LLM Landscape Catalog — Family Index

At a glance

The large-language-model landscape evolves on a monthly cadence; this note is an explicit snapshot dated mid-2025 (Q1/Q2) with selective forward-looking notes through 2026. Treat every concrete model, benchmark score, and price quoted here as a time-stamped claim — verify against the live provider docs or lmarena.ai before acting on it.

As of the snapshot date there are roughly 50+ frontier-class models in active commercial deployment, dozens of competitive open-weight families, and a long tail of domain-specific or smaller-edge offerings. The capability frontier in mid-2025 is anchored by OpenAI o3 / GPT-4.1 (and rumored GPT-5), Anthropic Claude Opus 4.x / Sonnet 4.x, Google Gemini 2.5 Pro, xAI Grok-3 / Grok-4, DeepSeek-V3 / R1, Alibaba Qwen 2.5 / Qwen 3, and Meta LLaMA 4 (Maverick, Behemoth). The reasoning-model branch (o1 / o3, DeepSeek-R1, Gemini 2.5 Thinking, QwQ-32B-Preview, Claude Sonnet 4.x extended thinking) emerged in late-2024 and is the dominant capability axis of 2025. Multimodal-native models (Gemini 2.5, GPT-4o, Claude 3.7+, LLaMA 4, Pixtral, Qwen2.5-VL) have largely displaced text-only as the default release form factor.

This index catalogs: (a) frontier closed US/EU labs, (b) frontier closed China labs, (c) open-weight Western families, (d) modality-specific stacks (vision, audio, video, image), (e) coding-specialized, (f) edge/small, (g) domain-specific verticals, (h) eval and leaderboards, (i) cost bands, and (j) the structural trends driving 2024–2026.

Frontier closed (US + EU)

OpenAI

GPT-4o (May 2024) — flagship multimodal (text + vision + audio in/out), 128K context, ~ $2.50/$ 10 per 1M tokens. Coding strong, omni-modal native.
GPT-4o-mini (Jul 2024) — distilled, $0.15/$ 0.60 per 1M; default cheap workhorse. Replaced GPT-3.5-turbo.
GPT-4.1 (Apr 2025) — coding + long-context refresh, 1M context window, $2/$ 8 per 1M. Better instruction-following than 4o on long-form tasks.
GPT-4.1-mini + GPT-4.1-nano — cheaper tiers in the 4.1 family.
o1 (Dec 2024 GA) — reasoning model, hidden chain-of-thought, $15/$ 60 per 1M; saturates MATH (~94%) and AIME (~83%).
o1-mini — coding-focused reasoning, cheaper.
o3 (Apr 2025) — frontier reasoning, breakthrough on ARC-AGI-1 (87.5% high-compute), GPQA-Diamond ~87%, SWE-Bench Verified ~71%. Expensive ($/task compute) but capability-defining.
o3-mini + o4-mini — production reasoning tiers; o4-mini-high a strong coding model.
GPT-5 — rumored launch H2 2025; unified reasoning + chat with adaptive thinking.
DALL-E 3 image, Sora video (GA Dec 2024), Whisper audio, Operator (Jan 2025) agentic web-use product, tts-1 / tts-1-hd speech synthesis.

Distribution: ChatGPT (free / Plus / Pro / Team / Enterprise), API (chat completions + responses + assistants), Azure OpenAI (Microsoft cloud).

Anthropic

Claude Opus 4.x — frontier model line; coding king on SWE-Bench Verified (~72%+ as of Q2 2025), extended-thinking mode toggle.
Claude Sonnet 4.x — mid-tier, the production workhorse; best-cost-to-quality on most enterprise tasks; ~ $3/$ 15 per 1M.
Claude Haiku 4.x — small + fast, $0.80/$ 4 per 1M.
Claude 3.7 Sonnet (Feb 2025) — bridge release that introduced extended thinking; widely used through mid-2025.
Computer Use (Oct 2024) — agentic screen + mouse + keyboard control (beta).
Claude Code — CLI agent product (this very runtime).
MCP (Model Context Protocol) — open standard for tool + data connectivity launched Nov 2024; now adopted by competitors.

Distribution: Anthropic API, claude.ai chat, AWS Bedrock, Google Cloud Vertex.

Google DeepMind

Gemini 2.5 Pro (Mar 2025) — frontier multimodal + reasoning; LMArena top-3 mid-2025; 1M context (2M extended); strong on math, code, vision.
Gemini 2.5 Flash — cheap fast tier; ~ $0.30/$ 2.50 per 1M; thinking-mode toggle.
Gemini 2.5 Flash-Lite — even cheaper.
Gemini 2.0 Pro / Flash / Flash Thinking — late-2024 series.
Gemini Nano — on-device (Pixel + Chrome).
Gemini Live — bidirectional voice + screen-share API.
Gemma 2 (9B / 27B, Jun 2024) and Gemma 3 (1B / 4B / 12B / 27B, Mar 2025) open-weight, multimodal variants.
PaliGemma 2 — vision-language open.
AlphaFold-3, AlphaProteo, AlphaProof / AlphaGeometry 2 scientific specialists.

Distribution: Gemini app, Google AI Studio, Vertex AI, NotebookLM, Workspace integrations.

xAI

Grok-3 (Feb 2025) — frontier reasoning + tool use; trained on Colossus cluster (~200K H100s).
Grok-3 mini + Grok-3 Reasoning (Think mode).
Grok-4 — announced mid-2025.
Native X (Twitter) integration; real-time data access.
Aurora image generation (Dec 2024).

Mistral AI (France)

Mistral Large 2 (Jul 2024, 123B dense) and Mistral Large 3 (2025) — flagship closed.
Mixtral 8x22B (Apr 2024) and Mixtral 8x7B (Dec 2023) open MoE.
Mistral Small 3.1 (Mar 2025, 24B) open Apache.
Pixtral 12B (Sep 2024) and Pixtral Large 124B (Nov 2024) multimodal.
Codestral 22B (May 2024) and Codestral 25.01 (Jan 2025) code-specialized.
Mistral Nemo 12B (Jul 2024) — with NVIDIA, 128K context, Apache.
Ministral 3B / 8B (Oct 2024) edge.
Voxtral — speech model (2025).

Distribution: La Plateforme API, Le Chat web product, AWS Bedrock, Azure AI.

Cohere (Canada)

Command R+ 08-2024 (104B) and Command R 08-2024 — enterprise RAG-tuned.
Command A (Mar 2025) — flagship.
Embed v3 + Embed v4 — multilingual + multimodal embedding models.
Rerank 3 + Rerank 3.5 — search reranker.
Aya 23 + Aya Expanse — 23/32-language multilingual open-weight.

Enterprise + on-prem focus; the most “boring + reliable” frontier shop.

Reka AI

Reka Core (~67B), Reka Flash (21B), Reka Edge (7B) — multimodal native (text + image + audio + video).
Reka Flash 3 (2025) — refresh, open-weighted under Apache.

AI21 Labs (Israel)

Jamba 1.5 Large (398B MoE) + Jamba 1.5 Mini (52B MoE) — hybrid Transformer-Mamba; 256K context; strong long-doc performance.
Jurassic legacy line.

Inflection AI

Pi consumer product paused after Microsoft hire-of-team in March 2024; Inflection 2.5 model now lives at MS as backbone for Copilot. Effectively absorbed.

Frontier closed (China)

Alibaba — Qwen (Tongyi Qianwen)

Qwen 2.5 family (Sep 2024) — open-weight 0.5B / 1.5B / 3B / 7B / 14B / 32B / 72B, Apache 2.0 (except 3B/72B which carry custom license).
Qwen 2.5-Max — flagship closed, MoE; competitive with GPT-4o on MMLU + GPQA.
Qwen 2.5-Plus + Qwen 2.5-Turbo — API tiers; Turbo supports 1M context.
Qwen 3 (Apr 2025) — next-gen with thinking-mode toggle; 0.6B → 235B-A22B MoE variants; many open-weighted.
Qwen2.5-Coder (1.5B / 7B / 32B) — open SOTA on code among open weights.
Qwen2.5-VL (3B / 7B / 72B) — vision-language; chart + document understanding.
QVQ-72B-Preview (Dec 2024) — visual reasoning thinking-model.
QwQ-32B-Preview (Nov 2024) — reasoning thinking-model; surprisingly strong on AIME + MATH.

DeepSeek (Hangzhou)

DeepSeek-V2 (May 2024, 236B MoE / 21B active) — established the MoE-at-cheap-price playbook.
DeepSeek-V3 (Dec 2024, 671B MoE / 37B active) — rivals GPT-4o quality at ~1/10th cost; trained for ~$5.6M reported.
DeepSeek-R1 (Jan 2025) — reasoning model trained with pure RL on verifiable rewards; rivals o1 on AIME + MATH + GPQA; fully open weights MIT license; the open-source moment of 2025. Distilled variants R1-Distill-Qwen-32B etc.
DeepSeek-Coder V3 — code-specialized.
DeepSeek-VL2 — multimodal.

The pricing + openness combination resets economics: V3 at ~ $0.27/$ 1.10 per 1M tokens.

MoonshotAI — Kimi

Kimi K1 + K1.5 + K2 + K3 generations through 2024–2025.
Kimi K2 focus: long context (claimed 2M-token document QA in production product).
Kimi-VL + Kimi-Audio multimodal extensions.
Heavy on Chinese-language QA + research-assistant use case.

MiniMax (Shanghai)

abab 6.5 + abab 6.5s (2024) chat models.
MiniMax-Text-01 and MiniMax-VL-01 (Jan 2025) — open hybrid Transformer + Lightning attention; 4M+ context.
MiniMax M1 (mid-2025) — hybrid attention/Mamba reasoning model.
Hailuo AI — video generation product (competitive with Kling / Sora).

Zhipu AI — GLM

GLM-4 + GLM-4-Plus + GLM-4-Air (2024) closed.
GLM-4.5 + GLM-4.5V (2025) flagship.
ChatGLM-3 (6B / 9B) and GLM-4-9B open-weight.
CogVideoX open video model line (2B / 5B).
CogVLM2 / GLM-4V multimodal.

01-AI — Yi

Yi-Large + Yi-Lightning (Oct 2024) closed; Yi-Lightning briefly top-10 LMArena late 2024.
Yi-1.5 open (6B / 9B / 34B) Apache.
Yi-Coder + Yi-Vision specialized.
Founder Kai-Fu Lee; company pivoted toward applications + tooling mid-2025.

Baidu — ERNIE

ERNIE 4.0 + ERNIE 4.0 Turbo + ERNIE Speed + ERNIE Lite + ERNIE Tiny API tier.
ERNIE 4.5 (Mar 2025) — flagship refresh.
ERNIE X1 — reasoning model (Mar 2025) at ~half DeepSeek-R1’s quoted price.
Wenxin Yiyan consumer chat product.

ByteDance — Doubao (Volcengine)

Doubao 1.5 Pro + Doubao 1.5 Lite + Doubao 1.5 Vision Pro (Jan 2025).
Doubao-1.5-pro-256k long-context, Doubao-Reasoning o1-style.
Doubao-Audio-TTS / ASR voice stack — used in CapCut + Douyin.
Aggressive price war pricing — drove Chinese API rates down 80%+ in late 2024.

Tencent — Hunyuan

Hunyuan Large (389B MoE / 52B active, Nov 2024) — largest open MoE at release time.
Hunyuan Turbo (closed) flagship.
Hunyuan 3D 1.0 / 2.0 — open 3D-asset generation (3D-aware diffusion + multi-view).
Hunyuan Video (Dec 2024) — open video model, 13B; rivals closed Kling/Sora on benchmarks.

Stepfun (Shanghai)

Step-1V multimodal + Step-2 (1T parameters, MoE) closed flagship.
Step-Video + Step-Audio multimodal extensions.

Skywork (Kunlun Tech)

Skywork 13B open + Skywork-MoE (146B-A22B) open.
Skywork-O1-Open — reasoning chain-of-thought open-weight (Nov 2024).
SkyReels-V1 open video generation (2025).

InternLM (Shanghai AI Lab)

InternLM2.5 (1.8B / 7B / 20B) open Apache.
InternLM3-8B-Instruct (2025) open.
InternVL2 + InternVL2.5 + InternVL3 vision-language; rivals proprietary on MMMU + DocVQA.

Mianbi (ModelBest) — MiniCPM

MiniCPM-V 2.6 (8B) — small multimodal; runs on mobile.
MiniCPM 3.0 4B + MiniCPM-o 2.6 (omni-modal: text + image + audio + video).
Edge + on-device focus; widely used in Chinese smartphone deployments.

Open-weight Western

Meta — LLaMA

LLaMA 3 (Apr 2024, 8B / 70B) — community-license open.
LLaMA 3.1 (Jul 2024, 8B / 70B / 405B) — 128K context; 405B was largest open dense at release.
LLaMA 3.2 (Sep 2024) — 1B / 3B text + 11B / 90B vision; introduced multimodal LLaMA.
LLaMA 3.3 70B (Dec 2024) — instruction-tuned refresh matching 405B quality at 70B size.
LLaMA 4 (Apr 2025) — multimodal MoE family:
- LLaMA 4 Scout (109B MoE / 17B active, 10M context) — competitive with Gemini Flash.
- LLaMA 4 Maverick (400B MoE / 17B active, 1M context) — challenger to GPT-4o.
- LLaMA 4 Behemoth (~2T MoE / 288B active) — teacher model, in training mid-2025.

Llama ecosystem is the gravitational center of open-weight: HuggingFace transformers, vLLM, llama.cpp, Ollama, LM Studio, GGUF format, derivative fine-tunes by the thousands.

Mistral open

Covered above — Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Nemo 12B, Mistral Small 3.1 24B, Pixtral 12B, Codestral Mamba 7B, Mathstral 7B.

Microsoft — Phi

Phi-3 (Apr 2024, 3.8B / 7B / 14B) — small + capable; punches above weight on reasoning + code.
Phi-3.5 (Aug 2024) — Mini-Instruct (3.8B), MoE (16x3.8B / 6.6B active), Vision (4.2B).
Phi-4 (Dec 2024, 14B) — reasoning-trained, math + science strong.
Phi-4-multimodal + Phi-4-mini (Feb 2025) — text + vision + audio compact.
Phi-4-reasoning + Phi-4-reasoning-plus (Apr 2025) — thinking models.
Phi Silica — quantized variant shipping on Windows Copilot+ PCs.

Google Gemma

Covered above — Gemma 2 + Gemma 3 + PaliGemma 2 + CodeGemma + RecurrentGemma (Griffin-architecture experiment) + ShieldGemma safety classifier.

Allen Institute (AI2)

OLMo 2 (Nov 2024, 7B / 13B) — fully-open: weights + data + checkpoints + training code.
Tülu 3 (Nov 2024) — open post-training recipe applied to LLaMA 3.1; rivals closed instruct quality.
Molmo (Sep 2024) — open vision-language; competitive with proprietary 7B-class VLMs.
OLMoE (Sep 2024, 7B MoE / 1B active) — open MoE.

Stability AI

Mostly legacy. Stable LM 2 1.6B / 12B, StableLM Zephyr small models — declining relevance vs Phi + Gemma.

Serving infrastructure (open-model hosts)

Together AI — fast inference, fine-tuning, dedicated.
Fireworks AI — function calling, low-latency serving.
DeepInfra — cheap per-token open-model API.
Hyperbolic — GPU rental + serving.
Replicate — model-as-API with cold-start.
Groq — LPU custom silicon, extreme tokens/sec on selected open models.
Cerebras Inference — wafer-scale silicon, similar speed claims.
SambaNova — RDU silicon serving.
AWS Bedrock, Azure AI Foundry, GCP Vertex Model Garden — hyperscaler model garden hosting.

Multimodal (vision + audio + video + image)

Vision-language (VLM)

GPT-4o + GPT-4.1 + o3 / o4-mini with vision.
Claude Opus 4.x + Sonnet 4.x + Haiku 4.x + Claude 3.5/3.7 vision.
Gemini 2.5 Pro / Flash vision.
Pixtral 12B / Pixtral Large (Mistral, open).
Qwen2.5-VL 3B / 7B / 72B + QVQ-72B (Alibaba, open).
InternVL2 / 2.5 / 3 (Shanghai AI Lab, open).
LLaVA-OneVision / LLaVA-NeXT academic line.
LLaMA 3.2 Vision 11B / 90B + LLaMA 4 native multimodal.
Phi-4 Multimodal (Microsoft).
Molmo (Allen) — open SOTA on visual grounding.
CogVLM2 / GLM-4V (Zhipu).
MiniCPM-V 2.6 / MiniCPM-o 2.6 (Mianbi, mobile-class).
NVLM-D-72B (NVIDIA, open).

Audio input (speech recognition + speech-to-text)

Whisper large-v3 / large-v3-turbo (OpenAI, open, 2022–2024) — open SOTA ASR.
GPT-4o audio (Realtime API) — speech-in, speech-out, low-latency conversational.
Gemini Live — bidirectional speech + screen-share, June 2024.
Moshi (Kyutai, Sep 2024, open) — streaming full-duplex 200ms latency.
Voxtral (Mistral, 2025) — open speech understanding.
SeamlessM4T v2 (Meta, open) — translation + transcription, 100+ langs.
Canary-1B / Parakeet (NVIDIA, open) — ASR leaders.
Qwen2-Audio + Doubao Audio.

Audio output (text-to-speech + voice cloning)

ElevenLabs (v3 Turbo, Multilingual v2) — commercial leader for quality + voice cloning.
OpenAI tts-1 / tts-1-hd / gpt-4o-tts — built into Realtime API.
Sesame CSM-1B (2025, open) — conversational speech model.
Suno Voice + Suno Bark — singing + speech.
Resemble AI, Play.ht 2.0, Cartesia Sonic (sub-100ms latency), Inflection Voice.
MeloTTS + F5-TTS + GPT-SoVITS + OpenVoice v2 (open).
MiniMax T2A, Doubao TTS, CosyVoice (Alibaba open) — Chinese-led open frontier.

Video generation

Sora (OpenAI, Feb 2024 demo, Dec 2024 GA) — up to 20s at 1080p; text + image + video conditioning.
Veo 2 (Google, Dec 2024) and Veo 3 (May 2025) — Veo 3 added native audio generation, perceived top-tier mid-2025.
Kling 1.6 / 2.0 (Kuaishou) — commercial Chinese leader; minutes-long realistic.
Runway Gen-3 Alpha / Gen-4 — pioneer commercial product.
Pika 2.0 / 2.1 — consumer-creator focused.
Luma Dream Machine (Ray2 model, 2025).
Mochi 1 (Genmo, Oct 2024, open Apache).
Hunyuan Video (Tencent, Dec 2024, open).
WanX / Wan 2.1 (Alibaba, Feb 2025 open) — competitive with closed on motion quality.
CogVideoX + OpenSora (open academic).
LTX-Video (Lightricks, open).
Hailuo / MiniMax video (closed Chinese product).

Image generation

DALL-E 3 (OpenAI) — inside ChatGPT + Bing.
GPT-Image-1 (OpenAI, Mar 2025) — native multimodal image gen inside GPT-4o.
Imagen 3 / Imagen 4 (Google) — strong text rendering + photorealism.
Midjourney v6.1 / v7 — aesthetic leader; Discord + web.
Stable Diffusion 3 / 3.5 Large (Stability, open) + SDXL legacy.
FLUX.1 [pro / dev / schnell] (Black Forest Labs, Aug 2024) + FLUX.1.1 pro + FLUX Kontext (May 2025) — open + commercial; new open leader on aesthetics + prompt-adherence.
Ideogram 2.0 / 3.0 — strong text-in-image.
Recraft V3 — vector + design-focused.
HiDream (open, 2025), Lumina-mGPT, Janus-Pro (DeepSeek, Jan 2025 open).

Coding-specific models

Claude Opus 4.x / Sonnet 4.x — frontier on SWE-Bench Verified (~72%+); the integrated agent stack (Claude Code CLI + Cursor + Windsurf) compounds the advantage.
GPT-4.1 + o4-mini-high — strong coding, especially on long-context refactors.
Codestral 22B / 25.01 + Codestral Mamba 7B (Mistral).
DeepSeek-Coder V2 + DeepSeek-V3 for code — open SOTA in late 2024 / early 2025.
Qwen2.5-Coder 32B — open SOTA mid-2025; matches GPT-4o on HumanEval / MBPP / LiveCodeBench.
CodeLlama (Meta) — legacy.
Granite Code (IBM, open).
Starcoder2 (BigCode / ServiceNow / HF, open).
Yi-Coder 9B (01-AI, open).
Tabby, Continue.dev, Aider, Cline, Cursor, Windsurf, Codex CLI (OpenAI, Apr 2025), Claude Code — IDE / CLI surfaces.

Smaller + edge models

Phi-3.5 Mini 3.8B, Phi-4 Mini — Microsoft small-model excellence.
Gemma 3 1B / 4B / 12B — Google compact multimodal.
LLaMA 3.2 1B / 3B — mobile-class.
Mistral Nemo 12B, Ministral 3B / 8B.
MiniCPM 3.0 4B + MiniCPM-V 2.6 8B (mobile-class multimodal).
Qwen 2.5 0.5B / 1.5B / 3B + Qwen 3 0.6B / 1.7B / 4B.
SmolLM2 (HuggingFace, 135M / 360M / 1.7B) — research-class tiny.
Phi Silica — quantized 3.3B running locally on Windows Copilot+ NPUs.
Apple Intelligence Foundation Models — ~3B on-device on iPhone 15 Pro+, M-series Macs (Sep 2024+).
Gemini Nano — on-device Pixel + Chrome (built-in window.ai JS API rolling out 2025).
TinyLlama 1.1B, StableLM Zephyr 3B — legacy small-model line.
Llama.cpp / Ollama / MLX / LM Studio runtimes target this tier on consumer hardware.

Domain-specific + specialized

Finance: BloombergGPT (50B, internal), FinGPT (open academic), Salesforce XGen-Sales, Hebbia (RAG-focused for analysts).
Medical: Med-PaLM 2 / Med-Gemini (Google, multimodal medical), Anthropic clinical adapters, OpenBioLLM, MedAlpaca, BioMistral, Hippocratic AI (patient-facing).
Science: Galactica (Meta, 2022 — withdrawn), Tx-LLM (Google therapeutics), ChemCrow + Coscientist research agents.
Biology / proteins: AlphaFold-3 (DeepMind, May 2024), AlphaProteo (Sep 2024), ESM-3 (EvolutionaryScale, 2024) — see [[Biology/genetics-and-genomics]].
Legal: Harvey AI (enterprise legal), CoCounsel (Thomson Reuters / Casetext), Lexis+ AI, Spellbook (contracts), Robin AI.
Education: Khanmigo (Khan Academy + GPT-4o), Duolingo Max (GPT-4o tutor), Synthesis (math), Magic School AI.
Code: covered above.
Robotics + embodied: RT-2 (Google), Pi-0 (Physical Intelligence, 2024), Helix (Figure AI), Gemini Robotics-1 (Mar 2025), Optimus + xAI integrations — see [[Robotics/robot-learning-and-foundation-models]].

Eval + leaderboards

General + chat preference

LMSYS Chatbot Arena → lmarena.ai — crowdsourced pairwise Elo, the single most-watched live leaderboard. Categories: Overall, Hard Prompts, Coding, Math, Vision, Multi-turn, Style-controlled.
Artificial Analysis — independent multi-axis benchmarking (quality + speed + price) at artificialanalysis.ai.

Knowledge + reasoning

MMLU (57-subject multiple choice) — saturated ~92% by frontier; deprecated as discriminator.
MMLU-Pro — harder variant, 10-way choices, still useful.
MMLU-Redux — error-corrected.
GPQA-Diamond (Google-proof QA, PhD-level science) — frontier discriminator; o3 ~87%.
BIG-Bench Hard (BBH) — composite reasoning, mostly saturated.
HellaSwag, ARC-Challenge, WinoGrande, TruthfulQA — older but still cited.

Math + AIME

MATH (competition math) — frontier ~95%+, saturating.
GSM8K (grade-school math) — saturated ~95%+.
AIME 2024 / 2025 — American Invitational Math Exam, the new high-water mark for reasoning models; o3 / R1 / QwQ score 80%+ on AIME 2024.
OlympiadBench, PutnamBench, FrontierMath (Epoch AI 2024) — frontier-grade math; FrontierMath at <5% even for o3.

Coding

HumanEval (164 Python problems) — saturated, ~95%+, deprecated as sole signal.
MBPP / MBPP+ — basic programming.
SWE-Bench Verified (500 real GitHub issues) — current frontier coding benchmark; Claude Opus 4.x ~72%, o3 ~71% mid-2025.
SWE-Bench Multilingual / Lite — variants.
LiveCodeBench — contamination-resistant competitive programming.
Aider Polyglot — multi-language refactoring.
BigCodeBench — library-rich tasks.
CodeContests — historical Codeforces.

Tool use + function calling + agents

BFCL (Berkeley Function-Calling Leaderboard) v1 / v2 / v3 — tool-use accuracy.
τ-Bench (Tau-Bench, Anthropic 2024) — agentic customer-support.
GAIA (Meta + HF, 2023) — general-assistant.
AgentBench, WebArena, VisualWebArena, AppWorld — agent envs.
OSWorld + WindowsAgentArena — computer-use evaluation.

Multimodal

MMMU (multi-discipline university multimodal) — frontier ~80%.
MathVista — visual math.
DocVQA, ChartQA, TextVQA, VQAv2 — document + chart + scene VQA.
MM-Vet, MMBench, SEED-Bench.
VideoMME, MVBench — video understanding.

Long context

Needle-in-a-Haystack (NIAH) — single-fact retrieval; saturated for 1M-context frontier.
RULER (NVIDIA, 2024) — synthetic long-context across 13 tasks; harder discriminator.
InfiniteBench, LongBench v2 — natural long-doc QA.

Reasoning / generalization frontier

ARC-AGI-1 (Chollet) — abstract reasoning grids; o3 high-compute hit 87.5% (Dec 2024), exceeding the prize threshold.
ARC-AGI-2 (Mar 2025) — successor benchmark; frontier still <10% as of Q2 2025.
HLE — Humanity’s Last Exam (Center for AI Safety + Scale, Jan 2025) — 3000 expert-level questions; frontier mid-2025 ~20%.

Safety + alignment

HarmBench (CAIS).
AILuminate (MLCommons, 2024) — safety benchmark with industry adoption.
AISI evaluations (UK + US AI Safety Institutes) — frontier pre-deployment access.
AgentHarm (Anthropic) — agentic-safety eval.

Cost bands (USD per 1M input + output tokens, mid-2025 ballpark)

Tier	Examples	Input $/M	Output $/M
Frontier reasoning	OpenAI o3, Claude Opus 4 thinking, Gemini 2.5 Pro thinking	$15–30	$60–150
Frontier chat	GPT-4o, GPT-4.1, Claude Sonnet 4, Gemini 2.5 Pro, Grok-3	$2–6	$8–25
Mid closed	GPT-4o-mini, Claude Haiku 4, Gemini 2.5 Flash, Mistral Small	$0.15–0.80	$0.60–4
Cheap closed	GPT-4.1-nano, Gemini Flash-Lite, Doubao Lite, DeepSeek-V3 API	$0.05–0.30	$0.15–1.10
Open hosted	LLaMA 3.3 70B, Qwen 2.5 72B, Mixtral 8x22B on Together/Fireworks	$0.10–0.80	$0.30–3
Open self-hosted	Any model on rented GPUs	electricity + GPU-hours	electricity + GPU-hours

Chinese provider pricing collapsed in late 2024 via ByteDance + Alibaba + DeepSeek price war; many APIs sit at 1/5 to 1/20 of US frontier rates. Quote with caveat — pricing changes weekly.

Trends 2024 – 2026

Reasoning + thinking models. o1 (Sep 2024) opened the paradigm; R1 + QwQ + Gemini 2.5 Thinking + Claude extended-thinking + Phi-4-reasoning followed. RL on verifiable rewards (math, code, formal proofs) is the dominant training axis. Test-time compute scaling is the new dimension on the scaling laws.
Mixture-of-Experts. Mixtral lit the fuse (Dec 2023); DeepSeek-V3 (671B MoE), LLaMA 4 (Scout / Maverick / Behemoth MoE), Qwen 3 MoE variants, Hunyuan Large, MiniMax Text-01, Skywork-MoE. Active-parameter counts of 17B–37B at total counts of 100B–2T+ are now mainstream. GPT-4 was long rumored to be MoE; 4o + 4.1 confirmed via inference characteristics.
Long context routine. 128K is table-stakes. 1M is normal (Gemini 1.5/2.5, GPT-4.1, LLaMA 4 Scout 10M). Kimi K2 + MiniMax 01 reach 2M–4M+. RULER and NIAH are the discriminators; real comprehension still degrades past ~200K for most models.
Multimodal native. Gemini 2.5, GPT-4o, Claude 3.5/3.7+, LLaMA 4, Qwen2.5-VL, Phi-4-multimodal, MiniCPM-o — single model, all modalities, joint training rather than late-fused adapters. Image-out (GPT-Image-1, native Imagen-in-Gemini) collapsing image-gen into the LLM.
Agentic tooling baked in. Claude Computer Use (Oct 2024), OpenAI Operator (Jan 2025), Google Project Mariner + Project Astra previews. MCP (Model Context Protocol) open standard (Anthropic, Nov 2024) adopted by OpenAI + others through 2025 — the USB-C of agentic tool integration.
Tool calling + structured output mature. JSON schema + grammar-constrained decoding (vLLM + Outlines + xgrammar) reliable across all frontier models. Parallel tool calls standard. OpenAI Responses API (Mar 2025) replaces Assistants v1.
Smaller-cheaper-faster competitive on routine tasks. Haiku 4.x, Gemini Flash, GPT-4o-mini, Phi-4, Gemma 3 27B, Qwen 2.5 32B all handle 80% of production traffic at 1/10th frontier cost. The router pattern (auto-route easy queries to cheap models) is standard architecture.
Open weights catching up. DeepSeek-R1 (open MIT-license reasoning matching o1), Qwen 3 (open at multiple scales), LLaMA 4 — the open-vs-closed gap narrowed from 12+ months to 3–6 months on quality. China-origin open weights now dominate the open frontier on both quality and openness of license.
Inference scaling laws + test-time compute. The o1 paradigm — spend compute at inference rather than only at training — re-opens scaling. Compute-budgeted reasoning, search + verify, parallel sampling + best-of-N, process-reward-model verifiers all in production.
Distillation pipeline standard. Frontier model → cheap-fast variant → on-device variant is now an explicit product line at every major lab. R1 → R1-Distill-Qwen-32B is the canonical open example.
Hybrid architectures. Mamba / SSM components appearing in Jamba 1.5 (AI21), Codestral Mamba (Mistral), Falcon Mamba (TII), MiniMax M1 — hybrid Transformer-Mamba for long-context efficiency.
Agentic + reasoning combine. o3 + Claude Opus 4.x + Gemini 2.5 Thinking show that thinking-models executing tool calls deliver order-of-magnitude better results on real coding + research tasks than either capability alone. Coding agents (Claude Code, Codex CLI, Windsurf Cascade, Cursor Composer) are the first mass-market application.
Inference silicon diversification. Groq LPU + Cerebras WSE + SambaNova RDU + AWS Trainium 2 + Google TPU v5p / v6 / v7 + Microsoft Maia + Meta MTIA + AMD MI300X / MI325X / MI355X + NVIDIA H100 / H200 / B100 / B200 / GB200 — see [[Compute/Tier3/ai-accelerators]].
Safety + governance. EU AI Act in force Aug 2024 with phased application through 2026. UK + US AISI pre-deployment evaluations on frontier models. Voluntary frontier-safety frameworks (Responsible Scaling Policies, Preparedness Framework, Frontier Safety Framework) from Anthropic + OpenAI + DeepMind.
China decoupling pressure + competition. Export controls (H100/H200/B200 to China restricted Oct 2023 + Oct 2024 + Jan 2025 tightening). Chinese labs respond with MoE efficiency (DeepSeek), Huawei Ascend 910C, and aggressive open-weighting. The DeepSeek-R1 Jan 2025 release was a watershed moment for the openness + cost narrative.

Cross-references

[[Compute/transformer-architecture]] — attention + MoE + Mamba mechanics underlying every model in this catalog.
[[Compute/fine-tuning-rlhf]] — SFT + DPO + RLHF + RLVR (the training method behind reasoning models).
[[Compute/inference-optimization]] — quantization (GPTQ, AWQ, GGUF, FP8), speculative decoding, KV-cache mgmt, batching.
[[Compute/rag-embeddings-vector-search]] — embedding models + vector DBs that pair with these LLMs.
[[Compute/prompt-engineering-agent-systems]] — agent frameworks + MCP + tool calling patterns.
[[Compute/model-serving-infrastructure]] — vLLM, TGI, TensorRT-LLM, SGLang, Triton, batching, autoscaling.
[[Compute/Tier3/ml-framework-comparison]] — PyTorch / JAX / TF / MLX training-side comparison.
[[Compute/Tier3/ai-accelerators]] — GPU + TPU + LPU + custom silicon for training + inference.
[[Compute/Tier3/_index]] — Compute Tier 3 family index.

Snapshot date: 2026-05-17. Models, scores, and prices quoted reflect the state of the field as of mid-2025 Q1/Q2 with selective notes through Q2 2026. The landscape moves on a monthly cadence — verify any specific claim against the provider’s live documentation, lmarena.ai, or artificialanalysis.ai before acting on it.

Compendium

Explorer

LLM Landscape Catalog — Family Index

LLM Landscape Catalog — Family Index

At a glance

Frontier closed (US + EU)

OpenAI

Anthropic

Google DeepMind

xAI

Mistral AI (France)

Cohere (Canada)

Reka AI

AI21 Labs (Israel)

Inflection AI

Frontier closed (China)

Alibaba — Qwen (Tongyi Qianwen)

DeepSeek (Hangzhou)

MoonshotAI — Kimi

MiniMax (Shanghai)

Zhipu AI — GLM

01-AI — Yi

Baidu — ERNIE

ByteDance — Doubao (Volcengine)

Tencent — Hunyuan

Stepfun (Shanghai)

Skywork (Kunlun Tech)

InternLM (Shanghai AI Lab)

Mianbi (ModelBest) — MiniCPM

Open-weight Western

Meta — LLaMA

Mistral open

Microsoft — Phi

Google Gemma

Allen Institute (AI2)

Stability AI

Serving infrastructure (open-model hosts)

Multimodal (vision + audio + video + image)

Vision-language (VLM)

Audio input (speech recognition + speech-to-text)

Audio output (text-to-speech + voice cloning)

Video generation

Image generation

Coding-specific models

Smaller + edge models

Domain-specific + specialized

Eval + leaderboards

General + chat preference

Knowledge + reasoning

Math + AIME

Coding

Tool use + function calling + agents

Multimodal

Long context

Reasoning / generalization frontier

Safety + alignment

Cost bands (USD per 1M input + output tokens, mid-2025 ballpark)

Trends 2024 – 2026

Cross-references

Graph View

Table of Contents