LLM Landscape Catalog — Family Index
At a glance
The large-language-model landscape evolves on a monthly cadence; this note is an explicit snapshot dated mid-2025 (Q1/Q2) with selective forward-looking notes through 2026. Treat every concrete model, benchmark score, and price quoted here as a time-stamped claim — verify against the live provider docs or lmarena.ai before acting on it.
As of the snapshot date there are roughly 50+ frontier-class models in active commercial deployment, dozens of competitive open-weight families, and a long tail of domain-specific or smaller-edge offerings. The capability frontier in mid-2025 is anchored by OpenAI o3 / GPT-4.1 (and rumored GPT-5), Anthropic Claude Opus 4.x / Sonnet 4.x, Google Gemini 2.5 Pro, xAI Grok-3 / Grok-4, DeepSeek-V3 / R1, Alibaba Qwen 2.5 / Qwen 3, and Meta LLaMA 4 (Maverick, Behemoth). The reasoning-model branch (o1 / o3, DeepSeek-R1, Gemini 2.5 Thinking, QwQ-32B-Preview, Claude Sonnet 4.x extended thinking) emerged in late-2024 and is the dominant capability axis of 2025. Multimodal-native models (Gemini 2.5, GPT-4o, Claude 3.7+, LLaMA 4, Pixtral, Qwen2.5-VL) have largely displaced text-only as the default release form factor.
This index catalogs: (a) frontier closed US/EU labs, (b) frontier closed China labs, (c) open-weight Western families, (d) modality-specific stacks (vision, audio, video, image), (e) coding-specialized, (f) edge/small, (g) domain-specific verticals, (h) eval and leaderboards, (i) cost bands, and (j) the structural trends driving 2024–2026.
Frontier closed (US + EU)
OpenAI
- GPT-4o (May 2024) — flagship multimodal (text + vision + audio in/out), 128K context, ~10 per 1M tokens. Coding strong, omni-modal native.
- GPT-4o-mini (Jul 2024) — distilled, 0.60 per 1M; default cheap workhorse. Replaced GPT-3.5-turbo.
- GPT-4.1 (Apr 2025) — coding + long-context refresh, 1M context window, 8 per 1M. Better instruction-following than 4o on long-form tasks.
- GPT-4.1-mini + GPT-4.1-nano — cheaper tiers in the 4.1 family.
- o1 (Dec 2024 GA) — reasoning model, hidden chain-of-thought, 60 per 1M; saturates MATH (~94%) and AIME (~83%).
- o1-mini — coding-focused reasoning, cheaper.
- o3 (Apr 2025) — frontier reasoning, breakthrough on ARC-AGI-1 (87.5% high-compute), GPQA-Diamond ~87%, SWE-Bench Verified ~71%. Expensive ($/task compute) but capability-defining.
- o3-mini + o4-mini — production reasoning tiers; o4-mini-high a strong coding model.
- GPT-5 — rumored launch H2 2025; unified reasoning + chat with adaptive thinking.
- DALL-E 3 image, Sora video (GA Dec 2024), Whisper audio, Operator (Jan 2025) agentic web-use product, tts-1 / tts-1-hd speech synthesis.
Distribution: ChatGPT (free / Plus / Pro / Team / Enterprise), API (chat completions + responses + assistants), Azure OpenAI (Microsoft cloud).
Anthropic
- Claude Opus 4.x — frontier model line; coding king on SWE-Bench Verified (~72%+ as of Q2 2025), extended-thinking mode toggle.
- Claude Sonnet 4.x — mid-tier, the production workhorse; best-cost-to-quality on most enterprise tasks; ~15 per 1M.
- Claude Haiku 4.x — small + fast, 4 per 1M.
- Claude 3.7 Sonnet (Feb 2025) — bridge release that introduced extended thinking; widely used through mid-2025.
- Computer Use (Oct 2024) — agentic screen + mouse + keyboard control (beta).
- Claude Code — CLI agent product (this very runtime).
- MCP (Model Context Protocol) — open standard for tool + data connectivity launched Nov 2024; now adopted by competitors.
Distribution: Anthropic API, claude.ai chat, AWS Bedrock, Google Cloud Vertex.
Google DeepMind
- Gemini 2.5 Pro (Mar 2025) — frontier multimodal + reasoning; LMArena top-3 mid-2025; 1M context (2M extended); strong on math, code, vision.
- Gemini 2.5 Flash — cheap fast tier; ~2.50 per 1M; thinking-mode toggle.
- Gemini 2.5 Flash-Lite — even cheaper.
- Gemini 2.0 Pro / Flash / Flash Thinking — late-2024 series.
- Gemini Nano — on-device (Pixel + Chrome).
- Gemini Live — bidirectional voice + screen-share API.
- Gemma 2 (9B / 27B, Jun 2024) and Gemma 3 (1B / 4B / 12B / 27B, Mar 2025) open-weight, multimodal variants.
- PaliGemma 2 — vision-language open.
- AlphaFold-3, AlphaProteo, AlphaProof / AlphaGeometry 2 scientific specialists.
Distribution: Gemini app, Google AI Studio, Vertex AI, NotebookLM, Workspace integrations.
xAI
- Grok-3 (Feb 2025) — frontier reasoning + tool use; trained on Colossus cluster (~200K H100s).
- Grok-3 mini + Grok-3 Reasoning (Think mode).
- Grok-4 — announced mid-2025.
- Native X (Twitter) integration; real-time data access.
- Aurora image generation (Dec 2024).
Mistral AI (France)
- Mistral Large 2 (Jul 2024, 123B dense) and Mistral Large 3 (2025) — flagship closed.
- Mixtral 8x22B (Apr 2024) and Mixtral 8x7B (Dec 2023) open MoE.
- Mistral Small 3.1 (Mar 2025, 24B) open Apache.
- Pixtral 12B (Sep 2024) and Pixtral Large 124B (Nov 2024) multimodal.
- Codestral 22B (May 2024) and Codestral 25.01 (Jan 2025) code-specialized.
- Mistral Nemo 12B (Jul 2024) — with NVIDIA, 128K context, Apache.
- Ministral 3B / 8B (Oct 2024) edge.
- Voxtral — speech model (2025).
Distribution: La Plateforme API, Le Chat web product, AWS Bedrock, Azure AI.
Cohere (Canada)
- Command R+ 08-2024 (104B) and Command R 08-2024 — enterprise RAG-tuned.
- Command A (Mar 2025) — flagship.
- Embed v3 + Embed v4 — multilingual + multimodal embedding models.
- Rerank 3 + Rerank 3.5 — search reranker.
- Aya 23 + Aya Expanse — 23/32-language multilingual open-weight.
Enterprise + on-prem focus; the most “boring + reliable” frontier shop.
Reka AI
- Reka Core (~67B), Reka Flash (21B), Reka Edge (7B) — multimodal native (text + image + audio + video).
- Reka Flash 3 (2025) — refresh, open-weighted under Apache.
AI21 Labs (Israel)
- Jamba 1.5 Large (398B MoE) + Jamba 1.5 Mini (52B MoE) — hybrid Transformer-Mamba; 256K context; strong long-doc performance.
- Jurassic legacy line.
Inflection AI
- Pi consumer product paused after Microsoft hire-of-team in March 2024; Inflection 2.5 model now lives at MS as backbone for Copilot. Effectively absorbed.
Frontier closed (China)
Alibaba — Qwen (Tongyi Qianwen)
- Qwen 2.5 family (Sep 2024) — open-weight 0.5B / 1.5B / 3B / 7B / 14B / 32B / 72B, Apache 2.0 (except 3B/72B which carry custom license).
- Qwen 2.5-Max — flagship closed, MoE; competitive with GPT-4o on MMLU + GPQA.
- Qwen 2.5-Plus + Qwen 2.5-Turbo — API tiers; Turbo supports 1M context.
- Qwen 3 (Apr 2025) — next-gen with thinking-mode toggle; 0.6B → 235B-A22B MoE variants; many open-weighted.
- Qwen2.5-Coder (1.5B / 7B / 32B) — open SOTA on code among open weights.
- Qwen2.5-VL (3B / 7B / 72B) — vision-language; chart + document understanding.
- QVQ-72B-Preview (Dec 2024) — visual reasoning thinking-model.
- QwQ-32B-Preview (Nov 2024) — reasoning thinking-model; surprisingly strong on AIME + MATH.
DeepSeek (Hangzhou)
- DeepSeek-V2 (May 2024, 236B MoE / 21B active) — established the MoE-at-cheap-price playbook.
- DeepSeek-V3 (Dec 2024, 671B MoE / 37B active) — rivals GPT-4o quality at ~1/10th cost; trained for ~$5.6M reported.
- DeepSeek-R1 (Jan 2025) — reasoning model trained with pure RL on verifiable rewards; rivals o1 on AIME + MATH + GPQA; fully open weights MIT license; the open-source moment of 2025. Distilled variants R1-Distill-Qwen-32B etc.
- DeepSeek-Coder V3 — code-specialized.
- DeepSeek-VL2 — multimodal.
The pricing + openness combination resets economics: V3 at ~1.10 per 1M tokens.
MoonshotAI — Kimi
- Kimi K1 + K1.5 + K2 + K3 generations through 2024–2025.
- Kimi K2 focus: long context (claimed 2M-token document QA in production product).
- Kimi-VL + Kimi-Audio multimodal extensions.
- Heavy on Chinese-language QA + research-assistant use case.
MiniMax (Shanghai)
- abab 6.5 + abab 6.5s (2024) chat models.
- MiniMax-Text-01 and MiniMax-VL-01 (Jan 2025) — open hybrid Transformer + Lightning attention; 4M+ context.
- MiniMax M1 (mid-2025) — hybrid attention/Mamba reasoning model.
- Hailuo AI — video generation product (competitive with Kling / Sora).
Zhipu AI — GLM
- GLM-4 + GLM-4-Plus + GLM-4-Air (2024) closed.
- GLM-4.5 + GLM-4.5V (2025) flagship.
- ChatGLM-3 (6B / 9B) and GLM-4-9B open-weight.
- CogVideoX open video model line (2B / 5B).
- CogVLM2 / GLM-4V multimodal.
01-AI — Yi
- Yi-Large + Yi-Lightning (Oct 2024) closed; Yi-Lightning briefly top-10 LMArena late 2024.
- Yi-1.5 open (6B / 9B / 34B) Apache.
- Yi-Coder + Yi-Vision specialized.
- Founder Kai-Fu Lee; company pivoted toward applications + tooling mid-2025.
Baidu — ERNIE
- ERNIE 4.0 + ERNIE 4.0 Turbo + ERNIE Speed + ERNIE Lite + ERNIE Tiny API tier.
- ERNIE 4.5 (Mar 2025) — flagship refresh.
- ERNIE X1 — reasoning model (Mar 2025) at ~half DeepSeek-R1’s quoted price.
- Wenxin Yiyan consumer chat product.
ByteDance — Doubao (Volcengine)
- Doubao 1.5 Pro + Doubao 1.5 Lite + Doubao 1.5 Vision Pro (Jan 2025).
- Doubao-1.5-pro-256k long-context, Doubao-Reasoning o1-style.
- Doubao-Audio-TTS / ASR voice stack — used in CapCut + Douyin.
- Aggressive price war pricing — drove Chinese API rates down 80%+ in late 2024.
Tencent — Hunyuan
- Hunyuan Large (389B MoE / 52B active, Nov 2024) — largest open MoE at release time.
- Hunyuan Turbo (closed) flagship.
- Hunyuan 3D 1.0 / 2.0 — open 3D-asset generation (3D-aware diffusion + multi-view).
- Hunyuan Video (Dec 2024) — open video model, 13B; rivals closed Kling/Sora on benchmarks.
Stepfun (Shanghai)
- Step-1V multimodal + Step-2 (1T parameters, MoE) closed flagship.
- Step-Video + Step-Audio multimodal extensions.
Skywork (Kunlun Tech)
- Skywork 13B open + Skywork-MoE (146B-A22B) open.
- Skywork-O1-Open — reasoning chain-of-thought open-weight (Nov 2024).
- SkyReels-V1 open video generation (2025).
InternLM (Shanghai AI Lab)
- InternLM2.5 (1.8B / 7B / 20B) open Apache.
- InternLM3-8B-Instruct (2025) open.
- InternVL2 + InternVL2.5 + InternVL3 vision-language; rivals proprietary on MMMU + DocVQA.
Mianbi (ModelBest) — MiniCPM
- MiniCPM-V 2.6 (8B) — small multimodal; runs on mobile.
- MiniCPM 3.0 4B + MiniCPM-o 2.6 (omni-modal: text + image + audio + video).
- Edge + on-device focus; widely used in Chinese smartphone deployments.
Open-weight Western
Meta — LLaMA
- LLaMA 3 (Apr 2024, 8B / 70B) — community-license open.
- LLaMA 3.1 (Jul 2024, 8B / 70B / 405B) — 128K context; 405B was largest open dense at release.
- LLaMA 3.2 (Sep 2024) — 1B / 3B text + 11B / 90B vision; introduced multimodal LLaMA.
- LLaMA 3.3 70B (Dec 2024) — instruction-tuned refresh matching 405B quality at 70B size.
- LLaMA 4 (Apr 2025) — multimodal MoE family:
- LLaMA 4 Scout (109B MoE / 17B active, 10M context) — competitive with Gemini Flash.
- LLaMA 4 Maverick (400B MoE / 17B active, 1M context) — challenger to GPT-4o.
- LLaMA 4 Behemoth (~2T MoE / 288B active) — teacher model, in training mid-2025.
Llama ecosystem is the gravitational center of open-weight: HuggingFace transformers, vLLM, llama.cpp, Ollama, LM Studio, GGUF format, derivative fine-tunes by the thousands.
Mistral open
Covered above — Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Nemo 12B, Mistral Small 3.1 24B, Pixtral 12B, Codestral Mamba 7B, Mathstral 7B.
Microsoft — Phi
- Phi-3 (Apr 2024, 3.8B / 7B / 14B) — small + capable; punches above weight on reasoning + code.
- Phi-3.5 (Aug 2024) — Mini-Instruct (3.8B), MoE (16x3.8B / 6.6B active), Vision (4.2B).
- Phi-4 (Dec 2024, 14B) — reasoning-trained, math + science strong.
- Phi-4-multimodal + Phi-4-mini (Feb 2025) — text + vision + audio compact.
- Phi-4-reasoning + Phi-4-reasoning-plus (Apr 2025) — thinking models.
- Phi Silica — quantized variant shipping on Windows Copilot+ PCs.
Google Gemma
Covered above — Gemma 2 + Gemma 3 + PaliGemma 2 + CodeGemma + RecurrentGemma (Griffin-architecture experiment) + ShieldGemma safety classifier.
Allen Institute (AI2)
- OLMo 2 (Nov 2024, 7B / 13B) — fully-open: weights + data + checkpoints + training code.
- Tülu 3 (Nov 2024) — open post-training recipe applied to LLaMA 3.1; rivals closed instruct quality.
- Molmo (Sep 2024) — open vision-language; competitive with proprietary 7B-class VLMs.
- OLMoE (Sep 2024, 7B MoE / 1B active) — open MoE.
Stability AI
Mostly legacy. Stable LM 2 1.6B / 12B, StableLM Zephyr small models — declining relevance vs Phi + Gemma.
Serving infrastructure (open-model hosts)
- Together AI — fast inference, fine-tuning, dedicated.
- Fireworks AI — function calling, low-latency serving.
- DeepInfra — cheap per-token open-model API.
- Hyperbolic — GPU rental + serving.
- Replicate — model-as-API with cold-start.
- Groq — LPU custom silicon, extreme tokens/sec on selected open models.
- Cerebras Inference — wafer-scale silicon, similar speed claims.
- SambaNova — RDU silicon serving.
- AWS Bedrock, Azure AI Foundry, GCP Vertex Model Garden — hyperscaler model garden hosting.
Multimodal (vision + audio + video + image)
Vision-language (VLM)
- GPT-4o + GPT-4.1 + o3 / o4-mini with vision.
- Claude Opus 4.x + Sonnet 4.x + Haiku 4.x + Claude 3.5/3.7 vision.
- Gemini 2.5 Pro / Flash vision.
- Pixtral 12B / Pixtral Large (Mistral, open).
- Qwen2.5-VL 3B / 7B / 72B + QVQ-72B (Alibaba, open).
- InternVL2 / 2.5 / 3 (Shanghai AI Lab, open).
- LLaVA-OneVision / LLaVA-NeXT academic line.
- LLaMA 3.2 Vision 11B / 90B + LLaMA 4 native multimodal.
- Phi-4 Multimodal (Microsoft).
- Molmo (Allen) — open SOTA on visual grounding.
- CogVLM2 / GLM-4V (Zhipu).
- MiniCPM-V 2.6 / MiniCPM-o 2.6 (Mianbi, mobile-class).
- NVLM-D-72B (NVIDIA, open).
Audio input (speech recognition + speech-to-text)
- Whisper large-v3 / large-v3-turbo (OpenAI, open, 2022–2024) — open SOTA ASR.
- GPT-4o audio (Realtime API) — speech-in, speech-out, low-latency conversational.
- Gemini Live — bidirectional speech + screen-share, June 2024.
- Moshi (Kyutai, Sep 2024, open) — streaming full-duplex 200ms latency.
- Voxtral (Mistral, 2025) — open speech understanding.
- SeamlessM4T v2 (Meta, open) — translation + transcription, 100+ langs.
- Canary-1B / Parakeet (NVIDIA, open) — ASR leaders.
- Qwen2-Audio + Doubao Audio.
Audio output (text-to-speech + voice cloning)
- ElevenLabs (v3 Turbo, Multilingual v2) — commercial leader for quality + voice cloning.
- OpenAI tts-1 / tts-1-hd / gpt-4o-tts — built into Realtime API.
- Sesame CSM-1B (2025, open) — conversational speech model.
- Suno Voice + Suno Bark — singing + speech.
- Resemble AI, Play.ht 2.0, Cartesia Sonic (sub-100ms latency), Inflection Voice.
- MeloTTS + F5-TTS + GPT-SoVITS + OpenVoice v2 (open).
- MiniMax T2A, Doubao TTS, CosyVoice (Alibaba open) — Chinese-led open frontier.
Video generation
- Sora (OpenAI, Feb 2024 demo, Dec 2024 GA) — up to 20s at 1080p; text + image + video conditioning.
- Veo 2 (Google, Dec 2024) and Veo 3 (May 2025) — Veo 3 added native audio generation, perceived top-tier mid-2025.
- Kling 1.6 / 2.0 (Kuaishou) — commercial Chinese leader; minutes-long realistic.
- Runway Gen-3 Alpha / Gen-4 — pioneer commercial product.
- Pika 2.0 / 2.1 — consumer-creator focused.
- Luma Dream Machine (Ray2 model, 2025).
- Mochi 1 (Genmo, Oct 2024, open Apache).
- Hunyuan Video (Tencent, Dec 2024, open).
- WanX / Wan 2.1 (Alibaba, Feb 2025 open) — competitive with closed on motion quality.
- CogVideoX + OpenSora (open academic).
- LTX-Video (Lightricks, open).
- Hailuo / MiniMax video (closed Chinese product).
Image generation
- DALL-E 3 (OpenAI) — inside ChatGPT + Bing.
- GPT-Image-1 (OpenAI, Mar 2025) — native multimodal image gen inside GPT-4o.
- Imagen 3 / Imagen 4 (Google) — strong text rendering + photorealism.
- Midjourney v6.1 / v7 — aesthetic leader; Discord + web.
- Stable Diffusion 3 / 3.5 Large (Stability, open) + SDXL legacy.
- FLUX.1 [pro / dev / schnell] (Black Forest Labs, Aug 2024) + FLUX.1.1 pro + FLUX Kontext (May 2025) — open + commercial; new open leader on aesthetics + prompt-adherence.
- Ideogram 2.0 / 3.0 — strong text-in-image.
- Recraft V3 — vector + design-focused.
- HiDream (open, 2025), Lumina-mGPT, Janus-Pro (DeepSeek, Jan 2025 open).
Coding-specific models
- Claude Opus 4.x / Sonnet 4.x — frontier on SWE-Bench Verified (~72%+); the integrated agent stack (Claude Code CLI + Cursor + Windsurf) compounds the advantage.
- GPT-4.1 + o4-mini-high — strong coding, especially on long-context refactors.
- Codestral 22B / 25.01 + Codestral Mamba 7B (Mistral).
- DeepSeek-Coder V2 + DeepSeek-V3 for code — open SOTA in late 2024 / early 2025.
- Qwen2.5-Coder 32B — open SOTA mid-2025; matches GPT-4o on HumanEval / MBPP / LiveCodeBench.
- CodeLlama (Meta) — legacy.
- Granite Code (IBM, open).
- Starcoder2 (BigCode / ServiceNow / HF, open).
- Yi-Coder 9B (01-AI, open).
- Tabby, Continue.dev, Aider, Cline, Cursor, Windsurf, Codex CLI (OpenAI, Apr 2025), Claude Code — IDE / CLI surfaces.
Smaller + edge models
- Phi-3.5 Mini 3.8B, Phi-4 Mini — Microsoft small-model excellence.
- Gemma 3 1B / 4B / 12B — Google compact multimodal.
- LLaMA 3.2 1B / 3B — mobile-class.
- Mistral Nemo 12B, Ministral 3B / 8B.
- MiniCPM 3.0 4B + MiniCPM-V 2.6 8B (mobile-class multimodal).
- Qwen 2.5 0.5B / 1.5B / 3B + Qwen 3 0.6B / 1.7B / 4B.
- SmolLM2 (HuggingFace, 135M / 360M / 1.7B) — research-class tiny.
- Phi Silica — quantized 3.3B running locally on Windows Copilot+ NPUs.
- Apple Intelligence Foundation Models — ~3B on-device on iPhone 15 Pro+, M-series Macs (Sep 2024+).
- Gemini Nano — on-device Pixel + Chrome (built-in
window.aiJS API rolling out 2025). - TinyLlama 1.1B, StableLM Zephyr 3B — legacy small-model line.
- Llama.cpp / Ollama / MLX / LM Studio runtimes target this tier on consumer hardware.
Domain-specific + specialized
- Finance: BloombergGPT (50B, internal), FinGPT (open academic), Salesforce XGen-Sales, Hebbia (RAG-focused for analysts).
- Medical: Med-PaLM 2 / Med-Gemini (Google, multimodal medical), Anthropic clinical adapters, OpenBioLLM, MedAlpaca, BioMistral, Hippocratic AI (patient-facing).
- Science: Galactica (Meta, 2022 — withdrawn), Tx-LLM (Google therapeutics), ChemCrow + Coscientist research agents.
- Biology / proteins: AlphaFold-3 (DeepMind, May 2024), AlphaProteo (Sep 2024), ESM-3 (EvolutionaryScale, 2024) — see
[[Biology/genetics-and-genomics]]. - Legal: Harvey AI (enterprise legal), CoCounsel (Thomson Reuters / Casetext), Lexis+ AI, Spellbook (contracts), Robin AI.
- Education: Khanmigo (Khan Academy + GPT-4o), Duolingo Max (GPT-4o tutor), Synthesis (math), Magic School AI.
- Code: covered above.
- Robotics + embodied: RT-2 (Google), Pi-0 (Physical Intelligence, 2024), Helix (Figure AI), Gemini Robotics-1 (Mar 2025), Optimus + xAI integrations — see
[[Robotics/robot-learning-and-foundation-models]].
Eval + leaderboards
General + chat preference
- LMSYS Chatbot Arena →
lmarena.ai— crowdsourced pairwise Elo, the single most-watched live leaderboard. Categories: Overall, Hard Prompts, Coding, Math, Vision, Multi-turn, Style-controlled. - Artificial Analysis — independent multi-axis benchmarking (quality + speed + price) at
artificialanalysis.ai.
Knowledge + reasoning
- MMLU (57-subject multiple choice) — saturated ~92% by frontier; deprecated as discriminator.
- MMLU-Pro — harder variant, 10-way choices, still useful.
- MMLU-Redux — error-corrected.
- GPQA-Diamond (Google-proof QA, PhD-level science) — frontier discriminator; o3 ~87%.
- BIG-Bench Hard (BBH) — composite reasoning, mostly saturated.
- HellaSwag, ARC-Challenge, WinoGrande, TruthfulQA — older but still cited.
Math + AIME
- MATH (competition math) — frontier ~95%+, saturating.
- GSM8K (grade-school math) — saturated ~95%+.
- AIME 2024 / 2025 — American Invitational Math Exam, the new high-water mark for reasoning models; o3 / R1 / QwQ score 80%+ on AIME 2024.
- OlympiadBench, PutnamBench, FrontierMath (Epoch AI 2024) — frontier-grade math; FrontierMath at <5% even for o3.
Coding
- HumanEval (164 Python problems) — saturated, ~95%+, deprecated as sole signal.
- MBPP / MBPP+ — basic programming.
- SWE-Bench Verified (500 real GitHub issues) — current frontier coding benchmark; Claude Opus 4.x ~72%, o3 ~71% mid-2025.
- SWE-Bench Multilingual / Lite — variants.
- LiveCodeBench — contamination-resistant competitive programming.
- Aider Polyglot — multi-language refactoring.
- BigCodeBench — library-rich tasks.
- CodeContests — historical Codeforces.
Tool use + function calling + agents
- BFCL (Berkeley Function-Calling Leaderboard) v1 / v2 / v3 — tool-use accuracy.
- τ-Bench (Tau-Bench, Anthropic 2024) — agentic customer-support.
- GAIA (Meta + HF, 2023) — general-assistant.
- AgentBench, WebArena, VisualWebArena, AppWorld — agent envs.
- OSWorld + WindowsAgentArena — computer-use evaluation.
Multimodal
- MMMU (multi-discipline university multimodal) — frontier ~80%.
- MathVista — visual math.
- DocVQA, ChartQA, TextVQA, VQAv2 — document + chart + scene VQA.
- MM-Vet, MMBench, SEED-Bench.
- VideoMME, MVBench — video understanding.
Long context
- Needle-in-a-Haystack (NIAH) — single-fact retrieval; saturated for 1M-context frontier.
- RULER (NVIDIA, 2024) — synthetic long-context across 13 tasks; harder discriminator.
- InfiniteBench, LongBench v2 — natural long-doc QA.
Reasoning / generalization frontier
- ARC-AGI-1 (Chollet) — abstract reasoning grids; o3 high-compute hit 87.5% (Dec 2024), exceeding the prize threshold.
- ARC-AGI-2 (Mar 2025) — successor benchmark; frontier still <10% as of Q2 2025.
- HLE — Humanity’s Last Exam (Center for AI Safety + Scale, Jan 2025) — 3000 expert-level questions; frontier mid-2025 ~20%.
Safety + alignment
- HarmBench (CAIS).
- AILuminate (MLCommons, 2024) — safety benchmark with industry adoption.
- AISI evaluations (UK + US AI Safety Institutes) — frontier pre-deployment access.
- AgentHarm (Anthropic) — agentic-safety eval.
Cost bands (USD per 1M input + output tokens, mid-2025 ballpark)
| Tier | Examples | Input $/M | Output $/M |
|---|---|---|---|
| Frontier reasoning | OpenAI o3, Claude Opus 4 thinking, Gemini 2.5 Pro thinking | $15–30 | $60–150 |
| Frontier chat | GPT-4o, GPT-4.1, Claude Sonnet 4, Gemini 2.5 Pro, Grok-3 | $2–6 | $8–25 |
| Mid closed | GPT-4o-mini, Claude Haiku 4, Gemini 2.5 Flash, Mistral Small | $0.15–0.80 | $0.60–4 |
| Cheap closed | GPT-4.1-nano, Gemini Flash-Lite, Doubao Lite, DeepSeek-V3 API | $0.05–0.30 | $0.15–1.10 |
| Open hosted | LLaMA 3.3 70B, Qwen 2.5 72B, Mixtral 8x22B on Together/Fireworks | $0.10–0.80 | $0.30–3 |
| Open self-hosted | Any model on rented GPUs | electricity + GPU-hours | electricity + GPU-hours |
Chinese provider pricing collapsed in late 2024 via ByteDance + Alibaba + DeepSeek price war; many APIs sit at 1/5 to 1/20 of US frontier rates. Quote with caveat — pricing changes weekly.
Trends 2024 – 2026
-
Reasoning + thinking models. o1 (Sep 2024) opened the paradigm; R1 + QwQ + Gemini 2.5 Thinking + Claude extended-thinking + Phi-4-reasoning followed. RL on verifiable rewards (math, code, formal proofs) is the dominant training axis. Test-time compute scaling is the new dimension on the scaling laws.
-
Mixture-of-Experts. Mixtral lit the fuse (Dec 2023); DeepSeek-V3 (671B MoE), LLaMA 4 (Scout / Maverick / Behemoth MoE), Qwen 3 MoE variants, Hunyuan Large, MiniMax Text-01, Skywork-MoE. Active-parameter counts of 17B–37B at total counts of 100B–2T+ are now mainstream. GPT-4 was long rumored to be MoE; 4o + 4.1 confirmed via inference characteristics.
-
Long context routine. 128K is table-stakes. 1M is normal (Gemini 1.5/2.5, GPT-4.1, LLaMA 4 Scout 10M). Kimi K2 + MiniMax 01 reach 2M–4M+. RULER and NIAH are the discriminators; real comprehension still degrades past ~200K for most models.
-
Multimodal native. Gemini 2.5, GPT-4o, Claude 3.5/3.7+, LLaMA 4, Qwen2.5-VL, Phi-4-multimodal, MiniCPM-o — single model, all modalities, joint training rather than late-fused adapters. Image-out (GPT-Image-1, native Imagen-in-Gemini) collapsing image-gen into the LLM.
-
Agentic tooling baked in. Claude Computer Use (Oct 2024), OpenAI Operator (Jan 2025), Google Project Mariner + Project Astra previews. MCP (Model Context Protocol) open standard (Anthropic, Nov 2024) adopted by OpenAI + others through 2025 — the USB-C of agentic tool integration.
-
Tool calling + structured output mature. JSON schema + grammar-constrained decoding (vLLM + Outlines + xgrammar) reliable across all frontier models. Parallel tool calls standard. OpenAI Responses API (Mar 2025) replaces Assistants v1.
-
Smaller-cheaper-faster competitive on routine tasks. Haiku 4.x, Gemini Flash, GPT-4o-mini, Phi-4, Gemma 3 27B, Qwen 2.5 32B all handle 80% of production traffic at 1/10th frontier cost. The router pattern (auto-route easy queries to cheap models) is standard architecture.
-
Open weights catching up. DeepSeek-R1 (open MIT-license reasoning matching o1), Qwen 3 (open at multiple scales), LLaMA 4 — the open-vs-closed gap narrowed from 12+ months to 3–6 months on quality. China-origin open weights now dominate the open frontier on both quality and openness of license.
-
Inference scaling laws + test-time compute. The o1 paradigm — spend compute at inference rather than only at training — re-opens scaling. Compute-budgeted reasoning, search + verify, parallel sampling + best-of-N, process-reward-model verifiers all in production.
-
Distillation pipeline standard. Frontier model → cheap-fast variant → on-device variant is now an explicit product line at every major lab. R1 → R1-Distill-Qwen-32B is the canonical open example.
-
Hybrid architectures. Mamba / SSM components appearing in Jamba 1.5 (AI21), Codestral Mamba (Mistral), Falcon Mamba (TII), MiniMax M1 — hybrid Transformer-Mamba for long-context efficiency.
-
Agentic + reasoning combine. o3 + Claude Opus 4.x + Gemini 2.5 Thinking show that thinking-models executing tool calls deliver order-of-magnitude better results on real coding + research tasks than either capability alone. Coding agents (Claude Code, Codex CLI, Windsurf Cascade, Cursor Composer) are the first mass-market application.
-
Inference silicon diversification. Groq LPU + Cerebras WSE + SambaNova RDU + AWS Trainium 2 + Google TPU v5p / v6 / v7 + Microsoft Maia + Meta MTIA + AMD MI300X / MI325X / MI355X + NVIDIA H100 / H200 / B100 / B200 / GB200 — see
[[Compute/Tier3/ai-accelerators]]. -
Safety + governance. EU AI Act in force Aug 2024 with phased application through 2026. UK + US AISI pre-deployment evaluations on frontier models. Voluntary frontier-safety frameworks (Responsible Scaling Policies, Preparedness Framework, Frontier Safety Framework) from Anthropic + OpenAI + DeepMind.
-
China decoupling pressure + competition. Export controls (H100/H200/B200 to China restricted Oct 2023 + Oct 2024 + Jan 2025 tightening). Chinese labs respond with MoE efficiency (DeepSeek), Huawei Ascend 910C, and aggressive open-weighting. The DeepSeek-R1 Jan 2025 release was a watershed moment for the openness + cost narrative.
Cross-references
[[Compute/transformer-architecture]]— attention + MoE + Mamba mechanics underlying every model in this catalog.[[Compute/fine-tuning-rlhf]]— SFT + DPO + RLHF + RLVR (the training method behind reasoning models).[[Compute/inference-optimization]]— quantization (GPTQ, AWQ, GGUF, FP8), speculative decoding, KV-cache mgmt, batching.[[Compute/rag-embeddings-vector-search]]— embedding models + vector DBs that pair with these LLMs.[[Compute/prompt-engineering-agent-systems]]— agent frameworks + MCP + tool calling patterns.[[Compute/model-serving-infrastructure]]— vLLM, TGI, TensorRT-LLM, SGLang, Triton, batching, autoscaling.[[Compute/Tier3/ml-framework-comparison]]— PyTorch / JAX / TF / MLX training-side comparison.[[Compute/Tier3/ai-accelerators]]— GPU + TPU + LPU + custom silicon for training + inference.[[Compute/Tier3/_index]]— Compute Tier 3 family index.
Snapshot date: 2026-05-17. Models, scores, and prices quoted reflect the state of the field as of mid-2025 Q1/Q2 with selective notes through Q2 2026. The landscape moves on a monthly cadence — verify any specific claim against the provider’s live documentation, lmarena.ai, or artificialanalysis.ai before acting on it.