LLM Landscape Catalog — Family Index

At a glance

The large-language-model landscape evolves on a monthly cadence; this note is an explicit snapshot dated mid-2025 (Q1/Q2) with selective forward-looking notes through 2026. Treat every concrete model, benchmark score, and price quoted here as a time-stamped claim — verify against the live provider docs or lmarena.ai before acting on it.

As of the snapshot date there are roughly 50+ frontier-class models in active commercial deployment, dozens of competitive open-weight families, and a long tail of domain-specific or smaller-edge offerings. The capability frontier in mid-2025 is anchored by OpenAI o3 / GPT-4.1 (and rumored GPT-5), Anthropic Claude Opus 4.x / Sonnet 4.x, Google Gemini 2.5 Pro, xAI Grok-3 / Grok-4, DeepSeek-V3 / R1, Alibaba Qwen 2.5 / Qwen 3, and Meta LLaMA 4 (Maverick, Behemoth). The reasoning-model branch (o1 / o3, DeepSeek-R1, Gemini 2.5 Thinking, QwQ-32B-Preview, Claude Sonnet 4.x extended thinking) emerged in late-2024 and is the dominant capability axis of 2025. Multimodal-native models (Gemini 2.5, GPT-4o, Claude 3.7+, LLaMA 4, Pixtral, Qwen2.5-VL) have largely displaced text-only as the default release form factor.

This index catalogs: (a) frontier closed US/EU labs, (b) frontier closed China labs, (c) open-weight Western families, (d) modality-specific stacks (vision, audio, video, image), (e) coding-specialized, (f) edge/small, (g) domain-specific verticals, (h) eval and leaderboards, (i) cost bands, and (j) the structural trends driving 2024–2026.


Frontier closed (US + EU)

OpenAI

  • GPT-4o (May 2024) — flagship multimodal (text + vision + audio in/out), 128K context, ~10 per 1M tokens. Coding strong, omni-modal native.
  • GPT-4o-mini (Jul 2024) — distilled, 0.60 per 1M; default cheap workhorse. Replaced GPT-3.5-turbo.
  • GPT-4.1 (Apr 2025) — coding + long-context refresh, 1M context window, 8 per 1M. Better instruction-following than 4o on long-form tasks.
  • GPT-4.1-mini + GPT-4.1-nano — cheaper tiers in the 4.1 family.
  • o1 (Dec 2024 GA) — reasoning model, hidden chain-of-thought, 60 per 1M; saturates MATH (~94%) and AIME (~83%).
  • o1-mini — coding-focused reasoning, cheaper.
  • o3 (Apr 2025) — frontier reasoning, breakthrough on ARC-AGI-1 (87.5% high-compute), GPQA-Diamond ~87%, SWE-Bench Verified ~71%. Expensive ($/task compute) but capability-defining.
  • o3-mini + o4-mini — production reasoning tiers; o4-mini-high a strong coding model.
  • GPT-5 — rumored launch H2 2025; unified reasoning + chat with adaptive thinking.
  • DALL-E 3 image, Sora video (GA Dec 2024), Whisper audio, Operator (Jan 2025) agentic web-use product, tts-1 / tts-1-hd speech synthesis.

Distribution: ChatGPT (free / Plus / Pro / Team / Enterprise), API (chat completions + responses + assistants), Azure OpenAI (Microsoft cloud).

Anthropic

  • Claude Opus 4.x — frontier model line; coding king on SWE-Bench Verified (~72%+ as of Q2 2025), extended-thinking mode toggle.
  • Claude Sonnet 4.x — mid-tier, the production workhorse; best-cost-to-quality on most enterprise tasks; ~15 per 1M.
  • Claude Haiku 4.x — small + fast, 4 per 1M.
  • Claude 3.7 Sonnet (Feb 2025) — bridge release that introduced extended thinking; widely used through mid-2025.
  • Computer Use (Oct 2024) — agentic screen + mouse + keyboard control (beta).
  • Claude Code — CLI agent product (this very runtime).
  • MCP (Model Context Protocol) — open standard for tool + data connectivity launched Nov 2024; now adopted by competitors.

Distribution: Anthropic API, claude.ai chat, AWS Bedrock, Google Cloud Vertex.

Google DeepMind

  • Gemini 2.5 Pro (Mar 2025) — frontier multimodal + reasoning; LMArena top-3 mid-2025; 1M context (2M extended); strong on math, code, vision.
  • Gemini 2.5 Flash — cheap fast tier; ~2.50 per 1M; thinking-mode toggle.
  • Gemini 2.5 Flash-Lite — even cheaper.
  • Gemini 2.0 Pro / Flash / Flash Thinking — late-2024 series.
  • Gemini Nano — on-device (Pixel + Chrome).
  • Gemini Live — bidirectional voice + screen-share API.
  • Gemma 2 (9B / 27B, Jun 2024) and Gemma 3 (1B / 4B / 12B / 27B, Mar 2025) open-weight, multimodal variants.
  • PaliGemma 2 — vision-language open.
  • AlphaFold-3, AlphaProteo, AlphaProof / AlphaGeometry 2 scientific specialists.

Distribution: Gemini app, Google AI Studio, Vertex AI, NotebookLM, Workspace integrations.

xAI

  • Grok-3 (Feb 2025) — frontier reasoning + tool use; trained on Colossus cluster (~200K H100s).
  • Grok-3 mini + Grok-3 Reasoning (Think mode).
  • Grok-4 — announced mid-2025.
  • Native X (Twitter) integration; real-time data access.
  • Aurora image generation (Dec 2024).

Mistral AI (France)

  • Mistral Large 2 (Jul 2024, 123B dense) and Mistral Large 3 (2025) — flagship closed.
  • Mixtral 8x22B (Apr 2024) and Mixtral 8x7B (Dec 2023) open MoE.
  • Mistral Small 3.1 (Mar 2025, 24B) open Apache.
  • Pixtral 12B (Sep 2024) and Pixtral Large 124B (Nov 2024) multimodal.
  • Codestral 22B (May 2024) and Codestral 25.01 (Jan 2025) code-specialized.
  • Mistral Nemo 12B (Jul 2024) — with NVIDIA, 128K context, Apache.
  • Ministral 3B / 8B (Oct 2024) edge.
  • Voxtral — speech model (2025).

Distribution: La Plateforme API, Le Chat web product, AWS Bedrock, Azure AI.

Cohere (Canada)

  • Command R+ 08-2024 (104B) and Command R 08-2024 — enterprise RAG-tuned.
  • Command A (Mar 2025) — flagship.
  • Embed v3 + Embed v4 — multilingual + multimodal embedding models.
  • Rerank 3 + Rerank 3.5 — search reranker.
  • Aya 23 + Aya Expanse — 23/32-language multilingual open-weight.

Enterprise + on-prem focus; the most “boring + reliable” frontier shop.

Reka AI

  • Reka Core (~67B), Reka Flash (21B), Reka Edge (7B) — multimodal native (text + image + audio + video).
  • Reka Flash 3 (2025) — refresh, open-weighted under Apache.

AI21 Labs (Israel)

  • Jamba 1.5 Large (398B MoE) + Jamba 1.5 Mini (52B MoE) — hybrid Transformer-Mamba; 256K context; strong long-doc performance.
  • Jurassic legacy line.

Inflection AI

  • Pi consumer product paused after Microsoft hire-of-team in March 2024; Inflection 2.5 model now lives at MS as backbone for Copilot. Effectively absorbed.

Frontier closed (China)

Alibaba — Qwen (Tongyi Qianwen)

  • Qwen 2.5 family (Sep 2024) — open-weight 0.5B / 1.5B / 3B / 7B / 14B / 32B / 72B, Apache 2.0 (except 3B/72B which carry custom license).
  • Qwen 2.5-Max — flagship closed, MoE; competitive with GPT-4o on MMLU + GPQA.
  • Qwen 2.5-Plus + Qwen 2.5-Turbo — API tiers; Turbo supports 1M context.
  • Qwen 3 (Apr 2025) — next-gen with thinking-mode toggle; 0.6B → 235B-A22B MoE variants; many open-weighted.
  • Qwen2.5-Coder (1.5B / 7B / 32B) — open SOTA on code among open weights.
  • Qwen2.5-VL (3B / 7B / 72B) — vision-language; chart + document understanding.
  • QVQ-72B-Preview (Dec 2024) — visual reasoning thinking-model.
  • QwQ-32B-Preview (Nov 2024) — reasoning thinking-model; surprisingly strong on AIME + MATH.

DeepSeek (Hangzhou)

  • DeepSeek-V2 (May 2024, 236B MoE / 21B active) — established the MoE-at-cheap-price playbook.
  • DeepSeek-V3 (Dec 2024, 671B MoE / 37B active) — rivals GPT-4o quality at ~1/10th cost; trained for ~$5.6M reported.
  • DeepSeek-R1 (Jan 2025) — reasoning model trained with pure RL on verifiable rewards; rivals o1 on AIME + MATH + GPQA; fully open weights MIT license; the open-source moment of 2025. Distilled variants R1-Distill-Qwen-32B etc.
  • DeepSeek-Coder V3 — code-specialized.
  • DeepSeek-VL2 — multimodal.

The pricing + openness combination resets economics: V3 at ~1.10 per 1M tokens.

MoonshotAI — Kimi

  • Kimi K1 + K1.5 + K2 + K3 generations through 2024–2025.
  • Kimi K2 focus: long context (claimed 2M-token document QA in production product).
  • Kimi-VL + Kimi-Audio multimodal extensions.
  • Heavy on Chinese-language QA + research-assistant use case.

MiniMax (Shanghai)

  • abab 6.5 + abab 6.5s (2024) chat models.
  • MiniMax-Text-01 and MiniMax-VL-01 (Jan 2025) — open hybrid Transformer + Lightning attention; 4M+ context.
  • MiniMax M1 (mid-2025) — hybrid attention/Mamba reasoning model.
  • Hailuo AI — video generation product (competitive with Kling / Sora).

Zhipu AI — GLM

  • GLM-4 + GLM-4-Plus + GLM-4-Air (2024) closed.
  • GLM-4.5 + GLM-4.5V (2025) flagship.
  • ChatGLM-3 (6B / 9B) and GLM-4-9B open-weight.
  • CogVideoX open video model line (2B / 5B).
  • CogVLM2 / GLM-4V multimodal.

01-AI — Yi

  • Yi-Large + Yi-Lightning (Oct 2024) closed; Yi-Lightning briefly top-10 LMArena late 2024.
  • Yi-1.5 open (6B / 9B / 34B) Apache.
  • Yi-Coder + Yi-Vision specialized.
  • Founder Kai-Fu Lee; company pivoted toward applications + tooling mid-2025.

Baidu — ERNIE

  • ERNIE 4.0 + ERNIE 4.0 Turbo + ERNIE Speed + ERNIE Lite + ERNIE Tiny API tier.
  • ERNIE 4.5 (Mar 2025) — flagship refresh.
  • ERNIE X1 — reasoning model (Mar 2025) at ~half DeepSeek-R1’s quoted price.
  • Wenxin Yiyan consumer chat product.

ByteDance — Doubao (Volcengine)

  • Doubao 1.5 Pro + Doubao 1.5 Lite + Doubao 1.5 Vision Pro (Jan 2025).
  • Doubao-1.5-pro-256k long-context, Doubao-Reasoning o1-style.
  • Doubao-Audio-TTS / ASR voice stack — used in CapCut + Douyin.
  • Aggressive price war pricing — drove Chinese API rates down 80%+ in late 2024.

Tencent — Hunyuan

  • Hunyuan Large (389B MoE / 52B active, Nov 2024) — largest open MoE at release time.
  • Hunyuan Turbo (closed) flagship.
  • Hunyuan 3D 1.0 / 2.0 — open 3D-asset generation (3D-aware diffusion + multi-view).
  • Hunyuan Video (Dec 2024) — open video model, 13B; rivals closed Kling/Sora on benchmarks.

Stepfun (Shanghai)

  • Step-1V multimodal + Step-2 (1T parameters, MoE) closed flagship.
  • Step-Video + Step-Audio multimodal extensions.

Skywork (Kunlun Tech)

  • Skywork 13B open + Skywork-MoE (146B-A22B) open.
  • Skywork-O1-Open — reasoning chain-of-thought open-weight (Nov 2024).
  • SkyReels-V1 open video generation (2025).

InternLM (Shanghai AI Lab)

  • InternLM2.5 (1.8B / 7B / 20B) open Apache.
  • InternLM3-8B-Instruct (2025) open.
  • InternVL2 + InternVL2.5 + InternVL3 vision-language; rivals proprietary on MMMU + DocVQA.

Mianbi (ModelBest) — MiniCPM

  • MiniCPM-V 2.6 (8B) — small multimodal; runs on mobile.
  • MiniCPM 3.0 4B + MiniCPM-o 2.6 (omni-modal: text + image + audio + video).
  • Edge + on-device focus; widely used in Chinese smartphone deployments.

Open-weight Western

Meta — LLaMA

  • LLaMA 3 (Apr 2024, 8B / 70B) — community-license open.
  • LLaMA 3.1 (Jul 2024, 8B / 70B / 405B) — 128K context; 405B was largest open dense at release.
  • LLaMA 3.2 (Sep 2024) — 1B / 3B text + 11B / 90B vision; introduced multimodal LLaMA.
  • LLaMA 3.3 70B (Dec 2024) — instruction-tuned refresh matching 405B quality at 70B size.
  • LLaMA 4 (Apr 2025) — multimodal MoE family:
    • LLaMA 4 Scout (109B MoE / 17B active, 10M context) — competitive with Gemini Flash.
    • LLaMA 4 Maverick (400B MoE / 17B active, 1M context) — challenger to GPT-4o.
    • LLaMA 4 Behemoth (~2T MoE / 288B active) — teacher model, in training mid-2025.

Llama ecosystem is the gravitational center of open-weight: HuggingFace transformers, vLLM, llama.cpp, Ollama, LM Studio, GGUF format, derivative fine-tunes by the thousands.

Mistral open

Covered above — Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Nemo 12B, Mistral Small 3.1 24B, Pixtral 12B, Codestral Mamba 7B, Mathstral 7B.

Microsoft — Phi

  • Phi-3 (Apr 2024, 3.8B / 7B / 14B) — small + capable; punches above weight on reasoning + code.
  • Phi-3.5 (Aug 2024) — Mini-Instruct (3.8B), MoE (16x3.8B / 6.6B active), Vision (4.2B).
  • Phi-4 (Dec 2024, 14B) — reasoning-trained, math + science strong.
  • Phi-4-multimodal + Phi-4-mini (Feb 2025) — text + vision + audio compact.
  • Phi-4-reasoning + Phi-4-reasoning-plus (Apr 2025) — thinking models.
  • Phi Silica — quantized variant shipping on Windows Copilot+ PCs.

Google Gemma

Covered above — Gemma 2 + Gemma 3 + PaliGemma 2 + CodeGemma + RecurrentGemma (Griffin-architecture experiment) + ShieldGemma safety classifier.

Allen Institute (AI2)

  • OLMo 2 (Nov 2024, 7B / 13B) — fully-open: weights + data + checkpoints + training code.
  • Tülu 3 (Nov 2024) — open post-training recipe applied to LLaMA 3.1; rivals closed instruct quality.
  • Molmo (Sep 2024) — open vision-language; competitive with proprietary 7B-class VLMs.
  • OLMoE (Sep 2024, 7B MoE / 1B active) — open MoE.

Stability AI

Mostly legacy. Stable LM 2 1.6B / 12B, StableLM Zephyr small models — declining relevance vs Phi + Gemma.

Serving infrastructure (open-model hosts)

  • Together AI — fast inference, fine-tuning, dedicated.
  • Fireworks AI — function calling, low-latency serving.
  • DeepInfra — cheap per-token open-model API.
  • Hyperbolic — GPU rental + serving.
  • Replicate — model-as-API with cold-start.
  • Groq — LPU custom silicon, extreme tokens/sec on selected open models.
  • Cerebras Inference — wafer-scale silicon, similar speed claims.
  • SambaNova — RDU silicon serving.
  • AWS Bedrock, Azure AI Foundry, GCP Vertex Model Garden — hyperscaler model garden hosting.

Multimodal (vision + audio + video + image)

Vision-language (VLM)

  • GPT-4o + GPT-4.1 + o3 / o4-mini with vision.
  • Claude Opus 4.x + Sonnet 4.x + Haiku 4.x + Claude 3.5/3.7 vision.
  • Gemini 2.5 Pro / Flash vision.
  • Pixtral 12B / Pixtral Large (Mistral, open).
  • Qwen2.5-VL 3B / 7B / 72B + QVQ-72B (Alibaba, open).
  • InternVL2 / 2.5 / 3 (Shanghai AI Lab, open).
  • LLaVA-OneVision / LLaVA-NeXT academic line.
  • LLaMA 3.2 Vision 11B / 90B + LLaMA 4 native multimodal.
  • Phi-4 Multimodal (Microsoft).
  • Molmo (Allen) — open SOTA on visual grounding.
  • CogVLM2 / GLM-4V (Zhipu).
  • MiniCPM-V 2.6 / MiniCPM-o 2.6 (Mianbi, mobile-class).
  • NVLM-D-72B (NVIDIA, open).

Audio input (speech recognition + speech-to-text)

  • Whisper large-v3 / large-v3-turbo (OpenAI, open, 2022–2024) — open SOTA ASR.
  • GPT-4o audio (Realtime API) — speech-in, speech-out, low-latency conversational.
  • Gemini Live — bidirectional speech + screen-share, June 2024.
  • Moshi (Kyutai, Sep 2024, open) — streaming full-duplex 200ms latency.
  • Voxtral (Mistral, 2025) — open speech understanding.
  • SeamlessM4T v2 (Meta, open) — translation + transcription, 100+ langs.
  • Canary-1B / Parakeet (NVIDIA, open) — ASR leaders.
  • Qwen2-Audio + Doubao Audio.

Audio output (text-to-speech + voice cloning)

  • ElevenLabs (v3 Turbo, Multilingual v2) — commercial leader for quality + voice cloning.
  • OpenAI tts-1 / tts-1-hd / gpt-4o-tts — built into Realtime API.
  • Sesame CSM-1B (2025, open) — conversational speech model.
  • Suno Voice + Suno Bark — singing + speech.
  • Resemble AI, Play.ht 2.0, Cartesia Sonic (sub-100ms latency), Inflection Voice.
  • MeloTTS + F5-TTS + GPT-SoVITS + OpenVoice v2 (open).
  • MiniMax T2A, Doubao TTS, CosyVoice (Alibaba open) — Chinese-led open frontier.

Video generation

  • Sora (OpenAI, Feb 2024 demo, Dec 2024 GA) — up to 20s at 1080p; text + image + video conditioning.
  • Veo 2 (Google, Dec 2024) and Veo 3 (May 2025) — Veo 3 added native audio generation, perceived top-tier mid-2025.
  • Kling 1.6 / 2.0 (Kuaishou) — commercial Chinese leader; minutes-long realistic.
  • Runway Gen-3 Alpha / Gen-4 — pioneer commercial product.
  • Pika 2.0 / 2.1 — consumer-creator focused.
  • Luma Dream Machine (Ray2 model, 2025).
  • Mochi 1 (Genmo, Oct 2024, open Apache).
  • Hunyuan Video (Tencent, Dec 2024, open).
  • WanX / Wan 2.1 (Alibaba, Feb 2025 open) — competitive with closed on motion quality.
  • CogVideoX + OpenSora (open academic).
  • LTX-Video (Lightricks, open).
  • Hailuo / MiniMax video (closed Chinese product).

Image generation

  • DALL-E 3 (OpenAI) — inside ChatGPT + Bing.
  • GPT-Image-1 (OpenAI, Mar 2025) — native multimodal image gen inside GPT-4o.
  • Imagen 3 / Imagen 4 (Google) — strong text rendering + photorealism.
  • Midjourney v6.1 / v7 — aesthetic leader; Discord + web.
  • Stable Diffusion 3 / 3.5 Large (Stability, open) + SDXL legacy.
  • FLUX.1 [pro / dev / schnell] (Black Forest Labs, Aug 2024) + FLUX.1.1 pro + FLUX Kontext (May 2025) — open + commercial; new open leader on aesthetics + prompt-adherence.
  • Ideogram 2.0 / 3.0 — strong text-in-image.
  • Recraft V3 — vector + design-focused.
  • HiDream (open, 2025), Lumina-mGPT, Janus-Pro (DeepSeek, Jan 2025 open).

Coding-specific models

  • Claude Opus 4.x / Sonnet 4.x — frontier on SWE-Bench Verified (~72%+); the integrated agent stack (Claude Code CLI + Cursor + Windsurf) compounds the advantage.
  • GPT-4.1 + o4-mini-high — strong coding, especially on long-context refactors.
  • Codestral 22B / 25.01 + Codestral Mamba 7B (Mistral).
  • DeepSeek-Coder V2 + DeepSeek-V3 for code — open SOTA in late 2024 / early 2025.
  • Qwen2.5-Coder 32B — open SOTA mid-2025; matches GPT-4o on HumanEval / MBPP / LiveCodeBench.
  • CodeLlama (Meta) — legacy.
  • Granite Code (IBM, open).
  • Starcoder2 (BigCode / ServiceNow / HF, open).
  • Yi-Coder 9B (01-AI, open).
  • Tabby, Continue.dev, Aider, Cline, Cursor, Windsurf, Codex CLI (OpenAI, Apr 2025), Claude Code — IDE / CLI surfaces.

Smaller + edge models

  • Phi-3.5 Mini 3.8B, Phi-4 Mini — Microsoft small-model excellence.
  • Gemma 3 1B / 4B / 12B — Google compact multimodal.
  • LLaMA 3.2 1B / 3B — mobile-class.
  • Mistral Nemo 12B, Ministral 3B / 8B.
  • MiniCPM 3.0 4B + MiniCPM-V 2.6 8B (mobile-class multimodal).
  • Qwen 2.5 0.5B / 1.5B / 3B + Qwen 3 0.6B / 1.7B / 4B.
  • SmolLM2 (HuggingFace, 135M / 360M / 1.7B) — research-class tiny.
  • Phi Silica — quantized 3.3B running locally on Windows Copilot+ NPUs.
  • Apple Intelligence Foundation Models — ~3B on-device on iPhone 15 Pro+, M-series Macs (Sep 2024+).
  • Gemini Nano — on-device Pixel + Chrome (built-in window.ai JS API rolling out 2025).
  • TinyLlama 1.1B, StableLM Zephyr 3B — legacy small-model line.
  • Llama.cpp / Ollama / MLX / LM Studio runtimes target this tier on consumer hardware.

Domain-specific + specialized

  • Finance: BloombergGPT (50B, internal), FinGPT (open academic), Salesforce XGen-Sales, Hebbia (RAG-focused for analysts).
  • Medical: Med-PaLM 2 / Med-Gemini (Google, multimodal medical), Anthropic clinical adapters, OpenBioLLM, MedAlpaca, BioMistral, Hippocratic AI (patient-facing).
  • Science: Galactica (Meta, 2022 — withdrawn), Tx-LLM (Google therapeutics), ChemCrow + Coscientist research agents.
  • Biology / proteins: AlphaFold-3 (DeepMind, May 2024), AlphaProteo (Sep 2024), ESM-3 (EvolutionaryScale, 2024) — see [[Biology/genetics-and-genomics]].
  • Legal: Harvey AI (enterprise legal), CoCounsel (Thomson Reuters / Casetext), Lexis+ AI, Spellbook (contracts), Robin AI.
  • Education: Khanmigo (Khan Academy + GPT-4o), Duolingo Max (GPT-4o tutor), Synthesis (math), Magic School AI.
  • Code: covered above.
  • Robotics + embodied: RT-2 (Google), Pi-0 (Physical Intelligence, 2024), Helix (Figure AI), Gemini Robotics-1 (Mar 2025), Optimus + xAI integrations — see [[Robotics/robot-learning-and-foundation-models]].

Eval + leaderboards

General + chat preference

  • LMSYS Chatbot Arenalmarena.ai — crowdsourced pairwise Elo, the single most-watched live leaderboard. Categories: Overall, Hard Prompts, Coding, Math, Vision, Multi-turn, Style-controlled.
  • Artificial Analysis — independent multi-axis benchmarking (quality + speed + price) at artificialanalysis.ai.

Knowledge + reasoning

  • MMLU (57-subject multiple choice) — saturated ~92% by frontier; deprecated as discriminator.
  • MMLU-Pro — harder variant, 10-way choices, still useful.
  • MMLU-Redux — error-corrected.
  • GPQA-Diamond (Google-proof QA, PhD-level science) — frontier discriminator; o3 ~87%.
  • BIG-Bench Hard (BBH) — composite reasoning, mostly saturated.
  • HellaSwag, ARC-Challenge, WinoGrande, TruthfulQA — older but still cited.

Math + AIME

  • MATH (competition math) — frontier ~95%+, saturating.
  • GSM8K (grade-school math) — saturated ~95%+.
  • AIME 2024 / 2025 — American Invitational Math Exam, the new high-water mark for reasoning models; o3 / R1 / QwQ score 80%+ on AIME 2024.
  • OlympiadBench, PutnamBench, FrontierMath (Epoch AI 2024) — frontier-grade math; FrontierMath at <5% even for o3.

Coding

  • HumanEval (164 Python problems) — saturated, ~95%+, deprecated as sole signal.
  • MBPP / MBPP+ — basic programming.
  • SWE-Bench Verified (500 real GitHub issues) — current frontier coding benchmark; Claude Opus 4.x ~72%, o3 ~71% mid-2025.
  • SWE-Bench Multilingual / Lite — variants.
  • LiveCodeBench — contamination-resistant competitive programming.
  • Aider Polyglot — multi-language refactoring.
  • BigCodeBench — library-rich tasks.
  • CodeContests — historical Codeforces.

Tool use + function calling + agents

  • BFCL (Berkeley Function-Calling Leaderboard) v1 / v2 / v3 — tool-use accuracy.
  • τ-Bench (Tau-Bench, Anthropic 2024) — agentic customer-support.
  • GAIA (Meta + HF, 2023) — general-assistant.
  • AgentBench, WebArena, VisualWebArena, AppWorld — agent envs.
  • OSWorld + WindowsAgentArena — computer-use evaluation.

Multimodal

  • MMMU (multi-discipline university multimodal) — frontier ~80%.
  • MathVista — visual math.
  • DocVQA, ChartQA, TextVQA, VQAv2 — document + chart + scene VQA.
  • MM-Vet, MMBench, SEED-Bench.
  • VideoMME, MVBench — video understanding.

Long context

  • Needle-in-a-Haystack (NIAH) — single-fact retrieval; saturated for 1M-context frontier.
  • RULER (NVIDIA, 2024) — synthetic long-context across 13 tasks; harder discriminator.
  • InfiniteBench, LongBench v2 — natural long-doc QA.

Reasoning / generalization frontier

  • ARC-AGI-1 (Chollet) — abstract reasoning grids; o3 high-compute hit 87.5% (Dec 2024), exceeding the prize threshold.
  • ARC-AGI-2 (Mar 2025) — successor benchmark; frontier still <10% as of Q2 2025.
  • HLE — Humanity’s Last Exam (Center for AI Safety + Scale, Jan 2025) — 3000 expert-level questions; frontier mid-2025 ~20%.

Safety + alignment

  • HarmBench (CAIS).
  • AILuminate (MLCommons, 2024) — safety benchmark with industry adoption.
  • AISI evaluations (UK + US AI Safety Institutes) — frontier pre-deployment access.
  • AgentHarm (Anthropic) — agentic-safety eval.

Cost bands (USD per 1M input + output tokens, mid-2025 ballpark)

TierExamplesInput $/MOutput $/M
Frontier reasoningOpenAI o3, Claude Opus 4 thinking, Gemini 2.5 Pro thinking$15–30$60–150
Frontier chatGPT-4o, GPT-4.1, Claude Sonnet 4, Gemini 2.5 Pro, Grok-3$2–6$8–25
Mid closedGPT-4o-mini, Claude Haiku 4, Gemini 2.5 Flash, Mistral Small$0.15–0.80$0.60–4
Cheap closedGPT-4.1-nano, Gemini Flash-Lite, Doubao Lite, DeepSeek-V3 API$0.05–0.30$0.15–1.10
Open hostedLLaMA 3.3 70B, Qwen 2.5 72B, Mixtral 8x22B on Together/Fireworks$0.10–0.80$0.30–3
Open self-hostedAny model on rented GPUselectricity + GPU-hourselectricity + GPU-hours

Chinese provider pricing collapsed in late 2024 via ByteDance + Alibaba + DeepSeek price war; many APIs sit at 1/5 to 1/20 of US frontier rates. Quote with caveat — pricing changes weekly.


  1. Reasoning + thinking models. o1 (Sep 2024) opened the paradigm; R1 + QwQ + Gemini 2.5 Thinking + Claude extended-thinking + Phi-4-reasoning followed. RL on verifiable rewards (math, code, formal proofs) is the dominant training axis. Test-time compute scaling is the new dimension on the scaling laws.

  2. Mixture-of-Experts. Mixtral lit the fuse (Dec 2023); DeepSeek-V3 (671B MoE), LLaMA 4 (Scout / Maverick / Behemoth MoE), Qwen 3 MoE variants, Hunyuan Large, MiniMax Text-01, Skywork-MoE. Active-parameter counts of 17B–37B at total counts of 100B–2T+ are now mainstream. GPT-4 was long rumored to be MoE; 4o + 4.1 confirmed via inference characteristics.

  3. Long context routine. 128K is table-stakes. 1M is normal (Gemini 1.5/2.5, GPT-4.1, LLaMA 4 Scout 10M). Kimi K2 + MiniMax 01 reach 2M–4M+. RULER and NIAH are the discriminators; real comprehension still degrades past ~200K for most models.

  4. Multimodal native. Gemini 2.5, GPT-4o, Claude 3.5/3.7+, LLaMA 4, Qwen2.5-VL, Phi-4-multimodal, MiniCPM-o — single model, all modalities, joint training rather than late-fused adapters. Image-out (GPT-Image-1, native Imagen-in-Gemini) collapsing image-gen into the LLM.

  5. Agentic tooling baked in. Claude Computer Use (Oct 2024), OpenAI Operator (Jan 2025), Google Project Mariner + Project Astra previews. MCP (Model Context Protocol) open standard (Anthropic, Nov 2024) adopted by OpenAI + others through 2025 — the USB-C of agentic tool integration.

  6. Tool calling + structured output mature. JSON schema + grammar-constrained decoding (vLLM + Outlines + xgrammar) reliable across all frontier models. Parallel tool calls standard. OpenAI Responses API (Mar 2025) replaces Assistants v1.

  7. Smaller-cheaper-faster competitive on routine tasks. Haiku 4.x, Gemini Flash, GPT-4o-mini, Phi-4, Gemma 3 27B, Qwen 2.5 32B all handle 80% of production traffic at 1/10th frontier cost. The router pattern (auto-route easy queries to cheap models) is standard architecture.

  8. Open weights catching up. DeepSeek-R1 (open MIT-license reasoning matching o1), Qwen 3 (open at multiple scales), LLaMA 4 — the open-vs-closed gap narrowed from 12+ months to 3–6 months on quality. China-origin open weights now dominate the open frontier on both quality and openness of license.

  9. Inference scaling laws + test-time compute. The o1 paradigm — spend compute at inference rather than only at training — re-opens scaling. Compute-budgeted reasoning, search + verify, parallel sampling + best-of-N, process-reward-model verifiers all in production.

  10. Distillation pipeline standard. Frontier model → cheap-fast variant → on-device variant is now an explicit product line at every major lab. R1 → R1-Distill-Qwen-32B is the canonical open example.

  11. Hybrid architectures. Mamba / SSM components appearing in Jamba 1.5 (AI21), Codestral Mamba (Mistral), Falcon Mamba (TII), MiniMax M1 — hybrid Transformer-Mamba for long-context efficiency.

  12. Agentic + reasoning combine. o3 + Claude Opus 4.x + Gemini 2.5 Thinking show that thinking-models executing tool calls deliver order-of-magnitude better results on real coding + research tasks than either capability alone. Coding agents (Claude Code, Codex CLI, Windsurf Cascade, Cursor Composer) are the first mass-market application.

  13. Inference silicon diversification. Groq LPU + Cerebras WSE + SambaNova RDU + AWS Trainium 2 + Google TPU v5p / v6 / v7 + Microsoft Maia + Meta MTIA + AMD MI300X / MI325X / MI355X + NVIDIA H100 / H200 / B100 / B200 / GB200 — see [[Compute/Tier3/ai-accelerators]].

  14. Safety + governance. EU AI Act in force Aug 2024 with phased application through 2026. UK + US AISI pre-deployment evaluations on frontier models. Voluntary frontier-safety frameworks (Responsible Scaling Policies, Preparedness Framework, Frontier Safety Framework) from Anthropic + OpenAI + DeepMind.

  15. China decoupling pressure + competition. Export controls (H100/H200/B200 to China restricted Oct 2023 + Oct 2024 + Jan 2025 tightening). Chinese labs respond with MoE efficiency (DeepSeek), Huawei Ascend 910C, and aggressive open-weighting. The DeepSeek-R1 Jan 2025 release was a watershed moment for the openness + cost narrative.


Cross-references

  • [[Compute/transformer-architecture]] — attention + MoE + Mamba mechanics underlying every model in this catalog.
  • [[Compute/fine-tuning-rlhf]] — SFT + DPO + RLHF + RLVR (the training method behind reasoning models).
  • [[Compute/inference-optimization]] — quantization (GPTQ, AWQ, GGUF, FP8), speculative decoding, KV-cache mgmt, batching.
  • [[Compute/rag-embeddings-vector-search]] — embedding models + vector DBs that pair with these LLMs.
  • [[Compute/prompt-engineering-agent-systems]] — agent frameworks + MCP + tool calling patterns.
  • [[Compute/model-serving-infrastructure]] — vLLM, TGI, TensorRT-LLM, SGLang, Triton, batching, autoscaling.
  • [[Compute/Tier3/ml-framework-comparison]] — PyTorch / JAX / TF / MLX training-side comparison.
  • [[Compute/Tier3/ai-accelerators]] — GPU + TPU + LPU + custom silicon for training + inference.
  • [[Compute/Tier3/_index]] — Compute Tier 3 family index.

Snapshot date: 2026-05-17. Models, scores, and prices quoted reflect the state of the field as of mid-2025 Q1/Q2 with selective notes through Q2 2026. The landscape moves on a monthly cadence — verify any specific claim against the provider’s live documentation, lmarena.ai, or artificialanalysis.ai before acting on it.