ML Framework Comparison — Family Index
A Tier 3 reference covering the dominant machine-learning frameworks of the 2024–26 era: their lineage, design philosophy, ecosystem, hardware coverage, performance posture, and production readiness. The goal is to make framework selection a structured decision rather than a tribal one. Date of compilation: 2026-05-17.
Introduction — a short history
The deep-learning framework landscape has gone through three reasonably distinct eras.
2010–2015 — the academic-toolkit era. Theano (Université de Montréal, 2007), Caffe (Berkeley, 2013), and Torch7 (NYU / Facebook, Lua-based, 2011) were the early movers. They introduced define-and-run graph compilation, GPU acceleration via CUDA, and the basic primitives of automatic differentiation. None survived as primary research frameworks past 2018, but their conceptual DNA — static graphs, dispatch over n-dim tensors, op-level GPU kernels — propagated into everything that followed.
2015–2020 — the consolidation era. Google’s TensorFlow 1.x (2015) and Facebook’s
PyTorch (2017, a Python-first rewrite of Torch7) split the field. TensorFlow optimized
for production deployment with a static-graph-first design (tf.Session, frozen graphs,
TFServing); PyTorch optimized for research with eager-mode tensor execution and Pythonic
control flow. JAX (Google, 2018) emerged from the Autograd lineage and introduced
composable function transformations (jit, grad, vmap, pmap) over NumPy semantics,
targeting both TPUs and GPUs. MXNet (Apache, Amazon-backed) and CNTK (Microsoft) were
also contenders in this era but lost developer mindshare and were effectively wound down
by 2020–21.
2020–2026 — the pluralism era. PyTorch swept academic ML and most of industry, reaching ~70%+ of new papers by 2024 (per Papers with Code framework stats). JAX established a strong niche in foundation-model training, scientific computing, and TPU-native workflows — used at DeepMind, Anthropic, Google Brain (pre-merger), Pixar, and parts of Meta FAIR. TensorFlow shrank in research relevance but retained a dominant share of mature production pipelines, edge deployment (TFLite / LiteRT), and JS-in-browser ML (TF.js). Keras 3 (2023) reset the multi-backend dream — write once, run on TF / JAX / PyTorch. MLX (Apple, late 2023) made Apple Silicon a first-class research target. Mojo (Modular, 2023+) attempted to fuse Python’s ergonomics with kernel-level performance. Triton (OpenAI, 2021) became the dominant high-level GPU-kernel DSL. Rust entered the fray through candle, burn, and dfdx. And the Hugging Face stack — transformers, accelerate, peft, trl, datasets, tokenizers, diffusers — became the de facto API layer sitting above whichever framework you chose.
The result, in 2026, is a layered ecosystem: choose your framework (PyTorch / JAX / TF / MLX / candle / etc.) for the compute substrate, your library (HF transformers / Flax / Lightning / Equinox / etc.) for the model abstraction, your trainer (DeepSpeed / accelerate / FSDP / Megatron-LM / MaxText / TRL) for distributed orchestration, and your serving stack (vLLM / TGI / TensorRT-LLM / TFServing / ONNX Runtime / MLX serve / MAX serve) for inference.
This index covers the framework and trainer layers in depth, with pointers to adjacent Tier 2 notes on transformers, fine-tuning, inference, serving, and GPU kernels.
Comparison axes
The framework matrix is evaluated along eight axes:
- Ergonomics — how Pythonic, how surprise-free, how good are the error messages.
- Performance — single-GPU throughput, kernel fusion quality, compile latency.
- Hardware coverage — NVIDIA / AMD / Apple / Google TPU / Intel / Qualcomm / WASM.
- Distributed training — data / tensor / pipeline / expert / FSDP / ZeRO support.
- Ecosystem depth — model zoos, pretrained checkpoints, tutorials, third-party libs.
- Production readiness — serving paths, mobile deployment, ONNX export, stability.
- Research adoption — fraction of recent papers, frontier-lab usage.
- Maintenance posture — release cadence, breaking-change discipline, governance.
PyTorch — Meta AI Research, 2017–present
License: BSD-3-Clause. GitHub: pytorch/pytorch, ~83k stars (mid-2026). Latest stable: 2.7 (early 2026), with 2.8 in nightly. Governance: PyTorch Foundation (under the Linux Foundation since Sep 2022 — Meta donated the project).
PyTorch is the gravity well of modern ML. It is eager-mode by default — tensors execute
immediately, control flow is plain Python — which makes debugging straightforward and
research iteration fast. Since PyTorch 2.0 (Mar 2023), torch.compile() wraps any
nn.Module or function in a graph-capture pipeline (TorchDynamo → AOTAutograd →
Inductor) that emits fused Triton kernels on GPU and C++ on CPU. By PyTorch 2.5–2.7
(2025–26), torch.compile is the default recommendation for non-trivial training and
yields 1.5–3x speedups on typical transformer workloads versus eager mode.
The PyTorch sub-ecosystem
- torch.compile — the JIT-graph-capture stack; lands in nearly every Tier 1 lab’s training pipeline by 2025.
- torchvision — image models, datasets, transforms; classic but maintained.
- torchaudio — audio, including codecs, spectrograms, and pretrained ASR backbones.
- torchtune — released 2024, a fine-tuning recipe library (LoRA / QLoRA / full FT /
PPO / DPO) competing with HF’s
peft+trl. - torchao — quantization and sparsity primitives (int4 / int8 / fp8 / nf4 / mxfp4), a major focus area through 2024–26 as inference cost dominates.
- torchrec — recommendation-system primitives (large embedding tables, sharded collective ops); used by Meta internally.
- torchchat — local LLM inference on laptops; the Meta-blessed answer to llama.cpp, though llama.cpp remains more popular in practice.
- TorchServe — model serving; less popular than dedicated solutions like vLLM and TGI for LLMs, but used for classic CV / NLP models.
- PyTorch/XLA — runs PyTorch on TPUs and on AMD MI300 via XLA; growing in 2024–26 as a viable cross-accelerator path.
- FSDP / FSDP2 — Fully Sharded Data Parallel, PyTorch’s answer to DeepSpeed ZeRO;
FSDP2 (2024) added per-parameter sharding and cleaner composition with
torch.compile.
PyTorch Lightning — research scaffolding
Lightning AI’s PyTorch Lightning (formerly pytorch-lightning, ~30k stars) provides the
LightningModule abstraction — separate model definition from training loop boilerplate.
Lightning has expanded into Lightning Fabric (lower-level distributed primitives) and
LitGPT (LLM-specific recipes). It’s the most popular high-level trainer library for
people who don’t want to write training loops, though Hugging Face’s Trainer is more
common for transformer workloads.
Hugging Face — the application layer
Hugging Face built the standard interface to LLMs on top of PyTorch (and increasingly JAX / TF for parity). The stack as of 2026:
transformers— model architectures, tokenizers, pretrained weights for ~500k models on the Hub; the API everyone codes against. ~140k stars.datasets— streaming-friendly dataset loader; supports Parquet, Arrow, JSON, audio, image, video; ~20k stars.tokenizers— Rust-backed fast tokenizers (BPE, WordPiece, Unigram, SentencePiece).accelerate— distributed-launch wrapper; abstracts over FSDP / DeepSpeed / Megatron; the defaulttransformers.Trainerruns through accelerate under the hood.peft— parameter-efficient fine-tuning (LoRA, QLoRA, AdaLoRA, IA³, prefix tuning, P-tuning v2); commodity for any fine-tuning workflow.trl— RLHF / DPO / KTO / ORPO / GRPO trainers; expanded 2024–26 to cover the alignment-method explosion.diffusers— image / video / audio diffusion models; pipeline abstraction over SD, SDXL, FLUX, Stable Video, Stable Audio.optimum— hardware-vendor integrations (Intel Neural Compressor, ONNX Runtime, TensorRT, Habana Gaudi, AWS Neuron).text-generation-inference(TGI) — production LLM serving (covered in model-serving-infrastructure).safetensors— format for tensor storage that replaced pickle; safe, fast mmap.
Use cases
PyTorch is the default for: academic research, computer vision, classical NLP, speech, generative-model training (Stable Diffusion family, most open LLMs from Llama 3 / Mistral / Qwen / DeepSpeek / Phi were trained primarily in PyTorch), reinforcement learning (stable-baselines3, RLlib has PyTorch backend), graph neural networks (PyG, DGL has PyTorch support), and most fine-tuning workflows.
Weaknesses
- TPU support is via XLA, second-class compared to JAX.
- Compile times for
torch.compilecan be long on first run (cached after). - Distributed training has improved a lot but still has more sharp edges than JAX’s
pjit/shard_mapon large-scale TPU pods.
JAX — Google Research, 2018–present
License: Apache-2.0. GitHub: google/jax, ~32k stars. Latest stable: 0.5.x (mid-2026, still 0.x but stable in practice). Governance: Google with significant DeepMind contribution; spun out a quasi-independent dev cadence after the Brain–DeepMind merger in 2023.
JAX is a numerical-computing library that composes four function transformations over
NumPy-shaped code: jit (XLA compilation), grad (reverse-mode autodiff), vmap
(automatic vectorization), and pmap / pjit / shard_map (parallelism over devices).
The result is that you write pure NumPy-style functions and JAX transforms them into
optimized, parallel, differentiable XLA programs. Its tagline — “composable function
transformations” — is the entire design philosophy.
Why JAX matters
JAX is the only widely-used framework where TPUs are a first-class target rather than an afterthought. The XLA backend natively targets TPUs (v4, v5e, v5p, Trillium / v6e in 2024–26), GPUs (via CUDA), and CPUs. For Google-scale TPU pod training, JAX is the default substrate. It’s also the framework of choice at DeepMind (AlphaFold 2, AlphaFold 3 (Isomorphic, 2024), Gemini training, AlphaProof / AlphaGeometry, MuZero family), Anthropic (Claude models — confirmed publicly in posts and engineering job descriptions), parts of Meta FAIR’s foundation-model work, and the high-end of academic foundation-model research.
Flax — neural-network library on top of JAX
Flax (Google, ~6k stars) is the canonical NN library on top of JAX. The “Linen” API was the standard from 2020–24; “NNX” (Flax 0.8+, 2024) is the new state-as-pytree API that moves away from the explicit-state-passing style toward something closer to PyTorch modules while preserving JAX’s functional purity. NNX is the recommended API as of 2025.
Equinox — minimalist neural-network library
Equinox (Patrick Kidger, ~3k stars) takes a different approach: models are PyTrees of parameters and modules, fully compatible with JAX transformations. It’s lighter-weight than Flax and very popular in scientific-ML circles. Equinox is the framework of choice for neural ODEs (via diffrax), implicit models, and physics-informed NN work.
Optax — gradient processing & optimizers
Optax (DeepMind, ~2k stars) is the optimizer library: Adam, AdamW, Lion, AdEMAMix, Sophia, Shampoo, etc., composed via chain-of-transforms (gradient clipping, weight decay, EMA, etc.). Standard across the Flax + Equinox ecosystems.
Other JAX libraries
- diffrax — differentiable ODE / SDE / CDE solvers; used in scientific ML.
- NumPyro — probabilistic programming on JAX; competitor to Pyro (PyTorch) and Stan.
- Pallas — JAX’s kernel-language DSL (analogous to Triton); compiles to GPU and TPU kernels. Released 2024, matured through 2025–26.
- Penzai — DeepMind’s structured-pytree library for interpretability work (2024).
- MaxText — Google’s JAX-native LLM training reference codebase; benchmarks high on TPU and on H100.
- EasyLM / Levanter — Stanford CRFM / Stability AI’s JAX foundation-model training recipes.
JAX use cases
Best fit: TPU training, scientific ML, anything with custom AD requirements (higher-order gradients, Jacobians, Hessians, implicit differentiation), simulation + ML coupling, very large-scale foundation-model training where the team has TPU access.
Worst fit: rapid prototyping with messy imperative state, ecosystems with no JAX port (audio domain libraries, some CV models), production serving (JAX has weaker serving story than ONNX / TFServing / TensorRT-LLM / vLLM).
TensorFlow — Google, 2015–present
License: Apache-2.0. GitHub: tensorflow/tensorflow, ~186k stars (massive historical accumulation). Latest stable: 2.18 (early 2026). Governance: Google with TensorFlow community process.
TensorFlow’s research relevance has declined sharply since 2020 but its production footprint is enormous and stable. Whole industries (e.g., recommendations at scale, ad-ranking pipelines, MLOps platforms built on Vertex AI / SageMaker / Azure ML) have years of investment in TF graphs, TFX pipelines, and TF Serving deployments. Those workloads aren’t migrating.
Where TensorFlow remains dominant
- Production ML platforms — Vertex AI Pipelines, Kubeflow Pipelines, TFX, TF Serving; the maturity here is unmatched.
- Edge / mobile — TensorFlow Lite was rebranded to LiteRT in 2024 and now hosts cross-framework models (it can run ONNX, JAX-via-StableHLO, and PyTorch-via-ExecuTorch models too, with the goal of being the de facto on-device runtime). Android’s ML Kit, Pixel’s on-device features, and most non-Apple mobile ML run through TFLite/LiteRT.
- MediaPipe — Google’s cross-platform CV/audio pipeline library; runs TF and LiteRT models on web, mobile, desktop, embedded.
- TensorFlow.js (TF.js) — in-browser ML; the historical leader, though ONNX Runtime Web and transformers.js have eroded some of its share.
- Recommendations — TFRS (TensorFlow Recommenders), TPU embedding lookup, sparse feature columns; Google and many ad-tech firms rely heavily on this.
What TensorFlow gave up
Research has overwhelmingly moved to PyTorch (and to JAX for foundation models). Most new transformer architectures, generative models, and RL methods publish PyTorch reference implementations only, with JAX as a secondary port and TF nowhere. Within Google itself, JAX has effectively replaced TF for new model research.
TensorFlow Keras vs Keras 3
TF includes Keras as tf.keras. Confusingly, the standalone Keras project relaunched in
2023 as Keras 3 with a multi-backend architecture (covered next). The two are
distinct — tf.keras is the TF-specific tightly-integrated API; Keras 3 is the new
standalone project that runs on top of TF / JAX / PyTorch.
Keras 3 — multi-backend Keras, 2023+
License: Apache-2.0. GitHub: keras-team/keras, ~62k stars. Lead: François Chollet (until early 2025; he left Google to start Ndea / focus on ARC-AGI, but contributes). Latest stable: 3.7 (2026).
Keras 3 is a rewrite of Keras as a backend-agnostic API. The same model code runs
on TensorFlow, JAX, or PyTorch by setting os.environ["KERAS_BACKEND"]. This resurrects
the “write once, run anywhere” promise that the original Keras (2015) had before being
absorbed into TF.
Practical effect: a lot of teaching material, MLOps pipelines, and applied-research shops use Keras 3 as a glue layer. It hasn’t recaptured the research mindshare it had in 2017–19 — PyTorch’s native ergonomics are good enough that the abstraction layer isn’t compelling for cutting-edge work — but it’s a sensible production choice when you want to keep options open across backends.
KerasNLP and KerasCV are companion model-zoo libraries; both retargeted to Keras 3.
MLX — Apple, 2023+
License: MIT. GitHub: ml-explore/mlx, ~17k stars. Latest stable: 0.20+ (2026). Maintainer: Apple Machine Learning Research.
MLX is Apple’s research-oriented array framework designed for Apple Silicon. It exploits
the unified-memory architecture of M1 / M2 / M3 / M4 chips: tensors live in shared CPU/GPU
memory, no host-device copies needed. The API is a hybrid of NumPy semantics and PyTorch
ergonomics, with a mlx.nn module library and a mlx.optimizers package.
Why MLX matters
- Mac developer experience for LLMs. Running Llama 3 70B in 4-bit on a 64GB M3 Max is a real workflow that MLX enables natively, without the CPU-GPU copy overhead that cripples PyTorch’s MPS backend.
- Lazy evaluation + dynamic graph. MLX builds a graph behind the scenes and evaluates on demand, similar to JAX, but with a more Pythonic API.
- mlx-lm — the companion library for LLM training and inference; supports LoRA / QLoRA fine-tuning, generation, quantization (int4 / int8 / fp16).
- mlx-swift — Swift bindings for on-device iOS / macOS deployment.
Where MLX falls short
- Apple Silicon only — useless if you’re not on a Mac or shipping to Apple devices.
- Smaller ecosystem: no equivalent of Hugging Face transformers, though
mlx-lmcovers the major LLM architectures. - Distributed training is nascent; MLX is primarily a single-machine framework.
Use cases
Local LLM dev on Macs, on-device iOS inference (via mlx-swift), education and research-prototyping on Apple Silicon. Several open-source projects (Ollama for some backends, LM Studio’s Mac path, exo-explore) leverage MLX or MLX-style backends.
Mojo — Modular Inc, 2023+
License: source available for parts of Mojo, proprietary core; some libraries on GitHub at modularml/mojo. Founder/CEO: Chris Lattner (LLVM, Clang, Swift, MLIR). Status (2026): still maturing; community edition is open, full Mojo SDK is commercial.
Mojo is a new systems-programming language designed as a Python superset that can also be used to write GPU kernels. The pitch: write your model in Python today, then selectively rewrite hot kernels in Mojo for orders-of-magnitude speedups, all inside the same source file, with full Python interop. Built on MLIR, so it has access to a deep compilation stack and can target CPU / GPU / TPU / NPU.
MAX — the Modular platform
Mojo lives inside Modular’s broader MAX platform: MAX Engine (runtime), MAX Serve (LLM serving competitor to vLLM / TensorRT-LLM), and MAX Graph (model representation). The vision is end-to-end — write models in Mojo, optimize via MAX, serve through MAX Serve, all with hardware-portable lowering.
Reality check, mid-2026
- Mojo has working compilers and benchmarks impressively on hand-written GPU kernels.
- Adoption is real but small — perhaps a few hundred companies using it in production.
- Python interop works but the “drop-in faster Python” promise is more aspirational than delivered: most real codebases require Mojo-specific rewrites for the speedups.
- MAX Serve has shown competitive performance against vLLM on Llama-class models but hasn’t yet displaced the open-source stacks.
- The commercial licensing model is a friction point for adoption versus pure open source competitors.
Use cases
If you have a hot path (e.g., custom attention variant, novel quantization scheme, signal processing) and want to squeeze every cycle out of a specific GPU SKU without dropping into raw CUDA / Triton, Mojo is a real option. For mainstream PyTorch / JAX work, it’s not yet a switch you’d make.
Triton — OpenAI, 2021+
License: MIT. GitHub: triton-lang/triton, ~13k stars. Maintainer: originally Philippe Tillet at OpenAI; now a multi-org project (OpenAI, Meta, NVIDIA, AMD all contribute).
Triton is a GPU-kernel DSL embedded in Python. You write kernels at the
block / thread level using @triton.jit decorators and tl.load / tl.store /
tl.dot primitives, and Triton compiles them to PTX (NVIDIA) or AMDGCN (AMD) via its
own MLIR-based pipeline. Triton is roughly as fast as hand-written CUDA for most
workloads while being ~5–10x less code. It’s the kernel backend for torch.compile’s
Inductor, FlashAttention 2 / 3, FlashInfer, and most modern fused-kernel libraries.
This index treats Triton as a peer of CUDA, not of PyTorch — it’s a kernel layer, not a model framework. For depth on Triton’s programming model, hardware targets (Hopper / Blackwell tile shapes, AMD CDNA3 / RDNA4, NVIDIA TMA), and patterns, see the dedicated note: cuda-triton-gpu-programming.
candle — Hugging Face, 2023+
License: Apache-2.0 / MIT. GitHub: huggingface/candle, ~17k stars. Lead: Laurent Mazaré at HF.
candle is HF’s minimalist ML framework in Rust. It targets the niche of “I want PyTorch ergonomics but as a single dependency-free Rust binary that runs anywhere.” The typical use case is shipping an LLM or embedding model as part of a Rust app — e.g., candle backs HF’s own text-embeddings-inference (TEI) server, several Tauri-based desktop apps, and various edge-deployment projects.
Strengths
- No Python runtime needed; ship a single binary.
- Native CUDA backend, Metal backend, CPU backend, WebGPU (experimental).
- Fast compile times for inference workloads.
- Direct Hugging Face Hub integration (load checkpoints by repo ID).
Weaknesses
- Training support exists but is limited; not a research framework.
- Smaller model coverage than PyTorch + transformers — supports the popular models (Llama, Mistral, Phi, Gemma, BERT, Whisper, Stable Diffusion, etc.) but the long tail isn’t ported.
- Rust learning curve for ML teams that are mostly Python.
Use cases
Embedded ML in Rust apps, single-binary inference servers, edge devices where you can’t ship a Python interpreter, latency-sensitive RAG pipelines (TEI is candle-based).
tinygrad — tiny corp / George Hotz, 2020+
License: MIT. GitHub: tinygrad/tinygrad, ~28k stars. Founder: George Hotz.
tinygrad is a minimalist deep-learning framework whose explicit design constraint is to stay under ~10k lines of code (a moving target, but the discipline is real). The project’s commercial side, tiny corp, sells the “tinybox” — a 6× AMD 7900 XTX / 4× A100 / 8× 4090 prebuilt training box — running tinygrad’s software stack as a NVIDIA-CUDA alternative.
Why it matters
- Pedagogical clarity: tinygrad is the easiest framework to read end-to-end. People use it to understand how PyTorch / JAX work under the hood.
- AMD support: tinygrad targets AMD GPUs (via its own kernel compilation, bypassing ROCm in some configurations) as a first-class backend, not as a port.
- Multi-backend: CUDA, AMD, Apple Metal, WebGPU, CPU; uses a unified Linearizer pass that lowers a generic op graph to per-device kernels.
Reality check
tinygrad is impressive engineering and very useful for education and AMD experimentation. It’s not yet a viable choice for serious production training — too small an ecosystem and too many missing optimizations on the long tail of ops. But its mere existence is real pressure on the CUDA monoculture.
DeepSpeed — Microsoft Research, 2020+
License: Apache-2.0. GitHub: microsoft/DeepSpeed, ~36k stars. Status (2026): in maintenance mode after the Megatron-DeepSpeed merge effort with NVIDIA; less new development than 2020–22 but still widely used.
DeepSpeed is the distributed-training and inference optimization library that introduced ZeRO (Zero Redundancy Optimizer) — the technique that lets you train models larger than any single GPU by sharding optimizer states, gradients, and parameters across the data-parallel group. ZeRO-1, ZeRO-2, ZeRO-3, ZeRO-Infinity (with CPU/NVMe offload), and ZeRO++ (with quantized communication) are the milestones.
DeepSpeed also includes:
- DeepSpeed-Inference — tensor-parallel inference with kernel fusion.
- DeepSpeed-MoE — mixture-of-experts training (used for GLM, Phi-MoE, etc.).
- DeepSpeed Chat — RLHF training pipeline.
- DeepSpeed-Domino — pipeline parallelism with reduced bubble overhead.
In 2024–26 the field has largely converged on FSDP / FSDP2 (PyTorch native), with
DeepSpeed retained for ZeRO-Infinity-style offload and MoE training. The two
approaches are functionally similar; FSDP2’s tighter integration with torch.compile is
the main reason for the migration.
Megatron-LM — NVIDIA, 2019+
License: BSD-style (Megatron-LM source). GitHub: NVIDIA/Megatron-LM, ~11k stars. Status: actively developed; ships as NeMo Megatron inside NVIDIA NeMo.
Megatron-LM is NVIDIA’s reference codebase for large-scale transformer training. It pioneered tensor parallelism (splitting a single matmul across GPUs) and pipeline parallelism with interleaved schedules, on top of which DeepSpeed and FSDP layer their data parallelism. The Llama 3 paper credits Megatron-style 3D parallelism; most >100B-parameter models trained on NVIDIA hardware use some derivative.
NeMo (NVIDIA’s higher-level training framework) wraps Megatron-LM with recipes for LLM, ASR, TTS, and multimodal training. NeMo is the recommended path on DGX clusters and H100 / B200 / GB200 systems.
MaxText — Google, 2023+
License: Apache-2.0. GitHub: google/maxtext, ~2k stars.
MaxText is Google’s JAX-native reference codebase for LLM training on TPUs (and on
GPUs via XLA:GPU). It’s the cleanest public example of how to train a >100B-parameter
model on TPU pods. MaxText demonstrates JAX’s pjit / shard_map for partitioning,
FlashAttention-equivalent attention via Pallas, optimizer sharding via Optax, and
checkpointing via Orbax. It’s the canonical reference if you’re starting a JAX-on-TPU
training stack.
Jittor, MindSpore, PaddlePaddle — China-origin frameworks
Three significant frameworks come from China and have substantial domestic adoption:
Jittor — Tsinghua University, 2020+
License: Apache-2.0. GitHub: Jittor/jittor, ~3k stars. Lead: Tsinghua Graphics & Geometric Computing Group.
Jittor uses meta-operator fusion and dynamic graph design. It has reasonable adoption in Chinese academic CV / graphics research and a notable community in Chinese universities, but limited Western traction. The project is alive but not in fast growth.
MindSpore — Huawei, 2020+
License: Apache-2.0. GitHub: mindspore-ai/mindspore, ~4k stars (mirror). Primary development on Gitee. Status: production-grade, deeply tied to Huawei’s Ascend AI accelerators (910B / 910C in 2024–26).
MindSpore is Huawei’s strategic answer to the US-export-controlled hardware landscape: domestic Ascend accelerators with a domestic framework. Inside China it’s substantial — the open-source PanGu, ChatGLM, and many DeepSeek-related training runs have MindSpore ports — and Huawei has invested heavily in the developer experience. Outside China, adoption is small. The MindSpore ecosystem is the most credible domestic alternative to PyTorch + CUDA in 2026.
PaddlePaddle — Baidu, 2016+
License: Apache-2.0. GitHub: PaddlePaddle/Paddle, ~22k stars.
PaddlePaddle (PArallel Distributed Deep LEarning) is older than PyTorch and has been Baidu’s strategic ML framework. ERNIE (Baidu’s foundation-model family) and many domestic NLP / OCR / CV systems are trained on Paddle. PaddleOCR is widely used globally (one of the most accurate open-source OCR systems). PaddleNLP, PaddleSpeech, PaddleDetection are mature model libraries. Western adoption is limited but PaddleOCR specifically punches above its weight.
ONNX & ONNX Runtime — cross-framework portability
ONNX: Open Neural Network Exchange format. Apache-2.0. ~17k stars. ONNX Runtime: Microsoft, MIT. ~14k stars.
ONNX is a standardized model representation (operators + tensors + graph). Most
frameworks can export to ONNX and most runtimes can consume it: PyTorch via
torch.onnx.export() and the newer torch.onnx.dynamo_export() (preferred since
2024), TF via tf2onnx, JAX via jax2tf → tf2onnx, etc. ONNX Runtime is the
high-performance executor — used for production inference of CV / classical NLP / speech
models on CPU, GPU (CUDA / DirectML / ROCm), and edge devices.
For LLMs specifically, the situation is more nuanced: ONNX Runtime has caught up with dedicated LLM serving stacks (vLLM / TGI / TensorRT-LLM) only on smaller models. For frontier LLMs, dedicated serving stacks remain ahead, but ONNX Runtime is the dominant choice for classical ML and for embedding cross-platform models in apps (e.g., Windows ML, browser inference via ONNX Runtime Web).
Experiment tracking & MLOps
A framework choice goes hand in hand with an experiment-management choice:
- Weights & Biases (wandb) — closed-source SaaS + self-hosted; the de facto standard for experiment tracking in 2024–26, especially in research and frontier-lab settings.
- MLflow (Linux Foundation, Databricks-led) — open-source experiment tracking, model registry, model serving; dominant in enterprise contexts.
- Kubeflow — open-source MLOps platform on Kubernetes; declining momentum vs Flyte / Metaflow / Argo Workflows but still widely deployed.
- Metaflow (Netflix / Outerbounds) — Pythonic ML pipeline framework; popular with applied ML teams.
- DVC — Git-style data versioning; complements wandb / MLflow.
- ClearML — open-source MLOps platform; smaller but loyal user base.
- Comet — closed-source experiment tracking; niche relative to wandb.
- Neptune.ai — similar; niche.
Model formats — pickle → SafeTensors → GGUF → CoreML → TFLite/LiteRT
Model serialization formats are a parallel axis to the framework choice:
- Pickle (
.bin,.pt) — historical default; unsafe (arbitrary code execution on load). Being deprecated in HF Hub uploads. - SafeTensors (
.safetensors) — HF’s format; safe (no code execution), fast (mmap), framework-agnostic. The standard for HF Hub since 2023. - GGUF — llama.cpp’s format; designed for CPU inference with quantization metadata baked in. The standard for local LLM inference (Ollama, LM Studio, koboldcpp).
- CoreML (
.mlpackage) — Apple’s format; required for App Store distribution of on-device models. Conversion viacoremltools. - TFLite / LiteRT (
.tflite) — Google’s mobile/edge format. Cross-framework as of the 2024 rebrand. - ONNX (
.onnx) — covered above. - TensorRT engines (
.engine,.plan) — NVIDIA’s compiled-for-specific-GPU format; not portable across GPU SKUs but maximum performance on the target hardware. - OpenVINO IR (
.xml+.bin) — Intel’s IR format for CPU / iGPU / NPU.
The pattern in 2026: train in PyTorch (SafeTensors), then export to ONNX for cross-vendor serving, to GGUF for local CPU inference, to CoreML for Apple devices, or to TFLite/LiteRT for Android. The “trained model” and the “deployed model” are typically in different formats.
Comparison matrix — high level
| Framework | Ergonomics | Perf | HW coverage | Distributed | Ecosystem | Production | Research adoption (2026) |
|---|---|---|---|---|---|---|---|
| PyTorch | A | A | NV / AMD / Apple MPS | FSDP2, DeepSpeed | AAA | A | ~70%+ |
| JAX | B+ (steep) | A+ | TPU / NV / AMD | pjit, shard_map | A | B | ~20% (concentrated) |
| TensorFlow 2 | B | A | NV / TPU / Edge | tf.distribute | AA | A | <5% |
| Keras 3 | A | A (backend-dep) | inherit | inherit | B+ | A | <2% |
| MLX | A | A (Apple Silicon) | Apple only | nascent | C+ (growing) | B+ (Apple) | niche |
| Mojo / MAX | B (new) | A (kernels) | NV / AMD / multi | nascent | C | B | <1% |
| candle (Rust) | B (Rust) | A | NV / Metal / CPU | limited | B | A (inference) | <1% |
| tinygrad | A (small) | B+ | NV / AMD / Metal/CPU | limited | C | B | educational |
| Jittor | B | B+ | NV / Ascend | yes | C+ | B | China niche |
| MindSpore | B+ | A (Ascend) | NV / Ascend | yes | B (China) | A (China) | China substantial |
| PaddlePaddle | B+ | A | NV / Kunlun / Ascend | yes | B+ (China) | A (China) | China substantial |
Grades are rough industry impressions, not benchmarked metrics.
Selection heuristics
The decision tree most teams follow as of 2026:
- Doing research (academic or frontier-lab) that needs broad community code? → PyTorch with HF transformers. Default for nearly all CV, NLP, speech, RL, GNN work.
- Doing very-large-scale foundation-model training with TPU access? → JAX + Flax NNX + Optax + MaxText / Levanter, on TPU v5p / v6e pods.
- Doing very-large-scale foundation-model training on NVIDIA GPUs?
→ PyTorch + FSDP2 (+ DeepSpeed for offload) +
torch.compile, or NeMo Megatron-LM for the NVIDIA-blessed path. - Targeting Apple Silicon for local dev / iOS deployment? → MLX for development; export to CoreML for shipping.
- Targeting mobile / embedded broadly? → CoreML (Apple), TFLite/LiteRT (Android + cross-platform), or candle / executorch for edge Linux.
- Targeting web browser? → transformers.js (ONNX Runtime Web under the hood), TF.js for legacy, MediaPipe for audio / video pipelines.
- Serving LLMs at scale? → vLLM (open-source, dominant), TGI (HF), TensorRT-LLM (NVIDIA, max perf), SGLang (newer, structured generation), llama.cpp (CPU + small GPU), MLX (Apple), MAX Serve (Modular). See model-serving-infrastructure for depth.
- Embedding ML in a Rust app / single-binary distribution? → candle or burn (pure Rust).
- Scientific ML, neural ODEs, physics-informed NNs, probabilistic programming? → JAX + Equinox + diffrax + NumPyro. PyTorch alternatives exist (torchode, pyro) but JAX is the center of gravity here.
- Building a teaching example or wanting to read end-to-end framework code? → tinygrad.
- Inside the Chinese hardware ecosystem (Ascend / Kunlun)? → MindSpore (Ascend), PaddlePaddle (Kunlun + Ascend).
- Production ML platform with mature MLOps tooling, recommendations, or established TF investment? → Stay on TF. Migration cost rarely pays off.
- Want a portable model that can ship to any hardware? → Train in PyTorch / JAX, export to ONNX, serve via ONNX Runtime + the vendor-specific execution provider.
Adoption trends, 2020 → 2026
| Year | PyTorch | JAX | TF (research) | TF (production) | Other |
|---|---|---|---|---|---|
| 2020 | ~50% | ~3% | ~30% | dominant | ~17% |
| 2022 | ~65% | ~8% | ~12% | strong | ~15% |
| 2024 | ~70% | ~15% | ~5% | stable | ~10% |
| 2026 | ~72% | ~18% | ~3% | stable, niche-y | ~7% |
(Percentages are rough Papers with Code / Hugging Face Hub / GitHub-star–trend extrapolations; treat as direction-of-travel, not measurement.)
PyTorch: dominant and still slowly growing share among research; consolidated as the single framework most ML engineers learn first.
JAX: growing in the foundation-model and scientific-ML niches; concentration is extreme (a few labs do a lot of the work) but the work is influential. Pallas (2024+) is closing the kernel-DSL gap that previously sent JAX users to Triton-via-PyTorch for custom kernels.
TensorFlow: research share collapsed; production share stable because of high switching costs and continued investment in LiteRT / MediaPipe / Vertex AI. Google’s own new model research is JAX-first internally.
Keras 3: re-launched well, useful as a portable applied layer, but not a force in research. Niche but real production traction.
MLX: rapid growth in the Apple-Silicon developer community; foundational to the “run LLMs on a Mac” subculture. Won’t go mainstream without leaving Apple-only territory.
Mojo: real but slower-than-hyped adoption. The kernel story works; the “drop-in faster Python” story is harder. Watch MAX Serve for the production-LLM pitch.
candle: steady growth as the Rust ML default; especially for inference.
tinygrad: niche but loud; tinybox sales appear to be real but small. The codebase remains the best teaching resource.
MindSpore / PaddlePaddle / Jittor: largely China-internal; geopolitically important but limited Western adoption.
What gets unbundled — the modern stack
The dominant pattern in 2026 is layered:
[Application]
└── Hugging Face transformers / diffusers / TRL / PEFT <- API + recipes
└── PyTorch (or JAX or TF or MLX or candle) <- framework
└── torch.compile / Triton / Pallas / XLA <- kernel compilation
└── CUDA / ROCm / Metal / TPU / NPU <- hardware
Choosing a framework is really choosing the bottom three layers as a bundle. The top layer (HF) is shared. The kernel layer is mostly invisible until you need to write a custom op, at which point Triton (PyTorch) or Pallas (JAX) is where you’ll be.
Custom kernels — when to write them
Modern training and inference frequently use hand-tuned kernels for attention, normalization, and quantized matmul. Common fused-kernel libraries:
- FlashAttention 2 / 3 (Tri Dao et al.) — the canonical attention kernel; FA3 (2024) added Hopper-specific TMA + warp-specialization, and Blackwell-specific FP4 support.
- FlashInfer — attention + paged-KV-cache kernels for serving; used by SGLang and others. Originally a CMU / UWashington project.
- xFormers (Meta) — older fused-attention library; partially superseded by FA.
- CUTLASS / CuTe (NVIDIA) — template library for writing CUDA matmuls.
- Liger Kernel (LinkedIn) — fused LLM training kernels; ~10% throughput improvement for typical PyTorch training.
- TransformerEngine (NVIDIA) — FP8 + FP4 training primitives for Hopper / Blackwell.
These are mostly framework-agnostic in spirit, but the bindings are typically PyTorch-first.
Hardware coverage — 2024–26 reality
| Hardware | PyTorch native | JAX native | TF native | MLX | Notes |
|---|---|---|---|---|---|
| NVIDIA H100 / B200 | first-class | first-class via XLA | first-class | - | dominant target |
| NVIDIA RTX 40/50 series | first-class | works | works | - | consumer / workstation |
| AMD MI300X / MI325X | improving (ROCm) | via XLA:AMD or PTpath | limited | - | improving 2024–26 |
| Google TPU v5p / v6e | via PyTorch/XLA | first-class | first-class | - | JAX is the default |
| Apple M-series | MPS (slower than MLX) | limited | limited | first-class | MLX is the answer |
| Intel Gaudi 3 | via Optimum-Habana | nascent | limited | - | niche, but real |
| Huawei Ascend 910B/C | limited | - | - | - | MindSpore-first |
| AWS Trainium / Inferentia | via Neuron-Torch | limited | limited | - | growing |
| Groq LPU | inference only | inference only | inference only | - | inference-only |
| Cerebras CS-3 | via wafer-scale SDK | limited | - | - | niche, training |
Software-defined accelerator coverage is one of the strongest cases for ONNX (export to any of these via vendor-specific execution providers) and for JAX/XLA (one IR, multiple targets).
Common pitfalls when picking a framework
- Picking JAX because it’s “more elegant” without TPU access. JAX on a single 8×H100 box is fine but you’ll be fighting the ecosystem; PyTorch is the better choice unless you have a real reason for functional purity (e.g., higher-order gradients).
- Picking TF for new research. The model zoo and community have moved; you’ll be porting reference implementations from PyTorch yourself.
- Picking MLX for non-Apple deployment. MLX is Mac-only; nothing portable comes out of it.
- Picking Mojo expecting drop-in Python speedup. You’ll need to rewrite the hot path in Mojo specifically; the marketing implies less work than reality.
- Hand-writing CUDA when Triton would have sufficed. Triton is usually as fast and massively less code; reach for raw CUDA only when you’ve measured a real Triton gap.
- Building production serving on TorchServe for LLMs. vLLM / TGI / TensorRT-LLM are dramatically better for transformer serving.
- Ignoring quantization formats early. Decide at training time whether you’ll serve fp16 / bf16 / int8 / int4 / fp8 / fp4 / mxfp4 — this affects training recipes (QAT vs PTQ) and serving infrastructure choice.
Cross-links
- Tier 3 hub: _index
- Tier 2: transformer-architecture — model-side concepts that all frameworks implement.
- Tier 2: fine-tuning-rlhf — LoRA / QLoRA / PEFT / DPO / RLHF — almost entirely via the HF ecosystem on PyTorch.
- Tier 2: inference-optimization — quantization, KV cache, speculative decoding, paged attention.
- Tier 2: model-serving-infrastructure — vLLM, TGI, TensorRT-LLM, TFServing, ONNX Runtime, MAX Serve, llama.cpp, MLX serve.
- Tier 2: cuda-triton-gpu-programming — kernel-level: Triton, CUDA, Pallas, CUTLASS / CuTe, FlashAttention internals.
Citations & references
- PyTorch 2.0 release notes (Mar 2023) —
torch.compile, TorchDynamo, Inductor. - “PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation” — Ansel et al., ASPLOS 2024.
- JAX docs and “Compiling machine learning programs via high-level tracing” — Frostig, Johnson, Leary (Google, 2018).
- Flax NNX release notes (2024).
- Equinox: “Equinox: neural networks in JAX via callable PyTrees and filtered transformations” — Kidger, Garcia (2021).
- Keras 3 launch announcement, keras.io (Nov 2023).
- MLX paper / docs — Apple ML Research (Dec 2023).
- Mojo / MAX documentation — Modular Inc.
- “Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations” — Tillet et al., MAPL 2019.
- FlashAttention 1/2/3 papers — Tri Dao et al. (2022, 2023, 2024).
- ZeRO: “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” — Rajbhandari et al. (2020). ZeRO-Infinity (2021). ZeRO++ (2023).
- Megatron-LM paper — Shoeybi et al. (NVIDIA, 2019). Megatron-LM v3 (2022).
- LiteRT rebrand — Google announcement (Sep 2024).
- Hugging Face SafeTensors spec.
- Papers with Code framework trend data (PyTorch vs TF vs JAX share over time).
Maintenance note
This index reflects the 2026-05-17 landscape. Frameworks evolve fast: re-validate specific version numbers, star counts, and release cadences if you’re making a real adoption decision. Adoption-share figures are direction-of-travel estimates, not audited metrics.