GenAI / LLM-Runtime / Model-Serving Config DSLs Family Index
type: language-family-index family: genai-llm-runtime languages_catalogued: 30 tags: [language-reference, family-index, genai-llm-runtime, huggingface, gguf, onnx, vllm, safetensors, peft, structured-outputs]
GenAI / LLM-Runtime / Model-Serving — Family Index
Family overview
This family is the model-as-artifact + serving-config layer for transformer-era ML. It is distinct from the prompt-DSL world (see ai-prompt-languages) and from the older NLU intent DSLs (see chatbot-intent-dsls). What unifies the entries here is that they describe how a model is packaged, quantized, loaded, served, and adapted at runtime, not how a developer asks the model to produce a result. The defining file is HuggingFace’s config.json — a single JSON document that says what architecture this is (architectures: ["LlamaForCausalLM"]), what its dimensions are (hidden_size, num_attention_heads, intermediate_size, num_hidden_layers, vocab_size), what activations and normalisations it uses, and increasingly what tokenizer and chat template it expects. Around config.json accreted tokenizer.json (the BPE/WordPiece/Unigram tokenizer-as-data spec with full pretokenizer + normalizer pipeline), generation_config.json (decoding defaults), model_index.json (Diffusers pipeline manifest), and the YAML frontmatter of README.md model cards (the discovery metadata Hub uses for pipeline_tag, library_name, tags, license). Transformers 4.51+ (current April 2025+) standardised this set, and almost every open-weight model release on the Hub now follows it.
The weights container layer evolved on a parallel track. The 2018–2022 era used PyTorch’s pickle-based .bin files, which were a security hazard (arbitrary code on load). HuggingFace’s safetensors (2022, stabilised through 2023) replaced them with a JSON-header + raw-tensor binary format — zero-copy, memmap-friendly, and unable to execute code on load. As of 2025 over 42% of Hub models carry safetensors weights. On the local-inference side, GGUF (GGML Unified Format, v3) displaced its predecessor GGML in late 2023 as the llama.cpp ecosystem’s quantized container, embedding both KV-metadata and tensors in a single file. GGUF added an embedded Jinja2 chat template in 2024 (parsed by llama.cpp’s own minja runtime in C++), turning the artifact into a self-describing serving package. ONNX, born in 2017 as the cross-framework IR, still anchors edge/mobile/Intel/AMD deployment (current ONNX 1.22.0, opset 27 as of 2025; opset 21 was the 2024 stable; ONNX Runtime 1.18+ for opset-21 coverage), but its share of frontier-LLM deployment has eroded as PyTorch-native pipelines and bespoke kernels dominate.
The serving zoo is the layer above weights+config. vLLM (0.6+ in 2025, with PagedAttention as its signature contribution) is the dominant OSS GPU server; its config surface is a mix of LLM(...) Python kwargs and vllm serve CLI flags rather than a standalone schema file. NVIDIA TensorRT-LLM uses trtllm-build to ahead-of-time compile a model into a serialized engine plus a JSON config file. NVIDIA Triton Inference Server uses config.pbtxt (protobuf-text) for every served model, with detailed sections for instance_group, dynamic_batching, model_warmup, optimization, and response_cache. TGI (Text Generation Inference, HuggingFace) entered maintenance mode December 2025 but added vLLM and TensorRT-LLM as alternative backends earlier that year. MLX (Apple, current 0.21+ as of 2025; M5 Neural Accelerator support arrived with macOS 26.2) is the Apple-silicon-native runtime that Ollama now uses on Apple hardware. MLC LLM (Apache TVM-based) compiles models for browsers, mobile, and heterogeneous CPUs/GPUs via the mlc-chat-config.json workflow. DeepSpeed anchors the training-config side with ds_config.json (ZeRO stages, optimizer offload, mixed-precision, pipeline-parallel).
Finally, two adjacent worlds: the adapter / fine-tune DSLs built around PEFT’s adapter_config.json (LoRA, QLoRA, DoRA — DoRA decomposes weight updates into magnitude + direction; QLoRA quantizes the base to 4-bit before LoRA on top), plus quantization-format configs (GPTQ, AWQ, EXL2, GGUF Q4_K_M, FP8) that travel as JSON metadata alongside safetensors shards. And the hosted-API session DSLs: OpenAI’s Structured Outputs (August 2024, response_format: {type: "json_schema", json_schema: {...}}, the GPT-4o-anchored evolution of json_mode), Anthropic’s Structured Outputs for tool-use (public beta November 2025, anthropic-beta header structured-outputs-2025-11-13), OpenAI Batch API JSONL, and the Realtime API session.update config. JSON Schema is the lingua franca: closed-API providers consume the same Pydantic/Zod-derived schemas that PEFT’s Python dataclasses serialize to.
In our deep library
None of these formats has a standalone deep-library note — they are sub-DSLs of JSON and YAML on disk, or Python dataclasses in memory. Cross-reference:
- ai-prompt-languages — prompt-DSL sibling (Guidance, LMQL, DSPy, Outlines, Jsonformer); structured-output JSON-Schema sits at the boundary between the two families.
- chatbot-intent-dsls — older NLU intent / dialogue-flow DSLs (Rasa, Dialogflow); pre-transformer lineage.
- oci-cloud-native — Kubernetes CRDs for model-serving (KServe
InferenceService, Seldon, vLLM Production Stack Helm charts) overlap with this family at the deployment-manifest layer. - api-description — JSON Schema, OpenAPI; the underlying spec for Structured Outputs and tool-use contracts.
- codec-and-dsp — TFLite/FlatBuffers and signal-processing model formats overlap with mobile inference here.
- nlp-corpus — training-data interchange (Datasets
dataset_infos.json). - python — universal host language for transformers, PEFT, vLLM, Diffusers; dataclasses serialize directly to these JSON formats.
- build-devops — CI/CD for model release (HuggingFace Hub spaces, OCI image push, S3 weight upload).
- yaml — model-card frontmatter, TorchServe
model-config.yaml, DeepSpeed YAML variants. - protobuf-rpc — Triton’s
config.pbtxtuses protobuf text format; gRPC underlies TGI’s router↔server channel.
Tier 3 family table — Model-artifact / weights formats
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| safetensors | 2022 (Hub adoption); spec stable 2023 | HuggingFace | JSON header + raw tensor binary; zero-copy, memmap, no code-execution-on-load | Dominant; ~42% of Hub models as of March 2025; current spec adds FP8 (E4M3FNUZ/E5M2FNUZ), MUSA device support | https://github.com/huggingface/safetensors |
| GGUF (GGML Unified Format) v3 | Aug 2023 (replacing GGML) | Georgi Gerganov / llama.cpp | Single binary file: magic + version + KV-metadata + tensor blocks; supports 2–8-bit ints, FP32/FP16/BF16, 1.58-bit | Dominant local-inference standard; embeds Jinja2 chat template, tokenizer; Dynamic GGUF 2.0 (2025) for better-accuracy quants | https://github.com/ggml-org/ggml/blob/master/docs/gguf.md |
| GGML (predecessor) | 2022 | Georgi Gerganov | Earlier llama.cpp tensor container; metadata convention varied across versions | Legacy, superseded by GGUF Aug 2023 | https://github.com/ggml-org/ggml |
| ONNX (.onnx) | 2017 | Microsoft + Facebook (now LF AI) | Protobuf IR for cross-framework models; ops version-pinned via opset | Active; ONNX 1.22.0 (2025) ships opset 27; opset 21 was the 2024 stable (ONNX Runtime 1.18+); slowed adoption for frontier LLMs as PyTorch-native dominates | https://onnx.ai/ |
| ONNX-ML | 2017 (with ONNX) | Microsoft + Facebook | Extension for classical ML ops (tree ensembles, linear models, vectorizers, label encoders) | Active; the export target for scikit-learn-onnx | https://onnx.ai/onnx/operators/ |
TensorFlow SavedModel + signature_def | 2017 | Protobuf-based MetaGraphDef + variables; signature_def declares serving inputs/outputs | Active within TF/Keras 3 ecosystem, lower share outside | https://www.tensorflow.org/guide/saved_model | |
| TFLite FlatBuffers (.tflite) | 2017 | FlatBuffer schema for on-device inference; integer quantization metadata | Active; LiteRT branding (2024) but format compatible; on-device Android/iOS workhorse | https://www.tensorflow.org/lite/guide | |
| Core ML (.mlpackage) | 2017; .mlpackage 2021 | Apple | Protobuf model description + weights bundle; iOS/macOS deployment | Active; coremltools 8.x; tight integration with MLX (2025) | https://developer.apple.com/documentation/coreml |
| MLX format | 2023 | Apple ML Research | npz/safetensors-derived format optimised for Apple unified memory | Very active; current 0.21+ (2025); macOS 26.2 added M5 Neural Accelerator support; Ollama on Apple Silicon now uses MLX | https://github.com/ml-explore/mlx |
| MLC LLM model libraries | 2023 | OctoML / mlc-ai / CMU Catalyst | TVM-compiled artifact tree + mlc-chat-config.json; targets WebGPU/Vulkan/Metal/CUDA/iOS/Android | Active; Apache TVM Unity pipeline; only universal-platform OSS LLM runtime | https://llm.mlc.ai/ |
| OpenVINO IR (.xml + .bin) | 2018 | Intel | XML graph + binary weights; Intel CPU/GPU/NPU deployment | Active; 2024/2025 LTS branches; pairs with optimum-intel | https://docs.openvino.ai/ |
Tier 3 family table — Model + tokenizer config (HuggingFace ecosystem)
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
config.json | 2018 (transformers v1+) | HuggingFace | JSON; the canonical model-config schema (architectures, model_type, hidden_size, num_attention_heads, etc.) | Dominant; transformers 4.51+ (April 2025) is current; the de facto model-artifact contract | https://huggingface.co/docs/transformers/main_classes/configuration |
tokenizer.json | 2020 (tokenizers v0.9+) | HuggingFace | JSON specification of normalizer + pretokenizer + model (BPE/WordPiece/Unigram) + post-processor + decoder pipeline | Dominant; tokenizer-as-data, replaces vocab.txt + merges.txt legacy split | https://huggingface.co/docs/tokenizers/ |
generation_config.json | 2022 (transformers 4.25+) | HuggingFace | JSON; default decoding parameters (temperature, top_p, top_k, max_new_tokens, repetition_penalty, stop tokens) | Active; loaded automatically by generate() | https://huggingface.co/docs/transformers/main_classes/text_generation |
model_index.json (Diffusers) | 2022 | HuggingFace Diffusers | JSON manifest enumerating the components of a Diffusion pipeline (UNet, VAE, text encoder, scheduler, safety checker) and their library/class | Active; the Diffusers pipeline equivalent of config.json | https://huggingface.co/docs/diffusers/using-diffusers/loading |
scheduler_config.json (Diffusers) | 2022 | HuggingFace Diffusers | JSON; diffusion scheduler config (DDIM, DPMSolver, Euler, etc., with beta schedule + timesteps) | Active | https://huggingface.co/docs/diffusers/api/schedulers/overview |
| Hub README.md YAML frontmatter (model card) | 2021 | HuggingFace | YAML frontmatter: license, library_name, pipeline_tag, tags, language, datasets, base_model, metrics | Active; powers Hub discovery, search filters, and inference-widget routing | https://huggingface.co/docs/hub/model-cards |
sentence_bert_config.json | 2019 | UKP Lab / Sentence-Transformers | JSON; max_seq_length and do_lower_case for embedding pooling | Active; v3 (2024) introduced unified modules.json + per-module configs | https://www.sbert.net/ |
Datasets dataset_infos.json | 2020 | HuggingFace | JSON; dataset splits + features schema + checksums + citations | Active; companion to datasets library | https://huggingface.co/docs/datasets/ |
| GGUF embedded chat template (Jinja2) | 2024 | llama.cpp / ggml-org | Jinja2 template string stored as a GGUF metadata KV; rendered by llama.cpp’s minja C++ implementation | Active and standard; --jinja flag uses it; tool-call grammars built on top; CVE-2024-34359 in llama-cpp-python’s older Python Jinja path | https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md |
Tier 3 family table — Serving / inference engines
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| vLLM engine config | 2023 (UC Berkeley); 0.6+ current 2025 | UC Berkeley Sky Computing Lab → vLLM Project (PyTorch Foundation, 2025) | Python LLM(...) kwargs + vllm serve CLI flags; PagedAttention is the signature contribution; no standalone schema | Dominant OSS GPU server; 0.6 added GPU NGram spec decoding, KV-cache offloading, elastic expert-parallelism | https://docs.vllm.ai/ |
| TensorRT-LLM build config | 2023 | NVIDIA | trtllm-build CLI + JSON checkpoint config + serialized engine.cfg; AOT compilation per-GPU-architecture | Active; the highest-throughput proprietary-stack option on NVIDIA HW | https://nvidia.github.io/TensorRT-LLM/ |
Triton Inference Server config.pbtxt | 2018 (TensorRT IS); 2019 Triton rebrand | NVIDIA | Protobuf-text ModelConfig: platform, max_batch_size, input[], output[], instance_group, dynamic_batching, model_warmup, response_cache, optimization | Active; the de facto multi-backend serving config (PyTorch, TF, ONNX, TensorRT, vLLM, TRT-LLM all plug in) | https://docs.nvidia.com/deeplearning/triton-inference-server/ |
| TGI (Text Generation Inference) router config | 2022 | HuggingFace | Rust HTTP/gRPC router with CLI/env-var config; continuous batching | Maintenance mode as of Dec 2025; added vLLM + TRT-LLM backends earlier 2025 | https://huggingface.co/docs/text-generation-inference/ |
TorchServe model-config.yaml | 2020 | Meta AI / AWS | YAML config: minWorkers, maxWorkers, batchSize, responseTimeout, handler class | Active but lower share; PyTorch’s official server, less momentum than vLLM | https://pytorch.org/serve/ |
| llama.cpp / llama-cpp-python config | 2023 | Georgi Gerganov / Andrei Betlen | CLI flags + Python kwargs (n_ctx, n_gpu_layers, chat_format, rope_freq_base); params.json for legacy llama-1 conversion | Very active; reference local-inference runtime | https://github.com/abetlen/llama-cpp-python |
| LM Studio model config | 2023 | LM Studio (Element Labs) | Per-model JSON sidecar in app data; GUI surfaces n_ctx, prompt template, GPU offload | Active; closed-source desktop wrapper around llama.cpp + MLX | https://lmstudio.ai/docs |
| Ollama Modelfile | 2023 | Ollama Inc. | Dockerfile-shaped DSL (FROM, PARAMETER, TEMPLATE, SYSTEM, ADAPTER); 2025 MLX backend on Apple Silicon | Very active; the dominant local-LLM bundler for non-developers | https://github.com/ollama/ollama/blob/main/docs/modelfile.md |
DeepSpeed ds_config.json | 2020 | Microsoft | JSON config for ZeRO stages, optimizer offload, mixed precision, pipeline parallelism, activation checkpointing | Active; the training-config heavyweight; integrates with HF Accelerate | https://www.deepspeed.ai/docs/config-json/ |
Tier 3 family table — Adapter / fine-tune / quantization
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
PEFT adapter_config.json | 2023 | HuggingFace PEFT | JSON; LoraConfig serialization (r, lora_alpha, lora_dropout, target_modules, modules_to_save, base_model_name_or_path) | Dominant adapter format; covers LoRA, QLoRA (4-bit base), DoRA (magnitude+direction decomposition), AdaLoRA, IA³, LoHA, LoKR | https://huggingface.co/docs/peft/ |
| AutoGPTQ / GPTQModel config | 2023 / 2024 | PanQiWei → ModelCloud | quantize_config.json + safetensors shards; bits, group_size, desc_act, sym, damp_percent, static_groups | Active; GPTQModel v7+ (2025) adds Huawei Ascend NPU; integrates HF, vLLM, SGLang | https://github.com/ModelCloud/GPTQModel |
| AWQ config | 2023 | MIT HAN Lab | JSON-embedded quantization_config (bits, group_size, zero_point, quant_method: "awq"); activation-aware scaling of salient channels | Active; lower perplexity than GPTQ in most benches; widely shipped on Hub | https://github.com/casper-hansen/AutoAWQ |
| ExLlamaV2 / EXL2 measurements | 2023 | turboderp | Variable bits-per-weight quantization with per-layer sensitivity measurement; safetensors-stored | Active; ExLlamaV3 quantizers (2025) extended to GPTQ/AWQ on consumer GPUs | https://github.com/turboderp-org/exllamav2 |
| GGUF quantization tags (Q4_K_M, Q5_K_S, Q6_K, IQ2_XS, etc.) | 2023 | llama.cpp | Quantization-type enum embedded as GGUF metadata; K-quants, I-quants, importance-matrix-derived | Active; the lingua franca of local-inference filenames | https://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/README.md |
Llama-cpp params.json | 2023 | Meta + llama.cpp | JSON sidecar from original Llama-1 release (dim, n_heads, n_layers, norm_eps); used by convert.py to produce GGUF | Legacy; superseded by reading directly from HF config.json | https://github.com/ggml-org/llama.cpp |
Tier 3 family table — Hosted-API session / structured-output
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
OpenAI Structured Outputs (response_format: json_schema) | Aug 2024 (GPT-4o-2024-08-06) | OpenAI | Strict JSON-Schema-constrained decoding; Pydantic/Zod schemas auto-converted via SDK | Dominant pattern for function-calling-via-schema across the industry | https://developers.openai.com/api/docs/guides/structured-outputs |
OpenAI Realtime API session.update config | Oct 2024 | OpenAI | JSON over WebSocket session config: voice, turn_detection, input_audio_format, instructions, tools, modalities | Active; the spec for low-latency voice/multimodal sessions | https://platform.openai.com/docs/guides/realtime |
| OpenAI Batch API JSONL | Apr 2024 | OpenAI | Line-delimited JSON; each line is {custom_id, method, url, body}; 24-hour SLA, 50% pricing | Active; the workhorse for offline backfills | https://platform.openai.com/docs/guides/batch |
| OpenAI Files / Assistants config | Nov 2023; v2 Apr 2024 | OpenAI | JSON for assistants, threads, runs, vector_stores; tool configs (code_interpreter, file_search, function) | Active (Assistants v2); Responses API is the supersession path | https://platform.openai.com/docs/assistants |
| Anthropic tool_use schema | May 2024 | Anthropic | JSON Schema input_schema per tool; tool_choice: auto/any/tool/none | Active | https://docs.claude.com/en/docs/build-with-claude/tool-use |
| Anthropic Structured Outputs | Nov 14, 2025 (public beta) | Anthropic | output_format + strict: true on tools; grammar-constrained decoding compiled from JSON Schema; anthropic-beta: structured-outputs-2025-11-13 header | GA across Claude Mythos Preview / Opus 4.5–4.7 / Sonnet 4.5–4.6 / Haiku 4.5 as of late 2025 / 2026 | https://platform.claude.com/docs/en/build-with-claude/structured-outputs |
Notable threads
-
config.jsonas the de facto model-artifact contract. No standards body ever sat down to specify “what a transformer model on disk looks like” — HuggingFace’sconfig.json(2018) plustokenizer.json(2020) plusmodel.safetensors(2022) became the universal trio by sheer adoption. By 2024 every meaningful open-weight release ships exactly that layout, and downstream tools (vLLM, TGI, llama.cpp convert, MLX, Ollama, MLC) all read it. ONNX-as-IR was supposed to be the standard at the IR level, but practitioners learned to prefer “ship a HF directory and let the runtime do the conversion.” -
GGUF v3 + embedded chat templates turned models into self-describing serving packages. The 2024 move to embed a Jinja2 chat template directly inside the GGUF metadata KV-store (along with tokenizer state and recommended generation params) closed the loop: a single
.gguffile is now sufficient to load and prompt the model correctly without any sidecar config. llama.cpp implements its own C++ minja Jinja2 runtime to render these templates safely. The flip side is CVE-2024-34359 (llama-cpp-python), which demonstrated that template injection via a malicious GGUF metadata field is a real attack surface — the embedded-DSL convenience has security-DSL costs. -
Safetensors displaced pickle for security reasons, then stayed for speed. The original case for safetensors was “pickle can execute arbitrary code on load and people are downloading random weights from the internet.” But the format also turned out to be faster (zero-copy memmap, sharded reads, GIL-free writes since the 2024 release) and cleaner (a JSON header you can
head -c 8and parse without loading tensors). As of March 2025, ~42% of Hub models are safetensors-tagged, and the percentage among new releases is approaching 100%. -
ONNX’s relative slowdown for frontier LLMs. ONNX still wins on edge/mobile/heterogeneous-hardware deployment (Intel OpenVINO, mobile NPUs, classical-ML embedded), and opset 27 in ONNX 1.22.0 (2025) covers the modern transformer ops. But for frontier-scale LLM training and inference, PyTorch-native (FSDP,
torch.compile, FlashAttention kernels) plus vLLM/TRT-LLM-style bespoke serving have crowded out the “export-to-IR” path. ONNX Runtime GenAI is the counter-bet, with phi/llama/gemma optimisations, but the LLM gravity is elsewhere. -
vLLM vs TensorRT-LLM vs MLX as the 2026 three-way split. vLLM is the default OSS GPU server (PagedAttention, continuous batching, broad model coverage, healthy contributor community, joined the PyTorch Foundation in 2025); TensorRT-LLM wins peak throughput on NVIDIA hardware via AOT compilation; MLX owns Apple Silicon (Ollama switched its Mac backend to MLX in 2025; M5 Neural Accelerators came online with macOS 26.2). MLC LLM is the only universal-platform OSS LLM compiler, but its operational complexity has kept it niche outside browser/mobile.
-
The proliferation of quantization configs is a tiny self-contained DSL ecosystem. GGUF tags (
Q4_K_M,Q5_K_S,Q6_K,IQ2_XXS,IQ3_XS), GPTQquantize_config.json(bits,group_size,desc_act,damp_percent), AWQ’s activation-aware scaling, EXL2’s per-layer variable bits, AutoRound, FP8 (E4M3 / E5M2 / E4M3FNUZ / E5M2FNUZ in safetensors), 1.58-bit ternary (BitNet, added in transformers 4.51) — each is a tiny vocabulary of allowed values stored in JSON metadata. The combinatorial explosion (which method × which bits × which group size × which calibration set) means model-card READMEs have become the de facto documentation surface for quant choices. -
OpenAI Structured Outputs set the function-calling-via-schema pattern. When OpenAI launched Structured Outputs in August 2024 with
response_format: {type: "json_schema", json_schema: {...}, strict: true}, every other vendor followed within 12 months. Anthropic shipped its own Structured Outputs in public beta on 2025-11-13 with the same essential shape (JSON Schema → grammar-constrained decoding → guaranteed conformance). vLLM added the same surface in OSS. The convergence is striking: by 2026, “tool-calling” and “structured outputs” are the same API pattern, distinguished only by whether the result lands incontent(structured output) ortool_use(tool call). JSON Schema, born as an OpenAPI/REST artifact, is now the central control surface of the LLM-API world.
Citations
- HuggingFace Transformers config docs — https://huggingface.co/docs/transformers/main_classes/configuration
- HuggingFace Tokenizers — https://huggingface.co/docs/tokenizers/
- HuggingFace Generation config — https://huggingface.co/docs/transformers/main_classes/text_generation
- Transformers v4.51 release (April 2025, BitNet, Qwen2.5-Omni preview, MLCD) — https://github.com/huggingface/transformers/releases
- HuggingFace Diffusers loading +
model_index.json— https://huggingface.co/docs/diffusers/using-diffusers/loading - HuggingFace Hub model cards (README.md frontmatter) — https://huggingface.co/docs/hub/model-cards
- safetensors repo — https://github.com/huggingface/safetensors
- GGUF spec (ggml repo) — https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
- llama.cpp function-calling + Jinja templates — https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md
- ONNX 1.22.0 docs (opset 27) — https://onnx.ai/onnx/repo-docs/Versioning.html
- ONNX Runtime compatibility (opset 21 ↔ ORT 1.18, 2024) — https://onnxruntime.ai/docs/reference/compatibility.html
- vLLM docs — https://docs.vllm.ai/
- vLLM 0.6 release notes — https://github.com/vllm-project/vllm/releases
- TensorRT-LLM
trtllm-build— https://nvidia.github.io/TensorRT-LLM/latest/commands/trtllm-build.html - Triton Inference Server model configuration — https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html
- TGI architecture — https://huggingface.co/docs/text-generation-inference/main/en/architecture
- TGI multi-backends announcement (TRT-LLM, vLLM) — https://huggingface.co/blog/tgi-multi-backend
- MLX repo — https://github.com/ml-explore/mlx
- Apple ML Research, MLX + M5 Neural Accelerators — https://machinelearning.apple.com/research/exploring-llms-mlx-m5
- MLC LLM docs — https://llm.mlc.ai/
- PEFT LoRA developer guide (LoraConfig, DoRA, QLoRA) — https://huggingface.co/docs/peft/developer_guides/lora
- GPTQModel — https://github.com/ModelCloud/GPTQModel
- AutoAWQ — https://github.com/casper-hansen/AutoAWQ
- ExLlamaV2 — https://github.com/turboderp-org/exllamav2
- OpenAI Structured Outputs — https://openai.com/index/introducing-structured-outputs-in-the-api/
- OpenAI Realtime API — https://platform.openai.com/docs/guides/realtime
- OpenAI Batch API — https://platform.openai.com/docs/guides/batch
- Anthropic Structured Outputs (Nov 2025 beta) — https://platform.claude.com/docs/en/build-with-claude/structured-outputs
- Anthropic tool use — https://docs.claude.com/en/docs/build-with-claude/tool-use
- DeepSpeed
ds_config.json— https://www.deepspeed.ai/docs/config-json/ - Ollama Modelfile — https://github.com/ollama/ollama/blob/main/docs/modelfile.md
- Ollama on MLX (2025) — https://ollama.com/blog/mlx
- CVE-2024-34359 (llama-cpp-python Jinja SSTI) — https://github.com/abetlen/llama-cpp-python/security/advisories/GHSA-56xg-wfcc-g829