GenAI / LLM-Runtime / Model-Serving Config DSLs Family Index

type: language-family-index family: genai-llm-runtime languages_catalogued: 30 tags: [language-reference, family-index, genai-llm-runtime, huggingface, gguf, onnx, vllm, safetensors, peft, structured-outputs]

GenAI / LLM-Runtime / Model-Serving — Family Index

Family overview

This family is the model-as-artifact + serving-config layer for transformer-era ML. It is distinct from the prompt-DSL world (see ai-prompt-languages) and from the older NLU intent DSLs (see chatbot-intent-dsls). What unifies the entries here is that they describe how a model is packaged, quantized, loaded, served, and adapted at runtime, not how a developer asks the model to produce a result. The defining file is HuggingFace’s config.json — a single JSON document that says what architecture this is (architectures: ["LlamaForCausalLM"]), what its dimensions are (hidden_size, num_attention_heads, intermediate_size, num_hidden_layers, vocab_size), what activations and normalisations it uses, and increasingly what tokenizer and chat template it expects. Around config.json accreted tokenizer.json (the BPE/WordPiece/Unigram tokenizer-as-data spec with full pretokenizer + normalizer pipeline), generation_config.json (decoding defaults), model_index.json (Diffusers pipeline manifest), and the YAML frontmatter of README.md model cards (the discovery metadata Hub uses for pipeline_tag, library_name, tags, license). Transformers 4.51+ (current April 2025+) standardised this set, and almost every open-weight model release on the Hub now follows it.

The weights container layer evolved on a parallel track. The 2018–2022 era used PyTorch’s pickle-based .bin files, which were a security hazard (arbitrary code on load). HuggingFace’s safetensors (2022, stabilised through 2023) replaced them with a JSON-header + raw-tensor binary format — zero-copy, memmap-friendly, and unable to execute code on load. As of 2025 over 42% of Hub models carry safetensors weights. On the local-inference side, GGUF (GGML Unified Format, v3) displaced its predecessor GGML in late 2023 as the llama.cpp ecosystem’s quantized container, embedding both KV-metadata and tensors in a single file. GGUF added an embedded Jinja2 chat template in 2024 (parsed by llama.cpp’s own minja runtime in C++), turning the artifact into a self-describing serving package. ONNX, born in 2017 as the cross-framework IR, still anchors edge/mobile/Intel/AMD deployment (current ONNX 1.22.0, opset 27 as of 2025; opset 21 was the 2024 stable; ONNX Runtime 1.18+ for opset-21 coverage), but its share of frontier-LLM deployment has eroded as PyTorch-native pipelines and bespoke kernels dominate.

The serving zoo is the layer above weights+config. vLLM (0.6+ in 2025, with PagedAttention as its signature contribution) is the dominant OSS GPU server; its config surface is a mix of LLM(...) Python kwargs and vllm serve CLI flags rather than a standalone schema file. NVIDIA TensorRT-LLM uses trtllm-build to ahead-of-time compile a model into a serialized engine plus a JSON config file. NVIDIA Triton Inference Server uses config.pbtxt (protobuf-text) for every served model, with detailed sections for instance_group, dynamic_batching, model_warmup, optimization, and response_cache. TGI (Text Generation Inference, HuggingFace) entered maintenance mode December 2025 but added vLLM and TensorRT-LLM as alternative backends earlier that year. MLX (Apple, current 0.21+ as of 2025; M5 Neural Accelerator support arrived with macOS 26.2) is the Apple-silicon-native runtime that Ollama now uses on Apple hardware. MLC LLM (Apache TVM-based) compiles models for browsers, mobile, and heterogeneous CPUs/GPUs via the mlc-chat-config.json workflow. DeepSpeed anchors the training-config side with ds_config.json (ZeRO stages, optimizer offload, mixed-precision, pipeline-parallel).

Finally, two adjacent worlds: the adapter / fine-tune DSLs built around PEFT’s adapter_config.json (LoRA, QLoRA, DoRA — DoRA decomposes weight updates into magnitude + direction; QLoRA quantizes the base to 4-bit before LoRA on top), plus quantization-format configs (GPTQ, AWQ, EXL2, GGUF Q4_K_M, FP8) that travel as JSON metadata alongside safetensors shards. And the hosted-API session DSLs: OpenAI’s Structured Outputs (August 2024, response_format: {type: "json_schema", json_schema: {...}}, the GPT-4o-anchored evolution of json_mode), Anthropic’s Structured Outputs for tool-use (public beta November 2025, anthropic-beta header structured-outputs-2025-11-13), OpenAI Batch API JSONL, and the Realtime API session.update config. JSON Schema is the lingua franca: closed-API providers consume the same Pydantic/Zod-derived schemas that PEFT’s Python dataclasses serialize to.

In our deep library

None of these formats has a standalone deep-library note — they are sub-DSLs of JSON and YAML on disk, or Python dataclasses in memory. Cross-reference:

ai-prompt-languages — prompt-DSL sibling (Guidance, LMQL, DSPy, Outlines, Jsonformer); structured-output JSON-Schema sits at the boundary between the two families.
chatbot-intent-dsls — older NLU intent / dialogue-flow DSLs (Rasa, Dialogflow); pre-transformer lineage.
oci-cloud-native — Kubernetes CRDs for model-serving (KServe InferenceService, Seldon, vLLM Production Stack Helm charts) overlap with this family at the deployment-manifest layer.
api-description — JSON Schema, OpenAPI; the underlying spec for Structured Outputs and tool-use contracts.
codec-and-dsp — TFLite/FlatBuffers and signal-processing model formats overlap with mobile inference here.
nlp-corpus — training-data interchange (Datasets dataset_infos.json).
python — universal host language for transformers, PEFT, vLLM, Diffusers; dataclasses serialize directly to these JSON formats.
build-devops — CI/CD for model release (HuggingFace Hub spaces, OCI image push, S3 weight upload).
yaml — model-card frontmatter, TorchServe model-config.yaml, DeepSpeed YAML variants.
protobuf-rpc — Triton’s config.pbtxt uses protobuf text format; gRPC underlies TGI’s router↔server channel.

Tier 3 family table — Model-artifact / weights formats

Format	First appeared	Origin	Type	Status (2026)	URL
safetensors	2022 (Hub adoption); spec stable 2023	HuggingFace	JSON header + raw tensor binary; zero-copy, memmap, no code-execution-on-load	Dominant; ~42% of Hub models as of March 2025; current spec adds FP8 (E4M3FNUZ/E5M2FNUZ), MUSA device support	https://github.com/huggingface/safetensors
GGUF (GGML Unified Format) v3	Aug 2023 (replacing GGML)	Georgi Gerganov / llama.cpp	Single binary file: magic + version + KV-metadata + tensor blocks; supports 2–8-bit ints, FP32/FP16/BF16, 1.58-bit	Dominant local-inference standard; embeds Jinja2 chat template, tokenizer; Dynamic GGUF 2.0 (2025) for better-accuracy quants	https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
GGML (predecessor)	2022	Georgi Gerganov	Earlier llama.cpp tensor container; metadata convention varied across versions	Legacy, superseded by GGUF Aug 2023	https://github.com/ggml-org/ggml
ONNX (.onnx)	2017	Microsoft + Facebook (now LF AI)	Protobuf IR for cross-framework models; ops version-pinned via opset	Active; ONNX 1.22.0 (2025) ships opset 27; opset 21 was the 2024 stable (ONNX Runtime 1.18+); slowed adoption for frontier LLMs as PyTorch-native dominates	https://onnx.ai/
ONNX-ML	2017 (with ONNX)	Microsoft + Facebook	Extension for classical ML ops (tree ensembles, linear models, vectorizers, label encoders)	Active; the export target for scikit-learn-onnx	https://onnx.ai/onnx/operators/
TensorFlow SavedModel + `signature_def`	2017	Google	Protobuf-based MetaGraphDef + variables; `signature_def` declares serving inputs/outputs	Active within TF/Keras 3 ecosystem, lower share outside	https://www.tensorflow.org/guide/saved_model
TFLite FlatBuffers (.tflite)	2017	Google	FlatBuffer schema for on-device inference; integer quantization metadata	Active; LiteRT branding (2024) but format compatible; on-device Android/iOS workhorse	https://www.tensorflow.org/lite/guide
Core ML (.mlpackage)	2017; .mlpackage 2021	Apple	Protobuf model description + weights bundle; iOS/macOS deployment	Active; coremltools 8.x; tight integration with MLX (2025)	https://developer.apple.com/documentation/coreml
MLX format	2023	Apple ML Research	npz/safetensors-derived format optimised for Apple unified memory	Very active; current 0.21+ (2025); macOS 26.2 added M5 Neural Accelerator support; Ollama on Apple Silicon now uses MLX	https://github.com/ml-explore/mlx
MLC LLM model libraries	2023	OctoML / mlc-ai / CMU Catalyst	TVM-compiled artifact tree + `mlc-chat-config.json`; targets WebGPU/Vulkan/Metal/CUDA/iOS/Android	Active; Apache TVM Unity pipeline; only universal-platform OSS LLM runtime	https://llm.mlc.ai/
OpenVINO IR (.xml + .bin)	2018	Intel	XML graph + binary weights; Intel CPU/GPU/NPU deployment	Active; 2024/2025 LTS branches; pairs with `optimum-intel`	https://docs.openvino.ai/

Tier 3 family table — Model + tokenizer config (HuggingFace ecosystem)

Format	First appeared	Origin	Type	Status (2026)	URL
`config.json`	2018 (transformers v1+)	HuggingFace	JSON; the canonical model-config schema (`architectures`, `model_type`, `hidden_size`, `num_attention_heads`, etc.)	Dominant; transformers 4.51+ (April 2025) is current; the de facto model-artifact contract	https://huggingface.co/docs/transformers/main_classes/configuration
`tokenizer.json`	2020 (tokenizers v0.9+)	HuggingFace	JSON specification of normalizer + pretokenizer + model (BPE/WordPiece/Unigram) + post-processor + decoder pipeline	Dominant; tokenizer-as-data, replaces `vocab.txt` + `merges.txt` legacy split	https://huggingface.co/docs/tokenizers/
`generation_config.json`	2022 (transformers 4.25+)	HuggingFace	JSON; default decoding parameters (temperature, top_p, top_k, max_new_tokens, repetition_penalty, stop tokens)	Active; loaded automatically by `generate()`	https://huggingface.co/docs/transformers/main_classes/text_generation
`model_index.json` (Diffusers)	2022	HuggingFace Diffusers	JSON manifest enumerating the components of a Diffusion pipeline (UNet, VAE, text encoder, scheduler, safety checker) and their library/class	Active; the Diffusers pipeline equivalent of `config.json`	https://huggingface.co/docs/diffusers/using-diffusers/loading
`scheduler_config.json` (Diffusers)	2022	HuggingFace Diffusers	JSON; diffusion scheduler config (DDIM, DPMSolver, Euler, etc., with beta schedule + timesteps)	Active	https://huggingface.co/docs/diffusers/api/schedulers/overview
Hub README.md YAML frontmatter (model card)	2021	HuggingFace	YAML frontmatter: `license`, `library_name`, `pipeline_tag`, `tags`, `language`, `datasets`, `base_model`, `metrics`	Active; powers Hub discovery, search filters, and inference-widget routing	https://huggingface.co/docs/hub/model-cards
`sentence_bert_config.json`	2019	UKP Lab / Sentence-Transformers	JSON; max_seq_length and do_lower_case for embedding pooling	Active; v3 (2024) introduced unified `modules.json` + per-module configs	https://www.sbert.net/
Datasets `dataset_infos.json`	2020	HuggingFace	JSON; dataset splits + features schema + checksums + citations	Active; companion to `datasets` library	https://huggingface.co/docs/datasets/
GGUF embedded chat template (Jinja2)	2024	llama.cpp / ggml-org	Jinja2 template string stored as a GGUF metadata KV; rendered by llama.cpp’s minja C++ implementation	Active and standard; `--jinja` flag uses it; tool-call grammars built on top; CVE-2024-34359 in llama-cpp-python’s older Python Jinja path	https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md

Tier 3 family table — Serving / inference engines

Format	First appeared	Origin	Type	Status (2026)	URL
vLLM engine config	2023 (UC Berkeley); 0.6+ current 2025	UC Berkeley Sky Computing Lab → vLLM Project (PyTorch Foundation, 2025)	Python `LLM(...)` kwargs + `vllm serve` CLI flags; PagedAttention is the signature contribution; no standalone schema	Dominant OSS GPU server; 0.6 added GPU NGram spec decoding, KV-cache offloading, elastic expert-parallelism	https://docs.vllm.ai/
TensorRT-LLM build config	2023	NVIDIA	`trtllm-build` CLI + JSON checkpoint config + serialized engine.cfg; AOT compilation per-GPU-architecture	Active; the highest-throughput proprietary-stack option on NVIDIA HW	https://nvidia.github.io/TensorRT-LLM/
Triton Inference Server `config.pbtxt`	2018 (TensorRT IS); 2019 Triton rebrand	NVIDIA	Protobuf-text `ModelConfig`: `platform`, `max_batch_size`, `input[]`, `output[]`, `instance_group`, `dynamic_batching`, `model_warmup`, `response_cache`, `optimization`	Active; the de facto multi-backend serving config (PyTorch, TF, ONNX, TensorRT, vLLM, TRT-LLM all plug in)	https://docs.nvidia.com/deeplearning/triton-inference-server/
TGI (Text Generation Inference) router config	2022	HuggingFace	Rust HTTP/gRPC router with CLI/env-var config; continuous batching	Maintenance mode as of Dec 2025; added vLLM + TRT-LLM backends earlier 2025	https://huggingface.co/docs/text-generation-inference/
TorchServe `model-config.yaml`	2020	Meta AI / AWS	YAML config: `minWorkers`, `maxWorkers`, `batchSize`, `responseTimeout`, handler class	Active but lower share; PyTorch’s official server, less momentum than vLLM	https://pytorch.org/serve/
llama.cpp / llama-cpp-python config	2023	Georgi Gerganov / Andrei Betlen	CLI flags + Python kwargs (`n_ctx`, `n_gpu_layers`, `chat_format`, `rope_freq_base`); `params.json` for legacy llama-1 conversion	Very active; reference local-inference runtime	https://github.com/abetlen/llama-cpp-python
LM Studio model config	2023	LM Studio (Element Labs)	Per-model JSON sidecar in app data; GUI surfaces n_ctx, prompt template, GPU offload	Active; closed-source desktop wrapper around llama.cpp + MLX	https://lmstudio.ai/docs
Ollama Modelfile	2023	Ollama Inc.	Dockerfile-shaped DSL (`FROM`, `PARAMETER`, `TEMPLATE`, `SYSTEM`, `ADAPTER`); 2025 MLX backend on Apple Silicon	Very active; the dominant local-LLM bundler for non-developers	https://github.com/ollama/ollama/blob/main/docs/modelfile.md
DeepSpeed `ds_config.json`	2020	Microsoft	JSON config for ZeRO stages, optimizer offload, mixed precision, pipeline parallelism, activation checkpointing	Active; the training-config heavyweight; integrates with HF Accelerate	https://www.deepspeed.ai/docs/config-json/

Tier 3 family table — Adapter / fine-tune / quantization

Format	First appeared	Origin	Type	Status (2026)	URL
PEFT `adapter_config.json`	2023	HuggingFace PEFT	JSON; LoraConfig serialization (r, lora_alpha, lora_dropout, target_modules, modules_to_save, base_model_name_or_path)	Dominant adapter format; covers LoRA, QLoRA (4-bit base), DoRA (magnitude+direction decomposition), AdaLoRA, IA³, LoHA, LoKR	https://huggingface.co/docs/peft/
AutoGPTQ / GPTQModel config	2023 / 2024	PanQiWei → ModelCloud	`quantize_config.json` + safetensors shards; `bits`, `group_size`, `desc_act`, `sym`, `damp_percent`, `static_groups`	Active; GPTQModel v7+ (2025) adds Huawei Ascend NPU; integrates HF, vLLM, SGLang	https://github.com/ModelCloud/GPTQModel
AWQ config	2023	MIT HAN Lab	JSON-embedded `quantization_config` (`bits`, `group_size`, `zero_point`, `quant_method: "awq"`); activation-aware scaling of salient channels	Active; lower perplexity than GPTQ in most benches; widely shipped on Hub	https://github.com/casper-hansen/AutoAWQ
ExLlamaV2 / EXL2 measurements	2023	turboderp	Variable bits-per-weight quantization with per-layer sensitivity measurement; safetensors-stored	Active; ExLlamaV3 quantizers (2025) extended to GPTQ/AWQ on consumer GPUs	https://github.com/turboderp-org/exllamav2
GGUF quantization tags (Q4_K_M, Q5_K_S, Q6_K, IQ2_XS, etc.)	2023	llama.cpp	Quantization-type enum embedded as GGUF metadata; K-quants, I-quants, importance-matrix-derived	Active; the lingua franca of local-inference filenames	https://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/README.md
Llama-cpp `params.json`	2023	Meta + llama.cpp	JSON sidecar from original Llama-1 release (`dim`, `n_heads`, `n_layers`, `norm_eps`); used by `convert.py` to produce GGUF	Legacy; superseded by reading directly from HF `config.json`	https://github.com/ggml-org/llama.cpp

Tier 3 family table — Hosted-API session / structured-output

Format	First appeared	Origin	Type	Status (2026)	URL
OpenAI Structured Outputs (`response_format: json_schema`)	Aug 2024 (GPT-4o-2024-08-06)	OpenAI	Strict JSON-Schema-constrained decoding; Pydantic/Zod schemas auto-converted via SDK	Dominant pattern for function-calling-via-schema across the industry	https://developers.openai.com/api/docs/guides/structured-outputs
OpenAI Realtime API `session.update` config	Oct 2024	OpenAI	JSON over WebSocket session config: voice, turn_detection, input_audio_format, instructions, tools, modalities	Active; the spec for low-latency voice/multimodal sessions	https://platform.openai.com/docs/guides/realtime
OpenAI Batch API JSONL	Apr 2024	OpenAI	Line-delimited JSON; each line is `{custom_id, method, url, body}`; 24-hour SLA, 50% pricing	Active; the workhorse for offline backfills	https://platform.openai.com/docs/guides/batch
OpenAI Files / Assistants config	Nov 2023; v2 Apr 2024	OpenAI	JSON for `assistants`, `threads`, `runs`, `vector_stores`; tool configs (code_interpreter, file_search, function)	Active (Assistants v2); Responses API is the supersession path	https://platform.openai.com/docs/assistants
Anthropic tool_use schema	May 2024	Anthropic	JSON Schema `input_schema` per tool; `tool_choice: auto/any/tool/none`	Active	https://docs.claude.com/en/docs/build-with-claude/tool-use
Anthropic Structured Outputs	Nov 14, 2025 (public beta)	Anthropic	`output_format` + `strict: true` on tools; grammar-constrained decoding compiled from JSON Schema; `anthropic-beta: structured-outputs-2025-11-13` header	GA across Claude Mythos Preview / Opus 4.5–4.7 / Sonnet 4.5–4.6 / Haiku 4.5 as of late 2025 / 2026	https://platform.claude.com/docs/en/build-with-claude/structured-outputs

Notable threads

config.json as the de facto model-artifact contract. No standards body ever sat down to specify “what a transformer model on disk looks like” — HuggingFace’s config.json (2018) plus tokenizer.json (2020) plus model.safetensors (2022) became the universal trio by sheer adoption. By 2024 every meaningful open-weight release ships exactly that layout, and downstream tools (vLLM, TGI, llama.cpp convert, MLX, Ollama, MLC) all read it. ONNX-as-IR was supposed to be the standard at the IR level, but practitioners learned to prefer “ship a HF directory and let the runtime do the conversion.”
GGUF v3 + embedded chat templates turned models into self-describing serving packages. The 2024 move to embed a Jinja2 chat template directly inside the GGUF metadata KV-store (along with tokenizer state and recommended generation params) closed the loop: a single .gguf file is now sufficient to load and prompt the model correctly without any sidecar config. llama.cpp implements its own C++ minja Jinja2 runtime to render these templates safely. The flip side is CVE-2024-34359 (llama-cpp-python), which demonstrated that template injection via a malicious GGUF metadata field is a real attack surface — the embedded-DSL convenience has security-DSL costs.
Safetensors displaced pickle for security reasons, then stayed for speed. The original case for safetensors was “pickle can execute arbitrary code on load and people are downloading random weights from the internet.” But the format also turned out to be faster (zero-copy memmap, sharded reads, GIL-free writes since the 2024 release) and cleaner (a JSON header you can head -c 8 and parse without loading tensors). As of March 2025, ~42% of Hub models are safetensors-tagged, and the percentage among new releases is approaching 100%.
ONNX’s relative slowdown for frontier LLMs. ONNX still wins on edge/mobile/heterogeneous-hardware deployment (Intel OpenVINO, mobile NPUs, classical-ML embedded), and opset 27 in ONNX 1.22.0 (2025) covers the modern transformer ops. But for frontier-scale LLM training and inference, PyTorch-native (FSDP, torch.compile, FlashAttention kernels) plus vLLM/TRT-LLM-style bespoke serving have crowded out the “export-to-IR” path. ONNX Runtime GenAI is the counter-bet, with phi/llama/gemma optimisations, but the LLM gravity is elsewhere.
vLLM vs TensorRT-LLM vs MLX as the 2026 three-way split. vLLM is the default OSS GPU server (PagedAttention, continuous batching, broad model coverage, healthy contributor community, joined the PyTorch Foundation in 2025); TensorRT-LLM wins peak throughput on NVIDIA hardware via AOT compilation; MLX owns Apple Silicon (Ollama switched its Mac backend to MLX in 2025; M5 Neural Accelerators came online with macOS 26.2). MLC LLM is the only universal-platform OSS LLM compiler, but its operational complexity has kept it niche outside browser/mobile.
The proliferation of quantization configs is a tiny self-contained DSL ecosystem. GGUF tags (Q4_K_M, Q5_K_S, Q6_K, IQ2_XXS, IQ3_XS), GPTQ quantize_config.json (bits, group_size, desc_act, damp_percent), AWQ’s activation-aware scaling, EXL2’s per-layer variable bits, AutoRound, FP8 (E4M3 / E5M2 / E4M3FNUZ / E5M2FNUZ in safetensors), 1.58-bit ternary (BitNet, added in transformers 4.51) — each is a tiny vocabulary of allowed values stored in JSON metadata. The combinatorial explosion (which method × which bits × which group size × which calibration set) means model-card READMEs have become the de facto documentation surface for quant choices.
OpenAI Structured Outputs set the function-calling-via-schema pattern. When OpenAI launched Structured Outputs in August 2024 with response_format: {type: "json_schema", json_schema: {...}, strict: true}, every other vendor followed within 12 months. Anthropic shipped its own Structured Outputs in public beta on 2025-11-13 with the same essential shape (JSON Schema → grammar-constrained decoding → guaranteed conformance). vLLM added the same surface in OSS. The convergence is striking: by 2026, “tool-calling” and “structured outputs” are the same API pattern, distinguished only by whether the result lands in content (structured output) or tool_use (tool call). JSON Schema, born as an OpenAPI/REST artifact, is now the central control surface of the LLM-API world.

Citations

HuggingFace Transformers config docs — https://huggingface.co/docs/transformers/main_classes/configuration
HuggingFace Tokenizers — https://huggingface.co/docs/tokenizers/
HuggingFace Generation config — https://huggingface.co/docs/transformers/main_classes/text_generation
Transformers v4.51 release (April 2025, BitNet, Qwen2.5-Omni preview, MLCD) — https://github.com/huggingface/transformers/releases
HuggingFace Diffusers loading + model_index.json — https://huggingface.co/docs/diffusers/using-diffusers/loading
HuggingFace Hub model cards (README.md frontmatter) — https://huggingface.co/docs/hub/model-cards
safetensors repo — https://github.com/huggingface/safetensors
GGUF spec (ggml repo) — https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
llama.cpp function-calling + Jinja templates — https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md
ONNX 1.22.0 docs (opset 27) — https://onnx.ai/onnx/repo-docs/Versioning.html
ONNX Runtime compatibility (opset 21 ↔ ORT 1.18, 2024) — https://onnxruntime.ai/docs/reference/compatibility.html
vLLM docs — https://docs.vllm.ai/
vLLM 0.6 release notes — https://github.com/vllm-project/vllm/releases
TensorRT-LLM trtllm-build — https://nvidia.github.io/TensorRT-LLM/latest/commands/trtllm-build.html
Triton Inference Server model configuration — https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html
TGI architecture — https://huggingface.co/docs/text-generation-inference/main/en/architecture
TGI multi-backends announcement (TRT-LLM, vLLM) — https://huggingface.co/blog/tgi-multi-backend
MLX repo — https://github.com/ml-explore/mlx
Apple ML Research, MLX + M5 Neural Accelerators — https://machinelearning.apple.com/research/exploring-llms-mlx-m5
MLC LLM docs — https://llm.mlc.ai/
PEFT LoRA developer guide (LoraConfig, DoRA, QLoRA) — https://huggingface.co/docs/peft/developer_guides/lora
GPTQModel — https://github.com/ModelCloud/GPTQModel
AutoAWQ — https://github.com/casper-hansen/AutoAWQ
ExLlamaV2 — https://github.com/turboderp-org/exllamav2
OpenAI Structured Outputs — https://openai.com/index/introducing-structured-outputs-in-the-api/
OpenAI Realtime API — https://platform.openai.com/docs/guides/realtime
OpenAI Batch API — https://platform.openai.com/docs/guides/batch
Anthropic Structured Outputs (Nov 2025 beta) — https://platform.claude.com/docs/en/build-with-claude/structured-outputs
Anthropic tool use — https://docs.claude.com/en/docs/build-with-claude/tool-use
DeepSpeed ds_config.json — https://www.deepspeed.ai/docs/config-json/
Ollama Modelfile — https://github.com/ollama/ollama/blob/main/docs/modelfile.md
Ollama on MLX (2025) — https://ollama.com/blog/mlx
CVE-2024-34359 (llama-cpp-python Jinja SSTI) — https://github.com/abetlen/llama-cpp-python/security/advisories/GHSA-56xg-wfcc-g829

Compendium

Explorer

GenAI / LLM-Runtime / Model-Serving Config DSLs Family Index

GenAI / LLM-Runtime / Model-Serving Config DSLs Family Index

type: language-family-index family: genai-llm-runtime languages_catalogued: 30 tags: [language-reference, family-index, genai-llm-runtime, huggingface, gguf, onnx, vllm, safetensors, peft, structured-outputs]

GenAI / LLM-Runtime / Model-Serving — Family Index

Family overview

In our deep library

Tier 3 family table — Model-artifact / weights formats

Tier 3 family table — Model + tokenizer config (HuggingFace ecosystem)

Tier 3 family table — Serving / inference engines

Tier 3 family table — Adapter / fine-tune / quantization

Tier 3 family table — Hosted-API session / structured-output

Notable threads

Citations

Graph View

Table of Contents