GenAI / LLM-Runtime / Model-Serving Config DSLs Family Index


type: language-family-index family: genai-llm-runtime languages_catalogued: 30 tags: [language-reference, family-index, genai-llm-runtime, huggingface, gguf, onnx, vllm, safetensors, peft, structured-outputs]

GenAI / LLM-Runtime / Model-Serving — Family Index

Family overview

This family is the model-as-artifact + serving-config layer for transformer-era ML. It is distinct from the prompt-DSL world (see ai-prompt-languages) and from the older NLU intent DSLs (see chatbot-intent-dsls). What unifies the entries here is that they describe how a model is packaged, quantized, loaded, served, and adapted at runtime, not how a developer asks the model to produce a result. The defining file is HuggingFace’s config.json — a single JSON document that says what architecture this is (architectures: ["LlamaForCausalLM"]), what its dimensions are (hidden_size, num_attention_heads, intermediate_size, num_hidden_layers, vocab_size), what activations and normalisations it uses, and increasingly what tokenizer and chat template it expects. Around config.json accreted tokenizer.json (the BPE/WordPiece/Unigram tokenizer-as-data spec with full pretokenizer + normalizer pipeline), generation_config.json (decoding defaults), model_index.json (Diffusers pipeline manifest), and the YAML frontmatter of README.md model cards (the discovery metadata Hub uses for pipeline_tag, library_name, tags, license). Transformers 4.51+ (current April 2025+) standardised this set, and almost every open-weight model release on the Hub now follows it.

The weights container layer evolved on a parallel track. The 2018–2022 era used PyTorch’s pickle-based .bin files, which were a security hazard (arbitrary code on load). HuggingFace’s safetensors (2022, stabilised through 2023) replaced them with a JSON-header + raw-tensor binary format — zero-copy, memmap-friendly, and unable to execute code on load. As of 2025 over 42% of Hub models carry safetensors weights. On the local-inference side, GGUF (GGML Unified Format, v3) displaced its predecessor GGML in late 2023 as the llama.cpp ecosystem’s quantized container, embedding both KV-metadata and tensors in a single file. GGUF added an embedded Jinja2 chat template in 2024 (parsed by llama.cpp’s own minja runtime in C++), turning the artifact into a self-describing serving package. ONNX, born in 2017 as the cross-framework IR, still anchors edge/mobile/Intel/AMD deployment (current ONNX 1.22.0, opset 27 as of 2025; opset 21 was the 2024 stable; ONNX Runtime 1.18+ for opset-21 coverage), but its share of frontier-LLM deployment has eroded as PyTorch-native pipelines and bespoke kernels dominate.

The serving zoo is the layer above weights+config. vLLM (0.6+ in 2025, with PagedAttention as its signature contribution) is the dominant OSS GPU server; its config surface is a mix of LLM(...) Python kwargs and vllm serve CLI flags rather than a standalone schema file. NVIDIA TensorRT-LLM uses trtllm-build to ahead-of-time compile a model into a serialized engine plus a JSON config file. NVIDIA Triton Inference Server uses config.pbtxt (protobuf-text) for every served model, with detailed sections for instance_group, dynamic_batching, model_warmup, optimization, and response_cache. TGI (Text Generation Inference, HuggingFace) entered maintenance mode December 2025 but added vLLM and TensorRT-LLM as alternative backends earlier that year. MLX (Apple, current 0.21+ as of 2025; M5 Neural Accelerator support arrived with macOS 26.2) is the Apple-silicon-native runtime that Ollama now uses on Apple hardware. MLC LLM (Apache TVM-based) compiles models for browsers, mobile, and heterogeneous CPUs/GPUs via the mlc-chat-config.json workflow. DeepSpeed anchors the training-config side with ds_config.json (ZeRO stages, optimizer offload, mixed-precision, pipeline-parallel).

Finally, two adjacent worlds: the adapter / fine-tune DSLs built around PEFT’s adapter_config.json (LoRA, QLoRA, DoRA — DoRA decomposes weight updates into magnitude + direction; QLoRA quantizes the base to 4-bit before LoRA on top), plus quantization-format configs (GPTQ, AWQ, EXL2, GGUF Q4_K_M, FP8) that travel as JSON metadata alongside safetensors shards. And the hosted-API session DSLs: OpenAI’s Structured Outputs (August 2024, response_format: {type: "json_schema", json_schema: {...}}, the GPT-4o-anchored evolution of json_mode), Anthropic’s Structured Outputs for tool-use (public beta November 2025, anthropic-beta header structured-outputs-2025-11-13), OpenAI Batch API JSONL, and the Realtime API session.update config. JSON Schema is the lingua franca: closed-API providers consume the same Pydantic/Zod-derived schemas that PEFT’s Python dataclasses serialize to.

In our deep library

None of these formats has a standalone deep-library note — they are sub-DSLs of JSON and YAML on disk, or Python dataclasses in memory. Cross-reference:

  • ai-prompt-languages — prompt-DSL sibling (Guidance, LMQL, DSPy, Outlines, Jsonformer); structured-output JSON-Schema sits at the boundary between the two families.
  • chatbot-intent-dsls — older NLU intent / dialogue-flow DSLs (Rasa, Dialogflow); pre-transformer lineage.
  • oci-cloud-native — Kubernetes CRDs for model-serving (KServe InferenceService, Seldon, vLLM Production Stack Helm charts) overlap with this family at the deployment-manifest layer.
  • api-description — JSON Schema, OpenAPI; the underlying spec for Structured Outputs and tool-use contracts.
  • codec-and-dsp — TFLite/FlatBuffers and signal-processing model formats overlap with mobile inference here.
  • nlp-corpus — training-data interchange (Datasets dataset_infos.json).
  • python — universal host language for transformers, PEFT, vLLM, Diffusers; dataclasses serialize directly to these JSON formats.
  • build-devops — CI/CD for model release (HuggingFace Hub spaces, OCI image push, S3 weight upload).
  • yaml — model-card frontmatter, TorchServe model-config.yaml, DeepSpeed YAML variants.
  • protobuf-rpc — Triton’s config.pbtxt uses protobuf text format; gRPC underlies TGI’s router↔server channel.

Tier 3 family table — Model-artifact / weights formats

FormatFirst appearedOriginTypeStatus (2026)URL
safetensors2022 (Hub adoption); spec stable 2023HuggingFaceJSON header + raw tensor binary; zero-copy, memmap, no code-execution-on-loadDominant; ~42% of Hub models as of March 2025; current spec adds FP8 (E4M3FNUZ/E5M2FNUZ), MUSA device supporthttps://github.com/huggingface/safetensors
GGUF (GGML Unified Format) v3Aug 2023 (replacing GGML)Georgi Gerganov / llama.cppSingle binary file: magic + version + KV-metadata + tensor blocks; supports 2–8-bit ints, FP32/FP16/BF16, 1.58-bitDominant local-inference standard; embeds Jinja2 chat template, tokenizer; Dynamic GGUF 2.0 (2025) for better-accuracy quantshttps://github.com/ggml-org/ggml/blob/master/docs/gguf.md
GGML (predecessor)2022Georgi GerganovEarlier llama.cpp tensor container; metadata convention varied across versionsLegacy, superseded by GGUF Aug 2023https://github.com/ggml-org/ggml
ONNX (.onnx)2017Microsoft + Facebook (now LF AI)Protobuf IR for cross-framework models; ops version-pinned via opsetActive; ONNX 1.22.0 (2025) ships opset 27; opset 21 was the 2024 stable (ONNX Runtime 1.18+); slowed adoption for frontier LLMs as PyTorch-native dominateshttps://onnx.ai/
ONNX-ML2017 (with ONNX)Microsoft + FacebookExtension for classical ML ops (tree ensembles, linear models, vectorizers, label encoders)Active; the export target for scikit-learn-onnxhttps://onnx.ai/onnx/operators/
TensorFlow SavedModel + signature_def2017GoogleProtobuf-based MetaGraphDef + variables; signature_def declares serving inputs/outputsActive within TF/Keras 3 ecosystem, lower share outsidehttps://www.tensorflow.org/guide/saved_model
TFLite FlatBuffers (.tflite)2017GoogleFlatBuffer schema for on-device inference; integer quantization metadataActive; LiteRT branding (2024) but format compatible; on-device Android/iOS workhorsehttps://www.tensorflow.org/lite/guide
Core ML (.mlpackage)2017; .mlpackage 2021AppleProtobuf model description + weights bundle; iOS/macOS deploymentActive; coremltools 8.x; tight integration with MLX (2025)https://developer.apple.com/documentation/coreml
MLX format2023Apple ML Researchnpz/safetensors-derived format optimised for Apple unified memoryVery active; current 0.21+ (2025); macOS 26.2 added M5 Neural Accelerator support; Ollama on Apple Silicon now uses MLXhttps://github.com/ml-explore/mlx
MLC LLM model libraries2023OctoML / mlc-ai / CMU CatalystTVM-compiled artifact tree + mlc-chat-config.json; targets WebGPU/Vulkan/Metal/CUDA/iOS/AndroidActive; Apache TVM Unity pipeline; only universal-platform OSS LLM runtimehttps://llm.mlc.ai/
OpenVINO IR (.xml + .bin)2018IntelXML graph + binary weights; Intel CPU/GPU/NPU deploymentActive; 2024/2025 LTS branches; pairs with optimum-intelhttps://docs.openvino.ai/

Tier 3 family table — Model + tokenizer config (HuggingFace ecosystem)

FormatFirst appearedOriginTypeStatus (2026)URL
config.json2018 (transformers v1+)HuggingFaceJSON; the canonical model-config schema (architectures, model_type, hidden_size, num_attention_heads, etc.)Dominant; transformers 4.51+ (April 2025) is current; the de facto model-artifact contracthttps://huggingface.co/docs/transformers/main_classes/configuration
tokenizer.json2020 (tokenizers v0.9+)HuggingFaceJSON specification of normalizer + pretokenizer + model (BPE/WordPiece/Unigram) + post-processor + decoder pipelineDominant; tokenizer-as-data, replaces vocab.txt + merges.txt legacy splithttps://huggingface.co/docs/tokenizers/
generation_config.json2022 (transformers 4.25+)HuggingFaceJSON; default decoding parameters (temperature, top_p, top_k, max_new_tokens, repetition_penalty, stop tokens)Active; loaded automatically by generate()https://huggingface.co/docs/transformers/main_classes/text_generation
model_index.json (Diffusers)2022HuggingFace DiffusersJSON manifest enumerating the components of a Diffusion pipeline (UNet, VAE, text encoder, scheduler, safety checker) and their library/classActive; the Diffusers pipeline equivalent of config.jsonhttps://huggingface.co/docs/diffusers/using-diffusers/loading
scheduler_config.json (Diffusers)2022HuggingFace DiffusersJSON; diffusion scheduler config (DDIM, DPMSolver, Euler, etc., with beta schedule + timesteps)Activehttps://huggingface.co/docs/diffusers/api/schedulers/overview
Hub README.md YAML frontmatter (model card)2021HuggingFaceYAML frontmatter: license, library_name, pipeline_tag, tags, language, datasets, base_model, metricsActive; powers Hub discovery, search filters, and inference-widget routinghttps://huggingface.co/docs/hub/model-cards
sentence_bert_config.json2019UKP Lab / Sentence-TransformersJSON; max_seq_length and do_lower_case for embedding poolingActive; v3 (2024) introduced unified modules.json + per-module configshttps://www.sbert.net/
Datasets dataset_infos.json2020HuggingFaceJSON; dataset splits + features schema + checksums + citationsActive; companion to datasets libraryhttps://huggingface.co/docs/datasets/
GGUF embedded chat template (Jinja2)2024llama.cpp / ggml-orgJinja2 template string stored as a GGUF metadata KV; rendered by llama.cpp’s minja C++ implementationActive and standard; --jinja flag uses it; tool-call grammars built on top; CVE-2024-34359 in llama-cpp-python’s older Python Jinja pathhttps://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md

Tier 3 family table — Serving / inference engines

FormatFirst appearedOriginTypeStatus (2026)URL
vLLM engine config2023 (UC Berkeley); 0.6+ current 2025UC Berkeley Sky Computing Lab → vLLM Project (PyTorch Foundation, 2025)Python LLM(...) kwargs + vllm serve CLI flags; PagedAttention is the signature contribution; no standalone schemaDominant OSS GPU server; 0.6 added GPU NGram spec decoding, KV-cache offloading, elastic expert-parallelismhttps://docs.vllm.ai/
TensorRT-LLM build config2023NVIDIAtrtllm-build CLI + JSON checkpoint config + serialized engine.cfg; AOT compilation per-GPU-architectureActive; the highest-throughput proprietary-stack option on NVIDIA HWhttps://nvidia.github.io/TensorRT-LLM/
Triton Inference Server config.pbtxt2018 (TensorRT IS); 2019 Triton rebrandNVIDIAProtobuf-text ModelConfig: platform, max_batch_size, input[], output[], instance_group, dynamic_batching, model_warmup, response_cache, optimizationActive; the de facto multi-backend serving config (PyTorch, TF, ONNX, TensorRT, vLLM, TRT-LLM all plug in)https://docs.nvidia.com/deeplearning/triton-inference-server/
TGI (Text Generation Inference) router config2022HuggingFaceRust HTTP/gRPC router with CLI/env-var config; continuous batchingMaintenance mode as of Dec 2025; added vLLM + TRT-LLM backends earlier 2025https://huggingface.co/docs/text-generation-inference/
TorchServe model-config.yaml2020Meta AI / AWSYAML config: minWorkers, maxWorkers, batchSize, responseTimeout, handler classActive but lower share; PyTorch’s official server, less momentum than vLLMhttps://pytorch.org/serve/
llama.cpp / llama-cpp-python config2023Georgi Gerganov / Andrei BetlenCLI flags + Python kwargs (n_ctx, n_gpu_layers, chat_format, rope_freq_base); params.json for legacy llama-1 conversionVery active; reference local-inference runtimehttps://github.com/abetlen/llama-cpp-python
LM Studio model config2023LM Studio (Element Labs)Per-model JSON sidecar in app data; GUI surfaces n_ctx, prompt template, GPU offloadActive; closed-source desktop wrapper around llama.cpp + MLXhttps://lmstudio.ai/docs
Ollama Modelfile2023Ollama Inc.Dockerfile-shaped DSL (FROM, PARAMETER, TEMPLATE, SYSTEM, ADAPTER); 2025 MLX backend on Apple SiliconVery active; the dominant local-LLM bundler for non-developershttps://github.com/ollama/ollama/blob/main/docs/modelfile.md
DeepSpeed ds_config.json2020MicrosoftJSON config for ZeRO stages, optimizer offload, mixed precision, pipeline parallelism, activation checkpointingActive; the training-config heavyweight; integrates with HF Acceleratehttps://www.deepspeed.ai/docs/config-json/

Tier 3 family table — Adapter / fine-tune / quantization

FormatFirst appearedOriginTypeStatus (2026)URL
PEFT adapter_config.json2023HuggingFace PEFTJSON; LoraConfig serialization (r, lora_alpha, lora_dropout, target_modules, modules_to_save, base_model_name_or_path)Dominant adapter format; covers LoRA, QLoRA (4-bit base), DoRA (magnitude+direction decomposition), AdaLoRA, IA³, LoHA, LoKRhttps://huggingface.co/docs/peft/
AutoGPTQ / GPTQModel config2023 / 2024PanQiWei → ModelCloudquantize_config.json + safetensors shards; bits, group_size, desc_act, sym, damp_percent, static_groupsActive; GPTQModel v7+ (2025) adds Huawei Ascend NPU; integrates HF, vLLM, SGLanghttps://github.com/ModelCloud/GPTQModel
AWQ config2023MIT HAN LabJSON-embedded quantization_config (bits, group_size, zero_point, quant_method: "awq"); activation-aware scaling of salient channelsActive; lower perplexity than GPTQ in most benches; widely shipped on Hubhttps://github.com/casper-hansen/AutoAWQ
ExLlamaV2 / EXL2 measurements2023turboderpVariable bits-per-weight quantization with per-layer sensitivity measurement; safetensors-storedActive; ExLlamaV3 quantizers (2025) extended to GPTQ/AWQ on consumer GPUshttps://github.com/turboderp-org/exllamav2
GGUF quantization tags (Q4_K_M, Q5_K_S, Q6_K, IQ2_XS, etc.)2023llama.cppQuantization-type enum embedded as GGUF metadata; K-quants, I-quants, importance-matrix-derivedActive; the lingua franca of local-inference filenameshttps://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/README.md
Llama-cpp params.json2023Meta + llama.cppJSON sidecar from original Llama-1 release (dim, n_heads, n_layers, norm_eps); used by convert.py to produce GGUFLegacy; superseded by reading directly from HF config.jsonhttps://github.com/ggml-org/llama.cpp

Tier 3 family table — Hosted-API session / structured-output

FormatFirst appearedOriginTypeStatus (2026)URL
OpenAI Structured Outputs (response_format: json_schema)Aug 2024 (GPT-4o-2024-08-06)OpenAIStrict JSON-Schema-constrained decoding; Pydantic/Zod schemas auto-converted via SDKDominant pattern for function-calling-via-schema across the industryhttps://developers.openai.com/api/docs/guides/structured-outputs
OpenAI Realtime API session.update configOct 2024OpenAIJSON over WebSocket session config: voice, turn_detection, input_audio_format, instructions, tools, modalitiesActive; the spec for low-latency voice/multimodal sessionshttps://platform.openai.com/docs/guides/realtime
OpenAI Batch API JSONLApr 2024OpenAILine-delimited JSON; each line is {custom_id, method, url, body}; 24-hour SLA, 50% pricingActive; the workhorse for offline backfillshttps://platform.openai.com/docs/guides/batch
OpenAI Files / Assistants configNov 2023; v2 Apr 2024OpenAIJSON for assistants, threads, runs, vector_stores; tool configs (code_interpreter, file_search, function)Active (Assistants v2); Responses API is the supersession pathhttps://platform.openai.com/docs/assistants
Anthropic tool_use schemaMay 2024AnthropicJSON Schema input_schema per tool; tool_choice: auto/any/tool/noneActivehttps://docs.claude.com/en/docs/build-with-claude/tool-use
Anthropic Structured OutputsNov 14, 2025 (public beta)Anthropicoutput_format + strict: true on tools; grammar-constrained decoding compiled from JSON Schema; anthropic-beta: structured-outputs-2025-11-13 headerGA across Claude Mythos Preview / Opus 4.5–4.7 / Sonnet 4.5–4.6 / Haiku 4.5 as of late 2025 / 2026https://platform.claude.com/docs/en/build-with-claude/structured-outputs

Notable threads

  • config.json as the de facto model-artifact contract. No standards body ever sat down to specify “what a transformer model on disk looks like” — HuggingFace’s config.json (2018) plus tokenizer.json (2020) plus model.safetensors (2022) became the universal trio by sheer adoption. By 2024 every meaningful open-weight release ships exactly that layout, and downstream tools (vLLM, TGI, llama.cpp convert, MLX, Ollama, MLC) all read it. ONNX-as-IR was supposed to be the standard at the IR level, but practitioners learned to prefer “ship a HF directory and let the runtime do the conversion.”

  • GGUF v3 + embedded chat templates turned models into self-describing serving packages. The 2024 move to embed a Jinja2 chat template directly inside the GGUF metadata KV-store (along with tokenizer state and recommended generation params) closed the loop: a single .gguf file is now sufficient to load and prompt the model correctly without any sidecar config. llama.cpp implements its own C++ minja Jinja2 runtime to render these templates safely. The flip side is CVE-2024-34359 (llama-cpp-python), which demonstrated that template injection via a malicious GGUF metadata field is a real attack surface — the embedded-DSL convenience has security-DSL costs.

  • Safetensors displaced pickle for security reasons, then stayed for speed. The original case for safetensors was “pickle can execute arbitrary code on load and people are downloading random weights from the internet.” But the format also turned out to be faster (zero-copy memmap, sharded reads, GIL-free writes since the 2024 release) and cleaner (a JSON header you can head -c 8 and parse without loading tensors). As of March 2025, ~42% of Hub models are safetensors-tagged, and the percentage among new releases is approaching 100%.

  • ONNX’s relative slowdown for frontier LLMs. ONNX still wins on edge/mobile/heterogeneous-hardware deployment (Intel OpenVINO, mobile NPUs, classical-ML embedded), and opset 27 in ONNX 1.22.0 (2025) covers the modern transformer ops. But for frontier-scale LLM training and inference, PyTorch-native (FSDP, torch.compile, FlashAttention kernels) plus vLLM/TRT-LLM-style bespoke serving have crowded out the “export-to-IR” path. ONNX Runtime GenAI is the counter-bet, with phi/llama/gemma optimisations, but the LLM gravity is elsewhere.

  • vLLM vs TensorRT-LLM vs MLX as the 2026 three-way split. vLLM is the default OSS GPU server (PagedAttention, continuous batching, broad model coverage, healthy contributor community, joined the PyTorch Foundation in 2025); TensorRT-LLM wins peak throughput on NVIDIA hardware via AOT compilation; MLX owns Apple Silicon (Ollama switched its Mac backend to MLX in 2025; M5 Neural Accelerators came online with macOS 26.2). MLC LLM is the only universal-platform OSS LLM compiler, but its operational complexity has kept it niche outside browser/mobile.

  • The proliferation of quantization configs is a tiny self-contained DSL ecosystem. GGUF tags (Q4_K_M, Q5_K_S, Q6_K, IQ2_XXS, IQ3_XS), GPTQ quantize_config.json (bits, group_size, desc_act, damp_percent), AWQ’s activation-aware scaling, EXL2’s per-layer variable bits, AutoRound, FP8 (E4M3 / E5M2 / E4M3FNUZ / E5M2FNUZ in safetensors), 1.58-bit ternary (BitNet, added in transformers 4.51) — each is a tiny vocabulary of allowed values stored in JSON metadata. The combinatorial explosion (which method × which bits × which group size × which calibration set) means model-card READMEs have become the de facto documentation surface for quant choices.

  • OpenAI Structured Outputs set the function-calling-via-schema pattern. When OpenAI launched Structured Outputs in August 2024 with response_format: {type: "json_schema", json_schema: {...}, strict: true}, every other vendor followed within 12 months. Anthropic shipped its own Structured Outputs in public beta on 2025-11-13 with the same essential shape (JSON Schema → grammar-constrained decoding → guaranteed conformance). vLLM added the same surface in OSS. The convergence is striking: by 2026, “tool-calling” and “structured outputs” are the same API pattern, distinguished only by whether the result lands in content (structured output) or tool_use (tool call). JSON Schema, born as an OpenAPI/REST artifact, is now the central control surface of the LLM-API world.

Citations