NLP / Corpus / Linguistic Annotation DSLs Family Index
type: language-family-index family: nlp-corpus languages_catalogued: 28 tags: [language-reference, family-index, nlp-corpus, tei, conll-u, universal-dependencies, propbank, framenet, elan, praat, brat, huggingface-datasets]
NLP / Corpus / Linguistic Annotation — Family Index
Family overview
Corpus and linguistic-annotation DSLs are the textual languages used to encode digital text collections, parsed sentences, dependency trees, semantic frames, and multimedia transcriptions. Unlike the streaming-SQL or i18n families, they do not converge on a single syntax — each subfamily reflects a particular era and a particular research community. The oldest and broadest is TEI (Text Encoding Initiative), an XML vocabulary born in 1987 and codified in TEI P5 in November 2007 (current 4.11.0, 18 February 2026); it remains the cornerstone of digital humanities, encoding everything from critical editions and manuscript descriptions to historical corpora and linguistic markup.
The 2000s saw the rise of shared-task tab-separated formats anchored by the CoNLL (Conference on Natural Language Learning) shared tasks: CoNLL-2003 (named-entity recognition), CoNLL-2009 (semantic-role labelling), and most importantly the CoNLL-U format that underlies Universal Dependencies (UD). UD is the dominant multilingual-treebank standard of the modern era: v2.17 was released in May 2026 as the twenty-third release, covering 200+ treebanks across 100+ languages with consistent dependency-relation labels (nsubj, obj, obl, nmod, etc.) and a universal POS tagset. CoNLL-U’s ten-column tab-separated representation (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC) is now the lingua franca that every modern multilingual parser (Stanza, spaCy, UDPipe, Trankit) reads and writes.
Parallel to syntax, the semantic-annotation tradition produced PropBank (verb-oriented predicate-argument numbering, Arg0/Arg1/…), FrameNet (frame-semantic abstractions over related lexical units), and VerbNet (verb-class hierarchies), with SemLink mapping between all three. The parallel-corpus / translation-memory layer — TMX, OPUS, Moses phrase tables, XCES — supports machine translation, and the multimedia/discourse-annotation lineage (ELAN .eaf, Praat TextGrid, EXMARaLDA, TalkBank CHAT) handles time-aligned audio and video. ELAN (Max Planck Institute for Psycholinguistics, current 7.1 of April 2026) and Praat (6.4.64, April 2026) are the workhorses of phonetic and sign-language research.
The LLM era has reshaped the field: HuggingFace Datasets (Apache Arrow-backed, current 4.x in 2026) and JSONL-with-custom-schema corpora have eaten most of the “training corpus” niche, while BRAT and its successor INCEpTION (active 2026; WebAnno itself archived) continue the standoff-annotation tradition for entity/relation tagging. spaCy’s .spacy binary serialization (DocBin) is the de facto Python NLP exchange format, and Sketch Engine’s CQL remains a bonafide corpus-query language for concordance research.
In our deep library
None of these formats have standalone Tier-1/2 deep-library notes — they are all data-exchange vocabularies hosted in XML, tab-separated text, JSON, or JSON-LD. Cross-reference:
- i18n-locale — TMX (Translation Memory eXchange) is dual-classified here; the i18n side covers software-localisation workflows, this index covers the parallel-corpus / machine-translation side.
- citation-formats — TEI overlaps heavily with humanities text-markup and bibliographic encoding (
<biblStruct>,<listBibl>). - api-description — XML-schema-based formats (TEI, XCES, PAULA-XML, TIGER-XML, GrAF) share the XSD ecosystem.
- notation-spec — formal-grammar adjacent (CFG / dependency-grammar formalisms).
- document-typesetting — TEI documents are frequently typeset via TEI-XSL → LaTeX / HTML pipelines.
- scientific — corpus-statistics tooling overlaps with R / Python data-science stacks.
- ai-prompt-languages — modern LLM training corpora (HuggingFace Datasets, JSONL) sit at the boundary.
- python — host language for spaCy, Stanza, NLTK, HuggingFace Datasets, Trankit; almost all 2020s NLP tooling.
Tier 3 family table — Treebank / dependency / parse formats
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| CoNLL-U | 2014 | Universal Dependencies project (Joakim Nivre et al.) | 10-column tab-separated; one token per line, blank lines separate sentences; encodes ID/FORM/LEMMA/UPOS/XPOS/FEATS/HEAD/DEPREL/DEPS/MISC | Very active, the de facto multilingual-treebank format under UD v2.17 (May 2026) | https://universaldependencies.org/format.html |
| CoNLL-U Plus | 2017 | Universal Dependencies | Extension of CoNLL-U allowing additional user-defined columns declared in a header line; backward-compatible | Active, niche extension | https://universaldependencies.org/ext-format.html |
| CoNLL-2003 | 2003 | CoNLL shared task on NER (Tjong Kim Sang & De Meulder) | 4-column tab-separated: token, POS, chunk-tag, NER-tag (BIO/IOB2) | Legacy but still the NER reference; HuggingFace conll2003 dataset remains a standard benchmark | https://www.clips.uantwerpen.be/conll2003/ner/ |
| CoNLL-2009 | 2009 | CoNLL shared task on syntactic + semantic dependencies | 14-column tab-separated with separate gold/predicted columns and PropBank-style predicate sense + arguments | Legacy, superseded by CoNLL-U for syntax + PropBank-3 for semantics | https://ufal.mff.cuni.cz/conll2009-st/task-description.html |
| Penn Treebank format | 1992 | UPenn (Marcus, Santorini, Marcinkiewicz) | Bracketed parse trees (S (NP (DT the) (NN cat)) (VP (VBZ is) ...)); encodes phrase-structure constituents | Legacy but evergreen; still the reference for constituency-parsing benchmarks (PTB WSJ §23) | https://www.cis.upenn.edu/~bies/manuals/tagguide.pdf |
| TIGER-XML | 2002 | TIGER project (Saarbrücken / Stuttgart / Potsdam) | XML representation of German treebank graphs with crossing/non-projective edges and secondary edges | Legacy, primarily for the TIGER corpus and downstream German parsers | https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/ |
| NEGRA export format | 1997 | NEGRA project (Saarbrücken) | Tab-separated phrase-structure format for German treebanking; predecessor to TIGER-XML | Legacy | https://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/ |
| TüBa-D/Z format | 2000s | University of Tübingen (Erhard Hinrichs et al.) | Tabular phrase-structure with topological-field annotation specific to German | Legacy / archival (TüBa-D/Z release 11 was final) | https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/corpora/tueba-dz/ |
Tier 3 family table — Semantic / frame annotation
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| PropBank annotation | 2002 | UPenn / Colorado (Martha Palmer, Paul Kingsbury) | Verb-frame files (.xml) declaring senses; annotation files attach Arg0/Arg1/…/ArgM-* labels to constituent spans on PTB or UD trees | Active, the dominant SRL standard; PropBank-3 ongoing | https://propbank.github.io/ |
| FrameNet annotation | 1997 | ICSI Berkeley (Charles Fillmore et al.) | XML lexicon of frames + frame elements + lexical units; annotation files attach FE labels to spans; supports cross-frame relations (Inherits-from, Uses, etc.) | Active, FrameNet 1.7+ widely used; Berkeley project ongoing | https://framenet.icsi.berkeley.edu/ |
| VerbNet | 2000 | Karin Kipper Schuler (UPenn) | XML verb-class hierarchy with thematic roles, syntactic frames, and semantic predicates; descended from Levin’s verb classes | Active, VerbNet 3.4 current; SemLink bridges to PropBank/FrameNet | https://verbs.colorado.edu/verbnet/ |
| WordNet | 1985 | Princeton (George A. Miller) | Lexical database of synsets with hypernym/hyponym/meronym relations; distributed as flat-file Prolog-like format + XML/RDF exports | Active, WordNet 3.1 long-stable; Open English WordNet (OEWN) the active 2020s descendant | https://wordnet.princeton.edu/ |
| Open Multilingual Wordnet (OMW) | 2010s | NTU + community | Aggregates 30+ wordnets across languages into a common synset-aligned format; LMF-compatible | Active | https://omwn.org/ |
| NIF (NLP Interchange Format) / CoNLL-RDF | 2013 | DBpedia / FREME projects | Linked-data NLP: RDF triples express tokens, spans, annotations using NIF Core Ontology; CoNLL-RDF lifts CoNLL columns into RDF | Active but niche; mostly in semantic-web NLP | https://persistence.uni-leipzig.org/nlp2rdf/ |
| LMF (Lexical Markup Framework, ISO 24613) | 2008 (ISO) | ISO TC 37/SC 4 | XML meta-model for lexicons; standardises the abstract structure for monolingual/multilingual/morphological lexicons | Active standard, revised editions 2019–2024 | https://www.iso.org/standard/68820.html |
Tier 3 family table — Multimedia / discourse annotation
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
ELAN .eaf (EUDICO Annotation Format) | 2002 | Max Planck Institute for Psycholinguistics, Nijmegen | XML format for time-aligned multi-tier annotation of audio/video; tiers can be independent, symbolic-association, or time-subdivision | Very active, ELAN 7.1 (April 2026) is the workhorse of sign-language, fieldwork, and gesture research | https://archive.mpi.nl/tla/elan |
Praat .TextGrid | ~1995 | Paul Boersma & David Weenink, University of Amsterdam | Plain-text or short-text format for time-aligned interval/point tiers; paired with .wav for phonetic analysis | Very active, Praat 6.4.64 (April 2026) is the dominant phonetic-analysis tool | https://www.fon.hum.uva.nl/praat/ |
| EXMARaLDA | 2002 | University of Hamburg (Thomas Schmidt) | XML “Basic Transcription” + “Segmented Transcription” for spoken-discourse and multilingual corpora; converters to/from ELAN, Praat, TEI | Active, niche in spoken-discourse communities | https://exmaralda.org/ |
| TalkBank CHAT format | 1984 | Brian MacWhinney (CMU) | Plain-text transcription format with *SPEAKER: lines and %dependent: tiers; underlying format for the entire TalkBank federation (CHILDES, AphasiaBank, etc.) | Very active, CHILDES + TalkBank remain canonical conversation corpora | https://talkbank.org/ |
| CLAN (CHAT) toolkit format | 1984 | Brian MacWhinney (CMU) | Companion to CHAT; CLAN is the analysis toolkit that reads/writes CHAT and produces frequency, MLU, and other developmental metrics | Active | https://dali.talkbank.org/clan/ |
Tier 3 family table — Corpus encoding / translation memory
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| TEI P5 (Text Encoding Initiative) | 1990 (P1) / 2007 (P5) / 4.11.0 (Feb 2026) | TEI Consortium | XML vocabulary for digital humanities + linguistic markup; 500+ elements organised into modules (core, header, linguistic, msdesc, drama, verse, etc.) | Very active, the cornerstone of DH; ~six-month release cadence | https://tei-c.org/guidelines/p5/ |
| TMX (Translation Memory eXchange) | 1998 | LISA / now OASIS | XML format for exchanging translation memories between CAT tools; bilingual aligned segments with metadata | Active, TMX 1.4b is the long-stable industry standard; cross-listed in i18n-locale | https://www.gala-global.org/lisa-oscar-standards |
| XCES (XML Corpus Encoding Standard) | 2000 | Vassar / Nancy (Nancy Ide, Laurent Romary) | XML-Schema corpus encoding refinement of CES; defines header + text + alignment + linguistic-annotation XML | Legacy, mostly referenced for older parallel corpora (e.g. OPUS) | https://www.xces.org/ |
| CES (Corpus Encoding Standard) | 1996 | EAGLES / Vassar | SGML predecessor of XCES; pre-XML corpus encoding | Legacy / historical | http://www.cs.vassar.edu/CES/ |
| GrAF (Graph Annotation Framework, ISO 24612) | 2008 (ISO) | ISO TC 37/SC 4 (Nancy Ide, Keith Suderman) | XML pivot format for representing arbitrary linguistic annotations as labelled directed graphs over a node-set | Active standard, pivot format for ANC (American National Corpus) | https://www.iso.org/standard/37326.html |
| PAULA-XML | ~2006 | SFB 632, University of Potsdam | XML standoff format for arbitrarily layered linguistic annotation; used by ANNIS corpus query system | Active within ANNIS-using projects | https://www.sfb632.uni-potsdam.de/en/paula.html |
| Moses phrase-table format | 2007 | Edinburgh (Philipp Koehn et al.) | Pipe-separated text file `source | ||
| OPUS aligned-corpus formats | 2004 | University of Helsinki (Jörg Tiedemann) | Distributes parallel corpora in TMX, Moses, plain-text-aligned, and XCES; the canonical free parallel-corpus repository | Very active, the go-to source for parallel data | https://opus.nlpl.eu/ |
| Apertium bidix / monodix | 2005 | Universitat d’Alacant / Apertium project | XML dictionary formats for shallow-transfer rule-based MT: monolingual morphological dictionaries (monodix) + bilingual transfer dictionaries (bidix) | Active, focused on low-resource and Iberian/Turkic/Romance language pairs | https://wiki.apertium.org/wiki/Bilingual_dictionary |
Tier 3 family table — Modern ML-NLP / lexicon / query
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
spaCy DocBin (.spacy) | 2019 (spaCy v2.2+) | Explosion AI (Matthew Honnibal, Ines Montani) | Binary serialization of Doc objects via msgpack; compact storage of tokens, spans, entities, and custom attributes | Very active, spaCy 3.8.14 current (March 2026); v4 in development | https://spacy.io/api/docbin |
| Stanza / Stanford CoreNLP CoNLL-U output | 2020 (Stanza) / 2010 (CoreNLP) | Stanford NLP Group | Stanza emits standard CoNLL-U; CoreNLP also emits TSV, JSON, XML, and Penn Treebank | Active | https://stanfordnlp.github.io/stanza/ |
| HuggingFace Datasets schema | 2020 | HuggingFace | Python/Arrow-backed loader with per-dataset features schema (Value, ClassLabel, Sequence, Translation, etc.); on-disk format is Apache Arrow + Parquet | Very active, the de facto ML-era corpus-exchange layer; dataset cards in YAML+Markdown | https://huggingface.co/docs/datasets/ |
BRAT standoff (.ann + .txt) | 2012 | NLPlab (Sampo Pyysalo et al.) | Plain-text standoff: T1 Person 0 5 Mary for text-bound annotations, R1 Located Arg1:T1 Arg2:T2 for relations, E1, A1, N1, #1 for events, attributes, normalizations, notes | Legacy (BRAT itself archived) but its format is still widely produced/consumed | https://brat.nlplab.org/standoff.html |
| WebAnno / INCEpTION CAS XMI | 2013 (WebAnno) / 2018 (INCEpTION) | TU Darmstadt UKP Lab | UIMA CAS-based XMI serialization for layered text annotation; INCEpTION is the active successor to the archived WebAnno | INCEpTION active 2026; WebAnno archived | https://inception-project.github.io/ |
| Sketch Engine CQL (Corpus Query Language) | 2003 | Lexical Computing (Adam Kilgarriff, Pavel Rychlý) | Token-pattern query language: [lemma="run"] [tag="N.*"], regex over attributes, within/containing structure operators; implemented in Manatee | Very active, the standard corpus-concordance query language | https://www.sketchengine.eu/documentation/corpus-querying/ |
Notable threads
-
TEI’s 36-year reign as the digital-humanities cornerstone. Begun in 1987 as an SGML application, codified as P5/XML in 2007, and steadily refined on a six-month cadence (4.11.0, February 2026), TEI has outlasted every competing humanities-markup proposal because it is genuinely a guideline + schema generator (ODD / Roma) rather than a fixed DTD. Critical editions, manuscript catalogues, historical newspapers, drama corpora, and linguistic samplers all sit in TEI; the EpiDoc subset is the standard for inscriptions and papyri; the MEI sibling handles music. Its longevity is a study in slow, consensus-driven standards work — the opposite of the “move fast” ML-NLP ecosystem.
-
Universal Dependencies as the breakthrough multilingual treebank standard. Before UD (Stanford Dependencies → UD v1 2014 → v2 2017), every language had its own treebank format and its own dependency-relation inventory; cross-lingual parsing was largely impossible. UD’s design choices — content-word-headed dependencies, a fixed inventory of ~37 universal relations, the UPOS tagset, the FEATS morphological-feature inventory — let a single parser architecture train on 200+ treebanks across 100+ languages. v2.17 (May 2026) is the twenty-third release; the project’s twice-yearly cadence and rigorous validation tooling are themselves a model of how a community-driven linguistic standard can scale.
-
CoNLL-U as the format that ate the field. UD’s 10-column tab-separated CoNLL-U is now the format every modern parser reads and writes — Stanza, spaCy (via converters), UDPipe, Trankit, COMBO, all the recent BERT-based dependency parsers. The format is plain-text, diff-friendly, grep-friendly, and trivially streamed; the DEPS column carries enhanced (graph-not-tree) dependencies; the MISC column is a flexible attribute bucket that has absorbed SpaceAfter, alignment, lemma-confidence, even STREUSLE supersenses in v2.17. CoNLL-U’s success is a vindication of “tab-separated text with a strict spec” over more elaborate XML encodings.
-
The BRAT / INCEpTION standoff tradition. BRAT (NLPlab, 2012) made the “annotations live in a separate
.annfile referencing character offsets into an untouched.txt” model the default for entity-and-relation annotation projects — biomedical NLP especially. BRAT itself is archived but its format remains widely produced; INCEpTION (TU Darmstadt UKP Lab) is the actively-developed successor, adding human-in-the-loop ML assistance, knowledge-base entity linking, and UIMA CAS XMI serialization. The standoff principle (never modify the source text, layer annotations on top) is now baked into virtually every modern annotation tool. -
ELAN as the multimedia-linguistics workhorse. ELAN (Max Planck Institute for Psycholinguistics, current 7.1 of April 2026) is the dominant tool for time-aligned video-and-audio annotation: sign languages, gesture studies, language documentation, multimodal interaction. Its
.eafXML format supports unlimited tiers with rich parent-child relationships (time-subdivision, symbolic-association, included-in). Praat TextGrids cover the speech-acoustic niche, EXMARaLDA the spoken-discourse niche, and TalkBank CHAT the conversation-and-development niche — but ELAN is uniquely positioned for multimodal field linguistics. -
The LLM era’s effect on “corpus” formats. Before 2018, an NLP “corpus” usually meant CoNLL-U, TEI-XML, or BRAT standoff with hand-curated annotations. After 2018, “corpus” increasingly means a HuggingFace Datasets entry — Apache Arrow on disk, a YAML+Markdown dataset card, often just JSONL with a few fields. The painstaking gold-annotation tradition still exists (UD, PropBank, BRAT projects) and remains essential for evaluation, but the training corpora are now web-scale unannotated text. spaCy’s
.spacyDocBin sits in between: a binary format optimised for ML pipelines that still carries linguistic annotations. -
Sketch Engine CQL as a bonafide corpus query language. Most “corpus formats” are passive data containers, but CQL is an actual query language — token-pattern matching with regex over morphological attributes (
[lemma="run" & tag="V.*"]), structure operators (within <s>,containing <np>), Boolean combinations, and distance constraints. It powers Sketch Engine’s word-sketch grammars and concordance searches and is the closest thing the corpus-linguistics world has to a SQL equivalent. The IMS Corpus Workbench (CWB) CQP variant is the closely-related open-source ancestor.
Citations
- TEI P5 Guidelines (current 4.11.0, 18 February 2026): https://tei-c.org/guidelines/p5/
- TEI P5 release archive (Zenodo DOI): https://doi.org/10.5281/zenodo.3413524
- Universal Dependencies home: https://universaldependencies.org/
- CoNLL-U format spec: https://universaldependencies.org/format.html
- CoNLL-U Plus extension: https://universaldependencies.org/ext-format.html
- UD v2 specifications: https://universaldependencies.org/v2/
- PropBank: https://propbank.github.io/
- FrameNet (Berkeley): https://framenet.icsi.berkeley.edu/
- VerbNet (Colorado): https://verbs.colorado.edu/verbnet/
- WordNet (Princeton): https://wordnet.princeton.edu/
- Open Multilingual Wordnet: https://omwn.org/
- NIF / NLP Interchange Format: https://persistence.uni-leipzig.org/nlp2rdf/
- ISO 24613 (LMF): https://www.iso.org/standard/68820.html
- ISO 24612 (GrAF): https://www.iso.org/standard/37326.html
- ELAN (MPI Nijmegen, current 7.1 April 2026): https://archive.mpi.nl/tla/elan
- ELAN release notes: https://archive.mpi.nl/tla/elan/release-notes
- Praat (current 6.4.64 April 2026): https://www.fon.hum.uva.nl/praat/
- EXMARaLDA: https://exmaralda.org/
- TalkBank / CHILDES + CHAT format: https://talkbank.org/
- BRAT standoff format: https://brat.nlplab.org/standoff.html
- INCEpTION (active WebAnno successor): https://inception-project.github.io/
- OPUS parallel corpora: https://opus.nlpl.eu/
- TMX 1.4b spec (GALA / OSCAR): https://www.gala-global.org/lisa-oscar-standards
- Moses SMT: http://www.statmt.org/moses/
- Apertium platform: https://www.apertium.org/
- Apertium bidix wiki: https://wiki.apertium.org/wiki/Bilingual_dictionary
- spaCy DocBin (
.spacy): https://spacy.io/api/docbin - Stanza (Stanford NLP, Python): https://stanfordnlp.github.io/stanza/
- HuggingFace Datasets: https://huggingface.co/docs/datasets/
- Sketch Engine CQL: https://www.sketchengine.eu/documentation/corpus-querying/
- Penn Treebank annotation guidelines: https://www.cis.upenn.edu/~bies/manuals/tagguide.pdf
- TIGER corpus: https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/
- CoNLL-2003 NER shared task: https://www.clips.uantwerpen.be/conll2003/ner/
- CoNLL-2009 shared task: https://ufal.mff.cuni.cz/conll2009-st/task-description.html