NLP / Corpus / Linguistic Annotation DSLs Family Index

type: language-family-index family: nlp-corpus languages_catalogued: 28 tags: [language-reference, family-index, nlp-corpus, tei, conll-u, universal-dependencies, propbank, framenet, elan, praat, brat, huggingface-datasets]

NLP / Corpus / Linguistic Annotation — Family Index

Family overview

Corpus and linguistic-annotation DSLs are the textual languages used to encode digital text collections, parsed sentences, dependency trees, semantic frames, and multimedia transcriptions. Unlike the streaming-SQL or i18n families, they do not converge on a single syntax — each subfamily reflects a particular era and a particular research community. The oldest and broadest is TEI (Text Encoding Initiative), an XML vocabulary born in 1987 and codified in TEI P5 in November 2007 (current 4.11.0, 18 February 2026); it remains the cornerstone of digital humanities, encoding everything from critical editions and manuscript descriptions to historical corpora and linguistic markup.

The 2000s saw the rise of shared-task tab-separated formats anchored by the CoNLL (Conference on Natural Language Learning) shared tasks: CoNLL-2003 (named-entity recognition), CoNLL-2009 (semantic-role labelling), and most importantly the CoNLL-U format that underlies Universal Dependencies (UD). UD is the dominant multilingual-treebank standard of the modern era: v2.17 was released in May 2026 as the twenty-third release, covering 200+ treebanks across 100+ languages with consistent dependency-relation labels (nsubj, obj, obl, nmod, etc.) and a universal POS tagset. CoNLL-U’s ten-column tab-separated representation (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC) is now the lingua franca that every modern multilingual parser (Stanza, spaCy, UDPipe, Trankit) reads and writes.

Parallel to syntax, the semantic-annotation tradition produced PropBank (verb-oriented predicate-argument numbering, Arg0/Arg1/…), FrameNet (frame-semantic abstractions over related lexical units), and VerbNet (verb-class hierarchies), with SemLink mapping between all three. The parallel-corpus / translation-memory layer — TMX, OPUS, Moses phrase tables, XCES — supports machine translation, and the multimedia/discourse-annotation lineage (ELAN .eaf, Praat TextGrid, EXMARaLDA, TalkBank CHAT) handles time-aligned audio and video. ELAN (Max Planck Institute for Psycholinguistics, current 7.1 of April 2026) and Praat (6.4.64, April 2026) are the workhorses of phonetic and sign-language research.

The LLM era has reshaped the field: HuggingFace Datasets (Apache Arrow-backed, current 4.x in 2026) and JSONL-with-custom-schema corpora have eaten most of the “training corpus” niche, while BRAT and its successor INCEpTION (active 2026; WebAnno itself archived) continue the standoff-annotation tradition for entity/relation tagging. spaCy’s .spacy binary serialization (DocBin) is the de facto Python NLP exchange format, and Sketch Engine’s CQL remains a bonafide corpus-query language for concordance research.

In our deep library

None of these formats have standalone Tier-1/2 deep-library notes — they are all data-exchange vocabularies hosted in XML, tab-separated text, JSON, or JSON-LD. Cross-reference:

i18n-locale — TMX (Translation Memory eXchange) is dual-classified here; the i18n side covers software-localisation workflows, this index covers the parallel-corpus / machine-translation side.
citation-formats — TEI overlaps heavily with humanities text-markup and bibliographic encoding (<biblStruct>, <listBibl>).
api-description — XML-schema-based formats (TEI, XCES, PAULA-XML, TIGER-XML, GrAF) share the XSD ecosystem.
notation-spec — formal-grammar adjacent (CFG / dependency-grammar formalisms).
document-typesetting — TEI documents are frequently typeset via TEI-XSL → LaTeX / HTML pipelines.
scientific — corpus-statistics tooling overlaps with R / Python data-science stacks.
ai-prompt-languages — modern LLM training corpora (HuggingFace Datasets, JSONL) sit at the boundary.
python — host language for spaCy, Stanza, NLTK, HuggingFace Datasets, Trankit; almost all 2020s NLP tooling.

Tier 3 family table — Treebank / dependency / parse formats

Format	First appeared	Origin	Type	Status (2026)	URL
CoNLL-U	2014	Universal Dependencies project (Joakim Nivre et al.)	10-column tab-separated; one token per line, blank lines separate sentences; encodes ID/FORM/LEMMA/UPOS/XPOS/FEATS/HEAD/DEPREL/DEPS/MISC	Very active, the de facto multilingual-treebank format under UD v2.17 (May 2026)	https://universaldependencies.org/format.html
CoNLL-U Plus	2017	Universal Dependencies	Extension of CoNLL-U allowing additional user-defined columns declared in a header line; backward-compatible	Active, niche extension	https://universaldependencies.org/ext-format.html
CoNLL-2003	2003	CoNLL shared task on NER (Tjong Kim Sang & De Meulder)	4-column tab-separated: token, POS, chunk-tag, NER-tag (BIO/IOB2)	Legacy but still the NER reference; HuggingFace `conll2003` dataset remains a standard benchmark	https://www.clips.uantwerpen.be/conll2003/ner/
CoNLL-2009	2009	CoNLL shared task on syntactic + semantic dependencies	14-column tab-separated with separate gold/predicted columns and PropBank-style predicate sense + arguments	Legacy, superseded by CoNLL-U for syntax + PropBank-3 for semantics	https://ufal.mff.cuni.cz/conll2009-st/task-description.html
Penn Treebank format	1992	UPenn (Marcus, Santorini, Marcinkiewicz)	Bracketed parse trees `(S (NP (DT the) (NN cat)) (VP (VBZ is) ...))`; encodes phrase-structure constituents	Legacy but evergreen; still the reference for constituency-parsing benchmarks (PTB WSJ §23)	https://www.cis.upenn.edu/~bies/manuals/tagguide.pdf
TIGER-XML	2002	TIGER project (Saarbrücken / Stuttgart / Potsdam)	XML representation of German treebank graphs with crossing/non-projective edges and secondary edges	Legacy, primarily for the TIGER corpus and downstream German parsers	https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/
NEGRA export format	1997	NEGRA project (Saarbrücken)	Tab-separated phrase-structure format for German treebanking; predecessor to TIGER-XML	Legacy	https://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/
TüBa-D/Z format	2000s	University of Tübingen (Erhard Hinrichs et al.)	Tabular phrase-structure with topological-field annotation specific to German	Legacy / archival (TüBa-D/Z release 11 was final)	https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/corpora/tueba-dz/

Tier 3 family table — Semantic / frame annotation

Format	First appeared	Origin	Type	Status (2026)	URL
PropBank annotation	2002	UPenn / Colorado (Martha Palmer, Paul Kingsbury)	Verb-frame files (`.xml`) declaring senses; annotation files attach Arg0/Arg1/…/ArgM-* labels to constituent spans on PTB or UD trees	Active, the dominant SRL standard; PropBank-3 ongoing	https://propbank.github.io/
FrameNet annotation	1997	ICSI Berkeley (Charles Fillmore et al.)	XML lexicon of frames + frame elements + lexical units; annotation files attach FE labels to spans; supports cross-frame relations (Inherits-from, Uses, etc.)	Active, FrameNet 1.7+ widely used; Berkeley project ongoing	https://framenet.icsi.berkeley.edu/
VerbNet	2000	Karin Kipper Schuler (UPenn)	XML verb-class hierarchy with thematic roles, syntactic frames, and semantic predicates; descended from Levin’s verb classes	Active, VerbNet 3.4 current; SemLink bridges to PropBank/FrameNet	https://verbs.colorado.edu/verbnet/
WordNet	1985	Princeton (George A. Miller)	Lexical database of synsets with hypernym/hyponym/meronym relations; distributed as flat-file Prolog-like format + XML/RDF exports	Active, WordNet 3.1 long-stable; Open English WordNet (OEWN) the active 2020s descendant	https://wordnet.princeton.edu/
Open Multilingual Wordnet (OMW)	2010s	NTU + community	Aggregates 30+ wordnets across languages into a common synset-aligned format; LMF-compatible	Active	https://omwn.org/
NIF (NLP Interchange Format) / CoNLL-RDF	2013	DBpedia / FREME projects	Linked-data NLP: RDF triples express tokens, spans, annotations using NIF Core Ontology; CoNLL-RDF lifts CoNLL columns into RDF	Active but niche; mostly in semantic-web NLP	https://persistence.uni-leipzig.org/nlp2rdf/
LMF (Lexical Markup Framework, ISO 24613)	2008 (ISO)	ISO TC 37/SC 4	XML meta-model for lexicons; standardises the abstract structure for monolingual/multilingual/morphological lexicons	Active standard, revised editions 2019–2024	https://www.iso.org/standard/68820.html

Tier 3 family table — Multimedia / discourse annotation

Format	First appeared	Origin	Type	Status (2026)	URL
ELAN `.eaf` (EUDICO Annotation Format)	2002	Max Planck Institute for Psycholinguistics, Nijmegen	XML format for time-aligned multi-tier annotation of audio/video; tiers can be independent, symbolic-association, or time-subdivision	Very active, ELAN 7.1 (April 2026) is the workhorse of sign-language, fieldwork, and gesture research	https://archive.mpi.nl/tla/elan
Praat `.TextGrid`	~1995	Paul Boersma & David Weenink, University of Amsterdam	Plain-text or short-text format for time-aligned interval/point tiers; paired with `.wav` for phonetic analysis	Very active, Praat 6.4.64 (April 2026) is the dominant phonetic-analysis tool	https://www.fon.hum.uva.nl/praat/
EXMARaLDA	2002	University of Hamburg (Thomas Schmidt)	XML “Basic Transcription” + “Segmented Transcription” for spoken-discourse and multilingual corpora; converters to/from ELAN, Praat, TEI	Active, niche in spoken-discourse communities	https://exmaralda.org/
TalkBank CHAT format	1984	Brian MacWhinney (CMU)	Plain-text transcription format with `*SPEAKER:` lines and `%dependent:` tiers; underlying format for the entire TalkBank federation (CHILDES, AphasiaBank, etc.)	Very active, CHILDES + TalkBank remain canonical conversation corpora	https://talkbank.org/
CLAN (CHAT) toolkit format	1984	Brian MacWhinney (CMU)	Companion to CHAT; CLAN is the analysis toolkit that reads/writes CHAT and produces frequency, MLU, and other developmental metrics	Active	https://dali.talkbank.org/clan/

Tier 3 family table — Corpus encoding / translation memory

Format	First appeared	Origin	Type	Status (2026)	URL
TEI P5 (Text Encoding Initiative)	1990 (P1) / 2007 (P5) / 4.11.0 (Feb 2026)	TEI Consortium	XML vocabulary for digital humanities + linguistic markup; 500+ elements organised into modules (core, header, linguistic, msdesc, drama, verse, etc.)	Very active, the cornerstone of DH; ~six-month release cadence	https://tei-c.org/guidelines/p5/
TMX (Translation Memory eXchange)	1998	LISA / now OASIS	XML format for exchanging translation memories between CAT tools; bilingual aligned segments with metadata	Active, TMX 1.4b is the long-stable industry standard; cross-listed in i18n-locale	https://www.gala-global.org/lisa-oscar-standards
XCES (XML Corpus Encoding Standard)	2000	Vassar / Nancy (Nancy Ide, Laurent Romary)	XML-Schema corpus encoding refinement of CES; defines header + text + alignment + linguistic-annotation XML	Legacy, mostly referenced for older parallel corpora (e.g. OPUS)	https://www.xces.org/
CES (Corpus Encoding Standard)	1996	EAGLES / Vassar	SGML predecessor of XCES; pre-XML corpus encoding	Legacy / historical	http://www.cs.vassar.edu/CES/
GrAF (Graph Annotation Framework, ISO 24612)	2008 (ISO)	ISO TC 37/SC 4 (Nancy Ide, Keith Suderman)	XML pivot format for representing arbitrary linguistic annotations as labelled directed graphs over a node-set	Active standard, pivot format for ANC (American National Corpus)	https://www.iso.org/standard/37326.html
PAULA-XML	~2006	SFB 632, University of Potsdam	XML standoff format for arbitrarily layered linguistic annotation; used by ANNIS corpus query system	Active within ANNIS-using projects	https://www.sfb632.uni-potsdam.de/en/paula.html
Moses phrase-table format	2007	Edinburgh (Philipp Koehn et al.)	Pipe-separated text file `source
OPUS aligned-corpus formats	2004	University of Helsinki (Jörg Tiedemann)	Distributes parallel corpora in TMX, Moses, plain-text-aligned, and XCES; the canonical free parallel-corpus repository	Very active, the go-to source for parallel data	https://opus.nlpl.eu/
Apertium bidix / monodix	2005	Universitat d’Alacant / Apertium project	XML dictionary formats for shallow-transfer rule-based MT: monolingual morphological dictionaries (monodix) + bilingual transfer dictionaries (bidix)	Active, focused on low-resource and Iberian/Turkic/Romance language pairs	https://wiki.apertium.org/wiki/Bilingual_dictionary

Tier 3 family table — Modern ML-NLP / lexicon / query

Format	First appeared	Origin	Type	Status (2026)	URL
spaCy DocBin (`.spacy`)	2019 (spaCy v2.2+)	Explosion AI (Matthew Honnibal, Ines Montani)	Binary serialization of `Doc` objects via msgpack; compact storage of tokens, spans, entities, and custom attributes	Very active, spaCy 3.8.14 current (March 2026); v4 in development	https://spacy.io/api/docbin
Stanza / Stanford CoreNLP CoNLL-U output	2020 (Stanza) / 2010 (CoreNLP)	Stanford NLP Group	Stanza emits standard CoNLL-U; CoreNLP also emits TSV, JSON, XML, and Penn Treebank	Active	https://stanfordnlp.github.io/stanza/
HuggingFace Datasets schema	2020	HuggingFace	Python/Arrow-backed loader with per-dataset `features` schema (Value, ClassLabel, Sequence, Translation, etc.); on-disk format is Apache Arrow + Parquet	Very active, the de facto ML-era corpus-exchange layer; dataset cards in YAML+Markdown	https://huggingface.co/docs/datasets/
BRAT standoff (`.ann` + `.txt`)	2012	NLPlab (Sampo Pyysalo et al.)	Plain-text standoff: `T1 Person 0 5 Mary` for text-bound annotations, `R1 Located Arg1:T1 Arg2:T2` for relations, `E1`, `A1`, `N1`, `#1` for events, attributes, normalizations, notes	Legacy (BRAT itself archived) but its format is still widely produced/consumed	https://brat.nlplab.org/standoff.html
WebAnno / INCEpTION CAS XMI	2013 (WebAnno) / 2018 (INCEpTION)	TU Darmstadt UKP Lab	UIMA CAS-based XMI serialization for layered text annotation; INCEpTION is the active successor to the archived WebAnno	INCEpTION active 2026; WebAnno archived	https://inception-project.github.io/
Sketch Engine CQL (Corpus Query Language)	2003	Lexical Computing (Adam Kilgarriff, Pavel Rychlý)	Token-pattern query language: `[lemma="run"] [tag="N.*"]`, regex over attributes, within/containing structure operators; implemented in Manatee	Very active, the standard corpus-concordance query language	https://www.sketchengine.eu/documentation/corpus-querying/

Notable threads

TEI’s 36-year reign as the digital-humanities cornerstone. Begun in 1987 as an SGML application, codified as P5/XML in 2007, and steadily refined on a six-month cadence (4.11.0, February 2026), TEI has outlasted every competing humanities-markup proposal because it is genuinely a guideline + schema generator (ODD / Roma) rather than a fixed DTD. Critical editions, manuscript catalogues, historical newspapers, drama corpora, and linguistic samplers all sit in TEI; the EpiDoc subset is the standard for inscriptions and papyri; the MEI sibling handles music. Its longevity is a study in slow, consensus-driven standards work — the opposite of the “move fast” ML-NLP ecosystem.
Universal Dependencies as the breakthrough multilingual treebank standard. Before UD (Stanford Dependencies → UD v1 2014 → v2 2017), every language had its own treebank format and its own dependency-relation inventory; cross-lingual parsing was largely impossible. UD’s design choices — content-word-headed dependencies, a fixed inventory of ~37 universal relations, the UPOS tagset, the FEATS morphological-feature inventory — let a single parser architecture train on 200+ treebanks across 100+ languages. v2.17 (May 2026) is the twenty-third release; the project’s twice-yearly cadence and rigorous validation tooling are themselves a model of how a community-driven linguistic standard can scale.
CoNLL-U as the format that ate the field. UD’s 10-column tab-separated CoNLL-U is now the format every modern parser reads and writes — Stanza, spaCy (via converters), UDPipe, Trankit, COMBO, all the recent BERT-based dependency parsers. The format is plain-text, diff-friendly, grep-friendly, and trivially streamed; the DEPS column carries enhanced (graph-not-tree) dependencies; the MISC column is a flexible attribute bucket that has absorbed SpaceAfter, alignment, lemma-confidence, even STREUSLE supersenses in v2.17. CoNLL-U’s success is a vindication of “tab-separated text with a strict spec” over more elaborate XML encodings.
The BRAT / INCEpTION standoff tradition. BRAT (NLPlab, 2012) made the “annotations live in a separate .ann file referencing character offsets into an untouched .txt” model the default for entity-and-relation annotation projects — biomedical NLP especially. BRAT itself is archived but its format remains widely produced; INCEpTION (TU Darmstadt UKP Lab) is the actively-developed successor, adding human-in-the-loop ML assistance, knowledge-base entity linking, and UIMA CAS XMI serialization. The standoff principle (never modify the source text, layer annotations on top) is now baked into virtually every modern annotation tool.
ELAN as the multimedia-linguistics workhorse. ELAN (Max Planck Institute for Psycholinguistics, current 7.1 of April 2026) is the dominant tool for time-aligned video-and-audio annotation: sign languages, gesture studies, language documentation, multimodal interaction. Its .eaf XML format supports unlimited tiers with rich parent-child relationships (time-subdivision, symbolic-association, included-in). Praat TextGrids cover the speech-acoustic niche, EXMARaLDA the spoken-discourse niche, and TalkBank CHAT the conversation-and-development niche — but ELAN is uniquely positioned for multimodal field linguistics.
The LLM era’s effect on “corpus” formats. Before 2018, an NLP “corpus” usually meant CoNLL-U, TEI-XML, or BRAT standoff with hand-curated annotations. After 2018, “corpus” increasingly means a HuggingFace Datasets entry — Apache Arrow on disk, a YAML+Markdown dataset card, often just JSONL with a few fields. The painstaking gold-annotation tradition still exists (UD, PropBank, BRAT projects) and remains essential for evaluation, but the training corpora are now web-scale unannotated text. spaCy’s .spacy DocBin sits in between: a binary format optimised for ML pipelines that still carries linguistic annotations.
Sketch Engine CQL as a bonafide corpus query language. Most “corpus formats” are passive data containers, but CQL is an actual query language — token-pattern matching with regex over morphological attributes ([lemma="run" & tag="V.*"]), structure operators (within <s>, containing <np>), Boolean combinations, and distance constraints. It powers Sketch Engine’s word-sketch grammars and concordance searches and is the closest thing the corpus-linguistics world has to a SQL equivalent. The IMS Corpus Workbench (CWB) CQP variant is the closely-related open-source ancestor.

Citations

TEI P5 Guidelines (current 4.11.0, 18 February 2026): https://tei-c.org/guidelines/p5/
TEI P5 release archive (Zenodo DOI): https://doi.org/10.5281/zenodo.3413524
Universal Dependencies home: https://universaldependencies.org/
CoNLL-U format spec: https://universaldependencies.org/format.html
CoNLL-U Plus extension: https://universaldependencies.org/ext-format.html
UD v2 specifications: https://universaldependencies.org/v2/
PropBank: https://propbank.github.io/
FrameNet (Berkeley): https://framenet.icsi.berkeley.edu/
VerbNet (Colorado): https://verbs.colorado.edu/verbnet/
WordNet (Princeton): https://wordnet.princeton.edu/
Open Multilingual Wordnet: https://omwn.org/
NIF / NLP Interchange Format: https://persistence.uni-leipzig.org/nlp2rdf/
ISO 24613 (LMF): https://www.iso.org/standard/68820.html
ISO 24612 (GrAF): https://www.iso.org/standard/37326.html
ELAN (MPI Nijmegen, current 7.1 April 2026): https://archive.mpi.nl/tla/elan
ELAN release notes: https://archive.mpi.nl/tla/elan/release-notes
Praat (current 6.4.64 April 2026): https://www.fon.hum.uva.nl/praat/
EXMARaLDA: https://exmaralda.org/
TalkBank / CHILDES + CHAT format: https://talkbank.org/
BRAT standoff format: https://brat.nlplab.org/standoff.html
INCEpTION (active WebAnno successor): https://inception-project.github.io/
OPUS parallel corpora: https://opus.nlpl.eu/
TMX 1.4b spec (GALA / OSCAR): https://www.gala-global.org/lisa-oscar-standards
Moses SMT: http://www.statmt.org/moses/
Apertium platform: https://www.apertium.org/
Apertium bidix wiki: https://wiki.apertium.org/wiki/Bilingual_dictionary
spaCy DocBin (.spacy): https://spacy.io/api/docbin
Stanza (Stanford NLP, Python): https://stanfordnlp.github.io/stanza/
HuggingFace Datasets: https://huggingface.co/docs/datasets/
Sketch Engine CQL: https://www.sketchengine.eu/documentation/corpus-querying/
Penn Treebank annotation guidelines: https://www.cis.upenn.edu/~bies/manuals/tagguide.pdf
TIGER corpus: https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/
CoNLL-2003 NER shared task: https://www.clips.uantwerpen.be/conll2003/ner/
CoNLL-2009 shared task: https://ufal.mff.cuni.cz/conll2009-st/task-description.html

Compendium

Explorer

NLP / Corpus / Linguistic Annotation DSLs Family Index

NLP / Corpus / Linguistic Annotation DSLs Family Index

type: language-family-index family: nlp-corpus languages_catalogued: 28 tags: [language-reference, family-index, nlp-corpus, tei, conll-u, universal-dependencies, propbank, framenet, elan, praat, brat, huggingface-datasets]

NLP / Corpus / Linguistic Annotation — Family Index

Family overview

In our deep library

Tier 3 family table — Treebank / dependency / parse formats

Tier 3 family table — Semantic / frame annotation

Tier 3 family table — Multimedia / discourse annotation

Tier 3 family table — Corpus encoding / translation memory

Tier 3 family table — Modern ML-NLP / lexicon / query

Notable threads

Citations

Graph View

Table of Contents