NLP / Corpus / Linguistic Annotation DSLs Family Index


type: language-family-index family: nlp-corpus languages_catalogued: 28 tags: [language-reference, family-index, nlp-corpus, tei, conll-u, universal-dependencies, propbank, framenet, elan, praat, brat, huggingface-datasets]

NLP / Corpus / Linguistic Annotation — Family Index

Family overview

Corpus and linguistic-annotation DSLs are the textual languages used to encode digital text collections, parsed sentences, dependency trees, semantic frames, and multimedia transcriptions. Unlike the streaming-SQL or i18n families, they do not converge on a single syntax — each subfamily reflects a particular era and a particular research community. The oldest and broadest is TEI (Text Encoding Initiative), an XML vocabulary born in 1987 and codified in TEI P5 in November 2007 (current 4.11.0, 18 February 2026); it remains the cornerstone of digital humanities, encoding everything from critical editions and manuscript descriptions to historical corpora and linguistic markup.

The 2000s saw the rise of shared-task tab-separated formats anchored by the CoNLL (Conference on Natural Language Learning) shared tasks: CoNLL-2003 (named-entity recognition), CoNLL-2009 (semantic-role labelling), and most importantly the CoNLL-U format that underlies Universal Dependencies (UD). UD is the dominant multilingual-treebank standard of the modern era: v2.17 was released in May 2026 as the twenty-third release, covering 200+ treebanks across 100+ languages with consistent dependency-relation labels (nsubj, obj, obl, nmod, etc.) and a universal POS tagset. CoNLL-U’s ten-column tab-separated representation (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC) is now the lingua franca that every modern multilingual parser (Stanza, spaCy, UDPipe, Trankit) reads and writes.

Parallel to syntax, the semantic-annotation tradition produced PropBank (verb-oriented predicate-argument numbering, Arg0/Arg1/…), FrameNet (frame-semantic abstractions over related lexical units), and VerbNet (verb-class hierarchies), with SemLink mapping between all three. The parallel-corpus / translation-memory layer — TMX, OPUS, Moses phrase tables, XCES — supports machine translation, and the multimedia/discourse-annotation lineage (ELAN .eaf, Praat TextGrid, EXMARaLDA, TalkBank CHAT) handles time-aligned audio and video. ELAN (Max Planck Institute for Psycholinguistics, current 7.1 of April 2026) and Praat (6.4.64, April 2026) are the workhorses of phonetic and sign-language research.

The LLM era has reshaped the field: HuggingFace Datasets (Apache Arrow-backed, current 4.x in 2026) and JSONL-with-custom-schema corpora have eaten most of the “training corpus” niche, while BRAT and its successor INCEpTION (active 2026; WebAnno itself archived) continue the standoff-annotation tradition for entity/relation tagging. spaCy’s .spacy binary serialization (DocBin) is the de facto Python NLP exchange format, and Sketch Engine’s CQL remains a bonafide corpus-query language for concordance research.

In our deep library

None of these formats have standalone Tier-1/2 deep-library notes — they are all data-exchange vocabularies hosted in XML, tab-separated text, JSON, or JSON-LD. Cross-reference:

  • i18n-localeTMX (Translation Memory eXchange) is dual-classified here; the i18n side covers software-localisation workflows, this index covers the parallel-corpus / machine-translation side.
  • citation-formats — TEI overlaps heavily with humanities text-markup and bibliographic encoding (<biblStruct>, <listBibl>).
  • api-description — XML-schema-based formats (TEI, XCES, PAULA-XML, TIGER-XML, GrAF) share the XSD ecosystem.
  • notation-spec — formal-grammar adjacent (CFG / dependency-grammar formalisms).
  • document-typesetting — TEI documents are frequently typeset via TEI-XSL → LaTeX / HTML pipelines.
  • scientific — corpus-statistics tooling overlaps with R / Python data-science stacks.
  • ai-prompt-languages — modern LLM training corpora (HuggingFace Datasets, JSONL) sit at the boundary.
  • python — host language for spaCy, Stanza, NLTK, HuggingFace Datasets, Trankit; almost all 2020s NLP tooling.

Tier 3 family table — Treebank / dependency / parse formats

FormatFirst appearedOriginTypeStatus (2026)URL
CoNLL-U2014Universal Dependencies project (Joakim Nivre et al.)10-column tab-separated; one token per line, blank lines separate sentences; encodes ID/FORM/LEMMA/UPOS/XPOS/FEATS/HEAD/DEPREL/DEPS/MISCVery active, the de facto multilingual-treebank format under UD v2.17 (May 2026)https://universaldependencies.org/format.html
CoNLL-U Plus2017Universal DependenciesExtension of CoNLL-U allowing additional user-defined columns declared in a header line; backward-compatibleActive, niche extensionhttps://universaldependencies.org/ext-format.html
CoNLL-20032003CoNLL shared task on NER (Tjong Kim Sang & De Meulder)4-column tab-separated: token, POS, chunk-tag, NER-tag (BIO/IOB2)Legacy but still the NER reference; HuggingFace conll2003 dataset remains a standard benchmarkhttps://www.clips.uantwerpen.be/conll2003/ner/
CoNLL-20092009CoNLL shared task on syntactic + semantic dependencies14-column tab-separated with separate gold/predicted columns and PropBank-style predicate sense + argumentsLegacy, superseded by CoNLL-U for syntax + PropBank-3 for semanticshttps://ufal.mff.cuni.cz/conll2009-st/task-description.html
Penn Treebank format1992UPenn (Marcus, Santorini, Marcinkiewicz)Bracketed parse trees (S (NP (DT the) (NN cat)) (VP (VBZ is) ...)); encodes phrase-structure constituentsLegacy but evergreen; still the reference for constituency-parsing benchmarks (PTB WSJ §23)https://www.cis.upenn.edu/~bies/manuals/tagguide.pdf
TIGER-XML2002TIGER project (Saarbrücken / Stuttgart / Potsdam)XML representation of German treebank graphs with crossing/non-projective edges and secondary edgesLegacy, primarily for the TIGER corpus and downstream German parsershttps://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/
NEGRA export format1997NEGRA project (Saarbrücken)Tab-separated phrase-structure format for German treebanking; predecessor to TIGER-XMLLegacyhttps://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/
TüBa-D/Z format2000sUniversity of Tübingen (Erhard Hinrichs et al.)Tabular phrase-structure with topological-field annotation specific to GermanLegacy / archival (TüBa-D/Z release 11 was final)https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/corpora/tueba-dz/

Tier 3 family table — Semantic / frame annotation

FormatFirst appearedOriginTypeStatus (2026)URL
PropBank annotation2002UPenn / Colorado (Martha Palmer, Paul Kingsbury)Verb-frame files (.xml) declaring senses; annotation files attach Arg0/Arg1/…/ArgM-* labels to constituent spans on PTB or UD treesActive, the dominant SRL standard; PropBank-3 ongoinghttps://propbank.github.io/
FrameNet annotation1997ICSI Berkeley (Charles Fillmore et al.)XML lexicon of frames + frame elements + lexical units; annotation files attach FE labels to spans; supports cross-frame relations (Inherits-from, Uses, etc.)Active, FrameNet 1.7+ widely used; Berkeley project ongoinghttps://framenet.icsi.berkeley.edu/
VerbNet2000Karin Kipper Schuler (UPenn)XML verb-class hierarchy with thematic roles, syntactic frames, and semantic predicates; descended from Levin’s verb classesActive, VerbNet 3.4 current; SemLink bridges to PropBank/FrameNethttps://verbs.colorado.edu/verbnet/
WordNet1985Princeton (George A. Miller)Lexical database of synsets with hypernym/hyponym/meronym relations; distributed as flat-file Prolog-like format + XML/RDF exportsActive, WordNet 3.1 long-stable; Open English WordNet (OEWN) the active 2020s descendanthttps://wordnet.princeton.edu/
Open Multilingual Wordnet (OMW)2010sNTU + communityAggregates 30+ wordnets across languages into a common synset-aligned format; LMF-compatibleActivehttps://omwn.org/
NIF (NLP Interchange Format) / CoNLL-RDF2013DBpedia / FREME projectsLinked-data NLP: RDF triples express tokens, spans, annotations using NIF Core Ontology; CoNLL-RDF lifts CoNLL columns into RDFActive but niche; mostly in semantic-web NLPhttps://persistence.uni-leipzig.org/nlp2rdf/
LMF (Lexical Markup Framework, ISO 24613)2008 (ISO)ISO TC 37/SC 4XML meta-model for lexicons; standardises the abstract structure for monolingual/multilingual/morphological lexiconsActive standard, revised editions 2019–2024https://www.iso.org/standard/68820.html

Tier 3 family table — Multimedia / discourse annotation

FormatFirst appearedOriginTypeStatus (2026)URL
ELAN .eaf (EUDICO Annotation Format)2002Max Planck Institute for Psycholinguistics, NijmegenXML format for time-aligned multi-tier annotation of audio/video; tiers can be independent, symbolic-association, or time-subdivisionVery active, ELAN 7.1 (April 2026) is the workhorse of sign-language, fieldwork, and gesture researchhttps://archive.mpi.nl/tla/elan
Praat .TextGrid~1995Paul Boersma & David Weenink, University of AmsterdamPlain-text or short-text format for time-aligned interval/point tiers; paired with .wav for phonetic analysisVery active, Praat 6.4.64 (April 2026) is the dominant phonetic-analysis toolhttps://www.fon.hum.uva.nl/praat/
EXMARaLDA2002University of Hamburg (Thomas Schmidt)XML “Basic Transcription” + “Segmented Transcription” for spoken-discourse and multilingual corpora; converters to/from ELAN, Praat, TEIActive, niche in spoken-discourse communitieshttps://exmaralda.org/
TalkBank CHAT format1984Brian MacWhinney (CMU)Plain-text transcription format with *SPEAKER: lines and %dependent: tiers; underlying format for the entire TalkBank federation (CHILDES, AphasiaBank, etc.)Very active, CHILDES + TalkBank remain canonical conversation corporahttps://talkbank.org/
CLAN (CHAT) toolkit format1984Brian MacWhinney (CMU)Companion to CHAT; CLAN is the analysis toolkit that reads/writes CHAT and produces frequency, MLU, and other developmental metricsActivehttps://dali.talkbank.org/clan/

Tier 3 family table — Corpus encoding / translation memory

FormatFirst appearedOriginTypeStatus (2026)URL
TEI P5 (Text Encoding Initiative)1990 (P1) / 2007 (P5) / 4.11.0 (Feb 2026)TEI ConsortiumXML vocabulary for digital humanities + linguistic markup; 500+ elements organised into modules (core, header, linguistic, msdesc, drama, verse, etc.)Very active, the cornerstone of DH; ~six-month release cadencehttps://tei-c.org/guidelines/p5/
TMX (Translation Memory eXchange)1998LISA / now OASISXML format for exchanging translation memories between CAT tools; bilingual aligned segments with metadataActive, TMX 1.4b is the long-stable industry standard; cross-listed in i18n-localehttps://www.gala-global.org/lisa-oscar-standards
XCES (XML Corpus Encoding Standard)2000Vassar / Nancy (Nancy Ide, Laurent Romary)XML-Schema corpus encoding refinement of CES; defines header + text + alignment + linguistic-annotation XMLLegacy, mostly referenced for older parallel corpora (e.g. OPUS)https://www.xces.org/
CES (Corpus Encoding Standard)1996EAGLES / VassarSGML predecessor of XCES; pre-XML corpus encodingLegacy / historicalhttp://www.cs.vassar.edu/CES/
GrAF (Graph Annotation Framework, ISO 24612)2008 (ISO)ISO TC 37/SC 4 (Nancy Ide, Keith Suderman)XML pivot format for representing arbitrary linguistic annotations as labelled directed graphs over a node-setActive standard, pivot format for ANC (American National Corpus)https://www.iso.org/standard/37326.html
PAULA-XML~2006SFB 632, University of PotsdamXML standoff format for arbitrarily layered linguistic annotation; used by ANNIS corpus query systemActive within ANNIS-using projectshttps://www.sfb632.uni-potsdam.de/en/paula.html
Moses phrase-table format2007Edinburgh (Philipp Koehn et al.)Pipe-separated text file `source
OPUS aligned-corpus formats2004University of Helsinki (Jörg Tiedemann)Distributes parallel corpora in TMX, Moses, plain-text-aligned, and XCES; the canonical free parallel-corpus repositoryVery active, the go-to source for parallel datahttps://opus.nlpl.eu/
Apertium bidix / monodix2005Universitat d’Alacant / Apertium projectXML dictionary formats for shallow-transfer rule-based MT: monolingual morphological dictionaries (monodix) + bilingual transfer dictionaries (bidix)Active, focused on low-resource and Iberian/Turkic/Romance language pairshttps://wiki.apertium.org/wiki/Bilingual_dictionary

Tier 3 family table — Modern ML-NLP / lexicon / query

FormatFirst appearedOriginTypeStatus (2026)URL
spaCy DocBin (.spacy)2019 (spaCy v2.2+)Explosion AI (Matthew Honnibal, Ines Montani)Binary serialization of Doc objects via msgpack; compact storage of tokens, spans, entities, and custom attributesVery active, spaCy 3.8.14 current (March 2026); v4 in developmenthttps://spacy.io/api/docbin
Stanza / Stanford CoreNLP CoNLL-U output2020 (Stanza) / 2010 (CoreNLP)Stanford NLP GroupStanza emits standard CoNLL-U; CoreNLP also emits TSV, JSON, XML, and Penn TreebankActivehttps://stanfordnlp.github.io/stanza/
HuggingFace Datasets schema2020HuggingFacePython/Arrow-backed loader with per-dataset features schema (Value, ClassLabel, Sequence, Translation, etc.); on-disk format is Apache Arrow + ParquetVery active, the de facto ML-era corpus-exchange layer; dataset cards in YAML+Markdownhttps://huggingface.co/docs/datasets/
BRAT standoff (.ann + .txt)2012NLPlab (Sampo Pyysalo et al.)Plain-text standoff: T1 Person 0 5 Mary for text-bound annotations, R1 Located Arg1:T1 Arg2:T2 for relations, E1, A1, N1, #1 for events, attributes, normalizations, notesLegacy (BRAT itself archived) but its format is still widely produced/consumedhttps://brat.nlplab.org/standoff.html
WebAnno / INCEpTION CAS XMI2013 (WebAnno) / 2018 (INCEpTION)TU Darmstadt UKP LabUIMA CAS-based XMI serialization for layered text annotation; INCEpTION is the active successor to the archived WebAnnoINCEpTION active 2026; WebAnno archivedhttps://inception-project.github.io/
Sketch Engine CQL (Corpus Query Language)2003Lexical Computing (Adam Kilgarriff, Pavel Rychlý)Token-pattern query language: [lemma="run"] [tag="N.*"], regex over attributes, within/containing structure operators; implemented in ManateeVery active, the standard corpus-concordance query languagehttps://www.sketchengine.eu/documentation/corpus-querying/

Notable threads

  • TEI’s 36-year reign as the digital-humanities cornerstone. Begun in 1987 as an SGML application, codified as P5/XML in 2007, and steadily refined on a six-month cadence (4.11.0, February 2026), TEI has outlasted every competing humanities-markup proposal because it is genuinely a guideline + schema generator (ODD / Roma) rather than a fixed DTD. Critical editions, manuscript catalogues, historical newspapers, drama corpora, and linguistic samplers all sit in TEI; the EpiDoc subset is the standard for inscriptions and papyri; the MEI sibling handles music. Its longevity is a study in slow, consensus-driven standards work — the opposite of the “move fast” ML-NLP ecosystem.

  • Universal Dependencies as the breakthrough multilingual treebank standard. Before UD (Stanford Dependencies → UD v1 2014 → v2 2017), every language had its own treebank format and its own dependency-relation inventory; cross-lingual parsing was largely impossible. UD’s design choices — content-word-headed dependencies, a fixed inventory of ~37 universal relations, the UPOS tagset, the FEATS morphological-feature inventory — let a single parser architecture train on 200+ treebanks across 100+ languages. v2.17 (May 2026) is the twenty-third release; the project’s twice-yearly cadence and rigorous validation tooling are themselves a model of how a community-driven linguistic standard can scale.

  • CoNLL-U as the format that ate the field. UD’s 10-column tab-separated CoNLL-U is now the format every modern parser reads and writes — Stanza, spaCy (via converters), UDPipe, Trankit, COMBO, all the recent BERT-based dependency parsers. The format is plain-text, diff-friendly, grep-friendly, and trivially streamed; the DEPS column carries enhanced (graph-not-tree) dependencies; the MISC column is a flexible attribute bucket that has absorbed SpaceAfter, alignment, lemma-confidence, even STREUSLE supersenses in v2.17. CoNLL-U’s success is a vindication of “tab-separated text with a strict spec” over more elaborate XML encodings.

  • The BRAT / INCEpTION standoff tradition. BRAT (NLPlab, 2012) made the “annotations live in a separate .ann file referencing character offsets into an untouched .txt” model the default for entity-and-relation annotation projects — biomedical NLP especially. BRAT itself is archived but its format remains widely produced; INCEpTION (TU Darmstadt UKP Lab) is the actively-developed successor, adding human-in-the-loop ML assistance, knowledge-base entity linking, and UIMA CAS XMI serialization. The standoff principle (never modify the source text, layer annotations on top) is now baked into virtually every modern annotation tool.

  • ELAN as the multimedia-linguistics workhorse. ELAN (Max Planck Institute for Psycholinguistics, current 7.1 of April 2026) is the dominant tool for time-aligned video-and-audio annotation: sign languages, gesture studies, language documentation, multimodal interaction. Its .eaf XML format supports unlimited tiers with rich parent-child relationships (time-subdivision, symbolic-association, included-in). Praat TextGrids cover the speech-acoustic niche, EXMARaLDA the spoken-discourse niche, and TalkBank CHAT the conversation-and-development niche — but ELAN is uniquely positioned for multimodal field linguistics.

  • The LLM era’s effect on “corpus” formats. Before 2018, an NLP “corpus” usually meant CoNLL-U, TEI-XML, or BRAT standoff with hand-curated annotations. After 2018, “corpus” increasingly means a HuggingFace Datasets entry — Apache Arrow on disk, a YAML+Markdown dataset card, often just JSONL with a few fields. The painstaking gold-annotation tradition still exists (UD, PropBank, BRAT projects) and remains essential for evaluation, but the training corpora are now web-scale unannotated text. spaCy’s .spacy DocBin sits in between: a binary format optimised for ML pipelines that still carries linguistic annotations.

  • Sketch Engine CQL as a bonafide corpus query language. Most “corpus formats” are passive data containers, but CQL is an actual query language — token-pattern matching with regex over morphological attributes ([lemma="run" & tag="V.*"]), structure operators (within <s>, containing <np>), Boolean combinations, and distance constraints. It powers Sketch Engine’s word-sketch grammars and concordance searches and is the closest thing the corpus-linguistics world has to a SQL equivalent. The IMS Corpus Workbench (CWB) CQP variant is the closely-related open-source ancestor.

Citations