Linguistic-Resource Publishing / Lexicon DSLs Family Index
type: language-family-index family: linguistic-resources languages_catalogued: 28 tags: [language-reference, family-index, linguistic-resources, ontolex-lemon, lmf, tei-lex0, lift, wn-lmf, cldf, dbnary, glottolog, olac, bcp47]
Linguistic-Resource Publishing / Lexicon / Dictionary — Family Index
Family overview
Linguistic-resource publishing DSLs are the textual and structured vocabularies used to encode digital dictionaries, terminology bases, wordnets, morpheme inventories, and language-documentation archives. They sit at the intersection of three different traditions that historically did not talk to each other: the lexicographic-publishing tradition (Oxford, Merriam-Webster, Brill — born from typesetting and now mostly XML/JSON), the field-linguistics tradition (SIL Toolbox SFM, FieldWorks FLEx, LIFT — born from missionary linguistics in the 1980s and still anchoring endangered-language documentation), and the computational-lexicon tradition (WordNet, OntoLex-Lemon, ISO LMF — born from NLP and the Semantic Web). The 2020s have seen a partial convergence around two attractors: OntoLex-Lemon on the linked-data side and TEI Lex-0 on the document-encoding side, with ISO LMF 24613 as the formal-standards reference.
OntoLex-Lemon is the de facto W3C model for lexicons as linked data, published as a W3C Final Community Group Report in May 2016 (with the Lexicography module lexicog added as a separate Final CG Report in September 2019). It is not a W3C Recommendation (the Community Group track does not produce Recommendations), but it functions as the modern standard: every major linked-data lexicon (DBnary, Apertium-RDF, BabelNet, the European Language Resources Coordination’s outputs) publishes in OntoLex. Modular structure separates a ontolex core (forms, lexical entries, senses, references to ontology concepts) from lime (lexicon metadata), vartrans (translation and variation), decomp (morphological decomposition), morph (inflection), syn (syntactic frames), lexicog (lexicographic structure), and the newer frac (frequency, attestation, corpus — still a working draft as of 2026, not yet a Final CG Report).
ISO LMF (Lexical Markup Framework, ISO 24613) went through a long modular split between 2019 and 2024: the legacy 2008 monolithic standard was retired in favor of six parts — Part 1 Core model (ISO 24613-1:2024), Part 2 Machine-Readable Dictionary (24613-2:2020), Part 3 Etymological Extension (24613-3:2021), Part 4 TEI Serialisation (24613-4:2021), Part 5 Lexical Base Exchange / LBX (24613-5:2022), and Part 6 Syntax and Semantics / SynSem (24613-6:2024). The TEI-serialisation part (24613-4) is the formal bridge from LMF to TEI Lex-0, and the LBX serialisation (24613-5) is the practical XML exchange format used by terminology vendors and EU-funded projects (ELEXIS, European Lexicographic Infrastructure).
The SIL legacy stack persists in field linguistics despite the linked-data wave: SFM (Standard Format Markers, 1980s line-based \lx … \ge … markers) feeds Toolbox; FieldWorks FLEx (.NET application, current 9.x) is the modern descendant; LIFT (Lexicon Interchange Format), currently at v0.13 and maintained via the SIL.Lift NuGet package, is the cross-tool XML exchange format between FLEx, WeSay, Lexique Pro, and Dictionary App Builder. Parallel to all of this, Wikibase Lexemes (Wikidata’s Lexeme/Form/Sense entity model, with dedicated wikibase-lexeme, wikibase-form, wikibase-sense datatypes) have become the largest crowdsourced structured-lexicon corpus, and DBnary (Gilles Sérasset, LIG/Grenoble) republishes Wiktionary as OntoLex-Lemon RDF twice a month across 26+ language editions. The wordnet world has consolidated around the Global WordNet Association’s WN-LMF XML schema (current 1.4 as of 2024), with the Open Multilingual Wordnet (OMW) as the canonical multi-language distribution, and CLDF (Cross-Linguistic Data Formats) — current 1.3 — has become the lingua franca for typological and comparative-linguistics datasets (WALS, PHOIBLE, Glottolog itself).
In our deep library
None of the formats in this family have standalone Tier-1/2 deep-library notes — they are XML/RDF/TSV exchange vocabularies hosted in general-purpose carrier languages.
Cross-reference:
- nlp-corpus — sibling family for corpus and annotation formats (TEI body, CoNLL-U, Universal Dependencies, ELAN, PropBank/FrameNet). The line is fuzzy: ELAN .eaf, CHILDES CHAT, and LMF itself appear in both indexes because dictionaries and corpora share serialisation.
- voice-phonetics — pronunciation-lexicon overlap (CMUdict, IPA in
<pron>elements of TEI Lex-0 and OntoLexlexinfo:pronunciation). - semantic-web — OntoLex-Lemon is built on RDF/OWL/SKOS; MMoOn, GOLD, and CIDOC CRM are all OWL ontologies and live conceptually inside the Semantic Web stack.
- i18n-locale — TBX (terminology) and BCP 47 language tags overlap heavily; the i18n index covers them from the software-localisation angle, this index from the dictionary/terminology-publishing angle.
- citation-formats — MARC for library cataloging; OAI-PMH harvest protocol used by OLAC for language-archive metadata.
- api-description — XML-schema and JSON-schema underpinnings (LIFT XSD, WN-LMF DTD, OntoLex Turtle).
- notation-spec — formal-grammar adjacent for morphological-rule formalisms inside LMF Morphology / MMoOn.
Tier 3 family table — Linked-data lexicon (RDF / OntoLex-Lemon)
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| OntoLex-Lemon core | 2016 (Final CG Report, May 2016) | W3C Ontology-Lexica Community Group (John P. McCrae et al.) | RDF/OWL model for lexicons; classes ontolex:LexicalEntry, ontolex:Form, ontolex:LexicalSense, ontolex:LexicalConcept | De facto standard; the modern linked-data lexicon model. Not a W3C Recommendation (CG track) but cited as the reference everywhere | https://www.w3.org/2016/05/ontolex/ |
| OntoLex lime (Lexicon Metadata) | 2016 | W3C OntoLex CG | Metadata module (lime:Lexicon, lime:entries, lime:language) | Active, Final CG Report | https://www.w3.org/2016/05/ontolex/#metadata-lime |
| OntoLex vartrans (translation/variation) | 2016 | W3C OntoLex CG | Translation, term-variation, and lexical-relation module | Active | https://www.w3.org/2016/05/ontolex/#variation-translation-vartrans |
| OntoLex decomp + morph | 2016 (decomp) / 2019+ (morph) | W3C OntoLex CG | Morphological decomposition and inflection paradigms | Active; morph is the newer of the two | https://www.w3.org/2016/05/ontolex/#morphology |
| OntoLex synsem | 2016 | W3C OntoLex CG | Syntax–semantics interface for predicates | Active | https://www.w3.org/2016/05/ontolex/#syntax-and-semantics-synsem |
| OntoLex lexicog (Lexicography module) | 2019 (Final CG Report, Sept 2019) | W3C OntoLex CG | Lexicographic-structure module — entries, sub-entries, ordering, dictionary metadata; the canonical “linked-data dictionary” layer | Active, Final CG Report | https://www.w3.org/2019/09/lexicog/ |
| OntoLex FrAC (Frequency, Attestation, Corpus) | 2018+ (in development), still Working Draft as of 2026 | W3C OntoLex CG (Christian Chiarcos et al.) | Corpus-derived frequencies, attestations, embedding pointers | Working draft, not yet Final CG Report | https://ontolex.github.io/frequency-attestation-corpus-information/ |
| DBnary | 2012 (Sérasset, LIG/Grenoble) | INRIA/Université Grenoble Alpes | Wiktionary republished as OntoLex-Lemon RDF; bi-monthly dumps synced to Wikimedia Wiktionary dumps; 26+ language editions as of 2024 | Very active | http://kaiko.getalp.org/about-dbnary/ |
| Wikibase Lexeme (Wikidata) | 2018 (Lexeme namespace launched on Wikidata) | Wikimedia Deutschland | First-class Wikibase entity types Lexeme (L-IDs), Form (F-IDs), Sense (S-IDs); dedicated datatypes wikibase-lexeme, wikibase-form, wikibase-sense | Very active; the largest crowdsourced structured-lexicon corpus | https://www.wikidata.org/wiki/Wikidata:Lexicographical_data |
| MMoOn Core (Multilingual Morpheme Ontology) | 2016 (initial) / 2021 (Semantic Web journal publication) | AKSW Leipzig (Bettina Klimek et al.) | OWL ontology for morpheme-level inventories; sub-word-level analogue to OntoLex | Active research-grade; not standardised | https://mmoon.org/ |
| GOLD (General Ontology for Linguistic Description) | 2003 | LinguistList / U. Arizona (Farrar, Langendoen) | OWL ontology for descriptive-linguistics categories (parts of speech, grammatical features); designed for endangered-language fieldwork | Stable/legacy reference; widely cited but not actively versioned | https://linguistics-ontology.org/ |
| CIDOC CRM (ISO 21127:2023) | 1996 (CIDOC), ISO 21127:2006 → 2014 → 2023; current community version 7.1.3 | International Council of Museums (ICOM) | OWL/RDFS ontology for cultural-heritage records; 81 classes, 160 properties; not lexicon-specific but used for archive metadata around dictionaries and manuscripts | Active; ISO 21127:2023 is current | https://cidoc-crm.org/ |
| Lexvo.org / lexinfo.net | 2008+ (Lexvo, McCrae) | DERI Galway / INSIGHT Centre | Companion vocabularies: Lexvo (URIs for languages, scripts, terms) and LexInfo (linguistic-category vocabulary used inside OntoLex) | Active, low-velocity maintenance | https://www.lexinfo.net/ |
Tier 3 family table — TEI / XML dictionary
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| TEI Lex-0 | 2018 (DARIAH Working Group on Lexical Resources) | DARIAH-EU, Toma Tasovac et al. | Constrained TEI subset for dictionary encoding — fixes the underspecified parts of TEI Chapter 9; canonical baseline for born-digital + retro-digitised dictionaries | Active; current release July 2025 (versioned, ODD-driven schema generation) | https://lex-0.org/ |
| TEI P5 dictionary module (Chapter 9) | 1990 (TEI P1) → P5 (2007) → 4.10.0 (Aug 2025) → 4.x ongoing | Text Encoding Initiative Consortium | Full TEI dictionary vocabulary: <entry>, <form>, <gramGrp>, <sense>, <cit>, <def> — flexible but underspecified, hence the need for Lex-0 | Active; the parent standard; current Guidelines 4.10.0 | https://tei-c.org/release/doc/tei-p5-doc/en/html/DI.html |
| LMF Part 1 — Core model (ISO 24613-1:2024) | 2008 (original ISO 24613), revised 2024 | ISO TC 37 / SC 4 | UML metamodel for lexical resources; class hierarchy LexicalResource → Lexicon → LexicalEntry → Sense | Current, published 2024 | https://www.iso.org/standard/82014.html |
| LMF Part 2 — Machine-Readable Dictionary / MRD (ISO 24613-2:2020) | 2020 | ISO TC 37 / SC 4 | Specialisation for general-purpose dictionaries | Current | https://www.iso.org/standard/72100.html |
| LMF Part 3 — Etymological Extension (ISO 24613-3:2021) | 2021 | ISO TC 37 / SC 4 | Etymology, cognates, borrowing chains | Current | https://www.iso.org/standard/72101.html |
| LMF Part 4 — TEI Serialisation (ISO 24613-4:2021) | 2021 | ISO TC 37 / SC 4 + TEI liaison | Normative TEI serialisation of LMF (bridge to TEI Lex-0) | Current | https://www.iso.org/standard/72102.html |
| LMF Part 5 — LBX Lexical Base Exchange (ISO 24613-5:2022) | 2022 | ISO TC 37 / SC 4 | XML exchange-format serialisation; the practical interchange syntax | Current | https://www.iso.org/standard/72099.html |
| LMF Part 6 — Syntax and Semantics / SynSem (ISO 24613-6:2024) | 2024 | ISO TC 37 / SC 4 | Predicate–argument structures, semantic frames | Current, newest part | https://www.iso.org/standard/83180.html |
Tier 3 family table — SIL legacy / FLEx / LIFT / wordlist
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| SFM (Standard Format Markers) | early 1980s | SIL International (Shoebox/Toolbox lineage) | Line-based \marker value plaintext; configurable marker hierarchies; the substrate of Toolbox | Legacy but widely-deployed in field linguistics | https://software.sil.org/toolbox/ |
| MDF (Multi-Dictionary Formatter) | 1990s | SIL (Toolbox shipping standard) | A standardised SFM dialect — agreed marker set (\lx, \ps, \ge, \dt) for typological consistency across field projects | Legacy/maintenance | https://software.sil.org/toolbox/ |
| FieldWorks FLEx XML | 2007+ (FLEx 1.0), current 9.x | SIL International | Native FLEx data format (XML; project files); the modern successor to Toolbox | Active; FLEx remains the dominant field-linguistics workbench | https://software.sil.org/fieldworks/ |
| LIFT (Lexicon Interchange Format) | 2007, current v0.13 | SIL International | XML cross-tool dictionary-exchange format; used by FLEx, WeSay, Lexique Pro, Dictionary App Builder; SIL.Lift NuGet package is current implementation | Active, the de facto SIL-ecosystem interchange | https://github.com/sillsdev/lift-standard |
| OpenDictionary / SIL Toolbox project format | 1990s | SIL | Project file bundles (.prj + SFM data) for Toolbox | Legacy | https://software.sil.org/toolbox/ |
| ELAN .eaf | 2002+, current ELAN 7.x (April 2026) | Max Planck Institute for Psycholinguistics, Nijmegen | XML time-aligned multimedia transcription/annotation; tier hierarchy with constraints; cross-listed in nlp-corpus | Active, the workhorse for endangered-language documentation | https://archive.mpi.nl/tla/elan |
| CHILDES CHAT | 1984+ | TalkBank / Carnegie Mellon (Brian MacWhinney) | Plain-text transcription convention for child-language acquisition; cross-listed in nlp-corpus | Active, central in CHILDES/TalkBank archives | https://talkbank.org/manuals/CHAT.html |
Tier 3 family table — Wordnet / terminology / Japanese-dictionary
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| WN-LMF (Global Wordnet LMF) | 2013 (WN-LMF 1.0), current WN-LMF 1.4 (2024) | Global WordNet Association | DTD-defined XML schema for wordnets — synsets, senses, ILI (Inter-Lingual Index) linking | Active; 1.4 current; cross-listed in nlp-corpus | https://globalwordnet.github.io/schemas/ |
| Princeton WordNet native database | 1985+ (Miller et al.), 3.1 (2011, last canonical release) | Princeton CSL | Native flat-file format (data.noun, index.noun, etc.); the original WordNet exchange substrate | Frozen at 3.1; Open English WordNet now extends it | https://wordnet.princeton.edu/ |
| Open English WordNet | 2019+, ongoing yearly releases | Global WordNet Association (John P. McCrae et al.) | The actively-maintained successor to Princeton WN 3.1; published as WN-LMF + JSON + RDF | Active, the canonical English wordnet today | https://en-word.net/ |
| OMW-EN / OMW JSON | 2010+ (OMW), JSON form 2018+ | NTU + GWA (Francis Bond et al.) | Open Multilingual Wordnet — JSON and WN-LMF distributions across 150+ wordnets | Active | https://omwn.org/ |
| TBX (TermBase eXchange, ISO 30042:2019) | 2002 (LISA), ISO 2008 → ISO 30042:2019 v3, revision in progress (ISO/AWI 30042) | ISO TC 37 / LISA legacy | XML terminology-exchange standard; concept-oriented; cross-listed in i18n-locale | Active; v3 current; new revision in drafting | https://www.iso.org/standard/62510.html |
| DatCatInfo (Data Category Repository) | 2019 (successor to ISOcat) | LTAC Global / TerminOrgs, ISO TC 37 liaison | Web-accessible repository of standardised data categories (POS values, gender, number, etc.) per ISO 12620 | Active; replaces retired ISOcat | https://datcatinfo.net/ |
| ISOcat (legacy) | 2009 → frozen 2014 | ISO TC 37 | Original Data Category Registry per ISO 12620:2009; categories migrated to DatCatInfo and CLARIN Concept Registry | Retired; consult successors | https://www.clarin.eu/news/concept-revival-isocat-clarin-concept-registry |
| JMdict XML | 1999 (Jim Breen, EDRDG) | Electronic Dictionary Research and Development Group | UTF-8 XML Japanese-English multilingual dictionary; daily releases; multiple kanji + readings + glosses per entry | Very active; the canonical OSS Japanese dictionary | https://www.edrdg.org/jmdict/edict.html |
| EDICT / EDICT2 | 1991 (EDICT), 2003 (EDICT2) | Jim Breen / EDRDG | Plain-text EUC-JP (EDICT) / enhanced text (EDICT2) Japanese-English dictionary; legacy format derived from JMdict | Legacy (provided for older apps); JMdict XML is canonical | https://www.edrdg.org/jmdict/edict.html |
| KANJIDIC / KANJIDIC2 | 1991+, current KANJIDIC2 XML | EDRDG | Per-kanji metadata (readings, meanings, stroke counts, JIS/Unicode codepoints, dictionary cross-references) | Active | https://www.edrdg.org/wiki/KANJIDIC_Project |
Tier 3 family table — Language codes / archives / comparative
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| ISO 639-1 / 639-2 / 639-3 / 639-5 | 1967 (639-1 → ISO 639) / 1998 (639-2) / 2007 (639-3) / 2008 (639-5) | ISO TC 37 / SC 2; SIL is RA for 639-3 | 2-letter (639-1), 3-letter (639-2, -3), and family codes (639-5); 639-3 covers ~7,800 individual languages | Active; Q1 2026 SIL change requests applied; 639-3 is the workhorse | https://iso639-3.sil.org/ |
| BCP 47 / RFC 5646 language tags | RFC 5646 (Sept 2009), still current; companion RFC 4647 for matching | IETF (Phillips, Davis) | Composition of ISO 639 + ISO 15924 script + ISO 3166-1 region + private use + variants; the Web/HTML/XML lang-tag standard | Active; the de facto language identifier on the Web | https://www.rfc-editor.org/rfc/bcp/bcp47.txt |
| Glottocode (Glottolog languoid ID) | 2011+, current Glottolog 5.3 (2026) | MPI EVA Leipzig (Hammarström, Forkel et al.) | 8-character ID (e.g. stan1288) for every languoid (family, language, dialect); fills gaps in ISO 639-3 (covers extinct, undocumented, and unclassified varieties); available as CLDF + JSON + RDF | Active; complement to 639-3, not replacement | https://glottolog.org/ |
| OLAC metadata | 2000+ | Open Language Archives Community (Bird, Simons) | XML metadata format extending Dublin Core (all 15 DC elements + community qualifiers); harvested via OAI-PMH; integrated with Linguistic Linked Open Data Cloud (2016) | Active; central to language-archive interop | http://www.language-archives.org/OLAC/metadata.html |
| OAI-PMH (harvest protocol) | 2002 (OAI 2.0) | Open Archives Initiative | HTTP harvest protocol — OLAC archives expose metadata via OAI-PMH endpoints; cross-listed adjacent to citation-formats | Active | https://www.openarchives.org/OAI/openarchivesprotocol.html |
| CLDF (Cross-Linguistic Data Formats) | 2018 (Forkel et al., Scientific Data); current CLDF 1.3 | Glottobank consortium (MPI SHH / EVA, ERC CALC) | CSV-on-the-Web (CSVW) profile for comparative-linguistics data — wordlists, structure datasets (WALS), phoneme inventories (PHOIBLE), Glottolog itself; JSON-LD metadata + CSV tables | Active; the lingua franca of comparative linguistics | https://cldf.clld.org/ |
| Wiktionary template syntax | 2002+ (Wiktionary) | Wikimedia Foundation | MediaWiki templates inside Wiktionary entries ({{lb}}, {{l}}, {{m}}, {{tt}}, language-specific headers); the substrate of the world’s largest free dictionary | Active; complemented by Wikibase Lexemes for structured data | https://en.wiktionary.org/wiki/Wiktionary:Templates |
Notable threads
-
OntoLex-Lemon as the de facto linked-data lexicon standard. Although it never advanced to W3C Recommendation (the Community Group track does not produce Recs), OntoLex-Lemon’s Final Community Group Report (May 2016) plus the lexicog extension (September 2019) function as the modern standard. Every major linked-data lexicography project — DBnary, Apertium-RDF, BabelNet, the European Language Resources Coordination outputs, Wikidata’s own data export — publishes in OntoLex. The trick that made it dominant was modularity: a thin core for forms/senses/concepts, with optional
lime,vartrans,decomp,morph,synsem, andlexicogmodules picked à la carte. The FrAC module (frequency, attestation, corpus) remains a Working Draft as of May 2026 — the unfinished frontier is corpus-derived evidence. -
The long shadow of SIL Toolbox SFM (1980s) in field linguistics. SFM’s
\marker valueline-based format predates XML by a decade and remains the substrate of an enormous installed base of field-linguistic data. MDF standardised the marker set, FLEx replaced the Toolbox application, and LIFT became the XML export format — but huge legacy SFM corpora still exist in researcher and missionary archives. The persistence is partly cultural (linguistics PhDs trained on Toolbox keep using it) and partly technical (SFM is human-readable in a text editor, which matters in low-connectivity field settings where binary FLEx project files are fragile). LIFT 0.13 is the cross-tool migration path for everyone who has finally moved off SFM. -
LIFT as the cross-tool interchange that mostly succeeded. Where OntoLex-Lemon won the linked-data world, LIFT won the SIL-ecosystem world: it is what flows between FLEx (the editor), WeSay (the lightweight tablet/laptop entry tool), Lexique Pro (the publication tool), Dictionary App Builder (mobile-app generator), and Webonary (the web-publishing tool). It does not aspire to round-trip every FLEx feature — FLEx’s native XML is richer — but it carries the dictionary “Send/Receive” workflow that is the actual collaboration model for distributed-field projects with intermittent connectivity. v0.13 has been stable since the late 2010s; the SIL.Lift NuGet package is the canonical implementation.
-
Wikibase Lexemes as Wikidata’s growing structured-dictionary layer. The Lexeme/Form/Sense entity types launched in 2018 added a third Wikibase entity dimension alongside Items (Q-IDs) and Properties (P-IDs). Each Lexeme has L-IDs, with sub-entities for Forms (F-IDs, one per inflected surface form) and Senses (S-IDs). The three dedicated datatypes —
wikibase-lexeme,wikibase-form,wikibase-sense— let properties on Items link to lexical data, and vice versa. This has produced the largest crowdsourced multilingual structured lexicon ever, growing fastest in languages underserved by commercial dictionaries. Wikibase Lexemes are SPARQL-queryable on the Wikidata Query Service and exportable to OntoLex via the Lexicographical data ontology mapping. -
CLDF as the typological-database lingua franca. Before CLDF (2018), every comparative-linguistics project (WALS, AUTOTYP, ASJP, Glottolog itself) used its own bespoke CSV/SQLite layout, so cross-project queries required custom ETL. CLDF profiled CSVW (CSV on the Web) for cross-linguistic data, defining standard column names (
Form,Cognateset_ID,Parameter_ID) and standard component tables (LanguageTable, ParameterTable, FormTable, CognateTable). Today WALS, PHOIBLE, Glottolog, Concepticon, NorthEuraLex, IELex, and dozens of family-specific datasets all ship CLDF, andpycldfis the canonical Python access library. CLDF 1.3 is current. -
Wn-LMF unifying the wordnet world. Before Wn-LMF, each wordnet (Princeton, EuroWordNet, IndoWordNet, BalkaNet, OMW component wordnets) shipped its own format. Wn-LMF 1.0 (2013) and the current 1.4 (2024) defined a single XML schema validated by DTD, with the GWA’s Inter-Lingual Index (ILI) as a glue layer assigning stable cross-language synset IDs. The Open English WordNet now actively maintains the Princeton lineage (Princeton WN 3.1 has been frozen since 2011), and OMW redistributes 150+ wordnets in WN-LMF + JSON.
-
ISO 639 / BCP 47 / Glottolog as overlapping language-ID systems with subtly-different goals. ISO 639-3 (SIL as RA) is the workhorse 3-letter individual-language identifier (~7,800 codes, Q1 2026 list current); ISO 639-1 (2-letter) and -2 (3-letter bibliographic) cover smaller subsets for older systems. BCP 47 / RFC 5646 composes 639 codes with ISO 15924 scripts (
Hans/Hant), ISO 3166-1 regions (US/GB), and registered variants (tr-x-icu,en-GB-oxendict) — it is the Web/HTML/XML standard. Glottolog (currently 5.3) assigns its own 8-character glottocodes that cover everything 639-3 misses (extinct languoids, unclassified varieties, dialects). Modern best practice: use BCP 47 in user-facing contexts (HTMLlang, CLDR locale), ISO 639-3 for individual-language identification, and Glottolog glottocodes for typological research and endangered-language documentation. -
The TEI Lex-0 / LMF Part 4 / OntoLex lexicog tripod. All three are 2016–2024-era responses to the same problem: the TEI dictionary chapter (P5 Ch 9) is too permissive, and serious dictionary projects need a constrained baseline. TEI Lex-0 is the constrained TEI subset (community recommendations + ODD-driven schema); ISO 24613-4 is the normative TEI serialisation of LMF; OntoLex lexicog is the RDF version of the same concepts. The three are interconvertible by design — DARIAH’s Lexical Resources Working Group, the OntoLex CG, and ISO TC 37 SC 4 share overlapping membership precisely to keep them aligned. Modern projects (ELEXIS, DigiLex, Dictionaria) target all three formats from a single source.
Citations
- OntoLex-Lemon Final Community Group Report (May 2016): https://www.w3.org/2016/05/ontolex/
- OntoLex Lexicography module lexicog (Sept 2019 Final CG Report): https://www.w3.org/2019/09/lexicog/
- OntoLex FrAC (Frequency, Attestation, Corpus — working draft): https://ontolex.github.io/frequency-attestation-corpus-information/
- ISO 24613-1:2024 LMF Part 1 Core model: https://www.iso.org/standard/82014.html
- ISO 24613-5:2022 LMF Part 5 LBX: https://www.iso.org/standard/72099.html
- ISO 24613-6:2024 LMF Part 6 SynSem: https://www.iso.org/standard/83180.html
- ISO 30042:2019 TBX v3: https://www.iso.org/standard/62510.html
- TEI Lex-0 portal (current July 2025 release): https://lex-0.org/
- TEI Guidelines 4.10.0 release notes (Aug 2025): https://tei-c.org/news/2025/08/15/new-release-tei-guidelines-4-10-0-stylesheets-7-59-0/
- LIFT standard: https://github.com/sillsdev/lift-standard
- Global Wordnet schemas (WN-LMF 1.4): https://globalwordnet.github.io/schemas/
- CLDF specification (current 1.3): https://cldf.clld.org/
- CLDF Scientific Data paper (Forkel et al., 2018): https://www.nature.com/articles/sdata2018205
- Glottolog 5.3: https://glottolog.org/
- Wikidata Lexicographical data documentation: https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation
- WikibaseLexeme data model: https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model
- DBnary project: http://kaiko.getalp.org/about-dbnary/
- OLAC metadata: http://www.language-archives.org/OLAC/metadata.html
- ISO 639-3 (SIL Registration Authority): https://iso639-3.sil.org/
- BCP 47 / RFC 5646: https://www.rfc-editor.org/rfc/bcp/bcp47.txt
- MMoOn Multilingual Morpheme Ontology: https://mmoon.org/
- DatCatInfo (post-ISOcat Data Category Repository): https://datcatinfo.net/
- CIDOC CRM (ISO 21127:2023): https://cidoc-crm.org/
- JMdict/EDICT/KANJIDIC (EDRDG): https://www.edrdg.org/