Voice / Phonetics / Pronunciation DSLs Family Index


type: language-family-index family: voice-phonetics languages_catalogued: 22 tags: [language-reference, family-index, voice-phonetics, ipa, x-sampa, arpabet, cmudict, espeak, mfa, tts]

Voice / Phonetics / Pronunciation — Family Index

Family overview

The voice/phonetics family is the set of textual notations for encoding speech sounds — the alphabets, lexicons, and engine-specific phoneme files that sit between orthographic text and acoustic output. The canonical citizen is the International Phonetic Alphabet (IPA), first published by the International Phonetic Association in 1888 (Paris/London) as a Latin-derived alphabet with one symbol per distinctive sound. The chart has been revised several times — the substantive 1989 Kiel revision, the 1996/2005 additions (notably the labiodental flap), the 2015 layout refresh, and the 2020 administrative re-issue (year/copyright only; no symbol changes). The IPA Extensions Unicode block (U+0250..02AF, plus IPA-related symbols scattered through Combining Diacritical Marks U+0300..036F and Spacing Modifier Letters U+02B0..02FF) is what lets the alphabet actually flow through ordinary text pipelines.

Because IPA pre-dates Unicode by a century, an entire shadow ecosystem of ASCII-friendly transliterations grew up: SAMPA (1989, EU ESPRIT project, with per-language tables SAMPA-DE / SAMPA-EN / SAMPA-FR / etc.), its superset X-SAMPA (1995, John Wells, UCL — covers the entire IPA chart in 7-bit ASCII), Kirshenbaum (1992, also called ASCII-IPA), and the older Anglocentric ARPAbet (1971, DARPA Speech Understanding Research). ARPAbet won inside North-American TTS — its 39-phoneme inventory plus 0/1/2 stress digits is the surface form of CMUdict (Carnegie Mellon Pronouncing Dictionary, 134k+ entries, current canonical release cmudict-0.7b from November 2014), which in turn is the lexicon embedded in Festival, eSpeak NG, Sphinx, and almost every English-TTS-from-source project.

Around the alphabets sits a TTS-engine phoneme-data zoo: Festival’s .utt/HTS label files (CSTR Edinburgh, 1996), MaryTTS XML (DFKI, 2000+, last release 5.2.1 in 2022 — now maintenance-only), MBROLA diphone packages (TCTS Mons, 1996, open-sourced 2019), and eSpeak NG, which uses the Kirshenbaum-derived ASCII IPA internally; eSpeak NG’s most recent release is 1.52.0 (Dec 2024 / Jan 2025), adding stress marks to phoneme events and finalising the cmake-only build. On the speech-recognition side, the Kaldi nnet3 phoneme labels, the HTK label file (.lab, Cambridge, 1989+), and most importantly the Montréal Forced Aligner (MFA) — currently at 3.3.9 (Feb 2026) — are the modern de facto formats for time-aligning phonemes to audio.

The W3C overlay (SSML 1.1, PLS 1.0, SRGS 1.0 — all catalogued in detail at accessibility-aria) standardises pronunciation hints, lexicon entries, and recognition grammars for the browser/cloud speech APIs. The LLM era has materially reduced the importance of phoneme intermediates: OpenAI’s gpt-realtime (2024), ElevenLabs v3, and the open-source XTTS / OpenVoice / F5-TTS line synthesise audio end-to-end from text + acoustic prompt, often bypassing the phoneme layer entirely. The phonetic-data layer is therefore quietly bifurcating: classical TTS/ASR still needs it for low-resource languages, alignment, and pronunciation-control corner cases; flagship neural TTS does not. IPA itself remains indispensable for linguistics, lexicography, and Wiktionary.

In our deep library

None of these have standalone deep-library notes. Cross-reference:

  • accessibility-ariasibling; W3C SSML 1.1, PLS 1.0, and SRGS 1.0 are catalogued there. This index cross-lists them but does not re-document.
  • codec-and-dspsibling; Kaldi nnet3, Vosk, and acoustic-side speech processing live there. Phoneme-label files are the textual interface between the two families.
  • nlp-corpussibling; Praat TextGrid and corpus annotation formats live there. Cross-listed below for forced-alignment.
  • i18n-locale — locale data (CLDR transformations, Unicode locale rules) carries pronunciation hints for digits/dates/currencies that feed TTS.
  • notation-spec — formal-grammar tradition (BNF/ABNF) underlies SRGS.
  • document-typesetting — TIPA (TeX IPA macros) belongs to the LaTeX ecosystem; cross-listed.
  • chatbot-intent-dsls — Dragon NaturallySpeaking voice command grammars and intent slot-filling DSLs overlap.

Tier 3 family table — Phonetic alphabets

FormatFirst appearedOriginTypeStatus (2026)URL
IPA (International Phonetic Alphabet)1888International Phonetic Association (Paris/London, Paul Passy et al.)Unicode/print alphabet, one symbol per distinctive soundCanonical; chart last substantively updated 2005 (labiodental flap), layout-refreshed 2015, re-issued 2018 / 2020 with copyright-only changes; annually re-issued since 2025https://www.internationalphoneticassociation.org/content/ipa-chart
IPA Extensions Unicode block1991 (Unicode 1.0)Unicode ConsortiumUnicode block U+0250..02AF (plus modifier letters U+02B0..02FF and combining diacritics U+0300..036F)Stable; the standard digital encoding of IPAhttps://www.unicode.org/charts/PDF/U0250.pdf
X-SAMPA1995John Wells (UCL)ASCII-only superset of SAMPA covering the full IPA chartActive; remains the standard ASCII-IPA used in ASR/TTS corpora and ISO 639-3 wordlistshttps://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm
SAMPA (original)1989EU ESPRIT SAM project (Speech Assessment Methods)Per-language ASCII phonetic tables (SAMPA-DE, SAMPA-EN, SAMPA-FR, …)Legacy but still cited; superseded in practice by X-SAMPA for cross-language workhttps://www.phon.ucl.ac.uk/home/sampa/
ARPAbet1971DARPA Speech Understanding Research projectASCII phoneme set for North-American English; 39 phonemes + 0/1/2 stress digitsActive; the lingua franca of US-English TTS via CMUdicthttps://en.wikipedia.org/wiki/ARPABET
Kirshenbaum (ASCII-IPA)1992Evan Kirshenbaum (HP Labs)7-bit ASCII transliteration of IPA, broader than ARPAbetNiche but live; eSpeak NG uses a Kirshenbaum-derived encoding internallyhttp://www.kirshenbaum.net/IPA/ascii-ipa.pdf
TIPA (TeX IPA macros)1996Fukui ReiLaTeX macro package for typesetting IPA in printActive; standard in linguistics journals; current TIPA 1.3 (CTAN, periodic updates)https://ctan.org/pkg/tipa
Wikipedia / Wiktionary IPA templates~2003MediaWiki communityWiki markup wrappers around IPA Unicode ({{IPA}}, {{IPAc-en}}, {{IPA-fr}}, …)Very active; the largest curated IPA corpus in the world by entry counthttps://en.wiktionary.org/wiki/Wiktionary:International_Phonetic_Alphabet

Tier 3 family table — Pronunciation lexicons

FormatFirst appearedOriginTypeStatus (2026)URL
CMUdict (Carnegie Mellon Pronouncing Dictionary)1993 (v0.1); 0.7b in November 2014Carnegie Mellon University Speech GroupPublic-domain English pronunciation lexicon in ARPAbet; 134k+ entries with 0/1/2 stress digitsCanonical; cmudict-0.7b remains the current standard release (community-maintained via cmusphinx/cmudict GitHub; infrequent point updates)https://github.com/cmusphinx/cmudict
W3C Pronunciation Lexicon Specification (PLS) 1.02008 (W3C Recommendation)W3C Voice Browser WGXML pronunciation-lexicon format for SSML/SRGS interop; IPA + X-SAMPA aliasesStable Recommendation; covered in depth in accessibility-ariahttps://www.w3.org/TR/pronunciation-lexicon/
Festival lexicon / ICELS1996+CSTR Edinburgh + CMU (Festvox)Scheme-style entries ((word pos (phonemes))) consumed by Festival’s lexicon moduleMaintained via Festival/Festvox; current Festival 2.5.1 (July 2020)http://www.festvox.org/docs/manual-2.4.0/festival_24.html
MaryTTS lexicon2001+DFKI MaryTTSXML and FST-compiled lexicons per languageMaintenance only; MaryTTS 5.2.1 (May 2022) is the last releasehttps://github.com/marytts/marytts
OpenJTalk dictionary (NAIST jdic / Mecab-NAIST-jdic)2009Nagoya Institute of TechnologyJapanese pronunciation + accent dictionary feeding OpenJTalk’s HTS-engine voicesActive; widely embedded in Japanese TTS pipelineshttps://open-jtalk.sourceforge.net/
Jieba pinyin dictionary2012Sun Junyi (jieba project)Chinese word-segmentation + pinyin pronunciation tablesActive; the de facto Chinese pinyin lexicon in the Python ecosystemhttps://github.com/fxsjy/jieba

Tier 3 family table — TTS-engine phoneme / voice formats

FormatFirst appearedOriginTypeStatus (2026)URL
eSpeak NG phoneme + voice files2015 (NG fork); eSpeak itself 1995Reece H. Dunn (NG fork of Jonathan Duddington’s eSpeak)Plain-text phoneme rule files + per-voice prosody data; uses Kirshenbaum-derived ASCII IPA internallyActive; current 1.52.0 (Dec 2024) added stress marks to phoneme events, finalised cmake-only build (autoconf removed in 1.52)https://github.com/espeak-ng/espeak-ng
Festival utt format / HTS labels1996CSTR Edinburgh (Black, Taylor, Caley)Utterance representation: heterogeneous-relation-graph of words, syllables, phonemes, prosody; HTS-engine .lab for HMM voicesMaintained; Festival 2.5.1 (Jul 2020) is current; HTS-engine still tracks ithttps://www.cstr.ed.ac.uk/projects/festival/
MaryTTS MaryXML2001+DFKI MaryTTSXML pipeline format carrying raw text → tokens → phonemes → acoustic parametersFrozen with MaryTTS 5.2.1 (2022)https://github.com/marytts/marytts/wiki/MaryXML
MBROLA diphone format1996 (open-sourced 2019)TCTS Lab, Université de Mons (Thierry Dutoit)Binary diphone databases + plain-text phoneme/duration/pitch inputOpen / maintained since the 2019 open-source release on GitHubhttps://github.com/numediart/MBROLA
Festvox voices1998Alan W. Black, Kevin Lenzo (CMU)Voice-building toolkit on top of Festival; Clustergen, Clunits, HTS recipesActive for academic usehttp://www.festvox.org/
HTK label file (.lab)1989+Cambridge University Engineering Dept (Steve Young et al.)Plain-text phoneme labels with start/end times in 100-ns unitsLegacy but live; still the lingua franca for HMM-era acoustic models and many alignment toolkitshttps://htk.eng.cam.ac.uk/

Tier 3 family table — Forced alignment, ASR labels, encoding algorithms

FormatFirst appearedOriginTypeStatus (2026)URL
Montréal Forced Aligner (MFA)2017McAuliffe et al., Montreal Corpus Tools / McGillKaldi-backed forced aligner; ships acoustic models + lexicons for 80+ languages; reads/writes Praat TextGridActive; current 3.3.9 (Feb 2026); the modern de facto forced-alignment standardhttps://montreal-forced-aligner.readthedocs.io/
Praat TextGrid1992+Boersma & Weenink (Univ. Amsterdam)Plain-text tier-based time-aligned annotation format; the lingua franca of phonetics labsActive; cross-listed from nlp-corpushttps://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html
Kaldi GMM/nnet3 lexicon + phoneme labels2011+Povey et al., Johns Hopkinslexicon.txt, phones.txt, nnet3 posterior labels; the Kaldi recipe conventionActive; cross-listed from codec-and-dsphttps://kaldi-asr.org/doc/data_prep.html
W3C SRGS 1.0 (Speech Recognition Grammar Spec)2004 (Recommendation)W3C Voice Browser WGTwo equivalent syntaxes — XML form and ABNF form — for recognition grammarsStable Recommendation; covered in accessibility-aria; still embedded in IVR/MRCP stackshttps://www.w3.org/TR/speech-grammar/
W3C SSML 1.12010 (Recommendation)W3C Voice Browser WGXML markup for prosody, phonemes (<phoneme ph="...">), break, say-as, lookupStable Recommendation; covered in accessibility-ariahttps://www.w3.org/TR/speech-synthesis11/
Soundex1918 (patent), formal 1922Robert C. Russell & Margaret K. OdellPhonetic-matching algorithm encoding surnames to 1-letter + 3-digit codesLegacy/canonical; still the default phonetic match in many SQL engines (SOUNDEX())https://en.wikipedia.org/wiki/Soundex
Metaphone / Double Metaphone / Metaphone 31990 / 2000 / 2009Lawrence PhilipsImproved phonetic-matching algorithms for English (Double Metaphone covers Slavic, Germanic, Spanish)Active; Metaphone 3 is commercial, Double Metaphone is BSD-licensedhttps://en.wikipedia.org/wiki/Metaphone
Phonex / NYSIIS / Caverphone1990 / 1970 / 2002A. J. Lait & B. Randell / NY State / David Hood (NZ)Regional phonetic-matching variants (UK, US, NZ)Niche; bundled in libraries like Apache Commons Codec, jellyfish (Python)https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System

Notable threads

  • IPA’s 138-year longevity (1888 → 2026). Few notations in computing or linguistics have survived this long without forking. The IPA’s secret is its governing body (the International Phonetic Association) restricts revisions to substantive evidence-based proposals — the chart changed in 1989 (Kiel), 1993, 1996, 2005 (labiodental flap), 2015 (layout), and was re-issued in 2018, 2020, and yearly since 2025 with only year/copyright changes. The Unicode IPA Extensions block (U+0250..02AF) cemented its digital permanence in 1991. Even modern neural TTS papers cite IPA in their datasets despite their models bypassing it at inference.

  • X-SAMPA’s stubborn relevance despite Unicode dominance. You’d expect a 1995 ASCII-only encoding of IPA to have been obsoleted by Unicode IPA, and yet X-SAMPA persists in 2026 corpora because: (a) some toolchains still mangle Unicode in CSV/TSV pipelines; (b) regex over X-SAMPA is far simpler than over combining-diacritic Unicode IPA; (c) MaryTTS, eSpeak NG (Kirshenbaum), and many Kaldi recipes still natively accept ASCII phonetic input; (d) X-SAMPA-to-IPA conversion is a one-shot lookup table, making round-tripping trivial. Lexicon files in academic phonetics often ship both representations.

  • ARPAbet + CMUdict as the dominant US-English TTS pair. The combination of (1) a 39-phoneme ASCII inventory designed for US English in 1971 and (2) a 134k-entry public-domain lexicon (CMUdict 0.7b, Nov 2014, still canonical in 2026) underlies essentially every from-source English TTS project of the past 30 years: Festival, eSpeak NG, Flite, Sphinx, the Mozilla TTS line, and many open-source neural TTS pretraining pipelines (Tacotron, FastSpeech, VITS variants) still use CMUdict as their G2P (grapheme-to-phoneme) bootstrap, even when the final model is end-to-end. CMUdict releases are deliberately infrequent (0.7b has been the canonical release for 11+ years) — a feature for reproducibility, not neglect.

  • eSpeak NG’s surprising portability and longevity. Despite being a tiny C codebase (originally Jonathan Duddington, NG fork since 2015 by Reece Dunn), eSpeak NG supports 100+ languages, runs on every platform from Android to Raspberry Pi to web (via WASM/Emscripten), and remains the fallback TTS for screen readers (NVDA, Orca) and a popular G2P backend for neural TTS phonemisation. The 1.52.0 release (Dec 2024 / Jan 2025) added stress marks to phoneme events and finalised the cmake-only build (autoconf was deprecated). It’s the single most-installed TTS engine on Linux and the de facto reference for low-resource-language G2P.

  • MaryTTS / Festival as gracefully aging classics. Both Festival (CSTR, 1996, current 2.5.1 Jul 2020) and MaryTTS (DFKI, 2000s, current 5.2.1 May 2022) have settled into maintenance mode — neither is dead, but neither is the answer for new production TTS work in 2026. Their durable value is for academic phonetics, low-resource languages, voice cloning research with controllable phonetic input, and the Festvox tooling ecosystem. The Festival .utt format and MaryTTS MaryXML remain useful reference data formats for understanding heterogeneous-relation-graph utterance representations.

  • MFA as the modern forced-alignment standard. Montréal Forced Aligner (McAuliffe et al., 2017, current 3.3.9 Feb 2026) effectively replaced HTK-era alignment for new academic work. It ships acoustic models + lexicons for 80+ languages, reads/writes Praat TextGrid (so it slots straight into existing phonetics workflows), and uses Kaldi under the hood. Almost every speech-corpus paper from 2020 onwards either uses MFA directly or compares against it. The 3.x release line (since 2022) modernised the CLI, fixed the lexicon-format brittleness of 2.x, and is now the canonical install.

  • The LLM-era TTS bypass of phoneme intermediates. OpenAI gpt-realtime (Oct 2024), ElevenLabs v3, and the open-source XTTS / OpenVoice / F5-TTS / E2-TTS line synthesise audio end-to-end from text + acoustic prompt — no IPA, no ARPAbet, no CMUdict at inference time. The phoneme layer survives at training time (datasets are often phonemised for stability, and pronunciation control via SSML <phoneme> tags still helps for proper nouns and homographs), but the user-visible role is shrinking. The classical pipeline (text → G2P → phoneme sequence → acoustic model → vocoder) is now the low-resource-language and controllability path, not the flagship path.

  • Phonetic-matching algorithms as DSL-adjacent constants. Soundex (1918), Metaphone (1990), Double Metaphone (2000), Phonex, NYSIIS, Caverphone — these are not languages in any conventional sense, but they appear here because they are defined as terse symbolic transformation rules (“drop silent letters, fold equivalence classes, output a fixed-width code”). They live on in SQL (SOUNDEX() is standard in MySQL, SQL Server, PostgreSQL fuzzystrmatch), in record-linkage and entity-resolution code, and in genealogy/healthcare deduplication.

Citations