Voice / Phonetics / Pronunciation DSLs Family Index
type: language-family-index family: voice-phonetics languages_catalogued: 22 tags: [language-reference, family-index, voice-phonetics, ipa, x-sampa, arpabet, cmudict, espeak, mfa, tts]
Voice / Phonetics / Pronunciation — Family Index
Family overview
The voice/phonetics family is the set of textual notations for encoding speech sounds — the alphabets, lexicons, and engine-specific phoneme files that sit between orthographic text and acoustic output. The canonical citizen is the International Phonetic Alphabet (IPA), first published by the International Phonetic Association in 1888 (Paris/London) as a Latin-derived alphabet with one symbol per distinctive sound. The chart has been revised several times — the substantive 1989 Kiel revision, the 1996/2005 additions (notably the labiodental flap), the 2015 layout refresh, and the 2020 administrative re-issue (year/copyright only; no symbol changes). The IPA Extensions Unicode block (U+0250..02AF, plus IPA-related symbols scattered through Combining Diacritical Marks U+0300..036F and Spacing Modifier Letters U+02B0..02FF) is what lets the alphabet actually flow through ordinary text pipelines.
Because IPA pre-dates Unicode by a century, an entire shadow ecosystem of ASCII-friendly transliterations grew up: SAMPA (1989, EU ESPRIT project, with per-language tables SAMPA-DE / SAMPA-EN / SAMPA-FR / etc.), its superset X-SAMPA (1995, John Wells, UCL — covers the entire IPA chart in 7-bit ASCII), Kirshenbaum (1992, also called ASCII-IPA), and the older Anglocentric ARPAbet (1971, DARPA Speech Understanding Research). ARPAbet won inside North-American TTS — its 39-phoneme inventory plus 0/1/2 stress digits is the surface form of CMUdict (Carnegie Mellon Pronouncing Dictionary, 134k+ entries, current canonical release cmudict-0.7b from November 2014), which in turn is the lexicon embedded in Festival, eSpeak NG, Sphinx, and almost every English-TTS-from-source project.
Around the alphabets sits a TTS-engine phoneme-data zoo: Festival’s .utt/HTS label files (CSTR Edinburgh, 1996), MaryTTS XML (DFKI, 2000+, last release 5.2.1 in 2022 — now maintenance-only), MBROLA diphone packages (TCTS Mons, 1996, open-sourced 2019), and eSpeak NG, which uses the Kirshenbaum-derived ASCII IPA internally; eSpeak NG’s most recent release is 1.52.0 (Dec 2024 / Jan 2025), adding stress marks to phoneme events and finalising the cmake-only build. On the speech-recognition side, the Kaldi nnet3 phoneme labels, the HTK label file (.lab, Cambridge, 1989+), and most importantly the Montréal Forced Aligner (MFA) — currently at 3.3.9 (Feb 2026) — are the modern de facto formats for time-aligning phonemes to audio.
The W3C overlay (SSML 1.1, PLS 1.0, SRGS 1.0 — all catalogued in detail at accessibility-aria) standardises pronunciation hints, lexicon entries, and recognition grammars for the browser/cloud speech APIs. The LLM era has materially reduced the importance of phoneme intermediates: OpenAI’s gpt-realtime (2024), ElevenLabs v3, and the open-source XTTS / OpenVoice / F5-TTS line synthesise audio end-to-end from text + acoustic prompt, often bypassing the phoneme layer entirely. The phonetic-data layer is therefore quietly bifurcating: classical TTS/ASR still needs it for low-resource languages, alignment, and pronunciation-control corner cases; flagship neural TTS does not. IPA itself remains indispensable for linguistics, lexicography, and Wiktionary.
In our deep library
None of these have standalone deep-library notes. Cross-reference:
- accessibility-aria — sibling; W3C SSML 1.1, PLS 1.0, and SRGS 1.0 are catalogued there. This index cross-lists them but does not re-document.
- codec-and-dsp — sibling; Kaldi nnet3, Vosk, and acoustic-side speech processing live there. Phoneme-label files are the textual interface between the two families.
- nlp-corpus — sibling; Praat TextGrid and corpus annotation formats live there. Cross-listed below for forced-alignment.
- i18n-locale — locale data (CLDR transformations, Unicode locale rules) carries pronunciation hints for digits/dates/currencies that feed TTS.
- notation-spec — formal-grammar tradition (BNF/ABNF) underlies SRGS.
- document-typesetting — TIPA (TeX IPA macros) belongs to the LaTeX ecosystem; cross-listed.
- chatbot-intent-dsls — Dragon NaturallySpeaking voice command grammars and intent slot-filling DSLs overlap.
Tier 3 family table — Phonetic alphabets
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| IPA (International Phonetic Alphabet) | 1888 | International Phonetic Association (Paris/London, Paul Passy et al.) | Unicode/print alphabet, one symbol per distinctive sound | Canonical; chart last substantively updated 2005 (labiodental flap), layout-refreshed 2015, re-issued 2018 / 2020 with copyright-only changes; annually re-issued since 2025 | https://www.internationalphoneticassociation.org/content/ipa-chart |
| IPA Extensions Unicode block | 1991 (Unicode 1.0) | Unicode Consortium | Unicode block U+0250..02AF (plus modifier letters U+02B0..02FF and combining diacritics U+0300..036F) | Stable; the standard digital encoding of IPA | https://www.unicode.org/charts/PDF/U0250.pdf |
| X-SAMPA | 1995 | John Wells (UCL) | ASCII-only superset of SAMPA covering the full IPA chart | Active; remains the standard ASCII-IPA used in ASR/TTS corpora and ISO 639-3 wordlists | https://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm |
| SAMPA (original) | 1989 | EU ESPRIT SAM project (Speech Assessment Methods) | Per-language ASCII phonetic tables (SAMPA-DE, SAMPA-EN, SAMPA-FR, …) | Legacy but still cited; superseded in practice by X-SAMPA for cross-language work | https://www.phon.ucl.ac.uk/home/sampa/ |
| ARPAbet | 1971 | DARPA Speech Understanding Research project | ASCII phoneme set for North-American English; 39 phonemes + 0/1/2 stress digits | Active; the lingua franca of US-English TTS via CMUdict | https://en.wikipedia.org/wiki/ARPABET |
| Kirshenbaum (ASCII-IPA) | 1992 | Evan Kirshenbaum (HP Labs) | 7-bit ASCII transliteration of IPA, broader than ARPAbet | Niche but live; eSpeak NG uses a Kirshenbaum-derived encoding internally | http://www.kirshenbaum.net/IPA/ascii-ipa.pdf |
| TIPA (TeX IPA macros) | 1996 | Fukui Rei | LaTeX macro package for typesetting IPA in print | Active; standard in linguistics journals; current TIPA 1.3 (CTAN, periodic updates) | https://ctan.org/pkg/tipa |
| Wikipedia / Wiktionary IPA templates | ~2003 | MediaWiki community | Wiki markup wrappers around IPA Unicode ({{IPA}}, {{IPAc-en}}, {{IPA-fr}}, …) | Very active; the largest curated IPA corpus in the world by entry count | https://en.wiktionary.org/wiki/Wiktionary:International_Phonetic_Alphabet |
Tier 3 family table — Pronunciation lexicons
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| CMUdict (Carnegie Mellon Pronouncing Dictionary) | 1993 (v0.1); 0.7b in November 2014 | Carnegie Mellon University Speech Group | Public-domain English pronunciation lexicon in ARPAbet; 134k+ entries with 0/1/2 stress digits | Canonical; cmudict-0.7b remains the current standard release (community-maintained via cmusphinx/cmudict GitHub; infrequent point updates) | https://github.com/cmusphinx/cmudict |
| W3C Pronunciation Lexicon Specification (PLS) 1.0 | 2008 (W3C Recommendation) | W3C Voice Browser WG | XML pronunciation-lexicon format for SSML/SRGS interop; IPA + X-SAMPA aliases | Stable Recommendation; covered in depth in accessibility-aria | https://www.w3.org/TR/pronunciation-lexicon/ |
| Festival lexicon / ICELS | 1996+ | CSTR Edinburgh + CMU (Festvox) | Scheme-style entries ((word pos (phonemes))) consumed by Festival’s lexicon module | Maintained via Festival/Festvox; current Festival 2.5.1 (July 2020) | http://www.festvox.org/docs/manual-2.4.0/festival_24.html |
| MaryTTS lexicon | 2001+ | DFKI MaryTTS | XML and FST-compiled lexicons per language | Maintenance only; MaryTTS 5.2.1 (May 2022) is the last release | https://github.com/marytts/marytts |
| OpenJTalk dictionary (NAIST jdic / Mecab-NAIST-jdic) | 2009 | Nagoya Institute of Technology | Japanese pronunciation + accent dictionary feeding OpenJTalk’s HTS-engine voices | Active; widely embedded in Japanese TTS pipelines | https://open-jtalk.sourceforge.net/ |
| Jieba pinyin dictionary | 2012 | Sun Junyi (jieba project) | Chinese word-segmentation + pinyin pronunciation tables | Active; the de facto Chinese pinyin lexicon in the Python ecosystem | https://github.com/fxsjy/jieba |
Tier 3 family table — TTS-engine phoneme / voice formats
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| eSpeak NG phoneme + voice files | 2015 (NG fork); eSpeak itself 1995 | Reece H. Dunn (NG fork of Jonathan Duddington’s eSpeak) | Plain-text phoneme rule files + per-voice prosody data; uses Kirshenbaum-derived ASCII IPA internally | Active; current 1.52.0 (Dec 2024) added stress marks to phoneme events, finalised cmake-only build (autoconf removed in 1.52) | https://github.com/espeak-ng/espeak-ng |
| Festival utt format / HTS labels | 1996 | CSTR Edinburgh (Black, Taylor, Caley) | Utterance representation: heterogeneous-relation-graph of words, syllables, phonemes, prosody; HTS-engine .lab for HMM voices | Maintained; Festival 2.5.1 (Jul 2020) is current; HTS-engine still tracks it | https://www.cstr.ed.ac.uk/projects/festival/ |
| MaryTTS MaryXML | 2001+ | DFKI MaryTTS | XML pipeline format carrying raw text → tokens → phonemes → acoustic parameters | Frozen with MaryTTS 5.2.1 (2022) | https://github.com/marytts/marytts/wiki/MaryXML |
| MBROLA diphone format | 1996 (open-sourced 2019) | TCTS Lab, Université de Mons (Thierry Dutoit) | Binary diphone databases + plain-text phoneme/duration/pitch input | Open / maintained since the 2019 open-source release on GitHub | https://github.com/numediart/MBROLA |
| Festvox voices | 1998 | Alan W. Black, Kevin Lenzo (CMU) | Voice-building toolkit on top of Festival; Clustergen, Clunits, HTS recipes | Active for academic use | http://www.festvox.org/ |
HTK label file (.lab) | 1989+ | Cambridge University Engineering Dept (Steve Young et al.) | Plain-text phoneme labels with start/end times in 100-ns units | Legacy but live; still the lingua franca for HMM-era acoustic models and many alignment toolkits | https://htk.eng.cam.ac.uk/ |
Tier 3 family table — Forced alignment, ASR labels, encoding algorithms
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| Montréal Forced Aligner (MFA) | 2017 | McAuliffe et al., Montreal Corpus Tools / McGill | Kaldi-backed forced aligner; ships acoustic models + lexicons for 80+ languages; reads/writes Praat TextGrid | Active; current 3.3.9 (Feb 2026); the modern de facto forced-alignment standard | https://montreal-forced-aligner.readthedocs.io/ |
| Praat TextGrid | 1992+ | Boersma & Weenink (Univ. Amsterdam) | Plain-text tier-based time-aligned annotation format; the lingua franca of phonetics labs | Active; cross-listed from nlp-corpus | https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html |
| Kaldi GMM/nnet3 lexicon + phoneme labels | 2011+ | Povey et al., Johns Hopkins | lexicon.txt, phones.txt, nnet3 posterior labels; the Kaldi recipe convention | Active; cross-listed from codec-and-dsp | https://kaldi-asr.org/doc/data_prep.html |
| W3C SRGS 1.0 (Speech Recognition Grammar Spec) | 2004 (Recommendation) | W3C Voice Browser WG | Two equivalent syntaxes — XML form and ABNF form — for recognition grammars | Stable Recommendation; covered in accessibility-aria; still embedded in IVR/MRCP stacks | https://www.w3.org/TR/speech-grammar/ |
| W3C SSML 1.1 | 2010 (Recommendation) | W3C Voice Browser WG | XML markup for prosody, phonemes (<phoneme ph="...">), break, say-as, lookup | Stable Recommendation; covered in accessibility-aria | https://www.w3.org/TR/speech-synthesis11/ |
| Soundex | 1918 (patent), formal 1922 | Robert C. Russell & Margaret K. Odell | Phonetic-matching algorithm encoding surnames to 1-letter + 3-digit codes | Legacy/canonical; still the default phonetic match in many SQL engines (SOUNDEX()) | https://en.wikipedia.org/wiki/Soundex |
| Metaphone / Double Metaphone / Metaphone 3 | 1990 / 2000 / 2009 | Lawrence Philips | Improved phonetic-matching algorithms for English (Double Metaphone covers Slavic, Germanic, Spanish) | Active; Metaphone 3 is commercial, Double Metaphone is BSD-licensed | https://en.wikipedia.org/wiki/Metaphone |
| Phonex / NYSIIS / Caverphone | 1990 / 1970 / 2002 | A. J. Lait & B. Randell / NY State / David Hood (NZ) | Regional phonetic-matching variants (UK, US, NZ) | Niche; bundled in libraries like Apache Commons Codec, jellyfish (Python) | https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System |
Notable threads
-
IPA’s 138-year longevity (1888 → 2026). Few notations in computing or linguistics have survived this long without forking. The IPA’s secret is its governing body (the International Phonetic Association) restricts revisions to substantive evidence-based proposals — the chart changed in 1989 (Kiel), 1993, 1996, 2005 (labiodental flap), 2015 (layout), and was re-issued in 2018, 2020, and yearly since 2025 with only year/copyright changes. The Unicode IPA Extensions block (
U+0250..02AF) cemented its digital permanence in 1991. Even modern neural TTS papers cite IPA in their datasets despite their models bypassing it at inference. -
X-SAMPA’s stubborn relevance despite Unicode dominance. You’d expect a 1995 ASCII-only encoding of IPA to have been obsoleted by Unicode IPA, and yet X-SAMPA persists in 2026 corpora because: (a) some toolchains still mangle Unicode in CSV/TSV pipelines; (b) regex over X-SAMPA is far simpler than over combining-diacritic Unicode IPA; (c) MaryTTS, eSpeak NG (Kirshenbaum), and many Kaldi recipes still natively accept ASCII phonetic input; (d) X-SAMPA-to-IPA conversion is a one-shot lookup table, making round-tripping trivial. Lexicon files in academic phonetics often ship both representations.
-
ARPAbet + CMUdict as the dominant US-English TTS pair. The combination of (1) a 39-phoneme ASCII inventory designed for US English in 1971 and (2) a 134k-entry public-domain lexicon (CMUdict 0.7b, Nov 2014, still canonical in 2026) underlies essentially every from-source English TTS project of the past 30 years: Festival, eSpeak NG, Flite, Sphinx, the Mozilla TTS line, and many open-source neural TTS pretraining pipelines (Tacotron, FastSpeech, VITS variants) still use CMUdict as their G2P (grapheme-to-phoneme) bootstrap, even when the final model is end-to-end. CMUdict releases are deliberately infrequent (0.7b has been the canonical release for 11+ years) — a feature for reproducibility, not neglect.
-
eSpeak NG’s surprising portability and longevity. Despite being a tiny C codebase (originally Jonathan Duddington, NG fork since 2015 by Reece Dunn), eSpeak NG supports 100+ languages, runs on every platform from Android to Raspberry Pi to web (via WASM/Emscripten), and remains the fallback TTS for screen readers (NVDA, Orca) and a popular G2P backend for neural TTS phonemisation. The 1.52.0 release (Dec 2024 / Jan 2025) added stress marks to phoneme events and finalised the cmake-only build (autoconf was deprecated). It’s the single most-installed TTS engine on Linux and the de facto reference for low-resource-language G2P.
-
MaryTTS / Festival as gracefully aging classics. Both Festival (CSTR, 1996, current 2.5.1 Jul 2020) and MaryTTS (DFKI, 2000s, current 5.2.1 May 2022) have settled into maintenance mode — neither is dead, but neither is the answer for new production TTS work in 2026. Their durable value is for academic phonetics, low-resource languages, voice cloning research with controllable phonetic input, and the Festvox tooling ecosystem. The Festival
.uttformat and MaryTTS MaryXML remain useful reference data formats for understanding heterogeneous-relation-graph utterance representations. -
MFA as the modern forced-alignment standard. Montréal Forced Aligner (McAuliffe et al., 2017, current 3.3.9 Feb 2026) effectively replaced HTK-era alignment for new academic work. It ships acoustic models + lexicons for 80+ languages, reads/writes Praat TextGrid (so it slots straight into existing phonetics workflows), and uses Kaldi under the hood. Almost every speech-corpus paper from 2020 onwards either uses MFA directly or compares against it. The 3.x release line (since 2022) modernised the CLI, fixed the lexicon-format brittleness of 2.x, and is now the canonical install.
-
The LLM-era TTS bypass of phoneme intermediates. OpenAI
gpt-realtime(Oct 2024), ElevenLabs v3, and the open-source XTTS / OpenVoice / F5-TTS / E2-TTS line synthesise audio end-to-end from text + acoustic prompt — no IPA, no ARPAbet, no CMUdict at inference time. The phoneme layer survives at training time (datasets are often phonemised for stability, and pronunciation control via SSML<phoneme>tags still helps for proper nouns and homographs), but the user-visible role is shrinking. The classical pipeline (text → G2P → phoneme sequence → acoustic model → vocoder) is now the low-resource-language and controllability path, not the flagship path. -
Phonetic-matching algorithms as DSL-adjacent constants. Soundex (1918), Metaphone (1990), Double Metaphone (2000), Phonex, NYSIIS, Caverphone — these are not languages in any conventional sense, but they appear here because they are defined as terse symbolic transformation rules (“drop silent letters, fold equivalence classes, output a fixed-width code”). They live on in SQL (
SOUNDEX()is standard in MySQL, SQL Server, PostgreSQLfuzzystrmatch), in record-linkage and entity-resolution code, and in genealogy/healthcare deduplication.
Citations
- IPA Chart, International Phonetic Association: https://www.internationalphoneticassociation.org/content/ipa-chart
- IPA chart 2020 PDF (Wikimedia Commons): https://commons.wikimedia.org/wiki/File:IPA_chart_2020.pdf
- Unicode IPA Extensions block: https://www.unicode.org/charts/PDF/U0250.pdf
- X-SAMPA reference (John Wells, UCL): https://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm
- Original SAMPA: https://www.phon.ucl.ac.uk/home/sampa/
- ARPAbet (Wikipedia overview, primary references therein): https://en.wikipedia.org/wiki/ARPABET
- Kirshenbaum ASCII-IPA: http://www.kirshenbaum.net/IPA/ascii-ipa.pdf
- TIPA on CTAN: https://ctan.org/pkg/tipa
- CMUdict (cmusphinx GitHub mirror): https://github.com/cmusphinx/cmudict
- CMUdict online query: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
- W3C SSML 1.1 Recommendation: https://www.w3.org/TR/speech-synthesis11/
- W3C PLS 1.0 Recommendation: https://www.w3.org/TR/pronunciation-lexicon/
- W3C SRGS 1.0 Recommendation: https://www.w3.org/TR/speech-grammar/
- W3C Semantic Interpretation for SRGS (SISR): https://www.w3.org/TR/semantic-interpretation/
- eSpeak NG GitHub + releases: https://github.com/espeak-ng/espeak-ng
- eSpeak NG 1.52.0 release: https://github.com/espeak-ng/espeak-ng/releases/tag/1.52.0
- eSpeak NG phoneme docs: https://github.com/espeak-ng/espeak-ng/blob/master/docs/phonemes.md
- MaryTTS GitHub: https://github.com/marytts/marytts
- Festival Speech Synthesis System (CSTR): https://www.cstr.ed.ac.uk/projects/festival/
- Festvox toolkit (CMU): http://www.festvox.org/
- MBROLA on GitHub: https://github.com/numediart/MBROLA
- HTK toolkit: https://htk.eng.cam.ac.uk/
- Montréal Forced Aligner docs: https://montreal-forced-aligner.readthedocs.io/
- MFA PyPI: https://pypi.org/project/Montreal-Forced-Aligner/
- Praat manual (TextGrid file formats): https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html
- Kaldi data prep (lexicon + phones): https://kaldi-asr.org/doc/data_prep.html
- OpenJTalk: https://open-jtalk.sourceforge.net/
- jieba (Chinese segmentation + pinyin): https://github.com/fxsjy/jieba
- Soundex / Metaphone overviews: https://en.wikipedia.org/wiki/Soundex , https://en.wikipedia.org/wiki/Metaphone