Voice / Phonetics / Pronunciation DSLs Family Index

type: language-family-index family: voice-phonetics languages_catalogued: 22 tags: [language-reference, family-index, voice-phonetics, ipa, x-sampa, arpabet, cmudict, espeak, mfa, tts]

Voice / Phonetics / Pronunciation — Family Index

Family overview

The voice/phonetics family is the set of textual notations for encoding speech sounds — the alphabets, lexicons, and engine-specific phoneme files that sit between orthographic text and acoustic output. The canonical citizen is the International Phonetic Alphabet (IPA), first published by the International Phonetic Association in 1888 (Paris/London) as a Latin-derived alphabet with one symbol per distinctive sound. The chart has been revised several times — the substantive 1989 Kiel revision, the 1996/2005 additions (notably the labiodental flap), the 2015 layout refresh, and the 2020 administrative re-issue (year/copyright only; no symbol changes). The IPA Extensions Unicode block (U+0250..02AF, plus IPA-related symbols scattered through Combining Diacritical Marks U+0300..036F and Spacing Modifier Letters U+02B0..02FF) is what lets the alphabet actually flow through ordinary text pipelines.

Because IPA pre-dates Unicode by a century, an entire shadow ecosystem of ASCII-friendly transliterations grew up: SAMPA (1989, EU ESPRIT project, with per-language tables SAMPA-DE / SAMPA-EN / SAMPA-FR / etc.), its superset X-SAMPA (1995, John Wells, UCL — covers the entire IPA chart in 7-bit ASCII), Kirshenbaum (1992, also called ASCII-IPA), and the older Anglocentric ARPAbet (1971, DARPA Speech Understanding Research). ARPAbet won inside North-American TTS — its 39-phoneme inventory plus 0/1/2 stress digits is the surface form of CMUdict (Carnegie Mellon Pronouncing Dictionary, 134k+ entries, current canonical release cmudict-0.7b from November 2014), which in turn is the lexicon embedded in Festival, eSpeak NG, Sphinx, and almost every English-TTS-from-source project.

Around the alphabets sits a TTS-engine phoneme-data zoo: Festival’s .utt/HTS label files (CSTR Edinburgh, 1996), MaryTTS XML (DFKI, 2000+, last release 5.2.1 in 2022 — now maintenance-only), MBROLA diphone packages (TCTS Mons, 1996, open-sourced 2019), and eSpeak NG, which uses the Kirshenbaum-derived ASCII IPA internally; eSpeak NG’s most recent release is 1.52.0 (Dec 2024 / Jan 2025), adding stress marks to phoneme events and finalising the cmake-only build. On the speech-recognition side, the Kaldi nnet3 phoneme labels, the HTK label file (.lab, Cambridge, 1989+), and most importantly the Montréal Forced Aligner (MFA) — currently at 3.3.9 (Feb 2026) — are the modern de facto formats for time-aligning phonemes to audio.

The W3C overlay (SSML 1.1, PLS 1.0, SRGS 1.0 — all catalogued in detail at accessibility-aria) standardises pronunciation hints, lexicon entries, and recognition grammars for the browser/cloud speech APIs. The LLM era has materially reduced the importance of phoneme intermediates: OpenAI’s gpt-realtime (2024), ElevenLabs v3, and the open-source XTTS / OpenVoice / F5-TTS line synthesise audio end-to-end from text + acoustic prompt, often bypassing the phoneme layer entirely. The phonetic-data layer is therefore quietly bifurcating: classical TTS/ASR still needs it for low-resource languages, alignment, and pronunciation-control corner cases; flagship neural TTS does not. IPA itself remains indispensable for linguistics, lexicography, and Wiktionary.

In our deep library

None of these have standalone deep-library notes. Cross-reference:

accessibility-aria — sibling; W3C SSML 1.1, PLS 1.0, and SRGS 1.0 are catalogued there. This index cross-lists them but does not re-document.
codec-and-dsp — sibling; Kaldi nnet3, Vosk, and acoustic-side speech processing live there. Phoneme-label files are the textual interface between the two families.
nlp-corpus — sibling; Praat TextGrid and corpus annotation formats live there. Cross-listed below for forced-alignment.
i18n-locale — locale data (CLDR transformations, Unicode locale rules) carries pronunciation hints for digits/dates/currencies that feed TTS.
notation-spec — formal-grammar tradition (BNF/ABNF) underlies SRGS.
document-typesetting — TIPA (TeX IPA macros) belongs to the LaTeX ecosystem; cross-listed.
chatbot-intent-dsls — Dragon NaturallySpeaking voice command grammars and intent slot-filling DSLs overlap.

Tier 3 family table — Phonetic alphabets

Format	First appeared	Origin	Type	Status (2026)	URL
IPA (International Phonetic Alphabet)	1888	International Phonetic Association (Paris/London, Paul Passy et al.)	Unicode/print alphabet, one symbol per distinctive sound	Canonical; chart last substantively updated 2005 (labiodental flap), layout-refreshed 2015, re-issued 2018 / 2020 with copyright-only changes; annually re-issued since 2025	https://www.internationalphoneticassociation.org/content/ipa-chart
IPA Extensions Unicode block	1991 (Unicode 1.0)	Unicode Consortium	Unicode block `U+0250..02AF` (plus modifier letters `U+02B0..02FF` and combining diacritics `U+0300..036F`)	Stable; the standard digital encoding of IPA	https://www.unicode.org/charts/PDF/U0250.pdf
X-SAMPA	1995	John Wells (UCL)	ASCII-only superset of SAMPA covering the full IPA chart	Active; remains the standard ASCII-IPA used in ASR/TTS corpora and ISO 639-3 wordlists	https://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm
SAMPA (original)	1989	EU ESPRIT SAM project (Speech Assessment Methods)	Per-language ASCII phonetic tables (SAMPA-DE, SAMPA-EN, SAMPA-FR, …)	Legacy but still cited; superseded in practice by X-SAMPA for cross-language work	https://www.phon.ucl.ac.uk/home/sampa/
ARPAbet	1971	DARPA Speech Understanding Research project	ASCII phoneme set for North-American English; 39 phonemes + 0/1/2 stress digits	Active; the lingua franca of US-English TTS via CMUdict	https://en.wikipedia.org/wiki/ARPABET
Kirshenbaum (ASCII-IPA)	1992	Evan Kirshenbaum (HP Labs)	7-bit ASCII transliteration of IPA, broader than ARPAbet	Niche but live; eSpeak NG uses a Kirshenbaum-derived encoding internally	http://www.kirshenbaum.net/IPA/ascii-ipa.pdf
TIPA (TeX IPA macros)	1996	Fukui Rei	LaTeX macro package for typesetting IPA in print	Active; standard in linguistics journals; current TIPA 1.3 (CTAN, periodic updates)	https://ctan.org/pkg/tipa
Wikipedia / Wiktionary IPA templates	~2003	MediaWiki community	Wiki markup wrappers around IPA Unicode (`{{IPA}}`, `{{IPAc-en}}`, `{{IPA-fr}}`, …)	Very active; the largest curated IPA corpus in the world by entry count	https://en.wiktionary.org/wiki/Wiktionary:International_Phonetic_Alphabet

Tier 3 family table — Pronunciation lexicons

Format	First appeared	Origin	Type	Status (2026)	URL
CMUdict (Carnegie Mellon Pronouncing Dictionary)	1993 (v0.1); 0.7b in November 2014	Carnegie Mellon University Speech Group	Public-domain English pronunciation lexicon in ARPAbet; 134k+ entries with 0/1/2 stress digits	Canonical; `cmudict-0.7b` remains the current standard release (community-maintained via cmusphinx/cmudict GitHub; infrequent point updates)	https://github.com/cmusphinx/cmudict
W3C Pronunciation Lexicon Specification (PLS) 1.0	2008 (W3C Recommendation)	W3C Voice Browser WG	XML pronunciation-lexicon format for SSML/SRGS interop; IPA + X-SAMPA aliases	Stable Recommendation; covered in depth in accessibility-aria	https://www.w3.org/TR/pronunciation-lexicon/
Festival lexicon / ICELS	1996+	CSTR Edinburgh + CMU (Festvox)	Scheme-style entries (`(word pos (phonemes))`) consumed by Festival’s lexicon module	Maintained via Festival/Festvox; current Festival 2.5.1 (July 2020)	http://www.festvox.org/docs/manual-2.4.0/festival_24.html
MaryTTS lexicon	2001+	DFKI MaryTTS	XML and FST-compiled lexicons per language	Maintenance only; MaryTTS 5.2.1 (May 2022) is the last release	https://github.com/marytts/marytts
OpenJTalk dictionary (NAIST jdic / Mecab-NAIST-jdic)	2009	Nagoya Institute of Technology	Japanese pronunciation + accent dictionary feeding OpenJTalk’s HTS-engine voices	Active; widely embedded in Japanese TTS pipelines	https://open-jtalk.sourceforge.net/
Jieba pinyin dictionary	2012	Sun Junyi (jieba project)	Chinese word-segmentation + pinyin pronunciation tables	Active; the de facto Chinese pinyin lexicon in the Python ecosystem	https://github.com/fxsjy/jieba

Tier 3 family table — TTS-engine phoneme / voice formats

Format	First appeared	Origin	Type	Status (2026)	URL
eSpeak NG phoneme + voice files	2015 (NG fork); eSpeak itself 1995	Reece H. Dunn (NG fork of Jonathan Duddington’s eSpeak)	Plain-text phoneme rule files + per-voice prosody data; uses Kirshenbaum-derived ASCII IPA internally	Active; current 1.52.0 (Dec 2024) added stress marks to phoneme events, finalised cmake-only build (autoconf removed in 1.52)	https://github.com/espeak-ng/espeak-ng
Festival utt format / HTS labels	1996	CSTR Edinburgh (Black, Taylor, Caley)	Utterance representation: heterogeneous-relation-graph of words, syllables, phonemes, prosody; HTS-engine `.lab` for HMM voices	Maintained; Festival 2.5.1 (Jul 2020) is current; HTS-engine still tracks it	https://www.cstr.ed.ac.uk/projects/festival/
MaryTTS MaryXML	2001+	DFKI MaryTTS	XML pipeline format carrying raw text → tokens → phonemes → acoustic parameters	Frozen with MaryTTS 5.2.1 (2022)	https://github.com/marytts/marytts/wiki/MaryXML
MBROLA diphone format	1996 (open-sourced 2019)	TCTS Lab, Université de Mons (Thierry Dutoit)	Binary diphone databases + plain-text phoneme/duration/pitch input	Open / maintained since the 2019 open-source release on GitHub	https://github.com/numediart/MBROLA
Festvox voices	1998	Alan W. Black, Kevin Lenzo (CMU)	Voice-building toolkit on top of Festival; Clustergen, Clunits, HTS recipes	Active for academic use	http://www.festvox.org/
HTK label file (`.lab`)	1989+	Cambridge University Engineering Dept (Steve Young et al.)	Plain-text phoneme labels with start/end times in 100-ns units	Legacy but live; still the lingua franca for HMM-era acoustic models and many alignment toolkits	https://htk.eng.cam.ac.uk/

Tier 3 family table — Forced alignment, ASR labels, encoding algorithms

Format	First appeared	Origin	Type	Status (2026)	URL
Montréal Forced Aligner (MFA)	2017	McAuliffe et al., Montreal Corpus Tools / McGill	Kaldi-backed forced aligner; ships acoustic models + lexicons for 80+ languages; reads/writes Praat TextGrid	Active; current 3.3.9 (Feb 2026); the modern de facto forced-alignment standard	https://montreal-forced-aligner.readthedocs.io/
Praat TextGrid	1992+	Boersma & Weenink (Univ. Amsterdam)	Plain-text tier-based time-aligned annotation format; the lingua franca of phonetics labs	Active; cross-listed from nlp-corpus	https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html
Kaldi GMM/nnet3 lexicon + phoneme labels	2011+	Povey et al., Johns Hopkins	`lexicon.txt`, `phones.txt`, `nnet3` posterior labels; the Kaldi recipe convention	Active; cross-listed from codec-and-dsp	https://kaldi-asr.org/doc/data_prep.html
W3C SRGS 1.0 (Speech Recognition Grammar Spec)	2004 (Recommendation)	W3C Voice Browser WG	Two equivalent syntaxes — XML form and ABNF form — for recognition grammars	Stable Recommendation; covered in accessibility-aria; still embedded in IVR/MRCP stacks	https://www.w3.org/TR/speech-grammar/
W3C SSML 1.1	2010 (Recommendation)	W3C Voice Browser WG	XML markup for prosody, phonemes (`<phoneme ph="...">`), break, say-as, lookup	Stable Recommendation; covered in accessibility-aria	https://www.w3.org/TR/speech-synthesis11/
Soundex	1918 (patent), formal 1922	Robert C. Russell & Margaret K. Odell	Phonetic-matching algorithm encoding surnames to 1-letter + 3-digit codes	Legacy/canonical; still the default phonetic match in many SQL engines (`SOUNDEX()`)	https://en.wikipedia.org/wiki/Soundex
Metaphone / Double Metaphone / Metaphone 3	1990 / 2000 / 2009	Lawrence Philips	Improved phonetic-matching algorithms for English (Double Metaphone covers Slavic, Germanic, Spanish)	Active; Metaphone 3 is commercial, Double Metaphone is BSD-licensed	https://en.wikipedia.org/wiki/Metaphone
Phonex / NYSIIS / Caverphone	1990 / 1970 / 2002	A. J. Lait & B. Randell / NY State / David Hood (NZ)	Regional phonetic-matching variants (UK, US, NZ)	Niche; bundled in libraries like Apache Commons Codec, `jellyfish` (Python)	https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System

Notable threads

IPA’s 138-year longevity (1888 → 2026). Few notations in computing or linguistics have survived this long without forking. The IPA’s secret is its governing body (the International Phonetic Association) restricts revisions to substantive evidence-based proposals — the chart changed in 1989 (Kiel), 1993, 1996, 2005 (labiodental flap), 2015 (layout), and was re-issued in 2018, 2020, and yearly since 2025 with only year/copyright changes. The Unicode IPA Extensions block (U+0250..02AF) cemented its digital permanence in 1991. Even modern neural TTS papers cite IPA in their datasets despite their models bypassing it at inference.
X-SAMPA’s stubborn relevance despite Unicode dominance. You’d expect a 1995 ASCII-only encoding of IPA to have been obsoleted by Unicode IPA, and yet X-SAMPA persists in 2026 corpora because: (a) some toolchains still mangle Unicode in CSV/TSV pipelines; (b) regex over X-SAMPA is far simpler than over combining-diacritic Unicode IPA; (c) MaryTTS, eSpeak NG (Kirshenbaum), and many Kaldi recipes still natively accept ASCII phonetic input; (d) X-SAMPA-to-IPA conversion is a one-shot lookup table, making round-tripping trivial. Lexicon files in academic phonetics often ship both representations.
ARPAbet + CMUdict as the dominant US-English TTS pair. The combination of (1) a 39-phoneme ASCII inventory designed for US English in 1971 and (2) a 134k-entry public-domain lexicon (CMUdict 0.7b, Nov 2014, still canonical in 2026) underlies essentially every from-source English TTS project of the past 30 years: Festival, eSpeak NG, Flite, Sphinx, the Mozilla TTS line, and many open-source neural TTS pretraining pipelines (Tacotron, FastSpeech, VITS variants) still use CMUdict as their G2P (grapheme-to-phoneme) bootstrap, even when the final model is end-to-end. CMUdict releases are deliberately infrequent (0.7b has been the canonical release for 11+ years) — a feature for reproducibility, not neglect.
eSpeak NG’s surprising portability and longevity. Despite being a tiny C codebase (originally Jonathan Duddington, NG fork since 2015 by Reece Dunn), eSpeak NG supports 100+ languages, runs on every platform from Android to Raspberry Pi to web (via WASM/Emscripten), and remains the fallback TTS for screen readers (NVDA, Orca) and a popular G2P backend for neural TTS phonemisation. The 1.52.0 release (Dec 2024 / Jan 2025) added stress marks to phoneme events and finalised the cmake-only build (autoconf was deprecated). It’s the single most-installed TTS engine on Linux and the de facto reference for low-resource-language G2P.
MaryTTS / Festival as gracefully aging classics. Both Festival (CSTR, 1996, current 2.5.1 Jul 2020) and MaryTTS (DFKI, 2000s, current 5.2.1 May 2022) have settled into maintenance mode — neither is dead, but neither is the answer for new production TTS work in 2026. Their durable value is for academic phonetics, low-resource languages, voice cloning research with controllable phonetic input, and the Festvox tooling ecosystem. The Festival .utt format and MaryTTS MaryXML remain useful reference data formats for understanding heterogeneous-relation-graph utterance representations.
MFA as the modern forced-alignment standard. Montréal Forced Aligner (McAuliffe et al., 2017, current 3.3.9 Feb 2026) effectively replaced HTK-era alignment for new academic work. It ships acoustic models + lexicons for 80+ languages, reads/writes Praat TextGrid (so it slots straight into existing phonetics workflows), and uses Kaldi under the hood. Almost every speech-corpus paper from 2020 onwards either uses MFA directly or compares against it. The 3.x release line (since 2022) modernised the CLI, fixed the lexicon-format brittleness of 2.x, and is now the canonical install.
The LLM-era TTS bypass of phoneme intermediates. OpenAI gpt-realtime (Oct 2024), ElevenLabs v3, and the open-source XTTS / OpenVoice / F5-TTS / E2-TTS line synthesise audio end-to-end from text + acoustic prompt — no IPA, no ARPAbet, no CMUdict at inference time. The phoneme layer survives at training time (datasets are often phonemised for stability, and pronunciation control via SSML <phoneme> tags still helps for proper nouns and homographs), but the user-visible role is shrinking. The classical pipeline (text → G2P → phoneme sequence → acoustic model → vocoder) is now the low-resource-language and controllability path, not the flagship path.
Phonetic-matching algorithms as DSL-adjacent constants. Soundex (1918), Metaphone (1990), Double Metaphone (2000), Phonex, NYSIIS, Caverphone — these are not languages in any conventional sense, but they appear here because they are defined as terse symbolic transformation rules (“drop silent letters, fold equivalence classes, output a fixed-width code”). They live on in SQL (SOUNDEX() is standard in MySQL, SQL Server, PostgreSQL fuzzystrmatch), in record-linkage and entity-resolution code, and in genealogy/healthcare deduplication.

Citations

IPA Chart, International Phonetic Association: https://www.internationalphoneticassociation.org/content/ipa-chart
IPA chart 2020 PDF (Wikimedia Commons): https://commons.wikimedia.org/wiki/File:IPA_chart_2020.pdf
Unicode IPA Extensions block: https://www.unicode.org/charts/PDF/U0250.pdf
X-SAMPA reference (John Wells, UCL): https://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm
Original SAMPA: https://www.phon.ucl.ac.uk/home/sampa/
ARPAbet (Wikipedia overview, primary references therein): https://en.wikipedia.org/wiki/ARPABET
Kirshenbaum ASCII-IPA: http://www.kirshenbaum.net/IPA/ascii-ipa.pdf
TIPA on CTAN: https://ctan.org/pkg/tipa
CMUdict (cmusphinx GitHub mirror): https://github.com/cmusphinx/cmudict
CMUdict online query: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
W3C SSML 1.1 Recommendation: https://www.w3.org/TR/speech-synthesis11/
W3C PLS 1.0 Recommendation: https://www.w3.org/TR/pronunciation-lexicon/
W3C SRGS 1.0 Recommendation: https://www.w3.org/TR/speech-grammar/
W3C Semantic Interpretation for SRGS (SISR): https://www.w3.org/TR/semantic-interpretation/
eSpeak NG GitHub + releases: https://github.com/espeak-ng/espeak-ng
eSpeak NG 1.52.0 release: https://github.com/espeak-ng/espeak-ng/releases/tag/1.52.0
eSpeak NG phoneme docs: https://github.com/espeak-ng/espeak-ng/blob/master/docs/phonemes.md
MaryTTS GitHub: https://github.com/marytts/marytts
Festival Speech Synthesis System (CSTR): https://www.cstr.ed.ac.uk/projects/festival/
Festvox toolkit (CMU): http://www.festvox.org/
MBROLA on GitHub: https://github.com/numediart/MBROLA
HTK toolkit: https://htk.eng.cam.ac.uk/
Montréal Forced Aligner docs: https://montreal-forced-aligner.readthedocs.io/
MFA PyPI: https://pypi.org/project/Montreal-Forced-Aligner/
Praat manual (TextGrid file formats): https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html
Kaldi data prep (lexicon + phones): https://kaldi-asr.org/doc/data_prep.html
OpenJTalk: https://open-jtalk.sourceforge.net/
jieba (Chinese segmentation + pinyin): https://github.com/fxsjy/jieba
Soundex / Metaphone overviews: https://en.wikipedia.org/wiki/Soundex , https://en.wikipedia.org/wiki/Metaphone

Compendium

Explorer

Voice / Phonetics / Pronunciation DSLs Family Index

Voice / Phonetics / Pronunciation DSLs Family Index

type: language-family-index family: voice-phonetics languages_catalogued: 22 tags: [language-reference, family-index, voice-phonetics, ipa, x-sampa, arpabet, cmudict, espeak, mfa, tts]

Voice / Phonetics / Pronunciation — Family Index

Family overview

In our deep library

Tier 3 family table — Phonetic alphabets

Tier 3 family table — Pronunciation lexicons

Tier 3 family table — TTS-engine phoneme / voice formats

Tier 3 family table — Forced alignment, ASR labels, encoding algorithms

Notable threads

Citations

Graph View

Table of Contents