Historical Linguistics and Typology
Historical linguistics reconstructs the past of languages: how they change over time, how they descend from common ancestors, and how they influence one another through contact. Typology surveys the structural diversity of the world’s languages, seeking patterns, universals, and the limits of variation. The two fields converge on the comparative study of language families and the explanation of how attested variety arose from earlier states.
The Comparative Method
The comparative method is the central tool for establishing genetic relationship between languages. Starting from cognate sets — words across multiple languages descended from a common ancestor — the linguist identifies regular sound correspondences. The Latin–Sanskrit–Greek–Gothic–English data on the word for father — pater, pitár, patḗr, fadar, father — yields the correspondences Latin p : Sanskrit p : Greek p : Gothic f : English f in word-initial position. The same correspondence appears in pes / pad / poús / fōtus / foot, piscis / – / – / fisks / fish, pellis / – / pélma / fill / fell. Regularity supports the inference that all five forms in each row descend from a single Proto-Indo-European form: *p_ə_₂tḗr,*ped-, *peisk-,*pel-.
The regularity hypothesis (Neogrammarians, Leipzig 1870s — Karl Brugmann, August Leskien, Hermann Osthoff) holds that sound change is exceptionless: every word containing the affected segment in the conditioning environment undergoes the change. Apparent exceptions reduce to dialect borrowing, analogical leveling, or as-yet-unrecognized conditioning. The principle remains the backbone of historical phonology, though Labov’s empirical work on change-in-progress (Philadelphia, 1960s–2010s) shows that some changes diffuse lexically before completing as a regular shift.
The Discovery of Indo-European
The Indo-European hypothesis emerged in the late 18th century. Sir William Jones, a British judge in Calcutta, observed in his 1786 anniversary discourse to the Asiatic Society: “The Sanskrit language, whatever be its antiquity, is of a wonderful structure; more perfect than the Greek, more copious than the Latin, and more exquisitely refined than either, yet bearing to both of them a stronger affinity, both in the roots of verbs and in the forms of grammar, than could possibly have been produced by accident; so strong indeed, that no philologer could examine them all three, without believing them to have sprung from some common source, which, perhaps, no longer exists.”
Franz Bopp Über das Conjugationssystem der Sanskritsprache (1816) compared the verb systems of Sanskrit, Greek, Latin, Persian, and Germanic, establishing systematic correspondences. Rasmus Rask independently (1814 manuscript, published 1818) worked out the consonant correspondences between Germanic and other Indo-European branches that would later be codified as Grimm’s Law (Jacob Grimm Deutsche Grammatik 1822).
August Schleicher Compendium der vergleichenden Grammatik der indogermanischen Sprachen (1861) developed the Stammbaumtheorie (family tree model) — languages branch from a common ancestor like a phylogenetic tree. Schleicher even composed a short fable in his reconstructed Proto-Indo-European (Avis akvāsas ka, “The Sheep and the Horses,” 1868), which has since been repeatedly updated as reconstruction improved.
Johannes Schmidt Die Verwandtschaftsverhältnisse der indogermanischen Sprachen (1872) countered with the Wellentheorie (wave theory): innovations radiate from centers in waves, overlapping in different language areas — explaining why languages share some features with their immediate neighbors regardless of tree topology. Modern phylogenetic approaches synthesize tree and wave: a primary tree with secondary contact-induced sharing.
Laryngeal Theory
Ferdinand de Saussure in his Mémoire sur le système primitif des voyelles dans les langues indo-européennes (1879) — written when he was 21 — proposed that PIE had a set of “sonant coefficients” (later called laryngeals) that explained otherwise puzzling vowel alternations. Jerzy Kuryłowicz (1927) discovered direct reflexes in Hittite (deciphered after Saussure’s death in 1913) — confirming the laryngeals as real segments. The mainstream reconstruction posits three laryngeals: *h₁ (vowel-coloring neutral or e-coloring),*h₂ (a-coloring), *h₃ (o-coloring). Their loss across most daughter branches produced compensatory vowel lengthening and coloring.
Grimm’s Law and Verner’s Law
Grimm’s Law (Jacob Grimm 1822, refining Rask) describes the First Germanic Consonant Shift, the systematic transformation of PIE stops in Proto-Germanic:
- PIE voiceless stops → Germanic voiceless fricatives: *p t k → f θ x (Lat pater : Eng father; Lat decem : Eng ten; Lat cor : Eng heart)
- PIE voiced stops → Germanic voiceless stops: *b d g → p t k (Lat slabus → Eng sleep; Lat decem d→t; Lat ager → Eng acre)
- PIE voiced aspirates → Germanic voiced stops or fricatives: *bʰ dʰ gʰ → b d g (Skt bhárāmi : Lat ferō : Eng bear)
Verner’s Law (Karl Verner 1875 Eine Ausnahme der ersten Lautverschiebung) resolved residual exceptions: PIE voiceless stops became voiced fricatives in Germanic when the immediately preceding syllable was unstressed in PIE accent. This explains the Gothic alternation faihu / fadar (PIE *péḱu /*ph₂tḗr — the laryngeal originally drew accent off the root). The discovery vindicated the regularity hypothesis: apparent exceptions had a regular cause.
Other Major Sound Laws
- Bartholomae’s Law (Indo-Iranian) — aspirate + voiceless stop → voiced stop + aspirate
- Grassmann’s Law (Greek and Indo-Iranian) — dissimilation of aspirates: in successive aspirate-aspirate syllables, the first deaspirates (Greek thríx “hair” / gen trikhós, from earlier *thrikh-)
- High German Consonant Shift (~500–700 CE) — Second Germanic shift differentiating Old High German from English and Low German: t → s/ts, p → f/pf, k → x/kx (English ten : German zehn; Eng open : Ger offen; Eng make : Ger machen)
- Great Vowel Shift of English (~1400–1700) — long vowels chain-shifted upward: ME [iː] → [aɪ] (Middle English tide [tiːdə] → ModE [taɪd]); [eː] → [iː] (meet); [aː] → [eɪ] (make); [uː] → [aʊ] (house); [oː] → [uː] (moon). The shift explains the mismatch between English spelling (frozen around Chaucer’s pronunciation) and modern values.
Ablaut and Apophony
Ablaut (German ab + Laut, “vowel gradation”) is the systematic vowel alternation within an Indo-European root family, marking grammatical or derivational categories. The classic IE e-o-∅ pattern shows in English strong verbs: sing-sang-sung, ring-rang-rung, drink-drank-drunk. Other examples: Latin tegō / toga (“I cover” / “covering”), Greek légō / lógos (“I say” / “word”). Ablaut is conditioned by accent in PIE (full grades when accented; zero grade when unaccented).
The Indo-European Branches
The Indo-European family comprises ten major branches plus several smaller groups:
- Anatolian — Hittite (deciphered Hrozný 1915 from Boğazköy archives), Luwian, Palaic, Lycian, Lydian, Carian. The oldest attested branch (~17th century BCE). Now extinct; its archaic features (laryngeals preserved) anchor reconstructions.
- Tocharian — Tocharian A and B, attested in northwest China (Tarim Basin, ~6th–8th centuries CE manuscripts). Surprising distribution given Indo-Iranian neighbors; satem/centum-defying.
- Indo-Iranian — Sanskrit, Avestan, Old Persian, Middle Indic (Pali, Prakrit), modern Indo-Aryan (Hindi-Urdu, Bengali, Punjabi, Gujarati, Marathi, Sinhala, Nepali) and Iranian (Persian, Pashto, Kurdish, Balochi, Tajik, Ossetian).
- Hellenic — Mycenaean Greek (Linear B, 14th–13th century BCE, deciphered Ventris 1952), Classical Greek (Attic, Ionic, Doric, Aeolic), Koine, Byzantine, Modern Greek.
- Italic — Latin and Sabellic (Oscan, Umbrian), descended into the Romance languages: Italian, French, Spanish, Portuguese, Romanian, Catalan, Occitan, Sardinian, Romansh, etc.
- Celtic — Continental (Gaulish, Lepontic, Celtiberian — all extinct) and Insular (Goidelic: Irish, Scottish Gaelic, Manx; Brythonic: Welsh, Cornish, Breton).
- Germanic — North (Old Norse → Icelandic, Faroese, Norwegian, Swedish, Danish), West (Old English → English, Old High German → German, Dutch, Frisian, Yiddish), East (Gothic — extinct).
- Balto-Slavic — Baltic (Lithuanian, Latvian, Old Prussian) and Slavic (Old Church Slavonic; modern East: Russian, Ukrainian, Belarusian; West: Polish, Czech, Slovak; South: Bulgarian, Macedonian, Serbo-Croatian, Slovenian).
- Albanian — isolated branch (Tosk and Gheg).
- Armenian — isolated branch (Classical Armenian “Grabar” from the 5th century CE; modern Eastern and Western Armenian).
Other extinct branches with limited attestation: Phrygian, Thracian, Illyrian, Messapic.
Non-Indo-European Families
The world has hundreds of language families. Major ones:
- Sino-Tibetan — Sinitic (Mandarin, Cantonese, Hokkien, Hakka, Wu, etc.) and Tibeto-Burman (Tibetan, Burmese, Karen, Lolo-Burmese)
- Afro-Asiatic — Semitic (Arabic, Hebrew, Amharic, Tigrinya, Aramaic, Akkadian), Egyptian (extinct → Coptic), Berber, Cushitic, Omotic, Chadic (Hausa)
- Niger-Congo — the largest African family by number of languages; includes Bantu (Swahili, Zulu, Yoruba, Igbo, Lingala, Shona, Xhosa), Kwa, Atlantic, Gur, Adamawa-Ubangi, Mande
- Nilo-Saharan — proposed family (Greenberg 1963) of disputed unity; includes Nilotic (Maasai, Dinka, Nuer), Songhay, Kanuri, Nubian
- Khoisan — historically grouped languages with click consonants in southern Africa; now widely regarded as a typological rather than genetic unity (Khoekhoe, Ju, Tuu, Sandawe, Hadza — last two probable isolates)
- Dravidian — Tamil, Telugu, Kannada, Malayalam, Tulu, Brahui (in Pakistan)
- Uralic — Finnic (Finnish, Estonian, Karelian), Samic (Sámi languages), Hungarian, Khanty, Mansi, Samoyedic (Nenets, Selkup)
- Altaic — disputed family proposing relation among Turkic (Turkish, Kazakh, Uzbek, Uyghur), Mongolic (Mongolian, Buryat), Tungusic (Manchu, Evenki), Korean, Japonic (Japanese, Ryukyuan). Most contemporary historical linguists treat these as separate families with prolonged contact (Sprachbund effects)
- Austronesian — one of the largest families by geography and language count: Formosan (Taiwan, the homeland), Malayo-Polynesian (Indonesian, Malay, Tagalog, Javanese, Malagasy, Hawaiian, Maori, Samoan, Tongan, Fijian, Tahitian)
- Austroasiatic — Vietnamese, Khmer, Mon, Mundari, Khasi, Santali
- Tai-Kadai — Thai, Lao, Shan, Zhuang
- Hmong-Mien — Hmong / Miao, Mien / Yao
- Trans-New Guinea — proposed family covering hundreds of New Guinea highlands languages
- Pama-Nyungan — most of Australia (Warlpiri, Pitjantjatjara, Arrernte, Dyirbal, Guugu Yimithirr)
- Algic — Algonquian (Cree, Ojibwe, Cheyenne, Blackfoot, Powhatan), Yurok, Wiyot
- Iroquoian — Mohawk, Cherokee, Oneida, Onondaga, Seneca, Cayuga, Tuscarora
- Salishan — Coast Salish (Lushootseed, Halkomelem) and Interior Salish (Lillooet, Colville-Okanagan, Spokane)
- Mayan — Yucatec, K’iche’, Q’eqchi’, Tzeltal, Tzotzil, Huastec
- Quechuan and Aymaran — Andean families (Quechua spoken by ~8 million; Aymara ~2 million)
Language Isolates
A language isolate has no demonstrated relatives. Major isolates:
- Basque (Euskara) — Pyrenees; pre-Indo-European; ~750k speakers
- Burushaski — northern Pakistan (Hunza and Yasin valleys)
- Korean — often treated as isolate or with Japonic as “Koreanic”; ~80M speakers
- Ainu — northern Japan (Hokkaido); critically endangered
- Sumerian — ancient Mesopotamia (extinct ~2000 BCE; preserved in Akkadian cuneiform); first written language
- Etruscan — ancient Italy (extinct; influenced Latin alphabet)
- Elamite, Hurrian, Hattic, Kassite — ancient Near East isolates
- Zuni, Haida, Tlingit (sometimes joined with Na-Dene), Kutenai, Yuchi — North America
- Sandawe, Hadza — Tanzania (formerly grouped with Khoisan)
Macro-Family Proposals — Controversial
Some linguists have proposed deeper relationships among multiple families:
- Nostratic — Holger Pedersen 1903; revived Vladislav Illich-Svitych and Aharon Dolgopolsky 1960s–70s; Allan Bomhard A Comprehensive Introduction to Nostratic Comparative Linguistics 2008. Proposes affinity among Indo-European, Afro-Asiatic, Uralic, Altaic, Dravidian, Kartvelian, Eskimo-Aleut. Methodology disputed.
- Eurasiatic — Joseph Greenberg Indo-European and its Closest Relatives (2000, 2002). Indo-European + Uralic + Yukaghir + Altaic + Korean + Japanese + Ainu + Gilyak + Chukotko-Kamchatkan + Eskimo-Aleut.
- Amerind — Greenberg Language in the Americas (1987). Most of the indigenous Americas as a single family, excluding Na-Dene and Eskimo-Aleut. Sharply criticized by Lyle Campbell American Indian Languages (1997), William Poser, Ives Goddard, and others on methodological grounds — mass lexical comparison without rigorous demonstration of regular correspondences. Mainstream consensus rejects Amerind.
The Sapir-Whorf Hypothesis
Edward Sapir and his student Benjamin Lee Whorf developed in the 1920s–1940s the hypothesis that the language one speaks shapes thought and perception (linguistic relativity). Whorf’s papers on Hopi time, Apache event categorization, and SAE (Standard Average European) drew extensive comparisons. The strong version (language determines thought) is widely rejected; a weak version (language influences cognition in domain-specific ways) has experimental support — Lera Boroditsky’s work on grammatical gender, time-space metaphor cross-linguistically; Stephen Levinson’s work on absolute spatial frames of reference (Guugu Yimithirr lacks ego-centric directional terms and uses cardinal directions even for body parts). The current consensus accepts linguistic-relativity effects as real but bounded.
Language Contact
Languages in contact borrow, mix, and converge. Standard taxonomy (Sarah Thomason and Terrence Kaufman Language Contact, Creolization, and Genetic Linguistics 1988):
- Substrate — the language of a population that shifts to another (Celtic substrate in French, Dravidian substrate in Indo-Aryan)
- Superstrate — the language of a politically dominant minority adopted by a majority (Norman French superstrate on Anglo-Saxon English producing Middle English’s massive Romance vocabulary)
- Adstrate — neighbors of comparable status (Old Norse adstrate on Old English contributing sky, take, they)
English exemplifies all three: a Germanic base, Norse adstrate (8th–11th c), Norman French superstrate (1066–~1400), and continual Latin and Greek learned borrowing.
Pidgins and Creoles
A pidgin is a contact language with reduced grammar arising for limited communication between groups without a common language; it has no native speakers. A creole emerges when a pidgin acquires native speakers and develops full grammatical complexity. Major creole languages: Tok Pisin (Papua New Guinea, English-based), Haitian Creole (French-based), Sranan (Suriname, English-based), Bislama (Vanuatu), Hawaiian Creole English (“Pidgin”), Singlish (Singapore Colloquial English), Mauritian Creole, Cape Verdean, Saramaccan, Krio (Sierra Leone), Pichi (Equatorial Guinea), Papiamentu (ABC islands, Dutch Caribbean).
Derek Bickerton Roots of Language (1981) proposed a bioprogram hypothesis: creoles arise via children imposing innate universal grammar structures on impoverished pidgin input, producing strikingly similar features across unrelated creoles (TMA — tense, mood, aspect — preverbal markers; SVO order; specific articles; no inflectional morphology). Critics — Salikoko Mufwene The Ecology of Language Evolution (2001), Michel DeGraff, Sarah Thomason and Terrence Kaufman, Philip Baker — argue that creole properties reflect inherited substrate and superstrate features and the constraints of any rapidly transmitted language under conditions of severe disruption, not a sui generis bioprogram.
Greenberg’s Universals
Joseph Greenberg Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements (1963) proposed 17 implicational universals based on a 30-language sample. Examples:
- Universal 1: In declarative sentences with nominal subject and object, the dominant order is almost always one in which the subject precedes the object.
- Universal 3: Languages with dominant VSO order are always prepositional.
- Universal 4: With overwhelmingly greater than chance frequency, languages with normal SOV order are postpositional.
- Universal 9: With well more than chance frequency, when question particles or affixes are specified in position by reference to the sentence as a whole, if initial, the language is prepositional; if final, postpositional.
These universals — strongly correlated with head-directionality — laid the groundwork for typology as a systematic enterprise.
WALS, Glottolog, ASJP
- WALS (World Atlas of Language Structures Online; Matthew Dryer and Martin Haspelmath, eds; Max Planck Institute for Evolutionary Anthropology, Leipzig, current edition 2013) — 192 typological features across 2,679 languages.
- Glottolog (Harald Hammarström, Robert Forkel, Martin Haspelmath, eds; current v5.0+) — comprehensive catalog of 8,000+ languages with genealogical classification, bibliographic references, geographic data.
- Ethnologue (SIL International, current 27th ed 2024) — 7,168 living languages; estimates ~40% endangered.
- ASJP (Automated Similarity Judgment Program; Søren Wichmann, Cecil Brown, et al.) — short Swadesh-style word lists in 7,000+ languages with phonetic transcriptions for automated comparison.
Rate of Change
Morris Swadesh (1950s) developed lexicostatistics / glottochronology: a basic vocabulary (100-word or 200-word Swadesh lists of universally found concepts — body parts, kinship, natural phenomena, basic verbs) is more resistant to replacement than peripheral vocabulary. Swadesh’s empirical calibration suggested ~14% replacement per millennium for the 100-word list, implying a Latin–English separation of ~6,000–7,000 years. The method has been heavily criticized (constant rate is empirically wrong; replacement is not uniform across the list) but variants survive in modern Bayesian phylogenetics.
Computational Phylogenetics
Russell Gray and Quentin Atkinson Language-Tree Divergence Times Support the Anatolian Theory of Indo-European Origin (Nature 2003) applied Bayesian phylogenetic methods (originally from biology) to cognate matrices and inferred a PIE date of ~8,700 BP with an Anatolian homeland — supporting the Renfrew hypothesis (Colin Renfrew Archaeology and Language 1987) against the Kurgan / Steppe hypothesis (Marija Gimbutas 1956+). A follow-up Bouckaert, Atkinson, Gray et al. (Science 2012) refined the Anatolian conclusion.
The debate reignited with ancient DNA evidence (David Reich, Iosif Lazaridis, Wolfgang Haak, and collaborators 2015–2022): mass migrations of Yamnaya steppe pastoralists (from the Pontic-Caspian steppe ~3300–2600 BCE) brought 70%+ population replacement to northern Europe and contributed substantial ancestry to South Asia. The genetic evidence aligns with the Steppe hypothesis. The David Anthony The Horse, the Wheel, and Language (2007) synthesis of archaeology, linguistics, and (now) genetics is widely cited; recent phylogenetic re-analysis (Heggarty, Anderson, Scarborough et al., Science 2023 Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages) proposes a hybrid model — earlier divergence in Anatolian, later northern Indo-European spread with the Steppe.
Word Order and Other Typological Distributions
The cross-linguistic frequency of basic word orders is highly skewed: SOV ~45%, SVO ~42%, VSO ~9%, VOS ~2%, OVS ~1%, OSV <1%. Strong implicational correlations link word order to:
- Adposition type (postposition with SOV, preposition with SVO/VSO)
- Genitive-noun vs noun-genitive (genitive-N with SOV, N-genitive with VO)
- Relative clause position (head-final RC with SOV, head-initial RC with VO)
- Standard-Marker-Adjective in comparatives
Matthew Dryer The Greenbergian Word Order Correlations (Language 1992) and subsequent work systematized these as branching-direction correlations: head-final languages cluster certain orders; head-initial languages cluster the opposite orders.
Endangered Languages
UNESCO’s Atlas of the World’s Languages in Danger classifies languages from “vulnerable” through “critically endangered” to “extinct.” Of ~7,000 living languages, roughly half are projected to lose their last native speakers by 2100 absent intervention. Drivers: economic and political pressure to shift to dominant languages; education in national languages only; media and digital infrastructure concentrated in major languages; intermarriage and migration; outright suppression in some historical and ongoing cases.
Successful revitalization efforts:
- Welsh — Welsh Language Act 1993, 2011; Welsh-medium education growing; ~30% of population in Wales report Welsh proficiency.
- Hebrew — the most cited revitalization: Eliezer Ben-Yehuda (1858–1922) led the revival from a liturgical/literary language to a spoken vernacular; now native language of ~9 million Israeli Jews.
- Maori (te reo Māori) — Kohanga reo language-nest preschools from 1982; broadcasting; ~3% native speakers but rising L2.
- Hawaiian — Pūnana Leo language nests from 1983; immersion schools; ~24,000 speakers.
- Basque — Ikastola immersion schools, Standard Basque (Euskara Batua, 1968).
- Catalan — robust post-Franco revival; ~10M speakers.
- Cornish — revival from extinction (last native speaker died 1777); current speakers ~600.
- Wampanoag (Massachusett) — revival from extinction by Jessie Little Doe Baird from 1990s using mission documents.
Adjacent
- phonetics-and-phonology — sound change is the engine of historical phonology
- syntax-and-grammar — syntactic change and typological correlations
- sociolinguistics-and-applied — endangered-language documentation and revitalization
- semantics-and-pragmatics — semantic change (broadening, narrowing, metaphor)
- biological-anthropology — ancient DNA and prehistoric population movements
- archaeology-foundations — material correlates of language spread