Statistical Software & Reproducible-Research DSLs Family Index

type: language-family-index family: statistical-software languages_catalogued: 26 tags: [language-reference, family-index, statistical-software, stata, sas, spss, stan, jags, pymc, quarto, jupyter, rmarkdown, bayesian, econometrics, reproducible-research]

Statistical Software & Reproducible-Research — Family Index

Family overview

The proprietary-statistics triumvirate — Stata, SAS, and SPSS — has dominated academic, governmental, and pharmaceutical statistical work since the 1970s/1980s and remains, in 2026, the default in three industries where it should arguably have been displaced a decade ago: clinical-trials biostatistics (where the FDA’s electronic-submission pathway under CDISC SDTM/ADaM is built around SAS), academic econometrics (where Stata .do files are the lingua franca of working-paper replication archives), and survey research (where SPSS .sav files remain the default export from instruments like Qualtrics, SurveyMonkey, and REDCap). Each platform ships its own internal command DSL — Stata’s do-file syntax, SAS’s macro + DATA step + PROC step trinity, and SPSS Statistics’ .sps command language — and each persists primarily because its installed base of code, its certified validation, and its regulator familiarity dwarf the costs of a clean R/Python rewrite.

In parallel sits the Bayesian probabilistic-programming lineage, which over thirty years progressed from BUGS (MRC Biostatistics Unit, Cambridge, late 1980s) to WinBUGS (1997) to OpenBUGS (2005) to JAGS (Martyn Plummer, 2007) to Stan (Andrew Gelman et al., Columbia, 2012) and onward to language-embedded successors PyMC (Python, v5+ rewrite on PyTensor) and Turing.jl (Julia). NIMBLE (UC Berkeley, 2014+) sits sideways to this lineage: it accepts BUGS-syntax models inside R but compiles them to C++ via its own intermediate language. Each generation traded ergonomics: BUGS was directed-acyclic-graph / Gibbs-only; Stan introduced Hamiltonian Monte Carlo + auto-differentiation and a real type-checked modelling language; PyMC and Turing.jl gave up the standalone-language ceremony in exchange for being host-language libraries with Python/Julia syntactic affordances.

The reproducible-research substrate is the third layer. R Markdown (RStudio, 2014) introduced the literate-programming workflow that combined Markdown prose, R code chunks via knitr, and pandoc-mediated multi-output rendering. Quarto (Posit, 2022) is its multi-language successor: built on pandoc + a Lua filter ecosystem, it renders Markdown documents containing R, Python, Julia, and Observable JS chunks to HTML, PDF (via document-typesetting / LaTeX), Word, ePub, dashboards, and slides — and crucially, it does not require R to render Python or Julia documents. Jupyter (originally IPython Notebook, 2011) sits adjacent: its .ipynb JSON document format remains the dominant in-the-wild notebook artefact format, with Quarto able to consume it directly. Together, R Markdown + Quarto + Jupyter form the textual reproducibility layer that the proprietary platforms grudgingly accommodate through ODS (SAS), dyndoc (Stata), and OUTPUT EXPORT (SPSS).

Underlying all of this is the binary-data-format zoo: Stata’s .dta (currently format 121 in Stata 19, since April 2025), SPSS’s .sav, SAS’s .sas7bdat and the older XPORT (.xpt) transport format. None has an open specification of comparable rigour to Parquet or Arrow, and the de facto interoperability layer is the Python pandas + R haven package, which together reverse-engineer the formats and provide round-trip reading and writing. Open-source replacements (Parquet, Feather, R .rds) have made deep inroads in tech/finance but barely scratched pharma and academia.

In our deep library

r — the deep R note; R is the natural home of Rmd, knitr, the haven package, RStan, cmdstanr, nimble, brms, and rstanarm.
python — covers Python, including statsmodels, scikit-learn, PyMC, cmdstanpy, arviz, and pandas as the de facto reader for .dta/.sav/.sas7bdat.
julia — covers Julia, including Turing.jl, DynamicPPL, and the wider SciML stack.
scientific — MATLAB / Mathematica / Octave / APL / Maxima / Scilab. Adjacent to this family but distinct: MATLAB’s Statistics Toolbox is a Mathworks add-on rather than a separate stats-package DSL.
document-typesetting — Quarto’s PDF output path runs through LaTeX (tinytex by default); R Markdown likewise. Output toolchain overlaps heavily.
healthcare-clinical — CDISC SDTM, ADaM, and define-xml are nearly always implemented on top of SAS. The reverse: most clinical-trial statistical-analysis-plan code is SAS macro + PROC.
survey-questionnaire — REDCap, Qualtrics, and SurveyMonkey export to SPSS .sav and Stata .dta as primary archival formats.

Tier 3 family table — Stata family

Format	First appeared	Origin	Type	Status (2026)	URL
Stata `.do` script	1985	Stata Corp (William Gould et al.), College Station TX	Command DSL; one-line-per-command imperative syntax (`regress y x1 x2 if region==3, robust`)	Very active; Stata 19 released April 2025; StataNow continuous-release branch previews Stata 20 features	https://www.stata.com/manuals/u.pdf
Stata `.ado` programs	1985	Stata Corp	User-written commands; a `.ado` file plus a matching `.sthlp` help file define a new Stata verb	Very active; SSC (Statistical Software Components, Boston College) is the de facto package repository	https://www.stata.com/manuals/u17.pdf
Stata `.dta` data format	1985, current spec format 121 (Stata 19)	Stata Corp	Versioned binary; widely used as exchange container in econometrics and health-policy datasets	Active and de facto standard in academic working-paper replication archives	https://www.stata.com/manuals/pfileformatsdta.pdf
Stata Mata language	2003 (Stata 9)	Stata Corp	Compiled matrix-programming language embedded inside Stata; C-like syntax, optimised for linear algebra	Active, the substrate for almost all complex `.ado` packages written since ~2010	https://www.stata.com/features/mata/
Stata Markdown (`dyndoc` / `markstat`)	~2014	Stata Corp (`dyndoc`) + Germán Rodríguez (Princeton) for `markstat`	Dynamic-document tools embedding Stata output in Markdown / HTML	Active but niche; many users now reach for Quarto’s Stata engine instead	https://www.stata.com/manuals/rptdyndoc.pdf

Tier 3 family table — SAS family

Format	First appeared	Origin	Type	Status (2026)	URL
SAS DATA step language	1976	SAS Institute (North Carolina; Anthony Barr, Jim Goodnight)	Row-by-row procedural data-manipulation DSL; the “data step / proc step” dichotomy is foundational to SAS	Active; SAS 9.4 M9 (June 2025) is the current 9-series release; supported through ~2030	https://documentation.sas.com/doc/en/pgmsascdc/default/lestmtsref/titlepage.htm
SAS macro language	~1980	SAS Institute	Token-substitution preprocessor with `%macro`/`%mend`, `%let`, `%if`, `%do` — the long-standing SAS metaprogramming layer	Very active, still the dominant way large pharma/biostat shops parameterise SAS programs	https://documentation.sas.com/doc/en/pgmsascdc/default/mcrolref/titlepage.htm
SAS PROC SQL	~1989 (SAS 6.06)	SAS Institute	Embedded ANSI-SQL extension with SAS-specific extensions (e.g. macro-variable interpolation, automatic dataset access)	Very active	https://documentation.sas.com/doc/en/sqlproc/v_017/titlepage.htm
SAS PROC IML (Interactive Matrix Language)	1985	SAS Institute	Matrix-programming DSL inside SAS; APL-influenced; analogue of Stata Mata	Active but lower-velocity than DATA step / macro	https://documentation.sas.com/doc/en/imlug/15.3/titlepage.htm
SAS `.sas7bdat` data files	1999 (SAS 7)	SAS Institute	Proprietary versioned binary data format; format reverse-engineered by R `haven` (Hadley Wickham, Evan Miller)	Active and dominant in pharma/biotech; widely accepted by FDA submissions when bundled with `define-xml`	https://documentation.sas.com/doc/en/pgmsascdc/default/lrcon/p1n8hb0gn5erf0n1qm0xthahmu99.htm
SAS XPORT (`.xpt`) format	1987 (SAS 6, “v5 transport”)	SAS Institute	Older fixed-width transport format with strict 8-character name limits; the only data format accepted by the FDA for legacy submissions until SDTM-XPT replaced it for clinical-trial datasets	Active in regulatory contexts, otherwise legacy	https://support.sas.com/techsup/technote/ts140.pdf

Tier 3 family table — SPSS / JMP / proprietary econometrics

Format	First appeared	Origin	Type	Status (2026)	URL
SPSS `.sps` syntax	1968 (SPSS first release)	Norman Nie / C. Hadlai Hull / Dale Bent (Stanford); SPSS Inc.; IBM (2009)	Command-language DSL (`COMPUTE`, `RECODE`, `FREQUENCIES`, `GLM`, `IF` … `END IF`) — verbose, FORTRAN-influenced syntax	Active; IBM SPSS Statistics 31 / 32 are the current releases (Server 32.0.x GA April 2026); the IBM divestment to Francisco Partners (2024) has left the product roadmap somewhat unclear	https://www.ibm.com/docs/en/spss-statistics
SPSS `.sav` data files	1968	SPSS Inc. → IBM	Proprietary binary data format with variable labels, value labels, missing-value codes, and measurement levels — features that survey researchers depend on	Active, the default export format from most survey instruments	https://www.ibm.com/docs/en/spss-statistics/saveformats
SPSS Statistics Python / R integration	~2007 (Python plug-in)	SPSS Inc. → IBM	Embedded Python and R inside SPSS for scripting custom procedures (a “begin program” block); R 4.4.1 in v30+	Active but underused; most SPSS users still write pure syntax	https://www.ibm.com/docs/en/spss-statistics/30.0.0?topic=integration-python
JMP Scripting Language (JSL)	1989 (JMP 1)	SAS Institute (John Sall)	C-influenced scripting DSL for the JMP visual-discovery platform; controls data tables, platforms (analyses), and graphical objects	Active; JMP 19 (2025) added expanded Python integration and continued JSL evolution	https://www.jmp.com/support/help/en/19.0/jmp/introduction-to-writing-jsl-scripts.shtml
GAUSS language	1984	Aptech Systems (Lee Edlefsen, Steve Jones), Maple Valley WA	Matrix-programming language for econometrics and finance; strong installed base in central banks	Active; GAUSS 26 released February 2026 with new time-series and panel functions	https://www.aptech.com/
EViews command/program language	1994 (EViews 1)	Quantitative Micro Software → IHS Markit → S&P Global	Time-series-econometrics-oriented DSL with `.prg` programs, `.wf1` workfiles	Active; EViews 14 (2024) with ongoing bug-fix patches through 2026	https://www.eviews.com/help/
gretl / hansl	2000	Allin Cottrell, Riccardo Lucchetti (Università Politecnica delle Marche)	Open-source econometrics package with its own scripting language hansl (Hansl’s a Neat Scripting Language); imports Stata `.dta`, R, Octave	Active; gretl 2026a released February 2026	http://gretl.sourceforge.net/

Tier 3 family table — Bayesian modelling DSLs

Format	First appeared	Origin	Type	Status (2026)	URL
BUGS / WinBUGS / OpenBUGS model language	1989 (BUGS) / 1997 (WinBUGS) / 2005 (OpenBUGS)	MRC Biostatistics Unit, Cambridge (David Spiegelhalter, Andrew Thomas, Nicky Best, Wally Gilks)	Declarative DAG model specification using `~` for stochastic and `<-` for deterministic nodes; Gibbs sampling only	Mostly legacy — WinBUGS development ended ~2007; OpenBUGS effectively unmaintained since ~2014; syntax lives on in JAGS and NIMBLE	https://www.mrc-bsu.cam.ac.uk/software/bugs/
JAGS model language	2007	Martyn Plummer (IARC Lyon)	BUGS-dialect successor written in C++; the most widely used “classic” Bayesian DSL via R’s `rjags` and `R2jags`	Active but slow — JAGS 4.3.2 (March 2023) remains the current release as of May 2026; few new features but reliable	https://mcmc-jags.sourceforge.io/
NIMBLE model language	2014 (CRAN release)	Perry de Valpine, Christopher Paciorek et al. (UC Berkeley)	BUGS-syntax-compatible R-internal modelling DSL that compiles models to C++; supports MCMC, Laplace approximation, MCEM, and user-defined samplers	Active; CRAN release ~v1.4.1 (April 2026)	https://r-nimble.org/
Stan model language	2012	Andrew Gelman / Bob Carpenter / Daniel Lee et al. (Columbia University)	Strongly typed C++-compiled probabilistic DSL with reverse-mode auto-differentiation, NUTS (No-U-Turn Sampler) Hamiltonian Monte Carlo, variational inference, optimisation	Very active; CmdStan 2.38.0 released January 2026; reference standard for Bayesian inference	https://mc-stan.org/docs/reference-manual/
RStan / cmdstanr / PyStan / cmdstanpy	2012+	Stan team	Language bindings to Stan from R and Python; `cmdstanr` and `cmdstanpy` are the recommended modern bindings (wrap CmdStan rather than embedding Stan’s C++)	Very active	https://mc-stan.org/cmdstanr/
PyMC model definition	2003 (PyMC2) → 2017 (PyMC3, Theano) → 2022 (PyMC v4, Aesara) → 2023 (PyMC v5, PyTensor)	Chris Fonnesbeck et al.	Python-embedded probabilistic-programming DSL; the v5 rewrite migrated from Theano → Aesara → PyTensor to escape the Theano deprecation	Very active; v5.28.5 released early May 2026	https://www.pymc.io/projects/docs/en/stable/
Turing.jl model definition	2018	Hong Ge et al. (Cambridge)	Julia-embedded probabilistic-programming DSL using the `@model` macro; supports HMC, NUTS, particle MCMC, variational inference, Gibbs composition	Very active; requires Julia ≥ 1.10.8	https://turinglang.org/

Tier 3 family table — Reproducible-research notebooks and document formats

Format	First appeared	Origin	Type	Status (2026)	URL
R Markdown (`.Rmd`)	2014	RStudio (Yihui Xie, JJ Allaire)	Markdown + YAML metadata + R code chunks executed by knitr; pandoc-rendered to HTML/PDF/Word	Maintained but in soft sunset; RStudio/Posit’s strategic successor is Quarto	https://rmarkdown.rstudio.com/
knitr engine	2012	Yihui Xie	The chunk-execution engine under R Markdown and bookdown; also supports many other host languages via reticulate, JuliaCall, etc.	Active, still core to R-side reproducible reporting	https://yihui.org/knitr/
Quarto (`.qmd`)	2022	Posit (formerly RStudio); Charles Teague, JJ Allaire et al.	Multi-language successor to R Markdown built directly on pandoc + Lua filters; native Python (Jupyter kernel), R (knitr), and Julia (`QuartoNotebookRunner.jl`) support; does not require R for non-R documents	Very active; v1.9.37 released April 2026 (1.7 line shipped late 2025)	https://quarto.org/
Jupyter notebook (`.ipynb`)	2011 (IPython Notebook), 2014 (Jupyter rename)	Project Jupyter (Fernando Pérez, Brian Granger et al.)	JSON document format storing cells (code, markdown, raw) + outputs + kernel metadata; the dominant interactive-notebook artefact	Very active; JupyterLab 4.x and Jupyter Notebook 7.x are current; consumed natively by Quarto	https://jupyter.org/
Posit Connect deployment manifests	~2018 (as RStudio Connect)	Posit (formerly RStudio)	YAML + lock-file format used to publish Quarto/Shiny/R Markdown/Streamlit/FastAPI artefacts to a Posit Connect server	Active; current Posit Connect 2026.03.1 (March 2026)	https://docs.posit.co/connect/

Notable threads

The persistence of SAS in regulated industries. SAS’s dominance in pharma/biotech is a regulatory-network-effect story, not a technical one. The FDA’s electronic-submission pathway (eCTD Module 5) is built around CDISC SDTM and ADaM standards whose canonical implementation is in SAS, and SAS’s per-procedure documentation has been the de facto “audit trail” for forty years. Several attempts have been made to validate the R Consortium’s R Submissions Working Group pilots through the FDA (Pilot 1 was accepted in 2022; subsequent pilots have widened); these are slowly opening the door, but no major sponsor has yet abandoned SAS for a Phase III submission. Inertia, validated-package availability, and clinical-research-organisation (CRO) staffing all favour the incumbent. SAS 9.4 M9 (June 2025) being a five-year-support release tells you SAS expects this status quo through ~2030.
Stata’s outsized academic-econometrics dominance. Stata’s installed base in economics, political science, and public health is sustained by replication-archive expectations: the American Economic Review, Journal of Political Economy, and Quarterly Journal of Economics all expect Stata .do files alongside data, and most working-paper repositories (NBER, IZA) implicitly assume .dta. Stata’s pricing-model choice — student licences priced at student-budget levels, perpetual licences that survive cohorts of grad students — locks in this network effect. Stata 19 (April 2025) added a heavy machine-learning suite (h2oml) and conditional-average-treatment-effect (CATE) commands, which is the company’s bet on staying relevant as econml/DoubleML/Python encroach.
SPSS’s slow decline. IBM acquired SPSS in 2009 for $1.2B, then in 2024 sold its analytics portfolio (including SPSS Statistics and SPSS Modeler) to Francisco Partners. The IBM divestment marks the end of SPSS as a flagship analytics platform; ongoing development continues (v30 added dark mode and Bland-Altman; v31 and v32 follow) but innovation has clearly slowed compared to Stata or even SAS. SPSS retains a powerful base in social-survey research because of .sav’s variable-label / value-label / measurement-level metadata, which Stata .dta and R only partially preserve.
The Bayesian DSL progression: BUGS → JAGS → Stan → PyMC/Turing. Each generation traded a constraint for ergonomics. BUGS was Gibbs-only and required a DAG structure that the package could partition; JAGS extended this with more sampler types but kept the BUGS dialect. Stan (2012) was a hard reset: a real strongly typed modelling language, reverse-mode automatic differentiation (the stan::math library is itself a substantial autodiff codebase), and Hamiltonian Monte Carlo with NUTS adaptive trajectory length — together making continuous high-dimensional models tractable in ways BUGS never was. PyMC v5 and Turing.jl took the next step: don’t write a separate model file, embed the model in a host language so that data wrangling, plotting (ArviZ), and reporting (Quarto) are all the same script.
PyMC’s PyTensor rewrite. PyMC’s history mirrors the Theano deprecation crisis. PyMC3 (2017) was built on Theano. When the Theano team announced end-of-life in 2017, PyMC forked Theano as Theano-PyMC, then renamed it Aesara (2020) as PyMC v4, then forked Aesara as PyTensor (2022) when Aesara stagnated. PyMC v5 (January 2023) is the first stable on PyTensor and is what current releases (v5.28.5, early May 2026) build on. The throughline: a graph-rewriting/auto-diff backend is the load-bearing piece, and PyMC has had to maintain its own through three reincarnations to survive.
Quarto as the modern reproducible-research substrate. Quarto’s strategic significance is that it decouples reproducible-document tooling from R. R Markdown required R even to render a pure-Python document. Quarto runs on pandoc directly with Lua-filter extensions, executes Python via Jupyter kernels (no R required), Julia via QuartoNotebookRunner.jl, and R via knitr — picking the engine from the chunk metadata. The current Quarto 1.9.x line (April 2026) makes this a serious competitor to Jupyter Book and a partial competitor to Sphinx for technical-book authoring. Posit’s bet is that Quarto will outlive R Markdown as the cross-language reproducible-document format.
Jupyter’s long shadow. .ipynb JSON is far from a perfect document format — diffing it is hostile, output cells balloon notebooks to megabytes, and the JSON schema entangles document and runtime state — yet it remains overwhelmingly dominant as the artefact people actually share on GitHub, Kaggle, and Google Colab. Tooling has adapted around its flaws: nbdime for diffing, jupytext for paired .py/.md round-tripping, papermill for parameterised execution, and Quarto’s ability to consume .ipynb directly. The format is unlikely to be displaced; instead it has become the substrate everything else interoperates with.
The binary-data-format compat layer. Reading .dta, .sav, and .sas7bdat from outside their native platforms is mostly the work of two libraries: R’s haven (tidyverse, Hadley Wickham + Evan Miller, leveraging the ReadStat C library) and Python’s pandas (read_stata, read_spss via pyreadstat, read_sas). These have been reliable enough that for many years a substantial fraction of .dta / .sav files in production data pipelines are never opened in Stata or SPSS at all — they are bulk-converted to Parquet on landing. Stata 19’s format-121 update broke pre-existing haven versions until a patch release; this is the recurring tension whenever a proprietary format version bumps.

Citations

Stata 19 announcement: https://blog.stata.com/2025/04/08/stata-19-is-released/
Stata .dta file format reference: https://www.stata.com/manuals/pfileformatsdta.pdf
Stata Mata reference: https://www.stata.com/features/mata/
SAS 9.4 maintenance releases: http://support.sas.com/software/maintenance/index.html
SAS macro language reference: https://documentation.sas.com/doc/en/pgmsascdc/default/mcrolref/titlepage.htm
SAS PROC SQL: https://documentation.sas.com/doc/en/sqlproc/v_017/titlepage.htm
SAS PROC IML: https://documentation.sas.com/doc/en/imlug/15.3/titlepage.htm
SAS XPORT specification (TS-140): https://support.sas.com/techsup/technote/ts140.pdf
IBM SPSS Statistics release notes: https://www.ibm.com/support/pages/release-notes-ibm%C2%AE-spss%C2%AE-statistics-30
JMP Scripting Guide: https://www.jmp.com/support/help/en/19.0/jmp/introduction-to-writing-jsl-scripts.shtml
Stan reference manual: https://mc-stan.org/docs/reference-manual/
CmdStan 2.38 release: https://blog.mc-stan.org/2026/01/13/release-of-cmdstan-2-38/
JAGS: https://mcmc-jags.sourceforge.io/
BUGS project (MRC Biostatistics Unit): https://www.mrc-bsu.cam.ac.uk/software/bugs/
NIMBLE: https://r-nimble.org/
PyMC documentation: https://www.pymc.io/projects/docs/en/stable/
Turing.jl: https://turinglang.org/
GAUSS 26: https://www.aptech.com/blog/gauss26/
EViews 14: https://www.eviews.com/EViews14/ev14main.html
gretl: http://gretl.sourceforge.net/
Quarto documentation: https://quarto.org/
R Markdown: https://rmarkdown.rstudio.com/
knitr: https://yihui.org/knitr/
Project Jupyter: https://jupyter.org/
R haven package: https://haven.tidyverse.org/
ReadStat (C library underlying haven and pyreadstat): https://github.com/WizardMac/ReadStat
R Consortium R Submissions Working Group: https://rconsortium.github.io/submissions-wg/
CDISC standards: https://www.cdisc.org/standards

Compendium

Explorer

Statistical Software & Reproducible-Research DSLs Family Index

Statistical Software & Reproducible-Research DSLs Family Index

type: language-family-index family: statistical-software languages_catalogued: 26 tags: [language-reference, family-index, statistical-software, stata, sas, spss, stan, jags, pymc, quarto, jupyter, rmarkdown, bayesian, econometrics, reproducible-research]

Statistical Software & Reproducible-Research — Family Index

Family overview

In our deep library

Tier 3 family table — Stata family

Tier 3 family table — SAS family

Tier 3 family table — SPSS / JMP / proprietary econometrics

Tier 3 family table — Bayesian modelling DSLs

Tier 3 family table — Reproducible-research notebooks and document formats

Notable threads

Citations

Graph View

Table of Contents