Statistical Software & Reproducible-Research DSLs Family Index
type: language-family-index family: statistical-software languages_catalogued: 26 tags: [language-reference, family-index, statistical-software, stata, sas, spss, stan, jags, pymc, quarto, jupyter, rmarkdown, bayesian, econometrics, reproducible-research]
Statistical Software & Reproducible-Research — Family Index
Family overview
The proprietary-statistics triumvirate — Stata, SAS, and SPSS — has dominated academic, governmental, and pharmaceutical statistical work since the 1970s/1980s and remains, in 2026, the default in three industries where it should arguably have been displaced a decade ago: clinical-trials biostatistics (where the FDA’s electronic-submission pathway under CDISC SDTM/ADaM is built around SAS), academic econometrics (where Stata .do files are the lingua franca of working-paper replication archives), and survey research (where SPSS .sav files remain the default export from instruments like Qualtrics, SurveyMonkey, and REDCap). Each platform ships its own internal command DSL — Stata’s do-file syntax, SAS’s macro + DATA step + PROC step trinity, and SPSS Statistics’ .sps command language — and each persists primarily because its installed base of code, its certified validation, and its regulator familiarity dwarf the costs of a clean R/Python rewrite.
In parallel sits the Bayesian probabilistic-programming lineage, which over thirty years progressed from BUGS (MRC Biostatistics Unit, Cambridge, late 1980s) to WinBUGS (1997) to OpenBUGS (2005) to JAGS (Martyn Plummer, 2007) to Stan (Andrew Gelman et al., Columbia, 2012) and onward to language-embedded successors PyMC (Python, v5+ rewrite on PyTensor) and Turing.jl (Julia). NIMBLE (UC Berkeley, 2014+) sits sideways to this lineage: it accepts BUGS-syntax models inside R but compiles them to C++ via its own intermediate language. Each generation traded ergonomics: BUGS was directed-acyclic-graph / Gibbs-only; Stan introduced Hamiltonian Monte Carlo + auto-differentiation and a real type-checked modelling language; PyMC and Turing.jl gave up the standalone-language ceremony in exchange for being host-language libraries with Python/Julia syntactic affordances.
The reproducible-research substrate is the third layer. R Markdown (RStudio, 2014) introduced the literate-programming workflow that combined Markdown prose, R code chunks via knitr, and pandoc-mediated multi-output rendering. Quarto (Posit, 2022) is its multi-language successor: built on pandoc + a Lua filter ecosystem, it renders Markdown documents containing R, Python, Julia, and Observable JS chunks to HTML, PDF (via document-typesetting / LaTeX), Word, ePub, dashboards, and slides — and crucially, it does not require R to render Python or Julia documents. Jupyter (originally IPython Notebook, 2011) sits adjacent: its .ipynb JSON document format remains the dominant in-the-wild notebook artefact format, with Quarto able to consume it directly. Together, R Markdown + Quarto + Jupyter form the textual reproducibility layer that the proprietary platforms grudgingly accommodate through ODS (SAS), dyndoc (Stata), and OUTPUT EXPORT (SPSS).
Underlying all of this is the binary-data-format zoo: Stata’s .dta (currently format 121 in Stata 19, since April 2025), SPSS’s .sav, SAS’s .sas7bdat and the older XPORT (.xpt) transport format. None has an open specification of comparable rigour to Parquet or Arrow, and the de facto interoperability layer is the Python pandas + R haven package, which together reverse-engineer the formats and provide round-trip reading and writing. Open-source replacements (Parquet, Feather, R .rds) have made deep inroads in tech/finance but barely scratched pharma and academia.
In our deep library
- r — the deep R note; R is the natural home of
Rmd,knitr, the haven package, RStan,cmdstanr,nimble,brms, andrstanarm. - python — covers Python, including statsmodels, scikit-learn, PyMC,
cmdstanpy,arviz, and pandas as the de facto reader for.dta/.sav/.sas7bdat. - julia — covers Julia, including Turing.jl, DynamicPPL, and the wider SciML stack.
- scientific — MATLAB / Mathematica / Octave / APL / Maxima / Scilab. Adjacent to this family but distinct: MATLAB’s Statistics Toolbox is a Mathworks add-on rather than a separate stats-package DSL.
- document-typesetting — Quarto’s PDF output path runs through LaTeX (tinytex by default); R Markdown likewise. Output toolchain overlaps heavily.
- healthcare-clinical — CDISC SDTM, ADaM, and
define-xmlare nearly always implemented on top of SAS. The reverse: most clinical-trial statistical-analysis-plan code is SAS macro + PROC. - survey-questionnaire — REDCap, Qualtrics, and SurveyMonkey export to SPSS
.savand Stata.dtaas primary archival formats.
Tier 3 family table — Stata family
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
Stata .do script | 1985 | Stata Corp (William Gould et al.), College Station TX | Command DSL; one-line-per-command imperative syntax (regress y x1 x2 if region==3, robust) | Very active; Stata 19 released April 2025; StataNow continuous-release branch previews Stata 20 features | https://www.stata.com/manuals/u.pdf |
Stata .ado programs | 1985 | Stata Corp | User-written commands; a .ado file plus a matching .sthlp help file define a new Stata verb | Very active; SSC (Statistical Software Components, Boston College) is the de facto package repository | https://www.stata.com/manuals/u17.pdf |
Stata .dta data format | 1985, current spec format 121 (Stata 19) | Stata Corp | Versioned binary; widely used as exchange container in econometrics and health-policy datasets | Active and de facto standard in academic working-paper replication archives | https://www.stata.com/manuals/pfileformatsdta.pdf |
| Stata Mata language | 2003 (Stata 9) | Stata Corp | Compiled matrix-programming language embedded inside Stata; C-like syntax, optimised for linear algebra | Active, the substrate for almost all complex .ado packages written since ~2010 | https://www.stata.com/features/mata/ |
Stata Markdown (dyndoc / markstat) | ~2014 | Stata Corp (dyndoc) + Germán Rodríguez (Princeton) for markstat | Dynamic-document tools embedding Stata output in Markdown / HTML | Active but niche; many users now reach for Quarto’s Stata engine instead | https://www.stata.com/manuals/rptdyndoc.pdf |
Tier 3 family table — SAS family
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| SAS DATA step language | 1976 | SAS Institute (North Carolina; Anthony Barr, Jim Goodnight) | Row-by-row procedural data-manipulation DSL; the “data step / proc step” dichotomy is foundational to SAS | Active; SAS 9.4 M9 (June 2025) is the current 9-series release; supported through ~2030 | https://documentation.sas.com/doc/en/pgmsascdc/default/lestmtsref/titlepage.htm |
| SAS macro language | ~1980 | SAS Institute | Token-substitution preprocessor with %macro/%mend, %let, %if, %do — the long-standing SAS metaprogramming layer | Very active, still the dominant way large pharma/biostat shops parameterise SAS programs | https://documentation.sas.com/doc/en/pgmsascdc/default/mcrolref/titlepage.htm |
| SAS PROC SQL | ~1989 (SAS 6.06) | SAS Institute | Embedded ANSI-SQL extension with SAS-specific extensions (e.g. macro-variable interpolation, automatic dataset access) | Very active | https://documentation.sas.com/doc/en/sqlproc/v_017/titlepage.htm |
| SAS PROC IML (Interactive Matrix Language) | 1985 | SAS Institute | Matrix-programming DSL inside SAS; APL-influenced; analogue of Stata Mata | Active but lower-velocity than DATA step / macro | https://documentation.sas.com/doc/en/imlug/15.3/titlepage.htm |
SAS .sas7bdat data files | 1999 (SAS 7) | SAS Institute | Proprietary versioned binary data format; format reverse-engineered by R haven (Hadley Wickham, Evan Miller) | Active and dominant in pharma/biotech; widely accepted by FDA submissions when bundled with define-xml | https://documentation.sas.com/doc/en/pgmsascdc/default/lrcon/p1n8hb0gn5erf0n1qm0xthahmu99.htm |
SAS XPORT (.xpt) format | 1987 (SAS 6, “v5 transport”) | SAS Institute | Older fixed-width transport format with strict 8-character name limits; the only data format accepted by the FDA for legacy submissions until SDTM-XPT replaced it for clinical-trial datasets | Active in regulatory contexts, otherwise legacy | https://support.sas.com/techsup/technote/ts140.pdf |
Tier 3 family table — SPSS / JMP / proprietary econometrics
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
SPSS .sps syntax | 1968 (SPSS first release) | Norman Nie / C. Hadlai Hull / Dale Bent (Stanford); SPSS Inc.; IBM (2009) | Command-language DSL (COMPUTE, RECODE, FREQUENCIES, GLM, IF … END IF) — verbose, FORTRAN-influenced syntax | Active; IBM SPSS Statistics 31 / 32 are the current releases (Server 32.0.x GA April 2026); the IBM divestment to Francisco Partners (2024) has left the product roadmap somewhat unclear | https://www.ibm.com/docs/en/spss-statistics |
SPSS .sav data files | 1968 | SPSS Inc. → IBM | Proprietary binary data format with variable labels, value labels, missing-value codes, and measurement levels — features that survey researchers depend on | Active, the default export format from most survey instruments | https://www.ibm.com/docs/en/spss-statistics/saveformats |
| SPSS Statistics Python / R integration | ~2007 (Python plug-in) | SPSS Inc. → IBM | Embedded Python and R inside SPSS for scripting custom procedures (a “begin program” block); R 4.4.1 in v30+ | Active but underused; most SPSS users still write pure syntax | https://www.ibm.com/docs/en/spss-statistics/30.0.0?topic=integration-python |
| JMP Scripting Language (JSL) | 1989 (JMP 1) | SAS Institute (John Sall) | C-influenced scripting DSL for the JMP visual-discovery platform; controls data tables, platforms (analyses), and graphical objects | Active; JMP 19 (2025) added expanded Python integration and continued JSL evolution | https://www.jmp.com/support/help/en/19.0/jmp/introduction-to-writing-jsl-scripts.shtml |
| GAUSS language | 1984 | Aptech Systems (Lee Edlefsen, Steve Jones), Maple Valley WA | Matrix-programming language for econometrics and finance; strong installed base in central banks | Active; GAUSS 26 released February 2026 with new time-series and panel functions | https://www.aptech.com/ |
| EViews command/program language | 1994 (EViews 1) | Quantitative Micro Software → IHS Markit → S&P Global | Time-series-econometrics-oriented DSL with .prg programs, .wf1 workfiles | Active; EViews 14 (2024) with ongoing bug-fix patches through 2026 | https://www.eviews.com/help/ |
| gretl / hansl | 2000 | Allin Cottrell, Riccardo Lucchetti (Università Politecnica delle Marche) | Open-source econometrics package with its own scripting language hansl (Hansl’s a Neat Scripting Language); imports Stata .dta, R, Octave | Active; gretl 2026a released February 2026 | http://gretl.sourceforge.net/ |
Tier 3 family table — Bayesian modelling DSLs
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| BUGS / WinBUGS / OpenBUGS model language | 1989 (BUGS) / 1997 (WinBUGS) / 2005 (OpenBUGS) | MRC Biostatistics Unit, Cambridge (David Spiegelhalter, Andrew Thomas, Nicky Best, Wally Gilks) | Declarative DAG model specification using ~ for stochastic and <- for deterministic nodes; Gibbs sampling only | Mostly legacy — WinBUGS development ended ~2007; OpenBUGS effectively unmaintained since ~2014; syntax lives on in JAGS and NIMBLE | https://www.mrc-bsu.cam.ac.uk/software/bugs/ |
| JAGS model language | 2007 | Martyn Plummer (IARC Lyon) | BUGS-dialect successor written in C++; the most widely used “classic” Bayesian DSL via R’s rjags and R2jags | Active but slow — JAGS 4.3.2 (March 2023) remains the current release as of May 2026; few new features but reliable | https://mcmc-jags.sourceforge.io/ |
| NIMBLE model language | 2014 (CRAN release) | Perry de Valpine, Christopher Paciorek et al. (UC Berkeley) | BUGS-syntax-compatible R-internal modelling DSL that compiles models to C++; supports MCMC, Laplace approximation, MCEM, and user-defined samplers | Active; CRAN release ~v1.4.1 (April 2026) | https://r-nimble.org/ |
| Stan model language | 2012 | Andrew Gelman / Bob Carpenter / Daniel Lee et al. (Columbia University) | Strongly typed C++-compiled probabilistic DSL with reverse-mode auto-differentiation, NUTS (No-U-Turn Sampler) Hamiltonian Monte Carlo, variational inference, optimisation | Very active; CmdStan 2.38.0 released January 2026; reference standard for Bayesian inference | https://mc-stan.org/docs/reference-manual/ |
| RStan / cmdstanr / PyStan / cmdstanpy | 2012+ | Stan team | Language bindings to Stan from R and Python; cmdstanr and cmdstanpy are the recommended modern bindings (wrap CmdStan rather than embedding Stan’s C++) | Very active | https://mc-stan.org/cmdstanr/ |
| PyMC model definition | 2003 (PyMC2) → 2017 (PyMC3, Theano) → 2022 (PyMC v4, Aesara) → 2023 (PyMC v5, PyTensor) | Chris Fonnesbeck et al. | Python-embedded probabilistic-programming DSL; the v5 rewrite migrated from Theano → Aesara → PyTensor to escape the Theano deprecation | Very active; v5.28.5 released early May 2026 | https://www.pymc.io/projects/docs/en/stable/ |
| Turing.jl model definition | 2018 | Hong Ge et al. (Cambridge) | Julia-embedded probabilistic-programming DSL using the @model macro; supports HMC, NUTS, particle MCMC, variational inference, Gibbs composition | Very active; requires Julia ≥ 1.10.8 | https://turinglang.org/ |
Tier 3 family table — Reproducible-research notebooks and document formats
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
R Markdown (.Rmd) | 2014 | RStudio (Yihui Xie, JJ Allaire) | Markdown + YAML metadata + R code chunks executed by knitr; pandoc-rendered to HTML/PDF/Word | Maintained but in soft sunset; RStudio/Posit’s strategic successor is Quarto | https://rmarkdown.rstudio.com/ |
| knitr engine | 2012 | Yihui Xie | The chunk-execution engine under R Markdown and bookdown; also supports many other host languages via reticulate, JuliaCall, etc. | Active, still core to R-side reproducible reporting | https://yihui.org/knitr/ |
Quarto (.qmd) | 2022 | Posit (formerly RStudio); Charles Teague, JJ Allaire et al. | Multi-language successor to R Markdown built directly on pandoc + Lua filters; native Python (Jupyter kernel), R (knitr), and Julia (QuartoNotebookRunner.jl) support; does not require R for non-R documents | Very active; v1.9.37 released April 2026 (1.7 line shipped late 2025) | https://quarto.org/ |
Jupyter notebook (.ipynb) | 2011 (IPython Notebook), 2014 (Jupyter rename) | Project Jupyter (Fernando Pérez, Brian Granger et al.) | JSON document format storing cells (code, markdown, raw) + outputs + kernel metadata; the dominant interactive-notebook artefact | Very active; JupyterLab 4.x and Jupyter Notebook 7.x are current; consumed natively by Quarto | https://jupyter.org/ |
| Posit Connect deployment manifests | ~2018 (as RStudio Connect) | Posit (formerly RStudio) | YAML + lock-file format used to publish Quarto/Shiny/R Markdown/Streamlit/FastAPI artefacts to a Posit Connect server | Active; current Posit Connect 2026.03.1 (March 2026) | https://docs.posit.co/connect/ |
Notable threads
-
The persistence of SAS in regulated industries. SAS’s dominance in pharma/biotech is a regulatory-network-effect story, not a technical one. The FDA’s electronic-submission pathway (eCTD Module 5) is built around CDISC SDTM and ADaM standards whose canonical implementation is in SAS, and SAS’s per-procedure documentation has been the de facto “audit trail” for forty years. Several attempts have been made to validate the R Consortium’s R Submissions Working Group pilots through the FDA (Pilot 1 was accepted in 2022; subsequent pilots have widened); these are slowly opening the door, but no major sponsor has yet abandoned SAS for a Phase III submission. Inertia, validated-package availability, and clinical-research-organisation (CRO) staffing all favour the incumbent. SAS 9.4 M9 (June 2025) being a five-year-support release tells you SAS expects this status quo through ~2030.
-
Stata’s outsized academic-econometrics dominance. Stata’s installed base in economics, political science, and public health is sustained by replication-archive expectations: the American Economic Review, Journal of Political Economy, and Quarterly Journal of Economics all expect Stata
.dofiles alongside data, and most working-paper repositories (NBER, IZA) implicitly assume.dta. Stata’s pricing-model choice — student licences priced at student-budget levels, perpetual licences that survive cohorts of grad students — locks in this network effect. Stata 19 (April 2025) added a heavy machine-learning suite (h2oml) and conditional-average-treatment-effect (CATE) commands, which is the company’s bet on staying relevant aseconml/DoubleML/Python encroach. -
SPSS’s slow decline. IBM acquired SPSS in 2009 for $1.2B, then in 2024 sold its analytics portfolio (including SPSS Statistics and SPSS Modeler) to Francisco Partners. The IBM divestment marks the end of SPSS as a flagship analytics platform; ongoing development continues (v30 added dark mode and Bland-Altman; v31 and v32 follow) but innovation has clearly slowed compared to Stata or even SAS. SPSS retains a powerful base in social-survey research because of
.sav’s variable-label / value-label / measurement-level metadata, which Stata.dtaand R only partially preserve. -
The Bayesian DSL progression: BUGS → JAGS → Stan → PyMC/Turing. Each generation traded a constraint for ergonomics. BUGS was Gibbs-only and required a DAG structure that the package could partition; JAGS extended this with more sampler types but kept the BUGS dialect. Stan (2012) was a hard reset: a real strongly typed modelling language, reverse-mode automatic differentiation (the
stan::mathlibrary is itself a substantial autodiff codebase), and Hamiltonian Monte Carlo with NUTS adaptive trajectory length — together making continuous high-dimensional models tractable in ways BUGS never was. PyMC v5 and Turing.jl took the next step: don’t write a separate model file, embed the model in a host language so that data wrangling, plotting (ArviZ), and reporting (Quarto) are all the same script. -
PyMC’s PyTensor rewrite. PyMC’s history mirrors the Theano deprecation crisis. PyMC3 (2017) was built on Theano. When the Theano team announced end-of-life in 2017, PyMC forked Theano as Theano-PyMC, then renamed it Aesara (2020) as PyMC v4, then forked Aesara as PyTensor (2022) when Aesara stagnated. PyMC v5 (January 2023) is the first stable on PyTensor and is what current releases (v5.28.5, early May 2026) build on. The throughline: a graph-rewriting/auto-diff backend is the load-bearing piece, and PyMC has had to maintain its own through three reincarnations to survive.
-
Quarto as the modern reproducible-research substrate. Quarto’s strategic significance is that it decouples reproducible-document tooling from R. R Markdown required R even to render a pure-Python document. Quarto runs on pandoc directly with Lua-filter extensions, executes Python via Jupyter kernels (no R required), Julia via
QuartoNotebookRunner.jl, and R via knitr — picking the engine from the chunk metadata. The current Quarto 1.9.x line (April 2026) makes this a serious competitor to Jupyter Book and a partial competitor to Sphinx for technical-book authoring. Posit’s bet is that Quarto will outlive R Markdown as the cross-language reproducible-document format. -
Jupyter’s long shadow.
.ipynbJSON is far from a perfect document format — diffing it is hostile, output cells balloon notebooks to megabytes, and the JSON schema entangles document and runtime state — yet it remains overwhelmingly dominant as the artefact people actually share on GitHub, Kaggle, and Google Colab. Tooling has adapted around its flaws:nbdimefor diffing,jupytextfor paired.py/.mdround-tripping,papermillfor parameterised execution, and Quarto’s ability to consume.ipynbdirectly. The format is unlikely to be displaced; instead it has become the substrate everything else interoperates with. -
The binary-data-format compat layer. Reading
.dta,.sav, and.sas7bdatfrom outside their native platforms is mostly the work of two libraries: R’s haven (tidyverse, Hadley Wickham + Evan Miller, leveraging the ReadStat C library) and Python’s pandas (read_stata,read_spssvia pyreadstat,read_sas). These have been reliable enough that for many years a substantial fraction of.dta/.savfiles in production data pipelines are never opened in Stata or SPSS at all — they are bulk-converted to Parquet on landing. Stata 19’s format-121 update broke pre-existinghavenversions until a patch release; this is the recurring tension whenever a proprietary format version bumps.
Citations
- Stata 19 announcement: https://blog.stata.com/2025/04/08/stata-19-is-released/
- Stata
.dtafile format reference: https://www.stata.com/manuals/pfileformatsdta.pdf - Stata Mata reference: https://www.stata.com/features/mata/
- SAS 9.4 maintenance releases: http://support.sas.com/software/maintenance/index.html
- SAS macro language reference: https://documentation.sas.com/doc/en/pgmsascdc/default/mcrolref/titlepage.htm
- SAS PROC SQL: https://documentation.sas.com/doc/en/sqlproc/v_017/titlepage.htm
- SAS PROC IML: https://documentation.sas.com/doc/en/imlug/15.3/titlepage.htm
- SAS XPORT specification (TS-140): https://support.sas.com/techsup/technote/ts140.pdf
- IBM SPSS Statistics release notes: https://www.ibm.com/support/pages/release-notes-ibm%C2%AE-spss%C2%AE-statistics-30
- JMP Scripting Guide: https://www.jmp.com/support/help/en/19.0/jmp/introduction-to-writing-jsl-scripts.shtml
- Stan reference manual: https://mc-stan.org/docs/reference-manual/
- CmdStan 2.38 release: https://blog.mc-stan.org/2026/01/13/release-of-cmdstan-2-38/
- JAGS: https://mcmc-jags.sourceforge.io/
- BUGS project (MRC Biostatistics Unit): https://www.mrc-bsu.cam.ac.uk/software/bugs/
- NIMBLE: https://r-nimble.org/
- PyMC documentation: https://www.pymc.io/projects/docs/en/stable/
- Turing.jl: https://turinglang.org/
- GAUSS 26: https://www.aptech.com/blog/gauss26/
- EViews 14: https://www.eviews.com/EViews14/ev14main.html
- gretl: http://gretl.sourceforge.net/
- Quarto documentation: https://quarto.org/
- R Markdown: https://rmarkdown.rstudio.com/
- knitr: https://yihui.org/knitr/
- Project Jupyter: https://jupyter.org/
- R
havenpackage: https://haven.tidyverse.org/ - ReadStat (C library underlying haven and pyreadstat): https://github.com/WizardMac/ReadStat
- R Consortium R Submissions Working Group: https://rconsortium.github.io/submissions-wg/
- CDISC standards: https://www.cdisc.org/standards