Statistical Software & Reproducible-Research DSLs Family Index


type: language-family-index family: statistical-software languages_catalogued: 26 tags: [language-reference, family-index, statistical-software, stata, sas, spss, stan, jags, pymc, quarto, jupyter, rmarkdown, bayesian, econometrics, reproducible-research]

Statistical Software & Reproducible-Research — Family Index

Family overview

The proprietary-statistics triumvirate — Stata, SAS, and SPSS — has dominated academic, governmental, and pharmaceutical statistical work since the 1970s/1980s and remains, in 2026, the default in three industries where it should arguably have been displaced a decade ago: clinical-trials biostatistics (where the FDA’s electronic-submission pathway under CDISC SDTM/ADaM is built around SAS), academic econometrics (where Stata .do files are the lingua franca of working-paper replication archives), and survey research (where SPSS .sav files remain the default export from instruments like Qualtrics, SurveyMonkey, and REDCap). Each platform ships its own internal command DSL — Stata’s do-file syntax, SAS’s macro + DATA step + PROC step trinity, and SPSS Statistics’ .sps command language — and each persists primarily because its installed base of code, its certified validation, and its regulator familiarity dwarf the costs of a clean R/Python rewrite.

In parallel sits the Bayesian probabilistic-programming lineage, which over thirty years progressed from BUGS (MRC Biostatistics Unit, Cambridge, late 1980s) to WinBUGS (1997) to OpenBUGS (2005) to JAGS (Martyn Plummer, 2007) to Stan (Andrew Gelman et al., Columbia, 2012) and onward to language-embedded successors PyMC (Python, v5+ rewrite on PyTensor) and Turing.jl (Julia). NIMBLE (UC Berkeley, 2014+) sits sideways to this lineage: it accepts BUGS-syntax models inside R but compiles them to C++ via its own intermediate language. Each generation traded ergonomics: BUGS was directed-acyclic-graph / Gibbs-only; Stan introduced Hamiltonian Monte Carlo + auto-differentiation and a real type-checked modelling language; PyMC and Turing.jl gave up the standalone-language ceremony in exchange for being host-language libraries with Python/Julia syntactic affordances.

The reproducible-research substrate is the third layer. R Markdown (RStudio, 2014) introduced the literate-programming workflow that combined Markdown prose, R code chunks via knitr, and pandoc-mediated multi-output rendering. Quarto (Posit, 2022) is its multi-language successor: built on pandoc + a Lua filter ecosystem, it renders Markdown documents containing R, Python, Julia, and Observable JS chunks to HTML, PDF (via document-typesetting / LaTeX), Word, ePub, dashboards, and slides — and crucially, it does not require R to render Python or Julia documents. Jupyter (originally IPython Notebook, 2011) sits adjacent: its .ipynb JSON document format remains the dominant in-the-wild notebook artefact format, with Quarto able to consume it directly. Together, R Markdown + Quarto + Jupyter form the textual reproducibility layer that the proprietary platforms grudgingly accommodate through ODS (SAS), dyndoc (Stata), and OUTPUT EXPORT (SPSS).

Underlying all of this is the binary-data-format zoo: Stata’s .dta (currently format 121 in Stata 19, since April 2025), SPSS’s .sav, SAS’s .sas7bdat and the older XPORT (.xpt) transport format. None has an open specification of comparable rigour to Parquet or Arrow, and the de facto interoperability layer is the Python pandas + R haven package, which together reverse-engineer the formats and provide round-trip reading and writing. Open-source replacements (Parquet, Feather, R .rds) have made deep inroads in tech/finance but barely scratched pharma and academia.

In our deep library

  • r — the deep R note; R is the natural home of Rmd, knitr, the haven package, RStan, cmdstanr, nimble, brms, and rstanarm.
  • python — covers Python, including statsmodels, scikit-learn, PyMC, cmdstanpy, arviz, and pandas as the de facto reader for .dta/.sav/.sas7bdat.
  • julia — covers Julia, including Turing.jl, DynamicPPL, and the wider SciML stack.
  • scientific — MATLAB / Mathematica / Octave / APL / Maxima / Scilab. Adjacent to this family but distinct: MATLAB’s Statistics Toolbox is a Mathworks add-on rather than a separate stats-package DSL.
  • document-typesetting — Quarto’s PDF output path runs through LaTeX (tinytex by default); R Markdown likewise. Output toolchain overlaps heavily.
  • healthcare-clinical — CDISC SDTM, ADaM, and define-xml are nearly always implemented on top of SAS. The reverse: most clinical-trial statistical-analysis-plan code is SAS macro + PROC.
  • survey-questionnaire — REDCap, Qualtrics, and SurveyMonkey export to SPSS .sav and Stata .dta as primary archival formats.

Tier 3 family table — Stata family

FormatFirst appearedOriginTypeStatus (2026)URL
Stata .do script1985Stata Corp (William Gould et al.), College Station TXCommand DSL; one-line-per-command imperative syntax (regress y x1 x2 if region==3, robust)Very active; Stata 19 released April 2025; StataNow continuous-release branch previews Stata 20 featureshttps://www.stata.com/manuals/u.pdf
Stata .ado programs1985Stata CorpUser-written commands; a .ado file plus a matching .sthlp help file define a new Stata verbVery active; SSC (Statistical Software Components, Boston College) is the de facto package repositoryhttps://www.stata.com/manuals/u17.pdf
Stata .dta data format1985, current spec format 121 (Stata 19)Stata CorpVersioned binary; widely used as exchange container in econometrics and health-policy datasetsActive and de facto standard in academic working-paper replication archiveshttps://www.stata.com/manuals/pfileformatsdta.pdf
Stata Mata language2003 (Stata 9)Stata CorpCompiled matrix-programming language embedded inside Stata; C-like syntax, optimised for linear algebraActive, the substrate for almost all complex .ado packages written since ~2010https://www.stata.com/features/mata/
Stata Markdown (dyndoc / markstat)~2014Stata Corp (dyndoc) + Germán Rodríguez (Princeton) for markstatDynamic-document tools embedding Stata output in Markdown / HTMLActive but niche; many users now reach for Quarto’s Stata engine insteadhttps://www.stata.com/manuals/rptdyndoc.pdf

Tier 3 family table — SAS family

FormatFirst appearedOriginTypeStatus (2026)URL
SAS DATA step language1976SAS Institute (North Carolina; Anthony Barr, Jim Goodnight)Row-by-row procedural data-manipulation DSL; the “data step / proc step” dichotomy is foundational to SASActive; SAS 9.4 M9 (June 2025) is the current 9-series release; supported through ~2030https://documentation.sas.com/doc/en/pgmsascdc/default/lestmtsref/titlepage.htm
SAS macro language~1980SAS InstituteToken-substitution preprocessor with %macro/%mend, %let, %if, %do — the long-standing SAS metaprogramming layerVery active, still the dominant way large pharma/biostat shops parameterise SAS programshttps://documentation.sas.com/doc/en/pgmsascdc/default/mcrolref/titlepage.htm
SAS PROC SQL~1989 (SAS 6.06)SAS InstituteEmbedded ANSI-SQL extension with SAS-specific extensions (e.g. macro-variable interpolation, automatic dataset access)Very activehttps://documentation.sas.com/doc/en/sqlproc/v_017/titlepage.htm
SAS PROC IML (Interactive Matrix Language)1985SAS InstituteMatrix-programming DSL inside SAS; APL-influenced; analogue of Stata MataActive but lower-velocity than DATA step / macrohttps://documentation.sas.com/doc/en/imlug/15.3/titlepage.htm
SAS .sas7bdat data files1999 (SAS 7)SAS InstituteProprietary versioned binary data format; format reverse-engineered by R haven (Hadley Wickham, Evan Miller)Active and dominant in pharma/biotech; widely accepted by FDA submissions when bundled with define-xmlhttps://documentation.sas.com/doc/en/pgmsascdc/default/lrcon/p1n8hb0gn5erf0n1qm0xthahmu99.htm
SAS XPORT (.xpt) format1987 (SAS 6, “v5 transport”)SAS InstituteOlder fixed-width transport format with strict 8-character name limits; the only data format accepted by the FDA for legacy submissions until SDTM-XPT replaced it for clinical-trial datasetsActive in regulatory contexts, otherwise legacyhttps://support.sas.com/techsup/technote/ts140.pdf

Tier 3 family table — SPSS / JMP / proprietary econometrics

FormatFirst appearedOriginTypeStatus (2026)URL
SPSS .sps syntax1968 (SPSS first release)Norman Nie / C. Hadlai Hull / Dale Bent (Stanford); SPSS Inc.; IBM (2009)Command-language DSL (COMPUTE, RECODE, FREQUENCIES, GLM, IFEND IF) — verbose, FORTRAN-influenced syntaxActive; IBM SPSS Statistics 31 / 32 are the current releases (Server 32.0.x GA April 2026); the IBM divestment to Francisco Partners (2024) has left the product roadmap somewhat unclearhttps://www.ibm.com/docs/en/spss-statistics
SPSS .sav data files1968SPSS Inc. → IBMProprietary binary data format with variable labels, value labels, missing-value codes, and measurement levels — features that survey researchers depend onActive, the default export format from most survey instrumentshttps://www.ibm.com/docs/en/spss-statistics/saveformats
SPSS Statistics Python / R integration~2007 (Python plug-in)SPSS Inc. → IBMEmbedded Python and R inside SPSS for scripting custom procedures (a “begin program” block); R 4.4.1 in v30+Active but underused; most SPSS users still write pure syntaxhttps://www.ibm.com/docs/en/spss-statistics/30.0.0?topic=integration-python
JMP Scripting Language (JSL)1989 (JMP 1)SAS Institute (John Sall)C-influenced scripting DSL for the JMP visual-discovery platform; controls data tables, platforms (analyses), and graphical objectsActive; JMP 19 (2025) added expanded Python integration and continued JSL evolutionhttps://www.jmp.com/support/help/en/19.0/jmp/introduction-to-writing-jsl-scripts.shtml
GAUSS language1984Aptech Systems (Lee Edlefsen, Steve Jones), Maple Valley WAMatrix-programming language for econometrics and finance; strong installed base in central banksActive; GAUSS 26 released February 2026 with new time-series and panel functionshttps://www.aptech.com/
EViews command/program language1994 (EViews 1)Quantitative Micro Software → IHS Markit → S&P GlobalTime-series-econometrics-oriented DSL with .prg programs, .wf1 workfilesActive; EViews 14 (2024) with ongoing bug-fix patches through 2026https://www.eviews.com/help/
gretl / hansl2000Allin Cottrell, Riccardo Lucchetti (Università Politecnica delle Marche)Open-source econometrics package with its own scripting language hansl (Hansl’s a Neat Scripting Language); imports Stata .dta, R, OctaveActive; gretl 2026a released February 2026http://gretl.sourceforge.net/

Tier 3 family table — Bayesian modelling DSLs

FormatFirst appearedOriginTypeStatus (2026)URL
BUGS / WinBUGS / OpenBUGS model language1989 (BUGS) / 1997 (WinBUGS) / 2005 (OpenBUGS)MRC Biostatistics Unit, Cambridge (David Spiegelhalter, Andrew Thomas, Nicky Best, Wally Gilks)Declarative DAG model specification using ~ for stochastic and <- for deterministic nodes; Gibbs sampling onlyMostly legacy — WinBUGS development ended ~2007; OpenBUGS effectively unmaintained since ~2014; syntax lives on in JAGS and NIMBLEhttps://www.mrc-bsu.cam.ac.uk/software/bugs/
JAGS model language2007Martyn Plummer (IARC Lyon)BUGS-dialect successor written in C++; the most widely used “classic” Bayesian DSL via R’s rjags and R2jagsActive but slow — JAGS 4.3.2 (March 2023) remains the current release as of May 2026; few new features but reliablehttps://mcmc-jags.sourceforge.io/
NIMBLE model language2014 (CRAN release)Perry de Valpine, Christopher Paciorek et al. (UC Berkeley)BUGS-syntax-compatible R-internal modelling DSL that compiles models to C++; supports MCMC, Laplace approximation, MCEM, and user-defined samplersActive; CRAN release ~v1.4.1 (April 2026)https://r-nimble.org/
Stan model language2012Andrew Gelman / Bob Carpenter / Daniel Lee et al. (Columbia University)Strongly typed C++-compiled probabilistic DSL with reverse-mode auto-differentiation, NUTS (No-U-Turn Sampler) Hamiltonian Monte Carlo, variational inference, optimisationVery active; CmdStan 2.38.0 released January 2026; reference standard for Bayesian inferencehttps://mc-stan.org/docs/reference-manual/
RStan / cmdstanr / PyStan / cmdstanpy2012+Stan teamLanguage bindings to Stan from R and Python; cmdstanr and cmdstanpy are the recommended modern bindings (wrap CmdStan rather than embedding Stan’s C++)Very activehttps://mc-stan.org/cmdstanr/
PyMC model definition2003 (PyMC2) → 2017 (PyMC3, Theano) → 2022 (PyMC v4, Aesara) → 2023 (PyMC v5, PyTensor)Chris Fonnesbeck et al.Python-embedded probabilistic-programming DSL; the v5 rewrite migrated from Theano → Aesara → PyTensor to escape the Theano deprecationVery active; v5.28.5 released early May 2026https://www.pymc.io/projects/docs/en/stable/
Turing.jl model definition2018Hong Ge et al. (Cambridge)Julia-embedded probabilistic-programming DSL using the @model macro; supports HMC, NUTS, particle MCMC, variational inference, Gibbs compositionVery active; requires Julia ≥ 1.10.8https://turinglang.org/

Tier 3 family table — Reproducible-research notebooks and document formats

FormatFirst appearedOriginTypeStatus (2026)URL
R Markdown (.Rmd)2014RStudio (Yihui Xie, JJ Allaire)Markdown + YAML metadata + R code chunks executed by knitr; pandoc-rendered to HTML/PDF/WordMaintained but in soft sunset; RStudio/Posit’s strategic successor is Quartohttps://rmarkdown.rstudio.com/
knitr engine2012Yihui XieThe chunk-execution engine under R Markdown and bookdown; also supports many other host languages via reticulate, JuliaCall, etc.Active, still core to R-side reproducible reportinghttps://yihui.org/knitr/
Quarto (.qmd)2022Posit (formerly RStudio); Charles Teague, JJ Allaire et al.Multi-language successor to R Markdown built directly on pandoc + Lua filters; native Python (Jupyter kernel), R (knitr), and Julia (QuartoNotebookRunner.jl) support; does not require R for non-R documentsVery active; v1.9.37 released April 2026 (1.7 line shipped late 2025)https://quarto.org/
Jupyter notebook (.ipynb)2011 (IPython Notebook), 2014 (Jupyter rename)Project Jupyter (Fernando Pérez, Brian Granger et al.)JSON document format storing cells (code, markdown, raw) + outputs + kernel metadata; the dominant interactive-notebook artefactVery active; JupyterLab 4.x and Jupyter Notebook 7.x are current; consumed natively by Quartohttps://jupyter.org/
Posit Connect deployment manifests~2018 (as RStudio Connect)Posit (formerly RStudio)YAML + lock-file format used to publish Quarto/Shiny/R Markdown/Streamlit/FastAPI artefacts to a Posit Connect serverActive; current Posit Connect 2026.03.1 (March 2026)https://docs.posit.co/connect/

Notable threads

  • The persistence of SAS in regulated industries. SAS’s dominance in pharma/biotech is a regulatory-network-effect story, not a technical one. The FDA’s electronic-submission pathway (eCTD Module 5) is built around CDISC SDTM and ADaM standards whose canonical implementation is in SAS, and SAS’s per-procedure documentation has been the de facto “audit trail” for forty years. Several attempts have been made to validate the R Consortium’s R Submissions Working Group pilots through the FDA (Pilot 1 was accepted in 2022; subsequent pilots have widened); these are slowly opening the door, but no major sponsor has yet abandoned SAS for a Phase III submission. Inertia, validated-package availability, and clinical-research-organisation (CRO) staffing all favour the incumbent. SAS 9.4 M9 (June 2025) being a five-year-support release tells you SAS expects this status quo through ~2030.

  • Stata’s outsized academic-econometrics dominance. Stata’s installed base in economics, political science, and public health is sustained by replication-archive expectations: the American Economic Review, Journal of Political Economy, and Quarterly Journal of Economics all expect Stata .do files alongside data, and most working-paper repositories (NBER, IZA) implicitly assume .dta. Stata’s pricing-model choice — student licences priced at student-budget levels, perpetual licences that survive cohorts of grad students — locks in this network effect. Stata 19 (April 2025) added a heavy machine-learning suite (h2oml) and conditional-average-treatment-effect (CATE) commands, which is the company’s bet on staying relevant as econml/DoubleML/Python encroach.

  • SPSS’s slow decline. IBM acquired SPSS in 2009 for $1.2B, then in 2024 sold its analytics portfolio (including SPSS Statistics and SPSS Modeler) to Francisco Partners. The IBM divestment marks the end of SPSS as a flagship analytics platform; ongoing development continues (v30 added dark mode and Bland-Altman; v31 and v32 follow) but innovation has clearly slowed compared to Stata or even SAS. SPSS retains a powerful base in social-survey research because of .sav’s variable-label / value-label / measurement-level metadata, which Stata .dta and R only partially preserve.

  • The Bayesian DSL progression: BUGS → JAGS → Stan → PyMC/Turing. Each generation traded a constraint for ergonomics. BUGS was Gibbs-only and required a DAG structure that the package could partition; JAGS extended this with more sampler types but kept the BUGS dialect. Stan (2012) was a hard reset: a real strongly typed modelling language, reverse-mode automatic differentiation (the stan::math library is itself a substantial autodiff codebase), and Hamiltonian Monte Carlo with NUTS adaptive trajectory length — together making continuous high-dimensional models tractable in ways BUGS never was. PyMC v5 and Turing.jl took the next step: don’t write a separate model file, embed the model in a host language so that data wrangling, plotting (ArviZ), and reporting (Quarto) are all the same script.

  • PyMC’s PyTensor rewrite. PyMC’s history mirrors the Theano deprecation crisis. PyMC3 (2017) was built on Theano. When the Theano team announced end-of-life in 2017, PyMC forked Theano as Theano-PyMC, then renamed it Aesara (2020) as PyMC v4, then forked Aesara as PyTensor (2022) when Aesara stagnated. PyMC v5 (January 2023) is the first stable on PyTensor and is what current releases (v5.28.5, early May 2026) build on. The throughline: a graph-rewriting/auto-diff backend is the load-bearing piece, and PyMC has had to maintain its own through three reincarnations to survive.

  • Quarto as the modern reproducible-research substrate. Quarto’s strategic significance is that it decouples reproducible-document tooling from R. R Markdown required R even to render a pure-Python document. Quarto runs on pandoc directly with Lua-filter extensions, executes Python via Jupyter kernels (no R required), Julia via QuartoNotebookRunner.jl, and R via knitr — picking the engine from the chunk metadata. The current Quarto 1.9.x line (April 2026) makes this a serious competitor to Jupyter Book and a partial competitor to Sphinx for technical-book authoring. Posit’s bet is that Quarto will outlive R Markdown as the cross-language reproducible-document format.

  • Jupyter’s long shadow. .ipynb JSON is far from a perfect document format — diffing it is hostile, output cells balloon notebooks to megabytes, and the JSON schema entangles document and runtime state — yet it remains overwhelmingly dominant as the artefact people actually share on GitHub, Kaggle, and Google Colab. Tooling has adapted around its flaws: nbdime for diffing, jupytext for paired .py/.md round-tripping, papermill for parameterised execution, and Quarto’s ability to consume .ipynb directly. The format is unlikely to be displaced; instead it has become the substrate everything else interoperates with.

  • The binary-data-format compat layer. Reading .dta, .sav, and .sas7bdat from outside their native platforms is mostly the work of two libraries: R’s haven (tidyverse, Hadley Wickham + Evan Miller, leveraging the ReadStat C library) and Python’s pandas (read_stata, read_spss via pyreadstat, read_sas). These have been reliable enough that for many years a substantial fraction of .dta / .sav files in production data pipelines are never opened in Stata or SPSS at all — they are bulk-converted to Parquet on landing. Stata 19’s format-121 update broke pre-existing haven versions until a patch release; this is the recurring tension whenever a proprietary format version bumps.

Citations