Bioinformatics & Scientific Workflow DSLs Family Index

type: language-family-index family: bio-workflow languages_catalogued: 22 tags: [language-reference, family-index, bioinformatics, workflow, scientific-computing, pipeline, data-orchestration]

Bioinformatics & Scientific Workflow DSLs — Family Index

Family overview

Scientific workflow DSLs are a specific subgenre of build/dataflow tools. They exist because three forces converged in computational science: (a) reproducibility demands explicit declaration of every input, output, parameter and tool version so a paper’s pipeline can be re-run a decade later; (b) jobs are long-running and partial-failure-prone — a 200-sample variant-calling run that dies on sample 137 must resume rather than restart, so all of these systems are essentially Make-with-checkpointing; and (c) the execution target is heterogeneous — the same workflow must run on a developer’s laptop, an LSF/Slurm HPC cluster, AWS Batch, Google Life Sciences, and Kubernetes, often switching mid-project. No general-purpose build tool addressed all three, so the field grew its own.

The ecosystem has a distinct bio/genomics origin bias. Snakemake (Köster, 2012, Saarland), Nextflow (Notredame lab, CRG Barcelona, 2013), WDL (Broad Institute, ~2012) and CWL (Open Workflow Group, 2014) all came from genomics labs solving genomics-shaped problems — fan-out per sample, fan-in for joint analysis, container-per-tool isolation. Cromwell is the Broad’s WDL execution engine. Galaxy predates them all (2005) and provided a GUI-driven XML workflow model for wet-lab scientists. Toil (UCSC) and Pegasus (USC/ISI) extended Python-driven DAGs to grid and cloud.

Parallel to bio, the general-purpose data-orchestration lineage evolved separately: Apache Airflow (Airbnb, 2014) brought Python-DSL DAGs to ETL, Prefect (2018) and Dagster (2019) responded to Airflow’s pain points, and Kubeflow Pipelines / MLflow Recipes adapted the same model for ML. The two camps remain culturally distinct — bio workflows obsess over reproducibility and resume semantics; data orchestration obsesses over scheduling, observability and SLAs — but they’re converging via Argo Workflows (K8s-native, used by both) and Pachyderm (data-versioning + pipelines). Note also the standards push (CWL, WDL — portable workflow specs) versus the framework push (Snakemake, Nextflow — opinionated runtimes); standards lost the developer mind-share war but won as compiler targets.

In our deep library

None directly — the deep library covers host languages and domain languages, not DSLs layered atop them. Cross-reference scientific (R, Stan, MATLAB, Mathematica numerical tooling that workflows orchestrate), build-devops (Argo Workflows and Tekton are cross-listed there as K8s pipeline runners), python (host language for Snakemake, Airflow, Prefect, Dagster, Kubeflow Pipelines DSL, MLflow, Toil, Ruffus, doit, BioBB), and groovy (host for Nextflow and Bpipe).

Tier 3 family table

Language	First appeared	Origin	Domain	Status (2026)	URL
Snakemake	2012	Johannes Köster, Saarland Univ.	Bioinformatics pipeline DSL (Python-flavored)	Dominant in academic genomics; v8 active	https://snakemake.readthedocs.io
Nextflow	2013	Paolo Di Tommaso / CRG; Seqera Labs	Dataflow pipeline DSL (Groovy-based)	Dominant in pharma/industry; nf-core ecosystem thriving	https://www.nextflow.io
WDL	~2012	Broad Institute (OpenWDL)	Portable workflow spec	Active standard; v1.1 current, v2.0 in progress	https://openwdl.org
CWL	2014	Open Workflow Group (multi-vendor)	Portable workflow spec (YAML/JSON)	Stable v1.2; niche but durable as compiler target	https://www.commonwl.org
Cromwell	2016	Broad Institute	WDL/CWL execution engine (Scala/JVM)	Active; backbone of Terra and GATK pipelines	https://cromwell.readthedocs.io
Galaxy workflows	2005	Penn State / Galaxy Project	XML/JSON workflow + GUI for wet-lab users	Very active; Galaxy Training Network ecosystem	https://galaxyproject.org
Toil	2015	UCSC Computational Genomics Lab	Python workflow framework, cloud + HPC	Active; powers UCSC large-scale pan-cancer projects	https://toil.readthedocs.io
Pegasus WMS	2002	USC/ISI + Univ. of Wisconsin	Grid/cloud workflow management	Active; large-physics & astronomy use (LIGO)	https://pegasus.isi.edu
Apache Airflow	2014	Airbnb (now Apache)	General-purpose DAG orchestration (Python)	Industry standard for data ETL; v2.x stable, v3 in beta	https://airflow.apache.org
Prefect	2018	Prefect Technologies	Python orchestration; “negative engineering” focus	Active; v3 (2024) competitive vs Airflow	https://www.prefect.io
Dagster	2019	Elementl	Python orchestration with software-defined assets	Rapidly growing; asset-centric model differentiator	https://dagster.io
Kubeflow Pipelines DSL	2018	Google + community	Python DSL compiling to Argo on K8s for ML	Active; KFP v2 SDK current	https://www.kubeflow.org/docs/components/pipelines
MLflow Recipes	2022	Databricks	Templated ML pipeline DSL (YAML + Python)	De-emphasized after 2024; MLflow core thrives, Recipes sidelined	https://mlflow.org/docs/latest/recipes.html
Argo Workflows	2017	Applatix / Intuit (CNCF)	K8s-native container workflow DSL (YAML)	CNCF graduated; cross-listed in build-devops	https://argoproj.github.io/workflows
Bpipe	2012	Sadedin et al. (WEHI Melbourne)	Groovy-based bioinformatics pipeline DSL	Maintenance mode; eclipsed by Nextflow	https://docs.bpipe.org
Ruffus	2010	Goodstadt (Oxford)	Python decorator-based pipeline library	Quiet; still used in legacy bio pipelines	http://www.ruffus.org.uk
BioBB	2019	BSC Barcelona	Bioinformatics Building Blocks (Python wrappers + workflows)	Active; CWL/PyCOMPSs/Galaxy targets	https://mmb.irbbarcelona.org/biobb
Slurm batch scripts	2002 (Slurm)	LLNL → SchedMD	`#SBATCH`-annotated shell — cluster-submit DSL	Ubiquitous; substrate beneath all bio HPC workflows	https://slurm.schedmd.com
LSF batch scripts	1992	Platform Computing → IBM Spectrum	`#BSUB`-annotated shell	Active in legacy enterprise/pharma HPC	https://www.ibm.com/products/hpc-workload-management
BioMake	2017	Chris Mungall (LBL)	GNU Make extension w/ Prolog backtracking & cluster targets	Niche; ontology / OBO community	https://github.com/evoldoers/biomake
doit	2008	Eduardo Schettino	Python build/workflow runner (“Make-in-Python”)	Active; popular for reproducible-research compendia	https://pydoit.org
Pachyderm pipelines	2014	Pachyderm Inc. (acquired by HPE 2023)	JSON pipeline spec over data-versioned filesystem	Open-source maintained post-acquisition; enterprise focus	https://www.pachyderm.com

(Reproducible Research Compendia — e.g. rrtools, workflowr, targets — are a methodology more than a language; the R-side targets package is the closest to a DSL and is covered in scientific under R tooling.)

Notable threads

Snakemake vs Nextflow is the dominant axis of the bio pipeline world. Snakemake won academia: Python-syntax rules feel familiar, the wildcard/inference engine is elegant for “one rule per analysis step, fan-out per sample,” and labs publishing methods papers like its readability. Nextflow won pharma and industrialized genomics: dataflow channels compose better at scale, the Groovy DSL2 module system is more reusable, the nf-core community curates production-grade pipelines (rnaseq, sarek, ampliseq) under semver discipline, and Seqera Labs built a viable commercial layer (Tower, Fusion, Wave, Seqera Cloud) that pharma procurement can sign. The split is roughly: methods development and one-off lab analyses → Snakemake; regulated/clinical/CRO pipelines → Nextflow.
CWL and WDL — standards that almost no one writes by hand. Both were designed as portable workflow specifications a decade ago, on the theory that pipelines should be a vendor-neutral interchange format. Reality: developers want a real programming language with loops, conditionals and abstraction, not YAML/JSON spec files. CWL is now mostly a compiler target (e.g. Snakemake --export-cwl, BioBB, Arvados). WDL fared slightly better because it has actual syntax and Cromwell as a reference implementation, but most WDL in the wild is Broad Institute pipelines; few external groups author greenfield WDL.
The Airflow → Prefect → Dagster generational shift in general-purpose orchestration. Airflow (2014) defined the category but its DAG-as-Python-module model conflated workflow definition with execution, made testing painful, and treated data as opaque side-effects. Prefect (2018) reframed as “negative engineering” — making failure cheap with first-class retries, dynamic mapping, and a hybrid hosted backend. Dagster (2019) went further with software-defined assets — declaring data products and letting the framework derive the DAG, which fits the modern “data mesh” mental model and integrates cleanly with dbt. Airflow remains the install base; Prefect and Dagster grow at its expense in greenfield.
Why Argo Workflows didn’t displace bio DSLs despite being K8s-native. Argo’s container-per-step YAML model is lower-level than Snakemake/Nextflow — you write a workflow as a tree of templates referencing container images, no wildcard inference, no implicit fan-out per sample, no built-in resume from cached intermediates. It became the execution backend of choice (Kubeflow, Hera, Nextflow’s K8s executor all target Argo or compose with it) rather than the surface DSL. Bio pipelines stayed in Snakemake/Nextflow; Argo won the general K8s job-graph slot.
Seqera Cloud / Tower as commercial Nextflow infrastructure. Seqera Labs (founded by Nextflow’s creators) commercialized the runtime around Nextflow — Tower (orchestration UI, monitoring, multi-cloud launch), Fusion (POSIX→object-store filesystem), Wave (on-the-fly container building), Seqera Cloud (managed control plane). This pattern (open-source DSL, commercial control plane) is now the de facto bio workflow business model and pulled real pharma money into the ecosystem in a way Snakemake’s pure-academic model has not (though Snakemake has Snakedeploy and Snakemake Workflow Catalog as community equivalents).
The LSF/Slurm submit-script lineage as the substrate everything else compiles to. Every higher-level bio DSL eventually emits sbatch/bsub invocations against an HPC scheduler — the #SBATCH --mem=64G --time=12:00:00 directive vocabulary is the lowest common denominator. Slurm dominates academia (free, NSF/DOE-funded sites all run it); LSF persists in legacy pharma and IBM-aligned shops. Both are themselves DSLs in the sense that the comment-prefixed directive grammar is parsed by the scheduler — which is why even Snakemake users learn --cluster "sbatch ..." syntax sooner or later. Cloud batch services (AWS Batch, Google Batch, Azure Batch) increasingly compete, but the on-prem HPC reality is still Slurm-shaped.

Citations

Snakemake docs — https://snakemake.readthedocs.io
Köster J, Rahmann S (2012). Snakemake — a scalable bioinformatics workflow engine. Bioinformatics. https://academic.oup.com/bioinformatics/article/28/19/2520/290322
Nextflow docs — https://www.nextflow.io/docs/latest/index.html
Di Tommaso et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology. https://www.nature.com/articles/nbt.3820
nf-core — https://nf-co.re
Seqera Labs — https://seqera.io
WDL specification — https://github.com/openwdl/wdl
OpenWDL — https://openwdl.org
Cromwell docs — https://cromwell.readthedocs.io
Common Workflow Language — https://www.commonwl.org
CWL specification — https://www.commonwl.org/v1.2
Galaxy Project — https://galaxyproject.org
Toil — https://toil.readthedocs.io
Pegasus WMS — https://pegasus.isi.edu
Apache Airflow — https://airflow.apache.org
Prefect — https://docs.prefect.io
Dagster — https://docs.dagster.io
Kubeflow Pipelines — https://www.kubeflow.org/docs/components/pipelines
MLflow Recipes — https://mlflow.org/docs/latest/recipes.html
Argo Workflows — https://argoproj.github.io/workflows
Bpipe — https://docs.bpipe.org
Ruffus — http://www.ruffus.org.uk
BioBB — https://mmb.irbbarcelona.org/biobb
Slurm — https://slurm.schedmd.com
IBM Spectrum LSF — https://www.ibm.com/products/hpc-workload-management
BioMake — https://github.com/evoldoers/biomake
doit — https://pydoit.org
Pachyderm — https://docs.pachyderm.com
targets (R) — https://docs.ropensci.org/targets

Compendium

Explorer

Bioinformatics & Scientific Workflow DSLs Family Index

Bioinformatics & Scientific Workflow DSLs Family Index

type: language-family-index family: bio-workflow languages_catalogued: 22 tags: [language-reference, family-index, bioinformatics, workflow, scientific-computing, pipeline, data-orchestration]

Bioinformatics & Scientific Workflow DSLs — Family Index

Family overview

In our deep library

Tier 3 family table

Notable threads

Citations

Graph View

Table of Contents