Bioinformatics & Scientific Workflow DSLs Family Index
type: language-family-index family: bio-workflow languages_catalogued: 22 tags: [language-reference, family-index, bioinformatics, workflow, scientific-computing, pipeline, data-orchestration]
Bioinformatics & Scientific Workflow DSLs — Family Index
Family overview
Scientific workflow DSLs are a specific subgenre of build/dataflow tools. They exist because three forces converged in computational science: (a) reproducibility demands explicit declaration of every input, output, parameter and tool version so a paper’s pipeline can be re-run a decade later; (b) jobs are long-running and partial-failure-prone — a 200-sample variant-calling run that dies on sample 137 must resume rather than restart, so all of these systems are essentially Make-with-checkpointing; and (c) the execution target is heterogeneous — the same workflow must run on a developer’s laptop, an LSF/Slurm HPC cluster, AWS Batch, Google Life Sciences, and Kubernetes, often switching mid-project. No general-purpose build tool addressed all three, so the field grew its own.
The ecosystem has a distinct bio/genomics origin bias. Snakemake (Köster, 2012, Saarland), Nextflow (Notredame lab, CRG Barcelona, 2013), WDL (Broad Institute, ~2012) and CWL (Open Workflow Group, 2014) all came from genomics labs solving genomics-shaped problems — fan-out per sample, fan-in for joint analysis, container-per-tool isolation. Cromwell is the Broad’s WDL execution engine. Galaxy predates them all (2005) and provided a GUI-driven XML workflow model for wet-lab scientists. Toil (UCSC) and Pegasus (USC/ISI) extended Python-driven DAGs to grid and cloud.
Parallel to bio, the general-purpose data-orchestration lineage evolved separately: Apache Airflow (Airbnb, 2014) brought Python-DSL DAGs to ETL, Prefect (2018) and Dagster (2019) responded to Airflow’s pain points, and Kubeflow Pipelines / MLflow Recipes adapted the same model for ML. The two camps remain culturally distinct — bio workflows obsess over reproducibility and resume semantics; data orchestration obsesses over scheduling, observability and SLAs — but they’re converging via Argo Workflows (K8s-native, used by both) and Pachyderm (data-versioning + pipelines). Note also the standards push (CWL, WDL — portable workflow specs) versus the framework push (Snakemake, Nextflow — opinionated runtimes); standards lost the developer mind-share war but won as compiler targets.
In our deep library
None directly — the deep library covers host languages and domain languages, not DSLs layered atop them. Cross-reference scientific (R, Stan, MATLAB, Mathematica numerical tooling that workflows orchestrate), build-devops (Argo Workflows and Tekton are cross-listed there as K8s pipeline runners), python (host language for Snakemake, Airflow, Prefect, Dagster, Kubeflow Pipelines DSL, MLflow, Toil, Ruffus, doit, BioBB), and groovy (host for Nextflow and Bpipe).
Tier 3 family table
| Language | First appeared | Origin | Domain | Status (2026) | URL |
|---|---|---|---|---|---|
| Snakemake | 2012 | Johannes Köster, Saarland Univ. | Bioinformatics pipeline DSL (Python-flavored) | Dominant in academic genomics; v8 active | https://snakemake.readthedocs.io |
| Nextflow | 2013 | Paolo Di Tommaso / CRG; Seqera Labs | Dataflow pipeline DSL (Groovy-based) | Dominant in pharma/industry; nf-core ecosystem thriving | https://www.nextflow.io |
| WDL | ~2012 | Broad Institute (OpenWDL) | Portable workflow spec | Active standard; v1.1 current, v2.0 in progress | https://openwdl.org |
| CWL | 2014 | Open Workflow Group (multi-vendor) | Portable workflow spec (YAML/JSON) | Stable v1.2; niche but durable as compiler target | https://www.commonwl.org |
| Cromwell | 2016 | Broad Institute | WDL/CWL execution engine (Scala/JVM) | Active; backbone of Terra and GATK pipelines | https://cromwell.readthedocs.io |
| Galaxy workflows | 2005 | Penn State / Galaxy Project | XML/JSON workflow + GUI for wet-lab users | Very active; Galaxy Training Network ecosystem | https://galaxyproject.org |
| Toil | 2015 | UCSC Computational Genomics Lab | Python workflow framework, cloud + HPC | Active; powers UCSC large-scale pan-cancer projects | https://toil.readthedocs.io |
| Pegasus WMS | 2002 | USC/ISI + Univ. of Wisconsin | Grid/cloud workflow management | Active; large-physics & astronomy use (LIGO) | https://pegasus.isi.edu |
| Apache Airflow | 2014 | Airbnb (now Apache) | General-purpose DAG orchestration (Python) | Industry standard for data ETL; v2.x stable, v3 in beta | https://airflow.apache.org |
| Prefect | 2018 | Prefect Technologies | Python orchestration; “negative engineering” focus | Active; v3 (2024) competitive vs Airflow | https://www.prefect.io |
| Dagster | 2019 | Elementl | Python orchestration with software-defined assets | Rapidly growing; asset-centric model differentiator | https://dagster.io |
| Kubeflow Pipelines DSL | 2018 | Google + community | Python DSL compiling to Argo on K8s for ML | Active; KFP v2 SDK current | https://www.kubeflow.org/docs/components/pipelines |
| MLflow Recipes | 2022 | Databricks | Templated ML pipeline DSL (YAML + Python) | De-emphasized after 2024; MLflow core thrives, Recipes sidelined | https://mlflow.org/docs/latest/recipes.html |
| Argo Workflows | 2017 | Applatix / Intuit (CNCF) | K8s-native container workflow DSL (YAML) | CNCF graduated; cross-listed in build-devops | https://argoproj.github.io/workflows |
| Bpipe | 2012 | Sadedin et al. (WEHI Melbourne) | Groovy-based bioinformatics pipeline DSL | Maintenance mode; eclipsed by Nextflow | https://docs.bpipe.org |
| Ruffus | 2010 | Goodstadt (Oxford) | Python decorator-based pipeline library | Quiet; still used in legacy bio pipelines | http://www.ruffus.org.uk |
| BioBB | 2019 | BSC Barcelona | Bioinformatics Building Blocks (Python wrappers + workflows) | Active; CWL/PyCOMPSs/Galaxy targets | https://mmb.irbbarcelona.org/biobb |
| Slurm batch scripts | 2002 (Slurm) | LLNL → SchedMD | #SBATCH-annotated shell — cluster-submit DSL | Ubiquitous; substrate beneath all bio HPC workflows | https://slurm.schedmd.com |
| LSF batch scripts | 1992 | Platform Computing → IBM Spectrum | #BSUB-annotated shell | Active in legacy enterprise/pharma HPC | https://www.ibm.com/products/hpc-workload-management |
| BioMake | 2017 | Chris Mungall (LBL) | GNU Make extension w/ Prolog backtracking & cluster targets | Niche; ontology / OBO community | https://github.com/evoldoers/biomake |
| doit | 2008 | Eduardo Schettino | Python build/workflow runner (“Make-in-Python”) | Active; popular for reproducible-research compendia | https://pydoit.org |
| Pachyderm pipelines | 2014 | Pachyderm Inc. (acquired by HPE 2023) | JSON pipeline spec over data-versioned filesystem | Open-source maintained post-acquisition; enterprise focus | https://www.pachyderm.com |
(Reproducible Research Compendia — e.g. rrtools, workflowr, targets — are a methodology more than a language; the R-side targets package is the closest to a DSL and is covered in scientific under R tooling.)
Notable threads
- Snakemake vs Nextflow is the dominant axis of the bio pipeline world. Snakemake won academia: Python-syntax rules feel familiar, the wildcard/inference engine is elegant for “one rule per analysis step, fan-out per sample,” and labs publishing methods papers like its readability. Nextflow won pharma and industrialized genomics: dataflow channels compose better at scale, the Groovy DSL2 module system is more reusable, the nf-core community curates production-grade pipelines (rnaseq, sarek, ampliseq) under semver discipline, and Seqera Labs built a viable commercial layer (Tower, Fusion, Wave, Seqera Cloud) that pharma procurement can sign. The split is roughly: methods development and one-off lab analyses → Snakemake; regulated/clinical/CRO pipelines → Nextflow.
- CWL and WDL — standards that almost no one writes by hand. Both were designed as portable workflow specifications a decade ago, on the theory that pipelines should be a vendor-neutral interchange format. Reality: developers want a real programming language with loops, conditionals and abstraction, not YAML/JSON spec files. CWL is now mostly a compiler target (e.g. Snakemake
--export-cwl, BioBB, Arvados). WDL fared slightly better because it has actual syntax and Cromwell as a reference implementation, but most WDL in the wild is Broad Institute pipelines; few external groups author greenfield WDL. - The Airflow → Prefect → Dagster generational shift in general-purpose orchestration. Airflow (2014) defined the category but its DAG-as-Python-module model conflated workflow definition with execution, made testing painful, and treated data as opaque side-effects. Prefect (2018) reframed as “negative engineering” — making failure cheap with first-class retries, dynamic mapping, and a hybrid hosted backend. Dagster (2019) went further with software-defined assets — declaring data products and letting the framework derive the DAG, which fits the modern “data mesh” mental model and integrates cleanly with dbt. Airflow remains the install base; Prefect and Dagster grow at its expense in greenfield.
- Why Argo Workflows didn’t displace bio DSLs despite being K8s-native. Argo’s container-per-step YAML model is lower-level than Snakemake/Nextflow — you write a workflow as a tree of templates referencing container images, no wildcard inference, no implicit fan-out per sample, no built-in resume from cached intermediates. It became the execution backend of choice (Kubeflow, Hera, Nextflow’s K8s executor all target Argo or compose with it) rather than the surface DSL. Bio pipelines stayed in Snakemake/Nextflow; Argo won the general K8s job-graph slot.
- Seqera Cloud / Tower as commercial Nextflow infrastructure. Seqera Labs (founded by Nextflow’s creators) commercialized the runtime around Nextflow — Tower (orchestration UI, monitoring, multi-cloud launch), Fusion (POSIX→object-store filesystem), Wave (on-the-fly container building), Seqera Cloud (managed control plane). This pattern (open-source DSL, commercial control plane) is now the de facto bio workflow business model and pulled real pharma money into the ecosystem in a way Snakemake’s pure-academic model has not (though Snakemake has Snakedeploy and Snakemake Workflow Catalog as community equivalents).
- The LSF/Slurm submit-script lineage as the substrate everything else compiles to. Every higher-level bio DSL eventually emits
sbatch/bsubinvocations against an HPC scheduler — the#SBATCH --mem=64G --time=12:00:00directive vocabulary is the lowest common denominator. Slurm dominates academia (free, NSF/DOE-funded sites all run it); LSF persists in legacy pharma and IBM-aligned shops. Both are themselves DSLs in the sense that the comment-prefixed directive grammar is parsed by the scheduler — which is why even Snakemake users learn--cluster "sbatch ..."syntax sooner or later. Cloud batch services (AWS Batch, Google Batch, Azure Batch) increasingly compete, but the on-prem HPC reality is still Slurm-shaped.
Citations
- Snakemake docs — https://snakemake.readthedocs.io
- Köster J, Rahmann S (2012). Snakemake — a scalable bioinformatics workflow engine. Bioinformatics. https://academic.oup.com/bioinformatics/article/28/19/2520/290322
- Nextflow docs — https://www.nextflow.io/docs/latest/index.html
- Di Tommaso et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology. https://www.nature.com/articles/nbt.3820
- nf-core — https://nf-co.re
- Seqera Labs — https://seqera.io
- WDL specification — https://github.com/openwdl/wdl
- OpenWDL — https://openwdl.org
- Cromwell docs — https://cromwell.readthedocs.io
- Common Workflow Language — https://www.commonwl.org
- CWL specification — https://www.commonwl.org/v1.2
- Galaxy Project — https://galaxyproject.org
- Toil — https://toil.readthedocs.io
- Pegasus WMS — https://pegasus.isi.edu
- Apache Airflow — https://airflow.apache.org
- Prefect — https://docs.prefect.io
- Dagster — https://docs.dagster.io
- Kubeflow Pipelines — https://www.kubeflow.org/docs/components/pipelines
- MLflow Recipes — https://mlflow.org/docs/latest/recipes.html
- Argo Workflows — https://argoproj.github.io/workflows
- Bpipe — https://docs.bpipe.org
- Ruffus — http://www.ruffus.org.uk
- BioBB — https://mmb.irbbarcelona.org/biobb
- Slurm — https://slurm.schedmd.com
- IBM Spectrum LSF — https://www.ibm.com/products/hpc-workload-management
- BioMake — https://github.com/evoldoers/biomake
- doit — https://pydoit.org
- Pachyderm — https://docs.pachyderm.com
- targets (R) — https://docs.ropensci.org/targets