Bioinformatics & Scientific Workflow DSLs Family Index


type: language-family-index family: bio-workflow languages_catalogued: 22 tags: [language-reference, family-index, bioinformatics, workflow, scientific-computing, pipeline, data-orchestration]

Bioinformatics & Scientific Workflow DSLs — Family Index

Family overview

Scientific workflow DSLs are a specific subgenre of build/dataflow tools. They exist because three forces converged in computational science: (a) reproducibility demands explicit declaration of every input, output, parameter and tool version so a paper’s pipeline can be re-run a decade later; (b) jobs are long-running and partial-failure-prone — a 200-sample variant-calling run that dies on sample 137 must resume rather than restart, so all of these systems are essentially Make-with-checkpointing; and (c) the execution target is heterogeneous — the same workflow must run on a developer’s laptop, an LSF/Slurm HPC cluster, AWS Batch, Google Life Sciences, and Kubernetes, often switching mid-project. No general-purpose build tool addressed all three, so the field grew its own.

The ecosystem has a distinct bio/genomics origin bias. Snakemake (Köster, 2012, Saarland), Nextflow (Notredame lab, CRG Barcelona, 2013), WDL (Broad Institute, ~2012) and CWL (Open Workflow Group, 2014) all came from genomics labs solving genomics-shaped problems — fan-out per sample, fan-in for joint analysis, container-per-tool isolation. Cromwell is the Broad’s WDL execution engine. Galaxy predates them all (2005) and provided a GUI-driven XML workflow model for wet-lab scientists. Toil (UCSC) and Pegasus (USC/ISI) extended Python-driven DAGs to grid and cloud.

Parallel to bio, the general-purpose data-orchestration lineage evolved separately: Apache Airflow (Airbnb, 2014) brought Python-DSL DAGs to ETL, Prefect (2018) and Dagster (2019) responded to Airflow’s pain points, and Kubeflow Pipelines / MLflow Recipes adapted the same model for ML. The two camps remain culturally distinct — bio workflows obsess over reproducibility and resume semantics; data orchestration obsesses over scheduling, observability and SLAs — but they’re converging via Argo Workflows (K8s-native, used by both) and Pachyderm (data-versioning + pipelines). Note also the standards push (CWL, WDL — portable workflow specs) versus the framework push (Snakemake, Nextflow — opinionated runtimes); standards lost the developer mind-share war but won as compiler targets.

In our deep library

None directly — the deep library covers host languages and domain languages, not DSLs layered atop them. Cross-reference scientific (R, Stan, MATLAB, Mathematica numerical tooling that workflows orchestrate), build-devops (Argo Workflows and Tekton are cross-listed there as K8s pipeline runners), python (host language for Snakemake, Airflow, Prefect, Dagster, Kubeflow Pipelines DSL, MLflow, Toil, Ruffus, doit, BioBB), and groovy (host for Nextflow and Bpipe).

Tier 3 family table

LanguageFirst appearedOriginDomainStatus (2026)URL
Snakemake2012Johannes Köster, Saarland Univ.Bioinformatics pipeline DSL (Python-flavored)Dominant in academic genomics; v8 activehttps://snakemake.readthedocs.io
Nextflow2013Paolo Di Tommaso / CRG; Seqera LabsDataflow pipeline DSL (Groovy-based)Dominant in pharma/industry; nf-core ecosystem thrivinghttps://www.nextflow.io
WDL~2012Broad Institute (OpenWDL)Portable workflow specActive standard; v1.1 current, v2.0 in progresshttps://openwdl.org
CWL2014Open Workflow Group (multi-vendor)Portable workflow spec (YAML/JSON)Stable v1.2; niche but durable as compiler targethttps://www.commonwl.org
Cromwell2016Broad InstituteWDL/CWL execution engine (Scala/JVM)Active; backbone of Terra and GATK pipelineshttps://cromwell.readthedocs.io
Galaxy workflows2005Penn State / Galaxy ProjectXML/JSON workflow + GUI for wet-lab usersVery active; Galaxy Training Network ecosystemhttps://galaxyproject.org
Toil2015UCSC Computational Genomics LabPython workflow framework, cloud + HPCActive; powers UCSC large-scale pan-cancer projectshttps://toil.readthedocs.io
Pegasus WMS2002USC/ISI + Univ. of WisconsinGrid/cloud workflow managementActive; large-physics & astronomy use (LIGO)https://pegasus.isi.edu
Apache Airflow2014Airbnb (now Apache)General-purpose DAG orchestration (Python)Industry standard for data ETL; v2.x stable, v3 in betahttps://airflow.apache.org
Prefect2018Prefect TechnologiesPython orchestration; “negative engineering” focusActive; v3 (2024) competitive vs Airflowhttps://www.prefect.io
Dagster2019ElementlPython orchestration with software-defined assetsRapidly growing; asset-centric model differentiatorhttps://dagster.io
Kubeflow Pipelines DSL2018Google + communityPython DSL compiling to Argo on K8s for MLActive; KFP v2 SDK currenthttps://www.kubeflow.org/docs/components/pipelines
MLflow Recipes2022DatabricksTemplated ML pipeline DSL (YAML + Python)De-emphasized after 2024; MLflow core thrives, Recipes sidelinedhttps://mlflow.org/docs/latest/recipes.html
Argo Workflows2017Applatix / Intuit (CNCF)K8s-native container workflow DSL (YAML)CNCF graduated; cross-listed in build-devopshttps://argoproj.github.io/workflows
Bpipe2012Sadedin et al. (WEHI Melbourne)Groovy-based bioinformatics pipeline DSLMaintenance mode; eclipsed by Nextflowhttps://docs.bpipe.org
Ruffus2010Goodstadt (Oxford)Python decorator-based pipeline libraryQuiet; still used in legacy bio pipelineshttp://www.ruffus.org.uk
BioBB2019BSC BarcelonaBioinformatics Building Blocks (Python wrappers + workflows)Active; CWL/PyCOMPSs/Galaxy targetshttps://mmb.irbbarcelona.org/biobb
Slurm batch scripts2002 (Slurm)LLNL → SchedMD#SBATCH-annotated shell — cluster-submit DSLUbiquitous; substrate beneath all bio HPC workflowshttps://slurm.schedmd.com
LSF batch scripts1992Platform Computing → IBM Spectrum#BSUB-annotated shellActive in legacy enterprise/pharma HPChttps://www.ibm.com/products/hpc-workload-management
BioMake2017Chris Mungall (LBL)GNU Make extension w/ Prolog backtracking & cluster targetsNiche; ontology / OBO communityhttps://github.com/evoldoers/biomake
doit2008Eduardo SchettinoPython build/workflow runner (“Make-in-Python”)Active; popular for reproducible-research compendiahttps://pydoit.org
Pachyderm pipelines2014Pachyderm Inc. (acquired by HPE 2023)JSON pipeline spec over data-versioned filesystemOpen-source maintained post-acquisition; enterprise focushttps://www.pachyderm.com

(Reproducible Research Compendia — e.g. rrtools, workflowr, targets — are a methodology more than a language; the R-side targets package is the closest to a DSL and is covered in scientific under R tooling.)

Notable threads

  • Snakemake vs Nextflow is the dominant axis of the bio pipeline world. Snakemake won academia: Python-syntax rules feel familiar, the wildcard/inference engine is elegant for “one rule per analysis step, fan-out per sample,” and labs publishing methods papers like its readability. Nextflow won pharma and industrialized genomics: dataflow channels compose better at scale, the Groovy DSL2 module system is more reusable, the nf-core community curates production-grade pipelines (rnaseq, sarek, ampliseq) under semver discipline, and Seqera Labs built a viable commercial layer (Tower, Fusion, Wave, Seqera Cloud) that pharma procurement can sign. The split is roughly: methods development and one-off lab analyses → Snakemake; regulated/clinical/CRO pipelines → Nextflow.
  • CWL and WDL — standards that almost no one writes by hand. Both were designed as portable workflow specifications a decade ago, on the theory that pipelines should be a vendor-neutral interchange format. Reality: developers want a real programming language with loops, conditionals and abstraction, not YAML/JSON spec files. CWL is now mostly a compiler target (e.g. Snakemake --export-cwl, BioBB, Arvados). WDL fared slightly better because it has actual syntax and Cromwell as a reference implementation, but most WDL in the wild is Broad Institute pipelines; few external groups author greenfield WDL.
  • The Airflow → Prefect → Dagster generational shift in general-purpose orchestration. Airflow (2014) defined the category but its DAG-as-Python-module model conflated workflow definition with execution, made testing painful, and treated data as opaque side-effects. Prefect (2018) reframed as “negative engineering” — making failure cheap with first-class retries, dynamic mapping, and a hybrid hosted backend. Dagster (2019) went further with software-defined assets — declaring data products and letting the framework derive the DAG, which fits the modern “data mesh” mental model and integrates cleanly with dbt. Airflow remains the install base; Prefect and Dagster grow at its expense in greenfield.
  • Why Argo Workflows didn’t displace bio DSLs despite being K8s-native. Argo’s container-per-step YAML model is lower-level than Snakemake/Nextflow — you write a workflow as a tree of templates referencing container images, no wildcard inference, no implicit fan-out per sample, no built-in resume from cached intermediates. It became the execution backend of choice (Kubeflow, Hera, Nextflow’s K8s executor all target Argo or compose with it) rather than the surface DSL. Bio pipelines stayed in Snakemake/Nextflow; Argo won the general K8s job-graph slot.
  • Seqera Cloud / Tower as commercial Nextflow infrastructure. Seqera Labs (founded by Nextflow’s creators) commercialized the runtime around Nextflow — Tower (orchestration UI, monitoring, multi-cloud launch), Fusion (POSIX→object-store filesystem), Wave (on-the-fly container building), Seqera Cloud (managed control plane). This pattern (open-source DSL, commercial control plane) is now the de facto bio workflow business model and pulled real pharma money into the ecosystem in a way Snakemake’s pure-academic model has not (though Snakemake has Snakedeploy and Snakemake Workflow Catalog as community equivalents).
  • The LSF/Slurm submit-script lineage as the substrate everything else compiles to. Every higher-level bio DSL eventually emits sbatch/bsub invocations against an HPC scheduler — the #SBATCH --mem=64G --time=12:00:00 directive vocabulary is the lowest common denominator. Slurm dominates academia (free, NSF/DOE-funded sites all run it); LSF persists in legacy pharma and IBM-aligned shops. Both are themselves DSLs in the sense that the comment-prefixed directive grammar is parsed by the scheduler — which is why even Snakemake users learn --cluster "sbatch ..." syntax sooner or later. Cloud batch services (AWS Batch, Google Batch, Azure Batch) increasingly compete, but the on-prem HPC reality is still Slurm-shaped.

Citations