Pathway Databases & Bioinformatics Resources

This Tier 3 family index catalogues the public + commercial knowledge bases that practising computational biologists rely on, ordered by content type: sequence → structure → pathway → ontology → variant/disease → drug/chemistry → ML/foundation models → pipeline tools → reproducibility. Numerical scale figures are mid-2026 unless noted. SI units throughout (storage in B/kB/MB/GB/TB/PB; sequence size in bp; protein in residues or Da).

Sequence databases — the INSDC and friends

The International Nucleotide Sequence Database Collaboration (INSDC) is the foundational three-way mirror of all public DNA/RNA sequence: GenBank (NCBI USA) ⇄ ENA (EBI Europe) ⇄ DDBJ (Japan). Submissions to any node propagate to all three nightly. Total holdings as of 2026: ~5 × 10¹² bp (5 Tbp) in core assembled sequence; ~5 × 10¹⁶ bp (50 Pbp) in raw read archives. Identifiers across all three are interchangeable (accession.version format).

NCBI (National Center for Biotechnology Information; NLM/NIH; Bethesda MD; David Lipman founded 1988)

  • GenBank — primary nucleotide repository; ~270 million sequence records; INSDC partner
  • RefSeq — curated reference sequences (NCBI staff curation); ~250,000 organisms with at least one RefSeq; assembly accessions GCF_*
  • GEO Gene Expression Omnibus — ~250,000 series (Series GSE), ~7 million samples (GSM); microarray + RNA-seq + ChIP-seq + scRNA-seq; supersedes ArrayExpress (which mirrored into BioStudies at EBI)
  • SRA Sequence Read Archive — raw reads from sequencing instruments; ~50 PB as of 2025 (5 × 10¹⁶ bytes); cloud mirrors AWS Open Data + GCP + Azure
  • dbSNP — 1 billion+ variants (mostly SNPs and small indels); reference RSIDs (rs prefix)
  • dbVar — structural variants ≥ 50 bp
  • ClinVar — clinically annotated variants; ~3 million records; five-tier classification (B/LB/VUS/LP/P) with star-rated review level
  • dbGaP — controlled-access genotype/phenotype (US NIH project data; biobank summary statistics)
  • BioProject + BioSample — metadata anchors linking submissions
  • PubChem — small-molecule chemistry (NLM; ~115 million compounds; 800+ data sources); bioassays + patent + bioactivity + literature
  • PubMed — citation/abstract database; ~36 million records (2024); MeSH-indexed
  • PMC PubMed Central — full-text open-access (~10 million articles)
  • MedGen — phenotype + disease terms cross-mapped to ClinVar, OMIM, MONDO, HPO
  • Taxonomy — NCBI Taxonomy Database (formal arbiter for sequence-record organism)
  • Genome — assembly hub; current human reference GRCh38.p14 (2022) + T2T-CHM13v2.0 Telomere-to-Telomere 2022 (Nurk 2022 Science)
  • BLAST — Basic Local Alignment Search Tool (Altschul 1990 + 1997 PSI-BLAST); now + toolkit; web + local; DELTA-BLAST + magicblast for RNA-seq read alignment
  • Datasets / E-utilities API — programmatic access

EMBL-EBI (European Bioinformatics Institute; Hinxton UK; Janet Thornton DG long-tenured; Ewan Birney)

  • ENA European Nucleotide Archive — INSDC partner; submission portal
  • UniProt Universal Protein Resource — collaboration EBI + SIB Swiss + PIR; Swiss-Prot ~570,000 manually annotated entries; TrEMBL ~245 million automatically annotated; UniRef clusters at 100 % / 90 % / 50 % identity; UniParc archive
  • Ensembl — genome browser + annotation pipeline; 300+ vertebrate species; Ensembl Plants / Fungi / Bacteria / Protists / Metazoa sister projects
  • ENCODE Encyclopedia of DNA Elements (host at NCBI primarily; data also at EBI) — functional annotation of human + mouse genomes; ~17,000 experiments
  • ChEMBL — manually curated bioactivity database; ~2.5 million compounds + 18 million bioactivities + 15,000 targets
  • Europe PMC — full-text literature (PMC + biomedical preprints from bioRxiv etc.)
  • PRIDE — proteomics identifications database (mass spec); part of ProteomeXchange consortium with PeptideAtlas + MassIVE + JPOST + iProX
  • MetaboLights — metabolomics
  • EVA European Variation Archive — variation submission
  • IntAct — molecular interactions (curated); part of IMEx consortium with MINT (Roma), DIP (UCLA), HPRD (Hyderabad legacy), BioGRID (Mt Sinai/Princess Margaret)
  • InterPro — protein-domain/family signature aggregator (PROSITE + Pfam + SMART + CATH-Gene3D + PRINTS + HAMAP + PIRSF + SUPERFAMILY + TIGRFAMs + AntiFam + CDD + PANTHER); ~45,000 signatures
  • Pfam (now hosted at InterPro since 2022 retirement of standalone Pfam) — protein families; ~23,000 family HMM models (HMMER); Sonnhammer Bateman Eddy founders
  • AlphaFold Database (joint with DeepMind 2021 + 2022) — 214 million predicted structures (originally 200 M in 2022; expanded to 214 M; covers ~98.5 % of UniProt)
  • Open Targets Platform — target validation; integrates evidence from genetics + drugs + literature + RNA expression (EMBL-EBI + GSK + Sanger + others)
  • Expression Atlas — bulk + scRNA-seq curated re-analysis (~5,000 experiments)
  • Single Cell Expression Atlas — single-cell focus
  • EMPIAR electron microscopy raw data archive
  • EMDB Electron Microscopy Data Bank (cryo-EM maps)
  • BioStudies — generic submission for misc; absorbed ArrayExpress in 2022

DDBJ (DNA Data Bank of Japan; NIG Mishima)

  • INSDC partner; DRA DDBJ Sequence Read Archive (mirror); JGA Japanese Genotype-phenotype Archive (controlled access); DDBJ Annotated/Assembled Sequences

Proteomics & structure

UniProt

The single most-used protein resource. Swiss-Prot entries hand-curated with sequence + function + subcellular location + PTMs + variants + disease association + GO terms + cross-references; TrEMBL entries automatic. UniProt IDs are stable accessions (e.g. P38398 = BRCA1_HUMAN, P04637 = P53_HUMAN). The proteomics community standardised on UniProt IDs for protein-level reporting. Update cycle ~8 weeks.

Protein Data Bank (PDB)

Founded 1971 at Brookhaven National Laboratory; now hosted by the wwPDB worldwide PDB consortium comprising RCSB PDB (USA; Rutgers + UCSD; Helen Berman/Stephen Burley), PDBe (EBI; Sameer Velankar), PDBj (Osaka; Haruki Nakamura/Genji Kurisu), BMRB Biological Magnetic Resonance Bank (UConn) for NMR. As of 2026: ~230,000 experimental structures; ~78 % X-ray crystallography, ~17 % cryo-EM (fastest-growing — overtook X-ray in new depositions ~2022), ~5 % NMR, small fraction electron diffraction + neutron. Resolution distribution: 15 % < 1.8 Å, 70 % 1.8–3 Å, 15 % > 3 Å (most low-res cryo-EM). PDB IDs are 4-character (e.g. 1HHO, 7K3G); 8-character extensions for newer entries.

  • PDB-Dev — model archive for integrative structural models (low-res + cross-link MS + SAXS hybrids)
  • EMDB Electron Microscopy Data Bank — paired cryo-EM map deposition
  • CSD Cambridge Structural Database (CCDC) — small-molecule structures (~1.2 million); separate from PDB

Predicted structures

  • AlphaFold Protein Structure Database (AlphaFold2; Jumper 2021 Nature) — 214 million predicted structures hosted at EBI; first version 350,000 structures July 2021 → 1 million by 2022 → 200+ million by 2024
  • ESM Atlas (Meta AI; Lin 2023 Science) — 617 million metagenomic protein structures predicted with ESMFold (faster but less accurate than AF2)
  • OmegaFold (Helix Bio; Wu 2022) — single-sequence (no MSA) prediction
  • RoseTTAFold (Baker UW IPD; Baek 2021 Science); RoseTTAFold All-Atom 2024; RoseTTAFold Diffusion (RFdiffusion) generative protein design
  • AlphaFold-Multimer (Evans 2022) — heteromeric complexes
  • AlphaFold3 (Abramson/Jumper 2024 Nature) — generalises to proteins + small molecules + ions + nucleic acids; web-only initial release (later open-weighted)
  • Chai-1 (Chai Discovery 2024) — open-source AlphaFold3 alternative
  • Boltz-1 / Boltz-2 (MIT 2024/2025) — Apache-2 licensed protein–ligand structure + affinity prediction
  • HelixFold (Baidu / PaddlePaddle); OpenFold (Columbia; AlQuraishi); Uni-Fold (DP Technology Beijing)

Sequence-to-structure-to-function annotation

  • InterPro + CATH (UCL Orengo) + SCOP/SCOP2 (Murzin Cambridge) + ECOD Evolutionary Classification of Protein Domains (Eddy/Grishin)
  • PROSITE (SIB Geneva) — patterns + profiles
  • HMMER (Eddy HHMI) — profile HMM tools; underlies Pfam
  • HHsuite (Söding Munich) — HMM-HMM comparison
  • MMseqs2 (Steinegger Soeding) — ultra-fast sequence search; underlies ColabFold MSAs
  • DALI (Holm Helsinki) — structure comparison
  • TM-align (Zhang Michigan; now PSU) — TM-score + structure superposition; Foldseek (Steinegger Söding 2023 Nat Biotechnol) — billion-times-faster structure search using structural alphabet 3Di

Protein-protein interaction (PPI)

  • STRING (von Mering Jensen Bork; SIB Geneva → UZH; ~12,000 organisms; protein-protein “associations” with evidence channels) — most-used PPI for hypothesis generation
  • BioGRID (Tyers Mt Sinai + Princess Margaret) — curated PPI + genetic interactions; ~2 million human PPI
  • IntAct + MINT + DIP + HPRD + InnateDB — IMEx Consortium curated
  • CORUM (Munich) — protein complexes
  • mentha (Tor Vergata) — aggregator
  • HuRI / HI-union (Vidal/CCSB Dana-Farber) — systematic Y2H interactome
  • OpenCell + HPA Cell Atlas — proximity / colocalisation
  • Reactome FI — functional interaction (computed)

Pathway databases

KEGG (Kyoto Encyclopedia of Genes and Genomes)

Founded 1995 (Minoru Kanehisa, Kyoto University Bioinformatics Center; now GenomeNet). Comprises:

  • KEGG PATHWAY — manually drawn molecular pathway maps; ~570 organisms × ~570 reference maps = ~10⁵ maps total; covering metabolism, genetic info, environmental info, cellular processes, organismal systems, human diseases, drug development
  • KEGG GENES — ortholog clustering; KO (KEGG Orthology) identifiers (e.g. K00001 for ADH alcohol dehydrogenase)
  • KEGG ENZYME + KEGG REACTION + KEGG COMPOUND + KEGG GLYCAN + KEGG RPAIR — metabolic reactions
  • KEGG DRUG + KEGG DGROUP — drugs hierarchically classified
  • KEGG DISEASE — diseases linked to genes/pathways
  • KEGG MODULE — functional units (~1,200 modules)
  • KEGG BRITE — hierarchical classifications (taxonomy, drug classes, etc.)

Licensing changed in 2011: free web + non-commercial API, but FTP bulk + commercial use requires Pathway Solutions licence (US$15k–200k/year depending on size). This drove much community usage to Reactome.

Reactome

Founded 2003 (CSHL + EBI + OICR Toronto + NYU; Henning Hermjakob, Peter D’Eustachio, Lincoln Stein). CC-BY licensed open. ~3,000 human pathways; 19 model species via orthology. Strengths: detailed reactions with stoichiometry; physical-entity model (compartments + complexes + sets); SBGN-compatible diagrams; ELV (event-level) hierarchies; downloadable as BioPAX + SBML + SBGN-ML + PSI-MITAB + JSON + neo4j graph; web service + ContentService + ReactomeFIViz Cytoscape plug-in. Best for detailed mechanistic queries and as an alternative to KEGG that is fully open. Pathway analysis: standard over-representation via R/ReactomePA + reactome.db Bioconductor.

WikiPathways

Open community-edited (Bohler/Pico/Slenter; Maastricht + Gladstone Institutes); CC0/CC-BY licence; 1,500+ pathways covering 30+ species; integration with Cytoscape WikiPathways App + PathVisio editor; emphasis on community-curated metabolism + lipid biology.

BioCyc / MetaCyc / EcoCyc

Founded 1996 by Peter Karp at SRI International. EcoCyc is the gold-standard whole-organism database for E. coli K-12 MG1655 (~30 years of curation); MetaCyc is the multi-organism reference metabolic pathway database (~3,000 pathways across 3,000+ organisms); BioCyc is the tiered umbrella with ~20,000 pathway/genome databases (Tier 1: curated like EcoCyc; Tier 2: moderately curated; Tier 3: automated PathoLogic). Subscription required for downloads + many tier-3 PGDBs since 2018 controversy; web access remains free.

Pathway Commons

NCI-funded aggregator (Bader Lab Toronto + MSKCC; Demir Babur); integrates Reactome + WikiPathways + KEGG (subset) + PID + PANTHER + Inoh + NetPath + BIND + HumanCyc + others into BioPAX Level 3; powering ChiBE + ReactomeFIViz + Newt + cBioPortal pathway lookups.

IPA — Ingenuity Pathway Analysis

Qiagen Digital Insights (acquired from Ingenuity Systems 2013 → Thermo briefly → Qiagen). Closed commercial (typical ~US$15k–30k/yr per seat). Strong in curated upstream-regulator + causal network analysis; popular in pharma + biotech but losing share to open tools and to ML-based pathway analysis.

Other pathway / signalling resources

  • SignaLink (Korcsmáros, Budapest) — signaling cross-talk
  • Signor (Cesareni Rome; Sacchi-Cesareni 2020) — manually-curated signalling causality
  • NetPath (Pandey Hopkins/IBR) — immune pathways (legacy)
  • PID Pathway Interaction Database (NCI 2009–2014; archived; now in Pathway Commons)
  • SMPDB Small Molecule Pathway Database (Wishart UAlberta; small-molecule context)
  • HMDB Human Metabolome Database (Wishart UAlberta) — companion metabolite info
  • PathBank (Wishart) — visual painter for all human pathways
  • OmniPath (Saez-Rodriguez Heidelberg) — meta-database aggregating ~100 sources for systems-biology modelling (used in NicheNet, LIANA, ROOTS)

Gene ontology & functional annotation

Gene Ontology (GO)

Founded 1998 (Ashburner Cherry Botstein; Ashburner 2000 Nature Genetics). Three independent ontologies (namespaces): biological process (BP; “DNA repair” GO:0006281), molecular function (MF; “kinase activity” GO:0016301), cellular component (CC; “mitochondrion” GO:0005739). ~45,000 terms with DAG (directed acyclic graph) relationships (is_a + part_of + regulates + has_part + occurs_in). GO Consortium maintains; evidence codes from EXP (experimental) through IEA (electronic, default for most TrEMBL annotations). GOA GO Annotations at EBI is the canonical annotation set for UniProt entries.

Enrichment / pathway analysis tools

  • GSEA Gene Set Enrichment Analysis (Subramanian Mootha 2005 PNAS; Broad). Pre-ranked variant + classic; uses MSigDB Molecular Signatures Database — hallmarks (H; 50 well-defined biological-process gene sets), positional (C1), curated (C2 — includes Reactome, KEGG, WikiPathways, BioCarta), regulatory motif (C3 — TF + miRNA targets), computational (C4 — cancer modules), GO (C5), oncogenic (C6), immunologic (C7), cell type (C8)
  • fgsea (Korotkevich 2021; R/Bioconductor) — fast GSEA
  • enrichplot / clusterProfiler (Yu Guangchuang Hong Kong) — R/Bioconductor; KEGG + GO + Reactome + MSigDB
  • GSEApy (Python)
  • DAVID (Huang Sherman Lempicki; NIH NIAID; 2003); free; web; older “EASE score” over-representation
  • Enrichr / Enrichr-KG (Avi Ma’ayan Lab Mt Sinai; Chen 2013; Xie 2021); free web/API; ~200 libraries
  • Metascape (Zhou 2019; Sanford-Burnham) — free; web; meta-analysis across DEG sets
  • PANTHER (PANTHERGO/PantherDB; Thomas 2003 + 2022; USC) — GO enrichment
  • g:Profiler (Reimand Tartu/OICR) — multi-organism; orthology
  • WebGestalt (Zhang Lab UNCC; Liao 2019)
  • STRING enrichment (network-based)
  • cytoscape + Stringapp + clueGO + clueCharGO
  • AUCell + UCell + ssGSEA for single-cell signature scoring (Bioconductor; Aibar 2017; Andreatta 2021)

Variant + disease databases

ClinVar + ClinGen

  • ClinVar (NCBI 2013) — clinically-observed variants; ACMG/AMP-style classification (Richards 2015 — benign/likely benign/VUS/likely pathogenic/pathogenic); two-star review-status threshold for clinical use; ~3 million records
  • ClinGen Clinical Genome Resource (NIH 2013; Rehm Harvard/Broad + others) — expert panels assign gene-disease validity + actionability; ClinGen Allele Registry; Variant Curation Interface (VCI)
  • VarSome (Saphetor Geneva) — commercial wrapper for variant interpretation
  • Franklin (Genoox) — commercial variant interpretation
  • Mastermind (Genomenon) — literature variant search

Somatic / cancer variants

  • COSMIC Catalogue Of Somatic Mutations In Cancer (Sanger Institute; Forbes Bamford 2004–; Tate 2019 NAR; ~14 million somatic mutations across 1.5 million tumours)
  • cBioPortal (Cerami Schultz Solit; Memorial Sloan-Kettering Cancer Center 2012 Cancer Discovery); free; integrates ~30,000 samples across 300+ studies including all TCGA; ground-zero for “what mutations exist in cancer X”; OncoKB + OncoPrint visualisations
  • OncoKB (MSK 2017) — FDA-recognised precision-oncology knowledgebase; therapy levels (1–4 + R1/R2)
  • CIViC Clinical Interpretation of Variants in Cancer (WashU Griffith) — open community-curated; CC0; competitor to OncoKB
  • JAX-CKB Clinical Knowledgebase (Jackson Lab)
  • PMKB Precision Medicine Knowledgebase (Weill Cornell)
  • CGI Cancer Genome Interpreter (Lopez-Bigas Barcelona)

Cancer-genome consortia & atlases

  • TCGA The Cancer Genome Atlas (NCI + NHGRI 2006–2018; Collins, Hudson, Hayes leadership). 33 cancer types; ~11,000 tumour-normal pairs; published Pan-Cancer Atlas (Hoadley 2018 + 27 companion papers Cell). Data accessible at GDC Genomic Data Commons (NCI 2016–; Chicago) — successor portal hosting TCGA + TARGET (paediatric) + CPTAC (proteogenomics) + CMI + MMRF + many more
  • ICGC International Cancer Genome Consortium (2008–2019; ~22,000 tumours); PCAWG Pan-Cancer Analysis of Whole Genomes (Campbell Korbel Stein 2020 Nature); successor ICGC-ARGO ongoing
  • AACR Project GENIE Genomics Evidence Neoplasia Information Exchange — clinical-grade tumour panel data; ~165,000 samples (v15)
  • MMRF CoMMpass — multiple myeloma longitudinal
  • CCLE / DepMap — cell lines (see Cell Lines file)

Reference + population databases

  • gnomAD Genome Aggregation Database (Karczewski Lek Tiao Karczewski 2020 Nature; Broad/MacArthur). v4.1 (2024) ~730,000 exomes + 76,000 whole genomes; predecessor ExAC (Lek 2016; 60,000 exomes); allele frequency tables drive variant interpretation; “constraint” pLI + LOEUF metrics on gene-level intolerance; gnomAD-SV structural variants
  • UK Biobank (Bycroft 2018 Nature; Sudlow 2015 PLOS Med) — ~500,000 UK participants enrolled 2006–2010; whole-exome (2020 Backman 2021) → 500k whole-genome (2023 — DNAnexus + AstraZeneca + Amgen + GSK + J&J + Roche + Pfizer + Wellcome Sanger consortium); medical records + biomarkers + retinal imaging + heart + brain MRI subset (~100k); proteomics Olink 3,000 proteins (~50k participants) + Olink Explore HT 5,400 proteins (whole-cohort 2024)
  • All of Us Research Program (NIH 2018–) — >1 million US enrolled by end 2024; whole-genome subset ongoing; diversity emphasis
  • TOPMed Trans-Omics for Precision Medicine (NHLBI; ~200,000 deeply sequenced) — variant frequency resource
  • MyCode Community Health Initiative (Geisinger + Regeneron Genetics Center) — ~340,000 sequenced (DiscovEHR collaboration)
  • Estonian Biobank (UTartu; ~210,000 — 20 % of adult population)
  • SG10K_Health (Singapore) — 10,000 SE Asian genomes
  • China Kadoorie Biobank (Oxford + CAMS Beijing; 510,000 enrolled)
  • BBJ BioBank Japan (~270,000)
  • FinnGen (Finland; ~500,000 with health-registry linkage)
  • Genomics England 100,000 Genomes Project (2018 complete; NHS; transitioning to NHS Genomic Medicine Service routine sequencing)
  • Million Veteran Program (VA US; >1 million enrolled)
  • 23andMe (private; ~14 million genotyped; bankruptcy filing 2025 disrupting data access)
  • GA4GH Global Alliance for Genomics and Health — interoperability standards (Beacon + VRS variation representation + Phenopackets + DUO data use ontology + DRS data repository service + htsget streaming + RNAget + Service Registry)

Mendelian / rare disease

  • OMIM Online Mendelian Inheritance in Man (Johns Hopkins; McKusick founded 1966 print version; web 1995); ~8,000 phenotype entries, ~17,000 genes; canonical for monogenic disease
  • Orphanet (INSERM France) — rare diseases (~6,000 listed); ORPHAcode identifiers
  • MONDO Monarch Disease Ontology — unified disease ontology mapping OMIM + Orphanet + DOID + UMLS + MedDRA + MeSH
  • HPO Human Phenotype Ontology (Robinson Köhler; Charité + JAX; ~18,000 terms) — phenotypic abnormalities; standard for clinical decision support and “phenotype-to-gene” searches
  • DECIPHER (Sanger; pathogenic CNVs + variants + phenotypes ~50,000+ patients)
  • GeneReviews (UWashington; expert-authored gene-disease reviews)
  • GenCC Gene Curation Coalition — gene-disease validity aggregator
  • PanelApp (Genomics England; Australia mirror) — gene panels for rare disease

Drug + chemistry resources

DatabaseOwnerScopeNotes
DrugBankWishart UAlberta + OMx Personal Health Analytics~16,000 drug entries with targets, mechanism, ADMESubscription for commercial / bulk (US$3k–50k+); free for academic
ChEMBLEBI~2.5 million compounds, 18 million bioactivities, 15,000 targetsFree + Creative Commons; gold-standard for SAR data
PubChemNLM NIH115 million compounds, 300 million substances, 1.5 million bioassaysFree open
ChemSpiderRSC Royal Society of Chemistry100 million+ structuresFree open
ReaxysElsevier~80 million reactions, 200 million substancesCommercial premium ($15k–100k/seat)
SciFinder^nCAS Chemical Abstracts ServiceAll CAS RN; reactions; bioactivityCommercial premium
DGIdb Drug-Gene Interaction DatabaseWashU McDonnellAggregator of 30+ sources for gene → drug interactionsFree
Guide to Pharmacology (GtoPdb)IUPHAR + BPSCurated pharmacology of ~1,800 protein targets + ~10,000 ligandsFree; complementary to ChEMBL with editorial curation
STITCHvon Mering Bork EBIChemical-protein interactions (~500,000 chemicals × 9.6 million proteins)Free
BindingDBUCSD/Skaggs (Gilson)Measured binding affinities (~2.7 million)Free
ChEMBL-NTDEBINeglected tropical diseaseFree
NPASSNortheast Forestry UNatural productsFree
COCONUTJena/SteinbeckNatural products consolidatedFree
ZINC22UCSF Irwin/Shoichet~37 billion purchasable compounds (mostly Enamine REAL)Free; the go-to for ultra-large virtual screening
Enamine REAL SpaceEnamine~50 billion virtual; ~5 billion synthesisableIndustry premium (catalog ~US$50–500/compound 5 mg)
WuXi GalaXiWuXi LabNetwork~12 billionVendor space
OTAVA / ChemDiv / Maybridge / Asinex / Specsvarioushundreds of K to few M eachVendor screening libraries

ML / AI for biology — foundation models + key tools

Structure prediction & design

  • AlphaFold2 Jumper 2021 Nature (Nobel Chemistry 2024 Hassabis Jumper; co-laureate Baker for protein design) — MSA-based, EvoFormer + structure module + 3-recycling
  • AlphaFold-Multimer Evans 2022 — heteromer support
  • AlphaFold3 Abramson 2024 Nature — joint protein + ligand + ion + NA prediction; pair-formation + diffusion module; web-only initial → released weights 2024 (academic)
  • ESM2 + ESMFold (Lin Meta AI 2023 Science) — single-sequence; ESM-IF inverse folding; ESM-3 (Hayes EvolutionaryScale 2024)
  • RoseTTAFold + RoseTTAFold2 + RFdiffusion + RFantibody (Baker UW IPD) — diffusion-based design; Baker won 2024 Chemistry Nobel
  • Chai-1 Chai Discovery 2024 (open-weights AlphaFold3 alternative)
  • Boltz-1 / Boltz-2 MIT 2024–2025 (Apache-2 licensed; co-structure + binding affinity)
  • HelixFold Baidu PaddlePaddle; OmegaFold Wu Helix Bio (no MSA); OpenFold Columbia AlQuraishi (open AF2 reimplementation); Uni-Fold DP Tech
  • ColabFold Mirdita Steinegger Sönmez Söding 2022 Nat Methods — MMseqs2-based MSA + AF2/RoseTTAFold pipeline; democratised structure prediction (~10 minutes/Colab)

Protein language models

  • ProtBert / ProtT5 (Elnaggar Rost TUM 2021)
  • ESM2 (Meta) — 15B-parameter PLM at top end
  • ESM-IF (Hsu) inverse folding
  • ProteinMPNN (Baker UW; Dauparas 2022 Science) — message-passing sequence design from structure
  • AbLang (Olsen Boomsma Deane 2022) — antibody language model
  • IgLM (Shuai Ruffolo Gray 2023) — antibody generative
  • OpenFold + UniRef50 / UniRef90 training data
  • AlphaMissense (Cheng Avsec Velankar Jumper 2023 Science) — missense variant pathogenicity prediction; ~80 % unambiguous calls on 71 million missense

DNA / RNA / single-cell

  • Nucleotide Transformer (InstaDeep + NVIDIA 2024) — 2.5B-parameter genomic LM trained on 850 genomes
  • HyenaDNA (Stanford 2024) — long-range up to 1M-bp context; Hyena attention-alternative
  • Caduceus (Princeton 2024) — bi-directional; Mamba-2 backbone
  • Evo (Arc Institute 2024 Science; Nguyen Poli Re) — 7B trained on 2.7M prokaryotic + phage genomes; generative
  • Evo2 (Arc 2025) — 40B + 9.3T-token training
  • DNABERT + DNABERT-2 (Northwestern 2024)
  • Geneformer (Theodoris Ellinor 2023 Nature; MGB; pretrained on Genecorpus-30M single-cell) — gene-level tokens
  • scGPT (Bo Wang Toronto 2023) — single-cell foundation model
  • scFoundation (Hao Tsinghua 2023); GeneCompass (Beijing 2024)
  • scvi-tools (Yosef Berkeley/Weizmann; Lopez 2018 deep generative for scRNA-seq) — most-used scRNA-seq integration toolkit; scVI + totalVI + scANVI + DestVI + cell2location
  • CellTypist (Teichmann Sanger; Domínguez-Conde 2022) — automated cell-type annotation
  • CellxGene (CZI) — single-cell atlas browser + Census API
  • HCA Human Cell Atlas (Regev Teichmann founders 2017–; ~100M+ cells planned; ~50M cataloged)
  • Tabula Sapiens / Tabula Muris (Quake Stanford) — multi-organ scRNA-seq atlases
  • BioNeMo NVIDIA — GPU framework + pretrained models
  • NVIDIA NIM microservices for protein design

Bio-foundation model companies / labs

  • DeepMind Isomorphic Labs (Hassabis Jumper) — AlphaFold + AI drug design (Lilly + Novartis partnerships)
  • EvolutionaryScale (Rives Hayes spinoff from Meta FAIR 2024) — ESM3 multimodal
  • Iambic Therapeutics (DiffDock Corso Stark; structure-based generative DD; AB-2100 clinical)
  • Variational AI (Vancouver; ENKI generative)
  • Insilico Medicine (Zhavoronkov; PandaOmics + Chemistry42 + InClinico; ISM001-055 idiopathic pulmonary fibrosis Phase II)
  • Recursion (Hillman; HCS phenomics + ML; acquired Exscientia 2024 ~US$700M)
  • Genesis Therapeutics (Lab; tilted molecule design)
  • Atomic AI (Townshend; RNA structure)
  • Cradle Bio + Profluent Bio (Maddhi) — protein engineering services
  • Generate Biomedicines (Flagship; deep-learning protein design; multiple oncology + immunology assets)
  • Absci (de-novo antibody discovery)
  • BigHat Biosciences (antibody optimisation)
  • PostEra (medicinal chemistry; Manifold platform)
  • Valence Labs (Mila + Recursion; small-molecule)
  • Chai Discovery (Chai-1 open; co-fold + affinity)
  • Cyrus Bio (Rosetta-derived design)

Imaging + spatial

  • CellProfiler (Carpenter Broad) — image-based phenotyping
  • DeepCell + Mesmer (Van Valen Caltech) — segmentation
  • Cellpose (Stringer Pachitariu HHMI Janelia) — generalist segmentation
  • Stardist (Schmidt Weigert) — star-convex polygon segmentation
  • squidpy (Theis Helmholtz Munich) — spatial transcriptomics analysis (Visium + Xenium + MERFISH + CosMx + Stereo-seq)
  • cell2location (Kleshchevnikov Bayraktar Teichmann 2022 Nat Biotechnol) — spatial cell-type mapping
  • MoNuSeg / PanNuke / Lizard — pathology nucleus datasets
  • Foundation pathology: CTransPath (Wang 2022); UNI (Mahmood Lab 2024 Nat Med); Virchow (Paige.AI 2024); Prov-GigaPath (Microsoft + Providence 2024 Nature); PRISM (Paige); PLUTO (Path-AI); PathChat (Mahmood/Vasquez)

Pipeline + workflow tools

ToolAuthorStrengths
SnakemakeKöster 2012; Univ Duisburg-EssenPythonic; popular in academic genomics; YAML config + Conda integration
NextflowDi Tommaso 2017 CRG Barcelona; Seqera Labs commercialChannel + dataflow paradigm; cloud-portable (AWS Batch, Google Batch, Kubernetes, Azure Batch); de facto pharma standard
nf-coreCommunity Nextflow pipeline collection (~100 pipelines; rnaseq, sarek, ampliseq, fetchngs, mag, etc.); rigorous CI + best-practices template; supported by SeqeraProduction-grade community workflows
WDL Workflow Description LanguageOpenWDL Broad/Cromwell engineGA4GH-endorsed; runs on Cromwell + miniWDL; Terra preferred
CWL Common Workflow LanguageAmstutz 2016; CWL communityContainer-portable; older but solid
GalaxyGoecks Nekrutenko Taylor 2010 Penn State (Hopkins move)GUI-first; pedagogical + production; usegalaxy.org + .eu + .au
TerraBroad + VerilyCloud workspace + WDL/Cromwell; primary AnVIL platform
DNAnexusDNAnexus (2010; UC Berkeley spinoff)Commercial cloud platform; AstraZeneca + Regeneron + UKBB RAP partner
Seven Bridges Genomics / Cancer Genomics Cloud (CGC)Velsera (formed 2023 merger Seven Bridges + Pixel Data + UgenTec)NCI CGC official platform
Latch.BioYC alumnus; TiszaDataFrame + modern UI
Form Bio (Colossal Biosciences spin-out)Workflow + visualization
Argo Workflows / Airflow / Prefect / DagsterCNCF / Apache / Prefect / ElementlGeneral-purpose orchestration adopted by some bio teams

Bioconductor + Python ecosystem

  • Bioconductor (Gentleman Carey Huber 2003; R-based; ~2,300 packages 2024) — DESeq2 (Love Anders Huber; differential expression); edgeR (Robinson Smyth McCarthy); limma (Smyth voom); GenomicRanges + IRanges + S4Vectors infrastructure; SummarizedExperiment + SingleCellExperiment data classes; ChIPseeker; ChIPQC; Gviz; karyoploteR; minfi (methylation); maftools; ComplexHeatmap (Gu); Seurat-bridge SeuratObject + sceasy
  • Seurat (Satija NYGC; v5 2024) — R single-cell flagship
  • Scanpy (Wolf Theis 2018) — Python single-cell flagship; anndata + scverse stack
  • scverse (Heumos Theis 2023) — Python single-cell ecosystem: anndata, scanpy, scvi-tools, squidpy, MUON multimodal, decoupler, dynamo
  • Galaxy Training Network GTN — curated tutorials
  • Biopython (Cock 2009)
  • Bioconductor data infrastructure: AnnData (Wolf + Strobl); MuData; Zarr chunked; OME-Zarr for imaging
  • PyTorch + JAX + scikit-learn + Keras + Hugging Face Transformers — ML
  • Polars + Pandas + Dask for tabular
  • DVC + MLflow + Weights & Biases + Aim — experiment tracking

Reproducibility, repositories, preprints

  • Zenodo (CERN OpenAIRE; ~3 million records; up to 50 GB per record; DOIs)
  • Dryad (~70k datasets; partnered with Zenodo for compute)
  • Figshare (Digital Science)
  • OSF Open Science Framework (Center for Open Science)
  • Code Ocean (compute capsules + Docker)
  • Whole Tale + Binder + mybinder.org — reproducible computational environments
  • Renku (SDSC Lausanne)
  • DataLad (Halchenko Hanke) — git-annex-based data versioning
  • Synapse / Sage Bionetworks — collaborative biomedical projects (DREAM Challenges, ADKnowledgePortal)
  • PhysioNet — physiologic signals + MIMIC-IV ICU data
  • bioRxiv (CSHL; ~400k preprints 2024); medRxiv (CSHL + BMJ + Yale; ~80k); chemRxiv (ACS + RSC + GDCh + ChemSoc Japan); arXiv q-bio (Cornell); ResearchSquare (Springer)
  • GitHub + GitLab — code (Bioconda for binary distribution)
  • Docker Hub / quay.io BioContainers / Singularity / Apptainer — containerised tools (Conda + Bioconda + Mamba dominant for Python/R bioinformatics)
  • Conda-forge + Bioconda (BioConda 2018 Grüning Köster; ~10,000 bioinformatics packages)
  • EDAM Ontology for bioinformatics tool/data classification
  • bio.tools registry (~25,000 tools)

Standards & FAIR-data anchors

  • FAIR Principles (Wilkinson 2016 Sci Data) — findable + accessible + interoperable + reusable
  • GA4GH standards (Beacon, VRS, Phenopackets, DUO, DRS, htsget, RNAget, Service Registry, Crypt4GH)
  • MIAME / MINSEQE / MIAPE — minimum information about experiment standards
  • SBML (Systems Biology Markup Language; Hucka 2003) for kinetic models
  • SBGN (Systems Biology Graphical Notation; Le Novère 2009) for diagrams
  • BioPAX (Bader Demir 2010) for pathway exchange
  • MIAxE / FAIRsharing.org — schemas registry
  • CITE-seq / SHARE-seq / Multiome / SPLiT-seq / Slide-seq / GeoMx / CosMx / Visium HD / Xenium / MERFISH / Stereo-seq / DBiT-seq — common single-cell / spatial assay protocols (referenced in many pipelines)

Adjacent