Pathway Databases & Bioinformatics Resources
This Tier 3 family index catalogues the public + commercial knowledge bases that practising computational biologists rely on, ordered by content type: sequence → structure → pathway → ontology → variant/disease → drug/chemistry → ML/foundation models → pipeline tools → reproducibility. Numerical scale figures are mid-2026 unless noted. SI units throughout (storage in B/kB/MB/GB/TB/PB; sequence size in bp; protein in residues or Da).
Sequence databases — the INSDC and friends
The International Nucleotide Sequence Database Collaboration (INSDC) is the foundational three-way mirror of all public DNA/RNA sequence: GenBank (NCBI USA) ⇄ ENA (EBI Europe) ⇄ DDBJ (Japan). Submissions to any node propagate to all three nightly. Total holdings as of 2026: ~5 × 10¹² bp (5 Tbp) in core assembled sequence; ~5 × 10¹⁶ bp (50 Pbp) in raw read archives. Identifiers across all three are interchangeable (accession.version format).
NCBI (National Center for Biotechnology Information; NLM/NIH; Bethesda MD; David Lipman founded 1988)
- GenBank — primary nucleotide repository; ~270 million sequence records; INSDC partner
- RefSeq — curated reference sequences (NCBI staff curation); ~250,000 organisms with at least one RefSeq; assembly accessions GCF_*
- GEO Gene Expression Omnibus — ~250,000 series (Series GSE), ~7 million samples (GSM); microarray + RNA-seq + ChIP-seq + scRNA-seq; supersedes ArrayExpress (which mirrored into BioStudies at EBI)
- SRA Sequence Read Archive — raw reads from sequencing instruments; ~50 PB as of 2025 (5 × 10¹⁶ bytes); cloud mirrors AWS Open Data + GCP + Azure
- dbSNP — 1 billion+ variants (mostly SNPs and small indels); reference RSIDs (rs prefix)
- dbVar — structural variants ≥ 50 bp
- ClinVar — clinically annotated variants; ~3 million records; five-tier classification (B/LB/VUS/LP/P) with star-rated review level
- dbGaP — controlled-access genotype/phenotype (US NIH project data; biobank summary statistics)
- BioProject + BioSample — metadata anchors linking submissions
- PubChem — small-molecule chemistry (NLM; ~115 million compounds; 800+ data sources); bioassays + patent + bioactivity + literature
- PubMed — citation/abstract database; ~36 million records (2024); MeSH-indexed
- PMC PubMed Central — full-text open-access (~10 million articles)
- MedGen — phenotype + disease terms cross-mapped to ClinVar, OMIM, MONDO, HPO
- Taxonomy — NCBI Taxonomy Database (formal arbiter for sequence-record organism)
- Genome — assembly hub; current human reference GRCh38.p14 (2022) + T2T-CHM13v2.0 Telomere-to-Telomere 2022 (Nurk 2022 Science)
- BLAST — Basic Local Alignment Search Tool (Altschul 1990 + 1997 PSI-BLAST); now
+toolkit; web + local; DELTA-BLAST + magicblast for RNA-seq read alignment - Datasets / E-utilities API — programmatic access
EMBL-EBI (European Bioinformatics Institute; Hinxton UK; Janet Thornton DG long-tenured; Ewan Birney)
- ENA European Nucleotide Archive — INSDC partner; submission portal
- UniProt Universal Protein Resource — collaboration EBI + SIB Swiss + PIR; Swiss-Prot ~570,000 manually annotated entries; TrEMBL ~245 million automatically annotated; UniRef clusters at 100 % / 90 % / 50 % identity; UniParc archive
- Ensembl — genome browser + annotation pipeline; 300+ vertebrate species; Ensembl Plants / Fungi / Bacteria / Protists / Metazoa sister projects
- ENCODE Encyclopedia of DNA Elements (host at NCBI primarily; data also at EBI) — functional annotation of human + mouse genomes; ~17,000 experiments
- ChEMBL — manually curated bioactivity database; ~2.5 million compounds + 18 million bioactivities + 15,000 targets
- Europe PMC — full-text literature (PMC + biomedical preprints from bioRxiv etc.)
- PRIDE — proteomics identifications database (mass spec); part of ProteomeXchange consortium with PeptideAtlas + MassIVE + JPOST + iProX
- MetaboLights — metabolomics
- EVA European Variation Archive — variation submission
- IntAct — molecular interactions (curated); part of IMEx consortium with MINT (Roma), DIP (UCLA), HPRD (Hyderabad legacy), BioGRID (Mt Sinai/Princess Margaret)
- InterPro — protein-domain/family signature aggregator (PROSITE + Pfam + SMART + CATH-Gene3D + PRINTS + HAMAP + PIRSF + SUPERFAMILY + TIGRFAMs + AntiFam + CDD + PANTHER); ~45,000 signatures
- Pfam (now hosted at InterPro since 2022 retirement of standalone Pfam) — protein families; ~23,000 family HMM models (HMMER); Sonnhammer Bateman Eddy founders
- AlphaFold Database (joint with DeepMind 2021 + 2022) — 214 million predicted structures (originally 200 M in 2022; expanded to 214 M; covers ~98.5 % of UniProt)
- Open Targets Platform — target validation; integrates evidence from genetics + drugs + literature + RNA expression (EMBL-EBI + GSK + Sanger + others)
- Expression Atlas — bulk + scRNA-seq curated re-analysis (~5,000 experiments)
- Single Cell Expression Atlas — single-cell focus
- EMPIAR electron microscopy raw data archive
- EMDB Electron Microscopy Data Bank (cryo-EM maps)
- BioStudies — generic submission for misc; absorbed ArrayExpress in 2022
DDBJ (DNA Data Bank of Japan; NIG Mishima)
- INSDC partner; DRA DDBJ Sequence Read Archive (mirror); JGA Japanese Genotype-phenotype Archive (controlled access); DDBJ Annotated/Assembled Sequences
Proteomics & structure
UniProt
The single most-used protein resource. Swiss-Prot entries hand-curated with sequence + function + subcellular location + PTMs + variants + disease association + GO terms + cross-references; TrEMBL entries automatic. UniProt IDs are stable accessions (e.g. P38398 = BRCA1_HUMAN, P04637 = P53_HUMAN). The proteomics community standardised on UniProt IDs for protein-level reporting. Update cycle ~8 weeks.
Protein Data Bank (PDB)
Founded 1971 at Brookhaven National Laboratory; now hosted by the wwPDB worldwide PDB consortium comprising RCSB PDB (USA; Rutgers + UCSD; Helen Berman/Stephen Burley), PDBe (EBI; Sameer Velankar), PDBj (Osaka; Haruki Nakamura/Genji Kurisu), BMRB Biological Magnetic Resonance Bank (UConn) for NMR. As of 2026: ~230,000 experimental structures; ~78 % X-ray crystallography, ~17 % cryo-EM (fastest-growing — overtook X-ray in new depositions ~2022), ~5 % NMR, small fraction electron diffraction + neutron. Resolution distribution: 15 % < 1.8 Å, 70 % 1.8–3 Å, 15 % > 3 Å (most low-res cryo-EM). PDB IDs are 4-character (e.g. 1HHO, 7K3G); 8-character extensions for newer entries.
- PDB-Dev — model archive for integrative structural models (low-res + cross-link MS + SAXS hybrids)
- EMDB Electron Microscopy Data Bank — paired cryo-EM map deposition
- CSD Cambridge Structural Database (CCDC) — small-molecule structures (~1.2 million); separate from PDB
Predicted structures
- AlphaFold Protein Structure Database (AlphaFold2; Jumper 2021 Nature) — 214 million predicted structures hosted at EBI; first version 350,000 structures July 2021 → 1 million by 2022 → 200+ million by 2024
- ESM Atlas (Meta AI; Lin 2023 Science) — 617 million metagenomic protein structures predicted with ESMFold (faster but less accurate than AF2)
- OmegaFold (Helix Bio; Wu 2022) — single-sequence (no MSA) prediction
- RoseTTAFold (Baker UW IPD; Baek 2021 Science); RoseTTAFold All-Atom 2024; RoseTTAFold Diffusion (RFdiffusion) generative protein design
- AlphaFold-Multimer (Evans 2022) — heteromeric complexes
- AlphaFold3 (Abramson/Jumper 2024 Nature) — generalises to proteins + small molecules + ions + nucleic acids; web-only initial release (later open-weighted)
- Chai-1 (Chai Discovery 2024) — open-source AlphaFold3 alternative
- Boltz-1 / Boltz-2 (MIT 2024/2025) — Apache-2 licensed protein–ligand structure + affinity prediction
- HelixFold (Baidu / PaddlePaddle); OpenFold (Columbia; AlQuraishi); Uni-Fold (DP Technology Beijing)
Sequence-to-structure-to-function annotation
- InterPro + CATH (UCL Orengo) + SCOP/SCOP2 (Murzin Cambridge) + ECOD Evolutionary Classification of Protein Domains (Eddy/Grishin)
- PROSITE (SIB Geneva) — patterns + profiles
- HMMER (Eddy HHMI) — profile HMM tools; underlies Pfam
- HHsuite (Söding Munich) — HMM-HMM comparison
- MMseqs2 (Steinegger Soeding) — ultra-fast sequence search; underlies ColabFold MSAs
- DALI (Holm Helsinki) — structure comparison
- TM-align (Zhang Michigan; now PSU) — TM-score + structure superposition; Foldseek (Steinegger Söding 2023 Nat Biotechnol) — billion-times-faster structure search using structural alphabet 3Di
Protein-protein interaction (PPI)
- STRING (von Mering Jensen Bork; SIB Geneva → UZH; ~12,000 organisms; protein-protein “associations” with evidence channels) — most-used PPI for hypothesis generation
- BioGRID (Tyers Mt Sinai + Princess Margaret) — curated PPI + genetic interactions; ~2 million human PPI
- IntAct + MINT + DIP + HPRD + InnateDB — IMEx Consortium curated
- CORUM (Munich) — protein complexes
- mentha (Tor Vergata) — aggregator
- HuRI / HI-union (Vidal/CCSB Dana-Farber) — systematic Y2H interactome
- OpenCell + HPA Cell Atlas — proximity / colocalisation
- Reactome FI — functional interaction (computed)
Pathway databases
KEGG (Kyoto Encyclopedia of Genes and Genomes)
Founded 1995 (Minoru Kanehisa, Kyoto University Bioinformatics Center; now GenomeNet). Comprises:
- KEGG PATHWAY — manually drawn molecular pathway maps; ~570 organisms × ~570 reference maps = ~10⁵ maps total; covering metabolism, genetic info, environmental info, cellular processes, organismal systems, human diseases, drug development
- KEGG GENES — ortholog clustering; KO (KEGG Orthology) identifiers (e.g. K00001 for ADH alcohol dehydrogenase)
- KEGG ENZYME + KEGG REACTION + KEGG COMPOUND + KEGG GLYCAN + KEGG RPAIR — metabolic reactions
- KEGG DRUG + KEGG DGROUP — drugs hierarchically classified
- KEGG DISEASE — diseases linked to genes/pathways
- KEGG MODULE — functional units (~1,200 modules)
- KEGG BRITE — hierarchical classifications (taxonomy, drug classes, etc.)
Licensing changed in 2011: free web + non-commercial API, but FTP bulk + commercial use requires Pathway Solutions licence (US$15k–200k/year depending on size). This drove much community usage to Reactome.
Reactome
Founded 2003 (CSHL + EBI + OICR Toronto + NYU; Henning Hermjakob, Peter D’Eustachio, Lincoln Stein). CC-BY licensed open. ~3,000 human pathways; 19 model species via orthology. Strengths: detailed reactions with stoichiometry; physical-entity model (compartments + complexes + sets); SBGN-compatible diagrams; ELV (event-level) hierarchies; downloadable as BioPAX + SBML + SBGN-ML + PSI-MITAB + JSON + neo4j graph; web service + ContentService + ReactomeFIViz Cytoscape plug-in. Best for detailed mechanistic queries and as an alternative to KEGG that is fully open. Pathway analysis: standard over-representation via R/ReactomePA + reactome.db Bioconductor.
WikiPathways
Open community-edited (Bohler/Pico/Slenter; Maastricht + Gladstone Institutes); CC0/CC-BY licence; 1,500+ pathways covering 30+ species; integration with Cytoscape WikiPathways App + PathVisio editor; emphasis on community-curated metabolism + lipid biology.
BioCyc / MetaCyc / EcoCyc
Founded 1996 by Peter Karp at SRI International. EcoCyc is the gold-standard whole-organism database for E. coli K-12 MG1655 (~30 years of curation); MetaCyc is the multi-organism reference metabolic pathway database (~3,000 pathways across 3,000+ organisms); BioCyc is the tiered umbrella with ~20,000 pathway/genome databases (Tier 1: curated like EcoCyc; Tier 2: moderately curated; Tier 3: automated PathoLogic). Subscription required for downloads + many tier-3 PGDBs since 2018 controversy; web access remains free.
Pathway Commons
NCI-funded aggregator (Bader Lab Toronto + MSKCC; Demir Babur); integrates Reactome + WikiPathways + KEGG (subset) + PID + PANTHER + Inoh + NetPath + BIND + HumanCyc + others into BioPAX Level 3; powering ChiBE + ReactomeFIViz + Newt + cBioPortal pathway lookups.
IPA — Ingenuity Pathway Analysis
Qiagen Digital Insights (acquired from Ingenuity Systems 2013 → Thermo briefly → Qiagen). Closed commercial (typical ~US$15k–30k/yr per seat). Strong in curated upstream-regulator + causal network analysis; popular in pharma + biotech but losing share to open tools and to ML-based pathway analysis.
Other pathway / signalling resources
- SignaLink (Korcsmáros, Budapest) — signaling cross-talk
- Signor (Cesareni Rome; Sacchi-Cesareni 2020) — manually-curated signalling causality
- NetPath (Pandey Hopkins/IBR) — immune pathways (legacy)
- PID Pathway Interaction Database (NCI 2009–2014; archived; now in Pathway Commons)
- SMPDB Small Molecule Pathway Database (Wishart UAlberta; small-molecule context)
- HMDB Human Metabolome Database (Wishart UAlberta) — companion metabolite info
- PathBank (Wishart) — visual painter for all human pathways
- OmniPath (Saez-Rodriguez Heidelberg) — meta-database aggregating ~100 sources for systems-biology modelling (used in NicheNet, LIANA, ROOTS)
Gene ontology & functional annotation
Gene Ontology (GO)
Founded 1998 (Ashburner Cherry Botstein; Ashburner 2000 Nature Genetics). Three independent ontologies (namespaces): biological process (BP; “DNA repair” GO:0006281), molecular function (MF; “kinase activity” GO:0016301), cellular component (CC; “mitochondrion” GO:0005739). ~45,000 terms with DAG (directed acyclic graph) relationships (is_a + part_of + regulates + has_part + occurs_in). GO Consortium maintains; evidence codes from EXP (experimental) through IEA (electronic, default for most TrEMBL annotations). GOA GO Annotations at EBI is the canonical annotation set for UniProt entries.
Enrichment / pathway analysis tools
- GSEA Gene Set Enrichment Analysis (Subramanian Mootha 2005 PNAS; Broad). Pre-ranked variant + classic; uses MSigDB Molecular Signatures Database — hallmarks (H; 50 well-defined biological-process gene sets), positional (C1), curated (C2 — includes Reactome, KEGG, WikiPathways, BioCarta), regulatory motif (C3 — TF + miRNA targets), computational (C4 — cancer modules), GO (C5), oncogenic (C6), immunologic (C7), cell type (C8)
- fgsea (Korotkevich 2021; R/Bioconductor) — fast GSEA
- enrichplot / clusterProfiler (Yu Guangchuang Hong Kong) — R/Bioconductor; KEGG + GO + Reactome + MSigDB
- GSEApy (Python)
- DAVID (Huang Sherman Lempicki; NIH NIAID; 2003); free; web; older “EASE score” over-representation
- Enrichr / Enrichr-KG (Avi Ma’ayan Lab Mt Sinai; Chen 2013; Xie 2021); free web/API; ~200 libraries
- Metascape (Zhou 2019; Sanford-Burnham) — free; web; meta-analysis across DEG sets
- PANTHER (PANTHERGO/PantherDB; Thomas 2003 + 2022; USC) — GO enrichment
- g:Profiler (Reimand Tartu/OICR) — multi-organism; orthology
- WebGestalt (Zhang Lab UNCC; Liao 2019)
- STRING enrichment (network-based)
- cytoscape + Stringapp + clueGO + clueCharGO
- AUCell + UCell + ssGSEA for single-cell signature scoring (Bioconductor; Aibar 2017; Andreatta 2021)
Variant + disease databases
ClinVar + ClinGen
- ClinVar (NCBI 2013) — clinically-observed variants; ACMG/AMP-style classification (Richards 2015 — benign/likely benign/VUS/likely pathogenic/pathogenic); two-star review-status threshold for clinical use; ~3 million records
- ClinGen Clinical Genome Resource (NIH 2013; Rehm Harvard/Broad + others) — expert panels assign gene-disease validity + actionability; ClinGen Allele Registry; Variant Curation Interface (VCI)
- VarSome (Saphetor Geneva) — commercial wrapper for variant interpretation
- Franklin (Genoox) — commercial variant interpretation
- Mastermind (Genomenon) — literature variant search
Somatic / cancer variants
- COSMIC Catalogue Of Somatic Mutations In Cancer (Sanger Institute; Forbes Bamford 2004–; Tate 2019 NAR; ~14 million somatic mutations across 1.5 million tumours)
- cBioPortal (Cerami Schultz Solit; Memorial Sloan-Kettering Cancer Center 2012 Cancer Discovery); free; integrates ~30,000 samples across 300+ studies including all TCGA; ground-zero for “what mutations exist in cancer X”; OncoKB + OncoPrint visualisations
- OncoKB (MSK 2017) — FDA-recognised precision-oncology knowledgebase; therapy levels (1–4 + R1/R2)
- CIViC Clinical Interpretation of Variants in Cancer (WashU Griffith) — open community-curated; CC0; competitor to OncoKB
- JAX-CKB Clinical Knowledgebase (Jackson Lab)
- PMKB Precision Medicine Knowledgebase (Weill Cornell)
- CGI Cancer Genome Interpreter (Lopez-Bigas Barcelona)
Cancer-genome consortia & atlases
- TCGA The Cancer Genome Atlas (NCI + NHGRI 2006–2018; Collins, Hudson, Hayes leadership). 33 cancer types; ~11,000 tumour-normal pairs; published Pan-Cancer Atlas (Hoadley 2018 + 27 companion papers Cell). Data accessible at GDC Genomic Data Commons (NCI 2016–; Chicago) — successor portal hosting TCGA + TARGET (paediatric) + CPTAC (proteogenomics) + CMI + MMRF + many more
- ICGC International Cancer Genome Consortium (2008–2019; ~22,000 tumours); PCAWG Pan-Cancer Analysis of Whole Genomes (Campbell Korbel Stein 2020 Nature); successor ICGC-ARGO ongoing
- AACR Project GENIE Genomics Evidence Neoplasia Information Exchange — clinical-grade tumour panel data; ~165,000 samples (v15)
- MMRF CoMMpass — multiple myeloma longitudinal
- CCLE / DepMap — cell lines (see Cell Lines file)
Reference + population databases
- gnomAD Genome Aggregation Database (Karczewski Lek Tiao Karczewski 2020 Nature; Broad/MacArthur). v4.1 (2024) ~730,000 exomes + 76,000 whole genomes; predecessor ExAC (Lek 2016; 60,000 exomes); allele frequency tables drive variant interpretation; “constraint” pLI + LOEUF metrics on gene-level intolerance; gnomAD-SV structural variants
- UK Biobank (Bycroft 2018 Nature; Sudlow 2015 PLOS Med) — ~500,000 UK participants enrolled 2006–2010; whole-exome (2020 Backman 2021) → 500k whole-genome (2023 — DNAnexus + AstraZeneca + Amgen + GSK + J&J + Roche + Pfizer + Wellcome Sanger consortium); medical records + biomarkers + retinal imaging + heart + brain MRI subset (~100k); proteomics Olink 3,000 proteins (~50k participants) + Olink Explore HT 5,400 proteins (whole-cohort 2024)
- All of Us Research Program (NIH 2018–) — >1 million US enrolled by end 2024; whole-genome subset ongoing; diversity emphasis
- TOPMed Trans-Omics for Precision Medicine (NHLBI; ~200,000 deeply sequenced) — variant frequency resource
- MyCode Community Health Initiative (Geisinger + Regeneron Genetics Center) — ~340,000 sequenced (DiscovEHR collaboration)
- Estonian Biobank (UTartu; ~210,000 — 20 % of adult population)
- SG10K_Health (Singapore) — 10,000 SE Asian genomes
- China Kadoorie Biobank (Oxford + CAMS Beijing; 510,000 enrolled)
- BBJ BioBank Japan (~270,000)
- FinnGen (Finland; ~500,000 with health-registry linkage)
- Genomics England 100,000 Genomes Project (2018 complete; NHS; transitioning to NHS Genomic Medicine Service routine sequencing)
- Million Veteran Program (VA US; >1 million enrolled)
- 23andMe (private; ~14 million genotyped; bankruptcy filing 2025 disrupting data access)
- GA4GH Global Alliance for Genomics and Health — interoperability standards (Beacon + VRS variation representation + Phenopackets + DUO data use ontology + DRS data repository service + htsget streaming + RNAget + Service Registry)
Mendelian / rare disease
- OMIM Online Mendelian Inheritance in Man (Johns Hopkins; McKusick founded 1966 print version; web 1995); ~8,000 phenotype entries, ~17,000 genes; canonical for monogenic disease
- Orphanet (INSERM France) — rare diseases (~6,000 listed); ORPHAcode identifiers
- MONDO Monarch Disease Ontology — unified disease ontology mapping OMIM + Orphanet + DOID + UMLS + MedDRA + MeSH
- HPO Human Phenotype Ontology (Robinson Köhler; Charité + JAX; ~18,000 terms) — phenotypic abnormalities; standard for clinical decision support and “phenotype-to-gene” searches
- DECIPHER (Sanger; pathogenic CNVs + variants + phenotypes ~50,000+ patients)
- GeneReviews (UWashington; expert-authored gene-disease reviews)
- GenCC Gene Curation Coalition — gene-disease validity aggregator
- PanelApp (Genomics England; Australia mirror) — gene panels for rare disease
Drug + chemistry resources
| Database | Owner | Scope | Notes |
|---|---|---|---|
| DrugBank | Wishart UAlberta + OMx Personal Health Analytics | ~16,000 drug entries with targets, mechanism, ADME | Subscription for commercial / bulk (US$3k–50k+); free for academic |
| ChEMBL | EBI | ~2.5 million compounds, 18 million bioactivities, 15,000 targets | Free + Creative Commons; gold-standard for SAR data |
| PubChem | NLM NIH | 115 million compounds, 300 million substances, 1.5 million bioassays | Free open |
| ChemSpider | RSC Royal Society of Chemistry | 100 million+ structures | Free open |
| Reaxys | Elsevier | ~80 million reactions, 200 million substances | Commercial premium ($15k–100k/seat) |
| SciFinder^n | CAS Chemical Abstracts Service | All CAS RN; reactions; bioactivity | Commercial premium |
| DGIdb Drug-Gene Interaction Database | WashU McDonnell | Aggregator of 30+ sources for gene → drug interactions | Free |
| Guide to Pharmacology (GtoPdb) | IUPHAR + BPS | Curated pharmacology of ~1,800 protein targets + ~10,000 ligands | Free; complementary to ChEMBL with editorial curation |
| STITCH | von Mering Bork EBI | Chemical-protein interactions (~500,000 chemicals × 9.6 million proteins) | Free |
| BindingDB | UCSD/Skaggs (Gilson) | Measured binding affinities (~2.7 million) | Free |
| ChEMBL-NTD | EBI | Neglected tropical disease | Free |
| NPASS | Northeast Forestry U | Natural products | Free |
| COCONUT | Jena/Steinbeck | Natural products consolidated | Free |
| ZINC22 | UCSF Irwin/Shoichet | ~37 billion purchasable compounds (mostly Enamine REAL) | Free; the go-to for ultra-large virtual screening |
| Enamine REAL Space | Enamine | ~50 billion virtual; ~5 billion synthesisable | Industry premium (catalog ~US$50–500/compound 5 mg) |
| WuXi GalaXi | WuXi LabNetwork | ~12 billion | Vendor space |
| OTAVA / ChemDiv / Maybridge / Asinex / Specs | various | hundreds of K to few M each | Vendor screening libraries |
ML / AI for biology — foundation models + key tools
Structure prediction & design
- AlphaFold2 Jumper 2021 Nature (Nobel Chemistry 2024 Hassabis Jumper; co-laureate Baker for protein design) — MSA-based, EvoFormer + structure module + 3-recycling
- AlphaFold-Multimer Evans 2022 — heteromer support
- AlphaFold3 Abramson 2024 Nature — joint protein + ligand + ion + NA prediction; pair-formation + diffusion module; web-only initial → released weights 2024 (academic)
- ESM2 + ESMFold (Lin Meta AI 2023 Science) — single-sequence; ESM-IF inverse folding; ESM-3 (Hayes EvolutionaryScale 2024)
- RoseTTAFold + RoseTTAFold2 + RFdiffusion + RFantibody (Baker UW IPD) — diffusion-based design; Baker won 2024 Chemistry Nobel
- Chai-1 Chai Discovery 2024 (open-weights AlphaFold3 alternative)
- Boltz-1 / Boltz-2 MIT 2024–2025 (Apache-2 licensed; co-structure + binding affinity)
- HelixFold Baidu PaddlePaddle; OmegaFold Wu Helix Bio (no MSA); OpenFold Columbia AlQuraishi (open AF2 reimplementation); Uni-Fold DP Tech
- ColabFold Mirdita Steinegger Sönmez Söding 2022 Nat Methods — MMseqs2-based MSA + AF2/RoseTTAFold pipeline; democratised structure prediction (~10 minutes/Colab)
Protein language models
- ProtBert / ProtT5 (Elnaggar Rost TUM 2021)
- ESM2 (Meta) — 15B-parameter PLM at top end
- ESM-IF (Hsu) inverse folding
- ProteinMPNN (Baker UW; Dauparas 2022 Science) — message-passing sequence design from structure
- AbLang (Olsen Boomsma Deane 2022) — antibody language model
- IgLM (Shuai Ruffolo Gray 2023) — antibody generative
- OpenFold + UniRef50 / UniRef90 training data
- AlphaMissense (Cheng Avsec Velankar Jumper 2023 Science) — missense variant pathogenicity prediction; ~80 % unambiguous calls on 71 million missense
DNA / RNA / single-cell
- Nucleotide Transformer (InstaDeep + NVIDIA 2024) — 2.5B-parameter genomic LM trained on 850 genomes
- HyenaDNA (Stanford 2024) — long-range up to 1M-bp context; Hyena attention-alternative
- Caduceus (Princeton 2024) — bi-directional; Mamba-2 backbone
- Evo (Arc Institute 2024 Science; Nguyen Poli Re) — 7B trained on 2.7M prokaryotic + phage genomes; generative
- Evo2 (Arc 2025) — 40B + 9.3T-token training
- DNABERT + DNABERT-2 (Northwestern 2024)
- Geneformer (Theodoris Ellinor 2023 Nature; MGB; pretrained on Genecorpus-30M single-cell) — gene-level tokens
- scGPT (Bo Wang Toronto 2023) — single-cell foundation model
- scFoundation (Hao Tsinghua 2023); GeneCompass (Beijing 2024)
- scvi-tools (Yosef Berkeley/Weizmann; Lopez 2018 deep generative for scRNA-seq) — most-used scRNA-seq integration toolkit; scVI + totalVI + scANVI + DestVI + cell2location
- CellTypist (Teichmann Sanger; Domínguez-Conde 2022) — automated cell-type annotation
- CellxGene (CZI) — single-cell atlas browser + Census API
- HCA Human Cell Atlas (Regev Teichmann founders 2017–; ~100M+ cells planned; ~50M cataloged)
- Tabula Sapiens / Tabula Muris (Quake Stanford) — multi-organ scRNA-seq atlases
- BioNeMo NVIDIA — GPU framework + pretrained models
- NVIDIA NIM microservices for protein design
Bio-foundation model companies / labs
- DeepMind Isomorphic Labs (Hassabis Jumper) — AlphaFold + AI drug design (Lilly + Novartis partnerships)
- EvolutionaryScale (Rives Hayes spinoff from Meta FAIR 2024) — ESM3 multimodal
- Iambic Therapeutics (DiffDock Corso Stark; structure-based generative DD; AB-2100 clinical)
- Variational AI (Vancouver; ENKI generative)
- Insilico Medicine (Zhavoronkov; PandaOmics + Chemistry42 + InClinico; ISM001-055 idiopathic pulmonary fibrosis Phase II)
- Recursion (Hillman; HCS phenomics + ML; acquired Exscientia 2024 ~US$700M)
- Genesis Therapeutics (Lab; tilted molecule design)
- Atomic AI (Townshend; RNA structure)
- Cradle Bio + Profluent Bio (Maddhi) — protein engineering services
- Generate Biomedicines (Flagship; deep-learning protein design; multiple oncology + immunology assets)
- Absci (de-novo antibody discovery)
- BigHat Biosciences (antibody optimisation)
- PostEra (medicinal chemistry; Manifold platform)
- Valence Labs (Mila + Recursion; small-molecule)
- Chai Discovery (Chai-1 open; co-fold + affinity)
- Cyrus Bio (Rosetta-derived design)
Imaging + spatial
- CellProfiler (Carpenter Broad) — image-based phenotyping
- DeepCell + Mesmer (Van Valen Caltech) — segmentation
- Cellpose (Stringer Pachitariu HHMI Janelia) — generalist segmentation
- Stardist (Schmidt Weigert) — star-convex polygon segmentation
- squidpy (Theis Helmholtz Munich) — spatial transcriptomics analysis (Visium + Xenium + MERFISH + CosMx + Stereo-seq)
- cell2location (Kleshchevnikov Bayraktar Teichmann 2022 Nat Biotechnol) — spatial cell-type mapping
- MoNuSeg / PanNuke / Lizard — pathology nucleus datasets
- Foundation pathology: CTransPath (Wang 2022); UNI (Mahmood Lab 2024 Nat Med); Virchow (Paige.AI 2024); Prov-GigaPath (Microsoft + Providence 2024 Nature); PRISM (Paige); PLUTO (Path-AI); PathChat (Mahmood/Vasquez)
Pipeline + workflow tools
| Tool | Author | Strengths |
|---|---|---|
| Snakemake | Köster 2012; Univ Duisburg-Essen | Pythonic; popular in academic genomics; YAML config + Conda integration |
| Nextflow | Di Tommaso 2017 CRG Barcelona; Seqera Labs commercial | Channel + dataflow paradigm; cloud-portable (AWS Batch, Google Batch, Kubernetes, Azure Batch); de facto pharma standard |
| nf-core | Community Nextflow pipeline collection (~100 pipelines; rnaseq, sarek, ampliseq, fetchngs, mag, etc.); rigorous CI + best-practices template; supported by Seqera | Production-grade community workflows |
| WDL Workflow Description Language | OpenWDL Broad/Cromwell engine | GA4GH-endorsed; runs on Cromwell + miniWDL; Terra preferred |
| CWL Common Workflow Language | Amstutz 2016; CWL community | Container-portable; older but solid |
| Galaxy | Goecks Nekrutenko Taylor 2010 Penn State (Hopkins move) | GUI-first; pedagogical + production; usegalaxy.org + .eu + .au |
| Terra | Broad + Verily | Cloud workspace + WDL/Cromwell; primary AnVIL platform |
| DNAnexus | DNAnexus (2010; UC Berkeley spinoff) | Commercial cloud platform; AstraZeneca + Regeneron + UKBB RAP partner |
| Seven Bridges Genomics / Cancer Genomics Cloud (CGC) | Velsera (formed 2023 merger Seven Bridges + Pixel Data + UgenTec) | NCI CGC official platform |
| Latch.Bio | YC alumnus; Tisza | DataFrame + modern UI |
| Form Bio (Colossal Biosciences spin-out) | — | Workflow + visualization |
| Argo Workflows / Airflow / Prefect / Dagster | CNCF / Apache / Prefect / Elementl | General-purpose orchestration adopted by some bio teams |
Bioconductor + Python ecosystem
- Bioconductor (Gentleman Carey Huber 2003; R-based; ~2,300 packages 2024) — DESeq2 (Love Anders Huber; differential expression); edgeR (Robinson Smyth McCarthy); limma (Smyth voom); GenomicRanges + IRanges + S4Vectors infrastructure; SummarizedExperiment + SingleCellExperiment data classes; ChIPseeker; ChIPQC; Gviz; karyoploteR; minfi (methylation); maftools; ComplexHeatmap (Gu); Seurat-bridge SeuratObject + sceasy
- Seurat (Satija NYGC; v5 2024) — R single-cell flagship
- Scanpy (Wolf Theis 2018) — Python single-cell flagship; anndata + scverse stack
- scverse (Heumos Theis 2023) — Python single-cell ecosystem: anndata, scanpy, scvi-tools, squidpy, MUON multimodal, decoupler, dynamo
- Galaxy Training Network GTN — curated tutorials
- Biopython (Cock 2009)
- Bioconductor data infrastructure: AnnData (Wolf + Strobl); MuData; Zarr chunked; OME-Zarr for imaging
- PyTorch + JAX + scikit-learn + Keras + Hugging Face Transformers — ML
- Polars + Pandas + Dask for tabular
- DVC + MLflow + Weights & Biases + Aim — experiment tracking
Reproducibility, repositories, preprints
- Zenodo (CERN OpenAIRE; ~3 million records; up to 50 GB per record; DOIs)
- Dryad (~70k datasets; partnered with Zenodo for compute)
- Figshare (Digital Science)
- OSF Open Science Framework (Center for Open Science)
- Code Ocean (compute capsules + Docker)
- Whole Tale + Binder + mybinder.org — reproducible computational environments
- Renku (SDSC Lausanne)
- DataLad (Halchenko Hanke) — git-annex-based data versioning
- Synapse / Sage Bionetworks — collaborative biomedical projects (DREAM Challenges, ADKnowledgePortal)
- PhysioNet — physiologic signals + MIMIC-IV ICU data
- bioRxiv (CSHL; ~400k preprints 2024); medRxiv (CSHL + BMJ + Yale; ~80k); chemRxiv (ACS + RSC + GDCh + ChemSoc Japan); arXiv q-bio (Cornell); ResearchSquare (Springer)
- GitHub + GitLab — code (Bioconda for binary distribution)
- Docker Hub / quay.io BioContainers / Singularity / Apptainer — containerised tools (Conda + Bioconda + Mamba dominant for Python/R bioinformatics)
- Conda-forge + Bioconda (BioConda 2018 Grüning Köster; ~10,000 bioinformatics packages)
- EDAM Ontology for bioinformatics tool/data classification
- bio.tools registry (~25,000 tools)
Standards & FAIR-data anchors
- FAIR Principles (Wilkinson 2016 Sci Data) — findable + accessible + interoperable + reusable
- GA4GH standards (Beacon, VRS, Phenopackets, DUO, DRS, htsget, RNAget, Service Registry, Crypt4GH)
- MIAME / MINSEQE / MIAPE — minimum information about experiment standards
- SBML (Systems Biology Markup Language; Hucka 2003) for kinetic models
- SBGN (Systems Biology Graphical Notation; Le Novère 2009) for diagrams
- BioPAX (Bader Demir 2010) for pathway exchange
- MIAxE / FAIRsharing.org — schemas registry
- CITE-seq / SHARE-seq / Multiome / SPLiT-seq / Slide-seq / GeoMx / CosMx / Visium HD / Xenium / MERFISH / Stereo-seq / DBiT-seq — common single-cell / spatial assay protocols (referenced in many pipelines)
Adjacent
- protein-families-and-drug-targets — UniProt, ChEMBL, GtoPdb power the target catalogs there
- cell-lines-and-antibody-catalog — Human Protein Atlas + Antibodypedia + DepMap CCLE feed cell-line and antibody choice
- model-organisms-and-sequencing-tech — Ensembl, RefSeq, MGI, ZFIN, FlyBase, WormBase, SGD are the model-organism counterparts
- Bioinformatics and computational biology Tier 2
- Genetics and genomics Tier 2
- Systems biology Tier 2
- Scientific computing Tier 2 (cross-domain — pipelines, container orchestration, HPC apply across bio + physical sciences)