Pathway Databases & Bioinformatics Resources

This Tier 3 family index catalogues the public + commercial knowledge bases that practising computational biologists rely on, ordered by content type: sequence → structure → pathway → ontology → variant/disease → drug/chemistry → ML/foundation models → pipeline tools → reproducibility. Numerical scale figures are mid-2026 unless noted. SI units throughout (storage in B/kB/MB/GB/TB/PB; sequence size in bp; protein in residues or Da).

Sequence databases — the INSDC and friends

The International Nucleotide Sequence Database Collaboration (INSDC) is the foundational three-way mirror of all public DNA/RNA sequence: GenBank (NCBI USA) ⇄ ENA (EBI Europe) ⇄ DDBJ (Japan). Submissions to any node propagate to all three nightly. Total holdings as of 2026: ~5 × 10¹² bp (5 Tbp) in core assembled sequence; ~5 × 10¹⁶ bp (50 Pbp) in raw read archives. Identifiers across all three are interchangeable (accession.version format).

NCBI (National Center for Biotechnology Information; NLM/NIH; Bethesda MD; David Lipman founded 1988)

GenBank — primary nucleotide repository; ~270 million sequence records; INSDC partner
RefSeq — curated reference sequences (NCBI staff curation); ~250,000 organisms with at least one RefSeq; assembly accessions GCF_*
GEO Gene Expression Omnibus — ~250,000 series (Series GSE), ~7 million samples (GSM); microarray + RNA-seq + ChIP-seq + scRNA-seq; supersedes ArrayExpress (which mirrored into BioStudies at EBI)
SRA Sequence Read Archive — raw reads from sequencing instruments; ~50 PB as of 2025 (5 × 10¹⁶ bytes); cloud mirrors AWS Open Data + GCP + Azure
dbSNP — 1 billion+ variants (mostly SNPs and small indels); reference RSIDs (rs prefix)
dbVar — structural variants ≥ 50 bp
ClinVar — clinically annotated variants; ~3 million records; five-tier classification (B/LB/VUS/LP/P) with star-rated review level
dbGaP — controlled-access genotype/phenotype (US NIH project data; biobank summary statistics)
BioProject + BioSample — metadata anchors linking submissions
PubChem — small-molecule chemistry (NLM; ~115 million compounds; 800+ data sources); bioassays + patent + bioactivity + literature
PubMed — citation/abstract database; ~36 million records (2024); MeSH-indexed
PMC PubMed Central — full-text open-access (~10 million articles)
MedGen — phenotype + disease terms cross-mapped to ClinVar, OMIM, MONDO, HPO
Taxonomy — NCBI Taxonomy Database (formal arbiter for sequence-record organism)
Genome — assembly hub; current human reference GRCh38.p14 (2022) + T2T-CHM13v2.0 Telomere-to-Telomere 2022 (Nurk 2022 Science)
BLAST — Basic Local Alignment Search Tool (Altschul 1990 + 1997 PSI-BLAST); now + toolkit; web + local; DELTA-BLAST + magicblast for RNA-seq read alignment
Datasets / E-utilities API — programmatic access

EMBL-EBI (European Bioinformatics Institute; Hinxton UK; Janet Thornton DG long-tenured; Ewan Birney)

ENA European Nucleotide Archive — INSDC partner; submission portal
UniProt Universal Protein Resource — collaboration EBI + SIB Swiss + PIR; Swiss-Prot ~570,000 manually annotated entries; TrEMBL ~245 million automatically annotated; UniRef clusters at 100 % / 90 % / 50 % identity; UniParc archive
Ensembl — genome browser + annotation pipeline; 300+ vertebrate species; Ensembl Plants / Fungi / Bacteria / Protists / Metazoa sister projects
ENCODE Encyclopedia of DNA Elements (host at NCBI primarily; data also at EBI) — functional annotation of human + mouse genomes; ~17,000 experiments
ChEMBL — manually curated bioactivity database; ~2.5 million compounds + 18 million bioactivities + 15,000 targets
Europe PMC — full-text literature (PMC + biomedical preprints from bioRxiv etc.)
PRIDE — proteomics identifications database (mass spec); part of ProteomeXchange consortium with PeptideAtlas + MassIVE + JPOST + iProX
MetaboLights — metabolomics
EVA European Variation Archive — variation submission
IntAct — molecular interactions (curated); part of IMEx consortium with MINT (Roma), DIP (UCLA), HPRD (Hyderabad legacy), BioGRID (Mt Sinai/Princess Margaret)
InterPro — protein-domain/family signature aggregator (PROSITE + Pfam + SMART + CATH-Gene3D + PRINTS + HAMAP + PIRSF + SUPERFAMILY + TIGRFAMs + AntiFam + CDD + PANTHER); ~45,000 signatures
Pfam (now hosted at InterPro since 2022 retirement of standalone Pfam) — protein families; ~23,000 family HMM models (HMMER); Sonnhammer Bateman Eddy founders
AlphaFold Database (joint with DeepMind 2021 + 2022) — 214 million predicted structures (originally 200 M in 2022; expanded to 214 M; covers ~98.5 % of UniProt)
Open Targets Platform — target validation; integrates evidence from genetics + drugs + literature + RNA expression (EMBL-EBI + GSK + Sanger + others)
Expression Atlas — bulk + scRNA-seq curated re-analysis (~5,000 experiments)
Single Cell Expression Atlas — single-cell focus
EMPIAR electron microscopy raw data archive
EMDB Electron Microscopy Data Bank (cryo-EM maps)
BioStudies — generic submission for misc; absorbed ArrayExpress in 2022

DDBJ (DNA Data Bank of Japan; NIG Mishima)

INSDC partner; DRA DDBJ Sequence Read Archive (mirror); JGA Japanese Genotype-phenotype Archive (controlled access); DDBJ Annotated/Assembled Sequences

Proteomics & structure

UniProt

The single most-used protein resource. Swiss-Prot entries hand-curated with sequence + function + subcellular location + PTMs + variants + disease association + GO terms + cross-references; TrEMBL entries automatic. UniProt IDs are stable accessions (e.g. P38398 = BRCA1_HUMAN, P04637 = P53_HUMAN). The proteomics community standardised on UniProt IDs for protein-level reporting. Update cycle ~8 weeks.

Protein Data Bank (PDB)

Founded 1971 at Brookhaven National Laboratory; now hosted by the wwPDB worldwide PDB consortium comprising RCSB PDB (USA; Rutgers + UCSD; Helen Berman/Stephen Burley), PDBe (EBI; Sameer Velankar), PDBj (Osaka; Haruki Nakamura/Genji Kurisu), BMRB Biological Magnetic Resonance Bank (UConn) for NMR. As of 2026: ~230,000 experimental structures; ~78 % X-ray crystallography, ~17 % cryo-EM (fastest-growing — overtook X-ray in new depositions ~2022), ~5 % NMR, small fraction electron diffraction + neutron. Resolution distribution: 15 % < 1.8 Å, 70 % 1.8–3 Å, 15 % > 3 Å (most low-res cryo-EM). PDB IDs are 4-character (e.g. 1HHO, 7K3G); 8-character extensions for newer entries.

PDB-Dev — model archive for integrative structural models (low-res + cross-link MS + SAXS hybrids)
EMDB Electron Microscopy Data Bank — paired cryo-EM map deposition
CSD Cambridge Structural Database (CCDC) — small-molecule structures (~1.2 million); separate from PDB

Predicted structures

AlphaFold Protein Structure Database (AlphaFold2; Jumper 2021 Nature) — 214 million predicted structures hosted at EBI; first version 350,000 structures July 2021 → 1 million by 2022 → 200+ million by 2024
ESM Atlas (Meta AI; Lin 2023 Science) — 617 million metagenomic protein structures predicted with ESMFold (faster but less accurate than AF2)
OmegaFold (Helix Bio; Wu 2022) — single-sequence (no MSA) prediction
RoseTTAFold (Baker UW IPD; Baek 2021 Science); RoseTTAFold All-Atom 2024; RoseTTAFold Diffusion (RFdiffusion) generative protein design
AlphaFold-Multimer (Evans 2022) — heteromeric complexes
AlphaFold3 (Abramson/Jumper 2024 Nature) — generalises to proteins + small molecules + ions + nucleic acids; web-only initial release (later open-weighted)
Chai-1 (Chai Discovery 2024) — open-source AlphaFold3 alternative
Boltz-1 / Boltz-2 (MIT 2024/2025) — Apache-2 licensed protein–ligand structure + affinity prediction
HelixFold (Baidu / PaddlePaddle); OpenFold (Columbia; AlQuraishi); Uni-Fold (DP Technology Beijing)

Sequence-to-structure-to-function annotation

InterPro + CATH (UCL Orengo) + SCOP/SCOP2 (Murzin Cambridge) + ECOD Evolutionary Classification of Protein Domains (Eddy/Grishin)
PROSITE (SIB Geneva) — patterns + profiles
HMMER (Eddy HHMI) — profile HMM tools; underlies Pfam
HHsuite (Söding Munich) — HMM-HMM comparison
MMseqs2 (Steinegger Soeding) — ultra-fast sequence search; underlies ColabFold MSAs
DALI (Holm Helsinki) — structure comparison
TM-align (Zhang Michigan; now PSU) — TM-score + structure superposition; Foldseek (Steinegger Söding 2023 Nat Biotechnol) — billion-times-faster structure search using structural alphabet 3Di

Protein-protein interaction (PPI)

STRING (von Mering Jensen Bork; SIB Geneva → UZH; ~12,000 organisms; protein-protein “associations” with evidence channels) — most-used PPI for hypothesis generation
BioGRID (Tyers Mt Sinai + Princess Margaret) — curated PPI + genetic interactions; ~2 million human PPI
IntAct + MINT + DIP + HPRD + InnateDB — IMEx Consortium curated
CORUM (Munich) — protein complexes
mentha (Tor Vergata) — aggregator
HuRI / HI-union (Vidal/CCSB Dana-Farber) — systematic Y2H interactome
OpenCell + HPA Cell Atlas — proximity / colocalisation
Reactome FI — functional interaction (computed)

Pathway databases

KEGG (Kyoto Encyclopedia of Genes and Genomes)

Founded 1995 (Minoru Kanehisa, Kyoto University Bioinformatics Center; now GenomeNet). Comprises:

KEGG PATHWAY — manually drawn molecular pathway maps; ~570 organisms × ~570 reference maps = ~10⁵ maps total; covering metabolism, genetic info, environmental info, cellular processes, organismal systems, human diseases, drug development
KEGG GENES — ortholog clustering; KO (KEGG Orthology) identifiers (e.g. K00001 for ADH alcohol dehydrogenase)
KEGG ENZYME + KEGG REACTION + KEGG COMPOUND + KEGG GLYCAN + KEGG RPAIR — metabolic reactions
KEGG DRUG + KEGG DGROUP — drugs hierarchically classified
KEGG DISEASE — diseases linked to genes/pathways
KEGG MODULE — functional units (~1,200 modules)
KEGG BRITE — hierarchical classifications (taxonomy, drug classes, etc.)

Licensing changed in 2011: free web + non-commercial API, but FTP bulk + commercial use requires Pathway Solutions licence (US$15k–200k/year depending on size). This drove much community usage to Reactome.

Reactome

Founded 2003 (CSHL + EBI + OICR Toronto + NYU; Henning Hermjakob, Peter D’Eustachio, Lincoln Stein). CC-BY licensed open. ~3,000 human pathways; 19 model species via orthology. Strengths: detailed reactions with stoichiometry; physical-entity model (compartments + complexes + sets); SBGN-compatible diagrams; ELV (event-level) hierarchies; downloadable as BioPAX + SBML + SBGN-ML + PSI-MITAB + JSON + neo4j graph; web service + ContentService + ReactomeFIViz Cytoscape plug-in. Best for detailed mechanistic queries and as an alternative to KEGG that is fully open. Pathway analysis: standard over-representation via R/ReactomePA + reactome.db Bioconductor.

WikiPathways

Open community-edited (Bohler/Pico/Slenter; Maastricht + Gladstone Institutes); CC0/CC-BY licence; 1,500+ pathways covering 30+ species; integration with Cytoscape WikiPathways App + PathVisio editor; emphasis on community-curated metabolism + lipid biology.

BioCyc / MetaCyc / EcoCyc

Founded 1996 by Peter Karp at SRI International. EcoCyc is the gold-standard whole-organism database for E. coli K-12 MG1655 (~30 years of curation); MetaCyc is the multi-organism reference metabolic pathway database (~3,000 pathways across 3,000+ organisms); BioCyc is the tiered umbrella with ~20,000 pathway/genome databases (Tier 1: curated like EcoCyc; Tier 2: moderately curated; Tier 3: automated PathoLogic). Subscription required for downloads + many tier-3 PGDBs since 2018 controversy; web access remains free.

Pathway Commons

NCI-funded aggregator (Bader Lab Toronto + MSKCC; Demir Babur); integrates Reactome + WikiPathways + KEGG (subset) + PID + PANTHER + Inoh + NetPath + BIND + HumanCyc + others into BioPAX Level 3; powering ChiBE + ReactomeFIViz + Newt + cBioPortal pathway lookups.

IPA — Ingenuity Pathway Analysis

Qiagen Digital Insights (acquired from Ingenuity Systems 2013 → Thermo briefly → Qiagen). Closed commercial (typical ~US$15k–30k/yr per seat). Strong in curated upstream-regulator + causal network analysis; popular in pharma + biotech but losing share to open tools and to ML-based pathway analysis.

Other pathway / signalling resources

SignaLink (Korcsmáros, Budapest) — signaling cross-talk
Signor (Cesareni Rome; Sacchi-Cesareni 2020) — manually-curated signalling causality
NetPath (Pandey Hopkins/IBR) — immune pathways (legacy)
PID Pathway Interaction Database (NCI 2009–2014; archived; now in Pathway Commons)
SMPDB Small Molecule Pathway Database (Wishart UAlberta; small-molecule context)
HMDB Human Metabolome Database (Wishart UAlberta) — companion metabolite info
PathBank (Wishart) — visual painter for all human pathways
OmniPath (Saez-Rodriguez Heidelberg) — meta-database aggregating ~100 sources for systems-biology modelling (used in NicheNet, LIANA, ROOTS)

Gene ontology & functional annotation

Gene Ontology (GO)

Founded 1998 (Ashburner Cherry Botstein; Ashburner 2000 Nature Genetics). Three independent ontologies (namespaces): biological process (BP; “DNA repair” GO:0006281), molecular function (MF; “kinase activity” GO:0016301), cellular component (CC; “mitochondrion” GO:0005739). ~45,000 terms with DAG (directed acyclic graph) relationships (is_a + part_of + regulates + has_part + occurs_in). GO Consortium maintains; evidence codes from EXP (experimental) through IEA (electronic, default for most TrEMBL annotations). GOA GO Annotations at EBI is the canonical annotation set for UniProt entries.

Enrichment / pathway analysis tools

GSEA Gene Set Enrichment Analysis (Subramanian Mootha 2005 PNAS; Broad). Pre-ranked variant + classic; uses MSigDB Molecular Signatures Database — hallmarks (H; 50 well-defined biological-process gene sets), positional (C1), curated (C2 — includes Reactome, KEGG, WikiPathways, BioCarta), regulatory motif (C3 — TF + miRNA targets), computational (C4 — cancer modules), GO (C5), oncogenic (C6), immunologic (C7), cell type (C8)
fgsea (Korotkevich 2021; R/Bioconductor) — fast GSEA
enrichplot / clusterProfiler (Yu Guangchuang Hong Kong) — R/Bioconductor; KEGG + GO + Reactome + MSigDB
GSEApy (Python)
DAVID (Huang Sherman Lempicki; NIH NIAID; 2003); free; web; older “EASE score” over-representation
Enrichr / Enrichr-KG (Avi Ma’ayan Lab Mt Sinai; Chen 2013; Xie 2021); free web/API; ~200 libraries
Metascape (Zhou 2019; Sanford-Burnham) — free; web; meta-analysis across DEG sets
PANTHER (PANTHERGO/PantherDB; Thomas 2003 + 2022; USC) — GO enrichment
g:Profiler (Reimand Tartu/OICR) — multi-organism; orthology
WebGestalt (Zhang Lab UNCC; Liao 2019)
STRING enrichment (network-based)
cytoscape + Stringapp + clueGO + clueCharGO
AUCell + UCell + ssGSEA for single-cell signature scoring (Bioconductor; Aibar 2017; Andreatta 2021)

Variant + disease databases

ClinVar + ClinGen

ClinVar (NCBI 2013) — clinically-observed variants; ACMG/AMP-style classification (Richards 2015 — benign/likely benign/VUS/likely pathogenic/pathogenic); two-star review-status threshold for clinical use; ~3 million records
ClinGen Clinical Genome Resource (NIH 2013; Rehm Harvard/Broad + others) — expert panels assign gene-disease validity + actionability; ClinGen Allele Registry; Variant Curation Interface (VCI)
VarSome (Saphetor Geneva) — commercial wrapper for variant interpretation
Franklin (Genoox) — commercial variant interpretation
Mastermind (Genomenon) — literature variant search

Somatic / cancer variants

COSMIC Catalogue Of Somatic Mutations In Cancer (Sanger Institute; Forbes Bamford 2004–; Tate 2019 NAR; ~14 million somatic mutations across 1.5 million tumours)
cBioPortal (Cerami Schultz Solit; Memorial Sloan-Kettering Cancer Center 2012 Cancer Discovery); free; integrates ~30,000 samples across 300+ studies including all TCGA; ground-zero for “what mutations exist in cancer X”; OncoKB + OncoPrint visualisations
OncoKB (MSK 2017) — FDA-recognised precision-oncology knowledgebase; therapy levels (1–4 + R1/R2)
CIViC Clinical Interpretation of Variants in Cancer (WashU Griffith) — open community-curated; CC0; competitor to OncoKB
JAX-CKB Clinical Knowledgebase (Jackson Lab)
PMKB Precision Medicine Knowledgebase (Weill Cornell)
CGI Cancer Genome Interpreter (Lopez-Bigas Barcelona)

Cancer-genome consortia & atlases

TCGA The Cancer Genome Atlas (NCI + NHGRI 2006–2018; Collins, Hudson, Hayes leadership). 33 cancer types; ~11,000 tumour-normal pairs; published Pan-Cancer Atlas (Hoadley 2018 + 27 companion papers Cell). Data accessible at GDC Genomic Data Commons (NCI 2016–; Chicago) — successor portal hosting TCGA + TARGET (paediatric) + CPTAC (proteogenomics) + CMI + MMRF + many more
ICGC International Cancer Genome Consortium (2008–2019; ~22,000 tumours); PCAWG Pan-Cancer Analysis of Whole Genomes (Campbell Korbel Stein 2020 Nature); successor ICGC-ARGO ongoing
AACR Project GENIE Genomics Evidence Neoplasia Information Exchange — clinical-grade tumour panel data; ~165,000 samples (v15)
MMRF CoMMpass — multiple myeloma longitudinal
CCLE / DepMap — cell lines (see Cell Lines file)

Reference + population databases

gnomAD Genome Aggregation Database (Karczewski Lek Tiao Karczewski 2020 Nature; Broad/MacArthur). v4.1 (2024) ~730,000 exomes + 76,000 whole genomes; predecessor ExAC (Lek 2016; 60,000 exomes); allele frequency tables drive variant interpretation; “constraint” pLI + LOEUF metrics on gene-level intolerance; gnomAD-SV structural variants
UK Biobank (Bycroft 2018 Nature; Sudlow 2015 PLOS Med) — ~500,000 UK participants enrolled 2006–2010; whole-exome (2020 Backman 2021) → 500k whole-genome (2023 — DNAnexus + AstraZeneca + Amgen + GSK + J&J + Roche + Pfizer + Wellcome Sanger consortium); medical records + biomarkers + retinal imaging + heart + brain MRI subset (~100k); proteomics Olink 3,000 proteins (~50k participants) + Olink Explore HT 5,400 proteins (whole-cohort 2024)
All of Us Research Program (NIH 2018–) — >1 million US enrolled by end 2024; whole-genome subset ongoing; diversity emphasis
TOPMed Trans-Omics for Precision Medicine (NHLBI; ~200,000 deeply sequenced) — variant frequency resource
MyCode Community Health Initiative (Geisinger + Regeneron Genetics Center) — ~340,000 sequenced (DiscovEHR collaboration)
Estonian Biobank (UTartu; ~210,000 — 20 % of adult population)
SG10K_Health (Singapore) — 10,000 SE Asian genomes
China Kadoorie Biobank (Oxford + CAMS Beijing; 510,000 enrolled)
BBJ BioBank Japan (~270,000)
FinnGen (Finland; ~500,000 with health-registry linkage)
Genomics England 100,000 Genomes Project (2018 complete; NHS; transitioning to NHS Genomic Medicine Service routine sequencing)
Million Veteran Program (VA US; >1 million enrolled)
23andMe (private; ~14 million genotyped; bankruptcy filing 2025 disrupting data access)
GA4GH Global Alliance for Genomics and Health — interoperability standards (Beacon + VRS variation representation + Phenopackets + DUO data use ontology + DRS data repository service + htsget streaming + RNAget + Service Registry)

Mendelian / rare disease

OMIM Online Mendelian Inheritance in Man (Johns Hopkins; McKusick founded 1966 print version; web 1995); ~8,000 phenotype entries, ~17,000 genes; canonical for monogenic disease
Orphanet (INSERM France) — rare diseases (~6,000 listed); ORPHAcode identifiers
MONDO Monarch Disease Ontology — unified disease ontology mapping OMIM + Orphanet + DOID + UMLS + MedDRA + MeSH
HPO Human Phenotype Ontology (Robinson Köhler; Charité + JAX; ~18,000 terms) — phenotypic abnormalities; standard for clinical decision support and “phenotype-to-gene” searches
DECIPHER (Sanger; pathogenic CNVs + variants + phenotypes ~50,000+ patients)
GeneReviews (UWashington; expert-authored gene-disease reviews)
GenCC Gene Curation Coalition — gene-disease validity aggregator
PanelApp (Genomics England; Australia mirror) — gene panels for rare disease

Drug + chemistry resources

Database	Owner	Scope	Notes
DrugBank	Wishart UAlberta + OMx Personal Health Analytics	~16,000 drug entries with targets, mechanism, ADME	Subscription for commercial / bulk (US$3k–50k+); free for academic
ChEMBL	EBI	~2.5 million compounds, 18 million bioactivities, 15,000 targets	Free + Creative Commons; gold-standard for SAR data
PubChem	NLM NIH	115 million compounds, 300 million substances, 1.5 million bioassays	Free open
ChemSpider	RSC Royal Society of Chemistry	100 million+ structures	Free open
Reaxys	Elsevier	~80 million reactions, 200 million substances	Commercial premium ($15k–100k/seat)
SciFinder^n	CAS Chemical Abstracts Service	All CAS RN; reactions; bioactivity	Commercial premium
DGIdb Drug-Gene Interaction Database	WashU McDonnell	Aggregator of 30+ sources for gene → drug interactions	Free
Guide to Pharmacology (GtoPdb)	IUPHAR + BPS	Curated pharmacology of ~1,800 protein targets + ~10,000 ligands	Free; complementary to ChEMBL with editorial curation
STITCH	von Mering Bork EBI	Chemical-protein interactions (~500,000 chemicals × 9.6 million proteins)	Free
BindingDB	UCSD/Skaggs (Gilson)	Measured binding affinities (~2.7 million)	Free
ChEMBL-NTD	EBI	Neglected tropical disease	Free
NPASS	Northeast Forestry U	Natural products	Free
COCONUT	Jena/Steinbeck	Natural products consolidated	Free
ZINC22	UCSF Irwin/Shoichet	~37 billion purchasable compounds (mostly Enamine REAL)	Free; the go-to for ultra-large virtual screening
Enamine REAL Space	Enamine	~50 billion virtual; ~5 billion synthesisable	Industry premium (catalog ~US$50–500/compound 5 mg)
WuXi GalaXi	WuXi LabNetwork	~12 billion	Vendor space
OTAVA / ChemDiv / Maybridge / Asinex / Specs	various	hundreds of K to few M each	Vendor screening libraries

ML / AI for biology — foundation models + key tools

Structure prediction & design

AlphaFold2 Jumper 2021 Nature (Nobel Chemistry 2024 Hassabis Jumper; co-laureate Baker for protein design) — MSA-based, EvoFormer + structure module + 3-recycling
AlphaFold-Multimer Evans 2022 — heteromer support
AlphaFold3 Abramson 2024 Nature — joint protein + ligand + ion + NA prediction; pair-formation + diffusion module; web-only initial → released weights 2024 (academic)
ESM2 + ESMFold (Lin Meta AI 2023 Science) — single-sequence; ESM-IF inverse folding; ESM-3 (Hayes EvolutionaryScale 2024)
RoseTTAFold + RoseTTAFold2 + RFdiffusion + RFantibody (Baker UW IPD) — diffusion-based design; Baker won 2024 Chemistry Nobel
Chai-1 Chai Discovery 2024 (open-weights AlphaFold3 alternative)
Boltz-1 / Boltz-2 MIT 2024–2025 (Apache-2 licensed; co-structure + binding affinity)
HelixFold Baidu PaddlePaddle; OmegaFold Wu Helix Bio (no MSA); OpenFold Columbia AlQuraishi (open AF2 reimplementation); Uni-Fold DP Tech
ColabFold Mirdita Steinegger Sönmez Söding 2022 Nat Methods — MMseqs2-based MSA + AF2/RoseTTAFold pipeline; democratised structure prediction (~10 minutes/Colab)

Protein language models

ProtBert / ProtT5 (Elnaggar Rost TUM 2021)
ESM2 (Meta) — 15B-parameter PLM at top end
ESM-IF (Hsu) inverse folding
ProteinMPNN (Baker UW; Dauparas 2022 Science) — message-passing sequence design from structure
AbLang (Olsen Boomsma Deane 2022) — antibody language model
IgLM (Shuai Ruffolo Gray 2023) — antibody generative
OpenFold + UniRef50 / UniRef90 training data
AlphaMissense (Cheng Avsec Velankar Jumper 2023 Science) — missense variant pathogenicity prediction; ~80 % unambiguous calls on 71 million missense

DNA / RNA / single-cell

Nucleotide Transformer (InstaDeep + NVIDIA 2024) — 2.5B-parameter genomic LM trained on 850 genomes
HyenaDNA (Stanford 2024) — long-range up to 1M-bp context; Hyena attention-alternative
Caduceus (Princeton 2024) — bi-directional; Mamba-2 backbone
Evo (Arc Institute 2024 Science; Nguyen Poli Re) — 7B trained on 2.7M prokaryotic + phage genomes; generative
Evo2 (Arc 2025) — 40B + 9.3T-token training
DNABERT + DNABERT-2 (Northwestern 2024)
Geneformer (Theodoris Ellinor 2023 Nature; MGB; pretrained on Genecorpus-30M single-cell) — gene-level tokens
scGPT (Bo Wang Toronto 2023) — single-cell foundation model
scFoundation (Hao Tsinghua 2023); GeneCompass (Beijing 2024)
scvi-tools (Yosef Berkeley/Weizmann; Lopez 2018 deep generative for scRNA-seq) — most-used scRNA-seq integration toolkit; scVI + totalVI + scANVI + DestVI + cell2location
CellTypist (Teichmann Sanger; Domínguez-Conde 2022) — automated cell-type annotation
CellxGene (CZI) — single-cell atlas browser + Census API
HCA Human Cell Atlas (Regev Teichmann founders 2017–; ~100M+ cells planned; ~50M cataloged)
Tabula Sapiens / Tabula Muris (Quake Stanford) — multi-organ scRNA-seq atlases
BioNeMo NVIDIA — GPU framework + pretrained models
NVIDIA NIM microservices for protein design

Bio-foundation model companies / labs

DeepMind Isomorphic Labs (Hassabis Jumper) — AlphaFold + AI drug design (Lilly + Novartis partnerships)
EvolutionaryScale (Rives Hayes spinoff from Meta FAIR 2024) — ESM3 multimodal
Iambic Therapeutics (DiffDock Corso Stark; structure-based generative DD; AB-2100 clinical)
Variational AI (Vancouver; ENKI generative)
Insilico Medicine (Zhavoronkov; PandaOmics + Chemistry42 + InClinico; ISM001-055 idiopathic pulmonary fibrosis Phase II)
Recursion (Hillman; HCS phenomics + ML; acquired Exscientia 2024 ~US$700M)
Genesis Therapeutics (Lab; tilted molecule design)
Atomic AI (Townshend; RNA structure)
Cradle Bio + Profluent Bio (Maddhi) — protein engineering services
Generate Biomedicines (Flagship; deep-learning protein design; multiple oncology + immunology assets)
Absci (de-novo antibody discovery)
BigHat Biosciences (antibody optimisation)
PostEra (medicinal chemistry; Manifold platform)
Valence Labs (Mila + Recursion; small-molecule)
Chai Discovery (Chai-1 open; co-fold + affinity)
Cyrus Bio (Rosetta-derived design)

Imaging + spatial

CellProfiler (Carpenter Broad) — image-based phenotyping
DeepCell + Mesmer (Van Valen Caltech) — segmentation
Cellpose (Stringer Pachitariu HHMI Janelia) — generalist segmentation
Stardist (Schmidt Weigert) — star-convex polygon segmentation
squidpy (Theis Helmholtz Munich) — spatial transcriptomics analysis (Visium + Xenium + MERFISH + CosMx + Stereo-seq)
cell2location (Kleshchevnikov Bayraktar Teichmann 2022 Nat Biotechnol) — spatial cell-type mapping
MoNuSeg / PanNuke / Lizard — pathology nucleus datasets
Foundation pathology: CTransPath (Wang 2022); UNI (Mahmood Lab 2024 Nat Med); Virchow (Paige.AI 2024); Prov-GigaPath (Microsoft + Providence 2024 Nature); PRISM (Paige); PLUTO (Path-AI); PathChat (Mahmood/Vasquez)

Pipeline + workflow tools

Tool	Author	Strengths
Snakemake	Köster 2012; Univ Duisburg-Essen	Pythonic; popular in academic genomics; YAML config + Conda integration
Nextflow	Di Tommaso 2017 CRG Barcelona; Seqera Labs commercial	Channel + dataflow paradigm; cloud-portable (AWS Batch, Google Batch, Kubernetes, Azure Batch); de facto pharma standard
nf-core	Community Nextflow pipeline collection (~100 pipelines; rnaseq, sarek, ampliseq, fetchngs, mag, etc.); rigorous CI + best-practices template; supported by Seqera	Production-grade community workflows
WDL Workflow Description Language	OpenWDL Broad/Cromwell engine	GA4GH-endorsed; runs on Cromwell + miniWDL; Terra preferred
CWL Common Workflow Language	Amstutz 2016; CWL community	Container-portable; older but solid
Galaxy	Goecks Nekrutenko Taylor 2010 Penn State (Hopkins move)	GUI-first; pedagogical + production; usegalaxy.org + .eu + .au
Terra	Broad + Verily	Cloud workspace + WDL/Cromwell; primary AnVIL platform
DNAnexus	DNAnexus (2010; UC Berkeley spinoff)	Commercial cloud platform; AstraZeneca + Regeneron + UKBB RAP partner
Seven Bridges Genomics / Cancer Genomics Cloud (CGC)	Velsera (formed 2023 merger Seven Bridges + Pixel Data + UgenTec)	NCI CGC official platform
Latch.Bio	YC alumnus; Tisza	DataFrame + modern UI
Form Bio (Colossal Biosciences spin-out)	—	Workflow + visualization
Argo Workflows / Airflow / Prefect / Dagster	CNCF / Apache / Prefect / Elementl	General-purpose orchestration adopted by some bio teams

Bioconductor + Python ecosystem

Bioconductor (Gentleman Carey Huber 2003; R-based; ~2,300 packages 2024) — DESeq2 (Love Anders Huber; differential expression); edgeR (Robinson Smyth McCarthy); limma (Smyth voom); GenomicRanges + IRanges + S4Vectors infrastructure; SummarizedExperiment + SingleCellExperiment data classes; ChIPseeker; ChIPQC; Gviz; karyoploteR; minfi (methylation); maftools; ComplexHeatmap (Gu); Seurat-bridge SeuratObject + sceasy
Seurat (Satija NYGC; v5 2024) — R single-cell flagship
Scanpy (Wolf Theis 2018) — Python single-cell flagship; anndata + scverse stack
scverse (Heumos Theis 2023) — Python single-cell ecosystem: anndata, scanpy, scvi-tools, squidpy, MUON multimodal, decoupler, dynamo
Galaxy Training Network GTN — curated tutorials
Biopython (Cock 2009)
Bioconductor data infrastructure: AnnData (Wolf + Strobl); MuData; Zarr chunked; OME-Zarr for imaging
PyTorch + JAX + scikit-learn + Keras + Hugging Face Transformers — ML
Polars + Pandas + Dask for tabular
DVC + MLflow + Weights & Biases + Aim — experiment tracking

Reproducibility, repositories, preprints

Zenodo (CERN OpenAIRE; ~3 million records; up to 50 GB per record; DOIs)
Dryad (~70k datasets; partnered with Zenodo for compute)
Figshare (Digital Science)
OSF Open Science Framework (Center for Open Science)
Code Ocean (compute capsules + Docker)
Whole Tale + Binder + mybinder.org — reproducible computational environments
Renku (SDSC Lausanne)
DataLad (Halchenko Hanke) — git-annex-based data versioning
Synapse / Sage Bionetworks — collaborative biomedical projects (DREAM Challenges, ADKnowledgePortal)
PhysioNet — physiologic signals + MIMIC-IV ICU data
bioRxiv (CSHL; ~400k preprints 2024); medRxiv (CSHL + BMJ + Yale; ~80k); chemRxiv (ACS + RSC + GDCh + ChemSoc Japan); arXiv q-bio (Cornell); ResearchSquare (Springer)
GitHub + GitLab — code (Bioconda for binary distribution)
Docker Hub / quay.io BioContainers / Singularity / Apptainer — containerised tools (Conda + Bioconda + Mamba dominant for Python/R bioinformatics)
Conda-forge + Bioconda (BioConda 2018 Grüning Köster; ~10,000 bioinformatics packages)
EDAM Ontology for bioinformatics tool/data classification
bio.tools registry (~25,000 tools)

Standards & FAIR-data anchors

FAIR Principles (Wilkinson 2016 Sci Data) — findable + accessible + interoperable + reusable
GA4GH standards (Beacon, VRS, Phenopackets, DUO, DRS, htsget, RNAget, Service Registry, Crypt4GH)
MIAME / MINSEQE / MIAPE — minimum information about experiment standards
SBML (Systems Biology Markup Language; Hucka 2003) for kinetic models
SBGN (Systems Biology Graphical Notation; Le Novère 2009) for diagrams
BioPAX (Bader Demir 2010) for pathway exchange
MIAxE / FAIRsharing.org — schemas registry
CITE-seq / SHARE-seq / Multiome / SPLiT-seq / Slide-seq / GeoMx / CosMx / Visium HD / Xenium / MERFISH / Stereo-seq / DBiT-seq — common single-cell / spatial assay protocols (referenced in many pipelines)

Adjacent

protein-families-and-drug-targets — UniProt, ChEMBL, GtoPdb power the target catalogs there
cell-lines-and-antibody-catalog — Human Protein Atlas + Antibodypedia + DepMap CCLE feed cell-line and antibody choice
model-organisms-and-sequencing-tech — Ensembl, RefSeq, MGI, ZFIN, FlyBase, WormBase, SGD are the model-organism counterparts
Bioinformatics and computational biology Tier 2
Genetics and genomics Tier 2
Systems biology Tier 2
Scientific computing Tier 2 (cross-domain — pipelines, container orchestration, HPC apply across bio + physical sciences)

Compendium

Explorer

Pathway Databases & Bioinformatics Resources

Pathway Databases & Bioinformatics Resources

Sequence databases — the INSDC and friends

NCBI (National Center for Biotechnology Information; NLM/NIH; Bethesda MD; David Lipman founded 1988)

EMBL-EBI (European Bioinformatics Institute; Hinxton UK; Janet Thornton DG long-tenured; Ewan Birney)

DDBJ (DNA Data Bank of Japan; NIG Mishima)

Proteomics & structure

UniProt

Protein Data Bank (PDB)

Predicted structures

Sequence-to-structure-to-function annotation

Protein-protein interaction (PPI)

Pathway databases

KEGG (Kyoto Encyclopedia of Genes and Genomes)

Reactome

WikiPathways

BioCyc / MetaCyc / EcoCyc

Pathway Commons

IPA — Ingenuity Pathway Analysis

Other pathway / signalling resources

Gene ontology & functional annotation

Gene Ontology (GO)

Enrichment / pathway analysis tools

Variant + disease databases

ClinVar + ClinGen

Somatic / cancer variants

Cancer-genome consortia & atlases

Reference + population databases

Mendelian / rare disease

Drug + chemistry resources

ML / AI for biology — foundation models + key tools

Structure prediction & design

Protein language models

DNA / RNA / single-cell

Bio-foundation model companies / labs

Imaging + spatial

Pipeline + workflow tools

Bioconductor + Python ecosystem

Reproducibility, repositories, preprints

Standards & FAIR-data anchors

Adjacent

Graph View

Table of Contents

Backlinks