Bioinformatics & Life-Sciences File-Format Languages Family Index
type: language-family-index family: bio-fileformats languages_catalogued: 28 tags: [language-reference, family-index, bio-fileformats, fasta, fastq, vcf, sam, bam, cram, pdb, mmcif, cif, star, smiles, inchi, selfies, gff, gtf, bed, newick, nexus, mzml]
Bioinformatics & Life-Sciences File-Formats — Family Index
Family overview
This family is the textual data formats that underpin biology and structural science — the on-disk grammars that flow through every sequencing pipeline, every crystallography refinement, every cheminformatics search. Unlike the workflow-orchestration sibling (bio-workflow — Snakemake/Nextflow/CWL/WDL), these are the data the workflows transform. Most are real grammars with formal specifications, parsers, validators, and 30–50-year installed bases; a handful (FASTA, PDB, GenBank flat-file, SMILES, CIF) have been continuously deployed since the 1970s and 80s and remain canonical in 2026.
The lineage is striking. PDB (Protein Data Bank flat-file, 1976) defined an 80-column-fixed-width format for atomic coordinates — a deliberate choice for FORTRAN punched-card compatibility — and survived for 48 years before the wwPDB began phasing it out for new depositions. FASTA (William Pearson, 1985) is the universal >id\nACGT... text format and has not changed materially in 41 years. GenBank flat-file (NCBI, since the 1980s) and its EMBL twin still carry the world’s annotated nucleotide records. CIF (IUCr, 1991) introduced the STAR-grammar-based dictionary approach to crystallography; mmCIF/PDBx (1997+) is its macromolecular specialisation, now the wwPDB deposition format. SMILES (Weininger, Daylight, 1986) is the dominant chemical-structure DSL, with InChI (IUPAC, 2005+) layered on top as canonical identifier and SELFIES (2020+) emerging as the ML-robust alternative.
The genomics big-data wave (~2008–2014) brought a different design pressure: terabyte-scale per-experiment data, which forced compact binary cousins. FASTQ added per-base Phred quality to FASTA for Illumina/Sanger reads. SAM (Sequence Alignment Map, 2009) standardised aligned-read storage, with BAM as its gzipped binary form and CRAM (3.0 in 2014, 3.1 stable since 2022) as the reference-based further-compressed archival form now defaulted by major sequencing centres. VCF (1000 Genomes, 2010) emerged for variant calling with #CHROM/POS/REF/ALT/INFO/FORMAT records; v4.5 is the current draft on hts-specs. The genome-browser line — BED (UCSC), GFF/GFF3 (Sequence Ontology), GTF (Ensembl/UCSC) — quietly moves exabytes of feature annotation daily.
Around the central genomics/structural axes orbit several smaller but durable ecosystems: phylogenetics (NEXUS 1997, Newick from the Maddison/Felsenstein era, PhyloXML), mass-spec proteomics (mzML — the HUPO/PSI standard that displaced mzXML), chemistry/biology bridge formats (MOL/SDF, CCD, ligand definitions), and alignment/profile formats (Stockholm, HMMER profiles). The recurring pattern is that bio formats almost never die — installed base, decades of tooling (htslib, Biopython 1.87 from March 2026, BioPerl, EMBOSS, RDKit), and academic citation-permanence keep formats alive long after they are technically obsolete.
In our deep library
None of the bio file-format languages have standalone deep-library notes (they are domain DSLs, not general-purpose languages). Cross-reference:
- bio-workflow — the sibling family: workflow-orchestration DSLs (Snakemake, Nextflow, CWL, WDL) that consume and produce these file formats. Read together for full coverage of “bio computing”.
- api-description — UniProt/NCBI/Ensembl/RCSB all expose REST + GraphQL APIs over these formats; OpenAPI/Smithy describe the access layer.
- scientific — R (Bioconductor), MATLAB (Bioinformatics Toolbox), and Mathematica all have native readers for FASTA/FASTQ/PDB/SDF.
- python — Biopython is the canonical reader stack; pysam wraps htslib for SAM/BAM/CRAM/VCF; RDKit handles SMILES/InChI/MOL; BioPandas wraps PDB/mmCIF.
- notation-spec — STAR/CIF and the chemistry line-notations (SMILES, InChI, SELFIES) are dual-classified there as compact textual notations.
- api-description — PhyloXML, mzML, BioM(JSON), HUPO-PSI XML formats all sit on top of XML/JSON-Schema infrastructure catalogued there.
Tier 3 family table — Sequence & alignment formats
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| FASTA | 1985 | William Pearson, Univ. of Virginia | Plain-text sequence: >id\nACGT... | Universal; unchanged in 41 years; every bio tool reads it | https://en.wikipedia.org/wiki/FASTA_format |
| FASTQ | ~2009 | Wellcome Trust Sanger Institute | FASTA + per-base Phred quality (4-line records) | Universal for short-read sequencing; Illumina/PacBio/ONT default | https://maq.sourceforge.net/fastq.shtml |
| GenBank flat-file | 1982 (GenBank), format formalised 1986+ | NCBI / Los Alamos | Annotated nucleotide record: LOCUS / DEFINITION / ACCESSION / FEATURES / ORIGIN | Active; daily releases continue | https://www.ncbi.nlm.nih.gov/genbank/release/current/ |
| EMBL flat-file | 1982 | EMBL-EBI | Sibling of GenBank with ID/AC/DE/FT/SQ line-tag grammar | Active; INSDC sync with GenBank/DDBJ | https://www.ebi.ac.uk/ena/browser/text-search |
| SAM | 2009 | Heng Li et al., 1000 Genomes / Sanger | Tab-separated aligned-read records (text); 11 mandatory cols + tags | Universal for alignment data | https://samtools.github.io/hts-specs/SAMv1.pdf |
| BAM | 2009 | Heng Li et al. | BGZF-compressed binary SAM | Universal; default working format | https://samtools.github.io/hts-specs/SAMv1.pdf |
| CRAM | 2014 (3.0); 3.1 stable in htslib since ~2022 | EBI / Cochrane / Bonfield | Reference-based compressed alignment; data-type-specific codecs | Active; default archival format at EBI/SRA; v3.1 current | https://samtools.github.io/hts-specs/CRAMv3.pdf |
| Stockholm | 1990s | Sean Eddy, HMMER project | Multiple-sequence-alignment format with markup lines (#=GF/GS/GR/GC) | Active; Pfam/Rfam/Dfam canonical format | https://en.wikipedia.org/wiki/Stockholm_format |
| HMMER profile | 1995+ | Sean Eddy, WashU/HHMI | Profile-HMM text grammar; companion to Stockholm alignments | Active; HMMER 3.4 current | http://hmmer.org/documentation.html |
Tier 3 family table — Structural / crystallography
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| PDB (flat-file) | 1976 | Brookhaven National Lab | 80-column fixed-width atomic-coordinate records (HEADER/ATOM/HETATM/CONNECT) | Legacy / sunset path: 4-char IDs continue; entries with 5-char extended IDs distributed in mmCIF only; full transition to extended IDs scheduled July 21, 2027 | https://www.wwpdb.org/documentation/file-format |
| mmCIF / PDBx | 1997+ | wwPDB / IUCr | CIF-based macromolecular dictionary; current PDB deposition format (mandatory for crystallographic deposits since 2019) | Canonical; required for new wwPDB deposits | https://mmcif.wwpdb.org/ |
| CIF | 1991 | IUCr (Hall, Allen, Brown) | STAR-grammar-based crystallographic information format for small molecules | Universal for small-molecule crystallography; CIF2 spec stable | https://www.iucr.org/resources/cif |
| STAR | 1991 | Sydney Hall, IUCr | Self-defining Text Archive and Retrieval — the parent grammar of CIF/mmCIF/NMR-STAR | Stable foundation; rarely written by humans, mostly via CIF dialects | https://en.wikipedia.org/wiki/Self-defining_Text_Archive_and_Retrieval |
| NMR-STAR | 1990s | BMRB (Biological Magnetic Resonance Bank), UW-Madison | STAR-dialect dictionary for NMR data deposition | Active; BMRB v3.2 dictionary | https://bmrb.io/standards/ |
| CCDC CSD format | 1965+ | Cambridge Crystallographic Data Centre | Proprietary database format for the Cambridge Structural Database (>1.3M structures) | Active; commercial; CSD-System 2026 release | https://www.ccdc.cam.ac.uk/ |
| PDB CCD (Chemical Component Dictionary) | 2000s | wwPDB | mmCIF-based dictionary of every ligand/residue ever seen in the PDB | Active; updated weekly with PDB releases | https://www.wwpdb.org/data/ccd |
Tier 3 family table — Genomic features & variants
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| GFF / GFF3 | ~1997 (GFF), 2007 (GFF3) | Sanger Centre / Sequence Ontology | Tab-separated genomic-feature annotation (9 columns; attribute key=value pairs) | Universal for annotation; GFF3 current | https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md |
| GTF | ~2003 | Ensembl/UCSC | GFF dialect with stricter attribute syntax; gene/transcript/exon hierarchy | Universal for transcript annotation; Ensembl/GENCODE primary | https://useast.ensembl.org/info/website/upload/gff.html |
| BED | ~2003 | UCSC Genome Browser (Kent et al.) | Tab-separated browser-extensible records (3–12 cols); BED12 carries blocks | Universal; bedtools ecosystem; bigBed binary variant ubiquitous | https://genome.ucsc.edu/FAQ/FAQformat.html#format1 |
| VCF | 2010 | 1000 Genomes Project / Danecek et al. | Variant-calling DSL: #CHROM/POS/ID/REF/ALT/QUAL/FILTER/INFO/FORMAT/<samples> | Universal; v4.5 draft active on hts-specs; v4.4 widely deployed | https://samtools.github.io/hts-specs/ |
| BCF | 2011 | samtools project | Binary BCF (Binary Call Format) compressed VCF | Active; v2.2 current | https://samtools.github.io/hts-specs/ |
| PED / FAM / TPED / TFAM | ~2007 | PLINK (Purcell et al., Broad Institute) | Pedigree + genotype text formats for GWAS | Active; PLINK 2.0 native uses PGEN binary but PED/MAP still common | https://www.cog-genomics.org/plink/2.0/formats |
| BIOM | 2011 | Earth Microbiome Project / Caporaso et al. | Biological Observation Matrix; HDF5 (v2) or sparse JSON (v1) for OTU tables | Active in microbiome work | https://biom-format.org/ |
Tier 3 family table — Chemistry / small molecules
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| SMILES | 1986 | David Weininger, Daylight Chemical | Linear SMILES line notation: CC(=O)O for acetic acid | Universal; unchanged in 40 years; canonical-SMILES algorithms vary by toolkit | https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html |
| OpenSMILES | 2007 | OpenSMILES community | Vendor-neutral SMILES specification (resolves Daylight ambiguities) | Active reference standard | http://opensmiles.org/opensmiles.html |
| InChI | 2005 (IUPAC); v1.07 approved 16 July 2024 | IUPAC + InChI Trust | Layered canonical chemical-identifier string (formula/connectivity/H/charge/stereo/isotopes) | Active; v1.07 on GitHub since 2024, MIT-licensed | https://www.inchi-trust.org/ |
| InChIKey | 2007 | IUPAC | Fixed-length hashed InChI (27 chars); database-friendly | Active; ubiquitous in chemistry databases | https://www.inchi-trust.org/about-the-inchi-standard/ |
| SELFIES | 2020 (Krenn et al., MLST 2020) | Aspuru-Guzik group, Toronto + community | Self-Referencing Embedded Strings; every string is a valid molecule | Active; ongoing extensions to polymers/crystals/reactions; widely used for ML generative chemistry | https://github.com/aspuru-guzik-group/selfies |
| MOL / MDL Molfile | 1980s | Molecular Design Limited (later Symyx, Accelrys, BIOVIA) | V2000 (atom/bond table) and V3000 (extended) chemical-structure files | Universal; V3000 supports >999 atoms; default ChemDraw export | https://en.wikipedia.org/wiki/Chemical_table_file |
| SDF | 1990s | MDL | Structure-Data File: concatenated MOL records + property tags (> <name>) | Universal for chemistry datasets (PubChem, ChEMBL exports) | https://en.wikipedia.org/wiki/Chemical_table_file#SDF |
| MOL2 | ~1990 | Tripos | Tripos MOL2 — atom/bond/substructure with explicit Sybyl atom types | Active; common in docking (AutoDock, GOLD) | http://chemyang.ccnu.edu.cn/ccb/server/AIMMS/mol2.pdf |
| Reaction SMILES / RXN | 1990s+ | Daylight (rxn SMILES); MDL (RXN file) | SMILES extension: reactants>>agents>>products | Active; RXN SMILES dominant in ML reaction prediction | https://daylight.com/dayhtml/doc/theory/theory.rxn.html |
Tier 3 family table — Phylogenetics & MS / other
| Format | First appeared | Origin | Type | Status (2026) | URL |
|---|---|---|---|---|---|
| Newick | 1986 (informal); standardised in NEXUS 1997 | Felsenstein/Maddison “Newick’s lobster house” meeting | Recursive parenthetical tree: ((A,B),(C,D)); with optional branch lengths | Universal for phylogenetic trees | https://evolution.genetics.washington.edu/phylip/newicktree.html |
| NEXUS | 1997 | Maddison, Swofford, Maddison | Block-structured phylogenetics format (TAXA/CHARACTERS/TREES/…); embeds Newick | Active; MrBayes/PAUP*/MEGA native input | https://academic.oup.com/sysbio/article/46/4/590/1629695 |
| PhyloXML | 2009 | Han & Zmasek | XML alternative to Newick with rich annotation | Active but niche; Newick still dominant for size reasons | http://www.phyloxml.org/ |
| mzML | 2008 (HUPO-PSI) | HUPO Proteomics Standards Initiative | XML-based mass-spec data format; superseded mzXML and mzData | Universal for MS data archives (ProteomeXchange/PRIDE) | https://www.psidev.info/mzML |
| mzXML | 2004 (legacy) | Seattle Proteome Center / ISB | Earlier MS format; predecessor to mzML | Legacy; archives still common but new tools prefer mzML | https://en.wikipedia.org/wiki/Mass_spectrometry_data_format#mzXML |
| mzTab | 2014 | HUPO-PSI | Tab-separated proteomics-results format (PSM/peptide/protein/SmallMolecule sections) | Active; v2.0 released 2021 (small-molecule extension) | https://www.psidev.info/mztab |
| MIBI / IBIS / OME-TIFF (cross-list) | 2010s | OME consortium / multiplexed-imaging community | Imaging-mass-cytometry / multiplexed-imaging data formats | Active | https://www.openmicroscopy.org/ome-files/ |
Notable threads
-
The PDB → mmCIF transition is mid-flight, not finished. A common misconception is that PDB was “deprecated in 2024.” The reality is more granular: crystallographic depositions to wwPDB have required mmCIF since 2019; PDB entries assigned 5-character extended IDs (rolled out as the 4-char namespace exhausts) are distributed in mmCIF only; and the full archive cutover — when wwPDB stops issuing 4-character IDs and the entire archive moves to extended IDs + mmCIF-only distribution — is scheduled for 21 July 2027. The 80-column PDB flat-file format will continue to exist for legacy 4-char entries indefinitely. Software that hard-codes
^.{4}$PDB ID regexes or 80-column line widths is on the clock. -
FASTA’s 41-year reign and why nothing displaces it. FASTA (1985) is genuinely unchanged. The format has zero formal grammar, yet it is parsed identically by Biopython 1.87 (March 2026), BioPerl, EMBOSS, samtools, BWA, BLAST, and every LLM-protein-model preprocessor. The reason nothing displaces it is Lindy: every new tool needs a FASTA reader on day 1 to interoperate with the existing zoo, so no replacement ever achieves escape velocity. FASTQ added quality scores but did not replace FASTA — it sits alongside, used for raw reads while FASTA holds reference and downstream protein/nucleotide sequences.
-
SMILES vs InChI vs SELFIES — three different optimisation goals. SMILES (1986) is human-writable, compact, and standard, but non-canonical: the same molecule has many valid SMILES strings depending on traversal order, and toolkits (RDKit, OpenBabel, OEChem) disagree on canonicalisation. InChI (2005, v1.07 approved July 2024 under IUPAC + InChI Trust) is canonical by construction — one molecule, one InChI string — and the InChIKey hash makes it database-indexable. SELFIES (2020, Aspuru-Guzik group, Toronto) solves a third problem: every randomly-generated SELFIES string corresponds to a syntactically valid molecule, so generative ML models cannot produce invalid output. Three formats, three different “valid” definitions: human-readable, canonical, ML-robust.
-
The SAM / BAM / CRAM compression ladder. SAM (2009) is the human-readable text alignment format; BAM is its BGZF-gzip binary form (typically 4–6× smaller); CRAM (3.0 in 2014, 3.1 stable in htslib since ~2022) is reference-based — it stores only the differences from a reference genome, so well-aligned data compresses 30–60% better than BAM. CRAM 3.1 added data-type-specific codecs (separate for identifiers, quality scores, sequence) rather than relying on general-purpose compression. EBI and SRA now default to CRAM for archival; htslib 1.23.1 (2026) is the canonical decoder. The lossy quality-binning option in CRAM is controversial — it sacrifices precision for ~2× more compression.
-
mzML as the proteomics OOXML moment. Mass-spec data went through three formats in a decade: vendor-binary (Thermo .raw, Bruker .d), then mzXML (Seattle Proteome Center, 2004), then mzML (HUPO-PSI, 2008). mzML won because it had a controlled-vocabulary backbone (PSI-MS CV) and unified the previously-bifurcated mzXML/mzData community under one standard. ProteomeXchange and PRIDE require mzML for deposition. mzTab (2014, v2.0 in 2021 with small-molecule support) handles the results layer above mzML.
-
The INSDC three-database choreography. GenBank (NCBI), EMBL (EBI), and DDBJ (Japan) are the three nodes of the International Nucleotide Sequence Database Collaboration. They synchronise content daily — every record submitted to one appears in the other two within 24 hours — but each preserves its own slightly different flat-file dialect (
LOCUS/DEFINITION/ACCESSIONfor GenBank,ID/AC/DEline tags for EMBL, similar for DDBJ). This is one of the most successful long-running international data-format federations in any field; it has worked since 1987 and now coordinates exabytes of sequence data. -
The chemistry-biology bridge formats. Biology has to handle small molecules (drug ligands binding to proteins) and chemistry has to handle proteins (target structures), so the formats inevitably overlap. The wwPDB Chemical Component Dictionary (CCD) is mmCIF-formatted and defines every ligand and residue ever observed in the PDB. MOL/SDF files appear inside drug-discovery pipelines that ingest from ChEMBL/PubChem/ZINC and feed docking software (AutoDock Vina, GOLD) which itself reads PDB protein structures and MOL2 ligands. The seam between the two ecosystems is a recurring source of bugs — bond-order ambiguity in PDB ligands is the canonical example, addressed by the CCD’s full chemical specification.
-
VCF as a genuine DSL, not just a file format. VCF’s
INFOcolumn carries semicolon-separated key=value records (AC=2;AN=4;DP=500;AF=0.5);FORMATdeclares per-sample fields (GT:DP:AD:GQ); the header declares all available keys with##INFO=<ID=...,Number=...,Type=...,Description=...>lines. This is effectively a typed schema language embedded in the file header — closer in spirit to Avro or Protobuf than to TSV. bcftools query, GATK SelectVariants, and snpEff all operate as miniature interpreters over this schema. v4.5 (current draft on hts-specs) extends it to handle modified bases.
Citations
- PDB Format Description (wwPDB): https://www.wwpdb.org/documentation/file-format
- PDBx/mmCIF dictionary (wwPDB): https://mmcif.wwpdb.org/
- wwPDB extended IDs / July 2027 transition (PDBj): https://pdbj.org/news/ExtensionCCDCodes?lang=en
- PDBx/mmCIF mandatory crystallographic deposition (PMC): https://pmc.ncbi.nlm.nih.gov/articles/PMC6465986/
- IUCr CIF resources: https://www.iucr.org/resources/cif
- BMRB NMR-STAR standards: https://bmrb.io/standards/
- FASTA description (NCBI BLAST docs): https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
- FASTQ format (Cock et al., NAR 2010): https://academic.oup.com/nar/article/38/6/1767/3112533
- HTS-specs (SAM/BAM/CRAM/VCF/BCF): https://samtools.github.io/hts-specs/
- CRAM 3.1 paper (Bonfield, Bioinformatics 2022): https://academic.oup.com/bioinformatics/article/38/6/1497/6499262
- htslib releases: https://github.com/samtools/htslib/releases
- VCF 4.5 draft (hts-specs): https://github.com/samtools/hts-specs/blob/master/VCFv4.5.draft.tex
- GFF3 specification (Sequence Ontology): https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
- UCSC BED format: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
- Newick format (Felsenstein): https://evolution.genetics.washington.edu/phylip/newicktree.html
- NEXUS paper (Maddison/Swofford/Maddison, Syst Biol 1997): https://academic.oup.com/sysbio/article/46/4/590/1629695
- SMILES (Daylight): https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
- OpenSMILES specification: http://opensmiles.org/opensmiles.html
- InChI Trust (v1.07 release, 16 July 2024): https://www.inchi-trust.org/iupac-inchi-moves-to-github-to-support-sustainable-chemical-standards-development/
- SELFIES paper (Krenn et al., MLST 2020): https://iopscience.iop.org/article/10.1088/2632-2153/aba947
- SELFIES library: https://github.com/aspuru-guzik-group/selfies
- HUPO-PSI mzML: https://www.psidev.info/mzML
- HUPO-PSI mzTab: https://www.psidev.info/mztab
- Stockholm format (Pfam): https://en.wikipedia.org/wiki/Stockholm_format
- HMMER documentation: http://hmmer.org/documentation.html
- Biopython 1.87 (March 2026 release): https://biopython.org/wiki/Download
- INSDC (international nucleotide collaboration): https://www.insdc.org/
- BIOM format: https://biom-format.org/
- PLINK formats: https://www.cog-genomics.org/plink/2.0/formats