Bioinformatics & Life-Sciences File-Format Languages Family Index


type: language-family-index family: bio-fileformats languages_catalogued: 28 tags: [language-reference, family-index, bio-fileformats, fasta, fastq, vcf, sam, bam, cram, pdb, mmcif, cif, star, smiles, inchi, selfies, gff, gtf, bed, newick, nexus, mzml]

Bioinformatics & Life-Sciences File-Formats — Family Index

Family overview

This family is the textual data formats that underpin biology and structural science — the on-disk grammars that flow through every sequencing pipeline, every crystallography refinement, every cheminformatics search. Unlike the workflow-orchestration sibling (bio-workflow — Snakemake/Nextflow/CWL/WDL), these are the data the workflows transform. Most are real grammars with formal specifications, parsers, validators, and 30–50-year installed bases; a handful (FASTA, PDB, GenBank flat-file, SMILES, CIF) have been continuously deployed since the 1970s and 80s and remain canonical in 2026.

The lineage is striking. PDB (Protein Data Bank flat-file, 1976) defined an 80-column-fixed-width format for atomic coordinates — a deliberate choice for FORTRAN punched-card compatibility — and survived for 48 years before the wwPDB began phasing it out for new depositions. FASTA (William Pearson, 1985) is the universal >id\nACGT... text format and has not changed materially in 41 years. GenBank flat-file (NCBI, since the 1980s) and its EMBL twin still carry the world’s annotated nucleotide records. CIF (IUCr, 1991) introduced the STAR-grammar-based dictionary approach to crystallography; mmCIF/PDBx (1997+) is its macromolecular specialisation, now the wwPDB deposition format. SMILES (Weininger, Daylight, 1986) is the dominant chemical-structure DSL, with InChI (IUPAC, 2005+) layered on top as canonical identifier and SELFIES (2020+) emerging as the ML-robust alternative.

The genomics big-data wave (~2008–2014) brought a different design pressure: terabyte-scale per-experiment data, which forced compact binary cousins. FASTQ added per-base Phred quality to FASTA for Illumina/Sanger reads. SAM (Sequence Alignment Map, 2009) standardised aligned-read storage, with BAM as its gzipped binary form and CRAM (3.0 in 2014, 3.1 stable since 2022) as the reference-based further-compressed archival form now defaulted by major sequencing centres. VCF (1000 Genomes, 2010) emerged for variant calling with #CHROM/POS/REF/ALT/INFO/FORMAT records; v4.5 is the current draft on hts-specs. The genome-browser line — BED (UCSC), GFF/GFF3 (Sequence Ontology), GTF (Ensembl/UCSC) — quietly moves exabytes of feature annotation daily.

Around the central genomics/structural axes orbit several smaller but durable ecosystems: phylogenetics (NEXUS 1997, Newick from the Maddison/Felsenstein era, PhyloXML), mass-spec proteomics (mzML — the HUPO/PSI standard that displaced mzXML), chemistry/biology bridge formats (MOL/SDF, CCD, ligand definitions), and alignment/profile formats (Stockholm, HMMER profiles). The recurring pattern is that bio formats almost never die — installed base, decades of tooling (htslib, Biopython 1.87 from March 2026, BioPerl, EMBOSS, RDKit), and academic citation-permanence keep formats alive long after they are technically obsolete.

In our deep library

None of the bio file-format languages have standalone deep-library notes (they are domain DSLs, not general-purpose languages). Cross-reference:

  • bio-workflow — the sibling family: workflow-orchestration DSLs (Snakemake, Nextflow, CWL, WDL) that consume and produce these file formats. Read together for full coverage of “bio computing”.
  • api-description — UniProt/NCBI/Ensembl/RCSB all expose REST + GraphQL APIs over these formats; OpenAPI/Smithy describe the access layer.
  • scientific — R (Bioconductor), MATLAB (Bioinformatics Toolbox), and Mathematica all have native readers for FASTA/FASTQ/PDB/SDF.
  • python — Biopython is the canonical reader stack; pysam wraps htslib for SAM/BAM/CRAM/VCF; RDKit handles SMILES/InChI/MOL; BioPandas wraps PDB/mmCIF.
  • notation-spec — STAR/CIF and the chemistry line-notations (SMILES, InChI, SELFIES) are dual-classified there as compact textual notations.
  • api-description — PhyloXML, mzML, BioM(JSON), HUPO-PSI XML formats all sit on top of XML/JSON-Schema infrastructure catalogued there.

Tier 3 family table — Sequence & alignment formats

FormatFirst appearedOriginTypeStatus (2026)URL
FASTA1985William Pearson, Univ. of VirginiaPlain-text sequence: >id\nACGT...Universal; unchanged in 41 years; every bio tool reads ithttps://en.wikipedia.org/wiki/FASTA_format
FASTQ~2009Wellcome Trust Sanger InstituteFASTA + per-base Phred quality (4-line records)Universal for short-read sequencing; Illumina/PacBio/ONT defaulthttps://maq.sourceforge.net/fastq.shtml
GenBank flat-file1982 (GenBank), format formalised 1986+NCBI / Los AlamosAnnotated nucleotide record: LOCUS / DEFINITION / ACCESSION / FEATURES / ORIGINActive; daily releases continuehttps://www.ncbi.nlm.nih.gov/genbank/release/current/
EMBL flat-file1982EMBL-EBISibling of GenBank with ID/AC/DE/FT/SQ line-tag grammarActive; INSDC sync with GenBank/DDBJhttps://www.ebi.ac.uk/ena/browser/text-search
SAM2009Heng Li et al., 1000 Genomes / SangerTab-separated aligned-read records (text); 11 mandatory cols + tagsUniversal for alignment datahttps://samtools.github.io/hts-specs/SAMv1.pdf
BAM2009Heng Li et al.BGZF-compressed binary SAMUniversal; default working formathttps://samtools.github.io/hts-specs/SAMv1.pdf
CRAM2014 (3.0); 3.1 stable in htslib since ~2022EBI / Cochrane / BonfieldReference-based compressed alignment; data-type-specific codecsActive; default archival format at EBI/SRA; v3.1 currenthttps://samtools.github.io/hts-specs/CRAMv3.pdf
Stockholm1990sSean Eddy, HMMER projectMultiple-sequence-alignment format with markup lines (#=GF/GS/GR/GC)Active; Pfam/Rfam/Dfam canonical formathttps://en.wikipedia.org/wiki/Stockholm_format
HMMER profile1995+Sean Eddy, WashU/HHMIProfile-HMM text grammar; companion to Stockholm alignmentsActive; HMMER 3.4 currenthttp://hmmer.org/documentation.html

Tier 3 family table — Structural / crystallography

FormatFirst appearedOriginTypeStatus (2026)URL
PDB (flat-file)1976Brookhaven National Lab80-column fixed-width atomic-coordinate records (HEADER/ATOM/HETATM/CONNECT)Legacy / sunset path: 4-char IDs continue; entries with 5-char extended IDs distributed in mmCIF only; full transition to extended IDs scheduled July 21, 2027https://www.wwpdb.org/documentation/file-format
mmCIF / PDBx1997+wwPDB / IUCrCIF-based macromolecular dictionary; current PDB deposition format (mandatory for crystallographic deposits since 2019)Canonical; required for new wwPDB depositshttps://mmcif.wwpdb.org/
CIF1991IUCr (Hall, Allen, Brown)STAR-grammar-based crystallographic information format for small moleculesUniversal for small-molecule crystallography; CIF2 spec stablehttps://www.iucr.org/resources/cif
STAR1991Sydney Hall, IUCrSelf-defining Text Archive and Retrieval — the parent grammar of CIF/mmCIF/NMR-STARStable foundation; rarely written by humans, mostly via CIF dialectshttps://en.wikipedia.org/wiki/Self-defining_Text_Archive_and_Retrieval
NMR-STAR1990sBMRB (Biological Magnetic Resonance Bank), UW-MadisonSTAR-dialect dictionary for NMR data depositionActive; BMRB v3.2 dictionaryhttps://bmrb.io/standards/
CCDC CSD format1965+Cambridge Crystallographic Data CentreProprietary database format for the Cambridge Structural Database (>1.3M structures)Active; commercial; CSD-System 2026 releasehttps://www.ccdc.cam.ac.uk/
PDB CCD (Chemical Component Dictionary)2000swwPDBmmCIF-based dictionary of every ligand/residue ever seen in the PDBActive; updated weekly with PDB releaseshttps://www.wwpdb.org/data/ccd

Tier 3 family table — Genomic features & variants

FormatFirst appearedOriginTypeStatus (2026)URL
GFF / GFF3~1997 (GFF), 2007 (GFF3)Sanger Centre / Sequence OntologyTab-separated genomic-feature annotation (9 columns; attribute key=value pairs)Universal for annotation; GFF3 currenthttps://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
GTF~2003Ensembl/UCSCGFF dialect with stricter attribute syntax; gene/transcript/exon hierarchyUniversal for transcript annotation; Ensembl/GENCODE primaryhttps://useast.ensembl.org/info/website/upload/gff.html
BED~2003UCSC Genome Browser (Kent et al.)Tab-separated browser-extensible records (3–12 cols); BED12 carries blocksUniversal; bedtools ecosystem; bigBed binary variant ubiquitoushttps://genome.ucsc.edu/FAQ/FAQformat.html#format1
VCF20101000 Genomes Project / Danecek et al.Variant-calling DSL: #CHROM/POS/ID/REF/ALT/QUAL/FILTER/INFO/FORMAT/<samples>Universal; v4.5 draft active on hts-specs; v4.4 widely deployedhttps://samtools.github.io/hts-specs/
BCF2011samtools projectBinary BCF (Binary Call Format) compressed VCFActive; v2.2 currenthttps://samtools.github.io/hts-specs/
PED / FAM / TPED / TFAM~2007PLINK (Purcell et al., Broad Institute)Pedigree + genotype text formats for GWASActive; PLINK 2.0 native uses PGEN binary but PED/MAP still commonhttps://www.cog-genomics.org/plink/2.0/formats
BIOM2011Earth Microbiome Project / Caporaso et al.Biological Observation Matrix; HDF5 (v2) or sparse JSON (v1) for OTU tablesActive in microbiome workhttps://biom-format.org/

Tier 3 family table — Chemistry / small molecules

FormatFirst appearedOriginTypeStatus (2026)URL
SMILES1986David Weininger, Daylight ChemicalLinear SMILES line notation: CC(=O)O for acetic acidUniversal; unchanged in 40 years; canonical-SMILES algorithms vary by toolkithttps://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
OpenSMILES2007OpenSMILES communityVendor-neutral SMILES specification (resolves Daylight ambiguities)Active reference standardhttp://opensmiles.org/opensmiles.html
InChI2005 (IUPAC); v1.07 approved 16 July 2024IUPAC + InChI TrustLayered canonical chemical-identifier string (formula/connectivity/H/charge/stereo/isotopes)Active; v1.07 on GitHub since 2024, MIT-licensedhttps://www.inchi-trust.org/
InChIKey2007IUPACFixed-length hashed InChI (27 chars); database-friendlyActive; ubiquitous in chemistry databaseshttps://www.inchi-trust.org/about-the-inchi-standard/
SELFIES2020 (Krenn et al., MLST 2020)Aspuru-Guzik group, Toronto + communitySelf-Referencing Embedded Strings; every string is a valid moleculeActive; ongoing extensions to polymers/crystals/reactions; widely used for ML generative chemistryhttps://github.com/aspuru-guzik-group/selfies
MOL / MDL Molfile1980sMolecular Design Limited (later Symyx, Accelrys, BIOVIA)V2000 (atom/bond table) and V3000 (extended) chemical-structure filesUniversal; V3000 supports >999 atoms; default ChemDraw exporthttps://en.wikipedia.org/wiki/Chemical_table_file
SDF1990sMDLStructure-Data File: concatenated MOL records + property tags (> <name>)Universal for chemistry datasets (PubChem, ChEMBL exports)https://en.wikipedia.org/wiki/Chemical_table_file#SDF
MOL2~1990TriposTripos MOL2 — atom/bond/substructure with explicit Sybyl atom typesActive; common in docking (AutoDock, GOLD)http://chemyang.ccnu.edu.cn/ccb/server/AIMMS/mol2.pdf
Reaction SMILES / RXN1990s+Daylight (rxn SMILES); MDL (RXN file)SMILES extension: reactants>>agents>>productsActive; RXN SMILES dominant in ML reaction predictionhttps://daylight.com/dayhtml/doc/theory/theory.rxn.html

Tier 3 family table — Phylogenetics & MS / other

FormatFirst appearedOriginTypeStatus (2026)URL
Newick1986 (informal); standardised in NEXUS 1997Felsenstein/Maddison “Newick’s lobster house” meetingRecursive parenthetical tree: ((A,B),(C,D)); with optional branch lengthsUniversal for phylogenetic treeshttps://evolution.genetics.washington.edu/phylip/newicktree.html
NEXUS1997Maddison, Swofford, MaddisonBlock-structured phylogenetics format (TAXA/CHARACTERS/TREES/…); embeds NewickActive; MrBayes/PAUP*/MEGA native inputhttps://academic.oup.com/sysbio/article/46/4/590/1629695
PhyloXML2009Han & ZmasekXML alternative to Newick with rich annotationActive but niche; Newick still dominant for size reasonshttp://www.phyloxml.org/
mzML2008 (HUPO-PSI)HUPO Proteomics Standards InitiativeXML-based mass-spec data format; superseded mzXML and mzDataUniversal for MS data archives (ProteomeXchange/PRIDE)https://www.psidev.info/mzML
mzXML2004 (legacy)Seattle Proteome Center / ISBEarlier MS format; predecessor to mzMLLegacy; archives still common but new tools prefer mzMLhttps://en.wikipedia.org/wiki/Mass_spectrometry_data_format#mzXML
mzTab2014HUPO-PSITab-separated proteomics-results format (PSM/peptide/protein/SmallMolecule sections)Active; v2.0 released 2021 (small-molecule extension)https://www.psidev.info/mztab
MIBI / IBIS / OME-TIFF (cross-list)2010sOME consortium / multiplexed-imaging communityImaging-mass-cytometry / multiplexed-imaging data formatsActivehttps://www.openmicroscopy.org/ome-files/

Notable threads

  • The PDB → mmCIF transition is mid-flight, not finished. A common misconception is that PDB was “deprecated in 2024.” The reality is more granular: crystallographic depositions to wwPDB have required mmCIF since 2019; PDB entries assigned 5-character extended IDs (rolled out as the 4-char namespace exhausts) are distributed in mmCIF only; and the full archive cutover — when wwPDB stops issuing 4-character IDs and the entire archive moves to extended IDs + mmCIF-only distribution — is scheduled for 21 July 2027. The 80-column PDB flat-file format will continue to exist for legacy 4-char entries indefinitely. Software that hard-codes ^.{4}$ PDB ID regexes or 80-column line widths is on the clock.

  • FASTA’s 41-year reign and why nothing displaces it. FASTA (1985) is genuinely unchanged. The format has zero formal grammar, yet it is parsed identically by Biopython 1.87 (March 2026), BioPerl, EMBOSS, samtools, BWA, BLAST, and every LLM-protein-model preprocessor. The reason nothing displaces it is Lindy: every new tool needs a FASTA reader on day 1 to interoperate with the existing zoo, so no replacement ever achieves escape velocity. FASTQ added quality scores but did not replace FASTA — it sits alongside, used for raw reads while FASTA holds reference and downstream protein/nucleotide sequences.

  • SMILES vs InChI vs SELFIES — three different optimisation goals. SMILES (1986) is human-writable, compact, and standard, but non-canonical: the same molecule has many valid SMILES strings depending on traversal order, and toolkits (RDKit, OpenBabel, OEChem) disagree on canonicalisation. InChI (2005, v1.07 approved July 2024 under IUPAC + InChI Trust) is canonical by construction — one molecule, one InChI string — and the InChIKey hash makes it database-indexable. SELFIES (2020, Aspuru-Guzik group, Toronto) solves a third problem: every randomly-generated SELFIES string corresponds to a syntactically valid molecule, so generative ML models cannot produce invalid output. Three formats, three different “valid” definitions: human-readable, canonical, ML-robust.

  • The SAM / BAM / CRAM compression ladder. SAM (2009) is the human-readable text alignment format; BAM is its BGZF-gzip binary form (typically 4–6× smaller); CRAM (3.0 in 2014, 3.1 stable in htslib since ~2022) is reference-based — it stores only the differences from a reference genome, so well-aligned data compresses 30–60% better than BAM. CRAM 3.1 added data-type-specific codecs (separate for identifiers, quality scores, sequence) rather than relying on general-purpose compression. EBI and SRA now default to CRAM for archival; htslib 1.23.1 (2026) is the canonical decoder. The lossy quality-binning option in CRAM is controversial — it sacrifices precision for ~2× more compression.

  • mzML as the proteomics OOXML moment. Mass-spec data went through three formats in a decade: vendor-binary (Thermo .raw, Bruker .d), then mzXML (Seattle Proteome Center, 2004), then mzML (HUPO-PSI, 2008). mzML won because it had a controlled-vocabulary backbone (PSI-MS CV) and unified the previously-bifurcated mzXML/mzData community under one standard. ProteomeXchange and PRIDE require mzML for deposition. mzTab (2014, v2.0 in 2021 with small-molecule support) handles the results layer above mzML.

  • The INSDC three-database choreography. GenBank (NCBI), EMBL (EBI), and DDBJ (Japan) are the three nodes of the International Nucleotide Sequence Database Collaboration. They synchronise content daily — every record submitted to one appears in the other two within 24 hours — but each preserves its own slightly different flat-file dialect (LOCUS/DEFINITION/ACCESSION for GenBank, ID/AC/DE line tags for EMBL, similar for DDBJ). This is one of the most successful long-running international data-format federations in any field; it has worked since 1987 and now coordinates exabytes of sequence data.

  • The chemistry-biology bridge formats. Biology has to handle small molecules (drug ligands binding to proteins) and chemistry has to handle proteins (target structures), so the formats inevitably overlap. The wwPDB Chemical Component Dictionary (CCD) is mmCIF-formatted and defines every ligand and residue ever observed in the PDB. MOL/SDF files appear inside drug-discovery pipelines that ingest from ChEMBL/PubChem/ZINC and feed docking software (AutoDock Vina, GOLD) which itself reads PDB protein structures and MOL2 ligands. The seam between the two ecosystems is a recurring source of bugs — bond-order ambiguity in PDB ligands is the canonical example, addressed by the CCD’s full chemical specification.

  • VCF as a genuine DSL, not just a file format. VCF’s INFO column carries semicolon-separated key=value records (AC=2;AN=4;DP=500;AF=0.5); FORMAT declares per-sample fields (GT:DP:AD:GQ); the header declares all available keys with ##INFO=<ID=...,Number=...,Type=...,Description=...> lines. This is effectively a typed schema language embedded in the file header — closer in spirit to Avro or Protobuf than to TSV. bcftools query, GATK SelectVariants, and snpEff all operate as miniature interpreters over this schema. v4.5 (current draft on hts-specs) extends it to handle modified bases.

Citations