Bioinformatics & Life-Sciences File-Format Languages Family Index

type: language-family-index family: bio-fileformats languages_catalogued: 28 tags: [language-reference, family-index, bio-fileformats, fasta, fastq, vcf, sam, bam, cram, pdb, mmcif, cif, star, smiles, inchi, selfies, gff, gtf, bed, newick, nexus, mzml]

Bioinformatics & Life-Sciences File-Formats — Family Index

Family overview

This family is the textual data formats that underpin biology and structural science — the on-disk grammars that flow through every sequencing pipeline, every crystallography refinement, every cheminformatics search. Unlike the workflow-orchestration sibling (bio-workflow — Snakemake/Nextflow/CWL/WDL), these are the data the workflows transform. Most are real grammars with formal specifications, parsers, validators, and 30–50-year installed bases; a handful (FASTA, PDB, GenBank flat-file, SMILES, CIF) have been continuously deployed since the 1970s and 80s and remain canonical in 2026.

The lineage is striking. PDB (Protein Data Bank flat-file, 1976) defined an 80-column-fixed-width format for atomic coordinates — a deliberate choice for FORTRAN punched-card compatibility — and survived for 48 years before the wwPDB began phasing it out for new depositions. FASTA (William Pearson, 1985) is the universal >id\nACGT... text format and has not changed materially in 41 years. GenBank flat-file (NCBI, since the 1980s) and its EMBL twin still carry the world’s annotated nucleotide records. CIF (IUCr, 1991) introduced the STAR-grammar-based dictionary approach to crystallography; mmCIF/PDBx (1997+) is its macromolecular specialisation, now the wwPDB deposition format. SMILES (Weininger, Daylight, 1986) is the dominant chemical-structure DSL, with InChI (IUPAC, 2005+) layered on top as canonical identifier and SELFIES (2020+) emerging as the ML-robust alternative.

The genomics big-data wave (~2008–2014) brought a different design pressure: terabyte-scale per-experiment data, which forced compact binary cousins. FASTQ added per-base Phred quality to FASTA for Illumina/Sanger reads. SAM (Sequence Alignment Map, 2009) standardised aligned-read storage, with BAM as its gzipped binary form and CRAM (3.0 in 2014, 3.1 stable since 2022) as the reference-based further-compressed archival form now defaulted by major sequencing centres. VCF (1000 Genomes, 2010) emerged for variant calling with #CHROM/POS/REF/ALT/INFO/FORMAT records; v4.5 is the current draft on hts-specs. The genome-browser line — BED (UCSC), GFF/GFF3 (Sequence Ontology), GTF (Ensembl/UCSC) — quietly moves exabytes of feature annotation daily.

Around the central genomics/structural axes orbit several smaller but durable ecosystems: phylogenetics (NEXUS 1997, Newick from the Maddison/Felsenstein era, PhyloXML), mass-spec proteomics (mzML — the HUPO/PSI standard that displaced mzXML), chemistry/biology bridge formats (MOL/SDF, CCD, ligand definitions), and alignment/profile formats (Stockholm, HMMER profiles). The recurring pattern is that bio formats almost never die — installed base, decades of tooling (htslib, Biopython 1.87 from March 2026, BioPerl, EMBOSS, RDKit), and academic citation-permanence keep formats alive long after they are technically obsolete.

In our deep library

None of the bio file-format languages have standalone deep-library notes (they are domain DSLs, not general-purpose languages). Cross-reference:

bio-workflow — the sibling family: workflow-orchestration DSLs (Snakemake, Nextflow, CWL, WDL) that consume and produce these file formats. Read together for full coverage of “bio computing”.
api-description — UniProt/NCBI/Ensembl/RCSB all expose REST + GraphQL APIs over these formats; OpenAPI/Smithy describe the access layer.
scientific — R (Bioconductor), MATLAB (Bioinformatics Toolbox), and Mathematica all have native readers for FASTA/FASTQ/PDB/SDF.
python — Biopython is the canonical reader stack; pysam wraps htslib for SAM/BAM/CRAM/VCF; RDKit handles SMILES/InChI/MOL; BioPandas wraps PDB/mmCIF.
notation-spec — STAR/CIF and the chemistry line-notations (SMILES, InChI, SELFIES) are dual-classified there as compact textual notations.
api-description — PhyloXML, mzML, BioM(JSON), HUPO-PSI XML formats all sit on top of XML/JSON-Schema infrastructure catalogued there.

Tier 3 family table — Sequence & alignment formats

Format	First appeared	Origin	Type	Status (2026)	URL
FASTA	1985	William Pearson, Univ. of Virginia	Plain-text sequence: `>id\nACGT...`	Universal; unchanged in 41 years; every bio tool reads it	https://en.wikipedia.org/wiki/FASTA_format
FASTQ	~2009	Wellcome Trust Sanger Institute	FASTA + per-base Phred quality (4-line records)	Universal for short-read sequencing; Illumina/PacBio/ONT default	https://maq.sourceforge.net/fastq.shtml
GenBank flat-file	1982 (GenBank), format formalised 1986+	NCBI / Los Alamos	Annotated nucleotide record: LOCUS / DEFINITION / ACCESSION / FEATURES / ORIGIN	Active; daily releases continue	https://www.ncbi.nlm.nih.gov/genbank/release/current/
EMBL flat-file	1982	EMBL-EBI	Sibling of GenBank with `ID/AC/DE/FT/SQ` line-tag grammar	Active; INSDC sync with GenBank/DDBJ	https://www.ebi.ac.uk/ena/browser/text-search
SAM	2009	Heng Li et al., 1000 Genomes / Sanger	Tab-separated aligned-read records (text); 11 mandatory cols + tags	Universal for alignment data	https://samtools.github.io/hts-specs/SAMv1.pdf
BAM	2009	Heng Li et al.	BGZF-compressed binary SAM	Universal; default working format	https://samtools.github.io/hts-specs/SAMv1.pdf
CRAM	2014 (3.0); 3.1 stable in htslib since ~2022	EBI / Cochrane / Bonfield	Reference-based compressed alignment; data-type-specific codecs	Active; default archival format at EBI/SRA; v3.1 current	https://samtools.github.io/hts-specs/CRAMv3.pdf
Stockholm	1990s	Sean Eddy, HMMER project	Multiple-sequence-alignment format with markup lines (`#=GF/GS/GR/GC`)	Active; Pfam/Rfam/Dfam canonical format	https://en.wikipedia.org/wiki/Stockholm_format
HMMER profile	1995+	Sean Eddy, WashU/HHMI	Profile-HMM text grammar; companion to Stockholm alignments	Active; HMMER 3.4 current	http://hmmer.org/documentation.html

Tier 3 family table — Structural / crystallography

Format	First appeared	Origin	Type	Status (2026)	URL
PDB (flat-file)	1976	Brookhaven National Lab	80-column fixed-width atomic-coordinate records (HEADER/ATOM/HETATM/CONNECT)	Legacy / sunset path: 4-char IDs continue; entries with 5-char extended IDs distributed in mmCIF only; full transition to extended IDs scheduled July 21, 2027	https://www.wwpdb.org/documentation/file-format
mmCIF / PDBx	1997+	wwPDB / IUCr	CIF-based macromolecular dictionary; current PDB deposition format (mandatory for crystallographic deposits since 2019)	Canonical; required for new wwPDB deposits	https://mmcif.wwpdb.org/
CIF	1991	IUCr (Hall, Allen, Brown)	STAR-grammar-based crystallographic information format for small molecules	Universal for small-molecule crystallography; CIF2 spec stable	https://www.iucr.org/resources/cif
STAR	1991	Sydney Hall, IUCr	Self-defining Text Archive and Retrieval — the parent grammar of CIF/mmCIF/NMR-STAR	Stable foundation; rarely written by humans, mostly via CIF dialects	https://en.wikipedia.org/wiki/Self-defining_Text_Archive_and_Retrieval
NMR-STAR	1990s	BMRB (Biological Magnetic Resonance Bank), UW-Madison	STAR-dialect dictionary for NMR data deposition	Active; BMRB v3.2 dictionary	https://bmrb.io/standards/
CCDC CSD format	1965+	Cambridge Crystallographic Data Centre	Proprietary database format for the Cambridge Structural Database (>1.3M structures)	Active; commercial; CSD-System 2026 release	https://www.ccdc.cam.ac.uk/
PDB CCD (Chemical Component Dictionary)	2000s	wwPDB	mmCIF-based dictionary of every ligand/residue ever seen in the PDB	Active; updated weekly with PDB releases	https://www.wwpdb.org/data/ccd

Tier 3 family table — Genomic features & variants

Format	First appeared	Origin	Type	Status (2026)	URL
GFF / GFF3	~1997 (GFF), 2007 (GFF3)	Sanger Centre / Sequence Ontology	Tab-separated genomic-feature annotation (9 columns; attribute key=value pairs)	Universal for annotation; GFF3 current	https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
GTF	~2003	Ensembl/UCSC	GFF dialect with stricter attribute syntax; gene/transcript/exon hierarchy	Universal for transcript annotation; Ensembl/GENCODE primary	https://useast.ensembl.org/info/website/upload/gff.html
BED	~2003	UCSC Genome Browser (Kent et al.)	Tab-separated browser-extensible records (3–12 cols); BED12 carries blocks	Universal; bedtools ecosystem; bigBed binary variant ubiquitous	https://genome.ucsc.edu/FAQ/FAQformat.html#format1
VCF	2010	1000 Genomes Project / Danecek et al.	Variant-calling DSL: `#CHROM/POS/ID/REF/ALT/QUAL/FILTER/INFO/FORMAT/<samples>`	Universal; v4.5 draft active on hts-specs; v4.4 widely deployed	https://samtools.github.io/hts-specs/
BCF	2011	samtools project	Binary BCF (Binary Call Format) compressed VCF	Active; v2.2 current	https://samtools.github.io/hts-specs/
PED / FAM / TPED / TFAM	~2007	PLINK (Purcell et al., Broad Institute)	Pedigree + genotype text formats for GWAS	Active; PLINK 2.0 native uses PGEN binary but PED/MAP still common	https://www.cog-genomics.org/plink/2.0/formats
BIOM	2011	Earth Microbiome Project / Caporaso et al.	Biological Observation Matrix; HDF5 (v2) or sparse JSON (v1) for OTU tables	Active in microbiome work	https://biom-format.org/

Tier 3 family table — Chemistry / small molecules

Format	First appeared	Origin	Type	Status (2026)	URL
SMILES	1986	David Weininger, Daylight Chemical	Linear SMILES line notation: `CC(=O)O` for acetic acid	Universal; unchanged in 40 years; canonical-SMILES algorithms vary by toolkit	https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
OpenSMILES	2007	OpenSMILES community	Vendor-neutral SMILES specification (resolves Daylight ambiguities)	Active reference standard	http://opensmiles.org/opensmiles.html
InChI	2005 (IUPAC); v1.07 approved 16 July 2024	IUPAC + InChI Trust	Layered canonical chemical-identifier string (formula/connectivity/H/charge/stereo/isotopes)	Active; v1.07 on GitHub since 2024, MIT-licensed	https://www.inchi-trust.org/
InChIKey	2007	IUPAC	Fixed-length hashed InChI (27 chars); database-friendly	Active; ubiquitous in chemistry databases	https://www.inchi-trust.org/about-the-inchi-standard/
SELFIES	2020 (Krenn et al., MLST 2020)	Aspuru-Guzik group, Toronto + community	Self-Referencing Embedded Strings; every string is a valid molecule	Active; ongoing extensions to polymers/crystals/reactions; widely used for ML generative chemistry	https://github.com/aspuru-guzik-group/selfies
MOL / MDL Molfile	1980s	Molecular Design Limited (later Symyx, Accelrys, BIOVIA)	V2000 (atom/bond table) and V3000 (extended) chemical-structure files	Universal; V3000 supports >999 atoms; default ChemDraw export	https://en.wikipedia.org/wiki/Chemical_table_file
SDF	1990s	MDL	Structure-Data File: concatenated MOL records + property tags (`> <name>`)	Universal for chemistry datasets (PubChem, ChEMBL exports)	https://en.wikipedia.org/wiki/Chemical_table_file#SDF
MOL2	~1990	Tripos	Tripos MOL2 — atom/bond/substructure with explicit Sybyl atom types	Active; common in docking (AutoDock, GOLD)	http://chemyang.ccnu.edu.cn/ccb/server/AIMMS/mol2.pdf
Reaction SMILES / RXN	1990s+	Daylight (rxn SMILES); MDL (RXN file)	SMILES extension: `reactants>>agents>>products`	Active; RXN SMILES dominant in ML reaction prediction	https://daylight.com/dayhtml/doc/theory/theory.rxn.html

Tier 3 family table — Phylogenetics & MS / other

Format	First appeared	Origin	Type	Status (2026)	URL
Newick	1986 (informal); standardised in NEXUS 1997	Felsenstein/Maddison “Newick’s lobster house” meeting	Recursive parenthetical tree: `((A,B),(C,D));` with optional branch lengths	Universal for phylogenetic trees	https://evolution.genetics.washington.edu/phylip/newicktree.html
NEXUS	1997	Maddison, Swofford, Maddison	Block-structured phylogenetics format (TAXA/CHARACTERS/TREES/…); embeds Newick	Active; MrBayes/PAUP*/MEGA native input	https://academic.oup.com/sysbio/article/46/4/590/1629695
PhyloXML	2009	Han & Zmasek	XML alternative to Newick with rich annotation	Active but niche; Newick still dominant for size reasons	http://www.phyloxml.org/
mzML	2008 (HUPO-PSI)	HUPO Proteomics Standards Initiative	XML-based mass-spec data format; superseded mzXML and mzData	Universal for MS data archives (ProteomeXchange/PRIDE)	https://www.psidev.info/mzML
mzXML	2004 (legacy)	Seattle Proteome Center / ISB	Earlier MS format; predecessor to mzML	Legacy; archives still common but new tools prefer mzML	https://en.wikipedia.org/wiki/Mass_spectrometry_data_format#mzXML
mzTab	2014	HUPO-PSI	Tab-separated proteomics-results format (PSM/peptide/protein/SmallMolecule sections)	Active; v2.0 released 2021 (small-molecule extension)	https://www.psidev.info/mztab
MIBI / IBIS / OME-TIFF (cross-list)	2010s	OME consortium / multiplexed-imaging community	Imaging-mass-cytometry / multiplexed-imaging data formats	Active	https://www.openmicroscopy.org/ome-files/

Notable threads

The PDB → mmCIF transition is mid-flight, not finished. A common misconception is that PDB was “deprecated in 2024.” The reality is more granular: crystallographic depositions to wwPDB have required mmCIF since 2019; PDB entries assigned 5-character extended IDs (rolled out as the 4-char namespace exhausts) are distributed in mmCIF only; and the full archive cutover — when wwPDB stops issuing 4-character IDs and the entire archive moves to extended IDs + mmCIF-only distribution — is scheduled for 21 July 2027. The 80-column PDB flat-file format will continue to exist for legacy 4-char entries indefinitely. Software that hard-codes ^.{4}$ PDB ID regexes or 80-column line widths is on the clock.
FASTA’s 41-year reign and why nothing displaces it. FASTA (1985) is genuinely unchanged. The format has zero formal grammar, yet it is parsed identically by Biopython 1.87 (March 2026), BioPerl, EMBOSS, samtools, BWA, BLAST, and every LLM-protein-model preprocessor. The reason nothing displaces it is Lindy: every new tool needs a FASTA reader on day 1 to interoperate with the existing zoo, so no replacement ever achieves escape velocity. FASTQ added quality scores but did not replace FASTA — it sits alongside, used for raw reads while FASTA holds reference and downstream protein/nucleotide sequences.
SMILES vs InChI vs SELFIES — three different optimisation goals. SMILES (1986) is human-writable, compact, and standard, but non-canonical: the same molecule has many valid SMILES strings depending on traversal order, and toolkits (RDKit, OpenBabel, OEChem) disagree on canonicalisation. InChI (2005, v1.07 approved July 2024 under IUPAC + InChI Trust) is canonical by construction — one molecule, one InChI string — and the InChIKey hash makes it database-indexable. SELFIES (2020, Aspuru-Guzik group, Toronto) solves a third problem: every randomly-generated SELFIES string corresponds to a syntactically valid molecule, so generative ML models cannot produce invalid output. Three formats, three different “valid” definitions: human-readable, canonical, ML-robust.
The SAM / BAM / CRAM compression ladder. SAM (2009) is the human-readable text alignment format; BAM is its BGZF-gzip binary form (typically 4–6× smaller); CRAM (3.0 in 2014, 3.1 stable in htslib since ~2022) is reference-based — it stores only the differences from a reference genome, so well-aligned data compresses 30–60% better than BAM. CRAM 3.1 added data-type-specific codecs (separate for identifiers, quality scores, sequence) rather than relying on general-purpose compression. EBI and SRA now default to CRAM for archival; htslib 1.23.1 (2026) is the canonical decoder. The lossy quality-binning option in CRAM is controversial — it sacrifices precision for ~2× more compression.
mzML as the proteomics OOXML moment. Mass-spec data went through three formats in a decade: vendor-binary (Thermo .raw, Bruker .d), then mzXML (Seattle Proteome Center, 2004), then mzML (HUPO-PSI, 2008). mzML won because it had a controlled-vocabulary backbone (PSI-MS CV) and unified the previously-bifurcated mzXML/mzData community under one standard. ProteomeXchange and PRIDE require mzML for deposition. mzTab (2014, v2.0 in 2021 with small-molecule support) handles the results layer above mzML.
The INSDC three-database choreography. GenBank (NCBI), EMBL (EBI), and DDBJ (Japan) are the three nodes of the International Nucleotide Sequence Database Collaboration. They synchronise content daily — every record submitted to one appears in the other two within 24 hours — but each preserves its own slightly different flat-file dialect (LOCUS/DEFINITION/ACCESSION for GenBank, ID/AC/DE line tags for EMBL, similar for DDBJ). This is one of the most successful long-running international data-format federations in any field; it has worked since 1987 and now coordinates exabytes of sequence data.
The chemistry-biology bridge formats. Biology has to handle small molecules (drug ligands binding to proteins) and chemistry has to handle proteins (target structures), so the formats inevitably overlap. The wwPDB Chemical Component Dictionary (CCD) is mmCIF-formatted and defines every ligand and residue ever observed in the PDB. MOL/SDF files appear inside drug-discovery pipelines that ingest from ChEMBL/PubChem/ZINC and feed docking software (AutoDock Vina, GOLD) which itself reads PDB protein structures and MOL2 ligands. The seam between the two ecosystems is a recurring source of bugs — bond-order ambiguity in PDB ligands is the canonical example, addressed by the CCD’s full chemical specification.
VCF as a genuine DSL, not just a file format. VCF’s INFO column carries semicolon-separated key=value records (AC=2;AN=4;DP=500;AF=0.5); FORMAT declares per-sample fields (GT:DP:AD:GQ); the header declares all available keys with ##INFO=<ID=...,Number=...,Type=...,Description=...> lines. This is effectively a typed schema language embedded in the file header — closer in spirit to Avro or Protobuf than to TSV. bcftools query, GATK SelectVariants, and snpEff all operate as miniature interpreters over this schema. v4.5 (current draft on hts-specs) extends it to handle modified bases.

Citations

PDB Format Description (wwPDB): https://www.wwpdb.org/documentation/file-format
PDBx/mmCIF dictionary (wwPDB): https://mmcif.wwpdb.org/
wwPDB extended IDs / July 2027 transition (PDBj): https://pdbj.org/news/ExtensionCCDCodes?lang=en
PDBx/mmCIF mandatory crystallographic deposition (PMC): https://pmc.ncbi.nlm.nih.gov/articles/PMC6465986/
IUCr CIF resources: https://www.iucr.org/resources/cif
BMRB NMR-STAR standards: https://bmrb.io/standards/
FASTA description (NCBI BLAST docs): https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
FASTQ format (Cock et al., NAR 2010): https://academic.oup.com/nar/article/38/6/1767/3112533
HTS-specs (SAM/BAM/CRAM/VCF/BCF): https://samtools.github.io/hts-specs/
CRAM 3.1 paper (Bonfield, Bioinformatics 2022): https://academic.oup.com/bioinformatics/article/38/6/1497/6499262
htslib releases: https://github.com/samtools/htslib/releases
VCF 4.5 draft (hts-specs): https://github.com/samtools/hts-specs/blob/master/VCFv4.5.draft.tex
GFF3 specification (Sequence Ontology): https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
UCSC BED format: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Newick format (Felsenstein): https://evolution.genetics.washington.edu/phylip/newicktree.html
NEXUS paper (Maddison/Swofford/Maddison, Syst Biol 1997): https://academic.oup.com/sysbio/article/46/4/590/1629695
SMILES (Daylight): https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
OpenSMILES specification: http://opensmiles.org/opensmiles.html
InChI Trust (v1.07 release, 16 July 2024): https://www.inchi-trust.org/iupac-inchi-moves-to-github-to-support-sustainable-chemical-standards-development/
SELFIES paper (Krenn et al., MLST 2020): https://iopscience.iop.org/article/10.1088/2632-2153/aba947
SELFIES library: https://github.com/aspuru-guzik-group/selfies
HUPO-PSI mzML: https://www.psidev.info/mzML
HUPO-PSI mzTab: https://www.psidev.info/mztab
Stockholm format (Pfam): https://en.wikipedia.org/wiki/Stockholm_format
HMMER documentation: http://hmmer.org/documentation.html
Biopython 1.87 (March 2026 release): https://biopython.org/wiki/Download
INSDC (international nucleotide collaboration): https://www.insdc.org/
BIOM format: https://biom-format.org/
PLINK formats: https://www.cog-genomics.org/plink/2.0/formats

Compendium

Explorer

Bioinformatics & Life-Sciences File-Format Languages Family Index

Bioinformatics & Life-Sciences File-Format Languages Family Index

type: language-family-index family: bio-fileformats languages_catalogued: 28 tags: [language-reference, family-index, bio-fileformats, fasta, fastq, vcf, sam, bam, cram, pdb, mmcif, cif, star, smiles, inchi, selfies, gff, gtf, bed, newick, nexus, mzml]

Bioinformatics & Life-Sciences File-Formats — Family Index

Family overview

In our deep library

Tier 3 family table — Sequence & alignment formats

Tier 3 family table — Structural / crystallography

Tier 3 family table — Genomic features & variants

Tier 3 family table — Chemistry / small molecules

Tier 3 family table — Phylogenetics & MS / other

Notable threads

Citations

Graph View

Table of Contents