FPGA Design

1. At a glance

A Field-Programmable Gate Array is a post-fabrication-configurable integrated circuit: a sea of small programmable logic elements (LUTs and flip-flops) embedded in a routing fabric, with islands of hard IP (DSP slices, block RAM, transceivers, PCIe controllers, DDR PHYs, sometimes ARM application cores) hardened in silicon for performance. The configuration — usually a multi-megabit bitstream — is loaded into on-die SRAM at power-up and defines what every LUT, FF, and routing switch does for the rest of the operating session.

FPGAs occupy the middle of the digital-implementation spectrum. ASICs ([[Engineering/digital-logic]] section 6p) bake function into mask sets — fastest, lowest unit cost at volume, USD 1–100 M NRE, months-to-years cycle time. CPUs / MCUs ([[Engineering/microcontrollers]]) execute software on a fixed datapath — most flexible, easiest to develop, sequential-by-default. FPGAs sit in between: post-fab reconfigurable like software, but spatially parallel like an ASIC. The technology wins where parallelism beats sequential execution, latency must be deterministic, protocol flexibility matters, or volumes don’t justify ASIC tape-out.

Real-world workloads where FPGAs dominate today: 5G base-station L1 (Xilinx Versal RFSoC, Intel Agilex), network-line-card packet classification at 400 GbE (Xilinx Alveo, Achronix Speedster7t), automotive radar and lidar front-ends (Xilinx Zynq UltraScale+ RFSoC, Intel Cyclone 10), military signal-intelligence (rad-hard Microchip RTG4), ASIC emulation and prototyping (Cadence Palladium / Synopsys ZeBu use thousands of FPGAs), financial HFT (Solarflare X3522), ML inference acceleration (AWS F1 with Xilinx VU9P, Vitis AI), video processing in broadcast (Lattice CrossLink-NX, Xilinx Zynq), and motor / power control at MHz rates (Zynq Z-7020, Cyclone V SoC; see [[Engineering/power-electronics]]).

Two issues consume more engineering hours on a real FPGA project than any other: timing closure (making a design hit its clock-frequency target across all process-voltage-temperature corners, covered in section 5p) and clock-domain crossing (CDC — covered in [[Engineering/digital-logic]] section 5p and revisited here for FPGA-specific gotchas in section 7p). Verification effort routinely outweighs RTL design effort 3:1; on safety-critical avionics work (DO-254), 5:1 or higher.

2. First principles

The programmable logic cell

The atomic unit of an FPGA fabric is a logic cell (Xilinx terminology) / logic element (Intel/Altera) / PFU sub-slice (Lattice). At its core sit:

k-input look-up table (LUT-k) — an SRAM with 2^k bits storing the truth table of any k-input combinational function. Setting the SRAM bits at configuration time defines the logic. Modern fabrics use LUT6 (Xilinx 7-series, UltraScale, UltraScale+, Versal; 64 SRAM bits per LUT) or fracturable LUT6 (Intel Stratix-10, Agilex; can split into two independent LUT5s sharing four inputs, doubling utility for narrow logic). Lattice ECP5 uses LUT4; iCE40 uses LUT4; older Spartan-6 used LUT6.
D flip-flop — one or two per LUT, with clock-enable, synchronous + asynchronous set/reset, configurable as latch on some devices.
Carry chain — dedicated fast-carry logic spanning a vertical column of cells, used for adders, subtractors, and counters. A 32-bit adder synthesises into 32 LUTs riding a dedicated chain that traverses one column with ~150 ps per bit on UltraScale+, far faster than routing carry through general fabric.
Wide-mux and shift-register modes — Xilinx LUT6 can also function as 32×1 distributed RAM or 32-bit shift register (SRL32), reusing the SRAM cells for storage instead of logic.

Cells are bundled into larger units: Xilinx CLB (Configurable Logic Block) = 2 SLICEs = 8 LUT6 + 16 FF; Intel LAB (Logic Array Block) = 10 ALMs = 20 fracturable LUT6 + 40 FF on Agilex; Lattice PLC = 8 SLICEs on Nexus.

Block RAM, distributed RAM, URAM

BRAM (Block RAM) — dedicated 18 kb or 36 kb dual-port SRAM blocks, configurable as 36k×1, 18k×2, 9k×4, …, 1k×36, also 512×72 with parity. True dual-port (independent read+write on each port, separate clocks), FIFO mode (built-in pointers and flags), ECC mode on UltraScale+ (64+8 SECDED). Xilinx 7-series has up to 1880 BRAM36 (XC7V2000T); Versal Premium has up to 2520. Read latency 1–2 cycles, write 1 cycle.
Distributed RAM — using the LUT6 SRAM cells in SLICEM as 32×1, 64×1, 128×1, or 256×1 single/dual-port memories. Useful for very small (< 256 word) high-fanout lookup tables. No clock cycle latency on async-read path.
URAM (UltraRAM) — 288 kb single-port (or dual-mode) blocks introduced on Xilinx UltraScale+. Larger, denser, fewer flexibility options. Designed for ML weight buffers and packet buffers. VU13P has 1280 URAMs.

DSP slices

A DSP slice is a hard MAC (multiply-accumulate) unit, far smaller and faster than the LUT-equivalent. The Xilinx DSP48E2 (UltraScale+) is a 27×18 signed multiplier + 48-bit adder/accumulator + pre-adder + pattern detector, running at up to 891 MHz. The DSP58 on Versal adds INT8/INT16 SIMD modes for AI. The Intel Variable Precision DSP on Stratix 10 / Agilex offers 18×19 or 27×27 with similar accumulation and supports IEEE-754 single-precision float natively (one FP32 multiply-accumulate per cycle per slice). Modern parts ship thousands: Kintex UltraScale+ KU15P has 1968 DSP48E2; Versal Premium VP1502 has 7392 DSP58s.

Clocking, I/O, and SerDes

MMCM / PLL — Mixed-Mode Clock Manager (Xilinx) / Phase-Locked Loop (all vendors). Generate, phase-shift, multiply, divide, and jitter-filter clocks from external references. A UltraScale+ MMCM accepts 10 MHz–800 MHz input, produces up to seven output clocks each up to 891 MHz, with ~100 fs RMS output jitter.
Global clock buffers (BUFG, BUFGCE, BUFGCTRL) — distribute clocks across regions with low skew (< 100 ps within a clock region on UltraScale+). 32 BUFGs per device typical; constraints fight over them in large designs.
I/O blocks — single-ended LVCMOS (1.2 V / 1.5 V / 1.8 V / 2.5 V / 3.3 V), differential LVDS, SSTL/POD for DDR4/DDR5 memory. Per-pin programmable IDELAY/ODELAY tap chains (78 ps resolution on UltraScale+) for source-synchronous interface deskew.
SerDes (transceivers) — multi-Gb/s differential serial. Xilinx GTH (16.3 Gb/s), GTY (32.75 Gb/s NRZ / PAM4 to 58 Gb/s), GTM (58 Gb/s PAM4 native, up to 112 Gb/s on Versal Premium). Intel E-tile/F-tile transceivers similar tier. Each transceiver includes CDR (clock-data recovery), 8b/10b or 64b/66b encoding, FEC, and a programmable equaliser.

Hard IP

Cost-effective integration of high-bandwidth standardised functions: PCIe Gen3 x16 / Gen4 x16 / Gen5 x16 hard blocks (Xilinx, Intel), 100G / 400G Ethernet MACs with RS-FEC (UltraScale+, Versal Premium, Agilex 7), DDR4 / DDR5 / HBM2e / HBM3 controllers, ARM Cortex-A53 / A78AE Processing Systems in SoC variants (Zynq-7000, Zynq UltraScale+ MPSoC, Versal ACAP, Cyclone V SoC, Stratix 10 SoC, Agilex 7 with quad Cortex-A76AE), AI Engine tiles in Versal (vector processors at 1.3 GHz, hundreds-of-cores arranged in a 2D mesh, designed for ML and signal-processing kernels).

3. Practical math / design equations

Resource estimation — a few rules of thumb

Block	Resource cost	Notes
32-bit ripple-carry adder	32 LUT, riding 1 carry chain	~1 ns on UltraScale+
32-bit register	32 FF	Free placement
32×32 = 64-bit multiplier	4 DSP48E2	1-cycle multiply + 1-cycle accumulate; ~10 cycles latency pipelined
18-bit × 25-bit multiplier (signed)	1 DSP48E2	Native, single-cycle
1 kB single-port RAM (1024×8)	1 BRAM18 (one half used)	2-cycle read latency typ
32 kB dual-port RAM	8 BRAM36	True dual-port
8-bit-wide async FIFO 1024 deep	1 BRAM18 + ~20 LUT + 40 FF	Gray-code pointers for CDC
256 kB packet buffer	1 URAM	UltraScale+ only
16-tap symmetric FIR filter, 16-bit data, 18-bit coeff	8 DSP48E2 + ~200 LUT + ~200 FF	At 200 MHz → 200 MSPS
100G Ethernet MAC	1 CMAC hard block	UltraScale+ / Versal
PCIe Gen3 x8 endpoint	1 PCIe hard block + DMA RTL	Xilinx XDMA / QDMA IP

Setup, hold, and slack — restated for FPGA flow

The setup and hold inequalities from [[Engineering/digital-logic]] section 3 apply unchanged to FPGAs. The FPGA flavour adds two specifics:

t_CO and t_setup are device-specific, not designer-controllable. They come from the FPGA’s speed grade. UltraScale+ -2 speed grade: FF t_CO ≈ 350 ps, t_setup ≈ 130 ps. -3 speed grade roughly 15 % faster, -1 about 15 % slower.
t_route dominates t_comb in most paths. On a fully placed-and-routed UltraScale+ design at 400 MHz, a typical register-to-register path with 4 LUT delays might be 1.0 ns of LUT/CARRY logic and 1.2 ns of net (wire) delay — routing accounts for more than half the total. Designers can’t directly edit routes; instead, they pull data + control closer together via floorplanning constraints, pblock assignments, and pipelining to reduce per-stage logic depth.

Slack at a destination FF:

slack = T_clk − t_CO_src − t_LUT − t_route − t_setup_dst + t_skew_dst − t_skew_src

Negative slack = timing violation. Vivado and Quartus both report worst-negative-slack (WNS) and total-negative-slack (TNS) summarised per clock domain. A design at WNS > 0 with TNS = 0 passes; anything negative fails.

Pipelining math

If a combinational path has logic delay t_comb, splitting it into N pipeline stages with FFs between gives:

t_stage ≈ t_comb / N → T_clk_min ≈ t_CO + t_comb/N + t_setup Throughput = 1 / T_clk_min (one result per cycle once filled) Latency = N × T_clk_min (cycles from input to first valid output)

Diminishing returns set in when t_CO + t_setup approach t_comb/N (~600 ps fixed overhead on UltraScale+); past N ≈ t_comb / (3 × overhead) extra stages cost area without speed.

Worked example A — 16-tap symmetric pipelined FIR filter

A 16-tap symmetric FIR at 200 MSPS, 16-bit data, 18-bit coefficients, on Artix-7 (XC7A35T-1).

Symmetric (h[n] = h[15-n]) → use 8 multipliers with pre-adders: y[n] = Σ h[k] × (x[n-k] + x[n-15+k]) for k = 0..7.
8 DSP48E1 slices (Artix-7 uses DSP48E1; 25×18 signed). Pre-adder inside each DSP holds the symmetric sum (saves one external adder per tap).
Pipeline: input register → pre-adder stage → multiply stage → adder-tree (3 cycles for log₂(8) levels) → output register. Total latency ≈ 16 + log₂(8) = 19 cycles → 95 ns at 200 MHz.
Resource count after synthesis + implementation: 8 DSP48E1, ~200 LUT, ~250 FF, no BRAM. WNS at -1 speed grade: ~0.3 ns. Timing closes.
Throughput: 200 MHz × 1 sample/cycle = 200 MSPS — 50 MHz signal Nyquist + plenty of guard band.
Software baseline: Cortex-M7 at 480 MHz with single-cycle MAC, ~30 MSPS sustained on this filter. FPGA is 7× faster and runs deterministically with sub-cycle jitter; the CPU jitters by 100s of cycles on cache misses.

Worked example B — Asynchronous BRAM FIFO for CDC

A 4 kB FIFO buffers between a 200 MHz producer domain (e.g. line-rate packet ingress) and a 156.25 MHz consumer domain (e.g. 10G MAC interface).

Using Xilinx XPM_FIFO_ASYNC parameterised macro:

Depth = 1024, width = 32 bits → 4 kB, fits in 1 BRAM36 (or 2 × BRAM18) on 7-series / UltraScale.
Pointer width = 11 bits (10 address + 1 wrap bit). Gray-coded read and write pointers cross the clock-domain boundary via two-FF synchronisers (one bit toggling at a time guarantees no decode error on the receive side).
Read latency = 3 cycles (gray pointer sync + read-enable + BRAM read), so an empty flag deasserts 3 cycles before data is valid; consumer must respect FWFT (first-word-fall-through) mode or pipeline-aware mode.
Programmable almost-full and almost-empty thresholds back-pressure the producer when fewer than 64 entries are free.
Resource cost: 1 BRAM18 + 16 LUT + ~40 FF.
Critical timing constraint: set_max_delay -datapath_only -from clk_wr -to clk_rd 6.4 ns (one rd period), and Vivado handles internal CDC paths via the macro’s pre-validated constraint set automatically.

Worked example C — Closing timing on a critical path

A motor-control inner loop targets 300 MHz on Zynq UltraScale+ ZU3EG-1 to compute a 32-bit Park transform + PI controller every cycle.

Initial implementation: Vivado reports WNS = −0.24 ns on clk_300m. Timing report shows the worst path is FF → 7 levels of LUT (combinational logic including a 32-bit signed multiplier modelled in fabric) → FF, with t_LUT = 2.2 ns + t_route = 1.5 ns + t_setup = 0.13 ns + t_CO = 0.35 ns = 4.18 ns > 3.33 ns period.

Three fixes considered:

Retiming — let Vivado auto-balance combinational stages by sliding FFs across logic. set_property RETIMING_FORWARD true on the relevant module. Vivado moves 2 of the 7 LUT levels into the next cycle. New WNS = +0.08 ns. Closes, no functional change, +1 cycle latency at the boundary.
Manual pipelining — insert a register between the multiplier output and downstream logic. Same effect as auto-retiming but explicit in RTL; preferred for safety-critical (DO-254) work where every register must be designer-justified.
Map multiplier to DSP slice — the multiplier was being mapped to LUT fabric (Vivado’s heuristic considered it too narrow). Add (* USE_DSP = "yes" *) attribute. The DSP48E2 absorbs the multiply in a single hardened slice with 1 ns total, freeing the LUT logic. New WNS = +0.42 ns. Best fix: no extra latency, fewer LUTs, hard timing margin.

Real designs use combinations — retiming for routine paths, manual pipelining for hand-tuned cores, DSP/BRAM forcing for arithmetic-heavy or storage-heavy paths.

Worked example D — Power budget for a portable instrument

A handheld test instrument runs an XC7Z020-1CLG400C (Zynq-7020) at 100 MHz fabric clock with the dual Cortex-A9 PS at 666 MHz. Estimate average power.

PS (processing system): V_CCPINT = 1.0 V, ~250 mW at 666 MHz with dual cores active per Xilinx Power Estimator (XPE).
PL (programmable logic) dynamic: 45 k LUT @ 30 % toggle, 60 k FF, 80 BRAM, 60 DSP — XPE reports ~480 mW at 100 MHz, V_CCINT = 1.0 V.
PL static (leakage): temperature-dependent. At T_J = 50 °C, ~120 mW.
I/O banks: 80 LVCMOS33 outputs at 20 MHz, C_L = 15 pF each. P_IO = C · V² · f · α · N = 15 pF × (3.3 V)² × 20 MHz × 0.25 × 80 = 65 mW.
MGT transceivers: not used → 0 mW.
DDR3 PHY: ~150 mW for 32-bit DDR3-1066.

Total: ~1.07 W. With a 3.7 V Li-ion at 3400 mAh = 12.6 Wh, runtime ≈ 12.6 / 1.07 ≈ 11.8 hours continuous fabric activity. Adding clock gating to the unused 60 % of PL FFs and dropping the PS to 333 MHz when idle pushes average power to ~600 mW and runtime to ~21 h.

4. Reference data

FPGA vendor and family landscape

Vendor	Family	Process	Largest part	LUTs	DSP	BRAM (Mb)	SerDes max	Use case
AMD (Xilinx)	Spartan-7	28 nm	XC7S100	102 k	160	4.1	none	Cost-sensitive embedded
AMD	Artix-7	28 nm	XC7A200T	215 k	740	13	6.6 Gb/s	Industrial, prototyping
AMD	Kintex-7	28 nm	XC7K480T	478 k	1920	34	12.5 Gb/s	Networking, mid-range DSP
AMD	Virtex-7	28 nm	XC7V2000T	1955 k	2160	46	13.1 Gb/s	High-end (legacy)
AMD	Zynq-7000 SoC	28 nm	XC7Z045 / XC7Z020	350 k / 85 k	900 / 220	19 / 4.4	12.5 Gb/s	Dual Cortex-A9 + PL
AMD	Kintex UltraScale	20 nm	XCKU115	1451 k	5520	75	30.5 Gb/s	DSP, ML, radar
AMD	Kintex UltraScale+	16 nm	XCKU15P	1143 k	1968	35	32.75 Gb/s	5G, networking
AMD	Virtex UltraScale+	16 nm	XCVU19P	8938 k	3840	224 (BRAM+URAM)	58 Gb/s PAM4	ASIC emulation, data centre
AMD	Zynq UltraScale+ MPSoC	16 nm	XCZU19EG	1143 k	1968	35	32.75 Gb/s	Quad A53 + dual R5 + PL
AMD	Versal Prime	7 nm	VP1502	1968 k	7392	90	112 Gb/s	NoC + AI Engines + PL
AMD	Versal AI Core	7 nm	VC1902	1969 k	1968	60	58 Gb/s	400 AIE tiles for ML
AMD	Versal Premium	7 nm	VP1802	3735 k	10848	199	112 Gb/s PAM4	Hyperscale networking
Intel (Altera)	Cyclone IV / V	60 / 28 nm	5CGXFC9	301 k	342	13.9	6.144 Gb/s	Cost-sensitive
Intel	Cyclone V SoC	28 nm	5CSXFC6	110 k	112	5.1	6.144 Gb/s	Dual Cortex-A9 + FPGA
Intel	Arria 10	20 nm	10AX115	1150 k	1518	53	17.4 Gb/s	Mid-range DSP / video
Intel	Stratix 10 GX/MX	14 nm	10SG280	2753 k	5760	229 (M20K + HBM)	28.9 Gb/s	High-end DSP / HBM2
Intel	Agilex 7	10 nm SF	AGFB014R24A2E2V (F-tile)	2692 k	8736	165	116 Gb/s PAM4	DC, networking
Intel	Agilex 5	7 nm	A5E065BB	656 k	1024	16	28 Gb/s	Edge AI, mid-range
Lattice	iCE40 UltraPlus	40 nm	iCE40UP5K	5.3 k	8	1 (BRAM + SPRAM)	none	Smartphone sensor hub
Lattice	iCE40 HX	40 nm	iCE40HX8K	7.7 k	0	0.13	none	Tiny dev boards
Lattice	ECP5	40 nm	ECP5-85F (LFE5UM5G-85F)	84 k	156	3.7	5 Gb/s	Mid-range, open-source toolchain
Lattice	MachXO3 / CrossLink-NX	28 nm FD-SOI	LIFCL-40	40 k	156	2.5	10.3 Gb/s	Edge AI, MIPI
Lattice	CertusPro-NX	28 nm FD-SOI	LFCPNX-100	96 k	174	7.5	5 Gb/s	Edge processing
Lattice	Avant-E / Avant-G	16 nm	LAV-AT-100	500 k	1300	36	25 Gb/s	High-mid range
Microchip	PolarFire	28 nm SONOS-NV	MPF300 / MPF500	481 k	1480	33	12.7 Gb/s	Low-power, secure, non-volatile
Microchip	PolarFire SoC	28 nm	MPFS250T	254 k	784	17	12.7 Gb/s	5-core RISC-V (4×U54 + 1×E51)
Microchip	RTG4	65 nm rad-hard	RT4G150	151 k	462	5.2	3.125 Gb/s	Space, rad-tolerant
Microchip	SmartFusion2	65 nm	M2S150	146 k	240	4.6	5 Gb/s	Cortex-M3 + secure flash
Achronix	Speedster7t	7 nm TSMC	AC7t1500	692 k	2560 (MLP)	195	112 Gb/s	Networking, ML
Efinix	Trion / Titanium	40 nm / 16 nm	Ti200	200 k	320	18	25 Gb/s	Edge AI, low-cost
GOWIN	LittleBee / Arora	55 / 22 nm	GW5AT-138	138 k	116	11	12.5 Gb/s	Chinese consumer / IoT

HDL language comparison

Language	Standard	Origin	Verbosity	Verification	Synthesis support	Industry footprint
Verilog	IEEE 1364-2005	Gateway Design (1985), C-like	Low	Limited (testbench is just RTL)	All vendors	Legacy still common; superseded by SystemVerilog
SystemVerilog	IEEE 1800-2023	Synopsys/Accellera (2002), superset of Verilog	Low-medium	UVM, constrained-random, SVA, classes	All vendors; subset on open tools	Default for ASIC + most new FPGA RTL
VHDL	IEEE 1076-2019	DoD VHSIC (1987), Ada-derived	High (strongly typed)	Native records, generics; OSVVM	All vendors	Aerospace, defence, European industry
Chisel	UCB, Scala-embedded	2012	Medium (Scala metaprog)	ScalaTest + Treadle	Emits Verilog	SiFive, Esperanto, Google TPU
SpinalHDL	Scala-embedded	2015	Low (cleaner than Chisel)	SimSlang, Cocotb	Emits Verilog/VHDL	Open-source community, VexRiscv
Amaranth (nMigen)	Python-embedded	2020	Low	Cocotb, native sim	Emits Verilog	iCE40/ECP5 open ecosystem
MyHDL	Python-embedded	2003	Low	Native Python	Emits Verilog/VHDL	Niche; older
Bluespec SystemVerilog	Bluespec Inc.	2000	High (atomic transactions)	Native	Emits Verilog	Research, defence
Vitis HLS (C/C++)	C++ subset + pragmas	AMD/Xilinx	High (raised abstraction)	C testbench + co-sim	Emits Verilog	DSP, vision, ML — wrap in RTL shell
Intel HLS / oneAPI FPGA	OpenCL / SYCL / C++	Intel	High	Native, with emulation	Emits Verilog	Intel-only

Tier-3 idioms and code patterns: [[Languages/Tier3/hdl]].

Tool flow stages

Stage	What it does	Vendor commercial tools	Open-source equivalents
RTL coding	Designer writes Verilog/VHDL	Any text editor; VS Code; vivado-vhdl-mode	Same
Elaboration + lint	Resolves hierarchy, parameters, checks style	Synopsys SpyGlass, Cadence HAL, Vivado lint	Verible, sv2v + lint
Simulation	Run testbench against RTL	ModelSim, Questa, VCS, Xcelium, Vivado XSIM	Verilator, Icarus, GHDL, Cocotb
Synthesis	RTL → netlist of LUTs + FFs + hard IP	Vivado Synth, Quartus Pro, Synplify Pro, Diamond Synth	Yosys (+ SystemVerilog plugin)
Floorplan / placement	Assigns netlist to physical sites	Vivado Implementation, Quartus Fitter, Diamond Map	nextpnr (iCE40, ECP5, Nexus, Gowin, GateMate)
Routing	Connects placed sites via metal	Same as placement	Same as placement
Static timing analysis	Verifies setup/hold across PVT corners	Vivado Timing, Quartus TimeQuest / TimeQuest II, Tempus (Cadence)	OpenSTA
Bitstream generation	Produces device-loadable binary	Vivado write_bitstream, Quartus assembler, Diamond bitgen	Project IceStorm, Project Trellis, Project Apicula
Configuration / programming	Loads bitstream into FPGA	Vivado Hardware Manager, Quartus Programmer	openFPGALoader, iceprog, ecpprog, openocd
Hardware debug	In-system probing	Vivado ILA / VIO, SignalTap, Lattice Reveal	LiteScope, Glasgow Interface Explorer

Common-design-block resource budgets (Xilinx UltraScale+ -2 speed grade)

Block	LUTs	FFs	BRAM18	URAM	DSP	f_max (typ)
32-bit binary counter	32	32	0	0	0	750 MHz
32-bit ripple adder	32	0	0	0	0	combinational
32-bit Kogge-Stone adder	~150	0	0	0	0	combinational, ~3 LUT levels
32×32 = 64 multiply (DSP)	~50	100	0	0	4	600 MHz pipelined
1024×32 single-port BRAM	0	~40	1	0	0	700 MHz
4096×72 dual-port BRAM	0	~80	4	0	0	600 MHz
8192×72 packet buffer	0	~100	0	1	0	700 MHz
Async FIFO 1024×64 (XPM)	100	200	4	0	0	500 MHz both clocks
AXI4-Stream 64-bit interconnect M:N=2:2	~3000	~5000	2	0	0	400 MHz
10 GbE MAC (soft)	~10 k	~12 k	8	0	0	156.25 MHz
100 GbE CMAC hard block	0	0	0	0	0	322.27 MHz native
PCIe Gen3 x8 + XDMA	~25 k	~40 k	30	0	0	250 MHz user clock
DDR4-2400 MIG controller	~15 k	~20 k	20	0	0	300 MHz user clock
RISC-V VexRiscv core (small)	~1 k	~1 k	1	0	0	200 MHz
MicroBlaze (small)	~1 k	~800	4	0	0	250 MHz

Certification standards for hardware (FPGA-relevant)

Standard	Domain	Level / scope	What it requires
DO-254	Civil avionics	DAL A–E (A = catastrophic)	Hardware planning, requirements, design, V&V, configuration mgmt, traceability. DAL A/B need independent verification, elemental analysis (covered by Vivado / Synplify in DO-254 mode).
IEC 61508-2 / -3	Industrial functional safety	SIL 1–4	Random + systematic failure rates, FMEDA, diagnostic coverage. Hardware (-2) and software/IP (-3) tracks.
ISO 26262-5 / -11	Automotive (passenger)	ASIL A–D (D highest)	Lockstep / DCLS cores, hardware metrics (SPFM, LFM, PMHF), independence of safety mechanism. Microchip PolarFire SoC, Xilinx Zynq US+ MPSoC, NXP S32 line all have ASIL-D ready variants.
EN 50128 / EN 50129	Rail signalling	SIL 1–4	Similar to IEC 61508 with rail-specific.
IEC 62443	Industrial cybersecurity	SL 1–4	Bitstream encryption, secure boot, key storage.
MIL-STD-883 / -461 / -750	Defence	Class B/S	Environmental, EMI, mechanical.
RTCA DO-178C	Civil avionics software	DAL A–E	If a soft-core runs software, software certs apply alongside DO-254.
NASA EEE-INST-002	Spaceflight parts	Level 1/2/3	Screening, qualification for space. RTG4 and Virtex-5QV target this.

5p. Theory — synthesis, timing closure, and CDC for FPGAs

Why LUTs win over discrete gates

A k-input LUT is a 2^k-bit SRAM with a tree of 2:1 muxes selecting one bit based on the k input signals. Any k-input Boolean function is implementable by writing the function’s truth table into the SRAM at configuration time. For k = 6 (Xilinx UltraScale+), a single LUT6 absorbs functions up to 6 variables — equivalent to several discrete gates worth of logic in one routing-hop. The trade-off vs ASIC standard cells:

LUT area cost: ~12× the equivalent NAND2 in standard cells.
LUT delay: ~5–10× a standard-cell NAND2 (gate delay + intra-LUT mux propagation).
LUT power: ~5× per equivalent function.

These are the structural reasons an FPGA design is typically 3–10× larger, 3–5× slower, and 5–20× more power-hungry than the same RTL on an ASIC at the same process node. The reconfigurability — no NRE, instant turn-around, in-field updates — is what justifies those penalties for the application classes where they fit.

Synthesis flow inside the FPGA tool

Modern FPGA synthesis (Vivado Synth, Quartus Pro, Synplify Pro, Yosys) follows the same general flow as ASIC synthesis ([[Engineering/digital-logic]] section 5p) with one key difference: the target library is a fixed set of primitives — LUT6, FDRE, FDSE, RAMB36, DSP48E2, MMCM, BUFG, GTY — rather than a foundry standard-cell library. Outputs:

Elaboration — flattens hierarchy, resolves generics. Resolves parameter overrides, generate constructs, package items.
Logic synthesis — combinational logic mapped to LUT6 / LUT5 / fracturable LUT pairs. Espresso-style minimisation and Boolean restructuring.
Resource mapping — multipliers mapped to DSP48 or LUT fabric (driven by USE_DSP attribute and width thresholds), memories mapped to BRAM / URAM / distributed RAM (driven by RAM_STYLE), shift registers to SRL16/SRL32.
Retiming — optional automatic FF migration across logic boundaries to balance pipeline stages (-retiming in Vivado synth, -pipelining in Synplify).
Technology mapping — final form fits the device family’s LUT/FF/BRAM/DSP primitive set.

Output is a netlist; place-and-route (Implementation in Vivado, Fitter in Quartus) assigns physical sites, routes signals, and reports actual timing.

Static Timing Analysis on FPGAs

STA on an FPGA differs from ASIC STA mainly in the device-specific PVT corners the tool ships with. Vivado timing analyses paths across slow-slow-cold (worst setup, e.g. -40 °C, V_CCINT = 0.825 V, slow process for -1 part) and fast-fast-hot for hold. The user provides:

Clock periods — create_clock -period 2.5 [get_ports clk_400m]
Clock relationships — set_clock_groups -asynchronous -group {clk_a} -group {clk_b} when domains are independent
I/O delays — set_input_delay -clock clk -min/-max <ns> [get_ports din] for source-synchronous timing
False paths — set_false_path -from [get_pins rst*] for static signals or properly synchronised paths
Multicycle paths — set_multicycle_path 2 -setup -from ... -to ... when data must hold valid for multiple cycles

Constraint files use XDC (Xilinx Design Constraints) or SDC (Synopsys Design Constraints) syntax — XDC is essentially TCL with a curated subset of SDC commands. Intel Quartus uses SDC directly. Constraints do not constrain the design; they tell the tool the designer’s intent. Wrong constraints (especially missing false paths or wrong clock groups) cause either timing failures the design doesn’t really have, or — far worse — passing timing on paths that are actually broken.

Clock-domain crossing on FPGAs

CDC fundamentals are in [[Engineering/digital-logic]] section 5p. FPGA-specific points:

Vendor synchroniser primitives. Xilinx XPM_CDC_SYNC_RST, XPM_CDC_SINGLE, XPM_CDC_GRAY, XPM_CDC_HANDSHAKE, XPM_CDC_PULSE macros wrap 2/3-FF synchronisers, gray-counter CDC, full handshake protocols, and pulse extension respectively. Each carries automatic constraints (ASYNC_REG = TRUE attribute on the synchroniser flip-flops, plus set_max_delay -datapath_only). Intel altera_std_synchronizer and dcfifo macros do the same.
ASYNC_REG = TRUE attribute (Xilinx) places the synchroniser FFs in adjacent slice locations and prevents synthesis from optimising them away. Quartus equivalent: synchronizer_identification = forced.
MTBF reporting. Vivado reports CDC violations (paths through unsynchronised FFs in different clock groups) but does not compute MTBF — the designer must own the failure analysis for unconventional crossings.
Multi-bit CDC. A 2-FF synchroniser on each bit of a 32-bit bus does not work — different bits resolve at different cycles. Use an async FIFO, a handshake, or a gray-coded encoding for monotonic counters.
Reset CDC. Each clock domain must have its own reset synchroniser; an asynchronous reset asserted globally is fine, but deassertion must be re-synchronised to each domain’s clock individually. Skipping this lets some FFs come out of reset on a different cycle to others, breaking initial state.
Recovery / removal timing — STA checks for asynchronous reset paths, analogous to setup/hold for data. Vivado reports recovery (reset must deassert this much before the clock edge) and removal (must stay asserted this long after); if these fail, asynchronous reset is effectively a CDC violation and the synchroniser is mandatory.

Reconfiguration

The bitstream is loaded at power-up by one of:

Master SPI from external Flash (most common; Micron N25Qxxx, Macronix MX25L, Cypress S25FL). 4 Mb to 2 Gb sizes. Boot in 1–10 s typical.
Master BPI (parallel NOR Flash) — faster but more pins.
JTAG (IEEE 1149.1) — typically development only. Xilinx XVC (Xilinx Virtual Cable) allows remote JTAG over TCP/IP.
SelectMAP / FPP — parallel slave configuration from an external CPU (ARM, x86). Used in many SmartNIC designs where an external host owns the boot.

Partial reconfiguration (PR) — Xilinx Dynamic Function eXchange (DFX) and Intel Partial Reconfiguration allow swapping a region of the fabric while the rest continues to run. Used for FPGA-as-a-service (AWS F1), software-defined radio with multiple waveform plug-ins, and hardware accelerator hotswap. PR regions (“pblocks”) must be floorplanned, and the bitstream is region-specific. Reconfig time on UltraScale+: ~5–50 ms per region typical.

Power consumption

P_total = P_dynamic + P_static + P_IO_xceiver

Dynamic — α · C · V_CCINT² · f, summed over all toggling nodes. Reduced by clock gating (BUFGCE with enable), reducing toggle activity, and lowering V_CCINT (configurable via UltraScale+ system-monitor + external regulator).
Static (leakage) — temperature-dependent; 28 nm and below contributes significantly. A fully populated VU13P at 85 °C die: ~30 W static alone.
Transceivers and I/O — GTH/GTY at full line rate: ~0.4 W per lane. A 16-lane PCIe Gen4 link: ~7 W.

Vivado Xilinx Power Estimator (XPE) spreadsheets give pre-implementation budget; post-implementation report_power in Vivado gives actual. Match within 10 % typical when activity factors are realistic.

Floorplanning and physical constraints

Large designs (> 50 % of fabric utilised, or with mixed-clock-domain blocks) often need floorplan constraints to converge timing. Tools:

Pblocks (Xilinx) / LogicLock regions (Intel) — rectangular fabric regions that constrain placement of a hierarchical module. Used to (a) keep latency-critical logic together, (b) reserve area for partial-reconfiguration regions, (c) isolate safety-critical lockstep blocks from each other for ISO 26262 independence requirements.
LOC constraints — pin a specific cell to a specific site (e.g. set_property LOC SLICE_X10Y20 [get_cells my_reg]). Used sparingly — usually for IO buffers, MMCMs, GTYs, and the occasional high-fanout register.
Clock-region constraints — confine a clock and its loads to a clock region (clusters of ~50–100 CLBs sharing a clock-region network). Reduces clock skew and routing congestion.
Hold-fixing buffers — Vivado / Quartus auto-insert LUT1 buffers in too-fast paths during the optimise phase; user usually doesn’t intervene.

Bad floorplanning is worse than no floorplanning. Pblocks too tight cause congestion + routing failure; pblocks too loose defeat the purpose. Start without floorplanning, add only after the tool has shown which paths it cannot close on its own.

6p. Application — design patterns

Finite state machine (Moore vs Mealy, one-hot vs binary encoding). Synthesis tool auto-picks encoding via FSM_ENCODING attribute: one_hot for ≤ 16 states (LUT-efficient on FPGA), binary or gray for larger.
FIFO — synchronous (single-clock, BRAM-backed) or asynchronous (CDC, gray-coded pointers). Always parameterised macros (Xilinx XPM_FIFO_*, Intel dcfifo).
Pipelined DSP — FIR filters via DSP slices, polyphase decimators / interpolators, CIC filters, FFT cores (Vitis DSP Library, Intel FFT IP). Symmetric / antisymmetric filters halve multiplier count via pre-adders.
Memory controllers — DDR4 / DDR5 / LPDDR4 / HBM via vendor IP. Xilinx MIG (Memory Interface Generator) for 7-series, integrated DDR/HBM controllers on UltraScale+ and Versal. Latency 80–150 ns to DDR4 from AXI4 master.
AXI4 / AXI4-Stream / AXI4-Lite — ARM AMBA interface fabric. AXI4 burst memory, AXI4-Stream packet/sample DMA, AXI4-Lite for control registers. Bridges via Xilinx AXI Interconnect or SmartConnect, Intel Avalon-MM↔AXI bridges. Most third-party IP standardises on AXI.
PCIe endpoint / root-complex — hard IP block + DMA engine (Xilinx XDMA, QDMA; Intel DMA IP). User logic sits behind an AXI4-Stream packet interface or AXI4 memory-mapped interface. Typical sustained PCIe Gen3 x8 throughput: 7.5 GB/s.
Ethernet MAC + PHY — 1G via tri-mode soft MAC + RGMII PHY; 10G via XGEMAC + KR/SR SerDes; 25G / 100G via hard CMAC + RS-FEC. PCS/PMA always uses transceivers.
Image-processing pipelines — Vitis Vision Library (HLS-generated OpenCV-equivalent kernels), Intel Computer Vision SDK.
HLS for ML inference — Vitis AI (Xilinx, DPU IP for Zynq US+ and Versal AIE), Intel OpenVINO FPGA, FINN (Xilinx research; quantised CNN to HLS), hls4ml (CERN; tiny networks for trigger systems).
Soft-core CPUs — Xilinx MicroBlaze (32-bit, RISC-like), Intel Nios II / Nios V (RISC-V), VexRiscv (open-source, configurable), PicoRV32 (tiny). Used for control plane around fabric datapaths; see [[Engineering/microcontrollers]] for software workflow.

FPGA vs ASIC vs CPU — when to choose which

Concern	FPGA wins	ASIC wins	CPU / MCU wins
Unit volume	< 100 k	> 1 M	any
Time to first silicon	weeks	months to years	hours
NRE	USD 0–100 k (tools + dev kit)	USD 1 M – 100 M (mask + design)	USD 0
Unit cost (high volume)	USD 5–10 000	USD 0.10–500	USD 0.30–100
Power efficiency vs algorithm	mid (5–20× worse than ASIC)	best	worst on data-parallel; best on control-heavy
Determinism	hard real-time, sub-ns jitter	hard real-time	varies — best on bare-metal MCU, worst on Linux SoC
Field update	yes (bitstream)	no (mask spin)	yes (firmware)
Protocol flexibility	yes (any I/O, any line-rate ≤ device limit)	no	limited to integrated peripherals
Tooling cost	USD 0–50 k seats	USD 1–10 M tool + verification IP	USD 0–10 k seats
Required expertise	hardware (RTL + verification + closure)	hardware + foundry partnership	software

The decision tree: if the workload fits the CPU’s throughput and timing budget, use a CPU (software is always cheaper). If it doesn’t fit a CPU but ships < 100 k units, use an FPGA. If volume > 1 M units, and the design is stable, and power efficiency / unit cost matter, the ASIC NRE amortises. Many products start FPGA, prove the architecture at hundreds-of-units pilot scale, then port to ASIC for volume launch.

Verification — the other half of the project

A real FPGA RTL block is verified against several tiers of test before tapeout to silicon (or first bring-up board):

Unit-level RTL simulation — testbench drives the block’s interfaces, scoreboards check responses. ModelSim / Questa / Verilator. Most engineers spend 50–70 % of project time here.
UVM (Universal Verification Methodology, IEEE 1800.2) — SystemVerilog object-oriented testbench framework with constrained-random stimulus, transaction-level interfaces, sequencers, drivers, monitors, scoreboards, coverage collectors. Dominant for ASIC; increasingly used on large FPGA SoCs.
Cocotb (Python) — Python-based testbench wrapping any simulator. Faster to write than UVM; preferred in startups, open-source projects, and small FPGA teams.
OSVVM (Open Source VHDL Verification Methodology) — VHDL equivalent of UVM. Free, IEEE-compatible, used in defence and aerospace.
Formal property verification — JasperGold, VC Formal, SymbiYosys. Mathematical proof that SystemVerilog Assertions (SVA) hold for all input sequences within bounded depth. Catches deadlocks, livelocks, AXI4-protocol violations, and CDC bugs that random simulation misses.
Hardware-in-the-loop (HIL) — design loaded onto a dev board with traffic generators or scope-injected stimulus. Catches issues with timing margins, PVT, power-up sequencing, and analog-coupled effects (jitter, SI) that simulation misses.
Emulation — Cadence Palladium / Synopsys ZeBu / Mentor Veloce map the entire RTL into a custom FPGA-cluster appliance at MHz speeds, runs real software workloads pre-silicon. USD millions per box; only large SoC teams use them.
Coverage closure — code coverage (line, toggle, branch, expression, FSM), functional coverage (covergroups in SystemVerilog), assertion coverage. 100 % code coverage and 95 % + functional is the gate to tape-out for many ASIC teams; FPGA teams typically target 80 % + and rely on board-level testing for the rest.

7p. Edge cases & assumptions

Inferred latches from incomplete case / if-else chains — usually a bug. Always include default (case) or else (if). A combinational always block that does not assign a signal in every path produces a latch, which has analog feedback and breaks STA on FPGAs.
Clock-domain crossing without synchronisers → metastability → field failures. Vivado CDC reports flag this; ignore at peril. The 2-FF synchroniser brings MTBF to centuries; absence brings it to days.
Reset strategy. Synchronous reset is preferred on Xilinx UltraScale+ (fewer FF primitives support async reset; sync-reset packs into SLICEL versus asynchronous-only in SLICEM). Synchronous makes timing analysable. Asynchronous reset must always be deasserted synchronously (small reset-synchroniser per clock domain). Microchip PolarFire prefers async assert / sync deassert; Intel Agilex prefers fully sync.
Initialisation of BRAM / FF. Xilinx FFs initialise to the value specified by their INIT attribute (default 0) at configuration. BRAMs and URAMs initialise from INIT_xx configuration data. Lattice iCE40 BRAMs do not initialise (designer must write zero in a startup state machine). Microchip PolarFire’s non-volatile cells retain prior state across power cycles — useful and dangerous.
Power consumption. Dynamic ∝ C · V² · f; at 1 GHz on Versal Premium a fully-active 1 % of fabric is enough to push 50 W. Clock gating is mandatory for portable / battery applications. UltraScale+ has dynamic-V_CCINT (configurable supply voltage at runtime) but few designs use it.
Reconfiguration time. Cold boot 10–500 ms typical depending on bitstream size and SPI flash speed. Versal Premium VP1502 bitstream: ~200 MB, ~600 ms from QSPI at 100 MHz QSPI clock. Partial reconfig: 5–50 ms per region.
Single-event upset (SEU) in space / high-altitude / high-energy physics. Cosmic rays flip individual configuration bits, corrupting logic. Mitigations: SEU-mitigated FPGAs (Microchip RTG4 with TMR on every FF, Xilinx Virtex-5QV, RT Kintex UltraScale), bitstream scrubbing (Xilinx SEM IP periodically re-reads and corrects configuration), TMR (triple-modular redundancy) at user-logic level (Synopsys Synplify Pro auto-TMR, Mentor Precision Hi-Rel), ECC on BRAM (built-in on UltraScale+).
Bitstream encryption and IP security. AES-256 / AES-GCM bitstream encryption with keys in eFuse or battery-backed BBRAM (Xilinx, Intel). Authentication via HMAC or RSA-4096 signature. Required for anti-piracy and anti-tamper in defence/industrial. Microchip PolarFire adds DPA-resistant boot — protects keys against side-channel attacks.
Lockstep / DCLS (dual-core lockstep) for safety-critical (DO-254 DAL A/B, IEC 61508 SIL 3/4, ISO 26262 ASIL D). Both cores execute the same workload; a comparator flags divergence. Microchip PolarFire SoC and Xilinx Zynq US+ MPSoC offer lockstep RISC-V / Cortex-R5 pairs. Pure-fabric lockstep: replicate the RTL and compare outputs externally.
Hot-pluggable in datacentre. Power sequencing is critical: V_CCINT before V_CCAUX before V_CCO, deassert PROGRAM_B last, monitor INIT_B and DONE. Mis-sequencing latches up the part or corrupts boot. AWS F1, Azure NP, and Alibaba FPGA-cloud designs all have documented sequencing.
Counterfeit parts. Refurbished and relabelled Xilinx / Intel parts circulate in the grey market. Verify supply via authorised distributors (Avnet, Arrow, Future, Mouser, DigiKey). Xilinx and Intel run anti-counterfeit programs with device ID readback (DNA register) and traceability.
Vendor toolchain lock-in. XDC and SDC constraints are not perfectly portable across Vivado, Quartus, Diamond/Radiant, and Libero. SystemVerilog feature support varies — interface, program, bind, covergroup support differs significantly. RTL written to the IEEE 1800-2017 subset that Synplify Pro accepts is the safest cross-vendor portable form.
Compile times. Versal Premium HBM design: 4–12 hours full P+R wall-time on a 64-core server with 256 GB RAM. Smaller designs (Artix-7, Cyclone V) finish in 5–30 minutes. Incremental synthesis + implementation (Vivado Block-Level Synthesis, Quartus Incremental Compile) reduces iteration to 10–30 minutes for small RTL changes on big designs.
HLS quality limits. Hand-coded RTL is still 2–5× faster and smaller than HLS-generated RTL for highly-tuned DSP or networking kernels. HLS pays off for productivity (algorithmic exploration, video / vision / ML where the algorithm is complex but the dataflow is regular).
Glitches feeding asynchronous resets or clocks — same rule as [[Engineering/digital-logic]] section 7p. Never use a combinational signal as a clock or asynchronous reset; always register it first.
Routing congestion vs logic utilisation. A design at 85 % LUT utilisation may fail P+R because routing resources, not logic resources, ran out. Congestion is worst around dense BRAM / DSP columns where many nets converge. Symptoms: unroutable nets in the router log, very long P+R wall-time, marginal timing closure. Fixes: reduce utilisation, floorplan to spread the hot region, change BRAM packing (RAM_STYLE = "distributed" for small memories), or move to a larger device.
Pin assignment drift. Vivado / Quartus default pin auto-place will move I/Os between iterations, breaking PCB layout. Always lock pins in the XDC / QSF early and keep the constraint file under version control.
IP version mismatch. Vendor IP (PCIe, DDR controllers, Ethernet) is tied to a specific tool version. Upgrading Vivado from 2023.1 to 2024.1 often forces IP re-customisation. Check release notes for IP-breaking changes before upgrading mid-project.
License servers — most commercial tools (Vivado Enterprise, Quartus Pro, ModelSim, VCS, JasperGold) require a network license server. Plan for license-server outages by either using node-locked or cloud-floating licenses, and never let a developer’s daily work depend on a single license that may be checked out elsewhere.
Bitstream-revision compatibility. Configuring a UltraScale+ part with a bitstream built for a different speed grade is allowed by hardware but will fail timing in the field. The verify step in the programmer matters.
Heat dissipation. Versal Premium and Stratix 10 / Agilex 7 routinely dissipate > 50 W; thermal pads, heatsinks, and 200+ LFM forced airflow are not optional. Junction temperature must stay below T_J max (typically 100 °C); above that, the part throttles or shuts down via on-die thermal sensor (System Monitor / SmartVID).
Decoupling and PDN. A modern FPGA package draws di/dt spikes of tens of amps per nanosecond on V_CCINT. Hundreds of 0.1 μF capacitors plus tens of 10 μF bulk plus a controlled-impedance plane pair are the minimum for clean operation; vendor reference designs spell out exact placement (see UG583 PCB Design and Pin Planning Guide for Xilinx).
JTAG chain integrity. Mixed-voltage JTAG chains (3.3 V Xilinx + 1.8 V Lattice) need level translators; otherwise device-detection fails or boundary-scan reports phantom chains.

8p. Tools & software

Vendor end-to-end suites

AMD Vivado / Vitis — Vivado for RTL + synthesis + P+R + bitstream (UltraScale+, Versal). Vivado HLS / Vitis HLS for C/C++ → RTL. Vitis Embedded for Zynq Linux / FreeRTOS / bare-metal. Vitis AI for ML inference on Zynq US+ and Versal AIE. Free WebPACK / ML editions for smaller parts; Enterprise for largest. Vivado Hardware Manager for JTAG programming and ILA debug.
Intel Quartus Prime Pro / Standard / Lite — synthesis + P+R for Cyclone, Arria, Stratix, Agilex. Platform Designer (Qsys) for IP integration. OneAPI for HLS via SYCL. Signal Tap II for in-system probing.
Lattice Diamond / Radiant / Propel — Diamond for older Lattice (MachXO2, ECP3), Radiant for current (iCE40 UltraPlus, ECP5, CertusPro-NX, MachXO3, CrossLink-NX), Propel for software-driven flow on Avant.
Microchip Libero SoC — PolarFire, PolarFire SoC, SmartFusion2, RTG4 toolchain. Includes secure boot, FlashPro hardware programming.

Open-source

Yosys (Clifford Wolf, now CTCFT / YosysHQ) — synthesis for FPGAs and (via OpenROAD downstream) ASICs. Reads Verilog and SystemVerilog subset (full SystemVerilog via sv2v or the proprietary Slang plugin). Targets nextpnr-compatible architectures.
nextpnr — place-and-route for iCE40 (Project IceStorm), ECP5 (Project Trellis), MachXO3 / Nexus (Project Oxide), GOWIN LittleBee / Arora (Project Apicula), GateMate (Cologne Chip). No support for Xilinx 7-series or UltraScale+ as of 2025 (Project X-Ray bitstream documentation incomplete for routing).
SymbiYosys — open formal verification (model checking, BMC, k-induction) using SMT solvers (Z3, Boolector, Yices). Front-ends Yosys.
Verilator — open compiled Verilog → C++ simulation. 10–100× faster than commercial event-driven simulators for pure RTL. Used heavily by SiFive, LowRISC, Chipyard, and any open RISC-V project.
GHDL — open VHDL simulator. Pairs with GtkWave for waveform viewing.
Cocotb — Python-based testbench framework wrapping any simulator (Verilator, Icarus, GHDL, ModelSim, VCS, Xcelium). Increasingly the open-source default for verification.
Icarus Verilog — interpreted Verilog simulator. Slower than Verilator but supports event semantics.

Third-party / multi-vendor

Synplify Pro / Synplify Premier (Synopsys, commercial) — synthesis targeting all major FPGA vendors. Used by ASIC teams targeting FPGAs for prototyping. Hi-Rel variant adds auto-TMR for SEU.
Mentor / Siemens EDA Precision (commercial) — peer to Synplify Pro; Precision Hi-Rel for rad-hard.
ModelSim / Questa (Siemens EDA) — industry-default simulator; UVM-aware; mixed-language.
Xcelium (Cadence), VCS (Synopsys) — commercial simulators for big SoC houses.
JasperGold (Cadence), VC Formal (Synopsys), Questa Formal (Siemens), OneSpin (Siemens) — commercial formal property verification.

Debug

Vivado ILA / VIO (Integrated Logic Analyzer / Virtual Input-Output) — embeds a logic-analyser core in the design, captures internal signals on configurable triggers, reads back over JTAG. Adds 1 LUT + 1 FF + BRAM per probe sample-depth slot.
Intel SignalTap II — same concept for Quartus.
Lattice Reveal — same for Diamond/Radiant.
Synplify Identify — vendor-neutral RTL-level probing.
External JTAG dongles — Xilinx Platform Cable USB II / SmartLynq, Intel USB-Blaster II, Lattice HW-USBN-2B, Microchip FlashPro5; or generic FT2232H-based dongles + OpenOCD.

Hardware development kits

Digilent Arty A7 (Artix-7 XC7A35T, USD ~130), Arty S7 (Spartan-7), Nexys A7 / Nexys Video (Artix-7 + DDR3 + HDMI), Cmod A7 (small Artix-7 PMOD form factor).
Avnet ZedBoard (Zynq-7000 XC7Z020, USD ~500), MiniZed (Zynq Z-7007S, USD ~150), PicoZed, MicroZed.
AMD/Xilinx KCU105 / KCU116 (Kintex UltraScale+), VCK190 / VEK280 (Versal AI Core / AI Edge), Alveo U200 / U250 / U280 / U55C (data-centre PCIe cards).
Terasic DE10-Nano (Intel Cyclone V SoC, USD ~150), DE10-Standard, DE2-115 (Cyclone IV).
Intel Agilex 7 F-Series transceiver dev kit (AGFB014R24A2E2V).
Lattice iCEstick (iCE40HX1K, USD ~30), iCEBreaker (iCE40UP5K, fully open toolchain), ECP5 Evaluation Board (LFE5UM5G-85F), CrossLink-NX Eval.
Microchip MPF300-Splash-Kit (PolarFire), PolarFire SoC Discovery Kit (USD ~700).
Cologne Chip GateMate Eval Board (GateMate A1, fully open toolchain via Yosys + nextpnr).

Cloud FPGA

AWS F1 instances — f1.2xlarge / f1.4xlarge / f1.16xlarge with 1, 2, or 8 Xilinx Virtex UltraScale+ VU9P boards. Used for ML inference, genomics, financial.
Alibaba F3 — Intel Stratix 10.
Azure NP-series — Xilinx Alveo U250.

Constraints file primer

A minimal Xilinx XDC for a 200 MHz design with a 100 MHz input clock, async reset, and a 50 MHz output bus might read:

# Primary clock (from external oscillator)
create_clock -name clk_100m -period 10.0 [get_ports clk_in_p]
 
# MMCM-generated clocks (Vivado infers from MMCM IP, but explicit is safer)
create_generated_clock -name clk_200m -source [get_pins mmcm/CLKIN1] \
    -multiply_by 2 [get_pins mmcm/CLKOUT0]
 
# Async reset — synchronised internally; mark the input as false-path
set_false_path -from [get_ports reset_n]
 
# Output: 50 MHz source-synchronous, ±2 ns total skew budget
set_output_delay -max 2.0 -clock clk_50m_out [get_ports data_out[*]]
set_output_delay -min -2.0 -clock clk_50m_out [get_ports data_out[*]]
 
# CDC: clk_100m and clk_200m are related via MMCM; not asynchronous
# but clk_uart (external) is asynchronous to both
set_clock_groups -asynchronous -group {clk_100m clk_200m} -group {clk_uart}
 
# Pin assignments
set_property PACKAGE_PIN E3      [get_ports clk_in_p]
set_property IOSTANDARD LVDS_25  [get_ports clk_in_p]
set_property PACKAGE_PIN H14     [get_ports {data_out[0]}]
set_property IOSTANDARD LVCMOS18 [get_ports {data_out[*]}]

Intel SDC syntax is nearly identical for the timing commands; pin assignments live in the .qsf (Quartus Settings File) instead.

Continuous integration for FPGA projects

Modern FPGA projects mirror software CI / CD:

Source control — Git for RTL, constraints, testbenches, scripts. Generated outputs (bitstream, .runs/) excluded via .gitignore.
Tcl-driven flow — Vivado batch (vivado -mode batch -source build.tcl), Quartus quartus_sh -t build.tcl, Yosys .ys scripts. Reproducible builds, no GUI clicks.
Containerisation — Docker images with the tool installed (Xilinx publishes official Vivado / Vitis images on a license server; Intel similarly). Build inside CI without polluting developer machines.
Regression suites — every commit triggers RTL simulation across the testbench library. Cocotb + pytest integration is natural; UVM regression usually runs via make or vendor regression managers.
Linting / style gates — Verible, Verilator -Wall, vendor lint (Vivado synth_design -mode out_of_context with -assert). Block PRs that introduce new lint warnings.
Implementation gate — full build to bitstream + timing-report parsing as a CI step. Fails the build on WNS < 0. Heavy (hours); often nightly rather than per-commit on big designs.

11. Cross-references

[[Engineering/digital-logic]] — RTL fundamentals (CMOS, FFs, FSMs, setup/hold/skew, CDC). Read first if unfamiliar; FPGA design specialises this material.
[[Engineering/semiconductor-devices]] — CMOS transistor physics underlies every LUT, FF, and SerDes.
[[Engineering/microcontrollers]] — soft cores (MicroBlaze, Nios II, VexRiscv) and hard cores (Zynq Cortex-A, PolarFire SoC RISC-V) inside FPGAs use the same software workflow.
[[Engineering/digital-control]] — FPGA-implemented control loops at MHz rates (motor control, power-electronics PWM, lidar / radar signal processing).
[[Engineering/power-electronics]] — gate-drive timing, dead-time generation, predictive control on FPGA.
[[Engineering/pcb-design]] — FPGA package escape routing (BGA fanout), DDR4 / DDR5 length matching, transceiver layout, PDN decoupling.
[[Engineering/op-amps]] — analog front-end (instrumentation amps, anti-alias filters, programmable-gain amplifiers) before ADCs feeding FPGAs.
[[Engineering/circuit-analysis]] — DC + AC analysis for I/O signalling.
[[Engineering/electromagnetics-engineering]] — high-speed-signal transmission-line theory, S-parameters, eye diagrams.
planned [[Engineering/realtime-embedded]] — RTOS and bare-metal software running on Zynq / PolarFire SoC’s hard-core side.
planned [[Robotics/bayesian-estimation]] — multi-sensor synchronisation (lidar + camera + IMU) implemented on FPGA front-ends.
[[Languages/Tier3/hdl]] — VHDL, SystemVerilog, Chisel, SpinalHDL, Amaranth idioms and code examples.
[[Languages/Tier3/notation-spec]] — SVA, PSL for formal-property specification.
[[Languages/Tier3/theorem-prover-dsls]] — formal-verification ecosystem (TLA+, SymbiYosys backends, JasperGold property templates).
[[Languages/Tier3/assembly-and-encoding]] — RISC-V and ARM ISA encoding for soft-core / hard-core dispatch.

12. Citations

Chu, P. P. (2018). FPGA Prototyping by SystemVerilog Examples (2nd ed.). Wiley. Canonical lab-bench text — board-tested RTL examples for Nexys / Basys 3.
Chu, P. P. (2017). FPGA Prototyping by VHDL Examples (2nd ed.). Wiley. VHDL companion.
Kilts, S. (2007). Advanced FPGA Design: Architecture, Implementation, and Optimization. Wiley. Timing closure, retiming, pipelining, CDC. Still the canonical practitioner book.
Crockett, L. H., Elliot, R. A., Enderwitz, M. A. & Stewart, R. W. (2014). The Zynq Book: Embedded Processing with the ARM Cortex-A9 on the Xilinx Zynq-7000 All Programmable SoC. Strathclyde Academic Media. Free PDF — classic Zynq introduction.
Pellerin, D. & Thibault, S. (2005). Practical FPGA Programming in C. Prentice Hall. Early HLS perspective; concepts still relevant to Vitis HLS workflow.
Sutter, G., Boemo, E., Hauck, S. & DeHon, A. (2017). FPGAs: Fundamentals, Advanced Features, and Applications in Industrial Electronics. CRC Press. Industrial-control orientation.
Spear, C. & Tumbush, G. (2012). SystemVerilog for Verification (3rd ed.). Springer. Constrained-random verification, UVM foundations.
Cohen, B., Venkataramanan, S., Kumari, A. & Piper, L. (2013). SystemVerilog Assertions Handbook (4th ed.). VhdlCohen Publishing. SVA reference.
Cummings, C. E. (2008). Clock Domain Crossing (CDC) Design & Verification Techniques Using SystemVerilog. SNUG San Jose paper. Two-FF synchroniser MTBF derivation and async-FIFO conventions.
IEEE Std 1364-2005. IEEE Standard for Verilog Hardware Description Language. (Subsumed by 1800.)
IEEE Std 1800-2023. IEEE Standard for SystemVerilog — Unified Hardware Design, Specification, and Verification Language.
IEEE Std 1076-2019. IEEE Standard VHDL Language Reference Manual.
IEEE Std 1149.1-2013. IEEE Standard for Test Access Port and Boundary-Scan Architecture (JTAG).
IEEE Std 1801-2018. IEEE Standard for Design and Verification of Low-Power, Energy-Aware Electronic Systems (UPF).
Synopsys (2018). Application Note: Synthesis Coding Style. Industry-standard RTL coding guidelines.
AMD/Xilinx user guides: UG901 Synthesis, UG903 Implementation, UG906 Design Analysis and Closure Techniques, UG909 Partial Reconfiguration / DFX, UG949 Methodology Guide, UG974 UltraScale Architecture Libraries.
Intel: Quartus Prime Pro Handbook (vol. 1–3), Stratix 10 / Agilex Device Datasheets, OneAPI for FPGA programming guide.
Lattice: TN-series and AN-series for iCE40 / ECP5 / Nexus device design.
Microchip: PolarFire UG0680, RTG4 UG0150, Libero SoC Design Flow User Guide.
DO-254 / ED-80. Design Assurance Guidance for Airborne Electronic Hardware. RTCA / EUROCAE. The civil-avionics hardware-development standard.
IEC 61508-2:2010 + IEC 61508-3:2010. Functional safety of electrical/electronic/programmable electronic safety-related systems.
ISO 26262-5:2018 + ISO 26262-11:2018. Road vehicles — Functional safety — Hardware level / Semiconductors.
The Yosys Open Synthesis Suite (yosyshq.net) and nextpnr documentation. Open-source equivalents of the commercial flows; production-quality at iCE40, ECP5, Nexus, GOWIN, GateMate.
Project IceStorm, Project Trellis, Project Oxide, Project Apicula. Bitstream-documentation projects that enable open-source toolchains.

Compendium

Explorer

FPGA Design — Engineering Reference