FPGA Design
1. At a glance
A Field-Programmable Gate Array is a post-fabrication-configurable integrated circuit: a sea of small programmable logic elements (LUTs and flip-flops) embedded in a routing fabric, with islands of hard IP (DSP slices, block RAM, transceivers, PCIe controllers, DDR PHYs, sometimes ARM application cores) hardened in silicon for performance. The configuration — usually a multi-megabit bitstream — is loaded into on-die SRAM at power-up and defines what every LUT, FF, and routing switch does for the rest of the operating session.
FPGAs occupy the middle of the digital-implementation spectrum. ASICs ([[Engineering/digital-logic]] section 6p) bake function into mask sets — fastest, lowest unit cost at volume, USD 1–100 M NRE, months-to-years cycle time. CPUs / MCUs ([[Engineering/microcontrollers]]) execute software on a fixed datapath — most flexible, easiest to develop, sequential-by-default. FPGAs sit in between: post-fab reconfigurable like software, but spatially parallel like an ASIC. The technology wins where parallelism beats sequential execution, latency must be deterministic, protocol flexibility matters, or volumes don’t justify ASIC tape-out.
Real-world workloads where FPGAs dominate today: 5G base-station L1 (Xilinx Versal RFSoC, Intel Agilex), network-line-card packet classification at 400 GbE (Xilinx Alveo, Achronix Speedster7t), automotive radar and lidar front-ends (Xilinx Zynq UltraScale+ RFSoC, Intel Cyclone 10), military signal-intelligence (rad-hard Microchip RTG4), ASIC emulation and prototyping (Cadence Palladium / Synopsys ZeBu use thousands of FPGAs), financial HFT (Solarflare X3522), ML inference acceleration (AWS F1 with Xilinx VU9P, Vitis AI), video processing in broadcast (Lattice CrossLink-NX, Xilinx Zynq), and motor / power control at MHz rates (Zynq Z-7020, Cyclone V SoC; see [[Engineering/power-electronics]]).
Two issues consume more engineering hours on a real FPGA project than any other: timing closure (making a design hit its clock-frequency target across all process-voltage-temperature corners, covered in section 5p) and clock-domain crossing (CDC — covered in [[Engineering/digital-logic]] section 5p and revisited here for FPGA-specific gotchas in section 7p). Verification effort routinely outweighs RTL design effort 3:1; on safety-critical avionics work (DO-254), 5:1 or higher.
2. First principles
The programmable logic cell
The atomic unit of an FPGA fabric is a logic cell (Xilinx terminology) / logic element (Intel/Altera) / PFU sub-slice (Lattice). At its core sit:
- k-input look-up table (LUT-k) — an SRAM with 2^k bits storing the truth table of any k-input combinational function. Setting the SRAM bits at configuration time defines the logic. Modern fabrics use LUT6 (Xilinx 7-series, UltraScale, UltraScale+, Versal; 64 SRAM bits per LUT) or fracturable LUT6 (Intel Stratix-10, Agilex; can split into two independent LUT5s sharing four inputs, doubling utility for narrow logic). Lattice ECP5 uses LUT4; iCE40 uses LUT4; older Spartan-6 used LUT6.
- D flip-flop — one or two per LUT, with clock-enable, synchronous + asynchronous set/reset, configurable as latch on some devices.
- Carry chain — dedicated fast-carry logic spanning a vertical column of cells, used for adders, subtractors, and counters. A 32-bit adder synthesises into 32 LUTs riding a dedicated chain that traverses one column with ~150 ps per bit on UltraScale+, far faster than routing carry through general fabric.
- Wide-mux and shift-register modes — Xilinx LUT6 can also function as 32×1 distributed RAM or 32-bit shift register (SRL32), reusing the SRAM cells for storage instead of logic.
Cells are bundled into larger units: Xilinx CLB (Configurable Logic Block) = 2 SLICEs = 8 LUT6 + 16 FF; Intel LAB (Logic Array Block) = 10 ALMs = 20 fracturable LUT6 + 40 FF on Agilex; Lattice PLC = 8 SLICEs on Nexus.
Block RAM, distributed RAM, URAM
- BRAM (Block RAM) — dedicated 18 kb or 36 kb dual-port SRAM blocks, configurable as 36k×1, 18k×2, 9k×4, …, 1k×36, also 512×72 with parity. True dual-port (independent read+write on each port, separate clocks), FIFO mode (built-in pointers and flags), ECC mode on UltraScale+ (64+8 SECDED). Xilinx 7-series has up to 1880 BRAM36 (XC7V2000T); Versal Premium has up to 2520. Read latency 1–2 cycles, write 1 cycle.
- Distributed RAM — using the LUT6 SRAM cells in SLICEM as 32×1, 64×1, 128×1, or 256×1 single/dual-port memories. Useful for very small (< 256 word) high-fanout lookup tables. No clock cycle latency on async-read path.
- URAM (UltraRAM) — 288 kb single-port (or dual-mode) blocks introduced on Xilinx UltraScale+. Larger, denser, fewer flexibility options. Designed for ML weight buffers and packet buffers. VU13P has 1280 URAMs.
DSP slices
A DSP slice is a hard MAC (multiply-accumulate) unit, far smaller and faster than the LUT-equivalent. The Xilinx DSP48E2 (UltraScale+) is a 27×18 signed multiplier + 48-bit adder/accumulator + pre-adder + pattern detector, running at up to 891 MHz. The DSP58 on Versal adds INT8/INT16 SIMD modes for AI. The Intel Variable Precision DSP on Stratix 10 / Agilex offers 18×19 or 27×27 with similar accumulation and supports IEEE-754 single-precision float natively (one FP32 multiply-accumulate per cycle per slice). Modern parts ship thousands: Kintex UltraScale+ KU15P has 1968 DSP48E2; Versal Premium VP1502 has 7392 DSP58s.
Clocking, I/O, and SerDes
- MMCM / PLL — Mixed-Mode Clock Manager (Xilinx) / Phase-Locked Loop (all vendors). Generate, phase-shift, multiply, divide, and jitter-filter clocks from external references. A UltraScale+ MMCM accepts 10 MHz–800 MHz input, produces up to seven output clocks each up to 891 MHz, with ~100 fs RMS output jitter.
- Global clock buffers (BUFG, BUFGCE, BUFGCTRL) — distribute clocks across regions with low skew (< 100 ps within a clock region on UltraScale+). 32 BUFGs per device typical; constraints fight over them in large designs.
- I/O blocks — single-ended LVCMOS (1.2 V / 1.5 V / 1.8 V / 2.5 V / 3.3 V), differential LVDS, SSTL/POD for DDR4/DDR5 memory. Per-pin programmable IDELAY/ODELAY tap chains (78 ps resolution on UltraScale+) for source-synchronous interface deskew.
- SerDes (transceivers) — multi-Gb/s differential serial. Xilinx GTH (16.3 Gb/s), GTY (32.75 Gb/s NRZ / PAM4 to 58 Gb/s), GTM (58 Gb/s PAM4 native, up to 112 Gb/s on Versal Premium). Intel E-tile/F-tile transceivers similar tier. Each transceiver includes CDR (clock-data recovery), 8b/10b or 64b/66b encoding, FEC, and a programmable equaliser.
Hard IP
Cost-effective integration of high-bandwidth standardised functions: PCIe Gen3 x16 / Gen4 x16 / Gen5 x16 hard blocks (Xilinx, Intel), 100G / 400G Ethernet MACs with RS-FEC (UltraScale+, Versal Premium, Agilex 7), DDR4 / DDR5 / HBM2e / HBM3 controllers, ARM Cortex-A53 / A78AE Processing Systems in SoC variants (Zynq-7000, Zynq UltraScale+ MPSoC, Versal ACAP, Cyclone V SoC, Stratix 10 SoC, Agilex 7 with quad Cortex-A76AE), AI Engine tiles in Versal (vector processors at 1.3 GHz, hundreds-of-cores arranged in a 2D mesh, designed for ML and signal-processing kernels).
3. Practical math / design equations
Resource estimation — a few rules of thumb
| Block | Resource cost | Notes |
|---|---|---|
| 32-bit ripple-carry adder | 32 LUT, riding 1 carry chain | ~1 ns on UltraScale+ |
| 32-bit register | 32 FF | Free placement |
| 32×32 = 64-bit multiplier | 4 DSP48E2 | 1-cycle multiply + 1-cycle accumulate; ~10 cycles latency pipelined |
| 18-bit × 25-bit multiplier (signed) | 1 DSP48E2 | Native, single-cycle |
| 1 kB single-port RAM (1024×8) | 1 BRAM18 (one half used) | 2-cycle read latency typ |
| 32 kB dual-port RAM | 8 BRAM36 | True dual-port |
| 8-bit-wide async FIFO 1024 deep | 1 BRAM18 + ~20 LUT + 40 FF | Gray-code pointers for CDC |
| 256 kB packet buffer | 1 URAM | UltraScale+ only |
| 16-tap symmetric FIR filter, 16-bit data, 18-bit coeff | 8 DSP48E2 + ~200 LUT + ~200 FF | At 200 MHz → 200 MSPS |
| 100G Ethernet MAC | 1 CMAC hard block | UltraScale+ / Versal |
| PCIe Gen3 x8 endpoint | 1 PCIe hard block + DMA RTL | Xilinx XDMA / QDMA IP |
Setup, hold, and slack — restated for FPGA flow
The setup and hold inequalities from [[Engineering/digital-logic]] section 3 apply unchanged to FPGAs. The FPGA flavour adds two specifics:
- t_CO and t_setup are device-specific, not designer-controllable. They come from the FPGA’s speed grade. UltraScale+ -2 speed grade: FF t_CO ≈ 350 ps, t_setup ≈ 130 ps. -3 speed grade roughly 15 % faster, -1 about 15 % slower.
- t_route dominates t_comb in most paths. On a fully placed-and-routed UltraScale+ design at 400 MHz, a typical register-to-register path with 4 LUT delays might be 1.0 ns of LUT/CARRY logic and 1.2 ns of net (wire) delay — routing accounts for more than half the total. Designers can’t directly edit routes; instead, they pull data + control closer together via floorplanning constraints, pblock assignments, and pipelining to reduce per-stage logic depth.
Slack at a destination FF:
slack = T_clk − t_CO_src − t_LUT − t_route − t_setup_dst + t_skew_dst − t_skew_src
Negative slack = timing violation. Vivado and Quartus both report worst-negative-slack (WNS) and total-negative-slack (TNS) summarised per clock domain. A design at WNS > 0 with TNS = 0 passes; anything negative fails.
Pipelining math
If a combinational path has logic delay t_comb, splitting it into N pipeline stages with FFs between gives:
t_stage ≈ t_comb / N → T_clk_min ≈ t_CO + t_comb/N + t_setup Throughput = 1 / T_clk_min (one result per cycle once filled) Latency = N × T_clk_min (cycles from input to first valid output)
Diminishing returns set in when t_CO + t_setup approach t_comb/N (~600 ps fixed overhead on UltraScale+); past N ≈ t_comb / (3 × overhead) extra stages cost area without speed.
Worked example A — 16-tap symmetric pipelined FIR filter
A 16-tap symmetric FIR at 200 MSPS, 16-bit data, 18-bit coefficients, on Artix-7 (XC7A35T-1).
- Symmetric (h[n] = h[15-n]) → use 8 multipliers with pre-adders: y[n] = Σ h[k] × (x[n-k] + x[n-15+k]) for k = 0..7.
- 8 DSP48E1 slices (Artix-7 uses DSP48E1; 25×18 signed). Pre-adder inside each DSP holds the symmetric sum (saves one external adder per tap).
- Pipeline: input register → pre-adder stage → multiply stage → adder-tree (3 cycles for log₂(8) levels) → output register. Total latency ≈ 16 + log₂(8) = 19 cycles → 95 ns at 200 MHz.
- Resource count after synthesis + implementation: 8 DSP48E1, ~200 LUT, ~250 FF, no BRAM. WNS at -1 speed grade: ~0.3 ns. Timing closes.
- Throughput: 200 MHz × 1 sample/cycle = 200 MSPS — 50 MHz signal Nyquist + plenty of guard band.
- Software baseline: Cortex-M7 at 480 MHz with single-cycle MAC, ~30 MSPS sustained on this filter. FPGA is 7× faster and runs deterministically with sub-cycle jitter; the CPU jitters by 100s of cycles on cache misses.
Worked example B — Asynchronous BRAM FIFO for CDC
A 4 kB FIFO buffers between a 200 MHz producer domain (e.g. line-rate packet ingress) and a 156.25 MHz consumer domain (e.g. 10G MAC interface).
Using Xilinx XPM_FIFO_ASYNC parameterised macro:
- Depth = 1024, width = 32 bits → 4 kB, fits in 1 BRAM36 (or 2 × BRAM18) on 7-series / UltraScale.
- Pointer width = 11 bits (10 address + 1 wrap bit). Gray-coded read and write pointers cross the clock-domain boundary via two-FF synchronisers (one bit toggling at a time guarantees no decode error on the receive side).
- Read latency = 3 cycles (gray pointer sync + read-enable + BRAM read), so an empty flag deasserts 3 cycles before data is valid; consumer must respect FWFT (first-word-fall-through) mode or pipeline-aware mode.
- Programmable almost-full and almost-empty thresholds back-pressure the producer when fewer than 64 entries are free.
- Resource cost: 1 BRAM18 + 16 LUT + ~40 FF.
- Critical timing constraint:
set_max_delay -datapath_only -from clk_wr -to clk_rd 6.4ns (one rd period), and Vivado handles internal CDC paths via the macro’s pre-validated constraint set automatically.
Worked example C — Closing timing on a critical path
A motor-control inner loop targets 300 MHz on Zynq UltraScale+ ZU3EG-1 to compute a 32-bit Park transform + PI controller every cycle.
Initial implementation: Vivado reports WNS = −0.24 ns on clk_300m. Timing report shows the worst path is FF → 7 levels of LUT (combinational logic including a 32-bit signed multiplier modelled in fabric) → FF, with t_LUT = 2.2 ns + t_route = 1.5 ns + t_setup = 0.13 ns + t_CO = 0.35 ns = 4.18 ns > 3.33 ns period.
Three fixes considered:
- Retiming — let Vivado auto-balance combinational stages by sliding FFs across logic.
set_property RETIMING_FORWARD trueon the relevant module. Vivado moves 2 of the 7 LUT levels into the next cycle. New WNS = +0.08 ns. Closes, no functional change, +1 cycle latency at the boundary. - Manual pipelining — insert a register between the multiplier output and downstream logic. Same effect as auto-retiming but explicit in RTL; preferred for safety-critical (DO-254) work where every register must be designer-justified.
- Map multiplier to DSP slice — the multiplier was being mapped to LUT fabric (Vivado’s heuristic considered it too narrow). Add
(* USE_DSP = "yes" *)attribute. The DSP48E2 absorbs the multiply in a single hardened slice with 1 ns total, freeing the LUT logic. New WNS = +0.42 ns. Best fix: no extra latency, fewer LUTs, hard timing margin.
Real designs use combinations — retiming for routine paths, manual pipelining for hand-tuned cores, DSP/BRAM forcing for arithmetic-heavy or storage-heavy paths.
Worked example D — Power budget for a portable instrument
A handheld test instrument runs an XC7Z020-1CLG400C (Zynq-7020) at 100 MHz fabric clock with the dual Cortex-A9 PS at 666 MHz. Estimate average power.
- PS (processing system): V_CCPINT = 1.0 V, ~250 mW at 666 MHz with dual cores active per Xilinx Power Estimator (XPE).
- PL (programmable logic) dynamic: 45 k LUT @ 30 % toggle, 60 k FF, 80 BRAM, 60 DSP — XPE reports ~480 mW at 100 MHz, V_CCINT = 1.0 V.
- PL static (leakage): temperature-dependent. At T_J = 50 °C, ~120 mW.
- I/O banks: 80 LVCMOS33 outputs at 20 MHz, C_L = 15 pF each. P_IO = C · V² · f · α · N = 15 pF × (3.3 V)² × 20 MHz × 0.25 × 80 = 65 mW.
- MGT transceivers: not used → 0 mW.
- DDR3 PHY: ~150 mW for 32-bit DDR3-1066.
Total: ~1.07 W. With a 3.7 V Li-ion at 3400 mAh = 12.6 Wh, runtime ≈ 12.6 / 1.07 ≈ 11.8 hours continuous fabric activity. Adding clock gating to the unused 60 % of PL FFs and dropping the PS to 333 MHz when idle pushes average power to ~600 mW and runtime to ~21 h.
4. Reference data
FPGA vendor and family landscape
| Vendor | Family | Process | Largest part | LUTs | DSP | BRAM (Mb) | SerDes max | Use case |
|---|---|---|---|---|---|---|---|---|
| AMD (Xilinx) | Spartan-7 | 28 nm | XC7S100 | 102 k | 160 | 4.1 | none | Cost-sensitive embedded |
| AMD | Artix-7 | 28 nm | XC7A200T | 215 k | 740 | 13 | 6.6 Gb/s | Industrial, prototyping |
| AMD | Kintex-7 | 28 nm | XC7K480T | 478 k | 1920 | 34 | 12.5 Gb/s | Networking, mid-range DSP |
| AMD | Virtex-7 | 28 nm | XC7V2000T | 1955 k | 2160 | 46 | 13.1 Gb/s | High-end (legacy) |
| AMD | Zynq-7000 SoC | 28 nm | XC7Z045 / XC7Z020 | 350 k / 85 k | 900 / 220 | 19 / 4.4 | 12.5 Gb/s | Dual Cortex-A9 + PL |
| AMD | Kintex UltraScale | 20 nm | XCKU115 | 1451 k | 5520 | 75 | 30.5 Gb/s | DSP, ML, radar |
| AMD | Kintex UltraScale+ | 16 nm | XCKU15P | 1143 k | 1968 | 35 | 32.75 Gb/s | 5G, networking |
| AMD | Virtex UltraScale+ | 16 nm | XCVU19P | 8938 k | 3840 | 224 (BRAM+URAM) | 58 Gb/s PAM4 | ASIC emulation, data centre |
| AMD | Zynq UltraScale+ MPSoC | 16 nm | XCZU19EG | 1143 k | 1968 | 35 | 32.75 Gb/s | Quad A53 + dual R5 + PL |
| AMD | Versal Prime | 7 nm | VP1502 | 1968 k | 7392 | 90 | 112 Gb/s | NoC + AI Engines + PL |
| AMD | Versal AI Core | 7 nm | VC1902 | 1969 k | 1968 | 60 | 58 Gb/s | 400 AIE tiles for ML |
| AMD | Versal Premium | 7 nm | VP1802 | 3735 k | 10848 | 199 | 112 Gb/s PAM4 | Hyperscale networking |
| Intel (Altera) | Cyclone IV / V | 60 / 28 nm | 5CGXFC9 | 301 k | 342 | 13.9 | 6.144 Gb/s | Cost-sensitive |
| Intel | Cyclone V SoC | 28 nm | 5CSXFC6 | 110 k | 112 | 5.1 | 6.144 Gb/s | Dual Cortex-A9 + FPGA |
| Intel | Arria 10 | 20 nm | 10AX115 | 1150 k | 1518 | 53 | 17.4 Gb/s | Mid-range DSP / video |
| Intel | Stratix 10 GX/MX | 14 nm | 10SG280 | 2753 k | 5760 | 229 (M20K + HBM) | 28.9 Gb/s | High-end DSP / HBM2 |
| Intel | Agilex 7 | 10 nm SF | AGFB014R24A2E2V (F-tile) | 2692 k | 8736 | 165 | 116 Gb/s PAM4 | DC, networking |
| Intel | Agilex 5 | 7 nm | A5E065BB | 656 k | 1024 | 16 | 28 Gb/s | Edge AI, mid-range |
| Lattice | iCE40 UltraPlus | 40 nm | iCE40UP5K | 5.3 k | 8 | 1 (BRAM + SPRAM) | none | Smartphone sensor hub |
| Lattice | iCE40 HX | 40 nm | iCE40HX8K | 7.7 k | 0 | 0.13 | none | Tiny dev boards |
| Lattice | ECP5 | 40 nm | ECP5-85F (LFE5UM5G-85F) | 84 k | 156 | 3.7 | 5 Gb/s | Mid-range, open-source toolchain |
| Lattice | MachXO3 / CrossLink-NX | 28 nm FD-SOI | LIFCL-40 | 40 k | 156 | 2.5 | 10.3 Gb/s | Edge AI, MIPI |
| Lattice | CertusPro-NX | 28 nm FD-SOI | LFCPNX-100 | 96 k | 174 | 7.5 | 5 Gb/s | Edge processing |
| Lattice | Avant-E / Avant-G | 16 nm | LAV-AT-100 | 500 k | 1300 | 36 | 25 Gb/s | High-mid range |
| Microchip | PolarFire | 28 nm SONOS-NV | MPF300 / MPF500 | 481 k | 1480 | 33 | 12.7 Gb/s | Low-power, secure, non-volatile |
| Microchip | PolarFire SoC | 28 nm | MPFS250T | 254 k | 784 | 17 | 12.7 Gb/s | 5-core RISC-V (4×U54 + 1×E51) |
| Microchip | RTG4 | 65 nm rad-hard | RT4G150 | 151 k | 462 | 5.2 | 3.125 Gb/s | Space, rad-tolerant |
| Microchip | SmartFusion2 | 65 nm | M2S150 | 146 k | 240 | 4.6 | 5 Gb/s | Cortex-M3 + secure flash |
| Achronix | Speedster7t | 7 nm TSMC | AC7t1500 | 692 k | 2560 (MLP) | 195 | 112 Gb/s | Networking, ML |
| Efinix | Trion / Titanium | 40 nm / 16 nm | Ti200 | 200 k | 320 | 18 | 25 Gb/s | Edge AI, low-cost |
| GOWIN | LittleBee / Arora | 55 / 22 nm | GW5AT-138 | 138 k | 116 | 11 | 12.5 Gb/s | Chinese consumer / IoT |
HDL language comparison
| Language | Standard | Origin | Verbosity | Verification | Synthesis support | Industry footprint |
|---|---|---|---|---|---|---|
| Verilog | IEEE 1364-2005 | Gateway Design (1985), C-like | Low | Limited (testbench is just RTL) | All vendors | Legacy still common; superseded by SystemVerilog |
| SystemVerilog | IEEE 1800-2023 | Synopsys/Accellera (2002), superset of Verilog | Low-medium | UVM, constrained-random, SVA, classes | All vendors; subset on open tools | Default for ASIC + most new FPGA RTL |
| VHDL | IEEE 1076-2019 | DoD VHSIC (1987), Ada-derived | High (strongly typed) | Native records, generics; OSVVM | All vendors | Aerospace, defence, European industry |
| Chisel | UCB, Scala-embedded | 2012 | Medium (Scala metaprog) | ScalaTest + Treadle | Emits Verilog | SiFive, Esperanto, Google TPU |
| SpinalHDL | Scala-embedded | 2015 | Low (cleaner than Chisel) | SimSlang, Cocotb | Emits Verilog/VHDL | Open-source community, VexRiscv |
| Amaranth (nMigen) | Python-embedded | 2020 | Low | Cocotb, native sim | Emits Verilog | iCE40/ECP5 open ecosystem |
| MyHDL | Python-embedded | 2003 | Low | Native Python | Emits Verilog/VHDL | Niche; older |
| Bluespec SystemVerilog | Bluespec Inc. | 2000 | High (atomic transactions) | Native | Emits Verilog | Research, defence |
| Vitis HLS (C/C++) | C++ subset + pragmas | AMD/Xilinx | High (raised abstraction) | C testbench + co-sim | Emits Verilog | DSP, vision, ML — wrap in RTL shell |
| Intel HLS / oneAPI FPGA | OpenCL / SYCL / C++ | Intel | High | Native, with emulation | Emits Verilog | Intel-only |
Tier-3 idioms and code patterns: [[Languages/Tier3/hdl]].
Tool flow stages
| Stage | What it does | Vendor commercial tools | Open-source equivalents |
|---|---|---|---|
| RTL coding | Designer writes Verilog/VHDL | Any text editor; VS Code; vivado-vhdl-mode | Same |
| Elaboration + lint | Resolves hierarchy, parameters, checks style | Synopsys SpyGlass, Cadence HAL, Vivado lint | Verible, sv2v + lint |
| Simulation | Run testbench against RTL | ModelSim, Questa, VCS, Xcelium, Vivado XSIM | Verilator, Icarus, GHDL, Cocotb |
| Synthesis | RTL → netlist of LUTs + FFs + hard IP | Vivado Synth, Quartus Pro, Synplify Pro, Diamond Synth | Yosys (+ SystemVerilog plugin) |
| Floorplan / placement | Assigns netlist to physical sites | Vivado Implementation, Quartus Fitter, Diamond Map | nextpnr (iCE40, ECP5, Nexus, Gowin, GateMate) |
| Routing | Connects placed sites via metal | Same as placement | Same as placement |
| Static timing analysis | Verifies setup/hold across PVT corners | Vivado Timing, Quartus TimeQuest / TimeQuest II, Tempus (Cadence) | OpenSTA |
| Bitstream generation | Produces device-loadable binary | Vivado write_bitstream, Quartus assembler, Diamond bitgen | Project IceStorm, Project Trellis, Project Apicula |
| Configuration / programming | Loads bitstream into FPGA | Vivado Hardware Manager, Quartus Programmer | openFPGALoader, iceprog, ecpprog, openocd |
| Hardware debug | In-system probing | Vivado ILA / VIO, SignalTap, Lattice Reveal | LiteScope, Glasgow Interface Explorer |
Common-design-block resource budgets (Xilinx UltraScale+ -2 speed grade)
| Block | LUTs | FFs | BRAM18 | URAM | DSP | f_max (typ) |
|---|---|---|---|---|---|---|
| 32-bit binary counter | 32 | 32 | 0 | 0 | 0 | 750 MHz |
| 32-bit ripple adder | 32 | 0 | 0 | 0 | 0 | combinational |
| 32-bit Kogge-Stone adder | ~150 | 0 | 0 | 0 | 0 | combinational, ~3 LUT levels |
| 32×32 = 64 multiply (DSP) | ~50 | 100 | 0 | 0 | 4 | 600 MHz pipelined |
| 1024×32 single-port BRAM | 0 | ~40 | 1 | 0 | 0 | 700 MHz |
| 4096×72 dual-port BRAM | 0 | ~80 | 4 | 0 | 0 | 600 MHz |
| 8192×72 packet buffer | 0 | ~100 | 0 | 1 | 0 | 700 MHz |
| Async FIFO 1024×64 (XPM) | 100 | 200 | 4 | 0 | 0 | 500 MHz both clocks |
| AXI4-Stream 64-bit interconnect M:N=2:2 | ~3000 | ~5000 | 2 | 0 | 0 | 400 MHz |
| 10 GbE MAC (soft) | ~10 k | ~12 k | 8 | 0 | 0 | 156.25 MHz |
| 100 GbE CMAC hard block | 0 | 0 | 0 | 0 | 0 | 322.27 MHz native |
| PCIe Gen3 x8 + XDMA | ~25 k | ~40 k | 30 | 0 | 0 | 250 MHz user clock |
| DDR4-2400 MIG controller | ~15 k | ~20 k | 20 | 0 | 0 | 300 MHz user clock |
| RISC-V VexRiscv core (small) | ~1 k | ~1 k | 1 | 0 | 0 | 200 MHz |
| MicroBlaze (small) | ~1 k | ~800 | 4 | 0 | 0 | 250 MHz |
Certification standards for hardware (FPGA-relevant)
| Standard | Domain | Level / scope | What it requires |
|---|---|---|---|
| DO-254 | Civil avionics | DAL A–E (A = catastrophic) | Hardware planning, requirements, design, V&V, configuration mgmt, traceability. DAL A/B need independent verification, elemental analysis (covered by Vivado / Synplify in DO-254 mode). |
| IEC 61508-2 / -3 | Industrial functional safety | SIL 1–4 | Random + systematic failure rates, FMEDA, diagnostic coverage. Hardware (-2) and software/IP (-3) tracks. |
| ISO 26262-5 / -11 | Automotive (passenger) | ASIL A–D (D highest) | Lockstep / DCLS cores, hardware metrics (SPFM, LFM, PMHF), independence of safety mechanism. Microchip PolarFire SoC, Xilinx Zynq US+ MPSoC, NXP S32 line all have ASIL-D ready variants. |
| EN 50128 / EN 50129 | Rail signalling | SIL 1–4 | Similar to IEC 61508 with rail-specific. |
| IEC 62443 | Industrial cybersecurity | SL 1–4 | Bitstream encryption, secure boot, key storage. |
| MIL-STD-883 / -461 / -750 | Defence | Class B/S | Environmental, EMI, mechanical. |
| RTCA DO-178C | Civil avionics software | DAL A–E | If a soft-core runs software, software certs apply alongside DO-254. |
| NASA EEE-INST-002 | Spaceflight parts | Level 1/2/3 | Screening, qualification for space. RTG4 and Virtex-5QV target this. |
5p. Theory — synthesis, timing closure, and CDC for FPGAs
Why LUTs win over discrete gates
A k-input LUT is a 2^k-bit SRAM with a tree of 2:1 muxes selecting one bit based on the k input signals. Any k-input Boolean function is implementable by writing the function’s truth table into the SRAM at configuration time. For k = 6 (Xilinx UltraScale+), a single LUT6 absorbs functions up to 6 variables — equivalent to several discrete gates worth of logic in one routing-hop. The trade-off vs ASIC standard cells:
- LUT area cost: ~12× the equivalent NAND2 in standard cells.
- LUT delay: ~5–10× a standard-cell NAND2 (gate delay + intra-LUT mux propagation).
- LUT power: ~5× per equivalent function.
These are the structural reasons an FPGA design is typically 3–10× larger, 3–5× slower, and 5–20× more power-hungry than the same RTL on an ASIC at the same process node. The reconfigurability — no NRE, instant turn-around, in-field updates — is what justifies those penalties for the application classes where they fit.
Synthesis flow inside the FPGA tool
Modern FPGA synthesis (Vivado Synth, Quartus Pro, Synplify Pro, Yosys) follows the same general flow as ASIC synthesis ([[Engineering/digital-logic]] section 5p) with one key difference: the target library is a fixed set of primitives — LUT6, FDRE, FDSE, RAMB36, DSP48E2, MMCM, BUFG, GTY — rather than a foundry standard-cell library. Outputs:
- Elaboration — flattens hierarchy, resolves generics. Resolves
parameteroverrides,generateconstructs, package items. - Logic synthesis — combinational logic mapped to LUT6 / LUT5 / fracturable LUT pairs. Espresso-style minimisation and Boolean restructuring.
- Resource mapping — multipliers mapped to DSP48 or LUT fabric (driven by
USE_DSPattribute and width thresholds), memories mapped to BRAM / URAM / distributed RAM (driven byRAM_STYLE), shift registers to SRL16/SRL32. - Retiming — optional automatic FF migration across logic boundaries to balance pipeline stages (
-retimingin Vivado synth,-pipeliningin Synplify). - Technology mapping — final form fits the device family’s LUT/FF/BRAM/DSP primitive set.
Output is a netlist; place-and-route (Implementation in Vivado, Fitter in Quartus) assigns physical sites, routes signals, and reports actual timing.
Static Timing Analysis on FPGAs
STA on an FPGA differs from ASIC STA mainly in the device-specific PVT corners the tool ships with. Vivado timing analyses paths across slow-slow-cold (worst setup, e.g. -40 °C, V_CCINT = 0.825 V, slow process for -1 part) and fast-fast-hot for hold. The user provides:
- Clock periods —
create_clock -period 2.5 [get_ports clk_400m] - Clock relationships —
set_clock_groups -asynchronous -group {clk_a} -group {clk_b}when domains are independent - I/O delays —
set_input_delay -clock clk -min/-max <ns> [get_ports din]for source-synchronous timing - False paths —
set_false_path -from [get_pins rst*]for static signals or properly synchronised paths - Multicycle paths —
set_multicycle_path 2 -setup -from ... -to ...when data must hold valid for multiple cycles
Constraint files use XDC (Xilinx Design Constraints) or SDC (Synopsys Design Constraints) syntax — XDC is essentially TCL with a curated subset of SDC commands. Intel Quartus uses SDC directly. Constraints do not constrain the design; they tell the tool the designer’s intent. Wrong constraints (especially missing false paths or wrong clock groups) cause either timing failures the design doesn’t really have, or — far worse — passing timing on paths that are actually broken.
Clock-domain crossing on FPGAs
CDC fundamentals are in [[Engineering/digital-logic]] section 5p. FPGA-specific points:
- Vendor synchroniser primitives. Xilinx XPM_CDC_SYNC_RST, XPM_CDC_SINGLE, XPM_CDC_GRAY, XPM_CDC_HANDSHAKE, XPM_CDC_PULSE macros wrap 2/3-FF synchronisers, gray-counter CDC, full handshake protocols, and pulse extension respectively. Each carries automatic constraints (
ASYNC_REG = TRUEattribute on the synchroniser flip-flops, plusset_max_delay -datapath_only). Intel altera_std_synchronizer and dcfifo macros do the same. ASYNC_REG = TRUEattribute (Xilinx) places the synchroniser FFs in adjacent slice locations and prevents synthesis from optimising them away. Quartus equivalent:synchronizer_identification = forced.- MTBF reporting. Vivado reports CDC violations (paths through unsynchronised FFs in different clock groups) but does not compute MTBF — the designer must own the failure analysis for unconventional crossings.
- Multi-bit CDC. A 2-FF synchroniser on each bit of a 32-bit bus does not work — different bits resolve at different cycles. Use an async FIFO, a handshake, or a gray-coded encoding for monotonic counters.
- Reset CDC. Each clock domain must have its own reset synchroniser; an asynchronous reset asserted globally is fine, but deassertion must be re-synchronised to each domain’s clock individually. Skipping this lets some FFs come out of reset on a different cycle to others, breaking initial state.
- Recovery / removal timing — STA checks for asynchronous reset paths, analogous to setup/hold for data. Vivado reports
recovery(reset must deassert this much before the clock edge) andremoval(must stay asserted this long after); if these fail, asynchronous reset is effectively a CDC violation and the synchroniser is mandatory.
Reconfiguration
The bitstream is loaded at power-up by one of:
- Master SPI from external Flash (most common; Micron N25Qxxx, Macronix MX25L, Cypress S25FL). 4 Mb to 2 Gb sizes. Boot in 1–10 s typical.
- Master BPI (parallel NOR Flash) — faster but more pins.
- JTAG (IEEE 1149.1) — typically development only. Xilinx XVC (Xilinx Virtual Cable) allows remote JTAG over TCP/IP.
- SelectMAP / FPP — parallel slave configuration from an external CPU (ARM, x86). Used in many SmartNIC designs where an external host owns the boot.
Partial reconfiguration (PR) — Xilinx Dynamic Function eXchange (DFX) and Intel Partial Reconfiguration allow swapping a region of the fabric while the rest continues to run. Used for FPGA-as-a-service (AWS F1), software-defined radio with multiple waveform plug-ins, and hardware accelerator hotswap. PR regions (“pblocks”) must be floorplanned, and the bitstream is region-specific. Reconfig time on UltraScale+: ~5–50 ms per region typical.
Power consumption
P_total = P_dynamic + P_static + P_IO_xceiver
- Dynamic — α · C · V_CCINT² · f, summed over all toggling nodes. Reduced by clock gating (BUFGCE with enable), reducing toggle activity, and lowering V_CCINT (configurable via UltraScale+ system-monitor + external regulator).
- Static (leakage) — temperature-dependent; 28 nm and below contributes significantly. A fully populated VU13P at 85 °C die: ~30 W static alone.
- Transceivers and I/O — GTH/GTY at full line rate: ~0.4 W per lane. A 16-lane PCIe Gen4 link: ~7 W.
Vivado Xilinx Power Estimator (XPE) spreadsheets give pre-implementation budget; post-implementation report_power in Vivado gives actual. Match within 10 % typical when activity factors are realistic.
Floorplanning and physical constraints
Large designs (> 50 % of fabric utilised, or with mixed-clock-domain blocks) often need floorplan constraints to converge timing. Tools:
- Pblocks (Xilinx) / LogicLock regions (Intel) — rectangular fabric regions that constrain placement of a hierarchical module. Used to (a) keep latency-critical logic together, (b) reserve area for partial-reconfiguration regions, (c) isolate safety-critical lockstep blocks from each other for ISO 26262 independence requirements.
- LOC constraints — pin a specific cell to a specific site (e.g.
set_property LOC SLICE_X10Y20 [get_cells my_reg]). Used sparingly — usually for IO buffers, MMCMs, GTYs, and the occasional high-fanout register. - Clock-region constraints — confine a clock and its loads to a clock region (clusters of ~50–100 CLBs sharing a clock-region network). Reduces clock skew and routing congestion.
- Hold-fixing buffers — Vivado / Quartus auto-insert LUT1 buffers in too-fast paths during the optimise phase; user usually doesn’t intervene.
Bad floorplanning is worse than no floorplanning. Pblocks too tight cause congestion + routing failure; pblocks too loose defeat the purpose. Start without floorplanning, add only after the tool has shown which paths it cannot close on its own.
6p. Application — design patterns
- Finite state machine (Moore vs Mealy, one-hot vs binary encoding). Synthesis tool auto-picks encoding via
FSM_ENCODINGattribute:one_hotfor ≤ 16 states (LUT-efficient on FPGA),binaryorgrayfor larger. - FIFO — synchronous (single-clock, BRAM-backed) or asynchronous (CDC, gray-coded pointers). Always parameterised macros (Xilinx XPM_FIFO_*, Intel dcfifo).
- Pipelined DSP — FIR filters via DSP slices, polyphase decimators / interpolators, CIC filters, FFT cores (Vitis DSP Library, Intel FFT IP). Symmetric / antisymmetric filters halve multiplier count via pre-adders.
- Memory controllers — DDR4 / DDR5 / LPDDR4 / HBM via vendor IP. Xilinx MIG (Memory Interface Generator) for 7-series, integrated DDR/HBM controllers on UltraScale+ and Versal. Latency 80–150 ns to DDR4 from AXI4 master.
- AXI4 / AXI4-Stream / AXI4-Lite — ARM AMBA interface fabric. AXI4 burst memory, AXI4-Stream packet/sample DMA, AXI4-Lite for control registers. Bridges via Xilinx AXI Interconnect or SmartConnect, Intel Avalon-MM↔AXI bridges. Most third-party IP standardises on AXI.
- PCIe endpoint / root-complex — hard IP block + DMA engine (Xilinx XDMA, QDMA; Intel DMA IP). User logic sits behind an AXI4-Stream packet interface or AXI4 memory-mapped interface. Typical sustained PCIe Gen3 x8 throughput: 7.5 GB/s.
- Ethernet MAC + PHY — 1G via tri-mode soft MAC + RGMII PHY; 10G via XGEMAC + KR/SR SerDes; 25G / 100G via hard CMAC + RS-FEC. PCS/PMA always uses transceivers.
- Image-processing pipelines — Vitis Vision Library (HLS-generated OpenCV-equivalent kernels), Intel Computer Vision SDK.
- HLS for ML inference — Vitis AI (Xilinx, DPU IP for Zynq US+ and Versal AIE), Intel OpenVINO FPGA, FINN (Xilinx research; quantised CNN to HLS), hls4ml (CERN; tiny networks for trigger systems).
- Soft-core CPUs — Xilinx MicroBlaze (32-bit, RISC-like), Intel Nios II / Nios V (RISC-V), VexRiscv (open-source, configurable), PicoRV32 (tiny). Used for control plane around fabric datapaths; see
[[Engineering/microcontrollers]]for software workflow.
FPGA vs ASIC vs CPU — when to choose which
| Concern | FPGA wins | ASIC wins | CPU / MCU wins |
|---|---|---|---|
| Unit volume | < 100 k | > 1 M | any |
| Time to first silicon | weeks | months to years | hours |
| NRE | USD 0–100 k (tools + dev kit) | USD 1 M – 100 M (mask + design) | USD 0 |
| Unit cost (high volume) | USD 5–10 000 | USD 0.10–500 | USD 0.30–100 |
| Power efficiency vs algorithm | mid (5–20× worse than ASIC) | best | worst on data-parallel; best on control-heavy |
| Determinism | hard real-time, sub-ns jitter | hard real-time | varies — best on bare-metal MCU, worst on Linux SoC |
| Field update | yes (bitstream) | no (mask spin) | yes (firmware) |
| Protocol flexibility | yes (any I/O, any line-rate ≤ device limit) | no | limited to integrated peripherals |
| Tooling cost | USD 0–50 k seats | USD 1–10 M tool + verification IP | USD 0–10 k seats |
| Required expertise | hardware (RTL + verification + closure) | hardware + foundry partnership | software |
The decision tree: if the workload fits the CPU’s throughput and timing budget, use a CPU (software is always cheaper). If it doesn’t fit a CPU but ships < 100 k units, use an FPGA. If volume > 1 M units, and the design is stable, and power efficiency / unit cost matter, the ASIC NRE amortises. Many products start FPGA, prove the architecture at hundreds-of-units pilot scale, then port to ASIC for volume launch.
Verification — the other half of the project
A real FPGA RTL block is verified against several tiers of test before tapeout to silicon (or first bring-up board):
- Unit-level RTL simulation — testbench drives the block’s interfaces, scoreboards check responses. ModelSim / Questa / Verilator. Most engineers spend 50–70 % of project time here.
- UVM (Universal Verification Methodology, IEEE 1800.2) — SystemVerilog object-oriented testbench framework with constrained-random stimulus, transaction-level interfaces, sequencers, drivers, monitors, scoreboards, coverage collectors. Dominant for ASIC; increasingly used on large FPGA SoCs.
- Cocotb (Python) — Python-based testbench wrapping any simulator. Faster to write than UVM; preferred in startups, open-source projects, and small FPGA teams.
- OSVVM (Open Source VHDL Verification Methodology) — VHDL equivalent of UVM. Free, IEEE-compatible, used in defence and aerospace.
- Formal property verification — JasperGold, VC Formal, SymbiYosys. Mathematical proof that SystemVerilog Assertions (SVA) hold for all input sequences within bounded depth. Catches deadlocks, livelocks, AXI4-protocol violations, and CDC bugs that random simulation misses.
- Hardware-in-the-loop (HIL) — design loaded onto a dev board with traffic generators or scope-injected stimulus. Catches issues with timing margins, PVT, power-up sequencing, and analog-coupled effects (jitter, SI) that simulation misses.
- Emulation — Cadence Palladium / Synopsys ZeBu / Mentor Veloce map the entire RTL into a custom FPGA-cluster appliance at MHz speeds, runs real software workloads pre-silicon. USD millions per box; only large SoC teams use them.
- Coverage closure — code coverage (line, toggle, branch, expression, FSM), functional coverage (covergroups in SystemVerilog), assertion coverage. 100 % code coverage and 95 % + functional is the gate to tape-out for many ASIC teams; FPGA teams typically target 80 % + and rely on board-level testing for the rest.
7p. Edge cases & assumptions
- Inferred latches from incomplete
case/if-elsechains — usually a bug. Always includedefault(case) orelse(if). A combinational always block that does not assign a signal in every path produces a latch, which has analog feedback and breaks STA on FPGAs. - Clock-domain crossing without synchronisers → metastability → field failures. Vivado CDC reports flag this; ignore at peril. The 2-FF synchroniser brings MTBF to centuries; absence brings it to days.
- Reset strategy. Synchronous reset is preferred on Xilinx UltraScale+ (fewer FF primitives support async reset; sync-reset packs into SLICEL versus asynchronous-only in SLICEM). Synchronous makes timing analysable. Asynchronous reset must always be deasserted synchronously (small reset-synchroniser per clock domain). Microchip PolarFire prefers async assert / sync deassert; Intel Agilex prefers fully sync.
- Initialisation of BRAM / FF. Xilinx FFs initialise to the value specified by their
INITattribute (default 0) at configuration. BRAMs and URAMs initialise fromINIT_xxconfiguration data. Lattice iCE40 BRAMs do not initialise (designer must write zero in a startup state machine). Microchip PolarFire’s non-volatile cells retain prior state across power cycles — useful and dangerous. - Power consumption. Dynamic ∝ C · V² · f; at 1 GHz on Versal Premium a fully-active 1 % of fabric is enough to push 50 W. Clock gating is mandatory for portable / battery applications. UltraScale+ has dynamic-V_CCINT (configurable supply voltage at runtime) but few designs use it.
- Reconfiguration time. Cold boot 10–500 ms typical depending on bitstream size and SPI flash speed. Versal Premium VP1502 bitstream: ~200 MB, ~600 ms from QSPI at 100 MHz QSPI clock. Partial reconfig: 5–50 ms per region.
- Single-event upset (SEU) in space / high-altitude / high-energy physics. Cosmic rays flip individual configuration bits, corrupting logic. Mitigations: SEU-mitigated FPGAs (Microchip RTG4 with TMR on every FF, Xilinx Virtex-5QV, RT Kintex UltraScale), bitstream scrubbing (Xilinx SEM IP periodically re-reads and corrects configuration), TMR (triple-modular redundancy) at user-logic level (Synopsys Synplify Pro auto-TMR, Mentor Precision Hi-Rel), ECC on BRAM (built-in on UltraScale+).
- Bitstream encryption and IP security. AES-256 / AES-GCM bitstream encryption with keys in eFuse or battery-backed BBRAM (Xilinx, Intel). Authentication via HMAC or RSA-4096 signature. Required for anti-piracy and anti-tamper in defence/industrial. Microchip PolarFire adds DPA-resistant boot — protects keys against side-channel attacks.
- Lockstep / DCLS (dual-core lockstep) for safety-critical (DO-254 DAL A/B, IEC 61508 SIL 3/4, ISO 26262 ASIL D). Both cores execute the same workload; a comparator flags divergence. Microchip PolarFire SoC and Xilinx Zynq US+ MPSoC offer lockstep RISC-V / Cortex-R5 pairs. Pure-fabric lockstep: replicate the RTL and compare outputs externally.
- Hot-pluggable in datacentre. Power sequencing is critical: V_CCINT before V_CCAUX before V_CCO, deassert PROGRAM_B last, monitor INIT_B and DONE. Mis-sequencing latches up the part or corrupts boot. AWS F1, Azure NP, and Alibaba FPGA-cloud designs all have documented sequencing.
- Counterfeit parts. Refurbished and relabelled Xilinx / Intel parts circulate in the grey market. Verify supply via authorised distributors (Avnet, Arrow, Future, Mouser, DigiKey). Xilinx and Intel run anti-counterfeit programs with device ID readback (DNA register) and traceability.
- Vendor toolchain lock-in. XDC and SDC constraints are not perfectly portable across Vivado, Quartus, Diamond/Radiant, and Libero. SystemVerilog feature support varies —
interface,program,bind, covergroup support differs significantly. RTL written to the IEEE 1800-2017 subset that Synplify Pro accepts is the safest cross-vendor portable form. - Compile times. Versal Premium HBM design: 4–12 hours full P+R wall-time on a 64-core server with 256 GB RAM. Smaller designs (Artix-7, Cyclone V) finish in 5–30 minutes. Incremental synthesis + implementation (Vivado Block-Level Synthesis, Quartus Incremental Compile) reduces iteration to 10–30 minutes for small RTL changes on big designs.
- HLS quality limits. Hand-coded RTL is still 2–5× faster and smaller than HLS-generated RTL for highly-tuned DSP or networking kernels. HLS pays off for productivity (algorithmic exploration, video / vision / ML where the algorithm is complex but the dataflow is regular).
- Glitches feeding asynchronous resets or clocks — same rule as
[[Engineering/digital-logic]]section 7p. Never use a combinational signal as a clock or asynchronous reset; always register it first. - Routing congestion vs logic utilisation. A design at 85 % LUT utilisation may fail P+R because routing resources, not logic resources, ran out. Congestion is worst around dense BRAM / DSP columns where many nets converge. Symptoms: unroutable nets in the router log, very long P+R wall-time, marginal timing closure. Fixes: reduce utilisation, floorplan to spread the hot region, change BRAM packing (
RAM_STYLE = "distributed"for small memories), or move to a larger device. - Pin assignment drift. Vivado / Quartus default
pin auto-placewill move I/Os between iterations, breaking PCB layout. Always lock pins in the XDC / QSF early and keep the constraint file under version control. - IP version mismatch. Vendor IP (PCIe, DDR controllers, Ethernet) is tied to a specific tool version. Upgrading Vivado from 2023.1 to 2024.1 often forces IP re-customisation. Check release notes for IP-breaking changes before upgrading mid-project.
- License servers — most commercial tools (Vivado Enterprise, Quartus Pro, ModelSim, VCS, JasperGold) require a network license server. Plan for license-server outages by either using node-locked or cloud-floating licenses, and never let a developer’s daily work depend on a single license that may be checked out elsewhere.
- Bitstream-revision compatibility. Configuring a UltraScale+ part with a bitstream built for a different speed grade is allowed by hardware but will fail timing in the field. The
verifystep in the programmer matters. - Heat dissipation. Versal Premium and Stratix 10 / Agilex 7 routinely dissipate > 50 W; thermal pads, heatsinks, and 200+ LFM forced airflow are not optional. Junction temperature must stay below T_J max (typically 100 °C); above that, the part throttles or shuts down via on-die thermal sensor (System Monitor / SmartVID).
- Decoupling and PDN. A modern FPGA package draws di/dt spikes of tens of amps per nanosecond on V_CCINT. Hundreds of 0.1 µF capacitors plus tens of 10 µF bulk plus a controlled-impedance plane pair are the minimum for clean operation; vendor reference designs spell out exact placement (see UG583 PCB Design and Pin Planning Guide for Xilinx).
- JTAG chain integrity. Mixed-voltage JTAG chains (3.3 V Xilinx + 1.8 V Lattice) need level translators; otherwise device-detection fails or boundary-scan reports phantom chains.
8p. Tools & software
Vendor end-to-end suites
- AMD Vivado / Vitis — Vivado for RTL + synthesis + P+R + bitstream (UltraScale+, Versal). Vivado HLS / Vitis HLS for C/C++ → RTL. Vitis Embedded for Zynq Linux / FreeRTOS / bare-metal. Vitis AI for ML inference on Zynq US+ and Versal AIE. Free WebPACK / ML editions for smaller parts; Enterprise for largest. Vivado Hardware Manager for JTAG programming and ILA debug.
- Intel Quartus Prime Pro / Standard / Lite — synthesis + P+R for Cyclone, Arria, Stratix, Agilex. Platform Designer (Qsys) for IP integration. OneAPI for HLS via SYCL. Signal Tap II for in-system probing.
- Lattice Diamond / Radiant / Propel — Diamond for older Lattice (MachXO2, ECP3), Radiant for current (iCE40 UltraPlus, ECP5, CertusPro-NX, MachXO3, CrossLink-NX), Propel for software-driven flow on Avant.
- Microchip Libero SoC — PolarFire, PolarFire SoC, SmartFusion2, RTG4 toolchain. Includes secure boot, FlashPro hardware programming.
Open-source
- Yosys (Clifford Wolf, now CTCFT / YosysHQ) — synthesis for FPGAs and (via OpenROAD downstream) ASICs. Reads Verilog and SystemVerilog subset (full SystemVerilog via
sv2vor the proprietary Slang plugin). Targets nextpnr-compatible architectures. - nextpnr — place-and-route for iCE40 (Project IceStorm), ECP5 (Project Trellis), MachXO3 / Nexus (Project Oxide), GOWIN LittleBee / Arora (Project Apicula), GateMate (Cologne Chip). No support for Xilinx 7-series or UltraScale+ as of 2025 (Project X-Ray bitstream documentation incomplete for routing).
- SymbiYosys — open formal verification (model checking, BMC, k-induction) using SMT solvers (Z3, Boolector, Yices). Front-ends Yosys.
- Verilator — open compiled Verilog → C++ simulation. 10–100× faster than commercial event-driven simulators for pure RTL. Used heavily by SiFive, LowRISC, Chipyard, and any open RISC-V project.
- GHDL — open VHDL simulator. Pairs with GtkWave for waveform viewing.
- Cocotb — Python-based testbench framework wrapping any simulator (Verilator, Icarus, GHDL, ModelSim, VCS, Xcelium). Increasingly the open-source default for verification.
- Icarus Verilog — interpreted Verilog simulator. Slower than Verilator but supports event semantics.
Third-party / multi-vendor
- Synplify Pro / Synplify Premier (Synopsys, commercial) — synthesis targeting all major FPGA vendors. Used by ASIC teams targeting FPGAs for prototyping. Hi-Rel variant adds auto-TMR for SEU.
- Mentor / Siemens EDA Precision (commercial) — peer to Synplify Pro; Precision Hi-Rel for rad-hard.
- ModelSim / Questa (Siemens EDA) — industry-default simulator; UVM-aware; mixed-language.
- Xcelium (Cadence), VCS (Synopsys) — commercial simulators for big SoC houses.
- JasperGold (Cadence), VC Formal (Synopsys), Questa Formal (Siemens), OneSpin (Siemens) — commercial formal property verification.
Debug
- Vivado ILA / VIO (Integrated Logic Analyzer / Virtual Input-Output) — embeds a logic-analyser core in the design, captures internal signals on configurable triggers, reads back over JTAG. Adds 1 LUT + 1 FF + BRAM per probe sample-depth slot.
- Intel SignalTap II — same concept for Quartus.
- Lattice Reveal — same for Diamond/Radiant.
- Synplify Identify — vendor-neutral RTL-level probing.
- External JTAG dongles — Xilinx Platform Cable USB II / SmartLynq, Intel USB-Blaster II, Lattice HW-USBN-2B, Microchip FlashPro5; or generic FT2232H-based dongles + OpenOCD.
Hardware development kits
- Digilent Arty A7 (Artix-7 XC7A35T, USD ~130), Arty S7 (Spartan-7), Nexys A7 / Nexys Video (Artix-7 + DDR3 + HDMI), Cmod A7 (small Artix-7 PMOD form factor).
- Avnet ZedBoard (Zynq-7000 XC7Z020, USD ~500), MiniZed (Zynq Z-7007S, USD ~150), PicoZed, MicroZed.
- AMD/Xilinx KCU105 / KCU116 (Kintex UltraScale+), VCK190 / VEK280 (Versal AI Core / AI Edge), Alveo U200 / U250 / U280 / U55C (data-centre PCIe cards).
- Terasic DE10-Nano (Intel Cyclone V SoC, USD ~150), DE10-Standard, DE2-115 (Cyclone IV).
- Intel Agilex 7 F-Series transceiver dev kit (AGFB014R24A2E2V).
- Lattice iCEstick (iCE40HX1K, USD ~30), iCEBreaker (iCE40UP5K, fully open toolchain), ECP5 Evaluation Board (LFE5UM5G-85F), CrossLink-NX Eval.
- Microchip MPF300-Splash-Kit (PolarFire), PolarFire SoC Discovery Kit (USD ~700).
- Cologne Chip GateMate Eval Board (GateMate A1, fully open toolchain via Yosys + nextpnr).
Cloud FPGA
- AWS F1 instances — f1.2xlarge / f1.4xlarge / f1.16xlarge with 1, 2, or 8 Xilinx Virtex UltraScale+ VU9P boards. Used for ML inference, genomics, financial.
- Alibaba F3 — Intel Stratix 10.
- Azure NP-series — Xilinx Alveo U250.
Constraints file primer
A minimal Xilinx XDC for a 200 MHz design with a 100 MHz input clock, async reset, and a 50 MHz output bus might read:
# Primary clock (from external oscillator)
create_clock -name clk_100m -period 10.0 [get_ports clk_in_p]
# MMCM-generated clocks (Vivado infers from MMCM IP, but explicit is safer)
create_generated_clock -name clk_200m -source [get_pins mmcm/CLKIN1] \
-multiply_by 2 [get_pins mmcm/CLKOUT0]
# Async reset — synchronised internally; mark the input as false-path
set_false_path -from [get_ports reset_n]
# Output: 50 MHz source-synchronous, ±2 ns total skew budget
set_output_delay -max 2.0 -clock clk_50m_out [get_ports data_out[*]]
set_output_delay -min -2.0 -clock clk_50m_out [get_ports data_out[*]]
# CDC: clk_100m and clk_200m are related via MMCM; not asynchronous
# but clk_uart (external) is asynchronous to both
set_clock_groups -asynchronous -group {clk_100m clk_200m} -group {clk_uart}
# Pin assignments
set_property PACKAGE_PIN E3 [get_ports clk_in_p]
set_property IOSTANDARD LVDS_25 [get_ports clk_in_p]
set_property PACKAGE_PIN H14 [get_ports {data_out[0]}]
set_property IOSTANDARD LVCMOS18 [get_ports {data_out[*]}]Intel SDC syntax is nearly identical for the timing commands; pin assignments live in the .qsf (Quartus Settings File) instead.
Continuous integration for FPGA projects
Modern FPGA projects mirror software CI / CD:
- Source control — Git for RTL, constraints, testbenches, scripts. Generated outputs (bitstream, .runs/) excluded via .gitignore.
- Tcl-driven flow — Vivado batch (
vivado -mode batch -source build.tcl), Quartusquartus_sh -t build.tcl, Yosys.ysscripts. Reproducible builds, no GUI clicks. - Containerisation — Docker images with the tool installed (Xilinx publishes official Vivado / Vitis images on a license server; Intel similarly). Build inside CI without polluting developer machines.
- Regression suites — every commit triggers RTL simulation across the testbench library. Cocotb + pytest integration is natural; UVM regression usually runs via
makeor vendor regression managers. - Linting / style gates — Verible, Verilator -Wall, vendor lint (Vivado
synth_design -mode out_of_contextwith-assert). Block PRs that introduce new lint warnings. - Implementation gate — full build to bitstream + timing-report parsing as a CI step. Fails the build on WNS < 0. Heavy (hours); often nightly rather than per-commit on big designs.
11. Cross-references
[[Engineering/digital-logic]]— RTL fundamentals (CMOS, FFs, FSMs, setup/hold/skew, CDC). Read first if unfamiliar; FPGA design specialises this material.[[Engineering/semiconductor-devices]]— CMOS transistor physics underlies every LUT, FF, and SerDes.[[Engineering/microcontrollers]]— soft cores (MicroBlaze, Nios II, VexRiscv) and hard cores (Zynq Cortex-A, PolarFire SoC RISC-V) inside FPGAs use the same software workflow.[[Engineering/digital-control]]— FPGA-implemented control loops at MHz rates (motor control, power-electronics PWM, lidar / radar signal processing).[[Engineering/power-electronics]]— gate-drive timing, dead-time generation, predictive control on FPGA.[[Engineering/pcb-design]]— FPGA package escape routing (BGA fanout), DDR4 / DDR5 length matching, transceiver layout, PDN decoupling.[[Engineering/op-amps]]— analog front-end (instrumentation amps, anti-alias filters, programmable-gain amplifiers) before ADCs feeding FPGAs.[[Engineering/circuit-analysis]]— DC + AC analysis for I/O signalling.[[Engineering/electromagnetics-engineering]]— high-speed-signal transmission-line theory, S-parameters, eye diagrams.- planned
[[Engineering/realtime-embedded]]— RTOS and bare-metal software running on Zynq / PolarFire SoC’s hard-core side. - planned
[[Robotics/bayesian-estimation]]— multi-sensor synchronisation (lidar + camera + IMU) implemented on FPGA front-ends. [[Languages/Tier3/hdl]]— VHDL, SystemVerilog, Chisel, SpinalHDL, Amaranth idioms and code examples.[[Languages/Tier3/notation-spec]]— SVA, PSL for formal-property specification.[[Languages/Tier3/theorem-prover-dsls]]— formal-verification ecosystem (TLA+, SymbiYosys backends, JasperGold property templates).[[Languages/Tier3/assembly-and-encoding]]— RISC-V and ARM ISA encoding for soft-core / hard-core dispatch.
12. Citations
- Chu, P. P. (2018). FPGA Prototyping by SystemVerilog Examples (2nd ed.). Wiley. Canonical lab-bench text — board-tested RTL examples for Nexys / Basys 3.
- Chu, P. P. (2017). FPGA Prototyping by VHDL Examples (2nd ed.). Wiley. VHDL companion.
- Kilts, S. (2007). Advanced FPGA Design: Architecture, Implementation, and Optimization. Wiley. Timing closure, retiming, pipelining, CDC. Still the canonical practitioner book.
- Crockett, L. H., Elliot, R. A., Enderwitz, M. A. & Stewart, R. W. (2014). The Zynq Book: Embedded Processing with the ARM Cortex-A9 on the Xilinx Zynq-7000 All Programmable SoC. Strathclyde Academic Media. Free PDF — classic Zynq introduction.
- Pellerin, D. & Thibault, S. (2005). Practical FPGA Programming in C. Prentice Hall. Early HLS perspective; concepts still relevant to Vitis HLS workflow.
- Sutter, G., Boemo, E., Hauck, S. & DeHon, A. (2017). FPGAs: Fundamentals, Advanced Features, and Applications in Industrial Electronics. CRC Press. Industrial-control orientation.
- Spear, C. & Tumbush, G. (2012). SystemVerilog for Verification (3rd ed.). Springer. Constrained-random verification, UVM foundations.
- Cohen, B., Venkataramanan, S., Kumari, A. & Piper, L. (2013). SystemVerilog Assertions Handbook (4th ed.). VhdlCohen Publishing. SVA reference.
- Cummings, C. E. (2008). Clock Domain Crossing (CDC) Design & Verification Techniques Using SystemVerilog. SNUG San Jose paper. Two-FF synchroniser MTBF derivation and async-FIFO conventions.
- IEEE Std 1364-2005. IEEE Standard for Verilog Hardware Description Language. (Subsumed by 1800.)
- IEEE Std 1800-2023. IEEE Standard for SystemVerilog — Unified Hardware Design, Specification, and Verification Language.
- IEEE Std 1076-2019. IEEE Standard VHDL Language Reference Manual.
- IEEE Std 1149.1-2013. IEEE Standard for Test Access Port and Boundary-Scan Architecture (JTAG).
- IEEE Std 1801-2018. IEEE Standard for Design and Verification of Low-Power, Energy-Aware Electronic Systems (UPF).
- Synopsys (2018). Application Note: Synthesis Coding Style. Industry-standard RTL coding guidelines.
- AMD/Xilinx user guides: UG901 Synthesis, UG903 Implementation, UG906 Design Analysis and Closure Techniques, UG909 Partial Reconfiguration / DFX, UG949 Methodology Guide, UG974 UltraScale Architecture Libraries.
- Intel: Quartus Prime Pro Handbook (vol. 1–3), Stratix 10 / Agilex Device Datasheets, OneAPI for FPGA programming guide.
- Lattice: TN-series and AN-series for iCE40 / ECP5 / Nexus device design.
- Microchip: PolarFire UG0680, RTG4 UG0150, Libero SoC Design Flow User Guide.
- DO-254 / ED-80. Design Assurance Guidance for Airborne Electronic Hardware. RTCA / EUROCAE. The civil-avionics hardware-development standard.
- IEC 61508-2:2010 + IEC 61508-3:2010. Functional safety of electrical/electronic/programmable electronic safety-related systems.
- ISO 26262-5:2018 + ISO 26262-11:2018. Road vehicles — Functional safety — Hardware level / Semiconductors.
- The Yosys Open Synthesis Suite (
yosyshq.net) and nextpnr documentation. Open-source equivalents of the commercial flows; production-quality at iCE40, ECP5, Nexus, GOWIN, GateMate. - Project IceStorm, Project Trellis, Project Oxide, Project Apicula. Bitstream-documentation projects that enable open-source toolchains.