Reliability Engineering — Engineering Reference
1. At a glance
Reliability engineering is the quantitative discipline of probability that a system performs its intended function under stated conditions for a stated period of time. That sentence is the IEEE 90 / IEC 60050-191 definition and it carries more weight than it looks: every word is load-bearing. Probability — reliability is statistical, not deterministic. Intended function — failure is defined against a specification, not a vague notion of “working.” Stated conditions — temperature, vibration, humidity, duty cycle, electrical stress, all of it. Stated period — there is no such thing as an MTBF without a mission time, even though the industry routinely quotes one.
The discipline spans six sub-fields that share probabilistic language but live in different parts of the V-cycle:
- Prediction — bottom-up estimates of field failure rate from datasheet stress + handbook coefficients (MIL-HDBK-217F, Telcordia SR-332, IEC 62380, RIAC 217Plus). Useful early; calibrated against returns later.
- Allocation — top-down apportionment of a system reliability goal to subsystems via importance + cost weighting. Drives supplier specs and design-for-reliability budgets.
- Testing — HALT, HASS, ALT, ESS, burn-in, reliability demonstration, reliability growth (Duane / Crow-AMSAA).
- Maintenance + RAMS — RCM, CBM, PdM, MTTR optimisation, spares queueing, availability targets.
- Failure analysis — FMEA / FMECA bottom-up, FTA top-down, RBD architectural, event trees forward, Markov for repairable + standby.
- Field reliability + warranty — censored field-return data, Weibull mixture fits, cohort + Pareto analysis, recall + safety-defect investigation.
Where it sits in the design stack: requirements + concept → architecture (RBD, allocation) → detail design (DfR, FMEA, prediction) → verification (HALT, ALT, RDT) → production (HASS, ESS, burn-in) → service (CBM, PdM, warranty analysis) → root-cause (failure analysis, FRACAS) → redesign. In a safety-critical industry (aerospace, automotive, medical, nuclear, rail) the full loop is mandated by standards (IEC 61508, ISO 26262, ARP4761, IEC 62304, EN 50128/50129, DO-178C/DO-254). In consumer goods the same toolkit drives warranty-cost minimisation.
Reliability sits adjacent to but is not the same as quality (conformance to spec at time of manufacture — Six Sigma, SPC, Cpk) or safety (acceptable risk of harm — HAZOP, LOPA, SIL/ASIL). High quality is necessary but not sufficient for reliability; high reliability is necessary but not sufficient for safety. The three disciplines share statistical machinery and standards bodies but answer different questions.
2. Why it matters
Hardware-failure cost is large and measurable. The economy-wide bill — warranty payments, recalls, downtime, product-liability awards, and consequential loss — sits in the 5–15 % of GDP band depending on the methodology (BCG 2018 industrial-IoT study; Warranty Week 2024 annual). For 2024, the US automotive industry alone booked 20 B in direct costs alone. Takata’s airbag-inflator recall — root-caused to ammonium-nitrate propellant degradation under humidity-and-temperature cycling, a textbook reliability physics problem — bankrupted the company and recalled ~100 million airbags worldwide.
Aerospace sets the high-water mark on quantitative reliability targets. FAR Part 25 / EASA CS-25 large-transport-category aircraft require a target of ≤ 1 catastrophic failure per 10⁹ flight hours for any system whose failure is catastrophic; that is DAL A in DO-178C / DO-254 and is derived in AC 25.1309-1B. ISO 26262 ASIL D for road-vehicle safety-of-the-intended-function sits at ≤ 10⁻⁸ /h for the residual hardware failure budget. Medical Class C software (IEC 62304) imposes equivalent rigor on software failure. These numbers are operating-point requirements that flow down through architecture, allocation, prediction, FMEA, FTA, redundancy design, and verification testing.
Outside safety-critical, reliability is a competitive moat. Caterpillar, Toyota, and Rolls-Royce build entire product strategies around demonstrated field reliability; the data is collected, analysed, and reinvested in design via FRACAS (Failure Reporting, Analysis, and Corrective Action System) per MIL-STD-2155.
3. First principles
3.1 The four basic functions
For a non-negative random failure time T, four equivalent descriptions exist; an engineer must be fluent in all four.
F(t) = P(T ≤ t) cumulative failure distribution (CDF)
R(t) = P(T > t) = 1 − F(t) reliability / survival function
f(t) = dF/dt failure density (PDF)
λ(t) = f(t) / R(t) hazard rate / instantaneous failure rate
Equivalent integral form:
R(t) = exp(−∫₀ᵗ λ(τ) dτ)
Reliability theory is the algebra of these four functions under combinatorial structures (series, parallel, k-of-n, bridge, Markov chains).
3.2 The bathtub curve
Empirical λ(t) over the life of a large fielded population shows three regimes (Klutke et al. 2003 review; originally documented in Davis 1952 telephone-relay data):
| Phase | λ(t) | Drivers | Reliability remedy |
|---|---|---|---|
| Infant mortality | decreasing | Manufacturing escapes, assembly defects, weak components | Burn-in, ESS, HASS, supplier qual |
| Useful life | ~constant | Random overstress, lightning, mishandling, true random | Redundancy, derating |
| Wear-out | increasing | Fatigue, corrosion, electromigration, bearing wear, capacitor dry-out | PM/CBM, end-of-life replacement, design margin |
The constant-λ assumption that underpins MIL-HDBK-217-style prediction is only valid in the useful-life phase. Burn-in moves the population past infant mortality before shipping; preventive maintenance retires units before wear-out kicks in. The Weibull shape parameter β maps directly to the three regimes: β < 1 ↔ infant, β = 1 ↔ constant, β > 1 ↔ wear-out (see §4).
3.3 MTBF, MTTF, MTTR, availability
- MTTF (Mean Time To Failure) — non-repairable. MTTF = ∫₀^∞ R(t) dt. For exponential, MTTF = 1/λ.
- MTBF (Mean Time Between Failures) — repairable. MTBF = MTTF + MTTR; in practice the two are used interchangeably when MTTR ≪ MTTF.
- MTTR (Mean Time To Repair) — detect + isolate + access + replace + verify. Quantified separately; usually lognormal.
- Availability — A = MTBF / (MTBF + MTTR), steady-state. Inherent (design-only), operational (incl. logistics delay), achieved (incl. PM downtime) are three distinct numbers.
Common pitfall: MTBF is not a lifetime. A unit with MTBF = 1 000 000 h does not last 114 years; under constant λ, R(MTBF) = e⁻¹ ≈ 0.37 — only 37 % of units survive to MTBF. The number is a rate parameter, not a service life.
FIT (Failures In Time) = failures per 10⁹ component-hours. 1 FIT = 1 failure per billion hours = λ in inverse hours × 10⁹. Used universally for semiconductors. A 100-FIT op-amp at 24 × 7 × 365 = 8766 h/yr in 1000 units gives ~0.88 failures/yr — small but not zero.
4. Probability distributions
4.1 Distribution comparison
| Distribution | Parameters | β / shape | Use case | Notes |
|---|---|---|---|---|
| Exponential | λ | — | Useful-life electronics, memoryless events | Special case of Weibull (β = 1) and Gamma (k = 1) |
| Weibull (2-param) | β shape, η scale | β < 1 infant, β = 1 const, β > 1 wear-out | Bearings, fatigue, mechanical | Most flexible; default for field data |
| Weibull (3-param) | β, η, γ shift | + minimum life γ | Guaranteed-life parts | γ is the “no-failure before” floor |
| Lognormal | μ, σ on log(t) | Right-skewed | Maintenance time, fatigue-crack growth, semi wear-out | Multiplicative damage models |
| Normal | μ, σ | Symmetric | Mechanical strength, dimensional, “stress-strength” interference | Truncated at 0 |
| Gamma | k shape, θ scale | Sum of k exponentials | Time-to-k-th-event, spares demand | k = 1 ⇒ exponential |
| Birnbaum–Saunders | α, β | Fatigue physics-based | Crack growth (Miner-rule failure time) | “Fatigue life distribution” — see fatigue-analysis |
| Mixed Weibull | β₁,η₁,p₁ + β₂,η₂,(1−p₁) | Multimodal | Multiple failure modes (infant + wear-out) | Common in warranty data |
| Extreme value (Gumbel) | μ, σ | — | Min-of-n weakest link | Brittle ceramics, “weakest link” |
4.2 Weibull in depth
Waloddi Weibull (1951, J Appl Mech, “A statistical distribution function of wide applicability”) proposed the form:
R(t) = exp(−(t/η)^β)
F(t) = 1 − R(t)
f(t) = (β/η)·(t/η)^(β−1)·exp(−(t/η)^β)
λ(t) = (β/η)·(t/η)^(β−1)
η is the characteristic life = the t at which R = e⁻¹ ≈ 0.368 (63.2 % failed); independent of β. β is the shape parameter = the hazard-rate slope. The original paper used cotton-fibre and steel-strength data; the distribution now dominates mechanical reliability because so many real failure mechanisms (bearings per bearings L_10, fatigue per fatigue-analysis Wöhler scatter, weld-joint life) fit it well.
4.3 Parameter estimation
Three competing methods:
- Maximum Likelihood Estimation (MLE) — best statistical efficiency, handles censored data natively, but optimisation can be ill-conditioned for small samples. Default in Reliasoft Weibull++ and Minitab.
- Median Rank Regression (MRR / RRX, RRY) — fit the linearised Weibull on a probability plot using Bernard’s approximation F̂ᵢ = (i − 0.3)/(n + 0.4) for the i-th order statistic of n units. Standard for small-sample mechanical data; robust and visualisable.
- Probability plotting (visual) — last resort when n is tiny; eyeball fit on Weibull paper. Still useful as a sanity check that the data is in fact Weibull (look for curvature).
For censored data (units still running at end of test, removed for other reasons, or failed by a mode not under study): only MLE and adjusted-rank MRR handle it correctly. Type-I censoring = stop at time T; Type-II = stop at r-th failure; multiply censored = arbitrary withdrawals. Always declare the censoring scheme in the analysis.
5. Reliability prediction
5.1 Handbook methods — origin and status
- MIL-HDBK-217F Notice 2 (1995) — the original. Parts-count (early design) and parts-stress (detailed design) methods for electronic components. Two reliability models per part: base λ_b modified by π factors for temperature, electrical stress, environment, quality, package. Officially cancelled in 1995 but never replaced by DoD; still cited in legacy contracts. Known to be 2-10× pessimistic for modern silicon — its base failure rates predate sub-micron CMOS — and its environment factor (π_E) for spacecraft is widely ignored.
- Telcordia SR-332 Issue 4 (2016) — telecom-industry replacement. Allows field data feedback (Bayesian update of generic λ with site experience). Better calibrated for modern parts than 217F.
- IEC 62380:2004 — European equivalent, similar structure.
- RIAC HDBK-217 Plus (Quanterion 2017) — modern field-data update of 217F with failure-mechanism distinctions (operating vs non-operating, T-cycle vs dwell). Maintained.
- PRISM (now incorporated into 217Plus) — process-grade-based extension.
- ANSI/VITA 51.1, 51.2, 51.3 — pre-adopted corrections to 217F (parameter tuning, additional factors, modern parts).
- FIDES Guide 2022 (French aerospace consortium) — physics-of-failure-influenced, mission-profile-driven, defect-aware.
5.2 Physics-of-failure (PoF) models
For modern microelectronics, JEDEC JEP122H (“Failure Mechanisms and Models for Semiconductor Devices”) and JESD85 (“Methods for Calculating Failure Rates in Units of FITs”) define the mechanism-by-mechanism approach:
| Mechanism | Model | Stress | Typical Ea |
|---|---|---|---|
| Oxide TDDB | E-model, 1/E-model | E-field, T | 0.6–0.9 eV |
| Electromigration | Black (1969): t₅₀ = A·J⁻ⁿ·exp(Ea/kT) | J, T | 0.7–1.1 eV, n = 1–2 |
| Hot carrier | I_sub power law | V_DS, I_sub, T | negative Ea |
| NBTI | Power-law in V_GS, exp in T | V_GS, T | 0.1–0.3 eV |
| Stress migration | Power-law in (T_proc − T) | ΔT | 0.5–0.9 eV |
| Solder thermal fatigue | Coffin–Manson (1954), Norris–Landzberg (1969) | ΔT, dwell | n = 2–8 |
| Corrosion | Peck humidity model | RH, T | 0.7–0.9 eV, RH^−3 |
| Mechanical fatigue | Basquin / Coffin–Manson | Δε | see fatigue-analysis |
PoF supplements handbook prediction: the handbook gives “this resistor sees 10 FIT”, PoF gives “this BGA solder joint will see 1 % failure at 4500 power-cycles at ΔT = 80 °C with 30 min dwell”. PoF is essential for new technologies (sub-7 nm, wide-bandgap power, advanced packaging) where handbook data does not exist.
5.3 Software tools
- Reliasoft Lambda Predict (Hottinger Bruel & Kjaer), Relyence Reliability Workbench, Item Toolkit ITEM-217, Isograph Hawk — all support 217F + 217Plus + Telcordia + IEC + FIDES.
6. System-level methods
6.1 Reliability Block Diagram (RBD)
Architecture-level representation: blocks = components, edges = success paths from source to sink. Math:
- Series: R_sys = ∏ Rᵢ. All must work.
- Parallel (active redundancy): R_sys = 1 − ∏ (1 − Rᵢ). Any one suffices.
- k-out-of-n: binomial sum ∑ⱼ₌ₖⁿ C(n,j) Rʲ (1−R)^(n−j). Voted redundancy (e.g. triple-modular redundancy TMR, k = 2 of 3).
- Bridge network: not series-parallel; use cut-set or tie-set algorithms.
- Standby (cold): must include switchover reliability + standby failure rate during dormant period.
Standard: MIL-STD-756B (cancelled 1995 but still referenced); IEC 61078:2016 (“Reliability block diagrams”). Tools: Reliasoft BlockSim, Isograph Reliability Workbench, ITEM Toolkit RBD, OpenFTA, OpenRBD.
6.2 Fault Tree Analysis (FTA)
Top-down deductive analysis. Top event = system failure; expand downward through AND / OR / k-of-n gates to basic events (component failures, human errors, external events). Mathematics is Boolean algebra of failure events.
- Minimal cut sets (MCS) — smallest combinations of basic events whose simultaneous occurrence causes the top event. Order-1 cut sets are single-point failures (red flags). Importance measures (Birnbaum, Fussell-Vesely, Risk Achievement Worth, Risk Reduction Worth) rank components.
- Standards: IEC 61025:2006, NASA Fault Tree Handbook (NASA/SP-2002-7106), NUREG-0492 (nuclear).
- Cross-discipline use: ARP4761A:2023 is the canonical avionics safety-assessment standard, mandates FHA → PSSA → SSA flow using FTA + FMEA + CCA (common-cause analysis).
Tools: Isograph FaultTree+, Reliasoft BlockSim FTA module, ITEM ToolKit FTA, FaultCAT (Boeing internal), SAPHIRE (INL nuclear), OpenFTA, RAM Commander.
6.3 Event Tree Analysis (ETA)
Forward inductive from an initiating event through success/failure branches of safety functions to a set of end states. Used in PRA (Probabilistic Risk Assessment) for nuclear and chemical plants per IEC 62502. Complementary to FTA: ETA structures the consequence side, FTA structures the cause side; both feed PRA.
6.4 FMEA / FMECA
Failure Mode and Effects Analysis — bottom-up tabular discipline. For each component, list every failure mode (open, short, drift, leak, jam), its local + system effects, the cause, detection method, and a risk index.
- Classical RPN (Risk Priority Number) = Severity × Occurrence × Detection, each on 1–10 scale; RPN ∈ [1, 1000]. Used to prioritise corrective action.
- Modern AIAG-VDA FMEA Handbook 1st ed 2019 replaced RPN with Action Priority (AP) (High / Medium / Low) to discourage RPN-threshold gaming and to keep severity-10 single-point items always at the top.
- FMECA = FMEA + Criticality (probability + severity) per MIL-STD-1629A. Defence and aerospace flavour.
- DFMEA (Design) vs PFMEA (Process) vs System-level FMEA. IATF 16949 requires both DFMEA and PFMEA in automotive.
Standards: IEC 60812:2018, MIL-STD-1629A (cancelled 1998 but still referenced), AIAG-VDA 2019, SAE J1739, ARP5580 (non-auto).
Tools: Reliasoft Xfmea, APIS IQ-FMEA, Plato AG SCIO, Siemens Teamcenter FMEA, Ansys medini analyze (ISO 26262 + ARP4761 + IEC 61508).
6.5 Markov, semi-Markov, Monte Carlo
For repairable systems, standby with imperfect switching, and time-dependent failure rates, simple RBD math breaks down. Markov state-transition diagrams (states = combinations of working/failed components, transitions = λ and μ rates) give analytical availability and unreliability. Semi-Markov handles non-exponential sojourn times. Monte Carlo simulation is the workhorse when analytic methods fail (mixture distributions, complex maintenance policies, opportunistic replacement). Tools: Reliasoft BlockSim (discrete-event sim), GoldSim, AnyLogic.
6.6 Reliability allocation
Top-down apportionment of a system reliability goal R_systo subsystems R_i such that ∏ R_i ≥ R_sys* (series case). Methods:
- Equal apportionment — naïve baseline.
- AGREE (Advisory Group on Reliability of Electronic Equipment 1957) — weighted by complexity (parts count) × importance.
- ARINC — weighted by predicted λ from prior systems.
- Feasibility of Objectives (FoO) — weighted by state-of-art difficulty score.
- Karmiol / Bracha — cost-weighted optimisation.
Allocation drives reliability budget flow-down to subsystem teams and supplier requirements (e.g. “this connector must demonstrate ≤ 50 FIT”).
7. Worked examples
7.1 Example A — Weibull fit + B₁₀ life of a deep-groove ball bearing
Test data: 10 bearings (6205-2RS, see bearings) run to failure under constant radial load and lubrication; failure times in operating hours.
Sorted failure times (hr): 2340, 3100, 4280, 5600, 6900, 8100, 9400, 10800, 12600, 15800.
Apply Bernard’s approximation for median rank Fᵢ = (i − 0.3)/(n + 0.4), n = 10:
| i | tᵢ (hr) | Fᵢ |
|---|---|---|
| 1 | 2340 | 0.0673 |
| 2 | 3100 | 0.1635 |
| 5 | 6900 | 0.4519 |
| 10 | 15800 | 0.9327 |
Linearise: x = ln(t), y = ln(−ln(1 − F)). Linear regression of y on x gives slope β and intercept −β·ln(η). For this data set:
β ≈ 2.3 (wear-out, consistent with bearing fatigue spalling), η ≈ 8000 hr.
B₁₀ life (the bearing-industry standard, t at which 10 % have failed; see bearings):
B₁₀ = η · (−ln(1 − 0.10))^(1/β)
= 8000 · (0.10536)^(1/2.3)
= 8000 · (0.10536)^0.4348
= 8000 · 0.3796
≈ 3037 hr
Compare with the rolled-up ISO 281 L₁₀ rated life (computed from C/P in the catalogue): if the catalogue rated L₁₀ for this load is ~3000 hr, the Weibull-fit B₁₀ confirms the rating. Discrepancy of > 30 % triggers either a load-spectrum error or a lubrication / contamination problem.
7.2 Example B — Series-parallel RBD with mixed redundancy
System: control rack composed of A (PSU) in series with a redundant pair (B || C, both compute boards, active redundancy) in series with D (output module).
Failure rates (FIT, useful-life zone, constant-λ exponential):
- λ_A = 100 FIT = 100 × 10⁻⁹ /h = 1 × 10⁻⁷ /h
- λ_B = λ_C = 500 FIT = 5 × 10⁻⁷ /h
- λ_D = 200 FIT = 2 × 10⁻⁷ /h
Mission time t = 1000 h (continuous operation).
Component reliabilities R(t) = e⁻ᵗᵝ:
- R_A(1000) = e^(−1e−7·1000) = e^(−1e−4) = 0.99990001
- R_B = R_C = e^(−5e−4) = 0.99950125
- R_D = e^(−2e−4) = 0.99980002
Parallel B || C:
R_BC = 1 − (1 − R_B)(1 − R_C)
= 1 − (0.00049875)²
= 1 − 2.488e−7
= 0.99999975
Series system:
R_sys = R_A · R_BC · R_D
= 0.99990001 · 0.99999975 · 0.99980002
= 0.99970004
System failure probability over 1000 h: 1 − R_sys = 2.9996 × 10⁻⁴ ≈ 300 ppm. System equivalent FIT (for short missions in useful-life regime) ≈ (1 − R_sys)/t × 10⁹ ≈ 300 FIT. The redundant pair contributed essentially nothing to that total — A and D dominate. Allocation insight: spend reliability budget on A and D, not on doubling-up B/C further. This is a generic finding: redundancy without diversifying single-point-of-failure power and I/O is theatre.
7.3 Example C — Arrhenius accelerated life test design
Goal: predict 10-year field life of an integrated motor-drive at T_use = 55 °C using a 100-hour bench test.
Choose stress temperature T_test = 125 °C. Use Arrhenius (1889 / Eyring 1936 simplified):
AF = exp(Ea/k · (1/T_use − 1/T_test))
with k = 8.617 × 10⁻⁵ eV/K (Boltzmann in eV/K), Ea = 0.7 eV (typical aggregate for IC bond-wire + die-attach + electromigration; per JEP122H).
Convert temperatures: T_use = 328.15 K, T_test = 398.15 K.
1/T_use − 1/T_test = 1/328.15 − 1/398.15
= 3.0474e−3 − 2.5116e−3
= 5.358e−4 K⁻¹
Ea / k = 0.7 / 8.617e−5 = 8124 K
AF = exp(8124 · 5.358e−4) = exp(4.353) = 77.7
Equivalent field hours: 100 h × 77.7 = 7770 h ≈ 10.8 months at 55 °C.
If the design target is 10 years at 55 °C = 87,600 h, the 100-h bench test covers only ≈ 9 % of design life — not sufficient. Either extend bench test to ≥ 1130 h at 125 °C, or raise T_test to 150 °C (AF ≈ 222, 100 h ≈ 22,200 field-h ≈ 2.5 yr), or run a larger sample size to demonstrate at 60 % confidence.
Sample-size sanity check using χ² formula for zero failures, Type-II right-censoring:
t_total = χ²(2r+2, 1−CL) / (2 · λ_target)
For r = 0 failures at 60 % confidence demonstrating λ_target = 100 FIT = 10⁻⁷ /h:
χ²(2, 0.40) = 1.833
t_total = 1.833 / (2 · 10⁻⁷) = 9.17e6 device-hours
Spread over n = 50 units at 125 °C with AF = 77.7: 9.17e6 / (50 · 77.7) ≈ 2360 bench hours. That is the test that demonstrates the requirement. A 100-h test for n = 50 only demonstrates ~24× weaker reliability than the field target — useful for screening, useless as a demonstration.
8. Testing methods
8.1 Accelerated-stress test taxonomy
| Test | Goal | Stress | Sample size | Output |
|---|---|---|---|---|
| HALT (Highly Accelerated Life Test) | Find design margins, weakest link | Step-stress T, vibration, T-cycle, V | 4–8 | Failure modes, margins (operational + destruct limits) |
| HASS (Highly Accelerated Stress Screen) | Production screening | Above-spec but below HALT destruct | 100 % | Pass/fail per unit |
| ESS (Environmental Stress Screening) | Production screening, milder than HASS | T-cycle + random vibration | 100 % | Pass/fail per unit, MIL-STD-2164 / IEC 61163-1 |
| ALT (Accelerated Life Test) | Quantitative life vs stress | Single stress, multiple levels | 20+ per cell | Acceleration factor, life distribution |
| Burn-in | Infant-mortality removal | Bias at elevated T for hours-days | 100 % (legacy) | Pass/fail; cost-benefit debated post-2000 |
| RDT (Reliability Demonstration Test) | Prove λ ≤ target | Use-condition | Sized via χ² / binomial | Demonstrated MTBF at confidence CL |
| Reliability Growth | Track λ improvement during dev | Use-condition + corrective action | Continuous | Duane / Crow-AMSAA growth slope α |
8.2 Acceleration models
| Stress | Model | Equation | Origin |
|---|---|---|---|
| Temperature | Arrhenius | AF = exp(Ea/k · (1/T₁ − 1/T₂)) | Arrhenius 1889 |
| ΔT cycle | Coffin–Manson | N_f = C · ΔT⁻ⁿ, n = 2–8 | Coffin 1954, Manson 1954 |
| ΔT cycle + dwell | Norris–Landzberg | N_f = C · f^m · ΔT⁻ⁿ · exp(Ea/kT_max) | Norris & Landzberg 1969 |
| Voltage / E-field | Power law / Eyring | t_f = A · V⁻ⁿ · exp(Ea/kT) | Eyring 1936 |
| Current density | Black | t₅₀ = A · J⁻ⁿ · exp(Ea/kT), n = 1–2 | Black 1969 |
| Humidity | Peck | t_f = A · RH⁻ⁿ · exp(Ea/kT) | Peck 1986 |
| Vibration | Basquin power law | N_f = C · S⁻ᵇ | Basquin 1910 |
Pitfall: multi-stress models (T + V + humidity) compound errors quickly. Always validate the Ea assumption on at least two stress points before extrapolating.
8.3 Reliability growth — Duane / Crow-AMSAA
Duane (1964) observed empirically that during a development program, cumulative MTBF tracked as ln(MTBF_cum) = ln(MTBF_initial) + α · ln(T) with growth slope α ∈ [0.3, 0.6] for well-managed programs. Crow (1974, AMSAA — Army Material Systems Analysis Activity) made it a formal NHPP (non-homogeneous Poisson process) λ(t) = λ_0 · β · t^(β−1), enabling MLE on the growth parameters. Standard: IEC 61164:2004.
9. Maintenance + RAMS
9.1 Maintenance strategies
| Strategy | When | Cost driver | Tool |
|---|---|---|---|
| Corrective (CM) | After failure | Downtime + collateral damage | Reactive |
| Preventive (PM, scheduled) | Time / cycle interval | Over-maintenance, waste of remaining life | RCM analysis |
| Condition-Based (CBM) | Threshold on measured indicator | Sensor + analytics infrastructure | Vibration analysis, oil debris, thermal |
| Predictive (PdM) | ML-predicted remaining-useful-life (RUL) | Model risk, training data | LSTM / survival models |
9.2 RCM — Reliability-Centred Maintenance
Origin: United Airlines 1968 (Nowlan & Heap report to FAA), formalised as MSG-2 / MSG-3 for commercial aviation; civilianised by Moubray (“RCM II”, 1997). Seven structured questions answered for each significant function:
- What are the functions and performance standards in its current context?
- In what ways does it fail?
- What causes each failure?
- What happens when each failure occurs?
- In what way does each failure matter?
- What can be done to predict or prevent each failure?
- What if nothing can be done?
Standards: SAE JA1011:2009 (RCM evaluation criteria), SAE JA1012:2002 (RCM guide), IEC 60300-3-11:2009.
9.3 Availability + spares
Steady-state availability A = MTBF/(MTBF + MTTR). To achieve A = 0.999 (8.76 h downtime/yr) with MTTR = 4 h requires MTBF ≥ 3996 h ≈ 5.5 months. To achieve A = 0.99999 (“five nines”, 5.3 min/yr) requires MTBF ≥ 400 000 h or MTTR < 0.04 h or redundancy.
Spares optimisation: Poisson demand model with rate λ_demand = N_installed × λ_part; service level S(k) = P(demand ≤ k spares) drives stockholding cost vs stockout cost. Tools: SAP MRP, Oracle EAM, Servigistics.
9.4 CMMS + EAM software
- CMMS (work-order, PM-schedule, spares): IBM Maximo, SAP PM (now S/4HANA Asset Management), Fiix (Rockwell), Maintainx, UpKeep, Limble, eMaint, Hippo CMMS.
- EAM (enterprise + financial integration): IBM Maximo Application Suite, SAP EAM, Infor EAM, Oracle eAM, Ivara EXP.
- APM (asset performance management, CBM + PdM): GE APM (Meridium), AVEVA APM, Bentley AssetWise, AspenTech Mtell.
10. Functional safety + safety-critical
10.1 Standards landscape
| Standard | Domain | Safety integrity level | Year |
|---|---|---|---|
| IEC 61508 | Generic E/E/PE safety-related systems | SIL 1–4 | 2010 (Ed 2) |
| IEC 61511 | Process industry safety-instrumented systems | SIL 1–4 | 2016 |
| ISO 26262 | Automotive E/E systems | ASIL A–D | 2018 (Ed 2) |
| IEC 62304 | Medical-device software | Class A / B / C | 2006 + Amd 1 2015 |
| ISO 14971 | Medical-device risk management | — | 2019 |
| DO-178C / DO-254 | Avionics SW / HW | DAL A–E | 2011 / 2000 |
| ARP4754A / ARP4761A | Civil aircraft system development / safety | DAL A–E | 2010 / 2023 |
| EN 50128 / 50129 / 50657 | Rail SW / signalling / on-board | SIL 0–4 | 2011–2017 |
| API RP 581 | Risk-Based Inspection (oil & gas) | — | 2016 |
| IEC 60601-1 | Medical electrical equipment basic safety | — | 2020 |
10.2 Integrity-level equivalence (approximate, demand mode)
| IEC 61508 SIL | ISO 26262 ASIL | DO-178C DAL | IEC 62304 Class | PFD (low demand) | PFH (high demand, /h) |
|---|---|---|---|---|---|
| 4 | — | A | C | 10⁻⁵ – 10⁻⁴ | 10⁻⁹ – 10⁻⁸ |
| 3 | D | B | C | 10⁻⁴ – 10⁻³ | 10⁻⁸ – 10⁻⁷ |
| 2 | C | C | B | 10⁻³ – 10⁻² | 10⁻⁷ – 10⁻⁶ |
| 1 | A / B | D | A | 10⁻² – 10⁻¹ | 10⁻⁶ – 10⁻⁵ |
The equivalence table is approximate — ASIL D is calibrated to road vehicles’ exposure-controllability rather than IEC 61508’s pure target probability, and a strict mapping is debated; see ISO 26262-10 Annex.
10.3 Failure categories
Modern functional-safety standards split failures into:
- Hardware random failure — Weibull / Exponential life of a component. Addressed by FIT + redundancy + diagnostics.
- Systematic failure — design + spec error, software bug, common-cause. Addressed by process rigour (V-model, formal methods, MISRA, code review, test coverage).
- Common-cause failure (CCF) — single root affects redundant channels. Addressed by diversity (heterogeneous redundancy), separation, β-factor model (β ≈ 0.01–0.10 typical).
10.4 HAZOP + LOPA
- HAZOP (Hazard and Operability Study, IEC 61882:2016) — guideword-driven team review (NO, MORE, LESS, REVERSE, AS WELL AS, PART OF, OTHER THAN) of process node-by-node deviations. Origin: ICI 1963, formalised by Kletz 1974. See chemical-process-fundamentals.
- LOPA (Layer of Protection Analysis, IEC 61511 + CCPS 2001) — semi-quantitative SIL assignment via independent protection-layer accounting.
11. Field reliability + warranty analysis
11.1 Warranty data — what’s hard
Warranty data is interval-censored, multiply-right-censored (units still operating), left-truncated (units that failed before the warranty registered), and contaminated by non-failures (no-trouble-found, NTF rates of 20–40 % are typical in consumer electronics). Naive MTBF = total operating hours / observed failures is wrong by a factor of 2–10× in most cases.
Correct treatment: fit a Weibull or mixture-Weibull to age-at-failure data, using MLE with the censoring structure. Mixture Weibull is essential when both infant-mortality and wear-out failure modes coexist (very common — the bathtub curve is in the data, not just on the slide).
11.2 Cohort + Pareto analysis
- Cohort analysis — group units by production date (e.g. month built) and plot failure rate vs age. Reveals manufacturing-process shifts.
- Pareto chart — 80/20 of failure modes; combined with Ishikawa fishbone for cause categorisation (Man, Machine, Method, Material, Measurement, Environment).
- Kaplan–Meier estimator — non-parametric survival curve from censored data. Visualises whether a Weibull is a defensible fit.
11.3 FRACAS
Failure Reporting, Analysis, and Corrective Action System — MIL-STD-2155 (cancelled but referenced); GEIA-STD-0009. The closed-loop process by which field failures generate root-cause findings that update design + manufacturing controls. Implementation tools: Reliasoft XFRACAS, Relyence FRACAS, IBM Engineering Workflow Management.
11.4 Industry-specific failure databases
- NPRD-2023 (Quanterion Non-electronic Parts Reliability Data) — mechanical and electromechanical part failure data.
- EPRD-2014 (Electronic Parts Reliability Data).
- OREDA (Offshore and Onshore REliability DAta, SINTEF) — oil & gas.
- PERD (CCPS Process Equipment Reliability Database).
- CNF / NEMS-FRACAS (NASA + commercial spaceflight).
- NHTSA (auto), CPSC (consumer), FDA MAUDE (medical devices) — public recall + adverse event.
11.5 Customer-perceived reliability
J.D. Power IQS (Initial Quality Study, first 90 days) and VDS (Vehicle Dependability Study, 3 years) drive consumer auto-brand reliability perception independent of engineering MTBF data. Consumer Reports annual reliability survey is the consumer-electronics equivalent. These data sets are noisy but lag-leading indicators of warranty cost two-to-five years out.
12. Tools / software
| Domain | Tools |
|---|---|
| Prediction | Reliasoft Lambda Predict, Relyence Reliability Workbench, Isograph Hawk, ITEM ToolKit ITEM-217, FIDES TestBench |
| Weibull / life data | Reliasoft Weibull++, Minitab Reliability + Survival, JMP Life Distribution, R packages survival, fitdistrplus, WeibullR, Python lifelines, reliability |
| FTA / RBD | Isograph FaultTree+ / Reliability Workbench, Reliasoft BlockSim, ITEM ToolKit RBD/FTA, FaultCAT (Boeing), SAPHIRE (INL), OpenFTA, RAM Commander, RiskSpectrum (Lloyd’s) |
| FMEA / FMECA | Reliasoft Xfmea / RCM++, APIS IQ-FMEA, Plato AG SCIO, Siemens Teamcenter Quality, Ansys medini analyze, IQ-RM |
| Monte Carlo / sim | Reliasoft BlockSim, GoldSim, AnyLogic, Reliasoft RGA (growth), @RISK |
| HALT/HASS chambers | Qualmark Typhoon series, Espec ARS series, Thermotron AST series, Weiss Technik |
| CMMS / EAM | IBM Maximo, SAP PM / S/4HANA Asset Mgmt, Fiix, Maintainx, Infor EAM, Oracle eAM |
| APM (PdM) | GE APM, AVEVA APM, Aspen Mtell, Bentley AssetWise, Augury, Uptake |
| Risk / RBI | Lloyd’s RBI Toolkit (per API 581), DNV-Synergi Plant, Bentley AssetWise APM |
| Free / OSS | OpenFTA, OpenRBD, R reliability, Python reliability, scilab reliability toolbox |
13. Cross-references
- fatigue-analysis — strain-life + S-N feeds Birnbaum–Saunders and lognormal life models
- fracture-mechanics — damage-tolerance and Paris-law feed crack-driven RUL prediction
- bearings — L₁₀ life = Weibull B₁₀ (this note’s worked Example A)
- gears-power-transmission — gear-tooth bending + pitting fatigue feed FMEA + L₁₀
- realtime-embedded — IEC 61508 + ISO 26262 software architecture
- microcontrollers — JEDEC JEP122H mechanisms, automotive AEC-Q100 grade
- system-identification — degradation modelling for PdM
- chemical-process-fundamentals — HAZOP + LOPA cross-link
- mems — JEDEC JEP122 stiction + fatigue mechanisms
- safety-standards — ISO 10218, ISO 13849-1 (industrial robot integrity levels)
- planned six-sigma — DMAIC + DfSS companion in same batch
- planned lean-manufacturing — overlapping continuous-improvement framework
14. Citations
Textbooks (canonical)
- O’Connor, P. & Kleyner, A. Practical Reliability Engineering, 5th ed. Wiley, 2012. ISBN 978-0-470-97981-5. Industry standard.
- Ebeling, C. An Introduction to Reliability and Maintainability Engineering, 3rd ed. Waveland, 2019. ISBN 978-1-4786-3933-6.
- Birolini, A. Reliability Engineering: Theory and Practice, 8th ed. Springer, 2017. ISBN 978-3-662-54208-8.
- Modarres, M.; Kaminskiy, M.; Krivtsov, V. Reliability Engineering and Risk Analysis: A Practical Guide, 3rd ed. CRC, 2017.
- Meeker, W. Q. & Escobar, L. A. Statistical Methods for Reliability Data, 2nd ed. Wiley, 2022. ISBN 978-1-118-11545-9.
- Nelson, W. B. Accelerated Testing: Statistical Models, Test Plans, and Data Analyses. Wiley, 1990. Canonical ALT reference.
- Moubray, J. Reliability-Centred Maintenance II. Industrial Press, 1997.
- Smith, D. J. Reliability, Maintainability and Risk, 9th ed. Butterworth-Heinemann, 2017.
Foundational papers
- Weibull, W. (1951). “A statistical distribution function of wide applicability.” J Appl Mech 18, 293–297.
- Arrhenius, S. (1889). “Über die Reaktionsgeschwindigkeit bei der Inversion von Rohrzucker durch Säuren.” Z Phys Chem 4, 226–248.
- Coffin, L. F. (1954). “A study of the effects of cyclic thermal stresses on a ductile metal.” Trans ASME 76, 931–950.
- Manson, S. S. (1954). “Behavior of materials under conditions of thermal stress.” NACA TN 2933.
- Norris, K. C. & Landzberg, A. H. (1969). “Reliability of controlled collapse interconnections.” IBM J Res Dev 13, 266–271.
- Black, J. R. (1969). “Electromigration — a brief survey and some recent results.” IEEE Trans Electron Devices 16, 338–347.
- Eyring, H. (1936). “Viscosity, plasticity, and diffusion as examples of absolute reaction rates.” J Chem Phys 4, 283.
- Crow, L. H. (1974). “Reliability analysis for complex repairable systems.” US Army AMSAA Tech Report 138.
- Duane, J. T. (1964). “Learning curve approach to reliability monitoring.” IEEE Trans Aerospace 2, 563–566.
- Peck, D. S. (1986). “Comprehensive model for humidity testing correlation.” IRPS Proc., 44–50.
Standards
- IEC 61508:2010 — Functional safety of E/E/PE safety-related systems (Parts 1-7).
- IEC 60812:2018 — FMEA / FMECA procedure.
- IEC 61025:2006 — Fault tree analysis.
- IEC 61078:2016 — Reliability block diagrams.
- IEC 61164:2004 — Reliability growth.
- IEC 62308:2006 — Reliability assessment methods.
- IEC 62380:2004 — Reliability prediction (telecom).
- ISO 26262:2018 — Road-vehicle functional safety.
- ARP4761A:2023 — Civil aircraft safety assessment.
- ARP4754A:2010 — Civil aircraft system development.
- DO-178C:2011 / DO-254:2000 — Avionics SW/HW.
- AIAG-VDA FMEA Handbook 1st ed, 2019 — Automotive FMEA + AP.
- MIL-HDBK-217F Notice 2 (1995) — Reliability prediction (legacy).
- MIL-STD-1629A (1980; cancelled 1998) — FMECA.
- Telcordia SR-332 Issue 4 (2016) — Reliability prediction (telecom).
- JEDEC JEP122H (2016) — Failure mechanisms + models for semiconductors.
- JEDEC JESD85 (2001, R2008) — FIT-rate calculation.
- ANSI/VITA 51.1/51.2/51.3 (2008–2010) — 217F corrections.
- RIAC HDBK-217 Plus (2017, Quanterion) — 217F field-data update.
- SAE JA1011:2009 + JA1012:2002 — RCM.
Online resources
- Reliasoft online textbook + ReliaWiki: https://www.reliawiki.com
- Quanterion Solutions databases (NPRD, EPRD, 217Plus): https://www.quanterion.com
- Weibull.com (Reliasoft) — practitioner articles.
- NASA Fault Tree Handbook NASA/SP-2002-7106.
- IAEA / NUREG nuclear PRA reports (publicly available).
node ~/.claude/bin/obsidian-research.mjs log "Built Engineering/reliability-engineering.md Tier 2 deep note"