Reliability Engineering — Engineering Reference

1. At a glance

Reliability engineering is the quantitative discipline of probability that a system performs its intended function under stated conditions for a stated period of time. That sentence is the IEEE 90 / IEC 60050-191 definition and it carries more weight than it looks: every word is load-bearing. Probability — reliability is statistical, not deterministic. Intended function — failure is defined against a specification, not a vague notion of “working.” Stated conditions — temperature, vibration, humidity, duty cycle, electrical stress, all of it. Stated period — there is no such thing as an MTBF without a mission time, even though the industry routinely quotes one.

The discipline spans six sub-fields that share probabilistic language but live in different parts of the V-cycle:

  1. Prediction — bottom-up estimates of field failure rate from datasheet stress + handbook coefficients (MIL-HDBK-217F, Telcordia SR-332, IEC 62380, RIAC 217Plus). Useful early; calibrated against returns later.
  2. Allocation — top-down apportionment of a system reliability goal to subsystems via importance + cost weighting. Drives supplier specs and design-for-reliability budgets.
  3. Testing — HALT, HASS, ALT, ESS, burn-in, reliability demonstration, reliability growth (Duane / Crow-AMSAA).
  4. Maintenance + RAMS — RCM, CBM, PdM, MTTR optimisation, spares queueing, availability targets.
  5. Failure analysis — FMEA / FMECA bottom-up, FTA top-down, RBD architectural, event trees forward, Markov for repairable + standby.
  6. Field reliability + warranty — censored field-return data, Weibull mixture fits, cohort + Pareto analysis, recall + safety-defect investigation.

Where it sits in the design stack: requirements + concept → architecture (RBD, allocation) → detail design (DfR, FMEA, prediction) → verification (HALT, ALT, RDT) → production (HASS, ESS, burn-in) → service (CBM, PdM, warranty analysis) → root-cause (failure analysis, FRACAS) → redesign. In a safety-critical industry (aerospace, automotive, medical, nuclear, rail) the full loop is mandated by standards (IEC 61508, ISO 26262, ARP4761, IEC 62304, EN 50128/50129, DO-178C/DO-254). In consumer goods the same toolkit drives warranty-cost minimisation.

Reliability sits adjacent to but is not the same as quality (conformance to spec at time of manufacture — Six Sigma, SPC, Cpk) or safety (acceptable risk of harm — HAZOP, LOPA, SIL/ASIL). High quality is necessary but not sufficient for reliability; high reliability is necessary but not sufficient for safety. The three disciplines share statistical machinery and standards bodies but answer different questions.

2. Why it matters

Hardware-failure cost is large and measurable. The economy-wide bill — warranty payments, recalls, downtime, product-liability awards, and consequential loss — sits in the 5–15 % of GDP band depending on the methodology (BCG 2018 industrial-IoT study; Warranty Week 2024 annual). For 2024, the US automotive industry alone booked 20 B in direct costs alone. Takata’s airbag-inflator recall — root-caused to ammonium-nitrate propellant degradation under humidity-and-temperature cycling, a textbook reliability physics problem — bankrupted the company and recalled ~100 million airbags worldwide.

Aerospace sets the high-water mark on quantitative reliability targets. FAR Part 25 / EASA CS-25 large-transport-category aircraft require a target of ≤ 1 catastrophic failure per 10⁹ flight hours for any system whose failure is catastrophic; that is DAL A in DO-178C / DO-254 and is derived in AC 25.1309-1B. ISO 26262 ASIL D for road-vehicle safety-of-the-intended-function sits at ≤ 10⁻⁸ /h for the residual hardware failure budget. Medical Class C software (IEC 62304) imposes equivalent rigor on software failure. These numbers are operating-point requirements that flow down through architecture, allocation, prediction, FMEA, FTA, redundancy design, and verification testing.

Outside safety-critical, reliability is a competitive moat. Caterpillar, Toyota, and Rolls-Royce build entire product strategies around demonstrated field reliability; the data is collected, analysed, and reinvested in design via FRACAS (Failure Reporting, Analysis, and Corrective Action System) per MIL-STD-2155.

3. First principles

3.1 The four basic functions

For a non-negative random failure time T, four equivalent descriptions exist; an engineer must be fluent in all four.

F(t) = P(T ≤ t)             cumulative failure distribution (CDF)
R(t) = P(T > t) = 1 − F(t)  reliability / survival function
f(t) = dF/dt                failure density (PDF)
λ(t) = f(t) / R(t)          hazard rate / instantaneous failure rate

Equivalent integral form:

R(t) = exp(−∫₀ᵗ λ(τ) dτ)

Reliability theory is the algebra of these four functions under combinatorial structures (series, parallel, k-of-n, bridge, Markov chains).

3.2 The bathtub curve

Empirical λ(t) over the life of a large fielded population shows three regimes (Klutke et al. 2003 review; originally documented in Davis 1952 telephone-relay data):

Phaseλ(t)DriversReliability remedy
Infant mortalitydecreasingManufacturing escapes, assembly defects, weak componentsBurn-in, ESS, HASS, supplier qual
Useful life~constantRandom overstress, lightning, mishandling, true randomRedundancy, derating
Wear-outincreasingFatigue, corrosion, electromigration, bearing wear, capacitor dry-outPM/CBM, end-of-life replacement, design margin

The constant-λ assumption that underpins MIL-HDBK-217-style prediction is only valid in the useful-life phase. Burn-in moves the population past infant mortality before shipping; preventive maintenance retires units before wear-out kicks in. The Weibull shape parameter β maps directly to the three regimes: β < 1 ↔ infant, β = 1 ↔ constant, β > 1 ↔ wear-out (see §4).

3.3 MTBF, MTTF, MTTR, availability

  • MTTF (Mean Time To Failure) — non-repairable. MTTF = ∫₀^∞ R(t) dt. For exponential, MTTF = 1/λ.
  • MTBF (Mean Time Between Failures) — repairable. MTBF = MTTF + MTTR; in practice the two are used interchangeably when MTTR ≪ MTTF.
  • MTTR (Mean Time To Repair) — detect + isolate + access + replace + verify. Quantified separately; usually lognormal.
  • Availability — A = MTBF / (MTBF + MTTR), steady-state. Inherent (design-only), operational (incl. logistics delay), achieved (incl. PM downtime) are three distinct numbers.

Common pitfall: MTBF is not a lifetime. A unit with MTBF = 1 000 000 h does not last 114 years; under constant λ, R(MTBF) = e⁻¹ ≈ 0.37 — only 37 % of units survive to MTBF. The number is a rate parameter, not a service life.

FIT (Failures In Time) = failures per 10⁹ component-hours. 1 FIT = 1 failure per billion hours = λ in inverse hours × 10⁹. Used universally for semiconductors. A 100-FIT op-amp at 24 × 7 × 365 = 8766 h/yr in 1000 units gives ~0.88 failures/yr — small but not zero.

4. Probability distributions

4.1 Distribution comparison

DistributionParametersβ / shapeUse caseNotes
ExponentialλUseful-life electronics, memoryless eventsSpecial case of Weibull (β = 1) and Gamma (k = 1)
Weibull (2-param)β shape, η scaleβ < 1 infant, β = 1 const, β > 1 wear-outBearings, fatigue, mechanicalMost flexible; default for field data
Weibull (3-param)β, η, γ shift+ minimum life γGuaranteed-life partsγ is the “no-failure before” floor
Lognormalμ, σ on log(t)Right-skewedMaintenance time, fatigue-crack growth, semi wear-outMultiplicative damage models
Normalμ, σSymmetricMechanical strength, dimensional, “stress-strength” interferenceTruncated at 0
Gammak shape, θ scaleSum of k exponentialsTime-to-k-th-event, spares demandk = 1 ⇒ exponential
Birnbaum–Saundersα, βFatigue physics-basedCrack growth (Miner-rule failure time)“Fatigue life distribution” — see fatigue-analysis
Mixed Weibullβ₁,η₁,p₁ + β₂,η₂,(1−p₁)MultimodalMultiple failure modes (infant + wear-out)Common in warranty data
Extreme value (Gumbel)μ, σMin-of-n weakest linkBrittle ceramics, “weakest link”

4.2 Weibull in depth

Waloddi Weibull (1951, J Appl Mech, “A statistical distribution function of wide applicability”) proposed the form:

R(t) = exp(−(t/η)^β)
F(t) = 1 − R(t)
f(t) = (β/η)·(t/η)^(β−1)·exp(−(t/η)^β)
λ(t) = (β/η)·(t/η)^(β−1)

η is the characteristic life = the t at which R = e⁻¹ ≈ 0.368 (63.2 % failed); independent of β. β is the shape parameter = the hazard-rate slope. The original paper used cotton-fibre and steel-strength data; the distribution now dominates mechanical reliability because so many real failure mechanisms (bearings per bearings L_10, fatigue per fatigue-analysis Wöhler scatter, weld-joint life) fit it well.

4.3 Parameter estimation

Three competing methods:

  1. Maximum Likelihood Estimation (MLE) — best statistical efficiency, handles censored data natively, but optimisation can be ill-conditioned for small samples. Default in Reliasoft Weibull++ and Minitab.
  2. Median Rank Regression (MRR / RRX, RRY) — fit the linearised Weibull on a probability plot using Bernard’s approximation F̂ᵢ = (i − 0.3)/(n + 0.4) for the i-th order statistic of n units. Standard for small-sample mechanical data; robust and visualisable.
  3. Probability plotting (visual) — last resort when n is tiny; eyeball fit on Weibull paper. Still useful as a sanity check that the data is in fact Weibull (look for curvature).

For censored data (units still running at end of test, removed for other reasons, or failed by a mode not under study): only MLE and adjusted-rank MRR handle it correctly. Type-I censoring = stop at time T; Type-II = stop at r-th failure; multiply censored = arbitrary withdrawals. Always declare the censoring scheme in the analysis.

5. Reliability prediction

5.1 Handbook methods — origin and status

  • MIL-HDBK-217F Notice 2 (1995) — the original. Parts-count (early design) and parts-stress (detailed design) methods for electronic components. Two reliability models per part: base λ_b modified by π factors for temperature, electrical stress, environment, quality, package. Officially cancelled in 1995 but never replaced by DoD; still cited in legacy contracts. Known to be 2-10× pessimistic for modern silicon — its base failure rates predate sub-micron CMOS — and its environment factor (π_E) for spacecraft is widely ignored.
  • Telcordia SR-332 Issue 4 (2016) — telecom-industry replacement. Allows field data feedback (Bayesian update of generic λ with site experience). Better calibrated for modern parts than 217F.
  • IEC 62380:2004 — European equivalent, similar structure.
  • RIAC HDBK-217 Plus (Quanterion 2017) — modern field-data update of 217F with failure-mechanism distinctions (operating vs non-operating, T-cycle vs dwell). Maintained.
  • PRISM (now incorporated into 217Plus) — process-grade-based extension.
  • ANSI/VITA 51.1, 51.2, 51.3 — pre-adopted corrections to 217F (parameter tuning, additional factors, modern parts).
  • FIDES Guide 2022 (French aerospace consortium) — physics-of-failure-influenced, mission-profile-driven, defect-aware.

5.2 Physics-of-failure (PoF) models

For modern microelectronics, JEDEC JEP122H (“Failure Mechanisms and Models for Semiconductor Devices”) and JESD85 (“Methods for Calculating Failure Rates in Units of FITs”) define the mechanism-by-mechanism approach:

MechanismModelStressTypical Ea
Oxide TDDBE-model, 1/E-modelE-field, T0.6–0.9 eV
ElectromigrationBlack (1969): t₅₀ = A·J⁻ⁿ·exp(Ea/kT)J, T0.7–1.1 eV, n = 1–2
Hot carrierI_sub power lawV_DS, I_sub, Tnegative Ea
NBTIPower-law in V_GS, exp in TV_GS, T0.1–0.3 eV
Stress migrationPower-law in (T_proc − T)ΔT0.5–0.9 eV
Solder thermal fatigueCoffin–Manson (1954), Norris–Landzberg (1969)ΔT, dwelln = 2–8
CorrosionPeck humidity modelRH, T0.7–0.9 eV, RH^−3
Mechanical fatigueBasquin / Coffin–MansonΔεsee fatigue-analysis

PoF supplements handbook prediction: the handbook gives “this resistor sees 10 FIT”, PoF gives “this BGA solder joint will see 1 % failure at 4500 power-cycles at ΔT = 80 °C with 30 min dwell”. PoF is essential for new technologies (sub-7 nm, wide-bandgap power, advanced packaging) where handbook data does not exist.

5.3 Software tools

  • Reliasoft Lambda Predict (Hottinger Bruel & Kjaer), Relyence Reliability Workbench, Item Toolkit ITEM-217, Isograph Hawk — all support 217F + 217Plus + Telcordia + IEC + FIDES.

6. System-level methods

6.1 Reliability Block Diagram (RBD)

Architecture-level representation: blocks = components, edges = success paths from source to sink. Math:

  • Series: R_sys = ∏ Rᵢ. All must work.
  • Parallel (active redundancy): R_sys = 1 − ∏ (1 − Rᵢ). Any one suffices.
  • k-out-of-n: binomial sum ∑ⱼ₌ₖⁿ C(n,j) Rʲ (1−R)^(n−j). Voted redundancy (e.g. triple-modular redundancy TMR, k = 2 of 3).
  • Bridge network: not series-parallel; use cut-set or tie-set algorithms.
  • Standby (cold): must include switchover reliability + standby failure rate during dormant period.

Standard: MIL-STD-756B (cancelled 1995 but still referenced); IEC 61078:2016 (“Reliability block diagrams”). Tools: Reliasoft BlockSim, Isograph Reliability Workbench, ITEM Toolkit RBD, OpenFTA, OpenRBD.

6.2 Fault Tree Analysis (FTA)

Top-down deductive analysis. Top event = system failure; expand downward through AND / OR / k-of-n gates to basic events (component failures, human errors, external events). Mathematics is Boolean algebra of failure events.

  • Minimal cut sets (MCS) — smallest combinations of basic events whose simultaneous occurrence causes the top event. Order-1 cut sets are single-point failures (red flags). Importance measures (Birnbaum, Fussell-Vesely, Risk Achievement Worth, Risk Reduction Worth) rank components.
  • Standards: IEC 61025:2006, NASA Fault Tree Handbook (NASA/SP-2002-7106), NUREG-0492 (nuclear).
  • Cross-discipline use: ARP4761A:2023 is the canonical avionics safety-assessment standard, mandates FHA → PSSA → SSA flow using FTA + FMEA + CCA (common-cause analysis).

Tools: Isograph FaultTree+, Reliasoft BlockSim FTA module, ITEM ToolKit FTA, FaultCAT (Boeing internal), SAPHIRE (INL nuclear), OpenFTA, RAM Commander.

6.3 Event Tree Analysis (ETA)

Forward inductive from an initiating event through success/failure branches of safety functions to a set of end states. Used in PRA (Probabilistic Risk Assessment) for nuclear and chemical plants per IEC 62502. Complementary to FTA: ETA structures the consequence side, FTA structures the cause side; both feed PRA.

6.4 FMEA / FMECA

Failure Mode and Effects Analysis — bottom-up tabular discipline. For each component, list every failure mode (open, short, drift, leak, jam), its local + system effects, the cause, detection method, and a risk index.

  • Classical RPN (Risk Priority Number) = Severity × Occurrence × Detection, each on 1–10 scale; RPN ∈ [1, 1000]. Used to prioritise corrective action.
  • Modern AIAG-VDA FMEA Handbook 1st ed 2019 replaced RPN with Action Priority (AP) (High / Medium / Low) to discourage RPN-threshold gaming and to keep severity-10 single-point items always at the top.
  • FMECA = FMEA + Criticality (probability + severity) per MIL-STD-1629A. Defence and aerospace flavour.
  • DFMEA (Design) vs PFMEA (Process) vs System-level FMEA. IATF 16949 requires both DFMEA and PFMEA in automotive.

Standards: IEC 60812:2018, MIL-STD-1629A (cancelled 1998 but still referenced), AIAG-VDA 2019, SAE J1739, ARP5580 (non-auto).

Tools: Reliasoft Xfmea, APIS IQ-FMEA, Plato AG SCIO, Siemens Teamcenter FMEA, Ansys medini analyze (ISO 26262 + ARP4761 + IEC 61508).

6.5 Markov, semi-Markov, Monte Carlo

For repairable systems, standby with imperfect switching, and time-dependent failure rates, simple RBD math breaks down. Markov state-transition diagrams (states = combinations of working/failed components, transitions = λ and μ rates) give analytical availability and unreliability. Semi-Markov handles non-exponential sojourn times. Monte Carlo simulation is the workhorse when analytic methods fail (mixture distributions, complex maintenance policies, opportunistic replacement). Tools: Reliasoft BlockSim (discrete-event sim), GoldSim, AnyLogic.

6.6 Reliability allocation

Top-down apportionment of a system reliability goal R_systo subsystems R_i such that ∏ R_i ≥ R_sys* (series case). Methods:

  • Equal apportionment — naïve baseline.
  • AGREE (Advisory Group on Reliability of Electronic Equipment 1957) — weighted by complexity (parts count) × importance.
  • ARINC — weighted by predicted λ from prior systems.
  • Feasibility of Objectives (FoO) — weighted by state-of-art difficulty score.
  • Karmiol / Bracha — cost-weighted optimisation.

Allocation drives reliability budget flow-down to subsystem teams and supplier requirements (e.g. “this connector must demonstrate ≤ 50 FIT”).

7. Worked examples

7.1 Example A — Weibull fit + B₁₀ life of a deep-groove ball bearing

Test data: 10 bearings (6205-2RS, see bearings) run to failure under constant radial load and lubrication; failure times in operating hours.

Sorted failure times (hr): 2340, 3100, 4280, 5600, 6900, 8100, 9400, 10800, 12600, 15800.

Apply Bernard’s approximation for median rank Fᵢ = (i − 0.3)/(n + 0.4), n = 10:

itᵢ (hr)Fᵢ
123400.0673
231000.1635
569000.4519
10158000.9327

Linearise: x = ln(t), y = ln(−ln(1 − F)). Linear regression of y on x gives slope β and intercept −β·ln(η). For this data set:

β ≈ 2.3 (wear-out, consistent with bearing fatigue spalling), η ≈ 8000 hr.

B₁₀ life (the bearing-industry standard, t at which 10 % have failed; see bearings):

B₁₀ = η · (−ln(1 − 0.10))^(1/β)
    = 8000 · (0.10536)^(1/2.3)
    = 8000 · (0.10536)^0.4348
    = 8000 · 0.3796
    ≈ 3037 hr

Compare with the rolled-up ISO 281 L₁₀ rated life (computed from C/P in the catalogue): if the catalogue rated L₁₀ for this load is ~3000 hr, the Weibull-fit B₁₀ confirms the rating. Discrepancy of > 30 % triggers either a load-spectrum error or a lubrication / contamination problem.

7.2 Example B — Series-parallel RBD with mixed redundancy

System: control rack composed of A (PSU) in series with a redundant pair (B || C, both compute boards, active redundancy) in series with D (output module).

Failure rates (FIT, useful-life zone, constant-λ exponential):

  • λ_A = 100 FIT = 100 × 10⁻⁹ /h = 1 × 10⁻⁷ /h
  • λ_B = λ_C = 500 FIT = 5 × 10⁻⁷ /h
  • λ_D = 200 FIT = 2 × 10⁻⁷ /h

Mission time t = 1000 h (continuous operation).

Component reliabilities R(t) = e⁻ᵗᵝ:

  • R_A(1000) = e^(−1e−7·1000) = e^(−1e−4) = 0.99990001
  • R_B = R_C = e^(−5e−4) = 0.99950125
  • R_D = e^(−2e−4) = 0.99980002

Parallel B || C:

R_BC = 1 − (1 − R_B)(1 − R_C)
     = 1 − (0.00049875)²
     = 1 − 2.488e−7
     = 0.99999975

Series system:

R_sys = R_A · R_BC · R_D
      = 0.99990001 · 0.99999975 · 0.99980002
      = 0.99970004

System failure probability over 1000 h: 1 − R_sys = 2.9996 × 10⁻⁴ ≈ 300 ppm. System equivalent FIT (for short missions in useful-life regime) ≈ (1 − R_sys)/t × 10⁹ ≈ 300 FIT. The redundant pair contributed essentially nothing to that total — A and D dominate. Allocation insight: spend reliability budget on A and D, not on doubling-up B/C further. This is a generic finding: redundancy without diversifying single-point-of-failure power and I/O is theatre.

7.3 Example C — Arrhenius accelerated life test design

Goal: predict 10-year field life of an integrated motor-drive at T_use = 55 °C using a 100-hour bench test.

Choose stress temperature T_test = 125 °C. Use Arrhenius (1889 / Eyring 1936 simplified):

AF = exp(Ea/k · (1/T_use − 1/T_test))

with k = 8.617 × 10⁻⁵ eV/K (Boltzmann in eV/K), Ea = 0.7 eV (typical aggregate for IC bond-wire + die-attach + electromigration; per JEP122H).

Convert temperatures: T_use = 328.15 K, T_test = 398.15 K.

1/T_use  − 1/T_test = 1/328.15 − 1/398.15
                    = 3.0474e−3 − 2.5116e−3
                    = 5.358e−4 K⁻¹
Ea / k             = 0.7 / 8.617e−5 = 8124 K
AF                 = exp(8124 · 5.358e−4) = exp(4.353) = 77.7

Equivalent field hours: 100 h × 77.7 = 7770 h ≈ 10.8 months at 55 °C.

If the design target is 10 years at 55 °C = 87,600 h, the 100-h bench test covers only ≈ 9 % of design life — not sufficient. Either extend bench test to ≥ 1130 h at 125 °C, or raise T_test to 150 °C (AF ≈ 222, 100 h ≈ 22,200 field-h ≈ 2.5 yr), or run a larger sample size to demonstrate at 60 % confidence.

Sample-size sanity check using χ² formula for zero failures, Type-II right-censoring:

t_total = χ²(2r+2, 1−CL) / (2 · λ_target)

For r = 0 failures at 60 % confidence demonstrating λ_target = 100 FIT = 10⁻⁷ /h:

χ²(2, 0.40) = 1.833
t_total = 1.833 / (2 · 10⁻⁷) = 9.17e6 device-hours

Spread over n = 50 units at 125 °C with AF = 77.7: 9.17e6 / (50 · 77.7) ≈ 2360 bench hours. That is the test that demonstrates the requirement. A 100-h test for n = 50 only demonstrates ~24× weaker reliability than the field target — useful for screening, useless as a demonstration.

8. Testing methods

8.1 Accelerated-stress test taxonomy

TestGoalStressSample sizeOutput
HALT (Highly Accelerated Life Test)Find design margins, weakest linkStep-stress T, vibration, T-cycle, V4–8Failure modes, margins (operational + destruct limits)
HASS (Highly Accelerated Stress Screen)Production screeningAbove-spec but below HALT destruct100 %Pass/fail per unit
ESS (Environmental Stress Screening)Production screening, milder than HASST-cycle + random vibration100 %Pass/fail per unit, MIL-STD-2164 / IEC 61163-1
ALT (Accelerated Life Test)Quantitative life vs stressSingle stress, multiple levels20+ per cellAcceleration factor, life distribution
Burn-inInfant-mortality removalBias at elevated T for hours-days100 % (legacy)Pass/fail; cost-benefit debated post-2000
RDT (Reliability Demonstration Test)Prove λ ≤ targetUse-conditionSized via χ² / binomialDemonstrated MTBF at confidence CL
Reliability GrowthTrack λ improvement during devUse-condition + corrective actionContinuousDuane / Crow-AMSAA growth slope α

8.2 Acceleration models

StressModelEquationOrigin
TemperatureArrheniusAF = exp(Ea/k · (1/T₁ − 1/T₂))Arrhenius 1889
ΔT cycleCoffin–MansonN_f = C · ΔT⁻ⁿ, n = 2–8Coffin 1954, Manson 1954
ΔT cycle + dwellNorris–LandzbergN_f = C · f^m · ΔT⁻ⁿ · exp(Ea/kT_max)Norris & Landzberg 1969
Voltage / E-fieldPower law / Eyringt_f = A · V⁻ⁿ · exp(Ea/kT)Eyring 1936
Current densityBlackt₅₀ = A · J⁻ⁿ · exp(Ea/kT), n = 1–2Black 1969
HumidityPeckt_f = A · RH⁻ⁿ · exp(Ea/kT)Peck 1986
VibrationBasquin power lawN_f = C · S⁻ᵇBasquin 1910

Pitfall: multi-stress models (T + V + humidity) compound errors quickly. Always validate the Ea assumption on at least two stress points before extrapolating.

8.3 Reliability growth — Duane / Crow-AMSAA

Duane (1964) observed empirically that during a development program, cumulative MTBF tracked as ln(MTBF_cum) = ln(MTBF_initial) + α · ln(T) with growth slope α ∈ [0.3, 0.6] for well-managed programs. Crow (1974, AMSAA — Army Material Systems Analysis Activity) made it a formal NHPP (non-homogeneous Poisson process) λ(t) = λ_0 · β · t^(β−1), enabling MLE on the growth parameters. Standard: IEC 61164:2004.

9. Maintenance + RAMS

9.1 Maintenance strategies

StrategyWhenCost driverTool
Corrective (CM)After failureDowntime + collateral damageReactive
Preventive (PM, scheduled)Time / cycle intervalOver-maintenance, waste of remaining lifeRCM analysis
Condition-Based (CBM)Threshold on measured indicatorSensor + analytics infrastructureVibration analysis, oil debris, thermal
Predictive (PdM)ML-predicted remaining-useful-life (RUL)Model risk, training dataLSTM / survival models

9.2 RCM — Reliability-Centred Maintenance

Origin: United Airlines 1968 (Nowlan & Heap report to FAA), formalised as MSG-2 / MSG-3 for commercial aviation; civilianised by Moubray (“RCM II”, 1997). Seven structured questions answered for each significant function:

  1. What are the functions and performance standards in its current context?
  2. In what ways does it fail?
  3. What causes each failure?
  4. What happens when each failure occurs?
  5. In what way does each failure matter?
  6. What can be done to predict or prevent each failure?
  7. What if nothing can be done?

Standards: SAE JA1011:2009 (RCM evaluation criteria), SAE JA1012:2002 (RCM guide), IEC 60300-3-11:2009.

9.3 Availability + spares

Steady-state availability A = MTBF/(MTBF + MTTR). To achieve A = 0.999 (8.76 h downtime/yr) with MTTR = 4 h requires MTBF ≥ 3996 h ≈ 5.5 months. To achieve A = 0.99999 (“five nines”, 5.3 min/yr) requires MTBF ≥ 400 000 h or MTTR < 0.04 h or redundancy.

Spares optimisation: Poisson demand model with rate λ_demand = N_installed × λ_part; service level S(k) = P(demand ≤ k spares) drives stockholding cost vs stockout cost. Tools: SAP MRP, Oracle EAM, Servigistics.

9.4 CMMS + EAM software

  • CMMS (work-order, PM-schedule, spares): IBM Maximo, SAP PM (now S/4HANA Asset Management), Fiix (Rockwell), Maintainx, UpKeep, Limble, eMaint, Hippo CMMS.
  • EAM (enterprise + financial integration): IBM Maximo Application Suite, SAP EAM, Infor EAM, Oracle eAM, Ivara EXP.
  • APM (asset performance management, CBM + PdM): GE APM (Meridium), AVEVA APM, Bentley AssetWise, AspenTech Mtell.

10. Functional safety + safety-critical

10.1 Standards landscape

StandardDomainSafety integrity levelYear
IEC 61508Generic E/E/PE safety-related systemsSIL 1–42010 (Ed 2)
IEC 61511Process industry safety-instrumented systemsSIL 1–42016
ISO 26262Automotive E/E systemsASIL A–D2018 (Ed 2)
IEC 62304Medical-device softwareClass A / B / C2006 + Amd 1 2015
ISO 14971Medical-device risk management2019
DO-178C / DO-254Avionics SW / HWDAL A–E2011 / 2000
ARP4754A / ARP4761ACivil aircraft system development / safetyDAL A–E2010 / 2023
EN 50128 / 50129 / 50657Rail SW / signalling / on-boardSIL 0–42011–2017
API RP 581Risk-Based Inspection (oil & gas)2016
IEC 60601-1Medical electrical equipment basic safety2020

10.2 Integrity-level equivalence (approximate, demand mode)

IEC 61508 SILISO 26262 ASILDO-178C DALIEC 62304 ClassPFD (low demand)PFH (high demand, /h)
4AC10⁻⁵ – 10⁻⁴10⁻⁹ – 10⁻⁸
3DBC10⁻⁴ – 10⁻³10⁻⁸ – 10⁻⁷
2CCB10⁻³ – 10⁻²10⁻⁷ – 10⁻⁶
1A / BDA10⁻² – 10⁻¹10⁻⁶ – 10⁻⁵

The equivalence table is approximate — ASIL D is calibrated to road vehicles’ exposure-controllability rather than IEC 61508’s pure target probability, and a strict mapping is debated; see ISO 26262-10 Annex.

10.3 Failure categories

Modern functional-safety standards split failures into:

  • Hardware random failure — Weibull / Exponential life of a component. Addressed by FIT + redundancy + diagnostics.
  • Systematic failure — design + spec error, software bug, common-cause. Addressed by process rigour (V-model, formal methods, MISRA, code review, test coverage).
  • Common-cause failure (CCF) — single root affects redundant channels. Addressed by diversity (heterogeneous redundancy), separation, β-factor model (β ≈ 0.01–0.10 typical).

10.4 HAZOP + LOPA

  • HAZOP (Hazard and Operability Study, IEC 61882:2016) — guideword-driven team review (NO, MORE, LESS, REVERSE, AS WELL AS, PART OF, OTHER THAN) of process node-by-node deviations. Origin: ICI 1963, formalised by Kletz 1974. See chemical-process-fundamentals.
  • LOPA (Layer of Protection Analysis, IEC 61511 + CCPS 2001) — semi-quantitative SIL assignment via independent protection-layer accounting.

11. Field reliability + warranty analysis

11.1 Warranty data — what’s hard

Warranty data is interval-censored, multiply-right-censored (units still operating), left-truncated (units that failed before the warranty registered), and contaminated by non-failures (no-trouble-found, NTF rates of 20–40 % are typical in consumer electronics). Naive MTBF = total operating hours / observed failures is wrong by a factor of 2–10× in most cases.

Correct treatment: fit a Weibull or mixture-Weibull to age-at-failure data, using MLE with the censoring structure. Mixture Weibull is essential when both infant-mortality and wear-out failure modes coexist (very common — the bathtub curve is in the data, not just on the slide).

11.2 Cohort + Pareto analysis

  • Cohort analysis — group units by production date (e.g. month built) and plot failure rate vs age. Reveals manufacturing-process shifts.
  • Pareto chart — 80/20 of failure modes; combined with Ishikawa fishbone for cause categorisation (Man, Machine, Method, Material, Measurement, Environment).
  • Kaplan–Meier estimator — non-parametric survival curve from censored data. Visualises whether a Weibull is a defensible fit.

11.3 FRACAS

Failure Reporting, Analysis, and Corrective Action System — MIL-STD-2155 (cancelled but referenced); GEIA-STD-0009. The closed-loop process by which field failures generate root-cause findings that update design + manufacturing controls. Implementation tools: Reliasoft XFRACAS, Relyence FRACAS, IBM Engineering Workflow Management.

11.4 Industry-specific failure databases

  • NPRD-2023 (Quanterion Non-electronic Parts Reliability Data) — mechanical and electromechanical part failure data.
  • EPRD-2014 (Electronic Parts Reliability Data).
  • OREDA (Offshore and Onshore REliability DAta, SINTEF) — oil & gas.
  • PERD (CCPS Process Equipment Reliability Database).
  • CNF / NEMS-FRACAS (NASA + commercial spaceflight).
  • NHTSA (auto), CPSC (consumer), FDA MAUDE (medical devices) — public recall + adverse event.

11.5 Customer-perceived reliability

J.D. Power IQS (Initial Quality Study, first 90 days) and VDS (Vehicle Dependability Study, 3 years) drive consumer auto-brand reliability perception independent of engineering MTBF data. Consumer Reports annual reliability survey is the consumer-electronics equivalent. These data sets are noisy but lag-leading indicators of warranty cost two-to-five years out.

12. Tools / software

DomainTools
PredictionReliasoft Lambda Predict, Relyence Reliability Workbench, Isograph Hawk, ITEM ToolKit ITEM-217, FIDES TestBench
Weibull / life dataReliasoft Weibull++, Minitab Reliability + Survival, JMP Life Distribution, R packages survival, fitdistrplus, WeibullR, Python lifelines, reliability
FTA / RBDIsograph FaultTree+ / Reliability Workbench, Reliasoft BlockSim, ITEM ToolKit RBD/FTA, FaultCAT (Boeing), SAPHIRE (INL), OpenFTA, RAM Commander, RiskSpectrum (Lloyd’s)
FMEA / FMECAReliasoft Xfmea / RCM++, APIS IQ-FMEA, Plato AG SCIO, Siemens Teamcenter Quality, Ansys medini analyze, IQ-RM
Monte Carlo / simReliasoft BlockSim, GoldSim, AnyLogic, Reliasoft RGA (growth), @RISK
HALT/HASS chambersQualmark Typhoon series, Espec ARS series, Thermotron AST series, Weiss Technik
CMMS / EAMIBM Maximo, SAP PM / S/4HANA Asset Mgmt, Fiix, Maintainx, Infor EAM, Oracle eAM
APM (PdM)GE APM, AVEVA APM, Aspen Mtell, Bentley AssetWise, Augury, Uptake
Risk / RBILloyd’s RBI Toolkit (per API 581), DNV-Synergi Plant, Bentley AssetWise APM
Free / OSSOpenFTA, OpenRBD, R reliability, Python reliability, scilab reliability toolbox

13. Cross-references

14. Citations

Textbooks (canonical)

  • O’Connor, P. & Kleyner, A. Practical Reliability Engineering, 5th ed. Wiley, 2012. ISBN 978-0-470-97981-5. Industry standard.
  • Ebeling, C. An Introduction to Reliability and Maintainability Engineering, 3rd ed. Waveland, 2019. ISBN 978-1-4786-3933-6.
  • Birolini, A. Reliability Engineering: Theory and Practice, 8th ed. Springer, 2017. ISBN 978-3-662-54208-8.
  • Modarres, M.; Kaminskiy, M.; Krivtsov, V. Reliability Engineering and Risk Analysis: A Practical Guide, 3rd ed. CRC, 2017.
  • Meeker, W. Q. & Escobar, L. A. Statistical Methods for Reliability Data, 2nd ed. Wiley, 2022. ISBN 978-1-118-11545-9.
  • Nelson, W. B. Accelerated Testing: Statistical Models, Test Plans, and Data Analyses. Wiley, 1990. Canonical ALT reference.
  • Moubray, J. Reliability-Centred Maintenance II. Industrial Press, 1997.
  • Smith, D. J. Reliability, Maintainability and Risk, 9th ed. Butterworth-Heinemann, 2017.

Foundational papers

  • Weibull, W. (1951). “A statistical distribution function of wide applicability.” J Appl Mech 18, 293–297.
  • Arrhenius, S. (1889). “Über die Reaktionsgeschwindigkeit bei der Inversion von Rohrzucker durch Säuren.” Z Phys Chem 4, 226–248.
  • Coffin, L. F. (1954). “A study of the effects of cyclic thermal stresses on a ductile metal.” Trans ASME 76, 931–950.
  • Manson, S. S. (1954). “Behavior of materials under conditions of thermal stress.” NACA TN 2933.
  • Norris, K. C. & Landzberg, A. H. (1969). “Reliability of controlled collapse interconnections.” IBM J Res Dev 13, 266–271.
  • Black, J. R. (1969). “Electromigration — a brief survey and some recent results.” IEEE Trans Electron Devices 16, 338–347.
  • Eyring, H. (1936). “Viscosity, plasticity, and diffusion as examples of absolute reaction rates.” J Chem Phys 4, 283.
  • Crow, L. H. (1974). “Reliability analysis for complex repairable systems.” US Army AMSAA Tech Report 138.
  • Duane, J. T. (1964). “Learning curve approach to reliability monitoring.” IEEE Trans Aerospace 2, 563–566.
  • Peck, D. S. (1986). “Comprehensive model for humidity testing correlation.” IRPS Proc., 44–50.

Standards

  • IEC 61508:2010 — Functional safety of E/E/PE safety-related systems (Parts 1-7).
  • IEC 60812:2018 — FMEA / FMECA procedure.
  • IEC 61025:2006 — Fault tree analysis.
  • IEC 61078:2016 — Reliability block diagrams.
  • IEC 61164:2004 — Reliability growth.
  • IEC 62308:2006 — Reliability assessment methods.
  • IEC 62380:2004 — Reliability prediction (telecom).
  • ISO 26262:2018 — Road-vehicle functional safety.
  • ARP4761A:2023 — Civil aircraft safety assessment.
  • ARP4754A:2010 — Civil aircraft system development.
  • DO-178C:2011 / DO-254:2000 — Avionics SW/HW.
  • AIAG-VDA FMEA Handbook 1st ed, 2019 — Automotive FMEA + AP.
  • MIL-HDBK-217F Notice 2 (1995) — Reliability prediction (legacy).
  • MIL-STD-1629A (1980; cancelled 1998) — FMECA.
  • Telcordia SR-332 Issue 4 (2016) — Reliability prediction (telecom).
  • JEDEC JEP122H (2016) — Failure mechanisms + models for semiconductors.
  • JEDEC JESD85 (2001, R2008) — FIT-rate calculation.
  • ANSI/VITA 51.1/51.2/51.3 (2008–2010) — 217F corrections.
  • RIAC HDBK-217 Plus (2017, Quanterion) — 217F field-data update.
  • SAE JA1011:2009 + JA1012:2002 — RCM.

Online resources

  • Reliasoft online textbook + ReliaWiki: https://www.reliawiki.com
  • Quanterion Solutions databases (NPRD, EPRD, 217Plus): https://www.quanterion.com
  • Weibull.com (Reliasoft) — practitioner articles.
  • NASA Fault Tree Handbook NASA/SP-2002-7106.
  • IAEA / NUREG nuclear PRA reports (publicly available).

node ~/.claude/bin/obsidian-research.mjs log "Built Engineering/reliability-engineering.md Tier 2 deep note"