Time Series & Hidden Markov Models

Time-series analysis treats data indexed by time, where temporal ordering matters and observations are typically correlated. Hidden Markov Models (HMM) extend the framework to latent discrete states, while modern foundation models (2024-26) bring transformer-scale pretraining to forecasting. This reference covers classical Box-Jenkins, state-space + Kalman, volatility (GARCH), HMM + CRF, and the current deep + foundation-model landscape.

1. Foundations: Stationarity and Ergodicity

A stochastic process ${X_{t}}$ is strictly stationary if the joint distribution of $(X_{t_{1}}, \dots, X_{t_{k}})$ equals that of $(X_{t_{1} + h}, \dots, X_{t_{k} + h})$ for all $h$ . Weak (covariance) stationarity requires only constant mean, constant variance, and autocovariance $γ (h) = Cov (X_{t}, X_{t + h})$ depending only on lag $h$ .

Ergodicity allows time averages to substitute for ensemble averages: $\frac{1}{T} \sum_{t = 1}^{T} X_{t} \to E [X_{t}]$ as $T \to \infty$ . Without ergodicity, a single realization is uninformative about the population.

Most real series are nonstationary (trend, seasonality, structural breaks). Box-Jenkins addresses this by differencing; modern methods use state-space decomposition or learn nonstationarity end-to-end.

2. ACF, PACF, and Wold Decomposition

The autocorrelation function is $ρ (h) = γ (h) / γ (0)$ . The partial autocorrelation $ϕ_{hh}$ is the correlation of $X_{t}$ and $X_{t + h}$ after removing the linear effect of intermediate lags — equivalently, the last coefficient in an AR( $h$ ) regression.

Wold’s decomposition theorem (1938): every weakly stationary process admits a unique decomposition into a deterministic component and an MA( $\infty$ ) innovation component:

$X_{t} = μ_{t} + \sum_{j = 0}^{\infty} ψ_{j} ε_{t - j}, \sum ψ_{j}^{2} < \infty$

This justifies ARMA modeling as a parsimonious approximation to the MA( $\infty$ ) representation.

Yule-Walker equations relate AR coefficients to autocovariances: for AR( $p$ ), $γ (h) = ϕ_{1} γ (h - 1) + \dots + ϕ_{p} γ (h - p)$ for $h > 0$ . Solved via Levinson-Durbin recursion in $O (p^{2})$ .

3. AR, MA, and ARMA

AR( $p$ ): $X_{t} = c + ϕ_{1} X_{t - 1} + \dots + ϕ_{p} X_{t - p} + ε_{t}$ , with $ε_{t} \sim WN (0, σ^{2})$ . Stationary iff all roots of $1 - ϕ_{1} z - \dots - ϕ_{p} z^{p} = 0$ lie outside the unit circle.

MA( $q$ ): $X_{t} = μ + ε_{t} + θ_{1} ε_{t - 1} + \dots + θ_{q} ε_{t - q}$ . Always stationary; invertible iff MA polynomial roots lie outside unit circle.

ARMA( $p, q$ ) combines both. Identification heuristic: AR cuts off in PACF at lag $p$ ; MA cuts off in ACF at lag $q$ ; ARMA tails off in both.

4. ARIMA — Box-Jenkins Methodology

Box and Jenkins (1970, “Time Series Analysis: Forecasting and Control”) formalized a three-stage workflow:

Identification: difference the series $d$ times until stationary; examine ACF/PACF to choose $(p, q)$ . ARIMA( $p, d, q$ ) applies ARMA to the $d$ -th difference $\nabla^{d} X_{t} = (1 - B)^{d} X_{t}$ , where $B$ is the backshift operator.

Estimation: maximum likelihood (Gaussian) or conditional sum of squares. Modern implementations use Kalman-filter likelihood (Harvey).

Diagnostics: Ljung-Box test on residuals (no remaining autocorrelation); QQ plots; AIC/BIC for model selection. Hyndman + Khandakar (2008) auto.arima automates the search.

5. SARIMA and ARIMAX

SARIMA( $p, d, q$ ) $\times$ ( $P, D, Q$ ) $_{s}$ adds a seasonal component with period $s$ :

$Φ_{P} (B^{s}) ϕ_{p} (B) (1 - B^{s})^{D} (1 - B)^{d} X_{t} = Θ_{Q} (B^{s}) θ_{q} (B) ε_{t}$

Standard for monthly/quarterly data (e.g., airline passenger series of Box-Jenkins).

ARIMAX adds exogenous regressors: $X_{t} = β^{'} Z_{t} + η_{t}$ , where $η_{t}$ follows ARIMA. Useful for demand forecasting with price + promotion covariates, or load forecasting with weather. R forecast::Arima and Python statsmodels.tsa.statespace.SARIMAX implement both.

6. Unit Root and Stationarity Tests

Augmented Dickey-Fuller (Dickey + Fuller 1979; Said + Dickey 1984): null hypothesis is unit root (nonstationary). Test regression $\nabla X_{t} = α + βt + γ X_{t - 1} + \sum δ_{i} \nabla X_{t - i} + ε_{t}$ ; reject $H_{0}$ if $γ$ significantly negative.

KPSS (Kwiatkowski + Phillips + Schmidt + Shin 1992): reversed null — stationarity. Use ADF + KPSS in tandem; agreement on stationarity is more convincing than either alone.

Phillips-Perron (1988): nonparametric correction to ADF for serial correlation.

7. Cointegration and VECM

Engle + Granger (1987, Nobel 2003): two nonstationary I(1) series $X_{t}, Y_{t}$ are cointegrated if a linear combination $Y_{t} - β X_{t}$ is stationary. Cointegration implies a long-run equilibrium and motivates the Vector Error Correction Model (VECM):

$Δ Y_{t} = α (Y_{t - 1} - β X_{t - 1}) + \sum Γ_{i} Δ Y_{t - i} + ε_{t}$

The error-correction term pulls the system back toward equilibrium. Johansen (1991) test estimates the cointegration rank via eigenvalues of a reduced-rank regression.

Applications: pairs trading (statistical arbitrage), term-structure modeling, PPP exchange-rate analysis.

8. VAR and Granger Causality

Vector autoregression (Sims 1980, Nobel 2011): $X_{t} = c + A_{1} X_{t - 1} + \dots + A_{p} X_{t - p} + ε_{t}$ . Each variable depends on lags of itself and other variables.

Granger causality (Granger 1969, Nobel 2003): $X$ Granger-causes $Y$ if past values of $X$ improve prediction of $Y$ beyond past values of $Y$ alone. Tested via F-test on lagged $X$ coefficients in the $Y$ equation. This is a statement about predictive content, not true causation.

Impulse response functions (IRF) trace how a shock in one variable propagates through the system; standard tool in macroeconometrics (e.g., monetary policy analysis).

9. State-Space Form and Kalman Filter

Any linear Gaussian time-series model can be written in state-space form:

$x_{t} = F_{t} x_{t - 1} + w_{t}, w_{t} \sim N (0, Q_{t})$ $y_{t} = H_{t} x_{t} + v_{t}, v_{t} \sim N (0, R_{t})$

The Kalman filter (Kalman 1960) recursively computes the posterior $p (x_{t} ∣ y_{1 : t})$ via predict-update:

Predict: $\hat{x}_{t ∣ t - 1} = F_{t} \hat{x}_{t - 1∣ t - 1}$ , $P_{t ∣ t - 1} = F_{t} P_{t - 1∣ t - 1} F_{t}^{'} + Q_{t}$
Update: $K_{t} = P_{t ∣ t - 1} H_{t}^{'} (H_{t} P_{t ∣ t - 1} H_{t}^{'} + R_{t})^{- 1}$ (Kalman gain), $\hat{x}_{t ∣ t} = \hat{x}_{t ∣ t - 1} + K_{t} (y_{t} - H_{t} \hat{x}_{t ∣ t - 1})$

The Kalman smoother (Rauch-Tung-Striebel 1965) computes $p (x_{t} ∣ y_{1 : T})$ by backward recursion. The filter also yields the innovation form likelihood, used for ML estimation of structural parameters.

See bayesian-estimation for the robotics view (sensor fusion, SLAM via EKF/UKF/particle filter). Extensions: Extended KF linearizes nonlinear $f, h$ ; Unscented KF (Julier + Uhlmann 1997) propagates sigma points; Particle filter for arbitrary non-Gaussian models.

Python pykalman, statsmodels.tsa.statespace, filterpy; R KFAS, dlm; MATLAB econometrics toolbox.

10. ETS — Exponential Smoothing

Simple exponential smoothing (Brown 1956; Holt 1957): $\hat{X}_{t + 1} = α X_{t} + (1 - α) \hat{X}_{t}$ , level-only.

Holt’s linear trend (1957): adds slope. Holt-Winters (1960): adds seasonality (additive or multiplicative).

ETS taxonomy (Hyndman + Koehler + Snyder + Grose 2002) classifies models by Error/Trend/Seasonal components: ETS(A,N,N) = SES, ETS(A,A,A) = Holt-Winters additive, ETS(M,Ad,M) = multiplicative damped, etc. Total of 30 models. Each has a state-space form with closed-form likelihood.

Hyndman + Athanasopoulos, “Forecasting: Principles and Practice”, 3rd ed (2021) is the standard reference; R forecast::ets, fable::ETS; Python statsmodels.tsa.holtwinters + sktime + statsforecast (Nixtla’s automated version).

ETS often beats sophisticated ML on the M-competitions (Makridakis); the M4 competition (2018) showed hybrid ETS+RNN (Smyl) winning, with pure-ML methods often underperforming statistical baselines.

11. Structural Time Series

Harvey (1989, “Forecasting, Structural Time Series Models, and the Kalman Filter”) decomposes a series into interpretable unobserved components:

$X_{t} = μ_{t} + τ_{t} + γ_{t} + ε_{t}$

(level, trend, seasonal, irregular), each with its own stochastic equation. Estimated by Kalman filter ML. Variants: local level, local linear trend, basic structural model (BSM), unobserved-components model (UCM).

Bayesian structural time series (BSTS, Scott + Varian 2014) adds spike-and-slab regression for variable selection over many candidate predictors; used at Google for causal-impact analysis (CausalImpact R package).

12. GARCH and Volatility

ARCH (Engle 1982, Nobel 2003) models conditional heteroskedasticity: $σ_{t}^{2} = ω + \sum α_{i} ε_{t - i}^{2}$ .

GARCH( $p, q$ ) (Bollerslev 1986) adds lagged variance: $σ_{t}^{2} = ω + \sum_{i = 1}^{q} α_{i} ε_{t - i}^{2} + \sum_{j = 1}^{p} β_{j} σ_{t - j}^{2}$ . GARCH(1,1) suffices for most financial returns. Stationarity requires $\sum α_{i} + \sum β_{j} < 1$ .

Asymmetric variants:

EGARCH (Nelson 1991): $lo g σ_{t}^{2}$ — no positivity constraint; captures leverage effect (negative shocks raise volatility more than positive shocks of equal magnitude).
GJR-GARCH (Glosten + Jagannathan + Runkle 1993): adds $γ I_{ε < 0} ε^{2}$ term.
TGARCH (Zakoian 1994): threshold version using $∣ ε ∣$ .

Multivariate: DCC-GARCH (Engle 2002), BEKK (Engle + Kroner 1995). Used for portfolio VaR and correlation forecasting.

Software: R rugarch + rmgarch; Python arch (Sheppard); Eviews; MATLAB econometrics. See derivatives-and-quant-finance for risk-management context.

13. Hidden Markov Models

A discrete-state HMM has unobserved states $S_{t} \in {1, \dots, K}$ following a Markov chain with transition matrix $A_{ij} = P (S_{t} = j ∣ S_{t - 1} = i)$ , and observations $O_{t}$ generated from $S_{t}$ via emission distribution $B_{j} (o) = P (O_{t} = o ∣ S_{t} = j)$ . Parameters: $λ = (A, B, π)$ where $π$ is the initial-state distribution.

Rabiner (1989, “A Tutorial on HMMs and Selected Applications in Speech Recognition”) is the canonical reference, framing three problems:

Evaluation: compute $P (O ∣ λ)$ . Solved by the forward algorithm: $α_{t} (j) = \sum_{i} α_{t - 1} (i) A_{ij} B_{j} (O_{t})$ , in $O (K^{2} T)$ .
Decoding: find most likely state sequence. Solved by Viterbi (1967): $δ_{t} (j) = max_{i} δ_{t - 1} (i) A_{ij} \cdot B_{j} (O_{t})$ , backtrack through argmax. $O (K^{2} T)$ .
Learning: estimate $λ$ from observations. Solved by Baum-Welch (Baum + Petrie + Soules + Weiss 1970), an EM algorithm using forward-backward probabilities $α_{t}, β_{t}$ to compute expected sufficient statistics.

Applications: speech recognition (pre-deep-learning era — HTK toolkit, Kaldi GMM-HMM), POS tagging, gene finding (HMMER + GeneMark + Augustus), regime-switching in finance, activity recognition from wearables, fault detection in industrial systems.

Semi-Markov HMM (HSMM): explicit duration distribution per state, relaxing the geometric-duration implication of standard HMM.

Input-output HMM, factorial HMM, hierarchical HMM generalize for structured states.

Software: Python hmmlearn, pomegranate; R depmixS4, HMM; classic C: HTK, GHMM.

14. Conditional Random Fields

Lafferty + McCallum + Pereira (2001) introduced linear-chain CRFs as the discriminative cousin of HMMs. Where HMM models joint $P (O, S)$ , CRF models conditional $P (S ∣ O)$ directly via:

$P (S ∣ O) = \frac{1}{Z ( O )} exp (\sum_{t} \sum_{k} λ_{k} f_{k} (S_{t - 1}, S_{t}, O, t))$

Advantages: rich, overlapping features on $O$ without the independence assumptions HMM requires; better accuracy on sequence labeling (NER, POS, chunking) in the pre-deep era. Trained by gradient methods on conditional log-likelihood (forward-backward over feature expectations).

Modern NLP largely supplanted CRF with BiLSTM-CRF (Lample 2016) and then transformers (transformer-architecture), but CRF still appears as the top decoding layer in some structured-prediction tasks.

15. RNNs and LSTMs for Sequences

Recurrent neural networks maintain a hidden state $h_{t} = f (W h_{t - 1} + U x_{t} + b)$ . Vanilla RNNs suffer from vanishing/exploding gradients over long sequences.

LSTM (Hochreiter + Schmidhuber 1997) introduces input/output/forget gates and a cell-state highway, enabling learning of long-range dependencies. GRU (Cho 2014) is a simplified variant with reset/update gates.

For time-series forecasting: DeepAR (Salinas + Flunkert + Gasthaus 2017, Amazon) uses LSTM with probabilistic outputs (Gaussian/negative-binomial) and Monte Carlo sampling for prediction intervals. Sold as a managed service in AWS SageMaker. Strong on retail demand at scale.

Seq2seq with attention (Bahdanau 2015) and N-BEATS (Oreshkin 2019) were standard before transformer-based approaches dominated.

16. Transformers for Time Series

The transformer’s attention mechanism handles long-range dependencies in parallel, but vanilla transformers scale $O (L^{2})$ in sequence length and struggle with the inductive biases of TS (trend, seasonality, scale).

Key architectures (2020-23):

Informer (Zhou 2021): ProbSparse attention for $O (L lo g L)$ .
Autoformer (Wu 2021): series decomposition + auto-correlation block.
FEDformer (Zhou 2022): frequency-domain attention.
PatchTST (Nie 2023): patches the series like ViT does images; subseries-level tokens.
iTransformer (Liu 2024): inverted attention — each variate is a token, attention across variates rather than across time. Strong on multivariate benchmarks.
TimeMixer (Wang 2024): pure MLP-based mixer with decomposable multiscale paths; competitive with transformers at lower cost.

See transformer-architecture for architectural fundamentals.

17. Foundation Models for Time Series (2024-26)

Following the LLM revolution, the field has produced pretrained zero-shot forecasters trained on massive heterogeneous TS corpora:

Lag-Llama (Rasul + Ashok 2024, ServiceNow + Mila): decoder-only transformer with lag features as covariates; first openly released TS foundation model. Probabilistic outputs via Student-t head.
Chronos (Ansari 2024, Amazon): tokenizes scaled + quantized values, then trains T5-style encoder-decoder on synthetic + real TS. Strong zero-shot. Released as chronos-t5-{tiny,small,base,large} on HuggingFace. Chronos-Bolt (2024) is a faster encoder-only variant.
TimesFM (Das + Kumar + Sen + Zhou 2024, Google Research): decoder-only patched-input model trained on Google Trends + Wiki Pageviews + synthetic data. 200M params. v2.0 (late 2024) improved long-horizon performance; available on HuggingFace.
Moirai (Woo + Liu 2024, Salesforce AI Research): masked encoder transformer with multi-patch projections; moirai-1.0-R-{small,base,large}. Moirai-MoE (2024) uses mixture-of-experts for scale efficiency.
TimeGPT (Garza + Mergenthaler-Canseco 2023-25, Nixtla): commercial API; first widely available TS foundation model. Recent TimeGPT-1 long-horizon variants integrated into nixtla Python client.
Toto (DataDog 2024) and TabPFN-TS (Hoo 2025) extend the foundation-model paradigm.

These models offer zero/few-shot forecasting without retraining per series. M6/M5-style benchmarks (e.g., GIFT-Eval, BasicTS) show foundation models competitive with task-specific deep models but rarely beating well-tuned classical methods on small, regular series — consistent with the M-competition pattern.

18. Change-Point and Anomaly Detection

CUSUM (Page 1954) detects shifts in mean: $S_{t} = max (0, S_{t - 1} + X_{t} - μ_{0} - k)$ ; alarm when $S_{t} > h$ .

Bayesian online change-point detection (Adams + MacKay 2007): maintains posterior over run length $r_{t}$ via recursive updates; flexible to nonstationary regimes.

PELT (Killick 2012): pruned exact linear-time multiple change-point detection.

Modern: ruptures (Python), changepoint (R), bocpd implementations.

Anomaly detection:

Statistical: STL decomposition + residual thresholding (Cleveland 1990); Twitter’s AnomalyDetection package.
Isolation forest (Liu + Ting + Zhou 2008): random tree partitioning; anomalies isolated in fewer splits.
Autoencoder-based: reconstruction error as anomaly score (e.g., LSTM autoencoder for multivariate; VAE for probabilistic anomaly).
Matrix profile (Yeh + Keogh 2016, stumpy): nearest-neighbor distance over sliding windows.

Used in IT operations (Datadog Watchdog, Dynatrace), fraud detection, manufacturing QC.

19. Spectral Methods

Periodogram estimates spectral density via squared FFT magnitudes. Welch’s method (1967) averages periodograms over overlapping segments; multitaper (Thomson 1982) uses orthogonal tapers for variance reduction.

Lomb-Scargle (Lomb 1976; Scargle 1982) generalizes the periodogram to irregularly-sampled series — essential in astronomy (variable-star light curves), environmental monitoring with missing data.

Wavelet transforms (Daubechies 1988; Mallat 1989) provide time-frequency localization: short scales for high frequencies, long scales for low. Discrete wavelet transform (DWT) for compression/denoising; continuous (CWT) for analysis. Used in finance for multiresolution volatility, biomedical signals (EEG/ECG), and climate.

20. Forecasting Metrics

Point-forecast accuracy:

MAE $= \frac{1}{n} \sum ∣ y_{t} - \overset{y}{^}_{t} ∣$ — robust, in original units.
RMSE $= \frac{1}{n} \sum (y_{t} - \overset{y}{^}_{t})^{2}$ — penalizes large errors more.
MAPE $= \frac{1}{n} \sum ∣ (y_{t} - \overset{y}{^}_{t}) / y_{t} ∣$ — scale-free but undefined at zero, asymmetric.
sMAPE $= \frac{1}{n} \sum 2∣ y_{t} - \overset{y}{^}_{t} ∣/ (∣ y_{t} ∣ + ∣ \overset{y}{^}_{t} ∣)$ — symmetric variant.
MASE (Hyndman + Koehler 2006) $= MAE / MAE_{naive}$ — scale-free, well-behaved at zero, default in M-competitions.

Probabilistic-forecast accuracy:

CRPS (continuous ranked probability score): $\int (F (z) - 1_{z \geq y})^{2} d z$ — proper scoring rule, generalizes MAE to distributions.
Pinball loss at quantile $τ$ : $ρ_{τ} (y - \overset{q}{^}_{τ}) = (y - \overset{q}{^}_{τ}) (τ - 1_{y < \overset{q}{^}_{τ}})$ — basis for quantile regression and quantile-loss DL training.
Log-score (negative log predictive density), interval coverage at level $α$ , Winkler score for interval sharpness + coverage.

21. Cross-Validation for Time Series

Standard K-fold violates temporal order. Use:

Rolling-origin (walk-forward) CV: train on $[1, t]$ , forecast $[t + 1, t + h]$ , advance $t$ . Most realistic.
Sliding window: fixed-size training window slides through time.
Blocked / purged K-fold (López de Prado 2018): gap between train and test to prevent leakage; standard in quant finance to avoid label-overlap leakage.
Time-series split (sklearn.model_selection.TimeSeriesSplit).

Diebold-Mariano test (1995) compares forecast accuracy of two methods on a common test set under a chosen loss function.

22. Hierarchical Forecasting and Reconciliation

Forecasts at multiple levels (product → category → total; store → region → country) must be coherent (children sum to parent).

MinT (Wickramasuriya + Athanasopoulos + Hyndman 2019): minimum-trace optimal reconciliation. Project base forecasts onto the coherent subspace using a covariance-weighted projection $G = (S^{'} W^{- 1} S)^{- 1} S^{'} W^{- 1}$ , where $S$ is the summing matrix. Improves accuracy at all levels relative to bottom-up or top-down.

Software: R fable::reconcile, hts (deprecated), fabletools; Python hierarchicalforecast (Nixtla). Used heavily in retail demand and energy load forecasting.

23. Software Landscape

R ecosystem:

forecast (Hyndman, classic): ARIMA, ETS, auto.arima.
fable + fabletools + tsibble + feasts (tidyverts, Hyndman): tidy/tibble-based replacement.
prophet (Facebook 2017, Taylor + Letham): generalized additive model with piecewise trend, Fourier seasonality, holidays; widely used for business forecasting despite criticism of underperformance vs. ETS/ARIMA.
rugarch, rmgarch: GARCH family.
KFAS, dlm, bsts: state-space and structural.
vars, urca: VAR + unit-root.
changepoint, bcp: change-point detection.

Python ecosystem:

statsmodels.tsa: ARIMA, SARIMAX, state-space, VAR, exponential smoothing.
sktime: scikit-learn-compatible TS framework with classification, forecasting, regression.
Nixtla suite: statsforecast (fast classical), neuralforecast (PyTorch DL), mlforecast (gradient boosting), hierarchicalforecast, nixtla (TimeGPT client).
arch (Sheppard): GARCH family.
pykalman, filterpy: Kalman.
hmmlearn, pomegranate: HMM.
prophet: Facebook’s GAM-based forecaster.
darts (Unit8): unified DL + classical interface.
gluonts (Amazon): probabilistic DL forecasting; DeepAR + Chronos + transformer models.
tsfresh: automated feature extraction.
stumpy: matrix profile.

Bayesian: PyMC, Stan, numpyro for structural Bayesian TS; CausalImpact (R/Python ports) for BSTS-based causal inference.

24. Applications

Finance

Returns prediction is famously hard (efficient market); volatility prediction (GARCH family) is the workhorse. Used for risk management, option pricing inputs, VaR/ES, portfolio construction. Realized volatility from high-frequency data (Andersen + Bollerslev 1998) and HAR-RV (Corsi 2009) dominate practical forecasting. See derivatives-and-quant-finance.

Macroeconomics and Nowcasting

Banbura + Giannone + Reichlin (2013) survey of nowcasting GDP from mixed-frequency, ragged-edge data. Dynamic factor models (Stock + Watson 2002): a few latent factors driving many observed series. Used by central banks (ECB, NY Fed) and BEA for real-time GDP estimates.

Demand Forecasting

Retail (Amazon, Walmart, Target): hierarchical SKU forecasts feeding inventory and supply chain. Methods range from ETS/ARIMA per series to single global DeepAR/Chronos models. M5 competition (2020, Walmart data) showed gradient boosting (LightGBM) winning at scale.

Energy Load

ISO/RTO operational forecasting (PJM, CAISO, MISO, ERCOT, NYISO, ISO-NE, SPP) — day-ahead and intraday load + wind + solar. Methods: SARIMAX + weather + calendar dummies; gradient boosting; transformer-based (e.g., NeuralProphet, TFT). See electricity-markets.

Epidemiology

ARIMA for syndromic surveillance; SIR/SEIR compartmental models for outbreaks; Bayesian state-space epidemic models. COVID-19 sparked rapid development of forecast hubs (US CDC ForecastHub, ECDC Hub) ensembling dozens of models.

Climate

Wavelet + spectral methods, attribution analysis, ENSO/PDO indices, paleoclimate reconstruction. Foundation models (ClimaX, GraphCast, Aurora) now compete with NWP. Covered in ClimateScience reference.

Industrial — Predictive Maintenance and RUL

Remaining useful life (RUL) estimation: Cox proportional-hazards model, Weibull AFT, gamma-process degradation, particle filters, LSTM/transformer on sensor streams. CMAPSS benchmark (NASA turbofan). Deployed at GE Aviation, Siemens Energy, Caterpillar.

25. Pitfalls

Look-ahead bias / data leakage: using information unavailable at prediction time. Especially insidious in feature engineering (e.g., normalizing with full-sample mean), label construction, and split design.
Non-stationarity ignored: regressions on integrated series produce spurious correlations (Granger + Newbold 1974) — high $R^{2}$ , low Durbin-Watson, no real relationship.
Overfitting: TS has fewer effective samples than naive $n$ would suggest. AIC/BIC, time-series CV, parsimony.
Backtest overfitting (Bailey + López de Prado 2014): deflated Sharpe; minimum backtest length; the more strategies tested, the more skeptical to be.
Survivorship bias: especially in financial backtests using current index constituents.
MAPE asymmetry and zero division: prefer MASE.
Forecast horizon mismatch: validation horizon must match operational horizon.
Calendar pitfalls: timezone, DST, business calendars, holiday boundary effects (e.g., moving Lunar New Year).
Model-staleness in non-stationary regimes: rolling refit; concept-drift detection.

26. Cross-References

probability-fundamentals — joint/conditional distributions, Markov chains, stochastic processes.
bayesian-inference — Bayesian filtering, BSTS, PyMC structural models.
ode-numerical-methods — continuous-time analogs (SDEs, Kalman-Bucy filter, compartmental models for epidemiology).
bayesian-estimation — Kalman/EKF/UKF/particle filters for state estimation; shared mathematical foundation with TS state-space.
derivatives-and-quant-finance — GARCH for option pricing inputs, cointegration for pairs trading, realized volatility, factor models.
electricity-markets — operational load + renewables forecasting, day-ahead price forecasting.
transformer-architecture — attention foundations used in TS transformers and foundation models.

27. Citations

Box, G.E.P., Jenkins, G.M. (1970). Time Series Analysis: Forecasting and Control. Holden-Day.
Wold, H. (1938). A Study in the Analysis of Stationary Time Series. Almqvist & Wiksell.
Dickey, D.A., Fuller, W.A. (1979). “Distribution of the estimators for autoregressive time series with a unit root.” JASA 74:427-431.
Kwiatkowski, D., Phillips, P.C.B., Schmidt, P., Shin, Y. (1992). “Testing the null hypothesis of stationarity…” J. Econometrics 54:159-178.
Engle, R.F., Granger, C.W.J. (1987). “Co-integration and error correction.” Econometrica 55:251-276. (Nobel 2003)
Sims, C.A. (1980). “Macroeconomics and reality.” Econometrica 48:1-48. (Nobel 2011)
Granger, C.W.J. (1969). “Investigating causal relations by econometric models and cross-spectral methods.” Econometrica 37:424-438.
Kalman, R.E. (1960). “A new approach to linear filtering and prediction problems.” J. Basic Engineering 82:35-45.
Rauch, H.E., Tung, F., Striebel, C.T. (1965). “Maximum likelihood estimates of linear dynamic systems.” AIAA J. 3:1445-1450.
Julier, S.J., Uhlmann, J.K. (1997). “A new extension of the Kalman filter to nonlinear systems.” SPIE.
Hyndman, R.J., Athanasopoulos, G. (2021). Forecasting: Principles and Practice, 3rd ed. OTexts. https://otexts.com/fpp3/
Hyndman, R.J., Koehler, A.B., Snyder, R.D., Grose, S. (2002). “A state space framework for automatic forecasting using exponential smoothing methods.” Int. J. Forecasting 18:439-454.
Hyndman, R.J., Khandakar, Y. (2008). “Automatic time series forecasting: the forecast package for R.” JSS 27.
Harvey, A.C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge.
Engle, R.F. (1982). “Autoregressive conditional heteroscedasticity…” Econometrica 50:987-1007. (Nobel 2003)
Bollerslev, T. (1986). “Generalized autoregressive conditional heteroskedasticity.” J. Econometrics 31:307-327.
Nelson, D.B. (1991). “Conditional heteroskedasticity in asset returns: a new approach.” Econometrica 59:347-370.
Glosten, L.R., Jagannathan, R., Runkle, D.E. (1993). “On the relation between the expected value and the volatility of the nominal excess return on stocks.” J. Finance 48:1779-1801.
Rabiner, L.R. (1989). “A tutorial on hidden Markov models and selected applications in speech recognition.” Proc. IEEE 77:257-286.
Baum, L.E., Petrie, T., Soules, G., Weiss, N. (1970). “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains.” Ann. Math. Stat. 41:164-171.
Viterbi, A.J. (1967). “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.” IEEE Trans. Inf. Theory 13:260-269.
Lafferty, J., McCallum, A., Pereira, F. (2001). “Conditional random fields: probabilistic models for segmenting and labeling sequence data.” ICML.
Hochreiter, S., Schmidhuber, J. (1997). “Long short-term memory.” Neural Computation 9:1735-1780.
Salinas, D., Flunkert, V., Gasthaus, J. (2020). “DeepAR: probabilistic forecasting with autoregressive recurrent networks.” Int. J. Forecasting 36:1181-1191. (orig. 2017)
Zhou, H. et al. (2021). “Informer: beyond efficient transformer for long sequence time-series forecasting.” AAAI.
Wu, H. et al. (2021). “Autoformer: decomposition transformers with auto-correlation for long-term series forecasting.” NeurIPS.
Nie, Y. et al. (2023). “A time series is worth 64 words: long-term forecasting with transformers.” ICLR.
Liu, Y. et al. (2024). “iTransformer: inverted transformers are effective for time series forecasting.” ICLR.
Wang, S. et al. (2024). “TimeMixer: decomposable multiscale mixing for time series forecasting.” ICLR.
Rasul, K., Ashok, A. et al. (2024). “Lag-Llama: towards foundation models for probabilistic time series forecasting.” arXiv 2310.08278.
Ansari, A.F. et al. (2024). “Chronos: learning the language of time series.” TMLR. Amazon.
Das, A., Kumar, W., Sen, R., Zhou, Y. (2024). “A decoder-only foundation model for time-series forecasting (TimesFM).” ICML. Google.
Woo, G., Liu, C. et al. (2024). “Unified training of universal time series forecasting transformers (Moirai).” ICML. Salesforce.
Garza, A., Mergenthaler-Canseco, M. (2023). “TimeGPT-1.” arXiv 2310.03589. Nixtla.
Adams, R.P., MacKay, D.J.C. (2007). “Bayesian online changepoint detection.” arXiv 0710.3742.
Killick, R., Fearnhead, P., Eckley, I.A. (2012). “Optimal detection of changepoints with a linear computational cost (PELT).” JASA 107:1590-1598.
Liu, F.T., Ting, K.M., Zhou, Z.-H. (2008). “Isolation forest.” ICDM.
Yeh, C.-C.M., Keogh, E. et al. (2016). “Matrix profile I: all pairs similarity joins for time series.” ICDM.
Hyndman, R.J., Koehler, A.B. (2006). “Another look at measures of forecast accuracy.” Int. J. Forecasting 22:679-688.
Diebold, F.X., Mariano, R.S. (1995). “Comparing predictive accuracy.” JBES 13:253-263.
Wickramasuriya, S., Athanasopoulos, G., Hyndman, R.J. (2019). “Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization.” JASA 114:804-819.
Taylor, S.J., Letham, B. (2018). “Forecasting at scale (Prophet).” American Statistician 72:37-45.
Scott, S.L., Varian, H.R. (2014). “Predicting the present with Bayesian structural time series.” Int. J. Math. Modelling and Numerical Optimisation 5:4-23.
Banbura, M., Giannone, D., Reichlin, L. (2013). “Nowcasting.” In Oxford Handbook of Economic Forecasting.
Granger, C.W.J., Newbold, P. (1974). “Spurious regressions in econometrics.” J. Econometrics 2:111-120.
Makridakis, S., Spiliotis, E., Assimakopoulos, V. (2020). “The M4 competition: 100,000 time series and 61 forecasting methods.” Int. J. Forecasting 36:54-74.
Makridakis, S., Spiliotis, E., Assimakopoulos, V. (2022). “The M5 accuracy competition: results, findings, and conclusions.” Int. J. Forecasting 38:1346-1364.
López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

Compendium

Explorer

Time Series & Hidden Markov Models — Math Reference