MCMC — Complete Mathematical Walkthrough

SECTION 01

The problem — posterior we want to sample

We have a nonlinear forward model F(x) = x², one observation y = 9.5, measurement noise σ_ε = 0.5, and a Gaussian prior x_a = 2, σ_a = 2.

From Bayes' theorem:

── Bayes' theorem ─────────────────────────────────────────────── P(x | y) \propto P(y | x) \cdot P(x) \propto exp(-(y - F(x))² / 2σ_ε²) \cdot exp(-(x - x_a)² / 2σ_a²) ── Plug in numbers ────────────────────────────────────────────── \propto exp(-(9.5 - x²)² / 0.50) \cdot exp(-(x - 2)² / 8)

This is not Gaussian in x because x appears as x² inside the exponential. We cannot evaluate the normalising constant ∫P(x|y)dx analytically. This is exactly where MCMC becomes necessary.

Why OE fails here: OE linearises F(x) ≈ F(x_a) + K(x−x_a) and assumes the posterior is Gaussian. But F(x) = x² is symmetric — there are two solutions near x ≈ +3.08 and x ≈ −3.08. OE finds only one. MCMC finds both.

We define the cost function J(x) = −2 log P(x|y), which we minimise in OE but only evaluate in MCMC:

J(x) = (9.5 - x²)² / 0.25 + (x - 2)² / 4 ────────────────────── ──────────────── measurement term prior term ── Evaluate at key points ─────────────────────────────────────── x = -3.08 \to J = 3.67 (second mode — OE misses this!) x = 0.00 \to J = 362.25 (very unlikely) x = +2.00 \to J = 70.57 (prior mean — moderate) x = +3.08 \to J = 2.78 (primary minimum — OE solution) x = +3.50 \to J = 12.1 (falling away fast)

SECTION 02

What is a Markov chain?

A Markov chain is a sequence of random variables x_0, x_1, x_2, ... where the next state depends only on the current state — not the full history:

P(x_{i+1} | x_i, x_{i-1}, ..., x_0) = P(x_{i+1} | x_i) This is the Markov property — memoryless given the present state.

The transition kernel T(x_{i+1} | x_i) defines how we move from one state to the next. The magic of MCMC: we design T so that the chain's stationary distribution equals our posterior P(x|y).

The condition that guarantees this is called detailed balance:

── Detailed balance (sufficient condition for correct stationary dist.) ── P(x | y) \cdot T(x* | x) = P(x* | y) \cdot T(x | x*) Interpretation: the probability flux from x\tox* equals flux from x*\tox. If this holds, the chain cannot drift — it is in equilibrium.

Key insight: We never need to compute the normalising constant Z = ∫P(x|y)dx. In any ratio P(x*|y) / P(x|y), the Z cancels. MCMC exploits this completely.

SECTION 03

Metropolis-Hastings algorithm — full derivation

We use a symmetric Gaussian proposal: x* ~ N(x_i, σ_prop²). This means q(x*|x_i) = q(x_i|x*), which simplifies the acceptance ratio.

── Algorithm ──────────────────────────────────────────────────── Initialise: x_0 = any starting point (we use x_0 = 0.0) Proposal σ: σ_prop = 1.0 For i = 0, 1, 2, ..., N: 1. Propose: x* ~ N(x_i, σ_prop²) ← draw from Gaussian centred on current x 2. Compute ratio: P(x* | y) α = min(1, ─────────) P(x | y) = min(1, exp(−J(x*)/2) / exp(−J(x_i)/2)) = min(1, exp(−[J(x*) − J(x_i)] / 2)) ← only need DIFFERENCE of J — Z cancels! 3. Accept/Reject: u ~ Uniform(0, 1) if u ≤ α: x_{i+1} = x* (accept) else: x_{i+1} = x_i (reject — stay put)

Why this works: If x* is more probable than x_i (α=1), always accept. If less probable, accept with probability equal to the probability ratio — giving the chain a chance to explore lower-probability regions. Over many steps, the fraction of time spent at any x equals P(x|y).

Hand calculation — first 5 steps

Starting at x_0 = 0.0, σ_prop = 1.0. We pre-fix the random draws for reproducibility.

SECTION 04

Diagnostics — how do we know if the chain is good?

Diagnostic	What it measures	Good value	Formula / method
Trace plot	Looks like a "hairy caterpillar"? Chain is mixing well.	No drifts or sticking	Plot x_i vs i
Acceptance rate	Fraction of proposals accepted	~23–44%	N_accepted / N_total
Autocorrelation τ	How many steps until samples are independent	τ as small as possible	τ = 1 + 2Σ ρ_k
Effective sample size	Number of "independent" samples	ESS > 200	ESS = N / τ
Gelman-Rubin R̂	Multiple chains agree? (between-chain vs within-chain variance)	R̂ < 1.01	R̂ = √(V̂/W) where V̂ = pooled, W = within-chain
Burn-in	Initial transient before chain reaches typical set	Discard first ~20%	Visual inspection of trace plot

── Autocorrelation at lag k ────────────────────────────────────── Σ (x_i − x̄)(x_{i+k} − x̄) ρ_k = ───────────────────────────── Σ (x_i − x̄)² ── Integrated autocorrelation time ────────────────────────────── τ = 1 + 2 Σ_{k=1}^{∞} ρ_k ← sum until ρ_k ≈ 0 ── Effective sample size ───────────────────────────────────────── ESS = N / τ ← independent draws equivalent

SECTION 05

Live MCMC simulation — interact

Run Metropolis-Hastings on our problem F(x) = x², y = 9.5. Adjust the proposal width and see how it affects mixing and the recovered posterior shape.

PROPOSAL WIDTH σ_prop 1.0

CHAIN LENGTH N 5000

BURN-IN FRACTION 20%

SECTION 06

OE vs MCMC — what each gives you

Aspect	Optimal Estimation (OE)	MCMC (M-H)
Output	Single point x̂ + Gaussian Sx	Samples from true P(x\|y)
Posterior shape	Always Gaussian (assumption)	Any shape — multimodal, skewed
Modes found	Only 1 (nearest to start)	All modes (if chain mixes)
Speed	~5 iterations → seconds	10⁴–10⁶ samples → minutes
Nonlinearity	Linearises F(x) — fails badly if curved	Handles full nonlinearity exactly
Key formula	Sx = (KᵀSε⁻¹K + Sa⁻¹)⁻¹	α = min(1, P(x*\|y)/P(x\|y))
Used in practice for CO₂	Operational (millions of soundings/day)	Algorithm development, difficult scenes

Key equations to remember

Posterior: P(x|y) ∝ exp(−J(x)/2), where J = measurement misfit + prior misfit
Proposal step: x* ~ N(x_i, σ_prop²) — Gaussian random walk
Acceptance ratio: α = min(1, exp(−ΔJ/2)) where ΔJ = J(x*) − J(x_i)
Detailed balance: P(x|y)·T(x*|x) = P(x*|y)·T(x|x*) — guarantees correct stationary distribution
ESS: N_eff = N / τ where τ is integrated autocorrelation time
Proposal tuning: too narrow → high acceptance but slow mixing; too wide → low acceptance and slow mixing; sweet spot ≈ 23% acceptance for 1D
Burn-in: discard early samples before chain reaches typical set of the posterior

Markov Chain Monte Carlomathematically, step by step

The problem — posterior we want to sample

What is a Markov chain?

Metropolis-Hastings algorithm — full derivation

Hand calculation — first 5 steps

Diagnostics — how do we know if the chain is good?

Live MCMC simulation — interact

OE vs MCMC — what each gives you

Key equations to remember

Markov Chain Monte Carlo
mathematically, step by step