🏠 Home ☰ Contents
MCMC · Bayesian Inference · Complete Derivation

Markov Chain Monte Carlo
mathematically, step by step

Same toy problem: F(x) = x², y = 9.5. We solve it three ways — posterior geometry, Metropolis-Hastings by hand, then live interactive sampling.

SECTION 01

The problem — posterior we want to sample

We have a nonlinear forward model F(x) = x², one observation y = 9.5, measurement noise σ_ε = 0.5, and a Gaussian prior x_a = 2, σ_a = 2.

From Bayes' theorem:

── Bayes' theorem ───────────────────────────────────────────────

P(x | y)P(y | x) · P(x) ∝ exp(−(y − F(x))² / 2σ_ε²) · exp(−(x − x_a)² / 2σ_a²)

── Plug in numbers ──────────────────────────────────────────────

∝ exp(−(9.5 − x²)² / 0.50) · exp(−(x − 2)² / 8)

This is not Gaussian in x because x appears as x² inside the exponential. We cannot evaluate the normalising constant ∫P(x|y)dx analytically. This is exactly where MCMC becomes necessary.

Why OE fails here: OE linearises F(x) ≈ F(x_a) + K(x−x_a) and assumes the posterior is Gaussian. But F(x) = x² is symmetric — there are two solutions near x ≈ +3.08 and x ≈ −3.08. OE finds only one. MCMC finds both.

We define the cost function J(x) = −2 log P(x|y), which we minimise in OE but only evaluate in MCMC:

J(x) = (9.5 − x²)² / 0.25 + (x − 2)² / 4

────────────────────── ────────────────

measurement term prior term

── Evaluate at key points ───────────────────────────────────────

x = −3.08 → J = 3.67 (second mode — OE misses this!)

x = 0.00 → J = 362.25 (very unlikely)

x = +2.00 → J = 70.57 (prior mean — moderate)

x = +3.08 → J = 2.78 (primary minimum — OE solution)

x = +3.50 → J = 12.1 (falling away fast)

SECTION 02

What is a Markov chain?

A Markov chain is a sequence of random variables x_0, x_1, x_2, ... where the next state depends only on the current state — not the full history:

P(x_{i+1} | x_i, x_{i-1}, ..., x_0) = P(x_{i+1} | x_i)

This is the Markov property — memoryless given the present state.

The transition kernel T(x_{i+1} | x_i) defines how we move from one state to the next. The magic of MCMC: we design T so that the chain's stationary distribution equals our posterior P(x|y).

The condition that guarantees this is called detailed balance:

── Detailed balance (sufficient condition for correct stationary dist.) ──

P(x | y) · T(x* | x) = P(x* | y) · T(x | x*)

Interpretation: the probability flux from x→x* equals flux from x*→x.

If this holds, the chain cannot drift — it is in equilibrium.

Key insight: We never need to compute the normalising constant Z = ∫P(x|y)dx. In any ratio P(x*|y) / P(x|y), the Z cancels. MCMC exploits this completely.
SECTION 03

Metropolis-Hastings algorithm — full derivation

We use a symmetric Gaussian proposal: x* ~ N(x_i, σ_prop²). This means q(x*|x_i) = q(x_i|x*), which simplifies the acceptance ratio.

── Algorithm ────────────────────────────────────────────────────

Initialise: x_0 = any starting point (we use x_0 = 0.0)

Proposal σ: σ_prop = 1.0

For i = 0, 1, 2, ..., N:

1. Propose: x* ~ N(x_i, σ_prop²)

← draw from Gaussian centred on current x

2. Compute ratio:
                       P(x* | y)
         α  =  min(1, ─────────)
                       P(x  | y)
            =  min(1,  exp(−J(x*)/2) / exp(−J(x_i)/2))
            =  min(1,  exp(−[J(x*) − J(x_i)] / 2))
        

← only need DIFFERENCE of J — Z cancels!

3. Accept/Reject:

u  ~  Uniform(0, 1)

if u ≤ α:   x_{i+1} = x* (accept)

else:       x_{i+1} = x_i   (reject — stay put)

Why this works: If x* is more probable than x_i (α=1), always accept. If less probable, accept with probability equal to the probability ratio — giving the chain a chance to explore lower-probability regions. Over many steps, the fraction of time spent at any x equals P(x|y).

Hand calculation — first 5 steps

Starting at x_0 = 0.0, σ_prop = 1.0. We pre-fix the random draws for reproducibility.

SECTION 04

Diagnostics — how do we know if the chain is good?

Diagnostic What it measures Good value Formula / method
Trace plot Looks like a "hairy caterpillar"? Chain is mixing well. No drifts or sticking Plot x_i vs i
Acceptance rate Fraction of proposals accepted ~23–44% N_accepted / N_total
Autocorrelation τ How many steps until samples are independent τ as small as possible τ = 1 + 2Σ ρ_k
Effective sample size Number of "independent" samples ESS > 200 ESS = N / τ
Gelman-Rubin R̂ Multiple chains agree? (between-chain vs within-chain variance) R̂ < 1.01 R̂ = √(V̂/W) where V̂ = pooled, W = within-chain
Burn-in Initial transient before chain reaches typical set Discard first ~20% Visual inspection of trace plot
── Autocorrelation at lag k ──────────────────────────────────────
         Σ (x_i − x̄)(x_{i+k} − x̄)
ρ_k  =  ─────────────────────────────
              Σ (x_i − x̄)²
── Integrated autocorrelation time ──────────────────────────────
τ  =  1  +  2 Σ_{k=1}^{∞} ρ_k         ← sum until ρ_k ≈ 0
── Effective sample size ─────────────────────────────────────────
ESS  =  N / τ                         ← independent draws equivalent
SECTION 05

Live MCMC simulation — interact

Run Metropolis-Hastings on our problem F(x) = x², y = 9.5. Adjust the proposal width and see how it affects mixing and the recovered posterior shape.

1.0
5000
20%
SECTION 06

OE vs MCMC — what each gives you

AspectOptimal Estimation (OE)MCMC (M-H)
Output Single point x̂ + Gaussian Sx Samples from true P(x|y)
Posterior shape Always Gaussian (assumption) Any shape — multimodal, skewed
Modes found Only 1 (nearest to start) All modes (if chain mixes)
Speed ~5 iterations → seconds 10⁴–10⁶ samples → minutes
Nonlinearity Linearises F(x) — fails badly if curved Handles full nonlinearity exactly
Key formula Sx = (KᵀSε⁻¹K + Sa⁻¹)⁻¹ α = min(1, P(x*|y)/P(x|y))
Used in practice for CO₂ Operational (millions of soundings/day) Algorithm development, difficult scenes

Key equations to remember