Table of Contents
- Mathematical Fundamentals
- Probability Theory and Distributions
- Markov Chains and Stationarity
- MCMC: Theoretical Framework
- MCMC Algorithms (Metropolis-Hastings, Gibbs)
- Application to CO₂ Retrieval from Remote Sensing
- Convergence Diagnostics and Validation
- Computational Considerations
1. Mathematical Fundamentals
1.1 Probability Measure and Sample Spaces
A probability measure is a function \(P: \mathcal{F} \rightarrow [0,1]\) defined on a \(\sigma\)-algebra \(\mathcal{F}\) of subsets of a sample space \(\Omega\), satisfying:
$$\begin{eqnarray} P(\Omega) & = & 1 \quad \text{(Normalization)} \\ P(\cup_i A_i) & = & \sum_i P(A_i) \quad \text{for disjoint sets } A_i \text{ (Countable additivity)} \end{eqnarray}$$1.2 Random Variables and Distributions
A random variable \(X\) is a measurable function \(X:\Omega \rightarrow \mathbb{R}\). The probability distribution of \(X\) is characterized by:
$$F(x) = P(X \leq x)$$
For continuous distributions, the probability density function (PDF) is:
$$f(x) = \frac{dF(x)}{dx}$$1.3 Expectation and Moments
The expectation (mean) of a random variable \(X\) with PDF \(f(x)\) is:
$$\mathbb{E}[X] = \int x f(x) dx$$The variance quantifies spread around the mean:
$$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$The covariance between two random variables \(X\) and \(Y\) is:
$$\text{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$1.4 Conditional Probability and Independence
The conditional probability of event \(A\) given event \(B\) (where \(P(B) > 0\)) is:
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$For random variables, the conditional density is:
$$f(x|y) = \frac{f(x,y)}{f(y)}$$\(X\) and \(Y\) are conditionally independent given \(Z\) if:
$$f(x,y|z) = f(x|z)f(y|z)$$2. Probability Theory and Distributions
2.1 Joint, Marginal, and Conditional Distributions
For a multivariate distribution with density \(f(x_1, x_2, \ldots, x_n)\):
$$f(x_1) = \int\int\cdots\int f(x_1, x_2, \ldots, x_n) dx_2 dx_3 \cdots dx_n$$
2.2 Bayes' Theorem
The foundational theorem for Bayesian inference is:
$$P(\theta|y) = \frac{P(y|\theta) P(\theta)}{P(y)}$$Or in terms of densities:
$$\pi(\theta|y) = \frac{L(y|\theta) \pi(\theta)}{\int L(y|\theta) \pi(\theta) d\theta}$$Where:
- \(\pi(\theta|y)\) = Posterior distribution
- \(L(y|\theta)\) = Likelihood function
- \(\pi(\theta)\) = Prior distribution
- \(\int L(y|\theta) \pi(\theta) d\theta\) = Normalizing constant (evidence)
2.3 Important Distributions
| Distribution | PDF/PMF | Mean | Variance |
|---|---|---|---|
| \(\mathcal{N}(\mu, \sigma^2)\) | \(\frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) | \(\mu\) | \(\sigma^2\) |
| \(\text{Uniform}(a,b)\) | \(\frac{1}{b-a}\) for \(x \in [a,b]\) | \(\frac{a+b}{2}\) | \(\frac{(b-a)^2}{12}\) |
| \(\text{Exponential}(\lambda)\) | \(\lambda \exp(-\lambda x)\) | \(\frac{1}{\lambda}\) | \(\frac{1}{\lambda^2}\) |
| \(\text{Beta}(\alpha,\beta)\) | \(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}\) | \(\frac{\alpha}{\alpha+\beta}\) | \(\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\) |
2.4 Multivariate Normal Distribution
The \(n\)-dimensional multivariate normal distribution has PDF:
$$\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-n/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$Where:
- \(\boldsymbol{\mu} \in \mathbb{R}^n\) = Mean vector
- \(\boldsymbol{\Sigma} \in \mathbb{R}^{n \times n}\) = Covariance matrix (positive definite)
- \(|\boldsymbol{\Sigma}|\) = Determinant of \(\boldsymbol{\Sigma}\)
2.5 Log-Probability and Numerical Stability
In computational applications, we work with log-probabilities to avoid underflow:
$$\log \pi(\theta|y) = \log L(y|\theta) + \log \pi(\theta) - \log p(y)$$For optimization, we often work with the unnormalized log-posterior:
$$\ell(\theta) \propto \log \pi(\theta|y) = \log L(y|\theta) + \log \pi(\theta)$$3. Markov Chains and Stationarity
3.1 Markov Property
A sequence of random variables \(\{X_t\}_{t=0}^{\infty}\) forms a Markov chain if:
$$P(X_{t+1} = x_{t+1} \mid X_0 = x_0, \ldots, X_t = x_t) = P(X_{t+1} = x_{t+1} \mid X_t = x_t)$$The future depends only on the current state, not the history. This is called the memoryless property.
3.2 Transition Kernels
The transition kernel \(K(x, x')\) specifies the probability of moving from state \(x\) to state \(x'\):
$$K(x, x') = P(X_{t+1} \in dx' \mid X_t = x)$$For a discrete state space, this is the transition probability matrix \(\mathbf{P}\) with elements:
$$P_{ij} = P(X_{t+1} = j \mid X_t = i)$$The transition matrix must satisfy row-stochasticity:
$$\sum_j P_{ij} = 1 \quad \text{for all } i$$3.3 Stationary Distribution
A probability distribution \(\pi\) is stationary (or invariant) for a Markov chain with transition kernel \(K\) if:
$$\pi(x') = \int \pi(x) K(x, x') dx$$In matrix form for discrete chains:
$$\boldsymbol{\pi} = \boldsymbol{\pi} \mathbf{P} \quad \text{(\(\boldsymbol{\pi}\) is a left eigenvector of \(\mathbf{P}\) with eigenvalue 1)}$$3.4 Detailed Balance (Reversibility)
A chain satisfies detailed balance with respect to \(\pi\) if:
$$\pi(x) K(x, x') = \pi(x') K(x', x)$$Detailed balance is a sufficient (but not necessary) condition for \(\pi\) to be the stationary distribution.
3.5 Ergodicity and Convergence
A chain is ergodic if it satisfies:
- Irreducibility: From any state, we can reach any other state in a finite number of steps
- Aperiodicity: The chain does not return to states in cycles of fixed length \(> 1\)
For an ergodic chain with stationary distribution \(\pi\):
$$\lim_{t \to \infty} P(X_t \in A \mid X_0 = x) = \pi(A) \quad \text{for all } x \text{ and measurable sets } A$$3.6 Law of Large Numbers for Markov Chains
Under ergodicity, empirical averages converge to expectations under \(\pi\):
$$\frac{1}{T} \sum_{t=1}^{T} f(X_t) \to \int f(x) \pi(x) dx \quad \text{as } T \to \infty$$This is the basis for MCMC inference: we can estimate posterior expectations by averaging over sampled chains.
4. MCMC: Theoretical Framework
4.1 The Core Principle
MCMC constructs a Markov chain whose stationary distribution is a target distribution \(\pi\) (e.g., a posterior). We then sample from this chain and use the samples to estimate quantities of interest.
Given observed data \(y\) and a model with parameters \(\theta\):
1. Define likelihood: \(L(y|\theta)\)
2. Choose prior: \(\pi(\theta)\)
3. Posterior: \(\pi(\theta|y) \propto L(y|\theta) \pi(\theta)\)
4. Design Markov chain with \(\pi(\theta|y)\) as stationary distribution
5. Sample: \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(n)}\)
6. Estimate: \(\mathbb{E}[g(\theta)|y] \approx \frac{1}{N} \sum_{i} g(\theta^{(i)})\)
4.2 Why MCMC Works
MCMC solves a fundamental problem: we want to sample from \(\pi\) but can only evaluate it up to a normalizing constant.
If we construct a chain with transition kernel \(K\) such that:
- \(\pi\) is the stationary distribution of \(K\)
- The chain is irreducible and aperiodic
Then for large enough \(t\), \(X_t \sim \pi\) (approximately).
4.3 Acceptance-Rejection Framework
Many MCMC algorithms use an accept-reject mechanism. Proposed move from \(x\) to \(x'\) is accepted with probability:
$$\alpha(x, x') = \min(1, r(x, x'))$$Where \(r(x, x')\) depends on the algorithm. With this mechanism:
$$K(x, x') = \alpha(x, x') q(x, x') + \left(1 - \int \alpha(x, z) q(x, z) dz\right) \delta(x)(x')$$Where \(q(x, x')\) is the proposal distribution and \(\delta(x)(x')\) is the point mass at \(x\) (stay in place if rejected).
4.4 Metropolis-Hastings Condition
To ensure detailed balance (and thus \(\pi\) as the stationary distribution), we need:
$$\pi(x) q(x, x') \alpha(x, x') = \pi(x') q(x', x) \alpha(x', x)$$The Metropolis-Hastings ratio ensures this:
$$r(x, x') = \frac{\pi(x') q(x', x)}{\pi(x) q(x, x')}$$With acceptance probability \(\alpha(x, x') = \min(1, r(x, x'))\), detailed balance is automatically satisfied.
4.5 Burn-in and Mixing
Due to dependence on initial conditions, early samples are discarded (burn-in phase). After burn-in, we approximate:
$$\pi(\theta|y) \approx \frac{1}{N-B} \sum_{t=B+1}^{N} \delta(\theta^{(t)})$$Where \(B\) is the burn-in period. Mixing refers to how quickly the chain explores the posterior—faster mixing means faster convergence.
5. MCMC Algorithms
5.1 Metropolis-Hastings Algorithm
The most general MCMC algorithm. Given target \(\pi\) and proposal \(q\):
Metropolis-Hastings Algorithm
5.2 Random Walk Metropolis
A simple proposal using a symmetric random walk:
$$q(\theta^*|\theta^{(t)}) = \mathcal{N}(\theta^*|\theta^{(t)}, \sigma^2 \mathbf{I})$$This is symmetric, so \(q(\theta^{(t)}|\theta^*) = q(\theta^*|\theta^{(t)})\), and the acceptance ratio simplifies to:
$$r = \frac{\pi(\theta^*)}{\pi(\theta^{(t)})}$$The proposal variance \(\sigma^2\) controls step size—tuning it is critical for efficiency.
5.3 Gibbs Sampling
For multivariate targets, Gibbs sampling updates one component at a time from its full conditional:
Gibbs Sampling
Gibbs sampling always accepts moves (acceptance rate = 100%) when full conditionals are tractable.
5.4 Adaptive Metropolis
Adapt proposal covariance online to improve mixing. At iteration \(t\), use estimated posterior covariance:
$$\boldsymbol{\Sigma}^{(t)} = \text{Cov}(\theta^{(1)}, \ldots, \theta^{(t)}) + \varepsilon \mathbf{I}$$And propose:
$$\theta^* \sim \mathcal{N}\left(\theta^{(t)}, \frac{2.38^2}{p} \boldsymbol{\Sigma}^{(t)}\right)$$The constant \(\frac{2.38^2}{p}\) is optimal for high-dimensional Gaussians (Gelman et al., 1997).
5.5 Hamiltonian Monte Carlo (HMC)
Uses gradient information to propose moves that are less likely to be rejected.
Augment parameter space with auxiliary momentum variables \(\mathbf{p} \sim \mathcal{N}(0, \mathbf{M})\):
$$H(\theta, \mathbf{p}) = -\log \pi(\theta) + \frac{1}{2} \mathbf{p}^T \mathbf{M}^{-1} \mathbf{p}$$Proposal follows Hamiltonian dynamics:
$$\frac{d\theta}{dt} = \nabla_{\mathbf{p}} H = \mathbf{M}^{-1} \mathbf{p}$$ $$\frac{d\mathbf{p}}{dt} = -\nabla_{\theta} H = \nabla \log \pi(\theta)$$HMC typically has much higher acceptance rates and better mixing than random walk methods, especially in high dimensions.
5.6 No-U-Turn Sampler (NUTS)
An extension of HMC that automatically determines trajectory length by detecting when the chain starts doubling back (U-turn).
This reduces manual tuning and improves efficiency for complex posteriors.
6. Application to CO₂ Retrieval from Remote Sensing
6.1 Remote Sensing Inverse Problem
Estimating atmospheric CO₂ from satellite measurements is an inverse problem:
$$\mathbf{y} = \mathbf{f}(\theta) + \varepsilon$$Where:
- \(\mathbf{y}\) = Observed spectral radiances (measurements)
- \(\mathbf{f}(\theta)\) = Forward radiative transfer model
- \(\theta\) = Parameters (CO₂ column, temperature profile, surface albedo, aerosols, ...)
- \(\varepsilon \sim \mathcal{N}(0, \boldsymbol{\Sigma}_{\text{noise}})\) = Measurement error
6.2 Bayesian Formulation
Define the likelihood assuming Gaussian errors:
$$L(\mathbf{y}|\theta) = (2\pi)^{-m/2} |\boldsymbol{\Sigma}_{\text{noise}}|^{-1/2} \exp\left(-\frac{1}{2}(\mathbf{y} - \mathbf{f}(\theta))^T \boldsymbol{\Sigma}_{\text{noise}}^{-1}(\mathbf{y} - \mathbf{f}(\theta))\right)$$The prior \(\pi(\theta)\) encodes information from climatology, seasonality, and physical constraints. For CO₂:
$$\pi(\theta) = \mathcal{N}(\theta|\boldsymbol{\mu}_{\text{prior}}, \boldsymbol{\Sigma}_{\text{prior}}) \times \mathbb{I}(\text{physical\_bounds})$$The posterior combines likelihood and prior:
$$\pi(\theta|y) \propto L(\mathbf{y}|\theta) \pi(\theta)$$6.3 Parameter Space for CO₂ Retrieval
In CO₂ retrieval problems, parameters typically include:
| Parameter | Physical Meaning | Typical Range |
|---|---|---|
| \(X_{\text{CO}_2}\) | CO₂ column-averaged mole fraction | 380–420 ppm |
| \(T(z)\) | Temperature profile | 200–300 K |
| \(P_{\text{surface}}\) | Surface pressure | 900–1050 hPa |
| \(\text{Albedo}\) | Surface reflectivity | 0.01–0.6 |
| \(\text{Aerosol}\) | Optical depth & properties | 0–2 |
6.4 Advantages of MCMC for CO₂ Retrieval
- Handles high-dimensional parameter spaces (temperature profile, aerosols)
- Quantifies full posterior distributions, not just point estimates
- Accounts for correlated uncertainties via covariance matrices
- Scales to large datasets when using parallel chains
- Provides uncertainty propagation through forward model
- Naturally handles nuisance parameters via marginalization
6.5 MCMC Implementation for CO₂
A practical workflow:
CO₂ Retrieval with MCMC
6.6 Computational Considerations
For operational CO₂ retrieval:
- Forward model evaluation is expensive (radiative transfer) → use emulators or surrogates
- Parameter correlations are strong → adaptive methods essential
- Real-time processing demands → parallel chains, GPU acceleration
- Validation requires reference data (TCCON, in-situ) → posterior predictive checks
7. Convergence Diagnostics and Validation
7.1 The Convergence Problem
We don't know when the chain has converged to the stationary distribution. Diagnostics help assess convergence.
7.2 Trace Plots and Visual Inspection
Plot \(\theta^{(t)}\) vs. iteration \(t\). Look for:
- Stationarity: No obvious trends
- Mixing: Random fluctuation around mean
- No stuck chains: Chain explores full posterior
7.3 Potential Scale Reduction Factor (PSRF) / \(\hat{R}\)
Run \(M\) independent chains. Compute within-chain (\(W\)) and between-chain (\(B\)) variances:
$$W = \frac{1}{M} \sum_{m=1}^{M} s_m^2$$ $$B = \frac{n}{M-1} \sum_{m=1}^{M} (\bar{\theta}_m - \bar{\theta})^2$$Where \(\bar{\theta}_m\) is the mean of chain \(m\), \(\bar{\theta}\) is the overall mean, and \(n\) is chain length.
The PSRF is:
$$\hat{R} = \sqrt{\frac{(n-1)W + B}{nW}}$$7.4 Effective Sample Size (ESS)
Due to autocorrelation, MCMC samples are not independent. The effective sample size is:
$$\text{ESS} = \frac{N}{\tau}$$Where \(\tau\) is the integrated autocorrelation time:
$$\tau = 1 + 2 \sum_{k=1}^{\infty} \rho(k)$$And \(\rho(k)\) is the autocorrelation at lag \(k\). Rule of thumb: ESS > 400 per parameter.
7.5 Autocorrelation Function
The autocorrelation at lag \(k\) is:
$$\rho(k) = \frac{\text{Cov}(\theta^{(t)}, \theta^{(t+k)})}{\text{Var}(\theta)}$$Plot \(\rho(k)\) vs. \(k\). Should decay quickly to 0. Slow decay → poor mixing, need longer chains.
7.6 Posterior Predictive Checks
Generate predictions from posterior samples:
$$\mathbf{y}_{\text{rep}}^{(t)} \sim p(\mathbf{y} \mid \theta^{(t)})$$Compare posterior predictive distribution with observed data \(\mathbf{y}\):
$$p(\mathbf{y}_{\text{rep}} \mid \mathbf{y}) = \int p(\mathbf{y}_{\text{rep}} \mid \theta) \pi(\theta|\mathbf{y}) d\theta \approx \frac{1}{N} \sum_{t} p(\mathbf{y}_{\text{rep}} \mid \theta^{(t)})$$If model is well-calibrated, \(\mathbf{y}\) should look like draws from \(p(\mathbf{y}_{\text{rep}} \mid \mathbf{y})\).
7.7 Validation Against Reference Data
For CO₂ retrieval, validate against ground truth:
$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i} (X_{\text{CO}_2, \text{retrieved}, i} - X_{\text{CO}_2, \text{true}, i})^2}$$ $$\text{Bias} = \frac{1}{n} \sum_{i} (X_{\text{CO}_2, \text{retrieved}, i} - X_{\text{CO}_2, \text{true}, i})$$ $$\text{Coverage} = P(X_{\text{CO}_2, \text{true}} \in [X_{\text{CO}_2, \text{retrieved}} \pm 2\sigma]) \approx 0.95$$8. Computational Considerations and Advanced Topics
8.1 Tuning Proposal Distributions
For random walk proposals, optimal acceptance rate is ~0.234 in high dimensions (Gelman et al., 1997):
If acceptance rate < 0.2, reduce proposal variance
If acceptance rate > 0.3, increase proposal variance
Target: acceptance rate ≈ 0.234
8.2 Parallel Tempering
Run \(M\) chains at different "temperatures":
$$\pi_{\tau_m}(\theta) \propto [\pi(\theta)]^{1/\tau_m} \quad \text{where } 1 = \tau_1 < \tau_2 < \ldots < \tau_m$$Hotter chains explore more, and they occasionally swap states with cooler chains. This helps escape local modes.
8.3 Variational Inference as Approximation
For large-scale problems, variational inference provides a fast approximation to the posterior:
$$q^*(\varphi) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q||\pi) = \arg\max_{q} \int q(\theta) \log \pi(\theta|\mathbf{y}) d\theta - \int q(\theta) \log q(\theta) d\theta$$The advantage: no burn-in, no convergence diagnostics. The disadvantage: approximation error, biased uncertainty estimates.
8.4 Emulators and Surrogate Models
When the forward model is expensive, train a surrogate:
$$\hat{\mathbf{f}}(\theta) \approx \mathbf{f}(\theta) \quad \text{where } \hat{\mathbf{f}} \text{ is fast to evaluate}$$Options: Gaussian processes, neural networks, polynomial chaos. Use true model for final refinement.
8.5 Importance Sampling and Reweighting
If you have samples from \(q(\theta)\), estimate integrals under \(\pi(\theta)\):
$$\mathbb{E}_{\pi}[g(\theta)] = \int g(\theta) \pi(\theta) d\theta \approx \sum_{i} w_i g(\theta_i)$$Where weights are:
$$w_i = \frac{\pi(\theta_i)}{q(\theta_i)} \quad \text{(normalized to sum to 1)}$$Useful for re-weighting existing MCMC samples without re-running the chain.
8.6 Software Implementations
| Software | Language | Algorithms | Best For |
|---|---|---|---|
| Stan | C++/R/Python | HMC, NUTS | General Bayesian inference |
| PyMC3/PyMC | Python | NUTS, Gibbs, custom | Flexible modeling |
| JAGS | C++/R | Gibbs, slice sampling | Hierarchical models |
| Emcee | Python | Affine invariant ensemble | High-dimensional posteriors |
| Julia (Turing.jl) | Julia | HMC, Gibbs, custom | Scientific computing, speed |
Summary: Key Takeaways
Essential Concepts
- MCMC constructs a Markov chain whose stationary distribution equals the target posterior
- Bayesian inference combines likelihood and prior to form a posterior
- Metropolis-Hastings is the most general algorithm; acceptance ratio ensures correct stationary distribution
- Gibbs sampling is efficient when full conditionals are tractable
- HMC/NUTS use gradients for better mixing and higher acceptance rates
- Convergence diagnostics (\(\hat{R}\), ESS, trace plots) assess chain behavior
- Uncertainty quantification is automatic from posterior samples
- CO₂ retrieval benefits from MCMC's ability to handle high-dimensional correlated parameters
The MCMC Workflow for CO₂ Retrieval
- Formulate Bayesian model: likelihood from radiative transfer, prior from climatology
- Implement (or use existing) MCMC sampler with efficient proposals
- Run parallel chains from dispersed starting points
- Monitor convergence (\(\hat{R}\), ESS, mixing time)
- Discard burn-in; keep post-convergence samples
- Estimate CO₂ and uncertainties from posterior samples
- Validate with ground truth data (TCCON, aircraft)
- Quantify systematic errors via posterior predictive checks
Why MCMC is Ideal for Remote Sensing Inverse Problems
- ✓ Handles ill-posed problems with strong prior information
- ✓ Propagates uncertainty through expensive forward models
- ✓ Natural accounting for correlated measurement errors
- ✓ Scales to many simultaneous unknown parameters
- ✓ Marginalizes nuisance parameters automatically
- ✓ Provides full posterior, not just point estimates