MCMC Theory: Comprehensive Mathematical Framework

Mathematical Fundamentals
Probability Theory and Distributions
Markov Chains and Stationarity
MCMC: Theoretical Framework
MCMC Algorithms (Metropolis-Hastings, Gibbs)
Application to CO₂ Retrieval from Remote Sensing
Convergence Diagnostics and Validation
Computational Considerations

1. Mathematical Fundamentals

1.1 Probability Measure and Sample Spaces

A probability measure is a function $P: \mathcal{F} \rightarrow [0,1]$ defined on a $\sigma$-algebra $\mathcal{F}$ of subsets of a sample space $\Omega$, satisfying:

$$\begin{eqnarray} P(\Omega) & = & 1 \quad \text{(Normalization)} \\ P(\cup_i A_i) & = & \sum_i P(A_i) \quad \text{for disjoint sets } A_i \text{ (Countable additivity)} \end{eqnarray}$$

1.2 Random Variables and Distributions

A random variable $X$ is a measurable function $X:\Omega \rightarrow \mathbb{R}$. The probability distribution of $X$ is characterized by:

Cumulative Distribution Function (CDF):
$$F(x) = P(X \leq x)$$

For continuous distributions, the probability density function (PDF) is:

$$f(x) = \frac{dF(x)}{dx}$$

1.3 Expectation and Moments

The expectation (mean) of a random variable $X$ with PDF $f(x)$ is:

$$\mathbb{E}[X] = \int x f(x) dx$$

The variance quantifies spread around the mean:

$$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

The covariance between two random variables $X$ and $Y$ is:

$$\text{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$

1.4 Conditional Probability and Independence

The conditional probability of event $A$ given event $B$ (where $P(B) > 0$) is:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

For random variables, the conditional density is:

$$f(x|y) = \frac{f(x,y)}{f(y)}$$

$X$ and $Y$ are conditionally independent given $Z$ if:

$$f(x,y|z) = f(x|z)f(y|z)$$

2. Probability Theory and Distributions

2.1 Joint, Marginal, and Conditional Distributions

For a multivariate distribution with density $f(x_1, x_2, \ldots, x_n)$:

Marginal distribution of $X_1$:
$$f(x_1) = \int\int\cdots\int f(x_1, x_2, \ldots, x_n) dx_2 dx_3 \cdots dx_n$$

2.2 Bayes' Theorem

The foundational theorem for Bayesian inference is:

$$P(\theta|y) = \frac{P(y|\theta) P(\theta)}{P(y)}$$

Or in terms of densities:

$$\pi(\theta|y) = \frac{L(y|\theta) \pi(\theta)}{\int L(y|\theta) \pi(\theta) d\theta}$$

Where:

$\pi(\theta|y)$ = Posterior distribution
$L(y|\theta)$ = Likelihood function
$\pi(\theta)$ = Prior distribution
$\int L(y|\theta) \pi(\theta) d\theta$ = Normalizing constant (evidence)

2.3 Important Distributions

Distribution	PDF/PMF	Mean	Variance
$\mathcal{N}(\mu, \sigma^2)$	$\frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$	$\mu$	$\sigma^2$
$\text{Uniform}(a,b)$	$\frac{1}{b-a}$ for $x \in [a,b]$	$\frac{a+b}{2}$	$\frac{(b-a)^2}{12}$
$\text{Exponential}(\lambda)$	$\lambda \exp(-\lambda x)$	$\frac{1}{\lambda}$	$\frac{1}{\lambda^2}$
$\text{Beta}(\alpha,\beta)$	$\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}$	$\frac{\alpha}{\alpha+\beta}$	$\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$

2.4 Multivariate Normal Distribution

The $n$-dimensional multivariate normal distribution has PDF:

$$\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-n/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$

Where:

$\boldsymbol{\mu} \in \mathbb{R}^n$ = Mean vector
$\boldsymbol{\Sigma} \in \mathbb{R}^{n \times n}$ = Covariance matrix (positive definite)
$|\boldsymbol{\Sigma}|$ = Determinant of $\boldsymbol{\Sigma}$

2.5 Log-Probability and Numerical Stability

In computational applications, we work with log-probabilities to avoid underflow:

$$\log \pi(\theta|y) = \log L(y|\theta) + \log \pi(\theta) - \log p(y)$$

For optimization, we often work with the unnormalized log-posterior:

$$\ell(\theta) \propto \log \pi(\theta|y) = \log L(y|\theta) + \log \pi(\theta)$$

3. Markov Chains and Stationarity

3.1 Markov Property

A sequence of random variables $\{X_t\}_{t=0}^{\infty}$ forms a Markov chain if:

$$P(X_{t+1} = x_{t+1} \mid X_0 = x_0, \ldots, X_t = x_t) = P(X_{t+1} = x_{t+1} \mid X_t = x_t)$$

The future depends only on the current state, not the history. This is called the memoryless property.

3.2 Transition Kernels

The transition kernel $K(x, x')$ specifies the probability of moving from state $x$ to state $x'$:

$$K(x, x') = P(X_{t+1} \in dx' \mid X_t = x)$$

For a discrete state space, this is the transition probability matrix $\mathbf{P}$ with elements:

$$P_{ij} = P(X_{t+1} = j \mid X_t = i)$$

The transition matrix must satisfy row-stochasticity:

$$\sum_j P_{ij} = 1 \quad \text{for all } i$$

3.3 Stationary Distribution

A probability distribution $\pi$ is stationary (or invariant) for a Markov chain with transition kernel $K$ if:

$$\pi(x') = \int \pi(x) K(x, x') dx$$

In matrix form for discrete chains:

$$\boldsymbol{\pi} = \boldsymbol{\pi} \mathbf{P} \quad \text{($\boldsymbol{\pi}$ is a left eigenvector of $\mathbf{P}$ with eigenvalue 1)}$$

Key Insight: If a Markov chain has a stationary distribution $\pi$ and satisfies certain ergodicity conditions, then samples from the chain converge to $\pi$ regardless of the starting distribution.

3.4 Detailed Balance (Reversibility)

A chain satisfies detailed balance with respect to $\pi$ if:

$$\pi(x) K(x, x') = \pi(x') K(x', x)$$

Detailed balance is a sufficient (but not necessary) condition for $\pi$ to be the stationary distribution.

3.5 Ergodicity and Convergence

A chain is ergodic if it satisfies:

Irreducibility: From any state, we can reach any other state in a finite number of steps
Aperiodicity: The chain does not return to states in cycles of fixed length $> 1$

For an ergodic chain with stationary distribution $\pi$:

$$\lim_{t \to \infty} P(X_t \in A \mid X_0 = x) = \pi(A) \quad \text{for all } x \text{ and measurable sets } A$$

3.6 Law of Large Numbers for Markov Chains

Under ergodicity, empirical averages converge to expectations under $\pi$:

$$\frac{1}{T} \sum_{t=1}^{T} f(X_t) \to \int f(x) \pi(x) dx \quad \text{as } T \to \infty$$

This is the basis for MCMC inference: we can estimate posterior expectations by averaging over sampled chains.

4. MCMC: Theoretical Framework

4.1 The Core Principle

MCMC constructs a Markov chain whose stationary distribution is a target distribution $\pi$ (e.g., a posterior). We then sample from this chain and use the samples to estimate quantities of interest.

Bayesian Inference with MCMC:
Given observed data $y$ and a model with parameters $\theta$:

1. Define likelihood: $L(y|\theta)$
2. Choose prior: $\pi(\theta)$
3. Posterior: $\pi(\theta|y) \propto L(y|\theta) \pi(\theta)$
4. Design Markov chain with $\pi(\theta|y)$ as stationary distribution
5. Sample: $\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(n)}$
6. Estimate: $\mathbb{E}[g(\theta)|y] \approx \frac{1}{N} \sum_{i} g(\theta^{(i)})$

4.2 Why MCMC Works

MCMC solves a fundamental problem: we want to sample from $\pi$ but can only evaluate it up to a normalizing constant.

If we construct a chain with transition kernel $K$ such that:

$\pi$ is the stationary distribution of $K$
The chain is irreducible and aperiodic

Then for large enough $t$, $X_t \sim \pi$ (approximately).

4.3 Acceptance-Rejection Framework

Many MCMC algorithms use an accept-reject mechanism. Proposed move from $x$ to $x'$ is accepted with probability:

$$\alpha(x, x') = \min(1, r(x, x'))$$

Where $r(x, x')$ depends on the algorithm. With this mechanism:

$$K(x, x') = \alpha(x, x') q(x, x') + \left(1 - \int \alpha(x, z) q(x, z) dz\right) \delta(x)(x')$$

Where $q(x, x')$ is the proposal distribution and $\delta(x)(x')$ is the point mass at $x$ (stay in place if rejected).

4.4 Metropolis-Hastings Condition

To ensure detailed balance (and thus $\pi$ as the stationary distribution), we need:

$$\pi(x) q(x, x') \alpha(x, x') = \pi(x') q(x', x) \alpha(x', x)$$

The Metropolis-Hastings ratio ensures this:

$$r(x, x') = \frac{\pi(x') q(x', x)}{\pi(x) q(x, x')}$$

With acceptance probability $\alpha(x, x') = \min(1, r(x, x'))$, detailed balance is automatically satisfied.

4.5 Burn-in and Mixing

Due to dependence on initial conditions, early samples are discarded (burn-in phase). After burn-in, we approximate:

$$\pi(\theta|y) \approx \frac{1}{N-B} \sum_{t=B+1}^{N} \delta(\theta^{(t)})$$

Where $B$ is the burn-in period. Mixing refers to how quickly the chain explores the posterior—faster mixing means faster convergence.

5. MCMC Algorithms

5.1 Metropolis-Hastings Algorithm

The most general MCMC algorithm. Given target $\pi$ and proposal $q$:

Metropolis-Hastings Algorithm

Initialize: Start with $\theta^{(0)}$

For $t = 0, 1, 2, \ldots, N-1$:

1. Propose: $\theta^* \sim q(\cdot|\theta^{(t)})$

2. Compute acceptance ratio: $$r = \frac{\pi(\theta^*) q(\theta^{(t)}|\theta^*)}{\pi(\theta^{(t)}) q(\theta^*|\theta^{(t)})}$$

3. Accept/Reject: With probability $\min(1, r)$ accept $\theta^*$ and set $\theta^{(t+1)} = \theta^*$, otherwise $\theta^{(t+1)} = \theta^{(t)}$

5.2 Random Walk Metropolis

A simple proposal using a symmetric random walk:

$$q(\theta^*|\theta^{(t)}) = \mathcal{N}(\theta^*|\theta^{(t)}, \sigma^2 \mathbf{I})$$

This is symmetric, so $q(\theta^{(t)}|\theta^*) = q(\theta^*|\theta^{(t)})$, and the acceptance ratio simplifies to:

$$r = \frac{\pi(\theta^*)}{\pi(\theta^{(t)})}$$

The proposal variance $\sigma^2$ controls step size—tuning it is critical for efficiency.

5.3 Gibbs Sampling

For multivariate targets, Gibbs sampling updates one component at a time from its full conditional:

Gibbs Sampling

Initialize: $\theta^{(0)} = (\theta_1^{(0)}, \ldots, \theta_p^{(0)})$

For $t = 0, 1, \ldots, N-1$:

For $j = 1, 2, \ldots, p$:

Sample $\theta_j^{(t+1)} \sim \pi(\theta_j \mid \theta_1^{(t+1)}, \ldots, \theta_{j-1}^{(t+1)}, \theta_{j+1}^{(t)}, \ldots, \theta_p^{(t)})$

Gibbs sampling always accepts moves (acceptance rate = 100%) when full conditionals are tractable.

5.4 Adaptive Metropolis

Adapt proposal covariance online to improve mixing. At iteration $t$, use estimated posterior covariance:

$$\boldsymbol{\Sigma}^{(t)} = \text{Cov}(\theta^{(1)}, \ldots, \theta^{(t)}) + \varepsilon \mathbf{I}$$

And propose:

$$\theta^* \sim \mathcal{N}\left(\theta^{(t)}, \frac{2.38^2}{p} \boldsymbol{\Sigma}^{(t)}\right)$$

The constant $\frac{2.38^2}{p}$ is optimal for high-dimensional Gaussians (Gelman et al., 1997).

5.5 Hamiltonian Monte Carlo (HMC)

Uses gradient information to propose moves that are less likely to be rejected.

Augment parameter space with auxiliary momentum variables $\mathbf{p} \sim \mathcal{N}(0, \mathbf{M})$:

$$H(\theta, \mathbf{p}) = -\log \pi(\theta) + \frac{1}{2} \mathbf{p}^T \mathbf{M}^{-1} \mathbf{p}$$

Proposal follows Hamiltonian dynamics:

$$\frac{d\theta}{dt} = \nabla_{\mathbf{p}} H = \mathbf{M}^{-1} \mathbf{p}$$ $$\frac{d\mathbf{p}}{dt} = -\nabla_{\theta} H = \nabla \log \pi(\theta)$$

HMC typically has much higher acceptance rates and better mixing than random walk methods, especially in high dimensions.

5.6 No-U-Turn Sampler (NUTS)

An extension of HMC that automatically determines trajectory length by detecting when the chain starts doubling back (U-turn).

This reduces manual tuning and improves efficiency for complex posteriors.

6. Application to CO₂ Retrieval from Remote Sensing

6.1 Remote Sensing Inverse Problem

Estimating atmospheric CO₂ from satellite measurements is an inverse problem:

$$\mathbf{y} = \mathbf{f}(\theta) + \varepsilon$$

Where:

$\mathbf{y}$ = Observed spectral radiances (measurements)
$\mathbf{f}(\theta)$ = Forward radiative transfer model
$\theta$ = Parameters (CO₂ column, temperature profile, surface albedo, aerosols, ...)
$\varepsilon \sim \mathcal{N}(0, \boldsymbol{\Sigma}_{\text{noise}})$ = Measurement error

6.2 Bayesian Formulation

Define the likelihood assuming Gaussian errors:

$$L(\mathbf{y}|\theta) = (2\pi)^{-m/2} |\boldsymbol{\Sigma}_{\text{noise}}|^{-1/2} \exp\left(-\frac{1}{2}(\mathbf{y} - \mathbf{f}(\theta))^T \boldsymbol{\Sigma}_{\text{noise}}^{-1}(\mathbf{y} - \mathbf{f}(\theta))\right)$$

The prior $\pi(\theta)$ encodes information from climatology, seasonality, and physical constraints. For CO₂:

$$\pi(\theta) = \mathcal{N}(\theta|\boldsymbol{\mu}_{\text{prior}}, \boldsymbol{\Sigma}_{\text{prior}}) \times \mathbb{I}(\text{physical\_bounds})$$

The posterior combines likelihood and prior:

$$\pi(\theta|y) \propto L(\mathbf{y}|\theta) \pi(\theta)$$

6.3 Parameter Space for CO₂ Retrieval

In CO₂ retrieval problems, parameters typically include:

Parameter	Physical Meaning	Typical Range
$X_{\text{CO}_2}$	CO₂ column-averaged mole fraction	380–420 ppm
$T(z)$	Temperature profile	200–300 K
$P_{\text{surface}}$	Surface pressure	900–1050 hPa
$\text{Albedo}$	Surface reflectivity	0.01–0.6
$\text{Aerosol}$	Optical depth & properties	0–2

6.4 Advantages of MCMC for CO₂ Retrieval

Why MCMC is ideal for CO₂ retrieval:

Handles high-dimensional parameter spaces (temperature profile, aerosols)
Quantifies full posterior distributions, not just point estimates
Accounts for correlated uncertainties via covariance matrices
Scales to large datasets when using parallel chains
Provides uncertainty propagation through forward model
Naturally handles nuisance parameters via marginalization

6.5 MCMC Implementation for CO₂

A practical workflow:

CO₂ Retrieval with MCMC

1. Prepare data: Spectral measurements $\mathbf{y}$, measurement covariance $\boldsymbol{\Sigma}_{\text{noise}}$, prior $\theta_0$ and $\boldsymbol{\Sigma}_{\text{prior}}$

2. Define log-posterior: $$\log \pi(\theta|y) = -\frac{1}{2}(\mathbf{y} - \mathbf{f}(\theta))^T \boldsymbol{\Sigma}_{\text{noise}}^{-1}(\mathbf{y} - \mathbf{f}(\theta)) - \frac{1}{2}(\theta - \theta_0)^T \boldsymbol{\Sigma}_{\text{prior}}^{-1}(\theta - \theta_0) - \frac{1}{2} \log|\boldsymbol{\Sigma}_{\text{noise}}| - \frac{1}{2} \log|\boldsymbol{\Sigma}_{\text{prior}}| + \text{const}$$

3. Initialize: $\theta^{(0)}$ from prior or optimization

4. Run M parallel chains with HMC or Adaptive Metropolis for $N_{\text{iter}}$ iterations

5. Discard burn-in: Keep samples $\theta^{(b)}, \ldots, \theta^{(N)}$ where $b = 0.3N$

6. Retrieve CO₂: $$X_{\text{CO}_2} = \int X_{\text{CO}_2}(\theta) \pi(\theta|y) d\theta \approx \frac{1}{N-b} \sum_{t=b}^{N} X_{\text{CO}_2}(\theta^{(t)})$$

7. Uncertainty quantification: $$\sigma_{X_{\text{CO}_2}} = \sqrt{\text{Var}(X_{\text{CO}_2}(\theta)) + \mathbb{E}[(\nabla_{\theta} \mathbf{f} \cdot \text{Cov}(\theta))^T]}$$

6.6 Computational Considerations

For operational CO₂ retrieval:

Forward model evaluation is expensive (radiative transfer) → use emulators or surrogates
Parameter correlations are strong → adaptive methods essential
Real-time processing demands → parallel chains, GPU acceleration
Validation requires reference data (TCCON, in-situ) → posterior predictive checks

7. Convergence Diagnostics and Validation

7.1 The Convergence Problem

We don't know when the chain has converged to the stationary distribution. Diagnostics help assess convergence.

7.2 Trace Plots and Visual Inspection

Plot $\theta^{(t)}$ vs. iteration $t$. Look for:

Stationarity: No obvious trends
Mixing: Random fluctuation around mean
No stuck chains: Chain explores full posterior

7.3 Potential Scale Reduction Factor (PSRF) / $\hat{R}$

Run $M$ independent chains. Compute within-chain ($W$) and between-chain ($B$) variances:

$$W = \frac{1}{M} \sum_{m=1}^{M} s_m^2$$ $$B = \frac{n}{M-1} \sum_{m=1}^{M} (\bar{\theta}_m - \bar{\theta})^2$$

Where $\bar{\theta}_m$ is the mean of chain $m$, $\bar{\theta}$ is the overall mean, and $n$ is chain length.

The PSRF is:

$$\hat{R} = \sqrt{\frac{(n-1)W + B}{nW}}$$

Convergence criterion: $\hat{R} < 1.01$ (or $\hat{R} < 1.05$) indicates convergence

7.4 Effective Sample Size (ESS)

Due to autocorrelation, MCMC samples are not independent. The effective sample size is:

$$\text{ESS} = \frac{N}{\tau}$$

Where $\tau$ is the integrated autocorrelation time:

$$\tau = 1 + 2 \sum_{k=1}^{\infty} \rho(k)$$

And $\rho(k)$ is the autocorrelation at lag $k$. Rule of thumb: ESS > 400 per parameter.

7.5 Autocorrelation Function

The autocorrelation at lag $k$ is:

$$\rho(k) = \frac{\text{Cov}(\theta^{(t)}, \theta^{(t+k)})}{\text{Var}(\theta)}$$

Plot $\rho(k)$ vs. $k$. Should decay quickly to 0. Slow decay → poor mixing, need longer chains.

7.6 Posterior Predictive Checks

Generate predictions from posterior samples:

$$\mathbf{y}_{\text{rep}}^{(t)} \sim p(\mathbf{y} \mid \theta^{(t)})$$

Compare posterior predictive distribution with observed data $\mathbf{y}$:

$$p(\mathbf{y}_{\text{rep}} \mid \mathbf{y}) = \int p(\mathbf{y}_{\text{rep}} \mid \theta) \pi(\theta|\mathbf{y}) d\theta \approx \frac{1}{N} \sum_{t} p(\mathbf{y}_{\text{rep}} \mid \theta^{(t)})$$

If model is well-calibrated, $\mathbf{y}$ should look like draws from $p(\mathbf{y}_{\text{rep}} \mid \mathbf{y})$.

7.7 Validation Against Reference Data

For CO₂ retrieval, validate against ground truth:

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i} (X_{\text{CO}_2, \text{retrieved}, i} - X_{\text{CO}_2, \text{true}, i})^2}$$ $$\text{Bias} = \frac{1}{n} \sum_{i} (X_{\text{CO}_2, \text{retrieved}, i} - X_{\text{CO}_2, \text{true}, i})$$ $$\text{Coverage} = P(X_{\text{CO}_2, \text{true}} \in [X_{\text{CO}_2, \text{retrieved}} \pm 2\sigma]) \approx 0.95$$

8. Computational Considerations and Advanced Topics

8.1 Tuning Proposal Distributions

For random walk proposals, optimal acceptance rate is ~0.234 in high dimensions (Gelman et al., 1997):

Tuning procedure:
If acceptance rate < 0.2, reduce proposal variance
If acceptance rate > 0.3, increase proposal variance
Target: acceptance rate ≈ 0.234

8.2 Parallel Tempering

Run $M$ chains at different "temperatures":

$$\pi_{\tau_m}(\theta) \propto [\pi(\theta)]^{1/\tau_m} \quad \text{where } 1 = \tau_1 < \tau_2 < \ldots < \tau_m$$

Hotter chains explore more, and they occasionally swap states with cooler chains. This helps escape local modes.

8.3 Variational Inference as Approximation

For large-scale problems, variational inference provides a fast approximation to the posterior:

$$q^*(\varphi) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q||\pi) = \arg\max_{q} \int q(\theta) \log \pi(\theta|\mathbf{y}) d\theta - \int q(\theta) \log q(\theta) d\theta$$

The advantage: no burn-in, no convergence diagnostics. The disadvantage: approximation error, biased uncertainty estimates.

8.4 Emulators and Surrogate Models

When the forward model is expensive, train a surrogate:

$$\hat{\mathbf{f}}(\theta) \approx \mathbf{f}(\theta) \quad \text{where } \hat{\mathbf{f}} \text{ is fast to evaluate}$$

Options: Gaussian processes, neural networks, polynomial chaos. Use true model for final refinement.

8.5 Importance Sampling and Reweighting

If you have samples from $q(\theta)$, estimate integrals under $\pi(\theta)$:

$$\mathbb{E}_{\pi}[g(\theta)] = \int g(\theta) \pi(\theta) d\theta \approx \sum_{i} w_i g(\theta_i)$$

Where weights are:

$$w_i = \frac{\pi(\theta_i)}{q(\theta_i)} \quad \text{(normalized to sum to 1)}$$

Useful for re-weighting existing MCMC samples without re-running the chain.

8.6 Software Implementations

Software	Language	Algorithms	Best For
Stan	C++/R/Python	HMC, NUTS	General Bayesian inference
PyMC3/PyMC	Python	NUTS, Gibbs, custom	Flexible modeling
JAGS	C++/R	Gibbs, slice sampling	Hierarchical models
Emcee	Python	Affine invariant ensemble	High-dimensional posteriors
Julia (Turing.jl)	Julia	HMC, Gibbs, custom	Scientific computing, speed

Summary: Key Takeaways

Essential Concepts

MCMC constructs a Markov chain whose stationary distribution equals the target posterior
Bayesian inference combines likelihood and prior to form a posterior
Metropolis-Hastings is the most general algorithm; acceptance ratio ensures correct stationary distribution
Gibbs sampling is efficient when full conditionals are tractable
HMC/NUTS use gradients for better mixing and higher acceptance rates
Convergence diagnostics ($\hat{R}$, ESS, trace plots) assess chain behavior
Uncertainty quantification is automatic from posterior samples
CO₂ retrieval benefits from MCMC's ability to handle high-dimensional correlated parameters

The MCMC Workflow for CO₂ Retrieval

Formulate Bayesian model: likelihood from radiative transfer, prior from climatology
Implement (or use existing) MCMC sampler with efficient proposals
Run parallel chains from dispersed starting points
Monitor convergence ($\hat{R}$, ESS, mixing time)
Discard burn-in; keep post-convergence samples
Estimate CO₂ and uncertainties from posterior samples
Validate with ground truth data (TCCON, aircraft)
Quantify systematic errors via posterior predictive checks

Why MCMC is Ideal for Remote Sensing Inverse Problems

✓ Handles ill-posed problems with strong prior information
✓ Propagates uncertainty through expensive forward models
✓ Natural accounting for correlated measurement errors
✓ Scales to many simultaneous unknown parameters
✓ Marginalizes nuisance parameters automatically
✓ Provides full posterior, not just point estimates

Distribution	PDF/PMF	Mean	Variance
\(\mathcal{N}(\mu, \sigma^2)\)	\(\frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\)	\(\mu\)	\(\sigma^2\)
\(\text{Uniform}(a,b)\)	\(\frac{1}{b-a}\) for \(x \in [a,b]\)	\(\frac{a+b}{2}\)	\(\frac{(b-a)^2}{12}\)
\(\text{Exponential}(\lambda)\)	\(\lambda \exp(-\lambda x)\)	\(\frac{1}{\lambda}\)	\(\frac{1}{\lambda^2}\)
\(\text{Beta}(\alpha,\beta)\)	\(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}\)	\(\frac{\alpha}{\alpha+\beta}\)	\(\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\)

Parameter	Physical Meaning	Typical Range
\(X_{\text{CO}_2}\)	CO₂ column-averaged mole fraction	380–420 ppm
\(T(z)\)	Temperature profile	200–300 K
\(P_{\text{surface}}\)	Surface pressure	900–1050 hPa
\(\text{Albedo}\)	Surface reflectivity	0.01–0.6
\(\text{Aerosol}\)	Optical depth & properties	0–2

Table of Contents

1. Mathematical Fundamentals

1.1 Probability Measure and Sample Spaces

1.2 Random Variables and Distributions

1.3 Expectation and Moments

1.4 Conditional Probability and Independence

2. Probability Theory and Distributions

2.1 Joint, Marginal, and Conditional Distributions

2.2 Bayes' Theorem

2.3 Important Distributions

2.4 Multivariate Normal Distribution

2.5 Log-Probability and Numerical Stability

3. Markov Chains and Stationarity

3.1 Markov Property

3.2 Transition Kernels

3.3 Stationary Distribution

3.4 Detailed Balance (Reversibility)

3.5 Ergodicity and Convergence

3.6 Law of Large Numbers for Markov Chains

4. MCMC: Theoretical Framework

4.1 The Core Principle

4.2 Why MCMC Works

4.3 Acceptance-Rejection Framework

4.4 Metropolis-Hastings Condition

4.5 Burn-in and Mixing

5. MCMC Algorithms

5.1 Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm

5.2 Random Walk Metropolis

5.3 Gibbs Sampling

Gibbs Sampling

5.4 Adaptive Metropolis

5.5 Hamiltonian Monte Carlo (HMC)

5.6 No-U-Turn Sampler (NUTS)

6. Application to CO₂ Retrieval from Remote Sensing

6.1 Remote Sensing Inverse Problem

6.2 Bayesian Formulation

6.3 Parameter Space for CO₂ Retrieval

6.4 Advantages of MCMC for CO₂ Retrieval

6.5 MCMC Implementation for CO₂

CO₂ Retrieval with MCMC

6.6 Computational Considerations

7. Convergence Diagnostics and Validation

7.1 The Convergence Problem

7.2 Trace Plots and Visual Inspection

7.3 Potential Scale Reduction Factor (PSRF) / \(\hat{R}\)

7.4 Effective Sample Size (ESS)

7.5 Autocorrelation Function

7.6 Posterior Predictive Checks

7.7 Validation Against Reference Data

8. Computational Considerations and Advanced Topics

8.1 Tuning Proposal Distributions

8.2 Parallel Tempering

8.3 Variational Inference as Approximation

8.4 Emulators and Surrogate Models

8.5 Importance Sampling and Reweighting

8.6 Software Implementations

Summary: Key Takeaways

Essential Concepts

The MCMC Workflow for CO₂ Retrieval

Why MCMC is Ideal for Remote Sensing Inverse Problems