1.1 What You Already Know: Gradient Descent

We are familiar with gradient descent (can be found in Grdient descent page), a method to minimize a loss function \( L(\theta) \) by iteratively updating the model parameters \( \theta \) in the direction opposite to the gradient:

\[ \theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta L(\theta_t) \]

Where:

  • \( \theta_t \) = parameters at step \( t \)
  • \( \eta \) = learning rate (step size)
  • \( \nabla_\theta L(\theta_t) \) = gradient of the loss w.r.t. parameters

The key idea: Move parameters in the direction that most reduces the loss.

Example — Linear Regression with MSE:

Given data \( \{(x_i, y_i)\}_{i=1}^{n} \), model \( \hat{y} = \theta^T x \), and loss:

\[ L(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta^T x_i)^2 \]

The gradient is:

\[ \nabla_\theta L = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) x_i \]

We update \( \theta \) repeatedly until convergence. This works perfectly when your model is parameterized (linear models, neural networks, etc.).

1.2 The Problem: What if You Can't Parameterize Your Model?

Gradient descent updates numbers (parameters \( \theta \)). But what if your model is a decision tree?

Decision trees are not easily differentiable with respect to their structure. You can't write \( \nabla_\theta L \) for "which split to make at which node." The model structure is discrete and combinatorial.

So the question becomes:
"Can we do something gradient-descent-like, but instead of updating parameters, we add new models?"

The answer is yes — and that's called Gradient Boosting.

1.3 The Core Idea: Boosting

Boosting is an ensemble technique that combines many weak learners (models only slightly better than random) into a single strong learner.

The word weak typically means a decision tree with very few splits (called a stump or shallow tree).

Why Combine Weak Learners?

A shallow tree underfits — it can't capture complex patterns. But if we fit many of them sequentially, each one correcting the mistakes of the previous, the ensemble can become very powerful.

Analogy: Imagine learning a skill. After each practice session, you identify your weakest area and focus specifically on that. Over many sessions, you improve dramatically — even if each session only gives a tiny improvement.

1.4 AdaBoost: The Predecessor

Before Gradient Boosting, there was AdaBoost (1996). It fits weak learners sequentially, giving higher sample weights to misclassified points so the next learner focuses on them.

\[ F(x) = \sum_{m=1}^{M} \alpha_m h_m(x) \]

Where \( h_m(x) \) is the \( m \)-th weak learner and \( \alpha_m \) is its weight based on accuracy.

AdaBoost was powerful, but it was designed specifically for classification with exponential loss. We need something more general.

1.5 Gradient Boosting: The Unifying Framework

Gradient Boosting (Friedman, 1999–2001) generalizes AdaBoost to work with any differentiable loss function by viewing boosting through the lens of functional gradient descent.

The Setup

We want to find a function \( F(x) \) that minimizes:

\[ \mathcal{L} = \sum_{i=1}^{n} L(y_i, F(x_i)) \]

Where:

  • \( y_i \) = true label
  • \( F(x_i) \) = model prediction
  • \( L \) = any differentiable loss (MSE, log-loss, etc.)

We build \( F(x) \) additively, one tree at a time:

\[ F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x) \]

Where:

  • \( F_{m-1}(x) \) = current ensemble
  • \( h_m(x) \) = new weak learner (tree) added at step \( m \)
  • \( \nu \) = shrinkage/learning rate (typically 0.01–0.3)

1.6 Functional Gradient Descent — The Key Insight

Standard Gradient Descent (parameter space):

\[ \theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L} \]

Functional Gradient Descent (function space):

Instead of updating parameters, we update the prediction function itself:

\[ F_m(x) = F_{m-1}(x) - \eta \cdot \nabla_{F} \mathcal{L} \]

The gradient of the loss w.r.t. the function's output at each training point \( x_i \) is:

\[ g_i = \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg|_{F = F_{m-1}} \]

We call \( -g_i \) the pseudo-residual (negative gradient). This is the "error signal" the next tree must try to correct.

The critical connection: Just as gradient descent moves parameters in the direction \( -\nabla_\theta L \), gradient boosting moves the prediction function in the direction \( -\nabla_F L \) by fitting a tree to the pseudo-residuals.

1.7 Gradient Boosting Algorithm — Step by Step

Input: Training data \( \{(x_i, y_i)\}_{i=1}^{n} \), loss function \( L \), number of trees \( M \), learning rate \( \nu \)

Step 0 — Initialize with the best constant prediction:

\[ F_0(x) = \arg\min_\gamma \sum_{i=1}^{n} L(y_i, \gamma) \]

For MSE loss, this is simply \( F_0(x) = \bar{y} \) (the mean of \( y \)).

For \( m = 1 \) to \( M \):

  1. Compute pseudo-residuals (negative gradients): \[ r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F = F_{m-1}}, \quad i = 1, \ldots, n \]
  2. Fit a regression tree \( h_m(x) \) to the pseudo-residuals \( \{(x_i, r_{im})\} \).
    Let the tree have \( J \) leaf regions \( R_{1m}, \ldots, R_{Jm} \).
  3. Find optimal leaf values (line search per leaf): \[ \gamma_{jm} = \arg\min_\gamma \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma) \]
  4. Update the ensemble: \[ F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J} \gamma_{jm} \cdot \mathbf{1}[x \in R_{jm}] \]

Output: \( F_M(x) \)

1.8 Worked Example: MSE Loss

Let \( L(y, F) = \frac{1}{2}(y - F)^2 \)

Pseudo-residual:

\[ r_i = -\frac{\partial L}{\partial F} = -\frac{\partial}{\partial F}\left[\frac{1}{2}(y_i - F)^2\right] = y_i - F(x_i) \]

So for MSE, the pseudo-residuals are literally the ordinary residuals \( y_i - \hat{y}_i \). Each new tree fits the remaining error!

Optimal leaf value (line search):

\[ \gamma_j = \arg\min_\gamma \sum_{x_i \in R_j} \frac{1}{2}(y_i - F_{m-1}(x_i) - \gamma)^2 \]

Taking derivative and setting to zero:

\[ \sum_{x_i \in R_j} -(y_i - F_{m-1}(x_i) - \gamma) = 0 \quad \Rightarrow \quad \sum_{x_i \in R_j} r_i = n_j \cdot \gamma \]
\[ \boxed{\gamma_j = \frac{1}{n_j} \sum_{x_i \in R_j} r_i} \]

The leaf value is just the mean of residuals in that leaf. Simple and elegant.

1.9 Worked Example: Log-Loss (Binary Classification)

For binary classification with \( y \in \{0, 1\} \), using log-loss:

\[ L(y, F) = -\left[y \log p + (1-y)\log(1-p)\right], \quad p = \sigma(F) = \frac{1}{1 + e^{-F}} \]

Pseudo-residual:

\[ r_i = -\frac{\partial L}{\partial F} = y_i - \sigma(F(x_i)) = y_i - p_i \]

The residual is the difference between the true label and the predicted probability — intuitively clear!

Optimal leaf value (approximate — no closed form):

\[ \gamma_j \approx \frac{\sum_{x_i \in R_j} r_i}{\sum_{x_i \in R_j} p_i(1 - p_i)} \]

This uses a Newton step approximation (second-order), which brings us to XGBoost-style methods.

1.10 Gradient Boosting vs Gradient Descent — Summary Table

FeatureGradient DescentGradient Boosting
What it optimizesParameters \( \theta \)Prediction function \( F(x) \)
Step direction\( -\nabla_\theta L \)\( -\nabla_F L \) (pseudo-residuals)
Each step addsA parameter updateA new tree
Works withDifferentiable modelsAny model (trees!)
Final modelParameterized functionSum of trees
Overfitting controlRegularizationShrinkage, tree depth, subsampling

1.11 Why Not Just Use One Big Tree?

A very deep tree can memorize training data perfectly — but generalizes poorly (overfitting).

Gradient boosting with many shallow trees is better because:

  1. Bias-variance tradeoff — Shallow trees have high bias but low variance. Boosting reduces bias without blowing up variance.
  2. Regularization options — You can control learning rate \( \nu \), tree depth, subsampling, and number of trees.
  3. Robustness — Errors are corrected gradually, making the model robust to noise.
📘 Next Chapter: Decision Trees — The Building Blocks of Gradient Boosting

References & Further Reading

📌 Other interesting topics: