Chapter 1: Contents
- 1.1 What we already Know: Gradient Descent
- 1.2 The Problem: What if You Can't Parameterize Your Model?
- 1.3 The Core Idea: Boosting
- 1.4 AdaBoost: The Predecessor
- 1.5 Gradient Boosting: The Unifying Framework
- 1.6 Functional Gradient Descent — The Key Insight
- 1.7 Gradient Boosting Algorithm — Step by Step
- 1.8 Worked Example: MSE Loss
- 1.9 Worked Example: Log-Loss (Binary Classification)
- 1.10 Gradient Boosting vs Gradient Descent — Summary Table
- 1.11 Why Not Just Use One Big Tree?
1.1 What You Already Know: Gradient Descent
We are familiar with gradient descent (can be found in Grdient descent page), a method to minimize a loss function \( L(\theta) \) by iteratively updating the model parameters \( \theta \) in the direction opposite to the gradient:
Where:
- \( \theta_t \) = parameters at step \( t \)
- \( \eta \) = learning rate (step size)
- \( \nabla_\theta L(\theta_t) \) = gradient of the loss w.r.t. parameters
The key idea: Move parameters in the direction that most reduces the loss.
Example — Linear Regression with MSE:
Given data \( \{(x_i, y_i)\}_{i=1}^{n} \), model \( \hat{y} = \theta^T x \), and loss:
The gradient is:
We update \( \theta \) repeatedly until convergence. This works perfectly when your model is parameterized (linear models, neural networks, etc.).
1.2 The Problem: What if You Can't Parameterize Your Model?
Gradient descent updates numbers (parameters \( \theta \)). But what if your model is a decision tree?
Decision trees are not easily differentiable with respect to their structure. You can't write \( \nabla_\theta L \) for "which split to make at which node." The model structure is discrete and combinatorial.
"Can we do something gradient-descent-like, but instead of updating parameters, we add new models?"
The answer is yes — and that's called Gradient Boosting.
1.3 The Core Idea: Boosting
Boosting is an ensemble technique that combines many weak learners (models only slightly better than random) into a single strong learner.
The word weak typically means a decision tree with very few splits (called a stump or shallow tree).
Why Combine Weak Learners?
A shallow tree underfits — it can't capture complex patterns. But if we fit many of them sequentially, each one correcting the mistakes of the previous, the ensemble can become very powerful.
1.4 AdaBoost: The Predecessor
Before Gradient Boosting, there was AdaBoost (1996). It fits weak learners sequentially, giving higher sample weights to misclassified points so the next learner focuses on them.
Where \( h_m(x) \) is the \( m \)-th weak learner and \( \alpha_m \) is its weight based on accuracy.
AdaBoost was powerful, but it was designed specifically for classification with exponential loss. We need something more general.
1.5 Gradient Boosting: The Unifying Framework
Gradient Boosting (Friedman, 1999–2001) generalizes AdaBoost to work with any differentiable loss function by viewing boosting through the lens of functional gradient descent.
The Setup
We want to find a function \( F(x) \) that minimizes:
Where:
- \( y_i \) = true label
- \( F(x_i) \) = model prediction
- \( L \) = any differentiable loss (MSE, log-loss, etc.)
We build \( F(x) \) additively, one tree at a time:
Where:
- \( F_{m-1}(x) \) = current ensemble
- \( h_m(x) \) = new weak learner (tree) added at step \( m \)
- \( \nu \) = shrinkage/learning rate (typically 0.01–0.3)
1.6 Functional Gradient Descent — The Key Insight
Standard Gradient Descent (parameter space):
Functional Gradient Descent (function space):
Instead of updating parameters, we update the prediction function itself:
The gradient of the loss w.r.t. the function's output at each training point \( x_i \) is:
We call \( -g_i \) the pseudo-residual (negative gradient). This is the "error signal" the next tree must try to correct.
1.7 Gradient Boosting Algorithm — Step by Step
Input: Training data \( \{(x_i, y_i)\}_{i=1}^{n} \), loss function \( L \), number of trees \( M \), learning rate \( \nu \)
Step 0 — Initialize with the best constant prediction:
For MSE loss, this is simply \( F_0(x) = \bar{y} \) (the mean of \( y \)).
For \( m = 1 \) to \( M \):
- Compute pseudo-residuals (negative gradients): \[ r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F = F_{m-1}}, \quad i = 1, \ldots, n \]
- Fit a regression tree \( h_m(x) \) to the pseudo-residuals \( \{(x_i, r_{im})\} \).
Let the tree have \( J \) leaf regions \( R_{1m}, \ldots, R_{Jm} \). - Find optimal leaf values (line search per leaf): \[ \gamma_{jm} = \arg\min_\gamma \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma) \]
- Update the ensemble: \[ F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J} \gamma_{jm} \cdot \mathbf{1}[x \in R_{jm}] \]
Output: \( F_M(x) \)
1.8 Worked Example: MSE Loss
Let \( L(y, F) = \frac{1}{2}(y - F)^2 \)
Pseudo-residual:
So for MSE, the pseudo-residuals are literally the ordinary residuals \( y_i - \hat{y}_i \). Each new tree fits the remaining error!
Optimal leaf value (line search):
Taking derivative and setting to zero:
The leaf value is just the mean of residuals in that leaf. Simple and elegant.
1.9 Worked Example: Log-Loss (Binary Classification)
For binary classification with \( y \in \{0, 1\} \), using log-loss:
Pseudo-residual:
The residual is the difference between the true label and the predicted probability — intuitively clear!
Optimal leaf value (approximate — no closed form):
This uses a Newton step approximation (second-order), which brings us to XGBoost-style methods.
1.10 Gradient Boosting vs Gradient Descent — Summary Table
| Feature | Gradient Descent | Gradient Boosting |
|---|---|---|
| What it optimizes | Parameters \( \theta \) | Prediction function \( F(x) \) |
| Step direction | \( -\nabla_\theta L \) | \( -\nabla_F L \) (pseudo-residuals) |
| Each step adds | A parameter update | A new tree |
| Works with | Differentiable models | Any model (trees!) |
| Final model | Parameterized function | Sum of trees |
| Overfitting control | Regularization | Shrinkage, tree depth, subsampling |
1.11 Why Not Just Use One Big Tree?
A very deep tree can memorize training data perfectly — but generalizes poorly (overfitting).
Gradient boosting with many shallow trees is better because:
- Bias-variance tradeoff — Shallow trees have high bias but low variance. Boosting reduces bias without blowing up variance.
- Regularization options — You can control learning rate \( \nu \), tree depth, subsampling, and number of trees.
- Robustness — Errors are corrected gradually, making the model robust to noise.
References & Further Reading
- Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics.
- Friedman, J. H. (1999). Stochastic Gradient Boosting.
- My GitHub: Machine Learning repository
- The Elements of Statistical Learning (Chapter on Boosting)