LightGBM: From Gradient Descent to Gradient Boosting

Chapter 1: Contents

1.1 What we already Know: Gradient Descent
1.2 The Problem: What if You Can't Parameterize Your Model?
1.3 The Core Idea: Boosting
1.4 AdaBoost: The Predecessor
1.5 Gradient Boosting: The Unifying Framework
1.6 Functional Gradient Descent — The Key Insight
1.7 Gradient Boosting Algorithm — Step by Step
1.8 Worked Example: MSE Loss
1.9 Worked Example: Log-Loss (Binary Classification)
1.10 Gradient Boosting vs Gradient Descent — Summary Table
1.11 Why Not Just Use One Big Tree?

1.1 What You Already Know: Gradient Descent

We are familiar with gradient descent (can be found in Grdient descent page), a method to minimize a loss function \( L(\theta) \) by iteratively updating the model parameters \( \theta \) in the direction opposite to the gradient:

\[ \theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta L(\theta_t) \]

Where:

\( \theta_t \) = parameters at step \( t \)
\( \eta \) = learning rate (step size)
\( \nabla_\theta L(\theta_t) \) = gradient of the loss w.r.t. parameters

The key idea: Move parameters in the direction that most reduces the loss.

Example — Linear Regression with MSE:

Given data \( \{(x_i, y_i)\}_{i=1}^{n} \), model \( \hat{y} = \theta^T x \), and loss:

\[ L(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta^T x_i)^2 \]

The gradient is:

\[ \nabla_\theta L = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) x_i \]

We update \( \theta \) repeatedly until convergence. This works perfectly when your model is parameterized (linear models, neural networks, etc.).

1.2 The Problem: What if You Can't Parameterize Your Model?

Gradient descent updates numbers (parameters \( \theta \)). But what if your model is a decision tree?

Decision trees are not easily differentiable with respect to their structure. You can't write \( \nabla_\theta L \) for "which split to make at which node." The model structure is discrete and combinatorial.

So the question becomes:
"Can we do something gradient-descent-like, but instead of updating parameters, we add new models?"

The answer is yes — and that's called Gradient Boosting.

1.3 The Core Idea: Boosting

Boosting is an ensemble technique that combines many weak learners (models only slightly better than random) into a single strong learner.

The word weak typically means a decision tree with very few splits (called a stump or shallow tree).

Why Combine Weak Learners?

A shallow tree underfits — it can't capture complex patterns. But if we fit many of them sequentially, each one correcting the mistakes of the previous, the ensemble can become very powerful.

Analogy: Imagine learning a skill. After each practice session, you identify your weakest area and focus specifically on that. Over many sessions, you improve dramatically — even if each session only gives a tiny improvement.

1.4 AdaBoost: The Predecessor

Before Gradient Boosting, there was AdaBoost (1996). It fits weak learners sequentially, giving higher sample weights to misclassified points so the next learner focuses on them.

\[ F(x) = \sum_{m=1}^{M} \alpha_m h_m(x) \]

Where \( h_m(x) \) is the \( m \)-th weak learner and \( \alpha_m \) is its weight based on accuracy.

AdaBoost was powerful, but it was designed specifically for classification with exponential loss. We need something more general.

1.5 Gradient Boosting: The Unifying Framework

Gradient Boosting (Friedman, 1999–2001) generalizes AdaBoost to work with any differentiable loss function by viewing boosting through the lens of functional gradient descent.

The Setup

We want to find a function \( F(x) \) that minimizes:

\[ \mathcal{L} = \sum_{i=1}^{n} L(y_i, F(x_i)) \]

Where:

\( y_i \) = true label
\( F(x_i) \) = model prediction
\( L \) = any differentiable loss (MSE, log-loss, etc.)

We build \( F(x) \) additively, one tree at a time:

\[ F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x) \]

Where:

\( F_{m-1}(x) \) = current ensemble
\( h_m(x) \) = new weak learner (tree) added at step \( m \)
\( \nu \) = shrinkage/learning rate (typically 0.01–0.3)

1.6 Functional Gradient Descent — The Key Insight

Standard Gradient Descent (parameter space):

\[ \theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L} \]

Functional Gradient Descent (function space):

Instead of updating parameters, we update the prediction function itself:

\[ F_m(x) = F_{m-1}(x) - \eta \cdot \nabla_{F} \mathcal{L} \]

The gradient of the loss w.r.t. the function's output at each training point \( x_i \) is:

\[ g_i = \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg|_{F = F_{m-1}} \]

We call \( -g_i \) the pseudo-residual (negative gradient). This is the "error signal" the next tree must try to correct.

The critical connection: Just as gradient descent moves parameters in the direction \( -\nabla_\theta L \), gradient boosting moves the prediction function in the direction \( -\nabla_F L \) by fitting a tree to the pseudo-residuals.

1.7 Gradient Boosting Algorithm — Step by Step

Input: Training data \( \{(x_i, y_i)\}_{i=1}^{n} \), loss function \( L \), number of trees \( M \), learning rate \( \nu \)

Step 0 — Initialize with the best constant prediction:

\[ F_0(x) = \arg\min_\gamma \sum_{i=1}^{n} L(y_i, \gamma) \]

For MSE loss, this is simply \( F_0(x) = \bar{y} \) (the mean of \( y \)).

For \( m = 1 \) to \( M \):

Compute pseudo-residuals (negative gradients): \[ r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F = F_{m-1}}, \quad i = 1, \ldots, n \]
Fit a regression tree \( h_m(x) \) to the pseudo-residuals \( \{(x_i, r_{im})\} \).
Let the tree have \( J \) leaf regions \( R_{1m}, \ldots, R_{Jm} \).
Find optimal leaf values (line search per leaf): \[ \gamma_{jm} = \arg\min_\gamma \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma) \]
Update the ensemble: \[ F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J} \gamma_{jm} \cdot \mathbf{1}[x \in R_{jm}] \]

Output: \( F_M(x) \)

1.8 Worked Example: MSE Loss

Let \( L(y, F) = \frac{1}{2}(y - F)^2 \)

Pseudo-residual:

\[ r_i = -\frac{\partial L}{\partial F} = -\frac{\partial}{\partial F}\left[\frac{1}{2}(y_i - F)^2\right] = y_i - F(x_i) \]

So for MSE, the pseudo-residuals are literally the ordinary residuals \( y_i - \hat{y}_i \). Each new tree fits the remaining error!

Optimal leaf value (line search):

\[ \gamma_j = \arg\min_\gamma \sum_{x_i \in R_j} \frac{1}{2}(y_i - F_{m-1}(x_i) - \gamma)^2 \]

Taking derivative and setting to zero:

\[ \sum_{x_i \in R_j} -(y_i - F_{m-1}(x_i) - \gamma) = 0 \quad \Rightarrow \quad \sum_{x_i \in R_j} r_i = n_j \cdot \gamma \]

\[ \boxed{\gamma_j = \frac{1}{n_j} \sum_{x_i \in R_j} r_i} \]

The leaf value is just the mean of residuals in that leaf. Simple and elegant.

1.9 Worked Example: Log-Loss (Binary Classification)

For binary classification with \( y \in \{0, 1\} \), using log-loss:

\[ L(y, F) = -\left[y \log p + (1-y)\log(1-p)\right], \quad p = \sigma(F) = \frac{1}{1 + e^{-F}} \]

Pseudo-residual:

\[ r_i = -\frac{\partial L}{\partial F} = y_i - \sigma(F(x_i)) = y_i - p_i \]

The residual is the difference between the true label and the predicted probability — intuitively clear!

Optimal leaf value (approximate — no closed form):

\[ \gamma_j \approx \frac{\sum_{x_i \in R_j} r_i}{\sum_{x_i \in R_j} p_i(1 - p_i)} \]

This uses a Newton step approximation (second-order), which brings us to XGBoost-style methods.

1.10 Gradient Boosting vs Gradient Descent — Summary Table

Feature	Gradient Descent	Gradient Boosting
What it optimizes	Parameters \( \theta \)	Prediction function \( F(x) \)
Step direction	\( -\nabla_\theta L \)	\( -\nabla_F L \) (pseudo-residuals)
Each step adds	A parameter update	A new tree
Works with	Differentiable models	Any model (trees!)
Final model	Parameterized function	Sum of trees
Overfitting control	Regularization	Shrinkage, tree depth, subsampling

1.11 Why Not Just Use One Big Tree?

A very deep tree can memorize training data perfectly — but generalizes poorly (overfitting).

Gradient boosting with many shallow trees is better because:

Bias-variance tradeoff — Shallow trees have high bias but low variance. Boosting reduces bias without blowing up variance.
Regularization options — You can control learning rate \( \nu \), tree depth, subsampling, and number of trees.
Robustness — Errors are corrected gradually, making the model robust to noise.

📘 Next Chapter: Decision Trees — The Building Blocks of Gradient Boosting

References & Further Reading

Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics.
Friedman, J. H. (1999). Stochastic Gradient Boosting.
My GitHub: Machine Learning repository
The Elements of Statistical Learning (Chapter on Boosting)
Gradient Descent Algorithm in Machine Learning

Arun

Chapter 1: Contents

1.1 What You Already Know: Gradient Descent

Example — Linear Regression with MSE:

1.2 The Problem: What if You Can't Parameterize Your Model?

1.3 The Core Idea: Boosting

Why Combine Weak Learners?

1.4 AdaBoost: The Predecessor

1.5 Gradient Boosting: The Unifying Framework

The Setup

1.6 Functional Gradient Descent — The Key Insight

Standard Gradient Descent (parameter space):

Functional Gradient Descent (function space):

1.7 Gradient Boosting Algorithm — Step by Step

1.8 Worked Example: MSE Loss

1.9 Worked Example: Log-Loss (Binary Classification)

1.10 Gradient Boosting vs Gradient Descent — Summary Table

1.11 Why Not Just Use One Big Tree?

References & Further Reading

📌 Other interesting topics: