Gradient Boosting

Gradient Boosting in ML

Chapter 1: Contents

1.1 What We Already Know: Gradient Descent
1.2 The Problem: What if You Can't Parameterize Your Model?
1.3 The Core Idea: Boosting
1.4 AdaBoost: The Predecessor
1.5 Gradient Boosting: The Unifying Framework
1.6 Functional Gradient Descent — The Key Insight
1.7 Gradient Boosting Algorithm — Step by Step
1.8 Worked Example: MSE Loss
1.9 Worked Example: Log-Loss (Binary Classification)
1.10 Gradient Boosting vs Gradient Descent — Summary Table
1.11 Why Not Just Use One Big Tree?
1.12 Regularization in Gradient Boosting
1.13 Variants: XGBoost, LightGBM, CatBoost

1.1 What We Already Know: Gradient Descent

Before jumping into Gradient Boosting, let us quickly revisit Gradient Descent. In classical machine learning, we have a model $ f(\mathbf{x}; \boldsymbol{\theta}) $ parameterized by a vector $ \boldsymbol{\theta} \in \mathbb{R}^p $. We define a loss function that measures how wrong our predictions are:

$$\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^{N} L\bigl(y_i,\; f(\mathbf{x}_i;\boldsymbol{\theta})\bigr)$$

where $ L $ is a per-sample loss (e.g., squared error, cross-entropy), $ y_i $ is the true label, and $ f(\mathbf{x}_i;\boldsymbol{\theta}) $ is the model's prediction for sample $ i $.

Gradient Descent updates the parameters iteratively in the direction of steepest descent:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \;\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

where $ \eta > 0 $ is the learning rate. Expanding the gradient:

$$\nabla_{\boldsymbol{\theta}} \mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} \frac{\partial L(y_i, f(\mathbf{x}_i;\boldsymbol{\theta}))}{\partial \boldsymbol{\theta}}$$

Key point: Gradient Descent works in parameter space — it adjusts the weights/coefficients of a fixed-form model (linear regression, neural network, logistic regression, etc.) to minimize the loss.

This is powerful, but it has a fundamental assumption: the model must be differentiable with respect to its parameters. What happens when it is not?

1.2 The Problem: What if You Can't Parameterize Your Model?

Consider a Decision Tree. A tree splits data using hard thresholds (e.g., "Is feature $ x_3 > 5.2 $?"). The prediction is a piecewise-constant function over the input space. There is no smooth parameter vector $ \boldsymbol{\theta} $ that you can take derivatives of. The split decisions are discrete, and the tree structure itself is part of the model.

So how do we "optimize" a model built out of decision trees? We cannot compute $ \nabla_{\boldsymbol{\theta}} \mathcal{L} $ because there is no $ \boldsymbol{\theta} $!

The core challenge: We want the power and flexibility of non-parametric models (like trees), but we need a principled way to iteratively improve them — analogous to gradient descent, but in function space rather than parameter space.

This is exactly the gap that Gradient Boosting fills.

1.3 The Core Idea: Boosting

Boosting is a family of ensemble methods where we build a strong model by sequentially adding weak learners (typically shallow decision trees), each one correcting the mistakes of its predecessors.

The final model is an additive expansion of base learners:

$$F_M(\mathbf{x}) = \sum_{m=1}^{M} \gamma_m \; h_m(\mathbf{x})$$

where $ h_m(\mathbf{x}) $ is the $ m $-th weak learner (base learner) and $ \gamma_m $ is its coefficient (weight). The model is built greedily: at each stage $ m $, we fix all previous learners and add one new one to reduce the overall loss.

Why "weak" learners? A weak learner is one that performs only slightly better than random guessing. Decision stumps (trees with depth 1) or shallow trees (depth 3–8) are common choices. The magic of boosting is that combining many weak learners produces a strong learner with high predictive power.

1.4 AdaBoost family (Adaptive boosting): The Predecessor

AdaBoost (Adaptive Boosting), proposed by Freund & Schapire (1997), was the first successful boosting algorithm. It works for binary classification where $ y_i \in \{-1, +1\} $.

AdaBoost Algorithm

Main boosting algorithms of this family are as follows:

AdaBoost
Real AdaBoost
Gentle AdaBoost
LogitBoost

Common weak learner:

Decision stump
Shallow decision tree

This family works best for simple classification problems and educational demostration purpose.

Algorithm: AdaBoost

Input: Training set $\{(\mathbf{x}_i, y_i)\}_{i=1}^N$, number of rounds $M$
Initialize: Sample weights $w_i^{(1)} = \frac{1}{N}$ for all $i$

For $m = 1$ to $M$:
  1. Fit weak learner $h_m(\mathbf{x})$ to training data using weights $w_i^{(m)}$
  2. Compute weighted error:
    $\epsilon_m = \frac{\sum_{i=1}^{N} w_i^{(m)} \;\mathbb{1}[y_i \neq h_m(\mathbf{x}_i)]}{\sum_{i=1}^{N} w_i^{(m)}}$

  3. Compute learner weight:
    $\alpha_m = \frac{1}{2}\ln\left(\frac{1-\epsilon_m}{\epsilon_m}\right)$

  4. Update sample weights:
    $w_i^{(m+1)} = w_i^{(m)} \cdot \exp\bigl(-\alpha_m \; y_i \; h_m(\mathbf{x}_i)\bigr)$
    Normalize so that $\sum_i w_i^{(m+1)} = 1$

Output: $F_M(\mathbf{x}) = \text{sign}\left(\sum_{m=1}^{M} \alpha_m \; h_m(\mathbf{x})\right)$

Why Does AdaBoost Work?

Breiman (and later Friedman, Mason, et al.) showed that AdaBoost can be interpreted as performing forward stagewise additive modeling under the exponential loss:

$$L(y, F(\mathbf{x})) = \exp\bigl(-y \cdot F(\mathbf{x})\bigr)$$

This insight was a breakthrough: it connected boosting to optimization and opened the door to generalizing boosting to any differentiable loss function. That generalization is Gradient Boosting.

1.5 Gradient Boosting (GB) family: The Unifying Framework

This family is the most important and widely used boosting method today. Each new tree learns the residual errors (gradients) of the previous ensemble. Common GB algo's are:

Gradient Boosting (GB)
Gradient Boosted Decision Trees
Stochastic Gradient Boosting

This family works best for regression problems and can be used for classification problems as well on a structured/tabular data. Some real world applications of this algo are:

Credit scoring
Recommendation systems
Risk modeling
Ad Click-Through Rate (CTR) prediction

Gradient Boosting (GB), introduced by Jerome Friedman (2001), generalizes the boosting idea by framing it as gradient descent in function space. Instead of being tied to exponential loss (like AdaBoost), it works with any differentiable loss function.

Forward Stagewise Additive Modeling

We model the prediction as an additive sum of base learners:

$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \; h_m(\mathbf{x})$$

Starting from an initial model $ F_0(\mathbf{x}) $, at each step $ m $ we want to find the base learner $ h_m $ that best reduces the total loss:

$$h_m = \arg\min_{h} \sum_{i=1}^{N} L\bigl(y_i,\; F_{m-1}(\mathbf{x}_i) + h(\mathbf{x}_i)\bigr)$$

Solving this exactly is hard. Gradient Boosting provides an elegant approximation using the idea of functional gradients.

1.6 Functional Gradient Descent — The Key Insight

From Parameters to Functions

In classical gradient descent, we have:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \;\nabla_{\boldsymbol{\theta}} \mathcal{L}$$

Now, think of the model's prediction on the training set as an $ N $-dimensional vector:

$$\mathbf{F} = \bigl(F(\mathbf{x}_1),\; F(\mathbf{x}_2),\; \ldots,\; F(\mathbf{x}_N)\bigr) \in \mathbb{R}^N$$

The total loss is a function of this vector:

$$\mathcal{L}(\mathbf{F}) = \sum_{i=1}^{N} L\bigl(y_i,\; F(\mathbf{x}_i)\bigr)$$

We can compute the gradient of the loss with respect to the function values (not parameters!):

$$\mathbf{g} = -\nabla_{\mathbf{F}} \mathcal{L} = -\left(\frac{\partial L(y_1, F(\mathbf{x}_1))}{\partial F(\mathbf{x}_1)},\; \ldots,\; \frac{\partial L(y_N, F(\mathbf{x}_N))}{\partial F(\mathbf{x}_N)}\right)$$

Each component of this gradient is called a pseudo-residual:

$$r_{im} = -\frac{\partial L(y_i,\; F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\Bigg|_{F = F_{m-1}}$$

The key insight: The negative gradient $ \mathbf{g} $ tells us the direction in function space that would decrease the loss the fastest. If we could update the function directly, we would set $ F_m = F_{m-1} + \eta \;\mathbf{g} $. But $ \mathbf{g} $ is only defined on the training points! We need a function that generalizes this direction to unseen data.

The Approximation: Fit a Base Learner to the Pseudo-Residuals

Since the negative gradient (pseudo-residuals) is only defined at the $ N $ training points, we fit a base learner $ h_m(\mathbf{x}) $ to approximate it:

$$h_m = \arg\min_{h \in \mathcal{H}} \sum_{i=1}^{N} \bigl(r_{im} - h(\mathbf{x}_i)\bigr)^2$$

This is typically a regression tree fitted to the pseudo-residuals, regardless of whether the original problem is classification or regression. The tree "learns" the gradient direction and can generalize it to new data points.

Why This Works: The Analogy

Aspect	Gradient Descent (Parameter Space)	Gradient Boosting (Function Space)
What we optimize	Parameters $ \boldsymbol{\theta} \in \mathbb{R}^p $	Function $ F(\mathbf{x}) $
Gradient of	$ \nabla_{\boldsymbol{\theta}} \mathcal{L} $	$ \nabla_{F(\mathbf{x}_i)} \mathcal{L} $ (pseudo-residuals)
Update rule	$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\;\nabla \mathcal{L} $	$ F \leftarrow F + \eta\; h_m $ where $ h_m \approx -\nabla \mathcal{L} $
Step direction	Exact gradient vector	Approximated by a base learner

1.7 Gradient Boosting Algorithm — Step by Step

Here is the full algorithm as proposed by Friedman (2001):

Algorithm: Gradient Boosting Machine (GBM)

Input: Training set $\{(\mathbf{x}_i, y_i)\}_{i=1}^N$, differentiable loss function $L(y, F)$, number of iterations $M$, learning rate $\eta$

Step 1 — Initialize:
  $F_0(\mathbf{x}) = \arg\min_{\gamma} \sum_{i=1}^{N} L(y_i, \gamma)$
  (A constant prediction that minimizes the total loss)

Step 2 — For $m = 1$ to $M$:

  (a) Compute pseudo-residuals:
    $r_{im} = -\left[\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\right]_{F=F_{m-1}}$    for $i = 1, \ldots, N$

  (b) Fit a base learner (regression tree) to the pseudo-residuals:
    Fit $h_m(\mathbf{x})$ to targets $\{r_{im}\}_{i=1}^N$
    Let the tree have $J_m$ terminal (leaf) regions $\{R_{jm}\}_{j=1}^{J_m}$

  (c) Compute optimal leaf values:
    For each leaf $j = 1, \ldots, J_m$:
    $\gamma_{jm} = \arg\min_{\gamma} \sum_{\mathbf{x}_i \in R_{jm}} L\bigl(y_i,\; F_{m-1}(\mathbf{x}_i) + \gamma\bigr)$

  (d) Update the model:
    $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \sum_{j=1}^{J_m} \gamma_{jm}\;\mathbb{1}[\mathbf{x} \in R_{jm}]$

Output: $F_M(\mathbf{x})$

Note on Step 2(c): This is what makes Gradient Boosting different from simply fitting trees to residuals. After the tree determines the structure (splits), we re-optimize the leaf values by solving a one-dimensional optimization per leaf. This is a form of line search in function space — analogous to choosing the optimal step size in gradient descent.

1.8 Worked Example: MSE Loss (Regression)

Let the loss function be Mean Squared Error:

$$L(y_i, F(\mathbf{x}_i)) = \frac{1}{2}\bigl(y_i - F(\mathbf{x}_i)\bigr)^2$$

Step 1: Initialization

We need $ F_0 = \arg\min_\gamma \sum_{i=1}^N \frac{1}{2}(y_i - \gamma)^2 $. Taking the derivative and setting it to zero:

$$\frac{\partial}{\partial \gamma}\sum_{i=1}^N \frac{1}{2}(y_i - \gamma)^2 = -\sum_{i=1}^N (y_i - \gamma) = 0$$

$$\implies \gamma = \frac{1}{N}\sum_{i=1}^N y_i = \bar{y}$$

So $ F_0(\mathbf{x}) = \bar{y} $ — the mean of the target values.

Step 2(a): Pseudo-Residuals

$$r_{im} = -\frac{\partial L}{\partial F(\mathbf{x}_i)} = -\frac{\partial}{\partial F(\mathbf{x}_i)}\left[\frac{1}{2}(y_i - F(\mathbf{x}_i))^2\right] = y_i - F_{m-1}(\mathbf{x}_i)$$

Result: For MSE loss, the pseudo-residuals are exactly the ordinary residuals $ y_i - F_{m-1}(\mathbf{x}_i) $. This is the special case that motivates the name "residual fitting," but for other losses the pseudo-residuals are different from the raw residuals.

Step 2(b): Fit Tree to Residuals

We fit a regression tree $ h_m(\mathbf{x}) $ to the targets $ \{r_{im}\} $.

Step 2(c): Optimal Leaf Values

For MSE, the optimal leaf value for region $ R_{jm} $ is:

$$\gamma_{jm} = \arg\min_\gamma \sum_{\mathbf{x}_i \in R_{jm}} \frac{1}{2}(y_i - F_{m-1}(\mathbf{x}_i) - \gamma)^2 = \frac{1}{|R_{jm}|}\sum_{\mathbf{x}_i \in R_{jm}} r_{im}$$

This is simply the mean of the pseudo-residuals in each leaf — which is exactly what the regression tree already outputs.

Step 2(d): Update

$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \; h_m(\mathbf{x})$$

With a learning rate $ \eta < 1 $, we take a small step in the direction of the negative gradient. This is called shrinkage and acts as regularization.

1.9 Worked Example: Log-Loss (Binary Classification)

For binary classification with $ y_i \in \{0, 1\} $, the model outputs log-odds (logits), and the predicted probability is:

$$p_i = \sigma(F(\mathbf{x}_i)) = \frac{1}{1 + e^{-F(\mathbf{x}_i)}}$$

The binary cross-entropy (log-loss) is:

$$L(y_i, F(\mathbf{x}_i)) = -\bigl[y_i \ln(p_i) + (1-y_i)\ln(1-p_i)\bigr]$$

Substituting $ p_i = \sigma(F(\mathbf{x}_i)) $ and simplifying:

$$L(y_i, F) = -y_i \; F + \ln(1 + e^F)$$

Step 1: Initialization

We solve $ F_0 = \arg\min_\gamma \sum_i L(y_i, \gamma) = \arg\min_\gamma \sum_i \bigl[-y_i\gamma + \ln(1+e^\gamma)\bigr] $.

$$\frac{\partial}{\partial \gamma}\sum_i \bigl[-y_i\gamma + \ln(1+e^\gamma)\bigr] = -\sum_i y_i + \sum_i \frac{e^\gamma}{1+e^\gamma} = 0$$

$$\implies \frac{e^\gamma}{1+e^\gamma} = \bar{y} \implies \gamma = \ln\frac{\bar{y}}{1-\bar{y}}$$

So $ F_0 = \ln\frac{\bar{y}}{1-\bar{y}} $ — the log-odds of the positive class proportion.

Step 2(a): Pseudo-Residuals

$$r_{im} = -\frac{\partial L(y_i, F)}{\partial F}\Bigg|_{F=F_{m-1}} = y_i - \sigma(F_{m-1}(\mathbf{x}_i)) = y_i - p_i^{(m-1)}$$

Result: The pseudo-residuals for log-loss are $ y_i - p_i $, i.e., the difference between the true label (0 or 1) and the current predicted probability. If the model predicts $ p_i = 0.3 $ for a positive example ($ y_i = 1 $), the pseudo-residual is $ 0.7 $, pushing the model to increase its prediction.

Step 2(b): Fit Tree to Pseudo-Residuals

A regression tree is fitted to $ \{r_{im}\} $, creating leaf regions $ R_{jm} $.

Step 2(c): Optimal Leaf Values

For log-loss, the exact solution per leaf requires solving a nonlinear equation. Friedman showed that a Newton-Raphson one-step approximation gives:

$$\gamma_{jm} = \frac{\sum_{\mathbf{x}_i \in R_{jm}} r_{im}}{\sum_{\mathbf{x}_i \in R_{jm}} p_i^{(m-1)}\;(1 - p_i^{(m-1)})}$$

The numerator is the sum of pseudo-residuals (first-order gradient). The denominator is the sum of $ p_i(1-p_i) $, which is the second-order derivative (Hessian) of the log-loss. This is essentially a Newton step — and this second-order approximation is exactly what XGBoost later exploits systematically.

Step 2(d): Update

$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \sum_{j=1}^{J_m} \gamma_{jm}\;\mathbb{1}[\mathbf{x} \in R_{jm}]$$

Predicted probabilities at any point are recovered via the sigmoid: $ p(\mathbf{x}) = \sigma(F_M(\mathbf{x})) $.

1.10 Gradient Boosting vs Gradient Descent — Summary Table

Aspect	Gradient Descent	Gradient Boosting
Optimization space	Parameter space $ \mathbb{R}^p $	Function space
What is updated	Weight vector $ \boldsymbol{\theta} $	Prediction function $ F(\mathbf{x}) $
Gradient computed w.r.t.	Parameters $ \boldsymbol{\theta} $	Function values $ F(\mathbf{x}_i) $
Step direction	Exact negative gradient $ -\nabla_\theta \mathcal{L} $	Base learner fitted to pseudo-residuals
Step size	Learning rate $ \eta $	Learning rate $ \eta $ + optimal leaf values $ \gamma_{jm} $
Model class	Parametric (fixed form)	Non-parametric (additive ensemble)
Requirements	Model differentiable w.r.t. $ \boldsymbol{\theta} $	Loss differentiable w.r.t. $ F(\mathbf{x}_i) $
Common base model	Linear / Neural Network	Decision Tree (regression)
Handles non-differentiable models	No	Yes (trees, stumps, etc.)

1.11 Why Not Just Use One Big Tree?

A natural question: if we are using trees as base learners, why not just grow one very deep tree? The answer involves the bias-variance tradeoff:

Aspect	Single Deep Tree	Gradient Boosted Ensemble
Bias	Low (can model complex patterns)	Low (additive composition achieves complexity)
Variance	High (overfits training data)	Lower (shrinkage + shallow trees regularize)
Interpretability	Hard (too many splits)	Feature importances available
Generalization	Poor without pruning	Good with proper learning rate + early stopping

Each shallow tree in gradient boosting has high bias (it's a weak learner — can't capture complex relationships alone) but low variance (it doesn't overfit). By combining many such trees additively, the ensemble reduces bias while keeping variance controlled through the learning rate $ \eta $.

Intuition: Think of it as a committee of cautious experts. Each expert makes a small, conservative correction to the current prediction. No single expert dominates, and together they converge to an accurate solution.

1.12 Regularization in Gradient Boosting

Gradient boosting is prone to overfitting if left unchecked. Several regularization techniques are used in practice:

1. Shrinkage (Learning Rate)

The learning rate $ \eta \in (0, 1] $ scales each tree's contribution:

$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \; h_m(\mathbf{x})$$

Smaller $ \eta $ (e.g., 0.01–0.1) means each tree makes a smaller correction, requiring more trees but generally yielding better generalization. There is a trade-off between $ \eta $ and $ M $ (number of trees).

2. Subsampling (Stochastic Gradient Boosting)

Friedman (1999) proposed fitting each tree on a random subsample of the training data (without replacement). A typical subsample fraction is 0.5–0.8. This:

Reduces variance (similar to bagging)
Speeds up training (fewer samples per tree)
Often improves generalization

3. Tree Constraints

Maximum depth: Shallow trees (depth 3–8) prevent individual trees from overfitting.
Minimum samples per leaf: Avoids tiny leaves that memorize noise.
Maximum number of leaves: Directly controls tree complexity.

4. Column Subsampling

At each tree (or each split), only a random subset of features is considered. This idea, borrowed from Random Forests, decorrelates the trees and prevents overfitting to dominant features.

5. Early Stopping

Monitor performance on a validation set and stop adding trees when the validation loss stops improving. This is one of the most effective regularization techniques.

6. L1 and L2 Regularization on Leaf Weights

Modern implementations (notably XGBoost) add regularization terms to the objective function:

$$\Omega(h_m) = \gamma\;T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$

where $ T $ is the number of leaves and $ w_j $ is the leaf weight. The term $ \gamma T $ penalizes tree complexity (like a cost-complexity pruning parameter), and $ \lambda \|w\|^2 $ is L2 regularization on leaf scores.

1.13 Variants: XGBoost, LightGBM, CatBoost

The basic Gradient Boosting framework has been extended into several high-performance libraries. Here is a brief comparison:

Feature	Classic GBM	XGBoost	LightGBM	CatBoost
Tree growth	Level-wise	Level-wise (default)	Leaf-wise (best-first)	Oblivious trees (symmetric)
Second-order info	Optional (leaf values)	Yes (Hessian in split gain)	Yes	Yes
Categorical features	Manual encoding	Manual encoding	Native (optimal split)	Native (ordered target statistics)
Missing values	Manual imputation	Native (learns best direction)	Native	Native
Speed	Slow	Fast	Fastest (histogram-based)	Fast
Regularization	Shrinkage + subsampling	L1/L2 on leaf weights + $\gamma$ on leaves	All XGBoost + extra constraints	Ordered boosting (reduces target leakage)

XGBoost — Second-Order Approximation

XGBoost (Chen & Guestrin, 2016) uses a second-order Taylor expansion of the loss when evaluating splits. For a given tree structure, the objective is approximated as:

$$\mathcal{L}^{(m)} \approx \sum_{i=1}^N \left[g_i\; h_m(\mathbf{x}_i) + \frac{1}{2} h_i\; h_m(\mathbf{x}_i)^2\right] + \Omega(h_m)$$

where

$$g_i = \frac{\partial L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}, \qquad h_i = \frac{\partial^2 L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)^2}$$

Here $ g_i $ and $ h_i $ are the first and second derivatives of the loss with respect to the model output (not the tree — the model output at iteration $ m-1 $). With this formulation, the optimal leaf weight for leaf $ j $ has a closed-form solution:

$$w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$$

And the split gain for splitting a leaf into left ($ I_L $) and right ($ I_R $) is:

$$\text{Gain} = \frac{1}{2}\left[\frac{\left(\sum_{i \in I_L} g_i\right)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{\left(\sum_{i \in I_R} g_i\right)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{\left(\sum_{i \in I_L \cup I_R} g_i\right)^2}{\sum_{i \in I_L \cup I_R} h_i + \lambda}\right] - \gamma$$

This gain formula elegantly combines the reduction in loss from splitting with the regularization penalty $ \gamma $ for adding a new leaf.

LightGBM — Leaf-Wise Growth & Histogram-Based Splitting

LightGBM (Ke et al., 2017) introduces two key innovations:

Leaf-wise (best-first) tree growth: Instead of growing all leaves at the same depth (level-wise), it picks the leaf with the maximum gain reduction, leading to deeper, more accurate trees with fewer leaves.
Histogram-based splitting: Continuous features are binned into discrete histograms (typically 255 bins), dramatically reducing computation. Split finding becomes O(#bins) instead of O(#data × #features).

LightGBM also introduces Gradient-based One-Side Sampling (GOSS) — keeping samples with large gradients and randomly sampling from those with small gradients — and Exclusive Feature Bundling (EFB) for high-dimensional sparse data.

CatBoost (Categorical Boosting) Family

A gradient boosting algorithm specially designed to handle categorical data efficiently and reduce prediction bias.

How CatBoost Works?

Start with training data
Automatically encode categorical features
Compute residual errors
Train decision trees sequentially
Use Ordered Boosting to reduce overfitting and target leakage
Combine all trees into the final prediction model

Key Features:

Handles categorical variables automatically
Reduces overfitting
Requires less preprocessing
High accuracy with minimal tuning

Core Idea Instead of manually converting categories into numbers, CatBoost intelligently processes categorical features during boosting, improving both speed and accuracy.

1.14 Python Implementation: Gradient Boosting from Scratch

To solidify your understanding, here is a minimal implementation of gradient boosting for regression using scikit-learn's DecisionTreeRegressor as the base learner:

import numpy as np
from sklearn.tree import DecisionTreeRegressor

class GradientBoostingFromScratch:
    """Gradient Boosting for regression with MSE loss."""
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.lr = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.F0 = None
    
    def fit(self, X, y):
        # Step 1: Initialize with the mean
        self.F0 = np.mean(y)
        F = np.full(len(y), self.F0)
        
        for m in range(self.n_estimators):
            # Step 2(a): Compute pseudo-residuals (negative gradient of MSE)
            residuals = y - F
            
            # Step 2(b): Fit a regression tree to pseudo-residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            
            # Step 2(c)+(d): Update predictions (tree already outputs leaf means)
            F += self.lr * tree.predict(X)
            self.trees.append(tree)
        
        return self
    
    def predict(self, X):
        F = np.full(X.shape[0], self.F0)
        for tree in self.trees:
            F += self.lr * tree.predict(X)
        return F

# --- Example usage ---
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=1000, n_features=10, noise=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

gbm = GradientBoostingFromScratch(n_estimators=200, learning_rate=0.05, max_depth=4)
gbm.fit(X_train, y_train)
y_pred = gbm.predict(X_test)

print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

For binary classification with log-loss, you would modify the pseudo-residual computation and initialization:


class GradientBoostingClassifier:
    """Gradient Boosting for binary classification with log-loss."""
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.lr = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.F0 = None
    
    @staticmethod
    def _sigmoid(x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def fit(self, X, y):
        # Step 1: Initialize with log-odds
        p_bar = np.mean(y)
        self.F0 = np.log(p_bar / (1 - p_bar))
        F = np.full(len(y), self.F0)
        
        for m in range(self.n_estimators):
            # Step 2(a): Pseudo-residuals for log-loss
            p = self._sigmoid(F)
            residuals = y - p  # negative gradient of log-loss
            
            # Step 2(b): Fit tree to pseudo-residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            
            # Step 2(d): Update (using tree output as approximation)
            F += self.lr * tree.predict(X)
            self.trees.append(tree)
        
        return self
    
    def predict_proba(self, X):
        F = np.full(X.shape[0], self.F0)
        for tree in self.trees:
            F += self.lr * tree.predict(X)
        return self._sigmoid(F)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

📘 Next Chapter: Decision Trees — The Building Blocks of Gradient Boosting. We will dive into how trees split data, the impurity measures (Gini, entropy, MSE), and how the split-finding algorithm works — the foundation for understanding XGBoost and LightGBM internals.

References & Further Reading

Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5), 1189–1232.
Friedman, J. H. (1999). Stochastic Gradient Boosting. Computational Statistics & Data Analysis.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proc. 22nd ACM SIGKDD.
Ke, G., et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems (NeurIPS).
Prokhorenkova, L., et al. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS.
Freund, Y., & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning. JCSS, 55(1), 119–139.
The Elements of Statistical Learning — Chapter 10: Boosting and Additive Trees.
My GitHub: Machine Learning repository

Aspect	Gradient Descent (Parameter Space)	Gradient Boosting (Function Space)
What we optimize	Parameters \( \boldsymbol{\theta} \in \mathbb{R}^p \)	Function \( F(\mathbf{x}) \)
Gradient of	\( \nabla_{\boldsymbol{\theta}} \mathcal{L} \)	\( \nabla_{F(\mathbf{x}_i)} \mathcal{L} \) (pseudo-residuals)
Update rule	\( \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\;\nabla \mathcal{L} \)	\( F \leftarrow F + \eta\; h_m \) where \( h_m \approx -\nabla \mathcal{L} \)
Step direction	Exact gradient vector	Approximated by a base learner

Aspect	Gradient Descent	Gradient Boosting
Optimization space	Parameter space \( \mathbb{R}^p \)	Function space
What is updated	Weight vector \( \boldsymbol{\theta} \)	Prediction function \( F(\mathbf{x}) \)
Gradient computed w.r.t.	Parameters \( \boldsymbol{\theta} \)	Function values \( F(\mathbf{x}_i) \)
Step direction	Exact negative gradient \( -\nabla_\theta \mathcal{L} \)	Base learner fitted to pseudo-residuals
Step size	Learning rate \( \eta \)	Learning rate \( \eta \) + optimal leaf values \( \gamma_{jm} \)
Model class	Parametric (fixed form)	Non-parametric (additive ensemble)
Requirements	Model differentiable w.r.t. \( \boldsymbol{\theta} \)	Loss differentiable w.r.t. \( F(\mathbf{x}_i) \)
Common base model	Linear / Neural Network	Decision Tree (regression)
Handles non-differentiable models	No	Yes (trees, stumps, etc.)