Chapter 3 Contents
- 3.1 Why Go Second-Order?
- 3.2 The Regularized Objective
- 3.3 Second-Order Taylor Approximation
- 3.4 Reformulating in Terms of Leaves
- 3.5 Optimal Leaf Weights
- 3.6 Optimal Tree Score
- 3.7 Split Gain Formula
- 3.8 Gradient & Hessian for Common Losses
- 3.9 The Full XGBoost Tree-Building Algorithm
- 3.10 Why is the Hessian Important?
3.1 Why Go Second-Order?
Standard gradient boosting uses only the first derivative (gradient) of the loss. This is like gradient descent — it tells us which direction to move, but not how far.
Newton's method uses both first and second derivatives for faster, more accurate optimization. XGBoost (Chen & Guestrin, 2016) brought this idea to gradient boosting, giving a principled framework that LightGBM extends.
Recap: Newton's Method
For minimizing \( f(\theta) \):
The second derivative \( f'' \) (curvature) tells us whether to take a large or small step. High curvature → smaller step, more precision.
3.2 The Regularized Objective
At step \( m \), the exact objective for the new tree \( h_m \) is:
Where the regularization term \( \Omega(h_m) \) penalizes complexity:
- \( J \) = number of leaves (penalizes having too many leaves, controlled by \( \gamma \))
- \( w_j \) = leaf weight (penalizes large weights, controlled by \( \lambda \))
- \( \gamma \geq 0 \) and \( \lambda \geq 0 \) are regularization hyperparameters
3.3 Second-Order Taylor Approximation
We approximate the loss for each sample with a Taylor expansion around the current prediction \( F_{m-1}(x_i) \):
Let \( \hat{y}_i^{(m-1)} = F_{m-1}(x_i) \), and \( f = h_m(x_i) \) be the new tree's output.
Where:
Note: \( h_i \) here is the Hessian (curvature), not to be confused with \( h_m \) (the tree). Standard notation — context will clarify.
Dropping constant terms (the \( L(y_i, \hat{y}_i^{(m-1)}) \) doesn't depend on the new tree), the objective becomes:
3.4 Reformulating in Terms of Leaves
Since the tree \( h_m \) assigns leaf weight \( w_j \) to all samples \( x_i \) in leaf \( j \), we can group by leaves. Let \( I_j = \{i : x_i \in R_j\} \) be the set of sample indices in leaf \( j \):
Define:
Then:
3.5 Optimal Leaf Weights
For a fixed tree structure, the objective is a quadratic in each \( w_j \), and the leaves are independent (each sample goes to exactly one leaf). We minimize each leaf separately:
This is the Newton step: gradient \( G_j \) divided by curvature \( H_j \), with \( \lambda \) stabilizing the denominator (like a regularized Newton step).
3.6 Optimal Tree Score
Substituting \( w_j^* \) back into the objective gives the minimum loss for a given tree structure:
This score tells us how good this particular tree structure is. A lower (more negative) score = better tree.
3.7 Split Gain Formula
When evaluating a candidate split that divides leaf with \( (G, H) \) into left \( (G_L, H_L) \) and right \( (G_R, H_R) \) children:
Before split (single leaf score): \( -\frac{G^2}{H + \lambda} \)
After split (two leaf scores): \( -\frac{G_L^2}{H_L + \lambda} - \frac{G_R^2}{H_R + \lambda} \)
We split only if \( \text{Gain} > 0 \).
- The \( -\gamma \) term means a split must overcome the regularization penalty on an extra leaf.
- Higher \( \lambda \) shrinks the scores, making splits less attractive (more regularization).
3.8 Gradient & Hessian for Common Loss Functions
| Loss | \( L(y, \hat{y}) \) | Gradient \( g_i \) | Hessian \( h_i \) |
|---|---|---|---|
| MSE | \( \frac{1}{2}(y - \hat{y})^2 \) | \( \hat{y}_i - y_i \) | \( 1 \) |
| Log-loss (binary) | \( -y\log p - (1-y)\log(1-p) \) | \( p_i - y_i \) | \( p_i(1 - p_i) \) |
| MAE | \( |y - \hat{y}| \) | \( \text{sign}(\hat{y}_i - y_i) \) | \( 0 \) (special handling needed) |
| Huber | Smooth MAE | Piecewise | Piecewise |
Where \( p_i = \sigma(F(x_i)) = 1/(1 + e^{-F(x_i)}) \) for log-loss.
3.9 The Full XGBoost Tree-Building Algorithm
At each boosting round \( m \):
- Compute \( g_i \) and \( h_i \) for all samples using current predictions \( F_{m-1}(x_i) \)
- Start with a single leaf containing all samples
- For each leaf, try all candidate splits \( (k, s) \):
- Compute \( G_L, H_L, G_R, H_R \)
- Compute Gain using the formula above
- Apply the best split if Gain > 0
- Repeat step 3-4 for child nodes (up to
max_depth) - Compute optimal weights \( w_j^* = -G_j / (H_j + \lambda) \) for all leaves
- Update: \( F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x) \)
3.10 Why is the Hessian Important?
Consider two leaves with the same gradient sum \( G_j = 10 \):
- Leaf A: \( H_j = 100 \) (many samples, high curvature)
- Leaf B: \( H_j = 1 \) (few samples, low curvature)
Without Hessian: \( w_A = w_B = -10 \) (same weight!)
With Hessian: \( w_A = -10/101 \approx -0.099 \) and \( w_B = -10/2 = -5 \)
This is exactly why second-order methods are more sample-efficient and faster to converge.
References & Further Reading
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016.
- My GitHub: Machine Learning repository
- The Elements of Statistical Learning (Chapter on Boosting)