LightGBM: XGBoost — Second-Order Gradient Boosting

Chapter 3 Contents

3.1 Why Go Second-Order?
3.2 The Regularized Objective
3.3 Second-Order Taylor Approximation
3.4 Reformulating in Terms of Leaves
3.5 Optimal Leaf Weights
3.6 Optimal Tree Score
3.7 Split Gain Formula
3.8 Gradient & Hessian for Common Losses
3.9 The Full XGBoost Tree-Building Algorithm
3.10 Why is the Hessian Important?

3.1 Why Go Second-Order?

Standard gradient boosting uses only the first derivative (gradient) of the loss. This is like gradient descent — it tells us which direction to move, but not how far.

Newton's method uses both first and second derivatives for faster, more accurate optimization. XGBoost (Chen & Guestrin, 2016) brought this idea to gradient boosting, giving a principled framework that LightGBM extends.

Recap: Newton's Method

For minimizing \( f(\theta) \):

\[ \theta \leftarrow \theta - \frac{f'(\theta)}{f''(\theta)} \]

The second derivative \( f'' \) (curvature) tells us whether to take a large or small step. High curvature → smaller step, more precision.

3.2 The Regularized Objective

At step \( m \), the exact objective for the new tree \( h_m \) is:

\[ \mathcal{L}^{(m)} = \sum_{i=1}^{n} L\left(y_i,\ F_{m-1}(x_i) + h_m(x_i)\right) + \Omega(h_m) \]

Where the regularization term \( \Omega(h_m) \) penalizes complexity:

\[ \Omega(h_m) = \gamma J + \frac{1}{2} \lambda \sum_{j=1}^{J} w_j^2 \]

\( J \) = number of leaves (penalizes having too many leaves, controlled by \( \gamma \))
\( w_j \) = leaf weight (penalizes large weights, controlled by \( \lambda \))
\( \gamma \geq 0 \) and \( \lambda \geq 0 \) are regularization hyperparameters

3.3 Second-Order Taylor Approximation

We approximate the loss for each sample with a Taylor expansion around the current prediction \( F_{m-1}(x_i) \):

Let \( \hat{y}_i^{(m-1)} = F_{m-1}(x_i) \), and \( f = h_m(x_i) \) be the new tree's output.

\[ L(y_i, \hat{y}_i^{(m-1)} + f) \approx L(y_i, \hat{y}_i^{(m-1)}) + g_i f + \frac{1}{2} h_i f^2 \]

Where:

\[ g_i = \frac{\partial L(y_i, \hat{y})}{\partial \hat{y}}\bigg|_{\hat{y} = \hat{y}_i^{(m-1)}} \quad \text{(first derivative / gradient)} \]

\[ h_i = \frac{\partial^2 L(y_i, \hat{y})}{\partial \hat{y}^2}\bigg|_{\hat{y} = \hat{y}_i^{(m-1)}} \quad \text{(second derivative / Hessian)} \]

Note: \( h_i \) here is the Hessian (curvature), not to be confused with \( h_m \) (the tree). Standard notation — context will clarify.

Dropping constant terms (the \( L(y_i, \hat{y}_i^{(m-1)}) \) doesn't depend on the new tree), the objective becomes:

\[ \tilde{\mathcal{L}}^{(m)} = \sum_{i=1}^{n} \left[g_i h_m(x_i) + \frac{1}{2} h_i h_m(x_i)^2\right] + \Omega(h_m) \]

3.4 Reformulating in Terms of Leaves

Since the tree \( h_m \) assigns leaf weight \( w_j \) to all samples \( x_i \) in leaf \( j \), we can group by leaves. Let \( I_j = \{i : x_i \in R_j\} \) be the set of sample indices in leaf \( j \):

\[ \tilde{\mathcal{L}}^{(m)} = \sum_{j=1}^{J} \left[\left(\sum_{i \in I_j} g_i\right) w_j + \frac{1}{2}\left(\sum_{i \in I_j} h_i + \lambda\right) w_j^2\right] + \gamma J \]

Define:

\[ G_j = \sum_{i \in I_j} g_i \quad \text{(sum of gradients in leaf } j\text{)} \] \[ H_j = \sum_{i \in I_j} h_i \quad \text{(sum of Hessians in leaf } j\text{)} \]

Then:

\[ \tilde{\mathcal{L}}^{(m)} = \sum_{j=1}^{J} \left[G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2\right] + \gamma J \]

3.5 Optimal Leaf Weights

For a fixed tree structure, the objective is a quadratic in each \( w_j \), and the leaves are independent (each sample goes to exactly one leaf). We minimize each leaf separately:

\[ \frac{\partial \tilde{\mathcal{L}}}{\partial w_j} = G_j + (H_j + \lambda) w_j = 0 \]

\[ \boxed{w_j^* = -\frac{G_j}{H_j + \lambda}} \]

This is the Newton step: gradient \( G_j \) divided by curvature \( H_j \), with \( \lambda \) stabilizing the denominator (like a regularized Newton step).

3.6 Optimal Tree Score

Substituting \( w_j^* \) back into the objective gives the minimum loss for a given tree structure:

\[ \tilde{\mathcal{L}}^*_{tree} = -\frac{1}{2}\sum_{j=1}^{J} \frac{G_j^2}{H_j + \lambda} + \gamma J \]

This score tells us how good this particular tree structure is. A lower (more negative) score = better tree.

3.7 Split Gain Formula

When evaluating a candidate split that divides leaf with \( (G, H) \) into left \( (G_L, H_L) \) and right \( (G_R, H_R) \) children:

Before split (single leaf score): \( -\frac{G^2}{H + \lambda} \)

After split (two leaf scores): \( -\frac{G_L^2}{H_L + \lambda} - \frac{G_R^2}{H_R + \lambda} \)

\[ \boxed{\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda}\right] - \gamma} \]

We split only if \( \text{Gain} > 0 \).

The \( -\gamma \) term means a split must overcome the regularization penalty on an extra leaf.
Higher \( \lambda \) shrinks the scores, making splits less attractive (more regularization).

3.8 Gradient & Hessian for Common Loss Functions

Loss	\( L(y, \hat{y}) \)	Gradient \( g_i \)	Hessian \( h_i \)
MSE	\( \frac{1}{2}(y - \hat{y})^2 \)	\( \hat{y}_i - y_i \)	\( 1 \)
Log-loss (binary)	\( -y\log p - (1-y)\log(1-p) \)	\( p_i - y_i \)	\( p_i(1 - p_i) \)
MAE	\( \|y - \hat{y}\| \)	\( \text{sign}(\hat{y}_i - y_i) \)	\( 0 \) (special handling needed)
Huber	Smooth MAE	Piecewise	Piecewise

Where \( p_i = \sigma(F(x_i)) = 1/(1 + e^{-F(x_i)}) \) for log-loss.

3.9 The Full XGBoost Tree-Building Algorithm

At each boosting round \( m \):

Compute \( g_i \) and \( h_i \) for all samples using current predictions \( F_{m-1}(x_i) \)
Start with a single leaf containing all samples
For each leaf, try all candidate splits \( (k, s) \):
- Compute \( G_L, H_L, G_R, H_R \)
- Compute Gain using the formula above
Apply the best split if Gain > 0
Repeat step 3-4 for child nodes (up to max_depth)
Compute optimal weights \( w_j^* = -G_j / (H_j + \lambda) \) for all leaves
Update: \( F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x) \)

3.10 Why is the Hessian Important?

Consider two leaves with the same gradient sum \( G_j = 10 \):

Leaf A: \( H_j = 100 \) (many samples, high curvature)
Leaf B: \( H_j = 1 \) (few samples, low curvature)

Without Hessian: \( w_A = w_B = -10 \) (same weight!)

With Hessian: \( w_A = -10/101 \approx -0.099 \) and \( w_B = -10/2 = -5 \)

Leaf A has many samples with conflicting predictions — the Hessian correctly gives it a smaller, more cautious update. Leaf B has a confident signal — it gets a larger update.

This is exactly why second-order methods are more sample-efficient and faster to converge.

📘 Next Chapter: LightGBM — Speed, Scale, and Innovation

References & Further Reading

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016.
My GitHub: Machine Learning repository
The Elements of Statistical Learning (Chapter on Boosting)

Arun