3.1 Why Go Second-Order?

Standard gradient boosting uses only the first derivative (gradient) of the loss. This is like gradient descent — it tells us which direction to move, but not how far.

Newton's method uses both first and second derivatives for faster, more accurate optimization. XGBoost (Chen & Guestrin, 2016) brought this idea to gradient boosting, giving a principled framework that LightGBM extends.

Recap: Newton's Method

For minimizing \( f(\theta) \):

\[ \theta \leftarrow \theta - \frac{f'(\theta)}{f''(\theta)} \]

The second derivative \( f'' \) (curvature) tells us whether to take a large or small step. High curvature → smaller step, more precision.

3.2 The Regularized Objective

At step \( m \), the exact objective for the new tree \( h_m \) is:

\[ \mathcal{L}^{(m)} = \sum_{i=1}^{n} L\left(y_i,\ F_{m-1}(x_i) + h_m(x_i)\right) + \Omega(h_m) \]

Where the regularization term \( \Omega(h_m) \) penalizes complexity:

\[ \Omega(h_m) = \gamma J + \frac{1}{2} \lambda \sum_{j=1}^{J} w_j^2 \]
  • \( J \) = number of leaves (penalizes having too many leaves, controlled by \( \gamma \))
  • \( w_j \) = leaf weight (penalizes large weights, controlled by \( \lambda \))
  • \( \gamma \geq 0 \) and \( \lambda \geq 0 \) are regularization hyperparameters

3.3 Second-Order Taylor Approximation

We approximate the loss for each sample with a Taylor expansion around the current prediction \( F_{m-1}(x_i) \):

Let \( \hat{y}_i^{(m-1)} = F_{m-1}(x_i) \), and \( f = h_m(x_i) \) be the new tree's output.

\[ L(y_i, \hat{y}_i^{(m-1)} + f) \approx L(y_i, \hat{y}_i^{(m-1)}) + g_i f + \frac{1}{2} h_i f^2 \]

Where:

\[ g_i = \frac{\partial L(y_i, \hat{y})}{\partial \hat{y}}\bigg|_{\hat{y} = \hat{y}_i^{(m-1)}} \quad \text{(first derivative / gradient)} \]
\[ h_i = \frac{\partial^2 L(y_i, \hat{y})}{\partial \hat{y}^2}\bigg|_{\hat{y} = \hat{y}_i^{(m-1)}} \quad \text{(second derivative / Hessian)} \]

Note: \( h_i \) here is the Hessian (curvature), not to be confused with \( h_m \) (the tree). Standard notation — context will clarify.

Dropping constant terms (the \( L(y_i, \hat{y}_i^{(m-1)}) \) doesn't depend on the new tree), the objective becomes:

\[ \tilde{\mathcal{L}}^{(m)} = \sum_{i=1}^{n} \left[g_i h_m(x_i) + \frac{1}{2} h_i h_m(x_i)^2\right] + \Omega(h_m) \]

3.4 Reformulating in Terms of Leaves

Since the tree \( h_m \) assigns leaf weight \( w_j \) to all samples \( x_i \) in leaf \( j \), we can group by leaves. Let \( I_j = \{i : x_i \in R_j\} \) be the set of sample indices in leaf \( j \):

\[ \tilde{\mathcal{L}}^{(m)} = \sum_{j=1}^{J} \left[\left(\sum_{i \in I_j} g_i\right) w_j + \frac{1}{2}\left(\sum_{i \in I_j} h_i + \lambda\right) w_j^2\right] + \gamma J \]

Define:

\[ G_j = \sum_{i \in I_j} g_i \quad \text{(sum of gradients in leaf } j\text{)} \] \[ H_j = \sum_{i \in I_j} h_i \quad \text{(sum of Hessians in leaf } j\text{)} \]

Then:

\[ \tilde{\mathcal{L}}^{(m)} = \sum_{j=1}^{J} \left[G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2\right] + \gamma J \]

3.5 Optimal Leaf Weights

For a fixed tree structure, the objective is a quadratic in each \( w_j \), and the leaves are independent (each sample goes to exactly one leaf). We minimize each leaf separately:

\[ \frac{\partial \tilde{\mathcal{L}}}{\partial w_j} = G_j + (H_j + \lambda) w_j = 0 \]
\[ \boxed{w_j^* = -\frac{G_j}{H_j + \lambda}} \]

This is the Newton step: gradient \( G_j \) divided by curvature \( H_j \), with \( \lambda \) stabilizing the denominator (like a regularized Newton step).

3.6 Optimal Tree Score

Substituting \( w_j^* \) back into the objective gives the minimum loss for a given tree structure:

\[ \tilde{\mathcal{L}}^*_{tree} = -\frac{1}{2}\sum_{j=1}^{J} \frac{G_j^2}{H_j + \lambda} + \gamma J \]

This score tells us how good this particular tree structure is. A lower (more negative) score = better tree.

3.7 Split Gain Formula

When evaluating a candidate split that divides leaf with \( (G, H) \) into left \( (G_L, H_L) \) and right \( (G_R, H_R) \) children:

Before split (single leaf score): \( -\frac{G^2}{H + \lambda} \)

After split (two leaf scores): \( -\frac{G_L^2}{H_L + \lambda} - \frac{G_R^2}{H_R + \lambda} \)

\[ \boxed{\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda}\right] - \gamma} \]

We split only if \( \text{Gain} > 0 \).

  • The \( -\gamma \) term means a split must overcome the regularization penalty on an extra leaf.
  • Higher \( \lambda \) shrinks the scores, making splits less attractive (more regularization).

3.8 Gradient & Hessian for Common Loss Functions

Loss\( L(y, \hat{y}) \)Gradient \( g_i \)Hessian \( h_i \)
MSE\( \frac{1}{2}(y - \hat{y})^2 \)\( \hat{y}_i - y_i \)\( 1 \)
Log-loss (binary)\( -y\log p - (1-y)\log(1-p) \)\( p_i - y_i \)\( p_i(1 - p_i) \)
MAE\( |y - \hat{y}| \)\( \text{sign}(\hat{y}_i - y_i) \)\( 0 \) (special handling needed)
HuberSmooth MAEPiecewisePiecewise

Where \( p_i = \sigma(F(x_i)) = 1/(1 + e^{-F(x_i)}) \) for log-loss.

3.9 The Full XGBoost Tree-Building Algorithm

At each boosting round \( m \):

  1. Compute \( g_i \) and \( h_i \) for all samples using current predictions \( F_{m-1}(x_i) \)
  2. Start with a single leaf containing all samples
  3. For each leaf, try all candidate splits \( (k, s) \):
    • Compute \( G_L, H_L, G_R, H_R \)
    • Compute Gain using the formula above
  4. Apply the best split if Gain > 0
  5. Repeat step 3-4 for child nodes (up to max_depth)
  6. Compute optimal weights \( w_j^* = -G_j / (H_j + \lambda) \) for all leaves
  7. Update: \( F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x) \)

3.10 Why is the Hessian Important?

Consider two leaves with the same gradient sum \( G_j = 10 \):

  • Leaf A: \( H_j = 100 \) (many samples, high curvature)
  • Leaf B: \( H_j = 1 \) (few samples, low curvature)

Without Hessian: \( w_A = w_B = -10 \) (same weight!)

With Hessian: \( w_A = -10/101 \approx -0.099 \) and \( w_B = -10/2 = -5 \)

Leaf A has many samples with conflicting predictions — the Hessian correctly gives it a smaller, more cautious update. Leaf B has a confident signal — it gets a larger update.

This is exactly why second-order methods are more sample-efficient and faster to converge.
📘 Next Chapter: LightGBM — Speed, Scale, and Innovation

References & Further Reading


📌 Other interesting topics: