6.1 The Complete Gradient Boosting Framework (Unified View)

We now have all pieces. Here is the complete algorithm as implemented in LightGBM:

Input:  
- Training data D = {(x_i, y_i)} for i=1..n  
- Loss function L(y, ŷ) with gradient g_i and Hessian h_i  
- Hyperparameters: M, ν (learning rate), J (max leaves), λ, γ, B (bins)

Initialization:
F₀(x) = argmin_c Σ L(y_i, c)
(e.g., F₀ = ȳ for MSE, F₀ = log( p̄/(1-p̄) ) for log-loss)

For m = 1 to M:

  ① Compute statistics for all samples:
     g_i = ∂L/∂ŷ_i |_{ŷ_i = F_{m-1}(x_i)},   h_i = ∂²L/∂ŷ_i² |_{ŷ_i = F_{m-1}(x_i)}

  ② [GOSS] Sample selection:
     - Keep set A: top a·n samples by |g_i|
     - Randomly sample set B: b·(1-a)·n samples from the rest
     - Amplify B gradients by (1-a)/b

  ③ [EFB] Bundle features:
     - Find mutually exclusive feature groups
     - Merge into bundles with bin offsets

  ④ Build histograms for each feature/bundle k:
     G_{k,b} = Σ_{i∈A∪B, x̃_i^(k)=b} g̃_i,   H_{k,b} = Σ_{i∈A∪B, x̃_i^(k)=b} h̃_i

  ⑤ Leaf-wise tree construction with at most J leaves:
     Repeat until J leaves:
       - For each leaf ℓ, each feature k, scan bins to find:
           Gain = ½[ G_L²/(H_L+λ) + G_R²/(H_R+λ) - G_ℓ²/(H_ℓ+λ) ] - γ
       - Apply global best split
       - Use histogram subtraction for larger sibling

  ⑥ Compute leaf weights:
     w_j* = -G_j / (H_j + λ),   j = 1,...,J

  ⑦ Update model:
     F_m(x) = F_{m-1}(x) + ν · Σ w_j* · 1[x ∈ R_j]

Output: F_M(x) (for regression) or σ(F_M(x)) (for binary classification)

6.2 Loss Functions — Derivations in Detail

6.2.1 Mean Squared Error (Regression)

\[ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 \]
\[ g_i = \frac{\partial L}{\partial \hat{y}_i} = \hat{y}_i - y_i, \quad h_i = \frac{\partial^2 L}{\partial \hat{y}_i^2} = 1 \]

Optimal leaf weight: \( w_j^* = -\frac{\sum_{i \in R_j}(\hat{y}_i - y_i)}{\sum_{i \in R_j} 1 + \lambda} = \frac{\sum_{i \in R_j}(y_i - \hat{y}_i)}{n_j + \lambda} \)

This is the mean residual (with L2 shrinkage) — exact same intuition as basic gradient boosting.

6.2.2 Binary Cross-Entropy (Log-Loss)

\[ L(y, \hat{y}) = -y \log \sigma(\hat{y}) - (1-y)\log(1 - \sigma(\hat{y})), \quad \sigma(z) = \frac{1}{1 + e^{-z}} \]
\[ g_i = p_i - y_i, \quad p_i = \sigma(F_{m-1}(x_i)) \]
\[ h_i = p_i(1 - p_i) \]

The Hessian is the variance of a Bernoulli(p_i) random variable — maximum at p_i = 0.5 (maximum uncertainty) and minimum near 0 or 1 (confident predictions).

6.2.3 Softmax for Multi-class (K classes)

For class k and sample i with true class y_i:

\[ g_{i,k} = p_{i,k} - \mathbf{1}[y_i = k], \quad p_{i,k} = \frac{e^{F_k(x_i)}}{\sum_{k'} e^{F_{k'}(x_i)}} \]
\[ h_{i,k} = p_{i,k}(1 - p_{i,k}) \]

One tree per class is built at each round (K trees per boosting iteration).

6.3 Complexity Analysis

OperationNaive (XGBoost)LightGBM
Histogram build\(O(n \cdot d)\) per tree\(O(n' \cdot d')\) where \(n' = (a+b)n\), \(d' \ll d\)
Split finding\(O(n \cdot d)\) per node\(O(B \cdot d')\) per node
Per-tree total\(O(n \cdot d \cdot D \cdot 2^D)\)\(O(B \cdot d' \cdot J)\)
Memory\(O(n \cdot d)\)\(O(B \cdot d')\)

With \(B=255\), \(n=10^6\), \(d'=0.1d\) (EFB), \(n'=0.3n\) (GOSS):

\[ \text{LightGBM speedup} \approx \frac{n \cdot d}{B \cdot d' \cdot n'/(n)} = \frac{10^6 \cdot d}{255 \cdot 0.1d \cdot 0.3} \approx 130,000 \times \]

In practice, the actual speedup is 10–50× due to overhead, cache effects, and sequential tree building.

6.4 DART — Dropouts Meet Multiple Additive Regression Trees

LightGBM also supports DART (boosting='dart'), which randomly drops trees during training (analogous to dropout in neural networks):

  1. Randomly drop a fraction ρ of existing trees from the ensemble
  2. Train new tree h_m to fit residuals of the reduced ensemble
  3. Add h_m back with normalized weights to maintain scale
Effect: DART significantly reduces overfitting and often improves generalization on noisy datasets, at the cost of slower prediction (must run all trees in a specific order).

6.5 The Full Picture — Conceptual Hierarchy

GRADIENT BOOSTING FAMILY
│
├── Base Concept: Functional Gradient Descent
│   └── Build ensemble F_M(x) = Σ ν·h_m(x) to minimize Σ L(y_i, F(x_i))
│
├── First-Order Methods (fit gradients)
│   └── Classic Gradient Boosting (Friedman 2001)
│       └── Pseudo-residuals r_i = -∂L/∂F
│
├── Second-Order Methods (fit gradients + Hessians)
│   ├── XGBoost (Chen 2016): Exact splits, level-wise
│   └── LightGBM (Ke 2017): Histogram splits, leaf-wise
│       ├── GOSS: Sample by gradient magnitude
│       ├── EFB: Bundle exclusive features
│       ├── Histograms: Discretize → B bins
│       └── Leaf-wise growth: Always best split globally
│
└── Variants
    ├── DART: Dropout regularization
    └── Random Forest mode: Independent trees

6.6 When to Use LightGBM vs Alternatives

SituationRecommendation
Tabular data, any sizeLightGBM — first choice
Very large data (>1M rows)LightGBM — GOSS/EFB shine here
Small dataset (<1000 rows)Consider XGBoost (exact splits) or Ridge/Lasso
Image / audio / textDeep learning (CNNs, Transformers)
Need full interpretabilityLinear models or shallow single trees
Time series (temporal)LightGBM with careful feature engineering, or Prophet/ARIMA
Ranking problemsLightGBM LambdaRank

6.7 Key Equations — Quick Reference Card

📐 LightGBM / XGBoost Quick Reference

Model:

\(F_M(x) = F_0(x) + \sum_{m=1}^M \nu \cdot h_m(x)\)

Gradient (1st order):

\(g_i = \partial L(y_i, \hat{y}_i) / \partial \hat{y}_i\)

Hessian (2nd order):

\(h_i = \partial^2 L(y_i, \hat{y}_i) / \partial \hat{y}_i^2\)

Leaf sums:

\(G_j = \sum_{i \in I_j} g_i,\; H_j = \sum_{i \in I_j} h_i\)

Optimal leaf weight:

\(w_j^* = -G_j / (H_j + \lambda)\)

Split gain:

\(\frac{1}{2}\left[\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{G^2}{H+\lambda}\right] - \gamma\)

Tree score: \(-\frac{1}{2}\sum_j \frac{G_j^2}{H_j + \lambda} + \gamma J\)   |   GOSS amplification: Weight of set B: \(\frac{1-a}{b}\)

6.8 Recommended Learning Path & Further Reading

  • ✅ Chapter 1 — Gradient Descent → Gradient Boosting conceptual bridge
  • ✅ Chapter 2 — Decision trees as building blocks
  • ✅ Chapter 3 — XGBoost: second-order framework, gain formula derivation
  • ✅ Chapter 4 — LightGBM innovations: histograms, leaf-wise, GOSS, EFB
  • ✅ Chapter 5 — Practical guide: parameters, tuning, code
  • ✅ Chapter 6 — Unified view and mathematical summary

Further Reading

  • Original papers:
    • Friedman (2001) — Greedy Function Approximation: A Gradient Boosting Machine
    • Chen & Guestrin (2016) — XGBoost: A Scalable Tree Boosting System
    • Ke et al. (2017) — LightGBM: A Highly Efficient Gradient Boosting Decision Tree
  • Practice:
  • My GitHub: Machine Learning repository
🏁 End of Tutorial — You now know LightGBM from first principles to production use.

References

  • Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5), 1189-1232.
  • Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD.
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017.

📌 Other interesting topics: