Chapter 6: Contents
- 6.1 The Complete Gradient Boosting Framework (Unified View)
- 6.2 Loss Functions — Derivations in Detail
- 6.3 Complexity Analysis
- 6.4 DART — Dropouts Meet Multiple Additive Regression Trees
- 6.5 The Full Picture — Conceptual Hierarchy
- 6.6 When to Use LightGBM vs Alternatives
- 6.7 Key Equations — Quick Reference Card
- 6.8 Recommended Learning Path & Further Reading
6.1 The Complete Gradient Boosting Framework (Unified View)
We now have all pieces. Here is the complete algorithm as implemented in LightGBM:
Input:
- Training data D = {(x_i, y_i)} for i=1..n
- Loss function L(y, ŷ) with gradient g_i and Hessian h_i
- Hyperparameters: M, ν (learning rate), J (max leaves), λ, γ, B (bins)
Initialization:
F₀(x) = argmin_c Σ L(y_i, c)
(e.g., F₀ = ȳ for MSE, F₀ = log( p̄/(1-p̄) ) for log-loss)
For m = 1 to M:
① Compute statistics for all samples:
g_i = ∂L/∂ŷ_i |_{ŷ_i = F_{m-1}(x_i)}, h_i = ∂²L/∂ŷ_i² |_{ŷ_i = F_{m-1}(x_i)}
② [GOSS] Sample selection:
- Keep set A: top a·n samples by |g_i|
- Randomly sample set B: b·(1-a)·n samples from the rest
- Amplify B gradients by (1-a)/b
③ [EFB] Bundle features:
- Find mutually exclusive feature groups
- Merge into bundles with bin offsets
④ Build histograms for each feature/bundle k:
G_{k,b} = Σ_{i∈A∪B, x̃_i^(k)=b} g̃_i, H_{k,b} = Σ_{i∈A∪B, x̃_i^(k)=b} h̃_i
⑤ Leaf-wise tree construction with at most J leaves:
Repeat until J leaves:
- For each leaf ℓ, each feature k, scan bins to find:
Gain = ½[ G_L²/(H_L+λ) + G_R²/(H_R+λ) - G_ℓ²/(H_ℓ+λ) ] - γ
- Apply global best split
- Use histogram subtraction for larger sibling
⑥ Compute leaf weights:
w_j* = -G_j / (H_j + λ), j = 1,...,J
⑦ Update model:
F_m(x) = F_{m-1}(x) + ν · Σ w_j* · 1[x ∈ R_j]
Output: F_M(x) (for regression) or σ(F_M(x)) (for binary classification)
6.2 Loss Functions — Derivations in Detail
6.2.1 Mean Squared Error (Regression)
Optimal leaf weight: \( w_j^* = -\frac{\sum_{i \in R_j}(\hat{y}_i - y_i)}{\sum_{i \in R_j} 1 + \lambda} = \frac{\sum_{i \in R_j}(y_i - \hat{y}_i)}{n_j + \lambda} \)
This is the mean residual (with L2 shrinkage) — exact same intuition as basic gradient boosting.
6.2.2 Binary Cross-Entropy (Log-Loss)
The Hessian is the variance of a Bernoulli(p_i) random variable — maximum at p_i = 0.5 (maximum uncertainty) and minimum near 0 or 1 (confident predictions).
6.2.3 Softmax for Multi-class (K classes)
For class k and sample i with true class y_i:
One tree per class is built at each round (K trees per boosting iteration).
6.3 Complexity Analysis
| Operation | Naive (XGBoost) | LightGBM |
|---|---|---|
| Histogram build | \(O(n \cdot d)\) per tree | \(O(n' \cdot d')\) where \(n' = (a+b)n\), \(d' \ll d\) |
| Split finding | \(O(n \cdot d)\) per node | \(O(B \cdot d')\) per node |
| Per-tree total | \(O(n \cdot d \cdot D \cdot 2^D)\) | \(O(B \cdot d' \cdot J)\) |
| Memory | \(O(n \cdot d)\) | \(O(B \cdot d')\) |
With \(B=255\), \(n=10^6\), \(d'=0.1d\) (EFB), \(n'=0.3n\) (GOSS):
In practice, the actual speedup is 10–50× due to overhead, cache effects, and sequential tree building.
6.4 DART — Dropouts Meet Multiple Additive Regression Trees
LightGBM also supports DART (boosting='dart'), which randomly drops trees during training (analogous to dropout in neural networks):
- Randomly drop a fraction ρ of existing trees from the ensemble
- Train new tree h_m to fit residuals of the reduced ensemble
- Add h_m back with normalized weights to maintain scale
6.5 The Full Picture — Conceptual Hierarchy
GRADIENT BOOSTING FAMILY
│
├── Base Concept: Functional Gradient Descent
│ └── Build ensemble F_M(x) = Σ ν·h_m(x) to minimize Σ L(y_i, F(x_i))
│
├── First-Order Methods (fit gradients)
│ └── Classic Gradient Boosting (Friedman 2001)
│ └── Pseudo-residuals r_i = -∂L/∂F
│
├── Second-Order Methods (fit gradients + Hessians)
│ ├── XGBoost (Chen 2016): Exact splits, level-wise
│ └── LightGBM (Ke 2017): Histogram splits, leaf-wise
│ ├── GOSS: Sample by gradient magnitude
│ ├── EFB: Bundle exclusive features
│ ├── Histograms: Discretize → B bins
│ └── Leaf-wise growth: Always best split globally
│
└── Variants
├── DART: Dropout regularization
└── Random Forest mode: Independent trees
6.6 When to Use LightGBM vs Alternatives
| Situation | Recommendation |
|---|---|
| Tabular data, any size | LightGBM — first choice |
| Very large data (>1M rows) | LightGBM — GOSS/EFB shine here |
| Small dataset (<1000 rows) | Consider XGBoost (exact splits) or Ridge/Lasso |
| Image / audio / text | Deep learning (CNNs, Transformers) |
| Need full interpretability | Linear models or shallow single trees |
| Time series (temporal) | LightGBM with careful feature engineering, or Prophet/ARIMA |
| Ranking problems | LightGBM LambdaRank |
6.7 Key Equations — Quick Reference Card
📐 LightGBM / XGBoost Quick Reference
Model:
Gradient (1st order):
Hessian (2nd order):
Leaf sums:
Optimal leaf weight:
Split gain:
Tree score: \(-\frac{1}{2}\sum_j \frac{G_j^2}{H_j + \lambda} + \gamma J\) | GOSS amplification: Weight of set B: \(\frac{1-a}{b}\)
6.8 Recommended Learning Path & Further Reading
- ✅ Chapter 1 — Gradient Descent → Gradient Boosting conceptual bridge
- ✅ Chapter 2 — Decision trees as building blocks
- ✅ Chapter 3 — XGBoost: second-order framework, gain formula derivation
- ✅ Chapter 4 — LightGBM innovations: histograms, leaf-wise, GOSS, EFB
- ✅ Chapter 5 — Practical guide: parameters, tuning, code
- ✅ Chapter 6 — Unified view and mathematical summary
Further Reading
- Original papers:
- Friedman (2001) — Greedy Function Approximation: A Gradient Boosting Machine
- Chen & Guestrin (2016) — XGBoost: A Scalable Tree Boosting System
- Ke et al. (2017) — LightGBM: A Highly Efficient Gradient Boosting Decision Tree
- Practice:
- Kaggle competitions (LightGBM dominates tabular competitions)
- LightGBM official docs: https://lightgbm.readthedocs.io
- My GitHub: Machine Learning repository
References
- Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5), 1189-1232.
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD.
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017.