4.1 Why LightGBM?
XGBoost was a breakthrough, but it has a critical bottleneck: finding the best split.
For each node, XGBoost must:
- For each feature \( k \) (of \( d \) features)
- Sort all \( n \) samples by feature value
- Scan through all possible thresholds
Total cost per tree: \( O(n \cdot d \cdot D) \) where \( D \) = depth
For \( n = 10^6 \) samples and \( d = 1000 \) features, this is extremely slow.
LightGBM (Ke et al., Microsoft Research, 2017) introduced four major innovations:
| Innovation | Problem Solved |
|---|---|
| GOSS (Gradient-based One-Side Sampling) | Reduce number of samples |
| EFB (Exclusive Feature Bundling) | Reduce number of features |
| Histogram-based splitting | Faster split finding |
| Leaf-wise growth | Better accuracy per tree |
4.2 Innovation 1: Histogram-Based Splitting
The Problem with Exact Splits
XGBoost's exact split algorithm sorts all values for each feature at each node. For large \( n \), this is \( O(n \log n) \) per feature per split.
The Histogram Idea
Instead of examining every unique value, LightGBM discretizes continuous features into \( B \) bins (typically \( B = 255 \)):
For each bin \( b \) and feature \( k \), we accumulate:
Building the histogram costs \( O(n \cdot d) \).
Finding the Best Split from a Histogram
We scan \( B \) thresholds (bins) instead of \( n \) thresholds. For threshold \( b \):
Using the XGBoost gain formula:
Split-finding cost: \( O(B \cdot d) \) per node (instead of \( O(n \cdot d) \))
Histogram Subtraction Trick
For a parent node histogram \( H_P \) and left child histogram \( H_L \), the right child histogram is:
We only need to build the histogram for the smaller child — the larger one is derived by subtraction. This halves the work when building sibling nodes.
Memory Efficiency
Histograms store \( B \cdot d \) values (instead of \( n \cdot d \) raw values). For \( B=255 \), \( d=1000 \), \( n=10^6 \):
- Raw data: \( 10^6 \times 1000 \times 4 \) bytes = 4 GB
- Histogram: \( 255 \times 1000 \times 2 \times 4 \) bytes = 2 MB
4.3 Innovation 2: Leaf-Wise Tree Growth
Level-Wise Growth (XGBoost style)
Grows the tree level by level — splits all leaves at depth \( d \) before any leaf at depth \( d+1 \):
/ \
Depth 2: [L1] [R1]
/ \ / \
Depth 3: [LL][LR] [RL][RR]
All leaves at depth 2 are split before moving to depth 3. This creates balanced trees.
Leaf-Wise Growth (LightGBM default)
Always split the leaf with the highest loss reduction (best Gain), regardless of depth:
Round 2: Split whichever child has higher gain
Round 3: Split whichever leaf now has highest gain
...
This creates unbalanced, deeper on one side trees.
Why Leaf-Wise is Better
Given a fixed number of leaves \( J \), leaf-wise growth achieves a lower training loss because it always makes the single best split available — it never wastes a split on a low-gain node just to maintain balance.
Risk of Leaf-Wise
It can overfit on small datasets because it grows very deep asymmetrically. Control this with num_leaves (LightGBM's primary depth control) and min_data_in_leaf.
num_leaves — the maximum number of leaves in one tree. Setting num_leaves = 2^max_depth gives similar capacity to level-wise at depth max_depth.
4.4 Innovation 3: GOSS — Gradient-based One-Side Sampling
Motivation
For large datasets, even building histograms over all \( n \) samples is expensive. Can we use fewer samples without losing accuracy?
• Samples with large gradients \( |g_i| \) have large errors — they're the most informative.
• Samples with small gradients \( |g_i| \) are already well-predicted — they contribute less to learning.
The GOSS Algorithm
Input: Data \( \{(x_i, y_i, g_i)\} \), top fraction \( a \), random fraction \( b \), amplification factor \( \frac{1-a}{b} \)
- Sort samples by \( |g_i| \) in descending order
- Keep the top \( a \times 100\% \) samples with largest gradients → set \( A \)
- From the remaining \( (1-a) \times 100\% \) samples, randomly sample \( b \times 100\% \) → set \( B \)
- Use \( A \cup B \) for histogram construction
- Amplify the gradients of \( B \) by factor \( \frac{1-a}{b} \) to correct for undersampling
Why Amplification?
Samples in \( B \) represent the small-gradient portion. If we sample \( b \) fraction of them, each sampled point represents \( \frac{1}{b} \) points in reality. But we already kept \( A \) (top \( a \)), so:
In the histogram, contributions from set \( B \) become:
This gives an unbiased estimate of the full-data histogram.
GOSS Formal Guarantee
The authors prove that the variance of the GOSS gain estimate is bounded by:
As long as \( a + b \) is not too small and the top-\( a \) gradients dominate, GOSS introduces negligible approximation error while using only \( (a + b) \times 100\% \) of the data.
4.5 Innovation 4: EFB — Exclusive Feature Bundling
Motivation
Real-world datasets (especially from one-hot encoding) are often very sparse — most feature values are zero. Features that are "mutually exclusive" (rarely non-zero simultaneously) can be bundled into one feature without information loss.
Definition: Exclusivity
Two features \( f_1 \) and \( f_2 \) are exclusive if they rarely take non-zero values simultaneously:
For truly exclusive features (\( \epsilon = 0 \)), we can losslessly merge them.
The Bundling Process
Step 1: Build a conflict graph
- Nodes = features
- Edge weight between \( f_i \) and \( f_j \) = number of samples where both are non-zero
Step 2: Solve the graph coloring problem
Finding the optimal bundles is equivalent to graph coloring (NP-hard). LightGBM uses a greedy approximation: sort features by degree, assign each feature to an existing bundle if it doesn't exceed conflict threshold, otherwise create a new bundle.
Step 3: Merge features within a bundle
For features \( f_1 \in [0, B_1] \) and \( f_2 \in [0, B_2] \) in a bundle:
This uses offset so the two features occupy different ranges of the merged histogram — they can be distinguished when needed.
• \( f_1 = 0 \) → bundle value 0 or 1
• \( f_2 = 0 \) → bundle value \( 0 + 1 = 1 \) or \( 1 + 1 = 2 \)
Since they're exclusive (never both non-zero), bundle \( \in \{0, 1, 2\} \) uniquely encodes both.
EFB Gain
If the original dataset has \( d \) features with \( s \) non-zero values per feature (sparsity), the effective feature count after bundling is:
Typical speedup: 8× on sparse data with aggressive bundling.
4.6 Putting It All Together: LightGBM vs XGBoost
| Feature | XGBoost | LightGBM |
|---|---|---|
| Split finding | Exact (all thresholds) | Histogram (B=255 bins) |
| Tree growth | Level-wise | Leaf-wise |
| Sample reduction | — | GOSS |
| Feature reduction | — | EFB |
| Memory | \( O(n \cdot d) \) | \( O(B \cdot d) \) |
| Training speed | 1× | ~10–20× faster |
| Accuracy | Slightly higher on small data | Comparable or better on large data |
| Categorical features | Requires encoding | Native support |
References & Further Reading
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017.
- My GitHub: Machine Learning repository
- LightGBM Official Documentation