4.1 Why LightGBM?

XGBoost was a breakthrough, but it has a critical bottleneck: finding the best split.

For each node, XGBoost must:

  • For each feature \( k \) (of \( d \) features)
  • Sort all \( n \) samples by feature value
  • Scan through all possible thresholds

Total cost per tree: \( O(n \cdot d \cdot D) \) where \( D \) = depth

For \( n = 10^6 \) samples and \( d = 1000 \) features, this is extremely slow.

LightGBM (Ke et al., Microsoft Research, 2017) introduced four major innovations:

InnovationProblem Solved
GOSS (Gradient-based One-Side Sampling)Reduce number of samples
EFB (Exclusive Feature Bundling)Reduce number of features
Histogram-based splittingFaster split finding
Leaf-wise growthBetter accuracy per tree

4.2 Innovation 1: Histogram-Based Splitting

The Problem with Exact Splits

XGBoost's exact split algorithm sorts all values for each feature at each node. For large \( n \), this is \( O(n \log n) \) per feature per split.

The Histogram Idea

Instead of examining every unique value, LightGBM discretizes continuous features into \( B \) bins (typically \( B = 255 \)):

\[ x^{(k)} \rightarrow \tilde{x}^{(k)} = \text{bin}(x^{(k)}) \in \{0, 1, \ldots, B-1\} \]

For each bin \( b \) and feature \( k \), we accumulate:

\[ G_{k,b} = \sum_{i: \tilde{x}_i^{(k)} = b} g_i \quad \text{(sum of gradients in bin } b\text{)} \] \[ H_{k,b} = \sum_{i: \tilde{x}_i^{(k)} = b} h_i \quad \text{(sum of Hessians in bin } b\text{)} \]

Building the histogram costs \( O(n \cdot d) \).

Finding the Best Split from a Histogram

We scan \( B \) thresholds (bins) instead of \( n \) thresholds. For threshold \( b \):

\[ G_L = \sum_{b'=0}^{b} G_{k,b'}, \quad H_L = \sum_{b'=0}^{b} H_{k,b'} \]

Using the XGBoost gain formula:

\[ \text{Gain}(k, b) = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda}\right] - \gamma \]

Split-finding cost: \( O(B \cdot d) \) per node (instead of \( O(n \cdot d) \))

Since \( B \ll n \) (e.g., \( B = 255 \), \( n = 10^6 \)), this is a ~4000× speedup.

Histogram Subtraction Trick

For a parent node histogram \( H_P \) and left child histogram \( H_L \), the right child histogram is:

\[ H_R = H_P - H_L \]

We only need to build the histogram for the smaller child — the larger one is derived by subtraction. This halves the work when building sibling nodes.

Memory Efficiency

Histograms store \( B \cdot d \) values (instead of \( n \cdot d \) raw values). For \( B=255 \), \( d=1000 \), \( n=10^6 \):

  • Raw data: \( 10^6 \times 1000 \times 4 \) bytes = 4 GB
  • Histogram: \( 255 \times 1000 \times 2 \times 4 \) bytes = 2 MB

4.3 Innovation 2: Leaf-Wise Tree Growth

Level-Wise Growth (XGBoost style)

Grows the tree level by level — splits all leaves at depth \( d \) before any leaf at depth \( d+1 \):

Depth 1: [Root]
/ \
Depth 2: [L1] [R1]
/ \ / \
Depth 3: [LL][LR] [RL][RR]

All leaves at depth 2 are split before moving to depth 3. This creates balanced trees.

Leaf-Wise Growth (LightGBM default)

Always split the leaf with the highest loss reduction (best Gain), regardless of depth:

Round 1: Split Root → best gain
Round 2: Split whichever child has higher gain
Round 3: Split whichever leaf now has highest gain
...

This creates unbalanced, deeper on one side trees.

Why Leaf-Wise is Better

Given a fixed number of leaves \( J \), leaf-wise growth achieves a lower training loss because it always makes the single best split available — it never wastes a split on a low-gain node just to maintain balance.

\[ \mathcal{L}_K = \text{Loss after } K \text{ best splits} \]

Risk of Leaf-Wise

It can overfit on small datasets because it grows very deep asymmetrically. Control this with num_leaves (LightGBM's primary depth control) and min_data_in_leaf.

Key parameter: num_leaves — the maximum number of leaves in one tree. Setting num_leaves = 2^max_depth gives similar capacity to level-wise at depth max_depth.

4.4 Innovation 3: GOSS — Gradient-based One-Side Sampling

Motivation

For large datasets, even building histograms over all \( n \) samples is expensive. Can we use fewer samples without losing accuracy?

Key insight from gradient boosting:
• Samples with large gradients \( |g_i| \) have large errors — they're the most informative.
• Samples with small gradients \( |g_i| \) are already well-predicted — they contribute less to learning.

The GOSS Algorithm

Input: Data \( \{(x_i, y_i, g_i)\} \), top fraction \( a \), random fraction \( b \), amplification factor \( \frac{1-a}{b} \)

  1. Sort samples by \( |g_i| \) in descending order
  2. Keep the top \( a \times 100\% \) samples with largest gradients → set \( A \)
  3. From the remaining \( (1-a) \times 100\% \) samples, randomly sample \( b \times 100\% \) → set \( B \)
  4. Use \( A \cup B \) for histogram construction
  5. Amplify the gradients of \( B \) by factor \( \frac{1-a}{b} \) to correct for undersampling

Why Amplification?

Samples in \( B \) represent the small-gradient portion. If we sample \( b \) fraction of them, each sampled point represents \( \frac{1}{b} \) points in reality. But we already kept \( A \) (top \( a \)), so:

\[ \text{Weight of each sample in } B = \frac{1-a}{b} \]

In the histogram, contributions from set \( B \) become:

\[ G_{k,b}^{GOSS} = \sum_{i \in A, \tilde{x}_i^{(k)}=b} g_i + \frac{1-a}{b}\sum_{i \in B, \tilde{x}_i^{(k)}=b} g_i \]

This gives an unbiased estimate of the full-data histogram.

GOSS Formal Guarantee

The authors prove that the variance of the GOSS gain estimate is bounded by:

\[ \mathcal{V}_{GOSS} = \frac{1-a}{b} \cdot O\left(\frac{1}{\sqrt{n}}\right) \]

As long as \( a + b \) is not too small and the top-\( a \) gradients dominate, GOSS introduces negligible approximation error while using only \( (a + b) \times 100\% \) of the data.

Typical values: \( a = 0.2 \) (top 20%), \( b = 0.1 \) (random 10%), so we use only 30% of data with nearly full accuracy.

4.5 Innovation 4: EFB — Exclusive Feature Bundling

Motivation

Real-world datasets (especially from one-hot encoding) are often very sparse — most feature values are zero. Features that are "mutually exclusive" (rarely non-zero simultaneously) can be bundled into one feature without information loss.

Definition: Exclusivity

Two features \( f_1 \) and \( f_2 \) are exclusive if they rarely take non-zero values simultaneously:

\[ \Pr[f_1 \neq 0 \land f_2 \neq 0] \leq \epsilon \]

For truly exclusive features (\( \epsilon = 0 \)), we can losslessly merge them.

The Bundling Process

Step 1: Build a conflict graph

  • Nodes = features
  • Edge weight between \( f_i \) and \( f_j \) = number of samples where both are non-zero

Step 2: Solve the graph coloring problem

Finding the optimal bundles is equivalent to graph coloring (NP-hard). LightGBM uses a greedy approximation: sort features by degree, assign each feature to an existing bundle if it doesn't exceed conflict threshold, otherwise create a new bundle.

Step 3: Merge features within a bundle

For features \( f_1 \in [0, B_1] \) and \( f_2 \in [0, B_2] \) in a bundle:

\[ f_{bundle} = f_1 + (f_2 + B_1) \]

This uses offset so the two features occupy different ranges of the merged histogram — they can be distinguished when needed.

Example: \( f_1 \in \{0, 1\} \) and \( f_2 \in \{0, 1\} \) (two binary features):
• \( f_1 = 0 \) → bundle value 0 or 1
• \( f_2 = 0 \) → bundle value \( 0 + 1 = 1 \) or \( 1 + 1 = 2 \)
Since they're exclusive (never both non-zero), bundle \( \in \{0, 1, 2\} \) uniquely encodes both.

EFB Gain

If the original dataset has \( d \) features with \( s \) non-zero values per feature (sparsity), the effective feature count after bundling is:

\[ d_{eff} = d \cdot (1 - \text{bundle ratio}) \ll d \]

Typical speedup: 8× on sparse data with aggressive bundling.

4.6 Putting It All Together: LightGBM vs XGBoost

FeatureXGBoostLightGBM
Split findingExact (all thresholds)Histogram (B=255 bins)
Tree growthLevel-wiseLeaf-wise
Sample reductionGOSS
Feature reductionEFB
Memory\( O(n \cdot d) \)\( O(B \cdot d) \)
Training speed~10–20× faster
AccuracySlightly higher on small dataComparable or better on large data
Categorical featuresRequires encodingNative support
📘 Next Chapter: LightGBM in Practice — Parameters, Training, and Tuning

References & Further Reading


📌 Other interesting topics: