LightGBM: Decision Trees — Building Blocks

LightGBM: From Scratch — Decision Trees

LightGBM and Decision Trees — Decision Trees: The Building Blocks of LightGBM

Chapter Contents

2.1 What is a Decision Tree?
2.2 How a Tree is Built: Splitting Criterion
2.3 Leaf Values
2.4 Tree Building Algorithm
2.5 Overfitting & Regularization in Trees
2.6 Why Trees in Gradient Boosting?
Next Chapter Preview

2.1 What is a Decision Tree?

A decision tree partitions the feature space into rectangular regions through a series of binary splits, and assigns a constant prediction to each region (leaf).

Given input \( x = (x^{(1)}, x^{(2)}, \ldots, x^{(d)}) \), a tree makes a prediction by routing \( x \) from the root node down through a series of decisions to a leaf node.

Each internal node asks: "Is \( x^{(k)} \leq s \)?" — a threshold \( s \) on feature \( k \).

Structure Formally
A regression tree with \( J \) leaves partitions the input space into regions \( R_1, \ldots, R_J \): \[ h(x) = \sum_{j=1}^{J} w_j \cdot \mathbf{1}[x \in R_j] \] Where \( w_j \) is the leaf weight (constant prediction) for leaf \( j \).

2.2 How a Tree is Built: Splitting Criterion

To build a tree, we greedily choose the best split at each node. "Best" means the split that most reduces the impurity (loss on training data).

Regression: Sum of Squared Errors (SSE)

For a node containing data points \( S = \{(x_i, y_i)\} \), the SSE is:

\[ \text{SSE}(S) = \sum_{i \in S} (y_i - \bar{y}_S)^2, \quad \bar{y}_S = \frac{1}{|S|}\sum_{i \in S} y_i \]

For a candidate split \( (k, s) \) that divides \( S \) into \( S_L \) (left: \( x^{(k)} \leq s \)) and \( S_R \) (right: \( x^{(k)} > s \)):

\[ \text{Gain}(k, s) = \text{SSE}(S) - \text{SSE}(S_L) - \text{SSE}(S_R) \]

We choose:

\[ k^*, s^* = \arg\max_{k, s} \; \text{Gain}(k, s) \]

Equivalent Formulation (faster computation)

\[ \text{SSE}(S) = \sum_{i \in S} y_i^2 - \frac{1}{|S|}\left(\sum_{i \in S} y_i\right)^2 \]

So the gain simplifies to:

\[ \text{Gain}(k, s) = \frac{\left(\sum_{i \in S_L} y_i\right)^2}{|S_L|} + \frac{\left(\sum_{i \in S_R} y_i\right)^2}{|S_R|} - \frac{\left(\sum_{i \in S} y_i\right)^2}{|S|} \]

2.3 Leaf Values

After the tree structure is determined, the leaf value is:

\[ w_j = \frac{1}{|R_j|} \sum_{i \in R_j} y_i \quad \text{(for MSE loss)} \]

Or more generally for any loss \( L \):

\[ w_j = \arg\min_w \sum_{i \in R_j} L(y_i, F_{m-1}(x_i) + w) \]

2.4 Tree Building Algorithm

BuildTree(S, depth, max_depth):
  if depth == max_depth or |S| < min_samples:
    return Leaf(mean(y for (x,y) in S))
  
  best_gain = -∞
  for each feature k:
    for each threshold s in sorted unique values of x^(k):
      S_L = {(x,y) : x^(k) ≤ s}
      S_R = {(x,y) : x^(k) > s}
      gain = Gain(S, S_L, S_R)
      if gain > best_gain:
        best_gain = gain
        best_split = (k, s)
  
  S_L, S_R = split S by best_split
  return Node(best_split,
              left  = BuildTree(S_L, depth+1, max_depth),
              right = BuildTree(S_R, depth+1, max_depth))

Complexity: For \( n \) samples and \( d \) features, each split evaluation costs \( O(n \cdot d) \), and a full tree of depth \( D \) costs \( O(n \cdot d \cdot D) \). For large datasets, this is expensive — which is why LightGBM introduces major optimizations (Chapter 5).

2.5 Overfitting & Regularization in Trees

A fully grown tree memorizes training data. To prevent overfitting:

Hyperparameter	Effect
`max_depth`	Limits tree depth
`min_samples_leaf`	Minimum samples per leaf
`min_gain_to_split`	Only split if gain exceeds threshold
`max_leaves`	Limits total leaves (leaf-wise growth)

In gradient boosting, trees are intentionally kept shallow (depth 3–8). The ensemble of many shallow trees does the heavy lifting.

2.6 Why Trees in Gradient Boosting?

Trees have unique advantages as base learners:

Handle mixed data types — numerical and categorical features naturally
No feature scaling needed — splits are threshold-based, not distance-based
Automatic interaction detection — a split on feature A then B captures the A×B interaction
Fast prediction — \( O(D) \) per sample where \( D \) = depth
Interpretable — you can visualize and understand individual trees

📘 Next Chapter: XGBoost — Second-Order Gradient Boosting

References & Further Reading

My GitHub repositories on Machine Learning: github.com/arunp77/Machine-Learning
A Visual Introduction To Linear Regression
Book: Regression and Other Stories
The Elements of Statistical Learning

Arun