2.1 What is a Decision Tree?

A decision tree partitions the feature space into rectangular regions through a series of binary splits, and assigns a constant prediction to each region (leaf).

Given input \( x = (x^{(1)}, x^{(2)}, \ldots, x^{(d)}) \), a tree makes a prediction by routing \( x \) from the root node down through a series of decisions to a leaf node.

Each internal node asks: "Is \( x^{(k)} \leq s \)?" — a threshold \( s \) on feature \( k \).

Structure Formally
A regression tree with \( J \) leaves partitions the input space into regions \( R_1, \ldots, R_J \): \[ h(x) = \sum_{j=1}^{J} w_j \cdot \mathbf{1}[x \in R_j] \] Where \( w_j \) is the leaf weight (constant prediction) for leaf \( j \).

2.2 How a Tree is Built: Splitting Criterion

To build a tree, we greedily choose the best split at each node. "Best" means the split that most reduces the impurity (loss on training data).

Regression: Sum of Squared Errors (SSE)

For a node containing data points \( S = \{(x_i, y_i)\} \), the SSE is:

\[ \text{SSE}(S) = \sum_{i \in S} (y_i - \bar{y}_S)^2, \quad \bar{y}_S = \frac{1}{|S|}\sum_{i \in S} y_i \]

For a candidate split \( (k, s) \) that divides \( S \) into \( S_L \) (left: \( x^{(k)} \leq s \)) and \( S_R \) (right: \( x^{(k)} > s \)):

\[ \text{Gain}(k, s) = \text{SSE}(S) - \text{SSE}(S_L) - \text{SSE}(S_R) \]

We choose:

\[ k^*, s^* = \arg\max_{k, s} \; \text{Gain}(k, s) \]

Equivalent Formulation (faster computation)

\[ \text{SSE}(S) = \sum_{i \in S} y_i^2 - \frac{1}{|S|}\left(\sum_{i \in S} y_i\right)^2 \]

So the gain simplifies to:

\[ \text{Gain}(k, s) = \frac{\left(\sum_{i \in S_L} y_i\right)^2}{|S_L|} + \frac{\left(\sum_{i \in S_R} y_i\right)^2}{|S_R|} - \frac{\left(\sum_{i \in S} y_i\right)^2}{|S|} \]

2.3 Leaf Values

After the tree structure is determined, the leaf value is:

\[ w_j = \frac{1}{|R_j|} \sum_{i \in R_j} y_i \quad \text{(for MSE loss)} \]

Or more generally for any loss \( L \):

\[ w_j = \arg\min_w \sum_{i \in R_j} L(y_i, F_{m-1}(x_i) + w) \]

2.4 Tree Building Algorithm

BuildTree(S, depth, max_depth):
  if depth == max_depth or |S| < min_samples:
    return Leaf(mean(y for (x,y) in S))
  
  best_gain = -∞
  for each feature k:
    for each threshold s in sorted unique values of x^(k):
      S_L = {(x,y) : x^(k) ≤ s}
      S_R = {(x,y) : x^(k) > s}
      gain = Gain(S, S_L, S_R)
      if gain > best_gain:
        best_gain = gain
        best_split = (k, s)
  
  S_L, S_R = split S by best_split
  return Node(best_split,
              left  = BuildTree(S_L, depth+1, max_depth),
              right = BuildTree(S_R, depth+1, max_depth))
                    

Complexity: For \( n \) samples and \( d \) features, each split evaluation costs \( O(n \cdot d) \), and a full tree of depth \( D \) costs \( O(n \cdot d \cdot D) \). For large datasets, this is expensive — which is why LightGBM introduces major optimizations (Chapter 5).

2.5 Overfitting & Regularization in Trees

A fully grown tree memorizes training data. To prevent overfitting:

HyperparameterEffect
max_depthLimits tree depth
min_samples_leafMinimum samples per leaf
min_gain_to_splitOnly split if gain exceeds threshold
max_leavesLimits total leaves (leaf-wise growth)

In gradient boosting, trees are intentionally kept shallow (depth 3–8). The ensemble of many shallow trees does the heavy lifting.

2.6 Why Trees in Gradient Boosting?

Trees have unique advantages as base learners:

  1. Handle mixed data types — numerical and categorical features naturally
  2. No feature scaling needed — splits are threshold-based, not distance-based
  3. Automatic interaction detection — a split on feature A then B captures the A×B interaction
  4. Fast prediction — \( O(D) \) per sample where \( D \) = depth
  5. Interpretable — you can visualize and understand individual trees
📘 Next Chapter: XGBoost — Second-Order Gradient Boosting

References & Further Reading


📌 Other interesting topics: