2.1 What is a Decision Tree?
A decision tree partitions the feature space into rectangular regions through a series of binary splits, and assigns a constant prediction to each region (leaf).
Given input \( x = (x^{(1)}, x^{(2)}, \ldots, x^{(d)}) \), a tree makes a prediction by routing \( x \) from the root node down through a series of decisions to a leaf node.
Each internal node asks: "Is \( x^{(k)} \leq s \)?" — a threshold \( s \) on feature \( k \).
A regression tree with \( J \) leaves partitions the input space into regions \( R_1, \ldots, R_J \): \[ h(x) = \sum_{j=1}^{J} w_j \cdot \mathbf{1}[x \in R_j] \] Where \( w_j \) is the leaf weight (constant prediction) for leaf \( j \).
2.2 How a Tree is Built: Splitting Criterion
To build a tree, we greedily choose the best split at each node. "Best" means the split that most reduces the impurity (loss on training data).
Regression: Sum of Squared Errors (SSE)
For a node containing data points \( S = \{(x_i, y_i)\} \), the SSE is:
For a candidate split \( (k, s) \) that divides \( S \) into \( S_L \) (left: \( x^{(k)} \leq s \)) and \( S_R \) (right: \( x^{(k)} > s \)):
We choose:
Equivalent Formulation (faster computation)
So the gain simplifies to:
2.3 Leaf Values
After the tree structure is determined, the leaf value is:
Or more generally for any loss \( L \):
2.4 Tree Building Algorithm
BuildTree(S, depth, max_depth):
if depth == max_depth or |S| < min_samples:
return Leaf(mean(y for (x,y) in S))
best_gain = -∞
for each feature k:
for each threshold s in sorted unique values of x^(k):
S_L = {(x,y) : x^(k) ≤ s}
S_R = {(x,y) : x^(k) > s}
gain = Gain(S, S_L, S_R)
if gain > best_gain:
best_gain = gain
best_split = (k, s)
S_L, S_R = split S by best_split
return Node(best_split,
left = BuildTree(S_L, depth+1, max_depth),
right = BuildTree(S_R, depth+1, max_depth))
Complexity: For \( n \) samples and \( d \) features, each split evaluation costs \( O(n \cdot d) \), and a full tree of depth \( D \) costs \( O(n \cdot d \cdot D) \). For large datasets, this is expensive — which is why LightGBM introduces major optimizations (Chapter 5).
2.5 Overfitting & Regularization in Trees
A fully grown tree memorizes training data. To prevent overfitting:
| Hyperparameter | Effect |
|---|---|
max_depth | Limits tree depth |
min_samples_leaf | Minimum samples per leaf |
min_gain_to_split | Only split if gain exceeds threshold |
max_leaves | Limits total leaves (leaf-wise growth) |
In gradient boosting, trees are intentionally kept shallow (depth 3–8). The ensemble of many shallow trees does the heavy lifting.
2.6 Why Trees in Gradient Boosting?
Trees have unique advantages as base learners:
- Handle mixed data types — numerical and categorical features naturally
- No feature scaling needed — splits are threshold-based, not distance-based
- Automatic interaction detection — a split on feature A then B captures the A×B interaction
- Fast prediction — \( O(D) \) per sample where \( D \) = depth
- Interpretable — you can visualize and understand individual trees
References & Further Reading
- My GitHub repositories on Machine Learning: github.com/arunp77/Machine-Learning
- A Visual Introduction To Linear Regression
- Book: Regression and Other Stories
- The Elements of Statistical Learning