Gradient Descent to Gradient Boosting
Adaptive Boosting Algorithm

Introduction

AdaBoost (Adaptive Boosting) is a sequential ensemble method that combines many weak learners (typically shallow decision trees) into a strong classifier by focusing more and more on the hard-to-classify points.

What makes it Adaptive?

Instead of training all models on the same data, AdaBoost:

  • Starts with equal weights for all data points.
  • Trains a weak learner (e.g., a decision stump — a tree with one split).
  • Increases the weights of misclassified points.
  • Trains the next learner to focus on those difficult points.
  • Repeats for a fixed number of iterations or until performance stops improving.

Why "Weak" Learners?

AdaBoost works with models that are only slightly better than random guessing (e.g., error rate < 50% on binary classification). The power comes from the ensemble, not the individual learners. Each tree corrects the mistakes of the previous ones.

The Boosting Process

Imagine a dataset of 10 data points. AdaBoost goes through these steps:

  • Step 1: Train the first tree. Let's say it correctly classifies 7 points but misclassifies 3.
  • Step 2: Increase the weights of the 3 misclassified points.
  • Step 3: Train the second tree, forcing it to focus more on those 3 points. It might correctly classify 2 of them, but misclassify 1.
  • Step 4: Increase the weight of the remaining misclassified point.
  • Step 5: Train the third tree to focus on that last point.
  • Final Model: Combine all 3 trees with different weights (the ones that performed better get higher weights).

AdaBoost Algorithm Steps

Here's the step-by-step breakdown of the AdaBoost algorithm:

  1. Initialize Weights: Start with equal weights for all data points:
  2. Loop for T Iterations:
    • Train Weak Learner: Train a weak learner (e.g., decision stump) on the current weighted data.
    • Calculate Weighted Error: Calculate the weighted error rate of the learner.
    • Calculate Learner Weight: Calculate the weight for this learner (more error = lower weight).
    • Update Data Weights: Increase weights for misclassified points, decrease for correctly classified points.
    • Normalize Weights: Rescale weights so they sum to 1.
  3. Final Prediction: Combine all learners using their weights (weighted majority vote for classification, weighted sum for regression).

AdaBoost Mathematics (for the curious mind)

Here are the core formulas used in AdaBoost:

Initializing Weights: $$ w_i = \frac{1}{N} \quad \text{for all } i = 1, \dots, N $$
Learner Weight Calculation: $$ \alpha_t = \frac{1}{2} \ln \left( \frac{1 - \epsilon_t}{\epsilon_t} \right) $$

The larger \(\alpha_t\), the more weight is given to the learner.

Updating Data Weights: $$ w_{i, t+1} = \frac{w_{i, t}}{Z_t} \times \begin{cases} e^{-\alpha_t} & \text{if misclassified} \\ e^{\alpha_t} & \text{if correctly classified} \end{cases} $$

where \(Z_t\) is the normalization factor to ensure weights sum to 1.

Final Prediction: $$ y_{final}(x) = \text{sign} \left( \sum_{t=1}^{T} \alpha_t h_t(x) \right) $$

Step-by-Step Algorithm

  1. Step 1: Initialize weights: Initially, all data points are equally important: $$w_i^{(1)} = \frac{1}{n}$$
  2. Step 2: Train weak learner: Train a weak classifier \(h_t(x)\) using the weighted dataset. The learner focus more on points with higher weights.
  3. Step 3: Compute weighted error: $$\epsilon_t = \sum_{i=1}^n w_i^{(t)} \cdot I(y_i \neq h_t(x_i))$$ This is not simple accuracy, it's weighted error.
  4. Step 4: Compute model weight: $$\alpha_t = \frac{1}{2} \ln \left( \frac{1 - \epsilon_t}{\epsilon_t} \right)$$ Interpretation:
    • If error is small \(\rightarrow \alpha_t\) is large \(\rightarrow\) strong learner
    • If error is \(\sim 0.5 \rightarrow \alpha_t \approx 0 \rightarrow\) useless learner
    • If error is close to 1 \(\rightarrow \alpha_t \approx -\infty \rightarrow\) overfitting - giving more weight to the mistakes.
  5. Step 5: Update sample weights: $$w_i^{(t+1)} = w_i^{(t)} \cdot \exp\left(-\alpha_t y_i h_t(x_i)\right)$$ This is the key idea:
    • If correctly classfied \(\rightarrow y_i h_t(x_i) = +1 \rightarrow \) weight decreases.
    • If misclassified \(\rightarrow y_i h_t(x_i) = -1 \rightarrow \) weight increases.
  6. Step 6: Normalize weights: $$w_i^{(t+1)} = \frac{w_i^{(t+1)}}{\sum_{j=1}^n w_j^{(t+1)}}$$ so weights remain a probability distribution.
  7. Step 7: Repeat until desired accuracy or max iterations reached.
  8. Step 8: Final prediction: $$y_{final}(x) = \text{sign} \left( \sum_{t=1}^{T} \alpha_t h_t(x) \right)$$ so:
    • Each weak learner votes
    • Stronger learners (low error) get higher weights
    • Final prediction is weighted majority vote
  9. Step 6: Intuition Behind the Math:

    Why exponential update?

    The term: \(\exp(-\alpha_t y_i h_t(x_i))\)

    creates:

    • smooth penalty
    • stronger updates for misclassified points
    • connections to exponential loss minimization

    AdaBoost minimizes this loss:

    $$\sum_{i=1}^n \exp\left(-y_i F(x_i)\right)$$ where $F(x)$ is the final predictor and is given by \(F(x) =\sum_{t=1}^T \alpha_t h_t(x)\). So AdaBoost is actually doing stage-wise additive modeling.
Image credit: © Arun Kumar Pandey.
For a example project, checkout the link: Adaboost - Gradient Boosting comparison project
📘 Next Chapter: Decision Trees — The Building Blocks of Gradient Boosting. We will dive into how trees split data, the impurity measures (Gini, entropy, MSE), and how the split-finding algorithm works — the foundation for understanding XGBoost and LightGBM internals.

References & Further Reading

  • Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5), 1189–1232.
  • Friedman, J. H. (1999). Stochastic Gradient Boosting. Computational Statistics & Data Analysis.
  • Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proc. 22nd ACM SIGKDD.
  • Ke, G., et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems (NeurIPS).
  • Prokhorenkova, L., et al. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS.
  • Freund, Y., & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning. JCSS, 55(1), 119–139.
  • The Elements of Statistical Learning — Chapter 10: Boosting and Additive Trees.
  • My GitHub: Machine Learning repository

📌 Other interesting topics: