5.1 Installation & Quick Start

pip install lightgbm
import lightgbm as lgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# LightGBM Dataset format (memory-efficient, stores histograms)
train_data = lgb.Dataset(X_train, label=y_train)
val_data   = lgb.Dataset(X_val,   label=y_val, reference=train_data)

# Parameters
params = {
    'objective':    'binary',
    'metric':       'binary_logloss',
    'num_leaves':   31,
    'learning_rate': 0.05,
    'n_estimators': 100,
    'verbose':      -1
}

# Train
model = lgb.train(
    params,
    train_data,
    valid_sets=[val_data],
    callbacks=[lgb.early_stopping(stopping_rounds=20)]
)

# Predict
y_pred = (model.predict(X_val) > 0.5).astype(int)
print(f"Accuracy: {accuracy_score(y_val, y_pred):.4f}")

5.2 Core Parameters — Complete Reference

Task / Objective

ParameterValueUse Case
objective'regression'Regression (MSE loss)
'regression_l1'Regression (MAE loss)
'binary'Binary classification
'multiclass'Multi-class classification
'lambdarank'Learning to rank
num_classintegerRequired for multiclass

Tree Structure

ParameterDefaultMeaningTuning Advice
num_leaves31Max leaves per treePrimary complexity control. ≤ 2^{max_depth}. Start at 31.
max_depth-1 (no limit)Max tree depthSet if num_leaves is too permissive.
min_data_in_leaf20Min samples in a leafIncrease for large datasets or noisy data.
min_sum_hessian_in_leaf1e-3Min sum of H_j in a leafEquivalent to min sample weight.
max_bin255Histogram bins BHigher = more accurate splits, slower.
min_gain_to_split0γ in gain formulaIncrease to prune low-gain splits.

Learning

ParameterDefaultMeaningTuning Advice
learning_rate ν0.1Shrinkage per treeLower = better generalization, needs more trees.
n_estimators100Number of trees MUse early stopping instead of fixed value.
early_stopping_roundsStop if no improvement for k roundsSet to ~50–100.

Regularization

ParameterDefaultMeaningEffect
lambda_l10L1 regularization on weightsSparse leaf weights
lambda_l20L2 regularization (λ)Smooths leaf weights, reduces overfitting
min_gain_to_split γ0Minimum split gainPrunes unprofitable splits
num_leaves31Lower = simpler model

Sampling (GOSS & Bagging)

ParameterDefaultMeaning
bagging_fraction1.0Fraction of data sampled per tree (like random forest)
bagging_freq0Bagging frequency (0 = disabled)
feature_fraction1.0Fraction of features per tree (column subsampling)
top_rate0.2GOSS a: top gradient fraction kept
other_rate0.1GOSS b: random fraction from the rest
boosting'gbdt'Set to 'goss' to enable GOSS

5.3 The Bias-Variance Tradeoff in LightGBM

🔧 To reduce overfitting (high variance):
  • Decrease num_leaves
  • Increase min_data_in_leaf
  • Decrease learning_rate + increase n_estimators
  • Increase lambda_l1, lambda_l2
  • Decrease feature_fraction, bagging_fraction
🚀 To reduce underfitting (high bias):
  • Increase num_leaves
  • Decrease min_data_in_leaf
  • Increase learning_rate
  • Decrease regularization
  • Increase feature_fraction, bagging_fraction to 1.0

5.4 Early Stopping — The Key to Optimal M

Choosing the number of trees M is critical: Too few → underfitting, Too many → overfitting.

model = lgb.train(
    params,
    train_data,
    num_boost_round=2000,           # Upper bound on trees
    valid_sets=[train_data, val_data],
    valid_names=['train', 'val'],
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),  # Stop after 50 rounds without improvement
        lgb.log_evaluation(period=50)             # Print every 50 rounds
    ]
)

print(f"Best iteration: {model.best_iteration}")
print(f"Best val score: {model.best_score}")
The optimal M is model.best_iteration. This is the most important practical trick for LightGBM.

5.5 Categorical Feature Handling

# Specify categorical columns
train_data = lgb.Dataset(X_train, label=y_train,
                          categorical_feature=[0, 3, 7])  # column indices

# Or with pandas DataFrames
import pandas as pd
df = pd.DataFrame(X_train)
df[0] = df[0].astype('category')  # Mark as categorical type
train_data = lgb.Dataset(df, label=y_train)
Internally, LightGBM uses an optimal split for categoricals. Instead of thresholding (feature ≤ s), it groups categories C ⊆ {0,1,...,K-1} into left vs right:
\[ \text{Gain}(C) = \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \]
LightGBM finds the optimal C using a sorted-gradient heuristic in O(K log K) time.

5.6 Feature Importance

# 'split': how many times feature was used to split
# 'gain': total gain from all splits using this feature
importance_split = model.feature_importance(importance_type='split')
importance_gain  = model.feature_importance(importance_type='gain')

# Plot
lgb.plot_importance(model, importance_type='gain', max_num_features=20)
Prefer gain importance — it weights features by how much they improve the objective, not just how often they're used (split counts can favor low-cardinality features).

5.7 Hyperparameter Tuning Strategy

Phase 1 — Set learning rate high, find rough M: learning_rate=0.1, use early stopping.

Phase 2 — Tune tree structure: num_leaves ∈ [15,31,63,127], min_data_in_leaf ∈ [10,20,50,100]

Phase 3 — Tune regularization: lambda_l2 ∈ [0,0.1,1,10], lambda_l1 ∈ [0,0.1,1], min_gain_to_split ∈ [0,0.1,1]

Phase 4 — Tune sampling: feature_fraction, bagging_fraction, bagging_freq

Phase 5 — Lower learning rate, retrain: learning_rate=0.01 or 0.05

Using Optuna for Automated Tuning

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    params = {
        'objective':          'binary',
        'metric':             'auc',
        'num_leaves':         trial.suggest_int('num_leaves', 20, 300),
        'min_data_in_leaf':   trial.suggest_int('min_data_in_leaf', 10, 100),
        'learning_rate':      trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'lambda_l2':          trial.suggest_float('lambda_l2', 1e-3, 10.0, log=True),
        'feature_fraction':   trial.suggest_float('feature_fraction', 0.5, 1.0),
        'bagging_fraction':   trial.suggest_float('bagging_fraction', 0.5, 1.0),
        'bagging_freq':       trial.suggest_int('bagging_freq', 1, 10),
        'verbose':            -1
    }
    model = lgb.train(params, train_data,
                      valid_sets=[val_data],
                      callbacks=[lgb.early_stopping(50), lgb.log_evaluation(-1)])
    return model.best_score['valid_0']['auc']

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)

5.8 Cross-Validation

cv_results = lgb.cv(
    params,
    train_data,
    num_boost_round=1000,
    nfold=5,
    stratified=True,        # For classification
    callbacks=[lgb.early_stopping(50)]
)

best_rounds = len(cv_results['valid binary_logloss-mean'])
print(f"Best rounds: {best_rounds}")
print(f"CV score: {cv_results['valid binary_logloss-mean'][-1]:.4f} "
      f"± {cv_results['valid binary_logloss-stdv'][-1]:.4f}")

5.9 sklearn API

from lightgbm import LGBMClassifier, LGBMRegressor

clf = LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    early_stopping_rounds=50,
    verbose=-1
)
clf.fit(X_train, y_train,
        eval_set=[(X_val, y_val)])

y_pred = clf.predict(X_val)
y_prob = clf.predict_proba(X_val)[:, 1]

5.10 Saving and Loading Models

# Save
model.save_model('model.lgb')

# Load
model_loaded = lgb.Booster(model_file='model.lgb')
y_pred = model_loaded.predict(X_val)

# Convert to JSON (human-readable, useful for inspection)
model.dump_model('model.json')

5.11 Common Pitfalls & Fixes

ProblemSymptomFix
OverfittingTrain loss ↓, val loss ↑Reduce `num_leaves`, increase `min_data_in_leaf`, add regularization
UnderfittingBoth losses highIncrease `num_leaves`, more trees, decrease regularization
Slow trainingLong wall timeEnable GOSS, reduce `max_bin`, use `feature_fraction < 1`
Memory errorOOMReduce `max_bin`, use `two_round_loading=True` for large files
Categoricals not workingHigh error on cat featuresEnsure `categorical_feature` param set, or use pandas category dtype
NaN predictionsNaN in outputCheck for NaN in input features; LightGBM treats NaN as a separate bin

5.12 Complete Parameter Cheat Sheet

params = {
    # Task
    'objective':             'binary',      # or 'regression', 'multiclass'
    'metric':                'auc',         # or 'rmse', 'logloss', 'multi_logloss'
    'num_class':             1,             # only for multiclass

    # Tree structure
    'num_leaves':            31,            # ↑ more complex, ↑ overfit risk
    'max_depth':             -1,            # -1 = no limit
    'min_data_in_leaf':      20,            # ↑ more regularization
    'min_sum_hessian_in_leaf': 1e-3,

    # Learning
    'learning_rate':         0.05,          # ↓ = better generalization
    'n_estimators':          1000,          # use early stopping
    'early_stopping_rounds': 50,

    # Regularization
    'lambda_l1':             0.0,           # L1 on leaf weights
    'lambda_l2':             0.0,           # L2 on leaf weights
    'min_gain_to_split':     0.0,           # γ: minimum gain to split

    # Sampling
    'feature_fraction':      0.8,           # column subsampling per tree
    'bagging_fraction':      0.8,           # row subsampling
    'bagging_freq':          5,             # bagging every 5 iterations

    # Speed & hardware
    'max_bin':               255,           # histogram bins
    'num_threads':           -1,            # -1 = use all cores
    'device_type':           'gpu',         # 'cpu' or 'gpu'
    'boosting':              'gbdt',        # 'gbdt', 'goss', 'dart', 'rf'

    # Output
    'verbose':               -1,            # suppress output
    'seed':                  42,
}
📘 Next Chapter: Mathematical Appendix & Summary

References & Further Reading


📌 Other interesting topics: