Gradient Boosting: A Comprehensive Guide to Machine Learning

Gradient Boosting Machines (GBMs) are a powerful and versatile family of machine learning algorithms that have achieved significant success across a wide range of applications. They are particularly effective for both regression and classification tasks, and have become a go-to method for winning Kaggle competitions and solving real-world problems. This article provides a comprehensive tutorial on gradient boosting, covering its methodology, design considerations, and practical applications.

Introduction to Gradient Boosting

Many supervised machine learning algorithms rely on a single predictive model. However, ensemble methods like bagging and random forests combine multiple models to improve overall performance. Boosting, a general algorithm for building ensembles, is particularly effective for models with high bias and low variance. The main idea behind boosting is to add new models to the ensemble sequentially, with each model focusing on correcting the errors of its predecessors.

The Power of Ensembles

In many machine learning applications, a common task is to build a non-parametric regression or classification model from data. While one strategy is to build a model from theory and adjust its parameters based on observed data, this is often not feasible in real-world situations. Non-parametric machine learning techniques like neural networks or support vector machines can be used to build a model directly from the data.

Single Strong Models vs. Ensembles

The most frequent approach to data-driven modeling is to build a single strong predictive model. However, an alternative approach is to build an ensemble of models for a particular learning task. Instead of combining a set of strong models like neural networks, the ensemble approach typically relies on combining a large number of relatively weak, simple models to obtain a stronger ensemble prediction. Common ensemble techniques like random forests rely on simple averaging of models in the ensemble.

Boosting: A Constructive Ensemble Formation Strategy

Boosting methods take a different approach, based on a constructive strategy of ensemble formation. The main idea is to add new models to the ensemble sequentially. At each iteration, a new weak base-learner model is trained with respect to the error of the whole ensemble learned so far. Early boosting techniques were purely algorithm-driven, making detailed analysis of their properties and performance difficult.

Read also: Optimizing Gradient Descent

Gradient Boosting Machines: A Statistical Framework

To establish a connection with the statistical framework, a gradient-descent-based formulation of boosting methods was derived, leading to the development of gradient boosting machines (GBMs). In GBMs, the learning procedure consecutively fits new models to provide a more accurate estimate of the response variable. The principle idea behind this algorithm is to construct the new base-learners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. The loss functions applied can be arbitrary, but to give a better intuition, if the error function is the classic squared-error loss, the learning procedure would result in consecutive error-fitting. This high flexibility makes the GBMs highly customizable to any particular data-driven task. It introduces a lot of freedom into the model design thus making the choice of the most appropriate loss function a matter of trial and error. However, boosting algorithms are relatively simple to implement, which allows one to experiment with different model designs.

Gradient Boosting Methodology

Gradient boosting is a machine learning ensemble technique that sequentially combines the predictions of multiple weak learners, typically decision trees. It aims to improve overall predictive performance by optimizing the model’s weights based on the errors of previous iterations, gradually reducing prediction errors and enhancing the model’s accuracy.

Weak Learners

Boosting is a framework that iteratively improves any weak learning model. Many gradient boosting applications allow you to “plug in” various classes of weak learners at your disposal. In practice however, boosted algorithms almost always use decision trees as the base-learner. A weak model is one whose error rate is only slightly better than random guessing. Shallow trees (i.e., trees with relatively few splits) represent a weak learner.

Sequential Training

Boosted trees are grown sequentially; each tree is grown using information from previously grown trees to improve performance. Each model in the sequence slightly improves upon the performance of the previous one (essentially, by focusing on the rows of the training data where the previous tree had the largest errors or residuals).

Gradient Descent

Gradient boosting is considered a gradient descent algorithm. Gradient descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of gradient descent is to tweak parameter(s) iteratively in order to minimize a cost function. Gradient descent can be performed on any loss function that is differentiable. Consequently, this allows GBMs to optimize different loss functions as desired. An important parameter in gradient descent is the size of the steps which is controlled by the learning rate.

Read also: Gradient Descent Explained

Algorithm Details

Let $\ell$ denote a (convex and differentiable) loss function. Assume we have already finished $t$ iterations and already have an ensemble classifier $Ht(\vec{x})$. Now in iteration $t+1$ we want to add one more weak learner $h{t+1}$ to the ensemble. Once $h_{t+1}$ has been found, we add it to our ensemble. Given $H$, we want to find the step-size $\alpha$ and (weak learner) $h$ to minimize the loss $\ell(H+\alpha h)$.

The approximation of $\ell(H+\alpha h) \approx \ell(H) + \alpha<\nabla \ell(H),h>$ only holds within a small region around $\ell(H)$, i. as long as $\alpha$ is small. We therefore fix it to a small constant (e.g. $\alpha\approx 0.1$).

We need a function $\mathbb{A}({(\mathbf{x}1,r1),\dots,(\mathbf{x}n,rn)})=\textrm{argmin}{h \in \mathbb{H}} \sum{i = 1}^{n} ri h(\mathbf{x}i)$. In order to make progress this $h$ does not have to be great. First, we assume that $\sum{i = 1}^{n} h^2(\mathbf{x}i)$ = constant. This is simple to do (we normalize the predictions) and important because we could always decrease $\sum{i=1}^n h(\mathbf{x}i)ri$ by rescaling $h$ with a large constant. By fixing $\sum{i=1}^n h^s(\mathbf{x}_i)$ to a constant we are essentially fixing the vector $h$ to lie on a circle, and we are only concerned with its direction but not its length.

Squared Loss

If the loss function $\ell$ is the squared loss, i.e. which is simply the residual, i.e. $\mathbf{r}$ is the vector pointing from $\mathbf{y}$ to $\mathbf{H}$.

AdaBoost

For notational convenience (and for reason that will become clear in a little bit), let us define $wi= \frac{1}{Z}e^{-yiH(\mathbf{x}i)}$, where $Z=\sum{i=1}^{n} e^{-yiH(\mathbf{x}i)}$ is a normalizing factor so that $\sum{i=1}^{n} wi=1.$ Note that the normalizing constant $Z$ is identical to the loss function. Each weight $wi$ therefore has a very nice interpretation. It is the relative contribution of the training point $(\mathbf{x}i,yi)$ towards the overall loss. Let us denote this weighted classification error as $\epsilon=\sum{i:h(\mathbf{x}i)yi=-1} wi$. So for AdaBoost, we only need a classifier that can take training data and a distribution over the training set (i.e. normalzied weights $wi$ for all training samples) and which returns a classifier $h\in H$ that reduces the weighted classification error of these training samples.

Read also: Boosting Algorithms Explained

In the AdaBoost setting we can find the optimal stepsize (i.e. We differentiate w.r.t. It is unusual that we can find the optimal step-size in such a simple closed form. After you take a step, i.e. $H{t+1}=H{t}+\alpha h$, you need to re-compute all the weights and then re-normalize.

As long as $H$ is negation closed (this means for every $h\in H$ we must also have $-h\in H$), it cannot be that the error $\epsilon>\frac{1}{2}$. The reason is simply that if $h$ has error $\epsilon$, it must be that $-h$ has error $1-\epsilon$. So you could just flip $h$ to $-h$ and obtain a classifier with smaller error. As $h$ was found by minimizing the error, this is a contradiction. The inner loop can terminate as the error $\epsilon=\frac{1}{2}$, and in most cases it will converge to $\frac{1}{2}$ over time. In that case the latest weak learner $h$ is only as good as a coin toss and cannot benefit the ensemble (therefore boosting terminates). Also note that if $\epsilon=\frac{1}{2}$ the step-size $\alpha$ would be zero. as, $h(\mathbf{x}i)yi$ is either $+1$ (if classified correctly by this weak learner) or $-1$ (otherwise), it follows that this weight update multiplies the weight $w_i$ either by a factor $e^\alpha>1$ if it was classified incorrectly (i.e. increases the weights), or by a factor $e^{-\alpha}<1$ if it was classified correctly (i.e. decreases the weight).

Loss Function and Normalization

Previously we established that the normalizer $Z$ is identical to the loss. (the factor $n$ comes from the fact that the initial $Z0=n$, when all weights are $\frac{1}{n}$. The function $c(1-c)$ is maximized at $c=\frac{1}{2}$. But we know that each $\epsilont<\frac{1}{2}$ (or else the algorithm would have terminated). Therefore $c(1-c)<\frac{1}{4}$ and we can re-write it as $c(1-c)=\frac{1}{4}-\gamma^2$, for some $\gamma$. In other words, the training loss is decreasing exponentially! In fact, we can go even further and compute after how many iterations we must have zero training error. Note that the training loss is an upper bound on the training error (defined as $\sum{i=1}^n \delta{H(\mathbf{x}i)\neq yi})$ - simply because $\delta{H(\mathbf{x}i)\neq yi}< e^{-yi H(\mathbf{x}_i)}$ in all cases. We can then compute the number of steps required until the loss is less than $1$, which would imply that not a single training input is misclassified. This is an amazing result. It shows that after $O(\log(n))$ iterations your training error must be zero. In practice it often makes sense to keep boosting even after you make no more mistakes on the training set.

Boosting as a Strong Classifier

Boosting is a great way to turn a week classifier into a strong classifier.

Gradient Boosting Design Considerations

To design a particular GBM for a given task, one has to provide the choices of functional parameters Ψ(y, f) and h(x, θ). In other words, one has to specify what one is actually going to optimize, and afterwards, to choose the form of the function, which will be used in building the solution. It is clear that these choices would greatly affect the GBM model properties. This section provides the descriptions and illustrations of different families of loss functions and models of base-learners.

Loss Functions

Given a particular learning task, one can consider different loss functions Ψ(y, f) to exploit. This choice is often influenced by the demand of specific characteristics of the conditional distribution. To use an arbitrary loss function, one has to specify both the loss function and the function to calculate the corresponding negative gradient. Given these two functions, they can be directly substituted into the GBM algorithm. Loss-functions can be classified according to the type of response variable y. Specific boosting algorithms have been derived for various families of the response, among which are the regression, classification and time-to-event analysis tasks.

Continuous Response Variables: Regression

When the response variable y is continuous, a regression task is solved. In the case of the L2 loss-function, its derivative is the residual y − f, which implies that the GBM algorithm simply performs residual refitting. The idea behind this loss function is to penalize large deviations from the target outputs while neglecting small residuals.

L2 Squared Loss Function

The L2 squared loss function is a common choice for regression tasks.

L1 Absolute Loss Function

Another example is the absolute L1-loss, denoted as the “Laplacian” loss function. The L1-loss corresponds to the median of the conditional distribution, thus considered as the robust regression loss. It may be of particular interest in tasks where the response variable has long-tail error distribution.

Huber Loss Function

One can also exploit the parameterized loss-functions as well. A robust regression alternative to the L1 loss is the Huber loss function. It comprises two parts, corresponding to the L2 and L1 losses. The cutting edge parameter δ is used to specify the robustification effect of the loss-function. The intuition behind this parameter is to specify the maximum value of error, after which the L1 loss has to be applied.

Quantile Loss Function

A more general approach is based on predicting a conditional quantile of the response variable. This approach is distribution free and in general proves to provide good robustness to outliers. The parameter α in this case specifies the desired quantile of the conditional distribution. One can note that when α = 0.5, this would coincide with the L1 loss, thus resulting in the conditional median.

Categorical Response Variables: Classification

In the case of categorical response, the response variable y typically takes on binary values y ∈ {0, 1}.

Regularization in Gradient Boosting

Shrinkage

The step size $\alpha$ is often referred to as shrinkage. Shrinkage scales the contribution of each new model using the learning rate (denoted as \eta). Smaller learning rates mean the contribution of each tree is smaller which reduces the risk of overfitting but requires more trees to achieve the same performance. Larger learning rates mean each tree has a more significant impact but this can lead to overfitting. There's a trade off between the learning rate and the number of estimators (trees) a smaller learning rate usually means more trees are required to achieve optimal performance.

Model Complexity

Some people do not consider gradient boosting algorithms to be part of the boosting family, because they have no guarantee that the training error decreases exponentially. Often these algorithms are referred to as stage-wise regression instead.

Stochastic Gradient Boosting

Inspired by Breiman's Bagging, stochastic gradient boosting subsamples the training data for each weak learner. This combines the benefits of bagging and boosting. One variant is to subsample only $n/2$ data points without replacement, which speeds up the training process. An important insight made by Breiman in developing his bagging and random forest algorithms was that training the algorithm on a random subsample of the training data set offered additional reduction in tree correlation and, therefore, improvement in prediction accuracy. Friedman used this same logic and updated the boosting algorithm accordingly. Generally, aggressive subsampling of rows, such as selecting only 50% or less of the training data, has shown to be beneficial and typical values range between 0.5-0.8. Subsampling of columns and the impact to performance largely depends on the nature of the data and if there is strong multicollinearity or a lot of noisy features. Similar to the (m_{try}) parameter in random forests, if there are fewer relevant predictors (more noisy data) higher values of column subsampling tends to perform better because it makes it more likely to select those features with the strongest signal.

Hyperparameter Tuning

A simple GBM model contains two categories of hyperparameters: boosting hyperparameters and tree-specific hyperparameters.

Boosting Hyperparameters

Number of trees: The total number of trees in the sequence or ensemble. The averaging of independently grown trees in bagging and random forests makes it very difficult to overfit with too many trees. However, GBMs function differently as each tree is grown in sequence to fix up the past tree’s mistakes. For example, in regression, GBMs will chase residuals as long as you allow them to.
Learning rate: Determines the contribution of each tree on the final outcome and controls how quickly the algorithm proceeds down the gradient descent (learns). Values range from 0-1 with typical values between 0.001-0.3. Smaller values make the model robust to the specific characteristics of each individual tree, thus allowing it to generalize well. Smaller values also make it easier to stop prior to overfitting; however, they increase the risk of not reaching the optimum with a fixed number of trees and are more computationally demanding. This hyperparameter is also called shrinkage.

Tree-Specific Hyperparameters

Tree depth: Controls the depth of the individual trees. Typical values range from a depth of 3-8 but it is not uncommon to see a tree depth of 1. Smaller depth trees such as decision stumps are computationally efficient (but require more trees); however, higher depth trees allow the algorithm to capture unique interactions but also increase the risk of over-fitting.
Minimum number of observations in terminal nodes: Also, controls the complexity of each tree. Since we tend to use shorter trees this rarely has a large impact on performance.

Tuning Strategy

Unlike random forests, GBMs can have high variability in accuracy dependent on their hyperparameter settings. So tuning can require much more strategy than a random forest model.

Choose a relatively high learning rate.
Use final hyperparameter settings and increase CV procedures to get more robust estimates.

Often, the above steps are performed with a simple validation procedure or 5-fold CV due to computational constraints.

Stochastic Gradient Boosting Tuning

When adding in a stochastic procedure, you can either include it in step 4) in the general tuning strategy above, or once you’ve found the optimal basic model.

XGBoost: Extreme Gradient Boosting

Extreme gradient boosting (XGBoost) is an optimized distributed gradient boosting library that is designed to be efficient, flexible, and portable across multiple languages.

Parallel Processing

Since gradient boosting is sequential in nature it is extremely difficult to parallelize.

Continue with Existing Model

A user can train an XGBoost model, save the results, and later on return to that model and continue building onto the results.

Regularization Parameters in XGBoost

xgboost provides multiple regularization parameters to help reduce model complexity and guard against overfitting. The first, gamma, is a pseudo-regularization hyperparameter known as a Lagrangian multiplier and controls the complexity of a given tree. gamma specifies a minimum loss reduction required to make a further partition on a leaf node of the tree. When gamma is specified, xgboost will grow the tree to the max depth specified but then prune the tree to find and remove splits that do not meet the specified gamma. gamma tends to be worth exploring as your trees in your GBM become deeper and when you see a significant difference between the train and test CV error. The value of gamma ranges from (0-\infty) (0 means no constraint while large numbers mean a higher regularization). Two more traditional regularization parameters include alpha and lambda. alpha provides an (L1) regularization and lambda provides an (L2) regularization. Setting both of these to greater than 0 results in an elastic net regularization; similar to gamma, these parameters can range from (0-\infty). All three hyperparameters (gamma, alpha, lambda) work to constrain model complexity and reduce overfitting. Although gamma is more commonly implemented, your tuning strategy should explore the impact of all three.

Dropout

Dropout is an alternative approach to reduce overfitting and can loosely be described as regularization. The dropout approach has been widely employed in deep learnings to prevent deep neural networks from overfitting. Dropout can also be used to address overfitting in GBMs. When constructing a GBM, the first few trees added at the beginning of the ensemble typically dominate the model performance while trees added later typically improve the prediction for only a small subset of the feature space.

Advantages and Applications of Gradient Boosting

One advantage of boosted classifiers is that during test time the computation $H(\mathbf{x})=\sum{t=1}^T \alphat ht(\mathbf{x}t)$ can be stopped prematurely if it becomes clear which way the prediction goes. This is particularly interesting in search engines, where the exact ranking of results is typically only interesting for the top 10 search results. Stopping the evaluation of lower ranked search results can lead to tremendous speed ups. A similar approach is also used by the Viola-Jones algorithm to speed up face detection in images. Here, the algorithm scans regions of an image to detect possible faces. As almost all regions and natural images do not contain faces, there are huge savings if the evaluation can be stopped after just a few weak learners are evaluated. These classifiers are referred to as cascades, that spend very little time on the common case (no face), but more time on the rare interesting case (face). With this approach Viola and Jones were the first to solve face recognition in real-time on low performance hardware (e.g. cameras).

Gradient Boosted Regression Trees is one of the most popular algorithms for Learning to Rank, the branch of machine learning focused on learning ranking functions, for example for web search engines.

Real-World Applications

Gradient boosting has become a dominant force in machine learning, with applications spanning various industries, including:

Predicting customer churn
Detecting asteroids
Learning to Rank for web search engines

tags: #gradient #boosting #machine #learning #tutorial