XGBoost: A Comprehensive Guide to Extreme Gradient Boosting

Traditional machine learning models, such as decision trees and random forests, are often easy to interpret but may struggle with accuracy on complex datasets. XGBoost, short for eXtreme Gradient Boosting, is an advanced machine learning algorithm designed for efficiency, speed, and high performance. It has grown in popularity in recent years because of its ability to help individuals and teams accomplish nearly every Kaggle structured data challenge.

Introduction to XGBoost

XGBoost is an optimized distributed gradient boosting toolkit which trains machine learning models in an efficient and scalable way. It is a form of ensemble learning that combines the predictions of several weak models to produce a more robust prediction. The basic idea of XGBoost is to combine many small, simple models to create a powerful model. XGBoost uses a technique known as "boosting." Boosting combines multiple small decision trees or other simple models one at a time.

What is Boosting?

Boosting, or ensemble methods, creates a sequence of models that attempts to correct the mistakes of the models before them in the sequence. The first model is built on training data, the second model improves the first model, the third model improves the second, and so on.

In ensemble methods, the train dataset is passed to the classifier 1. The classifier 1 model incorrectly predicts some data points. The weights of these incorrectly predicted data points are increased (to a certain extent) and sent to the next classifier, classifier 2. The classifier 2 correctly predicts some data points which classifier 1 was not able to predict, but classifier 2 also makes some other errors. This process continues and we have a combined final classifier which predicts all the data points correctly.

The classifier models can be added until all the items in the training dataset are predicted correctly or a maximum number of classifier models are added. The optimal maximum number of classifier models to train can be determined using hyperparameter tuning.

How XGBoost Works

XGBoost builds decision trees sequentially with each tree attempting to correct the mistakes made by the previous one. The process can be broken down as follows:

Start with a base learner: The first model decision tree is trained on the data. In regression tasks this base model simply predicts the average of the target variable.
Calculate the errors: After training the first tree the errors between the predicted and actual values are calculated.
Train the next tree: The next tree is trained on the errors of the previous tree. This step attempts to correct the errors made by the first tree.
Repeat the process: This process continues with each new tree trying to correct the errors of the previous trees until a stopping criterion is met.
Combine the predictions: The final prediction is the sum of the predictions from all the trees.

Mathematics Behind XGBoost Algorithm

It can be viewed as iterative process where we start with an initial prediction often set to zero. After which each tree is added to reduce errors. Mathematically the model can be represented as:

$$\hat{y}_{i} = \sum_{k=1}^{K} f_k(x_i)$$

Where :

$\hat{y}_{i}$ is the final predicted value for the $i^{th}$ data point
$K$ is the number of trees in the ensemble
$f_k(x_i)$ represents the prediction of the $K^{th}$ tree for the $i^{th}$ data point.

The objective function in XGBoost consists of two parts: a loss function and a regularization term. The loss function measures how well the model fits the data and the regularization term simplify complex trees. The general form of the loss function is:

Read also: Revolutionizing Remote Monitoring

$$obj(\theta) = \sum{i}^{n} l(y{i}, \hat{y}{i}) + \sum{k=1}^K \Omega(f_{k})$$

Where:

$l(y{i}, \hat{y}{i})$ is the loss function which computes the difference between the true value $y_i$ and the predicted value $\hat{y}_i$,
$\Omega(f_{k})$ is the regularization term which discourages overly complex trees.

Now instead of fitting the model all at once we optimize the model iteratively. We start with an initial prediction $\hat{y}_i^{(0)} =0$ and at each step we add a new tree to improve the model. The updated predictions after adding the $t^{th}$ tree can be written as:

$$\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i)$$

Where:

Read also: Boosting Algorithms Explained

$\hat{y}_i^{(t-1)}$ is the prediction from the previous iteration
$f_t(x_i)$ is the prediction of the $t^{th}$ tree for the $i^{th}$ data point.

The regularization term $\Omega(f_t)$ simplify complex trees by penalizing the number of leaves in the tree and the size of the leaf. It is defined as:

$$\Omega(ft) = \gamma T + \frac{1}{2}\lambda \sum{j=1}^T w_j^2$$

Where:

$T$ is the number of leaves in the tree
$\gamma$ is a regularization parameter that controls the complexity of the tree
$\lambda$ is a parameter that penalizes the squared weight of the leaves $w_j$

Finally, when deciding how to split the nodes in the tree we compute the information gain for every possible split. The information gain for a split is calculated as:

$$Gain = \frac{1}{2} \left[\frac{GL^2}{HL+\lambda}+\frac{GR^2}{HR+\lambda}-\frac{(GL+GR)^2}{HL+HR+\lambda}\right] - \gamma$$

Where:

$GL, GR$ are the sums of gradients in the left and right child nodes
$HL, HR$ are the sums of Hessians in the left and right child nodes

By calculating the information gain for every possible split at each node XGBoost selects the split that results in the largest gain which effectively reduces the errors and improves the model's performance.

What Makes XGBoost "eXtreme"?

XGBoost extends traditional gradient boosting by including regularization elements in the objective function, XGBoost improves generalization and prevents overfitting.

Preventing Overfitting

XGBoost incorporates several techniques to reduce overfitting and improve model generalization:

Learning rate (eta): Controls each tree’s contribution; a lower value makes the model more conservative.
Regularization: Adds penalties to complexity to prevent overly complex trees.
Pruning: Trees grow depth-wise, and splits that do not improve the objective function are removed, keeping trees simpler and faster.
Combination effect: Using learning rate, regularization, and pruning together enhances robustness and reduces overfitting.

Tree Structure

XGBoost builds trees level-wise (breadth-first) rather than the conventional depth-first approach, adding nodes at each depth before moving to the next level.

Best splits: Evaluates every possible split for each feature at each level and selects the one that minimizes the objective function (e.g., MSE for regression, cross-entropy for classification).
Feature prioritization: Level-wise growth reduces overhead, as all features are considered simultaneously, avoiding repeated evaluations.
Benefit: Handles complex feature interactions effectively by considering all features at the same depth.

Handling Missing Data

XGBoost manages missing values robustly during training and prediction using a sparsity-aware approach.

Sparsity-Aware Split Finding: Treats missing values as a separate category when evaluating splits.
Default direction: During tree building, missing values follow a default branch.
Prediction: Instances with missing features follow the learned default branch.
Benefit: Ensures robust predictions even with incomplete input data.

Cache-Aware Access

XGBoost optimizes memory usage to speed up computations by taking advantage of CPU cache.

Memory hierarchy: Frequently accessed data is stored in the CPU cache.
Spatial locality: Nearby data is accessed together to reduce memory access time.
Benefit: Reduces reliance on slower main memory, improving training speed.

XGBoost Fundamentals Explained

In the realm of machine learning, XGBoost stands out as a powerful tool. Understanding its fundamental principles, including gradient boosting and its core components, is essential for harnessing its capabilities.

Explanation of Gradient Boosting

Gradient Boosting is the machine learning ensemble method that combines the predictions of multiple models to create a stronger and more accurate model.

It works sequentially, where each new model (typically decision trees) focuses on correcting the errors made by the previous models.

By iteratively minimising the prediction errors, Gradient Boosting builds a robust predictive model that excels in both classification and regression tasks.

This approach is akin to a team of experts refining their judgments over time, with each expert addressing the mistakes of their predecessors to achieve a more precise final decision.

Key Components of XGBoost

Decision Trees: XGBoost uses decision trees as base models. These trees are constructed and combined to form a powerful ensemble. Each tree captures specific patterns or relationships in the data.
Objective Functions: Objective functions in XGBoost define the optimisation goals during training. By selecting the appropriate objective function, you can tailor XGBoost to your specific task, such as minimising errors for regression or maximising information gain for classification.
Learning Tasks: XGBoost can be applied to various learning tasks, including regression, classification, and ranking. The learning task determines the type of output and the associated objective function.

Feature Importance in XGBoost

Features, in a nutshell, are the variables we are using to predict the target variable. Sometimes, we are not satisfied with just knowing how good our machine learning model is. You would like to know which feature has more predictive power. There are various reasons why knowing feature importance can help us:

If you know that a certain feature is more important than others, you would put more attention to it and try to see if you can improve my model further.
After you have run the model, you will see if dropping a few features improves my model.
Initially, if the dataset is small, the time taken to run a model is not a significant factor while we are designing a system. But if the strategy is complex and requires a large dataset to run, then the computing resources and the time taken to run the model becomes an important factor.

Visualising Feature Importance

Visualising feature importance involves creating charts, graphs, or plots to represent the relative significance of different features in a machine learning model.

Visualisations can be in the form of bar charts, heatmaps, or scatter plots, making it easier to grasp the hierarchy of feature importance.

By visually understanding the importance of features, data scientists and stakeholders can make informed decisions regarding feature engineering and model interpretation.

The good thing about XGBoost is that it contains an inbuilt function to compute the feature importance and we don’t have to worry about coding it in the model.

Understanding How the Model Makes Predictions

To comprehend how an XGBoost model arrives at its predictions, one must delve into the model's internal workings.

This process includes tracing the path through the decision trees, considering the learned weights associated with each branch, and combining these elements to produce the final prediction.

Understanding this process is crucial for model interpretation, debugging, and ensuring the model's transparency and reliability, especially in sensitive or high-stakes applications.

XGBoost in Practice: A Trading Example

XGBoost is useful for data scientists, machine learning engineers, researchers, software developers, students, and business analysts looking for a quick and straightforward way to create and apply machine learning models. Let's explore how to use the XGBoost model for trading, including installation and a simple Python code example.

How to Install XGBoost in Anaconda

Anaconda is a Python environment which makes it really simple for us to write Python code and takes care of any nitty-gritty associated with the code. Hence, we are specifying the steps to install XGBoost in Anaconda. It’s actually just one line of code. You can simply open the Anaconda prompt and input the following:

pip install xgboost

The Anaconda environment will download the required setup file and install it for you.

XGBoost Python Code Example

We will divide the XGBoost python code into following sections for a better understanding of the model:

Import libraries: We have written the use of the library in the comments. For example, since we use XGBoost python library, we will import the same and write # Import XGBoost as a comment.
Define parameters: We have defined the list of stock, start date and the end date which we will be working with in this blog. Just to make things interesting, we will use the XGBoost python model on companies such as Apple, Amazon, Netflix, Nvidia and Microsoft.
Creating predictors and target variables: We have also defined a list of predictors from which the model will pick the best predictors. Here, we have the percentage change and the standard deviation with different time periods as the predictor variables.

The target variable is the next day's return. If the next day’s return is positive we label it as 1 and if it is negative then we label it as -1. You can also try to create the target variables with three labels such as 1, 0 and -1 for long, no position and short.

Initialise the XGBoost Machine Learning Model

We will initialise the classifier model. We will set two hyperparameters namely max_depth and n_estimators.

Cross Validation in Train Dataset

All right, we will now perform cross-validation on the train set to check the accuracy.

Feature Importance

We have plotted the top 7 features and sorted them based on their importance. The XGBoost python model tells us that the pct_change_15 is the most important feature out of the others.

Individual Stock Performance

Let’s see how the XGBoost based strategy returns held up against the normal daily returns i.e. the buy and hold strategy. We will plot a comparison graph between the strategy returns and the daily returns for all the companies we had mentioned before.

In the output above, the XGBoost model performed the best for NFLX in certain time periods. For other stocks the model didn’t perform that well.

Performance of Portfolio

We thought what would happen if we invest in all the companies equally and act according to the XGBoost python model. Finally, let us visualise the cumulative returns of the portfolio.

tags: #xgboost #machine #learning #tutorial

XGBoost: A Comprehensive Guide to Extreme Gradient Boosting

Introduction to XGBoost

What is Boosting?

How XGBoost Works

Mathematics Behind XGBoost Algorithm

What Makes XGBoost "eXtreme"?

Preventing Overfitting

Tree Structure

Handling Missing Data

Cache-Aware Access

XGBoost Fundamentals Explained

Explanation of Gradient Boosting

Key Components of XGBoost

Feature Importance in XGBoost

Visualising Feature Importance

Understanding How the Model Makes Predictions

XGBoost in Practice: A Trading Example

How to Install XGBoost in Anaconda

XGBoost Python Code Example

Initialise the XGBoost Machine Learning Model

Cross Validation in Train Dataset

Feature Importance

Individual Stock Performance

Performance of Portfolio

Popular posts:

Company

For Learners

Connect with us