Random Forest: A Comprehensive Guide to Ensemble Learning with Decision Trees

Random Forest is a powerful and versatile machine learning algorithm widely used for classification and regression tasks. It is an ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Introduction to Ensemble Learning

Ensemble learning is a technique that combines the predictions of multiple "weak" models to create a stronger, more accurate model. The core idea behind ensemble learning is that a group of models can often outperform a single model, especially if the individual models are diverse and make different types of errors.

One of the most well-known ensemble methods is the bagging method, also known as bootstrap aggregation. In bagging, multiple random samples of the data in a training set are selected with replacement. After several data samples are generated, these models are then trained independently. Depending on the type of task-i.e., regression or classification-the average or majority of those predictions yield a more accurate estimate. Random Forest is an extension of the bagging method, enhancing it with feature randomness.

Condorcet’s Jury Theorem suggests that the majority vote aggregation can have better accuracy than the individual models. There are other methods to aggregate predictions, such as weighted majority vote.

Decision Trees as Building Blocks

At the heart of the Random Forest algorithm lies the decision tree. A decision tree is a supervised learning algorithm that uses a tree-like structure to model the relationship between features and a target variable. It starts with a basic question, such as, “Should I surf?” From there, you can ask a series of questions to determine an answer, such as, “Is it a long period swell?” or “Is the wind blowing offshore?”. These questions make up the decision nodes in the tree, acting as a means to split the data. Each question helps an individual to arrive at a final decision, which would be denoted by the leaf node. Observations that fit the criteria will follow the “Yes” branch, and those that don’t will follow the alternate path. Decision trees seek to find the best split to subset the data and are typically trained through the Classification and Regression Tree (CART) algorithm.

Read also: Your Guide to Nursing Internships

While decision trees are intuitive and easy to understand, they have limitations. They can be prone to overfitting, meaning they learn the training data too well and perform poorly on new, unseen data. They can also be sensitive to small changes in the training data, leading to high variance.

The Random Forest Algorithm: Combining Bagging and Feature Randomness

The Random Forest algorithm addresses the limitations of decision trees by combining two key techniques: bagging and feature randomness.

Bagging: Bootstrap Aggregation

Bagging involves creating multiple subsets of the training data by randomly sampling with replacement. This means that some data points may appear multiple times in a subset, while others may not be included. Each subset is then used to train a separate decision tree.

Feature Randomness: The Random Subspace Method

Feature randomness, also known as feature bagging or "the random subspace method", generates a random subset of features, which ensures low correlation among decision trees. This is a key difference between decision trees and random forests. When building each tree, it doesn’t look at all the features (columns) at once. It picks a few at random to decide how to split the data.

The inventor of the random forest model Leo Breiman says in his paper "[o]ur results indicate that better (lower generalization error) random forests have lower correlation between classifiers and higher strength." The high variance of the decision tree model can help keep the correlation among trees low. The Bagging Method as well as the Feature Selection are the key innovations to keep correlation low.

Read also: The Return of College Football Gaming

Algorithm Steps

Here's a step-by-step breakdown of the Random Forest algorithm:

  1. Bootstrap Sample Creation: For each tree in the forest:

    • Create a new training set by random sampling from the original data with replacement until reaching the original dataset size. This is called bootstrap sampling.

    • Mark and set aside non-selected samples as out-of-bag (OOB) samples for later error estimation.

  2. Tree Construction:

    Read also: Transfer pathways after community college

    • Start at the root node with the complete bootstrap sample.

    • Calculate initial node impurity using all samples in the node:

      • Classification: Gini impurity or entropy
      • Regression: Mean Squared Error (MSE)
    • Select a random subset of features from the total available features:

      • Classification: √n_features
      • Regression: n_features/3
    • For each selected feature:

      • Sort data points by feature values.

      • Identify potential split points (midpoints between consecutive unique feature values).

    • For each potential split point:

      • Divide samples into left and right groups.

      • Calculate left child impurity using its samples.

      • Calculate right child impurity using its samples.

      • Calculate impurity reduction: parent_impurity - (left_weight × left_impurity + right_weight × right_impurity)

    • Split the current node data using the feature and split point that gives the highest impurity reduction. Then pass data points to the respective child nodes.

    • For each child node, repeat the process (step b-e) until:

      • Pure node or minimum impurity decrease

      • Minimum samples threshold

      • Maximum depth

      • Maximum leaf nodes

  3. Tree Construction Repetition: Repeat the whole Step 2 for other bootstrap samples.

  4. Prediction: For prediction, route new samples through all trees and aggregate:

    • Classification: majority vote

    • Regression: mean prediction

Variance in Composition

We previously discussed how the decision tree model suffers from high variance. However, this variance among trees is employed in the random forest as a feature, not a bug.

If each tree produces the same prediction, then the accuracy cannot improve. The irregular pattern in the grid shows how the trees are different in where they make mistakes. As expected, the random forest model performs the best overall even if there are trees with very low accuracy. Note that for the first data point (first column), there are still three trees with the correct prediction while the majority is incorrect. This inspired people to consider other methods, such as Boosting, which is very popular today.

Out-of-Bag (OOB) Evaluation

As each tree is trained on a bootstrap sample, approximately one-third of the original data is left out of that sample. The model can test each tree on its corresponding out-of-bag data to estimate performance, without requiring a separate validation dataset. Instead of just ignoring them, Random Forest uses them as a convenient validation set for each tree.

Each tree gets tested on its own out-of-bag samples (data not used in its training). By averaging these individual OOB accuracy scores, Random Forest provides a built-in way to measure performance without needing a separate test set.

Advantages of Random Forest

The Random Forest algorithm offers several advantages over other machine learning algorithms:

  • High Accuracy: Random Forest typically achieves high accuracy, often outperforming single decision trees and other ensemble methods. By combining multiple diverse decision trees and using majority voting, Random Forest achieves a high accuracy.

  • Reduced Risk of Overfitting: Decision trees run the risk of overfitting as they tend to tightly fit all the samples within training data. However, when there’s a robust number of decision trees in a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees lowers the overall variance and prediction error. Most of the time this won’t happen thanks to the random forest classifier.

  • Robustness to Outliers: Random forest is relatively robust to outliers because tree splits are based on thresholds, not distance metrics.

  • Handles Non-Linearity: Random forest does not assume linearity between inputs and outputs.

  • Flexibility: Since random forest can handle both regression and classification tasks with a high degree of accuracy, it is a popular method among data scientists.

  • Easy to Determine Feature Importance: Random forest makes it easy to evaluate variable importance, or contribution, to the model.

  • Versatility: Random forest is also a very handy algorithm because the default hyperparameters it uses often produce a good prediction result.

Disadvantages of Random Forest

While Random Forest offers numerous advantages, it also has some drawbacks: