Tackling Class Imbalance in Machine Learning: A Comprehensive Guide

Class imbalance is a common issue in machine learning, particularly in classification problems, where the distribution of examples within a dataset is skewed or biased. This means that one class has significantly more instances than the other(s). In such scenarios, standard machine learning algorithms tend to be biased towards the majority class, leading to poor performance on the minority class, which is often the class of interest. This article explores various techniques to address class imbalance and improve the performance of machine learning models.

Introduction: The Imbalance Problem

Imbalanced classification is a frequent challenge in machine learning, especially in binary classification tasks. It arises when the training dataset exhibits an unequal distribution of classes, potentially biasing the trained model. Real-world examples of imbalanced classification problems are abundant and include:

  • Fraud detection: Identifying fraudulent transactions within a large volume of legitimate transactions.
  • Claim prediction: Predicting which insurance claims are likely to be fraudulent.
  • Default prediction: Determining which borrowers are likely to default on their loans.
  • Churn prediction: Identifying customers who are likely to cancel their subscriptions.
  • Spam detection: Filtering out spam emails from legitimate emails.
  • Anomaly detection: Detecting unusual patterns or outliers in data.

Addressing class imbalance is crucial for enhancing model performance and ensuring accuracy, especially when the minority class is of particular interest. Note that most of the examples above are binary classification problems.

From Multi-Class to Bi-Class

Binary classification involves dividing a dataset into two groups: a positive group and a negative group. These principles can also be extended to multi-class problems by decomposing the problem into multiple two-class problems. This technique allows us to address class imbalance and utilize a range of methods to enhance the performance of our model.

Common Practices for Handling Class Imbalance

Several methods can be employed to address class imbalance in machine learning. These include:

Read also: Effective Class Scheduling

  • Resampling techniques: Adjusting the number of samples in the minority or majority class to create a more balanced dataset. This can be achieved through undersampling (reducing the number of samples in the majority class) or oversampling (increasing the number of samples in the minority class).
  • Weight modification on the loss function: Assigning different weights to the classes in the loss function, allowing the model to focus more on the minority class during training.
  • Bias initialization: Adjusting the initial values of the model’s parameters to better reflect the distribution of the training data, particularly the bias of the final layer.

These approaches can be used individually or in combination, depending on the specific problem and dataset.

Resampling Techniques: Under-sampling and Over-sampling

Resampling is a common technique used to address class imbalance. It involves creating a new version of the training dataset with a different class distribution by selecting examples from the original dataset. One popular method of resampling is random resampling, where examples are chosen randomly for the transformed dataset. Resampling is often considered a simple and effective strategy for imbalanced classification problems because it allows the model to more evenly consider examples from different classes during training. However, it is important to carefully consider the trade-offs and limitations of resampling, as it can also introduce additional noise and bias into the dataset.

Under-sampling

Under-sampling techniques aim to balance the class distribution by reducing the number of instances in the majority class. This can be achieved by randomly removing instances from the majority class until the desired balance is achieved.

1. Using Random Under-Sampling

When observations from the majority class are eliminated until the majority and minority classes are balanced, this is known as undersampling.

Steps:

Read also: Navigating College History Class

  • Firstly, we'll divide the data points from each class into separate DataFrames.
  • After this, the minority class is resampled with replacement by setting the number of data points equivalent to that of the majority class.
  • In the end, we'll concatenate the original majority class DataFrame and up-sampled minority class DataFrame.

2. One good thing in imblearn is RandomUnderSampler.

It's a quick and simple way to even out the data by randomly choosing some data from the classes we want to balance.

  • sampling_strategy: Sampling Information for dataset.
  • random_state: Used for shuffling the data. If positive non zero number is given then it shuffles otherwise not. Default value is None.
  • replacement: Implements resampling with or without replacement. Boolean type of value.

Advantages of Under-sampling:

  • Reduced Complexity: Under-sampling streamlines the dataset and speeds up calculations.
  • Prevents Overfitting: helps avoid overfitting, particularly in cases where the dominant class is the majority.
  • Simpler Models: results in less complex models that are simpler to understand.

Disadvantages of Under-sampling:

  • Loss of Information: Information loss may occur from removing instances of the majority class.
  • Risk of Bias: Undersampling may cause bias in how the original data are represented.
  • Potential for Instability: The model might become unstable as it grows more susceptible to changes.
  • The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate representation of the population.

Over-sampling

Over-sampling techniques aim to balance the class distribution by increasing the number of instances in the minority class. This can be achieved by randomly duplicating existing instances from the minority class or by generating synthetic instances.

1. Using RandomOverSampler:

Oversampling is the process of adding more copies to the minority class. When dealing with constrained data resources, this approach is helpful. This can be done with the help of the RandomOverSampler method present in imblearn.

  • sampling_strategy: Sampling Information for dataset.
    • Some Values are- 'minority': only minority class 'not minority': all classes except minority class, 'not majority': all classes except majority class, 'all': all classes, 'auto': similar to 'not majority', Default value is 'auto'
  • random_state: Used for shuffling the data. If a positive non-zero number is given then it shuffles otherwise not. Default value is None.
  • shrinkage: Parameter controlling the shrinkage.
    • Values are: float: Shrinkage factor applied on all classes. dict: Every class will have a specific shrinkage factor. None: Shrinkage= 0.

2. Random Over-Sampling with Imblearn

To address imbalanced data, one approach is to create more examples for the minority classes. A simple way to do this is by randomly selecting and duplicating existing samples.

3. Synthetic Minority Oversampling Technique (SMOTE)

SMOTE is used to generate artificial/synthetic samples for the minority class. This technique works by randomly choosing a sample from a minority class and determining K-Nearest Neighbors for this sample, then the artificial sample is added between the picked sample and its neighbors.

Read also: Accessing ClassDojo as a Student

  • sampling_strategy: Sampling Information for dataset
  • random_state: Used for shuffling the data. If positive non zero number is given then it shuffles otherwise not. Default value is None.
  • k_neighbors: Number count of nearest neighbours used to generate artificial/synthetic samples. Default value is 5
  • n_jobs: Number of CPU cores to be used. Default value is None. None here means 1 not 0.

Working of SMOTE Algorithm

An algorithm called SMOTE (Synthetic Minority Over-sampling Technique) is used to rectify dataset class imbalances. To put it briefly, SMOTE generates synthetic samples for the minority class.

  • Identify minority class instances: Determine which dataset instances belong to the minority class.
  • Select a Minority Instance: Select a minority instance at random from the dataset.
  • Find Nearest Neighbors: Determine which members of the minority class are the selected instance's k-nearest neighbors.
  • Generate Synthetic Samples: By dividing the selected instance by the distance between it and its closest neighbors, create synthetic instances.
  • Minority class is given as input vector.
  • Determine its K-Nearest Neighbours
  • Pick one of these neighbors and place an artificial sample point anywhere between the neighbor and sample point under consideration.
  • Repeat till the dataset gets balanced.

4. Oversampling with ADASYN (+ How it’s different from SMOTE)

ADASYN is a cousin of SMOTE: both SMOTE and ADASYN generate new samples by interpolation.

But there’s one critical difference. ADASYN generates samples next to the original samples that are wrongly classified by a KNN classifier. Conversely, SMOTE differentiates between samples that are correctly or wrongly classified by the KNN classifier.

Advantages of Over-sampling:

  • Enhanced Model Performance: Enhances the model's capacity to identify patterns in data from minority classes.
  • More Robust Models: It becomes more robust, especially when handling unbalanced datasets.
  • Reduced Risk of Information Loss: Oversampling helps to keep potentially important data from being lost.

Disadvantages of Over-sampling:

  • Increased Complexity: When the dataset grows, so do the computational requirements.
  • Potential Overfitting: The over-sampled data may introduce noise into the model fitting process.
  • Algorithm Sensitivity: Generalization may suffer from some algorithms' sensitivity to repeated occurrences.

Weights Modification on a Loss Function

The second method for addressing class imbalance is to modify the weights on the loss function. In a balanced dataset, the gradient of the loss function (i.e., the direction towards the local minimum) is calculated as the average gradient for all samples. However, in an imbalanced dataset, this gradient may not accurately reflect the optimal direction for the minority class. To address this issue, we can decompose the gradient by either oversampling as a part of the optimization process or by using a weighted loss.

Oversampling involves artificially increasing the number of minority class examples in the dataset, which can help the model more accurately consider those examples during training. Alternatively, using a weighted loss involves assigning higher weights to the minority class examples, so that the model places more emphasis on correctly classifying those examples. Both of these methods can help improve the performance of the model on imbalanced datasets. For XGBoost, the scale_pos_weight can be set at the appropriate value so as to penalise errors on the minor class more heavily.

Bias Initialization

The last technique we introduce in this post for addressing class imbalance in machine learning is bias initialization, which involves adjusting the initial values of the model’s parameters to better reflect the distribution of the training data. More specifically, we will set the final layer bias. For example, in an imbalanced binary classification problem with a softmax activation function, we can set the initial bias of the final layer to be b=log(P/N), where P is the number of positive examples and N is the number of negative examples. This can help the model more accurately measure the probability of the positive and negative classes at the initialization of the training process, improving its performance on imbalanced datasets.

It is important to carefully consider the trade-offs and limitations of bias initialization, as it can potentially introduce additional bias into the model if you initialize it wrong. However, when used properly, this technique can be an effective and efficient way to address class imbalance and improve the performance of the model.

Classification Metrics: Beyond Accuracy

When working with imbalanced datasets in machine learning, it is crucial to choose the right evaluation metrics in order to accurately assess the performance of the model. For example, in a dataset with 99,000 images of cats and only 1,000 images of dogs, the initial accuracy of the model might be 99%. However, this metric may not provide a true representation of the model’s ability to accurately classify the minority class (dogs).

One useful tool for evaluating the performance of a classifier on imbalanced datasets is the confusion matrix-based metrics. This matrix provides a breakdown of the true positive, true negative, false positive, and false negative predictions made by the model, allowing for a more nuanced understanding of its performance. It is important to consider a variety of metrics when evaluating a model on imbalanced datasets in order to get a comprehensive understanding of its capabilities.

Understanding the Confusion Matrix

In evaluating the performance of a classifier, it is helpful to consider a variety of metrics. A confusion matrix is a useful tool for understanding the true positive (TP) predictions, where the model correctly identified the positive class, as well as the false negative (FN) predictions, where the model incorrectly classified a sample as the negative class that was actually positive. The confusion matrix also provides information on false positive (FP) predictions, where the model incorrectly identified a sample as the positive class that was actually negative, and true negative (TN) predictions, where the model correctly identified the negative class. By considering these different types of predictions, we can gain a more comprehensive understanding of the model’s performance.

Key Evaluation Metrics for Imbalanced Datasets

In order to understand the performance of a classifier, it is important to consider a range of evaluation metrics. Accuracy, precision, and recall are three commonly used metrics that can be calculated from the confusion matrix.

  • Accuracy: Reflects the overall accuracy of the model’s predictions, calculated as the number of correct predictions divided by the total number of predictions.
  • Precision: Measures the proportion of positive predictions that were actually correct, calculated as the number of true positive predictions divided by the total number of positive predictions made by the model. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness.
  • Recall: Also known as sensitivity or true positive rate, captures the proportion of actual positive samples that were correctly predicted by the model, calculated as the number of true positive predictions divided by the total number of actual positive samples. The recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness.

In addition to these metrics, it is also important to consider the false positive rate and the false negative rate.

  • The false positive rate represents the proportion of actual negative samples that were incorrectly predicted as positive by the model, calculated as the number of false positive predictions divided by the total number of actual negative samples.
  • The false negative rate reflects the proportion of actual positive samples that were incorrectly predicted as negative by the model, calculated as the number of false negative predictions divided by the total number of actual positive samples.

Choosing the Right Metric

As for which technique to use - my recommendation would be to assess which technique performs best when comparing the predictions to the test data. Accuracy vs. Precision vs. Recall. Let's take this example. We build a model that classifies on a dataset with a 90% major class and a 10% minor class. However, there is a problem. The model fails to correctly classify any of the observations across the minor class in the test set. In this regard, you should also note the readings of precision (no false positives) and recall (no false negatives). As an example, let us say a company wants to predict customers that cancel their subscription of a product (1 = cancel, 0 = do not cancel). In this instance - because we want to minimise false negatives - we are looking for a high recall score.

Combining Oversampling and Undersampling: SMOTE-Tomek

Now that we have learnt about oversampling and undersampling. Can we combine these techniques? Of course! SMOTE-TOMEK is a technique.

Additional Techniques and Considerations

  • Decision Trees: Decision trees frequently perform well on imbalanced data. Tree base algorithm work by learning a hierarchy of if/else questions.
  • Cost-Sensitive Training: Cost-sensitive training is a machine learning technique in which the algorithm is trained by taking into account the various costs connected to various kinds of errors. The cost of incorrectly classifying instances of one class may differ from the cost of incorrectly classifying instances of the other class in a typical binary classification problem.

tags: #class #imbalance #in #machine #learning #techniques

Popular posts: