Scikit-Learn Cross-Validation: A Comprehensive Tutorial

When training machine learning models, the primary goal is to enhance overall performance on unseen data. Hyperparameter tuning plays a crucial role in achieving better performance on test sets. However, optimizing parameters solely on the test set can lead to information leakage, resulting in a model that performs poorly on new, unseen data. Cross-validation (CV) is a robust technique to address this issue, providing a more reliable estimate of a model's performance.

Introduction to Cross-Validation

Cross-validation is a statistical method used to evaluate a machine learning model's performance and test its ability to generalize to unseen data, thereby detecting overfitting. It is a fundamental technique in applied machine learning (ML) tasks, known for its ease of understanding, implementation, and lower bias compared to other model efficiency scoring methods. Cross-validation helps in selecting the best model and tuning hyperparameters by iteratively training and testing the model on different portions of the data.

Imagine training a machine learning model and wanting to know how it will perform on new, unseen data. K-Fold Cross-Validation offers a sneak peek at how your model might fare in the real world. This guide unpacks the basics of K-Fold Cross-Validation, compares it to simpler methods like the Train-Test Split, explores various cross-validation methods using Python, and explains why choosing the right one can greatly impact your projects.

K-Fold Cross-Validation Explained

K-Fold Cross-Validation is a robust technique used to evaluate the performance of machine learning models. In K-Fold cross-validation, the input data is divided into 'K' number of folds. The model undergoes training with K-1 folds and is evaluated on the remaining fold. This procedure is performed K times, where each fold is utilized as the testing set one time. For example, if K = 10, the K-Fold cross-validation splits the input data into 10 folds, resulting in 10 sets of data to train and test the model. In each iteration, the model uses one fold as test data and the remaining nine folds as training data.

Visualizing K-Fold Cross-Validation

To better understand the behavior of K-Fold cross-validation, consider a classification dataset. By using the make_classification() method, one can create a synthetic binary classification dataset of 100 samples with 20 features. A K-Fold cross-validation procedure can then be prepared for the dataset with 10 folds. Visualizing the training and test data for each fold helps illustrate how the data is split and used in each iteration.

K-Fold Cross-Validation vs. Train-Test Split

While K-Fold Cross-Validation partitions the dataset into multiple subsets to iteratively train and test the model, the Train-Test Split method divides the dataset into just two parts: one for training and the other for testing. K-Fold Cross-Validation provides a more robust and reliable performance estimate because it reduces the impact of data variability. By using multiple training and testing cycles, it minimizes the risk of overfitting to a particular data split.

Implementing K-Fold Cross-Validation with Scikit-Learn

Scikit-learn provides the KFold class in the sklearn.model_selection module to implement K-Fold cross-validation. The KFold class offers several parameters:

n_splits (int, default=5): The number of folds. Must be at least 2.
shuffle (bool, default=False): Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.
random_state (int, default=None): When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect.

Example: K-Fold with a Synthetic Regression Dataset

To illustrate how the K-Fold split works, a synthetic regression dataset can be created using the make_regression() method from sklearn.datasets.

from sklearn.datasets import make_regressionfrom sklearn.model_selection import KFoldX, y = make_regression(n_samples=10, n_features=1, noise=0.5, random_state=42)kf = KFold(n_splits=4)for fold, (train_index, test_index) in enumerate(kf.split(X)): print(f"Fold {fold}:") print(f" Training dataset index: {train_index}") print(f" Test dataset index: {test_index}")

In this example, the code creates a synthetic regression dataset and divides the input data into four folds using the split() method. The output shows the train index and test index for each iteration, demonstrating how the data is split for training and testing.

Example: Cross-Validating Different Regression Models

To demonstrate the effectiveness of K-Fold cross-validation, different regression models can be cross-validated using the California Housing dataset from Scikit-learn.

from sklearn.datasets import fetch_california_housingfrom sklearn.model_selection import KFold, cross_val_scorefrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegressionfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.ensemble import RandomForestRegressorimport numpy as np# Load the California Housing datasethousing = fetch_california_housing()X, y = housing.data, housing.target# Scale the featuresscaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Define the modelsmodels = { "Linear Regression": LinearRegression(), "Decision Tree Regression": DecisionTreeRegressor(), "Random Forest Regression": RandomForestRegressor()}# Perform K-Fold cross-validationkf = KFold(n_splits=10, shuffle=True, random_state=42)for name, model in models.items(): scores = cross_val_score(model, X_scaled, y, cv=kf, scoring='neg_mean_squared_error') rmse_scores = np.sqrt(-scores) print(f"----- {name} Cross Validation ------") print(f"Scores: {rmse_scores}") print(f"Mean: {rmse_scores.mean()}") print(f"StandardDeviation: {rmse_scores.std()}")

In this code, three different regression models (Linear Regression, Decision Tree Regression, and Random Forest Regression) are created. The cross_val_score() method is used to evaluate each model using the negative mean squared error as the scoring metric and K-Fold as the cross-validation procedure. The results show the mean prediction error for each model, allowing for a comparison of their performance.

Variations of K-Fold Cross-Validation

Besides the standard K-Fold cross-validation, there are several variations to address specific needs:

Stratified K-Fold: This is a variation of K-Fold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
Repeated K-Fold: This can be used when one requires to run KFold n times, producing different splits in each repetition.
Group K-Fold: This is a variation of k-fold which ensures that the same group is not represented in both testing and training sets.
Stratified Group K-Fold: This is a cross-validation scheme that combines both StratifiedKFold and GroupKFold.
Leave-One-Out Cross-Validation (LOOCV): An extreme case of k-Fold CV where k is equal to n (the number of samples in the dataset). LOOCV uses one observation to validate and n-1 observations to train.
Leave-P-Out Cross-Validation (LpOCV): Similar to Leave-one-out CV, it creates all possible training and test sets by using p samples as the test set.
ShuffleSplit: This method leaves out a percentage of the data, not to be used in the train or validation sets.
Repeated K-Fold Cross-Validation or Repeated Random Sub-Sampling CV: A variation of k-Fold where k is not the number of folds. On every iteration, samples are randomly selected across the dataset as the test set.

Cross-Validation for Imbalanced Datasets

In cases where classes are imbalanced, standard cross-validation techniques may not be suitable. Stratified K-Fold is particularly useful in such scenarios. It splits the dataset into k folds such that each fold contains approximately the same percentage of samples of each target class as the complete set. This ensures that both the training and validation sets are representative of the overall class distribution.

Cross-Validation for Time Series Data

Traditional cross-validation techniques donât work on sequential data such as time series because random data points cannot be assigned to either the test set or the train set. It makes no sense to use values from the future to forecast values in the past. Cross-validation is done on a rolling basis, starting with a small subset of data for training purposes, predicting future values, and then checking the accuracy on the forecasted data points.

Custom Cross-Validation Generators

Scikit-learn provides several built-in cross-validation methods, but there are scenarios where these standard methods may not be sufficient, and a custom cross-validation generator is needed. A custom cross-validation generator in Scikit-learn is essentially an iterable that yields train-test splits. Each split is a tuple containing two arrays: the training indices and the test indices.

Read also: Scikit-learn Clustering Algorithms

Example: Stratified Oversampling

Consider a scenario with an imbalanced dataset where you want to apply oversampling only to the training data.

from sklearn.model_selection import StratifiedKFoldfrom sklearn.base import BaseEstimator, ClassifierMixinimport numpy as npfrom sklearn.utils import resampleclass CustomStratifiedKFold(BaseEstimator, ClassifierMixin): def __init__(self, n_splits=5, random_state=None, oversample_func=None): self.n_splits = n_splits self.random_state = random_state self.oversample_func = oversample_func def split(self, X, y, groups=None): skf = StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=self.random_state) for train_index, test_index in skf.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] if self.oversample_func: X_train, y_train = self.oversample_func(X_train, y_train) yield train_index, test_index def get_n_splits(self, X, y, groups=None): return self.n_splits# Example usage:def oversample(X_train, y_train): # Oversample the minority class minority_class_count = np.sum(y_train == 1) X_minority, y_minority = X_train[y_train == 1], y_train[y_train == 1] X_oversampled, y_oversampled = resample(X_minority, y_minority, replace=True, n_samples=len(X_train) - minority_class_count, random_state=42) X_train_oversampled = np.vstack((X_train, X_oversampled)) y_train_oversampled = np.hstack((y_train, y_oversampled)) return X_train_oversampled, y_train_oversampled

In this example, CustomStratifiedKFold uses StratifiedKFold to ensure that each fold has the same proportion of classes as the entire dataset. The oversample_func is a user-defined function that applies oversampling to the training data.

Example: Time Series Split

For time series data, maintaining the temporal order is crucial.

from sklearn.model_selection import TimeSeriesSplitclass CustomTimeSeriesSplit: def __init__(self, n_splits=5): self.n_splits = n_splits self.tscv = TimeSeriesSplit(n_splits=n_splits) def split(self, X, y=None, groups=None): for train_index, test_index in self.tscv.split(X, y): yield train_index, test_index def get_n_splits(self, X, y=None, groups=None): return self.n_splits

This CustomTimeSeriesSplit generator creates splits that respect the temporal order of the data, ensuring that training data always precedes the test data.

Example: Grouped Data

Letâs create an example where we handle grouped data. This can be useful when your data contains groups and you want to ensure that all samples from a single group are either in the training set or the testing set, but never split between them.

import numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_scoreclass GroupAwareKFold: def __init__(self, n_splits=5): self.n_splits = n_splits def split(self, X, y, groups): unique_groups = np.unique(groups) kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=42) for train_idx, test_idx in kf.split(unique_groups): train_groups = unique_groups[train_idx] test_groups = unique_groups[test_idx] train_indices = np.where(np.isin(groups, train_groups))[0] test_indices = np.where(np.isin(groups, test_groups))[0] yield train_indices, test_indices def get_n_splits(self, X=None, y=None, groups=None): return self.n_splits

This GroupAwareKFold class ensures that all samples from the same group are either in the training set or the testing set. It does this by splitting the unique groups and then using these to generate the indices for the data.

Computing Cross-Validated Metrics

The cross_val_score function is the simplest way to use cross-validation. For example, to know the accuracy of a pipeline computing the score with 5 folds:

from sklearn.model_selection import cross_val_scorefrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.tree import DecisionTreeClassifierimport numpy as npfrom sklearn.datasets import load_iris# Load the Iris datasetiris = load_iris()X, y = iris.data, iris.target# Create a pipelinepipe = Pipeline([ ("scalar", StandardScaler()), ("pca", PCA(n_components=3)), ("estimator", DecisionTreeClassifier())], verbose=True)# Perform cross-validationCV = 5scoring = "accuracy"scores = cross_val_score(pipe, X, y, scoring=scoring, cv=CV)print(f"{scoring}: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

The scoring parameter defines the evaluation method used to calculate the score. The most popular metric is accuracy, but there are many scoring options available in Scikit-learn.

Cross-Validated Confusion Matrix

A confusion matrix is a tool used to evaluate the performance of a classification model. The matrix compares the predicted classes of a model with the actual classes of the data, breaking down the results into categories like true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

from sklearn.model_selection import cross_val_predictfrom sklearn.metrics import confusion_matrix, ConfusionMatrixDisplayfrom sklearn.datasets import load_irisfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.tree import DecisionTreeClassifier# Load the Iris datasetiris = load_iris()X, y = iris.data, iris.target# Create a pipelinepipe = Pipeline([ ("scalar", StandardScaler()), ("pca", PCA(n_components=3)), ("estimator", DecisionTreeClassifier())], verbose=True)# Perform cross-validation and get predictionsCV = 5y_pred = cross_val_predict(pipe, X, y, cv=CV)# Plot the confusion matrixcm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y, y_pred), display_labels=iris.target_names)cm.plot()

Tips and Tricks for Effective Cross-Validation

Stratified K-Fold for Imbalanced Datasets: Use StratifiedKFold to ensure class proportions are preserved in each fold.
Shuffle Data: Use shuffle=True in KFold to prevent bias, especially if the dataset is ordered.
Parallel Processing: Use n_jobs=-1 in cross_val_score to run cross-validation on all available CPU cores.
Choose Appropriate Metrics: Use metrics like recall or F1-score instead of accuracy, depending on the dataset and problem needs.
Leave-One-Out Cross-Validation (LOOCV): Use LOOCV for very small datasets to provide an exhaustive evaluation.
Cross-Validation with Pipelines: Apply cross-validation to a machine learning pipeline that encapsulates model training with prior data preprocessing steps, such as scaling.

Advantages and Disadvantages of K-Fold Cross-Validation

Advantages

Reduces underfitting and overfitting.
Considers most of the data for training and validation.
Model performance analysis on each fold helps to understand data variation.
Can efficiently handle unbalanced data and be used for hyperparameter tuning.

Disadvantages

Can be computationally expensive.

tags: #scikit #learn #cross #validation #tutorial