Understanding Learning Curves in Machine Learning

Learning curves are essential diagnostic tools in machine learning, offering insights into how well a model learns from data. They visually represent a model's performance improvement over time or with increased experience. This article delves into the definition, interpretation, and practical applications of learning curves, providing a comprehensive guide for machine learning practitioners.

Introduction to Learning Curves

In machine learning, models are trained to identify patterns in data. The goal is to create models that can generalize well to new, unseen data. During the research phase, experiments are conducted to find the solution that best solves the business's problem and reduces the error being made by the model. Learning curves help monitor this process, revealing whether a model is underfitting, overfitting, or achieving a good fit. A learning curve is essentially a plot showing the progress over the experience of a specific metric related to learning during the training of a machine learning model.

Imagine teaching a child to ride a bike. Initially, they are unsteady and unsure, much like a neural network just starting to learn. With practice (or more training data), they become more confident and skilled. This progression mirrors the evolving learning curves of a neural network.

Definition of a Learning Curve

A learning curve is a graphical representation of learning progress over time or experience. More formally, a learning curve is a plot of a model's learning performance across time or experience. Learning curves (LCs) are deemed effective tools for monitoring the performance of workers exposed to a new task. LCs provide a mathematical representation of the learning process that takes place as task repetition occurs. The x-axis typically represents the amount of experience, such as the size of the training dataset or the number of training iterations. The y-axis represents a measure of learning, such as accuracy or loss.

Learning curves are a common diagnostic tool in machine learning for algorithms that learn progressively from a training dataset. During the training of a machine learning model, the current state of the model at each step of the training algorithm can be evaluated. After each update during training, the model may be tested on the training dataset and a hold out validation dataset, and graphs of the measured performance can be constructed to display learning curves.

Read also: Understanding ROC and AUC

Types of Learning Curves

Learning curves can be categorized based on the metric used to evaluate the model's performance:

Optimization Learning Curves: These curves are calculated using the metric by which the model's parameters are being optimized, such as loss.
Performance Learning Curves: These curves are calculated using the metric by which the model will be evaluated and selected, such as accuracy, precision, or recall.

In some cases, it is also common to create learning curves for multiple metrics, such as in the case of classification predictive modeling problems, where the model may be optimized according to cross-entropy loss and model performance is evaluated using classification accuracy.

Interpreting Learning Curves: Bias and Variance

Learning curves are interpreted by assessing their shape, which reveals valuable information about a model's bias and variance. In machine learning, the best models can generalize well when faced with instances that were not part of the initial training data. To attain a more accurate solution, we seek to reduce the amount of bias and variance present in our model.

Understanding Bias and Variance

All supervised learning algorithms strive to achieve the same objective: estimating the mapping function (f_hat) for a target variable (y) given some input data (X). Changing the input data used to approximate the target variable will likely result in a different target function, which may impact the outputs predicted by the model. How much our target function varies as the training data is changed is known as the variance.

Bias: Bias refers to the error introduced by approximating a real-world problem, which is often complex, by a simplified model. It is basically nothing but the difference between the average prediction of a model and the correct value of the prediction. Models with high bias make a lot of assumptions about the training data. This leads to over-simplification of the model and may cause a high error on both the training and testing sets. However, this also makes the model faster to learn and easy to understand. Generally, linear model algorithms like Linear Regression have a high bias.

Read also: Understanding PLCs
Variance: Variance refers to the amount a model's prediction will change if the training data is changed. Ideally, a machine learning model should not vary too much with a change in training sets i.e., the algorithm should be good at picking up important details about the data, regardless of the data itself. Example of algorithms with high variance is Decision Trees, Support Vector Machines (SVM).

Ideally, we would want a model with low variance as well as low bias. To achieve lower bias, we need more training data but with higher training data, the variance of the model will increase. So, we have to strike a balance between the two. This is called the bias-variance trade-off. A learning curve can help to find the right amount of training data to fit our model with a good bias-variance trade-off. This is why learning curves are so important.

Common Learning Curve Dynamics

Learning curves typically exhibit three common dynamics:

Underfitting: A model with high bias is said to be underfit. It makes simplistic assumptions about the training data, which makes it difficult to learn the underlying patterns. This results in a model that has high error on the training and validation datasets.
Overfitting: A model with high variance is said to be overfit. It learns the training data and the random noise extremely well, thus resulting in a model that performs well on the training data, but fails to generalize to unseen instances.

Read also: Learning Resources Near You
Good Fit: A good fit model exists in the gray area between an underfit and overfit model. The model may not be as good on the training data as it is in the overfit instance, but it will make far fewer errors when faced with unseen instances.

Diagnosing Model Performance with Learning Curves

Examining model learning curves during training can help to detect learning issues such as an underfit or overfit model, as well as whether the training and validation datasets are sufficiently representative.

Underfitting Learning Curves

Underfitting occurs when a model is unable to learn from the training dataset. Only the learning curve of the training loss may be used to identify an underfit model. It may display a flat line or noisy values of relatively significant loss, suggesting that the model failed to learn the training dataset at all. This is typical when the model's capability is insufficient for the intricacy of the dataset. An underfit model can also be spotted by a falling training loss that continues to decline at the conclusion of the plot. This shows that the model is capable of more learning and development, and that the training process was terminated prematurely.

Underfitting is shown by a plot of learning curves if:

Regardless of training, the training loss stays constant.
The training loss decreases till the completion of training.

To address underfitting, we enhance the model’s capacity, allowing it to learn more complex patterns. It’s like giving the child training wheels.

Overfitting Learning Curves

Overfitting is defined as a model that has learnt the training dataset too well, including statistical noise or random fluctuations. Overfitting has the disadvantage that the more specialised the model gets to training data, the less successfully it can generalise to new data, resulting in an increase in generalisation error. The model's performance on the validation dataset may be used to quantify this rise in generalisation error.

This is common when the model has more capacity than is necessary for the problem and, as a result, too much flexibility. It can also happen if the model is trained for an inordinately extended period of time.

Overfitting is indicated by a plot of learning curves if:

The training loss plot continues to diminish with progress.
The validation loss plot drops to a point before climbing again.

As experience beyond that moment demonstrates the mechanics of overfitting, the inflection point in validation loss may be the point at which training might be discontinued.

To combat overfitting, we introduce regularization. It’s like teaching the child to adjust to different terrains and obstacles. The result? Both training and validation losses go down.

Good Fit Learning Curves

The learning algorithm seeks a good match between an overfit and an underfit model. A good fit is defined by a training and validation loss that declines to a stable point with a small difference between the two final loss values. The model's loss is nearly always smaller on the training dataset than on the validation dataset. This implies that there will be some disparity between the train and validation loss learning curves. This is known as the "generalisation gap."

A plot of learning curves indicates a good fit if and only if:

The training loss plot lowers to a point of stability.
The validation loss plot approaches stability and has a tiny gap with the training loss.

Continued training of a good fit will almost certainly result in overfit.

Diagnosing Unrepresentative Datasets

Learning curves may also be used to determine the qualities of a dataset and its representativeness. An unrepresentative dataset is one that does not capture the statistical properties of another dataset derived from the same domain, such as the difference between a train and a validation dataset. This is typical when the number of samples in one dataset is too little in comparison to another.

There are two common scenarios that may be observed:

The training dataset is not very representative.
The validation dataset is not very representative.

Unrepresentative Training Dataset

An unrepresentative training dataset is one that does not give enough information to learn the problem in comparison to the validation dataset used to evaluate it. This can happen if the training dataset includes fewer instances than the validation dataset. This condition is indicated by a learning curve for training loss that shows improvement and a learning curve for validation loss that also shows progress, but there is a big gap between the two curves.

Unrepresentative Validation Dataset

An unrepresentative validation dataset indicates that the validation dataset does not include enough information to assess the model's capacity to generalise. This might happen if the validation dataset includes fewer instances than the training dataset. A learning curve for training loss that appears like a good fit (or other fits) and a learning curve for validation loss that displays noisy movements around the training loss identify this scenario. A validation loss that is less than the training loss may also be used to identify it.

Practical Applications and Examples

Learning curves are valuable in various real-world scenarios. For instance, in a new project, machine learning requires lots of “babysitting”; monitoring, data preparation, and experimentation, especially if it’s a new project. The most popular example of a learning curve is loss over time. Loss (or cost) measures our model error, or “how bad our model is doing”. Other examples of very popular learning curves are accuracy, precision, and recall.

Example: Real Estate Valuation Model

Consider the task of building a model to predict real estate valuation using historical market data. To demonstrate bias, variance, and good fit solutions, we are going to build three models: a decision tree regressor, a support vector machine for regression, and a random forest regressor.

Decision Tree Regressor (High Variance): The model makes very few mistakes when it's required to predict instances it's seen during training, but performs terribly on new instances it hasn't been exposed to. You can observe this behavior by noticing how large the generalization error is between the training curve and the validation curve. A solution to improve this behavior may be to add more instances to our training dataset which introduces bias. Another solution may be to add regularization to the model.
Support Vector Machine (High Bias): The generalization gap for the training and validation curve becomes extremely small as the training dataset size increases. This indicates that adding more examples to our model is not going to improve its performance.
Random Forest Regressor (Good Fit): Now you can see we've reduced the error in the validation data. The generalization error is much smaller, with a low number of errors being made.

Implementing Learning Curves in Python

To implement learning curves in Python, we can use the scikit-learn library. Here’s a basic example using the 'Digit' dataset and a k-Nearest Neighbour classifier:

from sklearn.datasets import load_digitsfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import learning_curveimport matplotlib.pyplot as pltimport numpy as np# Load the digits datasetdigits = load_digits()X, y = digits.data, digits.target# Create a k-NN classifierknn = KNeighborsClassifier()# Generate the learning curve datatrain_sizes, train_scores, test_scores = learning_curve( knn, X, y, cv=10, scoring='accuracy', train_sizes=np.linspace(0.1, 1.0, 10))# Calculate the mean and standard deviation of the scorestrain_mean = np.mean(train_scores, axis=1)train_std = np.std(train_scores, axis=1)test_mean = np.mean(test_scores, axis=1)test_std = np.std(test_scores, axis=1)# Plot the learning curvesplt.plot(train_sizes, train_mean, label='Training score')plt.plot(train_sizes, test_mean, label='Cross-validation score')plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1)plt.title('Learning Curve')plt.xlabel('Training Set Size')plt.ylabel('Accuracy Score')plt.legend(loc='best')plt.show()

From the curve, we can clearly see that as the size of the training set increases, the training score curve and the cross-validation score curve converge. The cross-validation accuracy increases as we add more training data. So adding training data is useful in this case. Since the training score is very accurate, this indicates low bias and high variance. So this model also begins overfitting the data because the cross-validation score is relatively lower and increases very slowly as the size of the training set increases.

The Concept of Early Stopping

There’s a critical moment in training: when both training and validation losses decrease but then the training loss starts to dip below the validation loss. It’s like the child who starts to ride too fast, becoming overconfident. This is where you need to stop the training. It’s the concept of early stopping, crucial for preventing overfitting.

Learning Curves: The Mirror of a Model’s Soul

Learning curves are not just graphs; they are stories. They tell us about the struggles and triumphs of a neural network. From underfitting to overfitting and finally to a balanced model, these curves provide a comprehensive view of the model’s learning journey.

tags: #learning #curve #machine #learning #definition