Loss of Plasticity in Continual Learning Strategies

In the ever-evolving landscape of artificial intelligence, one of the most intriguing and challenging concepts is continual learning, also known as continuous learning, incremental learning, or lifelong learning. Continual learning aims to mimic the human brain’s ability to learn new tasks without forgetting the old ones. However, this ambition is often hindered by a phenomenon that is both fascinating and frustrating: catastrophic forgetting.

The Human Brain vs. Neural Networks

Humans have an extraordinary ability to learn multiple tasks over a lifetime without significant interference between them. For instance, learning to play the violin does not cause you to forget how to play the guitar. This is in stark contrast to neural networks, which are prone to catastrophic forgetting. When a neural network is trained on a new task, it often forgets the knowledge it acquired from previous tasks. This issue arises because the network’s weights, adjusted to minimize the loss for the new task, inadvertently overwrite the weights that encoded the previous tasks.

Understanding Catastrophic Forgetting

Catastrophic forgetting, or catastrophic interference, occurs when a neural network substantially or completely forgets the information related to previously learned tasks after being trained on a new one. This happens because the network’s weights are modified to reduce the error for the new task, which can dramatically alter the knowledge representation of prior tasks. Imagine a neural network that has mastered recognizing cats and dogs; when it is later trained to recognize birds, it may forget how to distinguish between cats and dogs.

The Stability-Plasticity Tradeoff

At the heart of continual learning is the stability-plasticity tradeoff. This tradeoff involves balancing the model’s ability to learn new information (plasticity) against its ability to retain old information (stability). Essentially, forgetting means erasing data, and learning means storing data. The ideal solution should allow the model to retain significant knowledge from previous tasks while accommodating new information. However, this balance is delicate; too much stability can prevent the model from learning new tasks effectively, while too much plasticity can lead to catastrophic forgetting.

Strategies to Mitigate Catastrophic Forgetting

Several strategies have been developed to address the issue of catastrophic forgetting. These strategies aim to strike a balance between retaining old knowledge and acquiring new knowledge.

Read also: Symptoms of Ambiguous Loss

Learning Without Forgetting

One approach is the “Learning Without Forgetting” method proposed by Li and Hoiem. This algorithm explicitly deals with the weaknesses of traditional methods by ensuring that the parameters for the new task do not overwrite the knowledge acquired for the old tasks. It achieves this by using the knowledge distillation technique, where the model is trained to mimic the outputs of the old model on the old tasks while learning the new task.

Here is a simplified example of how you might implement the “Learning Without Forgetting” approach using Python and the PyTorch library:

import torchimport torch.nn as nnimport torch.optim as optimclass Net(nn.Module): def __init__(self): super(Net, self).__init__() self.fc1 = nn.Linear(784, 128) self.fc2 = nn.Linear(128, 10) def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return xdef knowledge_distillation(old_model, new_model, inputs, outputs, alpha=0.5): # Calculate the outputs of the old model old_outputs = old_model(inputs) # Calculate the loss for the new model with knowledge distillation kd_loss = nn.KLDivLoss()(new_model(inputs), old_outputs) cross_entropy_loss = nn.CrossEntropyLoss()(new_model(inputs), outputs) loss = alpha * kd_loss + (1 - alpha) * cross_entropy_loss return loss# Assume old_model is the pre-trained model on the old task# and new_model is the model to be trained on the new taskold_model = Net()new_model = Net()# Train the new model with knowledge distillationoptimizer = optim.SGD(new_model.parameters(), lr=0.01)for inputs, outputs in new_task_data: loss = knowledge_distillation(old_model, new_model, inputs, outputs) optimizer.zero_grad() loss.backward() optimizer.step()

Elastic Weight Consolidation (EWC)

Another significant approach is Elastic Weight Consolidation (EWC), proposed by Kirkpatrick et al. EWC is inspired by the human brain’s synaptic consolidation process, which reduces the plasticity of synapses related to previously learned tasks. This method quantifies the importance of weights in terms of their impact on previous tasks’ performance and penalizes changes to these weights during the learning of new tasks. This helps in preserving the knowledge of previous tasks while allowing the model to learn new ones.

Here is a simplified example of how you might implement EWC:

import torchimport torch.nn as nnimport torch.optim as optimclass Net(nn.Module): def __init__(self): super(Net, self).__init__() self.fc1 = nn.Linear(784, 128) self.fc2 = nn.Linear(128, 10) def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return xdef ewc_loss(model, old_task_data, new_task_data, lambda_): # Calculate the Fisher information matrix for the old task # This is a simplified version and actual implementation may vary old_task_outputs = model(old_task_data) old_task_loss = nn.CrossEntropyLoss()(old_task_outputs, old_task_labels) old_task_gradients = torch.autograd.grad(old_task_loss, model.parameters(), retain_graph=True) # Compute the EWC loss ewc_loss = 0 for param, old_grad in zip(model.parameters(), old_task_gradients): ewc_loss += lambda_ * (param - old_param) ** 2 * old_grad ** 2 # Add the cross-entropy loss for the new task new_task_outputs = model(new_task_data) new_task_loss = nn.CrossEntropyLoss()(new_task_outputs, new_task_labels) total_loss = new_task_loss + ewc_loss return total_loss# Assume model is the pre-trained model on the old taskmodel = Net()# Train the model with EWC on the new taskoptimizer = optim.SGD(model.parameters(), lr=0.01)for inputs, labels in new_task_data: loss = ewc_loss(model, old_task_data, inputs, labels, lambda_=1000) optimizer.zero_grad() loss.backward() optimizer.step()

Joint Training and Fine-Tuning

Other methods include joint training, where the model is trained simultaneously on all tasks, and fine-tuning, where the model is trained on one task and then fine-tuned on subsequent tasks. However, these methods often require significant computational resources and access to all previous task data, which may not always be feasible.

Read also: Navigating Grief

Continual Backpropagation

Continual backpropagation selectively reinitializes low-utility units in the network. The contribution utility is defined for each connection or weight and each unit. The basic intuition behind the contribution utility is that the magnitude of the product of units’ activation and outgoing weight gives information about how valuable this connection is to its consumers. If the contribution of a hidden unit to its consumer is small, its contribution can be overwhelmed by contributions from other hidden units. In such a case, the hidden unit is not useful to its consumer. The contribution utility of a hidden unit is defined as the sum of the utilities of all its outgoing connections. The contribution utility is measured as a running average of instantaneous contributions with a decay rate, η (e.g., η = 0.99).

When a hidden unit is reinitialized, its outgoing weights are initialized to zero. Initializing the outgoing weights as zero ensures that the newly added hidden units do not affect the already learned function. However, initializing the outgoing weight to zero makes the new unit vulnerable to immediate reinitialization, as it has zero utility. To protect new units from immediate reinitialization, they are protected from a reinitialization for maturity threshold m number of updates. A unit is called mature if its age is more than m. Every step, a fraction of mature units ρ, called the replacement rate, is reinitialized in every layer.

The replacement rate ρ is typically set to a very small value, meaning that only one unit is replaced after hundreds of updates. For example, in class-incremental CIFAR-100 (Fig. 2) continual backpropagation was used with a replacement rate of 10−5. The last layer of the network in that problem had 512 units. At each step, roughly 512 × 10−5 = 0.00512 units are replaced. This corresponds to roughly one replacement after every 1/0.00512 ≈ 200 updates or one replacement after every eight epochs on the first five classes.

The final algorithm combines conventional backpropagation with selective reinitialization to continually inject random units from the initial distribution. Continual backpropagation performs a gradient descent and selective reinitialization step at each update. Algorithm 1 specifies continual backpropagation for a feed-forward neural network. In cases in which the learning system uses mini-batches, the instantaneous contribution utility can be used by averaging the utility over the mini-batch instead of keeping a running average to save computation.

Experiments and Results

Continual ImageNet

The ImageNet database used consists of 1,000 classes, each of 700 images. The 700 images for each class were divided into 600 images for a training set and 100 images for a test set. On each binary classification task, the deep-learning network was first trained on the training set of 1,200 images and then its classification accuracy was measured on the test set of 200 images. The training consisted of several passes through the training set, called epochs. For each task, all learning algorithms performed 250 passes through the training set using mini-batches of size 100. All tasks used the downsampled 32 × 32 version of the ImageNet dataset, as is often done to save computation.

All algorithms on Continual ImageNet used a convolutional network. The network had three convolutional-plus-max-pooling layers, followed by three fully connected layers. The final layer consisted of just two units, the heads, corresponding to the two classes. At task changes, the input weights of the heads were reset to zero. Resetting the heads in this way can be viewed as introducing new heads for the new tasks. This resetting of the output weights is not ideal for studying plasticity, as the learning system gets access to privileged information on the timing of task changes. It is used here because it is the standard practice in deep continual learning for this type of problem in which the learning system has to learn a sequence of independent tasks.

In this problem, the head of the network is reset at the beginning of each task. It means that, for a linear network, the whole network is reset. That is why the performance of a linear network will not degrade in Continual ImageNet. As the linear network is a baseline, having a low-variance estimate of its performance is desirable. The value of this baseline is obtained by averaging over thousands of tasks. This averaging gives us a much better estimate of its performance than other networks.

The network was trained using SGD with momentum on the cross-entropy loss and initialized once before the first task. The momentum hyperparameter was 0.9. Various step-size parameters were tested for backpropagation but only presented the performance for step sizes 0.01, 0.001 and 0.0001 for clarity of Fig. 1b. 30 runs were performed for each hyperparameter value, varying the sequence of tasks and other randomness. Across different hyperparameters and algorithms, the same sequences of pairs of classes were used.

The hyperparameter selection for L2 regularization, Shrink and Perturb and continual backpropagation is described below. The main text presents the results for these algorithms on Continual ImageNet in Fig. 1c. A grid search was performed for all algorithms to find the set of hyperparameters that had the highest average classification accuracy over 5,000 tasks. The values of hyperparameters used for the grid search are described in Extended Data Table 2. L2 regularization has two hyperparameters, step size and weight decay. Shrink and Perturb has three hyperparameters, step size, weight decay and noise variance. Two hyperparameters of continual backpropagation were swept over: step size and replacement rate. The maturity threshold in continual backpropagation was set to 100. For both backpropagation and L2 regularization, the performance was poor for step sizes of 0.1 or 0.003. Step sizes of 0.03 and 0.01 were only used for continual backpropagation and Shrink and Perturb. Ten independent runs were performed for all sets of hyperparameters. Then another 20 runs were performed to complete 30 runs for the best-performing set of hyperparameters to produce the results in Fig. 1c.

Class-incremental CIFAR-100

In the class-incremental CIFAR-100, the learning system gets access to more and more classes over time. Classes are provided to the learning system in increments of five. First, it has access to just five classes, then ten and so on, until it gets access to all 100 classes. The learning system is evaluated on the basis of how well it can discriminate between all the available classes at present. The dataset consists of 100 classes with 600 images each. The 600 images for each class were divided into 450 images to create a training set, 50 for a validation set and 100 for a test set. Note that the network is trained on all data from all classes available at present. First, it is trained on data from just five classes, then from all ten classes and so on, until finally, it is trained from data from all 100 classes simultaneously.

After each increment, the network was trained for 200 epochs, for a total of 4,000 epochs for all 20 increments. A learning-rate schedule that resets at the start of each increment was used. For the first 60 epochs of each increment, the learning rate was set to 0.1, then to 0.02 for the next 60 epochs, then 0.004 for the next 40 epochs and to 0.0008 for the last 40 epochs; the initial learning rate and learning-rate schedule reported in ref. 53 were used. During the 200 epochs of training for each increment, the network with the best accuracy on the validation set was tracked. To prevent overfitting, at the start of each new increment, the weights of the network were reset to the weights of the best-performing (on the validation set) network found during the previous increment; this is equivalent to early stopping for each different increment.

An 18-layer deep residual network was used for all experiments on class-incremental CIFAR-100. The weights of convolutional and linear layers were initialized using Kaiming initialization, the weights for the batch-norm layers were initialized to one and all of the bias terms in the network were initialized to zero. Each time five new classes were made available to the network, five more output units were added to the final layer of the network. The weights and biases of these output units were initialized using the same initialization scheme. The weights of the network were optimized using SGD with a momentum of 0.9, a weight decay of 0.0005 and a mini-batch size of 90.

Several steps of data preprocessing were used before the images were presented to the network. First, the value of all the pixels in each image was rescaled between 0 and 1 through division by 255. Then, each pixel in each channel was centred and rescaled by the average and standard deviation of the pixel values of each channel, respectively. Finally, three random data transformations were applied to each image before feeding it to the network: randomly horizontally flip the image with a probability of 0.5, randomly crop the image by padding the image with 4 pixels on each side and randomly cropping to the original size, and randomly rotate the image between 0 and 15°. The first two steps of preprocessing were applied to the training, validation and test sets, but the random transformations were only applied to the images in the training set.

Several hyperparameters were tested to ensure the best performance for each different algorithm with our specific architecture. For the base system, values for the weight decay parameter in {0.005, 0.0005, 0.00005} were tested. A weight-decay value of 0.0005 resulted in the best performance in terms of area under the curve for accuracy on the test set over the 20 increments. For Shrink and Perturb, the weight-decay value of the base system was used and values for the standard deviation of the Gaussian noise in {10−4, 10−5, 10−6} were tested; 10−5 resulted in the best performance. For continual backpropagation, values for the maturity threshold in {1,000, 10,000} and for the replacement rate in {10−4, 10−5, 10−6} using the contribution utility described in equation (1) were tested. A maturity threshold of 1,000 and a replacement rate of 10−5 resulted in the best performance. Finally, for the head-resetting baseline, in Extended Data Fig. 1a, the same hyperparameters as for the base system were used, but the output layer was reinitialized at the start of each increment.

In Fig. 2d, the stable rank of the representation in the penultimate layer of the network and the percentage of dead units in the full network are plotted. For a matrix ({\boldsymbol{\Phi }}\in {{\mathbb{R}}}^{n\times m}) with singular values σk sorted in descending order for k = 1, 2,…, q and q = max(n, m), the stable rank55 is (\min \left{k:\frac{{\Sigma }{i}^{k}{\sigma }{i}}{{\Sigma }{j}^{q}{\sigma }{j}} > 0.99\right}).

For reference, a network with the same hyperparameters as the base system but that was reinitialized at the beginning of each increment was also implemented. Figure 2b shows the performance of each algorithm relative to the performance of the reinitialized network. For completeness, Extended Data Fig. 1a shows the test accuracy of each algorithm in each different increment. The final accuracy of continual backpropagation on all 100 classes was 76.13%, whereas Extended Data Fig. 1b shows the performance of continual backpropagation for different replacement rates with a maturity threshold of 1,000. For all algorithms that were tested, there was no correlation between when a class was presented and the accuracy of that class, implying that the temporal order of classes did not affect performance.

Robust Loss of Plasticity in Permuted MNIST

A computationally cheap problem based on the MNIST dataset was used to test the generality of loss of plasticity across various conditions. MNIST is one of the most common supervised-learning datasets used in deep learning. It consists of 60,000, 28 × 28, greyscale images of handwritten digits from 0 to 9, together with their correct labels. For example, the left image in Extended Data Fig. 3a shows an image that is labelled by the digit 7. The smaller number of classes and the simpler images enable much smaller networks to perform well on this dataset than are needed on ImageNet or CIFAR-100. The smaller networks in turn mean that much less computation is needed to perform the experiments and thus experiments can be performed in greater quantities and under a variety of different conditions, enabling us to perform deeper and more extensive studies of plasticity.

A continual supervised-learning problem was created using permuted MNIST datasets. An individual permuted MNIST dataset is created by permuting the pixels in the original MNIST dataset. The right image in Extended Data Fig. 3a is an example of such a permuted image. Given a way of permuting, all 60,000 images are permuted in the same way to produce the new permuted MNIST dataset. For each task, each of its 60,000 images were presented one by one in random order to the learning network. No indication was given to the network at the time of task switching. With the pixels being permuted in a completely unrelated way, we might expect classification performance to fall substantially at the time of each task switch. Nevertheless, across tasks, there could be some savings, some improvement in speed of learning or, alternatively, there could be loss of plasticity-loss of the ability to learn across tasks. The network was trained on a single pass through the data and there were no mini-batches. This problem is called Online Permuted MNIST.

Feed-forward neural networks with three hidden layers were applied to Online Permuted MNIST. Convolutional layers were not used, as they could not be helpful on the permuted problem because the spatial information is lost; in MNIST, convolutional layers are often not used even on the standard, non-permuted problem. For each example, the network estimated the probabilities of each of the tem classes, compared them to the correct label and performed SGD on the cross-entropy loss. As a measure of online performance, the percentage of times the network correctly classified each of the 60,000 images in the task was recorded. This per-task performance measure versus task number is plotted in Extended Data Fig. 3b. The weigh…

Churn and Plasticity in Deep Continual RL

Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this context, churn refers to network output variability induced by the data in each training batch. Studies have shown that the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix and that reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively.

Continual Churn Approximated Reduction (C-CHAIN) has been introduced and demonstrated to improve learning performance and outperform baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.

Loss of Plasticity in Deep Reinforcement Learning

The ability to learn continually is essential in a complex and changing world. Research has characterized the behavior of canonical value-based deep reinforcement learning (RL) approaches under varying degrees of non-stationarity. Deep RL agents lose their ability to learn good policies when they cycle through a sequence of Atari 2600 games. This phenomenon is alluded to in prior work under various guises -- e.g., loss of plasticity, implicit under-parameterization, primacy bias, and capacity loss. Analysis shows that the activation footprint of the network becomes sparser, contributing to the diminishing gradients.

Activation Functions and Plasticity

The issue of the network’s activation function is important in the context of plasticity. Experiments have shown that the loss of plasticity is robust across different activations.

tags: #loss #of #plasticity #continual #learning #strategies