LoRA: Balancing Efficiency and Knowledge Retention in Language Model Fine-Tuning

Large language models (LLMs) have become increasingly prevalent, boasting trillions of parameters and being pretrained on vast datasets. However, the subsequent post-training phase, which involves adapting these models to specific tasks or domains, often utilizes smaller datasets. This raises the question of efficiency: is it necessary to adjust a massive number of parameters when only a fraction of the data is used for fine-tuning? Low-Rank Adaptation (LoRA) has emerged as a promising parameter-efficient fine-tuning (PEFT) method to address this concern.

Understanding LoRA: A Parameter-Efficient Approach

LoRA offers an alternative to full fine-tuning (FullFT) by modifying only a small subset of the model's parameters. Instead of directly updating the original weight matrix ( W ), LoRA introduces a modified version ( W' ) defined as:

[ W' = W + \gamma BA ]

Here, ( B ) and ( A ) are low-rank matrices, and ( \gamma ) is a scaling factor. The key advantage lies in the fact that the number of parameters in ( B ) and ( A ) is significantly smaller than that in ( W ), leading to substantial memory and computational savings.

Benefits of LoRA

LoRA offers several compelling advantages over FullFT:

Read also: The Story of Lora Cross

Reduced memory footprint: By training only a fraction of the parameters, LoRA significantly reduces memory requirements. This is particularly beneficial when dealing with large models and limited computational resources.
Faster training: With fewer parameters to update, LoRA converges faster than FullFT.
Ease of loading and transfer: LoRA adapters, which contain the trained low-rank matrices, are much smaller than the full model, making them easier to load, store, and transfer.
Multi-tenant serving: LoRA's efficiency makes it well-suited for multi-tenant serving scenarios, where multiple users or applications share the same model.

Addressing the Limitations of Full Fine-Tuning

Full fine-tuning (FFT) presents several challenges:

High computational cost: Updating every parameter requires significant processing and memory. When fine-tuning the whole model, the optimizer state needs to be stored along with the original weights, often at higher precision. As a result, FullFT usually requires an order of magnitude more accelerators than sampling from the same model does, and thus a different layout. For training, besides storing the weights, we typically need to store gradients and optimizer moments for all of the weights; moreover, these variables are often stored in higher precision (float32) than what’s used to store the weights for inference (bfloat16 or lower).
Risk of catastrophic forgetting: The model may forget previously learned information as it over-learns the new data.
Layout size for training: Since LoRA trains far fewer weights and uses far less memory, it can be trained on a layout only slightly larger than what is used for sampling.

LoRA's Performance: A Balancing Act

While LoRA offers significant advantages in terms of efficiency, it's crucial to understand its performance characteristics compared to FullFT. Recent research has shed light on the conditions under which LoRA excels and where it may fall short.

When LoRA Underperforms

There is agreement that LoRA underperforms in settings that resemble pre-training, namely those with very large datasets that exceed the storage limits of LoRA parameters. For datasets that exceed LoRA capacity, LoRA underperforms FullFT.

LoRA Learns Less and Forgets Less

A recent paper, “LoRA Learns Less and Forgets Less”, argues that by training fewer parameters, LoRA acts as a regularize that forces the fine-tuned model to remain close to the prior-learned data. LoRA is found to be a stronger regularizer compared to traditional methods like weight decay and dropout. This trait of LoRA prevents the loss of generalization on tasks that the model was originally trained on, which is a known issue with the more aggressive methods of fine-tuning.

The authors of the paper fine-tune models with LoRA and FFT in two domains: programming and math. As shown in the figure above, LoRA drastically falls below FFT in Coding tasks (but less in math). This gap also increases with the growing number of tokens to learn. Note that the gap is much bigger in continued pretraining (≈10B unstructured tokens) compared to instruction fine-tuning (≈100K prompt-response pairs). This highlights that LoRA is much less effective in learning a new domain when compared to changing the tone and behavior in the same domain. The conclusion? LoRA Learns Less and Forgets Less.

Read also: Explore LoRA's capabilities

LoRA in Reinforcement Learning

LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. In contrast, in policy gradient methods, learning is driven by the advantage function which provides only O(1) bits per episode. In the MATH example, we trained on ~10,000 problems with 32 samples per problem. Assuming each completion yields a single bit of information, the whole training process only needs to absorb 320,000 bits. Rank-1 LoRA for Llama-3.1-8B already has 3M parameters.

Hyperparameter Tuning: Optimizing LoRA's Performance

One barrier to LoRA adoption is the necessity to choose optimal hyperparameters, which are different from ones optimized for FullFT.

Learning Rate

The optimal LR seems to be similar for all the LoRA runs across different ranks. We found that the optimal learning rate for FullFT is lower by a factor of 10 than for high-rank LoRAs. Our experiments showed that the optimal LR for LoRA is consistently 10x the one used for FullFT in the same application, for both supervised learning and reinforcement learning.

Batch Size

In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning - it pays a larger penalty in loss as batch size increases beyond some point. The learning gap at large batches doesn’t seem to depend on rank, but rather seems to be a property of LoRA. The likely reason is that the product-of-matrices parametrization (BA) has less favorable optimization dynamics on this dataset than the full matrix (W).

Layer Selection

Even in small data settings, LoRA performs better when applied to all weight matrices, especially MLP and MoE layers. The original paper by Hu et al. recommended applying LoRA only to the attention matrices, and many subsequent papers followed suit, though a recent trend has been to apply it to all layers. Indeed, we achieved far better results when applying LoRA to all layers, in particular, the MLP (including MoE) layers. In fact, applying LoRA to the attention matrices shows no additional benefits beyond applying it to the MLPs only. Attention-only LoRA significantly underperforms MLP-only LoRA, and does not further improve performance on top of LoRA-on-MLP.

The Role of Rank and Scaling

The choice of LoRA rank (( r )) and scaling factor (( \alpha )) also plays a crucial role in performance. The equation for ( W' ) includes these parameters:

$$W' = W + \frac{\alpha}{r}BA$$

Where $r$ is the LoRA rank, $\alpha$ is the LoRA scaling factor, and $A$, $B$ are the LoRA weight matrices (of rank $r$). The $1/r$ scaling factor makes the optimal learning rate approximately independent of rank.

Practical Implications and Applications

LoRA's strengths and weaknesses have significant implications for various applications.

Resource-constrained environments: LoRA is ideal for scenarios where computational resources are limited, such as edge devices or mobile applications.
Rapid prototyping: LoRA's efficiency allows for faster experimentation and prototyping of different fine-tuning strategies.
Knowledge retention: LoRA's ability to preserve pre-trained knowledge makes it suitable for tasks where maintaining a broad range of capabilities is crucial.
Hybrid approaches: Combining LoRA with other fine-tuning techniques, such as full fine-tuning on specific layers, can potentially strike a balance between accuracy and efficiency.

tags: #LoRA #learns #less #and #forgets #less