SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

SimCLR, a framework for contrastive learning of visual representations, simplifies existing contrastive self-supervised learning algorithms by removing the need for specialized architectures or a memory bank. It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space.

Introduction to Self-Supervised Learning

Traditional deep learning relies heavily on labeled datasets, which are expensive and time-consuming to create. Self-supervised learning addresses this challenge by enabling models to learn from raw, unlabeled data, learning robust visual features that can be used for a variety of downstream tasks. After self-supervised pretraining, even a basic classifier on top of the learned features can achieve competitive results.

In computer vision, a common approach to self-supervision involves taking different crops of an image or applying different augmentations and passing the modified inputs through the model. The model learns that these images still contain the same visual information, leading to the learning of similar latent representations for the same objects.

SimCLR Framework

The SimCLR framework comprises four key components: data augmentation, a base encoder, a projection head, and a contrastive loss function.

Data Augmentation

A stochastic data augmentation module transforms any given data example randomly, resulting in two correlated views of the same example, denoted ~xi and ~xj, as positive pairs. The composition of data augmentations plays a critical role in defining effective predictive tasks.

Three simple augmentations are applied sequentially:

Random Cropping: By randomly cropping images, the contrastive prediction tasks are sampled that include global to local view or adjacent view prediction.
Random Color Distortions: Color jittering is applied as an augmentation.
Random Gaussian Blur: Blurring is applied as an augmentation.

No single transformation suffices to learn good representations. The combination of random crop and color distortion is crucial to achieving good performance.

Base Encoder

A neural network base encoder f() extracts representation vectors from augmented data examples. ResNet is used to obtain hi=f(~xi)=ResNet(~xi) where hi is d-dimensional, which is the output after the average pooling layer. Unsupervised learning benefits more from bigger models than its supervised counterpart.

Projection Head

A small neural network projection head g() maps representations to the space where contrastive loss is applied. A multilayer perceptron (MLP) with one hidden layer is used to obtain zi:where σ is a ReLU nonlinearity. It is found that it is beneficial to define the contrastive loss on zi’s rather than hi’s. After training, the projection head is discarded and only f(x) is used for downstream tasks.

Contrastive Loss

A minibatch of N examples is randomly sampled. The contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in 2N data points. Given a positive pair, the other 2(N-1) augmented examples within a minibatch serve as negative examples. The loss function for a positive pair of examples (i, j) is defined as:

Read also: Effective English Practices

where sim(,) is cosine similarity, τ is the temperature parameter. The final loss is computed across all positive pairs, both (i, j) and (j, i), in a mini-batch. This loss is named NT-Xent (the normalized temperature-scaled cross entropy loss).

SOTA Comparison

SimCLR significantly outperforms previous methods for self-supervised and semi-supervised learning on ImageNet.

Linear Evaluation on ImageNet

A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. The best result obtained with ResNet-50 (4×) using SimCLR can match the supervised pretrained ResNet-50.

Few Labels Evaluation on ImageNet

SimCLR significantly improves over state-of-the-art with both 1% and 10% of the labels.

Transfer Learning

When fine-tuned, SimCLR significantly outperforms the supervised baseline on 5 datasets, whereas the supervised baseline is superior on only 2.

Read also: SCAD Course Syllabi

Key Findings

The study of SimCLR's components reveals several key findings:

Data Augmentation Composition: The composition of data augmentations plays a critical role in defining effective predictive tasks.
Nonlinear Transformation: Introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations.
Batch Size and Training Steps: Contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

tags: #a #simple #framework #for #contrastive #learning