Learning Transferable Visual Models from Natural Language Supervision: An Overview of CLIP

CLIP (Contrastive Language-Image Pre-training) is a groundbreaking object identification model developed by OpenAI and published in February 2021. CLIP marks a historic advance in establishing value for large-scale models. CLIP demonstrates how far both computer vision and natural language have come as a result of the transformer model architecture. Unlike traditional image classification models that identify objects from a predefined set of categories, CLIP learns from the text associated with an image, enabling zero-shot transfer and identification using arbitrary labels.

Introduction to CLIP

Traditional object identification models predict fixed, predetermined categories. This approach requires new labeled data when attempting to recognize objects in a new category. CLIP solves this problem by learning from the text associated to an image, rather than manually-assigned labels. The basic idea is: we have a model that maps text prompts to a latent space (like a sentence embedding. In this case, a Transformer). CLIP was pre-trained using 400 million images and corresponding text from the Internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer and allows for identification using arbitrary labels that have not been part of any previous training.

The Core Idea: Contrastive Language-Image Pre-training

CLIP, short for Contrastive Language-Image Pre-training, leverages a vast dataset of images and their captions crawled from the internet to train two models: one for encoding images and another for encoding text. The training objective is to increase the cosine similarity between the embeddings of corresponding images and captions while minimizing the similarity between non-corresponding pairs.This is achieved through a similar loss as is used for word embeddings. It is contrastive: we take a pair (image, caption) and also make a spurious pair with the image and a random caption, and one with the caption and a random image.

CLIP's Architecture: A Dual-Encoder Approach

CLIP’s architecture is built around two separate encoders: a vision encoder for processing images and a text encoder for analyzing textual descriptions.

Vision Encoder

The vision encoder in CLIP utilizes two primary architectures: ResNet-50 a Convolutional Neural Network (CNN), and the Vision Transformer (ViT). Both serve the same purpose - transforming images into high-dimensional feature vectors that can be compared with text embeddings - but they achieve this through different methods. The choice of encoder depends on the task at hand and the computational resources available, offering flexibility and allowing for comparative research to determine the most effective architecture for specific scenarios.

Read also: Understanding PLCs

ResNet-50 Enhancements

ResNet-50 has been enhanced in CLIP with several modifications to optimize its performance within this framework. First, ResNet-D improvements have been applied to optimize the downsampling process, preserving more spatial detail. One key aspect of these improvements is the use of a 3x3 convolution with stride. In this context, the 3x3 convolution refers to a small filter, or kernel, that slides over the input data (the image or feature map), covering a 3x3 area at a time. The stride determines how far the filter moves with each step. For example, a stride of 2 would mean the filter moves two pixels at a time, essentially reducing the size of the output feature map by half. This technique allows the model to downsample the input data more effectively, retaining important spatial details while making the computation more efficient.

Second, antialiased Rect-2 blur pooling has been introduced to reduce aliasing artifacts, which are distortions that can occur during downsampling. This technique applies a blur filter before downsampling, ensuring that high-frequency details, like edges, are captured more accurately.

Lastly, CLIP replaces the traditional global average pooling with an attention pooling mechanism. This mechanism uses transformer-style multi-head query, key, value (QKV) attention to dynamically focus on different parts of the image, allowing the model to emphasize the most relevant regions in relation to the text description. These three enhancements - ResNet-D improvements, antialiased blur pooling, and attention pooling - work together to create more robust and contextually aware image representations.

Vision Transformer (ViT)

In addition to ResNet-50, CLIP also employs the Vision Transformer (ViT) as an alternative architecture for the vision encoder. Unlike CNNs, which process images using convolutional layers, ViT treats an image as a sequence of patches, much like how transformers process sequences of words in text. The image is first divided into fixed-size patches, and each patch is then flattened and linearly embedded into a vector. These vectors, representing different regions of the image, are combined with positional embeddings that provide information about the position of each patch in the original image, preserving spatial relationships.

These embedded patches are fed into a standard transformer model, where self-attention mechanisms allow the model to consider relationships between all patches across the entire image simultaneously. This approach enables ViT to capture both global context and fine-grained details, which are important for tasks like image-text matching. An additional layer normalization step is applied to the combined patch and position embeddings before they enter the transformer, which enhances the model’s stability and performance during training. The ViT’s ability to effectively process images as sequences and capture intricate relationships between different parts of an image makes it a powerful component in CLIP’s architecture.

Read also: Learning Resources Near You

Text Encoder

The text encoder in CLIP is based on a transformer architecture similar to that used in the GPT series. The text encoder is a Transformer with input capped at 76 characters. The text sequence is framed with [SOS] (start of sequence) and [EOS] (end of sequence) tokens, with the output at the [EOS] token used as the text embedding. This embedding is layer normalized and projected linearly into a shared multi-modal embedding space for comparison with image embeddings.

Contrastive Learning: Aligning Images and Text

At the core of CLIP’s success is its use of contrastive learning, a technique where the model learns to align images and text by comparing pairs. This approach contrasts with traditional supervised methods, which rely on labeled datasets for each specific task. By leveraging the relationships between data pairs, contrastive learning enables CLIP to create robust, generalizable embeddings that can perform zero-shot classification across a wide range of tasks. This method allows CLIP to handle diverse and previously unseen datasets effectively, a key aspect of its innovative design. The training data is a batch of (image, text) tuples. During training, the inner product of the vector from the image encoder and the vector from the text encoder gives the value 1 if the image/text association is correct, 0 otherwise.

Dataset Construction: WebImageText (WIT)

A critical component of CLIP’s success is the extensive and diverse dataset used for training. The WebImageText (WIT) dataset, constructed specifically for this purpose, consists of 400 million image-text pairs collected from various publicly available sources on the internet. The dataset was carefully curated to ensure high quality, filtering out images with automatically generated filenames or irrelevant metadata, and retaining pairs where the text was in English and contained natural language descriptions. To capture a wide range of visual concepts, 500,000 different queries were used to identify relevant image-text pairs, ensuring that the dataset was comprehensive and included a diverse array of objects, scenes, and situations. sources on the Internet. the English version of Wikipedia) and take up to 20k images per query.

Training Process and Efficiency

CLIP was trained from scratch on the WIT dataset, with a focus on efficiency and scalability. The core of CLIP’s training approach is contrastive learning, where the model learns by comparing pairs of images and text. The training objective involves predicting which of the N possible image-text pairings in a batch are correct. The model maximizes the cosine similarity between the embeddings of correct pairs while minimizing the similarity for incorrect pairs. Each batch considers all possible N² pairs (correct and incorrect), with a symmetric cross-entropy loss function optimizing these cosine similarity scores.

Given the scale of the dataset, efficiency was a primary consideration. CLIP employs a linear projection to map the encoder representations into the contrastive embedding space, where the model can effectively compare the embeddings of images and text. This approach was found to be more efficient without compromising performance, as it avoids the complexity of non-linear transformations while still enabling the model to learn robust embeddings.

Read also: Learning Civil Procedure

Scaling CLIP for Performance

Scaling plays an important role in the adaptability and performance of the CLIP architecture. To explore the scalability of CLIP, different versions of the model were trained with varying computational resources. For the ResNet architecture, scaling involves primarily adjusting the network’s width, which refers to the number of filters in each layer. CLIP includes variants like RN50x4, RN50x16, and RN50x64, where the numbers indicate how much the model’s width has been increased relative to the base ResNet-50 model. This scaling strategy allows CLIP to balance performance and computational efficiency, making the ResNet models more adaptable to different tasks and hardware constraints.

The Vision Transformer was also scaled, with versions such as ViT-B/32, ViT-B/16, and ViT-L/14. In these models, the numbers typically indicate the size of the model and the input resolution (for example, B/32 refers to a base model with a patch size of 32 pixels). The largest Vision Transformer, ViT-L/14, was further fine-tuned at a higher resolution of 336 pixels for one additional epoch to enhance performance. This scaling adjusts the transformer’s dimensions and the input resolution, optimizing the model for various levels of detail and computational cost.

For the text encoder, only the width was scaled proportionally to the ResNet, as CLIP’s performance was less sensitive to the depth of the text encoder.

Training Duration and Resources

Training CLIP required substantial computational resources, particularly for the largest models in the architecture. For instance, the largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs, while the largest Vision Transformer, ViT-L/14, took 12 days on 256 V100 GPUs. To efficiently manage memory and computational demands during this extensive training process, two key techniques were employed: mixed-precision training and gradient checkpointing.

Mixed-Precision Training

Mixed-precision training is a method that uses both 16-bit (FP16) and 32-bit (FP32) floating-point precision during training. By leveraging FP16 for most computations, such as matrix multiplications and convolutions, this technique significantly reduces memory usage and accelerates the training process. Important operations that require higher precision, such as gradient calculations, continue to use FP32 to maintain numerical stability. This balance allows larger models to be trained more efficiently on the same hardware, making it possible to handle the extensive computational workload required by models like RN50x64 and ViT-L/14.

Gradient Checkpointing

Gradient checkpointing is another technique that was essential in managing the memory requirements of these large-scale models. During the forward pass of training, instead of storing all intermediate activations, gradient checkpointing strategically saves checkpoints at specific layers. The remaining activations are discarded, reducing memory usage. When gradients need to be computed during the backward pass, the model recomputes the discarded activations using the saved checkpoints. While this method increases the computational load due to the need for recomputation, it drastically reduces the memory footprint, enabling the training of large models within the constraints of available GPU memory.

In addition to these techniques, CLIP’s training process also benefits from its robustness to data shifts and variations, an advantage over models that rely heavily on specific datasets like ImageNet. Traditional models often experience significant performance drops when faced with data that deviates from the training set, but CLIP’s ability to maintain performance across different types of images - such as sketches or adversarial examples - demonstrates its broader applicability. This robustness is a direct result of CLIP’s training methodology, which focuses on aligning images and text across a wide range of contexts.

Furthermore, CLIP’s training on diverse, web-scraped data enables it to generalize more effectively across tasks. Unlike models that overfit to the specific categories they are trained on, CLIP’s contrastive learning approach allows it to capture more complex relationships between images and text, making it particularly effective for tasks involving natural images and contexts that are less structured than traditional datasets. This capability is reflected in its strong performance on zero-shot tasks and its resilience in scenarios with few labeled examples, further underscoring the importance of its training resources and methodologies.

These techniques were important in making the training of CLIP’s massive models feasible, ensuring that the process could be completed within a reasonable timeframe while optimizing resource utilization.

Zero-Shot Learning with CLIP

CLIP's ability to perform zero-shot learning is one of its most remarkable features. Zero-shot learning is a process in which a model is trained on an independent corpus of data but still can be tested on benchmark datasets without ever previously “seeing” or being trained on these standard, canonical datasets.

After being pre-trained on the WIT dataset, CLIP can be directly applied to various downstream tasks without any task-specific training. This is achieved by encoding the class names of the target dataset using the trained text encoder. The inner product of the image encoded vector is then computed, and the label with the highest value is taken as the correct answer.

For example, on the Oxford-IIIT Pets dataset, the query "A photo of a {label}, a type of pet" is given for a more appropriate context. For OCR datasets, performance can be improved by prefixing the text or number you want to recognize with quotes. For satellite image classification, the sentence "A satellite photo of a {label}" can be used. Given the input image input.jpg, it returns the probability of being a human, a dog or a cat.

CLIP's Performance and Advantages

As reported in the paper “Learning Transferable Visual Models from Natural Language Supervision,” researchers compared vision transformers and ResNet. Zero-shot CLIP learning and its “linear probe” variant solidly outperformed both ResNet50 and the previous state of the art, BiT-M and EfficientNet-NoisyStudent, by at least five percentage points on classification accuracy in the average score. CLIP also outperformed numerous other highly accurate models, along the lines of a fourfold improvement in training efficiency.

Thus, if you want to build your own machine learning vision classifier without training, start with CLIP as a “rough and ready,” accurate zero-shot model and add a linear regression for your output layer. If you do have time to train on your dataset, you’ll have the confidence that you’re most likely outperforming every other state-of-the-art vision classifier. Building a competitive vision AI system just got a whole lot easier.

CLIP shows promise as a new standard baseline model, which typically only improves in accuracy when the final output layer is replaced with a simple linear regression and then is trained one or more times on a benchmark dataset such as Oxford Pets.

Limitations of CLIP

One oddity of CLIP is that the predictions it generates are often prefaced with “a photo of …,” “a type of …,” or even “a centered satellite photo of …,” and yet CLIP isn’t perfect. In fact, it performs worse on the standard MNIST dataset of handwritten digits than even a simple linear regression model. More specifically, the authors identify three typical failure conditions:

Fine-grained classification (very narrow classes)
Complex queries (estimating distance in an image)
Datasets not commonly seen in image searches (handwritten digits in MNIST, for example)

Beyond traditional performance metrics, CLIP also incorporates certain bias issues commonly found in web images and their descriptions, although the authors claim that on bias benchmark datasets such as FairFace, CLIP does better than several alternatives. Still, certain ages, genders, and races are more apt to result in a query result containing certain “animal” classes, and certain ages and races are more apt to be categorized as a “thief” or “criminal.” That said, the authors encourage further research into transfer learning and training runs on specific datasets in order to reduce the biases inherent in CLIP.

Applications of CLIP

This approach was evaluated against 30 different benchmarks, specifically, OCR, action detection from video, place name detection, and general object identification. CLIP performs as well as models trained on the benchmark dataset for most tasks, even though it does not use the benchmark training dataset and has not undergone any additional training.

GLIDE is an application of CLIP in text-to-image generation (was state of the art until DALL-E 2 arrived).

Conclusion

Three years after its introduction, CLIP continues to be a groundbreaking model in multimodal AI, significantly influencing both research and real-world applications. Its innovative use of contrastive learning to integrate visual and textual data has led to widespread adoption across diverse domains, from image generation and content moderation to creative tools. CLIP’s ability to generalize across various datasets without task-specific training remains a key strength, making it an invaluable and versatile tool. Notably, its integration into models like SAM-2 (Segment Anything Model) underscores its ongoing relevance, enhancing capabilities in tasks such as object segmentation based on textual descriptions.

tags: #learning #transferable #visual #models #from #natural