Zero-Shot Learning: Recognizing the Unseen in Artificial Intelligence

Zero-shot learning (ZSL) represents a paradigm shift in machine learning, enabling models to recognize and classify objects or concepts they have never encountered during training. Unlike traditional supervised learning, which requires labeled examples for every class, ZSL leverages auxiliary information to bridge the gap between seen and unseen classes. This approach holds immense potential for various applications, particularly in scenarios where obtaining labeled data is challenging or impossible.

The Essence of Zero-Shot Learning

In essence, ZSL empowers models to generalize to novel categories by associating observed and non-observed classes through auxiliary information. This information encodes observable distinguishing properties of objects, allowing the model to infer connections between new class descriptions and its learned feature space.

Consider a model trained to recognize horses but never exposed to zebras. With ZSL, if the model knows that zebras look like striped horses, it can still recognize a zebra when presented with one. This ability to "understand the labels" is crucial for classifying examples without relying on annotated data.

How Zero-Shot Learning Works: A Detailed Breakdown

The process of ZSL can be broken down into the following key stages:

Training on Seen Classes: The model is trained on a labeled dataset of "seen" classes, learning to associate visual features with semantic attributes.
Auxiliary Information: Auxiliary information, such as textual descriptions, semantic information, or word embeddings, is provided to describe the characteristics of both seen and unseen classes.
Semantic Space: Both seen and unseen classes are related in a high-dimensional vector space, called semantic space, where the knowledge from seen classes can be transferred to unseen classes.
Inference on Unseen Classes: When presented with an instance from an "unseen" class, the model utilizes the auxiliary information and learned associations to predict the class label.

Types of Zero-Shot Learning

Several ZSL techniques exist, each addressing specific challenges and leveraging different types of auxiliary information:

Read also: Your Guide to Nursing Internships

Learning with Attributes

In this approach, classes are accompanied by pre-defined structured descriptions, outlining their observable characteristics. For example, animals might be described by attributes like "has wings," "is a mammal," or "is carnivorous."

Learning from Textual Descriptions

This technique leverages natural language processing to extract semantic information from textual descriptions of classes. Class labels are augmented with definitions or free-text descriptions, allowing the model to "understand" the meaning of each class.

Class-Class Similarity

Here, classes are embedded in a continuous space, where the relationships between classes are explicitly modeled. This allows the model to infer the characteristics of unseen classes based on their similarity to seen classes.

Attribute-Based Zero-Shot Learning

This involves training a classification model using specific attributes of labeled data. If the new class sufficiently resembles the attribute classes in the training data, a ZSL model can infer the label of new classes using these attributes.

Semantic Embedding-Based Zero-Shot Learning

Semantic embeddings are vector representations of attributes in a semantic space. Zero-shot learning models can learn these semantic embeddings from labeled data and associate them with specific classes during training. The model can infer the category of unknown data by measuring the similarity between embeddings using distance measures.

Generalized Zero-Shot Learning (GZSL)

Unlike traditional ZSL, which focuses solely on unknown classes, GZSL trains models on known and unknown classes during supervised learning. GZSL models transfer knowledge from known classes to unknown classes using their semantic attributes by establishing a relationship between known and unknown classes.

Multi-Modal Zero-Shot Learning

Multi-modal ZSL combines information from multiple data modalities, such as text, images, videos, and audio, to predict unknown classes. By training a model using images and their associated textual descriptions, for instance, an ML practitioner can extract semantic embeddings and discern valuable associations.

Methods to Solve Zero-Shot Learning Problems

The two most common approaches used to solve the zero-shot recognition problems are:

Classifier-Based Methods

Existing classifier-based methods usually take a one-versus-rest solution for training the multiclass zero-shot classifier. That is, for each unseen class, they train a binary one-versus-rest classifier. Depending on the approach to construct classifiers, we further classify the classifier-based methods into three subcategories.

Correspondence Methods

Correspondence methods aim to construct the classifier for unseen classes via the correspondence between the binary one-versus-rest classifier for each class and its corresponding class prototype. There is just one corresponding prototype in the semantic space for each class. Thus, this prototype can be regarded as the “representation” of this class. Meanwhile, in the feature space, for each class, there is a corresponding binary one-versus-rest classifier, which can also be regarded as the “representation” of this class. Correspondence methods aim to learn a correspondence function between these two types of “representations.”

Relationship Methods

Such methods aim to construct a classifier or the unseen classes based on their inter-and intra-class relationships for unseen classes. In the feature space, binary one-versus-rest classifiers for the seen classes can be learned with the available data. Meanwhile, the relationships among the seen and the unseen classes may be obtained by calculating the relationships among corresponding prototypes. Relationship methods aim to construct the classifier for the unseen classes through these learned binary seen class classifiers and these class relationships. Meanwhile, the relationships among the seen and the unseen classes may be obtained by calculating the relationships among corresponding prototypes.

Combination Methods

The combination methods describe the idea of constructing the classifier for unseen classes by combining classifiers for basic elements used to constitute the classes. In combination methods, it is regarded that there is a list of “basic elements” to form the classes. Each data point in the seen and the unseen classes are a combination of these basic elements. For example, a “dog” class image will have a tail, fur, etc. Embodied in the semantic space, it is regarded that each dimension represents a basic element, and each class prototype denotes the combination of these basic elements for the corresponding class. Each dimension of the class prototypes takes either a 1 or 0 value, denoting whether a class has the corresponding element. Therefore, methods in this category are mainly suitable for semantic spaces.

Instance-Based Methods

Instance-based methods aim first to obtain labeled instances for the unseen classes and then, with these instances, to train the zero-shot classifier. Depending on the source of these instances, existing instance-based methods can be classified into three subcategories.

Projection Methods

The idea of projection methods is to obtain labeled instances for the unseen classes by projecting both the feature space instances and the semantic space prototypes into a shared space. There are labeled training instances in the feature space belonging to the seen classes. Meanwhile, there are prototypes of both the seen and the unseen classes in the semantic space. The feature and semantic spaces are real number spaces, with instances and prototypes as vectors in them. In this view, the prototypes can also be regarded as labeled instances. Thus, we labeled instances in two spaces (the feature and semantic spaces). In projection methods, instances in these two spaces are projected into a common space. In this way, we can obtain labeled instances belonging to the unseen classes.

Instance-Borrowing Methods

These methods deal with obtaining labeled instances for the unseen classes by borrowing from the training instances. Instance-borrowing methods are based on the similarities between classes. Let's take object recognition in images as an example-Suppose we want to build a classifier for the class “truck” but we do not have the corresponding labeled instances. However, we have some labeled instances belonging to classes “car” and “bus.” They are similar objects to the "truck" and when training classifiers for the class “truck,” we can use instances belonging to these two classes as positive instances. This kind of method follows the way humans recognize given objects and explore the world. We may have never seen instances belonging to some classes but have seen instances belonging to similar classes. With the knowledge of these similar classes, we can recognize instances belonging to the unseen classes.

Synthesizing Methods

The idea behind synthesizing methods is to obtain labeled instances for the unseen classes by synthesizing pseudo-instances using different strategies. In some methods, in order to synthesize the pseudo-instances, instances of each class are assumed to follow some kind of distribution. Firstly, the distribution parameters for the unseen classes need to be estimated. Then, instances of unseen classes are synthesized.

Zero-Shot Learning Methods Evaluation Metrics

We use the average per category top-k accuracy to evaluate zero-shot recognition results. For an N-class classification problem, classifiers output probability distribution for the test samples. A top-k (k ≤ N) accuracy refers to the scenario when the actual class of the sample (i.e., data label) lies in one of the classes with the “k” highest probabilities predicted by the classifier. We use the average per category top-k accuracy to evaluate zero-shot recognition results.

For example, for a 5-class classification problem, if the predicted probability distribution is: [0.35, 0.20, 0.15, 0.25, 0.05] for classes 0,1,…,4, then the top-2 classes are class-0 and class-3. If the original data label is either class-0 or class-3, the trained classifier is said to have predicted correctly.

Thus, the following equation calculates the accuracy, determining whether a prediction is correct or not according to the rule mentioned above; “C” denotes the number of unseen classes.

Applications of Zero-Shot Learning

The versatility of ZSL makes it applicable to a wide range of domains:

Image Classification: Recognizing objects in images without needing examples of every object. For example, classifying bird species without having seen images of all species during training.
Natural Language Processing: Understanding and responding to new words or concepts in text. This includes tasks like text classification, sentiment analysis, and machine translation.
Healthcare: Diagnosing rare diseases without needing extensive examples. ZSL can leverage knowledge of related diseases and symptoms to identify unseen conditions.
Visual Search Engines: Enabling search engines to understand and retrieve information about novel objects or concepts based on visual input.
Pharmaceutical Compliance: Tagging regulatory sections across evolving clinical trial documents, regardless of new terminology or formats.
Manufacturing Quality Systems: Identifying new defect types in inline optical inspections without prior examples.
Investment Management Tools: Classifying novel financial instruments into asset classes for portfolio analysis and regulatory reporting.

Challenges and Limitations

Despite its potential, ZSL faces several challenges:

Bias: Models can be biased towards seen classes, leading to poor performance on unseen classes.
Domain Shift: The characteristics of seen and unseen classes may differ significantly, leading to inaccurate mappings and predictions.
Semantic Loss: The model may not learn latent information present in the seen classes, which are not learned if they don’t contribute significantly to the decision-making process.
Hubness: Some points, called hubs, frequently occur in the k-nearest neighbor set of other points in high-dimensional data. The points in these hubs tend to be close to the semantic attribute vectors of many classes. Since we use the nearest neighbor search in semantic space during the testing phase, hubness leads to the deterioration in performance.
Scalability: Scaling ZSL to handle a large number of unseen classes can be difficult, especially when the semantic space is not large enough to differentiate between many categories.

Recent Advances in Zero-Shot Learning

Recent advancements in ZSL include:

Vision-Language Models (e.g., CLIP): OpenAI’s CLIP model maps images and text to the same semantic space, enabling zero-shot classification.
Generative Models: Generative models like GANs create synthetic samples of unseen classes, turning zero-shot learning into a supervised learning problem.
Self-Supervised Learning: This method allows models to learn robust class representations without extensive labeled data, enhancing their generalization to unseen classes.

Zero-Shot Learning in Practice: A Real-World Example

Consider a scenario where you want to build an image classifier to identify different types of animals. Training a traditional supervised learning model would require a vast dataset of labeled images for each animal species. However, with ZSL, you can train the model on a smaller set of "seen" animal classes and then use auxiliary information, such as textual descriptions or attributes, to enable the model to recognize "unseen" animal classes.

For example, you could train the model on images of cats, dogs, and birds, along with textual descriptions of their characteristics. Then, when presented with an image of a zebra, the model could use the textual description of a zebra (e.g., "a striped horse-like animal") to infer that it belongs to a new, unseen class.

The Future of Zero-Shot Learning

Zero-shot learning represents a significant step towards more flexible and adaptable AI systems. By enabling models to generalize to unseen classes, ZSL opens up new possibilities for applications in various domains. As research in this area continues, we can expect to see even more sophisticated ZSL techniques emerge, further blurring the lines between what a machine has "seen" and what it can "understand."

tags: #zero #shot #learning #explained