Multimodal Machine Learning: A Comprehensive Tutorial

The rise of multimodal data has created a growing need for machines to understand this data holistically. Data scientists and machine learning engineers often face challenges combining knowledge from existing tutorials, which typically address each mode separately. This article provides a comprehensive guide to multimodal machine learning, covering its core concepts, challenges, techniques, and applications.

Introduction to Multimodal Machine Learning

Multimodal machine learning involves developing computer algorithms that learn and improve performance by using multimodal datasets. It is a subfield of machine learning that trains AI models to process and find relationships between different types of data, such as images, video, audio, and text. By combining these modalities, a deep learning model can gain a more comprehensive understanding of its environment, as some cues are only present in certain modalities.

Consider the task of emotion recognition. It involves more than just analyzing facial expressions (visual modality). The tone and pitch of a person's voice (audio modality) convey significant information about their emotional state, which may not be apparent from their facial expressions, even when they are synchronized.

Unimodal models, which process only a single modality, have been extensively researched and have achieved remarkable results in fields like computer vision and natural language processing. However, the limitations of unimodal deep learning have led to the need for multimodal models.

For example, unimodal models may struggle with tasks like recognizing sarcasm or hate speech. In contrast, a multimodal model that processes both text and images can relate the two and discover the deeper meaning.

Modalities in Multimodal Learning

In multimodal deep learning, the most common modalities are visual (images, videos), textual, and auditory (voice, sounds, music). However, other modalities can be included, such as 3D visual data, depth sensor data, and LiDAR data (commonly used in self-driving cars). In clinical practice, imaging modalities include computed tomography (CT) scans and X-ray images, while non-image modalities include electroencephalogram (EEG) data. Sensor data, such as thermal data or data from eye-tracking devices, can also be used.

Any combination of these unimodal data types results in a multimodal dataset. For example:

Video + LiDAR + Depth Data: Suitable for self-driving car applications.
EEG + Eye Tracking Device Data: Connects eye movements with brain activity.

The most popular combinations include:

Image + Text
Image + Audio
Image + Text + Audio
Text + Audio

Multimodal Learning Challenges

Multimodal deep learning seeks to address five core challenges:

Representation

Multimodal representation involves encoding data from multiple modalities into a vector or tensor. Effective representations that capture the semantic information of raw data are vital for the success of machine learning models. However, extracting features from heterogeneous data in a way that leverages the synergies between them is challenging. It is essential to fully exploit the complementarity of different modalities while avoiding redundant information.

Multimodal representations fall into two categories:

Joint Representation: Each modality is encoded and placed into a mutual high-dimensional space. This approach is direct and works well when modalities are similar.
Coordinated Representation: Each modality is encoded independently, but their representations are coordinated by imposing a restriction.

Fusion

Fusion is the process of combining information from two or more modalities to perform a prediction task. Effectively fusing multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data.

Fusing heterogeneous information is central to multimodal research but presents several challenges. Practical challenges include addressing different formats, lengths, and non-synchronized data. Theoretical challenges involve identifying the most optimal fusion technique. Options range from simple operations like concatenation or weighted sums to more sophisticated attention mechanisms like transformer networks or attention-based recurrent neural networks (RNNs).

Fusion can be performed early or late. In early fusion, features are integrated immediately after feature extraction using one of the fusion mechanisms mentioned above. In late fusion, integration occurs only after each unimodal network outputs a prediction (classification, regression). Voting schemes, weighted averages, and other techniques are commonly used in late fusion. Hybrid fusion techniques combine outputs from early fusion and unimodal predictors.

Alignment

Alignment involves identifying direct relationships between different modalities. Current research in multimodal learning aims to create modality-invariant representations. This means that when different modalities refer to a similar semantic concept, their representations should be similar or close together in a latent space.

Read also: Revolutionizing Remote Monitoring

Translation

Translation is the process of mapping one modality to another while retaining the semantic meaning. For example, translating text to visual modalities. Translations are often open-ended and subjective, adding to the complexity of the task. Current research in multimodal learning focuses on constructing generative models that can translate between different modalities.

Co-Learning

Multimodal co-learning aims to transfer information learned through one or more modalities to tasks involving another. Co-learning is particularly important in cases with low-resource target tasks, fully or partly missing modalities, or noisy modalities. Translation can be used as a method of co-learning to transfer knowledge from one modality to another.

How Multimodal Learning Works

Multimodal neural networks typically combine multiple unimodal neural networks. For example, an audiovisual model might consist of two unimodal networks: one for visual data and one for audio data. These unimodal networks usually process their inputs separately, a process called encoding. After unimodal encoding, the information extracted from each model is fused together. Multiple fusion techniques have been proposed, ranging from simple concatenation to attention mechanisms. The process of multimodal data fusion is a crucial success factor. After fusion, a final "decision" network accepts the fused encoded information and is trained on the end task.

In essence, multimodal architectures consist of three parts:

Unimodal Encoders: Encode individual modalities.
Fusion Network: Combines the features extracted from each input modality during the encoding phase.
Classifier: Accepts the fused data and makes predictions.

Encoding

During encoding, the goal is to create meaningful representations. Typically, each modality is handled by a different monomodal encoder. However, inputs are often in the form of embeddings rather than raw data. For example, word2vec embeddings may be used for text, and COVAREP embeddings for audio. Multimodal embeddings, such as data2vec, translate video, text, and audio data into embeddings in a high-dimensional space and have outperformed other embeddings in many tasks.

Deciding whether to use joint or coordinated representations is an important decision. Joint representation is often preferred when modalities are similar. In practice, when designing multimodal networks, encoders are chosen based on what works well in each area, with more emphasis given to designing the fusion method. Many research papers use ResNets for visual modalities and RoBERTA for text.

Fusion Module

The fusion module combines each individual modality after feature extraction. The method or architecture used for fusion is a critical factor for success. The simplest method is using simple operations like concatenating or summing the different unimodal representations. However, more sophisticated methods have been researched and implemented.

For example, the cross-attention layer mechanism is a successful fusion method used to capture cross-modal interactions and fuse modalities more meaningfully.

Classification

Once fusion is complete, the resulting vector F is fed into a classification model, typically a neural network with one or two hidden layers. The input vector F encodes complementary information from multiple modalities, providing a richer representation compared to individual modalities. This increases the predictive power of the classifier.

Multimodal Fusion Techniques

Fusion involves learning a joint representation that models cross-modality interactions. Several methods exist to handle fusion:

Additive Fusion: Combines modalities by adding their representations.
Multiplicative Fusion: Combines modalities by multiplying their representations.
Bilinear Fusion: A more complex form of multiplicative fusion that captures interactions between modalities.
Tensor Fusion: Learns both intra-modality and inter-modality interactions but can be computationally heavy.
Low-Rank Tensor Fusion: Reduces the computational load of tensor fusion by using low-rank approximations.
Complex Fusion: Advanced techniques like channel exchange, where information is exchanged between modalities based on sparsity constraints.

Multimodal Deep Learning Applications

Multimodal deep learning has numerous applications across various fields. Here are some examples within computer vision:

Image Captioning

Image captioning involves generating short text descriptions for a given image. It is a multimodal task that uses multimodal datasets consisting of images and short text descriptions. It addresses the translation challenge by translating visual representations into textual ones. The task can also be extended to video captioning, where text coherently describes short videos.

For a model to translate visual modalities into text, it must capture the semantics of a picture, detect key objects, key actions, and key characteristics of objects, and reason about the relationships between objects in an image.

Image captioning models can provide text alternatives to images, assisting blind and visually-impaired users.

tags: #multimodal #machine #learning #tutorial