Multimodal Deep Learning: A Comprehensive Tutorial

Previous AI systems functioned by performing specific tasks and operating within one modality(unimodal AI). Our article explores multimodality in generative AI, discussing its fundamental principles and real-world applications. Multimodal generative AI refers to artificial intelligence systems that handle and create content from multiple data modalities. Multimodal AI uses cross-modal learning to generate richer results through multiple input types. This article aims to provide a comprehensive guide to multimodal deep learning, covering its fundamental concepts, challenges, techniques, and diverse applications.

Introduction to Multimodal AI

Multimodal AI fundamentally depends on its capability to process and integrate various data types through a unified computational framework. Multimodal AI combines various data types and processes them. Many recent advances in generative AI originate from multimodal approaches even though not all multimodal AI systems function as generative models. Multimodal AI and generative AI work together to create a unified system instead of opposition between the two approaches.

Let’s consider a multimodal generative AI system. The system can read scene descriptions and analyze corresponding images to produce new content, such as audio narrations and detailed images. Generative AI develops artificial intelligence systems that generate new content including visual outputs from tools like DALL·E, Stable Diffusion.

By merging inputs from different modalities, multimodal AI systems can produce a range of high-quality design alternatives. OpenAI introduced GPT-4 which represents the large language model that can process text and image data.

The Essence of Data Processing in Multimodal AI

The core of multimodal AI depends on data processing. For example, textual data will require tokenization during preprocessing while image data will leverage convolutional neural networks to extract visual features. Models must accurately align their extracted features. For example, text-based descriptions can help an image recognition system identify objects more accurately. This interplay requires the model to perform cross-attention, a mechanism that allows different parts of the model’s architecture to focus on relevant aspects of each modality. The process of data fusion involves combining synchronized features into one unified representation. The fusion layer holds a critical function since it identifies the most important details from each modality that apply to the specific task. The decoder stage transforms the unified representation into the target output for generative tasks using a transformer or recurrent neural network. Depending on the structure of the model, the resulting output can appear as text, images, or various other formats. These examples highlight how multimodal used in generative AI, significantly broadens the potential for content development and user engagement.

Read also: Understanding Multimodal Machine Learning

Multimodal Deep Learning: A Deep Dive

Multimodal deep learning frequently uses the transformer-based encoder-decoder framework as its primary method. Effective multimodal systems require attention mechanisms to enable models to focus on the most relevant components across various modalities. The PyTorch model combines text, image, and audio data through self-attention to achieve multimodal fusion. The model uses distinct linear layers to project each modality into a shared fusion space. The transformed features get stacked together, resulting in a single unified input tensor. The fully connected layer transforms the aligned feature output into a fused representation with dimensions (batchsize, fusiondim). By combining different modalities, multimodal AI systems can carry out tasks with human-like context awareness.

Understanding Multimodal Machine Learning

Multimodal machine learning is the study of computer algorithms that learn and improve performance through the use of multimodal datasets. Multimodal Deep Learning is a machine learning subfield that aims to train AI models to process and find relationships between different types of data (modalities)-typically, images, video, audio, and text. By combining different modalities, a deep learning model can comprehend its environment more universally since some cues exist only in certain modalities.

Imagine the task of emotion recognition. There is more to it than just looking at a human face (visual modality). The tone and pitch of a person’s voice (audio modality) encode enormous amounts of information about their emotional state, which might not be visible through their facial expressions, even if they are often in sync.

The Limitations of Unimodal Learning

Unimodal or Monomodal models, models that process only a single modality, have been researched to a great extent and have provided extraordinary results in advancing fields like computer vision and natural language processing. However, unimodal deep learning has limited capabilities, so the need for multimodal models arises. The figure is part of META’s multimodal dataset “Hateful Memes”. Combining image and text to create a sarcastic meme. Unimodal models are unable to perceive such kind of sarcasm since each individual modality contains just half the information. In contrast, a multimodal model that processes both text and images can relate the two and discover the deeper meaning.

Data Modalities in Multimodal Learning

Multimodal models, more often than not, rely on deep neural networks even though other machine learning models, such as hidden Markov models HMM or Restricted Boltzman Machines RBM have been incorporated in earlier research. In multimodal deep learning, the most typical modalities are visual (images, videos), textual, and auditory (voice, sounds, music). However, other less typical modalities include 3D visual data, depth sensor data, and LiDAR data (typical in self-driving cars). In clinical practice, imaging modalities include computed tomography (CT) scans and X-ray images, while non-image modalities include electroencephalogram (EEG) data. Sensor data like thermal data or data from eye-tracking devices can also be included in the list. Any combination of the above unimodal data results in a multimodal dataset. For example, combining Video + LiDAR+ depth data creates an excellent dataset for self-driving car applications. EEG + eye tracking device data, creates a multimodal dataset that connects eye movements with brain activity. However, the most popular combinations are combinations of the three most popular modalitiesImage + Text Image + AudioImage + Text + AudioText + Audio

Read also: Comprehensive Overview of Deep Learning for Cybersecurity

Core Challenges in Multimodal Deep Learning

Multimodal deep learning aims to solve five core challenges that are active areas of research. Solutions or improvements on any of the below challenges will advance multimodal AI research and practice.

Representation

Multimodal representation is the task of encoding data from multiple modalities in the form of a vector or tensor. Good representations that capture semantic information of raw data are very important for the success of machine learning models. However, feature extraction from heterogeneous data in a way that exploits the synergies between them is very hard. Moreover, fully exploiting the complementarity of different modalities and not paying attention to redundant information is essential. Multimodal representations fall into two categories.

  1. Joint representation: each individual modality is encoded and then placed into a mutual high dimensional space. This is the most direct way and may work well when modalities are of similar nature.

  2. Coordinated representation: each individual modality is encoded irrespective of one another, but their representations are then coordinated by imposing a restriction. For example, their linear projections should be maximally correlated

    $$(u,v) = argmax_{u,v}(u^TX,v^TY)$$

    Read also: Continual learning and plasticity: A deeper dive

    where X, Y denote input modalities, $(u^T, v^T)$ denote matrices that transfer input modalities to some representation space and $(u^, v^)$ denote the desired representation matrices that transfer inputs to a mutual representation space after the restriction has been imposed.

Fusion

Fusion is the task of joining information from two or more modalities to perform a prediction task. Effective fusion of multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data. Fusing heterogeneous information is the core of multimodal research but comes with a big set of challenges. Practical challenges involve solving problems such as different formats, different lengths, and non-synchronized data. Theoretical challenges involve finding the most optimal fusion technique. Options include simple operations such as concatenation or weighted sum, and more sophisticated attention mechanisms such as transformer networks, or attention-based recurrent neural networks (RNNs).

Finally, one may also need to choose between early or late fusion. In early fusion, features are integrated immediately after feature extraction with some of the above fusion mechanisms. On the other hand, during late fusion, integration is performed only after each unimodal network outputs a prediction (classification, regression). Voting schemes, weighted averages, and other techniques or usually used on late fusion. Hybrid fusion techniques have also been proposed. These combine outputs from early fusion and unimodal predictors.

Alignment

Alignment refers to the task of identifying direct relationships between different modalities. Current research in multimodal learning aims to create modality-invariant representations. This means that when different modalities refer to a similar semantic concept, their representations must be similar/close together in a latent space. For example, the sentence “she dived into the pool”, an image of a pool, and the audio signal of a splash sound should lie close together in a manifold of the representation space.

Translation

Translating is the act of mapping one modality to another. The main idea is how one modality (e.g., textual modality) can be translated to another (e.g., visual modalities) while retaining the semantic meaning. Translations, however, are open-ended, subjective, and no perfect answer exists, which adds to the complexity of the task.

Part of the current research in multimodal learning is to construct generative models that make translations between different modalities. The recent DALL-E and other text-to-image models are great examples of such generative models that translate text modalities to visual modalities.

Co-Learning

Multimodal Co-learning aims to transfer information learned through one or more modalities to tasks involving another. Co-learning is especially important in cases of low-resource target tasks, fully/partly missing or noisy modalities. Translation-explained in the section above-may be used as a method of co-learning to transfer knowledge from one modality to another.

Neuroscience suggests that humans may use methods of co-learning through translation, as well. People who suffer from aphantasia, the inability to create mental images in their heads, perform worse on memory tests. The opposite is also true, people who do create such mappings, textual/auditory to visual, perform better on memory tests. This suggests that being able to convert representations between different modalities is an important aspect of human cognition and memory.

How Multimodal Learning Works: A Detailed Explanation

Multimodal neural networks are usually a combination of multiple unimodal neural networks. For example, an audiovisual model might consist of two unimodal networks, one for visual data and one for audio data. These unimodal neural networks usually process their inputs separately. This process is called encoding. After unimodal encoding takes place, the information extracted from each model must be fused together. Multiple fusion techniques have been proposed that range from simple concatenation to attention mechanisms. The process of multimodal data fusion is one of the most important success factors. After fusion takes place, a final “decision” network accepts the fused encoded information and is trained on the end task.

To put it simply, multimodal architectures usually consist of three parts:

  • Unimodal encoders that encode individual modalities. Usually, one for each input modality.
  • A fusion network that combines the features extracted from each input modality, during the encoding phase.
  • A classifier that accepts the fused data and makes predictions.

We refer to the above as the encoding module, fusion module, and classification module.

Encoding

During encoding, we seek to create meaningful representations. Usually, each individual modality is handled by a different monomodal encoder. However, it’s often the case that the inputs are in the form of embeddings instead of their raw form. For example, word2vec embeddings may be used for text, and COVAREP embeddings for audio. Multimodal embeddings such as data2veq, which translate video, text, and audio data into embeddings in a high dimensional space, are one of the latest practices and have outperformed other embeddings achieving SOTA performance in many tasks.

Deciding whether it's more suitable to use joint representations or coordinated representations (explained in the representation challenge) is an important decision. Usually, a joint representation method works well when modalities are similar in nature, and it’s the one most often used. In practice when designing multimodal networks, encoders are chosen based on what works well in each area since more emphasis is given to designing the fusion method. Many research papers use the all-time-classic ResNets for the visual modalities and RoBERTA for text.

Fusion

The fusion module is responsible for combining each individual modality after feature extraction is completed. The method/architecture used for fusion is probably the most important ingredient for success.

The simplest method is to use simple operations such as concatenating or summing the different unimodal representations. However, more sophisticated and successful methods have been researched and implemented. For example, the cross-attention layer mechanism is one of the more recent and successful fusion methods. It has been used to capture cross-modal interactions and fuse modalities in a more meaningful way. The equation below describes the cross-attention mechanism and assumes basic familiarity with self-attention.

$$\alpha{kl} = s(\frac{KlQk}{\sqrt{d}})Vl$$

Where $\alpha{kl}$ denotes the attention score vector, $s(.)$ denotes the softmax function, $K$, $Q$ and $V$ are the Key, Query and Value matrices of the attention mechanism respectively. For symmetry $\alpha{kl}$ is also computed, and the two may be summed up to create an attention vector that maps the synergy between the two modalities $(k,l)$ involved. Essentially, the difference between $\alpha{kl}$ and $\alpha{lk}$ is that in the former $modalityk$ is used as the query while in the latter $modalityl$ is used instead, and $modality_k$ takes the role of key and value.

In the case of three or more modalities, multiple cross-attention mechanisms may be used so that every different combination is calculated. For example, if we have vision (V), text (T), and audio (A) modalities, then we create the combinations VT, VA, TA, and AVT in order to capture all possible cross-modal interactions.

Even after using an attention mechanism, a concatenation of the above cross-modal vectors is often performed to produce the fused vector F. Sum(.), max(.) even pooling operations may also be used instead.

Classification

Finally, once the fusion has been completed, vector F is fed into a classification model. This is usually a neural network with one or two hidden layers. The input vector F encodes complementary information from multiple modalities, thus providing a richer representation compared to the individual modalities V, A, and T. Hence, it should increase the predictive power of the classifier. Mathematically, the aim of a unimodal model is to minimize the loss

$$L(C(\phi_m(X)),y)$$

where $\phi_m$ is an encoding function, typically a deep neural network, and C(.) is a classifier, typically one or more dense layers.

In contrast, the aim of multimodal learning is to minimize the loss

$$L{multi}(C(\phi{m1} \oplus \phi{m2} \oplus \cdot \cdot \cdot \oplus{m_k},y)$$

where $ \oplus $ denotes a fusion operation (e.g., concatenation) and $\phi_{mi}$ denotes encoding function of a single modality.

Applications of Multimodal Deep Learning

Here are some examples of Multimodal Deep Learning applications within the computer vision field:

Image Captioning

Image captioning is the task of generating short text descriptions for a given image. It’s a multimodal task that involves multimodal datasets consisting of images and short text descriptions. It solves the translation challenge described previously by translating visual representations into textual ones. The task can also be extended to video captioning, where text coherently describes short videos.

For a model to translate visual modalities into text, it has to capture the semantics of a picture. It needs to detect the key objects, key actions, and key characteristics of objects. Referencing the example of fig. 3, “A horse (key object) carrying (key action) a large load (key characteristic) of hay (key object) and two people (key object) sitting on it.” Moreover, it needs to reason about the relationship between objects in an image, e.g., “Bunk bed with a narrow shelf sitting underneath it (spatial relationship).” However, as already mentioned, the task of multimodal translation is open-ended and subjective. Hence the caption “Two men are riding a horse carriage full of hay,” and “Two men transfer hay with a horse carriage,” are also valid captions.

Image captioning models can be applied to provide text alternatives to images, which help blind and visually-impaired users.

Self-Driving Cars

The use of self-driving cars demonstrates how multimodal AI operates effectively in practical applications. The operation of autonomous vehicles depends on data inputs from numerous sensors which include camera images, LiDAR point clouds, radar signals, and GPS information.

Speech Recognition

Traditional speech recognition models transform spoken audio signals into written text. If lip reading and audio data are used in noisy environments, they can achieve much better results.

Emotion Recognition

To understand human emotions we need to observe subtle signals in facial expressions (visual), voice tone (audio), and textual content (when it exists). Robust emotion recognition emerges from multimodal AI systems that combine multiple signals.

Challenges and Future Directions

Multimodal datasets require careful curation and alignment to ensure that texts correspond to their respective images or audio clips. Multimodal AI architecture needs more parameters than single-modality models. Gaining insight into the decision-making process of multimodal systems is more complex than analyzing unimodal models. Although benchmarks for text and vision tasks are available, comprehensive multimodal AI applications remain new.

The industry is developing stronger data curation pipelines, and efficient model architectures(such as sparse transformers and mixture-of-experts) with improved alignment strategies to address existing challenges. AI models that draw learning insights from personalized data sources like text messages, social media feeds, and voice commands will enable highly customized user experiences. As models incorporate multiple data types, the potential for biased or inappropriate outputs increases. Robots’ ability to process visual information and spoken language enables them to adapt to their environments.

tags: #multimodal #deep #learning #tutorial

Popular posts: