Neural Machine Translation by Jointly Learning to Align and Translate: A Comprehensive Overview

Neural Machine Translation (NMT) represents a paradigm shift in the field of machine translation, moving away from traditional phrase-based systems with independently tweaked sub-components to a single, massive neural network trained end-to-end. NMT models read a sentence and provide an accurate translation. This approach has demonstrated state-of-the-art capabilities in large-scale translation projects, such as English to French or English to German, and is attractive due to its minimal reliance on prior domain expertise.

Background: The Essence of Neural Machine Translation

From a probabilistic perspective, translation can be viewed as finding the target sentence y that maximizes the conditional probability given the source sentence x: P(y|x). NMT systems leverage parallel training corpora to maximize the conditional probability of sentence pairings. After learning the conditional distribution for a particular source sentence, the model generates a translation by searching for the sentence that maximizes this conditional probability.

The NMT process typically involves two key stages:

Encoding: The source sentence x is encoded into a fixed-length vector.
Decoding: The encoded vector is decoded into the desired target sentence y.

Early NMT systems utilized recurrent neural networks (RNNs) to encode a variable-length source sentence into a fixed-length vector, which was then decoded into a variable-length target sentence. These early systems showed impressive results, even achieving near state-of-the-art performance compared to traditional phrase-based machine translation systems, particularly when using Long Short-Term Memory (LSTM) units within the RNNs.

The RNN Encoder-Decoder Framework

The RNN Encoder-Decoder serves as the foundation for many NMT architectures, including the attention-based model discussed in the seminal paper "Neural Machine Translation by Jointly Learning to Align and Translate."

In this framework, the encoder reads the input sentence (a sequence of vectors x) and transforms it into a context vector c. This is typically achieved using an RNN, where:

ht represents the hidden state at time t.
c is a vector generated from the sequence of hidden states.
f and q are non-linear functions.

The decoder is then trained to predict the next word yt' using the context vector c and all previously predicted words {y1, …, yt'-1}. This can be expressed as:

P(y) = ∏t=1T P(yt | {y1, …, yt-1}, c)

The decoder effectively separates the joint probability into ordered conditionals, defining a probability over the translation y. This probability is typically modeled as:

P(yt | {y1, …, yt-1}, c) = g(yt-1, st, c)

Read also: AI-powered video communication

Where g is a non-linear function (potentially multi-layered) that outputs the probability of yt, and st is the hidden state of the RNN in the decoder.

Addressing the Bottleneck: The Issue of Fixed-Length Vectors

A significant limitation of the traditional encoder-decoder approach is its reliance on compressing the entire source sentence into a fixed-length vector. This can hinder the network's ability to process long sentences, especially those significantly longer than the sentences used during training. Research has demonstrated that the performance of a basic encoder-decoder degrades rapidly as the input sentence length increases.

The Attention Mechanism: A Solution to the Bottleneck

To overcome this limitation, the groundbreaking paper "Neural Machine Translation by Jointly Learning to Align and Translate" introduced a novel encoder-decoder model that simultaneously learns to align and translate. This model employs an attention mechanism to (softly) search for relevant parts of the source sentence when predicting each target word.

The core idea is that instead of encoding the entire input sentence into a single fixed-length vector, the model encodes the input sentence into a sequence of vectors and adaptively selects a subset of these vectors during decoding. This allows the model to focus on the most important information in the source sentence for each word it generates in the translation.

The authors demonstrated that this approach significantly improves translation performance compared to the standard encoder-decoder, particularly for longer sentences. The improvement is noticeable across sentences of all lengths.

Decoder with Attention: A Detailed Look

In this new architecture, the conditional probability of each target word yi is conditioned on a distinct context vector ci:

P(yi | y1, …, yi-1, x) = g(yi-1, si, ci)

This contrasts with the traditional encoder-decoder strategy where a single context vector is used for the entire translation.

The context vector ci is determined by a sequence of annotations (hi) to which the encoder maps the input sentence. Each annotation hi provides information about the entire input sequence, with a focus on the parts surrounding the i-th word. The context vector ci is computed as a weighted sum of these annotations:

ci = ∑j=1Tx αij hj

Where αij represents the weight of the j-th annotation hj for the i-th target word. These weights are calculated by an alignment model that scores how well the inputs around position j and the output at position i match:

eij = a(si-1, hj)

αij = softmax(eij) = exp(eij) / ∑k=1Tx exp(eik)

The alignment model a is parameterized as a feedforward neural network and trained jointly with the rest of the system. Crucially, alignment is not treated as a latent variable, but rather as a soft alignment that can be backpropagated through using the cost function's gradient. This allows for joint training of the alignment and translation models.

The softmax function ensures that the attention weights sum to 1, expressing the relative importance of each input sequence element. By incorporating this attention mechanism, the encoder is relieved of the burden of encoding all information in the source sentence into a fixed-length vector.

Encoder: Leveraging Bidirectional RNNs for Enhanced Annotations

To ensure that each word's annotation serves as a synopsis of both preceding and subsequent words, the authors employ a bidirectional RNN (BiRNN).

A BiRNN consists of a forward RNN and a backward RNN. The forward RNN reads the input sequence in its original order, computing a sequence of forward hidden states (→h1, →h2, …, →hTx). The backward RNN reads the sequence in reverse order, producing a sequence of backward hidden states (←h1, ←h2, …, ←hTx).

The annotation for each word xi is obtained by concatenating the forward and backward hidden states:

hi = [→hi; ←hi]

This ensures that each annotation captures information from the entire input sentence, providing a more comprehensive representation for the attention mechanism.

Bahdanau Attention: A Step-by-Step Process

Bahdanau's attention mechanism, used in the original "Neural Machine Translation by Jointly Learning to Align and Translate" paper, follows a chronological procedure:

Generating the Encoder's Hidden States: The encoder generates hidden states for each element in the input sequence using an RNN (or a variant like LSTM or GRU). Each input sequence element produces a hidden state/output after being processed by the encoder RNN. All the encoder's hidden states are passed to the next time step.
Calculating Alignment Scores: The decoder calculates an alignment score for each encoder output relative to the current decoder input and hidden state at each time step. The alignment score determines how much weight the decoder will give to each encoder output when generating the following output. The alignment scores for Bahdanau attention are calculated as:
score(ht, hs) = vT tanh(Wht + UHs)
Where ht is the decoder hidden state, hs is the encoder hidden state, and v, W, and U are trainable weight matrices. The decoder hidden state and encoder outputs are passed through their own Linear layers, each with its own trainable weights. They are then combined and passed through a tanh activation function. Finally, a score is assigned to each encoder output by multiplying the resulting vector by a trainable vector to get an alignment score vector.
Softmaxing the Alignment Scores: The alignment scores vector is then passed through a softmax function to generate attention weights. Each value in the vector represents the relative importance of the various inputs at the current time step and will be between 0 and 1.
Calculating the Context Vector: The context vector is generated by multiplying the attention weights by the encoder outputs, element by element. The softmax function enhances the impact of a particular input element on the decoder output if its score is closer to 1 and diminishes its effect if the score is closer to 0.
Decoding the Output: The output of the decoder is added to the newly created context vector. This is then fed into the decoder RNN cell to generate a new hidden state, and the process starts again from step 2. By feeding the updated hidden state into a Linear layer (a classifier), probability scores for the next predicted word are calculated, retrieving the final output for the time step.

Luong Attention Mechanism: Variations on the Theme

Luong's attention mechanism, also known as multiplicative attention, offers alternative approaches to calculating attention scores. It often involves simpler matrix multiplications to reduce the states of the encoder and the decoder into attention scores, resulting in faster computation and reduced space requirements.

A global attentional model aims to derive the context vector ct by considering all the encoder's hidden states. In this model, the current target hidden state ht is compared to each source hidden state hs, yielding a variable-length alignment vector at whose size is equal to the number of time steps on the source side.

The alignment scores can be calculated using three different content-based functions:

Dot: score(ht, hs) = htT hs
General: score(ht, hs) = htT W hs
Concat: score(ht, hs) = vT tanh(W[ht; hs])

The dot function simply multiplies the encoder's hidden state by the decoder's hidden state. The general function is similar to the dot function but incorporates a weight matrix into the formula. The concat function involves adding the decoder's hidden state to the encoder's hidden state. The difference is that the decoder hidden state and encoder hidden state are mixed together before being sent through a Linear layer. After passing through the Linear layer, the output is subjected to a tanh activation function before being multiplied by a weight matrix to get the alignment score.

The researchers also employ a location-based function in which the alignment scores are derived purely from the target hidden state ht.

The context vector ct is the weighted average of all the source hidden states, produced using the alignment vector as a weight.

Luong Attention Process: A Condensed View

The Luong attention process can be summarized as follows:

Producing the Encoder Hidden States: The encoder generates a hidden state for each input sequence element, much like Bahdanau Attention.
Decoder RNN: The RNN is used earlier in the decoding process in Luong Attention than it is in Bahdanau Attention.
Calculating Alignment Scores: The alignment score function can be specified in one of three methods in Luong Attention: dot, general, or concat. Alignment scores are computed by these scoring functions using encoder outputs and the decoder hidden state generated in the previous step.
Softmaxing the Alignment Scores: Alignment scores are softmaxed similarly to Bahdanau Attention, yielding weights between 0 and 1.
Calculating the Context Vector: This operation is identical to the one used in Bahdanau Attention, where the attention weights are multiplied by the encoder outputs.
Producing the Final Output: Finally, the decoder hidden state formed in Step 2 is appended to the resulting context vector.

tags: #neural #machine #translation #by #jointly #learning

Popular posts: