We’ve been hearing a lot about Transformers and with good reason. They have taken the world of NLP by storm in the last few years. The Transformer is an architecture that uses Attention to significantly improve the performance of deep learning NLP translation models. It was first introduced in the paper Attention is all you need and was quickly established as the leading architecture for most text data applications. Since then, numerous projects including Google’s BERT and OpenAI’s GPT series have built on this foundation and published performance results that handily beat existing state-of-the-art benchmarks.
In this article, we’ll go over the basics of Transformers, its architecture, and how it works internally. We will cover the Transformer functionality in a top-down manner. We will look under the covers to understand the operation of the system in detail. We will also do a deep dive into the workings of the multi-head attention, which is the heart of the Transformer.
The Transformer architecture excels at handling text data which is inherently sequential. They take a text sequence as input and produce another text sequence as output. eg. to translate an input English sentence to Spanish.
At its core, it contains a stack of Encoder layers and Decoder layers. To avoid confusion we will refer to the individual layer as an Encoder or a Decoder and will use the Encoder stack or Decoder stack for a group of Encoder layers.
The Encoder stack and the Decoder stack each have their corresponding Embedding layers for their respective inputs. Finally, there is an Output layer to generate the final output.
All the Encoders are identical to one another. Similarly, all the Decoders are identical.
The Encoder is a reusable module that is the defining component of all Transformer architectures. In addition to the above two layers, it also has Residual skip connections around both layers along with two LayerNorm layers.
There are many variations of the Transformer architecture. Some Transformer architectures have no Decoder at all and rely only on the Encoder.
The key to the Transformer’s ground-breaking performance is its use of Attention. While processing a word, Attention enables the model to focus on other words in the input that are closely related to that word.
eg. ‘Ball’ is closely related to ‘blue’ and ‘holding’. On the other hand, ‘blue’ is not related to ‘boy’.
The Transformer architecture uses self-attention by relating every word in the input sequence to every other word.
eg. Consider two sentences:
In the first sentence, the word ‘it’ refers to ‘cat’, while in the second it refers to ‘milk. When the model processes the word ‘it’, self-attention gives the model more information about its meaning so that it can associate ‘it’ with the correct word.
To enable it to handle more nuances about the intent and semantics of the sentence, Transformers include multiple attention scores for each word. eg. While processing the word ‘it’, the first score highlights ‘cat’, while the second score highlights ‘hungry’. So when it decodes the word ‘it’, by translating it into a different language, for instance, it will incorporate some aspect of both ‘cat’ and ‘hungry’ into the translated word.
The Transformer works slightly differently during Training and while doing Inference.
Let’s first look at the flow of data during Training. Training data consists of two parts:
The Transformer’s goal is to learn how to output the target sequence, by using both the input and target sequence.
The Transformer processes the data like this:
During Inference, we have only the input sequence and don’t have the target sequence to pass as input to the Decoder. The goal of the Transformer is to produce the target sequence from the input sequence alone.
So, like in a Seq2Seq model, we generate the output in a loop and feed the output sequence from the previous timestep to the Decoder in the next timestep until we come across an end-of-sentence token.
The difference from the Seq2Seq model is that, at each timestep, we re-feed the entire output sequence generated thus far, rather than just the last word.
The flow of data during Inference is:
The approach of feeding the target sequence to the Decoder during training is known as Teacher Forcing. Why do we do this and what does that term mean?
During training, we could have used the same approach that is used during inference. In other words, run the Transformer in a loop, take the last word from the output sequence, append it to the Decoder input and feed it to the Decoder for the next iteration. Finally, when the end-of-sentence token is predicted, the Loss function would compare the generated output sequence to the target sequence to train the network.
Not only would this looping cause training to take much longer, but it also makes it harder to train the model. The model would have to predict the second word based on a potentially erroneous first predicted word, and so on.
Instead, by feeding the target sequence to the Decoder, we are giving it a hint, so to speak, just like a Teacher would. Even though it predicted an erroneous first word, it can instead use the correct first word to predict the second word so that those errors don’t keep compounding.
In addition, the Transformer is able to output all the words in parallel without looping, which greatly speeds up training.
Transformers are very versatile and are used for most NLP tasks such as language models and text classification. They are frequently used in sequence-to-sequence models for applications such as Machine Translation, Text Summarization, Question-Answering, Named Entity Recognition, and Speech Recognition.
There are different flavors of the Transformer architecture for different problems. The basic Encoder Layer is used as a common building block for these architectures, with different application-specific ‘heads’ depending on the problem being solved.
A Sentiment Analysis application, for instance, would take a text document as input. A Classification head takes the Transformer’s output and generates predictions of the class labels such as a positive or negative sentiment.
A Language Model architecture would take the initial part of an input sequence such as a text sentence as input, and generate new text by predicting sentences that would follow. A Language Model head takes the Transformer’s output and generates a probability for every word in the vocabulary. The highest probability word becomes the predicted output for the next word in the sentence.
RNNs and their cousins, LSTMs and GRUs, were the de facto architecture for all NLP applications until Transformers came along and dethroned them.
RNN-based sequence-to-sequence models performed well, and when the Attention mechanism was first introduced, it was used to enhance their performance.
However, they had two limitations:
As an aside, with CNNs, all of the outputs can be computed in parallel, which makes convolutions much faster. However, they also have limitations in dealing with long-range dependencies:
The Transformer architecture addresses both of these limitations. It got rid of RNNs altogether and relied exclusively on the benefits of Attention.
Now that we have a high-level idea of what a Transformer is, we can go deeper into its internal functionality.
In this section, we can now look under the hood and study exactly how they work in detail. We’ll see how data flows through the system with their actual matrix representations and shapes and understand the computations performed at each stage.
As we saw, the main components of the architecture are:
Data inputs for both the Encoder and Decoder, which contains:
The Encoder stack contains a number of Encoders. Each Encoder contains:
The Decoder stack contains a number of Decoders. Each Decoder contains:
Output (top right) — generates the final output, and contains:
To understand what each component does, let’s walk through the working of the Transformer while we are training it to solve a translation problem. We’ll use one sample of our training data which consists of an input sequence (‘You are welcome’ in English) and a target sequence (‘De nada’ in Spanish).
Like any NLP model, the Transformer needs two things about each word — the meaning of the word and its position in the sequence.
The Transformer combines these two encodings by adding them.
The Transformer has two Embedding layers. The input sequence is fed to the first Embedding layer, known as the Input Embedding.
The target sequence is fed to the second Embedding layer after shifting the targets right by one position and inserting a Start token in the first position. Note that, during Inference, we have no target sequence and we feed the output sequence to this second layer in a loop, as we learned. That is why it is called the Output Embedding.
The text sequence is mapped to numeric word IDs using our vocabulary. The embedding layer then maps each input word into an embedding vector, which is a richer representation of the meaning of that word.
Since an RNN implements a loop where each word is input sequentially, it implicitly knows the position of each word.
However, Transformers don’t use RNNs and all words in a sequence are input in parallel. This is its major advantage over the RNN architecture, but it means that the position information is lost, and has to be added back in separately.
Just like the two Embedding layers, there are two Position Encoding layers. The Position Encoding is computed independently of the input sequence. These are fixed values that depend only on the max length of the sequence. For instance,
These constants are computed using the formula below, where
In other words, it interleaves a sine curve and a cos curve, with sine values for all even indexes and cos values for all odd indexes. As an example, if we encode a sequence of 40 words, we can see below the encoding values for a few (word position, encoding_index) combinations.
The blue curve shows the encoding of the 0th index for all 40 word-positions and the orange curve shows the encoding of the 1st index for all 40 word-positions. There will be similar curves for the remaining index values.
As we know, deep learning models process a batch of training samples at a time. The Embedding and Position Encoding layers operate on matrices representing a batch of sequence samples. The Embedding takes a (samples, sequence length) shaped matrix of word IDs. It encodes each word ID into a word vector whose length is the embedding size, resulting in a (samples, sequence length, embedding size) shaped output matrix. The Position Encoding uses an encoding size that is equal to the embedding size. So it produces a similarly shaped matrix that can be added to the embedding matrix.
The (samples, sequence length, embedding size) shape produced by the Embedding and Position Encoding layers is preserved all through the Transformer, as the data flows through the Encoder and Decoder Stacks until it is reshaped by the final Output layers.
This gives a sense of the 3D matrix dimensions in the Transformer. However, to simplify the visualization, from here on we will drop the first dimension (for the samples) and use the 2D representation for a single sample.
The Input Embedding sends its outputs into the Encoder. Similarly, the Output Embedding feeds into the Decoder.
The Encoder and Decoder Stacks consist of several (usually six) Encoders and Decoders respectively, connected sequentially.
The first Encoder in the stack receives its input from the Embedding and Position Encoding. The other Encoders in the stack receive their input from the previous Encoder.
The Encoder passes its input into a Multi-head Self-attention layer. The Self-attention output is passed into a Feed-forward layer, which then sends its output upwards to the next Encoder.
Both the Self-attention and Feed-forward sub-layers, have a residual skip-connection around them, followed by a Layer-Normalization.
The output of the last Encoder is fed into each Decoder in the Decoder Stack as explained below.
The Decoder’s structure is very similar to the Encoder’s but with a couple of differences.
Like the Encoder, the first Decoder in the stack receives its input from the Output Embedding and Position Encoding. The other Decoders in the stack receive their input from the previous Decoder.
The Decoder passes its input into a Multi-head Self-attention layer. This operates in a slightly different way than the one in the Encoder. It is only allowed to attend to earlier positions in the sequence. This is done by masking future positions, which we’ll talk about shortly.
Unlike the Encoder, the Decoder has a second Multi-head attention layer, known as the Encoder-Decoder attention layer. The Encoder-Decoder attention layer works like Self-attention, except that it combines two sources of inputs — the Self-attention layer below it as well as the output of the Encoder stack.
The Self-attention output is passed into a Feed-forward layer, which then sends its output upwards to the next Decoder.
Each of these sub-layers, Self-attention, Encoder-Decoder attention, and Feed-forward, have a residual skip-connection around them, followed by a Layer-Normalization.
We talked about why Attention is so important while processing sequences. In the Transformer, Attention is used in three places:
The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value.
The Transformer calls each Attention processor an Attention Head and repeats it several times in parallel. This is known as Multi-head attention. It gives its Attention greater power of discrimination, by combining several similar Attention calculations.
The Query, Key, and Value are each passed through separate Linear layers, each with their own weights, producing three results called Q, K, and V respectively. These are then combined together using the Attention formula as shown below, to produce the Attention Score.
The important thing to realize here is that the Q, K, and V values carry an encoded representation of each word in the sequence. The Attention calculations then combine each word with every other word in the sequence, so that the Attention Score encodes a score for each word in the sequence.
When discussing the Decoder a little while back, we briefly mentioned masking. The Mask is also shown in the Attention diagrams above. Let’s see how it works.
While computing the Attention Score, the Attention module implements a masking step. Masking serves two purposes:
In the Encoder Self-attention and in the Encoder-Decoder-attention: masking serves to zero attention outputs where there is padding in the input sentences, to ensure that padding doesn’t contribute to the self-attention. (Note: since input sequences could be of different lengths they are extended with padding tokens like in most NLP applications so that fixed-length vectors can be input to the Transformer.)
Similarly for the Encoder-Decoder attention.
In the Decoder Self-attention: masking serves to prevent the decoder from ‘peeking’ ahead at the rest of the target sentence when predicting the next word.
The Decoder processes words in the source sequence and uses them to predict the words in the destination sequence. During training, this is done via Teacher Forcing, where the complete target sequence is fed as Decoder inputs. Therefore, while predicting a word at a certain position, the Decoder has available to it the target words preceding that word as well as the target words following that word. This allows the Decoder to ‘cheat’ by using target words from future ‘time steps’.
For instance, when predicting ‘Word 3’, the Decoder should refer only to the first 3 input words from the target but not the fourth word ‘Ketan’.
Therefore, the Decoder masks out input words that appear later in the sequence.
When calculating the Attention Score (refer to the picture earlier showing the calculations) masking is applied to the numerator just before the Softmax. The masked out elements (white squares) are set to negative infinity, so that Softmax turns those values to zero.
The last Decoder in the stack passes its output to the Output component which converts it into the final output sentence.
The Linear layer projects the Decoder vector into Word Scores, with a score value for each unique word in the target vocabulary, at each position in the sentence. For instance, if our final output sentence has 7 words and the target Spanish vocabulary has 10000 unique words, we generate 10000 score values for each of those 7 words. The score values indicate the likelihood of occurrence for each word in the vocabulary in that position of the sentence.
The Softmax layer then turns those scores into probabilities (which add up to 1.0). In each position, we find the index for the word with the highest probability, and then map that index to the corresponding word in the vocabulary. Those words then form the output sequence of the Transformer.
During training, we use a loss function such as cross-entropy loss to compare the generated output probability distribution to the target sequence. The probability distribution gives the probability of each word occurring in that position.
Let’s assume our target vocabulary contains just four words. Our goal is to produce a probability distribution that matches our expected target sequence “De nada END”.
This means that the probability distribution for the first word-position should have a probability of 1 for “De” with probabilities for all other words in the vocabulary being 0. Similarly, “nada” and “END” should have a probability of 1 for the second and third word-positions respectively. As usual, the loss is used to compute gradients to train the Transformer via backpropagation.
Hopefully, this gives you a feel for what goes on inside the Transformer during Training. As we discussed, it runs in a loop during Inference but most of the processing remains the same. The Multi-head Attention module is what gives the Transformer its power.
In the next section, we will continue our journey and go one step deeper to really understand the details of how Attention is computed.
In this section, we will go a step further and dive deeper into Multi-head Attention, which is the brain of the Transformer.
As we discussed, Attention is used in the Transformer in three places:
Attention Input Parameters — Query, Key, and Value
The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value.
All three parameters are similar in structure, with each word in the sequence represented by a vector.
Encoder Self-Attention
The input sequence is fed to the Input Embedding and Position Encoding, which produces an encoded representation for each word in the input sequence that captures the meaning and position of each word. This is fed to all three parameters, Query, Key, and Value in the Self-Attention in the first Encoder which then also produces an encoded representation for each word in the input sequence, that now incorporates the attention scores for each word as well. As this passes through all the Encoders in the stack, each Self-Attention module also adds its own attention scores into each word’s representation.
Decoder Self-Attention
Coming to the Decoder stack, the target sequence is fed to the Output Embedding and Position Encoding, which produces an encoded representation for each word in the target sequence that captures the meaning and position of each word. This is fed to all three parameters, Query, Key, and Value in the Self-Attention in the first Decoder which then also produces an encoded representation for each word in the target sequence, which now incorporates the attention scores for each word as well.
After passing through the Layer Norm, this is fed to the Query parameter in the Encoder-Decoder Attention in the first Decoder
Encoder-Decoder Attention
Along with that, the output of the final Encoder in the stack is passed to the Value and Key parameters in the Encoder-Decoder Attention.
The Encoder-Decoder Attention is therefore getting a representation of both the target sequence (from the Decoder Self-Attention) and a representation of the input sequence (from the Encoder stack). It, therefore, produces a representation with the attention scores for each target sequence word that captures the influence of the attention scores from the input sequence as well.
As this passes through all the Decoders in the stack, each Self-Attention and each Encoder-Decoder Attention also add their own attention scores into each word’s representation.
In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. All of these similar Attention calculations are then combined together to produce a final Attention score. This is called Multi-head attention and gives the Transformer greater power to encode multiple relationships and nuances for each word.
To understand exactly how the data is processed internally, let’s walk through the working of the Attention module while we are training the Transformer to solve a translation problem. We’ll use one sample of our training data which consists of an input sequence (‘You are welcome’ in English) and a target sequence (‘De nada’ in Spanish).
There are three hyperparameters that determine the data dimensions:
In addition, we also have the Batch size, giving us one dimension for the number of samples.
The Input Embedding and Position Encoding layers produce a matrix of shape (Number of Samples, Sequence Length, Embedding Size) which is fed to the Query, Key, and Value of the first Encoder in the stack.
To make it simple to visualize, we will drop the Batch dimension in our pictures and focus on the remaining dimensions.
There are three separate Linear layers for the Query, Key, and Value. Each Linear layer has its own weights. The input is passed through these Linear layers to produce the Q, K, and V matrices.
Now the data gets split across the multiple Attention heads so that each can process it independently.
However, the important thing to understand is that this is a logical split only. The Query, Key, and Value are not physically split into separate matrices, one for each Attention head. A single data matrix is used for the Query, Key, and Value, respectively, with logically separate sections of the matrix for each Attention head. Similarly, there are no separate Linear layers, one for each Attention head. All the Attention heads share the same Linear layer but simply operate on their ‘own’ logical section of the data matrix.
Linear layer weights are logically partitioned per head
This logical split is done by partitioning the input data as well as the Linear layer weights uniformly across the Attention heads. We can achieve this by choosing the Query Size as below:
Query Size = Embedding Size / Number of heads
In our example, that is why the Query Size = 6/2 = 3. Even though the layer weight (and input data) is a single matrix we can think of it as ‘stacking together’ the separate layer weights for each head.
The computations for all Heads can therefore be achieved via a single matrix operation rather than requiring N separate operations. This makes the computations more efficient and keeps the model simply because fewer Linear layers are required, while still achieving the power of the independent Attention heads.
Reshaping the Q, K, and V matrices
The Q, K, and V matrices output by the Linear layers are reshaped to include an explicit Head dimension. Now each ‘slice’ corresponds to a matrix per head.
This matrix is reshaped again by swapping the Head and Sequence dimensions. Although the Batch dimension is not drawn, the dimensions of Q are now (Batch, Head, Sequence, Query size).
In the picture below, we can see the complete process of splitting our example Q matrix, after coming out of the Linear layer.
The final stage is for visualization only — although the Q matrix is a single matrix, we can think of it as a logically separate Q matrix per head.
We are ready to compute the Attention Score.
We now have the 3 matrices, Q, K, and V, split across the heads. These are used to compute the Attention Score.
We will show the computations for a single head using just the last two dimensions (Sequence and Query size) and skip the first two dimensions (Batch and Head). Essentially, we can imagine that the computations we’re looking at are getting ‘repeated’ for each head and for each sample in the batch (although, obviously, they are happening as a single matrix operation, and not as a loop).
The first step is to do a matrix multiplication between Q and K.
A Mask value is now added to the result. In the Encoder Self-attention, the mask is used to mask out the Padding values so that they don’t participate in the Attention Score.
Different masks are applied in the Decoder Self-attention and in the Decoder Encoder-Attention which we’ll come to a little later in the flow.
The result is now scaled by dividing by the square root of the Query size, and then a Softmax is applied to it.
Another matrix multiplication is performed between the output of the Softmax and the V matrix.
The complete Attention Score calculation in the Encoder Self-attention is as below:
We now have separate Attention Scores for each head, which need to be combined together into a single score. This Merge operation is essentially the reverse of the Split operation.
It is done by simply reshaping the result matrix to eliminate the Head dimension. The steps are:
Since Embedding size =Head * Query size, the merged Score is (Batch, Sequence, Embedding size). In the picture below, we can see the complete process of merging for the example Score matrix.
Putting it all together, this is the end-to-end flow of Multi-head Attention.
An Embedding vector captures the meaning of a word. In the case of Multi-head Attention, as we have seen, the Embedding vectors for the input (and target) sequence get logically split across multiple heads. What is the significance of this?
This means that separate sections of the Embedding can learn different aspects of the meanings of each word, as it relates to other words in the sequence. This allows the Transformer to capture richer interpretations of the sequence.
This may not be a realistic example, but it might help to build intuition. For instance, one section might capture the ‘gender-ness’ (male, female, neuter) of a noun while another might capture the ‘cardinality’ (singular vs plural) of a noun. This might be important during translation because, in many languages, the verb that needs to be used depends on these factors.
The Decoder Self-Attention works just like the Encoder Self-Attention, except that it operates on each word of the target sequence.
Similarly, the Masking masks out the Padding words in the target sequence.
The Encoder-Decoder Attention takes its input from two sources. Therefore, unlike the Encoder Self-Attention, which computes the interaction between each input word with other input words, and Decoder Self-Attention which computes the interaction between each target word with other target words, the Encoder-Decoder Attention computes the interaction between each target word with each input word.
Therefore each cell in the resulting Attention Score corresponds to the interaction between one Q (ie. target sequence word) with all other K (ie. input sequence) words and all V (ie. input sequence) words.
Similarly, the Masking masks out the later words in the target output, as was explained.
Hopefully, this gives you a good sense of what the Attention modules in the Transformer do. When put together with the end-to-end flow of the Transformer as a whole, we have now covered the detailed operation of the entire Transformer architecture. We now understand exactly what the Transformer does. But we haven’t fully answered the question of why the Transformer’s Attention performs the calculations that it does. Why does it use the notions of Query, Key, and Value, and why does it perform the matrix multiplications that we just saw?
We have a vague intuitive idea that it ‘captures the relationship between each word with each other word’, but what exactly does that mean? How exactly does that give the Transformer’s Attention the capability to understand the nuances of each word in the sequence?
The Transformer gets its powers because of the Attention module. And this happens because it captures the relationships between each word in a sequence with every other word. But the all-important question is how exactly does it do that?
In the following sections, we will attempt to answer that question, and understand why it performs the calculations that it does. To understand what makes the Transformer tick, we must focus on Attention. Let’s start with the input that goes into it, and then look at how it processes that input.
The Attention module is present in every Encoder in the Encoder stack, as well as every Decoder in the Decoder stack. We’ll zoom in on the Encoder attention first.
As an example, let’s say that we’re working on an English-to-Spanish translation problem, where one sample source sequence is “The ball is blue”. The target sequence is “La bola es azul”.
The source sequence is first passed through the Embedding and Position Encoding layer, which generates embedding vectors for each word in the sequence. The embedding is passed to the Encoder where it first reaches the Attention module.
Within Attention, the embedded sequence is passed through three Linear layers which produce three separate matrices — known as the Query, Key, and Value. These are the three matrices that are used to compute the Attention Score.
The important thing to keep in mind is that each ‘row’ of these matrices corresponds to one word in the source sequence.
The way we will understand what is going on with Attention, is by starting with the individual words in the source sequence, and then following their path as they make their way through the Transformer. In particular, we want to focus on what goes on inside the Attention Module.
That will help us clearly see how each word in the source and target sequences interacts with other words in the source and target sequences.
So as we go through this explanation, concentrate on what operations are being performed on each word, and how each vector maps to the original input word. We do not need to worry about many of the other details such as matrix shapes, specifics of the arithmetic calculations, multiple attention heads, and so on if they are not directly relevant to where each word is going.
So to simplify the explanation and the visualization, let’s ignore the embedding dimension and track just the rows for each word.
Each such row has been generated from its corresponding source word by a series of transformations — embedding, position encoding, and linear layer.
All of those transformations are trainable operations. This means that the weights used in those operations are not pre-decided but are learned by the model in such a way that they produce the desired output predictions.
The key question is, how does the Transformer figure out what set of weights will give it the best results? Keep this point in the back of your mind as we will come back to it a little later.
Attention performs several steps, but here, we will focus only on the Linear layer and the Attention Score.
As we can see from the formula, the first step within Attention is to do a matrix multiply (ie. dot product) between the Query (Q) matrix and a transpose of the Key (K) matrix. Watch what happens to each word.
We produce an intermediate matrix (let’s call it a ‘factor’ matrix) where each cell is a matrix multiplication between two words.
For instance, each column in the fourth row corresponds to a dot product between the fourth Query word with every Key word.
The next step is a matrix multiply between this intermediate ‘factor’ matrix and the Value (V) matrix, to produce the attention score that is output by the attention module. Here we can see that the fourth row corresponds to the fourth Query word matrix multiplied with all other Key and Value words.
This produces the Attention Score vector (Z) that is output by the Attention Module.
The way to think about the output score is that, for each word, it is the encoded value of every word from the “Value” matrix, weighted by the “factor” matrix. The factor matrix is the dot product of the Query value for that specific word with the Key value of all words.
The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.
For example, for the sentence, “The ball is blue”, the row for the word “blue” will contain the attention scores for “blue” with every other word. Here, “blue” is the Query word, and the other words are the “Key/Value”.
There are other operations being performed such as a division and a softmax, but we can ignore them in this article. They just change the numeric values in the matrices but don’t affect the position of each word row in the matrix. Nor do they involve any inter-word interactions.
So we have seen that the Attention Score is capturing some interaction between a particular word, and every other word in the sentence, by doing a dot product, and then adding them up. But how does the matrix multiply help the Transformer determine the relevance between two words?
To understand this, remember that the Query, Key, and Value rows are actually vectors with an Embedding dimension. Let’s zoom in on how the matrix multiplication between those vectors is calculated.
When we do a dot product between two vectors, we multiply pairs of numbers and then sum them up.
This means that if the signs of the corresponding numbers in the two vectors are aligned, the final sum will be larger.
This notion of the Dot Product applies to the attention score as well. If the vectors for two words are more aligned, the attention score will be higher.
So what is the behavior we want for the Transformer?
We want the attention score to be high for two words that are relevant to each other in the sentence. And we want the score to be low for two words that are unrelated to one another.
For example, for the sentence, “The black cat drank the milk”, the word “milk” is very relevant to “drank”, perhaps slightly less relevant to “cat”, and irrelevant to “black”. We want “milk” and “drank” to produce a high attention score, for “milk” and “cat” to produce a slightly lower score, and for “milk” and “black”, to produce a negligible score.
This is the output we want the model to learn to produce.
For this to happen, the word vectors for “milk” and “drank” must be aligned. The vectors for “milk” and “cat” will diverge somewhat. And they will be quite different for “milk” and “black”.
Let’s go back to the point we had kept at the back of our minds — how does the Transformer figure out what set of weights will give it the best results?
The word vectors are generated based on the word embeddings and the weights of the Linear layers. Therefore the Transformer can learn those embeddings, Linear weights, and so on to produce the word vectors as required above.
In other words, it will learn those embeddings and weights in such a way that if two words in a sentence are relevant to each other, then their word vectors will be aligned. And hence produce a higher attention score. For words that are not relevant to each other, the word vectors will not be aligned and will produce a lower attention score.
Therefore the embeddings for “milk” and “drank” will be very aligned and produce a high attention score. They will diverge somewhat for “milk” and “cat” to produce a slightly lower score and will be quite different for “milk” and “black”, to produce a low score.
This then is the principle behind the Attention module.
The dot product between the Query and Key computes the relevance between each pair of words. This relevance is then used as a “factor” to compute a weighted sum of all the Value words. That weighted sum is output as the Attention Score.
The Transformer learns embeddings etc, in such a way that words that are relevant to one another are more aligned.
This is one reason for introducing the three Linear layers and making three versions of the input sequence, for the Query, Key, and Value. That gives the Attention module some more parameters that it is able to learn to tune the creation of the word vectors.
Attention is used in the Transformer in three places:
In the Encoder Self Attention, we compute the relevance of each word in the source sentence to each other word in the source sentence. This happens in all the Encoders in the stack.
Most of what we’ve just seen in the Encoder Self Attention applies to Attention in the Decoder as well, with a few small but significant differences.
In the Decoder Self Attention, we compute the relevance of each word in the target sentence to each other word in the target sentence.
In the Encoder-Decoder Attention, the Query is obtained from the target sentence and the Key/Value from the source sentence. Thus it computes the relevance of each word in the target sentence to each word in the source sentence.
Hopefully, this gives you a good sense of the elegance of the Transformer design.
Sources:
https://jalammar.github.io/illustrated-transformer/
https://blog.dataiku.com/dissecting-the-transformer
https://machinelearningmastery.com/the-transformer-model/
https://www.geeksforgeeks.org/self-attention-in-nlp/
http://nlp.seas.harvard.edu/annotated-transformer/
https://theaisummer.com/transformer/
https://jinglescode.github.io/2020/05/27/illustrated-guide-transformer/