Deep Learning

Transformer Architecture

This post provides a primer on the Transformer model architecture. It is extremely adept at sequence modelling tasks such as language modelling, where the elements in the sequences exhibit temporal correlations with each other.

Encoder-Decoder Models

Transformers are a type of Encoder-Decoder model. In this section, we explain in general terms what an Encoder-Decoder model is.

Encoder-Decoder model block diagram

The Encoder part of the model maps a high-dimensional input representation, such as word embeddings, into a lower dimensional latent representation. It can achieve this via transformations and feature extraction. Hence, encoding can be done by passing data through fully-connected (dense) layers or convolutional layers.

The Decoder part of the model maps the latent representation into an output representation, which is often higher dimension relative to the latent representation. In Autoencoder models, the output representation is often meant to be semantically similar to the input representation. An example of an Autoencoder is a model that is trained to remove noise from images.

Transformers make use of multi-headed self attention to perform encoding and decoding within the model to perform tasks such as language modelling (next-word prediction) and machine translation.

Multi-headed Self-attention

Self-attention is a mechanism used to build representations based on the the pair-wise correlations between the elements in a sequence. In particular, we compute a attention score between the query vector Q of each element to the key vector K of every other element in the sequence. In the most commonly used dot-product attention, the attention score is computed using a dot product between Q and K.

Self-attention animation

Intuitively, the attention mechanism performs a lookup (similar to a dictionary lookup, or database query), producing a set of possible answers. The most probable answers have the highest attention scores. The computed attention scores are used as weights. The output of the attention mechanism is the product of the attention weights and the value vector V of the sequence. This final output is effectively a weighted sum of the entire sequence.

Distributional Hypothesis

The output of the attention mechanism combines the token representations across the entire sequence. This can be viewed as an extension of the distributional hypothesis. Loosely interpreted, the semantic meaning of a word is closely correlated to its surrounding context, a.k.a. “a word is characterized by the company it keeps” (Firth, 1957).

Thus, self-attention used as a mechanism for representation learning, based on the correlations (or dependencies) between the elements in a sequence.

\Large \text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{H}})V

As the dot product of Q and K might be a very large, we can choose to scale down the value by a factor (square-root of model dimension H).

Attention Masks

By default, the self-attention mechanism can freely attend to any position in the input sequence. In other words, the entire value vector V is fully visible to the attention mechanism. This may not be desirable in two circumstances:

  • Padding tokens are present in the input sequence.
  • Prevent “looking ahead” by not allowing attention to the right (“looking into the future”). This is a concern when using the attention mechanism inside the decoder portion of the model.

In both cases, we can introduce an attention mask. The attention mask will forcefully set the attention output of certain positions to a highly negative number. Hence, those positions will be “zeroed-out” and receive an insignificantly small weight after the softmax operation.

Deep-dive into Self-Attention
Attention Colab notebook

For a deep-dive into self-attention and the mechanics behind how it works, do check out this Colab notebook made by me. It contains several toy examples to demonstrate precisely how the self-attention and masking works.

Training Attention; Q, K And V

Up to this point, you may have noticed two things:

  • We introduced the Q, K and V vectors
  • Self-attention itself does not have any learnable parameters

The Q, K and V vectors are generated for every element in the input sequence by passing the representation of that element (a vector) through a trainable dense layer. Crucially, we do not pass the entire vector, but a subsection of the vector (along the embedding dimension). We do this to enable multiple attention heads (next section).

Transforming the token embedding into Q, K and V, then performing the attention operation
Multi-headed self-Attention

The main limitation of a singular self-attention mechanism is that it only models one particular correlation. Below are examples of such correlations learnt by the self-attention mechanism, as shown in Analysing Multi-Head Self-Attention (Voita et al., 2019). In order of left to right, these example correlations represent positional relation, syntactic relation and the presence of a rare word.

Correlations picked up by the attention mechanism (images from Analysing Multi-Head Self-Attention (Voita et al., 2019))

We refer to each singular self-attention mechanism as a head. In that sense, multi-head attention (MHA) is the combination of several attention heads. This is conceptually similar to how a convolutional layer can consists of multiple convolution filters, with each filter independently extracting different types of features.

In MHA, each attention head acts on a different subspace of the input representation in the embedding dimension. This effectively enables each attention head to work independently with a hidden dimension of d_embedding//num_heads. During model training, the parameters in each embedding subspace are trained separately as part of a particular self-attention mechanism.

Multi-head Attention
Self-Attention VS CNN/RNN

Self-attention is better able to model the long-range dependencies compared to a convolutional (CNN) or recurrent neural network (RNN). Unlike either a CNN and RNN, self-attention can directly model the long-range dependencies between elements in a given sequence.

Modelling Dependencies with CNN and RNN

In a CNN, the maximum range of a dependency that can be learnt in a layer is the size of the convolutional filter. In other words, convolution filters can only extract local correlations within the current filter window.

CNN animation

To model the dependency between an element at the start and the end of a sequence of length n will thus require O(ln n) layers.

CNN learning sequence dependencies

As for a RNN, modelling long-term dependencies is a well-known challenge, as the RNN must be able to persist the “correct” hidden state as it moves through the sequence. Advanced variants of RNNs, such as LSTMs, are better able to model such dependencies by introducing additional gates to modify the hidden state. In general, RNNs do not have a mechanism for modelling specific time dependencies and provide no learning guarantees for doing so.

RNN animation
Transformer “Layer”

While MHA is the “signature” mechanism in a Transformer model, there are many other things going on inside the same model. A Transformer model can be broken down into a series of repeating blocks. In this section, we will break down the structure of a Transformer block or “layer”.

Structure of Transformer Layer

Each Transformer layer takes in an input representation (embedding_dim × sequence_len) and passes it through the MHA mechanism. The output is then passed through a feed-forward network (2 dense layers with an activation function in between) to produce an output embedding. A high-level view of a single Transformer layer can be seen below.

Transformer layer block diagram

Notice that the input and output representation of the Transformer layer has the exact same dimensions. As such, we can chain together any number of Transformer layers to produce a deep Transformer model.

Parameterising the Transformer Layer

The Transformer layer can be parameterised by the following hyper-parameters:

  • H: size of embedding dimension (or hidden dimension)
  • A: number of attention heads
  • ffn_dim: number of nodes in the first layer of the feed-forward network. Typically 4×H.
  • ffn_act: activation function inside the feed-forward network
Additional Notes on Transformer Layer

Each Transformer layer contains residual connections. This has empirically been shown to allow positional information (injected at the first Transformer layer in the model) to propagate deeper within the model. Layer normalisation operations are present to normalise the values of the representations produced within the Transformer layer. These are shown in the diagram below.

Transformer layer, showing residual connections and layer normalization

Most Transformer models use the GeLU activation function (Hendrycks et al., 2016) in the feed-forward network, which shows empirically better performance.

Vaswani Transformer

In Attention is All You Need (Vaswani et al., 2017), the authors proposed an Encoder-Decoder architecture for the Transformer model, with a focus on machine translation. The overall structure of the model is shown below:

Transformer encoder-decoder model

This structure is actually fairly unique, and is not repeated for later Transformer models. The model takes in an input sequence (a sentence in the source language), a seed output sequence (either a START token or a partial output sequence) and predicts the next word in the output sequence.

The intuition behind this model is as follows:

  1. The input sequence is encoded by a series of Encoder layers
  2. The output of the Encoder is a vector that represents the semantic meaning of the input sequence
  3. With the seed output sequence as an initial context representation, the Decoder layers predict the next words to complete the translated sentence, given the Encoder output
Input Representation (Vaswani Transformer)

In the Vaswani Transformer, the input into the very first encoder block is a tensor of shape embedding_dim × sequence_len. It consists of the element-wise sum of:

  1. A sequence of word embeddings (to be trained)
  2. A sequence of positional embeddings (predefined using a series of sine and cosine functions)

The positional embeddings are injected in the first layer. Since self-attention is entirely position invariant, this positional signals allow the Transformer to differentiate between the same token present at different positions.

While the Vaswani Transformer used fixed positional embeddings, the positional embeddings can be learnt during training. This approach is taken by most other Transformer models.

Transformer Family Tree

The Vaswani Transformer is but the first in an entire lineage of Transformer models that followed. Most of the later models are larger, dropped either the encoder or decoder, and introduced concepts such as relative attention.

The general trend is that Transformer models become larger over time. As the number of parameters and dataset size increases, so does the performance of the model on a variety of benchmark tasks. For example, MegatronLM (Mohammad Shoeybi et al., 2019), the largest decoder model (8.3B parameters), achieves state-of-the-art results on many language modelling benchmarks. T5 (Raffel et al., 2019) achieves state-of-the-art results on many natural language understanding benchmarks with a staggering 11B parameters.


As Transformer models continue to ace NLP benchmarks such as GLUE and SUPERGLUE, we will no doubt see more varieties of Transformer models proposed by various research labs around the world. I hope this post has been informative in helping you understand Transformer models.

I have made an effort to be complete and precise in my explanations, while also providing the intuition necessary to understand Transformer models in general. If you would like to learn more, I have linked to some great resources below.

If you would like to test-drive the various Transformer models in your own projects, I highly recommend HuggingFace’s Transformers library, which is compatible with both TensorFlow 2.0 and PyTorch.

By Timothy Liu

I’m a Computer Science PhD student at the Singapore University of Technology and Design (SUTD) doing research on Machine Learning and Natural Language Processing. I work primarily on Natural Language Processing with Deep Learning (DL). My wider interests include understanding generalisation in DL, performant execution of DL models and understanding uncertainty in DL models.

2 replies on “Transformer Architecture”

Leave a Reply

Your email address will not be published. Required fields are marked *