In the previous post, we explored sequence modeling using an encoder-decoder architecture connected through an attention mechanism. This mechanism allows the decoder to “attend” to different parts of the input at each time step while generating the output sequence.
Attention can also be applied to a variety of tasks, such as image captioning. In this case, the decoder RNN focuses on different regions of the input image as it generates each word of the output caption.

Image captioning task with an RNN decoder with Attention.
Given its usefulness, let’s abstract the attention mechanism from sequence modeling and generalize it into a layer that can be inserted into any network.
General Attention Layer

(left) Attention mechanism in the image captioning task. (right) Generalized attention mechanism represented with vectors.
Recall that in the attention mechanism at each timestep:
Input:
- Features: $\mathbf{z}$ (Shape: $\text{H} \times \text{W} \times \text{D}$)
- Hidden state: $\mathbf{h}$ (Shape: $\text{D}$)
- Similarity function: $f_{\text{att}}$
Operations:
- Alignment: $e_{i, j} = f_{\text{att}} (\mathbf{h}, \mathbf{z}_{i,j})$
- Attention: $a = \text{softmax} (e) $
Output:
- Context vector: $\mathbf{c} = \sum_{i, j} \text{ } a_{i,j} \text{ } \mathbf{z}_{i,j}$ (Shape: $\text{D}$)
This mechanism can be generalized to operate on any set of vectors, making it more broadly applicable in deep learning.
To formalize this generalization of the attention mechanism, let’s redefine its components:
The input features are now represented as a set of vectors, $\mathbf{x}$, with shape $(\text{N} \times \text{D})$, where $\text{N} = \text{H} \times \text{W}$. These vectors are the elements we want to attend over.
- $\text{N}$ represents the number of vectors
- $\text{D}$ represents the dimension of each vector.
The hidden state of the decoder is renamed as a query vector, $\mathbf{q}$ (Shape: $\text{D}$).
The similarity function $f_{\text{att}}$, typically implemented as a Multi-Layer Perceptron (MLP), compares the query vector to each input vector.
The output context vector is denoted as $\mathbf{y}$.
Our general attention mechanism now looks like this:
Input:
- Input vectors: $\mathbf{x}$ (Shape: $\text{N} \times \text{D}$)
- Query vector: $\mathbf{q}$ (Shape: $\text{D}$)
- Similarity function: $f_{\text{att}}$
Operations:
- Alignment: $e_{j} = f_{\text{att}} (\mathbf{q}, \mathbf{x}_{j})$
- Attention: $a = \text{softmax} (e) $
Output:
- Context vector: $\mathbf{y} = \sum_j a_{j} \text{ } \mathbf{x}_{j}$ (Shape: $\text{D}$)
Modifications
Scaled dot product for similarity function
The similarity function we used earlier is called additive attention, where a feed-forward network with a single hidden layer computes the compatibility function between query vectors and input vectors.
A more commonly used alternative is dot-product (multiplicative) attention, where the query vector and input vectors are combined using a dot product. It can also looked as a measure of similarity as a dot product quantifies how closely two vectors are aligned.
While both methods have similar theoretical complexity, dot-product attention is computationally more efficient as it can be implemented using optimized matrix multiplication code.
However, when the vector dimension $\text{D}$ is large, which is typically greater than 1000 for large language models (LLMs), the resulting alignment scores can have high magnitudes. Since the softmax function normalizes these scores to compute attention weights, large magnitudes can cause softmax to saturate, leading to vanishing gradients during backpropagation.
Let’s look at the example below to understand this better,
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)
tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
This looks as expected. But when the magnitudes are large, it saturates and converges to a one-hot encoding, killing the gradients.
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*1000, dim=-1)
tensor([0., 0., 0., 0., 1.])
To mitigate this, we scale the dot product by $\frac{1}{\sqrt{\text{D}}}$. This adjustment reduces the impact of large vector magnitudes, similar to initialization techniques like Xavier or Kaiming. This approach, known as scaled dot-product attention, is widely used in modern models.
Multiple Query vectors
At each time step of the decoder, we use one query vector (i.e., one hidden state) to compute a probability distribution over the inputs, producing one context vector.
We can generalize this concept to handle multiple query vectors simultaneously, each generating a corresponding output vector. This allows us to compute multiple attention context vectors in parallel.

(left) A general attention layer. (right) The same layer with multiple query vectors.
With this modification, the attention layer includes:
Input:
- Input vectors: $\mathbf{x}$ (Shape: $\text{N} \times \text{D}$)
- Query vectors: $\mathbf{q}$ (Shape: $\text{M} \times \text{D}$)
- Similarity function: scaled dot product
Operations:
- Alignment: $e_{i, j} = (\mathbf{q}_i \cdot \mathbf{x}_{j} )$ / $\sqrt{\text{D}}$ (Shape: $\text{M} \times \text{N}$)
- Attention: $a = \text{softmax} (e) $
- Context vectors: $\mathbf{y}_i = \sum_j a_{i,j} \text{ } \mathbf{x}_{j}$
Output:
- Output vectors: $\mathbf{y}$ (Shape: $\text{M} \times \text{D}$)
Split into Key and Value vectors
In the attention mechanism, we use the input vectors in two different ways:
To generate the alignment scores that compare the input vectors with each query vector via the similarity function.
To compute the output context vectors by taking a weighted sum of input vectors and the attention weights.
To handle these roles effectively, the input vectors are separated into key vectors ($\mathbf{k}$) and value vectors ($\mathbf{v}$). Both are derived using learnable projection matrices applied to the input vectors:
- Keys are used to compute alignment scores with the query.
- Values are used to construct the context vector.
The separation of keys and values enables the model to use input vectors differently for comparison and retrieval. For example:
- Query - What am I looking for?
- Google search: “How tall is the Empire State Building?”
- Keys - What do I contain?
- Google compares the query with a set of webpages that may contain the answer.
- Values - Information in the token that will be communicated
- Returns the webpage saying, “At its top floor, the Empire State Building stands 1,250 feet (380 meters) tall.”
Since the information used for matching (keys) is different from the information returned (values), we separate them into two distinct vectors. The key determines relevance or alignment, whereas the value is used to retrieve the actual information. This gives the model flexibility in handling different types of information.

(left) General attention layer with multiple query vectors. (right) The same layer with separate key and value vectors.
With these modifications, the attention layer is described as follows:
Input:
- Input vectors: $\mathbf{x}$ (Shape: $\text{N} \times \text{D}$)
- Query vectors: $\mathbf{q}$ (Shape: $\text{M} \times \color{blue}{\text{D}_k}$)
- Similarity function: scaled dot product
Operations:
- Key vectors: ${\color{blue}{\mathbf{k}}} = \mathbf{x} {\color{blue}{W_\mathbf{k}}}$ (Shape: $\text{N} \times \color{blue}{\text{D}_k}$)
- Value vectors: ${\color{orange}{\mathbf{v}}} = \mathbf{x} {\color{orange}{W_\mathbf{v}}}$ (Shape: $\text{N} \times {\color{orange}{\text{D}_v}}$)
- Alignment: $e_{i, j} = (\mathbf{q}_i \cdot {\color{blue}{\mathbf{k}_j}} )$ / $\sqrt{\color{blue}{\text{D}_k}}$
- Attention: $a = \text{softmax} (e) $
- Context vectors: $\mathbf{y}_i = \sum_j a_{i,j} \text{ } {\color{orange}{\mathbf{v}_j}}$
Output:
- Output vectors: $\mathbf{y}$ (Shape: $\text{M} \times {\color{orange}{\text{D}_v}}$)
Matrix Representation
The entire attention mechanism can be expressed compactly in matrix form:
\begin{align} \text{Attention} (\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} (\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{\text{D}_k}}) \mathbf{V} \end{align}
Where:
- $\mathbf{Q}$: Matrix of query vectors (Shape: $\text{M} \times \text{D}_k$)
- $\mathbf{K}$: Matrix of key vectors (Shape: $\text{N} \times \text{D}_k$)
- $\mathbf{V}$: Matrix of value vectors (Shape: $\text{N} \times \text{D}_v$)
- Alignment score matrix has a shape of $\text{M} \times \text{N}$ and the softmax is taken along the last dimension $\text{N}$ (which is the number of input vectors).
This is the most common representation of attention that we’ve derived so far.
Notably, the attention mechanism itself has no learnable parameters, and the attention weights are not learned; instead, they are computed using a simple scaled dot product function followed by a softmax operation.
Self-Attention layer
One special case of the attention layer is the self-attention layer, where we only have input vectors and no explicit query vectors. In this case, we use a query matrix to derive the query vectors from our input vectors.
Since each input vector serves as its own query, we end up comparing each vector in the input set with every other vector. This design leverages the power of attention while eliminating the need for external query vectors, enabling the model to capture relationships between different parts of the input.
Self-attention enhances input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence.

A self-attention layer
In order to seperate the input and output dimensions, we denote $\text{D}_k = \text{D}_v = \text{D}_{out}$.
Input:
- Input vectors: $\mathbf{x}$ (Shape: $\text{N} \times \text{D}_{in}$)
- Similarity function: scaled dot product
Operations:
Query vectors: ${\color{green}{\mathbf{q}}} = \mathbf{x} {\color{green}{W_\mathbf{q}}}$ (Shape: $\text{N} \times \text{D}_{out}$)
Key vectors: ${\color{blue}{\mathbf{k}}} = \mathbf{x} {\color{blue}{W_\mathbf{k}}}$ (Shape: $\text{N} \times \text{D}_{out}$)
Value vectors: ${\color{orange}{\mathbf{v}}} = \mathbf{x} {\color{orange}{W_\mathbf{v}}}$ (Shape: $\text{N} \times \text{D}_{out}$)
Alignment: $e_{i, j} = ({\color{green}{\mathbf{q}_i}} \cdot {\color{blue}{\mathbf{k}_j}} ) / \sqrt{\text{D}_{out}}$
Attention: $a = \text{softmax} (e) $
Context vectors: $\mathbf{y}_i = \sum_j a_{i,j} \text{ } {\color{orange}{\mathbf{v}_j}}$
Output:
- Output vectors: $\mathbf{y}$ (Shape: $\text{N} \times \text{D}_{out}$)
This forms a new type of neural network layer, where we input a set of vectors and output another set of vectors, effectively allowing the model to attend to different parts of its own input.
In practice, we typically set $\text{D}_{in} = \text{D}_{out}$ to ensure that the input and output vectors have the same dimensions, allowing us to stack multiple layers seamlessly.
Let’s take a look at how this would be implemented in code.
class SelfAttention(nn.Module):
def __init__(self, d_model: int, qkv_bias: bool = False):
super().__init__()
# Linear projections for Query, Key, and Value
self.W_Q = nn.Linear(d_model, d_model, bias=qkv_bias)
self.W_K = nn.Linear(d_model, d_model, bias=qkv_bias)
self.W_V = nn.Linear(d_model, d_model, bias=qkv_bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
x: [B, N, d_model]
Returns: [B, N, d_model]
"""
# Compute Q, K, V
Q = self.W_Q(x)
K = self.W_K(x)
V = self.W_V(x)
# Apply scaled dot-product attention
attn_scores = (Q @ K.transpose(-2, -1)) * K.shape[-1]**-0.5
attn_weights = torch.softmax(attn_scores, dim=-1)
out = attn_weights @ values
return out
Since the softmax function is invariant to adding any constant offsets, bias terms in the Q, K, and V projections have no effect on the attention weights and are therefore omitted.
Setting dim=-1 in the softmax function instructs it to normalize along the last dimension, which corresponds to the columns (since attn_scores has the shape [batch, row N, column N]), ensuring that the values in each row sum to 1.
Positional encoding
A key consideration with the self-attention layer is that it is permutation equivariant. This means that if we change the order of the input vectors, we would still compute the same key, value, and query vectors, but they would be permuted in the same way the input vectors were permuted. As a result, the set of output vectors would remain the same, but their order would change.

Self-attention layer is permutation equivariant $f(s(x)) = s(f(x))$.
The self-attention layer does not inherently account for the order of the input vectors; it processes them as a set, irrespective of their sequence. While this property works well for certain tasks, it poses a challenge for tasks like machine translation or text generation, where the order of tokens is crucial.
For example, the sentences “The dog chased the cat” and “The cat chased the dog” would appear identical to the transformer but convey different meanings.
To make the layer position-aware, positional encodings are added to the input vectors. These encodings capture the position of each element in the sequence. A function ${\color{purple}{pos}}: \text{N} \rightarrow \mathbb{R}^\text{D}$ transforms each position $j$ into a unique, $\text{D}$-dimensional positional vector.

Concatenate special positional encoding $\color{purple}{p_j}$ to each input vector $\mathbf{x}_j$
There are two common ways to obtain this positional encoding function:
Learnable Lookup Table:
- A lookup table is learned during training that assigns a unique encoding to each position.
- This approach learns parameters for each position $t \in [0, T)$, where $T$ is the maximum sequence length, leading to a lookup table of size $T \times D$.
position_encoding = nn.Embedding(T, d_model)Fixed Function:
- A fixed function is designed that outputs a unique, deterministic encoding for each position.
- This approach doesn’t require any learnable parameters.
Masked Self-Attention layer
A variant of the self-attention layer, called masked self-attention or causal attention, is used for tasks like language modeling, where the goal is to predict the next word given the previous words. This is similar to a decoder RNN, where new words in the output sequence are generated one by one, with previously generated words providing context.
With the standard self-attention layer, the model can attend to all input tokens at once, which isn’t ideal for such tasks. To make the model “look” only at previous words while generating the next one, we mask future tokens. This prevents the model from attending to words that come after the current position in the sequence.
To achieve this, we set the attention weights for all future tokens to zero, ensuring that the model can only attend to current and past tokens when computing the context vector. This is particularly useful in decoders, where sequences are generated step-by-step.

A Masked self-attention layer (Causal Attention).
Rather than zeroing out the attention weights of future tokens and renormalizing them, we can assign negative infinity to those positions and apply softmax directly. This ensures they receive zero probability while maintaining a row sum of one—all in a single step.
The mask will have a shape of $\text{N} \times \text{N}$, where $\text{N}$ represents the number of input vectors. We use PyTorch’s tril function to create a mask where values above the diagonal are zero. These positions are then replaced with negative infinity in the attention scores matrix.
Let’s modify our code to add causal attention.
class SelfAttention(nn.Module):
def __init__(
self,
d_model: int,
context_len: int | None = None,
dropout: float = 0.0,
qkv_bias: bool = False,
causal: bool = True
):
super().__init__()
self.W_Q = nn.Linear(d_model, d_model, bias=qkv_bias)
self.W_K = nn.Linear(d_model, d_model, bias=qkv_bias)
self.W_V = nn.Linear(d_model, d_model, bias=qkv_bias)
self.attn_dropout = nn.Dropout(dropout)
# Register lower-triangular causal mask
if causal and context_len is not None:
self.register_buffer("mask", torch.tril(torch.ones(context_len, context_len)))
else:
self.mask = None
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
x: [B, N, d_model]
Returns: [B, N, d_model]
"""
B, N, _ = x.shape
# Compute Q, K, V: [B, N, d_model]
Q = self.W_Q(x)
K = self.W_K(x)
V = self.W_V(x)
# Apply scaled dot-product attention (with causal mask)
attn_scores = (Q @ K.transpose(-2, -1)) * K.shape[-1]**-0.5
if self.mask is not None:
attn_scores = attn_scores.masked_fill(self.mask[:N, :N] == 0, float('-inf'))
attn_weights = torch.softmax(attn_scores, dim=-1)
attn_weights = self.attn_dropout(attn_weights)
out = attn_weights @ values
return out
Context length refers to the maximum sequence length of the input vectors. It defines the largest possible attention mask, which can then be dynamically sliced and applied to sequences of any shorter length.
We have also used a register_buffer here, which ensures that the buffer variable mask is automatically moved to the appropriate device (CPU or GPU) along with the model. PyTorch only moves parameters, such as weights, to the selected device when we run model.to(device), which in our case is the GPU.
Since mask is used in our forward computation, defining it as a regular variable would cause a device mismatch error because it would remain on the CPU while the model operates on the GPU. To avoid this issue, we register it as a buffer, ensuring it stays synchronized with the model’s device.
It is also common practice to zero out additional elements in the attention weight matrix by applying a dropout layer, also called attention dropout. As you may recall, dropout is a regularization technique that helps prevent overfitting. Since some attention weights are zeroed out based on the dropout probability, the remaining weights are re-scaled to compensate for the reduction. Note that this dropout layer is disabled during inference.
Multi-head Self-Attention layer
Instead of performing a single attention function over the entire input vector space, the multi-head self-attention mechanism splits the input into multiple representation subspaces and performs attention in parallel across these subspaces. This allows the model to attend to different aspects of the input simultaneously.
In this case, we divide each input vector into $H$ chunks of equal size and feed them into several parallel attention layers. The input to each head has dimension $\text{d}_{h} = \text{D}/H$.
The outputs from all attention heads are concatenated (dimension $H \cdot \text{d}_h$) and passed through a projection linear layer to produce the final output of dimension $\text{D}_{out}$. While this projection layer is not strictly necessary, it is commonly used in many LLM architectures.

A Multi-head self-attention layer
The above image may seem more intuitive, but it is not optimal for computation. If we first split the input vectors $\mathbf{x}$, we would need to compute the keys, values, and queries independently for each head, increasing computation.
Instead, a more efficient approach is to compute the keys, values, and queries for the entire input vector dimension first, and then split them into $H$ equal chunks. After computing attention independently for each head, we combine the results.
Input:
- Input vectors: $\mathbf{x}$ (Shape: $\text{N} \times \text{D}_{in}$)
- Similarity function: scaled dot product
Operations:
Key vectors: ${\color{blue}{\mathbf{k}}} = \mathbf{x} {\color{blue}{W_\mathbf{k}}}$ (Shape: $\text{N} \times \text{D}_{out}$)
Value vectors: ${\color{orange}{\mathbf{v}}} = \mathbf{x} {\color{orange}{W_\mathbf{v}}}$ (Shape: $\text{N} \times \text{D}_{out}$)
Query vectors: ${\color{green}{\mathbf{q}}} = \mathbf{x} {\color{green}{W_\mathbf{q}}}$ (Shape: $\text{N} \times \text{D}_{out}$)
Split key, value and query vectors. For each head:
- Alignment: $e_{i, j} = (\mathbf{q}_i \cdot \mathbf{k}_j ) / \sqrt{\text{d}_{h}}$
- Attention: $a = \text{softmax} (e) $
- Output vectors: ${\mathbf{y}_i} = \sum_j a_{i,j} \text{ } \mathbf{v}_j$ (Shape: $\text{N} \times \text{d}_{h}$)
Output:
- Output vectors: $\mathbf{y} = \text{Concat} (\text{y}^0, \cdots, \text{y}^{\text{H} - 1}) W_o$ (Shape: $\text{N} \times \text{D}_{out}$)
Now, let’s implement a multi-head attention module that runs several causal attention heads in parallel.
class MultiHeadSelfAttention(nn.Module):
def __init__(
self,
d_model: int,
num_heads: int,
context_len: int | None = None,
dropout: float = 0.0,
qkv_bias: bool = False,
causal: bool = True,
):
super().__init__()
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.num_heads = num_heads
self.head_dim = d_model // num_heads
self.W_QKV = nn.Linear(d_model, 3 * d_model, bias=qkv_bias)
self.W_O = nn.Linear(d_model, d_model, bias=False)
self.attn_dropout = nn.Dropout(dropout)
if causal and context_len is not None:
self.register_buffer("mask", torch.tril(torch.ones(context_len, context_len)))
else:
self.mask = None
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
x: [B, N, d_model]
Returns: [B, N, d_model]
"""
B, N, _ = x.shape
# Compute Q, K, V in one go: [B, N, d_model]
QKV = self.W_QKV(x)
Q, K, V = QKV.chunck(3, dim=-1)
# Split into H heads: [B, N, H, d_h] and then transpose to [B, H, N, d_h]
Q = Q.view(B, N, self.num_heads, self.head_dim).transpose(-2, -3)
K = K.view(B, N, self.num_heads, self.head_dim).transpose(-2, -3)
V = V.view(B, N, self.num_heads, self.head_dim).transpose(-2, -3)
# Apply scaled dot-product attention (with causal mask) on each head
attn_scores = (Q @ K.transpose(-2, -1)) * K.shape[-1]**-0.5
if self.mask is not None:
N = x.shape[-2]
attn_scores = attn_scores.masked_fill(self.mask[:N, :N] == 0, float('-inf'))
attn_weights = torch.softmax(attn_scores, dim=-1)
attn_weights = self.attn_dropout(attn_weights)
out = attn_weights @ V
# Concatenate: transpose back to [B, N, H, d_h], then combine heads [B, N, d_model]
out = out.transpose(-2, -3).contiguous().view(B, N, -1)
out = self.W_O(out)
return out
Instead of using three separate linear layers for Q, K, and V, we project the input once into a tensor three times the output dimension, then split it along the last dimension, reducing matrix multiplications and improving efficiency.
This implementation also computes attention for all heads in parallel through efficient batched matrix multiplications, allowing the model to process multiple attention subspaces simultaneously. With a single attention head, the attention mechanism would average over these subspaces, potentially losing valuable information.
Additionally, since each head operates on a reduced dimension, the total computational cost remains comparable to that of single-head attention with full dimensionality.
Computational cost
This is the most commonly used layer today. If you think of it, it’s just 4 matrix multiplications!
QKV Projection: $\mathbf{x}[\text{N} \times \text{D}] \cdot W_{\mathbf{QKV}}[\text{D} \times 3\text{D}] \rightarrow [\text{N} \times 3\text{D}]$
QK Similarity: $\mathbf{Q}[\text{N} \times \text{D}] \cdot \mathbf{K}^T[\text{D} \times \text{N}] \rightarrow [\text{N} \times \text{N}]$
V-weighting: $ \mathbf{A} [N \times N] \cdot \mathbf{V} [N \times \text{D}] \rightarrow [N \times \text{D}]$
Output Projection: $\text{out} [N \times \text{D}] \cdot W_\mathbf{O} [\text{D} \times \text{D}] \rightarrow [N \times \text{D}]$
Now, let’s analyze the computational cost:
Number of learnable params: $\underbrace{\text{D} \times 3\text{D}}_{W_{\mathbf{QKV}}} + \underbrace{\text{D} \times \text{D}}_{W_\mathbf{O}} = 4D^2$
Number of Floating-point operations (FLOPs):
- QKV Projection: $2 \times \text{N} \times \text{D} \times 3\text{D} = 6\text{N}\text{D}^2$
- QK and AV: $2 \times \text{N} \times \text{N} \times \text{D} = 2\text{N}^2\text{D}$
- Output proj: $2 \times \text{N} \text{D} \times \text{D} = 2\text{N}\text{D}^2$
- Total: $8\text{N}\text{D}^2 + 2\text{N}^2\text{D} \sim \text{O}(\text{N}^2)$
It’s important to note that the computational cost grows quadratically with the sequence length $\text{N}$.
Transformer
The Transformer [1] was the first sequence model that relies solely on self-attention layers to compute representations of its input and output without using any recurrent units. This innovation marked a turning point for natural language processing (NLP), outperforming all state-of-the-art models of its time. It is often referred to as the “ImageNet moment” for NLP.
Let’s take a closer look at the complete architecture of the Transformer.
Embeddings and Positional Encoding
The input and output words are converted into vectors of dimension $\text{D} = d_{\text{model}} = 512$ by learned embeddings.
Next, positional encodings are added to input embedding vectors to inject information about the position of the tokens in the sequence. A fixed sinusoidal function is used to compute the positional encodings, which are of the same dimension $d_\text{model}$ as the embeddings, so that the two can be summed together.

Encoder block
The encoder is composed of a stack of $L = 6$ identical layers, each consisting of two sub-layers:
Multi-head Self-Attention: Each output from this layer depends on every input, allowing for interactions between all vectors in the input sequence. Since the inputs come from the previous encoder layer, each position in the encoder can attend to all positions in the previous layer.
- Hyperparamters: $H = 8$, $d_{model} = 512$
Feed-forward network: Each input vector passes through an MLP independently, consisting of two linear layers and a ReLU activation in between. This layer internally expands the embedding dimension into a higer-dimensional space (factor of 4), which allows for exploration of a richer representation space.
- Linear 1 $(512, 2048)$ -> ReLU -> Linear 2 $(2048, 512)$
Additionally:
Residual Dropout: Each of the two sublayers is followed by a dropout layer. Additionally, dropout is applied to the sum of the embeddings and position encodings. $p = 0.1$.
Residual connection: A residual connection is used around each of the two sub-layers to improve the gradient flow through the model, and overcome the vanishing gradient problem.
Layer Normalization: Each sub-layer is followed by layer normalization to aid optimization (similar to BatchNorm in convolutional layers).

The Encoder of the Transformer.
- Input: A set of vectors $\mathbf{x}$ (Shape: $\text{N} \times 512$)
- Output: A set of vectors $\mathbf{y}$ (Shape: $\text{N} \times 512$)
The interaction between vectors occurs only in the self-attention layer. LayerNorm and MLP operate on each input vector independently. The uniformity in input and output dimensions enables the stacking of multiple layers, thus making the model more scalable.
Decoder block
The decoder consists of $L = 6$ identical layers, each with three sub-layers:
Masked Multi-head Self-Attention: To prevent the decoder from attending to future tokens in the output sequence, we use masked self-attention, which ensures that each position in the decoder can only attend to past and current tokens.
- Hyperparamters: $H = 8$, $d_{model} = 512$
Multi-head Cross-Attention over Encoder outputs: This layer allows each position in the decoder to attend to all positions in the input sequence, passing relevant context from the encoder. This mimics the traditional encoder-decoder attention mechanism used in seq2seq models.
- Hyperparameters: $H = 8$, $d_{model} = 512$
Feed-forward network: Same structure as in the encoder:
- Linear 1 (512, 2048) -> ReLU -> Linear 2 (2048, 512)
Similar to the encoder, each sublayer is followed by a dropout layer, with residual connections added around each one. Finally, layer normalization is applied to each vector independently.

The Decoder of the Transformer.
- Input:
- Decoder sequence: A set of vectors $\mathbf{x}$ (Shape: $\text{M} \times 512$)
- Encoder context: A set of context vectors $\mathbf{c}$ (Shape: $\text{N} \times 512$)
- Output: A set of vectors $\mathbf{y}$ (Shape: $\text{M} \times 512$)
The masked self-attention sub-layer ensures autoregressive behavior by restricting attention to past inputs. The multi-head attention over encoder outputs bridges the encoder and decoder, allowing the decoder to focus on relevant parts of the input sequence.
The decoder block is followed by a linear layer and softmax function to convert the decoder output into predicted next-token probabilities.
During inference, the decoder sequence begins with a < START > token embedding (Shape: $1 \times 512$). As the model predicts each subsequent token, we append it to this sequence.
Key characteristics
Parallel computation: Unlike RNNs, the Transformer processes entire sequences simultaneously, allowing alignment and attention scores for all inputs to be computed in parallel, significantly improving efficiency on large datasets.
Flexibility of inputs: It can effectively handle both unordered sets and ordered sequences (with positional encodings).
Global context: Self-attention enables the model to capture long-range dependencies across the entire sequence.
Scalability: The Transformer’s architecture is highly scalable, with a few key hyperparameters that can be adjusted to meet various requirements:
- Number of Layers $L$: Applies equally to the encoder and decoder.
- Hidden size $d_{\text{model}}$: Defines the dimensionality of the model.
- MLP size $d_{ff}$: Specifies the output size of the first layer in the feed-forward MLP.
- Heads $H$: Determines the number of attention heads in multi-head self-attention (encoder), masked multi-head self-attention (decoder), and multi-head cross-attention (decoder).

The Transformer Architecture.
The Transformer’s innovative architecture (composed of just linear layers) has become the foundation for many advancements in NLP, inspiring models like BERT and GPT, and driving us in the new era of Large Language Models (LLMs).
References
- [1] Vaswani et al, “Attention is all you need”, NeurIPS 2017.
- Read more on Pytorch Buffers here: Understanding PyTorch Buffers