In previous discussions, we focused on feedforward neural networks, which take a single image as input, process it through multiple layers of convolution, normalization, and fully connected layers, and output a single label for image classification tasks. This is a one-to-one relationship: a single image maps to a single output label.

However, there are other types of problems we want to solve using deep learning that involve variable-length sequences as both inputs and outputs. Here are some examples:

One-to-many: Consider the task of image captioning, where the input is a single image, and the output is a sentence—a sequence of words describing the content of the image in natural language.

Many-to-one: An example of this is sentiment analysis, where the input is a sentence, and the task is to classify whether it expresses a positive or negative sentiment.

Many-to-many: In this case, we want to produce an output for each input in the sequence. For example, in video classification, where we may wish to label each frame of the video.

Sequence-to-sequence (Seq2Seq): In tasks like machine translation, the input is a sentence in one language (e.g., English), and the output is a translation of that sentence in another language (e.g., Spanish). The lengths of the input and output sequences are not necessarily equal.

Inputs are red, outputs are blue, and green boxes represent the RNN’s internal state (more on this soon). [Modified from Andrej Karpathy’s blog]

Inputs are red, outputs are blue, and green boxes represent the RNN’s internal state (more on this soon). [Modified from Andrej Karpathy’s blog]

Conventional neural networks or convolutional networks are designed to handle fixed-size input vectors (such as images) and produce fixed-size output vectors (such as class probabilities). To handle sequences of arbitrary lengths, we use Recurrent Neural Networks (RNNs). RNNs can process sequences of vectors, enabling us to handle sequences in the input, the output, or both.

Recurrent Neural Networks

An RNN processes inputs sequentially, maintaining an internal state called a hidden state $\mathbf{h}$ (a vector) that encodes information about the inputs it has seen so far.

A vanilla RNN for a many-to-many task is shown below:

At each time step $t$, the RNN takes an input $\mathbf{x}_t$ (shown in red) and updates its hidden state $\mathbf{h}_t$ using a recurrence formula:

\begin{align} \mathbf{h}_t &= \text{tanh} (W_{hh} \text{ } \mathbf{h}_{t-1} + W_{xh} \text{ } \mathbf{x}_t + b_h) \\ \mathbf{y}_t &= W_{hy} \text{ } \mathbf{h}_t \end{align}

This updated hidden state is used to produce an output $\mathbf{y}_t$ (shown in blue) at each time step. A tanh nonlinearity is used in this model because it was developed in the earlier days of neural network research.

We usually initialize the first hidden state $h_0$ to a zero vector. Note that the same weight matrices are used at every timestep in the sequence. This shared weight setup allows RNNs to handle sequences of arbitrary length by unrolling the computation graph over any number of timesteps, all while using the same set of weights.

This iterative generation process enables the model to progressively generate a coherent sequence of words, building each step upon the context of previous ones.

Training

Each word must be encoded as a vector to make the network to process it as they cannot process raw text. We use a fixed vocabulary and convert each unique word in the vocabulary into a one-hot encoded (or some learnable function) vector before training. This process of converting data into a vector format is called embedding.

Then, we use a softmax layer at the end of our network to predict a probability distribution over this vector of words, selecting the word with the highest probability as the output.

To train our RNN, we apply a loss function (e.g., cross-entropy) at each time step in the sequence. The loss is calculated between the output vector $\mathbf{y}_t$ and the ground truth label to obtain a loss value per time step. By summing these individual losses across all time steps, we obtain the total loss, which we then use for backpropagation through the network.

Computational graph for a many-to-many RNN that produces one output per timestep in our input sequence.

Computational graph for a many-to-many RNN that produces one output per timestep in our input sequence.

Limitations

During backpropagation, the chain rule applies across time steps. Because the same weight matrix is used for all time steps, it is multiplied repeatedly, which can cause two issues:

Exploding gradient problem

If the values in the weight matrix are greater than 1, repeated multiplication can cause the gradients to grow exponentially, leading to unstable training.

To address this, we use a technique called gradient clipping, where we scale down the gradients if their norm exceeds a certain threshold. This helps control and limit the magnitude of the gradients during training.

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0)

Given the gradient (for all parameters) $g$, we compute its L2-norm. If this norm is greater than a maximum value $M$, we scale down $g$ by a factor of $\frac{M}{||g||_2 + \epsilon}$.

# Gradient clipping
def gradient_clipping(params, max_norm: float):
    grad = [p.grad for p in params if p.grad is not None]        
    norm = torch.norm(torch.cat(grad))
    
    if norm > max_norm:
        # Scale down the gradient
        scale = max_norm / (norm + 10e-6)
        for p in params:
            if p.grad is not None:
                p.grad *= scale 

Vanishing gradient problem

Conversely, if the values in the weight matrix are less than 1, repeated multiplication can shrink the gradients exponentially, leading to the vanishing gradient problem.

To mitigate this issue, we use a variant of the RNN called the Long Short-Term Memory (LSTM) network. LSTMs have specialized gating mechanisms and a dedicated cell state that help manage the flow of information and gradients over longer sequences.

Without going into detail, the concepts we’ve discussed for RNNs remain the same, except that the mathematical formulation for updating the hidden state is more complex in LSTMs. You can read more about LSTMs here: Colah’s blog.

Extending to other tasks

Let’s explore how RNNs can be adapted for different sequence-based tasks.

One-to-Many: Image Captioning

In the image captioning task, the goal is to generate a descriptive sentence based on a single image. This is done in two main steps:

  1. Feature extraction: First, we feed the input image into a pre-trained convolutional neural network (CNN) to extract a feature vector, capturing important information about the image content.

  2. Generating the sequence: Next, we pass this image feature vector to an RNN, which uses a recurrence formula to generate a sequence of words that describes the image.

An RNN for an image captioning task at test time.

An RNN for an image captioning task at test time.

To incorporate the image features in the recurrence formula, we modify it as follows:

\begin{align} \mathbf{h}_t &= \text{tanh} (W_{hh} \text{ } \mathbf{h}_{t-1} + W_{xh} \text{ } \mathbf{x}_t + {\color{purple}{W_{ih}} \text{ } \mathbf{v}} + b_h) \\ \mathbf{y}_t &= W_{hy} \text{ } \mathbf{h}_t \end{align}

where $\mathbf{v}$ is the feature vector of the input image.

Training is similar to what we discussed earlier, but it differs at test time. During testing, we feed the image feature vector along with an initial seed token, “< START >”, into the RNN. This produces a probability distribution for the first word in the caption.

For instance, if “man” has the highest probability in this distribution, we select it as the first word in the caption. We then feed the it’s embedding vector back into the RNN as the next input to generate the following word.

We repeat this process, generating words sequentially and unrolling the graph to obtain $\mathbf{y}_{t=1:T}$. Sampling stops when we encounter the end token “< END >”, which marks the end of the sentence.

Many-to-One: Sentiment Analysis

In a sentiment analysis task, the RNN receives word embeddings for each word in an input sentence. After processing the entire sequence, the RNN produces a single output prediction $\mathbf{y}$ based on the final hidden state. This output might be a binary classification, such as “1” for positive sentiment or “0” for negative sentiment.

An RNN for a sentiment analysis task.

An RNN for a sentiment analysis task.

The final hidden state effectively summarizes the information from the entire input sequence, capturing the context the network needs to make a sentiment prediction.

Sequence to sequence (Many-to-One + One-to-Many): Machine Translation

The seq2seq model was first introduced for machine translation [1], where the goal is to transform an input sequence (source) into a corresponding output sequence (target), with both sequences being of arbitrary lengths. These models are also commonly used in applications like chatbots and personal assistants, where they generate meaningful responses to input queries.

It typically uses a combination of two RNNs in an encoder-decoder style architecture:

  • Encoder RNN (Many-to-One): The encoder takes the input sequence and outputs a fixed-length vector that summarizes the content of the input. This output vector, which is the last hidden state of the encoder, is often called the context vector or thought vector.

  • Decoder RNN (One-to-Many): We then feed this context vector as the initial hidden state into the decoder RNN, which generates the target sequence as its output.

A seq2seq model for translating a sentence from English to Spanish.

A seq2seq model for translating a sentence from English to Spanish.

It’s important to note that the encoder and decoder RNNs have different weight matrices since they handle different sequences, which may vary in length.

A common practice is to separate the context vector and the initial hidden state of the decoder, as both serve different purposes:

  • The context vector captures the input information from the encoder, which is passed on to the decoder to help generate the output sequence. This is often set to the last hidden state of the encoder, $\mathbf{c} = \mathbf{h}_T$, and is used in the recurrence formula at each time step of the decoder (similar to how image features are used in image captioning).

  • The initial decoder state is used to start the decoding process and is typically derived from a projection or a feed-forward layer applied to the encoder’s final hidden state. This approach allows the initial state to be optimized specifically for the decoder, rather than directly copying the encoder’s final state.

A seq2seq model with seperate context vector and initial decoder state.

A seq2seq model with seperate context vector and initial decoder state.

The context vector passes information from the encoder to the decoder. However, since all the input information is bottlenecked into this single fixed-size vector, it becomes difficult for the model to retain information from longer input sequences. Often, the earlier parts of the input may be “forgotten” by the time the encoder finishes processing. To address this limitation, we use a mechanism called Attention [2].

Attention

Instead of relying solely on the encoder’s last hidden state to build a single context vector, attention enables the model to dynamically create context vectors at each decoding step. We let the decoder weigh all the encoder’s hidden states according to their relevance to the current step in the decoding process.

To focus on the parts of the input sequence that are most relevant to the current output, the decoder follows these steps:

  1. Alignment scores: An alignment function (often an MLP) takes the current hidden state of the decoder and each hidden state of the encoder, producing a score (scalar value) for each encoder state. These scores indicate the relevance of each encoder hidden state to the current decoding step.

    $$ e_{t, i} = f_{\text{att}} (\mathbf{s}_{t-1}, \mathbf{h}_i) $$

  1. Alignment weights: The alignment scores are passed through a softmax function to produce a probability distribution. These values tell us how much weight to assign to each encoder hidden state when constructing the context vector.

    $$ a_{t, i} = \text{softmax} (e_{t, i}) $$

  1. New context vector: The context vector for the current time step is constructed by taking a weighted sum of the attention weights and the encoder hidden states. $$ \mathbf{c}_t = \sum_i a_{t, i} \mathbf{h}_i $$
  1. Update step: The decoder uses this new context vector in the recurrence formula to obtain the next hidden state:

\begin{align} \mathbf{s}_t &= \text{tanh} (W_{ss} \text{ } \mathbf{s}_{t-1} + W_{yh} \text{ } \mathbf{y}_{t-1} + {\color{purple}{W_{cs}} \text{ } \mathbf{c}_t} + b_s) \\ \mathbf{y}_t &= W_{hy} \text{ } \mathbf{s}_t \end{align}

We repeat these four steps for every time step in the decoder’s sequence, with the initial decoder state coming from the encoder’s final hidden state (or through a projection layer).

The intuition here is that as each word in the output sentence is generated, the context vector “attends” to the most relevant part of the input sentence. For example, in a English-to-Spanish translation, when the model is generating the word “estamos” for “we are” it will focus more on the corresponding English encoder states. Sample attention weights might look like:

\begin{align} a_{11} [\text{we}] = a_{12} [\text{are}] = 0.45, \\ a_{13} [\text{eating}] = a_{14} [\text{bread}] = 0.05 \end{align}

Similarly, for the word “comiendo” (from “eating”), the weights could be:

\begin{align} a_{21} [\text{we}] = a_{24} [\text{bread}] = 0.05, \\ a_{22} [\text{are}] = 0.1, \\ a_{23} [\text{eating}] = 0.8 \end{align}

Since the seq2seq model with attention is trainable end-to-end, the network learns on its own which parts of the input sequence to focus on for each output word. This flexibility allows it to dynamically adjust focus as needed, creating a more accurate and contextually aware output sequence.

We can visualize this intuition through the attention weight matrix from a trained seq2seq model, where each pixel reflects the attention value between corresponding words in the source (e.g., English) and target (e.g., French) sentences. Higher attention weights (white boxes) indicate stronger relevance or “focus” on particular input words when generating specific output words, showing how the model aligns corresponding words across the two languages.

The x-axis and y-axis correspond to the words in the source sentence (English) and the generated translation (French), respectively. Each pixel shows the attention weight value.

The x-axis and y-axis correspond to the words in the source sentence (English) and the generated translation (French), respectively. Each pixel shows the attention weight value.

Image Captioning with Visual Attention

The attention mechanism enables models to generate sequences by focusing on different parts of the input at each generation step. Importantly, this mechanism doesn’t rely on the input being a sequence; it simply lets the model attend to the most relevant parts of the input, which can be structured in any way. This flexibility makes attention mechanisms applicable not only to sequential data but also to other types of inputs, such as images.

Let’s explore how attention can be applied to the image captioning task [3].

  1. Encoder (feature extraction): We begin by using a CNN to extract feature map, where each vector corresponds to a specific spatial location in the input image. These feature vectors capture the image’s content in a spatially structured manner.

  2. Decoder (RNN):

    • Initial hidden state: This grid of feature vectors is then fed into an MLP to predict the initial hidden state of the decoder RNN.

    • Attention mechanism: At each step of the generation process, the attention mechanism combines the decoder’s current hidden state with the feature grid to construct a new context vector. This context vector reflects the parts of the image that are most relevant for generating the next word in the caption.

An RNN with Attention for an image captioning task at test time.

An RNN with Attention for an image captioning task at test time.

By using different context vectors at each timestep, the model can “attend” to different parts of the input image as it generates each word. Similar to sequence-to-sequence models, we can visualize attention weights overlayed on the image to gain insight into which parts of the image the model is focusing on at each step.

Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image.

Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image.

In the image above, the model attends to the bird when generating “bird flying over,” and shifts focus to the water region for the word “water.” This is similar to how humans visually explore a scene, focusing on different parts depending on what we’re describing.

This interpretability—visualizing attention weights—sets attention mechanisms apart, as they provide insight into how a model makes its decisions, making them more transparent and understandable than other neural network approaches.

References