Convolutional Neural Networks: Deep Learning for Image Recognition

Linear classifiers or MLPs that we have discussed so far don’t respect the 2D spatial structure of input images. These images are flattened into a 1D vector before passing them through the network which destroys the spatial structure of the image.

This creates a need for a new computational model that can operate on images while preserving spatial relationships — Convolutional Neural Networks (CNNs). Let’s understand the components of this CNN model.

Convolutional Layer

To respect the 2D structure of the input image, we use a 2D learnable weight matrix of shape $(K_w, K_h)$, called a kernel, which convolves over the input image. That is, the kernel slides spatially across the image channel, computing dot products at each position.

Recall that the input image is a 3D tensor of shape $(C, H, W)$, where $C = 3$ for an RGB image. For the example below, consider a single-channel input of shape $(5,5)$ convolving with a $3 \times 3$ kernel.

The values in this kernel matrix are weights learned during training. The same kernel is applied across all positions of the input channel, and each position computes a dot product, outputting a single number.

The intuition behind convolving kernels with images is rooted in classical computer vision. Kernels help extract feature information from an image, such as edges, textures, and patterns. These are useful for tasks like blurring, sharpening, edge detection, and more.

Instead of manually designing these kernels for feature extraction, we let the model learn them from the training data. This allows the network to automatically discover “good” kernels that extract the most relevant features for correctly classifying the input image.

Different channels of the image encode distinct features, so it’s essential to learn a different kernel matrix for each input channel and combine the results to form a unique feature representation of the image.

Filter

Since we have a unique kernel matrix for each input channel, the learnable matrix is a 3D tensor of weights with shape $(C, K_w, K_h)$, known as a filter. The depth of the filter (i.e., the number of kernels) always matches the number of input channels.

The figure below shows an example of a convolution operation on an RGB image with $3 \times 3$ kernels.

These outputs of convolutions are summed across all channels, along with a bias term, to produce a single value. This value forms one entry in the 2D output called an activation map, or feature map.

Similar to a linear classifier where we have: $$ f(\mathbf{x}, \mathbf{w}) = \mathbf{w}_1 \mathbf{x}_1 + \mathbf{w}_2 \mathbf{x}_2 + \mathbf{w}_3 \mathbf{x}_3 + \mathbf{b} $$ In a convolution, we have: $$ A(\mathbf{x}, \mathbf{k}_{1:3}) = \mathbf{k}_1 * \mathbf{x}_{\text{channel 1}} + \mathbf{k}_2 * \mathbf{x}_{\text{channel 2}} + \mathbf{k}_3 * \mathbf{x}_{\text{channel 3}} + \mathbf{b} $$

Each 3D filter produces one 2D feature map, with each map corresponding to a specific feature or pattern detected in the input. Since we want multiple features to be detected, we use multiple 3D filters. The output is formed by stacking the activation maps computed from each filter, resulting in a 3D tensor, where the depth corresponds to the number of filters used.

In the example below, a $(3, 32, 32)$ RGB image is convolved with six filters of $5 \times 5$ kernels.

To formulate this, let $N$ be the number of images in a mini-batch, $C_{in}$ the number of input channels, and $C_{out}$ the number of output channels (or filters) in the convolution layer:

\begin{align} \underbrace{N \times C_{in} \times H \times W}_{\text{Input size}} + \underbrace{C_{out} \times C_{in} \times K_w \times K_h}_{\text{Filters size}} \Rightarrow \underbrace{N \times C_{out} \times H’ \times W’}_{\text{Output size}} \end{align}

Assuming $H = W$ (square image) and $K_w = K_h$ (square kernel), the output size is given by: $$ W’ = \frac{(W - K_w + 2P)}{S} + 1 $$ where $P$ is the padding size and $S$ is the stride. Let’s understand the significance of these terms.

Padding

When we convolve a filter with an image, the spatial size of the output reduces, i.e., $W’ < W$. The feature map shrinks, as shown below.

To preserve the spatial size, we pad the input image with zeros around the borders. For instance, $P = 1$ refers to adding a single border of zeros around the input.

Padding also helps retain edge information, as without it, edge pixels contribute less to the output because the filter doesn’t fully cover them.

The two most common padding terms are:

Valid padding: No padding is applied, i.e., $P = 0$.
Same padding: The output size equals the input size, i.e., $W’ = W$. This is achieved with $P = (K_w - 1)/2 $.

Stride

Earlier, I mentioned that we convolve our filter with every pixel of the input channel. However, we can choose to convolve with every alternate pixel by setting the stride $S = 2$.

Stride refers to the number of pixels by which the filter moves across the input image during the convolution operation. When using a stride greater than 1, the filter skips certain pixels, effectively reducing the spatial size of the input.

The image below shows a convolution operation with padding = 1 and stride = 2.

Using larger strides can effectively downsample the input, reducing computational costs and speeding up training. However, it may result in skipping over fine-grained details in the input image.

Next, let’s examine how the spatial size of feature maps affects computation costs in a convolutional layer.

Computational Cost

Consider a convolutional layer with 10 filters of size $5 \times 5$, stride 1, and pad 2 (same padding), applied to an RGB image of size $(3, 32, 32)$.

Input: $[C_{in} \times H \times W] = 3 \times 32 \times 32$
Convolution: $[C_{out} \times C_{in} \times K_w \times K_h]$ = 10 filters of size $3 \times 5 \times 5$
Output: $[C_{out} \times H’ \times W’] = 10 \times 32 \times 32$

For a batch size of $B=1$, let’s calculate:

1. Number of learnable parameters: $\underbrace{(C_{out} * C_{in} \times K_w \times K_h)}_{\text{filters}} + \underbrace{C_{out}}_{\text{bias}}$ $(10 * 3 \times 5 \times 5) + 10$ = 760.

2. Number of Floating-point operations (FLOPs): Total multiply-add operations $\underbrace{(C_{out} \times H’ \times W’)}_{\text{output size}} * \underbrace{(C_{in} \times K_w \times K_h)}_{\text{filter size}}$

$10 \times 32 \times 32 = 10240$ outputs, each of which is the inner product of input with a $3 \times 5 \times 5 = 75$ tensor. Total = $75 * 10240$ = 768k FLOPs.

Memory calculations help estimate the maximum batch size that can be used without exceeding GPU memory limits, learnable parameters define the model’s capacity to learn from data, and FLOPs influence the computational time during training and inference.

Receptive Fields

Receptive fields refer to the specific regions of the input data that a feature map responds to. In the convolution shown below, each element of the output layer depends on a $3 \times 3$ receptive field in the input.

To cover the entire input of size $(7, 7)$ with $3 \times 3$ kernels, at least 3 convolution layers are needed to ensure that a single output can “see” the whole input image.

Stacking multiple convolution layers increases the size of the receptive fields, allowing the network to capture more global information about the input. However, this stacking would result in one large convolution, so we typically apply activation functions (non-linearities) after each convolution. ReLU is a commonly used activation function in this context.

For large images, many layers may be required to capture global information, which can become computationally expensive. A solution is to downsample the feature map within the network, effectively increasing the receptive fields of subsequent layers while reducing computation. We already saw that using strides can help achieve this. Pooling is another way to accomplish the same goal.

Pooling Layer

Unlike convolutional layers, which apply a filter to extract features, pooling layers summarize information within a localized region of the input. One common pooling method is max pooling, where we take the maximum value from the elements within a defined kernel.

Pooling layers have two hyperparameters: kernel size and stride. For example, consider a max pooling layer with a kernel size of 2 and a stride of 2:

This operation effectively reduces the spatial dimensions of the input by half, while retaining the most prominent features (the strongest responses) and discarding less relevant details.

The max operation introduces translational invariance: whether a key feature, like a cat’s ear, is located in the top-left corner of the filter or the bottom-right, the output remains the same since we are taking the maximum over the kernel. This property allows the model to be more robust to small spatial shifts in the input.

Here are some key features of pooling layers:

Pooling layers do not contain learnable parameters, which simplifies the backpropagation process. During backpropagation, they simply pass gradients back to the locations of the maximum values from the previous layer.
As pooling is a non-linear operation, it adds non-linearity to the model without requiring an additional activation function.

Another pooling method, average pooling, calculates the average of the elements within the kernel. This approach tends to smooth out the feature map by considering all values within the window, making it useful for reducing noise. However, max pooling is generally preferred for its ability to highlight the most significant features in the input.

Network Architecture

To maintain the spatial dimension of the input, same padding with stride 1 is frequently used. Below are some common choices for hyperparameters in convolutional layers:

$C_{in}$, $C_{out}$ = 32, 64, 128, 256 (in powers of 2)
Kernel = $5 \times 5$, Padding = 2, Stride = 1
Kernel = $3 \times 3$, Padding = 1, Stride = 1
Kernel = $3 \times 3$, Padding = 1, Stride = 2 [downsample by 2]
Kernel = $1 \times 1$, Padding = 0, Stride = 1 [Pointwise convolution]

Using channel dimensions that are powers of 2 allows for more efficient memory allocation on GPUs.

The $1 \times 1$ convolution layer is also called pointwise convolution because the filter operates on each pixel individually across the depth (channels) of the input. This allows changing the number of channels while keeping the spatial dimensions intact, as the depth of the resulting feature map is determined by the number of filters used.

A classical architecture for a convolutional neural network follows this structure: $$ [\text{Conv, ReLU, Pool}]_{\times N} \rightarrow Flatten \rightarrow [\text{FC, ReLU}]_{\times M} \rightarrow FC $$

The input image is first processed through multiple layers of convolution, followed by ReLU activations, and then pooling to downsample the feature maps.

The initial layers of a CNN learn to detect basic patterns such as edges, corners, and simple textures. These low-level features are often universal across different images, making them not specific to any particular object class.

As the network deepens, subsequent layers begin to detect more complex patterns by combining low-level features to recognize higher-level representations, such as shapes, textures, or parts of objects (e.g., eyes, wheels, or fur textures).

After the convolutional and pooling layers, the output is a multi-dimensional tensor representing the learned features. This tensor is then flattened into a 1D vector and fed into fully connected (FC) layers with ReLU activations. These layers combine the learned features to make predictions.

The final fully-connected layer produces the output, typically class scores for each category in a classification task.

Visualization of features in a fully trained model. The left image shows the kernels learned by the first convolutional layer of AlexNet [1], while the right image displays the features learned in Layers 3-5 of ZFNet [2].

Coding

Let’s modify our model from here to a convolutional neural network for classifying handwritten digits in the MNIST dataset.

# Define the model
class DigitClassifier(torch.nn.Module):
    def __init__(self):
        super(DigitClassifier, self).__init__()
        
        self.conv1 = torch.nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=1)  
        # Output: [32 x 28 x 28]
        self.max_pool1 = torch.nn.MaxPool2d(kernel_size=2, stride=2)                                      
        # Output: [32 x 14 x 14]

        self.conv2 = torch.nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)  
        # Output: [64 x 14 x 14]
        self.max_pool2 = torch.nn.MaxPool2d(kernel_size=2, stride=2)                                       
        # Output: [64 x 7 x 7]

        self.fc1 = torch.nn.Linear(7 * 7 * 64, 128)
        self.fc2 = torch.nn.Linear(128, 10)
        self.relu = torch.nn.ReLU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Input x: [B, 1, 28, 28]

        x = self.relu(self.conv1(x))
        x = self.max_pool1(x)
        x = self.relu(self.conv2(x))
        x = self.max_pool2(x)

        x = x.view(x.shape[0], -1)  # Flatten

        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize the model and move it to the selected device
model = DigitClassifier().to(device)

I used the Adam optimizer with a learning rate of 0.001, achieving a test accuracy of 99%.

To view the computational statistics of our model, we can use torchsummary:

from torchsummary import summary
summary(model, (1, 28, 28))

We can draw the following conclusions from this:

The memory used to store the output as floats is primarily consumed by the convolutional layers (huge feature maps).
Almost all learnable parameters are found in the fully connected layers.

While it’s possible to create deeper convolutional neural networks, training them can be quite challenging. Like linear networks, they face overfitting issues, and convergence becomes increasingly difficult as depth increases. A common solution to this problem is Batch Normalization [3].

Batch Normalization

The idea behind batch normalization is to normalize the activations of a layer so that they have zero mean and unit variance. But why is this necessary?

As a neural network trains, the distribution of inputs to its layers can shift due to weight updates. This phenomenon, known as internal covariate shift, can cause the learning algorithm to struggle, effectively chasing a moving target. Batch normalization helps mitigate this shift, stabilizing the training process and improving optimization.

In addition, normalizing the outputs of a layer ensures that the distribution is well-behaved before passing through the non-linear activation function, which helps prevent issues like saturation.

In practice, batch normalization is implemented as a layer that processes inputs before they are passed to the next layer. The normalization can be mathematically described as:

\begin{align} & \hat{x}_{ij} = \frac{x_{ij} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}} \\ & y_{ij} = \gamma_j \hat{x}_{ij} + \beta_j \end{align}

Here, $\mu_j$ and $\sigma_j$ are the mean and standard deviation calculated across the mini-batch of inputs for each channel.

Since maintaining zero mean and unit variance can be too restrictive, we introduce scale ($\gamma$) and shift ($\beta$) parameters that can be learned during training. This ensures that the activations of the layer remain Gaussian throughout the training process.

A beneficial side effect of this normalization process is that it introduces some randomness, which can enhance regularization and improve the model’s generalization ability.

Unused Bias

The bias added to $x_{ij}$ in the previous layer is effectively canceled out when the mean of the outputs is subtracted during batch normalization. As a result, the bias in the layer preceding the batch normalization layer becomes redundant and can be omitted.

Training vs Testing

Since the $\mu_j$ and $\sigma_j$ are computed from the mini-batch, their values depend on the specific batch. For example, if one test batch contains [cat, dog, frog] and another contains [cat, car, horse], the output for the common image of the cat will differ due to varying means and standard deviations.

To address this, batch normalization behaves differently during training and testing. During testing, we do not compute the mean and standard deviation from the batch; instead, we fix these values and use the running averages collected during training.

The running averages are updated during each training step as:

\begin{align} \mu_{\text{running}} = (1 - \beta) * \mu_{\text{running}} + \beta * \mu_j \end{align}

where $\beta$, called the momentum, is typically set to $0.1$.

Momentum is used to smoothly update the running mean and variance, allowing them to gradually accumulate information from multiple batches instead of relying solely on the current one.

Here’s how a simple 1-D BatchNorm would look like in pseudocode:

# Batch Norm 1D
eps = 1e-5
momentum = 0.1

scale = torch.ones(D)    # (learnable)
shift = torch.zeros(D)   # (learnable)

running_mean = torch.zeros(D)
running_var = torch.ones(D)

for x in dataloader:
    # Input: x [B, D]

    if training:
        # Compute per-feature mean and variance
        mean = x.mean(dim=0, keepdim=True)                   # Shape: [1, D]
        var = x.var(dim=0, keepdim=True, unbiased=False)     # Shape: [1, D] without Bessel's correction

        # Update running estimates
        running_mean = (1 - momentum) * running_mean + momentum * mean
        running_var  = (1 - momentum) * running_var  + momentum * var
    else:
        # Use running averages during inference
        mean = running_mean
        var  = running_var

    # Normalize, scale, and shift
    norm_x = (x - mean) / torch.sqrt(var + eps)               # Shape: [B, D]
    out = scale * norm_x + shift                              # Shape: [B, D]

Bessel’s correction adjusts the variance calculation by dividing by $n - 1$ instead of $n$ to produce an unbiased estimate of the population variance from a finite sample. In batch normalization, it’s typically disabled (unbiased=False) since we’re normalizing based on the current batch rather than estimating the true population statistics.

During inference, batch normalization becomes a linear operation, allowing it to be easily fused with preceding linear or convolutional layers. Typically, the batch norm layer is inserted after fully connected or convolutional layers and before the activation function.

Benefits of Batch Normalization

In summary, batch normalization offers several advantages:

Makes deep networks much easier to train by mitigating internal covariate shift.
Allows for higher learning rates, which can lead to faster convergence during training.
Normalization process makes networks more robust to weight initialization.
Acts as a form of regularization during training.
Zero computational overhead at test time (uses fixed parameters so can be fused with the previous layer)

Variants

While batch normalization is powerful, it behaves differently during training and testing. To address this, several variants have been developed.

Layer Normalization

Instead of averaging over batch dimensions, it averages over the feature dimension, resulting in per-channel mean and standard deviation. This makes it independent of the batch size, and it behaves the same during both training and testing. Layer normalization is commonly used in RNNs and Transformers, where batch sizes can vary significantly.

Instance Normalization

Here, the normalization is done over the spatial dimensions of each image, resulting in per-image mean and standard deviation. It also behaves consistently during training and testing, and is often used in style transfer and image generation tasks.

Group Normalization

Instead of normalizing across the entire channel dimension like layer normalization, group normalization splits the channels into groups and normalizes within each group. This method is particularly effective for tasks like object detection and is commonly used in group convolutions.

The figure below provides an intuitive understanding of the four types of normalizations.

Coding

Let’s add Batch Normalization to our model. Since Batch Normalization behaves differently during training and testing, we must ensure to use model.train() and model.eval() to switch between these modes. This guarantees that we compute the mean and variance in training mode and use the running averages in evaluation mode.

Additionally, since most of the learnable parameters are in the fully connected layers, there’s a higher risk of overfitting. To address this, I’ll include a dropout layer for regularization.

# Define the model
class DigitClassifier(torch.nn.Module):
    def __init__(self):
        super(DigitClassifier, self).__init__()
        # First convolutional block: Conv -> BatchNorm -> ReLU -> MaxPool
        self.conv1 = torch.nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = torch.nn.BatchNorm2d(32)
        self.max_pool1 = torch.nn.MaxPool2d(kernel_size=2, stride=2)

        # Second convolutional block: Conv -> BatchNorm -> ReLU -> MaxPool
        self.conv2 = torch.nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = torch.nn.BatchNorm2d(64)
        self.max_pool2 = torch.nn.MaxPool2d(kernel_size=2, stride=2)

        # Fully connected layers: FC -> ReLU -> Dropout -> FC
        self.fc1 = torch.nn.Linear(7 * 7 * 64, 128)
        self.dropout = torch.nn.Dropout(p=0.5)
        self.fc2 = torch.nn.Linear(128, 10)
        self.relu = torch.nn.ReLU()
        pass
        
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.max_pool1(x)

        x = self.relu(self.bn2(self.conv2(x)))
        x = self.max_pool2(x)

        x = x.view(x.shape[0], -1)  # Flatten

        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

With this implementation, I achieved a test accuracy of 99.3%, giving us the best model we’ve had until now.

It can be challenging to make numerous decisions regarding the architecture of a CNN model and its various hyperparameters in pursuit of the best possible accuracy. Therefore, it’s always beneficial to review state-of-the-art CNN models that have been successful in the past for inspiration. Let’s explore that in our next post.

References

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, NeurIPS 2012.

[2] Matthew D. Zeiler and Rob Fergus. “Visualizing and Understanding Convolutional Networks”, ECCV 2014.

[3] Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015.

Convolutional Layer#

Filter#

Padding#

Stride#

Computational Cost#

Receptive Fields#

Pooling Layer#

Network Architecture#

Coding#

Batch Normalization#

Unused Bias#

Training vs Testing#

Benefits of Batch Normalization#

Variants#

Layer Normalization#

Instance Normalization#

Group Normalization#

Coding#

References#

Convolutional Layer

Filter

Padding

Stride

Computational Cost

Receptive Fields

Pooling Layer

Network Architecture

Coding

Batch Normalization

Unused Bias

Training vs Testing

Benefits of Batch Normalization

Variants

Layer Normalization

Instance Normalization

Group Normalization

Coding

References