The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual competition that took place from 2010 to 2017, attracting teams from around the world to showcase their best-performing image classification models. This challenge became a crucial benchmark in the field, with its winners significantly influencing the landscape of image recognition and deep learning research.
The competition used a subset of the ImageNet dataset, containing 1.3M training examples across 1000 different classes, with 50k validation and 100k test examples. The images were resized to $256 \times 256$ pixels to standardize input, as they were originally downloaded from the web in various sizes.
Evaluation was based on two key performance indicators: top-1 and top-5 error rates. The top-5 error rate is the fraction of test images for which the correct label is not among the model’s five most likely predictions. Teams with the lowest top-5 error rate emerged as the winners of this challenge.

Winners of the challenge each year vs. Top-5 error rate
Early years
For the first two years, 2010 and 2011, the winning models were not based on neural networks at all. Instead, they relied on multiple layers of hand-designed feature extractors with linear neural networks on top. At the time, training convolutional neural networks (CNNs) on large-scale, high-resolution images was prohibitively expensive due to the computational limitations of available hardware.
The breakthrough came in 2012 with the introduction of GPUs, which enabled a highly optimized implementations of 2D convolutions. This, combined with the large-scale ImageNet dataset that provided sufficient labeled examples, made it feasible to train deep CNN models without severe overfitting.
Several factors contributed to the huge leap in 2012:
- Data: The availability of the large-scale, well-labeled ImageNet dataset.
- Computation: Advances in GPUs enabled efficient training of deep networks.
- Algorithm: The introduction of AlexNet, a deep convolutional network architecture that took full advantage of the above advances, leading to a dramatic improvement in performance.
AlexNet (2012)
AlexNet [1] was the largest convolutional neural network trained at that time and achieved the best results reported on the ImageNet dataset, outperforming all other competitors by a large margin. This breakthrough made CNNs a mainstream topic in the field of computer vision, and AlexNet became one of the most influential works in the field.
Architecture
The model has 8 layers: 5 convolutional layers and 3 fully-connected layers.

Activation function: While Sigmoid and Tanh were common at the time, AlexNet was the first CNN architecture to use ReLU activation. The authors showed empirically that CNNs with ReLU train several times faster than their equivalents using Tanh units.
Normalization: The authors used “Local Response Normalization” (LRN) after 1st and 2nd convolutional layers, which aided generalization, reducing the top-5 error rate by 1.2%. Although this technique has largely fallen out of use, but it was an early precursor to batch normalization.
Pooling layer: The network includes max-pooling layers with a $3 \times 3$ kernel and a stride of 2, referred to as “overlapping pooling” due to the overlap of the kernels. The authors observed during training that this approach makes it slightly more difficult for the model to overfit, resulting in 0.3% lower error rate compared to a non-overlapping scheme (kernel size of 2 and stride of 2).
Dropout: The network’s size (60 million parameters) made overfitting a significant problem, even with such a large dataset. To address this, dropout with a probability of $0.5$ was added to the first two fully-connected layers, where the majority of parameters are located. Without dropout, the network required double the number of iterations for convergence.

Model summary of AlexNet
Data Manipulation
Preprocessing: The mean image of the training set (per-pixel mean) is subtracted from each pixel.
Augmentation: Two data augmentation techniques are used to further reduce overfitting. These augmentations are performed on-the-fly on the CPU, while the GPUs trained the previous batch, making them computationally “free”.
Image translations and horizontal reflection:
Training: Random $224 \times 224$ patches (and their horizontal reflections) are extracted from $256 \times 256$ images for training.
- Number of transformations: $(256 - 224) * (256 - 224) = 1024$
- Horizontal reflections: $1024 * 2 = 2048$
Testing: The network makes a prediction by extracting five $224 \times 224$ patches (four corner patches and the center patch) as well as their horizontal reflections, averaging predictions across all ten patches.
PCA color augmentation (also called Fancy PCA): This technique alters the intensities of the RGB channels in the training images. It captures an important property of images—that object identity remains invariant to changes in intensity and illumination color.
Training
Key hyperparameters used for training AlexNet:
Initialization:
- Weights: Initialized from a zero-mean Gaussian distribution with a standard deviation of $0.01$ for each layer.
- Bias: Initialized to $1$ for the 2nd, 4th, and 5th convolutional layers, as well as in the fully connected layers. This choice accelerates early learning by providing ReLUs with positive inputs. All other layers have their biases initialized to $0$.
Loss Function: Cross-entropy loss.
Optimizer: Stochastic gradient descent (SGD)
- Momentum $m = 0.9$
- L2 weight decay $\lambda = 5e^{-4}$
- Batch size: 128
Learning rate: Initially set to $0.01$ and reduced three times prior to termination by dividing by 10 when the validation error rate stopped improving with the current learning rate.
Number of epochs: 90
Training time: 5 to 6 days on two GTX 580 3GB GPUs.

The image above illustrates a typical AlexNet architecture. The model was spread across two GTX 580 3GB GPUs, as it was too large to fit into the memory of a single GPU. The GPUs communicated only in certain layers to optimize computation. However, with modern hardware, this complexity is unnecessary; today, the entire model can be trained on a single GPU (Google Colab, for instance, offers 12GB/16GB GPUs).
Submission: The CNN architecture described above achieved a top-5 error rate of 18.2%. The authors submitted an ensemble of 5 similar CNNs that yielded an error rate of 16.4%, winning the 2012 challenge.
ZFNet (2013)
With AlexNet stealing the show in 2012, there was a significant increase in the number of CNN models submitted to ILSVRC 2013. The winner was ZFNet [2], an improved version of AlexNet that tweaked some layer configurations to achieve better performance.
- First layer adjustment: Alexnet used a large filter size of $11 \times 11$ with a stride of 4 in the first layer. While this aggressive downsampling reduced computational cost, it also resulted in the loss of relevant pixel information. To address this, ZFNet used a $7 \times 7$ sized filter with a stride of 2 in the first layer.
To justify these changes, the authors proposed “deconvnet”, a technique to project output feature maps from each convolutional layer back to the input pixel space. This allows us to visualize different types of features learned by each layer, providing insights into the inner workings of CNNs:
Visualization: Initial layers learn to detect general patterns such as corners, edges, and textures, while deeper layers capture class-specific details like dog faces, bird legs, and other object parts.
Evolution during training: Lower layers of the model converged within a few epochs, while the upper layers only developed after a considerable number of epochs, demonstrating the need to let the models train until fully converged.
Invariance: Small transformations strongly affect the first layer, while higher layers show greater stability, with minimal impact from translations and scalings. The network output remains stable under these transformations.
Occlusion: When an object is occluded, the probability of the correct class drops significantly, indicating that the model relies heavily on local structure within the image rather than broad scene context.
- Model size adjustment: The authors also conducted an ablation study that revealed performance gains from increasing the size of the middle convolutional layers. Consequently, they modified layers 3, 4, and 5 to have 512, 1024, and 512 output channels, respectively.
An ensemble of 6 CNNs—five with the modified first layer and one incorporating both modifications—achieved the lowest error rate.

ImageNet 2012 classification error rates
AlexNet and ZFNet were designed in a somewhat ad-hoc manner, with an arbitrary number of convolution and pooling layers, and the configurations of each layer set by trial and error. This makes scaling them quite challenging.
VGGNet (2014)
In 2014, the second-place winner of the ImageNet challenge was VGGNet [3], developed by the Visual Geometry Group at Oxford. This architecture was one of the first to have a principled design that guided the overall configuration of the network, enabling the creation of deeper networks and achieving significant improvements over previous configurations.
Architecture
Let’s take a look at a side-by-side comparison of AlexNet, ZFNet, VGG-16, and VGG-19 architectures.

VGGNet features clean and simple design principles. The configuration for each stage is fixed as follows:
Convolutional layers: Kernel size of $3 \times 3$, with a stride of 1 and padding of 1 (same padding).
- Why? This is the smallest kernel capable of capturing directionality (left/right, up/down, center). This design choice ensures that the network remains compact while allowing for greater depth.
Max-pooling layers: Kernel size of $2 \times 2$, with a stride of 2.
Channels: Starting from 64, the number of channels doubles after each pooling layer, reaching a maximum of 512.
- When the pooling layer downsamples the feature map by half, we double the number of channels to preserve the overall volume, thereby maintaining consistent time complexity (FLOPs) across each layer.
Activation: ReLU non-linearities are used thoughout the network.
Normalization: No normalization is applied, as it does not improve performance and instead increases memory consumption and computation time.
Dropout: Added to the first two fully connected layers, with a dropout ratio of $0.5$.
While AlexNet has 5 convolutional layers, VGGNet comprises 5 stages:
Stage 1: conv-conv-pool
Stage 2: conv-conv-pool
Stage 3: conv-conv-conv-[conv]-pool
Stage 4: conv-conv-conv-[conv]-pool
Stage 5: conv-conv-conv-[conv]-pool
Two VGGNet variants were presented: VGG-16 and VGG-19, with 16 and 19 layers, respectively. VGG-19 includes an additional convolutional layer in stages 3, 4, and 5.
The convolutional layers are stacked to increase the receptive field. For instance, a stack of two $3 \times 3$ layers achieves an effective receptive field of $5 \times 5$, and three layers result in $7 \times 7$. What do we gain by using three $3 \times 3$ layers instead of a single $7 \times 7$ layer?
More activations:Three non-linearities instead of one, making the decision function more discriminative.
Fewer parameters: Assuming the input and output have $C$ channels:
- Three $3 \times 3$ layers: $3(3 * 3 * C^2) = 27C^2$ params.
- A single $7 \times 7$ layer: $7 * 7 * C^2 = 49C^2$ params.

Data Manipulation
Preprocessing: The mean RGB value from the training set (per-channel mean) is subtracted from each pixel.
Augmentation: Training images are first rescaled to a training scale, $S$. These rescaled images are then randomly cropped to $224 \times 224$ and undergo random horizontal flipping and random RGB color shifting (similar to AlexNet).
Single training scale:
- Training: Models are trained at two fixed scales, $S = 256$ or $S = 384$.
- Testing: The output is averaged over three test image versions rescaled at $ \{ S - 32, S, S + 32 \} $.
Multi-training scale:
- Training: Each image is rescaled individually by randomly sampling $S$ from the range $[S_{min} =256, S_{max} = 512]$. This approach, called scale jittering, enables the model to recognize objects at different scales.
- Testing: The output is averaged over three test image versions rescaled at $ \{ S_{min}, S_{avg}, S_{max} \} = \{256, 384, 512 \} $.
This provides three trained versions of the same network: two single-scale models trained at fixed scales and one model trained using multiple scales.
Training
Training hyperparameters and choices are similar to AlexNet.
Initialization:
- Pre-training: A shallow network is first trained with random initialization (weights from a zero-mean Gaussian distribution with $0.01$ variance and biases set to $0$), and then re-trained with additional convolutional layers.
- Xavier initialization: After submission, the authors discovered that Xavier initialization enables training without the need for pre-training.
Loss Function: Cross-entropy loss.
Optimizer: Stochastic gradient descent (SGD)
- Momentum $m = 0.9$
- L2 weight decay $\lambda = 5e^{-4}$
- Batch size: 256
Learning rate: Initialized at $0.01$, and reduced three times by dividing by 10 when the validation accuracy stopped improving.
Number of epochs: 74
Training time: 2-3 weeks on four Titan Black 6GB GPUs.
Despite having more parameters and greater depth than AlexNet, VGGNet required fewer epochs to converge due to (a) implicit regularization imposed by greater depth and and smaller convolutional filter sizes, and (b) pre-initialization of certain layers.
Submission: The authors submitted an ensemble of 7 networks—six single-scale models and one multi-scale model—resulting in a top-5 error rate of 7.3%.

ResNet (2015)
By 2015, it had become clear that increasing a network’s depth significantly improved its performance. However, deep networks typically face two major challenges:
Overfitting: Regularization techniques like Batch Normalization (BatchNorm) help mitigate overfitting and enable higher learning rates.
Vanishing/Exploding gradients: Non-saturating activations like ReLU, combined with Kaiming Initialization to preserve signal variance, help address this. BatchNorm further ensures that forward-propagated signals maintain stable, non-zero variances.
The development of BatchNorm and Kaiming Initialization in 2015 set the stage for experimenting with deeper models. However, a “degradation problem” came into light: as network depth increased, accuracy would initially saturate (which might be expected) and then degrade rapidly.

Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer networks. The deeper network has higher training and test error.
This degradation is not caused by overfitting, as deeper networks exhibit higher training error than their shallower counterparts, as shown in the figure above. The hypothesis suggests that this may be an optimization problem, indicating that deeper models are challenging to optimize, leading to underfiting.
Architecture
Intuitively, we expect a deeper model to perform at least as well as a shallower model since it could theoretically emulate the shallower network by copying its layers and and setting the extra layers to identity. The fact that deeper models performed worse suggests that the solvers struggle to approximate identity mappings with multiple non-linear layers.
Residual Block
To address this, ResNet [4] introduced a new network design that simplifies learning identity mappings. If we let the stacked non-linear layers fit a mapping $\text{F(x)}$ for an input $\text{x}$, we can add a skip connection to recasted the original mapping to $\text{F(x) + x}$.

To the extreme, if an indentity mapping was optimal, the solver would push the residual to zero, i.e. drive the weights of non-linear layers toward zero to approximate the identity mapping. While it is unlikely that identity mappings are optimal in practice, this reformulation helps precondition the problem, making it easier for the solver to find perturbations relative to an identity mapping rather than learning a new function from scratch.
These shortcut connections neither add extra parameters nor increase computational complexity, and they can be trained end-to-end using SGD with backpropagation without modification.
You might wonder why the skip connection spans two layers. The authors found no advantage in using a single-layer skip, while two-layer residual blocks trained more stably and performed better.
A residual network, or ResNet, is formed by stacking multiple residual blocks. The image below shows a comparison of VGG-19, 34-layer plain, and residual networks.

Similar to VGGNet, ResNet is divided into four stages, each with a different number of residual blocks. The architecture includes:
Aggresive Stem: The input is aggressively downsampled with a $7 \times 7$ filter and stride 2 (similar to ZFNet) before applying residual blocks.
Residual block: Each residual block contains two $3 \times 3$ convolutional layers.
Element-wise addition is performed on two feature maps, channel by channel.
Skip connection across two stages (shown by dotted lines): A $1 \times 1$ convolution is added to the input to match the channel dimension, using a stride of 2 to match the spatial dimension. The output is given by, $$ \text{H(x) = F(x) + W x} $$ where $\text{W}$ is the projection shortcut, used solely for changing dimensions; other shortcuts are identity.
Channels: Starting with 64 channels, the number is doubled after each stage, up to 512.
Strided convolution: Instead of using pooling layers to halve the feature map after each stage, a stride of 2 is used in the first convolution of the next stage.
Normalization: Each convolution is followed by a BatchNorm layer and a ReLU activation function.
Global average pooling: Instead of using fully connected layers, the network ends with a global average pooling layer followed by a single linear layer with softmax to generate class scores.
- Why? Fully connected layers have large number of parameters, increasing memory usage.
- Average pooling is applied to the last convolutional layer ($512 \times 7 \times 7$) using a $7 \times 7$ kernel to cover the entire spatial structure.
The authors presented two variants, ResNet-18 and ResNet-34, which have lower complexity than VGG-19, which has 19.6 GFLOPs.

As the 34-layer network performed better than the 18-layer network, it was clear that adding more layers could yield better performance. However, this also increases computational costs.
Bottleneck Residual Block
To reduce computation, a bottleneck block is introduced with three convolutions: $1 \times 1$, $3 \times 3$ and $1 \times 1$. The pointwise convolutions are used to reduce and then restore dimensions.

Comparing FLOPs = $(C_{out} \times H’ \times W’)* (C_{in} \times K_w \times K_h)$,
Basic residual block: $2 * [(64 \times 56 \times 56) * (64 \times 3 \times 3)]$ = $0.24$ GFLOPs.
Bottleneck residual block: $2 * [(64 \times 56 \times 56) * (256 \times 1 \times 1)]$ + $[(64 \times 56 \times 56) * (64 \times 3 \times 3)]$ = $0.22$ GFLOPs.
These additions introduce extra layers and non-linearities with a slight reduction in computational cost, enabling ResNet to add more layers without significantly increasing overall complexity.
Replacing all basic blocks in ResNet-34 with bottleneck blocks results in the ResNet-50 architecture, a widely-used baseline. The authors further expanded ResNet with 101 and 152-layer variants using different numbers of bottleneck blocks.

Architecture of ResNet variants. Downsampling is performed by the first convolution of stages 2, 3, and 4 with a stride of 2. Error rates shown are for single-crop testing, as reported by torchvision.
Data Manipulation
Preprocessing: A per-pixel mean is subtracted, as in AlexNet.
Augmentation: Similar to VGGNet, multi-scale training is applied.
Training: Images are randomly rescaled in range $[256, 480]$ for scale augmentation, then randomly cropped to $224 \times 224$, and subjected to random horizontal flipping and random RGB color shifting.
Testing: Scores are averaged across multiple scales: $\{ 224, 256, 384, 480, 640 \}$.
Training
Training hyperparameters and choices are as follows.
Initialization: Kaiming Initialization
Loss Function: Cross-entropy loss.
Optimizer: Stochastic gradient descent (SGD)
- Momentum $m = 0.9$
- L2 weight decay $\lambda = 1e^{-4}$
- Batch size: 256
Learning rate: Initialized at $0.1$ and divided by 10 when the error plateus.
Number of epochs: 120 ($60 \times 10^{4}$ iterations)
Submission: The authors submitted an ensemble of six networks with varying depths, achieving a 3.57% error rate—surpassing human performance and securing the 2015 ImageNet challenge win.
Transfer Learning
Beyond using these CNN models solely for inspiration, transfer learning enables us to apply them directly to many tasks—even with limited training data!
The core idea is to take a model pre-trained on ImageNet, remove its last fully-connected layer, and freeze the weights of the remaining layers. With this setup, the pre-trained model becomes an excellent feature extractor, effectively capturing complex patterns from input images.
Smaller dataset: For smaller datasets, we typically train a new linear layer on top of this feature extractor, tailored to our specific task. This approach has proven effective across various downstream tasks, delivering impressive performance even with minimal data.
Larger dataset: For larger datasets, you can go further with fine-tuning. Here, we combine the new linear layer with the pre-trained network and jointly train them on the new data. This gradual adjustment allows the network to adapt to the nuances of the new dataset while retaining valuable prior knowledge.

References
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, NeurIPS 2012.
[2] Matthew D. Zeiler and Rob Fergus. “Visualizing and Understanding Convolutional Networks”, ECCV 2014.
[3] Simonyan and Zissermann, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015.
[4] He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016.