ImageNet Challenge: The Olympics of Deep Learning

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual competition that took place from 2010 to 2017, attracting teams from around the world to showcase their best-performing image classification models. This challenge became a crucial benchmark in the field, with its winners significantly influencing the landscape of image recognition and deep learning research.

The competition used a subset of the ImageNet dataset, containing 1.3M training examples across 1000 different classes, with 50k validation and 100k test examples. The images were resized to $256 \times 256$ pixels to standardize input, as they were originally downloaded from the web in various sizes.

Evaluation was based on two key performance indicators: top-1 and top-5 error rates. The top-5 error rate is the fraction of test images for which the correct label is not among the model’s five most likely predictions. Teams with the lowest top-5 error rate emerged as the winners of this challenge.

Winners of the challenge each year vs. Top-5 error rate

Early years

For the first two years, 2010 and 2011, the winning models were not based on neural networks at all. Instead, they relied on multiple layers of hand-designed feature extractors with linear neural networks on top. At the time, training convolutional neural networks (CNNs) on large-scale, high-resolution images was prohibitively expensive due to the computational limitations of available hardware.

The breakthrough came in 2012 with the introduction of GPUs, which enabled a highly optimized implementations of 2D convolutions. This, combined with the large-scale ImageNet dataset that provided sufficient labeled examples, made it feasible to train deep CNN models without severe overfitting.

Several factors contributed to the huge leap in 2012:

Data: The availability of the large-scale, well-labeled ImageNet dataset.
Computation: Advances in GPUs enabled efficient training of deep networks.
Algorithm: The introduction of AlexNet, a deep convolutional network architecture that took full advantage of the above advances, leading to a dramatic improvement in performance.

AlexNet (2012)

AlexNet [1] was the largest convolutional neural network trained at that time and achieved the best results reported on the ImageNet dataset, outperforming all other competitors by a large margin. This breakthrough made CNNs a mainstream topic in the field of computer vision, and AlexNet became one of the most influential works in the field.

Architecture

The model has 8 layers: 5 convolutional layers and 3 fully-connected layers.

Activation function: While Sigmoid and Tanh were common at the time, AlexNet was the first CNN architecture to use ReLU activation. The authors showed empirically that CNNs with ReLU train several times faster than their equivalents using Tanh units.

Normalization: The authors used “Local Response Normalization” (LRN) after 1st and 2nd convolutional layers, which aided generalization, reducing the top-5 error rate by 1.2%. Although this technique has largely fallen out of use, but it was an early precursor to batch normalization.

Pooling layer: The network includes max-pooling layers with a $3 \times 3$ kernel and a stride of 2, referred to as “overlapping pooling” due to the overlap of the kernels. The authors observed during training that this approach makes it slightly more difficult for the model to overfit, resulting in 0.3% lower error rate compared to a non-overlapping scheme (kernel size of 2 and stride of 2).

Dropout: The network’s size (60 million parameters) made overfitting a significant problem, even with such a large dataset. To address this, dropout with a probability of $0.5$ was added to the first two fully-connected layers, where the majority of parameters are located. Without dropout, the network required double the number of iterations for convergence.

Data Manipulation

Preprocessing: The mean image of the training set (per-pixel mean) is subtracted from each pixel.
Augmentation: Two data augmentation techniques are used to further reduce overfitting. These augmentations are performed on-the-fly on the CPU, while the GPUs trained the previous batch, making them computationally “free”.
1. Image translations and horizontal reflection:
  - Training: Random $224 \times 224$ patches (and their horizontal reflections) are extracted from $256 \times 256$ images for training.
    - Number of transformations: $(256 - 224) * (256 - 224) = 1024$
    - Horizontal reflections: $1024 * 2 = 2048$
  - Testing: The network makes a prediction by extracting five $224 \times 224$ patches (four corner patches and the center patch) as well as their horizontal reflections, averaging predictions across all ten patches.
2. PCA color augmentation (also called Fancy PCA): This technique alters the intensities of the RGB channels in the training images. It captures an important property of images—that object identity remains invariant to changes in intensity and illumination color.

Training

Key hyperparameters used for training AlexNet:

Initialization:
- Weights: Initialized from a zero-mean Gaussian distribution with a standard deviation of $0.01$ for each layer.
- Bias: Initialized to $1$ for the 2nd, 4th, and 5th convolutional layers, as well as in the fully connected layers. This choice accelerates early learning by providing ReLUs with positive inputs. All other layers have their biases initialized to $0$.
Loss Function: Cross-entropy loss.
Optimizer: Stochastic gradient descent (SGD)
- Momentum $m = 0.9$
- L2 weight decay $\lambda = 5e^{-4}$
- Batch size: 128
Learning rate: Initially set to $0.01$ and reduced three times prior to termination by dividing by 10 when the validation error rate stopped improving with the current learning rate.
Number of epochs: 90
Training time: 5 to 6 days on two GTX 580 3GB GPUs.

The image above illustrates a typical AlexNet architecture. The model was spread across two GTX 580 3GB GPUs, as it was too large to fit into the memory of a single GPU. The GPUs communicated only in certain layers to optimize computation. However, with modern hardware, this complexity is unnecessary; today, the entire model can be trained on a single GPU (Google Colab, for instance, offers 12GB/16GB GPUs).

Submission: The CNN architecture described above achieved a top-5 error rate of 18.2%. The authors submitted an ensemble of 5 similar CNNs that yielded an error rate of 16.4%, winning the 2012 challenge.

ZFNet (2013)

With AlexNet stealing the show in 2012, there was a significant increase in the number of CNN models submitted to ILSVRC 2013. The winner was ZFNet [2], an improved version of AlexNet that tweaked some layer configurations to achieve better performance.

First layer adjustment: Alexnet used a large filter size of $11 \times 11$ with a stride of 4 in the first layer. While this aggressive downsampling reduced computational cost, it also resulted in the loss of relevant pixel information. To address this, ZFNet used a $7 \times 7$ sized filter with a stride of 2 in the first layer.

To justify these changes, the authors proposed “deconvnet”, a technique to project output feature maps from each convolutional layer back to the input pixel space. This allows us to visualize different types of features learned by each layer, providing insights into the inner workings of CNNs:

Visualization: Initial layers learn to detect general patterns such as corners, edges, and textures, while deeper layers capture class-specific details like dog faces, bird legs, and other object parts.
Evolution during training: Lower layers of the model converged within a few epochs, while the upper layers only developed after a considerable number of epochs, demonstrating the need to let the models train until fully converged.
Invariance: Small transformations strongly affect the first layer, while higher layers show greater stability, with minimal impact from translations and scalings. The network output remains stable under these transformations.
Occlusion: When an object is occluded, the probability of the correct class drops significantly, indicating that the model relies heavily on local structure within the image rather than broad scene context.

Model size adjustment: The authors also conducted an ablation study that revealed performance gains from increasing the size of the middle convolutional layers. Consequently, they modified layers 3, 4, and 5 to have 512, 1024, and 512 output channels, respectively.

An ensemble of 6 CNNs—five with the modified first layer and one incorporating both modifications—achieved the lowest error rate.

ImageNet 2012 classification error rates

AlexNet and ZFNet were designed in a somewhat ad-hoc manner, with an arbitrary number of convolution and pooling layers, and the configurations of each layer set by trial and error. This makes scaling them quite challenging.

VGGNet (2014)

In 2014, the second-place winner of the ImageNet challenge was VGGNet [3], developed by the Visual Geometry Group at Oxford. This architecture was one of the first to have a principled design that guided the overall configuration of the network, enabling the creation of deeper networks and achieving significant improvements over previous configurations.