Deep Learning Basics Part 2: The Icing

In the last post, we introduced Linear Classifiers as the simplest model in deep learning for image classification problems. We discussed how Loss Functions express preferences over different choices of weights, and how Optimization minimizes these loss functions to train the model.

However, linear classifiers have limitations: their decision boundaries are linear, which makes them inadequate for classifying complex data. One option is to manually extract features from the input image and transform them into a feature space, hoping that it makes the data linearly separable. We could then apply a linear classifier on top of these features to classify each image. This would result in a non-linear decision boundary when projected back to the input space.

While possible, this approach is cumbersome since it requires manually defining feature extraction methods for each specific task. Instead, we want our model to jointly learn both the feature representation and the linear classifier. This is where neural networks come in. Neural networks are built using chunks of linear classifiers, which is why they are also called Multi-Layer Perceptrons (MLPs).

Neural Networks

A neural network is composed of interconnected layers, where every neuron in one layer connects to every neuron in the next layer. A typical deep learning model has multiple layers, where the “depth” refers to the number of layers or the number of learnable weight matrices.

\begin{align} \text{Linear function: } & f = W x \\ \text{2-Layer Neural Network: } & f = W_2 \hspace{2mm} z(W_1 x) \\ \text{3-Layer Neural Network: } & f = W_3 \hspace{2mm} z_2(W_2 \hspace{2mm} z_1(W_1 x)) \\ \end{align}

Each layer is followed by a non-linear function called an activation function $z_i$. Activation functions are critical because, without them, stacking multiple linear layers would simply result in a linear classifier ($ y = W_2 W_1 x = W x $).

The non-linearity introduced by activation functions allows neural networks to create complex decision boundaries. As the number of layers (and activation functions) increases, the decision boundary becomes more intricate in the input space, enabling the model to fit the data more effectively.

Activation functions

There are several popular activation functions to choose from, each with its pros and cons. Let’s briefly explore them:

Sigmoid

This function transforms input data into the range $[0, 1]$, which can be interpreted as a probability (i.e., the likelihood of a particular feature being present). But, it has significant drawbacks:

For large positive or negative inputs, the output saturates (the curve becomes parallel to the x-axis), causing the gradient to approach zero ($\mathbf{dw} \approx 0$). This is known as the vanishing gradient problem, which effectively kills the learning process.
Computing the exponential function is expensive, slowing down training.

Tanh

The tanh function is a scaled and shifted version of sigmoid that outputs in the range $[-1, 1]$, making it zero-centered. This allows for both positive and negative outputs, improving gradient flow compared to sigmoid. However:

It still suffers from the vanishing gradient problem.
Like sigmoid, it relies on exponentials, making it computationally expensive.

ReLU

ReLU is the most commonly used activation function due to its computational efficiency, which leads to faster training. But,

It has a dying ReLU problem: when the inputs are negative, the gradient becomes zero, and those neurons stop learning. These become inactive (as they always ouput zero), reducing the capacity of the model.

Leaky ReLU

Leaky ReLU is a variation that addresses the dying ReLU problem by adding a small, non-zero gradient for negative input. This prevents neurons from dying in the negative region.

The slope in the negative region (usually 0.1) is a hyperparameter, introducing another variable to tune 😔.

There are advanced variations like Exponential Linear Unit (ELU), Scaled ELU (SELU), and Gaussian Error Linear Unit (GELU), each offering specific benefits. A good rule of thumb is to start with ReLU as your default choice. For finer improvements, you can experiment with these advanced variants to push your model’s accuracy by a small margin.

Weight Initialization

Once the architecture of the network is fixed, the next step is to initialize the weights of each layer. Weight initialization sets the starting point on the loss landscape for gradient descent. Each new set of initial weights kicks off the optimization process from a different spot, potentially leading to different final solutions. Therefore, the choice of weight initialization is crucial, as it significantly influences the optimization path and the model’s convergence.

Initalizing with zeros - If the weights are initialized to a constant value, such as zero, the activations (outputs of neurons: linear layer + activation function) will also be zero, and the gradients will be zero. This effectively kills learning, preventing the network from even starting to train!

W_l = 0 # very bad idea

Initializing with small random numbers - Using a Gaussian distribution with zero mean and a standard deviation of 0.01 can work for shallow networks. However, for deeper networks, repeatedly multiplying small weights (less than 1) through many layers will shrink the activations as they propagate, causing them to approach zero. This leads to the vanishing gradient problem with tanh and ReLU non-linearities.

W_l = 0.01 * np.random.randn(Din, Dout)

Initializing with larger standard deviation - If weights are initialized with a larger standard deviation, the activations can grow exponentially as they propagate, leading to very large outputs and gradients. This is known as the exploding gradient problem.

Thus, weight initialization must strike a balance to avoid both extremes—activations that neither vanish nor explode. This concept is applied in Xavier Initialization [1].

Xavier Initialization

If we can ensure that the variance of the output of a layer is the same as the variance of the input, the scale of activations won’t change as they propagate allowing us to avoid vanishing or exploding gradients. To derive this, let’s consider a linear layer:

\begin{align} y &= \sum_{i=1}^{D_{in}} x_i w_i \\ Var(y) &= D_{in} * Var(x_i w_i) \\ &= D_{in} * (E[x_i^2]E[w_i^2] - E[x_i]^2E[w_i]^2) &\text{[Assume x,w independent]} \\ &= D_{in} * (E[x_i^2]E[w_i^2]) &\text{[Assume x,w are zero-mean]} \\ &= D_{in} * Var(x_i) * Var(w_i) \end{align} To maintain the variance between input and output, $Var(y) = Var(x_i)$, $$ \Rightarrow Var(w_i) = 1/D_{in} $$

Therefore, we initialize the weights of each layer with zero mean and a standard deviation equal to the inverse square root of the input dimension of that particular layer.

W_l = np.random.randn(Din, Dout) * np.sqrt(1 / Din)

Note the input $x_i$ refers to the activation (output) from the previous layer. Our derivation assumes that this activation should be zero-centered (or have zero-mean), which holds true for the activation functions like tanh that are centered around zero.

However, this initialization does not work well for networks with ReLU non-linearities and needs to be adjusted. This modification is incorporated in Kaiming Initialization [2].

Kaiming/ He Initialization

Assuming that the inputs are zero-centered, and ReLU kills all non-negative inputs, we effectively obtain only half of our expected outputs, resulting in the the variance being halved, $Var(y) = Var(x_i) / 2$. To maintain the scale of activations across layers, $$ \Rightarrow Var(w_i) = 2/D_{in} $$

W_l = np.random.randn(Din, Dout) * np.sqrt(2 / Din)

Backpropagation

Most of our choices so far have revolved around gradients, the essential flour of our deep learning cake. To compute these derivatives of the loss function with respect to the weights of our neural network, we rely on a computation graph.

A computation graph is a directed graph that represents the computations inside our model. Let’s take a simple example of a computation graph that represents the following equation: \begin{align} q = x + y \ \ \text{and} \ \ f = z * q \end{align}

First, we perform a forward pass through the graph (denoted in green), where we compute the outputs of each node sequentially, given the inputs: $x = -2, y = 5, z = -4$. \begin{align} q &= x + y = -2 + 5 = 3\\ f & = z * q = -4 * 3 = -12 \end{align}

Backpropagation is the algorithm we use to compute gradients in this computation graph. In our backward pass (denoted in red), we apply the chain rule to compute the derivatives of each node in reverse order with respect to the output $f$. \begin{align} \frac{df}{df} &= 1 \\ \frac{df}{dq} &= z = -4 \\ \frac{df}{dz} &= q = 3 \\ \frac{df}{dx} &= \frac{df}{dq} \frac{dq}{dx} = -4 * 1 = -4 \\ \frac{df}{dy} &= \frac{df}{dq} \frac{dq}{dy} = -4 * 1 = -4 \end{align}

In this way, we compute the derivative of the output with respect to the inputs. In deep learning, the output $f$ is typically the loss function, and we use this method to compute the gradients of the loss with respect to the weights.

With more complex neural networks, we follow the same procedure; but don’t worry, you don’t have to compute these gradients manually! Libraries like PyTorch handle it for us.

Coding from scratch

Okay, enough theory. Let’s dive into coding a neural network from scratch using the PyTorch library. Here are a couple of key PyTorch terminologies to understand:

Tensor: Similar to a NumPy array but with the ability to run on GPUs.
Autograd: A package that builds computational graphs from Tensors and automatically computes gradients, allowing for easy backpropagation.

For our example, we’ll build a simple neural network for image classification. The input is a grayscale image of size (28, 28), flattened to a 784-dimensional vector, and our model will have two layers with ReLU activation, outputting scores for 10 classes:

[Input -> Linear -> ReLU -> Linear -> Output]

Training Pipeline

We’ll follow this pipeline for training:

1. Initialize the model and weights

# Initialize the weights
w1 = torch.randn(784, 512, requires_grad=True) * torch.sqrt(2 / 784)
w2 = torch.randn(512, 10, requires_grad=True) * torch.sqrt(2 / 512)

Since we want to compute the gradients with respect to weights, we initialize them as tensors with the requires_grad = True flag. This tells Pytorch to enable autograd on these tensors and build a computational graph that tracks how these tensors are used in subsequent operations.

We initialize the weights using Kaiming initialization because we have the ReLU activation function in our network. Additionally, we prefer powers of 2 for hidden dimensions, such as 512, 256, 128, and 64, due to their computational efficiency on GPUs.

2. Perform a forward pass of the model

# Forward Pass
logits = X_batch.mm(w1).clamp(min=0).mm(w2)

Here, X_batch represents the mini-batch of inputs, while y_batch contains the corresponding labels. The output of the model, logits, is the score vector, which contains scores for each category. The clamp(min=0) operation implements the ReLU activation.

3. Compute the loss

# Compute the loss
loss = loss_fn(logits, y_batch)

The loss function calculates the average loss between predicted scores (logits) and the true labels(y_batch) across the mini-batch. This loss value is then used to compute gradients for updating the model parameters during training.

4. Backpropagate the gradients

# Compute gradients of loss wrt weights
loss.backward()

After computing the loss, the gradient of the loss with respect to the weights is automatically computed by calling loss.backward(). This function performs backpropagation on all the inputs that have the requires_grad flag set to true, accumulating the gradients in their .grad() attributes.

5. Update the weights

# Gradient descent step
with torch.no_grad():       # Tells Pytorch not to build a graph
    w1 -= learning_rate * w1.grad()
    w2 -= learning_rate * w2.grad()

# Set gradients to zero
w1.zero_grad()      
w2.zero_grad()

The torch.no_grad() context manager ensures that no computation graph is built during the weight update step.

It’s crucial to run .zero_grad() function to clear the current gradients after each update step and obtain fresh gradients in the next pass—otherwise, gradients would keep accumulating.

The entire code put together looks something like the following. We train the model for 10 steps, also known as epochs in machine learning terminology.

# Initialize the weights
w1 = torch.randn(784, 512, requires_grad=True) * torch.sqrt(2 / 784)
w2 = torch.randn(512, 10, requires_grad=True) * torch.sqrt(2 / 512)

# Split data into mini-batches
mini_batches = split(data, batch_size)        

num_epochs = 10

# Train
for epoch in range(num_epochs):
    for (X_batch, y_batch) in mini_batches:

        # Forward Pass
        logits = X_batch.mm(w1).clamp(min=0).mm(w2)  

        # Compute the loss 
        loss = loss_fn(logits, y_batch)

        # Backpropagate
        loss.backward()          
        
        # Gradient descent step
        with torch.no_grad():       # Tells PyTorch not to build a graph
            w1 -= learning_rate * w1.grad
            w2 -= learning_rate * w2.grad
        
        # Set gradients to zero
        w1.zero_grad()      
        w2.zero_grad()

Pretty simple, right? Kudos! You can now code a neural network!

While this was a naive way to code a deep learning model, let’s leverage PyTorch’s high-level APIs to rewrite the same model in a more scalable manner.

Module: A layer or model class where weights are automatically set to requires_grad=True and initialized using Kaiming initialization.
Loss functions: PyTorch provides a variety of loss functions to choose from, depending on the task at hand. We use Cross-Entropy loss for our example below.
optim: PyTorch’s package for optimization algorithms including SGD, SGD with Momentum, and Adam.
DataLoader: Helps in creating mini-batches efficiently.

class TwoLayerNet(torch.nn.Module):
    def __init__(self):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(784, 512)
        self.linear2 = torch.nn.Linear(512, 10)
        self.relu = torch.nn.ReLU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.linear1(x))
        x = self.linear2(x)
        return x

# Initialize the model
model = TwoLayerNet()

# Define the loss function
criterion = torch.nn.CrossEntropyLoss()

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# Create mini-batches
mini_batches = torch.utils.data.DataLoader(data, batch_size=batch_size)

num_epochs = 10

# Train
for epoch in range(num_epochs):
    for (X_batch, y_batch) in mini_batches:

        # Forward Pass
        logits = model(X_batch)

        # Compute the loss 
        loss = criterion(logits, y_batch)

        # Backpropagate
        loss.backward()

        # Gradient descent step
        optimizer.step()

        # Set gradients to zero
        optimizer.zero_grad()

With these powerful APIs, PyTorch takes care of most of the grunt work, allowing us to focus on the higher-level logic.

This approach to writing neural networks has become the standard due to its clarity and efficiency, and you’ll notice that nearly all deep learning code follows a similar structure. Next, let’s apply this to create a model that can classify handwritten digits (MNIST Dataset).

MNIST

The MNIST dataset consists of 70,000 grayscale images of size (28, 28) pixels, featuring handwritten digits from 0 to 9, making up 10 classes.

The first step is to import the relevant libraries that we will be using throughout our code.

import torch
import torchvision
import matplotlib.pyplot as plt

Downloading & Pre-processing the dataset

Let’s download the dataset in two subsets: the training set containing 60,000 images and the test set containing 10,000 images. PyTorch provides transforms to convert these datasets into tensors. Each pixel value is originally in the range [0, 255], gets divided by 255 during the conversion to tensors, resulting in a range [0, 1].

# Convert images to tensors
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])

# Download the dataset
mnist_trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
mnist_testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

Loading the dataset

We randomly split the training data into a training set containing 80% of the images and a validation set containing the rest 20%. We then create data loaders for each set, choosing a batch size of 64 for this example.

Remember, the test set is reserved for evaluation at the very end of our pipeline, and we use a batch size of 1 here since no processing is performed on it.

# Randomly split the dataset into train (80%) and validation (20%)
len_train = int(0.8 * len(mnist_trainset))
len_val = len(mnist_trainset) - len_train
train_dataset, val_dataset = torch.utils.data.random_split(mnist_trainset, [len_train, len_val])

# Data loaders for training, validation, and testing
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(mnist_testset, batch_size=1)

Define the Architecture

Let’s use the same two-layer neural network that we’ve been discussing.

# select the device on which the model will train
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
print("Device: ", device)

# Define the model
class TwoLayerNet(torch.nn.Module):
    def __init__(self):
        super(TwoLayerNet, self).__init__()
        self.fc1 = torch.nn.Linear(28 * 28 * 1, 512)
        self.fc2 = torch.nn.Linear(512, 10)
        self.relu = torch.nn.ReLU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize the model and move it to the selected device
model = TwoLayerNet().to(device)

If you are using Google Colab, don’t forget to change the runtime to GPU to take advantage of hardware acceleration.

Train the model

The training loop looks exactly like before.

# Define the loss function
criterion = torch.nn.CrossEntropyLoss()

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.01,  momentum=0.9)

num_epochs = 10
train_loss, train_acc, val_acc = [], [], []

for epoch in range(num_epochs):

    # Training loop
    running_loss = 0
    correct, total = 0, 0 

    for (image, label) in train_loader:

        # Send the batch of images and labels to the GPU
        image, label = image.to(device), label.to(device)

        # Flatten the image
        image = image.view(image.shape[0], -1)

        # Forward pass and optimization
        logits = model(image)
        loss = criterion(logits, label)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # Track training loss
        running_loss += loss.item()

        # Get model predictions (category with max score)
        _, predicted = torch.max(output, dim=1)

        # Track total labels and correct predictions
        total += label.shape[0]
        correct += (predicted == label).sum().item()

    train_loss.append(running_loss / len(train_loader))
    train_acc.append(correct / total)


    # Validation loop
    correct, total = 0, 0 

    for (image, label) in val_loader:

        # Send the batch of images and labels to the GPU
        image, label = image.to(device), label.to(device)

        # Flatten the image
        image = image.view(image.shape[0], -1)

        # Get model predictions
        logits = model(image)
        _, predicted = torch.max(logits, dim=1)

        # Track total labels and correct predictions
        total += label.shape[0]
        correct += (predicted == label).sum().item()
    
    val_acc.append(correct/total)

    print('\nEpoch: {}/{}, Train Loss: {:.4f}, Train Accuracy: {:.4f}, Val Accuracy: {:.4f}'.format(epoch + 1, num_epochs, train_loss[-1], train_acc[-1], val_acc[-1]))



# Plot training statistics
plt.figure(1)
plt.plot(range(1, num_epochs + 1), train_loss, label='Training Loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title('Loss vs Number of Epochs')
plt.legend()

plt.figure(2)
plt.plot(range(1, num_epochs + 1), train_acc, label='Training Accuracy')
plt.plot(range(1, num_epochs + 1), val_acc, label='Validation Accuracy')
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.title('Accuracy vs Number of Epochs')
plt.legend()

plt.show()

Additionally, we plot training statistics to assess the model’s training performance. Typically, we visualize three curves:

Training loss vs Number of Epochs
Training accuracy vs Number of Epochs
Validation accuracy vs Number of Epochs

We calculate the training loss for the entire epoch by summing the average losses across batches and dividing by the number of batches in the training loader. To compute accuracy, we compare the model’s predictions with the actual labels. A similar approach is taken for validation, except we do not backpropagate the loss for weight updates.

The training loss should ideally decrease with each epoch, while both training and validation accuracy should increase. Below are the plots obtained after running the code above.

Test the model

We evaluate the model’s accuracy on unseen data using the same approach as in the validation loop.

# Get test accuracy
correct, total = 0, 0 

for (image, label) in test_loader:

    # Send image and label to the GPU
    image, label = image.to(device), label.to(device)

    # Flatten the image
    image = image.view(image.shape[0], -1)

    # Get model predictions
    logits = model(image)
    _, predicted = torch.max(logits, dim=1)

    # Track total labels and correct predictions
    total += label.shape[0]
    correct += (predicted == label).sum().item()

print("Test Accuracy: ", correct/total)

With this model, I achieved 97.5% accuracy on the test set after training for just 10 epochs!

This simple yet effective two-layer neural network demonstrates how accessible deep learning can be, enabling us to achieve impressive results with this tiny dataset.

Our deep learning cake is almost ready—now, it’s time to add the toppings!

References

[1] Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, JMLR 2010.

[2] He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015.

Neural Networks#

Activation functions#

Sigmoid#

Tanh#

ReLU#

Leaky ReLU#

Weight Initialization#

Xavier Initialization#

Kaiming/ He Initialization#

Backpropagation#

Coding from scratch#

Training Pipeline#

1. Initialize the model and weights#

2. Perform a forward pass of the model#

3. Compute the loss#

4. Backpropagate the gradients#

5. Update the weights#

MNIST#

Downloading & Pre-processing the dataset#

Loading the dataset#

Define the Architecture#

Train the model#

Test the model#

References#