Deep Learning Basics Part 3: The Cherry on Top

We’ve already coded our first neural network architecture from scratch and learned how to train it. Our deep learning cake is almost ready, but we still need the toppings to make it more appealing. In this part, we’re going to discuss the available toppings—concepts that enhance optimization and help us reach a better final solution for the model’s weights. The most important topping among them is Regularization.

Regularization

When optimizing, our goal is to find the specific set of weights that minimize loss our training data, aiming for the highest possible accuracy on the test set. But in practice, this doesn’t always work as expected! When we focus solely on minimizing loss on the training data—essentially trying to fit it perfectly—we run the risk of not generalizing well to the test data, a phenomenon known as overfitting.

To illustrate this, let’s consider an example. The blue dots in the figure below represent the training data. We fit two models: $m_1$ (a polynomial model) and $m_2$ (a linear model). Model $m_1$ has more layers, leading to a more complex decision boundary that fits the training data almost perfectly.

It’s important to remember that both the training and test sets are assumed to be sampled from a common dataset, represented as $p_{\text{data}}$. Our goal should be to fit this probability distribution and not just the training data alone, as this ensures good performance on unseen test data. Now, let’s examine the test set, represented by the yellow points in the next figure.

While $m_1$ fits the training data perfectly, it struggles with the test data because it overfits the training set. Model $m_2$, despite being simpler, performs better on the test set as it generalizes better to the underlying distribution $p_{\text{data}}$.

If a simpler model like $m_2$ performs better on the test set, you might wonder why we don’t always use smaller models. The reason is that a smaller model runs the risk of underfitting, where it fails to capture the complexities of the data.

Since the true distribution $p_{\text{data}}$ is unknown to us, our best approach is to use a larger, more flexible model and apply techniques to help it generalize well, and this is where regularization comes in. Regularization helps ensure that our model doesn’t overfit by encouraging it to find simpler decision boundaries.

There are several ways to regularize a model, and we’ll dive into each of these methods in detail next.

Weight Decay

The simplest way to make a deeper model behave more like a shallow one is to effectively reduce the influence of some weights by shrinking them. To discourage the model from becoming overly complex, we add a penalty to the loss function based on the magnitude of the model’s weights.

$$ L(\mathbf{W}) = \frac{1}{N} \sum_{i = 1}^N L_i(x_i, y_i, \mathbf{W}) + \lambda R(\mathbf{W}) $$

where $\lambda$ is the hyperparameter that controls the regularization strength, i.e., how much to shrink the weights. Since we minimize this loss function during optimization, the added term helps decay some weights in the model. $R(\mathbf{W})$, the regularization term, varies depending on the type of weight decay:

\begin{align} \text{L2: } & R(\mathbf{W}) = \sum_{k} \sum_{l} \mathbf{W}_{k,l}^2 \\ \text{L1: } & R(\mathbf{W}) = \sum_k \sum_l |\mathbf{W}_{k,l}| \\ \text{Elastic Net (L1 + L2): } & R(\mathbf{W}) = \sum_k \sum_l \beta \mathbf{W}_{k,l}^2 + |\mathbf{W}_{k,l}| \end{align}

L2 regularization shrinks weights in proportion to their size, meaning larger weights get shrunk more (since we add their square to the loss function). This type of regularization prefers to keep all available features but with smaller weight values.

L1 regularization shrinks all the weights by the same amount, with the smaller weights getting zeroed out. The intuition here is that small weights indicate that the feature associated with that weight is less important, so zeroing it out won’t significantly affect the output.

Let’s take a simple example to understand this better: \begin{align} \mathbf{x} &= [1, 1, 1, 1] \\ \mathbf{w}_1 &= [1, 0, 0, 0] \\ \mathbf{w}_2 &= [0.25, 0.25, 0.25, .0.25] \end{align} Both weight vectors produce the same output, $\mathbf{w}_1^T \mathbf{x} = \mathbf{w}_2^T \mathbf{x}$. However, L1 regularization would prefer $\mathbf{w}_1$ because it sparsifies the weights (i.e., zeros out some weights), while L2 regularization would prefer $\mathbf{w}_2$, since it retains all the features but with smaller values. Since we usually want to keep all features active, L2 regularization is the more commonly used approach. A typical value for the L2 regularization parameter $\lambda$ is 1e-4.

Let’s now look at the mechanics of weight decay. When we apply L2 regularization, it modifies the gradient update step as follows: \begin{align} L(\mathbf{w}) &= L_{\text{data}}(\mathbf{w}) + \color{blue} \lambda |\mathbf{w}|^2 \\ g_t &= \nabla L_{\text{data}}(\mathbf{w}_t) + \color{blue} 2 \lambda \mathbf{w}_t \\ s_t &= \text{optimizer} (g_t) = \text{optimizer} (\mathbf{dw} + {\color{blue} 2 \lambda \mathbf{w}_t} ) \\ \mathbf{w}_{t+1} &= \mathbf{w}_t - \eta s_t \end{align}

The blue term represents the weight decay and it can directly be incorporated into our optimizer. This can be done in PyTorch using the weight_decay argument.

# Define the optimizer with L2 regularization
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)

While L2 regularization and weight decay are often used interchangeably for SGD and SGD with momentum, they diverge in behavior when applied to adaptive methods like AdaGrad, RMSProp, and Adam. Let’s explore why.

Decoupled Weight decay

Consider adding L2 regularization to Adam:

\begin{align} m_1 &= \beta_1 m_1 + (1 - \beta_1) (\mathbf{dw} + {\color{blue} 2 \lambda \mathbf{w}_t} ) \\ m_2 &= \beta_2 m_2 + (1 - \beta_2) (\mathbf{dw} + {\color{blue} 2 \lambda \mathbf{w}_t} ) \\ {m_1}_{\text{unbias}} &= m_1 / (1 - \beta_1^t) \\ {m_2}_{\text{unbias}} &= m_2 / (1 - \beta_2^t) \\ \mathbf{w}_{t+1} &= \mathbf{w}_{t} - \eta * \frac{{m_1}_{\text{unbias}}}{\sqrt{{m_2}_{\text{unbias}} + \epsilon}} \end{align}

By substituting the unbiased value of $m_1$ into the update step, we get:

\begin{align} \mathbf{w}_{t+1} &= \mathbf{w}_{t} - \eta * \frac{\beta_1 m_1 + (1 - \beta_1) (\mathbf{dw} + {\color{blue} 2 \lambda \mathbf{w}_t} )}{(\sqrt{{m_2}_{\text{unbias}} + \epsilon}) ((1 - \beta_1^t))} \end{align}

Notice how the blue term is normalized by the square root of the unbiased second moment, which tracks the sum of squared gradients. As a result, if the gradients for a particular weight are large, that weight will be regularized less than weights with smaller gradients. This means that the effect of regularization becomes dependent not only on the size of the weights but also on the size of the gradients.

This leads to unintended behavior and making L2 regularization less effective than intended. In contrast, with SGD and SGD with Momentum, L2 regularization works as expected: \begin{align} \text{SGD: } & \Delta \mathbf{w}^{t} = - \eta * (\mathbf{dw} + {\color{blue} 2 \lambda \mathbf{w}_t} ) \\ \text{SGD + Momentum: } & \Delta \mathbf{w}^{t} = - \eta * (\mathbf{dw} + {\color{blue} 2 \lambda \mathbf{w}_t} ) + m * \Delta \mathbf{w}^{t-1} \end{align}

To fix this issue in Adam, a variant called AdamW [1] was introduced. In this variant, the weight decay term is decoupled from the optimizer and added directly to the update step.

\begin{align} L(\mathbf{w}) &= L_{\text{data}}(\mathbf{w}) \\ \mathbf{dw} = g_t &= \nabla L_{\text{data}}(\mathbf{w}_t) \\ s_t &= \text{optimizer} (g_t) \\ \mathbf{w}_{t+1} &= \mathbf{w}_t - \eta (s_t + {\color{orange} 2 \lambda \mathbf{w}_t} ) \end{align}

# AdamW
m1 = 0
m2 = 0
for t in range(1, num_steps):
   m1 = (beta1 * m1) + (1 - beta1) * dw
   m2 = (beta2 * m1) + (1 - beta2) * dw * dw
   m1_unbias = m1 / (1 - beta1 **t)
   m2_unbias = m2 / (1 - beta2 **t)
   s = m1_unbias / (m2_unbias.sqrt() + 1e-7)
   w += -learning_rate * (s + (weight_decay * w))

This adjustment ensures that the weight decay term remains unaffected by the moving averages, keeping it directly proportional to the weights themselves.

AdamW has been shown to significantly improve the generalization performance of the standard Adam optimizer. It is always the preferred choice when we want effective regularization with adaptive optimizers.

Dropout

Another powerful regularization technique is Dropout [2], where we randomly set a fraction of neurons in a layer to zero during each forward pass. The dropout rate (probability of dropping a neuron) is a hyperparameter, typically set to 0.5.

By zeroing out neurons, the network is forced to avoid relying too heavily on any individual neuron or set of neurons, encouraging the model to distribute learning across different parts of the network. This approach effectively simulates training a large number of different sub-networks (ensemble), each with a unique structure but sharing weights.

In terms of interpretation, dropout creates a more robust network by introducing redundancy in the learned features. Since neurons are randomly excluded during each forward pass, the model must learn multiple independent representations of the data. This redundancy helps prevent overfitting, as the model doesn’t rely on specific patterns that might only exist in the training data, generalizing better on unseen data.

Inverted dropout

Dropout introduces randomness into the model during training, which can make prediction outputs non-deterministic. However, during inference, we want consistent predictions. To handle this, we use inverted dropout, that rescales the output of active neurons during training, to adjust for the dropped ones.

When neurons are randomly dropped, the outputs of the remaining neurons are scaled up by the inverse of the dropout rate to maintain the overall magnitude of activations. For instance, if half of the neurons are dropped (dropout rate = 0.5), we double the outputs of the remaining neurons during training to compensate.

Dropout is most commonly applied to fully connected layers, as these layers tend to contain the majority of learnable parameters, making them more prone to overfitting. Here’s how you can add it to your network:

class TwoLayerNet(torch.nn.Module):
    def __init__(self):
        super(TwoLayerNet, self).__init__()
        self.fc1 = torch.nn.Linear(28 * 28 * 1, 512)
        self.fc2 = torch.nn.Linear(512, 10)
        self.dropout = torch.nn.Dropout(p=0.5)   # Added dropout
        self.relu = torch.nn.ReLU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

Early Stopping

Another way to prevent overfiting is to stop training when the validation accuracy starts to decrease. While training, we typically plot three curves:

Training loss vs Number of iterations
Training accuracy vs Number of iterations
Validation accuracy vs Number of iterations

These curves provide insight into the “health” of the model. Ideally, training loss should decay exponentially with each iteration, while both training and validation accuracy should increase. A large gap between the training and validation accuracy suggests overfitting, whereas little to no gap indicates underfitting.

At each iteration, we save the model’s weights as checkpoints and later select the checkpoint where the validation accuracy is at its maximum. Training beyond this point might increase the training accuracy (as the model starts overfitting), but it could perform poorly on the test set because validation accuracy reflects the model’s ability to generalize.

This technique is called early stopping because it prevents the model from training all the way to the full number of iterations (or num_epochs hyperparameter), stopping earlier to capture the best version of the model without overfitting.

# Save checkpoint
def save_checkpoint(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    iteration: int,
    out: str | os.PathLike | BinaryIO | IO[bytes],
):
    checkpoint = {
        "model_state": model.state_dict(),
        "optimizer_state": optimizer.state_dict(),
        "iter": iteration
    }
    torch.save(checkpoint, out)
    
# Load checkpoint
def load_checkpoint(
    src: str | os.PathLike | BinaryIO | IO[bytes],
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
) -> int:
    
    checkpoint = torch.load(src)
    model.load_state_dict(checkpoint['model_state'])
    optimizer.load_state_dict(checkpoint['optimizer_state'])
    return checkpoint['iter']

Data Manipulation

The next topping is data manipulation, where we explore how to process and modify the training dataset to make it more suitable for efficient learning.

Preprocessing

At every step of the optimization algorithm, the loss function is computed on a mini-batch of the training data. Therefore, the shape of the loss landscape, which the optimization algorithm navigates, depends directly on how the training data is structured. To make this landscape more favorable for optimization, we often preprocess the data before feeding it into the neural network.

A common preprocessing step is normalizing the data, which zero-centers the data and ensures all features have the same variance. This step helps smooth out the optimization process by making the model less sensitive to small weight changes.

The pixel values range from [0, 255], and are divided by 255 on conversion to tensors, resulting in in values in range [0, 1]. We then normalize these values with a mean of 0.5 and a standard deviation of 0.5 to obtain pixel values in the range [-1, 1]. This can be included in the list of transforms in PyTorch:

# Apply transforms to dataset
transform = torchvision.transforms.Compose([
            torchvision.transforms.ToTensor(),                # Convert image to Tensor
            torchvision.transforms.Normalize((0.5,), (0.5,))  # Normalize with mean=0.5, std=0.5
            ])

The same transform is also applied to the test set to maintain consistency.

Other standard preprocessing techniques for image data include:

Subtracting the mean image
Subtracting per-channel mean (mean along each color channel)
Subtracting per-channel mean and divide by the per-channel standard deviation

Augmentation

Data augmentation is a widely-used technique that increases both the size and diversity of the training data by applying label-preserving transformations to the input images. Not only does this artificially expand the dataset, but it also adds regularization by introducing noise in the training data, which helps prevent overfitting.

In computer vision, augmentations like flipping, rotating, scaling, blurring, jittering, and cropping are widely used as an implicit form of regularization. Augmentations also encode invariances into the model—this means we teach the model to learn that certain transformations shouldn’t change the output.

Think about your specific task: what transformations should not change the desired network output? The answer will vary depending on the problem. For instance, a vertical flip might make sense for images of white blood cells but not for car images (as cars are never upside down in real-world scenarios).

Augmentations can also be included in the transforms list,

# Apply transforms to dataset
transform = torchvision.transforms.Compose([
            torchvision.transforms.RandomRotation(degrees=15), # Randomly rotate by +/- 15 degrees
            torchvision.transforms.ToTensor(),                 # Convert image to Tensor
            torchvision.transforms.Normalize((0.5,), (0.5,))   # Normalize with mean=0.5, std=0.5
            ])

Note that augmentations are typically excluded from the test set to ensure the model is evaluated on the original data distribution.

Learning Rate Schedules

All the optimizers we’ve discussed so far, including SGD, SGD with Momentum, Adagrad, RMSProp, and Adam, use a fixed learning rate as a hyperparameter to guide the search for the global minimum. As you may have learned by now, this learning rate is a crucial variable in our learning process.

Different learning rates produce different learning behaviors, as shown in the figure below. Therefore, it’s essential to choose an appropriate learning rate, ideally the red one. However, finding that one “perfect” learning rate through trial and error is not always feasible.

What if we don’t keep the learning rate fixed and instead change it during the training process? We can start with a high learning rate to allow our optimization to make quick progress in the initial iterations and then gradually decay it over time, ensuring that the model converges to a lower loss at the end. This would lead to faster convergence and better performance characteristics.

This mechanism of changing the learning rate over the course of training is called learning rate scheduling. Let’s look at some commonly used learning rate schedules.

Step Schedule

In a step schedule, we start with a high learning rate (similar to the green curve in the figure), and when the loss curve starts to plateau, we decrease the learning rate. This process continues until convergence.

For example, we might reduce the learning rate by a factor of 0.1 after epochs 30, 60, and 90. The learning curve would then look something like this.

Since we’re decaying the learning rate at arbitrary points during training, this schedule introduces additional hyperparameters, like the number of steps and when to decay. A common approach is to monitor the loss and decay the learning rate whenever the loss plateaus.

Decay Schedule

Instead of selecting fixed points to adjust the learning rate, we can define a function that dictates how the learning rate should decay over time. This eliminates the need for extra hyperparameters. Starting with an initial rate, these functions gradually reduce the it over time.

Here are some commonly used decay functions:

Cosine schedule: This is a popular choice for computer vision problems. The learning rate follows a cosine function that smoothly decays over time. If $\eta_0$ is the initial learning rate and $T$ is the total number of epochs, the learning rate at epoch $t$ is given by:

$$ \eta_t = \frac{\eta_0}{2} ( 1 + cos \frac{t \pi}{T}) $$

Linear schedule: In this approach, the learning rate decays linearly over time, which is shown to work well for language models.

$$ \eta_t = \eta_0 (1 - \frac{t}{T}) $$

Inverse Square root: Commonly used in training transformer models, this schedule follows an inverse square root decay. One drawback is that the model spends very little time at the higher learning rate.

$$ \eta_t = \frac{\eta_0}{ \sqrt{t} } $$

Cyclic Schedule

In addition to decaying the learning rate monotonically, we can adopt a cyclic learning rate schedule, which alternate between high and low learning rates during training. These schedules help prevent the optimization process from getting stuck in local minima.

In this approach, the learning rate decreases smoothly within each cycle, following a cosine decay curve. Once the cycle completes, the learning rate “warms up” by resetting to a higher value. This method, also known as Warm Restarts [3], allows the optimizer to periodically explore different regions of the loss landscape.

Cyclic schedules are particularly useful for training models on large datasets, as the periodic warm restarts enhance exploration, reducing the chances of premature convergence to suboptimal solutions.

Updating the code

Let’s modify the training loop to include a learning rate scheduler: cosine decay.

# Define the loss function
criterion = torch.nn.CrossEntropyLoss()

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.01,  momentum=0.9)

# Define the cosine annealing learning rate scheduler
num_epochs = 10
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)

for epoch in range(num_epochs):

    # Training loop
    for (image, label) in train_loader:

        # Send the batch of images and labels to the GPU
        image, label = image.to(device), label.to(device)

        # Flatten the image
        image = image.view(image.shape[0], -1)

        # Forward pass and optimization
        logits = model(image)
        loss = criterion(logits, label)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Update the learning rate at the end of each epoch
    scheduler.step()

The .step() function updates the learning rate based on the cosine decay schedule at the end of each epoch.

Using learning rate schedules with optimizers like SGD or SGD with Momentum is highly recommended for improving training efficiency and model performance. However, for adaptive methods like AdamW, a constant learning rate often works well, as these methods automatically adjust learning rates on a per-parameter basis.

Our deep learning cake is now complete, it’s time to dig in and savor the delicious knowledge we’ve baked together—enjoy every slice as you continue your journey in this exciting field!

References

[1] Loshchilov and Hutter, “Decoupled Weight Decay Regularization”, ICLR 2019.

[2] Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014.

[3] Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017.

Regularization#

Weight Decay#

Decoupled Weight decay#

Dropout#

Inverted dropout#

Early Stopping#

Data Manipulation#

Preprocessing#

Augmentation#

Learning Rate Schedules#

Step Schedule#

Decay Schedule#

Cyclic Schedule#

Updating the code#

References#