While there is a wealth of deep learning content scattered across the internet, I wanted to create a one-stop solution where you can find all the fundamental concepts needed to write your own neural network from scratch.
This series is inspired by Justin Johnson’s course, and Stanford’s CS231n. I would also highly recommend watching Andrej Karpathy’s videos.
Image Classification
Before diving into the details, let’s start with the basics: image classification. This is a core task in computer vision where a model takes an image as input and assigns it a category label from a fixed set of predefined categories.
For us humans, this is a trivial task, but for computers, it’s much more complex. An image, represented as $(C, H, W)$, is essentially a grid of numbers ranging from 0 to 255, where each value defines the brightness of a pixel. Here, $H$ is the height, $W$ is the width, and $C$ is the number of channels, which is 3 for an RGB image. This makes it difficult for a computer to “see” the image in the same way we do.
One option is to encode our human knowledge about objects and patterns into the model to help recognize images. However, a more robust and scalable approach is data-driven: we train a model that learns how to recognize different types of objects in images from data. The pipeline looks something like this:
- Collect a dataset of images and their corresponding labels.
- Use deep learning (DL) to train a classifier.
- Evaluate the classifier on new, unseen images.
Dataset
The dataset is usually divided into 3 parts: training set, validation set and test set. We use the training set to train our model.
Hyperparameters are variables that must be set before training begins, and their values cannot be derived from the training data. Since these parameters are highly problem-dependent, they are usually selected through trial and error. After training, we evaluate our trained model on the validation set and use it to fine-tune the hyperparameters, selecting the values that yield the best performance on this set.

Once everything is finalized, we evaluate the model on the test set only once, at the very end. This gives us an unbiased estimate of how the model will perform on truly unseen data, helping us predict its real-world performance.
Linear Classifier
Now, let’s define our first and most basic model: a linear classifier (also known as a perceptron). Suppose we have a dataset, $\{x_i, y_i\}_{i=1}^N$, where $x_i$ is an RGB image of size $(3, 32, 32)$ and $y_i$ is its corresponding label (integer) across three categories: cat (label=0), car (label=1), and frog (label=2).

We feed the image to our model as a flattened vector of shape $(3072, 1)$. The linear classifier consists of a weight matrix $W$ and a bias $b$ (offset). In practice, we concatenate the input with 1s at the bottom row to combine the weights and bias into a single matrix, $\mathbf{W}$, using a technique called the bias trick.
$$ f(\mathbf{x}, \mathbf{W}) = W \mathbf{x} + b = \begin{bmatrix} \mathbf{W} \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ 1 \end{bmatrix} $$
The model outputs a score vector of size $(3, 1)$, which defines the class scores for each of the 3 predefined categories.

Think of these scores as the model’s confidence levels—it’s basically saying how sure it is that the image belongs to each specific category. The class with the highest score becomes our predicted label $\hat{y}_i$. The scores of the true class are highlighted in bold.
Clearly, the predictions of our classifier are not always correct. This brings us to our first essential concept in deep learning: the loss function.
Loss function
A loss function helps us quantify the performance of our model on the data. Naturally, lower the loss, the better our classifier is. It is also called an objective function or a cost function.

We compute the loss for each input image and simply take an average across the training set to compute the final loss value.
$$ L(\mathbf{W}) = \frac{1}{N} \sum_{i=1}^N L_i \left( f(x_i, \mathbf{W}), y_i \right) $$
SVM Loss
The idea behind Support Vector Machine (SVM) loss is pretty straightforward: we want the score of the correct class to be significantly higher than the scores given to all the incorrect classes. Makes sense, right? We want our classifier to be confident in its predictions, meaning it should assign a higher score to the correct category, compared to the wrong ones, by a certain margin.
The SVM loss has the form, $$ L_i = \sum_{j \neq y_i} \text{max}(0, \underbrace{s_j}_{\text{score of jth class}} - \underbrace{s_{y_i}}_{\text{score of true class}} + \underbrace{1}_{\text{margin}}) $$

It’s also called Hinge Loss because its shape looks just like a door hinge. When the score for the correct class is higher than all the other scores by a certain margin, the loss becomes zero. Otherwise, it increases linearly. A key point to note is that the loss is always computed for the incorrect categories, and for the correct class, it’s always zero.

Let’s derive the SVM loss for our toy example. For the cat image, the score of the true class $s_{y_i} = 3.2$. The loss for the car and frog classes are calculated as, $$ L_1 = \text{max}(0, 5.1 - 3.2 + 1) + \text{max}(0, -1.7 - 3.2 + 1) = 2.9 + 0 = 2.9 $$
Similarly, for the other two training examples, \begin{align} L_2 &= \text{max}(0, 1.3 - 4.9 + 1) + \text{max}(0, 2 - 4.9 + 1) = 0 \\ L_3 &= \text{max}(0, 2.2 + 3.1 + 1) + \text{max}(0, 2.5 + 3.1 + 1) = 6.3 + 6.6 = 12.9 \end{align}
The final loss of our classifier is the average of all the losses: $L_{\text{svm}} = 5.27$.
Cross-Entropy Loss
Cross-Entropy Loss, also known as Logistic Loss, is one of the most commonly used loss functions in deep learning. It helps interpret raw classifier scores as probabilities, which makes sense because we want our model’s output to reflect how likely it thinks a certain class is correct.
To do this, we run the raw scores through an exponential function to ensure all the outputs are positive (since probabilities are always non-negative). After that, we normalize them so they sum to 1, giving us a valid probability distribution over all the categories. This entire transformation is known as the softmax function. The image below shows how the softmax function works on our cat image scores.

The term “softmax” comes from the fact that it’s a smooth, differentiable approximation to the max function. If we were to apply the max function directly to the raw scores—like in our cat image example—it would output something like $[0, 1, 0]$, which is a valid probability distribution. But the problem is, the max function isn’t differentiable, which means we can’t use it to train neural networks (you’ll see why shortly). This makes softmax a go-to tool in deep learning when you need to highlight the maximum value while ensuring the function remains differentiable.
The softmax function provides our estimated probability distribution, denoted by $q$. The true or desired distribution is one where all the probability mass is concentrated on the correct class, i.e., $p = [0, \dots, 1, \dots, 0]$, with a single 1 at the $y_i$th position (also known as a one-hot encoded vector). The difference between two probability distributions is measured using cross entropy. The cross-entropy between a true distribution $p$ and an estimated distribution $q$ is defined as: \begin{equation} H(p,q) = - \sum_x p(x) \text{ log} (q(x)) \end{equation}
Thus, the cross-entropy loss is simply the negative log of the predicted probability for the true class. For our cat image example, the loss would be $L_1 = -\log(0.13) = 2.04$.
The cross-entropy loss has the form, \begin{equation} L_i = -\text{log} \underbrace{\left( \frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \right)}_{\text{softmax function on score of true class}} \end{equation}
Some important things to note:
- The loss can only reach zero if the output of softmax is a perfect one-hot vector. However, this situation is practically impossible. Therefore, achieving $L_{\text{ce}} = 0$ is not feasible.
- On initialization, when the weights are initialized with small random values (we will get to this in the next part), all classes will be equally likely. The loss would be $L_{\text{ce}} = - \log(\frac{1}{c}) = \log(c)$ where $c$ is the number of categories. This serves as a quick check: if you observe this loss value at the start of training, it’s a sign that everything is working as expected.
Exponentials of large logits can easily overflow (e.g., exp(1000) → ∞). To avoid this, we use the fact that the softmax function is invariant to adding any constant to all inputs. We can therefore subtract the maximum logit per sample before exponentiating — keeping values within a stable numerical range.
Additionally, we can use the log-sum-exp trick for numerical stability: \begin{align} L_i &= -\log \left( \frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \right) \\ &= - \log (e^{s_{y_i}}) + \log \left(\sum_j e^{s_j} \right)\\ &= - \underbrace{s_{y_i}}_{\text{score of true class}} + \underbrace{\log \left(\sum_j e^{s_j} \right)}_{\text{logsumexp}} \end{align}
# Cross-Entropy Loss
# logits: (N, C) [raw class scores from the model for N samples]
# targets: (N) [true class labels]
loss = []
for i in range(N):
# Step 1: Subtract the largest logit
max_val = np.max(logits[i])
logits_shifted = logits[i] - max_val
# Step 2: Compute log-sum-exp term
logsumexp = np.log(np.sum(np.exp(logits_shifted)))
# Step 3: Gather true class logits
true_class_score = logits_shifted[targets[i]]
# Step 4: Compute loss for this sample
loss_i = -true_class_score + logsumexp
losses.append(loss_i)
# Step 5: Mean over all samples
loss = np.mean(losses)
The loss function tells us how well our current classifier is doing (with its existing weights) on the training data. Since we’re all a bit greedy, we aim to find that golden set of weights that minimizes this loss. This brings us to the second key concept in deep learning: optimization.
Optimization
Optimization is the process of utilizing the training data to search through all possible values of $\mathbf{W}$ to find the best one that results in the minimum loss.
Intuitively, it is like traversing a large high-dimensional landscape, where every point (x, y) on the ground represents the value of the weight matrix $\mathbf{W}$ and the height of that point is the value of loss function $L(\mathbf{W})$. Our goal is to navigate to the bottom-most point of this terrain, where we’ll find the weights that yield the minimal loss.
Since we don’t have the exact equation of the landscape, computing the minimum of the curve is not straightforward (it’s not a convex optimization problem). Instead, we take iterative, small steps towards the direction of the local minimum, hoping to eventually reach the lowest point. The derivative of a function gives us the slope, so this direction of steepest descent will be the negative gradient of the loss function with respect to the weights. This is the famous gradient descent algorithm.
Gradient Descent
In the algorithm below, commonly referred to as the vanilla version of gradient descent, we start by initializing the weights with some arbitrary values. Then, we loop for a fixed number of iterations, where for each iteration, we compute the gradient for our current weights and take a step in the direction of the negative gradient to update the weight values.
# Vanilla Gradient Descent
w = initialize_weights()
for t in range(num_steps):
dw = compute_gradient(loss_fn, data, w)
w += - learning_rate * dw
The size of each step is a hyperparameter that controls how much we move in the negative gradient direction during each iteration of the optimization process. In other words, it controls how fast our network learns, which is why it’s referred to as the learning rate.
A higher learning rate means taking larger steps toward the negative gradient, which might lead to faster convergence but can also result in unstable behavior. On the other hand, lower learning rate is more stable but may take longer to converge. Therefore it is crucial to select an optimal learning rate.
The number of iterations determines how many times we run this algorithm for the model to converge. Generally, more iterations lead to better results, as we continuously approach the minimum, and so, this hyperparameter is constrained by computational resources and time.
Just as we averaged the losses over all images in the training data to compute the overall loss, we also take the average of individual gradients when calculating the overall gradient.
\begin{align} L(\mathbf{W}) &= \frac{1}{N} \sum_{i = 1}^N L_i(x_i, y_i, \mathbf{W}) \\ \mathbf{dw} = \frac{\partial L}{ \partial \mathbf{W}} &= \frac{1}{N} \sum_{i=1}^N \frac{\partial}{\partial \mathbf{W}} L_i(x_i, y_i, \mathbf{W}) \end{align}
This process becomes computationally expensive and slow when dealing with a huge training set ($N$ is very large), as just one step of gradient descent involves looping over the entire training dataset.
In practice, this version of gradient descent is not very feasible as we almost always have large datasets. Instead of computing the gradient over all examples, we approximate it’s value by computing it over small sub-samples of the training set. The intuition here is that we assume that the examples in the training data are correlated, so calculating the gradient over batches of the training data is a good approximation to the gradient of the full objective.
We split the training dataset into mini-batches, each containing $B$ examples, and use these mini-batches to perform weight update steps. This approach allows for much faster convergence because we can perform more frequent parameter updates using the mini-batch gradients. This variant is known as Mini-batch Stochastic Gradient Descent (SGD).
# Mini-batch Gradient Descent
w = initialize_weights()
batches = split(data, batch_size)
for t in range(num_steps):
for mini_batch in batches:
dw = compute_gradient(loss_fn, mini_batch, w)
w += - learning_rate * dw
The samples are drawn randomly, and the batch size is another hyperparameter often selected from values like 32, 64, 128, or 256—commonly powers of 2. This is because many vectorized operations perform more efficiently when their inputs are sized in powers of 2. A larger batch size provides a better estimate of the true gradient and is often chosen as large as possible until you run out of GPU memory.
The extreme case where the mini-batch contains only a single example ($B = 1$), represents a pure Stochastic Gradient Descent. In this approach, we update the parameters more frequently by taking one observation at each iteration, resulting in a much more erratic learning curve compared to the mini-batch version. This is why mini-batch SGD is generally preferred and is almost always used in practice.
Mini-batch SGD, while effective, does have some challenges and limitations:
- Loss Landscape Variation: If the loss landscape changes rapidly in one direction and slowly in another, a constant learning rate would cause the algorithm to progress slowly along the shallow direction while jittering along the steep direction. Although a smaller learning rate might stabilize the optimization in steeper regions, it would significantly reduce progress in the shallow regions, leading to slower overall convergence.
- Local Minima and Saddle Points: When the loss function has local minima or saddle points, the gradients at these locations are zero. As a result, gradient descent can get stuck and be unable to escape these points.
- Noisy Gradients: Since our gradients are computed from mini-batches, they can be noisy, which may affect the stability and convergence of the optimization process.
A solution to these problems is a technique called Momentum [1], which draws inspiration from physics.
Momentum
Instead of relying solely on the gradient of the current step to guide the search, momentum also accumulates the gradients from past steps to determine the direction of movement. This helps the optimization process build speed in directions with consistent gradients while smoothing out updates. As a result, when the gradient updates ($\mathbf{dw}$) consistently point in the same direction, we build momentum and quickly descend the surface, potentially escaping local minima or saddle points along the way.
\begin{align} \text{SGD: } & \Delta \mathbf{w} = - \eta * \mathbf{dw} \\ \text{SGD + Momentum: } & \Delta \mathbf{w}^{t} = - \eta * \mathbf{dw} + m * \Delta \mathbf{w}^{t-1} \end{align}
Momentum helps address the three challenges of mini-batch SGD:
Loss Landscape Variation: With a smaller learning rate, we can avoid jittering in steep areas, while also maintaining a faster pace in shallow regions. As we progress through flatter areas, we build velocity, which speeds up convergence without increasing the learning rate.
Local Minima and Saddle Points: By accumulating previous gradients, we ensure there is always an update to the weights, even when gradients temporarily become small or zero. This enables the optimizer to push through saddle points or escape shallow local minima.
Noisy Gradients: Momentum smooths out noisy gradient updates by averaging them over time, resulting in more stable updates. This reduces the erratic behavior caused by mini-batch noise and leads to smoother convergence overall.
# SGD + Momentum
v = 0
for t in range(num_steps):
v = - (learning_rate * dw) + (m * v)
w += v
In the algorithm above, the variable $v$ accumulates velocity as a running mean of gradients, while $m$ serves as a friction or decay factor. To understand this concept better, consider two extreme cases:
- No momentum ($m = 0$): The algorithm behaves exactly like standard gradient descent.
- Maximum momentum ($m = 1)$: The updates oscillate indefinitely, similar to a frictionless bowl, failing to converge.
In practice, values for $m$ are typically set around 0.9 or 0.99, striking a balance between acceleration and stability.
Since we build velocity over time, we risk overshooting even after reaching the minima, which can lead to jittering. Therefore, we need a way to decay the weight updates over time, and this is where Adagrad [2] comes in.
AdaGrad
Instead of keeping track of the sum of gradients like momentum, the Adaptive Gradient algorithm, or AdaGrad for short, maintains a running sum of squared gradients to have an adaptive learning rate.
# Adagrad
grad_squared = 0
for t in range(num_steps):
grad_squared += dw * dw
w += -learning_rate * dw / (grad_squared.sqrt() + 1e-7)
In the direction of a steep descent (where $\mathbf{dw}$ is high), AdaGrad dampens the progress, resulting in smaller update steps. Conversely, in shallow regions (where $\mathbf{dw}$ is low), it allows for larger updates. This adaptive approach effectively addresses the problem of landscape variation.
During the initial iterations, the sum of the squared gradients is small, leading to a higher learning rate. As we continue accumulating squared gradients, the learning rate gradually decays over time—a beneficial feature to have!
Since the sum of squared gradients only increases over time, it can lead to situations where the learning rate becomes too low to make meaningful updates, potentially even before reaching the minimum. To overcome this limitation, we use a variant called RMSProp, which decays this running sum of squared gradients, ensuring that the learning rate does not become too small.
RMSProp
Root Mean Square Propagation, or RMSProp for short, is a leaky version of AdaGrad that introduces a decay rate to the running sum of squared gradients.
# RMSProp
grad_squared = 0
for t in range(num_steps):
grad_squared = (decay_rate * grad_squared) + (1 - decay_rate) * dw * dw
w += -learning_rate * dw / (grad_squared.sqrt() + 1e-7)
The decay rate is a hyperparameter typically set to 0.9, while a common choice for the learning rate is 0.001. This mechanism allows RMSProp to effectively balance the adaptation of the learning rate, preventing the rapid decay issues encountered with AdaGrad.
Adam
Adaptive Moment Estimation, or Adam [3] for short, combines the strengths of both RMSProp and Momentum, utilizing a running sum of gradients as well as the squared gradients to optimize the learning process.
# Adam
m1 = 0
m2 = 0
for t in range(1, num_steps):
m1 = (beta1 * m1) + (1 - beta1) * dw
m2 = (beta2 * m1) + (1 - beta2) * dw * dw
m1_unbias = m1 / (1 - beta1 **t)
m2_unbias = m2 / (1 - beta2 **t)
w += -learning_rate * m1_unbias / (m2_unbias.sqrt() + 1e-7)
The sum of gradients and squared gradients are very small at the beginning of training (biased towards zero), which can lead to excessively large updates. To address this issue, Adam employs a technique called bias correction as shown in algorithm above.
This adjustment ensures that the updates remain stable and effective, especially in the early stages of training when the optimizer may be prone to making erratic updates due to the low initial values of the moments.
- Beta1 is the decay rate for the first moment (the sum of gradient, aka momentum), and it is commonly set to 0.9.
- Beta 2 is the decay rate for the second moment (the sum of gradient squared), typically set at 0.999.
Adam has become a go-to optimizer for much of the deep learning community today. Learning rates of 1e-3, 5e-4, and 1e-4 serve as great starting points for most models.
The visualization below demonstrates how different optimizers behave on a loss landscape featuring both local and global minima, all starting from a common point.

Gradient Descent (cyan) vs. Momentum (magenta) with m = 0.99 vs. AdaGrad (white) vs RMSProp (green) with decay rate = 0.9 vs Adam (blue) with beta1 = 0.9 and beta2 = 0.999, on a surface with a global minimum (the left well) and local minimum (the right well). Source: visualization tool
Now that we have prepared the base of our deep learning cake, it’s time to add the icing in the next part!
References
[1] Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013.
[2] Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011.
[3] Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015.