This Blog Is All You Need!

Hi there! I’m Yug Ajmera, a Sr. Research Engineer at NEC Labs America. This is where I document my learnings, experiments, and ideas across AI, robotics, and beyond. When I’m not coding, you’ll probably find me at the gym 💪, cooking something new 🍲, or chasing sunsets ⛅ around the Bay Area.

DSA Cheatsheet

List (Dynamic Array) list = [1, 2, 3] Operation Code Time complexity Get / Set element list[i] or list[i] = 3 O(1) Add element list.append(3) O(1) Add at an index list.insert(i, 3) O(n) Pop last element list.pop() O(1) Pop at an index list.pop(i) O(n) Delete element by value list.remove(3) O(n) Search element 3 in x O(n) Find index of element list.index(3) O(n) Use Dictionary [element → index] instead Sort sorted_list = sorted(list, reverse = False) or list....

Transformers for Image Classification: ViT and CLIP

Modelling in computer vision has long been dominated by convolutional neural networks (CNNs). We’ve already discussed famous architectures like VGGNet and ResNet in previous posts, which have served as the primary backbones for a variety of vision tasks. In contrast, network architectures in natural language processing (NLP) have evolved along a different trajectory. The dominant architecture in NLP is the Transformer, designed for sequence modeling. Models like GPT-3 have achieved remarkable success, scaling to over 100 billion parameters thanks to their computational efficiency and scalability....

GPT Series Part 3: Building GPT-2 & Sampling Techniques

Building on our previous exploration of GPT-1, let’s now modify its architecture to recreate the GPT-2 [1] small model, which contains 124 million parameters. Although the original paper refers to it as 117M, OpenAI later clarified the actual count. A key advantage of GPT-2 is that OpenAI released its pre-trained weights, which we can directly load into our implementation. This not only serves as a sanity check for our model but also provides a strong foundation for fine-tuning....

GPT Series Part 2: Implementing BPE Tokenizer

In the previous post, we trained a character-level GPT-1 model from scratch, but it struggled to produce coherent words. Since each token represented a single character, the model had to learn the structure of words and sentences entirely from scratch, which is nearly impossible given our limited training data. This highlights the importance of tokenization, one of the most critical preprocessing steps in training large language models. Representing subwords or whole words as tokens, instead of individual characters, can significantly improve learning efficiency and language understanding....

GPT Series Part 1: Understanding LLMs & Coding GPT-1 from scratch

By now, you’ve probably used OpenAI’s ChatGPT—a chatbot that has taken the AI community by storm and transformed the way we work. First released in 2022 with GPT-3.5 (Generative Pre-trained Transformer 3.5) as its backend model, it reached one million users in just five days and a staggering 100 million in two months. ChatGPT interface The unprecedented success of ChatGPT fueled further research into the technology behind it—Large Language Models (LLMs)....

Generalizing Attention with Transformers

In the previous post, we explored sequence modeling using an encoder-decoder architecture connected through an attention mechanism. This mechanism allows the decoder to “attend” to different parts of the input at each time step while generating the output sequence. Attention can also be applied to a variety of tasks, such as image captioning. In this case, the decoder RNN focuses on different regions of the input image as it generates each word of the output caption....

Sequence Modeling with Recurrent Neural Networks and Attention

In previous discussions, we focused on feedforward neural networks, which take a single image as input, process it through multiple layers of convolution, normalization, and fully connected layers, and output a single label for image classification tasks. This is a one-to-one relationship: a single image maps to a single output label. However, there are other types of problems we want to solve using deep learning that involve variable-length sequences as both inputs and outputs....

ImageNet Challenge: The Olympics of Deep Learning

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual competition that took place from 2010 to 2017, attracting teams from around the world to showcase their best-performing image classification models. This challenge became a crucial benchmark in the field, with its winners significantly influencing the landscape of image recognition and deep learning research. The competition used a subset of the ImageNet dataset, containing 1.3M training examples across 1000 different classes, with 50k validation and 100k test examples....

Convolutional Neural Networks: Deep Learning for Image Recognition

Linear classifiers or MLPs that we have discussed so far don’t respect the 2D spatial structure of input images. These images are flattened into a 1D vector before passing them through the network which destroys the spatial structure of the image. This creates a need for a new computational model that can operate on images while preserving spatial relationships — Convolutional Neural Networks (CNNs). Let’s understand the components of this CNN model....

Deep Learning Basics Part 3: The Cherry on Top

We’ve already coded our first neural network architecture from scratch and learned how to train it. Our deep learning cake is almost ready, but we still need the toppings to make it more appealing. In this part, we’re going to discuss the available toppings—concepts that enhance optimization and help us reach a better final solution for the model’s weights. The most important topping among them is Regularization. Regularization When optimizing, our goal is to find the specific set of weights that minimize loss our training data, aiming for the highest possible accuracy on the test set....

Deep Learning Basics Part 2: The Icing

In the last post, we introduced Linear Classifiers as the simplest model in deep learning for image classification problems. We discussed how Loss Functions express preferences over different choices of weights, and how Optimization minimizes these loss functions to train the model. However, linear classifiers have limitations: their decision boundaries are linear, which makes them inadequate for classifying complex data. One option is to manually extract features from the input image and transform them into a feature space, hoping that it makes the data linearly separable....

Deep Learning Basics Part 1: The Base of the Cake

While there is a wealth of deep learning content scattered across the internet, I wanted to create a one-stop solution where you can find all the fundamental concepts needed to write your own neural network from scratch. This series is inspired by Justin Johnson’s course, and Stanford’s CS231n. I would also highly recommend watching Andrej Karpathy’s videos. Image Classification Before diving into the details, let’s start with the basics: image classification....