GPT Series Part 1: Understanding LLMs & Coding GPT-1 from scratch

By now, you’ve probably used OpenAI’s ChatGPT—a chatbot that has taken the AI community by storm and transformed the way we work. First released in 2022 with GPT-3.5 (Generative Pre-trained Transformer 3.5) as its backend model, it reached one million users in just five days and a staggering 100 million in two months.

The unprecedented success of ChatGPT fueled further research into the technology behind it—Large Language Models (LLMs). Over the past two posts, we’ve built the theoretical foundation, and now it’s time to get hands-on: coding a GPT-like LLM from the ground up.

Large Language Models (LLMs)

An LLM is a neural network designed to understand, generate, and interpret human language. As the name suggests, LLMs are simply language models with an extremely large number of parameters (billions or even trillions) trained on vast datasets—potentially encompassing all publicly available internet text.

While language modeling has been an area of research for decades (with RNNs, LSTMs, and Attention mechanisms), the introduction of Transformers in 2017 revolutionized the field.

With Transformers proving their effectiveness, the next logical step was scaling. This movement gained traction with Google’s BERT and OpenAI’s GPT models in 2018. These two architectures define the major types of LLMs we see today.

Types of LLMs

Recalling the architecture of Transformers from the last post, it consists of two components: the encoder and the decoder. It was proposed as a replacement for the encoder-decoder structure based on RNN + Attention for Seq2Seq learning, where the encoder reads the input text and the decoder produces predictions for the task.

Although the encoder-decoder architecture may seem natural for machine translation tasks (for which it was originally developed), it is not always necessary.

Representation Models: Encoder-Only Models

First school of thought: “Is a Decoder Necessary if I Only Want to Perform Data-to-Numbers Conversion?”

For example, in text classification tasks like sentiment analysis and spam detection or regression tasks like stock price prediction, where the goal is simply to convert data into categories or numerical values, the decoder can be omitted.

A good example of this approach is Bidirectional Encoder Representations from Transformers (BERT)[1], an encoder-only Transformer model that uses bidirectional processing to understand the context and relationships of the input data.

The input is concatenated with a special token [CLS] (stands for classification) at the beginning of a sentence. Using the self-attention mechanism, this token aggregates information from all words, capturing the overall meaning of a sentence. The hidden state representation corresponding to it is then fed into an output layer for classification.

BERT for classification and regression tasks.

Generative Models: Decoder-Only Models

Second school of thought: “Is an Encoder Necessary for Language Generation?”

Tasks like translation, summarization, and Q&A involve transforming an input sequence into an output sequence. Traditionally, this process has been handled using a model composed of an encoder for understanding the input and a decoder for generating the output.

However, this process can be reframed—what if the input and output are treated as one continuous sequence? For example, a machine translation task—“we are eating bread” in English to “estamos comiendo pan” in Spanish—can be reframed as a language generation task: “Translate English to Spanish: we are eating bread.”

A language generation or text completion task is an autoregressive problem, which can be handled using a decoder-only model. One such example is the Generative Pre-trained Transformer (GPT)[2].

Since the decoder contains a “masked” self-attention layer, it attends only to previously generated tokens or the available context, allowing it to generate coherent sequences that follow the context.

Chronologically, GPT was first introduced in early 2018, followed by BERT later that year. Both architectures were implemented with distinct purposes but achieved great success due to their similar training approach.

Training LLMs

LLMs are trained in two stages using a semi-supervised learning procedure that combines unsupervised pre-training and supervised fine-tuning.

1. Unsupervised Pre-training

In this first stage, an LLM is trained on vast amounts of raw, unlabeled data available on the internet. This allows the model to acquire broad world knowledge and develop an understanding of language semantics, including grammar and sentence structures.

This task-agnostic model is often referred to as a base model or a foundation model.

BERT pre-training uses the masked language model (MLM) objective, where random tokens in the input are masked, and the model is trained to predict them based on the surrounding context. This unique training strategy (masked word prediction) makes such models well-suited for text classification tasks.

For generative models like GPT, the language modeling objective is used—given a sequence of tokens, the model predicts the next token in the sequence, a simple next-word prediction task. This approach helps the model learn how words and phrases fit together naturally, making it capable of generating coherent and contextually relevant text based on the given context.

Since these objectives do not require labeled data, they enable training on massive, unlabeled text datasets. To capture a wide range of knowledge, the training data must be as diverse as possible.

Key points:

Requires large datasets—pre-training involves downloading and processing vast internet text data.
High computational cost and time-intensive (very expensive).
Compresses internet knowledge and models language semantics.
A trained GPT model functions as a text completer.

2. Supervised Fine-Tuning (SFT)

The second stage involves adapting the pre-trained model to specific tasks using labeled data. Since task-specific datasets require manual annotation, they are significantly smaller compared to the massive datasets used in pre-training. This semi-supervised approach enables LLMs to adapt effectively to new tasks with relatively small amounts of labeled data.

Fine-tuning can be categorized into two types:

Classification Fine-Tuning: Labeled data consists of text samples and associated class labels (e.g., emails labeled as “spam” or “not spam”).
Instruction Fine-Tuning: Labeled data consists of instruction-response pairs (e.g., “Translate this sentence into Spanish: we are eating bread” → “estamos comiendo pan”).

Key points:

Computationally cheaper than pre-training
Requires less data (fine-tuned on a narrower, manually labeled dataset).
Tailors the LLM to a specific task or domain.

3. Reinforcement Learning with Human Feedback (RLHF)

The third stage in model training, is Reinforcement Learning with Human Feedback (RLHF), which plays a crucial role in shaping the “chat” capabilities of ChatGPT. This technique further refines a fine-tuned language model to better align with human preferences and instruction-following behavior.

The process begins with data collection: human annotators compare model-generated responses and select the preferred one in a binary choice setup. This empirical feedback is then used to train a reward model, which learns to predict human preferences by identifying patterns in the annotators’ choices.

Once trained, the reward model assigns a scalar quality score (e.g., based on helpfulness, safety, or relevance) to model responses, using both the prompt and contextual information from previous turns. These scores serve as rewards in reinforcement learning, guiding the model to optimize its responses over time. Through this iterative process, RLHF enhances the model’s ability to produce more helpful, safe, and human-aligned outputs.

Open LLMs

Organizations developing open LLMs often share model weights and architectures with the public. Examples include Cohere’s Command R, Mistral models, Microsoft’s Phi, and Meta’s Llama models.

These organizations typically release two types of models:

Base Models: Foundational model trained on a massive amount of diverse data but remains largely unoptimized for specific tasks.
- Unaligned, meaning it may generate raw outputs based on internet-derived knowledge.
- Can be fine-tuned or adapted for downstream tasks.
Assistant Models: Fine-tuned version of a base model, optimized for user interactions, safety, and specific applications, often labeled as “Instruct” or “SFT” (Supervised Fine-Tuned) models.
- Aligned to follow instructions better (instruction-tuned).
- Often trained with Reinforcement Learning from Human Feedback (RLHF) for better responses.

Coding GPT-1 from scratch

This section demonstrates how to pre-train a small-scale GPT-1 model for educational purposes. Large-scale pre-training requires significant computational resources (GPT-3 pretraining cost is estimated at $4.6 million), so the community typically uses pre-trained base models.

We’ll train a character-level GPT-1 model using the Tiny Shakespeare dataset, containing 40,000 lines from Shakespeare’s plays. The model will generate text character by character in an unsupervised setting, learning through next-character prediction.

In order to use text as input to our LLM, we first split it into individual characters, convert each character into integer tokens, and then transform these tokens into embedding vectors. These embeddings serve as numerical representations of the text, making it suitable for neural network processing.

Download the data

First, we import necessary libraries and select the appropriate device.

# Import functions
import torch
import torch.nn as nn

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using", device)

Next, we download and inspect the dataset. The dataset contains 1.1 million characters.

# Download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print("Length of the dataset:", len(text))
print(text[:100])

Length of the dataset: 1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
Yo

Tokenization

Tokenization is the process of breaking text into smaller units called tokens. These tokens can be individual words, characters, or special symbols.

For our character-level model, each character itself will be treated as a token, meaning no further splitting is required. We first create a vocabulary, which is the set of all unique characters in our dataset. This defines all possible tokens our model can process as input and generate as output.

# Create vocabulary
all_chars = sorted(list(set(text)))
print("All characters that occur in the dataset:", ''.join(all_chars))

vocab_size = len(all_chars)
print("Vocab size (number of unique characters in the dataset):", vocab_size)

All characters that occur in the dataset: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size (number of unique characters in the dataset): 65

We do not convert all text to lowercase because capitalization helps the model distinguish between proper and common nouns, understand sentence structure, and generate correctly capitalized text.

Next, we map each character in our vocabulary to an integer token ID (ranging from $0$ to $64$). This mapping allows us to later convert token IDs into embedding vectors.

# Tokenize each character - convert the chars to ints
char_to_int = {all_chars[i]:i for i in range(len(all_chars))}
int_to_char = {i:all_chars[i] for i in range(len(all_chars))}

encode = lambda s: [char_to_int[i] for i in s]
decode = lambda s: [int_to_char[i] for i in s]

# Tokenize entire dataset
data = torch.tensor(encode(text))
print("Tokenized first 10 characters:")
print(data[:10].tolist())

Tokenized first 10 characters:
[18, 47, 56, 57, 58, 1, 15, 47, 58, 47]

The encode function maps characters to token IDs, while decode reverses the process to reconstruct text.

In summary,

Tokenization breaks down the input text (training data) into individual tokens.
We build a vocabulary out of all unique tokens.
Each token is mapped to a unique interger ID.

Create batches

The next step is to generate input-target pairs for training our language model.

Since training on the entire dataset at once is computationally expensive, we sample small chunks of the dataset (called block data) and train on them instead. The maximum size of these chunks is fixed for each task and is referred to as the block size or context length.

The context length is a vital parameter in LLMs, as it determines the maximum number of tokens the model can process in a single pass. A larger context window enables the model to capture longer dependencies and even process entire documents, but it also increases computational cost, which grows quadratically with sequence length.

Let’s take an example to understand this better. It is intuitive to assume that $\text{x}$ represents the input tokens, while $\text{y}$ contains the target tokens, which are simply the inputs shifted by one position. This setup aligns with the next-token prediction task that our model will be trained on.

block_size = 8              # what is the maximum context length for predictions?
block_data = data[:block_size+1]
x = data[:block_size]
y = data[1:block_size+1]
print("Training dataset chunck:", block_data.tolist())
print("x:", x.tolist())
print("y:", y.tolist())

# How training occurs
context = []
for i in range(block_size):
    context.append(x[i].item())
    target = y[i]
    print(f"Context: {context}, Target: {target}")

Training dataset chunck: [18, 47, 56, 57, 58, 1, 15, 47, 58]
x: [18, 47, 56, 57, 58, 1, 15, 47]
y: [47, 56, 57, 58, 1, 15, 47, 58]
Context: [18], Target: 47
Context: [18, 47], Target: 56
Context: [18, 47, 56], Target: 57
Context: [18, 47, 56, 57], Target: 58
Context: [18, 47, 56, 57, 58], Target: 1
Context: [18, 47, 56, 57, 58, 1], Target: 15
Context: [18, 47, 56, 57, 58, 1, 15], Target: 47
Context: [18, 47, 56, 57, 58, 1, 15, 47], Target: 58

Here, the model iteratively builds context from previous tokens. At each step, it predicts the next token based on what it has seen so far.

This sliding context window ensures that:

The model is trained on varied-length inputs.
It generalizes well to different sequence lengths.

Now that we have understood the concept of block size, let’s create batches.

# Build the data loader (train/val split)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

def make_batch(split: str, batch_size: int, context_len: int):
    data = train_data if split == "train" else val_data
    idx = torch.randint(len(data) - context_len, (batch_size,))
    x, y = [], []
    for i in idx:
        x.append(data[i:context_len+i])
        y.append(data[i+1:context_len+i+1])
    return torch.stack(x), torch.stack(y)

In practice, input text can be longer than the model’s supported context length. In such cases, we truncate the text, keeping only the most recent tokens up to the maximum length.

Model Architecture

Let’s break down the key components of the GPT-1 architecture and implement it from-scratch:

Token Embeddings: Converts integer token IDs from the vocabulary into a $\text{D}$-dimensional embedding vector using a simple lookup table. In our case, each character token in the input sequence is represented as a vector.
Positional Encoding: Since transformers do not inherently understand the order of tokens, we add a $\text{D}$-dimensional positional encoding to each token. GPT models use a learnable lookup table for this purpose.
Input vectors: The final input vector is obtained by summing the token embeddings and positional encodings. This results in an input tensor of shape $\text{N} \times \text{D}$, where $\text{N}$ is the number of tokens (up to the context length) and $\text{D}$ is the embedding dimension.
Model Architecture (Decoder-Only Transformer):
- The model consists of $\text{L}$ transformer blocks, each with:
  - Masked Multi-Head Self-Attention → Layer Norm → MLP → Layer Norm
- A fully connected layer (language modeling head) that projects the $\text{D}$-dimensional output back into the vocabulary space.
- A softmax layer that converts these logits into probabilities for the next-token prediction.

Masked Multi-head Self-attention layer

This follows the same mechanism as in the previous post, so I won’t go into detail here.

Note that bias terms were included in all linear layers in the original GPT models, since the authors directly used PyTorch’s default nn.Linear implementation (which sets bias=True by default).

class MultiHeadSelfAttention(nn.Module):
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        context_len: int | None = None,
        dropout: float = 0.0,
        causal: bool = True,
    ):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.W_QKV = nn.Linear(d_model, 3 * d_model)
        self.W_O = nn.Linear(d_model, d_model)       
        self.attn_dropout = nn.Dropout(dropout)

        if causal and context_len is not None:
            self.register_buffer("mask", torch.tril(torch.ones(context_len, context_len)))
        else:
            self.mask = None
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x:       [B, N, d_model]
        Returns: [B, N, d_model]
        """
        B, N, _ = x.shape

        # Compute Q, K, V in one go: [B, N, d_model]
        QKV = self.W_QKV(x)
        Q, K, V = QKV.chunck(3, dim=-1)
        
        # Split into H heads: [B, N, H, d_h] and then transpose to [B, H, N, d_h]
        Q = Q.view(B, N, self.num_heads, self.head_dim).transpose(-2, -3)
        K = K.view(B, N, self.num_heads, self.head_dim).transpose(-2, -3)
        V = V.view(B, N, self.num_heads, self.head_dim).transpose(-2, -3)
        
        # Apply scaled dot-product attention (with causal mask) on each head
        attn_scores = (Q @ K.transpose(-2, -1)) * K.shape[-1]**-0.5
        
        if self.mask is not None:
            N = x.shape[-2]
            attn_scores = attn_scores.masked_fill(self.mask[:N, :N] == 0, float('-inf'))

        attn_weights = torch.softmax(attn_scores, dim=-1)                             
        attn_weights = self.attn_dropout(attn_weights)
        out = attn_weights @ V

        # Concatenate: transpose back to [B, N, H, d_h], then combine heads [B, N, d_model]
        out = out.transpose(-2, -3).contiguous().view(B, N, -1)
        out = self.W_O(out)

        return out

Layer Normalization

As you might recall, layer normalization improves the stability and efficiency of training. Normalization is performed across the feature dimension which is the embedding dimension in our case.

class LayerNorm(nn.Module):
    def __init__(self, emb_dim: int):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Input x: [B, N, d_model]
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)       # No Bessel's correction
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

This implementation closely resembles our Batch Normalization code.

Feed forward layer

Next, we will implement a small neural network used as a part of the transformer block in LLMs.

Historically, the ReLU activation function has been widely used due to its simplicity and effectiveness. However, GPT-1 used the Gaussian Error Linear Unit (GELU) for its improved performance.

GELU can be thought of as a smoother version of ReLU. Its smooth transitions allow for better optimization properties during training, leading to more nuanced parameter adjustments.

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))
        )) 

class MLP(nn.Module):
    def __init__(self, emb_dim: int):
        super().__init__()
        self.layer1 = nn.Linear(emb_dim, 4 * emb_dim)
        self.layer2 = nn.Linear(4 * emb_dim, emb_dim)
        self.gelu = GELU()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.gelu(self.layer1(x))
        x = self.layer2(x)
        return x

Transformer Decoder block

Now, we assemble all the components we’ve built so far into a Transformer decoder block. This block forms the foundation of GPT models and is repeated multiple times throughout the architecture.

class DecoderBlock(nn.Module):
    def __init__(self, d_model: int, context_len: int, num_heads: int, dropout: float):
        super().__init__()
        self.ln_1 = LayerNorm(d_model)
        self.attn = MultiHeadSelfAttention(
            d_model=d_model, 
            num_heads=num_heads,
            context_len=context_len, 
            dropout=dropout
        )
        self.ln_2 = LayerNorm(d_model)
        self.mlp = MLP(emb_dim=d_model)
        self.resid_dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Masked self-attention with residual connection
        attn_out = self.resid_dropout(self.attn(x))
        x = self.ln_1(x + attn_out)

        # Feed-forward network with residual connection
        mlp_out = self.resid_dropout(self.mlp(x))
        x = self.ln_2(x + mlp_out)

        return x

Defining the GPT-1 model

With the decoder block implemented, we now have all the necessary components to build the GPT-1 architecture.

class GPT1(nn.Module):
    def __init__(
        self, 
        D_embd: int, 
        vocab_size: int, 
        context_len: int, 
        num_blocks: int, 
        num_heads: int, 
        dropout: float
    ):
        super().__init__()

        # Token Embeddings - Convert integer word tokens (vocab) to a D-dimensional embedding vector 
        self.token_embedding_table = nn.Embedding(vocab_size, D_embd)

        # Position Encoding - Encode each position to a D-dimensional embedding vector
        self.position_embedding_table = nn.Embedding(context_len, D_embd)

        # Embedding dropout
        self.embd_dropout = nn.Dropout(dropout)

        # Define multiple Decoder blocks
        self.blocks = nn.Sequential(*[DecoderBlock(d_model=D_embd, context_len=context_len, num_heads=num_heads dropout=dropout) for _ in range(num_blocks)])

        # Final FC layer to project the D-dimensional vector back to vocab space
        self.lm_head = nn.Linear(D_embd, vocab_size, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, N = x.size()
        device = x.device  # ensure we use the same device as the input

        # Token + positional embeddings + emdeding dropout
        token_emb = self.token_embedding_table(x)                               # [B, N, D]
        pos_emb = self.position_embedding_table(torch.arange(N, device=device)) # [N, D]
        x = token_emb + pos_emb                                                 # [B, N, D]
        x = self.embd_dropout(x)
        
        # Transformer blocks
        x = self.blocks(x)
        
        logits = self.lm_head(x)                                                # [B, N, vocab_size]
        return logits

Tokenized text is first converted into token embeddings, which are then augmented with positional embeddings. This combined representation forms a tensor that passes through a series of transformer blocks, outputting a tensor of the same dimensionality. The final language modeling head (a linear layer without bias) projects this output into the vocabulary space, generating logits for each token in the vocabulary.

Weight initialization

Since LayerNorm is used extensively throughout the model, the weights (in both linear and embedding layers) were initialized from a zero-mean Gaussian distribution with a standard deviation of $0.02$, which the authors noted was sufficient. The biases were set to $0$, and the LayerNorm gain and bias parameters were initialized to $1$ and $0$, respectively.

# Initialize weights as per GPT-1: Normal(0, 0.02)
def gpt1_init(m):
    if isinstance(m, (nn.Linear, nn.Embedding)):
        nn.init.normal_(m.weight, mean=0.0, std=0.02)
        if isinstance(m, nn.Linear) and m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, LayerNorm):
        nn.init.ones_(m.scale)
        nn.init.zeros_(m.shift)

Pre-Training

Now, let’s initialize a small GPT model and perform a forward pass as a sanity check.

# Define hyperparameters
context_len = 32
batch_size = 16              

model = GPT1(D_embd=64, vocab_size=65, context_length=context_len, num_blocks=4, num_heads=4, dropout=0.1).to(device)

# Apply initialization
model = model.apply(gpt1_init)

# print the number of parameters in the model
print(sum(p.numel() for p in model.parameters())/1e6, 'M parameters')

# Forward pass with one example
xb, yb = make_batch("train", batch_size=batch_size, context_len=context_len)
print("Input shape:", xb.size())
print("Target shape:", yb.size())
context, target = xb.to(device), yb.to(device)

logits = model(context)
print("Model output shape:", logits.size())

loss = nn.functional.cross_entropy(logits.view(-1, vocab_size), target.view(-1))
print("Initial Loss:", loss.item())

0.209536 M parameters
Input shape: torch.Size([16, 32])
Target shape: torch.Size([16, 32])
Model output shape: torch.Size([16, 32, 65])
Initial Loss: 4.262292861938477

As you can see, the model outputs a tensor of shape [16, 32, 65], since we passed 16 input texts with 32 tokens each. The last dimension corresponds to the vocabulary size of the tokenizer. To compute the loss, we collapse the first two dimensions, concatenating all tokens together and averaging the loss over the batch.

At initialization, all tokens in our vocabulary are equally likely, meaning each has a probability of $1/65$. If we manually compute the cross-entropy loss, we get: $$ \text{loss} = -\ln(1/65) = 4.174 $$ This is approximately the same as our computed value, indicating that we are on the right track.

Weight decay

GPT-1 used AdamW with a weight decay of 0.01, applied only to non-bias and non-gain weights. What does this mean?

Let’s look at all the learnable parameters in the model:

MSA Linear layers: W_QKV and W_O weights and biases.
Layer Norm: scale and shift params (also called gain and bias).
MLP Linear layers: layer1 and layer2 weights and biases.
Embedding layers: token_embedding_table, position_embedding_table
LM Linear layer: lm_head weight.

Therefore, weight decay was applied only to the weights of linear and embedding layers, not to biases or LayerNorm gain parameters. Since these weights are 2D tensors, while others are 1D, we can use this criterion to separate them into decay and non-decay parameter groups.

Here’s how to implement it in PyTorch:

# Separate parameters
decay_params = []
no_decay_params = []

for name, param in model.named_parameters():
    if param.requires_grad:
        if param.dim() >= 2:
            decay_params.append(param)
        else:
            no_decay_params.append(param)

# Define optimizer with different parameter groups
optimizer = torch.optim.AdamW(
    [
        {"params": decay_params, "weight_decay": 0.01},   # Linear weights
        {"params": no_decay_params, "weight_decay": 0.0}  # Biases & LayerNorm weights
    ],
    lr=1e-3
)

With that, we’re ready to pre-train the model on our dataset. 🚀

max_iter = 5000
eval_iter = 200

# Define the loss function & optimizer
criterion = torch.nn.CrossEntropyLoss()

for iter in range(max_iter):
    model.train()

    # Sample a batch of data
    xb, yb = make_batch('train', batch_size=batch_size, context_len=context_len)
    context, target = xb.to(device), yb.to(device)

    # Forward pass and optimization
    optimizer.zero_grad()
    logits = model(context)
    loss = criterion(logits.view(-1, vocab_size), target.view(-1))
    loss.backward()
    optimizer.step()

    # Evaluate the loss on train and val
    if iter%100 == 0 or iter == 0 or iter == (max_iter-1):
        with torch.no_grad():
            model.eval()
            out = {}
            for split in ['train', 'val']:
                running_loss = 0
                for k in range(eval_iter):
                    xb, yb = make_batch(split, batch_size=batch_size, block_size=block_size)
                    context, target = xb.to(device), yb.to(device)
                    logits = model(context)
                    loss = criterion(logits.view(-1, vocab_size), target.view(-1))
                    running_loss += loss.item()
                out[split] = running_loss / eval_iter
            print('\n Step:{}/{}, Train Loss:{:.4f}, Val Loss:{:.4f}'.format(iter, max_iter, out['train'], out['val']))

...
Step:4999/5000, Train Loss:1.7101, Val Loss:1.8699

Generating text

The inference process involves generating new text based on patterns the model has learned from the training data. We start with an initial context, typically the integer token 0, which represents the newline character in our vocabulary. This serves as the seed input to the model.

The model then outputs a probability distribution over all tokens in the vocabulary. We sample from this distribution to generate the next token. This newly generated token is appended to the existing context and fed back into the model to produce the next prediction. By iterating this process, the model generates coherent text that follows the structure and style of the training data—such as Shakespearean plays in the case of our Tiny Shakespeare dataset.

# Generate text from the trained model
max_new_tokens = 300
model.eval()

# Start with a zero token
context = torch.zeros((1, 1), dtype=torch.long, device=device)  # [B=1, N=1]

for _ in range(max_new_tokens):
    with torch.no_grad():
        # Trim to context length supported by the model
        idx_cond = context[:, -context_len:]

        # Forward pass
        logits = model(idx_cond)                 # [B=1, N, vocab_size]
        logits = logits[:, -1, :]                # last time step → [B, vocab_size]

        # Convert to probabilities and sample next token
        probs = torch.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)

        # Append sampled token
        context = torch.cat((context, next_token), dim=1)     # [B=1, N+1]

# Decode the generated sequence
print(decode(context[0].tolist()))

O this must says it toill dake to feare,
What in, I reigreaton all him draster. he spit me with hermattan?

COMANIUSIUS:
As you
my lastless; tell was with cantal you was?

QUEEN ELIZABETH:
Ay, forwas now can, and you fear'd, more my mune chamnot Moounce
Let not the kise him? hy bu do to him,
Unce I

Since we sample at every step from the probability distribution, text generation is stochastic, meaning that even with the same initial context, we may obtain different outputs. This ensures that the generated text is diverse rather than simply memorizing the training data.

I know the generation is not very good since we’ve trained a character-level model, and it doesn’t grasp the meaning of words well. But still, it does a good job at generating Shakespeare-like text.

Summary of GPT-1 model (2018)

Here are some key implementation details of the GPT-1 model.

Tokenizer: Byte Pair Encoding (BPE) vocabulary with 40,000 merges. We’ll go into more detail on this in the next post.
Context length: $N = 512$ tokens.
Position Embeddings: Learned instead of the sinusoidal version proposed in Transformers.
Architecture hyperparameters:
- Number of layers (decoder blocks): $L = 12$
- Number of attention heads: $H = 12$
- Embedding dimension: $d_{model} = 768$
- MLP size: $d_{ff} = 4 \times d_{model} = 3072$
- Residual, embedding and attention dropout: $p = 0.1$
Pre-Training: Trained on BooksCorpus dataset which contains over 7,000 unique unpublished books from a variety of genres.
- Initialization: Initialized from a zero-mean Gaussian distribution with a standard deviation of $0.02$.
- Optimizer: AdamW
  - Batch size: 64 (randomly sampled)
  - Weight decay: $0.01$
- Learning rate: Increased linearly from zero to a maximum value of $2.5 \times 10^{-4}$ over the first 2000 updates, then annealed to zero using a cosine schedule.
- Number of epochs: 100
Fine-tuning: Carried out using supervised learning on specific downstream tasks (like classification, question answering, etc.).
- Dropout in the classifier: $p = 0.1$
- Learning rate: $6.25e^{-5}$ with a linear decay schedule and warmup over 0.2% of training steps.
- Batch size: 32
- Number of epochs: 3 (sufficient for most cases)
- Weight decay: $0.5$
- Other parameters remain the same as in the pre-training stage.

References

Medium Post on Most Successful Transformer Variants: Introducing BERT and GPT
Andrej Karpathy’s video: Let’s build GPT: from scratch, in code, spelled out.
[1] Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2019.
[2] Radford et al., “Improving Language Understanding by Generative Pre-Training”, OpenAI, 2018.

Large Language Models (LLMs)#

Types of LLMs#

Representation Models: Encoder-Only Models#

Generative Models: Decoder-Only Models#

Training LLMs#

1. Unsupervised Pre-training#

2. Supervised Fine-Tuning (SFT)#

3. Reinforcement Learning with Human Feedback (RLHF)#

Open LLMs#

Coding GPT-1 from scratch#

Download the data#

Tokenization#

Create batches#

Model Architecture#

Masked Multi-head Self-attention layer#

Layer Normalization#

Feed forward layer#

Transformer Decoder block#

Defining the GPT-1 model#

Weight initialization#

Pre-Training#

Weight decay#

Generating text#

Summary of GPT-1 model (2018)#

References#