GPT Series Part 2: Implementing BPE Tokenizer

In the previous post, we trained a character-level GPT-1 model from scratch, but it struggled to produce coherent words. Since each token represented a single character, the model had to learn the structure of words and sentences entirely from scratch, which is nearly impossible given our limited training data.

This highlights the importance of tokenization, one of the most critical preprocessing steps in training large language models. Representing subwords or whole words as tokens, instead of individual characters, can significantly improve learning efficiency and language understanding.

In this post, we’ll dive into various tokenization schemes and implement the Byte Pair Encoding (BPE) tokenizer, which is commonly used in GPT models. If you’re curious to see how different tokenizers behave across popular LLMs, check out this interactive tool: Tiktoken web app.

Tokenization

A tokenizer converts between raw text and a sequence of integers (tokens). These token sequences are then mapped into dense vector representations through an embedding table, which are fed into the neural network.

In Python, raw text is represented as a Unicode string, which is a sequence of Unicode characters.

text = "Hello 👋"

We need a procedure that encodes string to tokens, and one that decodes these tokens back to strings.

Encode a string into tokens.
Decode tokens back into a string.

The tokenizer class usually implements both methods. The vocabulary size is the number of possible tokens (integers).

There are a few different ways to perform tokenization.

Character-level

The simplest way is to tokenize by character. To convert each Unicode character into a code point (an integer), we can use the ord() function.

text = "Hello 👋"
for t in text:
    print(f"{t} -> {ord(t)}")

H -> 72
e -> 101
l -> 108
l -> 108
o -> 111
  -> 32
👋 -> 128075

These integer code points can then be converted back using the chr() function.

But this approach requires modeling the entire space of Unicode symbols to represent all strings — resulting in a base vocabulary of over 150K. Such a large vocabulary causes two problems:

Model Size: The size of both the embedding layer and the language-modeling head depends on the vocabulary size. A larger vocabulary increases the number of parameters, leading to higher memory and compute requirements.
Risk of Underfitting: Many characters (e.g. 👋) occur very rarely in the training data, making it difficult for the model to learn meaningful representations.

Unicode Encodings (Byte-level)

Instead of mapping directly to large integer code points, Unicode characters can be represented as sequences of bytes (integers 0–255) using an encoding scheme. The most common is UTF-8, which maps each code point into 1 to 4 bytes.

UTF-8 Encoding Rules:

Code points from 0–127 (Basic ASCII) → Stored in 1 byte.
Code points from 128–2047 → Stored in 2 bytes.
Code points from 2048–65535 → Stored in 3 bytes
Code points 65536+ → Stored in 4 bytes.

text = "Hello 👋"
for t in text:
    print(f"{t} -> {t.encode("utf-8")}, {list(t.encode("utf-8"))}")

H -> b'H', [72]
e -> b'e', [101]
l -> b'l', [108]
l -> b'l', [108]
o -> b'o', [111]
  -> b' ', [32]
👋 -> b'\xf0\x9f\x91\x8b', [240, 159, 145, 139]

Some unicode characters are represented by one byte:
- Hello => [72, 101, 108, 108, 111, 32] (same as ASCII)
Others takes multiple bytes:
- 👋 => [240, 159, 152, 139]

To decode bytes back to text:

tokens = [72, 101, 108, 108, 111, 32, 240, 159, 145, 139]
print(bytes(tokens).decode('utf-8'))

Hello 👋

So we’ve converted a sequence of Unicode code points (integers in the range 0–154,997) into a sequence of bytes (integers 0–255).

This reduces the vocabulary size to just 256. However, each character may now expand to multiple tokens, which increases sequence length. For example, "Hello 👋" becomes 10 tokens, resulting in a poor compression ratio.

Problems this creates:

Shorter context: With a fixed context length, the model can attend to fewer characters.
Slower training: Since attention is quadratic in sequence length, more tokens mean more computation.

Word-level

Another approach is word-level tokenization, which splits text into words. This aligns well with human language and was a classical NLP technique.

In Python, we can implement this using the regex package:

import regex 
pattern = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

text = "Hello 👋, how're you?"
print(re.findall(pattern, text))

['Hello', ' 👋', ',', ' how', "'re", ' you', '?']

Each token can be mapped to a unique integer, forming our tokenizer. This produces a compact vocabulary (all unique words in the training data) and good compression.

But there’s a major drawback: it relies on a fixed vocabulary. What happens if a user enters a word that isn’t in the vocabulary? The tokenizer fails (KeyError).

A common workaround is to use an <unk> (unknown) token for out-of-vocabulary words, but this loses information.

Subword-level

Subword tokenization strikes a balance between characters and words. Frequent words are kept whole, while rare words are broken into smaller subword units. This eliminates the need for a special unknown token.

Characters are first encoded as bytes (keeping vocabulary size manageable), then frequent sequences of bytes are merged into tokens, improving compression. This is exactly the Byte Pair Encoding (BPE) [2] does.

Byte pair encoding (BPE)

The Byte Pair Encoding (BPE) algorithm automatically builds a vocabulary by training on raw text. Note that a tokenizer is typically trained separately from the language model. A diverse and extensive dataset is used to learn an efficient subword vocabulary, enabling it to generalize effectively across domains and languages.

BPE algorithm

Convert the input string to a sequence of bytes.
Initialize vocabulary with single characters: The vocabulary starts with all individual characters (e.g., b'a', b'b', b'c', …) or equivalently, all 256 byte values.
Find the most frequent pair: The algorithm identifies the most frequently occurring adjacent byte tokens in the dataset (bigrams).
Merge that pair: The most frequent bigram is merged and assigned a new token ID. This new token is added to the vocabulary.
- For example, b't' and b'h' often appear together in words like “the” or “there”. BPE merges them into a single token b'th'.
Repeat: Continue iteratively, merging the most common adjacent token pairs. Over time, subwords like b'the', b'ing', b'tion', etc., may emerge, allowing the vocabulary to represent common word segments efficiently.

Each iteration compresses the dataset (fewer tokens), while simultaneously expanding the vocabulary (new subwords). Too much merging reduces sequence length but inflates vocabulary size.

Thus, a balance must be struck between compression ratio and vocabulary size. The number of merges is a key hyperparameter: for example, GPT-1 used 40,000 merges, while GPT-2 used 50,000.

BPE in Code

Let’s walk through a toy implementation.

# Step 1 -> Convert the input string to utf-8 byte tokens
text = "the cat in the hat"
tokens = list(text.encode("utf-8"))
print(tokens)

[116, 104, 101, 32, 99, 97, 116, 32, 105, 110, 32, 116, 104, 101, 32, 104, 97, 116]

The BPE tokenizer starts with the first 256 byte values (single-byte tokens) as the initial vocabulary:

# Step 2 -> Initialize the vocabulary
vocab = {idx: bytes([idx]) for idx in range(256)}    # token_id (int) -> utf-8 bytes

Now we repeatedly find the most frequent adjacent byte pairs and merge them:

from collections import Counter

# Count frequency of each bigram
def get_pairs(tokens):
    counts = Counter()
    for i in range(len(tokens) - 1):
        counts[(tokens[i], tokens[i+1])] += 1
    return counts

# Merge a given bigram into a new token
def merge(tokens, bigram, new_id):
    new_tokens = []
    i = 0
    while i < len(tokens):
        if i < len(tokens)-1 and (tokens[i], tokens[i+1]) == bigram:
            new_tokens.append(new_id)
            i += 2
        else:
            new_tokens.append(tokens[i])
            i += 1
    return new_tokens

num_merges = 3
for i in range(num_merges):
    # Step 3 -> Find most frequent bigram
    pairs = get_pairs(tokens)
    most_freq_bigram = max(pairs, key=lambda k: (pairs[k], k))
    print("Most frequent bigram:", most_freq_bigram)

    # Step 4 -> Merge and update vocabulary
    new_token_id = 256 + i
    vocab[new_token_id] = vocab[most_freq_bigram[0]] + vocab[most_freq_bigram[1]]
    tokens = merge(tokens, most_freq_bigram, new_token_id)
    print(tokens)

Tie-breaking: If multiple bigrams have the same frequency, choose the lexicographically greater pair.

Iteration 1

Most frequent bigram: (116, 104)
[256, 101, 32, 99, 97, 116, 32, 105, 110, 32, 256, 101, 32, 104, 97, 116]

Input text: the cat in the hat
Most frequent bigram (116, 104) corresponds to characters b't' and b'h'
Replace it with a new token ID 256
New byte stream: <256>e cat in <256>e hat

Iteration 2

Most frequent bigram: (256, 101)
[257, 32, 99, 97, 116, 32, 105, 110, 32, 257, 32, 104, 97, 116]

Input text: <256>e cat in <256>e hat
Most frequent bigram (256, 101) corresponds to characters b'th' and b'e'
Replace it with a new token ID 257
New byte stream: <257> cat in <257> hat

Iteration 3

Most frequent bigram: (257, 32)
[258, 99, 97, 116, 32, 105, 110, 32, 258, 104, 97, 116]

Input text: <257> cat in <257> hat
Most frequent bigram (256, 32) corresponds to characters b'the and b' '
Replace it with a new token ID 258
New byte stream: <258>cat in <258>hat

The updated vocabulary after three merges:

...
256: b'th',
257: b'the',
258: b'the '

Special Context Tokens

In GPT-like large language models, it’s common to use special tokens to help manage the flow and structure of the data.

End-of-Text token

One such token is the <|endoftext|> token, which is used to separate two unrelated texts, especially when multiple independent documents or books are concatenated for training. This helps the model understand that even though the texts are joined together, they are separate and distinct. It is also helpful to indicate the stopping point during generation.

GPT-2’s vocabulary includes:

256 raw byte tokens,
50,000 merged tokens, and
1 <|endoftext|> token.

This brings the total vocabulary size to 50,257 tokens.

Screenshot from tiktoken webapp. Note that the <|endoftext|> token is the last token in the vocabulary corresponding to token id 50256. — Screenshot from tiktoken webapp. Note that the `<|endoftext|>` token is the last token in the vocabulary corresponding to token id `50256`.

Fine-Tuned Conversational Models

Conversational fine-tuning often introduces additional special tokens to track the flow of conversation between a user and an assistant. These tokens help to distinguish different parts of the dialogue and maintain coherence. Examples:

<|im_start|>: Marks the start of a structured block.
user/assistant: Denote the speaker (either the user or the assistant).
<|im_sep|>: A separator token used to separate different sections or turns in the conversation.
<|im_end|>: Marks the end of a block.

These tokens are introduced during the fine-tuning phase when the model is trained on a conversational dataset. Example sequence:

<|im_start|>  
<|user|> How are you?
<|im_sep|> 
<|assistant|> I'm doing great! How can I assist you today?
<|im_end|>

These tokens are essential for managing structured dialogues, especially in use cases like chatbots or virtual assistants, where knowing the flow of conversation and who is “speaking” is crucial.

Screenshot from tiktoken webapp showing special tokens used in gpt3.5-turbo conversational model.

GPT-2 Tokenizer

Unlike other GPT models, OpenAI has released the code for GPT-2 [1] Github link, which lets us explore the model’s inner workings, including the tokenizer and its implementation within the architecture.

However, the training code and data for the GPT-2 tokenizer were never released. What we have is the inference code (encoding/decoding), available here: encoder.py

To run inference with GPT-2’s tokenizer, you just need two files:

encoder.json → stores vocabulary mappings
vocab.bpe → stores the learned BPE merge rules

#!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe
#!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json
import os, json

with open('encoder.json', 'r') as f:
    vocab = json.load(f) 
with open('vocab.bpe', 'r', encoding="utf-8") as f:
    bpe_data = f.read()
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]

Now, let’s see how we could train our own BPE tokenizer and generate files in the same format.

Pre-tokenization

Before merging byte pairs, GPT-2 applies some preprocessing. The key steps are:

1. Preserve special tokens

Special tokens (e.g., <|endoftext|>) are inserted in the training data to encode metadata like document boundaries. These are mapped to fixed IDs and must always remain intact. We therefore split on them during preprocessing so merges never cross their boundaries.

2. Word-level tokenization

If we merged raw bytes across the corpus, words like dog! and dog. might end up with totally different tokens because of punctuation, resulting in suboptimal merges. Instead, GPT-2 first runs a regex-based word tokenizer to split text into meaningful chunks. Then BPE merges happen within each chunk. This avoids bad merges and improves generalization.

3. Cache word counts

Naively, we’d have to scan the entire corpus in every merge iteration to recompute bigram frequencies, which is extremely computationally expensive. Instead:

Pre-tokenize into “word-like” units.
Count their frequencies.
During merges, update these counts rather than re-scanning the dataset.

For example, if the appears 10 times, we can increment the bigram count for (t,h) and (h,e) by 10 immediately, instead of re-reading all text.

def run_pretokenizer(input_path: str | os.PathLike, special_tokens: list[str]) -> dict[tuple[bytes], int]:
    """
    Reads corpus and returns a dict mapping each pre-token (tuple of bytes) to its frequency.
    """
    pre_token_counts = Counter()

    with open(input_path, "rb") as f:
        data = f.read()

    # Split on special tokens (e.g., <|endoftext|>)
    pattern = "|".join(re.escape(tok) for tok in special_tokens)
    docs = re.split(pattern, data)

    # Regex-based tokenizer (same as GPT-2)
    pattern = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

    for text in docs:
        for match in re.finditer(pattern, text):
            byte_tokens = match.group().encode("utf-8")
            key = tuple(bytes([b]) for b in byte_tokens)
            pre_token_counts[key] += 1

    return pre_token_counts

Each pre-token is represented as a tuple of UTF-8 bytes. We could also store them as integers, but bytes make it easier to stay compatible with GPT-2’s released files.

The pre-token frequency table is represented as dict[tuple[bytes], int], e.g. {(b't', b'h', b'e'): 10}

Training the tokenizer

Now let’s train a toy GPT-2–style tokenizer on the Tiny Shakespeare dataset.

def train_tokenizer(
        input_path: str,
        vocab_size: int,
        special_tokens: list[str]) -> tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:

    # Initialize vocab and merge list
    vocab = {idx: bytes([idx]) for idx in range(256)}   # token_id -> UTF-8 bytes
    bpe_merges = []

    # Step 1: Pre-tokenization
    print("Running pre-tokenization...")
    pre_token_counts = run_pretokenizer(input_path, special_tokens)

    # Step 2: Run BPE merges
    print("Running BPE training...")
    num_merges = vocab_size - len(special_tokens) - 256

    for i in range(num_merges):
        # Find the most frequent bigram
        bigram_counts = Counter()
        for byte_tokens, count in pre_token_counts.items():
            bigram_counts.update(get_pairs(byte_tokens))

        most_freq_bigram = max(bigram_counts, key=lambda k: (bigram_counts[k], k))

        # Merge it into a new token
        new_token_id = 256 + i
        vocab[new_token_id] = most_freq_bigram[0] + most_freq_bigram[1]
        bpe_merges.append(most_freq_bigram)

        # Update counts
        updated_pre_token_counts = Counter()
        for byte_tokens, count in pre_token_counts.items():
            new_byte_token = merge(byte_tokens, most_freq_bigram, vocab[new_token_id])
            updated_pre_token_counts[tuple(new_byte_token)] += count
        pre_token_counts = updated_pre_token_counts

    # Step 3: Add special tokens at the end
    for i, tok in enumerate(special_tokens):
        vocab[256 + num_merges + i] = tok.encode("utf-8")

    return vocab, bpe_merges

Running with Tiny Shakespeare:

# Download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

vocab, bpe_merges = train_tokenizer("input.txt", vocab_size=276, special_tokens=["<|endoftext|>"])

The two main variables needed for inference are vocab and bpe_merges. These are exactly similar to what GPT-2 tokenizer gives us for inference.

vocab contains the final vocabulary set, with token IDs mapped to byte sequences.
bpe_merges is a list that stores UTF-8 byte token bigrams.

Let’s take a look at our trained vocabulary:

...
262: b'ing',
263: b'on',
264: b'ar',
...
275: b'<|endoftext|>'

Saving & Reading the files

Raw bytes (0–255) cannot be directly represented in a JSON file as it supports UTF-8–encoded Unicode text, not arbitrary binary data. To make the vocabulary JSON-serializable and fully reversible, GPT-2 maps each byte value to a unique Unicode character that’s guaranteed to round-trip correctly through UTF-8.

The helper function below defines that mapping:

def bytes_to_unicode():
    """
    Returns mapping from bytes [0–255] to unique unicode strings.
    GPT-2 uses this trick to serialize bytes reversibly.
    """
    bs = list(range(ord("!"), ord("~")+1)) + list(range(ord("¡"), ord("¬")+1)) + list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(256):
        if b not in bs:
            bs.append(b)
            cs.append(256 + n)
            n += 1
    cs = [chr(c) for c in cs]
    return dict(zip(bs, cs))

After training, we can serialize the resulting vocabulary and merges so they can be reloaded later for inference.

# Save vocab GPT-2 style: vocab.json & merges.txt
byte_encoder = bytes_to_unicode()
byte_decoder = {v: k for k, v in byte_encoder.items()}

# Convert vocab {id: bytes} → {string: id}
vocab_json = {
    "".join(byte_encoder[b] for b in tok) if tok not in special_tokens else str(tok): i
    for i, tok in vocab.items()
}

# Save vocab.json
with open("vocab.json", "w", encoding="utf-8") as f:
    json.dump(vocab_json, f, ensure_ascii=False, indent=2)

# Save merges as TXT, one merge per line
with open("bpe_merges.txt", "w", encoding="utf-8") as f:
    for a, b in bpe_merges:
        # Convert tuple of bytes to string representation
        a_str = "".join(byte_encoder[byte] for byte in a)
        b_str = "".join(byte_encoder[byte] for byte in b)
        f.write(f"{a_str} {b_str}\n")

We can then easily load them back for inference:

# Load vocab
with open(vocab_filepath, "r", encoding="utf-8") as f:
    vocab_json = json.load(f)

# Convert {string: id} → {id: bytes}
vocab = {}
for tok_str, idx in vocab_json.items():
    if tok_str in special_tokens:
        vocab[idx] = tok_str.encode("utf-8")
    else:
        vocab[idx] = bytes([byte_decoder[ch] for ch in tok_str])

# Load merges
bpe_merges = []
with open(merges_filepath, "r", encoding="utf-8") as f:
    lines = f.read().splitlines()
    for line in lines:
        a, b = line.split(" ")  # split on single space
        a_bytes = bytes([byte_decoder[ch] for ch in a])
        b_bytes = bytes([byte_decoder[ch] for ch in b])
        bpe_merges.append((a_bytes, b_bytes))

Decoding tokens back to string

To decode a sequence of integer token IDs back to raw text, we can simply look up each ID’s corresponding entries in the vocabulary (a byte sequence), concatenate them together, and then decode the bytes to a Unicode string.

def decode(ids: list[int]) -> str:
    byte_arr = b"".join(vocab[idx] for idx in ids)   
    text = byte_arr.decode('utf-8', errors='replace')
    return text

Bytes in the range 128–255 are valid in UTF-8 only as part of multi-byte sequences. If they appear alone, decoding fails. That’s why errors='replace' which tells Python to substitute those invalid bytes with a replacement character (usually ‘�’) instead of raising an error — ensuring the decoding process continues without interruption.

Let’s test it:

print(decode([65, 275, 270, 128]))

Allen�

Encoding string to tokens

The process of encoding text by BPE mirrors how we train the BPE vocabulary. We first pre-tokenize the sequence and represent each pre-token as a sequence of UTF-8 bytes. Then, we iteratively apply the same merge operations on each pre-token independently, following the exact order in which the merges were learned.

To do this, we create a bpe_ranks dictionary that assigns each merge a rank based on its priority. We then search for adjacent byte pairs in the text and merge the pair with the lowest rank first. This process continues until the token length becomes less than two or no further valid merges remain.

This approach is also computationally efficient, since we only look up and apply merges that are relevant to the current pre-token — rather than iterating over the entire list of merges each time to check which ones apply.

def encode(text: str, special_tokens=["<|endoftext|>"]) -> list[int]:
    # Split text by special tokens
    pattern = "(" + "|".join(re.escape(tok) for tok in special_tokens) + ")"
    parts = re.split(pattern, text)

    # Pre-tokenizer regex (GPT-2 style)
    pretoken_pattern = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

    # Map merges to ranks
    bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))

    output_tokens = []
    byte_to_id = {v: k for k, v in vocab.items()}

    for part in parts:
        if not part:
            continue

        if part in special_tokens:
            output_tokens.append(byte_to_id[part.encode("utf-8")])
        else:
            # Split by regex and run BPE on each segment
            for match in re.finditer(pretoken_pattern, part):
                byte_tokens = match.group().encode("utf-8")
                byte_tokens_list = [bytes([b]) for b in byte_tokens]

                while len(byte_tokens_list) > 1:  # Ensure we have at least 2 tokens to merge
                    pairs = get_pairs(byte_tokens_list)  # Get all pairs

                    # Choose the bigram that matches in bpe_merges with the lowest rank
                    bigram = min(pairs, key=lambda pair: bpe_ranks.get(pair, float('inf')))

                    # When there are no bigrams that exists in bpe_merges, it returns the first pair by default
                    if bigram not in bpe_ranks:
                        break

                    # Merge the selected bigram
                    merged_bytes = vocab[256 + bpe_ranks[bigram]]
                    byte_tokens_list = merge(byte_tokens_list, bigram, merged_bytes)

                # Map merged bytes to IDs
                for tok in byte_tokens_list:
                    output_tokens.append(byte_to_id[tok])

    return output_tokens

Testing our encoding function:

tokens = encode("It is raining👋 <|endoftext|>")
print(tokens)
print(decode(tokens))

[73, 116, 32, 272, 32, 114, 97, 256, 262, 240, 159, 145, 139, 32, 275]
It is raining👋 <|endoftext|>

Using the Tiktoken library

In practice, I highly recommend using OpenAI’s tiktoken library for tokenization inference, as it is optimized for speed and efficiency. Although it doesn’t include a training loop, it already comes pre-trained with OpenAI’s official tokenizer vocabularies, allowing us to use it directly for encoding and decoding tasks.

Here’s how you can encode a string:

import tiktoken
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("Hello World!"))

[15496, 2159, 0]

The output shows the token IDs according to the GPT-2 tokenizer.

References

Sebastian Rashcka’s blog: Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch
Andrej Karpathy’s video: Let’s build the GPT Tokenizer
Stanford’s CS336 Lec 1 & Assignment 1
[1] Radford et al., “Language Models are Unsupervised Multitask Learners”, OpenAI 2019.
[2] Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units”, ACL 2016.

Tokenization#

Character-level#

Unicode Encodings (Byte-level)#

Word-level#

Subword-level#

Byte pair encoding (BPE)#

BPE algorithm#

BPE in Code#

Special Context Tokens#

End-of-Text token#

Fine-Tuned Conversational Models#

GPT-2 Tokenizer#

Pre-tokenization#

1. Preserve special tokens#

2. Word-level tokenization#

3. Cache word counts#

Training the tokenizer#

Saving & Reading the files#

Decoding tokens back to string#

Encoding string to tokens#

Using the Tiktoken library#

References#