Modelling in computer vision has long been dominated by convolutional neural networks (CNNs). We’ve already discussed famous architectures like VGGNet and ResNet in previous posts, which have served as the primary backbones for a variety of vision tasks.

In contrast, network architectures in natural language processing (NLP) have evolved along a different trajectory. The dominant architecture in NLP is the Transformer, designed for sequence modeling. Models like GPT-3 have achieved remarkable success, scaling to over 100 billion parameters thanks to their computational efficiency and scalability.

Inspired by the success of Transformers in NLP, researchers sought to apply them directly to images with minimal modifications, with the aim to completely replace CNNs. This led to the introduction of Vision Transformers (ViTs) in 2021, marking a paradigm shift in deep learning for vision.

In this post, we’ll explore how to adapt the Transformer architecture for images by implementing a Vision Transformer (ViT). We’ll then dive into CLIP, the first multimodal foundation model that connects text and images, to learn a wide range of visual concepts directly from natural language.

Vision Transformer (ViT)

The paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, introduced the Vision Transformer (ViT) [1] the first Transformer-based architecture to achieve impressive results on the ImageNet dataset.

Traditionally, Transformers process text by tokenizing it into discrete units (words or subwords) and mapping them into continuous embeddings. The challenge is: how do we apply this to images, which are fundamentally different from text?

The key idea is to “tokenize” an image by splitting it into fixed-size patches and treating each patch as a token. As the title suggests, ViT represents an input image (of size $224 \times 224$ in ImageNet) as a sequence of image patches (each of size $16 \times 16$), analogous to sequences of tokens in NLP.

Each patch is then flattened and linearly projected into an embedding vector, forming the standard input to the Transformer.

Split the input into $16 \times 16$ patches.

Split the input into $16 \times 16$ patches.

To preserve spatial information, learnable positional encodings are added to each patch embedding. These encodings help the model understand the relative arrangement of patches, allowing it to capture the image’s underlying 2D structure.

This transformation enables us to feed an image into a standard Transformer architecture — no convolutional layers required.

Choosing the Right Transformer Model

Recall the two broad categories of large language models we’ve discussed before:

  1. Encoder-only models – Primarily used for classification and regression tasks (e.g., sentiment analysis, spam detection, stock price prediction). A prime example is BERT, where a special [CLS] token is prepended to the input sequence, and its final hidden state is used for classification.

  2. Decoder-only models – Designed for language generation tasks like translation, summarization, and question answering. We covered these extensively in our GPT series.

Since the goal is image classification, an encoder-based approach is the most sensible choice. Similar to BERT, an extra learnable classification token is added to the input sequence. fter passing through the Transformer encoder, the output corresponding to this token is used to predict the image class label.

Overview of the ViT Pipeline

Let’s visualize the overall architecture:

The Vision Transformer model.

The Vision Transformer model.

  1. Split an image into fixed-size patches.
  2. Flatten the patches.
  3. Project the flattened patches into the embedding dimension.
  4. Prepend a [cls] token embedding to the sequence.
  5. Add positional embeddings.
  6. Feed the sequence as an input to a standard transformer encoder.
  7. Pretrain the model in a fully supervised setting on a large dataset.
  8. Replace the classification head and fine-tune on a smaller dataset.

Similar to other transformer models, Vision Transformers (ViTs) require pre-training on large, diverse datasets to generalize well and outperform standard CNNs on smaller or mid-sized datasets (e.g., ImageNet with 1M images).

Unlike Convolutional Neural Networks (CNNs), which have strong inductive biases (such as spatial locality and translation invariance), ViTs rely solely on self-attention, lacking these built-in properties. As a result, ViTs do not generalize as effectively when trained on limited datasets and often underperform compared to CNNs.

However, when pre-trained on much larger datasets (e.g., ImageNet-21k with 14M images and JFT with 300M images) and then transferred to downstream tasks, the situation changes. Large-scale training helps compensate for the lack of inductive biases, enabling pre-trained ViTs to match or even surpass state-of-the-art performance on several image recognition benchmarks.

This is why I haven’t included ViT training in this post. It’s generally a good idea to adopt pretrained ViTs and fine-tune them on small datasets, rather than training them from scratch.

Training a ViT from scratch vs. fine-tuning a pre-trained ViT. Ref: Sebastian Raschka’s blog

Training a ViT from scratch vs. fine-tuning a pre-trained ViT. Ref: Sebastian Raschka’s blog

ViT vs CNNs

Despite being data-hungry, ViTs offer several key advantages over CNNs:

  • Global Context and Long-Range Dependencies:

    • Each convolution in a CNN looks only at a small local region, requiring multiple layers or use pooling operations to capture global context.

    • In contrast, every patch in a ViT can attend to every other patch in a single layer via self-attention. This gives ViTs an intrinsic global receptive field, leading to better scene understanding and feature interactions.

  • Scalability and Efficiency:

    • Scaling CNNs to larger models (more layers, wider filters) often demands architectural innovations such as ResNet or EfficientNet.
    • ViTs, on the other hand, scale naturally with data and compute — performance improves predictably with model size, following Transformer scaling laws.

This is why Vision Transformers have largely replaced CNNs as the foundation of modern computer vision.

Elon Musk confirming that Tesla has transitioned from CNNs (originally pioneered by Yann LeCun) to Transformer-based vision models.

Elon Musk confirming that Tesla has transitioned from CNNs (originally pioneered by Yann LeCun) to Transformer-based vision models.

Spelled out in Code

Let’s implement the Vision Transformer (ViT) architecture step by step. Here’s the basic set of imports to get started:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import torchvision

Patch Embeddings (naive implementation)

Ignoring the batch dimension for now, a standard Transformer processes a 1D sequence of token embeddings with shape $(\text{N} \times \text{D})$, where:

  • $\text{N}$ is the number of tokens.
  • $\text{D}$ is the embedding dimension.

Since images are inherently 2D, we convert them into a sequence of patch embeddings through the following steps:

  1. Input image: $(C \times H \times W)$, where $H$ is height, $W$ is width, and $C$ is the number of channels.

  2. Splitting into Patches: We divide the image into $\text{N}$ non-overlapping patches, each of size $(C \times \text{P} \times \text{P})$.

    • $(C \times H \times W) \rightarrow (C \times H/\text{P} \times \text{P} \times W/\text{P} \times \text{P})$.
  3. Reshape: Rearrange the dimensions to group patches separately.

    • $(C \times H/\text{P} \times \text{P} \times W/\text{P} \times \text{P}) \rightarrow (H/\text{P} \times W/\text{P} \times C \times \text{P} \times \text{P})$.
  4. Number of Patches: The number of patches is computed as $\text{N} = (H \cdot W)/\text{P}^2$.

    • $(H/\text{P} \times W/\text{P} \times C \times \text{P} \times \text{P}) \rightarrow (\text{N} \times C \times \text{P} \times \text{P})$.
    • This is analogous to the number of tokens in NLP.
  5. Flatten the Patches: Each patch is flattened into a vector of size $\text{P}^2 C$.

    • $(\text{N} \times C \times \text{P} \times \text{P}) \rightarrow (\text{N} \times \text{P}^2 C)$
  6. Obtain Patch Embeddings: Map the patch dimension to the embedding dimension $\text{D}$ using a linear layer.

    • $(\text{N} \times \text{P}^2 C) \rightarrow (\text{N} \times \text{D})$

Visually, this process is illustrated as follows:

We split the input image into 4 patches, each represented with a different color. We obtain patch embeddings using the naive implementation.

We split the input image into 4 patches, each represented with a different color. We obtain patch embeddings using the naive implementation.

Implementing these steps in code:

class PatchEmbedding(nn.Module):
    def __init__(self, patch_size: int, n_channels: int, emb_dim: int):
        super().__init__()
        self.patch_size = patch_size
        self.projection = nn.Linear(patch_size * patch_size * n_channels, emb_dim)
    
    def forward(self, x: torch.Tensor, visualize: bool = False) -> torch.Tensor:
        B, C, H, W = x.size()

        # Split into patches [B, C, H', p_H, W', p_W]
        x = x.view(B, C, H // self.patch_size, self.patch_size, W // self.patch_size, self.patch_size)  

        # Reshape into [B, H', W', C, p_H, p_W]
        x = x.permute(0, 2, 4, 1, 3, 5)    

        # Flatten spatial grid of patches to obtain number [B, N, C, p_H,p_W]                                             
        x = x.flatten(1, 2)                                                                              
        
        if visualize:
            return x                                                                                    

        # Flatten the patches [B, N, C*p_H*p_W]
        x = x.flatten(2, 4)    

        # Project into the embedding dimension [B, N, D]                                                                         
        x = self.projection(x)                                                                          
        return x

It takes a batch of images, splits each image into fixed-size patches, flattens them and linearly project them into the embedding dimension to obtain patch embeddings.

If visualize=True, it returns the patches before flattening them, allowing us to plot and inspect their structure. Let’s patchify a sample image from CIFAR-10 and verify the output:

# Download CIFAR-10 dataset
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
cifar10_trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(cifar10_trainset, batch_size=64, shuffle=True)

# Extract a batch of images
for image, label in train_loader:
    break

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

# Plot the original first image in the batch
axes[0].imshow(image[0].permute(1, 2, 0))
axes[0].set_title('Original Image')

# Plot the patches of first image in the batch
patchify = PatchEmbedding(patch_size=8, n_channels=3, emb_dim=768)
patches = patchify(image, visualize=True)[0]
img_grid = torchvision.utils.make_grid(patches, nrow=4, pad_value=0.9)
axes[1].imshow(img_grid.permute(1, 2, 0))
axes[1].set_title('Patches')

plt.show()

Each CIFAR-10 image has a shape of $(32, 32, 3)$. Using patch_size=8 we get:

  • $\text{N} = (32 \cdot 32) / (8)^2 = 16$ patches in total.
  • Each patch of shape $(8, 8, 3)$.

We arrange the image patches into a grid and visualize them as shown below:

A 32x32x3 image split into 16 patches of size 8x8x3.

A 32x32x3 image split into 16 patches of size 8x8x3.

This output verifies that our function is accurately splitting the image into the desired patches.

Patch Embeddings (convolution)

Another common approach to obtain patch embeddings directly from images is to use a single convolutional layer, where both the kernel size and stride are set equal to the patch size, and the number of output channels equals the embedding dimension. This convolution acts as a non-overlapping sliding window, extracting and projecting each patch into the embedding space in one step.

Splitting by convolution

Splitting by convolution

  1. Input image: $(C \times H \times W)$

  2. Apply convolution: We apply a convolution with kernel size and stride = $P$, and output channels = $\text{D}$.

    • $(C \times H \times W) \rightarrow (\text{D} \times H/\text{P} \times W/\text{P})$
  3. Number of Patches: The number of patches is computed as $\text{N} = (H \cdot W)/\text{P}^2$.

    • $(\text{D} \times H/\text{P} \times W/\text{P}) \rightarrow (\text{D} \times \text{N})$
  4. Transpose:

    • $(\text{D} \times \text{N}) \rightarrow (\text{N} \times \text{D})$

These steps yield the same result as the naiive approach, as shown below:

We split the input image into 4 patches, each represented with a different color. We obtain patch embeddings using convolution.

We split the input image into 4 patches, each represented with a different color. We obtain patch embeddings using convolution.

Implementing in code:

class PatchEmbedding(nn.Module):
    def __init__(self, patch_size: int, n_channels: int, emb_dim: int):
        super().__init__()
        self.patch_size = patch_size
        self.projection = nn.Conv2d(
            in_channels=n_channels,
            out_channels=emb_dim,
            kernel_size=patch_size,
            stride=patch_size
        )
    
   def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, C, H, W = x.size()

        # Output from Conv: [B, D, H', W']
        x = self.projection(x)      

        # Flatten spatial grid of patches to obtain number [B, D, N]                                                              
        x = x.flatten(2, 3)              

        # Transpose to [B, N, D]
        x = x.transpose(1, 2)                                                                           
        return x

Both methods produce the same results, but the convolution-based approach is more concise in code and computationally efficient.

Architecture

The Transformer model here is largely similar to the GPT-2 architecture we implemented in the previous post. The key difference is that, since we are using the encoder part of the Transformer, it employs self-attention (bidirectional) instead of causal (masked) attention. Aside from this change, the overall structure and operations remain the same.

class EncoderBlock(nn.Module):
    def __init__(self, d_model: int, context_len: int, num_heads: int, dropout: float):
        super().__init__()
        self.ln_1 = LayerNorm(d_model)
        self.attn = MultiHeadSelfAttention(
            d_model=d_model,
            num_heads=num_heads,
            dropout=dropout,
            causal = False              # Encoder style
        )
        self.ln_2 = LayerNorm(d_model)
        self.mlp = MLP(emb_dim=d_model)
        self.resid_dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.resid_dropout(self.attn(self.ln_1(x)))
        x = x + self.resid_dropout(self.mlp(self.ln_2(x)))
        return x

Defining the ViT model

Let’s implement the Vision Transformer model in code as follows:

class ViT(nn.Module):
    def __init__(self, cfg: dict):
        super().__init__()
        self.patch_size = cfg['patch_size']

        # Patch embeddings
        self.patch_embedding = PatchEmbedding(
            patch_size=cfg['patch_size'],
            n_channels=cfg['n_channels'],
            emb_dim=cfg['emb_dim']
        )
 
        # Define class token [1, 1, D] and position embeddings [1, N+1, D] as learnable parameters
        self.cls_token_embedding = nn.Parameter(torch.randn(1, 1, cfg['emb_dim']))
        self.position_embedding = nn.Parameter(torch.randn(1, 1 + cfg['n_patches'], cfg['emb_dim']))

        # Dropout and Transformer encoder stack
        self.embd_dropout = nn.Dropout(cfg['embd_pdrop'])
        self.blocks = nn.Sequential(*[EncoderBlock(cfg) for _ in range(cfg['n_layers'])])
        
        # Final layer norm and classification head
        self.ln_f = LayerNorm(cfg['emb_dim'])
        self.cls_head = nn.Linear(cfg['emb_dim'], cfg['n_classes'])
 
    def forward(self, img: torch.Tensor) -> torch.Tensor:
        # Input img: [B, C, H, W]

        # Obtain patch embeddings [B, N, D]
        patch_emb = self.patch_embeddings(img)                                   
        B, N, _ = patch_emb.size()
 
        # Prepend CLS token to the sequence of patch embeddings
        cls_token_emb = self.cls_token_embedding.expand(B, -1, -1)  # [B, 1, D]
        x = torch.cat((cls_token_emb, patch_emb), dim=1)            # [B, N+1, D]

        # Add positional embeddings
        pos_emb = self.position_embedding[:, :N+1, :]               # [1, N+1, D]
        x = self.embd_dropout(x + pos_emb)                          # [B, N+1, D]

        # Pass through Transformer encoder blocks
        x = self.blocks(x)                                          # [B, N+1, D]
        x = self.ln_f(x)                                            # [B, N+1, D]

        # Classification head on the [CLS] token
        x = x[:, 0, :]                                              # [B, D]
        logits = self.cls_head(x)                                   # [B, num_classes]

        return logits

We use nn.Parameter for the class token and position embeddings because these are learnable parameters that need to be optimized during training, just like the weights in a neural network.

We could also use nn.Embedding for the positional embeddings, as we did in the previous post — this is just another way to doing it.

After passing the input through the Transformer blocks, we extract the learned representations of the first token (the classification token) and feed it into the classification head, which outputs logits for the class labels.

Architecture variants

The authors present three variants of the ViT architecture: Base, Large, and Huge, with their respective hyperparameters defined below.

Details of Vision Transformer model variants.

Details of Vision Transformer model variants.

To denote model size and input patch size, models are named using the format ViT-[Size]/[Patch Size]. For example, ViT-B/16 refers to the Base variant with a $16 \times 16$ input patch size.

The transformer’s sequence length (i.e., the number of patches) is inversely proportional to the square of the patch size and directly proportional to the image size. This means that models using smaller patches or higher-resolution images are more computationally expensive, as they must process longer sequences.

To address this limitation, Microsoft Research introduced the Swin Transformer [2], which you can read more about here: Aman Arora’s blog on Swin Transfomer. It has become the go-to transformer-based backbone for vision tasks today.

Contrastive Language-Image Pre-Training (CLIP)

With the success of GPT-3 in zero-shot transfer to downstream tasks, OpenAI set out to extend this capability to image models, aiming to build a multi-modal foundation model. Pre-training on vast amounts of web-sourced data has enabled such task-agnostic models to perform competitively across a wide range of benchmarks with little or no dataset-specific training.

Traditionally, vision models have been pre-trained on crowd-labeled datasets like ImageNet. This form of supervision constrains the model to a fixed set of predefined object categories and typically requires additional fine-tuning on new labeled datasets to adapt to different tasks.

CLIP [3] (Contrastive Language-Image Pre-training) bridges this gap by learning visual concepts through natural language supervision on a large scale. The motivation behind this approach lies in the abundance of image–text pairs available publicly on the internet.

Once trained, CLIP can be applied to virtually any visual classification task without additional training, simply by providing text descriptions of the categories to recognize. In essence, it brings the zero-shot generalization capabilities of GPT-2 and GPT-3 into the visual domain.

Dataset

The training data leverages an abundant source of supervision: text paired with images found across the internet. Specifically, the authors constructed a new dataset of 400 million (image, text) pairs, naming it WIT (WebImageText).

Contrastive Pre-training

The initial approach was to jointly train an image CNN and a text transformer to predict the caption of an image (i.e., an image captioning model). However, predicting the exact words of a caption proved challenging due to the vast variability in descriptions, comments, and related text accompanying images on the internet.

Prior research has shown that contrastive objectives can learn better representations than their equivalent predictive objective. Thus, CLIP was trained using a more efficient proxy task—predicting which text as a whole is paired with which image rather than generating exact words of that text.

CLIP jointly trains an image encoder and a text encoder to learn a shared multi-modal embedding space, mapping both text and images to the same representation. The contrastive learning approach optimizes this space such that similar pairs (image and its corresponding text) stay close, while dissimilar ones are pushed apart.

For example, an image of a dog and the sentence “golden retriever” will have similar embeddings and will be close to each other in the vector space, whereas a sentence like “fluffy cat” will have dissimilar embeddings and will be farther from that image.

Contrastive loss encourages matching (image, text) pairs to have similar embeddings while pushing apart mismatched pairs

Contrastive loss encourages matching (image, text) pairs to have similar embeddings while pushing apart mismatched pairs

This contrastive loss is computed bidirectionally, each image embedding is compared against all text embeddings in the batch, and vice versa, as illustrated above.

CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples.

CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples.

Given a batch of $B$ (image, text) pairs, CLIP is trained to predict the correct pairings among the $B \times B$ possible combinations:

  • It maximizes the cosine similariy between the image and text embeddings for the $B$ correct pairs (i.e., the diagonal elements of the similarity matrix).
  • It minimizes the cosine similarity for the $B^2 - B$ incorrect pairings.

These scores are optimized using symmetric cross-entropy loss. Let’s break this down with pseudocode:

  1. Input: Batch of images and texts

    # Image batch: [B, C, H, W]
    # Text batch:  [B, N]
    
    An example of a batch from the dataset.

    An example of a batch from the dataset.

  2. Extract feature representations: Images are passed through an image encoder and texts through a text encoder to obtain features of dimensions $\text{D}_\text{i}$ and $\text{D}_\text{t}$, respectively. (These features come from different models, so they will have different dimensions.)

    # Extract feature representation
    I_f = image_encoder(images)         # Image features [B, D_i]
    T_f = text_encoder(texts)           # Text features  [B, D_t]
    
  3. Obtain embeddings: A linear layer projects both feature sets into a common embedding space of dimension $\text{D}_\text{e}$.

    # Use linear projection to map to the multi-modal embedding space.
    W_i = nn.Linear(D_i, D_e)
    W_t = nn.Linear(D_t, D_e)
    I_e = W_i(I_f)                     # Image in embedding space [B, D_e]
    T_e = W_t(T_f)                     # Text in embedding space  [B, D_e]
    
  4. Normalize: The embeddings are normalized (L2) to unit vectors, preventing any scale differences.

# Obtain normalized features
I_e = nn.functional.normalize(I_e, p=2, dim=1)
T_e = nn.functional.normalize(T_e, p=2, dim=1)
  1. Compute cosine similarity: A dot product operation between image and text embeddings yields similarity scores.

    # Find cosine similarity
    logits = I_e @ T_e.T               # [B, B]
    
    Cosine similarity matrix of our batch.

    Cosine similarity matrix of our batch.

  2. Calculate Contrastive loss: The cross entropy loss is computed for each row and column of our similarity matrix and divide by 2, since each pair is computed twice.

# symmetric loss function
labels = np.arange(N) 
loss_i = cross_entropy_loss(logits, labels, axis=0) 
loss_t = cross_entropy_loss(logits, labels, axis=1) 
loss = (loss_i + loss_t)/2

The correct labels correspond to the diagonal elements of the similarity matrix. The variable loss_i and loss_t push the logits for image similarities and text similarities to be high along the diagonal and low elsewhere. This is referred to as the contrastive loss.

Using this objective, CLIP learns to recognize a wide range of visual concepts in images directly from natural language, making it applicable to diverse visual classification tasks.

Temperature scaling

Cross-entropy loss includes a softmax function that converts logits into probability scores. CLIP uses temperature scaling (as discussed in our previous post) to adjust the range of logits before applying the softmax, effectively controlling the strength of separation between positive and negative pairs.

To avoid manually tuning this as a hyperparameter, temperature scaling is optimized during training as a log-parameterized multiplicative scalar, called logit_scale.

# Find scaled cosine similarity
logit_scale = np.exp(temperature)
logits = (I_e @ T_e.T) * logit_scale               # [B, B]

Architecture

CLIP consists of two main components: an image encoder and a text encoder.

  • Image Encoder: The authors experimented with two different models:

    • ResNets: ResNet-50, ResNet-101, RN50x4, RN50x16, and RN50x64.
    • Vision Transformers (ViTs): ViT-B/32, a ViT-B/16, and a ViT-L/14 (without the classification head).
      • Modification: An additional layer normalization was added to the combined patch and position embeddings before inputting into the transformer.
      • ViTs offered 3x compute efficiency over ResNets and trained faster, making them the preferred choice.
      • The [CLS] token’s output is used as the image feature vector representing the entire image.
  • Text Encoder: A 63M-parameter decoder-only Transformer, similar to GPT-2:

    • Number of layers: $L = 12$.
    • Number of heads: $H = 8$.
    • Embedding dimension: $\text{d}_{model} = 512$.
    • Tokenizer: Lowercased BPE with a vocabulary size of 49,152.
    • Context length: $N = 76$.
    • Special tokens: Sequence is bracketed with [SOS] (start) and [EOS] (end) token.
    • The [EOS] token’s output is taken as the text feature vector, which aggregates information from all preceding tokens through causal (masked) self-attention.
  • CLIP is trained from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights.

  • A batch size of 32,768 is used-meaning given an image, CLIP predicts which one of 32,768 text snippets was actually paired with it in our dataset. The contrastive loss depends heavily on the batch size as it directly determines the number of negative samples available for learning:

    • If the batch is too small and contains only weak negatives (e.g., “cat” vs. “truck”), the model would not get enough constrastive supervision.
    • Large batch sizes ensure the presence of hard negatives (e.g., “corgi” vs. “golden retriever”) that force the model to learn subtle, fine-grained differences between similar concepts.

Zero-shot transfer

CLIP demonstrates impressive zero-shot performance across several benchmarks, including ImageNet. Remarkably, the model was never trained on ImageNet’s 1.28M labeled examples, yet it achieves accuracy comparable to a ResNet-50 that was trained directly on the dataset in a fully supervised manner.

It is pre-trained to predict whether an image and a text snippet (rather than a single class label) are paired together. To bridge the distribution gap between a text snippet and a class label, we use the prompt template: “$\text{A photo of a {label}}$”.

Let’s take an example to see how we can perform zero-shot transfer with CLIP:

  1. Create text descriptions from labels – If we are working with ImageNet, which has 1000 possible classes, we create text descriptions for each label using the prompt template (e.g., “A photo of a dog.”).

  2. Obtain text embeddings – We feed these text descriptions into CLIP’s text encoder to obtain 1000 different embeddings corresponding to all possible classes.

  3. Obtain image embeddings – Next, we take the input image we want to classify (e.g., an image of a dog) and embed it using CLIP.

  4. Compute similarity scores – We compute cosine similarity scores between the image embedding and all text embeddings.

  5. Obtain class probabilities – Finally, we pass these similarity scores through a softmax function to obtain a probability distribution over all classes, just as in a typical vision model trained for image classification.

At test time, we convert all of a dataset’s classes into captions such as ‘A photo of a dog’ and predict the class whose caption CLIP estimates best pairs with the given image.

At test time, we convert all of a dataset’s classes into captions such as ‘A photo of a dog’ and predict the class whose caption CLIP estimates best pairs with the given image.

Like the GPT family, CLIP learns a wide variety of tasks during pre-training, enabling powerful zero-shot transfer for image classification and beyond. It canbe used as a foundation model for open-vocabulary recognition, capable of identifying entirely new categories simply by providing their text descriptions.

References