Vision Transformers vs. Convolutional Neural Networks

Introduction:

In this tutorial, we learn about the difference between the Vision Transformers (ViT) and the Convolutional Neural Networks (CNN). Transformers have become the choice in NLP due to their effectiveness and flexibility. In computer vision, convolutional neural network (CNN) architectures still dominate, but some researchers have tried to combine CNNs with personal ones. After pre-training on large datasets, the Vision Transformer (ViT) outperforms state-of-the-art convolutional networks at multiple scales while requiring fewer computational resources for training.

The authors tested applying the Transformer model directly to images. They found that the model's accuracy when obtaining information about average data was average compared to ResNet-like architectures. However, when learning about larger datasets, Vision Transformer (ViT) achieves good results and outperforms existing methods of many image recognition methods.

The model of Vision Transformer (ViT) that transforms 2D images into a series of 2D patches. Patches is a fixed-size vector using a linear projection. In addition to the patch order, the state of the Transformer encoder's output is used as a representation before the object is added.

Using the classification head, the image representation is pre-trained or fine-tuned. Position embeddings are added to store information about the position. The sequence of embedding vectors provides access to the Transformer encoder, which has a series of turns consisting of several self-attention and MLP blocks.

Previously, the first choice for image processing was CNN or Convolutional Neural Network. They are good at capturing local spatial patterns through convolution techniques, thus achieving hierarchical feature extraction. CNNs are good at learning from large amounts of image data and have successfully performed important tasks like image classification, object detection, and segmentation.

CNNs are powerful for tracking records in many computing operations. They are effective at processing big data, and Vision Transformers have advantages in understanding key concepts and situations where the world depends. However, Vision Transformers often need more information to achieve performance similar to CNNs. Additionally, CNNs have advantages due to their parallelism, making them more applicable to real-time and resource-constrained applications.

What is vision Transformer?

The vision Transformer is shortly known as ViT. It is an image classification model. The ViT is used as a Transformer-like image patch architecture. Split the image into a fixed size of patches. Then, embed each block linearly, add position embeddings and feed the result of the vectors into the standard Transformer encoder. The standard method of adding additional learning "classification symbols" to the sequence is used to perform the classification.

What is Convolutional Neural Network or CNN?

The Convolutional Neural Network is shortly known as CNN. It is a deep learning neural network architecture. In the computer vision, the CNN is widely used. This field of expertise enables computers to understand and interpret visual images or data. Neural networks are very effective when it comes to machine learning. This network is used for many data types, such as images, audio, and text. Different types of neural networks are used for different purposes. For example, we use convolutional neural networks to predict the order of words, similar to image classification, where we use neural networks (LSTMs in particular).

Example:

Here, we give an example of a Vision Transformer vs. Convolutional Neural Network. Here, we will use CNN (Convolutional Neural Network) and Visual Transformer methods to train the visual classification of cat and cow datasets available on Kaggle. Firstly, we need to download the cat and cow files containing 25,000 RGB images from Kaggle. If you have not already, you can read instructions on how to set the Kaggle API certificate here. We can download the file from your current working directory with the help of the given Python code. So, the code is given in below -

# The KaggleApi is imported here
from kaggle.api.kaggle_api_extended import KaggleApi

# Using the KaggleApi() function, we set the api
api = KaggleApi()
api.authenticate()

# For write the current directory, we use here the './' sign
api.dataset_download_files('karakaggle/kaggle-cat-vs-cow-dataset', path='./')

When the Kaggle file downloading is completed, then we can unzip this file using a simple command. The unzip command of this file is given below -

!unzip -qq kaggle-cat-vs-cow-dataset.zip
!rm -r kaggle-cat-vs-cow-dataset.zip

Use the following command to clone the Vision-Transformer GitHub repository. This repository, under the Vision_tr directory, contains all the codes the vision transformers need. So, the command is now given in below -

!git clone https://github.com/PriyankaA/vision-transformer.git
!mv vision-transformer/vision_tr .

Then, we need to clear the downloaded data. It was also preparing to train our image classifier. Create the following utility to clean and load data from Pytorch's DataLoader format. Now, the code is given below -

import torch.nn as nn
import torch
import torch.optim as optim

from torchvision import datasets, models, transforms
from torch.utils.data import DataLoader, Dataset
from PIL import Image
from sklearn.model_selection import train_test_split

import os
# Create the class LoadData 
class LoadData:
# Define the function __init__ with parameter self
    def __init__(self):
        self.cat_path = 'kagglecatsandcows_3067a/PetImages/Cat'
        self.dog_path = 'kagglecatsandcows_3067a/PetImages/Cow'

 # Define the function delete_non_jpeg_files with two parameter self and directory 
  def delete_non_jpeg_files(self, directory):
        for filename in os.listdir(directory):
            if not filename.endswith('.jpg') and not filename.endswith('.jpeg'):
                filePath = os.path.join(directory, filename)
                try:
                    if os.path.isfile(filePath) or os.path.islink(filePath):
                        os.unlink(filePath)
                    elif os.path.isdir(filePath):
                        shutil.rmtree(filePath)
                    print('The deleted file path is', filePath)
                except Exception as e:
                    print('Failed to delete %s. Reason: %s' % (filePath, e))
# Here, we define the function data with a parameter self
    def data(self):
        self.delete_non_jpeg_files(self.cowPath)
        self.delete_non_jpeg_files(self.catPath)

# Here we find the cowList
        cowList = os.listdir(self.cowPath)
        cowList = [(os.path.join(self.cowPath, i), 1) for i in cowList]

# Here we find the catList
        catList = os.listdir(self.catPath)
        catList = [(os.path.join(self.catPath, i), 0) for i in catList]

# Here we find the TotalList by adding catList and cowList
        TotalList = catList + cowList

        trainList, testList = train_test_split(TotalList, testSize = 0.4)
        trainList, valueList = train_test_split(trainList, testSize = 0.4)
        print('The train list is', len(trainList))
        print('The test list is', len(testList))
        print('The value list is', len(valueList))
        return trainList, testList, valueList

# Here we done the data Augumentation
transform = transforms.Compose([
    transforms.Resize((230, 230)),
    transforms.RandomResizedCrop(230),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
])

# Create the class dataset
class dataset(torch.utils.data.Dataset):

# Define the function __ init __ with three parameter self, fileList and transform
    def __init__(self, fileList, transform=None):
        self.file_list = fileList
        self.transform = transform

    # Define the function __len__ with parameter self for find the length of the dataset
    def __len__(self):
        self.filelength = len(self.fileList)
        return self.filelength

    # Define the function __getitem__ with two parameter self and index
    def __getitem__(self, index):
        imagePath, label = self.file_list[index]
        image = Image.open(imagePath).convert('RGB')
        image_transformed = self.transform(image)
        return image_transformed, label

Approach of Convolutional Neural Network (CNN):

The CNN model of this image classifier has three 2D convolution layers with kernel size 3, several strides of 2, and a maximum pooling layer of 2. After convolution, the number of fully connected layers is 2, each with 10 nodes. Here, we give the snippet code that is used to illustrate this pattern below:

# Here we create the class CNN
class CNN(nn.Module):

# Define the function __init__ with the parameter self
    def __init__(self):
        super(Cnn, self).__init__()

        self.layer1 = nn.Sequential(
            nn.Conv2d(2, 14, kernel_size=3, padding=0, stride=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )

        self.layer2 = nn.Sequential(
            nn.Conv2d(14, 28, kernel_size=3, padding=0, stride=2),
            nn.BatchNorm2d(28),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )

        self.layer3 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, padding=0, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )

        self.fc1 = nn.Linear(3 * 3 * 64, 10)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(10, 2)
        self.relu = nn.ReLU()
# Define the function forward with two parameter self and a
    def forward(self, a):
        OUT = self.layer1(a)
        OUT = self.layer2(OUT)
        OUT = self.layer3(OUT)
        OUT = out.view(out.size(0), -1)
        OUT = self.relu(self.fc1(OUT))
        OUT = self.fc2(OUT)
# Here we return the OUT 
        return OUT

Output:

The deleted file path is kagglecatsandcows_3067a/PetImages/Cow/non_image_file.txt
The train list is 432
The test list is 288
The value list is 115
Length of the training dataset: 432
CNN(
  (layer1): Sequential(
    (0): Conv2d(2, 14, kernel_size=(3, 3), stride=(2, 2))
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (layer2): Sequential(
    (0): Conv2d(14, 28, kernel_size=(3, 3), stride=(2, 2))
    (1): BatchNorm2d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (layer3): Sequential(
    (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc1): Linear(in_features=576, out_features=10, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc2): Linear(in_features=10, out_features=2, bias=True)
  (relu): ReLU()
)

Approach of the Vision Transformer:

In this tutorial, we also learn about the vision transformer. This architecture is designed with specific dimensions that can be customized to meet specific needs. The vision transformer architecture is still large for an image dataset of this size. Now we give the code below:

# Here we import ViT, which means Vision Transformer
from vision_tr.simple_vit import ViT

# Define the model here
model = ViT(
    imageSize = 230,
    patchSize = 28,
    numClass = 2,
    dimension = 128,
    depth = 12,
    heads = 8,
    mlp_dim = 1024,
    drop_out = 0.1,
    embedded_dropout = 0.1,
).to(device)

Output:

ViT(
  (patch_embedding): Conv2d(3, 128, kernel_size=(28, 28), stride=(28, 28))
  (position_embedding): PositionalEmbedding1D()
  (transformer): Transformer(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=1024, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=1024, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (1): TransformerEncoderLayer(
        ...
        (similar structure as previous layer)
        ...
      )
      ...
      (11): TransformerEncoderLayer(
        ...
        (similar structure as previous layer)
        ...
      )
    )
  )
  (norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  (mlp_head): MLPHead(
    (fc1): Linear(in_features=128, out_features=1024, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (fc2): Linear(in_features=1024, out_features=2, bias=True)
  )
)

Here, we define eight parameters in the ViT. Each parameter contains a certain definition, which we discuss in below-

1. imageSize = 230:

Firstly, we discuss about the parameter imageSize. The imageSize is initialized by the value 230. This is discussed about the size of the image. In this example, the image size should be 230x230 pixels.

2. patchSize = 28:

Secondly, we discuss the parameter patchSize. The patchSize is initialized by the value 28. This is divided the size of the image into patches. In this example, the patch size should be 28x28 pixels.

3. numClass = 2:

Thirdly, we discuss the parameter numClass. The numClass is initialized by the value 2. This is discussed about the number of the class for a classifier test. In this example, we take a model with two numbers of the class, such as cat and cow.

4. dimension = 128:

Fourthly, we discuss about the parameter dimension. The dimension is initialized by the value 128. This is discussed about the dimension of the embedding test model. In this example, Embedding captures a representation of each image patch.

5. depth = 12:

Fifthly, we discuss about the parameter depth. In this example, the depth is initialized by the value 12. This is discussed about the depth of the ViT or vision Transformers model. More complex extraction feature is allowed by the higher depth of the model.

6. heads = 8:

Sixthly, we discuss the parameter heads. In this example, the heads are initialized by the value 8. This is discussed about the number of attention heads in the self-attention mechanism model.

7. mlp_dim = 1024:

Seventhly, we discuss about the parameter mlp_dim. In this example, the mlp_dim is initialized by the value 1024. This is discussed about the dimension of the model's multilayer perceptron (MLP) layers. After displaying the self-attention, the multilayer perceptron (MLP) changes the token representation.

8. drop_out = 0.1:

Next, we discuss about the parameter drop_out. In this example, the drop_out is initialized by the value 0.1. This is used for the control of the dropout level. It is a regular technique that is used to prevent overfitting problems. The drop_out parameter can randomly set a subset of input units to 0 during training.

9. embedded_dropout = 0.1:

Lastly, we discuss about the parameter embedded_dropout. In this example, the embedded_dropout is initialized by the value 0.1. This is used to control the dropout level in the embedding model. During training, the embedded dropout option helps prevent over-reliance on special characters or tokens.

The vision Transformer was trained 20 times using a Tesla T4 (g4dn-xlarge) GPU machine for task classification. In the Vision transformer, the training is done 20 times, but in the CNN or Convolutional Neural Network, the training is done 10 epochs. The cause of more training or loss of training is slow convergence. While the Convolutional Neural Network or CNN method achieves 75% accuracy in 10 epochs, the Vision Transformer achieves 69% accuracy and takes longer to train.

Conclusion:

In this tutorial, we are learning about the difference between Vision Transformers and Convolutional Neural Networks. There are differences in model size, memory requirements, accuracy, and performance between CNN and vision transformers. CNN models have always been known for their size and efficient memory usage, making them suitable for constrained environments. It has also proven useful in image processing and increased accuracy in various computer vision applications. On the other hand, the Vision Transformers or ViT is used to improve the performance of certain tasks by providing a powerful way to capture all the dependencies of global and understanding of the content in images.

Moreover, the Vision Transformers or ViT tend to have larger models and higher requirements than the Convolutional Neural Network or CNN. Although they can achieve good accuracy, especially when processing larger data sets, their computational requirements may limit their effectiveness on limited resources. Ultimately, the choice of CNN and the Vision Transformer models depends on the specific needs of the task. It depends on factors such as resources, big data, and trade-offs between models, difficulty, accuracy, and efficiency. As computer vision continues, further advances in both architectures are expected, allowing researchers and practitioners to make it more informed for choices based on their specific needs and limitations. We also learn that the CNN method achieves 75% accuracy in 10 epochs, but the Vision Transformer achieves 69% accuracy.

So, here we conclude that the Convolutional Neural Network or CNN is better than the Vision Transformers or ViT.

Next TopicV-Net in Image Segmentation

← prev next →