Vision Transformer
Transformer is a deep learning-based model that was developed for Machine Translation using self-attention mechanism and has a huge application in the field of Natural Language Processing. Transformers mark a paradigm shift from the traditional sequence to sequence models and is based on one of the most significant innovations in deep learning research, which is called Attention. Transformer model was first introduced in the paper called Attention is All You Need [1] published by Google Research in 2017. The concept of Attention Mechanism aims to mimic the traits of human perceptions. The idea of Attention mechanism first appeared in a 2014 research paper called Neural Machine Translation by Jointly learning align and translate [2]. Another paper called Recurrent Models of Visual Attention [3] emphasizes that human perception does not process the scene in its entirety, but it focuses on fragments of a visual space. Human brain does not process a visual snapshot all at once. Attention came out of the aim to attend to most important parts in the image. Attention mechanism is the core of the transformer model.
Before delving into the general Transformer architecture and the features, architecture, input processing of its Vision Transformer [4] variant, following is an exploration of the concepts of Attention and Self Attention mechanism.
The Attention mechanism allows a network to assign different weights or attention to each element in an input sequence. The attention mechanism forms an integral component of modelling dependencies in various parts of an input and output sequence, regardless of their distance in the input and output sequence. An attention model enables a learner to better understand the context of an input. A certain input items attendance to another item signifies the presence of a relation between the two. These items form a part of the contextual information of the input item. Attention mechanism was used for the neural translation use case, and it can be described as a mapping between a query, a key-value pairs, and an output. The queries, keys and values are all represented as embedding vectors. A compatibility function calculates an alignment score between the input query and a weight is assigned to each value based on the compatibility function score.
The main objective of self-attention is to use the whole sequence is to compute a weighted average of each embedding.
The scaled dot product computation algorithm receives inputs as queries and keys of dimension dk and values of dv the query and the key vectors are used to calculate the compatibility score that describes how well the queries and keys match. These alignment scores are then turned into weights used for weighted sum of the value vectors. This weighted sum of the value vectors is returned as the attention vector. The process is performed by scaled dot product attention. The queries are accumulated in a matrix Q, so attention can be computed simultaneously for each query.
Following is the computation model and steps for calculating the scaled dot product attention. The attention layer outputs the context vectors for each query. The context vectors are weighted sums of the values where the similarity between the queries and keys determines the weights assigned to each value. $$ Attention(Q, K, V) = softmax(\frac{QK^{T}}{\sqrt{d_K}})$$
1. The key value matrices are created by stacking the embeddings for each word in an input sequence. The queries and the transpose of keys matrices are multiplied together to get a matrix alignment score
2. These are then scaled by the square root of the dimension of the key vector.
3. The scaled score are converted to weights using the softmax function to get the probabilities as the output
4. The weights and the value matrices are multiplied to get the attention vectors for each query.
When the softmax output is multiplied with V (the value matrix) we get a
linear combination of the initial input Multiplying SoftMax output with V is a
linear combination, which is passed to the decoder.
The input items are all embedded first. The embeddings are scored against each other. In this stage the input item being encoded is compared and scored with other input sequences. The score is then scaled with the square root of the inverse of the dimension of the keys.
A softmax function is applied on the outcomes and the softmax score is multiplied with the embedding of the current input item to get the level of expressions of each of these vectors. These vectors are then added together to construct the self-attention context vector. In this process the input vector is attending to all other items in the input sequence.
A schematic diagram of the process of encoding the second word by comparing it with the other words in the input sequence is shown below.
A query for each embedding is constructed by multiplying the embedding vectors with a query layer by a query matrix. Here the queries are passed through a feed forward neural the keys are created by using a similar process.
In Multihead self-attention mechanism each head uses different linear transformation to represent different relationships and associativity between input tokens. The scaled dot product attention is calculated by using the query, key, and values matrices that are generated from the word embeddings. In MultiHead attention the attention mechanism is applied parallelly to multiple copies or sets of these queries, keys, and values matrices generated from the embeddings. In the transformer model, different representations are constructed by linearly transforming the original embeddings of the query, key, and value matrices for each head in the model. Each head can learn different relationships between input tokens and come up with different weight vectors. The fact that multiple heads apply the attention mechanism at the same time to multiple sets of these matrices saves computation time as well. The number of heads determines the number of sets of these matrices are needed to perform a successful MultiHead operations. The attention scores computed by each head are horizontally merged
1. The input to the Multi_Head attention block is the query, key, and value matrices. Each type of matrices has the number of sets equal to the number of heads in the Multi-Head block. 2. Apply scaled dot attention mechanisms to every set of keys, queries, and values. 3. Concatenate each head into a single matrix 4. Lastly, another transformation happens to get the output context vectors.
Following diagram shows an example of the matrix multiplications executed to compute a merged attention vector. The example below has two heads and the number of columns in the input parameters (query, key, and value) are the dimensions of the embedding vectors. The number of rows is the number of tokens in a sequence. If the embedding vector has dim (3) and we have 3 tokens in a sequence, then the dimensions of the inputs will be 3 x 3 for all the matrices. The input matrices are transformed by a linear transformation by multiplying with 3 x 2 matrices. These weight matrices give us different representations for each head and the key components for the parallel attention computation process. After the scaled dot product computation each head returns an attention matrix of dimension 3 x 2 for this scenario. This is equal to the dim (query or value) / number of heads.
The dimensions of the embeddings and the query size or value size (query dim == value dim) are the two hyperparameters that can be tuned during the training process.
The traditional transformer model consists of an encoder and decoder. The encoder converts the input tokens into embedding vectors and inputs the embeddings into the transformer model. The decoder takes the attention vectors calculated by the encoder and generates output sequence of tokens. The input is then passed to the stack of encoders. Encoders output is fed into the stacks of decoders which generates the output, which is the next token in the sequence. We can use standalone encoders and decoders for specific use cases. For example, the Vision Transformer uses only encoder-based architectures.
The Encoder component in the transformer architecture starts with a MultiHead Attention Layer. In the MultiHead attention layer the linear transformations are learnable parameters. The MultiHead attention layer performs self-attention on the input sequence, each item (word or an image patch) attends to every other item in the input sequence. This is followed by a residual connection and normalization, a feed forward layer, and another residual connection and normalization. The encoder provides contextual representations of each of the input items. Like the encoder layer the decoder architecture comprises Masked MultiHead attention layer, Residual layers, and Normalization blocks. The first attention module is masked so that each position attends only to the previous position. The second attention layer takes the encoder output and allows the decoder to attend to all the items. The decoder layer comprises stack of decoders.
Vision Transformers are the new state of the art deep learning models that marks a new advancement in the field of computer vision. The idea of Vit is first introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [4] paper. This paper introduced a transformer-based vision framework that applies the self-attention mechanism to perform image classification. The paper claims that large scale training with vision transformers can remove inductive biases, which is very common to appear in CNN based architectures.
A vision transformer uses a transformer encoder as the feature extractor, and a multilayer perceptron that receives the processed features from the encoder and performs the classification.
One of the key challenges of making an image an input to a self-attention-based network is the size of an image. It becomes computationally infeasible if every pixel in a 224x224 image attends to all the other pixels in the transformer model.
The resulting attention matrix would be 224x224 in dimension. As a solution the input image is divided into patches and a linear embedding is provided on these patches. The embeddings are the input to the network. Image patches are analogous to the word tokens used in the NLP Transformers. Following are the steps for tokenizing an image that becomes the input to the Vision Transformer model
1. Split an image into a grid of sub-image patches.
2. Embed each patch with a linear projection.
3. Each embedded patch is treated as a token and becomes the input for
the vision transformer model.
Since the standard transformer input is a sequence of token embeddings, the 3D image is reshaped into a sequence of flattened 2D patches. Following figure shows the input reshaping process.
The original shape of the image is calculated using Height, width and the number of colour channels present in the image
$$ \large{inputX_{original \ image} \in {\mathbb{R}^{height \times width \times channels}}} $$The image then gets reshaped into the following form
$$ \large{inputX_{flattened \ 2D \ pathches } \in {\mathbb{R}^{number \ of \ pathches \times ((patch \ resolutuon^2) \times channel)}}} $$Resulting number of patches created by the following process
$$ \large{inputX_{resulting \ number \ of \ pathches } = \frac{image \ height \times image \ width} {(patch \ resolution)^2}} $$The patches resulting from one image constructs a sentence of image patch. Since P2C is a higher dimension, a linear transformation is applied to bring the result to a lower dimension by multiplying this with a matrix. This transformation matrix maps P2C to a lower dimension D and is referred to as a trainable linear projection. The output of this projection is called patch embeddings.
In Transformer architecture the order of encodings is not preserved naturally since the inputs are passed in one shot, unlike RNNs or LSTMs. The Transformer paper (Attention is All You Need [1]) recommends introduction of Positional Embeddings to retain the order of input sequences. Positional embeddings are added to the patch embeddings to retrain the positional information.
This concatenated matrix of positional embeddings and patch projections is passed to the Vision Transformer encoder as the input. Vision Transformer encoder consists of MultiHead self-attention layers for feature extraction, an MLP head for image classifications. Layer normalization is applied before both the MSA and MLP blocks. A residual connection is applied after each block.
Following are the computation steps involved in this process. The patches vectors are concatenated with the position embedding vectors first.
$$ \Large{ z_0 = [x_{class}; x^{1}_pE;x^{2}_pE;.....;x^{N}_pE] + E_{pos}}$$ $$ \Large{\ E \in \mathbb{R^{(P^{2}C)\times{D}}}, \ E_{pos} \in \mathbb{R^{(N+1)\times{D}}}} $$The embedded patches along with the position information are passed as inputs to transformer. The Transformer encoder’s MLP contains two layers followed by Gelu non-linearity. The input first passes through the MSA layer, where the attention matrix is calculated. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign.
$$ \Large{Z'_l = MSA(LN(z_{l-1})) + z_{l-1}, \ l = 1....L }$$The output from the attention layer is passed to the MLP layer and the output of the MLP layer contains the output sequence returned from the Transformer. The first element in the output sequence contains the class of the input image.
$$ \Large{Z'_l = MLP(LN(z'_{l-1})) + z'_{l}, \ l = 1....L }$$ $$ \Large{y = LN(z^{0}_L) }$$Vision Transformers contain less image specific inductive bias than CNNs and performs better than ResNets on huge volume of data. CNNs are more suitable for small and medium sized datasets.
I have finetuned the pretrained vision transformer model google/vit-base-patch16-224-in21k. on a custom dataset for plastic container shape detection use case. I have collected the dataset from different online grocery shopping websites of Coles, Woolworths, Bunnings, BigW and google image. search. There are 780 images used for training and 30 images used for testing. The task is a supervised classification task and the class labels are bottles, cans and sprays.
I have created a model that is composed of the vit-base-patch16-224-in21k and a Linear layer for classification. After training for 5 epochs the recognition accuracy has reached 80%. I think this is a good performance because the test data is gathered from a different source. Following are the performance evaluation result of the model.
This Paper applies Transformers to image recognition. It does not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, an image is interpreted as a sequence of patches and processed using a standard Transformer encoder as used in NLP. This strategy worked well when coupled with pre-training on large datasets. However, Transformers aren’t mainstream yet as their operations are compute-heavy, training takes long even on good hardware, and the final trained model is too large to host on smaller servers without severe latency issues during inference. Also, the Vision Transformer performed well when pre-trained on a large Google Brain’s proprietary JFT-300M dataset.