One CV model, part (2): ViT

  References:

  AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE  [Paper link]

  [ Paper source code ]

  [ A well-written Pytorch ViT Tutorial ]

  1. Motivation

  When the author wrote the article, there was a common practice in the academic world, that is, to use CNN (convolution) to process image data, and use Transformer to process language data. Readers here may wish to think about why this habit occurs. The following is my understanding. The advantage of CNN lies in the design of the convolution kernel, which has the function of multi-scale receptive field and weight sharing , so that multi-layer CNN can efficiently extract the visual features of a picture. The success of Transformer lies in the design structure of Self-Atten, which is like a divine aid, which makes the artificial neural network completely abandon the seriality of the cyclic neural network (LSTM, RNN) in the processing of sequence data (here especially refers to the encoding side, and the decoding side still has many successful autoregressive designs).

  Therefore, the author naturally thought, can we design a network using Transformer to extract the features of pictures for the data structure of pictures? ViT came into being!

  2. Challenge

  Pioneering network structure, coming up with the network architecture is the biggest challenge.

  3. Idea

  The structure of Transformer is essentially used to process sequence data, and its input is a sequence of tokens, so how to find this token in the picture? The author gave the idea that the image is divided into 16×16 image patches, and each image patch is flattened to extract a feature vector using MLP. These feature vectors are combined with Positional Embedding to form the input of the Transformer (Encoder) structure: tokens!

  4. Method

  The model architecture diagram is as follows:

  Seeing this network diagram, I would like to use three words to describe it: concise, intuitive and reasonable, and the model structure is clear at a glance.

  1) First slice the picture, and then extract a feature vector respectively.

  2) Put it together with the [CLASS] token and feed it to the Transformer Encoder Layer in series.

  3) After the last layer of Transformer Encoder Layer, the features corresponding to the [CLASS] token are used as the representation of the image.

  4) Connect the prediction head to complete the classification task.

  Regarding the method, I would like to mention two more points that need attention. One is the [CLASS] token. Its origin is actually bert, which is located at the beginning of the sentence. In fact, is it okay to not use the features corresponding to the [CLASS] token? I seem to be able to obtain a representation (?) of this image by directly using a pooling operation, and the embedding corresponding to this [CLASS] token is a learnable parameter~ 

  The second point is that the Transformer proposed in the article is actually just a Transformer Encoder, which is the same as Bert. In the original text of Transformer, there is an encoder and a decoder. Here, only the Encoder is used to extract image representations. Pay attention! After all, if you add the description of Encoder, ViTE is obviously not very elegant.

  4. Result

  1) In the case of a small amount of data, it failed to exceed the SOTA structures of ResNet

  2) In the case of a large amount of data, it exceeds the SOTA model (CNN-based)

  The above effect comparison should be in the classification, that is, the recognition task.

  具体结果图就不放了,ViT在CLIP等其他需要提取视觉特征的模型上大放异彩,已经充分证明了它的高效性。

Guess you like

Origin blog.csdn.net/weixin_43590796/article/details/131734403