[Deep Learning] Summary of CNN+Transformer

reference

1. Summary of CNN+Transformer algorithm

foreword

  • Summarizes all CV algorithms that combine CNN and Transformer frameworks since 2021
  • In a convolutional neural network (CNN),Convolution operations are good at extracting local features, but they still have certain limitations in capturing global feature representations. In Vision Transformer,The cascaded self-attention module can capture long-distance feature dependencies, but ignores the details of local features

cnn and transformer

CNN has very good performance. This is largely due to the convolution operation,
advantages: it collects local features in a hierarchical manner for better image representation.
Disadvantages: Despite its advantages in local feature extraction, CNN's ability to capture global representation is still insufficient, which is very important for many high-level computer vision tasks.
One of the most intuitive solutions is to expand the receptive field, but this will destroy the operation of the pooling layer .

Transformer

  1. ViT method
  • Build a sequence of tokens by splitting each image into patches with positional embeddings;
  • The Transformer Block is then used to extract parameterized vectors as visual representations.
    Due to the self-attention mechanism (Self-Attention) and the multi-layer perceptron (MLP) structure, Vision Transformer can reflect complex spatial transformations and long-distance feature dependencies to obtain global feature representation.
    Disadvantages: Vision Transformer ignores local feature details, which reduces the discriminability between background and foreground .

Therefore, some works propose a tokenization module or utilize CNN feature maps as input tokens to capture the neighborhood information of features. However, these methods still have not fundamentally resolved the relationship between local modeling and global modeling.

How to insert transformer in cnn

Similar work, such as DETR, uses CNN to extract image features, and then connects Transformer's encoder and decoder.

For example: An article Bottleneck Transformers for Visual Recognition published on arxiv on January 27, 2021 also uses CNN+Transformer, but in my opinion, it seems to be a more elegant approach:

  1. Incorporate Transformer's Self-attention into a CNN backbone instead of superposition;

  2. Specifically, the original 3x3 convolution (below) is replaced with MHSA (Multi-Head Self-attention) in the last three bottleneck blocks of ResNet. These new blocks are named BoT blocks and this new network is named BotNet

insert image description here
benefit:

  1. Can use the mature and tested CNN network structure to extract features . CNN has some prior or inductive biases in the visual field;

  2. After down-sampling the input image with CNN , the operation is performed by self-attention, which can reduce the amount of calculation compared to directly using self-attention to process on the original image;

  3. Such a design can be combined with other methods, for example, it may be used as a backbone in DETR.

In the MHSA layer, the feature input X is mapped into q, k, and v through the three matrices WQ, WK, and WV, which represent query, key, and value respectively. The operation of Self-attention is generally to calculate qkv.

2021 ICCV-Conformer (UCUS&Huawei&Pengcheng)

details

2021 IEEE/CVF International Conference on Computer Vision (ICCV) -9 May
Paper: Original Paper
Code: Code
Reference:
1 CNN+Transformer=Better, National University of Science and Technology & Huawei & Pengcheng Laboratory proposed a Converter, 84.1% Top-1 accuracy
2. Paper notes 32 – Conformer: Local Features Coupling Global Representations for Visual Recognition

This paper proposes a hybrid network structure called Conformer, which combines (convolution operation) and (self-attention mechanism) to enhance the learning of feature representation. Conformer relies on the Feature Coupling Unit (FCU) to fuse local and global feature representations at different resolutions in an interactive manner . In addition, Conformer employs a parallel structure to preserve local features and global representations to the greatest extent.
The author proves through experiments that, under similar parameters and complexity, Conformer is 2.3% better than DeiT-B on ImageNet. On the MS-COCO dataset, it outperforms ResNet-101 by 3.7% and 3.6% mAP on object detection and instance segmentation tasks, respectively.

Framework overview

Dual network structure Converter, capable of combining CNN-based local features with Transformer-based global representation to enhance representation learning. The Conformer consists of a CNN branch and a Transformer branch, which consist of a combination of local convolutional blocks, self-attention modules, and MLP units . During the training process, the cross-entropy loss function is used to supervise the training of the two branches of CNN and Transformer to obtain features with both CNN style and Transformer style.

Considering the asymmetry between CNN and Vision Transformer features , the author designed Feature Coupling Unit (FCU) as a bridge between CNN and Vision Transformer .
On the one hand, in order to fuse the features of the two styles, FCU utilizes 1×1 convolution to align the channel size, aligns the feature resolution with the down/up sampling strategy, and aligns the feature values ​​with LayerNorm and BatchNorm.
On the other hand, since CNN and Vision Transformer branches tend to capture different levels of features (local and global), FCU is inserted into each block to eliminate the semantic differences between them in a continuous interactive manner . This fusion process can greatly improve the global awareness of local features and the local details of the global representation.

method

Guess you like

Origin blog.csdn.net/zhe470719/article/details/124196490