PVT v2: Improved Baselines with Pyramid Vision Transformer

Paper address: https://arxiv.org/pdf/2106.13797.pdf
Code address: https://github.com/whai362/PVT

1. Research background

Recent research on visual transformers is converging on backbone networks designed for downstream vision tasks such as image classification, object detection, instance and semantic segmentation. For example, Vision Transformer (ViT) was the first to demonstrate that pure Transformers can achieve state-of-the-art performance in image classification. The Pyramid Vision Transformer (PVT v1) shows that pure Transformer backbones can also outperform CNNs in dense prediction tasks such as detection and segmentation tasks. Afterwards, Swin Transformer, CoaT, LeViT and Twins further improved the performance of the Transformer backbone in classification, detection and segmentation.
This article aims to establish a stronger and more feasible baseline based on PVT v1. There are three design improvements, namely
(1) Linear complexity attention layer;
(2) Overlapping patch embedding,
(3) Convolutional feedforward network is orthogonal to PVT v1 network, and when used with PVT v1, they can bring for better image classification, object detection, instance and semantic segmentation performance. The improved framework is called PVT v2.

2. Implementation details

The three main limitations of PVT v1 are as follows:
(1) Similar to ViT, the computational complexity of PVT v1 is relatively large when processing high-resolution inputs (e.g., 800 pixels on the short side).
(2) PVT v1 treats the image as a sequence of non-overlapping blocks, which loses the local continuity of the image to a certain extent; (
3) The position encoding in PVT v1 is of fixed size, which is not suitable for processing images of arbitrary sizes. flexible. These issues limit the performance of PVT v1 in vision tasks.

1. Linear space reduces attention

First, in order to reduce the high computational cost caused by attention operations, this paper proposes an attention layer (SRA), as shown in the figure below. Unlike SRA, which uses convolutions for spatial reduction, linear SRA uses average pooling to reduce the spatial dimensions (i.e., h × w) to a fixed size (i.e., P × P) before attention operations. Therefore, linear SRA has linear computational and storage costs like convolutional layers. Specifically, given an input of size h × w × c, the complexity of SRA and linear SRA is:
Insert image description here
where R is the spatial reduction rate of SRA. P is the pooling size of linear SRA, set to 7.
Insert image description here

2. Overlapping cutting and embedding

Second, to model local continuity information, overlapping tile embeddings are utilized to label images. As shown in Figure (a) below, the patch window is enlarged so that adjacent windows overlap by half the area, and the feature map is filled with zeros to maintain resolution. In this work, convolutions with zero padding are used to implement overlapping block embeddings. Specifically, given an input of size h×w×c, it is fed into a convolution with stride S, kernel size 2S−1, and padding size S−1. The number of cores is c ′ c^{'}c The output size ish / S × w / S × c ′ h/S×w/S×c^{'}h/S×w/S×c
Insert image description here

3. Convolutional feedforward network

This paper removes fixed-size positional encoding and introduces zero-padded positional encoding into PVT. As shown in figure (b) below. A 3×3 depthwise convolution with padding size 1 is added between the first fully connected (FC) layer and GELU in the feedforward network.
Insert image description here

4. Detailed information of PVT v2 series

This paper extends PVT v2 from B0 to B5 by changing the hyperparameters. As shown below:
S i S_iSi: The stride of overlapping patch embedding in stage i;
C i C_iCi: The number of channels output by the i-th stage;
L i L_iLi: Number of encoder layers in stage i;
R i R_iRi: Reduction ratio of SRA in stage i;
P i P_iPi: Adaptive average pooling size of linear SRA in stage i;
N i N_iNi: The number of effective self-attention heads in the first stage;
E i E_iEi: Expansion ratio of the feedforward layer in stage i;
the following table shows the details of the PVT v2 series. Follow the principles of ResNet .
(1) Channel size increases, while spatial resolution shrinks as the layer deepens.
(2) Most of the computational cost is allocated to stage 3.
Insert image description here

5. Advantages of PVT v2

Combining these improvements, PVT v2 can
(1) obtain more local continuity of images and feature maps;
(2) handle variable resolution inputs more flexibly;
(3) have the same linear complexity as CNN.

3. Experimental verification


Insert image description here

Insert image description here
Insert image description here
Ablation experiments for PVT v2 are reported in Table 6. All three designs can improve the model in terms of performance, number of parameters, or computational overhead. Overlapping tile embedding (OPE) is important. Comparing #1 and #2 in Table 6, compared to the model with original patch embedding (PE), the model with OPE achieved better top 1 accuracy on ImageNet (81.1% vs. 79.8%), and on COCO Obtained higher AP (42.2% vs. 40.4%). OPE is effective because it can model the local continuity of images and feature maps through overlapping sliding windows.
Convolutional feedforward networks (CFFN) are important. Compared with the original feedforward network (FFN), CFFN contains zero-padding convolutional layers. It can capture the local continuity of the input tensor. Furthermore, since zero-padding in OPE and CFFN introduces positional information, the fixed-size positional embeddings used in PVT v1 can be removed, allowing the model to flexibly handle variable resolution inputs.

Guess you like

Origin blog.csdn.net/qq_52302919/article/details/127788991