Dual Vision Transformer

Summary

Several strategies have been proposed toEasing the computation of self-attention mechanisms with high-resolution inputs: such as decomposing the global self-attention process on image patches into regional and local feature extraction processes, each of which incurs smaller computational complexity. Despite their good efficiency, previous methods rarely explore the overall interactions among all patches, making it difficult to fully capture global semantics.
In this paper, we propose a new Transformer architecture that elegantly exploits self-attention learning of global semantics, namely DualVision Transformer (Dual-ViT) . The new architecture introduces critical semantic paths that more efficiently compress token vectors into global semantics and reduce complexity.
Then, this compressedGlobal semantics serves as useful prior information when learning finer local pixel-level details through another constructed pixel path.
The semantic path and the pixel path are integrated and jointly trained to propagate enhanced self-attention information through both paths in parallel.
Dual-ViT can nowLeveraging global semantics to improve self-attention learning, without affecting much computational complexity.

1 Introduction

Self-attention processes form the main burden of this complexity problem, especially for high-resolution inputs, since each representation of each token is updated by paying attention to all tokens.
Many consider combining self-attention with downsampling to effectively replace the original standard global attention on all image patches.
This approach naturally enables the exploration of regional semantic information, which further promotes the learning/extraction of local finer features.
For example, PVT [12], [13] proposed Linear Spatial Reduced Attention (SRA), which utilizes downsampling operations (e.g., average pooling or strided convolution), as shown in the figure. 1(a). Twins [14] (Fig. 1(B)) add an additional locally grouped self-attention layer before SRA to further enhance the representation via intra-region interactions. RegionViT [15] (Fig. 1(c)) decomposes raw attention through regional and local self-attention. However, since the above methods rely heavily on downsampling feature maps into regions, they inevitably ignore the overall interactions between all patches that depict global semantic information.
Insert image description here
Fig. 1. (a) Pyramid Visual Transformer block (PVT), (b) twin blocks combining local grouping self-attention and spatial reduction attention, (c) region-to-local attention block in RegionViT, and (d ) Illustration of the Dual block in our proposed Dual-ViT. DS: downsampling operation; MSA: multi-head self-attention; FFN: feed-forward layer; LSA: local grouped self-attention. For simplicity, layer normalization and residual connections are omitted.

Among these different combination strategies, few have attempted to study the dependencies between global semantics and finer pixel-level features for self-attention learning.
In this paper, we consider decomposing training into global semantics and finer feature attention via the proposed DualViT. The motivation is to extractGlobal semantic information (i.e., parametric semantic query), which can be used as rich prior information to aid finer local feature extraction in the new dual-path design.
Our unique decomposition and integration of global semantics and local features allows to effectively reduce the number of tokens involved in multi-head attention, thereby saving computational complexity compared to standard global self-attention.
In particular, as shown in Fig. In Figure 1(d), Dual-ViT consists of two special pathways, called “semantic pathway” and “pixel pathway” respectively. Local pixel-level feature extraction through the constructed “pixel path” is imposed by the strong dependence of the compressed global prior on the “semantic path”. Since the gradient passes through semantic paths and pixel paths, the Dual-ViT training process can effectively compensate for the information loss of global feature compression while reducing the difficulty of finer local feature extraction. The former and latter procedures can significantly facilitate self-attention learning in parallel without sacrificing too much computational cost due to the smaller attention size and dependencies between the two pathways.
Contributions:
1) We propose a new Transformer architecture called Dual Vision Transformer (Dual ViT). As the name suggests,The Dual-ViT network includes two paths and one pixel path. These two paths are used to extract the global view of the input semantic features. The other pixel path focuses on the learning of finer local features.
2) Dual-ViT takes into account the dependence between global semantics and local features along the two paths, with the goal of simplifying training by reducing token size and smaller attention.
3) Compared with VOLO-D4, Dual-ViT achieves 85.7% top-1 accuracy on ImageNet with only 41.1% FLOPs and 37.8% parameters [16]. In terms of object detection and instance segmentation, Dual-ViT also improves PVT [13] by more than 1.4% in terms of mAP, 0.8% on COCO, and reduces parameters by 47.3%/42.3%.

2. Related work

Our Dual-ViT is also a multi-scale ViT backbone network. Compared with existing multi-scale ViT that heavily relies on local self-attention or downsampling operations within local windows, Dual-ViT decomposes the modeling of self-attention into the learning of global semantics and finer features in two paths. Semantic tags and input features are further combined to propagate enhanced self-attention information in parallel. This unique decomposition and integration design not only effectively reduces the number of tokens in self-attention learning, but also imposes an interaction between the two paths, resulting in better accuracy and latency trade-offs.

3. Method

This section first briefly reviews the traditional multi-head self-attention blocks adopted in existing ViT and analyzes how they scale down the self-attention computational cost. Next, we propose a new principled Transformer structure, namely Dual Vision Transformer (Dual-ViT). Our starting point is to upgrade the typical Transformer structure with a specific dual-path design and trigger the dependency between global semantics and local features to enhance self-attention learning.
Specifically, Dual-ViT consists of four stages, where the resolution of the feature maps in each stage gradually shrinks, as shown in [4]. In the first two stages with high-resolution input, Dual-ViT adopts a new Dual module consisting of two paths: (i) a pixel path that captures fine-grained information by refining input features at the pixel level, and (ii) ) abstracts the semantic path of high-level semantic tokens at the global level. The semantic path is slightly deeper (has more operations) but contains fewer semantic tags abstracted from pixels, and the pixel path treats these global semantics before learning finer pixel-level details. Such a design conveniently encodes the dependence of finer information on the overall semantics while maintaining favorable computational cost on high-resolution inputs. The outputs of these two paths are merged together and further fed to the bull self-attention in the last two stages.
A.Traditional Transformer
Traditional Transformer architectures (e.g., [2], [35]) often rely on multi-head self-attention to capture long-range dependencies between inputs.
Problem: Limiting the scope of attention to a local window can thus achieve only linear computational complexity with respect to the input resolution, however, the limited receptive field of the local window adversely hinders modeling of global dependencies.
Progress has been made by using downsampling operations (e.g., average pooling in [13] or pooling kernels in [33], [36]) to reduce computational costs, but these pooling-based operations inevitably lead to information loss , and the overall semantic information between all patches is not fully utilized .
B.Double block

Insert image description here
To alleviate the above issues, we design a principled self-attention block customized for high-resolution input (i.e., in the first two stages ), i.e., dual block. This new design nicely introduces an additional avenue to facilitate self-attention learning through global semantic information . Figure 2(B) depicts in detailTwo-block architecture. Specifically, dual blocks contain two pathways: the pixel path and the semantic path. The semantic path summarizes the input feature maps into semantic tags. Afterwards, the pixel path takes these semantic tokens as rich priors in the form of key/value and performs multi-head attention to refine the input feature map through cross-attention. In terms of complexity, since the semantic path contains much fewer tokens than the tokens in the pixel path, the computational cost is reduced to O(nmd + m2 ~ d), where m is the number of semantic tokens.
Formally, given the l-th dual blockInput features xl, we use the additionalParameter semantic queryzl∈ Rm×d to expand. The semantic path first contextually encodes the semantic query via self-attention, and then extracts semantic tokens by exploiting the interaction between the refined semantic query and the input features xl via cross-attention, followed by the feed-forward layer. This operation is performed as follows:
Insert image description here
the semantic token zl+l is fed into the pixel path and acts as a priori information for high-level semantics. At the same time, we treat semantic tags as enhanced semantic queries and feed them into the semantic path of the next Dual block.
The pixel path plays a similar role to the traditional Transformer block, except that it additionally employs semantic tags derived from the semantic path , as before through cross-attention to refine the input features. More specifically, the pixel path treats the semantic token zl+1 as a key/value and performs the intersection as follows Note:
Insert image description here
ConsideringThe gradient is back-propagated through two paths, and the dual-block algorithm can simultaneously compensate for the information loss of global feature compression and reduce the difficulty of global prior local feature extraction through semantic-pixel interaction.
C. Merge Blocks
Insert image description here
Recall,Double-blocking in the first two stages exploits the interaction between the two paths, while leaving the internal interactions between local tokens within the pixel paths unexploited due to the huge complexity of high-resolution inputs.To alleviate this problem, we propose a simple yet effective design of self-attention blocks (i.e. merge blocks ), inConcatenation semantics of the last two stages (low-resolution input) and local tokens perform self-attention, thus enabling internal interactions across local tokens. Figure 2(c) depicts the architecture of the merge block. Specifically, we directly merge the output tokens from the two paths and feed them to the multi-head self-attention layer. Due to the fact that tokens from the two paths convey different information, two separate feedforward layers are employed for each path in the merging block:
Insert image description here
where [||] represents a tensor cascade and FFNx and FFNz are two different Feedforward layer. Finally, we employ global average pooling on the output tokens of both paths to produce the final classification token.
D. Dual Vision Transformer
Insert image description here
Our proposed dual blocks and merged blocks are essentially unified self-attention blocks. Therefore, it is feasible to build a multi-scale ViT skeleton by stacking these blocks. Following the basic configuration of existing multi-scale ViT [4], [12], the complete Dual-ViT contains four stages. The first two stages consist of Dual block stacks, while the last two stages consist of merged blocks . According to the design principle of the CNN architecture, a patch embedding layer is used at each stage to increase the channel dimension while reducing the spatial resolution. In this work, we propose three variants of Dual-ViT with different model sizes, namely, Dual-ViT-S (small size), Dual-ViT-B (base size) and Dual-ViT-L ( large). Note that Dual ViT-S/B/L shares similar model size and computational complexity with Swin-T/S/B [4]. Table I details the architecture of all three variants of Dual-ViT, where HDi, Ci and Ex/Ez are the number of heads, channel dimensions and feed-forward layers of tokens derived from the pixel/semantic path in stage i expansion ratio.
E. Differences between our Dual-ViT and previous Vision Transformers
In this section, we discuss in detail the differences between our Dual-ViT and previous related ViT backbones.
RegionViT [15] designed region-to-local attention, which involves two kinds of tokens: region tokens with larger patch size and local tokens with smaller patch size. RegionViT first performs region self-attention on all region tokens to learn global information. Local self-attention then exchanges information between each single region token and its associated local token within a local window. Our Dual-ViT differs from this work in two aspects: first, the semantic tags in Dual blocks are not restricted to equal-sized uniform blocks as in RegionViT, and thus are more flexible in encoding semantics; second, the global semantics cards as a whole to compensate for the loss of information in the pixel path, whereas a regional ViT's regional token only interacts within the local window with its associated local token.
Twins [14] consists of two types of attention operations: local and subsampled self-attention. Specifically, Twins divides input tokens into groups and performs local self-attention on each group within each local sub-window. To enable interaction between different sub-windows, Twins exploits the self-attention of subsampling by downsampling feature maps into region tokens, which act as keys and values ​​for downsampling, as shown in [13]. The difference between Dual-ViT and Twins is that our semantic tags are summarized holistically from the entire feature map, while Twins generates each region tag independently from local sub-windows.
CrossViT [37] contains two tokens with different patch sizes, which are fed into two Transformer encoders in two independent branches. Finally, CrossViT integrates each branch’s CLS token with the other branch’s output patch token. In contrast, our Dual-ViT triggers interactions between tokens from pixels and semantic paths within each Dual block, rather than encoding each branch with a separate Transformer encoder as in CrossViT. Furthermore, Dual-ViT exchanges information between multiple global-level semantic tokens and finer pixel-level features throughout the process, while CrossViT only interacts with a single compressed CLS token for the final output feature, which may lack detailed semantics .
Non-deep networks [38] and parallel ViT [39] respectively designed CNN/ViT backbones with multiple parallel branches. Each branch in Non-deepNetworks is fed feature maps of different scales from the main CNN branch. In each block of a parallel ViT, each branch fires with the same input as in a typical Transformer block (i.e., uniform slices of equal size). In contrast, the input to the semantic path in our Dual-ViT, i.e., the semantic tags are not constrained to be uniform feature maps/blocks of the same size. Semantic tags are summarized holistically from the entire feature map, thereby being more flexible in encoding semantics. Furthermore, unlike non-deep networks and independent feature encoding of each branch in parallel ViT, our proposed Dual-ViT frequently exchanges self-attention information between pixels and semantic paths throughout the architecture.

4. Experiment

We validate the advantages of our Dual-ViT through various empirical evidence on multiple vision tasks, such as image recognition, object detection, instance segmentation, and semantic segmentation. In particular, we first train the proposed Dual-ViT from scratch using the most common image recognition benchmark (ImageNet [40]). Next, the pre-trained Dual-ViT is fine-tuned on COCO [41] and ADE 20 K [42] for the downstream tasks of object detection, instance segmentation and semantic segmentation, aiming to evaluate the general performance of the pre-trained Dual-ViT. ization ability.
Insert image description here

5. Summary

In this work, we propose DualVision Transformer (Dual-ViT), a new multi-scale ViT backbone that novelly models self-attention learning in two interactive paths: for learning finer pixel-level details Pixel paths and semantic paths that extract overall global semantic information from the input.
The semantic labels learned from semantic paths further serve as high-level semantic priors to facilitate finer local feature extraction in pixel paths.
In this way, enhanced self-attention information is propagated along two paths in parallel, pursuing a better accuracy-latency trade-off. Extensive empirical results on a variety of vision tasks demonstrate the superiority of Dual ViT over state-of-the-art ViT.

Guess you like

Origin blog.csdn.net/weixin_43722052/article/details/132888641