2022 CVPR : On the Integration of Self-Attention and Convolution

Thesis
github
gitee

1. Summary

  • Both convolution operations and attention mechanisms can be used to learn representations, and there is a fundamental relationship between the two. In a sense, most of the computations for both paradigms are actually done with the same operations. Traditional k × k \begin{array}{c} k\times k \end{array}k×kConvolution can be decomposed into k 2 \begin{array}{c} k^{2} \end{array}k2A 1×1 convolution, shift and sum operation. We then interpret the projection of queries, keys, and values ​​in the self-attention module as multiple 1×1 convolutions, and then compute aggregations of attention weights and values. The hybrid model enjoys the benefits of both self-Attention and Convolution (ACmix), while having minimal computational overhead compared to pure convolutional or self-attention counterparts.

2. Introduction

  • The convolution operation uses an aggregation function on local receptive fields according to the weights of the convolution filters, which are shared across the entire feature map. This property introduces a critical inductive bias to image processing. The attention module applies a weighted average operation based on the context of input features, where attention weights are dynamically computed by the similarity function between related pairs of pixels. This flexibility enables the attention module to adaptively focus on different regions and capture more informative features.
  • Specifically, we first project the input feature maps with 1×1 convolutions and obtain a rich set of intermediate features. Then, the intermediate features are reused and aggregated according to different paradigms, i.e. by self-attention and convolution respectively. In this way, ACmix enjoys the benefits of two modules and effectively avoids the cost of expensive projection operations twice.

3. Method

3.1 The connection between self-attention and convolution

  • There is a close connection between self-attention and the decomposition of convolutional modules. The first stage is a feature learning module, where the two methods share the same operation by performing 1×1 convolutions to project features into a deeper space. On the other hand, the second stage corresponds to the feature aggregation process despite the difference in its learning paradigm.
  • From a computational perspective, the 1 × 1 convolution performed at stage one of the convolution and self-attention modules requires theoretical FLOPs and a squared complexity of parameters related to the channel dimension size C. In contrast, in the second stage, both modules are lightweight or require little computation.
  • Therefore, the above analysis shows that (1) convolution and self-attention actually share the same operation of projecting input feature maps through 1×1 convolution, which is also the main computational overhead of the two modules. (2) Although crucial for capturing semantic features, the aggregation operation in the second stage is lightweight and does not generate additional learning parameters.

3.2 Integration of self-attention and convolution

ACmix

  • ACmix consists of two stages:
      In the first stage, input features are projected through three 1×1 convolutions and reshaped into N blocks respectively, resulting in an intermediate feature set of 3×N feature maps.
      In the second stage, there are two paths of self-attention and convolution. For the self-attention path, the corresponding three feature maps are used as query, key and value, following the traditional multi-head self-attention module.
      For a convolutional path with a kernel size of k, a light fully connected layer is used and a k² feature map is generated, while shift operations and aggregation are performed.
      Finally, the outputs of the two paths are added together, with the strength controlled by two learnable scalars:
    F out = α F att + β F conv \begin{array}{c} F_{out} = \alpha F_{att} + \beta F_{conv} \end{array}Fout=a F

Guess you like

Origin blog.csdn.net/u013308709/article/details/129289169