Thesis Notes - DAS

Paper: https://arxiv.org/pdf/2311.12091.pdf

Code: None

Convolutional neural networks (CNNs) excel at local spatial pattern recognition. For many vision tasks, such as object recognition and segmentation, salient information also exists outside the CNN kernel boundaries. However, due to the restricted receptive field of CNNs, they feel inadequate in capturing this relevant information.

The self-attention mechanism can improve the model's ability to obtain global information, but it also increases computational overhead. The authors propose a fast and simple fully convolutional method DAS, which helps focus attention on relevant information. The method uses deformable convolutions to represent the location of relevant image regions and separable convolutions to achieve efficiency.

DAS can be plugged into existing CNNs and use channel attention mechanisms to propagate relevant information. Compared to the O(n^2) computational complexity required for Transformer style attention, the computational complexity of DAS is O(n).

The authors' claim is that the ability of DAS to focus attention on relevant features can lead to performance improvements when added to popular CNNs (convolutional neural networks) for image classification and object detection. For example, DAS performs well on Stanford Dogs (4.47% improvement), ImageNet (1.91% improvement), and COCO AP (3.3% improvement) compared to the original model based on the ResNet50 Backbone network.

This approach outperforms other CNN attention mechanisms while using similar or fewer FLOPs.

1 Introduction

Convolutional neural networks (CNNs) are structurally designed to exploit local spatial hierarchies by applying convolutional filters implemented by convolutional kernels. Although this makes them both efficient and effective on tasks involving local spatial patterns, their inherent design limits their receptive field, potentially hindering the full integration of relevant information that is not within the boundaries of the kernel.

The Visual Transformer (ViT) enables capturing global dependencies and contextual understanding in images and shows improved performance in many computer vision tasks. ViT decomposes the image into a series of flattened patches and subsequently maps them into a sequence of embedding vectors of the Transformer encoder.

This patch-based approach is adopted because the inherent computational complexity of the attention mechanism is related to the number of input vectors. ViTs effectively reduces the number of input patches by converting images into thicker patches, i.e. However, there are still computational challenges in dense attention at the pixel level. Furthermore, ViTs typically require larger model sizes, higher memory requirements, and extensive pre-training compared to CNNs, and their computational requirements limit their usefulness in real-time embedded applications.

Although efforts continue to control the quadratic complexity of the Transformer to perform dense attention on long sequences using convolutions, there have been many studies directly integrating self-attention mechanisms into CNNs to achieve dense salient feature attention. The main motivation for this work is the latter.

The attention mechanism in CNN can be broadly divided into channel attention, spatial attention and mixed domain attention. These methods propose strategies to include computationally specific attention, such as using techniques such as aggregation, subsampling, pooling, etc., which in turn make it difficult to provide dense attention.

For example, most papers that follow the work of stacked attention modules use average pooling operations on attention-aware feature maps before calculating attention weights. A popular strategy is to compute a weight for each channel [15, 33]. This may result in important spatial context information being overlooked.

Some methods have been proposed to extend the above methods by mixing channel and spatial attention to obtain more robust attention modules. Another extension method uses global pooling of two rotations of the input and the original tensor combined with global pooling to combine the three-dimensional information from the features.

However, they still face the problem of effectively providing attention to salient features. They treat channel and spatial attention as independent processes, so they do not comprehensively consider the information in features, which may lead to potential information loss.

A promising approach to increase focus on relevant regions of an image is to use a deformed grid instead of the regular grid used in standard convolutional filters. DCN v2 has shown improved ability to focus on relevant image regions.

These methods have been used to provide deformation attention for fine-grained semantic segmentation and image classification tasks in ViTs by finding better keys and queries in ViTs. However, the authors' main interest lies in providing an attention mechanism directly in CNNs while minimizing changes to the original network or its training. Therefore, the focus of the rest of this article will be on convolutional attention methods.

The authors' approach is partly inspired by the success of DCNs and partly by the dominance of Raft architecture designs in various vision tasks such as optical flow and stereo vision, such as propagating images/feature maps using gated recurrent units (GRU). Perform recursion.

The author's main contribution is an efficient gated attention mechanism DAS, which can focus and increase attention to salient image regions. It can be very easily integrated into any existing CNN to improve the performance of the CNN with minimal added FLOPs and, most importantly, without changing the Backbone structure.

picture

The authors’ attention gate combines the context provided by layer features with the power of deformed convolutions to elegantly increase attention to salient features (see Figure 1). DAS only adds a single hyperparameter and is easy to adjust.

The authors demonstrate methods for adding the authors' gating to standard CNNs such as ResNet and MobileNetV2, and through extensive experimental results, demonstrate performance improvements in a variety of tasks. To support the authors' claim that CNNs with author attention gating indeed focus and increase attention on task-relevant features, we present gradient convolution heatmaps that highlight important pixels. The authors also define and calculate a simple metric called Salient Feature Detection (_sfd_) score for quantitative comparison of the effectiveness of our attention gating.

2 Related Work

The CNN attention mechanism has been developed to eliminate redundant information flowing through the neural network while solving the computational load problem. The goal is to increase focus on salient features and reduce/no focus on irrelevant features.

Channel attention . Compression and Excitation Network (SENet) introduces an efficient channel attention mechanism using global pooling and fully connected layers. SENet computes an attention weight for each channel, resulting in significant performance improvements compared to base architectures. Meanwhile, the global second-order pooling network (GSoP-Net) method adopts second-order pooling to calculate the attention weight vector. Efficient Channel Attention (ECA-Net) computes attention weights for each channel through global average pooling and one-dimensional convolution. The above channel attention methods ignore a large amount of spatial context information.

Spatial attention . GE-Net spatially encodes information through deep convolutions and then integrates input and encoded information into subsequent layers.

Dual attention network. The dual attention network (A2-Nets) method introduces a new relation function for non-local (NL) blocks, using two consecutive attention blocks in sequence. The Global Context Network (GC-Net) method integrates NL-blocks and SE blocks using complex permutation-based operations to capture long-term dependencies.

Cross paths incorporate contextual information. CC-Net incorporates contextual information of pixels on intersection trajectories. Process sub-features in parallel. SA-Net exploits channel splitting to process sub-features in parallel.

Among all the spatial attention methods mentioned above, although the goal is more to capture long-range dependencies, the computational overhead may be higher, as the authors saw in the experimental results.

Channel-Spatial Attention . Convolutional Block Attention Module (CBAM) and Bottleneck Attention Module (BAM) separate channel and spatial attention and combine them in the last step to achieve better performance than SENet. The attention module of CBAM includes multi-layer perceptron (MLP) and convolutional layers, using the fusion of global average and maximum pooling. A pooling technique called strip pooling was introduced in SPNet [13], utilizing a long and narrow kernel to effectively capture extensive contextual details for tasks involving pixel-level prediction.

GALA also finds local and global information separately, using two 2D tensors and integrating them to obtain channel-space attention. Triplet attention [26] improves performance by swapping input tensors and pooling to capture cross-dimensional interactions.

DRA-Net also adopts two independent FC layers to capture channel and spatial relationships. OFDet uses all three types of attention (channel, spatial, and channel-spatial attention) simultaneously.

In all of the above approaches, these separately processed concerns need to be carefully combined to provide a more comprehensive representation of feature dependencies. Providing dense attention is also difficult due to the use of averaging and/or pooling. Furthermore, the computational overhead is high.

Regarding the attention mechanism in CNN, a survey divided it into 6 categories:

  1. channel attention

  2. spatial attention

  3. time attention

  4. branched attention

  5. Channel and spatial attention

  6. space and time attention

The attention module proposed by the authors does not separate attention like the above methods, but considers the entire feature simultaneously and returns pixel-level attention weights in a very simple way. In summary, existing methods have not yet fully addressed capturing channel, spatial, and correlation information in a comprehensive manner, which is crucial for understanding contextual information . In most cases, intensive attention and/or computational overhead may also be an issue.

In contrast, the attention gating proposed by the authors combines the advantages of depthwise separable convolution and deformation convolution to provide attention at the pixel level in a comprehensive manner . It enables the authors' model to effectively focus attention on relevant information while maintaining the architectural simplicity of CNNs.

3 Methodology

In this section, the author proposes an attention mechanism called DAS to enhance the capabilities of CNN in a computationally efficient manner and provide focused attention to relevant information. The authors illustrate their use by using the authors' DAS attention gating after a skip connection for each main block of ResNet and MobileNetV2 models. The key steps and components of the authors' approach are described below.

Bottleneck Layer

The authors use a depthwise separable convolution operation as the bottleneck layer. This operation reduces the number of channels in the feature map, reducing them from channel to channel, where . This size reduction parameter was chosen to balance computational efficiency with accuracy.

picture

Through the experiments in the ablation study proposed by the author, the author determined the optimal value (see Figure 3). This also shows that the only hyperparameter added in the author's model is not sensitive to . After the bottleneck layer, the authors apply a normalization layer, specifically instance normalization, followed by a GELU nonlinear activation. These operations enhance the expressive power of features and contribute to the effectiveness of the attention mechanism.

The selection of Instance and Layer Normalization is supported by the experimental results in Table 5. Equation 1 shows the compression process, where X represents the input feature and represents depthwise separable convolution.

In Table 5, the authors demonstrate the importance of using InstanceNorm as a normalization technique before deformed convolution operations. Intuitively, the instance normalization process allows to remove instance-specific contrast information from images, thereby improving the robustness of the deformed convolutional attention model during training.

picture

Deformable Attention Gate

The compressed feature data obtained in the previous step (Eq. 1), which represents the feature context, is then processed through a deformed convolution using a dynamic mesh (via offset) introduced in [5, 38]. The authors know that this grid helps focus on relevant image areas.

The value of sum depends on which features the kernel function is applied to.

After DCN, the authors apply Layer Normalization, followed by a Sigmoid activation function (Equation 3). This convolution operation changes the number of channels of the feature map from the original input.

The output of Equation 3 represents the attention gate. This gate controls the flow of information from the feature map, and the elements in each gate tensor have values ​​between 0 and 1. These values ​​determine which parts of the feature map are emphasized or filtered out. Finally, in order to integrate the DAS attention mechanism into the CNN model, the author performs a dot multiplication between the original input tensor and the attention tensor obtained in the previous step.

The multiplication result of Equation 4 is the input to the next layer of the CNN model, seamlessly integrating the attention mechanism into the model without changing the Backbone structure. Compared with the previous deformation attention mechanism, the DAS attention mechanism is mainly used in CNN. It uses a 3x3 kernel, which is more suitable for CNN.

Deformable detr applies deformation attention specifically to query features, but the DAS attention mechanism considers image features holistically. The author's attention mechanism operates as a stand-alone module and does not require major architectural changes, thereby enhancing pluggability over the Transformer-based attention method.

4 Experiments

4.1 Training settings

For image classification, the authors used CIFAR100, Stanford Dogs, and ImageNet1k datasets, and for object detection, the authors used the MS COCO dataset. The authors adopt ResNet and MobileNetV2 architectures consistent with those mentioned in [26].

For the ImageNet experiments, the authors adopted the same settings as mentioned in [26]: ResNet training with a batch size of 256, an initial learning rate of 0.1, and a weight decay of 1e-4 for a total of 100 epochs. The learning rate is adjusted by a factor of 0.1 at 30, 60 and 90 epochs.

MobileNetV2: Batch size is 96, initial learning rate is 0.045, weight decay is 4e-5, learning rate is adjusted by 0.98 times in each epoch.

For CIFAR100 and Stanford Dogs datasets, the authors performed comparisons with Triplet attention [26] and Vanilla ResNet. The author performed a hyperparameter search on ResNet-18 and set the same settings for all Baselines: 300 epochs, batch size 128, initial learning rate 0.1, weight decay 5e-4, learning rates at 70, 130, 200, At 260, attenuation is performed by a multiple of 0.2.

For the Stanford Dogs dataset, the authors adopted the following settings: batch size 32, learning rate 0.1, weight decay 1e-4, CosineAnnealing learning rate scheduler, random flipping and cropping for image preprocessing.

For object detection, the author used Faster R-CNN on MS COCO, using the MMdetection toolbox with a batch size of 16, an initial learning rate of 0.02, a weight decay of 0.0001, and using ImageNet-1k pre-trained Backbone. The author reduces the noise by first training Backbone, and then trains Backbone and the rest of the model for several rounds of training. The weights obtained from this initial training are used as initialization points for the author's subsequent training process. The author has been using the SGD optimizer.

Image Classification

picture

Table 3 shows that adding triplet attention slightly improves the accuracy of ResNet-18 on CIFAR100 (0.3%), but reduces it by 1.36% on the Stanford Dogs dataset. However, DAS improves ResNet-18 accuracy by 0.79% and 4.91% on CIFAR100 and Stanford Dogs respectively.

Similar to ResNet-18, adding Triplet attention to ResNet-50 had a negative impact on the Backbone model of Stanford Dogs, while DAS enhanced the Backbone model by 2.8% and 4.47% on CIFAR100 and Stanford Dogs, respectively, showing the performance of DAS. Performance consistency on small and large models. Interestingly, we observed on the CIFAR100 and Stanford Dogs datasets that our proposed DAS-18 method not only outperformed the base ResNet-18 model, but also outperformed deeper architectures including ResNet-50, while using 2.26G fewer FLOPs. This makes the DAS-18 a strong option for mobile applications.

picture

The results of ImageNet classification are given in Table 1. When DAS attention gating is applied to ResNet-18, the classification accuracy is significantly improved. DAS reaches 72.03% in Top-1 accuracy and 90.70% in Top-5 accuracy. This exceeds other existing methods such as SENet, BAM, CBAM, Triplet Attention, and EMCA, demonstrating the effectiveness of DAS in improving model performance.

When the DAS depth is 50, the Top-1 accuracy reaches 78.04%, and the Top-5 accuracy reaches 94.00%. It achieves the best performance using 32% fewer FLOPs and 1.39M fewer parameters, outperforming the next best performance such as GSoP-Net. ResNet-50 + DAS attention surpasses ResNet-101 in Top-1 accuracy, improving accuracy by 0.69% at about 60% FLOPs and number of parameters. ResNet-101+DAS attention achieves the best Top-1 accuracy (78.62%) among other attention modules with fewer parameters (compared to SENet and CBAM).

On lightweight MobileNetV2, DAS still works. It reaches 72.79% in Top-1 accuracy and 90.87% in Top-5 accuracy, surpassing SENet, CBAM, and Triplet Attention, while being computationally efficient with a FLOP count as low as 0.35G.

Object Detection

Table 2 shows the author’s experimental results of object detection using the Faster R-CNN model on the challenging MS COCO dataset. Metrics used for evaluation include average precision (AP), AP at different IoU thresholds (AP, AP), and specific AP for small (AP), medium (AP), and large (AP) objects.

picture

The choice of Backbone architecture has a significant impact on object detection performance. In the author's evaluation, ResNet-50, ResNet-101, SENet-50, CBAM-50 and Triplet Attention-50 are used as powerful Baselines. The author's DAS-50 model surpasses all Backbone architectures in AP, AP, AP, AP and AP scores while having fewer parameters than ResNet-101, SENet-50 and CBAM-50.

Design Evolution and Ablation Studies

Before settling on the DAS design, the authors explored two concepts of pixel-level attention. These concepts are demonstrated in Figures 2(a) and (b), with corresponding results in Table 4 on the Stanford Dogs dataset.

picture

(a) : The author concatenates the input with a GridSample of its own, and then uses a convolutional layer to fuse the input with information from distant pixels. Although this method shows potential, the accuracy on the Stanford Dogs dataset is only 65.00%. GridSample is a differentiable feature of PyTorch that spatially interpolates adjacent pixels based on a given grid tensor.

(b) : The authors extended the initial concept by using compressed input and GridSample output to calculate redundant information weights in features. This improvement increases the accuracy from 65.00% to 65.21% while reducing computational overhead. To evaluate the author's design decisions

(c) :,The authors conducted various ablation studies.

(d) : Removing the initial part and relying only on deformed convolutions results in reduced accuracy (65.338%), which emphasizes the importance of the first convolutional layer.

(e) : Removing the deformed convolution but retaining the initial part increases the computational effort and reduces the accuracy (65.291%), indicating the need for multiple layers for accurate attention modeling.

(f) : Replacing deformed convolution with depthwise separable convolution improves the accuracy (66.107%), but is still surpassed by the author's method, which highlights the advantage of deformed convolution in focusing on relevant information.

(g) : Removing the attention module and using only deformed convolutions significantly reduces the accuracy, which emphasizes the importance of attentional behavior.

(h) : Similarly, removing the attention module and using additional layers also shows low accuracy, which emphasizes the preference of using these layers as attention modules.

picture

The attention method (c) proposed by the authors outperforms other methods on all configurations, achieving the best accuracy (66.410%). This highlights the effectiveness of the authors' context-aware attention mechanism in focusing on relevant information even outside the kernel boundaries and enhances model performance.

picture

Table 5 shows the impact of different normalization layers on the attention module. In summary, our experimental results show that our method has advantages in accuracy and computational efficiency compared with other ideas and configurations, providing a valuable complement to pixel-level attention models.

The authors studied the impact of changing parameters from 0.01 to 1 on the amount of calculations. Increasing will increase the amount of calculations and the number of parameters. The authors' findings are shown in Figure 3 that values ​​greater than 0.1 lead to better results. Often, there is a trade-off between FLOPs and accuracy. Therefore, in most studies, the authors chose.

The authors studied the impact of the number of attention layers. Adding an attention layer after all skip connections can slightly improve performance but significantly increases the computational effort and parameters, especially in larger models. Empirically, the authors observed that four attention gate layers achieve a good balance between computational cost and accuracy. The authors also studied the placement of attention gates and ultimately selected an attention model that is simple, efficient, and accurate for both small and large datasets.

Salient Feature Detection Effectiveness

The goal of applying attention mechanisms to any task is to increase attention to relevant features while reducing or avoiding attention to irrelevant features. The authors believe that the main reason for the early partial performance improvement is that our gating stands out in the image and increases focus on salient features. In this section, we will visualize to what extent our attention mechanism achieves the above goals.

To do this, the authors use gradCAM, a function that generates a heatmap showing which parts of the input image are important for the classification decision made by the trained network. The color scheme used in heatmaps is red to blue, with blue indicating lower importance.

picture

Figure 4 shows the Heatmap of block 3 and block 4 for multiple samples before and after using ResNet-50 and attention gate. These cases clearly show that the authors' attention gate focuses better on relevant features in images. The author applies the author's attention gate at the end of each block of ResNet so that the network can start to pay attention to relevant features in the image at an early stage. Observing the Heatmap changes from block 3 to block 4 in Figure 4, the author can see that when using DAS attention, attention does turn to relevant features.

Finally, the authors define a simple metric to measure the effectiveness of a trained network in focusing on relevant features. The authors based their weights on the gradCAM output. Since we observe that gradCAM weights are compressed in the range of 0 to 1, we use inverse logarithmic scaling of gradCAM weights in the following. The salient feature detection score is.

 Provides a measure of the intensity of attention received by relevant features in an image. The higher its value, the higher the attention paid to the relevant features. On the other hand, a higher value of means that attention to irrelevant features is also increasing. The value of varies between 0 and 1. A score close to 1 means focused attention on relevant features, while a score close to 0 means completely misdirected attention. Intermediate values ​​represent the distribution of attention between relevant and irrelevant features. The authors use the following procedure to detect R and B.

The authors first use Grounding-DINO+SAM to identify objects to be classified in images. To avoid manual inspection, the authors accept possible errors in this operation, which gives the authors relevant feature regions R. Outside of R, the authors select regions containing significant pixels according to gradCAM. This together with R gives B. The last row of Figure 4 shows the values ​​of ResNet-50 and DAS. The authors also calculated the values ​​for 100 images from ImageNet. The values ​​of ResNet and DAS are 0.59 and 0.72 respectively, which illustrates the strength of the author's method in achieving targeted feature attention.

5 Conclusion, Limitations and Extensions

In this paper, the author proposes DAS attention gate, a new self-attention mechanism suitable for CNN. DAS does not use Transformer. Compared with earlier methods within CNN, DAS provides dense attention and considers feature context comprehensively.

DAS is very simple - it combines depthwise separable convolutions (for efficient representation of global context) and deformation convolutions (for increasing attention to relevant image regions). The implementation results indeed show that despite its simplicity, DAS can achieve focused attention on task-relevant features in images.

One limitation is that the computational overhead may increase significantly when the network has large deep features. Therefore, the value of must be chosen carefully. If the value of is too small, it will result in the loss of context information; if the value of is too large, the amount of calculation will increase.

Although the authors have demonstrated the performance of DAS in image classification and object detection, in the future, the authors hope to apply it to more intensive visual tasks such as semantic segmentation and stereo matching, where the intensive attention capabilities of DAS can provide significant advantages.

Guess you like

Origin blog.csdn.net/Sciws/article/details/134861913