Attention channel attention mechanism Squeeze-and-Excitation Networks

Squeeze-and-Excitation Networks

Release time: 2017

Basic idea

Improve the expressive ability of the network model by accurately modeling the relationship between the various channels of the convolution feature.
The author proposes a mechanism that allows the network model to calibrate features, so that the network can selectively amplify valuable feature channels from global information and suppress useless feature channels .
Insert picture description here

step

Insert picture description here

  1. The first is the Squeeze operation. We perform feature compression along the spatial dimension, and turn each two-dimensional feature channel into a real number. This real number has a global receptive field to some extent, and the output dimension and the input feature channel number Match. It characterizes the global distribution of the response on the characteristic channel, and enables the layer close to the input to obtain the global receptive field, which is very useful in many tasks.
    That is, use global average pooling to batch size ∗ C ∗ H ∗ W batch size* C*H*WbatchsizeCHW is compressed intobatchsize ∗ C ∗ 1 ∗ 1 batch size* C*1*1batchsizeC11

  2. The second is the Excitation operation, which is a mechanism similar to the gate in the recurrent neural network. A weight is generated for each feature channel through the parameter w, where the parameter w is learned to explicitly model the correlation between the feature channels.
    About batchsize ∗ C ∗ 1 ∗ 1 batch size* C*1*1batchsizeC11 The vector is input to the Bottleneck structure composed of two Fully Connected layers to model the correlation between the channels and output the same number of weights as the input features. We first reduce the feature dimension to 1/16 of the input, and then after ReLu activation, it is upgraded back to the original dimension through a Fully Connected layer. The advantages of doing this over directly using a Fully Connected layer are:
    1) It has more non-linearity and can better fit the complex correlation between channels;
    2) The amount of parameters and calculations are greatly reduced.
    Then pass a Sigmoid gate to obtain a normalized weight between 0 and 1

  3. Finally, there is a Reweight operation. We regard the weight of the output of the Excitation as the importance of each feature channel after feature selection, and then weight the previous feature channel by channel through multiplication to complete the pairing in the channel dimension. Recalibration of original features.
    That is, the original feature map is weighted by channel

from torch import nn


class SELayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y.expand_as(x)

Reference article

Guess you like

Origin blog.csdn.net/weixin_42764932/article/details/112227689