Attention mechanism in CV

attention mechanism

The basic idea of the attention mechanism in computer vision is to make the system learn to pay attention, be able to ignore irrelevant information, and focus on key information.
Insert image description here

1. Hard attention mechanism (Hard/Local Attention)

The weight assigned to each input item is either 0 or 1. Unlike soft attention, the hard attention mechanism only considers which parts need attention and which parts do not, that is, directly discarding some irrelevant items. The advantage is that it can reduce a certain amount of time and calculation costs, but it may lose some information that should be paid attention to.
A hard attention mechanism is a non-differentiable attention.
Center cropping, max pooling layer

2. Soft attention mechanism

The continuous distribution problem between [0,1], the degree of attention of each area is represented by a score of 0~1. The key point of soft attention is that this kind of attention pays more attention to the area or channel, and soft attention is Deterministic attention can be generated directly through the network after learning is completed. The most critical point is that soft attention is differentiable, which is a very important point. Differentiable attention can calculate the gradient through the neural network and learn the weight of the attention through forward propagation and backward feedback. This type of soft attention is computationally very wasteful.

2.1Spatial Transformer Networks (spatial domain attention)

Insert image description here

class SpatialAttention(nn.Module):
    def __init__(self, kernel_size=7):
        super(SpatialAttention, self).__init__()
        assert kernel_size in (3,7), "kernel size must be 3 or 7"
        padding = 3if kernel_size == 7else1

        self.conv = nn.Conv2d(2,1,kernel_size, padding=padding, bias=False)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avgout = torch.mean(x, dim=1, keepdim=True)
        maxout, _ = torch.max(x, dim=1, keepdim=True)
        x = torch.cat([avgout, maxout], dim=1)
        x = self.conv(x)
        return self.sigmoid(x)

2.2 (Channel Attention Module) Channel attention

Insert image description here

class ChannelAttention(nn.Module):
    def __init__(self, in_planes, rotio=16):
        super(ChannelAttention, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)

        self.sharedMLP = nn.Sequential(
            nn.Conv2d(in_planes, in_planes // ratio, 1, bias=False), nn.ReLU(),
            nn.Conv2d(in_planes // rotio, in_planes, 1, bias=False))
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avgout = self.sharedMLP(self.avg_pool(x))
        maxout = self.sharedMLP(self.max_pool(x))
        return self.sigmoid(avgout + maxout)

2.3 (Branch Attention Module) Branch Attention - sknet

SKNET

The first step is to use grouped convolution. The original feature map U, U incorporates information from multiple receptive fields. Then the obtained U is a feature map with the shape [C, H, W] (C represents channel, H represents height, W represents width), and then averages along the H and W dimensions, and finally the information about the channel is A C×1×1 one-dimensional vector represents the importance of the information of each channel.

Then a linear transformation is used to map the original C dimension into Z-dimensional information, and then three linear transformations are used to change from the Z dimension to the original C. This completes the information extraction for the channel dimension, and then uses Softmax After normalization, each channel corresponds to a score, which represents the importance of its channel, which is equivalent to a mask. Multiply these three obtained masks by the corresponding U1, U2, and U3 respectively to obtain A1, A2, and A3. Then the three modules are added to perform information fusion to obtain the final module A. Compared with the original X, module A has undergone information refinement and fused information from multiple receptive fields.

import torch.nn as nn
import torch

class SKConv(nn.Module):
    def __init__(self, features, WH, M, G, r, stride=1, L=32):
        super(SKConv, self).__init__()
        d = max(int(features / r), L)
        self.M = M
        self.features = features
        self.convs = nn.ModuleList([])
        for i in range(M):
            # 使用不同kernel size的卷积
            self.convs.append(
                nn.Sequential(
                    nn.Conv2d(features,
                              features,
                              kernel_size=3 + i * 2,
                              stride=stride,
                              padding=1 + i,
                              groups=G), nn.BatchNorm2d(features),
                    nn.ReLU(inplace=False)))
            
        self.fc = nn.Linear(features, d)
        self.fcs = nn.ModuleList([])
        for i in range(M):
            self.fcs.append(nn.Linear(d, features))
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        for i, conv in enumerate(self.convs):
            fea = conv(x).unsqueeze_(dim=1)
            if i == 0:
                feas = fea
            else:
                feas = torch.cat([feas, fea], dim=1)
        fea_U = torch.sum(feas, dim=1)
        fea_s = fea_U.mean(-1).mean(-1)
        fea_z = self.fc(fea_s)
        for i, fc in enumerate(self.fcs):
            print(i, fea_z.shape)
            vector = fc(fea_z).unsqueeze_(dim=1)
            print(i, vector.shape)
            if i == 0:
                attention_vectors = vector
            else:
                attention_vectors = torch.cat([attention_vectors, vector],
                                              dim=1)
        attention_vectors = self.softmax(attention_vectors)
        attention_vectors = attention_vectors.unsqueeze(-1).unsqueeze(-1)
        fea_v = (feas * attention_vectors).sum(dim=1)
        return fea_v

if __name__ == "__main__":
    t = torch.ones((32, 256, 24,24))
    sk = SKConv(256,WH=1,M=2,G=1,r=2)
    out = sk(t)
    print(out.shape)

2.4Convolutional Block Attention Module (channel domain + spatial domain)

Insert image description here

class BasicBlock(nn.Module):
    expansion = 1
    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.ca = ChannelAttention(planes)
        self.sa = SpatialAttention()
        self.downsample = downsample
        self.stride = stride
    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.ca(out) * out  # 广播机制
        out = self.sa(out) * out  # 广播机制
        if self.downsample isnotNone:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

2.5 Non-Local Attention

Aiming at the problem of long-distance information transmission and improving long-distance dependence, this paper is inspired by the traditional non-local mean filtering method and proposes non-local in the convolutional network, that is, the response at a certain pixel is the response of all other points. The feature weight sum at , associates each point with all other points, realizing the non-local idea.

Insert image description here

X is a feature map with a shape of [bs, c, h, w]. After three 1×1 convolution kernels, the channel is reduced to half of the original (c/2). Then flatten the two dimensions h and w into h×w, and the final shape is a tensor of [bs, c/2, h×w]. The tensor corresponding to θ is channel rearranged, which is transposed in linear algebra, and the shape obtained is [bs, h×w, c/2]. Then perform matrix multiplication with the tensor represented by φ to obtain a matrix with a shape of [bs, h×w, h×w]. This matrix calculates similarity (or understood as attention). Then it is normalized by softmax, and then the obtained matrix fc and g are matrix multiplied by the result of flattening and transposing, and the result y obtained has the shape [bs, h*w, c/2]. Then transpose it into a tensor of [bs, c/2, h×w], and then re-stretch the h×w dimension into [h, w], thus obtaining a tensor with a shape of [bs, c/2, h, w] tensor. Then use a 1×1 convolution kernel on this tensor to expand the channel to the original c, thus obtaining a tensor of [bs, c, h, w], which is consistent with the shape of the initial X. The final step is to add X to the obtained tensor (similar to the residual block in resnet).