[Attention] mechanism CV of Non-Local neural networks of understanding and realization

1. Non-local

Non-Local is Wang Xiaolong in CVPR2018年提出的一个自注意力模型. Non-Local NN Non-Local Means and non-local mean filter to dry somewhat similar feeling. Common filter are 3 × 3 convolution kernel, then moves in the entire picture, the processing is local information 3 3 ×. Non-Local Means operation is a combination of a large search range, and weighted. See: https://blog.csdn.net/qianhen123/article/details/81043217

Local Non-Local NN in this article also has a certain relationship with the above, it is mainly for the receptive field, the general convolution of the receptive field size is 3 × 3 or 5 × 5, and the use of Non- local allows large receptive fields, rather than confined to a localized area.

CBAM module before the introduction, SE module, the BAM module, SK module similar, Non-Local also easy to integrate a module for refine the information for one feature map, but also a mechanism to achieve good attention. But compared to before the attention of several modules, Non-Local has more attention in the theoretical support, somewhat obscure.

Non-local generic formula:

\[ y_i=\frac{1}{C(x)}\sum_{\forall j}f(x_i,x_j)g(x_j) \]

  • x is an input signal, cv is generally used in feature map
  • I is an output representative of the position, such as index space, time or time and space, he should respond j enumerate then calculated
  • Functions f i and j calculated similarity
  • g represents the function calculating feature map position j
  • The final y is obtained after normalization processing by the response factor C (x)

Understood that: compared with the Non local mean, it is readily understood that, in response to the current I represents the location, j represents the global response, partial response values ​​obtained by a non-weighted.

What are the advantages Non-Local is?

  • Proposed non-local operations captured directly by calculating the interaction between the two positions dependent on any remote, rather than confined to adjacent dots, which is equivalent to a construct of as large as the size of the convolution kernel and feature map, which can maintain a more and more information.
  • It can be used as a non-local components, combined with other network structures, after OF experiment proved that can be applied to image classification, object detection, object segmentation, gesture recognition and other visual tasks, and good results.
  • Non-local video on the classification works well, we tend to use this classification in the video field.

2. details

To the general formula, and then introduced instantiated function f and g represents a function:

function g : can be viewed as a linear transformation (Linear Embedding) the following formula:
\ [g (x_j) = W_gx_j \]
\ (W_g \) is the weight of the weight matrix to be learned, 1 × 1 can be achieved by the spatial convolution (relatively simple to implement).


f function: This function is used to calculate a similarity i and j, the author proposes four specific function can be used as the function f.

  • Gaussian function: specific formula is as follows:

\[ f(x_i,x_j)=e^{x_i^Tx_j} \\ C(x)=\sum_{\forall j}f(x_i,x_j) \]

As used herein, is \ (x_i ^ Tx_j \) a point multiplication to calculate the similarity, the reason why the dot product can measure the similarity, which is simplified by the cosine similarity comes.
\ [\ Vec a * \ vec b = | \ vec a || \ vec b | cos \ theta \]

  • Embedded Gaussian: specific formula is as follows:

\[ f(x_i,x_j)=e^{\theta(x_i)^T\phi(x_j)} \\ C(x)=\sum_{\forall j}f(x_i,x_j) \]

  • Product DOT : specific formula is as follows:

\[ f(x_i,x_j)=\theta(x_i)^T\phi(x_j) \\ C(x)=|\{i|i is a valid index of x\}| \]

  • Concatenation: Specific formula is as follows:

\[ f(x_i,x_j)=ReLU(w_f^T .[\theta(x_i),\phi(x_j)]) \\ C(x)=|\{i|i is a valid index of x\}| \]


These four functions may appear to make people feel very difficult to read, below the top were probably explain the meaning of symbols, combined with schematic (with Embeded Gaussian, for example, on the original image processing details, refer to the specific code address Github in the non_local_embedded_gaussian.py file):

  • the Map the Feature on behalf of the X-, \ (x_i \) represents information about the current position of interest; \ (x_j \) represents the global information.
  • θ represents the \ (\ Theta (x_i) = W is _ {\ Theta} x_i \) , learning the actual operation is performed by convolution of a 1 × 1
  • φ represents the \ (\ Phi (x_j) = W is _ {\ Phi} x_j \) , learning the actual operation is a convolution of the 1 × 1

  • Similarly g
  • C (x) represents the normalized operation, used in the embedding gaussian Sigmoid is achieved.

Then it is on the map (implementation point) to the following (abstract) binding understood:

Details are as follows: (ps: bring explained below bs, bs figure above is inconvenient since the drawing, there is no added bs)

X是一个feature map,形状为[bs, c, h, w], 经过三个1×1卷积核,将通道缩减为原来一半(c/2)。然后将h,w两个维度进行flatten,变为h×w,最终形状为[bs, c/2, h×w]的tensor。对θ对应的tensor进行通道重排,在线性代数中也就是转置,得到形状为[bs, h×w, c/2]。然后与φ代表的tensor进行矩阵乘法,得到一个形状为[bs, h×w,h×w]的矩阵,这个矩阵计算的是相似度(或者理解为attention)。然后经过softmax进行归一化,然后将该得到的矩阵\(f_c\) 与g 经过flatten和转置的结果进行矩阵相乘,得到的形状为[bs, h*w, c/2]的结果y。然后转置为[bs, c/2, h×w]的tensor, 然后将h×w维度重新伸展为[h, w],从而得到了形状为[bs, c/2, h, w]的tensor。然后对这个tensor再使用一个1×1卷积核,将通道扩展为原来的c,这样得到了[bs, c, h, w]的tensor,与初始X的形状是一致的。最终一步操作是将X与得到的tensor进行相加(类似resnet中的residual block)。

可能存在的问题

计算量偏大:在高阶语义层引入non local layer, 也可以在具体实现的过程中添加pooling层来进一步减少计算量。

3. 代码

代码来自官方,修改了一点点以便于理解,推荐将代码的forward部分与上图进行对照理解。

import torch
from torch import nn
from torch.nn import functional as F


class _NonLocalBlockND(nn.Module):
    """
    调用过程
    NONLocalBlock2D(in_channels=32),
    super(NONLocalBlock2D, self).__init__(in_channels,
            inter_channels=inter_channels,
            dimension=2, sub_sample=sub_sample,
            bn_layer=bn_layer)
    """
    def __init__(self,
                 in_channels,
                 inter_channels=None,
                 dimension=3,
                 sub_sample=True,
                 bn_layer=True):
        super(_NonLocalBlockND, self).__init__()

        assert dimension in [1, 2, 3]

        self.dimension = dimension
        self.sub_sample = sub_sample

        self.in_channels = in_channels
        self.inter_channels = inter_channels

        if self.inter_channels is None:
            self.inter_channels = in_channels // 2
            # 进行压缩得到channel个数
            if self.inter_channels == 0:
                self.inter_channels = 1

        if dimension == 3:
            conv_nd = nn.Conv3d
            max_pool_layer = nn.MaxPool3d(kernel_size=(1, 2, 2))
            bn = nn.BatchNorm3d
        elif dimension == 2:
            conv_nd = nn.Conv2d
            max_pool_layer = nn.MaxPool2d(kernel_size=(2, 2))
            bn = nn.BatchNorm2d
        else:
            conv_nd = nn.Conv1d
            max_pool_layer = nn.MaxPool1d(kernel_size=(2))
            bn = nn.BatchNorm1d

        self.g = conv_nd(in_channels=self.in_channels,
                         out_channels=self.inter_channels,
                         kernel_size=1,
                         stride=1,
                         padding=0)

        if bn_layer:
            self.W = nn.Sequential(
                conv_nd(in_channels=self.inter_channels,
                        out_channels=self.in_channels,
                        kernel_size=1,
                        stride=1,
                        padding=0), bn(self.in_channels))
            nn.init.constant_(self.W[1].weight, 0)
            nn.init.constant_(self.W[1].bias, 0)
        else:
            self.W = conv_nd(in_channels=self.inter_channels,
                             out_channels=self.in_channels,
                             kernel_size=1,
                             stride=1,
                             padding=0)
            nn.init.constant_(self.W.weight, 0)
            nn.init.constant_(self.W.bias, 0)

        self.theta = conv_nd(in_channels=self.in_channels,
                             out_channels=self.inter_channels,
                             kernel_size=1,
                             stride=1,
                             padding=0)
        self.phi = conv_nd(in_channels=self.in_channels,
                           out_channels=self.inter_channels,
                           kernel_size=1,
                           stride=1,
                           padding=0)

        if sub_sample:
            self.g = nn.Sequential(self.g, max_pool_layer)
            self.phi = nn.Sequential(self.phi, max_pool_layer)

    def forward(self, x):
        '''
        :param x: (b, c,  h, w)
        :return:
        '''

        batch_size = x.size(0)

        g_x = self.g(x).view(batch_size, self.inter_channels, -1)#[bs, c, w*h]
        g_x = g_x.permute(0, 2, 1)

        theta_x = self.theta(x).view(batch_size, self.inter_channels, -1)
        theta_x = theta_x.permute(0, 2, 1)

        phi_x = self.phi(x).view(batch_size, self.inter_channels, -1)
        
        f = torch.matmul(theta_x, phi_x)

        print(f.shape)

        f_div_C = F.softmax(f, dim=-1)

        y = torch.matmul(f_div_C, g_x)
        y = y.permute(0, 2, 1).contiguous()
        y = y.view(batch_size, self.inter_channels, *x.size()[2:])
        W_y = self.W(y)
        z = W_y + x
        return z

4. 实验结论

  • 文中提出了四个计算相似度的模型,实验对四个方法都进行了实验,发现了这四个模型效果相差并不大,于是有一个结论:使用non-local对baseline结果是有提升的,但是不同相似度计算方法之间差距并不大,所以可以采用其中一个做实验即可,文中用embedding gaussian作为默认的相似度计算方法。

  • 作者做了一系列消融实验来证明non local NN的有效性:

  1. 使用四个相似度计算模型,发现影响不大,但是都比baseline效果好。

  1. 以ResNet50为例,测试加在不同stage下的结果。可以看出在res2,3,4部分得到的结果相对baseline提升比较大,但是res5就一般了,这有可能是由于第5个stage中的feature map的spatial size比较小,信息比较少,所以提升比较小。

  1. 尝试添加不同数量的non local block ,结果如下。可以发现,添加越多的non local 模块,其效果越好,但是与此同时带来的计算量也会比较大,所以要对速度和精度进行权衡。

  1. Non-local 与3D卷积的对比,发现要比3D卷积计算量小的情况下,准确率有较为可观的提升。

  1. 作者还将Non-local block应用在目标检测、实例分割、关键点检测等领域。可以将non-local block作为一个trick添加到目标检测、实例分割、关键点检测等领域, 可能带来1-3%的提升。

5. 评价

Non local NN从传统方法Non local means中获得灵感,然后接着在神经网络中应用了这个思想,直接融合了全局的信息,而不仅仅是通过堆叠多个卷积层获得较为全局的信息。这样可以为后边的层带来更为丰富的语义信息。

论文中也通过消融实验,完全证明了该模块在视频分类,目标检测,实例分割、关键点检测等领域的有效性,但是其中并没有给出其带来的参数量上的变化,或者计算速度的变化。但是可以猜得到,参数量的增加还是有一定的,如果对速度有要求的实验可能要进行速度和精度上的权衡,不能盲目添加non local block。神经网络中还有一个常见的操作也是利用的全局信息,那就是Linear层,全连接层将feature map上每一个点的信息都进行了融合,Linear可以看做一种特殊的Non local操作。

6. 参考内容

论文:https://arxiv.org/abs/1711.07971

video classification 代码:https://github.com/facebookresearch/video-nonlocal-net

non local官方实现:https://github.com/pprp/SimpleCVReproduction/tree/master/attention/Non-local/Non-Local_pytorch_0.4.1_to_1.1.0/lib

知乎文章:https://zhuanlan.zhihu.com/p/33345791

博客:https://hellozhaozheng.github.io/z_post/%E8%AE%A1%E7%AE%97%E6%9C%BA%E8%A7%86%E8%A7%89-NonLocal-CVPR2018/

Guess you like

Origin www.cnblogs.com/pprp/p/12153255.html