CVPR 2023 | Plug-and-Play Attention Module HAT: Activate more useful pixels to boost low-level tasks significantly!

This article was first published on the WeChat public account CVHub. Private reprinting or selling to other platforms is strictly prohibited, and offenders will be held accountable.

Title: Activating More Pixels in Image Super-Resolution Transformer
Paper: arxiv.org/pdf/2205.04…
Code: github.com/XPixelGroup…

guide

This paper presents a Hybrid Attention Transformer (HAT)method named , which aims to improve image super-resolution tasks by combining deep learning techniques and attention mechanisms .

The single image super-resolution (SR) task is a classic problem in computer vision and image processing, where the goal is to reconstruct a high-resolution image from a given low-resolution input. Through the method of attention analysis (attribution analysis), the author found that the existing Transformer method can only use a limited spatial range when utilizing input information. This means that the current network has not realized the full potential of Transformer.

Therefore, in order to activate more input pixels to achieve better reconstruction results, this paper constructs a novel Hybrid Attention Transformer (HAT) method. The method combines channel attention and window-based self-attention mechanisms, fully exploiting their complementary advantages in exploiting global statistics and powerful local fitting capabilities . Moreover, to better aggregate cross-window information, the authors introduce an overlapping cross-attention module, which enhances the interaction between adjacent window features. In the training phase, a pre-training strategy for the same task is also adopted to further tap the potential of the model for better performance.

Finally, extensive experiments demonstrate the effectiveness of the proposed method, and further demonstrate a significant boost in task performance by scaling up the model. Overall, the proposed method significantly outperforms the current state-of-the-art methods by more than 1dB in performance.

motivation

First of all, let's analyze it first Swin Transformer. It performs well in image super-resolution tasks, but many times it is not clear what advantage it has over CNN-based methods. To reveal how the method works, the researchers used a LAM(Localization-Aware Mapping)diagnostic tool called Attribution Analysis for the SR task.

LAM is able to show which input pixels contribute most to the reconstruction. Through the analysis of LAM, we can find that SwinIR (the method based on Transformer) is not wider than the method based on CNN (such as RCAN) in the range of information utilized. This contradicts common sense, but also brings additional revelation to the author.

First, it shows that SwinIR has stronger mapping ability than CNN, and thus can achieve better performance with less information. Second, due to the limited range of pixels utilized, SwinIR may recover wrong textures, so its performance may be further improved if more input pixels can be utilized. The researchers therefore aimed to design a network that would exploit a similar self-attention mechanism while activating more pixels for reconstruction. Their HAT network sees almost the entire image in the graph and is able to recover correct and clear textures.

此外,通过上图我们可以观察到 SwinIR 的中间特征存在明显的阻塞伪影(blocking artifacts)。这些伪影是由窗口划分机制引起的,这表明移动窗口机制在建立窗口间的连接上效率较低。一些用于高层次视觉任务的研究也指出,增强窗口之间的连接可以改善基于窗口的自注意力方法。因此,当设计本文方法时,作者特意加强了窗口间的信息交互,从而显著减轻了 HAT 所得到的中间特征中的阻塞伪影。

方法

Framework

如上图所示,整体网络由三个部分组成,包括浅层特征提取深层特征提取图像重建。这种架构设计在之前的研究中得到了广泛应用。具体而言,对于给定的低分辨率(LR)输入,首先利用一个卷积层提取浅层特征。然后,采用一系列残差混合注意力组(RHAG)和一个3×3的卷积层进行深层特征提取。在此之后,图中添加了一个全局残差连接,将浅层特征和深层特征融合起来,然后通过重建模块重建高分辨率结果。

此外,每个 RHAG 包含多个混合注意力块(HAB),一个重叠的交叉注意力块(OCAB)和一个带有残差连接的3×3卷积层。对于重建模块,采用像素洗牌(pixel-shuffle)方法来上采样融合的特征。同样地,本文中简单地使用 L1 损失来优化网络参数。

Hybrid Attention Block (HAB)

HAB 用于结合不同类型的注意力机制来激活更多的像素,以实现更好的重建效果。该模块由两个关键组成部分组成:窗口自注意力机制(Window-based Self-Attention)和通道注意力机制(Channel Attention)。

在 HAB 模块中,首先将输入特征进行归一化处理,然后利用窗口自注意力机制对特征进行处理。窗口自注意力机制将输入特征划分为局部窗口,并在每个窗口内计算自注意力。这样可以捕捉到局部区域的关联信息。接下来,通过通道注意力机制,全局信息被引入,用于计算通道注意力权重。通道注意力机制能够利用全局信息对特征进行加权,从而激活更多的像素。

HAB 模块的输出是窗口自注意力机制和通道注意力机制的加权和,并通过残差连接与输入特征相加。这种设计使得网络能够同时利用局部和全局信息,从而实现更好的重建效果。

Overlapping Cross-Attention Block (OCAB)

Overlapping Cross-Attention Block (OCAB) 模块则通过引入重叠交叉注意力层,在窗口自注意力中建立了窗口之间的交叉连接,以增强网络的表征能力。该模块的设计可以更好地利用窗口内部的像素信息进行查询,从而提高重建任务的性能。

The Same-task Pre-training

预训练在高级视觉任务中已经证明是有效的,而最近的研究也表明它对低级视觉任务有益处。有些方法强调使用多个低级任务进行预训练,例如去噪、去雨、超分辨率等,而另一些方法则利用了特定任务的不同降级水平进行预训练。然而,与这些方法不同的是,本研究直接在一个更大规模的数据集(如ImageNet)上使用相同任务进行预训练,结果显示预训练的有效性更多地取决于数据的规模和多样性。例如,在进行×4超分辨率模型训练时,首先在ImageNet上训练一个×4超分辨率模型,然后在特定数据集(如DF2K)上进行微调。这种相同任务预训练的策略更加简单,但却能够带来更好的性能提升。

不过大家需要注意的是,预训练的有效性取决于充分的训练迭代次数以及在微调阶段采用适当的小学习率,这是非常重要的。Transformer 模型需要更多的数据和迭代次数来学习任务的通用知识,但在微调时需要小的学习率以避免对特定数据集的过拟合。因此,预训练阶段需要足够的时间和数据来学习通用特征,而微调阶段则需要细致调整以适应特定任务的要求。

实验

正如我们上面所讨论的,激活更多的输入像素有助于实现更好的超分辨率性能。扩大窗口尺寸是实现这一目标的直观方法。在先前的一些相关工作中,有人研究了不同窗口尺寸的影响。然而,这些实验是基于移位交叉局部注意力的,并且只探索了最大为12×12的窗口尺寸。

本文进一步探究了自注意力的窗口尺寸如何影响表示能力。为了消除新引入的模块的影响,作者直接在SwinIR上进行以下实验。如表1所示,窗口尺寸为16×16的模型在性能上表现更好,尤其是在Urban100数据集上。此外,本文还在图6中提供了定性比较。对于标记为红色的补丁区域,窗口尺寸为16的模型利用了比窗口尺寸为8的模型更多的输入像素。重建结果的定量性能也证明了大窗口尺寸的有效性。基于这个结论,作者将窗口尺寸16直接作为默认设置。

从表6中可以看出,通过比较HAT和HAT†的性能,我们可以看到HAT可以极大地受益于预训练策略。为了展示所提出的同任务预训练的优越性,作者还将多相关任务预训练方法应用于HAT进行比较,并在完整的ImageNet数据集上使用与相同的训练设置,实验做得还是蛮充分的。

总结

本文提出了一种名为 HAT 的新型超分辨率 Transformer 方法,通过结合不同类型的注意力机制和大规模数据预训练,实现了更好的图像重建效果。该方法在实验证明了其在超分辨率任务中的优越性能,并超过了当前最先进方法。这项研究拓展了Transformer在计算机视觉任务中的应用,并提供了一种改进低级视觉任务的方法。


CVHub is a high-quality knowledge sharing platform focusing on the field of computer vision. The original rate of technical articles on the whole site reaches 99%. It presents you with comprehensive, multi-field, in-depth cutting-edge AI paper solutions and supporting industry-level application solutions every day. Solution, provide scientific research | technology | employment one-stop service, covering supervised/semi-supervised/unsupervised/self-supervised various 2D/3D detection/classification/segmentation/tracking/pose/super-resolution/reconstruction and other full-stack fields And generative models such as the latest AIGC. Pay attention to the WeChat public account, welcome to participate in real-time academic & technical interactive exchanges, receive a CV learning spree, and subscribe to the latest information on school recruitment & social recruitment of major domestic and foreign companies!

Guess you like

Origin juejin.im/post/7240789855139053627