ReCoNet-Recurrent Correction Network for Fast and Efficient Multi-modality Image Fusion

1.Abstract

In recent years, deep networks have achieved major breakthroughs in infrared and visible image fusion (IVIF) and attracted widespread attention. However, mostExisting methods cannot handle slight misalignment of source images, and existsHigh computational and space overheadThe problem. This paper addresses these two key issues that are rarely addressed in academia by developing ReCoNet, a recurrent correction network for robust and efficient fusion. Specifically, we design aDeformation moduleto explicitly compensate for geometric distortion, and useattention mechanismto reduce artifacts such as ghost artifacts. At the same time, the network consists of aParallel dilated convolution layerIt is composed of and runs in a loop, which significantly reduces the space and computational complexity. ReCoNet can effectively and efficiently mitigate structural distortion and texture artifacts caused by slight misalignment. Extensive experiments on two public datasets demonstrate the superior accuracy and performance of our ReCoNet over state-of-the-art IVIF methods. As a result, we achieve a relative improvement of 16% in correlation coefficient (CC) and an 86% increase in efficiency on datasets with misalignment.

In actual working environments, asynchronous multi-source sensor imaging usually contains errors such as slight misalignment and imaging distortion, and some advanced vision tasks such as target tracking have high requirements for fusion computing speed. This research builds an efficient joint framework for image correction and fusion to address such problems. This framework "removes falsehoods and retains trueness" uses successive iterations to eliminate imaging differences between multi-source images and weaken imaging distortions, fully utilizes modern graphics computing units for parallel computing, and efficiently uses multi-modal image pairs containing errors in actual scenes to fuse, providing a basis for human-computer interaction. and intelligent autonomous operations provide precise and continuous perception. The fusion results generated by this research method are not only leading in terms of human eye perception, but also in the results of various computer vision perception tasks, the quantitative test indicators are significantly improved compared to the existing most advanced methods.

2. Introduction

Infrared and visible image fusion (IVIF) produces a fused image that exhibits complementary features and contains richer information than a single modality image. The resulting images are visually appealing and of great significance in practical applications such as video surveillance, remote sensing, and autonomous driving.

The traditional IVIF method is committed to finding the best feature representation across modalities and designing appropriate weights for fusion. Recently, due to the powerful capabilities of deep learning in nonlinear fitting and feature extraction, researchers have begun to use deep networks to learn common features of infrared and visible images, or design fusion strategies on IVIF training samples. These methods can produce good fusion effects under specific scenarios, such as fixed capture devices and/or aligned input images, and are especially suitable for human eye inspection. However, for existing IVIF methods, there are still two key issues that need to be solved to significantly improve subsequent computer vision (CV) tasks, including target detection, target tracking, and semantic segmentation.
b
First, existing IVIF methods, whether traditional methods or deep learning-based methods, Typically very sensitive to misalignment of input images. Slight translation or deformation on one modality will produce obvious geometric distortions in the image structure and produce ghost-like artifacts in areas of texture detail, As shown in Figure 1, this severely harms the performance of downstream CV algorithms. Only a few works have attempted to mitigate these adverse effects. Ma et al. proposed a total variation minimization method to respectively enhance the geometric structure of the infrared image and preserve the texture of the visible input image. However, these methods significantly blur the details and do not fully exploit the complementary information between the two modalities. In addition, its iterative optimization process requires intensive gradient calculations, making the fusion process time-consuming. Other deep learning-based methods employ attention/masking mechanisms to enhance the robustness to misalignment and avoid artifacts by reducing the weight of mismatched patches. However, these attention/masking mechanisms have difficulty describing the correlation between different modalities, resulting in small artifacts in the fusion results.

Secondly, existing methods require a large amount of space to store numerous network parameters and lag behind real-time performance at runtime, as shown by the circles and time values ​​in Figure 1, although deep methods speed up the fusion process compared to traditional methods . main bottleneckThis is because these deep methods require stacking multiple layers of convolutional blocks to learn common features shared between infrared and visible images, and there are obvious differences in appearance between these two modalities.. At the same time, training these huge networks requires a large number of image pairs, which is impossible in practice

This study develops a recurrent lightweight network toResolve structural distortion and texture artifacts caused by misalignment. Specifically, we trained aMicro-registration module(R) to predict the deformation field between input images. This module explicitly corrects geometry distortions due to pixel shifts. We also start from two modes (σ i r and σ v i s ) (σ_{ir} and σ_{vis}) σirsumσvisThe attention map is learned in ) and the salient areas in the respective inputs are found. Therefore, during the fusion process, the texture weight of the visible input is larger while distinguishing high-frequency repeating patterns due to spatial offsets, thus implicitly weakening the artifact phenomenon. To improve efficiency, we designed aParallel dilated convolution layer (PDC), which uses multi-scale receptive fields to learn contextual information. We trained this simple PDC layer on a set of parameters and in the fusion processRun the network in a loop (F), cascading attention and lightweight PDC modules. This cyclic process saves network parameter space and iteratively improves the fusion quality. Figure 1 demonstrates the higher numerical scores, lower computational cost, and fewer parameters of our method compared to state-of-the-art methods on two publicly available datasets. We summarize our main contributions as follows:

  • To the best of our knowledge, this is the first work to jointly learn deep networks on mid-wave infrared and visible images for registration and fusion, thereby achieving robustness to source image misalignment.
  • We design a deformation module to explicitly compensate for geometric distortions and an attention mechanism to mitigate residual artifacts. This design effectively addresses two different types of undesirable effects that occur in the structural and textured areas of a given scene.

  • We develop a parallel dilated convolutional layer and loop mechanism that significantly reduces space and computational complexity.

3.Method

Insert image description here

3.1 Motivation

Insert image description here
In real-life scenarios, due to insurmountable internal and external factors,Unable to obtain pixel-level aligned infrared and visible light images. As shown in Figure 2, we show three typical factors that often appear in real acquisitions.

  • In most packaged devices, complementary metal oxide semiconductor (CMOS) will generate image noise, assuming that the internal system has been operating for a long time or is in a high-temperature internal environment.

  • For server environments, such as deserts and tropical forests, refraction from thermal airflow can cause severe distortion of the source image.

  • Bumpy roads, fast-moving objects, or unsynchronized multi-vision cameras can cause source image degradation, such as motion blur and transportation issues. A slight shift or deformation in one mode will bring significant geometric distortion;

    Few existing methods can overcome these problems because they only perform fusion on pixel-level aligned image pairs.

Based on this observation, we propose a loop correction network for implementing IVIF that has sufficient capacity to handle visually aligned source inputs.

Apart from this, most previous fusion methods strive to strengthen the network by increasing its depth and width to achieve state-of-the-art performance. However, this catastrophic increase in network layers can result in significant computational and memory requirements, thus making them difficult to apply to subsequent high-level computer vision tasks such as object detection, depth estimation, and object tracking. Therefore, our method carefully designs parallel dilated convolutional layers and recurrent learning mechanisms to improve computational efficiency.

3.2 Micro Registration Module

Micro Alignment ModuleRHelps mitigate minor misalignment errors caused by geometric distortion or scale. It consists of two components:Deformation field prediction network R ϕ R_ϕ Rϕand resampling layer R S R_S RS.变形场 ϕ ϕ ϕ is used to represent transformations, allowing our method tounevenAccurately map images.

Note:Deformation field is a technology used to describe the deformation of an image or object. It is a vector field that contains the displacement vector of each point in a given area. These displacement vectors represent the amount of movement of each point relative to its initial position during deformation. With traditional uniform transformation, every pixel in the image will be displaced, rotated, scaled or otherwise geometrically transformed in the same way. For example, translation transformation is a common uniform transformation, in which every pixel in the image moves according to the same displacement vector. Non-uniform transformations allow for more accurate mapping. This means that pixels in different areas can be mapped to different locations to better adapt to changes in the image. For example, in facial recognition, non-uniform transformation can align key points in the image to corresponding positions based on each person's facial features, thereby achieving more accurate face recognition.

Suppose the infrared image is given x x x and distorted visible light image y ~ \tilde{y} and~ R ϕ R_ϕ RϕThe goal of is to predict a deformation field ϕ y ~ → y = R ϕ ( x , y ~ ) ϕ_{ {\tilde{y}}→y}= R_ϕ(x, \tilde{y}) ϕand~y=Rϕ(x,and~),How to describe y ~ \tilde{y} and~ is non-rigidly aligned to y. Deformation field ϕ ∈ R h × w × 2 ϕ ∈ R^{h×w×2} ϕRh×w×2 ,Independently ϕ h , w = ( ∆ x h , ∆ x w ) ∈ R 2 ϕ_{h,w} = (∆x_h, ∆x_w) ∈ R^ 2 ϕh,w=(xh,xw)R2表示 y ~ \tilde{y} and~middle image element v h , w v_{h,w} inh,wdeformation offset. Our R mainly focuses on the fusion effect after registration, so we designed a micro-module similar to U-Net. The detailed architecture is given in the lower left corner of Figure 3.

Insert image description here
To apply a geometric transformation to an image, we useJewel new 釷层 R S R_S RS,该层连收 R ϕ R_ϕ RϕGenerated deformation field ϕ y ~ → y ϕ_{ {\tilde{y}}→y} ϕand~yand apply it to the distorted visible light image y ~ \tilde{y} and~. The transformed visible light image is calculated by the following equation y ˉ \bar{y} andˉImage element v h , w v_{h,w} inh,w处的值: y ˉ [ v h , w ] = y ~ [ v h , w + ϕ h , w y ~ → y ] ] ( 1 ) \bar{y}\left[v_{h,w}\right] = \tilde{y}\left[v_{h,w} + \phi^{\tilde{y}\rightarrow y}_{h,w}\right] ](1) andˉ[vh,w]=and~[vh,w+ϕh,wand~y]](1)

3.3 Biphasic Recurrent Fusion Module

Contextual features such as edges, objects, and contours play a key role in the fusion process. However, as the network depth increases, contextual features gradually degrade, resulting in blurred targets and unclear details in the fusion results. To address this issue, previous work attemptedDesign various attention mechanisms or increase the width of the network (such as adding dense or residual blocks). In fact, these aforementioned attention mechanisms have difficulty in characterizing contextual features from source images. Increasingly complex model architectures can result in significant computational and memory requirements. Therefore, we propose aDual-stage cycle fusion module, to obtain high computational efficiency to represent sufficient contextual features at multiple scales.

Dual-stage attention layer: In order to obtain salient features and maintain contextual consistency with the source image, Dual-stage attention layer . itIt consists of max pooling operation, average pooling operation and an unbiased convolutional layer.. For each pixel of the two images, take the maximum and average values ​​ and combine them as the input of the convolutional layer. Let A represent the dual-stage attention layer , I a and I b I_a and I_b IasumIbSeparate display and import image, can be displayed separately and below: A ( I a , I b ) = θ A ∗ [ m a x ( I a , I b ) , a v g ( I ​​a , I b ) ] A(I_a,I_b)=\theta_A *[max(I_a,I_b),avg(I_a,I_b)] A(Ia,Ib)=iA[inx(Ia,Ib),avg(Ia,Ib)]

Note: The combination of the maximum value and the average value can provide richer feature expression capabilities. The maximum captures the strongest local features in the image, while the average provides the overall features of the image. By combining these two values, the local and global information of the image can be comprehensively considered, thereby improving the expressive ability of the features.
Insert image description here
In the above formula, * represents the convolution operation, θ A θ_A iA represents the parameters of the convolutional layer in our attention layer. We will m a x ( I a , I b ) and a v g ( I ​​a , I b ) max (I_a, I_b ) and avg(I_a, I_b) atx(Ia,Ib)sumavg( Ia,Ib) concatenation is used as the input of the attention layer. As shown in Figure 3, the network generates images from the input image group { x , u , y ˉ } {\lbrace x, u, \bar{y} \rbrace} { xuandˉpxsumσy
σ i r = A x ( x , u i ) σ_{ir} = A_x(x, u_i) pir=Ax(x,ini)
σ y ˉ = A y ˉ ( y ˉ , u i ) σ_{\bar{y}} = A_{\bar{y}}(\bar{y }, u_i) pandˉ=Aandˉ(andˉ,ini)
inside, A x sum A y A_x sum A_y AxsumAy represents the attention layer of infrared and visible light respectively, u i u_i iniRepresents the fusion result of the previous iteration.

Insert image description here
Additionally, due to the presence of dual-stage attention maps, we can better emphasize contextual features and make our method implicitly adaptable to slight alignment errors by reducing the weight on slightly distorted regions such as non-smooth edges and ghost effects. .

Parallel dilated convolutional layers: We develop a set of parallel dilated convolutional layers, to efficiently extract features from source images< /span>. A set of dilated convolutional layers with a sawtooth waveform dilation factor can increase the receptive field without losing adjacent information. On the three expansion paths, convolution operations with the same convolution kernel size of 3×3 have different expansion factors. As shown in Figure 3, the expansion rates are set to 1, 2, and 3 respectively. Therefore, the receptive fields of these three parallel convolution paths are 3×3, 5×5, and 7×7 respectively.
Insert image description here
To provide a formal description, let f i n i f^i_{in} fini represents the input feature map of the dilated convolution layer in the i-th iteration. Output feature map of parallel dilated convolutional layer f o u t i f^i_{out} fouti逐步更新如下:
f o u t i = C k ( f i n i ) . k ∈ 1 , 2 , 3 , C ( f i n i ) = θ C k ∗ f i n i + b C k , f^i_{out} = {C^k(f^i_{in})}. k∈{1,2,3},C(f^i_{in}) = θ^k_C ∗ f^i_{in} + b^k_C, fouti=Ck(fini).k1,2,3C(fini)=iCkfini+bCk

Naka θ C k sum b C k θ^k_Csum b^k_C iCksumbCkRepresents the parameters and biases of the convolutional layer with expansion rate k.

Recurrent learning: We propose a recurrent architecture to replace time-consuming multi-layer convolutions toin a step-by-step mannerExtract features from contextual information. By partially reusing computational graphs, we can reduce the computational complexity overhead of building graphs, especially for dynamic graph network frameworks such as PyTorch. As shown in Figure 4, compared to the sequential network structure, we will spend more time in the first loop to build the graph, but in each subsequent loop, we will save about 27% of the time. Overall, our loop architecture reduces approximately 15% time, 33% parameters, and 42% GPU memory. This cyclic learning enables ReCoNet to extract image features from contextual information and meet real-time standards (≥ 25fps). Due to the reduction of parameters and memory, our ReCoNet can be deployed on mobile devices.

Insert image description here

3.4 Loss Functions

Total loss function of our network L t o t a l L_{total} Ltotal consists of two loss terms, namely fusion loss L f u s e L_{fuse} Lfuseand registration loss L r e g L_{reg} Lreg. The fusion loss ensures that the network generates better and informative fusion results, while the registration loss helps limit and improve image distortions due to alignment errors. We train our network by minimizing the following loss function:

L t o t a l = λ L f u s e + ( 1 − λ ) L r e g , (2)Ltotal=λLfuse+(1λ)Lreg,                   (2)

where λ is a trade-off parameter.

Fusion Disappearance L f u s e L_{fuse} Lfuse consists of two loss terms. Structural similarity L S S I M L_{SSIM} LSSIM is used to maintain structural consistency in terms of light, contrast and structural information, while pixel loss L p i x e l L_{pixel} Lpixel is used to balance the pixel intensity of the two source images. Therefore, L f u s e L_{fuse} LfuseIt can be expressed as:

L f u s e = γ L S S I M + ( 1 − γ ) L p i x e l ,                     ( 3 ) L_{fuse} = γL_{SSIM} + (1-γ)L_{pixel},Lfuse=γLSSIM+(1γ)Lpixel,                   (3)

where γ is the weight of the two loss terms. Specifically, we constrain our fusion result to have the same basic structure as the source image, so the LSSIM loss is defined as:

L S S I M = ( 1 − S S I M ( u , x ) ) + ( 1 − S S I M ( u , y ) ) ,           ( 4 ) L_{SSIM} = (1 - SSIM(u, x)) + (1 - SSIM(u, y)),         (4) LSSIM=(1SSIM(u,x))+(1SSIM(u,y)),         (4)

Similarly, the fusion result should balance the pixel intensity distribution from infrared and visible light images, and the pixel loss can be expressed as:

L p i x e l = ∣ ∣ u − x ∣ ∣ 1 + ∣ ∣ u − y ∣ ∣ 1 ,                  ( 5 ) L_{pixel} = ||u - x||_1 + ||u - y||_1,                (5) Lpixel=∣∣ux1+∣∣uy1,                (5)

Among∥·∥1 display l 1 l_1 l1norm.

In addition, registration loss L r e g L_{reg} LregIt also plays a key role in correcting distortion, which can be expressed as:

L r e g = η L s i m + ( 1 − η ) L s m o o t h , (6) L_{reg} = ηL_{sim} + (1-η)L_{smooth}, (6)Lreg=ηLsim+(1η)Lsmooth ,                 (6)

Among L s i m L_{sim} Lsim indicates similarity loss, L s m o o t h L_{smooth} Lsmooth is a smoothing loss designed to ensure smooth deformations. eta is the trade-off parameter that balances the two terms.

More specifically, Lsim is calculated as:

L s i m = ∣ ∣ ϕ y ~ → y − ( − ϕ y → y ~ ) ∣ ∣ 2 2 ,                   ( 7 ) L_{sim} = ||\phi_{\tilde{ y}→y} - (-ϕ_{ y→\tilde{y}})||^2_2,                 (7)Lsim=∣∣ϕand~y(ϕyand~)22,                 (7)

where $\phi_{\tilde{y}→y} represents the deformation field, represents the deformation field, represents the deformation field, ϕ_{y→\tilde{y}} represents the generated random deformation field. Since our framework mainly focuses on the fusion effect after alignment, denotes the generated random deformation field. Since our framework mainly focuses on the fusion effect after alignment, represents the generated random deformation field. Since our framework mainly focuses on the fusion effect after alignment, -ϕ_{y→\tilde{y}}$ is used as the fitting target of the deformation field to a certain extent. In our loop fusion mechanism, these subtle errors introduced will be eliminated.

For each pixel p in the two-dimensional space domain, L s m o o t h L_{smooth} Lsmooth It can be specifically defined as:

L s m o o t h = ∑ p ∈ Ω ∣ ∣ ∇ ϕ ( p ) ∣ ∣ 1 1 ,              (8)Lsmooth =pΩ∣∣∇ϕ(p)11,           (8)

where ∇ represents the spatial gradient approximated using the difference between adjacent pixels.

4 Experiments and Results

First, we introduce the dataset, evaluation metrics, and training details. We then compare the proposed method with eight state-of-the-art methods (i.e., DenseFuse, FusionGAN, RFN, GANMcC, MFEIF, PMGI, DIDFuse, and U2Fusion) on aligned/unaligned datasets. Additionally, we provide complexity evaluation, average opinion score analysis, and extensive ablation experiments. All experiments were conducted using PyTorch on a computer equipped with an Nvidia V100 GPU.

4.1 Dataset and Preprocessing

Dataset: Our aligned and unaligned fusion experiments were performed on the TNO and RoadScene datasets. We generate infrared images with varying degrees of distortion by randomly using deformation fields. In each aligned/unaligned IVIF experiment, we randomly selected 20/180 pairs of images and their corresponding TNO/RoadScene dataset as training samples.

Evaluation Metrics: We adopt three existing statistical metrics, including standard deviation (SD), entropy (EN) and correlation coefficient (CC), to comprehensively evaluate the Aspects evaluate the quality of fused images.

Training details: λ, γ and η are set to 0.6, 0.28 and 0.78 respectively. The Adam optimizer updates parameters using a learning rate of 0.001 and is trained for a total of 300 epochs. Micro-registration module R ϕ R_ϕ RϕPerform joint training with the biphasic cycle fusion module F.

4.2 Results on Aligned Dataset

Insert image description here
Qualitative Comparisons: Figure 5 shows eight representative fusion images generated by different models. By visual inspection, it can be seen that our method significantly outperforms the other compared models. Although other methods achieve meaningful fusion results, there are still some problems, such as unclear thermal targets (see green boxes in Figure 5 of DenseFuse, RFN, and U2Fusion), and blurred details (see Figure 5 of GANMcC and DIDFuse). red box). In contrast, our method can generate visually friendly fusion results with clear objects, obvious contrast, and rich details.
Insert image description here
**Quantitative Comparisons: **Subsequently, the quantitative results are shown on 15/40 image pairs of the TNO/RoadScene dataset, as shown in Figure 6. Obviously, our method achieves the highest values ​​on two metrics (SD and EN), followed by DIDFuse and U2Fusion. On the CC metric, our method only slightly lags behind FGAN on the TNO dataset.

4.3 Results on Slightly Misaligned Dataset

Qualitative Comparisons: Due to the ability of our method to handle slightly misaligned image pairs, we further compare the fusion performance against other state-of-the-art methods on the TNO and RoadScene datasets Test, as shown in Figure 7. Obviously, other methods suffer from structural distortions or undesirable halo phenomena in the fusion results. In contrast, our method overcomes the limitation of undesirable artifacts caused by image pair misalignment to a certain extent. This is mainly due to the structural refinement and recurrent attention module during training.

Insert image description here
**Quantitative Comparisons: **As shown in Figure 8, we evaluated the CC indicators of these methods on 20 selected images on the TNO/RoadScene dataset, including four different transformations: random noise, elastic transformation , affine transformation and hybrid transformation. It is easy to notice that the scores of DenseFuse, PMGI, DIDFuse and U2Fusion drop significantly as the transformations are applied to the input image. Since MFEIF uses an attention mechanism, it shows certain resistance to random noise. Since FusionGAN is a method based on gradient transformation, elastic transformation has little impact on it. In contrast, our method has strong processing power for all four transformations.
Insert image description here

4.4 Computational Complexity Analysis

Insert image description here

4.5 Mean Opinion Score Analysis

Insert image description here
We selected 20 pairs of typical images from each dataset (i.e., aligned/unaligned TNO/RoadScene) for subjective experiments. Ten computer vision researchers rated the fused images for overall visual perception, object clarity, and detail richness. Figure 9 shows the average opinion scores of all methods after normalization. Notably, our method achieved the highest scores in both groups, indicating excellent visual perception.

We conduct additional subjective experiments on these eight IVIF methods on the aligned/unaligned TNO/Roadscene dataset, where we select 20 pairs of typical images from each dataset. The unaligned dataset was generated by transforming the infrared images using three transformation methods, namely affine, elastic and a combination of both. We asked ten computer vision researchers to rate the fused images in terms of overall visual perception, object clarity, and detail richness. Figure 9 shows the ranked mean opinion score (MOS) of all methods after normalization, where the shade of color indicates the level of the score (yellow: best, purple: worst). Notably, our method achieves the highest score among all test image pairs, which indicates that our method is more consistent with the perception of the human visual system.

4.6 Ablation Studies

**Discuss the iterations in the attention module:** Figure 10 shows the impact of the number of iterations on the fusion results in recurrent attention learning. According to the fusion results, we found that as the number of iterations in our attention module increases, the fusion results tend to achieve better visual effects. Texture details and objects become clearer. This is mainly due to the progressive recurrent attention module, so that each iteration has a positive impact on the fusion result.
Insert image description here
**Ablation experiment to remove the attention mechanism:** In order to verify the benefits of our attention module, we chose attention and the corresponding ablation experiment results, as shown in Figure 11. We can find that our attention module is able to perceive the most discriminative areas in the source image (i.e., targets in the infrared image and details in the visible light image), so the fusion result retains more meaningful information.

Insert image description here
**Ablation experiment with deformable alignment module removed:** In order to study the effect of deformable alignment module, we show the visual results with/without deformable alignment module in Figure 12. It can be clearly seen that without the attention module, the fusion results suffer from undesirable artifacts and structural distortions (such as road signs in the second row and flagpoles in the bottom row). In contrast, our method can overcome the problems of artifacts and structural distortion to a certain extent.
Insert image description here

5 Conclusion

This paper proposes an innovative network based on biphasic recurrent attention learning to achieve infrared and visible light image fusion robustly and efficiently in an end-to-end manner. We first design a micro-registration module to roughly estimate the distortion caused by image misalignment. Then, the source images are successfully fused and other residual artifacts or artifacts are eliminated through a biphasic recurrent learning network. In addition, we employ parallel dilated convolutions and shared computational graphs in recurrent networks to achieve high computational efficiency. Both subjective and objective experimental results show that our ReCoNet has significant advantages in efficiency and is able to handle misaligned image pairs to a certain extent. Compared with existing state-of-the-art methods, our ReCoNet has clear superiority.

Guess you like

Origin blog.csdn.net/m0_47005029/article/details/131513661