Deep learning paper sharing (5) DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion

foreword

Original paper: https://arxiv.org/abs/2303.06840

Title: DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion
Authors: Zixiang Zhao1;2 Haowen Bai1 Yuanzhi Zhu2 Jiangshe Zhang1 Shuang Xu3 Yulun Zhang2 Kai Zhang2 Deyu Meng1 Radu Timofte2;4 Luc Van Gool2
1Xi'an Jiaotong University 2Comp uter Vision Lab, ETH Zurich ¨ 3Northwestern Polytechnical University 4University of Wurzburg

only for translation

Abstract

Multimodal image fusion aims to combine different modalities together to produce fused images that preserve the complementary features of each modality, such as functional highlights and texture details. To exploit the strong generative priors and address the challenges of training instability and lack of interpretability of GAN-based generative methods, we propose a novel fusion algorithm based on the Denoising Diffusion Probability Model (DDPM). The fusion task is formulated as a conditional generation problem under the DDPM sampling framework, and is further divided into an unconditional generation subproblem and a maximum likelihood subproblem. The latter is modeled in a hierarchical Bayesian fashion with latent variables and inferred by an expectation-maximization algorithm. By integrating inference understanding into the diffusion sampling iteration, our method can generate high-quality fused images with natural image generation priors and cross-modal information of source images. Note that all we need is an unconditionally pretrained generative model, no fine-tuning required. A large number of experiments show that this method has achieved good fusion effect in infrared-visible light image fusion and medical image fusion. Code will be posted.

1. Introduction

Image fusion integrates the basic information of multi-source images to form high-quality fused images [29], including multiple sources such as digital [15,53], multimodal [45,57] and remote sensing [48,60] image type. This technique provides clearer object and scene representations and has various applications such as saliency detection [32], object detection [2] and semantic segmentation [21]. Among the different subcategories of image fusion, infrared-visible image fusion (IVF) and medical image fusion (MIF) are particularly challenging in multimodal image fusion (MMIF) because they focus on modeling cross-modal features and Retain critical information from all sensors and modalities. Specifically, in IVF, the purpose of fused images is to preserve the thermal radiation of infrared images and the detailed texture information of visible light images, thereby avoiding the sensitivity of visible light images to lighting conditions and the limitations of infrared image noise and low resolution. MIF, on the other hand, can accurately detect abnormal locations by fusing multiple medical imaging modalities, thereby assisting diagnosis and treatment [12].

Many methods have been devised recently to address the challenges posed by MMIF [20, 51], and generative models [7, 30] have been widely used to simulate the distribution of fused images and obtain satisfactory fusion results. Among them, models based on Generative Adversarial Networks (GANs) [26, 27, 25, 20] dominate. As shown in Figure 1a, the workflow of a GAN-based model involves a generator that creates an image that contains information about the source image, and a discriminator that determines whether the generated image is on a similar manifold to the source image middle. Although GAN-based methods can generate high-quality fused images, they have problems such as unstable training, lack of interpretability, and mode collapse, which seriously affect the quality of generated samples. Furthermore, as a black-box model, the internal mechanism and behavior of GANs are difficult to understand, which brings challenges to achieve controllable generation.
insert image description here
Figure 1: (a) Existing GAN-based method workflow. (b) Likelihood-corrected plot of the hierarchical Bayesian model. (c) The overall workflow of DDFM.

Recently, Denoising Diffusion Probabilistic Models (DDPM) [9] has attracted the attention of the machine learning community, which generates high-quality images by simulating the diffusion process that restores noise-corrupted images to clean ones. DDPM is based on the Langevin diffusion process through a series of backdiffusion steps to generate promising synthetic samples [35]. Compared with GAN, DDPM does not require a discriminator network, which alleviates the common problems of training instability and mode collapse in GAN. Furthermore, its generative process is interpretable since it generates images based on denoising diffusion, which allows for a better understanding of the image generative process [44].

Therefore, we propose a Denoising Diffuse Image Fusion Model (DDFM), as shown in Fig. 1c. We describe the conditional generation task as a DDPM-based posterior sampling model, which can be further decomposed into an unconditional generation diffusion problem and a maximum likelihood estimation problem. The former satisfies the natural image prior, and the latter constrains the similarity with the source image through likelihood correction. Compared with discriminative methods, using DDPM to model natural images a priori can better generate details that are difficult to control with artificially designed loss functions, resulting in visually perceptible images. As a generative method, DDFM achieves stable and controllable generation of discriminator-free fused images by applying likelihood correction to DDPM output.

Our contributions fall into three areas:
• We introduce a ddpm-based MMIF posterior sampling model that consists of an unconditional generation module and a conditional likelihood correction module. Sampling of fused images is achieved only by pre-trained DDPM without fine-tuning.
• In likelihood correction, since obtaining the likelihood explicitly is not achievable, we formulate the optimization loss as a probabilistic inference problem involving latent variables, which can be solved by the EM algorithm. The scheme is then integrated into the DDPM loop to complete conditional image generation.
• Extensive evaluations on IVF and MIF tasks show that DDFM consistently provides good fusion results, effectively preserving the structure and detail information of source images, while also satisfying visual fidelity requirements.

insert image description here
Figure 2: DDFM (marked in yellow) outperforms all other methods on MSRS [40] and RoadScene [46] on six metrics.

2. Background

2.1. Score-based diffusion models

Score SDE formulation : Diffusion models are designed to generate samples by inverting a pre-defined forward process that takes clean samples x 0 x_0 by gradually adding noisex0Convert to an almost Gaussian signal x T x_TxT. This forward process can be described by Ito Stochastic Differential Equation (SDE) [38]
insert image description here
where dw dwd w is a standard Wiener process,β ( t ) β(t)β ( t ) is a predefined noise table [38] that favors variance preserving SDE.

This forward process can be reversed in time, but still exists in the form of SDE [1]:

insert image description here
where d ω ~ d\widetilde{\omega}doh Corresponding to the standard Wiener (Wiener) process running in reverse, the only unknown part ▽ xtlogpt ( xt ) \triangledown_{x_t}log_{p_t}(x_t)xtlogpt(xt) can be modeled as the so-called fractional functions θ ( xt , t ) s_θ(x_t,t)si(xt,t ) using the denoising score matching method, the score function can be trained with the following objectives [11,37]:
insert image description here
where t is uniformly sampled in[ 0 ; T ][0;T][0;T ] and data pair( x 0 ; xt ) ∼ p 0 ( x ) p 0 t ( xt ∣ x 0 ) (x_0;x_t) \thicksim p_0(x)p_{0t}(xt|x0)(x0;xt)p0(x)p0 t(xtx0)

Sampling with diffusion models : Specifically, the unconditional diffusion generation process starts from a random noise vector x T ∼ N ( 0 ; I ) x_T \thicksim N(0;I)xTN(0;I ) , and update according to the discretization of Eq.(2). Alternatively, we can understand the sampling process in a DDIM way [35], where the score function can also be thought of as a denoiser, at iterationttFrom any statext x_t at time txtPredict denoised x ~ 0 ∣ t \widetilde{x}_{0|t}x 0∣t:
insert image description here
x ~ 0 ∣ t \widetilde{x}_{0|t} x 0∣tmeans given xt x_txtx 0 x_0x0estimate. Following Ho et al. [9], we use the same notation α t = 1 − β t α_t = 1−β_tat=1btα ‾ t = ∏ s = 1 t α s \overline{α}_t = \prod_{s=1}^tα_sat=s=1tas. With this predicted x ~ 0 ∣ t \widetilde{x}_{0|t}x 0∣tand the current state xt x_txt, x t − 1 x_{t−1} xt1Update from where
insert image description here
z ∼ N ( 0 , I ) z \thicksim N(0,I)zN(0,I )σ ~ t 2 \widetilde{σ}^2_tp t2is the variance, usually set to 0. Then the sampled xt − 1 x_{t−1}xt1input into the next sampling iteration until the final image x 0 x_0 is generatedx0. More details on this sampling process can be found in the supplementary material or in the original paper [35].

Diffusion models applications : Recently, diffusion models have been improved to generate images of better quality than previous generative models such as GANs [5, 31]. Furthermore, the diffusion model can be viewed as a powerful generative prior and applied to many conditional generative tasks. A representative work on diffusion models is stable diffusion, which can generate images given textual cues [33]. Diffusion models are also applied to many low-level vision tasks. For example, DDRM [14] performs diffuse sampling in the spectral space of the degenerate operator A to reconstruct the missing information in the observation y. DDNM [50] has a similar idea to DDRM, and accomplishes the image restoration task by iteratively refining the null space of operator A. DPS [3] uses Laplace approximation to calculate the log-likelihood gradient of posterior sampling, which can handle many noisy nonlinear inverse problems. In ΠGDM [36], the authors use few approximations to make the log-likelihood tractable, allowing it to solve inverse problems with even non-differentiable measures.

2.2. Multi-modal image fusion

The multimodal image fusion algorithm based on deep learning realizes effective feature extraction and information fusion through the powerful fitting ability of neural network. Fusion algorithms are mainly divided into two branches: generative methods and discriminative methods. For generative methods [26, 23, 27], especially the GAN family, adversarial training [7, 28, 30] is employed to generate fused images with the same distribution as the source images. For discriminative methods, autoencoder-based models [57, 18, 16, 21, 42, 17, 51] use encoders and decoders to extract features and fuse them on a high-dimensional manifold. Algorithmic unfolding models [4, 6, 58, 49, 59] combine traditional optimization methods and neural networks to balance efficiency and interpretability. Unified models [46, 52, 45, 54, 13] avoid the problem of lack of training data and task-specific ground truth. More recently, fusion methods have been combined with pattern recognition tasks such as semantic segmentation [39] and object detection [20] to explore interactions with downstream tasks. Self-supervised learning [19] is employed to train the fusion network without paired images. In addition, pre-processing registration modules [47, 10, 43] can enhance the robustness to unregistered input images.

2.3. Comparison with existing approaches

The methods most relevant to our model are optimization-based methods and GAN-based generative methods. Traditional optimization-based methods are often limited by hand-designed loss functions, which may not be flexible enough to capture all relevant aspects and are sensitive to changes in the data distribution. Whereas incorporating natural image priors can provide additional knowledge that cannot be modeled by the generative loss function alone. Then, compared to GAN-based generative methods that may suffer from unstable training and mode collapse, our DDFM achieves better Stable and controllable fusion.

3. Method

In this section, we first propose a new method to obtain fused images using DDPM posterior sampling. Then, starting from the established image fusion loss function, a likelihood correction method for unconditional DDPM sampling is derived. Finally, we propose the DDFM algorithm, which embeds the solution of hierarchical Bayesian inference into diffusion sampling. In addition, this paper will also demonstrate the rationality of the proposed algorithm. For brevity, we omit the derivation of some equations and refer the interested reader to the supplementary material. It is worth noting that we use IVF as an example to illustrate our DDFM, and MIF can be performed similarly to IVF.

3.1. Fusing images via diffusion posterior sampling

We first give the notation for the model formulation. Infrared, visible and fused images are denoted as i ∈ RHW i\in\mathbb{R}^{HW}iRHW v ∈ R 3 H W v\in\mathbb{R}^{3HW} vR3HW f ∈ R 3 H W f\in\mathbb{R}^{3HW} fR3 H W

We expect that given iii sumvvv 'sffThe distribution of f , namelyp ( f ∣ i , v ) p(f|i,v)p(fi,v ) can be modeled, sofff can be sampled from the posterior distribution. Inspired by Eq.(2), we can express the reverse SDE of the diffusion process as:
insert image description here
score function, namely▽ ftlogpt ( ft ∣ i , v ) \triangledown_{f_t}log_{p_t}(f_t|i,v)ftlogpt(fti,v ) is calculated as:
insert image description here
wheref ~ 0 ∣ t \widetilde{f}_{0|t}f 0∣tis f 0 f_0f0Given ft f_tftEstimates from unconditional DDPM. The equation is derived from Bayes' theorem, and the approximate equation is proved by [3].

In Eq. (7), the first term represents the score function of unconditional diffusion sampling, which can be easily derived by the pretrained DDPM. In the next section, we will illustrate obtaining ▽ ftlogpt ( i , v ∣ f ~ 0 ∣ t ) \triangledown_{f_t}log_{p_t}(i,v|\widetilde{f}_{0|t})ftlogpt(i,vf 0∣t) method.

3.2. Likelihood rectification for image fusion

Different from the traditional image degradation inverse problem y = A(x) + n, where xxx is the real image,yyy is the measured value,A ( ⋅ ) A(·)A() is known, we can explicitly obtain its posterior distribution. However, in image fusion,pt ( i , v ∣ f ~ t ) {p_t}(i,v|\widetilde{f}_{t})pt(i,vf t) p t ( i , v ∣ f ~ 0 ∣ t ) {p_t}(i,v|\widetilde{f}_{0|t}) pt(i,vf 0∣t) cannot be expressed explicitly. In order to solve this problem, we start from the loss function and establish an optimized loss functionl ( i , v , f ~ 0 ∣ t ) l(i,v,\widetilde{f}_{0|t})l(i,v,f 0∣t) and the likelihood of the probability modelpt ( i , v ∣ f ~ 0 ∣ t ) p_t(i,v|\widetilde{f} _{0|t})pt(i,vf 0∣t) . For brevity, in Sections 3.2.1 and 3.2.2,f ~ 0 ∣ t \widetilde{f} _{0|t}f 0∣tabbreviated as fff

3.2.1 Formulation of the likelihood model

We first give the commonly used loss functions for image fusion tasks [17, 51, 22, 55]:

insert image description here
Then implement simple variable substitution x = f − vx=f - vx=fvy = i − vy=i - vy=iv , get
insert image description here
due toyyy is known,xxx is unknown, so this 1-norm optimization equation corresponds to the regression model:y = kx + ϵ y = kx+\epsilony=kx+ϵ , k is fixed at 1. According to the relationship between the regularization term and the noise prior distribution,ϵ \epsilonϵ should be Laplacian noise,xxx is governed by a Laplace distribution. So, in a Bayesian way, we have:
insert image description here
whereLAP ( ⋅ ) LAP( )LAP() is a Laplace distribution. ρ ρrc cγ are respectivelyp ( x ) p(x)p ( x ) andp ( y ∣ x ) p(y|x)The scale parameter of p ( y x ) .

To prevent the "1-norm optimization" in Eq.(9), inspired by [22,56], we give Proposition 1:

Proposition 1 . For a random variable (RV) ξ ξ following the Laplace distributionξ , can be seen as the coupling of the RV of the normal distribution and the RV of the exponential distribution, and its formula is:p ( x ) p(x)
insert image description here
in Eq.(10)p ( x ) andp ( y ∣ x ) p(y|x)p ( y x ) can be rewritten as the following hierarchical Bayesian framework:
insert image description here
wherei = 1 , . . . , H i = 1,...,Hi=1,...,H andj = 1 , . . . , W j = 1,...,Wj=1,...,W. Through the above probability analysis, the optimization problem in Eq. (9) can be transformed into a maximum likelihood reasoning problem

In addition, according to [22,39], the total variation penalty term r ( x ) = ∣ ∣ ▽ x ∣ ∣ 2 2 r(x) = ||\triangledown x||^2_2 can also be addedr(x)=∣∣x22, so that the fused image f better preserves vvThe texture information of v , where▽ \triangledown is the gradient operator. Ultimately, the log-likelihood function for the probabilistic inference problem is:
insert image description here
The probability plot of this hierarchical Bayesian model is shown in Figure 1b. It is worth noting that in this way, we transform the optimization problem Eq. (8) into a maximum likelihood problem for the probabilistic model Eq. (13). Furthermore, unlike traditional optimization methods that require manually specifying the tuning coefficient φ in Eq. (8), φ in our model can be adaptively updated by inferring latent variables, enabling the model to better fit different data distributions. The effectiveness of this design is also verified in the ablation experiments in Section 4.3. We then explore how to infer it in the next section.

3.2.2 Inference the likelihood model via EM algorithm

To solve the maximum log-likelihood problem in Eq. (13), which can be viewed as an optimization problem with latent variables, we use the expectation-maximization (EM) algorithm to obtain the optimal xxx . inEEIn step E , calculate the log-likelihood function pairp ( a , b ∣ x ( t ) , y ) p(a,b|x^{(t)},y)p(a,bx(t),y ) , the so-calledQQQ function:
insert image description here
inMMIn M steps, the optimalxxx is obtained by: Next
insert image description here
, we show the implementation details in each step.

E-step . Proposition 2 gives the computation of the conditional expectation of the latent variable and derives the derivative of the q function.
Proposition 2 . Latent variable 1 / mij 1/m_{ij} in formula (13)1/mij1 / nij 1/n_{ij}1/nijThe conditional expectation for is:
insert image description here

prove:
insert image description here
insert image description here

After that, QQThe Q function Eq.(14) is deduced as:
insert image description here
wheremij m_{ij}mijnij n_{ij}nijRespectively express E mij in formula (16) ∣ xij ( t ) , yij [ 1 / mij ] E_{m_{ij}|x^{(t)}_{ij},y_{ij}}[1/m_ {ij}]Emijxij(t),yij[1/mij]E nij ∣ xij ( t ) , yij [ 1 / nij ] E_{n_{ij}|x^{(t)}_{ij},y_{ij}}[1/n_{ij}]Enijxij(t),yij[1/nij] ⊙ \odot is element-wise multiplication. mmm andnnn is a matrix, each element ismij \sqrt{m_{ij}}mij and nij \sqrt{n_{ij}}nij

M-step . Here, we need to find negative QQQ function relative toxxThe minimum value of x , we use the semi-quadratic segmentation algorithm to deal with this problem, namely:
insert image description here
can be further transformed into the following unconstrained optimization problem: unknown
insert image description here
variablesk , u , xk,u,xk,u,x can be solved iteratively using the coordinate descent method.

Update k : This is a deconvolution problem.
insert image description here
It can be solved efficiently by fast Fourier transform (fft) and inverse Fourier transform (ifft) operators, and the solution of k is

insert image description here
where ⋅ ~ \widetilde{\cdot} is the complex conjugate form.

Update u : This is a "double norm penalized regression problem",
insert image description here
the solution of u is
insert image description here
Update x : This is a least squares problem,
insert image description here
the solution of x is
insert image description here
where ⊙ \odot represents element-wise division, and the final estimate of f is
insert image description here
In addition, the hyperparameterγ in Eq.(10) γcr rρ can also be obtained fromxxx (Eq.(29)) is sampled by
insert image description here

3.3. DDFM

overview . In Section 3.2, we propose a method to obtain a hierarchical Bayesian model from existing loss functions and perform model inference through the EM algorithm. In this section, we present our DDFM, in which inference and diffusion sampling are integrated within the same iterative framework given i and v to generate f 0 f_0f0. The algorithm is illustrated in Algorithm 1 and Figure 3.

There are two modules in DDFM, Unconditional Diffusion Sampling (UDS) module and Likelihood Correction (EM) module. The UDS module provides natural image priors to improve the visual believability of fused images. On the other hand, the EM module is responsible for rectifying the output of the UDS module by likelihood to preserve more information in the source image.
insert image description here
insert image description here
Figure 3: Computational graph of our DDFM in one iteration. Unlike the traditional DDPM, the likelihood correction is done by the EM algorithm, that is, from f ~ 0 ∣ t ⇒ f ^ 0 ∣ t \widetilde{f}_{0|t} \Rightarrow \hat{f}_{0 |t}f 0∣tf^0∣trenew.

Unconditional diffusion sampling module .
In Section 2.1, we briefly introduce diffusion sampling. In Algorithm 1, the UDS (gray part) is divided into two parts, the first part uses ft f_tftestimate f ~ 0 ∣ t \widetilde{f}_{0|t}f 0∣t, the second part uses both ft f_tftand f ^ 0 ∣ t \hat{f}_{0|t}f^0∣tEstimate ft − 1 f_{t−1}ft1. From the score-based DDPM in Eq.(7), the pre-trained DDPM can directly output the current ▽ ftlogpt ( ft ) \triangledown_{f_t}log_{p_t}(f_t)ftlogpt(ft),而 ▽ f t l o g p t ( i , v ∣ f ~ 0 ∣ t ) \triangledown_{f_t}log_{p_t}(i,v|\widetilde{f}_{0|t}) ftlogpt(i,vf 0∣t) can be obtained through the EM module.

EM module. The function of the EM module is to update f ~ 0 ∣ t ⇒ f ^ 0 ∣ t \widetilde{f}_{0|t} \Rightarrow \hat{f}_{0|t}f 0∣tf^0∣t. In Algorithm 1 and Figure 3, the EM algorithm (blue and yellow) is plugged into the UDS (gray). Initial estimate f ~ 0 ∣ t \widetilde{f}_{0|t} generated using DDPM sampling (line 5)f 0∣tAs the initial input of the EM algorithm, get f ^ 0 ∣ t \hat{f}_{0|t}f^0∣t(Lines 6-13), which is the estimate of the likelihood-corrected fused image. In other words, the EM module will f ~ 0 ∣ t \widetilde{f}_{0|t}f 0∣trectified to f ^ 0 ∣ t \hat{f}_{0|t}f^0∣tto satisfy the likelihood.

3.4. Why does one-step EM work?

The main difference between our DDFM and the traditional EM algorithm is that the traditional method requires multiple iterations to obtain the optimal x, that is, the operation of lines 6-13 in Algorithm 1 requires multiple cycles. However, our DDFM requires only one iteration of the EM algorithm, which is embedded in the DDPM framework to accomplish the sampling. Below, we give Proposal 3 to justify it.

Proposition 3 . One-step unconditional diffusion sampling combined with one-step EM iteration is equivalent to one-step conditional diffusion sampling.

prove:
insert image description here

4. Infrared and visible image fusion

In this section, we detail extensive experiments on the IVF task to demonstrate the superiority of our method. More related experiments are placed in the supplementary material.

4.1. Setup

Datasets and pre-trained model. According to the scheme of [20,19], IVF experiments were carried out on four test datasets of TNO[41], RoadScene[46], MSRS[40] and M3FD[20]. Note that there is no training dataset, since we do not need to do any fine-tuning for a specific task, but directly use the pre-trained DDPM model. We choose the pretrained model proposed in [5], which is trained on ImageNet [34].

Metrics. We use entropy (EN), standard deviation (SD), mutual information (MI), visual information fidelity (VIF), QAB=F, and structural similarity index measure (SSIM) in quantitative experiments. Comprehensive evaluation of the effect. The details of the metrics can be found in [24]

Implement details. We use a machine with an NVIDIA GeForce RTX 3090 GPU for fused image generation. All input images are normalized to [−1;1]. ψ \psi in Eq.(23)ψ the \etaη was set to 0.5 and 0.1, respectively. Please refer to the Supplementary Material for selection by grid searchψ \psiψη \etathe .

insert image description here
insert image description here

4.2. Comparison with SOTA methods

In this section, we compare our DDFM with the state-of-the-art methods, including the group of GAN-based methods: fusongan [26], GANMcC [27], TarDAL [20] and UMFusion [43]; the group of discriminative methods: U2Fusion [45], RFNet[47], DeFusion[19].

Qualitative comparison . We show the comparison of fusion results in Figure 4 and Figure 5. Our method effectively combines thermal radiation information from infrared images and detailed texture information from visible light images. As a result, objects located in dimly lit environments are clearly highlighted, making it easy to distinguish foreground objects from the background. In addition, background features that were previously unclear due to low illumination now have sharp edges and rich contour information, enhancing our ability to understand the scene.

Quantitative comparison . Subsequently, the aforementioned six metrics are used to quantitatively compare the fusion results, as shown in Table 1. Our method shows superior performance on almost all metrics, confirming its applicability to different lighting and object classes. Notably, the outstanding values ​​of MI, VIF, and Qabf in all datasets indicate that it is able to generate images that match human visual perception while maintaining the integrity of the source image information.

4.3. Ablation studies

Extensive ablation experiments were performed to confirm the reliability of our various modules. Using the above six indicators to evaluate the fusion performance of the experimental group, the results on the Roadscene test set are shown in Table 2.
insert image description here
insert image description here
insert image description here
Unconditional Diffusion Sampling Module . We first verified the effectiveness of DDPM. In Exp. I, we eliminated the denoising diffusion generation framework, and only used the EM algorithm to solve and optimize Eq. (8) to obtain the fused image. To be fair, we keep the total number of iterations consistent with DDFM.

EM module . Next, we verify the components in the EM module. In Exp. II, we removed the total variation penalty term r(x) in Eq.(13). Then, we remove the Bayesian inference model. As mentioned earlier, φ in Eq. (8) can be inferred automatically in a hierarchical Bayesian model. Therefore, we manually set φ to 0.1 (Exp. III) and 1 (Exp. IV) and use the ADMM algorithm to infer the model.

In summary, the results in Table 2 show that none of the experimental groups can achieve fusion results comparable to our DDFM, further emphasizing the effectiveness and rationality of our method.

5. Medical image fusion

In this section, we conduct MIF experiments to verify the effectiveness of our method.

settings . We select 50 pairs of medical images from the Harvard medical image dataset [8] for MIF experiments, including image pairs of MRI-CT, MRI-PET and MRI-SPECT. The generation strategy and evaluation metrics for the MIF task are the same as those for IVF.

Comparison with SOTA methods . Qualitative and quantitative results are shown in Fig. 6 and Table 3. It is clear that DDFM preserves complex textures while emphasizing structural information, resulting in excellent performance on both visual and almost all numerical metrics.

6. Conclusion

A generative image fusion algorithm DDFM based on the Denoising Diffusion Probability Model (DDPM) is proposed. The generation problem is divided into an unconditional DDPM (exploiting image generation priors) and a maximum likelihood subproblem (preserving cross-modal information of source images). We model the latter using a hierarchical Bayesian approach and integrate its EM algorithm-based solution into unconditional DDPM for conditional image fusion. The fusion experiment of infrared-visible light image and medical image shows that this method has achieved good fusion effect.

References

[1] Brian DO Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326,1982. 2
[2] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan MarkLiao. Yolov4: Optimal speed and accuracy of object detection.CoRR, abs/2004.10934, 2020. 1
[3] Hyungjin Chung, Jeongsol Kim, Michael T. McCann,Marc Louis Klasky, and Jong Chul Ye. Diffusion posteriorsampling for general noisy inverse problems. In ICLR, 2023.3, 6
[4] Xin Deng and Pier Luigi Dragotti. Deep convolutional neuralnetwork for multi-modal image restoration and fusion. IEEETrans. Pattern Anal. Mach. Intell., 43(10):3333–3348, 2021.3
[5] Prafulla Dhariwal and Alexander Nichol. Diffusion modelsbeat gans on image synthesis. Advances in Neural InformationProcessing Systems, 34:8780–8794, 2021. 3, 7
[6] Fangyuan Gao, Xin Deng, Mai Xu, Jingyi Xu, and Pier LuigiDragotti. Multi-modal convolutional dictionary learning.IEEE Trans. Image Process., 31:1325–1339, 2022. 3
[7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,and Yoshua Bengio. Generative adversarial nets. In NIPS,pages 2672–2680, 2014. 1, 3
[8] Harvard Medical website. http://www.med.harvard.edu/AANLIB/home.html. 8
[9] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 2
[10] Zhanbo Huang, Jinyuan Liu, Xin Fan, Risheng Liu, WeiZhong, and Zhongxuan Luo. Reconet: Recurrent correctionnetwork for fast and efficient multi-modality image fusion. InEuropean Conference on Computer Vision (ECCV), 2022. 3
[11] Aapo Hyvarinen and Peter Dayan. Estimation of non- ¨normalized statistical models by score matching. Journalof Machine Learning Research, 6(4), 2005. 2
[12] Alex Pappachen James and Belur V. Dasarathy. Medicalimage fusion: A survey of the state of the art. Inf. Fusion,19:4–19, 2014. 1
[13] Hyungjoo Jung, Youngjung Kim, Hyunsung Jang, NamkooHa, and Kwanghoon Sohn. Unsupervised deep image fusionwith structure tensor representations. IEEE Trans. ImageProcess., 29:3845–3858, 2020. 3
[14] Bahjat Kawar, Michael Elad, Stefano Ermon, and JiamingSong. Denoising diffusion restoration models. arXiv preprintarXiv:2201.11793, 2022. 3
[15] Hui Li, Kede Ma, Hongwei Yong, and Lei Zhang. Fast multiscale structural patch decomposition for multi-exposure image fusion. IEEETrans. Image Process., 29:5805–5816, 2020.1
[16] Hui Li, Xiao-Jun Wu, and Tariq S. Durrani. Nestfuse: Aninfrared and visible image fusion architecture based on nestconnection and spatial/channel attention models. IEEE Trans.Instrum. Meas., 69(12):9645–9656, 2020. 3
[17] Hui Li, Xiao-Jun Wu, and Josef Kittler. Rfn-nest: An end-toend residual fusion network for infrared and visible images.Inf. Fusion, 73:72–86, 2021. 3, 4
[18] Hui Li and Xiao-Jun Wu. Densefuse: A fusion approach toinfrared and visible images. IEEE Transactions on ImageProcessing, 28(5):2614–2623, 2018. 3
[19] Pengwei Liang, Junjun Jiang, Xianming Liu, and Jiayi Ma.Fusion from decomposition: A self-supervised decomposition approach for image fusion. In European Conference onComputer Vision (ECCV), 2022. 3, 6, 7, 8
[20] Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, RishengLiu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InCVPR, pages 5792–5801. IEEE, 2022. 1, 2, 3, 6, 7, 8
[21] Risheng Liu, Zhu Liu, Jinyuan Liu, and Xin Fan. Searchinga hierarchically aggregated fusion architecture for fast multimodality image fusion. In ACM Multimedia, pages 1600–1608. ACM, 2021. 1, 3
[22] Jiayi Ma, Chen Chen, Chang Li, and Jun Huang. Infrared andvisible image fusion via gradient transfer and total variationminimization. Information Fusion, 31:100–109, 2016. 4
[23] Jiayi Ma, Pengwei Liang, Wei Yu, Chen Chen, Xiaojie Guo,Jia Wu, and Junjun Jiang. Infrared and visible image fusionvia detail preserving adversarial learning. Information Fusion,54:85–98, 2020. 3
[24] Jiayi Ma, Yong Ma, and Chang Li. Infrared and visible imagefusion methods and applications: A survey. InformationFusion, 45:153–178, 2019. 7
[25] Jiayi Ma, Han Xu, Junjun Jiang, Xiaoguang Mei, and XiaoPing (Steven) Zhang. Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans. Image Process., 29:4980–4995 ,2020. 2
[26] Jiayi Ma, Wei Yu, Pengwei Liang, Chang Li, and Junjun Jiang. Fusiongan: A generative adversarial network for infrared and visible image fusion. Information Fusion, 48:11–26, 2019. 2,3, 7, 8
[27] Jiayi Ma, Hao Zhang, Zhenfeng Shao, Pengwei Liang, and Han Xu. Ganmcc: A generative adversarial network with multiclassification constraints for infrared and visible imagefusion. IEEE Trans. Instrum. Meas., 70:1–14, 2021 . 2, 3, 7, 8
[28] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, ZhenWang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 2794–2802, 2017. 3
[29] Bikash Meher, Sanjay Agrawal, Rutuparna Panda, and AjithAbraham. A survey on region based image fusion methods.Information Fusion, 48:119–132, 2019. 1
[30] Mehdi Mirza and Simon Osindero. Conditional generativeadversarial nets. arXiv preprint arXiv:1411.1784, 2014. 1, 3
[31] Alexander Quinn Nichol and Prafulla Dhariwal. Improveddenoising diffusion probabilistic models. In ICML, pages8162–8171, 2021. 3
[32] Xuebin Qin, Zichen Vincent Zhang, Chenyang Huang, ChaoGao, Masood Dehghan, and Martin Jagersand. Basnet: ¨Boundary-aware salient object detection. In CVPR, pages7479–7489. Computer Vision Foundation / IEEE, 2019. 1
[33] Robin Rombach, Andreas Blattmann, Dominik Lorenz,Patrick Esser, and Bjorn Ommer. High-resolution image ¨synthesis with latent diffusion models. In Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 10684–10695, 2022. 3
[34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, andLi Fei-Fei. Imagenet large scale visual recognition challenge.Int. J. Comput. Vis., 115(3):211–252, 2015. 7
[35] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoisingdiffusion implicit models. In ICLR, 2021. 2, 3
[36] Jiaming Song, Arash Vahdat, Morteza Mardani, and JanKautz. Pseudoinverse-guided diffusion models for inverseproblems. In International Conference on Learning Representations, 2023. 3
[37] Yang Song and Stefano Ermon. Generative modeling byestimating gradients of the data distribution. Advances inNeural Information Processing Systems, 32, 2019. 2
[38] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-basedgenerative modeling through stochastic differential equations.In ICLR. OpenReview.net, 2021. 2
[39] Linfeng Tang, Jiteng Yuan, and Jiayi Ma. Image fusion inthe loop of high-level vision tasks: A semantic-aware realtime infrared and visible image fusion network. Inf. Fusion,82:28–42, 2022. 3, 4
[40] Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, andJiayi Ma. Piafusion: A progressive infrared and visible imagefusion network based on illumination aware. Inf. Fusion,83-84:79–92, 2022. 1, 6, 8
[41] Alexander Toet and Maarten A. Hogervorst. Progress in colornight vision. Optical Engineering, 51(1):1 – 20, 2012. 6, 8
[42] Vibashan VS, Jeya Maria Jose Valanarasu, Poojan Oza,and Vishal M. Patel. Image fusion transformer. CoRR,abs/2107.09011, 2021. 3
[43] Di Wang, Jinyuan Liu, Xin Fan, and Risheng Liu. Unsupervised misaligned infrared and visible image fusion viacross-modality image generation and registration. In IJCAI, pages 3508–3515. ijcai.org, 2022. 3, 7, 8
[44] zhisheng xiao, Karsten Kreis, and Arash Vahdat. Tacklingthe Generatively Trilemma with DifFFUSINGANS. In ICLR, 2022. 2 [45]
Han xu, JIAYI Ma, junjun jiang, xiaojie guo, and haibinling. U2Fusion: a unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell., 44(1):502–518,2022. 1, 3, 7, 8 [46] Han Xu, Jiayi Ma,
Zhuliang Le, Junjun Jiang, and XiaojieGuo . Fusiondn: A unified densely connected network for image fusion. In AAAI Conference on Artificial Intelligence, AAAI, pages 12484–12491, 2020. 1, 3, 6, 8
[47] Han Xu, Jiayi Ma, Jiteng Yuan, Zhuliang Le, and Wei Liu. Rfnet: Unsupervised network for mutually reinforcing multimodal image registration and fusion. In CVPR, pages 19647–19656. IEEE, 2022. 3, 7, 8
[ 48] Shuang Xu, Jiangshe Zhang, Zixiang Zhao, Kai Sun, JunminLiu, and Chunxia Zhang. Deep gradient projection networks for pan-sharpening. In CVPR, pages 1366–1375. ComputerVision Foundation / IEEE, 2021. 1 [49] Shuang Xu
, Zixiang Zhao, Yicheng Wang, Chunxia Zhang,Junmin Liu, and Jiangshe Zhang. Deep convolutional sparsecoding networks for image fusion. CoRR, abs/2005.08448,2020. 3 [50]
Wang Yinhuai, Yu Jiwen, and Zhang Jian. Zero shot im age restoration using denoising diffusion null-space model.arXiv:2212.00490,2022.3
[51] Hao Zhang and Jiayi Ma. Sdnet: A versatile squeeze-anddecomposition network for real-time image fusion. Int. J.Comput. Vis., 129(10):2761–2785, 2021. 1, 3, 4
[52] Hao Zhang, Han Xu, Yang Xiao, Xiaojie Guo, and Jiayi Ma.Rethinking the image fusion: A fast unified image fusionnetwork based on proportional maintenance of gradient andintensity. In AAAI, pages 12797–12804. AAAI Press, 2020. 3
[53] Xingchen Zhang. Deep learning-based multi-focus imagefusion: A survey and a comparative study. IEEE Transactionson Pattern Analysis and Machine Intelligence, 2021. 1
[54] Yu Zhang, Yu Liu, Peng Sun, Han Yan, Xiaolin Zhao, and LiZhang. IFCNN: A general image fusion framework based onconvolutional neural network. Inf. Fusion, 54:99–118, 2020.3
[55] Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. CoRR, abs/ 2211.14461, 2022. 4
[56] Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, and Jiangshe Zhang. Bayesian fusion for infrared and visible images. Signal Processing, 177, 2020. 4 [57] Zixiang Zhao, Shuang Xu, Chun xia zhang
, Junmin Liu, Jiangshe Zhang, and Pengfei Li. DIDFuse: Deep image decomposition for infrared and visible image fusion. In International Joint Conference on Artificial Intelligence, IJCAI, pages 970–976, 2020. 1, 3
[58] Zixiang Zhao, Shuang Xu, Jiangshe Zhang, Chengyang Liang, Chunxia Zhang, and Junmin Liu. Efficient and model-based infrared and visible image fusion via algorithm unrolling. IEEE Trans. Circuits Syst. Video Technol., 32(3): 1186–1196, 2022. 3
[59] Zixiang Zhao, Jiangshe Zhang, Shuang Xu, Zudi Lin, and Hanspeter Pfister. Discrete cosine transform network forbidden depth map super-resolution. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternReco gnition (CVPR ), Pages 5697–5707, june 2022. 3
[60] zixiang zhao, jiangshe zhang, shuang xu, kai sun, luhuang, junmin liu, and chunxia zhang. FGF-Gan: Alightweight t generalive adverseSarial Network for Pansharpeningvia Fast Guided Filter. In ICME, pages 1–6. IEEE, 2021. 1

Guess you like

Origin blog.csdn.net/qq_52358603/article/details/131922363