DVDNET A FAST NETWORK FOR DEEP VIDEO DENOISING

DVDNET: A FAST NETWORK FOR DEEP VIDEO DENOISING

https://ieeexplore.ieee.org/document/8803136

Summary

The existing state-of-the-art video denoising algorithms are patch-based methods, and previous NN-based algorithms cannot match their performance. However, this paper proposes that NN's video denoising algorithm has better performance:

  • Compared with the patch-based algorithm, the calculation time is significantly reduced
  • Compared to other neural network algorithms, it has a small memory footprint and can handle various noise levels with a single model

introduce

We introduce a network for deep video denoising: DVDnet. This algorithm has advantages over other state-of-the-art methods, and at the same time it has a fast running time. The output of our algorithm exhibits remarkable temporal coherence, very low flicker, strong noise reduction, and accurate detail preservation.

Image Denoising

Most of the recent image denoising algorithms are based on deep learning techniques, which have good performance, but these performances are restricted to specific forms of priorbased on specific forms of priors and require hand-tuned parametersmanual parameter tuning.

Most of the current algorithms face a disadvantage: a modela specific model must be trained for each noise level. must be trained for each noise level

video denoising

There are not many neural network-based algorithms, and their performance may not be as good as patch-based methods. However, through the development of VBM4D/VNLB, etc., the current VNLB can obtain relatively the best denoising effect, but the problem is: it takes a long time Run time - even processing a single frame takes several minutes. But the algorithm proposed in this paper is better than VNLB.

Method in this paper

发展现状:most previous approaches based on deep learning have failed to employ the temporal information existent in image sequences effectively.

Key to denoising: Temporal coherence and the lack of flickering(no flicker) vital aspects in the perceived quality of a video.

Methods related to mandatory output time redundancy:

  • the extension of search regions from spatial neighborhoods to volumetric neighborhoods
  • the use of motion estimation.

image-20220402103704627

First, denoising is divided into two stages:

  • individually denoised with a spatial denoiser. Spatial denoising is denoised separately
  • egistered with respect to the central fram

(1) Although as a sequence, there is obvious flickering phenomenon ; therefore, in the second stage, the adjacent frames are aligned to the central frame through optical flow deformation, and the After motion compensation.

(2) The 2T + 1 aligned frames are concatenated and fed into the temporal denoising block. Using temporal neighbors when denoising each frame helps reduce flickering because the residuals in each frame are correlated .

原文:Finally, the 2T + 1 aligned frames are concatenated and input into the temporal denoising block. Using temporal neighbors when denoising each frame helps to reduce flickering as the residual error in each frame will be correlated.

Additionally, a noise map is added as input to the spatial and temporal denoisers. Including noise maps as input allows to handle spatially varying noise [18]. In contrast to other denoising algorithms, our denoiser takes no other parameters as input other than the image sequence and an estimate of the input noise.

Temporal and Spatial Denoising Blocks

image-20220402104108113

The design features of space and time blocks provide a good compromise between performance and fast runtime a good compromise. Both modules are implemented as standard feed-forward networks,

The architecture of the spatial denoiser is inspired by the architecture in [8, 9], while the temporal denoiser also borrows some elements from [13].

The spatial and temporal denoising blocks are respectively composed of D [ spatial ] D_[spatial]D[spatial] = 12 和 D [ t e m p o r a l ] D_[temporal] D[t e m p or a l ] = 6 convolutional layers. The number of feature maps is set to W = 96. The output of the convolutional layer is followed by a pointwise ReLU [19] activation function ReLU( ) = max( , 0). During training, a batch normalization layer (BN [20]) is placed between the convolutional and ReLU layers.

At test time, the batch normalization layers is removed and an affine layer that applies the learned normalizationreplaced by an affine layer that applies learned normalization. The spatial size of the convolution kernel is 3 × 3, and the stride is set to 1.

In two blocks, the input is first downscaled to quarter resolution. The main advantage of performing denoising at lower resolutions is the greatly reduced runtime and memory requirements without sacrificing denoising performance [8, 18]. Zooming back to full resolution is performed using the technique described in [21]. Both blocks have residual connections [10], which are observed to simplify the training process [18]

training details

The spatial denoising block and the temporal denoising block are trained separately , and the spatial denoising is prioritized; the training uses image cropping crop or patchs. AWGNA patch to a given sequence by adding σ ∈ [0,55]

Spatial denoising block: WaterExploration DataBaseData set, a total of 10240000 randomly cropped patchs, noise patch= 50, I personally think that the loss function is used L2损失函数, and the motion estimation is used DeepFlowto compensate.

Temporal Denoising Block: DAVISDataset

Common structure: the ADAM algorithm[25] is applied to min-imize the loss function, with all its hyper-parameters set to their
default values. epoch=80, mini-batch=128, learning-ratethe first 50 rounds 1 e − 3 1e-31 e3 , 50-60 rounds1e − 4 1e-41 e4 , and finally1e − 6 1e-61 e6

data augmentBy introducing different scaling factors and random flipping, the data was multiplied fivefold.

result

Test set : DAVIS /Sets8 (the first paragraph describes the parameters of its data set)

Compare : VBM4D VNLB NeatVideo (commercial denoising software)

image-20220402110758173

In general, the sequences output by DVDnet have significant temporal coherence. Our method renders very little flicker, especially in flat regions, where patch-based algorithms often leave low-frequency residual noise. An example can be observed in Figure 1. 3 (best viewed in numeric format). Temporal decorrelated low-frequency noise in flat regions appears particularly annoying to the observer. More video examples can be found on the algorithm's website.

image-20220402110855670 image-20220402110905335 image-20220402110938126

in conclusion

com/imgs/20220402-1109-483.png" alt=“image-20220402110938126” style=“zoom:67%;” />

in conclusion

DVDnet's denoising results have remarkable temporal coherence, very low flicker and excellent detail preservation. The algorithm achieves runtimes that are at least an order of magnitude faster than other state-of-the-art competitors. Although the results presented in this paper apply to Gaussian noise, our method can be extended to denoise other types of noise

Guess you like

Origin blog.csdn.net/qq_38758371/article/details/131730259