RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs

article: Michael Niemeyer等, 《RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs》 (CVPR 2022), https://doi.org/10.48550/arXiv.2112.00724.
code:https://github.com/google-research/google-research/tree/master/regnerf
author units: Max Planck Institute for Intelligent Systems, Google research

Summary

nerf has become a powerful representative for new view synthesis tasks due to its simplicity and state-of-the-art performance.
When the input of NeRF is a picture of multiple views, it can generate a photo-realistic rendering picture of the new camera perspective, but when the input view becomes less, the performance will drop significantly. In real-world application scenarios, such as AR/VR, autonomous driving, robots, etc., the input that can be obtained in these scenarios is usually sparse, and each scene has only a few views (views about specific objects or partial areas), In these scenarios, the quality of the new view rendered by NeRF drops significantly, as shown in the figure below.
image.png
The author of RegNerf analyzed that most of the artifacts that appear in sparse input scenes are mainly caused by two reasons:

  • Scene geometry estimation error
  • Divergent behavior at the beginning of training

In response to the above problems, RegNerf proposes the following improvement measures:

  • Regularizing the geometry and appearance of patches rendered from unobserved viewpoints
  • annealing the ray sampling space during training
  • Use a normalizing flow model to normalize the colors of unobserved viewpoints

1 Introduction

Aiming at the problem that the performance of NeRF drops significantly in sparse scenes, some works [such as MVSNeRF, SRF, pixelNeRF] propose conditional models to overcome these limitations. These models need to be trained on large-scale datasets of many scenes with multi-view images and camera pose annotations, rather than being optimized for test time from scratch for a given test scene, such pre-training is also relatively expensive.
After the above-mentioned models are pre-trained on a large data set, there is no need to retrain when generalizing or testing new scenes. Instead, they can generate novel views from only a small number of input images through amortized inference, combined with each A short-term fine-tuning of a scene can produce sharper and more detailed results than those models retrained on this scene. Although these improved models have achieved certain results, it is expensive to pre-train on large datasets containing multiple scenes, and these models may not generalize well to new views, and due to sparse input Data is inherently ambiguous, and new views rendered from these models will also be ambiguous.

:::info
Multi-view image datasets containing different scenes are not always readily available and can be expensive to acquire; besides this, most methods require a period of Time fine-tuning, and when the test data domain changes (maybe the category is inconsistent with the training), the quality of the generated new view is prone to decline.
:::

This RegNerf paper proposes a new method for regularizing NeRF models for sparse input scenarios. The main contributions are as follows:

  • Use a patch-based regularizer to regularize the geometry and appearance of patches rendered from unobserved viewpoints, avoiding expensive pre-training

:::info
Previous work in the direction of regularization includes DS-NeRF and Diet-NeRF.
DS-NeRF improves reconstruction accuracy by adding additional deep supervision. In contrast, RegNeRF only uses RGB images and does not require depth input.
DietNeRF compares CLIP embeddings of unobserved viewpoints presented at low resolution. This semantic consistency loss can only provide high-level information, but cannot improve scene geometry for sparse inputs. In contrast, RegNeRF regularizes the scene geometry and appearance of rendered patches and applies a scene-space annealing strategy.
After experiments, the authors found that RegNerf can obtain more realistic scene geometry and more accurate new views.
:::

  • A normalizing flow model is used to normalize the predicted colors at unseen viewpoints by maximizing the log-likelihood of rendered patches, thereby avoiding color transfer between different views.
  • The annealing strategy of sampling points along the ray first samples the scene content in a small range before expanding to the entire scene boundary, which effectively prevents early divergence during training.

3 overall module design

image.png
The overall idea: based on the observed camera's viewpoint (blue camera), give the bounding box of all camera possible positions, and then define the unobserved viewpoint (red camera) (these views are not in the input image, but can be Sampled from the set of camera positions), cast rays from unobserved views, and sample (red rays) points on these rays, and feed the points sampled on the rays into the neural radiation field f θ f_{\theta }fi, Render to get the color and density, sample multiple points on a ray, and then do an alpha-composting (alpha superposition) on all these points to get the color and density corresponding to a pixel, because the unobserved view is sampled Is a patch, repeat the above operation for each pixel in the patch, get a predicted RGB color patch P ^ \hat{P}P^ and a depth patchd ^ θ \hat{d}_\thetad^i, and then regularize the appearance of the rgb patch, and regularize the scene geometry of the depth patch. The idea of ​​appearance regularization is to estimate the possibility of the color of the rendered patch (it was originally to maximize the log-likelihood, and after taking the negative here, it becomes the minimum negative log-likelihood), ϕ \ phiϕ is the RealNVP normalized flow model trained on the patches of the JFT-300M dataset, which is an unstructured 2D dataset, such thatϕ \phiϕ can be used in any form of scene. The method of scene geometry regularization is to forcibly add a depth smoothing prior to the rendered depth patches. The author said that this helps to reduce float artifacts (this may mean reducing artifacts?) and makes even sparse input views A more realistic scene geometry can be obtained.
RegNerf is an improvement based on mip-nerf. Let's briefly introduce the knowledge of nerf and mip-nerf.

3.1 nerf and mip-nerf

3.1.1 nerf

Generally speaking, nerf is to construct an implicit rendering process. Its input is the position o , direction d and corresponding coordinates (x, y, z) of the light emitted from a certain viewing angle , which are sent to the neural radiation field to obtain the volume density . and color , and finally get the final image through volume rendering .
image.png
After Neural radiance fields
image.png
Volume Rendering
has a 3D space model (that is, the neural radiation field f θ f_{\theta}fi), it is necessary to synthesize images with the neural radiation field as an intermediate carrier, and this process is rendering. Nerf uses the Volume Rendering method for rendering. The specific process is assuming that the position of the current camera optical center is o ∈ R 3 o\in R^3oR3 , Connect any pixel on the image with the optical center to get the view directiond ∈ R 3 d \in R^3dR3 , according to the optical center and the direction of viewing angle, a rayr ( t ) = o + tdr(t)=o+tdr(t)=o+t d , according to the volume rendering formula, according to the following process, the color observed on the pixel is obtained and the
image.png
above formula is discretely approximated to complete the volume rendering process. So far, use the observed images as supervision, select an MSE loss function, and start training.

3.1.2 Mip-nerf

:::info The background
proposed by Mip-NeRF : nerf uses single ray to sample the scene in each pixel in NeRF. This method performs better in the rendering process only when the camera position is fixed and the viewing direction is changed. If the image is zoomed in or out, blurred and aliased (when NeRF is rendering) will appear. This situation is usually caused by inconsistency in the resolution of multiple pictures corresponding to the same input scene. Supplement: The essence of sawtooth generation is that the sampling frequency is lower than the frequency of the real original signal, that is, the "aliasing" phenomenon in signal processing (refer to:). There are two ways to solve aliasing: one is to increase the sampling rate as much as possible, such as SSAA/MSAA used in anti-aliasing in graphics; the other is to remove high-frequency components as much as possible, such as using a low-pass filter to blur edges. The idea of ​​mip-nerf is: nerf only emits single ray for each pixel. If multiple rays are used for each pixel and the sampling rate is increased, the problem of aliasing can be solved to a certain extent. But it is not realistic for NeRF; because rendering along a ray requires querying an MLP hundreds of times, the amount of calculation is greatly increased and the efficiency is low. So mip-nerf proposed to replace the light with a cone. As shown below. :::


image.png
Compared with nerf, mip-nerf has two main improvements:

  • Replacing rays with cones
  • Replace the two MLPs of nerf, coarse and fine, with a multiscale MLP to improve training speed and reduce model size.

When represented by a cone, the sampling is no longer a discrete point set, but a continuous conical frustum, which can solve the problem that NeRF ignores the volume and size of the light observation range. The corresponding area is expressed as:
image.png
At this time, the positional encoding is also converted into an integral form accordingly:
image.png

3.2 patch-based Regularization

:::info
The author first reviews why the number of input views is sparse and causes the performance of NeRF to drop significantly?
:::
NeRF is only supervised by the reconstruction loss function in (3) above from the sparse input view, although it can learn to perfectly reconstruct the input view, it may degenerate for new views, because in this sparse input scene, the model is not biased towards learning a 3D consistent solution.

To address the above issues, RegNerf regularizes the unobserved camera viewpoints. Specifically, RegNerf defines a viewpoint space that is not observed but is related to existing viewpoints, and renders patches randomly sampled from these viewpoints.
The main idea of ​​the paper is that these patches can be regularized to produce smooth geometry and high-likelihood colors.
unobserved viewpoint selection
To apply regularization to unobserved viewpoints, you must first define a sample space for unobserved camera poses, and first assume a set of known target poses { P targeti } \{P^i_{target}\ }{ Ptargeti} , where:
image.png
These target poses can be thought of as the bounds of the set of poses from which you wish to render new views at test time. The authors define the space of possible camera positions as the bounding box of all given target camera positions:
image.png
wheretmin and tmax t_{min} and t_{max}tminand tmax分别是 { t t a r g e t i } i \{t^i_{target}\}_i { ttargeti}iThe minimum and maximum values ​​of the elements in .

What does [R|t] mean?
image.png
https://www.jianshu.com/p/2341da36aa8e
SE(3) Special Euclidean group
Lie group introduction reference: https://zhuanlan.zhihu.com/p/460985235
It is simply understood that R is a 3 3 orthogonal matrix and The determinant is 1, t is a 3 1 translation matrix

To obtain the sampling space for camera rotations, the authors assume that all cameras are roughly focused on the central scene point. Define a common "up" axis P ˉ u \bar{P}_u by computing the normalized mean on the up axis of all target posesPˉu. Then, the average focal point P ˉ f \bar{P}_f is calculated by solving the least squares problemPˉf, to determine the 3D point with the smallest squared distance to the optical axis of all target poses. To learn a more robust representation, random jitter is added to the focal point before computing the camera rotation matrix. The authors define the set of all possible camera rotations (given a sampling position t) as:
image.png
where R ( ⋅ , ⋅ , ⋅ ) R( , , )R) represents the resulting "look-at" camera rotation matrix (the camera focused on the central scene point),ϵ \epsilonϵ is the small jitter added to the focal point, which follows a normal distribution with mean 0 and variance 0.0125. Obtaining random camera pose by sampling position and rotation:
image.png
Geometry Regularization
It is a generally accepted fact that real-world geometries tend to be piecewise smooth, i.e. planar structures are more likely than high-frequency structures.

:::
What is the low-frequency and high-frequency information of the info image?
:::
Low frequency means that the color changes slowly, that is, the gray scale changes slowly , which means that it is an area of ​​continuous gradient, which is the low frequency. For an image, the low frequency is the part that excludes the high frequency, and That is, the content within the edge is low-frequency, and the content within the edge is most of the information of the image, that is, the general outline and outline of the image , which is the approximate information of the image.
High frequency means that the frequency changes quickly. When does the gray level change quickly in the image? It means that the gray level difference between adjacent areas is very large, which means that the change is fast. In the image, there is usually a significant difference between the edge of an image and the background The difference, that is to say, where the edge line changes, the gray level changes quickly, that is, the part with a high frequency of change. Therefore, the gray value of the edge of the image changes quickly, which corresponds to a high frequency, that is, the edge of the image is displayed at a high frequency. The details of the image also belong to the area where the gray value changes sharply. It is precisely because of the sharp change of the gray value that the details appear.
As shown in the figure below, the location of 1 is the low frequency area, and the location of 2 is the high frequency area

The viewpoint that the author has never observed encourages depth smoothness, and incorporates this prior knowledge into the model, similar to the rendering formula of pixel color in (2), the calculation formula of depth is (modeled after nerf): the depth smoothing loss formula is
image.png
:
image.png
Among them, R r R_rRrIndicates from the camera pose SP S_PSPA sampled set of rays, rij r_{ij}rijis from pixel ( i , j ) (i, j)( i , j ) throughrrThe light of the patch centered on r , S patch S_{patch}Spatchis the size of the patch being rendered.
The Color Regularization
author mentions that for sparse inputs, most artifacts are caused by incorrect scene geometry. However, even with correct geometry, optimizing a NeRF model can still lead to color shifts or other errors in scene appearance predictions due to the sparsity of the input. To avoid degenerate colors and ensure stable optimization, Regnerf also regularizes the color predictions. The key idea is to estimate the likelihood of rendered patches and maximize it during optimization. To this end, the paper leverages off-the-shelf unstructured 2D image datasets.

Structured data is also called quantitative data, which is information that can be represented by data or a unified structure, such as numbers and symbols. Typical structured data includes: credit card numbers, dates, financial amounts, phone numbers, addresses, product names, etc.
Unstructured data is the data whose data structure is irregular or incomplete, there is no predefined data model, and it is inconvenient to use the two-dimensional logic table of the database to represent the data. Including all formats of office documents, text, pictures, HTML, various reports, images and audio/video information, etc.

Although datasets of pose multi-view images are extremely expensive to collect, collections of unstructured natural images are plentiful, and the only criterion for a dataset in this paper is that it contains diverse natural images such that the reconstructed Any kind of real-world scenario reuses the same flow model.
The paper trains the RealNVP normalized flow model on the patches of the JFT-300M dataset. Using this trained flow model, the log-likelihood (LL) of rendered patches is estimated and maximized during optimization. image.pngIs the size will be S patch = 8 S_{patch}=8Spatch=The patch of 8 is mapped toR d R^dRd d = S p a t c h ⋅ S p a t c h ⋅ 3 d=S_{patch}\cdot S_{patch} \cdot 3 d=SpatchSpatch3 's learned bijection

bijection: bijective mapping: If each element of one set is paired with only one element of the second set, and each element of the second set is paired with only one element of the first set, then the function for two sets is bijective. This means that all elements are paired and paired once. Reference: https://brilliant.org/wiki/bijection-injection-and-surjection/

Define the color regularization loss as:
image.png
R r R_rRrIndicates from SP S_PSPA sampled set of rays, where, P ^ r \hat{P}_rP^ris the patch of the predicted RGB color, it starts with rrr as the center,− logp Z −log p_ZlogpZDenotes Gaussian p Z p_ZpZThe negative log-likelihood (“NLL”) of . The total loss function optimized in each iteration of Total Loss is:
where R i R_i

image.png
RiRepresents a set of rays from input poses, R r R_rRrIndicates from random pose SP S_PSPa set of rays

The total loss function is nerf's original MSE loss + geometry regularization loss + color regularization loss. The
weights of these three losses in the text are 1, 0.1, 10^-6 respectively. At the same time, in order to make the training more robust, the weight of the depth smoothing loss is in the front Annealed from 400 to 0.1 in 512 optimization steps

3.3 Sample Space Annealing

For very sparse scenes (e.g., 3 or 6 input views), the authors found another reason for poor nerf performance: divergent behavior at the beginning of training. This results in a higher density value at the ray's origin. While the input view is correctly reconstructed, the new view degenerates since no 3D consistent representation is recovered. Fast annealing of the sampled scene space through early iterations during optimization helps avoid this problem. By restricting the scene sampling space to a small region defined for all input images, the authors introduce an inductive bias to account for input images with geometric structures in the center of the scene.

Annealing algorithm: In order to solve the local optimal solution problem, in 1983, Kirkpatrick et al. proposed the simulated annealing algorithm (SA), which can effectively solve the local optimal solution problem. The simulated annealing algorithm consists of two parts, the Metropolis algorithm and the annealing process. The Metropolis algorithm is how to make it jump out of the local optimal solution, which is the basis of annealing. In 1953, Metropolis proposed the importance sampling method, that is, to accept the new state with probability instead of using a completely definite rule, called the Metropolis criterion, and the amount of calculation is low. Reference: https://blog.csdn.net/weixin_42398658/article/details/84031235

Review formula (2) (original nerf color rendering formula), tn , tf t_n, t_ftntfare the near and far planes of the camera, respectively, and let tm t_mtmis the defined center point (usually tn and tf t_n and t_ftnand tfmidpoint between). Definition:
image.png
where, iii represents the current training iteration,N t N_tNtIndicates the number of iterations until the full range is reached, ps p_spsIndicates the starting range (eg 0.5). This annealing is applied to renderings from input poses and sampled unobserved viewpoints. The authors found that this annealing strategy ensures stability during early training and avoids degenerate solutions.

The scene sample space is annealed in the first iteration of 3/6 input views. Specifically, the sampling space is linearly annealed for the first Nt = 256 iterations, starting from a midpoint tm between tn and tf, with an initial range of ps = 0.5. Whereas for the LLFF dataset parameterized with NDC rays, 512 steps are used, starting from the far plane tm = tf with an initial extent ps = 0.0001.
Question: If you set ps=0.5, according to the above calculation formula, the section from 0.5 to 1 seems to have no effect

3.4 Training Details

Regnerf is the code to build regnerf on the basis of mip-nerf, using the JAX framework

What is JAX?
Simply put, it is GPU-accelerated, numpy that supports automatic differentiation (autodiff).
The main starting point of JAX is to combine the above advantages of numpy with hardware acceleration. JAX ( https://github.com/google/jax ), which is now open source, uses GPU (CUDA) to achieve hardware acceleration.

Optimizer: Adam, using 2 ⋅ 1 0 − 3 to 2 ⋅ 1 0 − 5 2 \cdot 10^{-3} to 2\cdot 10^{-5}2103t o2_10 Exponential learning rate decay of 5

Optimization and inference based on deep learning are computationally intensive processes that result in high energy consumption. To curb unnecessary energy usage, we use the aforementioned larger learning rate to reduce optimization time. Finding the optimal weights in a few iterations would be an explorable direction to further solve this problem.

The neural radiation field is parameterized as a fully connected ReLU network with 8 layers and a hidden layer dimension of 256.
For each ray of 3/6/9 input views, 128 points are sampled along the ray

Clip the gradient by a value of 0.1, and then clip the gradient by a norm of 0.1. (We clip gradients by value at 0.1 and then by norm at 0.1.)
The paper trains 500-pixel epoches, batch-size is 4096, and performs 44K, 88K, and 132K times on DTU for 3/6/9 input views respectively. Iterations, but less than mip-nerf's default 250K iterations.

4 experiments

  1. data set

Two real-world multi-view datasets:
DTU: contains images of objects placed on a table;
LLFF: contains complex forward-looking scenes.

  1. Evaluation index

PSNR (Peak Signal-to-Noise Ratio) peak signal-to-noise ratio

image.png

Structural Similarity Index (SSIM)

image.png

(Learned Perceptual Image Patch Similarity, LPIPS) is also known as "perceptual loss" (perceptual loss)

Used to measure the difference between two images. From a paper "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric" in CVPR2018, the metric learns to generate a reverse mapping from an image to Ground Truth, forcing the generator to learn to reconstruct the reverse mapping of the real image from the fake image, and prioritize the perceived similarity between them. LPIPS is more in line with human perception than traditional methods (such as L2/PSNR, SSIM, FSIM). The lower the value of LPIPS, the more similar the two images are, and vice versa, the greater the difference.
Given a Ground Truth image reference block x and a noisy image distortion block x0, the perceptual similarity measure formula is as follows:

where d is the distance between x0 and x. Feature stacks are extracted from L layers and unit-normalized in the channel dimension. Using vector wl ∈ RC l w_l \in R^{C_l}wlRClTo scale the number of activated channels, and finally calculate the L2 distance. Finally average over space and sum over channels.

For comparison, MSE = 1 0 − PSNR / 10 MSE=10^{-PSNR/10} is also usedMSE=10PSNR/10, 1 − S S I M \sqrt{1-SSIM} 1SS I M , and the average of LPIPS

  1. Baselines

Comparison with the state-of-the-art PixelNeRF, SRF, MVSNeRF

image.png

4.3 Ablation experiment

image.png
image.png
In Table 3 above, the author conducted ablation experiments on the various methods proposed in the paper. For sparse input views, the scene space annealing strategy and geometric regularization have played an important role. In contrast, the effect of appearance regularization has almost no effect. I feel that this is a method that the author came up with for innovation.
Table 4 shows the performance of other geometric regularization techniques studied by the author, and found that the method proposed by himself is the best.

Guess you like

Origin blog.csdn.net/KeepLearning1/article/details/130115466