NeRF must-read: Mip-NeRF summary and formula derivation

foreword

It has only been three years since NeRF developed in 2020, and the work of Follow has shown a blowout trend. I believe that in the near future, NeRF will reshape the industry of 3D reconstruction in one fell swoop, and even rebuild our four-dimensional world (first blow at the beginning ). Although the development time of NeRF is short, there are a few works that have begun to show the trend of essential oils in the field of my research:
* PixelNeRF- --- generalization magic weapon
* MipNeRF----near and long-term reconstruction
* NeRF in the wild---- Background reconstruction under light transformation
*Neus----Reconstruct Surface with NeRF
* Instant-NGP ----Multi-scale Hash coding for efficient rendering

Abstract

Due to the different resolutions of the distant view and the near view, there are obvious flaws in the expression of the classic NeRF for multi-scale scenes: NeRF's reconstruction of the close view is blurred, while the reconstruction of the distant view appears jagged. The simple and crude strategy is supersampling, but it is time-consuming and labor-intensive. Compared with the position encoding (PE) method used by NeRF, Mip-NeRF proposes an integral position encoding method (IPE). This encoding method can describe the information distribution in space on multiple scales, making sense.

NeRF

Positional Encoding:

\gamma (x)=[\sin (x),\cos (x),...,\sin (2^{L-1}x),\cos (2^{L-1}x)]^T

MLP:

\forall t_k \in \mathbf{t}, [\tau_k,\mathbf{c}_k]=MLP(\gamma(\mathbf{r}(t_k);\Theta);

Final Predicted Color of the Pixel:

\mathbf{C}(\mathbf{r};\Theta,\mathbf{t})=\sum_{k}T_{k}(1-\exp(-\tau_k(t_{k+1}-t_k)))\mathbf{c}_k \newline with, T_k=exp(-\sum_{k'<k}\tau_{k'}(t_{k+1}-t_k))

 Loss Function:

    \min_{\Theta^c,\Theta^f} \sum_{r\in R}(\Vert C^*(r)-C(r;\Theta^c,\mathbf{t} ^c) \Vert ^2_2+ \newline \Green C^*(r)-C(r;\Theta^f,sort(\mathbf{t} ^c \cup \mathbf{t}^f )) \Green^2_2)

Among them , it is  solved by inverse transform sampling \mathbf{t}^faccording to the coarse modelinferene .\omega_k=T_k(1-\exp(-\tau_k(t_{k+1}-t_k)))

Mip-NeRF

NeRF uses point-sampling to encode the position of each pixel --> the corresponding ray, while ignoring the Shape, Volume and other information contained in each sampling point.

This leads to the inference of similar point-sampled features at the intersection of the yellow and blue points shown in the figure. Mip-NeRF attempts to encode Shape and Volume to solve the dilemma of NeRF.

MipNeRF defines the frustum area represented by a ray as follows:

The red frame part is shown as follows:

 OB is the solution result of the middle part of the red box, [t_0,t_1]between

\theta_0,\theta_1Respectively represent the blue frame right type and left type.

The expected positional encoding is:

But there is no closed-form solution to this formula, so the author turns to use multivariate Gaussian to find an approximation, which is the solution (\mathbf{\mu},\mathbf{\Sigma}). Because the frustum is symmetrical about ray, the expected value should be on ray. Regarding the expected position, we only need to solve the expected distance on ray. Let it be \mathbf{\mu}_t. And along the ray line \sigma^2_t, and corresponding to the circular surface perpendicular to the ray \sigma^2_r.

First give the conclusion, see the formula derivation chapter for formula derivation :

Convert the mean and variance from the frustum coordinate system to the world coordinate system:

For Positional Encoding, make:

 and

 The IPE closed-form solution can be solved according to the following formula:

 The resulting IPE code is:

Formula Derivation:

Let's talk about how ( \mathbf{\mu}_t, \sigma^2_t, \sigma^2_r) are solved respectively

The author gives the derivation process in the supplement, and some explanations are given below in conjunction with the derivation process

(x,y,z)=\varphi(r,t,\theta)=(rt\cos \theta,rt\sin \theta,t), \theta \in [0.2\pi), t\geq 0. \Vert r \Vert \leq \dot{r}

What should be noted in this formula is that tit is a dproportional coefficient about the distance from the origin to the pixel center of the imaging plane (this sentence is a bit confusing, see the figure below)

So, t=1,r=\dot{r}at that time , x and y were on the red circle depicting the pixel. Similarly, t=1,r\leq \dot{r}at that time , this area is the area within the red circle on the described pixel plane. In order to solve the three-dimensional integral of the frustum, we need to solve the differential describing the three-dimensional space, as follows:

 Therefore, the volume inside the frustum

 tThe expectation is:

 t^{2}The expectation is:

x^2The expectation is:

\mu_t=E(t):

 \sigma^2=E(t^2)-E^2(t)

 The article says that the variance of r can be replaced by x or y. This is why I need to study it again. Welcome to leave a message.

 For the stability of the calculation, let t_1=(t_0+t_1)/2, t_{\delta}=(t_1-t_0)/2, After transforming the above formula into an equation, we get:

references:

Barron, Jonathan T., et al. "Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Barron, Jonathan T., et al. "Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields Supplemental Material."

Guess you like

Origin blog.csdn.net/i_head_no_back/article/details/129419735