Paper notes NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

NeRF uses neural networks to represent scenes. Given a scene, input pictures from some perspectives of the scene, and NeRF can synthesize pictures from new perspectives of the scene.
Insert image description here

neural radiation field

Neural radiance field (NeRF) uses 5D vector-valued functions to represent a scene.
The input is continuous 5D coordinates (including position x = ( x , y , z ) \mathbf x = (x,y,z)x=(x,y,z ) and viewing angle directiond = ( θ , ϕ ) \mathbf d = (\theta, \phi)d=( i ,ϕ ) ), the output is the luminous colorc = ( r , g , b ) \mathbf c = (r, g, b)c=(r,g,b ) and volume densityσ \sigmaσ .
Specifically, use a fully connected network to approximate this scenario, that is, learnF Θ : ( x , d ) → ( c , σ ) F_{\Theta}:(\mathbf x, \mathbf d) \rightarrow (\mathbf c , \sigma)FTh:(x,d)(c,σ ) .
The authors encourage making bulk density dependent only on location. So the network structure is to first input the positionx \mathbf xx , outputσ \sigmaσ and an eigenvector. Then the feature vector and the viewing angle direction are spliced, and finally mapped toc \mathbf cc color.
Note that one NeRF needs to be trained for each scenario.

position encoding

Using high-frequency functions to map the input into a higher-dimensional space before passing it to the network can provide a better fit to data containing high-frequency variation. Therefore, similar to Transformer, the author proposes to transform x, d \mathbf x, \mathbf dx,d is mapped into a high-dimensional space using trigonometric functions, and then input into the network:
Insert image description here
ppp is the value of each coordinate, such asx \mathbf xxneutralx , y , zx,y,zx,y,z

Stereoscopic rendering using radiant fields

In order to match the radiation field, the author uses volume rendering to render the image.
For stereoscopic rendering, please refer to https://zhuanlan.zhihu.com/p/595117334
Volume density σ ( x ) \sigma(\mathbf x)σ ( x ) can be interpreted as the light at positionx \mathbf xThe differential probability of ending at infinitesimal particle at x .
In stereoscopic rendering, camera rayr ( t ) = o + td \mathbf r(t) = \mathbf o + t\mathbf dr(t)=o+t d is in the range[tn, tf] [t_n, t_f][tn,tf] Desired colorC ( r ) C(\mathbf r)C(r)如下计算:
C ( r ) = ∫ t n t f T ( t ) σ ( r ( t ) ) c ( r ( t ) , d ) d t w h e r e    T ( t ) = exp ⁡ ( − ∫ t n t σ ( r ( s ) ) d s ) C(\mathbf r) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf r(t)) \mathbf c(\mathbf r(t), \mathbf d) dt \\ where~~ T(t) = \exp(-\int_{t_n}^t \sigma(\mathbf r(s))ds) C(r)=tntfT(t)σ(r(t))c(r(t),d)dtwhere  T(t)=exp(tntσ ( r ( s )) d s ) Rendering a view from a continuous neural radiation field requires tracing every pixel on the camera ray of the desired virtual camera to estimate the integralC ( r ) C(\mathbf r)C(r)

The integral of the above formula is actually calculated numerically, which requires sampling and summation. However, if sampling is fixed at certain points, the resolution of the representation will be limited. In order to solve this problem, the author proposes to use stratified sampling method. First, [tn, tf] [t_n, t_f][tn,tf] Divide it equally into N buckets of the same size, and then randomly sample a sample in each bucket:
ti ∼ U [ tn + i − 1 N ( tf − tn ) , tn + i N ( tf − tn ) ] t_i \sim \mathcal U [t_n + \frac{i-1}{N}(t_f - t_n), t_n + \frac{i}{N}(t_f - t_n)]tiU[tn+Ni1(tftn),tn+Ni(tftn)] Although the sampled samples are still discrete, the optimization process is cyclic and requires multiple samplings. Each sampling can sample to a different position, so it is equivalent to optimizing at continuous positions. Estimate C ( r ) C(\mathbf r)using the sampled samplesC(r)的方法如下:
C ^ ( r ) = ∑ i N T i ( 1 − exp ⁡ ( − σ i δ i ) ) c i w h e r e    T i = exp ⁡ ( − ∑ j = 1 i − 1 σ j δ j ) (1) \hat C(\mathbf r) = \sum_{i}^{N} T_i (1-\exp(-\sigma_i \delta_i)) \mathbf c_i \\ where~~ T_i = \exp(- \sum_{j=1}^{i-1} \sigma_j \delta_j) \tag{1} C^(r)=iNTi(1exp ( pidi))ciwhere  Ti=exp(j=1i1pjdj)(1)其中 δ i = t i + 1 − t i \delta_i = t_{i+1} - t_i di=ti+1ti. Additionally, estimate C ( r ) C(\mathbf r)The method of C ( r ) is differentiable, so the parameters can be easily optimized.

Hierarchical stereo sampling

If the NeRF values ​​are computed densely at N query points along each camera ray, such a rendering strategy is inefficient because free space and occlusion areas that do not contribute to the rendered image will be oversampled.
In order to solve this problem, the author proposed to train two networks, one is coarse-grained (coarse) and the other is fine-grained (fine). First, the coarse-grained network is hierarchically sampled N c N_cNcpoints, and then calculate C ^ c ( r ) \hat{C}_c(\mathbf r)C^c( r ) .
Insert image description hereStandardizationw ^ i = wi ∑ jwj \hat{w}_i=\frac{w_i}{\sum_j w_j}w^i=jwjwi, the probability density function of a distribution can be obtained. According to this distribution, use inverse transform sampling to obtain N f N_fNfpoint. This N f N_fNfPoints are obtained based on their importance to the rendering.
Then, use a fine-grained network to calculate this N c + N f N_c + N_fNc+Nfvalue of points, use formula (1) to calculate the final light color C ^ f ( r ) \hat{C}_f(\mathbf r)C^f( r ) .
This Hierarchical method can sample more points in the visible part.

train

Each scenario requires a separate NeRF training. In addition to a set of RGB images of the scene, the input also includes its corresponding camera position, camera internal parameters, and scene boundaries (which can be estimated using a structure-from-motion library, such as COLMAP).
The trained loss function is the squared error between the rendered pixels and the real pixels:
Insert image description here
where R \mathcal RR is a batch collection of rays. In each loop of training, a batch of camera rays is randomly sampled from pixels.
Although the final rendered image is generated by the fine-grained network, the coarse-grained network also needs to be trained because the coarse-grained network provides the location of the sampling points.

Guess you like

Origin blog.csdn.net/icylling/article/details/129064698