Nerf paper reading notes Neuralangelo: High-Fidelity Neural Surface Reconstruction

Neuralangelo: high-fidelity neural surface reconstruction

Official account: AI Knowledge Story; Station B is tentative; Zhihu has the same name

Video introduction can refer to

Station B - CVPR 2023 latest work! Neuralangelo: High-Fidelity Nerf Surface Reconstruction

https://www.bilibili.com/video/BV1Ju411W7FL/spm_id_from=333.337.searchcard.all.click&vd_source=03387e75fde3d924cb207c0c18ffa567
insert image description here
as shown in Figure 1. This paper proposes Neuralangelo, a framework for high-fidelity 3D surface reconstruction from RGB images using neural volume rendering, even without auxiliary data such as segmentation or depth . Shown in the figure is an extracted 3D mesh of the courthouse.

Summary

Neural surface reconstruction has been shown to be effective and well suited for restoring dense 3D surfaces via image-based neural rendering . However, current methods struggle to recover the detailed structure of real-world scenes. To address this problem, we propose Neuralangelo, which combines the representation power of multi-resolution 3D hash grids with neural surface rendering . Our approach has two key elements:
(1) numerical gradients for computing higher-order derivatives as a smoothing operation;
(2) coarse-to-fine optimization of hash grids that control different levels of detail.
Even without auxiliary inputs such as depth, Neuralangelo can efficiently recover dense 3D surface structures from multi-view images with significantly higher fidelity than previous methods, enabling detailed large-scale scene reconstruction from RGB video captures.

1 Introduction

3D surface reconstruction aims to recover the dense geometric scene structure extracted from multiple images observed from different viewpoints [9]. The recovered surfaces provide structural information useful for many downstream applications, such as 3D asset generation for augmented/virtual/mixed reality or environment mapping for autonomous robot navigation. Photogrammetric surface reconstruction using monocular RGB cameras is of particular interest because it enables users to create digital twins of the real world at will using ubiquitous mobile devices.

The classic multi-view stereo algorithm [6, 16, 29, 34] is the method of choice for sparse 3D reconstruction. However, an inherent shortcoming of these algorithms is that they cannot handle ambiguous observations (annotation: mask, etc.), such as regions with large areas of uniform color, repeating texture patterns, or strong color changes . This will result in inaccurate reconstructions with noisy or missing surfaces.

Recently, neural surface reconstruction methods [36, 41, 42] have shown great potential in addressing these limitations. This new class of methods uses coordinate-based multilayer perceptrons (MLPs) to represent scenes as implicit functions, such as occupancy fields [25] or signed distance functions (SDFs) [36, 41, 42]. Leveraging the inherent continuity of MLP and neural volume rendering [22], these techniques allow optimized surfaces to be meaningfully interpolated between spatial positions, resulting in smooth and complete surface representations .

Despite the superiority of neural surface reconstruction compared to classical methods, the fidelity recovered by current methods does not scale well with the capacity of MLPs . More recently, Muller et al. [23] proposed a new scalable representation called Instant-NGP (Neural Graphics Primitives). Instant NGP introduces a hybrid 3D grid structure with multi-resolution hash coding and a lightweight MLP with a memory footprint that is log-linear in resolution for greater expressiveness . The proposed hybrid representation greatly improves the representation capability of neural fields and has achieved great success in representing very fine-grained details for various tasks, such as object shape representation and novel view synthesis problems. In this paper, we propose Neuralangelo surface reconstruction for high-fidelity (Fig. 1). Neuralangelo employs Instant NGP as a neural SDF representation of the underlying 3D scene, optimized for multi-view image observation via neural surface rendering [36]. We present two findings that are critical to fully unlocking the potential of multi-resolution hash coding. First, computing higher-order derivatives using numerical gradients, such as surface normals for eikonal regularization [8, 12, 20, 42], is crucial for stable optimization. Second, the progressive optimization scheme plays an important role in recovering structures at different levels of detail. We combine these two key elements, and through extensive experiments on standard benchmarks and real scenes, we demonstrate significant improvements in both reconstruction accuracy and view synthesis quality for image-based neural surface reconstruction methods. In summary, we make the following contributions:

• We propose the Neuralangelo framework to naturally incorporate the representational power of multi-resolution hash coding [23] into neural SDF representations .

• We propose two simple techniques to improve the quality of hash-coded surface reconstructions: higher-order derivatives with numerical gradients and coarse-to-fine optimization with progressive levels of detail .

• We empirically demonstrate the effectiveness of Neuralangelo on a variety of datasets, showing evidence of significant performance compared to previous methods.

2. Related work

Multi-view surface reconstruction.

Early image-based photogrammetry techniques used volumetric occupancy grids to represent scenes [4, 16, 17, 29, 32]. Each voxel is visited and marked as occupied if strict color constancy between corresponding projected image pixels is satisfied. Photometric consistency assumptions often fail due to autoexposure or non-Lambertian materials (which are prevalent in the real world). Relaxing this constraint of color constancy across views is important for realistic 3D reconstructions. Subsequent methods typically start from 3D point clouds from multi-view stereo techniques [6, 7, 28, 34] and then perform dense surface reconstruction [13, 14]. The reliance on the quality of the generated point cloud often results in missing or noisy surfaces. Recent learning-based methods enhance the point cloud generation process by learning image features and cost volume construction [2, 10, 40]. However, these methods are inherently limited by cost volume resolution and cannot recover geometric details.

Neural Radiation Field (NeRF).

NeRF [22] achieves remarkable photorealistic view synthesis with view-dependent effects. NeRF encodes 3D scenes using an MLP that maps 3D spatial positions to colors and volume densities. These predictions are composited into pixel colors using Neural Volume Rendering. However, one problem with NeRF and its variants [1,30,43,46] is how to define isosurfaces of the volume density to represent the underlying 3D geometry. Current practice often relies on heuristic thresholding of density values; however, due to insufficient constraints on the level set, such surfaces are often noisy and may fail to accurately model scene structure [36, 41]. Therefore, for photogrammetric surface reconstruction problems, more direct surface modeling is preferred.

Neural resurfacing.

For scene representations with better defined 3D surfaces, implicit functions such as occupancy grids [24, 25] or SDF [42] are favored over simple volume density fields. For integration with neural volume rendering [22], different techniques [36, 41] have been proposed to reparameterize the underlying representation back to volumetric density. The design of these neural implicit functions can enable more accurate surface prediction through view synthesis without sacrificing quality [42]. Subsequent work extended the above methods to real situations at the expense of surface fidelity [18, 37], while others [3, 5, 44] used auxiliary information to enhance reconstruction results. Notably, NeuralWarp [3] uses patch warping given synergistic visibility information from Structure from Motion (SfM) to guide surface optimization, but the patch plane assumption fails to capture highly varying surfaces [3]. Other methods [5, 45] utilize sparse point clouds from SfM to supervise SDF, but they perform worse than classical methods, with an upper bound determined by the quality of the point cloud [45]. Using monocular depth and segmentation as auxiliary data has also been explored with unconstrained image collections [31] or using scene representations with hash coding [44]. In contrast, our work Neuralangelo is based on hash coding [23] to recover surfaces, but does not require the auxiliary inputs used in previous work [3, 5, 31, 44, 45]. Parallel work [38] also proposes coarse-to-fine optimization to improve surface details, where the displacement network corrects the shape predicted by the coarse network. Instead, we use a hierarchical hash grid and control the level of detail for higher-order derivative analysis based on our data.

3. Method

Neuralangelo's approach to reconstructing the dense structure of scenes from multi-view images. Neuralangelo samples the 3D position along the camera view direction and encodes the position using a multi-resolution hash code . The encoded features are fed into SDF MLP and color MLP to synthesize the image using SDF based volume rendering.

3.1. Preliminary knowledge

Neural Volume Rendering

NeRF [22] represents a 3D scene as a volumetric density and color gamut . Given a posed camera and ray direction, a volume rendering scheme integrates the color intensities of points sampled along the ray. The 3D position xi of the ith sample is at a distance ti from the camera center. Use the coordinate MLP to predict the volume density σi and color ci for each sampling point. The rendered color of a given pixel is approximately a Riemann sum:
insert image description here

Among them, αi = 1 − exp(−σiδi) is the opacity of the i-th ray segment , δi = ti+1 − ti is the distance between adjacent samples , Ti = Π (i-1) (j=1 ) (1 − αj) is the cumulative transmittance, representing the fraction of light that reaches the camera . To supervise the network, a color loss is used between the input image c and the rendered image ^c

insert image description here

However, using this density formula does not clearly define the surface. Extracting surfaces from density-based representations often leads to noisy and impractical results [36, 41]. Volume rendering of SDF. One of the most common surface representations is the SDF. The surface S of the SDF can be represented implicitly by its zero level set, ie S = {x ∈ R3|f(x) = 0}, where f(x) is the SDF value. In the context of neural SDF, Wang et al. [36] propose to convert volume density predictions in NeRF to SDF representations, where logistic functions allow optimization by neural volume rendering. Given a 3D point xi and an SDF value f(xi), the corresponding opacity value αi used in the equation. 1 is calculated as
insert image description here

where Φs is a Sigmoid function. In this work, we use the same SDF-based volume rendering formulation [36].

Multi-resolution hash coding.

More recently, Müller et al. proposed multi-resolution hash coding. [23] showed tremendous scalability of neural scene representations, generating fine-grained details for tasks such as novel view synthesis. In Neuralangelo, we exploit the representational power of hash coding to recover high-fidelity surfaces. Hash encoding uses a multi-resolution grid, with each grid mapped to a grid cell corner of the hash entry. Each hash entry stores encoded features. Let {V1,...,VL} be a collection of different spatial grid resolutions. Given an input location xi, we map it to the corresponding location for each grid resolution Vl, ie xi,l = xi · Vl. The feature vector γl(xi,l) ∈ Rc for a given resolution Vl is obtained by trilinear interpolation of the hash entries at the grid cell corners. The encoded features at all spatial resolutions are concatenated to form a γ(xi) ∈ RcL feature vector:

insert image description here

The encoded features are then passed to a shallow MLP. An alternative to hash coding is sparse voxel structures [30, 33, 39, 43], where each grid corner is uniquely defined and collision-free. However, volumetric feature grids require hierarchical spatial decomposition (e.g., octrees) to make parameter counts tractable; otherwise, memory grows cubically with spatial resolution. Considering such a hierarchical structure, a finer voxel resolution by design cannot restore a surface distorted by a coarser resolution [33]. In contrast, hash coding assumes no spatial hierarchy and resolves automatic collisions based on gradient averaging [23].

3.2. Numerical gradient calculation

We show in this section that the position of the parsed gradient w.r.t. the hash code is affected by locality. Therefore, optimization updates are only propagated to the local hash grid, lacking non-local smoothness. We propose a simple method using numerical gradients to address such locality problems. An overview is shown in Figure 2.
A special property of SDF is its gradient with unit norm. The gradient of the SDF satisfies the equation ||∇f(x)||_2 = 1 (almost everywhere). To force the optimized neural representation to be a valid SDF, an eikonal loss [8] is usually applied to the SDF to predict:

insert image description hereinsert image description here
Figure 2. Distributing backpropagation updates outside the local hash grid
cells using numerical gradients of higher-order derivatives, thus becoming a smoothed version of the analytical gradient.

where N is the total number of sampling points. To achieve end-to-end optimization, a double backward operation is required on the SDF prediction f(x). The de facto way to compute surface normals SDFs ∇f(x) is to use analytical gradients [36, 41, 42]. Gradient Analysis of Hash Coding However, under trilinear interpolation, the positions are not spatially continuous. To find the sampling location in the voxel grid, each 3D point xi is first scaled by the grid resolution Vl, written as xi,l = xi · Vl.
Let (3) the coefficient of linear interpolation be β = xi,l − ⌊xi,l⌋. The resulting eigenvectors are
insert image description here

where the fillet positions ⌊xi,l⌋, ⌈xi,l⌉ correspond to local grid cell corners. We note that the rounding operations ⌊·⌋ and ⌈·⌉ are not differentiable. As a result, the derivative is hash encoded. The position can be obtained by
insert image description here

The derivative of the hash code is local, i.e. when xi crosses the grid cell boundary, the corresponding hash entries will be different. Hence, the eikonal loss defined in Eq. 5 Backpropagate only to locally sampled hash entries, namely γl(⌊xi,l⌋) and γl(⌈xi,l⌉). When a continuous surface (such as a flat wall) spans multiple mesh cells, those mesh cells should produce coherent surface normals without abrupt transitions. To ensure a consistent surface representation, joint optimization of these grid cells is required. However, analytical gradients are restricted to local grid cells unless all corresponding grid cells happen to be sampled and optimized simultaneously. This sampling is not always guaranteed.

To overcome the localized hashing of analytical gradients, we propose to use numerical gradients to compute surface normals. If the step size of the numerical gradient is smaller than the hash-encoded grid size, the numerical gradient will be equivalent to the analytical calibration gradient; otherwise, hash entries for multiple grid cells will participate in the surface normal calculation. Thus, backpropagating through surface normals allows hash entries for multiple meshes to receive optimization updates simultaneously. Intuitively, numerical gradients with carefully chosen step sizes can be interpreted as smoothing operations on analytical gradient expressions. An alternative to normal supervision is the teacher-student course [35, 47], where predicted noisy normals are driven to the MLP output to exploit the smoothness of the MLP. However, the analytical gradient of this teacher-student loss can still only be backpropagated to the local grid cell for hash encoding. In contrast, numerical gradients can address the locality problem without the need for additional networks.

Computing surface normals using numerical methods In addition to this, additional SDF samples are required. Given a sample point xi = (xi, yi, zi), we additionally sample two points along each axis of canonical coordinates around xi around a step size ϵ. For example, the x-component surface normal is calculated as
insert image description here

where ϵx = [ϵ, 0, 0]. A total of six additional SDF samples are required for numerical surface normal computation.

3.3. Progressive levels of detail

Coarse-to-fine optimization can better shape loss land - avoid getting stuck in false local minima. This strategy has found many applications in computer vision, such as image-based registration [19, 21, 26]. Neuralangelo also employs a coarse-to-fine optimization scheme to reconstruct surfaces with progressive levels of detail. Numerical gradients using higher-order derivatives naturally enable Neuralangelo to perform coarse-to-fine optimization from two perspectives.

Step size ϵ. As mentioned earlier, numerical gradients can be interpreted as smoothing operations, where the step size ϵ controls the resolution and the amount of recovered detail. Imposing Leik with a larger ϵ in the numerical surface normal calculation ensures that the surface normals are consistent over a larger area, resulting in a consistent and continuous surface. On the other hand, applying Leik with a smaller ϵ affects smaller areas and avoids smoothing details. In practice, we initialize the step size ϵ to the coarsest hash grid size and reduce it exponentially throughout the optimization by matching different hash grid sizes.

Hash grid resolution V. If all hashgrids are activated from the beginning of the optimization, in order to capture geometric details, the fine hashgrid must first be "unlearned" from the coarse optimization with a large step size ϵ, and "relearned" with a smaller ϵ. If such a procedure fails due to convergent optimization, geometric details will be lost. Therefore, we only enable an initial set of coarse hashed grids and gradually activate them. As ϵ is reduced to its spatial size, a finer hash grid occurs throughout the optimization. Thus the re-learning process can be avoided to better capture details. In practice, we also apply weight decay on all parameters to avoid single-resolution features dominating the final result.
insert image description here

Figure 3. Qualitative comparison of the DTU benchmark [11]. Neuralangelo produces more accurate, higher fidelity surfaces

3.4. Optimization further promotes the fluency of refactoring

surface, we impose a prior by regularizing the mean curvature of the SDF. The mean curvature is computed via a discrete Laplacian similar to the computation of surface normals, otherwise the second order analytic gradient of the hash code is zero everywhere when trilinear interpolation is used. The curvature loss Lcurv is defined as:

We note the samples used in the equation for surface normal computation. 8 is sufficient for curvature calculations.
The total loss is defined as the weighted sum of losses:
insert image description here

All network parameters, including MLP and hash coding, are jointly trained end-to-end.

4 experiments

Dataset
Following previous work, we conduct experiments on 15 object-centric scenes of the DTU dataset [11]. Each scene has 49 or 64 images captured by a monocular RGB camera held by the robot. The ground truth is obtained from structured light scanners. We further conduct experiments on 6 scenes of the Tanks and Temples dataset [15], including large indoor/outdoor scenes. Each scene contains between 263 and 1107 images captured with a handheld monocular RGB camera. The ground truth is obtained using a LiDAR sensor.

implementation details

Our hash encoding resolution covers 25 to 211, a total of 16 levels. Each hash entry has a channel size of 8. The maximum number of hash entries per resolution is 222. Due to the difference in scene scale, we activate 4 and 8 hash resolutions at the beginning of optimization for the DTU dataset and Tanks and Temples, respectively. We enable a new hash resolution every 5000 iterations when the step size ϵ is equal to its grid cell size. For all experiments, we do not use auxiliary data such as segmentation or depth during optimization.

evaluation standard

We report chamfer distances and F1 scores for surface evaluation [11, 15]. We report image synthesis quality using peak signal-to-noise ratio (PSNR).

4.1. DTU Benchmark

We show the qualitative results in Fig. 3 and the quantitative results in Fig. 3. The results are shown in Table 1. On average, Neuralangelo achieves the lowest bevel distance and highest PSNR even without using the auxiliary input. The results show that Neuralangelo is more generally applicable than previous work in restoring surfaces and compositing images, although it does not perform best in each individual scenario.

We further ablated Neuralangelo according to the following conditions:

1) AG: Analytical Gradient,

2) AG+P: Analysis gradient and progressive activation hash resolution,

3) NG: Numerical gradients with different ϵ. Figure 4 qualitatively shows the results. Even with hashing, AG produces noisy surface resolution step-by-step activations (AG+P). NG improves the smoothness of the surface at the expense of detail. Our setting (NG+P) produces smooth surfaces and fine detail.
insert image description here

Figure 4. Qualitative comparison of different coarse-to-fine optimization schemes. Rough surfaces often contain artifacts when using analytical gradients (AG and AG+P). Details are also smoothed, although rougher shapes are obtained using Numerical Gradients (NG). Our solution (NG+P) produces smooth surfaces and fine details.

insert image description here
Table 1. Quantitative results on the DTU dataset [11]. Neuralangelo achieves the best reconstruction accuracy and image synthesis quality. best result. Second best grade. † Requires 3D points for SfM. Best viewed in color.

4.2. Tanks and Temples

Since Tanks and Temples have no public results, we train NeuS [36] and NeuralWarp [3] following our setup. We also report classical multi-view stereo results using COLMAP [27]. Since COLMAP and NeuralWarp do not support view synthesis, we only report PSNR for NeuS. The results are summarized in Figure 5 and Table 2. Neuralangelo achieves the highest average PSNR and performs best in terms of F1 scores. Compared to NeuS [36], we can recover high-fidelity surfaces with complex details. We found that the dense surfaces generated by COLMAP are sensitive to outliers in sparse point clouds.

We also found that NeuralWarp often predicts the surface of the sky and background, possibly due to their color rendering scheme following VolSDF [41]. Extra surfaces predicted for the background are considered outliers and can significantly worsen the F1 score. Instead, we follow NeuS [36] and use an additional network [46] to model the background. Similar to the DTU results, analytical gradients were used to generate noisy surfaces, resulting in lower F1 scores. We further note the reconstruction of the courthouse as shown. 1 and 5 are different sides of the same building, demonstrating Neuralangelo's large-scale capabilities for particle reconstruction.

4.3. Level of detail

As Neuralangelo progressively optimizes the hash function, we examine progressive levels of detail similar to NGLOD [33] for increasing resolution. We show qualitative visualizations in Figure 6. Although some surfaces are completely missed by coarse layers, such as trees, tables, and bike racks, these structures are successfully recovered by finer resolutions. The ability to recover missing surfaces demonstrates the strength of our spatially hierarchical design.

Also, we notice that the planes are predicted at a sufficiently high resolution (in this case around level 8). Therefore, only relying on the continuity of coarse-resolution local cells is not enough to reconstruct large continuous surfaces. This result motivates the use of numerical gradients for higher-order derivatives, enabling backpropagation beyond local grid cells.
insert image description here

Figure 6. Results at different hash resolutions. While some structures, such as trees, tables, and bike racks, are missed at coarse resolution (level 4). Finer resolutions can gradually restore these missing surfaces. Flat continuous surfaces also require sufficiently fine prediction resolution (level 8). When numerical gradients with higher derivatives are used, the results motivate non-local updates.

insert image description here
Table 2. Quantitative results on the Tanks and Temples dataset [15]. Neuralangelo achieves the best surface reconstruction quality and performs best on average for image synthesis. best result. Second best grade. Best viewed in color.

insert image description here

Figure 7. Ablation results. (a) Curvature regularization Lcurv improves surface smoothness. © Concave is better
Formed by topological preheating.

4.4. Ablation curvature regularization

We remove the necessity of curvature regularization in Neuralangelo and compare the results in Fig. 7(a). Intuitively, Lcurv acts as a smoothing prior by minimizing surface curvature. Without Lcurv, we found that surfaces tended to have undesirably sharp transitions. Surface noise can be removed by using Lcurv.

Topology warm-up

We follow previous work and initialize the SDF to approximate a sphere [42]. For initially spherical shapes, using Lcurv also makes it difficult to form concave shapes, because Lcurv preserves topology by preventing curvature singularities. Therefore, instead of applying Lcurv from the very beginning of the optimization process, we use a short warm-up period to linearly increase the curvature loss strength. We find this strategy particularly helpful in concave regions, as shown in Figure 7(b).

5 Conclusion

We introduce Neuralangelo, a method for photogrammetric neural surface reconstruction. Neuralangelo's discovery is simple yet effective: numerical gradients using higher-order derivatives and a coarse-to-fine optimization strategy. Neuralangelo unlocks the representational power of multi-resolution hash coding for neural surface reconstruction modeled as SDF. We demonstrate that Neuralangelo can efficiently recover object-centric capture and the dense scene structure of large-scale indoor/outdoor scenes with extremely high fidelity, enabling the reconstruction of detailed large-scale scenes from RGB videos. Our method currently randomly samples pixels from an image without tracking their statistics and errors. Therefore, we use long training iterations to reduce randomness and ensure sufficient sampling of details. Exploring more efficient sampling strategies is our future work
to speed up the training process

Guess you like

Origin blog.csdn.net/qq_40514113/article/details/131666889