[Paper Reading] NEF: Neural Edge Fields for Reconstructing 3D Parametric Curves from Multiple Views

insert image description here

paper code

Summary

We study the problem of reconstructing the 3D characteristic curves of an object from a set of calibrated multi-view images. To this end, we learn a neural implicit field representing the density distribution of 3D edges, which we refer to as Neural Edge Field (NEF). Inspired by NeRF [22], NEF is optimized with a view-based rendering loss, where a 2D edge map is rendered on a given view and compared with the ground truth edge map extracted from the image for that view. NEF's rendering-based differentiable optimization leverages 2D edge detection without the need for supervised 3D edges, 3D geometric operators, or cross-view edge correspondence. Several techniques are designed to ensure the learning of scope-restricted and view-independent NEF for robust edge extraction. The final parametric 3D curves are extracted from the NEF using an iterative optimization method. On our benchmarks using synthetic data, we demonstrate that NEF outperforms existing state-of-the-art methods on all metrics.

insert image description here
Figure 1. We exploit 2D edge detection to directly obtain 3D edge points by learning neural latent fields and further reconstructing 3D parametric curves representing object geometry. We present the details of extracting 3D edges from the proposed neural edge field in 3.1, and the coarse-to-fine optimization strategy for reconstructing parametric curves in 3.2. The entire pipeline is self-supervised, with only 2D supervision.

I. Introduction

Eigencurves “define” 3D shapes to some extent, not only geometrically (surface reconstruction from curve networks [17, 18]), but also perceptually (eigencurve-based shape perception [4, 38]). Therefore, the extraction of characteristic curves has been a long-standing problem in the field of graphics and vision. Traditional 3D curve extraction methods usually work directly on 3D shapes, eg polygonal meshes or point clouds. This approach has a major difficulty: sharp edges may be partially corrupted or completely lost due to imperfect 3D acquisition and/or reconstruction. Therefore, geometry-based methods, even the state-of-the-art methods, are sensitive to parameter settings and will falsely clip near circular edges, noise, and sparse data. Recently, learning-based methods have been proposed to address these issues, but have limited generalizability [20, 21, 36, 42].

In many cases, edges are visually prominent and easily detectable in 2D images of 3D shapes. To address occlusions, one might think of reconstructing 3D curves from multi-view edges. However, this solution strongly relies on cross-view edge correspondences, which is a very difficult problem in itself [31]. This explains why there is little work on multi-view curve reconstruction even in the era of deep learning. We ask the question: can we learn 3D feature curve extraction directly from the input of multi-view images?

In this work, we attempt to answer this question by learning a neural implicit field representing the 3D edge density distribution from a set of calibrated multi-view images, inspired by the recent success of Neural Radiative Field (NeRF) [22]. We refer to this fringing density field as the neural fringing field (NEF). Similar to NeRF, NEF is optimized with a view-based rendering loss, where a 2D edge map is rendered on a given view and compared with the ground-truth edge map extracted from the image for that view. Volume rendering is based on edge density and color (grayscale) predicted by the MLP along the line of sight. However, unlike NeRF, our goal is only to optimize NEF for later extraction of parametric 3D curves, no new view synthesis is involved. The rendering-based differentiable optimization of 3D edge density takes full advantage of 2D edge detection without the need for 3D geometric operators or cross-view edge correspondence. The latter is implicitly learned via multi-view consistency.

Optimizing NEF directly to NeRF-like densities is problematic because the density range can be arbitrarily large and varies between scenes, and it is difficult to choose an appropriate threshold to extract useful geometry (e.g., 3D surfaces for NeRF and 3D for NEF edge). Also, due to noise, the NeRF density often does not approximate the underlying 3D shape well. Therefore, we seek to map edge densities to actual NEF densities by learning a mapping function with a learnable scaling factor that constrains the edge densities in the range [0, 1]. By doing this, we can easily choose a threshold to robustly extract edges from the optimized edge density.

Another problem with NEF optimization is the incompatibility between the edge density field and the visibility of edges detected in the image. While the former is essentially a wireframe representation of the underlying 3D shape and all edges are visible in any view (i.e., no self-occlusion), edges in 2D images can be occluded by the object itself. This leads to inconsistent supervision for different views with different visibilities, and can lead to missed detections: according to one view where the edge is visible, edges that should be present in the NEF may be suppressed by other invisible views. To solve this problem, we choose to 1) enforce consistency between density and color in NEF, and 2) give less penalty to non-edge pixels in rendering loss, to allow NEF to maintain all edge. This essentially makes NEF view independent, which makes sense.

After obtaining the edge density, we fit a parametric curve by treating the 3D density volume as a cloud of edge points. We refine the control points of the curve in a coarse-to-fine fashion. Since initialization is important for this non-convex optimization, we first apply a line fit in a greedy fashion to cover the majority of points. Based on the initialization, we then upgrade the lines to cubic Bezier curves by adding additional control points, and optimize all curves simultaneously by additional endpoint regularization.

We build a benchmark using a synthetic dataset consisting of 115 CAD models with complex shape structures from the ABC dataset [16] and leverage BlenderProc [7] to render pose images. Extensive experiments on the proposed dataset show that NEF trained on itself only with 2D supervision outperforms the existing state-of-the-art methods on all metrics. Our contributions include:

  1. Multi-view 2D edge self-supervised 3D edge detection based on neural implicit field optimization.
  2. Several techniques are designed to ensure the learning of scope-limited and view-independent NEF and an iterative optimization strategy to reconstruct the parametric curves.
  3. A benchmark for evaluating and comparing various edge/curve extraction methods.

2. Related work

Neural Radiation Field
NeRF [22] has demonstrated a remarkable ability for novel view synthesis. The basic idea of ​​NeRF is to represent the geometry and appearance of a scene as a radiation field, allowing to query the color and volume density of continuous spatial positions and viewing directions for rendering. Many extensions have been designed on the NeRF backbone, such as accelerated training [28, 33] and inference [3, 14, 41], editing [19, 34, 43], generative models [23, 30] and model reconstruction [24, 35 , 40] and more are discussed in [5, 10]. However, there are not many works using NeRFs to extract 3D skeletons/curves. We propose Neural Edge Field (NEF) to reconstruct 3D edges from 2D images. The closest NeRF-based work to ours is for model reconstruction [35, 40], where we both recover the exact geometry by defining the original density as a transformed new representation. The difference is that they represent surfaces with a zero-level set of signed distance functions (SDFs) and focus on surface reconstruction; whereas we introduce edge densities by learning NEF to represent edge probabilities at each spatial location.

3D Parametric Curve Reconstruction
The basis of 3D parametric curve reconstruction is the point cloud edge detection algorithm. Traditional (non-learning) methods focus on multi-view images [26], or local geometric properties of point clouds, such as normals [6, 37], curvature [39] and hierarchical clustering [9]. Recent data-driven approaches often adopt edge detection as binary classification of point clouds. For each possible edge point, its neighborhood attributes are used as learned features. With the advancement of network architectures, classifiers for edge detection range from Random Forest [11, 12], PointNet++ [27]-based Point Multilayer Perceptron (MLP) [36, 42], to Capsule Network [2] .

Representing point cloud edges as parametric curves is more challenging. PIE-NET [36] learns to detect edges and corners from point clouds, and uses the network to generate parametric curve candidates, and finally suppresses invalid candidates. PC2WF [20] consists of a sequence of feed-forward blocks for sampling point clouds into patches to classify whether a patch contains corners or not. They regress the positions of the corners and connect all corners as a parametric curve. DEF [21] computes an estimate of the truncated distance feature field for each input point cloud from an additional set of depth images in a block-based manner, as well as the fitted curve after corner detection and clustering. Unlike those works that require at least point clouds as input and training from labeled datasets, our method is self-supervised by 2D edges.

3. Method

To obtain 3D parametric curves from multi-view images, the method consists of two steps: constructing neural edge field (NEF) and reconstructing parametric curves. As shown in Fig. 1, 2D edge maps are predicted by PiDiNet [32], a state-of-the-art edge detection network, and NEF is constructed from these multi-view edge maps. As introduced in 3.1, directly adopting NeRF on edge maps is problematic Yes, there are many differences between the edge map and the original image. Due to occlusion, edges are sparse and inconsistent between views. To cope with them, we introduce several training losses specifically designed for NEF. To reconstruct parametric curves from 3D edge points, we introduce a two-stage coarse-to-fine optimization in 3.2. In the coarse stage, we simplify the curves to straight lines and fit a set of lines to the 3D edge points with a fit-and-drop strategy. In the refinement stage, we upgrade the straight lines to cubic Bezier curves by adding additional control points.

3.1 Reconstruction of 3D edge points

In this section, we learn a neural implicit field representing the spatial distribution of 3D edges, called Neural Edge Field (NEF). We first introduce the preliminary knowledge about NeRF in 3.1.1. The design of NEF is introduced in 3.1.2. 3.1.3 Introduction Training NEF requires a specific loss design.

3.1.1 Preliminary knowledge

NeRF [22] represents a continuous scene with an MLP network that maps 5D coordinates between camera rays (position (x, y, z) and viewing direction (θ, φ)) to colors (r, g, b) and bulk density σ. After training, novel views can be rendered from arbitrary camera poses after volume rendering. Given the camera origin o and ray direction d and the near and far boundaries t n and t f , the predicted pixel color^C for a camera ray r(t) = o + td is defined as follows:

insert image description here

where , insert image description heredensity σ and color c are the predictions of the MLP network. NeRF's loss function is the re-rendering loss, defined by the mean squared error (MSE) between the rendered color ^C and the ground truth pixel color C:
insert image description here
where R i is the set of rays in each training batch. In our scenario, we adopt the structure of NeRF as the backbone, but modify the color c = (r, g, b) to a one-dimensional gray value, such as c = (gray) to represent the edge intensity.

3.1.2 Neural fringe field

We introduce Neural Edge Fields (NEF), trained from 2D edge maps to represent edge probabilities at each spatial location. While NeRFs synthesize realistic novel view images based on differentiable volume rendering, their volume densities do not closely approximate actual 3D shapes. A similar situation exists for NEF. NEF density does not approximate actual 3D edges. The range and scale of NEF density also varies from scene to scene, making 3D edges difficult to extract from them. More recently, NeuS [35] and VolSDF [40] represent object surfaces via signed distance functions (SDFs), and map the SDFs to volume densities of NeRFs via distribution functions. Likewise, we introduce an intermediate density field called edge density before the NEF density, as shown in Figure 3. With proper supervision/constraints on the edge density during training, they are expected to approximate 3D edges well. Edge density describes the edge probability for each location. In the range of [0, 1], it is unified with the two-dimensional edge (1 means there is an edge, and 0 means no edge). After the mapping function, we can convert the edge density to NEF density for volume rendering. Given x ∈ R 3 represents the space occupied by the object in R 3 , E(x) represents the edge density value at position x, the NEF density σ is calculated as follows:

insert image description here
where α is a trainable parameter controlling the density scale, β is the mean controlling the position of the function, and g adjusts the distribution around [0, 1]. The edge density is expected to adaptively match the distribution of NEF density, and should also be easily binarized by a uniform threshold. Therefore, to ensure the correct mapping from edge density to NEF density, we set β = 0.8 and g = 10 in all experiments, and α is a trainable parameter. As shown in Figure 2, the value of α varies from scene to scene. The edge density is obtained by adding an extra sigmoid layer after the original NeRF MLP. We add another MLP of 4 hidden layers with size 256 to predict gray value c from edge density and view orientation. The network architecture is shown in Figure 3.

insert image description here

(a) Conversion curves of edge density to bulk density for different α. (b) Plot showing five randomly selected samples of α × 10−4 during training iterations.

Figure 2. Example of transformation and trend of α in Equation 3. As shown in (a), the edge density ranges from 0 to 1, and we can adaptively match the NEF density. Adaptation to different scenarios is controlled by the trainable parameter α in (b).

insert image description here
Figure 3. The 3D position (x, y, z) and viewing direction (θ, φ) are fed into the network after position encoding (PE). The NEF density σ is derived from an edge density map with a learnable scale.

3.1.3 Training NEFs

Training NEF as NeRF is problematic: first, 3D edges resemble the 3D skeleton of an object, and in volume rendering, samples are sparse along rays, making the network prone to trapping in local optima. Second, the 2D edge map does not match the actual 3D wireframe due to occlusion. Edges may not be visible in all views, causing inconsistencies between views. We introduce weighted mean squared error loss (W-MSE) and consistency loss to address these two issues. Furthermore, to encourage the sparsity of points in NEF, we also introduce a sparsity loss.

With three loss designs, we are able to stably train NEF supervised by 2D images. The final loss function is expressed as:
insert image description here
where the balance parameters λ1, λ2, λ3 are set to 1, 1 and 0.01 in all experiments, respectively. Once the NEF is trained, we set a fixed threshold of 0.7 to binarize the edge density to extract 3D edge points from the NEF.

W-MSE Loss
We obtain 2D edge maps from the lightweight edge detector PiDiNet [32]. On edge maps, edge pixels appear white and are usually sparsely distributed. Therefore, when training NEF, edge and non-edge pixels are highly unbalanced, resulting in very sparse samples along rays. In this case, the network can easily degenerate to a local optimum. The most common degradation case is to predict all densities and colors to be zero, and the rendered image is all black. Therefore, we modify the original color loss by adding adaptive weights W(r) in each batch. The weighted mean squared error loss (W-MSE) is defined as:
insert image description here
insert image description here
where C + and C- denote the number of edge and non-edge pixels in each batch determined by a threshold η. We set η to 0.3 throughout the paper. Adaptive weights are simple yet effective by forcing the network to pay more attention to edge pixels/rays, avoiding degradation.

Consistency Loss
The edge map for each view does not match the real 3D wireframe. On a 2D edge map, not all edges are visible due to occlusion. This means that the "ground truths" are not quite correct, and those invisible edges are missing in every view.

To successfully reconstruct 3D edge points from these 2D edge maps, we should recover the full edge map for each view, where occluded edges are visible, by integrating information from other views, as shown in Figure 4. For each view, the occluded edges are invisible in the image as well as in the edge map. Due to this inconsistency between views, NEF gets confused during training. In each view, there are many missed pixels, which are non-edge on the "ground truth". This inconsistency introduces noisy NEF around object surfaces at spatial locations. For these occluded edges, the edge density value is close to 1, but the color is close to 0 to fit those missed samples. Therefore, we enforce the values ​​of edge density and color intensity (both in the range [0, 1]) to be consistent for all samples along the ray to reduce missed pixels. The consistency loss is also computed from the mean squared error, defined as:

insert image description here
Since the W-MSE loss in Equation 5 encourages NEF to pay more attention to edge pixels and less penalize non-edge pixels. Therefore, combining W-MSE loss and consistency loss not only stabilizes the training process, but also encourages NEF to occlude edges by learning from other views. After training, the edge map re-rendered by NEF successfully recovers those invisible edges, as shown in Figure 4. Therefore, adopting different 2D edge detectors has limited impact on NEF reconstruction. NEF automatically corrects missing 2D edges due to consistency loss, regardless of whether they are occluded or missed by the detector (see Section B.2).
insert image description here
Figure 4. Green dots represent edges that can be seen and detected in a given view. Yellow dots represent edges that are visible in this view, but difficult to detect (i.e. poorly lit). Red indicates edges that are completely occluded but can be seen in other views. Our method integrates edges seen from multiple views and can re-render all edges from that view.

Sparse Loss
As mentioned earlier, edges are sparse in both 2D and 3D spaces. To encourage spatial sparsity as well as speed up convergence, we add an additional regularizer, the sparsity loss, to penalize unnecessary edge density along rays of non-edge pixels during training . We adopt Cauchy loss [1] as the sparse regularizer, which is robust to outliers. The sparse loss is defined as:
insert image description here

where i indexes the non-edge pixels of the input edge map, j indexes the samples along the corresponding ray, and s controls the scale of the regularizer. We fix s = 0.5.

3.2 Extract 3D parameter curve

From the 3D edge point cloud of NEF, we further extract parametric curves. We extract Bezier curves from 3D edges in a coarse-to-fine manner. The optimization objective is introduced in Section 3.2.1. The coarse-to-fine process is introduced in Section 3.2.2.

3.2.1 Bezier curve optimization

insert image description here
Figure 5. We iteratively refine lines to fit 3D edge points one by one, following a fit and drop strategy. This process continues until very few points remain. Fitted lines are displayed in different colors.

We adopt cubic Bezier curves to represent the geometry of 3D edges. For each curve, we optimized the positions of four control points. The first and last control points define the start and end positions, and the other two control points result in different curvatures. A straight line can be thought of as a linear Bezier curve with two control points. The goal is to optimize a set of parameters (positions of the four control points) {curve i } n i =1= { {p j i } j=1,4 } n i=1 for all curves to fit the 3D point cloud. The number n of curves varies from object to object. To optimize the curve fitting, we sample 100 points on each curve and dilate them to 500 by adding Gaussian noise around them. We apply the widely used Chamfer Distance (CD) to calculate the distance between curve points and 3D edge points. Pc and Pt represent the curve sampling point and the target 3D edge point respectively, and the Chamfer loss is defined as:

insert image description here
where γ is the parameter controlling the trend of each side (γ = 1 for the original Chamfer loss). For each point in Pc find the closest point in Pt (and vice versa), and compute the average pairwise point-level distance. A larger γ means that the optimization focuses more on minimizing the distance from Pc to Pt.

By minimizing CD, we fit a Bezier curve to the 3D edge points. However, the optimization of CD is not sensitive to endpoint details, and we find that many curves are not connected. To encourage curved connections, we add a regularizer to the objective function to encourage spatially close endpoints to meet. The two endpoints of each Bezier curve are the first and last control points, and the endpoints {curve i } n i =1 of all curves are recorded as P E = { {p j i } j=1,4 } n i= 1 . The endpoint regularizer is defined as:
insert image description here
M is a mask to ensure that the endpoint loss normalizes only those endpoints that are close enough to each other (within distance d). Finally, the objective function for all curves is optimized:
insert image description here
where λ is set to 0.01 in all experiments.

3.2.2 From Coarse to Fine Solution

The optimized objective function is highly non-convex, making it easy for the control points to converge to local minima. Therefore, the initialization of the Bezier curve has a significant impact on the final result of the optimization. It is also difficult to choose the right number of curves for all objects. Therefore, we design a thick-to-thin pipeline to extract curves. At a coarse level, we downgrade the cubic Bezier to straight lines and fit a set of straight lines to the 3D edge points. At a fine level, we upgrade straight lines to cubic Bezier curves and connect the endpoints of the curves.

Coarse-Level Optimization
Instead of optimizing all lines simultaneously, we iteratively optimize lines one by one using a fit and drop strategy. Specifically, in each iteration, we increase γ to 5 in the Chamfer loss (Eq. 9) to encourage a line (linear curve) to best fit part of the 3D edge points. After identifying a line, we delete the 3D edge points around the line and record its parameters. This process continues until there are not many remaining 3D edge points (ie <20). The fitted set of lines serves as a fine-level initialization. We demonstrate the rough optimization process in Fig. 5. Since lines are fitted one by one, we do not consider endpoint regularization at this level.

Fine-level optimization
Coarse optimization of the number of initialization lines, as well as the start and end positions. At a fine level, we upgrade all straight lines back to cubic Bezier curves by inserting two more control points between the end-point pairs, and solve the optimization in Equation 12. The resulting parametric curves are precisely matched to the 3D edge points, as shown in Figure 6.

insert image description here
Figure 6. Qualitative comparison with other methods. From left to right, we present the rendered image, the results of PIE-NET, PC2WF, DEF, our reconstructed curves, our point cloud edges obtained from edge densities, and the ground truth edges. While other methods are trained on point clouds of the ABC dataset, our method is self-supervised via 2D edge maps.

4. Experiment

We compare with state-of-the-art methods and perform ablation experiments on the contributed ABC-NEF dataset. More experiments, discussions, training details, and video demonstrations are in the supplement.

4.1. ABC-NEF dataset.

As in previous works [20, 21, 36], we conduct experiments on the ABC dataset [16], which consists of over one million CAD models with edge annotations. To evaluate our pipeline, we provide a dataset named ABC-NEF, which contains 115 different and challenging CAD models. They include various types of surfaces and curves, from the first chunk of the ABC dataset. We employ BlenderProc [7] to render object center-oriented pose images. We sample 50 800 × 800 image ray views per object. The 50 views are sampled by Fibonacci sampling [13] with the camera evenly placed on the sphere. Statistical analyzes of the ABC-NEF dataset and ablation with respect to the number of views are included in the supplementary material.

4.2. Comparison with state-of-the-art

Comparative Settings
We compare the proposed method with three state-of-the-art data-driven methods for parametric curve reconstruction, PIE-NET [36], PC2WF [20] and DEF [21]. All three methods require point clouds as input, while our method only requires 2D images. According to the settings in their paper, for PIE-NET, we use the method of farthest point sampling to uniformly sample 8096 clean point clouds representing the shape of the object as input, and the output of PIE-NET contains closed and open curves. For PC2WF, we subsample 200,000 points per object from a surface mesh as input, and it outputs pairs of endpoints representing straight lines. For DEF, the input contains 128 depth maps and point cloud views. In DEF, depth maps are collected to construct distance feature fields, which are used to detect corners on point clouds and extract splines.

We employ their pretrained models to reconstruct parametric curves for evaluation. Since PC2WF is designed to detect straight lines, we also compare against a subset of the proposed ABC-NEF, which contains 26 CAD models with only straight lines, named ABC-NEF-Line.

Evaluation Metrics
We sample points on the reconstructed parametric curve and evaluate the distance between the sampled points and ground truth edge points. To ensure an even distribution of points, we downsample the points on the voxel grid, so there is at most one point per voxel.

To measure the position of reconstructed 3D edges, we employ Intersection over Union (IoU), Precision, Recall and its F-score. However, small changes between two point clouds may lead to large changes in the above metrics. We also employ the chamfer distance (CD) between point clouds to measure the geometric accuracy of the reconstructed parametric curves. Small offsets between point clouds do not affect CD much. We normalize and align all ground truth edge and curve predictions to the range [0, 1] before comparison. After normalization, when evaluating IoU, precision, recall and F-score, points are considered to match if at least one ground truth point exists and the L2 distance is less than 0.02.

Comparison Results
As shown in Table 1, our self-supervised method with only 2D supervision significantly outperforms other state-of-the-art methods in all metrics and datasets. We observe that both PIE-NET and PC2WF achieve much higher precision than recall, which means they often miss curves, but the detected curves are well localized. Although PC2WF is designed to detect straight lines, we noticed that our method still achieves better performance on the ABC-NEF-Line dataset.

insert image description here
Table 1. Quantitative comparison with state-of-the-art methods. Note that our method is self-supervised by 2D edge maps, while other methods are trained on point clouds sampled from the ABC dataset. "AN" indicates the ABC-NEF dataset, and "ANL" indicates the ABCNEF-Line dataset.

We illustrate qualitative performance in Figure 6. Results show that PIE-NET and DEF can detect and localize most curves well, and PC2WF is proficient at reconstructing line structures. However, limited by design, PC2WF can only detect straight lines, and it is difficult to capture any other type of curves. Since PIE-NET is trained precisely on sharp features, it results in poor performance on elliptical edges and regions with relatively weak curvature. Meanwhile, DEF reconstructed curves are mainly based on continuous smooth distance feature fields, so it is difficult to distinguish close curves and it is easy to mistakenly connect adjacent curves. We also note that these methods rely heavily on corner detection to reconstruct curves and thus cannot cover all edges if some corners are missed. Essentially, these data-driven methods may suffer from reconstruction curves that are shaped outside the distribution. In contrast, our method benefits from a self-supervised pipeline that can be trained on natural images . More comparisons and discussions are in the supplementary material.

4.3. Ablation studies

We conduct ablation studies to validate each loss and design. W-MSE loss in Equation 5. Due to the imbalance of edge and non-edge pixels (rays), it is crucial for learning NEF. Without the W-MSE loss, training on NEF suffers from the degradation of predicting all-null fields. Therefore, we take NEF with W-MSE loss as the baseline version and evaluate sparsity loss and consistency loss.

For better visualization, we compare the quality of edge density by illustrating the rendered depth map. Since the depth map is essentially rendered by the NEF density accumulated along the rays, it accurately conveys the spatial distribution of edge density. As shown in Figure 7, the network may generate random noise densities in the scene without sparse regularization. Without consistency loss, the network is trained to adapt to an incorrect "ground truth", missing occluded edges, thus overfitting 2D edges in each view and failing to reconstruct consistent 3D edges.

insert image description here
Figure 7. From left to right, we present the 2D image and detected edge maps in a given view, followed by the rendered edge maps and depth maps of the three loss combinations. The rendered depth map shows the spatial distribution of the fringe density field. Sparse regularization removes most of the noise around objects, and consistency loss makes edge densities clearly aligned with 2D edges, which are easy to separate from the background.

After obtaining the 3D edge points, we reconstruct the parametric curve in a coarse-to-fine manner. We justify the necessity of our design by removing each part individually. We show optimization results with and without coarse-level initialization, line-to-curve strategy, and endpoint loss in Figure 8. The quantitative results of selected samples are shown in Table 2. There is no coarse-level initialization. At a coarse level, one cubic Bezier curve may fit multiple connected lines if we try to fit a cubic Bezier curve without initializing from a straight line. Therefore, without a line-to-curve strategy, the total number of curves may not be enough for global optimization and further affect the endpoint loss. Endpoint loss is used to refine all curves to be tightly connected. With all design strategies, our complete result is clean, compact and geometrically compliant.

insert image description here
Figure 8. Based on the 3D edge points, we show the reconstructed ablation parameter curves by excluding key designs from the full version.

We also perform ablation experiments on other edge detectors (i.e. Canny) to detect noisy 2D edge maps on blurred images (Gaussian blur with 9 × 9 kernel size), and randomly discard 30% and 50% images of all views (edge) pixels to test the robustness of our method. As shown in Figure 9, all alternatives performed reasonably. Even if the edge map is badly damaged, it can still restore the rough 3D shape.

insert image description here
Figure 9. Ablation experiments with the Canny detector and low-quality 2D edge maps. For each ablation experiment, from left to right, it shows detected 2D edges, rendered edges, and a depth map (revealing the distribution of edge densities).

4.4. Real scene

We also tested the performance of several toys with sharp geometric shapes collected by NEF in real-world scenarios. We took a video of walking around and observing the target toy and cut around 60 frames as input. We apply COLMAP [29], a well-known structure from motion (SFM) solver, to estimate the camera pose of an input image. We still apply the pretrained PiDiNet [32] to extract 2D edge maps, train NEF, and reconstruct curves from the extracted edge points. The process is shown in Figure 10. The reconstruction results show the potential of our method to extract 3D edge points and reconstruct parametric curves in real-world scenes, even if the camera pose is not exactly correct.

insert image description here
Figure 10. As input a set of multi-view images cropped from a video, we use COLMAP [29] to obtain camera poses, detect 2D edge maps via PiDiNet [32], and reconstruct 3D curves.

V. Conclusion

We present the first self-supervised pipeline for 3D parametric curve reconstruction by learning neural fringe fields. Self-supervised with only 2D supervision, our method achieves comparable or even better curve reconstruction than alternative methods that take clean and complete point clouds as input. Our method shows the potential of generalization ability and the advantage of exploiting multimodal information. This method has limitations in handling textured objects, edges inside objects, and the network architecture will be optimized to be simpler. More discussion is in the supplementary material.


  1. It is a novel idea. The three Loss ideas proposed in dealing with occluded edges are quite good and worth learning.
  2. From the experimental results, it seems that it is only suitable for regular geometry, and it does not perform well in real data.
  3. This job made me think of LIMAP . LIMAP uses the traditional method, does not perform curve parameterization, focuses on 3D line segment reconstruction, and has strong competitiveness in point-line visual positioning, and LIMAP performs well in real scenes. The two have different focuses. The goal of this article should be component-level reconstruction, while LIMAP aims at scene-level reconstruction (see the figure below).
    Please add a picture description

Guess you like

Origin blog.csdn.net/m0_50910915/article/details/130609839