[Intensive reading of papers] DELS-MVS

Today I read an article published on WACV2023, the first author is from Graz University of Technology.
Article link: DELS-MVS: Deep Epipolar Line Search for Multi-View Stereo

Abstract

For each pixel in the reference image, our method utilizes the depth architecture to search for the corresponding point in the source image along the corresponding epipolar line. Previous learning-based MVS work selects a region of interest in depth space, discretizes it, and samples the epipolar line based on the resulting depth values: this results in non-uniform scanning of the epipolar line. Instead, our method acts directly on the epipolar lines: this guarantees a uniform scan of the image space, avoiding the choice of a depth-of-interest range, which is usually not a priori and can vary greatly from scene to scene, and requires Suitable discretization for depth space. In fact, the search used is iterative, which avoids building the cost volume. Finally, the method performs estimation of a robust geometry-aware fusion depth map, using the prediction confidence for each depth.

1. Introduction

Talking about the disadvantages of discretizing depth:

  1. Getting the depth range space in the natural environment needs to go through SfM, which is not necessarily accurate
  2. The discretization strategy will lead to imbalance. For objects close to the camera, it should be fine, but for objects far away from the camera, only a rough division is enough.

The advantages of our method are presented:
Compared with those methods that first discretize a given depth range, select it a priori, and then convert the resulting depth values ​​to points or line segments along the epipolar line when finding a match, our method has several advantages. advantage.
First, operating on epipolar lines allows our method to better exploit image information. In fact, due to the scene geometry and the relative pose between the reference camera and the source camera, a uniform discretization of the depth range may cause points to cluster in a small segment of the epipolar line, preventing correct matching.
Second, our strategy avoids the need to define a search depth range and the need for a depth discretization strategy tailored to the scene content, since the epipolar lines are dynamically explored. Our method is iterative, and the epipolar lines can be efficiently scanned using a coarse-to-fine approach. This avoids building a large volume of fine-grained depth costs.
Finally, our method estimates the depth of a reference image for each available source and fuses them in a geometry-aware manner using a confidence measure estimated along with the depth map itself. These confidence measures can also be exploited during point cloud construction to filter outliers, leading to more accurate reconstructions.

In summary, the core contributions are as follows:

  1. A depth, iterative, and coarse-to-fine depth estimation algorithm that operates directly on the epipolar line, thus avoiding the disadvantages of depth discretization, such as not needing to specify the depth range
  2. A confidence prediction module and a geometry-aware fusion strategy, coupled together, allow robust fusion of multiple reference image depth maps from different source images
  3. We validate our method by evaluating all the most popular MVS benchmarks, namely ETH3D, Tanks and Temples and DTU, and achieve competitive results

2. Related Work

Related MVS work is presented.

3. Algorithm

Overall Architecture Diagram

  1. Feature extraction, hand over the features to the next core algorithm to estimate the depth of the reference image.
  2. For each reference image pixel, the goal of the algorithm is to estimate the residual between the actual pixel projection to the source image and our initial guess along the epipolar line. This part is introduced in Section 3.1.
  3. To avoid scale dependencies, our algorithm estimates residuals through iterative classification steps that proceed in a coarse-to-fine fashion. We name our algorithm Deep Epipolar Search (DELS) because iterative classification is similar to search and utilizes a deep neural network called Epipolar Residual Network (ER-Net). We describe the DELS algorithm in Sections 3.2 and 3.3, which represents the core of our DELS-MVS and ER-Net.
  4. DELS-MVS also has a belief network (C-Net) that combines the confidence map with the estimated depth map D n D^{n}Dn is associated. This network is introduced in Section 3.4 and will introduce allD 0 ≤ n ≤ N − 1 D^{0≤n≤N−1}D0 n N 1 estimates the process used to fuse depth maps into a single depth map, which utilizes each source image.

3.1 Depth estimation via epipolar residual

insert image description here
The goal is to estimate the residuals such that:
insert image description here

3.2 Deep epipolar line search (DELS)

In MVS scenarios, the baselines between different source images and reference images can vary greatly, whether in the same scenario or not. Furthermore, depth maps can exhibit very different ranges depending on the specific scene: from very small ranges for reconstructing small objects to very large ranges for reconstructing outdoor scenes. In most 3D reconstruction scenarios, scene scale is not a priori. Overall, this makes network training, directly regressing epipolar errors, a very challenging task. To this end, we propose to reformulate the epipolar residual estimation problem into an iterative and coarse-to-fine classification scheme.
insert image description here
In order to estimate the epipolar residual for the new iteration i, we divide the limit into k segments, as shown in the figure, which is called LI L_{I}LI, the outer part is called LO L_{O}LO. This provides direction when new iterations are made.

The process is as follows:
insert image description here

3.3 Epipolar Residual Network

Classification is performed at each DELS iteration stage using ER-Net. The input of ER-Net is the feature map of src img and ref img, and the residual map generated by the previous stage. This allows sampling features around the latest estimate on the epipolar line for pixels above each ref img. To this end, we incorporate deformable convolutions into a U-Net-like architecture.

3.4 Confidence Network

Our method computes N depth maps on the ref img, each computed with a different src img. This begs the question of how to utilize all estimated depth maps to fuse them into a single depth map, since some ref img regions may be visible in one src img image but not in another. To this end, we introduce a belief network (C-Net) for each estimated depth map D n D^{n}Dn assigns a confidence mapC n C^{n}Cn : The confidence map is then used to guide the fusion of multiple available depth maps.

At each level j of our multiresolution scheme, we compute a map similar to the pixel-wise entropy of the partition probability, but taking into account its evolution over DELS iterations:
insert image description here

3.5 Geometry-aware multi-view fusion

A method for fusing multiple depth maps is introduced.

4 Experimental evaluation and Training

The method and detailed configuration of training and testing are introduced. The performance on the dataset is as follows:
ETH3D
T&T
DTU

5 Conclusion

We propose DELS-MVS, a novel MVS method that utilizes deep neural networks to perform matching search directly on the src img epipolar line. After estimating a dense depth map on the ref img for each available src, DELS-MVS employs a geometry-aware strategy to fuse them into a single depth map using the learned confidence, aiming to improve robustness to outliers. DELS-MVS is iterative, so there is no need to build large cost volumes. Also, no explicit discretization of depth space in min/max ranges is required, since DELS-MVS explores epipolar lines dynamically. We demonstrate the robustness of our method by evaluating on the ETH3D, DTU, and Tanks and Temples benchmarks, achieving competitive results.

Guess you like

Origin blog.csdn.net/YuhsiHu/article/details/131289860