[Paper Intensive Reading 2] Detailed Explanation of MVSNet Series Papers-RMVSNet

R-MVSNet, the full name of the paper: "Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference", CVPR 2019 (CCF A)
made some improvements on the basis of MVSNet, the main problem to be solved is the cost body regularization (Cost In the process of volume regulation), three changes were made to the problem of excessive memory:
(1) In the cost body regularization step, serialized GRU was used instead of 3D CNN
(2) soft argmin was replaced by Softmax, and the original The regression problem is changed to a multi-classification problem to calculate the cross-entropy loss
(3) In order to generate a depth estimate with sub-pixel accuracy, the depth map obtained by the initial network is refined by variational depth map (Variational Depth Map Refinement)


MVSNet is the basis of this series of papers. It is recommended to understand these optimized models first. For details, see [Paper Intensive Reading 1] Detailed Explanation of MVSNet Series Papers-MVSNet .

1. Problem introduction

One of the main limitations of cost volume-based MVS reconstruction methods is scalability , i.e., the memory consumption cost of cost volume regularization makes learning MVS difficult to apply to high-resolution scenes.

MVS (Multi-View Stereo) refers to the reconstruction of three-dimensional objects or scenes in the case of multiple images with overlapping features and known internal and external parameters.
In this type of problem, a cost volume is usually constructed from matching features of multiple images and regularized into a probability volume to infer a depth map. Regardless of the traditional method or the neural network-based learning method, if the entire cost body is used as input during regularization, the memory consumption will increase with the increase of the scale and the cubic level will increase. For this, the traditional method and based on Some methods of learning have been tried:

Traditional methods usually adjust the cost body implicitly, such as

  • Local depth propagation iteratively refines the depth map/point cloud
  • Regularize the cost volume using a simple planar scan order
  • 2D Spatial Cost Aggregation with Deep Winner-Take-All

The learning-based method has made two attempts:

  • For example, OctNet and O-CNN use the sparsity of 3D data to introduce the octree structure into 3D CNN, but they are still limited to the reconstruction of resolution <512^3 voxels.
  • Surface Net and Deep MVS, etc. apply the engineered divide-and-conquer strategy to MVS reconstruction, but face the problem of global context information loss and slowdown.

2. Model structure

The core idea of ​​this article is to use GRU (RNN neural network variant) to transform the process of ordinary regularization on multiple depths at one time into depth-by-depth and use the output of the previous depth (that is, the depth scale is regarded as a cycle The time scale of the neural network) , thus reducing the memory T required for the original D depth samples to T/D (this value is only for easy understanding).

R-MVSNet overall model

1. Feature extraction

Consistent with MVSNet

2. Feature volume regularization

2.1 Feature Map -> Feature Body

That is the process of the circle M in the figure. It transforms the N feature maps obtained by N source images and reference images through the feature extraction network through the homography matrix H corresponding to the depth D0 to obtain N feature bodies, and each page of these N books (feature bodies) Calculate the variance value for each feature point on (each feature channel), and finally get a book composed of variance values ​​(cost body, that is, C0 in the figure).

If you don't understand this part, see MVSNet for details . This is the core content of MVSNet, that is, micro-homography transformation.

2.2 Feature volume regularization

The network structure mainly used in the regularization part has been drawn in the figure. First, an orange convolutional network is used to change the number of channels from 32 to 16, and then the number of channels is changed to 16 and 4 through 3 layers of superimposed GRU. , 1. The final output is a regularized cost map (Cost Map). Personally understand that the value of each point on the cost map at this time represents the probability value that the point belongs to the current depth.
Then, repeat the process of 2.1 to calculate the cost body at the depth of D1, and also input the network for regularization. It should be noted that the input to the GRU layer at this time is not only the cost body at this depth, but also uses the above The output of each GRU layer at a depth of D0, that is, the idea of ​​a recurrent neural network.

This part is the core step to reduce the memory consumption of cost body regularization. The reason for this is that the network only regularizes the cost body of one depth each time, instead of regularizing the cost body of all depths at once like MVSNet, as shown in the figure below As shown, the RNN structure on the right is equivalent to the left, and the model in this paper is the same, that is, each memory consumption is only a neural network training consumption on the left.

Thank you for knowing @hourglass

2.3 Regularized cost map -> probability body

This part is the second change point of MVSNet, which is different from its Soft argmin (probability along the depth direction * expectation of the current depth), and directly regularizes the cost map of each depth (the value of each point on the personal understanding map represents the point. The probability value belonging to the current depth) is aggregated into a probability body P, and Softmax is used along the depth direction of P, that is, the probability that the value of each point is 1 along the depth direction of P.

3. Calculate the loss

Calculate the cross-entropy loss by combining the probability body P with the real probability body Q obtained from the real image, that is, the third change point, which turns the original regression problem of probability expectation into a multi-classification problem .

The real probability volume Q is obtained from the real depth map. Specifically, each pixel on the depth map corresponds to a depth value. The depth map is copied to D copies (the number of depth samples), and each pixel is on the layer of the real depth. Take 1, and other layers take 0, that is, a One-Hot operation in the depth direction.

Subsequently, for the probability body P, each point on the depth map has D softmax probability values ​​in the depth direction, that is, the probability of belonging to the "class" of depth d; while the real probability body Q is equivalent to giving The label of the depth "class" of the point is obtained, so it is converted into a multi-classification problem for cross-entropy loss. The formula is as follows:
Loss loss calculation formula

3. Variational Depth Map Refinement

According to the paper, this part is: One concern about the classification formulation is the discretized depth map output. To achieve subpixel accuracy, a variational depth map refinement algorithm is proposed in Sec. 4.2 to further refine the depth map output . One concern is the discretized depth map output, and to achieve sub-pixel accuracy , a variational depth map refinement algorithm is proposed .
In fact, the understanding of this "sub-pixel precision" is still a bit vague. Does it mean that the sub-pixel points of the image also have corresponding depth values?

The input of this step is the initial depth map obtained by the network. Specifically, the depth value is taken for each pixel point. The source of the depth value is observed along the depth direction of each regularized Cost Map, and the depth with the highest probability is taken. As the depth of the point, proceed to obtain a complete depth map.

This is the Winner-take-all (winner-take-all) strategy using argmax mentioned in the paper, that is, to directly take the most likely one instead of seeking expectations along the depth direction.
This is also mentioned in the paper. During the training process, the probability body P needs to be calculated, but during the test, it is only necessary to obtain the regularized Cost Map at each depth and use this strategy to obtain the depth map. ( In addition, while we need to compute the whole probability volume during training, for testing, the depth map can be sequentially retrieved from the regularized cost maps using the winner-take-all selection )

The process of refining the variational depth map can actually be regarded as a process of continuously reprojecting pixels and calculating and iteratively reducing a specific reprojection error Error. ):
Given the reference image I 1 , the reference depth map D 1 and one source image I i , we project I i to I 1 through D 1 to form the reprojected image I i→1 . The image reprojection error between I 1 and I i→1 at pixel p is defined as:
Calculation formula for reprojection error of a point p on the depth map
where Ei photo is the photometric error between two pixels, and Ei smooth is a regularization term to ensure the smoothness of the depth map.
The paper uses zero-mean normalized cross-correlation (ZNCC) to measure optical consistency C(·), and uses the bilateral depth squared difference S(·) between p and its neighbors p' ∈ N § to obtain smoothness.
During the refinement process, the total image reprojection error between the reference image and all source images and all pixels is minimized iteratively.

Rebuild the renderings of each process

Through this process, two effects are obtained:
(1) Figure (g) -> Figure (f) eliminates the staircase effect (stair effect), and the smoothing item plays a role
(2) fine-tunes the depth value in a small range to achieve Sub-pixel depth accuracy

It’s still the sub-pixel depth, and I don’t quite understand what it refers to.
In addition, the paper also said It is noteworthy that the initial depth map from R-MVSNet has already achieved satisfying result.
That is, the initial depth map obtained by the network has a satisfactory effect. This process is just optimized to obtain sub-pixel accuracy, but I think there is still a big difference between the effect of this picture (b) and the real picture (d), right?

Four. Summary

This R-MVSNet is an improvement of the memory consumption of MVSNet by Yao Yao and other original team members, so the basic idea has not changed. The possibility of the current depth is mainly due to the use of RNN serialization in the post-difference cost body regularization step, which is an application of the idea of ​​exchanging time for memory space.

5. Discussion

1. The paper gives a description diagram for different regularization methods:
insert image description here
This diagram describes several optimization ideas for sequentially processing the cost body in the depth direction in addition to the one-time global regularization cost:

  • Figure (a) is the simplest sequential approach, winner-take-all planar-scan stereo, which roughly replaces pixel-level depth values ​​with better ones, and thus suffers from noise.
  • Figure (b) An improved cost aggregation method that filters matching costs C(d) at different depths to gather spatial context information for each cost estimate.
  • Figure © is a more powerful loop regularization scheme based on convolutional GRU proposed in this work following the idea of ​​sequential processing, which can collect spatial and unidirectional context information in the depth direction, which achieves the equivalent of full-space 3D CNN Regularizes results, but is more efficient in-memory at runtime.

Figure a is the winner-take-all principle mentioned above, that is, calculate point by point in the depth direction, take the most likely depth value to use Figure b
mainly increases the spatial "context" information, the context here is a bit ambiguous, in fact It refers to spatial neighborhood information, and the "filtering matching cost C(d) at different depths" mentioned in the paper does not feel right. It should not be different depths. The original text is such cost aggregation methods filter the matching cost C( d ) at different depths (Fig. 1 (b)) so as to gather spatial context information for each cost estimation .
Figure c is from this article, and it can consider the depth direction and spatial context information by depth, and because each time Only one depth layer is used, and the memory consumption is also HxW.
Figure d is a method of directly using 3D CNN represented by MVSNet. Although it directly considers the overall situation, it becomes HxWxD due to the simultaneous operation of multiple depths. I personally feel that
this picture is more complicated. , but it is actually weird and not intuitive to understand...

2. In addition, the paper also mentioned a strategy for selecting the number of depth samples D——Inverse Depth, but it did not go into details. It was mentioned in supplementary material in detail but was not found... This should be quite important,
because The paper says:
Most deep stereo/MVS networks regress the disparity/depth outputs using the soft argmin operation, which can be interpreted as the expectation value along the depth direction [30]. The expectation formulation is valid if depth values ​​are uniformly sampled within the depth range . However, in recurrent MVSNet, we apply the inverse depth to sample the depth values ​​in order to efficiently handle reconstructions with wide depth ranges .
That is, the expectation along the depth is when the depth sample value is uniform in [D min , D max ] It is only valid when sampling, but RMVSNet uses this inverse depth setting for the purpose of handling the reconstruction of a wider depth range, which is obviously not uniform sampling, which is related to the specific depth value corresponding to the homography transformation.

3. During training, the paper points out that
to prevent depth maps from being biased on the GRU regularization order, each training sample is passed to the network with forward GRU regularization from dmin to dmax as well as the backward regularization from dmax to dmin , that is ,
for To prevent the GRU from sampling from small to large in depth, it will be trained again from large to small.

4. The paper points out that
The memory requirement of R-MVSNet is independent to the depth sample number D, which enables the network to infer depth maps with large depth range that is unable to be recovered by previous learning-based MVS methods. That is, the
method Memory consumption has nothing to do with the number of depth samples, which is the benefit of optimizing the regular body by depth, but in fact it is to exchange time for space.

5. For the concept of "sub-pixel level" mentioned in the variational depth map refinement, please leave a message if you understand...


2022.11.30 update

According to the paper GC-Net cited by MVSNet in the Soft argmin operation , the "sub-pixel level" precision actually simply refers to the "continuous, including decimal point" precision :

  • The method of finding the expectation is because the depth prediction value = Σ(weight * current depth value), so it can generate continuous depth values ​​with at least decimal precision , that is, sub-pixel-level precision can be obtained
  • In this paper, depth inference is treated as a multi-classification problem. The "true depth label" of a pixel is encoded as [0,0,...,1,0,0...0] by One-hot, that is, only in the real depth The layer is set to 1 and the other layers are set to 0. Finally, the predicted depth value of each pixel point is directly selected from a certain depth plane hypothetical value , so it cannot be generated continuously (only the depth value under the preset depth interval can be selected) , that is, Inability to produce sub-pixel precision
  • In this article, due to the discontinuity of the network output depth value of the multi-classification problem, the pixels with similar positions may have a large depth difference, which is also the source of the "staircase phenomenon" mentioned in the article , and the depth subdivision optimization The most important step is to fine-tune the current pixel value and smooth it with the surroundings, so the staircase phenomenon is eliminated and pixel-level precision is achieved .

Guess you like

Origin blog.csdn.net/qq_41794040/article/details/127853045