[Differentiable alignment of images and LiDAR point clouds]

overview

Registration between different modalities, such as between 2D images from cameras and 3D point clouds from LiDAR, is a crucial task in the fields of computer vision and robotics.

Previous methods usually estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks, and use the Perspective-n-Points (PnP) method in the post-processing stage to estimate rigid transformations. However, these methods have difficulties in robustly mapping points and pixels to a shared latent space because points and pixels have very different characteristics, and they cannot directly build supervision on transformations because PnP is non-differentiable, resulting in The registration results are unstable. To address these issues, we propose a method to learn a structured cross-modal latent space to represent pixel features and 3D features via a differentiable probabilistic PnP solver.

Specifically, we design a ternary network to learn VoxelPoint-to-Pixel matching, where we use voxels and points to represent 3D elements to learn a cross-modal latent space by pixels. We designed voxel and pixel branches based on CNN to perform convolutions on voxels/pixels represented as grids, and integrated additional point branches to account for information lost during voxelization. We train our framework end-to-end by directly imposing supervision on the probabilistic PnP solver.

To explore the unique patterns of cross-modal features, we design a novel loss with adaptive weight optimization to describe cross-modal features. Experimental results on KITTI and nuScenes datasets show that our method achieves significant improvements compared to state-of-the-art methods.

Paper address: https://arxiv.org/abs/2312.04060
Project address: https://github.com/junshengzhou/VP2P-Match

Main contributions

• Proposed a novel framework to learn image-to-point cloud registration by learning a structured cross-modal latent space, trained end-to-end via adaptive weight optimization, via a differentiable PnP solver.

• Proposed to represent 3D elements as a combination of voxels and points to overcome the modal gap between point clouds and pixels, in which a ternary network was designed to learn voxel point-to-pixel matching.

• Demonstrated our superior performance on state-of-the-art techniques through extensive experiments on KITTI and nuScenes datasets.

Content overview

The framework of this method mainly includes the following steps:

VoxelPoint-to-Pixel matching framework: This framework is used to learn a structured cross-modal latent space. It consists of three branches: voxel branch, pixel branch and point branch. The voxel branch and pixel branch use convolutional neural networks to perform operations on voxels and pixels represented as grids to learn the correspondence between them. Point branches are used to supplement the information lost during voxelization.

Novel loss function: To learn unique cross-modal patterns, this method introduces an adaptive weighted optimized loss function. This loss function can dynamically adjust weights based on the differences between samples, thereby better capturing the correlation between cross-modal features.

Differentiable probabilistic PnP solver: To achieve end-to-end learning, this method introduces a differentiable probabilistic PnP solver. This solver is used to estimate rigid transformations that map the learned cross-modal features into a shared latent space. By imposing supervision directly on the solver, the entire framework can be trained end-to-end.

Overall, the framework of this method achieves cross-modal registration by learning VoxelPoint-to-Pixel matching and using a differentiable probabilistic PnP solver. The novel loss function weighted optimization can improve the quality and stability of feature learning. Through experiments on KITTI and nuScenes datasets, the method demonstrates significant improvements over state-of-the-art methods.
Insert image description here
Figure 1: Overview of our approach. Given a pair of incorrectly registered images I and point clouds P as input, (a) we first operate on sparse voxels to generate sparse voxels V, and then apply a ternary network to extract patterns from the three modalities. We represent 2D patterns as pixel features and 3D patterns as a combination of voxel and point features, respectively, using adaptive weighted losses to learn unique 2D-3D cross-modal patterns. (b) We use cross-modal feature fusion to detect intersection regions in 2D/3D space. © We remove abnormal areas based on the results of intersection detection and establish 2D-3D correspondence using 2D-3D feature matching, and then apply probability PnP to predict the distribution of extrinsic poses by performing end-to-end supervision together with ground-truth poses.

The VoxelPoint-to-Pixel matching framework is a three-branch network for cross-modal feature matching. It includes Voxel, Point and Pixel branches for obtaining 2D and 3D features.

In the Voxel branch, sparse convolution is adopted to effectively capture spatial patterns. This branch mainly processes voxelized data to learn 3D features.

To recover the detailed 3D patterns lost during voxelization, the Point branch is introduced. Inspired by PointNet++, this branch is used to extract local features of point cloud data.

The Pixel branch is based on the convolutional U-Net structure and is used to extract global 2D image features. It mainly processes image data and is used to learn 2D features.

In terms of 2D-3D feature matching, 3D elements are represented as a combination of voxels and points. Matching of 2D and 3D features is achieved by mapping them into a shared latent space. This VoxelPoint-to-Pixel matching creates a structured cross-modal latent space that provides a uniform feature distribution.

In order to handle outliers between different modalities, a cross detection method is introduced. Intersection regions are defined by overlapping a 2D projection of the LiDAR point cloud using ground truth camera parameters with a reference image. Through the detection strategy, the probability that each 2D/3D element is located in the intersection region is predicted, thereby helping to remove outlier regions on both modalities before inferring the 2D-3D correspondence. This improves matching accuracy and stability.

Insert image description here
Figure 2 t-SNE visualization of the latent space learned using point-to-pixel (P2P) and voxel-to-pixel (VP2P) matching

3.1 Adaptive weighted optimization strategy
Adaptive weighted optimization aims to solve the feature matching problem in 2D and 3D tasks. Usually, traditional optimization methods such as contrastive loss and ternary loss have problems when dealing with 2D-3D feature matching. An adaptive weighted optimization strategy is proposed, which targets a set of 2D-3D paired samples and automatically Adaptive weighting factors weight positive and negative pairs for more flexible optimization.
Insert image description here
3.2 Differentiable PnP
establishes the 2D-3D correspondence relationship by first removing outlier areas in the two modes through cross-region detection, and then using the nearest neighbor principle of the cross-modal latent space to perform 2D-3D feature matching. To establish correspondence, the arg max operation is used to search for point coordinates with maximum similarity in the cross-modal latent space. This operation is non-differentiable, but gradients are obtained through the Gumbel estimator to achieve end-to-end training. The probabilistic PnP method interprets the output as a probability distribution and is used to solve the non-differentiable PnP problem, which is supervised by minimizing the distance between the predicted attitude distribution and the ground truth attitude distribution by calculating the KL divergence loss. In addition, the exact attitude is solved by an iterative PnP solver based on the Gauss-Newton algorithm, and the attitude loss is calculated. The pose loss is also involved in the optimization since the iterative part of the GN algorithm is differentiable.

experiment

We evaluate our performance on the image-to-point cloud registration task on two widely used benchmark datasets, KITTI and nuScenes. On both datasets, images and point clouds are captured simultaneously by 2D cameras and 3D lidar.
4.1 Quantitative and qualitative comparison experiments Quantitative
comparison: Our method shows excellent performance on KITTI and nuScenes data sets, especially in RTE, which is about 4 times better than the latest CorrI2P method. Through an end-to-end training framework, combined with a probabilistic PnP solver, our method is able to learn robust 2D-3D correspondences and achieve more accurate predictions, as shown in Table 1.
Insert image description here
Visual comparison: The visual comparison in Figure 5 shows that our method achieves better registration accuracy under different road conditions. Our method is able to solve the registration problem more accurately compared to other methods, especially in the case of difficult tuning, such as lines 1 and 2, while other methods such as DeepI2P and CorrI2P cannot match trees correctly. and the projection of the car with the corresponding pixels in the image.
Insert image description here
4.2 Accuracy of feature matching
Figure 6 shows the visualization of feature matching, where a two-sided error map is generated by calculating the matching distance on the two modalities. For 2D to 3D matching, we look for the point with the greatest similarity on each 2D pixel in the intersection area, and calculate the Euler distance between the projected matching point and the 2D pixel. The results show that our method is effective in 2D to 3D and 3D to Both are significantly better than CorrI2P in 2D matching. Our method is able to achieve slight errors of less than 2 pixels in most matches, indicating that the shared latent space we learned is able to accurately distinguish cross-modal patterns and achieve accurate feature matching. Relatively large errors may exist at the edges of images and point clouds because it is often difficult to perfectly perform cross-region detection in edge regions.
Insert image description here
4.3 Running efficiency The
efficiency is compared with other methods on NVIDIA RTX 3090 GPU and Intel® Xeon® E5-2699 CPU. In Table 2, our method has fewer parameters and performs significantly better. In addition, our method only takes 0.19 seconds for network inference and pose estimation for one frame, which is approximately 50 times (or more) faster than previous methods.
Insert image description here
4.4 Ablation experiments
An ablation study was conducted to verify the effectiveness of each design in our method and the impact of some important parameters, and the performance of RTE/RRE/Acc. under the KITTI data set was reported.
Framework design verification: We verified the effectiveness of each design in the framework through four variations, including removing the voxel branch, removing the point cloud branch, replacing the adaptive weighted optimization loss, and removing the differentiable PnP-driven end-to-end supervision. The results are shown in Table 3, which shows that the full model performs best among all variants, proving the effectiveness of each design in the framework. In particular, compared to removing the point cloud branch, the voxel branch plays a more important role in the framework, indicating that the voxel modality is more suitable for learning image-to-point cloud registration.
Insert image description here
Input resolution impact: We further investigated the impact of input image resolution and point cloud density. The results are shown in Table 4. Using higher resolution on both modalities will bring better results, because low-resolution images may lose some visual information, and low-density point clouds may lose detailed geometric structures. We Choose appropriate settings that balance performance and efficiency.
Insert image description here

summary

This work proposes a novel framework named VoxelPoint-to-Pixel matching, which aims to learn registration between images and point clouds. The framework adopts an adaptive weighted loss method to learn a structured cross-modal latent space.

In this framework, we represent 3D elements as a combination of voxels and points to overcome the domain difference between point clouds and pixels. This combined representation method better captures the correspondence between images and point clouds.

Furthermore, we introduce a differentiable PnP solver to directly supervise the predicted pose distribution, enabling end-to-end training. In this way, our framework can learn and optimize the registration process more efficiently.

We conduct extensive experimental demonstrations on KITTI and nuScenes datasets, demonstrating the superior performance of our framework. This work has important practical significance for solving the registration problem between images and point clouds.

Guess you like

Origin blog.csdn.net/weixin_47869094/article/details/135138645