KITTI2D及3D车辆检测算法

基于视觉信息的2D车辆检测

                   表1 基于视觉信息的KITTI数据集车辆检测榜

Method	Moderate	Easy	Hard	Runtime
SubCNN[1]	89.04 %	90.81 %	79.27 %	2 s / GPU
MS-CNN[2]	89.02 %	90.03 %	76.11 %	0.4 s / GPU
SDP+RPN[3]	88.85 %	90.14 %	78.38 %	0.4 s / GPU
Mono3D[4]	88.66 %	92.33 %	78.96 %	4.2 s / GPU
3DOP[5]	88.64 %	93.04 %	79.10 %	3s / GPU
MV3D (LIDAR+MONO)[6]	87.67 %	89.11 %	79.54 %	0.45 s / GPU
SDP+CRC[7]	83.53 %	90.33 %	60.70 %	0.6 s / GPU
Faster R-CNN[8]	81.84 %	86.71 %	65.38 %	2 s / GPU
AOG[9]	75.94 %	84.80 %	60.70 %	3 s / 4 cores
3DVP[10]	75.77 %	87.46 %	65.38 %	40 s / 8 cores
LSVM-MDPM[11]	56.48 %	68.02 %	44.18 %	10 s / 4 cores
ACF[12]	54.74 %	55.89 %	42.98 %	0.2 s / 1 core

注：基于目标大小/遮挡/截断水平的不同难度等级的AP值。数字越高表示综合性能越好

由于KITTI数据集包含许多不同尺度的目标和经常严重遮挡或截断的小目标。使用基于区域的网络很难检测到这些对象。因此，已经提出了几种获得更好的目标建议的方法（MS-CNN[2]等）。

使用从立体相机对估计的3D信息有更好的帮助。受此启发，Chen等人Mono3D[4]提出单目图像的类特定3D对象建议，他们对3D候选框使用3D点云特征对它们进行评分。最后，利用上下文信息和使用多任务损失的CNN共同回归对象的坐标和方向。

Cai等人（2016）MS-CNN[2]提出了一种由提案子网络和检测子网络组成的多尺度CNN。生成建议框网络在多个输出层执行检测，并将这些互补的特定尺寸的检测器组合起来以产生强大的多尺度对象检测器(SOA)。

基于Lidar信息的3D车辆检测

                 表2 基于Lidar信息的KITTI数据集车辆检测榜

Method	Moderate	Easy	Hard	Runtime
MV3D (LIDAR + MONO)[6]	87.67 %	89.11 %	79.54 %	0.45 s / GPU
MV3D (LIDAR)[6]	79.24 %	87.00 %	78.16 %	0.3 s / GPU
MV-RGBD-RF[13]	69.92 %	76.40 %	57.47 %	4 s / 4 cores
Vote3Deep[14]	68.24 %	76.79 %	63.23 %	1.5 s / 4 cores
VeloFCN[15]	53.59 %	71.06 %	46.92 %	1 s / GPU
Vote3D[16]	47.99 %	56.80 %	42.57 %	0.5 s / 4 cores
CSoR[17]	26.13 %	34.79 %	22.69 %	3.5 s / 4 cores

KITTI数据集提供同步相机和LiDAR数据，并允许在相同的数据上比较基于图像和基于LiDAR的方法。与相机相比，LiDAR激光距离传感器直接提供准确的3D信息，从而简化了候选对象的提取，并可为分类任务提供有用的3D形状信息。然而，来自激光扫描仪的3D数据通常是稀疏的，其空间分辨率是有限的。因此，仅依靠激光距离数据的最新技术还不能达到基于相机的检测系统的性能。在(表2)中，我们展示了基于LiDAR的基于KITTI基准的目标，行人和骑自行车者检测的最新技术。性能评估类似于基于图像的方法，通过投影三维边界框进入图像平面。

Wang＆Posner（2015）Vote3D[16]提出了一种有效的方法来将常用的2D滑动窗口检测方法应用于3D数据。更具体地说，他们利用投票方案利用问题的稀疏性来搜索所有可能的对象位置和方向。Li等人（2016b）VeloFCN[15]利用完全卷积神经网络从距离数据中检测车辆，从而改善这些结果。它们表示二维点图中的数据，并使用单个二维CNN同时预测对象置信度和边界框。用于表示数据的编码允许他们预测车辆的完整3D边界框。 Engelcke等（2016）Vote3Deep[14]利用以特征为中心的投票方案来实现利用点云稀疏性的新型卷积层。此外，他们建议使用L 1正则化惩罚。

由于激光扫描的密度有限，依靠单独的激光距离数据使得检测任务具有挑战性。因此，与基于图像的KITTI数据集相比，现有的基于LiDAR的方法性能较差。Chen等人（2016c）MV3D[6]将LiDAR激光测距数据与RGB图像结合起来进行物体检测。在他们的方法中，使用紧凑多视图表示来编码稀疏点云，并且提案生成网络利用点云的鸟瞰图表示来生成3D建议框。最后，他们将来自多个视图的区域特征与深度融合方案结合起来，这种方法比其他基于LiDAR的方法优越得多，并且在KITTI汽车基准测试中实现了最先进的性能（SOA）。

参考文献

[1] Xiang, Y., Choi, W., Lin, Y., & Savarese, S. (2016). Subcategory-aware con-volutional neural networks for object proposals and detection. arXiv.org, 1604.04693.
[2] Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale
deep convolutional neural network for fast object detection. In Proc. of the European Conf. on Computer Vision (ECCV).
[3] Yang, F., Choi, W., & Lin, Y. (2016). Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
[4] Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., & Urtasun, R. (2016a). Monocular 3d object detection for autonomous driving. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
[5] Chen, X., Kundu, K., Zhu, Y., Berneshawi, A. G., Ma, H., Fidler, S., & Urtasun, R. (2015c). 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems (NIPS).
[6] Chen, X., Ma, H., Wan, J., Li, B., & Xia, T. (2016c). Multi-view 3d object detection network for autonomous driving. arXiv.org, 1611.07759.
[7 Yang, F., Choi, W., & Lin, Y. (2016). Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
[8] Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS).
[9] Wu, T., Li, B., & Zhu, S. (2016a). Learning and-or model to represent context and occlusion for car detection and viewpoint estimation. IEEE Trans. on
Pattern Analysis and Machine Intelligence (PAMI), 38, 1829–1843.
[10] Xiang, Y., Choi, W., Lin, Y., & Savarese, S. (2015b). Data-driven 3d voxel patterns for object category recognition. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
[11] Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Trans. On Pattern Analysis and Machine Intelligence (PAMI), 32, 1627–1645.
[12] Dollár, P., Appel, R., Belongie, S. J., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 36, 1532–1545.
[13] González, A., Villalonga, G., Xu, J., Vázquez, D., Amores, J., & López, A. M. (2015). Multiview random forest of local experts combining RGB and LIDAR data for pedestrian detection. In Proc. IEEE Intelligent Vehicles Symposium (IV).
[14] Engelcke, M., Rao, D., Wang, D. Z., Tong, C. H., & Posner, I. (2016). Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. arXiv.org, 609.06666.
[15] Li, B., Zhang, T., & Xia, T. (2016b). Vehicle detection from 3d lidar using fully convolutional network. In Proc. Robotics: Science and Systems (RSS).
[16] Wang, D. Z., & Posner, I. (2015). Voting for voting in online point cloud object detection. In Proc. Robotics: Science and Systems (RSS).
[17] Plotkin, L. (2015). PyDriver: Entwicklung eines Frameworks für räumliche
Detektion und Klassifikation von Objekten in Fahrzeugumgebung. Master’s thesis Karlsruhe Institute of Technology.
[18] Behley, J., Steinhage, V., & Cremers, A. B. (2013). Laser-based segment classification using a mixture of bag-of-words. In Proc. IEEE International Conf. on Intelligent Robots and Systems (IROS).