翻译：Vehicle Detection on a Video Traffic Scene: Review and New Perspectives

Abstract-Vehicle detection applications play an important role in the reduction of the number of road accidents. In the same vein, this paper tends to summarize the recent advances in vehicle detection approaches. Both the approaches based on motion and those based on appearance are dealt with. Also, the challenges and limitations of using handcraft features are discussed. Moreover, we compare different approaches cited as new perspectives in object detection. The experiments performed using two videos illustrate the robustness of the approach based on deep learning with specialization of the generic detector to a specific scene.

摘要-车辆检测应用程序在减少道路交通事故数量方面发挥着重要作用。同样，本文倾向于总结车辆检测方法的最新进展。处理基于运动的方法和基于外观的方法。此外，还讨论了使用手工特征的挑战和局限性。此外，我们比较了在对象检测中作为新视角引用的不同方法。使用两个视频进行的实验说明了基于深度学习的方法的稳健性，其中通用检测器专用于特定场景。

1 INTRODUCTION

Cities are growing larger with a subsequent increase in the transporta-tion infrastructure. Consequently, the number of vehicles significantly rises and population density changes. Therefore, the number of road accidents is rapidly growing and becomes a worrying phenomenon. According to the World Bank, every year more than 1.17 million people die in road crashes around the world. The statistics show that the deaths in the developing countries represents alone 70 percent of the total number. The greater part of road victims in these countries are pedestrians, motorcyclists, and bicyclists. This heavy lost inflicts to the economy as road crashes cost approximately 1 to 3 percent of a country's annual Gross National Product [1]. Thus, finding a solution is a valuable contribution toward making the public roads safer, reducing rush-hour congestion and saving large amounts of the country's fortune. Multidisciplinary engineering disciplines such as transportation, civil, municipal and urban are concerned to find sustainable solutions for the road accidents. Over the past decade, the computer vision domain has been strongly involved in the traffic-scene understanding effort in order to contribute in saving lives and reducing the number of on-road fatalities [2]. Vision-based analysis systerns, in response to such needs, have become popular in transportation management thanks to their capability to extract very rich information on road traffic compared to sensor-based systems [3]. Researches cover important thematic sub-areas ranging from vehicle detection, producing statistics, traffic parameter estimation and behavior analysis. The uses of video cameras has become very common in computer vision systems for traffic surveillance thanks to the rich raw online information provided at a cheaper price. Cameras may be embedded on vehicles, fixed in road infrastructure or mounted on unmanned aerial vehicles. Among the three cases, conventional traffic data collection from cameras relying on fixed infrastructure is only limited to a local region [4]. Since using a fixed mono camera in traffic detection represents a frequent occurring case, in this paper we limit ourselves to present an overview of vehicle detection on a video traffic scene utilizing a fixed mono camera. The rest of the paper is organized as follows. The related works are reviewed in section 2. Section 3 gives an overview of the evaluation of detection algorithms. Section 4 includes the tests and results. The last section is a conclusion and gives some perpective.

随着运输基础设施的不断增加，城市规模越来越大。因此，车辆数量大幅增加，人口密度发生变化。因此，道路交通事故的数量迅速增加，成为令人担忧的现象。据世界银行统计，全球每年有超过117万人死于道路交通事故。统计数据显示，发展中国家的死亡人数仅占总人数的70％。这些国家大部分的道路受害者是行人，摩托车手和骑自行车的人。这种严重的损失导致经济发展，因为道路交通事故损失约占国家每年国民生产总值的1％至3％[1]。因此，寻求解决方案对于使公路更加安全，减少高峰时间的拥堵并节省大量国家财富是一项宝贵的贡献。多学科的工程学科如交通运输，城市和城市都在寻找可持续的交通事故解决方案。在过去的十年中，计算机视觉领域一直积极参与交通情景的理解工作，以期为挽救生命和减少路上死亡人数做出贡献[2]。基于视觉的分析系统在满足这种需求的情况下，由于能够提取非常丰富的道路交通信息，与基于传感器的系统相比[3]，它在交通管理中已经非常流行。研究涵盖车辆检测，生产统计，交通参数估计和行为分析等重要专题子领域。由于以更便宜的价格提供了丰富的原始在线信息，所以摄像机的用途在计算机视觉系统中用于交通监控已经非常普遍。摄像机可能嵌入在车辆上，固定在道路基础设施上或安装在无人驾驶飞行器上。在这三种情况中，依赖于固定基础设施的摄像机采集的传统交通数据仅限于本地区域[4]。由于在交通检测中使用固定单声道摄像头代表了一种频繁发生的情况，因此本文仅限于利用固定单色摄像头对视频交通场景中的车辆检测进行概述。本文的其余部分安排如下。相关的作品在第2部分中进行了介绍。第3部分概述了检测算法的评估。第4部分包括测试和结果。最后一部分是一个结论，并给出了一些看法。

2 RELATED WORKS

Vehicle detection is the first step in the process of traffic video analysis. Our literature review shows that researches in this area have received considerable attention and have achieved significant recent progress. Although a lot of progress has been done in this field, a lot of difficulties remain to be considered, especially how to obtain acceptable results under unstable situations. Indeed the accuracy of vehicle detection is usually affected by changing environments and different weather conditions. In complex urban real-world outdoor scenes, the main challenges faced by a traffic image analysis system are:

车辆检测是交通视频分析过程中的第一步。我们的文献回顾表明，这方面的研究受到了相当的关注，并取得了重大的进展。尽管在这一领域已经取得了很多进展，但还有很多困难需要考虑，特别是如何在不稳定的情况下取得可以接受的结果。事实上，车辆检测的准确性通常受环境变化和不同天气条件的影响。在复杂的城市真实世界的户外场景中，交通图像分析系统面临的主要挑战是：

-On-road detection of multiple actors sharing the public roads is a non-trivial task due to their (1) various shapes and types: cars, vans, connnercial vehicles, heavy-goods vehicles, trucks, buses, motorcycles, bicycles, trams, and pedestrians, and due to (2) intra-variability objects or visual appearances that depend on their pose, size and color.

- 共享公共道路的多个参与者的道路上检测是非平凡的任务，因为它们（1）各种形状和类型：轿车，货车，内燃车，重型货车，卡车，公共汽车，摩托车，自行车，有轨电车，以及由于（2）取决于它们的姿态，尺寸和颜色的内部变异性物体或视觉外观。

- Lighting changing in night time compared to day time and from clear to dark sky or in illumination changes.

- Motion blur effects for a moving road object.
- Poor visibility conditions caused by rain or fog.
- Occultation by other vehicles, pedestrians, road signs, or trees.
- Object scale variations that occur when objects move.
- Different points of view ofthe object.
- Shadows of objects on sunny days and their orientations changing according to the sun position, changing the vehicle shape.
- Closed cars in rush hours may be merged together into a single interest region.

- Moving background objects such as trees' swaying or flags' flapping in the wind are very difficult to include in the background model since each background pixel can be modeled by many different colors in this situation. Camera vibration and camera jitter can also be associated with this problem.

- 与白天相比，夜间照明改变，从晴朗到黑暗的天空或照明变化。

- 移动道路物体的运动模糊效果。

- 由于下雨或起雾造成的能见度不佳。

- 被其他车辆，行人，道路标志或树木掩盖。

- 对象移动时发生的对象尺度变化。

- 同一物体由于视野不同导致的变形。

- 晴天时物体的阴影及其方位根据太阳位置而变化，改变车辆的形状。

- 在繁忙时间关闭的汽车可能合并成一个单一的兴趣区域。

- 在背景模型中移动诸如树木摇曳或旗帜在风中飘动的背景对象是非常困难的，因为在这种情况下每个背景像素可以用许多不同的颜色建模。相机振动和相机抖动也可能与此问题相关联。

In the majority of the proposed approaches, the vehicle detection process is held in two main steps: 1) hypothesis generation in which we assume the potential vehicle locations and 2) hypothesis verification where tests are carried out in order to verify the exactness of the hypothesis [5],[6],[7],[8], [9]. In hypothesis generation, Yan et al. proposed an approach for detecting a vehicle potential location based on shadows under vehicles [5]. Since in the hypothesis generation step the detection is tainted with detection errors (false alarms, missing detections), we need to adjust the results using context information such as scene structures, detectedvehicle locations and sizes.

在大多数提出的方法中，车辆检测过程分为两个主要步骤：1）假设生成，其中我们假设潜在的车辆位置; 2）假设验证，其中进行测试以验证假设的正确性 [5]，[6]，[7]，[8]，[9]。在假说生成中，Yan et al。提出了一种基于车辆阴影检测车辆潜在位置的方法[5]。由于在假设生成步骤中，检测受到检测错误的影响（错误警报，缺失检测），因此我们需要使用上下文信息（如场景结构，检测到的车辆位置和尺寸）来调整结果。

Mainly two approaches have been used to hypothesize the detection of potential vehicle locations in traffic video analysis: 1) motion-based methods 2) appearance-based methods. In this paper we are particularly interested in the second approaches, which are much more frequently used.

主要有两种方法用于假设在交通视频分析中检测潜在车辆位置：1）基于运动的方法2）基于外观的方法。在本文中，我们特别感兴趣的是更频繁使用的第二种方法。

Motion-based methods

In these approaches, we require a sequence of images to detect a vehicle [2]. Various motion-based approaches have been used in the literature to detect moving objects. However, the background subtraction and the optical flow are widely used. Detecing a moving object using the background background subtraction approach is very popular since it is easy to implement. Intuitively, this approach requires a video sequence of a current view and a reference sequence without any moving objects. Therefore, we subtract the background by differentiating the current sequence from a static reference- background model. The pixels that deviate significantly from the background are considered to be moving or foreground objects [10],[11],[12]. In [13], vehicle detection in the long tunnel was performed by comparing the difference between two-consecutive sequences to a pre-Iearning value. The authors highlighted using a preprocessing step in which they compensated the illumination. On the other hand, the optical flow has been utilized for vehicle detection. It represents the most natural way for motion-based movingobject detection. Moving object detection is performed by tracking the (x,y) pixel intensities' apparent changes from one frame to the next. Yet, apparent-motion detection can be provided by illumination changes for an immobile object [14]. Also, this method suffers from performance dropping by increasing the false positive rate mostly when: the colors of moving objects are comparable to those in the background, illumination scene changes and moving or noisy background.

在这些方法中，我们需要一系列图像来检测车辆[2]。文献中已经使用各种基于运动的方法来检测运动物体。然而，背景扣除和光流被广泛使用。使用背景背景减法方法检测移动对象非常流行，因为它很容易实现。直观地说，这种方法需要当前视图的视频序列和没有任何移动对象的参考序列。因此，我们通过区分当前序列与静态参考背景模型来减去背景。背景明显偏离的像素被认为是移动或前景对象[10]，[11]，[12]。在[13]中，长隧道中的车辆检测是通过比较两个连续序列之间的差异与预先学习值进行的。作者强调使用预处理步骤补偿照度。另一方面，光流被用于车辆检测。它代表了基于运动的运动物体检测的最自然的方式。通过跟踪（x，y）像素强度从一帧到下一帧的明显变化来执行运动对象检测。然而，视觉运动检测可以通过对不可移动物体的照明变化来提供[14]。此外，这种方法主要在以下情况下通过增加误报率来降低性能：移动物体的颜色与背景中的颜色相当，照明场景变化以及移动或嘈杂的背景。

Appearance-based methods

Those approaches, called also machine learning methods, have been used to directly detect vehicles from images [2]. The performance of the detection system heavily depends on the choice of data representation. Haar-like features and Histograms of Oriented Gradient (HOG) have been widely used to detect objects [15]. In this context, Wen et al. [16] utilized the Haar-like features and the SVM to detect vehicles in videos. First, an extraction method of the Haar-like features was performed to represent vehicle's edges and structures. Then, they put forward a rapid feature selection algorithm using AdaBoost by combining the feature values with their labels [17]. Finally, the retained features' vector was normalized and used to feed an SVM classifier. Two classes were considered: vehicle as a positive class and nonvehicle as a negative one. Negri et al. [18] compared the Haar-like features with the HOG-based features, utilizing the same classification algorithm based on AdaBoost. Two strategies were tested: a simple stage detector and a multistage one. They reached an average detection of 96% onroad vehicle images. Nevertheless, the success of those approaches generally depends on the stability of data representation features, like moving objects' scale changes, translation, rotation and intra-class variability.

这些也称为机器学习方法的方法已被用于直接从图像中检测车辆[2]。检测系统的性能在很大程度上取决于数据表示的选择。 Haar-like特征和定向梯度直方图（Histogram of Oriented Gradient，HOG）已被广泛用于检测物体[15]。在这方面，温等人。 [16]利用Haar-like特征和SVM来检测视频中的车辆。首先，执行Haar-like特征的提取方法来表示车辆的边缘和结构。然后，他们提出了一种快速特征选择算法，使用AdaBoost将特征值与它们的标签相结合[17]。最后，保留特征的向量被归一化并用于馈送SVM分类器。考虑两个班：车辆为正班，非车辆为负班。 Negri等人[18]使用基于AdaBoost的相同分类算法，将Haar-like特征与基于HOG的特征相比较。测试了两种策略：简单的阶段检测器和多阶段检测器。他们平均检测到96％的车辆图像。然而，这些方法的成功通常取决于数据表示特征的稳定性，如移动对象的尺度变化，平移，旋转和类内变化。

The above detection approaches have many limitations, to wit:

上述检测方法有很多限制，即：

- The perfonnance of an object detection system based on the hand craft feature extraction is mainly dependent on the choice of data representation. Feature extraction is the key step in the process of object detection, according to the machine learning-based methods. For the last decades, significant efforts have been deployed in the machine learning domain to look for discriminated and invariant features (scale, rotation, translation ... ). However, the relevant feature extraction remains problematic owing to the wide intra-class variability, dynamic environments and moving objects.

- 基于手工特征提取的物体检测系统的性能主要取决于数据表示的选择。根据基于机器学习的方法，特征提取是目标检测过程中的关键步骤。在过去的几十年中，机器学习领域已经开展了大量工作，以寻找区分和不变特征（尺度，旋转，平移等）。然而，由于类内差异性较大，动态环境和移动对象，相关特征提取仍然存在问题。

- The perfonnance ofthe trained classifier depends basicallyon the training dataset. Training features used to construct a trained model must enclose the characteristics that are sufficiently described for each class and discriminated between classes. Furthermore, training and test features are extracted from the same feature space and have the same distribution. Besides, generic object detectors perform usually very well when tested in a specific scene. However, their perfonnances drop considerably when they are tested to another scene becauseb of the wide variability between the source training dataset and the target scene. Covering this wide intra-class variability needs a huge database that also takes into account all the other parameter interacting with the scene or variation caused by object movement. Training a system with such a database leads to a complex model, long computation time and convergence problems.

- 训练分类器的性能基本上取决于训练数据集。用于构建训练模型的训练特征必须包含为每个类充分描述的特征，并区分不同的类。此外，训练和测试特征从相同的特征空间中提取并具有相同的分布。此外，通用物体检测器在特定场景下测试时通常效果很好。然而，由于源训练数据集与目标场景之间存在广泛的可变性，因此在测试其他场景时，其性能会显着下降。要覆盖这种广泛的类内可变性，需要一个巨大的数据库，该数据库还考虑了与场景交互的所有其他参数或由对象移动引起的变化。用这样的数据库训练一个系统会导致一个复杂的模型，很长的计算时间和收敛问题。

2.1 New trends

In this section, we focus on three new trends proposed in the literature to face the problems of vehicle detection in machine learning methods, which are 1) deep learning: To resolve the problem linked to feature extraction, 2) transfer learning: To resolve the problem related to vehicle detection in a target scene that is different from the source one and 3) data fusion: To reinforce the description of data.

在本节中，我们着重于文献中提出的三种新趋势，以面对机器学习方法中的车辆检测问题，这些问题包括1）深度学习：解决与特征提取相关的问题，2）传递学习：解决在与目标场景不同的目标场景中与车辆检测相关的问题;3）数据融合：加强对数据的描述。

Deep-learning networks

A new area of machine learning research has emerged, commonly called deep learning (DL), performs extracting key characteristics automatically without human intervention [19]. The wide uses of DL architectures are driven by the recent advances in the architecture of new graphics cards and huge databases. For instance a Deep-learning Neural Networks known as Convolutional Neural Network (CNN) is commenly is made by multiple layers of non-linear operations compared to classical neural networks' architectures (figure 1). Adding more hidden layers allows increasing the abstraction of features' hierarchy starting from low-level features extracted directly from images and fed into the input layer. These architectures are able to extract hierarchical representations from a big number of images according to their similarities [20]. However, the training time of the CNN has been too long with the recent advances in object localizations [21] lead [22] to propose an improved CNN version called Region-based CNN (R-CNN).

机器学习研究的一个新领域已经出现，通常被称为深度学习（DL），在没有人为干预的情况下自动提取关键特征[19]。 DL架构的广泛应用受最近新显卡和大型数据库架构的进步推动。例如，称为卷积神经网络（CNN）的深度学习神经网络与经典神经网络的体系结构（图1）相比，通常由多层非线性运算构成。添加更多隐藏图层可以提高从图像直接提取的低级特征开始提取特征层级的抽象性，并将其馈送到输入图层。这些体系结构能够根据它们的相似性从大量图像中提取分层表示[20]。然而，CNN的培训时间过长，最近对象本地化的进展[21]导致[22]提出了一种改进的CNN版本，称为基于区域的CNN（R-CNN）。

This method allows reducing the training time for the computation of CNN features to be performed only on an extracted region of interest. The proposal regions are presented to CNN extracting a fixed-length features' vector feeding a classifier [22] (figure 2). Finally, a classifier is used to classify each proposal region into an object category or background [24].

该方法允许减少仅在所提取的感兴趣区域上执行CNN特征的计算的训练时间。将提议区域呈现给CNN，提取给定分类器的定长特征向量[22]（图2）。最后，分类器用于将每个提议区域分类为对象类别或背景[24]。

Unfortunately, the accuracy of the R-CNN depends mainly on the perfonnance of the predict object bounding boxes [24], and the processing time remains important. A significant improvement, in speeding up the detection, is achieved by using a new R-CNN structure, named Fast R-CNN. The input to a Fast R-CNN network is the image and multiple regions of interest. A fixed-length feature vector is extracted by first extracting a conv feature map over the whole image by using some convolutional and max pooling layers, and then converts the features inside each region of interest (RoI) into a small feature map. Feature vectors are utilized to supply fully connected layers (Figure 3.a). The Fast RCNN gives as outputs for each RoI a soft max probability and bounding-box position [24, 25]. In a more recent work,we have considered a Faster R-CNN in which we put in previews to the Fast R-CNN a Region Proposal Network (RPN). An RPN is fully deep conventional networks that deliver to the Fast R-CNN a set of region proposals (figure 3.b).

不幸的是，R-CNN的准确性主要取决于预测对象边界框的性能[24]，处理时间仍然很重要。通过使用新的R-CNN结构，名为Fast R-CNN，可以显着改善检测速度。快速R-CNN网络的输入是图像和多个感兴趣的区域。通过首先使用一些卷积和最大合并图层在整个图像上提取一个conv特征映射，然后将每个感兴趣区域（RoI）内的特征转换成小特征映射，来提取固定长度的特征向量。利用特征向量来提供完全连接的层（图3.a）。快速RCNN为每个RoI提供一个软最大概率和边界框位置的输出[24,25]。在最近的一项工作中，我们考虑了一个更快的R-CNN，其中我们向快速R-CNN提出了区域提案网络（RPN）的预览。 RPN是完全深入的传统网络，向快速R-CNN提供一组区域建议（图3.b）。

Transfer Learning

In the recent years, Transfer Learning (TL) was proposed to (1) specialize a generic classifier towards a specific scene [26] and (2) transfer characteristics of one object class to another sharing some common-shape features (i.e Bicycle and motorbike). In the both cases, the TL tries to use the knowledge from the source (generic) domain whose examples are welllabeled toward learning a classifier in the target domain containing few/non labeled data.

近年来，转移学习（TL）被提出来：（1）专门针对特定场景的通用分类器[26];以及（2）将一个对象类的特征转移到另一个具有共同特征的特征（即自行车和摩托车）。在这两种情况下，TL都尝试使用来自源（一般）域的知识，该域的示例已经很好地标记，以便在包含少量/未标记数据的目标域中学习分类器。

The main problem for specializing a generic classifier towards a specific scene is that: when using a generic classifier trained on a source database, how can we select a predict samples from the target database susceptible to transfer knowledge into a new domain? In this context, Wang et a1. [27] used, in addition to the source dataset, different contextual information such as pedestrian motion, road model (pedestrians, cars ... ), location, size and objects' visual appearances to generate a confidence score. Confidence scores from the target scene are used to select positive and negative samples of the target domain. In this work a confidenceencoded SVM is defined to select only the source samples that are good for the classification in the target scene. Recently, Maamatou, et al. in [26] have proposed a transductive TL to specialize a generic classifier approach.

将通用分类器专用于特定场景的主要问题是：当使用在源数据库上训练的通用分类器时，如何从目标数据库中选择容易将知识转移到新域的预测样本？在这方面，王等人。 [27]除了源数据集之外，还使用不同的上下文信息，例如行人运动，道路模型（行人，汽车......），位置，大小和物体的视觉外观来生成置信度分数。来自目标场景的置信度分数用于选择目标域的正面和负面样本。在这项工作中，一个可信编码的SVM被定义为仅选择对目标场景中的分类有利的源样本。最近，Maamatou等人在文献[26]中提出了一种转导性TL以专门化一种通用的分类器方法。

A generic detector has been firstly created by training a classifier using the INRIA Person Dataset as a generic database. Then, it is utilized to predict samples from the target database. The authors have suggested a TL approach based on the Sequential Monte Carlo (SMC) filter. The SMC filter selects the relevant samples and estimates the unknown target distribution. Those samples can be used to learn a specialized classifier that importantly improves the detection performances. The suggested methods have been validated on a pedestrian detection application using the HOG SVM classifier. The experiment has shown that the proposed specialization framework has good performances on the CHUK Square and MIT traffic databases. For the transfer of knowledge from one category to another, sharing some common characteristics, Ay tar et al. [28] put forward object category detectors that had earlier learnt the HOG generic detector template from similar categories (motorbike). To build a source detector, they transferred information and tested in a new target category (bicycle). The experiments performed using the PASCAL VOC database showed promising results by TL from one class to another.

通过使用INRIA人员数据集作为通用数据库对分类器进行训练，首先创建了通用检测器。然后，它被用来预测来自目标数据库的样本。作者提出了基于顺序蒙特卡罗（SMC）滤波器的TL方法。 SMC滤波器选择相关样本并估计未知目标分布。这些样本可用于学习重要的提高检测性能的专用分类器。所建议的方法已经在使用HOG SVM分类器的行人检测应用程序上得到验证。实验表明，所提出的专业化框架在CHUK Square和MIT流量数据库上表现良好。为了将知识从一个类别转移到另一个类别，共享一些共同的特征，Ay tar et al。 [28]提出早些时候从相似类别（摩托车）学习HOG通用检测器模板的对象类别检测器。为了建立一个源探测器，他们转移了信息并在一个新的目标类别（自行车）中进行了测试。使用PASCAL VOC数据库进行的实验显示，TL从一个阶层到另一个阶层获得了有希望的结果。

Data fusion.

The fusion can be basically carried out at 3 levels: pixel level fusion, features of level fusions, and decision level fusion.

融合可以基本上在3个层次上进行：像素级融合，特征级融合和决策级融合。

Multi-sensor image fusion has gained considerable attention in the last decade. Image fusion combines multi-modality sensor information of the same scene to provide rich information combining different points of view of a scene[29]. In [30], the authors used both LiDAR and stereo-cameras to take advantage of two different sensors, for vehicle detection and classification. Feature fusion aims to reinforce the description with the cooperation of more than one characteristic. Although the progress is impressive, especially in biometric systems, many open problems remain unsolved: (1 )how to find the best set among all features; (2) how to ensure that all features cooperate together without redundancy; and (3) how to get the best level of fusion (input, features, decision, ... ). In [7], a vehicle detection approach was proposed utilizing the AdaBoost Classifier fed by the Haar-SURF features. Decision-level fusion consists in combining the results from multiple algorithms' outputs to capitulate a final merged decision. The obtained decision represents a common interpretation of multiple algorithms.

多传感器图像融合在过去的十年中受到了相当的关注。图像融合将同一场景的多模态传感器信息相结合，以提供丰富的信息，结合场景的不同视点[29]。在[30]中，作者使用LiDAR和立体相机来利用两种不同的传感器来进行车辆检测和分类。特征融合旨在通过多个特征的合作来加强描述。虽然进展令人印象深刻，尤其是在生物识别系统方面，许多未解决的问题仍未解决：（1）如何在所有功能中找到最佳组合; （2）如何确保所有功能一起合作而没有冗余;和（3）如何获得最佳融合水平（输入，特征，决定......）。在文献[7]中，提出了一种利用由Haar-SURF特征提供的AdaBoost分类器的车辆检测方法。决策级融合包括结合多个算法输出的结果来投降最终的合并决策。获得的决定表示多种算法的一种通用解释。

3 DETECTION ALGORITHM EVALUATION

To compare a proposed detection algorithm to the panorama of baseline algorithms in the state of the art, we need metrics and public databases. In what follows we propose a list of common metrics and some databases.

为了将所提出的检测算法与现有技术中的基线算法的全景进行比较，我们需要度量和公共数据库。在下面我们提出一个通用指标和一些数据库的列表。

3.1 Metrics Evaluation.

Two groups of metrics are commonly used in detection. The first measures the accuracy of the result and the other meas ures precision [32]. The common criteria used in the literature are:

两组度量标准通常用于检测。第一个测量结果的准确性和其他测量精度[32]。文献中使用的通用标准是：

- Accuracy: Recall and Precision as well as False positive

- Precision, Multiple Object detection Precision (MODP)

- 准确度：召回率和精度以及误报率

- 精确度，多物体检测精度（MODP）

3.2 Databases

The main problem of vehicle detection is the absence of an annotated public database of traffic videos on which the proposed algorithms are validated and compared. Table 1 gives an overview of the publicly available data sets. It determines the disponiblity of ground truth and number of used mono cameras or multicameras [32].

车辆检测的主要问题是没有注释的交通视频公共数据库，在该数据库上对所提出的算法进行验证和比较。表1给出了公开可用数据集的概述。它决定了地面事实的可信度和使用的单相机或多镜头的数量[32]。

4 EXPERIMENTAL RESULTS

In this section, we review some of our latest works [26],[33]. We evaluate and compare the three approaches mentioned as new trends in section 2.l.

Two sets of experiments and results are reported:

在本节中，我们回顾一些我们最新的作品[26]，[33]。我们评估并比较了在第2.l节中提到的三种新方法。

The first set (figure 4) shows the contribution of the specialization of a detector, based on an SMC, compared to a generic one and the effect of deeper specialization through several iterations. The experiments were performed on the CUHK Square dataset and the MIT traffic dataset. (For more detail, refer to [26]). Figures 4.a and 4.b demonstrate the improvement induced by specializing the detector (compared to the generic detector initially trained with the source database). As shown in figures 4a and 4b, each iteration results in measurable progress toward the specialization of the detector.

第一组（图4）显示了基于SMC的检测器专业化与普通检测器相比的贡献以及通过多次迭代得到的更深入专业化的影响。实验在CUHK Square数据集和MIT流量数据集上进行。（更多细节请参考[26]）。图4.a和4.b证明了专门检测器所带来的改进（与最初使用源数据库进行训练的普通检测器相比）。如图4a和4b所示，每次迭代都会导致检测器专业化的可测量进展。

The second set (figure 5) reports the results of the implementation of three different algorithms for car detection, and compares each method via their ROC curves. The first curve provides the results of the application of the Faster R-CNN and the second HOG-Haar feature fusion. In last curve, we present the result of the SMC Faster R-CNN. The latter is a new transfer method based on a Faster R-CNN's deep network for detection and an SMC filter for specialization. The experiments are performed on a dataset which is composed of 2 videos taken by fixed cameras on highways. It is composed of 250 frames containing 740 vehicles and 100 motorcycles' images. The dataset was proposed by Brad Phi lip and Paul Updike, taken of the freeways of southern California. We compare three emerged approaches for the road object detection algorithm. The recent SMC filter for a faster R-CNN detector shows a higher performance when compared to other approaches. Also, the results indicate that fusing two HOG-Haar descriptors provide is more interesting compared to the generic Faster R-CNN [33].

第二组（图5）报告了用于汽车检测的三种不同算法的实施结果，并通过它们的ROC曲线比较了每种方法。第一条曲线提供了更快的R-CNN和第二个HOG-Haar特征融合的应用结果。在上一条曲线中，我们介绍SMC Faster R-CNN的结果。后者是一种基于更快的R-CNN深度检测网络和专用SMC滤波器的新型传输方法。实验是在由公路上的固定摄像机拍摄的2个视频组成的数据集上进行的。它由250个包含740辆车辆和100辆摩托车图像的车架组成。该数据集由Brad Phi lip和Paul Updike提出，取自南加州的高速公路。我们比较了三种新出现的道路物体检测算法。与其他方法相比，用于快速R-CNN检测器的最新SMC滤波器具有更高的性能。此外，结果表明，融合两个HOG-Haar描述符与通用Faster R-CNN相比更加有趣[33]。

5 CONCLUSION

In this paper, a survey on vehicle detection on a video traffic scene has been presented. Two approaches have been presented to hypothesize the detection of potential vehicle locations in traffic video analysis. This review highlights the second approaches which are more commonly used. We propose three new emerged perspectives on the recent state of the art: DL to solve the problem of feature extaction, TL to specialize a generic detector to more specfic ones, and data fusion to reinforce the description. We evaluate and compare the three emerged perspectives. The recent RMC Faster R-CNN shows a higher performance when compared to the generic Faster R-CNN and the HoG and Haar mixed fatures. The paper work can be extended to multi-object detection.

本文介绍了一个关于视频交通场景中车辆检测的调查。已经提出了两种方法来假设在交通视频分析中检测潜在的车辆位置。这篇综述重点介绍了更常用的第二种方法。我们提出了关于最新技术发展的三种新的观点：DL解决特征提取问题，TL将专业检测器专门化为更具体的检测器，并通过数据融合来加强描述。我们评估和比较三个出现的观点。与通用Faster R-CNN和HoG和Haar混合特性相比，最近的RMC Faster R-CNN显示出更高的性能。论文可以扩展到多目标检测。