You Only Look Once论文阅读摘要

　　论文链接: https://arxiv.org/pdf/1506.02640.pdf

　　代码下载: https://github.com/hizhangp/yolo_tensorflow

Abstract

We present YOLO, a new approach to object detection.Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
#我们提出了一种新的目标检测算法,YOLO.先前关于目标检测的算法在检测时需要对分类器进行重定向.与之不同的是,我们将目标检测当成空间独立的bounding boxes以及相关类别概率的回归问题.单一神经网络在对整张图片的一次评估中直接预测bounding boxes和分类概率.因此整个检测原则是可以优化的端到端实现的单一网络.

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
#我们的联合结构是极为迅速的.基础YOLO模型处理速度可以达到每秒45帧.更小的网络模型,Fast YOLO,其处理速度可以达到惊人的每秒155帧,同时可以保持其他实时检测算法2倍的平均准确率.

Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
#相比于state-of-art检测系统,YOLO会产生更多的定位错误,但是更少可能地在背景上作出错误的检测判断.最后,YOLO可以学习物体的非常通用的表示.它的表现超过了包括DPM及R-CNN在内的其他的检测方法,因为直接从原始图片而不是从图像区域中产生判断.

Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought.
#人类看一眼图片就立即知道图像中物体是什么,在哪里,以及相互间是什么关系.人类的视觉系统是快速而准确的,从而允许我们从事类似开车的复杂任务而无需过多的思考.

Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
#快速而准确的目标检测系统允许计算机在不配备特定传感器的情况下驾驶车辆,帮助辅助装置将实时图像信息传达给人类,并解锁特殊用途的潜能,尤其是机器人系统.

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
#当前的检测系统通过分类器来实现检测.为了检测目标,这些系统针对特定目标训练分类器,并且在图片不同位置和尺度下进行预测.类似deformable parts models(DPM)系统使用滑动窗口几乎对整张图片进行处理,其中分类器在滑动窗口中运行.

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
#最新的例如R-CNN方法使用region proposal方法首先在一张图片中产生潜在的bounding boxes,然后在这些proposed boxes上进行分类.分类之后,需要对bounding boxes进行后处理,减少重复的检测,并基于场景中的其他物体对boxes进行重新评分[13].这些复杂的原则很慢而且很难优化.

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
#我们将目标检测重新构造成一个单一的回归问题,直接从像素到bounding box坐标系和分类概念.使用我们的系统,你只需要看一次图片(YOLO)就可以预测物体是什么以及在哪里.

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
#从图1可以看出,YOLO是简单清新的.一个单一的卷积神经网络同时产生多个bounding boxes以及这些boxes的分类概念.YOLO在全图片进行训练和优化.这个联合模型比传统的目标检测模型有很多的优点.

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline.We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps.
#首先,YOLO是极度迅速的.因为我们把检测构造成一个回归问题,我们不需要复杂的原则.只需要让神经网络在测试阶段跑在新的图片来进行预测.我们的基础模型可以在没有批处理的情况下使用Titan X GPU实现每秒45帧,更快的模型可以达到每秒150帧.

This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage:http://pjreddie.com/yolo/.
#这意味着我们可以实时处理视频流,其中延迟低于25毫秒.此外,YOLO实现了两倍于其他实时检测系统的平均准确率.访问http://pjreddie.com/yolo/可以看到我们的系统运行在一个实时网络摄像机的demo.

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.
#第二,YOLO基于全局图像进行判断.不像sliding window以及基于region proposal技术,YOLO在训练和测试阶段看的是整张图片,所以它毫无疑问的对类别前后关系以及它们的表现进行编码.Fast R-CNN,文献[14]所述的一个顶尖的检测方法,错误地将背景当成目标,因为它无法看到更广泛的内容.YOLO相比Fast R-CNN的背景误判率低了一半.

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
#第三,YOLO学习物体更为通用的特征表示.当神经网络在作品上训练和测试时,YOLO在很大程度上超越了其他类似DPM和R-CNN方法.由于YOLO适用性更强,当遇到新的领域或者未知输入时更低可能失效.

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.
#YOLO在准确率上仍然落后于state-of-art检测系统.它在快速识别物体努力提高定位细小物体的精度,我们将在后续的实验中检视这些取舍.

All of our training and testing code is open source. Avariety of pretrained models are also available to download.
#我们所有的训练及测试代码都是开源的.无数的预训练模型也提供下载.

Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously.
#我们将目标检测中的分离任务统一成一个神经网络.我们的网络使用全图像特征来预测每个bounding box.它也是同时基于所有类别对bounding boxes进行分类预测的.

This means our network reasons globally about the full image and all the objects in the image.The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.
#这意味着我们的网络基于全局图像以及图像上所有物体进行推理.YOLO的设计保障了端到端训练以及保证高准确率的前提下实现实时.

Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
#我们的系统将输入图像分割成S*S个格子.如果物体的中心落在一个格子上,那么格子grid单元就需要检测这个物体.

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.
#每个格子预测B个bounding boxes以及这些bounding boxes的置信度.这些置信度反应了模型有多确信box中包含目标,以及它认为预测类别的准确率.

Formally we define confidence as Pr(Object) ∗ IOU . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
#先前我们将置信度定义为Pr(Object)*IoU.如果grid单元中不存在目标,那么置信度应该为0,否则我们希望置信度得分等于预测框与ground truth的IoU.

Each bounding box consists of 5 predictions: x, y, w, h,and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
#每个bounding box包含了五个预测量:x,y,w,h和置信度.(x,y)坐标代表了box相对于grid单元的中心位置.宽度和高度是相对于整幅图像而言的.最后置信度代表预测框与任意ground truth box的IoU

Each grid cell also predicts C conditional class probabilities, Pr(Classi |Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
#每个grid单元也预测C个条件分类概率,Pr(Classi |Object).这些概念是基于grid单元已经包含一个目标.我们针对每个grid单元只预测一系列分类,而不管boxes的数量B.

At test time we multiply the conditional class probabilities and the individual box confidence predictions,which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
#在测试阶段我们将条件分类概念乘以单个box的置信度预测值,这个预测值告诉我们每个box特定的类别的置信度.这些分数对box中出现物体类别的概念以及预测有多适合物体进行编码.

For evaluating YOLO on PASCAL VOC, we use S = 7,B = 2. PASCAL VOC has 20 labelled classes so C = 20.Our final prediction is a 7 × 7 × 30 tensor.
#为了使用PASCAL VOC对YOLO进行评估,我们使用S=7,B=2.PASCAL VOC拥有20个标注类别,因此C=20.我们最终的预测是一个7*7*30的向量.

　　 2.1 Network Design

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset[9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
#我们通过卷积神经网络实现这个模型,并且在PASCAL VOC检测数据集[9]上进行验证.最初的卷积层从图片中提取特征,而全连接层负责预测输出概念及坐标.

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers.Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.
#我们的网络结构受到用于图像分类的GoogLeNet模型[34]启发.我们的网络拥有24层卷积层,随后有2层全连接层.我们简单地使用了1*1还原层以及3*3卷积层替代了GoogLeNet的inception模块,近似于Lin在文献[22]中描述的.图3展示了整个网络.

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
#我们同时也训练了YOLO的快速版本用于实现快速目标检测.快速YOLO使用更少卷积层的神经网络(9而不是24层)以及卷积层中更少都滤波器.除了网络尺寸的差异,所有的训练和测试参数都保持一致.


The final output of our network is the 7 × 7 × 30 tensor of predictions.
#网络的最终输出是一个7*7*30的预测向量

　　 2.2 Training

We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].
#我们大约训练了一周,在ImageNet 2012验证集上达到了single crop top-5 88%的准确率,接近Caffe’s Model Zoo中的GoogLeNet模型.在整个训练和推理阶段我们都使用了Darknet框架(基于C和Cuda的开源框架).

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29].Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
#我们随后对网络进行转换用于检测.Ren等人证明了给预训练模型添加卷积层和全连接层可以提高性能[29].遵循他们的例子,我们添加了随机初始化的4个卷积层和两个全连接层.检测通常需要有纹理条纹的视觉信息,所以我们把网络输入分辨率从224*224提高到448*448.

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
#我们的最后一层对分类概念和bounding box坐标进行预测.我们对bounding box的宽和高关于整张图进行归一化,所以它们落在0和1之间.我们对bounding box的坐标x和y关于特定的grid单元的偏置进行参数化,所以它们也应该落在0和1之间.

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
#我们为最后一层使用线性激活函数,并且为剩下的所有层使用leaky rectified linear 激活函数:

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal.
#我们针对模型输出的平方和进行优化.我们使用平方和是因为它容易优化,但是它并不完美匹配我们最大化平均准确率的目标.它将定位错误等效于分类错误,因此可能并不是非常理想.

Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability,causing training to diverge early on.
#当然,在每张图片中,很多grid单元并不包含任何目标,这将"confidence"得分置为0,通常具有远大于其他包含目标单元的梯度.这可能会导致模型的不稳定,造成早期训练不收敛.

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λ_coord and λ_noobj to accomplish this. We set λ_coord = 5 and λ_noobj = .5.
#为了纠正这个,我们增加bounding box 坐标预测的损失并降低不包含目标的boxes预测的损失.我们通过使用了两个参数,λ_coord和λ_noobj来实现这一目标.我们设置λ_coord = 5 以及 λ_noobj = .5.

　Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
　#大的boxes和小的boxes也使用相同的误差权重.我们的误差度量应该要让大的boxes反应更小的偏差.为了部分强调这个(原则),我们预测bounding box宽度和高度的平方根而不是直接预测宽度和高度.

　YOLO predicts multiple bounding boxes per grid cell.At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain
sizes, aspect ratios, or classes of object, improving overall recall.
　#YOLO在每个grid单元预测多个bounding boxes.在训练阶段我们只希望一个预测每个目标的bounding box.我们为与ground truth拥有最高IoU的物体设置了一个预测器.这会造成bounding box预测器的专一化.每个预测器在特定尺度,宽高比,目标类别都具有更好的性能,进而提高整体的recall.

　During training we optimize the following, multi-part loss function:
　#训练中我们对下面的多部损失函数进行优化:

　Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

　#注意损失函数只对grid 单元中出现的目标分类错误进行惩罚(因此之前讨论了条件分类概念).同时只有当那个预测器对ground truth box负责时,才会对这个bounding box的坐标错误进行惩罚(也就是说,这个bounding box在grid单元中具有最高的IoU)

　　2.3 Interence

　　　2.4 Limitations of YOLO

Comparison to Other Detection Systems

　　　a). Deformable parts models

　　　b). R-CNN

　　　c). Other Fast Detectors

　　　d). Deep MultiBox

　　　e). OverFeat

　　　d). MultiGrasp

Experiments

　　　4.1 Comparison to Other Real-Time Systems

　　　4.2 VOC 2007 Error Analysis

　　　4.3 Combining Fast R-CNN and YOLO

　　　4.4 VOC 2012 Results

　　　4.5 Generalizability: Person Detection in Artwork

Real-Time Detection In The Wild
Conclusion

You Only Look Once论文阅读摘要

猜你喜欢