[Paper Reading]FCOS: Fully Convolutional One-Stage Object Detection

其他 2021-03-08 19:06:09 阅读次数: 0

FCOS: Fully Convolutional One-Stage Object Detection

前言：为什么 anchor-free 能卷土重来

在知乎看到的很棒的回答https://zhuanlan.zhihu.com/p/62372897
anchor-free 的方法能够在精度上媲美 anchor-based 的方法，最大的功劳我觉得应该归于 FPN，其次归于 Focal Loss。（内心OS：RetinaNet 赛高）。在每个位置只预测一个框的情况下，FPN 的结构对尺度起到了很好的弥补，FocalLoss 则是对中心区域的预测有很大帮助。当然把方法调 work 并不是这么容易的事情，相信有些细节会有很大影响，例如对重叠区域的处理，对回归范围的限制，如何将 target assign 给不同的 FPN level，head 是否 share 参数等等。

1. introduction

We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel predic- tion fashion, analogue to semantic segmentation.

作者提出了一个FCOS（全连接单阶段的目标检测），是一种anchor-free和proposal-free,

anchor-based文中列举的4个缺点：

(1)超参数需要被精确的调整。
(2)检测器在处理形状变化较大的候选对象（尤其是小型对象）时会遇到困难。
(3)大量的anchor boxes并且大部分都被标记为负样本正负样本的不平衡。
(4)计算开销大。

FCOS有以下优点：

(1)FCOS是和其他FCN的tasks统一的。
(2)FCOS检测是proposal free 和 anchor free的。
(3)FCOS开销较小。
(4)在单阶段检测中达到了sota。
(5)FOCS在视觉任务上具有适用性。

2. Fully Convolutional One-Stage Object Detector

In other words, our detector directly views locations as training samples instead of anchor boxes in anchor-based detectors, which is the same as FCNs for semantic segmentation

因此FCOS的改动点就在这里，它是直接在feature_map上的每一点进行回归操作。
坐标上的某个点的location(x,y)，在一张图上，GT bounding boxes定义为 ${\{B_i\}}$ ， $B_i=(x_0^{i},y_0^{i},x_1^{i},y_1^{i},c^{i}) \in R^4 \times \{1,2,....C\}$
这里的 $x_0^{i},y_0^{i}),(x_1^{i},y_1^{i})$ 表示GT的左上和右下边界的坐标，C是类别，对于COCO来说有80。
另外对于每一个feature map $F_i$ ， $F_i$ 上的像素与原图的像素之间的映射关系是 $\lfloor s/2 \rfloor + xs, \lfloor s/2 \rfloor + ys$ ,(x,y)是靠近中心的感受野区域。

location (x, y) is considered as a positive sample if it falls into any ground-truth box and the class label c∗ of the location is the class label of the ground-truth box. Otherwise it is a negative sample and c∗ = 0 (background class).

location(x,y)如果落入GT的区域内，并且location的c∗就是gt的class，那么这个像素点就是positive，否则为negative。并且FCOS不使用anchor 与GT含有高IOU来判断positive。
如下图所示，对于回归regression任务，文中引入了一个4D vector $t^*=(l^*,t^*,r^*,b^*)$
这四个点分别表示一个像素点location与bounding box 4个边上的距离。

If a location falls into multiple bounding boxes, it is considered as an ambiguous sample. We simply choose the bounding box with minimal area as its regression target.

当然这就存在一个问题，如果一个location在多个bounding box中，那么首先先选择小的bbox，后面就会引入FPN，多尺度来解决这个问题。
如果一个像素点(x,y)与bbox $B_i$ 有关，对于像素点来说回归任务可以被表示为：
$l^*=x-x_0^i， t*=y-y_0^i$
$r^*=x_1^i-x， b*=y_1^i-y$

在这里插入图片描述

3. FCOS Network

如下图，FCOS的backbone是CNN网络（ResNet），CNN后连接FPN网络，P3-P7是为了提取多尺度特征，接着5个共享头head（使用相同的参数，但其实并不是，比如P3 和 P4）。分为分类和回归分支。对于分类分支，最后一层的网络预测是80D，并且使用C个二进制的分类器。对于回归分支来说，最后一层的网络是4D vecotr，来表示bbox的坐标。由于在回归任务上的的都是正的（既在gt上的都是正的像素点），本文引入exp(x)函数。

Loss Function

$L(\{ p_x,p_y\}, \{t_x,t_y \})=1/N_{pos} \sum_{x,y}L_{cls}(p_{x,y},c^*_{x,y})+\lambda /N_{pos}\sum_{x,y} \mathbb I \{ c^*_{x,y}>0\}L_{reg}(t_{x,y},t^*_{x,y})$

其中Lcls使用focal loss，Lreg使用IOU loss。 $\mathbb I \{ c^*_{x,y}>0\}$ 表示指示函数，如果大于0，则执行 $L_{reg}(t_{x,y},t^*_{x,y})$

where Lcls is focal loss as in [15] and Lreg is the IOU loss as in UnitBox [32]. Npos denotes the number of positive samples and λ being 1 in this paper is the balance weight for Lreg. The summation is calculated over all locations on the feature maps Fi. 1{c∗i >0} is the indicator function, being 1 if c∗ i > 0 and 0 otherwise.
Moreover, since the regression targets are always positive, we employ exp(x) to map any real number to (0,∞) on the top of the regression branch.

Inference

Inference. The inference of FCOS is straightforward. Given an input images, we forward it through the network and obtain the classification scores Px,y and the regression prediction Tx,y
for each location on the feature maps Fi.
Following [15], we choose the location with px,y > 0.05 as positive samples and invert Eq. (1) to obtain the predicted bounding boxes.

在这里插入图片描述
与anchor-based的anchor不同，它在不同的尺度上使用不同的sizes，本文直接在每个尺度上限制了bbox的回归，首先计算了在不同尺度的location的l*,t*,r*,b*。如果某一个location满足：
$max(l^*,t^*,r^*,b^*)>m_i OR max(l^*,t^*,r^*,b^*)<m_{i-1}$
那么它就是negative sample并且不会参与回归。m的取值如下：

Here mi is the maximum distance that feature level i needs to regress. In this work, m2, m3, m4, m5, m6 and m7 are set as 0, 64, 128, 256, 512 and ∞, respectively.

如果还是有ambiguity，那么就取最小的。

If a location, even with multi-level prediction used, is still assigned to more than one ground-truth boxes, we simply choose the ground- truth box with minimal area as its target.

As a result, instead of using the standard exp(x), we make use of exp(six) with a trainable scalar si to automatically adjust the base of the exponential function for feature level Pi, which slightly improves the detection performance.

Center-ness for FCOS

引入此结构的目的是为了消除距离中心较远的location产生的low-quality predicted bounding box。Center-ness与Classification平行，Center-ness描绘了location到物体中心的归一化的距离，使用binary cross entropy (BCE) loss.

$center^* = \sqrt{min(l^*,r^*)/max(l^*,r^*) \times min(t^*,b^*)/max(t^*,b^*)}$

最后的置信分数是通过Center-ness * 相应的分类分数。

When testing, the final score (used for ranking the detected bounding boxes) is computed by multiplying the predicted center-ness with the correspond- ing classification score.