【论文笔记】M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network

& Paper Overview

 

Obtaining address: https://arxiv.org/abs/1811.04533

Code Address: https://github.com/ qijiezhao / M2Det 

& Summary and personal views

  In this paper, Multi-Level Feature Pyramid Network feature to build pyramids efficient detection of targets at different scales. MLFPN the FFM, TUMs and SFAM three parts. Wherein FFMv1 ( the Feature the Fusion Module1 ) multi-level features extracted by mixing for a backbone base feature; Tums ( thinned the U-Shape-Modules ) and extracted FFMv2s multi-level multi-scale feature by feature basis; the SFAM ( Scale-Wise the Feature Aggregation Module ) the characteristics of these multi-level multi-scale integration to get the final features of the pyramid based on the same scale. The M2Det based MLFPN is an efficient end-to-end one-stage detector, the data set in the MS COCO achieve optimum performance one-stage.

  In this paper, most local innovation is the use of SFAM, features different from the TUM output pyramid integrated into the final result of the pyramid, so that each layer incorporates enough information to detect the current scale of the target. FFM is not the color of the Department, the overall network design is wonderful.

& Contribution

1, proposed TUM module, made some improvements on the basis of FPN;

2, to achieve multi-level multi-scale pyramid structures characterized by SFAM module;

3, M2Det MS COCO data set in the one-stage method to achieve an optimum performance, but also beyond the most two-stage method.

& Problems to be solved

Problem : Although FPN previously used the network for improved results, but due to just under the target classification task designed backbone multi-scale pyramid structures and features simply build pyramids limited.

Analysis : As shown below, SSD directly using the backbone of the layer 2 and the layer 4 characterized in the pyramid obtained by convolution stride = 2, additional layers obtained; STDN DenseNet only the last dense block structures by the pool or scale-transfer method characterized in the pyramid; wherein the FPN to build deep and shallow pyramid fusion characterized by top-down method. These methods are mainly two limitations:

  • Characterized in the pyramid object detection task layer does not have sufficient force to characterize , as these methods merely to build the backbone wherein the target classification task layer design;
  • Each feature FIG pyramid for detecting a particular target size is the main or only backbone of the single-level layer structures , which means that each layer contains only or primarily single-level information of. In practice, the target instance of the same size will be very different, such as traffic lights and pedestrian far have similar size, but pedestrians appeared more complex; therefore, this may result in suboptimal results.

 

& Framework and main methods

1 Main Structure

2TUM(Thinned U-shape Module)

  TUM overall structure shown below, using the FPN model, wherein the upper layer using the sampling method of FIG bilinear interpolation, and then element-wise using the addition operation to obtain a final output characteristic diagram, wherein FIG layer 6 taken here.

       There are different from FPN at

  •  The encoder uses a series of 3 × 3, stride = convolution layer 2, the decoder outputs as these layers wherein the reference set of FIG., While the last layer output FPN selected ResNet backbone of each stage;
  •  Further, each sample in the decoder after the branching point and add operation is added 1 × 1 convolutional layer to enhance learning ability while maintaining smooth characteristics .

  All TUM output of each decoder forms a multi-scale characteristics of the current level, and the output of the stack is formed TUMs multi-level multi-scale features, while the front, intermediate and final TUM are provided shallow, middle and deep feature.

3FFM(Feature Fusion Module)

Wherein the fusion module incorporates features from different levels of compression using a convolutional channel 1 × 1, wherein the polymerization method of the concat FIG. FFMv1 backbone is a feature of the two different layers are fused, need thereof a normalized to the same scale, and therefore the operation using the sampled, and is feature-based FFMv2 fused TUM maximum output characteristic in FIG.

 

4SFAM(Scale-wise Feature Aggregation Module)

  SFAM effect polymerization of multi-level multi-scale feature, as shown above, the first stage is the feature SFAM spliced ​​by the same scale dimensions of the channel. However, not simply stitching adaptive, so that the introduction passage attention module can benefit from the most features of interest in the second phase of the channel. Using pooled mean global channel data is generated, while in order to completely capture the dependencies between the channels, two fully connected layers by learning attentional mechanisms:

其中表示ReLu函数,表示sigmoid函数。最终的输出通过使用激活方法s对输入X进行重新赋权:

  

5Experiment

1) 在MS COCO测试开发集中与其他one-stage、two-stage方法的对比,实验结果表明,M2Det获得one-stage方法中最优的性能,同时也超越了大多数two-stage方法。

2) 消融实验:通过实验验证不同TUM的数量对性能的影响,以及使用Base Feature、SFAM与否对AP的影响,使用不同的backbone的影响。

 

3) MLFPN的不同配置就TUM以及Channel而言对实验结果的影响,实验表明,取8个TUMs以及512个Channels效果最佳,但引入了更多的参数,综合考虑使用8个TUMs以及256Channels能够均衡效率以及精度。

4) 在MS COOC test-dev数据集上的网络的速度对比,可见M2Det综合能力更强,在速度和精度上都达到了top水准。

5) 可视化显示M2Det的检测结果,从中可以看出虽然交通信号灯以及远处的行人的大小差不多,但是检测响应的特征层级不同,行人有着更复杂的信息,因此在更深的层级中做出检测,对应的汽车也有相似的原因。

&思考与启发

  在主流都在思考如何对FPN网络进行改进,使得其能够在特征融合方面表现地更好的同时,本文作者直接通过FPN搭建特征金字塔的方法上从根源找出不足之处以达到提高性能的目的;可以说本文作者对问题的理解程度很深,关注方法也一针见血。

 

Guess you like

Origin www.cnblogs.com/fanzhongjie/p/11872077.html