ICCV-2017

Extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. 做到一个模型，三种用途 instance segmentation, bounding-box object detection, and person keypoint detection
提出了 RoIAlign 弥补 Faster R-CNN 的 end-to-end align for instance segmentation

3 Advantages

instance segmentation, bounding-box object detection, and person keypoint detection 三合一，且效果比各自单项冠军（2016 COCO）好

4 Methods

4.1 Head Architeture

左边的结构不好，R-FCN这边论文一开始就说了（this creates a deeper RoI-wise subnetwork that improves accuracy, at the cost of lower speed due to the unshared per-RoI computation. 类似RCNN的感觉，提出proposal后，对每个proposal进行后续处理），作者也推荐用右边的结构（we do not recommend using the C4 variant in practice）

Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset，作者加了第三个 branch，让网络 output object mask. 但是第三个 branch requiring extraction of much finer spatial layout of an object.

Mask R-CNN also outputs a binary mask for each RoI

4.2 RoI Align

做segment是pixel级别的，但是faster rcnn中roi pooling有2次量化操作导致了没有对齐 ,两次量化，第一次 roi 映射 feature map 时，第二次 roi pooling 时 1

量化（quantization）如下

RoI pooling、warp、align 的区别如下：

RoI Align 详解如下图中间部分

取非量化后 RoI 中的四个点，用双线性差值（周围四个像素点）确定其像素值，然后四个加起来求平均

4.3 Train

Multi-task loss： $L = L_{cls} + L_{box} + L_{mask}$ ， $L_{mask}$ is defined only on positive RoIs

mask branch has a $Km^2$ dimensional output for each RoI，K classes and m×m resolution
RoI 的 positive IoU at least 0.5 and negative otherwise
如同 fast rcnn 一样，采用 image-centric sampling 而不是 RoI centric sampling 来训练
- RoI-centric sampling：从所有图片的所有RoI中均匀取样，这样每个SGD的mini-batch中包含了不同图像中的样本。（SPPnet采用）
- image-centric sampling： (solution)mini-batch采用层次取样，先对图像取样，再对RoI取样，同一图像的RoI共享计算和内存2。
Each mini-batch has 2 images per GPU and each image has N sampled RoIs，positive：negative = 1：3，N = 64 for C4 backbone and 512 for FPN（见图3）
RPN anchors 5 scales and 3 aspect ratios

4.4 Inference

Proposal = 300 for C4，and 1000 for FPN，然后丢到 box prediction branch，接NMS
Mask branch applied to the highest scoring 100 detection boxes，与训练的时候不同，但是加速
Mask branch 能预测 K 个 masks per RoI，但是只用 k-th mask，k是 classification branch 的结果
Mask 会 resize 到 RoI 的大小，二值化的 thresold 为0.5

5 Experiments：Instance Segmentation

evaluate using mask IoU

5.1 Main Results

outperform COCO2015、2016的 instance segmentation 冠军

Mask RCNN VS FCIS，FCIS exhibits systematic artifacts on overlapping instances 而 Mask RCNN 没有。

5.2 Ablation Experiments

Backbone：benefit from depth（50 vs 101），FPN and ResNeXt（表2 a）
Multinomial vs Independent Masks：简单的说就是 sigmoid vs softmax，sigmod 是 class-specific 的，争对每一类，二分类，而 softmax 是 class- agnostic，争对每个像素，用softmax 然后 multinomial logistic loss（表2 b，c）
RoIAlign：对 max还是average pooling insensitive，所以作者都采用的是average pooling，相对 RoI pooling 效果有明显提升（表2 c，d），（c）的backbone 为 ResNet-50-C4，stride 为16，（d）中采用的是 ResNet-50-C5，stride 为 32，（d）比（c）的效果好，AP 30.9 vs 30.3，用FPN的话效果会进一步提升。
Mask branch： FCN 比 MLP （FC）好

5.3 Bounding Box Detection Results

注意到去掉mask 和加上mask的区别在于，solely due to the benefits of multi-task training

table1 中， instance segmentation 的 AP 为 37.1
This indicates that our approach largely closes the gap between object detection and the more challenging instance segmentation task.

5.4 Timing

our design is not optimized for speed

Mask R-CNN for Human Pose Estimation 以及 Experiments on Cityscapes（instance segmentation）这篇博客就不在讨论了，有兴趣的可以去看下原文。

【Mask RCNN】《Mask R-CNN》

目录

1 Motivation

2 Innovation