问题定义

问题定义：classification & detection
输入：image
输出：class(classification) / class+bbox(detection)
评价指标：mAP, computational budget
意义：提出了全新的结构：拥有类似深度&宽度但是sparser的NN architecture（详见算法流程-Architecture）可以达到同样好的效果但是更加节省computational budget (Improved utilization of the computing resources inside the network)
原理：
- Hebbian Principle
- multi-scale processing.
Incarnation: GoogLeNet (22 layer)

动机及思路

提高acc的直观方法：increase size + large training data
两个drawback:
- prone to have more params, prone to over-fitting (尤其是训练集有限时）
- increase use of computational resource，尤其是当capacity没有被高效利用的时候(weights接近于0)，会有很多的computation被浪费。
解决办法：
- 使用sparse connection结构代替全连接（甚至在卷积内部实现） - 根据Hebbian Principle，把有高相关性的output的neuron聚合在一起，作为下一层layer的input。
  - 核心理论：如果数据集的概率分布被表示在一个大的、稀疏的神经网络中，那么最优的网络结构可以通过逐层的分析上一层的activation的相关性，并把相关性高的神经元聚合在一起作为下一层的input来实现。
  - 解决：Naive Inception Module的产生，每一个filter bank聚合了相关性高的activation，再把filter banks concat到一起作为下一层的input。
- 但是sparse connection的计算十分得不偿失：non-uniform的sparse计算让overhead和cache miss非常多，当dense矩阵乘法更快后(利用硬件结构)，non-uniform的sparse计算和dense计算的差距越来越大。
  - 所以ConvNet把最初的随机选取(non-uniform的sparse connection) change back into 卷积内部的全连接(uniform)来实现并行计算。
- 因为dense computation十分高效，前沿的cv领域仅仅通过卷积层来利用了空间稀疏性，而卷积的内部操作依然是全连接的。
  - Motivation：sparsity运用在了层和层之间，而没有运用在卷积内部(uniform-dense full connection inside convs)，能否在filter level也利用上额外的sparsity？
  - 解决: Modified Inception module with 1*1 conv for dimension deduction purpose.

算法流程

Architecture（Fig）:
- main idea: 最优的局部sparse structure可以被近似/covered by dense components - filter banks的concat result可以cover最优局部稀疏structure
- NIN:
  - 1*1：对于每一层的输入，里面的每个unit对应原图上的一部分(region)，这些unit被聚集在了filter banks里。所以在input layer上的一个local region中，会有很多的clusters被合并在了一起。
    他们可以通过1*1连接到下一层layer上。（in NIN: cross channel pooling）
  - 3*3 5*5：更大的patch上会有更少的clusters（1个patch上最多所以用了1*1），但是又不想让patch重叠太严重所以选择了3*3，5*5
  - pooling: 因为在state of art上的成功
  - 合并上述filter banks
- 相同深度/宽度但是 Sparser 的 NN Architecture:
  - 重点在于通过对原始的 Naive Inception module 进行 dimension reduction，让模型的连接更加稀疏；又通过projection，让模型的宽度/深度保持不变。
- 存在问题：computational budget：
  - from stage to stage, layer越来越厚（filter banks的合并）-> 即使是 5*5 的conv，计算量也很大
- 解决：reduction + projection
  - reduction: 1*1 conv + ReLU
    - 原因：即使更加compressed的低纬度的embedding也包含了相对更大的image patch的information
    - 1*1 conv 效果：reduction + ReLU activation
  - projection：
    - 原因：十分dense, compressed的形式 & compressed的信息is hard to model
    - 形式：keep representation sparse at most place; 在需要aggregation时进行reduction（即只在使用3*3，5*5之前使用1*1 reduction)
- 优点：
  - 在每一层layer增加了units数量(增加width)/deeper(增加stages)而没有让computation complexity blow up
  - multi-scale(从不同感受野抽取特征)
  - 通过careful manual design让相似性能的NN可以2到3倍faster

GoogLeNet
- ensemble方法:
  - 6个相同结构的Inception NN, with different sampling method (其中最好的一个命名为GoogLeNet)，图片的crop size不同
  - 1个deeper, wider Inception NN, with inferior performance
- 网路结构：
  - 在softmax之前用AveragePool：from NIN，有0.6%的提升(top-1); 区别于Polyak Average Pooling - used for inference.
  - 在softmax之前AverPool后用FC：only convenience (为了方便对别的label sets进行fine-tune)
  - dropout：依然有效
  - Auxiliary classifiers:
    - 结果算在loss里（0.3weight）
    - Inf阶段不包含
    - 有regulation的功能(from v3):
      - 在side head上添加BN/dropout模型性能更好
    - 缺点(from v3)：
      - 去掉lower的head没有影响
      - 在acc很高时(训练晚期)才有帮助
- Training：
  - momentum 0.9
  - lr 4% decrease/8 epochs
  - avg pool before inference
  - DistBelief for data-parallelism
  - CPU based (GPU - memory usage problem)
  - 8%-100% crop
  - 3/4 - 4/3 AR
  - photometric distortion (combat overfitting)
  - random interpolation methods
  - 缺点：对于不同的设计没有指明contributing factors
- Testing阶段
  - 7 models: same in init, diff in sampling & input order
  - 4*3*6*2 = 144 crops
    - 4 scale
    - 3 square
    - 6 crop(include origin)
    - mirror
  - softmax aver from crops & models (144*7) - Polyak Averagin
- 模型对照：见 [2018.04.24] Inception-v3

实验结果

ILSVRC 2014 Classification
- Data:
  - Traing: 12.m
  - Val: 50k
  - Test: 100K
  - no external data
  - Class: 1000
- 指标：top-5 error
- Testing阶段
  - 7 models: same in init, diff in sampling & input order
  - 4*3*6*2 = 144 crops
    - 4 scale
    - 3 square
    - 6 crop(include origin)
    - mirror
  - softmax aver from crops & models (144*7) - Polyak Averaging
- result: 6.67% top-5 error

ILSVRC 2014 Detection
- 指标：mAP (IoU 0.5)
- Data:
  - Class: 200
  - Externel: ILSVRC 2012 Classification 1k
- Model:
  - R-CNN的改进：（SS+Multibox）+ Inception
  - 60%SS (superpixel size * 2 -> half SS proposals) 有1% increase for 1 model
  - 40% Multibox (200个) <- for higher object bounding box recall
  - 6 models

问题

Architecture: 为什么更高的layer更加抽象，所以3*3, 5*5的filters比例会渐渐增加？而且实际他们的比例仿佛并没有increase？
Sec 8 last line，为什么不用bounding box regression，为什么不用能达到同等效果说明性能更好？
有的模型使用了bounding box regression pre-train model。