FPN

2018-06-17 本文已影响9人初七123

Feature Pyramid Networks for Object Detection

Introduction

作者提到四种形式的特征金字塔结构
(a)对图像做下采样，每一层图像单独预测
(b)网络最后一层预测
(c)网络每一层单独预测
(d)FPN的方式，上采样融合原特征层信息，并且独立预测

Related Work

Hand-engineered features and early neural networks
人工特征
Deep ConvNet object detectors
深层卷积网络特征
Methods using multiple layers
多层卷积网络特征

Feature Pyramid Networks

FPN的具体结构：横向连接采用1 x 1 的卷积，自顶向下通过2x 上采样

Bottom-up pathway
for ResNets [16] we use the feature activations output by each stage’s last residual block. We denote the output of these last residual blocks as{C2,C3,C4,C5} for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of{4, 8, 16, 32}pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint
作者提到没有用到C1等，因为其庞大的内存占用

Top-down pathway and lateral connections
作者提到对混合后的结果采用3x3卷积 which is to reduce the aliasing effect of upsampling

Applications

RPN

作者提到对于FPN结构来说，不需要设置多尺度的anchor，只需要不同比例的anchor

Formally, we deﬁne the anchors to have areas of {32²,64²,128²,256²,512²} pixels on {P2,P3,P4,P5,P6} respectively

而正负样本的生成和Faster R-CNN中一样，以IoU>0.7为正，IoU<0.3为负

Fast R-CNN

在 Fast RCNN 里，FPN 主要应用于选择提取哪一层的 feature map 来做 ROI pooling。假设特征金字塔结果对应到图像金字塔结果

Formally, we assign an RoI of width w and height h (on the input image to the network)to the level P(k) of our feature pyramid by

Here 224 is the canonical ImageNet pre-training size, and k0 is the target level on which an RoI with w ×h = 224² should be mapped into.

Experiments on Object Detection

RPN
AR = Average Recall
s = small
m = medium
l = large

Comparisons with baselines
a, b, c 对比可得FPN的效果不错

How important is top-down enrichment?
d 没有top-down的效果变差
We conjecture that this is because there are large semantic gaps between different levels on the bottom-up pyramid

How important are lateral connections?
e 没有lateral的效果变差
因为经过上下采样丢失了bottom-up中的细节信息

How important are pyramid representation?
f 没有金字塔结构的效果
效果还行，但是anchor太多效率低下

Fast R-CNN

这里用FPN作为RPN来产生区域建议--固定的建议集合
但是不共享两个网络的特征
可见FPN用于 Fast R-CNN的检测部分效果还是不错的

（a）（b）（c）的对比证明在基于区域的目标卷积问题中，特征金字塔比单尺度特征更有效。（c）（f）的差距很小，作者认为原因是ROI pooling对于region的尺度并不敏感。因此并不能一概认为（f）这种特征融合的方式不好，博主个人认为要针对具体问题来看待，像上面在RPN网络中，可能（f）这种方式不大好，但是在Fast R-CNN中就没那么明显。

Faster R-CNN
最后将 FPN 用于 Faster R-CNN 共享网络参数
对小物体的检测明显效果变好

Extensions: SegmentationProposals

没看

机器之心的解读

CVPR 现场 QA：

1. 不同深度的 feature map 为什么可以经过 upsample 后直接相加？

A：作者解释说这个原因在于我们做了 end-to-end 的 training，因为不同层的参数不是固定的，不同层同时给监督做 end-to-end training，所以相加训练出来的东西能够更有效地融合浅层和深层的信息。

2. 为什么 FPN 相比去掉深层特征 upsample(bottom-up pyramid) 对于小物体检测提升明显？（RPN 步骤 AR 从 30.5 到 44.9，Fast RCNN 步骤 AP 从 24.9 到 33.9）

A：作者在 poster 里给出了这个问题的答案

FPN

对于小物体，一方面我们需要高分辨率的 feature map 更多关注小区域信息，另一方面，如图中的挎包一样，需要更全局的信息更准确判断挎包的存在及位置。

3. 如果不考虑时间情况下，image pyramid 是否可能会比 feature pyramid 的性能更高？

A：作者觉得经过精细调整训练是可能的，但是 image pyramid 主要的问题在于时间和空间占用太大，而 feature pyramid 可以在几乎不增加额外计算量情况下解决多尺度检测问题

FPN