Paper | Detecting Twenty-thousan
写在前面
- 文章出处: ECCV 2022
- 模型名字: Detic
- 整体概括:这篇文章跟最开始的OVD-Net一样,都是从pretraining的角度解决open vocabulary的问题,但是这篇文章的思路更加简单粗暴,直接加入imagenet的类别作为训练。本质上不是真正的open vocabulary,但是能够囊括2000类别;
1. Introduction:
-
OD has two subtasks: 1) finding boxes (localization); 2) naming the boxes (classification)
-
Previous works couple these two subtasks;
-
however, the detection benchmarks are much smaller than the classification benchmark;
as in the fig, both the image number and the category number of LVIS (OD) are much smaller than ImageNet (CLS).
![](https://img.haomeiwen.com/i9933353/56e6bdb70429e158.png)
This paper:
propose a detector with image classes (Detic) that uses image-level supervision in addition to detection supervision.
-
decouple the localization and classification sub-problems;
-
use image-level labels to train the classifier and broaden the vocabulary of the detector;
illustration:
![](https://img.haomeiwen.com/i9933353/193a8c78fd802d38.png)
standard OD: need gt boxes and labels;
weakly supervised od: assign image-level labels to predicted boxes [error-prone]
this paper: assigns image-level labels to the max-size proposals.
2 Method
2.1 preliminary
-
detection dataset
, with class set
-
image classification dataset
, with class set
-
testing dataset with class set
.
-
,
, and
may or may not overlap.
tradional OD: C_{det}
D_{cls} = \phi $
OVD: allows
2.2 Detic
the whole idea is quite simple.
- use both the detection dataset
and the classifiction dataset
to train the detection model.
![](https://img.haomeiwen.com/i9933353/babdd48d06fc5e5f.png)
-
sample a batch from both
and
.
-
if image belongs to
, then loss = typical od loss, rpn loss + rg loss + cls loss
-
if image belongs to
, then loss = max-size loss, max-size means the proposal has the max size is finally regarded as the region, then used to caculate the cls loss.
![](https://img.haomeiwen.com/i9933353/3ec1d43cf06dc44c.png)