Paper | Detecting Twenty-thousan

2023-12-13 本文已影响0人与阳光共进早餐

写在前面

文章出处： ECCV 2022
模型名字： Detic
整体概括：这篇文章跟最开始的OVD-Net一样，都是从pretraining的角度解决open vocabulary的问题，但是这篇文章的思路更加简单粗暴，直接加入imagenet的类别作为训练。本质上不是真正的open vocabulary，但是能够囊括2000类别；

OD has two subtasks: 1) finding boxes (localization); 2) naming the boxes (classification)
Previous works couple these two subtasks;
however, the detection benchmarks are much smaller than the classification benchmark；

as in the fig, both the image number and the category number of LVIS (OD) are much smaller than ImageNet (CLS).

image.png

This paper:

propose a detector with image classes (Detic) that uses image-level supervision in addition to detection supervision.

decouple the localization and classification sub-problems;
use image-level labels to train the classifier and broaden the vocabulary of the detector;

illustration:

image.png

standard OD: need gt boxes and labels;

weakly supervised od: assign image-level labels to predicted boxes [error-prone]

this paper: assigns image-level labels to the max-size proposals.

tradional OD: $C_{test} =$ C_{det} $,$ D_{cls} = \phi $

OVD: allows $C_{test} \neq C_{det}$

the whole idea is quite simple.

use both the detection dataset $D_{det}$ and the classifiction dataset $D_{cls}$ to train the detection model.

image.png

sample a batch from both $D_{det}$ and $D_{cls}$ .
if image belongs to $D_{det}$ , then loss = typical od loss, rpn loss + rg loss + cls loss
if image belongs to $D_{cls}$ , then loss = max-size loss, max-size means the proposal has the max size is finally regarded as the region, then used to caculate the cls loss.

image.png