Paper ｜ Region-Aware Pretraining

2023-12-13 本文已影响0人与阳光共进早餐

1 introduction

1.1 main story

OVD： training text category as text embedding rather than discrete IDs.

base set： $C_{B}$ ，novel set $C_{N}$ ;
OVD: training on $C_{B}$ , and test on the union of $C_{B}$ and $C_{N}$ ;

previous OD: number of objects are the same between the train and test;
OVD: to deal with additional categories at test time, the common practice is to replace the conventional fixed-size classifier fully-connected layer with the text embeddings of base categories.

OVD任务具体可以参照第一篇论文：https://www.jianshu.com/p/b23de6b4476b

Previous existing method： leverage image-text pretraining, via knowledge distillation， weak supervision， self-training， and frozen model but on CNNs;

assume pretrained VLM are given， and develop adaptation or finetuning recipes to bridge the gap between image-level pretraining and object-level finetuning (pretrain是image level的，但是下游任务是object level的)

This paper：

exploring OVD with ViT
propose RO-ViT： pretrain ViT in a region-aware manner for OVD

1.2 related works on OVD：

1） learning alignment between region visual representation and category word embed；
2） hallucinating visual features with a generative model
3） image-text pretraining 【this paper falls into】

existing paper based on CNN and assume image-text pretrained models are given and focus on finetuning or adaptation.

this paper： focuses on improving the upstream image-text pretraining with ViT.

2 method

2.1 common pipeline

proposals not matched to any annotations in $C_{B}$ are labeled as "background".
training: for each region i, calculate the detection score $p_{i}$ as the cosine_similary{region visual embedding, text embedding}, followed by a softmax;
testing: expand the text embedding from $C_{B}$ to $C_{B} \cup C_{N} \cup ``background"$ .

2.2 Region-Aware Image-Text Pretraining

existing method： align between whole image and text；
this paper： a novel cropped positional embeddings （CPE） to make aware of the region.

cropped positional embeddings （CPE）:
position embedding is key to transformers， providing information on which element comes.

整个框架分为三个部分；

左边：出去了CPE部分就是大家熟悉的用{img, caption} 作为pairs进行contrastive learning学习；但是loss从原本的softmax改成了focal loss；
中间：CPE部分，也就是这篇论文如何在pretrain阶段实现region-awareness，也就是对原本的fully image position embedding改成1）先upsample成OD的常见尺寸，比如原来的embedding是224x224xD，西先upsample成OD里常见的图像尺寸1024x1024xD；2）然后再从1024x1024xD中random crop and resize得到新的原先img size的positinal embedding。这样会让模型认为当前图像是某张大图里的其中一个区域部分。
右边：应用到下游的时候把pretrain时候的GAP改成detector head；

:(个人认为是论文中最为核心的部分就是在pretrain的阶段建立起region跟text之间的关系，具体的实现是通过CPE模块对positional embedding进行random 变换得到的，论文中其他的部分例如loss修改等细节就不再介绍了，大家感兴趣的可以再follow原文。