Paper | Region-Aware Pretraining

2023-12-13  本文已影响0人  与阳光共进早餐

1 introduction

1.1 main story

OVD: training text category as text embedding rather than discrete IDs.

base set:C_{B},novel set C_{N};
OVD: training on C_{B}, and test on the union of C_{B} and C_{N};

previous OD: number of objects are the same between the train and test;
OVD: to deal with additional categories at test time, the common practice is to replace the conventional fixed-size classifier fully-connected layer with the text embeddings of base categories.

OVD任务具体可以参照第一篇论文:https://www.jianshu.com/p/b23de6b4476b

Previous existing method: leverage image-text pretraining, via knowledge distillation, weak supervision, self-training, and frozen model but on CNNs;

assume pretrained VLM are given, and develop adaptation or finetuning recipes to bridge the gap between image-level pretraining and object-level finetuning (pretrain是image level的,但是下游任务是object level的)

This paper:

  1. exploring OVD with ViT
  2. propose RO-ViT: pretrain ViT in a region-aware manner for OVD

1.2 related works on OVD:

1) learning alignment between region visual representation and category word embed;
2) hallucinating visual features with a generative model
3) image-text pretraining 【this paper falls into】

existing paper based on CNN and assume image-text pretrained models are given and focus on finetuning or adaptation.

this paper: focuses on improving the upstream image-text pretraining with ViT.

2 method

2.1 common pipeline

  1. proposals not matched to any annotations in C_{B} are labeled as "background".
  2. training: for each region i, calculate the detection score p_{i} as the cosine_similary{region visual embedding, text embedding}, followed by a softmax;
  3. testing: expand the text embedding from C_{B} to C_{B} \cup C_{N} \cup ``background".

2.2 Region-Aware Image-Text Pretraining

existing method: align between whole image and text;
this paper: a novel cropped positional embeddings (CPE) to make aware of the region.

cropped positional embeddings (CPE):
position embedding is key to transformers, providing information on which element comes.

整个框架分为三个部分;

  1. 左边:出去了CPE部分就是大家熟悉的用{img, caption} 作为pairs进行contrastive learning学习; 但是loss从原本的softmax改成了focal loss;
  2. 中间:CPE部分,也就是这篇论文如何在pretrain阶段实现region-awareness,也就是对原本的fully image position embedding改成1)先upsample成OD的常见尺寸,比如原来的embedding是224x224xD, 西先upsample成OD里常见的图像尺寸1024x1024xD;2)然后再从1024x1024xD中random crop and resize得到新的原先img size的positinal embedding。 这样会让模型认为当前图像是某张大图里的其中一个区域部分。
  3. 右边:应用到下游的时候把pretrain时候的GAP改成detector head;

:(个人认为是论文中最为核心的部分就是在pretrain的阶段建立起region跟text之间的关系,具体的实现是通过CPE模块对positional embedding进行random 变换得到的,论文中其他的部分例如loss修改等细节就不再介绍了,大家感兴趣的可以再follow原文。

上一篇 下一篇

猜你喜欢

热点阅读