AltCLIP:更改CLIP中的语言编码器以实现扩展语言功能
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
Nov 2022
https://arxiv.org/abs/2211.06679
https://github.com/flagai-open/flagai
在这项工作中,我们提出了一种概念上简单有效的方法来训练强大的双语/多语言多模态表示模型。从OpenAI发布的预先训练的多模态表示模型CLIP开始,我们用预先训练的多语文本编码器XLM-R改变了其文本编码器,并通过由教师学习和对比学习组成的两阶段训练模式来对齐语言和图像表示。我们通过对大量任务的评估来验证我们的方法。我们为一系列任务设置了最先进的性能,包括ImageNet CN、Flicker30k CN、COCO-CN和XTD。此外,我们在几乎所有任务上都获得了与CLIP非常接近的性能,这表明我们可以简单地更改CLIP中的文本编码器,以获得扩展功能,如多语言理解。我们的型号和代码可在https://github.com/FlagAI-Open/FlagAI.
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model. Starting from the pre-trained multimodal representation model CLIP released by OpenAI, we altered its text encoder with a pre-trained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k-CN, COCO-CN and XTD. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.