MetaFormer

2022-06-22  本文已影响0人  Valar_Morghulis

MetaFormer is Actually What You Need for Vision

22 Nov 2021

CVPR2022 Oral

https://arxiv.org/abs/2111.11418

https://github.com/sail-sg/poolformer

Authors: Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan

Abstract: Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only the most basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 48%/60% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of "MetaFormer", a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design.

摘要:变形金刚在计算机视觉任务中显示出巨大的潜力。一个共同的信念是,他们基于注意力的标记混合器模块对他们的能力贡献最大。然而,最近的研究表明,变形金刚中基于注意的模块可以被空间MLP所取代,并且得到的模型仍然表现得很好。基于这一观察,我们假设变压器的一般架构,而不是特定的令牌混频器模块,对模型的性能更为重要。为了验证这一点,我们故意用一个令人尴尬的简单空间池操作符来替换transformers中的注意模块,以便只进行最基本的令牌混合。令人惊讶的是,我们观察到,被称为PoolFormer的派生模型在多个计算机视觉任务上取得了有竞争力的性能。例如,在ImageNet-1K上,PoolFormer实现了82.1%的top-1精度,比调整良好的vision transformer/MLP-like基线DeiT-B/ResMLP-B24的精度高出0.3%/1.1%,参数减少了35%/52%,Mac减少了48%/60%。PoolFormer的有效性验证了我们的假设,并促使我们提出“MetaFormer”的概念,这是一种从transformers抽象出来的通用架构,无需指定令牌混合器。在大量实验的基础上,我们认为,对于最近的transformer和MLP-like模型,MetaFormer是在视觉任务中取得优异结果的关键。这项工作需要更多的未来研究,致力于改进MetaFormer,而不是专注于令牌混频器模块。此外,我们提出的PoolFormer可以作为未来元前体架构设计的起始基线。

上一篇下一篇

猜你喜欢

热点阅读