RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer
Jiahao Wang1,2     Songyang Zhang1      Yong Liu3      Taiqiang Wu3      Yujiu Yang3     Xihui Liu2           Kai Chen1      Ping Luo2      Dahua Lin1    
1 Shanghai AI Laboratory     2 The University of Hong Kong     3 Tsinghua University    
Abstract
This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer (RIFormer) base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design.
Results
Results of models with different types of token mixers on ImageNet-1K. ⋄ denotes training with prolonged 600 epochs. ‡ denotes fine-tuning from the ImageNet pre-trained model for 30 epochs. Notably, without token mixer, RIFormer cannot even perform basic token mixing operation in its building blocks. However, the ImageNet experiments demonstrate that with the proposed training paradigm, RIFormer still shows promising results. We can only deem the reason behind the fact might be that optimization strategy plays a key role. RIFormer is readily a starting recipe for the exploration of optimization-driven efficient network design, and rest assured of the performance with advanced training schemes.
Visualization of the feature distribution of the first and last stage of PoolFormer-S12 and RIFormer-S12. After applying the proposed module imitation, the distribution of RIFormer-S12 are basically shifted toward that of the PoolFomer-S12, demonstrating its effect on helping student learn useful knowledge from the teacher
BibTeX
@article{wang2023riformer,
  title   = {RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer},
  author  = {Wang, Jiahao and Zhang, Songyang and Liu, Yong and Wu, Taiqiang and Yang, Yujiu and Liu, Xihui and Chen, Kai and Luo, Ping and Lin, Dahua},
  journal = {arXiv preprint arXiv:2304.05659},
  year    = {2023}
}
Related Work
Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan. MetaFormer Is Actually What You Need for Vision. CVPR 2023.
Comment: Hypothesize that the general architecture of the Transformers, instead of the specific token mixer module, is more essential to the model's performance.
Weihao Yu, Chenyang Si, Pan Zhou, Mi Luo, Yichen Zhou, Jiashi Feng, Shuicheng Yan, Xinchao Wang. MetaFormer Baselines for Vision. Arxiv 2023.
Comment: Introduce several baseline models under MetaFormer using the most basic or common mixers, and demonstrate their gratifying performance.