LiT can run on the edge-side laptop in an offline manner, generating 1K resolution photorealistic images.
In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation.
Through detailed exploration, we offer a suite of ready-to-use solutions, ranging from linear attention design to optimization strategies.
Our core contributions include 5 practical guidelines:
1) Applying depth-wise convolution within simple linear attention is sufficient for image generation.
2) Using fewer heads in linear attention provides a free-lunch performance boost without increasing latency.
3) Inheriting weights from a fully converged, pre-trained DiT.
4) Loading all parameters except those related to linear attention.
5) Hybrid knowledge distillation: using a pre-trained teacher DiT to help the training of the student linear DiT, supervising not only the predicted noise but also the variance of the reverse diffusion process.
These guidelines lead to our proposed Linear Diffusion Transformer (LiT), which serves as a safe and efficient alternative baseline for DiT with pure linear attention.
In class-conditional 256×256 and 512×512 ImageNet generation, LiT can be quickly adapted from DiT using only 20% and 33% of DiT’s training steps, respectively, while achieving comparable performance. LiT also rivals methods based on Mamba or Gated Linear Attention. Moreover, the same guidelines generalize to text-to-image generation: LiT can be swiftly converted from PixArt-$\Sigma$ to generate high-quality images, maintaining comparable GenEval scores.
Overall training procedure of LiT. Following the macro/micro-level design of DiT (for class-conditioned image generation) and PixArt-Σ (for text-to-image generation), LiT replace the self-attention in each block with the linear attention. We linearize diffusion Transformers by (1) building a strong linear DiT baseline with few heads, (2) inheriting weights from a DiT teacher and (3) distilling useful knowledge (predicted noise and the variances of the reverse diffusion process) from the teacher model.
Despite having only 20% of the training steps of DiT-XL/2, LiT still competes on par with DiT (2.32 vs. 2.27).
LiT, using pure linear attention, achieves an impressive FID of 3.69, comparable to DiT trained for 3M steps.
Built on PixArt, LiT leverages pure linear attention to achieve competitive performance.
Generated samples of LiT following user instructions. LiT shares the same macro/micro-level design as PixArt-Σ, but elegantly replaces all self-attention with cheap linear attention. While being more simple and efficient, LiT with our cost-effective training strategy, is still able to generate exceptional high-resolution images following complicated user instructions.
@article{wang2025lit,
title={LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation},
author={Wang, Jiahao and Kang, Ning and Yao, Lewei and Chen, Mengzhao and Wu, Chengyue and Zhang, Songyang and Xue, Shuchen and Liu, Yong and Wu, Taiqiang and Liu, Xihui and Zhang, Kaipeng and Zhang, Shifeng and Shao, Wenqi and Li, Zhenguo and Luo, Ping},
journal={arXiv preprint arXiv:2501.12976},
year={2025}
}