LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

Arxiv 2025.01

Jiahao Wang ^1†, Ning Kang ³, Lewei Yao ³, Mengzhao Chen ¹, Chengyue Wu ¹, Songyang Zhang ², Shuchen Xue ⁴, Yong Liu ⁵, Taiqiang Wu ¹, Xihui Liu ¹, Kaipeng Zhang ², Shifeng Zhang ³, Wenqi Shao ^2†, Zhenguo Li ^3†, Ping Luo ¹

¹HKU

²Shanghai AI Lab

³Huawei Noah's Ark Lab

⁴UCAS

⁵THUsz

† Corresponding author,

Arxiv Code Laptop Demo

Abstract

In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use solutions, ranging from linear attention design to optimization strategies. Our core contributions include 5 practical guidelines: 1) Applying depth-wise convolution within simple linear attention is sufficient for image generation. 2) Using fewer heads in linear attention provides a free-lunch performance boost without increasing latency. 3) Inheriting weights from a fully converged, pre-trained DiT. 4) Loading all parameters except those related to linear attention. 5) Hybrid knowledge distillation: using a pre-trained teacher DiT to help the training of the student linear DiT, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), which serves as a safe and efficient alternative baseline for DiT with pure linear attention. In class-conditional 256×256 and 512×512 ImageNet generation, LiT can be quickly adapted from DiT using only 20% and 33% of DiT’s training steps, respectively, while achieving comparable performance. LiT also rivals methods based on Mamba or Gated Linear Attention. Moreover, the same guidelines generalize to text-to-image generation: LiT can be swiftly converted from PixArt-$\Sigma$ to generate high-quality images, maintaining comparable GenEval scores.

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

¹HKU

²Shanghai AI Lab

³Huawei Noah's Ark Lab

⁴UCAS

⁵THUsz

Abstract

Method

Class-Conditional Image Generation on ImageNet

Text-to-image Generation

Generated samples

Offline deployment of on a Windows 11 laptop

BibTeX

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

1 HKU

2 Shanghai AI Lab

3 Huawei Noah's Ark Lab

4 UCAS

5 THUsz

Abstract

Method

Class-Conditional Image Generation on ImageNet

Text-to-image Generation

Generated samples

Offline deployment of on a Windows 11 laptop

BibTeX

¹HKU

²Shanghai AI Lab

³Huawei Noah's Ark Lab

⁴UCAS

⁵THUsz