Efficient Video and Image Generation

The field of video and image generation is moving towards more efficient models, with a focus on reducing computational cost and memory usage. Recent research has explored various methods to achieve this, including knowledge distillation, post-training quantization, and novel tokenization techniques. These advancements have led to significant improvements in model compression and inference acceleration, while maintaining or even surpassing the performance of full models. Notable papers in this area include V.I.P., which proposes an effective distillation method for efficient video diffusion models, and LRQ-DiT, which introduces a log-based quantization method for diffusion transformers. Additionally, S2Q-VDiT and WeTok have made significant contributions to quantized video diffusion transformers and discrete tokenization for high-fidelity visual reconstruction, respectively.

Sources

V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models

LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Text-to-Image Generation

$\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Built with on top of