The field of image and video generation is rapidly advancing with the development of new architectures and techniques. One of the key trends is the use of multi-scale and multi-frequency approaches to improve the quality and realism of generated images and videos. This includes the use of pyramid inputs, adaptive spatial-frequency learning units, and global feature fusion blocks to enhance the features at different scales. Another area of focus is the improvement of diffusion models, which have shown exceptional performance in image synthesis but are computationally intensive. Techniques such as progressive quantization, calibration-assisted distillation, and knowledge distillation are being explored to improve the efficiency and effectiveness of these models. Additionally, there is a growing interest in using reinforcement learning and vision-language models to improve the quality and trustworthiness of generated images and videos. Noteworthy papers in this area include Hunyuan3D 2.5, which generates high-fidelity 3D assets with ultimate details, and HiWave, which achieves training-free high-resolution image generation via wavelet-based diffusion sampling. PQCAD-DM and Diffusion Transformer-to-Mamba Distillation are also notable for their contributions to efficient and high-quality image generation.
Advances in Image and Video Generation
Sources
PQCAD-DM: Progressive Quantization and Calibration-Assisted Distillation for Extremely Efficient Diffusion Model
RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought