Controllable Generation and Efficient Modeling in Visual Autoregressive Learning

The field of visual autoregressive learning is moving towards more controllable and efficient models. Recent developments have focused on improving the fidelity and efficiency of visual autoregressive models, with a particular emphasis on controllable image synthesis and high-resolution image generation. Notable advancements include the development of novel decoding mechanisms and acceleration frameworks that reduce computational overhead without compromising image quality. These innovations have the potential to significantly impact the field, enabling more precise control over generated outputs and improving the scalability of visual autoregressive models. Noteworthy papers include: SCALAR, which presents a controllable generation method based on visual autoregressive models with a novel scale-wise conditional decoding mechanism. SparseVAR, which introduces a plug-and-play acceleration framework for next-scale prediction that dynamically excludes low-frequency tokens during inference. Spec-VLA, which proposes a speculative decoding framework to accelerate vision-language-action models. DivControl, which presents a decomposable pretraining framework for unified controllable generation and efficient adaptation. XSpecMesh, which employs a lightweight multi-head speculative decoding scheme to predict multiple tokens in parallel within a single forward pass, accelerating inference in auto-regressive mesh generation models.

Sources

SCALAR: Scale-wise Controllable Visual Autoregressive Learning

Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis

Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance

DivControl: Knowledge Diversion for Controllable Image Generation

XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding

Built with on top of