The field of multimodal models for image generation and editing is rapidly advancing, with a focus on developing more powerful and efficient models. Recent developments have centered around improving the accuracy and consistency of image generation and editing tasks, particularly in complex scenes with multiple objects. Researchers are exploring new architectures and techniques, such as autoregressive frameworks, mixture-of-experts models, and human-aligned reward models, to achieve state-of-the-art performance. Notable papers in this area include HunyuanImage 3.0, which presents a native multimodal model that unifies multimodal understanding and generation, and QL-Adapter, which achieves state-of-the-art performance on quantity and layout consistent image editing tasks. EditReward and VaPR are also noteworthy, as they introduce human-aligned reward models and preference alignment techniques for instruction-guided image editing and vision-language reasoning tasks.