The field of text-to-image and video generation is rapidly advancing, with a focus on improving the quality and consistency of generated content. Recent developments have centered around enhancing the ability of models to understand and represent complex scenes, objects, and attributes, leading to more realistic and diverse generated images and videos. Notably, innovations in diffusion models and transformers have enabled more efficient and effective generation processes. Furthermore, researchers have made significant strides in addressing challenges such as attribute-object binding, subject leakage, and cross-attention misalignment, resulting in improved performance and robustness. Some noteworthy papers have introduced novel approaches, such as compositional generation methods, multi-party collaborative attention control, and adaptive joint training, which have achieved state-of-the-art results in various benchmarks. Overall, the field is moving towards more sophisticated and flexible models that can handle complex inputs and generate high-quality, customized content. Noteworthy papers include VSC, which introduces a novel compositional generation method, and DualReal, which employs adaptive joint training to achieve lossless identity-motion fusion.