The field of multimodal generation is moving towards more sophisticated and synchronized models, with a focus on improving temporal alignment and control. Recent developments have explored the use of audio cues to guide video generation, resulting in more realistic and coherent outputs. Additionally, novel guidance mechanisms and fusion architectures have been proposed to enhance the quality and diversity of generated audio and video. These advancements have the potential to enable more effective and efficient multimodal generation, with applications in areas such as video editing, Foley sound design, and assistive multimedia. Noteworthy papers include Syncphony, which achieves state-of-the-art synchronization accuracy and visual quality in audio-to-video generation, and AudioMoG, which presents a mixture-of-guidance framework for cross-modal audio generation, demonstrating improved quality and diversity without sacrificing inference efficiency.