Advances in Audio-Visual Generation and Synthesis

The field of audio-visual generation and synthesis is moving towards more realistic and immersive experiences. Researchers are focusing on developing models that can generate high-quality, temporally synchronized audio from video content, as well as creating more realistic talking head synthesis and audio-driven portrait animation. Noteworthy papers in this area include LD-LAudio-V1, which achieves significant improvements in long-form audio generation, and FantasyTalking2, which introduces a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. Additionally, FoleySpace proposes a framework for video-to-binaural audio generation, and TalkVid introduces a large-scale, high-quality, and diverse dataset for audio-driven talking head synthesis. InfiniteTalk also presents a novel paradigm for sparse-frame video dubbing, enabling holistic, audio-synchronized full-body motion editing.

Advances in Audio-Visual Generation and Synthesis

Sources