Talking Heads and Audio-Visual Generation

The field of talking heads and audio-visual generation is moving towards more realistic and controllable synthesis of faces, voices, and whole-body animations. Researchers are exploring new architectures and techniques to improve lip-sync accuracy, preserve identity-related visual details, and generate high-quality cartoon animations. Noteworthy papers include Livatar-1, which achieves competitive lip-sync quality with a high throughput and low latency, and Face2VoiceSync, which proposes a novel framework for generating talking face animations and corresponding speeches with state-of-the-art performances. MagicAnime introduces a large-scale, hierarchically annotated, and multimodal dataset for cartoon animation generation, while JOLT3D revisits the effectiveness of 3DMM for talking head synthesis and proposes a novel lip-sync pipeline. Mask-Free Audio-driven Talking Face Generation enhances visual quality and identity preservation by transforming input images to have closed mouths, and JWB-DH-V1 introduces a benchmark for joint whole-body talking avatar and speech generation. Who is a Better Talker presents a comprehensive study on the quality of AI-generated talking heads, including a subjective and objective quality assessment dataset and method.

Talking Heads and Audio-Visual Generation

Sources