Talking Heads and Audio-Visual Generation

The field of talking heads and audio-visual generation is moving towards more realistic and controllable synthesis of faces, voices, and whole-body animations. Researchers are exploring new architectures and techniques to improve lip-sync accuracy, preserve identity-related visual details, and generate high-quality cartoon animations. Noteworthy papers include Livatar-1, which achieves competitive lip-sync quality with a high throughput and low latency, and Face2VoiceSync, which proposes a novel framework for generating talking face animations and corresponding speeches with state-of-the-art performances. MagicAnime introduces a large-scale, hierarchically annotated, and multimodal dataset for cartoon animation generation, while JOLT3D revisits the effectiveness of 3DMM for talking head synthesis and proposes a novel lip-sync pipeline. Mask-Free Audio-driven Talking Face Generation enhances visual quality and identity preservation by transforming input images to have closed mouths, and JWB-DH-V1 introduces a benchmark for joint whole-body talking avatar and speech generation. Who is a Better Talker presents a comprehensive study on the quality of AI-generated talking heads, including a subjective and objective quality assessment dataset and method.

Sources

Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching

Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation

JOLT3D: Joint Learning of Talking Heads and 3DMM Parameters with Application to Lip-Sync

Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation

JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1

Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

Built with on top of