Advances in Audio-Driven Animation and Speech Processing

The field of audio-driven animation and speech processing is rapidly evolving, with a focus on generating highly realistic and coherent animations and speech synthesis. Recent developments have centered around leveraging advanced techniques such as diffusion models, large language models, and optimal transportation to improve the quality and naturalness of generated animations and speech. Noteworthy papers in this area include Model See Model Do, which proposes a novel example-based generation framework for speech-driven facial animation with style control, and FlowDubber, which achieves high-quality audio-visual sync and pronunciation in movie dubbing using a large language model-based flow matching architecture. Other significant contributions include the introduction of new benchmarks and datasets, such as TA-Dubbing and Teochew-Wild, which aim to improve the evaluation and development of movie dubbing and speech recognition systems for low-resource languages.

Sources

Model See Model Do: Speech-Driven Facial Animation with Style Control

FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks

Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

OT-Talk: Animating 3D Talking Head with Optimal Transportation

PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model

Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment

ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition

Language translation, and change of accent for speech-to-speech task using diffusion model

Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

GesPrompt: Leveraging Co-Speech Gestures to Augment LLM-Based Interaction in Virtual Reality