Advancements in Audio-Visual Animation and Generation

The field of audio-visual animation and generation is moving towards more realistic and expressive outputs, with a focus on preserving the unique style properties of the input data. Recent developments have enabled the creation of high-fidelity audio, speech, and songs that are coherently synchronized with the input video, as well as the generation of lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip.

Notable advancements include the use of multimodal diffusion transformers, which have been shown to be effective in generating semantically rich and acoustically diverse audio. Additionally, the development of real-time audio-driven portrait animation frameworks has enabled the synthesis of realistic and natural talking head videos under real-time constraints.

The use of occlusion-robust stylization frameworks has also improved the quality of drawing-based 3D animation, allowing for the preservation of artist's unique style properties even under occlusions. Furthermore, the introduction of efficient training paradigms has enabled the scaling up of audio-synchronized visual animation to diverse audio-video classes.

Some particularly noteworthy papers include:

  • AudioGen-Omni, which presents a unified approach to generating high-fidelity audio, speech, and songs coherently synchronized with the input video.
  • X-Actor, which enables the generation of lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip.
  • READ, which proposes a real-time diffusion-transformer-based talking head generation framework that achieves an optimal balance between quality and speed.

Sources

Occlusion-robust Stylization for Drawing-based 3D Animation

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation

Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

MienCap: Realtime Performance-Based Facial Animation with Live Mood Dynamics

RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer

Embedding Alignment in Code Generation for Audio

Built with on top of