Advances in Multimodal Generation and Recognition

The field of multimodal generation and recognition is rapidly evolving, with a focus on developing more realistic and controllable models. Recent research has explored the use of novel frameworks and techniques to improve the quality and accuracy of generated audio, video, and text. One notable direction is the integration of multimodal inputs, such as audio and text, to generate more coherent and engaging outputs. Additionally, there is a growing interest in developing models that can capture fine-grained details and nuances in human emotions and expressions. These advances have significant implications for applications such as virtual assistants, education, and entertainment. Noteworthy papers include M2DAO-Talker, which achieves state-of-the-art performance in talking-head generation, and FreeAudio, which enables training-free timing-controlled text-to-audio generation. Other notable papers include SnapMoGen, which introduces a new text-motion dataset and improves upon previous generative masked modeling approaches, and Think-Before-Draw, which proposes a novel framework for decomposing emotion semantics and fine-grained controllable expressive talking head generation.

Sources

M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation

SnapMoGen: Human Motion Generation from Expressive Texts

EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition

MOSPA: Human Motion Generation Driven by Spatial Audio

Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation

ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

Built with on top of