Advances in Audio Generation and Editing

The field of audio generation and editing is rapidly evolving, with a focus on developing more sophisticated and controllable models. Recent research has explored the use of generative models, such as diffusion-based models and flow matching, to improve the quality and realism of generated audio. Additionally, there is a growing interest in multimodal approaches, which incorporate visual and textual information to enhance audio generation and editing. Notable papers in this area include: Text2Move, which generates moving sounds given text prompts in a controllable fashion. DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. UniFlow-Audio, a universal audio generation framework based on flow matching that supports omni-modalities, including text, audio, and video. Object-AVEdit, an object-level audio-visual editing model that achieves advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment.

Sources

Guiding Audio Editing with Audio Language Model

Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment

High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling

From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation

UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks

Video Object Segmentation-Aware Audio Generation

Object-AVEdit: An Object-level Audio-Visual Editing Model

PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation

Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions

Built with on top of