Advances in Audio Generation and Editing

The field of audio generation and editing is moving towards more flexible and controllable models. Recent developments have focused on improving the alignment between audio and video, as well as enabling free-form editing of audio using natural language instructions. This has led to the creation of more modular and production-friendly systems, which can be easily extended or upgraded without requiring significant retraining. Furthermore, there is a growing recognition of the importance of instruction sensitivity in large audio language models, with efforts underway to develop benchmarks and improve the robustness of these models to different instruction styles. Another area of research is exploring the use of synthetic data to address the scarcity of real-world data, particularly in applications such as aphasia research. Notable papers include: Foley Control, which introduces a lightweight approach to video-guided Foley that achieves competitive temporal and semantic alignment with fewer trainable parameters. SAO-Instruct, which enables free-form audio editing using natural language instructions and demonstrates competitive performance on objective metrics. ISA-Bench, which provides a dynamic benchmark for evaluating instruction sensitivity in large audio language models and highlights the need for instruction-robust audio understanding. Towards a Method for Synthetic Generation of PWA Transcripts, which constructs and validates methods for generating synthetic transcripts of aphasic language and shows promising results for capturing key aspects of linguistic degradation.

Sources

Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

Towards a Method for Synthetic Generation of PWA Transcripts

Built with on top of