The field of audio generation and editing is rapidly evolving, with a focus on developing more sophisticated and controllable models. Recent research has explored the use of generative models, such as diffusion-based models and flow matching, to improve the quality and realism of generated audio. Additionally, there is a growing interest in multimodal approaches, which incorporate visual and textual information to enhance audio generation and editing. Notable papers in this area include: Text2Move, which generates moving sounds given text prompts in a controllable fashion. DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. UniFlow-Audio, a universal audio generation framework based on flow matching that supports omni-modalities, including text, audio, and video. Object-AVEdit, an object-level audio-visual editing model that achieves advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment.