Advances in Multimodal Audio Processing and Analysis

The field of audio processing and analysis is rapidly advancing, with a focus on multimodal approaches that combine audio with other modalities such as text, images, and videos. Recent developments have seen the integration of large language models and transformer architectures to improve the accuracy and efficiency of audio processing tasks. One of the key areas of research is the use of multimodal fusion techniques to combine different modalities and improve the robustness of audio processing systems. Another area of focus is the development of more efficient and scalable models, such as the use of state space models and retrieval-augmented generation frameworks. These advances have the potential to improve a wide range of applications, including speech recognition, depression detection, and music generation. Notable papers in this area include the proposal of a novel multimodal framework for depression detection, which combines textual, user-specific, and image analysis to detect depression among social media users. Another notable paper is the introduction of a large audio-language model tailored for multiple Southeast Asian languages, which exhibits strong performance across diverse audio-centric tasks.

Sources

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Expressive Range Characterization of Open Text-to-Audio Models

A Multimodal Framework for Depression Detection during Covid-19 via Harvesting Social Media: A Novel Dataset and Method

Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Retrieval-Augmented Multimodal Depression Detection

Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token

Apriel-H1: Towards Efficient Enterprise Reasoning Models

The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013--2023

WST: Weakly Supervised Transducer for Automatic Speech Recognition

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Probabilistic Textual Time Series Depression Detection

PromptSep: Generative Audio Separation via Multimodal Prompting

Built with on top of