Advancements in Audio-Language Models

The field of audio-language models is rapidly evolving, with a focus on improving the efficiency and accuracy of these models. Recent developments have highlighted the potential of masked diffusion large language models (dLLMs) in overcoming the limitations of traditional autoregressive large language models (arLLMs). The use of multi-token prediction loss and tool-augmented reasoning frameworks has also shown promise in enhancing the performance of speech-to-speech translation and audio-language models. Furthermore, researchers are exploring new methods for aligning speech and text representations, such as adaptive vector steering and mixture-of-experts steering modules. Noteworthy papers include: Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs, which introduces a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning, which presents a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE, which proposes a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts framework.

Sources

Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

End-to-end Speech Recognition with similar length speech and text

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Steer-MoE: Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module

Closing the Gap Between Text and Speech Understanding in LLMs

TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation

Built with on top of