Advancements in Audio Large Language Models

The field of audio large language models is witnessing significant advancements, with a focus on improving the human-likeness of text-to-speech systems, enhancing speech recognition, and developing more robust evaluation frameworks. Researchers are exploring innovative approaches to combine large language models with speech encoders, enabling better performance on tasks such as automatic speech recognition and speech translation. Additionally, there is a growing emphasis on developing safety-aware evaluation frameworks to mitigate diagnostic biases and ensure the trustworthiness of audio large language models. Noteworthy papers in this area include VocalAgent, which introduces a large language model for vocal health diagnostics, and AudioTrust, which proposes a multifaceted trustworthiness evaluation framework for audio large language models. LegoSLM is also a notable contribution, as it presents a new paradigm for bridging speech encoders and large language models using ASR posterior matrices.

Sources

Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors

VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Large Language Models based ASR Error Correction for Child Conversations

IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

Built with on top of