Advances in Spoken Language Modeling

The field of spoken language modeling is moving towards more advanced and innovative approaches, with a focus on improving the acoustic-semantic gap and developing more effective models for speech-to-speech and speech-to-text translation. Researchers are exploring new methods, such as generative models and cross-modal knowledge distillation, to improve the performance of spoken language models. Noteworthy papers include: GmSLM, which introduces a novel pipeline for modeling Marmoset vocal communication and demonstrates its advantage over basic human-speech-based baselines. EchoX, which proposes a new approach to mitigating the acoustic-semantic gap via echo training for speech-to-speech LLMs and achieves advanced performance on multiple knowledge-based question-answering benchmarks.

Sources

GmSLM : Generative Marmoset Spoken Language Modeling

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task

Improving Audio Event Recognition with Consistency Regularization

From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models

UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition

Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens

Cross-Modal Knowledge Distillation for Speech Large Language Models

Built with on top of