Advancements in Speech-Language Models

The field of speech-language models is moving towards incorporating contextual paralinguistic understanding and empathetic reasoning, with a focus on developing more effective and natural conversational systems. Recent work has explored the use of novel training methods, such as implicit and explicit approaches to incorporate paralinguistic information, and planning-inspired text guidance to enhance meaningful dialogue generation. Additionally, there is a growing interest in developing unified speech understanding and generation models, which can seamlessly integrate speech understanding and generation capabilities. Noteworthy papers include: Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models, which proposes two approaches to incorporate contextual paralinguistic information into model training, and DualSpeechLM, which presents a dual-token modeling framework that concurrently models understanding-driven speech tokens as input and acoustic tokens as output. OSUM-EChat is also notable, as it introduces a three-stage understanding-driven spoken dialogue training strategy and a linguistic-paralinguistic dual thinking mechanism to enhance empathetic interactions.

Sources

Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

Dual Information Speech Language Models for Emotional Conversations

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue

Shaping Event Backstories to Estimate Potential Emotion Contexts

Built with on top of