Empathetic Speech Language Models and Multilingual Speech Synthesis

The field of speech language models is moving towards more empathetic and human-like conversation capabilities, with a focus on integrating linguistic content with diverse vocal cues. Recent research has highlighted the importance of evaluating emotion recognition in spoken language models and the need for more nuanced evaluation metrics that go beyond traditional word-by-word accuracy metrics. Multilingual speech synthesis is also becoming increasingly important, with a focus on developing engine-agnostic frameworks that can handle code-switching and varied scripts. Noteworthy papers include EchoMind, which presents a novel benchmark for evaluating empathetic speech language models, and SFMS-ALR, which introduces a script-first multilingual speech synthesis framework with adaptive locale resolution. SP-MCQA is also notable for its novel approach to evaluating intelligibility of TTS beyond the word level.

Sources

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Built with on top of