Advances in Speech and Language Processing

The field of speech and language processing is witnessing significant advancements, with a growing emphasis on developing more sophisticated and human-like models. Recent developments have focused on improving the accuracy and expressiveness of speech synthesis, as well as enhancing the ability of language models to understand and generate emotionally nuanced text. Notably, researchers are exploring the use of large language models and multimodal approaches to generate more realistic and engaging speech, as well as developing new datasets and evaluation frameworks to support the development of more effective speech and language processing systems.

Some noteworthy papers in this area include: The paper proposing LanStyleTTS, a non-autoregressive, language-aware style adaptive TTS framework, which enables fine-grained, phoneme-level style control across languages. The work introducing ELSA, a novel dataset for emotionally intelligent language generation, which comprises multiple emotionally nuanced variations of original sentences regenerated across distinct contextual styles. The Dopamine Audiobook system, a unified training-free approach for emotional and human-like audiobook generation and evaluation, which achieves superior emotional expression to state-of-the-art TTS models. The GOAT-TTS framework, an LLM-based text-to-speech generation approach optimized via a dual-branch architecture, which addresses fundamental tensions between acoustic characteristics, prompt speech-text pairs, and catastrophic forgetting of the LLM's native text comprehension.

Sources

From Speech to Summary: A Comprehensive Survey of Speech Summarization

Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

ELSA: A Style Aligned Dataset for Emotionally Intelligent Language Generation

On The Landscape of Spoken Language Models: A Comprehensive Survey

Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition

Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation

Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLMs and Data Augmentation

GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture

KODIS: A Multicultural Dispute Resolution Dialogue Corpus

Built with on top of