Prosody and Emotion Recognition in Speech

The field of speech processing is moving towards a greater emphasis on capturing and conveying nuanced prosodic features, such as emotion and sarcasm. Recent studies have explored the ability of discrete tokens to encode prosodic information, with a focus on designing tokens that can effectively capture and generate responses that reflect both semantic content and prosodic features. Another area of research is the development of speech emotion recognition models that can accurately identify emotions in audio, with some studies incorporating personality traits and multimodal features. The use of feedback loss and transfer learning has also been shown to improve the quality and naturalness of synthesized speech, particularly in the context of sarcastic speech synthesis. Noteworthy papers include:

  • Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis, which introduces a novel approach to synthesizing sarcastic speech using feedback loss from a bi-modal sarcasm detection model.
  • EmoTale: An Enacted Speech-emotion Dataset in Danish, which presents a new dataset for Danish emotional speech and demonstrates its validity using speech emotion recognition models.
  • Human Feedback Driven Dynamic Speech Emotion Recognition, which proposes a multi-stage method for dynamic speech emotion recognition that incorporates human feedback and models emotional mixtures using the Dirichlet distribution.

Sources

Benchmarking Prosody Encoding in Discrete Speech Tokens

Speech Emotion Recognition Using Fine-Tuned DWFormer:A Study on Track 1 of the IERPChallenge 2024

Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis

EmoTale: An Enacted Speech-emotion Dataset in Danish

Human Feedback Driven Dynamic Speech Emotion Recognition

Built with on top of