The field of speech emotion recognition and prosody analysis is moving towards more inclusive and explainable models, with a focus on self-supervised learning approaches to improve performance in low-resource settings. Researchers are exploring new methods to identify semantically important segments in speech signals, such as using word informativeness derived from pre-trained language models. Additionally, there is a growing interest in analyzing prosodic characteristics associated with emotional states, such as pitch and intensity, to improve emotion recognition accuracy. The use of self-supervised learning representations, such as Wav2Vec 2.0 and HuBERT, has shown promise in capturing subtle speech patterns linked to emotional states. Furthermore, studies are investigating the temporal granularity of prosodic structures and their contribution to speech comprehension, as well as the harmonic structure of information contours in language. Noteworthy papers in this area include:
- Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition, which achieves notable F1 score improvements in low-resource languages.
- Investigating the Impact of Word Informativeness on Speech Emotion Recognition, which enhances emotion recognition accuracy by computing acoustic features on semantically important segments.