Multimodal Learning and Speech Technologies

The field of multimodal learning and speech technologies is moving towards addressing the challenges of modality imbalance and scarce annotated data. Researchers are exploring innovative methods to improve model performance, such as mutual information regularization, adaptive weight allocation mechanisms, and emotion-sensitive augmentation frameworks. These approaches aim to enhance the robustness and generalization of multimodal models, enabling them to effectively integrate information from different modalities. Notable papers in this area include: Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model, which proposes a data augmentation framework aided by cross-modal information transfer and mutual information regularization. MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis, which introduces an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings.

Sources

Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Mixup Helps Understanding Multimodal Video Better

MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis

Cost Analysis of Human-corrected Transcription for Predominately Oral Languages

Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

Revisit Modality Imbalance at the Decision Layer

Built with on top of