The field of multimodal learning and speech technologies is moving towards addressing the challenges of modality imbalance and scarce annotated data. Researchers are exploring innovative methods to improve model performance, such as mutual information regularization, adaptive weight allocation mechanisms, and emotion-sensitive augmentation frameworks. These approaches aim to enhance the robustness and generalization of multimodal models, enabling them to effectively integrate information from different modalities. Notable papers in this area include: Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model, which proposes a data augmentation framework aided by cross-modal information transfer and mutual information regularization. MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis, which introduces an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings.