The field of multimodal analysis is rapidly advancing, with a focus on detecting mental health conditions such as depression and hate speech in videos and social media. Researchers are proposing novel frameworks and datasets to improve the accuracy of detection models. The use of contrastive learning, transformer networks, and multimodal fusion techniques is becoming increasingly popular. These approaches enable the effective extraction and fusion of features from multiple modalities, leading to better performance in depression detection and hate speech analysis. Notable papers in this area include: ImpliHateVid, which introduces a large-scale dataset and a two-stage contrastive learning framework for implicit hate speech detection in videos. MMFformer, which proposes a multimodal depression detection network that surpasses existing state-of-the-art approaches. eMotions, which provides a large-scale dataset and an audio-visual fusion network for emotion analysis in short-form videos. MDD-Net, which utilizes mutual transformers to efficiently extract and fuse multimodal features for depression detection.