The field of multimodal analysis and detection is rapidly evolving, with a focus on developing innovative methods to analyze and interpret complex data from various sources. Recent studies have explored the capabilities of large language models (LLMs) and vision-language models (VLMs) in detecting deception, image splicing, and deepfakes, as well as analyzing emotions and sentiment in text and images. The use of multimodal learning analytics and personalized feedback frameworks has also shown promise in improving student learning outcomes and enhancing academic emotions. Notably, LLMs have demonstrated competitive performance in zero-shot settings for image forensics tasks, while VLMs have exhibited moderate performance in academic facial expression recognition. However, despite these advancements, LLMs are not yet reliable for standalone deepfake detection and require further development.
Noteworthy papers include: The paper on using LLMs for image splicing detection, which achieved competitive detection performance in zero-shot settings. The study on detecting voice phishing with precision using fine-tuned small language models, which yielded the best performance among small LMs and was comparable to that of a GPT-4-based VP detector.