The field of deepfake detection and forensics is rapidly advancing, with a focus on multimodal approaches that incorporate audio, visual, and text modalities. Recent research has highlighted the importance of developing robust and generalizable detection methods that can identify sophisticated deepfakes. A key challenge in this area is the lack of large-scale, diverse datasets that can be used to train and evaluate detection models. To address this, several new datasets have been introduced, including multimodal digital human forgery datasets and benchmarks for face-voice association and video misinformation detection. Noteworthy papers in this area include ForensicHub, a unified benchmark and codebase for all-domain fake image detection and localization, and BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture. Additionally, CAD, a general multimodal framework for video deepfake detection, has shown significant improvements over previous methods. Other notable papers include AvatarShield, a visual reinforcement learning approach for human-centric video forgery detection, and Fact-R1, a novel framework for explainable video misinformation detection with deep reasoning.
Multimodal Deepfake Detection and Forensics
Sources
Coordinated Inauthentic Behavior on TikTok: Challenges and Opportunities for Detection in a Video-First Ecosystem
BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention
CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation