Multimodal Deepfake Detection and Forensics

The field of deepfake detection and forensics is rapidly advancing, with a focus on multimodal approaches that incorporate audio, visual, and text modalities. Recent research has highlighted the importance of developing robust and generalizable detection methods that can identify sophisticated deepfakes. A key challenge in this area is the lack of large-scale, diverse datasets that can be used to train and evaluate detection models. To address this, several new datasets have been introduced, including multimodal digital human forgery datasets and benchmarks for face-voice association and video misinformation detection. Noteworthy papers in this area include ForensicHub, a unified benchmark and codebase for all-domain fake image detection and localization, and BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture. Additionally, CAD, a general multimodal framework for video deepfake detection, has shown significant improvements over previous methods. Other notable papers include AvatarShield, a visual reinforcement learning approach for human-centric video forgery detection, and Fact-R1, a novel framework for explainable video misinformation detection with deep reasoning.

Sources

BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset

Coordinated Inauthentic Behavior on TikTok: Challenges and Opportunities for Detection in a Video-First Ecosystem

ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection

CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation

SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Pose-invariant face recognition via feature-space pose frontalization

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association