Advances in Audio-Visual Speech Processing and Deepfake Detection

The field of audio-visual speech processing is witnessing significant advancements, with a focus on improving speech recognition, voice conversion, and deepfake detection. Researchers are exploring innovative approaches to address challenges such as timbre leakage, speaker privacy, and visual disturbances. The use of dual attention mechanisms, flow matching, and landmark-guided visual feature extractors is becoming increasingly popular. Additionally, there is a growing emphasis on developing robust deepfake detection methods, including audio-visual speech representation learning and ensemble-based approaches. Noteworthy papers in this area include: DAFMSVC, which proposes a novel approach to singing voice conversion using dual attention mechanisms and flow matching. SEF-MK, which introduces a speaker-embedding-free framework for voice anonymization through multi-k-means quantization. AD-AVSR, which presents a new audio-visual speech recognition framework based on bidirectional modality enhancement. SpeechForensics, which leverages audio-visual speech representation learning for face forgery detection. Fake Speech Wild, which proposes a new dataset and benchmark for detecting deepfake speech on social media platforms.

Advances in Audio-Visual Speech Processing and Deepfake Detection

Sources