The field of speech synthesis and audio deepfake detection is rapidly evolving, with a focus on improving the reliability and robustness of detection systems. Recent developments have highlighted the importance of diverse and representative datasets, as well as innovative evaluation frameworks, to accurately assess the performance of audio deepfake detection models. Additionally, advancements in text-to-speech synthesis have led to the development of more efficient and effective models, capable of generating high-quality speech in a variety of languages and accents. Noteworthy papers in this area include: Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems, which proposes a novel evaluation framework for audio deepfake detection models. DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech, which introduces a new model for text-to-speech synthesis that achieves promising performance in several key metrics. HISPASpoof: A New Dataset For Spanish Speech Forensics, which presents a new large-scale Spanish dataset designed for synthetic speech detection and attribution.
Advances in Speech Synthesis and Audio Deepfake Detection
Sources
DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching