Advances in Speech Synthesis and Audio Deepfake Detection

The field of speech synthesis and audio deepfake detection is rapidly evolving, with a focus on improving the reliability and robustness of detection systems. Recent developments have highlighted the importance of diverse and representative datasets, as well as innovative evaluation frameworks, to accurately assess the performance of audio deepfake detection models. Additionally, advancements in text-to-speech synthesis have led to the development of more efficient and effective models, capable of generating high-quality speech in a variety of languages and accents. Noteworthy papers in this area include: Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems, which proposes a novel evaluation framework for audio deepfake detection models. DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech, which introduces a new model for text-to-speech synthesis that achieves promising performance in several key metrics. HISPASpoof: A New Dataset For Spanish Speech Forensics, which presents a new large-scale Spanish dataset designed for synthetic speech detection and attribution.

Sources

Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems

HISPASpoof: A New Dataset For Spanish Speech Forensics

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration

Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

Case-Based Decision-Theoretic Decoding with Quality Memories

A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis

SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models

Measuring Soft Biometric Leakage in Speaker De-Identification Systems

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

MeanFlowSE: one-step generative speech enhancement via conditional mean flow

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning