Personalization and Security in Automatic Speech Recognition

The field of automatic speech recognition (ASR) is moving towards more personalized and controlled models. Researchers are exploring ways to adapt generic ASR models to individual speakers using synthetic personal data, while preserving the models' ability to recognize a wide range of speakers. This involves developing new frameworks that can balance the learning of synthetic, personalized, and generic knowledge. Another area of focus is controllable accent conversion, which allows for explicit control over the degree of modification and preservation of speaker identity. However, the increasing use of federated learning for privacy-preserving training of ASR models has also raised concerns about the potential for attribute inference attacks. Noteworthy papers in this area include:

  • One that proposes a knowledge-decoupled functionally invariant path framework for personalized ASR models, achieving a 29.38% relative character error rate reduction on target speakers.
  • Another that presents a controllable zero-shot foreign accent conversion framework with factorized speech codec, providing an explicit user-controllable parameter for accent modification.
  • A study that analyzes the vulnerability of ASR models to attribute inference attacks in the federated setting, demonstrating the feasibility of such attacks on sensitive demographic and clinical attributes.

Sources

Knowledge-Decoupled Functionally Invariant Path with Synthetic Personal Data for Personalized ASR

FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

Personal Attribute Leakage in Federated Speech Models

Built with on top of