The field of speech recognition is moving towards developing more robust and adaptable models that can handle diverse speakers and language varieties. Recent research has focused on incorporating human-like adaptation mechanisms into spoken language models, allowing them to adjust to unfamiliar speakers and language varieties through exposure. This has led to significant improvements in automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. One key innovation is the use of discrete tokens extracted from self-supervised learning (SSL) models, which have been shown to exhibit interlanguage speech intelligibility benefits and can be used to simulate foreign accents. Additionally, novel frameworks for in-context learning and differentiable k-means clustering have been proposed, enabling the joint optimization of tokenization and downstream tasks. Notable papers include:
- In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties, which introduces a scalable framework for in-context learning in speech recognition.
- Discrete Tokens Exhibit Interlanguage Speech Intelligibility Benefit, which demonstrates the robustness of discrete token-based ASR to non-native speech.
- Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora, which proposes a method for simulating foreign accents using discrete tokens and native speech data.
- Differentiable K-means for Fully-optimized Discrete Token-based ASR, which enables the joint optimization of tokenization and downstream tasks using differentiable k-means clustering.