The field of multilingual speech processing is moving towards more efficient and accurate models, with a focus on adapting to low-resource languages and domains. Recent research has explored the use of large language models, self-supervised learning, and data augmentation techniques to improve performance in tasks such as language identification, automatic speech recognition, and speech translation. Notable advances include the development of few-shot learning methods, which enable models to learn from limited labeled data, and the use of grapheme-coherent phonemic and prosodic annotation to improve speech recognition accuracy. Noteworthy papers include:
- Improving Multilingual Speech Models on ML-SUPERB 2.0, which achieved a 14% relative improvement in LID accuracy and a 30% relative reduction in ASR CER.
- Fewer Hallucinations, More Verification, which proposed a three-stage LLM-based framework for ASR error correction and achieved 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.