The field of speech recognition and processing is moving towards more efficient and robust models, with a focus on low-resource languages and edge devices. Recent studies have explored the use of transfer learning, attention mechanisms, and grapheme-to-phoneme conversion to improve speech recognition accuracy. Additionally, there is a growing interest in developing lightweight models that can run on edge devices, enabling real-time speech recognition and processing. Noteworthy papers include the proposal of a unified denoising and adaptation framework for self-supervised Bengali dialectal ASR, which achieved state-of-the-art results, and the introduction of ArabEmoNet, a lightweight hybrid 2D CNN-BiLSTM model for robust Arabic speech emotion recognition. Furthermore, the development of tiny specialized ASR models, such as Flavors of Moonshine, has shown promising results for underrepresented languages. Overall, the field is advancing towards more accurate, efficient, and accessible speech recognition and processing systems.
Advancements in Speech Recognition and Processing
Sources
Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks
ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition
CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-Car Speech Separation with Distributed Heterogeneous Arrays
Leveraging Transfer Learning and Mobile-enabled Convolutional Neural Networks for Improved Arabic Handwritten Character Recognition
TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition