The field of speech recognition and multimodal processing is rapidly advancing, with a focus on improving the accuracy and robustness of systems in low-resource languages and multimodal settings. Recent work has explored the use of intermediate representations in spoken language models, demonstrating the importance of modality adapters in transforming representations. Additionally, researchers have made significant progress in facial emotion recognition, achieving competitive results with efficient architectures and imbalance-aware optimization. The development of unsupervised speech recognition frameworks has also shown promise, with syllable-level approaches achieving significant reductions in character error rates. Furthermore, the application of machine learning to Parkinson's disease diagnosis has yielded promising results, with cross-lingual multi-granularity frameworks and modified transfer learning approaches demonstrating enhanced diagnostic capability. Noteworthy papers include: Transcribe, Translate, or Transliterate, which examines the output representation of modality adapters in spoken language models. InsideOut, which presents a reproducible facial emotion recognition framework built on EfficientNetV2-S with transfer learning and strong data augmentation. Revisiting Direct Speech-to-Text Translation with Speech LLMs, which systematically compares Chain-of-Thought and Direct prompting under increasing amounts of data. Towards Unsupervised Speech Recognition at the Syllable-Level, which introduces a syllable-level unsupervised speech recognition framework based on masked language modeling. Cross-Lingual Multi-Granularity Framework for Interpretable Parkinson's Disease Diagnosis from Speech, which develops a granularity-aware approach for multilingual PD detection using an automated pipeline.
Advances in Speech Recognition and Multimodal Processing
Sources
Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models
InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition
Cross-Lingual Multi-Granularity Framework for Interpretable Parkinson's Disease Diagnosis from Speech
Exploring the Efficacy of Modified Transfer Learning in Identifying Parkinson's Disease Through Drawn Image Patterns
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation