Advances in Speech Recognition and Multimodal Processing

The field of speech recognition and multimodal processing is rapidly advancing, with a focus on improving the accuracy and robustness of systems in low-resource languages and multimodal settings. Recent work has explored the use of intermediate representations in spoken language models, demonstrating the importance of modality adapters in transforming representations. Additionally, researchers have made significant progress in facial emotion recognition, achieving competitive results with efficient architectures and imbalance-aware optimization. The development of unsupervised speech recognition frameworks has also shown promise, with syllable-level approaches achieving significant reductions in character error rates. Furthermore, the application of machine learning to Parkinson's disease diagnosis has yielded promising results, with cross-lingual multi-granularity frameworks and modified transfer learning approaches demonstrating enhanced diagnostic capability. Noteworthy papers include: Transcribe, Translate, or Transliterate, which examines the output representation of modality adapters in spoken language models. InsideOut, which presents a reproducible facial emotion recognition framework built on EfficientNetV2-S with transfer learning and strong data augmentation. Revisiting Direct Speech-to-Text Translation with Speech LLMs, which systematically compares Chain-of-Thought and Direct prompting under increasing amounts of data. Towards Unsupervised Speech Recognition at the Syllable-Level, which introduces a syllable-level unsupervised speech recognition framework based on masked language modeling. Cross-Lingual Multi-Granularity Framework for Interpretable Parkinson's Disease Diagnosis from Speech, which develops a granularity-aware approach for multilingual PD detection using an automated pipeline.

Sources

Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

Error correction in multiclass image classification of facial emotion on unbalanced samples

Towards Unsupervised Speech Recognition at the Syllable-Level

In-Vivo Training for Deep Brain Stimulation

Cross-Lingual Multi-Granularity Framework for Interpretable Parkinson's Disease Diagnosis from Speech

CARE-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson's Disease Gait Assessment

A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance

Exploring the Efficacy of Modified Transfer Learning in Identifying Parkinson's Disease Through Drawn Image Patterns

How I Built ASR for Endangered Languages with a Spoken Dictionary

Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

Built with on top of