Advances in Speech and Language Understanding

The field of speech and language understanding is rapidly advancing, with a focus on improving the performance of large language models and speech recognition systems. Recent developments have highlighted the importance of fine-tuning these models for specific tasks and languages, particularly in low-resource settings. The use of multimodal fusion frameworks and attention mechanisms has also shown promise in enhancing the accuracy and robustness of speech and language models. Furthermore, researchers are exploring new methods for speech translation, pronunciation assessment, and speaker verification, with a emphasis on developing more efficient and effective algorithms. Noteworthy papers in this area include the introduction of VOX-KRIKRI, a multimodal fusion framework for speech and language understanding, and the development of PART, a progressive alignment representation training method for multilingual speech-to-text systems. Overall, the field is moving towards more sophisticated and nuanced models of speech and language understanding, with potential applications in a wide range of areas, from speech recognition and translation to dialogue systems and language learning.

Sources

Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data

Speech Language Models for Under-Represented Languages: Insights from Wolof

Frustratingly Easy Data Augmentation for Low-Resource ASR

Direct Simultaneous Translation Activation for Large Audio-Language Models

VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition

Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning

Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech

Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models

DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment

WolBanking77: Wolof Banking Speech Intent Classification Dataset

Part-of-speech tagging for Nagamese Language using CRF

PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs

Can Audio Large Language Models Verify Speaker Identity?

Eliminating stability hallucinations in llm-based tts models via attention guidance

WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

OLaPh: Optimal Language Phonemizer

Multilingual Hope Speech Detection: A Comparative Study of Logistic Regression, mBERT, and XLM-RoBERTa with Active Learning

Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

DRES: Benchmarking LLMs for Disfluency Removal