Advancements in Multilingual and Speech-Driven Technologies

The field of speech and language technologies is moving towards more inclusive and personalized solutions. Researchers are developing innovative tools to support non-native English speakers in STEM education, such as real-time lexical cues and interactive rhythm training systems. Additionally, there is a growing focus on multilingual vision-language models, with advancements in retrieval-augmented generation and concept-aware captioning. Speech-driven 3D facial animation is also becoming increasingly sophisticated, with new methods for personalized animation and phonetic context-dependent viseme modeling. Furthermore, technologies are being developed to assist low-vision learners, including personalized visual guidance tools. Noteworthy papers include: CONCAP, which introduces a multilingual image captioning model that integrates retrieved captions with image-specific concepts, and MemoryTalker, which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input. VeasyGuide is also notable, as it provides personalized visual guidance for low-vision learners on instructor actions in presentation videos.

Sources

CueBuddy: helping non-native English speakers navigate English-centric STEM education

RhythmTA: A Visual-Aided Interactive System for ESL Rhythm Training via Dubbing Practice

DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization

Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation

VeasyGuide: Personalized Visual Guidance for Low-vision Learners on Instructor Actions in Presentation Videos

Built with on top of