Speech Translation and Recognition Advancements

The field of speech processing is moving towards more efficient and accurate models, particularly for low-resource languages. Researchers are exploring the use of weakly labeled data and small-scale language models to build end-to-end speech-to-text translation systems. Additionally, there is a growing interest in simultaneous translation and speech recognition, with a focus on improving performance and reducing latency. Noteworthy papers include:

  • One paper demonstrates that end-to-end speech translation systems can be built using weakly labeled data, achieving performance comparable to massive multi-modal multilingual baselines.
  • Another paper presents a unified speech-to-text model that integrates a pre-trained continuous speech encoder and text decoder, achieving state-of-the-art results on the IWSLT 2025 Shared Task.
  • A third paper describes a simultaneous speech translation system that uses an offline speech model and a large language model to improve performance and accommodate context.

Sources

End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data

Instituto de Telecomunica\c{c}\~oes at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning

Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025

Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data

Built with on top of