The field of speech processing is moving towards more efficient and accurate models, particularly for low-resource languages. Researchers are exploring the use of weakly labeled data and small-scale language models to build end-to-end speech-to-text translation systems. Additionally, there is a growing interest in simultaneous translation and speech recognition, with a focus on improving performance and reducing latency. Noteworthy papers include:
- One paper demonstrates that end-to-end speech translation systems can be built using weakly labeled data, achieving performance comparable to massive multi-modal multilingual baselines.
- Another paper presents a unified speech-to-text model that integrates a pre-trained continuous speech encoder and text decoder, achieving state-of-the-art results on the IWSLT 2025 Shared Task.
- A third paper describes a simultaneous speech translation system that uses an offline speech model and a large language model to improve performance and accommodate context.