The field of multimodal speech recognition and document analysis is witnessing significant advancements with the integration of large language models, reinforcement learning, and multimodal fusion techniques. Researchers are exploring novel approaches to improve the accuracy and robustness of speech recognition systems in challenging environments, such as cocktail-party scenarios. The development of large-scale datasets, like SARD for Arabic OCR and MegaHan97K for mega-category Chinese character recognition, is bridging critical gaps in data scarcity and enabling substantial improvements in model performance. Noteworthy papers include QARI-OCR, which achieves state-of-the-art results in Arabic OCR, and MonkeyOCR, which introduces a Structure-Recognition-Relation triplet paradigm for document parsing, outperforming existing models. Additionally, UniCUE proposes a unified framework for Chinese Cued Speech Video-to-Speech generation, significantly reducing Word Error Rate and improving lip-speech synchronization.
Advances in Multimodal Speech Recognition and Document Analysis
Sources
Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio-Visual Fusion and Alignment Loss
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation