The field of multimodal captioning is rapidly evolving, with a focus on developing models that can effectively integrate and describe multiple forms of input, such as audio, visual, and audio-visual data. Recent research has emphasized the importance of temporal alignment, fine-grained perception, and detailed description in captioning tasks. Noteworthy papers in this area have proposed innovative models and pipelines that achieve state-of-the-art performance on various benchmarks, demonstrating significant improvements in caption quality and efficiency. Some notable examples include the development of powerful audiovisual video captioners, generalist visual captioners, and omni detailed perception models. These advancements have the potential to benefit a wide range of applications, from video understanding and generation to human-AI interaction.
Notable papers: AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks. MetaCaptioner achieves comparable captioning capabilities with commercial models at a reduced cost. Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Shot2Tactic-Caption generates shot-level and tactic-level captions for badminton videos, demonstrating effective tactical understanding. MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, showcasing its ability to jointly segment and caption object trajectories in videos.