The field of multimodal perception is moving towards more sophisticated and fine-grained understanding of multimedia content. Researchers are exploring new architectures and methodologies to better integrate and fuse different modalities, such as visual, audio, and text, to improve tasks like object segmentation, event localization, and moment retrieval. Noteworthy papers include:
- A survey on multimodal referring segmentation, which provides a comprehensive overview of the field and its applications.
- The proposal of a novel Importance-aware Multi-Granularity fusion model for video moment retrieval, which achieves state-of-the-art results by selectively aggregating audio-vision-text contexts.
- The introduction of the Learned User Significance Tracker (LUST) framework, which leverages a hierarchical, two-stage relevance scoring mechanism employing Large Language Models (LLMs) to analyze video content and quantify thematic relevance.
- The development of the Think-Ground-Segment (TGS) Agent, which decomposes referring audio-visual segmentation into a Think-Ground-Segment process, mimicking human reasoning procedures.
- The proposal of the Cross-modal Salient Anchor-based Semantic Propagation (CLASP) method for weakly-supervised dense audio-visual event localization, which exploits cross-modal salient anchors to enhance event semantic encoding.