Multimodal Narrative Understanding and Generation

The field of multimodal narrative understanding and generation is rapidly advancing, with a focus on developing innovative methods for analyzing and generating multimodal content, such as comics, documents, and images. Recent research has highlighted the potential of multimodal large language models (MLLMs) in expanding the scope of information retrieval and generation beyond purely textual inputs. The development of new datasets and frameworks, such as scene-level narrative arcs and retrieval-augmented generation, has also improved the state-of-the-art in multimodal narrative understanding. Noteworthy papers include:

  • ComicScene154, which introduces a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books, providing a valuable resource for advancing computational methods in multimodal narrative understanding.
  • PREMIR, a framework that leverages the broad knowledge of an MLLM to generate cross-modal pre-questions before retrieval, achieving state-of-the-art performance on out-of-distribution benchmarks.
  • MMCIG, a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document, using a multimodal pseudo-labeling method to construct high-quality datasets at low cost.
  • Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity, improving lexical diversity and reduced redundancy in multilingual riddle generation.
  • ChronoRAG, a novel RAG framework specialized for narrative texts, refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages.
  • Retrieval-Augmented Generation (RAG) for Document Visual Question Answering (Document VQA), which systematically evaluates the impact of incorporating RAG into Document VQA through different retrieval variants, demonstrating that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks.

Sources

ComicScene154: A Scene Dataset for Comic Analysis

Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling

Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs

Chronological Passage Assembling in RAG framework for Temporal Question Answering

Enhancing Document VQA Models via Retrieval-Augmented Generation

Built with on top of