Multimodal Narrative Understanding and Generation

The field of multimodal narrative understanding and generation is rapidly advancing, with a focus on developing innovative methods for analyzing and generating multimodal content, such as comics, documents, and images. Recent research has highlighted the potential of multimodal large language models (MLLMs) in expanding the scope of information retrieval and generation beyond purely textual inputs. The development of new datasets and frameworks, such as scene-level narrative arcs and retrieval-augmented generation, has also improved the state-of-the-art in multimodal narrative understanding. Noteworthy papers include:

ComicScene154, which introduces a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books, providing a valuable resource for advancing computational methods in multimodal narrative understanding.
PREMIR, a framework that leverages the broad knowledge of an MLLM to generate cross-modal pre-questions before retrieval, achieving state-of-the-art performance on out-of-distribution benchmarks.
MMCIG, a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document, using a multimodal pseudo-labeling method to construct high-quality datasets at low cost.
Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity, improving lexical diversity and reduced redundancy in multilingual riddle generation.
ChronoRAG, a novel RAG framework specialized for narrative texts, refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages.
Retrieval-Augmented Generation (RAG) for Document Visual Question Answering (Document VQA), which systematically evaluates the impact of incorporating RAG into Document VQA through different retrieval variants, demonstrating that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks.

Multimodal Narrative Understanding and Generation

Sources