The field of visual reasoning is moving towards the use of digital twin representations to enable more effective and unified solutions. This approach allows for the construction of high-level representations of complex multi-modal visual inputs, which can then be reasoned over using large language models. The use of reinforcement learning and digital twin representations is showing promising results, with improvements over state-of-the-art task-specific models. Notably, this approach enables the handling of implicit queries and the ability to reason over long-horizon video content without visual token compression.
Some particularly noteworthy papers in this area include: Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning, which proposes a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs. Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models achieves state-of-the-art results in three conventional benchmarks and outperforms the strongest baseline by greater than 50 percentage points on a newly constructed benchmark. Fast Reasoning Segmentation for Images and Videos proposes a distillation scheme that enables more effective distillation by re-framing the problem and achieves state-of-the-art reasoning segmentation performance while being efficient enough for deployment in resource-constrained environments. Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations introduces a new task of reasoning video editing and proposes a model that decouples reasoning from generation through digital twin representations of video content.