Advances in Composed Image Retrieval

The field of Composed Image Retrieval (CIR) is rapidly advancing, with a focus on improving the accuracy and efficiency of image retrieval systems. Recent developments have centered around addressing the challenges of limited training data and the need for more effective ways to capture fine-grained modification semantics. Researchers are exploring innovative approaches, such as generative models, prediction-based mapping networks, and fine-grained textual inversion networks, to enhance the performance of CIR systems. Additionally, there is a growing emphasis on developing robust data annotation pipelines and leveraging large language models to generate high-quality training data. These advancements have the potential to significantly improve the precision and recall of CIR systems, enabling more accurate and efficient image retrieval. Noteworthy papers include:

  • Generative Compositor, which proposes a novel generative model for few-shot visual information extraction, achieving highly competitive results in full-sample training and outperforming baselines in few-shot settings.
  • FineCIR, which introduces a robust fine-grained CIR data annotation pipeline and a framework that explicitly parses modification text, consistently outperforming state-of-the-art CIR baselines on fine-grained and traditional CIR benchmark datasets.

Sources

Generative Compositor for Few-Shot Visual Information Extraction

Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

CoLLM: A Large Language Model for Composed Image Retrieval

FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

Built with on top of