Advances in Composed Image Retrieval and Text-to-Image Generation

The field of composed image retrieval and text-to-image generation is rapidly evolving, with a focus on improving the flexibility and accuracy of models. Researchers are exploring new methods to address the limitations of current approaches, including the use of large language models to analyze user instructions and determine the task to execute. The development of scalable pipelines for automatic triplet generation and the creation of large-scale fashion datasets are also notable trends. These advancements have the potential to enhance the performance of models in real-world applications, such as e-commerce websites. Noteworthy papers include OFFSET, which proposes a focus mapping-based feature extractor to reduce the impact of noise interference, and TalkFashion, which introduces an intelligent virtual try-on assistant based on multimodal large language models. Additionally, Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval and FACap, a large-scale fashion dataset, demonstrate significant contributions to the field.

Sources

OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval

TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model

Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval

MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval

Evaluating Attribute Confusion in Fashion Text-to-Image Generation

FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval

Built with on top of