Advances in Vision-Language Segmentation and Counting

The field of computer vision is moving towards more flexible and generalizable models that can effectively handle open-vocabulary scenarios and unseen categories. Recent developments focus on leveraging large vision-language models and innovative prompting techniques to achieve state-of-the-art performance in tasks such as semantic segmentation, instance segmentation, and object counting. Noteworthy papers include OpenWorldSAM, which achieves remarkable resource efficiency and generalization capabilities in open-vocabulary semantic and instance segmentation. QUANet introduces novel quantity-oriented text prompts and a dual-stream adaptive counting decoder, enabling strong generalizability for zero-shot class-agnostic counting. CoPT presents a novel Covariance-based Pixel-Text loss that uses domain-agnostic text embeddings for unsupervised domain adaptive segmentation. Description-free Multi-prompt Learning (DeMul) eliminates the need for extracting descriptions from Large Language Models, instead directly distilling knowledge into prompts for improved performance.

Sources

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Text-promptable Object Counting via Quantity Awareness Enhancement

CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings

Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation

Built with on top of