The field of computer vision is moving towards developing more efficient and human-aligned vision systems. Researchers are exploring new approaches to visual learning, such as structure-first pretraining, to induce more compact and generalizable visual representations. This direction is driven by the observation that humans can understand sparse, minimal representations like line drawings, and that structure underlies efficient visual understanding. The use of line drawings as a pretraining modality has been shown to produce models with stronger shape bias, more focused attention, and greater data efficiency. Additionally, there is a growing interest in developing lightweight human-centric vision models that can acquire strong generalization from large models through distillation-based pretraining frameworks. The trade-off between model complexity, training strategies, and alignment with human perception is also being investigated, highlighting the importance of considering human-like visual understanding in applications. Noteworthy papers in this area include:
- Learning More by Seeing Less: Line Drawing Pretraining for Efficient, Transferable, and Human-Aligned Vision, which proposes using line drawings as a structure-first pretraining modality to induce more compact and generalizable visual representations.
- Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models, which presents a novel distillation-based pretraining framework for efficiently training lightweight human-centric vision models.
- Evolution of Low-Level and Texture Human-CLIP Alignment, which investigates the correlation between low-level human image quality assessments and CLIP's alignment with low-level human perception.
- Do Vision Transformers See Like Humans, which systematically analyzes the impact of model size, dataset size, data augmentation, and regularization on Vision Transformers' perceptual alignment with human judgments.
- Contrast Sensitivity Function of Multimodal Vision-Language Models, which introduces a novel method to estimate the contrast sensitivity function of multimodal vision-language models and assess their alignment with human perception.