Advances in Vision Transformers and State Space Models

The field of computer vision is witnessing significant advancements with the development of Vision Transformers (ViTs) and State Space Models (SSMs). Researchers are exploring new architectures and techniques to improve the performance of these models in various tasks such as image classification, object detection, and segmentation. One notable direction is the integration of spatial awareness and attention mechanisms to enhance the modeling of complex relationships between objects in images. Additionally, there is a growing interest in applying SSMs to point cloud representation learning, with novel methods being proposed to address the limitations of existing approaches. Noteworthy papers in this area include the Polyline Path Masked Attention for Vision Transformer, which proposes a novel attention mechanism that integrates the self-attention mechanism of ViTs with an enhanced structured mask, and LBMamba, which introduces a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward selective scan. OTSurv is also a notable paper, which proposes a novel multiple instance learning framework for survival prediction using whole slide images, incorporating heterogeneity-aware optimal transport to model pathological heterogeneity within WSIs.

Advances in Vision Transformers and State Space Models

Sources