Advances in Vision Transformers and State Space Models

The field of computer vision is witnessing significant advancements with the development of Vision Transformers (ViTs) and State Space Models (SSMs). Researchers are exploring new architectures and techniques to improve the performance of these models in various tasks such as image classification, object detection, and segmentation. One notable direction is the integration of spatial awareness and attention mechanisms to enhance the modeling of complex relationships between objects in images. Additionally, there is a growing interest in applying SSMs to point cloud representation learning, with novel methods being proposed to address the limitations of existing approaches. Noteworthy papers in this area include the Polyline Path Masked Attention for Vision Transformer, which proposes a novel attention mechanism that integrates the self-attention mechanism of ViTs with an enhanced structured mask, and LBMamba, which introduces a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward selective scan. OTSurv is also a notable paper, which proposes a novel multiple instance learning framework for survival prediction using whole slide images, incorporating heterogeneity-aware optimal transport to model pathological heterogeneity within WSIs.

Sources

Polyline Path Masked Attention for Vision Transformer

LBMamba: Locally Bi-directional Mamba

MambaHash: Visual State Space Deep Hashing Model for Large-Scale Image Retrieval

Soft decision trees for survival analysis

Memory-Augmented Incomplete Multimodal Survival Prediction via Cross-Slide and Gene-Attentive Hypergraph Learning

ToSA: Token Merging with Spatial Awareness

OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport

StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning

Built with on top of