Advances in Vision Transformers and Autoencoders

The field of computer vision is witnessing significant developments with the integration of Vision Transformers (ViTs) and autoencoders. Recent studies have shown that ViTs can be improved by incorporating explicit object modeling and sparse autoencoders, leading to better performance and generalization capabilities. The use of auxiliary losses and multiscale masking has also been explored to further enhance the performance of ViTs. Additionally, sparse autoencoders have been found to be effective in precision unlearning and steerable features, enabling more efficient and interpretable models. The analysis of ViTs has also revealed the importance of computational redundancy in amplifying adversarial transferability, highlighting the need for more robust models. Furthermore, real-time anomaly detection methods based on flexible and sparse latent spaces have been proposed, demonstrating improved performance and applicability in robotic safety systems. Noteworthy papers include:

  • One that proposes an adaptation to the training of ViTs, allowing for explicit modeling of objects during attention computation.
  • Another that introduces Dynamic DAE Guardrails, a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier, achieving superior forget-utility trade-offs.

Sources

Learning Object Focused Attention

SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

Steering CLIP's vision transformer with sparse autoencoders

The Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability

A Real-time Anomaly Detection Method for Robots based on a Flexible and Sparse Latent Space

Built with on top of