The field of multimodal models is rapidly evolving, with a growing focus on adversarial robustness. Recent research has highlighted the vulnerabilities of these models to various types of attacks, including vision-language-action models, text-to-video models, and multimodal retrieval-augmented generation models. A key trend in this area is the development of novel attack frameworks and methodologies that can effectively manipulate and disrupt the performance of these models. Notably, researchers are exploring the use of implicit language cues, visual perturbations, and cross-modal misalignment attacks to compromise the security of multimodal models.
Some noteworthy papers in this area include: AttackVLA, which proposes a unified framework for evaluating the vulnerabilities of vision-language-action models and introduces a targeted backdoor attack that can compel a model to execute a specific action sequence. VEIL, which presents a jailbreak framework that leverages implicit language cues to induce text-to-video models to generate semantically unsafe videos. HV-Attack, which proposes a hierarchical visual attack that misaligns and disrupts the inputs of multimodal retrieval-augmented generation models. When Alignment Fails, which introduces a comprehensive study of multimodal adversarial robustness in vision-language-action models under both white-box and black-box settings. The Shawshank Redemption of Embodied AI, which proposes a novel indirect environmental jailbreak attack that can jailbreak embodied AI agents via indirect prompts injected into the environment.