Security Risks and Mitigations in Large Language Models

The field of Large Language Models (LLMs) is rapidly advancing, with a growing focus on security risks and mitigations. Recent research has highlighted the vulnerability of LLMs to indirect prompt injection attacks, which can compromise the core of these agents by manipulating contextual information. To address these risks, researchers are developing novel approaches to detect and defend against such attacks, including the use of behavioral state detection and instruction detection methods. Additionally, the integration of external knowledge sources, such as Retrieval-Augmented Generation (RAG), has introduced new security challenges, including the potential for poisoning attacks and hallucinations. To mitigate these risks, researchers are proposing frameworks for securing RAG systems, including graph-based reranking and risk assessment and mitigation frameworks. Noteworthy papers in this area include AgentXploit, which proposes a generic black-box fuzzing framework to discover and exploit indirect prompt injection vulnerabilities, and Defending against Indirect Prompt Injection by Instruction Detection, which demonstrates a novel approach to detecting potential IPI attacks with high accuracy. POISONCRAFT is also a significant contribution, as it presents a practical poisoning attack on RAG systems that can mislead the model to refer to fraudulent websites. Furthermore, Securing RAG: A Risk Assessment and Mitigation Framework provides a comprehensive overview of the vulnerabilities of RAG pipelines and outlines a framework for guiding the implementation of robust, compliant, secure, and trustworthy RAG systems.

Sources

AgentXploit: End-to-End Redteaming of Black-Box AI Agents

Defending against Indirect Prompt Injection by Instruction Detection

System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection

POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models

Practical Reasoning Interruption Attacks on Reasoning Large Language Models

SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion

GRADA: Graph-based Reranker against Adversarial Documents Attack

OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit

Securing RAG: A Risk Assessment and Mitigation Framework

Built with on top of