Advancements in Securing Large Language Models

The field of Large Language Models (LLMs) is rapidly evolving, with a growing focus on addressing the security challenges associated with their deployment. Recent research has highlighted the vulnerability of LLMs to various types of attacks, including prompt injection, jailbreaking, and data poisoning. In response, researchers are developing innovative defense strategies, such as co-evolutionary frameworks, adversarial training, and embedding-level integrity checks. Notably, the use of probing-based approaches for safety detection has been found to be limited, and more robust evaluation frameworks are being proposed to accurately gauge true model alignment. Furthermore, the development of real-time scam detection and conversational scambaiting systems leveraging LLMs and federated learning has shown promising results. Overall, the field is moving towards the development of more secure, reliable, and transparent LLMs. Noteworthy papers include AEGIS, which proposes an automated co-evolutionary framework for guarding prompt injections, and StealthEval, which introduces a probe-rewrite-evaluate workflow for reliable benchmarks and quantifying evaluation awareness.

Sources

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

The Resurgence of GCG Adversarial Attacks on Large Language Models

StealthEval: A Probe-Rewrite-Evaluate Workflow for Reliable Benchmarks

Clone What You Can't Steal: Black-Box LLM Replication via Logit Leakage and Distillation

LLMHoney: A Real-Time SSH Honeypot with Large Language Model-Driven Dynamic Response Generation

Poisoned at Scale: A Scalable Audit Uncovers Hidden Scam Endpoints in Production LLMs

A Survey: Towards Privacy and Security in Mobile Large Language Models

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs

AI-in-the-Loop: Privacy Preserving Real-Time Scam Detection and Conversational Scambaiting by Leveraging LLMs and Federated Learning

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization

Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs

Send to which account? Evaluation of an LLM-based Scambaiting System

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Built with on top of