Vulnerabilities and Defenses in Large Language Models

The field of large language models is moving towards a greater emphasis on security and robustness, with a focus on identifying and mitigating vulnerabilities to adversarial attacks. Recent research has highlighted the potential for attacks that manipulate the model's reasoning processes, leading to increased computational costs and decreased performance. In response, researchers are developing new defense mechanisms that provide transparency and interpretability into the model's decision-making processes. Notable papers in this area include BadThink, which proposes a novel backdoor attack on chain-of-thought reasoning, and ExplainableGuard, which introduces an interpretable adversarial defense framework. Other papers, such as Output Supervision Can Obfuscate the Chain of Thought and Explainable Transformer-Based Email Phishing Classification with Adversarial Robustness, have also made significant contributions to the field.

Sources

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

destroR: Attacking Transfer Models with Obfuscous Examples to Discard Perplexity

Output Supervision Can Obfuscate the Chain of Thought

Explainable Transformer-Based Email Phishing Classification with Adversarial Robustness

Interpretable Ransomware Detection Using Hybrid Large Language Models: A Comparative Analysis of BERT, RoBERTa, and DeBERTa Through LIME and SHAP

ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning

Built with on top of