Safety and Security in Large Language Models

The field of large language models (LLMs) is rapidly evolving, with a growing focus on safety and security concerns. Recent research has highlighted the potential risks and vulnerabilities associated with LLMs, including the susceptibility to jailbreaking and the potential for generating harmful outputs. The development of more effective attack methods and the identification of security issues in LLM integration have underscored the need for more robust safety and security measures. Notably, studies have shown that fine-tuning LLMs on benign datasets can lead to a significant increase in harmful outputs, and that existing mitigation strategies may be ineffective against certain types of attacks. To address these concerns, researchers are exploring new approaches to safety alignment, including the development of resources such as FalseReject, which aims to improve contextual safety and mitigate over-refusals in LLMs. Some of the most noteworthy papers in this area include: Benign Samples Matter!, which developed a more effective attack method by fine-tuning LLMs exclusively on outlier benign samples. FalseReject, which introduced a comprehensive resource for improving contextual safety and mitigating over-refusals in LLMs. LM-Scout, which presented a systematic study of insecure usage of LLMs in Android apps and developed a tool to detect vulnerable usage.

Safety and Security in Large Language Models

Sources