Large Language Model Security and Watermarking

The field of large language models is moving towards developing more robust and secure mechanisms for protecting intellectual property and preventing misuse. Recent research has highlighted the vulnerabilities of existing watermarking schemes and the importance of creating more effective and efficient defensive strategies. Character-level perturbations have been shown to be particularly effective in disrupting watermarks, and new fingerprinting frameworks have been proposed to address the trade-offs between stealthness, robustness, and generalizability. Additionally, studies have evaluated the resilience of large language models against adversarial attacks, revealing significant variations in model robustness and the need for more efficient and effective defensive mechanisms. Noteworthy papers include:

  • Character-Level Perturbations Disrupt LLM Watermarks, which demonstrates the effectiveness of character-level perturbations in removing watermarks under realistic constraints.
  • CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models, which introduces a novel rule-driven fingerprinting framework that achieves stronger stealth and robustness than prior work.

Sources

Character-Level Perturbations Disrupt LLM Watermarks

CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor

Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks

Yet Another Watermark for Large Language Models

Watermarking and Anomaly Detection in Machine Learning Models for LORA RF Fingerprinting

Built with on top of