The field of large language models is moving towards developing more robust and secure mechanisms for protecting intellectual property and preventing misuse. Recent research has highlighted the vulnerabilities of existing watermarking schemes and the importance of creating more effective and efficient defensive strategies. Character-level perturbations have been shown to be particularly effective in disrupting watermarks, and new fingerprinting frameworks have been proposed to address the trade-offs between stealthness, robustness, and generalizability. Additionally, studies have evaluated the resilience of large language models against adversarial attacks, revealing significant variations in model robustness and the need for more efficient and effective defensive mechanisms. Noteworthy papers include:
- Character-Level Perturbations Disrupt LLM Watermarks, which demonstrates the effectiveness of character-level perturbations in removing watermarks under realistic constraints.
- CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models, which introduces a novel rule-driven fingerprinting framework that achieves stronger stealth and robustness than prior work.