Protecting Large Language Models from Misuse

The field of large language models (LLMs) is moving towards increased focus on security and intellectual property protection. Researchers are developing innovative methods to detect plagiarism, embed watermarks, and establish model ownership. These advances aim to prevent misuse of LLMs, such as unauthorized copying or claiming authorship, and to ensure the integrity of generated content. Noteworthy papers in this area include Matrix-Driven Instant Review, which achieves accurate reconstruction of weight relationships and provides rigorous p-value estimation, and SAEMark, which establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs. EditMF is also notable for its highly imperceptible fingerprint embedding with minimal computational overhead. Additionally, studies on attacks and defenses against LLM fingerprinting highlight the potential to improve fingerprinting tools capabilities while providing practical mitigation strategies against fingerprinting attacks.

Sources

Matrix-Driven Instant Review: Confident Detection and Reconstruction of LLM Plagiarism on PC

SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

EditMF: Drawing an Invisible Fingerprint for Your Large Language Models

Attacks and Defenses Against LLM Fingerprinting

Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models

Built with on top of