Feb. 8, 2024, 5:10 a.m. | Javier Rando Florian Tram\`er

cs.CR updates on arXiv.org arxiv.org

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": …

adversarial attacker backdoors can cs.ai cs.cl cs.cr cs.lg feedback human jailbreak language language models large prompts threat work

Director of the Air Force Cyber Technical Center of Excellence (CyTCoE)

@ Air Force Institute of Technology | Dayton, OH, USA

Senior Cyber Security Analyst

@ Valley Water | San Jose, CA

Senior Cybersecurity Engineer

@ Hitachi | (STS) Perth - Belmont

Cyber Security Expert (W/M)

@ Worldline | Seclin - 59, Nord, France

Senior CISO

@ Alter Solutions | Madrid, Spain

IT Security Specialist

@ BDO | Eindhoven, Netherlands