all InfoSec news
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
April 9, 2024, 4:11 a.m. | Tim Baumg\"artner, Yang Gao, Dana Alon, Donald Metzler
cs.CR updates on arXiv.org arxiv.org
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a popular method for aligning Language Models (LM) with human values and preferences. RLHF requires a large number of preference pairs as training data, which are often used in both the Supervised Fine-Tuning and Reward Model training, and therefore publicly available datasets are commonly used. In this work, we study to what extent a malicious actor can manipulate the LMs generations by poisoning the preferences, i.e., injecting poisonous …
arxiv cs.ai cs.cl cs.cr cs.lg data feedback fine-tuning human human values language language models large model training popular reward training training data venom
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
CyberSOC Technical Lead
@ Integrity360 | Sandyford, Dublin, Ireland
Cyber Security Strategy Consultant
@ Capco | New York City
Cyber Security Senior Consultant
@ Capco | Chicago, IL
Sr. Product Manager
@ MixMode | Remote, US
Corporate Intern - Information Security (Year Round)
@ Associated Bank | US WI Remote
Senior Offensive Security Engineer
@ CoStar Group | US-DC Washington, DC