all InfoSec news
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
April 9, 2024, 4:11 a.m. | Tim Baumg\"artner, Yang Gao, Dana Alon, Donald Metzler
cs.CR updates on arXiv.org arxiv.org
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a popular method for aligning Language Models (LM) with human values and preferences. RLHF requires a large number of preference pairs as training data, which are often used in both the Supervised Fine-Tuning and Reward Model training, and therefore publicly available datasets are commonly used. In this work, we study to what extent a malicious actor can manipulate the LMs generations by poisoning the preferences, i.e., injecting poisonous …
arxiv cs.ai cs.cl cs.cr cs.lg data feedback fine-tuning human human values language language models large model training popular reward training training data venom
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
Security Engineer
@ Celonis | Munich, Germany
Security Engineer, Cloud Threat Intelligence
@ Google | Reston, VA, USA; Kirkland, WA, USA
IT Security Analyst*
@ EDAG Group | Fulda, Hessen, DE, 36037
Scrum Master/ Agile Project Manager for Information Security (Temporary)
@ Guidehouse | Lagunilla de Heredia
Waste Incident Responder (Tanker Driver)
@ Severn Trent | Derby , England, GB
Risk Vulnerability Analyst w/Clearance - Colorado
@ Rothe | Colorado Springs, CO, United States