April 9, 2024, 4:11 a.m. | Tim Baumg\"artner, Yang Gao, Dana Alon, Donald Metzler

cs.CR updates on arXiv.org arxiv.org

arXiv:2404.05530v1 Announce Type: cross
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a popular method for aligning Language Models (LM) with human values and preferences. RLHF requires a large number of preference pairs as training data, which are often used in both the Supervised Fine-Tuning and Reward Model training, and therefore publicly available datasets are commonly used. In this work, we study to what extent a malicious actor can manipulate the LMs generations by poisoning the preferences, i.e., injecting poisonous …

arxiv cs.ai cs.cl cs.cr cs.lg data feedback fine-tuning human human values language language models large model training popular reward training training data venom

Security Engineer

@ Celonis | Munich, Germany

Security Engineer, Cloud Threat Intelligence

@ Google | Reston, VA, USA; Kirkland, WA, USA

IT Security Analyst*

@ EDAG Group | Fulda, Hessen, DE, 36037

Scrum Master/ Agile Project Manager for Information Security (Temporary)

@ Guidehouse | Lagunilla de Heredia

Waste Incident Responder (Tanker Driver)

@ Severn Trent | Derby , England, GB

Risk Vulnerability Analyst w/Clearance - Colorado

@ Rothe | Colorado Springs, CO, United States