Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data | allinfosecnews.com

April 9, 2024, 4:11 a.m. | Tim Baumg\"artner, Yang Gao, Dana Alon, Donald Metzler

cs.CR updates on arXiv.org arxiv.org

arXiv:2404.05530v1 Announce Type: cross
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a popular method for aligning Language Models (LM) with human values and preferences. RLHF requires a large number of preference pairs as training data, which are often used in both the Supervised Fine-Tuning and Reward Model training, and therefore publicly available datasets are commonly used. In this work, we study to what extent a malicious actor can manipulate the LMs generations by poisoning the preferences, i.e., injecting poisonous …

arxiv cs.ai cs.cl cs.cr cs.lg data feedback fine-tuning human human values language language models large model training popular reward training training data venom

More from arxiv.org / cs.CR updates on arXiv.org

Quantum $X$-Secure $B$-Byzantine $T$-Colluding Private Information Retrieval 20 hours ago | arxiv.org

arxiv capabilities context cs.cr +13

Approximation of Pufferfish Privacy for Gaussian Priors 20 hours ago | arxiv.org

adversary arxiv cs.cr cs.it +9

Polynomial XL: A Variant of the XL Algorithm Using Macaulay Matrices over Polynomial Rings 20 hours ago | arxiv.org

algorithm arxiv computer computer science +11

Families of sequences with good family complexity and cross-correlation measure 20 hours ago | arxiv.org

alphabet arxiv binary complexity +12

Terrapin Attack: Breaking SSH Channel Integrity By Sequence Number Manipulation 20 hours ago | arxiv.org

access arxiv attack breaking +26

Seeing Is Not Always Believing: Invisible Collision Attack and Defence on Pre-Trained Models 20 hours ago | arxiv.org

arxiv attack bert big +14

Proceedings of the 2nd International Workshop on Adaptive Cyber Defense 20 hours ago | arxiv.org

applications artificial artificial intelligence arxiv +16

Sharpness-Aware Data Poisoning Attack 20 hours ago | arxiv.org

aim arxiv attack attacks +18

Fant\^omas: Understanding Face Anonymization Reversibility 20 hours ago | arxiv.org

anonymization arxiv can claims +14

Security Engineer

@ Celonis | Munich, Germany

View on infosec-jobs.com

Security Engineer, Cloud Threat Intelligence

@ Google | Reston, VA, USA; Kirkland, WA, USA

View on infosec-jobs.com

IT Security Analyst*

@ EDAG Group | Fulda, Hessen, DE, 36037

View on infosec-jobs.com

Scrum Master/ Agile Project Manager for Information Security (Temporary)

@ Guidehouse | Lagunilla de Heredia

View on infosec-jobs.com

Waste Incident Responder (Tanker Driver)

@ Severn Trent | Derby , England, GB

View on infosec-jobs.com

Risk Vulnerability Analyst w/Clearance - Colorado

@ Rothe | Colorado Springs, CO, United States

View on infosec-jobs.com