What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety | allinfosecnews.com

April 2, 2024, 7:12 p.m. | Luxi He, Mengzhou Xia, Peter Henderson

cs.CR updates on arXiv.org arxiv.org

arXiv:2404.01099v1 Announce Type: cross
Abstract: Current Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. First, we represent fine-tuning data through two lenses: representation and gradient spaces. Furthermore, we propose a bi-directional anchoring method that …

alignment arxiv cs.ai cs.cl cs.cr cs.lg current data fine-tuning found jailbreaking language language models large llms safe safety

More from arxiv.org / cs.CR updates on arXiv.org

Proactive Detection of Voice Cloning with Localized Watermarking 2 days, 19 hours ago | arxiv.org

architecture arxiv audio authenticity +16

SecFormer: Towards Fast and Accurate Privacy-Preserving Inference for Large Language Models 2 days, 19 hours ago | arxiv.org

account arxiv bank cloud +22

NFT Wash Trading: Direct vs. Indirect Estimation 2 days, 19 hours ago | arxiv.org

arxiv binance crypto crypto exchanges +18

Robust Distortion-free Watermarks for Language Models 2 days, 19 hours ago | arxiv.org

arxiv budget changing compute +15

Backdoor Attack with Sparse and Invisible Trigger 2 days, 19 hours ago | arxiv.org

arxiv attack backdoor backdoor attack +4

Homomorphic Polynomial Public Key Cryptography for Quantum-secure Digital Signature 2 days, 19 hours ago | arxiv.org

arxiv cryptography cs.cr digital +17

Toward Unbiased Multiple-Target Fuzzing with Path Diversity 2 days, 19 hours ago | arxiv.org

arxiv cs.cr diversity energy +10

Transferable Availability Poisoning Attacks 2 days, 19 hours ago | arxiv.org

accuracy adversary arxiv attack +17

Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection 2 days, 19 hours ago | arxiv.org

accuracy arxiv automatic can +17

CyberSOC Technical Lead

@ Integrity360 | Sandyford, Dublin, Ireland

View on infosec-jobs.com

Cyber Security Strategy Consultant

@ Capco | New York City

View on infosec-jobs.com

Cyber Security Senior Consultant

@ Capco | Chicago, IL

View on infosec-jobs.com

Senior Security Researcher - Linux MacOS EDR (Cortex)

@ Palo Alto Networks | Tel Aviv-Yafo, Israel

View on infosec-jobs.com

Sr. Manager, NetSec GTM Programs

@ Palo Alto Networks | Santa Clara, CA, United States

View on infosec-jobs.com

SOC Analyst I

@ Fortress Security Risk Management | Cleveland, OH, United States

View on infosec-jobs.com