April 2, 2024, 7:12 p.m. | Luxi He, Mengzhou Xia, Peter Henderson

cs.CR updates on arXiv.org arxiv.org

arXiv:2404.01099v1 Announce Type: cross
Abstract: Current Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. First, we represent fine-tuning data through two lenses: representation and gradient spaces. Furthermore, we propose a bi-directional anchoring method that …

alignment arxiv cs.ai cs.cl cs.cr cs.lg current data fine-tuning found jailbreaking language language models large llms safe safety

CyberSOC Technical Lead

@ Integrity360 | Sandyford, Dublin, Ireland

Cyber Security Strategy Consultant

@ Capco | New York City

Cyber Security Senior Consultant

@ Capco | Chicago, IL

Senior Security Researcher - Linux MacOS EDR (Cortex)

@ Palo Alto Networks | Tel Aviv-Yafo, Israel

Sr. Manager, NetSec GTM Programs

@ Palo Alto Networks | Santa Clara, CA, United States

SOC Analyst I

@ Fortress Security Risk Management | Cleveland, OH, United States