June 11, 2024, 4:12 a.m. | Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson

cs.CR updates on arXiv.org arxiv.org

arXiv:2406.05946v1 Announce Type: new
Abstract: The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to …

alignment arxiv attacks can cs.ai cs.cr current fine-tuning issue jailbreak language language models large llms safety shared shortcuts simple tokens vulnerabilities vulnerable

Senior Streaming Platform Engineer

@ Armis Security | Tel Aviv-Yafo, Tel Aviv District, Israel

Senior Streaming Platform Engineer

@ Armis Security | Tel Aviv-Yafo, Tel Aviv District, Israel

Deputy Chief Information Officer of Operations (Senior Public Service Administrator, Opt. 3)

@ State of Illinois | Springfield, IL, US, 62701-1222

Deputy Chief Information Officer of Operations (Senior Public Service Administrator, Opt. 3)

@ State of Illinois | Springfield, IL, US, 62701-1222

Analyst, Security

@ DailyPay | New York City

Analyst, Security

@ DailyPay | New York City