all InfoSec news
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
June 11, 2024, 4:12 a.m. | Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson
cs.CR updates on arXiv.org arxiv.org
Abstract: The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to …
alignment arxiv attacks can cs.ai cs.cr current fine-tuning issue jailbreak language language models large llms safety shared shortcuts simple tokens vulnerabilities vulnerable
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
Senior Streaming Platform Engineer
@ Armis Security | Tel Aviv-Yafo, Tel Aviv District, Israel
Senior Streaming Platform Engineer
@ Armis Security | Tel Aviv-Yafo, Tel Aviv District, Israel
Deputy Chief Information Officer of Operations (Senior Public Service Administrator, Opt. 3)
@ State of Illinois | Springfield, IL, US, 62701-1222
Deputy Chief Information Officer of Operations (Senior Public Service Administrator, Opt. 3)
@ State of Illinois | Springfield, IL, US, 62701-1222
Analyst, Security
@ DailyPay | New York City
Analyst, Security
@ DailyPay | New York City