Safety Alignment Should Be Made More Than Just a Few Tokens Deep | allinfosecnews.com

June 11, 2024, 4:12 a.m. | Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson

cs.CR updates on arXiv.org arxiv.org

arXiv:2406.05946v1 Announce Type: new
Abstract: The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to …

alignment arxiv attacks can cs.ai cs.cr current fine-tuning issue jailbreak language language models large llms safety shared shortcuts simple tokens vulnerabilities vulnerable

More from arxiv.org / cs.CR updates on arXiv.org

EnSolver: Uncertainty-Aware Ensemble CAPTCHA Solvers with Theoretical Guarantees 6 hours ago | arxiv.org

aim arxiv automated automated bots +17

Straggler-Resilient Differentially-Private Decentralized Learning 6 hours ago | arxiv.org

amplification analytical arxiv communication +21

BlockChain I/O: Enabling Cross-Chain Commerce 6 hours ago | arxiv.org

arxiv blockchain blockchains commerce +9

Machine Learning Predictors for Min-Entropy Estimation 6 hours ago | arxiv.org

application applications arxiv assessment +15

Quantum-Enhanced Secure Approval Voting Protocol 6 hours ago | arxiv.org

arxiv aspect changing computing +22

IDT: Dual-Task Adversarial Attacks for Privacy Protection 6 hours ago | arxiv.org

adversarial adversarial attacks arxiv attacks +24

Private Zeroth-Order Nonsmooth Nonconvex Optimization 6 hours ago | arxiv.org

algorithm alpha arxiv complexity +16

Instance-Optimal Private Density Estimation in the Wasserstein Distance 6 hours ago | arxiv.org

arxiv cs.cr cs.ds cs.lg +11

Too Good to be True? Turn Any Model Differentially Private With DP-Weights 6 hours ago | arxiv.org

arxiv cs.ai cs.cr cs.lg +14

Senior Streaming Platform Engineer

@ Armis Security | Tel Aviv-Yafo, Tel Aviv District, Israel

View on infosec-jobs.com

Senior Streaming Platform Engineer

@ Armis Security | Tel Aviv-Yafo, Tel Aviv District, Israel

View on infosec-jobs.com

Deputy Chief Information Officer of Operations (Senior Public Service Administrator, Opt. 3)

@ State of Illinois | Springfield, IL, US, 62701-1222

View on infosec-jobs.com

Deputy Chief Information Officer of Operations (Senior Public Service Administrator, Opt. 3)

@ State of Illinois | Springfield, IL, US, 62701-1222

View on infosec-jobs.com

Analyst, Security

@ DailyPay | New York City

View on infosec-jobs.com

Analyst, Security

@ DailyPay | New York City

View on infosec-jobs.com