Dec. 15, 2023, 2:25 a.m. | Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun

cs.CR updates on arXiv.org arxiv.org

Safety alignment of Large Language Models (LLMs) can be compromised with
manual jailbreak attacks and (automatic) adversarial attacks. Recent studies
suggest that defending against these attacks is possible: adversarial attacks
generate unlimited but unreadable gibberish prompts, detectable by
perplexity-based filters; manual jailbreak attacks craft readable prompts, but
their limited number due to the necessity of human creativity allows for easy
blocking. In this paper, we show that these solutions may be too optimistic. We
introduce AutoDAN, an interpretable, gradient-based adversarial …

adversarial adversarial attacks alignment attacks automatic compromised defending jailbreak language language models large llms prompts safety studies

CyberSOC Technical Lead

@ Integrity360 | Sandyford, Dublin, Ireland

Cyber Security Strategy Consultant

@ Capco | New York City

Cyber Security Senior Consultant

@ Capco | Chicago, IL

Sr. Product Manager

@ MixMode | Remote, US

Corporate Intern - Information Security (Year Round)

@ Associated Bank | US WI Remote

Senior Offensive Security Engineer

@ CoStar Group | US-DC Washington, DC