Dec. 15, 2023, 2:25 a.m. | Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun

cs.CR updates on arXiv.org arxiv.org

Safety alignment of Large Language Models (LLMs) can be compromised with
manual jailbreak attacks and (automatic) adversarial attacks. Recent studies
suggest that defending against these attacks is possible: adversarial attacks
generate unlimited but unreadable gibberish prompts, detectable by
perplexity-based filters; manual jailbreak attacks craft readable prompts, but
their limited number due to the necessity of human creativity allows for easy
blocking. In this paper, we show that these solutions may be too optimistic. We
introduce AutoDAN, an interpretable, gradient-based adversarial …

adversarial adversarial attacks alignment attacks automatic compromised defending jailbreak language language models large llms prompts safety studies

SOC 2 Manager, Audit and Certification

@ Deloitte | US and CA Multiple Locations

Network Security Engineer

@ Meta | Menlo Park, CA | Remote, US

Security Engineer, Investigations - i3

@ Meta | Washington, DC

Threat Investigator- Security Analyst

@ Meta | Menlo Park, CA | Seattle, WA | Washington, DC

Security Operations Engineer II

@ Microsoft | Redmond, Washington, United States

Engineering -- Tech Risk -- Global Cyber Defense & Intelligence -- Bug Bounty -- Associate -- Dallas

@ Goldman Sachs | Dallas, Texas, United States