all InfoSec news
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models. (arXiv:2310.15140v2 [cs.CR] UPDATED)
cs.CR updates on arXiv.org arxiv.org
Safety alignment of Large Language Models (LLMs) can be compromised with
manual jailbreak attacks and (automatic) adversarial attacks. Recent studies
suggest that defending against these attacks is possible: adversarial attacks
generate unlimited but unreadable gibberish prompts, detectable by
perplexity-based filters; manual jailbreak attacks craft readable prompts, but
their limited number due to the necessity of human creativity allows for easy
blocking. In this paper, we show that these solutions may be too optimistic. We
introduce AutoDAN, an interpretable, gradient-based adversarial …
adversarial adversarial attacks alignment attacks automatic compromised defending jailbreak language language models large llms prompts safety studies