May 16, 2024, 4:13 a.m. | Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

arXiv:2308.03825v2 Announce Type: replace
Abstract: The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and …

