June 12, 2024, 4:11 a.m. | Fan Liu, Zhao Xu, Hao Liu

cs.CR updates on arXiv.org arxiv.org

arXiv:2406.06622v1 Announce Type: cross
Abstract: Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt …

adversarial arxiv attack attacks capabilities cs.ai cs.cl cs.cr defending defense framework jailbreak language language models large llms stage the unknown

Consultant Sénior Cyber Sécurité H/F

@ Hifield | Lyon, France

Information Security & Resilience Analyst APAC

@ abrdn | Singapore

Technical Product Engineer

@ Palo Alto Networks | Tel Aviv-Yafo, Israel

Azure Cloud Architect

@ Version 1 | Dublin, Ireland

Junior Pen Tester

@ Vertiv | Pune, India

Information Security GRC Director

@ IQ-EQ | Hyderabad, India