Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs | allinfosecnews.com

June 12, 2024, 4:11 a.m. | Fan Liu, Zhao Xu, Hao Liu

cs.CR updates on arXiv.org arxiv.org

arXiv:2406.06622v1 Announce Type: cross
Abstract: Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt …

adversarial arxiv attack attacks capabilities cs.ai cs.cl cs.cr defending defense framework jailbreak language language models large llms stage the unknown

More from arxiv.org / cs.CR updates on arXiv.org

EnSolver: Uncertainty-Aware Ensemble CAPTCHA Solvers with Theoretical Guarantees 27 minutes ago | arxiv.org

aim arxiv automated automated bots +17

Straggler-Resilient Differentially-Private Decentralized Learning 27 minutes ago | arxiv.org

amplification analytical arxiv communication +21

BlockChain I/O: Enabling Cross-Chain Commerce 27 minutes ago | arxiv.org

arxiv blockchain blockchains commerce +9

Machine Learning Predictors for Min-Entropy Estimation 27 minutes ago | arxiv.org

application applications arxiv assessment +15

Quantum-Enhanced Secure Approval Voting Protocol 27 minutes ago | arxiv.org

arxiv aspect changing computing +22

IDT: Dual-Task Adversarial Attacks for Privacy Protection 28 minutes ago | arxiv.org

adversarial adversarial attacks arxiv attacks +24

Private Zeroth-Order Nonsmooth Nonconvex Optimization 28 minutes ago | arxiv.org

algorithm alpha arxiv complexity +16

Instance-Optimal Private Density Estimation in the Wasserstein Distance 28 minutes ago | arxiv.org

arxiv cs.cr cs.ds cs.lg +11

Too Good to be True? Turn Any Model Differentially Private With DP-Weights 28 minutes ago | arxiv.org

arxiv cs.ai cs.cr cs.lg +14

Consultant Sénior Cyber Sécurité H/F

@ Hifield | Lyon, France

View on infosec-jobs.com

Information Security & Resilience Analyst APAC

@ abrdn | Singapore

View on infosec-jobs.com

Technical Product Engineer

@ Palo Alto Networks | Tel Aviv-Yafo, Israel

View on infosec-jobs.com

Azure Cloud Architect

@ Version 1 | Dublin, Ireland

View on infosec-jobs.com

Junior Pen Tester

@ Vertiv | Pune, India

View on infosec-jobs.com

Information Security GRC Director

@ IQ-EQ | Hyderabad, India

View on infosec-jobs.com