BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models | allinfosecnews.com

June 26, 2024, 4:22 a.m. | Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia

cs.CR updates on arXiv.org arxiv.org

arXiv:2406.17092v1 Announce Type: new
Abstract: Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted …

adversarial arxiv attacks backdoor backdoor attacks backdoors behaviors challenge critical cs.ai cs.cr detection enable high language language models large llms malicious normal safety space token

More from arxiv.org / cs.CR updates on arXiv.org

Kirchhoff Meets Johnson: In Pursuit of Unconditionally Secure Communication 1 day, 2 hours ago | arxiv.org

arxiv communication cs.cr cs.it +12

Intriguing Properties of Adversarial ML Attacks in the Problem Space [Extended Version] 1 day, 2 hours ago | arxiv.org

adversarial arxiv attacks clear +17

Quartic quantum speedups for planted inference 1 day, 2 hours ago | arxiv.org

algorithm arxiv cs.cc cs.cr +10

Understanding Routing-Induced Censorship Changes Globally 1 day, 2 hours ago | arxiv.org

arxiv censorship cs.cr cs.ni +12

On Convex Optimization with Semi-Sensitive Features 1 day, 2 hours ago | arxiv.org

arxiv cs.cr cs.ds cs.lg +11

Investigating and Defending Shortcut Learning in Personalized Diffusion Models 1 day, 2 hours ago | arxiv.org

adversarial arxiv cs.ai cs.cr +14

Contraction of Private Quantum Channels and Private Quantum Hypothesis Testing 1 day, 2 hours ago | arxiv.org

action arxiv channel cs.cr +12

Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing 1 day, 2 hours ago | arxiv.org

arxiv cs.cr cs.cv exploiting +3

TTP-Based Cyber Resilience Index: A Probabilistic Quantitative Approach to Measure Defence Effectiveness Against Cyber Attacks 1 day, 2 hours ago | arxiv.org

arxiv attacks building campaigns +25

Ingénieur Développement Logiciel IoT H/F

@ Socomec Group | Benfeld, Grand Est, France

View on infosec-jobs.com

Architecte Cloud – Lyon

@ Sopra Steria | Limonest, France

View on infosec-jobs.com

Senior Risk Operations Analyst

@ Visa | Austin, TX, United States

View on infosec-jobs.com

Military Orders Writer

@ Advanced Technology Leaders, Inc. | Ft Eisenhower, GA, US

View on infosec-jobs.com

Senior Golang Software Developer (f/m/d)

@ E.ON | Essen, DE

View on infosec-jobs.com

Senior Revenue Operations Analyst (Redwood City)

@ Anomali | Redwood City, CA

View on infosec-jobs.com