all InfoSec news
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
June 26, 2024, 4:22 a.m. | Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia
cs.CR updates on arXiv.org arxiv.org
Abstract: Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted …
adversarial arxiv attacks backdoor backdoor attacks backdoors behaviors challenge critical cs.ai cs.cr detection enable high language language models large llms malicious normal safety space token
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
Ingénieur Développement Logiciel IoT H/F
@ Socomec Group | Benfeld, Grand Est, France
Architecte Cloud – Lyon
@ Sopra Steria | Limonest, France
Senior Risk Operations Analyst
@ Visa | Austin, TX, United States
Military Orders Writer
@ Advanced Technology Leaders, Inc. | Ft Eisenhower, GA, US
Senior Golang Software Developer (f/m/d)
@ E.ON | Essen, DE
Senior Revenue Operations Analyst (Redwood City)
@ Anomali | Redwood City, CA