June 26, 2024, 4:22 a.m. | Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia

cs.CR updates on arXiv.org arxiv.org

arXiv:2406.17092v1 Announce Type: new
Abstract: Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted …

adversarial arxiv attacks backdoor backdoor attacks backdoors behaviors challenge critical cs.ai cs.cr detection enable high language language models large llms malicious normal safety space token

Ingénieur Développement Logiciel IoT H/F

@ Socomec Group | Benfeld, Grand Est, France

Architecte Cloud – Lyon

@ Sopra Steria | Limonest, France

Senior Risk Operations Analyst

@ Visa | Austin, TX, United States

Military Orders Writer

@ Advanced Technology Leaders, Inc. | Ft Eisenhower, GA, US

Senior Golang Software Developer (f/m/d)

@ E.ON | Essen, DE

Senior Revenue Operations Analyst (Redwood City)

@ Anomali | Redwood City, CA