Feb. 20, 2024, 5:11 a.m. | Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tram\`er, Milad Nasr

cs.CR updates on arXiv.org arxiv.org

arXiv:2402.12329v1 Announce Type: cross
Abstract: Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language …

access adversarial arxiv attacks box cs.ai cs.cl cs.cr cs.lg examples language prompt query strings work

CyberSOC Technical Lead

@ Integrity360 | Sandyford, Dublin, Ireland

Cyber Security Strategy Consultant

@ Capco | New York City

Cyber Security Senior Consultant

@ Capco | Chicago, IL

Sr. Product Manager

@ MixMode | Remote, US

Corporate Intern - Information Security (Year Round)

@ Associated Bank | US WI Remote

Senior Offensive Security Engineer

@ CoStar Group | US-DC Washington, DC