July 4, 2024, 11:02 a.m. | Hannah Brown, Leon Lin, Kenji Kawaguchi, Michael Shieh

cs.CR updates on arXiv.org arxiv.org

arXiv:2407.03234v1 Announce Type: cross
Abstract: When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study …

adversarial adversarial attacks arxiv attacks cs.cl cs.cr cs.lg defense evaluation llms

Cyber Security Project Engineer

@ Dezign Concepts LLC | Chantilly, VA

Cloud Cybersecurity Incident Response Lead

@ Maveris | Martinsburg, West Virginia, United States

Sr Staff Security Researcher (Malware Research - Antivirus Systems)

@ Palo Alto Networks | Santa Clara, CA, United States

Identity & Access Management, Senior Associate

@ PwC | Toronto - 18 York Street

Senior Manager, AI Security

@ Lloyds Banking Group | London 10 Gresham Street

Senior Red Team Engineer

@ Adobe | Remote California