all InfoSec News
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
July 4, 2024, 11:02 a.m. | Hannah Brown, Leon Lin, Kenji Kawaguchi, Michael Shieh
cs.CR updates on arXiv.org arxiv.org
Abstract: When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study …
adversarial adversarial attacks arxiv attacks cs.cl cs.cr cs.lg defense evaluation llms
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
Cyber Security Project Engineer
@ Dezign Concepts LLC | Chantilly, VA
Cloud Cybersecurity Incident Response Lead
@ Maveris | Martinsburg, West Virginia, United States
Sr Staff Security Researcher (Malware Research - Antivirus Systems)
@ Palo Alto Networks | Santa Clara, CA, United States
Identity & Access Management, Senior Associate
@ PwC | Toronto - 18 York Street
Senior Manager, AI Security
@ Lloyds Banking Group | London 10 Gresham Street
Senior Red Team Engineer
@ Adobe | Remote California