Sept. 4, 2023, 1:10 a.m. | Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons

cs.CR updates on arXiv.org arxiv.org

Are foundation models secure from malicious actors? In this work, we focus on
the image input to a vision-language model (VLM). We discover image hijacks,
adversarial images that control generative models at runtime. We introduce
Behavior Matching, a general method for creating image hijacks, and we use it
to explore three types of attacks. Specific string attacks generate arbitrary
output of the adversary's choosing. Leak context attacks leak information from
the context window into the output. Jailbreak attacks circumvent a …

adversarial control discover focus foundation foundation models general generative generative models hijacking image images input language malicious malicious actors runtime work

Digital Security Infrastructure Manager

@ Wizz Air | Budapest, HU, H-1103

Sr. Solution Consultant

@ Highspot | Sydney

Cyber Security Analyst III

@ Love's Travel Stops | Oklahoma City, OK, US, 73120

Lead Security Engineer

@ JPMorgan Chase & Co. | Tampa, FL, United States

GTI Manager of Cybersecurity Operations

@ Grant Thornton | Tulsa, OK, United States

GCP Incident Response Engineer

@ Publicis Groupe | Dallas, Texas, United States