Feb. 27, 2024, 5:11 a.m. | Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash

cs.CR updates on arXiv.org arxiv.org

arXiv:2402.15911v1 Announce Type: new
Abstract: Large language models (LLMs) are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, …

arxiv attack attacks automated cs.cl cs.cr defense guard humans jailbreak language language models large large language model llms prp rails work

SOC 2 Manager, Audit and Certification

@ Deloitte | US and CA Multiple Locations

Lead Technical Product Manager - Threat Protection

@ Mastercard | Remote - United Kingdom

Data Privacy Officer

@ Banco Popular | San Juan, PR

GRC Security Program Manager

@ Meta | Bellevue, WA | Menlo Park, CA | Washington, DC | New York City

Cyber Security Engineer

@ ASSYSTEM | Warrington, United Kingdom

Privacy Engineer, Technical Audit

@ Meta | Menlo Park, CA