all InfoSec news
Are aligned neural networks adversarially aligned?
May 7, 2024, 4:12 a.m. | Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee
cs.CR updates on arXiv.org arxiv.org
Abstract: Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial …
adversarial align alignment arxiv can creators cs.ai cs.cl cs.cr cs.lg goals harm inputs language language models large networks neural networks questions requests respond
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
Information Security Engineers
@ D. E. Shaw Research | New York City
Technology Security Analyst
@ Halton Region | Oakville, Ontario, Canada
Senior Cyber Security Analyst
@ Valley Water | San Jose, CA
Technical Support Specialist (Cyber Security)
@ Sigma Software | Warsaw, Poland
OT Security Specialist
@ Adani Group | AHMEDABAD, GUJARAT, India
FS-EGRC-Manager-Cloud Security
@ EY | Bengaluru, KA, IN, 560048