Self-Evaluation as a Defense Against Adversarial Attacks on LLMs | allinfosecnews.com

July 4, 2024, 11:02 a.m. | Hannah Brown, Leon Lin, Kenji Kawaguchi, Michael Shieh

cs.CR updates on arXiv.org arxiv.org

arXiv:2407.03234v1 Announce Type: cross
Abstract: When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study …

adversarial adversarial attacks arxiv attacks cs.cl cs.cr cs.lg defense evaluation llms

More from arxiv.org / cs.CR updates on arXiv.org

Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs 3 days, 1 hour ago | arxiv.org

arxiv authorization can cs.cr +18

Information Flow Control in Machine Learning through Modular Model Architecture 3 days, 1 hour ago | arxiv.org

access access control architecture arxiv +16

Skellam Mixture Mechanism: a Novel Approach to Federated Learning with Differential Privacy 3 days, 1 hour ago | arxiv.org

arxiv can capabilities cs.cr +19

Gradual Verification for Smart Contracts 3 days, 1 hour ago | arxiv.org

arxiv assurances attacks blockchains +16

Session: End-To-End Encrypted Conversations With Minimal Metadata Leakage 3 days, 1 hour ago | arxiv.org

application applications arxiv conversations +21

Correlated Privacy Mechanisms for Differentially Private Distributed Mean Estimation 3 days, 1 hour ago | arxiv.org

aggregation arxiv block building +16

Self-Evaluation as a Defense Against Adversarial Attacks on LLMs 3 days, 1 hour ago | arxiv.org

adversarial adversarial attacks arxiv attacks +6

Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets 3 days, 1 hour ago | arxiv.org

arxiv cs.ai cs.cr cs.lg +16

Early-Stage Anomaly Detection: A Study of Model Performance on Complete vs. Partial Flows 3 days, 1 hour ago | arxiv.org

anomaly detection arxiv cs.cr cs.lg +14

Cyber Security Project Engineer

@ Dezign Concepts LLC | Chantilly, VA

View on infosec-jobs.com

Cloud Cybersecurity Incident Response Lead

@ Maveris | Martinsburg, West Virginia, United States

View on infosec-jobs.com

Sr Staff Security Researcher (Malware Research - Antivirus Systems)

@ Palo Alto Networks | Santa Clara, CA, United States

View on infosec-jobs.com

Identity & Access Management, Senior Associate

@ PwC | Toronto - 18 York Street

View on infosec-jobs.com

Senior Manager, AI Security

@ Lloyds Banking Group | London 10 Gresham Street

View on infosec-jobs.com

Senior Red Team Engineer

@ Adobe | Remote California

View on infosec-jobs.com