all InfoSec news
GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
Feb. 22, 2024, 5:11 a.m. | Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong
cs.CR updates on arXiv.org arxiv.org
Abstract: Large Language Models (LLMs) face threats from unsafe prompts. Existing methods for detecting unsafe prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our methodology is grounded in a pivotal observation: the gradients of an LLM's loss for unsafe prompts paired with …
analysis arxiv critical cs.cl cs.cr llms prompts safety safety-critical
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
Information Security Engineers
@ D. E. Shaw Research | New York City
Technology Security Analyst
@ Halton Region | Oakville, Ontario, Canada
Senior Cyber Security Analyst
@ Valley Water | San Jose, CA
Senior Application Security Engineer, Application Security
@ Miro | Amsterdam, NL
SOC Analyst (m/w/d)
@ LANXESS | Leverkusen, NW, DE, 51373
Lead Security Solutions Engineer (Remote, North America)
@ Dynatrace | Waltham, MA, United States