Feb. 22, 2024, 5:11 a.m. | Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong

cs.CR updates on arXiv.org arxiv.org

arXiv:2402.13494v1 Announce Type: cross
Abstract: Large Language Models (LLMs) face threats from unsafe prompts. Existing methods for detecting unsafe prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our methodology is grounded in a pivotal observation: the gradients of an LLM's loss for unsafe prompts paired with …

analysis arxiv critical cs.cl cs.cr llms prompts safety safety-critical

Information Security Engineers

@ D. E. Shaw Research | New York City

Technology Security Analyst

@ Halton Region | Oakville, Ontario, Canada

Senior Cyber Security Analyst

@ Valley Water | San Jose, CA

Senior Application Security Engineer, Application Security

@ Miro | Amsterdam, NL

SOC Analyst (m/w/d)

@ LANXESS | Leverkusen, NW, DE, 51373

Lead Security Solutions Engineer (Remote, North America)

@ Dynatrace | Waltham, MA, United States