Feb. 12, 2024, 5:10 a.m. | Yichuan Mo Yuji Wang Zeming Wei Yisen Wang

cs.CR updates on arXiv.org arxiv.org

Although Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to certain prompts that can induce them to bypass built-in safety measures and provide dangerous or illegal content, a phenomenon known as jailbreak. To protect LLMs from producing harmful information, various defense strategies are proposed, with most focusing on content filtering or adversarial training of models. In this paper, we propose an approach named Prompt Adversarial Tuning (PAT) to train a defense control mechanism, …

adversarial applications back bypass can cs.ai cs.cl cs.cr cs.lg defense defense strategies illegal information jailbreak jailbreaking language language models large llms producing prompt prompts protect safety strategies

SOC 2 Manager, Audit and Certification

@ Deloitte | US and CA Multiple Locations

Application Security Engineer - Enterprise Engineering

@ Meta | Bellevue, WA | Seattle, WA | New York City | Fremont, CA

Security Engineer

@ Retool | San Francisco, CA

Senior Product Security Analyst

@ Boeing | USA - Seattle, WA

Junior Governance, Risk and Compliance (GRC) and Operations Support Analyst

@ McKenzie Intelligence Services | United Kingdom - Remote

GRC Integrity Program Manager

@ Meta | Bellevue, WA | Menlo Park, CA | Washington, DC | New York City