May 24, 2024, 4:11 a.m. | Govind Ramesh, Yao Dou, Wei Xu

cs.CR updates on arXiv.org arxiv.org

arXiv:2405.13077v1 Announce Type: new
Abstract: Research on jailbreaking has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which …

access arxiv box capabilities cs.ai cs.cl cs.cr gpt gpt-4 iris issues jailbreak jailbreaking language language models large llms near novel perfect research safety security security issues testing understanding

Information Technology Specialist I: Windows Engineer

@ Los Angeles County Employees Retirement Association (LACERA) | Pasadena, California

Information Technology Specialist I, LACERA: Information Security Engineer

@ Los Angeles County Employees Retirement Association (LACERA) | Pasadena, CA

Vice President, Controls Design & Development-7

@ State Street | Quincy, Massachusetts

Vice President, Controls Design & Development-5

@ State Street | Quincy, Massachusetts

Data Scientist & AI Prompt Engineer

@ Varonis | Israel

Contractor

@ Birlasoft | INDIA - MUMBAI - BIRLASOFT OFFICE, IN