GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation | allinfosecnews.com

May 24, 2024, 4:11 a.m. | Govind Ramesh, Yao Dou, Wei Xu

cs.CR updates on arXiv.org arxiv.org

arXiv:2405.13077v1 Announce Type: new
Abstract: Research on jailbreaking has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which …

access arxiv box capabilities cs.ai cs.cl cs.cr gpt gpt-4 iris issues jailbreak jailbreaking language language models large llms near novel perfect research safety security security issues testing understanding

More from arxiv.org / cs.CR updates on arXiv.org

SoK: Facial Deepfake Detectors 12 hours ago | arxiv.org

arxiv cs.cr cs.cv cs.lg +19

Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration 12 hours ago | arxiv.org

aim arxiv attacks calibration +16

A Probabilistic Fluctuation based Membership Inference Attack for Diffusion Models 12 hours ago | arxiv.org

arxiv attack classification cs.ai +11

Locally Differentially Private Distributed Online Learning with Guaranteed Optimality 12 hours ago | arxiv.org

address algorithms arxiv awareness +19

A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models 12 hours ago | arxiv.org

arxiv challenge covert cs.cl +18

Detecting Misuse of Security APIs: A Systematic Review 12 hours ago | arxiv.org

api api design apis application +25

Privacy Preserving Reinforcement Learning for Population Processes 12 hours ago | arxiv.org

algorithm algorithms arxiv control +8

Video Inpainting Localization with Contrastive Learning 12 hours ago | arxiv.org

arxiv cs.cr cs.cv localization +1

CuDA2: An approach for Incorporating Traitor Agents into Cooperative Multi-Agent Systems 12 hours ago | arxiv.org

actions adversarial adversarial attacks agent +13

Information Technology Specialist I: Windows Engineer

@ Los Angeles County Employees Retirement Association (LACERA) | Pasadena, California

View on infosec-jobs.com

Information Technology Specialist I, LACERA: Information Security Engineer

@ Los Angeles County Employees Retirement Association (LACERA) | Pasadena, CA

View on infosec-jobs.com

Vice President, Controls Design & Development-7

@ State Street | Quincy, Massachusetts

View on infosec-jobs.com

Vice President, Controls Design & Development-5

@ State Street | Quincy, Massachusetts

View on infosec-jobs.com

Data Scientist & AI Prompt Engineer

@ Varonis | Israel

View on infosec-jobs.com

Contractor

@ Birlasoft | INDIA - MUMBAI - BIRLASOFT OFFICE, IN

View on infosec-jobs.com