all InfoSec news
Removing GPT4's Filter
March 11, 2024, 4:10 a.m. | Benjamin Lemkin
cs.CR updates on arXiv.org arxiv.org
Abstract: GPT4 was initially trained on large amounts of data, and then fine-tuned using Reinforcement learning from Human Feedback (RLHF), which is when volunteers give feedback in order to teach GPT4 not to create inappropriate content. In this paper, we present a method to manipulate the fine-tuned version into reverting to pre-RLHF behavior, effectively removing all safety mechanisms that the model learned during RLHF. In particular, when GPT4 acts without RLHF, it loses all inhibition, and …
arxiv cs.ai cs.cl cs.cr cs.lg data feedback filter human large order teach version volunteers
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
Sr. Product Manager
@ MixMode | Remote, US
Information Security Engineers
@ D. E. Shaw Research | New York City
Technology Security Analyst
@ Halton Region | Oakville, Ontario, Canada
Senior Cyber Security Analyst
@ Valley Water | San Jose, CA
Cybersecurity CASB Engineer - Corporate (Las Vegas)
@ Caesars Entertainment | United States
Cyber Security Engineer II (Boundary Protection,WAF, ZTNA,AWS)
@ FICO | Bengaluru, India