March 11, 2024, 4:10 a.m. | Benjamin Lemkin

cs.CR updates on arXiv.org arxiv.org

arXiv:2403.04769v1 Announce Type: new
Abstract: GPT4 was initially trained on large amounts of data, and then fine-tuned using Reinforcement learning from Human Feedback (RLHF), which is when volunteers give feedback in order to teach GPT4 not to create inappropriate content. In this paper, we present a method to manipulate the fine-tuned version into reverting to pre-RLHF behavior, effectively removing all safety mechanisms that the model learned during RLHF. In particular, when GPT4 acts without RLHF, it loses all inhibition, and …

arxiv cs.ai cs.cl cs.cr cs.lg data feedback filter human large order teach version volunteers

Sr. Product Manager

@ MixMode | Remote, US

Information Security Engineers

@ D. E. Shaw Research | New York City

Technology Security Analyst

@ Halton Region | Oakville, Ontario, Canada

Senior Cyber Security Analyst

@ Valley Water | San Jose, CA

Cybersecurity CASB Engineer - Corporate (Las Vegas)

@ Caesars Entertainment | United States

Cyber Security Engineer II (Boundary Protection,WAF, ZTNA,AWS)

@ FICO | Bengaluru, India