Teaching LLMs to Be Deceptive | allinfosecnews.com

Feb. 7, 2024, 12:04 p.m. | Bruce Schneier

Schneier on Security www.schneier.com

Interesting research: “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training“:

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). …

academic papers deception detect humans llm llms objectives opportunity order remove research safety strategy system teaching training

More from www.schneier.com / Schneier on Security

Friday Squid Blogging: Searching for the Colossal Squid 20 hours ago | www.schneier.com

blog blogging can cruise +7

Long Article on GM Spying on Its Cars’ Drivers 1 day, 7 hours ago | www.schneier.com

article cars companies data +10

The Rise of Large-Language-Model Optimization 2 days, 7 hours ago | www.schneier.com

artificial intelligence coming easy end +12

Dan Solove on Privacy Regulation 3 days, 6 hours ago | www.schneier.com

academic papers article consent dan +6

Microsoft and Security Incentives 4 days, 6 hours ago | www.schneier.com

capabilities companies cyber cybersecurity +16

Using Legitimate GitHub URLs for Malware 5 days, 2 hours ago | www.schneier.com

attack attacker attack vector can +25

Friday Squid Blogging: Squid Trackers 1 week ago | www.schneier.com

article blog blogging can +13

Other Attempts to Take Over Open Source Projects 1 week, 2 days ago | www.schneier.com

action backdoors council discovery +16

Using AI-Generated Legislative Amendments as a Delaying Technique 1 week, 3 days ago | www.schneier.com

adoption a hacker's mind artificial intelligence bill +7

SOC 2 Manager, Audit and Certification

@ Deloitte | US and CA Multiple Locations

View on infosec-jobs.com

Cybersecurity Engineer

@ Booz Allen Hamilton | USA, VA, Arlington (1550 Crystal Dr Suite 300) non-client

View on infosec-jobs.com

Invoice Compliance Reviewer

@ AC Disaster Consulting | Fort Myers, Florida, United States - Remote

View on infosec-jobs.com

Technical Program Manager II - Compliance

@ Microsoft | Redmond, Washington, United States

View on infosec-jobs.com

Head of U.S. Threat Intelligence / Senior Manager for Threat Intelligence

@ Moonshot | Washington, District of Columbia, United States

View on infosec-jobs.com

Customer Engineer, Security, Public Sector

@ Google | Virginia, USA; Illinois, USA

View on infosec-jobs.com