Feb. 7, 2024, 12:04 p.m. | Bruce Schneier

Schneier on Security www.schneier.com

Interesting research: “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training“:


Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). …

academic papers deception detect humans llm llms objectives opportunity order remove research safety strategy system teaching training

CyberSOC Technical Lead

@ Integrity360 | Sandyford, Dublin, Ireland

Cyber Security Strategy Consultant

@ Capco | New York City

Cyber Security Senior Consultant

@ Capco | Chicago, IL

Senior Security Researcher - Linux MacOS EDR (Cortex)

@ Palo Alto Networks | Tel Aviv-Yafo, Israel

Sr. Manager, NetSec GTM Programs

@ Palo Alto Networks | Santa Clara, CA, United States

SOC Analyst I

@ Fortress Security Risk Management | Cleveland, OH, United States