Jan. 24, 2024, 12:06 p.m. | Bruce Schneier

Schneier on Security www.schneier.com

New research into poisoning AI models:


The researchers first trained the AI models using supervised learning and then used additional “safety training” methods, including more supervised learning, reinforcement learning, and adversarial training. After this, they checked if the AI still had hidden behaviors. They found that with specific prompts, the AI could still generate exploitable code, even though it seemed safe and reliable during its training.


During stage 2, Anthropic applied reinforcement learning and supervised fine-tuning to the three …

academic papers adversarial ai models artificial intelligence code found hidden llm machine learning poisoning prompts research researchers safety threat models training

Senior Security Engineer - Detection and Response

@ Fastly, Inc. | US (Remote)

Application Security Engineer

@ Solidigm | Zapopan, Mexico

Defensive Cyber Operations Engineer-Mid

@ ISYS Technologies | Aurora, CO, United States

Manager, Information Security GRC

@ OneTrust | Atlanta, Georgia

Senior Information Security Analyst | IAM

@ EBANX | Curitiba or São Paulo

Senior Information Security Engineer, Cloud Vulnerability Research

@ Google | New York City, USA; New York, USA