all InfoSec news
Poisoning AI Models
Schneier on Security www.schneier.com
New research into poisoning AI models:
The researchers first trained the AI models using supervised learning and then used additional “safety training” methods, including more supervised learning, reinforcement learning, and adversarial training. After this, they checked if the AI still had hidden behaviors. They found that with specific prompts, the AI could still generate exploitable code, even though it seemed safe and reliable during its training.
During stage 2, Anthropic applied reinforcement learning and supervised fine-tuning to the three …
academic papers adversarial ai models artificial intelligence code found hidden llm machine learning poisoning prompts research researchers safety threat models training