all InfoSec news
Forbidden Facts: An Investigation of Competing Objectives in Llama-2. (arXiv:2312.08793v1 [cs.LG])
cs.CR updates on arXiv.org arxiv.org
LLMs often face competing pressures (for example helpfulness vs.
harmlessness). To understand how models resolve such conflicts, we study
Llama-2-chat models on the forbidden fact task. Specifically, we instruct
Llama-2 to truthfully complete a factual recall statement while forbidding it
from saying the correct answer. This often makes the model give incorrect
answers. We decompose Llama-2 into 1000+ components, and rank each one with
respect to how useful it is for forbidding the correct answer. We find that in
aggregate, …
chat fact facts forbidden investigation llama llms objectives recall statement study task understand