Dec. 15, 2023, 2:24 a.m. | Tony T. Wang, Miles Wang, Kaivu Hariharan, Nir Shavit

cs.CR updates on arXiv.org arxiv.org

LLMs often face competing pressures (for example helpfulness vs.
harmlessness). To understand how models resolve such conflicts, we study
Llama-2-chat models on the forbidden fact task. Specifically, we instruct
Llama-2 to truthfully complete a factual recall statement while forbidding it
from saying the correct answer. This often makes the model give incorrect
answers. We decompose Llama-2 into 1000+ components, and rank each one with
respect to how useful it is for forbidding the correct answer. We find that in
aggregate, …

chat fact facts forbidden investigation llama llms objectives recall statement study task understand

CyberSOC Technical Lead

@ Integrity360 | Sandyford, Dublin, Ireland

Cyber Security Strategy Consultant

@ Capco | New York City

Cyber Security Senior Consultant

@ Capco | Chicago, IL

Sr. Product Manager

@ MixMode | Remote, US

Information Systems Security Manager

@ Bank of America | USA, MD, Fort Meade (6910 Cooper Ave)

Security Engineer

@ EY | Bengaluru, KA, IN, 560048