Forbidden Facts: An Investigation of Competing Objectives in Llama-2. (arXiv:2312.08793v1 [cs.LG]) | allinfosecnews.com

Dec. 15, 2023, 2:24 a.m. | Tony T. Wang, Miles Wang, Kaivu Hariharan, Nir Shavit

cs.CR updates on arXiv.org arxiv.org

LLMs often face competing pressures (for example helpfulness vs.
harmlessness). To understand how models resolve such conflicts, we study
Llama-2-chat models on the forbidden fact task. Specifically, we instruct
Llama-2 to truthfully complete a factual recall statement while forbidding it
from saying the correct answer. This often makes the model give incorrect
answers. We decompose Llama-2 into 1000+ components, and rank each one with
respect to how useful it is for forbidding the correct answer. We find that in
aggregate, …

chat fact facts forbidden investigation llama llms objectives recall statement study task understand

More from arxiv.org / cs.CR updates on arXiv.org

PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models 2 days, 9 hours ago | arxiv.org

accessibility art arxiv attention +15

Provably Robust Cost-Sensitive Learning via Randomized Smoothing 2 days, 9 hours ago | arxiv.org

arxiv cost cs.cr cs.lg +1

WW-FL: Secure and Private Large-Scale Federated Learning 2 days, 9 hours ago | arxiv.org

arxiv attacks client cs.cr +28

Formalizing and Benchmarking Prompt Injection Attacks and Defenses 2 days, 9 hours ago | arxiv.org

arxiv attacks benchmarking cs.ai +9

SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection with LLMs? 2 days, 9 hours ago | arxiv.org

adoption analysis applications arxiv +30

An Efficient and Multi-private Key Secure Aggregation for Federated Learning 2 days, 9 hours ago | arxiv.org

aggregation arxiv client cs.ai +21

How (not) to Build Quantum PKE in Minicrypt 2 days, 9 hours ago | arxiv.org

arxiv box build can +12

Computing Low-Entropy Couplings for Large-Support Distributions 2 days, 9 hours ago | arxiv.org

arxiv computing cs.cr cs.it +6

Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness 2 days, 9 hours ago | arxiv.org

arxiv attacks backdoor backdoor attacks +14

CyberSOC Technical Lead

@ Integrity360 | Sandyford, Dublin, Ireland

View on infosec-jobs.com

Cyber Security Strategy Consultant

@ Capco | New York City

View on infosec-jobs.com

Cyber Security Senior Consultant

@ Capco | Chicago, IL

View on infosec-jobs.com

Sr. Product Manager

@ MixMode | Remote, US

View on infosec-jobs.com

Information Systems Security Manager

@ Bank of America | USA, MD, Fort Meade (6910 Cooper Ave)

View on infosec-jobs.com

Security Engineer

@ EY | Bengaluru, KA, IN, 560048

View on infosec-jobs.com