Forbidden Facts: An Investigation of Competing Objectives in Llama-2. (arXiv:2312.08793v1 [cs.LG]) | allinfosecnews.com

Dec. 15, 2023, 2:24 a.m. | Tony T. Wang, Miles Wang, Kaivu Hariharan, Nir Shavit

cs.CR updates on arXiv.org arxiv.org

LLMs often face competing pressures (for example helpfulness vs.
harmlessness). To understand how models resolve such conflicts, we study
Llama-2-chat models on the forbidden fact task. Specifically, we instruct
Llama-2 to truthfully complete a factual recall statement while forbidding it
from saying the correct answer. This often makes the model give incorrect
answers. We decompose Llama-2 into 1000+ components, and rank each one with
respect to how useful it is for forbidding the correct answer. We find that in
aggregate, …

chat fact facts forbidden investigation llama llms objectives recall statement study task understand

More from arxiv.org / cs.CR updates on arXiv.org

Differentially private Bayesian tests 3 hours ago | arxiv.org

arxiv confidential cornerstone cs.cr +16

On the Learnability of Watermarks for Language Models 3 hours ago | arxiv.org

arxiv ask can cs.cl +12

Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image … 3 hours ago | arxiv.org

applications arxiv attack cs.cr +14

On the Reliability of Watermarks for Large Language Models 3 hours ago | arxiv.org

arxiv bots cs.cl cs.cr +23

A Watermark for Large Language Models 3 hours ago | arxiv.org

arxiv can cs.cl cs.cr +13

Asymmetric Distributed Trust 3 hours ago | arxiv.org

abstraction algorithms arxiv can +12

Read Disturbance in High Bandwidth Memory: A Detailed Experimental Study on HBM2 DRAM Chips 3 hours ago | arxiv.org

arxiv bandwidth chips cs.ar +5

ABACuS: All-Bank Activation Counters for Scalable and Low Overhead RowHammer Mitigation 3 hours ago | arxiv.org

access address area arxiv +17

A Case Study of Large Language Models (ChatGPT and CodeBERT) for Security-Oriented Code Analysis 3 hours ago | arxiv.org

analysis arxiv can capabilities +17

Social Engineer For Reverse Engineering Exploit Study

@ Independent study | Remote

View on infosec-jobs.com

Data Privacy Manager m/f/d)

@ Coloplast | Hamburg, HH, DE

View on infosec-jobs.com

Cybersecurity Sr. Manager

@ Eastman | Kingsport, TN, US, 37660

View on infosec-jobs.com

KDN IAM Associate Consultant

@ KPMG India | Hyderabad, Telangana, India

View on infosec-jobs.com

Learning Experience Designer in Cybersecurity (f/m/div.) (Salary: ~113.000 EUR p.a.*)

@ Bosch Group | Stuttgart, Germany

View on infosec-jobs.com

Senior Security Engineer - SIEM

@ Samsara | Remote - US

View on infosec-jobs.com