Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework | allinfosecnews.com

March 19, 2024, 4:11 a.m. | Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang, Tomek Strzalkowski, Mei Si

cs.CR updates on arXiv.org arxiv.org

arXiv:2312.00029v2 Announce Type: replace
Abstract: Research into AI alignment has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when models are deliberately attacked. These attacks can trick seemingly aligned models into giving manufacturing instructions for dangerous materials, inciting violence, or recommending other immoral acts. To help mitigate this issue, we introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks …

adversarial adversarial attacks alignment arxiv attacks can cs.ai cs.cl cs.cr fail framework instructions into ai introduction language language models large llms manufacturing research

More from arxiv.org / cs.CR updates on arXiv.org

Differentially private Bayesian tests 22 hours ago | arxiv.org

arxiv confidential cornerstone cs.cr +16

On the Learnability of Watermarks for Language Models 22 hours ago | arxiv.org

arxiv ask can cs.cl +12

Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image … 22 hours ago | arxiv.org

applications arxiv attack cs.cr +14

On the Reliability of Watermarks for Large Language Models 22 hours ago | arxiv.org

arxiv bots cs.cl cs.cr +23

A Watermark for Large Language Models 22 hours ago | arxiv.org

arxiv can cs.cl cs.cr +13

Asymmetric Distributed Trust 22 hours ago | arxiv.org

abstraction algorithms arxiv can +12

Read Disturbance in High Bandwidth Memory: A Detailed Experimental Study on HBM2 DRAM Chips 22 hours ago | arxiv.org

arxiv bandwidth chips cs.ar +5

ABACuS: All-Bank Activation Counters for Scalable and Low Overhead RowHammer Mitigation 22 hours ago | arxiv.org

access address area arxiv +17

A Case Study of Large Language Models (ChatGPT and CodeBERT) for Security-Oriented Code Analysis 22 hours ago | arxiv.org

analysis arxiv can capabilities +17

Senior Security Engineer - Detection and Response

@ Fastly, Inc. | US (Remote)

View on infosec-jobs.com

Application Security Engineer

@ Solidigm | Zapopan, Mexico

View on infosec-jobs.com

Defensive Cyber Operations Engineer-Mid

@ ISYS Technologies | Aurora, CO, United States

View on infosec-jobs.com

Manager, Information Security GRC

@ OneTrust | Atlanta, Georgia

View on infosec-jobs.com

Senior Information Security Analyst | IAM

@ EBANX | Curitiba or São Paulo

View on infosec-jobs.com

Senior Information Security Engineer, Cloud Vulnerability Research

@ Google | New York City, USA; New York, USA

View on infosec-jobs.com