Detecting Pretraining Data from Large Language Models | allinfosecnews.com

March 12, 2024, 4:11 a.m. | Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

cs.CR updates on arXiv.org arxiv.org

arXiv:2310.16789v3 Announce Type: replace-cross
Abstract: Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In …

arxiv cs.cl cs.cr cs.lg data information language language models large llms materials personally identifiable information scale text tokens train

More from arxiv.org / cs.CR updates on arXiv.org

Causal Inference with Differentially Private (Clustered) Outcomes 10 hours ago | arxiv.org

algorithm arxiv cs.cr cs.lg +12

An artificial neural network approach to finding the key length of the Vigen\`{e}re cipher 10 hours ago | arxiv.org

accuracy article artificial arxiv +9

Generic Selfish Mining MDP for DAG Protocols 10 hours ago | arxiv.org

analysis arxiv bitcoin breaking +15

Tight Differential Privacy Guarantees for the Shuffle Model with $k$-Randomized Response 10 hours ago | arxiv.org

algorithms arxiv cs.cr data +14

Succinct arguments for QMA from standard assumptions via compiled nonlocal games 10 hours ago | arxiv.org

argument arxiv building crypto +8

On Training a Neural Network to Explain Binaries 10 hours ago | arxiv.org

aid arxiv binary code +15

Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning 10 hours ago | arxiv.org

arxiv attacks backdoor backdoor attacks +14

Leveraging Label Information for Stealthy Data Stealing in Vertical Federated Learning 10 hours ago | arxiv.org

arxiv attack attacks cs.cr +16

An Extensive Survey of Digital Image Steganography: State of the Art 10 hours ago | arxiv.org

adoption art arxiv attention +21

Information Security Cyber Risk Analyst

@ Intel | USA - AZ - Chandler

View on infosec-jobs.com

Senior Cloud Security Engineer (Fullstack)

@ Grab | Petaling Jaya, Malaysia

View on infosec-jobs.com

Principal Product Security Engineer

@ Oracle | United States

View on infosec-jobs.com

Cybersecurity Strategy Director

@ Proofpoint | Sunnyvale, CA

View on infosec-jobs.com

Information Security Consultant/Auditor

@ Devoteam | Lisboa, Portugal

View on infosec-jobs.com

IT Security Engineer til Netcompany IT Services

@ Netcompany | Copenhagen, Denmark

View on infosec-jobs.com