Detecting Pretraining Data from Large Language Models. (arXiv:2310.16789v2 [cs.CL] UPDATED) | allinfosecnews.com

Nov. 6, 2023, 2:10 a.m. | Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

cs.CR updates on arXiv.org arxiv.org

Although large language models (LLMs) are widely deployed, the data used to
train them is rarely disclosed. Given the incredible scale of this data, up to
trillions of tokens, it is all but certain that it includes potentially
problematic text such as copyrighted materials, personally identifiable
information, and test data for widely reported reference benchmarks. However,
we currently have no way to know which data of these types is included or in
what proportions. In this paper, we study the …

data information language language models large llms materials personally identifiable information scale test text tokens train

More from arxiv.org / cs.CR updates on arXiv.org

Causal Inference with Differentially Private (Clustered) Outcomes 16 hours ago | arxiv.org

algorithm arxiv cs.cr cs.lg +12

An artificial neural network approach to finding the key length of the Vigen\`{e}re cipher 16 hours ago | arxiv.org

accuracy article artificial arxiv +9

Generic Selfish Mining MDP for DAG Protocols 16 hours ago | arxiv.org

analysis arxiv bitcoin breaking +15

Tight Differential Privacy Guarantees for the Shuffle Model with $k$-Randomized Response 16 hours ago | arxiv.org

algorithms arxiv cs.cr data +14

Succinct arguments for QMA from standard assumptions via compiled nonlocal games 16 hours ago | arxiv.org

argument arxiv building crypto +8

On Training a Neural Network to Explain Binaries 16 hours ago | arxiv.org

aid arxiv binary code +15

Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning 16 hours ago | arxiv.org

arxiv attacks backdoor backdoor attacks +14

Leveraging Label Information for Stealthy Data Stealing in Vertical Federated Learning 16 hours ago | arxiv.org

arxiv attack attacks cs.cr +16

An Extensive Survey of Digital Image Steganography: State of the Art 16 hours ago | arxiv.org

adoption art arxiv attention +21

Sr. Cloud Security Engineer

@ BLOCKCHAINS | USA - Remote

View on infosec-jobs.com

Network Security (SDWAN: Velocloud) Infrastructure Lead

@ Sopra Steria | Noida, Uttar Pradesh, India

View on infosec-jobs.com

Senior Python Engineer, Cloud Security

@ Darktrace | Cambridge

View on infosec-jobs.com

Senior Security Consultant

@ Nokia | United States

View on infosec-jobs.com

Manager, Threat Operations

@ Ivanti | United States, Remote

View on infosec-jobs.com

Lead Cybersecurity Architect - Threat Modeling | AWS Cloud Security

@ JPMorgan Chase & Co. | Columbus, OH, United States

View on infosec-jobs.com