Nov. 6, 2023, 2:10 a.m. | Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

cs.CR updates on arXiv.org arxiv.org

Although large language models (LLMs) are widely deployed, the data used to
train them is rarely disclosed. Given the incredible scale of this data, up to
trillions of tokens, it is all but certain that it includes potentially
problematic text such as copyrighted materials, personally identifiable
information, and test data for widely reported reference benchmarks. However,
we currently have no way to know which data of these types is included or in
what proportions. In this paper, we study the …

data information language language models large llms materials personally identifiable information scale test text tokens train

Sr. Cloud Security Engineer

@ BLOCKCHAINS | USA - Remote

Network Security (SDWAN: Velocloud) Infrastructure Lead

@ Sopra Steria | Noida, Uttar Pradesh, India

Senior Python Engineer, Cloud Security

@ Darktrace | Cambridge

Senior Security Consultant

@ Nokia | United States

Manager, Threat Operations

@ Ivanti | United States, Remote

Lead Cybersecurity Architect - Threat Modeling | AWS Cloud Security

@ JPMorgan Chase & Co. | Columbus, OH, United States