March 12, 2024, 4:11 a.m. | Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

cs.CR updates on arXiv.org arxiv.org

arXiv:2310.16789v3 Announce Type: replace-cross
Abstract: Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In …

arxiv cs.cl cs.cr cs.lg data information language language models large llms materials personally identifiable information scale text tokens train

Information Security Cyber Risk Analyst

@ Intel | USA - AZ - Chandler

Senior Cloud Security Engineer (Fullstack)

@ Grab | Petaling Jaya, Malaysia

Principal Product Security Engineer

@ Oracle | United States

Cybersecurity Strategy Director

@ Proofpoint | Sunnyvale, CA

Information Security Consultant/Auditor

@ Devoteam | Lisboa, Portugal

IT Security Engineer til Netcompany IT Services

@ Netcompany | Copenhagen, Denmark