all InfoSec news
Detecting Pretraining Data from Large Language Models
March 12, 2024, 4:11 a.m. | Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer
cs.CR updates on arXiv.org arxiv.org
Abstract: Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In …
arxiv cs.cl cs.cr cs.lg data information language language models large llms materials personally identifiable information scale text tokens train
More from arxiv.org / cs.CR updates on arXiv.org
Jobs in InfoSec / Cybersecurity
Information Security Cyber Risk Analyst
@ Intel | USA - AZ - Chandler
Senior Cloud Security Engineer (Fullstack)
@ Grab | Petaling Jaya, Malaysia
Principal Product Security Engineer
@ Oracle | United States
Cybersecurity Strategy Director
@ Proofpoint | Sunnyvale, CA
Information Security Consultant/Auditor
@ Devoteam | Lisboa, Portugal
IT Security Engineer til Netcompany IT Services
@ Netcompany | Copenhagen, Denmark