all InfoSec news
Detecting Pretraining Data from Large Language Models. (arXiv:2310.16789v2 [cs.CL] UPDATED)
cs.CR updates on arXiv.org arxiv.org
Although large language models (LLMs) are widely deployed, the data used to
train them is rarely disclosed. Given the incredible scale of this data, up to
trillions of tokens, it is all but certain that it includes potentially
problematic text such as copyrighted materials, personally identifiable
information, and test data for widely reported reference benchmarks. However,
we currently have no way to know which data of these types is included or in
what proportions. In this paper, we study the …
data information language language models large llms materials personally identifiable information scale test text tokens train