Feb. 13, 2024, 5:11 a.m. | Jasper Dekoninck Mark Niklas M\"uller Maximilian Baader Marc Fischer Martin Vechev

cs.CR updates on arXiv.org arxiv.org

Large language models are widespread, with their performance on benchmarks frequently guiding user preferences for one model over another. However, the vast amount of data these models are trained on can inadvertently lead to contamination with public benchmarks, thus compromising performance measurements. While recently developed contamination detection methods try to address this issue, they overlook the possibility of deliberate contamination by malicious model providers aiming to evade detection. We argue that this setting is of crucial importance as it casts …

benchmarks can cs.ai cs.cl cs.cr cs.lg data detection easy language language models large performance public try vast

Information Security Engineers

@ D. E. Shaw Research | New York City

Technology Security Analyst

@ Halton Region | Oakville, Ontario, Canada

Senior Cyber Security Analyst

@ Valley Water | San Jose, CA

COMM Penetration Tester (PenTest-2), Chantilly, VA OS&CI Job #368

@ Allen Integrated Solutions | Chantilly, Virginia, United States

Consultant Sécurité SI H/F Gouvernance - Risques - Conformité

@ Hifield | Sèvres, France

Infrastructure Consultant

@ Telefonica Tech | Belfast, United Kingdom