March 4, 2024, 5:10 a.m. | Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, Manohar Swaminathan

cs.CR updates on arXiv.org arxiv.org

arXiv:2403.00393v1 Announce Type: new
Abstract: Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost. However, recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs, meaning that LLMs have access to test data during pretraining and/or fine-tuning. This raises serious concerns about the validity of benchmarking studies conducted so far and the future of evaluation using benchmarks. To solve this problem, …

access arxiv benchmarking benchmarks cost cs.cl cs.cr evaluation leaked llms low open source private speed standard today work

Senior Security Specialist, Forsah Technical and Vocational Education and Training (Forsah TVET) (NEW)

@ IREX | Ramallah, West Bank, Palestinian National Authority

Consultant(e) Junior Cybersécurité

@ Sia Partners | Paris, France

Senior Network Security Engineer

@ NielsenIQ | Mexico City, Mexico

Senior Consultant, Payment Intelligence

@ Visa | Washington, DC, United States

Corporate Counsel, Compliance

@ Okta | San Francisco, CA; Bellevue, WA; Chicago, IL; New York City; Washington, DC; Austin, TX

Security Operations Engineer

@ Samsara | Remote - US