April 4, 2023, 1:10 a.m. | Yizheng Chen, Zhoujie Ding, Xinyun Chen, David Wagner

cs.CR updates on arXiv.org arxiv.org

We propose and release a new vulnerable source code dataset. We curate the
dataset by crawling security issue websites, extracting vulnerability-fixing
commits and source codes from the corresponding projects. Our new dataset
contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable
functions extracted from 7,861 commits. Our dataset covers 305 more projects
than all previous datasets combined. We show that increasing the diversity and
volume of training data improves the performance of deep learning models for
vulnerability detection.


Combining our …

code data datasets deep learning detection diversity functions issue non performance projects release security source code training vulnerability vulnerability detection vulnerable websites

SOC 2 Manager, Audit and Certification

@ Deloitte | US and CA Multiple Locations

Information Security Engineers

@ D. E. Shaw Research | New York City

Staff DFIR Investigator

@ SentinelOne | United States - Remote

Senior Consultant.e (H/F) - Product & Industrial Cybersecurity

@ Wavestone | Puteaux, France

Information Security Analyst

@ StarCompliance | York, United Kingdom, Hybrid

Senior Cyber Security Analyst (IAM)

@ New York Power Authority | White Plains, US