Mathematics and Statistics Faculty Publications

Improved Learning from Data Competitions through Strategic Design of Training and Test Data Sets

Christine M. Anderson-Cook, Los Alamos National Laboratory
Lu Lu, University of South FloridaFollow
Kary L. Myers, Los Alamos National Laboratory
Kevin R. Quinlan, Pennsylvania State University
Norma Pawley, Los Alamos National Laboratory

Document Type

Article

Publication Date

2019

Keywords

Kaggle competition, supervised learning, design of experiments, simulated data, competition leaderboard, detection, identification, location

Digital Object Identifier (DOI)

https://doi.org/10.1080/08982112.2019.1572186

Abstract

Leveraging the depth and breadth of solutions generated through crowdsourcing can be a powerful accelerator to method development for high consequence problems. While data competitions have become quite popular and prevalent, particularly in supervised learning formats, their implementations by the host are highly variable. Without careful planning, a supervised learning competition is vulnerable to overfitting, where the winning solutions are so closely tuned to the particular set of provided data that they cannot generalize to the general underlying problem of interest to the host. This article outlines important considerations for strategically designing relevant and informative data sets to maximize the learning outcome from hosting a competition. These include: 1. precisely defining the scope of the problem, 2. encouraging participation by competitors from diverse technical backgrounds, 3. specifying the most interesting solution space in order to encourage improvement and distinguish between competitors, 4. strategically generating data sets that enable testing for interpolation and extrapolation to new scenarios of interest, 5. leveraging design of experiment principles for strategic data design while preventing unintentional artifacts in the competition data sets that competitors could exploit without addressing the real problem of interest, and 6. carefully designing the leaderboard scoring metric to select top solutions that closely match the overall competition goals. The methods are illustrated with a recently completed competition in the context of urban radiological search to evaluate algorithms capable of detecting, identifying, and locating radioactive materials. Simulated data were used in the urban search competition. Ideas for using measured real data in competitions are also suggested.

Was this content written or created while at USF?

Yes

Citation / Publisher Attribution

Quality Engineering, v. 31, issue 4, p. 564-580

Scholar Commons Citation

Anderson-Cook, Christine M.; Lu, Lu; Myers, Kary L.; Quinlan, Kevin R.; and Pawley, Norma, "Improved Learning from Data Competitions through Strategic Design of Training and Test Data Sets" (2019). Mathematics and Statistics Faculty Publications. 136.
https://digitalcommons.usf.edu/mth_facpub/136

Link to Full Text

Find in your library

COinS

Mathematics and Statistics Faculty Publications

Improved Learning from Data Competitions through Strategic Design of Training and Test Data Sets

Document Type

Publication Date

Keywords

Digital Object Identifier (DOI)

Abstract

Was this content written or created while at USF?

Citation / Publisher Attribution

Scholar Commons Citation

Search

Browse By

Useful Links

Mathematics and Statistics Faculty Publications

Improved Learning from Data Competitions through Strategic Design of Training and Test Data Sets

Authors

Document Type

Publication Date

Keywords

Digital Object Identifier (DOI)

Abstract

Was this content written or created while at USF?

Citation / Publisher Attribution

Scholar Commons Citation

Share

Search

Browse By

Useful Links