Improved Learning from Data Competitions through Strategic Design of Training and Test Data Sets

Document Type

Article

Publication Date

2019

Keywords

Kaggle competition, supervised learning, design of experiments, simulated data, competition leaderboard, detection, identification, location

Digital Object Identifier (DOI)

https://doi.org/10.1080/08982112.2019.1572186

Abstract

Leveraging the depth and breadth of solutions generated through crowdsourcing can be a powerful accelerator to method development for high consequence problems. While data competitions have become quite popular and prevalent, particularly in supervised learning formats, their implementations by the host are highly variable. Without careful planning, a supervised learning competition is vulnerable to overfitting, where the winning solutions are so closely tuned to the particular set of provided data that they cannot generalize to the general underlying problem of interest to the host. This article outlines important considerations for strategically designing relevant and informative data sets to maximize the learning outcome from hosting a competition. These include: 1. precisely defining the scope of the problem, 2. encouraging participation by competitors from diverse technical backgrounds, 3. specifying the most interesting solution space in order to encourage improvement and distinguish between competitors, 4. strategically generating data sets that enable testing for interpolation and extrapolation to new scenarios of interest, 5. leveraging design of experiment principles for strategic data design while preventing unintentional artifacts in the competition data sets that competitors could exploit without addressing the real problem of interest, and 6. carefully designing the leaderboard scoring metric to select top solutions that closely match the overall competition goals. The methods are illustrated with a recently completed competition in the context of urban radiological search to evaluate algorithms capable of detecting, identifying, and locating radioactive materials. Simulated data were used in the urban search competition. Ideas for using measured real data in competitions are also suggested.

Was this content written or created while at USF?

Yes

Citation / Publisher Attribution

Quality Engineering, v. 31, issue 4, p. 564-580

Share

COinS