Leveraging Unlabeled Data for Classification
Digital Object Identifier (DOI)
Classification is a form of data analysis that can be used to extract models to predict categorical class labels (Han & Kamber, 2001). Data classification has proven to be very useful in a wide variety of applications. For example, a classification model can be built to categorize bank loan applications as either safe or risky. In order to build a classification model, training data containing multiple independent variables and a dependant variable (class label) is needed. If a data record has a known value for its class label, this data record is termed “labeled”. If the value for its class is unknown, it is “unlabeled”. There are situations with a large amount of unlabeled data and a small amount of labeled data. Using only labeled data to build classification models can potentially ignore useful information contained in the unlabeled data. Furthermore, unlabeled data can often be much cheaper and more plentiful than labeled data, and so if useful information can be extracted from it that reduces the need for labeled examples, this can be a significant benefit (Balcan & Blum 2005). The default practice is to use only the labeled data to build a classification model and then assign class labels to the unlabeled data. However, when the amount of labeled data is not enough, the classification model built only using the labeled data can be biased and far from accurate. The class labels assigned to the unlabeled data can then be inaccurate. How to leverage the information contained in the unlabeled data to help improve the accuracy of the classification model is an important research question. There are two streams of research that addresses the challenging issue of how to appropriately use unlabeled data for building classification models. The details are discussed below.
Was this content written or created while at USF?
Citation / Publisher Attribution
Leveraging Unlabeled Data for Classification, Y. Yang & B. Padmanabhan (Eds.), Encyclopedia of Data Warehousing and Mining, Second Edition, IGI Global, p. 1164--1169
Scholar Commons Citation
Yang, Yinghui and Padmanabhan, Balaji, "Leveraging Unlabeled Data for Classification" (2009). School of Information Systems and Management Faculty Publications. 36.