Graduation Year
2004
Document Type
Thesis
Degree
M.S.C.S.
Degree Granting Department
Computer Science
Major Professor
Lawrence O. Hall, Ph.D.
Committee Member
Kevin W. Bowyer, Ph.D.
Committee Member
Dmitry Goldgof, Ph.D.
Keywords
data mining, decision tree, nearest neighbor, distributed learning, classification
Abstract
Committees of classifiers, also called mixtures or ensembles of classifiers, have become popular because they have the potential to improve on the performance of a single classifier constructed from the same set of training data. Bagging and boosting are some of the better known methods of constructing a committee of classifiers. Committees of classifiers are also important because they have the potential to provide a computationally scalable approach to handling massive datasets. When the emphasis is on computationally scalable approaches to handling massive datasets, the individual classifiers are often constructed from a small faction of the total data. In this context, the ability to improve on the accuracy of a hypothetical single classifier created from all of the training data may be sacrificed.
The design of a committee of classifiers typically assumes that all of the training data is equally available to be assigned to subsets as desired, and that each subset is used to train a classifier in the committee. However, there are some important application contexts in which this assumption is not valid. In many real life situations, massive data sets are created on a distributed computer, recording the simulation of important physical processes.
Currently, experts visually browse such datasets to search for interesting events in the simulation. This sort of manual search for interesting events in massive datasets is time consuming. Therefore, one would like to construct a classifier that could automatically label the "interesting" events. The problem is that the dataset is distributed across a large number of processors in chunks that are spatially homogenous with respect to the underlying physical context in the simulation. Here, a potential solution to this problem using ensembles is explored.
Scholar Commons Citation
Bhadoria, Divya, "Learning From Spatially Disjoint Data" (2004). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/958