Graduation Year
2010
Document Type
Dissertation
Degree
Ph.D.
Degree Granting Department
Computer Science and Engineering
Major Professor
Lawrence O. Hall, Ph.D.
Co-Major Professor
Dmitry B. Goldgof, Ph.D.
Committee Member
Sudeep Sarkar, Ph.D.
Committee Member
Kevin W. Bowyer, Ph.D.
Keywords
Random Forest, Saliency, Probabilistic Voting, Imbalanced Training Data, Lift
Abstract
We describe an ensemble approach to learning salient spatial regions from arbitrarily
partitioned simulation data. Ensemble approaches for anomaly detection
are also explored. The partitioning comes from the distributed processing requirements
of large-scale simulations. The volume of the data is such that classifiers
can train only on data local to a given partition. Since the data partition reflects
the needs of the simulation, the class statistics can vary from partition to partition.
Some classes will likely be missing from some or even most partitions. We combine
a fast ensemble learning algorithm with scaled probabilistic majority voting in
order to learn an accurate classifier from such data. Since some simulations are
difficult to model without a considerable number of false positive errors, and since
we are essentially building a search engine for simulation data, we order predicted
regions to increase the likelihood that most of the top-ranked predictions are correct
(salient). Results from simulation runs of a canister being torn and from a casing
being dropped show that regions of interest are successfully identified in spite of
the class imbalance in the individual training sets. Lift curve analysis shows that the
use of data driven ordering methods provides a statistically significant improvement
over the use of the default, natural time step ordering. Significant time is saved for
the end user by allowing an improved focus on areas of interest without the need to
conventionally search all of the data. We have also found that using random forests
weighted and distance-based outlier ensemble methods for supervised learning of
anomaly detection provide significant accuracy improvements when compared to
existing methods on the same dataset. Further, distance-based outlier and local
outlier factor ensemble methods for unsupervised learning of anomaly detection
also compare favorably to existing methods.
Scholar Commons Citation
Shoemaker, Larry, "Ensemble Learning With Imbalanced Data" (2010). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/3589