Graduation Year

2010

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Computer Science and Engineering

Major Professor

Lawrence O. Hall, Ph.D.

Co-Major Professor

Dmitry B. Goldgof, Ph.D.

Committee Member

Sudeep Sarkar, Ph.D.

Committee Member

Kevin W. Bowyer, Ph.D.

Keywords

Random Forest, Saliency, Probabilistic Voting, Imbalanced Training Data, Lift

Abstract

We describe an ensemble approach to learning salient spatial regions from arbitrarily

partitioned simulation data. Ensemble approaches for anomaly detection

are also explored. The partitioning comes from the distributed processing requirements

of large-scale simulations. The volume of the data is such that classifiers

can train only on data local to a given partition. Since the data partition reflects

the needs of the simulation, the class statistics can vary from partition to partition.

Some classes will likely be missing from some or even most partitions. We combine

a fast ensemble learning algorithm with scaled probabilistic majority voting in

order to learn an accurate classifier from such data. Since some simulations are

difficult to model without a considerable number of false positive errors, and since

we are essentially building a search engine for simulation data, we order predicted

regions to increase the likelihood that most of the top-ranked predictions are correct

(salient). Results from simulation runs of a canister being torn and from a casing

being dropped show that regions of interest are successfully identified in spite of

the class imbalance in the individual training sets. Lift curve analysis shows that the

use of data driven ordering methods provides a statistically significant improvement

over the use of the default, natural time step ordering. Significant time is saved for

the end user by allowing an improved focus on areas of interest without the need to

conventionally search all of the data. We have also found that using random forests

weighted and distance-based outlier ensemble methods for supervised learning of

anomaly detection provide significant accuracy improvements when compared to

existing methods on the same dataset. Further, distance-based outlier and local

outlier factor ensemble methods for unsupervised learning of anomaly detection

also compare favorably to existing methods.

Scholar Commons Citation

Shoemaker, Larry, "Ensemble Learning With Imbalanced Data" (2010). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/3589

Download

Included in

American Studies Commons

COinS

USF Tampa Graduate Theses and Dissertations

Ensemble Learning With Imbalanced Data

Graduation Year

Document Type

Degree

Degree Granting Department

Major Professor

Co-Major Professor

Committee Member

Committee Member

Keywords

Abstract

Scholar Commons Citation

Included in

Search

Browse By

Useful Links

USF Tampa Graduate Theses and Dissertations

Ensemble Learning With Imbalanced Data

Author

Graduation Year

Document Type

Degree

Degree Granting Department

Major Professor

Co-Major Professor

Committee Member

Committee Member

Keywords

Abstract

Scholar Commons Citation

Included in

Share

Search

Browse By

Useful Links