Graduation Year

2013

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Computer Science and Engineering

Major Professor

Lawrence O. Hall

Keywords

fuzzy c-means, representative object, scalable, statistical sampling, stopping criterion

Abstract

Clustering algorithms are a primary tool in data analysis, facilitating the discovery of groups and structure in unlabeled data. They are used in a wide variety of industries and applications. Despite their ubiquity, clustering algorithms have a flaw: they take an unacceptable amount of time to run as the number of data objects increases. The need to compensate for this flaw has led to the development of a large number of techniques intended to accelerate their performance. This need grows greater every day, as collections of unlabeled data grow larger and larger. How does one increase the speed of a clustering algorithm as the number of data objects increases and at the same time preserve the quality of the results? This question was studied using the Fuzzy c-means clustering algorithm as a baseline. Its performance was compared to the performance of four of its accelerated variants. Four key design principles of accelerated clustering algorithms were identified. Further study and exploration of these principles led to four new and unique contributions to the field of accelerated fuzzy clustering. The first was the identification of a statistical technique that can estimate the minimum amount of data needed to ensure a multinomial, proportional sample. This technique was adapted to work with accelerated clustering algorithms. The second was the development of a stopping criterion for incremental algorithms that minimizes the amount of data required, while maximizing quality. The third and fourth techniques were new ways of combining representative data objects. Five new accelerated algorithms were created to demonstrate the value of these contributions. One additional discovery made during the research was that the key design principles most often improve performance when applied in tandem. This discovery was applied during the creation of the new accelerated algorithms. Experiments show that the new algorithms improve speedup with minimal quality loss, are demonstrably better than related methods and occasionally are an improvement in both speedup and quality over the base algorithm.

Share

COinS