Graduation Year

2007

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Computer Science and Engineering

Major Professor

Lawrence O. Hall, Ph.D.

Keywords

Partitioning, Hard-c-means, Fuzzy-c-means, Scalability, Merging, Streaming

Abstract

Clustering algorithms are an important tool for data mining and data analysis purposes. Clustering algorithms fall under the category of unsupervised learning algorithms, which can group patterns without an external teacher or labels using some kind of similarity metric. Clustering algorithms are generally iterative in nature and computationally intensive. They will have disk accesses in every iteration for data sets larger than memory, making the algorithms unacceptably slow. Data could be processed in chunks, which fit into memory, to provide a scalable framework. Multiple processors may be used to process chunks in parallel. Clustering solutions from each chunk together form an ensemble and can be merged to provide a global solution. So, merging multiple clustering solutions, an ensemble, is important for providing a scalable framework.

Combining multiple clustering solutions or partitions, is also important for obtaining a robust clustering solution, merging distributed clustering solutions, and providing a knowledge reuse and privacy preserving data mining framework. Here we address combining multiple clustering solutions in a scalable framework. We also propose algorithms for incrementally clustering large or very large data sets. We propose an algorithm that can cluster large data sets through a single pass. This algorithm is also extended to handle clustering infinite data streams. These types of incremental/online algorithms can be used for real time processing as they don't revisit data and are capable of processing data streams under the constraint of limited buffer size and computational time. Thus, different frameworks/algorithms have been proposed to address scalability issues in different settings.

To our knowledge we are the first to introduce scalable algorithms for merging cluster ensembles, in terms of time and space complexity, on large real world data sets. We are also the first to introduce single pass and streaming variants of the fuzzy c means algorithm. We have evaluated the performance of our proposed frameworks/algorithms both on artificial and large real world data sets. A comparison of our algorithms with other relevant algorithms is discussed. These comparisons show the scalability and effectiveness of the partitions created by these new algorithms.

Scholar Commons Citation

Hore, Prodip, "Scalable frameworks and algorithms for cluster ensembles and clustering data streams" (2007). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/2222

Download

Included in

American Studies Commons

COinS

USF Tampa Graduate Theses and Dissertations

Scalable frameworks and algorithms for cluster ensembles and clustering data streams

Graduation Year

Document Type

Degree

Degree Granting Department

Major Professor

Keywords

Abstract

Scholar Commons Citation

Included in

Search

Browse By

Useful Links

USF Tampa Graduate Theses and Dissertations

Scalable frameworks and algorithms for cluster ensembles and clustering data streams

Author

Graduation Year

Document Type

Degree

Degree Granting Department

Major Professor

Keywords

Abstract

Scholar Commons Citation

Included in

Share

Search

Browse By

Useful Links