Graduation Year
2024
Document Type
Dissertation
Degree
Ph.D.
Degree Name
Doctor of Philosophy (Ph.D.)
Degree Granting Department
Mathematics and Statistics
Major Professor
Kandethody Ramachandran, Ph.D.
Committee Member
Lu Lu, Ph.D.
Committee Member
Seung-Yeop Lee, Ph.D.
Committee Member
Feng Cheng, Ph.D.
Keywords
Statistical Machine Learning, Natural Language Processing (NLP), Sentiment Analysis, Data Science, Ensemble Learning, Random Forest, Transformer
Abstract
Artificial Intelligence (AI) is a part of human's daily life nowadays. Machine Learning (ML) as one aspect from AI has been rapidly developing during the past two decades, especially from the statistical learning approaches, which emphasized the use of probability and statistics to model data, such as Support Vector Machines (SVMs) for classification and regression tasks to the ensemble learning techniques, such as Random Forest, Gradient Boosting Machine (GBM), and stacking. Ensemble learning has evolved into a pivotal concept in contemporary machine learning, empowering practitioners to amalgamate multiple models to enhance generalization, accuracy, and robustness. As the field of machine learning progresses, ensemble techniques are poised to retain their significance as indispensable tools in tackling intricate real- world challenges.
Since the publication of Google's paper “Attention is All You Need” at Neural Information Processing Systems 30 (NIPS 2017), the transformer architecture has witnessed widespread adoption in numerous Natural Language Processing (NLP) scenarios. It has been employed in various settings, ranging from applications utilizing the entire seq2seq architecture to those focusing solely on the encoder component, exemplified by the increasing popularity of models like GPT and BERT. In this dissertation, we will explore practical scenarios to illustrate the extensive and diverse applications of NLP facilitated by the transformer architecture.
This dissertation contains three distinct research studies: one study focusing on the ensemble learning method to solve a deception detection problem based the Miami University Deception Detection Database (MU3D). This study introduces a novel approach wherein we crafted an ensemble learning model based on random forest and evaluated its performance in the domain of deception detection. The other study is a statistical related Natural Language Processing (NLP) on the computer science field. The novelty in this study is that we analyzed the before and after text summarization sentiment for the unstructured Twitter (now they renamed the company as X) data using a fine-tuned large language models (LLMs) for text summarization. The impact of both studies was to combine statistical exploratory data analysis together with machine learning (ML) algorithms and large language models to solve real-life problems and inspired other researchers in this field. The third study focusing on machine learning and natural language processing to address practical challenges in pharmacovigilance analysis, with a particular focus on uncovering Drug-Drug Interactions (DDIs). The discussion is segmented into multiple sub-sections, ranging from the foundational concepts to the experimental framework, providing a theoretical exposition of the subject matter.
Scholar Commons Citation
Bu, Kun, "Advancing Text Summarization and Classification: Deep Insights from Transformer-Based Statistical Learning" (2024). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/10800
Included in
Artificial Intelligence and Robotics Commons, Medicinal Chemistry and Pharmaceutics Commons, Statistics and Probability Commons
