Graduation Year

2024

Document Type

Dissertation

Degree

Ph.D.

Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Mathematics and Statistics

Major Professor

Kandethody Ramachandran, Ph.D.

Committee Member

Lu Lu, Ph.D.

Committee Member

Seung-Yeop Lee, Ph.D.

Committee Member

Feng Cheng, Ph.D.

Keywords

Statistical Machine Learning, Natural Language Processing (NLP), Sentiment Analysis, Data Science, Ensemble Learning, Random Forest, Transformer

Abstract

Artificial Intelligence (AI) is a part of human's daily life nowadays. Machine Learning (ML) as one aspect from AI has been rapidly developing during the past two decades, especially from the statistical learning approaches, which emphasized the use of probability and statistics to model data, such as Support Vector Machines (SVMs) for classification and regression tasks to the ensemble learning techniques, such as Random Forest, Gradient Boosting Machine (GBM), and stacking. Ensemble learning has evolved into a pivotal concept in contemporary machine learning, empowering practitioners to amalgamate multiple models to enhance generalization, accuracy, and robustness. As the field of machine learning progresses, ensemble techniques are poised to retain their significance as indispensable tools in tackling intricate real- world challenges.

Since the publication of Google's paper “Attention is All You Need” at Neural Information Processing Systems 30 (NIPS 2017), the transformer architecture has witnessed widespread adoption in numerous Natural Language Processing (NLP) scenarios. It has been employed in various settings, ranging from applications utilizing the entire seq2seq architecture to those focusing solely on the encoder component, exemplified by the increasing popularity of models like GPT and BERT. In this dissertation, we will explore practical scenarios to illustrate the extensive and diverse applications of NLP facilitated by the transformer architecture.

This dissertation contains three distinct research studies: one study focusing on the ensemble learning method to solve a deception detection problem based the Miami University Deception Detection Database (MU3D). This study introduces a novel approach wherein we crafted an ensemble learning model based on random forest and evaluated its performance in the domain of deception detection. The other study is a statistical related Natural Language Processing (NLP) on the computer science field. The novelty in this study is that we analyzed the before and after text summarization sentiment for the unstructured Twitter (now they renamed the company as X) data using a fine-tuned large language models (LLMs) for text summarization. The impact of both studies was to combine statistical exploratory data analysis together with machine learning (ML) algorithms and large language models to solve real-life problems and inspired other researchers in this field. The third study focusing on machine learning and natural language processing to address practical challenges in pharmacovigilance analysis, with a particular focus on uncovering Drug-Drug Interactions (DDIs). The discussion is segmented into multiple sub-sections, ranging from the foundational concepts to the experimental framework, providing a theoretical exposition of the subject matter.

Share

COinS