Graduation Year
2024
Document Type
Dissertation
Degree
Ph.D.
Degree Name
Doctor of Philosophy (Ph.D.)
Degree Granting Department
Computer Science and Engineering
Major Professor
Yicheng Y. Tu, Ph.D.
Co-Major Professor
Kaiqi K. Xiong, Ph.D.
Committee Member
Ankur A. Mali, Ph.D.
Committee Member
Seungbae S. Kim, Ph.D.
Committee Member
Hadi H. Gard, Ph.D.
Keywords
Deep Learning, DNA, Metagenomic, Robustness
Abstract
Machine learning (ML) has become a transformative force in high-risk domains such as genomics and cybersecurity, where accurate predictions and robust defenses are essential. This dissertation advances ML frameworks in these areas by developing methods to enhance predictive power in health applications and assess vulnerabilities in machine learning systems.
In the genomics field, the work addresses challenges in Non-Invasive Prenatal Testing (NIPT) of monogenic disorders by proposing a deep learning model that reconstructs the fetal genome using maternal plasma cell-free DNA (cfDNA) and parental whole-genome sequencing (WGS) data. This model achieves high accuracy in single nucleotide variation (SNV) prediction, surpassing current methods with an overall accuracy of 89.3% ± 1.8%. Next, we introduce GenomGPT, a large language model specifically adapted for precision classification in metagenomic DNA sequences. This model leverages advanced tokenization and attention mechanisms, achieving a 94.7% classification accuracy in mixed-species samples, significantly enhancing species differentiation in complex genomic analyses and applications such as environmental monitoring and forensic analysis.
In this study, we also investigated the predictive potential of cell-free DNA (cfDNA) as a biomarker for Gestational Diabetes Mellitus (GDM) by analyzing cfDNA profiles across pregnancy trimesters in two independent cohorts. By focusing on cfDNA dynamics, we developed a model capable of predicting GDM in the first trimester, achieving an area under the curve (AUC) of 0.919. This high predictive accuracy underscores the value of cfDNA as an early biomarker for GDM, offering a non-invasive approach to risk stratification. Our results suggest that cfDNA profiling could facilitate timely intervention and personalized treatment strategies, ultimately improving outcomes for both mother and child.
In cybersecurity, this dissertation focuses on the vulnerabilities of GPT-based classifiers to adversarial attacks, a critical concern for ML in high-stakes applications. We propose the Stepping Autobreaker, a non-targeted adversarial attack that induces minimal, imperceptible perturbations to mislead GPT-based classifiers with over 95% attack successful rate. This method, evaluated across image classification tasks, reveals the susceptibility of Transformer-based models to adversarial manipulations and underscores the need for robust defenses in real-time applications.
This dissertation contributes novel ML methodologies for genomic prediction and security analysis in high-risk domains. It emphasizes the importance of accuracy and robustness in health and security applications, offering insights for more secure and effective ML frameworks across critical domains.
Scholar Commons Citation
Hu, Chengbin, "Analyzing and Extending Machine Learning Frameworks on High Risk Domains" (2024). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/10633