Graduation Year
2024
Document Type
Dissertation
Degree
Ph.D.
Degree Name
Doctor of Philosophy (Ph.D.)
Degree Granting Department
Computer Science and Engineering
Major Professor
Yicheng Tu, Ph.D.
Committee Member
Hao Zheng, Ph.D.
Committee Member
Robert Karam, Ph.D.
Committee Member
Rays Jiang, Ph.D.
Committee Member
Ashwin Parthasarathy, Ph.D.
Keywords
Parallel Algorithms, General-Purpose Graphics Processing Unit, Vector Database, Supercomputing
Abstract
Advancements in Next-Generation Sequencing (NGS) have dramatically reduced the cost and increased the speed of DNA sequencing. However, this rapid influx of data necessitates efficient and robust analysis tools, particularly for the complex task of aligning short NGS reads to reference genomes such as the human genome. We explore groundbreaking computational strategies and hardware acceleration to optimize this critical alignment process. This dissertation is structured around three innovative studies. First, we introduce a novel approach to dynamic memory allocation tailored for massively parallel systems, particularly Graphical Processing Units (GPUs), to support NGS alignment and other applications. Unlike traditional memory allocators that rely on global data structures which can bottleneck parallel processing, our approach widely distributes memory information and employs thread-level random search procedures for memory allocation. Our design consistently outperformed the current state-of-the-art by up to two orders of magnitude. Second, we present a pioneering implementation of BWA-MEM, one of the gold-standard NGS aligners, on GPUs. Our research addresses and resolves significant challenges in translating BWA-MEM’s CPU-based operations to GPU architecture, achieving a remarkable speedup -- up to 5.8 times faster than the original BWA-MEM and 3.2 times faster than its newer CPU-optimized version, BWA-MEM2. Finally, we discuss a novel Machine Learning-based approach for NGS alignment, marking a transformative shift from the traditional seed-and-extend approach. By encoding NGS sequences as learned latent vectors using machine learning models, we design advanced vector-database indexing techniques to efficiently identify alignment locations. An implementation of this new approach on CPUs has comparable accuracy and throughput to BWA-MEM2 for short reads and is more than twenty times faster for long reads.
Scholar Commons Citation
Pham, Minh H., "High-Performance Computing in Next-Generation Sequencing Read Alignment" (2024). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/10551