USF Tampa Graduate Theses and Dissertations

Beyond the Hype: The Fundamental Challenges of Machine Learning-Based Android Malware Detection in Cybersecurity

Guojun Liu, University of South Florida

Graduation Year

2025

Document Type

Dissertation

Degree

Ph.D.

Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Computer Science and Engineering

Major Professor

Xinming Ou, Ph.D.

Committee Member

Lawrence Hall, Ph.D.

Committee Member

Jay Ligatti, Ph.D.

Committee Member

Nasir Ghani, Ph.D.

Committee Member

Doina Caragea, Ph.D.

Committee Member

Ankit Shah, Ph.D.

Keywords

App Representation, Data Leakage, Evaluation Metrics

Abstract

Machine learning (ML) algorithms have achieved remarkable success across various domains, including cybersecurity. Inspired by these advancements, the academic security community has explored numerous ML-based approaches for Android malware detection. While ML holds significant promise in this domain, its practical deployment faces substantial challenges, including data collection, feature selection, app representation across different models, performance instability across datasets, and inherent limitations of learning-based malware detection. These challenges can lead to overly optimistic detection results and weaken the reliability of malware detection frameworks.

Android malware detection has been extensively studied using both traditional ML and deep learning (DL) approaches. Although many state-of-the-art detection models, particularly those based on DL, claim superior performance, they are often evaluated on a limited scale without comprehensive benchmarking against traditional ML models across diverse datasets. This raises concerns about the robustness of DL-based approaches and the potential oversight of simpler, more efficient ML models. In this study, we conduct a systematic evaluation of Android malware detection models across four datasets: three publicly available, recently published datasets and a large-scale dataset we systematically collected. We implement a range of traditional ML and advanced DL models, revealing that while DL models can achieve strong performance, they are often compared against an insufficient number of traditional ML baselines. In many cases, simpler and more computationally efficient ML models yield comparable or even superior results, underscoring the need for rigorous benchmarking in Android malware detection research.

A critical aspect of ML-based malware detection is the numerical representation of apps for training and testing. We identify a widespread occurrence of distinct Android apps that have identical or nearly identical app representations. In particular, a significant portion of test samples may closely resemble or match representations of apps in the training dataset, leading to data leakage. This issue inflates the reported performance of ML models on the test set, creating an illusion of generalizability. Beyond overly optimistic assessments, data leakage can also result in qualitatively different research conclusions. We present two case studies to illustrate this impact and further examine the real-world implications using a leak-aware detection framework. Our findings demonstrate how the qualitatively different research conclusions can lead to incorrect recommendations regarding the most suitable ML models for practical deployment.

Scholar Commons Citation

Liu, Guojun, "Beyond the Hype: The Fundamental Challenges of Machine Learning-Based Android Malware Detection in Cybersecurity" (2025). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/10881

Download

Included in

Computer Sciences Commons

COinS

USF Tampa Graduate Theses and Dissertations

Beyond the Hype: The Fundamental Challenges of Machine Learning-Based Android Malware Detection in Cybersecurity

Graduation Year

Document Type

Degree

Degree Name

Degree Granting Department

Major Professor

Committee Member

Committee Member

Committee Member

Committee Member

Committee Member

Keywords

Abstract

Scholar Commons Citation

Included in

Search

Browse By

Useful Links

USF Tampa Graduate Theses and Dissertations

Beyond the Hype: The Fundamental Challenges of Machine Learning-Based Android Malware Detection in Cybersecurity

Author

Graduation Year

Document Type

Degree

Degree Name

Degree Granting Department

Major Professor

Committee Member

Committee Member

Committee Member

Committee Member

Committee Member

Keywords

Abstract

Scholar Commons Citation

Included in

Share

Search

Browse By

Useful Links