Doctor of Philosophy (Ph.D.)
Degree Granting Department
Computer Science and Engineering
Lawrence Hall, Ph.D.
Dmitry Goldgof, Ph.D.
Sudeep Sarkar, Ph.D.
Kaiqi Xiong, Ph.D.
Mingyang Li, Ph.D.
Data Mining, Forecasting, Machine Learning, Radiology, Social Networks
In the real world, data used to build machine learning models always has different sizes and characteristics. These size and characteristic features, including small datasets, big datasets, imbalanced datasets, often lead to different challenges when training machine learning models. Models trained on a small number of observations tend to overfit the training data and produce inaccurate results. When it comes to big data, efficiently learning from "huge" size data in a short time becomes important. With an imbalanced dataset, learning is usually biased towards the majority class in the data and appropriate measurements are needed to check model performance.
As the fastest growing part of AI, deep learning, a subset of machine learning which gains popularity nowadays, also is affected by different data sizes and characteristics. As a result, exploring solutions to overcome these real-world data challenges is important. In this dissertation, we provide multiple solutions focusing on exploiting deep learning on datasets from multiple fields covering small datasets, very big datasets, and imbalanced datasets. We focus our exploration on medical image data and social network data. Medical image data are usually small due to reasons like privacy and cost. On the other hand, social network data is generally big because of the huge amount of users who interact on social networks platforms.
In medical imaging, research has shown the feasibility of using a pre-trained deep neural network as a feature extractor when only a small dataset is available, we proposed a novel image feature extraction method for predicting survival time from brain tumor magnetic resonance images using pre-trained deep neural networks.
We also introduced a novel method for over-sampling the minority class examples at the image level, rather than the feature vector level, to provide a solution to the problem of imbalanced medical imaging data. For social network analysis and future forecasting, we introduced a decomposition approach to address the long term fine time granularity simulation problem. The goal is to predict different user activities at hour granularity over a long period of time. In addition, when considering simulating user activities across multiple platforms, we introduced a sequence model approach which provides efficient long term cross platform simulation.
We demonstrate the proposed methods handle real-world data of different extremes of size and characteristics well and have better performance compared to baseline approaches and other machine learning approaches. Our explorations only focus on medical image data and social network data. However, the proposed methods are general enough to handle real-world labeled data at the extremes of small, big dataset, or imbalanced. As a result, the proposed methods in this dissertation can be exploited in other research fields as well.
Scholar Commons Citation
Liu, Renhao, "Deep Learning Predictive Modeling with Data Challenges (Small, Big, or Imbalanced)" (2020). USF Tampa Graduate Theses and Dissertations.