Graduation Year

2022

Document Type

Dissertation

Degree

Ph.D.

Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Computer Science and Engineering

Major Professor

Lawrence O. Hall, Ph.D.

Committee Member

Adriana Iamnitchi, Ph.D.

Committee Member

John Skvoretz, Ph.D.

Committee Member

Yu Sun, Ph.D.

Committee Member

Mingyang Li, Ph.D.

Keywords

Graph embedding, Machine Learning, Neural Networks, SMOTE, XGBoost

Abstract

In the overall history of technological innovations, social media has only existed for a brief time, however its influence is undeniable. Researchers have found that it can be used to influence elections, spread health misinformation, and aid with financial pump-and-dump schemes. Keeping all this in mind, it is clear that more research is needed to predict the spread of information on social media in order to combat its malicious use.

To that end, in this dissertation, we explore the use of Machine Learning algorithms to perform time series forecasting and user-level activity prediction in social media. We address the different types of challenges that come with predicting social media activity such as (1) accounting for the differences in user engagement among different social media platforms, (2) identifying the data required for accurate predictions, (3) selecting the appropriate prediction framework, and (4) metric selection.

We address the aforementioned challenges in multiple ways. Firstly, we introduce an end-to-end simulator called the Volume Audience Match simulator, or VAM. VAM is comprised of two modules called the (1) Volume Prediction Module and (2) the User-Assignment Module. VAM performs both time series prediction and user-assignment. It predicts the overall volume time series of (1) new users, (2) old users, and (3) activities. It then assigns the predicted actions to both old and new users over time.

We evaluate VAM’s predictive prowess on 2 geopolitical datasets: the Venezuela 2019 Twitter dataset (Vz19), and the China-Pakistan Economic Corridor Twitter dataset (CPEC). We show that VAM outperforms various traditional time series baselines for the Volume-Prediction task, specifically the Persistence Baseline, ARIMA, ARMA, AR, and MA models. We show that it outperforms the Persistence Baseline and several state-of-the-art embedding methods for the user-assignment task, specifcally, tNE-node2vec-S, tNE-node2vec-H, and tNE-DeepWalk.

We also find that exogenous features from Reddit and YouTube improve VAM’s time series prediction accuracy. Furthermore, we perform an in-depth analysis of VAM’s performance using a wide-range of metrics that analyze many dimensions of the resulting predictions, such as magnitude, burstiness, temporal pattern matching, user-level prediction accuracy, and overall network structure. Lastly, we compare the XGBoost-based VAM models to the Recurrent Neural Network-based (RNN) VAM models, and find that the XGBoost models are much faster to train and more accurate. This is notable because RNNs are one of the most frequently used machine learning algorithms for social media prediction. Perhaps this insight will prompt other researchers to consider using XGBoost for their own modeling purposes instead of RNNs.

We also introduce a variant of VAM that performs data-augmentation called SMOTER- VAM. This version of VAM utilitzes data-augmentation as a prepreprocessing step via 2 different algorithms: SMOTER-Binning (SMOTER-B) and SMOTER-NB (No-Binning). These two algorithms are variants of the SMOTER algorithm (Synthetic Minority Oversampling Technique for Regression).

Two different VAM models are trained on the 2 augmented datasets. We found that using the SMOTER-B and SMOTER-NB algorithms improve VAM’s performance on time series prediction, especially on low-volume topics. These SMOTER variations are also generalizable to any machine learning algorithm and any dataset that has multiple-continuous outputs. Therefore, these variations can have many potential applications beyond VAM or social media time series prediction usage.Lastly, we analyze the differences between 2 commonly used baselines within the realm of social media prediction- the Persistence Baseline and ARIMA models. We evaluate their performances on different datasets and in different contexts, and through our analysis, we better understand which situations the baselines are useful and why.

Share

COinS