Graduation Year

2023

Document Type

Dissertation

Degree

Ph.D.

Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Computer Science and Engineering

Major Professor

Adriana Iamnitchi, Ph.D.

Co-Major Professor

Lawrence O. Hall, Ph.D.

Committee Member

John Skvoretz, Ph.D.

Committee Member

Giovanni L. Ciampaglia, Ph.D.

Committee Member

Michael Maness, Ph.D.

Keywords

Online Platforms, Information Diffusion, Machine Learning, Contrastive Learning, Deep Learning

Abstract

The main objective of this dissertation is to develop models that predict and investigate the spread of information in social media over time. In this context, we consider topics of discussions as the information that spreads. Thus, we are interested in forecasting the number of messages per day in a future interval of time. We take a data-driven approach, in which we compare our results with real datasets from a multitude of socio-political contexts and from multiple social media platforms, specifically, Twitter and YouTube.

We identified a number of challenges related to forecasting social media time series per topic. First, it was not clear how well-established models, tested for time series forecasting in other contexts, would perform in the context of social media. Via data-driven studies on multiple datasets and platforms, we show that different models perform better under different performance metrics, and the best baseline model for social media activity is generally simply replaying the recent past. However, in the case of unexpected exogenous events (such as political protests), when accurate forecasting might be most helpful, replaying the recent past performs poorly.

The second component of this dissertation focus on developing methodologies to incorporate features from exogenous events into our models for forecasting the volume of social media activity over time. We thus introduced TAP, topic activity predictors, which employs a dynamic selection of Long-Short Term Memory (LSTM) models, each trained with a specific source of exogenous data. We show that our approach is capable of modeling spikes of activity as they happen in several topics, a characteristic that is often missed by other forecasting approaches.

Third, we discovered the challenge of properly identifying topics as the tokens of information spread. On a Twitter dataset related to COVID-19 discussions, we show that current topic modeling approaches struggle to differentiate between viewpoints on the same topic, resulting in overly broad topics. We propose an approach that enhances the semantic embeddings of tweets by incorporating additional viewpoint information through the use of social proximity signals present in social networks. We show that our resulting embeddings can generate more fine-grained clusters of tweets, thereby enabling the extraction of more targeted topics.

Fourth, this data-driven study led to the discovery of coordinated activity on social media in one of the datasets we analysed. Coordinated activity is represented by efforts in which multiple user accounts, within a short time, post content advancing a shared informational agenda. These orchestrated efforts, due to their inauthenticity and lack of representation in most datasets, add new challenges to the task of activity forecasting. We adopt a methodology that captures this inauthentic behavior through a network of coordinated link sharing and extends it to multi-platform settings. This methodology is based on empirically selecting a short interval of time, which is then used as a threshold for differentiating coordinated from non-coordinated activity. We investigate one specific scenario involving the disinformation campaign against the White Helmets organization. We show how pieces of content, in this study YouTube videos, are strategically promoted across various social media platforms, specifically Twitter and Facebook.

Finally, we introduce an approach for automatic detection of coordinated activity that does not rely on predefined thresholds. Our approach leverages temporal information, network structure information, and textual content similarity to identify groups of accounts exhibiting unusual activity patterns. We apply our framework to four distinct datasets collected from Twitter that cover a wide range of contexts, including geopolitical events, social media manipulation, and general topics of discussion. We show that our framework can isolate unusual interactions exhibiting high similarity in multiple dimensions, and can provide valuable insights for further qualitative investigation.

Share

COinS