Graduation Year

2021

Document Type

Dissertation

Degree

Ph.D.

Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Computer Science and Engineering

Major Professor

Adriana Iamnitchi, Ph.D.

Committee Member

John Skvoretz, Ph.D.

Committee Member

Lawrence O. Hall, Ph.D.

Committee Member

Giovanni L. Ciampaglia, Ph.D.

Committee Member

Michael Maness, Ph.D.

Keywords

Anonymization, Graphs, Information Diffusion, Machine Learning, Twitter

Abstract

Social media datasets are fundamental to understanding a variety of phenomena, such as epidemics, adoption of behavior, crowd management, and political uprisings. At the same time, many such datasets capturing computer-mediated social interactions are recorded nowadays by individual researchers or by organizations. However, while the need for real social graphs and the supply of such datasets are well established, the flow of data from data owners to researchers is significantly hampered by privacy risks: even when humans’ identities are removed, or data is anonymized to some extent, studies have proven repeatedly that re-identifying anonymized user identities (i.e., de-anonymization) is doable with high success rate.

A main research challenge is to develop a principled understanding of how to measure the effectiveness of an anonymization scheme and thus, conversely, the likely success of a de-anonymization attack. This dissertation develops methods to understand what makes some graph datasets more resilient to de-anonymization attacks. We propose a data-driven framework to 1) quantify the vulnerability of a graph to a re-identification attack; 2) quantitatively identify which graph structural properties contribute most to graph vulnerability; and 3) propose guidelines to develop new methodologies related to graph anonymization, de-anonymization and graph vulnerability quantification. We show the usefulness of this framework on a large set of synthetically generated graphs with con- trolled propertied inspired from a set of real social networks. Thus, we provide an unified framework to analyze the privacy/utility trade-off imposed on any family of social graphs.

We extend this data-driven framework for networks with node attributes. Using this improved framework, we quantify how much better a node re-identification attack performs when the node attributes are included in the attack compared to when there is no node attribute information available to the attacker. We quantify the privacy impact of node attributes under an attribute attachment model biased towards homophily, and analyze the interplay between graph structures and attribute information. Our results show that binary node attributes increase the chance of revealing node identity independent of their placements in the network. Further, we show that other network properties independent of the degree distribution put node privacy at risk. This improves the current understanding of graph privacy, as it means that protecting graph privacy is much harder than previously considered.

Once privacy is guaranteed to a certain level, social media datasets are useful for various studies. One such important study is to analyze and model the information spreading patterns on social networks. Understanding how information (e.g., opinions, rumours, etc.) spreads on social networks has many benefits ranging from controlling the spread of bad rumour, identifying influential spreaders, reducing the harm of an outbreak, etc. Although there are a variety of classical diffusion models developed for epidemic spreading, they are not representative for capturing the information spread in social media. This dissertation contributes to the development of data-driven models to predict social media activity.

In this line of work, we first develop methods to forecast how conversations will evolve on a social media platform. Given a set of original posts on a social platform, such as posts on Reddit in a continuous interval of time, we predict the conversation trees rooted in these seeds. For each conversation, we predict the final shape of the message tree, the user who posts each message, and the time (in continuous space) of the posting of each message. Our solution uses a probabilistic generative model with the support of a genetic algorithm and Long-Short Term Memory (LSTM) neural networks. We evaluate the proposed approach on real world conversations as appeared on subreddits related to crypto-currency and cyber-security on Reddit. We show that this technique can generate accurate conversation topological structures over time, and can accurately predict the volume of messages and the engagement of users over time.

We improve this technique to predict the Twitter activities per topic of interest during a political crisis period. By their nature, periods of crisis do not include many repeatable events, thus it is difficult to learn and predict how social media users react. We use external events information as seen through the lens of physical conflict and news when improving the simulator design. Specifically, we use the time-aligned exogenous signals to predict when tweets are posted, in which topic, and by which user. We use the previously developed cascade generation model to predict the resharing activity. We evaluate this finer-granularity of simulations by the volume and temporal pattern of Twitter discussions, new user engagements and the structure of user interaction network. We show on Twitter data collected during the Venezuela political crisis that our model generates activities that follow the ground truth.

Share

COinS