Social scientific findings, business decisions, and now even public policies are increasingly being made on the basis of digital trace data from sources such as social media platforms, purchase records, emails, and mobile phone sensors. What could go wrong?
In this thesis, I look at ways in which findings made from such digital trace data may be misleading, and how we might make use of such data in light of these limitations. As an example of demographic bias, I take the largest source of combined geographic, temporal, linguistic, and network data, geotagged tweets, and show that such data exhibits heavy geographic and demographic biases. In an empirical demonstration of algorithmic user manipulation, I use causal inference techniques to show how design decisions of social media engineers have a causal impact on user behavior and observed network structure. And I argue that sensors in mobile phones, by measuring proximity and not necessarily interaction, suggest a very different set of theoretically appropriate questions than what have been proposed so far.
I then give three examples of study designs with scopes that avoid problems of bias. The first is a case study of unique and interesting online organization in Erowid.org, an informational resource for psychoactive drugs, demonstrating ethical research relevant to the studied community. The second is a partnership with public health researchers, where we demonstrate the use Twitter not for trying to get scientific findings but as a means of public engagement. Lastly, I present a study that combines mobile phone sensor data with theoretical directions from the 1940s and 50s and with Stochastic Actor-Oriented Models for network dynamics.
This thesis also demonstrates how to operationalize critiques from critical algorithm studies, Science and Technology Studies (STS), and sociology into empirical research questions within computer science, By unifying these areas, I demonstrate a model of computational social science that is rigorous both sociologically and in terms of its statistical modeling, and that can responsibly sup- port decision-making, research, and policy.
Jürgen Pfeffer (Co-chair)
Anind K. Dey (HCII/Co-Chair)
Cosma Rohilla Shalizi (Statistics)
David Lazer (Northeastern University)