Social scientific findings, business decisions, and now even public policies are increasingly being made on the basis of digital trace data from sources such as social media platforms, purchase records, emails, and mobile phone sensors. What could go wrong?

In this thesis, I look at ways in which findings made from such digital trace data may be misleading, and how we might make use of such data in light of these limitations. As an example of demographic bias, I take the largest source of combined geographic, temporal, linguistic, and network data, geotagged tweets, and show that such data exhibits heavy geographic and demographic biases. In an empirical demonstration of algorithmic user manipulation, I use causal inference techniques to show how design decisions of social media engineers have a causal impact on user behavior and observed network structure. And I argue that sensors in mobile phones, by measuring proximity and not necessarily interaction, suggest a very different set of theoretically appropriate questions than what have been proposed so far.

I then give three examples of study designs with scopes that avoid problems of bias. The first is a case study of unique and interesting online organization in, an informational resource for psychoactive drugs, demonstrating ethical research relevant to the studied community. The second is a partnership with public health researchers, where we demonstrate the use Twitter not for trying to get scientific findings but as a means of public engagement. Lastly, I present a study that combines mobile phone sensor data with theoretical directions from the 1940s and 50s and with Stochastic Actor-Oriented Models for network dynamics.

This thesis also demonstrates how to operationalize critiques from critical algorithm studies, Science and Technology Studies (STS), and sociology into empirical research questions within computer science, By unifying these areas, I demonstrate a model of computational social science that is rigorous both sociologically and in terms of its statistical modeling, and that can responsibly sup- port decision-making, research, and policy.

Thesis Committee:
Jürgen Pfeffer (Co-chair)
Anind K. Dey (HCII/Co-Chair)
Cosma Rohilla Shalizi (Statistics)
David Lazer (Northeastern University)

Copy of Proposal Document

Software-intensive systems are increasingly expected to operate under changing and uncertain conditions, including not only varying user needs and workloads, but also fluctuating resource capacity. Self-adaptation aims to address this problem, giving systems the ability to change their behavior and structure to adapt to changes in themselves and their operating environment without human intervention.

Self-adaptive systems tend to be reactive and myopic, adapting in response to changes without anticipating what the subsequent adaptation needs will be. Adapting reactively can result in inefficiencies due to the system performing a suboptimal sequence of adaptations. Furthermore, some adaptation tactics—atomic adaptation actions—have latency and take some time to produce their effect. In that case, reactive adaptation causes the system to lag behind environment changes. What is worse, a long running adaptation action may prevent the system from performing other adaptations until it completes, further limiting its ability to effectively deal with the environment changes.

To address these limitations and improve the effectiveness of self-adaptation, we present proactive latency-aware adaptation, an approach that considers the timing of adaptation (i) leveraging predictions of the near future state of the environment to adapt proactively; (ii) considering the latency of adaptation tactics when deciding how to adapt; and (iii) executing tactics concurrently. We have developed three different solution approaches embodying these principles. One is based on probabilistic model checking, making it inherently able to deal with the stochastic behavior of the environment, and guaranteeing optimal adaptation choices over a finite decision horizon. The second approach uses stochastic dynamic programming to make adaptation decisions, and thanks to performing part of the computations required to make those decisions off-line, it achieves a speedup of an order of magnitude over the first solution approach without compromising optimality. A third solution approach makes adaptation decisions based on repertoires of adaptation strategies—predefined compositions of adaptation tactics. This approach is more scalable than the other two because the solution space is smaller, allowing an adaptive system to reap some of the benefits of proactive latency-aware adaptation even if the number of ways in which it could adapt is too large for the other approaches.

Thesis Committee:
David Garlan (Chair)
Mark Klein (SEI)
Claire LeGoues
Sam Malek (University of California, Irvine)

Copy of Draft Thesis Document

Some of the most pressing challenges facing humanity today, such as how to respond to climate change or govern the internet, are too complex for any individual or small group to completely understand by themselves, but a lot of people each know a piece of the problem or solution. This is somewhat like a billion-piece jigsaw puzzle where somebody threw away the box and mailed each piece to a different person. Attempts are now being made to build platforms where people can bring their pieces and assemble them into solutions.

This thesis examines such platforms, how people use and perceive them and their content, and how certain design decisions such as the exposure of discussion behind collaboratively produced content can affect those perceptions. Through a set of studies ranging from qualitative interviews to controlled experiments with hundreds or thousands of participants, this thesis adds to our understanding of how humans use and perceive content on such platforms, including when primary content is presented alongside related content or discussion, so that we can better understand how to design these systems to better achieve their users’ intended goals.

The research integrates insights from computer science and social psychology with latent variable modeling techniques in order to increase our understanding of what people are trying to accomplish on an example platform designed to support large-scale collaboration around a complex issue, and experimentally explores how people perceive the content they find on such sites. Data and projects used in this thesis come from a diversity of sources including Wikipedia, the President’s SAVE award ideation contest (facilitated through the IdeaScale ideation platform), and the MIT Climate CoLab.

Thesis Committee:
James D. Herbsleb (Chair)
Carolyn P. Rosé
Daniel B. Neill (Heinz)
Thomas W. Malone (Massachusetts Institute of Technology)

Copy of Draft Thesis Document

Online social networks have become a powerful venue for political activism. In many cases large, insular communities form that have been shown to be powerful diffusion mechanisms of both misinformation and propaganda. In some cases these groups’  users advocate actions and policies that could be construed as “extreme” along any distribution of opinion, and are thus called Online Extremist Communities (OECs). However, little is known about how these groups form or the methods used to  influence them. The work in this thesis provides a new ways for researchers to answer three critical research questions with respect to online marketing and its role in  geopolitical opinion:

•How can we detect large dynamic online activist or extremist communities?
•What automated tools are used to build, isolate, and influence these communities?
•How can we gain novel insight into large online activist or extremist communities?

I leverage the various affordances Online Social Networks offer for group curation by developing heterogeneous, annotated graph representations of user communities. I then use unsupervised methods to detect extrefinger zimist discussion cores which can be used to efficiently build training sets for supervised detection of the greater OEC. I also present illustrative knowledge extractions available to researchers when OECs are detected at scale. This methodological pipeline has also proven useful for social botnet detection, and I have observed large, complex social botnets that appear to be a powerful tool used for propaganda dissemination.

Throughout this thesis I provide Twitter case studies including communities focused on the Islamic State of Iraq and al-Sham (ISIS), the ongoing Syrian Revolution, the Euromaidan Movement in Ukraine, as well as the “alt-Right.”

Thesis Committee:
Kathleen Carley (Chair)
Zico Kolter
Randy Garrett (IronNet)
Daniel Neill (Heinz)

Copy of Draft Document

One of the main challenges that modern software developers face is the coordination of dependent agents such as software projects and other developers. Transparent development environments that make software development activities visible hold much promise for assisting developers making coordination decisions at the risk of potentially overwhelming them with information from potentially millions of software developers and repositories. Overcoming the risk of overload requires a principled understanding of what exactly developers need to know about dependencies to make decisions.

My approach to a principled understanding of how developers transparency information is to model the process using signaling theory as a theoretical lens. Developers making key coordination decisions often must determine qualities about projects and other developers that are not directly observable. Developers infer these unobservable qualities through interpreting information in their environment as signals and use this judgment to inform their decision. Through this understanding of the signaling process, I can create improved signals that more accurately represent desired unobservable qualities.

My dissertation work examines the qualities and signals that developers use to inform specific coordination tasks through a series of three empirical studies. The specific key coordination tasks studied are evaluating code contributions, negotiating problems around contributions, and evaluating projects. My results suggest that when project managers evaluate code contributions, they prefer social over technical signals. When project managers discuss contributions, I found that they attend to political signals regarding influence from stakeholders to prioritize problems. I found that developers evaluating projects tend to use signals that are related to how the core team works and the potential utility a project provides. In a fourth study, using signaling theory and findings from the qualities and signals that developers use to evaluate projects, I create and evaluate an improved signal called “supportiveness” for community support in projects. I compare this signal against the current signal that developers use, stars count, and find evidence that my designed signal is a stronger and more robust indicator of support. The findings of these studies inform the design of tools and environments that assist developers in these tasks through prioritizing and potentially improving signals.

Thesis Committee:
James Herbsleb (Co-Chair)
Laura Dabbish (Co-Chair, HCII)
Claire Le Goues
André van der Hoek (University of California Irvine)

Copy of Draft Thesis Document

Modern ubiquitous platforms, such as mobile apps, web browsers, social networks, and IoT devices, are providing sophisticated services to users while also increasingly collecting privacy-sensitive data. Service providers do give users fine-grained privacy controls over these sensitive actions; however, the number of privacy settings has reached a point where it is overwhelming to users, preventing them from taking advantage of these controls.

This work addresses this user burden problem by studying to what extent machine learning techniques could simplify user decisions and help users configure their privacy settings. We chose mobile app permissions as a first domain to explore this topic.

Specifically, in our first study, we explored the power of different combinations of features to predict users’ mobile app permission settings based on a dataset of 200K real Android users. We evaluated the profile-based predictions together with individual prediction models and showed that with selectively prompting 10% of the permission requests to the users, the system could predict users’ app permission settings with 90% accuracy. We conducted a second study in which we applied nudging to motivate users to engage with the settings to help develop strong predictive models even based on a small sample of users. The two studies confirmed that a relatively small number of profiles can go a long way capturing users’ diverse privacy preferences and predicting their privacy settings.

We then introduced an app that provides personalized recommendations for mobile app permission settings. Our results from a pilot study (N=72) conducted on real Android users showed that participants accepted 78.7% of the recommendations and kept 94.9% of these settings on their phones with comfort. In light of this, we propose a final study exploring the extent to which learned users’ privacy profiles from mobile app permissions could help predict users’ privacy settings across other domains.

Thesis Committee:
Norman Sadeh (Chair)M
Lorrie Cranor
Alessandro Acquisti
Florian Schaub (University of Michigan)
Nina Taft (Google)

Join us for CMU Privacy Day 2017 at Carnegie Mellon University. CMU is celebrating the International Data Privacy Day by presenting privacy research and practical advice on protecting privacy online. Privacy Day is open to the public, and no registration is required.

Data Privacy Day is an international effort to empower and educate people to protect their privacy and control their digital footprint. For more information, please visit

Privacy Day will feature a Privacy Clinic. Come and  learn how to protect your privacy. CMU’s information privacy and security students will educate you and answer your questions about privacy risks and remedies concerning many topics, including:

  • Web Application for Searching and Comparing Financial Companies' Privacy Practices
  • Are you being monitored at Carnegie Mellon?
  • Online Tracking and Targeted Ads
  • Private Browsing
  • The Decline of the Ad Blocker
  • Privacy for IoT Devices
  • How to Avoid In-App Tracking and Advertising
  • Encryption for Messenger Apps
  • Opting Out from Ad Targeting
  • Analyzing Privacy Requirements for Mobile Apps
  • Generating Privacy Policies for Websites and Apps

Refreshments will be provided.

Hosted by the MSIT-Privacy Engineering Program.

Structured probabilistic inference has shown to be useful in modeling complex latent structures of data. One successful way in which this technique has been applied is in the discovery of latent topical structures of text data, which is usually referred to as topic modeling. With the recent popularity of mobile devices and social networking, we can now easily acquire text data attached to meta information, such as geo-spatial coordinates and time stamps. This metadata can provide rich and accurate information that is helpful in answering many research questions related to spatial and temporal reasoning. However, such data must be treated differently from text data. For example, spatial data is usually organized in terms of a two dimensional region while temporal information can exhibit periodicities. While some work existing in the topic modeling community that utilizes some of the meta information, these models largely focused on incorporating metadata into text analysis, rather than providing models that make full use of the joint distribution of meta-information and text.

In this thesis, I propose the event detection problem, which is a multi-dimensional latent clustering problem on spatial, temporal and topical data. I start with a simple parametric model to discover independent events using geo-tagged Twitter data. The model is then improved toward two directions. First, I augmented the model using Recurrent Chinese Restaurant Process (RCRP) to discover events that are dynamic in nature. Second, I studied a model that can detect events using data from multiple media sources. I studied the characteristics of different media in terms of reported event times and linguistic patterns.

The approaches studied in this thesis are largely based on Bayesian non-parametric methods to deal with steaming data and unpredictable number of clusters. The research will not only serve the event detection problem itself but also shed light into a more general structured clustering problem in spatial, temporal and textual data.

Thesis Committee:
Kathleen M. Karley (Chair)
Tom Mitchell (MLD)
Alexander Smola (MLD/Amazon)
Huan Liu (Arizona State University)

Copy of Thesis Document

In our recent work, we aggressively modify the Java bytecode in order to implement a novel technique called variational execution. As we delve deeply into the bytecode, we realize that bytecode manipulation is a powerful technique that could be applied to various application domains. It has its own unique advantages over similar techniques like source-to-source transformation. It can be useful for simple tasks like performance profiling, refactoring and runtime checking. It is also widely used in research community for more complicated tasks like static analysis and dynamic analysis. In this talk, I am going to briefly introduce Java bytecode and show a few examples of how bytecode manipulation could be useful in a variety of scenarios. My hope is that, after this talk, you have one more implementation option to consider for your research project.


Subscribe to ISR