Generative modeling, a core problem in unsupervised learning, aims at understanding data by learning a model that can generate datapoints that resemble the real-world distribution.  Generative Adversarial Networks (GANs) are an increasingly popular framework that solve this by optimizing two deep networks, a "discriminator" and a "generator", in tandem.

However, this complex optimization procedure is still poorly understood. More specifically, it was not known whether equilibrium points of this system are "locally asymptotically stable" i.e., when initialized sufficiently close to an equilibrium point, does the optimization procedure converge to that point? In this work, we analyze the "gradient descent" form of GAN optimization (i.e., the setting where we simultaneously take small gradient steps in both generator and discriminator parameters). We show that even though GAN optimization does not correspond to a convex-concave game, even for simple parameterizations, under proper conditions, its equilibrium points are still locally asymptotically stable.  On the other hand, we show that for the recently-proposed Wasserstein GAN (WGAN), the optimization procedure might cycle around an equilibrium point without ever converging to it. Finally, motivated by this stability analysis, we propose an additional regularization term for GAN updates, which can guarantee local stability for both the WGAN and for the traditional GAN. Our regularizer also shows practical promise in speeding up convergence and in addressing a well-known failure mode in GANs called mode collapse.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement

Planning under uncertainty assumes a model that specifies the probabilistic effects of actions in terms of changes of the state. Given such model, planning proceeds to determine a policy that defines the choice of action at each state that maximizes a reward function. In this work, we realize that the world may have degrees of freedom not necessarily captured in the given model, and that the planning solution based thereon may be sub-optimal when compared to those possible when the additional degrees of freedom are considered. We introduce and formalize the problem of planning while considering feasible modifications to the model of the world that would lead to improved policies. We present an approach that models changes in the world as modifications to the probability transition function, and show that the problem of computing the most rewarding feasible modifications can be reduced to a constrained optimization problem. We then contribute a gradient-based algorithm for solving this optimization problem efficiently. We foresee several possible applications of this work, including some in the field of human-robot interaction.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.

For the first time in 25 years, a new non-volatile memory (NVM) category is being created that is two orders of magnitude faster than current durable storage media. The advent of NVM will fundamentally change the dichotomy between volatile memory and durable storage in database systems. These new NVM devices are almost as fast as DRAM, but all writes to it are potentially persistent even after power loss. Existing database systems are unable to take full advantage of this technology because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many of the components of legacy database systems are unnecessary and will degrade the performance of the data intensive applications.

This dissertation explores the implications of NVM for database systems. It presents the design and implementation of Peloton, a new database system tailored specifically for NVM. The dissertation focuses on three aspects of a database system: (1) logging and recovery, (2) storage management, and (3) indexing. Our primary contribution in this dissertation is the design of a new logging and recovery protocol, called write-behind logging, that improves the availability of the system by more than two orders of magnitude compared to the ubiquitous write- ahead logging protocol. Besides improving availability, we found that write- behind logging improves the space utilization of the NVM device and extends its lifetime. Second, we propose a new storage engine architecture that leverages the durability and byte-addressability properties of NVM to avoid unnecessary data duplication. Third, the dissertation presents the design of a latch-free range index tailored for NVM that supports near-instantaneous recovery without requiring special-purpose recovery code.

Thesis Committee:
Andy Pavlo (Chair)
Garth Gibson
Todd Mowry
Samuel Madden, MIT
Donald Kossmann (Microsoft Research)

Copy of Thesis Summary

Given the ever-growing prevalence of online social services, leveraging massive datasets has become an increasingly important challenge for businesses and end-users alike.  Online services capture a wealth of information about user behavior and platform interactions, such as who-follows-whom relationships in social networks and who-rates-what-and-when relationships in e-commerce networks.  Since many of these services rely on data-driven algorithms to recommend content to their users, authenticity of user behavior is paramount to success.  But given anonymity on the internet, how do we know which users and actions are real or not?  This thesis focuses on this problem and introduces new techniques to effectively and efficiently discern anomalous and fraudulent behavior in online social graphs. Specifically, we work on three thrusts: plain graphs, dynamic graphs and rich graphs.

Firstly, we focus on plain graphs, in which only static connectivity information is known.
We detail several proposed algorithms spanning the topics of spectral fraud detection in a single graph, blame attribution between graph snapshots, and structurally diverse graph summarization.

Next, we broaden our scope to dynamic graphs, in which we leverage connectivity information over a span of time.  We describe multiple relevant works which describe how to identify and summarize anomalous temporal graph structures, model interarrival time patterns in user queries to find anomalous search behavior, and identify "group" anomalies comprising of users acting in lockstep.

Lastly, we expand our reach to rich graphs, in which connectivity information is supplemented by other attributes, such as time, rating, number of messages sent, etc.  To this end, we discuss several works which focus on ranking abnormality of nodes in edge-attributed graphs, and characterizing multimodality of link fraud.

The techniques described in this thesis span various disciplines including data mining, machine learning, network and social sciences and information theory and are practically applicable to a number of real-world fraud and general anomaly detection scenarios.  They are carefully designed to attain high precision and recall in practice and scale to massive datasets, including social networks, telecommunication networks, e-commerce and collaboration networks with up to millions of nodes and billions of edges.

Thesis Committee:
Christos Faloutsos (Chair)
Roni Rosenfeld
Jaime Carbonell
Jiawei Han (University of Illinois at Urbana-Champaign)

At its core, coding theory studies how many elements of a (finite) vector space one can pack subject to the constraint that no two elements are too close. Typically, the notion of closeness is that of Hamming distance, that is, the distance between two vectors is the number of coordinates on which they differ. In a rank-metric code, codewords are matrices over a finite field and the distance between codewords is the rank of their difference. A linear rank-metric code is a subspace of matrices (over the field to which the matrix entries belong) such that every non-zero matrix in the subspace has large rank. Rank-metric codes have found numerous applications in random network coding, as well as magnetic recording, public-key cryptography, and space-time coding.

The typical algorithmic task for a code is that of unique-decoding: given a received word that may have been corrupted, the decoder should output the closest codeword. List-decoding is a relaxation of this notion which gives the possibility of decoding past half the minimum distance of the code at the cost of returning a (hopefully small) list of candidate codewords. List-decoding has proved to be a highly fruitful avenue of study in the Hamming case, and recently there has also been a great deal of interest in the list-decodability of rank-metric codes.

This work concerns the list-decodability of random linear rank-metric codes, and establishes a trade-off between and establishes a trade-off between list-size and gap to optimal decoding radius that is similar to what is known (and is straightforward to establish) for completely random rank-metric codes. Almost all known constructions of rank-metric codes are linear, and random code ensembles achieve the best known trade-offs, so it is of interest to understand the performance of random linear (rank-metric) codes.

Based on joint work with Venkatesan Guruswami.

CMU Youtube Theory channel.

Cells need to be able to sustain themselves, divide, and adapt to new stimuli. Proteins are
key agents in regulating these processes. Cell behavior is regulated by signaling pathways and proteins called transcription factors which regulate what and how much of a protein should be manufactured. Anytime a new stimulus arises, it can activate multiple signaling pathways by interacting with proteins on the cell surface (if it is an external stimulus) or proteins within the cell (if it is a virus for example).

Disruption in signaling pathways can lead to a myriad of diseases. Knowledge of which signaling pathways play a role in which condition, is thus key to comprehending how cells develop, react to environmental stimulus, and are able to carry out their normal functions. Another biological process that can play a role in such reaction is epigenetics -- modification of the DNA structure that do not involve changing the sequence itself. Epigenetics has been shown to regulate transcription, however, how epigenetic changes are regulated and the exact impact these changes have on transcriptional regulation are still open questions.

In this thesis we present a suite of computational techniques that are focused on modeling the dynamic regulation of biological processes. These methods address the various aspects of the problem mentioned above focusing on the reconstruction of dynamic signaling and regulatory networks. In many cases, the amount of biological data available for a specific condition can be very small compared to the number of variables. We present an algorithm which uses multi-task learning to learn signaling networks from many related conditions. There are also very few tools that attempt to take temporal dynamics into account when inferring signaling networks. The thesis presents a new algorithm that utilizes and extends Integer Programming methods for inferring such dynamic regulation. Finally, we present a new strategy to integrate epigenetic data with other temporal datasets using deep neural networks. We use this new method to reconstruct dynamic disease progression networks in Idiopathic Pulmonary Fibrosis (IPF).

Thesis Committee:
Ziv Bar-Joseph (Chair)
Eric Xing
Jaime Carbonell
Naftali Kaminski (Yale University)

DNA read mapping is one of the fundamental problem in bioinformatics. The general goal is to map billions of short pieces of DNA (reads) to a high quality reference genome, while tolerating errors including genomic variations between individuals and the sequencer imprecision.

The seed-and-extend strategy, a general mapping heuristic, efficiently maps erroneous DNA fragments to the reference by breaking the read into multiple non-overlapping segments, called "seeds", and assume at least one of the seed is error free. By using seed to index into the reference genome, seed-and-extend mappers drastically reduce the search space hence improves the run time.

However, a fundamental trade-off of seed-and-extend mappers is the balance between seed lengths and seed count. Greater seed length further reduces the search space therefore improve the performance. However, greater seed space also renders fewer seeds hence make the mapping process more error prune.

In this talk, I will present an on-going work, VEST (Versatile Error-resilient Seed Technique). VEST aims to adapt long seeds in order to make them error-resilient, hence achieves both high error tolerance as well as high mapping speed.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement

In a ridesharing system such as Uber or Lyft, arriving customers must be matched with available drivers. These decisions affect the overall number of customers matched, because they impact whether or not future available drivers will be close to the locations of arriving customers. A common policy used in practice is the closest driver (CD) policy that offers an arriving customer the closest driver. This is an attractive policy because no parameter information is required. However, we expect that a parameter-based policy can achieve better performance.

We propose to base the matching decisions on the solution to a continuous linear program (CLP) that accounts for (i) the differing arrival rates of customers and drivers in different areas of the city, (ii) how long customers are willing to wait for driver pick-up, and (iii) the time-varying nature of all the aforementioned parameters. We prove asymptotic optimality of a CLP-based policy in a large market regime. However, solving the CLP is difficult, thus we also propose matching policies based on a linear program (LP). We prove asymptotic optimality of an LP-based policy in a large market regime in which drivers are fully utilized. We conduct simulation experiments to test the performance of the CD, LP-based, and CLP-based policies.

This is joint work with Erhun Ozkan.

Amy R. Ward is a Professor in the Data Sciences and Operations Department in the Marshall School of Business at the University of Southern California (USC). She received her Ph.D. from Stanford University in 2001. Her research focuses on the approximation and control of systems in which uncertainty and variability are present, and cannot be ignored. Service systems are her main application area. She has over 25 journal publications. She received the Marshall Deans Award for Research Excellence in 2015. She is the chair of the INFORMS Applied Probability Society (term 11/2016-11/2018) and the Service SIG Chair of the INFORMS MSOM Society (term 6/2017-6/2019). She serves as an Associate Editor for Operations Research, Stochastic Systems, and Operations Research Letters.

While only recently developed, the ability to profile expression data in single cells (scRNA-Seq) has already led to several important studies and findings. However, this technology has also raised several new computational challenges. These include questions about the best methods for clustering scRNA-Seq data, how to identify unique group of cells in such experiments, and how to determine the state or function of specific cells based on their expression profile. To address these issues we develop and test a method based on neural networks (NN) for the analysis and retrieval of single cell RNA-Seq data. We tested various NN architectures, some of which incorporate prior biological knowledge, and used these to obtain a reduced dimension representation of the single cell expression data. We show that the NN method improves upon prior methods in both, the ability to correctly group cells in experiments not used in the training and the ability to correctly infer cell type or state by querying a database of tens of thousands of single cell profiles. Such database queries (which can be performed using our web server) will enable researchers to better characterize cells when analyzing heterogeneous scRNA-Seq samples.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement

Profesor Karl Crary will discuss the Master of Science in Computer Science Program. The program in computer science offers students with a Bachelor's Degree the opportunity to improve their training wtih advanced study in computer Science.




Subscribe to CSD