We study large-scale multi-label classification (MLC) on two recently released datasets: Youtube-8M and Open Images that contain millions of data instances and thousands of classes. The unprecedented problem scale poses great challenges for MLC. First, finding out the correct label subset out of exponentially many choices incurs substantial ambiguity and uncertainty. Second, the large data-size and class-size entail considerable computational cost. To address the first challenge, we investigate two strategies: capturing label-correlations from the training data and incorporating label co-occurrence relations obtained from external knowledge, which effectively eliminate semantically inconsistent labels and provide contextual clues to differentiate visually ambiguous labels. Specifically, we propose a Deep Determinantal Point Process (DDPP) model which seamlessly integrates a DPP with deep neural networks (DNNs) and supports end-to-end multi-label learning and deep representation learning. The DPP is able to capture label-correlations of any order with a polynomial computational cost, while the DNNs learn hierarchical features of images/videos and capture the dependency between input data and labels. To incorporate external knowledge about label co-occurrence relations, we impose relational regularization over the kernel matrix in DDPP. To address the second challenge, we study an efficient low-rank kernel learning algorithm based on inducing point methods. Experiments on the two datasets demonstrate the efficacy and efficiency of the proposed methods.
Across domains, the scale of data and complexity of machine learning models have both been increasing greatly in recent years. For many such large scale models of interest, tractable estimation without access to extensive computational infrastructure is still an open problem. In this talk, we approach the tractability of large-scale estimation problems through the lens of leveraging sparse structure inherent in the learning objective, which allows us to develop algorithms sublinear in the size of domain by greedily searching for atoms comprising the optimal structure. Specifically, we address the following three questions for each model of interest: (a) how to formulate model estimation as a high-dimensional optimization problem with tractable sparse structure, (b) how to efficiently i.e. in sublinear time, search for the optimal structure, (c) how to guarantee fast convergence of our optimization algorithm? By answering these questions, we develop state-of-the-art learning algorithms, for varied domains such as extreme classification, structured prediction with large output domains, and design new estimators for latent variable models that enjoy polynomial computational and sample complexities without restrictive assumptions.
Pradeep Ravikumar (Chair)
Geoffrey J. Gordon
Inderjit S. Dhillon (University of Texas at Austin)
Bayesian Nonparametric (BN) methods have gained popularity since they provide a Bayesian framework for modeling data with unknown number of parameter. They have an advantage over parametric methods as they do not need to assume a fixed number of parameters but adjusts parameter size to a given dataset. This means that such models can theoretically cope with data sets of arbitrary size and latent dimensionality. This makes BN methods a good choice for modeling real world datasets. But BN methods are slow in practice, have large run-time complexity thus limiting their usability. In my thesis I advance the use of Bayesian Nonparametrics to various real world machine learning applications with focus on scalability. I dissect BN models into different components that affect scalability. For different applications of BN I propose to redesign these components so as to achieve scalable learning and inference.
Eric P. Xing, (Chair)
Sinead Williamson (University of Texas at Austin)
We consider the problem of estimating a function defined over n locations on a d-dimensional grid. When the function is constrained to have discrete total variation bounded by C_n, we derive the minimax optimal \ell_2 estimation error rate, parametrized by n and C_n. Total variation denoising, also known as the fused lasso, is seen to be rate optimal. Linear smoothers are shown to be suboptimal extending fundamental findings of Donoho and Johnstone  to higher dimensions. We also derive minimax rates for discrete Sobolev spaces, which are, smaller than the total variation function spaces. Indeed, these are small enough spaces that linear estimators can be optimal. We define two higher-order TV classes and derive lower bounds on their minimax errors. We also analyze two naturally associated trend filtering methods; when d=2, each is seen to be rate optimal over the appropriate class.
Machine learning (ML) has become one of the most powerful classes of tools for artificial intelligence, personalized web services and data science problems across fields. Within the field of machine learning itself, there had been quite a number of paradigm shifts caused by the explosion of data size, computing power, modeling tools, and the new ways people collect, share, and make use of data sets.
Data privacy, for instance, was much less of a problem before the availability of personal information online that could be used to identify users in anonymized data sets. Images, videos, as well as observations generated over a social networks, often have highly localized structures, that cannot be captured by standard nonparametric models. Moreover, the "common task framework'" that is adopted by many sub-disciplines of AI has made it possible for many people to collaboratively and repeated work on the same data set, leading to implicit overfitting on public benchmarks and questionable scientific discoveries.
This thesis presents technical results under a number of new mathematical frameworks that are designed to partially address these issues. These include differentially private learning, locally adaptive nonparametric regression and sequential selective estimation (Gaussian adaptive data analysis). The talk will highlight a few aspects of this thesis's contribution to these problems relative to what was known in the literature.
Ryan Tibshirani (Chair)
Adam Smith (Boston University)
While Bayesian methods are praised for their ability to incorporate useful prior knowledge, in practice, convenient priors that allow for computationally cheap or tractable inference are commonly used. In this work, we investigate the following question: for a given model, is it possible to compute an inference result with any convenient false prior, and afterwards, given any target prior of interest, quickly transform this result into the target posterior? A potential solution is to use importance sampling (IS). However, we demonstrate that IS will fail for many choices of the target prior, depending on its parametric form and similarity to the false prior. Instead, we propose prior swapping, a method that leverages the pre-inferred false posterior to efficiently generate accurate posterior samples under arbitrary target priors. Prior swapping lets us apply less-costly inference algorithms to certain models, and incorporate new or updated prior information “post-inference”. We give theoretical guarantees about our method, and demonstrate it empirically on a number of models and priors.
Recent studies have applied dimensionality reduction methods to understand how the multi-dimensional structure of neural population activity gives rise to brain function. It is unclear, however, how the results obtained from dimensionality reduction generalize to recordings with larger numbers of neurons and trials or how these results relate to the underlying network structure. We address these questions by applying factor analysis to recordings in the visual cortex of non-human primates and to spiking network models that self-generate irregular activity through a balance of excitation and inhibition. We compared the scaling trends of two key outputs of dimensionality reduction - shared dimensionality and percent shared variance - with neuron and trial count. We found that the scaling properties of networks with non-clustered and clustered connectivity differed, and that the in vivo recordings were more consistent with the clustered network. Furthermore, recordings from tens of neurons were sufficient to identify the dominant modes of shared variability that generalize to larger portions of the network. These findings can help guide the interpretation of dimensionality reduction outputs in regimes of limited neuron and trial sampling and help relate these outputs to the underlying network structure.
Byron Yu, Matthew Smith, Steve Chase
I will discuss a decade long research project to create the foundations of reinforcement learning with context (aka features). This research project has multiple threads including Contextual Bandits, Learning to Search, and Contextual Decision Processes. The most mature of these (Contextual Bandits) is now driving many real-world RL applications while the least mature (CDPs) is still a theoretician's toy.
John Langford studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor's degree in 1997, and received his Ph.D. from Carnegie Mellon University in 2002. Since then, he has worked at Yahoo!, Toyota Technological Institute, and IBM's Watson Research Center. He is also the primary author of the popular Machine Learning weblog, hunch.net and the principle developer of Vowpal Wabbit. Previous research projects include Isomap, Captcha, Learning Reductions, Cover Trees, and Contextual Bandit learning.
We propose a new family of combinatorial inference problems for graphical models. Unlike classical statistical inference where the main interest is point estimation or parameter testing of Euclidean parameters, combinatorial inference aims at testing the global structure of the underlying graph. Examples include testing the graph connectivity, the presence of a cycle of certain size, or the maximum degree of the graph. To begin with, we develop a unified theory for the fundamental limits of a large family of combinatorial inference problems. We propose new structural packing entropies to characterize how the complexity of combinatorial graph structures impacts the corresponding minimax lower bounds. On the other hand, we propose a family of practical structural testing algorithms to match the obtained lower bounds. We use a case study of brain network analysis to illustrate the usefulness of these proposed methods (Based on joint work Matey Neykov, Kean Ming Tan, and Junwei Lu).
About the Speaker
Faculty Host: Ryan Tibshirani
Just as machine learning and natural language processing are transforming industries and reshaping the economy, so too are they expanding the ability of the social sciences and humanities to learn about society, both past and present, by extracting insights from large-scale text corpora. This thesis argues that in order to do cross-disciplinary research that is high-quality, relevant, and informative, we will need to incorporate domain-specific theory and adapt our models and objectives to be able to take advantage of the knowledge held by collaborators and domain experts in other fields. While interdisciplinary teams have the potential to bridge the divide between disciplines, we can do much more in terms of both tailoring methods to the specific need of scholars in other fields and in enabling them to have greater control over all stages of the research cycle, especially data exploration, modeling, and visualization. By combining highly general approaches to modeling such as tensor factorization with black-box inference algorithms, we can expand the ability of non-experts to make use of more sophisticated models. In so doing, we can accelerate the social-scientific research cycle and deliver insights that will be useful both for further research and in answering real-world policy questions.
Noah A. Smith (Chair, University of Washington)
Geoffrey J. Gordon
Dan Jurafsky (Stanford University)