Robotics Thesis Proposal

  • Remote Access - Zoom
  • Virtual Presentation - ET
Thesis Proposals

Visual Representation and Recognition without Human Supervision

Visual recognition models have seen great advancements by relying on large-scale, carefully curated datasets with human annotations. Most computer vision models leverage human supervision to either construct strong initial representations (e.g. using the ImageNet dataset) or for modeling the visual concepts relevant for downstream tasks (e.g. MS-COCO for object detection). In this thesis, we address two key challenges that arise from this observation: 1) can we construct better representations without human supervision? and 2) can we minimize the usage of human supervision for downstream tasks?

First, we present a modular neural network architecture for image classification that leverages the compositional nature of visual concepts to construct classifiers for unseen concepts. Next, we present a weakly-supervised approach to describe videos via association \textit{i.e.} by identifying dense spatial and temporal correspondences to reference videos. Finally, we present a framework to quantify the various invariances encoded in representations. Based on inferences from this framework, we present an approach to leverage videos for improving the invariances learned by existing self-supervised learning methods.

In order to further improve representations, we first observe that self-supervised learning methods focus on constructing monolithic representations that are useful for a wide range of downstream tasks. However, the visual concepts in the real world are complex and demonstrate multiple axes of similarity. We propose a self-supervised learning approach to learn multiple representations that encode diverse and unique axes of similarity.

So far, the two challenges of minimizing human supervision for representation learning and downstream visual recognition tasks have been addressed as independent problems. In the final part of this thesis, we propose a unified solution for addressing the task of detecting objects while simultaneously learning the representation. To accomplish this, we design a self-supervised learning approach to discover mid-level patches and objects from a collection of images without requiring any pretraining or human supervision.

Thesis Committee
Abhinav Gupta (Chair)
Deva Ramanan
David Held
Kristen Grauman (University of Texas at Austin)
Alexei Efros (University of California, Berkeley)

Zoom Participation. See announcement.

For More Information, Please Contact: