Robotics Thesis Defense
- Remote Access - Zoom
- Virtual Presentation
- AAYUSH BANSAL
- Ph.D. Student
- Robotics Institute
- Carnegie Mellon University
Unsupervised Learning of the 4D Audio-Visual World from Sparse Unconstrained Real-World Samples
We, humans, can easily observe, explore, and analyze the world we live in. We, however, struggle to share our observation, exploration, and analysis with others. This thesis introduce Computational Studio, computational machinery that can understand, explore, and create the four-dimensional audio-visual world. This allows: (1) humans to communicate with other humans without any loss of information; and (2) humans to communicate with machines effectively. Computational Studio is an environment that allows non-specialists to construct and creatively edit the 4D audio-visual world from sparse audio and video samples. There are four essential components of this thesis: (1) how can we capture the 4D visual world?; (2) how can we synthesize the audio-visual world using examples?; (3) how can we interactively explore the audio-visual world?; and (4) how can we continually learn the audio-visual world from sparse real-world samples without any supervision?
The first part of this thesis introduces capturing, browsing, and reconstructing the 4D visual world from sparse real-world samples. We bring together insights from both classical image-based rendering and current neural rendering approaches. Crucial to our work are two components: (1) fusing information from sparse multi-views to create dense 3D point clouds; and (2) fusing multi-view information to create new views. Though captured from discrete viewpoints, the proposed formulation allows us to do dense 3D reconstruction and 4D visualization of dynamic events. It also enables us to move around the space-time of the event continuously and facilitate: (1) freezing the time and exploring views; (2) freezing a view and moving through time; and (3) simultaneously changing both time and view. Finally, complete control of the 4D visual world allows us to do geometrically consistent content editing.
The second part of this thesis details the example-based synthesis of the audio-visual world in an unsupervised manner. Example-based audio-visual synthesis allows us to express ourselves easily. In this part, we introduce Recycle-GAN that combines spatial and temporal information via adversarial losses for unsupervised video retargeting. This will enable us to translate the contents from one domain to another while preserving the style native to the target domain. E.g., if we are to transfer the contents of John Oliver’s speech to Stephen Colbert, then the generated content/speech should be in Stephen Colbert’s style. We then extend our work to audio-visual synthesis using Exemplar Autoencoders that even generalizes to unseen examples at test time. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody (emotions and ambiance), and visual appearance of a specific target exemplar speech. Finally, we introduce PixelNN, a semi-parametric model that enables generating multiple outputs from a given input and examples.
The third part of this thesis introduces human-controllable representations that allow a human user to interact with visual data and create new experiences. Firstly, we introduce OpenShapes that enables a user to interactively synthesize new images using a paint-brush and a drag-and-drop tool. OpenShapes runs on a single-core CPU to generate multiple pictures from a user-generated label map. We then present simple video-specific autoencoders that enable human-controllable video exploration. This includes a wide variety of analytical tasks such as (but not limited to) spatial and temporal super-resolution, spatial and temporal extrapolation, object removal, video textures, average video exploration, video tapestries, and correspondence estimation within and across videos. Prior work has independently looked at each of these problems and proposed different formulations. In this work, we observe that a simple autoencoder trained (from scratch) on multiple frames of a specific video enables one to perform a large variety of video processing and editing tasks without even optimizing for a single task. Finally, we present a framework that allows us to extract a wide range of low-mid-high level semantic and geometric scene cues that could be understood and expressed by both humans and machines.
The last part of this thesis extends our work on continual learning of the audio-visual world to learning exemplar visual concepts and visual-recognition tasks. We explore semi-supervised learning of deep representations given a few labeled examples of a task and a (potentially) infinite stream of unlabeled examples. Our approach continually evolves task-specific representations by constructing a schedule of learning updates that iterates between pre-training on novel segments of the stream and fine-tuning on the small and fixed labeled dataset. Contrary to popular approaches in semi-supervised learning that use massive computing resources for storing and processing data, streaming learning requires modest computational infrastructure since it naturally breaks up massive datasets into slices that are manageable for processing. From this perspective, continual learning on streams can help democratize research and development for scalable, lifelong ML.
Computational Studio is a first step towards unlocking the full degree of creative imagination, which is currently limited to the human mind by the limits of the individual’s expressivity and skills. It has the potential to change the way we audio-visually communicate with other humans and machines.
Deva K. Ramanan (Co-Chair)
Yaser A. Sheikh (Co-Chair)
Martial H. Hebert
David A. Forsyth (University of Illinois-Urbana Champaign)
Alexei A. Efros (University of California, Berkeley)
Zoom Participation. See announcement.