Machine Learning Thesis Proposal
- Gates Hillman Centers
- McWilliams Classroom 4303
- HSIAO-YU (FISH) TUNG
- Ph.D. Student
- Machine Learning Department
- Carnegie Mellon University
Learning Generalizable Visual Representation for Embodied Agents and through Embodied Agents
Humans can use their prior learning to adapt to new environments and acquire new skills. The ability allows us to perform various tasks in diverse settings. On the contrary, although deep learning models are showing human-level performances in tasks like image classification and text-to-text translation, they are showing limited ability to generalize to unseen environments and tasks. In this thesis, we are interested in imposing the generalization ability of humans to machines, in particular to embodied agents: agents that interact with the environment through a physical body within the environment. Our key insight is that, to build machines that can perform any reasonable task in any environment, the machines must possess the ability to move so it can continually develop knowledge and acquire skills through their self-collected data.
To build these embodied agents, two fundamental questions we need to address here are: How do we improve over existing neural network models so that we obtain models that generalize better?; How do we learn or improve the models through the data collected by the agents? To improve generalization, we propose neural network models that encode object permanence bias existing in the input data in their network architectures. Our models construct a view-invariant 3D feature representation for a scene, even though 2D individual observations can change rapidly during camera motion or at occlusions. We show these models outperform and generalize better than existing methods without such biases in several tasks that require 3D understanding. To answer the second question, we propose to learn or improve the features with self-supervised tasks using image prediction as a supervisory signal. We show the models can improve their features to achieve better task performance in a self-supervised manner. We further explore unsupervised methods that allow the models to develop their visual perceptual skills and learn intuitive physics models entirely through their own collected data.
We discuss how these models that construct the view-invariant 3D feature representations are critical to developing embodied agents that can see, act, and understand language. Besides achieving superior performance over existing methods in several tasks involved, the models can perform tasks beyond what they have been trained on: our model can reason about whether an object configuration is physically plausible and can plan directly in the latent space, leveraging the fact that the representations are explicit in 3D, and a certain level of affordance reasoning is instantly available in 3D. It is unclear how to perform such reasoning in the 2D feature space operated by most state-of-the-art computer vision models. The reasoning ability further allows the model to resolve ambiguity in language, determine the plausibility of an object placement description, and conduct free-space path planning with complex object configurations. The embodied agents, after training, can detect and manipulate objects, follow human instructions, and can continually develop their knowledge and acquire skills through the data collected by themselves.
Katarina Fragkiadaki (Chair)
Jitendra Malik (Uuniversity of California, Berkeley)