Master of Science in Robotics Thesis Talk
- Remote Access - Zoom
- Virtual Presentation - ET
- HAOCHEN WANG
- Masters Student
- Robotics Institute
- Carnegie Mellon University
Audiovisual ontology and robust representations via cross-modal fusion
The shrill of an ambulance siren and flashing lights, the hum of an accelerating car — important events often come to us simultaneously through sight and sound. We first consider the problem of identifying these events from raw, unlabeled audiovisual data of agents interacting with urban environments. Our goal is to discover a suite of multimodal events that autonomous agents should be aware of. We argue that multimodal events such as emergency vehicle sirens, honks from interacting actors, and reverse backup beepers from large trucks, all should be added to current perception ontologies, which tend to be dominated by visual, event-driven categories. We show that this discovery task can be formulated as a multimodal self-supervised learning problem, and demonstrate our technique on a dataset containing hundreds of hours of in-the-wild dataset of urban walking videos. In comparisons with baseline methods, we show that the resulting model discovers significantly larger numbers of “actionable” events that affect behavior. Next we note that multimodal signals provide a natural source of redundancy and complementarity in decision making. Increasingly neural network architectures are trained on these rich input domains, and state-of-the-art designs often fuse information at multiple points along a feedforward pass. However, at test time when a modality is delayed, corrupted or lost, as could potentially happen in real-world safety critical deployments, we find that these networks suffer significant accuracy drop due to the lack of robustness in its merged representations. We discuss an approach that explicitly models the probabilistic co-occurrence of signals between modalities to provide a simple retrieval-based mechanism that recovers from modality loss. Results on the established benchmarks demonstrate that our approach works with any off-the-shelf models.
Deva Ramanan (advisor)
Zoom Participation. See announcement.