Object discovery and multiple object tracking (MOT) are two highly interrelated tasks that are known to be fundamental problems in computer vision, and are crucial for video understanding. Most existing methods rely on supervised training with human annotations, which is laborious and expensive. In this thesis, we propose a self-supervised method for detecting and tracking moving objects in unlabelled RGB-D videos. The method begins with classic handcrafted techniques for segmenting objects using motion cues: we estimate optical flow and camera motion, and conservatively segment regions that appear to be moving independently of the background. Treating these initial segments as pseudo-labels, we learn an ensemble of appearance-based 2D and 3D detectors, under heavy data augmentation. We use this ensemble to detect new instances of the ``moving'' type, even if they are not moving, and add these as new pseudo-labels. Our method is an expectation-maximization algorithm, where in the expectation step we fire all modules and look for agreement among them, and in the maximization step we re-train the modules to improve this agreement. The constraint of ensemble agreement helps combat contamination of the generated pseudo-labels (during the E step), and data augmentation helps the modules generalize to yet-unlabelled data (during the M step). We compare against existing unsupervised object discovery and tracking methods, using challenging videos from CATER and KITTI, and show strong improvements over the state-of-the-art.
Katerina Fragkiadaki (Advisor)
Kris M. Kitani
Zoom Participation. See announcement.