Master of Science in Robotics Thesis Talk

  • Remote Access - Zoom
  • Virtual Presentation - ET
  • Masters Student
  • Robotics Institute
  • Carnegie Mellon University
Master's Thesis Presentation

Exemplar free video retrieval

Video retrieval of activities has a wide range of applications. In the traditional mode of operation,  a collection of example videos describing the activities are given and the retrieval technique identifies other samples in a dataset that semantically match the examples provided. However, retrieval using a collection of example videos might not always be feasible, especially in the following two scenarios. The first scenario is when we only have a textual description of a class of videos. The second scenario occurs when the activities under consideration are not temporally localized, making them harder to collect and annotate. For instance, most commonly-used action recognition datasets like Kinetics exploit public sources of videos like youtube for data collection, where all the categories are well localized and can be easily searched and annotated. This strategy does not extend to more complex activities like   theft and object abandonment, both of which are not  temporally localized, and are hard to annotate.

In this thesis, we describe two  video retrieval approaches that work in the absence of visual examples. First, a text based retrieval approach, where a text query allows us to bypass the use of a visual exemplar.  Also text embedding models like GPT-2/GPT-3 are not trained in a dataset specific manner, ie … they are trained on all available data on the internet, and contain generalizable knowledge of all the activities in the real world. We will leverage that for developing retrieval models that work in the zeroshot/surprise setup. Since surprise activities are not known during the training time, the activity description/activity name is used during the test time to construct a textual embedding. First proposals are extracted from the video database using the TSM based model. For each proposal a visual embedding is computed. And similarity between video and textual embedding is used for retrieval.

The second approach that we consider is a rule-based unsupervised retrieval framework for categories specific to object transfer. This works by first detecting the objects and persons on a frame by frame basis. Followed by constructing short high-confidence tracklets. These tracklets are further connected in a soft fashion, where each tracklet can be associated with other tracklets such that the cumulative probability of 1. For the soft tracking based method, an annotation pipeline is built that facilities fast annotation of tracks. This works by assuming the high confidence tracklets are readily available, which can be achieved by using a high association threshold on the existing tracking algorithms. Then the annotation platform only requires user input to map tracklets among each other.

The two approaches discussed successfully avoid using visual exemplars, thereby also avoid all the shortcomings and restrictions of needing visual exemplars. This demonstrated the plausibility and the effectiveness of exemplar free approaches.  

Thesis Committee:
Deva Ramanan (Co-Chair)
Aswin Sankaranarayanan (Co-Chair)
David Held
Achal Dave

Zoom Participation. See announcement.

For More Information, Please Contact: