Machine Learning Thesis Defense
- Remote Access - Zoom
- Virtual Presentation - ET
- YAO-HUNG (HUBERT) TSAI
- Ph.D. Candidate
- Machine Learning Department
- Carnegie Mellon University
Cross-view Learning with Limited Supervision
Real-world data is often multi-view, with each view representing a different perspective of the data. These views can be different modalities, different sets of features or different viewpoints. For instance, human communication contains heterogeneous sources of information (views as different modalities) spanning tones of voice, facial gestures and spoken word. As another example, autonomous systems collect features from various sensors, such as LiDAR, RADAR and RGB signals (views as different sets of features). As the third example, surveillance cameras record scenes from multiple angles (views as different viewpoints). Learning representations from multi-view data, dubbed cross-view learning, requires modeling the complementarity within and understanding the relationships across views, such as knowing the information shared among different views and the information in a particular view. This process is challenging due to the heterogeneity of data and complex structures that link the different views (e.g., asynchrony between views). In this thesis, we study cross-view learning in scenarios where label supervision is not available for downstream tasks, but we have pairing information between views (i.e., limited supervision). We focus on these scenarios since they are close to reality in many fields, where collecting a large number of labels tends to be expensive, both computationally and effort-wise. To address this significant challenge of cross-view learning with limited supervision, we scaffold it in three core technical challenges.
The first challenge, which we refer to as cross-view heterogeneous structures, focuses on learning to align and synchronize different views and disentangling complementary factors from multi-view data. For the sake of simplicity, the first challenge is made under a fully supervised setup. Then, we note another important aspect of modeling the complementarity among views is quantifying the cross-view relationships within the views. This leads us to discuss the second challenge: relationship quantification. We focus on quantifying the relationship via mutual information, studying tractable and scalable estimators for it. Last, we discuss the third challenge: learning with limited supervision. We transit from the supervised to the unsupervised setting, where the only information comes from pairs between views, but without labels for the downstream task. We present how to learn good representations from multi-view data by considering the complementarity across views, when labels or downstream supervision is not available. Within the learning with limited supervision challenge, we may sometimes have access to additional information, more than just the data itself. The additional information can be auxiliary or undesirable information of data. For instance, the auxiliary information can be the hashtags for Instagram images, and the undesirable information can be the personal information from physiological data. We show how to either leverage the auxiliary information to learn better representations, or remove the undesirable information in the representations. The thesis discussed our contributions to all three challenges.
This thesis opens up many avenues for future research directions. One of these directions is to scale multi-view representation learning methods up to large number of views, such as learning representations from signals for aircraft sensors that track oil temperature, fuel pressure, airspeed measurement, lightning detection, vibration detection, etc. Next, since most theoretical analyses on self-supervised learning lie mainly within visual modality, another direction is establishing theoretical bases for self-supervised learning beyond the visual modality, such as the textual and acoustic modality. Last, most existing multi-view learning literature focuses primarily on the task of perception and less on action generation (e.g., action generation for navigation). Hence, a future direction is multi-view representation learning for action generation.
Ruslan Salakhutdinov (Co-Chair)
Louis-Philippe Morency (Co-Chair)
Jimmy Ba (University of Toronto)
Zoom Participation. See announcement.