Language Technologies Ph.D. Thesis Defense
- Gates Hillman Centers and Zoom
- Reddy Conference Room 4405 and Virtual
- JUNJIE HU
- Ph.D. Student
- Language Technologies Institute
- Carnegie Mellon University
Leveraging Word and Phrase Alignments for Multilingual Learning
Recent years have witnessed impressive success in natural language processing (NLP) thanks to the advances of neural networks and availability of large amounts of labeled data. However, many NLP systems predominately have focused on high-resource languages (e.g., English, Chinese) that have large, computationally-accessible collections of labeled data for training. While the achievements on high-resource languages are exciting, there are more than 6,900 languages in the world and the majority of them have far fewer resources for training deep neural networks. In fact, it is often expensive, or sometimes infeasible, to collect labeled data written in all possible languages. As a result, this data scarcity issue limits the generalization of NLP systems in many multilingual scenarios. Moreover, as models may be used to process text from a wide range of domains (e.g., social media or medical articles, the data scarcity issue is further exacerbated by the domain shift between the training and test data.
In this thesis, with the goal of improving generalization ability of NLP models to alleviate the aforementioned challenges, we exploit word and phrase alignment to train neural NLP models (e.g., neural machine translation or contextualized language models), and provide evaluation methods for examining the generalization capabilities of such models over diverse application scenarios. This thesis contains two parts. The first part explores cross-lingual generalization for language understanding. In particular, we examine the ability of pre-trained multilingual representations to transfer learned knowledge from a high resource language to other languages. To this end, we first introduce a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. Second, we leverage word and sentence alignments from parallel data to improve the multilingual representations for language understanding tasks such as those included in our benchmark. The second part of the thesis is devoted to leveraging alignment information for machine translation, a popular and useful language generation task. In particular, we focus on learning to translate aligned words and phrases between two languages with fewer parallel sentences. To accomplish this goal, we exploit techniques to obtain aligned words and phrases from monolingual data, knowledge bases or crowdsourcing and use them to improve translation systems
Graham Neubig (Chair)
Kyunghyun Cho (New York University)
Zoom Participation. See announcement.