Language Technologies Ph.D. Thesis Defense
- Remote Access - Zoom
- Virtual Presentation - ET
- JINGZHOU LIU
- Ph.D. Student
- Language Technologies Institute
- Carnegie Mellon University
Improving Text Summarization with LImited Resources
The task of text summarization, condensing a longer document or a set of documents to a shorter piece of textual summary that preserves the main concepts of input, has been a central problem in natural language processing (NLP). In the past few years, great success has been achieved in this area, partly thanks to the significant development of techniques in sequence-to-sequence learning, reinforcement-based training, and large-scale language models. However, a fundamental limitation of these supervised methods is that their success heavily relies on the availability of large training corpora with human-generated high-quality summaries, which are extremely expensive to obtain.
In this thesis, I aim to combat this data cost issue, and try to improve text summarization with limited resources. I will focus on the following three questions: (1) How can we leverage unannotated data to enhance the learning of summarization models? (2) How can we improve a summarization model with less expensive data? (3) How can we define and solve a new multi-level summarization task without extra human annotation cost? To address these questions, this thesis consists of the following three parts.
(1) The first part investigates unsupervised abstractive and extractive text summarization for leveraging unannotated data. In Chapter 2, I propose three novel pre-training auxiliary tasks on Wikipedia that leverage the structures in Wikipedia articles to pre-train an unsupervised abstractive summarization model, which achieves new state-of-the-art results among unsupervised abstractive methods. In Chapter 3, I propose to augment graph-based extractive summarization models with sentence distance information, enabling the sentence graph to better capture the structures of the input document, and show that by incorporating distance information, the proposed graph-based model achieves new state-of-the-art results in unsupervised extractive summarization.
(2) The second part focuses on leveraging the "less expensive" text classification data to enhance the text summarization task. In Chapter 4, I study the problem of eXtreme Multi-Label Text Classification (XMTC), and propose the first successful deep learning model, XML-CNN, to address this task, achieving new state-of-the-art results in XMTC. In Chapter 5, I utilize the proposed XML-CNN model, and view its output as a topical distribution of the input document. I propose a multi-tasking and a policy-gradient approach to imposing topic consistency between the input document and the summary produced by a supervised summarization model. Both approaches manage to significantly improve a state-of-the-art supervised summarization model.
(3) In the last part, I define the new task of Multi-Level Summarization (MLS), where the system is required to produce multiple summaries of different granularity levels, where higher level summaries contain the most information, and lower level summaries expand higher level summaries with more detail. I created the first MLS dataset by leveraging the structures of Wikipedia pages and extracting multi-level summaries without human annotation effort. I also propose three novel neural models for joint optimization of multi-level summaries and show that they significantly and consistently outperform the baseline models
Yiming Yang (Chair)
David R. Mortensen
Dominic Hughes (Apple Inc.)
Zoom Participation. See announcement.