Using Unlabeled Data to Improve Text Classification
One key difficulty with text classification algorithms is that theyrequire many hand-labeled examples to learn accurately. I demonstratethat supervised learning algorithms that use a small number of labeledexamples and many inexpensive unlabeled examples can createhigh-accuracy text classifiers. By assuming that documents arecreated by a parametric generative model, Expectation-Maximization(EM) finds local maximum likelihood models and classifiers from allthe data---labeled and unlabeled. These generative models do notcapture all the intricacies of text; however on some domains thistechnique substantially improves classification accuracy, especiallywhen labeled data are sparse.Two problems arise from this basic approach. First, unlabeled datacan hurt performance in domains where the generative modelingassumptions are too strongly violated. In this case the assumptionscan be made more representative in two ways: by modeling sub-topicclass structure, and by modeling super-topic hierarchical classrelationships. By doing so, model likelihood and classificationaccuracy come into correspondence, allowing unlabeled data to improveclassification performance. The second problem is that even with arepresentative model, the improvements given by unlabeled data do notsufficiently compensate for a paucity of labeled data. Here, limitedlabeled data provide EM initializations that lead to low-likelihoodmodels. I show that performance significantly improves by usingactive learning to select high-quality initializations, and by usingalternatives to EM that avoid low-likelihood local maxima.