Machine learning models are rapidly growing in size, leading to increased training and deployment costs. While the most popular approach for training compressed models is trying to guess good "lottery tickets" or sparse subnetworks, we revisit the low-rank factorization approach, in which weights matrices are replaced by products of smaller matrices. We extend recent analyses of optimization of deep networks to motivate simple initialization and regularization schemes for improving the training of these factorized layers. Empirically these methods yield higher accuracies than popular pruning and lottery ticket approaches at the same compression level. We further demonstrate their usefulness in two settings beyond model compression: simplifying knowledge distillation and training Transformer-based architectures such as BERT. This is joint work with Neil Tenenholtz, Lester Mackey, and Nicolo Fusi.
Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.
Zoom Participation. See announcement.