Machine Learning Thesis Proposal
- Gates Hillman Centers
- ERIC WONG
- Ph.D. Student
- Machine Learning Department
- Carnegie Mellon University
Provable, structured, and efficient methods for robustness of deep networks to adversarial examples
While deep networks have contributed to major leaps in raw performance across various applications, they are also known to be quite brittle to targeted data perturbations. By adding a small amount of adversarial noise to the data, it is possible to drastically change the output of a deep network. The existence of these so-called adversarial examples, perturbed data points which fool the model, pose a serious risk for safety- and security-centric applications where reliability and robustness are critical. In this thesis, we present a number of approaches for mitigating the effect of adversarial examples, also known as adversarial defenses, which can offer varying degrees and types of robustness.
From a practical standpoint, we present adversarial defenses which differ in the strength of the guarantee, the threat model which characterizes the set of potential adversarial examples, and the efficiency and simplicity of the defense. We start with the strongest type of guarantee called provable adversarial defenses, showing that is possible compute duality-based certificates which can guarantee that no adversarial examples exist within an L-p bounded region, which are trainable and can be minimized to learn networks which are provably robust to adversarial attacks. In later work, we show that the approach is agnostic to the specific network and is applicable to arbitrary computational graphs, while scaling the approach to medium sized convolutional networks. The approach is well-suited for L-infinity data perturbations but doesn't easily capture correlations found in L-2 perturbations, and so in proposed work we design a specialized certificate for L-2 attacks based on the ellipsoidal method.
The latter part of this thesis focuses on defenses based on adversarial training, which do not come with formal guarantees but can learn networks which are empirically robust to adversarial attacks, typically more efficient with better empirical performance than provable defenses. We define a threat model called the Wasserstein adversarial example, which captures semantically meaningful image transformations like translations and rotations previously uncaptured by existing threat models, and derive an efficient algorithm for projecting onto Wasserstein balls (which is a necessary step for performing adversarial training). Additionally, we demonstrate that adversarial training can be generalized to defend against multiple types of threats simultaneously. In ongoing work, we revisit even simpler and more efficient forms of adversarial training which were previous thought to be ineffective, and show that they can be used to learn robust networks at a fraction of the cost.
J. Zico Kolter (Chair)
Aleksander Madry (Massachusetts Institute of Technology)