Sparse DNNs with Improved Adversarial Robustness

Yiwen Guo, Chao Zhang, Changshui Zhang, Yurong Chen

Introduction

Although deep neural networks (DNNs) have advanced the state-of-the-art of many artificial intelligence techniques, some undesired properties may hinder them from being deployed in real-world applications. With continued proliferation of deep learning powered applications, one major concern raised recently is the heavy computation and storage burden that DNN models shall lay upon mobile platforms. Such burden stems from substantially redundant feature representations and parameterizations . To address this issue and make DNNs less resource-intensive, a variety of solutions have been proposed. In particular, it has been reported that more than 90% of connections in a well-trained DNN can be removed using pruning strategies , while no accuracy loss is observed. Such a remarkable network sparsity leads to considerable compressions and speedups on both GPUs and CPUs . Aside from being efficient, sparse representations are theoretically attractive and have made their way into tremendous applications over the past decade.

Orthogonal to the inefficiency issue, it has also been discovered that DNN models are vulnerable to adversarial examples—maliciously generated images which are perceptually similar to benign ones but can fool classifiers to make arbitrary predictions . Furthermore, generic regularizations (e.g., dropout and weight decay) do not really help on resisting adversarial attacks . Such undesirable property may prohibit DNNs from being applied to security-sensitive applications. The cause of this phenomenon seems mysterious and remains to be an open question. One reasonable explanation is the local linearity of modern DNNs . Quite a lot of attempts, including adversarial training , knowledge distillation , detecting and rejecting , and some gradient masking techniques like randomization , have been made to ameliorate this issue and defend adversarial attacks.

It is crucial to study potential relationships between the inefficiency (i.e., redundancy) and adversarial robustness of classifiers, in consideration of the inclination to avoid “robbing Peter to pay Paul”, if possible. Towards shedding light on such relationships, especially for DNNs, we provide comprehensive analyses in this paper from both the theoretical and empirical perspectives. By introducing reasonable metrics, we reveal, somewhat surprising, that there is a discrepancy between the robustness of sparse linear classifiers and nonlinear DNNs, under $l_{2}$ attacks. Our results also demonstrate that an appropriately higher sparsity implies better robustness of nonlinear DNNs, whereas over-sparsified models can be more difficult to resist adversarial examples, under both the $l_{\infty}$ and $l_{2}$ circumstances.

Related Works

In light of the “Occam’s razor” principle, we presume there exists intrinsic relationships between the sparsity and robustness of classifiers, and thus perform a comprehensive study in this paper. Our theoretical and empirical analyses shall cover both linear classifiers and nonlinear DNNs, in which the middle-layer activations and connection weights can all become sparse.

The (in)efficiency and robustness of DNNs have seldom been discussed together, especially from a theoretical point of view. Very recently, Gopalakrishnan et al. propose to sparsify the input representations as a defense and provide provable evidences on resisting $l_{\infty}$ attacks. Though intriguing, their theoretical analyses are limited to only linear and binary classification cases. Contemporaneous with our work, Wang et al. and Ye et al. experimentally discuss how pruning shall affect the robustness of some DNNs but surprisingly draw opposite conclusions. Galloway et al. focus on binary DNNs instead of the sparse ones and show that the difficulty of performing adversarial attacks on binary networks DNNs remains as that of training.

To some extent, several very recent defense methods also utilize the sparsity of DNNs. For improved model robustness, Gao et al. attempt to detect the feature activations exclusive to the adversarial examples and prune them away. Dhillon et al. choose an alternative way that prunes activations stochastically to mask gradients. These methods focus only on the sparsity of middle-layer activations and pay little attention to the sparsity of connections.

Sparsity and Robustness of Classifiers

This paper aims at analyzing and exploring potential relationships between the sparsity and robustness of classifiers to untargeted white-box adversarial attacks, from both theoretical and practical perspectives. To be more specific, we consider models which learn parameterized mappings $\mathbf{x}_{i}\mapsto y_{i}$ , when given a set of labelled training samples $\{(\mathbf{x}_{i},y_{i})\}$ for supervision. Similar to a bunch of other theoretical efforts, our analyses start from linear classifiers and will be generalized to nonlinear DNNs later in Section 3.2.

Generally, the sparsity of a DNN model can be considered in two aspects: the sparsity of connections among neurons and the sparsity of neuron activations. In particular, the sparsity of activations also include that of middle-layer activations and inputs, which can be treated as a special case. Knowing that the input sparsity has been previously discussed , we shall focus primarily on the weight and activation sparsity for nonlinear DNNs and just study the weight sparsity for linear models.

Adversarial attacks typically minimize an $l_{p}$ norm (e.g., $l_{2}$ , $l_{\infty}$ , $l_{1}$ and $l_{0}$ ) of the required perturbation under certain (box) constraints. Though not completely equivalent with the distinctions in our visual domain, such norms play a crucial role in evaluating adversarial robustness. We study both the $l_{\infty}$ and $l_{2}$ attacks in this paper. With an ambition to totalize them, we propose to evaluate the robustness of linear models using the following metrics that describe the ability of resisting them respectively:

Be aware that although there exists attack-agnostic guarantees on the model robustness , they are all instance-specific. Instead of generalizing them to the entire input space for analysis, we focus on the proposed statistical metrics and present their connections to the guarantees later in Section 3.2. Some other experimentally feasible metrics shall be involved in Section 4. The following theorem sheds light on intrinsic relationships between the described robustness metrics and $l_{p}$ norms of $\mathbf{w}$ .

As for $r_{2}$ , the proof is straightforward by similarly casting its definition into the sum of conditional expectations. That is,

Theorem 3.1 indicates clear relationships between the sparsity and robustness of linear models. In terms of $r_{\infty}$ , optimizing the problem gives rise to a sparse solution of $\mathbf{w}$ . By duality, maximizing the squared upper bound of $r_{\infty}$ also resembles solving a sparse PCA problem . Reciprocally, we might also concur that a highly sparse $\mathbf{w}$ implies relatively robust classification results. Nevertheless, it seems that the defined $r_{2}$ has nothing to do with the sparsity of $\mathbf{w}$ . It gets maximized iff $\mathbf{w}$ approaches $\bm{\mu}_{+1}-\bm{\mu}_{-1}$ or $\bm{\mu}_{-1}-\bm{\mu}_{+1}$ , however, sparsifying $\mathbf{w}$ probably does not help on reaching this goal. In fact, under some assumptions about data distributions, the dense reference model can be nearly optimal in the sense of $r_{2}$ . We will see this phenomenon remains in multi-class linear classifications in Theorem 3.2 but does not remain in nonlinear DNNs in Section 3.2. One can check Section 4.1 and 4.2 for some experimental discussions in more details.

We present in Theorem 3.2 similar bounds for multi-class classifiers to that provided in Theorem 3.1, under some mild assumptions. Our proof is deferred to the supplementary material. We emphasize that the two additional assumptions are intuitively acceptable. First, increasing the classification loss in a more principled way, say using FGS, ought to diminish the expected accuracy more effectively. Second, with high probability, an original misclassification cannot be fixed using the FGS method, as one intends to do precisely the opposite.

Similarly, the presented bound for $r_{\infty}$ also implies sparsity, though it is the sparsity of $\mathbf{w}_{k}-\bar{\mathbf{w}}$ . In fact, this is directly related with the sparsity of $\mathbf{w}_{k}$ , considering that the classifiers can be post-processed to subtract their average simultaneously whilst the classification decision won’t change for any possible input. Particularly, Theorem 3.2 also partially suits linear DNN-based classifications. Let the classifier $g_{k}$ be factorized in a form of $\mathbf{w}^{T}_{k}=(\mathbf{w}_{k}^{\prime})^{T}W^{T}_{d-1}\ldots W^{T}_{1}$ , it is evident to see that higher sparsity of the multipliers encourages higher probability of a sparse $\mathbf{w}_{k}$ .

2 Deep Neural Networks

A nonlinear feedforward DNN is usually specified by a directed acyclic graph $G=(\mathcal{V},\mathcal{E})$ with a single root node for final outputs. According to the forward propagation rule, the activation value of each internal (and also output) node is calculated based on its incoming nodes and learnable weights corresponding to the edges. Nonlinear activation functions are incorporated to ensure the capacity. With biases, some nodes output a special value of one. We omit them for simplicity reasons as before.

Classifications are performed by comparing the prediction scores corresponding to different classes, which means $\hat{y}=\operatorname*{arg\,max}_{k\in\{1,\ldots,c\}}g_{k}(\mathbf{x})$ . Benefit from some very recent theoretic efforts , we can directly utilize well-established robustness guarantees for nonlinear DNNs. Let us first denote by $B_{p}(\mathbf{x},R)$ a close ball centred at $\mathbf{x}$ with radius $R$ and then denote by $L_{q,\mathbf{x}}^{k}$ the (best) local Lipschitz constant of function $g_{\hat{y}}(\mathbf{x})-g_{k}(\mathbf{x})$ over a fixed $B_{p}(\mathbf{x},R)$ , if there exists one. It has been proven that the following lemma offers a reasonable lower bound for the required $l_{p}$ norm of instance-specific perturbations when all classifiers are Lipschitz continuous .

it holds that $\hat{y}=\operatorname*{arg\,max}_{k\in\{1,\ldots,c\}}g_{k}(\mathbf{x}+\Delta_{\mathbf{x}})$ , which means the classification decision does not change on $B_{p}(\mathbf{x},\gamma)$ .

in which $\mathbf{w}_{k}=W_{d}[:,k]$ and $\sigma$ is the nonlinear activation function. Here we mostly focus on “ReLU networks” with rectified-linear-flavoured nonlinearity, so the neuron activations in middle-layers are naturally sparse. Due to clarity reasons, we discuss the weight and activation sparsities separately. Mathematically, we let $\mathbf{a}_{0}=\mathbf{x}$ and $\mathbf{a}_{j}=\sigma(W^{T}_{j}\mathbf{a}_{j-1})$ for $0<j<d$ be the layer-wise activations. We will refer to

which is a diagonal matrix whose entries taking value one correspond to nonzero activations within the $j$ -th layer, and $M_{j}\in\{0,1\}^{n_{j-1}\times n_{j}}$ , which is a binary mask corresponding to each (possibly sparse) $W_{j}$ . Along with some analyses, the following lemma and theorem present intrinsic relationships between the adversarial robustness and (both weight and activation) sparsity of nonlinear DNNs.

in which function $\eta$ is monotonically increasing w.r.t. each $\alpha_{j}$ , $c_{2}=\|\mathbf{w}_{\hat{y}}-\mathbf{w}_{k}\|_{2}\prod_{j}\|W_{j}^{\prime}\|_{F}$ and $c_{1}=\|\mathbf{w}_{\hat{y}}-\mathbf{w}_{k}\|_{1}\prod_{j}\|W_{j}^{\prime}\|_{1,\infty}$ are two constants.

Particularly, $\prod\|D_{j}(\hat{\mathbf{x}})\|_{p}\neq 0$ is fulfilled iff $\|D_{d-1}(\hat{\mathbf{x}})\|_{p}\neq 0$ (i.e., it equals 1 for $q\in\{1,2\}$ ). Under the assumptions on $M_{j}$ , we know that the entries of $W_{j}$ are independent of each other, thus

In Lemma 3.1 we introduce probably smaller local Lipschitz constants than the commonly known ones (i.e., $c_{2}$ and $c_{1}$ ), and subsequently in Theorem 3.3 we build theoretical relationships between $L_{q,\mathbf{x}}^{k}$ and the network sparsity, for $q\in\{1,2\}$ (i.e., $p\in\{\infty,2\}$ ). Apparently, $L_{q,\mathbf{x}}^{k}$ is prone to get smaller if any weight matrix gets more sparse. It is worthy noting that the local Lipschitz constant is of great importance in evaluating the robustness of DNNs, and it is effective to regularize DNNs by just minimizing $L_{q,\mathbf{x}}^{k}$ , or equivalently $\|\nabla g_{\hat{y}}(\mathbf{x})-\nabla g_{k}(\mathbf{x})\|_{q}$ for differentiable continuous functions . Thus we reckon, when the network is over-parameterized, an appropriately higher weight sparsity implies a larger $\gamma$ and stronger robustness. There are similar conclusions if $\mathbf{a}_{j}$ gets more sparse.

Experimental Results

In this section, we conduct experiments to testify our theoretical results. To be consistent, we still start from linear models and turn to nonlinear DNNs afterwards. As previously discussed, we perform both $l_{\infty}$ and $l_{2}$ attacks on the classifiers to evaluate their adversarial robustness. In addition to the FGS and DeepFool attacks which have been thoroughly discussed in Section 3, we introduce two more attacks in this section for extensive comparisons of the model robustness.

We use the FGS and randomized FGS (rFGS) methods to perform $l_{\infty}$ attacks. As a famous $l_{\infty}$ attack, FGS has been widely exploited in the literature. In order to generate adversarial examples, it calculates the gradient of training loss w.r.t. benign inputs and uses its sign as perturbations, in an element-wise manner. The rFGS attack is a computationally efficient alternative to multi-step $l_{\infty}$ attacks with an ability of breaking adversarial training-based defences. We keep its hyper-parameters fixed for all experiments in this paper. For $l_{2}$ attacks, we choose DeepFool and the C&W’s attack . DeepFool linearises nonlinear classifiers locally and approximates the optimal perturbations iteratively. C&W’s method casts the problem of constructing adversarial examples as optimizing an objective function without constraints, such that some recent gradient-descent-based solvers can be adopted. On the base of different attacks, four $r_{2}$ and $r_{\infty}$ values can be calculated for each classification model.

In our experiments on linear classifiers, both the binary and multi-class scenarios shall be evaluated. We choose the well-established MNIST dataset as a benchmark, which consists of 70,000 $28\times 28$ images of handwritten digits. According to the official test protocol, 10,000 of them should be used for performance evaluation and the remaining 60,000 for training. For experiments on the binary cases, we randomly choose a pair of digits (e.g., “0” and “8” or “1” and “7”) as positive and negative classes. Linear classifiers are trained following our previous discussions and utilizing the softplus function: $\min_{\mathbf{w},b}\sum_{i}\log(1+\exp(-y_{i}(\mathbf{w}^{T}\mathbf{x}_{i}+b)))$ . Parameters $\mathbf{w}$ and $b$ are randomly initialized and learnt by means of stochastic gradient descent with momentum. For the “1” and “7” classification case, we train 10 reference models from different initializations and achieve a prediction accuracy of $99.17\pm 0.00\%$ on the benign test set. For the classification of all 10 classes, we train 10 references similarly and achieve a test-set accuracy of $92.26\pm 0.08\%$ .

To produce models with different weight sparsities, we use a progressive pruning strategy . That being said, we follow a pipeline of iteratively pruning and re-training. Within each iteration, a portion ( $\rho$ ) of nonzero entries of $\mathbf{w}$ , whose magnitudes are relatively small in comparison with the others, will be directly set to zero and shall never be activated again. After $m$ times of such “pruning”, we shall collect $10(m+1)$ models from all 10 dense references. Here we set $m=16,\rho=1/3$ so the achieved final percentage of zero weights should be $99.74\%\approx 1-(1-\rho)^{m}$ . We calculate the prediction accuracies on adversarial examples (i.e., $r_{\infty}$ ) under different $l_{\infty}$ attacks and the average Euclidean norm of required perturbations (i.e., $r_{2}$ ) under different $l_{2}$ attacks to evaluate the adversarial robustness of different models in practice. For $l_{\infty}$ attacks, we set $\epsilon=0.1$ .

Figure 1 illustrates how our metrics of robustness vary with the weight sparsity. We only demonstrate the variability of the first 12 points (from left to right) on each curve, to make the bars more resolvable. The upper and lower subfigures correspond to binary and multi-class cases, respectively. Obviously, the experimental results are consistent with our previous theoretical ones. While sparse linear models are prone to be more robust in the sense of $r_{\infty}$ , their $r_{2}$ robustness maintains similar or becomes even slightly weaker than the dense references, until there emerges inevitable accuracy degradations on benign examples (i.e., when $r_{\infty}$ may drop as well). We also observe from Figure 1 that, in both the binary and multi-class cases, $r_{2}$ starts decreasing much earlier than the benign-set accuracy. Though very slight in the binary case, the degradation of $r_{2}$ actually occurs after the first round of pruning (from $2.0103\pm 0.0022$ to $2.0009\pm 0.0016$ with DeepFool incorporated, and from $2.3151\pm 0.0023$ to $2.3061\pm 0.0023$ with the C&W’s attack).

2 Sparse Nonlinear DNNs Can be Consistently More Robust

Regarding nonlinear DNNs, we follow the same experimental pipeline as described in Section 4.1. We train MLPs with 2 hidden fully-connected layers and convolutional networks with 2 convolutional layers, 2 pooling layers and 2 fully-connected layers as references on MNIST, following the “LeNet-300-100” and “LeNet-5” architectures in network compression papers . We also follow the training policy suggested by Caffe and train network models for 50,000 iterations with a batch size of 64 such that the training cross-entropy loss does not decrease any longer. The well-trained reference models achieve much higher prediction accuracies (LeNet-300-100: $98.20\pm 0.07\%$ and LeNet-5: $99.11\pm 0.04\%$ ) than previous tested linear ones on the benign test set.

Then we prune the dense references and illustrate some major results regarding the robustness and weight sparsity in Figure 2 (a)-(d). (See Figure 3 in our supplementary material for results under rFGS and the C&W’s attack.) Weight matrices/tensors within each layer is uniformly pruned so the network sparsity should be approximately equal to the layer-wise sparsity. As expected, we observe similar results to previous linear cases in the context of our $r_{\infty}$ but significantly different results in $r_{2}$ . Unlike previous linear models which behave differently under $l_{\infty}$ and $l_{2}$ attacks, nonlinear DNN models show a consistent trend of adversarial robustness with respect to the sparsity. In particular, we observe increased $r_{\infty}$ and $r_{2}$ values under different attacks when continually pruning the models, until the sparsity reaches some thresholds and leads to inevitable capacity degradations. For additional verifications, we calculate the CLEVER scores that approximate attack-agnostic lower bounds of the $l_{p}$ norms of required perturbations (in Table 3 in the supplementary material).

Experiments are also conducted on CIFAR-10, in which deeper nonlinear networks can be involved. We train 10 VGG-like network models (each incorporates 12 convolutional layers and 2 fully-connected layers) and 10 ResNet models (each incorporates 31 convolutional layers and a single fully-connected layers) from scratch. Such deep architectures lead to average prediction accuracies of $93.01\%$ and $92.89\%$ . Still, we prune dense network models in the progressive manner and illustrate quantitative relationships between the robustness and weight sparsity in Figure 2 (e)-(h). The first and last layers in each network are kept dense to avoid early accuracy degradation on the benign set. The same observations can be made. Note that the ResNets are capable of resisting some DeepFool examples, for which the second and subsequent iterations make little sense and can be disregarded.

Activation sparsity.

Having testified relationship between the robustness and weight sparsity of nonlinear DNNs, we now examine the activation sparsity. As previously mentioned, the middle-layer activations of ReLU incorporated DNNs are naturally sparse. We simply add a $l_{1}$ norm regularization of activation matrices/tensors to the learning objective to encourage higher sparsities and calculate $r_{\infty}$ and $r_{2}$ accordingly. Experiments are conducted on MNIST. Table 1 summarizes the results, in which “Sparsity ( $\mathbf{a}$ )” indicates the percentage of deactivated (i.e., zero) neurons feeding to the last fully-connected layer. Here the $r_{\infty}$ and $r_{2}$ values are calculated using the FGS and DeepFool attacks, respectively. Apparently, we still observe positive correlations between the robustness and (activation) sparsity in a certain range.

3 Avoid “Over-pruning”

We discover from Figure 2 that the sharp decrease of the adversarial robustness, especially in the sense of $r_{2}$ , may occur in advance of the benign-set accuracy degradation. Hence, it can be necessary to evaluate the adversarial robustness of DNNs during an aggressive surgery, even though the prediction accuracy of compressed models may remain competitive with their references on benign test-sets. To further explore this, we collect some off-the-shelf sparse models (including a $56\times$ compressed LeNet-300-100 and a $108\times$ compressed LeNet-5) and their corresponding dense references from the Internet and hereby evaluate their $r_{\infty}$ and $r_{2}$ robustness. Table 2 compares the robustness of different models. Obviously, these extremely sparse models are more vulnerable to the DeepFool attack, and what’s worse, the over $100\times$ pruned LeNet-5 seems also more vulnerable to FGS, which suggests researchers to take care and avoid “over-pruning” if possible. One might also discover the fact with other pruning methods.

Conclusions

In this paper, we study some intrinsic relationships between the adversarial robustness and the sparsity of classifiers, both theoretically and empirically. By introducing plausible metrics, we demonstrate that unlike some linear models which behave differently under $l_{\infty}$ and $l_{2}$ attacks, sparse nonlinear DNNs can be consistently more robust to both of them than their corresponding dense references, until their sparsity reaches certain thresholds and inevitably causes harm to the network capacity. Our results also demonstrate that such sparsity, including sparse connections and middle-layer neuron activations, can be effectively imposed using network pruning and $l_{1}$ regularization of weight tensors.

Acknowledgement

We would like to thank anonymous reviewers for their constructive suggestions. Changshui Zhang is supported by NSFC (Grant No. 61876095, No. 61751308 and No. 61473167) and Beijing Natural Science Foundation (Grant No. L172037).