Stochastic Activation Pruning for Robust Adversarial Defense

Guneet S. Dhillon, Kamyar Azizzadenesheli, Zachary C. Lipton, Jeremy Bernstein, Jean Kossaifi, Aran Khanna, Anima Anandkumar

Introduction

While deep neural networks have emerged as dominant tools for supervised learning problems, they remain vulnerable to adversarial examples (Szegedy et al., 2013). Small, carefully chosen perturbations to input data can induce misclassification with high probability. In the image domain, even perturbations so small as to be imperceptible to humans can fool powerful convolutional neural networks (Szegedy et al., 2013; Goodfellow et al., 2014). This fragility presents an obstacle to using machine learning in the wild. For example, a vision system vulnerable to adversarial examples might be fundamentally unsuitable for a computer security application. Even if a vision system is not explicitly used for security, these weaknesses might be critical. Moreover, these problems seem unnecessary. If these perturbations are not perceptible to people, why should they fool a machine?

Since this problem was first identified, a rapid succession of papers have proposed various techniques both for generating and for guarding against adversarial attacks. Goodfellow et al. (2014) introduced a simple method for quickly producing adversarial examples called the fast gradient sign method (FGSM). To produce an adversarial example using FGSM, we update the inputs by taking one step in the direction of the sign of the gradient of the loss with respect to the input.

To defend against adversarial examples some papers propose training the neural network on adversarial examples themselves, either using the same model (Goodfellow et al., 2014; Madry et al., 2017), or using an ensemble of models (Tramèr et al., 2017a). Taking a different approach, Nayebi & Ganguli (2017) draws inspiration from biological systems. They propose that to harden neural networks against adversarial examples, one should learn flat, compressed representations that are sensitive to a minimal number of input dimensions.

This paper introduces Stochastic Activation Pruning (SAP), a method for guarding pretrained networks against adversarial examples. During the forward pass, we stochastically prune a subset of the activations in each layer, preferentially retaining activations with larger magnitudes. Following the pruning, we scale up the surviving activations to normalize the dynamic range of the inputs to the subsequent layer. Unlike other adversarial defense methods, our method can be applied post-hoc to pretrained networks and requires no additional fine-tuning.

Preliminaries

We denote an nn-layered neural network h:XYh:\mathcal{X}\rightarrow Y as a chain of functions h=hnhn1h1h=h^{n}\circ h^{n-1}\circ\ldots\circ h^{1}, where each hih^{i} consists of a linear transformation WiW^{i} followed by a non-linearity ϕi\phi^{i}. Given a set of nonlinearities and weight matrices, a neural network provides a nonlinear mapping from inputs xXx\in\mathcal{X} to outputs y^Y\hat{y}\in\mathcal{Y}, i.e.

In supervised classification and regression problems, we are given a data set D\mathcal{D} of pairs (x,y)(x,y), where each pair is drawn from an unknown joint distribution. For the classification problems, yy is a categorical random variable, and for regression, yy is a real-valued vector. Conditioned on a given dataset, network architecture, and a loss function such as cross entropy, a learning algorithm, e.g. stochastic gradient descent, learns parameters θ:={Wi}i=1n\theta:=\{W^{i}\}_{i=1}^{n} in order to minimize the loss. We denote J(θ,x,y)J(\theta,x,y) as the loss of a learned network, parameterized by θ\theta, on a pair of (x,y)(x,y). To simplify notation, we focus on classification problems, although our methods are broadly applicable.

Consider an input xx that is correctly classified by the model hh. An adversary seeks to apply a small additive perturbation, Δx\Delta x, such that h(x)h(x+Δx)h(x)\neq h(x+\Delta x), subject to the constraint that the perturbation is imperceptible to a human. For perturbations applied to images, the ll_{\infty}-norm is considered a better measure of human perceptibility than the more familiar l2l_{2} norm Goodfellow et al. (2014). Throughout this paper, we assume that the manipulative power of the adversary, the perturbation Δx\Delta x, is of bounded norm Δxλ\|\Delta x\|_{\infty}\leq\lambda. Given a classifier, one common way to generate an adversarial example is to perturb the input in the direction that increases the cross-entropy loss. This is equivalent to minimizing the probability assigned to the true label. Given the neural network hh, network parameters θ\theta, input data xx, and corresponding true output yy, an adversary could create a perturbation Δx\Delta x as follows

Due to nonlinearities in the underlying neural network, and therefore of the objective function JJ, the optimization Eq. 1, in general, can be a non-convex problem. Following Madry et al. (2017); Goodfellow et al. (2014), we use the first order approximation of the loss function

The first term in the optimization is not a function of the adversary perturbation, therefore reduces to

An adversary chooses rr to be in the direction of sign of J(θ,x,y)\mathcal{J}(\theta,x,y), i.e. Δx=λsign(J(θ,x,y))\Delta x=\lambda\cdot\text{sign}(\mathcal{J}(\theta,x,y)). This is the FGSM technique due to Goodfellow et al. (2014). Note that FGSM requires an adversary to access the model in order to compute the gradient.

Stochastic activation pruning

Consider the defense problem from a game-theoretic perspective (Osborne & Rubinstein, 1994). The adversary designs a policy in order to maximize the defender’s loss, while knowing the defenders policy. At the same time defender aims to come up with a strategy to minimize the maximized loss. Therefore, we can rewrite Eq. 1 as follows

where ρ\rho is the adversary’s policy, which provides rρr\sim\rho in the space of bounded (allowed) perturbations (for any rr in range of ρ\rho, rλ\|r\|_{\infty}\leq\lambda) and π\pi is the defenders policy which provides pπp\sim\pi, an instantiation of its policy. The adversary’s goal is to maximize the loss of the defender by perturbing the input under a strategy ρ\rho and the defender’s goal is to minimize the loss by changing model parameters θ\theta to Mp(θ)M_{p}(\theta) under strategy π\pi. The optimization problem in Eq. 2 is a minimax zero-sum game between the adversary and defender where the optimal strategies (π,ρ)(\pi^{*},\rho^{*}), in general, are mixed Nash equilibrium, i.e. stochastic policies.

Intuitively, the idea of SAP is to stochastically drop out nodes in each layer during forward propagation. We retain nodes with probabilities proportional to the magnitude of their activation and scale up the surviving nodes to preserve the dynamic range of the activations in each layer. Empirically, the approach preserves the accuracy of the original model. Notably, the method can be applied post-hoc to already-trained models.

We draw random samples with replacement from the activation map given the probability distribution described above. This makes it convenient to determine whether an activation would be sampled at all. If an activation is sampled, we scale it up by the inverse of the probability of sampling it over all the draws. If not, we set the activation to . In this way, SAP preserves inverse propensity scoring of each activation. Under an instance pp of policy π\pi, we draw rpir^{i}_{p} samples with replacement from this multinomial distribution. The new activation map, Mp(hi)M_{p}(h^{i}) is given by

We attempt to explain the advantages of SAP under the assumption that we are applying it to a pre-trained model that achieves high generalization accuracy. For instance pp under policy π\pi, if the number of samples drawn for each layer ii, rpir^{i}_{p}, is large, then fewer parameters of the neural network are pruned, and the scaling factor gets closer to 11. Under this scenario, the stochastically pruned model performs almost identically to the original model. The stochasticity is not advantageous in this case, but there is no loss in accuracy in the pruned model as compared to the original model.

On the other hand, with fewer samples in each layer, rpir^{i}_{p}, a large number of parameters of the neural network are pruned. Under this scenario, the SAP model’s accuracy will drop compared to the original model’s accuracy. But this model is stochastic and has more freedom to deceive the adversary. So the advantage of SAP comes if we can balance the number of samples drawn in a way that negligibly impacts accuracy but still confers robustness against adversarial attacks.

SAP is similar to the dropout technique due to Srivastava et al. (2014). However, there is a crucial difference: SAP is more likely to sample activations that are high in absolute value, whereas dropout samples each activation with the same probability. Because of this difference, SAP, unlike dropout, can be applied post-hoc to pretrained models without significantly decreasing the accuracy of the model. Experiments comparing SAP and dropout are included in section 4. Interestingly, dropout confers little advantage over the baseline. We suspect that the reason for this is that the dropout training procedure encourages all possible dropout masks to result in similar mappings.

2 Adversarial attack on SAP

If the adversary knows that our defense policy is to apply SAP, it might try to calculate the best strategy against the SAP model. Given the neural network hh, input data xx, corresponding true output yy, a policy ρ\rho over the allowed perturbations, and a policy π\pi over the model parameters that come from SAP (this result holds true for any stochastic policy chosen over the model parameters), the adversary determines the optimal policy ρ\rho^{*}

Therefore, using the result from section 2, the adversary determines the perturbation Δx\Delta x as follows;

Experiments

Our experiments to evaluate SAP address two tasks: image classification and reinforcement learning. We apply the method to the ReLU activation maps at each layer of the pretrained neural networks. To create adversarial examples in our evaluation, we use FGSM, Δx=λsign(J(Mp(θ),x,y))\Delta x=\lambda\cdot\text{sign}(\mathcal{J}(M_{p}(\theta),x,y)). For stochastic models, the adversary estimates J(Mp(θ),x,y)\mathcal{J}(M_{p}(\theta),x,y) using MC sampling unless otherwise mentioned. All perturbations are applied to the pixel values of images, which normally take values in the range -255255. So the fraction of perturbation with respect to the data’s dynamic range would be λ256\frac{\lambda}{256}. To ensure that all images are valid, even following perturbation, we clip the resulting pixel values so that they remain within the range $.Inallplots,weconsiderperturbationsofthefollowingmagnitudes. In all plots, we consider perturbations of the following magnitudes\lambda=\{0,1,2,4,8,16,32,64\}$.All the implementations were coded in MXNet framework (Chen et al., 2015) and sample code is available at https://github.com/Guneet-Dhillon/Stochastic-Activation-Pruning

To evaluate models in the image classification domain, we look at two aspects: the model accuracy for varying values of λ\lambda, and the calibration of the models (Guo et al., 2017). Calibration of a model is the relation between the confidence level of the model’s output and its accuracy. A linear calibration is ideal, as it suggests that the accuracy of the model is proportional to the confidence level of its output. To evaluate models in the reinforcement learning domain, we look at the average score that each model achieves on the games played, for varying values of λ\lambda. The higher the score, the better is the model’s performance. Because the units of reward are arbitrary, we report results in terms of the the relative percent change in rewards. In both cases, the output of stochastic models are computed as an average over multiple forward passes.

The CIFAR-1010 dataset (Krizhevsky & Hinton, 2009) was used for the image classification domain. We trained a ResNet-2020 model (He et al., 2016) using SGD, with minibatches of size 512512, momentum of 0.90.9, weight decay of 0.00010.0001, and a learning rate of 0.50.5 for the first 100100 epochs, then 0.050.05 for the next 3030 epochs, and then 0.0050.005 for the next 2020 epochs. This achieved an accuracy of 89.8%89.8\% with cross-entropy loss and ReLU non-linearity. For all the figures in this section, we refer to this model as the dense model.

The accuracy of the dense model degrades quickly with λ\lambda. For λ=1\lambda=1, the accuracy drops down to 66.3%66.3\%, and for λ=2\lambda=2 it is 56.4%56.4\%. These are small (hardly perceptible) perturbations in the input images, but the dense model’s accuracy decreases significantly.

1.2 Dropout (DRO)

Dropout, a technique due to Srivastava et al. (2014), was also tested to compare with SAP. Similar to the SAP setting, this method was added to the ReLU activation maps of the dense model. We see that low dropout rate perform similar to the dense model for small λ\lambda values, but its accuracy starts decreasing very quickly for higher λ\lambda values (Fig. 2(a)). We also trained ResNet-2020 models, similar to the dense model, but with different dropout rates. This time, the models were trained for 250250 epochs, with an initial learning rate of 0.50.5, reduced by a factor of 0.10.1 after 100100, 150150, 190190 and 220220 epochs. These models were tested against adversarial examples with and without dropout during validation (Figs. 2(b) and 2(c) respectively). The models do similar to the dense model, but do not provide additional robustness.

1.3 Adversarial training (ADV)

Adversarial training (Goodfellow et al., 2014) has emerged a standard method for defending against adversarial examples. It has been adopted by Madry et al. (2017); Tramèr et al. (2017a) to maintain high accuracy levels even for large λ\lambda values. We trained a ResNet-2020 model, similar to the dense model, with an initial learning rate of 0.50.5, which was halved every 1010 epochs, for a total of 100100 epochs. It was trained on a dataset consisting of 80%80\% un-perturbed data and 20%20\% adversarially perturbed data, generated on the model from the previous epoch, with λ=2\lambda=2. This achieved an accuracy of 75.0%75.0\% on the un-perturbed validation set. Note that the model capacity was not changed. When tested against adversarial examples, the accuracy dropped to 72.9%,70.9%72.9\%,70.9\% and 67.5%67.5\% for λ=1,2\lambda=1,2 and 44 respectively. We ran SAP-100100 on the ADV model (referred to as ADV++SAP-100100). The accuracy in the no perturbation case was 74.1%74.1\%. For adversarial examples, both models act similar to each other for small values of λ\lambda. But for λ=16\lambda=16 and 3232, ADV++SAP-100100 gets a higher accuracy than ADV by an absolute increase of 7.8%7.8\% and 7.9%7.9\% respectively.

We compare the accuracy-λ\lambda plot for dense, SAP-100100, ADV and ADV++SAP-100100 models. This is illustrated in Fig. 3 For smaller values of λ\lambda, SAP-100100 achieves high accuracy. As λ\lambda gets larger, ADV++SAP-100100 performs better than all the other models. We also compare the calibration plots for these models, in Fig. 4. The dense model is not linear for any λ0\lambda\neq 0. The other models are well calibrated (close to linear), and behave similar to each other for λ4\lambda\leq 4. For higher values of λ\lambda, we see that ADV++SAP-100100 is the closest to a linearly calibrated model.

2 Adversarial attacks in deep reinforcement learning (RL)

Previously, (Behzadan & Munir, 2017; Huang et al., 2017; Kos & Song, 2017) have shown that the reinforcement learning agents can also be easily manipulated by adversarial examples. The RL agent learns the long term value Q(a,s)Q(a,s) of each state-action pair (s,a)(s,a) through interaction with an environment, where given a state ss, the optimal action is argmaxaQ(a,s)\arg\max_{a}Q(a,s). A regression based algorithm, Deep Q-Network (DQN)(Mnih et al., 2015) and an improved variant, Double DQN (DDQN) have been proposed for the popular Atari games (Bellemare et al., 2013) as benchmarks. We deploy DDQN algorithm and train an RL agent in variety of different Atari game settings.

Similar to the image classification experiments, we tested SAP on a pretrained model (the model is described in the Appendix section A), by applying the method on the ReLU activation maps. SAP-100100 was used for these experiments. Table 1 specifies the relative percentage increase in rewards of SAP-100100 as compared to the original model. For all the games, we observe a drop in performance for the no perturbation case. But for λ0\lambda\neq 0, the relative increase in rewards is positive (except for λ=1\lambda=1 in the BattleZone game), and is very high in some cases (3425.9%3425.9\% for λ=1\lambda=1 for the Bowling game).

3 Additional baselines

In addition to experimenting with SAP, dropout, and adversarial training, we conducted extensive experiments with other methods for introducing stochasticity into a neural network. These techniques included -mean Gaussian noise added to weights (RNW), 11-mean multiplicative Gaussian noise for the weights (RSW), and corresponding additive (RNA) and multiplicative (RSA) noise added to the activations. We describe each method in detail in Appendix B. Each of these models performs worse than the dense baseline at most levels of perturbation and none matches the performance of SAP. Precisely why SAP works while other methods introducing stochasticity do not, remains an open question that we continue to explore in future work.

4 SAP attacks with varying numbers of MC samples

In the previous experiments the SAP adversary used 100100 MC samples to estimate the gradient. Additionally, we compared the performance of SAP-100100 against various attacks, these include the standard attack calculated based on the dense model and those generated on SAP-100100 by estimating the gradient with various numbers of MC samples. We see that if the adversary uses the dense model to generate adversarial examples, SAP-100100 model’s accuracy decreases. Additionally, if the adversary uses the SAP-100100 model to generate adversarial examples, greater numbers of MC samples lower the accuracy more. Still, even with 10001000 MC samples, for low amounts of perturbation (λ=1\lambda=1 and 22), SAP-100100 retains higher accuracy than the dense model.

Computing a single backward pass of the SAP-100100 model for 512512 examples takes 20\sim 20 seconds on 88 GPUs. Using 100100 and 10001000 MC samples would take 0.6\sim 0.6 and 6\sim 6 hours respectively.

5 Iterative adversarial attack

A more sophisticated technique for producing adversarial perturbations (than FGSM) is to apply multiple and smaller updates to the input in the direction of the local sign-gradients. This can be done by taking small steps of size kλk\leq\lambda in the direction of the sign-gradient at the updated point and repeating the procedure λk\lceil\frac{\lambda}{k}\rceil times (Kurakin et al., 2016) as follows

where function clipx,λ\textit{clip}_{x,\lambda} is a projection into a LL_{\infty}-ball of radius λ\lambda centered at xx, and also into the hyper-cube of image space (each pixel is clipped to the range of [0,255]\left[0,255\right]). The dense and SAP-100100 models are tested against this adversarial attack, with k=1.0k=1.0 (Fig. 1(c)). The accuracies of the dense model at λ=0\lambda=0, 11, 22 and 44 are 89.8%89.8\%, 66.3%66.3\%, 50.1%50.1\% and 31.0%31.0\% respectively. The accuracies of the SAP-100100 model against attacks computed on the same model (with 1010 MC samples taken at each step to estimate the gradient) are 83.3%83.3\%, 82.0%82.0\%, 80.2%80.2\% and 76.7%76.7\%, for λ=0,1,2,4\lambda=0,1,2,4 respectively. The SAP-100100 model provides accuracies of 83.3%83.3\%, 75.2%75.2\%, 67.0%67.0\% and 50.8%50.8\%, against attacks computed on the dense model, with the perturbations λ=0,1,2,4\lambda=0,1,2,4 respectively. Iterative attacks on the SAP models are much more expensive to compute and noisier than iterative attacks on dense models. This is why the adversarial attack computed on the dense model results in lower accuracies on the SAP-100100 model than the adversarial attack computed on the SAP-100100 model itself.

Related work

Robustness to adversarial attack has recently emerged as a serious topic in machine learning (Goodfellow et al., 2014; Kurakin et al., 2016; Papernot & McDaniel, 2016; Tramèr et al., 2017b; Fawzi et al., 2018). Goodfellow et al. (2014) introduced FGSM. Kurakin et al. (2016) proposed an iterative method where FGSM is used for smaller step sizes, which leads to a better approximation of the gradient. Papernot et al. (2017) observed that adversarial examples could be transferred to other models as well. Madry et al. (2017) introduce adding random noise to the image and then using the FGSM method to come up with adversarial examples.

Being robust against adversarial examples has primarily focused on training on the adversarial examples. Goodfellow et al. (2014) use FGSM to inject adversarial examples into their training dataset. Madry et al. (2017) use an iterative FGSM approach to create adversarial examples to train on. Tramèr et al. (2017a) introduced an ensemble adversarial training method of training on the adversarial examples created on the model itself and an ensemble of other pre-trained models. These works have been successful, achieving only a small drop in accuracy form the clean and adversarially generated data. Nayebi & Ganguli (2017) proposes a method to produce a smooth input-output mapping by using saturating activation functions and causing the activations to become saturated.

Conclusion

The SAP approach guards networks against adversarial examples without requiring any additional training. We showed that in the adversarial setting, applying SAP to image classifiers improves both the accuracy and calibration. Notably, combining SAP with adversarial training yields additive benefits. Additional experiments show that SAP can also be effective against adversarial examples in reinforcement learning.

References

Appendix A Reinforcement learning model architecture

For the experiments in section 4.2, we trained the network with RMSProp, minibatches of size 3232, a learning rate of 0.000250.00025, and a momentum of 0.950.95 and as in (Mnih et al., 2015) where the discount factor is γ=0.99\gamma=0.99, the number of steps between target updates to 1000010000 steps. We updated the network every 44 steps by randomly sampling a minibatch of size 3232 samples from the replay buffer and trained the agents for a total of 100M100M steps per game. The experience replay contains the 1M1M most recent transitions. For training we used an ε\varepsilon-greedy policy with ε\varepsilon annealed linearly from 11 to 0.10.1 over the first 1M1M time steps and fixed at 0.10.1 thereafter.

The input to the network is 4×84×844\times 84\times 84 tensor with a rescaled, gray-scale version of the last four observations. The first convolution layer has 32 filters of size 88 with a stride of 44. The second convolution layer has 6464 filters of size 44 with stride 22. The last convolution layer has 6464 filters of size 33 followed by two fully connected layers with size 512512 and the final fully connected layer Q-value of each action where ReLU rectifier is deployed for the nonlinearity at each layer.

Appendix B Other methods

We tried a variety of different methods that could be added to pretrained models and tested their performance against adversarial examples. The following is a continuation of section 4.1, where we use the dense model again on the CIFAR-1010 dataset.

One simple way of introducing stochasticity to the activations is by adding random Gaussian noise to each weight, with mean and constant standard deviation, ss. So each weight tensor WiW^{i} now changes to M(Wi)M(W^{i}), where the jj’th entry is given by

These models behave very similar to the dense model (Fig. 5(a), the legend indicates the value of ss). While we test several different values of ss, we do not observe any significant improvements regarding robustness against adversarial examples. As ss increased, the accuracy for non-zero λ\lambda decreased.

B.2 Randomly scaled weights (RSW)

Instead of using additive noise, we also try multiplicative noise. The scale factor can be picked from a Gaussian distribution, with mean 11 and constant standard deviation ss. So each weight tensor WiW^{i} now changes to M(Wi)M(W^{i}), where the jj’th entry is given by

These models perform similar to the dense model, but again, no robustness is offered against adversarial examples. They follow a similar trend as the RNW models (Figure 5(b), the legend indicates the value of ss).

B.3 Deterministic weight pruning (DWP)

Following from the motivation of preventing perturbations to propagate forward in the network, we tested deterministic weight pruning, where the top k%k\% entries of a weight matrix were kept, while the rest were pruned to , according to their absolute values. This method was prompted by the success achieved by this pruning method, introduced by Han et al. (2015), where they also fine-tuned the model.

For low levels of pruning, these models do very similar to the dense model, even against adversarial examples (Fig. 5(c), the legend indicates the value of kk). The adversary can compute the gradient of the sparse model, and the perturbations propagate forward through the surviving weights. For higher levels of sparsity, the accuracy in the no-perturbation case drops down quickly.

B.4 Stochastic weight pruning (SWP)

The new weight entry, M(Wi)jM(W^{i})_{j}, is given by

These models behave very similar to the dense model. We tried drawing range of percentages of samples, but no evident robustness could be seen against adversarial examples (Figure 5(d), the legend indicates the value of kk). For a small ss, it is very similar to the dense model. As ss increases, the these models do marginally better for low non-zero λ\lambda values, and then drops again (similar to the SAP case).

B.5 Random noisy activations (RNA)

Next we change our attention to the activation maps in the dense model. One simple way of introducing stochasticity to the activations is by adding random Gaussian noise to each activation entry, with mean and constant standard deviation, ss. So each activation map hih^{i} now changes to M(hi)M(h^{i}), where the jj’th entry is given by

These models too do not offer any robustness against adversarial examples. Their accuracy drops quickly with λ\lambda and ss (Fig. 5(e), the legend indicates the value of ss).

B.6 Randomly scaled activations (RSA)

Instead of having additive noise, we can also make the model stochastic by scaling the activations. The scale factor can be picked from a Gaussian distribution, with mean 11 and constant standard deviation ss. So each activation map hih^{i} now changes to M(hi)M(h^{i}), where the jj’th entry is given by

These models perform similar to the dense model, exhibiting no additional robustness against adversarial examples (Figure 5(f), the legend indicates the value of ss).