Investigating the influence of noise and distractors on the interpretation of neural networks

Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, Sven Dähne

Introduction

In recent years, deep learning had a huge impact on the machine learning community beating the state of the art across a wide range of applications (e.g. ImageNet classification ). Beyond applying deep learning to novel tasks and improving performance, considerable effort has been put into making machine learning models more interpretable. One of the first works on this topic, Zeiler et al. propose the Deconvnet as a means to visualize the features learned by the neural network. Simonyan et al. propose class-specific saliency maps corresponding to the gradient of an output neutron with respect to the input. Both approaches are related and differ mainly in the positioning of the ReLu units . Bach et al. define relevance as the contribution of each input variable to the prediction introducing layer-wise relevance propagation (LRP). Montavon et al. provide a theoretical basis for LRP and generalize the method to the deep Taylor decomposition. It is striking that, while neural networks explanation methods are mainly based on the gradient, such a procedure is considered unsuitable to obtain an interpretation in Neuroimaging due to noise and distractor signals in the data . This motivated the analysis of explanation methods below.

Linear projections in the presence of noise and distractors

We start with the following assumptions based on : The observed data x\bm{x} is generated by a model x=atst+AnsnT+ϵ\bm{x}=\bm{a}_{t}s_{t}+A_{n}\bm{s}_{n}^{T}+\mathbf{\epsilon}, being composed of a task-related component atst\bm{a}_{t}s_{t} and a noise component AnsnT+ϵA_{n}\bm{s}_{n}^{T}+\mathbf{\epsilon}. Here, sts_{t} is the hidden signal to be recovered, i.e., the class label, regression output or hidden neuron activation prior to the non-linearity. The vector sn\bm{s}_{n} comprises uncorrelated latent signals with E[sn]=0E\left[\bm{s}_{n}\right]=\mathbf{0} and Cov[st,sn]=0Cov[s_{t},\bm{s}_{n}]=\mathbf{0}, ϵ\mathbf{\epsilon} contains zero mean Gaussian noise. While st,sns_{t},\mathbf{s_{n}} and ϵ\mathbf{\epsilon} are example-specific, the patterns at,An\bm{a}_{t},A_{n} are shared across the data set. The vector at\bm{a}_{t} describes how changes to the task-related signal result in changes to our observed data x\bm{x}. The matrix AtA_{t} describes in which directions the data can change without changing the desired output. Note that there is no requirement that at\bm{a}_{t} and the components in AnA_{n} are orthogonal to each other. A linear projection w\bm{w}, trained to detect a specific feature or class membership indicated by sts_{t}, has the following expected output:

Hence, it is required that wTat=1\bm{w}^{T}\bm{a}_{t}=1 such that the desired target signal sts_{t} can be recovered. Given at\bm{a}_{t}, this can be fulfilled by infinitely many w\bm{w}. Additionally, w\bm{w} has to be orthogonal to task-unrelated variations: wTAn=0\bm{w}^{T}A_{n}=\mathbf{0}. This is critical to the interpretation of w\bm{w}. The weight vector of a linear projection yields the direction of steepest ascent concerning the output of the classifier. It is important to realize that this is not necessarily the task-related direction of variation because of the orthogonality requirement between w\bm{w} and AnA_{n}. For this reason, gradient-based approaches, like deconvnet and saliency maps, are known to be uninformative for linear classifiers in noisy environments (e.g. EEG analysis). On top of that, the reasoning above makes clear that we implicitly introduce additional assumptions about our data and networks when using different explanation rules or visualisations.

What assumptions do explanation rules make implicitly?

The deep taylor decomposition performs Taylor expansions in a layer-wise fashion distributing the relevance to the inputs of the neural networkFor simplicity we will assume that biases are introduced by a constant input neuron with activation 11. It is easy to show that in this setting, for ReLu non-linearities and max-pooling, the saliency map multiplied elementwise with the input corresponds to the explanation obtained by the zz-rule. This proof is included in the appendix.

The goal of the deep taylor decomposition is to decompose the output of a neural network into contributions that can be assigned to the different input variables (i.e. pixels when considering images). Formally this is described as follows. Let Rjout=fi(x)R^{\text{out}}_{j}=f_{i}(\bm{x}) be the jjth output of the neural network for DD-dimensionl input x=[x1,,xD]T\bm{x}=[x_{1},\ldots,x_{D}]^{T}. We want to have a decomposition of Rjout=i=1DRiinR^{\text{out}}_{j}=\sum_{i=1}^{D}R^{in}_{i} where we sum over the DD inputs of the neural network. The key idea being that inputs pixels that contribute more to the final output of the network are more important than pixels that contribute less.

The selection of a root point can be done based on different search directions. In the original paper, a search direction is chosen based on constraints of the domain of the previous layer, e.g., if the input neuron is limited in range or constrained to be positive as it is the output of a ReLu layer. In contrast, we argue to think about the search direction of the root point with respect to the generative model introduced above instead of the input domain. Ideally, we find the direction in which the data shows task-related variations. In the remainder of this section, we will take a look at how various propagation rules for deep Taylor decomposition correspond to different generative models defined by at,An\bm{a}_{t},A_{n} and propose two new rules. Not all known explanation rules are considered due to space constraints.

The zz-rule was originally proposed in and later integrated in the deep Taylor framework . It corresponds to a root point which is a constant 0\mathbf{0}, yielding x=atst+AnsnT\bm{x}=\bm{a}_{t}s_{t}+A_{n}\bm{s}_{n}^{T} with at=xwTx\bm{a}_{t}=\frac{\bm{x}}{\bm{w}^{T}\bm{x}} and Ansn=0A_{n}\bm{s}_{n}=\mathbf{0}. Hence, when using the z-rule, we assume that noise is not present in our data and that all input directions are informativePrevious work by Montavon et al. has shown that the amount of noise decreases as one goes to higher layers of the neural network. Consequently, this rule can be used reliably for decomposing higher level neurons but one should be careful when applying it to lower layer neurons on noisy data..

The w2w^{2}-rule and a new w+w^{+}-rule The w2w^{2}-rule was proposed for neurons with unbounded domain (i.e., the input layer). This rule chooses the closest root point lying along the direction of the gradient. In the generative model this corresponds to at=wwTw\bm{a}_{t}=\frac{\bm{w}}{\bm{w}^{T}\bm{w}} and Ansn=xwwTwstA_{n}\bm{s}_{n}=\bm{x}-\frac{\bm{w}}{\bm{w}^{T}\bm{w}}s_{t}. While it was originally intended to be used for the input layer, we argue that the w2w^{2}-rule can be applied across the entire network with a slight modification: the search direction is now chosen such that relevance can only be propagated to neurons with non-zero activation: w+=w1x0\bm{w}^{+}=\bm{w}\odot\mathbf{1}_{\bm{x}\neq 0}. This yields at=w+w+Tw+\bm{a}_{t}=\frac{\mathbf{w^{+}}}{\mathbf{w^{+}}^{T}\mathbf{w^{+}}} and assumes that the detected features lie along the gradient with respect to the input in the active input neurons. The assumption that the data varies along the direction of the gradient is also used by the saliency map and the deconvnet.

The aa-rule and a+a^{+} rule. Finally, we propose a new rule based on what is commonly applied in Neuroimaging . We will learn the vector at\bm{a}_{t} by regressing from the output of the layer (before non-linearity) to each input neuron. We make the simplifying assumption that all neurons are independent. Thus, the task-relevant direction is approximated for each neuron as: a^=(wTXXTw)1wTXXT,\hat{\bm{a}}=(\bm{w}^{T}XX^{T}\bm{w})^{-1}\bm{w}^{T}XX^{T}, where XX contains the inputs to the neuron. The matrix X=[x1,,xN]X=[\bm{x}_{1},\ldots,\bm{x}_{N}] contains one data point xn\bm{x}_{n} per column. In the aa-rule, we use this direction directly. As a result, the explanation would not be data-point dependent. Therefore, we make it adaptive by limiting the directions to the active input neurons and rescaling in the a+a^{+}-rule if it was preceeded by a ReLu activation.

Experiments and Discussion

We present a preliminary evaluation of the introduced rules based on the classification of digits on the MNIST dataset with an MLP. The images are scaled in the range $andnormalnoisehasbeenaddedwithastandarddeviationranging0.0to0.8.TheMLPpossessesahiddenlayerwith200fullyconnectedneuronsandReLunonlinearityandasoftmaxoutputlayer.AssuggestedbySimonyanetal.andBachetal.,theexplanationisstartedfromtheoutputlayerbeforeapplyingthesoftmax.Wecomparethesaliencymapandtheand normal noise has been added with a standard deviation ranging 0.0 to 0.8. The MLP possesses a hidden layer with 200 fully connected neurons and ReLu non-linearity and a softmax output layer. As suggested by Simonyan et al. and Bach et al. , the explanation is started from the output layer before applying the softmax. We compare the saliency map and thezrule,tothenew-rule, to the neww^{+}andanda^{+}$ rule we proposed here.

We look at Figure: reffig:decoding first. This figures highlights the influence of increasing noise levels on the different relevance methods. The saliency map, being essentially the gradient, is mimicking the structure of the input. While there is some considerable background noise in the explanation, it remains quite stable when the noise level is increased. Indicating that the network keeps using the same input pixels. As predicted, the zz-rule is very crisp when no noise is present. This is partially caused by the inputs being zero since it is equal to the saliency map multiplied with the input. However, this also causes the quality of the heatmap to degrade when the noise level is increased. This is not surprising since the implied, generative model assumes that there is no noise present.

Compared to zz-rule, the explanation of the w+w^{+} and a+a^{+} rules are much less inpacted as the noise level increases. The w+w^{+} rule stands out by assigning positive relevance to the entire input image. On the other hand, the a+a^{+} method focusses more on the center region of the image where there is most variation. This difference is caused by the gradient, which has non-zero weights over the entire input image. The learned a\bm{a} vectors show only the directions with class-relevant variation in the data. We argue that for MNIST focussing on the center is reasonable since this is where the input images differ and class information is present.

From a qualitative point, we observe in this setting that both Simonyan’s saliency map approach and the a+a^{+}-rule indicate that the gaps between the two vertical lines at the upper part as well as the bottom of the four are important. This intuitively makes sense since this would change the four into a nine or eight, respectively.

In Figure: 2 we present more images from the MNIST dataset. Here we observe a general trend. The zz-rule focusses on the symbol and assigns almost no relevance to the region outside of the actual number. The Simonyan saliency map and the two other decomposition rules assign much relevance to the region outside of the actual number. For the a+a^{+}-rule this effect is particularly strong and the visualisation indicates that the black regions next to the number itself are more important than the actual number.

Conclusion

In this abstract, we have taken a different theoretical view on the explanation of neural networks using the deep taylor decomposition. We have shown how different explanation rules correspond to different generative models of our observed data and how this influences the explanation. Based on our observations, we have also introduced a new explanation rule that learns the directions of variation related to the task. While this abstract has generated new insights into the explanation process, further research on large networks and additional datasets is needed to pick the appropriate model for each task and network architecture. Furthermore, benchmark taks where the relevant component is known are needed to make the evaluation of relevance visualisations on neural networks more reliable.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement NO 657679. Additional support was provided by the Federal Ministry of Education and Research (BMBF) for the Berlin Big Data Center BBDC (01IS14013A).

References

Appendix A The connection between the z𝑧z-rule and the Simonyan saliancy map.

In this section, we will show that the zz-rule with biases included as input neurons is equivalent to the gradient (Simonyan’s saliency map) multiplied elementwise with the input. We are assuming that there is only a ReLu activation or max-pooling. The effect of non-active ReLu neurons and paths blocked by max-pooling is included taking the sum only over active neurons.

By definition, the relevance for a neuron in the top layer tt is defined as follows:

This can be expressed as the gradient multiplied with the neuron activation of the previous layer:

Given that for layer tnt-n the zz-rule explanation is

This shows that the zz-rule is equivalent to the saliency map multiplied with the input.

A.2 Implications

Since the gradient of the weight vector for a network with only ReLu and max-pooling units yields a linear projection that performs the identical operation for this specific datapoint as the neural network itself. Hence, the fact that the zz-rule can be decomposed into the gradient multiplied with the input proves that for these networks the zz-rule results in a valid decomposition of the actual function the neural network computes. However, this decomposition, as discussed before, assumes that there is no (structured) noise or distracting pattern present in the data. This approach sees the neural network as a data-point specific function generator.