Investigating the influence of noise and distractors on the interpretation of neural networks

Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, Sven Dähne

Introduction

In recent years, deep learning had a huge impact on the machine learning community beating the state of the art across a wide range of applications (e.g. ImageNet classification ). Beyond applying deep learning to novel tasks and improving performance, considerable effort has been put into making machine learning models more interpretable. One of the first works on this topic, Zeiler et al. propose the Deconvnet as a means to visualize the features learned by the neural network. Simonyan et al. propose class-specific saliency maps corresponding to the gradient of an output neutron with respect to the input. Both approaches are related and differ mainly in the positioning of the ReLu units . Bach et al. define relevance as the contribution of each input variable to the prediction introducing layer-wise relevance propagation (LRP). Montavon et al. provide a theoretical basis for LRP and generalize the method to the deep Taylor decomposition. It is striking that, while neural networks explanation methods are mainly based on the gradient, such a procedure is considered unsuitable to obtain an interpretation in Neuroimaging due to noise and distractor signals in the data . This motivated the analysis of explanation methods below.

Linear projections in the presence of noise and distractors

We start with the following assumptions based on : The observed data $\bm{x}$ is generated by a model $\bm{x}=\bm{a}_{t}s_{t}+A_{n}\bm{s}_{n}^{T}+\mathbf{\epsilon}$ , being composed of a task-related component $\bm{a}_{t}s_{t}$ and a noise component $A_{n}\bm{s}_{n}^{T}+\mathbf{\epsilon}$ . Here, $s_{t}$ is the hidden signal to be recovered, i.e., the class label, regression output or hidden neuron activation prior to the non-linearity. The vector $\bm{s}_{n}$ comprises uncorrelated latent signals with $E\left[\bm{s}_{n}\right]=\mathbf{0}$ and $Cov[s_{t},\bm{s}_{n}]=\mathbf{0}$ , $\mathbf{\epsilon}$ contains zero mean Gaussian noise. While $s_{t},\mathbf{s_{n}}$ and $\mathbf{\epsilon}$ are example-specific, the patterns $\bm{a}_{t},A_{n}$ are shared across the data set. The vector $\bm{a}_{t}$ describes how changes to the task-related signal result in changes to our observed data $\bm{x}$ . The matrix $A_{t}$ describes in which directions the data can change without changing the desired output. Note that there is no requirement that $\bm{a}_{t}$ and the components in $A_{n}$ are orthogonal to each other. A linear projection $\bm{w}$ , trained to detect a specific feature or class membership indicated by $s_{t}$ , has the following expected output:

Hence, it is required that $\bm{w}^{T}\bm{a}_{t}=1$ such that the desired target signal $s_{t}$ can be recovered. Given $\bm{a}_{t}$ , this can be fulfilled by infinitely many $\bm{w}$ . Additionally, $\bm{w}$ has to be orthogonal to task-unrelated variations: $\bm{w}^{T}A_{n}=\mathbf{0}$ . This is critical to the interpretation of $\bm{w}$ . The weight vector of a linear projection yields the direction of steepest ascent concerning the output of the classifier. It is important to realize that this is not necessarily the task-related direction of variation because of the orthogonality requirement between $\bm{w}$ and $A_{n}$ . For this reason, gradient-based approaches, like deconvnet and saliency maps, are known to be uninformative for linear classifiers in noisy environments (e.g. EEG analysis). On top of that, the reasoning above makes clear that we implicitly introduce additional assumptions about our data and networks when using different explanation rules or visualisations.

What assumptions do explanation rules make implicitly?

The deep taylor decomposition performs Taylor expansions in a layer-wise fashion distributing the relevance to the inputs of the neural networkFor simplicity we will assume that biases are introduced by a constant input neuron with activation $1$ . It is easy to show that in this setting, for ReLu non-linearities and max-pooling, the saliency map multiplied elementwise with the input corresponds to the explanation obtained by the $z$ -rule. This proof is included in the appendix.

The goal of the deep taylor decomposition is to decompose the output of a neural network into contributions that can be assigned to the different input variables (i.e. pixels when considering images). Formally this is described as follows. Let $R^{\text{out}}_{j}=f_{i}(\bm{x})$ be the $j$ th output of the neural network for $D$ -dimensionl input $\bm{x}=[x_{1},\ldots,x_{D}]^{T}$ . We want to have a decomposition of $R^{\text{out}}_{j}=\sum_{i=1}^{D}R^{in}_{i}$ where we sum over the $D$ inputs of the neural network. The key idea being that inputs pixels that contribute more to the final output of the network are more important than pixels that contribute less.

The selection of a root point can be done based on different search directions. In the original paper, a search direction is chosen based on constraints of the domain of the previous layer, e.g., if the input neuron is limited in range or constrained to be positive as it is the output of a ReLu layer. In contrast, we argue to think about the search direction of the root point with respect to the generative model introduced above instead of the input domain. Ideally, we find the direction in which the data shows task-related variations. In the remainder of this section, we will take a look at how various propagation rules for deep Taylor decomposition correspond to different generative models defined by $\bm{a}_{t},A_{n}$ and propose two new rules. Not all known explanation rules are considered due to space constraints.

The $z$ -rule was originally proposed in and later integrated in the deep Taylor framework . It corresponds to a root point which is a constant $\mathbf{0}$ , yielding $\bm{x}=\bm{a}_{t}s_{t}+A_{n}\bm{s}_{n}^{T}$ with $\bm{a}_{t}=\frac{\bm{x}}{\bm{w}^{T}\bm{x}}$ and $A_{n}\bm{s}_{n}=\mathbf{0}$ . Hence, when using the z-rule, we assume that noise is not present in our data and that all input directions are informativePrevious work by Montavon et al. has shown that the amount of noise decreases as one goes to higher layers of the neural network. Consequently, this rule can be used reliably for decomposing higher level neurons but one should be careful when applying it to lower layer neurons on noisy data..

The $w^{2}$ -rule and a new $w^{+}$ -rule The $w^{2}$ -rule was proposed for neurons with unbounded domain (i.e., the input layer). This rule chooses the closest root point lying along the direction of the gradient. In the generative model this corresponds to $\bm{a}_{t}=\frac{\bm{w}}{\bm{w}^{T}\bm{w}}$ and $A_{n}\bm{s}_{n}=\bm{x}-\frac{\bm{w}}{\bm{w}^{T}\bm{w}}s_{t}$ . While it was originally intended to be used for the input layer, we argue that the $w^{2}$ -rule can be applied across the entire network with a slight modification: the search direction is now chosen such that relevance can only be propagated to neurons with non-zero activation: $\bm{w}^{+}=\bm{w}\odot\mathbf{1}_{\bm{x}\neq 0}$ . This yields $\bm{a}_{t}=\frac{\mathbf{w^{+}}}{\mathbf{w^{+}}^{T}\mathbf{w^{+}}}$ and assumes that the detected features lie along the gradient with respect to the input in the active input neurons. The assumption that the data varies along the direction of the gradient is also used by the saliency map and the deconvnet.

The $a$ -rule and $a^{+}$ rule. Finally, we propose a new rule based on what is commonly applied in Neuroimaging . We will learn the vector $\bm{a}_{t}$ by regressing from the output of the layer (before non-linearity) to each input neuron. We make the simplifying assumption that all neurons are independent. Thus, the task-relevant direction is approximated for each neuron as: $\hat{\bm{a}}=(\bm{w}^{T}XX^{T}\bm{w})^{-1}\bm{w}^{T}XX^{T},$ where $X$ contains the inputs to the neuron. The matrix $X=[\bm{x}_{1},\ldots,\bm{x}_{N}]$ contains one data point $\bm{x}_{n}$ per column. In the $a$ -rule, we use this direction directly. As a result, the explanation would not be data-point dependent. Therefore, we make it adaptive by limiting the directions to the active input neurons and rescaling in the $a^{+}$ -rule if it was preceeded by a ReLu activation.

Experiments and Discussion

We present a preliminary evaluation of the introduced rules based on the classification of digits on the MNIST dataset with an MLP. The images are scaled in the range $ $and normal noise has been added with a standard deviation ranging 0.0 to 0.8. The MLP possesses a hidden layer with 200 fully connected neurons and ReLu non-linearity and a softmax output layer. As suggested by Simonyan et al. and Bach et al. , the explanation is started from the output layer before applying the softmax. We compare the saliency map and the$ z $-rule, to the new$ w^{+} $and$ a^{+}$ rule we proposed here.

We look at Figure: reffig:decoding first. This figures highlights the influence of increasing noise levels on the different relevance methods. The saliency map, being essentially the gradient, is mimicking the structure of the input. While there is some considerable background noise in the explanation, it remains quite stable when the noise level is increased. Indicating that the network keeps using the same input pixels. As predicted, the $z$ -rule is very crisp when no noise is present. This is partially caused by the inputs being zero since it is equal to the saliency map multiplied with the input. However, this also causes the quality of the heatmap to degrade when the noise level is increased. This is not surprising since the implied, generative model assumes that there is no noise present.

Compared to $z$ -rule, the explanation of the $w^{+}$ and $a^{+}$ rules are much less inpacted as the noise level increases. The $w^{+}$ rule stands out by assigning positive relevance to the entire input image. On the other hand, the $a^{+}$ method focusses more on the center region of the image where there is most variation. This difference is caused by the gradient, which has non-zero weights over the entire input image. The learned $\bm{a}$ vectors show only the directions with class-relevant variation in the data. We argue that for MNIST focussing on the center is reasonable since this is where the input images differ and class information is present.

From a qualitative point, we observe in this setting that both Simonyan’s saliency map approach and the $a^{+}$ -rule indicate that the gaps between the two vertical lines at the upper part as well as the bottom of the four are important. This intuitively makes sense since this would change the four into a nine or eight, respectively.

In Figure: 2 we present more images from the MNIST dataset. Here we observe a general trend. The $z$ -rule focusses on the symbol and assigns almost no relevance to the region outside of the actual number. The Simonyan saliency map and the two other decomposition rules assign much relevance to the region outside of the actual number. For the $a^{+}$ -rule this effect is particularly strong and the visualisation indicates that the black regions next to the number itself are more important than the actual number.

Conclusion

In this abstract, we have taken a different theoretical view on the explanation of neural networks using the deep taylor decomposition. We have shown how different explanation rules correspond to different generative models of our observed data and how this influences the explanation. Based on our observations, we have also introduced a new explanation rule that learns the directions of variation related to the task. While this abstract has generated new insights into the explanation process, further research on large networks and additional datasets is needed to pick the appropriate model for each task and network architecture. Furthermore, benchmark taks where the relevant component is known are needed to make the evaluation of relevance visualisations on neural networks more reliable.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement NO 657679. Additional support was provided by the Federal Ministry of Education and Research (BMBF) for the Berlin Big Data Center BBDC (01IS14013A).

References

Appendix A The connection between the z𝑧z-rule and the Simonyan saliancy map.

In this section, we will show that the $z$ -rule with biases included as input neurons is equivalent to the gradient (Simonyan’s saliency map) multiplied elementwise with the input. We are assuming that there is only a ReLu activation or max-pooling. The effect of non-active ReLu neurons and paths blocked by max-pooling is included taking the sum only over active neurons.

By definition, the relevance for a neuron in the top layer $t$ is defined as follows:

This can be expressed as the gradient multiplied with the neuron activation of the previous layer:

Given that for layer $t-n$ the $z$ -rule explanation is

This shows that the $z$ -rule is equivalent to the saliency map multiplied with the input.

A.2 Implications

Since the gradient of the weight vector for a network with only ReLu and max-pooling units yields a linear projection that performs the identical operation for this specific datapoint as the neural network itself. Hence, the fact that the $z$ -rule can be decomposed into the gradient multiplied with the input proves that for these networks the $z$ -rule results in a valid decomposition of the actual function the neural network computes. However, this decomposition, as discussed before, assumes that there is no (structured) noise or distracting pattern present in the data. This approach sees the neural network as a data-point specific function generator.