Not Just a Black Box: Learning Important Features Through Propagating Activation Differences

Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, Anshul Kundaje

Introduction

As neural networks become increasingly popular, their “black box” reputation is a barrier to adoption when interpretability is paramount. Understanding the features that lead to a particular output builds trust with users and can lead to novel scientific discoveries. Simonyan et al. (2013) proposed using gradients to generate saliency maps and showed that this is closely related to the deconvolutional nets of Zeiler & Fergus (2014). Guided backpropagation (Springenberg et al., 2014) is another variant which only considers gradients that have positive error signal. As shown in Figure 2, saliency maps can be substantially improved by simply multiplying the gradient with the input signal, which corresponds to a first-order Taylor approximation of how the output would change if the input were set to zero; as we show, the layer-wise relevance propagation rules described in Bach et al. (2015) reduce to this approach, assuming bias terms are included in the denominators.

Gradient-based approaches are problematic because activation functions such as Rectified Linear Units (ReLUs) have a gradient of zero when they are not firing, and yet a ReLU that does not fire can still carry information (Figure 1). Similarly, sigmoid or tanh activations are popular choices for the activation functions of gates in memory units of recurrent neural networks such as GRUs and LSTMs (Chung et al., 2014; Hochreiter & Schmidhuber, 1997), but these activations have a near-zero gradient at high or low inputs even though such inputs can be very significant.

We present DeepLIFT, a method for assigning feature importance that compares a neuron’s activation to its ‘reference’, where the reference is the activation that the neuron has when the network is provided a ‘reference input’ (the reference input is defined according to what is appropriate for the task at hand). This addresses the limitation of gradient-based approaches because the difference from the reference may be non-zero even when the gradient is zero.

DeepLIFT Method

We denote the contribution of neuron $x$ to neuron $y$ as $C_{xy}$ . Let the activation of a neuron $n$ be denoted as $A_{n}$ . Further, let the reference activation of neuron $n$ be denoted $A_{n}^{0}$ , and let the $A_{n}-A_{n}^{0}$ be denoted as $\delta_{n}$ . We define our contributions $C_{xy}$ to satisfy the following properties.

For any set of neurons $S$ whose activations are minimally sufficient to compute the activation of $y$ (that is, if we know the activations of $S$ , we can compute the activation of $y$ , and there is no set $S^{\prime}\subset S$ such that $S^{\prime}$ is sufficient to compute the activation of $y$ - in layman’s terms, $S$ is a full set of non-redundant inputs to $y$ ), the following property holds:

That is, the sum over all the contributions of neurons in $S$ to $y$ equals the difference-from-reference of $y$ .

2 Linear composition

Let $O_{x}$ represent the output neurons of $x$ . The following property holds:

In layman’s terms, each neuron ‘inherits’ a contribution through its outputs in proportion to how much that neuron contributes to the difference-from-reference of the output.

3 Backpropagation Rules

We show that the contributions as defined above can be computed using the following rules (which can be implemented to run on a GPU). The computation is reminiscent of the chain rule used during gradient backpropagation, as equation $2$ makes it possible to start with contribution scores of later layers and use them to find the contribution scores of preceding layers. To avoid issues of numerical stability when $\delta_{n}$ for a particular neuron is small, rather than computing the contribution scores explicitly, we instead compute multipliers $m_{xy}$ that, when multiplied with the difference-from-reference, give the contribution:

Let $t$ represent the target neuron that we intend to compute contributions to, and let $O_{x}$ represent the set of outputs of $x$ . We show that:

The equation above follows from the linear composition property and the definition of the multipliers, as proved below:

In the equations below, $I_{y}$ denotes the set of inputs of $y$ .

Proof. We show that $\delta_{y}=\sum_{x\in I_{y}}m_{xy}\delta_{x}$ .

Using the fact that $A_{n}=A_{n}^{0}+\delta_{n}$ , we have:

We also note that the reference activation $A_{y}^{0}$ can be found as follows:

3.2 Max operation

We consider the case of max operation such as a maxpool:

Where $\bm{1}\{\}$ is the indicator function. If a symbolic computation package is used, then the gradient of $y$ with respect to $x$ can be used in place of $\bm{1}\{A_{x}=A_{y}\}$ .

3.3 Maxout units

i.e. it is the max over $n$ affine functions of the input vector $\vec{x}$ . For a given vector of activations $A_{\vec{x}}$ of the inputs, we split $A_{\vec{x}}-A_{\vec{x}}^{0}$ into segments such that over each segment $s$ , a unique affine function dominates the maxout and the coefficient of an individual input $x$ over that segment is $w(s)_{xy}$ . Let $l(s)$ denote the fraction of $A_{\vec{x}}-A_{\vec{x}}^{0}$ in segment $s$ . We have:

Intuitively speaking, we simply split the piecewise-linear maxout function into regions where it is linear, and do a weighted sum of the coefficients of $x$ in each region according to how much of $A_{\vec{x}}-A_{\vec{x}}^{0}$ falls in that region.

3.4 Other activations

The following choice for $m_{xy}$ , which is the same for all inputs to $y$ , satisfies summation-to-delta:

This rule may be used for nonlinearities like ReLUs, PReLUs, sigmoid and tanh (where $y$ has only one input). Situations where the denominator is near zero can be handled by applying L’hopital’s rule, because by definition:

3.5 Element-wise products

Thus, viable choices for the multipliers are $m_{x_{1}y}=A_{x_{2}}^{0}+0.5\delta_{x_{2}}$ and $m_{x_{2}y}=A_{x_{1}}^{0}+0.5\delta_{x_{1}}$

4 A note on final activation layers

Activation functions such as a softmax or a sigmoid have a maximum $\delta$ of 1.0. Due to the summation to $\delta$ property, the contribution scores for individual features are lower when there are several redundant features present. As an example, consider $A_{t}=\sigma(A_{y})$ (where $sigma$ is the sigmoid transformation) and $A_{y}=A_{x_{1}}+A_{x_{2}}$ . Let the default activations of the inputs be $A_{x_{1}}^{0}=A_{x_{2}}^{0}=0$ . When $x_{1}=100$ and $x_{2}=0$ , we have $C_{x_{1}t}=0.5$ . However, when both $x_{1}=100$ and $x_{2}=100$ , we have $C_{x_{1}t}=C_{x_{2}t}=0.25$ . To avoid this attenuation of contribution in the presence of redundant inputs, we can use the contributions to $y$ rather than $t$ ; in both cases, $C_{x_{1}y}=100$ .

5 A note on Softmax activation

Let ${t_{1},t_{2}...t_{n}}$ represent the output of a softmax transformation on the nodes ${y_{1},y_{2}...y_{n}}$ , such that:

Here, $A_{y_{1}}...A_{y_{n}}$ are affine functions of their inputs. Let $x$ represent a neuron that is an input to $A_{y_{1}}...A_{y_{n}}$ , and let $w_{xy_{i}}$ represent the coefficient of $A_{x}$ in $A_{y_{i}}$ . Because $A_{y_{1}}...A_{y_{n}}$ are followed by a softmax transformation, if $w_{xy_{i}}$ is the same for all $y_{i}$ (that is, $x$ contributes equally to all $y_{i}$ ), then $x$ effectively has zero contribution to $A_{t_{i}}$ . This can be observed by substituting $A_{y_{i}}=w_{xy_{i}}A_{x}+r_{y_{i}}$ in the expression for $A_{t_{i}}$ and canceling out $e^{w_{xy_{i}}A_{x}}$ (here, $r_{y_{i}}$ is the sum of all the remaining terms in the affine expression for $A_{y_{i}}$ )

As mentioned in the previous subsection, in order to avoid attenuation of signal for highly confident predictions, we should compute $C_{xy_{i}}$ rather than $C_{xt_{i}}$ . One way to ensure that $C_{xy_{i}}$ is zero if $w_{xy_{i}}$ is the same for all $y_{i}$ is to mean-normalized the weights as follows:

This transformation will not affect the output of the softmax, but will ensure that the DeepLIFT scores are zero when a particular node contributes equally to all softmax classes.

6 Weight normalization for constrained inputs

Let $y$ be a neuron with some subset of inputs $S_{y}$ that are constrained such that $\sum_{x\in S_{y}}A_{x}=c$ (for example, one-hot encoded input satisfies the constraint $\sum_{x\in S_{y}}A_{x}=1$ , and a convolutional neuron operating on one-hot encoded rows has one constraint per column that it sees). Let the weights from $x$ to $y$ be denoted $w_{xy}$ and let $b_{y}$ be the bias of $y$ . It is advisable to use normalized weights $\bar{w}_{xy}=w_{xy}-\mu$ and bias $\bar{b}_{y}=b_{y}+c\mu$ , where $\mu$ is the mean over all $w_{xy}$ . We note that this maintains the output of the neural net because, for any constant $\mu$ :

The normalization is desirable because, for affine functions, the multipliers $m_{xy}$ are equal to the weights $w_{xy}$ and are thus sensitive to $\mu$ . To take the example of a convolutional neuron operating on one-hot encoded rows: by mean-normalizing $w_{xy}$ for each column in the filter, one can ensure that the contributions $C_{xy}$ from some columns are not systematically overestimated or underestimated relative to the contributions from other columns.

Results

A model with the VGG16 (Long et al., 2015) architecture was trained using the Keras framework (Chollet, 2015) on a scaled-down version of the Imagenet dataset, dubbed ‘Tiny Imagenet’. The images were $64\times 64$ in dimension and belonged to one of 200 output classes. Results shown in Figure 2; the reference input was an input of all zeros after preprocessing.

2 Genomics

We apply DeepLIFT to models trained on genomic sequence. The positive class requires that the DNA patterns ’GATA’ and ’CAGATG’ appear in the length-200 sequence together. The negative class has only one of the two patterns appearing once or twice. Outside the core patterns (which were sampled from a generative model) we randomly sample the four bases A, C, G and T. A CNN was trained using the Keras framework (Chollet, 2015) on one-hot encoded sequences with 20 convolutional filters of length 15 and stride 1 and a max pool layer of width and stride 50, followed by two fully connected layers of size 200. PReLU nonlinearities were used for the hidden layers. This model performs well with auROC of 0.907. The misclassified examples primarily occur when one of the patterns erroneously arises in the randomly sampled background.

We then run DeepLIFT to assign an importance score to each base in the correctly predicted sequences. The reference input is an input of all zeros post weight-normalization (see 2.6) of the first convolutional layer (after weight normalization, the linear activation of a convolutional neuron for an input of all zeros is the bias, which is the same as the average activation across all four bases at each position). We compared the results to the gradient*input (Figure 3).

Discussion

Prevailing feature importance methods such as the saliency maps of Simonyan et al., the deconvolutional nets of Zeiler et al. and the guided backpropagation of Springenberg et al. are variants of computing gradients. As shown in Figure 1, this can give misleading results when the local gradient is zero. DeepLIFT instead considers the deviation from a neuron’s reference activity. This makes it capable of handling RNN memory units gated by activations that have vanishing gradients (eg: sigmoid, tanh).

Layer-wise Relevance Propagation (LRP), proposed by Bach et al., does not obviously rely on gradients; however, as we show, if all activations are piecewise linear, LRP reduces to gradient*input (a first-order Taylor approximation of the change in output if the input is set to zero). If all reference activations are zero (as happens when all bias terms are zero and all reference input values are zero), DeepLIFT and LRP give similar results (except that by computing contributions using multipliers, DeepLIFT circumvents the numerical stability problems that LRP faces). In practice, biases are often non-zero, which is why DeepLIFT produces superior results (Figures 2 & 3).

We show when all activations are piecewise linear and bias terms are included in the calculation, the Layer-wise Relevance Propagation (LRP) of Bach et al., reduces to gradient*input. We refer to Samek et al. (2015) for the concise description of LRP:

Unpooling: “The backwards signal is redirected proportionally onto the location for which the activation was recorded in the forward pass”: This is trivially the same as gradient*input, because the gradient*input will be zero for all locations which do not activation the pooling layer, and equal to the output for the location that does.

Filtering: We consider the first rule described in Samek et al., where $z_{ij}=a_{i}^{(l)}w_{ij}^{(l,l+1)}$ is the weighted activation of neuron $i$ onto neuron $j$ in the next layer, and $l$ is the index of the layer:

The term involving $\epsilon$ is included to avoid issues of numerical instability when $\sum_{i}^{\prime}z_{i^{\prime}j}$ is near zero. The second rule described in Samek et al. is another variant designed to address the problem of numerical instability. We show that gradient*input gives the exact result as $\epsilon\rightarrow 0$ (i.e. it solves the issue of numerical instability altogether).

Dropping the term for $\epsilon$ and substituting $z_{ij}=a_{i}^{(l)}w_{ij}^{(l,l+1)}$ , we have:

Assuming the bias term is included (which would be necessary for the conservation property described in Bach et al. to hold), the denominator is simply the activation of neuron $j$ , i.e.:

Let us now consider what happens when there are two filtering operations applied sequentially. Let $R_{ik}$ denote the relevance inherited by neuron $i$ in layer $l$ from neuron $k$ in layer $l+2$ , passing through the neurons in layer $l+1$ . We have:

Thus, we see that denominator $a_{j}^{(l+1)}$ for the intermediate layer cancelled out, leaving us with $a_{i}^{(l)}w_{ij}^{(l,l+1)}w_{jk}^{(l+1,l+2)}$ , where $w_{ij}^{(l,l+1)}w_{jk}^{(l+1,l+2)}$ is the gradient of $a_{k}^{(l+1)}$ with respect to $a_{i}^{(l)}$ . The only term left in the denominator is the activation of the last layer, $a_{k}^{(l+1)}$ ; if we set the relevance of neurons in the final layer to be equal to their own activation, then $R_{k}^{(l+2)}$ (assuming $k$ is the last layer) would cancel out $a_{k}^{(l+1)}$ in the denominator, leaving us with:

Which is simply equal to the activation $a_{i}^{(l)}$ multiplied by the gradient of $a_{k}$ with respect to $a_{i}^{(l)}$ . In situations where the relevance of the last layer is not the same as its activation (which may happen if there is a nonlinear transformation such as a sigmoid, as a sigmoid output of 0.5 occurs when the input is 0), one can simply compute gradient*input with respect to the linear term before the final nonlinearity (which is what we did; for softmax layers, we apply the normalization described in 2.5).

Nonlinearity: “The backward signal is simply propagated onto the lower layer, ignoring the rectification operation”: While this is not obviously the same as gradient*input, it should be noted that when a rectified linear unit is inactive, it has an activation of zero and the rule for filtering (described above) would assign it zero importance. Furthermore, when the rectified linear unit is active, its gradient is 1. Thus, when the unit is inactive, gradient*input is 0 and LRP assigns 0 signal; when a unit is active, gradient*input is equal to the output and LRP assigns all signal. The two approaches converge.

Author contributions

AS & PG conceived of DeepLIFT. AS implemented DeepLIFT in software. PG led application to genomics. AYS led application to Tiny Imagenet. AK provided guidance and feedback. AS, PG, AYS & AK prepared the manuscript.