Not Just a Black Box: Learning Important Features Through Propagating Activation Differences

Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, Anshul Kundaje

Introduction

As neural networks become increasingly popular, their “black box” reputation is a barrier to adoption when interpretability is paramount. Understanding the features that lead to a particular output builds trust with users and can lead to novel scientific discoveries. Simonyan et al. (2013) proposed using gradients to generate saliency maps and showed that this is closely related to the deconvolutional nets of Zeiler & Fergus (2014). Guided backpropagation (Springenberg et al., 2014) is another variant which only considers gradients that have positive error signal. As shown in Figure 2, saliency maps can be substantially improved by simply multiplying the gradient with the input signal, which corresponds to a first-order Taylor approximation of how the output would change if the input were set to zero; as we show, the layer-wise relevance propagation rules described in Bach et al. (2015) reduce to this approach, assuming bias terms are included in the denominators.

Gradient-based approaches are problematic because activation functions such as Rectified Linear Units (ReLUs) have a gradient of zero when they are not firing, and yet a ReLU that does not fire can still carry information (Figure 1). Similarly, sigmoid or tanh activations are popular choices for the activation functions of gates in memory units of recurrent neural networks such as GRUs and LSTMs (Chung et al., 2014; Hochreiter & Schmidhuber, 1997), but these activations have a near-zero gradient at high or low inputs even though such inputs can be very significant.

We present DeepLIFT, a method for assigning feature importance that compares a neuron’s activation to its ‘reference’, where the reference is the activation that the neuron has when the network is provided a ‘reference input’ (the reference input is defined according to what is appropriate for the task at hand). This addresses the limitation of gradient-based approaches because the difference from the reference may be non-zero even when the gradient is zero.

DeepLIFT Method

We denote the contribution of neuron xx to neuron yy as CxyC_{xy}. Let the activation of a neuron nn be denoted as AnA_{n}. Further, let the reference activation of neuron nn be denoted An0A_{n}^{0}, and let the AnAn0A_{n}-A_{n}^{0} be denoted as δn\delta_{n}. We define our contributions CxyC_{xy} to satisfy the following properties.

For any set of neurons SS whose activations are minimally sufficient to compute the activation of yy (that is, if we know the activations of SS, we can compute the activation of yy, and there is no set SSS^{\prime}\subset S such that SS^{\prime} is sufficient to compute the activation of yy - in layman’s terms, SS is a full set of non-redundant inputs to yy), the following property holds:

That is, the sum over all the contributions of neurons in SS to yy equals the difference-from-reference of yy.

2 Linear composition

Let OxO_{x} represent the output neurons of xx. The following property holds:

In layman’s terms, each neuron ‘inherits’ a contribution through its outputs in proportion to how much that neuron contributes to the difference-from-reference of the output.

3 Backpropagation Rules

We show that the contributions as defined above can be computed using the following rules (which can be implemented to run on a GPU). The computation is reminiscent of the chain rule used during gradient backpropagation, as equation 22 makes it possible to start with contribution scores of later layers and use them to find the contribution scores of preceding layers. To avoid issues of numerical stability when δn\delta_{n} for a particular neuron is small, rather than computing the contribution scores explicitly, we instead compute multipliers mxym_{xy} that, when multiplied with the difference-from-reference, give the contribution:

Let tt represent the target neuron that we intend to compute contributions to, and let OxO_{x} represent the set of outputs of xx. We show that:

The equation above follows from the linear composition property and the definition of the multipliers, as proved below:

In the equations below, IyI_{y} denotes the set of inputs of yy.

Proof. We show that δy=xIymxyδx\delta_{y}=\sum_{x\in I_{y}}m_{xy}\delta_{x}.

Using the fact that An=An0+δnA_{n}=A_{n}^{0}+\delta_{n}, we have:

We also note that the reference activation Ay0A_{y}^{0} can be found as follows:

3.2 Max operation

We consider the case of max operation such as a maxpool:

Where 1{}\bm{1}\{\} is the indicator function. If a symbolic computation package is used, then the gradient of yy with respect to xx can be used in place of 1{Ax=Ay}\bm{1}\{A_{x}=A_{y}\}.

3.3 Maxout units

i.e. it is the max over nn affine functions of the input vector x\vec{x}. For a given vector of activations AxA_{\vec{x}} of the inputs, we split AxAx0A_{\vec{x}}-A_{\vec{x}}^{0} into segments such that over each segment ss, a unique affine function dominates the maxout and the coefficient of an individual input xx over that segment is w(s)xyw(s)_{xy}. Let l(s)l(s) denote the fraction of AxAx0A_{\vec{x}}-A_{\vec{x}}^{0} in segment ss. We have:

Intuitively speaking, we simply split the piecewise-linear maxout function into regions where it is linear, and do a weighted sum of the coefficients of xx in each region according to how much of AxAx0A_{\vec{x}}-A_{\vec{x}}^{0} falls in that region.

3.4 Other activations

The following choice for mxym_{xy}, which is the same for all inputs to yy, satisfies summation-to-delta:

This rule may be used for nonlinearities like ReLUs, PReLUs, sigmoid and tanh (where yy has only one input). Situations where the denominator is near zero can be handled by applying L’hopital’s rule, because by definition:

3.5 Element-wise products

Thus, viable choices for the multipliers are mx1y=Ax20+0.5δx2m_{x_{1}y}=A_{x_{2}}^{0}+0.5\delta_{x_{2}} and mx2y=Ax10+0.5δx1m_{x_{2}y}=A_{x_{1}}^{0}+0.5\delta_{x_{1}}

4 A note on final activation layers

Activation functions such as a softmax or a sigmoid have a maximum δ\delta of 1.0. Due to the summation to δ\delta property, the contribution scores for individual features are lower when there are several redundant features present. As an example, consider At=σ(Ay)A_{t}=\sigma(A_{y}) (where sigmasigma is the sigmoid transformation) and Ay=Ax1+Ax2A_{y}=A_{x_{1}}+A_{x_{2}}. Let the default activations of the inputs be Ax10=Ax20=0A_{x_{1}}^{0}=A_{x_{2}}^{0}=0. When x1=100x_{1}=100 and x2=0x_{2}=0, we have Cx1t=0.5C_{x_{1}t}=0.5. However, when both x1=100x_{1}=100 and x2=100x_{2}=100, we have Cx1t=Cx2t=0.25C_{x_{1}t}=C_{x_{2}t}=0.25. To avoid this attenuation of contribution in the presence of redundant inputs, we can use the contributions to yy rather than tt; in both cases, Cx1y=100C_{x_{1}y}=100.

5 A note on Softmax activation

Let t1,t2...tn{t_{1},t_{2}...t_{n}} represent the output of a softmax transformation on the nodes y1,y2...yn{y_{1},y_{2}...y_{n}}, such that:

Here, Ay1...AynA_{y_{1}}...A_{y_{n}} are affine functions of their inputs. Let xx represent a neuron that is an input to Ay1...AynA_{y_{1}}...A_{y_{n}}, and let wxyiw_{xy_{i}} represent the coefficient of AxA_{x} in AyiA_{y_{i}}. Because Ay1...AynA_{y_{1}}...A_{y_{n}} are followed by a softmax transformation, if wxyiw_{xy_{i}} is the same for all yiy_{i} (that is, xx contributes equally to all yiy_{i}), then xx effectively has zero contribution to AtiA_{t_{i}}. This can be observed by substituting Ayi=wxyiAx+ryiA_{y_{i}}=w_{xy_{i}}A_{x}+r_{y_{i}} in the expression for AtiA_{t_{i}} and canceling out ewxyiAxe^{w_{xy_{i}}A_{x}} (here, ryir_{y_{i}} is the sum of all the remaining terms in the affine expression for AyiA_{y_{i}})

As mentioned in the previous subsection, in order to avoid attenuation of signal for highly confident predictions, we should compute CxyiC_{xy_{i}} rather than CxtiC_{xt_{i}}. One way to ensure that CxyiC_{xy_{i}} is zero if wxyiw_{xy_{i}} is the same for all yiy_{i} is to mean-normalized the weights as follows:

This transformation will not affect the output of the softmax, but will ensure that the DeepLIFT scores are zero when a particular node contributes equally to all softmax classes.

6 Weight normalization for constrained inputs

Let yy be a neuron with some subset of inputs SyS_{y} that are constrained such that xSyAx=c\sum_{x\in S_{y}}A_{x}=c (for example, one-hot encoded input satisfies the constraint xSyAx=1\sum_{x\in S_{y}}A_{x}=1, and a convolutional neuron operating on one-hot encoded rows has one constraint per column that it sees). Let the weights from xx to yy be denoted wxyw_{xy} and let byb_{y} be the bias of yy. It is advisable to use normalized weights wˉxy=wxyμ\bar{w}_{xy}=w_{xy}-\mu and bias bˉy=by+cμ\bar{b}_{y}=b_{y}+c\mu, where μ\mu is the mean over all wxyw_{xy}. We note that this maintains the output of the neural net because, for any constant μ\mu:

The normalization is desirable because, for affine functions, the multipliers mxym_{xy} are equal to the weights wxyw_{xy} and are thus sensitive to μ\mu. To take the example of a convolutional neuron operating on one-hot encoded rows: by mean-normalizing wxyw_{xy} for each column in the filter, one can ensure that the contributions CxyC_{xy} from some columns are not systematically overestimated or underestimated relative to the contributions from other columns.

Results

A model with the VGG16 (Long et al., 2015) architecture was trained using the Keras framework (Chollet, 2015) on a scaled-down version of the Imagenet dataset, dubbed ‘Tiny Imagenet’. The images were 64×6464\times 64 in dimension and belonged to one of 200 output classes. Results shown in Figure 2; the reference input was an input of all zeros after preprocessing.

2 Genomics

We apply DeepLIFT to models trained on genomic sequence. The positive class requires that the DNA patterns ’GATA’ and ’CAGATG’ appear in the length-200 sequence together. The negative class has only one of the two patterns appearing once or twice. Outside the core patterns (which were sampled from a generative model) we randomly sample the four bases A, C, G and T. A CNN was trained using the Keras framework (Chollet, 2015) on one-hot encoded sequences with 20 convolutional filters of length 15 and stride 1 and a max pool layer of width and stride 50, followed by two fully connected layers of size 200. PReLU nonlinearities were used for the hidden layers. This model performs well with auROC of 0.907. The misclassified examples primarily occur when one of the patterns erroneously arises in the randomly sampled background.

We then run DeepLIFT to assign an importance score to each base in the correctly predicted sequences. The reference input is an input of all zeros post weight-normalization (see 2.6) of the first convolutional layer (after weight normalization, the linear activation of a convolutional neuron for an input of all zeros is the bias, which is the same as the average activation across all four bases at each position). We compared the results to the gradient*input (Figure 3).

Discussion

Prevailing feature importance methods such as the saliency maps of Simonyan et al., the deconvolutional nets of Zeiler et al. and the guided backpropagation of Springenberg et al. are variants of computing gradients. As shown in Figure 1, this can give misleading results when the local gradient is zero. DeepLIFT instead considers the deviation from a neuron’s reference activity. This makes it capable of handling RNN memory units gated by activations that have vanishing gradients (eg: sigmoid, tanh).

Layer-wise Relevance Propagation (LRP), proposed by Bach et al., does not obviously rely on gradients; however, as we show, if all activations are piecewise linear, LRP reduces to gradient*input (a first-order Taylor approximation of the change in output if the input is set to zero). If all reference activations are zero (as happens when all bias terms are zero and all reference input values are zero), DeepLIFT and LRP give similar results (except that by computing contributions using multipliers, DeepLIFT circumvents the numerical stability problems that LRP faces). In practice, biases are often non-zero, which is why DeepLIFT produces superior results (Figures 2 & 3).

We show when all activations are piecewise linear and bias terms are included in the calculation, the Layer-wise Relevance Propagation (LRP) of Bach et al., reduces to gradient*input. We refer to Samek et al. (2015) for the concise description of LRP:

Unpooling: “The backwards signal is redirected proportionally onto the location for which the activation was recorded in the forward pass”: This is trivially the same as gradient*input, because the gradient*input will be zero for all locations which do not activation the pooling layer, and equal to the output for the location that does.

Filtering: We consider the first rule described in Samek et al., where zij=ai(l)wij(l,l+1)z_{ij}=a_{i}^{(l)}w_{ij}^{(l,l+1)} is the weighted activation of neuron ii onto neuron jj in the next layer, and ll is the index of the layer:

The term involving ϵ\epsilon is included to avoid issues of numerical instability when izij\sum_{i}^{\prime}z_{i^{\prime}j} is near zero. The second rule described in Samek et al. is another variant designed to address the problem of numerical instability. We show that gradient*input gives the exact result as ϵ0\epsilon\rightarrow 0 (i.e. it solves the issue of numerical instability altogether).

Dropping the term for ϵ\epsilon and substituting zij=ai(l)wij(l,l+1)z_{ij}=a_{i}^{(l)}w_{ij}^{(l,l+1)}, we have:

Assuming the bias term is included (which would be necessary for the conservation property described in Bach et al. to hold), the denominator is simply the activation of neuron jj, i.e.:

Let us now consider what happens when there are two filtering operations applied sequentially. Let RikR_{ik} denote the relevance inherited by neuron ii in layer ll from neuron kk in layer l+2l+2, passing through the neurons in layer l+1l+1. We have:

Thus, we see that denominator aj(l+1)a_{j}^{(l+1)} for the intermediate layer cancelled out, leaving us with ai(l)wij(l,l+1)wjk(l+1,l+2)a_{i}^{(l)}w_{ij}^{(l,l+1)}w_{jk}^{(l+1,l+2)}, where wij(l,l+1)wjk(l+1,l+2)w_{ij}^{(l,l+1)}w_{jk}^{(l+1,l+2)} is the gradient of ak(l+1)a_{k}^{(l+1)} with respect to ai(l)a_{i}^{(l)}. The only term left in the denominator is the activation of the last layer, ak(l+1)a_{k}^{(l+1)}; if we set the relevance of neurons in the final layer to be equal to their own activation, then Rk(l+2)R_{k}^{(l+2)} (assuming kk is the last layer) would cancel out ak(l+1)a_{k}^{(l+1)} in the denominator, leaving us with:

Which is simply equal to the activation ai(l)a_{i}^{(l)} multiplied by the gradient of aka_{k} with respect to ai(l)a_{i}^{(l)}. In situations where the relevance of the last layer is not the same as its activation (which may happen if there is a nonlinear transformation such as a sigmoid, as a sigmoid output of 0.5 occurs when the input is 0), one can simply compute gradient*input with respect to the linear term before the final nonlinearity (which is what we did; for softmax layers, we apply the normalization described in 2.5).

Nonlinearity: “The backward signal is simply propagated onto the lower layer, ignoring the rectification operation”: While this is not obviously the same as gradient*input, it should be noted that when a rectified linear unit is inactive, it has an activation of zero and the rule for filtering (described above) would assign it zero importance. Furthermore, when the rectified linear unit is active, its gradient is 1. Thus, when the unit is inactive, gradient*input is 0 and LRP assigns 0 signal; when a unit is active, gradient*input is equal to the output and LRP assigns all signal. The two approaches converge.

Author contributions

AS & PG conceived of DeepLIFT. AS implemented DeepLIFT in software. PG led application to genomics. AYS led application to Tiny Imagenet. AK provided guidance and feedback. AS, PG, AYS & AK prepared the manuscript.

References