Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan

Introduction

Interpretability is an extremely important criterion when a machine learning model is applied in areas such as medicine, financial markets, and criminal justice (e.g., see the discussion paper by Lipton ((Lipton, 2016)), as well as references therein). Many complex models, such as random forests, kernel methods, and deep neural networks, have been developed and employed to optimize prediction accuracy, which can compromise their ease of interpretation.

In this paper, we focus on instancewise feature selection as a specific approach for model interpretation. Given a machine learning model, instancewise feature selection asks for the importance scores of each feature on the prediction of a given instance, and the relative importance of each feature are allowed to vary across instances. Thus, the importance scores can act as an explanation for the specific instance, indicating which features are the key for the model to make its prediction on that instance. A related concept in machine learning is feature selection, which selects a subset of features that are useful to build a good predictor for a specified response variable (Guyon & Elisseeff, 2003). While feature selection produces a global importance of features with respect to the entire labeled data set, instancewise feature selection measures feature importance locally for each instance labeled by the model.

Existing work on interpreting models approach the problem from two directions. The first line of work computes the gradient of the output of the correct class with respect to the input vector for the given model, and uses it as a saliency map for masking the input (Simonyan et al., 2013; Springenberg et al., 2014). The gradient is computed using a Parzen window approximation of the original classifier if the original one is not available (Baehrens et al., 2010). Another line of research approximates the model to be interpreted via a locally additive model in order to explain the difference between the model output and some “reference” output in terms of the difference between the input and some “reference” input (Bach et al., 2015; Kindermans et al., 2016; Ribeiro et al., 2016; Lundberg & Lee, 2017; Shrikumar et al., 2017). Ribeiro et al. (2016) proposed the LIME, methods which randomly draws instances from a density centered at the sample to be explained, and fits a sparse linear model to predict the model outputs for these instances. Shrikumar et al. (2017) presented DeepLIFT, a method designed specifically for neural networks, which decomposes the output of a neural network on a specific input by backpropagating the contribution back to every feature of the input. Lundberg & Lee (2017) used Shapley values to quantify the importance of features of a given input, and proposed a sampling based method “kernel SHAP” for approximating Shapley values. Essentially, the two directions both approximate the model locally via an additive model, with different definitions of locality. While the first one considers infinitesimal regions on the decision surface and takes the first-order term in the Taylor expansion as the additive model, the second one considers the finite difference between an input vector and a reference vector.

In this paper, our approach to instancewise feature selection is via mutual information, a conceptually different perspective from existing approaches. We define an “explainer,” or instancewise feature selector, as a model which returns a distribution over the subset of features given the input vector. For a given instance, an ideal explainer should assign the highest probability to the subset of features that are most informative for the associated model response. This motivates us to maximize the mutual information between the selected subset of features and the response variable with respect to the instancewise feature selector. Direct estimation of mutual information and discrete feature subset sampling are intractable; accordingly, we derive a tractable method by first applying a variational lower bound for mutual information, and then developing a continuous reparametrization of the sampling distribution.

At a high level, the primary differences between our approach and past work are the following. First, our framework globally learns a local explainer, and therefore takes the distribution of inputs into consideration. Second, our framework removes the constraint of local feature additivity on an explainer. These distinctions enable our framework to yield a more efficient, flexible, and natural approach for instancewise feature selection. In summary, our contributions in this work are as follows (see also Table 1 for systematic comparisons):

We propose an information-based framework for instancewise feature selection.

We introduce a learning-based method for instancewise feature selection, which is both efficient and model-agnostic.

Furthermore, we show that the effectiveness of our method on a variety of synthetic and real data sets using both quantitative metric and human evaluation on Amazon Mechanical Turk.

A framework

Our method is derived from considering the mutual information between a particular pair of random vectors, so we begin by providing some basic background. Given two random vectors $X$ and $Y$ , the mutual information $I(X;Y)$ is a measure of dependence between them; intuitively, it corresponds to how much knowledge of one random vector reduces the uncertainty about the other. More precisely, the mutual information is given by the Kullback-Leibler divergence of the product of marginal distributions of $X$ and $Y$ from the joint distribution of $X$ and $Y$ (Cover & Thomas, 2012); it takes the form

where $p_{XY}$ and $p_{X},p_{Y}$ are the joint and marginal probability densities if $X,Y$ are continuous, or the joint and marginal probability mass functions if they are discrete. The expectation is taken with respect to the joint distribution of $X$ and $Y$ . One can show the mutual information is nonnegative and symmetric in two random variables. The mutual information has been a popular criteria in feature selection, where one selects the subset of features that approximately maximizes the mutual information between the response variable and the selected features (Gao et al., 2016; Peng et al., 2005). Here we propose to use mutual information as a criteria for instancewise feature selection.

2 How to construct explanations

We have thus defined a new random vector $X_{S}\in{}^{k}$ ; see Figure 1 for a probabilistic graphical model representing its construction. We formulate instancewise feature selection as seeking explainer that optimizes the criterion

In words, we aim to maximize the mutual information between the response variable from the model and the selected features, as a function of the choice of selection rule.

Proposed method

We now describe the steps taken to obtain a tractable variational formulation.

Mutual information between $X_{S}$ and $Y$ can be expressed in terms of the conditional distribution of $Y$ given $X_{S}$ :

When $X$ contains discrete features, we embed each discrete feature with a vector, and the vector representing a specific feature is set to zero simultaneously when the corresponding feature is not in $S$ .

2 Continuous relaxation of subset sampling

Direct estimation of the objective function in equation (4) requires summing over ${d\choose k}$ combinations of feature subsets after the variational approximation. Several tricks exist for tackling this issue, like REINFORCE-type Algorithms (Williams, 1992), or weighted sum of features parametrized by deterministic functions of $X$ . (A similar concept to the second trick is the “soft attention” structure in vision (Ba et al., 2014) and NLP (Bahdanau et al., 2014) where the weight of each feature is parametrized by a function of the respective feature itself.) We employ an alternative approach generalized from Concrete Relaxation (Gumbel-softmax trick) (Jang et al., 2017; Maddison et al., 2014, 2016), which empirically has a lower variance than REINFORCE and encourages discreteness (Raffel et al., 2017).

We add the random perturbation to the log probability of each category and take a temperature-dependent softmax over the $d$ -dimensional vector:

The resulting random vector $C=(C_{1},\dots,C_{d})$ is called a Concrete random vector, which we denote by

We apply the Gumbel-softmax trick to approximate weighted subset sampling. We would like to sample a subset $S$ of $k$ distinct features out of the $d$ dimensions. The sampling scheme for $S$ can be equivalently viewed as sampling a $k$ -hot random vector $Z$ from $D^{d}_{k}:\,=\{z\in\{0,1\}^{d}\mid\sum z_{i}=k\}$ , with each entry of $z$ being one if it is in the selected subset $S$ and being zero otherwise. An importance score which depends on the input vector is assigned for each feature. Concretely, we define $w_{\theta}\colon{}^{d}\to{}^{d}$ that maps the input to a $d$ -dimensional vector, with the $i$ th entry of $w_{\theta}(X)$ representing the importance score of the $i$ th feature.

We start with approximating sampling $k$ distinct features out of $d$ features by the sampling scheme below: Sample a single feature out of $d$ features independently for $k$ times. Discard the overlapping features and keep the rest. Such a scheme samples at most $k$ features, and is easier to approximate by a continuous relaxation. We further approximate the above scheme by independently sampling $k$ independent Concrete random vectors, and then we define a $d$ -dimensional random vector $V$ that is the elementwise maximum of $C^{1},C^{2},\dots,C^{k}$ :

The random vector $V$ is then used to approximate the $k$ -hot random vector $Z$ during training.

3 The final objective and its optimization

After having applied the continuous approximation of feature subset sampling, we have reduced Problem (4) to the following:

where $g_{\alpha}$ denotes the neural network used to approximate the model conditional distribution, and the quantity $\theta$ is used to parametrize the explainer. In the case of classification with $c$ classes, we can write

4 The explaining stage

During the explaining stage, the learned explainer maps each sample $X$ to a weight vector $w_{\theta}(X)$ of dimension $d$ , each entry representing the importance of the corresponding feature for the specific sample $X$ . In order to provide a deterministic explanation for a given sample, we rank features according to the weight vector, and the $k$ features with the largest weights are picked as the explaining features.

For each sample, only a single forward pass through the neural network parametrizing the explainer is required to yield explanation. Thus our algorithm is much more efficient in the explaining stage compared to other model-agnostic explainers like LIME or Kernel SHAP which require thousands of evaluations of the original model per sample.

Experiments

We carry out experiments on both synthetic and real data sets. For all experiments, we use RMSprop (Maddison et al., 2016) with the default hyperparameters for optimization. We also fix the step size to be $0.001$ across experiments. The temperature for Gumbel-softmax approximation is fixed to be $0.1$ . Codes for reproducing the key results are available online at https://github.com/Jianbo-Lab/L2X.

We begin with experiments on four synthetic data sets:

$2$ -dimensional XOR as binary classification. The input vector $X$ is generated from a $10$ -dimensional standard Gaussian. The response variable $Y$ is generated from $P(Y=1|X)\propto\exp\{X_{1}X_{2}\}$ .

Orange Skin. The input vector $X$ is generated from a $10$ -dimensional standard Gaussian. The response variable $Y$ is generated from $P(Y=1|X)\propto\exp\{\sum_{i=1}^{4}X_{i}^{2}-4\}$ .

Nonlinear additive model. Generate $X$ from a 10-dimensional standard Gaussian. The response variable $Y$ is generated from $P(Y=1|X)\propto\exp\{-100\sin(2X_{1})+2|X_{2}|+X_{3}+\exp\{-X_{4}\}\}$ .

Switch feature. Generate $X_{1}$ from a mixture of two Gaussians centered at $\pm 3$ respectively with equal probability. If $X_{1}$ is generated from the Gaussian centered at $3$ , the $2-5$ th dimensions are used to generate $Y$ like the orange skin model. Otherwise, the $6-9th$ dimensions are used to generate $Y$ from the nonlinear additive model.

The first three data sets are modified from commonly used data sets in the feature selection literature (Chen et al., 2017). The fourth data set is designed specifically for instancewise feature selection. Every sample in the first data set has the first two dimensions as true features, where each dimension itself is independent of the response variable $Y$ but the combination of them has a joint effect on $Y$ . In the second data set, the samples with positive labels centered around a sphere in a four-dimensional space. The sufficient statistic is formed by an additive model of the first four features. The response variable in the third data set is generated from a nonlinear additive model using the first four features. The last data set switches important features (roughly) based on the sign of the first feature. The $1-5$ features are true for samples with $X_{1}$ generated from the Gaussian centered at $-3$ , and the $1,6-9$ features are true otherwise.

We compare our method L2X (for “Learning to Explain”) with several strong existing algorithms for instancewise feature selection, including Saliency (Simonyan et al., 2013), DeepLIFT (Shrikumar et al., 2017), SHAP (Lundberg & Lee, 2017), LIME (Ribeiro et al., 2016). Saliency refers to the method that computes the gradient of the selected class with respect to the input feature and uses the absolute values as importance scores. SHAP refers to Kernel SHAP. The number of samples used for explaining each instance for LIME and SHAP is set as default for all experiments. We also compare with a method that ranks features by the input feature times the gradient of the selected class with respect to the input feature. Shrikumar et al. (2017) showed it is equivalent to LRP (Bach et al., 2015) when activations are piecewise linear, and used it in Shrikumar et al. (2017) as a strong baseline. We call it “Taylor” as it is the first-order Taylor approximation of the model.

Our experimental setup is as follows. For each data set, we train a neural network model with three hidden dense layers. We can safely assume the neural network has successfully captured the important features, and ignored noise features, based on its error rate. Then we use Taylor, Saliency, DeepLIFT, SHAP, LIME, and L2X for instancewise feature selection on the trained neural network models. For L2X, the explainer is a neural network composed of two hidden layers. The variational family is composed of three hidden layers. All layers are linear with dimension $200$ . The number of desired features $k$ is set to the number of true features.

The underlying true features are known for each sample, and hence the median ranks of selected features for each sample in a validation data set are reported as a performance metric, the box plots of which have been plotted in Figure 3. We observe that L2X outperforms all other methods on nonlinear additive and feature switching data sets. On the XOR model, DeepLIFT, SHAP and L2X achieve the best performance. On the orange skin model, all algorithms have near optimal performance, with L2X and LIME achieving the most stable performance across samples.

We also report the clock time of each method in Figure 2, where all experiments were performed on a single NVidia Tesla k80 GPU, coded in TensorFlow. Across all the four data sets, SHAP and LIME are the least efficient as they require multiple evaluations of the model. DeepLIFT, Taylor and Saliency requires a backward pass of the model. DeepLIFT is the slowest among the three, probably due to the fact that backpropagation of gradients for Taylor and Saliency are built-in operations of TensorFlow, while backpropagation in DeepLIFT is implemented with high-level operations in TensorFlow. Our method L2X is the most efficient in the explanation stage as it only requires a forward pass of the subset sampler. It is much more efficient compared to SHAP and LIME even after the training time has been taken into consideration, when a moderate number of samples (10,000) need to be explained. As the scale of the data to be explained increases, the training of L2X accounts for a smaller proportion of the over-all time. Thus the relative efficiency of L2X to other algorithms increases with the size of a data set.

2 IMDB

The Large Movie Review Dataset (IMDB) is a dataset of movie reviews for sentiment classification (Maas et al., 2011). It contains $50,000$ labeled movie reviews, with a split of $25,000$ for training and $25,000$ for testing. The average document length is $231$ words, and $10.7$ sentences. We use L2X to study two popular classes of models for sentiment analysis on the IMDB data set.

Convolutional neural networks (CNN) have shown excellent performance for sentiment analysis (Kim, 2014; Zhang & Wallace, 2015). We use a simple CNN model on Keras (Chollet et al., 2015) for the IMDB data set, which is composed of a word embedding of dimension $50$ , a $1$ -D convolutional layer of kernel size $3$ with $250$ filters, a max-pooling layer and a dense layer of dimension $250$ as hidden layers. Both the convolutional and the dense layers are followed by ReLU as nonlinearity, and Dropout (Srivastava et al., 2014) as regularization. Each review is padded/cut to $400$ words. The CNN model achieves $90\%$ accuracy on the test data, close to the state-of-the-art performance (around $94\%$ ). We would like to find out which $k$ words make the most influence on the decision of the model in a specific review. The number of key words is fixed to be $k=10$ for all the experiments.

The explainer of L2X is composed of a global component and a local component (See Figure 2 in Yang et al. (2018)). The input is initially fed into a common embedding layer followed by a convolutional layer with $100$ filters. Then the local component processes the common output using two convolutional layers with $50$ filters, and the global component processes the common output using a max-pooling layer followed by a $100$ -dimensional dense layer. Then we concatenate the global and local outputs corresponding to each feature, and process them through one convolutional layer with $50$ filters, followed by a Dropout layer (Srivastava et al., 2014). Finally a convolutional network with kernel size $1$ is used to yield the output. All previous convolutional layers are of kernel size 3, and ReLU is used as nonlinearity. The variational family is composed of an word embedding layer of the same size, followed by an average pooling and a $250$ -dimensional dense layer. Each entry of the output vector $V$ from the explainer is multiplied with the embedding of the respective word in the variational family. We use both automatic metrics and human annotators to validate the effectiveness of L2X.

When designing human experiments, we assume that the key words convey an attitude toward a movie, and can thus be used by a human to infer the review sentiment. This assumption has been partially validated given the aligned outcomes provided by post-hoc accuracy and by human judges, because the alignment implies the consistency between the sentiment judgement based on selected words from the original model and that from humans. Based on this assumption, we ask humans on Amazon Mechanical Turk (AMT) to infer the sentiment of a review given the ten key words selected by each explainer. The words adjacent to each other, like “not good at all,” keep their adjacency on the AMT interface if they are selected simultaneously. The reviews from different explainers have been mixed randomly, and the final sentiment of each review is averaged over the results of multiple human annotators. We measure whether the labels from human based on selected words align with the labels provided by the model, in terms of the average accuracy over $500$ reviews in the test data set. Some reviews are labeled as “neutral” based on selected words, which is because the selected key words do not contain sentiment, or the selected key words contain comparable numbers of positive and negative words. Thus these reviews are neither put in the positive nor in the negative class when we compute accuracy. We call this metric human accuracy.

The result is reported in Table 4. We observe that the model prediction based on only ten words selected by L2X align with the original prediction for over $90\%$ of the data. The human judgement given ten words also aligns with the model prediction for $84.4\%$ of the data. The human accuracy is even higher than that based on the original review, which is $83.3\%$ (Yang et al., 2018). This indicates the selected words by L2X can serve as key words for human to understand the model behavior. Table 2 shows the results of our model on four examples.

2.2 Explaining hierarchical LSTM

Another competitive class of models in sentiment analysis uses hierarchical LSTM (Hochreiter & Schmidhuber, 1997; Li et al., 2015). We build a simple hierarchical LSTM by putting one layer of LSTM on top of word embeddings, which yields a representation vector for each sentence, and then using another LSTM to encoder all sentence vectors. The output representation vector by the second LSTM is passed to the class distribution via a linear layer. Both the two LSTMs and the word embedding are of dimension $100$ . The word embedding is pretrained on a large corpus (Mikolov et al., 2013). Each review is padded to contain $15$ sentences. The hierarchical LSTM model gets around 90% accuracy on the test data. We take each sentence as a single feature group, and study which sentence is the most important in each review for the model.

The explainer of L2X is composed of a $100$ -dimensional word embedding followed by a convolutional layer and a max pooling layer to encode each sentence. The encoded sentence vectors are fed through three convolutional layers and a dense layer to get sampling weights for each sentence. The variational family also encodes each sentence with a convolutional layer and a max pooling layer. The encoding vectors are weighted by the output of the subset sampler, and passed through an average pooling layer and a dense layer to the class probability. All convolutional layers are of filter size $150$ and kernel size $3$ . In this setting, L2X can be interpreted as a hard attention model (Xu et al., 2015) that employs the Gumbel-softmax trick.

Comparison is carried out with the same metrics. For human accuracy, one selected sentence for each review is shown to human annotators. The other experimental setups are kept the same as above. We observe that post-hoc accuracy reaches $84.4\%$ with one sentence selected by L2X, and human judgements using one sentence align with the original model prediction for $77.4\%$ of data. Table 3 shows the explanations from our model on four examples.

3 MNIST

The MNIST data set contains $28\times 28$ images of handwritten digits (LeCun et al., 1998). We form a subset of the MNIST data set by choosing images of digits $3$ and $8$ , with $11,982$ images for training and $1,984$ images for testing. Then we train a simple neural network for binary classification over the subset, which achieves accuracy $99.7\%$ on the test data set. The neural network is composed of two convolutional layers of kernel size $5$ and a dense linear layer at last. The two convolutional layers contains $8$ and $16$ filters respectively, and both are followed by a max pooling layer of pool size $2$ . We try to explain each sample image with $k=4$ image patches on the neural network model, where each patch contains $4\times 4$ pixels, obtained by dividing each $28\times 28$ image into $7\times 7$ patches. We use patches instead of raw pixels as features for better visualization.

We parametrize the explainer and the variational family with three-layer and two-layer convolutional networks respectively, with max pooling added after each hidden layer. The $7\times 7$ vector sampled from the explainer is upsampled (with repetition) to size $28\times 28$ and multiplied with the input raw pixels.

We use only the post-hoc accuracy for experiment, with results shown in Table 4. The predictions based on 4 patches selected by L2X out of 49 align with those from original images for $95.8\%$ of data. Randomly selected examples with explanations are shown in Figure 4. We observe that L2X captures most of the informative patches, in particular those containing patterns that can distinguish 3 and 8.

Conclusion

We have proposed a framework for instancewise feature selection via mutual information, and a method L2X which seeks a variational approximation of the mutual information, and makes use of a Gumbel-softmax relaxation of discrete subset sampling during training. To our best knowledge, L2X is the first method to realize real-time interpretation of a black-box model. We have shown the efficiency and the capacity of L2X for instancewise feature selection on both synthetic and real data sets.

Acknowledgements

L.S. was also supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC, NVIDIA and Amazon AWS. We thank Nilesh Tripuraneni for comments about the Gumbel trick.

Appendix A Proof of Theorem 1

Any explanation is represented as a conditional distribution of the feature subset over the input vector. Given the definition of $S^{*}$ , we have for any $X$ , and any explanation $\mathcal{E}:S|X$ ,

In the case when $S^{*}(X)$ is a set instead of a singleton, we identify $S^{*}(X)$ with any distribution that assigns arbitrary probability to each elements in $S^{*}(X)$ with zero probability outside $S^{*}(X)$ . With abuse of notation, $S^{*}$ indicates both the set function that maps every $X$ to a set $S^{*}(X)$ and any real-valued function that maps $X$ to an element in $S^{*}(X)$ .

The reverse direction is proved by contradiction. Assume the optimal explanation $P(S|X)$ is such that there exists a set $M$ of nonzero probability, over which $P(S|X)$ does not degenerates to an element in $S^{*}(X)$ . Concretely, we define $M$ as

where $S^{*}(x)$ is a deterministic function in the set of distributions that assign arbitrary probability to each elements in $S^{*}(x)$ with zero probability outside $S^{*}(x)$ . Outside $M$ , we always have

from the definition of $S^{*}$ . As $M$ is of nonzero size over $P(X)$ , combining Equation A and Equation A and taking expectation with respect to $P(X)$ , we have