Interpreting Black Box Predictions using Fisher Kernels

Rajiv Khanna, Been Kim, Joydeep Ghosh, Oluwasanmi Koyejo

Introduction

It has long been established that using examples to enable interpretability is one of the most effective approaches for human learning and understanding . The ability to interpret using examples from the data can lead to more informed decision based systems and a better understanding of the inner workings of the model . In this work, we are interested in finding data points or prototypes that are “most responsible” for the underlying model making specific predictions of interest. To this end, we develop a novel method that is model agnostic and only requires an access to the function and gradient oracles.

In a more formal sense, we aim to approximate the empiricial test data distribution using samples from the training data. Our approach is to first embed all the points in the space induced by the Fisher kernels . This provides a principled way to quantify closeness of two points with respect to the similarity induced by the trained model. If two points in this space are close, then intuitively the model treats them similarly. We formally show that influence function based approach to interpretability is essentially doing the same thing.

Thus, our goal is to find a subset of the training data such that, when also embedded in a model-induced space, is close to the test set in the distribution sense. We build this subset from the training data sequentially using a greedy method called Sequential Bayesian Quadrature (SBQ) . SBQ is an importance-sampling based algorithm to estimate the expected value of a function under a distribution using discrete sample points drawn from it. To the best of our knowledge SBQ has not been used in conjunction with Fisher kernels for interpretability. Moreover, we leverage recent research in discrete optimization to provide novel convergence rates for the algorithm over discrete atomic sets. Our analysis also yields novel and more scalable algorithm variants of SBQ with corresponding constant factor guarantees.

We propose a novel method to select salient training data points that explain test set predictions for black box models.

To solve the resulting combinatorial problem, we develop new faster convergence guarantees for greedy Sequential Bayesian Quadrature on discrete candidate sets. One novel insight that results is the applicability of more scalable algorithm variants for SBQ with provable bounds. These theoretical insights may be of independent interest.

We recover the influence function based approach of Koh and Liang as a special case. This connection again yields several novel insights about using influence functions for model interpretation and training side adversarial attacks. Most importantly, we establish the importance of the Fisher space for robust learning that can hopefully lead to promising future research directions.

To highlight the practical impact of the our interpretability framework, we present its application to three different real world use-cases.

Related work: There has been a lot of interest lately in model interpretation in various ways and their corresponding applications. Thus, we focus our related work on the subset of most closely related research. Our approach has a similar motivation as Koh and Liang , who proposed the use of influence functions for finding the most influential training data point for a test data point prediction. The intuition revolves around infinitesimally perturbing the training data point and evaluating the corresponding impact on the test point. The method is only designed for single data points – thus their extension to selecting multiple data points required an unmotivated heuristic approach. A complementary line of research revolves around feature based interpretation of models. Instead of focusing on choosing representative data points, the goal is to reveal which features are important for the prediction Ribeiro et al. . Recently, Kim et al. also made use of the unweighted MMD function to propose selection of prototypes and criticisms. While their approach can be used for exploratory analysis of the data, it has not been extended for explaining a model. Their focus, moreover, is on the use of criticisms in addition to examples as a vital component of exploring datasets.

Fisher kernels were proposed to exploit the implicit embedding of a structured object in a generative model for discriminative purposes , and have since been applied successfully in a variety of applications . The goal is to design a kernel for generative models of structured objects that captures the “similarity" for the said objects in the corresponding embedding space. The kernel itself can then be used out of the box in discriminative models such as Support vector machines.

Background

In this section, we provide an overview of the technical background required for our setup. We begin by fixing some notation. We represent sets using sans script fonts e.g. $\mathsf{A},\mathsf{B}$ . Vectors are represented using lower case bold letters e.g. ${\mathbf{x}},{\mathbf{y}}$ , and matrices are represented using upper case bold letters e.g. ${\mathbf{X}},{\mathbf{Y}}$ . Non-bold face letters are used for scalars e.g. $j,M,r$ and function names e.g. $f(\cdot)$ .

The notion of similarity that Fisher kernels employ is that if two objects are structurally similar, then slight perturbations in the neighborhood of the fitted parameters $\hat{\theta}:=\arg\max\log p({\mathbf{X}}|\theta)$ , would impact the fit of the two objects similarly. In other words, the feature embedding ${\mathbf{f}}_{i}:=\frac{\partial\log p({\mathbf{X}}_{i}|\theta)}{\partial\theta}|_{\theta=\hat{\theta}}$ , for an object ${\mathbf{X}}_{i}\rightarrow{\mathbf{f}}_{i}$ can be interpreted as a feature mapping which can then be used to define a similarity kernel by a weighted dot product:

While appropriate feature mapping is crucial for predictive tasks, we observe that it is also is vital for interpretability. Fisher kernels are ideal for our task because they seamlessly extract model-induced data similarity from trained model that we wish to interpret. To further motivate that such a task can not be trivially performed by a something like a parameter sweep over RBF kernels i.e. without supervision, we perform a simple toy experiment illustrated in Figure 1.

2 Bayesian Quadrature

where $w_{i}$ are the weights associated with function evaluations at ${\mathbf{x}}_{i}$ . Using $w_{i}=\nicefrac{{1}}{{n}}$ and randomly sampling ${\mathbf{x}}_{i}$ recovers the standard Monte Carlo integration. Other methods include kernel herding and quasi-Monte carlo , both of which use $w_{i}=\nicefrac{{1}}{{n}}$ but use specific schemes to draw ${\mathbf{x}}_{i}$ . Bayesian quadrature allows one to consider a non-uniform $w_{i}$ given a functional prior for $f(\cdot)$ . The samples ${\mathbf{x}}_{i}$ can then be chosen as the ones that minimize the posterior variance as we shall see in the sequel. The corresponding weights can be calculated directly from the posterior mean. We impose a Gaussian Process prior on the function as $f\sim\text{GP}(0,k)$ with a kernel function $k(\cdot,\cdot)$ . The algorithm SBQ proceeds as follows. Say we have already chosen $n$ points: ${\mathbf{x}}_{i},i\in[n]$ . The posterior of $f$ given the evaluations $f({\mathbf{x}}_{i})$ has the mean function:

where ${\mathbf{f}}$ is the vector of function evaluations $f({\mathbf{x}}_{i})$ , ${\mathbf{k}}$ is the vector of kernel evaluations $k({\mathbf{x}},{\mathbf{x}}_{i})$ , and ${\mathbf{K}}$ is the kernel matrix with ${\mathbf{K}}_{ij}:=k({\mathbf{x}}_{i},{\mathbf{x}}_{j})$ .

We now focus on sampling the points ${\mathbf{x}}_{i}$ . The quadrature estimate provides not only the mean, but the full distribution as its posterior. The posterior variance can be written as:

We can write the variance of $Z(\mathsf{S}_{n})$ as:

The algorithm Sequential Bayesian Quadrature (SBQ) samples for the points ${\mathbf{x}}_{i}$ in a greedy fashion with the goal of minimizing the posterior variance of the computed approximate integral:

Prototype Selection using Fisher Kernels

where $p({\mathbf{x}})$ is the data distribution. Since we usually do not have access to the true data distribution, $p({\mathbf{x}})$ is typically the empirical data distribution $p({\mathbf{x}})=\frac{1}{n}\delta({\mathbf{x}})$ , where $\delta(\cdot)$ is $1$ if ${\mathbf{x}}$ exists in the dataset, and otherwise, and $n$ is the size of the dataset. Our goal in this work is to approximate the integral (3) over the test or validation set (which specifies the distribution $p$ for us) using a weighted sum of a few points from the training dataset (1). Note that while the training samples in general have measure 0 in the test or the validation set distribution in the euclidean space, the smoothening GP prior over the embedding space still allows for samples to be generated from the former to approximate the latter.

For the kernel function in the GP prior in Bayesian Quadrature, we use the Fisher kernel of the trained parametric model. SBQ selection strategy inherently establishes a trade off between selecting data points that are representative of the parametric fit and diversity of the selected points. To see this, consider the SBQ cost function (2). At every new selection ${\mathbf{x}}_{j+1}$ , one one hand, the cost function rewards the selection of data points which are clustered closer together in the feature mapping space to increase the value of ${\mathbf{z}}$ which in turn decreases variance. However, on the other hand, selecting points close to each other decreases the eigenvalues of ${\mathbf{K}}^{-1}$ thereby increasing variance . Thus, the SBQ seeks a tradeoff between these terms.

In this section, we provide a practical greedy algorithm to select representative prototypes using SBQ to optimize (2). Note that the first term is constant w.r.t to $\mathsf{S}_{n}$ . Moreover, $p({\mathbf{x}})=\frac{1}{n}\delta({\mathbf{x}})$ . Thus, we can re-write ${\mathbf{z}}_{i}=\frac{1}{n}\sum_{j=1}^{n}k({\mathbf{x}}_{i},{\mathbf{x}}_{j})$ for each $i$ in training and each $j$ in the test set. This can be pre-computed by a row or column sum over the kernel of the entire dataset in $O(nt)$ time and stored as vector of size $t$ to speed up later computation, where $t$ is the size of the training set and $n$ is the size of the test set. Our greedy cost function at step $j+1$ is thus:

The solution set is then updated as $\mathsf{S}_{j+1}=\mathsf{S}_{j}\cup\{i^{\star}_{j+1}\}$ . The optimization (4) requires an inverse of the kernel matrix of already selected data points which can be computationally expensive. However, we can use the following result from linear algebra about block matrix inverses to speed up operations.

For an invertible matrix ${\mathbf{A}}$ , a column vector ${\mathbf{b}}$ , and a scalar $c$ , let $d=c-{\mathbf{b}}^{\top}{\mathbf{A}}^{-1}{\mathbf{b}}$ , then

Proposition 1 allows us to build the inverse of the kernel ${\mathbf{K}}$ in (4) greedily. The full algorithm is presented in Algorithm 1.

Algorithm 1 obviates the need for taking explicit inverses and only requires an oracle access to the kernel function. The algorithm itself is inherently embarrassingly parallelizable over multiple cores. We study guarantees for the algorithm in Section 4 which also motivates its more scalable variants.

Analysis

The greedy algorithm described in Algorithm 1 while being simple also has interesting optimization guarantees that make it attractive to use in practice. In this section, we provide convergence guarantees for the cost function (2) as $n$ increases. Typically for functions like these in the general case, the candidate set of atoms used to build the approximation is uncountably infinite - any possible sample from the underlying density is a candidate. As such, the convergence results are based on using Frank-Wolfe analysis on the marginal polytope . However, for us the underlying set of candidate atoms are discrete points, which are at worst countably infinite. As such, for this special case, it is worth analyzing if we can provide better rates than the general available guarantees. It turns out that this is indeed possible. We are able to leverage recent research in discrete optimization to indeed provide a linear convergence rate for the forward greedy algorithm.

Recall our set optimization function (from (4)) is:

where $n$ is the set of candidate training data points. We write $\mu_{p}:=\iint k({\mathbf{x}},{\mathbf{y}})p({\mathbf{x}})p({\mathbf{y}})d{\mathbf{x}}d{\mathbf{y}}$ . For the RKHS induced by the kernel $\mathcal{H}$ , we can equivalently re-write the cost function as :

For a matrix ${\mathbf{A}}$ , the smallest (largest) $k$ -sparse eigenvalues is min (max) of $\frac{{\mathbf{x}}^{\top}{\mathbf{A}}{\mathbf{x}}}{{\mathbf{x}}^{\top}{\mathbf{x}}}$ under the constraints $\|{\mathbf{x}}\|_{0}\leq k$ , and ${\mathbf{x}}\neq 0$ . Note that we can write $v(\emptyset)=\mu_{p}$ . We present our convergence guarantee next.

Say $\mathcal{H}$ is finite dimensional and has bounded norm i.e. $\forall\nu\in\mathcal{H}$ , $\|\nu\|_{\mathcal{H}}<\infty$ . Let $m$ be the smallest $2r$ sparse eigenvalue and $M$ be the largest $r+1$ -sparse eigenvalues of the kernel matrix ${\mathbf{K}}$ of the training set. If $\mathsf{S}_{G}$ of size $k$ is the set returned by Algorithm 1 and $\mathsf{S}^{\star}$ of size $r$ is the optimal solution of (6), then if $k\geq\frac{M}{m}r\log\frac{1}{\epsilon}$ , $v(\mathsf{S}_{G})-v(\mathsf{S}^{\star})\leq\epsilon(v(\emptyset)-v(\mathsf{S}^{\star}))$ .

Theorem 2 provides exponential convergence for the cost function $v(\cdot)$ . For the same objective, using Frank-Wolfe on the marginal polytope, the best known guarantees in the most general case are $O(1/T)$ for finite dimensional bounded Hilbert spaces . In the special case when the optimum lies in the relative interior, we do get faster exponential convergence. Theorem 2 provides an alternative condition that is sufficient for exponential convergence for the case when the optimum $\mu_{p}$ lies at the boundary of the marginal polytope instead of in its relative interior i.e. it is linear combination of $r$ atoms. The lower sparse eigenvalue condition is a union bound, and only requires to hold over the greedy selection set plus any $r$ sized subset.

1 Scalability

For massively large real world datasets, the standard greedy algorithm (SBQ) may be prohibitively slow. In addition to run time, there are also memory considerations. SBQ requires building and storing an $O(m^{2})$ sized kernel matrix over the training set of size $m$ . We can use alternative variants of the greedy algorithm that are either faster with some compromise on the convergence rate or can distribute the kernel over multiple machines. These variants are presented in Table 1 with their corresponding references. To the best of our knowledge, these variants have not been suggested for solving the problem (1) before and may be of independent interest. The convergence rates are obtained similar to the proof of Theorem 2 by plugging in respective approximation guarantees in lieu of Lemma 7 in the appendix.

Relationship with Influence functions

Influence functions have recently been proposed as a tool for interpreting model predictions . Since our goal is also the same, it is interesting to ask if there is a relationship between the two approaches. For selecting the most influential training point for a given test point, influence functions approximate infinitesimal upweighting of which training point has the most effect on prediction of the test point in question. In this section, we show that our method recovers this influence function approach used by Koh and Liang for selecting influential training data points. In addition, we also show how adversarial training side attacks proposed by Koh and Liang by perturbing features of training data points can be re-interpreted as a standard adversarial attack in the RKHS induced by the Fisher Kernel. Our analysis yields new insights about the influence function based approach and also establishes the importance of the Fisher space for robust learning.

We briefly introduce the influence function approach for model interpretation. For simplicity, we re-use the notation suggested by Koh and Liang . Let $\operatorname*{{\mathbf{z}}_{\text{test}}}$ be the test data point in question, $\mathsf{S}_{\text{train}}$ be the training set, $L({\mathbf{z}},\theta)$ be the loss function fitted on the training set, $\hat{\theta}$ be the optimizer of $L(\mathsf{S}_{\text{train}},\theta)$ , ${\mathbf{H}}_{\theta}$ be the Hessian of the loss function evaluated at $\theta$ , then the most influential training data point is the solution of the optimization problem:

We compare the two discrete optimization problems (5) and (7). Even though (5) uses first order information only while (7) uses both first order and second order information about the loss function, the following proposition illustrates a connection.

If the loss function $L(\cdot)$ takes the form of a negative log-likelihood function, $[H_{\hat{\theta}}]_{ij}=\nabla_{\theta_{i}}L(\mathsf{S}_{\text{train}},\hat{\theta})^{\top}\nabla_{\theta_{j}}L(\mathsf{S}_{\text{train}},\hat{\theta})$ , where we have overloaded the notation $L(\mathsf{S}_{\text{train}},\theta)=\frac{1}{|\mathsf{S}_{\text{train}}|}\sum_{t}L({\mathbf{z}}_{t},\theta)$ .

Let $L({\mathbf{z}},\theta):=-\log p({\mathbf{z}},\theta)$ , since it takes form of a negative LL function. Then, since $\hat{\theta}$ is the optimizer of $L(\mathsf{S}_{\text{train}},\theta)$ ,

from which the result directly follows. ∎

From Proposition 1, it is easy to see that the optimization problems (5) and (7) are the same under some conditions. To be more precise, we can make the following statement. If the cost function $L(\cdot,\cdot)$ is in the form of a negative log-likelihood function, (7) is a special case of (5) with the practical Fisher kernel (see Section 2.1) when the test set is of size 1, and $r=1$ .

This equivalence gives several insights about influence functions that were not known before: (1) it generalizes influence functions to multiple data points for both test and training sets in a principled way and provides a probabilistic foundation to the method, (2) it establishes the importance of the induced RKHS by the Fisher kernel by re-interpreting the influence function optimization problem as $\min_{{\mathbf{z}}\in\mathsf{S}_{\text{train}}}\|\operatorname*{{\mathbf{z}}_{\text{test}}}-{\mathbf{z}}\|_{\mathcal{H}}$ (see Lemma 4 in the appendix), (3) for negative LL functions, it renders the expensive the calculation of the Hessian in the work by Koh and Liang as redundant since by Proposition 3, first order information suffices, (4) it provides theoretical approximation guarantees (see Lemma 7 in the appendix) for selection of multiple training data points, in constrast to Koh and Liang who made multiple selections greedily only as a heuristic.

2 Unified view of adversarial attacks

While the optimization (8) is hard in general, typically a few iterations of projected gradient ascent or FGSM are applied. We refer to the recent work by Madry et al. for details.

For training side attacks, Koh and Liang perform the following iterative update:

This equivalence provides a unified view of both training and test side attacks. As such, the large literature on robust learning against test side attacks can be applied to robustness against training side attacks as well. Moreover our framework also provides a principled way to do training side attacks to target multiple test set examples, instead of attacking individual test points separately.

Experiments

We present empirical use cases of our framework. We chose the experiments to illustrate the flexibility of our framework, as well as to emphasize its generalization capacity over and above influence functions. As such, we present experiments that make use of set influence (as opposed to single data point influence) for data cleaning and summarization (Sections 6.1,6.3). To illustrate potential benefit of using the full Fisher kernel as opposed to the simplified practical Fisher kernel as used by the influence functions, we present evaluation for a use case for fixing mislabelled examples as presented by Koh and Liang (Section 6.2).

In this section, we present experiments on the MNIST dataset to illustrate the effectiveness of our method in interpreting model behavior for the test population. Some of the handwritten digits in MNIST are hard even for a human to classify correctly. Such points can adversely affect the training of the classifier, leading to lower predictive accuracy. Our goal in this experiment is to try to identify some such misleading training data points, and remove them to see if it improves predictive accuracy. To illustrate the flexibility of our approach, we focus only on the digits $4$ and $9$ in the test data which were misclassified by our model, and then select the training data points responsible for those misclassifications.

The MNIST data set consists of images of handwritten digits and their respective labels. Each image is a $28\times 28$ pixel array. There are $70000$ images in total, split into $60000$ training examples and $10000$ test examples. The $10$ digits are about evenly represented in both the training and the test data.

For the classification task, we use tensorflow to build a 2 layer convolutional network with $2\times 2$ max pooling followed by a fully connected layer and the softmax layer. The convolutions use a stride of $1$ followed by padding of zeros to match the input size. We use dropout to avoid overfitting. The network was trained using the built-in Adam Optimizer for $20000$ steps of batch size $100$ each. For the entire test set, we obtain an accuracy of $0.9922$ , while for the subset of the test set consisting only of the chosen two digits $4$ and $9$ , the accuracy is $0.9889$ .

After the training is completed, we obtain the gradients of the training and test data points w.r.t the parameters of the network by passing each point through the trained (and subsequently frozen) network. The obtained gradient vectors are used to calculate the Fisher kernel as detailed in Section 2.1. We then employ Algorithm 1 using the newly built Fisher kernel matrix between training and test datasets to obtain the top $300$ prototypes i.e. data points from the training set that our algorithm deems most responsible for misclassifying $4$ s and $9$ s.

To check if these points are indeed misguiding the model, we remove the top $50,100,200,300$ of the selected points from the training data and retrain the model to retest on the test set. These numbers are reported as Sel50, Sel100, Sel200, Sel300 in Figure 2(b). Indeed we see an improvement in the test accuracy till Sel200 indicating the importance of removing the selected potential malicious points from the training set, and a subsequent decay in performance for Sel300 most likely due to removal of too many useful points in addition to malicious ones. To compare, we also remove the respective number of points randomly and repeat the experiment. Removal of random points from the training data led to a general decay in the predictive accuracy.

Finally, we manually selected $50$ points from the chosen $300$ points as the curated set based on how ill-formed the digits were (see Figure 2(a)). Removing these points from the training set before re-training and testing gives predictive accuracy is reported as Cur50 comparable to Sel100, but still worse than Sel200, indicating that the algorithm identified more malicious points in top-200 selected than our manually chosen $50$ points.

2 Fixing Mislabeled Examples

In this experiment, we use our framework to detect and fix mislabeled examples. Labor intenstive labeling tasks naturally result in mislabels, especially in real-world datasets. These data points may cause poor performance and degradation of the model. We show that our method can be successfully used for this purpose, showing improvement over the recent results by Koh and Liang .

We use a small correctly labeled validation set to identify examples from the large training set that are likely mislabeled. We first train a classifier on the noisy training set, and predict on the validation set. We then employ Algorithm 1 to identify training examples that were responsible for making incorrect predictions on the validation set. The potentially mislabeled data points are then chosen by the output of our method. Curation is then simulated on the selected examples in order of selections made (similar to the approach by Koh and Liang ), and if the label was indeed wrong, it is fixed. We report on the number of training data points selected vs fixed (the precision metric for incorrectly labeled points) and the respective improvement in unseen test data accuracy.

For evaluation, we use enron1 email spam dataset used by Koh and Liang and compare our results to their reported results. The dataset consists of $4137$ training points and $1035$ test points. We randomly select $500$ data points from the training set as the clean curated data. From the remaining training data points, we randomly flip the labels of $20\%$ of the data. We then use our method and the baselines to select several candidates for curation. We report the number of fixes made after these selections and the corresponding test predictive accuracy. The baselines are selection by (i) top self influence measures , and (ii) random selection of datapoints. The curation data is used as part of the training by all the methods. No method had access to the test data. As showing in Figure 3, our algorithm consistently performs better in test accuracy and the fraction of flips fixed as more and more data is curated.

3 Data Summarization

In this section, we perform the task of training data summarization. Our goal is to select a few data samples that represent the data distribution sufficiently well, so that a model built on the selected subsample of the training data does not degrade too much in performance on the unseen test data. This task is complimentary to the task of interpretation, wherein one is interested in selecting training samples that explain some particular predictions on the test set. Since we are interested in approximating the test distribution using a few samples from a training set with the goal of predictive accuracy under a given model, our framework of Sequential Bayesian Quadrature using Fisher kernels is directly applicable.

Another method that also aims to do training data summarization is that of coreset selection , albeit with a different goal of reducing the training data size for optimization speedup while still maintaining guaranteed approximation to the training likelihood. Since the goal itself is optimization speedup, coreset selection algorithms typically employ fast methods while still trying to capture the data distribution by proxy of the training likelihood. Moreover, the coreset selection algorithm is usually closely tied with the respective model as opposed to being a model-agnostic method like ours.

To illustrate that coreset selection falls short on the goal of competitively estimating the data distribution, we employ our framework to the problem of training data summarization under logistic regression, as considered by Huggins et al. using coreset construction. We experiment using two datasets ChemReact and CovType. ChemReact consists of $26733$ chemicals each of feature size $100$ . Out of these, $2500$ are test data points. The prediction variable is $0/1$ and signifies if a chemical is reactive. CovType has $581012$ webpages each of feature size $54$ . Out of these, $29000$ are test points. The task is to predict whether a type of tree is present in each location or not.

In each of the datasets, we further randomly split the training data into $10\%$ validation and $90\%$ training. For the larger CovType data, we note that selecting about 20,000 training points out of the training set achieves about the same performance as the full set. Hence, we work with randomly selected 20,000 points for speedup. We train the logistic regression model on the new training data, and use the validation set as a proxy to the unseen test set. We build the kernel matrix ${\mathbf{K}}$ and the affinity vector ${\mathbf{z}}$ , and run Algorithm 1 for various values of $k$ . For the baselines, we use the coreset selection algorithm and random data selection as implemented by Huggins et al. . The results are presented in Figure 4. We note that our algorithm yields a significantly better predictive performance compared to random subsets and coresets with the same size of the training subset across different subset sizes.

This manuscript proposed a novel principled approach for examining sets of training examples that influence an entire test set given a trained black-box model – extending a notable recently proposed per-example influence to set-wise influence. We also presented novel convergence guarantees for SBQ and more scalable algorithm variants.. Empirical results were presented to highlight the utility of the proposed approach for black-box model interpretability and related tasks. For future work, we plan to investigate the use of model criticisms to provide additional insights into the trained models.

References

Appendix A Appendix

. Our proof follows the following sketch. We show that the given problem can be written as a linear regression problem in the induced RKHS. The greedy SBQ algorithm to choose data points is then equivalent to forward greedy feature selection in the transformed space (Lemma 4). After the selection is made, the weight optimization obtained through the posterior calculation ensures orthogonal projection (Lemma 5) which means the posterior calculation is nothing but fitting of the least squares regression on the chosen set of features. Finally we draw upon research in discrete optimization to get approximation guarantees for greedy feature selection for least squares regression (Lemma 7) that we use to obtain the convergence rates.

We will require the following definition of the Maximum Mean Discrepancy (MMD). MMD is a divergence measure between two distributions $p$ and $q$ over a class of functions $\mathcal{F}$ . We restrict our attention to cases when $\mathcal{F}$ is a Reproducing Kernel Hilbert Space (RKHS), which allows MMD evaluation based only on kernels, rather than explicit function evaluations.

where $\mu_{p}$ and $\mu_{q}$ are the mean function mappings under $p$ and $q$ respectively.

We make use of the following lemma that establishes a connection between MMD and Bayesian Quadrature.

Let $q$ be the distribution established by weights $w_{i}$ of the Bayesian Quadrature over the selected points. Then, the expected variance of the weighted sum in Bayesian Quadration (2) is equal to $\text{MMD}^{2}(p,q)$ .

We can make this explicit in our notation. If $\mathcal{F}$ is an RKHS, we can write the MMD cost function using only the kernel function $K(\cdot,\cdot)$ associated with the RKHS as:-

where, $\phi(\cdot)$ represents the feature mapping under the kernel function $K(\cdot,\cdot)$ , and $i$ ranges over the selected points that define our discrete distribution $q$ . Recall that Bayesian Quadrature deviates from simple kernel herding by allowing for and optimizing over non-uniform weights $w_{i}$ . We can formally show that the weight optimization obtained through the posterior calculation performs an orthogonal projection of $\mu_{p}$ onto the span of selected points to get $\mu_{q}$ in the induced kernel space.

The weights obtained $w_{i}$ through the posterior evaluation of $Z(\mathsf{S}_{n})$ guarantee that $\sum_{i}w_{i}\phi({\mathbf{x}}_{i})$ is the orthogonal projection of $\mu_{p}$ onto span( $\phi({\mathbf{x}}_{i})$ .

Note that it suffices to show that the residual of the projection $\mu_{p}-\sum_{j}w_{j}\phi({\mathbf{x}}_{j})$ is orthogonal to $\phi({\mathbf{x}}_{i})$ for all $i$ in $\mathcal{H}$ . Recall that $w_{i}=\sum_{j}[{\mathbf{K}}^{-1}]_{ij}{\mathbf{z}}_{j}$ , and ${\mathbf{z}}_{i}=\int k({\mathbf{x}},{\mathbf{x}}_{i})p({\mathbf{x}})d({\mathbf{x}})$ . For an arbitrary index $i$ ,

where the last equality follows by noting that $\sum_{j}{\mathbf{K}}_{ji}[{\mathbf{K}}^{-1}]_{tj}$ is inner product of row $i$ of ${\mathbf{K}}$ and row $t$ of ${\mathbf{K}}^{-1}$ which is $1$ if $t=i$ and otherwise. This completes the proof. ∎

Lemma 5 implies that given the selected points, the posterior evaluation is equivalent to the optimizing for ${\mathbf{w}}$ to minimize MMD $(p,q)^{2}$ . In other words, the weight optimization is a simple linear regression in the mapped space $\mathcal{F}$ , and SBQ is equivalent to a greedy forward selection algorithm in $\mathcal{F}$ .

We shall also make use of recent results in generalization of submodular functions. Let ${\mathfrak{p}}(\mathsf{S})$ be the power set of the set $\mathsf{S}$ .

Weak submodularity generalizes submodularity so that a greedy forward selection algorithm guarantees a $(1-\nicefrac{{1}}{{e^{\lambda}}})$ approximation for $\lambda$ -weak submodular functions . Standard submodular functions have a guarantee of $(1-\nicefrac{{1}}{{e}})$ . Thus, submodular functions are $1$ -weak submodular. To provide guarantees for Algorithm 1, we show that the normalized set optimization function is $\frac{m}{M}$ -weak submodular, where $m,M$ depend on the spectrum of the kernel matrix.

The linear regression function is $\frac{m}{M}$ -weak submodular where $m$ is the smallest $2r$ sparse eigenvalue and $M$ is the largest $r+1$ -sparse eigenvalues of the dot product matrix of the features.

We note that Lemma 7 as proposed and proved by Das and Kempe is for the euclidean space. However, their results directly translate to general RKHS as long as the RKHS is bounded, or the candidate atoms have bounded norm. Hence under additional assumptions of bounded norm, the proofs and results of Das and Kempe directly translate to general RKHS. From Lemma 7 and recent results on weakly submodular functions, ([8, Corollary 1]), we get the following approximation guarantee for $g(\cdot)$ under the assumptions of Lemma 7.

Setting $g(\mathsf{S})=v(\phi)-v(\mathsf{S})$ , we get the final result.