Fast Randomized Kernel Methods With Statistical Guarantees

Ahmed El Alaoui, Michael W. Mahoney

Introduction

We consider the low-rank approximation of symmetric positive semi-definite (SPSD) matrices that arise in machine learning and data analysis, with an emphasis on obtaining good statistical guarantees. This is of interest primarily in connection with kernel-based machine learning methods. Recent work in this area has focused on one or the other of two very different perspectives: an algorithmic perspective, where the focus is on running time issues and worst-case quality-of-approximation guarantees, given a fixed input matrix; and a statistical perspective, where the goal is to obtain good inferential properties, under some hypothesized model, by using the low-rank approximation in place of the full kernel matrix. The recent results of Gittens and Mahoney provide the strongest example of the former, and the recent results of Bach are an excellent example of the latter. In this paper, we combine ideas from these two lines of work in order to obtain a fast randomized kernel method with statistical guarantees that are improved relative to the state-of-the-art.

In this paper, we extend these ideas, and we show that—from the statistical perspective—we are able to obtain a low-rank approximation that comes with improved statistical guarantees by using a variant of this more traditional notion of statistical leverage. In particular, we improve the recent bounds of Bach , which provides the first known statistical convergence result when substituting the kernel matrix by its low-rank approximation. To understand the connection, recall that a key component of Bach’s approach is the quantity $d_{\text{mof}}=n\|\,\textup{diag}(\,K(K+n\lambda I)^{-1})\|_{\infty}$ , which he calls the maximal marginal degrees of freedom.We will refer to it as the maximal degrees of freedom. Bach’s main result is that by constructing a low-rank approximation of the original kernel matrix by sampling uniformly at random $p=O(d_{\text{mof}}/\epsilon)$ columns, i.e., performing the vanilla Nyström method, and then by using this low-rank approximation in a prediction task, the statistical performance is within a factor of $1+\epsilon$ of the performance when the entire kernel matrix is used. Here, we show that this uniform sampling is suboptimal. We do so by sampling with respect to a coarse but quickly-computable approximation of a variant to the statistical leverage scores, given in Definition 8 below, and we show that we can obtain similar $1+\epsilon$ guarantees by sampling only $O(d_{\text{eff}}/\epsilon)$ columns, where $d_{\text{eff}}=\mathsf{Tr}(K(K+n\lambda I)^{-1})<d_{\text{mof}}$ . The quantity $d_{\text{eff}}$ is called the effective dimensionality of the learning problem, and it can be interpreted as the implicit number of parameters in this nonparametric setting .

We expect that our results and insights will be useful much more generally. As an example of this, we can directly compare the Nyström sampling method to a related divide-and-conquer approach, thereby answering an open problem of Zhang et al. . Recall that the Zhang et al. divide-and-conquer method consists of dividing the dataset $\{(x_{i},y_{i})\}_{i=1}^{n}$ into $m$ random partitions of equal size, computing estimators on each partition in parallel, and then averaging the estimators. They prove the minimax optimality of their estimator, although their multiplicative constants are suboptimal; and, in terms of the number of kernel evaluations, their method requires $m\times(n/m)^{2}$ , with $m$ in the order of $n/d_{\text{eff}}^{2}$ , which gives a total number of $O(nd_{\text{eff}}^{2})$ evaluations. They noticed that the scaling of their estimator was not directly comparable to that of the Nyström sampling method (which was proven to only require $O(nd_{\text{mof}})$ evaluations, if the sampling is uniform ), and they left it as an open problem to determine which if either method is fundamentally better than the other. Using our Theorem 3, we are able to put both results on a common ground for comparison. Indeed, the estimator obtained by our non-uniform Nyström sampling requires only $O(nd_{\text{eff}})$ kernel evaluations (compared to $O(nd_{\text{eff}}^{2})$ and $O(nd_{\text{mof}})$ ), and it obtains the same bound on the statistical predictive performance as in . In this sense, our result combines “the best of both worlds,” by having the reduced sample complexity of and the sharp approximation bound of .

Preliminaries and notation

Let $\{(x_{i},y_{i})\}_{i=1}^{n}$ be $n$ pairs of points in $\mathcal{X}\times\mathcal{Y}$ , where $\mathcal{X}$ is the input space and $\mathcal{Y}$ is the response space. The kernel-based learning problem can be cast as the following minimization problem:

where $K_{ij}=k(x_{i},x_{j})$ . We let $U\Sigma U^{\top}$ be the eigenvalue decomposition of $K$ , with $\Sigma=\text{Diag}(\sigma_{1},\cdots,\sigma_{n})$ , $\sigma_{1}\geq\cdots\geq\sigma_{n}\geq 0$ , and $U$ an orthogonal matrix. The underlying data model is

More generally, if we let $S\in R^{n\times p}$ be an arbitrary sketching matrix, i.e., a tall and skinny matrix that, when left-multiplied by $K$ , produces a “sketch” of $K$ that preserves some desirable properties, then the Nyström approximation associated with $S$ is

For instance, for random sampling algorithms, $S$ would contain a non-zero entry at position $(i,j)$ if the $i$ -th column of $K$ is chosen at the $j$ -th trial of the sampling process. Alternatively, $S$ could also be a random projection matrix; or $S$ could be constructed with some other (perhaps deterministic) method, as long as it verifies some structural properties, depending on the application .

We will focus in this paper on analyzing this approximation in the statistical prediction context related to the estimation of $f^{*}$ by solving Problem (2). We proceed by revisiting and improving upon prior results from three different areas. The first result (Theorem 5) is on the behavior of the bias of $\hat{f}_{L}$ , when $L$ is constructed using a general sketching matrix $S$ . This result underlies the statistical analysis of the Nyström method. To see this, first, it is not hard to prove that $L\preceq K$ in the sense of usual the order on the positive semi-definite cone. Second, one can prove that the variance is matrix-increasing, hence the variance will decrease when replacing $K$ by $L$ . On the other hand, the bias (while not matrix monotone in general) can be proven to not increase too much when replacing $K$ by $L$ . This latter statement will be the main technical difficulty for obtaining a bound on $\mathcal{R}(\hat{f}_{L})$ (see Appendix A). A form of this result is due to Bach in the case where $S$ is a uniform sampling matrix. The second result (Theorem 7) is a concentration bound for approximating matrix multiplication when the rank-one components of the product are sampled non uniformly. This result is derived from the matrix Bernstein inequality, and yields a sharp quantification of the deviation of the approximation from the true product. The third result (Definition 8) is an extension of the definition of the leverage scores to the context of kernel ridge regression. Whereas the notion of leverage is established as an algorithmic tool in randomized linear algebra, we introduce a natural counterpart of it to this statistical setting. By combining these contributions, we are able to give a sharp statistical statement on the behavior of the Nyström method if one is allowed to sample non uniformly. All the proofs are deferred to the appendix.

Revisiting prior work and new results

We begin by stating a “structural” result that upper-bounds the bias of the estimator constructed using the approximation $L$ . This result is deterministic: it only depends on the properties of the input data, and holds for any sketching matrix $S$ that satisfies certain conditions. This way the randomness of the construction of $S$ is decoupled from the rest of the analysis. We highlight the fact that this view offers a possible way of improving the current results since a better construction of $S$ -whether deterministic or random- satisfying the data-related conditions would immediately lead to down stream algorithmic and statistical improvements in this setting.

In the special case where $S$ contains one non zero entry equal to $1/\sqrt{pn}$ in every column with $p$ the number of sampled columns, the result and its proof can be found in (appendix B.2), although we believe that their argument contains a problematic statement. We propose an alternative and complete proof in Appendix A. The subsequent analysis unfolds in two steps: (1) assuming the sketching matrix $S$ satisfies the conditions stated in Theorem 5, we will have $\mathcal{R}(\hat{f}_{L})\lesssim\mathcal{R}(\hat{f}_{K})$ , and (2) matrix concentration is used to show that an appropriate random construction of $S$ satisfies the said conditions. We start by stating the concentration result that is the source of our improvement (section 3.2), define a notion of statistical leverage scores (section 3.3), and then state and prove the main statistical result (Theorem 3 section 3.4). We then present our main algorithmic result consisting of a fast approximation to this new notion of leverage scores (section 3.5).

2 A concentration bound on matrix multiplication

Next, we state our result for approximating matrix products of the form $\Psi\Psi^{\top}$ when a few columns from $\Psi$ are sampled to form the approximate product $\Psi_{I}\Psi_{I}^{\top}$ where $\Psi_{I}$ contains the chosen columns. The proof relies on a matrix Bernstein inequality (see e.g. ) and is presented at the end of the paper (Appendix B).

Remarks: 1. This result will be used for $\Psi=\Phi^{1/2}U^{\top}$ , in conjunction with Theorem 5 to prove our main result in Theorem 3. Notice that $\Psi^{\top}$ is a scaled version of the eigenvectors, with a scaling given by the diagonal matrix $\Phi=\Sigma(\Sigma+n\gamma I)^{-1}$ which should be considered as “soft projection” matrix that smoothly selects the top part of the spectrum of $K$ . The setting of Gittens et al. , in which $\Phi$ is a 0-1 diagonal is the closest analog of our setting.

2. It is known that $p_{i}=\frac{\|\psi_{i}\|_{2}^{2}}{\|\Psi\|_{F}^{2}}$ is the optimal sampling distribution in terms of minimizing the expected error $\mathsf{E}\|\Psi\Psi^{\top}-\Psi SS^{\top}\Psi^{\top}\|_{F}^{2}$ . The above result exhibits a robustness property by allowing the chosen sampling distribution to be different from the optimal one by a factor $\beta$ .In their work , Drineas et al. have a comparable robust statement for controlling the expected error. Our result is a robust quantification of the tail probability of the error, which is a much stronger statement. The sub-optimality of such a distribution is reflected in the upper bound (7) by the amplification of the squared Frobenius norm of $\Psi$ by a factor $1/\beta$ . For instance, if the sampling distribution is chosen to be uniform, i.e. $p_{i}=1/m$ , then the value of $\beta$ for which (6) is tight is $\frac{\|\Psi\|_{F}^{2}}{m\max_{i}\|\psi_{i}\|_{2}^{2}},$ in which case we recover a concentration result proven by Bach . Note that Theorem 7 is derived from one of the state-of-the-art bounds on matrix concentration, but it is one among many others in the literature; and while it constitutes the base of our improvement, it is possible that a concentration bound more tailored to the problem might yield sharper results.

3 An extended definition of leverage

We introduce an extended notion of leverage scores that is specifically tailored to the ridge regression problem, and that we call the $\lambda$ -ridge leverage scores.

For $\lambda>0$ , the $\lambda$ -ridge leverage scores associated with the kernel matrix $K$ and the parameter $\lambda$ are

Note that $l_{i}(\lambda)$ is the $i^{th}$ diagonal entry of $K(K+n\lambda I)^{-1}$ . The quantities $(l_{i}(\lambda))_{1\leq i\leq n}$ are in this setting the analogs of the so-called leverage scores in the statistical literature, as they characterize the data points that “stick out”, and consequently that most affect the result of a statistical procedure. They are classically defined as the row norms of the left singular matrix $U$ of the input matrix, and they have been used in regression diagnostics for outlier detection , and more recently in randomized matrix algorithms as they often provide an optimal importance sampling distribution for constructing random sketches for low rank approximation and least squares regression when the input matrix is tall and skinny ( $n\geq m$ ). In the case where the input matrix is square, this definition is vacuous as the row norms of $U$ are all equal to 1. Recently, Gittens and Mahoney used a truncated version of these scores (that they called leverage scores relative to the best rank- $k$ space) to obtain the best algorithmic results known to date on low rank approximation of positive semi-definite matrices. Definition 8 is a weighted version of the classical leverage scores, where the weights depend on the spectrum of $K$ and a regularization parameter $\lambda$ . In this sense, it is an interpolation between Gittens’ scores and the classical (tall-and-skinny) leverage scores, where the parameter $\lambda$ plays the role of a rank parameter. In addition, we point out that Bach’s maximal degrees of freedom $d_{\text{mof}}$ is to the $\lambda$ -ridge leverage scores what the coherence is to Gittens’ leverage scores, i.e. their (scaled) maximum value: $d_{\text{mof}}/n=\max_{i}l_{i}(\lambda)$ ; and that while the sum of Gittens’ scores is the rank parameter $k$ , the sum of the $\lambda$ -ridge leverage scores is the effective dimensionality $d_{\text{eff}}$ . We argue in the following that Definition 8 provides a relevant notion of leverage in the context of kernel ridge regression. It is the natural counterpart of the algorithmic notion of leverage in the prediction context. We use it in the next section to make a statistical statement on the performance of the Nyström method.

4 Main statistical result: an error bound on approximate kernel ridge regression

Now we are able to give an improved version of a theorem by Bach that establishes a performance guaranty on the use of the Nyström method in the context of kernel ridge regression. It is improved in the sense that the sufficient number of columns that should be sampled in order to incur no (or little) loss in the prediction performance is lower. This is due to a more data-sensitive way of sampling the columns of $K$ (depending on the $\lambda$ -ridge leverage scores) during the construction of the approximation $L$ . The proof is in Appendix C.

Let $\lambda,\epsilon>0$ , $\rho\in(0,1/2)$ , $n\geq 2$ and $L$ be a Nyström approximation of $K$ by choosing $p$ columns randomly with replacement according to a probability distribution $(p_{i})_{1\leq i\leq n}$ such that $\forall i\in\{1,\cdots,n\},~{}~{}~{}p_{i}\geq\beta\cdot l_{i}(\lambda\epsilon)/\sum_{i=1}^{n}l_{i}(\lambda\epsilon)$ for some $\beta\in(0,1]$ . Let $\underline{l}\leq\min_{i}l_{i}(\lambda\epsilon)$ . If

with $d_{\text{eff}}=\sum_{i=1}^{n}l_{i}(\lambda\epsilon)=\mathsf{Tr}(K(K+n\lambda\epsilon I)^{-1})$ then

with probability at least $1-2\rho$ , where $(l_{i})_{i}$ are introduced in Definition 8 and $\mathcal{R}$ is defined in (3).

Theorem 3 asserts that substituting the kernel matrix $K$ by a Nyström approximation of rank $p$ in the KRR problem induces an arbitrarily small prediction loss, provided that $p$ scales linearly with the effective dimensionality $d_{\text{eff}}$ Note that $d_{\text{eff}}$ depends on the precision parameter $\epsilon$ , which is absent in the classical definition of the effective dimensionality However, the following bound holds: $d_{\text{eff}}\leq\frac{1}{\epsilon}\mathsf{Tr}(K(K+n\lambda I)^{-1})$ . and that $\lambda$ is not too smallThis condition on $\lambda$ is not necessary if one constructs $L$ as $KS(S^{\top}KS+n\lambda\epsilon I)^{-1}S^{\top}K$ (see proof).. The leverage-based sampling appears to be crucial for obtaining this dependence, as the $\lambda$ -ridge leverage scores provide information on which columns -and hence which data points- capture most of the difficulty of the estimation problem. Also, as a sanity check, the smaller the target accuracy $\epsilon$ , the higher $d_{\text{eff}}$ , and the more uniform the sampling distribution $(l_{i}(\lambda\epsilon))_{i}$ becomes. In the limit $\epsilon\rightarrow 0$ , $p$ is in the order of $n$ and the scores are uniform, and the method is essentially equivalent to using the entire matrix $K$ . Moreover, if the sampling distribution $(p_{i})_{i}$ is a factor $\beta$ away from optimal, a slight oversampling (i.e. increase $p$ by $1/\beta$ ) achieves the same performance. In this sense, the above result shows robustness to the sampling distribution. This property is very beneficial from an implementation point of view, as the error bounds still hold when only an approximation of the leverage scores is available. If the columns are sampled uniformly, a worse lower bound on $p$ that depends on $d_{\text{mof}}$ is obtained .

5 Main algorithmic result: a fast approximation to the λ𝜆\lambda-ridge leverage scores

Although the $\lambda$ -ridge leverage scores can be naively computed using SVD, the exact computation is as costly as solving the original Problem (2). Therefore, the central role they play in the above result motivates the problem of a fast approximation, in a similar way the importance of the usual leverage scores has motivated Drineas et al. to approximate them is random projection time . A success in this task will allow us to combine the running time benefits with the improved statistical guarantees we have provided.

Inputs: data points $(x_{i})_{1\leq i\leq n}$ , probability vector $(p_{i})_{1\leq i\leq n}$ , sampling parameter $p\in\{1,2,\cdots\}$ , $\lambda>0$ , $\epsilon\in(0,1/2)$ .

Sample $p$ data points from $(x_{i})_{1\leq i\leq n}$ with replacement with probabilities $(p_{i})_{1\leq i\leq n}$ .

Compute the corresponding columns $K_{1},\cdots,K_{p}$ of the kernel matrix.

where $B_{i}$ is the $i$ -th row of $B$ , and return it.

Let $\epsilon\in(0,1/2)$ , $\rho\in(0,1)$ and $\lambda>0$ . Let $L$ be a Nyström approximation of $K$ by choosing $p$ columns at random with probabilities $p_{i}=K_{ii}/\mathsf{Tr}(K)$ , $i=1,\cdots,n$ . If

then we have $\forall i\in\{1,\cdots,n\}$

Remarks: 1. Theorem 4 states that if the columns of $K$ are sampled proportionally to $K_{ii}$ then $O(\frac{\mathsf{Tr}(K)}{n\lambda})$ is a sufficient number of samples. Recall that $K_{ii}=\|\phi(x_{i})\|_{\mathcal{F}}^{2}$ , so our procedure is akin to sampling according to the squared lengths of the data vectors, which has been extensively used in different contexts of randomized matrix approximation .

2. Due to how $\lambda$ is defined in eq. (1) the $n$ in the denominator is artificial: $n\lambda$ should be thought of as a “rescaled” regularization parameter $\lambda^{\prime}$ . In some settings, the $\lambda$ that yields the best generalization error scales like $O(1/\sqrt{n})$ , hence $p=O(\mathsf{Tr}(K)/\sqrt{n})$ is sufficient. On the other hand, if the columns are sampled uniformly, one would get $p=O(d_{\text{mof}})=O(n\max_{i}l_{i}(\lambda))$ .

Experiments

We test our results based on several datasets: one synthetic regression problem from to illustrate the importance of the $\lambda$ -ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh http://www.cs.toronto.edu/~delve/data/pumadyn/desc.html and the Gas Sensor Array Drift Dataset from the UCI databasehttps://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset. The synthetic case consists of a regression problem on the interval $\mathcal{X}=$ where, given a sequence $(x_{i})_{1\leq i\leq n}$ and a sequence of noise $(\epsilon_{i})_{1\leq i\leq n}$ , we observe the sequence

The function $f$ belongs to the RKHS $\mathcal{F}$ generated by the kernel $k(x,y)=\frac{1}{(2\beta)!}B_{2\beta}(x-y-\lfloor x-y\rfloor)$ where $B_{2\beta}$ is the $2\beta$ -th Bernoulli polynomial . One important feature of this regression problem is the distribution of the points $(x_{i})_{1\leq i\leq n}$ on the interval $\mathcal{X}$ : if they are spread uniformly over the interval, the $\lambda$ -ridge leverage scores $(l_{i}(\lambda))_{1\leq i\leq n}$ are uniform for every $\lambda>0$ , and uniform column sampling is optimal in this case. In fact, if $x_{i}=\frac{i-1}{n}$ for $i=1,\cdots,n$ , the kernel matrix $K$ is a circulant matrix , in which case, we can prove that the $\lambda$ -ridge leverage scores are constant. Otherwise, if the data points are distributed asymmetrically on the interval, the $\lambda$ -ridge leverage scores are non uniform, and importance sampling is beneficial (Figure 1). In this experiment, the data points $x_{i}\in(0,1)$ have been generated with a distribution symmetric about $\frac{1}{2}$ , having a high density on the borders of the interval $(0,1)$ and a low density on the center of the interval. The number of observations is $n=500$ . On Figure 1, we can see that there are few data points with high leverage, and those correspond to the region that is underrepresented in the data sample (i.e. the region close to the center of the interval since it is the one that has the lowest density of observations). The $\lambda$ -ridge leverage scores are able to capture the importance of these data points, thus providing a way to detect them (e.g. with an analysis of outliers), had we not known their existence.

For all datasets, we determine $\lambda$ and the band width of $k$ by cross validation, and we compute the effective dimensionality $d_{\text{eff}}$ and the maximal degrees of freedom $d_{\text{mof}}$ . Table 1 summarizes the experiments. It is often the case that $d_{\text{eff}}\ll d_{\text{mof}}$ and $\mathcal{R}(\hat{f}_{L})/\mathcal{R}(\hat{f}_{K})\simeq 1$ , in agreement with Theorem 3.

Conclusion

We showed in this paper that in the case of kernel ridge regression, the sampling complexity of the Nyström method can be reduced to the effective dimensionality of the problem, hence bridging and improving upon different previous attempts that established weaker forms of this result. This was achieved by defining a natural analog to the notion of leverage scores in this statistical context, and using it as a column sampling distribution. We obtained this result by combining and improving upon results that have emerged from two different perspectives on low rank matrix approximation. We also present a way to approximate these scores that is computationally tractable, i.e. runs in time $O(np^{2})$ with $p$ depending only on the trace of the kernel matrix and the regularization parameter. One natural unanswered question is whether it is possible to further reduce the sampling complexity, or is the effective dimensionality also a lower bound on $p$ ? And as pointed out by previous work , it is likely that the same results hold for smooth losses beyond the squared loss (e.g. logistic regression). However the situation is unclear for non-smooth losses (e.g. support vector regression).

An earlier draft of this paper contained a mistake in the proof of Theorem 5. We thank Xixian Chen for signaling it to us. We thank Francis Bach for stimulating discussions and for contributing to a rectified proof. We thank Jason Lee and Aaditya Ramdas for fruitful discussions regarding the same issue. We thank Yuchen Zhang for pointing out the connection to his work.

References

Appendix A Proof of Theorem 5

Note: This proof is inspired by one of Bach . We extend their result to the case of a general sketching matrix $S$ . Moreover, we believe their argument contains two problematic statements (about monotonicity of the bias) that we rectify with Lemma 2 and Lemma 3 below. Their result therefore holds also true with minimal change based on this argument.

For kernel ridge regression, the bias of the estimator $\hat{f}_{K}$ can be expressed as

Let $K=U\Sigma U^{\top}$ where $U$ is orthogonal and $\Sigma$ diagonal positive. We have

with $\Phi=\Sigma(\Sigma+n\gamma I)^{-1}$ . If $\lambda_{\max}(D)\leq t$ for $t\in(0,1)$ then

If $0\preceq K-L_{\gamma}\preceq\frac{n\gamma}{1-t}I$ then $\text{bias}(L_{\gamma})\leq\left(1+\frac{\gamma/\lambda}{1-t}\right)\text{bias}(K)$ .

If $0\preceq K-L_{\gamma}\preceq\frac{n\gamma}{1-t}I$ and $\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\cdot\frac{\lambda_{\max}(K)}{n}$ then the map $\gamma\rightarrow\text{bias}(L_{\gamma})$ is increasing. This in particular implies that under the same conditions, $\text{bias}(L)\leq\text{bias}(L_{\gamma})$ .

Proof of Lemma 1. With $K=U\Sigma U^{\top}$ and $R=\Sigma^{1/2}U^{\top}S$ , $\bar{L}_{\gamma}=R(R^{\top}R+n\gamma I)^{-1}R^{\top}$ , we have

Due to the matrix inversion lemma, we have

and $\Phi=\Sigma(\Sigma+n\gamma I)^{-1}$ . This shows that for any $\gamma\geq 0$

Now if $\lambda_{\max}(D)\leq t$ for $t\in(0,1)$ ,

Proof of Lemma 2. This proof was communicated to us by Francis Bach .

Since $K-L_{\gamma}$ commutes with the identity, we have

Proof of Lemma 3. Let $\varphi(\gamma)={f^{*}}^{\top}(L_{\gamma}+n\lambda I)^{-2}f^{*}$ . The task is to prove that $\varphi$ is increasing if $\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\frac{\lambda_{\max}(K)}{n}$ . We do so by computing the derivative of $\varphi$ and showing that $\varphi^{\prime}\geq 0$ . Let $\gamma,\gamma^{\prime}>0$ . We have

Now we compute the terms $L_{\gamma^{\prime}}-L_{\gamma}$ and $L_{\gamma^{\prime}}^{2}-L_{\gamma}^{2}$ :

The first term is the last equality above is equal to

Now combining the above and taking the limit $\gamma^{\prime}\rightarrow\gamma$ we have

Therefore, the function $\varphi$ is increasing for all $\gamma$ such that $Q\succeq 0$ , and the latter is true if $2n\lambda\geq-\lambda_{\min}(\bar{Q})$ . Moreover, since $\bar{Q}$ is symmetric we have

and it is sufficient to verify the condition

Now we finish the proof by showing that the above operator norm is smaller than $\frac{1}{1-t}\|S\|_{\text{op}}^{2}\lambda_{\max}(K)$ . We have

Taking operator norms, and using the assumption $0\preceq K-L_{\gamma}\preceq\frac{n\gamma}{1-t}I$ ,

Hence, (11) is satisfied if $n\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\lambda_{\max}(K)$ therefore concluding the proof. ∎

Appendix B Proof of Theorem 7

The proof uses the matrix Bernstein inequality (see e.g. Theorem 6.1.1 in ):

Consider a sequence $(X_{k})$ of independent random symmetric matrices with dimension $d$ . Assume that $\mathsf{E}(X_{k})=0$ , $\lambda_{\max}(X_{k})\leq R$ , and let $Y=\sum_{k}X_{k}$ . Furthermore, assume that there exists $\sigma>0$ such that $\|\mathsf{E}(Y^{2})\|_{2}\leq\sigma^{2}$ . Then

Next , we exhibit the sequence $(X_{k})$ and $Y$ in our case. We have

where $(z_{ik})_{1\leq i\leq m}$ are i.i.d. binary random vectors for $k\in\{1,\cdots,p\}$ with $\mathsf{Pr}(z_{ik}=1)=p_{i}$ (i.e. $(z_{ik})_{1\leq i\leq m}$ is the indicator of the chosen column at trial $k$ ). Let $Y=\Psi\Psi^{\top}-\Psi SS^{\top}\Psi^{\top}$ , then

We choose $X_{k}$ to be $\frac{1}{p}\sum_{i=1}^{m}(1-\frac{z_{ik}}{p_{i}})\psi_{i}\psi_{i}^{\top}$ for every $k\in\{1,\cdots,p\}$ . Now we verify the assumptions of the above theorem. The matrices $X_{k}$ inherit independence from the random vectors $(z_{ik})_{1\leq i\leq m}$ , and we have $\mathsf{E}(X_{k})=0$ , and $\lambda_{\max}(X_{k})\leq\frac{1}{p}\lambda_{\max}(\sum_{i=1}^{m}\psi_{i}\psi_{i}^{\top})=\frac{1}{p}\lambda_{\max}(\Psi\Psi^{\top})$ . Now we control the spectral norm of the second moment of $Y$ . Again with $\mathsf{E}(X_{k})=0$ we have $\mathsf{E}(Y^{2})=\sum_{k,k^{\prime}=1}^{p}\mathsf{E}(X_{k}X_{k^{\prime}})=\sum_{k=1}^{p}\mathsf{E}(X_{k}^{2}).$ And for $k\in\{1,\cdots,p\}$

To proceed, observe that for $i\neq i^{\prime}$ , $z_{ik}z_{i^{\prime}k}=0$ since only one column is chosen at a time. This yields

Given that the probability distribution $(p_{i})$ verifies $p_{i}\geq\beta\frac{\|\psi_{i}\|_{2}^{2}}{\|\Psi\|_{F}^{2}}$ , we get $\mathsf{E}(Y^{2})\preceq\frac{\|\Psi\|_{F}^{2}}{\beta p}\sum_{i=1}^{m}\psi_{i}\psi_{i}^{\top}=\frac{\|\Psi\|_{F}^{2}}{\beta p}\Psi\Psi^{\top}.$ Hence $\|\mathsf{E}(Y^{2})\|_{2}\leq\frac{\|\Psi\|_{F}^{2}}{\beta p}\lambda_{\max}(\Psi\Psi^{\top}).$ We now apply the theorem with $R=\frac{1}{p}\lambda_{\max}(\Psi\Psi^{\top})$ and $\sigma^{2}=\frac{\|\Psi\|_{F}^{2}}{\beta p}\lambda_{\max}(\Psi\Psi^{\top})$ which leads to the desired result.

Appendix C Proof of Theorem 3

First of all, we observe that the variance of the estimator $\hat{f}_{K}$ is matrix-increasing as a function of $K$ . Indeed, we have

where $\lambda_{j}(K)$ is the $j$ th eigenvalue of $K$ arranged in a decreasing order. The function $x\rightarrow\frac{x^{2}}{(x+n\lambda)^{2}}$ is increasing for $x\geq 0$ . Moreover, if $L\preceq K$ then by the Courant-Fischer minimax principle $\lambda_{j}(L)\leq\lambda_{j}(K)$ for all $j$ (e.g. see Corollary III.1.2 in ).

Now, using Theorem 5 combined with the above fact, we have

We set $\gamma=\lambda\epsilon$ and $t=1/2$ . The above holds if $\lambda_{\max}\Big{(}\Phi-\Phi^{1/2}U^{\top}SS^{\top}U\Phi^{1/2}\Big{)}\leq t$ and $n\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\lambda_{\max}(K)$ . Now let $\Psi=\Phi^{1/2}U^{\top}$ . Then we have $\|\psi_{i}\|_{2}^{2}=l_{i}(\gamma)$ and $\|\Psi\|_{F}^{2}=d_{\text{eff}}$ . Using Theorem 7 on $\Psi$ , and given that $\lambda_{\max}(\Psi\Psi^{\top})=\lambda_{\max}(\Phi)\leq 1$ , for the result to hold with probability at least $1-\rho$ , it is sufficient to set $p$ such that $n\exp\Big{(}\frac{-p(1/2)^{2}/2}{d_{\text{eff}}/\beta+1/6}\Big{)}\leq\rho$ which gives the desired lower bound $p\geq 8(d_{\text{eff}}/\beta+1/6)\log\left(\frac{n}{\rho}\right)$ .

Remark: Note that if one uses the regularized Nyström approximation $L_{\gamma}=KS(S^{\top}KS+n\gamma I)^{-1}S^{\top}K$ with $\gamma=\lambda\epsilon$ instead of $L=KS(S^{\top}KS)^{\dagger}S^{\top}K$ in the algorithm then the proof would now be complete and the condition condition $n\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\lambda_{\max}(K)$ is not necessary. If one uses $L$ , then this latter condition needs to be verified to insure monotonicity of the bias (see Lemma 3).

Now it remains to control the operator norm of the sketching matrix $S$ appearing in the lower bound on $\lambda$ . To this end we use a variant of the matrix Bernstein inequality (Theorem 5) for controlling operator norms of random matrices (see Corollary 6.2.1 in ).

Consider a sequence $(X_{k})$ of independent random symmetric matrices with dimension $d\times d$ . Assume that $\mathsf{E}(X_{k})=0$ , $\|X_{k}\|_{\text{op}}\leq R$ , and let $Y=\sum_{k}X_{k}$ . Furthermore, assume that there exists $\sigma>0$ such that $\|\mathsf{E}(Y^{2})\|_{\text{op}}\leq\sigma^{2}$ . Then

with $\underline{l}=\min_{i}l_{i}(\lambda\epsilon)$ . On the other hand,

By choosing $\sigma^{2}=R=\frac{1}{p}\frac{d_{\text{eff}}}{\beta\underline{l}}$ , we have $\|SS^{\top}-I\|_{\text{op}}\leq t$ with probability at least $1-2n\exp\left(-\frac{t^{2}/2}{R(1+t/3)}\right)$ . Taking $t=\max\left\{1~{},~{}\frac{8d_{\text{eff}}}{3\beta\underline{l}\cdot p}\log\left(\frac{2n}{\rho}\right)\right\}$ , the latter probability is greater than $1-\rho$ , and by the triangle inequality: $\|S\|_{\text{op}}^{2}\leq 1+t$ with the same probability. By taking $p\geq 8(d_{\text{eff}}/\beta+1/6)\log\left(\frac{n}{\rho}\right)$ (thereby verifying the condition from the previous paragraph) we have

if $n\geq 2$ , and therefore $\|S\|_{\text{op}}^{2}\leq 1+1/\underline{l}$ (since $\underline{l}\leq 1$ ) with probability at least $1-\rho$ .

Appendix D Proof of Theorem 4

for $t\geq 0$ with $\Psi=\Phi^{1/2}U^{\top}$ , $\Phi=\Sigma(\Sigma+n\gamma I)^{-1}$ then

by choosing $p_{i}=K_{ii}/\mathsf{Tr}(K)$ , we have $p_{i}\geq\beta~{}l_{i}(\lambda\epsilon)/\sum_{i=1}^{n}l_{i}(\lambda\epsilon)$ with $\beta=n\lambda\epsilon d_{\text{eff}}/\mathsf{Tr}(K)$ , which yields $d_{\text{eff}}/\beta=\mathsf{Tr}(K)/(n\lambda\epsilon)$ .

As for the multiplicative error bound, using $K-L_{\gamma}\preceq\frac{n\gamma}{1-t}K(K+n\gamma)^{-1}$ we get

For $t=1/2$ , $I-\frac{n\gamma}{1-t}(K+n\gamma I)^{-1}=(K-n\gamma I)(K+n\gamma I)^{-1}\succeq\frac{\sigma_{n}-n\gamma}{\sigma_{n}+n\gamma}I$ . The result follows.