Fast Randomized Kernel Methods With Statistical Guarantees

Ahmed El Alaoui, Michael W. Mahoney

Introduction

We consider the low-rank approximation of symmetric positive semi-definite (SPSD) matrices that arise in machine learning and data analysis, with an emphasis on obtaining good statistical guarantees. This is of interest primarily in connection with kernel-based machine learning methods. Recent work in this area has focused on one or the other of two very different perspectives: an algorithmic perspective, where the focus is on running time issues and worst-case quality-of-approximation guarantees, given a fixed input matrix; and a statistical perspective, where the goal is to obtain good inferential properties, under some hypothesized model, by using the low-rank approximation in place of the full kernel matrix. The recent results of Gittens and Mahoney provide the strongest example of the former, and the recent results of Bach are an excellent example of the latter. In this paper, we combine ideas from these two lines of work in order to obtain a fast randomized kernel method with statistical guarantees that are improved relative to the state-of-the-art.

In this paper, we extend these ideas, and we show that—from the statistical perspective—we are able to obtain a low-rank approximation that comes with improved statistical guarantees by using a variant of this more traditional notion of statistical leverage. In particular, we improve the recent bounds of Bach , which provides the first known statistical convergence result when substituting the kernel matrix by its low-rank approximation. To understand the connection, recall that a key component of Bach’s approach is the quantity dmof=ndiag(K(K+nλI)1)d_{\text{mof}}=n\|\,\textup{diag}(\,K(K+n\lambda I)^{-1})\|_{\infty}, which he calls the maximal marginal degrees of freedom.We will refer to it as the maximal degrees of freedom. Bach’s main result is that by constructing a low-rank approximation of the original kernel matrix by sampling uniformly at random p=O(dmof/ϵ)p=O(d_{\text{mof}}/\epsilon) columns, i.e., performing the vanilla Nyström method, and then by using this low-rank approximation in a prediction task, the statistical performance is within a factor of 1+ϵ1+\epsilon of the performance when the entire kernel matrix is used. Here, we show that this uniform sampling is suboptimal. We do so by sampling with respect to a coarse but quickly-computable approximation of a variant to the statistical leverage scores, given in Definition 8 below, and we show that we can obtain similar 1+ϵ1+\epsilon guarantees by sampling only O(deff/ϵ)O(d_{\text{eff}}/\epsilon) columns, where deff=Tr(K(K+nλI)1)<dmofd_{\text{eff}}=\mathsf{Tr}(K(K+n\lambda I)^{-1})<d_{\text{mof}}. The quantity deffd_{\text{eff}} is called the effective dimensionality of the learning problem, and it can be interpreted as the implicit number of parameters in this nonparametric setting .

We expect that our results and insights will be useful much more generally. As an example of this, we can directly compare the Nyström sampling method to a related divide-and-conquer approach, thereby answering an open problem of Zhang et al. . Recall that the Zhang et al. divide-and-conquer method consists of dividing the dataset {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} into mm random partitions of equal size, computing estimators on each partition in parallel, and then averaging the estimators. They prove the minimax optimality of their estimator, although their multiplicative constants are suboptimal; and, in terms of the number of kernel evaluations, their method requires m×(n/m)2m\times(n/m)^{2}, with mm in the order of n/deff2n/d_{\text{eff}}^{2}, which gives a total number of O(ndeff2)O(nd_{\text{eff}}^{2}) evaluations. They noticed that the scaling of their estimator was not directly comparable to that of the Nyström sampling method (which was proven to only require O(ndmof)O(nd_{\text{mof}}) evaluations, if the sampling is uniform ), and they left it as an open problem to determine which if either method is fundamentally better than the other. Using our Theorem 3, we are able to put both results on a common ground for comparison. Indeed, the estimator obtained by our non-uniform Nyström sampling requires only O(ndeff)O(nd_{\text{eff}}) kernel evaluations (compared to O(ndeff2)O(nd_{\text{eff}}^{2}) and O(ndmof)O(nd_{\text{mof}})), and it obtains the same bound on the statistical predictive performance as in . In this sense, our result combines “the best of both worlds,” by having the reduced sample complexity of and the sharp approximation bound of .

Preliminaries and notation

Let {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} be nn pairs of points in X×Y\mathcal{X}\times\mathcal{Y}, where X\mathcal{X} is the input space and Y\mathcal{Y} is the response space. The kernel-based learning problem can be cast as the following minimization problem:

where Kij=k(xi,xj)K_{ij}=k(x_{i},x_{j}). We let UΣUU\Sigma U^{\top} be the eigenvalue decomposition of KK, with Σ=Diag(σ1,,σn)\Sigma=\text{Diag}(\sigma_{1},\cdots,\sigma_{n}), σ1σn0\sigma_{1}\geq\cdots\geq\sigma_{n}\geq 0, and UU an orthogonal matrix. The underlying data model is

More generally, if we let SRn×pS\in R^{n\times p} be an arbitrary sketching matrix, i.e., a tall and skinny matrix that, when left-multiplied by KK, produces a “sketch” of KK that preserves some desirable properties, then the Nyström approximation associated with SS is

For instance, for random sampling algorithms, SS would contain a non-zero entry at position (i,j)(i,j) if the ii-th column of KK is chosen at the jj-th trial of the sampling process. Alternatively, SS could also be a random projection matrix; or SS could be constructed with some other (perhaps deterministic) method, as long as it verifies some structural properties, depending on the application .

We will focus in this paper on analyzing this approximation in the statistical prediction context related to the estimation of ff^{*} by solving Problem (2). We proceed by revisiting and improving upon prior results from three different areas. The first result (Theorem 5) is on the behavior of the bias of f^L\hat{f}_{L}, when LL is constructed using a general sketching matrix SS. This result underlies the statistical analysis of the Nyström method. To see this, first, it is not hard to prove that LKL\preceq K in the sense of usual the order on the positive semi-definite cone. Second, one can prove that the variance is matrix-increasing, hence the variance will decrease when replacing KK by LL. On the other hand, the bias (while not matrix monotone in general) can be proven to not increase too much when replacing KK by LL. This latter statement will be the main technical difficulty for obtaining a bound on R(f^L)\mathcal{R}(\hat{f}_{L}) (see Appendix A). A form of this result is due to Bach in the case where SS is a uniform sampling matrix. The second result (Theorem 7) is a concentration bound for approximating matrix multiplication when the rank-one components of the product are sampled non uniformly. This result is derived from the matrix Bernstein inequality, and yields a sharp quantification of the deviation of the approximation from the true product. The third result (Definition 8) is an extension of the definition of the leverage scores to the context of kernel ridge regression. Whereas the notion of leverage is established as an algorithmic tool in randomized linear algebra, we introduce a natural counterpart of it to this statistical setting. By combining these contributions, we are able to give a sharp statistical statement on the behavior of the Nyström method if one is allowed to sample non uniformly. All the proofs are deferred to the appendix.

Revisiting prior work and new results

We begin by stating a “structural” result that upper-bounds the bias of the estimator constructed using the approximation LL. This result is deterministic: it only depends on the properties of the input data, and holds for any sketching matrix SS that satisfies certain conditions. This way the randomness of the construction of SS is decoupled from the rest of the analysis. We highlight the fact that this view offers a possible way of improving the current results since a better construction of SS -whether deterministic or random- satisfying the data-related conditions would immediately lead to down stream algorithmic and statistical improvements in this setting.

In the special case where SS contains one non zero entry equal to 1/pn1/\sqrt{pn} in every column with pp the number of sampled columns, the result and its proof can be found in (appendix B.2), although we believe that their argument contains a problematic statement. We propose an alternative and complete proof in Appendix A. The subsequent analysis unfolds in two steps: (1) assuming the sketching matrix SS satisfies the conditions stated in Theorem 5, we will have R(f^L)R(f^K)\mathcal{R}(\hat{f}_{L})\lesssim\mathcal{R}(\hat{f}_{K}), and (2) matrix concentration is used to show that an appropriate random construction of SS satisfies the said conditions. We start by stating the concentration result that is the source of our improvement (section 3.2), define a notion of statistical leverage scores (section 3.3), and then state and prove the main statistical result (Theorem 3 section 3.4). We then present our main algorithmic result consisting of a fast approximation to this new notion of leverage scores (section 3.5).

2 A concentration bound on matrix multiplication

Next, we state our result for approximating matrix products of the form ΨΨ\Psi\Psi^{\top} when a few columns from Ψ\Psi are sampled to form the approximate product ΨIΨI\Psi_{I}\Psi_{I}^{\top} where ΨI\Psi_{I} contains the chosen columns. The proof relies on a matrix Bernstein inequality (see e.g. ) and is presented at the end of the paper (Appendix B).

Remarks: 1. This result will be used for Ψ=Φ1/2U\Psi=\Phi^{1/2}U^{\top}, in conjunction with Theorem 5 to prove our main result in Theorem 3. Notice that Ψ\Psi^{\top} is a scaled version of the eigenvectors, with a scaling given by the diagonal matrix Φ=Σ(Σ+nγI)1\Phi=\Sigma(\Sigma+n\gamma I)^{-1} which should be considered as “soft projection” matrix that smoothly selects the top part of the spectrum of KK. The setting of Gittens et al. , in which Φ\Phi is a 0-1 diagonal is the closest analog of our setting.

2. It is known that pi=ψi22ΨF2p_{i}=\frac{\|\psi_{i}\|_{2}^{2}}{\|\Psi\|_{F}^{2}} is the optimal sampling distribution in terms of minimizing the expected error EΨΨΨSSΨF2\mathsf{E}\|\Psi\Psi^{\top}-\Psi SS^{\top}\Psi^{\top}\|_{F}^{2} . The above result exhibits a robustness property by allowing the chosen sampling distribution to be different from the optimal one by a factor β\beta.In their work , Drineas et al. have a comparable robust statement for controlling the expected error. Our result is a robust quantification of the tail probability of the error, which is a much stronger statement. The sub-optimality of such a distribution is reflected in the upper bound (7) by the amplification of the squared Frobenius norm of Ψ\Psi by a factor 1/β1/\beta. For instance, if the sampling distribution is chosen to be uniform, i.e. pi=1/mp_{i}=1/m, then the value of β\beta for which (6) is tight is ΨF2mmaxiψi22,\frac{\|\Psi\|_{F}^{2}}{m\max_{i}\|\psi_{i}\|_{2}^{2}}, in which case we recover a concentration result proven by Bach . Note that Theorem 7 is derived from one of the state-of-the-art bounds on matrix concentration, but it is one among many others in the literature; and while it constitutes the base of our improvement, it is possible that a concentration bound more tailored to the problem might yield sharper results.

3 An extended definition of leverage

We introduce an extended notion of leverage scores that is specifically tailored to the ridge regression problem, and that we call the λ\lambda-ridge leverage scores.

For λ>0\lambda>0, the λ\lambda-ridge leverage scores associated with the kernel matrix KK and the parameter λ\lambda are

Note that li(λ)l_{i}(\lambda) is the ithi^{th} diagonal entry of K(K+nλI)1K(K+n\lambda I)^{-1}. The quantities (li(λ))1in(l_{i}(\lambda))_{1\leq i\leq n} are in this setting the analogs of the so-called leverage scores in the statistical literature, as they characterize the data points that “stick out”, and consequently that most affect the result of a statistical procedure. They are classically defined as the row norms of the left singular matrix UU of the input matrix, and they have been used in regression diagnostics for outlier detection , and more recently in randomized matrix algorithms as they often provide an optimal importance sampling distribution for constructing random sketches for low rank approximation and least squares regression when the input matrix is tall and skinny (nmn\geq m). In the case where the input matrix is square, this definition is vacuous as the row norms of UU are all equal to 1. Recently, Gittens and Mahoney used a truncated version of these scores (that they called leverage scores relative to the best rank-kk space) to obtain the best algorithmic results known to date on low rank approximation of positive semi-definite matrices. Definition 8 is a weighted version of the classical leverage scores, where the weights depend on the spectrum of KK and a regularization parameter λ\lambda. In this sense, it is an interpolation between Gittens’ scores and the classical (tall-and-skinny) leverage scores, where the parameter λ\lambda plays the role of a rank parameter. In addition, we point out that Bach’s maximal degrees of freedom dmofd_{\text{mof}} is to the λ\lambda-ridge leverage scores what the coherence is to Gittens’ leverage scores, i.e. their (scaled) maximum value: dmof/n=maxili(λ)d_{\text{mof}}/n=\max_{i}l_{i}(\lambda); and that while the sum of Gittens’ scores is the rank parameter kk, the sum of the λ\lambda-ridge leverage scores is the effective dimensionality deffd_{\text{eff}}. We argue in the following that Definition 8 provides a relevant notion of leverage in the context of kernel ridge regression. It is the natural counterpart of the algorithmic notion of leverage in the prediction context. We use it in the next section to make a statistical statement on the performance of the Nyström method.

4 Main statistical result: an error bound on approximate kernel ridge regression

Now we are able to give an improved version of a theorem by Bach that establishes a performance guaranty on the use of the Nyström method in the context of kernel ridge regression. It is improved in the sense that the sufficient number of columns that should be sampled in order to incur no (or little) loss in the prediction performance is lower. This is due to a more data-sensitive way of sampling the columns of KK (depending on the λ\lambda-ridge leverage scores) during the construction of the approximation LL. The proof is in Appendix C.

Let λ,ϵ>0\lambda,\epsilon>0, ρ(0,1/2)\rho\in(0,1/2), n2n\geq 2 and LL be a Nyström approximation of KK by choosing pp columns randomly with replacement according to a probability distribution (pi)1in(p_{i})_{1\leq i\leq n} such that i{1,,n},   piβli(λϵ)/i=1nli(λϵ)\forall i\in\{1,\cdots,n\},~{}~{}~{}p_{i}\geq\beta\cdot l_{i}(\lambda\epsilon)/\sum_{i=1}^{n}l_{i}(\lambda\epsilon) for some β(0,1]\beta\in(0,1]. Let lminili(λϵ)\underline{l}\leq\min_{i}l_{i}(\lambda\epsilon). If

with deff=i=1nli(λϵ)=Tr(K(K+nλϵI)1)d_{\text{eff}}=\sum_{i=1}^{n}l_{i}(\lambda\epsilon)=\mathsf{Tr}(K(K+n\lambda\epsilon I)^{-1}) then

with probability at least 12ρ1-2\rho, where (li)i(l_{i})_{i} are introduced in Definition 8 and R\mathcal{R} is defined in (3).

Theorem 3 asserts that substituting the kernel matrix KK by a Nyström approximation of rank pp in the KRR problem induces an arbitrarily small prediction loss, provided that pp scales linearly with the effective dimensionality deffd_{\text{eff}}Note that deffd_{\text{eff}} depends on the precision parameter ϵ\epsilon, which is absent in the classical definition of the effective dimensionality However, the following bound holds: deff1ϵTr(K(K+nλI)1)d_{\text{eff}}\leq\frac{1}{\epsilon}\mathsf{Tr}(K(K+n\lambda I)^{-1}). and that λ\lambda is not too smallThis condition on λ\lambda is not necessary if one constructs LL as KS(SKS+nλϵI)1SKKS(S^{\top}KS+n\lambda\epsilon I)^{-1}S^{\top}K (see proof).. The leverage-based sampling appears to be crucial for obtaining this dependence, as the λ\lambda-ridge leverage scores provide information on which columns -and hence which data points- capture most of the difficulty of the estimation problem. Also, as a sanity check, the smaller the target accuracy ϵ\epsilon, the higher deffd_{\text{eff}}, and the more uniform the sampling distribution (li(λϵ))i(l_{i}(\lambda\epsilon))_{i} becomes. In the limit ϵ0\epsilon\rightarrow 0, pp is in the order of nn and the scores are uniform, and the method is essentially equivalent to using the entire matrix KK. Moreover, if the sampling distribution (pi)i(p_{i})_{i} is a factor β\beta away from optimal, a slight oversampling (i.e. increase pp by 1/β1/\beta) achieves the same performance. In this sense, the above result shows robustness to the sampling distribution. This property is very beneficial from an implementation point of view, as the error bounds still hold when only an approximation of the leverage scores is available. If the columns are sampled uniformly, a worse lower bound on pp that depends on dmofd_{\text{mof}} is obtained .

5 Main algorithmic result: a fast approximation to the λ𝜆\lambda-ridge leverage scores

Although the λ\lambda-ridge leverage scores can be naively computed using SVD, the exact computation is as costly as solving the original Problem (2). Therefore, the central role they play in the above result motivates the problem of a fast approximation, in a similar way the importance of the usual leverage scores has motivated Drineas et al. to approximate them is random projection time . A success in this task will allow us to combine the running time benefits with the improved statistical guarantees we have provided.

Inputs: data points (xi)1in(x_{i})_{1\leq i\leq n}, probability vector (pi)1in(p_{i})_{1\leq i\leq n}, sampling parameter p{1,2,}p\in\{1,2,\cdots\}, λ>0\lambda>0, ϵ(0,1/2)\epsilon\in(0,1/2).

Sample pp data points from (xi)1in(x_{i})_{1\leq i\leq n} with replacement with probabilities (pi)1in(p_{i})_{1\leq i\leq n}.

Compute the corresponding columns K1,,KpK_{1},\cdots,K_{p} of the kernel matrix.

where BiB_{i} is the ii-th row of BB, and return it.

Let ϵ(0,1/2)\epsilon\in(0,1/2), ρ(0,1)\rho\in(0,1) and λ>0\lambda>0. Let LL be a Nyström approximation of KK by choosing pp columns at random with probabilities pi=Kii/Tr(K)p_{i}=K_{ii}/\mathsf{Tr}(K), i=1,,ni=1,\cdots,n. If

then we have i{1,,n}\forall i\in\{1,\cdots,n\}

Remarks: 1. Theorem 4 states that if the columns of KK are sampled proportionally to KiiK_{ii} then O(Tr(K)nλ)O(\frac{\mathsf{Tr}(K)}{n\lambda}) is a sufficient number of samples. Recall that Kii=ϕ(xi)F2K_{ii}=\|\phi(x_{i})\|_{\mathcal{F}}^{2}, so our procedure is akin to sampling according to the squared lengths of the data vectors, which has been extensively used in different contexts of randomized matrix approximation .

2. Due to how λ\lambda is defined in eq. (1) the nn in the denominator is artificial: nλn\lambda should be thought of as a “rescaled” regularization parameter λ\lambda^{\prime}. In some settings, the λ\lambda that yields the best generalization error scales like O(1/n)O(1/\sqrt{n}), hence p=O(Tr(K)/n)p=O(\mathsf{Tr}(K)/\sqrt{n}) is sufficient. On the other hand, if the columns are sampled uniformly, one would get p=O(dmof)=O(nmaxili(λ))p=O(d_{\text{mof}})=O(n\max_{i}l_{i}(\lambda)).

Experiments

We test our results based on several datasets: one synthetic regression problem from to illustrate the importance of the λ\lambda-ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh http://www.cs.toronto.edu/~delve/data/pumadyn/desc.html and the Gas Sensor Array Drift Dataset from the UCI databasehttps://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset. The synthetic case consists of a regression problem on the interval X=\mathcal{X}= where, given a sequence (xi)1in(x_{i})_{1\leq i\leq n} and a sequence of noise (ϵi)1in(\epsilon_{i})_{1\leq i\leq n}, we observe the sequence

The function ff belongs to the RKHS F\mathcal{F} generated by the kernel k(x,y)=1(2β)!B2β(xyxy)k(x,y)=\frac{1}{(2\beta)!}B_{2\beta}(x-y-\lfloor x-y\rfloor) where B2βB_{2\beta} is the 2β2\beta-th Bernoulli polynomial . One important feature of this regression problem is the distribution of the points (xi)1in(x_{i})_{1\leq i\leq n} on the interval X\mathcal{X}: if they are spread uniformly over the interval, the λ\lambda-ridge leverage scores (li(λ))1in(l_{i}(\lambda))_{1\leq i\leq n} are uniform for every λ>0\lambda>0, and uniform column sampling is optimal in this case. In fact, if xi=i1nx_{i}=\frac{i-1}{n} for i=1,,ni=1,\cdots,n, the kernel matrix KK is a circulant matrix , in which case, we can prove that the λ\lambda-ridge leverage scores are constant. Otherwise, if the data points are distributed asymmetrically on the interval, the λ\lambda-ridge leverage scores are non uniform, and importance sampling is beneficial (Figure 1). In this experiment, the data points xi(0,1)x_{i}\in(0,1) have been generated with a distribution symmetric about 12\frac{1}{2}, having a high density on the borders of the interval (0,1)(0,1) and a low density on the center of the interval. The number of observations is n=500n=500. On Figure 1, we can see that there are few data points with high leverage, and those correspond to the region that is underrepresented in the data sample (i.e. the region close to the center of the interval since it is the one that has the lowest density of observations). The λ\lambda-ridge leverage scores are able to capture the importance of these data points, thus providing a way to detect them (e.g. with an analysis of outliers), had we not known their existence.

For all datasets, we determine λ\lambda and the band width of kk by cross validation, and we compute the effective dimensionality deffd_{\text{eff}} and the maximal degrees of freedom dmofd_{\text{mof}}. Table 1 summarizes the experiments. It is often the case that deffdmofd_{\text{eff}}\ll d_{\text{mof}} and R(f^L)/R(f^K)1\mathcal{R}(\hat{f}_{L})/\mathcal{R}(\hat{f}_{K})\simeq 1, in agreement with Theorem 3.

Conclusion

We showed in this paper that in the case of kernel ridge regression, the sampling complexity of the Nyström method can be reduced to the effective dimensionality of the problem, hence bridging and improving upon different previous attempts that established weaker forms of this result. This was achieved by defining a natural analog to the notion of leverage scores in this statistical context, and using it as a column sampling distribution. We obtained this result by combining and improving upon results that have emerged from two different perspectives on low rank matrix approximation. We also present a way to approximate these scores that is computationally tractable, i.e. runs in time O(np2)O(np^{2}) with pp depending only on the trace of the kernel matrix and the regularization parameter. One natural unanswered question is whether it is possible to further reduce the sampling complexity, or is the effective dimensionality also a lower bound on pp? And as pointed out by previous work , it is likely that the same results hold for smooth losses beyond the squared loss (e.g. logistic regression). However the situation is unclear for non-smooth losses (e.g. support vector regression).

An earlier draft of this paper contained a mistake in the proof of Theorem 5. We thank Xixian Chen for signaling it to us. We thank Francis Bach for stimulating discussions and for contributing to a rectified proof. We thank Jason Lee and Aaditya Ramdas for fruitful discussions regarding the same issue. We thank Yuchen Zhang for pointing out the connection to his work.

References

Appendix A Proof of Theorem 5

Note: This proof is inspired by one of Bach . We extend their result to the case of a general sketching matrix SS. Moreover, we believe their argument contains two problematic statements (about monotonicity of the bias) that we rectify with Lemma 2 and Lemma 3 below. Their result therefore holds also true with minimal change based on this argument.

For kernel ridge regression, the bias of the estimator f^K\hat{f}_{K} can be expressed as

Let K=UΣUK=U\Sigma U^{\top} where UU is orthogonal and Σ\Sigma diagonal positive. We have

with Φ=Σ(Σ+nγI)1\Phi=\Sigma(\Sigma+n\gamma I)^{-1}. If λmax(D)t\lambda_{\max}(D)\leq t for t(0,1)t\in(0,1) then

If 0KLγnγ1tI0\preceq K-L_{\gamma}\preceq\frac{n\gamma}{1-t}I then bias(Lγ)(1+γ/λ1t)bias(K)\text{bias}(L_{\gamma})\leq\left(1+\frac{\gamma/\lambda}{1-t}\right)\text{bias}(K).

If 0KLγnγ1tI0\preceq K-L_{\gamma}\preceq\frac{n\gamma}{1-t}I and λ11tSop2λmax(K)n\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\cdot\frac{\lambda_{\max}(K)}{n} then the map γbias(Lγ)\gamma\rightarrow\text{bias}(L_{\gamma}) is increasing. This in particular implies that under the same conditions, bias(L)bias(Lγ)\text{bias}(L)\leq\text{bias}(L_{\gamma}).

Proof of Lemma 1. With K=UΣUK=U\Sigma U^{\top} and R=Σ1/2USR=\Sigma^{1/2}U^{\top}S, Lˉγ=R(RR+nγI)1R\bar{L}_{\gamma}=R(R^{\top}R+n\gamma I)^{-1}R^{\top}, we have

Due to the matrix inversion lemma, we have

and Φ=Σ(Σ+nγI)1\Phi=\Sigma(\Sigma+n\gamma I)^{-1}. This shows that for any γ0\gamma\geq 0

Now if λmax(D)t\lambda_{\max}(D)\leq t for t(0,1)t\in(0,1),

Proof of Lemma 2. This proof was communicated to us by Francis Bach .

Since KLγK-L_{\gamma} commutes with the identity, we have

Proof of Lemma 3. Let φ(γ)=f(Lγ+nλI)2f\varphi(\gamma)={f^{*}}^{\top}(L_{\gamma}+n\lambda I)^{-2}f^{*}. The task is to prove that φ\varphi is increasing if λ11tSop2λmax(K)n\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\frac{\lambda_{\max}(K)}{n}. We do so by computing the derivative of φ\varphi and showing that φ0\varphi^{\prime}\geq 0. Let γ,γ>0\gamma,\gamma^{\prime}>0. We have

Now we compute the terms LγLγL_{\gamma^{\prime}}-L_{\gamma} and Lγ2Lγ2L_{\gamma^{\prime}}^{2}-L_{\gamma}^{2}:

The first term is the last equality above is equal to

Now combining the above and taking the limit γγ\gamma^{\prime}\rightarrow\gamma we have

Therefore, the function φ\varphi is increasing for all γ\gamma such that Q0Q\succeq 0, and the latter is true if 2nλλmin(Qˉ)2n\lambda\geq-\lambda_{\min}(\bar{Q}). Moreover, since Qˉ\bar{Q} is symmetric we have

and it is sufficient to verify the condition

Now we finish the proof by showing that the above operator norm is smaller than 11tSop2λmax(K)\frac{1}{1-t}\|S\|_{\text{op}}^{2}\lambda_{\max}(K). We have

Taking operator norms, and using the assumption 0KLγnγ1tI0\preceq K-L_{\gamma}\preceq\frac{n\gamma}{1-t}I,

Hence, (11) is satisfied if nλ11tSop2λmax(K)n\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\lambda_{\max}(K) therefore concluding the proof. ∎

Appendix B Proof of Theorem 7

The proof uses the matrix Bernstein inequality (see e.g. Theorem 6.1.1 in ):

Consider a sequence (Xk)(X_{k}) of independent random symmetric matrices with dimension dd. Assume that E(Xk)=0\mathsf{E}(X_{k})=0, λmax(Xk)R\lambda_{\max}(X_{k})\leq R, and let Y=kXkY=\sum_{k}X_{k}. Furthermore, assume that there exists σ>0\sigma>0 such that E(Y2)2σ2\|\mathsf{E}(Y^{2})\|_{2}\leq\sigma^{2}. Then

Next , we exhibit the sequence (Xk)(X_{k}) and YY in our case. We have

where (zik)1im(z_{ik})_{1\leq i\leq m} are i.i.d. binary random vectors for k{1,,p}k\in\{1,\cdots,p\} with Pr(zik=1)=pi\mathsf{Pr}(z_{ik}=1)=p_{i} (i.e. (zik)1im(z_{ik})_{1\leq i\leq m} is the indicator of the chosen column at trial kk). Let Y=ΨΨΨSSΨY=\Psi\Psi^{\top}-\Psi SS^{\top}\Psi^{\top}, then

We choose XkX_{k} to be 1pi=1m(1zikpi)ψiψi\frac{1}{p}\sum_{i=1}^{m}(1-\frac{z_{ik}}{p_{i}})\psi_{i}\psi_{i}^{\top} for every k{1,,p}k\in\{1,\cdots,p\}. Now we verify the assumptions of the above theorem. The matrices XkX_{k} inherit independence from the random vectors (zik)1im(z_{ik})_{1\leq i\leq m}, and we have E(Xk)=0\mathsf{E}(X_{k})=0, and λmax(Xk)1pλmax(i=1mψiψi)=1pλmax(ΨΨ)\lambda_{\max}(X_{k})\leq\frac{1}{p}\lambda_{\max}(\sum_{i=1}^{m}\psi_{i}\psi_{i}^{\top})=\frac{1}{p}\lambda_{\max}(\Psi\Psi^{\top}). Now we control the spectral norm of the second moment of YY. Again with E(Xk)=0\mathsf{E}(X_{k})=0 we have E(Y2)=k,k=1pE(XkXk)=k=1pE(Xk2).\mathsf{E}(Y^{2})=\sum_{k,k^{\prime}=1}^{p}\mathsf{E}(X_{k}X_{k^{\prime}})=\sum_{k=1}^{p}\mathsf{E}(X_{k}^{2}). And for k{1,,p}k\in\{1,\cdots,p\}

To proceed, observe that for iii\neq i^{\prime}, zikzik=0z_{ik}z_{i^{\prime}k}=0 since only one column is chosen at a time. This yields

Given that the probability distribution (pi)(p_{i}) verifies piβψi22ΨF2p_{i}\geq\beta\frac{\|\psi_{i}\|_{2}^{2}}{\|\Psi\|_{F}^{2}}, we get E(Y2)ΨF2βpi=1mψiψi=ΨF2βpΨΨ.\mathsf{E}(Y^{2})\preceq\frac{\|\Psi\|_{F}^{2}}{\beta p}\sum_{i=1}^{m}\psi_{i}\psi_{i}^{\top}=\frac{\|\Psi\|_{F}^{2}}{\beta p}\Psi\Psi^{\top}. Hence E(Y2)2ΨF2βpλmax(ΨΨ).\|\mathsf{E}(Y^{2})\|_{2}\leq\frac{\|\Psi\|_{F}^{2}}{\beta p}\lambda_{\max}(\Psi\Psi^{\top}). We now apply the theorem with R=1pλmax(ΨΨ)R=\frac{1}{p}\lambda_{\max}(\Psi\Psi^{\top}) and σ2=ΨF2βpλmax(ΨΨ)\sigma^{2}=\frac{\|\Psi\|_{F}^{2}}{\beta p}\lambda_{\max}(\Psi\Psi^{\top}) which leads to the desired result.

Appendix C Proof of Theorem 3

First of all, we observe that the variance of the estimator f^K\hat{f}_{K} is matrix-increasing as a function of KK. Indeed, we have

where λj(K)\lambda_{j}(K) is the jjth eigenvalue of KK arranged in a decreasing order. The function xx2(x+nλ)2x\rightarrow\frac{x^{2}}{(x+n\lambda)^{2}} is increasing for x0x\geq 0. Moreover, if LKL\preceq K then by the Courant-Fischer minimax principle λj(L)λj(K)\lambda_{j}(L)\leq\lambda_{j}(K) for all jj (e.g. see Corollary III.1.2 in ).

Now, using Theorem 5 combined with the above fact, we have

We set γ=λϵ\gamma=\lambda\epsilon and t=1/2t=1/2. The above holds if \lambda_{\max}\Big{(}\Phi-\Phi^{1/2}U^{\top}SS^{\top}U\Phi^{1/2}\Big{)}\leq t and nλ11tSop2λmax(K)n\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\lambda_{\max}(K). Now let Ψ=Φ1/2U\Psi=\Phi^{1/2}U^{\top}. Then we have ψi22=li(γ)\|\psi_{i}\|_{2}^{2}=l_{i}(\gamma) and ΨF2=deff\|\Psi\|_{F}^{2}=d_{\text{eff}}. Using Theorem 7 on Ψ\Psi, and given that λmax(ΨΨ)=λmax(Φ)1\lambda_{\max}(\Psi\Psi^{\top})=\lambda_{\max}(\Phi)\leq 1, for the result to hold with probability at least 1ρ1-\rho, it is sufficient to set pp such that n\exp\Big{(}\frac{-p(1/2)^{2}/2}{d_{\text{eff}}/\beta+1/6}\Big{)}\leq\rho which gives the desired lower bound p8(deff/β+1/6)log(nρ)p\geq 8(d_{\text{eff}}/\beta+1/6)\log\left(\frac{n}{\rho}\right).

Remark: Note that if one uses the regularized Nyström approximation Lγ=KS(SKS+nγI)1SKL_{\gamma}=KS(S^{\top}KS+n\gamma I)^{-1}S^{\top}K with γ=λϵ\gamma=\lambda\epsilon instead of L=KS(SKS)SKL=KS(S^{\top}KS)^{\dagger}S^{\top}K in the algorithm then the proof would now be complete and the condition condition nλ11tSop2λmax(K)n\lambda\geq\frac{1}{1-t}\|S\|_{\text{op}}^{2}\lambda_{\max}(K) is not necessary. If one uses LL, then this latter condition needs to be verified to insure monotonicity of the bias (see Lemma 3).

Now it remains to control the operator norm of the sketching matrix SS appearing in the lower bound on λ\lambda. To this end we use a variant of the matrix Bernstein inequality (Theorem 5) for controlling operator norms of random matrices (see Corollary 6.2.1 in ).

Consider a sequence (Xk)(X_{k}) of independent random symmetric matrices with dimension d×dd\times d. Assume that E(Xk)=0\mathsf{E}(X_{k})=0, XkopR\|X_{k}\|_{\text{op}}\leq R, and let Y=kXkY=\sum_{k}X_{k}. Furthermore, assume that there exists σ>0\sigma>0 such that E(Y2)opσ2\|\mathsf{E}(Y^{2})\|_{\text{op}}\leq\sigma^{2}. Then

with l=minili(λϵ)\underline{l}=\min_{i}l_{i}(\lambda\epsilon). On the other hand,

By choosing σ2=R=1pdeffβl\sigma^{2}=R=\frac{1}{p}\frac{d_{\text{eff}}}{\beta\underline{l}}, we have SSIopt\|SS^{\top}-I\|_{\text{op}}\leq t with probability at least 12nexp(t2/2R(1+t/3))1-2n\exp\left(-\frac{t^{2}/2}{R(1+t/3)}\right). Taking t=max{1 , 8deff3βlplog(2nρ)}t=\max\left\{1~{},~{}\frac{8d_{\text{eff}}}{3\beta\underline{l}\cdot p}\log\left(\frac{2n}{\rho}\right)\right\}, the latter probability is greater than 1ρ1-\rho, and by the triangle inequality: Sop21+t\|S\|_{\text{op}}^{2}\leq 1+t with the same probability. By taking p8(deff/β+1/6)log(nρ)p\geq 8(d_{\text{eff}}/\beta+1/6)\log\left(\frac{n}{\rho}\right) (thereby verifying the condition from the previous paragraph) we have

if n2n\geq 2, and therefore Sop21+1/l\|S\|_{\text{op}}^{2}\leq 1+1/\underline{l} (since l1\underline{l}\leq 1) with probability at least 1ρ1-\rho.

Appendix D Proof of Theorem 4

for t0t\geq 0 with Ψ=Φ1/2U\Psi=\Phi^{1/2}U^{\top}, Φ=Σ(Σ+nγI)1\Phi=\Sigma(\Sigma+n\gamma I)^{-1} then

by choosing pi=Kii/Tr(K)p_{i}=K_{ii}/\mathsf{Tr}(K), we have piβ li(λϵ)/i=1nli(λϵ)p_{i}\geq\beta~{}l_{i}(\lambda\epsilon)/\sum_{i=1}^{n}l_{i}(\lambda\epsilon) with β=nλϵdeff/Tr(K)\beta=n\lambda\epsilon d_{\text{eff}}/\mathsf{Tr}(K), which yields deff/β=Tr(K)/(nλϵ)d_{\text{eff}}/\beta=\mathsf{Tr}(K)/(n\lambda\epsilon).

As for the multiplicative error bound, using KLγnγ1tK(K+nγ)1K-L_{\gamma}\preceq\frac{n\gamma}{1-t}K(K+n\gamma)^{-1} we get

For t=1/2t=1/2, Inγ1t(K+nγI)1=(KnγI)(K+nγI)1σnnγσn+nγII-\frac{n\gamma}{1-t}(K+n\gamma I)^{-1}=(K-n\gamma I)(K+n\gamma I)^{-1}\succeq\frac{\sigma_{n}-n\gamma}{\sigma_{n}+n\gamma}I. The result follows.