Fast Randomized Kernel Methods With Statistical Guarantees
Ahmed El Alaoui, Michael W. Mahoney
Introduction
We consider the low-rank approximation of symmetric positive semi-definite (SPSD) matrices that arise in machine learning and data analysis, with an emphasis on obtaining good statistical guarantees. This is of interest primarily in connection with kernel-based machine learning methods. Recent work in this area has focused on one or the other of two very different perspectives: an algorithmic perspective, where the focus is on running time issues and worst-case quality-of-approximation guarantees, given a fixed input matrix; and a statistical perspective, where the goal is to obtain good inferential properties, under some hypothesized model, by using the low-rank approximation in place of the full kernel matrix. The recent results of Gittens and Mahoney provide the strongest example of the former, and the recent results of Bach are an excellent example of the latter. In this paper, we combine ideas from these two lines of work in order to obtain a fast randomized kernel method with statistical guarantees that are improved relative to the state-of-the-art.
In this paper, we extend these ideas, and we show that—from the statistical perspective—we are able to obtain a low-rank approximation that comes with improved statistical guarantees by using a variant of this more traditional notion of statistical leverage. In particular, we improve the recent bounds of Bach , which provides the first known statistical convergence result when substituting the kernel matrix by its low-rank approximation. To understand the connection, recall that a key component of Bach’s approach is the quantity , which he calls the maximal marginal degrees of freedom.We will refer to it as the maximal degrees of freedom. Bach’s main result is that by constructing a low-rank approximation of the original kernel matrix by sampling uniformly at random columns, i.e., performing the vanilla Nyström method, and then by using this low-rank approximation in a prediction task, the statistical performance is within a factor of of the performance when the entire kernel matrix is used. Here, we show that this uniform sampling is suboptimal. We do so by sampling with respect to a coarse but quickly-computable approximation of a variant to the statistical leverage scores, given in Definition 8 below, and we show that we can obtain similar guarantees by sampling only columns, where . The quantity is called the effective dimensionality of the learning problem, and it can be interpreted as the implicit number of parameters in this nonparametric setting .
We expect that our results and insights will be useful much more generally. As an example of this, we can directly compare the Nyström sampling method to a related divide-and-conquer approach, thereby answering an open problem of Zhang et al. . Recall that the Zhang et al. divide-and-conquer method consists of dividing the dataset into random partitions of equal size, computing estimators on each partition in parallel, and then averaging the estimators. They prove the minimax optimality of their estimator, although their multiplicative constants are suboptimal; and, in terms of the number of kernel evaluations, their method requires , with in the order of , which gives a total number of evaluations. They noticed that the scaling of their estimator was not directly comparable to that of the Nyström sampling method (which was proven to only require evaluations, if the sampling is uniform ), and they left it as an open problem to determine which if either method is fundamentally better than the other. Using our Theorem 3, we are able to put both results on a common ground for comparison. Indeed, the estimator obtained by our non-uniform Nyström sampling requires only kernel evaluations (compared to and ), and it obtains the same bound on the statistical predictive performance as in . In this sense, our result combines “the best of both worlds,” by having the reduced sample complexity of and the sharp approximation bound of .
Preliminaries and notation
Let be pairs of points in , where is the input space and is the response space. The kernel-based learning problem can be cast as the following minimization problem:
where . We let be the eigenvalue decomposition of , with , , and an orthogonal matrix. The underlying data model is
More generally, if we let be an arbitrary sketching matrix, i.e., a tall and skinny matrix that, when left-multiplied by , produces a “sketch” of that preserves some desirable properties, then the Nyström approximation associated with is
For instance, for random sampling algorithms, would contain a non-zero entry at position if the -th column of is chosen at the -th trial of the sampling process. Alternatively, could also be a random projection matrix; or could be constructed with some other (perhaps deterministic) method, as long as it verifies some structural properties, depending on the application .
We will focus in this paper on analyzing this approximation in the statistical prediction context related to the estimation of by solving Problem (2). We proceed by revisiting and improving upon prior results from three different areas. The first result (Theorem 5) is on the behavior of the bias of , when is constructed using a general sketching matrix . This result underlies the statistical analysis of the Nyström method. To see this, first, it is not hard to prove that in the sense of usual the order on the positive semi-definite cone. Second, one can prove that the variance is matrix-increasing, hence the variance will decrease when replacing by . On the other hand, the bias (while not matrix monotone in general) can be proven to not increase too much when replacing by . This latter statement will be the main technical difficulty for obtaining a bound on (see Appendix A). A form of this result is due to Bach in the case where is a uniform sampling matrix. The second result (Theorem 7) is a concentration bound for approximating matrix multiplication when the rank-one components of the product are sampled non uniformly. This result is derived from the matrix Bernstein inequality, and yields a sharp quantification of the deviation of the approximation from the true product. The third result (Definition 8) is an extension of the definition of the leverage scores to the context of kernel ridge regression. Whereas the notion of leverage is established as an algorithmic tool in randomized linear algebra, we introduce a natural counterpart of it to this statistical setting. By combining these contributions, we are able to give a sharp statistical statement on the behavior of the Nyström method if one is allowed to sample non uniformly. All the proofs are deferred to the appendix.
Revisiting prior work and new results
We begin by stating a “structural” result that upper-bounds the bias of the estimator constructed using the approximation . This result is deterministic: it only depends on the properties of the input data, and holds for any sketching matrix that satisfies certain conditions. This way the randomness of the construction of is decoupled from the rest of the analysis. We highlight the fact that this view offers a possible way of improving the current results since a better construction of -whether deterministic or random- satisfying the data-related conditions would immediately lead to down stream algorithmic and statistical improvements in this setting.
In the special case where contains one non zero entry equal to in every column with the number of sampled columns, the result and its proof can be found in (appendix B.2), although we believe that their argument contains a problematic statement. We propose an alternative and complete proof in Appendix A. The subsequent analysis unfolds in two steps: (1) assuming the sketching matrix satisfies the conditions stated in Theorem 5, we will have , and (2) matrix concentration is used to show that an appropriate random construction of satisfies the said conditions. We start by stating the concentration result that is the source of our improvement (section 3.2), define a notion of statistical leverage scores (section 3.3), and then state and prove the main statistical result (Theorem 3 section 3.4). We then present our main algorithmic result consisting of a fast approximation to this new notion of leverage scores (section 3.5).
2 A concentration bound on matrix multiplication
Next, we state our result for approximating matrix products of the form when a few columns from are sampled to form the approximate product where contains the chosen columns. The proof relies on a matrix Bernstein inequality (see e.g. ) and is presented at the end of the paper (Appendix B).
Remarks: 1. This result will be used for , in conjunction with Theorem 5 to prove our main result in Theorem 3. Notice that is a scaled version of the eigenvectors, with a scaling given by the diagonal matrix which should be considered as “soft projection” matrix that smoothly selects the top part of the spectrum of . The setting of Gittens et al. , in which is a 0-1 diagonal is the closest analog of our setting.
2. It is known that is the optimal sampling distribution in terms of minimizing the expected error . The above result exhibits a robustness property by allowing the chosen sampling distribution to be different from the optimal one by a factor .In their work , Drineas et al. have a comparable robust statement for controlling the expected error. Our result is a robust quantification of the tail probability of the error, which is a much stronger statement. The sub-optimality of such a distribution is reflected in the upper bound (7) by the amplification of the squared Frobenius norm of by a factor . For instance, if the sampling distribution is chosen to be uniform, i.e. , then the value of for which (6) is tight is in which case we recover a concentration result proven by Bach . Note that Theorem 7 is derived from one of the state-of-the-art bounds on matrix concentration, but it is one among many others in the literature; and while it constitutes the base of our improvement, it is possible that a concentration bound more tailored to the problem might yield sharper results.
3 An extended definition of leverage
We introduce an extended notion of leverage scores that is specifically tailored to the ridge regression problem, and that we call the -ridge leverage scores.
For , the -ridge leverage scores associated with the kernel matrix and the parameter are
Note that is the diagonal entry of . The quantities are in this setting the analogs of the so-called leverage scores in the statistical literature, as they characterize the data points that “stick out”, and consequently that most affect the result of a statistical procedure. They are classically defined as the row norms of the left singular matrix of the input matrix, and they have been used in regression diagnostics for outlier detection , and more recently in randomized matrix algorithms as they often provide an optimal importance sampling distribution for constructing random sketches for low rank approximation and least squares regression when the input matrix is tall and skinny (). In the case where the input matrix is square, this definition is vacuous as the row norms of are all equal to 1. Recently, Gittens and Mahoney used a truncated version of these scores (that they called leverage scores relative to the best rank- space) to obtain the best algorithmic results known to date on low rank approximation of positive semi-definite matrices. Definition 8 is a weighted version of the classical leverage scores, where the weights depend on the spectrum of and a regularization parameter . In this sense, it is an interpolation between Gittens’ scores and the classical (tall-and-skinny) leverage scores, where the parameter plays the role of a rank parameter. In addition, we point out that Bach’s maximal degrees of freedom is to the -ridge leverage scores what the coherence is to Gittens’ leverage scores, i.e. their (scaled) maximum value: ; and that while the sum of Gittens’ scores is the rank parameter , the sum of the -ridge leverage scores is the effective dimensionality . We argue in the following that Definition 8 provides a relevant notion of leverage in the context of kernel ridge regression. It is the natural counterpart of the algorithmic notion of leverage in the prediction context. We use it in the next section to make a statistical statement on the performance of the Nyström method.
4 Main statistical result: an error bound on approximate kernel ridge regression
Now we are able to give an improved version of a theorem by Bach that establishes a performance guaranty on the use of the Nyström method in the context of kernel ridge regression. It is improved in the sense that the sufficient number of columns that should be sampled in order to incur no (or little) loss in the prediction performance is lower. This is due to a more data-sensitive way of sampling the columns of (depending on the -ridge leverage scores) during the construction of the approximation . The proof is in Appendix C.
Let , , and be a Nyström approximation of by choosing columns randomly with replacement according to a probability distribution such that for some . Let . If
with then
with probability at least , where are introduced in Definition 8 and is defined in (3).
Theorem 3 asserts that substituting the kernel matrix by a Nyström approximation of rank in the KRR problem induces an arbitrarily small prediction loss, provided that scales linearly with the effective dimensionality Note that depends on the precision parameter , which is absent in the classical definition of the effective dimensionality However, the following bound holds: . and that is not too smallThis condition on is not necessary if one constructs as (see proof).. The leverage-based sampling appears to be crucial for obtaining this dependence, as the -ridge leverage scores provide information on which columns -and hence which data points- capture most of the difficulty of the estimation problem. Also, as a sanity check, the smaller the target accuracy , the higher , and the more uniform the sampling distribution becomes. In the limit , is in the order of and the scores are uniform, and the method is essentially equivalent to using the entire matrix . Moreover, if the sampling distribution is a factor away from optimal, a slight oversampling (i.e. increase by ) achieves the same performance. In this sense, the above result shows robustness to the sampling distribution. This property is very beneficial from an implementation point of view, as the error bounds still hold when only an approximation of the leverage scores is available. If the columns are sampled uniformly, a worse lower bound on that depends on is obtained .
5 Main algorithmic result: a fast approximation to the λ𝜆\lambda-ridge leverage scores
Although the -ridge leverage scores can be naively computed using SVD, the exact computation is as costly as solving the original Problem (2). Therefore, the central role they play in the above result motivates the problem of a fast approximation, in a similar way the importance of the usual leverage scores has motivated Drineas et al. to approximate them is random projection time . A success in this task will allow us to combine the running time benefits with the improved statistical guarantees we have provided.
Inputs: data points , probability vector , sampling parameter , , .
Sample data points from with replacement with probabilities .
Compute the corresponding columns of the kernel matrix.
where is the -th row of , and return it.
Let , and . Let be a Nyström approximation of by choosing columns at random with probabilities , . If
then we have
Remarks: 1. Theorem 4 states that if the columns of are sampled proportionally to then is a sufficient number of samples. Recall that , so our procedure is akin to sampling according to the squared lengths of the data vectors, which has been extensively used in different contexts of randomized matrix approximation .
2. Due to how is defined in eq. (1) the in the denominator is artificial: should be thought of as a “rescaled” regularization parameter . In some settings, the that yields the best generalization error scales like , hence is sufficient. On the other hand, if the columns are sampled uniformly, one would get .
Experiments
We test our results based on several datasets: one synthetic regression problem from to illustrate the importance of the -ridge leverage scores, the Pumadyn family consisting of three datasets pumadyn-32fm, pumadyn-32fh and pumadyn-32nh http://www.cs.toronto.edu/~delve/data/pumadyn/desc.html and the Gas Sensor Array Drift Dataset from the UCI databasehttps://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset. The synthetic case consists of a regression problem on the interval where, given a sequence and a sequence of noise , we observe the sequence
The function belongs to the RKHS generated by the kernel where is the -th Bernoulli polynomial . One important feature of this regression problem is the distribution of the points on the interval : if they are spread uniformly over the interval, the -ridge leverage scores are uniform for every , and uniform column sampling is optimal in this case. In fact, if for , the kernel matrix is a circulant matrix , in which case, we can prove that the -ridge leverage scores are constant. Otherwise, if the data points are distributed asymmetrically on the interval, the -ridge leverage scores are non uniform, and importance sampling is beneficial (Figure 1). In this experiment, the data points have been generated with a distribution symmetric about , having a high density on the borders of the interval and a low density on the center of the interval. The number of observations is . On Figure 1, we can see that there are few data points with high leverage, and those correspond to the region that is underrepresented in the data sample (i.e. the region close to the center of the interval since it is the one that has the lowest density of observations). The -ridge leverage scores are able to capture the importance of these data points, thus providing a way to detect them (e.g. with an analysis of outliers), had we not known their existence.
For all datasets, we determine and the band width of by cross validation, and we compute the effective dimensionality and the maximal degrees of freedom . Table 1 summarizes the experiments. It is often the case that and , in agreement with Theorem 3.
Conclusion
We showed in this paper that in the case of kernel ridge regression, the sampling complexity of the Nyström method can be reduced to the effective dimensionality of the problem, hence bridging and improving upon different previous attempts that established weaker forms of this result. This was achieved by defining a natural analog to the notion of leverage scores in this statistical context, and using it as a column sampling distribution. We obtained this result by combining and improving upon results that have emerged from two different perspectives on low rank matrix approximation. We also present a way to approximate these scores that is computationally tractable, i.e. runs in time with depending only on the trace of the kernel matrix and the regularization parameter. One natural unanswered question is whether it is possible to further reduce the sampling complexity, or is the effective dimensionality also a lower bound on ? And as pointed out by previous work , it is likely that the same results hold for smooth losses beyond the squared loss (e.g. logistic regression). However the situation is unclear for non-smooth losses (e.g. support vector regression).
An earlier draft of this paper contained a mistake in the proof of Theorem 5. We thank Xixian Chen for signaling it to us. We thank Francis Bach for stimulating discussions and for contributing to a rectified proof. We thank Jason Lee and Aaditya Ramdas for fruitful discussions regarding the same issue. We thank Yuchen Zhang for pointing out the connection to his work.
References
Appendix A Proof of Theorem 5
Note: This proof is inspired by one of Bach . We extend their result to the case of a general sketching matrix . Moreover, we believe their argument contains two problematic statements (about monotonicity of the bias) that we rectify with Lemma 2 and Lemma 3 below. Their result therefore holds also true with minimal change based on this argument.
For kernel ridge regression, the bias of the estimator can be expressed as
Let where is orthogonal and diagonal positive. We have
with . If for then
If then .
If and then the map is increasing. This in particular implies that under the same conditions, .
Proof of Lemma 1. With and , , we have
Due to the matrix inversion lemma, we have
and . This shows that for any
Now if for ,
Proof of Lemma 2. This proof was communicated to us by Francis Bach .
Since commutes with the identity, we have
Proof of Lemma 3. Let . The task is to prove that is increasing if . We do so by computing the derivative of and showing that . Let . We have
Now we compute the terms and :
The first term is the last equality above is equal to
Now combining the above and taking the limit we have
Therefore, the function is increasing for all such that , and the latter is true if . Moreover, since is symmetric we have
and it is sufficient to verify the condition
Now we finish the proof by showing that the above operator norm is smaller than . We have
Taking operator norms, and using the assumption ,
Hence, (11) is satisfied if therefore concluding the proof. ∎
Appendix B Proof of Theorem 7
The proof uses the matrix Bernstein inequality (see e.g. Theorem 6.1.1 in ):
Consider a sequence of independent random symmetric matrices with dimension . Assume that , , and let . Furthermore, assume that there exists such that . Then
Next , we exhibit the sequence and in our case. We have
where are i.i.d. binary random vectors for with (i.e. is the indicator of the chosen column at trial ). Let , then
We choose to be for every . Now we verify the assumptions of the above theorem. The matrices inherit independence from the random vectors , and we have , and . Now we control the spectral norm of the second moment of . Again with we have And for
To proceed, observe that for , since only one column is chosen at a time. This yields
Given that the probability distribution verifies , we get Hence We now apply the theorem with and which leads to the desired result.
Appendix C Proof of Theorem 3
First of all, we observe that the variance of the estimator is matrix-increasing as a function of . Indeed, we have
where is the th eigenvalue of arranged in a decreasing order. The function is increasing for . Moreover, if then by the Courant-Fischer minimax principle for all (e.g. see Corollary III.1.2 in ).
Now, using Theorem 5 combined with the above fact, we have
We set and . The above holds if \lambda_{\max}\Big{(}\Phi-\Phi^{1/2}U^{\top}SS^{\top}U\Phi^{1/2}\Big{)}\leq t and . Now let . Then we have and . Using Theorem 7 on , and given that , for the result to hold with probability at least , it is sufficient to set such that n\exp\Big{(}\frac{-p(1/2)^{2}/2}{d_{\text{eff}}/\beta+1/6}\Big{)}\leq\rho which gives the desired lower bound .
Remark: Note that if one uses the regularized Nyström approximation with instead of in the algorithm then the proof would now be complete and the condition condition is not necessary. If one uses , then this latter condition needs to be verified to insure monotonicity of the bias (see Lemma 3).
Now it remains to control the operator norm of the sketching matrix appearing in the lower bound on . To this end we use a variant of the matrix Bernstein inequality (Theorem 5) for controlling operator norms of random matrices (see Corollary 6.2.1 in ).
Consider a sequence of independent random symmetric matrices with dimension . Assume that , , and let . Furthermore, assume that there exists such that . Then
with . On the other hand,
By choosing , we have with probability at least . Taking , the latter probability is greater than , and by the triangle inequality: with the same probability. By taking (thereby verifying the condition from the previous paragraph) we have
if , and therefore (since ) with probability at least .
Appendix D Proof of Theorem 4
for with , then
by choosing , we have with , which yields .
As for the multiplicative error bound, using we get
For , . The result follows.