Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels

Haim Avron, Vikas Sindhwani, Jiyan Yang, Michael Mahoney

Introduction

Kernel methods (Schölkopf and Smola, 2002; Wahba, 1990; Cucker and Smale, 2001) offer a comprehensive suite of mathematically well-founded non-parametric modeling techniques for a wide range of problems in machine learning. These include nonlinear classification, regression, clustering, semi-supervised learning (Belkin et al., 2006), time-series analysis (Parzen, 1970), sequence modeling (Song et al., 2010), dynamical systems (Boots et al., 2013), hypothesis testing (Harchaoui et al., 2013), causal modeling (Zhang et al., 2011) and many more.

Standard regularized linear statistical models in $\mathcal{H}$ then provide non-linear inference with respect to the original input representation. The algorithmic basis of such constructions are classical Representer Theorems (Wahba, 1990; Schölkopf and Smola, 2002) that guarantee finite-dimensional solutions of associated optimization problems, even if $\mathcal{H}$ is infinite-dimensional.

However, there is a steep price of these elegant generalizations in terms of scalability. Consider, for example, least squares regression given $n$ data points $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ and assume that $n\gg d$ . The complexity of linear regression training using standard least squares solvers is $O(nd^{2})$ , with $O(nd)$ memory requirements, and $O(d)$ prediction speed on a test point. Its kernel-based nonlinear counterpart, however, requires solving a linear system involving the Gram matrix of the kernel function (defined by $\mathbf{K}_{ij}=k(\mathbf{x}_{i},\mathbf{x}_{j})$ ). In general, this incurs $O(n^{3}+n^{2}d)$ complexity for training, $O(n^{2})$ memory requirements, and $O(nd)$ prediction time for a single test point – none of which are particularly appealing in “big data” settings. Similar conclusions apply to other algorithms such as Kernel PCA.

This is rather unfortunate, since non-parametric models, such as the ones produced by kernel methods, are particularly appealing in a big data settings as they can adapt to the full complexity of the underlying domain, as uncovered by increasing dataset sizes. It is well-known that imposing strong structural constraints upfront for the purpose of allowing an efficient solution (in the above example: a linear hypothesis space) often limits, both theoretically and empirically, the potential to deliver value on large amounts of data. Thus, as big data becomes pervasive across a number of application domains, it has become necessary to be able to develop highly scalable algorithms for kernel methods.

The starting point of Rahimi and Recht (2008) is a celebrated result that characterizes the class of positive definite functions:

Without loss of generality, we assume henceforth that $\mu(\cdot)$ is a probability measure with associated probability density function $p(\cdot)$ .

For the most notable member of the shift-invariant family of kernels – the Gaussian kernel:

the associated density is again Gaussian $\mathcal{N}(0,\sigma^{-2}\mathbf{I}_{d})$ .

The integral representation of the kernel (2) may be approximated as follows:

The goal of this work is to improve the convergence behavior of this approximation, so that a smaller $s$ can be used to get the same quality of approximation to the kernel function. This is motivated by recent work that showed that in order to obtain state-of-the-art accuracy on some important datasets, a very large number of random features is needed (Huang et al., 2014; Sindhwani and Avron, 2014).

Our point of departure from the work of Rahimi and Recht (2008) is the simple observation that when $\mathbf{w}_{1},\dots,\mathbf{w}_{s}$ are are drawn from the distribution defined by the density function $p(\cdot)$ , the approximation in (1) may be viewed as a standard Monte Carlo (MC) approximation to the integral representation of the kernel. Instead of using plain MC approximation, we propose to use the low-discrepancy properties of Quasi-Monte Carlo (QMC) sequences to reduce the integration error in approximations of the form (1). A self-contained overview of Quasi-Monte Carlo techniques for high-dimensional integration problems is provided in Section 2. In Section 3, we describe how QMC techniques apply to our setting.

We then proceed to apply an average case theoretical analysis of the integration error for any given sequence $S$ (Section 4). This bound motivates an optimization problem over the sequence $S$ whose minimizer provides adaptive QMC sequences fine tuned to our kernels (Section 5).

Finally, empirical results (Section 6) clearly demonstrate the superiority of QMC techniques over the MC feature maps (Rahimi and Recht, 2008), the correctness of our theoretical analysis and the potential value of adaptive QMC techniques for large-scale kernel methods.

Preliminaries

In “MC sequence” we mean points drawn randomly either from the unit cube or certain distribution that will be clear from the text. For “QMC sequence” we mean a deterministic sequence designed to reduce the integration error. Typically, it will be a low-discrepancy sequence on the unit cube.

It is also useful to recall the definition of Reproducing Kernel Hilbert Space (RKHS).

$\langle f,h(\mathbf{x},\cdot)\rangle_{\mathcal{H}}=f(\mathbf{x})$ (Reproducing Property)

Equivalently, RKHSs are Hilbert spaces with bounded, continuous evaluation functionals. Informally, they are Hilbert spaces with the nice property that if two functions $f,g\in\mathcal{H}$ are close in the sense of the distance derived from the norm in $\mathcal{H}$ (i.e., $\|f-g\|_{\mathcal{H}}$ is small), then their values $f(\mathbf{x}),g(\mathbf{x})$ are also close for all $\mathbf{x}\in\mathcal{X}$ ; in other words, the norm controls the pointwise behavior of functions in $\mathcal{H}$ (Berlinet and Thomas-Agnan, 2004).

2 Related Work

In this section we discuss related work on scalable kernel methods. Relevant work on QMC methods is discussed in the next subsection.

Scalability has long been identified as a key challenge associated with deploying kernel methods in practice. One dominant line of work constructs low-rank approximations of the Gram matrix, either using data-oblivious randomized feature maps to approximate the kernel function, or using sampling techniques such as the classical Nyström method (Williams and Seeger, 2001). In its vanilla version, the latter approach - Nyström method - samples points from the dataset, computes the columns of the Gram matrix that corresponds to the sampled data points, and uses this partial computation of the Gram matrix to construct an approximation to the entire Gram matrix. More elaborate techniques exist, both randomized and deterministic; see Gittens and Mahoney (2013) for a thorough treatment.

Subsequently, there has been considerable effort given to extending this technique to other classes of kernels. Li et al. (2010) use Bochner’s theorem to provide random features to the wider class of group-invariant kernels. Maji and Berg (2009) suggested random features for the intersection kernel $k(\mathbf{x},\mathbf{z})=\sum^{d}_{i=1}\min(x_{i},z_{i})$ . Vedaldi and Zisserman (2012) developed feature maps for $\gamma$ -homogeneous kernels. Sreekanth et al. (2010) developed feature maps for generalized RBF kernels $k(\mathbf{x},\mathbf{z})=g(D(\mathbf{x},\mathbf{z})^{2})$ where $g(\cdot)$ is a positive definite function, and $D(\cdot,\cdot)$ is a distance metric. Kar and Karnick (2012) suggested feature maps for dot-product kernels. The feature maps are based on the Maclaurin expansion, which is guaranteed to be non-negative due to a classical result of Schoenberg (1942). Pham and Pagh (2013) suggested feature maps for the polynomial kernels. Their construction leverages known techniques from sketching theory. It can also be shown that their feature map is an oblivious subspace embedding, and this observation provides stronger theoretical guarantees than point-wise error bounds prevalent in the feature map literature (Avron et al., 2014). By invoking a variant of Bochner’s theorem that replaces the Fourier transform with the Laplace transform, Yang et al. (2014) obtained randomized feature maps for semigroup kernels on histograms. We note that while the original feature maps suggested by Rahimi and Recht were randomized, some of the aforementioned maps are deterministic.

Our work is more in-line with recent efforts on scaling up the random features, so that learning and prediction can be done faster. Le et al. (2013) return to the original construction of Rahimi and Recht (2008), and devise a clever distribution of random samples $\mathbf{w}_{1},\mathbf{w}_{2},\dots,\mathbf{w}_{s}$ that is structured so that the generation of random features can be done much faster. They showed that only a very limited concession in term of convergence rate is made. Hamid et al. (2014), working on the polynomial kernel, suggest first generating a very large amount of random features, and then applying them a low-distortion embedding based the Fast Johnson-Lindenstruass Transform, so the make the final size of the mapped vector rather small. In contrast, our work tries to design $\mathbf{w}_{1},\dots,\mathbf{w}_{s}$ so that less features will be necessary to get the same quality of kernel approximation.

Several other scalable approaches for large-scale kernel methods have been suggested over the years, starting from approaches such as chunking and decomposition methods proposed in the early days of SVM optimization literature. Raykar and Duraiswami (2007) use an improved fast Gauss transform for large scale Gaussian Process regression. There are also approaches that are more specific to the objective function at hand, e.g., Keerthi et al. (2006) builds a kernel expansion greedily to optimize the SVM objective function. Another well known approach is the Core Vector Machines (Tsang et al., 2005) which draws on approximation algorithms from computational geometry to scale up a class of kernel methods that can be reformulated in terms of the minimum enclosing ball problem.

For a broader discussion of these methods, and others, see Bottou et al. (2007).

3 Quasi-Monte Carlo Techniques: an Overview

In this section we provide a self-contained overview of Quasi-Monte Carlo (QMC) techniques. For brevity, we restrict our discussion to background that is necessary for understanding subsequent sections. We refer the interested reader to the excellent reviews by Caflisch (1998) and Dick et al. (2013), and the recent book Leobacher and Pillichschammer (2014) for a much more detailed exposition.

Consider the task of computing an approximation of the following integral

Define the integration error with respect to the point set $S$ as,

When $S$ is drawn randomly, the Central Limit Theorem asserts that if $s=\lvert S\rvert$ is large enough then $\epsilon_{S}[f]\approx\sigma[f]s^{-1/2}\boldsymbol{\nu}$ where $\boldsymbol{\nu}$ is a standard normal random variable, and $\sigma[f]$ is the square-root of the variance of $f$ ; that is,

In other words, the root mean square error of the Monte Carlo method is,

Therefore, the Monte Carlo method converges at a rate of $O(s^{-1/2})$ .

The aim of QMC methods is to improve the convergence rate by using a deterministic low-discrepancy sequence to construct $S$ , instead of randomly sampling points. The underlying intuition is illustrated in Figure 1, where we plot a set of 1000 two-dimensional random points (left graph), and a set of 1000 two-dimensional points from a quasi-random sequence (Halton sequence; right graph).

In the random sequence we see that there is an undesired clustering of points, and as a consequence empty spaces. Clusters add little to the approximation of the integral in those regions, while in the empty spaces the integrand is not sampled. This lack of uniformity is due to the fact that Monte Carlo samples are independent of each other. By carefully designing a sequence of correlated points to avoid such clustering effects, QMC attempts to avoid this phenomena, and thus provide faster convergence to the integral.

The theoretical apparatus for designing such sequences are inequalities of the form

in which $V(f)$ is a measure of the variation or difficulty of integrating $f(\cdot)$ and $D(S)$ is a sequence-dependent term that typically measures the discrepancy, or degree of deviation from uniformity, of the sequence $S$ . For example, the expected Monte Carlo integration error decouples into a variance term, and $s^{-1/2}$ as in (5).

A prototypical inequality of this sort is the following remarkable and classical result:

For any function $f$ with bounded variation, and sequence $S=\{\mathbf{w}_{1},\ldots,\mathbf{w}_{s}\}$ , the integration error is bounded above as follows,

where $V_{HK}$ is the Hardy-Krause variation of $f$ (see Niederreiter (1992)), which is defined in terms of the following partial derivatives,

and $D^{\star}$ is the star discrepancy defined by

where $\operatorname{disr}_{S}$ is the local discrepancy function

with $J_{\mathbf{x}}=[0,x_{1})\times[0,x_{2})\times\dots\times[0,x_{d})$ with $\operatorname{Vol}(J_{\mathbf{x}})=\prod_{j=1}^{d}x_{j}$ .

Given $\mathbf{x}$ , the second term in $\operatorname{disr}_{S}(\mathbf{x})$ is an estimate of the volume of $J_{\mathbf{x}}$ , which will be accurate if the points in $S$ are uniform enough. $D^{\star}(S)$ measures the maximum difference between the actual volume of the subregion $J_{\mathbf{x}}$ and its estimate for all $\mathbf{x}$ in $^{d}$ .

An infinite sequence $\mathbf{w}_{1},\mathbf{w}_{2},\dots$ is defined to be a low-discrepancy sequence if, as a function of $s$ , $D^{\star}(\{\mathbf{w}_{1},\dots,\mathbf{w}_{s}\})=O((\log s)^{d}/s)$ . Several constructions are know to be low-discrepancy sequences. One notable example is the Halton sequences, which are defined as follows. Let $p_{1},\dots,p_{d}$ be the first $d$ prime numbers. The Halton sequence $\mathbf{w}_{1},\mathbf{w}_{2},\dots$ of dimension $d$ is defined by

where for integers $i\geq 0$ and $b\geq 2$ we have

in which $i_{0},i_{1},\dots\in\{0,1,\dots,b-1\}$ is given by the unique decomposition

It is outside the scope of this paper to describe all these constructions in detail. However we mention that in addition to the Halton sequences, other notable members are Sobol’ sequences, Faure sequences, Niederreiter sequences, and more (see Dick et al. (2013), Section $2$ ). We also mention that it is conjectured that the $O((\log s)^{d}/s)$ rate for star discrepancy decay is optimal.

The classical QMC theory, which is based on the Koksma-Hlawka inequality and low discrepancy sequences, thus achieves a convergence rate of $O((\log s)^{d}/s)$ . While this is asymptotically superior to $O(s^{-1/2})$ for a fixed $d$ , it requires $s$ to be exponential in $d$ for the improvement to manifest. As such, in the past QMC methods were dismissed as unsuitable for very high-dimensional integration.

However, several authors noticed that QMC methods perform better than MC even for very high-dimensional integration (Sloan and Wozniakowski, 1998; Dick et al., 2013).Also see: “On the unreasonable effectiveness of QMC”, I.H. Sloan https://mcqmc.mimuw.edu.pl/Presentations/sloan.pdf Contemporary QMC literature explains and expands on these empirical observations, by leveraging the structure of the space in which the integrand function lives, to derive more refined bounds and discrepancy measures, even when classical measures of variation such as (6) are unbounded. This literature has evolved along at least two directions: one, where worst-case analysis is provided under the assumption that the integrands live in a Reproducing Kernel Hilbert Space (RKHS) of sufficiently smooth and well-behaved functions (see Dick et al. (2013), Section $3$ ) and second, where the analysis is done in terms of average-case error, under an assumed probability distribution over the integrands, instead of worst-case error (Wozniakowski, 1991; Traub and Wozniakowski, 1994). We refrain from more details, as these are essentially the paths that the analysis in Section 4 follows for our specific setting.

QMC Feature Maps: Our Algorithm

We assume that the density function in (2) can be written as $p(\mathbf{x})=\prod_{j=1}^{d}p_{j}(x_{j})$ , where $p_{j}(\cdot)$ is a univariate density function. The density functions associated to many shift-invariant kernels, e.g., Gaussian, Laplacian and Cauchy, admits such a form.

where $\Phi_{j}(\cdot)$ is the cumulative distribution function (CDF) of $p_{j}(\cdot)$ , for $j=1,\ldots,d$ . By setting $\mathbf{w}=\Phi^{-1}(\mathbf{t})$ , then (2) can be equivalently written as

Thus, a low discrepancy sequence $\mathbf{t}_{1},\dots,\mathbf{t}_{s}\in^{d}$ can be transformed using $\mathbf{w}_{i}=\Phi^{-1}(\mathbf{\mathbf{t}}_{i})$ , which is then plugged into (3) to yield the QMC feature map. This simple procedure is summarized in Algorithm 1. QMC feature maps are analyzed in the next section.

Theoretical Analysis and Average Case Error Bounds

The proofs for assertions made in this section and the next can be found in the Appendix.

The goal of this section is to develop a framework for analyzing the approximation quality of the QMC feature maps described in the previous section (Algorithm 1). We need to develop such a framework since the classical Koksma-Hlawka inequality cannot be applied to our setting, as the following proposition shows:

For any $p(\mathbf{x})=\prod_{j=1}^{d}p_{j}(x_{j})$ , where $p_{j}(\cdot)$ is a univariate density function, let

Our framework is based on a new discrepancy measure, box discrepancy, that characterizes integration error for the set of integrals defined with respect to the underlying data domain. Throughout this section we use the convention that $S=\{\mathbf{w}_{1},\ldots,\mathbf{w}_{s}\}$ , and the notation $\bar{\mathcal{X}}=\left\{\mathbf{x}-\mathbf{z}~{}|~{}\mathbf{x},\mathbf{z}\in\mathcal{X}\right\}$ .

Given a probability density function $p(\cdot)$ and $S$ , we define the integration error $\epsilon_{S,p}[f]$ of a function $f(\cdot)$ with respect to $p(\cdot)$ and the $s$ samples as,

We are interested in characterizing the behavior of $\epsilon_{S,p}[f]$ on $f\in\mathcal{F}_{\bar{\mathcal{X}}}$ where

As is common in modern QMC analysis (Leobacher and Pillichschammer, 2014; Dick et al., 2013), our analysis is based on setting up a Reproducing Kernel Hilbert Space of “nice” functions that is related to integrands that we are interested in, and using properties of the RKHS to derive bounds on the integration error. In particular, the integration error of integrands in an RKHS can be bounded using the following proposition.

In the theory of RKHS embeddings of probability distributions (Smola et al., 2007; Sriperumbudur et al., 2010), the function

is known as the kernel mean embedding of $p(\cdot)$ . The function

and consider the space of functions that admit an integral representation over $\mathcal{F}_{\Box\mathbf{b}}$ of the form

This space is associated with bandlimited functions, i.e., functions with compactly-supported inverse Fourier transforms, which are of fundamental importance in the Shannon-Nyquist sampling theory. Under a natural choice of inner product, these spaces are called Paley-Wiener spaces and they constitute an RKHS.

By $PW_{\mathbf{b}}$ , denote the space of functions which admit the representation in (10), with the inner product $\langle f,g\rangle_{PW_{\mathbf{b}}}=(2\pi)^{2d}\langle\hat{f},\hat{g}\rangle_{L_{2}(\Box\mathbf{b})}$ . $PW_{\mathbf{b}}$ is an RKHS with kernel function,

For notational convenience, in the above we define $\sin(b\cdot 0)/0$ to be $b$ . Furthermore, $\langle f,g\rangle_{PW_{\mathbf{b}}}=\langle f,g\rangle_{L_{2}(\Box\mathbf{b})}$ .

For $PW_{\mathbf{b}}$ the discrepancy measure $D_{h,S}$ in Proposition 6 can be written explicitly.

Suppose that $p(\cdot)$ is a probability density function, and that we can write $p(\mathbf{x})=\prod^{d}_{j=1}p_{j}(x_{j})$ where each $p_{j}(\cdot)$ is a univariate probability density function as well. Let $\varphi_{j}(\cdot)$ be the characteristic function associated with $p_{j}(\cdot)$ . Then,

This naturally leads to the definition of the box discrepancy, analogous to the star discrepancy described in Theorem 4.

The box discrepancy of a sequence $S$ with respect to $p(\cdot)$ is defined as,

For notational convenience, we generally omit the $\mathbf{b}$ from $D^{\Box\mathbf{b}}_{p}(S)$ as long as it is clear from the context.

The worse-case integration error bound for Paley-Wiener spaces is stated in the following as a corollary of Proposition 6. As explained earlier, this result not yet apply to functions in $\mathcal{F}_{\Box\mathbf{b}}$ because these functions are not part of $PW_{\mathbf{b}}$ . Nevertheless, we state it here for completeness.

Our main result shows that the expected square error of an integrand drawn from a uniform distribution over $\mathcal{F}_{\Box\mathbf{b}}$ is proportional to the square discrepancy measure $D^{\Box}_{p}(S)$ . This result is in the spirit of similar average case analysis in the QMC literature (Wozniakowski, 1991; Traub and Wozniakowski, 1994).

Let ${\cal U}(\mathcal{F}_{\Box\mathbf{b}})$ denote the uniform distribution on $\mathcal{F}_{\Box\mathbf{b}}$ . That is, $f\sim{\cal U}(\mathcal{F}_{\Box\mathbf{b}})$ denotes $f=f_{\mathbf{u}}$ where $f_{\mathbf{u}}(\mathbf{x})=e^{-i\mathbf{u}^{T}\mathbf{x}}$ and $\mathbf{u}$ is randomly drawn from a uniform distribution on $\Box\mathbf{b}$ . We have,

We now give an explicit formula for $D^{\Box}_{p}(S)$ for the case that $p(\cdot)$ is the density function of the multivariate Gaussian distribution with zero mean and independent components. This is an important special case since this is the density function that is relevant for the Gaussian kernel.

Let $p(\cdot)$ be the $d$ -dimensional multivariate Gaussian density function with zero mean and covariance matrix equal to $\operatorname{diag}(\sigma_{1}^{-2},\dots,\sigma_{d}^{-2})$ . We have,

Two other shift-invariant kernel that have been mentioned in the machine learning literature is the Laplacian kernel (Rahimi and Recht, 2008) and Matern kernel (Le et al., 2013). The distribution associated with the Laplacian kernel can be written as a product $p(\mathbf{x})=\prod_{j=1}^{d}p_{j}(x_{j})$ , where $p_{j}(\cdot)$ is density associated with the Cauchy distribution. The characteristic function is simple ( $\phi_{j}(\beta)=e^{-\left|\beta\right|/\sigma_{j}}$ ) so analytic formulas like (12) can be derived. The distribution associated with the Matern kernel, on the other hand, is the multivariate t-distribution, which cannot be written as a product $p(\mathbf{x})=\prod_{j=1}^{d}p_{j}(x_{j})$ , so the presented theory does not apply to it.

We now derive an expression for the expected discrepancy of Monte-Carlo sequences, and show that it decays as $O(s^{-1/2})$ . This is useful since via an averaging argument we are guaranteed that there exists sets for which the discrepancy behaves $O(s^{-1/2})$ .

Again, we can derive specific formulas for the Gaussian density. The following is straightforward from Corollary 14. We omit the proof.

Let $p(\cdot)$ be the $d$ -dimensional multivariate Gaussian density function with zero mean and covariance matrix equal to $\operatorname{diag}(\sigma_{1}^{-2},\dots,\sigma_{d}^{-2})$ . Suppose $\mathbf{t}_{1},\ldots,\mathbf{t}_{s}$ are chosen uniformly from $^{d}$ . Let $\mathbf{w}_{i}=\Phi^{-1}(\mathbf{t}_{i})$ , for $i=1,\ldots,s$ . Then,

Learning Adaptive QMC Sequences

Error characterization via discrepancy measures like (12) is typically used in the QMC literature to prescribe sequences whose discrepancy behaves favorably. It is clear that for the box discrepancy, a meticulous design is needed for a high quality sequence and we leave this to future work. Instead, in this work, we use the fact that unlike the star discrepancy (4), the box discrepancy is a smooth function of the sequence with a closed-form formula. This allows us to both evaluate various candidate sequences, and select the one with the lowest discrepancy, as well as to adaptively learn a QMC sequence that is specialized for our problem setting via numerical optimization. The basis is the following proposition, which gives an expression for the gradient of $D^{\Box}(S)$ .

Define the following scalar functions and variables,

In the above ,we define $\operatorname{sinc}^{\prime}(0)$ to be . Then, the elements of the gradient vector of $D^{\Box}$ are given by,

We explore two possible approaches for finding sequences based on optimizing the box discrepancy, namely global optimization and greedy optimization. The latter is closely connected to herding algorithms (Welling, 2009).

The gradient can be plugged into any first order numerical solver for non-convex optimization. We use non-linear conjugate gradient in our experiments (Section 6.2).

The above learning mechanism can be extended in various directions. For example, QMC sequences for $n$ -point rank-one Lattice Rules (Dick et al., 2013) are integral fractions of a lattice defined by a single generating vector $\mathbf{v}$ . This generating vector may be learnt via local minimization of the box discrepancy.

Greedy Adaptive Sequences.

Starting with $S_{0}=\emptyset$ , for $t\geq 1$ , let $S_{t}=\{\mathbf{w}_{1},\ldots,\mathbf{w}_{t}\}$ . At step $t+1$ , we solve the following optimization problem,

Set $S_{t+1}=S_{t}\cup\{w_{t+1}\}$ and repeat the above procedure. The gradient of the above objective is also given in (14). Again, we use non-linear conjugate gradient in our experiments (Section 6.2).

The greedy adaptive procedure is closely related to the herding algorithm, recently presented by Welling (2009). Applying the herding algorithm to $PW_{\mathbf{b}}$ and $p(\cdot)$ , and using our notation, the points $\mathbf{w}_{1},\mathbf{w}_{2},\dots$ are generated using the following iteration

In the above, $\mathbf{z}_{0},\mathbf{z}_{1},\dots$ is a series of functions in $PW_{\mathbf{b}}$ . The literature is not specific on the initial value of $\mathbf{z}_{0}$ , with both $\mathbf{z}_{0}=0$ and $\mathbf{z}_{0}=\mu_{h,p}$ suggested. Either way, it is always the case that $\mathbf{z}_{t}=\mathbf{z}_{0}+t(\mu_{h,p}-\hat{\mu}_{h,p,S_{t}})$ where $S_{t}=\{\mathbf{w}_{1},\dots,\mathbf{w}_{t}\}$ .

Weighted Sequences.

Classically, Monte-Carlo and Quasi-Monte Carlo approximations of integrals are unweighted, or more precisely, have a uniform weights. However, it is quite plausible to weight the approximations, i.e. approximate $I_{d}[f]=\int_{^{d}}f(\mathbf{x})d\mathbf{x}$ using

This construction requires $\xi_{i}\geq 0$ for $i=1,\dots,s$ , although we note that (16) itself does not preclude negative weights. We do not require the weights to be normalized, that is it is possible that $\sum^{s}_{i=1}\xi_{i}\neq 1$ .

One can easily generalize the result of the previous section to derive the following discrepancy measure that takes into consideration the weights

Using this discrepancy measure, global adaptive and greedy adaptive sequences of points and weights can be found.

However, we note that if we fix the points, then optimizing just the weights is a simple convex optimization problem. The box discrepancy can be written as

The $\mathbf{\xi}$ that minimizes $D^{\Box\mathbf{b}}_{p}(S,\Xi)^{2}$ is equal to $\mathbf{H}^{-1}\mathbf{v}$ , but there is no guarantee that $\xi_{i}\geq 0$ for all $i$ . We need to explicitly impose these conditions. Thus, the optimal weights can be found by solving the following convex optimization problem

Selecting the weights in such a way is closely connected to the so-called Bayesian Monte Carlo (BMC) method, originally suggested by Ghahramani and Rasmussen (2003). In BMC, a Bayesian approach is utilized in which the function is a assumed to be random with a prior that is a Gaussian Process. Combining with the observations, a posterior is obtained, which naturally leads to the selection of weights. Huszár and Duvenaud (2012) subsequently pointed out the connection between this approach and the herding algorithm discussed earlier.

Experiments

In this section we report experiments with both classical QMC sequences and adaptive sequences learnt from box discrepancy minimization.

We examine the behavior of classical low-discrepancy sequences when compared to random Fourier features (i.e., MC). We consider four sequences: Halton, Sobol’, Lattice Rules, and Digital Nets. For Halton and Sobol’, we use the implementation available in MATLAB.http://www.mathworks.com/help/stats/quasi-random-numbers.html For Lattice Rules and Digital Nets, we use publicly available implementations.http://people.cs.kuleuven.be/ dirk.nuyens/qmc-generators/ For all four low-discrepancy sequences, we use scrambling and shifting techniques recommended in the QMC literature (see Dick et al. (2013) for details). For Sobol’, Lattice Rules and Digital Nets, scrambling introduces randomization and hence variance. For Halton sequence, scrambling is deterministic, and there is no variance. The generation of these sequences is extremely fast, and quite negligible when compared to the time for any reasonable downstream use. For example, for census dataset with size 18,000 by 119, if we choose the number of random features $s=2000$ , the running time for performing kernel ridge regression model is more than 2 minutes, while the time of generating the QMC sequences is only around 0.2 seconds (Digital Nets sequence takes longer, but not much longer) and that of MC sequence is around 0.01 seconds. Therefore, we do not report running times as these are essentially the same across methods.

In all experiments, we work with a Gaussian kernel. For learning, we use regularized least square classification on the feature mapped dataset, which can be thought of as a form of approximate kernel ridge regression. For each dataset, we performed 5-fold cross-validation when using random Fourier features (MC sequence) to set the bandwidth $\sigma$ , and then used the same $\sigma$ for all other sequences.

We examine four datasets: cpu (6554 examples, 21 dimensions), census (a subset chosen randomly with 5,000 examples, 119 dimensions), USPST (1,506 examples, 250 dimensions after PCA) and MNIST (a subset chosen randomly with 5,000 examples, 250 dimensions after PCA). The reason we do subsampling on large datasets is to be able to compute the full exact Gram matrix for comparison purposes. The reason we use dimensionality reduction on MNIST is that the maximum dimension supported by the Lattice Rules implementation we use is 250.

We can clearly see that except Sobol’ sequences classical low-discrepancy sequences consistently produce better approximations to the Gram matrix than the approximations produced using MC sequences. Among the four classical QMC sequences, the Digital Nets, Lattice Rules and Halton sequences yield much lower error. Similar results were observed for other datasets (not reported here). Although using scrambled variants of QMC sequences may incur some variance, the variance is quite small compared to that of the MC random features.

Scrambled (whether deterministic or randomized) QMC sequences tend to yield higher accuracies than non-scambled QMC sequences. In Figure 3, we show the ratio between the relative errors achieved by using both scrambled and non-scrambled QMC sequences. As can be seen, scrambled QMC sequences provide more accurate approximations in most cases as the ratio value tends to be less than one. In particular, scrambled Lattice sequence outperforms the non-scrambled one across all the cases for larger values of $s$ . Therefore, in the rest of the experiments we use scrambled sequences.

Does better Gram matrix approximation translate to lower generalization errors?

We consider two regression datasets, cpu and census, and use (approximate) kernel ridge regression to build a regression model. The ridge parameter is set by the optimal value we obtain via 5-fold cross-validation on the training set by using the MC sequence. Table 1 summarizes the results.

As we see, for cpu, all the sequences behave similarly, with the Halton sequence yielding the lowest test error. For census, the advantage of using Halton sequence is significant (almost 20% reduction in generalization error) followed by Digital Nets and Sobol’. In addition, MC sequence tends to generate higher variance across all the sampling size. Overall, QMC sequences, especially Halton, outperform MC sequences on these datasets.

When performed on classification datasets by using the same learning model, with a moderate range of $s$ , e.g., less than $2000$ , the QMC sequences do not yield accuracy improvements over the MC sequence with the same consistency as in the regression case. The connection between kernel approximation and the performance in downstream applications is outside the scope of the current paper. Worth mentioning in this regard, is the recent work by Bach (2013), which analyses the connection between Nyström approximations of the Gram matrix, and the regression error, and the work of El Alaoui and Mahoney (2014) on kernel methods with statistical guarantees.

Behavior of Box Discrepancy

Next, we examine if $D^{\Box}$ is predictive of the quality of approximation. We compute the normalized square box discrepancy values (i.e., $\pi^{d}(\prod^{d}_{j=1}b_{j})^{-1}D^{\Box}(S)^{2}$ ) as well as Gram matrix approximation error for the different sequences with different sample sizes $s$ . The expected normalized square box discrepancy values for MC are computed using (13).

Our experiments revealed that using the full $\Box\mathbf{b}$ does not yield box discrepancy values that are very useful. Either the values were not predictive of the kernel approximation, or they tended to stay constant. Recall, that while the bounding box $\Box\mathbf{b}$ is set based on observed ranges of feature values in the dataset, the actual distribution of points $\bar{\mathcal{X}}$ encountered inside that box might be far from uniform. This lead us to consider the discrepancy measure when measured on the central part of the bounding box (i.e., $\Box\mathbf{b}/2$ instead of $\Box\mathbf{b}$ ), which is equal to the integration error averaged over that part of the bounding box. Presumably, points from $\bar{\mathcal{X}}$ concentrate in that region, and they may be more relevant for downstream predictive task.

2 Experiments With Adaptive QMC

The goal of this subsection is to provide a proof-of-concept for learning adaptive QMC sequences, using the three schemes described in Section 5. We demonstrate that QMC sequences can be improved to produce better approximation to the Gram matrix, and that can sometimes lead to improved generalization error.

Note that the running time of learning the adaptive sequences is less relevant in our experimental setting for the following reasons. Given the values of $s$ , $d$ , $\mathbf{b}$ and $\sigma$ the optimization of a sequence needs only to be done once. There is some flexibility in these parameters: $d$ can be adjusted by adding zero features or by doing PCA on the input; one can use longer or shorter sequences; and the data can be forced to a fit a particular bounding box using (possibly non-equal) scaling of the features (this, in turn, affects the choice of the $\sigma$ ) . Since designing adaptive QMC sequences is data-independent with applicability to a variety of downstream applications of kernel methods, it is quite conceivable to generate many point sets in advance and to use them for many learning tasks. Furthermore, the total size of the sequences ( $s\times d$ ) is independent of the number of examples $n$ , which is the dominant term in large scale learning settings.

We name the three sequences as Global Adaptive, Greedy Adaptive and Weighted respectively. For Global Adaptive, the Halton sequence is used as the initial setting of the optimization variables $S$ . For Greedy Adaptive, when optimizing for $\mathbf{w}_{t}$ , the $t$ -th point in the Halton sequence is used as the initial point. In both cases, we use non-linear conjugate gradient to perform numerical optimization. For Weighted, the initial features are generated using the Halton sequence and we optimize for the weights. We used CVX (Grant and Boyd, 2014, 2008) to compute the sequence (solve (17)).

We begin by examining the integration error over the unit square by using three different sequences, namely, MC, Halton and global adaptive QMC sequences. The integral is of the form $\int_{^{2}}e^{-i\mathbf{u}^{T}\mathbf{t}}d\mathbf{t}$ where $\mathbf{u}$ spans the unit square. The error is plotted in Figure 5. We see that MC sequences concentrate most of the error reduction near the origin. The Halton sequence gives significant improvement expanding the region of low integration error. Global adaptive QMC sequences give another order of magnitude improvement in integration error which is now diffused over the entire unit square; the estimation of such sequences is “aware” of the full integration region. In fact, by controlling the box size (see plot labeled b/2), adaptive sequences can be made to focus in a specified sub-box which can help with generalization if the actual data distribution is better represented by this sub-box.

Quality of Kernel Approximation

In Figure 6 and Figure 7 we examine how various metrics (discrepancy, maximum squared error, mean squared error, norm of the error) on the Gram matrix approximation evolve during the optimization process for both adaptive sequences. Since learning the adaptive sequences on dataset with low dimensional features is more affordable, the experiment is performed on two such datasets, namely, cpu and housing.

For Global Adaptive, we fixed $s=100$ and examine how the performance evolves as the number of iterations grows. In Figure 6 (a) we examine the behavior on cpu. We see that all metrics go down as the iteration progresses. This supports our hypothesis that by optimizing the box discrepancy we can improve the approximation of the Gram matrix. Figure 6 (b), which examines the same metrics on the scaled version of the housing dataset, has some interesting behaviors. Initially all metrics go down, but eventually all the metrics except the box-discrepancy start to go up; the box-discrepancy continues to go down. One plausible explanation is that the integrands are not uniformly distributed in the bounding box, and that by optimizing the expectation over the entire box we start to overfit it, thereby increasing the error in those regions of the box where integrands actually concentrate. One possible way to handle this is to optimize closer to the center of the box (e.g., on $\Box\mathbf{b}/2$ ), under the assumption that integrands concentrate there. In Figure 6 (c) we try this on the housing dataset. We see that now the mean error and the norm error are much improved, which supports the interpretation above. But the maximum error eventually goes up. This is quite reasonable as the outer parts of the bounding box are harder to approximate, so the maximum error is likely to originate from there. Subsequently, we stop the adaptive learning of the QMC sequences early, to avoid the actual error from going up due to averaging.

For Greedy Adaptive, we examine its behavior as the number of points increases. In Figure 7 (a) and (b), as expected, as the number of points in the sequence increases, the box discrepancy goes down. This is also translated to non-monotonic decrease in the other metrics of Gram matrix approximation. However, unlike the global case, we see in Figure 7 (c), when the points are generated by optimizing on a smaller box $\Box\mathbf{b}/2$ , the resulting metrics become higher for a fixed number of points. Although the Greedy Adaptive sequence can be computed faster than the adaptive sequence, potentially it might need a large number of points to achieve certain low magnitude of discrepancy. Hence, as shown in the plots, when the number of points is below 500, the quality of the optimization is not good enough to provide a good approximation the Gram matrix. For example, one can check when the number of points is 100, the discrepancy value of the Greedy Adaptive sequence is higher than that of the Global Adaptive sequence with more than 10 iterations.

Table 2 also shows the discrepancy values of various sequences on cpu and census. Using adaptive sequences improves the discrepancy values by orders-of-magnitude. We note that a significant reduction in terms of discrepancy values can be achieved using only weights, sometimes yielding discrepancy values that are better than the hard-to-compute global or greedy sequences.

Generalization Error

We use the three algorithms for learning adaptive sequences as described in the previous subsections, and use them for doing approximate kernel ridge regression. The ridge parameter is set by the value which is near-optimal for both sequences in 5-fold cross-validation on the training set. Table 3 summarizes the results.

For both cpu and census, at least one of the adaptive sequences sequences can yield lower test error for each sampling size (since the test error is already low, around 3% or 5%, such improvement in accuracy is not trivial). For cpu, greedy approach seems to give slightly better results. When $s=500$ or even larger (not reported here), the performance of the sequences are very close. For census, the weighted sequence yields the lowest generalization error when $s=400,800$ . Afterwards we can see global adaptive sequence outperforms the rest of the sequences, even though it has better discrepancy values. In some cases, adaptive sequences sometimes produce errors that are bigger than the unoptimized sequences.

In most cases, the adaptive sequence on the central part of the bounding box outperforms the adaptive sequence on the entire box. This is likely due to the non-uniformity phenomena discussed earlier.

Conclusion and Future Work

Recent work on applying kernel methods to very large datasets, has shown their ability to achieve state-of-the-art accuracies that sometimes match those attained by Deep Neural Networks (DNN) (Huang et al., 2014). Key to these results is the ability to apply kernel method to such datasets. The random features approach, originally due to Rahimi and Recht (2008), as emerged as a key technology for scaling up kernel methods (Sindhwani and Avron, 2014).

Close examination of those empirical results reveals that to achieve state-of-the-art accuracies, a very large number of random features was needed. For example, on TIMIT, a classical speech recognition dataset, over 200,000 random features were used in order to match DNN performance (Huang et al., 2014). It is clear that improving the efficiency of random features can have a significant impact on our ability to scale up kernel methods, and potentially get even higher accuracies.

This paper is the first to exploit high-dimensional approximate integration techniques from the QMC literature in this context, with promising empirical results backed by rigorous theoretical analyses. Avenues for future work include incorporating stronger data-dependence in the estimation of adaptive sequences and analyzing how resulting Gram matrix approximations translate into downstream performance improvements for a variety of large-scale learning tasks.

Acknowledgements

The authors would like to thank Josef Dick for useful pointers to literature about improvement of the QMC sequences; Ha Quang Minh for several discussions on Paley-Wiener spaces and RKHS theory; the anonymous ICML reviewers for pointing out the connection to herding and other helpful comments; the anonymous JMLR reviewers for suggesting weighted sequences and other helpful comments. This research was supported by the XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750-12-C-0323. This work was done while J. Yang was a summer intern at IBM Research.

Appendix A Technical Details

In this section we give detailed proofs of the assertions made in Section 4 and 5.

From $f_{\mathbf{u}}(\mathbf{t})=e^{-i\mathbf{u}^{T}\Phi^{-1}(\mathbf{t})}$ , for any $j=1,\ldots,d$ , we have

With a change of variable, $\Phi_{j}(t_{j})=v_{j}$ , for $j=1,\ldots,d$ , (18) becomes

As this is a term in (6), we know that $V_{HK}[f_{\mathbf{u}}(\mathbf{t})]$ is unbounded.

A.2 Proof of Proposition 6

We need the following lemmas, across which we share some notation.

The linear functional $T$ is a bounded linear functional on the RKHS $\mathcal{H}$ . To see this:

From the Riesz Representation Theorem, every bounded linear functional on $\mathcal{H}$ admits an inner product representation. Therefore, for $T$ defined in (19), there exists $\mu_{h,p}\in\mathcal{H}$ such that,

Therefore we have, $\langle f,\mu_{h,p}\rangle_{\mathcal{H}}=\int f(\mathbf{x})p(\mathbf{x})d\mathbf{x}$ for all $f\in\mathcal{H}$ . For any $\mathbf{z}$ , choosing $f(\cdot)=h(\mathbf{z},\cdot)$ , where $h(\cdot,\cdot)$ is the kernel associated with $\mathcal{H}$ , and invoking the reproducing property we see that,

The proof of Proposition 6 follows from the existence Lemmas above, and the following steps.

A.3 Proof of Theorem 9

We apply (6) to the particular case of $h=\operatorname{sinc}_{\mathbf{b}}$ . We have

So we can consider each coordinate on its own.

The interchange in the second line is allowed since the $p_{j}(x)$ makes the function integrable (with respect to $x$ ).

where the last equality follows from first noticing that the characteristic function associated with the density function $x\mapsto p_{j}(x+w)$ is $\beta\mapsto\varphi(\beta)e^{iw\beta}$ , and then applying the previous inequality.

The interchange at the third line is allowed because of $p_{j}(x)p_{j}(y)$ . In the last line we use the fact that the $\varphi_{j}(\cdot)$ is Hermitian.

A.4 Proof of Theorem 12

In the above, $\operatorname{rect}$ is the function that is 1 on $[-1/2,1/2]$ and zero elsewhere.

We now have for every $\mathbf{u}\in\Box\mathbf{b}$ ,

The function $r_{S}(\cdot)$ is square-integrable, so it has a Fourier transform $\hat{r}_{S}(\cdot)$ . The above formula is exactly the value of $\hat{r}_{S}(\mathbf{u})$ . That is,

The equality before the last follows from Plancherel formula and the equality of the norm in $PW_{\mathbf{b}}$ to the $L2$ -norm. The last equality follows from the fact that $r_{S}$ is exactly the expression used in the proof of Proposition 6 to derive $D^{\Box}_{p}$ .

A.5 Proof of Corollary 13

In this case, $p(\mathbf{x})=\prod_{j=1}^{d}p_{j}(x_{j})$ where $p_{j}(\cdot)$ is the density function of $\mathcal{N}(0,1/\sigma_{j})$ . The characteristic function associated with $p_{j}(\cdot)$ is $\varphi_{j}(\beta)=e^{-\frac{\beta^{2}}{2\sigma_{j}^{2}}}$ . We apply (11) directly.

Combining (21), (22) and (11), (12) follows.

A.6 Proof of Corollary 14

Obviously, the first is a constant which is independent to $\mathbf{t}_{1},\ldots,\mathbf{t}_{s}$ . Since all the terms are finite, we can interchange the integral and the sum among rest terms. In the second term, for each $l$ , the only dependence on $\mathbf{t}_{1},\ldots,\mathbf{t}_{s}$ is $\mathbf{t}_{l}$ , hence all the other $\mathbf{t}_{j}$ can be integrated out. That is,

Above, the last equality comes from a change of variable, i.e., $\mathbf{t}_{l}=\left(\Phi_{1}(\phi_{1}),\ldots,\Phi_{d}(\phi_{d})\right)$ .

Similar operations can be done for the third and fourth term. Combining all of these, we have the following,

A.7 Proof of Proposition 16

Before we compute the derivative, we prove two auxiliary lemmas.

We omit the proof as it is a simple computation that follows from the definition of $\operatorname{sinc}_{\mathbf{b}}$ .

The derivative of the scalar function $f(x)=\operatorname{Re}\left[e^{-ax^{2}}\operatorname{erf}\left(c+idx\right)\right]$ , for real scalars $a,c,d$ is given by,

it suffices to compute the the derivative $g(x)=e^{-ax^{2}}\operatorname{erf}(c+idx)$ .

Let $k(x)=\operatorname{erf}(c+idx)$ . We have

For the first term in (12), that is $\frac{1}{s^{2}}\sum_{m=1}^{s}\sum_{r=1}^{s}\operatorname{sinc}_{\mathbf{b}}(\mathbf{w}_{m},\mathbf{w}_{r})$ , to compute the partial derivative of $w_{lj}$ , we only have to consider when at least $m$ or $r$ is equal to $l$ . If $m=j=l$ , by definition, the corresponding term in the summation is one. Hence, we only have to consider the case when $m\neq r$ . By symmetry, it is equivalent to compute the partial derivative of the following function $\frac{2}{s^{2}}\sum_{m=1,m\neq l}^{s}\operatorname{sinc}_{\mathbf{b}}(\mathbf{w}_{l},\mathbf{w}_{m})$ . Applying Lemma 19, we get the first term in (14).

Next, for the last term in (12), we only have consider the term associated with one in the summation and the term associated with $j$ in the product. Since $\left(\frac{\sigma_{j}}{\sqrt{2\pi}}\right)e^{-\frac{\sigma_{j}^{2}w^{2}_{lj}}{2}}\operatorname{Re}\left(\operatorname{erf}\left(\frac{b_{j}}{\sigma_{j}\sqrt{2}}-i\frac{\sigma_{j}w_{lj}}{\sqrt{2}}\right)\right)$ satisfies the formulation in Lemma 20, we can simply apply Lemma 20 and get its derivative with respect to $w_{lj}$ .

Equation (14) follows by combining these terms. ∎