Breaking the Curse of Dimensionality with Convex Neural Networks

Francis Bach

Introduction

Supervised learning methods come in a variety of ways. They are typically based on local averaging methods, such as kk-nearest neighbors, decision trees, or random forests, or on optimization of the empirical risk over a certain function class, such as least-squares regression, logistic regression or support vector machine, with positive definite kernels, with model selection, structured sparsity-inducing regularization, or boosting (see, e.g., Györfi and Krzyzak, 2002; Hastie et al., 2009; Shalev-Shwartz and Ben-David, 2014, and references therein).

At the other end of the spectrum, parametric methods such as linear supervised learning make strong assumptions regarding the problem and generalization bounds based on estimation errors typically assume that the model is well-specified, and the sample complexity to attain an excess risk of ε\varepsilon grows as n=Ω(d/ε2)n=\Omega(d/\varepsilon^{2}), for linear functions in dd dimensions and Lipschitz-continuous loss functions (Shalev-Shwartz and Ben-David, 2014, Chapter 9). While the sample complexity is much lower, when the assumptions are not met, the methods underfit and more complex models would provide better generalization performances.

Affine functions: f(x)=wx+bf(x)=w^{\top}x+b, leading to potential severe underfitting, but easy optimization and good (i.e., non exponential) sample complexity.

Single hidden-layer neural networks: f(x)=j=1kσ(wjx+bj)f(x)=\sum_{j=1}^{k}\sigma(w_{j}^{\top}x+b_{j}), where kk is the number of units in the hidden layer (see, e.g., Rumelhart et al., 1986; Haykin, 1994). The activation function σ\sigma is here assumed to be fixed. While the learning problem may be cast as a (sub)differentiable optimization problem, techniques based on gradient descent may not find the global optimum. If the number of hidden units is fixed, this is a parametric problem.

In this paper, our main aim is to answer the following question: Is there a single learning method that can deal efficiently with all situations above with provable adaptivity? We consider single-hidden-layer neural networks, with non-decreasing homogeneous activation functions such as

for α{0,1,}\alpha\in\{0,1,\dots\}, with a particular focus on α=0\alpha=0 (with the convention that 00=00^{0}=0), that is σ(u)=1u>0\sigma(u)=1_{u>0} (a threshold at zero), and α=1\alpha=1, that is, σ(u)=max{u,0}=(u)+\sigma(u)=\max\{u,0\}=(u)_{+}, the so-called rectified linear unit (Nair and Hinton, 2010; Krizhevsky et al., 2012). We follow the convexification approach of Bengio et al. (2006); Rosset et al. (2007), who consider potentially infinitely many units and let a sparsity-inducing norm choose the number of units automatically. This leads naturally to incremental algorithms such as forward greedy selection approaches, which have a long history for single-hidden-layer neural networks (see, e.g. Breiman, 1993; Lee et al., 1996).

We provide in Section 2 a review of functional analysis tools used for learning from continuously infinitely many basis functions, by studying carefully the similarities and differences between L1L_{1}- and L2L_{2}-penalties on the output weights. For L2L_{2}-penalties, this corresponds to a positive definite kernel and may be interpreted through random sampling of hidden weights. We also review incremental algorithms (i.e., forward greedy approaches) to learn from these infinite sets of basis functions when using L1L_{1}-penalties.

The results are specialized in Section 3 to neural networks with a single hidden layer and activation functions which are positively homogeneous (such as the rectified linear unit). In particular, in Sections 3.2, 3.3 and 3.4, we provide simple geometric interpretations to the non-convex problems of additions of new units, in terms of separating hyperplanes or Hausdorff distance between convex sets. They constitute the core potentially hard computational tasks in our framework of learning from continuously many basis functions.

In Section 4, we provide a detailed theoretical analysis of the approximation properties of (single hidden layer) convex neural networks with monotonic homogeneous activation functions, with explicit bounds. We relate these new results to the extensive literature on approximation properties of neural networks (see, e.g., Pinkus, 1999, and references therein) in Section 4.7, and show that these neural networks are indeed adaptive to linear structures, by replacing the exponential dependence in dimension by an exponential dependence in the dimension of the subspace of the data can be projected to for good predictions.

We provide in Section 5.5 simple conditions for convex relaxations to achieve the same generalization error bounds, even when constant-factor approximation cannot be found (e.g., because it is NP-hard such as for the threshold activation function and the rectified linear unit). We present in Section 6 convex relaxations based on semi-definite programming, but we were not able to find strong enough convex relaxations (they provide only a provable sample complexity with a polynomial time algorithm which is exponential in the dimension dd) and leave open the existence or non-existence of polynomial-time algorithms that preserve the non-exponential sample complexity.

Learning from continuously infinitely many basis functions

We consider the space F1\mathcal{F}_{1} of functions ff that can be written as

where μ\mu is a signed Radon measure on V\mathcal{V} with finite total variation μ(V)|\mu|(\mathcal{V}).

When V\mathcal{V} is finite, this corresponds to

with total variation vVμv\sum_{v\in\mathcal{V}}|\mu_{v}|, where the proper formalization for infinite sets V\mathcal{V} is done through measure theory.

The infimum of μ(V)|\mu|(\mathcal{V}) over all decompositions of ff as f=Vφvdμ(v)f=\int_{\mathcal{V}}\varphi_{v}d\mu(v), turns out to be a norm γ1\gamma_{1} on F1\mathcal{F}_{1}, often called the variation norm of ff with respect to the set of basis functions (see, e.g., Kurkova and Sanguineti, 2001; Mhaskar, 2004).

Given our assumptions regarding the compactness of V\mathcal{V}, for any fF1f\in\mathcal{F}_{1}, the infimum defining γ1(f)\gamma_{1}(f) is in fact attained by a signed measure μ\mu, as a consequence of the compactness of measures for the weak topology (see Evans and Gariepy, 1991, Section 1.9).

In the definition above, if we assume that the signed measure μ\mu has a density with respect to a fixed probability measure τ\tau with full support on V\mathcal{V}, that is, dμ(v)=p(v)dτ(v)d\mu(v)={p}(v)d\tau(v), then, the variation norm γ1(f)\gamma_{1}(f) is also equal to the infimal value of

over all integrable functions pp such that f(x)=Vp(v)φv(x)dτ(v)f(x)=\int_{\mathcal{V}}{p}(v)\varphi_{v}(x)d\tau(v). Note however that not all measures have densities, and that the two infimums are the same as all Radon measures are limits of measures with densities. Moreover, the infimum in the definition above is not attained in general (for example when the optimal measure is singular with respect to dτd\tau); however, it often provides a more intuitive definition of the variation norm, and leads to easier comparisons with Hilbert spaces in Section 2.3.

2 Representation from finitely many functions

When minimizing any functional JJ that depends only on the function values taken at a subset X^\hat{\mathcal{X}} of values in X\mathcal{X}, over the ball {fF1, γ1(f)δ}\{f\in\mathcal{F}_{1},\ \gamma_{1}(f)\leqslant\delta\}, then we have a “representer theorem” similar to the reproducing kernel Hilbert space situation, but also with significant differences, which we now present.

we can then build a function defined over all X\mathcal{X}, through the optimal measure μ\mu above.

Moreover, by Carathéodory’s theorem for cones (Rockafellar, 1997), if X^\hat{\mathcal{X}} is composed of only nn elements (e.g., nn is the number of observations in machine learning), the optimal function fX^f_{|\hat{\mathcal{X}}} above (and hence ff) may be decomposed into at most nn functions φv\varphi_{v}, that is, μ\mu is supported by at most nn points in V\mathcal{V}, among a potential continuum of possibilities.

Note however that the identity of these nn functions is not known in advance, and thus there is a significant difference with the representer theorem for positive definite kernels and Hilbert spaces (see, e.g., Shawe-Taylor and Cristianini, 2004), where the set of nn functions are known from the knowledge of the points xX^x\in\hat{\mathcal{X}} (i.e., kernel functions evaluated at xx).

3 Corresponding reproducing kernel Hilbert space (RKHS)

We have seen above that if the real-valued measures μ\mu are restricted to have density pp with respect to a fixed probability measure τ\tau with full support on V\mathcal{V}, that is, dμ(v)=p(v)dτ(v)d\mu(v)={p}(v)d\tau(v), then, the norm γ1(f)\gamma_{1}(f) is the infimum of the total variation μ(V)=Vp(v)dτ(v),|\mu|(\mathcal{V})=\int_{\mathcal{V}}|{p}(v)|d\tau(v), over all decompositions f(x)=Vp(v)φv(x)dτ(v)f(x)=\int_{\mathcal{V}}{p}(v)\varphi_{v}(x)d\tau(v).

We may also define the infimum of Vp(v)2dτ(v)\int_{\mathcal{V}}|{p}(v)|^{2}d\tau(v) over the same decompositions (squared L2L_{2}-norm instead of L1L_{1}-norm). It turns out that it defines a squared norm γ22\gamma_{2}^{2} and that the function space F2\mathcal{F}_{2} of functions with finite norm happens to be a reproducing kernel Hilbert space (RKHS). When V\mathcal{V} is finite, then it is well-known (see, e.g., Berlinet and Thomas-Agnan, 2004, Section 4.1) that the infimum of vVμv2\sum_{v\in\mathcal{V}}\mu_{v}^{2} over all vectors μ\mu such that f=vVμvφvf=\sum_{v\in V}\mu_{v}\varphi_{v} defines a squared RKHS norm with positive definite kernel k(x,y)=vVφv(x)φv(y)k(x,y)=\sum_{v\in V}\varphi_{v}(x)\varphi_{v}(y).

We show in Appendix A that for any compact set V\mathcal{V}, we have defined a squared RKHS norm γ22\gamma_{2}^{2} with positive definite kernel k(x,y)=Vφv(x)φv(y)dτ(v)\displaystyle k(x,y)=\int_{\mathcal{V}}\varphi_{v}(x)\varphi_{v}(y)d\tau(v).

When mm tends to infinity, then k^(x,y)\hat{k}(x,y) tends to k(x,y)k(x,y) and random sampling provides a way to work efficiently with explicit mm-dimensional feature spaces. See Rahimi and Recht (2007) for a analysis of the number of units needed for an approximation with error ε\varepsilon, typically of order 1/ε21/\varepsilon^{2}. See also Bach (2015) for improved results with a better dependence on ε\varepsilon when making extra assumptions on the eigenvalues of the associated covariance operator.

The corresponding RKHS norm is always greater than the variation norm (because of Jensen’s inequality), and thus the RKHS F2\mathcal{F}_{2} is included in F1\mathcal{F}_{1}. However, as shown in this paper, the two spaces F1\mathcal{F}_{1} and F2\mathcal{F}_{2} have very different properties; e.g., γ2\gamma_{2} may be computed easily in several cases, while γ1\gamma_{1} does not; also, learning with F2\mathcal{F}_{2} may either be done by random sampling of sufficiently many weights or using kernel methods, while F1\mathcal{F}_{1} requires dedicated convex optimization algorithms with potentially non-polynomial-time steps (see Section 2.5).

Moreover, for any vVv\in\mathcal{V}, φvF1\varphi_{v}\in\mathcal{F}_{1} with a norm γ1(φv)1\gamma_{1}(\varphi_{v})\leqslant 1, while in general φvF2\varphi_{v}\notin\mathcal{F}_{2}. This is a simple illustration of the fact that F2\mathcal{F}_{2} is too small and thus will lead to a lack of adaptivity that will be further studied in Section 5.4 for neural networks with certain activation functions.

4 Supervised machine learning

that is, the excess risk J(f^)inffFJ(f)J(\hat{f})-\inf_{f\in\mathcal{F}}J(f) is upper-bounded by a sum of an approximation error inffFδJ(f)inffFJ(f)\inf_{f\in\mathcal{F}^{\delta}}J(f)-\inf_{f\in\mathcal{F}}J(f), an estimation error 2supfFδJ^(f)J(f)2\sup_{f\in\mathcal{F}^{\delta}}|\hat{J}(f)-J(f)| and an optimization error ε\varepsilon (see also Bottou and Bousquet, 2008). In this paper, we will deal with all three errors, starting from the optimization error which we now consider for the space F1\mathcal{F}_{1} and its variation norm.

5 Incremental conditional gradient algorithms

We assume the functional JJ is convex and LL-smooth, that is for all hL2(dρ)h\in L_{2}(d\rho), there exists a gradient J(h)L2(dρ)J^{\prime}(h)\in L_{2}(d\rho) such that for all fL2(dρ)f\in L_{2}(d\rho),

When X\mathcal{X} is finite, this corresponds to the regular notion of smoothness from convex optimization (Nesterov, 2004).

The conditional gradient algorithm (a.k.a. Frank-Wolfe algorithm) is an iterative algorithm, starting from any function f0F1δf_{0}\in\mathcal{F}_{1}^{\delta} and with the following recursion, for t0t\geqslant 0:

See an illustration in Figure 1. We may choose either ρt=2t+1\rho_{t}=\frac{2}{t+1} or perform a line search for ρt\rho_{t}\in. For all of these strategies, the tt-th iterate is a convex combination of the functions fˉ0,,fˉt1\bar{f}_{0},\dots,\bar{f}_{t-1}, and is thus an element of F1δ\mathcal{F}_{1}^{\delta}. It is known that for these two strategies for ρt\rho_{t}, we have the following convergence rate (see, e.g. Jaggi, 2013):

When, r2=supvVφvL2(dρ)2r^{2}=\sup_{v\in\mathcal{V}}\|\varphi_{v}\|^{2}_{L_{2}(d\rho)} is finite, we have fL2(dρ)2r2γ1(f)2\|f\|^{2}_{L_{2}(d\rho)}\leqslant r^{2}\gamma_{1}(f)^{2} and thus we get a convergence rate of 2Lr2δ2t+1\frac{2Lr^{2}\delta^{2}}{t+1}.

Moreover, the basic Frank-Wolfe (FW) algorithm may be extended to handle the regularized problem as well (Harchaoui et al., 2013; Bach, 2013; Zhang et al., 2012), with similar convergence rates in O(1/t)O(1/t). Also, the second step in the algorithm, where the function ft+1f_{t+1} is built in the segment between ftf_{t} and the newly found extreme function, may be replaced by the optimization of J{J} over the convex hull of all functions fˉ0,,fˉt\bar{f}_{0},\dots,\bar{f}_{t}, a variant which is often referred to as fully corrective. Moreover, in our context where V\mathcal{V} is a space where local search techniques may be considered, there is also the possibility of “fine-tuning” the vectors vv as well (Bengio et al., 2006), that is, we may optimize the function (v_{1},\dots,v_{t},\alpha_{1},\dots,\alpha_{t})\mapsto J\big{(}\sum_{i=1}^{t}\alpha_{i}\varphi_{v_{i}}\big{)}, through local search techniques, starting from the weights (αi)(\alpha_{i}) and points (vi)(v_{i}) obtained from the conditional gradient algorithm.

Adding a new basis function.

The conditional gradient algorithm presented above relies on solving at each iteration the “Frank-Wolfe step”:

for g=J(ft)L2(dρ)g=-J^{\prime}(f_{t})\in L_{2}(d\rho). For the norm γ1\gamma_{1} defined through an L1L_{1}-norm, we have for f=Vφvdμ(v)f=\int_{\mathcal{V}}\varphi_{v}d\mu(v) such that γ1(f)=μ(V)\gamma_{1}(f)=|\mu|(\mathcal{V}):

with equality if and only if μ=μ+μ\mu=\mu_{+}-\mu_{-} with μ+\mu_{+} and μ\mu_{-} two non-negative measures, with μ+\mu_{+} (resp. μ\mu_{-}) supported in the set of maximizers vv of \big{|}\int_{\mathcal{X}}\varphi_{v}(x)g(x)d\rho(x)\big{|} where the value is positive (resp. negative).

with the maximizers ff of the first optimization problem above (left-hand side) obtained as δ\delta times convex combinations of φv\varphi_{v} and φv-\varphi_{v} for maximizers vv of the second problem (right-hand side).

A common difficulty in practice is the hardness of the Frank-Wolfe step, that is, the optimization problem above over V\mathcal{V} may be difficult to solve. See Section 3.2, 3.3 and 3.4 for neural networks, where this optimization is usually difficult.

Finitely many observations.

where the set of solutions of the first problem is in the convex hull of the solutions of the second problem.

Non-smooth loss functions.

In this paper, in our theoretical results, we consider non-smooth loss functions for which conditional gradient algorithms do not converge in general. One possibility is to smooth the loss function, as done by Nesterov (2005): an approximation error of ε\varepsilon may be obtained with a smoothness constant proportional to 1/ε1/\varepsilon. By choosing ε\varepsilon as 1/t1/\sqrt{t}, we obtain a convergence rate of O(1/t)O(1/\sqrt{t}) after tt iterations. See also Lan (2013).

Approximate oracles.

The conditional gradient algorithm may deal with approximate oracles; however, what we need in this paper is not the additive errors situations considered by Jaggi (2013), but multiplicative ones on the computation of the dual norm (similar to ones derived by Bach (2013) for the regularized problem).

Indeed, in our context, we minimize a function J(f)J(f) on fL2(dρ)f\in L_{2}(d\rho) over a norm ball {γ1(f)δ}\{\gamma_{1}(f)\leqslant\delta\}. A multiplicative approximate oracle outputs for any gL2(dρ)g\in L_{2}(d\rho), a vector f^L2(dρ)\hat{f}\in L_{2}(d\rho) such that γ1(f^)=1\gamma_{1}(\hat{f})=1, and

for a fixed κ1\kappa\geqslant 1. In Appendix B, we propose a modification of the conditional gradient algorithm that converges to a certain hL2(dρ)h\in L_{2}(d\rho) such that γ1(h)δ\gamma_{1}(h)\leqslant\delta and for which infγ1(f)δJ(f)J(h)infγ1(f)δ/κJ(f)\inf_{\gamma_{1}(f)\leqslant\delta}J(f)\leqslant J(h)\leqslant\inf_{\gamma_{1}(f)\leqslant\delta/\kappa}J(f).

Such approximate oracles are not available in general, because they require uniform bounds over all possible values of gL2(dρ)g\in L_{2}(d\rho). In Section 5.5, we show that a weaker form of oracle is sufficient to preserve our generalization bounds from Section 5.

Approximation of any function by a finite number of basis functions.

Neural networks with non-decreasing positively homogeneous activation functions

In this paper, we focus on a specific family of basis functions, that is, of the form

for specific activation functions σ\sigma. We assume that σ\sigma is non-decreasing and positively homogeneous of some integer degree, i.e., it is equal to σ(u)=(u)+α\sigma(u)=(u)_{+}^{\alpha}, for some α{0,1,}\alpha\in\{0,1,\dots\}. We focus on these functions for several reasons:

Since they are not polynomials, linear combinations of these functions can approximate any measurable function (Leshno et al., 1993).

By homogeneity, they are invariant by a change of scale of the data; indeed, if all observations xx are multiplied by a constant, we may simply change the measure μ\mu defining the expansion of ff by the appropriate constant to obtain exactly the same function. This allows us to study functions defined on the unit-sphere.

The special case α=1\alpha=1, often referred to as the rectified linear unit, has seen considerable recent empirical success (Nair and Hinton, 2010; Krizhevsky et al., 2012), while the case α=0\alpha=0 (hard thresholds) has some historical importance (Rosenblatt, 1958).

The goal of this section is to specialize the results from Section 2 to this particular case and show that the “Frank-Wolfe” steps have simple geometric interpretations.

We first show that the positive homogeneity of the activation functions allows to transfer the problem to a unit sphere.

This implies by Hölder’s inequality that φv(x)22α\varphi_{v}(x)^{2}\leqslant 2^{\alpha}. Moreover this leads to functions in F1\mathcal{F}_{1} that are bounded everywhere, that is, fF1\forall f\in\mathcal{F}_{1}, f(x)22αγ1(f)2f(x)^{2}\leqslant 2^{\alpha}\gamma_{1}(f)^{2}. Note that the functions in F1\mathcal{F}_{1} are also Lipschitz-continuous for α1\alpha\geqslant 1.

Homogeneous reformulation.

that is γ1(f)γ1(g)\gamma_{1}(f)\leqslant\gamma_{1}(g), because we have assumed that (w,b/R)(w^{\top},b/R)^{\top} is on the (1/R)(1/R)-sphere.

The only remaining important aspect is to define gg on the entire sphere, so that (a) its regularity constants are controlled by a constant times the ones on the portion of the sphere where it is already defined, (b) gg is either even or odd (this will be important in Section 4). Ensuring that the regularity conditions can be met is classical when extending to the full sphere (see, e.g., Whitney, 1934). Ensuring that the function may be chosen as odd or even may be obtained by multiplying the function gg by an infinitely differentiable function which is equal to one for a1/2a\geqslant 1/\sqrt{2} and zero for a0a\leqslant 0, and extending by g-g or gg on the hemi-sphere a<0a<0.

In summary, we may consider in Section 4 functions defined on the sphere, which are much easier to analyze. In the rest of the section, we specialize some of the general concepts reviewed in Section 2 to our neural network setting with specific activation functions, namely, in terms of corresponding kernel functions and geometric reformulations of the Frank-Wolfe steps.

1 Corresponding positive-definite kernels

There are key differences and similarities between the RKHS F2\mathcal{F}_{2} and our space of functions F1\mathcal{F}_{1}. The RKHS is smaller than F1\mathcal{F}_{1} (i.e., the norm in the RKHS is larger than the norm in F1\mathcal{F}_{1}); this implies that approximation properties of the RKHS are transferred to F1\mathcal{F}_{1}. In fact, our proofs rely on this fact.

However, the RKHS norm does not lead to any adaptivity, while the function space F1\mathcal{F}_{1} does (see more details in Section 5). This may come as a paradox: both the RKHS F2\mathcal{F}_{2} and F1\mathcal{F}_{1} have similar properties, but one is adaptive while the other one is not. A key intuitive difference is as follows: given a function ff expressed as f(x)=Vφv(x)p(v)dτ(v)f(x)=\int_{\mathcal{V}}\varphi_{v}(x){p}(v)d\tau(v), then γ1(f)=Vp(v)dτ(v)\gamma_{1}(f)=\int_{\mathcal{V}}|{p}(v)|d\tau(v), while the squared RKHS norm is γ2(f)2=Vp(v)2dτ(v)\gamma_{2}(f)^{2}=\int_{\mathcal{V}}|{p}(v)|^{2}d\tau(v). For the L1L_{1}-norm, the measure p(v)dτ(v){p}(v)d\tau(v) may tend to a singular distribution with a bounded norm, while this is not true for the L2L_{2}-norm. For example, the function (wx+b)+α(w^{\top}x+b)_{+}^{\alpha} is in F1\mathcal{F}_{1}, while it is not in F2\mathcal{F}_{2} in general.

2 Incremental optimization problem for α=0𝛼0\alpha=0

where I+={i,yi0}I_{+}=\{i,y_{i}\geqslant 0\} and I={i,yi<0}I_{-}=\{i,y_{i}<0\}. As outlined by Bengio et al. (2006), this is equivalent to finding an hyperplane parameterized by vv that minimizes a weighted mis-classification rate (when doing linear classification). Note that the norm of vv has no effect.

Convex relaxation.

Given linear binary classification problems, there are several algorithms to approximately find a good half-space. These are based on using convex surrogates (such as the hinge loss or the logistic loss). Although some theoretical results do exist regarding the classification performance of estimators obtained from convex surrogates (Bartlett et al., 2006), they do not apply in the context of linear classification.

3 Incremental optimization problem for α=1𝛼1\alpha=1

For the problem of maximizing \big{|}\sum_{i=1}^{n}y_{i}(v^{\top}z_{i})_{+}\big{|}, then this corresponds to

This is exactly the Hausdorff distance between the two convex sets {T+b+, b+n+}\{T_{+}^{\top}b_{+},\ b_{+}\in^{n_{+}}\} and {Tb, bn}\{T_{-}^{\top}b_{-},\ b_{-}\in^{n_{-}}\} (referred to as zonotopes, see below).

Given the pair (b+,b)(b_{+},b_{-}) achieving the Hausdorff distance, then we may compute the optimal vv as v=\arg\max_{\|v\|_{p}\leqslant 1}v^{\top}\big{(}T_{+}^{\top}b_{+}-T_{-}^{\top}b_{-}\big{)}. Note this has not changed the problem at all, since it is equivalent. It is still NP-hard in general (König, 2014). But we now have a geometric interpretation with potential approximation algorithms. See below and Section 6.

A zonotope AA is the Minkowski sum of a finite number of segments from the origin, that is, of the form

In our context, the two convex sets {T+b+, b+n+}\{T_{+}^{\top}b_{+},\ b_{+}\in^{n_{+}}\} and {Tb, bn}\{T_{-}^{\top}b_{-},\ b_{-}\in^{n_{-}}\} defined above are thus zonotopes. See an illustration of the Hausdorff distance computation in Figure 4 (middle plot), which is the core computational problem for α=1\alpha=1.

Approximation by ellipsoids.

Centrally symmetric convex polytopes (w.l.o.g. centered around zero) may be approximated by ellipsoids. In our set-up, we could use the minimum volume enclosing ellipsoid (see, e.g. Barvinok, 2002), which can be computed exactly when the polytope is given through its vertices, or up to a constant factor when the polytope is such that quadratic functions may be optimized with a constant factor approximation. For zonotopes, the standard semi-definite relaxation of Nesterov (1998) leads to such constant-factor approximations, and thus the minimum volume inscribed ellipsoid may be computed up to a constant. Given standard results (see, e.g. Barvinok, 2002), a (1/d)(1/\sqrt{d})-scaled version of the ellipsoid is inscribed in this polytope, and thus the ellipsoid is a provably good approximation of the zonotope with a factor scaling as d\sqrt{d}. However, the approximation ratio is not good enough to get any relevant bound for our purpose (see Section 5.5), as for computing the Haussdorff distance, we care about potentially vanishing differences that are swamped by constant factor approximations.

NP-hardness.

Given the reduction of the case α=1\alpha=1 (rectified linear units) to α=0\alpha=0 (exact thresholds) (Livni et al., 2014), the incremental problem is also NP-hard, so as obtaining a constant-factor approximation. However, this does not rule out convex relaxations with non-constant approximation ratios (see Section 6 for more details).

4 Incremental optimization problem for α⩾2𝛼2\alpha\geqslant 2

Note that, while for α=1\alpha=1 we can use the identity 2u+=u+u2u_{+}=u+|u| to replace the rectified linear unit by the absolute value and obtain the same function space, this is not possible for α=2\alpha=2, as (u+)2(u_{+})^{2} and u2u^{2} do not differ by a linear function. This implies that the results from Livni et al. (2014), which state that for the quadratic activation function, the incremental problems is equivalent to an eigendecomposition (and hence solvable in polynomial time), do not apply.

Approximation properties

In this section, we first consider approximation properties of functions in G1\mathcal{G}_{1} by a finite number of neurons (only for α=1\alpha=1). We then study approximation properties of functions on the sphere by functions in G1\mathcal{G}_{1}. It turns out that all our results are based on the approximation properties of the corresponding RKHS G2\mathcal{G}_{2}: we give sufficient conditions for being in G2\mathcal{G}_{2}, and then approximation bounds for functions which are not in G2\mathcal{G}_{2}. Finally we transfer these to the spaces F1\mathcal{F}_{1} and F2\mathcal{F}_{2}, and consider in particular functions which only depend on projections on a low-dimensional subspace, for which the properties of G1\mathcal{G}_{1} and G2\mathcal{G}_{2} (and of F1\mathcal{F}_{1} and F2\mathcal{F}_{2}) differ. This property is key to obtaining generalization bounds that show adaptivity to linear structures in the prediction functions (as done in Section 5).

Approximation properties of neural networks with finitely many neurons have been studied extensively (see, e.g., Petrushev, 1998; Pinkus, 1999; Makovoz, 1998; Burger and Neubauer, 2001). In Section 4.7, we relate our new results to existing work from the literature on approximation theory, by showing that our results provide an explicit control of the various weight vectors which are needed for bounding the estimation error in Section 5.

Positively homogenous convex functions hh may be written as the support function of a compact convex set KK (Rockafellar, 1997), that is, h(z)=maxyKyzh(z)=\max_{y\in K}y^{\top}z, and the set KK characterizes the function hh. The functions g+g_{+} and gg_{-} defined above are not any convex positively homogeneous functions, as we now describe.

If the measure μ+\mu_{+} is supported by finitely many points, that is, μ+(v)=i=1rηiδ(vvi)\mu_{+}(v)=\sum_{i=1}^{r}\eta_{i}\delta(v-v_{i}) with η0\eta\geqslant 0, then g+(z)=i=1tηi(viz)+=i=1t(ηiviz)+=i=1t(tiz)+g_{+}(z)=\sum_{i=1}^{t}\eta_{i}(v_{i}^{\top}z)_{+}=\sum_{i=1}^{t}(\eta_{i}v_{i}^{\top}z)_{+}=\sum_{i=1}^{t}(t_{i}^{\top}z)_{+} for ti=ηivit_{i}=\eta_{i}v_{i}. Thus the corresponding set K+K_{+} is the zonotope [0,t_{1}]+\cdots+[0,t_{r}]=\big{\{}\sum_{i=1}^{r}b_{i}t_{i},\ b\in^{r}\big{\}} already defined in Section 3.3. Thus the functions g+G1g_{+}\in\mathcal{G}_{1} and gG1g_{-}\in\mathcal{G}_{1} for finitely supported measures μ\mu are support functions of zonotopes.

When the measure μ\mu is not constrained to have finite support, then the sets K+K_{+} and KK_{-} are limits of zonotopes, and thus, by definition, zonoids (Bolker, 1969), and thus functions in G1\mathcal{G}_{1} are differences of support functions of zonoids. Zonoids are a well-studied set of convex bodies. They are centrally symmetric, and in two dimensions, all centrally symmetric compact convexs sets are (up to translation) zonoids, which is not true in higher dimensions (Bolker, 1969). Moreover, the problem of approximating a zonoid by a zonotope with a small number of segments (Bourgain et al., 1989; Matoušek, 1996) is essentially equivalent to the approximation of a function gg by finitely many neurons. The number of neurons directly depends on the norm γ1\gamma_{1}, as we now show.

with rC(d)ε2+6/(d+3)=C(d)ε2d/(d+3)r\leqslant C(d)\varepsilon^{-2+6/(d+3)}=C(d)\varepsilon^{-2d/(d+3)}, for some constant C(d)C(d) that depends only on dd. We may then simply write

Finally, the number of neurons needed to express a function with a bound on the γ2\gamma_{2}-norm can be estimated from general results on approximating reproducing kernel Hilbert space described in Section 2.3, whose kernel can be expressed as an expectation. Indeed, Bach (2015) shows that with kk neurons, one can approximate a function in F2\mathcal{F}_{2} with unit γ2\gamma_{2}-norm with an error measured in L2L_{2} of ε=k(d+3)/(2d)\varepsilon=k^{-(d+3)/(2d)}. When inverting the relationship between kk and ε\varepsilon, we get a number of neurons scaling as ε2d/(d+3)\varepsilon^{-2d/(d+3)}, which is the same as in Prop. 1 but with an error in L2L_{2}-norm instead of LL^{\infty}-norm.

2 Sufficient conditions for finite variation

Tightness of conditions: as shown in Appendix D.5, there are functions gg, which have bounded first ss derivatives and do not belong to G2\mathcal{G}_{2} while sd2+αs\leqslant\frac{d}{2}+\alpha (at least when sαs-\alpha is even). Therefore, when sαs-\alpha is even, the scaling in (d1)/2+α{(d-1)}/{2}+\alpha is optimal.

Dependence on α\alpha: for any dd, the higher the α\alpha, the stricter the sufficient condition. Given that the estimation error grows slowly with α\alpha (see Section 5.1), low values of α\alpha would be preferred in practice.

Dependence on dd: a key feature of the sufficient condition is the dependence on dd, that is, as dd increases the number of derivatives has to increase in d/2d/2—like for Sobolev spaces in dimension dd (Adams and Fournier, 2003). This is another instantiation of the curse of dimensionality: only very smooth functions in high dimensions are allowed.

3 Approximation of Lipschitz-continuous functions

In order to derive generalization bounds for target functions which are not sufficiently differentiable (and may not be in G2\mathcal{G}_{2} or G1\mathcal{G}_{1}), we need to approximate any Lipschitz-continuous function, with a function gG2g\in\mathcal{G}_{2} with a norm γ2(g)\gamma_{2}(g) that will grow as the approximation gets tighter. We give precise rates in the proposition below. Note the requirement for parity of the function gg. The result below notably shows the density of G1\mathcal{G}_{1} in uniform norm in the space of Lipschitz-continuous functions of the given parity, which is already known since our activation functions are not polynomials (Leshno et al., 1993).

This proposition is shown in Appendix C.3 for d=1d=1 (using Fourier series) and in Appendix D.4 for all d1d\geqslant 1 (using spherical harmonics). We can make the following observations:

Dependence in δ\delta and η\eta: as expected, the main term in the error bound \big{(}{\delta}/{\eta}\big{)}^{-1/(\alpha+(d-1)/2)} is a decreasing function of δ/η\delta/\eta, that is when the norm γ2(h)\gamma_{2}(h) is allowed to grow, the approximation gets tighter, and when the Lipschitz constant of gg increases, the approximation is less tight.

Dependence on dd and α\alpha: the rate of approximation is increasing in dd and α\alpha. In particular the approximation properties are better for low α\alpha.

Special case d=1d=1 and α=0\alpha=0: up to the logarithmic term we recover the result of Prop. 2, that is, the function gg is in G2\mathcal{G}_{2}.

Tightness: in Appendix D.5, we provide a function which is not in the RKHS and for which the tightest possible approximation scales as δ2/(d/2+α2)\delta^{-2/(d/2+\alpha-2)}. Thus the linear scaling of the rate as d/2+αd/2+\alpha is not improvable (but constants are).

4 Linear functions

We see that for α=1\alpha=1, the γ1\gamma_{1}-norm is less than a constant, and is much smaller than the γ2\gamma_{2}-norm (which scales as d\sqrt{d}). For α2\alpha\geqslant 2, we were not able to derive better bounds for γ1\gamma_{1} (other than the value of γ2)\gamma_{2}).

5 Functions of projections

However, for the RKHS norm γ2\gamma_{2}, this reasoning does not apply. For example, a certain function φ\varphi exists, which is ss-times differentiable, as shown in Appendix D.5, for sd2+αs\leqslant\frac{d}{2}+\alpha (when sαs-\alpha is even), and is not in G2\mathcal{G}_{2}. Thus, given Prop. 2, the dependence on a uni-dimensional projection does not make a difference regarding the level of smoothness which is required to belong to G2\mathcal{G}_{2}.

Proof By assumption, the function xf(Rx)x\mapsto f(Rx) has all its derivatives bounded by a constant times η\eta. Moreover, we have defined g(t,a)=f\big{(}\frac{Rt}{a}\big{)}a^{\alpha} so that all derivatives are bounded by η\eta. The result then follows immediately from Prop. 2.

Proof With the same reasoning as above, we obtain that gg is Lipschitz-continuous with constant η\eta, we thus get the desired approximation error from Prop. 3.

If f(x)=wx+bf(x)=w^{\top}x+b, with w2η\|w\|_{2}\leqslant\eta and bηRb\leqslant\eta R, then for α=1\alpha=1, it is straightforward that γ1(f)2Rη\gamma_{1}(f)\leqslant 2R\eta. Moreover, we have γ2(f)CRη\gamma_{2}(f)\sim CR\eta. For other values of α\alpha, we also have γ1\gamma_{1}-norms less than a constant (depending only of α\alpha) times RηR\eta. The RKHS norms are bit harder to compute since linear functions for ff leads to linear functions for gg only for α=1\alpha=1.

Functions of projections.

7 Related work

In this section, we show how our results from the previous sections relate to existing work on neural network approximation theory.

In this section, we only consider the case α=1\alpha=1, for which we have two approximation bounds: Prop. 6 which approximates any η\eta-Lipschitz-continuous function by a function with finite γ1\gamma_{1}-norm less than δ\delta and uniform error less than \displaystyle\eta\big{(}{\delta}/{\eta}\big{)}^{-2/(d+1)}\log\big{(}{\delta}/{\eta}\big{)}, and Prop. 1 which shows that a function with γ1\gamma_{1}-norm less than δ\delta, may be approximated with rr neurons with uniform error δr(d+3)/(2d)\delta r^{-(d+3)/(2d)}.

Thus, given rr neurons, we get an approximation of the original function with uniform error

We can optimize over δ\delta, and use δ=ηn(d+1)/(2d)\delta=\eta n^{(d+1)/(2d)}, to obtain a uniform approximation bound proportional to η(logn)n1/d{\eta(\log n)}{n^{-1/d}}, for approximating an η\eta-Lipschitz-continuous function with nn neurons.

Approximation by ridge functions.

The approximation properties of single hidden layer neural networks have been studied extensively, where they are often referred to as “ridge function” approximations. As shown by Pinkus (1999, Corollary 6.10)—based on a result from Petrushev (1998), the approximation order of n1/dn^{-1/d} for the rectified linear unit was already known, but only in L2L_{2}-norm (and without the factor logn\log n), and without any constraints on the input and output weights. In this paper, we provide an explicit control of the various weights, which is needed for computing estimation errors. Moreover, while the two proof techniques use spherical harmonics, the proof of Petrushev (1998) relies on quadrature formulas for the associated Legendre polynomials, while ours relies on the relationship with the associated positive definite kernels, is significantly simpler, and offers additional insights into the problem (relationship with convex neural networks and zonoids). Maiorov (2006, Theorem 2.3) also derives a similar result, but in L2L_{2}-norm (rather than uniform norm), and for sigmoidal activation functions (which are bounded). Note finally, that the order O(n1/d)O(n^{-1/d}) cannot be improved (DeVore et al., 1989, Theorem 4.2). Also, Maiorov and Meir (2000, Theorem 5) derive similar upper and lower bounds based on a random sampling argument which is close to using random features in the RKHS setting described in Section 2.3.

Relationship to hardness results for Boolean-valued functions.

In this paper, we consider a particular view of the curse of dimensionality and ways of circumventing it, that is, our distribution over inputs is arbitrary, but our aim is to approximate a real-valued function. Thus, all hardness results depending on functions with values in {0,1}\{0,1\} do not apply there directly—see, e.g., Shalev-Shwartz and Ben-David (2014, Chapter 20), for the need of exponentially many hidden units for approximating most of the functions from {0,1}d\{0,1\}^{d} to {0,1}\{0,1\}.

Our approximation bounds show that, without any assumption beyond Lipschitz-continuity of the target function, it sufficient to have a number of hidden units which is still exponential in dimension (hence we also suffer from the curse of dimensionality), but a soon as the target function depends on linear low-dimensional structure, then we lose this exponential dependence. It would be interesting to study an extension to {0,1}\{0,1\}-valued functions, and also to relate our results to the number of linear regions delimited by neural networks with rectified linear units (Montufar et al., 2014).

Generalization bounds

Our goal is to derive the generalization bounds outlined in Section 2.4 for neural networks with a single hidden layer. The main results that we obtain are summarized in Table 2 and show adaptivity to assumptions that avoid the curse of dimensionality.

Logistic regression and support vector machines: we have G=1G=1.

Least-squares regression: we take G=\max\big{\{}\sqrt{2}\delta+\|y\|_{\infty},\frac{\|y\|_{\infty}^{2}}{\sqrt{2}\delta}\big{\}}.

Approximation errors inffFδJ(f)inffFJ(f)\inf_{f\in\mathcal{F}^{\delta}}J(f)-\inf_{f\in\mathcal{F}}J(f) will be obtained from the approximation results from Section 4 by assuming that the optimal target function ff_{\ast} has a specific form. Indeed, we have:

We now deal with estimation errors supfFδJ^(f)J(f)\sup_{f\in\mathcal{F}^{\delta}}|\hat{J}(f)-J(f)| using Rademacher complexities.

The following proposition bounds the uniform deviation between JJ and its empirical counterpart J^\hat{J}. This result is standard (see, e.g., Koltchinskii, 2001; Bartlett and Mendelson, 2003) and may be extended in bounds that hold with high-probability.

We have the following bound on the expected uniform deviation:

for α1\alpha\geqslant 1, C(p,d,α)α2log(d+1)C(p,d,\alpha)\leqslant\alpha\sqrt{2\log(d+1)} for p=1p=1 and C(p,d,α)αp1C(p,d,\alpha)\leqslant\frac{\alpha}{\sqrt{p-1}} for p(1,2]p\in(1,2]

for α=0\alpha=0, C(p,d,α)Cd+1C(p,d,\alpha)\leqslant C\sqrt{d+1}, where CC is a universal constant.

Proof We use the standard framework of Rademacher complexities and get:

We then take different routes for α1\alpha\geqslant 1 and α=0\alpha=0.

For α1\alpha\geqslant 1, we have the upper-bound

From Kakade et al. (2009), we get the following bounds on Rademacher complexities:

Our generalization bounds are expected values of the excess expected risk for a our estimator (where the expectation is taken over the data).

We assume f(x)=wx+bf_{\ast}(x)=w^{\top}x+b, with w2η\|w\|_{2}\leqslant\eta and bRη|b|\leqslant R\eta. Then, as seen in Section 4.6, fF1f_{\ast}\in\mathcal{F}_{1} with γ1(f)C(α)ηR\gamma_{1}(f_{\ast})\leqslant C(\alpha)\eta R (the constant is independent of dd because we approximate an affine function). From Prop. 7, we thus get a generalization bound proportional to GRηn\frac{GR\eta}{\sqrt{n}} times a constant (that may depend on α\alpha), which is the same as assuming directly that we optimize over linear predictors only. The chosen δ\delta is then a constant times RηR\eta, and does not grow with nn, like in parametric estimation (although we do use a non-parametric estimation procedure).

Projection pursuit.

We assume f(x)=j=1kfj(wjx)f_{\ast}(x)=\sum_{j=1}^{k}f_{j}(w_{j}^{\top}x), with wj2η\|w_{j}\|_{2}\leqslant\eta and each fjf_{j} bounded by ηR\eta R and 11-Lipschitz continuous. From Prop. 6, we may approach each xfj(wjx)x\mapsto f_{j}(w_{j}^{\top}x) by a function with γ1\gamma_{1}-norm less than δηR\delta\eta R and uniform approximation C(α)ηRδ1/αlogδC(\alpha)\eta R\delta^{-1/\alpha}\log\delta. This leads to a total approximation error of kC(α)GηRδ1/αlogδkC(\alpha)G\eta R\delta^{-1/\alpha}\log\delta for a norm less than kδηRk\delta\eta R (the constant is independent of dd because we approximate a function of one-dimensional projection).

For α=0\alpha=0, from Prop. 5, the target function belongs to F1\mathcal{F}_{1} with a norm less than kGRηkGR\eta, leading to an overall generalization bound of kGRηdn\frac{kGR\eta\sqrt{d}}{\sqrt{n}}.

Note that when the functions fjf_{j} are exactly the activation functions, the bound is better, as these functions directly belong to the space F1\mathcal{F}_{1}.

Multi-dimensional projection pursuit.

For α1\alpha\geqslant 1, the estimation error is kGRηδ/n{kGR\eta\delta}/{\sqrt{n}}, with an overall bound of C(\alpha,s)kGR\eta\big{(}{\delta}/{\sqrt{n}}+\delta^{-1/(\alpha+(s-1)/2)}\log\delta\big{)}. With δ=n(α+(s1)/2)/(2α+s1)\delta=n^{(\alpha+(s-1)/2)/(2\alpha+s-1)}, we get an optimized bound of C(α,s)kGRηn1/(2α+s+1)logn\frac{C(\alpha,s)kGR\eta}{n^{1/(2\alpha+s+1)}}\log n.

For α=0\alpha=0, we have an overall bound of C(s)kGR\eta\big{(}\delta^{-2/(s-1)}\log\delta+\frac{\delta\sqrt{d}}{\sqrt{n}}\big{)}, and with δ=(n/d)(s1)/(s+1)\delta=(n/d)^{(s-1)/(s+1)}, we get a generalization bound scaling as C(s)kGRη(n/d)1/(s+1)log(n/d)\frac{C(s)kGR\eta}{(n/d)^{1/(s+1)}}\log(n/d).

Note that for s=ds=d and k=1k=1, we recover the usual Lipschitz-continuous assumption, with a rate of C(α,d)kGRηn1/(2α+d+1)logn\frac{C(\alpha,d)kGR\eta}{n^{1/(2\alpha+d+1)}}\log n.

Summary table: when we know a bound rr on all dimensions of xx, then we may take R=rdR=r\sqrt{d}; this is helpful in comparisons in Table 2, where RR is replaced by rdr\sqrt{d} and the dependence in rr is removed as it is the same for all models.

Dependence on dd: when making only a global Lipschitz-continuity assumption, the generalization bound has a bad scaling in nn, i.e., as n1/(2α+d+1)n^{-1/(2\alpha+d+1)}, which goes down to zero slowly when dd increases. However, when making structural assumptions regarding the dependence on unknown lower-dimensional subspaces, the scaling in dd disappears.

Comparing different values of α\alpha: the value α=0\alpha=0 always has the best scaling in nn, but constants are better for α1\alpha\geqslant 1 (among which α=1\alpha=1 has the better scaling in nn).

Bounds for F2\mathcal{F}_{2}: The simplest upper bound for the penalization by the space F2\mathcal{F}_{2} depends on the approximation properties of F2\mathcal{F}_{2}. For linear functions and α=1\alpha=1, it is less than dηR\sqrt{d}\eta R, with a bound GRηdn\frac{GR\eta\sqrt{d}}{\sqrt{n}}. For the other values of α\alpha, there is a constant C(d)C(d). Otherwise, there is no adaptivity and all other situations only lead to upper-bounds of O(n1/(2α+d+1))O(n^{-1/(2\alpha+d+1)}). See more details in Section 5.4.

Sample complexity: Note that the generalization bounds above may be used to obtain sample complexity results such as dε2d\varepsilon^{-2} for affine functions, (εk1d1/2)2α2(\varepsilon k^{-1}d^{-1/2})^{-2\alpha-2} for projection pursuit, and (εk1d1/2)s12α(\varepsilon k^{-1}d^{-1/2})^{-s-1-2\alpha} for the generalized version (up to logarithmic terms).

Relationship to existing work: Maiorov (2006, Theorem 1.1) derives similar results for neural networks with sigmoidal activation functions (that tend to one at infinity) and the square loss only, and for a level of smoothness of the target function which grows with dimension (in this case, once can get easily rates of n1/2n^{-1/2}). Our result holds for problems where only bounded first-order derivatives are assumed, but by using Prop. 2, we would get similar rate by ensuring the target function belongs to F2\mathcal{F}_{2} and hence to F1\mathcal{F}_{1}.

Lower bounds.

In the sections above, we have only provided generalization bounds. Although interesting, deriving lower-bounds for the generalization performance when the target function belongs to certain function classes is out of the scope of this paper. Note however, that results from Sridharan (2012) suggest that the Rademacher complexities of the associated function classes provide such lower-bounds. For general Lipschitz-functions, these Rademacher complexities decreases as nmax{d,2}n^{-\max\{d,2\}} (von Luxburg and Bousquet, 2004).

We assume f(x)=wx+bf_{\ast}(x)=w^{\top}x+b, with w2η\|w\|_{2}\leqslant\eta and bRη|b|\leqslant R\eta. Given that we have assumed that ww has at most qq non-zeros, we have w1qη\|w\|_{1}\leqslant\sqrt{q}\eta.

Then, fF1f_{\ast}\in\mathcal{F}_{1} with γ1(f)C(α)ηrq\gamma_{1}(f)\leqslant C(\alpha)\eta r\sqrt{q}, with a constant that is independent of dd because we have an affine function.

From Prop. 7, we thus get a rate of Grηqlog(d)n\frac{Gr\eta\sqrt{q\log(d)}}{\sqrt{n}} times a constant (that may depend on α\alpha), which is the same as assuming directly that we optimize over linear predictors only (see, for example, Bühlmann and Van De Geer, 2011). We recover a high-dimensional phenomenon (although with a slow rate in 1/n1/\sqrt{n}), where dd may be much larger than nn, as long as logd\log d is small compared to nn. The chosen δ\delta is then a constant times rηqr\eta\sqrt{q} (and does not grow with nn).

Projection pursuit.

We assume f(x)=j=1kfj(wjx)f_{\ast}(x)=\sum_{j=1}^{k}f_{j}(w_{j}^{\top}x), with wj2η\|w_{j}\|_{2}\leqslant\eta (which implies wj1qη\|w_{j}\|_{1}\leqslant\sqrt{q}\eta given our sparsity assumption) and each fjf_{j} bounded by ηrq\eta r\sqrt{q} and 11-Lipschitz continuous. We may approach each xfj(wjx)x\mapsto f_{j}(w_{j}^{\top}x) by a function with γ1\gamma_{1}-norm less than δηrq\delta\eta r\sqrt{q} and uniform approximation C(α)ηrqδ1/αlogδC(\alpha)\eta r\sqrt{q}\delta^{-1/\alpha}\log\delta, with a constant that is independent of dd because we have a function of one-dimensional projection. This leads to a total approximation error of kC(α)Gηrqδ1/αlogδkC(\alpha)G\eta r\sqrt{q}\delta^{-1/\alpha}\log\delta for a norm less than kδηrqk\delta\eta r\sqrt{q}.

For α1\alpha\geqslant 1, the estimation error is kGrηδqlogdn\frac{kGr\eta\delta\sqrt{q\log d}}{\sqrt{n}}, with an overall bound of C(\alpha)kGr\sqrt{q}\eta\big{(}\delta^{-1/\alpha}\log\delta+\frac{\delta\sqrt{\log d}}{\sqrt{n}}\big{)}. With δ=(n/logd)α/2(α+1)\delta=(n/\log d)^{\alpha/2(\alpha+1)}, we get an optimized bound of C(α)kGrqηlogn(logd)1/(2α+2)n1/(2α+2)C(\alpha)kGr\sqrt{q}\eta\frac{\log n(\log d)^{1/(2\alpha+2)}}{n^{1/(2\alpha+2)}}, with a scaling only dependent in dd with a logarithmic factor.

For α=0\alpha=0, the target function belongs to F1\mathcal{F}_{1} with a norm less than kGrqηkGr\sqrt{q}\eta, leading to an overal bound of kGrηqlogdn\frac{kGr\eta\sqrt{q\log d}}{\sqrt{n}} (the sparsity is not helpful in this case).

Multi-dimensional projection pursuit.

We may approach each xFj(Wjx)x\mapsto F_{j}(W_{j}^{\top}x) by a function with γ1\gamma_{1}-norm less than δηrq\delta\eta r\sqrt{q} and uniform approximation C(α,s)ηrqδ1/(α+(s1)/2)logδC(\alpha,s)\eta r\sqrt{q}\delta^{-1/(\alpha+(s-1)/2)}\log\delta. This leads to a total approximation error of kC(α,s)Gηrqδ1/(α+(s1)/2)logδkC(\alpha,s)G\eta r\sqrt{q}\delta^{-1/(\alpha+(s-1)/2)}\log\delta.

For α1\alpha\geqslant 1, the estimation error is kGrqηδlogd/n{kGr\sqrt{q}\eta\delta\sqrt{\log d}}/{\sqrt{n}}, with an overall bound which is equal to C(\alpha,s)kGr\sqrt{q}\eta\big{(}\delta^{-1/(\alpha+(s-1)/2)}\log\delta+\frac{\delta\sqrt{\log d}}{\sqrt{n}}\big{)}. With δ=(n/logd)(α+(s1)/2)/(2α+s1)\delta=(n/\log d)^{(\alpha+(s-1)/2)/(2\alpha+s-1)}, we get an optimized bound of C(α,s)kGrqη(logd)1/(2α+s+1)n1/(2α+s+1)logn\displaystyle\frac{C(\alpha,s)kGr\sqrt{q}\eta(\log d)^{1/(2\alpha+s+1)}}{n^{1/(2\alpha+s+1)}}\log n.

For α=0\alpha=0, we have the bound C(s)kGrqη(n/d)1/(s+1)log(n/d)\frac{C(s)kGr\sqrt{q}\eta}{(n/d)^{1/(s+1)}}\log(n/d), that is we cannot use the sparsity as the problem is invariant to the chosen norm on hidden weights.

High-dimensional variable selection: when k=1k=1, s=qs=q and W1W_{1} is a projection onto qq variables, then we obtain a bound proportional to qη(logd)1/(2α+s+1)n1/(2α+s+1)logn\frac{\sqrt{q}\eta(\log d)^{1/(2\alpha+s+1)}}{n^{1/(2\alpha+s+1)}}\log n, which exhibits a high-dimensional scaling in a non-linear setting. Note that beyond sparsity, no assumption is made (in particular regarding correlations between input variables), and we obtain a high-dimensional phenomenon where dd may be much larger than nn.

4 Relationship to kernel methods and random sampling

5 Sufficient condition for polynomial-time algorithms

up to a constant factor. That is, there exists κ1\kappa\geqslant 1, such that for all yy and zz, we may compute v^\hat{v} such that v^p=1\|\hat{v}\|_{p}=1 and

This is provably NP-hard for α=0\alpha=0 (see Section 3.2), and for α=1\alpha=1 (see Section 3.3). If such an algorithm is available, the approximate conditional gradient presented in Section 2.5 leads to an estimator with the same generalization bound. Moreover, given the strong hardness results for improper learning in the situation α=0\alpha=0 (Klivans and Sherstov, 2006; Livni et al., 2014), a convex relaxation that would consider a larger set of predictors (e.g., by relaxing vvvv^{\top} into a symmetric positive-definite matrix), and obtained a constant approximation guarantee, is also ruled out.

Lower-bound on γ1\gamma_{1}: It is defined from functions φ^v^\hat{\varphi}_{\hat{v}}, for v^V^\hat{v}\in\hat{\mathcal{V}}, where for any vVv\in\mathcal{V}, there exists v^V^\hat{v}\in\hat{\mathcal{V}} such that φv=φ^v^\varphi_{v}=\hat{\varphi}_{\hat{v}}. This implies that the corresponding space F^1\hat{\mathcal{F}}_{1} is larger than F1\mathcal{F}_{1} and that if fF1f\in\mathcal{F}_{1}, then γ^1(f)γ1(f)\hat{\gamma}_{1}(f)\leqslant\gamma_{1}(f).

Polynomial-time algorithm for dual norm: The dual norm \displaystyle\sup_{\hat{v}\in\hat{\mathcal{V}}}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}y_{i}\hat{\varphi}_{\hat{v}}(z_{i})\bigg{|} may be computed in polynomial time.

We may also replace the standard Gaussian vectors by Rademacher random variables.

We can then penalize by γ^\hat{\gamma} instead of γ\gamma. Since γ^1γ1\hat{\gamma}_{1}\leqslant\gamma_{1}, approximation properties are transferred, and because of the result above, the Rademacher complexity for γ^1\hat{\gamma}_{1}-balls scales as well as for γ1\gamma_{1}-balls. In the next section, we show convex relaxations which cannot achieve these and leave the existence or non-existence of such norm γ^1\hat{\gamma}_{1} as an open problem.

Convex relaxations of the Frank-Wolfe step

We present two relaxations, which are of the form described in Section 5.5 (leading to potential generalization bounds) but do not attain the proper approximation scaling (as was checked empirically).

Note that all relaxations that end up being Lipschitz-continuous functions of zz, will have at least the same scaling than the set of these functions. The Rademacher complexity of such functions is well-known, that is 1/n1/\sqrt{n} for d=1d=1, lognn\sqrt{\frac{\log n}{n}} for d=2d=2 and n1/dn^{-1/d} for larger dd (von Luxburg and Bousquet, 2004). Unfortunately, the decay in nn is too slow to preserve generalization bounds (which would require a scaling in 1/n1/\sqrt{n}).

We denote ui=(vzi)+=12vzi+12vziu_{i}=(v^{\top}z_{i})_{+}=\frac{1}{2}v^{\top}z_{i}+\frac{1}{2}|v^{\top}z_{i}|. We may then use 2uivzi=vzi2u_{i}-v^{\top}z_{i}=|v^{\top}z_{i}| and, for v2=1\|v\|_{2}=1, vvzi2=vzi=zivvzi\|vv^{\top}z_{i}\|_{2}=|v^{\top}z_{i}|=\sqrt{z_{i}^{\top}vv^{\top}z_{i}}. By denoting V=vvV=vv^{\top}, the constraint that ui=(vzi)+=12vzi+12vziu_{i}=(v^{\top}z_{i})_{+}=\frac{1}{2}v^{\top}z_{i}+\frac{1}{2}|v^{\top}z_{i}| is equivalent to

We obtain a convex relaxation when removing the rank constraint, that is

(n+d)𝑛𝑑(n+d)-dimensional relaxation.

which leads to a convex program in U=uuU=uu^{\top}, V=vvV=vv^{\top} and J=uvJ=uv^{\top}, that is a semidefinite program with d+nd+n dimensions, with the constraints

and the usual semi-definite contraints \displaystyle\left(\begin{array}[]{cc}U&J\\ J^{\top}&V\end{array}\right)\succcurlyeq\left(\begin{array}[]{c}u\\ v\end{array}\right)\left(\begin{array}[]{c}u\\ v\end{array}\right)^{\top}, with the additional constraint that 4Uii+ziVzi4δiJzi=trVzizi4U_{ii}+z_{i}^{\top}Vz_{i}-4\delta_{i}^{\top}Jz_{i}=\mathop{\rm tr}Vz_{i}z_{i}^{\top}.

If we add these constraints on top of the ones above, we obtain a tighter relaxation. Note that for this relaxation, we must have \big{[}(2u_{i}-v^{\top}z_{i})-(2u_{j}-v^{\top}z_{j})\big{]} less than a constant times zizj2\|z_{i}-z_{j}\|_{2}. Hence, the result mentioned above regarding Lipschitz-continuous functions and the scaling of the upper-bound for random yy holds (with the dependence on nn which is not good enough to preserve the generalization bounds with a polynomial-time algorithm).

2 Relaxation of sign vectors

Usual semi-definite constraint: \displaystyle\left(\begin{array}[]{cc}S&J\\ J^{\top}&V\end{array}\right)\succcurlyeq\left(\begin{array}[]{c}s\\ v\end{array}\right)\left(\begin{array}[]{c}s\\ v\end{array}\right)^{\top},

Unit/trace constraints: diag(S)=1\mathop{\rm diag}(S)=1 and trV=1\mathop{\rm tr}V=1,

Sign constraint: δiJximaxjiδjJxi\delta_{i}^{\top}Jx_{i}\geqslant\max_{j\neq i}|\delta_{j}^{\top}Jx_{i}|.

Additional constraint: (xiVxi)1/2δiJxi(x_{i}^{\top}Vx_{i})^{1/2}\leqslant\delta_{i}^{\top}Jx_{i}.

Conclusion

In this paper, we have provided a detailed analysis of the generalization properties of convex neural networks with positively homogenous non-decreasing activation functions. Our main new result is the adaptivity of the method to underlying linear structures such as the dependence on a low-dimensional subspace, a setting which includes non-linear variable selection in presence of potentially many input variables.

All our current results apply to estimators for which no polynomial-time algorithm is known to exist and we have proposed sufficient conditions under which convex relaxations could lead to the same bounds, leaving open the existence or non-existence of such algorithms. Interestingly, these problems have simple geometric interpretations, either as binary linear classification, or computing the Haussdorff distance between two zonotopes.

In this work, we have considered a single real-valued output; the functional analysis framework readily extends to outputs in a finite-dimensional vector-space where vector-valued measures could be used, and then apply to multi-task or multi-class problems. However, the extension to multiple hidden layers does not appear straightforward as the units of the last hidden layers share the weights of the first hidden layers, which should require a new functional analysis framework.

References

We follow the proof of Berlinet and Thomas-Agnan (2004, Section 4.1) and extend it to integrals rather than finite sums. We consider the linear mapping T:L2(dτ)F2T:L_{2}(d\tau)\to\mathcal{F}_{2} defined by (Tp)(x)=Vp(v)φv(x)dτ(v)(Tp)(x)=\int_{\mathcal{V}}p(v)\varphi_{v}(x)d\tau(v), with null space K\mathcal{K}. When restricted to the orthogonal complement K\mathcal{K}^{\perp}, we obtain a bijection UU from K\mathcal{K}^{\perp} to F2\mathcal{F}_{2}. We then define a dot-product on F2\mathcal{F}_{2} as f,g=V(U1f)(v)(U1g)(v)dτ(v)\langle f,g\rangle=\int_{\mathcal{V}}(U^{-1}f)(v)(U^{-1}g)(v)d\tau(v).

We first show that this defines an RKHS with kernel kk defined above. For this, we trivially have k(,y)F2k(\cdot,y)\in\mathcal{F}_{2} for all yXy\in\mathcal{X}. Moreover, for any yXy\in\mathcal{X}, we have with p=U1k(,y)Kp=U^{-1}k(\cdot,y)\in\mathcal{K}^{\perp} and q:vφv(y)q:v\mapsto\varphi_{v}(y), pqKp-q\in\mathcal{K}, which implies that f,k(,y)=V(U1f)(v)p(v)dτ(v)=V(U1f)(v)q(v)dτ(v)=V(U1f)(v)φv(y)dτ(v)=T(U1f)(y)=f(y)\langle f,k(\cdot,y)\rangle=\int_{\mathcal{V}}(U^{-1}f)(v)p(v)d\tau(v)=\int_{\mathcal{V}}(U^{-1}f)(v)q(v)d\tau(v)=\int_{\mathcal{V}}(U^{-1}f)(v)\varphi_{v}(y)d\tau(v)=T(U^{-1}f)(y)=f(y), hence the reproducing property is satisfied. Thus, F2\mathcal{F}_{2} is an RKHS.

We now need to show that the RKHS norm which we have defined is actually γ2\gamma_{2}. For any fF2f\in\mathcal{F}_{2} such that f=Tpf=Tp, for pL2(dτ)p\in L_{2}(d\tau), we have p=U1f+qp=U^{-1}f+q, where qKq\in\mathcal{K}. Thus, Vp(v)2dτ(v)=pL2(dτ)2=U1fL2(dτ)2+qL2(dτ)2=f2+qL2(dτ)2\int_{\mathcal{V}}p(v)^{2}d\tau(v)=\|p\|^{2}_{L_{2}(d\tau)}=\|U^{-1}f\|^{2}_{L_{2}(d\tau)}+\|q\|^{2}_{L_{2}(d\tau)}=\|f\|^{2}+\|q\|^{2}_{L_{2}(d\tau)}. This implies that Vp(v)2dτ(v)f2\int_{\mathcal{V}}p(v)^{2}d\tau(v)\geqslant\|f\|^{2} with equality if and only if q=0q=0. This shows that γ2(f)=f\gamma_{2}(f)=\|f\|.

B Approximate conditional gradient with multiplicative oracle

for a fixed κ1\kappa\geqslant 1. We now propose a modification of the conditional gradient algorithm that converges to a certain hh such that γ(h)δ\gamma(h)\leqslant\delta and for which infγ(h)δJ(h)J(h^)infγ(h)δ/κJ(h)\inf_{\gamma(h)\leqslant\delta}J(h)\leqslant J(\hat{h})\leqslant\inf_{\gamma(h)\leqslant\delta/\kappa}J(h).

We assume the smoothness of the function JJ with respect to the norm γ\gamma, that is, for a certain L>0L>0, for all h,hh,h^{\prime} such that γ(h)δ\gamma(h)\leqslant\delta, then

In the previous recursion, one may replace the minimization of JJ on the segment [ht,h^t][h_{t},\hat{h}_{t}] with the minimization of its upper-bound of Eq. (7) taken at h=hth=h_{t}. From the recursion, all iterates are in the γ\gamma-ball of radius δ\delta. Following the traditional convergence proof for the conditional gradient method (Dunn and Harshbarger, 1978; Jaggi, 2013), we have, for any ρ\rho in $$:

If we take hh_{\ast} the minimizer of JJ on {γ(h)δ/κ}\{\gamma(h)\leqslant\delta/\kappa\}, we get:

Then, by using J(ht)J(h)+J(ht),hhtJ(h_{t})\geqslant J(h_{\ast})+\langle J^{\prime}(h_{t}),h_{\ast}-h_{t}\rangle, we get:

This is valid for any ρ\rho\in. If J(ht)J(h)0J(h_{t})-J(h_{\ast})\leqslant 0 for some tt, then by taking ρ=0\rho=0 it remains the same of all greater tt. Therefore, up to (the potentially never happening) point where J(ht)J(h)0J(h_{t})-J(h_{\ast})\leqslant 0, we can apply the regular proof of the conditional gradien to obtain: J(ht)infγ(h)δ/κJ(h)+4Lρ2δ2tJ(h_{t})\leqslant\inf_{\gamma(h)\leqslant\delta/\kappa}J(h)+\frac{4L\rho^{2}\delta^{2}}{t}, which leads to the desired result. Note that a similar reasoning may be used for ρ=2/(t+1)\rho=2/(t+1).

C Proofs for the 222-dimensional sphere (d=1𝑑1d=1)

For k=0k=0, the same equality holds (except that the two coefficients g0g_{0} and p0p_{0} are divided by 2π2\pi except of π\pi).

We thus simply need to compute λk\lambda_{k} and its decay for all values of α\alpha, and then relate them to the smoothness properties of gg, which is standard for Fourier series.

We now detail the computation of λk=12π02πσ(cosη)coskηdη\lambda_{k}=\frac{1}{2\pi}\int_{0}^{2\pi}\sigma(\cos\eta)\cos k\eta\,d\eta for the different functions σ=()+α\sigma=(\cdot)_{+}^{\alpha}. We have for α=0\alpha=0:

For k=0k=0 it is equal to 12\frac{1}{2}. It is equal to zero for all other even kk, and different from zero for all odd kk, with λk\lambda_{k} going to zero as 1/k1/k.

For k=1k=1, it is equal to 1/41/4. It is equal to zero for all other odd kk, and different from zero for all even kk, with λk\lambda_{k} going to zero as 1/k21/k^{2}.

For k=0k=0, it is equal to 1/41/4, and for k=2k=2, it is equal to 1/81/8. It is equal to zero for all other even kk, and different from zero for all odd kk, with λk\lambda_{k} going to zero as 1/k31/k^{3}.

C.2 Proof of Prop. 2 for d=1𝑑1d=1

We only consider the proof for d=1d=1. For the proof for general dd, see Appendix D.3.

Given the zero values of λk\lambda_{k} given above, if gg has the opposite parity than α\alpha (that is, is even when α\alpha is odd, and vice-versa), then we may define p{p} through its Fourier series, which is obtained by multiplying the one of gg by a strictly positive sequence growing as kα+1k^{\alpha+1}.

Note that we could relax the assumption that gg is even (resp. odd) by adding all trigonometric polynomials of order less than α\alpha.

C.3 Proof of Prop. 3 for d=1𝑑1d=1

Again, we only consider the proof for d=1d=1. For the proof for general dd, see Appendix D.4.

Without loss of generality, we assume that η=1\eta=1. For d=1d=1, we essentially want to approximate a Lipschitz-continuous function by a function which is (α+1)(\alpha+1)-times differentiable.

For α=0\alpha=0, then the function gg is already in G2\mathcal{G}_{2} with a norm less than one, because Lipschitz-continuous functions are almost everywhere differentiable with bounded derivative (Adams and Fournier, 2003). We thus now consider α>0\alpha>0.

Given λk\lambda_{k} defined above and r(0,1)r\in(0,1), we define p^\hat{{p}} through

Since gg is 11-Lipschitz-continuous with constant 11, then it has a squared-integrable derivative f=gf=g^{\prime} with norm less than 1 (Adams and Fournier, 2003), so that

This implies that using λk1=O(kα+1)\lambda_{k}^{-1}=O(k^{\alpha+1}):

Computing distance between g^^𝑔\hat{g} and g𝑔g.

It can be easily checked that for any r(1/2,1)r\in(1/2,1), the last function is less than a constant times 52C(1r)log11r\frac{5}{2}C(1-r)\log\frac{1}{1-r}. We thus get for δ\delta large enough, by taking r=1(C/δ)1/α(1/2,1)r=1-(C/\delta)^{1/\alpha}\in(1/2,1), an error of

D Approximations on the d𝑑d-dimensional sphere

In this appendix, we first review tools from spherical harmonic analysis, before proving the two main propositions regarding the approximation properties of the Hilbert space G2\mathcal{G}_{2}. Using spherical harmonics in our set-up is natural and is common in the analysis of ridge functions (Petrushev, 1998) and zonotopes (Bourgain and Lindenstrauss, 1988).

In this section, we review relevant concepts from spherical harmonics. See Frye and Efthimiou (2012); Atkinson and Han (2012) for more details. Spherical harmonics may be seen as extension of Fourier series to spheres in dimensions more than 2 (i.e., with our convention d1d\geqslant 1).

Legendre polynomials.

where PkP_{k} is a Legendre polynomial of degree kk and dimension d+1d+1, defined as (Rodrigues’ formula):

They are also referred to as Gegenbauer polynomials. For d=1d=1, PkP_{k} is the kk-th Chebyshev polynomial, such that Pk(cosθ)=cos(kθ)P_{k}(\cos\theta)=\cos(k\theta) for all θ\theta (and we thus recover the Fourier series framework of Appendix C). For d=2d=2, PkP_{k} is the usual Legendre polynomial.

The polynomial PkP_{k} is even (resp. odd) when kk is even (resp. odd), and we have

For small kk, we have P0(t)=1P_{0}(t)=1, P1(t)=tP_{1}(t)=t, and P2(t)=(d+1)t21dP_{2}(t)=\frac{(d+1)t^{2}-1}{d}.

The Hecke-Funk formula leads to, for any linear combination YkY_{k} of YkjY_{kj}, j{1,,N(d,k)}j\in\{1,\dots,N(d,k)\}:

This is the decomposition in harmonics of degree kk. Note that

Decomposition of functions of one-dimensional projections.

If kα\mboxmod.2k\equiv\alpha\mbox{ mod. }2, then λk1211tαPk(t)(1t2)(d2)/2dt=0\lambda_{k}\propto\frac{1}{2}\int_{-1}^{1}t^{\alpha}P_{k}(t)(1-t^{2})^{(d-2)/2}dt=0, for k>αk>\alpha since PkP_{k} is orthogonal to all polynomials of degree strictly less than kk for that dot-product. Otherwise, λk0\lambda_{k}\neq 0, since tαt^{\alpha} may be decomposed as combination with non-zero coefficients of polynomials PjP_{j} for jα\mboxmod.2j\equiv\alpha\mbox{ mod. }2, jαj\leqslant\alpha.

We now provide an explicit formula extending the proof technique (for α=1\alpha=1) of Schneider (1967) and Bourgain and Lindenstrauss (1988) to all values of α\alpha. See also Mhaskar (2006).

We have, by α\alpha successive integration by parts, for kα+1k\geqslant\alpha+1:

By using Stirling formula Γ(x)xx1/2ex2π\Gamma(x)\approx x^{x-1/2}e^{-x}\sqrt{2\pi}, we get an equivalent when kk or dd tends to infinity as a constant (that depends on α\alpha) times

Note that all exponential terms cancel out. Moreover, when kk tends to infinity and dd is considered constant, then we get the equivalent kd/2α1/2k^{-d/2-\alpha-1/2}, which we need for the following sections. Finally, when dd tends to infinity and kk is considered constant, then we get the equivalent dα/2k/2+1/2d^{-\alpha/2-k/2+1/2}.

We will also need expressions of λk\lambda_{k} for k=0k=0 and k=1k=1. For k=0k=0, we have:

using the normalization factor of the Beta distribution. This leads to

which is equivalent to d1/2α/2d^{1/2-\alpha/2} as dd tends to infinity.

Moreover, for k=1k=1, we have (for α>0\alpha>0):

which is equivalent to dα/2d^{-\alpha/2} as dd tends to infinity.

Finally, for α=0\alpha=0, λ1=d12dπ\lambda_{1}=\frac{d-1}{2d\pi}. More generally, we have λkC(d)k(d1)/2α1|\lambda_{k}|\sim C(d)k^{-(d-1)/2-\alpha-1}.

Given gg with the correct parity, then we have

D.3 Proof of Prop. 2 for d>1𝑑1d>1

Moreover, since gg has the correct parity,

Also, gkg_{k} are eigenfunctions of the Laplacian with eigenvalues k(k+d1)k(k+d-1). Thus, we have

D.4 Proof of Prop. 3 for d>1𝑑1d>1

which is always defined when r(0,1)r\in(0,1) because the series is absolutely convergent. This defines a function g^\hat{g} that will have a finite γ2\gamma_{2}-norm and be close to gg.

The function p^\hat{{p}} thus defines a function g^G1\hat{g}\in\mathcal{G}_{1} by g^k=λkpk\hat{g}_{k}=\lambda_{k}{p}_{k}, for which γ2(g)C(d,α)(1r)(d+1)/2α\gamma_{2}(g)\leqslant C(d,\alpha)(1-r)^{(-d+1)/2-\alpha}.

Approximation properties.

We now show that gg and g^\hat{g} are close to each other. Because of the parity of gg, we have g^k=rkgk\hat{g}_{k}=r^{k}g_{k}. We have, using Theorem 4.28 from Frye and Efthimiou (2012):

Moreover, following Bourgain and Lindenstrauss (1988), we have:

As shown by Bourgain and Lindenstrauss (1988, Eq. (2.13)), this is less than a constant that depends on dd times (1r)log11r(1-r)\log\frac{1}{1-r}. We thus get for δ\delta large enough, by taking 1r=(C/δ)1/(α+(d1)/2)(0,1)1-r=(C/\delta)^{1/(\alpha+(d-1)/2)}\in(0,1), an error of

In this section, we consider functions on the sphere which have the proper parity with respect to α\alpha, which are ss-times differentiable with bounded derivatives, but which are not in G2\mathcal{G}_{2}. We then provide optimal approximation rates for these functions.

The summand has an asymptotic equivalent proportional to kd2s1kd1kd+2α+1=kd+2α2s1k^{-d-2s-1}k^{d-1}k^{d+2\alpha+1}=k^{d+2\alpha-2s-1}. Thus if d+2α2s0d+2\alpha-2s\geqslant 0, the series is divergent (the function is not in the RKHS), i.e., if sα+d2s\leqslant\alpha+\frac{d}{2}.

which should be of order δ2\delta^{2} (this gives the scaling of λ\lambda as a function of δ\delta). Then the squared error is

and thus the (non-squared) approximation error scales as δ(2s+1)/(d+2α2s).\delta^{-(2s+1)/(d+2\alpha-2s)}. For s=1s=1, this leads to a scaling as δ3/(d+2α2)\delta^{-3/(d+2\alpha-2)}.

D.6 Proof of Prop. 4

For α=1\alpha=1, by writing vx=(vx)+(vx)+v^{\top}x=(v^{\top}x)_{+}-(-v^{\top}x)_{+} we obtain the upperbound γ1(g)2\gamma_{1}(g)\leqslant 2. For all other situations, we may compute

We assume that we are given two ellipsoids defined as (xa)A1(xa)1(x-a)^{\top}A^{-1}(x-a)\leqslant 1 and (xb)B1(xb)1(x-b)^{\top}B^{-1}(x-b)\leqslant 1 and we want to compute their Hausdorff distance. This leads to the two equivalent problems

We consider the following convex optimization problem, with Q0Q\succcurlyeq 0; we have by Lagrangian duality:

If Q1q21\|Q^{-1}q\|_{2}\leqslant 1, then λ=0\lambda=0 and x=Q1qx=Q^{-1}q. Otherwise, at the optimum, λ>0\lambda>0 and x22=q(Q+λI)2q=1\|x\|_{2}^{2}=q^{\top}(Q+\lambda I)^{-2}q=1, which implies 11λ+λmin(Q)qQ1q1\leqslant\frac{1}{\lambda+\lambda_{\min}(Q)}q^{\top}Q^{-1}q, which leads to λqQ1qλmin(Q)\lambda\leqslant q^{\top}Q^{-1}q-\lambda_{\min}(Q), which is important to reduce the interval of possible λ\lambda. The optimal λ\lambda may then be obtained by binary search (from a single SVD of QQ).

Minimizing concave quadratic forms over the sphere.

We consider the following non-convex optimization problem, with Q0Q\succcurlyeq 0, for which strong Lagrangian duality is known to hold (Boyd and Vandenberghe, 2004):

At the optimum, we have q(λIQ)2q=1q^{\top}(\lambda I-Q)^{-2}q=1, which implies 11[λλmax(Q)]2q221\leqslant\frac{1}{[\lambda-\lambda_{\max}(Q)]^{2}}\|q\|_{2}^{2}, which leads to 0λλmax(Q)q20\leqslant\lambda-\lambda_{\max}(Q)\leqslant\|q\|_{2}. We may perform binary search on λ\lambda from a single SVD of QQ.

Computing the Haussdorff distance.

with v=(B+λI)1B1/2(ab+A1/2u)v=(B+\lambda I)^{-1}B^{1/2}(a-b+A^{1/2}u). The interval in λ\lambda which is sufficient to explore is

which are bounds that are independent of uu.

Given λ0\lambda\geqslant 0, we have the problem of

We have u=(μλIA1/2(B+λI)1A1/2)1A1/2(B+λI)1(ab),\displaystyle u=(\frac{\mu}{\lambda}I-A^{1/2}(B+\lambda I)^{-1}A^{1/2})^{-1}A^{1/2}(B+\lambda I)^{-1}(a-b), leading to w(λ1Bμ1A+I)(ab)\displaystyle w\propto(\lambda^{-1}B-\mu^{-1}A+I)(a-b). We need μλλmax(A1/2(B+λI)1A1/2)\frac{\mu}{\lambda}\geqslant\lambda_{\max}(A^{1/2}(B+\lambda I)^{-1}A^{1/2}). Moreover

The author was partially supported by the European Research Council (SIERRA Project), and thanks Nicolas Le Roux for helpful discussions. The author also thanks Varun Kanade for pointing the NP-hardness linear classification result. The main part of this work was carried through while visiting the Centre de Recerca Matemàtica (CRM) in Barcelona.