Breaking the Curse of Dimensionality with Convex Neural Networks

Francis Bach

Introduction

Supervised learning methods come in a variety of ways. They are typically based on local averaging methods, such as $k$ -nearest neighbors, decision trees, or random forests, or on optimization of the empirical risk over a certain function class, such as least-squares regression, logistic regression or support vector machine, with positive definite kernels, with model selection, structured sparsity-inducing regularization, or boosting (see, e.g., Györfi and Krzyzak, 2002; Hastie et al., 2009; Shalev-Shwartz and Ben-David, 2014, and references therein).

At the other end of the spectrum, parametric methods such as linear supervised learning make strong assumptions regarding the problem and generalization bounds based on estimation errors typically assume that the model is well-specified, and the sample complexity to attain an excess risk of $\varepsilon$ grows as $n=\Omega(d/\varepsilon^{2})$ , for linear functions in $d$ dimensions and Lipschitz-continuous loss functions (Shalev-Shwartz and Ben-David, 2014, Chapter 9). While the sample complexity is much lower, when the assumptions are not met, the methods underfit and more complex models would provide better generalization performances.

Affine functions: $f(x)=w^{\top}x+b$ , leading to potential severe underfitting, but easy optimization and good (i.e., non exponential) sample complexity.

Single hidden-layer neural networks: $f(x)=\sum_{j=1}^{k}\sigma(w_{j}^{\top}x+b_{j})$ , where $k$ is the number of units in the hidden layer (see, e.g., Rumelhart et al., 1986; Haykin, 1994). The activation function $\sigma$ is here assumed to be fixed. While the learning problem may be cast as a (sub)differentiable optimization problem, techniques based on gradient descent may not find the global optimum. If the number of hidden units is fixed, this is a parametric problem.

In this paper, our main aim is to answer the following question: Is there a single learning method that can deal efficiently with all situations above with provable adaptivity? We consider single-hidden-layer neural networks, with non-decreasing homogeneous activation functions such as

for $\alpha\in\{0,1,\dots\}$ , with a particular focus on $\alpha=0$ (with the convention that $0^{0}=0$ ), that is $\sigma(u)=1_{u>0}$ (a threshold at zero), and $\alpha=1$ , that is, $\sigma(u)=\max\{u,0\}=(u)_{+}$ , the so-called rectified linear unit (Nair and Hinton, 2010; Krizhevsky et al., 2012). We follow the convexification approach of Bengio et al. (2006); Rosset et al. (2007), who consider potentially infinitely many units and let a sparsity-inducing norm choose the number of units automatically. This leads naturally to incremental algorithms such as forward greedy selection approaches, which have a long history for single-hidden-layer neural networks (see, e.g. Breiman, 1993; Lee et al., 1996).

We provide in Section 2 a review of functional analysis tools used for learning from continuously infinitely many basis functions, by studying carefully the similarities and differences between $L_{1}$ - and $L_{2}$ -penalties on the output weights. For $L_{2}$ -penalties, this corresponds to a positive definite kernel and may be interpreted through random sampling of hidden weights. We also review incremental algorithms (i.e., forward greedy approaches) to learn from these infinite sets of basis functions when using $L_{1}$ -penalties.

The results are specialized in Section 3 to neural networks with a single hidden layer and activation functions which are positively homogeneous (such as the rectified linear unit). In particular, in Sections 3.2, 3.3 and 3.4, we provide simple geometric interpretations to the non-convex problems of additions of new units, in terms of separating hyperplanes or Hausdorff distance between convex sets. They constitute the core potentially hard computational tasks in our framework of learning from continuously many basis functions.

In Section 4, we provide a detailed theoretical analysis of the approximation properties of (single hidden layer) convex neural networks with monotonic homogeneous activation functions, with explicit bounds. We relate these new results to the extensive literature on approximation properties of neural networks (see, e.g., Pinkus, 1999, and references therein) in Section 4.7, and show that these neural networks are indeed adaptive to linear structures, by replacing the exponential dependence in dimension by an exponential dependence in the dimension of the subspace of the data can be projected to for good predictions.

We provide in Section 5.5 simple conditions for convex relaxations to achieve the same generalization error bounds, even when constant-factor approximation cannot be found (e.g., because it is NP-hard such as for the threshold activation function and the rectified linear unit). We present in Section 6 convex relaxations based on semi-definite programming, but we were not able to find strong enough convex relaxations (they provide only a provable sample complexity with a polynomial time algorithm which is exponential in the dimension $d$ ) and leave open the existence or non-existence of polynomial-time algorithms that preserve the non-exponential sample complexity.

Learning from continuously infinitely many basis functions

We consider the space $\mathcal{F}_{1}$ of functions $f$ that can be written as

where $\mu$ is a signed Radon measure on $\mathcal{V}$ with finite total variation $|\mu|(\mathcal{V})$ .

When $\mathcal{V}$ is finite, this corresponds to

with total variation $\sum_{v\in\mathcal{V}}|\mu_{v}|$ , where the proper formalization for infinite sets $\mathcal{V}$ is done through measure theory.

The infimum of $|\mu|(\mathcal{V})$ over all decompositions of $f$ as $f=\int_{\mathcal{V}}\varphi_{v}d\mu(v)$ , turns out to be a norm $\gamma_{1}$ on $\mathcal{F}_{1}$ , often called the variation norm of $f$ with respect to the set of basis functions (see, e.g., Kurkova and Sanguineti, 2001; Mhaskar, 2004).

Given our assumptions regarding the compactness of $\mathcal{V}$ , for any $f\in\mathcal{F}_{1}$ , the infimum defining $\gamma_{1}(f)$ is in fact attained by a signed measure $\mu$ , as a consequence of the compactness of measures for the weak topology (see Evans and Gariepy, 1991, Section 1.9).

In the definition above, if we assume that the signed measure $\mu$ has a density with respect to a fixed probability measure $\tau$ with full support on $\mathcal{V}$ , that is, $d\mu(v)={p}(v)d\tau(v)$ , then, the variation norm $\gamma_{1}(f)$ is also equal to the infimal value of

over all integrable functions $p$ such that $f(x)=\int_{\mathcal{V}}{p}(v)\varphi_{v}(x)d\tau(v)$ . Note however that not all measures have densities, and that the two infimums are the same as all Radon measures are limits of measures with densities. Moreover, the infimum in the definition above is not attained in general (for example when the optimal measure is singular with respect to $d\tau$ ); however, it often provides a more intuitive definition of the variation norm, and leads to easier comparisons with Hilbert spaces in Section 2.3.

2 Representation from finitely many functions

When minimizing any functional $J$ that depends only on the function values taken at a subset $\hat{\mathcal{X}}$ of values in $\mathcal{X}$ , over the ball $\{f\in\mathcal{F}_{1},\ \gamma_{1}(f)\leqslant\delta\}$ , then we have a “representer theorem” similar to the reproducing kernel Hilbert space situation, but also with significant differences, which we now present.

we can then build a function defined over all $\mathcal{X}$ , through the optimal measure $\mu$ above.

Moreover, by Carathéodory’s theorem for cones (Rockafellar, 1997), if $\hat{\mathcal{X}}$ is composed of only $n$ elements (e.g., $n$ is the number of observations in machine learning), the optimal function $f_{|\hat{\mathcal{X}}}$ above (and hence $f$ ) may be decomposed into at most $n$ functions $\varphi_{v}$ , that is, $\mu$ is supported by at most $n$ points in $\mathcal{V}$ , among a potential continuum of possibilities.

Note however that the identity of these $n$ functions is not known in advance, and thus there is a significant difference with the representer theorem for positive definite kernels and Hilbert spaces (see, e.g., Shawe-Taylor and Cristianini, 2004), where the set of $n$ functions are known from the knowledge of the points $x\in\hat{\mathcal{X}}$ (i.e., kernel functions evaluated at $x$ ).

3 Corresponding reproducing kernel Hilbert space (RKHS)

We have seen above that if the real-valued measures $\mu$ are restricted to have density $p$ with respect to a fixed probability measure $\tau$ with full support on $\mathcal{V}$ , that is, $d\mu(v)={p}(v)d\tau(v)$ , then, the norm $\gamma_{1}(f)$ is the infimum of the total variation $|\mu|(\mathcal{V})=\int_{\mathcal{V}}|{p}(v)|d\tau(v),$ over all decompositions $f(x)=\int_{\mathcal{V}}{p}(v)\varphi_{v}(x)d\tau(v)$ .

We may also define the infimum of $\int_{\mathcal{V}}|{p}(v)|^{2}d\tau(v)$ over the same decompositions (squared $L_{2}$ -norm instead of $L_{1}$ -norm). It turns out that it defines a squared norm $\gamma_{2}^{2}$ and that the function space $\mathcal{F}_{2}$ of functions with finite norm happens to be a reproducing kernel Hilbert space (RKHS). When $\mathcal{V}$ is finite, then it is well-known (see, e.g., Berlinet and Thomas-Agnan, 2004, Section 4.1) that the infimum of $\sum_{v\in\mathcal{V}}\mu_{v}^{2}$ over all vectors $\mu$ such that $f=\sum_{v\in V}\mu_{v}\varphi_{v}$ defines a squared RKHS norm with positive definite kernel $k(x,y)=\sum_{v\in V}\varphi_{v}(x)\varphi_{v}(y)$ .

We show in Appendix A that for any compact set $\mathcal{V}$ , we have defined a squared RKHS norm $\gamma_{2}^{2}$ with positive definite kernel $\displaystyle k(x,y)=\int_{\mathcal{V}}\varphi_{v}(x)\varphi_{v}(y)d\tau(v)$ .

When $m$ tends to infinity, then $\hat{k}(x,y)$ tends to $k(x,y)$ and random sampling provides a way to work efficiently with explicit $m$ -dimensional feature spaces. See Rahimi and Recht (2007) for a analysis of the number of units needed for an approximation with error $\varepsilon$ , typically of order $1/\varepsilon^{2}$ . See also Bach (2015) for improved results with a better dependence on $\varepsilon$ when making extra assumptions on the eigenvalues of the associated covariance operator.

The corresponding RKHS norm is always greater than the variation norm (because of Jensen’s inequality), and thus the RKHS $\mathcal{F}_{2}$ is included in $\mathcal{F}_{1}$ . However, as shown in this paper, the two spaces $\mathcal{F}_{1}$ and $\mathcal{F}_{2}$ have very different properties; e.g., $\gamma_{2}$ may be computed easily in several cases, while $\gamma_{1}$ does not; also, learning with $\mathcal{F}_{2}$ may either be done by random sampling of sufficiently many weights or using kernel methods, while $\mathcal{F}_{1}$ requires dedicated convex optimization algorithms with potentially non-polynomial-time steps (see Section 2.5).

Moreover, for any $v\in\mathcal{V}$ , $\varphi_{v}\in\mathcal{F}_{1}$ with a norm $\gamma_{1}(\varphi_{v})\leqslant 1$ , while in general $\varphi_{v}\notin\mathcal{F}_{2}$ . This is a simple illustration of the fact that $\mathcal{F}_{2}$ is too small and thus will lead to a lack of adaptivity that will be further studied in Section 5.4 for neural networks with certain activation functions.

4 Supervised machine learning

that is, the excess risk $J(\hat{f})-\inf_{f\in\mathcal{F}}J(f)$ is upper-bounded by a sum of an approximation error $\inf_{f\in\mathcal{F}^{\delta}}J(f)-\inf_{f\in\mathcal{F}}J(f)$ , an estimation error $2\sup_{f\in\mathcal{F}^{\delta}}|\hat{J}(f)-J(f)|$ and an optimization error $\varepsilon$ (see also Bottou and Bousquet, 2008). In this paper, we will deal with all three errors, starting from the optimization error which we now consider for the space $\mathcal{F}_{1}$ and its variation norm.

5 Incremental conditional gradient algorithms

We assume the functional $J$ is convex and $L$ -smooth, that is for all $h\in L_{2}(d\rho)$ , there exists a gradient $J^{\prime}(h)\in L_{2}(d\rho)$ such that for all $f\in L_{2}(d\rho)$ ,

When $\mathcal{X}$ is finite, this corresponds to the regular notion of smoothness from convex optimization (Nesterov, 2004).

The conditional gradient algorithm (a.k.a. Frank-Wolfe algorithm) is an iterative algorithm, starting from any function $f_{0}\in\mathcal{F}_{1}^{\delta}$ and with the following recursion, for $t\geqslant 0$ :

See an illustration in Figure 1. We may choose either $\rho_{t}=\frac{2}{t+1}$ or perform a line search for $\rho_{t}\in$ . For all of these strategies, the $t$ -th iterate is a convex combination of the functions $\bar{f}_{0},\dots,\bar{f}_{t-1}$ , and is thus an element of $\mathcal{F}_{1}^{\delta}$ . It is known that for these two strategies for $\rho_{t}$ , we have the following convergence rate (see, e.g. Jaggi, 2013):

When, $r^{2}=\sup_{v\in\mathcal{V}}\|\varphi_{v}\|^{2}_{L_{2}(d\rho)}$ is finite, we have $\|f\|^{2}_{L_{2}(d\rho)}\leqslant r^{2}\gamma_{1}(f)^{2}$ and thus we get a convergence rate of $\frac{2Lr^{2}\delta^{2}}{t+1}$ .

Moreover, the basic Frank-Wolfe (FW) algorithm may be extended to handle the regularized problem as well (Harchaoui et al., 2013; Bach, 2013; Zhang et al., 2012), with similar convergence rates in $O(1/t)$ . Also, the second step in the algorithm, where the function $f_{t+1}$ is built in the segment between $f_{t}$ and the newly found extreme function, may be replaced by the optimization of ${J}$ over the convex hull of all functions $\bar{f}_{0},\dots,\bar{f}_{t}$ , a variant which is often referred to as fully corrective. Moreover, in our context where $\mathcal{V}$ is a space where local search techniques may be considered, there is also the possibility of “fine-tuning” the vectors $v$ as well (Bengio et al., 2006), that is, we may optimize the function $(v_{1},\dots,v_{t},\alpha_{1},\dots,\alpha_{t})\mapsto J\big{(}\sum_{i=1}^{t}\alpha_{i}\varphi_{v_{i}}\big{)}$ , through local search techniques, starting from the weights $(\alpha_{i})$ and points $(v_{i})$ obtained from the conditional gradient algorithm.

Adding a new basis function.

The conditional gradient algorithm presented above relies on solving at each iteration the “Frank-Wolfe step”:

for $g=-J^{\prime}(f_{t})\in L_{2}(d\rho)$ . For the norm $\gamma_{1}$ defined through an $L_{1}$ -norm, we have for $f=\int_{\mathcal{V}}\varphi_{v}d\mu(v)$ such that $\gamma_{1}(f)=|\mu|(\mathcal{V})$ :

with equality if and only if $\mu=\mu_{+}-\mu_{-}$ with $\mu_{+}$ and $\mu_{-}$ two non-negative measures, with $\mu_{+}$ (resp. $\mu_{-}$ ) supported in the set of maximizers $v$ of $\big{|}\int_{\mathcal{X}}\varphi_{v}(x)g(x)d\rho(x)\big{|}$ where the value is positive (resp. negative).

with the maximizers $f$ of the first optimization problem above (left-hand side) obtained as $\delta$ times convex combinations of $\varphi_{v}$ and $-\varphi_{v}$ for maximizers $v$ of the second problem (right-hand side).

A common difficulty in practice is the hardness of the Frank-Wolfe step, that is, the optimization problem above over $\mathcal{V}$ may be difficult to solve. See Section 3.2, 3.3 and 3.4 for neural networks, where this optimization is usually difficult.

Finitely many observations.

where the set of solutions of the first problem is in the convex hull of the solutions of the second problem.

Non-smooth loss functions.

In this paper, in our theoretical results, we consider non-smooth loss functions for which conditional gradient algorithms do not converge in general. One possibility is to smooth the loss function, as done by Nesterov (2005): an approximation error of $\varepsilon$ may be obtained with a smoothness constant proportional to $1/\varepsilon$ . By choosing $\varepsilon$ as $1/\sqrt{t}$ , we obtain a convergence rate of $O(1/\sqrt{t})$ after $t$ iterations. See also Lan (2013).

Approximate oracles.

The conditional gradient algorithm may deal with approximate oracles; however, what we need in this paper is not the additive errors situations considered by Jaggi (2013), but multiplicative ones on the computation of the dual norm (similar to ones derived by Bach (2013) for the regularized problem).

Indeed, in our context, we minimize a function $J(f)$ on $f\in L_{2}(d\rho)$ over a norm ball $\{\gamma_{1}(f)\leqslant\delta\}$ . A multiplicative approximate oracle outputs for any $g\in L_{2}(d\rho)$ , a vector $\hat{f}\in L_{2}(d\rho)$ such that $\gamma_{1}(\hat{f})=1$ , and

for a fixed $\kappa\geqslant 1$ . In Appendix B, we propose a modification of the conditional gradient algorithm that converges to a certain $h\in L_{2}(d\rho)$ such that $\gamma_{1}(h)\leqslant\delta$ and for which $\inf_{\gamma_{1}(f)\leqslant\delta}J(f)\leqslant J(h)\leqslant\inf_{\gamma_{1}(f)\leqslant\delta/\kappa}J(f)$ .

Such approximate oracles are not available in general, because they require uniform bounds over all possible values of $g\in L_{2}(d\rho)$ . In Section 5.5, we show that a weaker form of oracle is sufficient to preserve our generalization bounds from Section 5.

Approximation of any function by a finite number of basis functions.

Neural networks with non-decreasing positively homogeneous activation functions

In this paper, we focus on a specific family of basis functions, that is, of the form

for specific activation functions $\sigma$ . We assume that $\sigma$ is non-decreasing and positively homogeneous of some integer degree, i.e., it is equal to $\sigma(u)=(u)_{+}^{\alpha}$ , for some $\alpha\in\{0,1,\dots\}$ . We focus on these functions for several reasons:

Since they are not polynomials, linear combinations of these functions can approximate any measurable function (Leshno et al., 1993).

By homogeneity, they are invariant by a change of scale of the data; indeed, if all observations $x$ are multiplied by a constant, we may simply change the measure $\mu$ defining the expansion of $f$ by the appropriate constant to obtain exactly the same function. This allows us to study functions defined on the unit-sphere.

The special case $\alpha=1$ , often referred to as the rectified linear unit, has seen considerable recent empirical success (Nair and Hinton, 2010; Krizhevsky et al., 2012), while the case $\alpha=0$ (hard thresholds) has some historical importance (Rosenblatt, 1958).

The goal of this section is to specialize the results from Section 2 to this particular case and show that the “Frank-Wolfe” steps have simple geometric interpretations.

We first show that the positive homogeneity of the activation functions allows to transfer the problem to a unit sphere.

This implies by Hölder’s inequality that $\varphi_{v}(x)^{2}\leqslant 2^{\alpha}$ . Moreover this leads to functions in $\mathcal{F}_{1}$ that are bounded everywhere, that is, $\forall f\in\mathcal{F}_{1}$ , $f(x)^{2}\leqslant 2^{\alpha}\gamma_{1}(f)^{2}$ . Note that the functions in $\mathcal{F}_{1}$ are also Lipschitz-continuous for $\alpha\geqslant 1$ .

Homogeneous reformulation.

that is $\gamma_{1}(f)\leqslant\gamma_{1}(g)$ , because we have assumed that $(w^{\top},b/R)^{\top}$ is on the $(1/R)$ -sphere.

The only remaining important aspect is to define $g$ on the entire sphere, so that (a) its regularity constants are controlled by a constant times the ones on the portion of the sphere where it is already defined, (b) $g$ is either even or odd (this will be important in Section 4). Ensuring that the regularity conditions can be met is classical when extending to the full sphere (see, e.g., Whitney, 1934). Ensuring that the function may be chosen as odd or even may be obtained by multiplying the function $g$ by an infinitely differentiable function which is equal to one for $a\geqslant 1/\sqrt{2}$ and zero for $a\leqslant 0$ , and extending by $-g$ or $g$ on the hemi-sphere $a<0$ .

In summary, we may consider in Section 4 functions defined on the sphere, which are much easier to analyze. In the rest of the section, we specialize some of the general concepts reviewed in Section 2 to our neural network setting with specific activation functions, namely, in terms of corresponding kernel functions and geometric reformulations of the Frank-Wolfe steps.

1 Corresponding positive-definite kernels

There are key differences and similarities between the RKHS $\mathcal{F}_{2}$ and our space of functions $\mathcal{F}_{1}$ . The RKHS is smaller than $\mathcal{F}_{1}$ (i.e., the norm in the RKHS is larger than the norm in $\mathcal{F}_{1}$ ); this implies that approximation properties of the RKHS are transferred to $\mathcal{F}_{1}$ . In fact, our proofs rely on this fact.

However, the RKHS norm does not lead to any adaptivity, while the function space $\mathcal{F}_{1}$ does (see more details in Section 5). This may come as a paradox: both the RKHS $\mathcal{F}_{2}$ and $\mathcal{F}_{1}$ have similar properties, but one is adaptive while the other one is not. A key intuitive difference is as follows: given a function $f$ expressed as $f(x)=\int_{\mathcal{V}}\varphi_{v}(x){p}(v)d\tau(v)$ , then $\gamma_{1}(f)=\int_{\mathcal{V}}|{p}(v)|d\tau(v)$ , while the squared RKHS norm is $\gamma_{2}(f)^{2}=\int_{\mathcal{V}}|{p}(v)|^{2}d\tau(v)$ . For the $L_{1}$ -norm, the measure ${p}(v)d\tau(v)$ may tend to a singular distribution with a bounded norm, while this is not true for the $L_{2}$ -norm. For example, the function $(w^{\top}x+b)_{+}^{\alpha}$ is in $\mathcal{F}_{1}$ , while it is not in $\mathcal{F}_{2}$ in general.

2 Incremental optimization problem for α=0𝛼0\alpha=0

where $I_{+}=\{i,y_{i}\geqslant 0\}$ and $I_{-}=\{i,y_{i}<0\}$ . As outlined by Bengio et al. (2006), this is equivalent to finding an hyperplane parameterized by $v$ that minimizes a weighted mis-classification rate (when doing linear classification). Note that the norm of $v$ has no effect.

Convex relaxation.

Given linear binary classification problems, there are several algorithms to approximately find a good half-space. These are based on using convex surrogates (such as the hinge loss or the logistic loss). Although some theoretical results do exist regarding the classification performance of estimators obtained from convex surrogates (Bartlett et al., 2006), they do not apply in the context of linear classification.

3 Incremental optimization problem for α=1𝛼1\alpha=1

For the problem of maximizing $\big{|}\sum_{i=1}^{n}y_{i}(v^{\top}z_{i})_{+}\big{|}$ , then this corresponds to

This is exactly the Hausdorff distance between the two convex sets $\{T_{+}^{\top}b_{+},\ b_{+}\in^{n_{+}}\}$ and $\{T_{-}^{\top}b_{-},\ b_{-}\in^{n_{-}}\}$ (referred to as zonotopes, see below).

Given the pair $(b_{+},b_{-})$ achieving the Hausdorff distance, then we may compute the optimal $v$ as $v=\arg\max_{\|v\|_{p}\leqslant 1}v^{\top}\big{(}T_{+}^{\top}b_{+}-T_{-}^{\top}b_{-}\big{)}$ . Note this has not changed the problem at all, since it is equivalent. It is still NP-hard in general (König, 2014). But we now have a geometric interpretation with potential approximation algorithms. See below and Section 6.

A zonotope $A$ is the Minkowski sum of a finite number of segments from the origin, that is, of the form

In our context, the two convex sets $\{T_{+}^{\top}b_{+},\ b_{+}\in^{n_{+}}\}$ and $\{T_{-}^{\top}b_{-},\ b_{-}\in^{n_{-}}\}$ defined above are thus zonotopes. See an illustration of the Hausdorff distance computation in Figure 4 (middle plot), which is the core computational problem for $\alpha=1$ .

Approximation by ellipsoids.

Centrally symmetric convex polytopes (w.l.o.g. centered around zero) may be approximated by ellipsoids. In our set-up, we could use the minimum volume enclosing ellipsoid (see, e.g. Barvinok, 2002), which can be computed exactly when the polytope is given through its vertices, or up to a constant factor when the polytope is such that quadratic functions may be optimized with a constant factor approximation. For zonotopes, the standard semi-definite relaxation of Nesterov (1998) leads to such constant-factor approximations, and thus the minimum volume inscribed ellipsoid may be computed up to a constant. Given standard results (see, e.g. Barvinok, 2002), a $(1/\sqrt{d})$ -scaled version of the ellipsoid is inscribed in this polytope, and thus the ellipsoid is a provably good approximation of the zonotope with a factor scaling as $\sqrt{d}$ . However, the approximation ratio is not good enough to get any relevant bound for our purpose (see Section 5.5), as for computing the Haussdorff distance, we care about potentially vanishing differences that are swamped by constant factor approximations.

NP-hardness.

Given the reduction of the case $\alpha=1$ (rectified linear units) to $\alpha=0$ (exact thresholds) (Livni et al., 2014), the incremental problem is also NP-hard, so as obtaining a constant-factor approximation. However, this does not rule out convex relaxations with non-constant approximation ratios (see Section 6 for more details).

4 Incremental optimization problem for α⩾2𝛼2\alpha\geqslant 2

Note that, while for $\alpha=1$ we can use the identity $2u_{+}=u+|u|$ to replace the rectified linear unit by the absolute value and obtain the same function space, this is not possible for $\alpha=2$ , as $(u_{+})^{2}$ and $u^{2}$ do not differ by a linear function. This implies that the results from Livni et al. (2014), which state that for the quadratic activation function, the incremental problems is equivalent to an eigendecomposition (and hence solvable in polynomial time), do not apply.

Approximation properties

In this section, we first consider approximation properties of functions in $\mathcal{G}_{1}$ by a finite number of neurons (only for $\alpha=1$ ). We then study approximation properties of functions on the sphere by functions in $\mathcal{G}_{1}$ . It turns out that all our results are based on the approximation properties of the corresponding RKHS $\mathcal{G}_{2}$ : we give sufficient conditions for being in $\mathcal{G}_{2}$ , and then approximation bounds for functions which are not in $\mathcal{G}_{2}$ . Finally we transfer these to the spaces $\mathcal{F}_{1}$ and $\mathcal{F}_{2}$ , and consider in particular functions which only depend on projections on a low-dimensional subspace, for which the properties of $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ (and of $\mathcal{F}_{1}$ and $\mathcal{F}_{2}$ ) differ. This property is key to obtaining generalization bounds that show adaptivity to linear structures in the prediction functions (as done in Section 5).

Approximation properties of neural networks with finitely many neurons have been studied extensively (see, e.g., Petrushev, 1998; Pinkus, 1999; Makovoz, 1998; Burger and Neubauer, 2001). In Section 4.7, we relate our new results to existing work from the literature on approximation theory, by showing that our results provide an explicit control of the various weight vectors which are needed for bounding the estimation error in Section 5.

Positively homogenous convex functions $h$ may be written as the support function of a compact convex set $K$ (Rockafellar, 1997), that is, $h(z)=\max_{y\in K}y^{\top}z$ , and the set $K$ characterizes the function $h$ . The functions $g_{+}$ and $g_{-}$ defined above are not any convex positively homogeneous functions, as we now describe.

If the measure $\mu_{+}$ is supported by finitely many points, that is, $\mu_{+}(v)=\sum_{i=1}^{r}\eta_{i}\delta(v-v_{i})$ with $\eta\geqslant 0$ , then $g_{+}(z)=\sum_{i=1}^{t}\eta_{i}(v_{i}^{\top}z)_{+}=\sum_{i=1}^{t}(\eta_{i}v_{i}^{\top}z)_{+}=\sum_{i=1}^{t}(t_{i}^{\top}z)_{+}$ for $t_{i}=\eta_{i}v_{i}$ . Thus the corresponding set $K_{+}$ is the zonotope $[0,t_{1}]+\cdots+[0,t_{r}]=\big{\{}\sum_{i=1}^{r}b_{i}t_{i},\ b\in^{r}\big{\}}$ already defined in Section 3.3. Thus the functions $g_{+}\in\mathcal{G}_{1}$ and $g_{-}\in\mathcal{G}_{1}$ for finitely supported measures $\mu$ are support functions of zonotopes.

When the measure $\mu$ is not constrained to have finite support, then the sets $K_{+}$ and $K_{-}$ are limits of zonotopes, and thus, by definition, zonoids (Bolker, 1969), and thus functions in $\mathcal{G}_{1}$ are differences of support functions of zonoids. Zonoids are a well-studied set of convex bodies. They are centrally symmetric, and in two dimensions, all centrally symmetric compact convexs sets are (up to translation) zonoids, which is not true in higher dimensions (Bolker, 1969). Moreover, the problem of approximating a zonoid by a zonotope with a small number of segments (Bourgain et al., 1989; Matoušek, 1996) is essentially equivalent to the approximation of a function $g$ by finitely many neurons. The number of neurons directly depends on the norm $\gamma_{1}$ , as we now show.

with $r\leqslant C(d)\varepsilon^{-2+6/(d+3)}=C(d)\varepsilon^{-2d/(d+3)}$ , for some constant $C(d)$ that depends only on $d$ . We may then simply write

Finally, the number of neurons needed to express a function with a bound on the $\gamma_{2}$ -norm can be estimated from general results on approximating reproducing kernel Hilbert space described in Section 2.3, whose kernel can be expressed as an expectation. Indeed, Bach (2015) shows that with $k$ neurons, one can approximate a function in $\mathcal{F}_{2}$ with unit $\gamma_{2}$ -norm with an error measured in $L_{2}$ of $\varepsilon=k^{-(d+3)/(2d)}$ . When inverting the relationship between $k$ and $\varepsilon$ , we get a number of neurons scaling as $\varepsilon^{-2d/(d+3)}$ , which is the same as in Prop. 1 but with an error in $L_{2}$ -norm instead of $L^{\infty}$ -norm.

2 Sufficient conditions for finite variation

Tightness of conditions: as shown in Appendix D.5, there are functions $g$ , which have bounded first $s$ derivatives and do not belong to $\mathcal{G}_{2}$ while $s\leqslant\frac{d}{2}+\alpha$ (at least when $s-\alpha$ is even). Therefore, when $s-\alpha$ is even, the scaling in ${(d-1)}/{2}+\alpha$ is optimal.

Dependence on $\alpha$ : for any $d$ , the higher the $\alpha$ , the stricter the sufficient condition. Given that the estimation error grows slowly with $\alpha$ (see Section 5.1), low values of $\alpha$ would be preferred in practice.

Dependence on $d$ : a key feature of the sufficient condition is the dependence on $d$ , that is, as $d$ increases the number of derivatives has to increase in $d/2$ —like for Sobolev spaces in dimension $d$ (Adams and Fournier, 2003). This is another instantiation of the curse of dimensionality: only very smooth functions in high dimensions are allowed.

3 Approximation of Lipschitz-continuous functions

In order to derive generalization bounds for target functions which are not sufficiently differentiable (and may not be in $\mathcal{G}_{2}$ or $\mathcal{G}_{1}$ ), we need to approximate any Lipschitz-continuous function, with a function $g\in\mathcal{G}_{2}$ with a norm $\gamma_{2}(g)$ that will grow as the approximation gets tighter. We give precise rates in the proposition below. Note the requirement for parity of the function $g$ . The result below notably shows the density of $\mathcal{G}_{1}$ in uniform norm in the space of Lipschitz-continuous functions of the given parity, which is already known since our activation functions are not polynomials (Leshno et al., 1993).

This proposition is shown in Appendix C.3 for $d=1$ (using Fourier series) and in Appendix D.4 for all $d\geqslant 1$ (using spherical harmonics). We can make the following observations:

Dependence in $\delta$ and $\eta$ : as expected, the main term in the error bound $\big{(}{\delta}/{\eta}\big{)}^{-1/(\alpha+(d-1)/2)}$ is a decreasing function of $\delta/\eta$ , that is when the norm $\gamma_{2}(h)$ is allowed to grow, the approximation gets tighter, and when the Lipschitz constant of $g$ increases, the approximation is less tight.

Dependence on $d$ and $\alpha$ : the rate of approximation is increasing in $d$ and $\alpha$ . In particular the approximation properties are better for low $\alpha$ .

Special case $d=1$ and $\alpha=0$ : up to the logarithmic term we recover the result of Prop. 2, that is, the function $g$ is in $\mathcal{G}_{2}$ .

Tightness: in Appendix D.5, we provide a function which is not in the RKHS and for which the tightest possible approximation scales as $\delta^{-2/(d/2+\alpha-2)}$ . Thus the linear scaling of the rate as $d/2+\alpha$ is not improvable (but constants are).

4 Linear functions

We see that for $\alpha=1$ , the $\gamma_{1}$ -norm is less than a constant, and is much smaller than the $\gamma_{2}$ -norm (which scales as $\sqrt{d}$ ). For $\alpha\geqslant 2$ , we were not able to derive better bounds for $\gamma_{1}$ (other than the value of $\gamma_{2})$ .

5 Functions of projections

However, for the RKHS norm $\gamma_{2}$ , this reasoning does not apply. For example, a certain function $\varphi$ exists, which is $s$ -times differentiable, as shown in Appendix D.5, for $s\leqslant\frac{d}{2}+\alpha$ (when $s-\alpha$ is even), and is not in $\mathcal{G}_{2}$ . Thus, given Prop. 2, the dependence on a uni-dimensional projection does not make a difference regarding the level of smoothness which is required to belong to $\mathcal{G}_{2}$ .

Proof By assumption, the function $x\mapsto f(Rx)$ has all its derivatives bounded by a constant times $\eta$ . Moreover, we have defined $g(t,a)=f\big{(}\frac{Rt}{a}\big{)}a^{\alpha}$ so that all derivatives are bounded by $\eta$ . The result then follows immediately from Prop. 2.

Proof With the same reasoning as above, we obtain that $g$ is Lipschitz-continuous with constant $\eta$ , we thus get the desired approximation error from Prop. 3.

If $f(x)=w^{\top}x+b$ , with $\|w\|_{2}\leqslant\eta$ and $b\leqslant\eta R$ , then for $\alpha=1$ , it is straightforward that $\gamma_{1}(f)\leqslant 2R\eta$ . Moreover, we have $\gamma_{2}(f)\sim CR\eta$ . For other values of $\alpha$ , we also have $\gamma_{1}$ -norms less than a constant (depending only of $\alpha$ ) times $R\eta$ . The RKHS norms are bit harder to compute since linear functions for $f$ leads to linear functions for $g$ only for $\alpha=1$ .

Functions of projections.

7 Related work

In this section, we show how our results from the previous sections relate to existing work on neural network approximation theory.

In this section, we only consider the case $\alpha=1$ , for which we have two approximation bounds: Prop. 6 which approximates any $\eta$ -Lipschitz-continuous function by a function with finite $\gamma_{1}$ -norm less than $\delta$ and uniform error less than $\displaystyle\eta\big{(}{\delta}/{\eta}\big{)}^{-2/(d+1)}\log\big{(}{\delta}/{\eta}\big{)}$ , and Prop. 1 which shows that a function with $\gamma_{1}$ -norm less than $\delta$ , may be approximated with $r$ neurons with uniform error $\delta r^{-(d+3)/(2d)}$ .

Thus, given $r$ neurons, we get an approximation of the original function with uniform error

We can optimize over $\delta$ , and use $\delta=\eta n^{(d+1)/(2d)}$ , to obtain a uniform approximation bound proportional to ${\eta(\log n)}{n^{-1/d}}$ , for approximating an $\eta$ -Lipschitz-continuous function with $n$ neurons.

Approximation by ridge functions.

The approximation properties of single hidden layer neural networks have been studied extensively, where they are often referred to as “ridge function” approximations. As shown by Pinkus (1999, Corollary 6.10)—based on a result from Petrushev (1998), the approximation order of $n^{-1/d}$ for the rectified linear unit was already known, but only in $L_{2}$ -norm (and without the factor $\log n$ ), and without any constraints on the input and output weights. In this paper, we provide an explicit control of the various weights, which is needed for computing estimation errors. Moreover, while the two proof techniques use spherical harmonics, the proof of Petrushev (1998) relies on quadrature formulas for the associated Legendre polynomials, while ours relies on the relationship with the associated positive definite kernels, is significantly simpler, and offers additional insights into the problem (relationship with convex neural networks and zonoids). Maiorov (2006, Theorem 2.3) also derives a similar result, but in $L_{2}$ -norm (rather than uniform norm), and for sigmoidal activation functions (which are bounded). Note finally, that the order $O(n^{-1/d})$ cannot be improved (DeVore et al., 1989, Theorem 4.2). Also, Maiorov and Meir (2000, Theorem 5) derive similar upper and lower bounds based on a random sampling argument which is close to using random features in the RKHS setting described in Section 2.3.

Relationship to hardness results for Boolean-valued functions.

In this paper, we consider a particular view of the curse of dimensionality and ways of circumventing it, that is, our distribution over inputs is arbitrary, but our aim is to approximate a real-valued function. Thus, all hardness results depending on functions with values in $\{0,1\}$ do not apply there directly—see, e.g., Shalev-Shwartz and Ben-David (2014, Chapter 20), for the need of exponentially many hidden units for approximating most of the functions from $\{0,1\}^{d}$ to $\{0,1\}$ .

Our approximation bounds show that, without any assumption beyond Lipschitz-continuity of the target function, it sufficient to have a number of hidden units which is still exponential in dimension (hence we also suffer from the curse of dimensionality), but a soon as the target function depends on linear low-dimensional structure, then we lose this exponential dependence. It would be interesting to study an extension to $\{0,1\}$ -valued functions, and also to relate our results to the number of linear regions delimited by neural networks with rectified linear units (Montufar et al., 2014).

Generalization bounds

Our goal is to derive the generalization bounds outlined in Section 2.4 for neural networks with a single hidden layer. The main results that we obtain are summarized in Table 2 and show adaptivity to assumptions that avoid the curse of dimensionality.

Logistic regression and support vector machines: we have $G=1$ .

Least-squares regression: we take $G=\max\big{\{}\sqrt{2}\delta+\|y\|_{\infty},\frac{\|y\|_{\infty}^{2}}{\sqrt{2}\delta}\big{\}}$ .

Approximation errors $\inf_{f\in\mathcal{F}^{\delta}}J(f)-\inf_{f\in\mathcal{F}}J(f)$ will be obtained from the approximation results from Section 4 by assuming that the optimal target function $f_{\ast}$ has a specific form. Indeed, we have:

We now deal with estimation errors $\sup_{f\in\mathcal{F}^{\delta}}|\hat{J}(f)-J(f)|$ using Rademacher complexities.

The following proposition bounds the uniform deviation between $J$ and its empirical counterpart $\hat{J}$ . This result is standard (see, e.g., Koltchinskii, 2001; Bartlett and Mendelson, 2003) and may be extended in bounds that hold with high-probability.

We have the following bound on the expected uniform deviation:

for $\alpha\geqslant 1$ , $C(p,d,\alpha)\leqslant\alpha\sqrt{2\log(d+1)}$ for $p=1$ and $C(p,d,\alpha)\leqslant\frac{\alpha}{\sqrt{p-1}}$ for $p\in(1,2]$

for $\alpha=0$ , $C(p,d,\alpha)\leqslant C\sqrt{d+1}$ , where $C$ is a universal constant.

Proof We use the standard framework of Rademacher complexities and get:

We then take different routes for $\alpha\geqslant 1$ and $\alpha=0$ .

For $\alpha\geqslant 1$ , we have the upper-bound

From Kakade et al. (2009), we get the following bounds on Rademacher complexities:

Our generalization bounds are expected values of the excess expected risk for a our estimator (where the expectation is taken over the data).

We assume $f_{\ast}(x)=w^{\top}x+b$ , with $\|w\|_{2}\leqslant\eta$ and $|b|\leqslant R\eta$ . Then, as seen in Section 4.6, $f_{\ast}\in\mathcal{F}_{1}$ with $\gamma_{1}(f_{\ast})\leqslant C(\alpha)\eta R$ (the constant is independent of $d$ because we approximate an affine function). From Prop. 7, we thus get a generalization bound proportional to $\frac{GR\eta}{\sqrt{n}}$ times a constant (that may depend on $\alpha$ ), which is the same as assuming directly that we optimize over linear predictors only. The chosen $\delta$ is then a constant times $R\eta$ , and does not grow with $n$ , like in parametric estimation (although we do use a non-parametric estimation procedure).

Projection pursuit.

We assume $f_{\ast}(x)=\sum_{j=1}^{k}f_{j}(w_{j}^{\top}x)$ , with $\|w_{j}\|_{2}\leqslant\eta$ and each $f_{j}$ bounded by $\eta R$ and $1$ -Lipschitz continuous. From Prop. 6, we may approach each $x\mapsto f_{j}(w_{j}^{\top}x)$ by a function with $\gamma_{1}$ -norm less than $\delta\eta R$ and uniform approximation $C(\alpha)\eta R\delta^{-1/\alpha}\log\delta$ . This leads to a total approximation error of $kC(\alpha)G\eta R\delta^{-1/\alpha}\log\delta$ for a norm less than $k\delta\eta R$ (the constant is independent of $d$ because we approximate a function of one-dimensional projection).

For $\alpha=0$ , from Prop. 5, the target function belongs to $\mathcal{F}_{1}$ with a norm less than $kGR\eta$ , leading to an overall generalization bound of $\frac{kGR\eta\sqrt{d}}{\sqrt{n}}$ .

Note that when the functions $f_{j}$ are exactly the activation functions, the bound is better, as these functions directly belong to the space $\mathcal{F}_{1}$ .

Multi-dimensional projection pursuit.

For $\alpha\geqslant 1$ , the estimation error is ${kGR\eta\delta}/{\sqrt{n}}$ , with an overall bound of $C(\alpha,s)kGR\eta\big{(}{\delta}/{\sqrt{n}}+\delta^{-1/(\alpha+(s-1)/2)}\log\delta\big{)}$ . With $\delta=n^{(\alpha+(s-1)/2)/(2\alpha+s-1)}$ , we get an optimized bound of $\frac{C(\alpha,s)kGR\eta}{n^{1/(2\alpha+s+1)}}\log n$ .

For $\alpha=0$ , we have an overall bound of $C(s)kGR\eta\big{(}\delta^{-2/(s-1)}\log\delta+\frac{\delta\sqrt{d}}{\sqrt{n}}\big{)}$ , and with $\delta=(n/d)^{(s-1)/(s+1)}$ , we get a generalization bound scaling as $\frac{C(s)kGR\eta}{(n/d)^{1/(s+1)}}\log(n/d)$ .

Note that for $s=d$ and $k=1$ , we recover the usual Lipschitz-continuous assumption, with a rate of $\frac{C(\alpha,d)kGR\eta}{n^{1/(2\alpha+d+1)}}\log n$ .

Summary table: when we know a bound $r$ on all dimensions of $x$ , then we may take $R=r\sqrt{d}$ ; this is helpful in comparisons in Table 2, where $R$ is replaced by $r\sqrt{d}$ and the dependence in $r$ is removed as it is the same for all models.

Dependence on $d$ : when making only a global Lipschitz-continuity assumption, the generalization bound has a bad scaling in $n$ , i.e., as $n^{-1/(2\alpha+d+1)}$ , which goes down to zero slowly when $d$ increases. However, when making structural assumptions regarding the dependence on unknown lower-dimensional subspaces, the scaling in $d$ disappears.

Comparing different values of $\alpha$ : the value $\alpha=0$ always has the best scaling in $n$ , but constants are better for $\alpha\geqslant 1$ (among which $\alpha=1$ has the better scaling in $n$ ).

Bounds for $\mathcal{F}_{2}$ : The simplest upper bound for the penalization by the space $\mathcal{F}_{2}$ depends on the approximation properties of $\mathcal{F}_{2}$ . For linear functions and $\alpha=1$ , it is less than $\sqrt{d}\eta R$ , with a bound $\frac{GR\eta\sqrt{d}}{\sqrt{n}}$ . For the other values of $\alpha$ , there is a constant $C(d)$ . Otherwise, there is no adaptivity and all other situations only lead to upper-bounds of $O(n^{-1/(2\alpha+d+1)})$ . See more details in Section 5.4.

Sample complexity: Note that the generalization bounds above may be used to obtain sample complexity results such as $d\varepsilon^{-2}$ for affine functions, $(\varepsilon k^{-1}d^{-1/2})^{-2\alpha-2}$ for projection pursuit, and $(\varepsilon k^{-1}d^{-1/2})^{-s-1-2\alpha}$ for the generalized version (up to logarithmic terms).

Relationship to existing work: Maiorov (2006, Theorem 1.1) derives similar results for neural networks with sigmoidal activation functions (that tend to one at infinity) and the square loss only, and for a level of smoothness of the target function which grows with dimension (in this case, once can get easily rates of $n^{-1/2}$ ). Our result holds for problems where only bounded first-order derivatives are assumed, but by using Prop. 2, we would get similar rate by ensuring the target function belongs to $\mathcal{F}_{2}$ and hence to $\mathcal{F}_{1}$ .

Lower bounds.

In the sections above, we have only provided generalization bounds. Although interesting, deriving lower-bounds for the generalization performance when the target function belongs to certain function classes is out of the scope of this paper. Note however, that results from Sridharan (2012) suggest that the Rademacher complexities of the associated function classes provide such lower-bounds. For general Lipschitz-functions, these Rademacher complexities decreases as $n^{-\max\{d,2\}}$ (von Luxburg and Bousquet, 2004).

We assume $f_{\ast}(x)=w^{\top}x+b$ , with $\|w\|_{2}\leqslant\eta$ and $|b|\leqslant R\eta$ . Given that we have assumed that $w$ has at most $q$ non-zeros, we have $\|w\|_{1}\leqslant\sqrt{q}\eta$ .

Then, $f_{\ast}\in\mathcal{F}_{1}$ with $\gamma_{1}(f)\leqslant C(\alpha)\eta r\sqrt{q}$ , with a constant that is independent of $d$ because we have an affine function.

From Prop. 7, we thus get a rate of $\frac{Gr\eta\sqrt{q\log(d)}}{\sqrt{n}}$ times a constant (that may depend on $\alpha$ ), which is the same as assuming directly that we optimize over linear predictors only (see, for example, Bühlmann and Van De Geer, 2011). We recover a high-dimensional phenomenon (although with a slow rate in $1/\sqrt{n}$ ), where $d$ may be much larger than $n$ , as long as $\log d$ is small compared to $n$ . The chosen $\delta$ is then a constant times $r\eta\sqrt{q}$ (and does not grow with $n$ ).

Projection pursuit.

We assume $f_{\ast}(x)=\sum_{j=1}^{k}f_{j}(w_{j}^{\top}x)$ , with $\|w_{j}\|_{2}\leqslant\eta$ (which implies $\|w_{j}\|_{1}\leqslant\sqrt{q}\eta$ given our sparsity assumption) and each $f_{j}$ bounded by $\eta r\sqrt{q}$ and $1$ -Lipschitz continuous. We may approach each $x\mapsto f_{j}(w_{j}^{\top}x)$ by a function with $\gamma_{1}$ -norm less than $\delta\eta r\sqrt{q}$ and uniform approximation $C(\alpha)\eta r\sqrt{q}\delta^{-1/\alpha}\log\delta$ , with a constant that is independent of $d$ because we have a function of one-dimensional projection. This leads to a total approximation error of $kC(\alpha)G\eta r\sqrt{q}\delta^{-1/\alpha}\log\delta$ for a norm less than $k\delta\eta r\sqrt{q}$ .

For $\alpha\geqslant 1$ , the estimation error is $\frac{kGr\eta\delta\sqrt{q\log d}}{\sqrt{n}}$ , with an overall bound of $C(\alpha)kGr\sqrt{q}\eta\big{(}\delta^{-1/\alpha}\log\delta+\frac{\delta\sqrt{\log d}}{\sqrt{n}}\big{)}$ . With $\delta=(n/\log d)^{\alpha/2(\alpha+1)}$ , we get an optimized bound of $C(\alpha)kGr\sqrt{q}\eta\frac{\log n(\log d)^{1/(2\alpha+2)}}{n^{1/(2\alpha+2)}}$ , with a scaling only dependent in $d$ with a logarithmic factor.

For $\alpha=0$ , the target function belongs to $\mathcal{F}_{1}$ with a norm less than $kGr\sqrt{q}\eta$ , leading to an overal bound of $\frac{kGr\eta\sqrt{q\log d}}{\sqrt{n}}$ (the sparsity is not helpful in this case).

Multi-dimensional projection pursuit.

We may approach each $x\mapsto F_{j}(W_{j}^{\top}x)$ by a function with $\gamma_{1}$ -norm less than $\delta\eta r\sqrt{q}$ and uniform approximation $C(\alpha,s)\eta r\sqrt{q}\delta^{-1/(\alpha+(s-1)/2)}\log\delta$ . This leads to a total approximation error of $kC(\alpha,s)G\eta r\sqrt{q}\delta^{-1/(\alpha+(s-1)/2)}\log\delta$ .

For $\alpha\geqslant 1$ , the estimation error is ${kGr\sqrt{q}\eta\delta\sqrt{\log d}}/{\sqrt{n}}$ , with an overall bound which is equal to $C(\alpha,s)kGr\sqrt{q}\eta\big{(}\delta^{-1/(\alpha+(s-1)/2)}\log\delta+\frac{\delta\sqrt{\log d}}{\sqrt{n}}\big{)}$ . With $\delta=(n/\log d)^{(\alpha+(s-1)/2)/(2\alpha+s-1)}$ , we get an optimized bound of $\displaystyle\frac{C(\alpha,s)kGr\sqrt{q}\eta(\log d)^{1/(2\alpha+s+1)}}{n^{1/(2\alpha+s+1)}}\log n$ .

For $\alpha=0$ , we have the bound $\frac{C(s)kGr\sqrt{q}\eta}{(n/d)^{1/(s+1)}}\log(n/d)$ , that is we cannot use the sparsity as the problem is invariant to the chosen norm on hidden weights.

High-dimensional variable selection: when $k=1$ , $s=q$ and $W_{1}$ is a projection onto $q$ variables, then we obtain a bound proportional to $\frac{\sqrt{q}\eta(\log d)^{1/(2\alpha+s+1)}}{n^{1/(2\alpha+s+1)}}\log n$ , which exhibits a high-dimensional scaling in a non-linear setting. Note that beyond sparsity, no assumption is made (in particular regarding correlations between input variables), and we obtain a high-dimensional phenomenon where $d$ may be much larger than $n$ .

4 Relationship to kernel methods and random sampling

5 Sufficient condition for polynomial-time algorithms

up to a constant factor. That is, there exists $\kappa\geqslant 1$ , such that for all $y$ and $z$ , we may compute $\hat{v}$ such that $\|\hat{v}\|_{p}=1$ and

This is provably NP-hard for $\alpha=0$ (see Section 3.2), and for $\alpha=1$ (see Section 3.3). If such an algorithm is available, the approximate conditional gradient presented in Section 2.5 leads to an estimator with the same generalization bound. Moreover, given the strong hardness results for improper learning in the situation $\alpha=0$ (Klivans and Sherstov, 2006; Livni et al., 2014), a convex relaxation that would consider a larger set of predictors (e.g., by relaxing $vv^{\top}$ into a symmetric positive-definite matrix), and obtained a constant approximation guarantee, is also ruled out.

Lower-bound on $\gamma_{1}$ : It is defined from functions $\hat{\varphi}_{\hat{v}}$ , for $\hat{v}\in\hat{\mathcal{V}}$ , where for any $v\in\mathcal{V}$ , there exists $\hat{v}\in\hat{\mathcal{V}}$ such that $\varphi_{v}=\hat{\varphi}_{\hat{v}}$ . This implies that the corresponding space $\hat{\mathcal{F}}_{1}$ is larger than $\mathcal{F}_{1}$ and that if $f\in\mathcal{F}_{1}$ , then $\hat{\gamma}_{1}(f)\leqslant\gamma_{1}(f)$ .

Polynomial-time algorithm for dual norm: The dual norm $\displaystyle\sup_{\hat{v}\in\hat{\mathcal{V}}}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}y_{i}\hat{\varphi}_{\hat{v}}(z_{i})\bigg{|}$ may be computed in polynomial time.

We may also replace the standard Gaussian vectors by Rademacher random variables.

We can then penalize by $\hat{\gamma}$ instead of $\gamma$ . Since $\hat{\gamma}_{1}\leqslant\gamma_{1}$ , approximation properties are transferred, and because of the result above, the Rademacher complexity for $\hat{\gamma}_{1}$ -balls scales as well as for $\gamma_{1}$ -balls. In the next section, we show convex relaxations which cannot achieve these and leave the existence or non-existence of such norm $\hat{\gamma}_{1}$ as an open problem.

Convex relaxations of the Frank-Wolfe step

We present two relaxations, which are of the form described in Section 5.5 (leading to potential generalization bounds) but do not attain the proper approximation scaling (as was checked empirically).

Note that all relaxations that end up being Lipschitz-continuous functions of $z$ , will have at least the same scaling than the set of these functions. The Rademacher complexity of such functions is well-known, that is $1/\sqrt{n}$ for $d=1$ , $\sqrt{\frac{\log n}{n}}$ for $d=2$ and $n^{-1/d}$ for larger $d$ (von Luxburg and Bousquet, 2004). Unfortunately, the decay in $n$ is too slow to preserve generalization bounds (which would require a scaling in $1/\sqrt{n}$ ).

We denote $u_{i}=(v^{\top}z_{i})_{+}=\frac{1}{2}v^{\top}z_{i}+\frac{1}{2}|v^{\top}z_{i}|$ . We may then use $2u_{i}-v^{\top}z_{i}=|v^{\top}z_{i}|$ and, for $\|v\|_{2}=1$ , $\|vv^{\top}z_{i}\|_{2}=|v^{\top}z_{i}|=\sqrt{z_{i}^{\top}vv^{\top}z_{i}}$ . By denoting $V=vv^{\top}$ , the constraint that $u_{i}=(v^{\top}z_{i})_{+}=\frac{1}{2}v^{\top}z_{i}+\frac{1}{2}|v^{\top}z_{i}|$ is equivalent to

We obtain a convex relaxation when removing the rank constraint, that is

(n+d)𝑛𝑑(n+d)-dimensional relaxation.

which leads to a convex program in $U=uu^{\top}$ , $V=vv^{\top}$ and $J=uv^{\top}$ , that is a semidefinite program with $d+n$ dimensions, with the constraints

and the usual semi-definite contraints $\displaystyle\left(\begin{array}[]{cc}U&J\\ J^{\top}&V\end{array}\right)\succcurlyeq\left(\begin{array}[]{c}u\\ v\end{array}\right)\left(\begin{array}[]{c}u\\ v\end{array}\right)^{\top}$ , with the additional constraint that $4U_{ii}+z_{i}^{\top}Vz_{i}-4\delta_{i}^{\top}Jz_{i}=\mathop{\rm tr}Vz_{i}z_{i}^{\top}$ .

If we add these constraints on top of the ones above, we obtain a tighter relaxation. Note that for this relaxation, we must have $\big{[}(2u_{i}-v^{\top}z_{i})-(2u_{j}-v^{\top}z_{j})\big{]}$ less than a constant times $\|z_{i}-z_{j}\|_{2}$ . Hence, the result mentioned above regarding Lipschitz-continuous functions and the scaling of the upper-bound for random $y$ holds (with the dependence on $n$ which is not good enough to preserve the generalization bounds with a polynomial-time algorithm).

2 Relaxation of sign vectors

Usual semi-definite constraint: $\displaystyle\left(\begin{array}[]{cc}S&J\\ J^{\top}&V\end{array}\right)\succcurlyeq\left(\begin{array}[]{c}s\\ v\end{array}\right)\left(\begin{array}[]{c}s\\ v\end{array}\right)^{\top}$ ,

Unit/trace constraints: $\mathop{\rm diag}(S)=1$ and $\mathop{\rm tr}V=1$ ,

Sign constraint: $\delta_{i}^{\top}Jx_{i}\geqslant\max_{j\neq i}|\delta_{j}^{\top}Jx_{i}|$ .

Additional constraint: $(x_{i}^{\top}Vx_{i})^{1/2}\leqslant\delta_{i}^{\top}Jx_{i}$ .

Conclusion

In this paper, we have provided a detailed analysis of the generalization properties of convex neural networks with positively homogenous non-decreasing activation functions. Our main new result is the adaptivity of the method to underlying linear structures such as the dependence on a low-dimensional subspace, a setting which includes non-linear variable selection in presence of potentially many input variables.

All our current results apply to estimators for which no polynomial-time algorithm is known to exist and we have proposed sufficient conditions under which convex relaxations could lead to the same bounds, leaving open the existence or non-existence of such algorithms. Interestingly, these problems have simple geometric interpretations, either as binary linear classification, or computing the Haussdorff distance between two zonotopes.

In this work, we have considered a single real-valued output; the functional analysis framework readily extends to outputs in a finite-dimensional vector-space where vector-valued measures could be used, and then apply to multi-task or multi-class problems. However, the extension to multiple hidden layers does not appear straightforward as the units of the last hidden layers share the weights of the first hidden layers, which should require a new functional analysis framework.

References

We follow the proof of Berlinet and Thomas-Agnan (2004, Section 4.1) and extend it to integrals rather than finite sums. We consider the linear mapping $T:L_{2}(d\tau)\to\mathcal{F}_{2}$ defined by $(Tp)(x)=\int_{\mathcal{V}}p(v)\varphi_{v}(x)d\tau(v)$ , with null space $\mathcal{K}$ . When restricted to the orthogonal complement $\mathcal{K}^{\perp}$ , we obtain a bijection $U$ from $\mathcal{K}^{\perp}$ to $\mathcal{F}_{2}$ . We then define a dot-product on $\mathcal{F}_{2}$ as $\langle f,g\rangle=\int_{\mathcal{V}}(U^{-1}f)(v)(U^{-1}g)(v)d\tau(v)$ .

We first show that this defines an RKHS with kernel $k$ defined above. For this, we trivially have $k(\cdot,y)\in\mathcal{F}_{2}$ for all $y\in\mathcal{X}$ . Moreover, for any $y\in\mathcal{X}$ , we have with $p=U^{-1}k(\cdot,y)\in\mathcal{K}^{\perp}$ and $q:v\mapsto\varphi_{v}(y)$ , $p-q\in\mathcal{K}$ , which implies that $\langle f,k(\cdot,y)\rangle=\int_{\mathcal{V}}(U^{-1}f)(v)p(v)d\tau(v)=\int_{\mathcal{V}}(U^{-1}f)(v)q(v)d\tau(v)=\int_{\mathcal{V}}(U^{-1}f)(v)\varphi_{v}(y)d\tau(v)=T(U^{-1}f)(y)=f(y)$ , hence the reproducing property is satisfied. Thus, $\mathcal{F}_{2}$ is an RKHS.

We now need to show that the RKHS norm which we have defined is actually $\gamma_{2}$ . For any $f\in\mathcal{F}_{2}$ such that $f=Tp$ , for $p\in L_{2}(d\tau)$ , we have $p=U^{-1}f+q$ , where $q\in\mathcal{K}$ . Thus, $\int_{\mathcal{V}}p(v)^{2}d\tau(v)=\|p\|^{2}_{L_{2}(d\tau)}=\|U^{-1}f\|^{2}_{L_{2}(d\tau)}+\|q\|^{2}_{L_{2}(d\tau)}=\|f\|^{2}+\|q\|^{2}_{L_{2}(d\tau)}$ . This implies that $\int_{\mathcal{V}}p(v)^{2}d\tau(v)\geqslant\|f\|^{2}$ with equality if and only if $q=0$ . This shows that $\gamma_{2}(f)=\|f\|$ .

B Approximate conditional gradient with multiplicative oracle

for a fixed $\kappa\geqslant 1$ . We now propose a modification of the conditional gradient algorithm that converges to a certain $h$ such that $\gamma(h)\leqslant\delta$ and for which $\inf_{\gamma(h)\leqslant\delta}J(h)\leqslant J(\hat{h})\leqslant\inf_{\gamma(h)\leqslant\delta/\kappa}J(h)$ .

We assume the smoothness of the function $J$ with respect to the norm $\gamma$ , that is, for a certain $L>0$ , for all $h,h^{\prime}$ such that $\gamma(h)\leqslant\delta$ , then

In the previous recursion, one may replace the minimization of $J$ on the segment $[h_{t},\hat{h}_{t}]$ with the minimization of its upper-bound of Eq. (7) taken at $h=h_{t}$ . From the recursion, all iterates are in the $\gamma$ -ball of radius $\delta$ . Following the traditional convergence proof for the conditional gradient method (Dunn and Harshbarger, 1978; Jaggi, 2013), we have, for any $\rho$ in $$:

If we take $h_{\ast}$ the minimizer of $J$ on $\{\gamma(h)\leqslant\delta/\kappa\}$ , we get:

Then, by using $J(h_{t})\geqslant J(h_{\ast})+\langle J^{\prime}(h_{t}),h_{\ast}-h_{t}\rangle$ , we get:

This is valid for any $\rho\in$ . If $J(h_{t})-J(h_{\ast})\leqslant 0$ for some $t$ , then by taking $\rho=0$ it remains the same of all greater $t$ . Therefore, up to (the potentially never happening) point where $J(h_{t})-J(h_{\ast})\leqslant 0$ , we can apply the regular proof of the conditional gradien to obtain: $J(h_{t})\leqslant\inf_{\gamma(h)\leqslant\delta/\kappa}J(h)+\frac{4L\rho^{2}\delta^{2}}{t}$ , which leads to the desired result. Note that a similar reasoning may be used for $\rho=2/(t+1)$ .

C Proofs for the 222-dimensional sphere (d=1𝑑1d=1)

For $k=0$ , the same equality holds (except that the two coefficients $g_{0}$ and $p_{0}$ are divided by $2\pi$ except of $\pi$ ).

We thus simply need to compute $\lambda_{k}$ and its decay for all values of $\alpha$ , and then relate them to the smoothness properties of $g$ , which is standard for Fourier series.

We now detail the computation of $\lambda_{k}=\frac{1}{2\pi}\int_{0}^{2\pi}\sigma(\cos\eta)\cos k\eta\,d\eta$ for the different functions $\sigma=(\cdot)_{+}^{\alpha}$ . We have for $\alpha=0$ :

For $k=0$ it is equal to $\frac{1}{2}$ . It is equal to zero for all other even $k$ , and different from zero for all odd $k$ , with $\lambda_{k}$ going to zero as $1/k$ .

For $k=1$ , it is equal to $1/4$ . It is equal to zero for all other odd $k$ , and different from zero for all even $k$ , with $\lambda_{k}$ going to zero as $1/k^{2}$ .

For $k=0$ , it is equal to $1/4$ , and for $k=2$ , it is equal to $1/8$ . It is equal to zero for all other even $k$ , and different from zero for all odd $k$ , with $\lambda_{k}$ going to zero as $1/k^{3}$ .

C.2 Proof of Prop. 2 for d=1𝑑1d=1

We only consider the proof for $d=1$ . For the proof for general $d$ , see Appendix D.3.

Given the zero values of $\lambda_{k}$ given above, if $g$ has the opposite parity than $\alpha$ (that is, is even when $\alpha$ is odd, and vice-versa), then we may define ${p}$ through its Fourier series, which is obtained by multiplying the one of $g$ by a strictly positive sequence growing as $k^{\alpha+1}$ .

Note that we could relax the assumption that $g$ is even (resp. odd) by adding all trigonometric polynomials of order less than $\alpha$ .

C.3 Proof of Prop. 3 for d=1𝑑1d=1

Again, we only consider the proof for $d=1$ . For the proof for general $d$ , see Appendix D.4.

Without loss of generality, we assume that $\eta=1$ . For $d=1$ , we essentially want to approximate a Lipschitz-continuous function by a function which is $(\alpha+1)$ -times differentiable.

For $\alpha=0$ , then the function $g$ is already in $\mathcal{G}_{2}$ with a norm less than one, because Lipschitz-continuous functions are almost everywhere differentiable with bounded derivative (Adams and Fournier, 2003). We thus now consider $\alpha>0$ .

Given $\lambda_{k}$ defined above and $r\in(0,1)$ , we define $\hat{{p}}$ through

Since $g$ is $1$ -Lipschitz-continuous with constant $1$ , then it has a squared-integrable derivative $f=g^{\prime}$ with norm less than 1 (Adams and Fournier, 2003), so that

This implies that using $\lambda_{k}^{-1}=O(k^{\alpha+1})$ :

Computing distance between g^^𝑔\hat{g} and g𝑔g.

It can be easily checked that for any $r\in(1/2,1)$ , the last function is less than a constant times $\frac{5}{2}C(1-r)\log\frac{1}{1-r}$ . We thus get for $\delta$ large enough, by taking $r=1-(C/\delta)^{1/\alpha}\in(1/2,1)$ , an error of

D Approximations on the d𝑑d-dimensional sphere

In this appendix, we first review tools from spherical harmonic analysis, before proving the two main propositions regarding the approximation properties of the Hilbert space $\mathcal{G}_{2}$ . Using spherical harmonics in our set-up is natural and is common in the analysis of ridge functions (Petrushev, 1998) and zonotopes (Bourgain and Lindenstrauss, 1988).

In this section, we review relevant concepts from spherical harmonics. See Frye and Efthimiou (2012); Atkinson and Han (2012) for more details. Spherical harmonics may be seen as extension of Fourier series to spheres in dimensions more than 2 (i.e., with our convention $d\geqslant 1$ ).

Legendre polynomials.

where $P_{k}$ is a Legendre polynomial of degree $k$ and dimension $d+1$ , defined as (Rodrigues’ formula):

They are also referred to as Gegenbauer polynomials. For $d=1$ , $P_{k}$ is the $k$ -th Chebyshev polynomial, such that $P_{k}(\cos\theta)=\cos(k\theta)$ for all $\theta$ (and we thus recover the Fourier series framework of Appendix C). For $d=2$ , $P_{k}$ is the usual Legendre polynomial.

The polynomial $P_{k}$ is even (resp. odd) when $k$ is even (resp. odd), and we have

For small $k$ , we have $P_{0}(t)=1$ , $P_{1}(t)=t$ , and $P_{2}(t)=\frac{(d+1)t^{2}-1}{d}$ .

The Hecke-Funk formula leads to, for any linear combination $Y_{k}$ of $Y_{kj}$ , $j\in\{1,\dots,N(d,k)\}$ :

This is the decomposition in harmonics of degree $k$ . Note that

Decomposition of functions of one-dimensional projections.

If $k\equiv\alpha\mbox{ mod. }2$ , then $\lambda_{k}\propto\frac{1}{2}\int_{-1}^{1}t^{\alpha}P_{k}(t)(1-t^{2})^{(d-2)/2}dt=0$ , for $k>\alpha$ since $P_{k}$ is orthogonal to all polynomials of degree strictly less than $k$ for that dot-product. Otherwise, $\lambda_{k}\neq 0$ , since $t^{\alpha}$ may be decomposed as combination with non-zero coefficients of polynomials $P_{j}$ for $j\equiv\alpha\mbox{ mod. }2$ , $j\leqslant\alpha$ .

We now provide an explicit formula extending the proof technique (for $\alpha=1$ ) of Schneider (1967) and Bourgain and Lindenstrauss (1988) to all values of $\alpha$ . See also Mhaskar (2006).

We have, by $\alpha$ successive integration by parts, for $k\geqslant\alpha+1$ :

By using Stirling formula $\Gamma(x)\approx x^{x-1/2}e^{-x}\sqrt{2\pi}$ , we get an equivalent when $k$ or $d$ tends to infinity as a constant (that depends on $\alpha$ ) times

Note that all exponential terms cancel out. Moreover, when $k$ tends to infinity and $d$ is considered constant, then we get the equivalent $k^{-d/2-\alpha-1/2}$ , which we need for the following sections. Finally, when $d$ tends to infinity and $k$ is considered constant, then we get the equivalent $d^{-\alpha/2-k/2+1/2}$ .

We will also need expressions of $\lambda_{k}$ for $k=0$ and $k=1$ . For $k=0$ , we have:

using the normalization factor of the Beta distribution. This leads to

which is equivalent to $d^{1/2-\alpha/2}$ as $d$ tends to infinity.

Moreover, for $k=1$ , we have (for $\alpha>0$ ):

which is equivalent to $d^{-\alpha/2}$ as $d$ tends to infinity.

Finally, for $\alpha=0$ , $\lambda_{1}=\frac{d-1}{2d\pi}$ . More generally, we have $|\lambda_{k}|\sim C(d)k^{-(d-1)/2-\alpha-1}$ .

Given $g$ with the correct parity, then we have

D.3 Proof of Prop. 2 for d>1𝑑1d>1

Moreover, since $g$ has the correct parity,

Also, $g_{k}$ are eigenfunctions of the Laplacian with eigenvalues $k(k+d-1)$ . Thus, we have

D.4 Proof of Prop. 3 for d>1𝑑1d>1

which is always defined when $r\in(0,1)$ because the series is absolutely convergent. This defines a function $\hat{g}$ that will have a finite $\gamma_{2}$ -norm and be close to $g$ .

The function $\hat{{p}}$ thus defines a function $\hat{g}\in\mathcal{G}_{1}$ by $\hat{g}_{k}=\lambda_{k}{p}_{k}$ , for which $\gamma_{2}(g)\leqslant C(d,\alpha)(1-r)^{(-d+1)/2-\alpha}$ .

Approximation properties.

We now show that $g$ and $\hat{g}$ are close to each other. Because of the parity of $g$ , we have $\hat{g}_{k}=r^{k}g_{k}$ . We have, using Theorem 4.28 from Frye and Efthimiou (2012):

Moreover, following Bourgain and Lindenstrauss (1988), we have:

As shown by Bourgain and Lindenstrauss (1988, Eq. (2.13)), this is less than a constant that depends on $d$ times $(1-r)\log\frac{1}{1-r}$ . We thus get for $\delta$ large enough, by taking $1-r=(C/\delta)^{1/(\alpha+(d-1)/2)}\in(0,1)$ , an error of

In this section, we consider functions on the sphere which have the proper parity with respect to $\alpha$ , which are $s$ -times differentiable with bounded derivatives, but which are not in $\mathcal{G}_{2}$ . We then provide optimal approximation rates for these functions.

The summand has an asymptotic equivalent proportional to $k^{-d-2s-1}k^{d-1}k^{d+2\alpha+1}=k^{d+2\alpha-2s-1}$ . Thus if $d+2\alpha-2s\geqslant 0$ , the series is divergent (the function is not in the RKHS), i.e., if $s\leqslant\alpha+\frac{d}{2}$ .

which should be of order $\delta^{2}$ (this gives the scaling of $\lambda$ as a function of $\delta$ ). Then the squared error is

and thus the (non-squared) approximation error scales as $\delta^{-(2s+1)/(d+2\alpha-2s)}.$ For $s=1$ , this leads to a scaling as $\delta^{-3/(d+2\alpha-2)}$ .

D.6 Proof of Prop. 4

For $\alpha=1$ , by writing $v^{\top}x=(v^{\top}x)_{+}-(-v^{\top}x)_{+}$ we obtain the upperbound $\gamma_{1}(g)\leqslant 2$ . For all other situations, we may compute

We assume that we are given two ellipsoids defined as $(x-a)^{\top}A^{-1}(x-a)\leqslant 1$ and $(x-b)^{\top}B^{-1}(x-b)\leqslant 1$ and we want to compute their Hausdorff distance. This leads to the two equivalent problems

We consider the following convex optimization problem, with $Q\succcurlyeq 0$ ; we have by Lagrangian duality:

If $\|Q^{-1}q\|_{2}\leqslant 1$ , then $\lambda=0$ and $x=Q^{-1}q$ . Otherwise, at the optimum, $\lambda>0$ and $\|x\|_{2}^{2}=q^{\top}(Q+\lambda I)^{-2}q=1$ , which implies $1\leqslant\frac{1}{\lambda+\lambda_{\min}(Q)}q^{\top}Q^{-1}q$ , which leads to $\lambda\leqslant q^{\top}Q^{-1}q-\lambda_{\min}(Q)$ , which is important to reduce the interval of possible $\lambda$ . The optimal $\lambda$ may then be obtained by binary search (from a single SVD of $Q$ ).

Minimizing concave quadratic forms over the sphere.

We consider the following non-convex optimization problem, with $Q\succcurlyeq 0$ , for which strong Lagrangian duality is known to hold (Boyd and Vandenberghe, 2004):

At the optimum, we have $q^{\top}(\lambda I-Q)^{-2}q=1$ , which implies $1\leqslant\frac{1}{[\lambda-\lambda_{\max}(Q)]^{2}}\|q\|_{2}^{2}$ , which leads to $0\leqslant\lambda-\lambda_{\max}(Q)\leqslant\|q\|_{2}$ . We may perform binary search on $\lambda$ from a single SVD of $Q$ .

Computing the Haussdorff distance.

with $v=(B+\lambda I)^{-1}B^{1/2}(a-b+A^{1/2}u)$ . The interval in $\lambda$ which is sufficient to explore is

which are bounds that are independent of $u$ .

Given $\lambda\geqslant 0$ , we have the problem of

We have $\displaystyle u=(\frac{\mu}{\lambda}I-A^{1/2}(B+\lambda I)^{-1}A^{1/2})^{-1}A^{1/2}(B+\lambda I)^{-1}(a-b),$ leading to $\displaystyle w\propto(\lambda^{-1}B-\mu^{-1}A+I)(a-b)$ . We need $\frac{\mu}{\lambda}\geqslant\lambda_{\max}(A^{1/2}(B+\lambda I)^{-1}A^{1/2})$ . Moreover

The author was partially supported by the European Research Council (SIERRA Project), and thanks Nicolas Le Roux for helpful discussions. The author also thanks Varun Kanade for pointing the NP-hardness linear classification result. The main part of this work was carried through while visiting the Centre de Recerca Matemàtica (CRM) in Barcelona.