Breaking the Curse of Dimensionality with Convex Neural Networks
Francis Bach
Introduction
Supervised learning methods come in a variety of ways. They are typically based on local averaging methods, such as -nearest neighbors, decision trees, or random forests, or on optimization of the empirical risk over a certain function class, such as least-squares regression, logistic regression or support vector machine, with positive definite kernels, with model selection, structured sparsity-inducing regularization, or boosting (see, e.g., Györfi and Krzyzak, 2002; Hastie et al., 2009; Shalev-Shwartz and Ben-David, 2014, and references therein).
At the other end of the spectrum, parametric methods such as linear supervised learning make strong assumptions regarding the problem and generalization bounds based on estimation errors typically assume that the model is well-specified, and the sample complexity to attain an excess risk of grows as , for linear functions in dimensions and Lipschitz-continuous loss functions (Shalev-Shwartz and Ben-David, 2014, Chapter 9). While the sample complexity is much lower, when the assumptions are not met, the methods underfit and more complex models would provide better generalization performances.
Affine functions: , leading to potential severe underfitting, but easy optimization and good (i.e., non exponential) sample complexity.
Single hidden-layer neural networks: , where is the number of units in the hidden layer (see, e.g., Rumelhart et al., 1986; Haykin, 1994). The activation function is here assumed to be fixed. While the learning problem may be cast as a (sub)differentiable optimization problem, techniques based on gradient descent may not find the global optimum. If the number of hidden units is fixed, this is a parametric problem.
In this paper, our main aim is to answer the following question: Is there a single learning method that can deal efficiently with all situations above with provable adaptivity? We consider single-hidden-layer neural networks, with non-decreasing homogeneous activation functions such as
for , with a particular focus on (with the convention that ), that is (a threshold at zero), and , that is, , the so-called rectified linear unit (Nair and Hinton, 2010; Krizhevsky et al., 2012). We follow the convexification approach of Bengio et al. (2006); Rosset et al. (2007), who consider potentially infinitely many units and let a sparsity-inducing norm choose the number of units automatically. This leads naturally to incremental algorithms such as forward greedy selection approaches, which have a long history for single-hidden-layer neural networks (see, e.g. Breiman, 1993; Lee et al., 1996).
We provide in Section 2 a review of functional analysis tools used for learning from continuously infinitely many basis functions, by studying carefully the similarities and differences between - and -penalties on the output weights. For -penalties, this corresponds to a positive definite kernel and may be interpreted through random sampling of hidden weights. We also review incremental algorithms (i.e., forward greedy approaches) to learn from these infinite sets of basis functions when using -penalties.
The results are specialized in Section 3 to neural networks with a single hidden layer and activation functions which are positively homogeneous (such as the rectified linear unit). In particular, in Sections 3.2, 3.3 and 3.4, we provide simple geometric interpretations to the non-convex problems of additions of new units, in terms of separating hyperplanes or Hausdorff distance between convex sets. They constitute the core potentially hard computational tasks in our framework of learning from continuously many basis functions.
In Section 4, we provide a detailed theoretical analysis of the approximation properties of (single hidden layer) convex neural networks with monotonic homogeneous activation functions, with explicit bounds. We relate these new results to the extensive literature on approximation properties of neural networks (see, e.g., Pinkus, 1999, and references therein) in Section 4.7, and show that these neural networks are indeed adaptive to linear structures, by replacing the exponential dependence in dimension by an exponential dependence in the dimension of the subspace of the data can be projected to for good predictions.
We provide in Section 5.5 simple conditions for convex relaxations to achieve the same generalization error bounds, even when constant-factor approximation cannot be found (e.g., because it is NP-hard such as for the threshold activation function and the rectified linear unit). We present in Section 6 convex relaxations based on semi-definite programming, but we were not able to find strong enough convex relaxations (they provide only a provable sample complexity with a polynomial time algorithm which is exponential in the dimension ) and leave open the existence or non-existence of polynomial-time algorithms that preserve the non-exponential sample complexity.
Learning from continuously infinitely many basis functions
We consider the space of functions that can be written as
where is a signed Radon measure on with finite total variation .
When is finite, this corresponds to
with total variation , where the proper formalization for infinite sets is done through measure theory.
The infimum of over all decompositions of as , turns out to be a norm on , often called the variation norm of with respect to the set of basis functions (see, e.g., Kurkova and Sanguineti, 2001; Mhaskar, 2004).
Given our assumptions regarding the compactness of , for any , the infimum defining is in fact attained by a signed measure , as a consequence of the compactness of measures for the weak topology (see Evans and Gariepy, 1991, Section 1.9).
In the definition above, if we assume that the signed measure has a density with respect to a fixed probability measure with full support on , that is, , then, the variation norm is also equal to the infimal value of
over all integrable functions such that . Note however that not all measures have densities, and that the two infimums are the same as all Radon measures are limits of measures with densities. Moreover, the infimum in the definition above is not attained in general (for example when the optimal measure is singular with respect to ); however, it often provides a more intuitive definition of the variation norm, and leads to easier comparisons with Hilbert spaces in Section 2.3.
2 Representation from finitely many functions
When minimizing any functional that depends only on the function values taken at a subset of values in , over the ball , then we have a “representer theorem” similar to the reproducing kernel Hilbert space situation, but also with significant differences, which we now present.
we can then build a function defined over all , through the optimal measure above.
Moreover, by Carathéodory’s theorem for cones (Rockafellar, 1997), if is composed of only elements (e.g., is the number of observations in machine learning), the optimal function above (and hence ) may be decomposed into at most functions , that is, is supported by at most points in , among a potential continuum of possibilities.
Note however that the identity of these functions is not known in advance, and thus there is a significant difference with the representer theorem for positive definite kernels and Hilbert spaces (see, e.g., Shawe-Taylor and Cristianini, 2004), where the set of functions are known from the knowledge of the points (i.e., kernel functions evaluated at ).
3 Corresponding reproducing kernel Hilbert space (RKHS)
We have seen above that if the real-valued measures are restricted to have density with respect to a fixed probability measure with full support on , that is, , then, the norm is the infimum of the total variation over all decompositions .
We may also define the infimum of over the same decompositions (squared -norm instead of -norm). It turns out that it defines a squared norm and that the function space of functions with finite norm happens to be a reproducing kernel Hilbert space (RKHS). When is finite, then it is well-known (see, e.g., Berlinet and Thomas-Agnan, 2004, Section 4.1) that the infimum of over all vectors such that defines a squared RKHS norm with positive definite kernel .
We show in Appendix A that for any compact set , we have defined a squared RKHS norm with positive definite kernel .
When tends to infinity, then tends to and random sampling provides a way to work efficiently with explicit -dimensional feature spaces. See Rahimi and Recht (2007) for a analysis of the number of units needed for an approximation with error , typically of order . See also Bach (2015) for improved results with a better dependence on when making extra assumptions on the eigenvalues of the associated covariance operator.
The corresponding RKHS norm is always greater than the variation norm (because of Jensen’s inequality), and thus the RKHS is included in . However, as shown in this paper, the two spaces and have very different properties; e.g., may be computed easily in several cases, while does not; also, learning with may either be done by random sampling of sufficiently many weights or using kernel methods, while requires dedicated convex optimization algorithms with potentially non-polynomial-time steps (see Section 2.5).
Moreover, for any , with a norm , while in general . This is a simple illustration of the fact that is too small and thus will lead to a lack of adaptivity that will be further studied in Section 5.4 for neural networks with certain activation functions.
4 Supervised machine learning
that is, the excess risk is upper-bounded by a sum of an approximation error , an estimation error and an optimization error (see also Bottou and Bousquet, 2008). In this paper, we will deal with all three errors, starting from the optimization error which we now consider for the space and its variation norm.
5 Incremental conditional gradient algorithms
We assume the functional is convex and -smooth, that is for all , there exists a gradient such that for all ,
When is finite, this corresponds to the regular notion of smoothness from convex optimization (Nesterov, 2004).
The conditional gradient algorithm (a.k.a. Frank-Wolfe algorithm) is an iterative algorithm, starting from any function and with the following recursion, for :
See an illustration in Figure 1. We may choose either or perform a line search for . For all of these strategies, the -th iterate is a convex combination of the functions , and is thus an element of . It is known that for these two strategies for , we have the following convergence rate (see, e.g. Jaggi, 2013):
When, is finite, we have and thus we get a convergence rate of .
Moreover, the basic Frank-Wolfe (FW) algorithm may be extended to handle the regularized problem as well (Harchaoui et al., 2013; Bach, 2013; Zhang et al., 2012), with similar convergence rates in . Also, the second step in the algorithm, where the function is built in the segment between and the newly found extreme function, may be replaced by the optimization of over the convex hull of all functions , a variant which is often referred to as fully corrective. Moreover, in our context where is a space where local search techniques may be considered, there is also the possibility of “fine-tuning” the vectors as well (Bengio et al., 2006), that is, we may optimize the function (v_{1},\dots,v_{t},\alpha_{1},\dots,\alpha_{t})\mapsto J\big{(}\sum_{i=1}^{t}\alpha_{i}\varphi_{v_{i}}\big{)}, through local search techniques, starting from the weights and points obtained from the conditional gradient algorithm.
Adding a new basis function.
The conditional gradient algorithm presented above relies on solving at each iteration the “Frank-Wolfe step”:
for . For the norm defined through an -norm, we have for such that :
with equality if and only if with and two non-negative measures, with (resp. ) supported in the set of maximizers of \big{|}\int_{\mathcal{X}}\varphi_{v}(x)g(x)d\rho(x)\big{|} where the value is positive (resp. negative).
with the maximizers of the first optimization problem above (left-hand side) obtained as times convex combinations of and for maximizers of the second problem (right-hand side).
A common difficulty in practice is the hardness of the Frank-Wolfe step, that is, the optimization problem above over may be difficult to solve. See Section 3.2, 3.3 and 3.4 for neural networks, where this optimization is usually difficult.
Finitely many observations.
where the set of solutions of the first problem is in the convex hull of the solutions of the second problem.
Non-smooth loss functions.
In this paper, in our theoretical results, we consider non-smooth loss functions for which conditional gradient algorithms do not converge in general. One possibility is to smooth the loss function, as done by Nesterov (2005): an approximation error of may be obtained with a smoothness constant proportional to . By choosing as , we obtain a convergence rate of after iterations. See also Lan (2013).
Approximate oracles.
The conditional gradient algorithm may deal with approximate oracles; however, what we need in this paper is not the additive errors situations considered by Jaggi (2013), but multiplicative ones on the computation of the dual norm (similar to ones derived by Bach (2013) for the regularized problem).
Indeed, in our context, we minimize a function on over a norm ball . A multiplicative approximate oracle outputs for any , a vector such that , and
for a fixed . In Appendix B, we propose a modification of the conditional gradient algorithm that converges to a certain such that and for which .
Such approximate oracles are not available in general, because they require uniform bounds over all possible values of . In Section 5.5, we show that a weaker form of oracle is sufficient to preserve our generalization bounds from Section 5.
Approximation of any function by a finite number of basis functions.
Neural networks with non-decreasing positively homogeneous activation functions
In this paper, we focus on a specific family of basis functions, that is, of the form
for specific activation functions . We assume that is non-decreasing and positively homogeneous of some integer degree, i.e., it is equal to , for some . We focus on these functions for several reasons:
Since they are not polynomials, linear combinations of these functions can approximate any measurable function (Leshno et al., 1993).
By homogeneity, they are invariant by a change of scale of the data; indeed, if all observations are multiplied by a constant, we may simply change the measure defining the expansion of by the appropriate constant to obtain exactly the same function. This allows us to study functions defined on the unit-sphere.
The special case , often referred to as the rectified linear unit, has seen considerable recent empirical success (Nair and Hinton, 2010; Krizhevsky et al., 2012), while the case (hard thresholds) has some historical importance (Rosenblatt, 1958).
The goal of this section is to specialize the results from Section 2 to this particular case and show that the “Frank-Wolfe” steps have simple geometric interpretations.
We first show that the positive homogeneity of the activation functions allows to transfer the problem to a unit sphere.
This implies by Hölder’s inequality that . Moreover this leads to functions in that are bounded everywhere, that is, , . Note that the functions in are also Lipschitz-continuous for .
Homogeneous reformulation.
that is , because we have assumed that is on the -sphere.
The only remaining important aspect is to define on the entire sphere, so that (a) its regularity constants are controlled by a constant times the ones on the portion of the sphere where it is already defined, (b) is either even or odd (this will be important in Section 4). Ensuring that the regularity conditions can be met is classical when extending to the full sphere (see, e.g., Whitney, 1934). Ensuring that the function may be chosen as odd or even may be obtained by multiplying the function by an infinitely differentiable function which is equal to one for and zero for , and extending by or on the hemi-sphere .
In summary, we may consider in Section 4 functions defined on the sphere, which are much easier to analyze. In the rest of the section, we specialize some of the general concepts reviewed in Section 2 to our neural network setting with specific activation functions, namely, in terms of corresponding kernel functions and geometric reformulations of the Frank-Wolfe steps.
1 Corresponding positive-definite kernels
There are key differences and similarities between the RKHS and our space of functions . The RKHS is smaller than (i.e., the norm in the RKHS is larger than the norm in ); this implies that approximation properties of the RKHS are transferred to . In fact, our proofs rely on this fact.
However, the RKHS norm does not lead to any adaptivity, while the function space does (see more details in Section 5). This may come as a paradox: both the RKHS and have similar properties, but one is adaptive while the other one is not. A key intuitive difference is as follows: given a function expressed as , then , while the squared RKHS norm is . For the -norm, the measure may tend to a singular distribution with a bounded norm, while this is not true for the -norm. For example, the function is in , while it is not in in general.
2 Incremental optimization problem for α=0𝛼0\alpha=0
where and . As outlined by Bengio et al. (2006), this is equivalent to finding an hyperplane parameterized by that minimizes a weighted mis-classification rate (when doing linear classification). Note that the norm of has no effect.
Convex relaxation.
Given linear binary classification problems, there are several algorithms to approximately find a good half-space. These are based on using convex surrogates (such as the hinge loss or the logistic loss). Although some theoretical results do exist regarding the classification performance of estimators obtained from convex surrogates (Bartlett et al., 2006), they do not apply in the context of linear classification.
3 Incremental optimization problem for α=1𝛼1\alpha=1
For the problem of maximizing \big{|}\sum_{i=1}^{n}y_{i}(v^{\top}z_{i})_{+}\big{|}, then this corresponds to
This is exactly the Hausdorff distance between the two convex sets and (referred to as zonotopes, see below).
Given the pair achieving the Hausdorff distance, then we may compute the optimal as v=\arg\max_{\|v\|_{p}\leqslant 1}v^{\top}\big{(}T_{+}^{\top}b_{+}-T_{-}^{\top}b_{-}\big{)}. Note this has not changed the problem at all, since it is equivalent. It is still NP-hard in general (König, 2014). But we now have a geometric interpretation with potential approximation algorithms. See below and Section 6.
A zonotope is the Minkowski sum of a finite number of segments from the origin, that is, of the form
In our context, the two convex sets and defined above are thus zonotopes. See an illustration of the Hausdorff distance computation in Figure 4 (middle plot), which is the core computational problem for .
Approximation by ellipsoids.
Centrally symmetric convex polytopes (w.l.o.g. centered around zero) may be approximated by ellipsoids. In our set-up, we could use the minimum volume enclosing ellipsoid (see, e.g. Barvinok, 2002), which can be computed exactly when the polytope is given through its vertices, or up to a constant factor when the polytope is such that quadratic functions may be optimized with a constant factor approximation. For zonotopes, the standard semi-definite relaxation of Nesterov (1998) leads to such constant-factor approximations, and thus the minimum volume inscribed ellipsoid may be computed up to a constant. Given standard results (see, e.g. Barvinok, 2002), a -scaled version of the ellipsoid is inscribed in this polytope, and thus the ellipsoid is a provably good approximation of the zonotope with a factor scaling as . However, the approximation ratio is not good enough to get any relevant bound for our purpose (see Section 5.5), as for computing the Haussdorff distance, we care about potentially vanishing differences that are swamped by constant factor approximations.
NP-hardness.
Given the reduction of the case (rectified linear units) to (exact thresholds) (Livni et al., 2014), the incremental problem is also NP-hard, so as obtaining a constant-factor approximation. However, this does not rule out convex relaxations with non-constant approximation ratios (see Section 6 for more details).
4 Incremental optimization problem for α⩾2𝛼2\alpha\geqslant 2
Note that, while for we can use the identity to replace the rectified linear unit by the absolute value and obtain the same function space, this is not possible for , as and do not differ by a linear function. This implies that the results from Livni et al. (2014), which state that for the quadratic activation function, the incremental problems is equivalent to an eigendecomposition (and hence solvable in polynomial time), do not apply.
Approximation properties
In this section, we first consider approximation properties of functions in by a finite number of neurons (only for ). We then study approximation properties of functions on the sphere by functions in . It turns out that all our results are based on the approximation properties of the corresponding RKHS : we give sufficient conditions for being in , and then approximation bounds for functions which are not in . Finally we transfer these to the spaces and , and consider in particular functions which only depend on projections on a low-dimensional subspace, for which the properties of and (and of and ) differ. This property is key to obtaining generalization bounds that show adaptivity to linear structures in the prediction functions (as done in Section 5).
Approximation properties of neural networks with finitely many neurons have been studied extensively (see, e.g., Petrushev, 1998; Pinkus, 1999; Makovoz, 1998; Burger and Neubauer, 2001). In Section 4.7, we relate our new results to existing work from the literature on approximation theory, by showing that our results provide an explicit control of the various weight vectors which are needed for bounding the estimation error in Section 5.
Positively homogenous convex functions may be written as the support function of a compact convex set (Rockafellar, 1997), that is, , and the set characterizes the function . The functions and defined above are not any convex positively homogeneous functions, as we now describe.
If the measure is supported by finitely many points, that is, with , then for . Thus the corresponding set is the zonotope [0,t_{1}]+\cdots+[0,t_{r}]=\big{\{}\sum_{i=1}^{r}b_{i}t_{i},\ b\in^{r}\big{\}} already defined in Section 3.3. Thus the functions and for finitely supported measures are support functions of zonotopes.
When the measure is not constrained to have finite support, then the sets and are limits of zonotopes, and thus, by definition, zonoids (Bolker, 1969), and thus functions in are differences of support functions of zonoids. Zonoids are a well-studied set of convex bodies. They are centrally symmetric, and in two dimensions, all centrally symmetric compact convexs sets are (up to translation) zonoids, which is not true in higher dimensions (Bolker, 1969). Moreover, the problem of approximating a zonoid by a zonotope with a small number of segments (Bourgain et al., 1989; Matoušek, 1996) is essentially equivalent to the approximation of a function by finitely many neurons. The number of neurons directly depends on the norm , as we now show.
with , for some constant that depends only on . We may then simply write
Finally, the number of neurons needed to express a function with a bound on the -norm can be estimated from general results on approximating reproducing kernel Hilbert space described in Section 2.3, whose kernel can be expressed as an expectation. Indeed, Bach (2015) shows that with neurons, one can approximate a function in with unit -norm with an error measured in of . When inverting the relationship between and , we get a number of neurons scaling as , which is the same as in Prop. 1 but with an error in -norm instead of -norm.
2 Sufficient conditions for finite variation
Tightness of conditions: as shown in Appendix D.5, there are functions , which have bounded first derivatives and do not belong to while (at least when is even). Therefore, when is even, the scaling in is optimal.
Dependence on : for any , the higher the , the stricter the sufficient condition. Given that the estimation error grows slowly with (see Section 5.1), low values of would be preferred in practice.
Dependence on : a key feature of the sufficient condition is the dependence on , that is, as increases the number of derivatives has to increase in —like for Sobolev spaces in dimension (Adams and Fournier, 2003). This is another instantiation of the curse of dimensionality: only very smooth functions in high dimensions are allowed.
3 Approximation of Lipschitz-continuous functions
In order to derive generalization bounds for target functions which are not sufficiently differentiable (and may not be in or ), we need to approximate any Lipschitz-continuous function, with a function with a norm that will grow as the approximation gets tighter. We give precise rates in the proposition below. Note the requirement for parity of the function . The result below notably shows the density of in uniform norm in the space of Lipschitz-continuous functions of the given parity, which is already known since our activation functions are not polynomials (Leshno et al., 1993).
This proposition is shown in Appendix C.3 for (using Fourier series) and in Appendix D.4 for all (using spherical harmonics). We can make the following observations:
Dependence in and : as expected, the main term in the error bound \big{(}{\delta}/{\eta}\big{)}^{-1/(\alpha+(d-1)/2)} is a decreasing function of , that is when the norm is allowed to grow, the approximation gets tighter, and when the Lipschitz constant of increases, the approximation is less tight.
Dependence on and : the rate of approximation is increasing in and . In particular the approximation properties are better for low .
Special case and : up to the logarithmic term we recover the result of Prop. 2, that is, the function is in .
Tightness: in Appendix D.5, we provide a function which is not in the RKHS and for which the tightest possible approximation scales as . Thus the linear scaling of the rate as is not improvable (but constants are).
4 Linear functions
We see that for , the -norm is less than a constant, and is much smaller than the -norm (which scales as ). For , we were not able to derive better bounds for (other than the value of .
5 Functions of projections
However, for the RKHS norm , this reasoning does not apply. For example, a certain function exists, which is -times differentiable, as shown in Appendix D.5, for (when is even), and is not in . Thus, given Prop. 2, the dependence on a uni-dimensional projection does not make a difference regarding the level of smoothness which is required to belong to .
Proof By assumption, the function has all its derivatives bounded by a constant times . Moreover, we have defined g(t,a)=f\big{(}\frac{Rt}{a}\big{)}a^{\alpha} so that all derivatives are bounded by . The result then follows immediately from Prop. 2.
Proof With the same reasoning as above, we obtain that is Lipschitz-continuous with constant , we thus get the desired approximation error from Prop. 3.
If , with and , then for , it is straightforward that . Moreover, we have . For other values of , we also have -norms less than a constant (depending only of ) times . The RKHS norms are bit harder to compute since linear functions for leads to linear functions for only for .
Functions of projections.
7 Related work
In this section, we show how our results from the previous sections relate to existing work on neural network approximation theory.
In this section, we only consider the case , for which we have two approximation bounds: Prop. 6 which approximates any -Lipschitz-continuous function by a function with finite -norm less than and uniform error less than \displaystyle\eta\big{(}{\delta}/{\eta}\big{)}^{-2/(d+1)}\log\big{(}{\delta}/{\eta}\big{)}, and Prop. 1 which shows that a function with -norm less than , may be approximated with neurons with uniform error .
Thus, given neurons, we get an approximation of the original function with uniform error
We can optimize over , and use , to obtain a uniform approximation bound proportional to , for approximating an -Lipschitz-continuous function with neurons.
Approximation by ridge functions.
The approximation properties of single hidden layer neural networks have been studied extensively, where they are often referred to as “ridge function” approximations. As shown by Pinkus (1999, Corollary 6.10)—based on a result from Petrushev (1998), the approximation order of for the rectified linear unit was already known, but only in -norm (and without the factor ), and without any constraints on the input and output weights. In this paper, we provide an explicit control of the various weights, which is needed for computing estimation errors. Moreover, while the two proof techniques use spherical harmonics, the proof of Petrushev (1998) relies on quadrature formulas for the associated Legendre polynomials, while ours relies on the relationship with the associated positive definite kernels, is significantly simpler, and offers additional insights into the problem (relationship with convex neural networks and zonoids). Maiorov (2006, Theorem 2.3) also derives a similar result, but in -norm (rather than uniform norm), and for sigmoidal activation functions (which are bounded). Note finally, that the order cannot be improved (DeVore et al., 1989, Theorem 4.2). Also, Maiorov and Meir (2000, Theorem 5) derive similar upper and lower bounds based on a random sampling argument which is close to using random features in the RKHS setting described in Section 2.3.
Relationship to hardness results for Boolean-valued functions.
In this paper, we consider a particular view of the curse of dimensionality and ways of circumventing it, that is, our distribution over inputs is arbitrary, but our aim is to approximate a real-valued function. Thus, all hardness results depending on functions with values in do not apply there directly—see, e.g., Shalev-Shwartz and Ben-David (2014, Chapter 20), for the need of exponentially many hidden units for approximating most of the functions from to .
Our approximation bounds show that, without any assumption beyond Lipschitz-continuity of the target function, it sufficient to have a number of hidden units which is still exponential in dimension (hence we also suffer from the curse of dimensionality), but a soon as the target function depends on linear low-dimensional structure, then we lose this exponential dependence. It would be interesting to study an extension to -valued functions, and also to relate our results to the number of linear regions delimited by neural networks with rectified linear units (Montufar et al., 2014).
Generalization bounds
Our goal is to derive the generalization bounds outlined in Section 2.4 for neural networks with a single hidden layer. The main results that we obtain are summarized in Table 2 and show adaptivity to assumptions that avoid the curse of dimensionality.
Logistic regression and support vector machines: we have .
Least-squares regression: we take G=\max\big{\{}\sqrt{2}\delta+\|y\|_{\infty},\frac{\|y\|_{\infty}^{2}}{\sqrt{2}\delta}\big{\}}.
Approximation errors will be obtained from the approximation results from Section 4 by assuming that the optimal target function has a specific form. Indeed, we have:
We now deal with estimation errors using Rademacher complexities.
The following proposition bounds the uniform deviation between and its empirical counterpart . This result is standard (see, e.g., Koltchinskii, 2001; Bartlett and Mendelson, 2003) and may be extended in bounds that hold with high-probability.
We have the following bound on the expected uniform deviation:
for , for and for
for , , where is a universal constant.
Proof We use the standard framework of Rademacher complexities and get:
We then take different routes for and .
For , we have the upper-bound
From Kakade et al. (2009), we get the following bounds on Rademacher complexities:
Our generalization bounds are expected values of the excess expected risk for a our estimator (where the expectation is taken over the data).
We assume , with and . Then, as seen in Section 4.6, with (the constant is independent of because we approximate an affine function). From Prop. 7, we thus get a generalization bound proportional to times a constant (that may depend on ), which is the same as assuming directly that we optimize over linear predictors only. The chosen is then a constant times , and does not grow with , like in parametric estimation (although we do use a non-parametric estimation procedure).
Projection pursuit.
We assume , with and each bounded by and -Lipschitz continuous. From Prop. 6, we may approach each by a function with -norm less than and uniform approximation . This leads to a total approximation error of for a norm less than (the constant is independent of because we approximate a function of one-dimensional projection).
For , from Prop. 5, the target function belongs to with a norm less than , leading to an overall generalization bound of .
Note that when the functions are exactly the activation functions, the bound is better, as these functions directly belong to the space .
Multi-dimensional projection pursuit.
For , the estimation error is , with an overall bound of C(\alpha,s)kGR\eta\big{(}{\delta}/{\sqrt{n}}+\delta^{-1/(\alpha+(s-1)/2)}\log\delta\big{)}. With , we get an optimized bound of .
For , we have an overall bound of C(s)kGR\eta\big{(}\delta^{-2/(s-1)}\log\delta+\frac{\delta\sqrt{d}}{\sqrt{n}}\big{)}, and with , we get a generalization bound scaling as .
Note that for and , we recover the usual Lipschitz-continuous assumption, with a rate of .
Summary table: when we know a bound on all dimensions of , then we may take ; this is helpful in comparisons in Table 2, where is replaced by and the dependence in is removed as it is the same for all models.
Dependence on : when making only a global Lipschitz-continuity assumption, the generalization bound has a bad scaling in , i.e., as , which goes down to zero slowly when increases. However, when making structural assumptions regarding the dependence on unknown lower-dimensional subspaces, the scaling in disappears.
Comparing different values of : the value always has the best scaling in , but constants are better for (among which has the better scaling in ).
Bounds for : The simplest upper bound for the penalization by the space depends on the approximation properties of . For linear functions and , it is less than , with a bound . For the other values of , there is a constant . Otherwise, there is no adaptivity and all other situations only lead to upper-bounds of . See more details in Section 5.4.
Sample complexity: Note that the generalization bounds above may be used to obtain sample complexity results such as for affine functions, for projection pursuit, and for the generalized version (up to logarithmic terms).
Relationship to existing work: Maiorov (2006, Theorem 1.1) derives similar results for neural networks with sigmoidal activation functions (that tend to one at infinity) and the square loss only, and for a level of smoothness of the target function which grows with dimension (in this case, once can get easily rates of ). Our result holds for problems where only bounded first-order derivatives are assumed, but by using Prop. 2, we would get similar rate by ensuring the target function belongs to and hence to .
Lower bounds.
In the sections above, we have only provided generalization bounds. Although interesting, deriving lower-bounds for the generalization performance when the target function belongs to certain function classes is out of the scope of this paper. Note however, that results from Sridharan (2012) suggest that the Rademacher complexities of the associated function classes provide such lower-bounds. For general Lipschitz-functions, these Rademacher complexities decreases as (von Luxburg and Bousquet, 2004).
We assume , with and . Given that we have assumed that has at most non-zeros, we have .
Then, with , with a constant that is independent of because we have an affine function.
From Prop. 7, we thus get a rate of times a constant (that may depend on ), which is the same as assuming directly that we optimize over linear predictors only (see, for example, Bühlmann and Van De Geer, 2011). We recover a high-dimensional phenomenon (although with a slow rate in ), where may be much larger than , as long as is small compared to . The chosen is then a constant times (and does not grow with ).
Projection pursuit.
We assume , with (which implies given our sparsity assumption) and each bounded by and -Lipschitz continuous. We may approach each by a function with -norm less than and uniform approximation , with a constant that is independent of because we have a function of one-dimensional projection. This leads to a total approximation error of for a norm less than .
For , the estimation error is , with an overall bound of C(\alpha)kGr\sqrt{q}\eta\big{(}\delta^{-1/\alpha}\log\delta+\frac{\delta\sqrt{\log d}}{\sqrt{n}}\big{)}. With , we get an optimized bound of , with a scaling only dependent in with a logarithmic factor.
For , the target function belongs to with a norm less than , leading to an overal bound of (the sparsity is not helpful in this case).
Multi-dimensional projection pursuit.
We may approach each by a function with -norm less than and uniform approximation . This leads to a total approximation error of .
For , the estimation error is , with an overall bound which is equal to C(\alpha,s)kGr\sqrt{q}\eta\big{(}\delta^{-1/(\alpha+(s-1)/2)}\log\delta+\frac{\delta\sqrt{\log d}}{\sqrt{n}}\big{)}. With , we get an optimized bound of .
For , we have the bound , that is we cannot use the sparsity as the problem is invariant to the chosen norm on hidden weights.
High-dimensional variable selection: when , and is a projection onto variables, then we obtain a bound proportional to , which exhibits a high-dimensional scaling in a non-linear setting. Note that beyond sparsity, no assumption is made (in particular regarding correlations between input variables), and we obtain a high-dimensional phenomenon where may be much larger than .
4 Relationship to kernel methods and random sampling
5 Sufficient condition for polynomial-time algorithms
up to a constant factor. That is, there exists , such that for all and , we may compute such that and
This is provably NP-hard for (see Section 3.2), and for (see Section 3.3). If such an algorithm is available, the approximate conditional gradient presented in Section 2.5 leads to an estimator with the same generalization bound. Moreover, given the strong hardness results for improper learning in the situation (Klivans and Sherstov, 2006; Livni et al., 2014), a convex relaxation that would consider a larger set of predictors (e.g., by relaxing into a symmetric positive-definite matrix), and obtained a constant approximation guarantee, is also ruled out.
Lower-bound on : It is defined from functions , for , where for any , there exists such that . This implies that the corresponding space is larger than and that if , then .
Polynomial-time algorithm for dual norm: The dual norm \displaystyle\sup_{\hat{v}\in\hat{\mathcal{V}}}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}y_{i}\hat{\varphi}_{\hat{v}}(z_{i})\bigg{|} may be computed in polynomial time.
We may also replace the standard Gaussian vectors by Rademacher random variables.
We can then penalize by instead of . Since , approximation properties are transferred, and because of the result above, the Rademacher complexity for -balls scales as well as for -balls. In the next section, we show convex relaxations which cannot achieve these and leave the existence or non-existence of such norm as an open problem.
Convex relaxations of the Frank-Wolfe step
We present two relaxations, which are of the form described in Section 5.5 (leading to potential generalization bounds) but do not attain the proper approximation scaling (as was checked empirically).
Note that all relaxations that end up being Lipschitz-continuous functions of , will have at least the same scaling than the set of these functions. The Rademacher complexity of such functions is well-known, that is for , for and for larger (von Luxburg and Bousquet, 2004). Unfortunately, the decay in is too slow to preserve generalization bounds (which would require a scaling in ).
We denote . We may then use and, for , . By denoting , the constraint that is equivalent to
We obtain a convex relaxation when removing the rank constraint, that is
(n+d)𝑛𝑑(n+d)-dimensional relaxation.
which leads to a convex program in , and , that is a semidefinite program with dimensions, with the constraints
and the usual semi-definite contraints \displaystyle\left(\begin{array}[]{cc}U&J\\ J^{\top}&V\end{array}\right)\succcurlyeq\left(\begin{array}[]{c}u\\ v\end{array}\right)\left(\begin{array}[]{c}u\\ v\end{array}\right)^{\top}, with the additional constraint that .
If we add these constraints on top of the ones above, we obtain a tighter relaxation. Note that for this relaxation, we must have \big{[}(2u_{i}-v^{\top}z_{i})-(2u_{j}-v^{\top}z_{j})\big{]} less than a constant times . Hence, the result mentioned above regarding Lipschitz-continuous functions and the scaling of the upper-bound for random holds (with the dependence on which is not good enough to preserve the generalization bounds with a polynomial-time algorithm).
2 Relaxation of sign vectors
Usual semi-definite constraint: \displaystyle\left(\begin{array}[]{cc}S&J\\ J^{\top}&V\end{array}\right)\succcurlyeq\left(\begin{array}[]{c}s\\ v\end{array}\right)\left(\begin{array}[]{c}s\\ v\end{array}\right)^{\top},
Unit/trace constraints: and ,
Sign constraint: .
Additional constraint: .
Conclusion
In this paper, we have provided a detailed analysis of the generalization properties of convex neural networks with positively homogenous non-decreasing activation functions. Our main new result is the adaptivity of the method to underlying linear structures such as the dependence on a low-dimensional subspace, a setting which includes non-linear variable selection in presence of potentially many input variables.
All our current results apply to estimators for which no polynomial-time algorithm is known to exist and we have proposed sufficient conditions under which convex relaxations could lead to the same bounds, leaving open the existence or non-existence of such algorithms. Interestingly, these problems have simple geometric interpretations, either as binary linear classification, or computing the Haussdorff distance between two zonotopes.
In this work, we have considered a single real-valued output; the functional analysis framework readily extends to outputs in a finite-dimensional vector-space where vector-valued measures could be used, and then apply to multi-task or multi-class problems. However, the extension to multiple hidden layers does not appear straightforward as the units of the last hidden layers share the weights of the first hidden layers, which should require a new functional analysis framework.
References
We follow the proof of Berlinet and Thomas-Agnan (2004, Section 4.1) and extend it to integrals rather than finite sums. We consider the linear mapping defined by , with null space . When restricted to the orthogonal complement , we obtain a bijection from to . We then define a dot-product on as .
We first show that this defines an RKHS with kernel defined above. For this, we trivially have for all . Moreover, for any , we have with and , , which implies that , hence the reproducing property is satisfied. Thus, is an RKHS.
We now need to show that the RKHS norm which we have defined is actually . For any such that , for , we have , where . Thus, . This implies that with equality if and only if . This shows that .
B Approximate conditional gradient with multiplicative oracle
for a fixed . We now propose a modification of the conditional gradient algorithm that converges to a certain such that and for which .
We assume the smoothness of the function with respect to the norm , that is, for a certain , for all such that , then
In the previous recursion, one may replace the minimization of on the segment with the minimization of its upper-bound of Eq. (7) taken at . From the recursion, all iterates are in the -ball of radius . Following the traditional convergence proof for the conditional gradient method (Dunn and Harshbarger, 1978; Jaggi, 2013), we have, for any in $$:
If we take the minimizer of on , we get:
Then, by using , we get:
This is valid for any . If for some , then by taking it remains the same of all greater . Therefore, up to (the potentially never happening) point where , we can apply the regular proof of the conditional gradien to obtain: , which leads to the desired result. Note that a similar reasoning may be used for .
C Proofs for the 222-dimensional sphere (d=1𝑑1d=1)
For , the same equality holds (except that the two coefficients and are divided by except of ).
We thus simply need to compute and its decay for all values of , and then relate them to the smoothness properties of , which is standard for Fourier series.
We now detail the computation of for the different functions . We have for :
For it is equal to . It is equal to zero for all other even , and different from zero for all odd , with going to zero as .
For , it is equal to . It is equal to zero for all other odd , and different from zero for all even , with going to zero as .
For , it is equal to , and for , it is equal to . It is equal to zero for all other even , and different from zero for all odd , with going to zero as .
C.2 Proof of Prop. 2 for d=1𝑑1d=1
We only consider the proof for . For the proof for general , see Appendix D.3.
Given the zero values of given above, if has the opposite parity than (that is, is even when is odd, and vice-versa), then we may define through its Fourier series, which is obtained by multiplying the one of by a strictly positive sequence growing as .
Note that we could relax the assumption that is even (resp. odd) by adding all trigonometric polynomials of order less than .
C.3 Proof of Prop. 3 for d=1𝑑1d=1
Again, we only consider the proof for . For the proof for general , see Appendix D.4.
Without loss of generality, we assume that . For , we essentially want to approximate a Lipschitz-continuous function by a function which is -times differentiable.
For , then the function is already in with a norm less than one, because Lipschitz-continuous functions are almost everywhere differentiable with bounded derivative (Adams and Fournier, 2003). We thus now consider .
Given defined above and , we define through
Since is -Lipschitz-continuous with constant , then it has a squared-integrable derivative with norm less than 1 (Adams and Fournier, 2003), so that
This implies that using :
Computing distance between g^^𝑔\hat{g} and g𝑔g.
It can be easily checked that for any , the last function is less than a constant times . We thus get for large enough, by taking , an error of
D Approximations on the d𝑑d-dimensional sphere
In this appendix, we first review tools from spherical harmonic analysis, before proving the two main propositions regarding the approximation properties of the Hilbert space . Using spherical harmonics in our set-up is natural and is common in the analysis of ridge functions (Petrushev, 1998) and zonotopes (Bourgain and Lindenstrauss, 1988).
In this section, we review relevant concepts from spherical harmonics. See Frye and Efthimiou (2012); Atkinson and Han (2012) for more details. Spherical harmonics may be seen as extension of Fourier series to spheres in dimensions more than 2 (i.e., with our convention ).
Legendre polynomials.
where is a Legendre polynomial of degree and dimension , defined as (Rodrigues’ formula):
They are also referred to as Gegenbauer polynomials. For , is the -th Chebyshev polynomial, such that for all (and we thus recover the Fourier series framework of Appendix C). For , is the usual Legendre polynomial.
The polynomial is even (resp. odd) when is even (resp. odd), and we have
For small , we have , , and .
The Hecke-Funk formula leads to, for any linear combination of , :
This is the decomposition in harmonics of degree . Note that
Decomposition of functions of one-dimensional projections.
If , then , for since is orthogonal to all polynomials of degree strictly less than for that dot-product. Otherwise, , since may be decomposed as combination with non-zero coefficients of polynomials for , .
We now provide an explicit formula extending the proof technique (for ) of Schneider (1967) and Bourgain and Lindenstrauss (1988) to all values of . See also Mhaskar (2006).
We have, by successive integration by parts, for :
By using Stirling formula , we get an equivalent when or tends to infinity as a constant (that depends on ) times
Note that all exponential terms cancel out. Moreover, when tends to infinity and is considered constant, then we get the equivalent , which we need for the following sections. Finally, when tends to infinity and is considered constant, then we get the equivalent .
We will also need expressions of for and . For , we have:
using the normalization factor of the Beta distribution. This leads to
which is equivalent to as tends to infinity.
Moreover, for , we have (for ):
which is equivalent to as tends to infinity.
Finally, for , . More generally, we have .
Given with the correct parity, then we have
D.3 Proof of Prop. 2 for d>1𝑑1d>1
Moreover, since has the correct parity,
Also, are eigenfunctions of the Laplacian with eigenvalues . Thus, we have
D.4 Proof of Prop. 3 for d>1𝑑1d>1
which is always defined when because the series is absolutely convergent. This defines a function that will have a finite -norm and be close to .
The function thus defines a function by , for which .
Approximation properties.
We now show that and are close to each other. Because of the parity of , we have . We have, using Theorem 4.28 from Frye and Efthimiou (2012):
Moreover, following Bourgain and Lindenstrauss (1988), we have:
As shown by Bourgain and Lindenstrauss (1988, Eq. (2.13)), this is less than a constant that depends on times . We thus get for large enough, by taking , an error of
In this section, we consider functions on the sphere which have the proper parity with respect to , which are -times differentiable with bounded derivatives, but which are not in . We then provide optimal approximation rates for these functions.
The summand has an asymptotic equivalent proportional to . Thus if , the series is divergent (the function is not in the RKHS), i.e., if .
which should be of order (this gives the scaling of as a function of ). Then the squared error is
and thus the (non-squared) approximation error scales as For , this leads to a scaling as .
D.6 Proof of Prop. 4
For , by writing we obtain the upperbound . For all other situations, we may compute
We assume that we are given two ellipsoids defined as and and we want to compute their Hausdorff distance. This leads to the two equivalent problems
We consider the following convex optimization problem, with ; we have by Lagrangian duality:
If , then and . Otherwise, at the optimum, and , which implies , which leads to , which is important to reduce the interval of possible . The optimal may then be obtained by binary search (from a single SVD of ).
Minimizing concave quadratic forms over the sphere.
We consider the following non-convex optimization problem, with , for which strong Lagrangian duality is known to hold (Boyd and Vandenberghe, 2004):
At the optimum, we have , which implies , which leads to . We may perform binary search on from a single SVD of .
Computing the Haussdorff distance.
with . The interval in which is sufficient to explore is
which are bounds that are independent of .
Given , we have the problem of
We have leading to . We need . Moreover
The author was partially supported by the European Research Council (SIERRA Project), and thanks Nicolas Le Roux for helpful discussions. The author also thanks Varun Kanade for pointing the NP-hardness linear classification result. The main part of this work was carried through while visiting the Centre de Recerca Matemàtica (CRM) in Barcelona.