On the Complexity of Learning Neural Networks

Le Song, Santosh Vempala, John Wilmes, Bo Xie

Introduction

It is well-known that Neural Networks (NN’s) provide universal approximate representations and under mild assumptions, i.e., any real-valued function can be approximated by a NN. This holds for a wide class of activation functions (hidden layer units) and even with only a single hidden layer (although there is a trade-off between depth and width ). Typically learning a NN is done by stochastic gradient descent applied to a loss function comparing the network’s current output to the values of the given training data; for regression, typically the function is just the least-squares error. Variants of gradient descent include drop-out, regularization, perturbation, batch gradient descent etc. In all cases, the training algorithm has the following form:

Repeat: 1. Compute a fixed function $F_{W}(.)$ defined by the current network weights $W$ on a subset of training examples. 2. Use $F_{W}(.)$ to update the current weights $W$ .

The empirical success of this approach raises the question: what can NN’s learn efficiently in theory? In spite of much effort, at the moment there are no satisfactory answers to this question, even with reasonable assumptions on the function being learned and the input distribution.

When learning involves some computationally intractable optimization problem, e.g., learning an intersection of halfspaces over the uniform distribution on the Boolean hypercube, then any training algorithm is unlikely to be efficient. This is the case even for improper learning (when the complexity of the hypothesis class being used to learn can be greater than the target class). Such lower bounds are unsatisfactory to the extent they rely on discrete (or at least nonsmooth) functions and distributions. What if we assume that the function to be learned is generated by a NN with a single hidden layer of smooth activation units, and the input distribution is benign? Can such functions be learned efficiently by gradient descent?

Our main result is a lower bound, showing a simple and natural family of functions generated by $1$ -hidden layer NN’s using any known activation function (e.g., sigmoid, ReLU), with each input drawn from a logconcave input distribution (e.g., Gaussian, uniform in an interval), are hard to learn by a wide class of algorithms, including those in the general form above. Our finding implies that efficient NN training algorithms need to use stronger assumptions on the target function and input distribution, more so than Lipschitzness and smoothness even when the true data is generated by a NN with a single hidden layer.

The idea of the lower bound has two parts. First, NN updates can be viewed as statistical queries to the input distribution. Second, there are many very different $1$ -layer networks, and in order to learn the correct one, any algorithm that makes only statistical queries of not too small accuracy has to make an exponential number of queries. The lower bound uses the SQ framework of Kearns as generalized by Feldman et al. .

The bound on the RHS is the standard deviation of $t$ independent Bernoulli coins with desired expectation, i.e., the error that even a random sample of size $t$ would yield. In this paper, we study SQ algorithms that access the input distribution only via the $\operatorname{VSTAT}(t)$ oracle. The remaining computation is unrestricted and can use randomization (e.g., to determine which query to ask next).

In the case of an algorithm training a neural network via gradient descent, the relevant query functions are derivatives of the loss function.

2 Main result

Note that the parameter $s$ also bounds the Lipschitz constant of $\sigma(x)$ .

As mentioned in the introduction, we can view a NN learning algorithm as a statistical query (SQ) algorithm: in each iteration, the algorithm constructs a function based on its current weights (typically a gradient or subgradient), evaluates it on a batch of random examples from the input distribution, then uses the evaluations to update the weights of the NN. Then we have the following result.

The Lipschitz assumption on the statistical queries is satisfied by all commonly used algorithms for training neural networks can be simulated with Lipschitz queries (e.g., gradients of natural loss functions with regularizers). This assumption can be omitted if the output of the hard-to-learn family $\mathcal{C}$ is represented with finite precision (see Corollary 5).

Informally, Theorem 1.1 shows that there exist simple realizable functions that are not efficiently learnable by NN training algorithms with polynomial batch sizes, assuming the algorithm allows for error as much as the standard deviation of random samples for each query. We remark that in practice, large batch sizes are seldom used for training NNs, not just for efficiency, but also since moderately noisy gradient estimates are believed to be useful for avoiding bad local minima. Even NN training algorithms with larger batch sizes will require $\Omega(t)$ samples to achieve lower error, whereas the NNs that represent functions in our class $\mathcal{C}$ have only $\widetilde{O}(\sqrt{t})$ parameters.

Our lower bound extends to a broad family of activation units, including all the well-known ones (ReLU, sigmoid, softplus etc., see Section 3.1). In the case of sigmoid gates, the functions of $\mathcal{C}$ take the following form (cf. Figure 1.1). For a set $S\subseteq\{1,\ldots,n\}$ , we define $f_{m,S}(x_{1},\ldots,x_{n})=\phi_{m}(\sum_{i\in S}x_{i})$ , where

As we show empirically in Section 4, these lower bounds hold even for small values of $n$ and $s$ , across choices of gates, architecture used to learn, learning rate, batch size, etc. As suggested by the statement of Theorem 1.1, there is threshold for the quantity $s\sqrt{n}$ , above which stochastic gradient descent fails to train the NN to low error — the regression error of the trained NN does not even improve on that of a constant function.

The condition number upper bound for $\mathcal{C}$ is significant in part because there do exist SQ algorithms for learning certain families of simple NNs with time complexity polynomial in the condition number of the weight matrix (the tensor factorization based algorithm of Janzamin et al. can easily be seen to be SQ). Our results imply that this dependence cannot be substantially improved (see Section 1.3).

The class of input distributions can be relaxed further. Rather than being a product distribution, it suffices if the distribution is in isotropic position and invariant under reflections across and permutations of coordinate axes. And instead of being logconcave, it suffices for marginals to be unimodal with variance $\sigma$ , density $O(1/\sigma)$ at the mode, and density $\Omega(1/\sigma)$ within a standard deviation of the mode.

Overall, our lower bounds suggest that even the combination of small network size, smooth, standard activation functions, and benign input distributions is insufficient to make learning a NN easy, even improperly via a very general family of algorithms. Instead, stronger structural assumptions on the NN, such as a small condition number, and very strong structural properties on the input distribution, are necessary to make learning tractable. It is our hope that these insights will guide the discovery of provable efficiency guarantees.

3 Related Work

There is much work on complexity-theoretic hardness of learning neural networks . These results have shown the hardness of learning functions representable as small (depth $2$ ) neural networks over discrete input distributions. Since these input distributions bear little resemblance to the real-world data sets on which NNs have seen great recent empirical success, it is natural to wonder whether more realistic distributional assumptions might make learning NNs tractable. Our results suggest that benign input distributions are insufficient, even for functions realized as small networks with standard, smooth activation units.

Recent independent work of Shamir shows a smooth family of functions for which the gradient of the squared loss function is not informative for training a NN over a Gaussian input distribution (more generally, for distributions with rapidly decaying Fourier coefficients). In fact, for this setting the paper shows an exponentially small bound on the gradient, relying on the fine structure of the Gaussian distribution and of the smooth functions (see for a follow-up with experiments and further ideas). These smooth functions cannot be realized in small NNs using the most commonly studied activation units (though a related non-smooth family of functions for which the bounds apply can be realized by larger NNs using ReLU units). In contrast our bounds are (a) in the more general SQ framework, and in particular apply regardless of the loss function, regularization scheme, or specific variant of gradient descent (b) apply to functions actually realized as small NNs using any of a wide family of activation units (c) apply to any logconcave input distribution and (d) are robust to small perturbations of the input layer weights.

Also related is the tensor-based algorithm of Janzamin et al. to learn a 1-layer network under nondegeneracy assumptions on the weight matrix. The complexity is polynomial in the dimension, size of network being learned and condition number of the weight matrix. Since their tensor decomposition can also be implemented as a statistical query algorithm, our results give a lower bound indicating that such a polynomial dependence on the dimension and condition number is unavoidable.

Other algorithmic results for learning NNs apply in very restricted settings. For example, polynomial-time bounds are known for learning NNs with a single hidden ReLU layer over Gaussian inputs under the assumption that the hidden units use disjoint sets of inputs , as well as for learning a single ReLU and for learning sparse polynomials via NNs .

4 Proof ideas

Our starting point is the observation is that NN training algorithms are inherently statistical and can be simulated by $\operatorname{VSTAT}$ . This is because in each step a gradient is computed by averaging over a batch of random examples. The number of samples needed in the average depends only on the range of the functions being queried. In addition to these queries, the algorithm is allowed to perform any other computations that do not query the input.

One way to prove a lower bound on the number of queries is to exhibit neural networks approximating parity functions. Since parity of an unknown subset is hard-to-learn for SQ algorithms, this would give a lower bound in the setting of neural networks as well. However, the resulting bound on tolerance is much worse since the discrete parity function would have to be approximated by a continuous function.

Instead, we directly estimate the statistical dimension of the family of $s$ -wave functions. Statistical dimension is a key concept in the study of SQ algorithms, and is known to characterize the query complexity of supervised learning via SQ algorithms . Briefly, a family $\mathcal{C}$ of distributions (e.g., over labeled examples) has statistical dimension $d$ with average correlation $\bar{\gamma}$ if every $(1/d)$ -fraction of $\mathcal{C}$ has average correlation $\bar{\gamma}$ ; this condition implies that $\mathcal{C}$ cannot be learned with fewer than $O(d)$ queries to $\operatorname{VSTAT}(O(1/\bar{\gamma}))$ . See Section 2.1 for precise statements.

The SQ literature for supervised learning of boolean functions is rich. However, lower bounds for regression problems in the SQ framework have so far not appeared in the literature, and the existing notions of statistical dimension are too weak for this setting. We state a new, strengthened notion of statistical dimension for regression problems (Definition 2), and show that lower bounds for this dimension transfer to query complexity bounds (Theorem 2.1). The essential difference from the statistical dimension for learning is that we must additionally bound the average covariances of indicator functions (or, rather, continuous analogues of indicators) on the outputs of functions in $\mathcal{C}$ . The essential claim in our lower bounds is therefore in showing that a typical pair of (indicator functions on outputs of) $s$ -wave functions has small covariance.

So it suffices to show that the expectation of $f(\sum_{i\in S}x_{i})$ doesn’t change much when we condition on the value of $z=\sum_{i\in S\cap T}x_{i}$ .

In particular, conditioning on the value of $z=\sum_{i\in S\cap T}x_{i}$ has little effect on the value of $f(\sum_{i\in S}x_{i})$ . The combination of these observations gives the query complexity lower bound. Precise statements of some of the technical lemmas are given in Section 3, and the complete proof appears in Section 5.

The complexity of learning smooth one-layer networks

We now give a precise definition of the statistical dimension with average correlation for regression problems, extending the concept introduced in .

where $\rho_{D}(f,g)=\operatorname{Cov}_{D}(f,g)/\sqrt{\operatorname{Var}(f)\operatorname{Var}(g)}$ when both $\operatorname{Var}(f)$ and $\operatorname{Var}(g)$ are nonzero, and $\rho_{D}(f,g)=0$ otherwise.

So $\chi_{y}$ is $(1/\epsilon)^{2}$ -Lipschitz, is supported on $(y-\epsilon,y+\epsilon)$ , and has norm $\|\chi_{y}\|_{1}=1$ .

Note that the parameter $\mu(y)$ is independent of the choice of $f\in\mathcal{C}$ . The application of this notion of dimension is given by the following theorem.

A version of the theorem for Boolean functions is proved in . For completeness, we include a proof of the version used in this paper, following ideas in [18, Theorem 2]. The proof is given in Section 6.

As a consequence of Theorem 2.1, there is no need to consider an SQ algorithm’s query strategy in order to obtain lower bounds on its query complexity. Instead, the lower bounds follow directly from properties of the concept class itself, in particular from bounds on average covariances of indicator functions. Theorem 1.1 will therefore follow from Theorem 2.1 by analyzing the statistical dimension of the $s$ -wave functions.

Statistical dimension of one-layer functions

We now present the most general context in which we obtain SQ lower bounds.

In other words, we have statistic dimension bounds (and hence query complexity bounds) for functions that are sufficiently close to periodic. However, the activation units of interest are generally monotonic increasing functions such as sigmoids and ReLUs that are quite far from periodic. Hence, in order to apply Lemma 3.1 in our context, we must show that the activation units of interest can be combined to make nearly periodic functions.

The definition of our hard-to-learn family $f_{m,S}$ is exactly in this form (1.1).

We now describe the properties of the integrable functions $\psi$ that will be used in the proof.

In particular, if $\psi$ is bounded, and monotonic nonincreasing (resp. nondecreasing) for sufficiently large positive (resp. negative) inputs, then $\psi$ satisfies Definition 4. Alternatively, it suffices for $\psi$ to have bounded first derivative.

To complete the proof of Theorem 1.1, we show that we can combine activation units $\psi$ satisfying the above properties in a function which is close to periodic, i.e., which satisfies the hypotheses of Lemma 3.1 above.

We now sketch how Lemmas 3.1 and 3.2 imply Theorem 1.1 for sigmoid units.

Then $\psi$ satisfies the hypotheses of Lemma 3.2.

Similar proofs give corresponding lower bounds for activation functions other than sigmoids. In every case, we reduce to gates satisfying the hypotheses of Lemma 3.2 by constructing an appropriate $L^{1}$ -function $\psi$ as an affine combination of of the activation functions.

For example, let $\sigma(x)=\sigma_{s}(x)=\max\{0,sx\}$ denote the ReLU unit with slope $s$ . Then the affine combination

In the case of softsign functions, this function $\psi$ converges much more slowly to zero as $|x|\to\infty$ compared to sigmoid units. Hence, in order to obtain an adequate quasiperiodic function as an affine combination of $\psi$ -units, a much larger number of $\psi$ -units is needed: the bound on the number $m$ of units in this case is polynomial in the Lipschitz parameter $\lambda$ of the query functions, and a larger polynomial in the input dimension $n$ . The case of other commonly used activation functions, such as ELU (exponential linear) or LReLU (Leaky ReLU), is similar to those discussed above.

Experiments

For a given sharpness parameter $s\in\{0.01,0.02,0.05,0.1,0.2,0.5,1,2\}$ , input dimension $d\in\{50,100,200\}$ and input distribution, we generate the true function according to Eqn. 1.1. There are a total of 50,000 training data points and 1000 test data points. We then learn the true function with fully-connected neural networks of both ReLU and sigmoid activation functions. The best test error is reported among the following different hyper-parameters.

The number of hidden layers we used is 1, 2, and 4. The number of hidden units per layer varies from $4n$ to $8n$ . The training is carried out using SGD with 0.9 momentum, and we enumerate learning rates from 0.1, 0.01 and 0.001 and batch sizes from 64, 128 and 256.

From Theorem 1.1, learning such functions should become difficult as $s\sqrt{n}$ increases over a threshold. In Figure 4.1, we illustrate this phenomenon. Each curve corresponds to a particular input dimension $n$ and each point in the curve corresponds to a particular smoothness parameter $s$ . The x-axis is $s\sqrt{n}$ and the y-axis denotes the test errors. We can see that at roughly $s\sqrt{n}=5$ , the problem becomes hard even empirically.

Complete proof of Theorem 1.1

Since $\phi$ has period $\theta$ , by redefining $z$ we may assume without loss of generality that $0<z<\theta$ , and that $f$ achieves its mode at .

Since the quantity $\int_{I}|\phi(x)|$ is the same for every interval $I$ of length $\theta$ , we may assume without loss of generality that $f$ has its mode at . We compute

By the tail bound for logconcave distributions,

Note that the probability density function $f$ of $D$ satisfies $\|f\|_{\infty}=O(1/\sigma)$ since $D$ is logconcave. Therefore, using the above estimate for both $z^{\prime}=z$ and $z^{\prime}=0$ , and applying Lemmas 5.1 and 5.2, we have

using Eq. (5.1) again for the last estimate. ∎

The following proposition follows from a straightforward application of the probabilistic method.

Let $X=(x_{1},\ldots,x_{n})\sim D$ . Without loss of generality, by shifting the affine functions $g$ we will define, we may assume without loss of generality that each $x_{i}$ has mean zero. For every subset $S$ of $\{1,\ldots,n\}$ let

Let $c>0$ be the absolute constant and $\mathcal{F}$ the family of subsets of $\{1,\ldots n\}$ given by Proposition 5.5. Let $\mathcal{C}=\{\phi\ \circ\zeta_{S}:S\in\mathcal{F}\}$ . We show $\operatorname{\epsilon-SDA}(\mathcal{C},D,\bar{\gamma})\geq\epsilon\theta^{2}2^{\Omega(n)}$ for some $\bar{\gamma}=O(\theta^{2}/n)$ .

This suffices to prove the lemma, as we now show.

First, suppose $h$ is the identity function $h(x)=x$ . Observe that since $\operatorname{Var}_{x\sim U(0,\theta)}(\phi(x))=\Omega(1)$ , for $\theta=O(\sqrt{n})$ we also have $\operatorname{Var}_{X\sim D}(f(X))=\Omega(1)$ for any $f\in\mathcal{C}$ . Then by Eq. (5.2), for any $S,T\in\mathcal{F}$ , we have

Hence, for any subset $\mathcal{C}^{\prime}\subseteq\mathcal{C}$ we have

This quantity is at most $O(\theta^{2}/n)$ , assuming $|\mathcal{C}^{\prime}|\geq n/\theta^{2}$ . In particular, it holds whenever $|\mathcal{C}^{\prime}|\geq|\mathcal{C}|/d$ where $d=(\theta^{2}/n)2^{cn}$ .

even when $|\mathcal{C}^{\prime}_{y}|=|\mathcal{C}_{y}|/d$ for some $d=\epsilon\theta^{2}2^{\Omega(n)}$ . This proves that $\operatorname{\epsilon-SDA}(\mathcal{C},D,\bar{\gamma})=\epsilon\theta^{2}2^{\Omega(n)}$ .

Note that since $h$ has Lipschitz constant $(1/\epsilon)^{2}$ and $\phi$ is $(M,\delta,\theta)$ -quasiperiodic, we have that $h\circ\phi$ is $(M,\delta^{\prime},\theta)$ -quasiperiodic, where $\delta^{\prime}=\delta/\epsilon^{2}=O(\epsilon\theta/\sqrt{n})$ . Let $\hat{\phi}=h\circ\phi$ .

By the tail bound for logconcave distributions and the definition of of $M$ , we have $P_{X}(|\zeta_{S\cap T}(X)|\geq M/2)<O(\epsilon^{\prime})$ for adequate choice of the constant hidden in the definition of $M$ . Therefore,

By Eq. (5.3) and Eq. (5.5) we thus have $|m_{S}-m_{S\cap T}|=O(\epsilon^{\prime})$ . Using Eq. (5.3) again, we have for all $|z|<M/2$ that

Using the fact that $f_{T}(X)$ and $f_{S}(X)$ are independent random variables after conditioning on the value of $\zeta_{S\cap T}(X)$ , we have

Now using a tail bound of the same flavor as in Eq. (5.5), we have

By the periodicity of $F$ , we may assume without loss of generality that $I=[-r,3r]$ . By the monotone convergence theorem, we have

By the definition of $r$ , we therefore have $\int_{-r}^{r}\left|F(x)\right|dx\geq(2/3)\|\psi\|_{1}$ . Similarly

For any $m>0$ , we define the truncated $(\psi,\theta)$ -periodic function

Despite its name, the truncated $\psi$ -periodic function is not in general periodic. Nevertheless, it approximates $F_{\psi,\theta}$ .

By Lemma 5.8, the function $F^{(m)}_{\psi,\theta}$ is $(M,\delta,\theta)$ -quasiperiodic for appropriate choice of $m$ . Hence, by Lemma 5.6, taking $\theta=4r$ , the function $F^{(m)}_{\psi,\theta}$ also has the desired variance. Finally, by Lemma 5.7, we may rescale $F^{(m)}_{\psi,\theta}$ by a constant factor to ensure that its range is in $ $, preserving the variance (up to constant factors) and quasiperiodicity (for appropriate choice of$ m$). ∎

3 Proof of Main Theorem and Corollary

We now give the full proof of Theorem 1.1, sketched previously in Section 3. First, we prove a small lemma that will be useful for the condition number guarantee.

The lemma follows by setting $h$ to be either the identity function or a soft indicator $\chi_{y}^{(\epsilon)}$ , and averaging over all pairs $f,g\in\mathcal{C}$ . ∎

for some $r=\Theta(1/s)$ and therefore this is a bound on the essential radius of $\psi$ . Furthermore, since $\psi$ is monotonic for sufficiently large positive or negative inputs, $\psi$ has the mean bound property. Thus, $\psi$ satisfies the hypotheses of Lemma 3.2.

Note that the functions in $\mathcal{C}$ are represented by single-layer neural networks, the composition of any $f\in\mathcal{C}_{0}$ with $g$ is again an affine function.

It remains to estimate the number $m$ of $\psi$ -units used to represent the functions in $\mathcal{C}$ , which is half the number of $\sigma$ activation units used.

By Lemma 3.2, we may take $m=O(\max\{m_{1},s\sqrt{n}\log(n/(\epsilon\theta))\})$ , where $m_{1}$ satisfies

Therefore, some $m_{1}=O(\log(\lambda sn)/s)$ suffices and this implies that the number of $\psi$ -units used is $m=O(s\sqrt{n}\log(\lambda sn))$ .

Up to this point, the weight matrices of the NNs produced by the direct application of Lemmas 3.2 and 3.1 have rank $1$ . In order to obtain the condition number guarantee of Theorem 1.1, it is therefore necessary to modify the weight matrices, which we accomplish by adding Gaussian noise with variance $1/\operatorname{poly}(\lambda,s,n)$ in each coordinate. We now sketch this analysis.

We conclude with a corollary showing that the Lipschitz assumption on the statistical queries can be omitted when the function outputs are represented with finite precision.

Suppose the family $\mathcal{C}$ of functions from Theorem 1.1 have outputs rounded to some uniformly-spaced finite set $Y\subseteq$ , and let $n$ , $D$ , $s$ , and $t$ be as in the statement of the theorem. Let $A$ be a randomized SQ algorithm learning $\mathcal{C}$ over $D$ to regression error less than $\Omega(1)-\sqrt{1/t}$ with probability at least $1/2$ . Then if $A$ uses arbitrary queries to $\operatorname{VSTAT}(t)$ , it requires at least $2^{\Omega(n)}/(|Y|s^{2})$ queries.

The bound on the number of sigmoid gates used is $O(s\sqrt{n}\log(|Y|sn))$ , similar to the bound in Theorem 1.1. On the other hand, the weight matrices used in the neural networks defining the family $\mathcal{C}$ of Corollary 5 have rank $1$ , assuming they are represented with the same precision as the function outputs.

Query Complexity via Statistical Dimension

We now give the proof of Theorem 2.1, the query complexity bound that follows from a bound on statistical dimension.

Hence, by Markov’s inequality, given a random $g\in\mathcal{C}^{\prime}$ , the probability that $\rho_{D}(f,g)\geq 2\sqrt{\bar{\gamma}}$ is at most $1/2$ . Thus, in order to learn $f$ with regression error less than $1-2\sqrt{\bar{\gamma}}$ with probability greater than $1/2$ , the statistical algorithm must first rule out all but at most $|\mathcal{C}|/d$ of the functions in $\mathcal{C}$ .

We will examine the situation in which the oracle responds to query $h$ with value $v$ . Let $\mathcal{C}^{\prime}\subseteq\mathcal{C}$ , and let $h_{y}(x)=h(x,y)$ . We estimate using Cauchy-Schwarz,

Suppose that $|\mathcal{C}^{\prime}|>|\mathcal{C}|/d$ . Then

Hence, for every subset $\mathcal{C}^{\prime}\subseteq\mathcal{C}$ of size $|\mathcal{C}^{\prime}|\geq|\mathcal{C}|/d$ , we have

It follows that either $|\mathcal{C}^{\prime}|\leq|\mathcal{C}|/d$ , or we have (cf. [9, Lemma 3.5])

whence $t>\Omega(1/\bar{\gamma})$ . Hence, no query strategy can rule out more than $2|\mathcal{C}|/d$ functions per query to $\operatorname{VSTAT}(c/\bar{\gamma})$ for some constant $c$ . Hence, any statistical algorithm using queries to $\operatorname{VSTAT}(c/\bar{\gamma})$ requires at least $\Omega(d)$ queries to learn $\mathcal{C}$ . ∎

From the theorem, we see that a lower bound on the statistical dimension of a distributional problem implies a lower bound on the complexity of any statistical query algorithm for the problem.

The authors are grateful to Vitaly Feldman for discussions about statistical query lower bounds, and for suggestions that simplified the presentation of our results.