Distribution-Specific Hardness of Learning Neural Networks

Ohad Shamir

Introduction

Artificial neural networks have seen a dramatic resurgence in recent years, and have proven to be a highly effective machine learning method in computer vision, natural language processing, and other challenging AI problems. Moreover, successfully training such networks is routinely performed using simple and scalable gradient-based methods, in particular stochastic gradient descent.

Despite this practical success, our theoretical understanding of the computational tractability of such methods is quite limited, with most results being negative. For example, as discussed in , learning even depth-2 networks in a formal PAC learning framework is computationally hard in the worst case, and even if the algorithm is allowed to return arbitrary predictors. As common in such worst-case results, these are proven using rather artificial constructions, quite different than the real-world problems on which neural networks are highly successful. In particular, since the PAC framework focuses on distribution-free learning (where the distribution generating the examples is unknown and rather arbitrary), the hardness results rely on carefully crafted distributions, which allows one to relate the learning problem to (say) an NP-hard problem or breaking a cryptographic system. However, what if we insist on “natural” distributions? Is it possible to show that neural networks learning becomes computationally tractable? Can we show that they can be learned using the standard heuristics employed in practice, such as stochastic gradient descent?

To understand what a “natural” distribution refers to, we need to separate the distribution over examples (given as input-output pairs $(\mathbf{x},y)$ ) into two components:

The input distribution $p(\mathbf{x})$ : “Natural” input distributions on Euclidean space tend to have properties such as smoothness, non-degeneracy, incoherence etc.

The target function $h(\mathbf{x})$ : In PAC learning, it is assumed that the output $y$ equals $h(\mathbf{x})$ , where $h$ is some unkown target function from the hypothesis class we are considering. In studying neural networks, it is common to consider the class of all networks which share some fixed architecture (e.g. feedforward networks of a given depth and width). However, one may argue that the parameters of real-world networks (e.g. the weights of each neuron) are not arbitrary, but exhibit various features such as non-degeneracy or some “random like” appearance. Indeed, networks with a random structure have been shown to be more amenable to analysis in various situations (see for instance and references therein).

Empirical evidence clearly suggest that many pairs of input distributions and target functions are computationally tractable, using standard methods. However, how do we characterize these pairs? Would appropriate assumptions on one of them be sufficient to show learnability?

In this paper, we investigate these two components, and provide evidence that neither one of them alone is enough to guarantee computationally tractable learning, at least with methods resembling those used in practice. Specifically, we focus on simple, shallow ReLU networks, assume that the data can be perfectly predicted by some such network, and even allow over-specification (a.k.a. improper learning), in the sense that we allow the learning algorithm to output a predictor which is possibly larger and more complex than the target function (this technique increases the power of the learner, and was shown to make the learning problem easier in theory and in practice ). Even under such favorable conditions, we show the following:

Hardness for “natural” target functions. For each individual target function coming from a simple class of small, shallow ReLU networks (even if its parameters are chosen randomly or in some other oblivious way), we show that no algorithm invariant to linear transformations can successfully learn it w.r.t. all input distributions in polynomial time (this corresponds, for instance, to standard gradient-based methods together with data whitening or preconditioning). This result is based on a reduction from learning intersections of halfspaces. Although that problem is known to be hard in the worst-case over both input distributions and target functions, we essentially show that invariant algorithms as above do not “distinguish” between worst-case and average-case: If one can learn a particular target function with such an algorithm, then the algorithm can learn nearly all target functions in that class.

Hardness for “natural” input distributions. We show that target functions of the form $\mathbf{x}\mapsto\psi(\left\langle\mathbf{w},\mathbf{x}\right\rangle)$ for any periodic $\psi$ are generally difficult to learn using gradient-based methods, even if the input distribution is fixed and belongs to a very broad class of smooth input distributions (including, for instance, Gaussians and mixtures of Gaussians). Note that such functions can be constructed by simple shallow networks, and can be seen as an extension of generalized linear models . Unlike the previous result, which relies on a computational hardness assumption, the results here are geometric in nature, and imply that the gradient of the objective function, nearly everywhere, contains virtually no signal on the underlying target function. Therefore, any algorithm which relies on gradient information cannot learn such functions. Interestingly, the difficulty here is not in having a plethora of spurious local minima or saddle points – the associated stochastic optimization problem may actually have no such critical points. Instead, the objective function may exhibit properties such as flatness nearly everywhere, unless one is already very close to the global optimum. This highlights a potential pitfall in non-convex learning, which occurs already for a slight extension of generalized linear models, and even for “nice” input distributions.

Together, these results indicate that in order to explain the practical success of neural network learning with gradient-based methods, one would need to employ a careful combination of assumptions on both the input distribution and the target function, and that results with even a “partially” distribution-free flavor (which are common, for instance, in convex learning problems) may be difficult to attain here.

To prove our results, we develop some tools which may be of independent interest. In particular, the techniques used to prove hardness of learning functions of the form $\mathbf{x}\mapsto\psi(\left\langle\mathbf{w},\mathbf{x}\right\rangle)$ are based on Fourier analysis, and have some close connections to hardness results on learning parities in the well-known framework of learning from statistical queries : In both cases, one essentially shows that the Fourier transform of the target function has very small support, and hence does not “correlate” with most functions, making it difficult to learn using certain methods. However, we consider a more general and arguably more natural class of input distributions over Euclidean space, rather than distributions on the Boolean cube. In a sense, we show that learning general periodic functions over Euclidean space is difficult (at least with gradient-based methods), for the same reasons that learning parities over the Boolean cube is difficult in the statistical queries framework.

Recent years have seen quite a few papers on the theory of neural network learning. Below, we only briefly mention those most relevant to our paper.

In a very elegant work, Janzamin et al. have shown that a certain method based on tensor decompositions allows one to provably learn simple neural networks by a combination of assumptions on the input distribution and the target function. However, a drawback of their method is that it requires rather precise knowledge of the input distribution and its derivatives, which is rarely available in practice. In contrast, our focus is on algorithms which do not utilize such knowledge. Other works which show computationally-efficient learnability of certain neural networks under sufficiently strong distributional assumptions include .

In the context of learning functions over the Boolean cube, it is known that even if we restrict ourself to a particular input distribution (as long as it satisfies some mild conditions), it is difficult to learn parity functions using statistical query algorithms . Moreover, it was recently shown that stochastic gradient descent methods can be approximately posed as such algorithms . Since parities can be implemented with small real-valued networks, this implies that for “most” input distributions on the Boolean cube, there are neural networks which are unlikely to be learnable with gradient-based methods. However, data provided to neural networks in practice are not in the form of Boolean vectors, but rather vectors of floating-point numbers. Moreover, some assumptions on the input distribution, such as smoothness and Gaussianity, only make sense once we consider the support to be Euclidean space rather than Boolean cube. Perhaps these are enough to guarantee computational tractability? A contribution of this paper is to show that this is not the case, and to formally demonstrate how phenomena similar to the Boolean case also occurs in Euclidean space, using appropriate target functions and distributions.

Finally, we note that provides improper-learning hardness results, which hold even for a standard Gaussian distribution on Euclidean space, and for any algorithm. However, unlike our paper, their focus is on hardness of agnostic learning (where the target function is arbitrary and does not have to correspond to a given class), the results are specific to the standard Gaussian distribution, and the proofs are based on a reduction from the Boolean case.

The paper is structured as follows: In Sec. 2, we formally present some notation and concepts used throughout the paper. In Sec. 4, we provide our hardness results for natural input distributions, and in Sec. 3, we provide our hardness results for natural target functions. All proofs are presented in Sec. 5.

Preliminaries

We generally let bold-faced letters denote vectors. Given a complex-valued number $z=a+ib$ , we let $\overline{z}=a-ib$ denote its complex conjugate, and $|z|=\sqrt{a^{2}+b^{2}}$ denote its modulus. Given a function $f$ , we let $\nabla f$ denote its gradient and $\nabla^{2}f$ denote its Hessian (assuming they exist).

Neural Networks. The focus of our results will be on learning predictors which can be described by simple and shallow (depth 2 or 3) neural networks. A standard feedforward neural network is composed of neurons, each of which computes the mapping $\mathbf{x}\mapsto\sigma(\left\langle\mathbf{w},\mathbf{x}\right\rangle+b)$ , where $\mathbf{w},b$ are parameters and $\sigma$ is a scalar activation function, for example the popular ReLU function $[z]_{+}=\max\{0,z\}$ . These neurons are arranged in parallel in layers, so the output of each layer can be compactly represented as $\mathbf{x}\mapsto\sigma(W^{\top}\mathbf{x}+\mathbf{b})$ , where $W$ is a matrix (each column corresponding to the parameter vector of one of the neurons), $\mathbf{b}$ is a vector, and $\sigma$ applies an activation function on the coordinates of $W^{\top}\mathbf{x}$ . In vanilla feedforward networks, such layers are connected to each other, so given an input $\mathbf{x}$ , the output equals

PAC Learning. For the results of Sec. 3, we will rely on the following standard definition of PAC learning with respect to Boolean functions: Given a hypothesis class $\mathcal{H}$ of functions from $\{0,1\}^{d}$ to $\{0,1\}$ , we say that a learning algorithm PAC-learns $\mathcal{H}$ if for any $\epsilon\in(0,1)$ , any distribution $\mathcal{D}$ over $\{0,1\}^{d}$ , and any $h^{\star}\in\mathcal{H}$ , if the algorithm is given oracle access to i.i.d. samples $(\mathbf{x},h^{\star}(\mathbf{x}))$ where $\mathbf{x}$ is sampled according to $\mathcal{D}$ , then in time $\text{poly}(d,1/\epsilon)$ , the algorithm returns a function $f:\{0,1\}^{d}\mapsto\{0,1\}$ such that $\Pr_{\mathbf{x}\sim\mathcal{D}}(f(\mathbf{x})\neq h^{\star}(\mathbf{x}))\leq\epsilon$ with high probability (for our purposes, it will be enough to consider any constant close to $1$ ). Note that in the definition above, we allow $f$ not to belong to the hypothesis class $\mathcal{H}$ . This is often denoted as “improper” learning, and allows the learning algorithm more power than in “proper” learning, where $f$ must be a member of $\mathcal{H}$ .

where $\exp(iz)=\cos(z)+i\cdot\sin(z)$ , $i$ being the imaginary unit. In the proofs, we will use the following well-known properties of the Fourier transform:

Linearity: For scalars $a,b$ and functions $f,g$ , $\widehat{af+bg}=a\hat{f}+b\hat{g}$ .

Isometry: $\left\langle f,g\right\rangle=\langle\hat{f},\hat{g}\rangle$ and $\left\|f\right\|=\|\hat{f}\|$ .

Convolution: $\widehat{fg}=\hat{f}*\hat{g}$ , where $*$ denotes the convolution operation: $(f*g)(\mathbf{w})=\int f(\mathbf{z})\cdot g(\mathbf{w}-\mathbf{z})~{}d\mathbf{z}$ .

Natural Target Functions

In this section, we consider simple target functions of the form $\mathbf{x}\mapsto\left[\sum_{i=1}^{n}[\left\langle\mathbf{w}_{i},\mathbf{x}\right\rangle]_{+}\right]_{}$ , where $[z]_{+}=\max\{0,z\}$ is the ReLU function, and $[z]_{}=\min\{1,\max\{0,z\}\}$ is the clipping operation on the interval $ $. This corresponds to depth-2 networks with no bias in the first layer, and where the outputs of the first layer are simply summed and moved through a clipping non-linearity (this operation can also be easily implemented using a second layer composed of two ReLU neurons). Letting$ W=[\mathbf{w}_{1},\ldots,\mathbf{w}_{n}] $, we can write such predictors as$ \mathbf{x}\mapsto h(W^{\top}\mathbf{x}) $for an appropriate fixed function$ h $. Our goal would be to show that for such a target function, with virtually any choice of$ W$ (essentially, as long as its columns are linearly independent), and any polynomial-time learning algorithm satisfying some conditions, there exists an input distribution on which it must fail.

As the careful reader may have noticed, it is impossible to provide such a target-function-specific result which holds for any algorithm. Indeed, if we fix the target function in advance, we can always “learn” by returning the target function, regardless of the training data. Thus, imposing some constraints on the algorithm is necessary. Specifically, we will consider algorithms which exhibit certain natural invariances to the coordinate system used. One very natural invariance is with respect to orthogonal transformations: For example, if we rotate the input instances $\mathbf{x}_{i}$ in a fixed manner, then an orthogonally-invariant algorithm will return a predictor which still makes the same predictions on those instances. Formally, this invariance is defined as follows:

The definition as stated refers to deterministic algorithms. For stochastic algorithms, we will understand orthogonal invariance to mean orthogonal invariance conditioned on any realization of the algorithm’s random coin flips.

For example, standard gradient and stochastic gradient descent methods (possibly with coordinate-oblivious regularization, such as $L_{2}$ regularization) can be easily shown to be orthogonally-invariantEssentially, this is because the gradient of any function $g(W^{\top}\mathbf{x})=g(\left\langle\mathbf{w}_{1},\mathbf{x}\right\rangle,\ldots,\left\langle\mathbf{w}_{k},\mathbf{x}\right\rangle)$ w.r.t. any $\mathbf{w}_{i}$ is proportional to $\mathbf{x}$ . Thus, if we multiply $\mathbf{x}$ by an orthogonal $M$ , the gradient also gets multiplied by $M$ . Since $M^{\top}M=I$ , the inner products of instances $\mathbf{x}$ and gradients remain the same. Therefore, by induction, it can be shown that any algorithm which operates by incrementally updating some iterate by linear combinations of gradients will be rotationally invariant.. However, for our results we will need to make a somewhat stronger invariance assumption, namely invariance to general invertible linear transformations of the data (not necessarily just orthogonal). This is formally defined as follows:

One well-known example of such an algorithm (which is also invariant to affine transformations) is the Newton method . More relevant to our purposes, linear invariance occurs whenever an orthogonally-invariant algorithm preconditions or “whitens” the data so that its covariance has a fixed structure (e.g. the identity matrix, possibly after a dimensionality reduction if the data is rank-deficient). For example, even though gradient descent methods are not linearly invariant, they become so if we precede them by such a preconditioning step. This is formalized in the following theorem:

Let $\mathcal{A}$ be any algorithm which given $\{\mathbf{x}_{i},y_{i}\}_{i=1}^{m}$ , computes the whitening matrix $P=D^{-1}U^{\top}$ (where $X=[\mathbf{x}_{1}~{}\mathbf{x}_{2}~{}\ldots~{}\mathbf{x}_{m}]$ , $X=UDV^{\top}$ is a thinThat is, if $X$ is of size $d\times m$ , then $U$ is of size $d\times\text{Rank}(X)$ , $D$ is of size $\text{Rank}(X)\times\text{Rank}(X)$ , and $V$ is of size $m\times\text{Rank}(X)$ . SVD decomposition of $X$ ), feeds $\{P\mathbf{x}_{i},y_{i}\}_{i=1}^{m}$ to an orthogonally-invariant algorithm, and given the output predictor $\mathbf{x}\mapsto f(W^{\top}\mathbf{x})$ , returns the predictor $\mathbf{x}\mapsto f((P^{\top}W)^{\top}\mathbf{x})$ . Then $\mathcal{A}$ is linearly-invariant.

It is easily verified that the covariance matrix of the transformed instances $P\mathbf{x}_{1},\ldots,P\mathbf{x}_{m}$ is the $r\times r$ identity matrix (where $r=\text{Rank}(X)$ ), so this is indeed a whitening transform. We note that whitening is a very common preprocessing heuristic, and even when not done explicitly, scalable approximate whitening and preconditioning methods (such as Adagrad and batch normalization ) are very common and widely recognized as useful for training neural networks.

(where we let $1$ correspond to ‘true’ and to ‘false’). The problem of PAC-learning intersections of halfspaces over the Boolean cube ( $\mathbf{x}\in\{0,1\}^{d}$ ) has been well-studied. In particular, two known hardness results are the following:

Klivans and Sherstov show that under a certain well-studied cryptographic assumption (hardness of finding unique shortest vectors in a high-dimensional lattice), no algorithm can PAC-learn intersection of $n_{d}=d^{\delta}$ halfspaces (where $\delta$ is any positive constant), even if the coordinates of $\mathbf{w}_{i}$ and $b_{i}$ are all integers, and $\max_{i}\left\|(\mathbf{w}_{i},b_{i})\right\|\leq\text{poly}(d)$ .

Daniely and Shalev-Shwartz show that under an assumption related to the hardness of refuting random K-SAT formulas, no algorithm can PAC-learn intersections of $n_{d}=\omega(1)$ halfspaces (as $d\rightarrow\infty$ ), even if the coordinates of $\mathbf{w}_{i}$ and $b_{i}$ are all integers, and $\max_{i}\left\|(\mathbf{w}_{i},b_{i})\right\|\leq\mathcal{O}(d)$ .

In the theorem below, we will use the result of , which applies to an intersection of a smaller number of halfspaces, and with smaller norms. However, similar results can be shown using , at the cost of worse polynomial dependencies on $d$ .

The main result of this section is the following:

Consider any network $h(W_{\star}^{\top}\mathbf{x})=\left[\sum_{i=1}^{n_{d}}[\left\langle\mathbf{w}^{\star}_{i},\mathbf{x}\right\rangle]_{+}\right]_{}$ (where $W_{\star}=[\mathbf{w}^{\star}_{1},\ldots,\mathbf{w}^{\star}_{n}]$ ), which satisfies the following:

$n_{d}\geq\omega(1)$ as $d\rightarrow\infty$

$\max_{i}\left\|\mathbf{w}^{\star}_{i}\right\|\leq\mathcal{O}(d)$

$\mathbf{w}_{1}^{\star}\ldots\mathbf{w}_{n}^{\star}$ are linearly independent, so the smallest singular value $s_{\min}(W_{\star})$ of $W_{\star}$ is strictly positive.

Then under the assumption stated in , there is no linearly-invariant algorithm which for any $\epsilon>0$ and any distribution $\mathcal{D}$ over vectors of norm at most $\frac{\mathcal{O}(d\sqrt{dn_{d}})}{\min\{1,s_{\min}(W_{\star})\}}$ , given only access to samples $(\mathbf{x},h(W_{\star}^{\top}\mathbf{x}))$ where $\mathbf{x}\sim\mathcal{D}$ , runs in time $\text{poly}(d,1/\epsilon)$ and returns with high probability a predictor $\mathbf{x}\mapsto f(W^{\top}\mathbf{x})$ such that

Note that the result holds even if the returned predictor $f(W^{\top}\mathbf{x})$ has a different structure than $h_{W_{\star}}(\cdot)$ , and $W$ is of a larger size than $W_{\star}$ . Thus, it applies even if the algorithm is allowed to train a larger network or more complicated predictor than $h_{W_{\star}}(\cdot)$ .

Since the algorithm is linearly-invariant, it can be shown that this implies successful learning from $(\mathbf{x},h(W^{\top}\mathbf{x}))$ where $\mathbf{x}\sim\mathcal{D}$ , as required.

In the sketch above, we have ignored some technical issues. For example, we need to be careful that $M$ has a bounded spectral norm, so that it induces a linear transformation which does not distort norms by too much (as all our arguments apply for input distributions supported on a bounded domain). A second issue is that if we apply a linearly-invariant algorithm on a dataset transformed by $M$ , then the invariance is only with respect to the data, not necessarily with respect to new instances $\mathbf{x}$ sampled from the same distribution (and this restriction is necessary for results such as Thm. 1 to hold without further assumptions). However, it can be shown that if the dataset is large enough, invariance will still occur with high probability over the sampling of $\mathbf{x}$ , which is sufficient for our purposes.

Natural Input Distributions

Our formal results rely on Fourier analysis and are a bit technical. Hence, we precede them with an informal description, outlining the main ideas and techniques, and presenting a specific case study which may be of independent interest (Subsection 4.1). The formal results are presented in Subsection 4.2.

Consider a target function of the form $\mathbf{x}\mapsto\psi(\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle)$ , and any input distribution whose density function can be written as the square $\varphi^{2}$ of some function $\varphi$ (the reason for this will become apparent shortly). Suppose we attempt to learn this target function (with respect to the squared loss) using some hypothesis class, which can be parameterized by a bounded-norm vector $\mathbf{w}$ in some subset $\mathcal{W}$ of an Euclidean space (not necessarily of the same dimensionality as $\mathbf{w}^{\star}$ ), so each predictor in the class can be written as $\mathbf{x}\mapsto f(\mathbf{w},\mathbf{x})$ for some fixed mapping $f$ . Thus, our goal is essentially to solve the stochastic optimization problem

In this section, we study the geometry of this objective function, and show that under mild conditions on $f$ , and assuming the norm of $\mathbf{w}^{\star}$ is reasonably large, the following holds:

For any fixed $\mathbf{w}$ , the value of the objective function is almost independent of $\mathbf{w}^{\star}$ , in the sense that if we pick the direction of $\mathbf{w}^{\star}$ uniformly at random, the value is extremely concentrated around a fixed value independent of $\mathbf{w}^{\star}$ (e.g. exponentially small in $\left\|\mathbf{w}^{\star}\right\|^{2}$ for a Gaussian or a mixture of Gaussians).

Similarly, the gradient of the objective function with respect to $\mathbf{w}$ is almost independent of $\mathbf{w}^{\star}$ , and is extremely concentrated around a fixed value (again, exponentially small in $\left\|\mathbf{w}^{\star}\right\|^{2}$ for, say, a mixture of Gaussians).

Therefore, assuming $\left\|\mathbf{w}^{\star}\right\|$ is reasonably large, any standard gradient-based method will follow a trajectory nearly independent of $\mathbf{w}^{\star}$ . In fact, in practice we do not even have access to exact gradients of Eq. (2), but only to noisy and biased versions of it (e.g. if we perform stochastic gradient descent, and certainly if we use finite-precision computations). In that case, the noise will completely obliterate the exponentially small signal about $\mathbf{w}^{\star}$ in the gradients, and will make the trajectory essentially independent of $\mathbf{w}^{\star}$ . As a result, assuming $\psi$ and the distribution is such that the function $\psi(\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle)$ is sensitive to the direction of $\mathbf{w}^{\star}$ , it follows that these methods will fail to optimize Eq. (2) successfully. Finally, we note that in practice, it is common to solve not Eq. (2) directly, but rather its empirical approximation with respect to some fixed finite training set. Still, by concentration of measure, this empirical objective would converge to the one in Eq. (2) given enough data, so the same issues will occur.

An important feature of our results is that they make virtually no structural assumptions on the predictors $\mathbf{x}\mapsto f(\mathbf{w},\mathbf{x})$ . In particular, they can represent arbitrary classes of neural networks (as well as other predictor classes). Thus, our results imply that target functions of the form $\mathbf{x}\mapsto\psi(\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle)$ , where $\psi$ is periodic, would be difficult to learn using gradient-based methods, even if we allow improper learning and consider predictor classes of a different structure.

To explain how such results are attained, let us study a concrete special case (not necessarily in the context of neural networks). Consider the target function $\mathbf{x}\mapsto\cos(2\pi\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle)$ , and the hypothesis class (parameterized by $\mathbf{w}$ ) of functions $\mathbf{x}\mapsto\cos(2\pi\left\langle\mathbf{w},\mathbf{x}\right\rangle)$ . Thus, Eq. (2) takes the form

We now turn to explain why Eq. (3) has the form shown in Figure 2. This will also help to illustrate our proof techniques, which apply much more generally. The main idea is to analyze the Fourier transform of Eq. (3). Letting $\cos_{\mathbf{w}}$ denote the function $\mathbf{x}\mapsto\cos(2\pi\left\langle\mathbf{w},\mathbf{x}\right\rangle)$ , we can write Eq. (3) as

$\widehat{\cos_{\mathbf{w}}}(\boldsymbol{\xi})$ can be shown to equal $\frac{1}{2}\left(\delta(\boldsymbol{\xi}-\mathbf{w})+\delta(\boldsymbol{\xi}+\mathbf{w})\right)$ , where $\delta(\cdot)$ is Dirac’s delta function (a “generalized” function which satisfies $\delta(\mathbf{z})=0$ for all $\mathbf{z}\neq\mathbf{0}$ , and $\int\delta(\mathbf{z})d\mathbf{z}=1$ ). Plugging this into the above and simplifying, we get

If $\varphi^{2}$ is a standard Gaussian, $\hat{\varphi}(\boldsymbol{\xi})$ can be shown to equal the Gaussian-like function $(4\pi)^{d/2}a^{-\left\|\boldsymbol{\xi}\right\|^{2}}$ where $a=\exp(4\pi^{2})$ . Plugging back, the expression above is proportional to

The expression in each inner parenthesis can be viewed as a mixture of two Gaussian-like functions, with centers at $\mathbf{w},-\mathbf{w}$ (or $\mathbf{w}^{\star},-\mathbf{w}^{\star}$ ). Thus, if $\mathbf{w}$ is far from $\mathbf{w}^{\star}$ , these two mixtures will have nearly disjoint support, and Eq. (5) will have nearly the same value regardless of $\mathbf{w}$ – in other words, it is very flat. Since this equation is nothing more than a re-formulation of the original objective function in Eq. (3) (up to a constant), we get a similar behavior for Eq. (3) as well.

This behavior extends, however, much more generally than the specific objective in Eq. (3). First of all, we can replace the standard Gaussian distribution $\varphi^{2}$ by any distribution such that $\hat{\varphi}$ has a localized support. This would still imply that Eq. (4) refers to the difference of two functions with nearly disjoint support, and the same flatness phenomenon will occur. Second, we can replace the $\cos$ function by any periodic function $\psi$ . By properties of the Fourier transform of periodic functions, we still get localized functions in the Fourier domain (more precisely, the Fourier transform will be localized around integer multiples of $\mathbf{w}$ , up to scaling). Finally, instead of considering hypothesis classes of predictors $\mathbf{x}\mapsto\psi(\left\langle\mathbf{w},\mathbf{x}\right\rangle)$ similar to the target function, we can consider quite arbitrary mappings $\mathbf{x}\mapsto f(\mathbf{w},\mathbf{x})$ . Even though this function may no longer be localized in the Fourier domain, it is enough that only the target function $\mathbf{x}\mapsto\psi(\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle)$ will be localized: That implies that regardless how $f$ looks like, under a random choice of $\mathbf{w}^{\star}$ , only a minuscule portion of the $L_{2}$ mass of $f$ overlaps with the target function, hence getting sufficient signal on $\mathbf{w}^{\star}$ will be difficult.

As mentioned in the introduction, these techniques and observations have some close resemblance to hardness results for learning parities over the Boolean cube in the statistical queries learning model . There as well, one considers a Fourier transform (but on the Boolean cube rather than Euclidean space), and essentially show that functions with a “localized” Fourier transform are difficult to “detect” using any fixed function. However, our results are different and more general, in the sense that they apply to generic smooth distributions over Euclidean space, and to a general class of periodic functions, rather than just parities. On the flip side, our results are constrained to methods which are based on gradients of the objective, whereas the statistical queries framework is more general and considers algorithms which are based on computing (approximate) expectations of arbitrary functions of the data. Extending our results to this generality is an interesting topic for future research.

2 Formal Results

We now turn to provide a more formal statement of our results. The distributions we will consider consist of arbitrary mixtures of densities, whose square roots have rapidly decaying tails in the Fourier domain. More precisely, we have the following definition:

where $\mathbf{1}_{\geq r}$ is the indicator function of $\{\mathbf{x}:\left\|\mathbf{x}\right\|\geq r\}$ .

A canonical example is Gaussian distributions: Given a (non-degenerate, zero-mean) Gaussian density function $\varphi^{2}$ with covariance matrix $\Sigma$ , its square root $\varphi$ is proportional to a Gaussian with covariance $2\Sigma$ , and its Fourier transform $\hat{\varphi}$ is well-known to be proportional to a Gaussian with covariance $(2\Sigma)^{-1}$ . By standard Gaussian concentration results, it follows that $\varphi^{2}$ is Fourier-concentrated with $\epsilon(r)=\exp(-\Omega(\lambda_{\min}r^{2}))$ where $\lambda_{\min}$ is the minimal eigenvalue of $\Sigma$ . A similar bound can be shown when the Gaussian has some arbitrary mean. More generally, it is well-known that smooth functions (differentiable to sufficiently high order with integrable derivatives) have Fourier transforms with rapidly decaying tails. For example, if we consider the broad class of Schwartz functions (characterized by having values and all derivatives decaying faster than polynomially in $r$ ), then the Fourier transform of any such function is also a Schwartz function, which implies super-polynomial decay of $\epsilon(r)$ (see for instance , Chapter 11 and Proposition 11.25).

We now formally state our main result for this section. We consider any predictor of the form $\mathbf{x}\mapsto f(\mathbf{w},\mathbf{x})$ , where $f$ is some fixed function and $\mathbf{w}$ is a parameter vector coming from some domain $\mathcal{W}$ , which we will assume w.l.o.g. to be a subset of some Euclidean spaceMore generally, our analysis is applicable to any separable Hilbert space. (for example, $f$ can represent a network of a given architecture, with weights specified by $\mathbf{w}$ ). When learning $f$ based on data coming from an underlying distribution, we are essentially attempting to solve the optimization problem

Assume that $F$ is differentiable w.r.t. $\mathbf{w}$ , any gradient-based method to solve this problem proceeds by computing (or approximating) $\nabla F_{\mathbf{w}^{\star}}(\mathbf{w})$ at various points $\mathbf{w}$ . However, the following theorem shows that at any $\mathbf{w}$ , and regardless of the type of predictor or network one is attempting to train, the gradient at $\mathbf{w}$ is virtually independent of the underlying target function, and hence provides very little signal:

We note that bounded variation is weaker than, say, Lipschitz continuity. Assuming $\epsilon(r)$ decays rapidly with $r$ – say, exponentially in $r^{2}$ as is the case for a Gaussian mixture – we get that the bound in the theorem is on the order of $\exp(-\Omega(\min\{d,r^{2}\}))$ .

Overall, the theorem implies that if $r,d$ are moderately large, the gradient of $F_{\mathbf{w}^{\star}}$ at any point $\mathbf{w}$ is extremely concentrated around a fixed value, independent of $\mathbf{w}^{\star}$ . This implies that gradient-based methods, which attempt to optimize $F_{\mathbf{w}^{\star}}$ via gradient information, are unlikely to succeed. One way to formalize this is to consider any iterative algorithm (possibly randomized), which relies on an $\varepsilon$ -approximate gradient oracle to optimize $F_{\mathbf{w}^{\star}}$ : At every iteration $t$ , the algorithm chooses a point $\mathbf{w}_{t}\in\mathcal{W}$ , and receives a vector $\mathbf{g}_{t}$ such that $|\nabla F_{\mathbf{w}^{\star}}-\mathbf{g}_{t}|\leq\varepsilon$ . In our case, we will be interested in $\varepsilon$ such that $\varepsilon^{3}$ is on the order of the bound in Thm. 3. Since the bound is extremely small for moderate $d,r$ (say, smaller than machine precision), this is a realistic model of gradient-based methods on finite-precision machines, even if one attempts to compute the gradients accurately. The following theorem implies that if the number of iterations is not extremely large (on the order of $1/\varepsilon$ , e.g. $\exp(\Omega(d,r^{2}))$ iterations for Gaussian mixtures), then with high probability, a gradient-based method will return the same predictor independent of $\mathbf{w}^{\star}$ . However, since the objective function $F_{\mathbf{w}^{\star}}$ is highly sensitive to the choice of $\mathbf{w}^{\star}$ , this means that no such gradient-based method can train a reasonable predictor.

Assume the conditions of Thm. 3, and let $\varepsilon=\sqrt{c_{2}(\sup_{\mathbf{w}\in\mathcal{W}}G_{\mathbf{w}})\left(\exp(-c_{3}d)+\sum_{n=1}^{\infty}\epsilon(nr)\right)}$ be the cube root of the bound specified there (uniformly over all $\mathbf{w}\in\mathcal{W}$ ). Then for any algorithm as above and any $p\in(0,1)$ , conditioned on an event which holds with probability $1-p$ over the choice of $\mathbf{w}^{\star}$ , its output after at most $p/\varepsilon$ iterations will be independent of $\mathbf{w}^{\star}$ .

Proofs

Let $P_{M}$ denote the whitening matrix employed if we transform the instances $X$ by some invertible $d\times d$ matrix $M$ (that is, $X$ becomes $MX$ ), and $P$ the whitening matrix employed for the original data.

Using the same notation as in the theorem, it is easily verified that $PX=V^{\top}$ , and $P_{M}MX=V_{M}^{\top}$ , where $U_{M}D_{M}V_{M}^{\top}$ is an SVD decomposition of the matrix $MX$ . Since both $V^{\top}$ and $V_{M}^{\top}$ are $\text{Rank}(X)\times m$ matrices with rows consisting of orthonormal vectors, they are related by an orthogonal transformation (i.e. there is an orthogonal matrix $R_{M}$ such that $R_{M}V^{\top}=V_{M}^{\top}$ ). Therefore, $R_{M}PX=P_{M}MX$ . Since the data is fed to an orthogonally-invariant algorithm, its output $W_{M}$ satisfies $W_{M}^{\top}P_{M}MX=W^{\top}PX$ . This in turn implies $W_{M}^{\top}R_{M}PX=W^{\top}PX$ , and hence $W_{M}^{\top}R_{M}V^{\top}=W^{\top}V^{\top}$ . Multiplying both sides on the right by $V$ and taking a transpose, we get that $R_{M}^{\top}W_{M}=W$ , and hence $W_{M}=R_{M}W$ . In words, $W$ and $W_{M}$ are the same up to an orthogonal transformation $R_{M}$ depending on $M$ . Therefore,

so we see that the returned predictor makes the same predictions over the dataset, independent of the transformation matrix $M$ .

2 Proof of Thm. 2

We start with the following auxiliary theorem, which reduces the hardness result of to one about neural networks of the type we discuss here:

Under the assumption stated in , the following holds for any $n_{d}=\omega(1)$ (as $d\rightarrow\infty$ ):

There is no algorithm running in time $\text{poly}(d,1/\epsilon)$ , which for any distribution $\mathcal{D}$ on $\{0,1\}^{d}$ , and any $h(W^{\top}\mathbf{x})=\sigma(\sum_{i=1}^{n_{d}}[\left\langle\mathbf{w}_{i},\mathbf{x}\right\rangle]_{+})$ (where $W=[\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{n_{d}}]$ and $\max_{i}\left\|\mathbf{w}_{i}\right\|\leq\mathcal{O}(d)$ ), given only access to samples $(\mathbf{x},h(W^{\top}\mathbf{x}))$ where $\mathbf{x}\sim\mathcal{D}$ , returns with high probability a function $f$ such that

Thm. 5 holds even if we restrict $\mathbf{w}_{1},\ldots,\mathbf{w}_{n_{d}}$ to be linearly independent, with $s_{\min}(W)\geq 1$ .

Suppose by contradiction that there exists an algorithm $\mathcal{A}$ which succeeds for any $W$ as stated above. We will describe how to use $\mathcal{A}$ to get an algorithm which succeeds for any $W$ as described in Thm. 5, hence reaching a contradiction.

This contradicts Thm. 5, which states that no efficient algorithm can return such a predictor for any sufficiently large dimension $d$ and norm bound $\mathcal{O}(d)$ . ∎

It is enough to prove that with probability at least $1-\delta$ over the sampling of $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ ,

This is because the event $W_{M}^{\top}M\mathbf{x}_{i}=W^{\top}\mathbf{x}_{i}$ for all $i$ means that $W_{M}^{\top}M\mathbf{x}=W^{\top}\mathbf{x}$ for any $\mathbf{x}$ in the span of $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ .

Let $\mathbf{x}_{1},\ldots,\mathbf{x}_{m+1}$ be sampled i.i.d. according to $\mathcal{D}$ . Considering probabilities over this sample, we have

where the latter inequality is because each $\mathbf{x}_{j}$ is a $d$ -dimensional vector, hence the number of times we can get a vector not in the span of the previous ones is at most $d$ . Moreover, since the vectors are sampled i.i.d, we have

so the probabilities in Eq. (7) monotonically decrease with $j$ . Thus, Eq. (7) implies

so by Markov’s inequality, with probability at least $1-\delta$ over the sampling of $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ ,

Since $\mathbf{x}_{m+1}$ is sampled independently, Eq. (6) and hence the lemma follows. ∎

We will show that the very same algorithm, if given $\text{poly}(d,1/\epsilon)$ samples, can successfully learn w.r.t. any $d\times n_{d}$ matrix $W$ and any distribution $\mathcal{D}$ satisfying Proposition 1 and Thm. 5, contradicting those results.

Indeed, let $W$ and $\mathcal{D}$ be an arbitrary matrix and distribution as above. We first argue that there exists a $d\times d$ invertible matrix $M$ such that

To see this, note that $W$ and $W_{\star}$ are of the same size and our conditions imply that both of them have full column rank. Thus, we can simply augment them to invertible $d\times d$ matrices $[W~{}\hat{W}]$ and $[W_{\star}~{}\hat{W}_{\star}]$ , where the columns of $\hat{W}$ (respectively $\hat{W}_{\star}$ ) are an orthonormal basis for the subspace orthogonal to the column space of $W$ (respectively $W_{\star}$ ), and choosing $M^{\top}=[W~{}\hat{W}][W_{\star}~{}\hat{W}_{\star}]^{-1}$ . Thus, $\left\|M\right\|\leq\left\|[W~{}\hat{W}]\right\|\cdot\left\|[W_{\star}~{}\hat{W}_{\star}]^{-1}\right\|$ . The spectral norm of $[W~{}\hat{W}]$ can be bounded by the Frobenius norm, which by the assumption on $W$ from Thm. 5 and the fact that $\hat{W}$ consist of an orthonormal basis, is at most $\sqrt{\mathcal{O}(d)^{2}\cdot n_{d}+1\cdot(d-n_{d})}=\mathcal{O}(\sqrt{d^{2}n_{d}})=\mathcal{O}(d\sqrt{n_{d}})$ . The spectral norm of $[W_{\star}~{}\hat{W}_{\star}]^{-1}$ can be bounded by $1/s_{\min}([W_{\star}~{}\hat{W}_{\star}])$ , where $s_{\min}([W_{\star}~{}\hat{W}_{\star}])$ equals the square root of the smallest eigenvalue of $[W_{\star}~{}\hat{W}_{\star}]^{\top}[W_{\star}~{}\hat{W}_{\star}]$ , which can be easily verified to be $\min\{1,s_{\min}(W_{\star})\}$ .

By Eq. (8), $W_{\star}^{\top}M=(M^{\top}W_{\star})^{\top}=W^{\top}$ , so this is equivalent to

This means that the algorithm succesfully learns the hypothesis $\mathbf{x}\mapsto h(W^{\top}\mathbf{x})$ with respect to the distribution $\mathcal{D}$ . Since $\epsilon$ is arbitrarily small and $W,\mathcal{D}$ were chosen arbitrarily, the result follows.

3 Proof of Thm. 3

The proof is a combination of a few lemmas, presented below.

where $i$ is the imaginary unit (note that since $\psi$ is real-valued, the imaginary components eventually cancel out, but it will be more convenient for us to represent the Fourier series in this compact form). By Parseval’s identity, $\sum_{z}|a_{z}|^{2}=\int_{-1/2}^{1/2}\psi^{2}(x)dx$ , which is at most $1$ (since $\psi(x)\in[-1,+1]$ ).

We now wish to compute the Fourier transform of the aboveStrictly speaking, this function does not have a Fourier transform in the sense of Eq. (1), since the associated integrals do not converge. However, the function still has a well-defined Fourier transform in the more general sense of a generalized function or distribution (see e.g. for a survey). In the derivation below, we will simply rely on some standard formulas from the Fourier analysis literature, and refer to for their formal justifications.. First, we note that the Fourier transform of $\exp(2\pi i\left\langle\mathbf{v},\cdot\right\rangle)$ is given by $\delta(\cdot-\mathbf{v})$ , where $\delta$ is the Dirac delta function (a so-called generalized function which satisfies $\delta(\mathbf{x})=0$ for all $\mathbf{x}\neq\mathbf{0}$ , and $\int\delta(\mathbf{x})d\mathbf{x}=1$ ). Based on this and the linearity of the Fourier transform, we have that

and therefore, by the convolution property of the Fourier transform, we have

For any distinct integers $z_{1}\neq z_{2}$ and any $\mathbf{w}$ such that $\left\|\mathbf{w}\right\|=2r$ , it holds that

Let $\Delta=|z_{2}-z_{1}|r$ , and $\mathbf{v}=(z_{2}-z_{1})\mathbf{w}$ , so $\mathbf{v}$ is a vector of norm $2\Delta$ . Since the inner product is invariant to shifting the coordinates, we can assume without loss of generality that $z_{1}=0$ , and our goal is to bound $\left\langle|\hat{\varphi}|,|\hat{\varphi}(\cdot-\mathbf{v})|\right\rangle$ .

Using the convention that $\mathbf{1}_{\leq\Delta}$ is the indicator of $\{\mathbf{x}:\left\|\mathbf{x}\right\|\leq\Delta\}$ , and $\mathbf{1}_{>\Delta}$ is the indicator for its complement, we have

where in the last step we used Cauchy-Schwartz. Using the fact that norms and inner products are invariant to coordinate shifting, the above is at most

By the triangle inequality and the assumption $\left\|\mathbf{v}\right\|=2\Delta$ , the event $\left\|\mathbf{x}+\mathbf{v}\right\|\leq\Delta$ implies $\left\|\mathbf{x}\right\|\geq\Delta$ . Therefore, the above can be upper bounded by

Since $\varphi$ is Fourier-concentrated, this is at most $2\epsilon(\Delta)\left\|\hat{\varphi}\right\|^{2}=2\epsilon(\Delta)\left\|\varphi\right\|^{2}=2\epsilon(\Delta)$ , where we use the isometry of the Fourier transform and the assumption that $\left\|\varphi\right\|^{2}=\int\varphi^{2}(\mathbf{x})d\mathbf{x}=1$ . Plugging back the definition of $\Delta$ , the result follows. ∎

For simplicity, define $\epsilon^{\prime}(x)=\epsilon(x)$ for all $x>0$ , and $\epsilon(0)=0$ . Then the expression in the lemma equals

where in the last step we used the fact that the two inner square roots are the same up to a different indexing. Recalling the definition of $\epsilon^{\prime}$ and that $\sum_{z}|a_{z}|^{2}\leq 1$ , the above is at most

where $a_{0}$ is the coefficient from Lemma 2 and $c$ is a universal positive constant.

The existence of such a set follows from standard concentration of measure arguments (i.e. if we pick that many vectors uniformly at random from $\left\{-\frac{2r}{\sqrt{d}},+\frac{2r}{\sqrt{d}}\right\}^{d}$ , and $c$ is small enough, the vectors will satisfy the above with overwhelming probability, hence such a set must exist).

where we used the fact that $x+1/x\geq 2$ for all $x>0$ . This contradicts the assumption on $\mathcal{W}$ (see Eq. (10)), and establishes that $\{A_{\mathbf{w}}\}_{\mathbf{w}\in\mathcal{W}}$ are indeed disjoint sets.

We now continue by analyzing Eq. (11). Letting $\mathbf{1}_{A_{\mathbf{w}}}$ be the indicator function to the set $A_{\mathbf{w}}$ , and $\mathbf{1}_{A^{C}_{\mathbf{w}}}$ be the indicator of its complement, and recalling that $(a+b)^{2}\leq 2(a^{2}+b^{2})$ , we can upper bound Eq. (11) by

We consider each expectation separately. Starting with the first one, we have

Since we have $\left\|\hat{\varphi}\right\|=\left\|\varphi\right\|=1$ , $|a_{0}|^{2}\leq\sum_{z}|a_{z}|^{2}\leq 1$ , and $\left\|\psi_{\mathbf{w}}\varphi\right\|^{2}=\int\psi^{2}_{\mathbf{w}}(\mathbf{x})\varphi^{2}(\mathbf{x})d\mathbf{x}\leq\int\varphi^{2}(\mathbf{x})d\mathbf{x}=1$ , the above is at most

Since $A_{\mathbf{w}}$ are disjoint sets, $\sum_{\mathbf{w}\in\mathcal{W}}\mathbf{1}_{A_{\mathbf{w}}}(\mathbf{x})\leq 1$ for any $\mathbf{x}$ , so the above is at most

We now divide the terms in the sum above to two cases:

and by definition of $A_{\mathbf{w}}^{C}$ and the assumption $z_{1}\neq 0$ , we have $\mathbf{1}_{A_{\mathbf{w}}^{C}}(\mathbf{x}+z_{1}\mathbf{w})=1$ only if $\left\|\mathbf{x}\right\|\geq r$ . Therefore, as $\varphi$ is Fourier-concentrated, the above is at most

Plugging these two cases back into Eq. (14), we get the upper bound

Noting that $\sum_{z}|a_{z}|^{2}\leq 1$ , and applying Lemma 4, the above is at most

where we used the fact that $\epsilon^{2}(r)\leq\epsilon(r)\leq\sum_{n=1}^{\infty}\epsilon(nr)$ . Recalling this is an upper bound on the second expectation in Eq. (12), and that the first expectation is upper bounded as in Eq. (13), we get that Eq. (12) (and hence the expression in the lemma statement) is at most

With these lemmas in hand, we can now turn to prove the theorem. We have that

for any vector $\mathbf{p}$ which is not dependent of $\mathbf{w}^{\star}$ (this $\mathbf{p}$ will be determined later). Recalling the definition of the objective function $F$ , and letting $\mathbf{g}(\mathbf{x})=(g_{1}(\mathbf{x}),g_{2}(\mathbf{x}),\ldots)=\frac{\partial}{\partial\mathbf{w}}f(\mathbf{w},\mathbf{x})$ , the above equals

Let us now choose $\mathbf{p}$ so that $p_{i}=\left\langle\varphi g_{i},\varphi f(\mathbf{w},\cdot)\right\rangle-\left\langle\varphi g_{i},a_{0}\varphi\right\rangle$ (note that this choice is indeed independent of $\mathbf{w}^{\star}$ ). Plugging back and applying Lemma 5 (using the $L^{2}$ function $\widehat{\varphi g_{i}}$ for each $i$ ), we get

4 Proof of Thm. 4

We will assume w.l.o.g. that the algorithm is deterministic: If it is randomized, we can simply prove the statement for any possible realization of its random coin flips.

The algorithm’s first point $\mathbf{w}_{1}$ is fixed before receiving any information from the oracle, and is therefore independent of $\mathbf{w}^{\star}$ . By Thm. 3, we have that $\text{Var}_{\mathbf{w}^{\star}}(\nabla F_{\mathbf{w}^{\star}}(\mathbf{w}_{1}))\leq\varepsilon$ , which by Chebyshev’s inequality, implies that

Repeating this argument and applying a union bound, it follows that as long as the number of iterations $T$ satisfies $T\varepsilon\leq p$ (or equivalently $T\leq p/\varepsilon$ ), the oracle reveals no information whatsoever on the choice of $\mathbf{w}^{\star}$ all point chosen by the algorithm (and hence also its output) are independent of $\mathbf{w}^{\star}$ as required.

This research is supported in part by an FP7 Marie Curie CIG grant, Israel Science Foundation grant 425/13, and the Intel ICRI-CI Institute.