Distribution-Specific Hardness of Learning Neural Networks

Ohad Shamir

Introduction

Artificial neural networks have seen a dramatic resurgence in recent years, and have proven to be a highly effective machine learning method in computer vision, natural language processing, and other challenging AI problems. Moreover, successfully training such networks is routinely performed using simple and scalable gradient-based methods, in particular stochastic gradient descent.

Despite this practical success, our theoretical understanding of the computational tractability of such methods is quite limited, with most results being negative. For example, as discussed in , learning even depth-2 networks in a formal PAC learning framework is computationally hard in the worst case, and even if the algorithm is allowed to return arbitrary predictors. As common in such worst-case results, these are proven using rather artificial constructions, quite different than the real-world problems on which neural networks are highly successful. In particular, since the PAC framework focuses on distribution-free learning (where the distribution generating the examples is unknown and rather arbitrary), the hardness results rely on carefully crafted distributions, which allows one to relate the learning problem to (say) an NP-hard problem or breaking a cryptographic system. However, what if we insist on “natural” distributions? Is it possible to show that neural networks learning becomes computationally tractable? Can we show that they can be learned using the standard heuristics employed in practice, such as stochastic gradient descent?

To understand what a “natural” distribution refers to, we need to separate the distribution over examples (given as input-output pairs (x,y)(\mathbf{x},y)) into two components:

The input distribution p(x)p(\mathbf{x}): “Natural” input distributions on Euclidean space tend to have properties such as smoothness, non-degeneracy, incoherence etc.

The target function h(x)h(\mathbf{x}): In PAC learning, it is assumed that the output yy equals h(x)h(\mathbf{x}), where hh is some unkown target function from the hypothesis class we are considering. In studying neural networks, it is common to consider the class of all networks which share some fixed architecture (e.g. feedforward networks of a given depth and width). However, one may argue that the parameters of real-world networks (e.g. the weights of each neuron) are not arbitrary, but exhibit various features such as non-degeneracy or some “random like” appearance. Indeed, networks with a random structure have been shown to be more amenable to analysis in various situations (see for instance and references therein).

Empirical evidence clearly suggest that many pairs of input distributions and target functions are computationally tractable, using standard methods. However, how do we characterize these pairs? Would appropriate assumptions on one of them be sufficient to show learnability?

In this paper, we investigate these two components, and provide evidence that neither one of them alone is enough to guarantee computationally tractable learning, at least with methods resembling those used in practice. Specifically, we focus on simple, shallow ReLU networks, assume that the data can be perfectly predicted by some such network, and even allow over-specification (a.k.a. improper learning), in the sense that we allow the learning algorithm to output a predictor which is possibly larger and more complex than the target function (this technique increases the power of the learner, and was shown to make the learning problem easier in theory and in practice ). Even under such favorable conditions, we show the following:

Hardness for “natural” target functions. For each individual target function coming from a simple class of small, shallow ReLU networks (even if its parameters are chosen randomly or in some other oblivious way), we show that no algorithm invariant to linear transformations can successfully learn it w.r.t. all input distributions in polynomial time (this corresponds, for instance, to standard gradient-based methods together with data whitening or preconditioning). This result is based on a reduction from learning intersections of halfspaces. Although that problem is known to be hard in the worst-case over both input distributions and target functions, we essentially show that invariant algorithms as above do not “distinguish” between worst-case and average-case: If one can learn a particular target function with such an algorithm, then the algorithm can learn nearly all target functions in that class.

Hardness for “natural” input distributions. We show that target functions of the form xψ(w,x)\mathbf{x}\mapsto\psi(\left\langle\mathbf{w},\mathbf{x}\right\rangle) for any periodic ψ\psi are generally difficult to learn using gradient-based methods, even if the input distribution is fixed and belongs to a very broad class of smooth input distributions (including, for instance, Gaussians and mixtures of Gaussians). Note that such functions can be constructed by simple shallow networks, and can be seen as an extension of generalized linear models . Unlike the previous result, which relies on a computational hardness assumption, the results here are geometric in nature, and imply that the gradient of the objective function, nearly everywhere, contains virtually no signal on the underlying target function. Therefore, any algorithm which relies on gradient information cannot learn such functions. Interestingly, the difficulty here is not in having a plethora of spurious local minima or saddle points – the associated stochastic optimization problem may actually have no such critical points. Instead, the objective function may exhibit properties such as flatness nearly everywhere, unless one is already very close to the global optimum. This highlights a potential pitfall in non-convex learning, which occurs already for a slight extension of generalized linear models, and even for “nice” input distributions.

Together, these results indicate that in order to explain the practical success of neural network learning with gradient-based methods, one would need to employ a careful combination of assumptions on both the input distribution and the target function, and that results with even a “partially” distribution-free flavor (which are common, for instance, in convex learning problems) may be difficult to attain here.

To prove our results, we develop some tools which may be of independent interest. In particular, the techniques used to prove hardness of learning functions of the form xψ(w,x)\mathbf{x}\mapsto\psi(\left\langle\mathbf{w},\mathbf{x}\right\rangle) are based on Fourier analysis, and have some close connections to hardness results on learning parities in the well-known framework of learning from statistical queries : In both cases, one essentially shows that the Fourier transform of the target function has very small support, and hence does not “correlate” with most functions, making it difficult to learn using certain methods. However, we consider a more general and arguably more natural class of input distributions over Euclidean space, rather than distributions on the Boolean cube. In a sense, we show that learning general periodic functions over Euclidean space is difficult (at least with gradient-based methods), for the same reasons that learning parities over the Boolean cube is difficult in the statistical queries framework.

Recent years have seen quite a few papers on the theory of neural network learning. Below, we only briefly mention those most relevant to our paper.

In a very elegant work, Janzamin et al. have shown that a certain method based on tensor decompositions allows one to provably learn simple neural networks by a combination of assumptions on the input distribution and the target function. However, a drawback of their method is that it requires rather precise knowledge of the input distribution and its derivatives, which is rarely available in practice. In contrast, our focus is on algorithms which do not utilize such knowledge. Other works which show computationally-efficient learnability of certain neural networks under sufficiently strong distributional assumptions include .

In the context of learning functions over the Boolean cube, it is known that even if we restrict ourself to a particular input distribution (as long as it satisfies some mild conditions), it is difficult to learn parity functions using statistical query algorithms . Moreover, it was recently shown that stochastic gradient descent methods can be approximately posed as such algorithms . Since parities can be implemented with small real-valued networks, this implies that for “most” input distributions on the Boolean cube, there are neural networks which are unlikely to be learnable with gradient-based methods. However, data provided to neural networks in practice are not in the form of Boolean vectors, but rather vectors of floating-point numbers. Moreover, some assumptions on the input distribution, such as smoothness and Gaussianity, only make sense once we consider the support to be Euclidean space rather than Boolean cube. Perhaps these are enough to guarantee computational tractability? A contribution of this paper is to show that this is not the case, and to formally demonstrate how phenomena similar to the Boolean case also occurs in Euclidean space, using appropriate target functions and distributions.

Finally, we note that provides improper-learning hardness results, which hold even for a standard Gaussian distribution on Euclidean space, and for any algorithm. However, unlike our paper, their focus is on hardness of agnostic learning (where the target function is arbitrary and does not have to correspond to a given class), the results are specific to the standard Gaussian distribution, and the proofs are based on a reduction from the Boolean case.

The paper is structured as follows: In Sec. 2, we formally present some notation and concepts used throughout the paper. In Sec. 4, we provide our hardness results for natural input distributions, and in Sec. 3, we provide our hardness results for natural target functions. All proofs are presented in Sec. 5.

Preliminaries

We generally let bold-faced letters denote vectors. Given a complex-valued number z=a+ibz=a+ib, we let z=aib\overline{z}=a-ib denote its complex conjugate, and z=a2+b2|z|=\sqrt{a^{2}+b^{2}} denote its modulus. Given a function ff, we let f\nabla f denote its gradient and 2f\nabla^{2}f denote its Hessian (assuming they exist).

Neural Networks. The focus of our results will be on learning predictors which can be described by simple and shallow (depth 2 or 3) neural networks. A standard feedforward neural network is composed of neurons, each of which computes the mapping xσ(w,x+b)\mathbf{x}\mapsto\sigma(\left\langle\mathbf{w},\mathbf{x}\right\rangle+b), where w,b\mathbf{w},b are parameters and σ\sigma is a scalar activation function, for example the popular ReLU function [z]+=max{0,z}[z]_{+}=\max\{0,z\}. These neurons are arranged in parallel in layers, so the output of each layer can be compactly represented as xσ(Wx+b)\mathbf{x}\mapsto\sigma(W^{\top}\mathbf{x}+\mathbf{b}), where WW is a matrix (each column corresponding to the parameter vector of one of the neurons), b\mathbf{b} is a vector, and σ\sigma applies an activation function on the coordinates of WxW^{\top}\mathbf{x}. In vanilla feedforward networks, such layers are connected to each other, so given an input x\mathbf{x}, the output equals

PAC Learning. For the results of Sec. 3, we will rely on the following standard definition of PAC learning with respect to Boolean functions: Given a hypothesis class H\mathcal{H} of functions from {0,1}d\{0,1\}^{d} to {0,1}\{0,1\}, we say that a learning algorithm PAC-learns H\mathcal{H} if for any ϵ(0,1)\epsilon\in(0,1), any distribution D\mathcal{D} over {0,1}d\{0,1\}^{d}, and any hHh^{\star}\in\mathcal{H}, if the algorithm is given oracle access to i.i.d. samples (x,h(x))(\mathbf{x},h^{\star}(\mathbf{x})) where x\mathbf{x} is sampled according to D\mathcal{D}, then in time poly(d,1/ϵ)\text{poly}(d,1/\epsilon), the algorithm returns a function f:{0,1}d{0,1}f:\{0,1\}^{d}\mapsto\{0,1\} such that PrxD(f(x)h(x))ϵ\Pr_{\mathbf{x}\sim\mathcal{D}}(f(\mathbf{x})\neq h^{\star}(\mathbf{x}))\leq\epsilon with high probability (for our purposes, it will be enough to consider any constant close to 11). Note that in the definition above, we allow ff not to belong to the hypothesis class H\mathcal{H}. This is often denoted as “improper” learning, and allows the learning algorithm more power than in “proper” learning, where ff must be a member of H\mathcal{H}.

where exp(iz)=cos(z)+isin(z)\exp(iz)=\cos(z)+i\cdot\sin(z), ii being the imaginary unit. In the proofs, we will use the following well-known properties of the Fourier transform:

Linearity: For scalars a,ba,b and functions f,gf,g, af+bg^=af^+bg^\widehat{af+bg}=a\hat{f}+b\hat{g}.

Isometry: f,g=f^,g^\left\langle f,g\right\rangle=\langle\hat{f},\hat{g}\rangle and f=f^\left\|f\right\|=\|\hat{f}\|.

Convolution: fg^=f^g^\widehat{fg}=\hat{f}*\hat{g}, where * denotes the convolution operation: (fg)(w)=f(z)g(wz) dz(f*g)(\mathbf{w})=\int f(\mathbf{z})\cdot g(\mathbf{w}-\mathbf{z})~{}d\mathbf{z}.

Natural Target Functions

In this section, we consider simple target functions of the form x[i=1n[wi,x]+]\mathbf{x}\mapsto\left[\sum_{i=1}^{n}[\left\langle\mathbf{w}_{i},\mathbf{x}\right\rangle]_{+}\right]_{}, where [z]+=max{0,z}[z]_{+}=\max\{0,z\} is the ReLU function, and [z]=min{1,max{0,z}}[z]_{}=\min\{1,\max\{0,z\}\} is the clipping operation on the interval $.Thiscorrespondstodepth2networkswithnobiasinthefirstlayer,andwheretheoutputsofthefirstlayeraresimplysummedandmovedthroughaclippingnonlinearity(thisoperationcanalsobeeasilyimplementedusingasecondlayercomposedoftwoReLUneurons).Letting. This corresponds to depth-2 networks with no bias in the first layer, and where the outputs of the first layer are simply summed and moved through a clipping non-linearity (this operation can also be easily implemented using a second layer composed of two ReLU neurons). LettingW=[\mathbf{w}_{1},\ldots,\mathbf{w}_{n}],wecanwritesuchpredictorsas, we can write such predictors as\mathbf{x}\mapsto h(W^{\top}\mathbf{x})foranappropriatefixedfunctionfor an appropriate fixed functionh.Ourgoalwouldbetoshowthatforsuchatargetfunction,withvirtuallyanychoiceof. Our goal would be to show that for such a target function, with virtually any choice ofW$ (essentially, as long as its columns are linearly independent), and any polynomial-time learning algorithm satisfying some conditions, there exists an input distribution on which it must fail.

As the careful reader may have noticed, it is impossible to provide such a target-function-specific result which holds for any algorithm. Indeed, if we fix the target function in advance, we can always “learn” by returning the target function, regardless of the training data. Thus, imposing some constraints on the algorithm is necessary. Specifically, we will consider algorithms which exhibit certain natural invariances to the coordinate system used. One very natural invariance is with respect to orthogonal transformations: For example, if we rotate the input instances xi\mathbf{x}_{i} in a fixed manner, then an orthogonally-invariant algorithm will return a predictor which still makes the same predictions on those instances. Formally, this invariance is defined as follows:

The definition as stated refers to deterministic algorithms. For stochastic algorithms, we will understand orthogonal invariance to mean orthogonal invariance conditioned on any realization of the algorithm’s random coin flips.

For example, standard gradient and stochastic gradient descent methods (possibly with coordinate-oblivious regularization, such as L2L_{2} regularization) can be easily shown to be orthogonally-invariantEssentially, this is because the gradient of any function g(Wx)=g(w1,x,,wk,x)g(W^{\top}\mathbf{x})=g(\left\langle\mathbf{w}_{1},\mathbf{x}\right\rangle,\ldots,\left\langle\mathbf{w}_{k},\mathbf{x}\right\rangle) w.r.t. any wi\mathbf{w}_{i} is proportional to x\mathbf{x}. Thus, if we multiply x\mathbf{x} by an orthogonal MM, the gradient also gets multiplied by MM. Since MM=IM^{\top}M=I, the inner products of instances x\mathbf{x} and gradients remain the same. Therefore, by induction, it can be shown that any algorithm which operates by incrementally updating some iterate by linear combinations of gradients will be rotationally invariant.. However, for our results we will need to make a somewhat stronger invariance assumption, namely invariance to general invertible linear transformations of the data (not necessarily just orthogonal). This is formally defined as follows:

One well-known example of such an algorithm (which is also invariant to affine transformations) is the Newton method . More relevant to our purposes, linear invariance occurs whenever an orthogonally-invariant algorithm preconditions or “whitens” the data so that its covariance has a fixed structure (e.g. the identity matrix, possibly after a dimensionality reduction if the data is rank-deficient). For example, even though gradient descent methods are not linearly invariant, they become so if we precede them by such a preconditioning step. This is formalized in the following theorem:

Let A\mathcal{A} be any algorithm which given {xi,yi}i=1m\{\mathbf{x}_{i},y_{i}\}_{i=1}^{m}, computes the whitening matrix P=D1UP=D^{-1}U^{\top} (where X=[x1 x2  xm]X=[\mathbf{x}_{1}~{}\mathbf{x}_{2}~{}\ldots~{}\mathbf{x}_{m}], X=UDVX=UDV^{\top} is a thinThat is, if XX is of size d×md\times m, then UU is of size d×Rank(X)d\times\text{Rank}(X), DD is of size Rank(X)×Rank(X)\text{Rank}(X)\times\text{Rank}(X), and VV is of size m×Rank(X)m\times\text{Rank}(X). SVD decomposition of XX), feeds {Pxi,yi}i=1m\{P\mathbf{x}_{i},y_{i}\}_{i=1}^{m} to an orthogonally-invariant algorithm, and given the output predictor xf(Wx)\mathbf{x}\mapsto f(W^{\top}\mathbf{x}), returns the predictor xf((PW)x)\mathbf{x}\mapsto f((P^{\top}W)^{\top}\mathbf{x}). Then A\mathcal{A} is linearly-invariant.

It is easily verified that the covariance matrix of the transformed instances Px1,,PxmP\mathbf{x}_{1},\ldots,P\mathbf{x}_{m} is the r×rr\times r identity matrix (where r=Rank(X)r=\text{Rank}(X)), so this is indeed a whitening transform. We note that whitening is a very common preprocessing heuristic, and even when not done explicitly, scalable approximate whitening and preconditioning methods (such as Adagrad and batch normalization ) are very common and widely recognized as useful for training neural networks.

(where we let 11 correspond to ‘true’ and to ‘false’). The problem of PAC-learning intersections of halfspaces over the Boolean cube (x{0,1}d\mathbf{x}\in\{0,1\}^{d}) has been well-studied. In particular, two known hardness results are the following:

Klivans and Sherstov show that under a certain well-studied cryptographic assumption (hardness of finding unique shortest vectors in a high-dimensional lattice), no algorithm can PAC-learn intersection of nd=dδn_{d}=d^{\delta} halfspaces (where δ\delta is any positive constant), even if the coordinates of wi\mathbf{w}_{i} and bib_{i} are all integers, and maxi(wi,bi)poly(d)\max_{i}\left\|(\mathbf{w}_{i},b_{i})\right\|\leq\text{poly}(d).

Daniely and Shalev-Shwartz show that under an assumption related to the hardness of refuting random K-SAT formulas, no algorithm can PAC-learn intersections of nd=ω(1)n_{d}=\omega(1) halfspaces (as dd\rightarrow\infty), even if the coordinates of wi\mathbf{w}_{i} and bib_{i} are all integers, and maxi(wi,bi)O(d)\max_{i}\left\|(\mathbf{w}_{i},b_{i})\right\|\leq\mathcal{O}(d).

In the theorem below, we will use the result of , which applies to an intersection of a smaller number of halfspaces, and with smaller norms. However, similar results can be shown using , at the cost of worse polynomial dependencies on dd.

The main result of this section is the following:

Consider any network h(Wx)=[i=1nd[wi,x]+]h(W_{\star}^{\top}\mathbf{x})=\left[\sum_{i=1}^{n_{d}}[\left\langle\mathbf{w}^{\star}_{i},\mathbf{x}\right\rangle]_{+}\right]_{} (where W=[w1,,wn]W_{\star}=[\mathbf{w}^{\star}_{1},\ldots,\mathbf{w}^{\star}_{n}]), which satisfies the following:

ndω(1)n_{d}\geq\omega(1) as dd\rightarrow\infty

maxiwiO(d)\max_{i}\left\|\mathbf{w}^{\star}_{i}\right\|\leq\mathcal{O}(d)

w1wn\mathbf{w}_{1}^{\star}\ldots\mathbf{w}_{n}^{\star} are linearly independent, so the smallest singular value smin(W)s_{\min}(W_{\star}) of WW_{\star} is strictly positive.

Then under the assumption stated in , there is no linearly-invariant algorithm which for any ϵ>0\epsilon>0 and any distribution D\mathcal{D} over vectors of norm at most O(ddnd)min{1,smin(W)}\frac{\mathcal{O}(d\sqrt{dn_{d}})}{\min\{1,s_{\min}(W_{\star})\}}, given only access to samples (x,h(Wx))(\mathbf{x},h(W_{\star}^{\top}\mathbf{x})) where xD\mathbf{x}\sim\mathcal{D}, runs in time poly(d,1/ϵ)\text{poly}(d,1/\epsilon) and returns with high probability a predictor xf(Wx)\mathbf{x}\mapsto f(W^{\top}\mathbf{x}) such that

Note that the result holds even if the returned predictor f(Wx)f(W^{\top}\mathbf{x}) has a different structure than hW()h_{W_{\star}}(\cdot), and WW is of a larger size than WW_{\star}. Thus, it applies even if the algorithm is allowed to train a larger network or more complicated predictor than hW()h_{W_{\star}}(\cdot).

Since the algorithm is linearly-invariant, it can be shown that this implies successful learning from (x,h(Wx))(\mathbf{x},h(W^{\top}\mathbf{x})) where xD\mathbf{x}\sim\mathcal{D}, as required.

In the sketch above, we have ignored some technical issues. For example, we need to be careful that MM has a bounded spectral norm, so that it induces a linear transformation which does not distort norms by too much (as all our arguments apply for input distributions supported on a bounded domain). A second issue is that if we apply a linearly-invariant algorithm on a dataset transformed by MM, then the invariance is only with respect to the data, not necessarily with respect to new instances x\mathbf{x} sampled from the same distribution (and this restriction is necessary for results such as Thm. 1 to hold without further assumptions). However, it can be shown that if the dataset is large enough, invariance will still occur with high probability over the sampling of x\mathbf{x}, which is sufficient for our purposes.

Natural Input Distributions

Our formal results rely on Fourier analysis and are a bit technical. Hence, we precede them with an informal description, outlining the main ideas and techniques, and presenting a specific case study which may be of independent interest (Subsection 4.1). The formal results are presented in Subsection 4.2.

Consider a target function of the form xψ(w,x)\mathbf{x}\mapsto\psi(\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle), and any input distribution whose density function can be written as the square φ2\varphi^{2} of some function φ\varphi (the reason for this will become apparent shortly). Suppose we attempt to learn this target function (with respect to the squared loss) using some hypothesis class, which can be parameterized by a bounded-norm vector w\mathbf{w} in some subset W\mathcal{W} of an Euclidean space (not necessarily of the same dimensionality as w\mathbf{w}^{\star}), so each predictor in the class can be written as xf(w,x)\mathbf{x}\mapsto f(\mathbf{w},\mathbf{x}) for some fixed mapping ff. Thus, our goal is essentially to solve the stochastic optimization problem

In this section, we study the geometry of this objective function, and show that under mild conditions on ff, and assuming the norm of w\mathbf{w}^{\star} is reasonably large, the following holds:

For any fixed w\mathbf{w}, the value of the objective function is almost independent of w\mathbf{w}^{\star}, in the sense that if we pick the direction of w\mathbf{w}^{\star} uniformly at random, the value is extremely concentrated around a fixed value independent of w\mathbf{w}^{\star} (e.g. exponentially small in w2\left\|\mathbf{w}^{\star}\right\|^{2} for a Gaussian or a mixture of Gaussians).

Similarly, the gradient of the objective function with respect to w\mathbf{w} is almost independent of w\mathbf{w}^{\star}, and is extremely concentrated around a fixed value (again, exponentially small in w2\left\|\mathbf{w}^{\star}\right\|^{2} for, say, a mixture of Gaussians).

Therefore, assuming w\left\|\mathbf{w}^{\star}\right\| is reasonably large, any standard gradient-based method will follow a trajectory nearly independent of w\mathbf{w}^{\star}. In fact, in practice we do not even have access to exact gradients of Eq. (2), but only to noisy and biased versions of it (e.g. if we perform stochastic gradient descent, and certainly if we use finite-precision computations). In that case, the noise will completely obliterate the exponentially small signal about w\mathbf{w}^{\star} in the gradients, and will make the trajectory essentially independent of w\mathbf{w}^{\star}. As a result, assuming ψ\psi and the distribution is such that the function ψ(w,x)\psi(\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle) is sensitive to the direction of w\mathbf{w}^{\star}, it follows that these methods will fail to optimize Eq. (2) successfully. Finally, we note that in practice, it is common to solve not Eq. (2) directly, but rather its empirical approximation with respect to some fixed finite training set. Still, by concentration of measure, this empirical objective would converge to the one in Eq. (2) given enough data, so the same issues will occur.

An important feature of our results is that they make virtually no structural assumptions on the predictors xf(w,x)\mathbf{x}\mapsto f(\mathbf{w},\mathbf{x}). In particular, they can represent arbitrary classes of neural networks (as well as other predictor classes). Thus, our results imply that target functions of the form xψ(w,x)\mathbf{x}\mapsto\psi(\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle), where ψ\psi is periodic, would be difficult to learn using gradient-based methods, even if we allow improper learning and consider predictor classes of a different structure.

To explain how such results are attained, let us study a concrete special case (not necessarily in the context of neural networks). Consider the target function xcos(2πw,x)\mathbf{x}\mapsto\cos(2\pi\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle), and the hypothesis class (parameterized by w\mathbf{w}) of functions xcos(2πw,x)\mathbf{x}\mapsto\cos(2\pi\left\langle\mathbf{w},\mathbf{x}\right\rangle). Thus, Eq. (2) takes the form

We now turn to explain why Eq. (3) has the form shown in Figure 2. This will also help to illustrate our proof techniques, which apply much more generally. The main idea is to analyze the Fourier transform of Eq. (3). Letting cosw\cos_{\mathbf{w}} denote the function xcos(2πw,x)\mathbf{x}\mapsto\cos(2\pi\left\langle\mathbf{w},\mathbf{x}\right\rangle), we can write Eq. (3) as

cosw^(ξ)\widehat{\cos_{\mathbf{w}}}(\boldsymbol{\xi}) can be shown to equal 12(δ(ξw)+δ(ξ+w))\frac{1}{2}\left(\delta(\boldsymbol{\xi}-\mathbf{w})+\delta(\boldsymbol{\xi}+\mathbf{w})\right), where δ()\delta(\cdot) is Dirac’s delta function (a “generalized” function which satisfies δ(z)=0\delta(\mathbf{z})=0 for all z0\mathbf{z}\neq\mathbf{0}, and δ(z)dz=1\int\delta(\mathbf{z})d\mathbf{z}=1). Plugging this into the above and simplifying, we get

If φ2\varphi^{2} is a standard Gaussian, φ^(ξ)\hat{\varphi}(\boldsymbol{\xi}) can be shown to equal the Gaussian-like function (4π)d/2aξ2(4\pi)^{d/2}a^{-\left\|\boldsymbol{\xi}\right\|^{2}} where a=exp(4π2)a=\exp(4\pi^{2}). Plugging back, the expression above is proportional to

The expression in each inner parenthesis can be viewed as a mixture of two Gaussian-like functions, with centers at w,w\mathbf{w},-\mathbf{w} (or w,w\mathbf{w}^{\star},-\mathbf{w}^{\star}). Thus, if w\mathbf{w} is far from w\mathbf{w}^{\star}, these two mixtures will have nearly disjoint support, and Eq. (5) will have nearly the same value regardless of w\mathbf{w} – in other words, it is very flat. Since this equation is nothing more than a re-formulation of the original objective function in Eq. (3) (up to a constant), we get a similar behavior for Eq. (3) as well.

This behavior extends, however, much more generally than the specific objective in Eq. (3). First of all, we can replace the standard Gaussian distribution φ2\varphi^{2} by any distribution such that φ^\hat{\varphi} has a localized support. This would still imply that Eq. (4) refers to the difference of two functions with nearly disjoint support, and the same flatness phenomenon will occur. Second, we can replace the cos\cos function by any periodic function ψ\psi. By properties of the Fourier transform of periodic functions, we still get localized functions in the Fourier domain (more precisely, the Fourier transform will be localized around integer multiples of w\mathbf{w}, up to scaling). Finally, instead of considering hypothesis classes of predictors xψ(w,x)\mathbf{x}\mapsto\psi(\left\langle\mathbf{w},\mathbf{x}\right\rangle) similar to the target function, we can consider quite arbitrary mappings xf(w,x)\mathbf{x}\mapsto f(\mathbf{w},\mathbf{x}). Even though this function may no longer be localized in the Fourier domain, it is enough that only the target function xψ(w,x)\mathbf{x}\mapsto\psi(\left\langle\mathbf{w}^{\star},\mathbf{x}\right\rangle) will be localized: That implies that regardless how ff looks like, under a random choice of w\mathbf{w}^{\star}, only a minuscule portion of the L2L_{2} mass of ff overlaps with the target function, hence getting sufficient signal on w\mathbf{w}^{\star} will be difficult.

As mentioned in the introduction, these techniques and observations have some close resemblance to hardness results for learning parities over the Boolean cube in the statistical queries learning model . There as well, one considers a Fourier transform (but on the Boolean cube rather than Euclidean space), and essentially show that functions with a “localized” Fourier transform are difficult to “detect” using any fixed function. However, our results are different and more general, in the sense that they apply to generic smooth distributions over Euclidean space, and to a general class of periodic functions, rather than just parities. On the flip side, our results are constrained to methods which are based on gradients of the objective, whereas the statistical queries framework is more general and considers algorithms which are based on computing (approximate) expectations of arbitrary functions of the data. Extending our results to this generality is an interesting topic for future research.

2 Formal Results

We now turn to provide a more formal statement of our results. The distributions we will consider consist of arbitrary mixtures of densities, whose square roots have rapidly decaying tails in the Fourier domain. More precisely, we have the following definition:

where 1r\mathbf{1}_{\geq r} is the indicator function of {x:xr}\{\mathbf{x}:\left\|\mathbf{x}\right\|\geq r\}.

A canonical example is Gaussian distributions: Given a (non-degenerate, zero-mean) Gaussian density function φ2\varphi^{2} with covariance matrix Σ\Sigma, its square root φ\varphi is proportional to a Gaussian with covariance 2Σ2\Sigma, and its Fourier transform φ^\hat{\varphi} is well-known to be proportional to a Gaussian with covariance (2Σ)1(2\Sigma)^{-1}. By standard Gaussian concentration results, it follows that φ2\varphi^{2} is Fourier-concentrated with ϵ(r)=exp(Ω(λminr2))\epsilon(r)=\exp(-\Omega(\lambda_{\min}r^{2})) where λmin\lambda_{\min} is the minimal eigenvalue of Σ\Sigma. A similar bound can be shown when the Gaussian has some arbitrary mean. More generally, it is well-known that smooth functions (differentiable to sufficiently high order with integrable derivatives) have Fourier transforms with rapidly decaying tails. For example, if we consider the broad class of Schwartz functions (characterized by having values and all derivatives decaying faster than polynomially in rr), then the Fourier transform of any such function is also a Schwartz function, which implies super-polynomial decay of ϵ(r)\epsilon(r) (see for instance , Chapter 11 and Proposition 11.25).

We now formally state our main result for this section. We consider any predictor of the form xf(w,x)\mathbf{x}\mapsto f(\mathbf{w},\mathbf{x}), where ff is some fixed function and w\mathbf{w} is a parameter vector coming from some domain W\mathcal{W}, which we will assume w.l.o.g. to be a subset of some Euclidean spaceMore generally, our analysis is applicable to any separable Hilbert space. (for example, ff can represent a network of a given architecture, with weights specified by w\mathbf{w}). When learning ff based on data coming from an underlying distribution, we are essentially attempting to solve the optimization problem

Assume that FF is differentiable w.r.t. w\mathbf{w}, any gradient-based method to solve this problem proceeds by computing (or approximating) Fw(w)\nabla F_{\mathbf{w}^{\star}}(\mathbf{w}) at various points w\mathbf{w}. However, the following theorem shows that at any w\mathbf{w}, and regardless of the type of predictor or network one is attempting to train, the gradient at w\mathbf{w} is virtually independent of the underlying target function, and hence provides very little signal:

We note that bounded variation is weaker than, say, Lipschitz continuity. Assuming ϵ(r)\epsilon(r) decays rapidly with rr – say, exponentially in r2r^{2} as is the case for a Gaussian mixture – we get that the bound in the theorem is on the order of exp(Ω(min{d,r2}))\exp(-\Omega(\min\{d,r^{2}\})).

Overall, the theorem implies that if r,dr,d are moderately large, the gradient of FwF_{\mathbf{w}^{\star}} at any point w\mathbf{w} is extremely concentrated around a fixed value, independent of w\mathbf{w}^{\star}. This implies that gradient-based methods, which attempt to optimize FwF_{\mathbf{w}^{\star}} via gradient information, are unlikely to succeed. One way to formalize this is to consider any iterative algorithm (possibly randomized), which relies on an ε\varepsilon-approximate gradient oracle to optimize FwF_{\mathbf{w}^{\star}}: At every iteration tt, the algorithm chooses a point wtW\mathbf{w}_{t}\in\mathcal{W}, and receives a vector gt\mathbf{g}_{t} such that Fwgtε|\nabla F_{\mathbf{w}^{\star}}-\mathbf{g}_{t}|\leq\varepsilon. In our case, we will be interested in ε\varepsilon such that ε3\varepsilon^{3} is on the order of the bound in Thm. 3. Since the bound is extremely small for moderate d,rd,r (say, smaller than machine precision), this is a realistic model of gradient-based methods on finite-precision machines, even if one attempts to compute the gradients accurately. The following theorem implies that if the number of iterations is not extremely large (on the order of 1/ε1/\varepsilon, e.g. exp(Ω(d,r2))\exp(\Omega(d,r^{2})) iterations for Gaussian mixtures), then with high probability, a gradient-based method will return the same predictor independent of w\mathbf{w}^{\star}. However, since the objective function FwF_{\mathbf{w}^{\star}} is highly sensitive to the choice of w\mathbf{w}^{\star}, this means that no such gradient-based method can train a reasonable predictor.

Assume the conditions of Thm. 3, and let ε=c2(supwWGw)(exp(c3d)+n=1ϵ(nr))\varepsilon=\sqrt{c_{2}(\sup_{\mathbf{w}\in\mathcal{W}}G_{\mathbf{w}})\left(\exp(-c_{3}d)+\sum_{n=1}^{\infty}\epsilon(nr)\right)} be the cube root of the bound specified there (uniformly over all wW\mathbf{w}\in\mathcal{W}). Then for any algorithm as above and any p(0,1)p\in(0,1), conditioned on an event which holds with probability 1p1-p over the choice of w\mathbf{w}^{\star}, its output after at most p/εp/\varepsilon iterations will be independent of w\mathbf{w}^{\star}.

Proofs

Let PMP_{M} denote the whitening matrix employed if we transform the instances XX by some invertible d×dd\times d matrix MM (that is, XX becomes MXMX), and PP the whitening matrix employed for the original data.

Using the same notation as in the theorem, it is easily verified that PX=VPX=V^{\top}, and PMMX=VMP_{M}MX=V_{M}^{\top}, where UMDMVMU_{M}D_{M}V_{M}^{\top} is an SVD decomposition of the matrix MXMX. Since both VV^{\top} and VMV_{M}^{\top} are Rank(X)×m\text{Rank}(X)\times m matrices with rows consisting of orthonormal vectors, they are related by an orthogonal transformation (i.e. there is an orthogonal matrix RMR_{M} such that RMV=VMR_{M}V^{\top}=V_{M}^{\top}). Therefore, RMPX=PMMXR_{M}PX=P_{M}MX. Since the data is fed to an orthogonally-invariant algorithm, its output WMW_{M} satisfies WMPMMX=WPXW_{M}^{\top}P_{M}MX=W^{\top}PX. This in turn implies WMRMPX=WPXW_{M}^{\top}R_{M}PX=W^{\top}PX, and hence WMRMV=WVW_{M}^{\top}R_{M}V^{\top}=W^{\top}V^{\top}. Multiplying both sides on the right by VV and taking a transpose, we get that RMWM=WR_{M}^{\top}W_{M}=W, and hence WM=RMWW_{M}=R_{M}W. In words, WW and WMW_{M} are the same up to an orthogonal transformation RMR_{M} depending on MM. Therefore,

so we see that the returned predictor makes the same predictions over the dataset, independent of the transformation matrix MM.

2 Proof of Thm. 2

We start with the following auxiliary theorem, which reduces the hardness result of to one about neural networks of the type we discuss here:

Under the assumption stated in , the following holds for any nd=ω(1)n_{d}=\omega(1) (as dd\rightarrow\infty):

There is no algorithm running in time poly(d,1/ϵ)\text{poly}(d,1/\epsilon), which for any distribution D\mathcal{D} on {0,1}d\{0,1\}^{d}, and any h(Wx)=σ(i=1nd[wi,x]+)h(W^{\top}\mathbf{x})=\sigma(\sum_{i=1}^{n_{d}}[\left\langle\mathbf{w}_{i},\mathbf{x}\right\rangle]_{+}) (where W=[w1,w2,,wnd]W=[\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{n_{d}}] and maxiwiO(d)\max_{i}\left\|\mathbf{w}_{i}\right\|\leq\mathcal{O}(d)), given only access to samples (x,h(Wx))(\mathbf{x},h(W^{\top}\mathbf{x})) where xD\mathbf{x}\sim\mathcal{D}, returns with high probability a function ff such that

Thm. 5 holds even if we restrict w1,,wnd\mathbf{w}_{1},\ldots,\mathbf{w}_{n_{d}} to be linearly independent, with smin(W)1s_{\min}(W)\geq 1.

Suppose by contradiction that there exists an algorithm A\mathcal{A} which succeeds for any WW as stated above. We will describe how to use A\mathcal{A} to get an algorithm which succeeds for any WW as described in Thm. 5, hence reaching a contradiction.

This contradicts Thm. 5, which states that no efficient algorithm can return such a predictor for any sufficiently large dimension dd and norm bound O(d)\mathcal{O}(d). ∎

It is enough to prove that with probability at least 1δ1-\delta over the sampling of x1,,xm\mathbf{x}_{1},\ldots,\mathbf{x}_{m},

This is because the event WMMxi=WxiW_{M}^{\top}M\mathbf{x}_{i}=W^{\top}\mathbf{x}_{i} for all ii means that WMMx=WxW_{M}^{\top}M\mathbf{x}=W^{\top}\mathbf{x} for any x\mathbf{x} in the span of x1,,xm\mathbf{x}_{1},\ldots,\mathbf{x}_{m}.

Let x1,,xm+1\mathbf{x}_{1},\ldots,\mathbf{x}_{m+1} be sampled i.i.d. according to D\mathcal{D}. Considering probabilities over this sample, we have

where the latter inequality is because each xj\mathbf{x}_{j} is a dd-dimensional vector, hence the number of times we can get a vector not in the span of the previous ones is at most dd. Moreover, since the vectors are sampled i.i.d, we have

so the probabilities in Eq. (7) monotonically decrease with jj. Thus, Eq. (7) implies

so by Markov’s inequality, with probability at least 1δ1-\delta over the sampling of x1,,xm\mathbf{x}_{1},\ldots,\mathbf{x}_{m},

Since xm+1\mathbf{x}_{m+1} is sampled independently, Eq. (6) and hence the lemma follows. ∎

We will show that the very same algorithm, if given poly(d,1/ϵ)\text{poly}(d,1/\epsilon) samples, can successfully learn w.r.t. any d×ndd\times n_{d} matrix WW and any distribution D\mathcal{D} satisfying Proposition 1 and Thm. 5, contradicting those results.

Indeed, let WW and D\mathcal{D} be an arbitrary matrix and distribution as above. We first argue that there exists a d×dd\times d invertible matrix MM such that

To see this, note that WW and WW_{\star} are of the same size and our conditions imply that both of them have full column rank. Thus, we can simply augment them to invertible d×dd\times d matrices [W W^][W~{}\hat{W}] and [W W^][W_{\star}~{}\hat{W}_{\star}], where the columns of W^\hat{W} (respectively W^\hat{W}_{\star}) are an orthonormal basis for the subspace orthogonal to the column space of WW (respectively WW_{\star}), and choosing M=[W W^][W W^]1M^{\top}=[W~{}\hat{W}][W_{\star}~{}\hat{W}_{\star}]^{-1}. Thus, M[W W^][W W^]1\left\|M\right\|\leq\left\|[W~{}\hat{W}]\right\|\cdot\left\|[W_{\star}~{}\hat{W}_{\star}]^{-1}\right\|. The spectral norm of [W W^][W~{}\hat{W}] can be bounded by the Frobenius norm, which by the assumption on WW from Thm. 5 and the fact that W^\hat{W} consist of an orthonormal basis, is at most O(d)2nd+1(dnd)=O(d2nd)=O(dnd)\sqrt{\mathcal{O}(d)^{2}\cdot n_{d}+1\cdot(d-n_{d})}=\mathcal{O}(\sqrt{d^{2}n_{d}})=\mathcal{O}(d\sqrt{n_{d}}). The spectral norm of [W W^]1[W_{\star}~{}\hat{W}_{\star}]^{-1} can be bounded by 1/smin([W W^])1/s_{\min}([W_{\star}~{}\hat{W}_{\star}]), where smin([W W^])s_{\min}([W_{\star}~{}\hat{W}_{\star}]) equals the square root of the smallest eigenvalue of [W W^][W W^][W_{\star}~{}\hat{W}_{\star}]^{\top}[W_{\star}~{}\hat{W}_{\star}], which can be easily verified to be min{1,smin(W)}\min\{1,s_{\min}(W_{\star})\}.

By Eq. (8), WM=(MW)=WW_{\star}^{\top}M=(M^{\top}W_{\star})^{\top}=W^{\top}, so this is equivalent to

This means that the algorithm succesfully learns the hypothesis xh(Wx)\mathbf{x}\mapsto h(W^{\top}\mathbf{x}) with respect to the distribution D\mathcal{D}. Since ϵ\epsilon is arbitrarily small and W,DW,\mathcal{D} were chosen arbitrarily, the result follows.

3 Proof of Thm. 3

The proof is a combination of a few lemmas, presented below.

where ii is the imaginary unit (note that since ψ\psi is real-valued, the imaginary components eventually cancel out, but it will be more convenient for us to represent the Fourier series in this compact form). By Parseval’s identity, zaz2=1/21/2ψ2(x)dx\sum_{z}|a_{z}|^{2}=\int_{-1/2}^{1/2}\psi^{2}(x)dx, which is at most 11 (since ψ(x)[1,+1]\psi(x)\in[-1,+1]).

We now wish to compute the Fourier transform of the aboveStrictly speaking, this function does not have a Fourier transform in the sense of Eq. (1), since the associated integrals do not converge. However, the function still has a well-defined Fourier transform in the more general sense of a generalized function or distribution (see e.g. for a survey). In the derivation below, we will simply rely on some standard formulas from the Fourier analysis literature, and refer to for their formal justifications.. First, we note that the Fourier transform of exp(2πiv,)\exp(2\pi i\left\langle\mathbf{v},\cdot\right\rangle) is given by δ(v)\delta(\cdot-\mathbf{v}), where δ\delta is the Dirac delta function (a so-called generalized function which satisfies δ(x)=0\delta(\mathbf{x})=0 for all x0\mathbf{x}\neq\mathbf{0}, and δ(x)dx=1\int\delta(\mathbf{x})d\mathbf{x}=1). Based on this and the linearity of the Fourier transform, we have that

and therefore, by the convolution property of the Fourier transform, we have

For any distinct integers z1z2z_{1}\neq z_{2} and any w\mathbf{w} such that w=2r\left\|\mathbf{w}\right\|=2r, it holds that

Let Δ=z2z1r\Delta=|z_{2}-z_{1}|r, and v=(z2z1)w\mathbf{v}=(z_{2}-z_{1})\mathbf{w}, so v\mathbf{v} is a vector of norm 2Δ2\Delta. Since the inner product is invariant to shifting the coordinates, we can assume without loss of generality that z1=0z_{1}=0, and our goal is to bound φ^,φ^(v)\left\langle|\hat{\varphi}|,|\hat{\varphi}(\cdot-\mathbf{v})|\right\rangle.

Using the convention that 1Δ\mathbf{1}_{\leq\Delta} is the indicator of {x:xΔ}\{\mathbf{x}:\left\|\mathbf{x}\right\|\leq\Delta\}, and 1>Δ\mathbf{1}_{>\Delta} is the indicator for its complement, we have

where in the last step we used Cauchy-Schwartz. Using the fact that norms and inner products are invariant to coordinate shifting, the above is at most

By the triangle inequality and the assumption v=2Δ\left\|\mathbf{v}\right\|=2\Delta, the event x+vΔ\left\|\mathbf{x}+\mathbf{v}\right\|\leq\Delta implies xΔ\left\|\mathbf{x}\right\|\geq\Delta. Therefore, the above can be upper bounded by

Since φ\varphi is Fourier-concentrated, this is at most 2ϵ(Δ)φ^2=2ϵ(Δ)φ2=2ϵ(Δ)2\epsilon(\Delta)\left\|\hat{\varphi}\right\|^{2}=2\epsilon(\Delta)\left\|\varphi\right\|^{2}=2\epsilon(\Delta), where we use the isometry of the Fourier transform and the assumption that φ2=φ2(x)dx=1\left\|\varphi\right\|^{2}=\int\varphi^{2}(\mathbf{x})d\mathbf{x}=1. Plugging back the definition of Δ\Delta, the result follows. ∎

For simplicity, define ϵ(x)=ϵ(x)\epsilon^{\prime}(x)=\epsilon(x) for all x>0x>0, and ϵ(0)=0\epsilon(0)=0. Then the expression in the lemma equals

where in the last step we used the fact that the two inner square roots are the same up to a different indexing. Recalling the definition of ϵ\epsilon^{\prime} and that zaz21\sum_{z}|a_{z}|^{2}\leq 1, the above is at most

where a0a_{0} is the coefficient from Lemma 2 and cc is a universal positive constant.

The existence of such a set follows from standard concentration of measure arguments (i.e. if we pick that many vectors uniformly at random from {2rd,+2rd}d\left\{-\frac{2r}{\sqrt{d}},+\frac{2r}{\sqrt{d}}\right\}^{d}, and cc is small enough, the vectors will satisfy the above with overwhelming probability, hence such a set must exist).

where we used the fact that x+1/x2x+1/x\geq 2 for all x>0x>0. This contradicts the assumption on W\mathcal{W} (see Eq. (10)), and establishes that {Aw}wW\{A_{\mathbf{w}}\}_{\mathbf{w}\in\mathcal{W}} are indeed disjoint sets.

We now continue by analyzing Eq. (11). Letting 1Aw\mathbf{1}_{A_{\mathbf{w}}} be the indicator function to the set AwA_{\mathbf{w}}, and 1AwC\mathbf{1}_{A^{C}_{\mathbf{w}}} be the indicator of its complement, and recalling that (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}), we can upper bound Eq. (11) by

We consider each expectation separately. Starting with the first one, we have

Since we have φ^=φ=1\left\|\hat{\varphi}\right\|=\left\|\varphi\right\|=1, a02zaz21|a_{0}|^{2}\leq\sum_{z}|a_{z}|^{2}\leq 1, and ψwφ2=ψw2(x)φ2(x)dxφ2(x)dx=1\left\|\psi_{\mathbf{w}}\varphi\right\|^{2}=\int\psi^{2}_{\mathbf{w}}(\mathbf{x})\varphi^{2}(\mathbf{x})d\mathbf{x}\leq\int\varphi^{2}(\mathbf{x})d\mathbf{x}=1, the above is at most

Since AwA_{\mathbf{w}} are disjoint sets, wW1Aw(x)1\sum_{\mathbf{w}\in\mathcal{W}}\mathbf{1}_{A_{\mathbf{w}}}(\mathbf{x})\leq 1 for any x\mathbf{x}, so the above is at most

We now divide the terms in the sum above to two cases:

and by definition of AwCA_{\mathbf{w}}^{C} and the assumption z10z_{1}\neq 0, we have 1AwC(x+z1w)=1\mathbf{1}_{A_{\mathbf{w}}^{C}}(\mathbf{x}+z_{1}\mathbf{w})=1 only if xr\left\|\mathbf{x}\right\|\geq r. Therefore, as φ\varphi is Fourier-concentrated, the above is at most

Plugging these two cases back into Eq. (14), we get the upper bound

Noting that zaz21\sum_{z}|a_{z}|^{2}\leq 1, and applying Lemma 4, the above is at most

where we used the fact that ϵ2(r)ϵ(r)n=1ϵ(nr)\epsilon^{2}(r)\leq\epsilon(r)\leq\sum_{n=1}^{\infty}\epsilon(nr). Recalling this is an upper bound on the second expectation in Eq. (12), and that the first expectation is upper bounded as in Eq. (13), we get that Eq. (12) (and hence the expression in the lemma statement) is at most

With these lemmas in hand, we can now turn to prove the theorem. We have that

for any vector p\mathbf{p} which is not dependent of w\mathbf{w}^{\star} (this p\mathbf{p} will be determined later). Recalling the definition of the objective function FF, and letting g(x)=(g1(x),g2(x),)=wf(w,x)\mathbf{g}(\mathbf{x})=(g_{1}(\mathbf{x}),g_{2}(\mathbf{x}),\ldots)=\frac{\partial}{\partial\mathbf{w}}f(\mathbf{w},\mathbf{x}), the above equals

Let us now choose p\mathbf{p} so that pi=φgi,φf(w,)φgi,a0φp_{i}=\left\langle\varphi g_{i},\varphi f(\mathbf{w},\cdot)\right\rangle-\left\langle\varphi g_{i},a_{0}\varphi\right\rangle (note that this choice is indeed independent of w\mathbf{w}^{\star}). Plugging back and applying Lemma 5 (using the L2L^{2} function φgi^\widehat{\varphi g_{i}} for each ii), we get

4 Proof of Thm. 4

We will assume w.l.o.g. that the algorithm is deterministic: If it is randomized, we can simply prove the statement for any possible realization of its random coin flips.

The algorithm’s first point w1\mathbf{w}_{1} is fixed before receiving any information from the oracle, and is therefore independent of w\mathbf{w}^{\star}. By Thm. 3, we have that Varw(Fw(w1))ε\text{Var}_{\mathbf{w}^{\star}}(\nabla F_{\mathbf{w}^{\star}}(\mathbf{w}_{1}))\leq\varepsilon, which by Chebyshev’s inequality, implies that

Repeating this argument and applying a union bound, it follows that as long as the number of iterations TT satisfies TεpT\varepsilon\leq p (or equivalently Tp/εT\leq p/\varepsilon), the oracle reveals no information whatsoever on the choice of w\mathbf{w}^{\star} all point chosen by the algorithm (and hence also its output) are independent of w\mathbf{w}^{\star} as required.

This research is supported in part by an FP7 Marie Curie CIG grant, Israel Science Foundation grant 425/13, and the Intel ICRI-CI Institute.

References