On the Computational Efficiency of Training Neural Networks

Roi Livni, Shai Shalev-Shwartz, Ohad Shamir

Introduction

One of the most significant recent developments in machine learning has been the resurgence of “deep learning”, usually in the form of artificial neural networks. A combination of algorithmic advancements, as well as increasing computational power and data size, has led to a breakthrough in the effectiveness of neural networks, and they have been used to obtain very impressive practical performance on a variety of domains (a few recent examples include ).

From the perspective of statistical learning theory, by specifying a neural network architecture (i.e. the underlying graph and the activation function) we obtain a hypothesis class, namely, the set of all prediction rules obtained by using the same network architecture while changing the weights of the network. Learning the class involves finding a specific set of weights, based on training examples, which yields a predictor that has good performance on future examples. When studying a hypothesis class we are usually concerned with three questions:

Sample complexity: how many examples are required to learn the class.

Expressiveness: what type of functions can be expressed by predictors in the class.

Training time: how much computation time is required to learn the class.

For simplicity, let us first consider neural networks with a threshold activation function (i.e. σ(z)=1\sigma(z)=1 if z>0z>0 and otherwise), over the boolean input space, {0,1}d\{0,1\}^{d}, and with a single output in {0,1}\{0,1\}. The sample complexity of such neural networks is well understood . It is known that the VC dimension grows linearly with the number of edges (up to log factors). It is also easy to see that no matter what the activation function is, as long as we represent each weight of the network using a constant number of bits, the VC dimension is bounded by a constant times the number of edges. This implies that empirical risk minimization - or finding weights with small average loss over the training data - can be an effective learning strategy from a statistical point of view.

As to the expressiveness of such networks, it is easy to see that neural networks of depth 22 and sufficient size can express all functions from {0,1}d\{0,1\}^{d} to {0,1}\{0,1\}. However, it is also possible to show that for this to happen, the size of the network must be exponential in dd (e.g. [19, Chapter 20]). Which functions can we express using a network of polynomial size? The theorem below shows that all boolean functions that can be calculated in time O(T(d))O(T(d)), can also be expressed by a network of depth O(T(d))O(T(d)) and size O(T(d)2)O(T(d)^{2}).

The proof of the theorem follows directly from the relation between the time complexity of programs and their circuit complexity (see, e.g., ), and the fact that we can simulate the standard boolean gates using a fixed number of neurons.

We see that from the statistical perspective, neural networks form an excellent hypothesis class; On one hand, for every runtime T(d)T(d), by using depth of O(T(d))O(T(d)) we contain all predictors that can be run in time at most T(d)T(d). On the other hand, the sample complexity of the resulting class depends polynomially on T(d)T(d).

The main caveat of neural networks is the training time. Existing theoretical results are mostly negative, showing that successfully learning with these networks is computationally hard in the worst case. For example, neural networks of depth 22 contain the class of intersection of halfspaces (where the number of halfspaces is the number of neurons in the hidden layer). By reduction to kk-coloring, it has been shown that finding the weights that best fit the training set is NP-hard (). has shown that even finding weights that result in close-to-minimal empirical error is computationally infeasible. These hardness results focus on proper learning, where the goal is to find a nearly-optimal predictor with a fixed network architecture AA. However, if our goal is to find a good predictor, there is no reason to limit ourselves to predictors with one particular architecture. Instead, we can try, for example, to find a network with a different architecture AA^{\prime}, which is almost as good as the best network with architecture AA. This is an example of the powerful concept of improper learning, which has often proved useful in circumventing computational hardness results. Unfortunately, there are hardness results showing that even with improper learning, and even if the data is generated exactly from a small, depth-22 neural network, there are no efficient algorithms which can find a predictor that performs well on test data. In particular, and have shown this in the case of learning intersections of halfspaces, using cryptographic and average case complexity assumptions. On a related note, recently showed positive results on learning from data generated by a neural network of a certain architecture and randomly connected weights. However, the assumptions used are strong and unlikely to hold in practice.

Despite this theoretical pessimism, in practice, modern-day neural networks are trained successfully in many learning problems. There are several tricks that enable successful training:

Changing the activation function: The threshold activation function, σ(a)=1a>0\sigma(a)=\mathbf{1}_{a>0}, has zero derivative almost everywhere. Therefore, we cannot apply gradient-based methods with this activation function. To circumvent this problem, we can consider other activation functions. Most widely known is a sigmoidal activation, e.g. σ(a)=11+ea\sigma(a)=\frac{1}{1+e^{a}}, which forms a smooth approximation of the threshold function. Another recent popular activation function is the rectified linear unit (ReLU) function, σ(a)=max{0,a}\sigma(a)=\max\{0,a\}. Note that subtracting a shifted ReLU from a ReLU yields an approximation of the threshold function, so by doubling the number of neurons we can approximate a network with threshold activation by a network with ReLU activation.

Over-specification: It was empirically observed that it is easier to train networks which are larger than needed. Indeed, we empirically demonstrate this phenomenon in Sec. 5.

Regularization: It was empirically observed that regularizing the weights of the network speeds up the convergence (e.g. ).

The goal of this paper is to revisit and re-raise the question of neural network’s computational efficiency, from a modern perspective. This is a challenging topic, and we do not pretend to give any definite answers. However, we provide several results, both positive and negative. Most of them are new, although a few appeared in the literature in other contexts. Our contributions are as follows:

We make a simple observation that for sufficiently over-specified networks, global optima are ubiquitous and in general computationally easy to find. Although this holds only for extremely large networks which will overfit, it can be seen as an indication that the computational hardness of learning does decrease with the amount of over-specification. This is also demonstrated empirically in Sec. 5.

Networks with quadratic activation are as expressive as networks with threshold activation.

Constant depth networks with quadratic activation can be learned in polynomial time.

The aforementioned positive results are interesting theoretically, but lead to impractical algorithms. We provide a practical, provably correct, algorithm for training depth-22 polynomial networks. While such networks can also be learned using a linearization trick, our algorithm is more efficient and returns networks whose size does not depend on the data dimension. Our algorithm follows a forward greedy selection procedure, where each step of the greedy selection procedure builds a new neuron by solving an eigenvalue problem.

We generalize the above algorithm to depth-33, in which each forward greedy step involves an efficient approximate solution to a tensor approximation problem. The algorithm can learn a rich sub-class of depth-33 polynomial networks.

We describe some experimental evidence, showing that our practical algorithm is competitive with state-of-the-art neural network training methods for depth-22 networks.

Sufficiently Over-Specified Networks Are Easy to Train

We begin by considering the idea of over-specification, and make an observation that for sufficiently over-specified networks, the optimization problem associated with training them is generally quite easy to solve, and that global optima are in a sense ubiquitous. As an interesting contrast, note that for very small networks (such as a single neuron with a non-convex activation function), the associated optimization problem is generally hard, and can exhibit exponentially many local (non-global) minima . We emphasize that our observation only holds for extremely large networks, which will overfit in any reasonable scenario, but it does point to a possible spectrum where computational cost decreases with the amount of over-specification.

The Hardness of Learning Neural Networks

We now review several known hardness results and apply them to our learning setting. For simplicity, throughout most of this section we focus on the PAC model in the binary classification case, over the Boolean cube, in the realizable case, and with a fixed target accuracy.While we focus on the realizable case (i.e., there exists fHf^{*}\in H that provides perfect predictions), with a fixed accuracy (ϵ\epsilon) and confidence (δ\delta), since we are dealing with hardness results, the results trivially apply to the agnostic case and to learning with arbitrarily small accuracy and confidence parameters.

In the context of neural networks, every network architecture defines a hypothesis class, Nt,n,σ\mathcal{N}_{t,n,\sigma}, that contains all target functions ff that can be implemented using a neural network with tt layers, nn neurons (excluding input neurons), and an activation function σ\sigma. The immediate question is which Nt,n,σ\mathcal{N}_{t,n,\sigma} are efficiently learnable. We will first address this question for the threshold activation function, σ0,1(z)=1\sigma_{0,1}(z)=1 if z>0z>0 and otherwise.

Observing that depth-22 networks with the threshold activation function can implement intersections of halfspaces, we will rely on the following hardness results, due to .

and let Hka={xh1(x)h2(x)hk(x):i,hiHa}H^{a}_{k}=\{\mathbf{x}\to h_{1}(\mathbf{x})\wedge h_{2}(\mathbf{x})\wedge\ldots\wedge h_{k}(\mathbf{x}):\forall i,h_{i}\in H^{a}\}, where k=dρk=d^{\rho} for some constant ρ>0\rho>0. Then under a certain cryptographic assumption, HkaH^{a}_{k} is not efficiently learnable.

Under a different complexity assumption, showed a similar result even for k=ω(1)k=\omega(1).

As mentioned before, neural networks of depth 2\geq 2 and with the σ0,1\sigma_{0,1} activation function can express intersections of halfspaces: For example, the first layer consists of kk neurons computing the kk halfspaces, and the second layer computes their conjunction by the mapping xσ0,1(ixik+1/2)\mathbf{x}\mapsto\sigma_{0,1}\left(\sum_{i}x_{i}-k+1/2\right). Trivially, if some class HH is not efficiently learnable, then any class containing it is also not efficiently learnable. We thus obtain the following corollary:

For every t2,n=ω(1)t\geq 2,n=\omega(1), the class Nt,n,σ0,1\mathcal{N}_{t,n,\sigma_{0,1}} is not efficiently learnable (under the complexity assumption given in ).

What happens when we do regularize the weights? Let Nt,n,σ,L\mathcal{N}_{t,n,\sigma,L} be all target functions that can be implemented using a neural network of depth tt, size nn, activation function σ\sigma, and when we restrict the input weights of each neuron to be w1+bL\|\mathbf{w}\|_{1}+|b|\leq L.

One may argue that in many real world distributions, the difference between the two classes, Nt,n,σ,L\mathcal{N}_{t,n,\sigma,L} and Nt,n,σ0,1\mathcal{N}_{t,n,\sigma_{0,1}} is small. Roughly speaking, when the distribution density is low around the decision boundary of neurons (similarly to separation with margin assumptions), then sigmoidal neurons will be able to effectively simulate threshold activated neurons.

A proof is provided in the appendix. What happens when LL is much smaller? Later on in the paper we will show positive results for LL being a constant and the depth being fixed. These results will be obtained using polynomial networks, which we study in the next section.

Polynomial Networks

Below we study the expressiveness and computational complexity of polynomial networks. We note that algorithms for efficiently learning (real-valued) sparse or low-degree polynomials has been studied in several previous works (e.g. ). However, these rely on strong distributional assumptions, such as the data instances having a uniform or log-concave distribution, while we are interested in a distribution-free setting.

We first show that, similarly to networks with threshold activation, polynomial networks of polynomial size can express all functions that can be implemented efficiently using a Turing machine.

The proof of the theorem relies on the result of and is given in the appendix.

Another relevant expressiveness result, which we will use later, shows that polynomial networks can approximate networks with sigmoidal activation functions:

The proof relies on an approximation of the sigmoid function based on Chebyshev polynomials, as was done in , and is given in the appendix.

2 Training Time

We now turn to the computational complexity of learning polynomial networks. We first show that it is hard to learn polynomial networks of depth Ω(log(d))\Omega(\log(d)). Indeed, by combining Thm. 4 and Corollary 2 we obtain the following:

The class Nt,n,σ2\mathcal{N}_{t,n,\sigma_{2}}, where t=Ω(log(d))t=\Omega(\log(d)) and n=Ω(d)n=\Omega(d), is not efficiently learnable.

An interesting application of this observation is that depth-22 sigmoidal networks are efficiently learnable with sufficient regularization, as formalized in the result below. This contrasts with corollary 2, which provides a hardness result without regularization.

3 Learning 2-layer and 3-layer Polynomial Networks

While interesting theoretically, the above results are not very practical, since the time and sample complexity grow very fast with the depth of the network.If one uses SVM with polynomial kernels, the time and sample complexity may be small under margin assumptions in a feature space corresponding to a given kernel. Note, however, that large margin in that space is very different than the assumption we make here, namely, that there is a network with a small number of hidden neurons that works well on the data. In this section we describe practical, provably correct, algorithms for the special case of depth-22 and depth-33 polynomial networks, with some additional constraints. Although such networks can be learned in polynomial time via explicit linearization (as described in section 4.2), the runtime and resulting network size scales quadratically (for depth-2) or cubically (for depth-3) with the data dimension dd. In contrast, our algorithms and guarantees have a much milder dependence on dd.

We first consider 2 layer polynomial networks, of the following form:

We’ll describe an efficient algorithm for learning this class, which is based on the GECO algorithm for convex optimization with low-rank constraints .

The goal of the algorithm is to find ff that minimizes the objective

The basic idea of the algorithm is to gradually add hidden neurons to the hidden layer, in a greedy manner, so as to decrease the loss function over the data. To do so, define V={x(wx)2:w2=1}\mathcal{V}=\{\mathbf{x}\mapsto(\mathbf{w}^{\top}\mathbf{x})^{2}:\|\mathbf{w}\|_{2}=1\} the set of functions that can be implemented by hidden neurons. Then every fP2,rf\in\mathcal{P}_{2,r} is an affine function plus a weighted sum of functions from V\mathcal{V}. The algorithm starts with ff being the minimizer of RR over all affine functions. Then at each greedy step, we search for gVg\in\mathcal{V} that minimizes a first order approximation of R(f+ηg)R(f+\eta g):

The following theorem, which follows directly from Theorem 1 of , provides convergence guarantee for GECO. Observe that the theorem gives guarantee for learning P2,k\mathcal{P}_{2,k} if we allow to output an over-specified network.

Fix some ϵ>0\epsilon>0. Assume that the loss function is convex and β\beta-smooth. Then if the GECO Algorithm is run for r>2βk2ϵr>\frac{2\beta k^{2}}{\epsilon} iterations, it outputs a network fN2,r,σ2f\in\mathcal{N}_{2,r,\sigma_{2}} for which R(f)minfP2,kR(f)+ϵR(f)\leq\min_{f^{*}\in\mathcal{P}_{2,k}}R(f^{*})+\epsilon.

We next consider a hypothesis class consisting of third degree polynomials, which is a subset of 33-layer polynomial networks (see Lemma 11 in the appendix) . The hidden neurons will be functions from the class: V=i=13Vi   where   Vi={xj=1i(wjx):j, wj2=1} .\mathcal{V}=\cup_{i=1}^{3}\mathcal{V}_{i}~{}~{}~{}\textrm{where}~{}~{}~{}\mathcal{V}_{i}=\left\{\mathbf{x}\mapsto\prod_{j=1}^{i}(\mathbf{w}_{j}^{\top}\mathbf{x}):\forall j,~{}\|\mathbf{w}_{j}\|_{2}=1\right\}~{}. The hypothesis class we consider is P3,k = {xi=1kαigi(x):i, αi1,giV}.\mathcal{P}_{3,k}~{}=~{}\left\{\mathbf{x}\mapsto\sum_{i=1}^{k}\alpha_{i}g_{i}(\mathbf{x}):\forall i,~{}|\alpha_{i}|\leq 1,g_{i}\in\mathcal{V}\right\}.

The basic idea of the algorithm is the same as for 2-layer networks. However, while in the 2-layer case we could implement efficiently each greedy step by solving an eigenvalue problem, we now face the following tensor approximation problem at each greedy step:

While this is in general a hard optimization problem, we can approximate it – and luckily, an approximate greedy step suffices for success of the greedy procedure. This procedure is given in Figure 1, and is again based on an approximate eigenvector computation. A guarantee for the quality of approximation is given in the appendix, and this leads to the following theorem, whose proof is given in the appendix.

Fix some δ,ϵ>0\delta,\epsilon>0. Assume that the loss function is convex and β\beta-smooth. Then if the GECO Algorithm is run for r>4dβk2ϵ(1τ)2r>\frac{4d\beta k^{2}}{\epsilon(1-\tau)^{2}} iterations, where each iteration relies on the approximation procedure given in Fig. 4.3, then with probability (1δ)r(1-\delta)^{r}, it outputs a network fN3,5r,σ2f\in\mathcal{N}_{3,5r,\sigma_{2}} for which R(f)minfP3,kR(f)+ϵR(f)\leq\min_{f^{*}\in\mathcal{P}_{3,k}}R(f^{*})+\epsilon.

Experiments

To demonstrate the practicality of GECO to train neural networks for real world problems, we considered a pedestrian detection problem as follows. We collected 200k training examples of image patches of size 88x40 pixels containing either pedestrians (positive examples) or hard negative examples (containing images that were classified as pedestrians by applying a simple linear classifier in a sliding window manner). See a few examples of images above. We used half of the examples as a

training set and the other half as a test set. We calculated HoG features () from the imagesUsing the Matlab implementation provided in http://www.mathworks.com/matlabcentral/fileexchange/33863-histograms-of-oriented-gradients.. We then trained, using GECO, a depth-22 polynomial network on the resulting features. We used 4040 neurons in the hidden layer. For comparison we trained the same network architecture (i.e. 4040 hidden neurons with a squared activation function) by SGD. We also trained a similar network (4040 hidden neurons again) with the ReLU activation function. For the SGD implementation we tried the following tricks to speed up the convergence: heuristics for initialization of the weights, learning rate rules, mini-batches, Nesterov’s momentum (as explained in ), and dropout. The test errors of SGD as a function of the number of iterations are depicted on the top plot of the Figure on the side. We also mark the performance of GECO as a straight line (since it doesn’t involve SGD iterations). As can be seen, the error of GECO is slightly better than SGD. It should be also noted that we had to perform a very large number of SGD iterations to obtain a good solution, while the runtime of GECO was much faster. This indicates that GECO may be a valid alternative approach to SGD for training depth-22 networks. It is also apparent that the squared activation function is slightly better than the ReLU function for this task.

Acknowledgements: This research is supported by Intel (ICRI-CI). OS was also supported by an ISF grant (No. 425/13), and a Marie-Curie Career Integration Grant. SSS and RL were also supported by the MOS center of Knowledge for AI and ML (No. 3-9243). RL is a recipient of the Google Europe Fellowship in Learning Theory, and this research is supported in part by this Google Fellowship. We thank Itay Safran for spotting a mistake in a previous version of Sec. 2 and to James Martens for helpful discussions.

References

Appendix A Proofs

Consider HaH^{a} as defined in Thm. 2. Note that for every hHah\in H^{a} there are integral w\mathbf{w} and bb such that h(x)=wxb12h(\mathbf{x})=\mathbf{w}^{\top}\mathbf{x}-b-\frac{1}{2} and we have that h(x)1/2|h(\mathbf{x})|\geq 1/2. Given kk hyperplanes {hi}i=1k\{h_{i}\}_{i=1}^{k} consider the neurons

where Cω(1)C\in\omega(1) is to be chosen later. Let

Since hi(x)12|h_{i}(\mathbf{x})|\geq\frac{1}{2} for all ii, if the output of all neurons {gi}\{g_{i}\} is positive we have

On the other hand, if hi(x)<12h_{i}(\mathbf{x})<-\frac{1}{2} for some ii we have that

Again, given kk hyperplanes {hi}i=1k\{h_{i}\}_{i=1}^{k}, for every kk consider the two neurons:

A.2 Proof of Thm. 3

and that x1x2=14((x2+x1)2(x2x1)2).\mathbf{x}_{1}\cdot\mathbf{x}_{2}=\frac{1}{4}\left((\mathbf{x}_{2}+\mathbf{x}_{1})^{2}-(\mathbf{x}_{2}-\mathbf{x}_{1})^{2}\right). Thus we can implement with two layers a conjunction and disjunction of 22 neurons. By adding a fixed number of layers, we can also implement the conjunction and disjunction of any fixed number of neurons. Therefore, if BB is a circuit with fixed number of fan-ins, of size T, we can implement it using a polynomial network with O(T)O(T) layers and O(T2)O(T^{2}) neurons, where layer tt simulates the calculation of all gates at depth tt.

Now, by , any Turing machine with runtime TT can be simulated by an oblivious Turing machine with O(TlogT)O(T\log T)-steps. An oblivious Turing machine is a machine such that the position of the machine head at time tt does not depend on the input of the machine (and therefore is known ahead of time). We can now easily simulate the machine by a network of depth O(TlogT)O(T\log T), where the nodes at each layer contain the state of the turing machine (the content of the tape and the position at the state machine), and the transition from layer to layer depends on a constant size circuit, and hence can be implemented by a constant depth polynomial network.

A.3 Proof of Thm. 4

The idea of proof of Thm. 4 is as follows: First we show that we can express any TT-degree polynomial using O(logT)O(\log T) layers and O(T)O(T) neurons. This is done in Lemma 1 part 4. As a second step, we show in Lemma 2 that a sigmoidal function can be approximated in a ball of radius LL by a O(logLϵ)O(\log\frac{L}{\epsilon})-degree polynomial. The result follows by replacing each sigmoid activation unit with added layers that approximate the sigmoidal function on the output of the previous layer. We will first prove the two Lemmas. The proof of Thm. 4 is then given at the end of the section.

If gNt,n,σ2,Lg\in\mathcal{N}_{t,n,\sigma_{2},L} for some L2L\geq 2 then gNt,n+2(tt),σ2,Lg\in\mathcal{N}_{t^{\prime},n+2(t^{\prime}-t),\sigma_{2},L} for every ttt^{\prime}\geq t.

If GNt,n,σ2,LG\in\mathcal{N}_{t,n,\sigma_{2},L} for some L2L\geq 2 and GG is a network with two output neurons g1g_{1} and g2g_{2} then g1g2Nt+1,n+1,σ2g_{1}\cdot g_{2}\in\mathcal{N}_{t+1,n+1,\sigma_{2}}.

If gNt,n,σ2,Lg\in\mathcal{N}_{t,n,\sigma_{2},L} for some L2L\geq 2 then (g)TNt,n,σ2,L(g)^{T}\in\mathcal{N}_{t^{\prime},n^{\prime},\sigma_{2},L}. where t=t+logT+loglogTt^{\prime}=t+\log T+\log\log T and n=n+2logT+logT(loglogT)n^{\prime}=n+2\log T+\log T(\log\log T).

If gNt,n,σ2,Lg\in\mathcal{N}_{t,n,\sigma_{2},L} then i=1Tai(g(x))k\sum_{i=1}^{T}a_{i}(g(\mathbf{x}))^{k} is in Nt,n,σ2,L\mathcal{N}_{t^{\prime},n^{\prime},\sigma_{2},L^{\prime}} where

where a0={k:ak0}\|a\|_{0}=|\{k:a_{k}\neq 0\}|. And L=max{a1,L,2}L^{\prime}=\max\{\|a\|_{1},L,2\}.

Proof of 1]: Note that 14((x+1)2(x1)2)=x\frac{1}{4}((x+1)^{2}-(x-1)^{2})=x. Next we prove the statement by induction. For t=tt^{\prime}=t, the satement is trivial. Next assume that gNt,n+2(tt),σ2,Lg\in\mathcal{N}_{t,n+2(t^{\prime}-t),\sigma_{2},L}. Let

Let h(x)=h1(x)h2(x)h(\mathbf{x})=h_{1}(\mathbf{x})-h_{2}(\mathbf{x}) then h(x)=g(x)h(\mathbf{x})=g(\mathbf{x}). By taking the network that implements gg, removing the output neuron, adding an additional hidden layer that consists of h1h_{1} and h2h_{2} and finally adding an additional output neuron we have that h(x)Nt+1,n+2(tt)+2,σ2,Lh(\mathbf{x})\in\mathcal{N}_{t^{\prime}+1,n+2(t^{\prime}-t)+2,\sigma_{2},L}.

Proof of 2: Like before, note that x1x2=14(x1+x2)214(x1x2)2x_{1}\cdot x_{2}=\frac{1}{4}(x_{1}+x_{2})^{2}-\frac{1}{4}(x_{1}-x_{2})^{2}. Let

As before we remove from the network that implements GG the two nodes at the output layer, add an additional hidden layer that implements h1h_{1} and h2h_{2} and finally add an output neuron h(x)=h1(x)h2(x)h(\mathbf{x})=h_{1}(\mathbf{x})-h_{2}(\mathbf{x}).

Proof of 3:Write T=i=1logTϵi2iT=\sum_{i=1}^{\log T}\epsilon_{i}2^{i} where ϵi={0,1}\epsilon_{i}=\{0,1\}.

We will first show that we can construct a polynomial network that contains in layer t+logTt+\log T neurons h1,,hlogTh_{1},\ldots,h_{\log T} such that hk(x)=(g(x))2kh_{k}(\mathbf{x})=(g(\mathbf{x}))^{2^{k}}. It is easy to see that we can implement a neuron h(x)kh^{\prime}(\mathbf{x})_{k} at layer t+kt+k such that hk(x)=(g(x))2kh_{k}^{\prime}(\mathbf{x})=(g(\mathbf{x}))^{2^{k}}. Next, using 1 we add 2(logTk)2(\log T-k) neurons and implement hkh_{k}^{\prime} in layer t+logTt+\log T.

Finally, we implement {i:ϵi0}hi(x)\prod_{\{i:\epsilon_{i}\neq 0\}}h_{i}(\mathbf{x}) using loglogT\log\log T layers and logTloglogT\log T\log\log T additional neurons, this can be done by applying 2 where at each layer we pair the neurons at previous layer and do their product (e.g. if for every ii ϵi0\epsilon_{i}\neq 0 then at the next layer we implement (h1h2,h3h4,ht1hth_{1}\cdot h_{2},h_{3}\cdot h_{4},\ldots h_{t-1}h_{t}) then at the next layer we implement (h1h2h3h4,,ht4hth_{1}\cdot h_{2}\cdot h_{3}\cdot h_{4},\ldots,h_{t-4}\cdots h_{t}) etc..)

The following holds for any ϵ0\epsilon\geq 0 and (for simplicity) L3L\geq 3: Set

There is a polynomial p(x)=j=1Tajxjp(x)=\sum_{j=1}^{T}a_{j}x^{j}, such that:

According to Lemma 2, there is an analytic function qq such that

Note that for every jj we have βj2tj2|\beta_{j}|\leq 2^{\frac{t^{\prime}-j}{2}}. Thus

Recalling that T=t+2log8ϵT=t^{\prime}+2\log\frac{8}{\epsilon} and letting p0(x)=j=0Tβjxjp_{0}(x)=\sum_{j=0}^{T}\beta_{j}x^{j}, we have by triangular inequality that

Finally, take p(x)=p0(x4L)p(x)=p_{0}(\frac{x}{4L}). ∎

By induction, there is some P(t1)N(t1)Bt,(ns)Bn,σ2P^{(t-1)}\in\mathcal{N}_{(t-1)B_{t},(n-s)B_{n},\sigma_{2}} such that

By Lemma 1 part 4 and Lemma 2 we can add BtB_{t} layers and BnB_{n} neurons and implement a new target function PiP_{i} such that

where pp is taken from Lemma 2 and satisfies

Taken together we can add sBnsB_{n} nodes to implement a target function P=P1,,PsP=P_{1},\ldots,P_{s}. Next,

A.4 Proof of Thm. 7

That fN3,5r,σ2f\in\mathcal{N}_{3,5r,\sigma_{2}} can be shown using Lemma 1 and the output’s structure.

Let us denote by w,u,v\mathbf{w}^{*},\mathbf{u}^{*},\mathbf{v}^{*} the maximizers of F(w,u,v)F(\mathbf{w},\mathbf{u},\mathbf{v}), over all w,u,v=1\|\mathbf{w}\|,\|\mathbf{u}\|,\|\mathbf{v}\|=1.

For each u,v\mathbf{u},\mathbf{v} let f(u,v)f(\mathbf{u},\mathbf{v}) be the vector

First, we claim that f(u,v)wf(\mathbf{u}^{*},\mathbf{v}^{*})\propto\mathbf{w}^{*} and that F(w,u,v)=f(u,v)F(\mathbf{w}^{*},\mathbf{u}^{*},\mathbf{v}^{*})=\|f(\mathbf{u}^{*},\mathbf{v}^{*})\|. Indeed for every w1\|\mathbf{w}\|\leq 1, by the Cauchy-Schwartz inequality:

Again by Cauchy-Schwartz, equality is attained if and only if wf(u,v)\mathbf{w}\propto f(\mathbf{u}^{*},\mathbf{v}^{*}).

Letting slog1δlog(112d)2dlog1δs\geq-\frac{\log\frac{1}{\delta}}{\log(1-\frac{1}{2d})}\approx 2d\log\frac{1}{\delta} we have that with probability at least (1δ)(1-\delta) for some wi\mathbf{w}_{i} we have wiw12d|\mathbf{w}_{i}^{\top}\mathbf{w}^{*}|\geq\frac{1}{\sqrt{2d}}, say w1\mathbf{w}_{1}.

Finally note that by definition of ui,vi\mathbf{u}_{i},\mathbf{v}_{i}: