On the Computational Efficiency of Training Neural Networks
Roi Livni, Shai Shalev-Shwartz, Ohad Shamir
Introduction
One of the most significant recent developments in machine learning has been the resurgence of “deep learning”, usually in the form of artificial neural networks. A combination of algorithmic advancements, as well as increasing computational power and data size, has led to a breakthrough in the effectiveness of neural networks, and they have been used to obtain very impressive practical performance on a variety of domains (a few recent examples include ).
From the perspective of statistical learning theory, by specifying a neural network architecture (i.e. the underlying graph and the activation function) we obtain a hypothesis class, namely, the set of all prediction rules obtained by using the same network architecture while changing the weights of the network. Learning the class involves finding a specific set of weights, based on training examples, which yields a predictor that has good performance on future examples. When studying a hypothesis class we are usually concerned with three questions:
Sample complexity: how many examples are required to learn the class.
Expressiveness: what type of functions can be expressed by predictors in the class.
Training time: how much computation time is required to learn the class.
For simplicity, let us first consider neural networks with a threshold activation function (i.e. if and otherwise), over the boolean input space, , and with a single output in . The sample complexity of such neural networks is well understood . It is known that the VC dimension grows linearly with the number of edges (up to log factors). It is also easy to see that no matter what the activation function is, as long as we represent each weight of the network using a constant number of bits, the VC dimension is bounded by a constant times the number of edges. This implies that empirical risk minimization - or finding weights with small average loss over the training data - can be an effective learning strategy from a statistical point of view.
As to the expressiveness of such networks, it is easy to see that neural networks of depth and sufficient size can express all functions from to . However, it is also possible to show that for this to happen, the size of the network must be exponential in (e.g. [19, Chapter 20]). Which functions can we express using a network of polynomial size? The theorem below shows that all boolean functions that can be calculated in time , can also be expressed by a network of depth and size .
The proof of the theorem follows directly from the relation between the time complexity of programs and their circuit complexity (see, e.g., ), and the fact that we can simulate the standard boolean gates using a fixed number of neurons.
We see that from the statistical perspective, neural networks form an excellent hypothesis class; On one hand, for every runtime , by using depth of we contain all predictors that can be run in time at most . On the other hand, the sample complexity of the resulting class depends polynomially on .
The main caveat of neural networks is the training time. Existing theoretical results are mostly negative, showing that successfully learning with these networks is computationally hard in the worst case. For example, neural networks of depth contain the class of intersection of halfspaces (where the number of halfspaces is the number of neurons in the hidden layer). By reduction to -coloring, it has been shown that finding the weights that best fit the training set is NP-hard (). has shown that even finding weights that result in close-to-minimal empirical error is computationally infeasible. These hardness results focus on proper learning, where the goal is to find a nearly-optimal predictor with a fixed network architecture . However, if our goal is to find a good predictor, there is no reason to limit ourselves to predictors with one particular architecture. Instead, we can try, for example, to find a network with a different architecture , which is almost as good as the best network with architecture . This is an example of the powerful concept of improper learning, which has often proved useful in circumventing computational hardness results. Unfortunately, there are hardness results showing that even with improper learning, and even if the data is generated exactly from a small, depth- neural network, there are no efficient algorithms which can find a predictor that performs well on test data. In particular, and have shown this in the case of learning intersections of halfspaces, using cryptographic and average case complexity assumptions. On a related note, recently showed positive results on learning from data generated by a neural network of a certain architecture and randomly connected weights. However, the assumptions used are strong and unlikely to hold in practice.
Despite this theoretical pessimism, in practice, modern-day neural networks are trained successfully in many learning problems. There are several tricks that enable successful training:
Changing the activation function: The threshold activation function, , has zero derivative almost everywhere. Therefore, we cannot apply gradient-based methods with this activation function. To circumvent this problem, we can consider other activation functions. Most widely known is a sigmoidal activation, e.g. , which forms a smooth approximation of the threshold function. Another recent popular activation function is the rectified linear unit (ReLU) function, . Note that subtracting a shifted ReLU from a ReLU yields an approximation of the threshold function, so by doubling the number of neurons we can approximate a network with threshold activation by a network with ReLU activation.
Over-specification: It was empirically observed that it is easier to train networks which are larger than needed. Indeed, we empirically demonstrate this phenomenon in Sec. 5.
Regularization: It was empirically observed that regularizing the weights of the network speeds up the convergence (e.g. ).
The goal of this paper is to revisit and re-raise the question of neural network’s computational efficiency, from a modern perspective. This is a challenging topic, and we do not pretend to give any definite answers. However, we provide several results, both positive and negative. Most of them are new, although a few appeared in the literature in other contexts. Our contributions are as follows:
We make a simple observation that for sufficiently over-specified networks, global optima are ubiquitous and in general computationally easy to find. Although this holds only for extremely large networks which will overfit, it can be seen as an indication that the computational hardness of learning does decrease with the amount of over-specification. This is also demonstrated empirically in Sec. 5.
Networks with quadratic activation are as expressive as networks with threshold activation.
Constant depth networks with quadratic activation can be learned in polynomial time.
The aforementioned positive results are interesting theoretically, but lead to impractical algorithms. We provide a practical, provably correct, algorithm for training depth- polynomial networks. While such networks can also be learned using a linearization trick, our algorithm is more efficient and returns networks whose size does not depend on the data dimension. Our algorithm follows a forward greedy selection procedure, where each step of the greedy selection procedure builds a new neuron by solving an eigenvalue problem.
We generalize the above algorithm to depth-, in which each forward greedy step involves an efficient approximate solution to a tensor approximation problem. The algorithm can learn a rich sub-class of depth- polynomial networks.
We describe some experimental evidence, showing that our practical algorithm is competitive with state-of-the-art neural network training methods for depth- networks.
Sufficiently Over-Specified Networks Are Easy to Train
We begin by considering the idea of over-specification, and make an observation that for sufficiently over-specified networks, the optimization problem associated with training them is generally quite easy to solve, and that global optima are in a sense ubiquitous. As an interesting contrast, note that for very small networks (such as a single neuron with a non-convex activation function), the associated optimization problem is generally hard, and can exhibit exponentially many local (non-global) minima . We emphasize that our observation only holds for extremely large networks, which will overfit in any reasonable scenario, but it does point to a possible spectrum where computational cost decreases with the amount of over-specification.
The Hardness of Learning Neural Networks
We now review several known hardness results and apply them to our learning setting. For simplicity, throughout most of this section we focus on the PAC model in the binary classification case, over the Boolean cube, in the realizable case, and with a fixed target accuracy.While we focus on the realizable case (i.e., there exists that provides perfect predictions), with a fixed accuracy () and confidence (), since we are dealing with hardness results, the results trivially apply to the agnostic case and to learning with arbitrarily small accuracy and confidence parameters.
In the context of neural networks, every network architecture defines a hypothesis class, , that contains all target functions that can be implemented using a neural network with layers, neurons (excluding input neurons), and an activation function . The immediate question is which are efficiently learnable. We will first address this question for the threshold activation function, if and otherwise.
Observing that depth- networks with the threshold activation function can implement intersections of halfspaces, we will rely on the following hardness results, due to .
and let , where for some constant . Then under a certain cryptographic assumption, is not efficiently learnable.
Under a different complexity assumption, showed a similar result even for .
As mentioned before, neural networks of depth and with the activation function can express intersections of halfspaces: For example, the first layer consists of neurons computing the halfspaces, and the second layer computes their conjunction by the mapping . Trivially, if some class is not efficiently learnable, then any class containing it is also not efficiently learnable. We thus obtain the following corollary:
For every , the class is not efficiently learnable (under the complexity assumption given in ).
What happens when we do regularize the weights? Let be all target functions that can be implemented using a neural network of depth , size , activation function , and when we restrict the input weights of each neuron to be .
One may argue that in many real world distributions, the difference between the two classes, and is small. Roughly speaking, when the distribution density is low around the decision boundary of neurons (similarly to separation with margin assumptions), then sigmoidal neurons will be able to effectively simulate threshold activated neurons.
A proof is provided in the appendix. What happens when is much smaller? Later on in the paper we will show positive results for being a constant and the depth being fixed. These results will be obtained using polynomial networks, which we study in the next section.
Polynomial Networks
Below we study the expressiveness and computational complexity of polynomial networks. We note that algorithms for efficiently learning (real-valued) sparse or low-degree polynomials has been studied in several previous works (e.g. ). However, these rely on strong distributional assumptions, such as the data instances having a uniform or log-concave distribution, while we are interested in a distribution-free setting.
We first show that, similarly to networks with threshold activation, polynomial networks of polynomial size can express all functions that can be implemented efficiently using a Turing machine.
The proof of the theorem relies on the result of and is given in the appendix.
Another relevant expressiveness result, which we will use later, shows that polynomial networks can approximate networks with sigmoidal activation functions:
The proof relies on an approximation of the sigmoid function based on Chebyshev polynomials, as was done in , and is given in the appendix.
2 Training Time
We now turn to the computational complexity of learning polynomial networks. We first show that it is hard to learn polynomial networks of depth . Indeed, by combining Thm. 4 and Corollary 2 we obtain the following:
The class , where and , is not efficiently learnable.
An interesting application of this observation is that depth- sigmoidal networks are efficiently learnable with sufficient regularization, as formalized in the result below. This contrasts with corollary 2, which provides a hardness result without regularization.
3 Learning 2-layer and 3-layer Polynomial Networks
While interesting theoretically, the above results are not very practical, since the time and sample complexity grow very fast with the depth of the network.If one uses SVM with polynomial kernels, the time and sample complexity may be small under margin assumptions in a feature space corresponding to a given kernel. Note, however, that large margin in that space is very different than the assumption we make here, namely, that there is a network with a small number of hidden neurons that works well on the data. In this section we describe practical, provably correct, algorithms for the special case of depth- and depth- polynomial networks, with some additional constraints. Although such networks can be learned in polynomial time via explicit linearization (as described in section 4.2), the runtime and resulting network size scales quadratically (for depth-2) or cubically (for depth-3) with the data dimension . In contrast, our algorithms and guarantees have a much milder dependence on .
We first consider 2 layer polynomial networks, of the following form:
We’ll describe an efficient algorithm for learning this class, which is based on the GECO algorithm for convex optimization with low-rank constraints .
The goal of the algorithm is to find that minimizes the objective
The basic idea of the algorithm is to gradually add hidden neurons to the hidden layer, in a greedy manner, so as to decrease the loss function over the data. To do so, define the set of functions that can be implemented by hidden neurons. Then every is an affine function plus a weighted sum of functions from . The algorithm starts with being the minimizer of over all affine functions. Then at each greedy step, we search for that minimizes a first order approximation of :
The following theorem, which follows directly from Theorem 1 of , provides convergence guarantee for GECO. Observe that the theorem gives guarantee for learning if we allow to output an over-specified network.
Fix some . Assume that the loss function is convex and -smooth. Then if the GECO Algorithm is run for iterations, it outputs a network for which .
We next consider a hypothesis class consisting of third degree polynomials, which is a subset of -layer polynomial networks (see Lemma in the appendix) . The hidden neurons will be functions from the class: The hypothesis class we consider is
The basic idea of the algorithm is the same as for 2-layer networks. However, while in the 2-layer case we could implement efficiently each greedy step by solving an eigenvalue problem, we now face the following tensor approximation problem at each greedy step:
While this is in general a hard optimization problem, we can approximate it – and luckily, an approximate greedy step suffices for success of the greedy procedure. This procedure is given in Figure 1, and is again based on an approximate eigenvector computation. A guarantee for the quality of approximation is given in the appendix, and this leads to the following theorem, whose proof is given in the appendix.
Fix some . Assume that the loss function is convex and -smooth. Then if the GECO Algorithm is run for iterations, where each iteration relies on the approximation procedure given in Fig. 4.3, then with probability , it outputs a network for which .
Experiments
To demonstrate the practicality of GECO to train neural networks for real world problems, we considered a pedestrian detection problem as follows. We collected 200k training examples of image patches of size 88x40 pixels containing either pedestrians (positive examples) or hard negative examples (containing images that were classified as pedestrians by applying a simple linear classifier in a sliding window manner). See a few examples of images above. We used half of the examples as a
training set and the other half as a test set. We calculated HoG features () from the imagesUsing the Matlab implementation provided in http://www.mathworks.com/matlabcentral/fileexchange/33863-histograms-of-oriented-gradients.. We then trained, using GECO, a depth- polynomial network on the resulting features. We used neurons in the hidden layer. For comparison we trained the same network architecture (i.e. hidden neurons with a squared activation function) by SGD. We also trained a similar network ( hidden neurons again) with the ReLU activation function. For the SGD implementation we tried the following tricks to speed up the convergence: heuristics for initialization of the weights, learning rate rules, mini-batches, Nesterov’s momentum (as explained in ), and dropout. The test errors of SGD as a function of the number of iterations are depicted on the top plot of the Figure on the side. We also mark the performance of GECO as a straight line (since it doesn’t involve SGD iterations). As can be seen, the error of GECO is slightly better than SGD. It should be also noted that we had to perform a very large number of SGD iterations to obtain a good solution, while the runtime of GECO was much faster. This indicates that GECO may be a valid alternative approach to SGD for training depth- networks. It is also apparent that the squared activation function is slightly better than the ReLU function for this task.
Acknowledgements: This research is supported by Intel (ICRI-CI). OS was also supported by an ISF grant (No. 425/13), and a Marie-Curie Career Integration Grant. SSS and RL were also supported by the MOS center of Knowledge for AI and ML (No. 3-9243). RL is a recipient of the Google Europe Fellowship in Learning Theory, and this research is supported in part by this Google Fellowship. We thank Itay Safran for spotting a mistake in a previous version of Sec. 2 and to James Martens for helpful discussions.
References
Appendix A Proofs
Consider as defined in Thm. 2. Note that for every there are integral and such that and we have that . Given hyperplanes consider the neurons
where is to be chosen later. Let
Since for all , if the output of all neurons is positive we have
On the other hand, if for some we have that
Again, given hyperplanes , for every consider the two neurons:
A.2 Proof of Thm. 3
and that Thus we can implement with two layers a conjunction and disjunction of neurons. By adding a fixed number of layers, we can also implement the conjunction and disjunction of any fixed number of neurons. Therefore, if is a circuit with fixed number of fan-ins, of size T, we can implement it using a polynomial network with layers and neurons, where layer simulates the calculation of all gates at depth .
Now, by , any Turing machine with runtime can be simulated by an oblivious Turing machine with -steps. An oblivious Turing machine is a machine such that the position of the machine head at time does not depend on the input of the machine (and therefore is known ahead of time). We can now easily simulate the machine by a network of depth , where the nodes at each layer contain the state of the turing machine (the content of the tape and the position at the state machine), and the transition from layer to layer depends on a constant size circuit, and hence can be implemented by a constant depth polynomial network.
A.3 Proof of Thm. 4
The idea of proof of Thm. 4 is as follows: First we show that we can express any -degree polynomial using layers and neurons. This is done in Lemma 1 part 4. As a second step, we show in Lemma 2 that a sigmoidal function can be approximated in a ball of radius by a -degree polynomial. The result follows by replacing each sigmoid activation unit with added layers that approximate the sigmoidal function on the output of the previous layer. We will first prove the two Lemmas. The proof of Thm. 4 is then given at the end of the section.
If for some then for every .
If for some and is a network with two output neurons and then .
If for some then . where and .
If then is in where
where . And .
Proof of 1]: Note that . Next we prove the statement by induction. For , the satement is trivial. Next assume that . Let
Let then . By taking the network that implements , removing the output neuron, adding an additional hidden layer that consists of and and finally adding an additional output neuron we have that .
Proof of 2: Like before, note that . Let
As before we remove from the network that implements the two nodes at the output layer, add an additional hidden layer that implements and and finally add an output neuron .
Proof of 3:Write where .
We will first show that we can construct a polynomial network that contains in layer neurons such that . It is easy to see that we can implement a neuron at layer such that . Next, using 1 we add neurons and implement in layer .
Finally, we implement using layers and additional neurons, this can be done by applying 2 where at each layer we pair the neurons at previous layer and do their product (e.g. if for every then at the next layer we implement () then at the next layer we implement () etc..)
The following holds for any and (for simplicity) : Set
There is a polynomial , such that:
According to Lemma 2, there is an analytic function such that
Note that for every we have . Thus
Recalling that and letting , we have by triangular inequality that
Finally, take . ∎
By induction, there is some such that
By Lemma 1 part 4 and Lemma 2 we can add layers and neurons and implement a new target function such that
where is taken from Lemma 2 and satisfies
Taken together we can add nodes to implement a target function . Next,
A.4 Proof of Thm. 7
That can be shown using Lemma 1 and the output’s structure.
Let us denote by the maximizers of , over all .
For each let be the vector
First, we claim that and that . Indeed for every , by the Cauchy-Schwartz inequality:
Again by Cauchy-Schwartz, equality is attained if and only if .
Letting we have that with probability at least for some we have , say .
Finally note that by definition of :