Training Neural Networks Without Gradients: A Scalable ADMM Approach

Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, Tom Goldstein

Introduction

As hardware and algorithms advance, neural network performance is constantly improving for many machine learning tasks. This is particularly true in applications where extremely large datasets are available to train models with many parameters. Because big datasets provide results that (often dramatically) outperform the prior state-of-the-art in many machine learning tasks, researchers are willing to purchase specialized hardware such as GPUs, and commit large amounts of time to training models and tuning hyper-parameters.

Gradient-based training methods have several properties that contribute to this need for specialized hardware. First, while large amounts of data can be shared amongst many cores, existing optimization methods suffer when parallelized. Second, training neural nets requires optimizing highly non-convex objectives that exhibit saddle points, poor conditioning, and vanishing gradients, all of which slow down gradient-based methods such as stochastic gradient descent, conjugate gradients, and BFGS. Several mitigating approaches to avoiding this issue have been introduced, including rectified linear units (ReLU) (Nair & Hinton, 2010), Long Short-Term Memory networks (Hochreiter & Schmidhuber, 1997), RPROP (Riedmiller & Braun, 1993), and others, but the fundamental problem remains.

In this paper, we introduce a new method for training the parameters of neural nets using the Alternating Direction Method of Multipliers (ADMM) and Bregman iteration. This approach addresses several problems facing classical gradient methods; the proposed method exhibits linear scaling when data is parallelized across cores, and is robust to gradient saturation and poor conditioning. The method decomposes network training into a sequence of sub-steps that are each solved to global optimality. The scalability of the proposed method, combined with the ability to avoid local minima by globally solving each substep, can lead to dramatic speedups.

We begin in Section 2 by describing the mathematical notation and context, and providing a discussion of several weaknesses of gradient-based methods that we hope to address. Sections 3 and 4 introduce and describe the optimization approach, and Sections 5 and 6 describe in detail the distributed implementation. Section 7 provides an experimental comparison of the new approach with standard implementations of several gradient-based methods on two problems of differing size and difficulty. Finally, Section 8 contains a closing discussion of the paper’s contributions and the future work needed.

Background and notation

Though there are many variations, a typical neural network consists of $L$ layers, each of which is defined by a linear operator $W_{l},$ and a non-linear neural activation function $h_{l}.$ Given a (column) vector of input activations, $a_{l-1},$ a single layer computes and outputs the non-linear function $a_{l}=h_{l}(W_{l}a_{l-1}).$ A network is formed by layering these units together in a nested fashion to compute a composite function; in the 3-layer case, for example, this would be

where $W=\{W_{l}\}$ denotes the ensemble of weight matrices, and $a_{0}$ contains input activations for every training sample (one sample per column). The function $h_{3}$ is absent as it is common for the last layer to not have an activation function.

Note that in our notation, we have included all input activations for all training data into the matrix/tensor $a_{0}.$ This notation benefits our discussion of the proposed algorithm, which operates on all training data simultaneously as a batch.

Also, in our formulation the tensor $W$ contains linear operators, but not necessarily dense matrices. These linear operators can be convolutions with an ensemble of filters, in which case (1) represents a convolutional net.

Finally, the formulation used here assumes a feed-forward architecture. However, our proposed methods can handle more complex network topologies (such as recurrent networks) with little modification.

Most networks are trained using stochastic gradient descent (SGD, i.e. backpropagation) in which the gradient of the network loss function is approximated using a small number of training samples, and then a descent step is taken using this approximate gradient. Stochastic gradient methods work extremely well in the serial setting, but lack scalability. Recent attempts to scale SGD include Downpour, which runs SGD simultaneously on multiple cores. This model averages parameters across cores using multiple communication nodes that store copies of the model. A conceptually similar approach is elastic averaging (Zhang et al., 2015), in which different processors simultaneously run SGD using a quadratic penalty term that prevents different processes from drifting too far from the central average. These methods have found success with modest numbers of processors, but fail to maintain strong scaling for large numbers of cores. For example, for several experiments reported in (Dean et al., 2012), the Downpour distributed SGD method runs slower with 1500 cores than with 500 cores.

The scalability of SGD is limited because it relies on a large number of inexpensive minimization steps that each use a small amount of data. Forming a noisy gradient from a small mini-batch requires very little computation. The low cost of this step is an asset in the serial setting where it enables the algorithm to move quickly, but disadvantageous in the parallel setting where each step is too inexpensive to be split over multiple processors. For this reason, SGD is ideally suited for computation on GPUs, where multiple cores can simultaneously work on a small batch of data using a shared memory space with virtually no communication overhead.

When parallelizing over CPUs, it is preferable to have methods that use a small number of expensive minimization steps, preferably involving a large number of data. The work required on each minimization step can then be split across many worker nodes, and the latency of communication is amortized over a large amount of computation. This approach has been suggested by numerous authors who propose batch computation methods (Ngiam et al., 2011), which compute exact gradients on each iteration using the entire dataset, including conjugate gradients (Towsey et al., 1995; Møller, 1993), BFGS, and Hessian-free (Martens & Sutskever, 2011; Sainath et al., 2013) methods.

Unfortunately, all gradient-based approaches, whether batched or stochastic, also suffer from several other critical drawbacks. First, gradient-based methods suffer from the vanishing gradients. During backpropagation, the derivative of shallow layers in a network are formed using products of weight matrices and derivatives of non-linearities from downstream layers. When the eigenvalues of the weight matrices are small and the derivatives of non-linearities are nearly zero (as they often are for sigmoid and ReLU non-linearities), multiplication by these terms annihilates information. The resulting gradients in shallow layers contain little information about the error (Bengio et al., 1994; Riedmiller & Braun, 1993; Hochreiter & Schmidhuber, 1997).

Second, backprop has the potential to get stuck at local minima and saddle points. While recent results suggest that local minimizers of SGD are close to global minima (Choromanska et al., 2014), in practice SGD often lingers near saddle points where gradients are small (Dauphin et al., 2014).

Finally, backprop does not easily parallelize over layers, a significant bottleneck when considering deep architectures. However, recent work on SGD has successfully used model parallelism by using multiple replicas of the entire network (Dean et al., 2012).

We propose a solution that helps alleviate these problems by separating the objective function at each layer of a neural network into two terms: one term measuring the relation between the weights and the input activations, and the other term containing the nonlinear activation function. We then apply an alternating direction method that addresses each term separately. The first term allows the weights to be updated without the effects of vanishing gradients. In the second step, we have a non-convex minimization problem that can be solved globally in closed-form. Also, the form of the objective allows the weights of every layer to be updated independently, enabling parallelization over layers.

This approach does not require any gradient steps at all. Rather, the problem of training network parameters is reduced to a series of minimization sub-problems using the alternating direction methods of multipliers. These minimization sub-problems are solved globally in closed form.

2 Related work

Other works have applied least-squares based methods to neural networks. One notable example is the method of auxiliary coordinates (MAC) (Carreira-Perpinán & Wang, 2012) which uses quadratic penalties to approximately enforce equality constraints. Unlike our method, MAC requires iterative solvers for sub-problems, whereas the method proposed here is designed so that all sub-problems have closed form solutions. Also unlike MAC, the method proposed here uses Lagrange multipliers to exactly enforce equality constraints, which we have found to be necessary for training deeper networks.

Another related approach is the expectation-maximization (EM) algorithm of (Patel et al., 2015), which is derived from the Deep Rendering Model (DRM), a hierarchical generative model for natural images. They show that feedforward propagation in a deep convolutional net corresponds to inference on their proposed DRM. They derive a new EM learning algorithm for their proposed DRM that employs least-squares parameter updates that are conceptually similar to (but different from) the Parallel Weight Update proposed here (see Section 5). However, there is currently no implementation nor any training results to compare against.

Note that our work is the first to consider alternating least squares as a method to distribute computation across a cluster, although the authors of (Carreira-Perpinán & Wang, 2012) do consider implementations that are “distributed” in the sense of using multiple threads on a single machine via the Matlab matrix toolbox.

Alternating minimization for neural networks

The idea behind our method is to decouple the weights from the nonlinear link functions using a splitting technique. Rather than feeding the output of the linear operator $W_{l}$ directly into the activation function $h_{l},$ we store the output of layer $l$ in a new variable $z_{l}=W_{l}a_{l-1}.$ We also represent the output of the link function as a vector of activations $a_{l}=h_{l}(z_{l}).$ We then wish to solve the following problem

where $\{\gamma_{l}\}$ and $\{\beta_{l}\}$ are constants that control the weight of each constraint. The formulation (3) only approximately enforces the constraints in (3). To obtain exact enforcement of the constraints, we add a Lagrange multiplier term to (3), which yields

where $\lambda$ is a vector of Lagrange multipliers with the same dimensions as $z_{L}.$ Note that in a classical ADMM formulation, a Lagrange multiplier would be added for each constraint in (3). The formulation above corresponds more closely to Bregman iteration, which only requires a Lagrange correction to be added to the objective term (and not the constraint terms), rather than classical ADMM. We have found the Bregman formulation to be far more stable than a full scale ADMM formulation. This issue will be discussed in detail in Section 4.

The split formulation (3) is carefully designed to be easily minimized using an alternating direction method in which each sub-step has a simple closed-form solution. The alternating direction scheme proceeds by updating one set of variables at a time – either $\{W_{l}\},$ $\{a_{l}\},$ or $\{z_{l}\}$ – while holding the others constant. The simplicity of the proposed scheme comes from the following observation: The minimization of (3) with respect to both $\{W_{l}\}$ and $\{a_{l-1}\}$ is a simple linear least-squares problem. Only the minimization of (3) with respect $\{z_{l}\}$ is nonlinear. However, there is no coupling between the entries of $\{z_{l}\},$ and so the problem of minimizing for $\{z_{l}\}$ decomposes into solving a large number of one-dimensional problems, one for each entry in $\{z_{l}\}.$ Because each sub-problem has a simple form and only 1 variable, these problems can be solved globally in closed form.

The full alternating direction method is listed in Algorithm 1. We discuss the details below.

In this section, we consider the updates for each variable in (5). The algorithm proceeds by minimizing for $W_{l},$ $a_{l},$ and $z_{l},$ and then updating the Lagrange multipliers $\lambda.$

We first consider the minimization of (3) with respect to $\{W_{l}\}.$ For each layer $l,$ the optimal solution minimizes $\|z_{l}-W_{l}a_{l-1}\|^{2}.$ This is simply a least squares problem, and the solution is given by $W_{l}\leftarrow z_{l}a_{l-1}^{\dagger}$ where $a_{l-1}^{\dagger}$ represents the pseudoinverse of the (rectangular) activation matrix $a_{l-1}.$

Activations update

Minimization for $a_{l}$ is a simple least-squares problem similar to the weight update. However, in this case the matrix $a_{l}$ appears in two penalty terms in (3), and so we must minimize $\beta_{l}\|z_{l+1}-W_{l+1}a_{l}\|+\gamma_{l}\|a_{l}-h_{l}(z_{l})\|^{2}$ for $a_{l},$ holding all other variables fixed. The new value of $a_{l}$ is given by

where $W_{l+1}^{T}$ is the adjoint (transpose) of $W_{l+1}.$

Outputs update

The update for $z_{l}$ requires minimizing

This problem is non-convex and non-quadratic (because of the non-linear term $h$ ). Fortunately, because the non-linearity $h$ works entry-wise on its argument, the entries in $z_{l}$ are de-coupled. Solving (7) is particularly easy when $h$ is piecewise linear, as it can be solved in closed form; common piecewise linear choices for $h$ include rectified linear units (ReLUs) and non-differentiable sigmoid functions given by

For such choices of $h,$ the minimizer of (7) is easily computed using simple if-then logic. For more sophistical choices of $h,$ including smooth sigmoid curves, the problem can be solved quickly with a lookup table of pre-computed solutions because each 1-dimensional problem only depends on two inputs.

Lagrange multiplier update

After minimizing for $\{W_{l}\},$ $\{a_{l}\},$ and $\{z_{l}\},$ the Lagrange multiplier update is given simply by

We discuss this update further in Section 4.

Lagrange multiplier updates via method of multipliers and Bregman iteration

The proposed method can be viewed as solving the constrained problem (3) using Bregman iteration, which is closely related to ADMM. The convergence of Bregman iteration is fairly well understood in the presence of linear constraints (Yin et al., 2008). The convergence of ADMM is fairly well understood for convex problems involving only two separate variable blocks (He & Yuan, 2015). Convergence results also guarantee that a local minima is obtained for two-block non-convex objectives under certain smoothness assumptions (Nocedal & Wright, 2006).

Because the proposed scheme involves more than two coupled variable blocks and a non-smooth penalty function, it lies outside the scope of known convergence results for ADMM. In fact, when ADMM is applied to (3) in a conventional way using separate Lagrange multiplier vectors for each constraint, the method is highly unstable because of the de-stabilizing effect of a large number of coupled, non-smooth, non-convex terms.

Fortunately, we will see below that the Bregman Lagrange update method (13) does not involve any non-smooth constraint terms, and the resulting method seems to be extremely stable.

Bregman iteration (also known as the method of multipliers) is a general framework for solving constrained optimization problems. Methods of this type have been used extensively in the sparse optimization literature (Yin et al., 2008). Consider the general problem of minimizing

for some convex function $J$ and linear operator $A.$ Bregman iteration repeatedly solves

where $p\in\partial J(u^{k})$ is a (sub-)gradient of $J$ at $u^{k},$ and $D_{J}(u,u^{k})=J(u)-J(u^{k})-\langle u-u^{k},p\rangle$ is the so-called Bregman distance. The iterative process (10) can be viewed as minimizing the objective $J$ subject to an inexact penalty that approximately obtains $Ax\approx b,$ and then adding a linear term to the objective to weaken it so that the quadratic penalty becomes more influential on the next iteration.

For this reason, the Lagrange update (13) can be interpreted as updating the sub-gradient in the Bregman iterative method for solving (3). The combination of the Bregman iterative update with an alternating minimization strategy makes the proposed algorithm an instance of the split Bregman method (Goldstein & Osher, 2009).

2 Interpretation as method of multipliers

In addition to the Bregman interpretation, the proposed method can also be viewed as an approximation to the method of multipliers, which solves constrained problems of the form

for some convex function $J$ and (possibly non-linear) operator $A.$ In its most general form (which does not assume linear constraints) the method proceeds using the iterative updates

where $\lambda^{k}$ is a vector of Lagrange multipliers that is generally initialized to zero, and $\frac{\beta}{2}\|A(u)-b\|^{2}$ is a quadratic penalty term. After each minimization sub-problem, the gradient of the penalty term is added to the Lagrange multipliers. When the operator $A$ is linear, this update takes the form $\lambda^{k+1}\leftarrow\lambda^{k}+\beta A^{T}(Au-b),$ which is the most common form of the method of multipliers.

When the objective is approximately minimized by alternately updating separate blocks of variables (as in the proposed method), this becomes an instance of the ADMM (Boyd et al., 2011).

Distributed implementation using data parallelism

The main advantage of the proposed alternating minimization method is its high degree of scalability. In this section, we explain how the method is distributed.

Consider distributing the algorithm across $N$ worker nodes. The ADMM method is scaled using a data parallelization strategy, in which different nodes store activations and outputs corresponding to different subsets of the training data. For each layer, the activation matrix is broken into columns subsets as $a_{i}=(a_{1},a_{2},\cdots,a_{N}).$ The output matrix $z_{l}$ and Lagrange multipliers $\lambda$ decompose similarly.

The optimization sub-steps for updating $\{a_{l}\}$ and $\{z_{l}\}$ do not require any communication and parallelize trivially. The weight matrix update requires the computation of pseudo-inverses and products involving the matrices $\{a_{l}\}$ and $\{z_{l}\}.$ This can be done effectively using transpose reduction strategies that reduce the dimensionality of matrices before they are transmitted to a central node.

The weight update has the form $W_{l}\leftarrow z_{l}a_{l}^{\dagger},$ where $a_{l}^{\dagger}$ represents the pseudoinverse of the activation matrix $a_{l}.$ This pseudoinverse can be written $a_{l}^{\dagger}=a_{l}^{T}(a_{l}a_{l}^{T})^{-1}.$ Using this expansion, the W update decomposes across nodes as

The individual products $z_{l}^{n}(a_{l}^{n})^{T}$ and $a_{l}^{n}(a_{l}^{n})^{T}$ are computed separately on each node, and then summed across nodes using a single reduce operation. Note that the width of $a_{l}^{n}$ equals the number of training vectors that are stored on node $n,$ which is potentially very large for big data sets. When the number of features (the number of rows in $a_{l}^{n}$ ) is less than the number of training data (columns of $a_{l}^{n}$ ), we can exploit transpose reduction when forming these products – the product $a_{l}^{n}(a_{l}^{n})^{T}$ is much smaller than the matrix $a_{l}^{n}$ alone. This dramatically reduces the quantity of data transmitted during the reduce operation.

Once these products have been formed and reduced onto a central server, the central node computes the inverse of $a_{l}a_{l}^{T},$ updates $W_{l},$ and then broadcasts the result to the worker nodes.

Parallel Activations update

The update (6) trivially decomposes across workers, with each worker computing

Each server maintains a full representation of the entire weight matrix, and can formulate its own local copy of the matrix inverse $(\beta_{l+1}W_{l+1}^{T}W_{l+1}+\gamma I)^{-1}.$

Parallel Outputs update

Like the activations update, the update for $z_{l}$ trivially parallelizes and each worker node solves

Each worker node simply computes $W_{l}a_{l-1}^{n}$ using local data, and then updates each of the (decoupled) entries in $z_{l}^{n}$ by solving a 1-dimensional problem in closed form.

Parallel Lagrange multiplier update

The Lagrange multiplier update also trivially splits across nodes, with worker $n$ computing

Implementation details

Like many training methods for neural networks, the ADMM approach requires several tips and tricks to get maximum performance. The convergence theory for the method of multipliers requires a good minimizer to be computed before updating the Lagrange multipliers. When the method is initialized with random starting values, the initial iterates are generally far from optimal. For this reason, we frequently “warm start” the ADMM method by running several iterations without Lagrange multiplier updates.

The method potentially requires the user to choose a large number of parameters $\{\gamma_{i}\}$ and $\{\beta_{i}\}.$ We choose $\gamma_{i}=10$ and $\beta_{i}=1$ for all trials runs reported here, and we have found that this choice works reliably for a wide range of problems and network architectures. Note that in the classical ADMM method, convergence is guaranteed for any choice of the quadratic penalty parameters.

We use training data with binary class labels, in which each output entry $a_{L}$ is either 1 or 0. We use a separable loss function with a hinge penalty of the form

This loss function works well in practice, and yields minimization sub-problems that are easily solved in closed form.

Finally, our implementation simply initializes the activation matrices $\{a_{l}\}$ and output matrices $\{z_{l}\}$ using i.i.d Gaussian random variables. Because our method updates the weights before anything else, the weight matrices do not require any initialization. The results presented here are using Gaussian random variables with unit variance, and the results seem to be fairly insensitive to the variance of this distribution. This seems to be because the output updates are solved to global optimality on each iteration.

Experiments

In this section, we present experimental results that compare the performance of the ADMM method to other approaches, including SGD, conjugate gradients, and L-BFGS on benchmark classification tasks. Comparisons are made across multiple axes. First, we illustrate the scaling of the approach, by varying the number of cores available and clocking the compute time necessary to meet an accuracy threshold on the test set of the problem. Second, we show test set classification accuracy as a function of time to compare the rate of convergence of the optimization methods. Finally, we show these comparisons on two different data sets, one small and relatively easy, and one large and difficult.

The new ADMM approach was implemented in Python on a Cray XC30 supercomputer with Ivy Bridge processors, and communication between cores performed via MPI. SGD, conjugate gradients, and L-BFGS are run as implemented in the Torch optim package on NVIDIA Tesla K40 GPUs. These methods underwent a thorough hyperparameter grid search to identify the algorithm parameters that produced the best results. In all cases, timings indicate only the time spent optimizing, excluding time spent loading data and setting up the network.

Experiments were run on two datasets. The first is a subset of the Street View House Numbers (SVHN) dataset (Netzer et al., 2011). Neural nets were constructed to classify pictures of 0s from 2s using histogram of gradient (HOG) features of the original dataset. Using the “extra” dataset to train, this meant 120,290 training datapoints of 648 features each. The testing set contained 5,893 data points.

The second dataset is the far more difficult Higgs dataset (Baldi et al., 2014), consisting of a training set of 10,500,000 datapoints of 28 features each, with each datapoint labelled as either a signal process producing a Higgs boson or a background process which does not. The testing set consists of 500,000 datapoints.

First, we focus on the problem posed by the SVHN dataset. For this dataset, we optimized a net with two hidden layers of 100 and 50 nodes and ReLU activation functions. This is an easy problem (test accuracy rises quickly) that does not require a large volume of data and is easily handled by gradient-based methods on a GPU. However, Figure 1(a) demonstrates that ADMM exhibits linear scaling with cores. Even though the implementations of the gradient-based methods enjoy communication via shared memory on the GPU while ADMM required CPU-to-CPU communication, strong scaling allows ADMM on CPU cores to compete with the gradient-based methods on a GPU.

This is illustrated clearly in Figure 1(b), which shows each method’s performance on the test set as a function of time. With 1,024 compute cores, on an average of 10 runs, ADMM was able to meet the 95% test set accuracy threshold in 13.3 seconds. After an extensive hyperparameter search to find the settings which resulted in the fastest convergence, SGD converged on average in 28.3 seconds, L-BFGS in 3.3 seconds, and conjugate gradients in 10.1 seconds. Though the small dataset kept ADMM from taking full advantage of its scalability, it was nonetheless sufficient to allow it to be competitive with GPU implementations.

2 Higgs

For the much larger and more difficult Higgs dataset, we optimized a simple network with ReLU activation functions and a hidden layer of 300 nodes, as suggested in (Baldi et al., 2014). The graph illustrates the amount of time required to optimize the network to a test set prediction accuracy of 64%; this parameter was chosen as all batch methods being tested reliably hit this accuracy benchmark over numerous trials. As is clear from Figure 2(a), parallelizing over additional cores decreases the time required dramatically, and again exhibits linear scaling.

In this much larger problem, the advantageous scaling allowed ADMM to reach the 64% benchmark much faster than the other approaches. Figure 2(b) illustrates this clearly, with ADMM running on 7200 cores reaching this benchmark in 7.8 seconds. In comparison, L-BFGS required 181 seconds, and conjugate gradients required 44 minutes.It is worth noting that though L-BFGS required substantially more time to reach 64% than did ADMM, it was the only method to produce a superior classifier, doing as well as 75% accuracy on the test set. In seven hours of training, SGD never reached 64% accuracy on the test set. These results suggest that, for large and difficult problems, the strong linear scaling of ADMM enables it to leverage large numbers of cores to (dramatically) out-perform GPU implementations.

Discussion & Conclusion

We present a method for training neural networks without using gradient steps. In addition to avoiding many difficulties of gradient methods (like saturation and choice of learning rates), performance of the proposed method scales linearly up to thousands of cores. This strong scaling enables the proposed approach to out-perform other methods on problems involving extremely large datasets.

The experiments shown here represent a fairly narrow range of classification problems and are not meant to demonstrate the absolute superiority of ADMM as a training method. Rather, this study is meant to be a proof of concept demonstrating that the caveats of gradient-based methods can be avoided using alternative minimization schemes. Future work will explore the behavior of alternating direction methods in broader contexts.

We are particularly interested in focusing future work on recurrent nets and convolutional nets. Recurrent nets, which complicate standard gradient methods (Jaeger, 2002; Lukoševičius, 2012), pose no difficulty for ADMM schemes whatsoever because they decouple layers using auxiliary variables. Convolutional networks are also of interest because ADMM can, in principle, handle them very efficiently. When the linear operators $\{W_{l}\}$ represent convolutions rather than dense weight matrices, the least squares problems that arise in the updates for $\{W_{l}\}$ and $\{a_{l}\}$ can be solved efficiently using fast Fourier transforms.

Finally, there are avenues to explore to potentially improve convergence speed. These include adding momentum terms to the weight updates and studying different initialization schemes, both of which are known to be important for gradient-based schemes (Sutskever et al., 2013).

Acknowledgements

This work was supported by the National Science Foundation (#1535902), the Office of Naval Research (#N00014-15-1-2676 and #N0001415WX01341), and the DoD High Performance Computing Center.