Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence
Mert Pilanci, Martin J. Wainwright
Introduction
Relative to first-order methods, second-order methods for convex optimization enjoy superior convergence in both theory and practice. For instance, Newton’s method converges at a quadratic rate for strongly convex and smooth problems, and moreover, even for weakly convex functions (i.e. not strongly convex), modifications of Newton’s method has super-linear convergence compared to the much slower convergence rate that can be achieved by a first-order method like accelerated gradient descent (see e.g. ). More importantly, at least in a uniform sense, the -rate is known to be unimprovable for first-order methods . Yet another issue in first-order methods is the tuning of step size, whose optimal choice depends on the strong convexity parameter and/or smoothness of the underlying problem. For example, consider the problem of optimizing a function of the form , where is a “data matrix”, and is a twice-differentiable function. Here the performance of first-order methods will depend on both the convexity/smoothness of , as well as the conditioning of the data matrix. In contrast, whenever the function is self-concordant, then Newton’s method with suitably damped steps has a global complexity guarantee that is provably independent of such problem-dependent parameters.
On the other hand, each step of Newton’s method requires solving a linear system defined by the Hessian matrix. For instance, in application to the problem family just described involving an data matrix, each of these steps has complexity scaling as . For this reason, both forming the Hessian and solving the corresponding linear system pose a tremendous numerical challenge for large values of — for instance, values of thousands to millions, as is common in big data applications, In order to address this issue, a multitude of different approximations to Newton’s method have been proposed and studied in the literature. Quasi-Newton methods form estimates of the Hessian by successive evaluations of the gradient vectors and are computationally cheaper. Examples of such methods include DFP and BFGS schemes and also their limited memory versions (see the book for further details). A disadvantage of such approximations based on first-order information is that the associated convergence guarantees are typically much weaker than those of Newton’s method and require stronger assumptions. Under restrictions on the eigenvalues of the Hessian (strong convexity and smoothness), Quasi-Newton methods typically exhibit local super-linear convergence.
In this paper, we propose and analyze a randomized approximation of Newton’s method, known as the Newton Sketch. Instead of explicitly computing the Hessian, the Newton Sketch method approximates it via a random projection of dimension . When these projections are carried out using the randomized Hadamard transform, each iteration has complexity . Our results show that it is always sufficient to choose proportional to , and moreover, that the sketch dimension can be much smaller for certain types of constrained problems. Thus, in the regime and with , the complexity per iteration can be substantially lower than the complexity of each Newton step. Specifically for , the complexity of Newton Sketch per iteration is , which is linear in the input size () and comparable to first order methods which only access the derivative . Moreover, we show that for self-concordant functions, the total complexity of obtaining a -optimal solution is , and does not depend on constants such as strong convexity or smoothness parameters unlike first order methods. On the other hand, for problems with , we also provide a dual strategy which effectively has the same guarantees with roles of and exchanged.
We also consider other random projection matrices and sub-sampling strategies, including partial forms of random projection that exploit known structure in the Hessian. For self-concordant functions, we provide an affine invariant analysis proving that the convergence is linear-quadratic and the guarantees are independent of the function and data, such as condition numbers of matrices involved in the objective function. Finally, we describe an interior point method to deal with arbitrary convex constraints which combines the Newton sketch with the barrier method. We provide an upper bound on the total number of iterations required to obtain a solution with a pre-specified target accuracy.
The remainder of this paper is organized as follows. We begin in Section 2 with some background on the classical form of Newton’s method, random matrices for sketching, and Gaussian widths as a measure of the size of a set. In Section 3, we formally introduce the Newton Sketch, including both fully and partially sketched versions for unconstrained and constrained problems. We provide some illustrative examples in Section 3.2 before turning to local convergence theory in Section 3.3. Section 4 is devoted to global convergence results for self-concordant functions, in both the constrained and unconstrained settings. In Section 5, we consider a number of applications and provide additional numerical results. The bulk of our proofs are in given in Section 6, with some more technical aspects deferred to the appendices.
Background
We begin with some background material on the standard form of Newton’s method, various types of random sketches, and the notion of Gaussian width as a complexity measure.
In this section, we briefly review the convergence properties and complexity of the classical form of Newton’s method; see the sources for further background.
Let be a closed, convex and twice-differentiable function that is bounded below. Given a convex set , we assume that the constrained minimizer
is uniquely defined, and we define the minimum and maximum eigenvalues and of the Hessian evaluated at the minimum.
We assume moreover that the Hessian map is Lipschitz continuous with modulus , meaning that
This result is classical: for instance, see Boyd and Vandenberghe for a proof. Newton’s method can be slightly modified to be globally convergent by choosing the step sizes via a simple backtracking line-search procedure.
The following result characterizes the complexity of Newton’s method when applied to self-concordant functions and is central in the development of interior point methods (for instance, see the books ). We defer the definitions of self-concordance and the line-search procedure in the following sections. The number of iterations needed to obtain a approximate minimizer of a strictly convex self-concordant function is bounded by
where are constants in the line-search procedure.Typical values of these constants are and .
2 Different types of randomized sketches
The most classical sketch is based on a random matrix with i.i.d. standard Gaussian entries, or somewhat more generally, sketch matrices based on i.i.d. sub-Gaussian rows. In particular, a zero-mean random vector is -sub-Gaussian if for any , we have
For instance, a vector with i.i.d. entries is -sub-Gaussian, as is a vector with i.i.d. Rademacher entries (uniformly distributed over ). We use the terminology sub-Gaussian sketch to mean a random matrix with i.i.d. rows that are zero-mean, -sub-Gaussian, and with .
From a theoretical perspective, sub-Gaussian sketches are attractive because of the well-known concentration properties of sub-Gaussian random matrices (e.g., ). On the other hand, from a computational perspective, a disadvantage of sub-Gaussian sketches is that they require matrix-vector multiplications with unstructured random matrices. In particular, given a data matrix , computing its sketched version requires basic operations in general (using classical matrix multiplication).
The second type of randomized sketch we consider is randomized orthonormal system (ROS), for which matrix multiplication can be performed much more efficiently. In order to define a ROS sketch, we first let be an orthonormal matrix with entries . Standard classes of such matrices are the Hadamard or Fourier bases, for which matrix-vector multiplication can be performed in time via the fast Hadamard or Fourier transforms, respectively. Based on any such matrix, a sketching matrix from a ROS ensemble is obtained by sampling i.i.d. rows of the form
where the random vector is chosen uniformly at random from the set of all canonical basis vectors, and is a diagonal matrix of i.i.d. Rademacher variables . Given a fast routine for matrix-vector multiplication, the sketch for a data matrix can be formed in time (for instance, see the papers ).
Given a probability distribution over , another choice of sketch is to randomly sample the rows of a data matrix a total of times with replacement from the given probability distribution. Thus, the rows of are independent and take on the values
3 Gaussian widths
In this section, we introduce some background on the notion of Gaussian width, a way of measuring the size of a compact set in d. These width measures play a key role in the analysis of randomized sketches. Given a compact subset , its Gaussian width is given by
where is an i.i.d. sequence of variables. This complexity measure plays an important role in Banach space theory, learning theory and statistics (e.g., ).
Of particular interest in this paper are sets that are obtained by intersecting a given cone with the Euclidean sphere . It is easy to show that the Gaussian width of any such set is at most , but the it can be substantially smaller, depending on the nature of the underlying cone. For instance, if is a subspace of dimension , then a simple calculation yields that .
Newton sketch and local convergence
With the basic background in place, let us now introduce the Newton sketch algorithm, and then develop a number of convergence guarantees associated with it. It applies to an optimization problem of the form , where is a twice-differentiable convex function, and is a convex constraint set.
Now suppose that we have available a Hessian matrix square root —that is, a matrix of dimensions such that
In many cases, such a matrix square root can be computed efficiently. For instance, consider a function of the form where , and the function has the separable form . In this case, a suitable Hessian matrix square root is given by the matrix \nabla^{2}f(x)^{1/2}:\,=\mbox{diag}\big{\{}g_{i}^{\prime\prime}(\langle a_{i},\,x\rangle)\big{\}}_{i=1}^{n}A. In Section 3.2, we discuss various concrete instantiations of such functions.
In terms of this notation, the ordinary Newton update can be re-written as
where is an independent realization of a sketching matrix. When the problem is unconstrained, i.e., and the matrix is invertible, the Newton sketch update takes the simpler form to
In this paper, we also analyze a partially sketched Newton update, which takes the following form. Given an additive decomposition of the form , we perform a sketch of of the Hessian while retaining the exact form of the Hessian . This leads to the partially sketched update
where .
For either the fully sketched (6) or partially sketched updates (8), our analysis shows that there are many settings in which the sketch dimension can be chosen to be substantially smaller than , in which cases the sketched Newton updates will be much cheaper than a standard Newton update. For instance, the unconstrained update (7) can be computed in at most time, as opposed to the time of the standard Newton update. In constrained settings, we show that the sketch dimension can often be chosen even smaller—even —which leads to further savings.
2 Some examples
In order to provide some intuition, let us provide some simple examples to which the sketched Newton updates can be applied.
Consider a linear program (LP) in the standard form
where is a given constraint matrix. We assume that the polytope is bounded so that the minimum achieved. A barrier method approach to this LP is based on solving a sequence of problems of the form
where denotes the row of , and is a weight parameter that is adjusted during the algorithm. By inspection, the function is twice-differentiable, and its Hessian is given by \nabla^{2}f(x)=A^{T}\mbox{diag}\big{\{}\frac{1}{(b_{i}-\langle a_{i},\,x\rangle)^{2}}\big{\}}A. A Hessian square root is given by , which allows us to compute a sketched version of the Hessian square root
With a ROS sketch matrix, computing this matrix requires basic operations. The complexity of each Newton sketch iteration scales as , where is at most . In contrast, the standard unsketched form of the Newton update has complexity , so that the sketched method is computationally cheaper whenever there are more constraints than dimensions ().
By increasing the barrier parameter , we obtain a sequence of solutions that approach the optimum to the LP, which we refer to as the central path. As a simple illustration, Figure 1 compares the central paths generated by the ordinary and sketched Newton updates for a polytope defined by constraints in dimension . Each row shows three independent trials of the method for a given sketch dimension ; the top, middle and bottom rows correspond to sketch dimensions respectively. Note that as the sketch dimension is increased, the central path taken by the sketched updates converges to the standard central path.
As a second example, we consider the problem of maximum likelihood estimation for generalized linear models.
The class of generalized linear models (GLMs) is used to model a wide variety of prediction and classification problems, in which the goal is to predict some output variable on the basis of a covariate vector . it includes as special cases the standard linear Gaussian model (in which ), as well as logistic models for classification (in which ), as well as as Poisson models for count-valued responses (in which ). See the book for further details and applications.
Given a collection of observations of response-covariate pairs from some GLM, the problem of constrained maximum likelihood estimation be written in the form
where is a given convex function, and is a convex constraint set, chosen by the user to enforce a certain type of structure in the solution. Important special cases of GLMs include the linear Gaussian model, in which , and the problem (10) corresponds to a regularized form of least-squares, as well as the problem of logistic regression, obtained by setting .
Letting denote the data matrix with as its row, the Hessian of the objective (10) takes the form
Since the function is convex, we are guaranteed that , and hence the quantity can be used as an matrix square-root. We return to explore this class of examples in more depth in Section 5.1.
3 Local convergence analysis using strong convexity
Returning now to the general setting, we now begin by proving a local convergence guarantee for the sketched Newton updates. In particular, this theorem provides insight into how large the sketch dimension must be in order to guarantee good local behavior of the sketched Newton algorithm.
This choice of sketch dimension is determined by geometry of the problem, in particular in terms of the tangent cone defined by the optimum. Given a constraint set and the minimizer , the tangent cone at is given by
Recalling the definition of the Gaussian width from Section 2.3, our first main result requires the sketch dimension to satisfy a lower bound of the form
where is a user-defined tolerance, and is a universal constant. Since the Hessian square-root has dimensions , this squared Gaussian width is at at most . This worst-case bound is achieved for an unconstrained problem (in which case ), but the Gaussian width can be substantially smaller for constrained problems. See the example following Theorem 1 for an illustration.
In addition to this Gaussian width, our analysis depends on the cone-constrained eigenvalues of the Hessian , which are defined as
In the unconstrained case (), we have , and so that and reduce to the minimum and maximum eigenvalues of the Hessian . In the classical analysis of Newton’s method, these quantities measure the strong convexity and smoothness parameters of the function .
With this set-up, the following theorem is applicable to any twice-differentiable objective with cone-constrained eigenvalues defined in equation (13), and with Hessian that is -Lipschitz continuous, as defined in equation (2).
The bound (14) shows that when is set to a fixed constant—say —the algorithm displays a linear-quadratic convergence rate in terms of the error . More specifically, the rate is initially quadratic—that is, when is large. However, as the iterations progress and becomes substantially less than 1, then the rate becomes linear—meaning that —since the term becomes negligible compared to . If we perform steps in total, the linear rate guarantees the conservative error bounds
A notable feature of Theorem 1 is that, depending on the structure of the problem, the linear-quadratic convergence can be obtained using a sketch dimension that is substantially smaller than . As an illustrative example, we performed simulations for some instantiations of a portfolio optimization problem: it is a linearly-constrained quadratic program of the form
where and are empirically estimated matrices and vectors (see Section 5.3 for more details). We used the Newton sketch to solve different sizes of this problem , and with in each case. Each problem was constructed so that the optimum had at most non-zero entries. A calculation of the Gaussian width for this problem (see Appendix C for the details) shows that it suffices to take a sketch dimension , and we implemented the algorithm with this choice.
Figure 2 shows the convergence rate of the Newton sketch algorithm for the six different problem sizes: consistent with our theory, the sketch dimension suffices to guarantee linear convergence in all cases.
It is also possible obtain an asymptotically super-linear rate by using an iteration-dependent sketching accuracy . The following corollary summarizes one such possible guarantee:
Consider the Newton sketch iterates using the iteration-dependent sketching accuracy . Then with the same probability as in Theorem 1, we have
and consequently, super-linear convergence is obtained—namely, .
Note that the price for this super-linear convergence is that the sketch size is inflated by the factor , so it is only logarithmic in the iteration number.
Newton sketch for self-concordant functions
The analysis and complexity estimates given in the previous section involve the curvature constants and the Lipschitz constant , which are seldom known in practice. Moreover, as with the analysis of classical Newton method, the theory is local, in that the linear-quadratic convergence takes place once the iterates enter a suitable basin of the origin.
In this section, we seek to obtain global convergence results that do not depend on unknown problem parameters. As in the classical analysis, the appropriate setting in which to seek such results is for self-concordant functions, and using an appropriate form of backtracking line search. We begin by analyzing the unconstrained case, and then discuss extensions to constrained problems with self-concordant barriers. In each case, we show that given a suitable lower bound on the sketch dimension, the sketched Newton updates can be equipped with global convergence guarantees that hold with exponentially high probability. Moreover, the total number of iterations does not depend on any unknown constants such as strong convexity and Lipschitz parameters.
In this section, we consider the unconstrained optimization problem , where is a closed convex self-concordant function which is bounded below. Note that a closed convex function is self-concordant if
This definition can be extended to a function by imposing this requirement on the univariate functions , for all choices of in the domain of . Examples of self-concordant functions include linear and quadratic functions and negative logarithm. Self concordance is preserved under addition and affine transformations.
Our main result provide a bound on the total number of Newton sketch iterations required to obtain a -accurate solution without imposing any sort of initialization condition (as was done in our previous analysis). This bound scales proportionally to and inversely in a parameter that depends on sketching accuracy and backtracking parameters via
Let be a strictly convex self-concordant function. Given a sketching matrix with , the number of total iterations for obtaining an approximate solution in function value via Algorithm 1 is bounded by
with probability at least .
The bound in the above theorem shows that the convergence of the Newton Sketch is independent of the properties of the function and problem parameters, similar to classical Newton’s method. Note that for problems with , the complexity of each Newton sketch step is at most ), which is smaller than that of Newton’s Method (), and also smaller than typical first-order optimization methods () whenever .
2 Newton Sketch with self-concordant barriers
We now turn to the more general constrained case. Given a closed, convex self-concordant function , let be a convex subset of d, and consider the constrained optimization problem . If we are given a convex self-concordant barrier function for the constraint set , it is equivalent to consider the unconstrained problem
In all such cases, an attractive strategy is to apply a partial Newton sketch, in which we sketch the Hessian term and retain the exact Hessian , as in the previously described updates (8). More formally, Algorithm 2 provides a summary of the steps, including the choice of the line search parameters. The main result of this section provides a guarantee on this algorithm, assuming that the sequence of sketch dimensions is appropriately chosen.
The choice of sketch dimensions depends on the tangent cones defined by the iterates, namely the sets
For a given sketch accuracy , we require that the sequence of sketch dimensions satisfies the lower bound
Finally, the reader should recall the parameter was defined in equation (18), which depends only on the sketching accuracy and the line search parameters. Given this set-up, we have the following guarantee:
Let be a convex and self-concordant function, and let be a convex and self-concordant barrier for the convex set . Suppose that we implement Algorithm 2 with sketch dimensions satisfying the lower bound (19). Then taking
suffices to obtain -approximate solution in function value with probability at least .
Thus, we see that the Newton Sketch method can also be used with self-concordant barrier functions, which considerably extends its scope. Section 5.5 provides a numerical illustration of its performance in this context. As we discuss in the next section, there is a flexibility in choosing the decomposition and corresponding to objective and barrier, which enables us to also sketch the constraints.
3 Sketching with interior point methods
In this section, we discuss the application of Newton Sketch to a form of barrier or interior point methods. In particular we discuss two different strategies and provide rigorous worst-case complexity results when the functions in the objective and constraints are self-concordant. More precisely, let us consider a problem of the form
where and are twice-differentiable convex functions. We assume that there exists a unique solution to the above problem.
The barrier method for computing is based on solving a sequence of problems of the form
for increasing values of the parameter . The family of solutions trace out what is known as the central path. A standard bound (e.g., ) on the sub-optimality of is given by
The barrier method successively updates the penalty parameter and also the starting points supplied to Newton’s method using previous solutions.
Since Newton’s method lies at the heart of the barrier method, we can obtain a fast version by replacing the exact Newton minimization with the Newton sketch. Algorithm 3 provides a precise description of this strategy. As noted in Step 1, there are two different strategies in dealing with the convex constraints for :
Full sketch: Sketch the full Hessian of the objective function (21) using Algorithm 1 ,
Partial sketch: Sketch only the Hessians corresponding to a subset of the functions , and use exact Hessians for the other functions. Apply Algorithm 2.
As shown by our theory, either approach leads to the same convergence guarantees, but the associated computational complexity can vary depending both on how data enters the objective and constraints, as well as the Hessian structure arising from particular functions. The following theorem is an application of the classical results on the barrier method tailored for Newton Sketch using any of the above strategies (see e.g., ). As before, the key parameter was defined in Theorem 2.
For a given target accuracy and any , the total number of Newton Sketch iterations required to obtain a -accurate solution using Algorithm 3 is at most
If the parameter is set to minimize the above upper-bound, the choice yields iterations. However, when applying the standard Newton method, this “optimal” choice is typically not used in practice: instead, it is common to use a fixed value of . In experiments, experience suggests that the number of Newton iterations needed is a constant independent of and other parameters. Theorem 4 allows us to obtain faster interior point solvers with rigorous worst-case complexity results. We show different applications of Algorithm 3 in the following section.
Applications and numerical results
In this section, we discuss some applications of the Newton sketch to different optimization problems. In particular, we show various forms of Hessian structure that arise in applications, and how the Newton sketch can be computed. When the objective and/or the constraints contain more than one term, the barrier method with Newton Sketch has some flexibility in sketching. We discuss the choices of partial Hessian sketching strategy in the barrier method. It is also possible to apply the sketch in the primal or dual form, and we provide illustrations of both strategies here.
Suppose that we apply the Newton sketch algorithm to the optimization problem (10). Given the current iterate , computing the next iterate requires solving the constrained quadratic program
Note that we are always guaranteed that . It also involves certain quantities that depend on the function , namely
where is the row of . With this set-up, supposing that the optimal solution has cardinality at most , then it can be shown (see Lemma 8 in Appendix C) that it suffices to take a sketch size
where is a universal constant. Let us consider some examples to illustrate:
Least-Squares regression: , and .
Poisson regression: , and
Logistic regression: , and ,
where , and .
For typical distributions of the data matrices, the sketch size choice given in equation (25) is . As an example, consider data matrices where each row is independently sampled from a sub-Gaussian distribution with variance . Then standard results on random matrices show that as long as for a sufficiently large constant . In addition, we have , as well as . For such problems, the per iteration complexity of Newton Sketch update scales as using standard Lasso solvers (e.g., ) or as using projected gradient descent. Both of these scalings are substantially smaller than conventional algorithms that fail to exploit the small intrinsic dimension of the tangent cone.
2 Semidefinite programs
Here the term , along with its multiplicative pre-factor that can be adjusted by the user, is a regularization term for encouraging a relatively low-rank solution. Using the standard self-concordant barrier for the PSD cone, the barrier method involves solving a sequence of sub-problems of the form
Now the Hessian of the function is a matrix given by
where . Then we can apply the barrier method with partial Hessian sketch on the first term, and exact Hessian for the second term. Since the vectorized decision variable is the complexity of Newton Sketch is while the complexity of a classical SDP interior-point solver is .
3 Portfolio optimization and SVMs
Here we consider the Markowitz formulation of the portfolio optimization problem . The objective is to find belonging to the unit simplex, which corresponds to non-negative weights associated with each of possible assets, so as to maximize the expected return minus a coefficient times the variance of the return. Letting denote a vector corresponding to mean return of the assets, and we let be a symmetric, positive semidefinite matrix, covariance of the returns. The optimization problem is given by
The covariance of returns is often estimated from past stock data via empirical covariance, where the columns of are time series corresponding to assets normalized by , where is the length of the observation window.
The barrier method can be used solve the above problem by solving penalized problems of the
where is the element of the canonical basis and is row vector of all-ones. Then the Hessian of the above barrier penalized formulation can be written as
Consequently we can sketch the data dependent part of the Hessian via which has at most rank and keep the remaining terms in the Hessian exact. Since the matrix is rank one, the resulting sketched estimate is therefore diagonal plus rank where the matrix inversion lemma can be applied for efficient computation of the Newton Sketch update (see e.g. ). Therefore, as long as , the complexity per iteration scales as , which is cheaper than the per step complexity associated with classical interior point methods. We also note that support vector machine classification problems with squared hinge loss also has the same form as in (26) (see e.g. ) where the same strategy can be applied.
4 Unconstrained logistic regression with d≪nmuch-less-than𝑑𝑛d\ll n
Let us now turn to some numerical comparisons of the Newton Sketch with other popular optimization methods for large-scale instances of logistic regression. More specifically, we generated a feature matrix based on features and observations. Each row was generated from the -variate Gaussian distribution where . As shown in Figure 3, the convergence of the algorithm per iteration is very similar to Newton’s method. Besides the original Newton’s method, the other algorithms compared are
Gradient Descent (GD) with backtracking line search
Accelerated Gradient Descent (Acc. GD) adapted for strongly convex functions with manually tuned parameters.
Stochastic Gradient Descent (SGD) with the classical step size choice
Broyden–-Fletcher–-Goldfarb-–Shanno algorithm (BFGS) approximating the Hessian with gradients.
For each problem, we averaged the performance of the randomized algorithms (Newton sketch and SGD) over independent trials. We ran the Newton sketch algorithm with sketch size . To be fair in comparisons, we performed hand-tuning of the stepsize parameters in the gradient-based methods so as to optimize their performance. The top panel in Figure 3 plots the log duality gap versus the number of iterations: as expected, on this scale, the classical form of Newton’s method is the fastest, whereas the SGD method is the slowest. However, when the log optimality gap is plotted versus the wall-clock time in the bottom panel, we now see that the Newton sketch is the fastest.
5 A dual example: Lasso with d≫nmuch-greater-than𝑑𝑛d\gg n
The regularized Lasso problem takes the form \min\limits_{x\in{}^{d}}\big{\{}\frac{1}{2}\,\|Ax-y\|_{2}^{2}+\lambda\|x\|_{1}\big{\}}, where is a user-specified regularization parameter. In this section, we consider efficient sketching strategies for this class of problems in the regime . In particular, let us consider the corresponding dual program, given by
By construction, the number of constraints in the dual program is larger than the number of optimization variables . If we apply the barrier method to solve this dual formulation, then we need to solve a sequence of problems of the form
where denotes the column of . The Hessian of the above barrier penalized formulation can be written as
Consequently we can keep the first term in the Hessian, exact and apply partial sketching to the Hessians of the last two terms via
Since the partially sketched Hessian is of the form , where is rank at most , we can use matrix inversion lemma for efficiently calculating Newton Sketch updates. The complexity of the above strategy for is , where is at most , whereas traditional interior point solvers are typically per iteration.
In order to test this algorithm, we generated a feature matrix with features and observations. Each row was generated from the multivariate Gaussian distribution with . For a given problem instance, we ran independent trials of the sketched barrier method, and compared the results to the original barrier method. Figure 4 plots the the duality gap versus iteration number (top panel) and versus the wall-clock time (bottom panel) for the original barrier method (blue) and sketched barrier method (red): although the sketched algorithm requires more iterations, these iterations are cheaper, leading to a smaller wall-clock time. This point is reinforced by Figure 5, where we plot the wall-clock time required to reach a duality gap of versus the number of features in problem families of increasing size. Note that the sketched barrier method outperforms the original barrier method, with significantly less computation time for obtaining similar accuracy.
Proofs
We now turn to the proofs of our theorems, with more technical details deferred to the appendices.
Throughout this proof, we let denote a fixed vector that is independent of the sketch matrix and the current iterate . We then define the following pair of random variables
These random variables are significant, because the core of our proof is based on establishing that the error vector satisfies the recursive bound
where and . We then combine this recursion with the following probabilistic guarantee on and . For a given tolerance parameter , consider the ”good event”
For sub-Gaussian sketch matrices, given a sketch size , we have
For randomized orthogonal system (ROS) sketches over the class of self-bounding cones, given a sketch size , we have
Combining Lemma 1 with the recursion (27) and re-scaling appropriately yields the claim of the theorem.
Accordingly, it remains to prove the recursion (27), and we do so via a basic inequality argument. Recall the function that underlies the sketch Newton update (6): since and are optimal and feasible for the constrained optimization problem, we have . Introducing the error vector , some straightforward algebra then then leads to the basic inequality
Let us first upper bound the right-hand side. By using the integral form of Taylor’s expansion, we have , and hence
By adding and subtracting terms and then applying triangle inequality, we have the bound , where
Now observe that the vector is independent of the randomness in , whereas the vector belongs to the cone . Consequently, by the definition of , we have
Now note that using the fact that controls the smoothness of the gradient and the Lipschitz continuity of Hessian we can upper bound the terms on the above right-hand side as follows
and similarly, . Combining the above bounds with (32) we obtain
On the other hand, by the -Lipschitz condition on the Hessian, we have
Substituting these two bounds into our basic inequality, we have
Our final step is to lower bound the left-hand side (LHS) of this inequality. By definition of , we have
Substituting this lower bound into the previous inequality (34) and then rearranging, we find that, as long as , we also have and consequently
2 Proof of Theorem 2
Recall that in this case, we assume that is a self-concordant strictly convex function. We adopt the following notation and conventions from the book . For a given , we define the pair of dual norms
Note that is well-defined for strictly convex self-concordant functions. In terms of this notation, the exact Newton update is given by , where
whereas the Newton sketch update is given by , where
The proof of Theorem 2 given in this section involves the unconstrained case (), whereas the proofs of later theorems involve the more general constrained case. In the unconstrained case, the two updates take the simpler forms
For a self-concordant function, the sub-optimality of the Newton iterate in function value satisfies the bound
This classical bound is not directly applicable to the Newton sketch update, since it involves the approximate Newton decrement , as opposed to the exact one . Thus, our strategy is to prove that with high probability over the randomness in the sketch matrix, the approximate Newton decrement can be used as an exit condition.
Recall the definitions (35) and (36) of the exact and sketched Newton update directions, as well as the definition of the tangent cone at . Let be the tangent cone at . The following lemma provides a high probability bound on their difference:
Let be a sub-Gaussian or ROS sketch matrix, and consider any fixed vector independent of the sketch matrix. If , then
with probability at least .
Similar to the standard analysis of Newton’s method, our analysis of the Newton sketch algorithm is split into two phases defined by the magnitude of the decrement . In particular, the following lemma constitute the core of our proof:
For , there exist constants and such that:
If , then with probability at least .
Conversely, if , then
where both bounds hold with probability .
Using this lemma, let us now complete the proof of the theorem, dividing our analysis into the two phases of the algorithm.
By Lemma 3(a) each iteration in the first phase decreases the function value by at least , the number of first phase iterations is at most
with probability at least .
Next, let us suppose that at some iteration , the condition holds, so that part (b) of Lemma 3 can be applied. In fact, the bound (38a) then guarantees that , so that we may apply the contraction bound (38b) repeatedly for rounds so as to obtain that
with probability .
Since by assumption, the self-concordance of then implies that
Therefore, in order to ensure that and consequently for achieving , it suffices to the number of second phase iterations lower bounded as .
Putting together the two phases, we conclude that the total number of iterations required to achieve - accuracy is at most
and moreover, this guarantee holds with probability at least .
The final step in our proof of the theorem is to establish Lemma 3, and we do in the next two subsections.
2.1 Proof of Lemma 3(a)
Our proof of this part is performed conditionally on the event . Our strategy is to show that the backtracking line search leads to a stepsize such that function decrement in moving from the current iterate to the new sketched iterate is at least
The outline of our proof is as follows. Defining the univariate function and , we first show that satisfies the bound
Since by assumption and the function is monotone increasing, this bound implies that inequality (39) holds with .
It remains to prove the claims (40a) and (40b), for which we make use of the following auxiliary lemma:
For , we have the decrement bound
provided that .
With probability at least , we have
The proof of these lemmas are provided in Appendices A.2 and A.3. Using them, let us prove the claims (40a) and (40b). Recalling our shorthand , substituting inequality (42) into the decrement formula (41) yields
where we added and subtracted so as to obtain the final equality.
We now prove inequality (40a). Now setting , which satisfies the conditions of Lemma 4 yields
Making use of the standard inequality (for instance, see the book ), we find that
where the final inequality follows from our assumption . This completes the proof of the bound (40a). Finally, the lower bound (40b) follows by setting into the decrement inequality (41).
2.2 Proof of Lemma 3(b)
The proof of this part hinges on the following auxiliary lemma:
where all bounds hold with probability at least .
We now use Lemma 6 to prove the two claims in the lemma statement.
Recall from the theorem statement that . By examining the roots of a polynomial in , it can be seen that .
By applying the inequalities (44b), we have
Here the final inequality holds for all . Combining the bound (44b) with inequality (46) yields
where the final inequality again uses the condition . This completes the proof of the bound (38a).
This inequality has been established as a consequence of proving the bound (46).
3 Proof of Theorem 3
Given the proof of Theorem 2, it remains only to prove the following modified version of Lemma 2. It applies to the exact and sketched Newton directions that are defined as follows
Thus, the only difference is that the Hessian is sketched, whereas the term remains unsketched.
Let be a sub-Gaussian or ROS sketching matrix, and let be a (possibly random) vector independent of . If , then
with probability at least .
Discussion
In this paper, we introduced and analyzed the Newton sketch, a randomized approximation to the classical Newton updates. This algorithm is a natural generalization of the Iterative Hessian Sketch (IHS) updates analyzed in our earlier work . The IHS applies only to constrained least-squares problems (for which the Hessian is independent of the iteration number), whereas the Newton Sketch applies to any any twice differentiable function subject to a closed convex constraint set. We described various applications of the Newton sketch, including its use with barrier methods to solve various forms of constrained problems. For the minimization of self-concordant functions, the combination of the Newton sketch within interior point updates leads to much faster algorithms for an extensive body of convex optimization problems.
Each iteration of the Newton sketch always has lower computational complexity than classical Newton’s method. Moreover, it has lower computational complexity than first-order methods when either or (using the dual strategy); here and denote the dimensions of the data matrix . In the context of barrier methods, the parameters and typically correspond to the number of constraints and number of variables, respectively. In many “big data” problems, one of the dimensions is much larger than the other, in which case the Newton sketch is advantageous. Moreover, sketches based on the randomized Hadamard transform are well-suited to in parallel environments: in this case, the sketching step can be done in time with processors. This scheme significantly decreases the amount of central computation—namely, from to .
There are a number of open problems associated with the Newton sketch. Here we focused our analysis on the cases of sub-Gaussian and randomized orthogonal system (ROS) sketches. It would also be interesting to analyze sketches based on coordinate sampling, or other forms of “sparse” sketches (for instance, see the paper ). Such techniques might lead to significant gains in cases where the data matrix is itself sparse: more specifically, it may be possible to obtain sketched optimization algorithms whose computational complexity only scales with number of nonzero entries in the data matrices the full dimensionality . Finally, it would be interesting to explore the problem of lower bounds on the sketch dimension . In particular, is there a threshold below which any algorithm that has access only to gradients and -sketched Hessians must necessarily converge at a sub-linear rate, or in a way that depends on the strong convexity and smoothness parameters? Such a result would clarify whether or not the guarantees in this paper are improvable.
Both authors were partially supported by Office of Naval Research MURI grant N00014-11-1-0688, and National Science Foundation Grants CIF-31712-23800 and DMS-1107000. In addition, MP was supported by a Microsoft Research Fellowship.
Appendix A Technical results for Theorem 2
In this appendix, we collect together various technical results and proofs that are required in the proof of Theorem 2.
Let be a unit-norm vector independent of , and consider the random quantities
By the optimality and feasibility of and (respectively) for the sketched Newton update (36), we have
Defining the difference vector , some algebra leads to the basic inequality
Moreover, by the optimality and feasibility of and for the exact Newton update (35), we have
Consequently, by adding and subtracting , we find that
By definition, the error vector belongs to the cone and the vector is fixed and independent of the sketch. Consequently, invoking definitions (49a) and (49b) of the random variables and yields
Putting together the pieces, we find that
By setting , the claim follows.
A.2 Proof of Lemma 4
By construction, the function is strictly convex and self-concordant. Consequently, it satisfies the bound , whence
or equivalently for . Integrating this inequality twice yields the bound
Since and , the decrement bound (41) follows.
A.3 Proof of Lemma 5
We perform this analysis conditional on the bound (37) from Lemma 2. We begin by observing that
Lemma 2 implies that . In conjunction with the bound (56), we see that
Our next step is to lower bound the term : in particular, by adding and subtracting a factor of the original Newton step , we find that
where the final step again makes use of Lemma 2. Repeating the above argument in the reverse direction yields the lower bound , so that we may conclude that
Finally, by squaring both sides of the inequality (56) and combining with the above bounds gives
A.4 Proof of Lemma 6
We have already proved the bound (44b) during our proof of Lemma 5—in particular, see equation (58). Accordingly, it remains only to prove the inequality (44a).
Introducing the shorthand , we first claim that the Hessian satisfies the sandwich relation
for where , with probability at least . Let us recall Theorem 4.1.6 of Nesterov : it guarantees that
Now recall the bound (37) from Lemma 2: combining it with an application of the triangle inequality (in terms of the semi-norm ) yields
with probability at least , and substituting this inequality into the bound (60) yields the sandwich relation (59) for the Hessian.
Using this sandwich relation (59), the Newton decrement can be bounded as
where we have defined . By the triangle inequality, we can write \lambda_{f}(x_{\tiny{\mbox{NSK}}})\leq\frac{1}{\left(1-(1+\epsilon)\lambda_{f}(x)\right)}\big{(}M_{1}+M_{2}\big{)}, where
In order to complete the proof, it suffices to show that
Re-arranging and then invoking the Hessian sandwich relation (59) yields
where the inequality in step (i) follows from Lemma 2.
Appendix B Proof of Lemma 7
The proof follows the basic inequality argument of the proof of Lemma 2. Since and are optimal and feasible (respectively) for the sketched Newton problem (47b), we have . Defining the difference vector , some algebra leads to the basic inequality
On the other hand since and are optimal and feasible (respectively) for the Newton step (47a), we have
Consequently, by adding and subtracting , we find that
By convexity of , we have , whence
Given this inequality, the remainder of the proof follows as in the proof of Lemma 2.
Here the random variables are zero-mean Gaussians with variance at most
Consequently, applying standard bounds on the suprema of Gaussian variates , we obtain
When combined with the previous inequality, the claim follows. ∎