Newton Sketch: A Linear-time Optimization Algorithm with Linear-Quadratic Convergence

Mert Pilanci, Martin J. Wainwright

Introduction

Relative to first-order methods, second-order methods for convex optimization enjoy superior convergence in both theory and practice. For instance, Newton’s method converges at a quadratic rate for strongly convex and smooth problems, and moreover, even for weakly convex functions (i.e. not strongly convex), modifications of Newton’s method has super-linear convergence compared to the much slower $1/T^{2}$ convergence rate that can be achieved by a first-order method like accelerated gradient descent (see e.g. ). More importantly, at least in a uniform sense, the $1/T^{2}$ -rate is known to be unimprovable for first-order methods . Yet another issue in first-order methods is the tuning of step size, whose optimal choice depends on the strong convexity parameter and/or smoothness of the underlying problem. For example, consider the problem of optimizing a function of the form $x\mapsto g(Ax)$ , where $A\in{}^{n\times d}$ is a “data matrix”, and $g:{}^{n}\rightarrow\real$ is a twice-differentiable function. Here the performance of first-order methods will depend on both the convexity/smoothness of $g$ , as well as the conditioning of the data matrix. In contrast, whenever the function $g$ is self-concordant, then Newton’s method with suitably damped steps has a global complexity guarantee that is provably independent of such problem-dependent parameters.

On the other hand, each step of Newton’s method requires solving a linear system defined by the Hessian matrix. For instance, in application to the problem family just described involving an $n\times d$ data matrix, each of these steps has complexity scaling as $\mathcal{O}(nd^{2})$ . For this reason, both forming the Hessian and solving the corresponding linear system pose a tremendous numerical challenge for large values of $(n,d)$ — for instance, values of thousands to millions, as is common in big data applications, In order to address this issue, a multitude of different approximations to Newton’s method have been proposed and studied in the literature. Quasi-Newton methods form estimates of the Hessian by successive evaluations of the gradient vectors and are computationally cheaper. Examples of such methods include DFP and BFGS schemes and also their limited memory versions (see the book for further details). A disadvantage of such approximations based on first-order information is that the associated convergence guarantees are typically much weaker than those of Newton’s method and require stronger assumptions. Under restrictions on the eigenvalues of the Hessian (strong convexity and smoothness), Quasi-Newton methods typically exhibit local super-linear convergence.

In this paper, we propose and analyze a randomized approximation of Newton’s method, known as the Newton Sketch. Instead of explicitly computing the Hessian, the Newton Sketch method approximates it via a random projection of dimension $m$ . When these projections are carried out using the randomized Hadamard transform, each iteration has complexity $\mathcal{O}(nd\log(m)+dm^{2})$ . Our results show that it is always sufficient to choose $m$ proportional to $\min\{d,n\}$ , and moreover, that the sketch dimension $m$ can be much smaller for certain types of constrained problems. Thus, in the regime $n>d$ and with $m\asymp d$ , the complexity per iteration can be substantially lower than the $\mathcal{O}(nd^{2})$ complexity of each Newton step. Specifically for $n\geq d^{2}$ , the complexity of Newton Sketch per iteration is $\mathcal{O}(nd\log d)$ , which is linear in the input size ( $nd$ ) and comparable to first order methods which only access the derivative $g^{\prime}(Ax)$ . Moreover, we show that for self-concordant functions, the total complexity of obtaining a $\delta$ -optimal solution is $\mathcal{O}(nd\log d\log(1/\delta))$ , and does not depend on constants such as strong convexity or smoothness parameters unlike first order methods. On the other hand, for problems with $d>n$ , we also provide a dual strategy which effectively has the same guarantees with roles of $d$ and $n$ exchanged.

We also consider other random projection matrices and sub-sampling strategies, including partial forms of random projection that exploit known structure in the Hessian. For self-concordant functions, we provide an affine invariant analysis proving that the convergence is linear-quadratic and the guarantees are independent of the function and data, such as condition numbers of matrices involved in the objective function. Finally, we describe an interior point method to deal with arbitrary convex constraints which combines the Newton sketch with the barrier method. We provide an upper bound on the total number of iterations required to obtain a solution with a pre-specified target accuracy.

The remainder of this paper is organized as follows. We begin in Section 2 with some background on the classical form of Newton’s method, random matrices for sketching, and Gaussian widths as a measure of the size of a set. In Section 3, we formally introduce the Newton Sketch, including both fully and partially sketched versions for unconstrained and constrained problems. We provide some illustrative examples in Section 3.2 before turning to local convergence theory in Section 3.3. Section 4 is devoted to global convergence results for self-concordant functions, in both the constrained and unconstrained settings. In Section 5, we consider a number of applications and provide additional numerical results. The bulk of our proofs are in given in Section 6, with some more technical aspects deferred to the appendices.

Background

We begin with some background material on the standard form of Newton’s method, various types of random sketches, and the notion of Gaussian width as a complexity measure.

In this section, we briefly review the convergence properties and complexity of the classical form of Newton’s method; see the sources for further background.

Let $f:{}^{d}\rightarrow\real$ be a closed, convex and twice-differentiable function that is bounded below. Given a convex set $\mathcal{C}$ , we assume that the constrained minimizer

is uniquely defined, and we define the minimum and maximum eigenvalues $\gamma=\lambda_{min}(\nabla^{2}f(x^{*}))$ and $\beta=\lambda_{max}(\nabla^{2}f(x^{*}))$ of the Hessian evaluated at the minimum.

We assume moreover that the Hessian map $x\mapsto\nabla^{2}f(x)$ is Lipschitz continuous with modulus $L$ , meaning that

This result is classical: for instance, see Boyd and Vandenberghe for a proof. Newton’s method can be slightly modified to be globally convergent by choosing the step sizes via a simple backtracking line-search procedure.

The following result characterizes the complexity of Newton’s method when applied to self-concordant functions and is central in the development of interior point methods (for instance, see the books ). We defer the definitions of self-concordance and the line-search procedure in the following sections. The number of iterations needed to obtain a $\delta$ approximate minimizer of a strictly convex self-concordant function $f$ is bounded by

where $a,b$ are constants in the line-search procedure.Typical values of these constants are $a=0.1$ and $b=0.5$ .

2 Different types of randomized sketches

The most classical sketch is based on a random matrix $S\in{}^{m\times n}$ with i.i.d. standard Gaussian entries, or somewhat more generally, sketch matrices based on i.i.d. sub-Gaussian rows. In particular, a zero-mean random vector $s\in{}^{n}$ is $1$ -sub-Gaussian if for any $u\in{}^{n}$ , we have

For instance, a vector with i.i.d. $N(0,1)$ entries is $1$ -sub-Gaussian, as is a vector with i.i.d. Rademacher entries (uniformly distributed over $\{-1,+1\}$ ). We use the terminology sub-Gaussian sketch to mean a random matrix $S\in{}^{m\times n}$ with i.i.d. rows that are zero-mean, $1$ -sub-Gaussian, and with $\operatorname{cov}(s)=I_{n}$ .

From a theoretical perspective, sub-Gaussian sketches are attractive because of the well-known concentration properties of sub-Gaussian random matrices (e.g., ). On the other hand, from a computational perspective, a disadvantage of sub-Gaussian sketches is that they require matrix-vector multiplications with unstructured random matrices. In particular, given a data matrix $A\in{}^{n\times d}$ , computing its sketched version $SA$ requires $\mathcal{O}(mnd)$ basic operations in general (using classical matrix multiplication).

The second type of randomized sketch we consider is randomized orthonormal system (ROS), for which matrix multiplication can be performed much more efficiently. In order to define a ROS sketch, we first let $H\in{}^{n\times n}$ be an orthonormal matrix with entries $H_{ij}\in[-\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}}]$ . Standard classes of such matrices are the Hadamard or Fourier bases, for which matrix-vector multiplication can be performed in $\mathcal{O}(n\log n)$ time via the fast Hadamard or Fourier transforms, respectively. Based on any such matrix, a sketching matrix $S\in{}^{m\times n}$ from a ROS ensemble is obtained by sampling i.i.d. rows of the form

where the random vector $e_{j}\in{}^{n}$ is chosen uniformly at random from the set of all $n$ canonical basis vectors, and $D=\mbox{diag}(\nu)$ is a diagonal matrix of i.i.d. Rademacher variables $\nu\in\{-1,+1\}^{n}$ . Given a fast routine for matrix-vector multiplication, the sketch $SM$ for a data matrix $M\in{}^{n\times d}$ can be formed in $\mathcal{O}(n\,d\log m)$ time (for instance, see the papers ).

Given a probability distribution $\{p_{j}\}_{j=1}^{n}$ over $[n]=\{1,\ldots,n\}$ , another choice of sketch is to randomly sample the rows of a data matrix $M$ a total of $m$ times with replacement from the given probability distribution. Thus, the rows of $S$ are independent and take on the values

3 Gaussian widths

In this section, we introduce some background on the notion of Gaussian width, a way of measuring the size of a compact set in d. These width measures play a key role in the analysis of randomized sketches. Given a compact subset $\mathcal{L}\subseteq{}^{d}$ , its Gaussian width is given by

where $g\in{}^{n}$ is an i.i.d. sequence of $N(0,1)$ variables. This complexity measure plays an important role in Banach space theory, learning theory and statistics (e.g., ).

Of particular interest in this paper are sets $\mathcal{L}$ that are obtained by intersecting a given cone $\mathcal{K}$ with the Euclidean sphere $\mathcal{S}^{d-1}=\{z\in{}^{n}\,\mid\,\|z\|_{2}=1\}$ . It is easy to show that the Gaussian width of any such set is at most $\sqrt{d}$ , but the it can be substantially smaller, depending on the nature of the underlying cone. For instance, if $\mathcal{K}$ is a subspace of dimension $r<d$ , then a simple calculation yields that $\mathcal{W}(\mathcal{K}\cap\mathcal{S}^{d-1})\leq\sqrt{r}$ .

Newton sketch and local convergence

With the basic background in place, let us now introduce the Newton sketch algorithm, and then develop a number of convergence guarantees associated with it. It applies to an optimization problem of the form $\min_{x\in\mathcal{C}}f(x)$ , where $f:{}^{d}\rightarrow\real$ is a twice-differentiable convex function, and $\mathcal{C}\subseteq{}^{d}$ is a convex constraint set.

Now suppose that we have available a Hessian matrix square root $\nabla^{2}f(x)^{1/2}$ —that is, a matrix $\nabla^{2}f(x)^{1/2}$ of dimensions $n\times d$ such that

In many cases, such a matrix square root can be computed efficiently. For instance, consider a function of the form $f(x)=g(Ax)$ where $A\in{}^{n\times d}$ , and the function $g:{}^{n}\rightarrow\real$ has the separable form $g(Ax)=\sum_{i=1}^{n}g_{i}(\langle a_{i},\,x\rangle)$ . In this case, a suitable Hessian matrix square root is given by the $n\times d$ matrix $\nabla^{2}f(x)^{1/2}:\,=\mbox{diag}\big{\{}g_{i}^{\prime\prime}(\langle a_{i},\,x\rangle)\big{\}}_{i=1}^{n}A$ . In Section 3.2, we discuss various concrete instantiations of such functions.

In terms of this notation, the ordinary Newton update can be re-written as

where $S^{t}\in{}^{m\times d}$ is an independent realization of a sketching matrix. When the problem is unconstrained, i.e., $\mathcal{C}={}^{d}$ and the matrix $\nabla^{2}f(x^{t})^{1/2}(S^{t})^{T}S^{t}\nabla^{2}f(x^{t})^{1/2}$ is invertible, the Newton sketch update takes the simpler form to

In this paper, we also analyze a partially sketched Newton update, which takes the following form. Given an additive decomposition of the form $f=f_{0}+g$ , we perform a sketch of of the Hessian $\nabla^{2}f_{0}$ while retaining the exact form of the Hessian $\nabla^{2}g$ . This leads to the partially sketched update

where $Q^{t}:\,=(S^{t}\nabla^{2}f_{0}(x^{t})^{1/2})^{T}S^{t}\nabla^{2}f_{0}(x^{t})^{1/2}+\nabla^{2}g(x^{t})$ .

For either the fully sketched (6) or partially sketched updates (8), our analysis shows that there are many settings in which the sketch dimension $m$ can be chosen to be substantially smaller than $n$ , in which cases the sketched Newton updates will be much cheaper than a standard Newton update. For instance, the unconstrained update (7) can be computed in at most $\mathcal{O}(md^{2})$ time, as opposed to the $\mathcal{O}(nd^{2})$ time of the standard Newton update. In constrained settings, we show that the sketch dimension $m$ can often be chosen even smaller—even $m\ll d$ —which leads to further savings.

2 Some examples

In order to provide some intuition, let us provide some simple examples to which the sketched Newton updates can be applied.

Consider a linear program (LP) in the standard form

where $A\in{}^{n\times d}$ is a given constraint matrix. We assume that the polytope $\{x\in{}^{d}\,\mid Ax\leq b\}$ is bounded so that the minimum achieved. A barrier method approach to this LP is based on solving a sequence of problems of the form

where $a_{i}\in{}^{d}$ denotes the $i^{th}$ row of $A$ , and $\tau>0$ is a weight parameter that is adjusted during the algorithm. By inspection, the function $f:{}^{d}\rightarrow\real\cup\{+\infty\}$ is twice-differentiable, and its Hessian is given by $\nabla^{2}f(x)=A^{T}\mbox{diag}\big{\{}\frac{1}{(b_{i}-\langle a_{i},\,x\rangle)^{2}}\big{\}}A$ . A Hessian square root is given by $\nabla^{2}f(x)^{1/2}:\,=\mbox{diag}\left(\frac{1}{|b_{i}-\langle a_{i},\,x\rangle|}\right)A$ , which allows us to compute a sketched version of the Hessian square root

With a ROS sketch matrix, computing this matrix requires $\mathcal{O}(nd\log(m))$ basic operations. The complexity of each Newton sketch iteration scales as $\mathcal{O}(md^{2})$ , where $m$ is at most $d$ . In contrast, the standard unsketched form of the Newton update has complexity $\mathcal{O}(nd^{2})$ , so that the sketched method is computationally cheaper whenever there are more constraints than dimensions ( $n>d$ ).

By increasing the barrier parameter $\tau$ , we obtain a sequence of solutions that approach the optimum to the LP, which we refer to as the central path. As a simple illustration, Figure 1 compares the central paths generated by the ordinary and sketched Newton updates for a polytope defined by $n=32$ constraints in dimension $d=2$ . Each row shows three independent trials of the method for a given sketch dimension $m$ ; the top, middle and bottom rows correspond to sketch dimensions $m\in\{d,4d,16d\}$ respectively. Note that as the sketch dimension $m$ is increased, the central path taken by the sketched updates converges to the standard central path.

As a second example, we consider the problem of maximum likelihood estimation for generalized linear models.

The class of generalized linear models (GLMs) is used to model a wide variety of prediction and classification problems, in which the goal is to predict some output variable $y\in\mathcal{Y}$ on the basis of a covariate vector $a\in{}^{d}$ . it includes as special cases the standard linear Gaussian model (in which $\mathcal{Y}=\real$ ), as well as logistic models for classification (in which $\mathcal{Y}=\{-1,+1\}$ ), as well as as Poisson models for count-valued responses (in which $\mathcal{Y}=\{0,1,2,\ldots\}$ ). See the book for further details and applications.

Given a collection of $n$ observations $\{(y_{i},a_{i})\}_{i=1}^{n}$ of response-covariate pairs from some GLM, the problem of constrained maximum likelihood estimation be written in the form

where $\psi:\real\times\mathcal{Y}\rightarrow\real$ is a given convex function, and $\mathcal{C}\subset{}^{d}$ is a convex constraint set, chosen by the user to enforce a certain type of structure in the solution. Important special cases of GLMs include the linear Gaussian model, in which $\psi(u,y)=\frac{1}{2}(y-u)^{2}$ , and the problem (10) corresponds to a regularized form of least-squares, as well as the problem of logistic regression, obtained by setting $\psi(u,y)=\log(1+\exp(-yu))$ .

Letting $A\in{}^{n\times d}$ denote the data matrix with $a_{i}\in{}^{d}$ as its $i^{th}$ row, the Hessian of the objective (10) takes the form

Since the function $\psi$ is convex, we are guaranteed that $\psi^{\prime\prime}(a_{i}^{T}x)\geq 0$ , and hence the quantity $\mbox{diag}\left(\psi^{\prime\prime}(a_{i}^{T}x)\right)^{1/2}A$ can be used as an $n\times d$ matrix square-root. We return to explore this class of examples in more depth in Section 5.1.

3 Local convergence analysis using strong convexity

Returning now to the general setting, we now begin by proving a local convergence guarantee for the sketched Newton updates. In particular, this theorem provides insight into how large the sketch dimension $m$ must be in order to guarantee good local behavior of the sketched Newton algorithm.

This choice of sketch dimension is determined by geometry of the problem, in particular in terms of the tangent cone defined by the optimum. Given a constraint set $\mathcal{C}$ and the minimizer $x^{*}:\,=\arg\min\limits_{x\in\mathcal{C}}f(x)$ , the tangent cone at $x^{*}$ is given by

Recalling the definition of the Gaussian width from Section 2.3, our first main result requires the sketch dimension to satisfy a lower bound of the form

where $\epsilon\in(0,1)$ is a user-defined tolerance, and $c$ is a universal constant. Since the Hessian square-root $\nabla^{2}f(x)^{1/2}$ has dimensions $n\times d$ , this squared Gaussian width is at at most $\min\{n,d\}$ . This worst-case bound is achieved for an unconstrained problem (in which case $\mathcal{K}={}^{d}$ ), but the Gaussian width can be substantially smaller for constrained problems. See the example following Theorem 1 for an illustration.

In addition to this Gaussian width, our analysis depends on the cone-constrained eigenvalues of the Hessian $\nabla^{2}f(x^{*})$ , which are defined as

In the unconstrained case ( $\mathcal{C}={}^{d}$ ), we have $\mathcal{K}={}^{d}$ , and so that $\gamma$ and $\beta$ reduce to the minimum and maximum eigenvalues of the Hessian $\nabla^{2}f(x^{*})$ . In the classical analysis of Newton’s method, these quantities measure the strong convexity and smoothness parameters of the function $f$ .

With this set-up, the following theorem is applicable to any twice-differentiable objective $f$ with cone-constrained eigenvalues $(\gamma,\beta)$ defined in equation (13), and with Hessian that is $L$ -Lipschitz continuous, as defined in equation (2).

The bound (14) shows that when $\epsilon$ is set to a fixed constant—say $\epsilon=1/4$ —the algorithm displays a linear-quadratic convergence rate in terms of the error $\Delta^{t}=x^{t}-x^{*}$ . More specifically, the rate is initially quadratic—that is, $\|\Delta^{t+1}\|_{2}\approx\frac{4L}{\gamma}\|\Delta^{t}\|_{2}^{2}$ when $\|\Delta^{t}\|_{2}$ is large. However, as the iterations progress and $\|\Delta^{t}\|_{2}$ becomes substantially less than 1, then the rate becomes linear—meaning that $\|\Delta^{t+1}\|_{2}\approx\epsilon\frac{\beta}{\gamma}\|\Delta^{t}\|_{2}$ —since the term $\frac{4L}{\gamma}\|\Delta^{t}\|_{2}^{2}$ becomes negligible compared to $\epsilon\frac{\beta}{\gamma}\|\Delta^{t}\|_{2}$ . If we perform $N$ steps in total, the linear rate guarantees the conservative error bounds

A notable feature of Theorem 1 is that, depending on the structure of the problem, the linear-quadratic convergence can be obtained using a sketch dimension $m$ that is substantially smaller than $\min\{n,d\}$ . As an illustrative example, we performed simulations for some instantiations of a portfolio optimization problem: it is a linearly-constrained quadratic program of the form

where $A\in{}^{n\times d}$ and $c\in{}^{d}$ are empirically estimated matrices and vectors (see Section 5.3 for more details). We used the Newton sketch to solve different sizes of this problem $d\in\{10,20,30,40,50,60\}$ , and with $n=d^{3}$ in each case. Each problem was constructed so that the optimum $x^{*}$ had at most $s=\lceil 2\log(d)\rceil$ non-zero entries. A calculation of the Gaussian width for this problem (see Appendix C for the details) shows that it suffices to take a sketch dimension $m\succsim s\log d$ , and we implemented the algorithm with this choice.

Figure 2 shows the convergence rate of the Newton sketch algorithm for the six different problem sizes: consistent with our theory, the sketch dimension $m\ll\min\{d,n\}$ suffices to guarantee linear convergence in all cases.

It is also possible obtain an asymptotically super-linear rate by using an iteration-dependent sketching accuracy $\epsilon=\epsilon(t)$ . The following corollary summarizes one such possible guarantee:

Consider the Newton sketch iterates using the iteration-dependent sketching accuracy $\epsilon(t)=\frac{1}{\log(1+t)}$ . Then with the same probability as in Theorem 1, we have

and consequently, super-linear convergence is obtained—namely, $\lim_{t\rightarrow\infty}~{}\frac{\|x^{t+1}-x^{*}\|_{2}}{\|x^{t}-x^{*}\|_{2}}=0$ .

Note that the price for this super-linear convergence is that the sketch size is inflated by the factor $\epsilon^{-2}(t)=\log^{2}(1+t)$ , so it is only logarithmic in the iteration number.

Newton sketch for self-concordant functions

The analysis and complexity estimates given in the previous section involve the curvature constants $(\gamma,\beta)$ and the Lipschitz constant $L$ , which are seldom known in practice. Moreover, as with the analysis of classical Newton method, the theory is local, in that the linear-quadratic convergence takes place once the iterates enter a suitable basin of the origin.

In this section, we seek to obtain global convergence results that do not depend on unknown problem parameters. As in the classical analysis, the appropriate setting in which to seek such results is for self-concordant functions, and using an appropriate form of backtracking line search. We begin by analyzing the unconstrained case, and then discuss extensions to constrained problems with self-concordant barriers. In each case, we show that given a suitable lower bound on the sketch dimension, the sketched Newton updates can be equipped with global convergence guarantees that hold with exponentially high probability. Moreover, the total number of iterations does not depend on any unknown constants such as strong convexity and Lipschitz parameters.

In this section, we consider the unconstrained optimization problem $\min_{x\in{}^{d}}f(x)$ , where $f$ is a closed convex self-concordant function which is bounded below. Note that a closed convex function $\phi:\real\rightarrow\real$ is self-concordant if

This definition can be extended to a function $f:{}^{d}\rightarrow\real$ by imposing this requirement on the univariate functions $\phi_{x,y}(t):\,=f(x+ty)$ , for all choices of $x,y$ in the domain of $f$ . Examples of self-concordant functions include linear and quadratic functions and negative logarithm. Self concordance is preserved under addition and affine transformations.

Our main result provide a bound on the total number of Newton sketch iterations required to obtain a $\delta$ -accurate solution without imposing any sort of initialization condition (as was done in our previous analysis). This bound scales proportionally to $\log(1/\delta)$ and inversely in a parameter $\nu$ that depends on sketching accuracy $\epsilon\in(0,\frac{1}{4})$ and backtracking parameters $(a,b)$ via

Let $f$ be a strictly convex self-concordant function. Given a sketching matrix $S\in{}^{m\times n}$ with $m\geq\frac{c_{3}}{\epsilon^{2}}\max_{x\in\mathcal{C}}\operatorname{rank}(\nabla^{2}f(x))=\frac{c_{3}}{\epsilon^{2}}\,d$ , the number of total iterations $T$ for obtaining an $\delta$ approximate solution in function value via Algorithm 1 is bounded by

with probability at least $1-c_{1}Ne^{-c_{2}m}$ .

The bound in the above theorem shows that the convergence of the Newton Sketch is independent of the properties of the function $f$ and problem parameters, similar to classical Newton’s method. Note that for problems with $n>d$ , the complexity of each Newton sketch step is at most $\mathcal{O}(d^{3}+nd\log d$ ), which is smaller than that of Newton’s Method ( $\mathcal{O}(nd^{2})$ ), and also smaller than typical first-order optimization methods ( $\mathcal{O}(nd)$ ) whenever $n>d^{2}$ .

2 Newton Sketch with self-concordant barriers

We now turn to the more general constrained case. Given a closed, convex self-concordant function $f_{0}:{}^{d}\rightarrow\real$ , let $\mathcal{C}$ be a convex subset of d, and consider the constrained optimization problem $\min_{x\in\mathcal{C}}f_{0}(x)$ . If we are given a convex self-concordant barrier function $g$ for the constraint set $\mathcal{C}$ , it is equivalent to consider the unconstrained problem

In all such cases, an attractive strategy is to apply a partial Newton sketch, in which we sketch the Hessian term $\nabla^{2}f_{0}(x)$ and retain the exact Hessian $\nabla^{2}g(x)$ , as in the previously described updates (8). More formally, Algorithm 2 provides a summary of the steps, including the choice of the line search parameters. The main result of this section provides a guarantee on this algorithm, assuming that the sequence of sketch dimensions $\{m^{t}\}_{t=0}^{\infty}$ is appropriately chosen.

The choice of sketch dimensions depends on the tangent cones defined by the iterates, namely the sets

For a given sketch accuracy $\epsilon\in(0,1)$ , we require that the sequence of sketch dimensions satisfies the lower bound

Finally, the reader should recall the parameter $\nu$ was defined in equation (18), which depends only on the sketching accuracy $\epsilon$ and the line search parameters. Given this set-up, we have the following guarantee:

Let $f:{}^{d}\rightarrow\real$ be a convex and self-concordant function, and let $g:{}^{d}\rightarrow\real\cup\{+\infty\}$ be a convex and self-concordant barrier for the convex set $\mathcal{C}$ . Suppose that we implement Algorithm 2 with sketch dimensions $\{m^{t}\}_{t\geq 0}$ satisfying the lower bound (19). Then taking

suffices to obtain $\delta$ -approximate solution in function value with probability at least $1-c_{1}Ne^{-c_{2}m}$ .

Thus, we see that the Newton Sketch method can also be used with self-concordant barrier functions, which considerably extends its scope. Section 5.5 provides a numerical illustration of its performance in this context. As we discuss in the next section, there is a flexibility in choosing the decomposition $f_{0}$ and $g$ corresponding to objective and barrier, which enables us to also sketch the constraints.

3 Sketching with interior point methods

In this section, we discuss the application of Newton Sketch to a form of barrier or interior point methods. In particular we discuss two different strategies and provide rigorous worst-case complexity results when the functions in the objective and constraints are self-concordant. More precisely, let us consider a problem of the form

where $f_{0}$ and $\{g_{j}\}_{j=1}^{r}$ are twice-differentiable convex functions. We assume that there exists a unique solution $x^{*}$ to the above problem.

The barrier method for computing $x^{*}$ is based on solving a sequence of problems of the form

for increasing values of the parameter $\tau\geq 1$ . The family of solutions $\{\widehat{x}(\tau)\}_{\tau\geq 1}$ trace out what is known as the central path. A standard bound (e.g., ) on the sub-optimality of $\widehat{x}(\tau)$ is given by

The barrier method successively updates the penalty parameter $\tau$ and also the starting points supplied to Newton’s method using previous solutions.

Since Newton’s method lies at the heart of the barrier method, we can obtain a fast version by replacing the exact Newton minimization with the Newton sketch. Algorithm 3 provides a precise description of this strategy. As noted in Step 1, there are two different strategies in dealing with the convex constraints $g_{j}(x)\leq 0$ for $j=1,\ldots,r$ :

Full sketch: Sketch the full Hessian of the objective function (21) using Algorithm 1 ,

Partial sketch: Sketch only the Hessians corresponding to a subset of the functions $\{f_{0},g_{j},j=1,\ldots,r\}$ , and use exact Hessians for the other functions. Apply Algorithm 2.

As shown by our theory, either approach leads to the same convergence guarantees, but the associated computational complexity can vary depending both on how data enters the objective and constraints, as well as the Hessian structure arising from particular functions. The following theorem is an application of the classical results on the barrier method tailored for Newton Sketch using any of the above strategies (see e.g., ). As before, the key parameter $\nu$ was defined in Theorem 2.

For a given target accuracy $\delta\in(0,1)$ and any $\mu>1$ , the total number of Newton Sketch iterations required to obtain a $\delta$ -accurate solution using Algorithm 3 is at most

If the parameter $\mu$ is set to minimize the above upper-bound, the choice $\mu=1+\frac{1}{r}$ yields $\mathcal{O}(\sqrt{r})$ iterations. However, when applying the standard Newton method, this “optimal” choice is typically not used in practice: instead, it is common to use a fixed value of $\mu\in$ . In experiments, experience suggests that the number of Newton iterations needed is a constant independent of $r$ and other parameters. Theorem 4 allows us to obtain faster interior point solvers with rigorous worst-case complexity results. We show different applications of Algorithm 3 in the following section.

Applications and numerical results

In this section, we discuss some applications of the Newton sketch to different optimization problems. In particular, we show various forms of Hessian structure that arise in applications, and how the Newton sketch can be computed. When the objective and/or the constraints contain more than one term, the barrier method with Newton Sketch has some flexibility in sketching. We discuss the choices of partial Hessian sketching strategy in the barrier method. It is also possible to apply the sketch in the primal or dual form, and we provide illustrations of both strategies here.

Suppose that we apply the Newton sketch algorithm to the optimization problem (10). Given the current iterate $x^{t}$ , computing the next iterate $x^{t+1}$ requires solving the constrained quadratic program

Note that we are always guaranteed that $\gamma^{-}_{s}(A)\geq\lambda_{\min}(A^{T}A)$ . It also involves certain quantities that depend on the function $\psi$ , namely

where $a_{i}\in{}^{d}$ is the $i^{th}$ row of $A$ . With this set-up, supposing that the optimal solution $x^{*}$ has cardinality at most $\|x^{*}\|_{0}\leq s$ , then it can be shown (see Lemma 8 in Appendix C) that it suffices to take a sketch size

where $c_{0}$ is a universal constant. Let us consider some examples to illustrate:

Least-Squares regression: $\psi(u)=\frac{1}{2}u^{2}$ , $\psi^{\prime\prime}(u)=1$ and $\psi^{\prime\prime}_{\min}=\psi^{\prime\prime}_{\max}=1$ .

Poisson regression: $\psi(u)=e^{u}$ , $\psi^{\prime\prime}(u)=e^{u}$ and $\frac{\psi^{\prime\prime}_{\max}}{\psi^{\prime\prime}_{\min}}=\frac{e^{RA_{\max}}}{e^{-RA_{\min}}}$

Logistic regression: $\psi(u)=\log(1+e^{u})$ , $\psi^{\prime\prime}(u)=\frac{e^{u}}{(e^{u}+1)^{2}}$ and $\frac{\psi^{\prime\prime}_{\max}}{\psi^{\prime\prime}_{\min}}=\frac{e^{RA_{\min}}}{e^{-RA_{\max}}}\frac{(e^{-RA_{\max}}+1)^{2}}{(e^{RA_{\min}}+1)^{2}}$ ,

where $A_{\max}:\,=\max\limits_{i=1,\ldots,n}\|a_{i}\|_{\infty}$ , and $A_{\min}:\,=\min\limits_{i=1,\ldots,n}\|a_{i}\|_{\infty}$ .

For typical distributions of the data matrices, the sketch size choice given in equation (25) is $\mathcal{O}(s\log d)$ . As an example, consider data matrices $A\in{}^{n\times d}$ where each row is independently sampled from a sub-Gaussian distribution with variance $1$ . Then standard results on random matrices show that $\gamma^{-}_{s}(A)>1/2$ as long as $n>c_{1}s\log d$ for a sufficiently large constant $c_{1}$ . In addition, we have $\max\limits_{j=1,\ldots,d}\|A_{j}\|_{2}^{2}=\mathcal{O}(n)$ , as well as $\frac{\psi^{\prime\prime}_{\max}}{\psi^{\prime\prime}_{\min}}=\mathcal{O}(\log(n))$ . For such problems, the per iteration complexity of Newton Sketch update scales as $\mathcal{O}(s^{2}d\log^{2}(d))$ using standard Lasso solvers (e.g., ) or as $\mathcal{O}(sd\log(d))$ using projected gradient descent. Both of these scalings are substantially smaller than conventional algorithms that fail to exploit the small intrinsic dimension of the tangent cone.

2 Semidefinite programs

Here the term $\operatorname{trace}(X)$ , along with its multiplicative pre-factor $\lambda>0$ that can be adjusted by the user, is a regularization term for encouraging a relatively low-rank solution. Using the standard self-concordant barrier $X\mapsto\log\det(X)$ for the PSD cone, the barrier method involves solving a sequence of sub-problems of the form

Now the Hessian of the function $\mbox{vec}(X)\mapsto f(\mbox{vec}(X))$ is a $d^{2}\times d^{2}$ matrix given by

where $A_{ij}:\,=(a_{i}-a_{j})(a_{i}-a_{j})^{T}$ . Then we can apply the barrier method with partial Hessian sketch on the first term, $\{S_{ij}\mbox{vec}(A_{ij})\}_{i\neq j}$ and exact Hessian for the second term. Since the vectorized decision variable is $\mbox{vec}(X)\in{}^{d^{2}}$ the complexity of Newton Sketch is $\mathcal{O}(m^{2}d^{2})$ while the complexity of a classical SDP interior-point solver is $\mathcal{O}(nd^{4})$ .

3 Portfolio optimization and SVMs

Here we consider the Markowitz formulation of the portfolio optimization problem . The objective is to find $x\in{}^{d}$ belonging to the unit simplex, which corresponds to non-negative weights associated with each of $d$ possible assets, so as to maximize the expected return minus a coefficient times the variance of the return. Letting $\mu\in{}^{d}$ denote a vector corresponding to mean return of the assets, and we let $\Sigma\in{}^{d\times d}$ be a symmetric, positive semidefinite matrix, covariance of the returns. The optimization problem is given by

The covariance of returns is often estimated from past stock data via empirical covariance, $\Sigma=A^{T}A$ where the columns of $A$ are time series corresponding to assets normalized by $\sqrt{n}$ , where $n$ is the length of the observation window.

The barrier method can be used solve the above problem by solving penalized problems of the

where $e_{i}\in{}^{d}$ is the $i^{th}$ element of the canonical basis and $1$ is row vector of all-ones. Then the Hessian of the above barrier penalized formulation can be written as

Consequently we can sketch the data dependent part of the Hessian via $\tau\lambda SA$ which has at most rank $m$ and keep the remaining terms in the Hessian exact. Since the matrix $11^{T}$ is rank one, the resulting sketched estimate is therefore diagonal plus rank $(m+1)$ where the matrix inversion lemma can be applied for efficient computation of the Newton Sketch update (see e.g. ). Therefore, as long as $m\leq d$ , the complexity per iteration scales as $\mathcal{O}(md^{2})$ , which is cheaper than the $\mathcal{O}(nd^{2})$ per step complexity associated with classical interior point methods. We also note that support vector machine classification problems with squared hinge loss also has the same form as in (26) (see e.g. ) where the same strategy can be applied.

4 Unconstrained logistic regression with d≪nmuch-less-than𝑑𝑛d\ll n

Let us now turn to some numerical comparisons of the Newton Sketch with other popular optimization methods for large-scale instances of logistic regression. More specifically, we generated a feature matrix $A\in{}^{n\times d}$ based on $d=100$ features and $n=16384$ observations. Each row $a_{i}\in{}^{d}$ was generated from the $d$ -variate Gaussian distribution $N(0,\Sigma)$ where $\Sigma_{ij}=2|0.99|^{i-j}$ . As shown in Figure 3, the convergence of the algorithm per iteration is very similar to Newton’s method. Besides the original Newton’s method, the other algorithms compared are

Gradient Descent (GD) with backtracking line search

Accelerated Gradient Descent (Acc. GD) adapted for strongly convex functions with manually tuned parameters.

Stochastic Gradient Descent (SGD) with the classical step size choice $1/\sqrt{t}$

Broyden–-Fletcher–-Goldfarb-–Shanno algorithm (BFGS) approximating the Hessian with gradients.

For each problem, we averaged the performance of the randomized algorithms (Newton sketch and SGD) over $10$ independent trials. We ran the Newton sketch algorithm with sketch size $m=6d$ . To be fair in comparisons, we performed hand-tuning of the stepsize parameters in the gradient-based methods so as to optimize their performance. The top panel in Figure 3 plots the log duality gap versus the number of iterations: as expected, on this scale, the classical form of Newton’s method is the fastest, whereas the SGD method is the slowest. However, when the log optimality gap is plotted versus the wall-clock time in the bottom panel, we now see that the Newton sketch is the fastest.

5 A dual example: Lasso with d≫nmuch-greater-than𝑑𝑛d\gg n

The regularized Lasso problem takes the form $\min\limits_{x\in{}^{d}}\big{\{}\frac{1}{2}\,\|Ax-y\|_{2}^{2}+\lambda\|x\|_{1}\big{\}}$ , where $\lambda>0$ is a user-specified regularization parameter. In this section, we consider efficient sketching strategies for this class of problems in the regime $d\gg n$ . In particular, let us consider the corresponding dual program, given by

By construction, the number of constraints $d$ in the dual program is larger than the number of optimization variables $n$ . If we apply the barrier method to solve this dual formulation, then we need to solve a sequence of problems of the form

where $A_{j}\in{}^{n}$ denotes the $j^{th}$ column of $A$ . The Hessian of the above barrier penalized formulation can be written as

Consequently we can keep the first term in the Hessian, $\tau I$ exact and apply partial sketching to the Hessians of the last two terms via

Since the partially sketched Hessian is of the form $tI_{n}+VV^{T}$ , where $V$ is rank at most $m$ , we can use matrix inversion lemma for efficiently calculating Newton Sketch updates. The complexity of the above strategy for $d>n$ is $\mathcal{O}(dm^{2})$ , where $m$ is at most $d$ , whereas traditional interior point solvers are typically $\mathcal{O}(dn^{2})$ per iteration.

In order to test this algorithm, we generated a feature matrix $A\in{}^{n\times d}$ with $d=4096$ features and $n=50$ observations. Each row $a_{i}\in{}^{d}$ was generated from the multivariate Gaussian distribution $N(0,\Sigma)$ with $\Sigma_{ij}=2*|0.99|^{i-j}$ . For a given problem instance, we ran $10$ independent trials of the sketched barrier method, and compared the results to the original barrier method. Figure 4 plots the the duality gap versus iteration number (top panel) and versus the wall-clock time (bottom panel) for the original barrier method (blue) and sketched barrier method (red): although the sketched algorithm requires more iterations, these iterations are cheaper, leading to a smaller wall-clock time. This point is reinforced by Figure 5, where we plot the wall-clock time required to reach a duality gap of $10^{-6}$ versus the number of features $n$ in problem families of increasing size. Note that the sketched barrier method outperforms the original barrier method, with significantly less computation time for obtaining similar accuracy.

Proofs

We now turn to the proofs of our theorems, with more technical details deferred to the appendices.

Throughout this proof, we let $r\in\mathcal{S}^{d-1}$ denote a fixed vector that is independent of the sketch matrix $S^{t}$ and the current iterate $x^{t}$ . We then define the following pair of random variables

These random variables are significant, because the core of our proof is based on establishing that the error vector $\Delta^{t}=x^{t}-x^{*}$ satisfies the recursive bound

where $Z_{1}^{t}:\,=Z_{1}(S^{t};\,x^{t})$ and $Z_{2}^{t}:\,=Z_{2}(S^{t};\,x^{t})$ . We then combine this recursion with the following probabilistic guarantee on $Z_{1}^{t}$ and $Z_{2}^{t}$ . For a given tolerance parameter $\epsilon\in(0,\frac{1}{2}]$ , consider the ”good event”

For sub-Gaussian sketch matrices, given a sketch size $m>\frac{c_{0}}{\epsilon^{2}}\max_{x\in\mathcal{C}}\mathcal{W}^{2}(\nabla^{2}f(x)^{1/2}\mathcal{K})$ , we have

For randomized orthogonal system (ROS) sketches over the class of self-bounding cones, given a sketch size $m>\frac{c_{0}\,\log^{4}n}{\epsilon^{2}}\max_{x\in\mathcal{C}}\mathcal{W}^{2}(\nabla^{2}f(x)^{1/2}\mathcal{K})$ , we have

Combining Lemma 1 with the recursion (27) and re-scaling $\epsilon$ appropriately yields the claim of the theorem.

Accordingly, it remains to prove the recursion (27), and we do so via a basic inequality argument. Recall the function $x\mapsto\Phi(x;S^{t})$ that underlies the sketch Newton update (6): since $x^{t}$ and $x^{*}$ are optimal and feasible for the constrained optimization problem, we have $\Phi(x;S^{t})\leq\Phi(x^{*};S^{t})$ . Introducing the error vector $\Delta^{t}:\,=x^{t}-x^{*}$ , some straightforward algebra then then leads to the basic inequality

Let us first upper bound the right-hand side. By using the integral form of Taylor’s expansion, we have $\langle\nabla f(x^{t})-\nabla f(x^{*}),\,\Delta^{t+1}\rangle=\int_{0}^{1}\langle\nabla^{2}f(x^{t}+u(x^{*}-x^{t}))\Delta^{t},\,\Delta^{t+1}\rangle du$ , and hence

By adding and subtracting terms and then applying triangle inequality, we have the bound $\mbox{RHS}\leq T_{1}+T_{2}$ , where

Now observe that the vector $r:\,=\nabla^{2}f(x^{t})^{1/2}\Delta^{t}$ is independent of the randomness in $S^{t}$ , whereas the vector $\nabla^{2}f(x^{t})^{1/2}\Delta^{t+1}$ belongs to the cone $\nabla^{2}f(x^{t})^{1/2}\mathcal{K}$ . Consequently, by the definition of $Z_{1}$ , we have

Now note that using the fact that $\beta$ controls the smoothness of the gradient and the Lipschitz continuity of Hessian we can upper bound the terms on the above right-hand side as follows

and similarly, $\langle\Delta^{t+1},\,\nabla^{2}f(x^{*})\Delta^{t+1}\rangle\leq\left\{\beta+L\|\Delta^{t}\|_{2}\right\}\|\Delta^{t+1}\|_{2}^{2}$ . Combining the above bounds with (32) we obtain

On the other hand, by the $L$ -Lipschitz condition on the Hessian, we have

Substituting these two bounds into our basic inequality, we have

Our final step is to lower bound the left-hand side (LHS) of this inequality. By definition of $Z_{2}$ , we have

Substituting this lower bound into the previous inequality (34) and then rearranging, we find that, as long as $\|\Delta^{t}\|_{2}<\frac{\gamma}{2L}$ , we also have $\|\Delta^{t}\|_{2}<\frac{\beta}{2L}$ and consequently

2 Proof of Theorem 2

Recall that in this case, we assume that $f$ is a self-concordant strictly convex function. We adopt the following notation and conventions from the book . For a given $x\in{}^{d}$ , we define the pair of dual norms

Note that $\nabla^{2}f(x)^{-1}$ is well-defined for strictly convex self-concordant functions. In terms of this notation, the exact Newton update is given by $x\mapsto x_{\tiny{\mbox{NE}}}:\,=x+v$ , where

whereas the Newton sketch update is given by $x\mapsto x_{\tiny{\mbox{NSK}}}:\,=x+v_{\tiny{\mbox{NSK}}}$ , where

The proof of Theorem 2 given in this section involves the unconstrained case ( $\mathcal{C}={}^{d}$ ), whereas the proofs of later theorems involve the more general constrained case. In the unconstrained case, the two updates take the simpler forms

For a self-concordant function, the sub-optimality of the Newton iterate $x_{\tiny{\mbox{NE}}}$ in function value satisfies the bound

This classical bound is not directly applicable to the Newton sketch update, since it involves the approximate Newton decrement $\widetilde{\lambda}_{f}(x):\,=-\langle\nabla f(x),\,v_{\tiny{\mbox{NSK}}}\rangle$ , as opposed to the exact one ${\lambda}_{f}(x):\,=-\langle\nabla f(x),\,v_{\tiny{\mbox{NE}}}\rangle$ . Thus, our strategy is to prove that with high probability over the randomness in the sketch matrix, the approximate Newton decrement can be used as an exit condition.

Recall the definitions (35) and (36) of the exact $v_{\tiny{\mbox{NE}}}$ and sketched Newton $v_{\tiny{\mbox{NSK}}}$ update directions, as well as the definition of the tangent cone $\mathcal{K}$ at $x\in\mathcal{C}$ . Let $\mathcal{K}^{t}$ be the tangent cone at $x^{t}$ . The following lemma provides a high probability bound on their difference:

Let $S\in{}^{m\times n}$ be a sub-Gaussian or ROS sketch matrix, and consider any fixed vector $x\in\mathcal{C}$ independent of the sketch matrix. If $m\geq c_{0}\frac{\mathcal{W}(\nabla^{2}f(x)^{1/2}\mathcal{K}^{t})^{2}}{\epsilon^{2}}$ , then

with probability at least $1-c_{1}e^{-c_{2}m\epsilon^{2}}$ .

Similar to the standard analysis of Newton’s method, our analysis of the Newton sketch algorithm is split into two phases defined by the magnitude of the decrement $\widetilde{\lambda}_{f}(x)$ . In particular, the following lemma constitute the core of our proof:

For $\epsilon\in(0,1/2)$ , there exist constants $\nu>0$ and $\eta\in(0,1/16)$ such that:

If $\widetilde{\lambda}_{f}(x)>\eta$ , then $f(x_{\tiny{\mbox{NSK}}})-f(x)\leq-\nu$ with probability at least $1-c_{1}e^{-c_{2}m\epsilon^{2}}$ .

Conversely, if $\widetilde{\lambda}_{f}(x)\leq\eta$ , then

where both bounds hold with probability $1-c_{1}e^{c_{2}m\epsilon^{2}}$ .

Using this lemma, let us now complete the proof of the theorem, dividing our analysis into the two phases of the algorithm.

By Lemma 3(a) each iteration in the first phase decreases the function value by at least $\nu>0$ , the number of first phase iterations $N_{1}$ is at most

with probability at least $1-N_{1}c_{1}e^{-c_{2}m}$ .

Next, let us suppose that at some iteration $t$ , the condition $\widetilde{\lambda}_{f}(x^{t})\leq\eta$ holds, so that part (b) of Lemma 3 can be applied. In fact, the bound (38a) then guarantees that $\widetilde{\lambda}_{f}(x^{t+1})\leq\eta$ , so that we may apply the contraction bound (38b) repeatedly for $N_{2}$ rounds so as to obtain that

with probability $1-N_{2}c_{1}e^{c_{2}m}$ .

Since ${\lambda}_{f}(x^{t})\leq\eta\leq 1/16$ by assumption, the self-concordance of $f$ then implies that

Therefore, in order to ensure that and consequently for achieving $f(x^{t+k})-f(x^{*})\leq\epsilon$ , it suffices to the number of second phase iterations lower bounded as $N_{2}\geq 0.65\log_{2}(\frac{1}{16\epsilon})$ .

Putting together the two phases, we conclude that the total number of iterations $N$ required to achieve $\epsilon$ - accuracy is at most

and moreover, this guarantee holds with probability at least $1-Nc_{1}e^{-c_{2}m\epsilon^{2}}$ .

The final step in our proof of the theorem is to establish Lemma 3, and we do in the next two subsections.

2.1 Proof of Lemma 3(a)

Our proof of this part is performed conditionally on the event $\mathcal{D}:\,=\{\widetilde{\lambda}_{f}(x)>\eta\}$ . Our strategy is to show that the backtracking line search leads to a stepsize $s>0$ such that function decrement in moving from the current iterate $x$ to the new sketched iterate $x_{\tiny{\mbox{NSK}}}=x+sv_{\tiny{\mbox{NSK}}}$ is at least

The outline of our proof is as follows. Defining the univariate function $g(u):\,=f(x+uv_{\tiny{\mbox{NSK}}})$ and $\epsilon^{\prime}=\frac{2\epsilon}{1-\epsilon}$ , we first show that $\widehat{u}=\frac{1}{1+(1+\epsilon^{\prime})\widetilde{\lambda}_{f}(x)}$ satisfies the bound

Since $\widetilde{\lambda}_{f}(x)>\eta$ by assumption and the function $u\rightarrow\frac{u^{2}}{1+(1+\frac{2\epsilon}{1-\epsilon})u}$ is monotone increasing, this bound implies that inequality (39) holds with $\nu=ab\frac{\eta^{2}}{1+(1+\frac{2\epsilon}{1-\epsilon})\eta}$ .

It remains to prove the claims (40a) and (40b), for which we make use of the following auxiliary lemma:

For $u\in{\rm dom\,}g\cap{}^{+}$ , we have the decrement bound

provided that $u\|[\nabla^{2}f(x)]^{1/2}v_{\tiny{\mbox{NSK}}}\|_{2}<1$ .

With probability at least $1-c_{1}e^{-c_{2}m}$ , we have

The proof of these lemmas are provided in Appendices A.2 and A.3. Using them, let us prove the claims (40a) and (40b). Recalling our shorthand $\epsilon^{\prime}:\,=\frac{1+\epsilon}{1-\epsilon}-1=\frac{2\epsilon}{1-\epsilon}$ , substituting inequality (42) into the decrement formula (41) yields

where we added and subtracted $u(1+\epsilon^{\prime})^{2}\widetilde{\lambda}_{f}(x)^{2}$ so as to obtain the final equality.

We now prove inequality (40a). Now setting $u=\widehat{u}:\,=\frac{1}{1+(1+\epsilon^{\prime})\widetilde{\lambda}_{f}(x)}$ , which satisfies the conditions of Lemma 4 yields

Making use of the standard inequality $-u+\log(1+u)\leq-\frac{\frac{1}{2}u^{2}}{(1+u)}$ (for instance, see the book ), we find that

where the final inequality follows from our assumption $\alpha\leq\frac{1}{2}-\frac{1}{2}{\epsilon^{\prime}}^{2}-\epsilon^{\prime}$ . This completes the proof of the bound (40a). Finally, the lower bound (40b) follows by setting $u=b\widehat{u}$ into the decrement inequality (41).

2.2 Proof of Lemma 3(b)

The proof of this part hinges on the following auxiliary lemma:

where all bounds hold with probability at least $1-c_{1}e^{-c_{2}m\epsilon^{2}}$ .

We now use Lemma 6 to prove the two claims in the lemma statement.

Recall from the theorem statement that $\eta:\,=\frac{1}{8}\,\frac{1-\frac{1}{2}(\frac{1+\epsilon}{1-\epsilon})^{2}-a}{(\frac{1+\epsilon}{1-\epsilon})^{3}}$ . By examining the roots of a polynomial in $\epsilon$ , it can be seen that $\eta\leq\frac{1-\epsilon}{1+\epsilon}\,\frac{1}{16}$ .

By applying the inequalities (44b), we have

Here the final inequality holds for all $\epsilon\in(0,1/2)$ . Combining the bound (44b) with inequality (46) yields

where the final inequality again uses the condition $\epsilon\in(0,\frac{1}{2})$ . This completes the proof of the bound (38a).

This inequality has been established as a consequence of proving the bound (46).

3 Proof of Theorem 3

Given the proof of Theorem 2, it remains only to prove the following modified version of Lemma 2. It applies to the exact and sketched Newton directions $v_{\tiny{\mbox{NE}}},v_{\tiny{\mbox{NSK}}}\in{}^{d}$ that are defined as follows

Thus, the only difference is that the Hessian $\nabla^{2}f(x)$ is sketched, whereas the term $\nabla^{2}g(x)$ remains unsketched.

Let $S\in{}^{m\times n}$ be a sub-Gaussian or ROS sketching matrix, and let $x\in{}^{d}$ be a (possibly random) vector independent of $S$ . If $m\geq c_{0}\max_{x\in\mathcal{C}}\frac{\mathcal{W}(\nabla^{2}f(x)^{1/2}\mathcal{K})^{2}}{\epsilon^{2}}$ , then

with probability at least $1-c_{1}e^{-c_{2}m\epsilon^{2}}$ .

Discussion

In this paper, we introduced and analyzed the Newton sketch, a randomized approximation to the classical Newton updates. This algorithm is a natural generalization of the Iterative Hessian Sketch (IHS) updates analyzed in our earlier work . The IHS applies only to constrained least-squares problems (for which the Hessian is independent of the iteration number), whereas the Newton Sketch applies to any any twice differentiable function subject to a closed convex constraint set. We described various applications of the Newton sketch, including its use with barrier methods to solve various forms of constrained problems. For the minimization of self-concordant functions, the combination of the Newton sketch within interior point updates leads to much faster algorithms for an extensive body of convex optimization problems.

Each iteration of the Newton sketch always has lower computational complexity than classical Newton’s method. Moreover, it has lower computational complexity than first-order methods when either $n\geq d^{2}$ or $d\geq n^{2}$ (using the dual strategy); here $n$ and $d$ denote the dimensions of the data matrix $A$ . In the context of barrier methods, the parameters $n$ and $d$ typically correspond to the number of constraints and number of variables, respectively. In many “big data” problems, one of the dimensions is much larger than the other, in which case the Newton sketch is advantageous. Moreover, sketches based on the randomized Hadamard transform are well-suited to in parallel environments: in this case, the sketching step can be done in $\mathcal{O}(\log m)$ time with $\mathcal{O}(nd)$ processors. This scheme significantly decreases the amount of central computation—namely, from $\mathcal{O}(m^{2}d+nd\log m)$ to $\mathcal{O}(m^{2}d+\log d)$ .

There are a number of open problems associated with the Newton sketch. Here we focused our analysis on the cases of sub-Gaussian and randomized orthogonal system (ROS) sketches. It would also be interesting to analyze sketches based on coordinate sampling, or other forms of “sparse” sketches (for instance, see the paper ). Such techniques might lead to significant gains in cases where the data matrix $A$ is itself sparse: more specifically, it may be possible to obtain sketched optimization algorithms whose computational complexity only scales with number of nonzero entries in the data matrices the full dimensionality $nd$ . Finally, it would be interesting to explore the problem of lower bounds on the sketch dimension $m$ . In particular, is there a threshold below which any algorithm that has access only to gradients and $m$ -sketched Hessians must necessarily converge at a sub-linear rate, or in a way that depends on the strong convexity and smoothness parameters? Such a result would clarify whether or not the guarantees in this paper are improvable.

Both authors were partially supported by Office of Naval Research MURI grant N00014-11-1-0688, and National Science Foundation Grants CIF-31712-23800 and DMS-1107000. In addition, MP was supported by a Microsoft Research Fellowship.

Appendix A Technical results for Theorem 2

In this appendix, we collect together various technical results and proofs that are required in the proof of Theorem 2.

Let $u$ be a unit-norm vector independent of $S$ , and consider the random quantities

By the optimality and feasibility of $v_{\tiny{\mbox{NSK}}}$ and $v_{\tiny{\mbox{NE}}}$ (respectively) for the sketched Newton update (36), we have

Defining the difference vector $\widehat{e}:\,=v_{\tiny{\mbox{NSK}}}-v_{\tiny{\mbox{NE}}}$ , some algebra leads to the basic inequality

Moreover, by the optimality and feasibility of $v_{\tiny{\mbox{NE}}}$ and $v_{\tiny{\mbox{NSK}}}$ for the exact Newton update (35), we have

Consequently, by adding and subtracting $\langle\nabla^{2}f(x)v_{\tiny{\mbox{NE}}},\,\widehat{e}\rangle$ , we find that

By definition, the error vector $\widehat{e}$ belongs to the cone $\mathcal{K}^{t}$ and the vector $\nabla^{2}f(x)^{1/2}v_{\tiny{\mbox{NE}}}$ is fixed and independent of the sketch. Consequently, invoking definitions (49a) and (49b) of the random variables $Z_{1}$ and $Z_{2}$ yields

Putting together the pieces, we find that

By setting $\delta=\frac{\epsilon}{4}$ , the claim follows.

A.2 Proof of Lemma 4

By construction, the function $g(u)=f(x+uv_{\tiny{\mbox{NSK}}})$ is strictly convex and self-concordant. Consequently, it satisfies the bound $\frac{d}{du}\left(g^{\prime\prime}(u)^{-1/2}\right)\leq 1$ , whence

or equivalently $g^{\prime\prime}(s)\leq\frac{g^{\prime\prime}(0)}{(1-sg^{\prime\prime}(0)^{1/2})^{2}}$ for $s\in{\rm dom\,}g\cap[0,g^{\prime\prime}(0)^{-1/2})$ . Integrating this inequality twice yields the bound

Since $g^{\prime}(u)=\langle\nabla f(x+uv_{\tiny{\mbox{NSK}}}),\,v_{\tiny{\mbox{NSK}}}\rangle$ and $g^{{\prime\prime}}(u)=\langle v_{\tiny{\mbox{NSK}}},\,\nabla^{2}f(x+uv_{\tiny{\mbox{NSK}}})v_{\tiny{\mbox{NSK}}}\rangle$ , the decrement bound (41) follows.

A.3 Proof of Lemma 5

We perform this analysis conditional on the bound (37) from Lemma 2. We begin by observing that

Lemma 2 implies that $\|\nabla^{2}[f(x)]^{1/2}(v_{\tiny{\mbox{NSK}}}-v_{\tiny{\mbox{NE}}})\|_{2}\leq\epsilon\|\nabla^{2}[f(x)]^{1/2}v_{\tiny{\mbox{NE}}}\|_{2}=\epsilon{\lambda}_{f}(x)$ . In conjunction with the bound (56), we see that

Our next step is to lower bound the term $\langle\nabla f(x),\,v_{\tiny{\mbox{NSK}}}\rangle$ : in particular, by adding and subtracting a factor of the original Newton step $v_{\tiny{\mbox{NE}}}$ , we find that

where the final step again makes use of Lemma 2. Repeating the above argument in the reverse direction yields the lower bound $\langle\nabla f(x),\,v_{\tiny{\mbox{NSK}}}\rangle\geq-{\lambda}_{f}(x)^{2}(1+\epsilon)$ , so that we may conclude that

Finally, by squaring both sides of the inequality (56) and combining with the above bounds gives

A.4 Proof of Lemma 6

We have already proved the bound (44b) during our proof of Lemma 5—in particular, see equation (58). Accordingly, it remains only to prove the inequality (44a).

Introducing the shorthand $\widetilde{\lambda}:\,=(1+\epsilon)\lambda_{f}(x)$ , we first claim that the Hessian satisfies the sandwich relation

for $|1-s\alpha|<1$ where $\alpha=(1+\epsilon)\lambda_{f}(x)$ , with probability at least $1-c_{1}e^{-c_{2}m\epsilon^{2}}$ . Let us recall Theorem 4.1.6 of Nesterov : it guarantees that

Now recall the bound (37) from Lemma 2: combining it with an application of the triangle inequality (in terms of the semi-norm $\|v\|_{x}=\|\nabla^{2}f(x)^{1/2}v\|_{2}$ ) yields

with probability at least $1-e^{-c_{1}m\epsilon^{2}}$ , and substituting this inequality into the bound (60) yields the sandwich relation (59) for the Hessian.

Using this sandwich relation (59), the Newton decrement can be bounded as

where we have defined $\Delta=\int_{0}^{1}\nabla^{2}f(x+sv_{\tiny{\mbox{NSK}}})\,(v_{\tiny{\mbox{NSK}}}-v_{\tiny{\mbox{NE}}})\,ds$ . By the triangle inequality, we can write $\lambda_{f}(x_{\tiny{\mbox{NSK}}})\leq\frac{1}{\left(1-(1+\epsilon)\lambda_{f}(x)\right)}\big{(}M_{1}+M_{2}\big{)}$ , where

In order to complete the proof, it suffices to show that

Re-arranging and then invoking the Hessian sandwich relation (59) yields

where the inequality in step (i) follows from Lemma 2.

Appendix B Proof of Lemma 7

The proof follows the basic inequality argument of the proof of Lemma 2. Since $v_{\tiny{\mbox{NSK}}}$ and $v_{\tiny{\mbox{NE}}}$ are optimal and feasible (respectively) for the sketched Newton problem (47b), we have $\Psi(v_{\tiny{\mbox{NSK}}};S)\leq\Psi(v_{\tiny{\mbox{NE}}};S)$ . Defining the difference vector $\widehat{e}:\,=v_{\tiny{\mbox{NSK}}}-v$ , some algebra leads to the basic inequality

On the other hand since $v_{\tiny{\mbox{NE}}}$ and $v_{\tiny{\mbox{NSK}}}$ are optimal and feasible (respectively) for the Newton step (47a), we have