SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

Zheng Qu, Peter Richtárik, Martin Takáč, Olivier Fercoq

Introduction

Empirical risk minimization (ERM) is a fundamental paradigm in the theory and practice of statistical inference and machine learning Shalev-Shwartz & Ben-David (2014). In the “big data” era it is increasingly common in practice to solve ERM problems with a massive number of examples, which leads to new algorithmic challenges.

State-of-the-art optimization methods for ERM include i) stochastic (sub)gradient descent Shalev-Shwartz et al. (2011); Takáč et al. (2013), ii) methods based on stochastic estimates of the gradient with diminishing variance such as SAG Schmidt et al. (2013), SVRG Johnson & Zhang (2013), S2GD Konečný & Richtárik (2014), proxSVRG Xiao & Zhang (2014), MISO Mairal (2014), SAGA Defazio et al. (2014), minibatch S2GD Konečný et al. (2014a), S2CD Konečný et al. (2014b), and iii) variants of stochastic dual coordinate ascent Shalev-Shwartz & Zhang (2013d); Zhao & Zhang (2014); Takáč et al. (2013); Shalev-Shwartz & Zhang (2013b; a); Lin et al. (2014); Qu et al. (2014); Shalev-Shwartz & Zhang (2013c).

There have been several attempts at designing methods that combine randomization with the use of curvature (second-order) information. For example, methods based on running coordinate ascent in the dual such as those mentioned above and also Richtárik & Takáč (2014; 2012); Fercoq & Richtárik (2013b); Tappenden et al. (2014); Richtárik & Takáč (2013a; b); Fercoq & Richtárik (2013a); Fercoq et al. (2014); Qu et al. (2014); Qu & Richtárik (2014a) use curvature information contained in the diagonal of a bound on the Hessian matrix. Block coordinate descent methods, when equipped with suitable data-dependent norms for the blocks, use information contained in the block diagonal of the Hessian Tappenden et al. (2013).

A more direct route to incorporating curvature information was taken by Schraudolph et al. (2007) in their stochastic L-BFGS method, by Byrd et al. (2014) and Sohl-Dickstein et al. (2014) in their stochastic quasi-Newton methods and by Fountoulakis & Tappenden (2014) who proposed a stochastic block coordinate descent methods. While typically efficient in practice, none of the methods mentioned above are equipped with complexity bounds (bounds on the number of iterations). An exception in this regard is the work of Bordes et al. (2009), who give a O(1/ϵ)O(1/\epsilon) complexity bound for a Quasi-Newton SGD method.

The main contribution of this paper is the design and analysis of a new algorithm—stochastic dual Newton ascent (SDNA)—for solving a regularized ERM problem with smooth loss functions and a strongly convex regularizer (primal problem). Our method is stochastic in nature and has the capacity to utilize all curvature information inherent in the data. While we do our analysis for an arbitrary strongly convex regularizer, for the purposes of the introduction we shall describe the method in the case of the L2 regularizer. In this case, the dual problem is a concave quadratic maximization problem with a strongly concave separable penalty.

SDNA in each iteration picks a random subset of the dual variables (which corresponds to picking a minibatch of examples in the primal problem), following an arbitrary probability law, and maximizes, exactly, the dual objective restricted to the random subspace spanned by the coordinates. Equivalently, this can be seen as the solution of a proximal subproblem involving a random principal submatrix of the Hessian of the quadratic function. Hence, SDNA utilizes all curvature information available in the random subspace in which it operates. Note that this is very different from the update strategy of parallel / minibatch coordinate descent methods. Indeed, while these methods also update a random subset of variables in each iteration, they instead only utilize curvature information present in the diagonal of the Hessian.

As we will explain in detail in the text, SDCA-like methods need more iterations (and hence more passes through data) to convergence as the minibatch size increases. However, SDNA enjoys the opposite behavior: with increasing minibatch size, SDNA needs fewer iterations (and hence fewer passes over data) to convergence. This observation can be deduced from the complexity results we prove for SDNA, and is also confirmed by our numerical experiments. In particular, we show that the expected duality gap decreases at a geometric rate which i) is better than that of SDCA-like methods such as SDCA Shalev-Shwartz & Zhang (2013d) and QUARTZ Qu et al. (2014), and ii) improves with increasing minibatch size. This improvement does not come for free: as we increase the minibatch size, the subproblems grow in size as they involve larger portions of the Hessian. We find through experiments that for some, especially dense problems, even relatively small minibatch sizes lead to dramatic speedups in actual runtime.

We show that in the case of quadratic loss, and when viewed as a primal method, SDNA can be interpreted as a variant of the recently introduced Iterative Hessian Sketch algorithm Pilanci & Wainwright (2014).

En route to developing SDNA which we describe in Section 4, we also develop several other new algorithms: two in Section 2 (where we focus on smooth problems), one in Section 3 (where we focus on composite problems). Besides SDNA, we also develop and analyze a novel minibatch variant of SDCA in Section 4, for the sake of finding suitable method to compare SDNA to. SDNA is equivalent to applying the new method developed in Section 3 to the dual of the ERM problem. However, as we are mainly interested in solving the ERM (primal) problem, we additionally prove that the expected duality gap decreases at a geometric rate. Our technique for doing this is a variant of the one use by Shalev-Shwartz & Zhang (2013d), but generalized to an arbitrary sampling.

2 Notation

Minimization of a Smooth Function

In this section we consider unconstrained minimization of a differentiable convex function:

In particular, we shall assume smoothness (Lipschitz continuity of the gradient) and strong convexity of ff:

We now describe three algorithmic strategies for solving problem (5), the first two of which are new. All these methods have the form

where hikh_{i}^{k} is only allowed to be nonzero for iSki\in S_{k}, where {Sk}k0\{S_{k}\}_{k\geq 0} are i.i.d. random subsets of [n]:={1,2,,n}[n]:=\{1,2,\dots,n\} (“samplings”). That is, all methods in each iteration update a random subset of the variables. The four methods will only differ in how the update elements hikh_{i}^{k} for iSki\in S_{k} are computed. If we wish the methods to work, we necessarily need to require that every coordinate has a positive probability of being sampled. For certain technical reasons that will be apparent later, we will also assume that SkS_{k} is nonempty with probability 1.

The random sets {Sk}k0\{S_{k}\}_{k\geq 0} are i.i.d., proper (i.e., Prob(iSk)>0\mathbf{Prob}(i\in S_{k})>0 for all i[n]i\in[n]) and nonvacuous (i.e., Prob(Sk=)=0\mathbf{Prob}(S_{k}=\emptyset)=0).

Much of our discussion will depend on the distribution of SkS_{k} rather than on kk. As {Sk}k0\{S_{k}\}_{k\geq 0} are i.i.d., we will write S^\hat{S} for a sampling which shares their distribution. We will write p=(p1,,pn)p=(p_{1},\dots,p_{n}) where

By Assumption 3, we have pi>0p_{i}>0 for all ii. We now describe the methods.

Method 1. We compute (MSk)1(\mathbf{M}_{S_{k}})^{-1} and set

Note that the update only involves the inversion of a random principal submatrix of M\mathbf{M} of size Sk×Sk|S_{k}|\times|S_{k}|. Also, we only need to compute elements iSki\in S_{k} of the gradient f(xk)\nabla f(x_{k}). If Sk|S_{k}| is reasonably small, the update step is cheap.

Assuming vv is easily computable (this should be done before the methods starts), the update is clearly very easy to perform. Indeed, the update can be equivalently written as hik=1viei,f(xk)h_{i}^{k}=-\tfrac{1}{v_{i}}\langle e_{i},\nabla f(x^{k})\rangle for iSki\in S_{k} and hik=0h_{i}^{k}=0 for iSki\notin S_{k}. Method 3 is known as NSync Richtárik & Takáč (2013b). For a calculus allowing the computation of closed form formulas for vv as a function of the sampling S^\hat{S} we refer the reader to Qu & Richtárik (2014b).

Note that all three methods coincide if S^=1|\hat{S}|=1 with probability 1.

2 Three linear convergence rates

We shall now show that, putting the issue of the cost of each iteration of the three methods aside, all enjoy a linear rate of convergence. In particular, we shall show that Method 1 has the fastest rate, followed by Method 22 and finally, Method 33.

Let Assumptions 1, 2 and 3 be satisfied. Let {xk}k0\{x^{k}\}_{k\geq 0} be the sequence of random vectors produced by Method mm, for m=1,2,3m=1,2,3 and let xx^{*} be the optimal solution of (5). Then

The above result means that the number of iterations sufficient for Method mm to obtain an ϵ\epsilon-solution (in expectation) is O(1σmlog(1/ϵ))O(\tfrac{1}{\sigma_{m}}\log(1/\epsilon)).

If S^\hat{S} is proper and nonvacuous, then

Substituting S=S^S=\hat{S} and taking expectations, we obtain

We now establish an important relationship between the quantities σ1,σ2\sigma_{1},\sigma_{2} and σ3\sigma_{3}, which sheds light on the convergence rates of the three methods.

0<σ3σ2σ110<\sigma_{3}\leq\sigma_{2}\leq\sigma_{1}\leq 1.

3 Example

Note that Assumption 1 holds, and Assumption 2 holds with G=M\mathbf{G}=\mathbf{M}. Let S^\hat{S} be the “22-nice sampling” on [n]={1,2,3}[n]=\{1,2,3\}. That is, we set Prob(S^={i,j})=13.\mathbf{Prob}(\hat{S}=\{i,j\})=\frac{1}{3}. for (i,j)=(1,2),(2,3),(3,1)(i,j)=(1,2),(2,3),(3,1). A straightforward calculation reveals that:

It can be verified that (9) holds with v=(2,2,2)v=(2,2,2); see Richtárik & Takáč (2012) or Qu & Richtárik (2014b). Therefore, D(p)D(v1)=13I\mathbf{D}(p)\mathbf{D}(v^{-1})=\tfrac{1}{3}\mathbf{I}. Finally, we obtain:

Note that: theoretical rate, σ1\sigma_{1}, of Method 1 is 10,000 times better than the rate, σ3\sigma_{3}, of parallel coordinate descent (Method 3).

4 Proof of Theorem 1

By minimizing both sides of (7) in hh, we get:

Method 1: If we use (15) with hhk:=(MSk)1f(xk)h\leftarrow h^{k}:=-(\mathbf{M}_{S_{k}})^{-1}\nabla f(x^{k}), and apply (4), we get:

Taking expectations on both sides with respect to SkS_{k} yields:

Minimization of a Composite Function

In this section we consider the following composite minimization problem:

We assume that ff satisfies Assumptions 6 and 7. The difference from the setup in the previous section is in the inclusion of the separable term iψi\sum_{i}\psi_{i}.

We now propose Algorithm 1, which a variant of Method 1 applicable to problem (16). If ψi0\psi_{i}\equiv 0 for all ii, the methods coincide. The following result states that the method converges at a geometric rate, in expectation.

Let Assumptions 1, 2, 3 and 4 be satisfied. Then the output sequence {xk}k0\{x^{k}\}_{k\geq 0} of Algorithm 1 satisfies:

where xx^{*} is the solution of (16), σ1prox:=τmin(1,s1)n\sigma^{prox}_{1}:=\frac{\tau\min(1,s_{1})}{n} and

So, the rate we can prove for the composite version of Method 1 (σ1prox\sigma^{prox}_{1}) is weaker than the rate we get for Method 2 (σ2\sigma_{2}), which by Theorem 2 is weaker than the rate of Method 1 (σ1\sigma_{1}). We believe this is a byproduct of our analysis rather than the weakness of Algorithm 1.

2 PCDM

We will now compare our new Algorithm 1 with the Parallel Coordinate Descent Method (PCDM) of Richtárik & Takáč (2012), which can also be applied to problem (16).

where σ3prox:=τmin(1,s3)n\sigma_{3}^{prox}:=\frac{\tau\min(1,s_{3})}{n} and

Sketch: The proof is a minor modification of the arguments in Richtárik & Takáč (2012). ∎

3 Comparison of the rates of Algorithms 1 and 2

We now show that the rate of linear (geometric) convergence of our method is better than that of PCDM.

σ1proxσ3prox\sigma^{prox}_{1}\geq\sigma^{prox}_{3}.

Since pi=τnp_{i}=\tfrac{\tau}{n} for all ii, we have D(p)=τnI\mathbf{D}(p)=\tfrac{\tau}{n}\mathbf{I} and hence from (9) we deduce that:

whence s1s3s_{1}\geq s_{3}, and the claim follows. ∎

Empirical Risk Minimization

We now turn our attention to the empirical risk minimization problem:

Note that the dual problem has the form (16)

where F(α)=D(α)F(\alpha)=-D(\alpha), f(α)=λg(1λnAα)f(\alpha)=\lambda g^{*}(\tfrac{1}{\lambda n}\mathbf{A}\alpha) and ψ(αi)=1nϕi(αi)\psi(\alpha_{i})=\tfrac{1}{n}\phi_{i}^{*}(-\alpha_{i}). It is easy to see that ff satisfies Assumption 1 with M:=1nX\mathbf{M}:=\tfrac{1}{n}\mathbf{X}, where X:=1λnAA\mathbf{X}:=\frac{1}{\lambda n}\mathbf{A}^{\top}\mathbf{A}. Moreover, ψi\psi_{i} is γn\tfrac{\gamma}{n}-strongly convex. We can therefore apply Algorithm 1 to solve the dual (19). This is what Algorithm 3 does.

If α\alpha^{*} is the optimal solution of (18), then the optimal solution of (17) is given by:

With each proper sampling S^\hat{S} we associate the number:

Closed-form expressions for vv satisfying this inequality, as a function of the sampling S^\hat{S} chosen, can be found in Qu & Richtárik (2014b). A rather conservative choice which works for any S^\hat{S}, irrespective of its distribution, is vi=min{τ,λ(AA)}ai2v_{i}=\min\{\tau,\lambda^{\prime}(\mathbf{A}^{\top}\mathbf{A})\}\|a_{i}\|^{2}, where λ(Y):=maxh{hYh  :  hD(Y)h1}\lambda^{\prime}(\mathbf{Y}):=\max_{h}\{h^{\top}\mathbf{Y}h\;:\;h^{\top}\mathbf{D}(\mathbf{Y})h\leq 1\} and τ\tau is a number for which S^τ|\hat{S}|\leq\tau with probability 1 (see Theorem 5.1 in the aforementioned reference). Better bounds (with smaller vv) can be derived for special classes of samplings.

Now we can state the main result of this section:

where σ1prox:=τmin(1,s1)n\sigma_{1}^{prox}:=\frac{\tau\min(1,s_{1})}{n} and

In the case of quadratic losses and quadratic regularizer, we can sharpen the complexity bound:

When both ϕi\phi_{i} and gg are quadratic functions, the output sequence {wk,αk}k0\{w^{k},\alpha^{k}\}_{k\geq 0} of Algorithm 3 satisfies:

2 Complexity analysis

We first establish that SDNA is able to solve the dual.

where σ1prox\sigma_{1}^{prox} is as in Theorem 4.

If S^\hat{S} is uniform, then the output of Algorithm 3 is equivalent to the output of Algorithm 1 applied to (19). Therefore, the result is obtained by applying Theorem 3 with M=1λn2AA\mathbf{M}=\tfrac{1}{\lambda n^{2}}\mathbf{A}^{\top}\mathbf{A}, G=0\mathbf{G}=0 and γi=γn\gamma_{i}=\tfrac{\gamma}{n} for all ii. ∎

We now prove a sharper result in the case of quadratic loss and quadratic regularizer.

If {ϕi}i\{\phi_{i}\}_{i} and gg are quadratic, then the output sequence {αk}k0\{\alpha^{k}\}_{k\geq 0} of Algorithm 3 satisfies:

If {ϕi}i\{\phi_{i}\}_{i} and gg are all quadratic functions, then the dual objective function is quadratic with Hessian matrix given by 2D(α)1λn2AA+γnI\nabla^{2}D(\alpha)\equiv\frac{1}{\lambda n^{2}}\mathbf{A}^{\top}\mathbf{A}+\frac{\gamma}{n}\mathbf{I}. It suffices to apply Theorem 1(10), with M=G=2D(α)\mathbf{M}=\mathbf{G}=\nabla^{2}D(\alpha). ∎

We now prove a more general version of a classical result in dual coordinate ascent methods which bounds the duality gap from above by the expected dual increase.

The output sequence {wk,αk}k0\{w^{k},\alpha^{k}\}_{k\geq 0} of Algorithm 3 satisfies:

The proof of the this lemma is provided in the supplementary material. Theorem 4 (resp. Theorem 5) now follows by combining Lemma 3 (resp. Lemma 4) and Lemma 5.

3 New Algorithm: SDCA with Arbitrary Sampling

When S^=1|\hat{S}|=1 with probability 1, SDNA reduces to a proximal variant of stochastic dual coordinate ascent (SDCA) Shalev-Shwartz & Zhang (2013d). However, a minibatch version of standard SDCA in the ERM setup we consider here has not been previously studied in the literature. Takáč et al. (2013) developed such a method but in the special case of hinge-loss (which is not smooth and hence does not fit our setup). Shalev-Shwartz & Zhang (2013b) studied minibatching but in conjunction with acceleration and the QUARTZ method of Qu et al. (2014), which has been analyzed for an arbitrary sampling S^\hat{S}, uses a different primal update than SDNA. Hence, in order to compare SDNA with an SDCA-like method which is as close a match to SDNA as possible, we need to develop a new method. Algorithm 4 is an extension of SDCA to allow it handle an arbitrary uniform sampling S^\hat{S}.

The complexity of Minibatch SDCA (we henceforth just write SDCA) is given in Theorem 6.

If (22) holds, then the output sequence {wk,αk}k0\{w^{k},\alpha^{k}\}_{k\geq 0} of Algorithm 4 satisfies:

4 SDNA vs SDCA

We now compare the rates of SDNA and SDCA. The next result says that the rate of SDNA is always superior to that of SDCA. We also see that the rate is better in the quadratic case covered by Theorem 5.

Since S^\hat{S} is a uniform sampling, we have pi=τnp_{i}=\frac{\tau}{n} for all i[n]i\in[n]. In view of (21), we have 1nτθ(S^)1\leq\tfrac{n}{\tau}\theta(\hat{S}). Next,

SDNA as Iterative Hessian Sketch

We now apply SDNA to the least squares problem:

and show that the resulting primal update can be interpreted as an iterative Hessian sketch, alternative to the one proposed by Pilanci & Wainwright (2014). We first need to establish a simple duality result.

Let α\alpha^{*} be the optimal solution of

then the optimal solution ww^{*} of (24) is w=1λnAα.w^{*}=\frac{1}{\lambda n}\mathbf{A}\alpha^{*}.

Problem (24) is a special case of (17) for g(w)12w2g(w)\equiv\frac{1}{2}\|w\|^{2} and ϕi(a)12(abi)2\phi_{i}(a)\equiv\frac{1}{2}(a-b_{i})^{2} for all i[n]i\in[n]. Problem (25) is the dual of (24) and the result follows from (20). ∎

The interpretation of SDNA as a variant of the Iterative Hessian sketch method of Pilanci & Wainwright (2014) follows immediately from the following theorem.

The output sequence {wk,αk}k0\{w^{k},\alpha^{k}\}_{k\geq 0} of Algorithm 3 applied on problem (24) satisfies:

where Sk\mathbf{S}_{k} denotes the nn-by-Sk|S_{k}| submatrix of the identity matrix In\mathbf{I}_{n} with columns in the random subset SkS_{k}.

We know that SkΔαk\mathbf{S}_{k}^{\top}\Delta\alpha^{k} is the optimal solution of

Let τ=Sk\tau=|S_{k}|. By Lemma 6, the optimal solution of

is given by 1λnASkSkαk\frac{1}{\lambda n}\mathbf{A}\mathbf{S}_{k}\mathbf{S}_{k}^{\top}\nabla\alpha^{k}, which equals αˉk+1αˉk\bar{\alpha}^{k+1}-\bar{\alpha}^{k} and thus equals wk+1wkw^{k+1}-w^{k}. Hence,

which is equivalent to (26) since (In)Sk=SkSk(\mathbf{I}_{n})_{S_{k}}=\mathbf{S}_{k}\mathbf{S}_{k}^{\top}. ∎

Numerical Experiments

In our first experiment (Figure 1) we compare SDNA and our new minibatch version of SDCA on one real (mushrooms; d=112d=112, n=8,124n=8,124) and one synthetic (d=1,024d=1,024, n=2,048n=2,048) dataset. In both cases, we used λ=1/n\lambda=1/n as the regularization parameter and g(w)=12w2g(w)=\tfrac{1}{2}\|w\|^{2}. As τ\tau increases, SDNA requires less passes over data (epochs), while SDCA requires more passes over data. It can be shown that this behavior can be predicted from the complexity results for these two methods. The difference in performance depends on the choice of the dataset and can be quite dramatic.

In the second experiment (Figure 2), we investigate how much time it takes for the methods to process a single epoch, using the same datasets as before. As τ\tau increases, SDNA does more work as the subproblems it needs to solve in each iteration involve a τ×τ\tau\times\tau submatrix of the Hessian of the smooth part of the dual objective function. On the other hand, the work SDCA needs to do is much smaller, and does nearly does not increase with the minibatch size τ\tau. This is because the subproblems are separable. As before, all experiments are done using a single core (however, both methods would benefit from a parallel implementation).

Finally, in Figure 3 we put the insights gained from the previous two experiments together: we look at the performance of SDNA for various choices of τ\tau by comparing runtime and duality gap error. We should expect that increasing τ\tau would lead to faster method in terms of passes over data, but that this would also lead to slower iterations. The question is, is does the gain outweight the loss? The answer is: yes, for small enough minibatch sizes. Indeed, looking at Figure 3, we see that the runtime of SDNA improved up to the point τ=16\tau=16 for both datasets, and then starts to deteriorate. In situations where it is costly to fetch data from memory to a (fast) processor, much larger minibatch sizes would be optimal.

References

APPENDIX: Proof of Theorem 3

It follows directly from Assumption 1 and the update rule xk+1=xk+(hk)Skx^{k+1}=x^{k}+(h^{k})_{S_{k}} in Algorithm 1 that:

Since hkh^{k} is defined as the minimizer of the right hand side in the last inequality, we can further bound this term by replacing hkh^{k} with h=λ(xxk)h=\lambda(x^{*}-x^{k}) for arbitrary λ\lambda\in:

Now we use the fact that ψi\psi_{i} is γi\gamma_{i}-strongly convex to obtain:

By taking expectations in SkS_{k} on both sides of the last inequality, we see that for any λ\lambda\in, the following holds:

APPENDIX: Proof of Lemma 5

Recall that M=1nX\mathbf{M}=\tfrac{1}{n}\mathbf{X}, where X=1λnAA\mathbf{X}=\tfrac{1}{\lambda n}\mathbf{A}^{\top}\mathbf{A}.

For simplicity in this proof we write θ=θ(S^)\theta=\theta(\hat{S}). First, by the 1-strong convexity of the function gg we obtain the 1-smoothness of the function gg^{*}, from which we deduce:

Now we replace g(αˉk)\nabla g^{*}(\bar{\alpha}^{k}) by wkw^{k} and αˉ\bar{\alpha} by 1λnAα\frac{1}{\lambda n}\mathbf{A}\alpha to obtain:

From γ\gamma-strong convexity of the functions ϕi\phi_{i}^{*} we deduce that:

where the equality follows from uik=ϕi(aiwk)u_{i}^{k}=\nabla\phi_{i}(a_{i}^{\top}w^{k}). Next, by the definition of θ\theta in (21), we know that:

In the main text we have shown that σ2σ3\sigma_{2}\geq\sigma_{3}, where σ2\sigma_{2} is the rate of Method 2 and σ3\sigma_{3} is the rate of Method 3: NSync Richtárik & Takáč (2013b). In this section we give a more detailed description of the relationship between these two quantities in the case when S^\hat{S} is the τ\tau-nice sampling Richtárik & Takáč (2012). That is, S^\hat{S} picks subsets of [n][n] of cardinality τ\tau, uniformly at random. For this sampling,

Suppose that G=M\mathbf{G}=\mathbf{M} and S^\hat{S} be the τ\tau-nice sampling. Then there exists β[1,τ]\beta\in[1,\tau] such that one can choose vi=βMi,iv_{i}=\beta\mathbf{M}_{i,i} and

As explained in Richtárik & Takáč (2012), (9) is always true if we take vi=βMi,iv_{i}=\beta\mathbf{M}_{i,i} with β=τ\beta=\tau but smaller values (leading to a faster algorithm) may be computable if the problem exhibits a property called “partial separability”.

Let us denote by D\mathbf{D} the diagonal matrix whose entries are the diagonal entries of M\mathbf{M}.

Let us denote A=M1/2DM1/2\mathbf{A}=\mathbf{M}^{-1/2}\mathbf{D}\mathbf{M}^{-1/2} and α=τ1n1\alpha=\frac{\tau-1}{n-1}.

But we have λmax((1α)A+αI)=(1α)λmax(A)+α\lambda_{\max}((1-\alpha)\mathbf{A}+\alpha\mathbf{I})=(1-\alpha)\lambda_{\max}(\mathbf{A})+\alpha, so

Note that if σ3\sigma_{3} is small, then σ2\sigma_{2} is of the order of βσ31τ1n1>βσ3\frac{\beta\sigma_{3}}{1-\frac{\tau-1}{n-1}}>\beta\sigma_{3} . ∎