Algorithms of Robust Stochastic Optimization Based on Mirror Descent Method

Anatoli Juditsky, Alexander Nazin, Arkadi Nemirovsky, Alexandre Tsybakov

Introduction

In this paper, we consider the problem of convex composite stochastic optimization:

where XX is a compact convex subset of a finite-dimensional real vector space EE with norm \|\cdot\|, ω\omega is a random variable on a probability space Ω\Omega with distribution PP, function ψ\psi is convex and continuous, and function Φ:  X×ΩR\Phi:\;X\times\Omega\to{\bf R}. Suppose that the expectation

is finite for all xXx\in X, and is a convex and differentiable function of xx. Under these assumptions, the problem (1) has a solution with optimal value F=minxXF(x)F_{*}=\min_{x\in{X}}F(x).

Assume that there is an oracle, which for any input (x,ω)X×Ω(x,\omega)\in X\times\Omega returns a stochastic gradient that is a vector G(x,ω)G(x,\omega) satisfying

where \|\cdot\|_{*} is conjugate norm to \|\cdot\|, and σ>0\sigma>0 is a constant. The aim of this paper is to construct (1α)(1-\alpha)-reliable approximate solutions of the problem (1), i.e., solutions x^N{\widehat{x}}_{N}, based on NN queries of the oracle and satisfying the condition

with as small as possible δN(α)>0\delta_{N}(\alpha)>0.

Note that stochastic optimization problems of the form (1) arise in the context of penalized risk minimization, where the confidence bounds (3) are directly converted into confidence bounds for the accuracy of the obtained estimators. In this paper, the bounds (3) are derived with δN(α)\delta_{N}(\alpha) of order ln(1/α)/N\sqrt{\ln(1/\alpha)/N}. Such bounds are often called sub-Gaussian confidence bounds. Standard results on sub-Gaussian confidence bounds for stochastic optimization algorithms assume boundedness of exponential or subexponential moments of the stochastic noise of the oracle G(x,ω)ϕ(x)G(x,\omega)-\nabla\phi(x) (cf. ). In the present paper, we propose robust stochastic algorithms that satisfy sub-Gaussian bounds of type (3) under a significantly less restrictive condition (2).

Recall that the notion of robustness of statistical decision procedures was introduced by J. Tukey and P. Huber in the 19601960ies, which led to the subsequent development of robust stochastic approximation algorithms. In particular, in the 1970ies–1980ies, algorithms that are robust for wide classes of noise distributions were proposed for problems of stochastic optimization and parametric identification. Their asymptotic properties when the sample size increases have been well studied, see, for example, and references therein. An important contribution to the development of the robust approach was made by Ya.Z. Tsypkin. Thus, a significant place in the monographs is devoted to the study of iterative robust identification algorithms.

The interest in robust estimation resumed in the 2010ies due to the need to develop statistical procedures that are resistant to noise with heavy tails in high-dimensional problems. Some recent work develops the method of median of means for constructing estimates that satisfy sub-Gaussian confidence bounds for noise with heavy tails. Thus, in the median of means approach was used to construct an (1α)(1-\alpha)-reliable version of stochastic approximation with averaging (“batch” algorithm) in a stochastic optimization setting similar to (1). Other original approaches were developed in , in particular, the geometric median techniques for robust estimation of signals and covariance matrices with sub-Gaussian guarantees . Also there was a renewal of interest in robust iterative algorithms. Thus, it was shown that robustness of stochastic approximation algorithms can be enhanced by using the geometric median of stochastic gradients . Another variant of the stochastic approximation procedure for calculating the geometric median was studied in , where a specific property of the problem (boundedness of the stochastic gradients) allowed the authors to construct (1α)(1-\alpha)-reliable bounds under a very weak assumption about the tails of the noise distribution.

This paper discusses an approach to the construction of robust stochastic algorithms based on truncation of the stochastic gradients. It is shown that this method satisfies sub-Gaussian confidence bounds. In Sections 2 and 3, we define the main components of the optimization problem under consideration. In Section 4, we define the robust stochastic mirror descent algorithm and establish confidence bounds for it. Section 5 is devoted to robust accuracy estimates for general stochastic algorithms. Finally, Section 6 establishes robust confidence bounds for problems, in which FF has a quadratic growth. The Appendix contains the proofs of the results of the paper.

Notation and Definitions

Let EE be a finite-dimensional real vector space with norm \|\cdot\| and let EE^{*} be the conjugate space to EE. Denote by s,x\langle s,x\rangle the value of linear function sEs\in E^{*} at point xEx\in E and by \|\cdot\|_{*} the conjugate to norm \|\cdot\| on EE^{*}, i.e.,

we consider a continuous convex function θ:BR\theta:B\to\textbf{R} with the following property:

where θ()\theta^{\prime}(\cdot) is a continuous in Bo={xB:θ(x)}B^{o}=\{x\in B:\,\partial\theta(x)\neq\emptyset\} version of the subgradient of θ()\theta(\cdot) and θ(x)\partial\theta(x) denotes the subdifferential of function θ()\theta(\cdot) at point xx, i.e., the set of all subgradients at this point. In other words, function θ()\theta(\cdot) is strongly convex on BB with coefficient 1 with respect to the norm \|\cdot\|. We will call θ()\theta(\cdot) the normalized proxy function. Examples of such functions are:

\theta(x)=\mbox{\small\frac{1}{2}}\|x\|_{2}^{2} for (E,)=(Rn,2)\left(E,\|\cdot\|\right)=\left(\textbf{R}^{n},\|\cdot\|_{2}\right);

θ(x)=2e(lnn)xpp\theta(x)={2\rm{e}(\ln n)}\|x\|_{p}^{p} with p=p(n):=1+12lnnp=p(n):=1+{1\over 2\ln n} for (E,)=(Rn,1)\left(E,\|\cdot\|\right)=\left(\textbf{R}^{n},\|\cdot\|_{1}\right);

θ(x)=4e(lnn)i=1nλi(x)p\theta(x)=4\rm{e}(\ln n)\sum_{i=1}^{n}|\lambda_{i}(x)|^{p} with p=p(n)p=p(n) for E=SnE=S_{n}, where SnS_{n} is the space of symmetric n×nn\times n matrices equipped with the nuclear norm x=i=1nλi(x)\|x\|=\sum_{i=1}^{n}|\lambda_{i}(x)| and λi(x)\lambda_{i}(x) are eigenvalues of matrix xx.

Now, let XX be a convex compact subset in EE and let x0Xx_{0}\in X and R>0R>0 be such that maxxXxx0R\max_{x\in X}\|x-x_{0}\|\leq R. We equip XX with a proxy function

Note that ϑ()\vartheta(\cdot) is strongly convex with coefficient 1 and

Let D:=maxx,xXxxD:=\max_{x,x^{\prime}\in X}\|x-x^{\prime}\| be the diameter of the set XX. Then D2RD\leq 2R.

In the following, we denote by CC and CC^{\prime} positive numerical constants, not necessarily the same in different cases.

Assumptions

Consider a convex composite stochastic optimisation problem (1) on a convex compact set XEX\subset E. Assume in the following that the function

is convex on XX, differentiable at each point of the set XX and its gradient satisfies the Lipschitz condition

Assume also that function ψ\psi is convex and continuous. In what follows, we assume that we have at our disposal a stochastic oracle, which for any input (x,ω)X×Ω(x,\omega)\in X\times\Omega, returns a random vector G(x,ω)G(x,\omega), satisfying the conditions (2). In addition, it is assumed that for any aEa\in E^{*} and β>0\beta>0 an exact solution of the minimization problem

with a constant υ0\upsilon\geq 0. This assumption is motivated as follows.

First, if we a priori know that the global minimum of function ϕ\phi is attained at an interior point xϕx_{\phi} of the set XX (what is common in statistical applications of stochastic approximation), we have ϕ(xϕ)=0\nabla\phi(x_{\phi})=0. Therefore, choosing xˉ=xϕ\bar{x}=x_{\phi}, one can put g(xˉ)=0g(\bar{x})=0 and assumption (6) holds automatically with υ=0\upsilon=0.

Second, in general, one can choose xˉ\bar{x} as any point of the set XX and g(xˉ)g(\bar{x}) as a geometric median of stochastic gradients G(xˉ,ωi)G(\bar{x},\omega_{i}), i=1,,mi=1,\dots,m, over mm oracle queries. It follows from that if mm is of order ln(ε1)\ln\left(\varepsilon^{-1}\right) with some sufficiently small ε>0\varepsilon>0, then

Thus, the confidence bounds obtained below will remain valid up to an ε\varepsilon-correction in the probability of deviations.

Accuracy bounds for Algorithm RSMD

In what follows, we consider that the assumptions of Section 3 are fulfilled. Introduce a composite proximal transform

For i=1,2,i=1,2,\dots, define the algorithm of Robust Stochastic Mirror Descent (RSMD) by the recursion

Here βi>0\beta_{i}>0, i=0,1,i=0,1,\dots, and λ>0\lambda>0 are tuning parameters that will be defined below, and ω1,ω2,\omega_{1},\omega_{2},\dots are independent identically distributed (i.i.d.) realizations of a random variable ω\omega, corresponding to the oracle queries at each step of the algorithm.

The approximate solution of problem (1) after NN iterations is defined as the weighted average

If the global minimum of function ϕ\phi is attained at an interior point of the set XX and υ=0\upsilon=0, then definition (12) is simplified. In this case, replacing xˉxi1\|\bar{x}-x_{i-1}\| by the upper bound DD and putting υ=0\upsilon=0 and g(xˉ)=0g(\bar{x})=0 in (12), we define the truncated stochastic gradient by the formula

The next result describes some useful properties of mirror descent recursion (9). Define

Let βi2L\beta_{i}\geq 2L for all i=0,1,...i=0,1,..., and let x^N\widehat{x}_{N} be defined in (13), where xix_{i} are iterations (9) for any values yiy_{i}, not necessarily given by (12). Then for any zXz\in X we have

where ziz_{i} is a random vector with values in XX depending only on x0,ξ1,,ξix_{0},\xi_{1},\dots,\xi_{i}.

Using Proposition 1 we obtain the following bounds on the expected error F(x^N)FF({\widehat{x}}_{N})-F_{*} of the approximate solution of problem (1) based on the RSMD algorithm. In what follows, we denote by E{}{\bf E}\{\cdot\} the expectation with respect to the distribution of ωN=(ω1,...,ωN)ΩN\omega^{N}=(\omega_{1},...,\omega_{N})\in\Omega^{\otimes N}.

Set M=LRM=LR. Assume that λmax{M,σN}+υσ\lambda\geq\max\{M,\sigma\sqrt{N}\}+\upsilon\sigma and βi2L\beta_{i}\geq 2L for all i=0,1,...i=0,1,.... Let x^N{\widehat{x}}_{N} be the approximate solution (13), where xix_{i} are the iterations of the RSMD algorithm defined by relations (9) and (12). Then

In particular, if βi=βˉ\beta_{i}=\bar{\beta} for all i=0,1,...i=0,1,..., where

Moreover, in this case we have the following inequality with explicit constants:

This result shows that if the truncation threshold λ\lambda is large enough, then the expected error of the proposed algorithm is bounded similarly to the expected error of the standard mirror descent algorithm with averaging, i.e., the algorithm in which stochastic gradients are taken without truncation: yi=G(xi1,ωi)y_{i}=G(x_{i-1},\omega_{i}).

The following theorem gives confidence bounds for the proposed algorithm.

Let βi=βˉ2L\beta_{i}=\bar{\beta}\geq 2L for all i=0,1,...i=0,1,..., and let 1τN/υ21\leq\tau\leq N/\upsilon^{2},

Let x^N{\widehat{x}}_{N} be the approximate solution (13), where xix_{i} are the RSMD iterations defined by relations (9) and (12). Then there is a random event ANΩN{\cal A}_{N}\subset\Omega^{\otimes N} of probability at least 12eτ1-2e^{-\tau} such that for all ωNAN\omega^{N}\in{\cal A}_{N} the following inequalities hold:

In paticular, chosing βˉ\bar{\beta} as in formula (18) we have, for all ωNAN\omega^{N}\in{\cal A}_{N},

where C1>0C_{1}>0 and C2>0C_{2}>0 are numerical constants.

The values of the numerical constants C1C_{1} and C2C_{2} in (21) can be obtained from the proof of the theorem, cf. the bound in (51).

Confidence bound (21) in Theorem 1 contains two terms corresponding to the deterministic error and to the stochastic error. Unlike the case of noise with a “light tail” (see, for example, ) and the bound in expectation (19), the deterministic error LR2[τΘ]/NLR^{2}[\tau\vee\Theta]/N depends on τ\tau. Note also that Theorem 1 gives a sub-Gaussian confidence bound (the order of the stochastic error is σR[τΘ]/N\sigma R\sqrt{[\tau\vee\Theta]/N}). However, the truncation threshold λ\lambda depends on the confidence level τ\tau. This can be inconvenient for the implementation of the algorithms. Some simple but coarser confidence bounds can be obtained by using a universal threshold independent of τ\tau, which is λ=max{σN,M}+υσ\lambda=\max\{\sigma\sqrt{N},M\}+\upsilon\sigma. In particular, we have the following result.

Let βi=βˉ2L\beta_{i}=\bar{\beta}\geq 2L for all i=0,1,...i=0,1,..., and let Nυ2N\geq\upsilon^{2}. Set

Let x^N=N1i=1Nxi{\widehat{x}}_{N}=N^{-1}\sum_{i=1}^{N}x_{i}, where xix_{i} are the iterations of the RSMD algorithm defined by relations (9) and (12). Then there is a random event ANΩN{\cal A}_{N}\subset\Omega^{\otimes N} of probability at least 12eτ1-2e^{-\tau} such that for all ωNAN\omega^{N}\in{\cal A}_{N} the following inequalities hold:

In particular, choosing βˉ\bar{\beta} as in formula (18) we have

The values of the numerical constants CC in Theorem 2 can be obtained from the proof, cf. the bound in (51).

Robust Confidence Bounds for Stochastic Optimization Methods

Consider an arbitrary algorithm for solving the problem (1) based on NN queries of the stochastic oracle. Assume that we have a sequence \big{(}x_{i},G(x_{i},\omega_{i+1})\big{)},\;i=0,...,N, where xiXx_{i}\in X are the search points of some stochastic algorithm and G(xi,ωi+1)G(x_{i},\omega_{i+1}) are the corresponding observations of the stochastic gradient. It is assumed that xix_{i} depends only on {(xj1,ωj),j=1,,i}\{(x_{j-1},\omega_{j}),j=1,\dots,i\}. The approximate solution of the problem (1) is defined in the form:

Our goal is to construct a confidence interval with sub-Gaussian accuracy for F(x^N)FF({\widehat{x}}_{N})-F_{*}. To do this, we use the following fact. Note that for any tLt\geq L the value

is an upper bound on the accuracy of the approximate solution x^N{\widehat{x}}_{N}:

(see Lemma 1 in Appendix). This fact is true for any sequence of points x0,,xNx_{0},\dots,x_{N} in XX, regardless of how they are obtained. However, since the function ϕ()\nabla\phi(\cdot) is not known, the estimate (24) cannot be used in practice. Replacing the gradients ϕ(xi1)\nabla\phi(x_{i-1}) in (23) with their truncated estimates yiy_{i} defined in (12) we get an implementable analogue of ϵN(t)\epsilon_{N}(t):

Note that computing ϵ^N(t){\widehat{\epsilon}}_{N}(t) reduces to solving a problem of the form (4) with β=0\beta=0. Thus, it is computationally not more complex than, for example, one step of the RSMD algorithm. Replacing ϕ(xi1)\nabla\phi(x_{i-1}) with yiy_{i} introduces a random error. In order to get a reliable upper bound for ϵN(t)\epsilon_{N}(t), we need to compensate this error by slightly increasing ϵ^N(t){\widehat{\epsilon}}_{N}(t). Specifically, we add to ϵ^N(t){\widehat{\epsilon}}_{N}(t) the value

Let \big{(}x_{i},G(x_{i},\omega_{i+1})\big{)}_{i=0}^{N} be the trajectory of a stochastic algorithm for which xix_{i} depends only on {(xj1,ωj),j=1,,i}\{(x_{j-1},\omega_{j}),j=1,\dots,i\}. Let 0<τN/υ20<\tau\leq N/\upsilon^{2} and let yi=yi(τ)y_{i}=y_{i}(\tau) be truncated stochastic gradients defined in (12), where the threshold λ=λ(τ)\lambda=\lambda(\tau) is chosen in the form (20). Then for any tLt\geq L the value

Since ΔN(τ,t)\Delta_{N}(\tau,t) monotonically increases in tt it suffices to use this bound for t=Lt=L when LL is known. Note that, although ΔN(τ,t)\Delta_{N}(\tau,t) gives an upper bound for ϵN(t)\epsilon_{N}(t), Proposition 2 does not guarantee that ΔN(τ,t)\Delta_{N}(\tau,t) is sufficiently close to ϵN(t)\epsilon_{N}(t). However, this property holds for the RSMD algorithm with a constant step, as follows from the next result.

Under the conditions of Proposition 2, let the vectors x0,,xNx_{0},\dots,x_{N} be given by the RSMD recursion (9)–(12), where βi=βˉ2L\beta_{i}=\bar{\beta}\geq 2L, i=0,...,N1i=0,...,N-1. Then

Moreover, if βˉmax{2L,σNRΘ}\bar{\beta}\geq\max\left\{2L,{\sigma\sqrt{N}\over R\sqrt{\Theta}}\right\} then

where C3>0C_{3}>0 and C4>0C_{4}>0 are numerical constants.

The values of the numerical constants C3C_{3} and C4C_{4} can be derived from the proof of this corollary.

Robust Confidence Bounds for Quadratic Growth Problems

In this section, it is assumed that FF is a function with quadratic growth on XX in the following sense (cf. ). Let FF be a continuous function on XX and let XXX_{*}\subset X be the set of its minimizers on XX. Then FF is called a function with quadratic growth on XX if there is a constant κ>0\kappa>0 such that for any xXx\in X there exists xˉ(x)X\bar{x}(x)\in X_{*} such that the following inequality holds:

The RSMD algorithm for quadratically growing functions will be defined in stages. At each stage, for specially selected r>0r>0 and yXy\in X it solves an auxiliary problem

We initialize the algorithm by choosing arbitrary y0=x0Xy_{0}=x_{0}\in X and r0maxzXzx0r_{0}\geq\max_{z\in X}\|z-x_{0}\|. We set rk2=2kr02r^{2}_{k}=2^{-k}r^{2}_{0}, k=1,2,...k=1,2,.... Let C1C_{1} and C2C_{2} be the numerical constants in the bound (21) of Theorem 1. For a given parameter τ>0\tau>0, and k=1,2,k=1,2,\dots we define the values

Here t\rfloor t\lfloor denotes the smallest integer greater than or equal to tt. Set

Now, let k{1,2,,m(N)}k\in\{1,2,\dots,m(N)\}. At the kk-th stage of the algorithm, we solve the problem of minimization of FF on the ball Xrk1(yk1)X_{r_{k-1}}(y_{k-1}), we find its approximate solution x^Nk{\widehat{x}}_{N_{k}} according to (9)–(13), where we replace x0x_{0} by yk1y_{k-1}, XX by Xrk1(yk1)X_{r_{k-1}}(y_{k-1}), RR by rk1r_{k-1}, NN by NkN_{k}, and set

It is assumed that, at each stage kk of the algorithm, an exact solution of the minimization problem

is available for any aEa\in E and β>0\beta>0. At the output of the kk-th stage of the algorithm, we obtain yk:=x^Nky_{k}:={\widehat{x}}_{N_{k}}.

Assume that m(N)1m(N)\geq 1, i.e. at least one stage of the algorithm described above is completed. Then there is a random event BNΩN{\cal B}_{N}\subset\Omega^{\otimes N} of probability at least 12m(N)eτ1-2m(N)e^{-\tau} such that for ωNBN\omega^{N}\in{\cal B}_{N} the approximate solution ym(N)y_{m(N)} after m(N)m(N) stages of the algorithm satisfies the inequality

Theorem 3 shows that, for functions with quadratic growth, the deterministic error component can be significantly reduced – it becomes exponentially decreasing in NN. The stochastic error component is also significantly reduced. Note that the factor m(N)m(N) is of logarithmic order and has little effect on the probability of deviations. Indeed, it follows from (29) that m(N)Cln(Cκ2r02Nσ2(τΘ))m(N)\leq C\ln\left(\frac{C^{\prime}\kappa^{2}r_{0}^{2}N}{\sigma^{2}(\tau\vee\Theta)}\right). Neglecting this factor in the probability of deviations and considering the stochastic component of the error, we see that the confidence bound of Theorem 3 is approximately sub-exponential rather than sub-Gaussian.

Conclusion

We have considered algorithms of smooth stochastic optimization when the distribution of noise in observations has heavy tails. It is shown that by truncating the observed gradients with a suitable threshold one can construct confidence sets for the approximate solutions that are similar to those in the case of “light tails”. It should be noted that the order of the deterministic error in the obtained bounds is suboptimal — it is substantially greater than the optimal rates achieved by the accelerated algorithms , namely, O(LR2N2)O(LR^{2}N^{-2}) in the case of convex objective function and O(exp(Nκ/L))O(\exp(-N\sqrt{\kappa/L})) in the strongly convex case. On the other hand, the proposed approach cannot be used to obtain robust versions of the accelerated algorithms since applying it to such algorithms leads to accumulation of the bias caused by the truncation of the gradients. The problem of constructing accelerated robust stochastic algorithms with optimal guarantees remains open.

A.1. Preliminary remarks. We start with the following known result.

Assume that ϕ\phi and ψ\psi satisfy the assumptions of Section 3, and let x0,,xNx_{0},\dots,x_{N} be some points of the set XX. Define

Then for any zXz\in X the following inequality holds:

In addition, for x^N=1Ni=1Nxi\widehat{x}_{N}={1\over N}\sum_{i=1}^{N}x_{i} we have

Proof Using the property V_{x}(z)\geq\mbox{\small\frac{1}{2}}\|x-z\|^{2}, the convexity of functions ϕ\phi and ψ\psi and the Lipschitz condition on ϕ\nabla\phi we get that, for any zXz\in X,

Summing up over ii from 0 to N1N-1 and using the convexity of FF we obtain the second result of the lemma. \square

In what follows, we denote by Exi{}{\bf E}_{x_{i}}\{\cdot\} the conditional expectation for fixed xix_{i}.

Let the assumptions of Section 3 be fulfilled and let xix_{i} and yiy_{i} satisfy the RSMD recursion, cf. (9) and (12). Then

Proof Set χi=1G(xi1,ωi)g(xˉ)>Lxi1xˉ+λ+υσ\chi_{i}=1_{\|G(x_{i-1},\omega_{i})-g(\bar{x})\|_{*}>L\|x_{i-1}-\bar{x}\|+\lambda+\upsilon\sigma}. Note that by construction χiηi:=1G(xi1,ωi)f(xi1>λ\chi_{i}\leq\eta_{i}:=1_{\|G(x_{i-1},\omega_{i})-\nabla f(x_{i-1}\|_{*}>\lambda}. We have

Moreover, since Exi1{G(xi1,ωi)}=ϕ(xi1){\bf E}_{x_{i-1}}\{G(x_{i-1},\omega_{i})\}=\nabla\phi(x_{i-1}) we have

The following lemma gives bounds for the deviations of the sums iξi,xi1z\sum_{i}\langle\xi_{i},x_{i-1}-z\rangle and iξi2\sum_{i}\|\xi_{i}\|_{*}^{2}.

Let the assumptions of Section 3 be fulfilled and let xix_{i} and yiy_{i} satisfy the recursion of RSMD, cf. (9) and (12).

(i) If τN/υ2\tau\leq{N/\upsilon^{2}} and λ=max{σNτ,M}+υσ\lambda=\max\left\{\sigma\sqrt{N\over\tau},M\right\}+\upsilon\sigma then, for any zXz\in X,

(ii) If Nυ2N\geq\upsilon^{2} and λ=max{σN,M}+υσ\lambda=\max\left\{\sigma\sqrt{N},M\right\}+\upsilon\sigma then, for any zXz\in X,

Proof Set ζi=ξi,zxi1\zeta_{i}=\langle\xi_{i},z-x_{i-1}\rangle and ςi=ξi2\varsigma_{i}=\|\xi_{i}\|_{*}^{2}, i=1,2,i=1,2,\dots Using Lemma 2 it is easy to check that the following inequalities are fulfilled

In what follows, we apply several times the Bernstein inequality, and each time we will use the same notation rr, AA, ss for the values that are, respectively, the uniform upper bound of the expectation, the maximum absolute value, and the standard deviation of a random variable.

1o. We first prove the statement (i)(i). We start with the case MσNτM\leq\sigma\sqrt{N\over\tau}. It follows from (39) that in this case

Using (47) and Bernstein’s inequality for martingales (see, for example, ) we get

for all τ>0\tau>0 satisfying the condition τ16N/(9υ2)\tau\leq{16N/(9\upsilon^{2})}. On the other hand, in the case under consideration, the following inequalities hold (cf. (43) and (47))

for 0<τN/υ20<\tau\leq{N/\upsilon^{2}}. Applying again the Bernstein inequality, we get

for all τ>0\tau>0 satisfying the condition τN/υ2\tau\leq N/\upsilon^{2}.

2o. Assume now that M>σNτM>\sigma\sqrt{N\over\tau}, so that λ=M+υσ\lambda=M+\upsilon\sigma and σ2M2τ/N\sigma^{2}\leq{M^{2}\tau/N}. Then

and applying again the Bernstein inequality we get

for all τ>0\tau>0, satisfying the condition τ16N/(9υ2)\tau\leq 16N/(9\upsilon^{2}). Next, in this case

for τN/υ2\tau\leq{N/\upsilon^{2}}. Applying once again the Bernstein inequality we get

for all τ>0\tau>0 satisfying the condition τN/υ2\tau\leq N/\upsilon^{2}.

3o. Now, consider the case λ=max{M,σN}+συ\lambda=\max\{M,\sigma\sqrt{N}\}+\sigma\upsilon. Let MσNM\leq\sigma\sqrt{N}, so that λ=σ(N+υ)\lambda=\sigma(\sqrt{N}+\upsilon). We argue in the same way as in the proof of (i)(i). By virtue of (39) we have

Hence, using the Bernstein inequality we get

Now, applying again the Bernstein inequality we get

Proofs of the bounds (34) and (35) in the case M>σNM>\sigma\sqrt{N} and λ=M+συ\lambda=M+\sigma\upsilon follow the same lines. \square

A.2. Proof of Proposition 1. We first prove inequality (15). In view of (4), the optimality condition for (9) has the form

where the last equality follows from the following remarkable identity (see, for example, ): for any u,uu,u^{\prime} and wXw\in X

Since, by definition, ξi=yiϕ(xi1)\xi_{i}=y_{i}-\nabla\phi(x_{i-1}) we get

It follows from Lemma 1 and the condition βi2L\beta_{i}\geq 2L that

Together with (48), this inequality implies

On the other hand, due to the strong convexity of Vx()V_{x}(\cdot) we have

for all zXz\in X. Dividing (49) by βi\beta_{i} and taking the sum over ii from to N1N-1 we obtain (15).

We now prove the bound (16). Applying Lemma 6.1 of with z0=x0z_{0}=x_{0} we get

where z_{i}={\rm arg\,min}_{z\in X}\big{\{}\mu_{i-1}\langle\xi_{i},z\rangle+V_{z_{i-1}}(z)\big{\}} depends only on z0,ξ1,,ξiz_{0},\xi_{1},\dots,\xi_{i}. Further,

Combining this inequality with (15), we get (16). \square

A.3. Proof of Corollary 1. Note that (17) is an immediate consequence of (15) and of the bounds for the moments of ξi\|\xi_{i}\|_{*} given in Lemma 2. Indeed, (31)(b) implies that, under the conditions of Corollary 1,

Taking the expectation of both sides of (15) and using the last two inequalities we get (17). The bound (19) is proved in a similar way, with the only difference that instead of inequality (15) we use (16). \square

A.4. Proof of Theorem 1. By virtue of part (i) of Lemma 3, under the condition τN/υ2\tau\leq N/\upsilon^{2} we have that, with probability of at least 12eτ1-2e^{-\tau},

Plugging these bounds in (16) we obtain that, with probability at least 12eτ1-2e^{-\tau}, the following holds:

Next, taking \bar{\beta}=\max\big{\{}2L,{\sigma\over R}\sqrt{N\over\Theta}\big{\}} we get

for 1τN/υ21\leq\tau\leq N/\upsilon^{2}. This implies (21). \square

A.5. Proof of Theorem 2. We act in the same way as in the proof of Theorem 1 with the only difference that instead of part (i) of Lemma 3 we use part (ii) of that lemma, which implies that if Nυ2N\geq\upsilon^{2} then with probability at least 12eτ1-2e^{-\tau} the following inequalities hold:

The proposition is a direct consequence of the following result.

Then, for 0<τN/υ20<\tau\leq N/\upsilon^{2} and tLt\geq L the following inequalities hold

Proof of Lemma ​​. Let us prove the first inequality in (55). Recall that ξi=yiϕ(xi1)\xi_{i}=y_{i}-\nabla\phi(x_{i-1}), i=1,...,Ni=1,...,N. Due to the strong convexity of Vx()V_{x}(\cdot), for any zXz\in X and μ>0\mu>0 we have

Recalling that Vx0(z)R2ΘV_{x_{0}}(z)\leq R^{2}\Theta, we conclude that i=1Nξi,zxiρN(τ;μ,ν)\sum_{i=1}^{N}\langle\xi_{i},z-x_{i}\rangle\leq\rho_{N}(\tau;\mu,\nu) for all zXz\in X and all ωNAN\omega^{N}\in{\cal A}_{N}. Therefore, for ωNAN\omega^{N}\in{\cal A}_{N} we have

which proves the first inequality in (55). The proof of the second inequality in (55) is similar and therefore it is omitted. \square

A.7. Proof of Corollary 2. From the definition of ϵN()\epsilon_{N}(\cdot) we deduce that

and we get (26) by taking μ=1/βˉ\mu=1/\bar{\beta}. On the other hand, one can check that for βˉmax{2L,σNRΘ}\bar{\beta}\geq\max\left\{2L,{\sigma\sqrt{N}\over R\sqrt{\Theta}}\right\} the following inequalities hold:

with the same probability. This implies (27). \square

1o. We first show that for each k=1,,m=m(N)k=1,\dots,m=m(N), the following is true.

Fact Ik{I_{k}}. There is a random event BkΩN{\cal B}_{k}\subseteq\Omega^{\otimes N} of probability at least 12keτ1-2ke^{-\tau} such that for all ωNBk\omega^{N}\in{\cal B}_{k} the following inequalities hold:

The proof of Fact Ik{I_{k}} is carried out by induction. Note that (58)(a) holds with probability 1 for k=0k=0. Set B0=ΩN{\cal B}_{0}=\Omega^{\otimes N}. Assume that (58)(a) holds for some k{0,,m1}k\in\{0,\dots,m-1\} with probability at least 12keτ1-2ke^{-\tau}, and let us show that then Fact Ik+1{I_{k+1}} is true.

Define Fk=minxXrk(yk)F(x)F_{*}^{k}=\min_{x\in X_{r_{k}}(y_{k})}F(x) and let XkX_{*}^{k} be the set of all minimizers of function FF on Xrk(yk)X_{r_{k}}(y_{k}). By Theorem 1 and the definition of NkN_{k} (cf. (29)), there is an event Ak{\cal A}_{k} of probability at least 12eτ1-2e^{-\tau} such that for ωNAk\omega^{N}\in{\cal A}_{k} after the (k+1)(k+1)-th stage of the algorithm we have

where xˉk(yk+1)\bar{x}_{k}(y_{k+1}) is the projection of yk+1y_{k+1} onto XkX_{*}^{k}. Set Bk+1=BkAk{\cal B}_{k+1}={\cal B}_{k}\cap{\cal A}_{k}. Then

In addition, due to the assumption of induction, on the set Bk{\cal B}_{k} (and, therefore, on Bk+1{\cal B}_{k+1}) we have

i.e., the distance between yky_{k} and the set XX_{*} of global minimizers does not exceed rkr_{k}. Therefore, the set Xrk(yk)X_{r_{k}}(y_{k}) has a non-empty intersection with XX_{*}. Thus, XkXX^{k}_{*}\subseteq X_{*}, the point xˉk(yk+1)\bar{x}_{k}(y_{k+1}) is contained in XX_{*} and FkF_{*}^{k} coincides with the optimal value FF_{*} of the initial problem. We conclude that

2o. We now prove the theorem in the case N11{\overline{N}}_{1}\geq 1. This condition is equivalent to the fact that Nk1{\overline{N}}_{k}\geq 1 for all k=1,,m(N)k=1,\dots,m(N), since N1N2Nm(N){\overline{N}}_{1}\leq{\overline{N}}_{2}\leq\cdots\leq{\overline{N}}_{m(N)} by construction. Assume that ωNBm(N)\omega^{N}\in{\cal B}_{m(N)}, so that (58) holds with k=m(N)k=m(N). Since N11{\overline{N}}_{1}\geq 1 we have Nk2NkN_{k}\leq 2{\overline{N}}_{k}. In addition, Nk+12Nk{\overline{N}}_{k+1}\leq 2{\overline{N}}_{k}. Using these remarks and the definition of m(N)m(N) we get

Thus, using the definition of Nk{\overline{N}}_{k} (cf. (29)) we obtain

Two cases are possible: S1N/2S_{1}\geq N/2 or S2N/2S_{2}\geq N/2. If S1N/2S_{1}\geq N/2, then

so that if ωNBm(N)\omega^{N}\in{\cal B}_{m(N)} then

If S2N/2S_{2}\geq N/2 the following inequalities hold:

Therefore, in this case for ωNBm(N)\omega^{N}\in{\cal B}_{m(N)} we have

Let k2k_{*}\geq 2 be the smallest integer kk such that Nk1{\overline{N}}_{k}\geq 1. If k>N/4k_{*}>N/4 it is not difficult to see that m(N)N/4m(N)\geq N/4 and therefore for ωNBm(N)\omega^{N}\in{\cal B}_{m(N)} we have

If 2kN/42\leq k_{*}\leq N/4 we have the following chain of inequalities:

where the first inequality uses the fact that Nm(N)+12Nm(N){\overline{N}}_{m(N)+1}\leq 2{\overline{N}}_{m(N)} and the last inequality follows from the definition of m(N)m(N). Based on this remark and on the fact that Nk/2kNk/2k{\overline{N}}_{k}/2^{k}\leq{\overline{N}}_{k_{*}}/2^{k_{*}} for kkk\geq k_{*} we obtain

where the last inequality follows by noticing that Nk2Nk1<2{\overline{N}}_{k_{*}}\leq 2{\overline{N}}_{k_{*}-1}<2. Hence, taking into account (58)(b) we get that, for ωNBm(N)\omega^{N}\in{\cal B}_{m(N)},

Combining this bound with (60), (61) and (63) we get (30). \square

References