On Stochastic Gradient and Subgradient Methods with Adaptive Steplength Sequences

Farzad Yousefian, Angelia Nedić, Uday V. Shanbhag

I Introduction

The use of stochastic gradient and subgradient schemes for the solution of stochastic convex optimization problems has a long tradition, beginning with an iterative scheme, first proposed by Robbins and Monro , that relied primarily on noisy gradient observations. Research by Ermoliev and his coauthors focused largely on quasigradient (subgradient) methods and considered a host of stochastic programming problems, amongst them being two-period recourse-based problems (see ). To accelerate the convergence of stochastic subgradient methods, ergodic sequences, arising from the averaging of iterates, have been employed in . Often gradient computations are either costly or unavailable; in such instances, a finite-difference approximation of the gradient can be constructed as first observed by Kiefer and Wolfowitz . While standard finite-difference techniques perturb one direction at a time to obtain gradient estimates, simultaneous perturbation stochastic approximation techniques simultaneously perturb all directions and general require fewer function evaluations . More recently, there has been a significant interest in the application of ODE-based methods for investigating the stability and convergence of the associated stochastic approximation schemes . An elegant exposition of these methods may be found in the monographs by Polyak , Kushner and Yin , and Borkar .

Sample-average approximation (SAA) techniques are often viewed as an alternative to stochastic approximation techniques and are particularly attractive when approximate solutions to the problem are desired in an offline manner. This approach relies on using a sample from the underlying distribution to construct a deterministic sample-average problem, which can be subsequently solved via standard nonlinear programming solvers, as seen in . In , the authors demonstrate that stochastic approximation schemes are shown to be competitive with SAA techniques. Importantly in , Nemirovski et al. develop a robust SA scheme that determines an optimal constant steplength for minimizing the theoretical error over a pre-specified number of steps. Mirror-descent generalizations of SA, that rely on a suitably defined prox-mapping, are also presented in (also see ), while validation analysis is provided in .

Stochastic gradient algorithms have also been found to be effective in solving large deterministic problems such as convex feasibility problems , feasibility problems arising in control and some specially structured large-scale convex problems in . Distributed consensus-based stochastic subgradient methods for minimizing a convex objective over a network have been recently developed and studied in . The success of gradient-based methods in solving monotone variational inequalities has prompted the study of similar techniques for contending with stochastic variational inequalities. In fact, Jiang and Xu develop precisely such a scheme for the solution of strongly monotone stochastic variational inequalities and regularized variants were presented in to allow for application to monotone stochastic variational inequalities. Finally, stochastic generalizations of the mirror-prox schemes were examined in and allowed for the solution of monotone variational inequalities.

While stochastic approximation schemes have proved successful, other avenues exist for addressing stochastic programs. For instance, an alternate approach lies in using sample-average approximation methods, that obtain estimators to the optimal value and solution of the problem through the solution of deterministic problem in which the expectation is replaced by a sample-average. Convergence theory for the obtained estimators is examined by Shapiro . Decomposition schemes, that leverage cutting-plane methods, have also been particularly successful in addressing two-period stochastic linear , convex and nonconvex programs while a scalable matrix-splitting decomposition scheme is presented in for two-period stochastic Nash games.

In this paper, we consider adaptive stochastic gradient and subgradient methods for solving constrained stochastic convex optimization problems. The novelty of our work can be categorized as follows: (1) the development and analysis of two adaptive stepsize rules; and (2) the development of a local function smoothing technique. Next, we provide some motivation and a more elaborate description of each.

In stochastic gradient methods (cf. ), the almost-sure convergence of such methods is guaranteed assuming that the stepsize is diminishing but not too rapidly, i.e., the stepsize is proportional to 1ka\frac{1}{k^{a}} with 12<a1\frac{1}{2}<a\leq 1. Typically, there is no guidance on the specific choice of the sequence and problem parameters play little role in refining this choice. In contrast, in this paper, we propose specific (adaptive) rules for the stepsize values that exploit the information about the objective function. Accordingly, our first goal lies in examining whether one can construct a convergent scheme under an adaptive stepsize rule that is more reflective of the problem setting. Through out this part of the paper, we assume that the integrand of the expectation is a random convex differentiable function. Under a Lipschitzian assumption on the gradient, we propose two different adaptive stepsize rules:

Recursive stepsize rule: In attempting to minimize the bound on the expected error, we develop a recursive scheme for specifying the stepsize that requires only the steplength at the previous parameter and some problem parameters. Global convergence and rate estimates for this scheme are developed.

Cascading stepsize rule: It is well-known that under suitable assumptions, fixed-stepsize schemes are guaranteed to converge to a compact region containing the solution set of the original problem. We consider a modified version of such a scheme where the trajectory moves to successively smaller compact regions containing the solution sets. Furthermore, as soon as the trajectory of iterates reaches within a bound of the solution set, the steplength is updated allowing the sequence to make further progress. Effectively, we consider a method in which the steplength sequence can be viewed as one where the stepsize is maintained constant with drops or cascades in stepsize occurring at particular epochs. While the scheme has intuitive appeal, we provide a theoretical support for the convergence of such an algorithmic framework.

When the random integrands arising in such stochastic problems are nonsmooth, direct application of known SA schemes is impossible. Contending with nonsmoothness in mathematical programming is often managed through avenues that rely on the solution of a sequence of smoothed problems (cf. ). In a stochastic regime, an approach for addressing such problems is through a technique of global smoothing, as considered in and more recently in .See for a scheme that develops an approximation method for addressing a class of separable piecewise-linear stochastic optimization problems with integer breakpoints. This involves modifying the original problem by adding a random variable with possibly unbounded support. However, such a technique is not feasible in when the objective is defined over a restricted domain. We present a local smoothing technique which leads to a globally differentiable approximation of the original function with Lipschitz continuous gradients. Furthermore, through such a smoothing, we derive a Lipschitz constant for the gradients and show that the constant grows at the rate of n\sqrt{n} where nn is the dimensionality of the problem space. Importantly, this Lipschitzian property facilitates the construction of a stochastic approximation framework. Consequently, the second part of the paper focuses on computing solutions to approximations with smoothed integrands whose gradients are shown to be provably Lipschitz continuous.

The remainder of the paper is organized as follows. In Section II, we establish the almost-sure convergence of the classical stochastic approximation algorithm for a constrained problem with a differentiable convex function with Lipschitz gradients. In Sections III and IV, for a strongly convex function, we propose and analyze two different stepsize rules, each motivated by a minimization of an estimate on the expected error per iteration of the method. In Section V, we introduce a local randomized smoothing technique for nondifferentiable convex optimization, and derive its approximation properties as well as a bound on the Lipschitz constant of the gradients. In Section VI, we report some numerical results obtained by applying our proposed stepsize rules and the smoothing technique to three test problems and conclude with a discussion in Section VII.

Notation and basic terminology: We view vectors as columns, and write xTx^{T} to denote the transpose of a vector xx. We use x\|x\| to denote the Euclidean vector norm, i.e., x=xTx\|x\|=\sqrt{x^{T}x}. We write ΠX(x)\Pi_{X}(x) to denote the Euclidean projection of a vector xx on a set XX. i.e., xΠX(x)=minyXxy\|x-\Pi_{X}(x)\|=\min_{y\in X}\|x-y\|. For a convex function ff with domain domf{\rm dom}f, a vector gg is a subgradient of ff at xˉdomf\bar{x}\in{\rm dom}f if the following relation holdsFor a differentiable convex ff, the inequality holds with g=f(xˉ)g=\nabla f(\bar{x}).:

The subdifferential set of ff at x=xˉx=\bar{x}, denoted by f(xˉ)\partial f(\bar{x}), is the set of all subgradients of ff at x=xˉx=\bar{x}. Finally, we write a.s. for “almost surely”, and use Prob(Z)\hbox{Prob}(\mathcal{Z}) and E ⁣[Z]\mathsf{E}\!\left[Z\right] to denote the probability of an event Z\mathcal{Z} and the expectation of a random variable ZZ, respectively.

II Problem Formulation and Background

In this section, we begin by describing the problem and iterative scheme of interest (Section II-A). This is followed by Section II-B where we provide a short description on various adaptive schemes in the realm of stochastic approximation.

We consider the following stochastic optimization problem

The set XDX\subset\mathcal{D} is convex and closed. The function F(,ξ)F(\cdot,\xi) is convex on D\mathcal{D} for every ξΩ\xi\in\Omega, and the expected value E ⁣[F(x,ξ)]\mathsf{E}\!\left[F(x,\xi)\right] is finite for every xDx\in\mathcal{D}.

Under Assumption 1, the function ff is convex over XX and the following relation holds

First, we will consider problem (1) where ff is a differentiable function with Lipschitz gradients. Later, we will allow the function ff to be nondifferentiable and we will consider a local smoothing technique yielding a differentiable function that approximates ff over XX. For this reason, we start our discussion by focusing on a differentiable problem (1) and the following iterative algorithm:

Here, x0Xx_{0}\in X is a random initial point, γk>0\gamma_{k}>0 is a (deterministic) stepsize, and wkw_{k} is the random vector given by the difference between the sampled gradient xF(x,ξk)\nabla_{x}F(x,\xi_{k}) and its expectation E ⁣[xF(x,ξ)]\mathsf{E}\!\left[\nabla_{x}F(x,\xi)\right] evaluated at x=xkx=x_{k}. Throughout the paper, we assume that E ⁣[x02]<\mathsf{E}\!\left[\|x_{0}\|^{2}\right]<\infty.

We let Fk\mathcal{F}_{k} denote the history of the method up to time kk, i.e., Fk={x0,ξ0,ξ1,,ξk1}\mathcal{F}_{k}=\{x_{0},\xi_{0},\xi_{1},\ldots,\xi_{k-1}\} for k1k\geq 1 and F0={x0}\mathcal{F}_{0}=\{x_{0}\}. By Assumption 1 and relation (2), it follows that f(xk)=E ⁣[xF(xk,ξ)]\nabla f(x_{k})=\mathsf{E}\!\left[\nabla_{x}F(x_{k},\xi)\right] for a differentiable FF, implying that wkw_{k} has zero-mean, i.e.,

Next, we state some additional assumptions on the stochastic gradient error wkw_{k} and the stepsize γk\gamma_{k}.

The stepsize is such that γk>0\gamma_{k}>0 for all kk. Furthermore, the following hold:

k=0γk=\sum_{k=0}^{\infty}\gamma_{k}=\infty and k=0γk2<\sum_{k=0}^{\infty}\gamma_{k}^{2}<\infty.

The stochastic errors wkw_{k} satisfy k=0γk2E ⁣[wk2Fk]<\sum_{k=0}^{\infty}\gamma_{k}^{2}\mathsf{E}\!\left[\|w_{k}\|^{2}\mid\mathcal{F}_{k}\right]<\infty almost surely.

Assumption 2(b) is satisfied, for example, when k=0γk2<\sum_{k=0}^{\infty}\gamma_{k}^{2}<\infty and the error wkw_{k} is bounded almost surely, i.e., wkc\|w_{k}\|\leq c for all kk and some scalar cc almost surely.

We use the following Lemma in establishing the convergence of method (3) (see , page 50).

(Robbins-Siegmund) Let vk,v_{k}, uk,u_{k}, αk,\alpha_{k}, and βk\beta_{k} be nonnegative random variables, and let the following relations hold almost surely:

We also make use of the following result, which can be found in (see Lemma 11 in page 50).

Let {vk}\{v_{k}\} be a sequence of nonnegative random variables, where E ⁣[v0]<\mathsf{E}\!\left[v_{0}\right]<\infty, and let {αk}\{\alpha_{k}\} and {βk}\{\beta_{k}\} be deterministic scalar sequences such that:

Then, vk0v_{k}\rightarrow 0 almost surely, limkE ⁣[vk]=0\lim_{k\to\infty}\mathsf{E}\!\left[v_{k}\right]=0, and for any ϵ>0\epsilon>0,

In Sections III and IV, we examine adaptive steplength schemes for a strongly convex function ff whose gradients f\nabla f are Lipschitz continuous over XX with constant LL. XX is defined as

II-B Adaptive Stochastic Approximation Schemes

Robbins and Monro proposed the first stochastic approximation algorithm in 1951 while Kiefer and Wolfowitz proposed a variant of this scheme in which finite differences were employed to estimate the gradient. Asymptotic distributions of the Robbins-Monro scheme were first examined by Chung , leading to an asymptotic normality result in the one-dimensional regime while generalizations were subsequently studied by Sacks .

A potential challenge in developing efficient implementations of stochastic approximation implementations lies in choosing an appropriate steplength sequence. Kesten , in 1957, suggested a technique where the steplength sequence adapts to the observed data, which was further extended by Kushner and Gavin to the multi-dimensional regime, while its accelerations were studied in . Sacks proved that, under suitable conditions, a choice of the form a/ka/k (where kk is the iterate index) is optimal from the standpoint of minimizing the asymptotic variance. Yet, the challenge lies in estimating the “optimal” aa. Subsequently, Ventner in what is possibly amongst the first adaptive steplength SA schemes, considered sequences of the form ak/ka_{k}/k where aka_{k} is updated by leveraging past information. Notably, Chung also examined the asymptotic variance properties of SA when steplength choices of the form a/k1αa/k^{1-\alpha} with α<12\alpha<{1\over 2} are used. In related work on adaptive schemes, Lai and Robbins considered schemes of the form ak/ka_{k}/k where aka_{k} is a strongly consistent estimator of f(x)\nabla f(x) in a stochastic root-finding problem. One choice for obtaining aka_{k} is through the use of least-squares estimators. Multivariate generalizations of this analysis were suggested by Wei in 1987 and again, it was observed that the Jacobian of the vector function assumes relevance in constructing efficient steplength sequences.

An alternative to using a single sample was suggested by Spall and relied on obtaining gradient estimates through a simultaneous perturbation of all the parameters. An adaptive generalization of this scheme, proposed by the same author , employed an additional recursion to the standard projected gradient step that attempted to estimate the Jacobian in root finding problems or the Hessian in optimization problems. Accordingly, the modified update rule is of the form

where HkH_{k} is an estimate of the Hessian matrix of the objective. Clearly, this also falls under the regime of an adaptive steplength scheme. Related adaptive schemes may also be found in the work by Bhatnagar .

A final remark is in order regarding the key difference between our proposed schemes and past work. A majority of the adaptive schemes in the literature employ past information to update the steplength. One such avenue involves developing estimates of the Hessian which is subsequently used in scaling the gradient step appropriately. In the sections to appear, we consider two very different approaches that are linked by a crucial property: they rely on using algorithm and problem parameters, and not sample points, to develop adaptive steplength schemes.

II-C Smoothing Techniques

One of the goals of this paper is to address stochastic optimization problems with nonsmooth integrands. Here, we provide some background for accommodating nonsmoothness in optimization problems. In deterministic regimes, subgradient methods and their incremental variants have proved popular (see ), as have bundle methods , amongst others. One approach for managing nonsmoothness is through smoothing approaches. For instance, such avenues have allowed for the solution of variational inequalities and complementarity problems as well as mathematical programs with equilibrium constraints .

In this paper, we also adopt a smoothing technique which bears little similarity to such approaches. We adopt a framework that can be traced back to a class of averaged functions introduced by Steklov in 1907. A general definition of such an averaging over possibly discontinuous functions is provided next .

We pursue an alternative to solving a sequence of smoothed problems and obtain an approximate solution by solving a single smoothed problem with a fixed ϵ\epsilon akin to that employed by Lakshmanan and Farias . However, since we intend to leverage stochastic approximation schemes of the form described earlier in this paper, Lipschitz constants associated with the gradients are a requiem. In , the authors obtain Lipschitz constants assuming that the averaging is achieved through a normal distribution that requires the function be defined everywhere. Instead of “globally smoothing” the function, we employ a uniform distribution, referred to as “local smoothing.”

III A recursive steplength stochastic approximation scheme

In this section, we introduce an adaptive stochastic approximation scheme that overcomes certain challenges associated with implementing standard diminishing steplength schemes and relies on the use of a recursive rule for prescribing steplengths. We begin by examining the standard stochastic gradient method for problem (1) in Section III-A. In general, the convergence of this scheme is guaranteed under the requirement that k=1γk=\sum_{k=1}^{\infty}\gamma_{k}=\infty and k=0γk2<.\sum_{k=0}^{\infty}\gamma_{k}^{2}<\infty. A host of choices exists with one possible choice being γk=θ/k.\gamma_{k}=\theta/k. Yet, the choice of the appropriate θ\theta can have a significant impact on the performance of the algorithm. Motivated by the desire to minimize the “expected error,” we develop a recursive stochastic approximation algorithm (referred to as the RSA scheme) in which the steplength at a particular iteration is a function of the steplength at the previous iteration and some problem parameters. In Section III-B, we motivate and introduce such a scheme and proceed to develop the associated convergence theory in Section III-C.

We consider method (3) as applied to problem (1) where ff has Lipschitz gradients. The method generates a sequence of iterates that converge to an optimal solution almost-surely, as shown in the forthcoming proposition. This result is a straightforward extension of Theorem 1 in [16, Pg. 51] which pertains to an unconstrained problem.

Let Assumptions 1–2 hold, and let ff be differentiable over the set XX with Lipschitz gradients. Assume that the optimal set XX^{*} of problem (1) is nonempty. Then, the sequence {xk}\{x_{k}\} generated by (3) converges almost surely to some random point in XX^{*}.

By definition of the method and the nonexpansive property of the projection operation, we obtain for any xXx^{*}\in X^{*} and k0k\geq 0,

By the convexity of ff and the gradient inequality, we have

Taking the conditional expectation given Fk\mathcal{F}_{k}, using E ⁣[wkFk]=0\mathsf{E}\!\left[w_{k}\mid\mathcal{F}_{k}\right]=0 (see Eq. (4)) and the Lipschitzian property of the gradient, we have

Under Assumption 2, the conditions of Lemma 1 are satisfied. Therefore, almost surely, the sequence {xk+1x}\{\|x_{k+1}-x^{*}\|\} is convergent for any xXx^{*}\in X^{*} and k=0γk(f(xk)f)<\sum_{k=0}^{\infty}\gamma_{k}(f(x_{k})-f^{*})<\infty. The former relation implies that {xk}\{x_{k}\} is bounded a.s., while the latter implies lim infkf(xk)=f\liminf_{k\to\infty}f(x_{k})=f^{*} a.s. in view of the condition k=0γk=\sum_{k=0}^{\infty}\gamma_{k}=\infty. Since the set XX is closed, all accumulation points of {xk}\{x_{k}\} lie in XX. Furthermore, since f(xk)ff(x_{k})\to f^{*} along a subsequence a.s., by continuity of ff it follows that {xk}\{x_{k}\} has a subsequence converging to some random point in XX^{*} a.s. Moreover, since {xk+1x}\{\|x_{k+1}-x^{*}\|\} is convergent for any xXx^{*}\in X^{*} a.s., the entire sequence {xk}\{x_{k}\} converges to some random point in XX^{*} a.s. ∎

Under the Lipschitz continuity of the gradient and the strong convexity of the objective, an expected error bound may also be provided for the method. During the development of the error bound, the following intermediate result assumes relevance.

Let Assumption 1 hold, and let ff be differentiable over the set XX with Lipschitz gradients with constant L>0L>0. Also, assume that the optimal set XX^{*} of problem (1) is nonempty. Let the sequence {xk}\{x_{k}\} be generated by algorithm (3) with any (deterministic) stepsize γk>0\gamma_{k}>0. Then, for any xXx^{*}\in X^{*} and any k0k\geq 0, the following holds almost surely:

By the first-order optimality conditions, a vector xx^{*} is optimal for the problem if and only if xx^{*} satisfies

By the definition of the method and the nonexpansive property of the projection operation, we obtain for all k0k\geq 0,

Taking the expectation conditioned on the past, and using E ⁣[wkFk]=0\mathsf{E}\!\left[w_{k}\mid\mathcal{F}_{k}\right]=0 (cf. Eq. (4)), we have

The Lipschitz gradient property for a convex function is equivalent to co-coercivity of the gradient map with constant 1/L1/L, (see [16, Pg. 24, Lemma 2]), i.e., for all x,yXx,y\in X,

Therefore, for any xXx^{*}\in X^{*} and any k0k\geq 0,

In what follows, we will often use a stronger version of Assumption 2(b), given as follows.

The errors wkw_{k} are such that for some ν>0\nu>0,

Next, we provide an error bound for algorithm (3) under the assumption that f(x)f(x) is a strongly convex function with Lipschitz gradients. Note that requiring that f(x)f(x) is strongly convex over a set KK follows if F(x,ξ(ω))F(x,\xi(\omega)) is a strongly convex function for ωΩˉ\omega\in\bar{\Omega}, where Ωˉ\bar{\Omega} is a set of positive measure defined as

Less formally, we merely require that F(.,ξ)F(.,\xi) is a strongly convex function with positive, but arbitrarily small, probability to ensure that f(x)f(x) is strongly convex over KK (see ).

Let Assumptions 1–2 hold. Also, let ff be differentiable over the set XX with Lipschitz gradients with constant L>0L>0 and strongly convex with constant η>0\eta>0. Then, the sequence {xk}\{x_{k}\} generated by algorithm (3) converges almost surely to the unique optimal solution of problem (1). Furthermore, if the stepsize satisfies 0<γk2L0<\gamma_{k}\leq\frac{2}{L} for all k0k\geq 0, we then have:

The following relation holds almost surely:

If Assumption 2(b) is replaced with Assumption 3, then the following relation holds almost surely:

Moreover, limkE ⁣[xkx2]=0\lim_{k\to\infty}\mathsf{E}\!\left[\|x_{k}-x^{*}\|^{2}\right]=0, and for every ϵ>0\epsilon>0,

The existence and uniqueness of the optimal solution of problem (1) is guaranteed by the strong convexity of f(x)f(x). The convergence of the method follows by Proposition 1. To establish the relation in part (a) for the expected value E ⁣[xk+1x2]\mathsf{E}\!\left[\|x_{k+1}-x^{*}\|^{2}\right], we use Lemma 3, which implies for the optimal xx^{*} and all k0k\geq 0,

By the strong convexity of f(x)f(x), we have (xy)T(f(x)f(y))ηxy2(x-y)^{T}(\nabla f(x)-\nabla f(y))\geq\eta\|x-y\|^{2} for all x,yXx,y\in X, which when combined with the preceding relation implies for all k0k\geq 0,

The relation in part (b), follows from inequality (6) by using Assumption 3 and by taking the total expectation. To show the other results in part (b), we apply Lemma 2. For this, we need to verify that all the conditions of Lemma 2 hold. Since 0<γk2L0<\gamma_{k}\leq\frac{2}{L}, it follows ηγk(2γkL)0\eta\gamma_{k}(2-\gamma_{k}L)\geq 0. Also, in view of ηL\eta\leq L, we have ηγk(2γkL)1\eta\gamma_{k}(2-\gamma_{k}L)\leq 1. Obviously, ν2γk20\nu^{2}\gamma_{k}^{2}\geq 0 for all kk. Since Assumption 2(a) holds, we have k=0ηγk(2γkL)=\sum_{k=0}^{\infty}\eta\gamma_{k}(2-\gamma_{k}L)=\infty and k=0ηγk2<\sum_{k=0}^{\infty}\eta\gamma_{k}^{2}<\infty. Furthermore, since γk0\gamma_{k}\to 0, we have

Hence, the conditions of Lemma 2 hold and the stated results follow. ∎

Lemma 4 will be employed in developing our adaptive stepsize schemes. Before proceeding, we make the following comment regarding the lemma.

Remark 1: The result in part (a) of Lemma 4 is similar to a result in , which was derived by requiring only the strong convexity of the function ff. Here, we make the additional assumption that the gradients are Lipschitz continuous and this assumption gains relevance when we employ local random smoothing in Section V. Furthermore, our result depends on the expectation of gradient errors, E ⁣[wk2]\mathsf{E}\!\left[\|w_{k}\|^{2}\right], with wkw_{k} defined in (3). Note that, in contrast, the result in depends on the expectation of the subgradient norms, E ⁣[G(x,ξ)2]\mathsf{E}\!\left[\|G(x,\xi)\|^{2}\right], where G(x,ξ)xF(x,ξ)G(x,\xi)\in\partial_{x}F(x,\xi).

III-B A recursive steplength scheme

A challenge associated with the implementation of diminishing steplength schemes lies in determining an appropriate sequence {γk}\{\gamma_{k}\}. The key result of this section is the motivation and introduction of a scheme that adaptively optimizes the steplength from iteration to iteration. Our adaptive scheme relies on the minimization of a suitably defined error function at each step. We start with the relation in part (b) of Lemma 4:

When the stepsize is further restricted so that 0<γk1L0<\gamma_{k}\leq\frac{1}{L}, we have

Thus, for 0<γk1L0<\gamma_{k}\leq\frac{1}{L}, inequality (7) yields

We now use relation (8) to develop an adaptive stepsize procedure. Loosely speaking for the moment, let us view the quantity E ⁣[xk+1x2]\mathsf{E}\!\left[\|x_{k+1}-x^{*}\|^{2}\right] as an error ek+1e_{k+1} of the method arising from the use of the stepsize values γ0,γ1,,γk\gamma_{0},\gamma_{1},\ldots,\gamma_{k}. Also, consider the worst case error which is the case when (8) holds with equality. Thus, in the worst case, the error satisfies the following recursive relation:

Then, it seems natural to investigate if the stepsizes γ0,γ1,γk\gamma_{0},\gamma_{1}\ldots,\gamma_{k} can be selected so as to minimize the error ek+1e_{k+1}. It turns out that this can indeed be achieved and minimizing the error ek+2e_{k+2} at the next iteration can also be done by selecting γk+1\gamma_{k+1} as a function of only the most recent stepsize γk\gamma_{k}. To formalize the above discussion, we define real-valued error functions ek(γ0,,γk1)e_{k}(\gamma_{0},\ldots,\gamma_{k-1}) as follows:

where e0e_{0} is a positive scalar, η\eta is the strong convexity parameter and ν2\nu^{2} is the upper bound for the second moments of the error norms wk\|w_{k}\|.

In what follows, we consider the sequence {γk}\{\gamma^{*}_{k}\} given by

We often abbreviate ek(γ0,,γk1)e_{k}(\gamma_{0},\ldots,\gamma_{k-1}) by eke_{k} whenever this is unambiguous. We show that the stepsizes γi\gamma_{i}, i=0,,k1,i=0,\ldots,k-1, minimize the errors eke_{k} over an (0,1L]k(0,\frac{1}{L}]^{k}, where LL is the Lipschitz constant. In particular, we have the following result.

Let ek(γ0,,γk1)e_{k}(\gamma_{0},\ldots,\gamma_{k-1}) be defined as in (9), where e0>0e_{0}>0 is such that η2ν2e01L\frac{\eta}{2\nu^{2}}\,e_{0}\leq\frac{1}{L}, with LL being the Lipschitz constant for the gradients of ff. Let the sequence {γk}\{\gamma^{*}_{k}\} be given by (10)–(11). Then, the following hold:

For each k1k\geq 1, the vector (γ0,γ1,,γk1)(\gamma_{0}^{*},\gamma_{1}^{*},\ldots,\gamma_{k-1}^{*}) is the minimizer of the function ek(γ0,,γk1)e_{k}(\gamma_{0},\ldots,\gamma_{k-1}) over the set

(a) We use induction on kk to prove our result. Note that the result holds trivially for k=0k=0 from (10). Next, assume that we have ek(γ0,,γk1)=2ν2ηγke_{k}(\gamma_{0}^{*},\ldots,\gamma_{k-1}^{*})=\frac{2\nu^{2}}{\eta}\,\gamma_{k}^{*} for some kk, and consider the case for k+1k+1. By the definition of the error eke_{k} in (9), we have

where the second equality follows by the inductive hypothesis. Hence,

where the last equality follows by the definition of γk+1\gamma^{*}_{k+1} in (11).

(b) We now show that (γ0,γ1,,γk1)(\gamma^{*}_{0},\gamma_{1}^{*},\ldots,\gamma_{k-1}^{*}) minimizes the error eke_{k} for all k1k\geq 1. We again use mathematical induction on kk. By the definition of the error e1e_{1} and the relation e1(γ0)=2ν2ηγ1e_{1}(\gamma_{0}^{*})=\frac{2\nu^{2}}{\eta}\gamma_{1}^{*} shown in part (a), we have

Using γ1=γ0(1η2γ0)\gamma_{1}^{*}=\gamma_{0}^{*}\left(1-\frac{\eta}{2}\gamma_{0}^{*}\right), we obtain

where the last equality follows from e0=2ν2ηγ0.e_{0}=\frac{2\nu^{2}}{\eta}\,\gamma_{0}^{*}. Thus, we have

Under the inductive hypothesis, we have ekeke_{k}\geq e_{k}^{*}. Using this, the relation ek=2ν2ηγke_{k}^{*}=\frac{2\nu^{2}}{\eta}\gamma_{k}^{*} of part (a) and the definition of γk+1\gamma_{k+1}^{*}, we obtain

Now, we proceed by induction on kk. For k=1k=1, we have

From the definition of eke_{k} in (9) we can see that

Finally, we consider the partial derivative of ek+1e_{k+1} with respect to γk\gamma_{k}, for which we have

III-C Convergence theory

We next show that the proposed RSA approximation scheme discussed in Section III-B leads to a convergent algorithm. We prove this in a more general setting for a stepsize with a form similar to that seen in constructing the optimal choice. The following proposition holds for any stepsize of a form similar to the optimal scheme of (11).

Let Assumptions 1 and 3 hold. Let the function ff be differentiable over the set XX with Lipschitz gradients and the optimal solution set of problem (1) be nonempty. Assume that the stepsize sequence {γk}\{\gamma_{k}\} is generated by the following self-adaptive scheme:

where c>0c>0 is a scalar and the initial stepsize is such that 0<γ0<1c0<\gamma_{0}<\frac{1}{c}. Then, the sequence {xk}\{x_{k}\} generated by algorithm (3) converges almost surely to a random point that belongs to the optimal set.

We employ Proposition 1. To apply this proposition, it suffices to verify that Assumption 2 holds. First we show that i=0γi=\sum_{i=0}^{\infty}{\gamma_{i}}=\infty. From (14) we obtain

By dividing both sides by (i=1kγi)\left(\prod_{i=1}^{k}\gamma_{i}\right), it follows that

Since γ0(0,1c)\gamma_{0}\in(0,\frac{1}{c}), from (14) it follows that {γk}\{\gamma_{k}\} is positive nonincreasing sequence. Therefore, the limit limkγk\lim_{k\to\infty}\gamma_{k} exists and it is less than 1c\frac{1}{c}. Thus, by taking the limit in (14), we obtain limkγk=0\lim_{k\to\infty}\gamma_{k}=0. Then, by taking limits in (15), we further obtain

To arrive at a contradiction suppose that i=0γi<\sum_{i=0}^{\infty}{\gamma_{i}}<\infty. Then, there is an ϵ(0,1)\epsilon\in(0,1) such that for jj sufficiently large, we have

Since i=jk(1cγi)1ci=jkγi\prod_{i=j}^{k}(1-c\gamma_{i})\geq 1-c\sum_{i=j}^{k}\gamma_{i} for all j<kj<k, by letting kk\rightarrow\infty, we obtain for all jj sufficiently large,

This, however, contradicts the fact limki=0k(1cγi)=0.\lim_{k\rightarrow\infty}\prod_{i=0}^{k}(1-c\gamma_{i})=0. Therefore, we conclude that i=0γi=\sum_{i=0}^{\infty}{\gamma_{i}}=\infty.

Now we show that i=0γi2<\sum_{i=0}^{\infty}{\gamma_{i}}^{2}<\infty. From (14) we have

Summing the preceding relations, we obtain

By taking limits and recalling that limkγk=0\lim_{k\to\infty}\gamma_{k}=0, we obtain

Assumption 3 and relation i=0γi2<\sum_{i=0}^{\infty}{\gamma_{i}}^{2}<\infty yield k=0γk2E ⁣[wk2Fk]<\sum_{k=0}^{\infty}\gamma_{k}^{2}\mathsf{E}\!\left[\|w_{k}\|^{2}\mid\mathcal{F}_{k}\right]<\infty. Hence, Assumption 2 holds. ∎

Note that Proposition 3 applies to algorithm (3) with the stepsize sequence {γk}\{\gamma_{k}^{*}\} generated by the recursive scheme (11). Thus, we immediately have the following corollary.

Let Assumptions 1 and 3 hold. Let the function ff be differentiable over the set XX with Lipschitz gradients with constant L>0L>0 and strongly convex with parameter η>0\eta>0. Let the stepsize sequence {γk}\{\gamma_{k}^{*}\} be generated by the recursive scheme (11) with e0=E ⁣[x0x2]e_{0}=\mathsf{E}\!\left[\|x_{0}-x^{*}\|^{2}\right]. If η2ν2E ⁣[x0x2]<1L,\frac{\eta}{2\nu^{2}}\mathsf{E}\!\left[\|x_{0}-x^{*}\|^{2}\right]<\frac{1}{L}, then the sequence {xk}\{x_{k}\} generated by algorithm (3) converges almost surely to the unique optimal solution xx^{*} of problem (1).

The existence and uniqueness of the optimal solution follows by the strong convexity assumption. Almost sure convergence follows by Proposition 3. ∎

Note that when the set XX is bounded, in Proposition 1 we may use e0=maxx,yXxy2e_{0}=\max_{x,y\in X}\|x-y\|^{2} and the results will hold as long as η2ν2maxx,yXxy2<1L.\frac{\eta}{2\nu^{2}}\max_{x,y\in X}\|x-y\|^{2}<\frac{1}{L}.

In the following, we discuss a recursive stepsize for algorithm (3) as applied to a nonsmooth but strongly convex function f(x)=E ⁣[F(x,ξ)]f(x)=\mathsf{E}\!\left[F(x,\xi)\right]. Let G(x,ξ)G(x,\xi) be a subgradient vector of F(x,ξ)F(x,\xi) with respect to xx, i.e., G(x,ξ)F(x,ξ)G(x,\xi)\in\partial F(x,\xi). Assume that there is a positive number MM such that

We have the following convergence result, which obviously also holds for smooth problems.

Consider problem (1) and let Assumption 1 hold. Also, let the set XX be compact and the function ff be strongly convex over XX with constant η\eta. Assume that there is a scalar M>0M>0 such that E ⁣[G(x,ξ)2]M2\mathsf{E}\!\left[\|G(x,\xi)\|^{2}\right]\leq M^{2} for all xXx\in X. Consider the following algorithm:

where x0Xx_{0}\in X is a random initial point independent of {ξk}\{\xi_{k}\} and γk\gamma_{k} is a (deterministic) stepsize. Consider the self-adaptive stepsize sequence {γk}\{\gamma_{k}^{*}\} defined by

where D=maxx,yXxyD=\max_{x,y\in X}\|x-y\|. Assuming that ηD2M2<12\frac{\eta D^{2}}{M^{2}}<\frac{1}{2}, we have

The proof is based on verifying that, for the algorithm in (16), Proposition 2 holds, where 2ν22\nu^{2} is replaced by M2M^{2} and e0=D2e_{0}=D^{2}. Then, the rest of the proof is similar to that of Proposition 1. ∎

IV A cascading steplength stochastic approximation scheme

In Section III, we presented a stochastic approximation scheme in which the sequence of steplengths is determined via a recursion that relies on optimizing the error estimates. A key benefit of such a recursion is that the steplength choice is not left to the user. In this section, we introduce an alternate avenue for specifying steplengths that also considers a diminishing steplength framework but uses a markedly different approach for determining the steplength. In particular, the scheme relies on reducing the steplength at a set of epochs while the steplengths are maintained as constant between these epochs. The details of this stochastic approximation scheme (called the cascading steplength stochastic approximation (CSA) scheme) are presented in Section IV-A while convergence theory is provided in Section IV-B.

Our technique is based on the properties derived from problems possessing strongly convex objectives. Specifically, we obtain the following result from the inequality in Lemma 4 when the stepsize is maintained as constant.

Let Assumptions 1 and 3 hold. Also, let ff be differentiable over the set XX with Lipschitz gradients with constant L>0L>0 and strongly convex with constant η>0\eta>0. Let the sequence {xk}\{x_{k}\} be generated by (3) with constant stepsize γk=γ\gamma_{k}=\gamma for all k0k\geq 0, where γ(0,2L)\gamma\in(0,\frac{2}{L}). Then, we have

where q(γ)=1ηγ(2γL)q(\gamma)=1-\eta\gamma(2-\gamma L) and xx^{*} is the optimal solution of problem (1).

Follows from the inequality in part (b) of Lemma 4. ∎

From inequality (17), we obtain the following relation

where the expected distance E ⁣[xkx2]\mathsf{E}\!\left[\|x_{k}-x^{*}\|^{2}\right] is bounded by the sum of two error terms:

Transient error: The transient error, given by q(γ)kE ⁣[x0x2]q(\gamma)^{k}\mathsf{E}\!\left[\|x_{0}-x^{*}\|^{2}\right], decays to zero as k.k\to\infty. In effect, the contractive nature of this error, as arising from q(γ)<1q(\gamma)<1, ensures that the transient error can be reduced to an arbitrarily small level.

Persistent error: The persistent error, given by γ2ν21q(γ){\frac{\gamma^{2}\nu^{2}}{1-q(\gamma)}}, is invariant to increasing the number of iterations, denoted by kk. Its reduction, as we proceed to show, necessitates reducing γ.\gamma.

Our cascading steplength scheme basically requires specifying a rule for deciding at what iteration to decrease the steplength and to what extent it should be decreased. The iterations during which the stepsize is kept fixed is referred to as a constant steplength regime or just a regime. Given the two error terms, our scheme can be loosely represented as an infinite sequence of regimes of finite duration. In fact, we proceed to show that the duration of the regimes is an increasing function. Entering a new regime is marked by a reduction in the steplength. In fact, since a finite reduction in the steplength occurs between consecutive regimes, the steplength sequence would naturally converge to zero if there is an infinite number of the regimes. Suppose one is at the beginning of the ttth regime, where the steplength is γt\gamma_{t} and the current iteration number is KK. The steplength γt\gamma_{t} is maintained constant during regime tt. Furthermore, suppose that at the beginning of the ttth regime, the transient error is greater than the persistent error for γt\gamma_{t}, i.e., E ⁣[xKx2]>γt2ν21q(γt)\mathsf{E}\!\left[\|x_{K}-x^{*}\|^{2}\right]>\frac{\gamma_{t}^{2}\nu^{2}}{1-q(\gamma_{t})}. Since 0<q(γt)<10<q(\gamma_{t})<1, E ⁣[xKx2]\mathsf{E}\!\left[\|x_{K}-x^{*}\|^{2}\right] decreases when multiplied with q(γt)kq(\gamma_{t})^{k} for k0k\geq 0. The larger kk, the smaller q(γt)kE ⁣[xKx2]q(\gamma_{t})^{k}\mathsf{E}\!\left[\|x_{K}-x^{*}\|^{2}\right], so there exists k>0k>0 for which q(γt)kE ⁣[xKx2]q(\gamma_{t})^{k}\mathsf{E}\!\left[\|x_{K}-x^{*}\|^{2}\right] will drop and remain below the persistent error γt2ν21q(γt)\frac{\gamma_{t}^{2}\nu^{2}}{1-q(\gamma_{t})}. We let KtK_{t} be the index kk just before this drop takes place, i.e., KtK_{t} is the largest kk for which the following inequality holds:

Therefore, KtK_{t} specifies the duration of regime tt, during which the stepsize is fixed at γt\gamma_{t}.

The next question is how one should go about reducing the persistent error. We observe through the next result that by reducing γt\gamma_{t}, the persistent error does indeed reduce.

Consider the persistent error given by P(γ)=γ2ν21q(γ)P(\gamma)=\frac{\gamma^{2}\nu^{2}}{1-q(\gamma)}, where q(γ)=1ηγ(2γL)q(\gamma)=1-\eta\gamma(2-\gamma L) and γ(0,2L)\gamma\in(0,\frac{2}{L}). Then, this error is an increasing function of γ.\gamma.

By using q(γ)=1ηγ(2γL)q(\gamma)=1-\eta\gamma(2-\gamma L), for the persistent error we obtain P(γ)=γν2η(2γL)P(\gamma)=\frac{\gamma\nu^{2}}{\eta(2-\gamma L)}. Therefore, the derivative of the persistent error with respect to γ\gamma is given by P(γ)=ν2η2(2γL)2>0P^{\prime}(\gamma)=\frac{\nu^{2}}{\eta}\,\frac{2}{(2-\gamma L)^{2}}>0 for all γ2L\gamma\neq\frac{2}{L}. ∎

Therefore, when γt\gamma_{t} is reduced to γt+1\gamma_{t+1}, the persistent error does indeed reduce. This drop in steplength is referred to as the cascading step and marks the commencement of a new regime. As earlier, in this regime, the persistent error will be smaller than the transient error and the process of determining Kt+1K_{t+1} can be repeated. Therefore, we may view the scheme as a diminishing steplength scheme where the steplength is reduced at a sequence of time epochs and between these epochs, it is maintained constant.

We now proceed to describe the scheme more formally. It can be viewed as having two stages, of which the second stage repeats infinitely often in a consecutive fashion. The first of these is an initialization phase. We assume throughout that the constraint set XX is bounded, so that E ⁣[x0x2]D2\mathsf{E}\!\left[\|x_{0}-x^{*}\|^{2}\right]\leq D^{2} with D=maxx,yXxyD=\max_{x,y\in X}\|x-y\|. Next, we describe each of the stages in cascading scheme in some detail.

Cascading steplength stochastic approximation (CSA) scheme:

Finally, we exit this phase by defining Kˉ1=0\bar{K}_{-1}=0, setting t=0t=0, and going to Phase IIt.

Constant steplength phase (Phase IIt): Define Kˉt=j=0tKt\bar{K}_{t}=\sum_{j=0}^{t}K_{t}. For the iteration indices kk with k{Kˉt1+1,,Kˉt}k\in\{\bar{K}_{t-1}+1,\ldots,\bar{K}_{t}\}, the stepsize is kept constant and equal to γt\gamma_{t}, i.e.,

Then, we increase tt by setting t=t+1t=t+1, reduce the stepsize by letting γtγt1θ\gamma_{t}\triangleq\gamma_{t-1}\theta, compute qt=q(γt)q_{t}=q(\gamma_{t}) and determine the integer KtK_{t} as follows:

We then repeat phase IIt until the number kk of iterations (i.e., gradient steps) exceeds a pre-specified threshold, in case of which the algorithm terminates.

We provide a graphical representation of these phases in Figure 1 where the circles around xx^{*} represent thresholds beyond which the transient error is less than the persistent error. For instance, in Figure 1 (plot to the left), phase II0 requires K0K_{0} steps to reach the first circle. Once, the steplength is reduced by a factor θ\theta, the phase II1 commences and requires K1K_{1} steps to reach an analogous error threshold where the transient error is equal to the persistent error; this is illustrated in Figure 1 (plot in the center). Finally, phase II2 requires K2K_{2} to reach an even smaller level of persistent error, as depicted in Figure 1 (plot to the right). Note that whenever the steplength is reduced, the persistent error is immediately reduced (Lemma 5). Thus, the stepsize is essentially a piecewise constant decreasing function of the iteration index kk.

The next result establishes the correctness of the cascading scheme by showing that KtK_{t} in Phase IIt is finite, so the scheme is well defined.

Let Assumptions 1 and 3 hold. Also, let ff be differentiable over the set XX with Lipschitz gradients with constant L>0L>0 and strongly convex with constant η>0\eta>0. Assume that the set XX is compact and let D=maxx,yXxyD=\max_{x,y\in X}\|x-y\|. Then, KtK_{t} is finite for all t0t\geq 0.

We use induction on tt to show that KtK_{t} is well defined and for all t0t\geq 0,

First note that, since γ0(0,2L)\gamma_{0}\in(0,\frac{2}{L}) and the steplength γk\gamma_{k} is non-increasing in kk, we have q(γt)(0,1)q(\gamma_{t})\in(0,1) for all t0t\geq 0.

For t=0t=0, from Proposition 5 and the boundedness of the set XX we have

where we use the fact Kˉ0=K0\bar{K}_{0}=K_{0} (see Phase IIt for t=0t=0).

Now assume that KtK_{t} is well defined and relation (22) holds for tt. We next show that Kt+1K_{t+1} is also well defined and relation (22) holds for t+1t+1. Note that the steplength γk=γt+1\gamma_{k}=\gamma_{t+1} is used for kKˉtk\geq\bar{K}_{t}. From Proposition 5 where we replace x0x_{0} with xKˉtx_{\bar{K}_{t}}, by replacing γ\gamma by γt+1\gamma_{t+1} letting qt+1=q(γt+1)q_{t+1}=q(\gamma_{t+1}), we have for kKˉtk\geq\bar{K}_{t},

By inductive hypothesis relation (22) holds, so it follows

Consequently, Kt+1K_{t+1} is defined as the largest positive integer kk for which term 1 is strictly greater than term 2, i.e.,

(see the definition of KtK_{t} in (21)). Noting that Kˉt+1=Kˉt+Kt+1\bar{K}_{t+1}=\bar{K}_{t}+K_{t+1} (see Phase IIt) and qt+1k2t+1(j=0tqjKj)D2>γt+12ν21qt+1q_{t+1}^{k}2^{t+1}\left(\prod_{j=0}^{t}q_{j}^{K_{j}}\right)D^{2}>\frac{\gamma_{t+1}^{2}\nu^{2}}{1-q_{t+1}} for k=Kˉt+1,,Kˉt+1k=\bar{K}_{t}+1,\ldots,\bar{K}_{t+1}, from (24) with k=Kˉt+1k=\bar{K}_{t+1}, we obtain

thus showing relation (22) for t+1t+1 and completing the proof. ∎

The transient and persistent error trajectories are illustrated in in Figure 2 for a problem discussed later in Section VI-A1. In Figure 2a, the transient and persistent terms of the error are plotted. The persistent error, as expected, is a piecewise constant decreasing function of the iteration count with the jumps occurring whenever the steplengths are reduced. The transient error is a plot of qtk2tj=0t1qjKjD2q_{t}^{k}2^{t}\prod_{j=0}^{t-1}q_{j}^{K_{j}}D^{2} with respect to kk. This function is a decreasing function when k{Kˉt1,,Kˉt1}.k\in\{\bar{K}_{t-1},\ldots,\bar{K}_{t}-1\}. As soon as k=Kˉtk=\bar{K}_{t}, in the transient error the factor 2t2^{t} is replaced with 2t+12^{t+1}, leading to the increase in transient error at that juncture. The total error, which is the summation of two terms, is showed in Figure 2b.

Remark on choice of θ\theta: Recall that θ\theta specifies the rate at which the steplength is dropped over consecutive steps in the cascading scheme. It can be readily observed from the bounds derived on KtK_{t} that if θ1\theta\to 1, then Kt0K_{t}\to 0 thus implying that the steplength is kept constant for a very short period. This is intuitive since a conservative drop in steplengths would imply that these drops have to occur more frequently to ensure that the sequence is driven to zero. Conversely, if θ0\theta\to 0, then KtK_{t} can grow to be quite large.

IV-B Global convergence theory

In this section, we prove that algorithm (3) using the cascading steplength scheme is indeed convergent to the optimal solution of problem (1).

Let q(γ)=12ηγ+ηLγ2q(\gamma)=1-2\eta\gamma+\eta L\gamma^{2} and let η<L\eta<L. Then, we have

Let r(γ)=ln(q(γ))γr(\gamma)=\frac{-\ln(q(\gamma))}{\gamma}. Note that the function q(γ)=12ηγ+ηLγ2q(\gamma)=1-2\eta\gamma+\eta L\gamma^{2} is nonnegative for all γ\gamma since 0<ηL0<\eta\leq L. Furthermore q(γ)<1q(\gamma)<1 for γ<2L\gamma<\frac{2}{L}. Thus, r(γ)>0r(\gamma)>0 for 0<γ<2L0<\gamma<\frac{2}{L}. We next show that r(γ)r(\gamma) is bounded from above as stated. To show that the sequence is bounded, we employ the Taylor expansion of ln(q(γ))\ln(q(\gamma)). First, we write

Noting that β(γ)=1q(γ)(0,1)\beta(\gamma)=1-q(\gamma)\in(0,1), we then use the fact ln(1x)=k=1xkk\ln(1-x)=-\sum_{k=1}^{\infty}\frac{x^{k}}{k} for x<1|x|<1, and obtain

Using β(γ)2ηγ\beta(\gamma)\leq 2\eta\gamma, we further obtain

The relation for the limit is obtained by applying L’Hôpital’s rule, as follows:

Let Assumptions 1 and 3 hold. Also, let ff be differentiable over the set XX with Lipschitz gradients with constant L>0L>0 and strongly convex with constant η>0\eta>0, where L>ηL>\eta. Assume that the set XX is compact and let D=maxx,yXxyD=\max_{x,y\in X}\|x-y\|. Let the sequence {xk}\{x_{k}\} be generated by algorithm (3) and cascading steplength scheme with γ0(0,2L)\gamma_{0}\in\left(0,\frac{2}{L}\right) and θ(0,1)\theta\in(0,1). Then, {xk}\{x_{k}\} converges almost surely to the unique optimal solution of problem (1).

The result will follow from Proposition 1 provided we verify that Assumption 2 holds, i.e., k=0γk=\sum_{k=0}^{\infty}\gamma_{k}=\infty and k=0γk2<\sum_{k=0}^{\infty}\gamma_{k}^{2}<\infty. According to Phase IIt of the cascading scheme, we have γk=γt\gamma_{k}=\gamma_{t} for k=Kˉt1+1,,Kˉtk=\bar{K}_{t-1}+1,\ldots,\bar{K}_{t} with γt=θtγ0\gamma_{t}=\theta^{t}\gamma_{0} and Kˉt=Kˉt1+Kt\bar{K}_{t}=\bar{K}_{t-1}+K_{t}. Therefore

From the definition of KtK_{t} in (21) we have

Relation (26) and the fact γt=θtγ0\gamma_{t}=\theta^{t}\gamma_{0} (see Phase IIt of the cascading scheme) yield

Therefore, by multiplying and dividing by γ0θj\gamma_{0}\theta^{j}, we obtain

Note that qj=12ηγ0θj+ηL(γ0θj)2q_{j}=1-2\eta\gamma_{0}\theta^{j}+\eta L(\gamma_{0}\theta^{j})^{2} with γ0(0,2/l)\gamma_{0}\in(0,2/l) and θ(0,1)\theta\in(0,1). Thus, by Lemma 6 we have ln(qj)γ0θj2ηL/(Lη)\frac{-\ln(q_{j})}{\gamma_{0}\theta^{j}}\leq 2\eta L/(L-\eta), implying

Taking limits on both sides, we have that

The limit on the right can be simplified by substituting qt=12ηγ0θt+ηLγ02θ2tq_{t}=1-2\eta\gamma_{0}\theta^{t}+\eta L\gamma_{0}^{2}\theta^{2t}, leading to

implying that j=0Kjθj=.\sum_{j=0}^{\infty}K_{j}\theta^{j}=\infty.

It remains to show that t=0Ktθ2t<.\sum_{t=0}^{\infty}{K_{t}\theta^{2t}}<\infty. From (25) and the fact qj(0,1)q_{j}\in(0,1) for all jj, we have that

This allows for obtaining an upper bound on KtK_{t}, given by

The desired result will follow by the Cauchy root test, if we show that

By noting that (Ktθ2t)1/t=θ2(Kt)1/t(K_{t}\theta^{2t})^{1/t}=\theta^{2}(K_{t})^{1/t}, it suffices to use the upper bound on KtK_{t} in (27). We proceed to analyze this bound, for which by letting β(γ)=1q(γ)\beta(\gamma)=1-q(\gamma) and recalling that qt=q(γt)q_{t}=q(\gamma_{t}) we have

Noting that β(γ)(0,1)\beta(\gamma)\in(0,1) for all γ\gamma when η<L\eta<L, we have ln(β(γ))<0\ln(\beta(\gamma))<0, implying

Since β(γ)(0,1)\beta(\gamma)\in(0,1), the denominator can be expanded in Taylor series as follows:

Furthermore, since β(γt)=ηγt(2Lγt)\beta(\gamma_{t})=\eta\gamma_{t}(2-L\gamma_{t}) and γt=γ0θt\gamma_{t}=\gamma_{0}\theta^{t} with θ(0,1)\theta\in(0,1), we have γ0θt1\gamma_{0}\theta^{t}\leq 1 for tt large enough, implying β(γt)ηγ0θt\beta(\gamma_{t})\geq\eta\gamma_{0}\theta^{t}. Thus,

By combining (28) and (29), we obtain for tt large enough,

By recalling that limtt1/t=1\lim_{t\to\infty}{t^{1/t}}=1 and limtc1/t=1\lim_{t\to\infty}{c^{1/t}}=1 for any c>0c>0, it follows that

We next examine the limit on the right hand side. Letting a=ln(θ/2)a=-\ln(\theta/2) and b=ln(γ02ν2D2)b=-\ln\left(\frac{\gamma_{0}^{2}\nu^{2}}{D^{2}}\right), we can write

Therefore, limtKt1/t1θ,\lim_{t\to\infty}K_{t}^{1/t}\leq\frac{1}{\theta}, implying that

As a consequence, the Cauchy-root test is satisfied and t=0Ktθ2t<.\sum_{t=0}^{\infty}K_{t}\theta^{2t}<\infty.

V Addressing nondifferentiability through Local Randomized Smoothing

In this section, we develop a smoothing approach for solving stochastic optimization problem with nonsmooth integrands. In Section V-A, given a nondifferentiable function f(x)f(x), we introduce a smooth approximation for f(x)f(x), denoted by f^(x){\hat{f}}(x) by using local random perturbations. In Section V-B, we derive Lipschitz constants for the gradients associated with this smooth approximation when the smoothing is introduced via a uniform distribution. Finally, in Section V-C, the convergence theory of stochastic approximation schemes is examined in this modified regime.

We let ff be nondifferentiable and consider its approximation f^\hat{f}, defined by

We discuss our local smoothing technique under the assumption that the function ff has uniformly bounded subgradients over the set XϵX_{\epsilon}, given as follows.

The subgradients of ff over XϵX_{\epsilon} are uniformly bounded, i.e., there is a scalar C>0C>0 such that gC\|g\|\leq C for all gf(x)g\in\partial f(x) and xXϵx\in X_{\epsilon}.

Assumption 4 is satisfied, for example, when XX is bounded. In the sequel, we let E ⁣[g(x+z)]\mathsf{E}\!\left[g(x+z)\right] denote the vector-valued integral of an element from the set of subdifferentials, which is given by

The following lemma presents properties of the randomized technique (30) with an arbitrary local random distribution over a ball. It states that, under the boundedness of the subgradients of ff, the set E ⁣[g(x+z)]\mathsf{E}\!\left[g(x+z)\right] defined above is a singleton. In particular, the lemma shows that f^\hat{f} is convex and differentiable approximation of ff.

f^\hat{f} is convex and differentiable over XX, with gradient

where the vector E ⁣[g(x+z)]\mathsf{E}\!\left[g(x+z)\right] is as defined in (31). Furthermore, f^(x)C\|\nabla\hat{f}(x)\|\leq C for all xXx\in X.

f(x)f^(x)f(x)+ϵCf(x)\leq\hat{f}(x)\leq f(x)+\epsilon C for all xXx\in X.

(b) By definition of random vector zz, it has zero mean, i.e., E ⁣[x+z]=x\mathsf{E}\!\left[x+z\right]=x, so that f(E ⁣[x+z])=f(x)f(\mathsf{E}\!\left[x+z\right])=f(x). Therefore, by Jensen’s inequality and the definition of f^\hat{f}, we have

To show relation f^(x)f(x)+ϵC\hat{f}(x)\leq f(x)+\epsilon C, we use the subgradient inequality for ff, which in particular implies that, for every xˉXϵ\bar{x}\in X_{\epsilon} and gf(xˉ)g\in\partial f(\bar{x}), we have

Since xˉXϵ\bar{x}\in X_{\epsilon}, we have xˉ=x+z\bar{x}=x+z for some xXx\in X and zz with zϵ\|z\|\leq\epsilon. Using this and the subgradient boundedness, from the preceding relation we obtain

Thus, by taking the expectation, we get f^(x)=E ⁣[f(x+z)]f(x)+ϵC\hat{f}(x)=\mathsf{E}\!\left[f(x+z)\right]\leq f(x)+\epsilon C for all xX.x\in X.

V-B Smoothing via random variables with uniform distributions

In this subsection, we consider a local smoothing technique wherein zz is generated via a uniform distribution. Other distributions may also work such as normal, considered in . However, distributions with finite support seem more appropriate for capturing local behavior of a function, as well as to deal with the problems where the function itself has a restricted domain. Our choice to work with a uniform distribution is due to the uniform distribution lending itself readily for computation of resulting Lipschitz constant and for assessment of the growth of the Lipschitz constant with the size of the problem.

The key result of this section is an examination of the Lipschitz continuity of the gradients of the smooth approximation, particularly in terms of the rate that such a constant grows with problem size.

where cn=πn2Γ(n2+1)c_{n}=\dfrac{\pi^{\frac{n}{2}}}{\Gamma(\frac{n}{2}+1)}, and Γ\Gamma is the gamma function given by

The following lemma shows that f^\hat{f} is convex and differentiable approximation of ff with Lipschitz gradients, where the Lipschitz constant for f^\nabla\hat{f} is related to the norm bound for the subgradients of ff.

where κ=2π\kappa=\frac{2}{\pi} if nn is even, and otherwise κ=1\kappa=1.

From Lemma 7(a) and relation (31), for any xXx\in X, there is a vector g(z+x)g(z+x) such that g(z+x)f(x+z)g(z+x)\in\partial f(x+z) a.s. and

where the last equality follows by letting v=x+zv=x+z. Therefore, for any x,yXx,y\in X,

where the last inequality follows by using the boundedness of the subgradients of ff over XϵX_{\epsilon}.

Now, we let x,yXx,y\in X be arbitrary but fixed, and we estimate Xϵpu(zx)pu(zy)dz\int_{X_{\epsilon}}|p{{}_{u}}(z-x)-p{{}_{u}}(z-y)|dz in (34). For this we consider the cases where xy>2ϵ\|x-y\|>2\epsilon and xy2ϵ\|x-y\|\leq 2\epsilon.

Case 1 (xy>2ϵ\|x-y\|>2\epsilon): For every zz with zxϵ\|z-x\|\leq\epsilon, we have zy>ϵ\|z-y\|>\epsilon, implying that pu(zy)=0p{{}_{u}}(z-y)=0, so that zxϵpu(zx)pu(zy)dz=1\int_{\|z-x\|\leq\epsilon}|p{{}_{u}}(z-x)-p{{}_{u}}(z-y)|dz=1. Likewise, for every zz with zyϵ\|z-y\|\leq\epsilon, we have pu(zx)=0p{{}_{u}}(z-x)=0, implying

Since 2<xy/ϵ2<\|x-y\|/\epsilon, it follows that

It can be further seen that κn!!(n1)!!1\kappa\frac{n!!}{(n-1)!!}\geq 1 for all n1n\geq 1, which combined with (37) and (34) yields the result.

Case 2 (xy2ϵ\|x-y\|\leq 2\epsilon): We decompose the integral in (34) over several regions, as follows:

where VSV_{S} denotes the volume of the set SS.

Now we want to find an upper bound for VSV_{S} in terms of yx\|y-x\|. Let Vcap(d)V_{cap}(d) denote the volume of the spherical cap with the distance dd from the center of the sphere. Therefore,

The volume of the nn-dimensional spherical cap with distance dd from the center of the sphere can be calculated in terms of the volumes of (n1)(n-1)-dimensional spheres, as follows:

with cn=πn/2Γ(n2+1)c_{n}=\dfrac{\pi^{n/2}}{\Gamma(\frac{n}{2}+1)} for n1n\geq 1. We have for d[0,ϵ]d\in[0,\epsilon],

where VcapV_{cap}^{\prime} and VcapV_{cap}^{\prime\prime} denote the first and the second derivative, respectively, with respect to dd. Hence, Vcap(d)V_{cap}(d) is convex over [0,ε][0,\varepsilon], and by the subgradient inequality we have

Since Vcap(0)=12cnϵnV_{cap}(0)=\frac{1}{2}c_{n}\epsilon^{n} and Vcap(0)=cn1ϵn1V_{cap}^{\prime}(0)=-c_{n-1}\epsilon^{n-1}, it follows

Noting that xy/2ϵ\|x-y\|/2\leq\epsilon (since xy2ϵ\|x-y\|\leq 2\epsilon), we can let d=xy/2ϵd=\|x-y\|/2\leq\epsilon in (40). By doing so and using (39), we obtain

Finally, substituting the preceding relation in (38), we have

Since cn=πn/2Γ(n2+1)c_{n}=\frac{\pi^{n/2}}{\Gamma(\frac{n}{2}+1)}, it can be seen that

with κ=2π\kappa=\frac{2}{\pi} if nn is even, and otherwise κ=1\kappa=1. Thus, we have

By combining (42) with (34), we obtain the desired result. ∎

It can be seen that the Lipschitz constant κn!!(n1)!!Cϵ\kappa\frac{n!!}{(n-1)!!}\,\frac{C}{\epsilon} established in Lemma 7 for the differentiable approximation f^\hat{f} grows at the rate of n\sqrt{n} with the number nn of the variables, i.e.,

This growth rate is worse than the growth rate ln(n+1)\sqrt{\ln(n+1)} obtained in for the global smoothing approximation, which uses a normally distributed perturbation vector zz. However, it should be emphasized that the smoothing technique in requires the function ff to be defined over the entire space since zz is drawn from a normal distribution, which is a somewhat stringent requirement. Our proposed local smoothing technique removes such a requirement, but suffers from a worse growth rate.

V-C Convergence analysis of the algorithm with local smoothing

In this section, we apply the stochastic approximation scheme presented in Section II to the smooth approximation f^\hat{f} of a nondifferentiable function ff. First, we consider the case when ff is convex but deterministic and then, we consider the case when ff is given as the expectation of a convex function.

We apply the local smoothing technique to the minimization of a convex but not necessarily differentiable function ff. In particular, suppose we want to minimize such a function ff over some set XX. We may first approximate ff by a differentiable function f^\hat{f} and then minimize f^\hat{f} over ff. In this case, by taking the minimum over xXx\in X in the relation in Lemma 7(b), we see that ff^f+ϵCf^{*}\leq\hat{f}^{*}\leq f^{*}+\epsilon C. Thus, we may overestimate the optimal value ff^{*} of the original problem by at most ϵC\epsilon C, where CC is a bound on subgradient norms of ff. So we consider the following optimization problem

We may solve the problem by considering the method (3), which takes the following form

where {zk}\{z_{k}\} is an i.i.d. sequence of random variables with uniform distribution over the nn-dimensional sphere centered at the origin and with the radius ϵ>0\epsilon>0.

We show that the conditions of Proposition 1 are satisfied. In particular, under the given assumptions, the set XϵX_{\epsilon} is convex and closed (Corollary 9.1.2 in ). Furthermore, the function F(x,z)=f(x+z)F(x,z)=f(x+z) is convex and finite on some open set containing the set XϵX_{\epsilon} for any zΩ={ξξϵ}z\in\Omega=\{\xi\mid\|\xi\|\leq\epsilon\}. Since zz is a random variable with uniform distribution on the sphere Ω\Omega, we see that E ⁣[F(x,z)]=E ⁣[f(x+z)]\mathsf{E}\!\left[F(x,z)\right]=\mathsf{E}\!\left[f(x+z)\right] is finite for every xXx\in X. Thus, Assumption 1 is satisfied. Since ff has bounded subgradients on XϵX_{\epsilon} and xkXXϵx_{k}\in X\subset X_{\epsilon}, we have gkC\|g_{k}\|\leq C. By Lemma 7(a), the gradients f^(x)\nabla\hat{f}(x) over XX are also bounded uniformly by CC. Hence,

implying that E ⁣[wk2Fk]4C2\mathsf{E}\!\left[\|w_{k}\|^{2}\mid\mathcal{F}_{k}\right]\leq 4C^{2}. In view of this, and k=0γk2<\sum_{k=0}^{\infty}\gamma_{k}^{2}<\infty (Assumption 2(a)), it follows that k=0γk2E ⁣[wk2Fk]<\sum_{k=0}^{\infty}\gamma_{k}^{2}\mathsf{E}\!\left[\|w_{k}\|^{2}\mid\mathcal{F}_{k}\right]<\infty, thus showing that Assumption 2(b) is satisfied. By Lemma 7, the function f^\hat{f} is differentiable with Lipschitz gradients over XX. Thus, the conditions of Proposition 1 are satisfied and the result follows. ∎

V-C2 Stochastic nondifferentiable optimization

In this section, we apply our local smoothing technique to a nondifferentiable stochastic problem of the form (1). Essentially, this amounts to putting the results of Sections II and V-C together. We thus consider the following problem:

FF is the function as described in section II, and f^\hat{f} is a smooth approximation of ff with zz having a uniform density pup_{u} as discussed in Section V. In view of Lemma 7(a), we know that ϵC\epsilon C is an upper bound for the difference between the optimal value f=minxXf(x)f^{*}=\min_{x\in X}f(x) and f^=minxXf^(x)\hat{f}^{*}=\min_{x\in X}\hat{f}(x), under appropriate conditions to be stated shortly. Under these conditions, we are interested in solving the approximate problem in (45).

where the inner expectation is conditioned on ξ\xi and is with respect to zz while the outer expectation is with respect to ξ\xi. We note that the variables ξ\xi and zz are independent, and by exchanging the order of the expectations, we obtain:

Thus, the problem in (45) is equivalent to

In the following lemma, we provide some conditions ensuring the differentiability of F^\hat{F} with respect to xx, as well as some other properties of F^\hat{F}. The lemma can be viewed as an immediate extension of Lemma 7 to the collection of functions F(,ξ)F(\cdot,\xi).

For every ξΩ\xi\in\Omega, the function F^(,ξ)\hat{F}(\cdot,\xi) is convex and differentiable with respect to xx at every xXx\in X, and the gradient xF^(x,ξ)\nabla_{x}\hat{F}(x,\xi) is given by

Furthermore, xF^(x,ξ)C\|\nabla_{x}\hat{F}(x,\xi)\|\leq C for all xXx\in X and ξΩ\xi\in\Omega.

F(x,ξ)F^(x,ξ)F(x,ξ)+ϵCF(x,\xi)\leq\hat{F}(x,\xi)\leq F(x,\xi)+\epsilon C for all xXx\in X and ξΩ\xi\in\Omega.

xF^(x,ξ)xF^(y,ξ)κn!!(n1)!!Cϵxy\|\nabla_{x}\hat{F}(x,\xi)-\nabla_{x}\hat{F}(y,\xi)\|\leq\kappa\dfrac{n!!}{(n-1)!!}\,\dfrac{C}{\epsilon}\|x-y\| for all x,yXx,y\in X and ξΩ\xi\in\Omega, where κ=2π\kappa=\frac{2}{\pi} if nn is even, and otherwise κ=1\kappa=1.

Under the given assumptions, each of the functions F(,ξ)F(\cdot,\xi) for ξΩ\xi\in\Omega satisfies the conditions of Lemma 7. Thus, the results follow by applying the lemma to each of the functions F(,ξ)F(\cdot,\xi) for ξΩ\xi\in\Omega. ∎

In the light of Lemma 7, the optimal value f^\hat{f}^{*} of the approximate problem in (46) is an overestimate of the optimal value ff^{*} of the original problem (1) within the error ϵC\epsilon C. In particular, by taking the expectation with respect to ξ\xi in the relation of Lemma 7(b), we obtain

This motivates solving approximate problem (46). Since for every ξΩ\xi\in\Omega, the function F^(,ξ)\hat{F}(\cdot,\xi) is convex and differentiable over the set XX, the function f^(x)=E ⁣[F^(x,ξ)]\hat{f}(x)=\mathsf{E}\!\left[\hat{F}(x,\xi)\right] is also convex and differentiable over the set XX (see ). Thus, the objective function f^\hat{f} in (46) is differentiable. To solve the problem, we consider the method in (3), which takes the following form:

We have the following convergence result for the method.

Let the assumptions of Lemma 9 hold, and let Assumption 2 hold. Then, the sequence {xk}\{x_{k}\} generated by method (47) converges almost surely to some optimal solution of problem (46).

It suffices to show that the conditions of Proposition 1 are satisfied for the set XX, and the functions F^(x,ξ)\hat{F}(x,\xi) and f^(x)\hat{f}(x). The result will then follow from Proposition 1.

We first verify that F^(x,ξ)\hat{F}(x,\xi) satisfies Assumption 1 and that f^(x)\hat{f}(x) has Lipschitz gradients over XX. Under the given assumptions, Lemma 9 holds. By Lemma 9(a)–(b), the function F^(x,ξ)\hat{F}(x,\xi) satisfies Assumption 1. Furthermore, by Lemma 9(a) and (c), the function F^(x,ξ)\hat{F}(x,\xi) is differentiable and with Lipschitz gradients for every ξΩ\xi\in\Omega. Hence, f^(x)=E ⁣[F^(x,ξ)]\hat{f}(x)=\mathsf{E}\!\left[\hat{F}(x,\xi)\right] is also differentiable with the gradient given by f^(x)=E ⁣[xF^(x,ξ)]\nabla\hat{f}(x)=\mathsf{E}\!\left[\nabla_{x}\hat{F}(x,\xi)\right](see ). To see that the gradients f^\nabla\hat{f} are Lipschitz continuous, we take the expectation in the relation of Lemma 9(c), and we obtain for all x,yXx,y\in X,

where κ=2π\kappa=\frac{2}{\pi} if nn is even, and otherwise κ=1\kappa=1. Using Jensen’s inequality, we further have for all x,yXx,y\in X,

Since f^(x)=E ⁣[xF^(x,ξ)]\nabla\hat{f}(x)=\mathsf{E}\!\left[\nabla_{x}\hat{F}(x,\xi)\right], it follows that f^(x)\nabla\hat{f}(x) is Lipschitz over the set XX. Thus, the objective function f^\hat{f} satisfies the conditions of Proposition 1.

We now show that Assumption 2(b) is satisfied. In view of the assumption that k=0γk2<\sum_{k=0}^{\infty}\gamma_{k}^{2}<\infty (Assumption 2(a)), it suffices to show that wk\|w_{k}\| is uniformly bounded. By the definition of wkw_{k} in (47), we have for all kk,

where xkXx_{k}\in X and zkϵ\|z_{k}\|\leq\epsilon for all kk. Thus, xk+zkXϵx_{k}+z_{k}\in X_{\epsilon} for all kk. By the assumptions of Lemma 9, the subdifferential set xF(x,ξ)\partial_{x}F(x,\xi) is uniformly bounded over Xϵ×ΩX_{\epsilon}\times\Omega, implying that

We next prove that the gradients f^(x)\nabla\hat{f}(x) are uniformly bounded over the set XX. Taking the expectation in the relation xF^(x,ξ)C\|\nabla_{x}\hat{F}(x,\xi)\|\leq C valid for any xXx\in X and ξΩ\xi\in\Omega (Lemma 9(a)), and using Jensen’s inequality, we obtain

Since f^(x)=E ⁣[xF^(x,ξ)]\nabla\hat{f}(x)=\mathsf{E}\!\left[\nabla_{x}\hat{F}(x,\xi)\right], we see that f^(x)C\|\nabla\hat{f}(x)\|\leq C for xXx\in X. This and relation (48) yields

thus showing that wk\|w_{k}\| is uniformly bounded. ∎

VI Numerical results

In this section, we present computational results of applying our adaptive and smoothing schemes to three test problems. Sections VI-A1, VI-A2 and VI-A3 consider a stochastic utility problem (see ), a bilinear matrix game and a stochastic network utility maximization problem, respectively. In all of these examples, we compare the performance of the recursive steplength SA scheme (RSA) and the cascading steplength SA scheme (CSA) with a standard implementation of stochastic approximation. The standard SA scheme, where the steplength sequence is chosen to be a harmonic sequence is referred to as the HSA scheme and is employed as a benchmark. For each example, we provide this comparison for 9 problems of varying size and problem parameters apart from figures illustrating the difference between theoretical bounds and the obtained results. Notably, the first two problems are nonsmooth convex problems, prompting us to work with a regularized strongly convex form. In Section VI-B, we discuss the sensitivity of the schemes to changes in parameters. Throughout Section VI, we use N,n,ηN,n,\eta and ϵ\epsilon, to denote the no. of iterations, the problem dimension, the strong convexity parameter, and the size of the uniform distribution employed for smoothing, respectively.

Consider the following optimization problem,

where X={xRnx0,i=1nxi=1}X=\{x\in R^{n}|x\geq 0,\sum_{i=1}^{n}x_{i}=1\}, ξi\xi_{i} are independent and normally distributed random variables with mean zero and variance one. The function ϕ()\phi(\cdot) is a piecewise linear convex function given by ϕ(t)=max1im{vi+sit},\phi(t)=\max_{1\leq i\leq m}\{v_{i}+s_{i}t\}, where viv_{i} and sis_{i} are constants between zero and one, and F(x,ξ)=ϕ(i=1n(in+ξi)xi))F(x,\xi)=\phi(\sum_{i=1}^{n}(\frac{i}{n}+\xi_{i})x_{i})). To apply our schemes, we require strong convexity of function ff. Therefore, we regularize ff by adding the term η2x2\frac{\eta}{2}\|x\|^{2} to ff where η>0\eta>0 is the strong convexity parameter. We now apply the randomized smoothing technique discussed in Section V-C. Smoothed regularized problem given by

Table I shows the results of parametric analysis of the simulation of our schemes on problem (50). The table is partitioned into three parts, each corresponding to a variation of parameters nn, NN, η\eta, respectively. In each part, one parameter has been assigned three increasing values while the other parameters are kept fixed, allowing us to ascertain the impact of each parameter on the performance of the schemes. We generated 50 trajectories of the RSA and CSA scheme for a given n,N,η,ϵ.n,N,\eta,\epsilon. Over these realizations, we computed the means and 90%\% confidence intervals. The baseline parameters are chosen as n=20n=20, N=4000N=4000, ϵ=0.5\epsilon=0.5, and η=0.5\eta=0.5 as a reference for each group. Note that in Table I, the confidence intervals employ the logarithm of the error. Recall that we have a theoretical upper bound on the error E ⁣[xkxϵ,η2]\mathsf{E}\!\left[\|x_{k}-x_{\epsilon,\eta}^{*}\|^{2}\right], as given by (8) and (17) for the RSA and CSA schemes. Additionally, we obtain an empirical error bound based on using the scheme in practice. Insights: We observe that the confidence intervals of both the CSA and the RSA schemes are relatively invariant to changes in problem dimension. Furthermore, RSA appears to have provide slightly tighter intervals in comparison with CSA. Expectedly, increasing NN leads to significant improvement in these intervals while larger values of η\eta lead to less accurate solutions (with respect to the unregularized problem) but tighter bounds. Moreover, the CSA schemes in particular give better confidence bounds than RSA when η\eta is larger.

VI-A2 A bilinear matrix game problem

Problem (51) a saddle point problem. Solving saddle point problems by SA algorithm has been discussed extensively (cf. ). The gradient and its sampled variant to be employed in algorithm (3) are given by:

respectively where l(y,ξ1)l(y,\xi_{1}) and \l(x,ξ2)\l(x,\xi_{2}) are random integers between 1 and nn with probabilities

respectively for arbitrary vectors xx and yy. We generate these random variables through two independent random variables ξ1\xi_{1} and ξ2\xi_{2} which are uniformly distributed in $.Now,forany. Now, for any(x,y)\in X\times Y,since, since\min(0,x_{1},\ldots,x_{n})=\min(0,y_{1},\ldots,y_{n})=0,and, and\sum_{i=1}^{n}x_{i}=\sum_{j=1}^{n}y_{j}=1$, we have

implying that wkw_{k} has zero-mean, i.e., E ⁣[wkFk]=0\mathsf{E}\!\left[w_{k}\mid\mathcal{F}_{k}\right]=0 for all k0.k\geq 0. To analyze the behavior of the upper bound of error arising from RSA and CSA, we need a strongly convex function. This is obtained by adding a regularization term η2x2η2y2\frac{\eta}{2}\|x\|^{2}-\frac{\eta}{2}\|y\|^{2} to the function yTAxy^{T}Ax which makes it a strongly convex function with respect to xx and a strongly concave function with respect to yy. To apply the randomized technique in Section V, we consider an (2n2n)-dimensional ball with radius ϵ\epsilon uniformly distributed. We use the following SA algorithm to find the solution to an approximate solution of (51):

From the structure of AA in (52), it is observed that the optimal solution of problem (51) is obtained for x=[1,0,,0]Tx^{*}=[1,0,\ldots,0]^{T} and y=[0,,0,1]Ty^{*}=[0,\ldots,0,1]^{T}. This result can also be obtained quite simply by using a linear programming reformulation. The regularized problem cannot be analyzed as easily and its solution can be obtained by using QP duality and SAA techniques.

Table II presents the results of simulations for RSA and CSA schemes. Similar to the Table I, there are three parts in the Table II for the parameters. For this problem, xϵ,ηx2\|x_{\epsilon,\eta}^{*}-x^{*}\|^{2} is very small and shows that the optimal solution of the approximate problem is very close to the optimal solution of problem (51). We set n=20n=20, N=4000N=4000, ϵ=0.2\epsilon=0.2, and η=0.01\eta=0.01 as the reference setting. Figure 3b shows the theoretical upper bounds and the mean of samples of simulation for RSA and CSA schemes.

Insights: Unlike in the stochastic utility problem, in this instance, the true optimal solution is obtained within the NN gradient steps for most of the test problems. However, it should be remarked that the CSA appears to find solutions faster than RSA, in at least three of the problems (P(i)(i): 3, 5 and 7).

VI-A3 A stochastic network utility problem

In this example, we consider a spatial network and consider the associated network utility maximization problem (See ). Suppose that there are nn users and L1L_{1} links. The overall network maximization problem is characterized by an objective that is a sum of user-specific concave utilities less a congestion cost, which is given by a function of aggregate flow over a link. Let xix_{i} denote the iith user’s flow rate while Fi(x;ξ)F_{i}(x_{;}\xi) denotes its utility function, defined by

where ki(ξi)k_{i}(\xi_{i}) is an uncertain parameter. Suppose that AA denotes the adjacency matrix that captures the set of links traversed by the traffic. More precisely, for every link lLl\in{\mathcal{L}} and user ii, we have Ali=1A_{li}=1 if link ll carries flow of user ii and Ali=0A_{li}=0 otherwise. The congestion cost is given by c(x)=Ax2.c(x)=\|Ax\|^{2}. The total cost at the network level us then given by

We assume that the user traffic rates are restricted by a capacity constraint AxCAx\leq C. Since the objective function FF is smooth, there is no requirement to introduce an additional smoothing.

Table III shows the results of simulations for HSA, RSA, and CSA scheme. Here, we assume that C3=(0.10,0.15,0.20,0.10,0.15,0.20,0.20,0.15,0.25)=0.75C2=0.5C1C_{3}=(0.10,0.15,0.20,0.10,0.15,0.20,0.20,0.15,0.25)=0.75C_{2}=0.5C_{1} and xx is constrained to be nonnegative. We also assume that ki(ξi)k_{i}(\xi_{i}) is drawn from uniform distribution Uni(0.2,1)Uni(0.2,1) for every user. The confidence intervals for the normed error between the terminating iterate and the optimal solution are reported for each problem.

Insights: We observe that both RSA and CSA schemes perform favorably in comparison with the HSA scheme. Importantly, neither scheme appears to deteriorate from a confidence interval standpoint when the problem size grows. Similar to the earlier examples, CSA appears to have slightly tighter confidence intervals in the empirical tests that we carried out.

VI-B Interpretation of numerical results

In this section, we interpret the numerical results obtained in the previous subsections, focusing on a comparison between the theoretical and empirical results and the sensitivity of the schemes to the algorithm parameters.

In Figures 3a, 3b and 3c, we provide schematics of the trajectories associated with the theoretically obtained upper bounds and the empirical means. Several observations can be immediately made. In the context of the stochastic utility problem and the network utility maximization problem, we observe that the RSA scheme displays uniformly better theoretical bounds, in comparison with CSA. It is also worth emphasizing that the “jumps” seen in the theoretical error bound trajectories of CSA correspond to junctures where the steplengths drop. In fact, the cascading nature is also apparent in the empirical trajectories of the network utility maximization game in Fig 3c, albeit in a less obvious fashion. We observe that the overall empirical behavior of both schemes is similar in terms of the final errors for the utility and network utility maximization problems while in the context of the bimatrix game, the CSA scheme performs significantly better for a subset of problems.

VI-B2 Sensitivity to algorithm parameters

Finally, in this section, we discuss the sensitivity of each scheme to algorithm parameters and provide a comparison with a standard stochastic approximation scheme where we assume that the stepsize is γk=αk\gamma_{k}=\frac{\alpha}{k} for k1k\geq 1 and α>0\alpha>0. In HSA, we intend to examine the effect of choosing different values of α\alpha on the performance of the SA algorithm. In the RSA scheme, we have a choice of the first stepsize γ0RSA\gamma_{0}^{RSA} and also parameter cc in the inequality of Proposition 3. We set c=0.5c=0.5 and examine the impact of changing γ0RSA\gamma_{0}^{RSA}. Finally, the CSA scheme performs differently with different choices of the cascading parameter 0<θ<10<\theta<1. We consider three different values for each of α\alpha, γ0RSA\gamma_{0}^{RSA}, and θ\theta and present simulations for HSA, RSA and CSA in the case of the stochastic utility problem. The reference setting is specified by n=20n=20, N=4000N=4000, ϵ=0.5\epsilon=0.5, and η=0.5\eta=0.5. Now suppose α\alpha, γ0RSA\gamma_{0}^{RSA}, and θ\theta are set as follows:

Figure 4 shows the simulations for the specified parameters. Note that “Th. UB” shows the corresponding theoretical upper bound of each scheme and ”Mean” shows the mean of error zkzϵ,η2\|z_{k}-z_{\epsilon,\eta}^{*}\|^{2} where z=(x,y)z=(x,y).

Figure 4a shows the harmonic scheme with α=\alpha=1, 0.5, and 0.25 corresponding to labels 1, 2, and 3 in the legend. This shows that the performance of HSA is extremely sensitive to the choice of α\alpha and HSA implementations with a larger α\alpha performed better for the stochastic utility problem. Furthermore, the error on termination of HSA schemes can vary by nearly a factor of 10 for the problems that we tested. The update rules in the RSA schemes rely on η\eta and LL with γ0RSA\gamma_{0}^{RSA} being the sole user input. Yet, when examining the sensitivity of the RSA scheme to the choice of γ0RSA\gamma_{0}^{RSA} (see Figure 4b with γ0RSA=\gamma_{0}^{RSA}=1, 0.5, and 0.25 corresponding to labels 1, 2, and 3), we observe that the performance is relatively insensitive to the choice of initial stepsize. In effect, the modeler can be relatively less concerned about such parameters when attempting to solve this class of problems. Importantly, both theoretical and numerical aspect of RSA have almost the same performance for three values of γ0RSA\gamma_{0}^{RSA}. Finally, a concern in the implementation of CSA schemes is the choice of θ\theta, the cascading parameter where θ(0,1)\theta\in(0,1). Figure 4c shows the simulation of the cascading scheme with θ=\theta=0.75, 0.50, and 0.25 corresponding to labels 1, 2, and 3. Theoretically, we observe that smaller values of θ\theta (more aggressive reductions in stepsize) lead to slightly superior theoretical bounds but not significantly so. However, the results are far more muted when conducting an empirical examination. In particular, we observe that the CSA scheme appears to be relatively insensitive to diversity in the choice of θ.\theta. The relative robustness of the RSA and CSA schemes to the choice of parameters is seen as a crucial advantage of such schemes.

VII Concluding remarks

This paper is motivated by two shortcomings associated with standard stochastic approximation procedures for stochastic convex programs. First, standard implementations of such schemes provide little guidance in specifying parameters that may prove crucial in practical performance. Furthermore, direct extensions to nonsmooth regimes of such schemes is not immediate. Accordingly, this paper makes two sets of contributions. First, we develop two sets of adaptive steplength schemes and provide the associated global convergence theory. Of these, the former, a recursive steplength scheme (RSA), specifies the steplength at a particular iteration using the previous steplength and certain problem parameters. The second scheme, called a cascading steplength scheme (CSA), differs significantly and is essentially a sequence of constant steplength schemes in which the steplength is reduced at specific points in time. The second set of contributions extends these techniques to settings where the objective is not necessarily differentiable. Through the use of a local smoothing method that perturbs the problem through a uniformly distributed random variable, we propose a stochastic gradient scheme. Notably, Lipschitz bounds are obtained for the gradients and their growth with problem size is found to be modest. Locally smoothed variants of the RSA and CSA scheme were seen to perform well on two classes of nonsmooth stochastic optimization problems and implementations were seen to be relatively insensitive to problem parameters.

References