SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

Cong Fang, Chris Junchi Li, Zhouchen Lin, Tong Zhang

Introduction

In this paper, we study the optimization problem

where the stochastic component F(x;ζ)F(\mathbf{x};\bm{\zeta}), indexed by some random vector ζ\bm{\zeta}, is smooth and possibly non-convex. Non-convex optimization problem of form (1.1) contains many large-scale statistical learning tasks. Optimization methods that solve (1.1) are gaining tremendous popularity due to their favorable computational and statistical efficiencies (Bottou,, 2010; Bubeck et al.,, 2015; Bottou et al.,, 2018). Typical examples of form (1.1) include principal component analysis, estimation of graphical models, as well as training deep neural networks (Goodfellow et al.,, 2016). The expectation-minimization structure of stochastic optimization problem (1.1) allows us to perform iterative updates and minimize the objective using its stochastic gradient F(x;ζ)\nabla F(\mathbf{x};\bm{\zeta}) as an estimator of its deterministic counterpart.

A special case of central interest is when the stochastic vector ζ\bm{\zeta} is finitely sampled. In such finite-sum (or offline) case, we denote each component function as fi(x)f_{i}(x) and (1.1) can be restated as

where nn is the number of individual functions. Another case is when nn is reasonably large or even infinite, running across of the whole dataset is exhaustive or impossible. We refer it as the online (or streaming) case. For simplicity of notations we will study the optimization problem of form (1.2) in both finite-sum and on-line cases till the rest of this paper.

For the task of finding stationary points for which we already achieved a faster convergence rate via our proposed Spider-SFO algorithm, a follow-up question to ask is: is our proposed Spider-SFO algorithm optimal for an appropriate class of smooth functions? In this paper, we provide an affirmative answer to this question in the finite-sum case. To be specific, inspired by a counterexample proposed by Carmon et al., 2017b we are able to prove that the gradient cost upper bound of Spider-SFO algorithm matches the algorithmic lower bound. To put it differently, the gradient cost of Spider-SFO cannot be further improved for finding stationary points for some particular non-convex functions.

In the recent years, there has been a surge of literatures in machine learning community that analyze the convergence property of non-convex optimization algorithms. Limited by space and our knowledge, we have listed all literatures that we believe are mostly related to this work. We refer the readers to the monograph by Jain et al., (2017) and the references therein on recent general and model-specific convergence rate results on non-convex optimization.

For the general problem of finding approximate stationary points, under the smoothness condition of f(x)f(\mathbf{x}), it is known that vanilla Gradient Descent (GD) and Stochastic Gradient Descent (SGD), which can be traced back to Cauchy, (1847) and Robbins & Monro, (1951) and achieve an ϵ\epsilon-approximate stationary point with a gradient cost of O(min(nϵ2,ϵ4))\mathcal{O}(\min(n\epsilon^{-2},\epsilon^{-4})) (Nesterov,, 2004; Ghadimi & Lan,, 2013; Nesterov & Spokoiny,, 2011; Ghadimi & Lan,, 2013; Shamir,, 2017).

First-order method for finding approximate stationary points

Online PCA and the NEON method

In late 2017, two groups Xu et al., (2017); Allen-Zhu & Li, (2018) proposed a generic saddle-point-escaping method called Neon, a Negative-Curvature-Search method using stochastic gradients. Using such Neon method, one can convert a series of optimization algorithms whose update rules use stochastic gradients and Hessian-vector products (GD, SVRG, FastCubic/CDHS, SGD, SCSG, Natasha2, etc.) to the ones using only stochastic gradients without increasing the gradient cost. The idea of Neon was built upon Oja’s iteration for principal component estimation (Oja,, 1982), and its global convergence rate was proved to be near-optimal (Li et al.,, 2017; Jain et al.,, 2016). Allen-Zhu & Li, (2017) later extended such analysis to the rank-kk case as well as the gap-free case, the latter of which serves as the pillar of the Neon method.

Other concurrent works

As the current work is carried out in its final phase, the authors became aware that an idea of resemblance was earlier presented in an algorithm named the StochAstic Recursive grAdient algoritHm (SARAH) (Nguyen et al., 2017a, ; Nguyen et al., 2017b, ). Both our Spider-type of algorithms and theirs adopt the recursive stochastic gradient update framework. Nevertheless, our techniques essentially differ from the works Nguyen et al., 2017a ; Nguyen et al., 2017b in two aspects:

The version of SARAH proposed by Nguyen et al., 2017a ; Nguyen et al., 2017b can be seen as a variant of gradient descent, while ours hybrids the Spider technique with a stochastic version of NGD.

Nguyen et al., 2017a ; Nguyen et al., 2017b adopt a large stepsize setting (in fact their goal was to design a memory-saving variant of SAGA (Defazio et al.,, 2014)), while our algorithms adopt a small stepsize that is proportional to ϵ\epsilon;

2 Our Contributions

In this work, we propose the Stochastic Path-Integrated Differential Estimator (Spider) technique, which significantly avoids excessive access of stochastic oracles and reduces the time complexity. Such technique can be potential applied in many stochastic estimation problems.

As a first application of our Spider technique, we propose the Spider-SFO algorithm (Algorithm 1) for finding an approximate first-order stationary point for non-convex stochastic optimization problem (1.2), and prove the optimality of such rate in at least one case. Inspired by recent works Johnson & Zhang, (2013); Carmon et al., (2016); Carmon et al., 2017b and independent of Zhou et al., 2018b ; Zhou et al., 2018a , this is the first time that the gradient cost of O(min(n1/2ϵ2,ϵ3))\mathcal{O}(\min(n^{1/2}\epsilon^{-2},\epsilon^{-3})) in both upper and lower (finite-sum only) bound for finding first-order stationary points for problem (1.2) were obtained.

As a second application of our Spider technique, we apply it to zeroth-order optimization for problem (1.2) and achieves individual function accesses of O(min(dn1/2ϵ2,dϵ3))\mathcal{O}(\min(dn^{1/2}\epsilon^{-2},d\epsilon^{-3})). To best of our knowledge, this is also the first time that using Variance Reduction technique (Schmidt et al.,, 2017; Johnson & Zhang,, 2013) to reduce the individual function accesses for non-convex problems to the aforementioned complexity.

We propose a much simpler analysis for proving convergence to a stationary point. One can flexibly apply our proof techniques to analyze others algorithms, e.g. SGD, SVRG (Johnson & Zhang,, 2013), and SAGA (Defazio et al.,, 2014).

Organization. The rest of this paper is organized as follows. §2 presents the core idea of stochastic path-integrated differential estimator that can track certain quantities with much reduced computational costs. §3 provides the Spider method for stochastic first-order methods and convergence rate theorems of this paper for finding approximate first-order stationary and second-order stationary points, and details a comparison with concurrent works. §4 provides the Spider method for stochastic zeroth-order methods and relevant convergence rate theorems. §5 concludes the paper with future directions. All the detailed proofs are deferred to the appendix in their order of appearance.

Stochastic Path-Integrated Differential Estimator: Core Idea

In this section, we present in detail the underlying idea of our Stochastic Path-Integrated Differential Estimator (Spider) technique behind the algorithm design. As the readers will see, such technique significantly avoids excessive access of the stochastic oracle and reduces the complexity, which is of independent interest and has potential applications in many stochastic estimation problems.

Then we can integrate (in the discrete sense) the stochastic differential estimate as

Then for any γ>0\gamma>0 and a given k{1,,K}k\in\{1,\dots,K\} we have with probability at least 14γ1-4\gamma

Proposition 1(i) can be easily concluded using the property of square-integrable martingales. To prove the high-probability bound in Proposition 1(ii), we need to apply an Azuma-Hoeffding-type concentration inequality (Pinelis,, 1994). See §A in the Appendix for more details.

At each step kk let SS_{*} be a subset that samples S{\mathcal{S}}_{*} elements in [n][n] with replacement, and let the stochastic estimator BS=(1/S)iSBi\mathcal{B}_{S_{*}}=(1/{\mathcal{S}}_{*})\sum_{i\in S_{*}}\mathcal{B}_{i} satisfy

and xkxk1ϵ1\|\mathbf{x}^{k}-\mathbf{x}^{k-1}\|\leq\epsilon_{1} for all k=1,,Kk=1,\dots,K. Finally, we set our estimator Vk\mathcal{V}^{k} of B(xk)\mathcal{B}(\mathbf{x}^{k}) as

Applying Proposition 1 immediately concludes the following lemma, which gives an error bound of the estimator Vk\mathcal{V}^{k} in terms of the second moment of VkB(xk)\|\mathcal{V}^{k}-\mathcal{B}(\mathbf{x}^{k})\|:

We have under the condition (2.7) that for all k=1,,Kk=1,\dots,K,

It turns out that one can use Spider to track many quantities of interest, such as stochastic gradient, function values, zero-order estimate gradient, functionals of Hessian matrices, etc. Our proposed Spider-based algorithms in this paper take Bi\mathcal{B}_{i} as the stochastic gradient fi\nabla f_{i} and the zeroth-order estimate gradient, separately.

SPIDER for Stochastic First-Order Method

In this section, we apply Spider to the task of finding both first-order and second-order stationary points for non-convex stochastic optimization. The main advantage of Spider-SFO lies in using SPIDER to estimate the gradient with a low computation cots. We introduce the basic settings and assumptions in §3.1 and propose the main error-bound theorems for finding approximate first-order and second-order stationary points, separately in §3.2 and §3.3.

We first introduce the formal definition of approximate first-order and second-order stationary points, as follows.

Also, call x\mathbf{x} an (ϵ,δ)(\epsilon,\delta)-approximate second-order stationary point, or simply an SSP, if

The definition of an (ϵ,δ)(\epsilon,\delta)-approximate second-order stationary point generalizes the classical version where δ=ρϵ\delta=\sqrt{\rho\epsilon}, see e.g. Nesterov & Polyak, (2006). For our purpose of analysis, we also pose the following additional assumption:

The component function fi(x)f_{i}(\mathbf{x}) has an averaged LL-Lipschitz gradient, i.e. for all x,y\mathbf{x},\mathbf{y},

(For on-line case only) the stochastic gradient has a finite variance bounded by σ2<\sigma^{2}<\infty, i.e.

Alternatively, to obtain high-probability results using concentration inequalities, we propose the following more stringent assumptions:

We assume that Assumption 1 holds and, in addition,

(Optional) each component function fi(x)f_{i}(\mathbf{x}) has LL-Lipschitz continuous gradient, i.e. for all i,x,yi,\mathbf{x},\mathbf{y},

(For on-line case only) the gradient of each component function fi(x)f_{i}(\mathbf{x}) has finite bounded variance by σ2<\sigma^{2}<\infty (with probability 11) , i.e. for all i,xi,\mathbf{x},

For the problem of finding an (ϵ,δ)(\epsilon,\delta)-approximate second-order stationary point, we pose in addition to Assumption 1 the following assumption:

We assume that Assumption 2 (including (ii’)) holds and, in addition, each component function fi(x)f_{i}(\mathbf{x}) has ρ\rho-Lipschitz continuous Hessian, i.e. for all i,x,yi,\mathbf{x},\mathbf{y},

We emphasize that Assumptions 1, 2, and 3 are standard for non-convex stochastic optimization (Agarwal et al.,, 2017; Carmon et al., 2017b, ; Jin et al., 2017a, ; Xu et al.,, 2017; Allen-Zhu & Li,, 2018).

2 First-Order Stationary Point

Recall that NGD has iteration update rule

where η\eta is a constant step size. The NGD update rule (3.3) ensures xk+1xk\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\| being constantly equal to the stepsize η\eta, and might fastly escape from saddle points and converge to a second-order stationary point (Levy,, 2016). We propose Spider-SFO in Algorithm 1, which is like a stochastic variant of NGD with the Spider technique applied, so as to maintain an estimator in each epoch f(xk)\nabla f(\mathbf{x}^{k}) at a higher accuracy under limited gradient budgets.

To analyze the convergence rate of Spider-SFO, let us first consider the on-line case for Algorithm 1. We let the input parameters be

where n0[1,2σ/ϵ]n_{0}\in[1,2\sigma/\epsilon] is a free parameter to choose.When n0=1n_{0}=1, the mini-batch size is 2σ/ϵ2\sigma/\epsilon, which is the largest mini-batch size that Algorithm 1 allows to choose. In this case, vk\mathbf{v}^{k} in Line 5 of Algorithm 1 is a Spider for f(xk)\nabla f(\mathbf{x}^{k}). To see this, recall fi(xk1)\nabla f_{i}(\mathbf{x}^{k-1}) is the stochastic gradient drawn at step kk and

Plugging in Vk=vk\mathcal{V}^{k}=\mathbf{v}^{k} and Bi=fi\mathcal{B}_{i}=\nabla f_{i} in Lemma 1 of §2, we can use vk\mathbf{v}^{k} in Algorithm 1 as the Spider and conclude the following lemma that is pivotal to our analysis.

Set the parameters S1S_{1}, S2S_{2}, η\eta, and qq as in (3.4), and k0=k/qqk_{0}=\lfloor k/q\rfloor\cdot q. Then under the Assumption 1, we have

Here we compute the conditional expectation over the randomness of x(k0+1):kx_{(k_{0}+1):k}.

Lemma 2 shows that our Spider vk\mathbf{v}^{k} of f(x)\nabla f(\mathbf{x}) maintains an error of O(ϵ)\mathcal{O}(\epsilon). Using this lemma, we are ready to present the following results for Stochastic First-Order (SFO) method for finding first-order stationary points of (1.2).

For the on-line case, set the parameters S1S_{1}, S2S_{2}, η\eta, and qq as in (3.4), and K=(4LΔn0)ϵ2+1K=\left\lfloor(4L\Delta n_{0})\epsilon^{-2}\right\rfloor+1. Then under the Assumption 1, for Algorithm 1 with OPTION II, after KK iteration, we have

The gradient cost is bounded by 16LΔσϵ3+2σ2ϵ2+4σn01ϵ116L\Delta\sigma\cdot\epsilon^{-3}+2\sigma^{2}\epsilon^{-2}+4\sigma n_{0}^{-1}\epsilon^{-1} for any choice of n0[1,2σ/ϵ]n_{0}\in[1,2\sigma/\epsilon]. Treating Δ\Delta, LL and σ\sigma as positive constants, the stochastic gradient complexity is O(ϵ3)\mathcal{O}(\epsilon^{-3}).

The relatively reduced minibatch size serves as the key ingredient for the superior performance of Spider-SFO. For illustrations, let us compare the sampling efficiency among SGD, SCSG and Spider-SFO in their special cases. With some involved analysis of these algorithms, we can conclude that to ensure a sufficient function value decrease of Ω(ϵ2/L)\Omega(\epsilon^{2}/L) at each iteration,

for SGD the choice of mini-batch size is \mathcal{O}\big{(}\sigma^{2}\cdot\epsilon^{-2}\big{)};

for SCSG (Lei et al.,, 2017) and Natasha2 (Allen-Zhu,, 2018) the mini-batch size is \mathcal{O}\big{(}\sigma\cdot\epsilon^{-1.333}\big{)};

for our Spider-SFO only needs a reduced mini-batch size of \mathcal{O}\big{(}\sigma\cdot\epsilon^{-1}\big{)}

Turning to the finite-sum case, analogous to the on-line case we let

where n0[1,n1/2]n_{0}\in[1,n^{1/2}]. In this case, one computes the full gradient vk=fS1(xk)\mathbf{v}^{k}=\nabla f_{S_{1}}(\mathbf{x}^{k}) in Line 3 of Algorithm 1. We conclude our second upper-bound result:

In the finite-sum case, set the parameters S2{\mathcal{S}}_{2}, η\eta, and qq as in (3.7), K=(4LΔn0)ϵ2+1K=\left\lfloor(4L\Delta n_{0})\epsilon^{-2}\right\rfloor+1 and let S1=[n]S_{1}=[n], i.e. we obtain the full gradient in Line 3. The gradient cost is bounded by n+8(LΔ)n1/2ϵ2+2n01n1/2n+8(L\Delta)\cdot n^{1/2}\epsilon^{-2}+2n_{0}^{-1}n^{1/2} for any choice of n0[1,n1/2]n_{0}\in[1,n^{1/2}]. Treating Δ\Delta, LL and σ\sigma as positive constants, the stochastic gradient complexity is O(n+n1/2ϵ2)\mathcal{O}(n+n^{1/2}\epsilon^{-2}).

Lower Bound for Finding First-Order Stationary Points

Note the condition nO(ϵ4)n\leq\mathcal{O}(\epsilon^{-4}) in Theorem 3 ensures that our lower bound Ω(n1/2ϵ2)=Ω(n+n1/2ϵ2)\Omega(n^{1/2}\epsilon^{-2})=\Omega(n+n^{1/2}\epsilon^{-2}), and hence our upper bound in Theorem 1 matches the lower bound in Theorem 3 up to a constant factor of relevant parameters, and is hence near-optimal. Inspired by Carmon et al., 2017b , our proof of Theorem 3 utilizes a specific counterexample function that requires at least Ω(n1/2ϵ2)\Omega(n^{1/2}\epsilon^{-2}) stochastic gradient accesses. Note Carmon et al., 2017b analyzed such counterexample in the deterministic case n=1n=1 and we generalize such analysis to the finite-sum case n1n\geq 1.

Note by setting n=O(ϵ4)n=\mathcal{O}(\epsilon^{-4}) the lower bound complexity in Theorem 3 can be as large as Ω(ϵ4)\Omega(\epsilon^{-4}). We emphasize that this does not violate the O(ϵ3)\mathcal{O}(\epsilon^{-3}) upper bound in the on-line case [Theorem 1], since the counterexample established in the lower bound depends not on the stochastic gradient variance σ2\sigma^{2} specified in Assumption 1(iii), but on the component number nn. To obtain the lower bound result for the on-line case with the additional Assumption 1(iii), with more efforts one might be able to construct a second counterexample that requires Ω(ϵ3)\Omega(\epsilon^{-3}) stochastic gradient accesses with the knowledge of σ\sigma instead of nn. We leave this as a future work.

Upper Bound for Finding First-Order Stationary Points, in High-Probability

In the finite-sum case, set the parameters S1S_{1}, S2S_{2}, η\eta, and qq as (3.7). let S1=[n]S_{1}=[n], i.e. we obtain the full gradient in Line 3. Then under the Assumption 2 (including (ii’)), with probability at least 1p1-p, Algorithm 1 terminates before K0=4LΔn0/ϵ2+2K_{0}=\lfloor 4L\Delta n_{0}/\epsilon^{2}\rfloor+2 iterations and outputs an xK\mathbf{x}^{\mathcal{K}} satisfying

3 Second-Order Stationary Point

Our final result on the convergence rate of Algorithm 2 is stated as:

Let Assumptions 3 hold. For the on-line case, set q,S1,S2,ηq,S_{1},S_{2},\eta in (3.4), K=δLn0ρϵ\mathscr{K}=\frac{\delta Ln_{0}}{\rho\epsilon} with any choice of n0[1,2σ/ϵ]n_{0}\in[1,2\sigma/\epsilon], then with probability at least 1/21/2By multiple times (at most in O(log(1/p))O(\log(1/p)) times) of verification and restarting Algorithm 2 , one can also obtain a high-probability result., Algorithm 2 outputs an xk\mathbf{x}^{k} with jJ=4max(3ρ2Δδ3,4Δρδϵ)+4j\leq J=4\left\lfloor\max\left(\frac{3\rho^{2}\Delta}{\delta^{3}},\frac{4\Delta\rho}{\delta\epsilon}\right)\right\rfloor+4, and kK0=(4max(3ρ2Δδ3,4Δρδϵ)+4)Ln0δρϵk\leq K_{0}=\left(4\left\lfloor\max\left(\frac{3\rho^{2}\Delta}{\delta^{3}},\frac{4\Delta\rho}{\delta\epsilon}\right)\right\rfloor+4\right)\frac{Ln_{0}\delta}{\rho\epsilon} satisfying

The dominate term in the gradient cost of Neon+ Spider-SFO is the so-called coupling term in the regime of interest: ϵ2δ3\epsilon^{-2}\delta^{-3} for the on-line case and n1/2ϵ1δ3n^{1/2}\epsilon^{-1}\delta^{-3} for the finite-sum case, separately. Due to this term, most convergence rate results in concurrent works for the on-line case such as Reddi et al., (2018); Tripuraneni et al., (2018); Xu et al., (2017); Allen-Zhu & Li, (2018); Zhou et al., 2018a have gradient costs that cannot break the O(ϵ3.5)\mathcal{O}(\epsilon^{-3.5}) barrier when δ\delta is chosen to be O(ϵ0.5)\mathcal{O}(\epsilon^{0.5}). Observe that we always need to run a new Spider-SFO which at least costs \mathcal{O}\big{(}\min(\epsilon^{-2},n)\big{)} stochastic gradient accesses.

Our analysis sharpens the seemingly non-improvable coupling term by modifying the single large Neon step to many mini-steps. Such modification enables us to maintain the Spider estimates and obtain a coupling term O(min(n,ϵ2)δ2)\mathcal{O}\left(\min(n,\epsilon^{-2})\delta^{-2}\right) of Spider-SFO+, which improves upon the Neon coupling term O(min(n,ϵ2)δ3)\mathcal{O}\left(\min(n,\epsilon^{-2})\delta^{-3}\right) by a factor of δ\delta.

For the finite-sum case, Spider-SFO+ enjoys a convergence rate that is faster than existing methods only in the regime n=Ω(ϵ1)n=\Omega(\epsilon^{-1}) [Table 1]. For the case of n=O(ϵ1)n=\mathcal{O}(\epsilon^{-1}), using Spider to track the gradient in the Neon procedure can be more costly than applying appropriate acceleration techniques (Agarwal et al.,, 2017; Carmon et al.,, 2016).Spider-SFO+ enjoys a faster rate than Neon+Spider-SFO where computing the “full” gradient dominates the gradient cost, namely δ=O(1)\delta=\mathcal{O}(1) in the on-line case and δ=O(n1/2ϵ)\delta=\mathcal{O}(n^{1/2}\epsilon) for the finite-sum case. Beacause it is well-known that momentum technique (Nesterov,, 1983) provably ensures faster convergence rates when nn is sufficient small (Shalev-Shwartz & Zhang,, 2016). One can also apply momentum technique to solve the sub-problem in Step 1 and 3 like Carmon et al., (2016); Allen-Zhu & Li, (2018) when nO(ϵ1)n\leq\mathcal{O}(\epsilon^{-1}), and thus can achieve the state-of-the-art gradient cost of

4 Comparison with Concurrent Works

This subsection compares our Spider algorithms with concurrent works. In special, we detail our main result for applying Spider to first-order methods in the list below:

For the problem of finding an ϵ\epsilon-approximate first-order stationary point, under Assumption 1 our results indicate a gradient cost of O(min(ϵ3,n1/2ϵ2))\mathcal{O}(\min(\epsilon^{-3},n^{1/2}\epsilon^{-2})) which supersedes the best-known convergence rate results for stochastic optimization problem (1.2) [Theorems 1 and 2]. Before this work, the best-known result is O(min(ϵ3.333,n2/3ϵ2))\mathcal{O}\left(\min(\epsilon^{-3.333},n^{2/3}\epsilon^{-2})\right), achieved by Allen-Zhu & Hazan, (2016); Reddi et al., (2016) in the finite-sum case and Lei et al., (2017) in the on-line case, separately. Moreover, such a gradient cost achieves the algorithmic lower bound for the finite-sum setting [Theorem 3].

We summarize the comparison with concurrent works that solve (1.2) under similar assumptions in Table 1. In addition, we provide Figure 1 which draws the gradient cost against the magnitude of nn for both an approximate stationary point.One of the results not included in this table is Carmon et al., 2017a , which finds an ϵ\epsilon-approximate first-order stationary point in O(nϵ1.75)\mathcal{O}(n\epsilon^{-1.75}) gradient evaluations. However, their result relies on a more stringent Hessian-Lipschitz condition, in which case a second-order stationary point can be found in similar gradient cost (Jin et al., 2017b, ). For simplicity, we leave out the complexities of the algorithms that has Hessian-vector product access and only record algorithms that use stochastic gradients only.Due to the Neon method (Xu et al.,, 2017; Allen-Zhu & Li,, 2018), nearly all existing Hessian-vector product based algorithms in stochastic optimization can be converted to ones that use stochastic gradients only. Specifically, the yellow-boxed complexity O(nϵ1.5+n3/4ϵ1.75)\mathcal{O}(n\epsilon^{-1.5}+n^{3/4}\epsilon^{-1.75}) in Table 1, which was achieved by Neon+FastCubic/CDHS (Allen-Zhu & Li,, 2018; Jin et al., 2017b, ) for finding an approximate second-order stationary point in the finite-sum case using momentum technique, are the only results that have not been outperformed by our Spider-SFO+ algorithm in certain parameter regimes (nO(ϵ1)n\leq\mathcal{O}(\epsilon^{-1}) in this case).

SPIDER for Stochastic Zeroth-Order Method

For SZO algorithms, (2.3) can be solved only from the Incremental Zeroth-Order Oracle (IZO)(Nesterov & Spokoiny,, 2011), which is defined as:

We use Assumption 2 (including (ii’)) for convergence analysis which is standard for SZO(Nesterov & Spokoiny,, 2011; Ghadimi & Lan,, 2013) algorithms. Because the true gradient are not allowed to obtain for SZO. Most works (Nesterov & Spokoiny,, 2011; Ghadimi & Lan,, 2013; Shamir,, 2017) use the gradient of a smoothed version of the objective function through a two-point feedback in a stochastic setting. Following (Nesterov & Spokoiny,, 2011), we consider the typical Gaussian distribution in the convolution to smooth the function. Define

From the (1), suppose uN(0,Id)\mathbf{u}\sim N(\mathbf{0},\mathbf{I}_{d}), and i[n]i\in[n], with u\mathbf{u} and ii being independent, we have

For non-convex case, the best known result is O(dϵ4)\mathcal{O}(d\epsilon^{-4}) from Ghadimi & Lan, (2013). We has not found a work that applying Variance Reduction technique to significantly reduce the complexity of IZO. This might because that even in finite-sum case, the full gradient is not available (with noise). In this paper, we give a stronger results by Spider technique, directly reducing the IZO from O(dϵ4)\mathcal{O}(d\epsilon^{-4}) to O(min(dn1/2ϵ2,dϵ3))\mathcal{O}(\min(dn^{1/2}\epsilon^{-2},d\epsilon^{-3})).

From (4.6), we can integrate the two-point feed-back to track f^(x)\nabla\hat{f}(\mathbf{x}). The algorithm is shown in Algorithm 3. Then the following lemma shows that vk\mathbf{v}^{k} is a high accurate estimator of f^(xk)\|\nabla\hat{f}(\mathbf{x}^{k})\|:

Under the Assumption 2, suppose ii is random number of the function index, (i[n]i\in[n]) and u\mathbf{u} is a standard Gaussian random vector, i.e. uN(0,Id)\mathbf{u}\sim N(\mathbf{0},\mathbf{I}_{d}), we have

From (4.3), by setting a smaller μ\mu, the smoothed gradient f^(x)\nabla\hat{f}(\mathbf{x}) approximates f(x)\nabla f(\mathbf{x}), which ensures sufficient function descent in each iteration. For simpleness, we only give expectation result, shown in Theorem 8.

Under the Assumption 2 (including (ii’)). For infinite case, set μ=min(ϵ26Ld,ϵ6n0L(d+6)3/2)\mu=\min\left(\frac{\epsilon}{2\sqrt{6}L\sqrt{d}},\frac{\epsilon}{\sqrt{6}n_{0}L(d+6)^{3/2}}\right), S1=96dσ2ϵ2S_{1}=\frac{96d\sigma^{2}}{\epsilon^{2}}, S2=30(2d+9)σϵn0S_{2}=\frac{30(2d+9)\sigma}{\epsilon n_{0}}, q=5n0σϵq=\frac{5n_{0}\sigma}{\epsilon}, where n0[1,30(2d+9)σϵ]n_{0}\in[1,\frac{30(2d+9)\sigma}{\epsilon}]. In the finite-sum case, set the parameters S2=(2d+9)n1/2n0S_{2}=\frac{(2d+9)n^{1/2}}{n_{0}}, and q=n0n1/26q=\frac{n_{0}n^{1/2}}{6}, let S1/d=[n]S_{1}/d=[n], i.e. vjk=f(xk+μej)f(xk)/μv^{k}_{j}=f(\mathbf{x}^{k}+\mu\mathbf{e}_{j})-f(\mathbf{x}^{k})/\mu with j[d]j\in[d], where n0[1,n1/26]n_{0}\in[1,\frac{n^{1/2}}{6}]. Then with ηk=min(12Ln0,ϵLn0vk)\eta^{k}=\min(\frac{1}{2Ln_{0}},\frac{\epsilon}{Ln_{0}\|\mathbf{v}^{k}\|}), K=(4LΔn0)ϵ2+1K=\left\lfloor(4L\Delta n_{0})\epsilon^{-2}\right\rfloor+1, for Algorithm 3 we have

The IZO calls are O(dmin(n1/2ϵ2,ϵ3))\mathcal{O}\left(d\min(n^{1/2}\epsilon^{-2},\epsilon^{-3})\right).

Summary and Future Directions

The authors would like to thank NIPS Reviewer 1 to point out a mistake in the original proof of Theorem 1 and thank Zeyuan Allen-Zhu and Quanquan Gu for relevant discussions and pointing out references Zhou et al., 2018b ; Zhou et al., 2018a , also Jianqiao Wangni for pointing out references Nguyen et al., 2017a ; Nguyen et al., 2017b , and Zebang Shen, Ruoyu Sun, Haishan Ye, Pan Zhou for very helpful discussions and comments. Zhouchen Lin is supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), National Natural Science Foundation (NSF) of China (grant nos. 61625301 and 61731018), and Microsoft Research Asia.

References

Appendix A Vector-Martingale Concentration Inequality

We apply a result by Pinelis, (1994) and conclude Proposition 2 which is an Azuma-Hoeffding-type concentration inequality. See also Kallenberg & Sztencel, (1991), Lemma 4.4 in Zhang, (2005) or Theorem 2.1 in Zhang, (2005) and the references therein.

where λ\lambda is an arbitrary real positive number.

A.1 Proof of Proposition 1

is a martingale, and hence (2.2) follows from the property of L2L^{2} martingales (Durrett,, 2010). ∎

A.2 Proof of Lemma 1

where in =a\overset{a}{=} and b\overset{b}{\leq}, we use Eq (2.6), and SS_{*} are random sampled from [n][n] with replacement. Combining (A.2) and (A.3), we have

Telescoping the above display for k=k1,,0k^{\prime}=k-1,\dots,0 and using the iterated law of expectation, we have

Appendix B Deferred Proofs

From Line 14 of Algorithm 1 we have for all k0k\geq 0,

Applying Lemma 1 with ϵ1=ϵ/(Ln0)\epsilon_{1}=\epsilon/(Ln_{0}), S2=2σ/(ϵn0)S_{2}=2\sigma/(\epsilon n_{0}), K=kk0q=σn0/ϵK=k-k_{0}\leq q=\sigma n_{0}/\epsilon, we have

B.2 Proof of Expectation Results for FSP

The rest of this section devotes to the proofs of Theorems 1, 2. To prepare for them, we first conclude via standard analysis the following

Under the Assumption 1, setting k0=k/qqk_{0}=\lfloor k/q\rfloor\cdot q, we have

So f(x)f(\mathbf{x}) has LL-Lipschitz continuous gradient, then

where in a\overset{a}{\leq}, we applied Cauchy-Schwarz inequality. Since ηk=min(ϵLn0vk,12Ln0)12Ln012L\eta^{k}=\min\left(\frac{\epsilon}{Ln_{0}\|\mathbf{v}^{k}\|},\frac{1}{2Ln_{0}}\right)\leq\frac{1}{2Ln_{0}}\leq\frac{1}{2L}, we have

where in a\overset{a}{\geq}, we use V(x)=min(x,x22)x2V(x)=\min\left(|x|,\frac{x^{2}}{2}\right)\geq|x|-2 for all xx. Hence

Taking expectation on the above display and using Lemma 2, we have

The proof is done via the following lemma:

Under Assumption 1, for all k0k\geq 0, we have

By taking the total expectation in Lemma 2, we have

Taking full expectation on Lemma 4, and telescoping the results from k=0k=0 to K1K-1, we have

Diving 4Ln0ϵK\frac{4Ln_{0}}{\epsilon}K both sides of (B.13), and using K=4LΔn0ϵ2+14LΔn0ϵ2K=\lfloor\frac{4L\Delta n_{0}}{\epsilon^{2}}\rfloor+1\geq\frac{4L\Delta n_{0}}{\epsilon^{2}}, we have

To compute the gradient cost, note in each qq iterations we access for one time S1S_{1} stochastic gradients and for qq times of S2S_{2} stochastic gradients, and hence the cost is

This concludes a gradient cost of 16LΔσϵ3+2σ2ϵ2+4σn01ϵ116L\Delta\sigma\epsilon^{-3}+2\sigma^{2}\epsilon^{-2}+4\sigma n_{0}^{-1}\epsilon^{-1}. ∎

With the above display, applying Lemma 1 with ϵ1=ϵLn0\epsilon_{1}=\frac{\epsilon}{Ln_{0}}, and S2=n1/2ϵn0S_{2}=\frac{n^{1/2}}{\epsilon n_{0}}, K=kk0q=n0n1/2K=k-k_{0}\leq q=n_{0}n^{1/2}, we have

So Lemma 2 holds. Then from the same technique of on-line case, we can obtain (B.2) and (5), and (B.15). The gradient cost analysis is computed as:

This concludes a gradient cost of n+8(LΔ)n1/2ϵ2+2n01n1/2n+8(L\Delta)\cdot n^{1/2}\epsilon^{-2}+2n^{-1}_{0}n^{1/2}. ∎

B.3 Proof of High Probability Results for FSP

Set K\mathcal{K} be the time when Algorithm 1 stops. We have K=0\mathcal{K}=0 if v0<2ϵ\|\mathbf{v}^{0}\|<2\epsilon, and K=inf{k0:vk<2ϵ}+1\mathcal{K}=\inf\{k\geq 0:\|\mathbf{v}^{k}\|<2\epsilon\}+1 if v02ϵ\|\mathbf{v}^{0}\|\geq 2\epsilon. It is a random stopping time. Let K0=4LΔn0ϵ2+2K_{0}=\lfloor 4L\Delta n_{0}\epsilon^{-2}\rfloor+2. We have the following lemma:

Set the parameters S1S_{1}, S2S_{2}, η\eta, and qq as in Theorem 4. Then under the Assumption 2, for fixed K0K_{0}, define the event:

we have HK0\bm{\mathcal{H}}_{K_{0}} occurs with probability at least 1p1-p.

Because when kKk\geq\mathcal{K}, the algorithm has already stopped. So if KkK0\mathcal{K}\leq k\leq K_{0}, we can define a virtual update as xk+1=xk\mathbf{x}^{k+1}=\mathbf{x}^{k}, and vk\mathbf{v}^{k} is still generated by Line 3 and Line 5 in Algorithm 1.

Let ξk\bm{\xi}^{k} with k0k\geq 0 denote the randomness in maintaining Spider vk\mathbf{v}^{k} at iteration kk. And Fk=σ{ξ0,ξk}{\bm{\mathcal{F}}}^{k}=\sigma\{\bm{\xi}^{0},\cdots\bm{\xi}^{k}\}, where σ{}\sigma\{\cdot\} denotes the sigma field. We know that xk\mathbf{x}^{k} and vk1\mathbf{v}^{k-1} are measurable on Fk1{\bm{\mathcal{F}}}^{k-1}.

Then given Fk1{\bm{\mathcal{F}}}^{k-1}, if k=k/qqk=\lfloor k/q\rfloor q, we set

where ii is the index with S1(i)\mathcal{S}_{1}(i) denoting the ii-th random component function selected at iteration kk and 1iS11\leq i\leq S_{1}. We have

When kk/qqk\neq\lfloor k/q\rfloor q, set k0=k/qqk_{0}=\lfloor k/q\rfloor q, and

where ii is the index with S2(i)\mathcal{S}_{2}(i) denoting the ii-th random component function selected at iteration kk, 1iS21\leq i\leq S_{2} and k0jkk_{0}\leq j\leq k. We have

For any x\mathbf{x} and y\mathbf{y}, we have

So f(x)f(\mathbf{x}) also have LL-Lipschitz continuous gradient.

Then from the update rule if k<Kk<\mathcal{K}, we have xk+1xk=η(vk/vk)=η=ϵLn0\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|=\|\eta(\mathbf{v}^{k}/\|\mathbf{v}^{k}\|)\|=\eta=\frac{\epsilon}{Ln_{0}}, if kKk\geq\mathcal{K}, we have xk+1xk=0ϵLn0\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|=0\leq\frac{\epsilon}{Ln_{0}}. We have

for all k0<jkk_{0}<j\leq k and 1iS21\leq i\leq S_{2}. On the other hand, we have

Plugging (B.22) and (B.23) together, and using Proposition 2, we have

Under Assumption 2, we have that on HK0(K>K0)\bm{\mathcal{H}}_{K_{0}}\cap(\mathcal{K}>K_{0}), for all 0kK00\leq k\leq K_{0},

Let ηk:=η/vk.\eta^{k}:=\eta/\|\mathbf{v}^{k}\|. Since ff has LL-Lipschitz continuous gradient from (B.21), we have

Because we are on the event HK0(K>K0)\bm{\mathcal{H}}_{K_{0}}\cap(\mathcal{K}>K_{0}), so K1K0\mathcal{K}-1\geq K_{0}, then for all 0kK00\leq k\leq K_{0}, we have vk2ϵ\|\mathbf{v}^{k}\|\geq 2\epsilon, thus

and for HK0\bm{\mathcal{H}}_{K_{0}} happens, we also have

By telescoping (B.28) from to K0K_{0}, we have

Then gradient cost can be bounded by the same way in Theorem 2 as:

When k=k/qqk=\lfloor k/q\rfloor q, we have vk=f(xk)\mathbf{v}^{k}=\nabla f(\mathbf{x}^{k}).

When kk/qqk\neq\lfloor k/q\rfloor q, set k0=k/qqk_{0}=\lfloor k/q\rfloor q, and

where ii is the index with S2(i)\mathcal{S}_{2}(i) denoting the ii-th random component function selected at iteration kk, from (B.22), we have

for all k0<jkk_{0}<j\leq k and 1iS21\leq i\leq S_{2}. On the other hand

B.4 Proof of Theorem 6 for SSP

We first restate the Neon result in Allen-Zhu & Li, (2018) for NC-search in the following Theorem:

Under the Assumption 2 (including (ii’)), for every point x0Rd\mathbf{x}_{0}\in\mathcal{R}^{d}, for every δ(0,L]\delta\in(0,L], and every p(0,1)p\in(0,1), the Neon2 (NC search) output

satisfies that, with probability at least 1p1-p:

if w=\mathbf{w}=\bot, then f2(x0)δI\nabla f^{2}(\mathbf{x}_{0})\succeq-\delta I.

Moreover, the total number of stochastic gradient evaluations are O(log2((d/p))L2δ2)O\left(\log^{2}((d/p))L^{2}\delta^{-2}\right).

One can refer to Allen-Zhu & Li, (2018) for more details.

From Algorithm 2, we can find that all the randomness in iteration kk come from 33 parts: 1) maintaining Spider vk\mathbf{v}^{k} (Line 7-11 and 17-21); 2) to conducting NC-search in Line 2 (if mod(k,K)=0\text{mod}(k,\mathscr{K})=0); 3) choosing a random direction to update xk\mathbf{x}^{k} in Line 5 (if Algorithm 2 performs first-order updates). We denote the randomness from the three parts as ξk1\bm{\xi}^{1}_{k}, ξk2\bm{\xi}^{2}_{k}, ξk3\bm{\xi}^{3}_{k}, respectively. Let Fk{\bm{\mathcal{F}}}^{k} be the filtration involving the full information of x0:k,v0:k\mathbf{x}_{0:k},\mathbf{v}_{0:k}, i.e Fk=σ{ξ0:k1,ξ0:k2,ξ0:k13}{\bm{\mathcal{F}}}^{k}=\sigma\left\{\bm{\xi}^{1}_{0:k},\bm{\xi}^{2}_{0:k},\bm{\xi}^{3}_{0:k-1}\right\}. So the randomness in iteration kk given Fk{\bm{\mathcal{F}}}^{k} only comes from ξk3\bm{\xi}^{3}_{k} (choosing a random direction in Line 5).

we have H1kFk\bm{\mathcal{H}}_{1}^{k}\in{\bm{\mathcal{F}}}^{k}, and H11H12H1k\bm{\mathcal{H}}_{1}^{1}\supseteq\bm{\mathcal{H}}_{1}^{2}\supseteq\cdots\supseteq\bm{\mathcal{H}}_{1}^{k}. Let H2Kj\bm{\mathcal{H}}^{\mathscr{K}j}_{2} denotes the event that the NC-search in iteration Kj\mathscr{K}j runs successfully. And H3k\bm{\mathcal{H}}^{k}_{3} denotes the event that

With the setting of Theorem 6, and under the Assumption 3, we have

when mod(k,p)=0(k,p)=0. For mod(k,p)0(k,p)\neq 0, we have

Because xk\mathbf{x}^{k} is generated by one of the three ways:

Algorithm 2 performs First-order descent, we have xkxk1=η(vk1/vk1)=η=ϵLn0\|\mathbf{x}^{k}-\mathbf{x}^{k-1}\|=\|\eta(\mathbf{v}^{k-1}/\|\mathbf{v}^{k-1}\|)\|=\eta=\frac{\epsilon}{Ln_{0}}.

Algorithm 2 performs Second-order descent, we have xkxk1=η=ϵLn0\|\mathbf{x}^{k}-\mathbf{x}^{k-1}\|=\eta=\frac{\epsilon}{Ln_{0}}.

Algorithm 2 has already stopped. xkxk1=0ϵLn0\|\mathbf{x}^{k}-\mathbf{x}^{k-1}\|=0\leq\frac{\epsilon}{Ln_{0}}.

Let H4k=H1kH3k\bm{\mathcal{H}}_{4}^{k}=\bm{\mathcal{H}}_{1}^{k}\cap\bm{\mathcal{H}}_{3}^{k}. We show that Theorem 6 is essentially to measure the probability of the event that (H1K0)cH3K0\left(\bm{\mathcal{H}}^{K_{0}}_{1}\right)^{c}\bigcap\bm{\mathcal{H}}^{K_{0}}_{3}.

If (H1K0)cH3K0\left(\bm{\mathcal{H}}^{K_{0}}_{1}\right)^{c}\bigcap\bm{\mathcal{H}}^{K_{0}}_{3} happens, Algorithm 2 outputs an xk\mathbf{x}^{k} satisfying (3.12) before K0K_{0} iterations.

So f()f(\cdot) has ρ\rho-Lipschitz Hessian. We have

Thus λmin(f2(xk))3δI\lambda_{\min}(\nabla f^{2}(\mathbf{x}^{k}))\geq-3\delta I. ∎

For all iteration K\mathcal{K} with mod(K,K)=0\text{mod}(\mathcal{K},\mathscr{K})=0, given FK{\bm{\mathcal{F}}}^{\mathcal{K}}, we consider the case when IK=2\mathcal{I}_{\mathcal{K}}=2 and H4K\bm{\mathcal{H}}_{4}^{\mathcal{K}} happens. Because f()f(\cdot) has ρ\rho-Lipschitz Hessian, we have

On the other hand, given FK{\bm{\mathcal{F}}}^{\mathcal{K}}, we consider the case when IK=1\mathcal{I}_{\mathcal{K}}=1 and H4K\bm{\mathcal{H}}_{4}^{\mathcal{K}} happens, then for any kk satisfying Kk<K+K\mathcal{K}\leq k<\mathcal{K}+\mathscr{K}, we know Ik=1\mathcal{I}_{k}=1.

Given Fk{\bm{\mathcal{F}}}^{k} with Kk<K+K\mathcal{K}\leq k<\mathcal{K}+\mathscr{K}, then from (B.26) we have

Taking expectation up to FK{\bm{\mathcal{F}}}^{\mathcal{K}}, we have

By telescoping (B.42) with kk from K\mathcal{K} to K+K1\mathcal{K}+\mathscr{K}-1, we have

From Lemma 9, with probability 1/21/2, the algorithm shall be terminated before KJ\mathscr{K}J iterations, and output a xk\mathbf{x}^{k} satisfying (3.12).

The total stochastic gradient complexity consists of two parts: the Spider maintenance cost and NC-Search cost. We estimate them as follows:

With probability 1/21/2, the algorithm ends in at most K0K_{0} iterations, thus the number of stochastic gradient accesses to maintain Spider can be bounded by

By summing (B.49) and (B.50), using maxa,ba+b\max\lfloor a,b\rfloor\leq a+b with a0a\geq 0 and b0b\geq 0, we have that the total stochastic gradient complexity can be bounded:

For the on-line case, plugging into K=δLn0ρϵ\mathscr{K}=\frac{\delta Ln_{0}}{\rho\epsilon}, S1=2σϵ2S_{1}=\frac{2\sigma}{\epsilon^{2}}, and S2=2σn0ϵS_{2}=\frac{2\sigma}{n_{0}\epsilon}, the stochastic gradient complexity can be bounded:

For the off-line case, plugging into S2=n1/2n0S_{2}=\frac{n^{1/2}}{n_{0}}, we obtain the stochastic gradient complexity is bounded by:

B.5 Proof for SZO

because fif_{i} has LL-Lipschitz continuous gradient ((6) in (Nesterov & Spokoiny,, 2011)); b\overset{b}{\leq}, we use

Under the Assumption 2 (including (ii’)), if k/qq=k\lfloor k/q\rfloor q=k, given xk\mathbf{x}^{k}, we have

where in a\overset{a}{\leq}, we use a1+a2++as2sa12+sa22++sas2|a_{1}+a_{2}+\cdots+a_{s}|^{2}\leq s|a_{1}|^{2}+s|a_{2}|^{2}+\cdots+s|a_{s}|^{2}.

For the first term in the right hand of (B.53), because fi(x)f_{i}(\mathbf{x}) has LL-Lipschitz continuous gradient, we have

For the on-line case, due to μϵ26Ld\mu\leq\frac{\epsilon}{2\sqrt{6}L\sqrt{d}}, and S1=96dσ2ϵ2S_{1}=\frac{96d\sigma^{2}}{\epsilon^{2}}, we have

Also from (4.3), and μϵ6n0L(d+6)3/2\mu\leq\frac{\epsilon}{\sqrt{6}n_{0}L(d+6)^{3/2}}, we have

For k=k0k=k_{0}, from Lemma 10, we obtain the result. When kk0k\geq k_{0},

Using Proposition 1, for on-line case, we have

By taking full expectation on Lemma 11, we have

By using Lemma 4, (B.13), and (B.14), we have

One the other hand, by Jensen’s inequality, we have

B.6 Proof of Theorem 3 for Lower Bound

where xix_{i} denote the value of ii-th coordinate of x\mathbf{x}, with i[d]i\in[d]. f^K(x)\hat{f}_{K}(\mathbf{x}) constructed by Carmon et al., 2017b is a zero-chain function, that is for every i[d]i\in[d], if(x)=0\nabla_{i}f(\mathbf{x})=0 whenever xi1=xi=xi+1x_{i-1}=x_{i}=x_{i+1}. So any deterministic algorithm can only recover “one” dimension in each iteration (Carmon et al., 2017b, ). In addition, it satisfies that : If xi1|x_{i}|\leq 1 for any iKi\leq K,

Then to handle random algorithms, Carmon et al., 2017b further consider the following extensions:

The above properties found by Carmon et al., 2017b is very technical. One can refer to Carmon et al., 2017b for more details.

Our lower bound theorem proof is as follows. The proof mirrors Theorem 2 in Carmon et al., 2017b by further taking the number of individual function nn into account. Set

where we use b=lϵLb=\frac{l\epsilon}{L}. Summing i=1,,ni=1,\ldots,n and using each Ci\mathbf{C}_{i} are orthogonal matrix, we have

From the above analysis, for any algorithm A\mathcal{A}, after running T=nK2=ΔLn1/224lϵ2T=\frac{nK}{2}=\frac{\Delta Ln^{1/2}}{24l\epsilon^{2}} iterations, at least n2\frac{n}{2} functions cannot be solved (the worst case is when A\mathcal{A} exactly solves n2\frac{n}{2} functions), so

where in =a\overset{a}{=}, we use CiCj=0d/n\mathbf{C}_{i}^{\top}\mathbf{C}_{j}=\mathbf{0}_{d/n}, when iji\neq j, and CiCi=Id/n\mathbf{C}_{i}^{\top}\mathbf{C}_{i}=I_{d/n}. ∎

Let {x}0:T\{\mathbf{x}\}_{0:T} with T=nK2T=\frac{nK}{2} is informed by a certain algorithm in the form (3.8). Then when d2max(9n3K2,12n3KR2)log(2n2K2p)+n2Kd\geq 2\max(9n^{3}K^{2},12n^{3}KR^{2})\log(\frac{2n^{2}K^{2}}{p})+n^{2}K, with probability 1p1-p, at each iteration 0tT0\leq t\leq T, xt\mathbf{x}^{t} can only recover one coordinate.

The proof is essentially same to Carmon et al., 2017b and Woodworth & Srebro, (2017). We give a proof here. Before the poof, we give the following definitions:

Let iti^{t} denotes that at iteration tt, the algorithm choses the iti^{t}-th individual function.

Let IitI^{t}_{i} denotes the total times that individual function with index ii has been called before iteration kk. We have Ii0=0I^{0}_{i}=0 with i[n]i\in[n], iiti\neq i^{t}, and Ii00=1I^{0}_{i^{0}}=1. And for t1t\geq 1,

Set Vit\bm{\mathcal{V}}^{t}_{i} be the set that (i=1n{bi,1,bi,min(K,Iit)}){yi0,yi1,,yit}\left(\bigcup_{i=1}^{n}\left\{\mathbf{b}_{i,1},\cdots\mathbf{b}_{i,\min(K,I^{t}_{i})}\right\}\right)\bigcup\left\{\mathbf{y}^{0}_{i},\mathbf{y}^{1}_{i},\cdots,\mathbf{y}^{t}_{i}\right\}, where bi,j\mathbf{b}_{i,j} denotes the jj-th column of BiK\mathbf{B}^{K}_{i}.

Let PitR(d/n)×(d/n)\bm{\mathcal{P}}_{i}^{t}\in\mathcal{R}^{(d/n)\times(d/n)} denote the projection operator to the span of uVit\mathbf{u}\in\bm{\mathcal{V}}_{i}^{t}. And let Pit\bm{\mathcal{P}}^{t\bot}_{i} denote its orthogonal complement.

Because At\mathcal{A}^{t} performs measurable mapping, the above terms are all measurable on ξ\bm{\xi} and BnK\mathbf{B}^{nK}, where ξ\bm{\xi} is the random vector in A\mathcal{A}. It is clear that if for all 0tT0\leq t\leq T and i[n]i\in[n], we have

then at each iteration, we can only recover one index, which is our destination. To prove that (B.84) holds with probability at least 1p1-p, we consider a more hard event Gt\bm{\mathcal{G}}^{t} as

with a=min(13(T+1),12(1+3T)R)a=\min\left(\frac{1}{3(T+1)},\frac{1}{2(1+\sqrt{3T})R}\right). And Gt=j=0tGjG^{\leq t}=\bigcap_{j=0}^{t}\bm{\mathcal{G}}^{j}.

We first show that if GT\bm{\mathcal{G}}^{\leq T} happens, then (B.84) holds for all 0tT0\leq t\leq T. For 0tT0\leq t\leq T, and i[n]i\in[n], if Uit=\bm{\mathcal{U}}^{t}_{i}=\varnothing, (B.84) is right; otherwise for any uUit\mathbf{u}\in\bm{\mathcal{U}}^{t}_{i}, we have

where in the last inequality, we use Pi(t1)yityi(t1)R\|\bm{\mathcal{P}}^{(t-1)\bot}_{i}\mathbf{y}^{t}_{i}\|\leq\|\mathbf{y}^{(t-1)}_{i}\|\leq R.

If t=0t=0, we have Pit1=0d/n×d/n\bm{\mathcal{P}}_{i}^{t-1}=\mathbf{0}_{d/n\times d/n}, then Pit1u=0\left\|\bm{\mathcal{P}}_{i}^{t-1}\mathbf{u}\right\|=0, so (B.84) holds. When t1t\geq 1, suppose at t1t-1, Gt\bm{\mathcal{G}}^{\leq t} happens then (B.84) holds for all to t1t-1. Then we need to prove that Pit1ub=3Ta\|\bm{\mathcal{P}}^{t-1}_{i}\mathbf{u}\|\leq b=\sqrt{3T}a with uUit\mathbf{u}\in\bm{\mathcal{U}}^{t}_{i} and i[n]i\in[n]. Instead, we prove a stronger results: Pit1ub=3Ta\|\bm{\mathcal{P}}^{t-1}_{i}\mathbf{u}\|\leq b=\sqrt{3T}a with all uUt\mathbf{u}\in\bm{\mathcal{U}}^{t} and i[n]i\in[n]. Again, When t=0t=0, we have Pit1u=0\|\bm{\mathcal{P}}^{t-1}_{i}\mathbf{u}\|=0, so it is right, when t1t\geq 1, by Graham-Schmidt procedure on yi0,bi0,min(Ii00,K),,yit1,bit1,min(Iit1t1,K)\mathbf{y}^{0}_{i},\mathbf{b}_{i_{0},\min(I^{0}_{i^{0}},K)},\cdots,\mathbf{y}^{t-1}_{i},\mathbf{b}_{i_{t-1},\min(I^{t-1}_{i^{t-1}},K)}, we have

Using biz,Iizzu\mathbf{b}_{i_{z},I^{z}_{i^{z}}}\bot\mathbf{u} for all uUt\mathbf{u}\in\bm{\mathcal{U}}^{t}, we have

For the first term in the right hand of (B.6), by induction, we have

For the second term in the right hand of (B.6), by assumption (B.85), we have

Substituting (B.6) and (B.6) into (B.87), for all uUt\mathbf{u}\in\bm{\mathcal{U}}^{t}, we have

Thus for (B.86), t1t\geq 1, because uUitUt\mathbf{u}\in\bm{\mathcal{U}}^{t}_{i}\subseteq\bm{\mathcal{U}}^{t}, we have

Denote i^t\hat{i}^{t} be the sequence of i0:t1i_{0:t-1}. Let S^t\hat{{\mathcal{S}}}^{t} be the set that contains all possible ways of i^t\hat{i}^{t} (S^tnt|\hat{{\mathcal{S}}}^{t}|\leq n^{t}).