A Geometric Analysis of Phase Retrieval

Ju Sun, Qing Qu, John Wright

Introduction

This problem has attracted substantial recent interest, due to its connections to fields such as crystallography, optical imaging and astronomy. In these areas, one often has access only to the Fourier magnitudes of a complex signal x\mathbf{x}, i.e., F(x)\left|\mathcal{F}(\mathbf{x})\right| [Mil90, Rob93, Wal63, DF87]. The phase information is hard or infeasible to record due to physical constraints. The problem of recovering the signal x\mathbf{x} from its Fourier magnitudes F(x)\left|\mathcal{F}(\mathbf{x})\right| is naturally termed (Fourier) phase retrieval (PR). It is easy to see PR as a special version of GPR, with the ak\mathbf{a}_{k}’s the Fourier basis vectors. GPR also sees applications in electron microscopy [MIJ+02], diffraction and array imaging [BDP+07, CMP11], acoustics [BCE06, Bal10], quantum mechanics [Cor06, Rei65] and quantum information [HMW13]. We refer the reader to survey papers [SEC+15, JEH15] for accounts of recent developments in the theory, algorithms, and applications of GPR.

For GPR, heuristic methods based on nonconvex optimization often work surprisingly well in practice (e.g., [Fie82, GS72], and many more cited in [SEC+15, JEH15]). However, investigation into provable recovery methods, particularly based on nonconvex optimization, has started only relatively recently [NJS13, CESV13, CSV13, CL14, CLS15a, WdM15, VX14, ABFM14, CLS15b, CC15, WWS15, ZCL16, ZL16, WGE16, KÖ16, GX16, BE16, Wal16]. The surprising effectiveness of nonconvex heuristics on GPR remains largely mysterious. In this paper, we take a step towards bridging this gap.

We focus on a natural least-squares formulationAnother least-squares formulation, minimizez  12mk=1m(ykakz)2\operatorname{minimize}_{\mathbf{z}}\;\frac{1}{2m}\sum_{k=1}^{m}(y_{k}-\left|\mathbf{a}_{k}^{*}\mathbf{z}\right|)^{2}, was first studied in the seminal works [Fie82, GS72]. An obvious advantage of the f(z)f(\mathbf{z}) studied here is that it is differentiable in the sense of Wirtinger calculus introduced later. – discussed systematically in [SEC+15, JEH15] and first studied theoretically in [CLS15b, WWS15],

We assume the ak\mathbf{a}_{k}’s are independent identically distributed (i.i.d.) complex Gaussian:

f(z)f(\mathbf{z}) is a fourth-order polynomial in z\mathbf{z},Strictly speaking, f(z)f(\mathbf{z}) is not a complex polynomial in z\mathbf{z} over the complex field; complex polynomials are necessarily complex differentiable. However, f(z)f(\mathbf{z}) is a fourth order real polynomial in real and complex parts of z\mathbf{z}. and is nonconvex. A-priori, there is little reason to believe that simple iterative methods can solve this problem without special initialization. Typical local convergence (i.e., convergence to a local minimizer) guarantees in optimization require an initialization near the target minimizer [Ber99]. Moreover, existing results on provable recovery using (1.1) and related formulations rely on careful initialization in the vicinity of the ground truth [NJS13, CLS15b, CC15, WWS15, ZCL16, ZL16, WGE16, KÖ16, GX16, BE16, Wal16].

2 A Curious Experiment

We apply gradient descent to f(z)f(\mathbf{z}), starting from a random initialization z(0)\mathbf{z}^{(0)}:

where the step size μ\mu is fixed for simplicityMathematically, f(z)f(\mathbf{z}) is not complex differentiable; here the gradient is defined based on the Wirtinger calculus [KD09]; see also [CLS15b]. This notion of gradient is a natural choice when optimizing real-valued functions of complex variables.. The result is quite striking (Figure 1): for a fixed problem instance (fixed set of random measurements and fixed target x\mathbf{x}), gradient descent seems to always return a global minimizer (i.e., the target x\mathbf{x} up to a global phase shift), across many independent random initializations! This contrasts with the typical “mental picture” of nonconvex objectives as possessing many spurious local minimizers.

3 A Geometric Analysis

The numerical surprise described above is not completely isolated. Simple heuristic methods have been observed to work surprisingly well for practical PR [Fie82, GS72, SEC+15, JEH15]. In this paper, we take a step towards explaining this phenomenon. We show that although the function (1.1) is nonconvex, when mm is reasonably large, it actually has benign global geometry which allows it to be globally optimized by efficient iterative methods, regardless of the initialization.

Notice that (i) the only local minimizers are exactly ±x\pm\mathbf{x} – they are also global minimizers;Note that the global sign cannot be recovered. (ii) there are saddle points (and a local maximizer), but around them there is a negative curvature in the ±x\pm\mathbf{x} direction. Intuitively, any algorithm that can successfully escape from this kind of saddle point (and local maximizer) can in fact find a global minimizer, i.e., recover the target signal x\mathbf{x}.

Because of this global geometry, a wide range of efficient iterative methods can obtain a global minimizer to f(z)f(\mathbf{z}), regardless of initialization. Examples include the noisy gradient and stochastic gradient methods [GHJY15] (see also [LSJR16, PP16]), curvilinear search [Gol80] and trust-region methods [CGT00, NP06, SQW15b]. The key property that the methods must possess is the ability to escape saddle points at which the Hessian has a strictly negative eigenvalueSuch saddle points are called ridable saddles [SQW15b] or strict saddles [GHJY15]; see [AG16] for computational methods for escaping from higher-order saddles also. .

We corroborate this claim by developing a second-order trust-region algorithm for this problem, and prove that (Theorem 3.10) (i) from any initialization, it efficiently obtains a close approximation (i.e., up to numerical precision) of the target x\mathbf{x} (up to a global phase) and (ii) it exhibits quadratic convergence in the vicinity of the global minimizers.

In sum, our geometrical analysis produces the following result.

Informal Statement of Our Main Results; See Theorem 2.2 and Theorem 3.10 and Remark 3.11. When mCnlog3nm\geq Cn\log^{3}n, with probability at least 1cm11-cm^{-1}, the function f(z)f(\mathbf{z}) has no spurious local minimizers. The only global minimizers are the target x\mathbf{x} and its equivalent copies, and at all saddle points the function has directional negative curvature. Moreover, with at least the same probability, the trust-region method with properly set step size parameter find a global minimizer of f(z)f(\mathbf{z}) in polynomial time, from an arbitrary initialization in the zero-centered complex ball with radius R03(1mk=1myk2)1/2R_{0}\doteq 3(\tfrac{1}{m}\sum_{k=1}^{m}y_{k}^{2})^{1/2}. Here CC and cc are absolute positive constants.

The choice of R0R_{0} above allows us to state a result with a concise bound on the number of iterations required to converge. However, under our probability model, w.h.p., the trust- region method succeeds from any initialization. There are two caveats to this claim. First, one must choose the parameters of the method appropriately. Second, the number of iterations depends on how far away from the truth the method starts.

The second-order trust-region method, albeit polynomial-time, may not be the most practical algorithm for solving GPR. Deriving the most practical algorithms is not the focus of this paper. We mentioned above that any iterative method with saddle-escaping capability can be deployed to solve the nonconvex formulation; our geometrical analysis constitutes a solid basis for developing and analyzing much more practical algorithms for GPR.

4 Prior Arts and Connections

The survey papers [SEC+15, JEH15] provide comprehensive accounts of recent progress on GPR. In this section, we focus on provable efficient (particularly, nonconvex) methods for GPR, and draw connections to other work on provable nonconvex heuristics for practical problems.

Although heuristic methods for GPR have been used effectively in practice [GS72, Fie82, SEC+15, JEH15], only recently have researchers begun to develop methods with provable performance guarantees. The first results of this nature were obtained using semidefinite programming (SDP) relaxations [CESV13, CSV13, CL14, CLS15a, WdM15, VX14]. While this represented a substantial advance in theory, the computational complexity of semidefinite programming limits the practicality of this approach.Another line of research [BCE06, BBCE09, ABFM14] seeks to co-design the measurements and recovery algorithms based on frame- or graph-theoretic tools. While revising this work, new convex relaxations based on second-order cone programming have been proposed [GS16, BR16, HV16b, HV16a].

(Global) Geometric analysis of other nonconvex problems.

The approach taken here is similar in spirit to our recent geometric analysis of a nonconvex formulation for complete dictionary learning [SQW15a]. For that problem, we also identified a similar geometric structure that allows efficient global optimization without special initialization. There, by analyzing the geometry of a nonconvex formulation, we derived a provable efficient algorithm for recovering square invertible dictionaries when the coefficient matrix has a constant fraction of nonzero entries. Previous results required the dictionary matrix to have far fewer nonzero entries. [SQW15b] provides a high-level overview of the common geometric structure that arises in dictionary learning, GPR and several other problems. This approach has also been applied to other problems [GHJY15, BBV16, BVB16, SC16, Kaw16, BNS16, GLM16, PKCS16]. Despite these similarities, GPR raises several novel technical challenges: the objective is heavy-tailed, and minimizing the number of measurements is importantThe same challenge is also faced by [CLS15b, CC15]..

Our work sits amid the recent surge of work on provable nonconvex heuristics for practical problems. Besides GPR studied here, this line of work includes low-rank matrix recovery [KMO10, JNS13, Har14, HW14, NNS+14, JN14, SL14, WCCL15, SRO15, ZL15, TBSR15, CW15], tensor recovery [JO14, AGJ14a, AGJ14b, AJSN15, GHJY15], structured element pursuit [QSW14, HSSS15], dictionary learning [AAJ+13, AGM13, AAN13, ABGM14, AGMM15, SQW15a], mixed regression [YCS13, SA14], blind deconvolution [LWB13, LJ15, LLJB15], super resolution [EW15], phase synchronization [Bou16], numerical linear algebra [JJKN15], and so forth. Most of the methods adopt the strategy of initialization plus local refinement we alluded to above. In contrast, our global geometric analysis allows flexible algorithm design (i.e., separation of geometry and algorithms) and gives some clues as to the behavior of nonconvex heuristics used in practice, which often succeed without clever initialization.

Recovering low-rank positive semidefinite matrices.

5 Notations, Organization, and Reproducible Research

Wirtinger calculus.

Note that above the partial derivatives gz\frac{\partial g}{\partial\mathbf{z}} and gz\frac{\partial g}{\partial\overline{\mathbf{z}}} are row vectors. The Wirtinger gradient and Hessian are defined as

where we sometimes write zg(gz)\nabla_{\mathbf{z}}g\doteq\left(\frac{\partial g}{\partial\mathbf{z}}\right)^{*} and naturally zg(gz)\nabla_{\overline{\mathbf{z}}}g\doteq\left(\frac{\partial g}{\partial\overline{\mathbf{z}}}\right)^{*}. With gradient and Hessian, the second-order Taylor expansion of g(z)g(\mathbf{z}) at a point z0\mathbf{z}_{0} is defined as

For numerical optimization, we are most interested in real-valued gg. A real-valued gg is stationary at a point z\mathbf{z} if and only if

This is equivalent to the condition zg=0\nabla_{\overline{\mathbf{z}}}g=\mathbf{0}, as zg=zg\nabla_{\mathbf{z}}g=\overline{\nabla_{\overline{\mathbf{z}}}g} when gg is real-valued. The curvature of gg at a stationary point z\mathbf{z} is dictated by the Wirtinger Hessian 2g(z)\nabla^{2}g(\mathbf{z}). An important technical point is that the Hessian quadratic form involves left and right multiplication with a 2n2n-dimensional vector consisting of a conjugate pair (δ,δˉ)(\mathbf{\delta},\bar{\mathbf{\delta}}).

Following the above notation, we write zf(z)\nabla_{\mathbf{z}}f(\mathbf{z}) and zf(z)\nabla_{\overline{\mathbf{z}}}f(\mathbf{z}) for denoting the first and second half of f(z)\nabla f(\mathbf{z}), respectively.

Organization.

The remainder of this paper is organized as follows. In Section 2, we provide a quantitative characterization of the global geometry for GPR and highlight main technical challenges in establishing the results. Based on this characterization, in Section 3 we present a modified trust-region method for solving GPR from an arbitrary initialization, which leads to our main computational guarantee. In Section 20 we study the empirical performance of our method for GPR. Section 5 concludes the main body with a discussion of open problems. Section 6 and Section 7 collect the detailed proofs to technical results for the geometric analysis and algorithmic analysis, respectively.

Reproducible research.

The code to reproduce all the figures and the experimental results can be found online:

The Geometry of the Objective Function

z=0\mathbf{z}=\mathbf{0} is a critical point, and the Hessian

Hence, z=0\mathbf{z}=\mathbf{0} is a local maximizer.

In the region {z:0<z2<12x2}\left\{\mathbf{z}:0<\left\|\mathbf{z}\right\|^{2}<\tfrac{1}{2}\left\|\mathbf{x}\right\|^{2}\right\}, we have

So there is no critical point in this region.

Similarly, one can show that in z\mathbf{z} direction there is positive curvature. Hence, every zS\mathbf{z}\in\mathcal{S} is a saddle point.

In the region {z:12x2<z2<x2}\left\{\mathbf{z}:\tfrac{1}{2}\left\|\mathbf{x}\right\|^{2}<\left\|\mathbf{z}\right\|^{2}<\left\|\mathbf{x}\right\|^{2}\right\}, any potential critical point must satisfy

In other words, 2z2x22\left\|\mathbf{z}\right\|^{2}-\left\|\mathbf{x}\right\|^{2} is the positive eigenvalue of the rank-one PSD Hermitian matrix xx\mathbf{x}\mathbf{x}^{*}. Hence 2z2x2=x22\left\|\mathbf{z}\right\|^{2}-\left\|\mathbf{x}\right\|^{2}=\left\|\mathbf{x}\right\|^{2}. This would imply that z=x\left\|\mathbf{z}\right\|=\left\|\mathbf{x}\right\|, which does not occur in this region.

When z2=x2\left\|\mathbf{z}\right\|^{2}=\left\|\mathbf{x}\right\|^{2}, critical points must satisfy

Summarizing the above observations completes the proof.

2 The Finite-Sample Landscape

The result is not surprising in view of the above characterization of the “large-sample” landscape. The intuition is as follows: since the objective function is a sum of independent random variables, when mm is sufficiently large, the function values, gradients and Hessians should be uniformly close to their expectations. Some care is required in making this intuition precise, however. Because the objective function contains fourth powers of Gaussian random variables, it is heavy tailed. Ensuring that ff and its derivatives are uniformly close to their expectations requires mCn2m\geq Cn^{2}. This would be quite wasteful, since x\mathbf{x} has only nn degrees of freedom.

There exist positive absolute constants C,cC,c, such that when mCnlog3nm\geq Cn\log^{3}n, it holds with probability at least 1cm11-cm^{-1} that f(z)f(\mathbf{z}) has no spurious local minimizers and the only local/global minimizers are exactly the target set X\mathcal{X}. More precisely, with the same probability,

where, assuming h(z)\mathbf{h}(\mathbf{z}) as defined in (1.3),

Here the regions R1,  R2z,  R2h\mathcal{R}_{1},\;\mathcal{R}_{2}^{\mathbf{z}},\;\mathcal{R}_{2}^{\mathbf{h}} are defined as

By the restricted strong convexity we have established for R3\mathcal{R}_{3}, and the integral form of Taylor’s theorem in Lemma A.2,

Summing up the above two inequalities, we obtain

In the asymptotic version, we characterized only the critical points. In this finite-sample version, we characterize the whole space and particularly provide quantitative control for regions near critical points (i.e., R1R3\mathcal{R}_{1}\cup\mathcal{R}_{3}). These concrete quantities are important for algorithm design and analysis (see Section 3).

3 Key Steps in the Geometric Analysis

Our proof strategy is fairly simple: we work out uniform bounds on the quantities for each of the three regions, and finally show the regions together cover the space. Since (1.1) and associated derivatives take the form of summation of mm independent random variables, the proof involves concentration and covering arguments [Ver12]. The main challenge in our argument will be the heavy-tailedness nature of ff and its gradient.

When mCnlognm\geq Cn\log n, it holds with probability at least 1cm11-cm^{-1} that

for all zR1\mathbf{z}\in\mathcal{R}_{1} defined in (2.4). Here C,cC,c are positive absolute constants.

Next, we show that near X\mathcal{X}, the objective ff is strongly convex in any geometrically normal direction to the target set X\mathcal{X} (which is a one-dimensional circle). Combined with the smoothness property, this allows us to achieve a quadratic asymptotic rate of convergence with the modified trust-region algorithm we propose later.

When mCnlognm\geq Cn\log n for a sufficiently large constant CC, it holds with probability at least 1cm11-cm^{-1} that

for all zR3\mathbf{z}\in\mathcal{R}_{3} defined in (2.5) and for all

Here cc is a positive absolute constant.

This restricted strong convexity result is qualitatively stronger than the local curvature property (i.e., Condition 7.11) established in [CLS15b]. Specifically, our result is equivalent to the following: for any line segment L\mathcal{L} that is normal to the circle X\mathcal{X} and contained in S3\mathcal{S}_{3}, it holds that

In contrast, the local curvature property [CLS15b] only states that in such line segment,

While the local curvature property is sufficient to establish local convergence result of first-order method, the stronger restricted strong convex that provides uniform second-order curvature controls in the above L\mathcal{L}’s are necessary to showing quadratic convergence result of second-order method, as we will do in the next section.

In Proposition 2.7 below, we will show that R2zR2h\mathcal{R}_{2}^{\mathbf{z}}\cup\mathcal{R}_{2}^{\mathbf{h}} indeed cover R2\mathcal{R}_{2}. We first show that the gradients in either region are bounded away from zero.

When mCnlognm\geq Cn\log n, it holds with probability at least 1cm11-cm^{-1} that

for all zR2z\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{z}} defined in (2.7). Here C,cC,c are positive absolute constants.

It follows immediately that zf(z)x2/1000\left\|\nabla_{\mathbf{z}}f(\mathbf{z})\right\|\geq\left\|\mathbf{x}\right\|^{2}/1000 for all zR2z\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{z}}.

When mCnlog3nm\geq Cn\log^{3}n, it holds with probability at least 1cm11-cm^{-1} that

for all zR2h\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{h}} defined in (2.8). Here C,cC,c are positive absolute constants.

This clearly implies that zf(z)x2/1000\left\|\nabla_{\mathbf{z}}f(\mathbf{z})\right\|\geq\left\|\mathbf{x}\right\|^{2}/1000 for all zR2h\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{h}}.

The quantity we want to control in the above proposition, (h(z)zf(z))\Re\left(\mathbf{h}(\mathbf{z})^{*}\nabla_{\mathbf{z}}f(\mathbf{z})\right), is the same quantity to be controlled in the local curvature condition (e.g., Condition 7.11) in [CLS15b]. There only points near X\mathcal{X} (i.e., roughly our R3\mathcal{R}_{3} below) are considered, and the target is proving the local curvature is in a certain sense positive. Here the points of interest are not close to X\mathcal{X}, and the target is only showing that at these points, the directional derivative in h(z)\mathbf{h}(\mathbf{z}) direction is bounded away from zero.

Finally, we show that the two sub-regions, R2z\mathcal{R}_{2}^{\mathbf{z}} and R2h\mathcal{R}_{2}^{\mathbf{h}}, together cover R2\mathcal{R}_{2}. Formally,

We have R2R2zR2h\mathcal{R}_{2}\subset\mathcal{R}_{2}^{\mathbf{z}}\cup\mathcal{R}_{2}^{\mathbf{h}}.

The main challenge is that the function (1.1) is a fourth-order polynomial, and most quantities arising in the above propositions involve heavy-tailed random variables. For example, we need to control

Optimization by Trust-Region Method (TRM)

Based on the geometric characterization in Section 2.2, we describe a second-order trust-region algorithm that produces a close approximation (i.e., up to numerical precision) to a global minimizer of (1.1) in polynomial number of steps. One interesting aspect of ff in the complex space is that each point has a “circle” of equivalent points that have the same function value. Thus, we constrain each step to move “orthogonal” to the trivial direction. This simple modification helps the algorithm to converge faster in practice, and proves important to the quadratic asymptotic convergence rate in theory.

The basic idea of the trust-region method is simple: we generate a sequence of iterates z(0),z(1),\mathbf{z}^{(0)},\mathbf{z}^{(1)},\dots, by repeatedly constructing quadratic approximations f^(δ;z(r))f(z(r)+δ)\widehat{f}(\mathbf{\delta};\mathbf{z}^{(r)})\approx f(\mathbf{z}^{(r)}+\mathbf{\delta}), minimizing f^\widehat{f} to obtain a step δ\mathbf{\delta}, and setting z(r+1)=z(r)+δ\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}. More precisely, we approximate f(z)f(\mathbf{z}) around z(r)\mathbf{z}^{(r)} using the second-order Taylor expansion,

Then, the quadratic approximation of f(z)f(\mathbf{z}) around z(r)\mathbf{z}^{(r)} can be rewritten as

By structure of the Wirtinger gradient f(z(r))\nabla f(\mathbf{z}^{(r)}) and Wirtinger Hessian 2f(z(r))\nabla^{2}f(\mathbf{z}^{(r)}), g(z(r))\mathbf{g}(\mathbf{z}^{(r)}) and H(z(r))\mathbf{H}(\mathbf{z}^{(r)}) contain only real entries. Thus, the problem (3.2) is in fact an instance of the classical trust-region subproblem w.r.t. real variable ξ\mathbf{\xi}. A minimizer to (3.1) can be obtained from a minimizer of (3.2) ξ\mathbf{\xi}_{\star} as δ=Uξ\mathbf{\delta}_{\star}=\mathbf{U}\mathbf{\xi}_{\star}.

So, any method which can solve the classical trust-region subproblem can be directly applied to the modified problem (3.1). Although the resulting problem can be nonconvex (as H(z(r))\mathbf{H}(\mathbf{z}^{(r)}) in (3.4) can be indefinite), it can be solved in polynomial time, by root-finding [MS83, CGT00] or SDP relaxation [RW97, FW04]. Our convergence guarantees assume an exact solution of this problem; we outline below how to obtain an ε\varepsilon-near solution (for arbitrary ε>0\varepsilon>0) via bisection search in polynomial time (i.e., polynomial in ε1\varepsilon^{-1}), due to [Ye92, VZ90]. At the end of Section 3.3, we will discuss robustness of our convergence guarantee to the numerical imperfection, and provide an estimate of the total time complexity. In practice, though, even very inexact solutions of the trust-region subproblem suffice.This can also be proved, in a relatively straightforward way, using the geometry of the objective ff. In the interest of brevity, we do not pursue this here. Inexact iterative solvers for the trust-region subproblem can be engineered to avoid the need to densely represent the Hessian; these methods have the attractive property that they attempt to optimize the amount of Hessian information that is used at each iteration, in order to balance rate of convergence and computation.

Now we describe briefly how to apply bisection search to find an approximate solution to the classical trust-region subproblem, i.e.,

We first isolate the case when there is a unique interior global minimizer. In this case λ=0\lambda_{\star}=0 and A0\mathbf{A}\succeq\mathbf{0}. When A\mathbf{A} has a zero eigenvalue with an associated eigenvector v\mathbf{v} and w\mathbf{w}_{\star} is an interior minimizer, all feasible w+tv\mathbf{w}_{\star}+t\mathbf{v} are also minimizers. So uniqueness here requires A0\mathbf{A}\succ\mathbf{0}. Thus, one can try to sequentially (1) test if A\mathbf{A} is positive definite; (2) solve for w\mathbf{w}_{\star} from Aw=b\mathbf{A}\mathbf{w}_{\star}=-\mathbf{b}; and (3) test if w\mathbf{w}_{\star} is feasible. If the procedure goes through, a minimizer has been found. It is obvious the arithmetic complexity is O(d3)O(d^{3}).

When A\mathbf{A} is not positive definite, there must be minimizers on the boundary. Then, we need to find a λ0\lambda_{\star}\geq 0 such that

One remarkable fact is that although the minimizers may not be unique, the multiplier λ\lambda_{\star} is unique. The unique λ\lambda_{\star} can be efficiently approximated via a bisection search algorithm (Algorithm 1).

The above algorithm finds a λ^\widehat{\lambda} with λ^λε|\widehat{\lambda}-\lambda_{\star}|\leq\varepsilon with arithmetic complexity O(d3log(1/ε))O(d^{3}\log\left(1/\varepsilon\right)). Moreover, this can be translated into a convergence result in function value. Define w^=(A+λ^I)1b\widehat{\mathbf{w}}=-(\mathbf{A}+\widehat{\lambda}\mathbf{I})^{-1}\mathbf{b} if A+λ^I0\mathbf{A}+\widehat{\lambda}\mathbf{I}\succ\mathbf{0}, and w^=(A+λ^I)b+tv\widehat{\mathbf{w}}=-(\mathbf{A}+\widehat{\lambda}\mathbf{I})^{\dagger}\mathbf{b}+t\mathbf{v} if A+λ^I\mathbf{A}+\widehat{\lambda}\mathbf{I} has zero eigenvalue with an associated eigenvector v\mathbf{v} and tt makes w^=r\left\|\widehat{\mathbf{w}}\right\|=r. Then,

2 Convergence Analysis

holds with probability at least 1cbm11-c_{b}m^{-1}, provided mCnlognm\geq Cn\log n for a sufficiently large CC. It can be checked that when

Thus, we conclude that when mCnlognm\geq Cn\log n, w.h.p., the sublevel set {z:f(z)f(z(0))}\left\{\mathbf{z}:f(\mathbf{z})\leq f(\mathbf{z}^{(0)})\right\} is contained in the set

Lipschitz Properties

with probability at least 1caexp(cbm)1-c_{a}\exp(-c_{b}m), provided mCnm\geq Cn for a sufficiently large absolute constant CC. Here cac_{a} through cec_{e} are positive absolute constants.

Property of Hessians near the Target Set 𝒳𝒳\mathcal{X}.

We will provide spectral upper and lower bounds for the (restricted) Hessian matrices H(z)\mathbf{H}(\mathbf{z}), where H(z)\mathbf{H}(\mathbf{z}) is as defined in (3.3). These bounds follow by bounding H(z)\mathbf{H}(\mathbf{z}) on X\mathcal{X}, and then using the Lipschitz property of the Hessian to extend the bounds to a slightly larger region around X\mathcal{X}.

When mCnlognm\geq Cn\log n, it holds with probability at least 1cm11-cm^{-1} that

for all zR3\mathbf{z}\in\mathcal{R}_{3}^{\prime} with mH=22/25x2m_{H}=22/25\left\|\mathbf{x}\right\|^{2} and MH=9/2x2M_{H}=9/2\left\|\mathbf{x}\right\|^{2}. Here C,cC,c are positive absolute constants.

3 Proof of TRM Convergence

We are now ready to prove the convergence of the TRM algorithm. Throughout, we will assume mCnlog3nm\geq Cn\log^{3}n for a sufficiently large constant CC, so that all the events of interest hold w.h.p..

The next auxiliary lemma makes precise the intuition that whenever there exists a descent direction, the step size parameter Δ\Delta is sufficiently small, a trust-region step will decrease the objective.

For any zΓ\mathbf{z}\in\Gamma, suppose there exists a vector δ\mathbf{\delta} with δΔ\left\|\mathbf{\delta}\right\|\leq\Delta such that

for a certain d>0d>0. Then the trust-region subproblem (3.1) returns a point δ\mathbf{\delta}_{\star} with δΔ\left\|\mathbf{\delta}_{\star}\right\|\leq\Delta and

The next proposition says when Δ\Delta is chosen properly, a trust-region step from a point with negative local curvature decreases the function value by a concrete amount.

Suppose the current iterate z(r)R1Γ\mathbf{z}^{(r)}\in\mathcal{R}_{1}\cap\Gamma, and our trust-region size satisfies

Then an optimizer δ\mathbf{\delta}_{\star} to (3.1) leads to z(r+1)=z(r)+δ\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star} that obeys

The next proposition shows that when Δ\Delta is chosen properly, a trust-region step from a point with strong gradient decreases the objective by a concrete amount.

Suppose our current iterate z(r)R2R1cΓ\mathbf{z}^{(r)}\in\mathcal{R}_{2}\cap\mathcal{R}_{1}^{c}\cap\Gamma, and our trust-region size satisfies

Then an optimizer δ\mathbf{\delta}_{\star} to (3.1) leads to z(r+1)=z(r)+δ\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star} that obeys

Now, we argue about R3\mathcal{R}_{3}, in which the behavior of the algorithm is more complicated. For the region R3R3\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime}, the restricted strong convexity in radial directions around X\mathcal{X} as established in Proposition 2.4 implies that the gradient at any point in R3R3\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime} is nonzero. Thus, one can treat this as another strong gradient region, and carry out essentially the same argument as in Proposition 3.5.

Suppose our current iterate z(r)R3R3\mathbf{z}^{(r)}\in\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime}, and our trust-region size satisfies

Then an optimizer δ\mathbf{\delta}_{\star} to (3.1) leads to z(r+1)=z(r)+δ\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star} that obeys

Our next several propositions show that when the iterate sequence finally moves into R3\mathcal{R}_{3}^{\prime}, in general it makes a finite number of consecutive constrained steps in which the trust-region constraints are always active, followed by ultimate consecutive unconstrained steps in which the the trust-region constraints are always inactive until convergence. Depending on the initialization and optimization parameters, either constrained or unconstrained steps can be void. The next proposition shows that when Δ\Delta is chosen properly, a constrained step in R3\mathcal{R}_{3}^{\prime} decreases the objective by a concrete amount.

Suppose our current iterate z(r)R3\mathbf{z}^{(r)}\in\mathcal{R}_{3}^{\prime}, and the trust-region subproblem takes a constrained step, i.e., the optimizer to (3.1) satisfies δ=Δ\left\|\mathbf{\delta}_{\star}\right\|=\Delta. We have the δ\mathbf{\delta}_{\star} leads to

Here mHm_{H} and MHM_{H} are as defined in Lemma 3.2.

The next proposition shows that when Δ\Delta is properly tuned, an unconstrained step in R3\mathcal{R}_{3}^{\prime} dramatically reduces the norm of the gradient.

Suppose our current iterate z(r)R3\mathbf{z}^{(r)}\in\mathcal{R}_{3}^{\prime}, and the trust-region subproblem takes an unconstrained step, i.e., the unique optimizer to (3.1) satisfies δ<Δ\left\|\mathbf{\delta}_{\star}\right\|<\Delta. We have the δ\mathbf{\delta}_{\star} leads to z(r+1)=z(r)+δ\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star} that obeys

Here MHM_{H} and mHm_{H} are as defined in Lemma 3.2.

The next proposition shows that when Δ\Delta is properly tuned, as soon as an unconstrained R3\mathcal{R}_{3}^{\prime} step is taken, all future iterations take unconstrained R3\mathcal{R}_{3}^{\prime} steps. Moreover, the sequence converges quadratically to the target set X\mathcal{X}.

for all integers r1r^{\prime}\geq 1, provided that

Now we are ready to piece together the above technical propositions to prove our main algorithmic theorem.

Here cac_{a} through cdc_{d} are positive absolute constants.

Proof When mC1nlog3nm\geq C_{1}n\log^{3}n for a sufficiently large constant C1C_{1}, the assumption of Theorem 2.2 is satisfied. Moreover, with probability at least 1c2m11-c_{2}m^{-1}, the following estimates hold:

for a certain positive absolute constant C3C_{3}. From the technical lemmas and propositions in Section 3.3, it can be verified that when

for a positive absolute constant c4c_{4}, all requirements on Δ\Delta are satisfied.

In fact, the previous argument implies a generic iterate sequence consists of two phases: the first phase that takes consecutive RA\mathcal{R}_{A} or constrained R3\mathcal{R}_{3}^{\prime} steps, and thereafter the second phase that takes consecutive unconstrained R3\mathcal{R}_{3}^{\prime} steps till convergence. Either of the two can be absent depending on the initialization and parameter setting for the TRM algorithm.

Since f0f\geq 0, by Proposition 3.4, 3.5, 3.6, and 3.7, from z(0)\mathbf{z}^{(0)} it takes at most

steps for the iterate sequence to enter R3\mathcal{R}_{3}^{\prime}.It is possible to refine the argument a bit by proving that the sequence does not exit R3\mathcal{R}_{3}^{\prime} once entering it, in which case the bound can be tightened as f(z(0))/min(d1,d2,d3)f(\mathbf{z}^{(0)})/\min(d_{1},d_{2},d_{3}). We prefer to state this crude bound to save the additional technicality. Let r0r_{0} denote the index of the first iteration for which z(r0)R3\mathbf{z}^{(r_{0})}\in\mathcal{R}_{3}^{\prime}. Once the sequence enters R3\mathcal{R}_{3}^{\prime}, there are three possibilities:

The sequence always takes constrained steps in R3\mathcal{R}_{3}^{\prime} and since the function f(z)f(\mathbf{z}) is lower bounded (0\geq 0), it reaches the target set X\mathcal{X} in finitely many steps.

The sequence takes constrained steps until reaching certain point zR3\mathbf{z}^{\prime}\in\mathcal{R}_{3}^{\prime} such that f(z)f(x)+d4=d4f(\mathbf{z}^{\prime})\leq f(\mathbf{x})+d_{4}=d_{4}, where d4d_{4} is defined in Proposition 3.7. Since a constrained step in R3\mathcal{R}_{3}^{\prime} must decrease the function value by at least d4d_{4}, all future steps must be unconstrained. Proposition 3.9 suggests that the sequence will converge quadratically to the target set X\mathcal{X}.

The sequence starts to take unconstrained steps at a certain point zR3\mathbf{z}^{\prime\prime}\in\mathcal{R}_{3}^{\prime} such that f(z)f(x)+d4f(\mathbf{z}^{\prime\prime})\geq f(\mathbf{x})+d_{4}. Again Proposition 3.9 implies that the sequence will converge quadratically to the target set X\mathcal{X}.

In sum, by Proposition 3.4, Proposition 3.5, Proposition 3.6, Proposition 3.7, and Proposition 3.9, the number of iterations to obtain an ε\varepsilon-close solution to the target set X\mathcal{X} can be grossly bounded by

Using our previous estimates of mHm_{H}, MHM_{H}, and LHL_{H}, and taking min{d1,d2,d3,d4}=c5Δ2x2\min\{d_{1},d_{2},d_{3},d_{4}\}=c_{5}\Delta^{2}\left\|\mathbf{x}\right\|^{2}, we arrive at the claimed result.

Our above results are conditioned on exact arithmetic for solving the trust-region subproblem. We first discuss the effect of inexact arithmetics. Fix an ε\varepsilon to be determined later, suppose our trust-region subproblem solver always returns a w^\widehat{\mathbf{w}} such that Q(w^)Q(w)εQ(\widehat{\mathbf{w}})-Q(\mathbf{w}_{\star})\leq\varepsilon – based on our discussion in Section 3.1, such w^\widehat{\mathbf{w}} can always be found in O(n3log(1/ε))O(n^{3}\log(1/\varepsilon)) time. Then, in Lemma 3.3, the trust-region subproblem returns a feasible point δ\left\|\mathbf{\delta}_{\star}\right\| such that

Accordingly, the did_{i} for ii\in in Proposition 3.4 through Proposition 3.7 are changed to diεd_{i}-\varepsilon, with other conditions unaltered. Thus, combining the above with arguments in Theorem 3.10, we can take

such that from an initialization z(0)\mathbf{z}^{(0)}, the iterate sequence takes at most

steps to stay in region R3\mathcal{R}_{3}^{\prime} and possibly to start consecutive unconstrained steps. For the unconstrained steps in R3\mathcal{R}_{3}^{\prime}, since exact arithmetic is possible as we discussed in Section 3.1, the results in Proposition 3.8 and Proposition 3.9 are intact. The step estimate for this part in Theorem 3.10 remains valid. Overall, by our above choice, inexact arithmetics at most double the number of iterations to attain an ε\varepsilon-near solution to X\mathcal{X}.

If we set Δ=cn7/2log7/2mx\Delta=cn^{-7/2}\log^{-7/2}m\left\|\mathbf{x}\right\|, then each trust-region iteration costs O(n3log(nxlogm))O(n^{3}\log\left(n\left\|\mathbf{x}\right\|\log m\right)). Moreover, it takes

iterations to arrive at an ε\varepsilon-near solution to X\mathcal{X}.

Numerical Simulations

Our convergence analysis for the TRM is based on two idealizations: (i) the trust-region subproblem is solved exactly; and (ii) the step-size is fixed to be sufficiently small. These simplifications ease the analysis, but also render the TRM algorithm impractical. In practice, the trust-region subproblem is never exactly solved, and the trust-region step size is adjusted to the local geometry, by backtracking. It is relatively straightforward to modify our analysis to account for inexact subproblem solvers; for sake of brevity, we do not pursue this hereThe proof ideas are contained in Chap 6 of [CGT00]; see also [AMS09]. Intuitively, such result is possible because reasonably good approximate solutions to the TRM subproblem make qualitatively similar progress as the exact solution. Recent work [CGT12, BAC16] has established worst-case polynomial iteration complexity (under reasonable assumptions on the geometric parameters of the functions, of course) of TRM to converge to point verifying the second-order optimality conditions. Their results allow inexact trust-region subproblem solvers, as well as adaptive step sizes. Based on our geometric result, we could have directly called their results, producing slightly worse iteration complexity bounds. It is not hard to adapt their proof taking advantage of the stronger geometric property we established and produce tighter results. .

In this section, we investigate experimentally the number of measurements mm required to ensure that f(z)f(\mathbf{z}) is well-structured, in the sense of our theorems. This entails solving large instances of f(z)f(\mathbf{z}). To this end, we deploy the Manopt toolbox [BMAS14]Available online: http://www.manopt.org. . Manopt is a user-friendly Matlab toolbox that implements several sophisticated solvers for tackling optimization problems on Riemannian manifolds. The most developed solver is based on the TRM. This solver uses the truncated conjugate gradient (tCG; see, e.g., Section 7.5.4 of [CGT00]) method to (approximately) solve the trust-region subproblem (vs. the exact solver in our analysis). It also dynamically adjusts the step size. However, the original implementation (Manopt 2.0) is not adequate for our purposes. Their tCG solver uses the gradient as the initial search direction, which does not ensure that the TRM solver can escape from saddle points [ABG07, AMS09]. We modify the tCG solver, such that when the current gradient is small and there is a negative curvature direction (i.e., the current point is near a saddle point or a local maximizer for f(z)f(\mathbf{z})), the tCG solver explicitly uses the negative curvature direction…adjusted in sign to ensure positive correlation with the gradient – if it does not vanish. as the initial search direction. This modificationSimilar modification is also adopted in the TRM algorithmic framework in the recent work [BAC16] (Algorithm 3). ensures the TRM solver always escapes saddle points/local maximizers with directional negative curvature. Hence, the modified TRM algorithm based on Manopt is expected to have the same qualitative behavior as the idealized version we analyzed.

We fix n=1,000n=1,000 and vary the ratio m/nm/n from 44 to 1010.

For each mm, we generate a fixed instance: a fixed signal x\mathbf{x}, and a fixed set of complex Gaussian vectors. We run the TRM algorithm 2525 times for each problem instance, with independent random initializations. Successfully recovery is declared if at termination the optimization variable z\mathbf{z}_{\infty} satisfies

Discussion

In this work, we provide a complete geometric characterization of the nonconvex formulation (1.1) for the GPR problem. The benign geometric structure allows us to design a second-order trust-region algorithm that efficiently finds a global minimizer of (1.1), without special initializations. We close this paper by discussing possible extensions and relevant open problems.

Our result (Theorem 2.2 and Theorem 3.10) indicates that mC1nlog3(n)m\geq C_{1}n\log^{3}(n) samples are sufficient to guarantee the favorable geometric property and efficient recovery, while our simulations suggested that C2nlog(n)C_{2}n\log(n) or even C3nC_{3}n is enough. For efficient recovery only, mC4nm\geq C_{4}n are known to be sufficient [CC15] (and for all signals; see also [CLS15b, WGE16, ZL16]). It is interesting to see if the gaps can be closed. Our current analysis pertains to Gaussian measurements only which are not practical, it is important to extend the geometric analysis to more practical measurement schemes, such as t-designs [GKK13] and masked Fourier transform measurements [CLS15a]. A preliminary study of the low-dimensional function landscape for the latter scheme (for reduced real version) produces very positive result; see Figure 5.

Sparse phase retrieval.

A special case of GPR is when the underlying signal x\mathbf{x} is known to be sparse, which can be considered as a quadratic compressed sensing problem [OYVS13, OYDS13, OYDS12, LV13, JOH13, SBE14]. Since x\mathbf{x} is sparse, the lifted matrix X=xx\mathbf{X}=\mathbf{x}\mathbf{x}^{*} is sparse and has rank one. Thus, existing convex relaxation methods [OYVS13, OYDS13, LV13, JOH13] formulated it as a simultaneously low-rank and sparse recovery problem. For the latter problem, however, known convex relaxations are suboptimal [OJF+12, MHWG14]. Let kk be the number of nonzeros in the target signal. [LV13, JOH13] showed that natural convex relaxations require C5k2lognC_{5}k^{2}\log n samples for correct recovery, instead of the optimal order O(klog(n/k)O(k\log(n/k). A similar gap is also observed with certain nonconvex methods [CLM15]. It is tempting to ask whether novel nonconvex formulations and analogous geometric analysis as taken here could shed light on this problem.

Other structured nonconvex problems.

We have mentioned recent surge of works on provable nonconvex heuristics [JNS13, Har14, HW14, NNS+14, JN14, SL14, JO14, WCCL15, SRO15, ZL15, TBSR15, CW15, AGJ14a, AGJ14b, AJSN15, GHJY15, QSW14, HSSS15, AAJ+13, AGM13, AAN13, ABGM14, AGMM15, SQW15a, YCS13, SA14, LWB13, LJ15, LLJB15, EW15, Bou16, JJKN15]. While the initialization plus local refinement analyses generally produce interesting theoretical results, they do not explain certain empirical successes that do not rely on special initializations. The geometric structure and analysis we work with in our recent work [SQW15a, SQW15b] (see also [GHJY15, AG16], and [Kaw16, SC16, BNS16, GLM16, PKCS16, BBV16, BVB16]) seem promising in this regard. It is interesting to consider whether analogous geometric structure exists for other practical problems.

Proofs of Technical Results for Function Landscape

Lemma 6.1 to 6.3 first appeared in [CLS15b]; we include the full proofs here for completeness.

We now evaluate the three terms separately. Note that the law CN\mathcal{CN} is invariant to unitary transform. Thus,

Gathering the above results, we obtain (6.1). By taking Wirtinger derivative (1.4) with respect to (6.1), we obtain the Wirtinger gradient and Hessian in (6.2), (6.3) as desired.

Similar calculation yields the second expectation.

Here C(δ)C(\delta) is a constant depending on δ\delta and cac_{a}, cbc_{b} and ccc_{c} are positive absolute constants.

Proof We work out the results on 1mk=1makv2akak\frac{1}{m}\sum_{k=1}^{m}\left|\mathbf{a}_{k}^{*}\mathbf{v}\right|^{2}\mathbf{a}_{k}\mathbf{a}_{k}^{*} first. By the unitary invariance of the Gaussian measure and rescaling, it is enough to consider v=e1\mathbf{v}=\mathbf{e}_{1}. We partition each vector ak\mathbf{a}_{k} as ak=[ak(1);a~k]\mathbf{a}_{k}=[a_{k}(1);\widetilde{\mathbf{a}}_{k}] and upper bound the target quantity as:

By Chebyshev’s inequality, we have with probability at least 1c1δ2m11-c_{1}\delta^{-2}m^{-1},

For all w\mathbf{w} and all k[m]k\in[m] , a~kw\widetilde{\mathbf{a}}_{k}^{*}\mathbf{w} is distributed as CN(1)\mathcal{CN}(1) that is independent of the {ak(1)}\{a_{k}(1)\} sequence. So for one realization of {ak(1)}\{a_{k}(1)\}, the Hoeffding-type inequality of Lemma A.5 implies

for any w\mathbf{w} with w=1\left\|\mathbf{w}\right\|=1 and any t>0t>0. Taking t=δ/8t=\delta/8, together with a union bound on a 1/21/2-net on the sphere, we obtain

Now an application of Chebyshev’s inequality gives that k=1mak(1)620m\sum_{k=1}^{m}\left|a_{k}(1)\right|^{6}\leq 20m with probability at least 1c3m11-c_{3}m^{-1}. Substituting this into the above, we conclude that whenever mC4δ2nm\geq C_{4}\delta^{-2}n for some sufficiently large C4C_{4},

with probability at least 1c3m1exp(c5δ2m)1-c_{3}m^{-1}-\exp\left(-c_{5}\delta^{2}m\right).

For all fixed w\mathbf{w} and all k[m]k\in[m], a~kwCN(1)\widetilde{\mathbf{a}}_{k}^{*}\mathbf{w}\sim\mathcal{CN}(1). Thus, a~kw21\left|\widetilde{\mathbf{a}}_{k}^{*}\mathbf{w}\right|^{2}-1 is centered sub-exponential. So for one realization of {ak(1)}\{a_{k}(1)\}, Bernstein’s inequality (Lemma A.6) implies

for any fixed w\mathbf{w} with w=1\left\|\mathbf{w}\right\|=1 and any t>0t>0. Taking t=δ/8t=\delta/8, together with a union bound on a 1/21/2-net on the sphere, we obtain

Chebyshev’s inequality and the union bound give that

hold with probability at least 1c8m1m41-c_{8}m^{-1}-m^{-4}. To conclude, when mC9(δ)δ2nlognm\geq C_{9}(\delta)\delta^{-2}n\log n for some sufficiently large constant C9(δ)C_{9}(\delta),

with probability at least 1c8m1m42exp(c10δ2m/logm)1-c_{8}m^{-1}-m^{-4}-2\exp\left(-c_{10}\delta^{2}m/\log m\right).

Collecting the above bounds and probabilities yields the claimed results. Similar arguments prove the claim on 1mk=1m(akv)akak\frac{1}{m}\sum_{k=1}^{m}\left(\mathbf{a}_{k}^{*}\mathbf{v}\right)\mathbf{a}_{k}\mathbf{a}_{k}^{\top} also, completing the proof.

Let a1,,am\mathbf{a}_{1},\dots,\mathbf{a}_{m} be i.i.d. copies of aCN(n)\mathbf{a}\sim\mathcal{CN}(n). For any δ(0,1)\delta\in(0,1), when mC(δ)nlognm\geq C(\delta)n\log n, it holds with probability at least 1cexp(c(δ)m)cmn1-c^{\prime}\exp\left(-c(\delta)m\right)-c^{\prime\prime}m^{-n} that

Here C(δ)C(\delta) and c(δ)c(\delta) are constants depending on δ\delta and cc^{\prime} and cc^{\prime\prime} are positive absolute constants.

as maxk[m]ak25nlogm\max_{k\in[m]}\left\|\mathbf{a}_{k}\right\|^{2}\leq 5n\log m with probability at least 1c2mn1-c_{2}m^{-n}, and k=1makak2m\left\|\sum_{k=1}^{m}\mathbf{a}_{k}\mathbf{a}_{k}^{*}\right\|\leq 2m with probability at least 1exp(c3m)1-\exp(-c_{3}m). Thus,

Taking ε=c4(δ)/(nlogm)\varepsilon=c_{4}(\delta)/(n\log m) for a sufficiently small c4(δ)>0c_{4}(\delta)>0, we obtain that with probability at least 1exp(c1δ2m+4nlog(3nlogm/c4(δ)))c5mn1-\exp\left(-c_{1}\delta^{2}m+4n\log(3n\log m/c_{4}(\delta))\right)-c_{5}m^{-n},

which, together with continuity of the function (w,z)wz2(\mathbf{w},\mathbf{z})\mapsto\left|\mathbf{w}^{*}\mathbf{z}\right|^{2}, implies

It is enough to take mC6δ2nlognm\geq C_{6}\delta^{-2}n\log n to ensure the desired event happens w.h.p..

2 Proof of Proposition 2.3

Lemma 6.3 implies that when mC1nlognm\geq C_{1}n\log n, w.h.p.,

On the other hand, by Lemma A.7, we have that

holds with probability at least 1exp(c2m)1-\exp(-c_{2}m). For the second summation, we have

w.h.p., provided mC3nlognm\geq C_{3}n\log n, according to Lemma 6.3.

Collecting the above estimates, we have that when mC4nlognm\geq C_{4}n\log n for a sufficiently large constant C4C_{4}, w.h.p.,

for all zR1\mathbf{z}\in\mathcal{R}_{1}. Dividing both sides of the above by x2\left\|\mathbf{x}\right\|^{2} gives the claimed results.

3 Proof of Proposition 2.4

Lemma 6.4 implies when mC1nlognm\geq C_{1}n\log n for sufficiently large constant C1C_{1}, w.h.p.,

for all gT\mathbf{g}\in\mathcal{T}, where we have used that (gx)=0(x+g)g=0\Im(\mathbf{g}^{*}\mathbf{x})=0\Longrightarrow\Im(\mathbf{x}+\mathbf{g})^{*}\mathbf{g}=0 to simplify the results.

Collecting the above estimates, we obtain that when mC4nlognm\geq C_{4}n\log n, w.h.p.,

To provide a lower bound for the above, we let (xg)=xg=λx\Re(\mathbf{x}^{*}\mathbf{g})=\mathbf{x}^{*}\mathbf{g}=\lambda\left\|\mathbf{x}\right\| with λ\lambda\in and t=ηxt=\eta\left\|\mathbf{x}\right\| with η[0,1/7]\eta\in[0,1/\sqrt{7}]. Then

For any fixed η\eta, it is easy to see that minimizer occurs when λ=599299η\lambda=-\frac{599}{299}\eta. Plugging this into ϕ(λ,η)\phi(\lambda,\eta), one obtains ϕ(λ,η)24120η2241140\phi(\lambda,\eta)\geq-\frac{241}{20}\eta^{2}\geq-\frac{241}{140}. Thus,

4 Proof of Proposition 2.5

By Lemma 6.4, when mC1nlognm\geq C_{1}n\log n for some sufficiently large C1C_{1}, w.h.p.,

for all zR2z\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{z}}, as desired.

5 Proof of Proposition 2.6

Proof We abbreviate ϕ(z)\phi(\mathbf{z}) as ϕ\phi below. Note that

To bound the first term, for a fixed τ\tau to be determined later, define:

Obviously S1(z)S2(z)S_{1}(\mathbf{z})\geq S_{2}(\mathbf{z}) for all z\mathbf{z} as

Because S1(z)S2(z)S_{1}(\mathbf{z})\geq S_{2}(\mathbf{z}) as shown above,

Thus, the unconditional probability can be bounded as

Taking τ=10logmx\tau=\sqrt{10\log m}\left\|\mathbf{x}\right\|, we obtain

where we have used zx\left\|\mathbf{z}\right\|\leq\left\|\mathbf{x}\right\| to simplify the last inequality. Now we use the moment-control Bernstein’s inequality (Lemma A.8) to get a bound for probability on deviation of S2(z)S_{2}(\mathbf{z}). To this end, we have

where the second inequality holds for any integer p3p\geq 3. Hence one can take

in the deviation inequality of S2(z)S_{2}(\mathbf{z}) to obtain

where we have used the fact zx\left\|\mathbf{z}\right\|\leq\left\|\mathbf{x}\right\| and assumed 43m5/21/2004\sqrt{3}m^{-5/2}\leq 1/200 to simplify the probability. Thus, with probability at least 1m4exp(c2m/log2m+2nlog(3x/ε))1-m^{-4}-\exp\left(-c_{2}m/\log^{2}m+2n\log(3\left\|\mathbf{x}\right\|/\varepsilon)\right), it holds that

Moreover, for any z,zR2h\mathbf{z},\mathbf{z}^{\prime}\in\mathcal{R}_{2}^{\mathbf{h}}, we have

for all zR2h\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{h}}, with probability at least 1c6m1c7exp(c2m/log2m+c9nlog(C8nlogm))1-c_{6}m^{-1}-c_{7}\exp\left(-c_{2}m/\log^{2}m+c_{9}n\log(C_{8}n\log m)\right).

Combining the above estimates, when mC10nlog3nm\geq C_{10}n\log^{3}n for sufficiently large constant C10C_{10}, w.h.p.,

for all zR2h\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{h}}, as claimed.

6 Proof of Proposition 2.7

For convenience, we will define a relaxed R2h\mathcal{R}_{2}^{\mathbf{h}} region

In this case, when z2398601x2\left\|\mathbf{z}\right\|^{2}\leq\tfrac{398}{601}\left\|\mathbf{x}\right\|^{2},

On the other hand, when z2626995x2\left\|\mathbf{z}\right\|^{2}\geq\tfrac{626}{995}\left\|\mathbf{x}\right\|^{2},

Since 398601>626995\tfrac{398}{601}>\tfrac{626}{995}, we conclude that RaR1R2z\mathcal{R}_{a}\subset\mathcal{R}_{1}\cup\mathcal{R}_{2}^{\mathbf{z}}.

So Rb\mathcal{R}_{b} is covered by R1\mathcal{R}_{1}.

We show this region is covered by R2z\mathcal{R}_{2}^{\mathbf{z}} and R2h\mathcal{R}_{2}^{\mathbf{h}^{\prime}}. First, for any zRc\mathbf{z}\in\mathcal{R}_{c}, when z19831990x\left\|\mathbf{z}\right\|\geq\sqrt{\tfrac{1983}{1990}}\left\|\mathbf{x}\right\|,

implying that Rc{z:z19831990x}R2z\mathcal{R}_{c}\cap\left\{\mathbf{z}:\left\|\mathbf{z}\right\|\geq\sqrt{\tfrac{1983}{1990}}\left\|\mathbf{x}\right\|\right\}\subset\mathcal{R}_{2}^{\mathbf{z}}. Next we suppose z=λx\left\|\mathbf{z}\right\|=\lambda\left\|\mathbf{x}\right\| and xz=ηxz\left|\mathbf{x}^{*}\mathbf{z}\right|=\eta\left\|\mathbf{x}\right\|\left\|\mathbf{z}\right\|, where λ[1120,19841990]\lambda\in[\tfrac{11}{20},\sqrt{\tfrac{1984}{1990}}] and η[12,99100]\eta\in[\tfrac{1}{2},\tfrac{99}{100}], and show the rest of Rc\mathcal{R}_{c} is covered by R2h\mathcal{R}_{2}^{\mathbf{h}^{\prime}}. To this end, it is enough to verify that

over this subregion. Writing the left as a function of λ,η\lambda,\eta and eliminating x\left\|\mathbf{x}\right\| and z\left\|\mathbf{z}\right\|, it is enough to show

We show that this region is covered by R2z\mathcal{R}_{2}^{\mathbf{z}}, R3\mathcal{R}_{3}, and R2h\mathcal{R}_{2}^{\mathbf{h}^{\prime}} together. First, for any zRd\mathbf{z}\in\mathcal{R}_{d}, when z1001995x\left\|\mathbf{z}\right\|\geq\sqrt{\tfrac{1001}{995}}\left\|\mathbf{x}\right\|,

So Rd{z:z1001995x}R2z\mathcal{R}_{d}\cap\left\{\mathbf{z}:\left\|\mathbf{z}\right\|\geq\sqrt{\tfrac{1001}{995}}\left\|\mathbf{x}\right\|\right\}\subset\mathcal{R}_{2}^{\mathbf{z}}. Next, we show that any zRd\mathbf{z}\in\mathcal{R}_{d} with z24/25x\left\|\mathbf{z}\right\|\leq 24/25\cdot\left\|\mathbf{x}\right\| is contained in R2h\mathcal{R}_{2}^{\mathbf{h}^{\prime}}. Similar to the above argument for Rc\mathcal{R}_{c}, it is enough to show

as 12501+λ22ηλ<0.00185\frac{1}{250}\sqrt{1+\lambda^{2}-2\eta\lambda}<0.00185 in this case. Since the Hessian is again always indefinite, we check the optimal value for η=99/100\eta=99/100 and η=1\eta=1 and do the comparison. It can be verified p0.00627>0.00185p\geq 0.00627>0.00185 in this case. So Rd{z:z2425x}R2h\mathcal{R}_{d}\cap\left\{\mathbf{z}:\left\|\mathbf{z}\right\|\leq\frac{24}{25}\left\|\mathbf{x}\right\|\right\}\subset\mathcal{R}_{2}^{\mathbf{h}^{\prime}}. Finally, we consider the case 2325xz1005995x\tfrac{23}{25}\left\|\mathbf{x}\right\|\leq\left\|\mathbf{z}\right\|\leq\sqrt{\tfrac{1005}{995}}\left\|\mathbf{x}\right\|. A λ,η\lambda,\eta argument as above leads to

implying that Rd{z:2325xz1005995x}R3\mathcal{R}_{d}\cap\left\{\mathbf{z}:\tfrac{23}{25}\left\|\mathbf{x}\right\|\leq\left\|\mathbf{z}\right\|\leq\sqrt{\tfrac{1005}{995}}\left\|\mathbf{x}\right\|\right\}\subset\mathcal{R}_{3}.

Proofs of Technical Results for Trust-Region Algorithm

When mCnm\geq Cn for a sufficiently large CC, it holds with probability at least 1caexp(cbm)1-c_{a}\exp(-c_{b}m) that

Proof Lemma 3.1 in [CSV13] has shown that when mC1nm\geq C_{1}n, it holds with probability at least 1c2exp(c3m)1-c_{2}\exp(-c_{3}m) that

for all z\mathbf{z} and w\mathbf{w}, where \left\|\cdot\right\|_{\ast} is the nuclear norm that sums up singular values. The claims follows from

When mCnlognm\geq Cn\log n, with probability at least 1cam1cbexp(ccm/logm)1-c_{a}m^{-1}-c_{b}\exp\left(-c_{c}m/\log m\right),

for all ψ[0,2π)\psi\in[0,2\pi). Here CC, cac_{a} to ccc_{c} are positive absolute constants.

holds w.h.p. when mC1nlognm\geq C_{1}n\log n for a sufficiently large C1C_{1}.

2 Proof of Lemma 3.1

Proof For any z,zΓ\mathbf{z},\mathbf{z}^{\prime}\in\Gamma^{\prime}, we have

where in the third line we invoked results of Lemma 7.1, and hence the derived inequality holds w.h.p. when mC1nm\geq C_{1}n. Similarly, for the gradient,

where from the second to the third inequality we used the fact 1mk=1makak2\left\|\frac{1}{m}\sum_{k=1}^{m}\mathbf{a}_{k}\mathbf{a}_{k}^{*}\right\|\leq 2 with probability at least 1exp(c2m)1-\exp(-c_{2}m). Similarly for the Hessian,

where to obtain the third inequality we used that 1mA22\frac{1}{m}\left\|\mathbf{A}^{*}\right\|^{2}\leq 2 with probability at least 1exp(c3m)1-\exp(-c_{3}m) when mC4nm\geq C_{4}n for a sufficiently large constant C4C_{4}.

Since R010xR_{0}\leq 10\left\|\mathbf{x}\right\| with probability at least 1exp(c5m)1-\exp(-c_{5}m) when mC6nm\geq C_{6}n, by definition of R1R_{1}, we have R130(nlogm)1/2xR_{1}\leq 30(n\log m)^{1/2}\left\|\mathbf{x}\right\| w.h.p.. Substituting this estimate into the above bounds yields the claimed results.

3 Proof of Lemma 3.2

Proof For the upper bound, we have that for all zR3\mathbf{z}\in\mathcal{R}_{3}^{\prime},

where to obtain the third line we applied Lemma 7.2. To show the lower bound for all zR3\mathbf{z}\in\mathcal{R}_{3}^{\prime}, it is equivalent to show that

By Lemma 3.1 and Lemma 7.2, w.h.p., we have

Since (wz)=0\Im\left(\mathbf{w}^{*}\mathbf{z}\right)=0, we have ((wz)2)=wz2\Re\left((\mathbf{w}^{*}\mathbf{z})^{2}\right)=\left|\mathbf{w}^{*}\mathbf{z}\right|^{2}. Thus,

4 Proof of Lemma 3.3

5 Proof of Proposition 3.4

Taking t=Δt=\Delta and applying Lemma 3.3, we have

where we obtain the very last inequality using the assumption that Δx2/(400Lh)\Delta\leq\left\|\mathbf{x}\right\|^{2}/(400L_{h}), completing the proof.

6 Proof of Proposition 3.5

Obviously vectors of the form tδt\mathbf{\delta} is feasible for (3.1) for any t[0,Δ]t\in[0,\Delta]. By Lemma A.2, we have

By Proposition 2.5 and Proposition 2.6, we have

Since {z:zx/2}R1\left\{\mathbf{z}:\left\|\mathbf{z}\right\|\leq\left\|\mathbf{x}\right\|/2\right\}\subset\mathcal{R}_{1}, z(r)\mathbf{z}^{(r)} of interest here satisfies z(r)x/2\|\mathbf{z}^{(r)}\|\geq\left\|\mathbf{x}\right\|/2. Thus,

Combining the above with Lemma 3.3, we obtain

7 Proof of Proposition 3.6

Proof By Proposition 2.4 and the integral form of Taylor’s theorem in Lemma A.2, we have that for any g\mathbf{g} satisfying (gx)=0\Im(\mathbf{g}^{*}\mathbf{x})=0 and g=1\left\|\mathbf{g}\right\|=1 and any t[0,x/7]t\in[0,\left\|\mathbf{x}\right\|/\sqrt{7}],

Combining the above two inequalities, we obtain

where to obtain the very last bound we have used the fact minzR3R3h(z)x2/(10Lh)\min_{\mathbf{z}\in\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime}}\left\|\mathbf{h}(\mathbf{z})\right\|\geq\left\|\mathbf{x}\right\|^{2}/(10L_{h}) due to (3.9). This implies that for all zR3R3\mathbf{z}\in\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime},

The rest arguments are very similar to that of Proposition 3.5. Take δ=h(z(r))/h(z(r))\mathbf{\delta}=-\mathbf{h}(\mathbf{z}^{(r)})/\left\|\mathbf{h}(\mathbf{z}^{(r)})\right\| and it can checked vectors of the form tδt\mathbf{\delta} for t[0,Δ]t\in[0,\Delta] are feasible for (3.1). By Lemma A.2, we have

where to obtain the last line we have used (7.1). Combining the above with Lemma 3.3, we obtain

8 Proof of Proposition 3.7

Let U\mathbf{U} be an orthogonal (in geometric sense) basis for the space {w:(wz(r))=0}\left\{\mathbf{w}:\Im(\mathbf{w}^{*}\mathbf{z}^{(r)})=0\right\}. In view of the transformed gradient and Hessian in (3.3), it is easy to see

By Lemma 3.2, λmin(H(z(r)))mH\lambda_{\min}(\mathbf{H}(\mathbf{z}^{(r)}))\geq m_{H}. Thus,

where the last simplification is due to that MHmHM_{H}\geq m_{H}. By Lemma A.3, we have

Therefore, for z(r+1)=z(r)+δ\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star}, Lemma 3.3 implies that

The claimed result follows provided ΔmH2/(4MHLh)\Delta\leq m_{H}^{2}/(4M_{H}L_{h}), completing the proof.

9 Proof of Proposition 3.8

Before proceeding, we note one important fact that is useful below. For any z\mathbf{z}, we have

Thus, if U(z)\mathbf{U}(\mathbf{z}) is an (geometrically) orthonormal basis constructed for the space {w:(wz)=0}\left\{\mathbf{w}:\Im(\mathbf{w}^{*}\mathbf{z})=0\right\} (as defined around (3.3)), it is easy to verify that

Proof Throughout the proof, we write g(r)\mathbf{g}^{(r)}, H(r)\mathbf{H}^{(r)} and U(r)\mathbf{U}^{(r)} short for g(z(r))\mathbf{g}(\mathbf{z}^{(r)}), H(z(r))\mathbf{H}(\mathbf{z}^{(r)}) and U(z(r))\mathbf{U}(\mathbf{z}^{(r)}), respectively. Given an orthonormal basis U(r)\mathbf{U}^{(r)} for {w:(wz(r))=0}\left\{\mathbf{w}:\Im(\mathbf{w}^{*}\mathbf{z}^{(r)})=0\right\}, the unconstrained optimality condition of the trust region method implies that

By Taylor’s theorem and Lipschitz property in Lemma 3.1, we have

where to obtain the second equality we have used the optimality condition discussed at start of the proof. Thus, using the result above, we obtain

where from the second to the third line we translate the complex operator norm to the real operator norm. Similarly, we also get

Since z(r)minzR3z(11/7)x3x/5\left\|\mathbf{z}^{(r)}\right\|\geq\min_{\mathbf{z}\in\mathcal{R}_{3}}\left\|\mathbf{z}\right\|\geq(1-1/\sqrt{7})\left\|\mathbf{x}\right\|\geq 3\left\|\mathbf{x}\right\|/5, and z(r+1)z(r)Δx/2\left\|\mathbf{z}^{(r+1)}\right\|\geq\left\|\mathbf{z}^{(r)}\right\|-\Delta\geq\left\|\mathbf{x}\right\|/2 provided

where we used the assumption Δx/10\Delta\leq\left\|\mathbf{x}\right\|/10 again to obtain the last inequality.

Invoking the optimality condition again, we obtain

Here (H(r))1(\mathbf{H}^{(r)})^{-1} is well-defined because Lemma 3.2 shows that H(r)mH\left\|\mathbf{H}^{(r)}\right\|\geq m_{H} for all z(r)R3\mathbf{z}^{(r)}\in\mathcal{R}_{3}^{\prime}. Combining the last two estimates, we complete the proof.

10 Proof of Proposition 3.9

Proof Throughout the proof, we write g(r)\mathbf{g}^{(r)}, H(r)\mathbf{H}^{(r)} and U(r)\mathbf{U}^{(r)} short for g(z(r))\mathbf{g}(\mathbf{z}^{(r)}), H(z(r))\mathbf{H}(\mathbf{z}^{(r)}) and U(z(r))\mathbf{U}(\mathbf{z}^{(r)}), respectively.

We first show z(r+1)\mathbf{z}^{(r+1)} stays in R3\mathcal{R}_{3}^{\prime}. From proof of Proposition 3.6, we know that for all zR3\mathbf{z}\in\mathcal{R}_{3}, the following estimate holds:

provided Δx/10\Delta\leq\left\|\mathbf{x}\right\|/10. Moreover,

where the last inequality followed because step rr is unconstrained. Combining the above estimates, we obtain that

z(r+1)\mathbf{z}^{(r+1)} stays in R3\mathcal{R}_{3}^{\prime}.

Next we show the next step will also be an unconstrained step when Δ\Delta is sufficiently small. We have

where we again applied results of Proposition 3.8 to obtain the third line, and applied the optimality condition to obtain the fourth line. Thus, whenever

the transformed trust-region subproblem has its minimizer ξ(r+1)\mathbf{\xi}^{(r+1)} with ξ(r+1)<Δ\|\mathbf{\xi}^{(r+1)}\|<\Delta. This implies the minimizer δ(r+1)\mathbf{\delta}^{(r+1)} to the original trust-region subproblem satisfies δ(r+1)<Δ\mathbf{\delta}^{(r+1)}<\Delta, as δr+1=ξ(r+1)\|\mathbf{\delta}^{r+1}\|=\|\mathbf{\xi}^{(r+1)}\|. Thus, under the above condition the (r+1)(r+1)-th step is also unconstrained.

Repeating the above arguments for all future steps implies that all future steps will be constrained within R3\mathcal{R}_{3}^{\prime}.

We next provide an explicit estimate for the rate of convergence in terms of distance of the iterate to the target set X\mathcal{X}. Again by Proposition 3.8,

Appendix A Basic Tools and Results

For aCN(1)a\sim\mathcal{CN}(1), it holds that

Proof Since ff is continuous differentiable, by the fundamental theorem of calculus,

Change of variable τ=st(0s1)\tau=st(0\leq s\leq 1) gives the claimed result.

Proof By integral form of Taylor’s theorem in Lemma A.2,

Let X\mathbf{X} be an n1×n2n_{1}\times n_{2} (n1>n2n_{1}>n_{2}) matrices with i.i.d. CN\mathcal{CN} entries. Then,

Moreover, for each t0t\geq 0, it holds with probability at least 12exp(t2)1-2\exp\left(-t^{2}\right) that

Let X1,,XNX_{1},\cdots,X_{N} be independent centered sub-Gaussian random variables, and let K=maxiXiψ2K=\max_{i}\left\|X_{i}\right\|_{\psi_{2}}, where the sub-Gaussian norm

Let X1,,XNX_{1},\cdots,X_{N} be independent centered sub-exponential random variables, and let K=maxiXiψ1K=\max_{i}\left\|X_{i}\right\|_{\psi_{1}}, where the sub-exponential norm

Let X1X_{1}, \dots, XNX_{N} be i.i.d. copies of the nonnegative random variable XX with finite second moment. Then it holds that

Thus, by the usual exponential transform trick, we obtain that for any t>0t>0,

Taking λ=t/(Nσ2)\lambda=t/(N\sigma^{2}) and making change of variable for tt give the claimed result.

Let X1,,XpX_{1},\dots,X_{p} be i.i.d. copies of a real-valued random variable XX Suppose that there exist some positive number RR and σX2\sigma_{X}^{2} such that

Let S1pk=1pXkS\doteq\frac{1}{p}\sum_{k=1}^{p}X_{k}, then for … , it holds that

Proof Proof to i) is similar to that of II. Theorem 4.11 in [SS90]. For 2kn2k\leq n, w.l.o.g., we can assume U\mathbf{U} and V\mathbf{V} are the canonical bases for U\mathcal{U} and V\mathcal{V}, respectively. Then

Note that the upper bound is achieved by taking x=e1\mathbf{x}=\mathbf{e}_{1}. When 2k>n2k>n, by the results from CS decomposition (see, e.g., I Theorem 5.2 of [SS90]).

and the same argument then carries through. To prove ii), note the fact that sinθ1=UUVV\sin\theta_{1}=\left\|\mathbf{U}\mathbf{U}^{*}-\mathbf{V}\mathbf{V}^{*}\right\| (see, e.g., Theorem 4.5 and Corollary 4.6 of [SS90]). Obviously one also has

while IUU\mathbf{I}-\mathbf{U}\mathbf{U}^{*} and IVV\mathbf{I}-\mathbf{V}\mathbf{V}^{*} are projectors onto U\mathcal{U}^{\perp} and V\mathcal{V}^{\perp}, respectively. This completes the proof.

References