A Geometric Analysis of Phase Retrieval

Ju Sun, Qing Qu, John Wright

Introduction

This problem has attracted substantial recent interest, due to its connections to fields such as crystallography, optical imaging and astronomy. In these areas, one often has access only to the Fourier magnitudes of a complex signal $\mathbf{x}$ , i.e., $\left|\mathcal{F}(\mathbf{x})\right|$ [Mil90, Rob93, Wal63, DF87]. The phase information is hard or infeasible to record due to physical constraints. The problem of recovering the signal $\mathbf{x}$ from its Fourier magnitudes $\left|\mathcal{F}(\mathbf{x})\right|$ is naturally termed (Fourier) phase retrieval (PR). It is easy to see PR as a special version of GPR, with the $\mathbf{a}_{k}$ ’s the Fourier basis vectors. GPR also sees applications in electron microscopy [MIJ+02], diffraction and array imaging [BDP+07, CMP11], acoustics [BCE06, Bal10], quantum mechanics [Cor06, Rei65] and quantum information [HMW13]. We refer the reader to survey papers [SEC+15, JEH15] for accounts of recent developments in the theory, algorithms, and applications of GPR.

For GPR, heuristic methods based on nonconvex optimization often work surprisingly well in practice (e.g., [Fie82, GS72], and many more cited in [SEC+15, JEH15]). However, investigation into provable recovery methods, particularly based on nonconvex optimization, has started only relatively recently [NJS13, CESV13, CSV13, CL14, CLS15a, WdM15, VX14, ABFM14, CLS15b, CC15, WWS15, ZCL16, ZL16, WGE16, KÖ16, GX16, BE16, Wal16]. The surprising effectiveness of nonconvex heuristics on GPR remains largely mysterious. In this paper, we take a step towards bridging this gap.

We focus on a natural least-squares formulationAnother least-squares formulation, $\operatorname{minimize}_{\mathbf{z}}\;\frac{1}{2m}\sum_{k=1}^{m}(y_{k}-\left|\mathbf{a}_{k}^{*}\mathbf{z}\right|)^{2}$ , was first studied in the seminal works [Fie82, GS72]. An obvious advantage of the $f(\mathbf{z})$ studied here is that it is differentiable in the sense of Wirtinger calculus introduced later. – discussed systematically in [SEC+15, JEH15] and first studied theoretically in [CLS15b, WWS15],

We assume the $\mathbf{a}_{k}$ ’s are independent identically distributed (i.i.d.) complex Gaussian:

$f(\mathbf{z})$ is a fourth-order polynomial in $\mathbf{z}$ ,Strictly speaking, $f(\mathbf{z})$ is not a complex polynomial in $\mathbf{z}$ over the complex field; complex polynomials are necessarily complex differentiable. However, $f(\mathbf{z})$ is a fourth order real polynomial in real and complex parts of $\mathbf{z}$ . and is nonconvex. A-priori, there is little reason to believe that simple iterative methods can solve this problem without special initialization. Typical local convergence (i.e., convergence to a local minimizer) guarantees in optimization require an initialization near the target minimizer [Ber99]. Moreover, existing results on provable recovery using (1.1) and related formulations rely on careful initialization in the vicinity of the ground truth [NJS13, CLS15b, CC15, WWS15, ZCL16, ZL16, WGE16, KÖ16, GX16, BE16, Wal16].

2 A Curious Experiment

We apply gradient descent to $f(\mathbf{z})$ , starting from a random initialization $\mathbf{z}^{(0)}$ :

where the step size $\mu$ is fixed for simplicityMathematically, $f(\mathbf{z})$ is not complex differentiable; here the gradient is defined based on the Wirtinger calculus [KD09]; see also [CLS15b]. This notion of gradient is a natural choice when optimizing real-valued functions of complex variables.. The result is quite striking (Figure 1): for a fixed problem instance (fixed set of random measurements and fixed target $\mathbf{x}$ ), gradient descent seems to always return a global minimizer (i.e., the target $\mathbf{x}$ up to a global phase shift), across many independent random initializations! This contrasts with the typical “mental picture” of nonconvex objectives as possessing many spurious local minimizers.

3 A Geometric Analysis

The numerical surprise described above is not completely isolated. Simple heuristic methods have been observed to work surprisingly well for practical PR [Fie82, GS72, SEC+15, JEH15]. In this paper, we take a step towards explaining this phenomenon. We show that although the function (1.1) is nonconvex, when $m$ is reasonably large, it actually has benign global geometry which allows it to be globally optimized by efficient iterative methods, regardless of the initialization.

Notice that (i) the only local minimizers are exactly $\pm\mathbf{x}$ – they are also global minimizers;Note that the global sign cannot be recovered. (ii) there are saddle points (and a local maximizer), but around them there is a negative curvature in the $\pm\mathbf{x}$ direction. Intuitively, any algorithm that can successfully escape from this kind of saddle point (and local maximizer) can in fact find a global minimizer, i.e., recover the target signal $\mathbf{x}$ .

Because of this global geometry, a wide range of efficient iterative methods can obtain a global minimizer to $f(\mathbf{z})$ , regardless of initialization. Examples include the noisy gradient and stochastic gradient methods [GHJY15] (see also [LSJR16, PP16]), curvilinear search [Gol80] and trust-region methods [CGT00, NP06, SQW15b]. The key property that the methods must possess is the ability to escape saddle points at which the Hessian has a strictly negative eigenvalueSuch saddle points are called ridable saddles [SQW15b] or strict saddles [GHJY15]; see [AG16] for computational methods for escaping from higher-order saddles also. .

We corroborate this claim by developing a second-order trust-region algorithm for this problem, and prove that (Theorem 3.10) (i) from any initialization, it efficiently obtains a close approximation (i.e., up to numerical precision) of the target $\mathbf{x}$ (up to a global phase) and (ii) it exhibits quadratic convergence in the vicinity of the global minimizers.

In sum, our geometrical analysis produces the following result.

Informal Statement of Our Main Results; See Theorem 2.2 and Theorem 3.10 and Remark 3.11. When $m\geq Cn\log^{3}n$ , with probability at least $1-cm^{-1}$ , the function $f(\mathbf{z})$ has no spurious local minimizers. The only global minimizers are the target $\mathbf{x}$ and its equivalent copies, and at all saddle points the function has directional negative curvature. Moreover, with at least the same probability, the trust-region method with properly set step size parameter find a global minimizer of $f(\mathbf{z})$ in polynomial time, from an arbitrary initialization in the zero-centered complex ball with radius $R_{0}\doteq 3(\tfrac{1}{m}\sum_{k=1}^{m}y_{k}^{2})^{1/2}$ . Here $C$ and $c$ are absolute positive constants.

The choice of $R_{0}$ above allows us to state a result with a concise bound on the number of iterations required to converge. However, under our probability model, w.h.p., the trust- region method succeeds from any initialization. There are two caveats to this claim. First, one must choose the parameters of the method appropriately. Second, the number of iterations depends on how far away from the truth the method starts.

The second-order trust-region method, albeit polynomial-time, may not be the most practical algorithm for solving GPR. Deriving the most practical algorithms is not the focus of this paper. We mentioned above that any iterative method with saddle-escaping capability can be deployed to solve the nonconvex formulation; our geometrical analysis constitutes a solid basis for developing and analyzing much more practical algorithms for GPR.

4 Prior Arts and Connections

The survey papers [SEC+15, JEH15] provide comprehensive accounts of recent progress on GPR. In this section, we focus on provable efficient (particularly, nonconvex) methods for GPR, and draw connections to other work on provable nonconvex heuristics for practical problems.

Although heuristic methods for GPR have been used effectively in practice [GS72, Fie82, SEC+15, JEH15], only recently have researchers begun to develop methods with provable performance guarantees. The first results of this nature were obtained using semidefinite programming (SDP) relaxations [CESV13, CSV13, CL14, CLS15a, WdM15, VX14]. While this represented a substantial advance in theory, the computational complexity of semidefinite programming limits the practicality of this approach.Another line of research [BCE06, BBCE09, ABFM14] seeks to co-design the measurements and recovery algorithms based on frame- or graph-theoretic tools. While revising this work, new convex relaxations based on second-order cone programming have been proposed [GS16, BR16, HV16b, HV16a].

(Global) Geometric analysis of other nonconvex problems.

The approach taken here is similar in spirit to our recent geometric analysis of a nonconvex formulation for complete dictionary learning [SQW15a]. For that problem, we also identified a similar geometric structure that allows efficient global optimization without special initialization. There, by analyzing the geometry of a nonconvex formulation, we derived a provable efficient algorithm for recovering square invertible dictionaries when the coefficient matrix has a constant fraction of nonzero entries. Previous results required the dictionary matrix to have far fewer nonzero entries. [SQW15b] provides a high-level overview of the common geometric structure that arises in dictionary learning, GPR and several other problems. This approach has also been applied to other problems [GHJY15, BBV16, BVB16, SC16, Kaw16, BNS16, GLM16, PKCS16]. Despite these similarities, GPR raises several novel technical challenges: the objective is heavy-tailed, and minimizing the number of measurements is importantThe same challenge is also faced by [CLS15b, CC15]..

Our work sits amid the recent surge of work on provable nonconvex heuristics for practical problems. Besides GPR studied here, this line of work includes low-rank matrix recovery [KMO10, JNS13, Har14, HW14, NNS+14, JN14, SL14, WCCL15, SRO15, ZL15, TBSR15, CW15], tensor recovery [JO14, AGJ14a, AGJ14b, AJSN15, GHJY15], structured element pursuit [QSW14, HSSS15], dictionary learning [AAJ+13, AGM13, AAN13, ABGM14, AGMM15, SQW15a], mixed regression [YCS13, SA14], blind deconvolution [LWB13, LJ15, LLJB15], super resolution [EW15], phase synchronization [Bou16], numerical linear algebra [JJKN15], and so forth. Most of the methods adopt the strategy of initialization plus local refinement we alluded to above. In contrast, our global geometric analysis allows flexible algorithm design (i.e., separation of geometry and algorithms) and gives some clues as to the behavior of nonconvex heuristics used in practice, which often succeed without clever initialization.

Recovering low-rank positive semidefinite matrices.

5 Notations, Organization, and Reproducible Research

Wirtinger calculus.

Note that above the partial derivatives $\frac{\partial g}{\partial\mathbf{z}}$ and $\frac{\partial g}{\partial\overline{\mathbf{z}}}$ are row vectors. The Wirtinger gradient and Hessian are defined as

where we sometimes write $\nabla_{\mathbf{z}}g\doteq\left(\frac{\partial g}{\partial\mathbf{z}}\right)^{*}$ and naturally $\nabla_{\overline{\mathbf{z}}}g\doteq\left(\frac{\partial g}{\partial\overline{\mathbf{z}}}\right)^{*}$ . With gradient and Hessian, the second-order Taylor expansion of $g(\mathbf{z})$ at a point $\mathbf{z}_{0}$ is defined as

For numerical optimization, we are most interested in real-valued $g$ . A real-valued $g$ is stationary at a point $\mathbf{z}$ if and only if

This is equivalent to the condition $\nabla_{\overline{\mathbf{z}}}g=\mathbf{0}$ , as $\nabla_{\mathbf{z}}g=\overline{\nabla_{\overline{\mathbf{z}}}g}$ when $g$ is real-valued. The curvature of $g$ at a stationary point $\mathbf{z}$ is dictated by the Wirtinger Hessian $\nabla^{2}g(\mathbf{z})$ . An important technical point is that the Hessian quadratic form involves left and right multiplication with a $2n$ -dimensional vector consisting of a conjugate pair $(\mathbf{\delta},\bar{\mathbf{\delta}})$ .

Following the above notation, we write $\nabla_{\mathbf{z}}f(\mathbf{z})$ and $\nabla_{\overline{\mathbf{z}}}f(\mathbf{z})$ for denoting the first and second half of $\nabla f(\mathbf{z})$ , respectively.

Organization.

The remainder of this paper is organized as follows. In Section 2, we provide a quantitative characterization of the global geometry for GPR and highlight main technical challenges in establishing the results. Based on this characterization, in Section 3 we present a modified trust-region method for solving GPR from an arbitrary initialization, which leads to our main computational guarantee. In Section 20 we study the empirical performance of our method for GPR. Section 5 concludes the main body with a discussion of open problems. Section 6 and Section 7 collect the detailed proofs to technical results for the geometric analysis and algorithmic analysis, respectively.

Reproducible research.

The code to reproduce all the figures and the experimental results can be found online:

The Geometry of the Objective Function

$\mathbf{z}=\mathbf{0}$ is a critical point, and the Hessian

Hence, $\mathbf{z}=\mathbf{0}$ is a local maximizer.

In the region $\left\{\mathbf{z}:0<\left\|\mathbf{z}\right\|^{2}<\tfrac{1}{2}\left\|\mathbf{x}\right\|^{2}\right\}$ , we have

So there is no critical point in this region.

Similarly, one can show that in $\mathbf{z}$ direction there is positive curvature. Hence, every $\mathbf{z}\in\mathcal{S}$ is a saddle point.

In the region $\left\{\mathbf{z}:\tfrac{1}{2}\left\|\mathbf{x}\right\|^{2}<\left\|\mathbf{z}\right\|^{2}<\left\|\mathbf{x}\right\|^{2}\right\}$ , any potential critical point must satisfy

In other words, $2\left\|\mathbf{z}\right\|^{2}-\left\|\mathbf{x}\right\|^{2}$ is the positive eigenvalue of the rank-one PSD Hermitian matrix $\mathbf{x}\mathbf{x}^{*}$ . Hence $2\left\|\mathbf{z}\right\|^{2}-\left\|\mathbf{x}\right\|^{2}=\left\|\mathbf{x}\right\|^{2}$ . This would imply that $\left\|\mathbf{z}\right\|=\left\|\mathbf{x}\right\|$ , which does not occur in this region.

When $\left\|\mathbf{z}\right\|^{2}=\left\|\mathbf{x}\right\|^{2}$ , critical points must satisfy

Summarizing the above observations completes the proof.

2 The Finite-Sample Landscape

The result is not surprising in view of the above characterization of the “large-sample” landscape. The intuition is as follows: since the objective function is a sum of independent random variables, when $m$ is sufficiently large, the function values, gradients and Hessians should be uniformly close to their expectations. Some care is required in making this intuition precise, however. Because the objective function contains fourth powers of Gaussian random variables, it is heavy tailed. Ensuring that $f$ and its derivatives are uniformly close to their expectations requires $m\geq Cn^{2}$ . This would be quite wasteful, since $\mathbf{x}$ has only $n$ degrees of freedom.

There exist positive absolute constants $C,c$ , such that when $m\geq Cn\log^{3}n$ , it holds with probability at least $1-cm^{-1}$ that $f(\mathbf{z})$ has no spurious local minimizers and the only local/global minimizers are exactly the target set $\mathcal{X}$ . More precisely, with the same probability,

where, assuming $\mathbf{h}(\mathbf{z})$ as defined in (1.3),

Here the regions $\mathcal{R}_{1},\;\mathcal{R}_{2}^{\mathbf{z}},\;\mathcal{R}_{2}^{\mathbf{h}}$ are defined as

By the restricted strong convexity we have established for $\mathcal{R}_{3}$ , and the integral form of Taylor’s theorem in Lemma A.2,

Summing up the above two inequalities, we obtain

In the asymptotic version, we characterized only the critical points. In this finite-sample version, we characterize the whole space and particularly provide quantitative control for regions near critical points (i.e., $\mathcal{R}_{1}\cup\mathcal{R}_{3}$ ). These concrete quantities are important for algorithm design and analysis (see Section 3).

3 Key Steps in the Geometric Analysis

Our proof strategy is fairly simple: we work out uniform bounds on the quantities for each of the three regions, and finally show the regions together cover the space. Since (1.1) and associated derivatives take the form of summation of $m$ independent random variables, the proof involves concentration and covering arguments [Ver12]. The main challenge in our argument will be the heavy-tailedness nature of $f$ and its gradient.

When $m\geq Cn\log n$ , it holds with probability at least $1-cm^{-1}$ that

for all $\mathbf{z}\in\mathcal{R}_{1}$ defined in (2.4). Here $C,c$ are positive absolute constants.

Next, we show that near $\mathcal{X}$ , the objective $f$ is strongly convex in any geometrically normal direction to the target set $\mathcal{X}$ (which is a one-dimensional circle). Combined with the smoothness property, this allows us to achieve a quadratic asymptotic rate of convergence with the modified trust-region algorithm we propose later.

When $m\geq Cn\log n$ for a sufficiently large constant $C$ , it holds with probability at least $1-cm^{-1}$ that

for all $\mathbf{z}\in\mathcal{R}_{3}$ defined in (2.5) and for all

Here $c$ is a positive absolute constant.

This restricted strong convexity result is qualitatively stronger than the local curvature property (i.e., Condition 7.11) established in [CLS15b]. Specifically, our result is equivalent to the following: for any line segment $\mathcal{L}$ that is normal to the circle $\mathcal{X}$ and contained in $\mathcal{S}_{3}$ , it holds that

In contrast, the local curvature property [CLS15b] only states that in such line segment,

While the local curvature property is sufficient to establish local convergence result of first-order method, the stronger restricted strong convex that provides uniform second-order curvature controls in the above $\mathcal{L}$ ’s are necessary to showing quadratic convergence result of second-order method, as we will do in the next section.

In Proposition 2.7 below, we will show that $\mathcal{R}_{2}^{\mathbf{z}}\cup\mathcal{R}_{2}^{\mathbf{h}}$ indeed cover $\mathcal{R}_{2}$ . We first show that the gradients in either region are bounded away from zero.

When $m\geq Cn\log n$ , it holds with probability at least $1-cm^{-1}$ that

for all $\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{z}}$ defined in (2.7). Here $C,c$ are positive absolute constants.

It follows immediately that $\left\|\nabla_{\mathbf{z}}f(\mathbf{z})\right\|\geq\left\|\mathbf{x}\right\|^{2}/1000$ for all $\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{z}}$ .

When $m\geq Cn\log^{3}n$ , it holds with probability at least $1-cm^{-1}$ that

for all $\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{h}}$ defined in (2.8). Here $C,c$ are positive absolute constants.

This clearly implies that $\left\|\nabla_{\mathbf{z}}f(\mathbf{z})\right\|\geq\left\|\mathbf{x}\right\|^{2}/1000$ for all $\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{h}}$ .

The quantity we want to control in the above proposition, $\Re\left(\mathbf{h}(\mathbf{z})^{*}\nabla_{\mathbf{z}}f(\mathbf{z})\right)$ , is the same quantity to be controlled in the local curvature condition (e.g., Condition 7.11) in [CLS15b]. There only points near $\mathcal{X}$ (i.e., roughly our $\mathcal{R}_{3}$ below) are considered, and the target is proving the local curvature is in a certain sense positive. Here the points of interest are not close to $\mathcal{X}$ , and the target is only showing that at these points, the directional derivative in $\mathbf{h}(\mathbf{z})$ direction is bounded away from zero.

Finally, we show that the two sub-regions, $\mathcal{R}_{2}^{\mathbf{z}}$ and $\mathcal{R}_{2}^{\mathbf{h}}$ , together cover $\mathcal{R}_{2}$ . Formally,

We have $\mathcal{R}_{2}\subset\mathcal{R}_{2}^{\mathbf{z}}\cup\mathcal{R}_{2}^{\mathbf{h}}$ .

The main challenge is that the function (1.1) is a fourth-order polynomial, and most quantities arising in the above propositions involve heavy-tailed random variables. For example, we need to control

Optimization by Trust-Region Method (TRM)

Based on the geometric characterization in Section 2.2, we describe a second-order trust-region algorithm that produces a close approximation (i.e., up to numerical precision) to a global minimizer of (1.1) in polynomial number of steps. One interesting aspect of $f$ in the complex space is that each point has a “circle” of equivalent points that have the same function value. Thus, we constrain each step to move “orthogonal” to the trivial direction. This simple modification helps the algorithm to converge faster in practice, and proves important to the quadratic asymptotic convergence rate in theory.

The basic idea of the trust-region method is simple: we generate a sequence of iterates $\mathbf{z}^{(0)},\mathbf{z}^{(1)},\dots$ , by repeatedly constructing quadratic approximations $\widehat{f}(\mathbf{\delta};\mathbf{z}^{(r)})\approx f(\mathbf{z}^{(r)}+\mathbf{\delta})$ , minimizing $\widehat{f}$ to obtain a step $\mathbf{\delta}$ , and setting $\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}$ . More precisely, we approximate $f(\mathbf{z})$ around $\mathbf{z}^{(r)}$ using the second-order Taylor expansion,

Then, the quadratic approximation of $f(\mathbf{z})$ around $\mathbf{z}^{(r)}$ can be rewritten as

By structure of the Wirtinger gradient $\nabla f(\mathbf{z}^{(r)})$ and Wirtinger Hessian $\nabla^{2}f(\mathbf{z}^{(r)})$ , $\mathbf{g}(\mathbf{z}^{(r)})$ and $\mathbf{H}(\mathbf{z}^{(r)})$ contain only real entries. Thus, the problem (3.2) is in fact an instance of the classical trust-region subproblem w.r.t. real variable $\mathbf{\xi}$ . A minimizer to (3.1) can be obtained from a minimizer of (3.2) $\mathbf{\xi}_{\star}$ as $\mathbf{\delta}_{\star}=\mathbf{U}\mathbf{\xi}_{\star}$ .

So, any method which can solve the classical trust-region subproblem can be directly applied to the modified problem (3.1). Although the resulting problem can be nonconvex (as $\mathbf{H}(\mathbf{z}^{(r)})$ in (3.4) can be indefinite), it can be solved in polynomial time, by root-finding [MS83, CGT00] or SDP relaxation [RW97, FW04]. Our convergence guarantees assume an exact solution of this problem; we outline below how to obtain an $\varepsilon$ -near solution (for arbitrary $\varepsilon>0$ ) via bisection search in polynomial time (i.e., polynomial in $\varepsilon^{-1}$ ), due to [Ye92, VZ90]. At the end of Section 3.3, we will discuss robustness of our convergence guarantee to the numerical imperfection, and provide an estimate of the total time complexity. In practice, though, even very inexact solutions of the trust-region subproblem suffice.This can also be proved, in a relatively straightforward way, using the geometry of the objective $f$ . In the interest of brevity, we do not pursue this here. Inexact iterative solvers for the trust-region subproblem can be engineered to avoid the need to densely represent the Hessian; these methods have the attractive property that they attempt to optimize the amount of Hessian information that is used at each iteration, in order to balance rate of convergence and computation.

Now we describe briefly how to apply bisection search to find an approximate solution to the classical trust-region subproblem, i.e.,

We first isolate the case when there is a unique interior global minimizer. In this case $\lambda_{\star}=0$ and $\mathbf{A}\succeq\mathbf{0}$ . When $\mathbf{A}$ has a zero eigenvalue with an associated eigenvector $\mathbf{v}$ and $\mathbf{w}_{\star}$ is an interior minimizer, all feasible $\mathbf{w}_{\star}+t\mathbf{v}$ are also minimizers. So uniqueness here requires $\mathbf{A}\succ\mathbf{0}$ . Thus, one can try to sequentially (1) test if $\mathbf{A}$ is positive definite; (2) solve for $\mathbf{w}_{\star}$ from $\mathbf{A}\mathbf{w}_{\star}=-\mathbf{b}$ ; and (3) test if $\mathbf{w}_{\star}$ is feasible. If the procedure goes through, a minimizer has been found. It is obvious the arithmetic complexity is $O(d^{3})$ .

When $\mathbf{A}$ is not positive definite, there must be minimizers on the boundary. Then, we need to find a $\lambda_{\star}\geq 0$ such that

One remarkable fact is that although the minimizers may not be unique, the multiplier $\lambda_{\star}$ is unique. The unique $\lambda_{\star}$ can be efficiently approximated via a bisection search algorithm (Algorithm 1).

The above algorithm finds a $\widehat{\lambda}$ with $|\widehat{\lambda}-\lambda_{\star}|\leq\varepsilon$ with arithmetic complexity $O(d^{3}\log\left(1/\varepsilon\right))$ . Moreover, this can be translated into a convergence result in function value. Define $\widehat{\mathbf{w}}=-(\mathbf{A}+\widehat{\lambda}\mathbf{I})^{-1}\mathbf{b}$ if $\mathbf{A}+\widehat{\lambda}\mathbf{I}\succ\mathbf{0}$ , and $\widehat{\mathbf{w}}=-(\mathbf{A}+\widehat{\lambda}\mathbf{I})^{\dagger}\mathbf{b}+t\mathbf{v}$ if $\mathbf{A}+\widehat{\lambda}\mathbf{I}$ has zero eigenvalue with an associated eigenvector $\mathbf{v}$ and $t$ makes $\left\|\widehat{\mathbf{w}}\right\|=r$ . Then,

2 Convergence Analysis

holds with probability at least $1-c_{b}m^{-1}$ , provided $m\geq Cn\log n$ for a sufficiently large $C$ . It can be checked that when

Thus, we conclude that when $m\geq Cn\log n$ , w.h.p., the sublevel set $\left\{\mathbf{z}:f(\mathbf{z})\leq f(\mathbf{z}^{(0)})\right\}$ is contained in the set

Lipschitz Properties

with probability at least $1-c_{a}\exp(-c_{b}m)$ , provided $m\geq Cn$ for a sufficiently large absolute constant $C$ . Here $c_{a}$ through $c_{e}$ are positive absolute constants.

Property of Hessians near the Target Set 𝒳𝒳\mathcal{X}.

We will provide spectral upper and lower bounds for the (restricted) Hessian matrices $\mathbf{H}(\mathbf{z})$ , where $\mathbf{H}(\mathbf{z})$ is as defined in (3.3). These bounds follow by bounding $\mathbf{H}(\mathbf{z})$ on $\mathcal{X}$ , and then using the Lipschitz property of the Hessian to extend the bounds to a slightly larger region around $\mathcal{X}$ .

When $m\geq Cn\log n$ , it holds with probability at least $1-cm^{-1}$ that

for all $\mathbf{z}\in\mathcal{R}_{3}^{\prime}$ with $m_{H}=22/25\left\|\mathbf{x}\right\|^{2}$ and $M_{H}=9/2\left\|\mathbf{x}\right\|^{2}$ . Here $C,c$ are positive absolute constants.

3 Proof of TRM Convergence

We are now ready to prove the convergence of the TRM algorithm. Throughout, we will assume $m\geq Cn\log^{3}n$ for a sufficiently large constant $C$ , so that all the events of interest hold w.h.p..

The next auxiliary lemma makes precise the intuition that whenever there exists a descent direction, the step size parameter $\Delta$ is sufficiently small, a trust-region step will decrease the objective.

For any $\mathbf{z}\in\Gamma$ , suppose there exists a vector $\mathbf{\delta}$ with $\left\|\mathbf{\delta}\right\|\leq\Delta$ such that

for a certain $d>0$ . Then the trust-region subproblem (3.1) returns a point $\mathbf{\delta}_{\star}$ with $\left\|\mathbf{\delta}_{\star}\right\|\leq\Delta$ and

The next proposition says when $\Delta$ is chosen properly, a trust-region step from a point with negative local curvature decreases the function value by a concrete amount.

Suppose the current iterate $\mathbf{z}^{(r)}\in\mathcal{R}_{1}\cap\Gamma$ , and our trust-region size satisfies

Then an optimizer $\mathbf{\delta}_{\star}$ to (3.1) leads to $\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star}$ that obeys

The next proposition shows that when $\Delta$ is chosen properly, a trust-region step from a point with strong gradient decreases the objective by a concrete amount.

Suppose our current iterate $\mathbf{z}^{(r)}\in\mathcal{R}_{2}\cap\mathcal{R}_{1}^{c}\cap\Gamma$ , and our trust-region size satisfies

Then an optimizer $\mathbf{\delta}_{\star}$ to (3.1) leads to $\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star}$ that obeys

Now, we argue about $\mathcal{R}_{3}$ , in which the behavior of the algorithm is more complicated. For the region $\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime}$ , the restricted strong convexity in radial directions around $\mathcal{X}$ as established in Proposition 2.4 implies that the gradient at any point in $\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime}$ is nonzero. Thus, one can treat this as another strong gradient region, and carry out essentially the same argument as in Proposition 3.5.

Suppose our current iterate $\mathbf{z}^{(r)}\in\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime}$ , and our trust-region size satisfies

Then an optimizer $\mathbf{\delta}_{\star}$ to (3.1) leads to $\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star}$ that obeys

Our next several propositions show that when the iterate sequence finally moves into $\mathcal{R}_{3}^{\prime}$ , in general it makes a finite number of consecutive constrained steps in which the trust-region constraints are always active, followed by ultimate consecutive unconstrained steps in which the the trust-region constraints are always inactive until convergence. Depending on the initialization and optimization parameters, either constrained or unconstrained steps can be void. The next proposition shows that when $\Delta$ is chosen properly, a constrained step in $\mathcal{R}_{3}^{\prime}$ decreases the objective by a concrete amount.

Suppose our current iterate $\mathbf{z}^{(r)}\in\mathcal{R}_{3}^{\prime}$ , and the trust-region subproblem takes a constrained step, i.e., the optimizer to (3.1) satisfies $\left\|\mathbf{\delta}_{\star}\right\|=\Delta$ . We have the $\mathbf{\delta}_{\star}$ leads to

Here $m_{H}$ and $M_{H}$ are as defined in Lemma 3.2.

The next proposition shows that when $\Delta$ is properly tuned, an unconstrained step in $\mathcal{R}_{3}^{\prime}$ dramatically reduces the norm of the gradient.

Suppose our current iterate $\mathbf{z}^{(r)}\in\mathcal{R}_{3}^{\prime}$ , and the trust-region subproblem takes an unconstrained step, i.e., the unique optimizer to (3.1) satisfies $\left\|\mathbf{\delta}_{\star}\right\|<\Delta$ . We have the $\mathbf{\delta}_{\star}$ leads to $\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star}$ that obeys

Here $M_{H}$ and $m_{H}$ are as defined in Lemma 3.2.

The next proposition shows that when $\Delta$ is properly tuned, as soon as an unconstrained $\mathcal{R}_{3}^{\prime}$ step is taken, all future iterations take unconstrained $\mathcal{R}_{3}^{\prime}$ steps. Moreover, the sequence converges quadratically to the target set $\mathcal{X}$ .

for all integers $r^{\prime}\geq 1$ , provided that

Now we are ready to piece together the above technical propositions to prove our main algorithmic theorem.

Here $c_{a}$ through $c_{d}$ are positive absolute constants.

Proof When $m\geq C_{1}n\log^{3}n$ for a sufficiently large constant $C_{1}$ , the assumption of Theorem 2.2 is satisfied. Moreover, with probability at least $1-c_{2}m^{-1}$ , the following estimates hold:

for a certain positive absolute constant $C_{3}$ . From the technical lemmas and propositions in Section 3.3, it can be verified that when

for a positive absolute constant $c_{4}$ , all requirements on $\Delta$ are satisfied.

In fact, the previous argument implies a generic iterate sequence consists of two phases: the first phase that takes consecutive $\mathcal{R}_{A}$ or constrained $\mathcal{R}_{3}^{\prime}$ steps, and thereafter the second phase that takes consecutive unconstrained $\mathcal{R}_{3}^{\prime}$ steps till convergence. Either of the two can be absent depending on the initialization and parameter setting for the TRM algorithm.

Since $f\geq 0$ , by Proposition 3.4, 3.5, 3.6, and 3.7, from $\mathbf{z}^{(0)}$ it takes at most

steps for the iterate sequence to enter $\mathcal{R}_{3}^{\prime}$ .It is possible to refine the argument a bit by proving that the sequence does not exit $\mathcal{R}_{3}^{\prime}$ once entering it, in which case the bound can be tightened as $f(\mathbf{z}^{(0)})/\min(d_{1},d_{2},d_{3})$ . We prefer to state this crude bound to save the additional technicality. Let $r_{0}$ denote the index of the first iteration for which $\mathbf{z}^{(r_{0})}\in\mathcal{R}_{3}^{\prime}$ . Once the sequence enters $\mathcal{R}_{3}^{\prime}$ , there are three possibilities:

The sequence always takes constrained steps in $\mathcal{R}_{3}^{\prime}$ and since the function $f(\mathbf{z})$ is lower bounded ( $\geq 0$ ), it reaches the target set $\mathcal{X}$ in finitely many steps.

The sequence takes constrained steps until reaching certain point $\mathbf{z}^{\prime}\in\mathcal{R}_{3}^{\prime}$ such that $f(\mathbf{z}^{\prime})\leq f(\mathbf{x})+d_{4}=d_{4}$ , where $d_{4}$ is defined in Proposition 3.7. Since a constrained step in $\mathcal{R}_{3}^{\prime}$ must decrease the function value by at least $d_{4}$ , all future steps must be unconstrained. Proposition 3.9 suggests that the sequence will converge quadratically to the target set $\mathcal{X}$ .

The sequence starts to take unconstrained steps at a certain point $\mathbf{z}^{\prime\prime}\in\mathcal{R}_{3}^{\prime}$ such that $f(\mathbf{z}^{\prime\prime})\geq f(\mathbf{x})+d_{4}$ . Again Proposition 3.9 implies that the sequence will converge quadratically to the target set $\mathcal{X}$ .

In sum, by Proposition 3.4, Proposition 3.5, Proposition 3.6, Proposition 3.7, and Proposition 3.9, the number of iterations to obtain an $\varepsilon$ -close solution to the target set $\mathcal{X}$ can be grossly bounded by

Using our previous estimates of $m_{H}$ , $M_{H}$ , and $L_{H}$ , and taking $\min\{d_{1},d_{2},d_{3},d_{4}\}=c_{5}\Delta^{2}\left\|\mathbf{x}\right\|^{2}$ , we arrive at the claimed result.

Our above results are conditioned on exact arithmetic for solving the trust-region subproblem. We first discuss the effect of inexact arithmetics. Fix an $\varepsilon$ to be determined later, suppose our trust-region subproblem solver always returns a $\widehat{\mathbf{w}}$ such that $Q(\widehat{\mathbf{w}})-Q(\mathbf{w}_{\star})\leq\varepsilon$ – based on our discussion in Section 3.1, such $\widehat{\mathbf{w}}$ can always be found in $O(n^{3}\log(1/\varepsilon))$ time. Then, in Lemma 3.3, the trust-region subproblem returns a feasible point $\left\|\mathbf{\delta}_{\star}\right\|$ such that

Accordingly, the $d_{i}$ for $i\in$ in Proposition 3.4 through Proposition 3.7 are changed to $d_{i}-\varepsilon$ , with other conditions unaltered. Thus, combining the above with arguments in Theorem 3.10, we can take

such that from an initialization $\mathbf{z}^{(0)}$ , the iterate sequence takes at most

steps to stay in region $\mathcal{R}_{3}^{\prime}$ and possibly to start consecutive unconstrained steps. For the unconstrained steps in $\mathcal{R}_{3}^{\prime}$ , since exact arithmetic is possible as we discussed in Section 3.1, the results in Proposition 3.8 and Proposition 3.9 are intact. The step estimate for this part in Theorem 3.10 remains valid. Overall, by our above choice, inexact arithmetics at most double the number of iterations to attain an $\varepsilon$ -near solution to $\mathcal{X}$ .

If we set $\Delta=cn^{-7/2}\log^{-7/2}m\left\|\mathbf{x}\right\|$ , then each trust-region iteration costs $O(n^{3}\log\left(n\left\|\mathbf{x}\right\|\log m\right))$ . Moreover, it takes

iterations to arrive at an $\varepsilon$ -near solution to $\mathcal{X}$ .

Numerical Simulations

Our convergence analysis for the TRM is based on two idealizations: (i) the trust-region subproblem is solved exactly; and (ii) the step-size is fixed to be sufficiently small. These simplifications ease the analysis, but also render the TRM algorithm impractical. In practice, the trust-region subproblem is never exactly solved, and the trust-region step size is adjusted to the local geometry, by backtracking. It is relatively straightforward to modify our analysis to account for inexact subproblem solvers; for sake of brevity, we do not pursue this hereThe proof ideas are contained in Chap 6 of [CGT00]; see also [AMS09]. Intuitively, such result is possible because reasonably good approximate solutions to the TRM subproblem make qualitatively similar progress as the exact solution. Recent work [CGT12, BAC16] has established worst-case polynomial iteration complexity (under reasonable assumptions on the geometric parameters of the functions, of course) of TRM to converge to point verifying the second-order optimality conditions. Their results allow inexact trust-region subproblem solvers, as well as adaptive step sizes. Based on our geometric result, we could have directly called their results, producing slightly worse iteration complexity bounds. It is not hard to adapt their proof taking advantage of the stronger geometric property we established and produce tighter results. .

In this section, we investigate experimentally the number of measurements $m$ required to ensure that $f(\mathbf{z})$ is well-structured, in the sense of our theorems. This entails solving large instances of $f(\mathbf{z})$ . To this end, we deploy the Manopt toolbox [BMAS14]Available online: http://www.manopt.org. . Manopt is a user-friendly Matlab toolbox that implements several sophisticated solvers for tackling optimization problems on Riemannian manifolds. The most developed solver is based on the TRM. This solver uses the truncated conjugate gradient (tCG; see, e.g., Section 7.5.4 of [CGT00]) method to (approximately) solve the trust-region subproblem (vs. the exact solver in our analysis). It also dynamically adjusts the step size. However, the original implementation (Manopt 2.0) is not adequate for our purposes. Their tCG solver uses the gradient as the initial search direction, which does not ensure that the TRM solver can escape from saddle points [ABG07, AMS09]. We modify the tCG solver, such that when the current gradient is small and there is a negative curvature direction (i.e., the current point is near a saddle point or a local maximizer for $f(\mathbf{z})$ ), the tCG solver explicitly uses the negative curvature direction…adjusted in sign to ensure positive correlation with the gradient – if it does not vanish. as the initial search direction. This modificationSimilar modification is also adopted in the TRM algorithmic framework in the recent work [BAC16] (Algorithm 3). ensures the TRM solver always escapes saddle points/local maximizers with directional negative curvature. Hence, the modified TRM algorithm based on Manopt is expected to have the same qualitative behavior as the idealized version we analyzed.

We fix $n=1,000$ and vary the ratio $m/n$ from $4$ to $10$ .

For each $m$ , we generate a fixed instance: a fixed signal $\mathbf{x}$ , and a fixed set of complex Gaussian vectors. We run the TRM algorithm $25$ times for each problem instance, with independent random initializations. Successfully recovery is declared if at termination the optimization variable $\mathbf{z}_{\infty}$ satisfies

Discussion

In this work, we provide a complete geometric characterization of the nonconvex formulation (1.1) for the GPR problem. The benign geometric structure allows us to design a second-order trust-region algorithm that efficiently finds a global minimizer of (1.1), without special initializations. We close this paper by discussing possible extensions and relevant open problems.

Our result (Theorem 2.2 and Theorem 3.10) indicates that $m\geq C_{1}n\log^{3}(n)$ samples are sufficient to guarantee the favorable geometric property and efficient recovery, while our simulations suggested that $C_{2}n\log(n)$ or even $C_{3}n$ is enough. For efficient recovery only, $m\geq C_{4}n$ are known to be sufficient [CC15] (and for all signals; see also [CLS15b, WGE16, ZL16]). It is interesting to see if the gaps can be closed. Our current analysis pertains to Gaussian measurements only which are not practical, it is important to extend the geometric analysis to more practical measurement schemes, such as t-designs [GKK13] and masked Fourier transform measurements [CLS15a]. A preliminary study of the low-dimensional function landscape for the latter scheme (for reduced real version) produces very positive result; see Figure 5.

Sparse phase retrieval.

A special case of GPR is when the underlying signal $\mathbf{x}$ is known to be sparse, which can be considered as a quadratic compressed sensing problem [OYVS13, OYDS13, OYDS12, LV13, JOH13, SBE14]. Since $\mathbf{x}$ is sparse, the lifted matrix $\mathbf{X}=\mathbf{x}\mathbf{x}^{*}$ is sparse and has rank one. Thus, existing convex relaxation methods [OYVS13, OYDS13, LV13, JOH13] formulated it as a simultaneously low-rank and sparse recovery problem. For the latter problem, however, known convex relaxations are suboptimal [OJF+12, MHWG14]. Let $k$ be the number of nonzeros in the target signal. [LV13, JOH13] showed that natural convex relaxations require $C_{5}k^{2}\log n$ samples for correct recovery, instead of the optimal order $O(k\log(n/k)$ . A similar gap is also observed with certain nonconvex methods [CLM15]. It is tempting to ask whether novel nonconvex formulations and analogous geometric analysis as taken here could shed light on this problem.

Proofs of Technical Results for Function Landscape

Lemma 6.1 to 6.3 first appeared in [CLS15b]; we include the full proofs here for completeness.

We now evaluate the three terms separately. Note that the law $\mathcal{CN}$ is invariant to unitary transform. Thus,

Gathering the above results, we obtain (6.1). By taking Wirtinger derivative (1.4) with respect to (6.1), we obtain the Wirtinger gradient and Hessian in (6.2), (6.3) as desired.

Similar calculation yields the second expectation.

Here $C(\delta)$ is a constant depending on $\delta$ and $c_{a}$ , $c_{b}$ and $c_{c}$ are positive absolute constants.

Proof We work out the results on $\frac{1}{m}\sum_{k=1}^{m}\left|\mathbf{a}_{k}^{*}\mathbf{v}\right|^{2}\mathbf{a}_{k}\mathbf{a}_{k}^{*}$ first. By the unitary invariance of the Gaussian measure and rescaling, it is enough to consider $\mathbf{v}=\mathbf{e}_{1}$ . We partition each vector $\mathbf{a}_{k}$ as $\mathbf{a}_{k}=[a_{k}(1);\widetilde{\mathbf{a}}_{k}]$ and upper bound the target quantity as:

By Chebyshev’s inequality, we have with probability at least $1-c_{1}\delta^{-2}m^{-1}$ ,

For all $\mathbf{w}$ and all $k\in[m]$ , $\widetilde{\mathbf{a}}_{k}^{*}\mathbf{w}$ is distributed as $\mathcal{CN}(1)$ that is independent of the $\{a_{k}(1)\}$ sequence. So for one realization of $\{a_{k}(1)\}$ , the Hoeffding-type inequality of Lemma A.5 implies

for any $\mathbf{w}$ with $\left\|\mathbf{w}\right\|=1$ and any $t>0$ . Taking $t=\delta/8$ , together with a union bound on a $1/2$ -net on the sphere, we obtain

Now an application of Chebyshev’s inequality gives that $\sum_{k=1}^{m}\left|a_{k}(1)\right|^{6}\leq 20m$ with probability at least $1-c_{3}m^{-1}$ . Substituting this into the above, we conclude that whenever $m\geq C_{4}\delta^{-2}n$ for some sufficiently large $C_{4}$ ,

with probability at least $1-c_{3}m^{-1}-\exp\left(-c_{5}\delta^{2}m\right)$ .

For all fixed $\mathbf{w}$ and all $k\in[m]$ , $\widetilde{\mathbf{a}}_{k}^{*}\mathbf{w}\sim\mathcal{CN}(1)$ . Thus, $\left|\widetilde{\mathbf{a}}_{k}^{*}\mathbf{w}\right|^{2}-1$ is centered sub-exponential. So for one realization of $\{a_{k}(1)\}$ , Bernstein’s inequality (Lemma A.6) implies

for any fixed $\mathbf{w}$ with $\left\|\mathbf{w}\right\|=1$ and any $t>0$ . Taking $t=\delta/8$ , together with a union bound on a $1/2$ -net on the sphere, we obtain

Chebyshev’s inequality and the union bound give that

hold with probability at least $1-c_{8}m^{-1}-m^{-4}$ . To conclude, when $m\geq C_{9}(\delta)\delta^{-2}n\log n$ for some sufficiently large constant $C_{9}(\delta)$ ,

with probability at least $1-c_{8}m^{-1}-m^{-4}-2\exp\left(-c_{10}\delta^{2}m/\log m\right)$ .

Collecting the above bounds and probabilities yields the claimed results. Similar arguments prove the claim on $\frac{1}{m}\sum_{k=1}^{m}\left(\mathbf{a}_{k}^{*}\mathbf{v}\right)\mathbf{a}_{k}\mathbf{a}_{k}^{\top}$ also, completing the proof.

Let $\mathbf{a}_{1},\dots,\mathbf{a}_{m}$ be i.i.d. copies of $\mathbf{a}\sim\mathcal{CN}(n)$ . For any $\delta\in(0,1)$ , when $m\geq C(\delta)n\log n$ , it holds with probability at least $1-c^{\prime}\exp\left(-c(\delta)m\right)-c^{\prime\prime}m^{-n}$ that

Here $C(\delta)$ and $c(\delta)$ are constants depending on $\delta$ and $c^{\prime}$ and $c^{\prime\prime}$ are positive absolute constants.

as $\max_{k\in[m]}\left\|\mathbf{a}_{k}\right\|^{2}\leq 5n\log m$ with probability at least $1-c_{2}m^{-n}$ , and $\left\|\sum_{k=1}^{m}\mathbf{a}_{k}\mathbf{a}_{k}^{*}\right\|\leq 2m$ with probability at least $1-\exp(-c_{3}m)$ . Thus,

Taking $\varepsilon=c_{4}(\delta)/(n\log m)$ for a sufficiently small $c_{4}(\delta)>0$ , we obtain that with probability at least $1-\exp\left(-c_{1}\delta^{2}m+4n\log(3n\log m/c_{4}(\delta))\right)-c_{5}m^{-n}$ ,

which, together with continuity of the function $(\mathbf{w},\mathbf{z})\mapsto\left|\mathbf{w}^{*}\mathbf{z}\right|^{2}$ , implies

It is enough to take $m\geq C_{6}\delta^{-2}n\log n$ to ensure the desired event happens w.h.p..

2 Proof of Proposition 2.3

Lemma 6.3 implies that when $m\geq C_{1}n\log n$ , w.h.p.,

On the other hand, by Lemma A.7, we have that

holds with probability at least $1-\exp(-c_{2}m)$ . For the second summation, we have

w.h.p., provided $m\geq C_{3}n\log n$ , according to Lemma 6.3.

Collecting the above estimates, we have that when $m\geq C_{4}n\log n$ for a sufficiently large constant $C_{4}$ , w.h.p.,

for all $\mathbf{z}\in\mathcal{R}_{1}$ . Dividing both sides of the above by $\left\|\mathbf{x}\right\|^{2}$ gives the claimed results.

3 Proof of Proposition 2.4

Lemma 6.4 implies when $m\geq C_{1}n\log n$ for sufficiently large constant $C_{1}$ , w.h.p.,

for all $\mathbf{g}\in\mathcal{T}$ , where we have used that $\Im(\mathbf{g}^{*}\mathbf{x})=0\Longrightarrow\Im(\mathbf{x}+\mathbf{g})^{*}\mathbf{g}=0$ to simplify the results.

Collecting the above estimates, we obtain that when $m\geq C_{4}n\log n$ , w.h.p.,

To provide a lower bound for the above, we let $\Re(\mathbf{x}^{*}\mathbf{g})=\mathbf{x}^{*}\mathbf{g}=\lambda\left\|\mathbf{x}\right\|$ with $\lambda\in$ and $t=\eta\left\|\mathbf{x}\right\|$ with $\eta\in[0,1/\sqrt{7}]$ . Then

For any fixed $\eta$ , it is easy to see that minimizer occurs when $\lambda=-\frac{599}{299}\eta$ . Plugging this into $\phi(\lambda,\eta)$ , one obtains $\phi(\lambda,\eta)\geq-\frac{241}{20}\eta^{2}\geq-\frac{241}{140}$ . Thus,

4 Proof of Proposition 2.5

By Lemma 6.4, when $m\geq C_{1}n\log n$ for some sufficiently large $C_{1}$ , w.h.p.,

for all $\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{z}}$ , as desired.

5 Proof of Proposition 2.6

Proof We abbreviate $\phi(\mathbf{z})$ as $\phi$ below. Note that

To bound the first term, for a fixed $\tau$ to be determined later, define:

Obviously $S_{1}(\mathbf{z})\geq S_{2}(\mathbf{z})$ for all $\mathbf{z}$ as

Because $S_{1}(\mathbf{z})\geq S_{2}(\mathbf{z})$ as shown above,

Thus, the unconditional probability can be bounded as

Taking $\tau=\sqrt{10\log m}\left\|\mathbf{x}\right\|$ , we obtain

where we have used $\left\|\mathbf{z}\right\|\leq\left\|\mathbf{x}\right\|$ to simplify the last inequality. Now we use the moment-control Bernstein’s inequality (Lemma A.8) to get a bound for probability on deviation of $S_{2}(\mathbf{z})$ . To this end, we have

where the second inequality holds for any integer $p\geq 3$ . Hence one can take

in the deviation inequality of $S_{2}(\mathbf{z})$ to obtain

where we have used the fact $\left\|\mathbf{z}\right\|\leq\left\|\mathbf{x}\right\|$ and assumed $4\sqrt{3}m^{-5/2}\leq 1/200$ to simplify the probability. Thus, with probability at least $1-m^{-4}-\exp\left(-c_{2}m/\log^{2}m+2n\log(3\left\|\mathbf{x}\right\|/\varepsilon)\right)$ , it holds that

Moreover, for any $\mathbf{z},\mathbf{z}^{\prime}\in\mathcal{R}_{2}^{\mathbf{h}}$ , we have

for all $\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{h}}$ , with probability at least $1-c_{6}m^{-1}-c_{7}\exp\left(-c_{2}m/\log^{2}m+c_{9}n\log(C_{8}n\log m)\right)$ .

Combining the above estimates, when $m\geq C_{10}n\log^{3}n$ for sufficiently large constant $C_{10}$ , w.h.p.,

for all $\mathbf{z}\in\mathcal{R}_{2}^{\mathbf{h}}$ , as claimed.

6 Proof of Proposition 2.7

For convenience, we will define a relaxed $\mathcal{R}_{2}^{\mathbf{h}}$ region

In this case, when $\left\|\mathbf{z}\right\|^{2}\leq\tfrac{398}{601}\left\|\mathbf{x}\right\|^{2}$ ,

On the other hand, when $\left\|\mathbf{z}\right\|^{2}\geq\tfrac{626}{995}\left\|\mathbf{x}\right\|^{2}$ ,

Since $\tfrac{398}{601}>\tfrac{626}{995}$ , we conclude that $\mathcal{R}_{a}\subset\mathcal{R}_{1}\cup\mathcal{R}_{2}^{\mathbf{z}}$ .

So $\mathcal{R}_{b}$ is covered by $\mathcal{R}_{1}$ .

We show this region is covered by $\mathcal{R}_{2}^{\mathbf{z}}$ and $\mathcal{R}_{2}^{\mathbf{h}^{\prime}}$ . First, for any $\mathbf{z}\in\mathcal{R}_{c}$ , when $\left\|\mathbf{z}\right\|\geq\sqrt{\tfrac{1983}{1990}}\left\|\mathbf{x}\right\|$ ,

implying that $\mathcal{R}_{c}\cap\left\{\mathbf{z}:\left\|\mathbf{z}\right\|\geq\sqrt{\tfrac{1983}{1990}}\left\|\mathbf{x}\right\|\right\}\subset\mathcal{R}_{2}^{\mathbf{z}}$ . Next we suppose $\left\|\mathbf{z}\right\|=\lambda\left\|\mathbf{x}\right\|$ and $\left|\mathbf{x}^{*}\mathbf{z}\right|=\eta\left\|\mathbf{x}\right\|\left\|\mathbf{z}\right\|$ , where $\lambda\in[\tfrac{11}{20},\sqrt{\tfrac{1984}{1990}}]$ and $\eta\in[\tfrac{1}{2},\tfrac{99}{100}]$ , and show the rest of $\mathcal{R}_{c}$ is covered by $\mathcal{R}_{2}^{\mathbf{h}^{\prime}}$ . To this end, it is enough to verify that

over this subregion. Writing the left as a function of $\lambda,\eta$ and eliminating $\left\|\mathbf{x}\right\|$ and $\left\|\mathbf{z}\right\|$ , it is enough to show

We show that this region is covered by $\mathcal{R}_{2}^{\mathbf{z}}$ , $\mathcal{R}_{3}$ , and $\mathcal{R}_{2}^{\mathbf{h}^{\prime}}$ together. First, for any $\mathbf{z}\in\mathcal{R}_{d}$ , when $\left\|\mathbf{z}\right\|\geq\sqrt{\tfrac{1001}{995}}\left\|\mathbf{x}\right\|$ ,

So $\mathcal{R}_{d}\cap\left\{\mathbf{z}:\left\|\mathbf{z}\right\|\geq\sqrt{\tfrac{1001}{995}}\left\|\mathbf{x}\right\|\right\}\subset\mathcal{R}_{2}^{\mathbf{z}}$ . Next, we show that any $\mathbf{z}\in\mathcal{R}_{d}$ with $\left\|\mathbf{z}\right\|\leq 24/25\cdot\left\|\mathbf{x}\right\|$ is contained in $\mathcal{R}_{2}^{\mathbf{h}^{\prime}}$ . Similar to the above argument for $\mathcal{R}_{c}$ , it is enough to show

as $\frac{1}{250}\sqrt{1+\lambda^{2}-2\eta\lambda}<0.00185$ in this case. Since the Hessian is again always indefinite, we check the optimal value for $\eta=99/100$ and $\eta=1$ and do the comparison. It can be verified $p\geq 0.00627>0.00185$ in this case. So $\mathcal{R}_{d}\cap\left\{\mathbf{z}:\left\|\mathbf{z}\right\|\leq\frac{24}{25}\left\|\mathbf{x}\right\|\right\}\subset\mathcal{R}_{2}^{\mathbf{h}^{\prime}}$ . Finally, we consider the case $\tfrac{23}{25}\left\|\mathbf{x}\right\|\leq\left\|\mathbf{z}\right\|\leq\sqrt{\tfrac{1005}{995}}\left\|\mathbf{x}\right\|$ . A $\lambda,\eta$ argument as above leads to

implying that $\mathcal{R}_{d}\cap\left\{\mathbf{z}:\tfrac{23}{25}\left\|\mathbf{x}\right\|\leq\left\|\mathbf{z}\right\|\leq\sqrt{\tfrac{1005}{995}}\left\|\mathbf{x}\right\|\right\}\subset\mathcal{R}_{3}$ .

Proofs of Technical Results for Trust-Region Algorithm

When $m\geq Cn$ for a sufficiently large $C$ , it holds with probability at least $1-c_{a}\exp(-c_{b}m)$ that

Proof Lemma 3.1 in [CSV13] has shown that when $m\geq C_{1}n$ , it holds with probability at least $1-c_{2}\exp(-c_{3}m)$ that

for all $\mathbf{z}$ and $\mathbf{w}$ , where $\left\|\cdot\right\|_{\ast}$ is the nuclear norm that sums up singular values. The claims follows from

When $m\geq Cn\log n$ , with probability at least $1-c_{a}m^{-1}-c_{b}\exp\left(-c_{c}m/\log m\right)$ ,

for all $\psi\in[0,2\pi)$ . Here $C$ , $c_{a}$ to $c_{c}$ are positive absolute constants.

holds w.h.p. when $m\geq C_{1}n\log n$ for a sufficiently large $C_{1}$ .

2 Proof of Lemma 3.1

Proof For any $\mathbf{z},\mathbf{z}^{\prime}\in\Gamma^{\prime}$ , we have

where in the third line we invoked results of Lemma 7.1, and hence the derived inequality holds w.h.p. when $m\geq C_{1}n$ . Similarly, for the gradient,

where from the second to the third inequality we used the fact $\left\|\frac{1}{m}\sum_{k=1}^{m}\mathbf{a}_{k}\mathbf{a}_{k}^{*}\right\|\leq 2$ with probability at least $1-\exp(-c_{2}m)$ . Similarly for the Hessian,

where to obtain the third inequality we used that $\frac{1}{m}\left\|\mathbf{A}^{*}\right\|^{2}\leq 2$ with probability at least $1-\exp(-c_{3}m)$ when $m\geq C_{4}n$ for a sufficiently large constant $C_{4}$ .

Since $R_{0}\leq 10\left\|\mathbf{x}\right\|$ with probability at least $1-\exp(-c_{5}m)$ when $m\geq C_{6}n$ , by definition of $R_{1}$ , we have $R_{1}\leq 30(n\log m)^{1/2}\left\|\mathbf{x}\right\|$ w.h.p.. Substituting this estimate into the above bounds yields the claimed results.

3 Proof of Lemma 3.2

Proof For the upper bound, we have that for all $\mathbf{z}\in\mathcal{R}_{3}^{\prime}$ ,

where to obtain the third line we applied Lemma 7.2. To show the lower bound for all $\mathbf{z}\in\mathcal{R}_{3}^{\prime}$ , it is equivalent to show that

By Lemma 3.1 and Lemma 7.2, w.h.p., we have

Since $\Im\left(\mathbf{w}^{*}\mathbf{z}\right)=0$ , we have $\Re\left((\mathbf{w}^{*}\mathbf{z})^{2}\right)=\left|\mathbf{w}^{*}\mathbf{z}\right|^{2}$ . Thus,

4 Proof of Lemma 3.3

5 Proof of Proposition 3.4

Taking $t=\Delta$ and applying Lemma 3.3, we have

where we obtain the very last inequality using the assumption that $\Delta\leq\left\|\mathbf{x}\right\|^{2}/(400L_{h})$ , completing the proof.

6 Proof of Proposition 3.5

Obviously vectors of the form $t\mathbf{\delta}$ is feasible for (3.1) for any $t\in[0,\Delta]$ . By Lemma A.2, we have

By Proposition 2.5 and Proposition 2.6, we have

Since $\left\{\mathbf{z}:\left\|\mathbf{z}\right\|\leq\left\|\mathbf{x}\right\|/2\right\}\subset\mathcal{R}_{1}$ , $\mathbf{z}^{(r)}$ of interest here satisfies $\|\mathbf{z}^{(r)}\|\geq\left\|\mathbf{x}\right\|/2$ . Thus,

Combining the above with Lemma 3.3, we obtain

7 Proof of Proposition 3.6

Proof By Proposition 2.4 and the integral form of Taylor’s theorem in Lemma A.2, we have that for any $\mathbf{g}$ satisfying $\Im(\mathbf{g}^{*}\mathbf{x})=0$ and $\left\|\mathbf{g}\right\|=1$ and any $t\in[0,\left\|\mathbf{x}\right\|/\sqrt{7}]$ ,

Combining the above two inequalities, we obtain

where to obtain the very last bound we have used the fact $\min_{\mathbf{z}\in\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime}}\left\|\mathbf{h}(\mathbf{z})\right\|\geq\left\|\mathbf{x}\right\|^{2}/(10L_{h})$ due to (3.9). This implies that for all $\mathbf{z}\in\mathcal{R}_{3}\setminus\mathcal{R}_{3}^{\prime}$ ,

The rest arguments are very similar to that of Proposition 3.5. Take $\mathbf{\delta}=-\mathbf{h}(\mathbf{z}^{(r)})/\left\|\mathbf{h}(\mathbf{z}^{(r)})\right\|$ and it can checked vectors of the form $t\mathbf{\delta}$ for $t\in[0,\Delta]$ are feasible for (3.1). By Lemma A.2, we have

where to obtain the last line we have used (7.1). Combining the above with Lemma 3.3, we obtain

8 Proof of Proposition 3.7

Let $\mathbf{U}$ be an orthogonal (in geometric sense) basis for the space $\left\{\mathbf{w}:\Im(\mathbf{w}^{*}\mathbf{z}^{(r)})=0\right\}$ . In view of the transformed gradient and Hessian in (3.3), it is easy to see

By Lemma 3.2, $\lambda_{\min}(\mathbf{H}(\mathbf{z}^{(r)}))\geq m_{H}$ . Thus,

where the last simplification is due to that $M_{H}\geq m_{H}$ . By Lemma A.3, we have

Therefore, for $\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\mathbf{\delta}_{\star}$ , Lemma 3.3 implies that

The claimed result follows provided $\Delta\leq m_{H}^{2}/(4M_{H}L_{h})$ , completing the proof.

9 Proof of Proposition 3.8

Before proceeding, we note one important fact that is useful below. For any $\mathbf{z}$ , we have

Thus, if $\mathbf{U}(\mathbf{z})$ is an (geometrically) orthonormal basis constructed for the space $\left\{\mathbf{w}:\Im(\mathbf{w}^{*}\mathbf{z})=0\right\}$ (as defined around (3.3)), it is easy to verify that

Proof Throughout the proof, we write $\mathbf{g}^{(r)}$ , $\mathbf{H}^{(r)}$ and $\mathbf{U}^{(r)}$ short for $\mathbf{g}(\mathbf{z}^{(r)})$ , $\mathbf{H}(\mathbf{z}^{(r)})$ and $\mathbf{U}(\mathbf{z}^{(r)})$ , respectively. Given an orthonormal basis $\mathbf{U}^{(r)}$ for $\left\{\mathbf{w}:\Im(\mathbf{w}^{*}\mathbf{z}^{(r)})=0\right\}$ , the unconstrained optimality condition of the trust region method implies that

By Taylor’s theorem and Lipschitz property in Lemma 3.1, we have

where to obtain the second equality we have used the optimality condition discussed at start of the proof. Thus, using the result above, we obtain

where from the second to the third line we translate the complex operator norm to the real operator norm. Similarly, we also get

Since $\left\|\mathbf{z}^{(r)}\right\|\geq\min_{\mathbf{z}\in\mathcal{R}_{3}}\left\|\mathbf{z}\right\|\geq(1-1/\sqrt{7})\left\|\mathbf{x}\right\|\geq 3\left\|\mathbf{x}\right\|/5$ , and $\left\|\mathbf{z}^{(r+1)}\right\|\geq\left\|\mathbf{z}^{(r)}\right\|-\Delta\geq\left\|\mathbf{x}\right\|/2$ provided

where we used the assumption $\Delta\leq\left\|\mathbf{x}\right\|/10$ again to obtain the last inequality.

Invoking the optimality condition again, we obtain

Here $(\mathbf{H}^{(r)})^{-1}$ is well-defined because Lemma 3.2 shows that $\left\|\mathbf{H}^{(r)}\right\|\geq m_{H}$ for all $\mathbf{z}^{(r)}\in\mathcal{R}_{3}^{\prime}$ . Combining the last two estimates, we complete the proof.

10 Proof of Proposition 3.9

We first show $\mathbf{z}^{(r+1)}$ stays in $\mathcal{R}_{3}^{\prime}$ . From proof of Proposition 3.6, we know that for all $\mathbf{z}\in\mathcal{R}_{3}$ , the following estimate holds:

provided $\Delta\leq\left\|\mathbf{x}\right\|/10$ . Moreover,

where the last inequality followed because step $r$ is unconstrained. Combining the above estimates, we obtain that

$\mathbf{z}^{(r+1)}$ stays in $\mathcal{R}_{3}^{\prime}$ .

Next we show the next step will also be an unconstrained step when $\Delta$ is sufficiently small. We have

where we again applied results of Proposition 3.8 to obtain the third line, and applied the optimality condition to obtain the fourth line. Thus, whenever

the transformed trust-region subproblem has its minimizer $\mathbf{\xi}^{(r+1)}$ with $\|\mathbf{\xi}^{(r+1)}\|<\Delta$ . This implies the minimizer $\mathbf{\delta}^{(r+1)}$ to the original trust-region subproblem satisfies $\mathbf{\delta}^{(r+1)}<\Delta$ , as $\|\mathbf{\delta}^{r+1}\|=\|\mathbf{\xi}^{(r+1)}\|$ . Thus, under the above condition the $(r+1)$ -th step is also unconstrained.

Repeating the above arguments for all future steps implies that all future steps will be constrained within $\mathcal{R}_{3}^{\prime}$ .

We next provide an explicit estimate for the rate of convergence in terms of distance of the iterate to the target set $\mathcal{X}$ . Again by Proposition 3.8,

Appendix A Basic Tools and Results

For $a\sim\mathcal{CN}(1)$ , it holds that

Proof Since $f$ is continuous differentiable, by the fundamental theorem of calculus,

Change of variable $\tau=st(0\leq s\leq 1)$ gives the claimed result.

Proof By integral form of Taylor’s theorem in Lemma A.2,

Let $\mathbf{X}$ be an $n_{1}\times n_{2}$ ( $n_{1}>n_{2}$ ) matrices with i.i.d. $\mathcal{CN}$ entries. Then,

Moreover, for each $t\geq 0$ , it holds with probability at least $1-2\exp\left(-t^{2}\right)$ that

Let $X_{1},\cdots,X_{N}$ be independent centered sub-Gaussian random variables, and let $K=\max_{i}\left\|X_{i}\right\|_{\psi_{2}}$ , where the sub-Gaussian norm

Let $X_{1},\cdots,X_{N}$ be independent centered sub-exponential random variables, and let $K=\max_{i}\left\|X_{i}\right\|_{\psi_{1}}$ , where the sub-exponential norm

Let $X_{1}$ , $\dots$ , $X_{N}$ be i.i.d. copies of the nonnegative random variable $X$ with finite second moment. Then it holds that

Thus, by the usual exponential transform trick, we obtain that for any $t>0$ ,

Taking $\lambda=t/(N\sigma^{2})$ and making change of variable for $t$ give the claimed result.

Let $X_{1},\dots,X_{p}$ be i.i.d. copies of a real-valued random variable $X$ Suppose that there exist some positive number $R$ and $\sigma_{X}^{2}$ such that

Let $S\doteq\frac{1}{p}\sum_{k=1}^{p}X_{k}$ , then for … , it holds that

Proof Proof to i) is similar to that of II. Theorem 4.11 in [SS90]. For $2k\leq n$ , w.l.o.g., we can assume $\mathbf{U}$ and $\mathbf{V}$ are the canonical bases for $\mathcal{U}$ and $\mathcal{V}$ , respectively. Then

Note that the upper bound is achieved by taking $\mathbf{x}=\mathbf{e}_{1}$ . When $2k>n$ , by the results from CS decomposition (see, e.g., I Theorem 5.2 of [SS90]).

and the same argument then carries through. To prove ii), note the fact that $\sin\theta_{1}=\left\|\mathbf{U}\mathbf{U}^{*}-\mathbf{V}\mathbf{V}^{*}\right\|$ (see, e.g., Theorem 4.5 and Corollary 4.6 of [SS90]). Obviously one also has

while $\mathbf{I}-\mathbf{U}\mathbf{U}^{*}$ and $\mathbf{I}-\mathbf{V}\mathbf{V}^{*}$ are projectors onto $\mathcal{U}^{\perp}$ and $\mathcal{V}^{\perp}$ , respectively. This completes the proof.

Introduction

2 A Curious Experiment

3 A Geometric Analysis

4 Prior Arts and Connections

(Global) Geometric analysis of other nonconvex problems.

Recovering low-rank positive semidefinite matrices.

5 Notations, Organization, and Reproducible Research

Wirtinger calculus.

Organization.

Reproducible research.

The Geometry of the Objective Function

2 The Finite-Sample Landscape

3 Key Steps in the Geometric Analysis

Optimization by Trust-Region Method (TRM)

2 Convergence Analysis

Lipschitz Properties

Property of Hessians near the Target Set 𝒳𝒳\mathcal{X}.

3 Proof of TRM Convergence

Numerical Simulations

Discussion

Sparse phase retrieval.

Other structured nonconvex problems.

Proofs of Technical Results for Function Landscape

2 Proof of Proposition 2.3

3 Proof of Proposition 2.4

4 Proof of Proposition 2.5

5 Proof of Proposition 2.6

6 Proof of Proposition 2.7

Proofs of Technical Results for Trust-Region Algorithm

2 Proof of Lemma 3.1

3 Proof of Lemma 3.2

4 Proof of Lemma 3.3

5 Proof of Proposition 3.4

6 Proof of Proposition 3.5

7 Proof of Proposition 3.6

8 Proof of Proposition 3.7

9 Proof of Proposition 3.8

10 Proof of Proposition 3.9

Appendix A Basic Tools and Results

References