A Geometric Analysis of Phase Retrieval
Ju Sun, Qing Qu, John Wright
Introduction
This problem has attracted substantial recent interest, due to its connections to fields such as crystallography, optical imaging and astronomy. In these areas, one often has access only to the Fourier magnitudes of a complex signal , i.e., [Mil90, Rob93, Wal63, DF87]. The phase information is hard or infeasible to record due to physical constraints. The problem of recovering the signal from its Fourier magnitudes is naturally termed (Fourier) phase retrieval (PR). It is easy to see PR as a special version of GPR, with the ’s the Fourier basis vectors. GPR also sees applications in electron microscopy [MIJ+02], diffraction and array imaging [BDP+07, CMP11], acoustics [BCE06, Bal10], quantum mechanics [Cor06, Rei65] and quantum information [HMW13]. We refer the reader to survey papers [SEC+15, JEH15] for accounts of recent developments in the theory, algorithms, and applications of GPR.
For GPR, heuristic methods based on nonconvex optimization often work surprisingly well in practice (e.g., [Fie82, GS72], and many more cited in [SEC+15, JEH15]). However, investigation into provable recovery methods, particularly based on nonconvex optimization, has started only relatively recently [NJS13, CESV13, CSV13, CL14, CLS15a, WdM15, VX14, ABFM14, CLS15b, CC15, WWS15, ZCL16, ZL16, WGE16, KÖ16, GX16, BE16, Wal16]. The surprising effectiveness of nonconvex heuristics on GPR remains largely mysterious. In this paper, we take a step towards bridging this gap.
We focus on a natural least-squares formulationAnother least-squares formulation, , was first studied in the seminal works [Fie82, GS72]. An obvious advantage of the studied here is that it is differentiable in the sense of Wirtinger calculus introduced later. – discussed systematically in [SEC+15, JEH15] and first studied theoretically in [CLS15b, WWS15],
We assume the ’s are independent identically distributed (i.i.d.) complex Gaussian:
is a fourth-order polynomial in ,Strictly speaking, is not a complex polynomial in over the complex field; complex polynomials are necessarily complex differentiable. However, is a fourth order real polynomial in real and complex parts of . and is nonconvex. A-priori, there is little reason to believe that simple iterative methods can solve this problem without special initialization. Typical local convergence (i.e., convergence to a local minimizer) guarantees in optimization require an initialization near the target minimizer [Ber99]. Moreover, existing results on provable recovery using (1.1) and related formulations rely on careful initialization in the vicinity of the ground truth [NJS13, CLS15b, CC15, WWS15, ZCL16, ZL16, WGE16, KÖ16, GX16, BE16, Wal16].
2 A Curious Experiment
We apply gradient descent to , starting from a random initialization :
where the step size is fixed for simplicityMathematically, is not complex differentiable; here the gradient is defined based on the Wirtinger calculus [KD09]; see also [CLS15b]. This notion of gradient is a natural choice when optimizing real-valued functions of complex variables.. The result is quite striking (Figure 1): for a fixed problem instance (fixed set of random measurements and fixed target ), gradient descent seems to always return a global minimizer (i.e., the target up to a global phase shift), across many independent random initializations! This contrasts with the typical “mental picture” of nonconvex objectives as possessing many spurious local minimizers.
3 A Geometric Analysis
The numerical surprise described above is not completely isolated. Simple heuristic methods have been observed to work surprisingly well for practical PR [Fie82, GS72, SEC+15, JEH15]. In this paper, we take a step towards explaining this phenomenon. We show that although the function (1.1) is nonconvex, when is reasonably large, it actually has benign global geometry which allows it to be globally optimized by efficient iterative methods, regardless of the initialization.
Notice that (i) the only local minimizers are exactly – they are also global minimizers;Note that the global sign cannot be recovered. (ii) there are saddle points (and a local maximizer), but around them there is a negative curvature in the direction. Intuitively, any algorithm that can successfully escape from this kind of saddle point (and local maximizer) can in fact find a global minimizer, i.e., recover the target signal .
Because of this global geometry, a wide range of efficient iterative methods can obtain a global minimizer to , regardless of initialization. Examples include the noisy gradient and stochastic gradient methods [GHJY15] (see also [LSJR16, PP16]), curvilinear search [Gol80] and trust-region methods [CGT00, NP06, SQW15b]. The key property that the methods must possess is the ability to escape saddle points at which the Hessian has a strictly negative eigenvalueSuch saddle points are called ridable saddles [SQW15b] or strict saddles [GHJY15]; see [AG16] for computational methods for escaping from higher-order saddles also. .
We corroborate this claim by developing a second-order trust-region algorithm for this problem, and prove that (Theorem 3.10) (i) from any initialization, it efficiently obtains a close approximation (i.e., up to numerical precision) of the target (up to a global phase) and (ii) it exhibits quadratic convergence in the vicinity of the global minimizers.
In sum, our geometrical analysis produces the following result.
Informal Statement of Our Main Results; See Theorem 2.2 and Theorem 3.10 and Remark 3.11. When , with probability at least , the function has no spurious local minimizers. The only global minimizers are the target and its equivalent copies, and at all saddle points the function has directional negative curvature. Moreover, with at least the same probability, the trust-region method with properly set step size parameter find a global minimizer of in polynomial time, from an arbitrary initialization in the zero-centered complex ball with radius . Here and are absolute positive constants.
The choice of above allows us to state a result with a concise bound on the number of iterations required to converge. However, under our probability model, w.h.p., the trust- region method succeeds from any initialization. There are two caveats to this claim. First, one must choose the parameters of the method appropriately. Second, the number of iterations depends on how far away from the truth the method starts.
The second-order trust-region method, albeit polynomial-time, may not be the most practical algorithm for solving GPR. Deriving the most practical algorithms is not the focus of this paper. We mentioned above that any iterative method with saddle-escaping capability can be deployed to solve the nonconvex formulation; our geometrical analysis constitutes a solid basis for developing and analyzing much more practical algorithms for GPR.
4 Prior Arts and Connections
The survey papers [SEC+15, JEH15] provide comprehensive accounts of recent progress on GPR. In this section, we focus on provable efficient (particularly, nonconvex) methods for GPR, and draw connections to other work on provable nonconvex heuristics for practical problems.
Although heuristic methods for GPR have been used effectively in practice [GS72, Fie82, SEC+15, JEH15], only recently have researchers begun to develop methods with provable performance guarantees. The first results of this nature were obtained using semidefinite programming (SDP) relaxations [CESV13, CSV13, CL14, CLS15a, WdM15, VX14]. While this represented a substantial advance in theory, the computational complexity of semidefinite programming limits the practicality of this approach.Another line of research [BCE06, BBCE09, ABFM14] seeks to co-design the measurements and recovery algorithms based on frame- or graph-theoretic tools. While revising this work, new convex relaxations based on second-order cone programming have been proposed [GS16, BR16, HV16b, HV16a].
(Global) Geometric analysis of other nonconvex problems.
The approach taken here is similar in spirit to our recent geometric analysis of a nonconvex formulation for complete dictionary learning [SQW15a]. For that problem, we also identified a similar geometric structure that allows efficient global optimization without special initialization. There, by analyzing the geometry of a nonconvex formulation, we derived a provable efficient algorithm for recovering square invertible dictionaries when the coefficient matrix has a constant fraction of nonzero entries. Previous results required the dictionary matrix to have far fewer nonzero entries. [SQW15b] provides a high-level overview of the common geometric structure that arises in dictionary learning, GPR and several other problems. This approach has also been applied to other problems [GHJY15, BBV16, BVB16, SC16, Kaw16, BNS16, GLM16, PKCS16]. Despite these similarities, GPR raises several novel technical challenges: the objective is heavy-tailed, and minimizing the number of measurements is importantThe same challenge is also faced by [CLS15b, CC15]..
Our work sits amid the recent surge of work on provable nonconvex heuristics for practical problems. Besides GPR studied here, this line of work includes low-rank matrix recovery [KMO10, JNS13, Har14, HW14, NNS+14, JN14, SL14, WCCL15, SRO15, ZL15, TBSR15, CW15], tensor recovery [JO14, AGJ14a, AGJ14b, AJSN15, GHJY15], structured element pursuit [QSW14, HSSS15], dictionary learning [AAJ+13, AGM13, AAN13, ABGM14, AGMM15, SQW15a], mixed regression [YCS13, SA14], blind deconvolution [LWB13, LJ15, LLJB15], super resolution [EW15], phase synchronization [Bou16], numerical linear algebra [JJKN15], and so forth. Most of the methods adopt the strategy of initialization plus local refinement we alluded to above. In contrast, our global geometric analysis allows flexible algorithm design (i.e., separation of geometry and algorithms) and gives some clues as to the behavior of nonconvex heuristics used in practice, which often succeed without clever initialization.
Recovering low-rank positive semidefinite matrices.
5 Notations, Organization, and Reproducible Research
Wirtinger calculus.
Note that above the partial derivatives and are row vectors. The Wirtinger gradient and Hessian are defined as
where we sometimes write and naturally . With gradient and Hessian, the second-order Taylor expansion of at a point is defined as
For numerical optimization, we are most interested in real-valued . A real-valued is stationary at a point if and only if
This is equivalent to the condition , as when is real-valued. The curvature of at a stationary point is dictated by the Wirtinger Hessian . An important technical point is that the Hessian quadratic form involves left and right multiplication with a -dimensional vector consisting of a conjugate pair .
Following the above notation, we write and for denoting the first and second half of , respectively.
Organization.
The remainder of this paper is organized as follows. In Section 2, we provide a quantitative characterization of the global geometry for GPR and highlight main technical challenges in establishing the results. Based on this characterization, in Section 3 we present a modified trust-region method for solving GPR from an arbitrary initialization, which leads to our main computational guarantee. In Section 20 we study the empirical performance of our method for GPR. Section 5 concludes the main body with a discussion of open problems. Section 6 and Section 7 collect the detailed proofs to technical results for the geometric analysis and algorithmic analysis, respectively.
Reproducible research.
The code to reproduce all the figures and the experimental results can be found online:
The Geometry of the Objective Function
is a critical point, and the Hessian
Hence, is a local maximizer.
In the region , we have
So there is no critical point in this region.
Similarly, one can show that in direction there is positive curvature. Hence, every is a saddle point.
In the region , any potential critical point must satisfy
In other words, is the positive eigenvalue of the rank-one PSD Hermitian matrix . Hence . This would imply that , which does not occur in this region.
When , critical points must satisfy
Summarizing the above observations completes the proof.
2 The Finite-Sample Landscape
The result is not surprising in view of the above characterization of the “large-sample” landscape. The intuition is as follows: since the objective function is a sum of independent random variables, when is sufficiently large, the function values, gradients and Hessians should be uniformly close to their expectations. Some care is required in making this intuition precise, however. Because the objective function contains fourth powers of Gaussian random variables, it is heavy tailed. Ensuring that and its derivatives are uniformly close to their expectations requires . This would be quite wasteful, since has only degrees of freedom.
There exist positive absolute constants , such that when , it holds with probability at least that has no spurious local minimizers and the only local/global minimizers are exactly the target set . More precisely, with the same probability,
where, assuming as defined in (1.3),
Here the regions are defined as
By the restricted strong convexity we have established for , and the integral form of Taylor’s theorem in Lemma A.2,
Summing up the above two inequalities, we obtain
In the asymptotic version, we characterized only the critical points. In this finite-sample version, we characterize the whole space and particularly provide quantitative control for regions near critical points (i.e., ). These concrete quantities are important for algorithm design and analysis (see Section 3).
3 Key Steps in the Geometric Analysis
Our proof strategy is fairly simple: we work out uniform bounds on the quantities for each of the three regions, and finally show the regions together cover the space. Since (1.1) and associated derivatives take the form of summation of independent random variables, the proof involves concentration and covering arguments [Ver12]. The main challenge in our argument will be the heavy-tailedness nature of and its gradient.
When , it holds with probability at least that
for all defined in (2.4). Here are positive absolute constants.
Next, we show that near , the objective is strongly convex in any geometrically normal direction to the target set (which is a one-dimensional circle). Combined with the smoothness property, this allows us to achieve a quadratic asymptotic rate of convergence with the modified trust-region algorithm we propose later.
When for a sufficiently large constant , it holds with probability at least that
for all defined in (2.5) and for all
Here is a positive absolute constant.
This restricted strong convexity result is qualitatively stronger than the local curvature property (i.e., Condition 7.11) established in [CLS15b]. Specifically, our result is equivalent to the following: for any line segment that is normal to the circle and contained in , it holds that
In contrast, the local curvature property [CLS15b] only states that in such line segment,
While the local curvature property is sufficient to establish local convergence result of first-order method, the stronger restricted strong convex that provides uniform second-order curvature controls in the above ’s are necessary to showing quadratic convergence result of second-order method, as we will do in the next section.
In Proposition 2.7 below, we will show that indeed cover . We first show that the gradients in either region are bounded away from zero.
When , it holds with probability at least that
for all defined in (2.7). Here are positive absolute constants.
It follows immediately that for all .
When , it holds with probability at least that
for all defined in (2.8). Here are positive absolute constants.
This clearly implies that for all .
The quantity we want to control in the above proposition, , is the same quantity to be controlled in the local curvature condition (e.g., Condition 7.11) in [CLS15b]. There only points near (i.e., roughly our below) are considered, and the target is proving the local curvature is in a certain sense positive. Here the points of interest are not close to , and the target is only showing that at these points, the directional derivative in direction is bounded away from zero.
Finally, we show that the two sub-regions, and , together cover . Formally,
We have .
The main challenge is that the function (1.1) is a fourth-order polynomial, and most quantities arising in the above propositions involve heavy-tailed random variables. For example, we need to control
Optimization by Trust-Region Method (TRM)
Based on the geometric characterization in Section 2.2, we describe a second-order trust-region algorithm that produces a close approximation (i.e., up to numerical precision) to a global minimizer of (1.1) in polynomial number of steps. One interesting aspect of in the complex space is that each point has a “circle” of equivalent points that have the same function value. Thus, we constrain each step to move “orthogonal” to the trivial direction. This simple modification helps the algorithm to converge faster in practice, and proves important to the quadratic asymptotic convergence rate in theory.
The basic idea of the trust-region method is simple: we generate a sequence of iterates , by repeatedly constructing quadratic approximations , minimizing to obtain a step , and setting . More precisely, we approximate around using the second-order Taylor expansion,
Then, the quadratic approximation of around can be rewritten as
By structure of the Wirtinger gradient and Wirtinger Hessian , and contain only real entries. Thus, the problem (3.2) is in fact an instance of the classical trust-region subproblem w.r.t. real variable . A minimizer to (3.1) can be obtained from a minimizer of (3.2) as .
So, any method which can solve the classical trust-region subproblem can be directly applied to the modified problem (3.1). Although the resulting problem can be nonconvex (as in (3.4) can be indefinite), it can be solved in polynomial time, by root-finding [MS83, CGT00] or SDP relaxation [RW97, FW04]. Our convergence guarantees assume an exact solution of this problem; we outline below how to obtain an -near solution (for arbitrary ) via bisection search in polynomial time (i.e., polynomial in ), due to [Ye92, VZ90]. At the end of Section 3.3, we will discuss robustness of our convergence guarantee to the numerical imperfection, and provide an estimate of the total time complexity. In practice, though, even very inexact solutions of the trust-region subproblem suffice.This can also be proved, in a relatively straightforward way, using the geometry of the objective . In the interest of brevity, we do not pursue this here. Inexact iterative solvers for the trust-region subproblem can be engineered to avoid the need to densely represent the Hessian; these methods have the attractive property that they attempt to optimize the amount of Hessian information that is used at each iteration, in order to balance rate of convergence and computation.
Now we describe briefly how to apply bisection search to find an approximate solution to the classical trust-region subproblem, i.e.,
We first isolate the case when there is a unique interior global minimizer. In this case and . When has a zero eigenvalue with an associated eigenvector and is an interior minimizer, all feasible are also minimizers. So uniqueness here requires . Thus, one can try to sequentially (1) test if is positive definite; (2) solve for from ; and (3) test if is feasible. If the procedure goes through, a minimizer has been found. It is obvious the arithmetic complexity is .
When is not positive definite, there must be minimizers on the boundary. Then, we need to find a such that
One remarkable fact is that although the minimizers may not be unique, the multiplier is unique. The unique can be efficiently approximated via a bisection search algorithm (Algorithm 1).
The above algorithm finds a with with arithmetic complexity . Moreover, this can be translated into a convergence result in function value. Define if , and if has zero eigenvalue with an associated eigenvector and makes . Then,
2 Convergence Analysis
holds with probability at least , provided for a sufficiently large . It can be checked that when
Thus, we conclude that when , w.h.p., the sublevel set is contained in the set
Lipschitz Properties
with probability at least , provided for a sufficiently large absolute constant . Here through are positive absolute constants.
Property of Hessians near the Target Set 𝒳𝒳\mathcal{X}.
We will provide spectral upper and lower bounds for the (restricted) Hessian matrices , where is as defined in (3.3). These bounds follow by bounding on , and then using the Lipschitz property of the Hessian to extend the bounds to a slightly larger region around .
When , it holds with probability at least that
for all with and . Here are positive absolute constants.
3 Proof of TRM Convergence
We are now ready to prove the convergence of the TRM algorithm. Throughout, we will assume for a sufficiently large constant , so that all the events of interest hold w.h.p..
The next auxiliary lemma makes precise the intuition that whenever there exists a descent direction, the step size parameter is sufficiently small, a trust-region step will decrease the objective.
For any , suppose there exists a vector with such that
for a certain . Then the trust-region subproblem (3.1) returns a point with and
The next proposition says when is chosen properly, a trust-region step from a point with negative local curvature decreases the function value by a concrete amount.
Suppose the current iterate , and our trust-region size satisfies
Then an optimizer to (3.1) leads to that obeys
The next proposition shows that when is chosen properly, a trust-region step from a point with strong gradient decreases the objective by a concrete amount.
Suppose our current iterate , and our trust-region size satisfies
Then an optimizer to (3.1) leads to that obeys
Now, we argue about , in which the behavior of the algorithm is more complicated. For the region , the restricted strong convexity in radial directions around as established in Proposition 2.4 implies that the gradient at any point in is nonzero. Thus, one can treat this as another strong gradient region, and carry out essentially the same argument as in Proposition 3.5.
Suppose our current iterate , and our trust-region size satisfies
Then an optimizer to (3.1) leads to that obeys
Our next several propositions show that when the iterate sequence finally moves into , in general it makes a finite number of consecutive constrained steps in which the trust-region constraints are always active, followed by ultimate consecutive unconstrained steps in which the the trust-region constraints are always inactive until convergence. Depending on the initialization and optimization parameters, either constrained or unconstrained steps can be void. The next proposition shows that when is chosen properly, a constrained step in decreases the objective by a concrete amount.
Suppose our current iterate , and the trust-region subproblem takes a constrained step, i.e., the optimizer to (3.1) satisfies . We have the leads to
Here and are as defined in Lemma 3.2.
The next proposition shows that when is properly tuned, an unconstrained step in dramatically reduces the norm of the gradient.
Suppose our current iterate , and the trust-region subproblem takes an unconstrained step, i.e., the unique optimizer to (3.1) satisfies . We have the leads to that obeys
Here and are as defined in Lemma 3.2.
The next proposition shows that when is properly tuned, as soon as an unconstrained step is taken, all future iterations take unconstrained steps. Moreover, the sequence converges quadratically to the target set .
for all integers , provided that
Now we are ready to piece together the above technical propositions to prove our main algorithmic theorem.
Here through are positive absolute constants.
Proof When for a sufficiently large constant , the assumption of Theorem 2.2 is satisfied. Moreover, with probability at least , the following estimates hold:
for a certain positive absolute constant . From the technical lemmas and propositions in Section 3.3, it can be verified that when
for a positive absolute constant , all requirements on are satisfied.
In fact, the previous argument implies a generic iterate sequence consists of two phases: the first phase that takes consecutive or constrained steps, and thereafter the second phase that takes consecutive unconstrained steps till convergence. Either of the two can be absent depending on the initialization and parameter setting for the TRM algorithm.
Since , by Proposition 3.4, 3.5, 3.6, and 3.7, from it takes at most
steps for the iterate sequence to enter .It is possible to refine the argument a bit by proving that the sequence does not exit once entering it, in which case the bound can be tightened as . We prefer to state this crude bound to save the additional technicality. Let denote the index of the first iteration for which . Once the sequence enters , there are three possibilities:
The sequence always takes constrained steps in and since the function is lower bounded (), it reaches the target set in finitely many steps.
The sequence takes constrained steps until reaching certain point such that , where is defined in Proposition 3.7. Since a constrained step in must decrease the function value by at least , all future steps must be unconstrained. Proposition 3.9 suggests that the sequence will converge quadratically to the target set .
The sequence starts to take unconstrained steps at a certain point such that . Again Proposition 3.9 implies that the sequence will converge quadratically to the target set .
In sum, by Proposition 3.4, Proposition 3.5, Proposition 3.6, Proposition 3.7, and Proposition 3.9, the number of iterations to obtain an -close solution to the target set can be grossly bounded by
Using our previous estimates of , , and , and taking , we arrive at the claimed result.
Our above results are conditioned on exact arithmetic for solving the trust-region subproblem. We first discuss the effect of inexact arithmetics. Fix an to be determined later, suppose our trust-region subproblem solver always returns a such that – based on our discussion in Section 3.1, such can always be found in time. Then, in Lemma 3.3, the trust-region subproblem returns a feasible point such that
Accordingly, the for in Proposition 3.4 through Proposition 3.7 are changed to , with other conditions unaltered. Thus, combining the above with arguments in Theorem 3.10, we can take
such that from an initialization , the iterate sequence takes at most
steps to stay in region and possibly to start consecutive unconstrained steps. For the unconstrained steps in , since exact arithmetic is possible as we discussed in Section 3.1, the results in Proposition 3.8 and Proposition 3.9 are intact. The step estimate for this part in Theorem 3.10 remains valid. Overall, by our above choice, inexact arithmetics at most double the number of iterations to attain an -near solution to .
If we set , then each trust-region iteration costs . Moreover, it takes
iterations to arrive at an -near solution to .
Numerical Simulations
Our convergence analysis for the TRM is based on two idealizations: (i) the trust-region subproblem is solved exactly; and (ii) the step-size is fixed to be sufficiently small. These simplifications ease the analysis, but also render the TRM algorithm impractical. In practice, the trust-region subproblem is never exactly solved, and the trust-region step size is adjusted to the local geometry, by backtracking. It is relatively straightforward to modify our analysis to account for inexact subproblem solvers; for sake of brevity, we do not pursue this hereThe proof ideas are contained in Chap 6 of [CGT00]; see also [AMS09]. Intuitively, such result is possible because reasonably good approximate solutions to the TRM subproblem make qualitatively similar progress as the exact solution. Recent work [CGT12, BAC16] has established worst-case polynomial iteration complexity (under reasonable assumptions on the geometric parameters of the functions, of course) of TRM to converge to point verifying the second-order optimality conditions. Their results allow inexact trust-region subproblem solvers, as well as adaptive step sizes. Based on our geometric result, we could have directly called their results, producing slightly worse iteration complexity bounds. It is not hard to adapt their proof taking advantage of the stronger geometric property we established and produce tighter results. .
In this section, we investigate experimentally the number of measurements required to ensure that is well-structured, in the sense of our theorems. This entails solving large instances of . To this end, we deploy the Manopt toolbox [BMAS14]Available online: http://www.manopt.org. . Manopt is a user-friendly Matlab toolbox that implements several sophisticated solvers for tackling optimization problems on Riemannian manifolds. The most developed solver is based on the TRM. This solver uses the truncated conjugate gradient (tCG; see, e.g., Section 7.5.4 of [CGT00]) method to (approximately) solve the trust-region subproblem (vs. the exact solver in our analysis). It also dynamically adjusts the step size. However, the original implementation (Manopt 2.0) is not adequate for our purposes. Their tCG solver uses the gradient as the initial search direction, which does not ensure that the TRM solver can escape from saddle points [ABG07, AMS09]. We modify the tCG solver, such that when the current gradient is small and there is a negative curvature direction (i.e., the current point is near a saddle point or a local maximizer for ), the tCG solver explicitly uses the negative curvature direction…adjusted in sign to ensure positive correlation with the gradient – if it does not vanish. as the initial search direction. This modificationSimilar modification is also adopted in the TRM algorithmic framework in the recent work [BAC16] (Algorithm 3). ensures the TRM solver always escapes saddle points/local maximizers with directional negative curvature. Hence, the modified TRM algorithm based on Manopt is expected to have the same qualitative behavior as the idealized version we analyzed.
We fix and vary the ratio from to .
For each , we generate a fixed instance: a fixed signal , and a fixed set of complex Gaussian vectors. We run the TRM algorithm times for each problem instance, with independent random initializations. Successfully recovery is declared if at termination the optimization variable satisfies
Discussion
In this work, we provide a complete geometric characterization of the nonconvex formulation (1.1) for the GPR problem. The benign geometric structure allows us to design a second-order trust-region algorithm that efficiently finds a global minimizer of (1.1), without special initializations. We close this paper by discussing possible extensions and relevant open problems.
Our result (Theorem 2.2 and Theorem 3.10) indicates that samples are sufficient to guarantee the favorable geometric property and efficient recovery, while our simulations suggested that or even is enough. For efficient recovery only, are known to be sufficient [CC15] (and for all signals; see also [CLS15b, WGE16, ZL16]). It is interesting to see if the gaps can be closed. Our current analysis pertains to Gaussian measurements only which are not practical, it is important to extend the geometric analysis to more practical measurement schemes, such as t-designs [GKK13] and masked Fourier transform measurements [CLS15a]. A preliminary study of the low-dimensional function landscape for the latter scheme (for reduced real version) produces very positive result; see Figure 5.
Sparse phase retrieval.
A special case of GPR is when the underlying signal is known to be sparse, which can be considered as a quadratic compressed sensing problem [OYVS13, OYDS13, OYDS12, LV13, JOH13, SBE14]. Since is sparse, the lifted matrix is sparse and has rank one. Thus, existing convex relaxation methods [OYVS13, OYDS13, LV13, JOH13] formulated it as a simultaneously low-rank and sparse recovery problem. For the latter problem, however, known convex relaxations are suboptimal [OJF+12, MHWG14]. Let be the number of nonzeros in the target signal. [LV13, JOH13] showed that natural convex relaxations require samples for correct recovery, instead of the optimal order . A similar gap is also observed with certain nonconvex methods [CLM15]. It is tempting to ask whether novel nonconvex formulations and analogous geometric analysis as taken here could shed light on this problem.
Other structured nonconvex problems.
We have mentioned recent surge of works on provable nonconvex heuristics [JNS13, Har14, HW14, NNS+14, JN14, SL14, JO14, WCCL15, SRO15, ZL15, TBSR15, CW15, AGJ14a, AGJ14b, AJSN15, GHJY15, QSW14, HSSS15, AAJ+13, AGM13, AAN13, ABGM14, AGMM15, SQW15a, YCS13, SA14, LWB13, LJ15, LLJB15, EW15, Bou16, JJKN15]. While the initialization plus local refinement analyses generally produce interesting theoretical results, they do not explain certain empirical successes that do not rely on special initializations. The geometric structure and analysis we work with in our recent work [SQW15a, SQW15b] (see also [GHJY15, AG16], and [Kaw16, SC16, BNS16, GLM16, PKCS16, BBV16, BVB16]) seem promising in this regard. It is interesting to consider whether analogous geometric structure exists for other practical problems.
Proofs of Technical Results for Function Landscape
Lemma 6.1 to 6.3 first appeared in [CLS15b]; we include the full proofs here for completeness.
We now evaluate the three terms separately. Note that the law is invariant to unitary transform. Thus,
Gathering the above results, we obtain (6.1). By taking Wirtinger derivative (1.4) with respect to (6.1), we obtain the Wirtinger gradient and Hessian in (6.2), (6.3) as desired.
Similar calculation yields the second expectation.
Here is a constant depending on and , and are positive absolute constants.
Proof We work out the results on first. By the unitary invariance of the Gaussian measure and rescaling, it is enough to consider . We partition each vector as and upper bound the target quantity as:
By Chebyshev’s inequality, we have with probability at least ,
For all and all , is distributed as that is independent of the sequence. So for one realization of , the Hoeffding-type inequality of Lemma A.5 implies
for any with and any . Taking , together with a union bound on a -net on the sphere, we obtain
Now an application of Chebyshev’s inequality gives that with probability at least . Substituting this into the above, we conclude that whenever for some sufficiently large ,
with probability at least .
For all fixed and all , . Thus, is centered sub-exponential. So for one realization of , Bernstein’s inequality (Lemma A.6) implies
for any fixed with and any . Taking , together with a union bound on a -net on the sphere, we obtain
Chebyshev’s inequality and the union bound give that
hold with probability at least . To conclude, when for some sufficiently large constant ,
with probability at least .
Collecting the above bounds and probabilities yields the claimed results. Similar arguments prove the claim on also, completing the proof.
Let be i.i.d. copies of . For any , when , it holds with probability at least that
Here and are constants depending on and and are positive absolute constants.
as with probability at least , and with probability at least . Thus,
Taking for a sufficiently small , we obtain that with probability at least ,
which, together with continuity of the function , implies
It is enough to take to ensure the desired event happens w.h.p..
2 Proof of Proposition 2.3
Lemma 6.3 implies that when , w.h.p.,
On the other hand, by Lemma A.7, we have that
holds with probability at least . For the second summation, we have
w.h.p., provided , according to Lemma 6.3.
Collecting the above estimates, we have that when for a sufficiently large constant , w.h.p.,
for all . Dividing both sides of the above by gives the claimed results.
3 Proof of Proposition 2.4
Lemma 6.4 implies when for sufficiently large constant , w.h.p.,
for all , where we have used that to simplify the results.
Collecting the above estimates, we obtain that when , w.h.p.,
To provide a lower bound for the above, we let with and with . Then
For any fixed , it is easy to see that minimizer occurs when . Plugging this into , one obtains . Thus,
4 Proof of Proposition 2.5
By Lemma 6.4, when for some sufficiently large , w.h.p.,
for all , as desired.
5 Proof of Proposition 2.6
Proof We abbreviate as below. Note that
To bound the first term, for a fixed to be determined later, define:
Obviously for all as
Because as shown above,
Thus, the unconditional probability can be bounded as
Taking , we obtain
where we have used to simplify the last inequality. Now we use the moment-control Bernstein’s inequality (Lemma A.8) to get a bound for probability on deviation of . To this end, we have
where the second inequality holds for any integer . Hence one can take
in the deviation inequality of to obtain
where we have used the fact and assumed to simplify the probability. Thus, with probability at least , it holds that
Moreover, for any , we have
for all , with probability at least .
Combining the above estimates, when for sufficiently large constant , w.h.p.,
for all , as claimed.
6 Proof of Proposition 2.7
For convenience, we will define a relaxed region
In this case, when ,
On the other hand, when ,
Since , we conclude that .
So is covered by .
We show this region is covered by and . First, for any , when ,
implying that . Next we suppose and , where and , and show the rest of is covered by . To this end, it is enough to verify that
over this subregion. Writing the left as a function of and eliminating and , it is enough to show
We show that this region is covered by , , and together. First, for any , when ,
So . Next, we show that any with is contained in . Similar to the above argument for , it is enough to show
as in this case. Since the Hessian is again always indefinite, we check the optimal value for and and do the comparison. It can be verified in this case. So . Finally, we consider the case . A argument as above leads to
implying that .
Proofs of Technical Results for Trust-Region Algorithm
When for a sufficiently large , it holds with probability at least that
Proof Lemma 3.1 in [CSV13] has shown that when , it holds with probability at least that
for all and , where is the nuclear norm that sums up singular values. The claims follows from
When , with probability at least ,
for all . Here , to are positive absolute constants.
holds w.h.p. when for a sufficiently large .
2 Proof of Lemma 3.1
Proof For any , we have
where in the third line we invoked results of Lemma 7.1, and hence the derived inequality holds w.h.p. when . Similarly, for the gradient,
where from the second to the third inequality we used the fact with probability at least . Similarly for the Hessian,
where to obtain the third inequality we used that with probability at least when for a sufficiently large constant .
Since with probability at least when , by definition of , we have w.h.p.. Substituting this estimate into the above bounds yields the claimed results.
3 Proof of Lemma 3.2
Proof For the upper bound, we have that for all ,
where to obtain the third line we applied Lemma 7.2. To show the lower bound for all , it is equivalent to show that
By Lemma 3.1 and Lemma 7.2, w.h.p., we have
Since , we have . Thus,
4 Proof of Lemma 3.3
5 Proof of Proposition 3.4
Taking and applying Lemma 3.3, we have
where we obtain the very last inequality using the assumption that , completing the proof.
6 Proof of Proposition 3.5
Obviously vectors of the form is feasible for (3.1) for any . By Lemma A.2, we have
By Proposition 2.5 and Proposition 2.6, we have
Since , of interest here satisfies . Thus,
Combining the above with Lemma 3.3, we obtain
7 Proof of Proposition 3.6
Proof By Proposition 2.4 and the integral form of Taylor’s theorem in Lemma A.2, we have that for any satisfying and and any ,
Combining the above two inequalities, we obtain
where to obtain the very last bound we have used the fact due to (3.9). This implies that for all ,
The rest arguments are very similar to that of Proposition 3.5. Take and it can checked vectors of the form for are feasible for (3.1). By Lemma A.2, we have
where to obtain the last line we have used (7.1). Combining the above with Lemma 3.3, we obtain
8 Proof of Proposition 3.7
Let be an orthogonal (in geometric sense) basis for the space . In view of the transformed gradient and Hessian in (3.3), it is easy to see
By Lemma 3.2, . Thus,
where the last simplification is due to that . By Lemma A.3, we have
Therefore, for , Lemma 3.3 implies that
The claimed result follows provided , completing the proof.
9 Proof of Proposition 3.8
Before proceeding, we note one important fact that is useful below. For any , we have
Thus, if is an (geometrically) orthonormal basis constructed for the space (as defined around (3.3)), it is easy to verify that
Proof Throughout the proof, we write , and short for , and , respectively. Given an orthonormal basis for , the unconstrained optimality condition of the trust region method implies that
By Taylor’s theorem and Lipschitz property in Lemma 3.1, we have
where to obtain the second equality we have used the optimality condition discussed at start of the proof. Thus, using the result above, we obtain
where from the second to the third line we translate the complex operator norm to the real operator norm. Similarly, we also get
Since , and provided
where we used the assumption again to obtain the last inequality.
Invoking the optimality condition again, we obtain
Here is well-defined because Lemma 3.2 shows that for all . Combining the last two estimates, we complete the proof.
10 Proof of Proposition 3.9
Proof Throughout the proof, we write , and short for , and , respectively.
We first show stays in . From proof of Proposition 3.6, we know that for all , the following estimate holds:
provided . Moreover,
where the last inequality followed because step is unconstrained. Combining the above estimates, we obtain that
stays in .
Next we show the next step will also be an unconstrained step when is sufficiently small. We have
where we again applied results of Proposition 3.8 to obtain the third line, and applied the optimality condition to obtain the fourth line. Thus, whenever
the transformed trust-region subproblem has its minimizer with . This implies the minimizer to the original trust-region subproblem satisfies , as . Thus, under the above condition the -th step is also unconstrained.
Repeating the above arguments for all future steps implies that all future steps will be constrained within .
We next provide an explicit estimate for the rate of convergence in terms of distance of the iterate to the target set . Again by Proposition 3.8,
Appendix A Basic Tools and Results
For , it holds that
Proof Since is continuous differentiable, by the fundamental theorem of calculus,
Change of variable gives the claimed result.
Proof By integral form of Taylor’s theorem in Lemma A.2,
Let be an () matrices with i.i.d. entries. Then,
Moreover, for each , it holds with probability at least that
Let be independent centered sub-Gaussian random variables, and let , where the sub-Gaussian norm
Let be independent centered sub-exponential random variables, and let , where the sub-exponential norm
Let , , be i.i.d. copies of the nonnegative random variable with finite second moment. Then it holds that
Thus, by the usual exponential transform trick, we obtain that for any ,
Taking and making change of variable for give the claimed result.
Let be i.i.d. copies of a real-valued random variable Suppose that there exist some positive number and such that
Let , then for … , it holds that
Proof Proof to i) is similar to that of II. Theorem 4.11 in [SS90]. For , w.l.o.g., we can assume and are the canonical bases for and , respectively. Then
Note that the upper bound is achieved by taking . When , by the results from CS decomposition (see, e.g., I Theorem 5.2 of [SS90]).
and the same argument then carries through. To prove ii), note the fact that (see, e.g., Theorem 4.5 and Corollary 4.6 of [SS90]). Obviously one also has
while and are projectors onto and , respectively. This completes the proof.