Phase Retrieval via Wirtinger Flow: Theory and Algorithms

Emmanuel Candes, Xiaodong Li, Mahdi Soltanolkotabi

Introduction

We are interested in solving quadratic equations of the form

On the theoretical side, several combinatorial optimization problems—optimization programs with discrete design variables which take on integer or Boolean values—can be cast as solving quadratic equations or as minimizing a linear objective subject to quadratic inequalities. In their most general form these problems are known to be notoriously difficult (NP-hard) [52, Section 4.3]. Nevertheless, many heuristics have been developed for addressing such problems.For a partial review of some of these heuristics as well as some recent theoretical advances in related problems we refer to our companion paper [16, Section 1.6] and references therein . One popular heuristic is based on a class of convex relaxations known as Shor’s relaxations [52, Section 4.3.1] which can be solved using tractable semi-definite programming (SDP). For certain random models, some recent SDP relaxations such as PhaseLift are known to provide exact solutions (up to global phase) to the generalized phase retrieval problem using a near minimal number of sampling vectors . While in principle SDP based relaxations offer tractable solutions, they become computationally prohibitive as the dimension of the signal increases. Indeed, for a large number of unknowns in the tens of thousands, say, the memory requirements are far out of reach of desktop computers so that these SDP relaxations are de facto impractical.

Algorithm: Wirtinger Flow

This paper introduces an approach to phase retrieval based on non-convex optimization as well as a solution algorithm, which has two components: (1) a careful initialization obtained by means of a spectral method, and (2) a series of updates refining this initial estimate by iteratively applying a novel update rule, much like in a gradient descent scheme. We refer to the combination of these two steps, introduced in reverse order below, as the Wirtinger flow (WF) algorithm.

Our approach to (2.1) is simply stated: start with an initialization z0\bm{z}_{0}, and for τ=0,1,2,\tau=0,1,2,\ldots, inductively define

2 Initialization via a spectral method

Our main result states that for a certain random model, if the initialization z0\bm{z}_{0} is sufficiently accurate, then the sequence {zτ}\{\bm{z}_{\tau}\} will converge toward a solution to the generalized phase problem (1.1). In this paper, we propose computing the initial guess z0\bm{z}_{0} via a spectral method, detailed in Algorithm 1. In words, z0\bm{z}_{0} is the leading eigenvector of the positive semidefinite Hermitian matrix ryrarar\sum_{r}y_{r}\bm{a}_{r}\bm{a}_{r}^{*} constructed from the knowledge of the sampling vectors and observations. (As usual, ar\bm{a}_{r}^{*} is the adjoint of ar\bm{a}_{r}.) Letting A\bm{A} be the m×nm\times n matrix whose rrth row is ar\bm{a}_{r}^{*} so that with obvious notation y=Ax2\bm{y}=|\bm{A}\bm{x}|^{2}, z0\bm{z}_{0} is the leading eigenvector of Adiag{y}A\bm{A}^{*}\,\operatorname{diag}\{\bm{y}\}\,\bm{A} and can be computed via the power method by repeatedly applying A\bm{A}, entrywise multiplication by y\bm{y} and A\bm{A}^{*}. In the theoretical framework we study below, a constant number of power iterations would give machine accuracy because of an eigenvalue gap between the top two eigenvalues, please see Appendix B for additional information.

3 Wirtinger flow as a stochastic gradient scheme

We would like to motivate the Wirtinger flow algorithm and provide some insight as to why we expect it to work in a model where the sampling vectors are random. First, we emphasize that our statements in this section are heuristic in nature; as it will become clear in the proof Section 7, a correct mathematical formalization of these ideas is far more complicated than our heuristic development here may suggest. Second, although our ideas are broadly applicable, it makes sense to begin understanding the algorithm in a setting where everything is real valued, and in which the vectors ara_{r} are i.i.d. N(0,I)\mathcal{N}(\bm{0},\bm{I}). Also without any loss in generality and to simplify exposition in this section we shall assume x=1\left\|\bm{x}\right\|=1.

Let x\bm{x} be a solution to (1.1) so that yr=ar,x2y_{r}=|\langle\bm{a}_{r},\bm{x}\rangle|^{2}, and consider the initialization step first. In the Gaussian model, a simple moment calculation gives

We now turn our attention to the gradient-update (2.2) and define

where here and below, x\bm{x} is once again our planted solution. The first term ensures that the direction of z\bm{z} matches the direction of x\bm{x} and the second term penalizes the deviation of the Euclidean norm of z\bm{z} from that of x\bm{x}. Obviously, the minimizers of this function are ±x\pm\bm{x}. Now consider the gradient scheme

In Section 7.9, we show that if min z0±x1/8x\text{min }\left\|\bm{z}_{0}\pm\bm{x}\right\|\leq 1/8\left\|\bm{x}\right\|, then {zτ}\{\bm{z}_{\tau}\} converges to x\bm{x} up to a global sign. However, this is all ideal as we would need knowledge of x\bm{x} itself to compute the gradient of FF; we simply cannot run this algorithm.

Hence, the average WF update is the same as that in (2.3) so that we can interpret the Wirtinger flow algorithm as a stochastic gradient scheme in which we only get to observe an unbiased estimate f(z)\nabla f(\bm{z}) of the “true” gradient F(z)\nabla F(\bm{z}).

Regarding WF as a stochastic gradient scheme helps us in choosing the learning parameter or step size μτ\mu_{\tau}. Lemma 7.7 asserts that

holds with high probability. Looking at the right-hand side, this says that the uncertainty about the gradient estimate depends on how far we are from the actual solution x\bm{x}. The further away, the larger the uncertainty or the noisier the estimate. This suggests that in the early iterations we should use a small learning parameter as the noise is large since we are not yet close to the solution. However, as the iterations count increases and we make progress, the size of the noise also decreases and we can pick larger values for the learning parameter. This heuristic together with experimentation lead us to consider

shown in Figure 1. Values of τ0\tau_{0} around 330330 and of μmax\mu_{\text{max}} around 0.40.4 worked well in our simulations. This makes sure that μτ\mu_{\tau} is rather small at the beginning (e.g. μ10.003\mu_{1}\approx 0.003 but quickly increases and reaches a maximum value of about 0.40.4 after 200 iterations or so.

Main Results

Our main result establishes the correctness of the Wirtinger flow algorithm in the Gaussian model defined below. Later in Section 5, we shall also develop exact recovery results for a physically inspired diffraction model.

We also need to define the distance to the solution set.

with probability at least 110eγn8/n21-10e^{-\gamma n}-{8}/{n^{2}} (γ\gamma is a fixed positive numerical constant). Further, take a constant learning parameter sequence, μτ=μ\mu_{\tau}=\mu for all τ=1,2,\tau=1,2,\ldots and assume μc1/n\mu\leq c_{1}/n for some fixed numerical constant c1c_{1}. Then there is an event of probability at least 113eγnme1.5m8/n21-13e^{-\gamma n}-me^{-1.5m}-{8}/{n^{2}}, such that on this event, starting from any initial solution z0\bm{z}_{0} obeying (3.1), we have

Setting μ=c1/n\mu=c_{1}/n yields ϵ\epsilon accuracy in a relative sense, namely, dist(z,x)ϵx\operatorname{dist}(\bm{z},\bm{x})\leq\epsilon\left\|\bm{x}\right\|, in O(nlog1/ϵ)\mathcal{O}(n\,\log 1/\epsilon) iterations. The computational work at each iteration is dominated by two matrix-vector products of the form Az\bm{A}\bm{z} and Av\bm{A}^{*}\bm{v}, see Appendix B. It follows that the overall computational complexity of the WF algorithm is O(mn2log1/ϵ)\mathcal{O}(mn^{2}\log{1}/{\epsilon}). Later in the paper, we will exhibit a modification to the WF algorithm of mere theoretical interest, which also yields exact recovery under the same sampling complexity and an O(mnlog1/ϵ)\mathcal{O}(mn\log{1}/{\epsilon}) computational complexity; that is to say, the computational workload is now just linear in the problem size.

2 Comparison with other non-convex schemes

We now pause to comment on a few other non-convex schemes in the literature. Other comparisons may be found in our companion paper .

Earlier, we discussed the Gerchberg-Saxton and Fienup algorithms. These formulations assume that A\bm{A} is a Fourier transform and can be described as follows: suppose zτ\bm{z}_{\tau} is the current guess, then one computes the image of zτ\bm{z}_{\tau} through A\bm{A} and adjust its modulus so that it matches that of the observed data vector: with obvious notation,

where \odot is elementwise multiplication, and b=Ax\bm{b}=|\bm{A}\bm{x}| so that br2=yrb_{r}^{2}=y_{r} for all r=1,,mr=1,\ldots,m. Then

(In the case of Fourier data, the step (3.2)–(3.3) essentially adjusts the modulus of the Fourier transform of the current guess so that it fits the measured data.) Finally, if we know that the solution belongs to a convex set C\mathcal{C} (as in the case where the signal is known to be real-valued, possibly non-negative and of finite support), then the next iterate is

where PCP_{\mathcal{C}} is the projection onto the convex set C\mathcal{C}. If no such information is available, then zτ+1=vτ+1\bm{z}_{\tau+1}=\bm{v}_{\tau+1}. The first step (3.3) is not a projection onto a convex set and, therefore, it is in general completely unclear whether the Gerchberg-Saxton algorithm actually converges. (And if it were to converge, at what speed?) It is also unclear how the procedure should be initialized to yield accurate final estimates. This is in contrast to the Wirtinger flow algorithm, which in the Gaussian model is shown to exhibit geometric convergence to the solution to the phase retrieval problem. Another benefit is that the Wirtinger flow algorithm does not require solving a least-squares problem (3.3) at each iteration; each step enjoys a reduced computational complexity.

A recent contribution related to ours is the interesting paper , which proposes an alternating minimization scheme named AltMinPhase for the general phase retrieval problem. AltMinPhase is inspired by the Gerchberg-Saxton update (3.2)–(3.3) as well as other established alternating projection heuristics . We describe the algorithm in the setup of Theorem 3.3 for which gives theoretical guarantees. To begin with, AltMinPhase partitions the sampling vectors ar\bm{a}_{r} (the rows of the matrix A\bm{A}) and corresponding observations yry_{r} into B+1B+1 disjoint blocks (y(0),A(0))(\bm{y}^{(0)},\bm{A}^{(0)}), (y(1),A(1))(\bm{y}^{(1)},\bm{A}^{(1)}), \ldots, (y(B),A(B))(\bm{y}^{(B)},\bm{A}^{(B)}) of roughly equal size. Hence, distinct blocks are stochastically independent from each other. The first block (y(0),A(0))(\bm{y}^{(0)},\bm{A}^{(0)}) is used to compute an initial estimate z0\bm{z}_{0}. After initialization, AltMinPhase goes through a series of iterations of the form (3.2)–(3.3), however, with the key difference that each iteration uses a fresh set of sampling vectors and observations: in details,

The main point is that in practice, it is not realistic to imagine (1) that we will divide the samples in distinct blocks (how many blocks should we form a priori? of which sizes?) and (2) that we will use measured data only once. With respect to the latter, observe that the Gerchberg-Saxton procedure (3.2)–(3.3) uses all the samples at each iteration. This is the reason why AltMinPhase is of little practical value, and of theoretical interest only. As a matter of fact, its design and study seem merely to stem from analytical considerations: since one uses an independent set of measurements at each iteration, A(τ+1)A^{(\tau+1)} and zτz_{\tau} are stochastically independent, a fact which considerably simplifies the convergence analysis. In stark contrast, the WF iterate uses all the samples at each iteration and thus introduces some dependencies, which makes for some delicate analysis. Overcoming these difficulties is crucial because the community is preoccupied with convergence properties of algorithms one actually runs, like Gerchberg-Saxton (3.2)–(3.3), or would actually want to run. Interestingly, it may be possible to use some of the ideas developed in this paper to develop a rigorous theory of convergence for algorithms in the style of Gerchberg-Saxton and Fienup, please see .

In a recent paper , which appeared on the arXiv preprint server as the final version of this paper was under preparation, the authors explore necessary and sufficient conditions for the global convergence of an alternative minimization scheme with generic sampling vectors. The issue is that we do not know when these conditions hold. Further, even when the algorithm converges, it does not come with an explicit convergence rate so that is is not known whether the algorithm converges in polynomial time. As before, some of our methods as well as those from our companion paper may have some bearing upon the analysis of this algorithm. Similarly, another class of nonconvex algorithms that have recently been proposed in the literature are iterative algorithms based on Generalized Approximate Message Passing (GAMP), see and as well as for some background literature on AMP. In , the authors demonstrate a favorable runtime for an algorithm of this nature. However, this does not come with any theoretical guarantees.

Moving away from the phase retrieval problem, we would like to mention some very interesting work on the matrix completion problem using non-convex schemes by Montanari and coauthors , see also . Although the problems and models are quite different, there are some general similarities between the algorithm named OptSpace in and ours. Indeed, OptSpace operates by computing an initial guess of the solution to a low-rank matrix completion problem by means of a spectral method. It then sets up a nonconvex problem, and proposes an iterative algorithm for solving it. Under suitable assumptions, demonstrates the correctness of this method in the sense that OptSpace will eventually converge to a low-rank solution, although it is not shown to converge in polynomial time.

Numerical Experiments

We present some numerical experiments to assess the empirical performance of the Wirtinger flow algorithm. Here, we mostly consider a model of coded diffraction patterns reviewed below.

We consider an acquisition model, where we collect data of the form

We emphasize that these models are physically realizable in optical applications specially those that arise in microscopy. However, we should note that phase retrieval has many different applications and in some cases other models may be more convenient. We refer to our companion paper Section 2.2 for a discussion of other practically relevant models.

2 The Gaussian and coded diffraction models

Random low-pass signals. Here, x\bm{x} is given by

with M=n/8M=n/8 and XkX_{k} and YkY_{k} are i.i.d. N(0,1)\mathcal{N}(0,1).

where XkX_{k} and YkY_{k} are are i.i.d. N(0,1/8)\mathcal{N}(0,1/8) so that the low-pass model is a ‘bandlimited’ version of this high-pass random model (variances are adjusted so that the expected signal power is the same).

Below, we set n=128n=128, and generate one signal of each type which will be used in all the experiments.

The initialization step of the Wirtinger flow algorithm is run by applying 5050 iterations of the power method outlined in Algorithm 3 from Appendix B. In the iteration (2.2), we use the parameter value μτ=min(1exp(τ/τ0),0.2)\mu_{\tau}=\min(1-\exp(-\tau/\tau_{0}),0.2) where τ0330\tau_{0}\approx 330. We stop after 2,5002,500 iterations, and report the empirical probability of success for the two different signal models. The empirical probability of succcess is an average over 100100 trials, where in each instance, we generate new random sampling vectors according to the Gaussian or CDP models. We declare a trial successful if the relative error of the reconstruction dist(x^,x)/x\operatorname{dist}(\hat{\bm{x}},\bm{x})/\left\|\bm{x}\right\| falls below 10510^{-5}.

Figure 2 shows that around 4.5n4.5n Gaussian phaseless measurements suffice for exact recovery with high probability via the Wirtinger flow algorithm. We also see that about six octanary patterns are sufficient.

3 Performance on natural images

We move on to testing the Wirtinger flow algorithm on various images of different sizes; these are photographs of the Naqsh-e Jahan Square in the central Iranian city of Esfahan, the Stanford main quad, and the Milky Way galaxy. Since each photograph is in color, we run the WF algorithm on each of the three RGB images that make up the photograph. Color images are viewed as n1×n2×3n_{1}\times n_{2}\times 3 arrays, where the first two indices encode the pixel location, and the last the color band.

We generate L=20L=20 random octanary patterns and gather the coded diffraction patterns for each color band using these 20 samples. As before, we run 50 iterations of the power method as the initialization step. The updates use the sequence μτ=min(1exp(τ/τ0),0.4)\mu_{\tau}=\min(1-\exp({-\tau/\tau_{0}}),0.4) where τ0330\tau_{0}\approx 330 as before. In all cases we run 300300 iterations and record the relative recovery error as well as the running time. If x\bm{x} and x^\hat{\bm{x}} are the original and recovered images, the relative error is equal to x^x/x\left\|\hat{\bm{x}}-\bm{x}\right\|/\left\|\bm{x}\right\|, where \left\|\cdot\right\| is the Euclidean norm x2=i,j,kx(i,j,k)2\left\|\bm{x}\right\|^{2}=\sum_{i,j,k}|x(i,j,k)|^{2}. The computational time we report is the the computational time averaged over the three RGB images. All experiments were carried out on a MacBook Pro with a 2.4 GHz Intel Core i7 Processor and 8 GB 1600 MHz DDR3 memory.

Figure 3 shows the images recovered via the Wirtinger flow algorithm. In all cases, WF gets 12 or 13 digits of precision in a matter of minutes. To convey an idea of timing that is platform-independent, we also report time in units of FFTs; one FFT unit is the amount of time it takes to perform a single FFT on an image of the same size. Now all the workload is dominated by matrix vector products of the form Az\bm{A}\bm{z} and Av\bm{A}^{*}\bm{v}. In details, each iteration of the power method in the initialization step, or each update (2.2) requires 4040 FFTs; the factor of 40 comes from the fact that we have 20 patterns and that each iteration involves one FFT and one adjoint or inverse FFT. Hence, the total number of FFTs is equal to

Another way to state this is that the total workload of our algorithm is roughly equal to 350350 applications of the sensing matrix A\bm{A} and its adjoint A\bm{A}^{*}. For about 13 digits of accuracy (relative error of about 101310^{-13}), Figure 3 shows that we need between 21,000 and 42,000 FFT units. This is within a factor between 1.5 and 3 of the optimal number computed above. This increase has to do with the fact that in our implementation, certain variables are copied into other temporary variables and these types of operations cause some overhead. This overhead is non-linear and becomes more prominent as the size of the signal increases.

For comparison, SDP based solutions such as PhaseLift and PhaseCut would be prohibitive on a laptop computer as the lifted signal would not fit into memory. In the SDP approach an nn pixel image become an n2/2n^{2}/2 array, which in the first example already takes storing the lifted signal even for the smallest image requires (189×768)2×1/2×8(189\times 768)^{2}\times 1/2\times 8 Bytes, which is approximately 85 GB of space. (For the image of the Milky Way, storage would be about 17 TB.) These large memory requirements prevent the application of full-blown SDP solvers on desktop computers.

4 3D molecules

Understanding molecular structure is a great contemporary scientific challenge, and several techniques are currently employed to produce 3D images of molecules; these include electron microscopy and X-ray imaging. In X-ray imaging, for instance, the experimentalist illuminates an object of interest, e.g. a molecule, and then collects the intensity of the diffracted rays, please see Figure 4 for an illustrative setup. Figures 5 and 6 show the schematic representation and the corresponding electron density maps for the Caffeine and Nicotine molecules: the density map ρ(x1,x2,x3)\rho(x_{1},x_{2},x_{3}) is the 3D object we seek to infer. In this paper, we do not go as far 3D reconstruction but demonstrate the performance of the Wirtinger flow algorithm for recovering projections of 3D molecule density maps from simulated data. For related simulations using convex schemes we refer the reader to .

To derive signal equations, consider an experimental apparatus as in Figure 4. If we imagine that light propagates in the direction of the x3x_{3}-axis, an approximate model for the collected data reads

Therefore, by imputing the missing phase using phase retrieval algorithms, one can recover a slice of the 3D Fourier transfom of the electron density map, i.e. R(f1,f2,0)R(f_{1},f_{2},0). Viewing the object from different angles or directions gives us different slices. In a second step we do not perform in this paper, one can presumably recover the 3D Fourier transform of the electron density map from all these slices (this is the tomography or blind tomography problem depending upon whether or not the projection angles are known) and, in turn, the 3D electron density map.

We now generate 5151 observation planes by rotating the x1x2x_{1}x_{2}-plane around the x1x_{1}-axis by equally spaced angles in the interval [0,2π][0,2\pi]. Each of these planes is associated with a 2D projection of size 1024×10241024\times 1024, giving us 20 coded diffraction octanary patterns (we use the same patterns for all 51 projections). We run the Wirtinger flow algorithm with exactly the same parameters as in the previous section, and stop after 150 gradient iterations. Figure 8 reports the average relative error over the 51 projections and the total computational time required for reconstructing all 51 images.

Theory for the Coded Diffraction Model

We complement our study with theoretical results applying to the model of coded diffraction patterns. These results concern a variation of the Wirtinger flow algorithm: whereas the iterations are exactly the same as (2.2), the initialization applies an iterative scheme which uses fresh sets of sample at each iteration. This is described in Algorithm 2. In the CDP model, the partitioning assigns to the same group all the observations and sampling vectors corresponding to a given realization of the random code. This is equivalent to partitioning the random patterns into B+1B+1 groups. As a result, sampling vectors in distinct groups are stochastically independent.

with probability at least 1(4L+2)/n31-(4L+2)/{n^{3}}. Further, take a constant learning parameter sequence, μτ=μ\mu_{\tau}=\mu for all τ=1,2,\tau=1,2,\ldots and assume μc1\mu\leq c_{1} for some fixed numerical constant c1c_{1}. Then there is an event of probability at least 1(2L+1)/n31/n21-(2L+1)/{n^{3}}-{1}/{n^{2}}, such that on this event, starting from any initial solution z0\bm{z}_{0} obeying (5.1), we have

In the Gaussian model, both statements also hold with high probability provided that mc0n(logn)2m\geq c_{0}\cdot n\,(\log n)^{2}, where c0c_{0} is a sufficiently large numerical constant.

Hence, we achieve perfect recovery from on the order of n(logn)4n(\log n)^{4} samples arising from a coded diffraction experiment. Our companion paper established that PhaseLift—the SDP relaxation—is also exact with a sampling complexity on the order of n(logn)4n(\log n)^{4} (this has recently been improved to n(logn)2n(\log n)^{2} ). We believe that the sampling complexity of both approaches (WF and SDP) can be further reduced to nlognn\log n (or even nn for certain kind of patterns). We leave this to future research.

Setting μ=c1\mu=c_{1} yields ϵ\epsilon accuracy in O(log1/ϵ)\mathcal{O}(\log 1/\epsilon) iterations. As the computational work at each iteration is dominated by two matrix-vector products of the form Az\bm{A}\bm{z} and Av\bm{A}^{*}\bm{v}, it follows that the overall computational is at most O(nLlognlog1/ϵ)\mathcal{O}(nL\log n\log{1}/{\epsilon}). In particular, this approach yields a near-linear time algorithm in the CDP model (linear in the dimension of the signal nn). In the Gaussian model, the complexity scales like O(mnlog1/ϵ)\mathcal{O}(mn\log{1}/{\epsilon}).

Wirtinger Derivatives

of several complex variables can be written in the form f(z,zˉ)f(\bm{z},\bar{\bm{z}}), where ff is holomorphic in z=x+iy\bm{z}=\bm{x}+i\bm{y} for fixed zˉ\bar{\bm{z}} and holomorphic in zˉ=xiy\bar{\bm{z}}=\bm{x}-i\bm{y} for fixed z\bm{z}. This holds as long as a the real-valued functions uu and vv are differentiable as functions of the real variables x\bm{x} and y\bm{y}. As an example, consider

This fact underlies the development of the Wirtinger calculus. In essence, the conjugate coordinates

Our definitions follow standard notation from multivariate calculus so that derivatives are row vectors and gradients are column vectors. In this new coordinate system the complex gradient is given by

In this coordinate system the complex Hessian is given by

If we were to run gradient descent in this new coordinate system, the iterates would be

Note that when ff is a real-valued function (as in this paper) we have

Therefore, the second set of updates in (6.1) is just the conjugate of the first. Thus, it is sufficient to keep track of the first update, namely,

For real valued functions of complex variables, setting

Proofs

We first note that in the CDP model with admissible CDPs ar6n\left\|\bm{a}_{r}\right\|\leq\sqrt{6n} for all r=1,2,,mr=1,2,\ldots,m, as in our CDP model d3<6\left|d\right|\leq\sqrt{3}<\sqrt{6}. In the Gaussian model the measurements vectors also obey ar6n\left\|\bm{a}_{r}\right\|\leq\sqrt{6n} for all r=1,2,,mr=1,2,\ldots,m with probability at least 1me1.5n1-me^{-1.5n}. Throughout the proofs, we assume we are on this event without explicitly mentioning it each time.

Before we begin with the proofs we should mention that we will prove our result using the update

Since z02x2164x2\left|\left\|\bm{z}_{0}\right\|^{2}-\left\|\bm{x}\right\|^{2}\right|\leq\frac{1}{64}\left\|\bm{x}\right\|^{2} holds with high probability as proven in Section 7.8, we have

Therefore, the results for the update (7.1) automatically carry over to the WF update with a simple rescaling of the upper bound on the learning parameter. More precisely, if we prove that the update (7.1) converges to a global optimum as long as μμ0\mu\leq\mu_{0}, then the convergence of the WF update to a global optimum is guaranteed as long as μWF6364μ0\mu_{\text{WF}}\leq\frac{63}{64}\mu_{0}. Also, the update in (7.1) is invariant to the Euclidean norm of x\bm{x}. Therefore, without loss of generality we will assume throughout the proofs that x=1\left\|\bm{x}\right\|=1.

We remind the reader that throughout x\bm{x} is a solution to our quadratic equations, i.e. obeys y=Ax2y=|\bm{A}\bm{x}|^{2} and that the sampling vectors are independent from x\bm{x}. Define

to be the set of all vectors that differ from the planted solution x\bm{x} only by a global phase factor. We also introduce the set of all points that are close to PP,

2 Formulas for the complex gradient and Hessian

We gather some useful gradient and Hessian calculations that will be used repeatedly. Starting with

3 Expectation and concentration

This section gathers some useful intermediate results whose proofs are deferred to Appendix A. The first two lemmas establish the expectation of the Hessian, gradient and a related random variable in both the Gaussian and admissible CDP models.In the CDP model the expectation is with respect to the random modulation pattern.

Recall that x\bm{x} is a solution obeying x=1\left\|\bm{x}\right\|=1, which is independent from the sampling vectors. Furthermore, assume the sampling vectors ar\bm{a}_{r} are distributed according to either the Gaussian or admissible CDP model. Then

The next lemma gathers some useful identities in the Gaussian model.

The next lemma establishes the concentration of the Hessian around its mean for both the Gaussian and the CDP model.

In the setup of Lemma 7.1, assume the vectors ar\bm{a}_{r} are distributed according to either the Gaussian or admissible CDP model with a sufficiently large number of measurements. This means that the number of samples obeys mc(δ)nlognm\geq c(\delta)\cdot n\log n in the Gaussian model and the number of patterns obeys Lc(δ)log3nL\geq c(\delta)\cdot\log^{3}n in the CDP model. Then

holds with probability at least 110eγn8/n21-10e^{-\gamma n}-{8}/{n^{2}} and 1(2L+1)/n31-(2L+1)/{n^{3}} for the Gaussian and CDP models, respectively.

We will also make use of the two results below, which are corollaries of the three lemmas above. These corollaries are also proven in Appendix A.

The next lemma establishes the concentration of the gradient around its mean for both Gaussian and admissible CDP models.

holds with probability at least 120eγm4m/n41-20e^{-\gamma m}-{4m}/{n^{4}} in the Gaussian model and 1(4L+2)/n31-(4L+2)/{n^{3}} in the CDP model.

We finish with a result concerning the concentration of the sample covariance matrix.

holds with probability at least 12eγm1-2e^{-\gamma m} for the Gaussian model and 11/n21-{1}/{n^{2}} in the CDP model. On this event,

4 General convergence analysis

We will assume that the function ff satisfies a regularity condition on E(ϵ)E(\epsilon), which essentially states that the gradient of the function is well behaved. We remind the reader that E(ϵ)E(\epsilon), as defined in (7.4), is the set of points that are close to the path of global minimizers.

We say that the function ff satisfies the regularity condition or RC(α,β,ϵ)RC(\alpha,\beta,\epsilon) if for all vectors zE(ϵ)\bm{z}\in E(\epsilon) we have

In the lemma below we show that as long as the regularity condition holds on E(ϵ)E(\epsilon) then Wirtinger Flow starting from an initial solution in E(ϵ)E(\epsilon) converges to a global optimizer at a geometric rate. Subsequent sections shall establish that this property holds.

Assume that ff obeys RC(α,β,ϵ)(\alpha,\beta,\epsilon) for all zE(ϵ)\bm{z}\in E(\epsilon). Furthermore, suppose z0E\bm{z}_{0}\in E, and assume 0<μ2/β0<\mu\leq{2}/{\beta}. Consider the following update

Then for all τ\tau we have zτE(ϵ)\bm{z}_{\tau}\in E(\epsilon) and

We note that for αβ<4\alpha\beta<4, (7.10) holds with the direction of the inequality reversed.One can see this by applying Cauchy-Schwarz and calculating the determinant of the resulting quadratic form. Thus, if RC(α,β,ϵ)RC(\alpha,\beta,\epsilon) holds, α\alpha and β\beta must obey αβ4\alpha\beta\geq 4. As a result, under the stated assumptions of Lemma 7.10 above, the factor 12μ/α14/(αβ)1-2\mu/\alpha\geq 1-4/(\alpha\beta) is non-negative.

Proof The proof follows a structure similar to related results in the convex optimization literature e.g. [53, Theorem 2.1.15]. However, unlike these classical results where the goal is often to prove convergence to a unique global optimum, the objective function ff does not have a unique global optimum. Indeed, in our problem, if x\bm{x} is solution, then eiϕxe^{i\phi}\bm{x} is also solution. Hence, proper modification is required to prove convergence results.

We prove that if zE(ϵ)\bm{z}\in E(\epsilon) then for all 0<μ2/β0<\mu\leq{2}/{\beta}

Therefore, if zE(ϵ)\bm{z}\in E(\epsilon) then we also have z+E(ϵ)\bm{z}_{+}\in E(\epsilon). The lemma follows by inductively applying (7.11). Now simple algebraic manipulations together with the regularity condition (7.10) give

where the last line follows from μ2/β\mu\leq{2}/{\beta}. The definition of ϕ(z+)\phi(\bm{z}_{+}) gives

5 Proof of the regularity condition

For any zE(ϵ)\bm{z}\in E(\epsilon), we need to show that

We prove this with δ=0.01\delta=0.01 by establishing that our gradient satisfies the local smoothness and local curvature conditions defined below. Combining both these two properties gives (7.12).

We say that the function ff satisfies the local curvature condition or LCC(α,ϵ,δ)LCC(\alpha,\epsilon,\delta) if for all vectors zE(ϵ)\bm{z}\in E(\epsilon),

This condition essentially states that the function curves sufficiently upwards (along most directions) near the curve of global optimizers.

We say that the function ff satisfies the local smoothness condition or LSC(β,ϵ,δ)LSC(\beta,\epsilon,\delta) if for all vectors zE(ϵ)\bm{z}\in E(\epsilon) we have

This condition essentially states that the gradient of the function is well behaved (the function does not vary too much) near the curve of global optimizers.

6 Proof of the local curvature condition

For any zE(ϵ)\bm{z}\in E(\epsilon), we want to prove the local curvature condition (7.13). Recall that

and define h:=eiϕ(z)zx\bm{h}:=e^{-i\phi(\bm{z})}\bm{z}-\bm{x}. To establish (7.13) it suffices to prove that

holds for all h\bm{h} satisfying Im(hx)=0\operatorname{Im}(\bm{h}^{*}\bm{x})=0, h2ϵ\|\bm{h}\|_{2}\leq\epsilon. Equivalently, we only need to prove that for all h\bm{h} satisfying Im(hx)=0\operatorname{Im}(\bm{h}^{*}\bm{x})=0, h2=1\|\bm{h}\|_{2}=1 and for all ss with 0sϵ0\leq s\leq\epsilon,

holds for all h\bm{h} obeying h=1\left\|\bm{h}\right\|=1. Therefore, to establish the local curvature condition (7.13) it suffices to show that

We will establish (7.17) for different measurement models and different values of ϵ\epsilon. Below, it shall be convenient to use the shorthand

Set ϵ=1/8n\epsilon={1}/{8\sqrt{n}}. We show that with high probability, (7.17) holds for all h\bm{h} satisfying Im(hx)=0(\bm{h}^{*}\bm{x})=0, h2=1\|\bm{h}\|_{2}=1, 0sϵ0\leq s\leq\epsilon, δ0.01\delta\leq 0.01, and α30\alpha\geq 30. First, note that by Cauchy-Schwarz inequality,

The last inequality follows from (ab)2a22b2(a-b)^{2}\geq\frac{a^{2}}{2}-b^{2}. By Corollary 7.5,

holds with high probability for all h\bm{h} obeying h=1\left\|\bm{h}\right\|=1. Furthermore, by applying Lemma 7.8,

holds with high probability. Plugging (7.19) and (7.20) in (7.6.1) yields

(7.17) follows by using α30\alpha\geq 30, ϵ=18n\epsilon=\frac{1}{8\sqrt{n}} and δ=0.01\delta=0.01.

6.2 Proof of (7.17) with ϵ=1/8italic-ϵ18\epsilon={1}/{8} in the Gaussian model

Set ϵ=1/8\epsilon={1}/{8}. We show that with high probability, (7.17) holds for all h\bm{h} satisfying Im(hx)=0\operatorname{Im}(\bm{h}^{*}\bm{x})=0, h2=1\|\bm{h}\|_{2}=1, 0sϵ0\leq s\leq\epsilon, δ2\delta\leq 2, and α8\alpha\geq 8. To this end, we first state a result about the tail of a sum of i.i.d. random variables. Below, Φ\Phi is the cumulative distribution function of a standard normal variable.

To establish (7.17) we first prove it for a fixed h\bm{h}, and then use a covering argument. Observe that

compare (7.5) and (7.6). Therefore, using s18s\leq\frac{1}{8},

We have all the elements to apply Lemma 7.13 with σ2=mmax(92,210)=210m\sigma^{2}=m\max(9^{2},210)=210m and y=m/4y={m}/{4}:

with γ=1/1260\gamma=1/1260. Therefore, with probability at least 1e3γm1-e^{-3\gamma m}, we have

holds with probability at least 1eγm1-e^{\gamma m} as long as mcnlognm\geq c\cdot n\log n. This concludes the proof of (7.17) with α8\alpha\geq 8.

7 Proof of the local smoothness condition

Define h:=eϕ(z)zx\bm{h}:=e^{-\phi(\bm{z})}\bm{z}-\bm{x} and w:=eiϕ(z)u\bm{w}:=e^{-i\phi(\bm{z})}\bm{u}, to establish (7.14) it suffices to prove that

holds for all h\bm{h} and w\bm{w} satisfying Im(hx)=0(\bm{h}^{*}\bm{x})=0, hϵ\left\|\bm{h}\right\|\leq\epsilon and w=1\left\|\bm{w}\right\|=1. Equivalently, we only need to prove for all h\bm{h} and w\bm{w} satisfying Im(hx)=0(\bm{h}^{*}\bm{x})=0, h=w=1\left\|\bm{h}\right\|=\left\|\bm{w}\right\|=1 and s:0sϵ\forall s:0\leq s\leq\epsilon,

Note that since (a+b+c)23(a2+b2+c2)(a+b+c)^{2}\leq 3(a^{2}+b^{2}+c^{2})

We now bound each of the terms on the right-hand side. For the first term we use Cauchy-Schwarz and Corollary 7.6, which give

Finally, for the third term we use the Cauchy-Schwarz inequality together with Lemma 7.8 (inequality) (7.9) to derive

We now plug these inequalities into (7.7) and get

which completes the proof of (7.26) and, in turn, establishes the local smoothness condition in (7.14). However, the last line of (7.7) holds as long as

In our theorems we use two different values of ϵ=18n\epsilon=\frac{1}{8\sqrt{n}} and ϵ=18\epsilon=\frac{1}{8}. Using δ0.01\delta\leq 0.01 in (7.32) we conclude that the local smoothness condition (7.7) holds as long as

8 Wirtinger flow initialization

In this section, we prove that the WF initialization obeys (3.1) from Theorem 3.3. Recall that

Also, since λ0\lambda_{0} is the top eigenvalue of Y\bm{Y}, and x=1\left\|\bm{x}\right\|=1, we have

Combining the above two inequalities together, we have

9 Initialization via resampled Wirtinger Flow

In this section, we prove that that the output of Algorithm 2 obeys (5.1) from Theorem 5.1. Introduce

We prove that FF obeys a regularization condition in E(1/8)E(1/8), namely,

Lemma 7.7 implies that for a fixed vector z\bm{z},

holds with high probability. The last inequality follows from (7.33). Applying Lemma 7.7, we also have

It only remains to establish (7.33). First, without loss of generality, we can assume that ϕ(z)=0\phi(\bm{z})=0, which implies Re(zx)=zx\operatorname{Re}(\bm{z}^{*}\bm{x})=\left|\bm{z}^{*}\bm{x}\right| and use zx\left\|\bm{z}-\bm{x}\right\| in lieu of dist(z,x)(\bm{z},\bm{x}). Set h:=zx\bm{h}:=\bm{z}-\bm{x} so that Im(xh)=0\operatorname{Im}(\bm{x}^{*}\bm{h})=0. This implies

where the last inequality is due to hϵ1/8\left\|\bm{h}\right\|\leq\epsilon\leq{1}/{8}. Furthermore,

where the last inequality also holds because hϵ1/8\left\|\bm{h}\right\|\leq\epsilon\leq{1}/{8}. Finally, (7.35) and (7.9) imply

where α=8\alpha^{\prime}=8 and β=200\beta^{\prime}=200.

Acknowledgements

E. C. is partially supported by AFOSR under grant FA9550-09-1-0643 and by gift from the Broadcom Foundation. X. L. is supported by the Wharton Dean’s Fund for Post-Doctoral Research and by funding from the National Institutes of Health. M. S. is supported by a Benchmark Stanford Graduate Fellowship and AFOSR grant FA9550-09-1-0643. M. S. would like to thank Andrea Montanari for helpful discussions and for the class which inspired him to explore provable algorithms based on non-convex schemes. He would also like to thank Alexandre d’Aspremont, Fajwel Fogel and Filipe Maia for sharing some useful code regarding 3D molecule reconstructions, Ben Recht for introducing him to reference , and Amit Singer for bringing the paper to his attention.

References

Appendix A Expectations and deviations

The proof for admissible coded diffraction patterns follows from Lemmas 3.1 and 3.2 in . For the Gaussian model, it is a consequence of the two lemmas below, whose proofs are ommitted.

A.2 Proof of Lemma 7.2

Thus by applying Lemma 3.1 in (for the CDP model) and Lemma A.1 above (for the Gaussian model) we have

A.3 Proof of Lemma 7.3

The identity (7.7) follows from standard normal moment calculations.

A.4 Proof of Lemma 7.4

where RR is a positive scalar whose value will be determined shortly, and define the events

A.4.2 The Gaussian model

By unitary invariance, we may take x=e1\bm{x}=\bm{e}_{1}. Letting z(1)\bm{z}(1), be the first coordinate of a vector z\bm{z}, to prove Lemma 7.4 for the Gaussian model it suffices to prove the two inequalities,

For any ϵ>0\epsilon>0, there is a constant C>0C>0 with the property that mCnm\geq C\cdot n implies

with probability at least 13n21-3n^{-2}; this is a consquence of Chebyshev’s inequality. Moreover a union bound gives

We now turn our attention to the last two terms of (A.4.2). For the second term, the ordinary Hoeffding’s inequality (Proposition 10 in ) gives that for any constants δ0\delta_{0} and γ\gamma, there exists a constant C(δ0,γ)C(\delta_{0},\gamma), such that for mC(δ0,γ)n(r=1mar(1)6)m\geq C(\delta_{0},\gamma)\sqrt{n\left(\sum_{r=1}^{m}\left|\bm{a}_{r}(1)\right|^{6}\right)},

holds with probability at least 13e2γn1-3e^{-2\gamma n}. To control the final term, we apply the Bernstein-type inequality (Proposition 16 in ) to assert the following: for any positive constants δ0\delta_{0} and γ\gamma, there exists a constant C(δ0,γ)C(\delta_{0},\gamma), such that for mC(δ0,γ)(n(r=1mar(1)4)+nmaxr=1mar(1)2)m\geq C(\delta_{0},\gamma)(\sqrt{n(\sum_{r=1}^{m}\left|\bm{a}_{r}(1)\right|^{4})}+n\max_{r=1}^{m}|\bm{a}_{r}(1)|^{2}),

holds with probability at least 12e2γn1-2e^{-2\gamma n}.

Therefore, for any unit norm vector y\bm{y},

holds with probability at least 15e2γn1-5e^{-2\gamma n}. By Lemma 5.4 in , we can bound the operator norm via an ϵ\epsilon-net argument:

The proof of (A.3) is similar. The only difference is that the random matrix is not Hermitian, so we work with

where u\bm{u} and v\bm{v} are unit vectors.

A.5 Proof of Corollary 7.5

The other inequality is established in a similar fashion.

A.6 Proof of Corollary 7.6

In the proof of Lemma 7.4, we established that with high probability,

This concludes the proof of one side. The other side is similar.

A.7 Proof of Lemma 7.7

Define h:=eϕ(z)zx\bm{h}:=e^{-\phi(\bm{z})}\bm{z}-\bm{x} and w:=eiϕ(z)u\bm{w}:=e^{-i\phi(\bm{z})}\bm{u}, we have

Combining (A.7) and (A.8) together with the triangular inequality and Lemma 7.4 give

A.8 Proof of Lemma 7.8

The result for the CDP model follows from Lemma 3.3 in . For the Gaussian model, it is a consequence of standard results, e.g. Theorem 5.39 in , concerning the deviation of the sample covariance matrix from its mean.

Appendix B The Power Method

We use the power method (Algorithm 3) with a random initialization to compute the first eigenvector of Y=Adiag{y}A\bm{Y}=\bm{A}\,\text{diag}\{\bm{y}\}\,\bm{A}^{*}. Since, each iteration of the power method asks to compute the matrix-vector product

we simply need to apply A\bm{A} and A\bm{A}^{*} to an arbitrary vector. In the Gaussian model, this costs 2mn2mn multiplications while in the CDP model the cost is that of 2L2L nn-point FFTs. We now turn our attention to the number of iterations required to achieve a sufficiently accurate initialization.

Standard results from numerical linear algebra show that after kk iterations of the power method, the accuracy of the eigenvector is O(tanθ0(λ2/λ1)k)\mathcal{O}(\tan\theta_{0}(\lambda_{2}/\lambda_{1})^{k}), where λ1\lambda_{1} and λ2\lambda_{2} are the top two eigenvalues of the positive semidefinite matrix Y\bm{Y}, and θ0\theta_{0} is the angle between the initial guess and the top eigenvector. Hence, we would need on the order of log(n/ϵ)/log(λ1/λ2)\log(n/\epsilon)\,/\,\log(\lambda_{1}/\lambda_{2}) for ϵ\epsilon accuracy. Under the stated assumptions, Lemma 7.4 bounds below the eigenvalue gap by a numerical constant so that we can see that few iterations of the power method would yield accurate estimates.