Phase Retrieval from Coded Diffraction Patterns

Emmanuel Candes, Xiaodong Li, Mahdi Soltanolkotabi

Introduction

The phase retrieval problem is that of recovering the missing phase of the data ak,x\langle\bm{a}_{k},\bm{x}\rangle. Once this information is available, one can find the vector x\bm{x} by essentially solving a system of linear equations.

where Ω\Omega is a sampled set of frequencies in $(westatedtheprobleminonedimensiontosimplifymatters).Wethusrecognizeaninstanceof(1.1)inwhichthevectors(we stated the problem in one dimension to simplify matters). We thus recognize an instance of (1.1) in which the vectors\bm{a}_{k}$ are sampled values of complex sinusoids. X-ray diffraction images are of this form, and as is well known, permitted the discovery of the double helix . In addition to X-ray crystallography , the phase problem has numerous other applications in the imaging sciences such as diffraction and array imaging , optics , speckle imaging in astronomy , and microscopy . Other areas where related problems appear include acoustics , blind channel estimation in wireless communications , interferometry , quantum mechanics and quantum information .

2 Convex relaxation

where tr(X){{\operatorname{tr}}}(\bm{X}) is the trace of the matrix X\bm{X}. The idea is then to lift the problem in higher dimensions: introducing the Hermitian matrix variable XSn×n\bm{X}\in\mathcal{S}^{n\times n}, the phase problem is equivalent to finding X\bm{X} obeying

where, here and below, X0\bm{X}\succeq 0 means that X\bm{X} is positive semidefinite. This problem is not tractable and, by dropping the rank constraint, is relaxed into

PhaseLift (1.4) is a semidefinite program (SDP). If its solution happens to have rank one and is equal to xx\bm{x}\bm{x}^{*}, then a simple factorization recovers x\bm{x} up to a global phase/sign.

We pause to emphasize that in different contexts, similar convex relaxations for optimizing quadratic objectives subject to quadratic constraints are known as Schor’s semidefinite relaxations, see [41, Section 4.3] and on the MAXCUT problem from graph theory for spectacular applications of these ideas. For related convex relaxations of quadratic problems, we refer the interested reader to the wonderful tutorial .

3 This paper

Numerical experiments together with emerging theory suggest that the PhaseLift approach is surprisingly effective. On the theoretical side, starting with , a line of work established that if the sampling vectors ak\bm{a}_{k} are sufficiently randomized, then the convex relaxation is provably exact. Assuming that the ak\bm{a}_{k}’s are independent random (complex-valued) Gaussian vectors, shows that on the order of nlognn\log n quadratic measurements are sufficient to guarantee perfect recovery via (1.4) with high probability. A subset of the authors reached the same conclusion from on the order of nn equations only, by solving the SDP feasibility problem; to be sure, establishes that the set of matrices obeying the constraints in (1.4) reduces to a unique point namely, xx\bm{x}\bm{x}^{*}, see for a similar result. also establishes near-optimal estimation bounds from noisy data. Finally, inspired by PhaseLift and the famous MAXCUT relaxation of Goemans and Williamson, proposed another semidefinite relaxation called PhaseCut whose performance from noiseless data—in terms of the number of samples needed to achieve perfect recovery—turns out to be identical to that of PhaseLift.

While this is all reassuring, the problem is that the Gaussian model, in which each measurement gives us the magnitude of the dot product t=0n1x[t]ak[t]\sum_{t=0}^{n-1}x[t]a_{k}[t] between the signal and (complex-valued) Gaussian white noise, is very far from the kind of data one can collect in an XX-ray imaging and many related experiments. The purpose of this paper is to show that the PhaseLift relaxation is still exact in a physically inspired setup where one can modulate the signal of interest and then let diffraction occur.

4 Coded diffraction patterns

Imagine then that we modulate the signal before diffraction. Letting d[t]d[t] be the modulating waveform, we would observe the diffraction pattern

We call this a coded diffraction pattern (CDP) since it gives us information about the spectrum of {x[t]}\{x[t]\} modulated by the code {d[t]}\{d[t]\}. There are several ways of achieving modulations of this type: one can use a phase mask just after the sample, see Figure 1, or use an optical grating to modulate the illumination beam as mentioned in , or even use techniques from ptychography which scan an illumination patch on an extended specimen . We refer to for a more thorough discussion of such approaches.

In this paper, we analyze such a data collection scheme in which one uses multiple modulations. Our model for data acquisition is thus as follows:

We prove that if we use random modulation patterns (random waveforms d[t]d[t]), then the solution to (1.4) is exact with high probability provided that we have sufficiently many CDPs. In fact, we will see that the feasible set in (1.4) equal to

5 Main result

Suppose that the modulation is admissible and that the number LL of coded diffraction patterns obeys

for some fixed numerical constant cc. Then with probability at least 11/n1-{1}/{n}, the feasibility problem (1.7) reduces to a unique point, namely, xx\bm{x}\bm{x}^{*}, and thus recovers x\bm{x} up to a global phase shift. For γ1\gamma\geq 1, setting Lcγlog4nL\geq c\gamma\log^{4}n leads to a probability of success at least 1nγ1-n^{-\gamma}.

Thus, in a stylized physical setting, it is possible to recover an arbitrary signal from a fairly limited number of coded diffraction patterns by solving an SDP feasibility problem. As mentioned earlier, the equivalence from implies that our theoretical guarantees automatically carry over to the PhaseCut formulation.

Mathematically, the phase recovery problem is different than that in which the sampling vectors are Gaussian as in . The reason is that the measurements in Theorem 1.1 are far more structured and far ‘less random’. Loosely speaking, our random modulation model uses on the order of m:=nLm:=nL random bits whereas the Gaussian model with the same number of quadratic equations would use on the order of mnmn random bits (this can be formalized by using the notion of entropy from information theory). A consequence of this difference is that the proof of the theorem requires new techniques and ideas. Having said this, an open and interesting research direction is to close the gap—remove the log factors—and show whether or not perfect recovery can be achieved from a number of coded diffraction patterns independent of dimension.

The first version of this paper was made publicly available at the same time as , which begins to study the performance of PhaseLift from non-Gaussian sampling vectors. There, the authors study sampling schemes from certain finite vector configurations, dubbed t-designs. These models are different from ours and do not include our coded diffraction patterns as a special case. Hence, our results are not comparable. Having said this, there are similarities in the proof techniques, especially in the role played by the robust injectivity property, compare our Lemma 3.7 from Section 3.3 with [32, Section 3.3].

6 Other approaches to phase retrieval and related works

There are of course other approaches to phase retrieval and we mention some literature for completeness and to inform the interested reader of recent progress in this area. Balan studies a problem where the sampling vectors model a short-time Fourier transform. Balan, Casazza and Edidin formulate the phase retrieval problem as nonconvex optimization problem. In , the same authors describe some applications of the phase problem in signal processing and speech analysis and presents some necessary and sufficient conditions which guarantee that the solution to (1.1) is unique. Other articles studying the minimal number of frame coefficient magnitudes for noiseless recovery include . We recommend the two blog posts and by Mixon and the references therein for a comprehensive review and discussion of such results. Lower bounds on the performance of any recovery method from noisy data are studied in .

On the algorithmic side, proposes a nonlinear scheme for phase retrieval having exponential time complexity in the dimension of the signal x\bm{x} while presents a tractable algorithm requiring a number of measurements at least quadratic in the dimension of the signal; that is to say, mcn2m\geq c\cdot n^{2} for some constant c>0c>0.

We also wish to mention some recent works aiming at presenting efficient reconstruction algorithms for generic frames (for certain types of sampling vectors such as those in fast implementations already exist) and give two references. The first introduces an iterative regularized least-square algorithm and establishes convergence of the algorithm. It is however not known whether this algorithm enjoys accurate reconstruction guarantees. The second is recent and studies an alternative minimization approach for phase retrieval, which yields very accurate but not exact reconstructions from a number of measurements that needs to be at least on the order of nlogn3n\log n^{3} Gaussian measurements.

There also is a recent body of work studying the phase retrieval under sparsity assumptions about the signal we wish to recover, see as well as the references therein. Finally, a different line of work studies the phase retrieval by polarization, see also for a related approach. This technique comes with an algorithm that can achieve recovery using on the order of logn\log n specially constructed masks/codes in the noiseless case. However, to the extent of our knowledge, PhaseLift offers more flexibility in terms of the number and types of masks that can be used since it can be applied regardless of the data acquisition scheme. In addition, when dealing with noisy data PhaseLift behaves very well, see Section 2 below and the experiments in .

Numerical Experiments

In this section we carry out some simple numerical experiments to show how the performance of the algorithm depends on the number of measurements/masks and how the algorithm is affected by noise. To solve the optimization problems below we use Auslender and Teboulle sub-gradient optimization method with a solver written in the framework provided by TFOCS (The code is available online at ). The stopping criterion is when the Frobenius norm of the relative error of the objective between two subsequent iterations falls below 101010^{-10} or the number of iterations reaches 50,00050,000, whichever occurs first. Before presenting the results we introduce the signal and measurement models we use.

Random low-pass signals. Here, x\bm{x} is given by

with M=n/8M=n/8 and XkX_{k} and YkY_{k} are i.i.d. N(0,1)\mathcal{N}(0,1).

where XkX_{k} and YkY_{k} are are i.i.d. N(0,1/8)\mathcal{N}(0,1/8) so that the low-pass model is a ‘bandlimited’ version of this high-pass random model (variances are adjusted so that the expected power is the same).

2 Measurement models

We perform simulations based on four different kinds of measurements:

Gaussian measurements. We sample m=nLm=nL random complex Gaussian vectors ak\bm{a}_{k} and use measurements of the form akx2|\bm{a}_{k}^{*}\bm{x}|^{2}.

Binary modulations/codes. We sample (L1L-1) binary codes distributed as

together with a regular diffraction pattern (d[t]=1d[t]=1 for all tt).

Ternary modulations/codes. We sample (L1L-1) ternary codes distributed as (1.10) together with a regular diffraction pattern.

Octanary modulations/codes. Here, the codes are distributed as (1.9).

3 Phase transitions

with λ=103\lambda=10^{-3} (Note that the solution to (2.1) as λ\lambda tends to zero will equal to the optimal solution of (1.4)).

In Figure 2 we report the empirical probability of success for different signal and measurement models with different number of measurements. We declare a trial successful if the relative error of the reconstruction (X^xxF/xxF\left\|\hat{\bm{X}}-\bm{x}\bm{x}^{*}\right\|_{F}/\left\|\bm{x}\bm{x}^{*}\right\|_{F}) falls below 10510^{-5}). These plots suggest that for the type of models studied in this paper six coded patterns are sufficient for exact recovery via convex programming.

4 Noisy measurements

We now study how the performance of the algorithm behaves in the presence of noise. We consider Poisson noise which is the usual noise model in optics. More, specifically we assume that the measurements {yk}k=1m\{y_{k}\}_{k=1}^{m} is a sequence of independent samples from the Poisson distributions Poi(μk)(\mu_{k}), where μk=akx2\mu_{k}=|\bm{a}_{k}^{*}\bm{x}|^{2} correspond to the noiseless measurements. The Poisson log-likelihood for independent samples has the form kyklogμkμk\sum_{k}y_{k}\log\mu_{k}-\mu_{k} (up to an additive constant factor). Following a classical fitting approach we balance a maximum likelihood term with the trace norm in the relaxation (1.4):

Figure 3 shows the average relative MSE (in dB) versus the SNR (also in dB). More precisely, the values of 10log10(rel. MSE)10\log_{10}(\text{rel.~{}MSE}) are plotted, where rel. MSE=X^xxF2/X^F2\text{rel.~{}MSE}=\left\|\hat{\bm{X}}-\bm{x}\bm{x}^{*}\right\|_{F}^{2}/\left\|\hat{\bm{X}}\right\|_{F}^{2}. These figures indicate that the performance of the algorithm degrades linearly as the SNR decreases (on a dB/dB scale). Empirically, the slope is close to -1, which means that the MSE scales like the noise. Together with the low offset, these features indicate that all is as in a well-conditioned-least squares problem.

Proofs

We prove our results in this section. Before we begin, we introduce some notation. We recall that the random variable dd is admissible, i.e. bounded i.e. dM|d|\leq M, symmetric, and obeying moment constraints

Continuing, X\|\bm{X}\| is the spectral or operator norm of a matrix X\bm{X}. Finally, 1\bm{1} is a vector with all entries equal to one.

This subspace may be interpreted as the tangent space at xx\bm{x}\bm{x}^{*} to the manifold of Hermitian matrices of rank 11. Below TT^{\perp} is the orthogonal complement to TT. For a linear subspace VV of Hermitian matrices, we use YV\bm{Y}_{V} or PV(Y)\mathcal{P}_{V}(\bm{Y}) to denote the orthogonal projection of Y\bm{Y} onto VV. With this, the reader will check that YT=(Ixx)Y(Ixx)\bm{Y}_{T^{\perp}}=(\bm{I}-\bm{x}\bm{x}^{*})\bm{Y}(\bm{I}-\bm{x}\bm{x}^{*}).

It is useful to record two identities that shall be used multiple times, and defer the proofs to the Appendix.

Next, we present two simple intermediate results we shall also use. The proofs are also in the Appendix.

Fix δ>0\delta>0 and suppose the number LL of CDP’s obeys LclognL\geq c\log n for some sufficiently large numerical constant cc. Then with probability at least 11/n21-1/n^{2},

For all positive semidefinite matrices X\bm{X}, it holds

Finally, the last piece of mathematics is the matrix Hoeffding inequality

2 Certificates

We now establish sufficient conditions guaranteeing that xx\bm{x}^{*}\bm{x} is the unique feasible point of (1.7). Variants of the lemma below have appeared before in the literature, see .

Suppose the mapping A\mathcal{A} obeys the following two properties:

There exists a self-adjoint matrix of the form Z=A(λ)\bm{Z}=\mathcal{A}^{*}(\bm{\lambda}), with λ\bm{\lambda} real valued (this makes sure that Z\bm{Z} is self adjoint), obeying

Then xx\bm{x}^{*}\bm{x} is the unique element in the feasible set (1.7).

Proof Suppose xx+H\bm{x}\bm{x}^{*}+\bm{H} is feasible. Feasibility implies that H\bm{H} is a self-adjoint matrix in the null space of A\mathcal{A} and HT0\bm{H}_{T^{\perp}}\succeq\bm{0}. This is because for all yx\bm{y}\perp\bm{x},

which says that HT\bm{H}_{T^{\perp}} is positive semidefinite. This gives

where the first inequality above follows from Lemma 3.4. The injectivity property (3.4) gives

and since A(HT)=A(HT)\mathcal{A}(\bm{H}_{T})=-\mathcal{A}(\bm{H}_{T^{\perp}}), we established

In summary, (3.6), (3.7) and (3.8) assert that HT=0\bm{H}_{T}=0. In turn, this gives tr(HT)=0{{\operatorname{tr}}}(\bm{H}_{T^{\perp}})=0 by (3.6), which implies that HT=0\bm{H}_{T^{\perp}}=\bm{0} since HT0\bm{H}_{T^{\perp}}\succeq\bm{0}. This completes the proof.

Property (3.4) can be viewed as a form of robust injectivity of the mapping A\mathcal{A} restricted to elements in TT. It is of course reminiscent of the local restricted isometry property in compressive sensing. Property (3.5) can be interpreted as the existence of an approximate dual certificate. It is well known that injectivity together with an exact dual certificate leads to exact reconstruction. The above lemma essentially asserts that a robust form of injectivity together with an approximate dual certificate leads to exact recovery as in [23, Section 2.1], see also . In the next two sections we show that the two properties stated in Lemma 3.6 above each hold with probability at least 11/(2n)1-1/(2n).

3 Robust injectivity

Fix δ>0\delta>0 and suppose LL obeys Lclog3nL\geq c\log^{3}n for some sufficiently large numerical constant cc. Then with probability at least 11/2n1-1/2n, for all XT\bm{X}\in T,

Proof First, notice that without loss of generality we can assume that xy\bm{x}^{*}\bm{y} is real valued in the definition of TT. That is,

Now for any X=xy+yxT\bm{X}=\bm{x}\bm{y}^{*}+\bm{y}\bm{x}^{*}\in T,

Fix a positive threshold TnT_{n}. We now claim that (3.10) follows from

in which α\alpha is any real valued number. To see why this is true, observe that

The last inequality comes from (3.11) together with

which holds since we assumed that xy\bm{x}^{*}\bm{y} is real valued.

The remainder of the proof justifies (3.11) by means of the matrix Hoeffding inequality. Let <W>\left<\bm{W}\right> be the left-hand side in (3.11) (we use notation from physics to denote empirical averages since a bar denotes complex conjugation and we would like to avoid overloading symbols). By definition <W>\left<\bm{W}\right> is the empirical average of LL i.i.d. copies of

First, Wk(D)0\bm{W}_{k}(\bm{D})\succeq\bm{0} since

We now roughly estimate the mean of W(D)\bm{W}(\bm{D}). Obviously,

Furthermore, a simple calculation shows that since

Setting Tn=2βlognT_{n}=\sqrt{2\beta\log n}, then a simple application of Hoeffding’s inequality gives

and with TnT_{n} as above, collecting our estimates gives

We have done the groundwork to apply the matrix Hoeffding inequality (3.3), which reads

This implies that when β\beta is sufficiently large and Lclog3nL\geq c\log^{3}n for a sufficiently large constant,

with probability at least 11/(2n)1-1/(2n). Now from (3.12), and (3.13) gives

where (3.13) gives that Eϵ/2\|\bm{E}\|\leq\epsilon/2 provided β2+log(16M4ϵ1)/logn\beta\geq 2+\log(16M^{4}\epsilon^{-1})/\log n. Hence, we have established that

since [xxˉ][x,xT]0\begin{bmatrix}\bm{x}\\ \bar{\bm{x}}\end{bmatrix}[\bm{x}^{*},\bm{x}^{T}]\succeq\bm{0}. With ϵ=2δδ2\epsilon=2\delta-\delta^{2}, this is the desired conclusion (3.11).

4 Dual certificate construction via the golfing scheme

We now construct the approximate dual certificate Z\bm{Z} obeying the conditions of Lemma 3.6. For this purpose we use the golfing scheme first presented in the work of Gross . Modifications of this technique have subsequently been used in many other papers e.g. . The special form used here is most closely related to the construction in . The mathematical validity of our construction crucially relies on the lemma below, whose proof is the object of the separate Section 3.5.

Assume that Lclog3nL\geq c\log^{3}n for a sufficiently large constant cc. Then for any fixed XT\bm{X}\in T, there exists Y\bm{Y} of the form Y=A(λ)\bm{Y}=\mathcal{A}^{*}(\bm{\lambda}) with λ\bm{\lambda} real valued such that

holds with probability at least 11/n21-1/n^{2}. This inequality has the immediate consequences

To build our approximate dual certificate Z\bm{Z}, we partition the modulations or CDPs into B+1B+1 different groups so that, from now on, A0\mathcal{A}_{0} corresponds to those measurements from the first L0L_{0} modulations, A1\mathcal{A}_{1} to those from the next L1L_{1} ones, and so on. Clearly, L0+L1++LB=LL_{0}+L_{1}+\ldots+L_{B}=L. The random mappings {Ab}b=0B\{\mathcal{A}_{b}\}_{b=0}^{B} correspond to independent modulations and are thus independent. Our golfing scheme starts with X(0)=2nL0PT(A0(1))\bm{X}^{(0)}=\frac{2}{nL_{0}}\mathcal{P}_{T}(\mathcal{A}_{0}^{*}(\bm{1})) (1\bm{1} is the all-one vector) and for b=1,,Bb=1,\ldots,B, inductively defines

Y(b)Range(Ab)\bm{Y}^{(b)}\in\text{Range}(\mathcal{A}_{b}^{*}) obeying Y(b)X(b1)220X(b1)F\|\bm{Y}^{(b)}-\bm{X}^{(b-1)}\|\leq\frac{\sqrt{2}}{20}\left\|\bm{X}^{(b-1)}\right\|_{F},

and X(b)=X(b1)PT(Y(b))\bm{X}^{(b)}=\bm{X}^{(b-1)}-\mathcal{P}_{T}(\bm{Y}^{(b)}).

Note that Lemma 3.8 asserts that Y(b)\bm{Y}^{(b)} exists with high probability, and that for each bb both

hold on an event of probability at least 11/n21-1/n^{2}.

Then (3.14) implies that with probability at least 1B/n21-B/n^{2}

with probability at least 1B/n21-B/n^{2}. If L0clognL_{0}\geq c\log n for a sufficiently large constant c>0c>0, Lemma 3.3 states that

with probability at least 11/n21-1/n^{2}. Using the fact that for any matrix W\bm{W}, we have WT2W\|\bm{W}_{T}\|\leq 2\|\bm{W}\| and WTW\|\bm{W}_{T^{\perp}}\|\leq\|\bm{W}\| we conclude that

Finally, with (3.17) and \|\bm{I}_{T}\big{\|}\leq 1, we conclude that

Therefore, the assumptions in Lemma 3.6 hold with probability at least 11/2n1-1/2n by applying the union bound and using with the proviso that Bc1lognB\geq c_{1}\log n and Lbc2log3nL_{b}\geq c_{2}\log^{3}n for sufficiently large constants c1c_{1} and c2c_{2} (this is why we require Lclog4nL\geq c\log^{4}n for a sufficiently large constant).

5 Proof of Lemma 3.8

The immediate consequences hold for the following reasons. First, since any matrix in TT has rank at most 2,

where the second inequality follows from MT2M\|\bm{M}_{T}\|\leq 2\|\bm{M}\| for any M\bm{M}. Second, since MTM\|\bm{M}_{T^{\perp}}\|\leq\|\bm{M}\|,

It thus suffices to prove the first property. To this end consider the eigenvalue decomposition of X=λ1u1u1+λ2u2u2\bm{X}=\lambda_{1}\bm{u}_{1}\bm{u}_{1}^{*}+\lambda_{2}\bm{u}_{2}\bm{u}_{2}^{*}. The proof follows from Lemma 3.9 below combined with Lemma 3.3.

Assume Lclog3nL\geq c\log^{3}n for a sufficiently large constant cc. Given any fixed self-adjoint matrix vv\bm{v}\bm{v}^{*}, with probability at least 11/(2n3)1-1/(2n^{3}) there exists Y~Range(A)\widetilde{\bm{Y}}\in\text{Range}(\mathcal{A}^{*}) obeying

Notice that the random positive semi-definite matrix Y\bm{Y} obeys

Using Jensen’s inequality, we have as in the proof of Lemma 3.7

Put Tn=2βlognT_{n}=\sqrt{2\beta\log n}. Hoeffding’s inequality gives

For sufficiently large β\beta, this implies

By using Hoeffding inequality in a similar fashion as in the proof of Lemma 3.7, we obtain (we omit the details)

Combining the latter with (3.22), we conclude

Discussion

In this paper, we proved that a signal could be recovered by convex programming techniques from a few diffraction patterns corresponding to generic modulations obeying an admissibility condition. We expect that our results, methods and proofs extend to more general random modulations although we have not pursued such extensions in this paper. Further, we proved that on the order of (logn)4(\log n)^{4} CDPs suffice for perfect recovery and we expect that further refinements would allow to reduce this number, perhaps all the way down to a figure independent of the number nn of unknowns. Such refinements appear quite involved to us and since our intention is to provide a reasonably short and conceptually simple argument, we leave such refinements to future research.

Appendix

Set ω=e2πin\omega=e^{\frac{2\pi i}{n}} to be the nnth root of unity so that

For two integers aa and bb we use anba\overset{n}{\equiv}b to denote congruence of aa and bb modulo nn (nn divides aba-b).

2 Proof of Lemma 3.2

3 Proof of Lemma 3.3

4 Proof of Lemma 3.4

The proof is straightforward and parallels calculations in . Fix a unit-normed vector v\bm{v}, then

Consider now the eigenvalue decomposition X=j=1nλjvjvj\bm{X}=\sum_{j=1}^{n}\lambda_{j}\bm{v}_{j}\bm{v}_{j}^{*} where λj\lambda_{j} is nonnegative since X0\bm{X}\succeq\bm{0}. Then

Acknowledgements

E C. is partially supported by AFOSR under grant FA9550-09-1-0643, by ONR under grant N00014-09-1-0258 and by a gift from the Broadcom Foundation. M. S. is supported by a a Benchmark Stanford Graduate Fellowship. X. L. is supported by the Wharton Dean’s Fund for Post-Doctoral Research and by funding from the National Institutes of Health. We would like to thank V. Voroninski for helpful discussions, especially for bringing to our attention the difficulty of establishing RIP-1 results in the masked Fourier model. M. S. thanks David Brady and Adam Backer for fruitful discussions about the implementation of structured illuminations in X-ray crystallography and microscopy applications.

References