Phase Retrieval using Alternating Minimization

Praneeth Netrapalli, Prateek Jain, Sujay Sanghavi

Introduction

then the task is to recover $\mathbf{x^{*}}$ using $\mathbf{y}$ and the measurement matrix $\mathbf{A}=[\mathbf{a_{1}}\ \mathbf{a_{2}}\ \dots\ \mathbf{a_{m}}]$ .

The above problem arises in many settings where it is harder / infeasible to record the phase of measurements, while recording the magnitudes is significantly easier. This problem, known as phase retrieval, is encountered in several applications in crystallography, optics, spectroscopy and tomography . Moreover, the problem is broadly studied in the following two settings:

The measurements in (1) correspond to the Fourier transform (the number of measurements here is equal to $n$ ) and there is some apriori information about the signal.

The set of measurements $\mathbf{y}$ are overcomplete (i.e., $m>n$ ), while some apriori information about the signal may or may not be available.

In the first case, various types of apriori information about the underlying signal such as positivity, magnitude information on the signal , sparsity and so on have been studied. In the second case, algorithms for various measurement schemes such as Fourier oversampling , multiple random illuminations and wavelet transform have been suggested.

By and large, the most well known methods for solving this problem are the error reduction algorithms due to Gerchberg and Saxton and Fienup , and variants thereof. These algorithms are alternating projection algorithms that iterate between the unknown phases of the measurements and the unknown underlying vector. Though the empirical performance of these algorithms has been well studied . and they are used in many applications , there are not many theoretical guarantees regarding their performance.

More recently, a line of work has approached this problem from a different angle, based on the realization that recovering $\mathbf{x^{*}}$ is equivalent to recovering the rank-one matrix $\mathbf{x^{*}}\mathbf{x^{*}}^{T}$ , i.e., its outer product. Inspired by the recent literature on trace norm relaxation of the rank constraint, they design SDPs to solve this problem. Refer Section 1.1 for more details.

In this paper we go back to the empirically more popular ideology of alternating minimization; we develop a new alternating minimization algorithm, and show that (a) empirically, it noticeably outperforms convex methods, and (b) analytically, a natural resampled version of this algorithm requires $O(n\log^{3}n\log\frac{1}{\epsilon})$ i.i.d. random Gaussian measurements to geometrically converge to the true vector up to an accuracy of $\epsilon$ . Our contribution:

The iterative part of our algorithm is essentially due to Gerchberg and Saxton and Fienup ; indeed, with out resampling, our algorithm is exactly their famous error reduction algorithm; the novelty in our algorithmic contribution is the initialization step which makes it more likely for the iterative procedure to succeed - see Figures 1, 2 and 3.

Our analytical contribution is the first theoretical guarantee establishing the correctness of alternating minimization (with resampling) in recovering the underlying signal for the phase retrieval problem.

Besides being an empirically better algorithm for this problem, our work is also interesting in a broader sense: there are many problems in machine learning, signal procesing and numerical linear algebra, where the natural formulation of a problem is non-convex; examples include rank constrained problems, applications of EM algorithms etc., and alternating minimization has good empirical performance. However, the methods with the best (or only) analytical guarantees involve convex relaxations (e.g., by relaxing the rank constraint and penalizing the trace norm). In most of these settings, correctness of alternating minimization is an open question. We believe that our results in this paper are of interest, and may have implications, in this larger context.

Difference from standard alternating minimization: The algorithm we analyze in this paper uses different measurements in each iteration and differs from standard alternating minimization approaches in this context, where same measurements are used in each iteration. Since our algorithm decays the error at a geometric rate, an error of $\epsilon$ requires $O\left({\log(1/\epsilon)}\right)$ iterations, increasing the total number of measurements by this factor. Theoretically, this is still competitive with convex optimization approaches under computational constraints. Indeed, for a $\textrm{poly}(n)$ run time, the best known bounds for phase retrieval via convex optimization can guarantee an accuracy of $1/\textrm{poly}(n)$ . For an accuracy of $\epsilon=1/\textrm{poly}(n)$ , the use of different samples in different iterations of our algorithm contributes an extra factor of just $O\left({\log n}\right)$ . Nevertheless, throwing away samples (as our algorithm does) is simply not a viable option in many practical settings. In fact, we empirically observe that using the same samples in all iterations performs significantly better than using different samples in each iteration (indeed, for our numerical experiments, we use the same samples in each iteration). Subsequent to our work, Candès et al. proposed a non-convex iterative algorithm based on Wirtinger flow, that uses same samples in each iteration, and show that it converges to the true underlying vector. See Section 1.1 for more details. The rest of the paper is organized as follows: In section 1.1, we briefly review related work. We clarify our notation in Section 2. We present our algorithm in Section 3 and the main results in Section 4. We present our results for the sparse case in Section 5. Finally, we present experimental results in Section 6.

Phase Retrieval via Non-Convex Procedures: Inspite of the huge amount of work it has attracted, phase retrieval has been a long standing open problem. Early work in this area focused on using holography to capture the phase information along with magnitude measurements . However, computational methods for reconstruction of the signal using only magnitude measurements received a lot of attention due to their applicability in resolving spurious noise, fringes, optical system aberrations and so on and difficulties in the implementation of interferometer setups . Though such methods have been developed to solve this problem in various practical settings , our theoretical understanding of this problem is still far from complete. Many papers have focused on determining conditions under which (1) has a unique solution. However, the uniqueness results of these papers do not resolve the algorithmic question of how to find the solution to (1).

Since the seminal work of Gerchberg and Saxton and Fienup , many iterated projection algorithms have been developed targeted towards various applications . first suggested the use of multiple magnitude measurements to resolve the phase problem. This approach has been successfully used in many practical applications - see and references there in. Following the empirical success of these algorithms, researchers were able to explain its success in some of the instances using Bregman’s theory of iterated projections onto convex sets . However, many instances, such as the one we consider in this paper, are out of reach of this theory since they involve magnitude constraints which are non-convex. To the best of our knowledge, there are no theoretical results on the convergence of these approaches in a non-convex setting.

Subsequent to our work, Candès et al. proposed an iterative algorithm based on Wirtinger flow which is similar to optimizing a non-convex function using gradient descent. Despite using same samples, they manage to show that their algorithm recovers the true underlying vector for Gaussian measurements, albeit with a slow convergence rate. Quite interestingly, they also show that if the initial point is $O\left({\frac{1}{\sqrt{n}}}\right)$ close to the true vector (which can be achieved by using a small amount of resampling), their algorithm (using same samples) achieves exact recovery for Gaussian measurements as well as coded diffraction measurements (which are practically more relevant than Gaussian measurements), with a fast convergence rate matching that of our algorithm. It has also been reported that the Wirtinger flow algorithm has better properties than alternating minimization in some optics settings .

Phase Retrieval via Convex Relaxation: An interesting recent approach for solving this problem formulates it as one of finding the rank-one solution to a system of linear matrix equations. The papers then take the approach of relaxing the rank constraint by a trace norm penalty, making the overall algorithm a convex program (called PhaseLift) over $n\times n$ matrices. Another recent line of work takes a similar but different approach : it uses an SDP relaxation (called PhaseCut) that is inspired by the classical SDP relaxation for the max-cut problem. To date, these convex methods are the only ones with analytical guarantees on statistical performance (i.e. the number $m$ of measurements required to recover $\mathbf{x^{*}}$ ) . However, by “lifting” a vector problem to a matrix one, these methods lead to a much larger representation of the state space, and higher computational cost as a result.

Measurement Schemes: Earlier results on PhaseLift and PhaseCut assumed an i.i.d. random Gaussian model on the measurement vectors $\mathbf{a}_{i}$ . extends these results for PhaseLift for measurement schemes known as t-designs, which are more general than Gaussian measurements. Recently, establishes near-optimal statistical guarantees for PhaseLift under masked Fourier transform measurements.

Alternating Minimization (a.k.a. ALS): Alternating minimization has been successfully applied to many applications in the low-rank matrix setting. For example, clustering , sparse PCA , non-negative matrix factorization , signed network prediction etc. However, despite empirical success, for most of the problems, there are no theoretical guarantees regarding its convergence except to a local minimum. Of late, however, there has been a spurt of work in obtaining provable guarantees for alternating minimization in various settings such as learning sparsely used dictionaries , matrix completion , robust PCA etc. Though earlier results for matrix completion use heavy resampling, subsequent work has obtained similar results with a small amount of resampling.

There has also been some work on designing other non convex optimization algorithms, such as gradient descent for solving some of these problems. For instance, propose a gradient descent algorithm on the Grassmanian manifold to solve the matrix completion problem.

Notation

Algorithm

where $\mathbf{C}^{*}\stackrel{{\scriptstyle\textrm{def}}}{{=}}\mbox{Diag}(\mathbf{c}^{*})$ is the diagonal matrix of phases. Of course we do not know $\mathbf{C}^{*}$ , hence one approach to recovering $\mathbf{x^{*}}$ is to solve:

While the above algorithm is simple and intuitive, it is known that with bad initial points, the solution might not converge to $\mathbf{x^{*}}$ . In fact, this algorithm with a uniformly random initial point has been empirically evaluated for example in , where it performs worse than SDP based methods. Moreover, since the underlying problem is non-convex, standard analysis techniques fail to guarantee convergence to the global optimum, $\mathbf{x^{*}}$ . Hence, the key challenges here are: a) a good initialization step for this method, b) establishing this method’s convergence to $\mathbf{x^{*}}$ .

We address the first key challenge in our AltMinPhase algorithm (Algorithm 1) by initializing $\mathbf{x}$ as the largest singular vector of the matrix $\mathbf{S}=\frac{1}{m}\sum_{i}y_{i}^{2}\mathbf{a_{i}}\mathbf{a_{i}}^{T}$ . This is similar to the initialization in for the matrix completion problem. Theorem 4.1 shows that when $\mathbf{A}$ is sampled from standard complex normal distribution, this initialization is accurate. In particular, if $m\geq C_{1}n\log^{3}n$ for large enough $C_{1}>0$ , then whp we have $\|\mathbf{x^{0}}-\mathbf{x^{*}}\|_{2}\leq 1/100$ (or any other constant).

Theorem 4.2 addresses the second key challenge and shows that a variant of AltMinPhase (see Algorithm 2) actually converges to the global optimum $\mathbf{x^{*}}$ at linear rate. See section 4 for a detailed analysis of our algorithm.

We would like to stress that not only does a natural variant of our proposed algorithm have rigorous theoretical guarantees, it also is effective practically as each of its iterations is fast, has a closed form solution and does not require SVD computation. AltMinPhase has similar statistical complexity to that of PhaseLift and PhaseCut while being much more efficient computationally. In particular, for accuracy $\epsilon$ , we only need to solve each least squares problem only up to accuracy $O\left({\epsilon^{2}}\right)$ . Since the measurement matrix $A$ is Gaussian with $m>Cn$ , it is well conditioned. This means that each such step takes $O\left({mn\log\frac{1}{\epsilon}}\right)$ time using the conjugate gradient method. When $m=O\left({n}\right)$ and we have geometric convergence, the total time taken by the algorithm is $O\left({n^{2}\log^{2}\frac{1}{\epsilon}}\right)$ . SDP based methods on the other hand require $\Omega(n^{3}/\sqrt{\epsilon})$ time. Moreover, our initialization step increases the likelihood of successful recovery as opposed to a random initialization (which has been considered so far in prior work). Refer Figure 1 for an empirical validation of these claims.

A key drawback of our results, however, is the use of resampling. More specifically, our convergence guarantee is obtained for a variant of Algorithm 1 (see Algorithm 2), where we use different samples in each iteration. In practice, this is not feasible since in many applications, taking so many measurements may not be possible. On the other hand, the SDP approaches and a recent non-convex optimization approach do not face this issue. See Section 1 for more details on this aspect.

Main Results: Analysis

In this section we describe the main contribution of this paper: provable statistical guarantees for the success of alternating minimization in solving the phase recovery problem. To this end, we consider the setting where each measurement vector $\mathbf{a_{i}}$ is iid and is sampled from the standard complex normal distribution. We would like to stress that all the existing guarantees for phase recovery also use exactly the same setting . Table 1 presents a comparison of the theoretical guarantees of Algorithm 2 as compared to PhaseLift and PhaseCut.

Our proof for convergence of alternating minimization can be broken into two key results. We first show that if $m\geq Cn\log^{3}n$ , then whp the initialization step used by AltMinPhase returns $\mathbf{x^{0}}$ which is at most a constant distance away from $\mathbf{x^{*}}$ . Furthermore, that constant can be controlled by using more samples (see Theorem 4.1).

We now present the two results mentioned above. For our proofs, wlog, we assume that $\|\mathbf{x^{*}}\|_{2}=1$ .

Our first result guarantees a good initial vector.

There exists a constant $C_{1}$ such that if $m>\frac{C_{1}}{c^{2}}n\log^{3}n$ , then in Algorithm 2, with probability greater than $1-4/m^{2}$ we have:

that is, $\mathbf{\mathbf{x}^{+}}$ can be viewed as a perturbation of $\mathbf{x^{*}}$ and the goal is to bound the error term (the second term above). We break the proof into two main steps:

Using the above bounds and choosing $c<\frac{1}{100c_{1}}$ , we can prove the theorem:

proving the first part of the theorem. The second part follows easily from (4) and Lemma A.2. ∎

Intuition and key challenge: If we look at step 6 of Algorithm 2, we see that, for the measurements, we use magnitudes calculated from $\mathbf{x^{*}}$ and phases calculated from $\mathbf{x}$ . Intuitively, this means that we are trying to push $\mathbf{\mathbf{x}^{+}}$ towards $\mathbf{x^{*}}$ (since we use its magnitudes) and $\mathbf{x}$ (since we use its phases) at the same time. The key intuition behind the success of this procedure is that the push towards $\mathbf{x^{*}}$ is stronger than the push towards $\mathbf{x}$ , when $\mathbf{x}$ is close to $\mathbf{x^{*}}$ . The key lemma that captures this effect is stated below:

See Appendix A for a proof of the above lemma and how we use it to prove Theorem 4.2. Combining Theorems 4.1 and 4.2, we can establish the correctness of Algorithm 2.

Suppose the measurement vectors in (1) are independent standard complex normal vectors. There exists a constant $c$ such that if ${m>cn\log n\left(\log^{2}n+\log\frac{1}{\epsilon}\log\log\frac{1}{\epsilon}\right)}$ then, with probability greater than $1-\frac{1}{n}$ , Algorithm 2 outputs $\mathbf{x^{t_{0}}}$ such that $\|\mathbf{x^{t_{0}}}-\mathbf{x^{*}}\|_{2}<\epsilon$ , for some global phase choice of $\mathbf{x^{*}}$ .

Sparse Phase Retrieval

In this section, we consider the case where $\mathbf{x^{*}}$ is known to be sparse, with sparsity $k$ . A natural and practical question to ask here is: can the sample and computational complexity of the recovery algorithm be improved when $k\ll n$ .

The following lemma shows that if the number of measurements is large enough, step 1 of SparseAltMinPhase recovers the support of $\mathbf{x^{*}}$ correctly.

Suppose $\mathbf{x^{*}}$ is $k$ -sparse with support $S$ and $\left\|{\mathbf{x^{*}}}\right\|_{2}=1$ . If $\mathbf{a_{i}}$ are standard complex Gaussian random vectors and $m>\frac{c}{\left(x_{\textrm{min}}^{*}\right)^{4}}\log\frac{n}{\delta}$ , then Algorithm 3 recovers $S$ with probability greater than $1-\delta$ , where $x_{\textrm{min}}^{*}$ is the minimum non-zero entry of $\mathbf{x^{*}}$ .

The key step of our proof is to show that if $j\in supp(\mathbf{x^{*}})$ , then random variable $Z_{ij}=\sum_{i}|a_{ij}y_{i}|$ has significantly higher mean than for the case when $j\notin supp(\mathbf{x^{*}})$ . Now, by applying appropriate concentration bounds, we can ensure that $\min_{j\in supp(\mathbf{x^{*}})}|Z_{ij}|>\max_{j\notin supp(\mathbf{x^{*}})}|Z_{ij}|$ and hence our algorithm never picks up an element outside the true support set $supp(\mathbf{x^{*}})$ . See Appendix B for a detailed proof of the above lemma.

The correctness of Algorithm 3 now is a direct consequence of Lemma 5.1 and Theorem 4.4. For the special case where each non-zero value in $x^{*}$ is from $\{-\frac{1}{\sqrt{k}},\frac{1}{\sqrt{k}}\}$ , we have the following corollary:

Suppose $\mathbf{x^{*}}$ is $k$ -sparse with non-zero elements $\pm\frac{1}{\sqrt{k}}$ . If the number of measurements $m>c\log{n}\left(k^{2}+k\log^{2}k+k\log\frac{1}{\epsilon}\right)$ , then Algorithm 3 will recover $x^{*}$ up to accuracy $\epsilon$ with probability greater than $1-\frac{1}{n}$ .

Experiments

In this section, we present experimental evaluation of AltMinPhase (Algorithm 1) and compare its performance with the SDP based methods PhaseLift and PhaseCut . We also empirically demonstrate the advantage of our initialization procedure over random initialization (denoted by AltMin (random init)), which has thus far been considered in the literature . AltMin (random init) is the same as AltMinPhase except that step 1 of Algorithm 1 is replaced with: $\mathbf{x^{0}}\leftarrow$ Uniformly random vector from the unit sphere.

In the noiseless setting, a trial is said to succeed if the output $\mathbf{x}$ satisfies $\left\|{\mathbf{x}-\mathbf{x^{*}}}\right\|_{2}<10^{-2}$ . For a given dimension, we do a linear search for smallest $m$ (number of samples) such that empirical success ratio over $20$ runs is at least $0.8$ . We implemented our methods in Matlab, while we obtained the code for PhaseLift and PhaseCut from the authors of and respectively.

We now present results from our experiments in three different settings.

Independent Random Gaussian Measurements: Each measurement vector $\mathbf{a_{i}}$ is generated from the standard complex Gaussian distribution. This measurement scheme was first suggested by as a first step to obtain a theoretical understanding of the problem.

Multiple Random Illumination Filters: We now present our results for the setting where the measurements are obtained using multiple illumination filters; this setting was suggested by . In particular, choose $J$ vectors $\mathbf{z^{(1)}},\cdots,\mathbf{z^{(J)}}$ and compute the following discrete Fourier transforms:

where $\cdot*$ denotes component-wise multiplication. Our measurements will then be the magnitudes of components of the vectors $\mathbf{\widehat{x}^{(1)}},\cdots,\mathbf{\widehat{x}^{(J)}}$ . Note that this gives a total of $Jn$ measurements. The above measurement scheme can be implemented by modulating the light beam or by the use of masks; see for more details.

For this setting, we conduct a similar set of experiments as the previous setting. That is, we vary dimensionality of the true signal $\mathbf{z^{(u)}}$ (generated from the Gaussian distribution)and then empirically determine measurement and computational cost of each algorithm. Figures 2 (a) and (b) present our experimental results for this measurement scheme. Here again, we make similar observations as the last setting. That is, the measurement complexity of AltMinPhase is similar to PhaseCut and PhaseLift, but AltMinPhase is orders of magnitude faster than PhaseLift and PhaseCut. Note that Figure 2 is on a log-scale.

Noisy Phase Retrieval: Finally, we study our method in the following noisy measurement scheme:

Geometric Decay: Finally, we provide empirical results verifying that AltMinPhase reduces the error at a geometric rate as guaranteed by Theorem 4.2 but no faster. The measurement vectors were chosen to be standard complex Gaussian with $n=64$ and $m=6n$ . Figure 3(b) shows the plot of empirical error vs the number of iterations.

Acknowledgment

S. Sanghavi would like to acknowledge support from NSF grants 1302435 and 0954059.

References

Appendix A Proofs for Section 4

Proof of Step (1): We now complete our proof by proving (1). To this end, we use the following matrix concentration result from :

A.2 Proof of per step reduction in error

In all the lemmas in this section, $\delta$ is a small numerical constant (can be taken to be $0.01$ ).

Using (4) and the fact that $\|\mathbf{x^{*}}\|_{2}=1$ , $\mathbf{x}^{*T}\mathbf{x}^{+}=1+\mathbf{x}^{*T}\left(\mathbf{A}\mathbf{A}^{T}\right)^{-1}\mathbf{A}\left(\mathbf{D}-\mathbf{I}\right)\mathbf{A}^{T}\mathbf{x^{*}}$ . That is, $|\mathbf{x}^{*T}\mathbf{x}^{+}|\geq 1-\|\left(\frac{1}{2m}\mathbf{A}\mathbf{A}^{T}\right)^{-1}\|_{2}\|\frac{1}{\sqrt{2m}}A\|_{2}\|\frac{1}{\sqrt{2m}}\left(\mathbf{D}-\mathbf{I}\right)\mathbf{A}^{T}\mathbf{x^{*}}\|_{2}$ . Assuming $m>\widehat{c}\log\frac{1}{\eta}n$ , Standard results in random matrix theory tell us that $\left\|{\frac{1}{2m}\mathbf{A}\mathbf{A}^{T}-\mathbf{I}}\right\|_{2}<\frac{1}{\sqrt{\widehat{c}}}$ , wp $\geq 1-\frac{\eta}{10}$ . This means that $\|\left(\frac{1}{2m}\mathbf{A}\mathbf{A}^{T}\right)^{-1}\|_{2}\leq 1/(1-2/\sqrt{\widehat{c}})^{2}$ and $\|\mathbf{A}\|_{2}\leq 1+2/\sqrt{\widehat{c}}$ . Note that both the quantities can be bounded by constants that are close to $1$ by selecting a large enough $\widehat{c}$ . Also note that $\frac{1}{2m}\mathbf{A}\mathbf{A}^{T}$ converges to $\mathbf{I}$ (the identity matrix), or equivalently $\frac{1}{m}\mathbf{A}\mathbf{A}^{T}$ converges to $2\mathbf{I}$ since the elements of $A$ are standard normal complex random variables and not standard normal real random variables.

Show that $U_{l}$ is a subexponential random variable and use that fact to derive concentration bounds.

where $(\zeta_{1})$ uses Lemma A.7 and $(\zeta_{2})$ , the fact that $a_{2l}$ is a sub-gaussian random variable. This means:

Using this, we have the following bound on the expected value of $U_{l}$ :

From (8), we see that $U_{l}$ is a subexponential random variable with parameter $c\left(1-\alpha^{2}\right)$ . Using Proposition 5.16 from , we obtain:

where the last step uses Lemma A.2. Similarly,

Using (9), (10), (11) along with Lemmas A.5 and A.6, we see that for a fixed $z$ , we have:

with probability greater than $1-\frac{\eta}{10}\exp(-cn)$ .

So far we have proved the result only for a fixed vector $z$ . We now use a covering and union bound argument to extend this result for every $z$ that is orthogonal to $\mathbf{x^{*}}$ .

Union bound argument: Construct an $\epsilon$ -net $S$ for unit vectors in the $(n-1)$ -dimensional space that is orthogonal to $\mathbf{x^{*}}$ . Using standard results (see e.g., Chap. 13 of ), we know that the size of $S$ can be chosen to be $\left(\frac{1}{\epsilon}\right)^{O\left({n}\right)}$ . We choose $\epsilon=1/100$ , and hence the size of $S$ is $\exp\left({cn}\right)$ , for some fixed constant $c$ . Applying (12) for every $\mathbf{z}\in S$ , and taking a union bound, we obtain:

with probability greater than $1-\frac{\eta}{10}\exp(-n)$ .

Now choose a unit vector $\mathbf{\widehat{z}}$ that is orthogonal to $\mathbf{x^{*}}$ (but is not necessarily in S), that maximizes $\left|\langle\mathbf{\widehat{z}},\mathbf{x}^{+}\rangle\right|$ . In other words, $\mathbf{\widehat{z}}$ is such that

Since $S$ is a $\frac{1}{100}$ -net of the orthogonal space to $\mathbf{x^{*}}$ , we know that there is a $\mathbf{z}\in S$ such that $\left\|{\mathbf{z}-\mathbf{\widehat{z}}}\right\|_{2}<\frac{1}{100}$ . So, we have:

where $(\zeta_{1})$ follows from (13) and $(\zeta_{2})$ follows from (14). This means that

Recalling the choice of $\mathbf{\widehat{z}}$ from (14) finishes the proof. ∎

Assume the hypothesis of Theorem 4.2 and the notation therein. Then,

with probability greater than $1-\frac{\eta}{10}e^{-n}$ .

where the last step follows from the fact that $a^{\prime}_{2l}$ is a subgaussian random variable and hence $\left|a^{\prime}_{2l}\right|^{2}$ is a subexponential random variable. Using Proposition 5.16 from , we obtain:

Choosing $\delta=\frac{1}{99}$ and using Lemma 4.3, we obtain:

with probability greater than $1-\frac{\eta}{10}\exp(-n)$ . This proves the lemma. ∎

Let $w_{2}=\left|w_{2}\right|e^{i\theta}$ . Then $\left|w_{1}\right|,\left|w_{2}\right|$ and $\theta$ are all independent random variables. $\theta$ is a uniform random variable over $[-\pi,\pi]$ and $\left|w_{1}\right|$ and $\left|w_{2}\right|$ are identically distributed with probability distribution function, $p(x)=x\exp\left(-\frac{x^{2}}{2}\right)\mathds{1}_{\{x\geq 0\}}$ . We have:

We will first calculate the imaginary part of the above expectation:

since we are taking the expectation of an odd function. Focusing on the real part, we let:

We show this by calculating $F^{\prime}(0)$ and using the continuity of $F^{\prime}(\beta)$ at $\beta=0$ . We first calculate $F^{\prime}(\beta)$ as follows:

From the above, we see that $F^{\prime}(0)=\frac{1}{2}$ and (16) then follows from the continuity of $F^{\prime}(\beta)$ at $\beta=0$ . Getting back to the expected value of $U$ , we have:

where $c$ is some constant. The last step follows from standard bounds on the tail probabilities of gaussian random variables. We now bound the second term of (17) as follows:

where $(\zeta_{1})$ follows from (18), $(\zeta_{2})$ follows from the formulae for second and third absolute moments of gaussian random variables and $(\zeta_{3})$ follows from the fact that $1-\alpha^{2}<\delta$ . Plugging the above inequality in (17), we obtain:

where we used the fact that $\alpha\geq 1-\frac{\delta}{2}$ . This proves the lemma. ∎

Assume the hypothesis of Theorem 4.2 and the notation therein. Then,

with probability greater than $1-\frac{\eta}{10}e^{-n}$ .

The proof of this lemma is very similar to that of Lemma A.5. We have:

where the last step follows from the fact that $a^{\prime}_{2l}$ and $a^{\prime}_{3l}$ are independent subgaussian random variables and hence $\left|a^{\prime}_{2l}a^{\prime}_{3l}\right|$ is a subexponential random variable. Using Proposition 5.16 from , we obtain:

Choosing $\delta=\frac{1}{100}$ , we have:

with probability greater than $1-\frac{\eta}{10}\exp(-n)$ . This proves the lemma. ∎

Appendix B Proofs for Section 5

For every $j\in[n]$ and $i\in[m]$ , consider the random variable $Z_{ij}\stackrel{{\scriptstyle\textrm{def}}}{{=}}\left|a_{ij}y_{i}\right|$ . We have the following:

where the first step follows from Corollary 3.1 in and the second step follows from the Taylor series expansions of $\sqrt{1-x^{2}}$ and $\arcsin(x)$ ,

for every $j\in[n]$ , $Z_{ij}$ is a sub-exponential random variable with parameter $c=O(1)$ (since it is a product of two standard normal random variables).

Using the hypothesis of the theorem about $m$ , we have:

Applying a union bound to the above, we see that with probability greater than $1-\delta$ , there is a separation in the values of $\frac{1}{m}\sum_{i=1}^{m}Z_{ij}$ for $j\in S$ and $j\notin S$ . This proves the theorem. ∎