The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss

Introduction

The question of recovering a probability distribution from a finite set of samples is one of the most fundamental questions of statistical inference. While classically such problems have been considered in low dimension, more recently inference in high dimension has drawn significant attention in statistics and computer science literature.

In particular, an active line of investigation in theoretical computer science has dealt with the question of learning a Gaussian Mixture Model in high dimension. This line of work was started in where the first algorithm to recover parameters using a number of samples polynomial in the dimension was presented. The method relied on random projections to a low dimensional space and required certain separation conditions for the means of the Gaussians. Significant work was done in order to weaken the separation conditions and to generalize the result (see e.g., ). Much of this work has polynomial sample and time complexity but requires strong separation conditions on the Gaussian components. A completion of the attempts to weaken the separation conditions was achieved in and , where it was shown that arbitrarily small separation was sufficient for learning a general mixture with a fixed number of components in polynomial time. Moreover, a one-dimensional example given in showed that an exponential dependence on the number of components was unavoidable unless strong separation requirements were imposed. Thus the question of polynomial learnability appeared to be settled. It is worth noting that while quite different in many aspects, all of these papers used a general scheme similar to that in the original work by reducing high-dimensional inference to a small number of low-dimensional problems through appropriate projections.

However, a surprising result was recently proved in . The authors showed that a mixture of dd Gaussians in dimension dd could be learned using a polynomial number of samples, assuming a non-degeneracy condition on the configuration of the means. The result in is inherently high-dimensional as that condition is never satisfied when the means belong to a lower-dimensional space. Thus the problem of learning a mixture gets progressively computationally easier as the dimension increases, a “blessing of dimensionality!” It is important to note that this was quite different from much of the previous work, which had primarily used projections to lower-dimension spaces.

Still, there remained a large gap between the worst case impossibility of efficiently learning more than a fixed number of Gaussians in low dimension and the situation when the number of components is equal to the dimension. Moreover, it was not completely clear whether the underlying problem was genuinely easier in high dimension or our algorithms in low dimension were suboptimal. The one-dimensional example in cannot answer this question as it is a specific worst-case scenario, which can be potentially ruled out by some genericity condition.

In our paper we take a step to eliminate this gap by showing that even very large mixtures of Gaussians can be polynomially learned. More precisely, we show that a mixture of mm Gaussians with equal known covariance can be polynomially learned as long as mm is bounded from above by a polynomial of the dimension nn and a certain more complex non-degeneracy condition for the means is satisfied. We show that if nn is high enough, these non-degeneracy conditions are generic in the smoothed complexity sense. Thus for any fixed dd, O(nd)O(n^{d}) generic Gaussians can be polynomially learned in dimension nn.

Further, we prove that no such condition can exist in low dimension. A measure of non-degeneracy must be monotone in the sense that adding Gaussian components must make the condition number worse. However, we show that for k2k^{2} points uniformly sampled from $thereare(withhighprobability)twomixturesofunitGaussianswithmeansonnonintersectingsubsetsofthesepoints,whosethere are (with high probability) two mixtures of unit Gaussians with means on non-intersecting subsets of these points, whoseL^{1}distanceisdistance isO^{*}(e^{-{k}})andwhicharethusnotpolynomiallyidentifiable.Moregenerally,indimensionand which are thus not polynomially identifiable. More generally, in dimensionnthedistancebecomesthe distance becomesO^{*}(e^{-\sqrt[n]{k}})$. That is, the conditioning improves as the dimension increases, which is consistent with our algorithmic results.

To summarize, our contributions are as follows:

We show that for any qq, a mixture of nqn^{q} Gaussians in dimension nn can be learned in time and number of samples polynomial in nn and a certain “condition number” σ\sigma. We show that if the dimension is sufficiently high, this results in an algorithm polynomial from the smoothed analysis point of view (Theorem 1). To do that we provide smoothed analysis of the condition number using certain results from and anti-concentration inequalities. The main technical ingredient of the algorithm is a new “Poissonization” technique to reduce Gaussian mixture estimation to a problem of recovering a linear map of a product distribution known as underdetermined Independent Component Analysis (ICA). We combine this with the recent work on efficient algorithms for underdetermined ICA from to obtain the necessary bounds.

We show that in low dimension polynomial identifiability fails in a certain generic sense (see Theorem 3). Thus the efficiency of our main algorithm is truly a consequence of the ”blessing of dimensionality” and no comparable algorithm exists in low dimension. The analysis is based on results from approximation theory and Reproducing Kernel Hilbert Spaces.

Moreover, we combine the approximation theory results with the Poissonization-based technique to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined Independent Component Analysis in low dimension. To the best of our knowledge, this is the first such result in the literature.

We discuss our main contributions more formally now. The notion of Khatri–Rao power AdA^{\odot d} of a matrix AA is defined in Section 2.

where wmaxi(wi)/mini(wi)w\geq\max_{i}(w_{i})/\min_{i}(w_{i}), umaxiμiu\geq\max_{i}\left\|\mu_{i}\right\|, r\geq\big{(}\max_{i}\left\|\mu_{i}\right\|+1)/(\min_{i}\left\|\mu_{i}\right\|)\big{)}, 0<bσm(Bd/2)0<b\leq\sigma_{m}(B^{\odot d/2}) are bounds provided to the algorithm, and σ=λmax(Σ)\sigma=\sqrt{\lambda_{\max}(\Sigma)}.

Given that the means have been estimated, the weights can be recovered using the tensor structure of higher order cumulants (see Section 2 for the definition of cumulants). This is shown in Appendix I.

We show that σmin(Ad)\sigma_{\min}(A^{\odot d}) is large in the smoothed analysis sense, namely, if we start with a base matrix AA and perturb each entry randomly to get AA^{\prime}, then σmin(Ad)\sigma_{\min}(A^{\odot d}) is likely to be large. More precisely,

Finally, in Section 6 we show that in low dimension the situation is very different from the high-dimensional generic efficiency given by Theorems 1 and 2: The problem is generically hard. More precisely, we show:

Let XX be a set of k2k^{2} points uniformly sampled from n^{n}. Then with high probability there exist two mixtures with equal number of unit Gaussians pp, qq centered on disjoint subsets of XX, such that, for some C>0C>0,

Combining the above lower bound with our reduction provides a similar lower bound for ICA; see a discussion on the connection with ICA below. Our lower bound gives an information-theoretic barrier. This is in contrast to conjectured computational barriers that arise in related settings based on the noisy parity problem (see for pointers). The only previous information-theoretic lower bound for learning GMMs we are aware of is due to and holds for two specially designed one-dimensional mixtures.

A key observation of is that methods based on the higher order statistics used in Independent Component Analysis (ICA) can be adapted to the setting of learning a Gaussian Mixture Model. In ICA, samples are of the form X=i=1mAiSiX=\sum_{i=1}^{m}A_{i}S_{i} where the latent random variables SiS_{i} are independent, and the column vectors AiA_{i} give the directions in which each signal SiS_{i} acts. The goal is to recover the vectors AiA_{i} up to inherent ambiguities. The ICA problem is typically posed when mm is at most the dimensionality of the observed space (the “fully determined” setting), as recovery of the directions AiA_{i} then allows one to demix the latent signals. The case where the number of latent source signals exceeds the dimensionality of the observed signal XX is the underdetermined ICA setting.See [12, Chapter 9] for a recent account of algorithms for underdetermined ICA. Two well-known algorithms for underdetermined ICA are given in and . Finally, provides an algorithm with rigorous polynomial time and sampling bounds for underdetermined ICA in high dimension in the presence of Gaussian noise.

Nevertheless, our analysis of the mixture models can be embedded in ICA to show exponential information-theoretic hardness of performing ICA in low-dimension, and thus establishing the blessing of dimensionality for ICA as well.

Let XX be a set of k2k^{2} random nn-dimensional unit vectors. Then with high probability, there exist two disjoint subsets of XX, such that when these two sets form the columns of matrices AA and BB respectively, there exist noisy ICA models AS+ηAS+\eta and BS+ηBS^{\prime}+\eta^{\prime} which are exponentially close as distributions in L1L^{1} distance and satisfying: (1) The coordinate random variables of SS and SS^{\prime} are scaled Poisson random variables. For at least one coordinate random variable, Si=αXS_{i}=\alpha X, where XPoisson(λ)X\sim\mathsf{Poisson}(\lambda) is such that α\alpha and λ\lambda are polynomially bounded away from 0. (2) The Gaussian noises η\eta and η\eta^{\prime} have polynomially bounded directional covariances.

We sketch the proof of Theorem 4 in Appendix G.

Discussion.

Most problems become harder in high dimension, often exponentially harder, a behavior known as “the curse of dimensionality.” Showing that a complex problem does not become exponentially harder often constitutes major progress in its understanding. In this work we demonstrate a reversal of this curse, showing that the lower dimensional instances are exponentially harder than those in high dimension. This seems to be a rare situation in statistical inference and computation. In particular, while high-dimensional concentration of mass can sometimes be a blessing of dimensionality, in our case the generic computational efficiency of our problem comes from anti-concentration.

We hope that this work will enable better understanding of this unusual phenomenon and its applicability to a wider class of computational and statistical problems.

Preliminaries

In this formulation, eh\mathbf{e}_{h} acts as a selector of a Gaussian mean. Conditioning on h=ih=i, we have ZN(μi,Σ)Z\sim\mathcal{N}(\mu_{i},\Sigma), which is consistent with the GMM model.

Given samples from the GMM, the goal is to recover the unknown parameters of the GMM, namely the means μ1,,μm\mu_{1},\dots,\mu_{m} and the weights w1,,wmw_{1},\dots,w_{m}.

Underdetermined ICA.

The ICA algorithm from to which we will be reducing learning a GMM relies on the shared tensor structure of the derivatives of the second characteristic function and the higher order multi-variate cumulants. This tensor structure motivates the following form of the Khatri-Rao product:

This form of the Khatri-Rao product arises when performing a change of coordinates under the ICA model using either higher order cumulants or higher order derivative tensors of the second characteristic function.

ICA Results.

Theorem H.40 (Appendix H.1, from ) allows us to recover AA up to the necessary ambiguities in the noisy ICA setting. The theorem establishes guarantees for an algorithm from for noisy underdetermined ICA, UnderdeterminedICA. This algorithm takes as input a tensor order parameter dd, number of signals mm, access to samples according to the noisy underdetermined ICA model with unknown noise, accuracy parameter ϵ\epsilon, confidence parameter δ\delta, bounds on moments and cumulants MM and Δ\Delta, a bound on the conditioning parameter σm\sigma_{m}, and a bound on the cumulant order kk. It returns approximations to the columns of AA up to sign and permutation.

Learning GMM means using underdetermined ICA: The basic idea

In this section we give an informal outline of the proof of our main result, namely learning the means of the components in GMMs via reduction to the underdetermined ICA problem. Our reduction will be discussed in two parts. The first part gives the main idea of the reduction and will demonstrate how to recover the means μi\mu_{i} up to their norms and signs, i.e. we will get ±μi/μi\pm\mu_{i}/\left\|\mu_{i}\right\|. We will then present the reduction in full. It combines the basic reduction with some preprocessing of the data to recover the μi\mu_{i}’s themselves. The reduction relies on some well-known properties of the Poisson distribution stated in the lemma below; its proof can be found in Appendix B.

Recall the GMM from equation (1) is given by Z=[μ1μm]eh+ηZ=[\mu_{1}|\cdots|\mu_{m}]\mathbf{e}_{h}+\eta. Henceforth, we will set A=[μ1μm]A=[\mu_{1}|\cdots|\mu_{m}]. We can write the GMM in the form Z=Aeh+ηZ=A\mathbf{e}_{h}+\eta, which is similar in form to the noisy ICA model, except that eh\mathbf{e}_{h} does not have independent coordinates. We now describe how a single sample of an approximate noisy ICA problem is generated.

The reduction involves two internal parameters λ\lambda and τ\tau that we will set later. We generate a Poisson random variable RPoisson(λ)R\sim\mathsf{Poisson}(\lambda), and we run the following experiment RR times: At the iith step, generate sample ZiZ_{i} from the GMM. Output the sum of the outcomes of these experiments: Y=Z1++ZRY=Z_{1}+\cdots+Z_{R}.

Let SiS_{i} be the random variable denoting the number of times samples were taken from the iith Gaussian component in the above experiment. Thus, S1++Sm=RS_{1}+\cdots+S_{m}=R. Note that S1,,SmS_{1},\dots,S_{m} are not observable although we know their sum. By Lemma 5, each SiS_{i} has distribution Poisson(wiλ)\mathsf{Poisson}(w_{i}\lambda), and the random variables SiS_{i} are mutually independent. Let S:=(S1,,Sm)TS:=(S_{1},\dots,S_{m})^{T}.

For a non-negative integer tt, we define η(t):=i=1tηi\eta(t):=\sum_{i=1}^{t}\eta_{i} where the ηi\eta_{i} are iid according to ηiN(0,Σ)\eta_{i}\sim\mathcal{N}(0,\Sigma). In this definition, tt can be a random variable, in which case the ηi\eta_{i} are sampled independent of tt. Using \sim to indicate that two random variables have the same distribution, then YAS+η(R)Y\sim AS+\eta(R). If there were no Gaussian noise in the GMM (i.e. if we were sampling from a discrete set of points) then the model becomes simply Y=ASY=AS, which is the ICA model without noise, and so we could recover AA up to necessary ambiguities. However, the model YAS+η(R)Y\sim AS+\eta(R) fails to satisfy even the assumptions of the noisy ICA model, both because η(R)\eta(R) is not independent of SS and because η(R)\eta(R) is not distributed as a Gaussian random vector.

As the covariance of the additive Gaussian noise is known, we may add additional noise to the samples of YY to obtain a good approximation of the noisy ICA model. Parameter τ\tau, the second parameter of the reduction, is chosen so that with high probability we have RτR\leq\tau. Conditioning on the event RτR\leq\tau we draw XX according to the rule X=Y+η(τR)AS+η(R)+η(τR)X=Y+\eta(\tau-R)\sim AS+\eta(R)+\eta(\tau-R), where η(R)\eta(R), η(τR)\eta(\tau-R), and SS are drawn independently conditioned on RR. Then, conditioned on RτR\leq\tau, we have XAS+η(τ)X\sim AS+\eta(\tau).

Note that we have only created an approximation to the ICA model. In particular, restricting i=1mSi=Rτ\sum_{i=1}^{m}S_{i}=R\leq\tau can be accomplished using rejection sampling, but the coordinate random variables S1,,SmS_{1},\dots,S_{m} would no longer be independent. We have two models of interest: (1) XAS+η(τ)X\sim AS+\eta(\tau), a noisy ICA model with no restriction on R=i=1mSiR=\sum_{i=1}^{m}S_{i}, and (2) X(AS+η(τ))RτX\sim(AS+\eta(\tau))|_{R\leq\tau} the restricted model.

We are unable to produce samples from the first model, but it meets the assumptions of the noisy ICA problem. Pretending we have samples from model (1), we can apply Theorem H.40 (Appendix H.1) to recover the Gaussian means up to sign and scaling. On the other hand, we can produce samples from model (2), and depending on the choice of τ\tau, the statistical distance between models (1) and (2) can be made arbitrarily close to zero. It will be demonstrated that given an appropriate choice of τ\tau, running UnderdeterminedICA on samples from model (2) is equivalent to running UnderdeterminedICA on samples from model (1) with high probability, allowing for recovery of the Gaussian mean directions ±μi/μi\pm\mu_{i}/\left\|\mu_{i}\right\| up to some error.

Full reduction.

To be able to recover the μi\mu_{i} without sign or scaling ambiguities, we add an extra coordinate to the GMM as follows. The new means μi\mu_{i}^{\prime} are μi\mu_{i} with an additional coordinate whose value is 11 for all ii, i.e. μi:=(μiT,1)T\mu_{i}^{\prime}:=\left(\mu_{i}^{T},1\right)^{T}. Moreover, this coordinate has no noise. In other words, each Gaussian component now has an (n+1)×(n+1)(n+1)\times(n+1) covariance matrix Σ:=(Σ000)\Sigma^{\prime}:=\left(\begin{smallmatrix}\Sigma&0\\ 0&0\end{smallmatrix}\right). It is easy to construct samples from this new GMM given samples from the original: If the original samples were u1,u2u_{1},u_{2}\ldots, then the new samples are u1,u2u^{\prime}_{1},u^{\prime}_{2}\ldots where ui:=(uiT,1)Tu^{\prime}_{i}:=\left(u_{i}^{T},1\right)^{T}. The reduction proceeds similarly to the above on the new inputs.

Unlike before, we will define the ICA mixing matrix to be A^{\prime}:=\bigl{[}{\mu_{1}^{\prime}}/{\left\|\mu_{1}^{\prime}\right\|}\bigl{\lvert}\cdots\bigr{\rvert}{\mu_{m}^{\prime}}/{\left\|\mu_{m}^{\prime}\right\|}\bigr{]} such that it has unit norm columns. The role of matrix AA in the basic reduction will now be played by AA^{\prime}. Since we are normalizing the columns of AA^{\prime}, we have to scale the ICA signal SS obtained in the basic reduction to compensate for this: Define Si:=μiSiS^{\prime}_{i}:=\left\|\mu^{\prime}_{i}\right\|S_{i}. Thus, the ICA models obtained in the full reduction are:

Correctness of the Algorithm and Reduction

Subroutine 1 captures the sampling process of the reduction: Let Σ\Sigma be the covariance matrix of the GMM, λ\lambda be an integer chosen as input, and a threshold value τ\tau also computed elsewhere and provided as input. Let RPoisson(λ)R\sim\mathsf{Poisson}(\lambda). If RR is larger than τ\tau, the subroutine returns a failure notice and the calling algorithm halts immediately. A requirement, then, should be that the threshold is chosen so that the chance of failure is very small; in our case, τ\tau is chosen so that the chance of failure is half of the confidence parameter given to Algorithm 2. The subroutine then goes through the process described in the full reduction: sampling from the GMM, lifting the sample by appending a 1, then adding a lifted Gaussian so that the total noise has distribution N(0,τΣ)\mathcal{N}(0,\tau\Sigma). The resulting sample is from the model given by (3).

The bounds are used instead of actual values to allow flexibility — in the context under which the algorithm is invoked — on what the algorithm needs to succeed. However, the closer the bounds are to the actual values, the more efficient the algorithm will be.

The proof of correctness of Algorithm 2 has two main parts. For brevity, the details can be found in Appendix A. In the first part, we analyze the sample complexity of recovering the Gaussian means using UnderdeterminedICA when samples are taken from the ideal noisy ICA model (2).

In the second part, we note that we do not have access to the ideal model (2), and that we can only sample from the approximate noisy ICA model (3) using the full reduction. Choosing τ\tau appropriately, we use total variation distance to argue that with high probability, running UnderdeterminedICA with samples from the approximate noisy ICA model will produce equally valid results as running UnderdeterminedICA with samples from the ideal noisy ICA model. The total variation distance bound is explored in section A.2.

These ideas are combined in section A.3 to prove the correctness of Algorithm 2. One additional technicality arises from the implementation of Algorithm 2. Samples can be drawn from the noisy ICA model X=(AS+η(τ))RτX^{\prime}=(AS^{\prime}+\eta^{\prime}(\tau))|_{R\leq\tau} using rejection sampling on RR. In order to guarantee Algorithm 2 executes in polynomial time, when a sample of RR needs to be rejected, Algorithm 2 terminates in explicit failure. To complete the proof, we argue that with high probability, Algorithm 2 does not explicitly fail.

Smoothed Analysis

With the above notation, for any base matrix MM with dimensions as above, we have, for some absolute constant CC,

Theorem 2 follows immediately from the theorem above by noting that σmin(A2)σmin(A2)\sigma_{\min}(A^{\odot 2})\geq\sigma_{\min}(A^{\ominus 2}).

Now note that this is a quadratic polynomial in the random variables NikN_{ik}. We will apply the anticoncentration inequality of \processifversionvstdCarbery–Wright to this polynomial to conclude that the distance between the kk’th column of (M+N)2(M+N)^{\ominus 2} and the span of the rest of the columns is unlikely to be very small (see Appendix H.3 for the precise result).

Using u2=1\left\|u\right\|_{2}=1, the variance of our polynomial in (4) becomes

In our application, our random variables NikN_{ik} for i[n]i\in[n] are not standard Gaussians but are iid Gaussian with variance σ2\sigma^{2}, and our polynomial does not have unit variance. After adjusting for these differences using the estimate on the variance of PP above, Lemma H.42 gives .

Therefore, by the union bound over the choice of kk .

Now choosing ϵ=σ2/n6\epsilon=\sigma^{2}/n^{6}, Lemma H.41 gives .

We note that while the above discussion is restricted to Gaussian perturbation, the same technique would work for a much larger class of perturbations. To this end, we would require a version of the Carbery-Wright anticoncentration inequality which is applicable in more general situations. We omit such generalizations here.

The curse of low dimensionality for Gaussian mixtures

We will need some properties of the Reproducing Kernel Hilbert Space HH corresponding to the kernel KK (see [29, Chapter 10] for an introduction). In particular, we need the bound ffH\|f\|_{\infty}\leq\|f\|_{H} and the reproducing property, f(),K(x,)H=f(x),fH\langle f(\cdot),K(x,\cdot)\rangle_{H}=f(x),\forall f\in H. For a function of the form wiK(xi,x)\sum w_{i}K(x_{i},x) we have wiK(xi,x)H2=wiwjK(xi,xj){\lVert\sum w_{i}K(x_{i},x)\rVert}^{2}_{H}=\sum w_{i}w_{j}K(x_{i},x_{j}).

Let gg be any positive function with L2L_{2} norm 11 supported on n^{n} and let f=Kgf={\cal K}g. If XX has fill hh, then there exists A>0A>0 such that

Let XX and YY be any two subsets of n^{n} with fill hh. Then there exist two Gaussian mixtures pp and qq (with positive coefficients summing to one, but not necessarily the same number of components), which are centered on two disjoint subsets of XYX\cup Y and such that for some B>0B>0

To simplify the notation we assume that n=1n=1. The general case follows verbatim, except that the interval of integration, [1/h,1/h][-1/h,1/h], and its complement need to be replaced by the sphere of radius 1/h1/h and its complement respectively.

where, p1p_{1} and p2p_{2} are mixtures with positive coefficients only.

Put p1=iS1αiK(xi,x)p_{1}=\sum_{i\in S_{1}}\alpha_{i}K(x_{i},x), p2=iS2βiK(xi,x)p_{2}=\sum_{i\in S_{2}}\beta_{i}K(x_{i},x), where S1S_{1} and S2S_{2} are disjoint subsets of XYX\cup Y. Now we need to ensure that the coefficients can be normalized to sum to 11.

Let α=αi\alpha=\sum\alpha_{i}, β=βi\beta=\sum\beta_{i}. From (5) and by integrating over the interval $,andsince, and sincefisstrictlypositiveontheinterval,itiseasytoseethatis strictly positive on the interval, it is easy to see that\alpha,\beta\geq 1$. We have

Noticing that the first summand is bounded by 2hexp(Aloghh)\frac{2}{h}\exp(A\frac{\log h}{h}) and the integral in the second summand is even smaller (in fact, O(e1/h2)O(e^{-1/h^{2}})) , it follows immediately, that 1βα<exp(Aloghh)|1-\frac{\beta}{\alpha}|<\exp(A^{\prime}\frac{\log h}{h}) for some AA^{\prime} and hh sufficiently small.

Collecting exponential inequalities completes the proof.

For convenience we will use a set of 4k24k^{2} points instead of k2k^{2}. Clearly it does not affect the exponential rate.

By a simple covering set argument (cutting the cube into mnm^{n} cubes with size 1/m1/m) and basic probability, we see that the fill hh of nmnlogmnm^{n}\log m points is at most O(n/m)O(\sqrt{n}/m) with probability 1o(1)1-o(1). Hence, given kk points, we have h=O((logkk)1/n)h=O((\frac{\log k}{k})^{1/n}). We see, that with a smaller probability (but still close to 11 for large kk), we can sample kk points 3k23k^{2} times and still have the same fill.

Partitioning the set of 4k24k^{2} points into 2k2k disjoint subsets of 2k2k points and applying Theorem 6.10 (to k+kk+k points) we obtain 2k2k pairs of exponentially close mixtures with at most 2k2k components each. If one of the pairs has the same number of components, we are done. If not, by the pigeon-hole principle for at least two pairs of mixtures p1q1p_{1}\approx q_{1} and p2q2p_{2}\approx q_{2} the differences of the number of components (an integer number between and 2k12k-1) must coincide. Assume without loss of generality that p1p_{1} has no more components that q1q_{1} and p2p_{2} has no more components than q2q_{2}.Taking p=12(p1+q2)p=\frac{1}{2}(p_{1}+q_{2}) and q=12(p2+q1)q=\frac{1}{2}(p_{2}+q_{1}) completes the proof.

References

Appendix A Theorem 1 Proof Details

The proposed full reduction from Section 3 provides us with two models. The first is a noisy ICA model from which we cannot sample:

The second is a model that fails to satisfy the assumption that SS^{\prime} has independent coordinates, but it is a model from which we can sample:

Both models rely on the choice of two parameters, λ\lambda and τ\tau. The dependence on τ\tau is explicit in the models. The dependence on λ\lambda can be summarized in the unrestricted model as Si=1μiSiPoisson(wiλ)S_{i}=\frac{1}{\left\|\mu_{i}^{\prime}\right\|}S_{i}^{\prime}\sim\mathsf{Poisson}(w_{i}\lambda) independently of each other, and R=i=1mSiPoisson(λ)R=\sum_{i=1}^{m}S_{i}\sim\mathsf{Poisson}(\lambda).

The probability of choosing R>τR>\tau will be seen to be exponentially small in τ\tau. For this reason, running UnderdeterminedICA with polynomially many samples from model (6) will with high probability be equivalent to running the ICA Algorithm with samples from model (7). This notion will be made precise later using total variation distance.

For the remainder of this subsection, we proceed as if samples are drawn from the ideal noisy ICA model (6). Thus, to recover the columns of AA^{\prime}, it suffices to run UnderdeterminedICA on samples of XX^{\prime}. Theorem H.40 can be used for this analysis so long as we can obtain the necessary bounds on the cumulants of SS^{\prime}, moments of SS^{\prime}, and the moments of η(τ)\eta^{\prime}(\tau). We define wmin:=miniwiw_{\min}:=\min_{i}w_{i} and wmax:=maxiwiw_{\max}:=\max_{i}w_{i}. Then, the cumulants of SS^{\prime} are bounded by the following lemma:

By construction, Si=μiSiS^{\prime}_{i}=\left\|\mu_{i}^{\prime}\right\|S_{i}. By the homogeneity property of univariate cumulants,

The bounds on the moments of SiS^{\prime}_{i} for each ii can be computed using the following lemma:

Let YY denote a random variable drawn from Poisson(α)\mathsf{Poisson}(\alpha). It is known (see ) that

The absolute moments of Gaussian random variables are well known. For completeness, the bounds are provided in Lemma E.39 of Appendix E.

Suppose that samples of XX^{\prime} are taken from the unrestricted ICA model (3) choosing parameter λ=m\lambda=m and τ\tau a constant. Suppose that UnderdeterminedICA is run using these samples. Suppose σm(Ad/2)>0\sigma_{m}(A^{\prime\odot d/2})>0. Fix ϵ(0,1/2)\epsilon\in(0,1/2) and δ(0,1/2)\delta\in(0,1/2). Then with probability 1δ1-\delta, when the number of samples NN is:

Obtaining the sample bound is an exercise of rewriting the parameters associated with the model X=AS+η(τ)X^{\prime}=A^{\prime}S^{\prime}+\eta^{\prime}(\tau) in a way which can be used by Theorem H.40. In what follows, where new parameters are introduced without being described, they will correspond to parameters of the same name defined in and used by the statement of Theorem H.40.

it suffices that M(μmaxwmaxwmin)d+1(d+1)d+1M\geq(\left\|\mu_{\max}^{\prime}\right\|\frac{w_{\max}}{w_{\min}})^{d+1}(d+1)^{d+1}, giving a more natural condition number.

to guarantee that MM bounds all required order d+1d+1 absolute moments.

We can now apply Theorem H.40, using the parameter values k=d+1k=d+1, Δ\Delta from (9), and MM from (10). Then with probability 1δ1-\delta,

The poly bound in (11) is equivalent to the poly bound in (8).

Theorem A.17 allows us to recover the columns of AA^{\prime} up to sign. However, what we really want to recover are the means of the original Gaussian mixture model, which are the columns of AA. Recalling the correspondence between AA^{\prime} and AA laid out in section 3, the Gaussian means μ1,,μm\mu_{1},\dotsc,\mu_{m} which form the columns of AA are related to the columns μ1,,μm\mu_{1}^{\prime},\dotsc,\mu_{m}^{\prime} of AA^{\prime} by the rule μi=μi(1:n)/μi(n+1)\mu_{i}=\mu_{i}^{\prime}(1:n)/\mu_{i}^{\prime}(n+1). Using this rule, we can construct estimate the Gaussian means from the estimates of the columns of AA^{\prime}. By propagating the errors from Theorem A.17, we arrive at the following result:

Let ϵ>0\epsilon^{*}>0 (to be chosen later) give a desired bound on the errors of the columns of AA^{\prime}. Then, from Theorem A.17, using

By construction, \mu_{\max}^{\prime}=\left(\begin{array}[]{c}\mu_{\max}\\ 1\end{array}\right). By the triangle inequality,

Next, we note that \underline{A}^{\prime}=\left(\begin{array}[]{c}A\\ \mathbf{1}\end{array}\right) where 1\mathbf{1} is an all ones row vector. It follows that the rows of Ad/2A^{\odot d/2} are a strict subset of the rows of Ad/2\underline{A}^{\prime\odot d/2}. Thus,

Thus, it is sufficient to call UnderdeterminedICA with

samples to achieve the desired ϵ\epsilon^{*} accuracy on the returned estimates of the columns of AA^{\prime} with probability 1δ1-\delta.

Step 2: Error propagation.

Recall that A_{i}^{\prime}=\left(\begin{array}[]{c}\mu_{i}\\ 1\end{array}\right)\cdot\left\|\left(\begin{array}[]{c}\mu_{i}\\ 1\end{array}\right)\right\|^{-1}, making Ai(n+1)=11+μi2A_{i}^{\prime}(n+1)=\frac{1}{\sqrt{1+\left\|\mu_{i}\right\|^{2}}}. Thus,

A.2 Distance of the Sampled Model to the Ideal Model

An important part of the reduction is that the coordinates of SS are mutually independent. Without the threshold τ\tau, this is true (c.f. Lemma 5). However, without the threshold, one cannot know how to add more noise so that the total noise on each sample is iid. We show that we can choose the threshold τ\tau large enough that the samples still come from a distribution with arbitrarily small total variation distance to the one with truly independent coordinates.

Fix δ>0\delta>0. Let SPoisson(λ)S\sim\mathsf{Poisson}(\lambda) for λlnδ\lambda\geq\ln\delta. Let b=eλb=e\lambda, If τ>eλ\tau>e\lambda, τ1\tau\geq 1, and τln(1/δ)λ\tau\geq\ln(1/\delta)-\lambda, then .

By the Chernoff bound (See Theorem A.1.15 in ),

For any τ>λ\tau>\lambda, letting ϵ=τ/λ1\epsilon=\tau/\lambda-1, we get

To get , it suffices that ττlogbτlogb(δeλ)\tau-\tau\log_{b}\tau\leq\log_{b}(\delta e^{\lambda}). Note that

If ττlogbτlogb(δeλ)\tau-\tau\log_{b}\tau\leq\log_{b}\left(\delta e^{\lambda}\right), then we have

which holds for τln(1δeλ)=ln(1/δ)λ\tau\geq\ln\left(\frac{1}{\delta e^{\lambda}}\right)=\ln(1/\delta)-\lambda, giving the desired result.

By Lemma A.23 τln(N/δ)λ\tau\geq\ln(N/\delta)-\lambda implies for every ii. The union bound gives us the desired result.

It should now be easy to see that if we choose our threshold τ\tau large enough, our samples can be statistically close (See Appendix F) to ones that would come from the truly independent distribution. This claim is made formal as follows:

Fix δ>0\delta>0. Let τ>0\tau>0. Let FF be a Poisson distribution with parameter λ\lambda and have corresponding density ff. Let GG be a discrete distribution with density g(x)=f(x)/F(τ)g(x)=f(x)/F(\tau) when 0xτ0\leq x\leq\tau and 0 otherwise. Then dTV(F,G)=1F(τ)\operatorname{d}_{TV}(F,G)=1-F(\tau).

Since we are working with discrete distributions, we can write

A.3 Proof of Theorem 1

We now show that after the reduction is applied, we can use the UnderdeterminedICA routine given in to learn the GMM. Instead of requiring exact values of each parameter, we simply require a bound on each. The algorithm remains polynomial on those bounds, and hence polynomial on the true values.

Assume that after drawing samples from Subroutine 1, the signals SiS_{i} are mutually independent (as in the “ideal” model given by (2)) and the mean matrix BB satisfies σm(Bd/2)b>0\sigma_{m}(B^{\odot d/2})\geq b>0. Then by Theorem A.19, with probability of error δ1\delta_{1}, the call to UnderdeterminedICA in Algorithm 2 recovers the columns of BB to within ϵ\epsilon and up to a permutation using NN samples of complexity

Step 2

We need to show that after getting NN samples from the reduction, the resulting distribution is still close in total variation to the independent one. We will choose a new δ=δ2/(2N)\delta^{\prime}=\delta_{2}/(2N). Let RPoisson(λ)R\sim\mathsf{Poisson}(\lambda). Given δ\delta^{\prime}, Lemma A.27 shows that for τln(1/δ)λ\tau\geq\ln(1/\delta^{\prime})-\lambda, with probability 1δ1-\delta^{\prime}, RτR\leq\tau.

Take NN iid random variables X1,X2,,XNX_{1},X_{2},\dots,X_{N} from the Poisson(λ)\mathsf{Poisson}(\lambda) distribution. Let GG be a distribution given by density function g(x)=(f(x)\mathds10xτ)/F(τ)g(x)=(f(x)\mathds{1}_{0\leq x\leq\tau})/F(\tau). Let Y1,Y2,,YNY_{1},Y_{2},\dots,Y_{N} be iid random variables with distribution GG. Denote the joint distribution of the XiX_{i}’s by FF^{\prime} with density ff^{\prime}, and the joint distribution of the YiY_{i}’s as GG^{\prime} with density gg^{\prime}. By the union bound and the fact that total variation distance satisfies the triangle inequality,

Then for our choice of τ\tau, by Lemma A.23 and Lemma A.27, we have

By the same union bound argument, the probability that the algorithm fails (when R>τR>\tau) is at most δ2/2\delta_{2}/2, since it has to draw NN samples. So with high probability, the algorithm does not fail; otherwise, it still does not take more than polynomial time, and will terminate instead of returning a false result.

Step 3

We know that NN is at least a polynomial which can be written in terms of the dependence on τ\tau as p(τd2,Θ)p(\tau^{d^{2}},\Theta). This means there will be a power of τ\tau which dominates all of the τ\tau factors in pp, and in particular, will be τCd2\tau^{Cd^{2}} for some CC. It then suffices to choose CC so that p(τd2,Θ)τCd2q(Θ)Np\left(\tau^{d^{2}},\Theta\right)\leq\tau^{Cd^{2}}q(\Theta)\leq N, where

Then, with the proper choice of τ\tau (to be specified shortly), from step 2 we have

Since λ1\lambda\geq 1 it suffices to choose τ\tau so that

is enough for the desired bound on the sample size. Observe that 4(log(2/δ)+log(q(Θ)))14(\log(2/\delta)+\log(q(\Theta)))\geq 1.

An useful fact is that for general x,a,b1x,a,b\geq 1, xmax(2a,b2)x\geq\max(2a,b^{2}) satisfies xaxx/bxx^{a}\leq x^{x}/b^{x}. This captures the essence of our situation nicely. Letting eλe\lambda play the role of bb, Cd2Cd^{2} play the role of aa and xx play the role of τ\tau, to satisfy (18), it suffices that

We can see that ττ/2(eλ)2\tau^{\tau/2}\geq(e\lambda)^{2} and ττ/4τCd2\tau^{\tau/4}\geq\tau^{Cd^{2}} by construction. But we also get τ/4log(2/δ)+logq(Θ)\tau/4\geq\log(2/\delta)+\log{q(\Theta)} which implies ττ/4eτ/42δq(Θ)\tau^{\tau/4}\geq e^{\tau/4}\geq\frac{2}{\delta}q(\Theta). Thus for our choice of τ\tau, which also preserves the requirement in Step 2, there is a corresponding set of choices for NN, where the required sample size remains polynomial as

Appendix B Lemmas on the Poisson Distribution

The following lemmas are well-known; see, e.g., . We provide proofs for completeness.

The first part of the lemma (i.e., YiPoisson(piλ)Y_{i}\sim\mathsf{Poisson}(p_{i}\lambda) for all ii) follows from Lemma B.30. For the second part, let’s prove it for the binomial case (k=2k=2); the general case is similar.

Appendix C Properties of Cumulants

The following properties of multivariate cumulants are well known and are largely inherited from the definition of the cumulant generating function:

Also, given a scalar random variable ZZ, then

with symmetry implying the additive multilinear property for all other coordinates.

Appendix D Bounds on Stirling Numbers of the Second Kind

The following bound comes from [23, Theorem 3].

If n2n\geq 2 and 1rn11\leq r\leq n-1 are integers, then {nr}12(nr)rnr\genfrac{\{}{\}}{0.0pt}{}{n}{r}\leq\frac{1}{2}{n\choose r}r^{n-r}.

From this, we can derive a somewhat looser bound on the Stirling numbers of the second kind which does not depend on rr:

The Stirling number {nk}\genfrac{\{}{\}}{0.0pt}{}{n}{k} of the second kind gives a count of the number of ways of splitting a set of nn labeled objects into kk unlabeled subsets. In the case where r=nr=n, then {nr}=1\genfrac{\{}{\}}{0.0pt}{}{n}{r}=1 As n1n\geq 1, it is clear that for these choices of nn and rr, {nr}nn1\genfrac{\{}{\}}{0.0pt}{}{n}{r}\leq n^{n-1}. By the restriction 1rn1\leq r\leq n, when n=1n=1, then n=rn=r giving that {nr}=1\genfrac{\{}{\}}{0.0pt}{}{n}{r}=1. As such, the only remaining cases to consider are when n2n\geq 2 and 1rn11\leq r\leq n-1, the cases where Lemma D.34 applies.

When n2n\geq 2 and 1rn11\leq r\leq n-1, then

which is slightly stronger than the desired upper bound.

Appendix E Values of Higher Order Statistics

In this appendix, we gather together some of the explicit values for higher order statistics of the Poisson and Normal distributions required for the analysis of our reduction from learning a Gaussian Mixture Model to learning an ICA model from samples.

The absolute moments of the Gaussian random variable ηN(0,σ2)\eta\sim N(0,\sigma^{2}) are given by:

Appendix F Total Variation Distance

Total variation is a type of statistical distance metric between probability distributions. In words, the total variation between two measures is the largest difference between the measures on a single event. Clearly, this distance is bounded above by 1.

For probability measures FF and GG on a sample space Ω\Omega with sigma-algebra Σ\Sigma, the total variation is denoted and defined as:

Equivalently, when FF and GG are distribution functions having densities ff and gg, respectively,

where μ\mu is an arbitrary positive measure for which FF and GG are absolutely continuous.

More specifically, when FF and GG are discrete distributions with known densities, we can write

where we choose μ\mu that simply assigns unit measure to each atom of Ω\Omega (in this case, absolute continuity is trivial since μ(A)=0\mu(A)=0 only when AA is empty and thus F(A)F(A) must also be 0). For more discussion, one can see Definition 15.3 in and Sect. 11.6 in .

Appendix G Sketch for the proof of Theorem 4

We can use our Poissonization technique to embed difficult instances of learning GMMs into the ICA setting to prove that ICA is information-theoretically hard when the observed dimension nn is a constant using the lower bound for learning GMMs. We are not aware of any existing lower bounds in the literature for this problem. We only provide an informal outline of the argument.

Recall that Rp=iSpiR_{p}=\sum_{i}S_{pi} and Rq=iSqiR_{q}=\sum_{i}S_{qi} are Poisson distributed with parameters mpm_{p} and mqm_{q} denoting the number of columns of ApA_{p} and AqA_{q} respectively. Lemma A.23 implies that for a choice of τp\tau_{p} which is linear in mpm_{p}, the probability of a draw with Rp>τpR_{p}>\tau_{p} is exponentially small, and similarly for τq\tau_{q}. In particular, we choose τ=max(τp,τq)\tau=\max(\tau_{p},\tau_{q}) for the above ICA models.

Now since the L1L^{1} (and hence total variation) distance between pp and qq is exponentially small in k2k^{2} (upper bound on the number of components), the distance between the two resulting ICA models produced by the reduction is also exponentially small (specifically, the total variation distance between the random variables XpX_{p} and XqX_{q}). To see this, we must condition on several cases. First, conditioning either model on R>τR>\tau, we have that Pr(R>τ)\Pr(R>\tau) is exponentially small, and hence its contribution to the overall total variation distance between XpX_{p} and XqX_{q} is exponentially small. Conditioning on R=zR=z where z{0,1,,τ}z\in\{0,1,\ldots,\tau\}, then the facts that pp and qq are close in total variation distance and that total variation distance satisfies a version of the triangle inequality (that is for random variables C,D,E,FC,D,E,F, dTV(C+D,E+F)dTV(C,E)+dTV(D,F)d_{TV}(C+D,E+F)\leq d_{TV}(C,E)+d_{TV}(D,F)) imply that by viewing XpX_{p} (and similarly for XqX_{q}) as the sum of zz draws from the distribution pp and τz\tau-z draws from the additive Gaussian noise distribution, the total variation distance between XpX_{p} and XqX_{q} conditioned on R=zR=z is still exponentially small. Thus, the non-conditional distributions of XpX_{p} and XqX_{q} will be exponentially close in kk in total variation distance. In particular, the sample complexity of distinguishing between XpX_{p} and XqX_{q} is exponential in kk.

Appendix H

H.2 Rudelson-Vershynin subspace bound

where as usual σmin(A)=σmin(m,n)(A)\sigma_{\min}(A)=\sigma_{\min(m,n)}(A).

H.3 Carbery-Wright anticoncentration

The version of the anticoncentration inequality we use is explicitly given in which in turn follows immediately from :

Appendix I Recovery of Gaussian Weights

Our technique for the recovery of the Gaussian weights relies on the tensor properties of multivariate cumulants that have been used in the ICA literature.

The most theoretically justified ICA algorithms have relied on the tensor structure of multivariate cumulants, including the early, popular practical algorithm JADE . In the fully determined ICA setting in which the number source signals does not exceed the ambient dimension, the papers and demonstrate that ICA with additive Gaussian noise can be solved in polynomial time and using polynomial samples. The tensor structure of the cumulants was (to the best of our knowledge) first exploited in and later in to solve underdetermined ICA. Finally, provides an algorithm with rigorous polynomial time and sampling bounds for underdetermined ICA in the presence of Gaussian noise.

Weight recovery (main idea).

Under the basic ICA reduction (see section 3) using the Poisson distribution with parameter λ\lambda, we have that X=AS+ηX=AS+\eta is observed such that A=[μ1μm]A=[\mu_{1}|\cdots|\mu_{m}] and SiPoisson(wiλ)S_{i}\sim\mathsf{Poisson}(w_{i}\lambda). As AA has already been recovered, what remains to be recovered are the weights w1,,wmw_{1},\cdots,w_{m}. These can be recovered using the tensor structure of higher order cumulants. The critical relationship is captured by the following Lemma:

It is easily seen that the Gaussian component has no effect on the cumulant:

In particular, we have that SiPoisson(wiλ)S_{i}\sim\mathsf{Poisson}(w_{i}\lambda) with wiw_{i} the probability of sampling from the iith Gaussian. Given knowledge of AA and the cumulants of the Poisson distribution, we can recover the Gaussian weights.