The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss

Introduction

The question of recovering a probability distribution from a finite set of samples is one of the most fundamental questions of statistical inference. While classically such problems have been considered in low dimension, more recently inference in high dimension has drawn significant attention in statistics and computer science literature.

In particular, an active line of investigation in theoretical computer science has dealt with the question of learning a Gaussian Mixture Model in high dimension. This line of work was started in where the first algorithm to recover parameters using a number of samples polynomial in the dimension was presented. The method relied on random projections to a low dimensional space and required certain separation conditions for the means of the Gaussians. Significant work was done in order to weaken the separation conditions and to generalize the result (see e.g., ). Much of this work has polynomial sample and time complexity but requires strong separation conditions on the Gaussian components. A completion of the attempts to weaken the separation conditions was achieved in and , where it was shown that arbitrarily small separation was sufficient for learning a general mixture with a fixed number of components in polynomial time. Moreover, a one-dimensional example given in showed that an exponential dependence on the number of components was unavoidable unless strong separation requirements were imposed. Thus the question of polynomial learnability appeared to be settled. It is worth noting that while quite different in many aspects, all of these papers used a general scheme similar to that in the original work by reducing high-dimensional inference to a small number of low-dimensional problems through appropriate projections.

However, a surprising result was recently proved in . The authors showed that a mixture of $d$ Gaussians in dimension $d$ could be learned using a polynomial number of samples, assuming a non-degeneracy condition on the configuration of the means. The result in is inherently high-dimensional as that condition is never satisfied when the means belong to a lower-dimensional space. Thus the problem of learning a mixture gets progressively computationally easier as the dimension increases, a “blessing of dimensionality!” It is important to note that this was quite different from much of the previous work, which had primarily used projections to lower-dimension spaces.

Still, there remained a large gap between the worst case impossibility of efficiently learning more than a fixed number of Gaussians in low dimension and the situation when the number of components is equal to the dimension. Moreover, it was not completely clear whether the underlying problem was genuinely easier in high dimension or our algorithms in low dimension were suboptimal. The one-dimensional example in cannot answer this question as it is a specific worst-case scenario, which can be potentially ruled out by some genericity condition.

In our paper we take a step to eliminate this gap by showing that even very large mixtures of Gaussians can be polynomially learned. More precisely, we show that a mixture of $m$ Gaussians with equal known covariance can be polynomially learned as long as $m$ is bounded from above by a polynomial of the dimension $n$ and a certain more complex non-degeneracy condition for the means is satisfied. We show that if $n$ is high enough, these non-degeneracy conditions are generic in the smoothed complexity sense. Thus for any fixed $d$ , $O(n^{d})$ generic Gaussians can be polynomially learned in dimension $n$ .

Further, we prove that no such condition can exist in low dimension. A measure of non-degeneracy must be monotone in the sense that adding Gaussian components must make the condition number worse. However, we show that for $k^{2}$ points uniformly sampled from $ $there are (with high probability) two mixtures of unit Gaussians with means on non-intersecting subsets of these points, whose$ L^{1} $distance is$ O^{*}(e^{-{k}}) $and which are thus not polynomially identifiable. More generally, in dimension$ n $the distance becomes$ O^{*}(e^{-\sqrt[n]{k}})$. That is, the conditioning improves as the dimension increases, which is consistent with our algorithmic results.

To summarize, our contributions are as follows:

We show that for any $q$ , a mixture of $n^{q}$ Gaussians in dimension $n$ can be learned in time and number of samples polynomial in $n$ and a certain “condition number” $\sigma$ . We show that if the dimension is sufficiently high, this results in an algorithm polynomial from the smoothed analysis point of view (Theorem 1). To do that we provide smoothed analysis of the condition number using certain results from and anti-concentration inequalities. The main technical ingredient of the algorithm is a new “Poissonization” technique to reduce Gaussian mixture estimation to a problem of recovering a linear map of a product distribution known as underdetermined Independent Component Analysis (ICA). We combine this with the recent work on efficient algorithms for underdetermined ICA from to obtain the necessary bounds.

We show that in low dimension polynomial identifiability fails in a certain generic sense (see Theorem 3). Thus the efficiency of our main algorithm is truly a consequence of the ”blessing of dimensionality” and no comparable algorithm exists in low dimension. The analysis is based on results from approximation theory and Reproducing Kernel Hilbert Spaces.

Moreover, we combine the approximation theory results with the Poissonization-based technique to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined Independent Component Analysis in low dimension. To the best of our knowledge, this is the first such result in the literature.

We discuss our main contributions more formally now. The notion of Khatri–Rao power $A^{\odot d}$ of a matrix $A$ is defined in Section 2.

where $w\geq\max_{i}(w_{i})/\min_{i}(w_{i})$ , $u\geq\max_{i}\left\|\mu_{i}\right\|$ , $r\geq\big{(}\max_{i}\left\|\mu_{i}\right\|+1)/(\min_{i}\left\|\mu_{i}\right\|)\big{)}$ , $0<b\leq\sigma_{m}(B^{\odot d/2})$ are bounds provided to the algorithm, and $\sigma=\sqrt{\lambda_{\max}(\Sigma)}$ .

Given that the means have been estimated, the weights can be recovered using the tensor structure of higher order cumulants (see Section 2 for the definition of cumulants). This is shown in Appendix I.

We show that $\sigma_{\min}(A^{\odot d})$ is large in the smoothed analysis sense, namely, if we start with a base matrix $A$ and perturb each entry randomly to get $A^{\prime}$ , then $\sigma_{\min}(A^{\odot d})$ is likely to be large. More precisely,

Finally, in Section 6 we show that in low dimension the situation is very different from the high-dimensional generic efficiency given by Theorems 1 and 2: The problem is generically hard. More precisely, we show:

Let $X$ be a set of $k^{2}$ points uniformly sampled from $^{n}$ . Then with high probability there exist two mixtures with equal number of unit Gaussians $p$ , $q$ centered on disjoint subsets of $X$ , such that, for some $C>0$ ,

Combining the above lower bound with our reduction provides a similar lower bound for ICA; see a discussion on the connection with ICA below. Our lower bound gives an information-theoretic barrier. This is in contrast to conjectured computational barriers that arise in related settings based on the noisy parity problem (see for pointers). The only previous information-theoretic lower bound for learning GMMs we are aware of is due to and holds for two specially designed one-dimensional mixtures.

A key observation of is that methods based on the higher order statistics used in Independent Component Analysis (ICA) can be adapted to the setting of learning a Gaussian Mixture Model. In ICA, samples are of the form $X=\sum_{i=1}^{m}A_{i}S_{i}$ where the latent random variables $S_{i}$ are independent, and the column vectors $A_{i}$ give the directions in which each signal $S_{i}$ acts. The goal is to recover the vectors $A_{i}$ up to inherent ambiguities. The ICA problem is typically posed when $m$ is at most the dimensionality of the observed space (the “fully determined” setting), as recovery of the directions $A_{i}$ then allows one to demix the latent signals. The case where the number of latent source signals exceeds the dimensionality of the observed signal $X$ is the underdetermined ICA setting.See [12, Chapter 9] for a recent account of algorithms for underdetermined ICA. Two well-known algorithms for underdetermined ICA are given in and . Finally, provides an algorithm with rigorous polynomial time and sampling bounds for underdetermined ICA in high dimension in the presence of Gaussian noise.

Nevertheless, our analysis of the mixture models can be embedded in ICA to show exponential information-theoretic hardness of performing ICA in low-dimension, and thus establishing the blessing of dimensionality for ICA as well.

Let $X$ be a set of $k^{2}$ random $n$ -dimensional unit vectors. Then with high probability, there exist two disjoint subsets of $X$ , such that when these two sets form the columns of matrices $A$ and $B$ respectively, there exist noisy ICA models $AS+\eta$ and $BS^{\prime}+\eta^{\prime}$ which are exponentially close as distributions in $L^{1}$ distance and satisfying: (1) The coordinate random variables of $S$ and $S^{\prime}$ are scaled Poisson random variables. For at least one coordinate random variable, $S_{i}=\alpha X$ , where $X\sim\mathsf{Poisson}(\lambda)$ is such that $\alpha$ and $\lambda$ are polynomially bounded away from 0. (2) The Gaussian noises $\eta$ and $\eta^{\prime}$ have polynomially bounded directional covariances.

We sketch the proof of Theorem 4 in Appendix G.

Discussion.

Most problems become harder in high dimension, often exponentially harder, a behavior known as “the curse of dimensionality.” Showing that a complex problem does not become exponentially harder often constitutes major progress in its understanding. In this work we demonstrate a reversal of this curse, showing that the lower dimensional instances are exponentially harder than those in high dimension. This seems to be a rare situation in statistical inference and computation. In particular, while high-dimensional concentration of mass can sometimes be a blessing of dimensionality, in our case the generic computational efficiency of our problem comes from anti-concentration.

We hope that this work will enable better understanding of this unusual phenomenon and its applicability to a wider class of computational and statistical problems.

Preliminaries

In this formulation, $\mathbf{e}_{h}$ acts as a selector of a Gaussian mean. Conditioning on $h=i$ , we have $Z\sim\mathcal{N}(\mu_{i},\Sigma)$ , which is consistent with the GMM model.

Given samples from the GMM, the goal is to recover the unknown parameters of the GMM, namely the means $\mu_{1},\dots,\mu_{m}$ and the weights $w_{1},\dots,w_{m}$ .

Underdetermined ICA.

The ICA algorithm from to which we will be reducing learning a GMM relies on the shared tensor structure of the derivatives of the second characteristic function and the higher order multi-variate cumulants. This tensor structure motivates the following form of the Khatri-Rao product:

This form of the Khatri-Rao product arises when performing a change of coordinates under the ICA model using either higher order cumulants or higher order derivative tensors of the second characteristic function.

ICA Results.

Theorem H.40 (Appendix H.1, from ) allows us to recover $A$ up to the necessary ambiguities in the noisy ICA setting. The theorem establishes guarantees for an algorithm from for noisy underdetermined ICA, UnderdeterminedICA. This algorithm takes as input a tensor order parameter $d$ , number of signals $m$ , access to samples according to the noisy underdetermined ICA model with unknown noise, accuracy parameter $\epsilon$ , confidence parameter $\delta$ , bounds on moments and cumulants $M$ and $\Delta$ , a bound on the conditioning parameter $\sigma_{m}$ , and a bound on the cumulant order $k$ . It returns approximations to the columns of $A$ up to sign and permutation.

Learning GMM means using underdetermined ICA: The basic idea

In this section we give an informal outline of the proof of our main result, namely learning the means of the components in GMMs via reduction to the underdetermined ICA problem. Our reduction will be discussed in two parts. The first part gives the main idea of the reduction and will demonstrate how to recover the means $\mu_{i}$ up to their norms and signs, i.e. we will get $\pm\mu_{i}/\left\|\mu_{i}\right\|$ . We will then present the reduction in full. It combines the basic reduction with some preprocessing of the data to recover the $\mu_{i}$ ’s themselves. The reduction relies on some well-known properties of the Poisson distribution stated in the lemma below; its proof can be found in Appendix B.

Recall the GMM from equation (1) is given by $Z=[\mu_{1}|\cdots|\mu_{m}]\mathbf{e}_{h}+\eta$ . Henceforth, we will set $A=[\mu_{1}|\cdots|\mu_{m}]$ . We can write the GMM in the form $Z=A\mathbf{e}_{h}+\eta$ , which is similar in form to the noisy ICA model, except that $\mathbf{e}_{h}$ does not have independent coordinates. We now describe how a single sample of an approximate noisy ICA problem is generated.

The reduction involves two internal parameters $\lambda$ and $\tau$ that we will set later. We generate a Poisson random variable $R\sim\mathsf{Poisson}(\lambda)$ , and we run the following experiment $R$ times: At the $i$ th step, generate sample $Z_{i}$ from the GMM. Output the sum of the outcomes of these experiments: $Y=Z_{1}+\cdots+Z_{R}$ .

Let $S_{i}$ be the random variable denoting the number of times samples were taken from the $i$ th Gaussian component in the above experiment. Thus, $S_{1}+\cdots+S_{m}=R$ . Note that $S_{1},\dots,S_{m}$ are not observable although we know their sum. By Lemma 5, each $S_{i}$ has distribution $\mathsf{Poisson}(w_{i}\lambda)$ , and the random variables $S_{i}$ are mutually independent. Let $S:=(S_{1},\dots,S_{m})^{T}$ .

For a non-negative integer $t$ , we define $\eta(t):=\sum_{i=1}^{t}\eta_{i}$ where the $\eta_{i}$ are iid according to $\eta_{i}\sim\mathcal{N}(0,\Sigma)$ . In this definition, $t$ can be a random variable, in which case the $\eta_{i}$ are sampled independent of $t$ . Using $\sim$ to indicate that two random variables have the same distribution, then $Y\sim AS+\eta(R)$ . If there were no Gaussian noise in the GMM (i.e. if we were sampling from a discrete set of points) then the model becomes simply $Y=AS$ , which is the ICA model without noise, and so we could recover $A$ up to necessary ambiguities. However, the model $Y\sim AS+\eta(R)$ fails to satisfy even the assumptions of the noisy ICA model, both because $\eta(R)$ is not independent of $S$ and because $\eta(R)$ is not distributed as a Gaussian random vector.

As the covariance of the additive Gaussian noise is known, we may add additional noise to the samples of $Y$ to obtain a good approximation of the noisy ICA model. Parameter $\tau$ , the second parameter of the reduction, is chosen so that with high probability we have $R\leq\tau$ . Conditioning on the event $R\leq\tau$ we draw $X$ according to the rule $X=Y+\eta(\tau-R)\sim AS+\eta(R)+\eta(\tau-R)$ , where $\eta(R)$ , $\eta(\tau-R)$ , and $S$ are drawn independently conditioned on $R$ . Then, conditioned on $R\leq\tau$ , we have $X\sim AS+\eta(\tau)$ .

Note that we have only created an approximation to the ICA model. In particular, restricting $\sum_{i=1}^{m}S_{i}=R\leq\tau$ can be accomplished using rejection sampling, but the coordinate random variables $S_{1},\dots,S_{m}$ would no longer be independent. We have two models of interest: (1) $X\sim AS+\eta(\tau)$ , a noisy ICA model with no restriction on $R=\sum_{i=1}^{m}S_{i}$ , and (2) $X\sim(AS+\eta(\tau))|_{R\leq\tau}$ the restricted model.

We are unable to produce samples from the first model, but it meets the assumptions of the noisy ICA problem. Pretending we have samples from model (1), we can apply Theorem H.40 (Appendix H.1) to recover the Gaussian means up to sign and scaling. On the other hand, we can produce samples from model (2), and depending on the choice of $\tau$ , the statistical distance between models (1) and (2) can be made arbitrarily close to zero. It will be demonstrated that given an appropriate choice of $\tau$ , running UnderdeterminedICA on samples from model (2) is equivalent to running UnderdeterminedICA on samples from model (1) with high probability, allowing for recovery of the Gaussian mean directions $\pm\mu_{i}/\left\|\mu_{i}\right\|$ up to some error.

Full reduction.

To be able to recover the $\mu_{i}$ without sign or scaling ambiguities, we add an extra coordinate to the GMM as follows. The new means $\mu_{i}^{\prime}$ are $\mu_{i}$ with an additional coordinate whose value is $1$ for all $i$ , i.e. $\mu_{i}^{\prime}:=\left(\mu_{i}^{T},1\right)^{T}$ . Moreover, this coordinate has no noise. In other words, each Gaussian component now has an $(n+1)\times(n+1)$ covariance matrix $\Sigma^{\prime}:=\left(\begin{smallmatrix}\Sigma&0\\ 0&0\end{smallmatrix}\right)$ . It is easy to construct samples from this new GMM given samples from the original: If the original samples were $u_{1},u_{2}\ldots$ , then the new samples are $u^{\prime}_{1},u^{\prime}_{2}\ldots$ where $u^{\prime}_{i}:=\left(u_{i}^{T},1\right)^{T}$ . The reduction proceeds similarly to the above on the new inputs.

Unlike before, we will define the ICA mixing matrix to be $A^{\prime}:=\bigl{[}{\mu_{1}^{\prime}}/{\left\|\mu_{1}^{\prime}\right\|}\bigl{\lvert}\cdots\bigr{\rvert}{\mu_{m}^{\prime}}/{\left\|\mu_{m}^{\prime}\right\|}\bigr{]}$ such that it has unit norm columns. The role of matrix $A$ in the basic reduction will now be played by $A^{\prime}$ . Since we are normalizing the columns of $A^{\prime}$ , we have to scale the ICA signal $S$ obtained in the basic reduction to compensate for this: Define $S^{\prime}_{i}:=\left\|\mu^{\prime}_{i}\right\|S_{i}$ . Thus, the ICA models obtained in the full reduction are:

Correctness of the Algorithm and Reduction

Subroutine 1 captures the sampling process of the reduction: Let $\Sigma$ be the covariance matrix of the GMM, $\lambda$ be an integer chosen as input, and a threshold value $\tau$ also computed elsewhere and provided as input. Let $R\sim\mathsf{Poisson}(\lambda)$ . If $R$ is larger than $\tau$ , the subroutine returns a failure notice and the calling algorithm halts immediately. A requirement, then, should be that the threshold is chosen so that the chance of failure is very small; in our case, $\tau$ is chosen so that the chance of failure is half of the confidence parameter given to Algorithm 2. The subroutine then goes through the process described in the full reduction: sampling from the GMM, lifting the sample by appending a 1, then adding a lifted Gaussian so that the total noise has distribution $\mathcal{N}(0,\tau\Sigma)$ . The resulting sample is from the model given by (3).

The bounds are used instead of actual values to allow flexibility — in the context under which the algorithm is invoked — on what the algorithm needs to succeed. However, the closer the bounds are to the actual values, the more efficient the algorithm will be.

The proof of correctness of Algorithm 2 has two main parts. For brevity, the details can be found in Appendix A. In the first part, we analyze the sample complexity of recovering the Gaussian means using UnderdeterminedICA when samples are taken from the ideal noisy ICA model (2).

In the second part, we note that we do not have access to the ideal model (2), and that we can only sample from the approximate noisy ICA model (3) using the full reduction. Choosing $\tau$ appropriately, we use total variation distance to argue that with high probability, running UnderdeterminedICA with samples from the approximate noisy ICA model will produce equally valid results as running UnderdeterminedICA with samples from the ideal noisy ICA model. The total variation distance bound is explored in section A.2.

These ideas are combined in section A.3 to prove the correctness of Algorithm 2. One additional technicality arises from the implementation of Algorithm 2. Samples can be drawn from the noisy ICA model $X^{\prime}=(AS^{\prime}+\eta^{\prime}(\tau))|_{R\leq\tau}$ using rejection sampling on $R$ . In order to guarantee Algorithm 2 executes in polynomial time, when a sample of $R$ needs to be rejected, Algorithm 2 terminates in explicit failure. To complete the proof, we argue that with high probability, Algorithm 2 does not explicitly fail.

Smoothed Analysis

With the above notation, for any base matrix $M$ with dimensions as above, we have, for some absolute constant $C$ ,

Theorem 2 follows immediately from the theorem above by noting that $\sigma_{\min}(A^{\odot 2})\geq\sigma_{\min}(A^{\ominus 2})$ .

Now note that this is a quadratic polynomial in the random variables $N_{ik}$ . We will apply the anticoncentration inequality of \processifversionvstdCarbery–Wright to this polynomial to conclude that the distance between the $k$ ’th column of $(M+N)^{\ominus 2}$ and the span of the rest of the columns is unlikely to be very small (see Appendix H.3 for the precise result).

Using $\left\|u\right\|_{2}=1$ , the variance of our polynomial in (4) becomes

In our application, our random variables $N_{ik}$ for $i\in[n]$ are not standard Gaussians but are iid Gaussian with variance $\sigma^{2}$ , and our polynomial does not have unit variance. After adjusting for these differences using the estimate on the variance of $P$ above, Lemma H.42 gives .

Therefore, by the union bound over the choice of $k$ .

Now choosing $\epsilon=\sigma^{2}/n^{6}$ , Lemma H.41 gives .

We note that while the above discussion is restricted to Gaussian perturbation, the same technique would work for a much larger class of perturbations. To this end, we would require a version of the Carbery-Wright anticoncentration inequality which is applicable in more general situations. We omit such generalizations here.

The curse of low dimensionality for Gaussian mixtures

We will need some properties of the Reproducing Kernel Hilbert Space $H$ corresponding to the kernel $K$ (see [29, Chapter 10] for an introduction). In particular, we need the bound $\|f\|_{\infty}\leq\|f\|_{H}$ and the reproducing property, $\langle f(\cdot),K(x,\cdot)\rangle_{H}=f(x),\forall f\in H$ . For a function of the form $\sum w_{i}K(x_{i},x)$ we have ${\lVert\sum w_{i}K(x_{i},x)\rVert}^{2}_{H}=\sum w_{i}w_{j}K(x_{i},x_{j})$ .

Let $g$ be any positive function with $L_{2}$ norm $1$ supported on $^{n}$ and let $f={\cal K}g$ . If $X$ has fill $h$ , then there exists $A>0$ such that

Let $X$ and $Y$ be any two subsets of $^{n}$ with fill $h$ . Then there exist two Gaussian mixtures $p$ and $q$ (with positive coefficients summing to one, but not necessarily the same number of components), which are centered on two disjoint subsets of $X\cup Y$ and such that for some $B>0$

To simplify the notation we assume that $n=1$ . The general case follows verbatim, except that the interval of integration, $[-1/h,1/h]$ , and its complement need to be replaced by the sphere of radius $1/h$ and its complement respectively.

where, $p_{1}$ and $p_{2}$ are mixtures with positive coefficients only.

Put $p_{1}=\sum_{i\in S_{1}}\alpha_{i}K(x_{i},x)$ , $p_{2}=\sum_{i\in S_{2}}\beta_{i}K(x_{i},x)$ , where $S_{1}$ and $S_{2}$ are disjoint subsets of $X\cup Y$ . Now we need to ensure that the coefficients can be normalized to sum to $1$ .

Let $\alpha=\sum\alpha_{i}$ , $\beta=\sum\beta_{i}$ . From (5) and by integrating over the interval $ $, and since$ f $is strictly positive on the interval, it is easy to see that$ \alpha,\beta\geq 1$. We have

Noticing that the first summand is bounded by $\frac{2}{h}\exp(A\frac{\log h}{h})$ and the integral in the second summand is even smaller (in fact, $O(e^{-1/h^{2}})$ ) , it follows immediately, that $|1-\frac{\beta}{\alpha}|<\exp(A^{\prime}\frac{\log h}{h})$ for some $A^{\prime}$ and $h$ sufficiently small.

Collecting exponential inequalities completes the proof.

For convenience we will use a set of $4k^{2}$ points instead of $k^{2}$ . Clearly it does not affect the exponential rate.

By a simple covering set argument (cutting the cube into $m^{n}$ cubes with size $1/m$ ) and basic probability, we see that the fill $h$ of $nm^{n}\log m$ points is at most $O(\sqrt{n}/m)$ with probability $1-o(1)$ . Hence, given $k$ points, we have $h=O((\frac{\log k}{k})^{1/n})$ . We see, that with a smaller probability (but still close to $1$ for large $k$ ), we can sample $k$ points $3k^{2}$ times and still have the same fill.

Partitioning the set of $4k^{2}$ points into $2k$ disjoint subsets of $2k$ points and applying Theorem 6.10 (to $k+k$ points) we obtain $2k$ pairs of exponentially close mixtures with at most $2k$ components each. If one of the pairs has the same number of components, we are done. If not, by the pigeon-hole principle for at least two pairs of mixtures $p_{1}\approx q_{1}$ and $p_{2}\approx q_{2}$ the differences of the number of components (an integer number between and $2k-1$ ) must coincide. Assume without loss of generality that $p_{1}$ has no more components that $q_{1}$ and $p_{2}$ has no more components than $q_{2}$ .Taking $p=\frac{1}{2}(p_{1}+q_{2})$ and $q=\frac{1}{2}(p_{2}+q_{1})$ completes the proof.

References

Appendix A Theorem 1 Proof Details

The proposed full reduction from Section 3 provides us with two models. The first is a noisy ICA model from which we cannot sample:

The second is a model that fails to satisfy the assumption that $S^{\prime}$ has independent coordinates, but it is a model from which we can sample:

Both models rely on the choice of two parameters, $\lambda$ and $\tau$ . The dependence on $\tau$ is explicit in the models. The dependence on $\lambda$ can be summarized in the unrestricted model as $S_{i}=\frac{1}{\left\|\mu_{i}^{\prime}\right\|}S_{i}^{\prime}\sim\mathsf{Poisson}(w_{i}\lambda)$ independently of each other, and $R=\sum_{i=1}^{m}S_{i}\sim\mathsf{Poisson}(\lambda)$ .

The probability of choosing $R>\tau$ will be seen to be exponentially small in $\tau$ . For this reason, running UnderdeterminedICA with polynomially many samples from model (6) will with high probability be equivalent to running the ICA Algorithm with samples from model (7). This notion will be made precise later using total variation distance.

For the remainder of this subsection, we proceed as if samples are drawn from the ideal noisy ICA model (6). Thus, to recover the columns of $A^{\prime}$ , it suffices to run UnderdeterminedICA on samples of $X^{\prime}$ . Theorem H.40 can be used for this analysis so long as we can obtain the necessary bounds on the cumulants of $S^{\prime}$ , moments of $S^{\prime}$ , and the moments of $\eta^{\prime}(\tau)$ . We define $w_{\min}:=\min_{i}w_{i}$ and $w_{\max}:=\max_{i}w_{i}$ . Then, the cumulants of $S^{\prime}$ are bounded by the following lemma:

By construction, $S^{\prime}_{i}=\left\|\mu_{i}^{\prime}\right\|S_{i}$ . By the homogeneity property of univariate cumulants,

The bounds on the moments of $S^{\prime}_{i}$ for each $i$ can be computed using the following lemma:

Let $Y$ denote a random variable drawn from $\mathsf{Poisson}(\alpha)$ . It is known (see ) that

The absolute moments of Gaussian random variables are well known. For completeness, the bounds are provided in Lemma E.39 of Appendix E.

Suppose that samples of $X^{\prime}$ are taken from the unrestricted ICA model (3) choosing parameter $\lambda=m$ and $\tau$ a constant. Suppose that UnderdeterminedICA is run using these samples. Suppose $\sigma_{m}(A^{\prime\odot d/2})>0$ . Fix $\epsilon\in(0,1/2)$ and $\delta\in(0,1/2)$ . Then with probability $1-\delta$ , when the number of samples $N$ is:

Obtaining the sample bound is an exercise of rewriting the parameters associated with the model $X^{\prime}=A^{\prime}S^{\prime}+\eta^{\prime}(\tau)$ in a way which can be used by Theorem H.40. In what follows, where new parameters are introduced without being described, they will correspond to parameters of the same name defined in and used by the statement of Theorem H.40.

it suffices that $M\geq(\left\|\mu_{\max}^{\prime}\right\|\frac{w_{\max}}{w_{\min}})^{d+1}(d+1)^{d+1}$ , giving a more natural condition number.

to guarantee that $M$ bounds all required order $d+1$ absolute moments.

We can now apply Theorem H.40, using the parameter values $k=d+1$ , $\Delta$ from (9), and $M$ from (10). Then with probability $1-\delta$ ,

The poly bound in (11) is equivalent to the poly bound in (8).

Theorem A.17 allows us to recover the columns of $A^{\prime}$ up to sign. However, what we really want to recover are the means of the original Gaussian mixture model, which are the columns of $A$ . Recalling the correspondence between $A^{\prime}$ and $A$ laid out in section 3, the Gaussian means $\mu_{1},\dotsc,\mu_{m}$ which form the columns of $A$ are related to the columns $\mu_{1}^{\prime},\dotsc,\mu_{m}^{\prime}$ of $A^{\prime}$ by the rule $\mu_{i}=\mu_{i}^{\prime}(1:n)/\mu_{i}^{\prime}(n+1)$ . Using this rule, we can construct estimate the Gaussian means from the estimates of the columns of $A^{\prime}$ . By propagating the errors from Theorem A.17, we arrive at the following result:

Let $\epsilon^{*}>0$ (to be chosen later) give a desired bound on the errors of the columns of $A^{\prime}$ . Then, from Theorem A.17, using

By construction, $\mu_{\max}^{\prime}=\left(\begin{array}[]{c}\mu_{\max}\\ 1\end{array}\right)$ . By the triangle inequality,

Next, we note that $\underline{A}^{\prime}=\left(\begin{array}[]{c}A\\ \mathbf{1}\end{array}\right)$ where $\mathbf{1}$ is an all ones row vector. It follows that the rows of $A^{\odot d/2}$ are a strict subset of the rows of $\underline{A}^{\prime\odot d/2}$ . Thus,

Thus, it is sufficient to call UnderdeterminedICA with

samples to achieve the desired $\epsilon^{*}$ accuracy on the returned estimates of the columns of $A^{\prime}$ with probability $1-\delta$ .

Step 2: Error propagation.

Recall that $A_{i}^{\prime}=\left(\begin{array}[]{c}\mu_{i}\\ 1\end{array}\right)\cdot\left\|\left(\begin{array}[]{c}\mu_{i}\\ 1\end{array}\right)\right\|^{-1}$ , making $A_{i}^{\prime}(n+1)=\frac{1}{\sqrt{1+\left\|\mu_{i}\right\|^{2}}}$ . Thus,

A.2 Distance of the Sampled Model to the Ideal Model

An important part of the reduction is that the coordinates of $S$ are mutually independent. Without the threshold $\tau$ , this is true (c.f. Lemma 5). However, without the threshold, one cannot know how to add more noise so that the total noise on each sample is iid. We show that we can choose the threshold $\tau$ large enough that the samples still come from a distribution with arbitrarily small total variation distance to the one with truly independent coordinates.

Fix $\delta>0$ . Let $S\sim\mathsf{Poisson}(\lambda)$ for $\lambda\geq\ln\delta$ . Let $b=e\lambda$ , If $\tau>e\lambda$ , $\tau\geq 1$ , and $\tau\geq\ln(1/\delta)-\lambda$ , then .

By the Chernoff bound (See Theorem A.1.15 in ),

For any $\tau>\lambda$ , letting $\epsilon=\tau/\lambda-1$ , we get

To get , it suffices that $\tau-\tau\log_{b}\tau\leq\log_{b}(\delta e^{\lambda})$ . Note that

If $\tau-\tau\log_{b}\tau\leq\log_{b}\left(\delta e^{\lambda}\right)$ , then we have

which holds for $\tau\geq\ln\left(\frac{1}{\delta e^{\lambda}}\right)=\ln(1/\delta)-\lambda$ , giving the desired result.

By Lemma A.23 $\tau\geq\ln(N/\delta)-\lambda$ implies for every $i$ . The union bound gives us the desired result.

It should now be easy to see that if we choose our threshold $\tau$ large enough, our samples can be statistically close (See Appendix F) to ones that would come from the truly independent distribution. This claim is made formal as follows:

Fix $\delta>0$ . Let $\tau>0$ . Let $F$ be a Poisson distribution with parameter $\lambda$ and have corresponding density $f$ . Let $G$ be a discrete distribution with density $g(x)=f(x)/F(\tau)$ when $0\leq x\leq\tau$ and 0 otherwise. Then $\operatorname{d}_{TV}(F,G)=1-F(\tau)$ .

Since we are working with discrete distributions, we can write

A.3 Proof of Theorem 1

We now show that after the reduction is applied, we can use the UnderdeterminedICA routine given in to learn the GMM. Instead of requiring exact values of each parameter, we simply require a bound on each. The algorithm remains polynomial on those bounds, and hence polynomial on the true values.

Assume that after drawing samples from Subroutine 1, the signals $S_{i}$ are mutually independent (as in the “ideal” model given by (2)) and the mean matrix $B$ satisfies $\sigma_{m}(B^{\odot d/2})\geq b>0$ . Then by Theorem A.19, with probability of error $\delta_{1}$ , the call to UnderdeterminedICA in Algorithm 2 recovers the columns of $B$ to within $\epsilon$ and up to a permutation using $N$ samples of complexity

Step 2

We need to show that after getting $N$ samples from the reduction, the resulting distribution is still close in total variation to the independent one. We will choose a new $\delta^{\prime}=\delta_{2}/(2N)$ . Let $R\sim\mathsf{Poisson}(\lambda)$ . Given $\delta^{\prime}$ , Lemma A.27 shows that for $\tau\geq\ln(1/\delta^{\prime})-\lambda$ , with probability $1-\delta^{\prime}$ , $R\leq\tau$ .

Take $N$ iid random variables $X_{1},X_{2},\dots,X_{N}$ from the $\mathsf{Poisson}(\lambda)$ distribution. Let $G$ be a distribution given by density function $g(x)=(f(x)\mathds{1}_{0\leq x\leq\tau})/F(\tau)$ . Let $Y_{1},Y_{2},\dots,Y_{N}$ be iid random variables with distribution $G$ . Denote the joint distribution of the $X_{i}$ ’s by $F^{\prime}$ with density $f^{\prime}$ , and the joint distribution of the $Y_{i}$ ’s as $G^{\prime}$ with density $g^{\prime}$ . By the union bound and the fact that total variation distance satisfies the triangle inequality,

Then for our choice of $\tau$ , by Lemma A.23 and Lemma A.27, we have

By the same union bound argument, the probability that the algorithm fails (when $R>\tau$ ) is at most $\delta_{2}/2$ , since it has to draw $N$ samples. So with high probability, the algorithm does not fail; otherwise, it still does not take more than polynomial time, and will terminate instead of returning a false result.

Step 3

We know that $N$ is at least a polynomial which can be written in terms of the dependence on $\tau$ as $p(\tau^{d^{2}},\Theta)$ . This means there will be a power of $\tau$ which dominates all of the $\tau$ factors in $p$ , and in particular, will be $\tau^{Cd^{2}}$ for some $C$ . It then suffices to choose $C$ so that $p\left(\tau^{d^{2}},\Theta\right)\leq\tau^{Cd^{2}}q(\Theta)\leq N$ , where

Then, with the proper choice of $\tau$ (to be specified shortly), from step 2 we have

Since $\lambda\geq 1$ it suffices to choose $\tau$ so that

is enough for the desired bound on the sample size. Observe that $4(\log(2/\delta)+\log(q(\Theta)))\geq 1$ .

An useful fact is that for general $x,a,b\geq 1$ , $x\geq\max(2a,b^{2})$ satisfies $x^{a}\leq x^{x}/b^{x}$ . This captures the essence of our situation nicely. Letting $e\lambda$ play the role of $b$ , $Cd^{2}$ play the role of $a$ and $x$ play the role of $\tau$ , to satisfy (18), it suffices that

We can see that $\tau^{\tau/2}\geq(e\lambda)^{2}$ and $\tau^{\tau/4}\geq\tau^{Cd^{2}}$ by construction. But we also get $\tau/4\geq\log(2/\delta)+\log{q(\Theta)}$ which implies $\tau^{\tau/4}\geq e^{\tau/4}\geq\frac{2}{\delta}q(\Theta)$ . Thus for our choice of $\tau$ , which also preserves the requirement in Step 2, there is a corresponding set of choices for $N$ , where the required sample size remains polynomial as

Appendix B Lemmas on the Poisson Distribution

The following lemmas are well-known; see, e.g., . We provide proofs for completeness.

The first part of the lemma (i.e., $Y_{i}\sim\mathsf{Poisson}(p_{i}\lambda)$ for all $i$ ) follows from Lemma B.30. For the second part, let’s prove it for the binomial case ( $k=2$ ); the general case is similar.

Appendix C Properties of Cumulants

The following properties of multivariate cumulants are well known and are largely inherited from the definition of the cumulant generating function:

Also, given a scalar random variable $Z$ , then

with symmetry implying the additive multilinear property for all other coordinates.

Appendix D Bounds on Stirling Numbers of the Second Kind

The following bound comes from [23, Theorem 3].

If $n\geq 2$ and $1\leq r\leq n-1$ are integers, then $\genfrac{\{}{\}}{0.0pt}{}{n}{r}\leq\frac{1}{2}{n\choose r}r^{n-r}$ .

From this, we can derive a somewhat looser bound on the Stirling numbers of the second kind which does not depend on $r$ :

The Stirling number $\genfrac{\{}{\}}{0.0pt}{}{n}{k}$ of the second kind gives a count of the number of ways of splitting a set of $n$ labeled objects into $k$ unlabeled subsets. In the case where $r=n$ , then $\genfrac{\{}{\}}{0.0pt}{}{n}{r}=1$ As $n\geq 1$ , it is clear that for these choices of $n$ and $r$ , $\genfrac{\{}{\}}{0.0pt}{}{n}{r}\leq n^{n-1}$ . By the restriction $1\leq r\leq n$ , when $n=1$ , then $n=r$ giving that $\genfrac{\{}{\}}{0.0pt}{}{n}{r}=1$ . As such, the only remaining cases to consider are when $n\geq 2$ and $1\leq r\leq n-1$ , the cases where Lemma D.34 applies.

When $n\geq 2$ and $1\leq r\leq n-1$ , then

which is slightly stronger than the desired upper bound.

Appendix E Values of Higher Order Statistics

In this appendix, we gather together some of the explicit values for higher order statistics of the Poisson and Normal distributions required for the analysis of our reduction from learning a Gaussian Mixture Model to learning an ICA model from samples.

The absolute moments of the Gaussian random variable $\eta\sim N(0,\sigma^{2})$ are given by:

Appendix F Total Variation Distance

Total variation is a type of statistical distance metric between probability distributions. In words, the total variation between two measures is the largest difference between the measures on a single event. Clearly, this distance is bounded above by 1.

For probability measures $F$ and $G$ on a sample space $\Omega$ with sigma-algebra $\Sigma$ , the total variation is denoted and defined as:

Equivalently, when $F$ and $G$ are distribution functions having densities $f$ and $g$ , respectively,

where $\mu$ is an arbitrary positive measure for which $F$ and $G$ are absolutely continuous.

More specifically, when $F$ and $G$ are discrete distributions with known densities, we can write

where we choose $\mu$ that simply assigns unit measure to each atom of $\Omega$ (in this case, absolute continuity is trivial since $\mu(A)=0$ only when $A$ is empty and thus $F(A)$ must also be 0). For more discussion, one can see Definition 15.3 in and Sect. 11.6 in .

Appendix G Sketch for the proof of Theorem 4

We can use our Poissonization technique to embed difficult instances of learning GMMs into the ICA setting to prove that ICA is information-theoretically hard when the observed dimension $n$ is a constant using the lower bound for learning GMMs. We are not aware of any existing lower bounds in the literature for this problem. We only provide an informal outline of the argument.

Recall that $R_{p}=\sum_{i}S_{pi}$ and $R_{q}=\sum_{i}S_{qi}$ are Poisson distributed with parameters $m_{p}$ and $m_{q}$ denoting the number of columns of $A_{p}$ and $A_{q}$ respectively. Lemma A.23 implies that for a choice of $\tau_{p}$ which is linear in $m_{p}$ , the probability of a draw with $R_{p}>\tau_{p}$ is exponentially small, and similarly for $\tau_{q}$ . In particular, we choose $\tau=\max(\tau_{p},\tau_{q})$ for the above ICA models.

Now since the $L^{1}$ (and hence total variation) distance between $p$ and $q$ is exponentially small in $k^{2}$ (upper bound on the number of components), the distance between the two resulting ICA models produced by the reduction is also exponentially small (specifically, the total variation distance between the random variables $X_{p}$ and $X_{q}$ ). To see this, we must condition on several cases. First, conditioning either model on $R>\tau$ , we have that $\Pr(R>\tau)$ is exponentially small, and hence its contribution to the overall total variation distance between $X_{p}$ and $X_{q}$ is exponentially small. Conditioning on $R=z$ where $z\in\{0,1,\ldots,\tau\}$ , then the facts that $p$ and $q$ are close in total variation distance and that total variation distance satisfies a version of the triangle inequality (that is for random variables $C,D,E,F$ , $d_{TV}(C+D,E+F)\leq d_{TV}(C,E)+d_{TV}(D,F)$ ) imply that by viewing $X_{p}$ (and similarly for $X_{q}$ ) as the sum of $z$ draws from the distribution $p$ and $\tau-z$ draws from the additive Gaussian noise distribution, the total variation distance between $X_{p}$ and $X_{q}$ conditioned on $R=z$ is still exponentially small. Thus, the non-conditional distributions of $X_{p}$ and $X_{q}$ will be exponentially close in $k$ in total variation distance. In particular, the sample complexity of distinguishing between $X_{p}$ and $X_{q}$ is exponential in $k$ .

Appendix H

H.2 Rudelson-Vershynin subspace bound

where as usual $\sigma_{\min}(A)=\sigma_{\min(m,n)}(A)$ .

H.3 Carbery-Wright anticoncentration

The version of the anticoncentration inequality we use is explicitly given in which in turn follows immediately from :

Appendix I Recovery of Gaussian Weights

Our technique for the recovery of the Gaussian weights relies on the tensor properties of multivariate cumulants that have been used in the ICA literature.

The most theoretically justified ICA algorithms have relied on the tensor structure of multivariate cumulants, including the early, popular practical algorithm JADE . In the fully determined ICA setting in which the number source signals does not exceed the ambient dimension, the papers and demonstrate that ICA with additive Gaussian noise can be solved in polynomial time and using polynomial samples. The tensor structure of the cumulants was (to the best of our knowledge) first exploited in and later in to solve underdetermined ICA. Finally, provides an algorithm with rigorous polynomial time and sampling bounds for underdetermined ICA in the presence of Gaussian noise.

Weight recovery (main idea).

Under the basic ICA reduction (see section 3) using the Poisson distribution with parameter $\lambda$ , we have that $X=AS+\eta$ is observed such that $A=[\mu_{1}|\cdots|\mu_{m}]$ and $S_{i}\sim\mathsf{Poisson}(w_{i}\lambda)$ . As $A$ has already been recovered, what remains to be recovered are the weights $w_{1},\cdots,w_{m}$ . These can be recovered using the tensor structure of higher order cumulants. The critical relationship is captured by the following Lemma:

It is easily seen that the Gaussian component has no effect on the cumulant:

In particular, we have that $S_{i}\sim\mathsf{Poisson}(w_{i}\lambda)$ with $w_{i}$ the probability of sampling from the $i$ th Gaussian. Given knowledge of $A$ and the cumulants of the Poisson distribution, we can recover the Gaussian weights.