Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders

Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva

Introduction

We present an algorithm (with rigorous performance guarantees) for a basic statistical problem. Suppose η\eta is an independent nn-dimensional Gaussian random variable with an unknown covariance matrix Σ\Sigma and AA is an unknown n×nn\times n matrix. We are given samples of the form y=Ax+ηy=Ax+\eta where xx is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable. The most natural case is when xx is chosen uniformly at random from {+1,1}n\{+1,-1\}^{n}, although our algorithms in even the more general case above. Our goal is to reconstruct an additive approximation to the matrix AA and the covariance matrix Σ\Sigma running in time and using a number of samples that is polynomial in nn and 1ϵ\frac{1}{\epsilon}, where ϵ\epsilon is the target precision (see Theorem 1.1) This problem arises in several research directions within machine learning: Independent Component Analysis (ICA), Deep Learning, Gaussian Mixture Models (GMM), etc. We describe these connections next, and known results (focusing on algorithms with provable performance guarantees, since that is our goal).

Most obviously, the above problem can be seen as an instance of Independent Component Analysis (ICA) with unknown Gaussian noise. ICA has an illustrious history with applications ranging from econometrics, to signal processing, to image segmentation. The goal generally involves finding a linear transformation of the data so that the coordinates are as independent as possible . This is often accomplished by finding directions in which the projection is “non-Gaussian” . Clearly, if the datapoint yy is generated as AxAx (i.e., with no noise η\eta added) then applying linear transformation A1A^{-1} to the data results in samples A1yA^{-1}y whose coordinates are independent. This restricted case was considered by Comon and Frieze, Jerrum and Kannan , and their goal was to recover an additive approximation to AA efficiently and using a polynomial number of samples. (We will later note a gap in their reasoning, albeit fixable by our methods. See also recent papers by Anandkumar et al., Hsu and Kakade, that do not use local search and avoids this issue.) To the best of our knowledge, there are currently no known algorithms with provable guarantees for the more general case of ICA with Gaussian noise (this is especially true if the covariance matrix is unknown, as in our problem), although many empirical approaches are known. (eg. , the issue of “empirical” vs “rigorous” is elaborated upon after Theorem 1.1.)

The second view of our problem is as a concisely described Gaussian Mixture Model. Our data is generated as a mixture of 2n2^{n} identical Gaussian components (with an unknown covariance matrix) whose centers are the points {Ax:x{1,1}n}\left\{Ax:x\in\left\{-1,1\right\}^{n}\right\}, and all mixing weights are equal. Notice, this mixture of 2n2^{n} Gaussians admits a concise description using O(n2)O(n^{2}) parameters. The problem of learning Gaussian mixtures has a long history, and the popular approach in practice is to use the EM algorithm , though it has no worst-case guarantees (the method may take a very long time to converge, and worse, may not always converge to the correct solution). An influential paper of Dasgupta initiated the program of designing algorithms with provable guarantees, which was improved in a sequence of papers . But in the current setting, it is unclear how to apply any of the above algorithms (including EMEM) since the trivial application would keep track of exponentially many parameters – one for each component. Thus, new ideas seem necessary to achieve polynomial running time.

The third view of our problem is as a simple form of autoencoding . This is a central notion in Deep Learning, where the goal is to obtain a compact representation of a target distribution using a multilayered architecture, where a complicated function (the target) can be built up by composing layers of a simple function (called the autoencoder ). The main tenet is that there are interesting functions which can be represented concisely using many layers, but would need a very large representation if a “shallow” architecture is used instead). This is most useful for functions that are “highly varying” (i.e. cannot be compactly described by piecewise linear functions or other “simple” local representations). Formally, it is possible to represent using just (say) n2n^{2} parameters, some distributions with 2n2^{n} “varying parts” or “interesting regions.” The Restricted Boltzmann Machine (RBM) is an especially popular autoencoder in Deep Learning, though many others have been proposed. However, to the best of our knowledge, there has been no successful attempt to give a rigorous analysis of Deep Learning. Concretely, if the data is indeed generated using the distribution represented by an RBM, then do the popular algorithms for Deep Learning learn the model parameters correctly and in polynomial time? Clearly, if the running time were actually found to be exponential in the number of parameters, then this would erode some of the advantages of the compact representation.

We give a provable algorithm for ICA with unknown Gaussian noise. We have not made an attempt to optimize the quoted running time of this model, but we emphasize that this is in fact the first algorithm with provable guarantees for this problem and moreover we believe that in practice our algorithm will run almost as fast as the usual ICA algorithms, which are its close relatives.

There is an algorithm that recovers the unknown AA and Σ\Sigma up to additive error ϵ\epsilon in each entry in time that is polynomial in n,A2,Σ2,\nicefrac1ϵ,\nicefrac1λmin(A)n,\|A\|_{2},\|\Sigma\|_{2},\nicefrac{{1}}{{\epsilon}},\nicefrac{{1}}{{\lambda_{\min}(A)}} where 2\|\cdot\|_{2} denotes the operator norm and λmin()\lambda_{\min}(\cdot) denotes the smallest eigenvalue.

The classical approach for ICA initiated in Comon and Frieze, Jerrum and Kannan ) is for the noiseless case in which y=Axy=Ax. The first step is whitening, which applies a suitable linear transformation that makes the variance the same in all directions, thus reducing to the case where AA is a rotation matrix. Given samples y=Rxy=Rx where RR is a rotation matrix, the rows of RR can be found in principle by computing the vectors uu that are local minima of E[(uy)4]E[(u\cdot y)^{4}]. Subsequently, a number of works (see e.g. ) have focused on giving algorithms that are robust to noise. A popular approach is to use the fourth order cumulant (as an alternative to the fourth order moment) as a method for “denoising,” or any one of a number of other functionals whose local optima reveal interesting directions. However, theoretical guarantees of these algorithms are not well understood.

The above procedures in the noise-free model can almost be made rigorous (i.e., provably polynomial running time and number of samples), except for one subtlety: it is unclear how to use local search to find all optima in polynomial time. In practice, one finds a single local optimum, projects to the subspace orthogonal to it and continues recursively on a lower-dimensional problem. However, a naive implementation of this idea is unstable since approximation errors can accumulate badly, and to the best of our knowledge no rigorous analysis has been given prior to our work. (This is not a technicality: in some similar settings the errors are known to blow up exponentially .) One of our contributions is a modified local search that avoids this potential instability and finds all local optima in this setting. (Section 4.2.)

Our major new contribution however is dealing with noise that is an unknown Gaussian. This is an important generalization, since many methods used in ICA are quite unstable to noise (and a wrong estimate for the covariance could lead to bad results). Here, we no longer need to assume we know even rough estimates for the covariance. Moreover, in the context of Gaussian Mixture Models this generalization corresponds to learning a mixture of many Gaussians where the covariance of the components is not known in advance.

We design new tools for denoising and especially whitening in this setting. Denoising uses the fourth order cumulant instead of the fourth moment used in and whitening involves a novel use of the Hessian of the cumulant. Even then, we cannot reduce to the simple case y=Rxy=Rx as above, and are left with a more complicated functional form (see “quasi-whitening” in Section 2.) Nevertheless, we can reduce to an optimization problem that can be solved via local search, and which remains amenable to a rigorous analysis. The results of the local optimization step can be then used to simplify the complicated functional form and recover AA as well as the noise Σ\Sigma. We defer many of our proofs to the supplementary material section, due to space constraints.

In order to avoid cluttered notation, we have focused on the case in which xx is chosen uniformly at random from {1,+1}n\{-1,+1\}^{n}, although our algorithm and analysis work under the more general conditions that the coordinates of xx are (i) independent and (ii) have a fourth moment that is less than three (the fourth moment of a Gaussian random variable). In this case, the functional P(u)P(u) (see Lemma 2.2) will take the same form but with weights depending on the exact value of the fourth moment for each coordinate. Since we already carry through an unknown diagonal matrix DD throughout our analysis, this generalization only changes the entries on the diagonal and the same algorithm and proof apply.

Denoising and quasi-whitening

As mentioned, our approach is based on the fourth order cumulant. The cumulants of a random variable are the coefficients of the Taylor expansion of the logarithm of the characteristic function . Let κr(X)\kappa_{r}(X) be the rthr^{th} cumulant of a random variable XX. We make use of:

(i) If XX has mean zero, then κ4(X)=E\/[X4]3E\/[X2]2\kappa_{4}(X)=\mathop{\bf E\/}[X^{4}]-3\mathop{\bf E\/}[X^{2}]^{2}. (ii) If XX is Gaussian with mean μ\mu and variance σ2\sigma^{2}, then κ1(X)=μ\kappa_{1}(X)=\mu, κ2(X)=σ2\kappa_{2}(X)=\sigma^{2} and κr(X)=0\kappa_{r}(X)=0 for all r>2r>2. (iii) If XX and YY are independent, then κr(X+Y)=κr(X)+κr(Y)\kappa_{r}(X+Y)=\kappa_{r}(X)+\kappa_{r}(Y).

Proof: The crucial observation is that uTy=uTAx+uTηu^{T}y=u^{T}Ax+u^{T}\eta is the sum of two independent random variables, AxAx and η\eta and that P(u)=κ4(uTAx+uTη)=κ4(uTAx)κ4(uTη)=κ4(uTAx)P(u)=-\kappa_{4}(u^{T}Ax+u^{T}\eta)=-\kappa_{4}(u^{T}Ax)-\kappa_{4}(u^{T}\eta)=-\kappa_{4}(u^{T}Ax). So in fact, the functional P(u)P(u) is invariant under additive Gaussian noise independent of the variance matrix Σ\Sigma. This vastly simplifies our computation:

Furthermore E\/[(uTAx)2]2=(uTAATu)2\mathop{\bf E\/}[(u^{T}Ax)^{2}]^{2}=(u^{T}AA^{T}u)^{2} and we conclude that

In prior works on ICA, whitening refers to reducing to the case where y=Rxy=Rx for some some rotation matrix RR. Here we give a technique to reduce to the case where y=RDx+ηy=RDx+\eta^{\prime} where η\eta^{\prime} is some other Gaussian noise (still unknown), RR is a rotation matrix and DD is a diagonal matrix that depends upon AA. We call this quasi-whitening. Quasi-whitening suffices for us since local search using the objective function κ4(uTy)\kappa_{4}(u^{T}y) will give us (approximations to) the rows of RDRD, from which we will be able to recover AA.

Quasi-whitening involves computing the Hessian of P(u)P(u), which recall is the matrix of all 2nd order partial derivatives of P(u)P(u). Throughout this section, we will denote the Hessian operator by H\mathcal{H}. In matrix form, the Hessian of P(u)P(u) is

Let DA(u)D_{A}(u) be a diagonal matrix in which the kthk^{th} entry is 24(Aku)224(A_{k}\cdot u)^{2}.

Of course, the exact Hessian of P(u)P(u) is unavailable and we will instead compute an empirical approximation P^(u)\widehat{P}(u) to P(u)P(u) (given many samples from the distribution), and we will show that the Hessian of P^(u)\widehat{P}(u) is a good approximation to the Hessian of P(u)P(u).

Given 2N2N samples y1,y1,y2,y2...,yN,yNy_{1},y^{\prime}_{1},y_{2},y^{\prime}_{2}...,y_{N},y^{\prime}_{N} of the random variable yy, let

Our first step is to show that the expectation of the Hessian of P^(u)\widehat{P}(u) is exactly the Hessian of P(u)P(u). In fact, since the expectation of P^(u)\widehat{P}(u) is exactly P(u)P(u) (and since P^(u)\widehat{P}(u) is an analytic function of the samples and of the vector uu), we can interchange the Hessian operator and the expectation operator. Roughly, one can imagine the expectation operator as an integral over the possible values of the random samples, and as is well-known in analysis, one can differentiate under the integral provided that all functions are suitably smooth over the domain of integration.

E\/y,y[(uTy)4+3(uTy)2(uTy)2]=P(u)\mathop{\bf E\/}_{y,y^{\prime}}[-(u^{T}y)^{4}+3(u^{T}y)^{2}(u^{T}y^{\prime})^{2}]=P(u)

This claim follows immediately from the definition of P(u)P(u), and since yy and yy^{\prime} are independent.

H(P(u))=E\/y,y[H((uTy)4+3(uTy)2(uTy)2)]\mathcal{H}(P(u))=\mathop{\bf E\/}_{y,y^{\prime}}[\mathcal{H}(-(u^{T}y)^{4}+3(u^{T}y)^{2}(u^{T}y^{\prime})^{2})]

Next, we compute the two terms inside the expectation:

H((uTy)4)=12(uTy)2yyT\mathcal{H}((u^{T}y)^{4})=12(u^{T}y)^{2}yy^{T}

H((uTy)2(uTy)2)=2(uTy)2yyT+2(uTy)2y(y)T+4(uTy)(uTy)(y(y)T+(y)yT)\mathcal{H}((u^{T}y)^{2}(u^{T}y^{\prime})^{2})=2(u^{T}y^{\prime})^{2}yy^{T}+2(u^{T}y)^{2}y^{\prime}(y^{\prime})^{T}+4(u^{T}y)(u^{T}y^{\prime})(y(y^{\prime})^{T}+(y^{\prime})y^{T})

Let λmin(A)\lambda_{min}(A) denote the smallest eigenvalue of AA. Our analysis also requires bounds on the entries of DA(u0)D_{A}(u_{0}):

If u0u_{0} is chosen uniformly at random then with high probability for all ii,

Proof: We can bound maxi=1nAiu\max_{i=1}^{n}|A_{i}\cdot u| by maxi=1nAi2lognn\max_{i=1}^{n}\|A_{i}\|_{2}\frac{\log n}{\sqrt{n}} thus the bound for maxi=1n(DA(u0))i,i\max_{i=1}^{n}(D_{A}(u_{0}))_{i,i} follows. Note that with high probability the minimum absolute value of nn Gaussian random variables is at least 1/n21/n^{2}, hence mini=1n(DA(u0))i,imini=1nAi22n4\min_{i=1}^{n}(D_{A}(u_{0}))_{i,i}\geq\min_{i=1}^{n}\|A_{i}\|_{2}^{2}n^{-4}. \blacksquare

Proof: First we consider each entry of the matrix updates. For example, the variance of any entry in H((uTy)4)=12(uTy)2yyT\mathcal{H}((u^{T}y)^{4})=12(u^{T}y)^{2}yy^{T} can be bounded by y28\|y\|_{2}^{8}, which we can bound by E\/[y28]O(E\/[Ax28+η28])\mathop{\bf E\/}[\|y\|_{2}^{8}]\leq O(\mathop{\bf E\/}[\|Ax\|_{2}^{8}+\|\eta\|_{2}^{8}]). This can be bounded by O(n4(A28+Σ24))O(n^{4}(\|A\|_{2}^{8}+\|\Sigma\|_{2}^{4})). This is also an upper bound for the variance (of any entry) of any of the other matrix updates when computing H(P^(u0))\mathcal{H}(\widehat{P}(u_{0})).

Suppose that (1ϵ)ADA(u0)ATM^(1+ϵ)ADA(u0)AT(1-\epsilon)AD_{A}(u_{0})A^{T}\preceq\widehat{M}\preceq(1+\epsilon)AD_{A}(u_{0})A^{T}, and let M^=BBT\widehat{M}=BB^{T}. Then there is a rotation matrix RR^{*} such that B1ADA(u0)1/2RFnϵ\|B^{-1}AD_{A}(u_{0})^{1/2}-R^{*}\|_{F}\leq\sqrt{n}\epsilon.

The intuition is: if any of the singular values of B1ADA(u0)1/2B^{-1}AD_{A}(u_{0})^{1/2} are outside the range [1ϵ,1+ϵ][1-\epsilon,1+\epsilon], we can find a unit vector xx where the quadratic forms xTADA(u0)ATxx^{T}AD_{A}(u_{0})A^{T}x and xTM^xx^{T}\widehat{M}x are too far apart (which contradicts the condition of the lemma). Hence the singular values of B1ADA(u0)1/2B^{-1}AD_{A}(u_{0})^{1/2} can all be set to one without changing the Froebenius norm of B1ADA(u0)1/2B^{-1}AD_{A}(u_{0})^{1/2} too much, and this yields a rotation matrix.

Proof: Let M=ADA(u0)ATM=AD_{A}(u_{0})A^{T} and let C=ADA(u0)1/2C=AD_{A}(u_{0})^{1/2}, and so M=CCTM=CC^{T} and M^=BBT\widehat{M}=BB^{T}. The condition (1ϵ)MM^(1+ϵ)M(1-\epsilon)M\preceq\widehat{M}\preceq(1+\epsilon)M is well-known to be equivalent to the condition that for all vectors xx, (1ϵ)xTMxxTM^x(1+ϵ)xTMx(1-\epsilon)x^{T}Mx\leq x^{T}\widehat{M}x\leq(1+\epsilon)x^{T}Mx.

Suppose for the sake of contradiction that S=B1CS=B^{-1}C has a singular value outside the range [1ϵ,1+ϵ][1-\epsilon,1+\epsilon]. Assume (without loss of generality) that SS has a singular value strictly larger than 1+ϵ1+\epsilon (and the complementary case can be handled analogously). Hence there is a unit vector yy such that yTSSTy>1+ϵy^{T}SS^{T}y>1+\epsilon. But since BSSTBT=CCTBSS^{T}B^{T}=CC^{T}, if we set xT=yTB1x^{T}=y^{T}B^{-1} then we have xTM^x=xTBBTx=yTy=1x^{T}\widehat{M}x=x^{T}BB^{T}x=y^{T}y=1 but xTMx=xTCCTx=xTBSSTBTx=yTSSTy>1+ϵx^{T}Mx=x^{T}CC^{T}x=x^{T}BSS^{T}B^{T}x=y^{T}SS^{T}y>1+\epsilon. This is a contradiction and so we conclude that all of the singular values of B1CB^{-1}C are in the range [1ϵ,1+ϵ][1-\epsilon,1+\epsilon].

Let UΣVTU\Sigma V^{T} be the singular value decomposition of B1CB^{-1}C. If we set all of the diagonal entries in Σ\Sigma to 11 we obtain a rotation matrix R=UVTR^{*}=UV^{T}. And since the singular values of B1CB^{-1}C are all in the range [1ϵ,1+ϵ][1-\epsilon,1+\epsilon], we can bound the Froebenius norm of B1CRB^{-1}C-R^{*}: B1CRFnϵ\|B^{-1}C-R^{*}\|_{F}\leq\sqrt{n}\epsilon, as desired. \blacksquare

Our algorithm (and notation)

In this section we describe our overall algorithm. It uses as a blackbox the denoising and quasi-whitening already described above, as well as a routine for computing all local maxima of some “well-behaved” functions which is described later in Section 4.

Notation: Placing a hat over a function corresponds to an empirical approximation that we obtain from random samples. This approximation introduces error, which we will keep track of.

Step 2: Take 2N2N samples y1,y2,...,yN,y1,y2,...,yNy_{1},y_{2},...,y_{N},y_{1}^{\prime},y_{2}^{\prime},...,y_{N}^{\prime}, and let

which is an empirical estimation of P(u)P^{\prime}(u).

Step 3: Use the procedure AllOPT(P^(u),β,δ,β,δ\widehat{P}^{\prime}(u),\beta,\delta^{\prime},\beta^{\prime},\delta^{\prime}) of Section 4 to compute all nn local maxima of the function P^(u)\widehat{P}^{\prime}(u).

Step 4: Let RR be the matrix whose rows are the nn local optima recovered in the previous step. Use procedure Recover of Section 5 to find AA and Σ\Sigma.

Explanation: Step 1 uses the transformation B1B^{-1} computed in the previous Section to quasi-whiten the data. Namely, we consider the sequence of samples z=B1yz=B^{-1}y, which are therefore of the form RDx+ηR^{\prime}Dx+\eta^{\prime} where η=B1η\eta=B^{-1}\eta, D=DA(u0)D=D_{A}(u_{0}) and RR^{\prime} is close to a rotation matrix RR^{*} (by Lemma 2.11). In Step 2 we look at κ4((uTz))\kappa_{4}((u^{T}z)), which effectively denoises the new samples (see Lemma 2.2), and thus is the same as κ4(RD1/2x)\kappa_{4}(R^{\prime}D^{-1/2}x). Let P(u)=κ4(uTz)=κ4(uTB1y)P^{\prime}(u)=\kappa_{4}(u^{T}z)=\kappa_{4}(u^{T}B^{-1}y) which is easily seen to be E[(uTRD1/2x)4]E[(u^{T}R^{\prime}D^{-1/2}x)^{4}]. Step 2 estimates this function, obtaining P^(u)\widehat{P}^{\prime}(u). Then Step 3 tries to find local optima via local search. Ideally we would have liked access to the functional P(u)=(uTRx)4P^{*}(u)=(u^{T}R^{*}x)^{4} since the procedure for local optima works only for true rotations. But since RR^{\prime} and RR^{*} are close we can make it work approximately with P^(u)\widehat{P}^{\prime}(u), and then in Step 4 use these local optima to finally recover AA.

Proof: In Step 1, by Lemma 2.11 we know once we use z=B1yz=B^{-1}y, the whitened function P(u)P^{\prime}(u) is inverse polynomially close to P(u)P^{*}(u). Then by Lemma 5.3, the function P^(u)\widehat{P^{\prime}}(u) we get in Step 2 is inverse polynomially close to P(u)P^{\prime}(u) and P(u)P^{*}(u). Theorem 4.6 and Lemma 5.5 show that given P^(u)\widehat{P^{\prime}}(u) inverse polynomially close to P(u)P^{*}(u), Algorithm 2: : AllOPT finds all local maxima with inverse polynomial precision. Finally by Theorem 5.6 we know AA and WW are recovered correctly up to additive ϵ\epsilon error in Frobenius norm. The running time and sampling complexity of the algorithm is polynomial because all parameters in these Lemmas are polynomially related. \blacksquare

Note that here we recover AA up to a permutation of the columns and sign-flips. In general, this is all we can hope for since the distribution of xx is also invariant under these same operations. Also, the dependence of our algorithm on the various norms (of AA and Σ\Sigma) seems inherent since our goal is to recover an additive approximation, and as we scale up AA and/or Σ\Sigma, this goal becomes a stronger relative guarantee on the error.

Framework for iteratively finding all local maxima

Given that we can only ever hope for an additive approximation to a local maximum, one should be concerned about how the error accumulates when our goal is to find all local maxima. In fact, a naive strategy is to project onto the subspace orthogonal to the directions found so far, and continue in this subspace. However, such an approach seems to accumulate errors badly (the additive error of the last local maxima found is exponentially larger than the error of the first). Rather, the crux of our analysis is a novel method for bounding how much the error can accumulate (by refining old estimates).

Our strategy is to first find a local maximum in the orthogonal subspace, then run the local optimization algorithm again (in the original nn-dimensional space) to “refine” the local maximum we have found. The intuition is that since we are already close to a particular local maxima, the local search algorithm cannot jump to some other local maxima (since this would entail going through a valley).

Throughout this section, we will assume that we are given oracle access to a function f(u)f(u) and its gradient and Hessian. The procedure is also given a starting point usu_{s}, a search range β\beta, and a step size δ\delta. For simplicity in notation we define the following projection operator.

\mboxProju(v)=v(uTv)u\mbox{Proj}_{\perp u}(v)=v-(u^{T}v)u, \mboxProju(M)=M(uTMu)uuT\mbox{Proj}_{\perp u}(M)=M-(u^{T}Mu)uu^{T}.

The basic step the algorithm is a modification of Newton’s method to find a local improvement that makes progress so long as the current point uu is far from a local maxima. Notice that if we add a small vector to uu, we do not necessarily preserve the norm of uu. In order to have control over how the norm of uu changes, during local optimization step the algorithm projects the gradient f\nabla f and Hessian H(f)\mathcal{H}(f) to the space perpendicular to uu. There is also an additional correction term /uf(u)ξ2/2-\partial/\partial_{u}f(u)\cdot\|\xi\|^{2}/2. This correction term is necessary because the new vector we obtain is (u+ξ)/(u+ξ)2(u+\xi)/\left\lVert(u+\xi)\right\rVert_{2} which is close to uξ22/2u+ξ+O(β3)u-\|\xi\|_{2}^{2}/2\cdot u+\xi+O(\beta^{3}). Step 2 of the algorithm is just maximizing a quadratic function and can be solved exactly using Lagrangian Multiplier method. To increase efficiency it is also acceptable to perform an approximate maximization step by taking ξ\xi to be either aligned with the gradient \mboxProjuf(u)\mbox{Proj}_{\perp u}\nabla f(u) or the largest eigenvector of \mboxProju(H(f(u)))\mbox{Proj}_{\perp u}(\mathcal{H}(f(u))).

The algorithm is guaranteed to succeed in polynomial time when the function is Locally Improvable and Locally Approximable:

A function f(u)f(u) is locally approximable, if its third order derivatives exist and for any uu and any direction vv, the third order derivative of ff at point uu in the direction of vv is bounded by 0.01δ/β30.01\delta/\beta^{3}.

The analysis of the running time of the procedure comes from local Taylor expansion. When a function is Locally Approximable it is well approximated by the gradient and Hessian within a β\beta neighborhood. The following theorem from showed that the two properties above are enough to guarantee the success of a local search algorithm even when the function is only approximated.

If f(u)f(u)δ/8|f(u)-f^{*}(u)|\leq\delta/8, the function f(u)f^{*}(u) is (γ,β,δ)(\gamma,\beta,\delta)-Locally Improvable, f(u)f(u) is (β,δ)(\beta,\delta) Locally Approximable, then Algorithm 1 will find a vector vv that is γ\gamma close to some local maximum. The running time is at most O((n2+T)maxf/δ)O((n^{2}+T)\max f^{*}/\delta) where TT is the time to evaluate the function ff and its gradient and Hessian.

2 Finding all local maxima

Now we consider how to find all local maxima of a given function f(u)f^{*}(u). The crucial condition that we need is that all local maxima are orthogonal (which is indeed true in our problem, and is morally true when using local search more generally in ICA). Note that this condition implies that there are at most nn local maxima.Technically, there are 2n2n local maxima since for each direction uu that is a local maxima, so too is u-u but this is an unimportant detail for our purposes. In fact we will assume that there are exactly nn local maxima. If we are given an exact oracle for ff^{*} and can compute exact local maxima then we can find all local maxima easily: find one local maximum, project the function into the orthogonal subspace, and continue to find more local maxima.

The following theorem gives sufficient conditions under which the above algorithm finds all local maxima, making precise the intuition given at the beginning of this section.

Orthogonal Local Maxima: The function has nn local maxima vi,v^{*}_{i}, and they are orthogonal to each other.

Locally Improvable: ff^{*} is (γ,β,δ)(\gamma,\beta,\delta) Locally Improvable.

Improvable Projection: The projection of the function to any subspace spanned by a subset of local maxima is (γ,β,δ)(\gamma^{\prime},\beta^{\prime},\delta^{\prime}) Locally Improvable. The step size δ10δ\delta^{\prime}\geq 10\delta.

Lipschitz: If two points uu23nγ\left\lVert u-u^{\prime}\right\rVert_{2}\leq 3\sqrt{n}\gamma, then the function value f(u)f(u)δ/20|f^{*}(u)-f^{*}(u^{\prime})|\leq\delta^{\prime}/20.

Attraction Radius: Let Rad3nγ+γRad\geq 3\sqrt{n}\gamma+\gamma^{\prime}, for any local maximum viv^{*}_{i}, let TT be minf(u)\min f^{*}(u) for uvi2Rad\left\lVert u-v^{*}_{i}\right\rVert_{2}\leq Rad, then there exist a set UU containing uvi23nγ+γ\left\lVert u-v^{*}_{i}\right\rVert_{2}\leq 3\sqrt{n}\gamma+\gamma^{\prime} and does not contain any other local maxima, such that for every uu that is not in UU but is β\beta close to UU, f(u)<Tf^{*}(u)<T.

If we are given function ff such that f(u)f(u)δ/8|f(u)-f^{*}(u)|\leq\delta/8 and ff is both (β,δ)(\beta,\delta) and (β,δ)(\beta^{\prime},\delta^{\prime}) Locally Approximable, then Algorithm 2 can find all local maxima of ff^{*} within distance γ\gamma.

To prove this theorem, we first notice the projection of the function ff in Step 3 of the algorithm should be close to the projection of ff^{*} to the remaining local maxima. This is implied by Lipschitz condition and is formally shown in the following two lemmas. First we prove a “coupling” between the orthogonal complement of two close subspaces:

Proof: Let S1S_{1} be span{v1,v2,...,vk}span\{v_{1},v_{2},...,v_{k}\}, S2S_{2} be span{v1,v2,...,vk}span\{v^{*}_{1},v^{*}_{2},...,v^{*}_{k}\} and S1S_{1}^{\perp}, S2S_{2}^{\perp} be their orthogonal subspaces respectively. We first prove that for any unit vector vS1v\in S_{1}^{\perp}, there is another unit vector vS2v^{\prime}\in S_{2}^{\perp} so that vTv14nγ2v^{T}v^{\prime}\geq 1-4n\gamma^{2}. In fact, we can take vv^{\prime} to be the unit vector along the projection of vv in S2S_{2}^{\perp}. To bound the length of the projection, we instead bound the length of projection to S2S_{2}. Since we know viTv=0v_{i}^{T}v^{\prime}=0 for iki\leq k and viviγ\left\lVert v_{i}-v^{*}_{i}\right\rVert\leq\gamma, it must be that (vi)Tv2γ(v^{*}_{i})^{T}v^{\prime}\leq 2\gamma when γ<0.01\gamma<0.01. So the projection of vv^{\prime} in S2S_{2} has length at most 2nγ2\sqrt{n}\gamma and hence the projection of vv^{\prime} in S2S_{2}^{\perp} has length at least 14nγ21-4n\gamma^{2}.

The second equality cannot hold if these vectors are not orthogonal. And for any ww,

So we conclude that the distance between these two vectors is at most 3nγ3\sqrt{n}\gamma. \blacksquare

Using this lemma we see that the projected function is close to the projection of ff^{*} in the span of the rest of local maxima:

Let gg^{*} be the projection of ff^{*} into the space spanned by the rest of local maxima, then g(w)g(w)δ/8+δ/20δ/8|g^{*}(w)-g(w)|\leq\delta/8+\delta^{\prime}/20\leq\delta^{\prime}/8.

Proof: The proof is straight forward because g(w)g(w)f(u)f(u)+f(u)f(u)|g^{*}(w)-g(w)|\leq|f^{*}(u)-f(u)|+|f^{*}(u)-f^{*}(u^{\prime})| for some uu23nγ\left\lVert u-u^{\prime}\right\rVert_{2}\leq 3\sqrt{n}\gamma, we know the first one is at most δ/8\delta/8 and the second one is at most δ/20\delta^{\prime}/20 by Lipschitz Condition. \blacksquare

Now we are ready to prove the main theorem.

Proof: [Theorem 4.6] By Theorem 4.4 the first column is indeed γ\gamma close to a local maximum. We then prove by induction that if v1v_{1}, v2v_{2}, …, vkv_{k} are γ\gamma close to different local maxima, then vk+1v_{k+1} must be close to a new local maximum.

By Lemma 4.8 we know gk+1g_{k+1} is (γ,β,δ)(\gamma^{\prime},\beta^{\prime},\delta^{\prime}) Locally Improvable, and because it is a projection of ff its derivatives are also bounded so it is (β,δ)(\beta^{\prime},\delta^{\prime}) Locally Approximable. By Theorem 4.4 uu^{\prime} must be γ\gamma^{\prime} close to local maximum for the projected function. Then since the projected space is close to the space spanned by the rest of local maxima, uu^{\prime} is in fact γ+3nγ\gamma^{\prime}+3\sqrt{n}\gamma close to vk+1v^{*}_{k+1} (here again we are reindexing the local maxima wlog.).

Now we use the Attraction Radius property, since uu is currently in UU, f(u)Tf^{*}(u)\geq T, and each step we go to a point uu^{\prime} such that uuβ\left\lVert u^{\prime}-u\right\rVert\leq\beta and f(u)>f(u)Tf^{*}(u^{\prime})>f^{*}(u)\geq T. The local search in Algorithm 1 can never go outside UU, therefore it must find the local maximum vk+1v^{*}_{k+1}. \blacksquare

Local search on the fourth order cumulant

Next, we prove that the fourth order cumulant P(u)P^{*}(u) satisfies the properties above. Then the algorithm given in the previous section will find all of the local maxima, which is the missing step in our main goal: learning a noisy linear transformation Ax+ηAx+\eta with unknown Gaussian noise. We first use a theorem from to show that properties for finding one local maxima is satisfied.

Also, for notational convenience we set di=2DA(u0)i,i2d_{i}=2D_{A}(u_{0})_{i,i}^{-2} and let dmind_{min} and dmaxd_{max} denote the minimum and maximum values (bounds on these and their ratio follow from Claim 2.9). Using this notation P(u)=i=1ndi(uTRi)4P^{*}(u)=\sum_{i=1}^{n}d_{i}(u^{T}R^{*}_{i})^{4}.

When β<dmin/10dmaxn2\beta<d_{min}/10d_{max}n^{2}, the function P(u)P^{*}(u) is (3nβ,β,P(u)β2/100)(3\sqrt{n}\beta,\beta,P^{*}(u)\beta^{2}/100) Locally Improvable and (β,dminβ2/100n)(\beta,d_{min}\beta^{2}/100n) Locally Approximable. Moreover, the local maxima of the function is exactly {±Ri}\{\pm R^{*}_{i}\}.

Proof: The proof appears in . Here for completeness we show the proof using our notations.

First we establish that P(u)P^{*}(u) is Locally Improvable.Observe that this desirada is invariant under rotation, so we need only prove the theorem for P(v)=i=1ndivi4P^{*}(v)=\sum_{i=1}^{n}d_{i}v_{i}^{4}. The gradient of the function is P(v)=4(d1v13,d2v23,...,dnvn3)\nabla P^{*}(v)=4(d_{1}v_{1}^{3},d_{2}v_{2}^{3},...,d_{n}v_{n}^{3}). The inner product of P(v)\nabla P^{*}(v) and vv is exactly 4i=1ndivi4=4P(v)4\sum_{i=1}^{n}d_{i}v_{i}^{4}=4P^{*}(v). Therefore the projected gradient ϕ=\mboxProjvP(v)\phi=\mbox{Proj}_{\perp v}\nabla P^{*}(v) has coordinate ϕi=4vi(divi2P(v))\phi_{i}=4v_{i}(d_{i}v_{i}^{2}-P^{*}(v)). Furthermore, the Hessian H=H(P(v))H=\mathcal{H}(P^{*}(v)) is a diagonal matrix whose (i,i)th(i,i)^{th} entry is 12divi212d_{i}v_{i}^{2}.

Consider the case in which ϕP(v)β/4\left\lVert\phi\right\rVert\geq P^{*}(v)\beta/4. We can obtain an improvement to P(v)β2/100P^{*}(v)\beta^{2}/100 because we can take ξ\xi in the direction of ϕ\phi and with ξ2=β/20\left\lVert\xi\right\rVert_{2}=\beta/20. The contribution of the Hessian term is nonnegative and the third term 2P(u)ξ22-2P^{*}(u)\left\lVert\xi\right\rVert_{2}^{2} is small in comparison.

Hence, we can assume ϕP(v)β/4\left\lVert\phi\right\rVert\leq P^{*}(v)\beta/4. Now let us write out the expression of ϕ2\left\lVert\phi\right\rVert^{2}

In particular every term vi2(divi2P(v))2v_{i}^{2}(d_{i}v_{i}^{2}-P^{*}(v))^{2} must be at most β2(P(v))2/16.\beta^{2}(P^{*}(v))^{2}/16.. Thus for any ii, either vi2β2v_{i}^{2}\leq\beta^{2} or (divi2P(v))2(P(v))2/16(d_{i}v_{i}^{2}-P^{*}(v))^{2}\leq(P^{*}(v))^{2}/16.

If there are at least 2 coordinates kk and ll such that (divi2P(v))2(P(v))2/16(d_{i}v_{i}^{2}-P^{*}(v))^{2}\leq(P^{*}(v))^{2}/16, then we know for these two coordinates vi2[0.75P(v)/di,1.25P(v)/di]v_{i}^{2}\in[0.75P^{*}(v)/d_{i},1.25P^{*}(v)/d_{i}]. We choose the vector ξ\xi so that ξk=τvl\xi_{k}=\tau v_{l} and ξl=τvk\xi_{l}=-\tau v_{k}. Wlog assume ξϕ0\xi\cdot\phi\geq 0 otherwise we use ξ-\xi. Take τ\tau so that τ2(vl2+vk2)=β2\tau^{2}(v_{l}^{2}+v_{k}^{2})=\beta^{2}. Clearly ξ=β\left\lVert\xi\right\rVert=\beta and ξv=0\xi\cdot v=0 so ξ\xi is a valid solution. Also τ2\tau^{2} is lower bounded by β2/(vl2+vk2)45β2P(u)(1/dl+1/dk)\beta^{2}/(v_{l}^{2}+v_{k}^{2})\geq\frac{4}{5}\frac{\beta^{2}}{P^{*}(u)(1/d_{l}+1/d_{k})}.

In the remaining case, all of the coordinates except for at most one satisfy vi2β2v_{i}^{2}\leq\beta^{2}. Since we assumed β2<1n\beta^{2}<\frac{1}{n}, there must be one of the coordinate vkv_{k} that is large, and it is at least 1nβ21-n\beta^{2}. Thus the distance of this vector to the local maxima eke_{k} is at most 3nβ3\sqrt{n}\beta. \blacksquare

We then observe that given enough samples, the empirical mean P^(u)\widehat{P}^{\prime}(u) is close to P(u)P^{*}(u). For concentration we require every degree four term zizjzkzlz_{i}z_{j}z_{k}z_{l} has variance at most ZZ.

Z=O(dmin2λmin(A)8Σ24+dmin2)Z=O(d_{min}^{2}\lambda_{min}(A)^{8}\|\Sigma\|_{2}^{4}+d_{min}^{2}).

Proof: We will start by bounding E\/[(zizjzkzl)2]E\/[(zi8+zj8+zk8+zl8)]\mathop{\bf E\/}[(z_{i}z_{j}z_{k}z_{l})^{2}]\leq\mathop{\bf E\/}[(z_{i}^{8}+z_{j}^{8}+z_{k}^{8}+z_{l}^{8})]. Furthermore E\/[zi8]O(E\/[(B1Ax)i8+(B1η)i8])\mathop{\bf E\/}[z_{i}^{8}]\leq O(\mathop{\bf E\/}[(B^{-1}Ax)_{i}^{8}+(B^{-1}\eta)_{i}^{8}]). Next we bound E\/[(B1η)i8]\mathop{\bf E\/}[(B^{-1}\eta)_{i}^{8}], which is just the eighth moment of a Gaussian with variance at most B1ΣBT2B122Σ2dmin1/2λmin(A)2Σ2\|B^{-1}\Sigma B^{-T}\|_{2}\leq\|B^{-1}\|_{2}^{2}\|\Sigma\|_{2}\leq d_{min}^{1/2}\lambda_{min}(A)^{-2}\|\Sigma\|_{2}. Hence we can bound this term byO(B1ΣBT24)=O(dmin2λmin(A)8Σ24)O(\|B^{-1}\Sigma B^{-T}\|_{2}^{4})=O(d_{min}^{2}\lambda_{min}(A)^{8}\|\Sigma\|_{2}^{4}). Finally the remaining term E\/[(B1Ax)i8]\mathop{\bf E\/}[(B^{-1}Ax)_{i}^{8}] can be bounded by O(dmin2)O(d_{min}^{2}) because the variance of this random variable is only larger if we instead replace xx by an nn-dimensional standard Gaussian. \blacksquare

Given 2N2N samples y1,y2,...,yN,y1,y2,...,yNy_{1},y_{2},...,y_{N},y_{1}^{\prime},y_{2}^{\prime},...,y_{N}^{\prime}, suppose columns of R=B1ADA(u0)1/2R^{\prime}=B^{-1}AD_{A}(u_{0})^{1/2} are ϵ\epsilon close to the corresponding columns of RR^{*}, with high probability the function P^(u)\widehat{P}^{\prime}(u) is O(dmaxn1/2ϵ+n2(N/Zlogn)1/2)O(d_{max}n^{1/2}\epsilon+n^{2}(N/Z\log n)^{-1/2}) close to the true function P(u)P^{*}(u).

Proof: P^(u)\widehat{P}^{\prime}(u) is the empirical mean of F(u,y,y)=(uTB1y)4+3(uTB1y)2(uTB1y)2F(u,y,y^{\prime})=-(u^{T}B^{-1}y)^{4}+3(u^{T}B^{-1}y)^{2}(u^{T}B^{-1}y^{\prime})^{2}. In Section 2 we proved that P(u)=E\/y,yF(u,y,y)=i=1n2Di,i1/2(uTRi)4=i=1nλi(uTRi)4P^{\prime}(u)=\mathop{\bf E\/}_{y,y^{\prime}}F(u,y,y^{\prime})=\sum_{i=1}^{n}2D^{-1/2}_{i,i}(u^{T}R_{i})^{4}=\sum_{i=1}^{n}\lambda_{i}(u^{T}R_{i})^{4}. First, we demonstrate that P(u)P^{\prime}(u) is close to P(u)P^{*}(u), and then using concentration bounds we show that P^(u)\widehat{P}^{\prime}(u) is close to P(u)P^{\prime}(u) (with high probability) over all uu.

The first part is a simple application of Cauchy-Schwartz:

The first inequality uses the fact that ((uTRi)2+(uTRi)2)3((u^{T}R^{\prime}_{i})^{2}+(u^{T}R^{*}_{i})^{2})\leq 3, the second inequality uses the fact that when ϵ\epsilon is small enough, uTR22\left\lVert u^{T}R^{\prime}\right\rVert_{2}\leq 2.

Next we prove that the empirical mean P^(u)\widehat{P}^{\prime}(u) is close to P(u)P^{\prime}(u). The key point here is we need to prove this for all points uu since a priori we have no control over which directions local search will choose to explore. We accomplish this by considering P^(u)\widehat{P}^{\prime}(u) as a degree-4 polynomial over uu and prove that the coefficient of each monomial in P^(u)\widehat{P}^{\prime}(u) is close to the corresponding coefficient in P(u)P^{\prime}(u). This is easy: the expectation of each coefficient of F(u,y,y)F(u,y,y^{\prime}) is equal to the correct coefficient, and the variance is bounded by O(Z)O(Z). The coefficients are also sub-Gaussian so by Bernstein’s inequality the probability that any coefficient of P^(u)\widehat{P}^{\prime}(u) deviates by more than ϵ\epsilon^{\prime} (from its expectation) is at most eΩ(ϵ2N/Z)e^{-\Omega(\epsilon^{\prime 2}N/Z)}. Hence when NO(Zlogn/ϵ2)N\geq O(Z\log n/\epsilon^{\prime 2}) with high probability all the coefficients of P^(u)\widehat{P}^{\prime}(u) and P(u)P^{\prime}(u) are ϵ\epsilon^{\prime} close. So for any uu:

Therefore P^(u)\widehat{P}^{\prime}(u) and P(u)P^{*}(u) are O(dmaxn1/2ϵ+n2(N/Zlogn)1/2)O(d_{max}n^{1/2}\epsilon+n^{2}(N/Z\log n)^{-1/2}) close. \blacksquare

This proof can also be used to show that the derivatives of the function P^(u)\widehat{P}^{\prime}(u) is concentrated to the derivatives of the true function P(u)P^{*}(u) because the derivatives are only related to coefficients, therefore P^(u)\widehat{P}^{\prime}(u) is also (β,dminβ2/100n)(\beta,d_{min}\beta^{2}/100n) Locally Approximable.

The other properties required by Theorem 4.6 are also satisfied:

For any uu2r\|u-u^{\prime}\|_{2}\leq r, P(u)P(u)5dmaxn1/2r|P^{*}(u)-P^{*}(u^{\prime})|\leq 5d_{max}n^{1/2}r. All local maxima of PP^{*} has attraction radius Raddmin/100dmaxRad\geq d_{min}/100d_{max}.

Proof: The Lipschitz condition follows from the same Cauchy-Schwartz as appeared above. When two points uu and uu^{\prime} are of distance rr, P(u)P(u)5dmaxn1/2r|P^{*}(u)-P^{*}(u^{\prime})|\leq 5d_{max}n^{1/2}r. Finally for the Attraction Radius, we know when 3nγ+γdmin/100dmax3\sqrt{n}\gamma+\gamma^{\prime}\leq d_{min}/100d_{max}, we can just take the set UU to be uTRi1dmin/50dmaxu^{T}R^{*}_{i}\geq 1-d_{min}/50d_{max}. For all uu such that uTRi[1dmin/25dmax,1dmin/50dmax]u^{T}R^{*}_{i}\in[1-d_{min}/25d_{max},1-d_{min}/50d_{max}] (which contains the β\beta neighborhood of UU), we know the value of P(u)TP^{*}(u)\leq T. \blacksquare

Applying Theorem 4.6 we obtain the following Lemma (the parameters are chosen so that all properties required are satisfied):

Let β=Θ((dmin/dmax)2)\beta^{\prime}=\Theta((d_{min}/d_{max})^{2}), β=min{γn1/2,Ω((dmin/dmax)4n3.5)}\beta=\min\{\gamma n^{-1/2},\Omega((d_{min}/d_{max})^{4}n^{-3.5})\}, then the procedure Recover(f,β,f,\beta, dminβ2/100nd_{min}\beta^{2}/100n ,β,dminβ2/100n,\beta^{\prime},d_{min}\beta^{\prime 2}/100n) finds vectors v1,v2,...,vnv_{1},v_{2},...,v_{n}, so that there is a permutation matrix Π\Pi and ki{±1}k_{i}\in\{\pm 1\} and for all ii: vi(RΠ\mboxDiag(ki))i2γ\left\lVert v_{i}-(R\Pi\mbox{Diag}(k_{i}))^{*}_{i}\right\rVert_{2}\leq\gamma.

After obtaining R^=[v1,v2,...,vn]\widehat{R}=\left[v_{1},v_{2},...,v_{n}\right] we can use Algorithm 3 to find AA and Σ\Sigma:

Given a matrix R^\widehat{R} such that there is permutation matrix Π\Pi and ki{±1}k_{i}\in\{\pm 1\} with R^iki(RΠ)i2γ\|\widehat{R}_{i}-k_{i}(R^{*}\Pi)_{i}\|_{2}\leq\gamma for all ii, Algorithm 3 returns matrix A^\widehat{A} such that A^AΠ\mboxDiag(ki)FO(γA22n3/2/λmin(A))\|\widehat{A}-A\Pi\mbox{Diag}(k_{i})\|_{F}\leq O(\gamma\left\lVert A\right\rVert_{2}^{2}n^{3/2}/\lambda_{min}(A)). If γO(ϵ/A22n3/2λmin(A))×min{1/A2,1}\gamma\leq O(\epsilon/\left\lVert A\right\rVert_{2}^{2}n^{3/2}\lambda_{min}(A))\times\min\{1/\left\lVert A\right\rVert_{2},1\}, we also have Σ^ΣFϵ\|\widehat{\Sigma}-\Sigma\|_{F}\leq\epsilon.

Recall that the diagonal matrix DA(u)D_{A}(u) is unknown (since it depends on AA), but if we are given RR^{*} (or an approximation) and since P(u)=i=1ndi(uTRi)4P^{*}(u)=\sum_{i=1}^{n}d_{i}(u^{T}R^{*}_{i})^{4}, we can recover the matrix DA(u)D_{A}(u) approximately from computing P(Ri)P^{*}(R^{*}_{i}). Then given DA(u)D_{A}(u), we can recover AA and Σ\Sigma and this completes the analysis of our algorithm.

Proof: By Lemma 2.11 we know the columns of RR^{\prime} is close the the columns of RR (the parameters will be set so that the error is much smaller than γ\gamma), thus R^iki(RΠ)i2γ\|\widehat{R}_{i}-k_{i}(R^{\prime}\Pi)_{i}\|_{2}\leq\gamma. Applying Lemma 5.3 we obtain: P^(R^i)P(R^i)γ|\widehat{P}^{\prime}(\widehat{R}_{i})-P^{*}(\widehat{R}_{i})|\ll\gamma. Furthermore, when R^ikiRΠ1(i)2γ\|\widehat{R}_{i}-k_{i}R^{*}_{\Pi^{-1}(i)}\|_{2}\leq\gamma we know that P(R^i)/dΠ1(i)[13γ,1+3γ]P^{*}(\widehat{R}_{i})/d_{\Pi^{-1}(i)}\in[1-3\gamma,1+3\gamma] (here we are abusing notation and use the permutation matrix as a permutation). Hence D^A(u)i,i/(DA(u))Π1(i),Π1(i)[13γ,1+3γ]\widehat{D}_{A}(u)_{i,i}/\left(D_{A}(u)\right)_{\Pi^{-1}(i),\Pi^{-1}(i)}\in[1-3\gamma,1+3\gamma]. We have:

and their difference is at most O(γB2(DA(u))Π1(i),Π1(i)1/2)O(\gamma\left\lVert B\right\rVert_{2}\left(D_{A}(u)\right)_{\Pi^{-1}(i),\Pi^{-1}(i)}^{-1/2}). Hence we can bound the total error by O(γB2DA(u)1/2F)O(\gamma\left\lVert B\right\rVert_{2}\left\lVert D_{A}(u)^{-1/2}\right\rVert_{F}). We also know B2A2DA(u)1/22\left\lVert B\right\rVert_{2}\leq\left\lVert A\right\rVert_{2}\|D_{A}(u)^{1/2}\|_{2} because BBTADA(u)ATBB^{T}\approx AD_{A}(u)A^{T}, so this can be bounded by O(γA2DA(u)21/2DA(u)1/2F)O(\gamma\left\lVert A\right\rVert_{2}\|D_{A}(u)\|_{2}^{1/2}\|D_{A}(u)^{-1/2}\|_{F}). Applying Claim 2.9, we conclude that (with high probability) the ratio of the largest to smallest diagonal entry of DA(u)D_{A}(u) is at most n2A22/λmin(A)2n^{2}\left\lVert A\right\rVert_{2}^{2}/\lambda_{min}(A)^{2}. So we can bound the error by O(γA22n3/2/λmin(A))O(\gamma\left\lVert A\right\rVert_{2}^{2}n^{3/2}/\lambda_{min}(A)).

Consider the error for Σ\Sigma: Using concentration bounds similar but much simpler than those used in Lemma 5.3, we obtain that C^CF1/2ϵ\|\widehat{C}-C\|_{F}\leq 1/2\epsilon, so Σ^ΣFC^CFA^A^TAATFϵ/2+2A2AΠ\mboxDiag(ki)A^F+AΠ\mboxDiag(ki)A^F2ϵ\|\widehat{\Sigma}-\Sigma\|_{F}\leq\|\widehat{C}-C\|_{F}-\|\widehat{A}\widehat{A}^{T}-AA^{T}\|_{F}\leq\epsilon/2+2\left\lVert A\right\rVert_{2}\|A\Pi\mbox{Diag}(k_{i})-\widehat{A}\|_{F}+\|A\Pi\mbox{Diag}(k_{i})-\widehat{A}\|_{F}^{2}\leq\epsilon. \blacksquare

ICA is a vast field with many successful techniques. Most rely on heuristic nonlinear optimization. An exciting question is: can we give a rigorous analysis of those techniques as well, just as we did for local search on cumulants? A rigorous analysis of deep learning —say, an algorithm that provably learns the parameters of an RBM—is another problem that is wide open, and a plausible special case involves subtle variations on the problem we considered here.

References