Phase Transitions in Sparse PCA

Thibault Lesieur, Florent Krzakala, Lenka Zdeborova

I Introduction

Let us denote by $X_{0}$ the true underlying signal matrix. We treat the problem of estimating $X$ from $Y$ . We evaluate and analyze the estimator $\hat{X}$ that minimizes the mean squared error

If the distribution $P_{0}(x)$ is known then such an estimator is given by the mean of the marginals of the posterior probability distribution $P(X|Y)$ . We analyze optimal estimation in the above model in the limit where the system size $N$ is large, but rank is small $r\!=\!O(1)$ , noise variance $\Delta=O(1)$ , and fraction of non-zero elements $\rho=O(1)$ .

We aim to answer the following two questions: (Q1) What is the information theoretically minimal mean-squared error (MMSE) in this limit? (Q2) In what range of parameters can the MMSE be achieved with a computationally tractable algorithm? We answer these question by studying the approximate message passing (AMP) algorithm to estimate the marginals of the posterior likelihood, and its asymptotic state evolution (SE). Our results rely on a conjecture from statistical physics, that the present problem belongs to a class of problems for which the fixed points of the state evolution describe asymptotically exactly both the optimal estimator and the performance of AMP. Moreover current experience from other such problems suggests that when AMP does not reach the MMSE then no other polynomial algorithm will.

Principal component analysis (PCA) is a common dimensionality reduction technique that aims at describing the data as linear combination of a small number of principal components. In sparse PCA we search for principal components with many zero elements to facilitate the interpretation of the result. Only the non-zero elements then correspond to features relevant for describing the variability in the data. The sparsity constraint makes the problem algorithmically challenging. Let us note that when talking about sparse PCA we have more often in mind a model of a type $Y=UV^{T}+W$ , rather than (1). For simplicity of presentation, in this short report, we restrict to the model (1), but applying our method to the $UV^{T}$ setting is straightforward and leads to comparable results.

Abundance of applications of sparse PCA motivated both algorithmic development and theoretical studies of the problem, see e.g. . Many of these existing works are concerned with exact recovery of the support of non-zero elements. However, exact recovery of the support is possible only when the number of non-zeros is subextensive, i.e. the fraction of non-zeros $\rho=o(1)$ . In our setting, we assume $\rho=O(1)$ which is reasonable in many applications. In this paper we model a typical case of sparse PCA by (1) and analyze this model in Bayesian probabilistic setting where we assume the knowledge of the distribution $P_{0}(x)$ .

For sparse PCA of rank one $r=1$ the AMP algorithm and its state evolution have been derived by Rangan and Fletcher . Deshpande and Montanari were able to prove that for Bernoulli distributed coordinates (and always rank one) the state evolution equation indeed describes exactly the evolution of the algorithm at large sizes when the density of non-zeros $\rho$ is large enough. Remarkably, they also proved that the asymptotic MSE achieved by AMP is in this case information theoretically optimal. Additionally, in the regime where their proof is valid they did not observe any phase transition in the MMSE. The corresponding AMP algorithm for generic rank was derived in , however, without the state evolution.

The result of Deshpande and Montanari about asymptotic optimality of AMP is surprising for at least two reasons: First, for the question of support recovery, there is a well known large gap between what is information theoretically possible and what is tractable with current algorithms . Moreover this gap was linked to the problem of planted clique , where it is believed that no polynomial algorithm will be able to achieve the information theoretic performance. Second, in the regime where the rank is small but scales linearly with $N$ an analogous gap between information theoretic and tractable algorithmic performance was predicted to exist . This motivates us to revisit the analysis of the model (1) in particular in the region of small density $\rho$ and for rank larger than one. Indeed, in both these cases we identify phase transitions in the MMSE, as well as regions where AMP is suboptimal. Technique-wise the contribution of the present paper is a generalization of the state evolution to arbitrary rank $r$ , and analysis of the AMP-MSE and the MMSE for a range of distributions $P_{0}(x)$ .

II From AMP to state evolution

Here we remind the approximate message passing algorithm for general rank $r$ , and sketch the derivation of the corresponding state evolution. For rank one our results reduce to the state evolution of .

AMP is a large- $N$ simplification of the belief propagation equations (BP) for a graphical model corresponding to the posterior probability

In the derivation we distinguish between $P(x)\neq P_{0}(x)$ , but later we assume equality of the two. Variables in this graphical model are the $r$ -component vectors $x_{\mu}$ for $\mu=1,\dots,N$ . Belief propagation is written as an iterative procedure on message that are probability distributions over $x_{\mu}$ .

The AMP equations are usually initialized in such a way that $a^{t=0}$ is the mean of the prior distribution $P(x)$ and $v^{t=0}$ its variance.

Finally the AMP approach also provides an approximation for the log-likelihood $\phi=\log Z(Y)$ , where $Z(Y)$ is the normalization of (3). This is related to the Bethe free energy simplified along the very same lines as BP was simplified into AMP. Given a fixed point of the AMP algorithm we compute the Bethe log-likelihood as

where ${\cal N}(A,B)$ is the normalization from (7) and

II-B State evolution

The mean squared error is related to these parameters as

With this definition we have from (5) $A^{t}=Q^{t}/\Delta$ , and from (6) by using (1) to express $y_{\mu\nu}$ and neglecting sub-leading order terms, we derive that $B_{\mu}^{t}$ is a random Gaussian variable with mean $(x_{0})_{\mu}M^{t}/\Delta$ and variance $Q^{t}/\Delta$ . The above order parameters hence follow the state evolution equations

where $W$ is a $r$ -variate Gaussian random variable with zero mean and of covariance $Q^{t}/\Delta$ , the arguments of $f$ and $f^{\top}$ in (13) are the same. The Bethe log-likelihood can then be evaluated from the fixed point of the state evolution as

We recall that $P(x)$ from (3) appears in (13-14) via the definition of the function $f(A,B)$ in (7).

In this paper we work in the so-called Bayes-optimal setting where we assume $P_{0}(x)=P(x)$ , the state evolution then simplifies because $M^{t}=Q^{t}$ for all $t$ . This is called the Nishimori condition in statistical physics and was also derived in . The intuition behind this condition is that in Bayes-optimal inference the ground true signal $X_{0}$ behaves in exactly the same way as a random sample from the posterior distribution and hence $Q$ that describes the overlap between two randomly chosen samples is the same as $M$ that describes the overlap beween a randomly chosen sample and $X_{0}$ .

II-C Statistical physics conjecture

The belief propagation equations from which we derived the AMP and the state evolution assume that certain correlations between incoming messages are weak enough. In statistical physics this assumption is widely accepted to hold for inference in the Bayes optimal setting on models such as (1), that correspond to a fully connected factor graph with weak interactions on factor nodes. This has been used in many works, see e.g. a more detailed discussion in , and it has been proven in a subset of cases, see notably the closely related or .

Under the above assumption and in the limit of large $N$ , the MMSE can be computed from a fixed point of the state evolution equations (13-14) that has the largest log-likelihood (15). And the AMP-MSE can be computed from a stable fixed point of the state evolution that is reached iteratively from initialization $Q^{t=0}=\epsilon$ for a very small $\epsilon$ .

III Analysis of the state evolution

A first observation we make about the general state evolution (13-14) is that $M=Q=0$ is a fixed point if and only if the prior distribution $P(x)$ in (3) has a zero mean. This trivial fixed point corresponds to as large MSE (12) as if all we knew about the signal $X$ was the distribution $P_{0}(x)$ . In case $Q=M=0$ is the fixed point with maximum log-likelihood then the matrix $Y$ did not contain any sign of the low rank perturbation, the information was completely lost in the noise, and we denote the signal $X$ as undetectable.

We now study the linear stability of the fixed point $Q=0$ , $M=0$ in the case where both $P(x)$ and $P_{0}(x)$ have zero mean. We expand the state evolution equations around the trivial fixed point up to the first order in $Q$ and $M$ . Looking at the Taylor expansion of $f(Q/\Delta,Mx_{0}/\Delta+W)$ , it is only the term $W\partial_{B}f(0,0)$ that will matter for eq. (13), and $Mx_{0}\partial_{B}f(0,0)/\Delta$ that will matter for eq. (14). Realizing that from definition (7) $\partial_{B}f(0,0)$ is the covariance of the distribution $P(x)$ we get

where $\Sigma$ and $\Sigma_{0}$ are respectively the covariance matrices of $P(x)$ and $P_{0}(x)$ .

When $\Sigma=\Sigma_{0}$ the above linearization will converge away from the trivial fixed point for $\Delta<\Delta_{u}$ , with

and the linearization is a contraction for $\Delta>\Delta_{u}$ . For distributions of $X$ of zero mean, there is hence a phase transition in the behavior of AMP-MSE at $\Delta_{u}$ . Translated into the behavior of the iterative AMP algorithm, when $\Delta>\Delta_{u}$ AMP will converge to a trivial fixed point $a_{\mu}=0$ for all $\mu$ , and to a fixed point of smaller MSE for $\Delta<\Delta_{u}$ . It is quite remarkable to notice that this stability criteria (17) of the trivial fixed point is universal, in the sense that it does not depend on the details of the distributions $P_{0}(x)$ and $P(x)$ , it only requires their means to be zero and their covariances to agree.

The phase transition at $\Delta_{u}$ (and its universality) remarkably reminds us of a detectability/undetectability spectral phase transition known for the canonical PCA . The difference is that in our setting there is no such phase transition when $P(x)$ has a non-zero mean.

III-B The Gauss-Bernoulli case

A particularly interesting and in our opinion representative example of distribution $P(x)$ that we will (among others) study in this paper is the $r$ -variate Gauss-Bernoulli

Using the criteria (17) we conclude that for $\Delta>\rho^{2}$ AMP (randomly initialized) will not be able to detect that matrix $Y$ was a noisy low-rank matrix. On the other hand for $\Delta<\rho^{2}$ AMP will converge to a fixed point giving informative MSE.

where we used the Nishimori condition $q^{t}=m^{t}$ , and

$S_{r}$ being the surface of a unit sphere in $r$ dimensions. And where $\hat{\rho}(a,b)$ is an estimator of the probability that a given vector component is nonzero.

III-C The limit of large rank r𝑟r

In this section we show that for the Gauss-Bernoulli signal (18) when the rank $r$ is sufficiently large the state evolution has a fixed point at $q_{r}=\rho-\Delta+o_{r}(1)$ whenever $\Delta<\rho$ , and this fixed point has always larger log-likelihood than the trivial fixed point at $q=0$ . Moreover, we derive that the probability that a given component of the support is correctly discovered goes to one exponentially fast in $r$ when $r$ grows. This means, on the one hand, that at large $r$ the MMSE of the sparse PCA behaves exactly as if the support of the non-zero elements was known, i.e. ${\rm MSE}=\rho$ for $\Delta>\rho$ and ${\rm MSE}=\Delta$ for $\Delta<\rho$ as in . AMP, on the other hand, will reach the MMSE only for $\Delta<\rho^{2}$ and will give ${\rm MSE}=\rho$ otherwise. There is hence a wide hard region in sparse PCA between $\rho^{2}<\Delta<\rho$ when $r\rightarrow\infty$ .

To prove the above large- $r$ results we analyze the function $\hat{\rho}(a,\tau u^{2})$ . From the definition (21) we can rewrite

For the distribution (18) the log-likelihood (15) becomes

For the fixed point $q=0$ we get $\phi=0$ . To compare to the log-likelihood of the fixed point $q=\rho-\Delta$ we develop $\psi(a,\tau u^{2})$ for large $r$ and assume $q\neq 0$ . For $K(a,\tau)<0$ we have $\psi(a,\tau u^{2})\approx\log(\rho)-\frac{r}{2}K(a,\tau)$ , from which we obtain

which evaluated at $q=\rho-\Delta$ is positive (and hence larger that the value for the fixed point $q=0$ ) if and only if $\Delta<\rho$ . This tells us that at large enough $r$ and $\Delta<\rho$ the fixed point $q=\rho-\Delta$ corresponds to the MMSE.

III-D Numerical Results

In this section we give several examples of a numerical investigation of the fixed points of the AMP algorithm and of the state evolution. In all these experiments we iterate the corresponding equations till convergence and monitor the corresponding mean squared error and the value of the log-likelihood. We initialize the iterations in two different ways.

In a region where these two initializations converge to a different fixed point the MMSE is the one for which the log-likelihood evaluated from (15) is larger. In Fig. 1 we compare the MSE reached by the state evolution and the AMP algorithm on an instance of size $N=20000$ and as expected we see an excellent agreement.

Figs. 1 and 2 are for the Gauss-Bernoulli signal distribution for which the trivial fixed point exists and is locally stable for $\Delta>\rho^{2}$ . In this case of zero mean $P_{0}(x)$ there is either a single second order (continuous) phase transition at $\Delta_{u}=\rho^{2}$ or a (discontinuous) first order phase transition in the MMSE at $\Delta_{c}>\rho^{2}$ with its two spinodals, $\Delta_{\rm AMP}$ and $\Delta_{\rm 2nd}$ . By the results of the previous section in the limit of large rank we have $\Delta_{c}(r\rightarrow\infty)=\Delta_{\rm 2nd}(r\rightarrow\infty)=\rho$ . Fig. 2 depicts the result for rank $r=50$ .

In Fig. 3 we depict the result for a case of $P_{0}(x)$ with a non-zero mean. In that case either there is a unique fixed point and no phase transition, or there are two fixed points, both with MSE smaller than if the signal was chosen randomly only according to the prior information. Remarkably, the first order phase transition seems always to happen for small densities. Fig. 3 depicts the case of spiked Wigner model from , in a region of densities for which the proof of did not apply. Note that the AMP spinodal $\rho_{\rm AMP}$ and the curve of phase transition in MMSE $\rho_{c}$ arrive with a different slope to $\rho\rightarrow 0$ , this is consistent with the algorithmic gap for support recovery for sub-extensive support size .

IV Conclusions

In this paper we analyzed probabilistic estimation in sparse PCA modeled by (1). We derived the state evolution of the approximate message passing algorithm for general rank $r$ . Relying on a statistical physics conjecture about exactness of this state evolution we analyze it’s fixed points to compute the minimal mean squared error and the error made by AMP in the large $N$ limit. Generalization of the proof technique of to small densities, ranks larger than one and signal distributions having zero mean are an important topic for future work.

For signal distributions of zero mean we unveil an undetectability regime where no algorithm can do better in estimation of the signal than random guessing. We observe a first order (discontinuous) phase transition in regions of small density $\rho$ or large rank $r$ . Existence of such a first order phase transition is related to the existence of a region of parameters where AMP is asymptotically sub-optimal.

In the large rank limit the emerging picture is particularly simple (at least for signal distributions of zero mean). The asymptotic MMSE is equal to the MMSE of the problem with known support, and the probability of false negatives in the support recovery is exponentially small when $r\rightarrow\infty$ . The MMSE is not reachable for AMP unless the noise variance is smaller than the variance of the signal squared. We expect that this hard region will stay computationally hard also for other polynomial algorithm. Proving such a result about algorithmic barrier for a generic class of algorithms is a very interesting challenge.

Acknowledgment

This work has been supported in part by the ERC under the European Union’s 7th Framework Programme Grant Agreement 307087-SPARCS.