Information-theoretically Optimal Sparse PCA

Yash Deshpande, Andrea Montanari

I Introduction

In either case, our data consists of a sparse, rank-one matrix observed through Gaussian noise. We let $X$ denote the clean, underlying signal ( $xx^{\sf T}$ or $uv^{\sf T}$ for the spiked Wigner or Wishart model respectively). Our task is to estimate the signal $X$ from the data $Y_{\lambda}$ in the high dimension asymptotic where $n\to\infty,m\to\infty$ with $m/n\to\alpha\in(0,\infty)$ . This paper focuses on estimation in the sense of the mean squared error, defined for an estimator $\widehat{X}(Y_{\lambda})$ as:

In this paper, we analyze an iterative scheme called approximate message passing (AMP) to estimate the clean signal $X$ . The machinery of approximate message passing reduces the high-dimensional matrix problem in models (1), (2) to the following simpler scalar denoising problem:

where $X_{0}\sim{\sf Ber}({\varepsilon})$ and $N\sim{\sf N}(0,1)$ are independent. The scalar MMSE in estimating $X_{0}$ from $Y_{\lambda}$ is given by:

Our main results, characterize the optimal mean squared error ${\sf M\text{-}mmse}(\lambda,n)$ in the large $n$ asymptotic, when ${\varepsilon}>{\varepsilon}_{c}\approx 0.05$ , and establish that AMP achieves this fundamental limit. For the spiked Wigner model we prove the following.

There exists an ${\varepsilon}_{c}\in(0,1)$ such that for all ${\varepsilon}>{\varepsilon}_{c}$ , and every $\lambda\geq 0$ the squared error of AMP iterates $\widehat{X}^{t}$ satisfies the following:

Further, the limit on the RHS above satisfies, for every $\lambda>0$ :

where $y_{*}=y_{*}(\lambda)$ solves $y_{*}=\lambda({\varepsilon}-{\sf S\text{-}mmse}(X_{0},y_{*}))$ .

The combination of Theorem 1 and Eq. (10) effectively yield a single-letter characterization of Model (1), connecting the limiting matrix MMSE with the MMSE of a calibrated scalar denoising problem ${\sf S\text{-}mmse}(X_{0},y_{*}(\lambda))$ .

It is straightforward to establish that ${\varepsilon}_{c}<1$ . However, numerically we obtain that ${\varepsilon}_{c}\approx 0.05$ . Thus, for most values of ${\varepsilon}$ , our results completely characterize the spiked Wigner model.

Probabilistic models similar to Eqs.(1), (2) have been the focus of much recent work in random matrix theory . The focus in this literature is to analyze the limiting distribution of the eigenvalues of the matrix $Y_{\lambda}/\sqrt{n}$ and, in particular, identifying regimes in which this distribution differs from that of the pure noise matrix $Z$ . The typical picture that emerges from this line of work is that a phase transition occurs at a well-defined critical signal-to-noise ratio $\lambda_{c}=\lambda_{c}(\alpha,{\varepsilon})$ : {LaTeXdescription}

there exists an outlier eigenvalue and the principal eigenvector corresponding to this outlier has a positive correlation with the signal. For instance, in the spiked Wigner case, letting $\widehat{x}_{1}(Y_{\lambda})$ denote the normalized principal eigenvector of $Y_{\lambda}$ we obtain $\langle\widehat{x}_{1}(Y_{\lambda}),x\rangle/\sqrt{n{\varepsilon}}\geq\delta({\varepsilon})>0$ asymptotically.

the spectral distribution of the observation $Y_{\lambda}$ is indistinguishable from that of the pure noise $Z$ . Furthermore, the principal eigenvector is asymptotically orthogonal to the signal factors. For the spiked Wigner case, this implies that $\langle\widehat{x}_{1}(Y_{\lambda}),x\rangle/\sqrt{n{\varepsilon}}\to 0$ asymptotically. This phase transition phenomenon has been demonstrated under considerably fewer assumptions than we make in Eqs. (1), (2). We refer the interested reader to and the references therein for further details.

It is clear from these results that vanilla PCA, which involves using the principal eigenvector is ineffective in estimating the underlying clean signal $X$ when $\lambda<\lambda_{c}$ . Indeed PCA only makes use of the fact that the underlying signal is low-rank, or in fact rank-one in our case. Since we make additional sparsity assumptions in our models (1), (2) it is natural to ask if this can be leveraged when we have a small signal-to-noise ratio $\lambda$ . In the last decade, a considerable amount of work in the statistics community has studied this problem. Our spiked Wishart model Eq. (2) is a special case of the spiked covariance model in statistics, first introduced by Johnstone and Lu . Johnstone and Lu proposed a simple diagonal thresholding scheme that estimates the support of $v$ using the largest diagonal entries of the Gram matrix $Y_{\lambda}^{\sf T}Y_{\lambda}$ . An M-estimator for the underlying factors was proposed by . A number of other practical algorithms have also been proposed to outperform diagonal thresholding.

Some recent work has focused on the support recovery guarantees for such algorithms, or estimating consistently the positions of non-zeros in $v$ . Let $k=n{\varepsilon}$ denote the expected size of the support of $v$ . Amini and Wainwright proved that unless $k\leq cm/\log n$ , no algorithm would be able to consistently estimate the support of $v$ due to information-theoretic obstructions. They further demonstrate that a (computationally intractable) algorithm that searches through all possible $k$ -sized subsets of rows of the data matrix can recover the support provided $k\leq c^{\prime}m/\log n$ .

Since we consider ${\varepsilon}=\Theta(1)$ and $m=\Theta(n)$ , in our case $k=\Theta(m)$ and consequently, estimating the support correctly is impossible. It is for this reason that we instead focus on another natural figure-of-merit: the mean squared error, defined in Eq. (3) above. Somewhat surprisingly, we are able to prove (for a regime ${\varepsilon}_{c}<{\varepsilon}\leq 1$ ) that a computationally efficient algorithm asymptotically achieves the information-theoretically optimal mean squared error for any signal-to-noise ratio $\lambda$ .

Other related work

Rangan et al. considered a model similar to Eq. (2) with general structural assumptions on the factors $u$ and $v$ . They proposed an approximate message passing algorithm analogous to the one we analyze and characterize its high-dimensional behavior. Based on non-rigorous but powerful tools from statistical physics, they conjecture that AMP asymptotically achieves the (optimal) performance of the joint MMSE estimator of $u$ and $v$ . In the restricted setting of sparse PCA, we rigorously confirm this conjecture, and validate the statistical physics arguments.

A model similar to Eq. (1) was considered by , motivated by the “planted clique” problem in theoretical computer science. The sparsity regime of interest in this work was $k=O(\sqrt{n})$ , with a focus on recovering the “clique”, analogous to support recovery in the spiked covariance model.

Organization

The paper is organized as follows. In Section II we give details of the AMP algorithm and formally state our results. For brevity, we only provide the proof of one of our main results in Section III.

II Algorithm and main results

In the interest of exposition, we restrict ourselves to the spiked Wigner model (1) and defer the discussion of the Wishart model (2) to Section II-D.

II-B State evolution

The key property of approximate message passing is that it admits an asymptotically exact characterization in the high-dimensional limit where $n\to\infty$ . The iterates $x^{t}_{i}$ converge as $n\to\infty$ to Gaussian random variables with a prescribed mean and variance. These prescribed mean and variance parameters evolve according to deterministic recursions, jointly termed “state evolution”. We define for $t\geq 0$ :

where $X_{0}\sim{\sf Ber}({\varepsilon})$ and $Z\sim{\sf N}(0,1)$ are independent. The recursion is initialized with $\mu_{0}=\tau_{0}=0$ .

where $\mu_{t},\tau_{t}$ are defined by Eqs. (8), (9). This allows us to track the squared error of the AMP estimator accurately, in the high-dimensional limit, and establish its optimality.

II-C Main Result

We first define the following regime for ${\varepsilon}$ :

Let ${\varepsilon}_{*}\in(0,1)$ be the smallest positive real number such that for every ${\varepsilon}>{\varepsilon}_{*}$ the following is true. For every $\lambda>0$ , the equation below has only one solution in $[0,\infty)$ :

Here $X_{0}\sim{\sf Ber}({\varepsilon})$ .

With a slight abuse of notation, we denote by ${\sf M\text{-}mmse}(\lambda)$ the quantity $\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n)$ , assuming it exists. Also we define the squared error of AMP at iteration $t$ as:

Notice that ${\sf MSE_{AMP}}(\lambda,t)$ is a random variable that depends on the realization $Y_{\lambda}$ . Our first main result strengthens Theorem 1 for the spiked Wigner case:

Under Model (1) we have ${\sf M\text{-}mmse}(\lambda)=\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n)$ exists for every $\lambda\geq 0$ . This limit satisfies, when ${\varepsilon}>{\varepsilon}_{*}$ :

where $y_{*}(\lambda)$ is the unique solution to Eq. (10) above. Further, the symmetric Bayes-optimal AMP algorithm 1 satisfies the following almost surely:

Although this result is asymptotic in nature, simulations show that the predictions are accurate on problems of dimension a few thousands (see Figure 1).

II-D The spiked Wishart model

The following is the analogue of Definition II.1 for the asymmetric Wishart model:

Let $\widetilde{{\varepsilon}}_{*}$ be the smallest positive real number such that for every ${\varepsilon}>\widetilde{{\varepsilon}}_{*}$ the following is true. For every $\lambda>0$ equation below has only one solution in $[0,\infty)$ :

Our second result is for the spiked Wishart model (2):

The limit ${\sf M\text{-}mmse}(\lambda)\equiv\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n)$ exists for every $\lambda\geq 0$ and, for ${\varepsilon}>\widetilde{{\varepsilon}}_{*}$ , is given by:

where $y_{*}(\lambda)$ is the unique solution to Eq. (13). Further, asymmetric Bayes-optimal AMP satisfies the following limits almost surely:

Numerically verifying the fixed point condition Eq. (13) is somewhat more involved than Eq. (10). However it is relatively straightforward to establish that $\widetilde{{\varepsilon}}_{*}\leq\inf\{{\varepsilon}:{\sf S\text{-}mmse}(V,z)\text{ is convex in }z\}\approx 0.05$ .

III Proof of Theorem 2

Owing to space constraints, we restrict ourselves to proving Theorem 2 in this paper. The proof of Theorem 3 follows similar ideas and will be provided in the full version of the present paper. Theorem 2 follows almost immediately from the following two propositions.

Consider the model Eq. (1) with ${\varepsilon}>{\varepsilon}_{*}$ , and the approximate message passing orbit obtained by using the recursively defined scalar functions for $t\geq 0$ :

Here $X_{0}\sim{\sf Ber}({\varepsilon})$ and $Z\sim{\sf N}(0,1)$ are independent. Further $\mu_{0}=\tau_{0}=0$ and $(\mu_{t},\tau_{t})_{t\geq 1}$ are defined using the state evolution recursions (8), (9). Then defining ${\sf mse_{AMP}}(\lambda)\equiv{\varepsilon}^{2}-y_{*}(\lambda)^{2}/\lambda^{2}$ , the RHS of Eq. (11), the following is true:

The first limit holds almost surely and in ${\cal L}_{1}$ and $h({\varepsilon})$ is the binary entropy function $h({\varepsilon})=-{\varepsilon}\log{\varepsilon}-(1-{\varepsilon})\log(1-{\varepsilon})$ .

For every $\lambda\geq 0$ , ${\sf M\text{-}mmse}(\lambda)=\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n)$ exists. Further:

where $h({\varepsilon})$ is defined in Proposition III.1.

where in the first inequality and the last equality we use Propositions III.1, III.2. This implies that ${\sf mse_{AMP}}(\lambda)={\sf M\text{-}mmse}(\lambda)$ for Lebesgue-a.e. $\lambda$ . Further, as ${\sf M\text{-}mmse}(\lambda)$ is the pointwise limit of monotone non-increasing (in $\lambda$ ) functions ${\sf M\text{-}mmse}(\lambda,n)$ , it is monotone non-increasing, which yields the claim for all $\lambda\in[0,\infty)$ . ∎

By the strong law of large numbers, $\lVert{x}\rVert^{4}/n^{2}\to{\varepsilon}^{2}$ almost surely, and in ${\cal L}_{1}$ . It is not hard to prove that the functions $f_{t}(y)$ are $\sqrt{\lambda}$ -Lipschitz continuous. Hence, it is a direct consequence of Theorem 1 of that the following limits hold almost surely and in ${\cal L}_{1}$ :

It follows that $\lim_{t\to\infty}\lim_{n\to\infty}{\sf MSE_{AMP}}(\lambda,t)={\varepsilon}^{2}-\tau_{*}^{2}$ almost surely and in ${\cal L}_{1}$ where $\tau_{*}=\tau_{*}(\lambda)$ denotes the smallest non-negative fixed point of the equation:

Since the right hand side equals ${\varepsilon}(1-{\varepsilon})$ at $\tau=0$ and ${\varepsilon}$ at $\tau=\infty$ , at least one fixed point must exist. Hence $\tau_{*}(\lambda)$ is well defined. Now, note that

Thus $\tau_{*}$ is a fixed point of Eq. (19) iff $\lambda\tau_{*}=y_{*}$ is a fixed point of Eq. (10). It follows from our definition of ${\varepsilon}_{*}$ that when ${\varepsilon}>{\varepsilon}_{*}$ , $\tau_{*}(\lambda)$ is the unique non-negative fixed point of Eq. (19) and Claim (16) follows. To complete the proof of the proposition, it only remains to show claim (17), for which we have the following

Let $\tau_{*}(\lambda)$ denote the unique non-negative fixed point of Eq. (19). Then

where $W(m,\lambda,x,z)=m\sqrt{\lambda}x-m\sqrt{\lambda}/2+\lambda^{1/4}m^{1/2}z$ . Here $X_{0}\sim{\sf Ber}({\varepsilon})$ , $Z\sim{\sf N}(0,1)$ and are independent. Letting $m_{*}=\tau_{*}\sqrt{\lambda}$ , it is not hard to show that:

It follows from the fundamental theorem of calculus that

It is easy to see that $\phi(0,m_{*}(0))=0$ . Further, using the fact that $\tau_{*}(\lambda)\leq 1$ (as the right hand side of Eq. (19) is bounded by 1), we have that $m_{*}(\lambda)=O(\sqrt{\lambda})$ . Using this, it is not hard to check that $\phi(\lambda,m_{*}(\lambda))\to h({\varepsilon})$ as $\lambda\to\infty$ . This concludes the proof of the lemma. ∎

III-B Proof of Proposition III.2

We first prove that $\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n)$ exists for every $\lambda\geq 0$ . Define for $i,j\in[n]$

By the fact that the distribution of $(X,Y_{\lambda})$ is invariant under (identical) row and column permutations, $m_{ij}(\lambda,n)=m_{12}(\lambda,n)$ for every $i,j$ distinct. Consequently:

Since ${\rm Var}(X_{11})={\varepsilon}(1-{\varepsilon})<\infty$ it suffices to prove that $\lim_{n\to\infty}m_{12}(\lambda,n)$ exists for every $\lambda\geq 0$ . To this end, let $Y_{\lambda}^{n-1}$ denote the first principal $(n-1)\times(n-1)$ submatrix of $Y_{\lambda}$ . Clearly:

where the equality follows from the model Eq. (1) and the second inequality from monotonicity of the minimum mean square error in $\lambda$ . Consequently, for every $\lambda\geq 0$ , $m_{12}(\lambda,n)$ is a monotone, bounded sequence and has a limit.

In order to prove the claim 18, we first note that for any finite $n$ the following holds applying the I-MMSE identity of to the upper triangular portion of $X$ :

For $\Lambda=\infty$ we have that $I(X;Y_{\infty})=H(X)-H(X|Y_{\infty})=H(X)$ and $H(X)=H(x)=nh({\varepsilon})$ . Dividing by $n$ on either side:

An application of Fatou’s lemma then yields the result. ∎