Information-theoretically Optimal Sparse PCA
Yash Deshpande, Andrea Montanari
I Introduction
In either case, our data consists of a sparse, rank-one matrix observed through Gaussian noise. We let denote the clean, underlying signal ( or for the spiked Wigner or Wishart model respectively). Our task is to estimate the signal from the data in the high dimension asymptotic where with . This paper focuses on estimation in the sense of the mean squared error, defined for an estimator as:
In this paper, we analyze an iterative scheme called approximate message passing (AMP) to estimate the clean signal . The machinery of approximate message passing reduces the high-dimensional matrix problem in models (1), (2) to the following simpler scalar denoising problem:
where and are independent. The scalar MMSE in estimating from is given by:
Our main results, characterize the optimal mean squared error in the large asymptotic, when , and establish that AMP achieves this fundamental limit. For the spiked Wigner model we prove the following.
There exists an such that for all , and every the squared error of AMP iterates satisfies the following:
Further, the limit on the RHS above satisfies, for every :
where solves .
The combination of Theorem 1 and Eq. (10) effectively yield a single-letter characterization of Model (1), connecting the limiting matrix MMSE with the MMSE of a calibrated scalar denoising problem .
It is straightforward to establish that . However, numerically we obtain that . Thus, for most values of , our results completely characterize the spiked Wigner model.
Probabilistic models similar to Eqs.(1), (2) have been the focus of much recent work in random matrix theory . The focus in this literature is to analyze the limiting distribution of the eigenvalues of the matrix and, in particular, identifying regimes in which this distribution differs from that of the pure noise matrix . The typical picture that emerges from this line of work is that a phase transition occurs at a well-defined critical signal-to-noise ratio : {LaTeXdescription}
there exists an outlier eigenvalue and the principal eigenvector corresponding to this outlier has a positive correlation with the signal. For instance, in the spiked Wigner case, letting denote the normalized principal eigenvector of we obtain asymptotically.
the spectral distribution of the observation is indistinguishable from that of the pure noise . Furthermore, the principal eigenvector is asymptotically orthogonal to the signal factors. For the spiked Wigner case, this implies that asymptotically. This phase transition phenomenon has been demonstrated under considerably fewer assumptions than we make in Eqs. (1), (2). We refer the interested reader to and the references therein for further details.
It is clear from these results that vanilla PCA, which involves using the principal eigenvector is ineffective in estimating the underlying clean signal when . Indeed PCA only makes use of the fact that the underlying signal is low-rank, or in fact rank-one in our case. Since we make additional sparsity assumptions in our models (1), (2) it is natural to ask if this can be leveraged when we have a small signal-to-noise ratio . In the last decade, a considerable amount of work in the statistics community has studied this problem. Our spiked Wishart model Eq. (2) is a special case of the spiked covariance model in statistics, first introduced by Johnstone and Lu . Johnstone and Lu proposed a simple diagonal thresholding scheme that estimates the support of using the largest diagonal entries of the Gram matrix . An M-estimator for the underlying factors was proposed by . A number of other practical algorithms have also been proposed to outperform diagonal thresholding.
Some recent work has focused on the support recovery guarantees for such algorithms, or estimating consistently the positions of non-zeros in . Let denote the expected size of the support of . Amini and Wainwright proved that unless , no algorithm would be able to consistently estimate the support of due to information-theoretic obstructions. They further demonstrate that a (computationally intractable) algorithm that searches through all possible -sized subsets of rows of the data matrix can recover the support provided .
Since we consider and , in our case and consequently, estimating the support correctly is impossible. It is for this reason that we instead focus on another natural figure-of-merit: the mean squared error, defined in Eq. (3) above. Somewhat surprisingly, we are able to prove (for a regime ) that a computationally efficient algorithm asymptotically achieves the information-theoretically optimal mean squared error for any signal-to-noise ratio .
Other related work
Rangan et al. considered a model similar to Eq. (2) with general structural assumptions on the factors and . They proposed an approximate message passing algorithm analogous to the one we analyze and characterize its high-dimensional behavior. Based on non-rigorous but powerful tools from statistical physics, they conjecture that AMP asymptotically achieves the (optimal) performance of the joint MMSE estimator of and . In the restricted setting of sparse PCA, we rigorously confirm this conjecture, and validate the statistical physics arguments.
A model similar to Eq. (1) was considered by , motivated by the “planted clique” problem in theoretical computer science. The sparsity regime of interest in this work was , with a focus on recovering the “clique”, analogous to support recovery in the spiked covariance model.
Organization
The paper is organized as follows. In Section II we give details of the AMP algorithm and formally state our results. For brevity, we only provide the proof of one of our main results in Section III.
II Algorithm and main results
In the interest of exposition, we restrict ourselves to the spiked Wigner model (1) and defer the discussion of the Wishart model (2) to Section II-D.
II-B State evolution
The key property of approximate message passing is that it admits an asymptotically exact characterization in the high-dimensional limit where . The iterates converge as to Gaussian random variables with a prescribed mean and variance. These prescribed mean and variance parameters evolve according to deterministic recursions, jointly termed “state evolution”. We define for :
where and are independent. The recursion is initialized with .
where are defined by Eqs. (8), (9). This allows us to track the squared error of the AMP estimator accurately, in the high-dimensional limit, and establish its optimality.
II-C Main Result
We first define the following regime for :
Let be the smallest positive real number such that for every the following is true. For every , the equation below has only one solution in :
Here .
With a slight abuse of notation, we denote by the quantity , assuming it exists. Also we define the squared error of AMP at iteration as:
Notice that is a random variable that depends on the realization . Our first main result strengthens Theorem 1 for the spiked Wigner case:
Under Model (1) we have exists for every . This limit satisfies, when :
where is the unique solution to Eq. (10) above. Further, the symmetric Bayes-optimal AMP algorithm 1 satisfies the following almost surely:
Although this result is asymptotic in nature, simulations show that the predictions are accurate on problems of dimension a few thousands (see Figure 1).
II-D The spiked Wishart model
The following is the analogue of Definition II.1 for the asymmetric Wishart model:
Let be the smallest positive real number such that for every the following is true. For every equation below has only one solution in :
Our second result is for the spiked Wishart model (2):
The limit exists for every and, for , is given by:
where is the unique solution to Eq. (13). Further, asymmetric Bayes-optimal AMP satisfies the following limits almost surely:
Numerically verifying the fixed point condition Eq. (13) is somewhat more involved than Eq. (10). However it is relatively straightforward to establish that .
III Proof of Theorem 2
Owing to space constraints, we restrict ourselves to proving Theorem 2 in this paper. The proof of Theorem 3 follows similar ideas and will be provided in the full version of the present paper. Theorem 2 follows almost immediately from the following two propositions.
Consider the model Eq. (1) with , and the approximate message passing orbit obtained by using the recursively defined scalar functions for :
Here and are independent. Further and are defined using the state evolution recursions (8), (9). Then defining , the RHS of Eq. (11), the following is true:
The first limit holds almost surely and in and is the binary entropy function .
For every , exists. Further:
where is defined in Proposition III.1.
where in the first inequality and the last equality we use Propositions III.1, III.2. This implies that for Lebesgue-a.e. . Further, as is the pointwise limit of monotone non-increasing (in ) functions , it is monotone non-increasing, which yields the claim for all . ∎
By the strong law of large numbers, almost surely, and in . It is not hard to prove that the functions are -Lipschitz continuous. Hence, it is a direct consequence of Theorem 1 of that the following limits hold almost surely and in :
It follows that almost surely and in where denotes the smallest non-negative fixed point of the equation:
Since the right hand side equals at and at , at least one fixed point must exist. Hence is well defined. Now, note that
Thus is a fixed point of Eq. (19) iff is a fixed point of Eq. (10). It follows from our definition of that when , is the unique non-negative fixed point of Eq. (19) and Claim (16) follows. To complete the proof of the proposition, it only remains to show claim (17), for which we have the following
Let denote the unique non-negative fixed point of Eq. (19). Then
where . Here , and are independent. Letting , it is not hard to show that:
It follows from the fundamental theorem of calculus that
It is easy to see that . Further, using the fact that (as the right hand side of Eq. (19) is bounded by 1), we have that . Using this, it is not hard to check that as . This concludes the proof of the lemma. ∎
III-B Proof of Proposition III.2
We first prove that exists for every . Define for
By the fact that the distribution of is invariant under (identical) row and column permutations, for every distinct. Consequently:
Since it suffices to prove that exists for every . To this end, let denote the first principal submatrix of . Clearly:
where the equality follows from the model Eq. (1) and the second inequality from monotonicity of the minimum mean square error in . Consequently, for every , is a monotone, bounded sequence and has a limit.
In order to prove the claim 18, we first note that for any finite the following holds applying the I-MMSE identity of to the upper triangular portion of :
For we have that and . Dividing by on either side:
An application of Fatou’s lemma then yields the result. ∎