Information-theoretically Optimal Sparse PCA

Yash Deshpande, Andrea Montanari

I Introduction

In either case, our data consists of a sparse, rank-one matrix observed through Gaussian noise. We let XX denote the clean, underlying signal (xxTxx^{\sf T} or uvTuv^{\sf T} for the spiked Wigner or Wishart model respectively). Our task is to estimate the signal XX from the data YλY_{\lambda} in the high dimension asymptotic where n,mn\to\infty,m\to\infty with m/nα(0,)m/n\to\alpha\in(0,\infty). This paper focuses on estimation in the sense of the mean squared error, defined for an estimator X^(Yλ)\widehat{X}(Y_{\lambda}) as:

In this paper, we analyze an iterative scheme called approximate message passing (AMP) to estimate the clean signal XX. The machinery of approximate message passing reduces the high-dimensional matrix problem in models (1), (2) to the following simpler scalar denoising problem:

where X0Ber(ε)X_{0}\sim{\sf Ber}({\varepsilon}) and NN(0,1)N\sim{\sf N}(0,1) are independent. The scalar MMSE in estimating X0X_{0} from YλY_{\lambda} is given by:

Our main results, characterize the optimal mean squared error M-mmse(λ,n){\sf M\text{-}mmse}(\lambda,n) in the large nn asymptotic, when ε>εc0.05{\varepsilon}>{\varepsilon}_{c}\approx 0.05, and establish that AMP achieves this fundamental limit. For the spiked Wigner model we prove the following.

There exists an εc(0,1){\varepsilon}_{c}\in(0,1) such that for all ε>εc{\varepsilon}>{\varepsilon}_{c}, and every λ0\lambda\geq 0 the squared error of AMP iterates X^t\widehat{X}^{t} satisfies the following:

Further, the limit on the RHS above satisfies, for every λ>0\lambda>0:

where y=y(λ)y_{*}=y_{*}(\lambda) solves y=λ(εS-mmse(X0,y))y_{*}=\lambda({\varepsilon}-{\sf S\text{-}mmse}(X_{0},y_{*})).

The combination of Theorem 1 and Eq. (10) effectively yield a single-letter characterization of Model (1), connecting the limiting matrix MMSE with the MMSE of a calibrated scalar denoising problem S-mmse(X0,y(λ)){\sf S\text{-}mmse}(X_{0},y_{*}(\lambda)).

It is straightforward to establish that εc<1{\varepsilon}_{c}<1. However, numerically we obtain that εc0.05{\varepsilon}_{c}\approx 0.05. Thus, for most values of ε{\varepsilon}, our results completely characterize the spiked Wigner model.

Probabilistic models similar to Eqs.(1), (2) have been the focus of much recent work in random matrix theory . The focus in this literature is to analyze the limiting distribution of the eigenvalues of the matrix Yλ/nY_{\lambda}/\sqrt{n} and, in particular, identifying regimes in which this distribution differs from that of the pure noise matrix ZZ. The typical picture that emerges from this line of work is that a phase transition occurs at a well-defined critical signal-to-noise ratio λc=λc(α,ε)\lambda_{c}=\lambda_{c}(\alpha,{\varepsilon}): {LaTeXdescription}

there exists an outlier eigenvalue and the principal eigenvector corresponding to this outlier has a positive correlation with the signal. For instance, in the spiked Wigner case, letting x^1(Yλ)\widehat{x}_{1}(Y_{\lambda}) denote the normalized principal eigenvector of YλY_{\lambda} we obtain x^1(Yλ),x/nεδ(ε)>0\langle\widehat{x}_{1}(Y_{\lambda}),x\rangle/\sqrt{n{\varepsilon}}\geq\delta({\varepsilon})>0 asymptotically.

the spectral distribution of the observation YλY_{\lambda} is indistinguishable from that of the pure noise ZZ. Furthermore, the principal eigenvector is asymptotically orthogonal to the signal factors. For the spiked Wigner case, this implies that x^1(Yλ),x/nε0\langle\widehat{x}_{1}(Y_{\lambda}),x\rangle/\sqrt{n{\varepsilon}}\to 0 asymptotically. This phase transition phenomenon has been demonstrated under considerably fewer assumptions than we make in Eqs. (1), (2). We refer the interested reader to and the references therein for further details.

It is clear from these results that vanilla PCA, which involves using the principal eigenvector is ineffective in estimating the underlying clean signal XX when λ<λc\lambda<\lambda_{c}. Indeed PCA only makes use of the fact that the underlying signal is low-rank, or in fact rank-one in our case. Since we make additional sparsity assumptions in our models (1), (2) it is natural to ask if this can be leveraged when we have a small signal-to-noise ratio λ\lambda. In the last decade, a considerable amount of work in the statistics community has studied this problem. Our spiked Wishart model Eq. (2) is a special case of the spiked covariance model in statistics, first introduced by Johnstone and Lu . Johnstone and Lu proposed a simple diagonal thresholding scheme that estimates the support of vv using the largest diagonal entries of the Gram matrix YλTYλY_{\lambda}^{\sf T}Y_{\lambda}. An M-estimator for the underlying factors was proposed by . A number of other practical algorithms have also been proposed to outperform diagonal thresholding.

Some recent work has focused on the support recovery guarantees for such algorithms, or estimating consistently the positions of non-zeros in vv. Let k=nεk=n{\varepsilon} denote the expected size of the support of vv. Amini and Wainwright proved that unless kcm/lognk\leq cm/\log n, no algorithm would be able to consistently estimate the support of vv due to information-theoretic obstructions. They further demonstrate that a (computationally intractable) algorithm that searches through all possible kk-sized subsets of rows of the data matrix can recover the support provided kcm/lognk\leq c^{\prime}m/\log n.

Since we consider ε=Θ(1){\varepsilon}=\Theta(1) and m=Θ(n)m=\Theta(n), in our case k=Θ(m)k=\Theta(m) and consequently, estimating the support correctly is impossible. It is for this reason that we instead focus on another natural figure-of-merit: the mean squared error, defined in Eq. (3) above. Somewhat surprisingly, we are able to prove (for a regime εc<ε1{\varepsilon}_{c}<{\varepsilon}\leq 1) that a computationally efficient algorithm asymptotically achieves the information-theoretically optimal mean squared error for any signal-to-noise ratio λ\lambda.

Other related work

Rangan et al. considered a model similar to Eq. (2) with general structural assumptions on the factors uu and vv. They proposed an approximate message passing algorithm analogous to the one we analyze and characterize its high-dimensional behavior. Based on non-rigorous but powerful tools from statistical physics, they conjecture that AMP asymptotically achieves the (optimal) performance of the joint MMSE estimator of uu and vv. In the restricted setting of sparse PCA, we rigorously confirm this conjecture, and validate the statistical physics arguments.

A model similar to Eq. (1) was considered by , motivated by the “planted clique” problem in theoretical computer science. The sparsity regime of interest in this work was k=O(n)k=O(\sqrt{n}), with a focus on recovering the “clique”, analogous to support recovery in the spiked covariance model.

Organization

The paper is organized as follows. In Section II we give details of the AMP algorithm and formally state our results. For brevity, we only provide the proof of one of our main results in Section III.

II Algorithm and main results

In the interest of exposition, we restrict ourselves to the spiked Wigner model (1) and defer the discussion of the Wishart model (2) to Section II-D.

II-B State evolution

The key property of approximate message passing is that it admits an asymptotically exact characterization in the high-dimensional limit where nn\to\infty. The iterates xitx^{t}_{i} converge as nn\to\infty to Gaussian random variables with a prescribed mean and variance. These prescribed mean and variance parameters evolve according to deterministic recursions, jointly termed “state evolution”. We define for t0t\geq 0:

where X0Ber(ε)X_{0}\sim{\sf Ber}({\varepsilon}) and ZN(0,1)Z\sim{\sf N}(0,1) are independent. The recursion is initialized with μ0=τ0=0\mu_{0}=\tau_{0}=0.

where μt,τt\mu_{t},\tau_{t} are defined by Eqs. (8), (9). This allows us to track the squared error of the AMP estimator accurately, in the high-dimensional limit, and establish its optimality.

II-C Main Result

We first define the following regime for ε{\varepsilon}:

Let ε(0,1){\varepsilon}_{*}\in(0,1) be the smallest positive real number such that for every ε>ε{\varepsilon}>{\varepsilon}_{*} the following is true. For every λ>0\lambda>0, the equation below has only one solution in [0,)[0,\infty):

Here X0Ber(ε)X_{0}\sim{\sf Ber}({\varepsilon}).

With a slight abuse of notation, we denote by M-mmse(λ){\sf M\text{-}mmse}(\lambda) the quantity limnM-mmse(λ,n)\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n), assuming it exists. Also we define the squared error of AMP at iteration tt as:

Notice that MSEAMP(λ,t){\sf MSE_{AMP}}(\lambda,t) is a random variable that depends on the realization YλY_{\lambda}. Our first main result strengthens Theorem 1 for the spiked Wigner case:

Under Model (1) we have M-mmse(λ)=limnM-mmse(λ,n){\sf M\text{-}mmse}(\lambda)=\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n) exists for every λ0\lambda\geq 0. This limit satisfies, when ε>ε{\varepsilon}>{\varepsilon}_{*}:

where y(λ)y_{*}(\lambda) is the unique solution to Eq. (10) above. Further, the symmetric Bayes-optimal AMP algorithm 1 satisfies the following almost surely:

Although this result is asymptotic in nature, simulations show that the predictions are accurate on problems of dimension a few thousands (see Figure 1).

II-D The spiked Wishart model

The following is the analogue of Definition II.1 for the asymmetric Wishart model:

Let ε~\widetilde{{\varepsilon}}_{*} be the smallest positive real number such that for every ε>ε~{\varepsilon}>\widetilde{{\varepsilon}}_{*} the following is true. For every λ>0\lambda>0 equation below has only one solution in [0,)[0,\infty):

Our second result is for the spiked Wishart model (2):

The limit M-mmse(λ)limnM-mmse(λ,n){\sf M\text{-}mmse}(\lambda)\equiv\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n) exists for every λ0\lambda\geq 0 and, for ε>ε~{\varepsilon}>\widetilde{{\varepsilon}}_{*}, is given by:

where y(λ)y_{*}(\lambda) is the unique solution to Eq. (13). Further, asymmetric Bayes-optimal AMP satisfies the following limits almost surely:

Numerically verifying the fixed point condition Eq. (13) is somewhat more involved than Eq. (10). However it is relatively straightforward to establish that ε~inf{ε:S-mmse(V,z) is convex in z}0.05\widetilde{{\varepsilon}}_{*}\leq\inf\{{\varepsilon}:{\sf S\text{-}mmse}(V,z)\text{ is convex in }z\}\approx 0.05.

III Proof of Theorem 2

Owing to space constraints, we restrict ourselves to proving Theorem 2 in this paper. The proof of Theorem 3 follows similar ideas and will be provided in the full version of the present paper. Theorem 2 follows almost immediately from the following two propositions.

Consider the model Eq. (1) with ε>ε{\varepsilon}>{\varepsilon}_{*}, and the approximate message passing orbit obtained by using the recursively defined scalar functions for t0t\geq 0:

Here X0Ber(ε)X_{0}\sim{\sf Ber}({\varepsilon}) and ZN(0,1)Z\sim{\sf N}(0,1) are independent. Further μ0=τ0=0\mu_{0}=\tau_{0}=0 and (μt,τt)t1(\mu_{t},\tau_{t})_{t\geq 1} are defined using the state evolution recursions (8), (9). Then defining mseAMP(λ)ε2y(λ)2/λ2{\sf mse_{AMP}}(\lambda)\equiv{\varepsilon}^{2}-y_{*}(\lambda)^{2}/\lambda^{2}, the RHS of Eq. (11), the following is true:

The first limit holds almost surely and in L1{\cal L}_{1} and h(ε)h({\varepsilon}) is the binary entropy function h(ε)=εlogε(1ε)log(1ε)h({\varepsilon})=-{\varepsilon}\log{\varepsilon}-(1-{\varepsilon})\log(1-{\varepsilon}).

For every λ0\lambda\geq 0, M-mmse(λ)=limnM-mmse(λ,n){\sf M\text{-}mmse}(\lambda)=\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n) exists. Further:

where h(ε)h({\varepsilon}) is defined in Proposition III.1.

where in the first inequality and the last equality we use Propositions III.1, III.2. This implies that mseAMP(λ)=M-mmse(λ){\sf mse_{AMP}}(\lambda)={\sf M\text{-}mmse}(\lambda) for Lebesgue-a.e. λ\lambda. Further, as M-mmse(λ){\sf M\text{-}mmse}(\lambda) is the pointwise limit of monotone non-increasing (in λ\lambda) functions M-mmse(λ,n){\sf M\text{-}mmse}(\lambda,n), it is monotone non-increasing, which yields the claim for all λ[0,)\lambda\in[0,\infty). ∎

By the strong law of large numbers, x4/n2ε2\lVert{x}\rVert^{4}/n^{2}\to{\varepsilon}^{2} almost surely, and in L1{\cal L}_{1}. It is not hard to prove that the functions ft(y)f_{t}(y) are λ\sqrt{\lambda}-Lipschitz continuous. Hence, it is a direct consequence of Theorem 1 of that the following limits hold almost surely and in L1{\cal L}_{1}:

It follows that limtlimnMSEAMP(λ,t)=ε2τ2\lim_{t\to\infty}\lim_{n\to\infty}{\sf MSE_{AMP}}(\lambda,t)={\varepsilon}^{2}-\tau_{*}^{2} almost surely and in L1{\cal L}_{1} where τ=τ(λ)\tau_{*}=\tau_{*}(\lambda) denotes the smallest non-negative fixed point of the equation:

Since the right hand side equals ε(1ε){\varepsilon}(1-{\varepsilon}) at τ=0\tau=0 and ε{\varepsilon} at τ=\tau=\infty, at least one fixed point must exist. Hence τ(λ)\tau_{*}(\lambda) is well defined. Now, note that

Thus τ\tau_{*} is a fixed point of Eq. (19) iff λτ=y\lambda\tau_{*}=y_{*} is a fixed point of Eq. (10). It follows from our definition of ε{\varepsilon}_{*} that when ε>ε{\varepsilon}>{\varepsilon}_{*}, τ(λ)\tau_{*}(\lambda) is the unique non-negative fixed point of Eq. (19) and Claim (16) follows. To complete the proof of the proposition, it only remains to show claim (17), for which we have the following

Let τ(λ)\tau_{*}(\lambda) denote the unique non-negative fixed point of Eq. (19). Then

where W(m,λ,x,z)=mλxmλ/2+λ1/4m1/2zW(m,\lambda,x,z)=m\sqrt{\lambda}x-m\sqrt{\lambda}/2+\lambda^{1/4}m^{1/2}z. Here X0Ber(ε)X_{0}\sim{\sf Ber}({\varepsilon}), ZN(0,1)Z\sim{\sf N}(0,1) and are independent. Letting m=τλm_{*}=\tau_{*}\sqrt{\lambda}, it is not hard to show that:

It follows from the fundamental theorem of calculus that

It is easy to see that ϕ(0,m(0))=0\phi(0,m_{*}(0))=0. Further, using the fact that τ(λ)1\tau_{*}(\lambda)\leq 1 (as the right hand side of Eq. (19) is bounded by 1), we have that m(λ)=O(λ)m_{*}(\lambda)=O(\sqrt{\lambda}). Using this, it is not hard to check that ϕ(λ,m(λ))h(ε)\phi(\lambda,m_{*}(\lambda))\to h({\varepsilon}) as λ\lambda\to\infty. This concludes the proof of the lemma. ∎

III-B Proof of Proposition III.2

We first prove that limnM-mmse(λ,n)\lim_{n\to\infty}{\sf M\text{-}mmse}(\lambda,n) exists for every λ0\lambda\geq 0. Define for i,j[n]i,j\in[n]

By the fact that the distribution of (X,Yλ)(X,Y_{\lambda}) is invariant under (identical) row and column permutations, mij(λ,n)=m12(λ,n)m_{ij}(\lambda,n)=m_{12}(\lambda,n) for every i,ji,j distinct. Consequently:

Since Var(X11)=ε(1ε)<{\rm Var}(X_{11})={\varepsilon}(1-{\varepsilon})<\infty it suffices to prove that limnm12(λ,n)\lim_{n\to\infty}m_{12}(\lambda,n) exists for every λ0\lambda\geq 0. To this end, let Yλn1Y_{\lambda}^{n-1} denote the first principal (n1)×(n1)(n-1)\times(n-1) submatrix of YλY_{\lambda}. Clearly:

where the equality follows from the model Eq. (1) and the second inequality from monotonicity of the minimum mean square error in λ\lambda . Consequently, for every λ0\lambda\geq 0, m12(λ,n)m_{12}(\lambda,n) is a monotone, bounded sequence and has a limit.

In order to prove the claim 18, we first note that for any finite nn the following holds applying the I-MMSE identity of to the upper triangular portion of XX:

For Λ=\Lambda=\infty we have that I(X;Y)=H(X)H(XY)=H(X)I(X;Y_{\infty})=H(X)-H(X|Y_{\infty})=H(X) and H(X)=H(x)=nh(ε)H(X)=H(x)=nh({\varepsilon}). Dividing by nn on either side:

An application of Fatou’s lemma then yields the result. ∎

References