MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel

Thibault Lesieur, Florent Krzakala, Lenka Zdeborová

I Introduction

Estimation of low-rank matrices from their noisy or incomplete measurements is a problem that has a wide range of applications of practical interest . As for every broadly relevant data processing problem it is of interest to study statistical and computational limits of such an estimation on meaningful model settings. In this paper we evaluate the Bayes optimal and computationally achievable mean-squared error of estimation for the following two models:

In the first model the matrix to be estimated is created as

where XX is a n×rn\times r matrix whose rows xix_{i} were chosen independently at random from some distribution Pprior(xi)P_{\rm prior}(x_{i}), and KK is a r×rr\times r symmetric matrix. The matrix WW is then observed element-wise trough a noisy non-linear output channel Pout(yijwij)P_{\rm out}(y_{ij}|w_{ij}), with i,j=1,,ni,j=1,\dots,n. The goal is to estimate the unknown matrix XX from measured YY.

We consider the problem in the limit of very large systems nn\to\infty, small rank r=Ω(1)r=\Omega(1), and Pout/w=Ω(1)\partial P_{\rm out}/\partial w=\Omega(1). The purpose of the scaling factor 1/n1/\sqrt{n} in (1) is that the inference problem is neither trivially easy nor clearly impossible in this limit. The same model and scaling was considered e.g. in , where the matrix KK was an identity. The main algorithmic difficulty in such a setting comes from the high dimension nn. In this work we study the idealized setting in which the hyper-parameters r,K,Pprior,Poutr,K,P_{\rm prior},P_{\rm out} are independent of the dimension nn and known. Note, however, that extending our approach via expectation maximization seems like a natural way to learn these hyper-parameters.

In the second model the matrix to be estimated is created as

where UU is a n×rn\times r matrix and VV is a m×rm\times r matrix, with rows uiu_{i} (resp. viv_{i}) chosen from some distribution Pprioru(ui)P^{u}_{\rm prior}(u_{i}) (resp. Ppriorv(vi)P^{v}_{\rm prior}(v_{i})). The matrix WW is then observed trough the same output channel Pout(yijwij)P_{\rm out}(y_{ij}|w_{ij}) as above, and the problem is set analogously, adding that m/n=α=Ω(1)m/n=\alpha=\Omega(1). Most of the discussion in this paper uses notation of the first model, but all our results are relevant for the second one as well.

An algorithmic tool that first comes to our mind when seeing the setting above is singular (or eigenvalue) value decomposition (SVD) keeping the leading rr components. SVD is the optimal estimation algorithm if the goal is to minimize the mean squared distance between the observed matrix YY and the estimator irrespectively of the properties of the factors (such as the prior PpriorP_{\rm prior}). Requirements on the factors, such as sparsity or other structure are not incorporated in the basic SVD. Moreover, for a general non-linear output channel PoutP_{\rm out} the sum of squares between YY and its estimator is not the most relevant quantity to minimize.

Setting the problem in a fully probabilistic way is another common approach , that is in principle much more flexible, but algorithmically more challenging in general. To obtain a Bayes-optimal estimator of the factor xix_{i} (rr-dimensional column vector) we need to compute the marginals of the posterior probability distribution

where wij=xiTKxj/nw_{ij}=x^{T}_{i}Kx_{j}/\sqrt{n}. In this paper we will leverage this algorithmic difficulty by realizing that techniques based on approximate message passing and related state evolution (SE) are asymptotically optimal for the above setting. For compressed sensing these techniques are rigorous thanks to series of works . The AMP algorithm and its state evolution for the above setting is different from the one of compressed sensing, and these proofs do not apply. Arguments for the optimality in the present setting come from non-rigorous methods of statistical physics. From a mathematical point of view the present paper provides a set of accurate conjectures. Fully rigorous proof of these conjectures is a natural direction for future work.

The above model includes a number of examples that are commonly considered in the literature. Without trying to be exhaustive, we list a few interesting ones.

Sparse principle component analysis as considered in : The prior distribution PpriorP_{\rm prior} has a weight 1ρ1-\rho on a vector of zeros. The output channel was considered as additive Gaussian noise.

Robust PCA in which the measured matrix is a low-rank matrix plus a sparse large noise: The output channel adds small noise with some probability pp and large noise with 1p1-p.

Submatrix localization : The matrix WW includes rr submatrices (overlapping or not) that have larger mean than the overall mean of the matrix WW. The prior xix_{i} then encodes in a binary manner to which of the rr submatrices does a given variable ii belong. The output is usually considered as Gaussian additive noise.

Detection of communities hidden in dense networks: Stochastic block model is popular for theoretical studies of clustering. Nodes belong to different clusters/communities, the prior PpriorP_{\rm prior} then allows only vectors having one component 11, and elsewhere. The observed matrix YY is binary and the probability to observe yij=1y_{ij}=1 is given by KabK_{ab} if xi(a)=1x_{i}(a)=1 and xj(b)=1x_{j}(b)=1 (i,j=1,,ni,j=1,\dots,n, and a,b=1,,ra,b=1,\dots,r).

Biclustering is a simultaneous clustering of rows and columns, it finds a number of applications e.g. for analysis of microarray data in genomics . Again the prior encodes affinity to a cluster and the output function includes various models of noise.

Poisson noise in the matrix factorization was considered e.g. in .

The labeled stochastic block model is another case that can be reformulated in the present setting .

I-B Contribution and closely related work

As far as we know the only tools that provides asymptotically exact analysis of the minimal mean squared error of models (1-2) is approximate message passing and state evolution as deployed in the present paper. For the output channel being additive Gaussian noise this was done previously in for rank r=1r=1 with part of the results being fully rigorous, in for generic rank and without the state evolution, and in for general rank with the state evolution, but non-rigorously. The main contribution of this paper is the treatment of the case of general non-linear output channel PoutP_{\rm out}.

Approximate message passing and state evolution for a generic output channel PoutP_{\rm out} was derived previously in the context of linear estimation , and later in matrix factorization with r=Ω(n)r=\Omega(n) . In both these cases the resulting equations (both the AMP and the state evolution) are considerably more involved than those for additive Gaussian noise. In the setting of low-rank models (1-2) above the situation is remarkably simpler as the AMP algorithm stays the same up to a change of the matrix YY for the so-called Fisher score matrix SS that depends on the output channel and on YY element-wise, and the inverse of the Fisher information of the channel that we denote Δ\Delta and plays a role of an effective noise variance. In the state evolution the situation is even simpler in the sense that only the effective value Δ\Delta of the noise appears.

The space of all the possible element-wise output channels hence reduces to one dimensional space, parametrized by their inverse Fisher information Δ\Delta. The resulting asymptotic MMSE depends only on Δ\Delta and not on other details of the channel. Consequently, classes of rather differently looking matrix estimation problems all share the same single-letter characterization. The contribution of the present paper is to unveil and quantify this property, and illustrate it on the example of detection of communities hidden in dense networks that is via this one parameter mapping related to localization of submatrices having different mean from the background matrix.

Analogous universality with respect to the output channel was observed in (see e.g. their remark 2.5) in the study of detection of a small hidden clique with approximate message passing.

Our results for the Bayes-optimal estimation error of community detection in dense stochastic block model are of independent interest. Analogous results were derived for the sparse case in . In the dense case only MSE-suboptimal spectral methods were evaluated . We also unveil a hard phase existing in this problem for rank r>4r>4, and becoming very wide for rr\to\infty.

We also recently learned about independently ongoing work of who consider rank r=1r=1 with Radamacher prior, and establish rigorously the relation between the stochastic block model with two groups and low-rank estimation with Gaussian channel.

II AMP and state evolution

In this section we derive the AMP algorithm to compute marginals of the posterior probability distribution (3). We present the derivation for the XKXXKX^{\top} case, for the UVUV^{\top} everything works analogously and we only state the results. For convenience we introduce a function g(y,w)g(y,w) as

We require g(y,w)g(y,w) to be differentiable in ww. For the previously considered Gaussian additive noise we have gGAN(y,w)=(yw)2/(2Δ)g_{\rm GAN}(y,w)=-(y-w)^{2}/(2\Delta).

In the first step, we write the belief propagation equations for the probability distribution (3). For this we introduce messages mijit(xi){m^{t}_{ij\rightarrow i}}(x_{i}) and niijt(xi)n_{i\rightarrow ij}^{t}(x_{i}) between variables xix_{i} and the factors associated to yijy_{ij}. The BP equations read

where ZilitZ^{t}_{il\to i} and ZiijtZ^{t}_{i\to ij} are normalizations. The main assumption behind these BP equations is that the messages niil(xi)n_{i\rightarrow il}(x_{i}) and niij(xi)n_{i\rightarrow ij}(x_{i}) can be interpreted as probabilities that are statistically independent conditioned to the values of the variable xix_{i} in the large nn limit. This is also the main assumption of this paper. Arguments of theoretical statistical physics provide heuristic justification for this assumption. In the above form the BP equations are not useful for implementation. We, however, realize that we can rewrite them into a much simpler form that provides the same marginals in the limit of large nn.

In the large nn limit the function g(yil,xlKxi/n)g\left(y_{il},x_{l}^{\top}Kx_{i}/\sqrt{n}\right) depends only weakly on xix_{i} and xlx_{l}. Therefore we expand this function around w=0w=0 and (5) then becomes

Here we introduced quantities alilta^{t}_{l\rightarrow il} and vliltv^{t}_{l\rightarrow il} as the mean and covariance of nliltn^{t}_{l\rightarrow il} We then take the logarithm of (6) and by using (5) we find

To close these equations we finally compute the new mean and variance of the messages niijt+1n_{i\rightarrow ij}^{t+1} as

where f(A,B)f(A,B) is the mean of the normalized probability distribution

and f/B\partial f/\partial B is its covariance matrix. Above we closed the intractable distributional belief propagation equations on the means and variances of the messages.

As we can already anticipate it will be instrumental to introduce the so-called Fisher score matrix (evaluated at w=0w=0) as

and the Fisher information (evaluated at w=0w=0) of the output channel as

We denote the inverse Fisher information by Δ\Delta because that would be the noise variance for the Gaussian additive channel (in that case S=Y/ΔS=Y/\Delta).

Using dyPout(yw)=1\int{\rm d}yP_{\rm out}(y|w)=1, w\forall w and (4), it follows that

With the above, eqs. (10) can be rewritten as

Here we can recognize AMP equations derived and studied for the low-rank matrix estimation problem with additive Gaussian noise in .

Remarkably the non-linear output channel enters only trough the value of the effective noise (15) and an effective form of the observed matrix (14). This is much simpler than what happens for the linear estimation model for which the generalized output AMP was derived in , or the matrix factorization .

Finally we provide a simplification that is standard to AMP-like algorithms and that could be called TAPyfication . We notice that variables aiijta_{i\rightarrow ij}^{t} and viijtv_{i\rightarrow ij}^{t} depend only weakly on the index jj, and this allow us to reduce further the number of variables to end up with

Where the second term in the expression for BitB_{i}^{t} is the so-called Onsager reaction term, with its time index one iteration earlier, as is usual in AMP-type algorithms. Here we have reduced the number of messages to iterate from O(n2)O(n^{2}) to O(n)O(n).

The procedure carried out above to derive the AMP algorithms can also be used to obtain a corresponding expression for the log-likelihood ϕ=log(Z(Y))/n\phi=\log(Z(Y))/n where Z(Y)Z(Y) is the normalization in (3). This is called the Bethe free energy. Given a fixed point the AMP equations the Bethe free energy reads (for details see Appendix B)

where Z(A,Bi){\cal Z}(A,B_{i}) is the normalisation from (13) and

II-B Summary of the algorithm

For a given output channel we first need to evaluate the effective noise parameter Δ\Delta (15) and the score matrix SS (14). Then at each iteration of the AMP algorithm we store the following variables.

aita_{i}^{t} and ait1a_{i}^{t-1} are the estimators of the mean of the variables xix_{i} at time tt and t1t-1. Every aia_{i} is a vector of size r×1r\times 1.

vitv_{i}^{t} is the estimator of the covariance of the variables xix_{i} at time tt. The viv_{i} are matrices of size r×rr\times r.

We then compute the matrix AtA^{t} and the vectors BitB_{i}^{t} with (20) and (21). Every BitB_{i}^{t} is a vector of size r×1r\times 1 while AtA^{t} is a r×rr\times r matrix. Finally we compute the new estimate of the mean and covariance of variables xix_{i} from (22) and (23). Where the function ff is defined as mean and covariance of the distribution in eq. (13). We initialize the ait=0a_{i}^{t=0} very close to 0 in order to avoid convergence problems. To help convergence we can also damp the iterations with some parameter γ\gamma. More sophisticated damping should be implemented if convergence problems persist on non-synthetic data . This finally gives us the following algorithm.

II-C Remark about spectral algorithms

An interesting remark can be done when noticing that there is an intimate relation between AMP and SVD. First notice that for some prior distributions AMP has a so-called uniform (or factorized) fixed point where the values of aia_{i} do not depend on the index ii (e.g. for zero mean priors). In such cases it is meaningful to linearize AMP around this uniform fixed point. This linearization can then be interpreted as a spectral algorithm. For the most common additive Gaussian output channel this gives the eigen-decomposition of YY (or SVD of YY for the uvuv^{\top} case). Of course this spectral algorithm comes with its advantages (non-parameteric) and disadvantages (hard to ensure constraints on the factors XX).

For a generic output channel the object that comes out from the AMP linearization is the Fisher score matrix SS defined by (14). Following the analogy, the SVD decomposition (or eigen-decomposition) should hence be done on SS rather than on YY. Let us give an example in which the difference between doing SVD on YY or SS indeed improves performance of the spectral method. We consider rank r=1r=1 and Pprior(x)=ex2/2/2πP_{\rm prior}(x)=e^{-x^{2}/2}/\sqrt{2\pi}. The output channel is an additive exponential noise of parameter λ\lambda, i.e. Pout(yw)=exp(yw/λ)/(2λ)P_{\rm out}(y|w)=\exp\left(-|y-w|/\lambda\right)/(2\lambda). The score matrix SS is in this case proportional to Ssign(yij)S\approx{\rm sign}(y_{ij}). And indeed for 1/2<λ<11/\sqrt{2}<\lambda<1 an informative eigenvector gets out of the bulk of the spectrum of SS but not out of YY’s spectrum. Taking the score matrix allowed us to extract meaningful eigenvectors.

II-D State evolution

Remarkably, the AMP algorithm is amenable to asymptotic analysis. In statistical physics this would be called the cavity method , in linear estimation the commonly used term is state evolution. The SE was derived for low-rank matrix estimation for the Gaussian additive channel in and the present case is a straightforward generalization. To write the SE for the present case we introduce two order parameters of dimension r×r{r\times r}.

where XX is the vector we are trying to infer and that is unknown to us. State evolution (or single letter characterization) relies on the computation of the distribution (with respect to the realizations of the matrix YY) of BitB_{i}^{t} given XX (AtA^{t} is a deterministic variable of QtQ^{t}).

Using (10) we notice that BitB_{i}^{t} is a sum of many terms that by the assumptions of the belief propagation are independent. Using the central limit theorem we can characterize BitB_{i}^{t} by its mean and variance as done in Appendix A. As a result one finds that

where ξi\xi_{i} is a Gaussian random variable of mean 0 and of covariance KQtK/ΔKQ^{t}K/\Delta. This allows us to see that the order parameters QtQ^{t} and MtM^{t} will evolve as

where we used abbreviations for Px=Pprior(x)P_{x}=P_{\rm prior}(x) and Pξ=N(0,KQtK/Δ)P_{\xi}={\cal N}(0,KQ^{t}K/\Delta). The remarkable part of these equations is that all the details of the output channel have disappeared and the only thing that remains is the inverse of the Fisher information Δ\Delta. Larger Δ\Delta means larger effective noise, smaller Fisher information and hence harder inference. The estimation error in the large nn limit depends on the channel only trough its Fisher information.

Since in this paper we assume the knowledge of the generative model and its parameters the state evolution simplifies further as we observe that Qt=MtQ^{t}=M^{t}. This is called the Nishimori condition in statistical physics and was derived e.g. in or .

State evolution also allows us to derive the Bethe free energy in the large nn limit as

This formula is useful when there are multiple stable fixed points of equations (28) and (29). In such a situation the one with the greatest log-likelihood ϕ\phi is the Bayes optimal one.

In this section we complete the presentation of the AMP algorithm and the state evolution by stating it for the UVUV^{\top} model (2). We remind that UU and VV are of size n×rn\times r and m×rm\times r respectively and that we consider the following limit. nn\rightarrow\infty, m=αnm=\alpha n, α=Ω(1)\alpha=\Omega(1), r=ω(1)r=\omega(1). Each row of UU and VV has been sampled from a probability distribution Puprior(ui)P^{\rm prior}_{u}(u_{i}) and Pvprior(vi)P^{\rm prior}_{v}(v_{i}) (for simplicity of notation we will omit the upper index ”prior” in this section). Distributions PuP_{u} and PvP_{v} have their corresponding input functions fu(A,B)f_{u}(A,B), fv(A,B)f_{v}(A,B) and their normalizations Zu(A,B){\cal Z}_{u}(A,B), Zv(A,B){\cal Z}_{v}(A,B), defined according to (13). The matrix WW is observed through an output channel defined by some probability Pout(yijwij)P_{\rm out}(y_{ij}|w_{ij}).

The AMP algorithm for estimating UU and VV from YY is as follows

The score matrix SS and the effective noise Δ\Delta were again computed from the output channel following (14) and (15).

The Bethe expression for the log-likelihood is obtained from the fixed point as

where Z(A,Bi){\cal Z}(A,B_{i}) is again the normalisation from (13) and

To write the state evolution equations we introduce the order parameters

where ξu\xi_{u} and ξv\xi_{v} are Gaussian random variables of mean and covariance Qut/ΔQ^{t}_{u}/\Delta and Qvt/ΔQ^{t}_{v}/\Delta. At the fixed point we compute the Bethe free energy as

III Phase diagrams and examples for clustering

In this section we illustrate our findings on the example of clustering into rr equally sized groups. In this case the rr-dimensional rows of the matrix XX take the form of (0,,1,,0)\left(0,\ldots,1,\ldots,0\right) with a 11 located at only one of the rr coordinates. The matrix KK is an identity. The elements of the matrix WW are then wij=1/nw_{ij}=1/\sqrt{n} if xi=xjx_{i}=x_{j} and wij=0w_{ij}=0 otherwise. Correspondingly the prior Pprior(x)P_{\rm prior}(x) is the probability distribution that picks with probability 1/r1/r one of the above rr vectors. The function f(A,B)f(A,B) defined via (13) is in this case

We will consider two different output channels

Gaussian additive noise Pout(yijwij)=exp[(yijwij)2/(2Δ)]/2πΔP_{\rm out}(y_{ij}|w_{ij})=\exp[-(y_{ij}-w_{ij})^{2}/(2\Delta)]/\sqrt{2\pi\Delta}. In this case the problem is interpreted as localization of submatrices having mean 1/n1/\sqrt{n} in an otherwise zero-mean matrix.

Stochastic block model channel where YY is its (dense) adjacency matrix created as Pout(yij=1wij)=pout+μwijP_{\rm out}(y_{ij}=1|w_{ij})=p_{\rm out}+\mu w_{ij} and Pout(yij=0wij)=1poutμwijP_{\rm out}(y_{ij}=0|w_{ij})=1-p_{\rm out}-\mu w_{ij}. That is, nodes from different groups are connected with probability poutp_{\rm out}, and nodes from the same group with probability pin=pout+μ/np_{\rm in}=p_{\rm out}+\mu/\sqrt{n}. Both pout=Ω(1)p_{\rm out}=\Omega(1), and μ=Ω(1)\mu=\Omega(1). The Fisher score matrix has elements Sij=μ/poutS_{ij}=\mu/p_{\rm out} where Yij=1Y_{ij}=1, and Sij=μ/(1pout)S_{ij}=-\mu/(1-p_{\rm out}) where Yij=0Y_{ij}=0. The inverse Fisher information of the stochastic block model channel is given as

In the large nn limit the resulting graph will have average degree npoutnp_{\rm out}.

Bayes-optimal inference in the stochastic block model was studied in the case of sparse networks (bounded average degree) . For the case of dense networks (linear average degree) and μ=Ω(1)\mu=\Omega(1) the Bayes-optimal MMSE and corresponding phase transitions are an original contribution of the present paper.

The state evolution from section II-D simplifies for the case of symmetric clustering in the setting above. One notices that if QtQ^{t} and MtM^{t} are of the symmetric form

then Qt+1Q^{t+1} and Mt+1M^{t+1} will be of the same form. Here JrJ_{r} is a r×rr\times r matrix filled with ones, and IrI_{r} is a r×rr\times r identity matrix. Moreover since the sum of elements of the vectors xx is always 11 we have the additional property that

Therefore in the end we only have one parameter bb in the state evolution, bt=1b^{t}=1 means that we have perfectly reconstructed the communities, bt=0b^{t}=0 is when there is no information and the estimator for every variables is (1r,,1r)\left(\frac{1}{r},\cdots,\frac{1}{r}\right).

The state evolution equation (28) can then be used to derive the evolution of the parameter bb under iterations. It follows that

We now observe that the above state evolution equation has always the uniform or uninformative fixed point bu=0b_{u}=0. It is crucial to study the stability of this fixed point under iterations of (52). Depending on the number of groups rr and the noise parameter Δ\Delta there may be another stable fixed point of (52). When it exists we will call it bfarb_{\rm far} (it obviously depends on Δ\Delta and rr). We expand (53) around bub_{u} to get

the uniform fixed point will be unstable and the iteration will converge away from it.

If we translate condition (56) back to the parameters of the stochastic block model we obtain that inference with AMP is possible for

This is the same condition as known from the sparse SBM and from standard spectral methods . Our contribution to these results is the Bayes-optimal value of the detection error, which is considerably better than the error obtained from the spectral algorithms evaluated in . Note e.g. that the error in the spectral algorithm is always continuous at the transition, whereas the Bayes-optimal error presents a discontinuity at the transitions for r>4r>4 (see below).

By looking at the second order term in (55) we can get some information about where the iterations will converge.

If r<4r<4, r42Δ2r4bt2\frac{r-4}{2\Delta^{2}r^{4}}{b^{t}}^{2} is negative and close to Δc\Delta_{c} we will converge to bfar(Δ)=2(ΔcΔ)/(4r)+O[(ΔcΔ)2]b_{\rm far}(\Delta)=2(\Delta_{c}-\Delta)/(4-r)+O[(\Delta_{c}-\Delta)^{2}]. In this case the fixed point is a continuous function of the noise.

If r>4r>4, things are different. Because the second derivative of (55) is positive then it is impossible for bfarb_{\rm far} to go to zero continuously as Δ\Delta goes to Δc\Delta_{c} (a simple plot of (55) should convince the reader of this fact). There is therefore a jump in the value of the fixed point as Δ\Delta crosses Δc\Delta_{c}. This is the signature of a first order phase transition.

In the case of first order phase transition there are always three transitions to discuss

The different phases have the following properties:

Undetectable phase without clusters, Δspinodal<Δ\Delta_{\rm spinodal}<\Delta. Eq. (52) has only the uniform fixed point. Bayes optimal inference does not provide any information about the labeling of the nodes.

Undetectable phase with clusters, Δstatic<Δ<Δspinodal\Delta_{\rm static}<\Delta<\Delta_{\rm spinodal}. Eq. (52) has two stable fixed point bub_{\rm u} and bfarb_{\rm far}. But bfarb_{\rm far} has a lower log-likelihood than bub_{\rm u} therefore the Bayes-optimal estimator is still not correlated with the planted solution.

Detectable but hard, Δc<Δ<Δstatic\Delta_{c}<\Delta<\Delta_{\rm static}. The fixed point bfarb_{\rm far} now has a larger log-likelihood than bub_{\rm u}. Bayes optimal estimator will find a configuration well correlated with the planted one, but the AMP algorithms will not. Analogous phase appears in many other inference problems and we conjecture that all known polynomial algorithms will fail in this phase.

Easy phase, Δ<Δc\Delta<\Delta_{c}. The uniform fixed point bub_{\rm u} is unstable and the AMP algorithm will converge to bfarb_{\rm far}. The problem is said easy since we have a polynomial algorithm that (at least is conjectured to) asymptotically reaches optimal reconstruction.

A theoretical analysis of Δspinodal(r)\Delta_{\rm spinodal}(r) and Δstatic(r)\Delta_{\rm static}(r) is also possible as rr\to\infty. It relies on analysis of which exponential dominates the others in (53), for details see appendix D. We find

The most notable remark about these results is that in the limit of large rank the gap between what is statistically possible Δstatic\Delta_{\rm static} and what is algorithmically tractable Δc=1/r2\Delta_{c}=1/r^{2} is different even in its order. Note also that the present phase transitions are equal to critical temperatures in the Potts glass model as computed in , this is because of the intimate relation between random and planted models . Note that these works also derived rank r=4r=4 as the point where the second order phase transitions changes into a first order one.

III-B Phase diagrams

In this section we illustrate the behavior of the fixed points of the AMP algorithm and of the state evolution. In all the presented experiments we iterate the corresponding equations till convergence. To investigate the MMSE and the phase transitions we can initialize the AMP algorithm and the SE equations in two different ways:

Uniformative initialization: for the SE equations this means bt=0=δb^{t=0}=\delta, where δ\delta is very small. For the AMP algorithm this means ai=(1r,,1r)+δia_{i}=(\frac{1}{r},\ldots,\frac{1}{r})^{\top}+\delta_{i} and vit=Ir/rv_{i}^{t}=I_{r}/r. This is the initialization with which we would work if we were working with real data.

Informative initialization: for the SE equations this means bt=0=1b^{t=0}=1. For the AMP algorithm this means ai=xia_{i}=x_{i} and vit=0v_{i}^{t}=0, i.e. initiate the estimators to be equal to the planted solution that we are trying to recover.

In community detection we usually evaluate an estimator that maximizes the number of correctly assigned nodes. For this we need to take the index of the maximum component of aia_{i}. For numerical comparisons between the state evolution and the AMP algorithm we will rather evaluate the mean-squared error MSE=1Q{\rm MSE}=1-Q, where QQ is the order parameter (26).

In figure 1 we illustrate the fact that in the large nn limit different output channels having the same inverse Fisher information Δ\Delta have equivalent performance. We plot the MSE obtained from the state evolution, compare to the one obtained from the AMP algorithm on a single instance of size n=20000n=20000 with either a Gaussian additive noise channel or the stochastic block model channel. Indeed the three cases agree.

In figure 2 we depict the first order phase transition that occurs for r>4r>4. We present the MSE obtained from density evolution as function of r2Δr^{2}\Delta for various ranks r=10,15,20r=10,15,20. From the uninformative initialization the MSE jumps to 11/r1-1/r at Δc=1/r2\Delta_{c}=1/r^{2}. From the informative initialization the jump occurs at Δspinodal>Δc\Delta_{\rm spinodal}>\Delta_{c} and by comparing the Bethe free energy of the two fixed points we evaluate the Δstatic\Delta_{\rm static}. We observe that already for these values of the rank the gap between the information theoretical and computationally tractable performance, i.e. between Δc\Delta_{c} and Δstatic\Delta_{\rm static} is very large. Another interesting observation we made concerns the value of the MSE at the Δc=1/r2\Delta_{c}=1/r^{2} transition. The reconstruction is very close to exact even at the phase transition itself. More precisely it decreases roughly exponentially with the rank rr, being close to 101010^{-10} for rank r=100r=100.

In figure 3 we plot the values of Δspinodal\Delta_{\rm spinodal} and Δstatic\Delta_{\rm static} rescaled by rlogrr\log{r} as a function of the rank rr. As the rank grows rr\to\infty the static line will converge to 1/41/4 and the spinodal one to 1/21/2 (59-60). Whereas the static transition converges nicely to its asymptotic value, for the spinodal transition we are still quite far from the asymptotic regime.

IV Conclusions and perspectives

We considered here the problem of estimation of a low rank matrix that was observed element-wise trough an arbitrary noisy channel. Using approximate message passing and its state evolution we compute the Bayes-optimal MMSE in this problem. The most interesting and, comparing to previously studied cases, surprising conclusion is that the MMSE depends on the output channel only trough its Fisher information.

As an example of the considered setting we study the problem of clustering in dense stochastic block model with the difference between probabilities of connection within and between communities scaling as O(1/n)O(1/\sqrt{n}). We evaluate the phase transitions and MMSE in this regime, unveiling a wide region of computational hardness.

Both the AMP algorithms and the state evolution apply to a number of other setting considered in the literature, which is a natural direction of future work.

V Acknowledgements

We thank Madhu Advani for clarifying discussion about Fisher information. The research leading to these results has received funding from the European Research Council under the European Union’s 7th7^{th} Framework Programme (FP/2007-2013)/ERC Grant Agreement 307087-SPARCS.

References

Appendix A

To derive the state evolution let us define

We then expand the probability Pout(yw)P_{\rm out}(y|w) around 0

Using (63) in (62) and using (16) and (16) we get

where aia_{i} is the mean of aiila_{i\rightarrow il} with respect to yijy_{ij}. Using (63) and the fact that alila_{l\rightarrow il} and akika_{k\rightarrow ik} are independent random variables we can compute the second moments of the GijG_{ij} as

From the central limit theorem we deduce.

where WiW_{i} is a Gaussian random variable of mean 0 and of covariance KQtK/ΔKQ^{t}K/\Delta.

Appendix B

The Bethe free energy in the case where all the factors are between two variables reads as

For the second term we expand once again Pout(yijwij)P_{\rm out}\left(y_{ij}|w_{ij}\right) around 0 and then take the mean with respect to each message. We then take the log and find

The TAPification procedure that is not detailed here gives us

We also remove the term g(yij,0)g(y_{ij},0) since it is a constant that does not depend on the messages. We replace (72), (17) and (15) in (71) and we get

Since the sum is on all undirected links ijij we do the sum on all directed link and divide by 22

To get the Bethe-free-energy in the large nn limit we compute the mean of aiijKajijn\frac{a_{i\rightarrow ij}^{\top}Ka_{j\rightarrow ij}}{\sqrt{n}} with respect to the noise. The main idea remains the same we compute the mean with respect to a perturbation of PoutP_{\rm out}

To compute the mean of (70) we use the fact that we know the distribution of the BiB_{i} (66). We replace (76) in (71) and we get

Appendix C

To find the Δstatic\Delta_{\rm static} one can notice that by taking the derivative with respect to QQ and MM of (30) one finds a combination of (28) and (29). Therefore we have

We deduce a way to compute Δstatic\Delta_{\rm static} as

To compute Δstatic\Delta_{\rm static} we evaluate Mr{\cal M}_{r} on a whole interval then for each xx draw a line between point (0,0)(0,0) and (x,H(x))(x,H(x)). We then compute the area between Mr{\cal M}_{r} and this line. When this aera is zero then Mr(x)/x{\cal M}_{r}(x)/{x} gives us Δstatic\Delta_{\rm static}.

Appendix D

In this appendix we explain the derivation of (59), (60). To do this let us study the function Mr(x){\cal M}_{r}(x) where we take x=βrlog(r)x=\beta r\log(r), where β=Ω(1)\beta=\Omega(1). The important part of Mr{\cal M}_{r} is this integral

If F1F_{1} dominates F2F_{2} as r+r\rightarrow+\infty then Mr=1{\cal M}_{r}=1, otherwise if F2F_{2} dominates F1F_{1} then Mr=0{\cal M}_{r}=0.

To estimate F1F_{1}, F2F_{2} let us notice that with high probability the maximum value will be of order 2βlog(r)\sqrt{2\beta\log(r)}. This is a general property of Gaussian variables that the maximum of rr independent Gaussian variables of variance σ2\sigma^{2} is of order σ2log(r)\sigma\sqrt{2\log(r)}. We can therefore compute the mean of F2F_{2} all while conditioning on the fact that all of the uiu_{i} are smaller than 2βlog(r)\sqrt{2\beta\log(r)}. This allows us to compute the typical value of F1F_{1} as F1rβF_{1}\sim r^{\beta}. For F2F_{2} we obtain: when β<1/2\beta<1/2 then F2rβ+12F_{2}\sim r^{\beta+\frac{1}{2}}, and when if β>1/2\beta>1/2 then F2r2βF_{2}\sim r^{\sqrt{2\beta}}. We have to look at which of the F1F_{1} or F2F_{2} has a higher exponent. In these computation we can therefore just assume that

Now let us remind (78) while keeping x=βrlog(r)x=\beta r\log(r)

To get the static transition we use (80). Let us find the β\beta that satisfies equation (80) we get