MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel

Thibault Lesieur, Florent Krzakala, Lenka Zdeborová

I Introduction

Estimation of low-rank matrices from their noisy or incomplete measurements is a problem that has a wide range of applications of practical interest . As for every broadly relevant data processing problem it is of interest to study statistical and computational limits of such an estimation on meaningful model settings. In this paper we evaluate the Bayes optimal and computationally achievable mean-squared error of estimation for the following two models:

In the first model the matrix to be estimated is created as

where $X$ is a $n\times r$ matrix whose rows $x_{i}$ were chosen independently at random from some distribution $P_{\rm prior}(x_{i})$ , and $K$ is a $r\times r$ symmetric matrix. The matrix $W$ is then observed element-wise trough a noisy non-linear output channel $P_{\rm out}(y_{ij}|w_{ij})$ , with $i,j=1,\dots,n$ . The goal is to estimate the unknown matrix $X$ from measured $Y$ .

We consider the problem in the limit of very large systems $n\to\infty$ , small rank $r=\Omega(1)$ , and $\partial P_{\rm out}/\partial w=\Omega(1)$ . The purpose of the scaling factor $1/\sqrt{n}$ in (1) is that the inference problem is neither trivially easy nor clearly impossible in this limit. The same model and scaling was considered e.g. in , where the matrix $K$ was an identity. The main algorithmic difficulty in such a setting comes from the high dimension $n$ . In this work we study the idealized setting in which the hyper-parameters $r,K,P_{\rm prior},P_{\rm out}$ are independent of the dimension $n$ and known. Note, however, that extending our approach via expectation maximization seems like a natural way to learn these hyper-parameters.

In the second model the matrix to be estimated is created as

where $U$ is a $n\times r$ matrix and $V$ is a $m\times r$ matrix, with rows $u_{i}$ (resp. $v_{i}$ ) chosen from some distribution $P^{u}_{\rm prior}(u_{i})$ (resp. $P^{v}_{\rm prior}(v_{i})$ ). The matrix $W$ is then observed trough the same output channel $P_{\rm out}(y_{ij}|w_{ij})$ as above, and the problem is set analogously, adding that $m/n=\alpha=\Omega(1)$ . Most of the discussion in this paper uses notation of the first model, but all our results are relevant for the second one as well.

An algorithmic tool that first comes to our mind when seeing the setting above is singular (or eigenvalue) value decomposition (SVD) keeping the leading $r$ components. SVD is the optimal estimation algorithm if the goal is to minimize the mean squared distance between the observed matrix $Y$ and the estimator irrespectively of the properties of the factors (such as the prior $P_{\rm prior}$ ). Requirements on the factors, such as sparsity or other structure are not incorporated in the basic SVD. Moreover, for a general non-linear output channel $P_{\rm out}$ the sum of squares between $Y$ and its estimator is not the most relevant quantity to minimize.

Setting the problem in a fully probabilistic way is another common approach , that is in principle much more flexible, but algorithmically more challenging in general. To obtain a Bayes-optimal estimator of the factor $x_{i}$ ( $r$ -dimensional column vector) we need to compute the marginals of the posterior probability distribution

where $w_{ij}=x^{T}_{i}Kx_{j}/\sqrt{n}$ . In this paper we will leverage this algorithmic difficulty by realizing that techniques based on approximate message passing and related state evolution (SE) are asymptotically optimal for the above setting. For compressed sensing these techniques are rigorous thanks to series of works . The AMP algorithm and its state evolution for the above setting is different from the one of compressed sensing, and these proofs do not apply. Arguments for the optimality in the present setting come from non-rigorous methods of statistical physics. From a mathematical point of view the present paper provides a set of accurate conjectures. Fully rigorous proof of these conjectures is a natural direction for future work.

The above model includes a number of examples that are commonly considered in the literature. Without trying to be exhaustive, we list a few interesting ones.

Sparse principle component analysis as considered in : The prior distribution $P_{\rm prior}$ has a weight $1-\rho$ on a vector of zeros. The output channel was considered as additive Gaussian noise.

Robust PCA in which the measured matrix is a low-rank matrix plus a sparse large noise: The output channel adds small noise with some probability $p$ and large noise with $1-p$ .

Submatrix localization : The matrix $W$ includes $r$ submatrices (overlapping or not) that have larger mean than the overall mean of the matrix $W$ . The prior $x_{i}$ then encodes in a binary manner to which of the $r$ submatrices does a given variable $i$ belong. The output is usually considered as Gaussian additive noise.

Detection of communities hidden in dense networks: Stochastic block model is popular for theoretical studies of clustering. Nodes belong to different clusters/communities, the prior $P_{\rm prior}$ then allows only vectors having one component $1$ , and elsewhere. The observed matrix $Y$ is binary and the probability to observe $y_{ij}=1$ is given by $K_{ab}$ if $x_{i}(a)=1$ and $x_{j}(b)=1$ ( $i,j=1,\dots,n$ , and $a,b=1,\dots,r$ ).

Biclustering is a simultaneous clustering of rows and columns, it finds a number of applications e.g. for analysis of microarray data in genomics . Again the prior encodes affinity to a cluster and the output function includes various models of noise.

Poisson noise in the matrix factorization was considered e.g. in .

The labeled stochastic block model is another case that can be reformulated in the present setting .

I-B Contribution and closely related work

As far as we know the only tools that provides asymptotically exact analysis of the minimal mean squared error of models (1-2) is approximate message passing and state evolution as deployed in the present paper. For the output channel being additive Gaussian noise this was done previously in for rank $r=1$ with part of the results being fully rigorous, in for generic rank and without the state evolution, and in for general rank with the state evolution, but non-rigorously. The main contribution of this paper is the treatment of the case of general non-linear output channel $P_{\rm out}$ .

Approximate message passing and state evolution for a generic output channel $P_{\rm out}$ was derived previously in the context of linear estimation , and later in matrix factorization with $r=\Omega(n)$ . In both these cases the resulting equations (both the AMP and the state evolution) are considerably more involved than those for additive Gaussian noise. In the setting of low-rank models (1-2) above the situation is remarkably simpler as the AMP algorithm stays the same up to a change of the matrix $Y$ for the so-called Fisher score matrix $S$ that depends on the output channel and on $Y$ element-wise, and the inverse of the Fisher information of the channel that we denote $\Delta$ and plays a role of an effective noise variance. In the state evolution the situation is even simpler in the sense that only the effective value $\Delta$ of the noise appears.

The space of all the possible element-wise output channels hence reduces to one dimensional space, parametrized by their inverse Fisher information $\Delta$ . The resulting asymptotic MMSE depends only on $\Delta$ and not on other details of the channel. Consequently, classes of rather differently looking matrix estimation problems all share the same single-letter characterization. The contribution of the present paper is to unveil and quantify this property, and illustrate it on the example of detection of communities hidden in dense networks that is via this one parameter mapping related to localization of submatrices having different mean from the background matrix.

Analogous universality with respect to the output channel was observed in (see e.g. their remark 2.5) in the study of detection of a small hidden clique with approximate message passing.

Our results for the Bayes-optimal estimation error of community detection in dense stochastic block model are of independent interest. Analogous results were derived for the sparse case in . In the dense case only MSE-suboptimal spectral methods were evaluated . We also unveil a hard phase existing in this problem for rank $r>4$ , and becoming very wide for $r\to\infty$ .

We also recently learned about independently ongoing work of who consider rank $r=1$ with Radamacher prior, and establish rigorously the relation between the stochastic block model with two groups and low-rank estimation with Gaussian channel.

II AMP and state evolution

In this section we derive the AMP algorithm to compute marginals of the posterior probability distribution (3). We present the derivation for the $XKX^{\top}$ case, for the $UV^{\top}$ everything works analogously and we only state the results. For convenience we introduce a function $g(y,w)$ as

We require $g(y,w)$ to be differentiable in $w$ . For the previously considered Gaussian additive noise we have $g_{\rm GAN}(y,w)=-(y-w)^{2}/(2\Delta)$ .

In the first step, we write the belief propagation equations for the probability distribution (3). For this we introduce messages ${m^{t}_{ij\rightarrow i}}(x_{i})$ and $n_{i\rightarrow ij}^{t}(x_{i})$ between variables $x_{i}$ and the factors associated to $y_{ij}$ . The BP equations read

where $Z^{t}_{il\to i}$ and $Z^{t}_{i\to ij}$ are normalizations. The main assumption behind these BP equations is that the messages $n_{i\rightarrow il}(x_{i})$ and $n_{i\rightarrow ij}(x_{i})$ can be interpreted as probabilities that are statistically independent conditioned to the values of the variable $x_{i}$ in the large $n$ limit. This is also the main assumption of this paper. Arguments of theoretical statistical physics provide heuristic justification for this assumption. In the above form the BP equations are not useful for implementation. We, however, realize that we can rewrite them into a much simpler form that provides the same marginals in the limit of large $n$ .

In the large $n$ limit the function $g\left(y_{il},x_{l}^{\top}Kx_{i}/\sqrt{n}\right)$ depends only weakly on $x_{i}$ and $x_{l}$ . Therefore we expand this function around $w=0$ and (5) then becomes

Here we introduced quantities $a^{t}_{l\rightarrow il}$ and $v^{t}_{l\rightarrow il}$ as the mean and covariance of $n^{t}_{l\rightarrow il}$ We then take the logarithm of (6) and by using (5) we find

To close these equations we finally compute the new mean and variance of the messages $n_{i\rightarrow ij}^{t+1}$ as

where $f(A,B)$ is the mean of the normalized probability distribution

and $\partial f/\partial B$ is its covariance matrix. Above we closed the intractable distributional belief propagation equations on the means and variances of the messages.

As we can already anticipate it will be instrumental to introduce the so-called Fisher score matrix (evaluated at $w=0$ ) as

and the Fisher information (evaluated at $w=0$ ) of the output channel as

We denote the inverse Fisher information by $\Delta$ because that would be the noise variance for the Gaussian additive channel (in that case $S=Y/\Delta$ ).

Using $\int{\rm d}yP_{\rm out}(y|w)=1$ , $\forall w$ and (4), it follows that

With the above, eqs. (10) can be rewritten as

Here we can recognize AMP equations derived and studied for the low-rank matrix estimation problem with additive Gaussian noise in .

Remarkably the non-linear output channel enters only trough the value of the effective noise (15) and an effective form of the observed matrix (14). This is much simpler than what happens for the linear estimation model for which the generalized output AMP was derived in , or the matrix factorization .

Finally we provide a simplification that is standard to AMP-like algorithms and that could be called TAPyfication . We notice that variables $a_{i\rightarrow ij}^{t}$ and $v_{i\rightarrow ij}^{t}$ depend only weakly on the index $j$ , and this allow us to reduce further the number of variables to end up with

Where the second term in the expression for $B_{i}^{t}$ is the so-called Onsager reaction term, with its time index one iteration earlier, as is usual in AMP-type algorithms. Here we have reduced the number of messages to iterate from $O(n^{2})$ to $O(n)$ .

The procedure carried out above to derive the AMP algorithms can also be used to obtain a corresponding expression for the log-likelihood $\phi=\log(Z(Y))/n$ where $Z(Y)$ is the normalization in (3). This is called the Bethe free energy. Given a fixed point the AMP equations the Bethe free energy reads (for details see Appendix B)

where ${\cal Z}(A,B_{i})$ is the normalisation from (13) and

II-B Summary of the algorithm

For a given output channel we first need to evaluate the effective noise parameter $\Delta$ (15) and the score matrix $S$ (14). Then at each iteration of the AMP algorithm we store the following variables.

$a_{i}^{t}$ and $a_{i}^{t-1}$ are the estimators of the mean of the variables $x_{i}$ at time $t$ and $t-1$ . Every $a_{i}$ is a vector of size $r\times 1$ .

$v_{i}^{t}$ is the estimator of the covariance of the variables $x_{i}$ at time $t$ . The $v_{i}$ are matrices of size $r\times r$ .

We then compute the matrix $A^{t}$ and the vectors $B_{i}^{t}$ with (20) and (21). Every $B_{i}^{t}$ is a vector of size $r\times 1$ while $A^{t}$ is a $r\times r$ matrix. Finally we compute the new estimate of the mean and covariance of variables $x_{i}$ from (22) and (23). Where the function $f$ is defined as mean and covariance of the distribution in eq. (13). We initialize the $a_{i}^{t=0}$ very close to 0 in order to avoid convergence problems. To help convergence we can also damp the iterations with some parameter $\gamma$ . More sophisticated damping should be implemented if convergence problems persist on non-synthetic data . This finally gives us the following algorithm.

II-C Remark about spectral algorithms

An interesting remark can be done when noticing that there is an intimate relation between AMP and SVD. First notice that for some prior distributions AMP has a so-called uniform (or factorized) fixed point where the values of $a_{i}$ do not depend on the index $i$ (e.g. for zero mean priors). In such cases it is meaningful to linearize AMP around this uniform fixed point. This linearization can then be interpreted as a spectral algorithm. For the most common additive Gaussian output channel this gives the eigen-decomposition of $Y$ (or SVD of $Y$ for the $uv^{\top}$ case). Of course this spectral algorithm comes with its advantages (non-parameteric) and disadvantages (hard to ensure constraints on the factors $X$ ).

For a generic output channel the object that comes out from the AMP linearization is the Fisher score matrix $S$ defined by (14). Following the analogy, the SVD decomposition (or eigen-decomposition) should hence be done on $S$ rather than on $Y$ . Let us give an example in which the difference between doing SVD on $Y$ or $S$ indeed improves performance of the spectral method. We consider rank $r=1$ and $P_{\rm prior}(x)=e^{-x^{2}/2}/\sqrt{2\pi}$ . The output channel is an additive exponential noise of parameter $\lambda$ , i.e. $P_{\rm out}(y|w)=\exp\left(-|y-w|/\lambda\right)/(2\lambda)$ . The score matrix $S$ is in this case proportional to $S\approx{\rm sign}(y_{ij})$ . And indeed for $1/\sqrt{2}<\lambda<1$ an informative eigenvector gets out of the bulk of the spectrum of $S$ but not out of $Y$ ’s spectrum. Taking the score matrix allowed us to extract meaningful eigenvectors.

II-D State evolution

Remarkably, the AMP algorithm is amenable to asymptotic analysis. In statistical physics this would be called the cavity method , in linear estimation the commonly used term is state evolution. The SE was derived for low-rank matrix estimation for the Gaussian additive channel in and the present case is a straightforward generalization. To write the SE for the present case we introduce two order parameters of dimension ${r\times r}$ .

where $X$ is the vector we are trying to infer and that is unknown to us. State evolution (or single letter characterization) relies on the computation of the distribution (with respect to the realizations of the matrix $Y$ ) of $B_{i}^{t}$ given $X$ ( $A^{t}$ is a deterministic variable of $Q^{t}$ ).

Using (10) we notice that $B_{i}^{t}$ is a sum of many terms that by the assumptions of the belief propagation are independent. Using the central limit theorem we can characterize $B_{i}^{t}$ by its mean and variance as done in Appendix A. As a result one finds that

where $\xi_{i}$ is a Gaussian random variable of mean 0 and of covariance $KQ^{t}K/\Delta$ . This allows us to see that the order parameters $Q^{t}$ and $M^{t}$ will evolve as

where we used abbreviations for $P_{x}=P_{\rm prior}(x)$ and $P_{\xi}={\cal N}(0,KQ^{t}K/\Delta)$ . The remarkable part of these equations is that all the details of the output channel have disappeared and the only thing that remains is the inverse of the Fisher information $\Delta$ . Larger $\Delta$ means larger effective noise, smaller Fisher information and hence harder inference. The estimation error in the large $n$ limit depends on the channel only trough its Fisher information.

Since in this paper we assume the knowledge of the generative model and its parameters the state evolution simplifies further as we observe that $Q^{t}=M^{t}$ . This is called the Nishimori condition in statistical physics and was derived e.g. in or .

State evolution also allows us to derive the Bethe free energy in the large $n$ limit as

This formula is useful when there are multiple stable fixed points of equations (28) and (29). In such a situation the one with the greatest log-likelihood $\phi$ is the Bayes optimal one.

In this section we complete the presentation of the AMP algorithm and the state evolution by stating it for the $UV^{\top}$ model (2). We remind that $U$ and $V$ are of size $n\times r$ and $m\times r$ respectively and that we consider the following limit. $n\rightarrow\infty$ , $m=\alpha n$ , $\alpha=\Omega(1)$ , $r=\omega(1)$ . Each row of $U$ and $V$ has been sampled from a probability distribution $P^{\rm prior}_{u}(u_{i})$ and $P^{\rm prior}_{v}(v_{i})$ (for simplicity of notation we will omit the upper index ”prior” in this section). Distributions $P_{u}$ and $P_{v}$ have their corresponding input functions $f_{u}(A,B)$ , $f_{v}(A,B)$ and their normalizations ${\cal Z}_{u}(A,B)$ , ${\cal Z}_{v}(A,B)$ , defined according to (13). The matrix $W$ is observed through an output channel defined by some probability $P_{\rm out}(y_{ij}|w_{ij})$ .

The AMP algorithm for estimating $U$ and $V$ from $Y$ is as follows

The score matrix $S$ and the effective noise $\Delta$ were again computed from the output channel following (14) and (15).

The Bethe expression for the log-likelihood is obtained from the fixed point as

where ${\cal Z}(A,B_{i})$ is again the normalisation from (13) and

To write the state evolution equations we introduce the order parameters

where $\xi_{u}$ and $\xi_{v}$ are Gaussian random variables of mean and covariance $Q^{t}_{u}/\Delta$ and $Q^{t}_{v}/\Delta$ . At the fixed point we compute the Bethe free energy as

III Phase diagrams and examples for clustering

In this section we illustrate our findings on the example of clustering into $r$ equally sized groups. In this case the $r$ -dimensional rows of the matrix $X$ take the form of $\left(0,\ldots,1,\ldots,0\right)$ with a $1$ located at only one of the $r$ coordinates. The matrix $K$ is an identity. The elements of the matrix $W$ are then $w_{ij}=1/\sqrt{n}$ if $x_{i}=x_{j}$ and $w_{ij}=0$ otherwise. Correspondingly the prior $P_{\rm prior}(x)$ is the probability distribution that picks with probability $1/r$ one of the above $r$ vectors. The function $f(A,B)$ defined via (13) is in this case

We will consider two different output channels

Gaussian additive noise $P_{\rm out}(y_{ij}|w_{ij})=\exp[-(y_{ij}-w_{ij})^{2}/(2\Delta)]/\sqrt{2\pi\Delta}$ . In this case the problem is interpreted as localization of submatrices having mean $1/\sqrt{n}$ in an otherwise zero-mean matrix.

Stochastic block model channel where $Y$ is its (dense) adjacency matrix created as $P_{\rm out}(y_{ij}=1|w_{ij})=p_{\rm out}+\mu w_{ij}$ and $P_{\rm out}(y_{ij}=0|w_{ij})=1-p_{\rm out}-\mu w_{ij}$ . That is, nodes from different groups are connected with probability $p_{\rm out}$ , and nodes from the same group with probability $p_{\rm in}=p_{\rm out}+\mu/\sqrt{n}$ . Both $p_{\rm out}=\Omega(1)$ , and $\mu=\Omega(1)$ . The Fisher score matrix has elements $S_{ij}=\mu/p_{\rm out}$ where $Y_{ij}=1$ , and $S_{ij}=-\mu/(1-p_{\rm out})$ where $Y_{ij}=0$ . The inverse Fisher information of the stochastic block model channel is given as

In the large $n$ limit the resulting graph will have average degree $np_{\rm out}$ .

Bayes-optimal inference in the stochastic block model was studied in the case of sparse networks (bounded average degree) . For the case of dense networks (linear average degree) and $\mu=\Omega(1)$ the Bayes-optimal MMSE and corresponding phase transitions are an original contribution of the present paper.

The state evolution from section II-D simplifies for the case of symmetric clustering in the setting above. One notices that if $Q^{t}$ and $M^{t}$ are of the symmetric form

then $Q^{t+1}$ and $M^{t+1}$ will be of the same form. Here $J_{r}$ is a $r\times r$ matrix filled with ones, and $I_{r}$ is a $r\times r$ identity matrix. Moreover since the sum of elements of the vectors $x$ is always $1$ we have the additional property that

Therefore in the end we only have one parameter $b$ in the state evolution, $b^{t}=1$ means that we have perfectly reconstructed the communities, $b^{t}=0$ is when there is no information and the estimator for every variables is $\left(\frac{1}{r},\cdots,\frac{1}{r}\right)$ .

The state evolution equation (28) can then be used to derive the evolution of the parameter $b$ under iterations. It follows that

We now observe that the above state evolution equation has always the uniform or uninformative fixed point $b_{u}=0$ . It is crucial to study the stability of this fixed point under iterations of (52). Depending on the number of groups $r$ and the noise parameter $\Delta$ there may be another stable fixed point of (52). When it exists we will call it $b_{\rm far}$ (it obviously depends on $\Delta$ and $r$ ). We expand (53) around $b_{u}$ to get

the uniform fixed point will be unstable and the iteration will converge away from it.

If we translate condition (56) back to the parameters of the stochastic block model we obtain that inference with AMP is possible for

This is the same condition as known from the sparse SBM and from standard spectral methods . Our contribution to these results is the Bayes-optimal value of the detection error, which is considerably better than the error obtained from the spectral algorithms evaluated in . Note e.g. that the error in the spectral algorithm is always continuous at the transition, whereas the Bayes-optimal error presents a discontinuity at the transitions for $r>4$ (see below).

By looking at the second order term in (55) we can get some information about where the iterations will converge.

If $r<4$ , $\frac{r-4}{2\Delta^{2}r^{4}}{b^{t}}^{2}$ is negative and close to $\Delta_{c}$ we will converge to $b_{\rm far}(\Delta)=2(\Delta_{c}-\Delta)/(4-r)+O[(\Delta_{c}-\Delta)^{2}]$ . In this case the fixed point is a continuous function of the noise.

If $r>4$ , things are different. Because the second derivative of (55) is positive then it is impossible for $b_{\rm far}$ to go to zero continuously as $\Delta$ goes to $\Delta_{c}$ (a simple plot of (55) should convince the reader of this fact). There is therefore a jump in the value of the fixed point as $\Delta$ crosses $\Delta_{c}$ . This is the signature of a first order phase transition.

In the case of first order phase transition there are always three transitions to discuss

The different phases have the following properties:

Undetectable phase without clusters, $\Delta_{\rm spinodal}<\Delta$ . Eq. (52) has only the uniform fixed point. Bayes optimal inference does not provide any information about the labeling of the nodes.

Undetectable phase with clusters, $\Delta_{\rm static}<\Delta<\Delta_{\rm spinodal}$ . Eq. (52) has two stable fixed point $b_{\rm u}$ and $b_{\rm far}$ . But $b_{\rm far}$ has a lower log-likelihood than $b_{\rm u}$ therefore the Bayes-optimal estimator is still not correlated with the planted solution.

Detectable but hard, $\Delta_{c}<\Delta<\Delta_{\rm static}$ . The fixed point $b_{\rm far}$ now has a larger log-likelihood than $b_{\rm u}$ . Bayes optimal estimator will find a configuration well correlated with the planted one, but the AMP algorithms will not. Analogous phase appears in many other inference problems and we conjecture that all known polynomial algorithms will fail in this phase.

Easy phase, $\Delta<\Delta_{c}$ . The uniform fixed point $b_{\rm u}$ is unstable and the AMP algorithm will converge to $b_{\rm far}$ . The problem is said easy since we have a polynomial algorithm that (at least is conjectured to) asymptotically reaches optimal reconstruction.

A theoretical analysis of $\Delta_{\rm spinodal}(r)$ and $\Delta_{\rm static}(r)$ is also possible as $r\to\infty$ . It relies on analysis of which exponential dominates the others in (53), for details see appendix D. We find

The most notable remark about these results is that in the limit of large rank the gap between what is statistically possible $\Delta_{\rm static}$ and what is algorithmically tractable $\Delta_{c}=1/r^{2}$ is different even in its order. Note also that the present phase transitions are equal to critical temperatures in the Potts glass model as computed in , this is because of the intimate relation between random and planted models . Note that these works also derived rank $r=4$ as the point where the second order phase transitions changes into a first order one.

III-B Phase diagrams

In this section we illustrate the behavior of the fixed points of the AMP algorithm and of the state evolution. In all the presented experiments we iterate the corresponding equations till convergence. To investigate the MMSE and the phase transitions we can initialize the AMP algorithm and the SE equations in two different ways:

Uniformative initialization: for the SE equations this means $b^{t=0}=\delta$ , where $\delta$ is very small. For the AMP algorithm this means $a_{i}=(\frac{1}{r},\ldots,\frac{1}{r})^{\top}+\delta_{i}$ and $v_{i}^{t}=I_{r}/r$ . This is the initialization with which we would work if we were working with real data.

Informative initialization: for the SE equations this means $b^{t=0}=1$ . For the AMP algorithm this means $a_{i}=x_{i}$ and $v_{i}^{t}=0$ , i.e. initiate the estimators to be equal to the planted solution that we are trying to recover.

In community detection we usually evaluate an estimator that maximizes the number of correctly assigned nodes. For this we need to take the index of the maximum component of $a_{i}$ . For numerical comparisons between the state evolution and the AMP algorithm we will rather evaluate the mean-squared error ${\rm MSE}=1-Q$ , where $Q$ is the order parameter (26).

In figure 1 we illustrate the fact that in the large $n$ limit different output channels having the same inverse Fisher information $\Delta$ have equivalent performance. We plot the MSE obtained from the state evolution, compare to the one obtained from the AMP algorithm on a single instance of size $n=20000$ with either a Gaussian additive noise channel or the stochastic block model channel. Indeed the three cases agree.

In figure 2 we depict the first order phase transition that occurs for $r>4$ . We present the MSE obtained from density evolution as function of $r^{2}\Delta$ for various ranks $r=10,15,20$ . From the uninformative initialization the MSE jumps to $1-1/r$ at $\Delta_{c}=1/r^{2}$ . From the informative initialization the jump occurs at $\Delta_{\rm spinodal}>\Delta_{c}$ and by comparing the Bethe free energy of the two fixed points we evaluate the $\Delta_{\rm static}$ . We observe that already for these values of the rank the gap between the information theoretical and computationally tractable performance, i.e. between $\Delta_{c}$ and $\Delta_{\rm static}$ is very large. Another interesting observation we made concerns the value of the MSE at the $\Delta_{c}=1/r^{2}$ transition. The reconstruction is very close to exact even at the phase transition itself. More precisely it decreases roughly exponentially with the rank $r$ , being close to $10^{-10}$ for rank $r=100$ .

In figure 3 we plot the values of $\Delta_{\rm spinodal}$ and $\Delta_{\rm static}$ rescaled by $r\log{r}$ as a function of the rank $r$ . As the rank grows $r\to\infty$ the static line will converge to $1/4$ and the spinodal one to $1/2$ (59-60). Whereas the static transition converges nicely to its asymptotic value, for the spinodal transition we are still quite far from the asymptotic regime.

IV Conclusions and perspectives

We considered here the problem of estimation of a low rank matrix that was observed element-wise trough an arbitrary noisy channel. Using approximate message passing and its state evolution we compute the Bayes-optimal MMSE in this problem. The most interesting and, comparing to previously studied cases, surprising conclusion is that the MMSE depends on the output channel only trough its Fisher information.

As an example of the considered setting we study the problem of clustering in dense stochastic block model with the difference between probabilities of connection within and between communities scaling as $O(1/\sqrt{n})$ . We evaluate the phase transitions and MMSE in this regime, unveiling a wide region of computational hardness.

Both the AMP algorithms and the state evolution apply to a number of other setting considered in the literature, which is a natural direction of future work.

V Acknowledgements

We thank Madhu Advani for clarifying discussion about Fisher information. The research leading to these results has received funding from the European Research Council under the European Union’s $7^{th}$ Framework Programme (FP/2007-2013)/ERC Grant Agreement 307087-SPARCS.

References

Appendix A

To derive the state evolution let us define

We then expand the probability $P_{\rm out}(y|w)$ around 0

Using (63) in (62) and using (16) and (16) we get

where $a_{i}$ is the mean of $a_{i\rightarrow il}$ with respect to $y_{ij}$ . Using (63) and the fact that $a_{l\rightarrow il}$ and $a_{k\rightarrow ik}$ are independent random variables we can compute the second moments of the $G_{ij}$ as

From the central limit theorem we deduce.

where $W_{i}$ is a Gaussian random variable of mean 0 and of covariance $KQ^{t}K/\Delta$ .

Appendix B

The Bethe free energy in the case where all the factors are between two variables reads as

For the second term we expand once again $P_{\rm out}\left(y_{ij}|w_{ij}\right)$ around 0 and then take the mean with respect to each message. We then take the log and find

The TAPification procedure that is not detailed here gives us

We also remove the term $g(y_{ij},0)$ since it is a constant that does not depend on the messages. We replace (72), (17) and (15) in (71) and we get

Since the sum is on all undirected links $ij$ we do the sum on all directed link and divide by $2$

To get the Bethe-free-energy in the large $n$ limit we compute the mean of $\frac{a_{i\rightarrow ij}^{\top}Ka_{j\rightarrow ij}}{\sqrt{n}}$ with respect to the noise. The main idea remains the same we compute the mean with respect to a perturbation of $P_{\rm out}$

To compute the mean of (70) we use the fact that we know the distribution of the $B_{i}$ (66). We replace (76) in (71) and we get

Appendix C

To find the $\Delta_{\rm static}$ one can notice that by taking the derivative with respect to $Q$ and $M$ of (30) one finds a combination of (28) and (29). Therefore we have

We deduce a way to compute $\Delta_{\rm static}$ as

To compute $\Delta_{\rm static}$ we evaluate ${\cal M}_{r}$ on a whole interval then for each $x$ draw a line between point $(0,0)$ and $(x,H(x))$ . We then compute the area between ${\cal M}_{r}$ and this line. When this aera is zero then ${\cal M}_{r}(x)/{x}$ gives us $\Delta_{\rm static}$ .

Appendix D

In this appendix we explain the derivation of (59), (60). To do this let us study the function ${\cal M}_{r}(x)$ where we take $x=\beta r\log(r)$ , where $\beta=\Omega(1)$ . The important part of ${\cal M}_{r}$ is this integral

If $F_{1}$ dominates $F_{2}$ as $r\rightarrow+\infty$ then ${\cal M}_{r}=1$ , otherwise if $F_{2}$ dominates $F_{1}$ then ${\cal M}_{r}=0$ .

To estimate $F_{1}$ , $F_{2}$ let us notice that with high probability the maximum value will be of order $\sqrt{2\beta\log(r)}$ . This is a general property of Gaussian variables that the maximum of $r$ independent Gaussian variables of variance $\sigma^{2}$ is of order $\sigma\sqrt{2\log(r)}$ . We can therefore compute the mean of $F_{2}$ all while conditioning on the fact that all of the $u_{i}$ are smaller than $\sqrt{2\beta\log(r)}$ . This allows us to compute the typical value of $F_{1}$ as $F_{1}\sim r^{\beta}$ . For $F_{2}$ we obtain: when $\beta<1/2$ then $F_{2}\sim r^{\beta+\frac{1}{2}}$ , and when if $\beta>1/2$ then $F_{2}\sim r^{\sqrt{2\beta}}$ . We have to look at which of the $F_{1}$ or $F_{2}$ has a higher exponent. In these computation we can therefore just assume that

Now let us remind (78) while keeping $x=\beta r\log(r)$

To get the static transition we use (80). Let us find the $\beta$ that satisfies equation (80) we get