Phase transitions and sample complexity in Bayes-optimal matrix factorization

Yoshiyuki Kabashima, Florent Krzakala, Marc Mézard, Ayaka Sakata, Lenka Zdeborová

math.NA cond-mat.stat-mech cs.IT cs.LG stat.ML

I Introduction

We study in this paper a variety of questions which all deal with the general problem of matrix factorization. Generically, this problem is stated as follows: Given a $M\times P$ dimensional matrix $Y$ , that was obtained from noisy element-wise measurements of a matrix $Z$ , one seeks a factorization $Z=FX$ , where the $M\times N$ dimensional matrix $F$ and the $N\times P$ dimensional matrix $X$ must satisfy some specific requirements like sparsity, low-rank or non-negativity.

From a machine learning point of view, matrix factorization can be applied to unsupervised learning of data representation Bengio et al. (2013). The success of machine learning, including recent progress such as deep learning LeCun et al. (2015), depends largely on data representations. Explicit approaches to efficient data representation, such as matrix factorization, are hence of wide relevance. Other applications that can be formulated as matrix factorization include dictionary learning or sparse coding Olshausen et al. (1996); Olshausen & Field (1997); Kreutz-Delgado et al. (2003), sparse principal component analysis Zou et al. (2006), blind source separation Belouchrani et al. (1997), low rank matrix completion Candès & Recht (2009); Candès & Tao (2010) or robust principal component analysis Candès et al. (2011), that will be described below.

Theoretical limits on when matrix factorization is possible and computationally tractable are still rather poorly understood. In this work we make a step towards this understanding by predicting the limits of matrix factorization and its algorithmic tractability when $Z$ is created using randomly generated matrices $F$ and $X$ , and measured element-wise via a known noisy output channel $P_{\rm out}(Y|Z)$ . Our results are derived in the limit where $N,M,P\to\infty$ with fixed ratios $M/N=\alpha$ , $P/N=\pi$ . We predict the existence of sharp phase transitions in this limit and provide the explicit formalism to locate them.

We use two types of methods in this paper. The first one is based on a generalization of approximate message passing (AMP) Donoho et al. (2009) to the matrix factorization problem, and on its asymptotic analysis which is known in statistical physics as the cavity method Mézard et al. (1987); Mézard & Montanari (2009), and has been called state evolution in the context of compressed sensing Donoho et al. (2009). The second method that we use in the following is the replica method. These two methods are widely believed to be exact in the context of theoretical statistical physics, but most of the results that we shall obtain in the present work are not rigorously established. Our predictions have thus the status of conjectures. A first cross-check of the correctness of these conjectures is the fact that the two methods give identical results. This has been understood first in the context of spin glasses Mézard et al. (1987).

This work builds upon some previous steps that we described in earlier reports Sakata & Kabashima (2013); Krzakala et al. (2013). The message passing algorithm related to our analysis was first presented in Krzakala et al. (2013) and is very closely related to the Big-AMP algorithm developed and tested in Schniter et al. (2012); Parker et al. (2013, 2014); relations and differences with Big-AMP will be mentioned in several places throughout the paper. Our main focus here, beside the detailed derivation of the algorithm, is the asymptotic analysis and phase diagrams which were not studied in Schniter et al. (2012); Parker et al. (2013, 2014). We also discuss several variants of the AMP algorithm that are interesting for theoretical reasons. Several of these variants, however, have convergence problems when implemented straightforwardly. For a robust implementation of the algorithm that can be used on practical benchmarks we refer to the works Parker et al. (2013, 2014).

Our general method provides a unifying framework for the study of computational tractability and identifiability of various matrix factorization problems. The first step for this synergy is the formulation of the problem via a graphical model (see Fig. 1) that is amenable to analysis using the present methods.

The phase diagrams that we shall derive establishes for each problem two types of thresholds in the plane $\alpha-\pi$ : the threshold where the problem of matrix factorization ceases to be solvable in principle, and the threshold where AMP ceases to find the best solution. In most existing works the computation of phase transitions was treated separately for each of the various problems. For instance, redundant dictionaries for sparse representations and low rankness are usually thought as two different kinds of dimensional reduction. Interestingly, in our work a wide class of problems is treated within one unified formalism; this is theoretically interesting in the context of recent developments Elad (2012).

The element-wise measurement $y_{\mu l}$ of $z_{\mu l}$ is then specified by some known probability distribution function $P_{\rm out}(y_{\mu l}|z_{\mu l})$ , so that:

The goal of matrix factorization is to estimate both matrices $F$ and $X$ from the measurements $Y$ .

In this paper we will treat this problem in the framework of Bayesian inference. In particular we will assume that the matrices $F$ and $X$ were both generated from a known separable probability distribution

Although we restrict to separable prior probability distributions $P_{F}(F)$ , $P_{X}(X)$ , it turns out that these priors can encode a broad range constraints such as, for instance sparsity. This is the reason why so many different problems can be studied within our scheme. The output channel $P_{\rm out}(y,z)$ can be of a rather generic nature, thus including various kinds of additive and multiplicative noise; in the spirit of activation functions from neural networks it can degrade the information included in the measurement by e.g. keeping only the sign of elements of $z$ .

The posterior distribution of $F$ and $X$ given the measurements $Y$ is written as

where ${\cal Z}(Y)$ is the normalization constant, known as the partition function in statistical physics.

Notice that, while the original problem of finding $F$ and $X$ , given the measurements $Y$ , is not well determined (because of the possibility to obtain, from a given solution, an infinity of other solutions through the transformation $F\to FU^{-1}$ and $X\to UX$ , where $U$ is any $N\times N$ nonsingular matrix), the fact of using well defined priors $P^{\mu i}_{F}$ and $P^{il}_{X}$ actually lifts the degeneracy: the problem of finding the most probable $F,X$ given the measurements and the priors is well defined. In case the priors $P^{\mu i}_{F}$ and $P^{il}_{X}$ do not depend on the indices $\mu l$ and $il$ we are left with a permutational symmetry between the $N$ column of $F$ and $N$ rows of $X$ . Both in the algorithm and the asymptotic analysis this symmetry is broken and one of the $N\!$ solutions is chosen at random.

Typically, in most applications, the distributions $P_{\rm out}$ , $P_{F}$ and $P_{X}$ will depend on a set of parameters (such as the mean, variance, sparsity, noise strength, etc.) that we usually will not write explicitly in the general case, in order to simplify the notations. The prior knowledge of these parameters is not necessarily required in our approach: these parameters can be learned via an expectation-maximization-like algorithm that we will discuss briefly in section II.5.

Note also that Eq. 1 can be multiplied by an arbitrary constant: with a corresponding change in the output function the problem will not be modified. In the derivations of this paper we choose the above constant in such a way that the elements of matrices $X$ , $Y$ , and $Z$ are of order $O(1)$ , whereas the elements of $F$ scale in a consistent way, meaning that the mean of each $F_{\mu i}$ is of order $O(1/N)$ and its variance is also of order $O(1/N)$ .

I.2 Bayes-optimal inference

Our paper deals with the general case of incomplete information. This happens when the reconstruction assumes that the matrices $X$ , $F$ and $Y$ were generated with some distributions $P_{X}$ , $P_{F}$ and $P_{\rm out}(Y|Z)$ , whereas in reality the matrices were generated using some other distributions $P_{X^{0}}$ , $P_{F^{0}}$ and $P^{0}_{\rm out}(Y|Z)$ . The message passing algorithm and its asymptotic evolution will be derived in this general case.

However, our most important results concern the Bayes-optimal setting, i.e. when we assume that

In this case, an estimator $X^{\star}$ that minimizes the mean-squared error (MSE) with respect to the original signal $X^{0}$ , defined as

is obtained from marginals of $X_{il}$ with respect to the posterior probability measure $P(F,X|Y)$ , i.e.,

is the marginal probability distribution of the variable $il$ . The mean squared error achieved by this optimal estimator is called the minimum mean squared error (MMSE) in this paper.

A similar result holds for the estimator of $F^{0}$ that minimizes the mean-squared error

which is obtained from the mean of $F_{\mu i}$ with respect to the posterior probability measure $P(F,X|Y)$ . In the remainder of this article we will be using these estimators.

I.3 Statement of the main result

The main result of this paper are explicit formulas for the MMSE achievable in the Bayes optimal setting (as defined above) for the matrix factorization problem in the “thermodynamic limit”, i.e. when $N,M,P\to\infty$ with fixed ratios $M/N=\alpha$ , $P/N=\pi$ . When sparsity is involved we consider that a finite fraction of matrix elements are non-zero. Similarly, when we treat matrices with low ranks we consider again the ranks to be a finite fraction of the total dimension. We also derive the AMP-MSE, i.e. the mean square error achievable by the approximate message passing algorithm as derived in this paper.

So far we were characterizing the output channel by the conditional distribution $P^{0}_{\rm out}(y|z^{0})$ or $P_{\rm out}(y|z)$ . It will be useful to think of the output as a deterministic function $h$ of $z$ and of random variables $w$ , i.e. $y_{\mu l}=h(z_{\mu l},w_{\mu l})=h^{0}(z^{0}_{\mu l},w_{\mu l}^{0})$ . The random (“noise”) variables $w$ and $w^{0}$ are specified by their probability distributions $P(w)$ and $P_{0}(w^{0})$ . We can relate $P_{\rm out}$ to $h$ as follows

To compute the MMSE and AMP-MSE we need to analyze the fixed points of the following iterative equation.

The output function $g_{\rm out}$ is defined using the output probability $P_{\rm out}$ as

The AMP-MSE is obtained from a fixed point reached with a so-called uninformative initialization. The uninformative initialization does not use any information about the seeked matrices $F^{0},X^{0}$ , it uses only the prior distributions, it is defined as

To compute the MMSE we need to initialize the iterations in an informative way, using the knowledge of the true $X^{0}$ and $F^{0}$ . This informative initialization is defined as an infinitesimal perturbation of

If the resulting fixed point agrees with the AMP-MSE then this is also the MMSE. In case that this informative initialization leads to a different fixed point than the AMP-MSE then the MMSE is given by the one of them for which the following free entropy is larger

Sections III, IV, V are devoted to the derivation of these formulas for MMSE and AMP-MSE. In section VI we then give a number of examples of phase diagram for specific applications of the matrix factorization problem.

We reiterate at this point that whereas our results are exact results within the theoretical physics scope of the cavity/replica method, we do not provide their proofs and their mathematical status is that of conjectures. We devote section II.3 to explaining the reasons of why this is so. The situation is comparable to the predictions-conjectures that were made for the satisfiability threshold Mézard et al. (2002), which have been partly established rigorously recently Ding et al. (2014), or the predictions-conjectures that were made for the CDMA problem Tanaka (2002); Guo & Verdú (2005), and were later proved by Montanari & Tse (2006); Bayati & Montanari (2011).

I.4 Applications of matrix factorization

Several important problems which have received a lot of attention recently are special cases of the matrix factorization problem as set above. In this paper we will analyse the following ones.

This is a basic example of finding a representation of data that is advantageous in some way. Representation learning Bengio et al. (2013) is a concept behind many machine learning applications, including the deep learning framework which has become very popular in recent time LeCun et al. (2015). In the context of representation learning dictionary learning is sometimes referred to as sparse coding Bengio et al. (2013).

Dictionary learning relies on the fact that signals of interest are often sparse in some basis; this property is widely used in data compression and more recently in compressed sensing. A lot of work has been devoted to analyzing bases in which different data are sparse. The goal of dictionary learning is to infer a basis in which the data are sparse based purely on a large number of samples from the data. The $M\times P$ matrix $Y$ then represents the $P$ samples of $M$ -dimensional data. The goal is to decompose $Y=FX+W$ into a $M\times N$ matrix $F$ , and a $N\times P$ sparse matrix $X$ , $W$ is the noise.

In this paper we will analyse the following teacher-student scenario of dictionary learning. We will generate a random Gaussian matrix $F^{0}$ with iid elements of zero mean and variance $1/N$ , and a random Gauss-Bernoulli matrix $X^{0}$ with fraction $0<\rho<1$ of non-zero elements. The non-zero elements of $X^{0}$ will be iid Gaussian with mean $\overline{X}$ and variance $\sigma$ . The noise $W^{0}$ with elements $w^{0}_{\mu l}$ is also iid Gaussian with zero mean and variance $\Delta$ . We hence have

where $\delta(X)$ is the Dirac delta function. The goal is to infer $F^{0}$ and $X^{0}$ from the knowledge of $Y$ with the smallest possible number of samples $P$ .

In the noiseless case, $\Delta=0$ , exact reconstruction might be possible only when the observed information is larger that the information that we want to infer. This provides a simple counting bound on the number of needed samples $P$ :

Note that the above assumptions on $P_{F}$ , $P_{X}$ and $P_{\rm out}$ likely do not hold in any realistic problem. However, it is of theoretical interest to analyze the average theoretical performance and computational tractability of such a problem, as it gives a well defined benchmark. Moreover, we anticipate that if we develop an algorithm working well in the above case it might also work well in many real applications where the above assumptions are not satisfied, in the same spirit as the approximated message passing algorithm derived for compressed sensing with zero mean Gaussian measurement matrices Donoho et al. (2009) works also for other kinds of matrices.

Typically, we would look for an invertible basis $F^{0}$ with $N=M$ , in that case we speak of a square dictionary. However, with the compressed-sensing application in mind, it is also very interesting to consider that $Y$ might be under-sampled measurements of the actual signal, corresponding then to $\alpha=M/N<1$ . Hence we will be interested in the whole range $0<\alpha\leq 1$ . The regime of $\alpha<1$ corresponds to an overcomplete dictionary, in which case each of the $P$ measurements $\vec{y}_{l}$ is a sparse linear combination of the columns (atoms) of the dictionary.

We remind that the $N$ columns of the matrix $F^{0}$ can always be permuted arbitrarily and multiplied by $\pm 1$ . This is also true for rows of the matrix $X^{0}$ : all these operations do not change $Y$ , nor the posterior probability distribution. This is hence an intrinsic freedom in the dictionary learning problems that we have to keep in mind. Note that many works consider the dictionary to be column normalized, which lifts part of the degeneracy in some optimization formulations of the problem. In our setting the equivalent of column normalization is asymptotically determined by the properties of the prior distribution.

In dictionary learning one does not have any specific information about the elements $F^{0}_{\mu i}$ . However in some applications of compressed sensing one might have an approximate knowledge of the measurement matrix: it is often possible to use known samples of the signal $X$ in order to calibrate the matrix in a supervised manner (i.e. using known training samples of the signal). Sometimes, however, the known training samples are not available and hence the only way to calibrate is to measure a number of unknown samples and perform their reconstruction and calibration of the matrix at the same time, such a scenario is called the blind calibration.

In blind matrix calibration, the properties of the signal and the output function are the same as in dictionary learning, eqs. (22-23). As for the matrix elements $F^{0}_{\mu i}$ one knows a noisy estimation $F^{\prime}_{\mu i}$ . In this work we will assume that this estimation was obtained from $F^{0}_{\mu i}$ as follows

where $\xi_{\mu i}$ is a Gaussian random variables with zero mean and variance $1/N$ . This way, if the matrix elements $F^{0}_{\mu i}$ have zero mean and variance $1/N$ , then the same is true for the elements $F^{\prime}_{\mu i}$ .

The control parameter $\eta$ is then quantifying how well one knows the measurement matrix. It provides a way to interpolate between the pure compressed sensing $\eta=0$ , where one knows the measurement matrix $F^{0}$ , and the dictionary learning problems $\eta\to\infty$ . Explicitly, the prior distribution of a given element of the matrix $F$ is

where ${\cal N}(a,b)$ is a Gaussian distribution with mean $a$ and variance $b$ .

Another special case of matrix factorization that is often studied is the low-rank matrix completion. In that case one “knows” only a small (but in our case finite when $N\to\infty$ ) fraction $\epsilon$ of elements of the $M\times P$ matrix $Y$ . Also one knows which elements are known and which are not; let us call ${\cal M}$ the set on which elements are known, it is a set of size $\epsilon MP$ . In this case the output function is:

The precise choice on the function on the second line is arbitrary as long as it does not depend on the $z_{\mu l}$ . In what follows we will assume that the $\epsilon MP$ known elements were chosen uniformly at random.

In low rank matrix completion $N$ is small compared to $M$ and $P$ , hence both $\pi$ and $\alpha$ are relatively large. Note, however, that the limit we analyse in this paper keeps $\pi=O(1)$ and $\alpha=O(1)$ while $N\to\infty$ , whereas in many previous works on low-rank matrix completion the rank was considered to be $O(1)$ and hence $\alpha$ and $\pi$ of order $O(N)$ . Compared to those works the analysis here applies to “not-so-low-rank” matrix completion. The question is what fraction $\epsilon$ of elements of $Y$ needs to be known in order to be able to reconstruct the two matrices $F^{0}$ and $X^{0}$ .

For negligible measurement noise, $\Delta=0$ , a simple counting bound gives that the fraction of known elements we need for reconstruction is at least

Again we will study the student-teacher scenario when a low-rank $Z$ is generated from $X^{0}$ and $F^{0}$ having iid elements distributed according to eq. (21) and (22) with $\rho=1$ (no sparsity). To construct $Y$ we keep a random fraction $\epsilon$ of elements of $Z$ , and the goal is to reconstruct $X^{0}$ and $F^{0}$ from that knowledge.

We also note that variants on matrix completion where the output channel $P_{\rm out}(y,z)$ keeps only the sign of $z$ were also considered, called 1-bit matrix completion, and have some interesting properties reported in Davenport et al. (2014); Davenport & Romberg (2015).

Principal component analysis (PCA) is a well-known tool for dimensional reduction. One usually considers the singular value decomposition (SVD) of a given matrix and keeps a given number of largest values, thus minimizing the mean square error between the original matrix and its low-rank approximation. The SVD is computationally tractable, and provides the minimization of the mean square error between the original matrix and its low-rank approximation. However, with additional constraints there is no general computationally tractable approach.

A variant of PCA that is relevant for a number practical application requires that one of the low-rank components is sparse. The goal is then to approximate a matrix $Y$ by a product $FX$ where $F$ is a tall matrix, and $X$ a wide sparse matrix. The teacher-student scenario for sparse PCA that we will analyse in this paper uses eq. (21-23) and the matrix dimensions are such that $\pi=P/N$ and $\alpha=M/N$ are both large, but still of order $O(1)$ , and comparable one to the other. Hence it is only the region of interest for $\alpha$ and $\pi$ that makes this problem different from the dictionary learning. Note that, in the same way as for the matrix completion, many works in the literature consider $N=O(1)$ whereas here we have $N\to\infty$ in such a way that $\pi=P/N=O(1)$ and $\alpha=M/N=O(1)$ . Hence we work with low rank, but not as low as most of the existing literature. For a statistical physics study following the lines of the present paper where the rank is $O(1)$ see Lesieur et al. (2015).

In the zero measurement noise case, $\Delta=0$ , the simple counting bound gives that the rank $N$ for which the reconstruction problem may be solvable needs to be smaller than

One important application where sparse PCA is relevant is the blind source separation problem. Given $N$ $\rho$ -sparse (in some known basis) source-signals of dimension $P$ , they are mixed in an unknown way via a matrix $F$ into $M$ channel measurements $Y$ . In blind source separation typically both the number of sources $N$ and the number of channels (sensors) $M$ are small compared to $P$ . When $N<M$ we obtain an overdetermined problem which may be solvable even for $\rho=1$ . More interesting is the undetermined case with the number of sensors smaller than the number of sources, $M<N$ , which would not be solvable unless the signal is sparse $\rho<1$ (in some basis), in that case the bound (29) applies.

Another variant of PCA that arises often in practice is the robust PCA, where the matrix $Y$ is very close to a low rank matrix $FX$ plus a sparse full rank matrix. The interpretation is that $Y$ was created as low rank but then a small fraction of elements was distorted by a large additive noise. The resulting $Y$ is hence not low rank.

In this paper we will analyse a case of robust PCA when $F$ and $X$ are generated from eq. (21-22) with $\rho=1$ and the output function is

where $\epsilon$ is the fraction of elements that were not largely distorted, $\Delta_{s}\ll 1$ is the small measurement noise on the non-distorted elements, $\Delta_{l}$ is the large measurement noise on the distorted elements. We will require $\Delta_{l}\approx\overline{X}^{2}+\sigma$ to be comparable to the variance of $z_{\mu l}$ such that there is no reliable way to tell which elements were distorted by simply looking at the distribution of $y_{\mu l}$ . The parameters regime we are interested in here is $\pi$ and $\alpha$ both relatively large and comparable one to another.

In robust PCA in the zero measurement noise case, $\Delta_{s}=0$ , the simple counting bound gives that the fraction of non-distorted elements $\epsilon$ under which the reconstruction may still be solvable needs to satisfy the same bound as for matrix completion (28). Indeed this counting bound does not distinguish between the case of matrix completion when the position of known elements of $Y$ is known and the case of RPCA when their positions are unknown.

The characteristic feature of FA is to take into account the site dependence of the output function as

In many application of matrix factorization it is known that both the coefficient of the signals $X$ and the coefficients of the dictionary $F$ must be non-negative. In our setting this can be taken into account easily by imposing the distributions $P_{X}$ and $P_{F}$ to have non-negative support. In the student-teacher scenario of this paper we can hence consider the nonzero elements of $X$ to be $P_{X}(X)=0$ for $X<0$ and $P_{X}(X)=N(\overline{X},\sigma)$ for $X>0$ , and analogously for $P_{F}$ .

I.5 Related work and positioning of our contribution

Matrix factorization and its special cases as mentioned above are well-studied problems with extensive theoretical and algorithmic literature that we cannot fully cover here. We will hence only give examples of relevant works and will try to be exhaustive only concerning papers that are very closely related to our work (i.e. papers using message-passing algorithms or analyzing the same phase transitions in the same scaling limits).

The dictionary learning problem was identified in the work of Olshausen et al. (1996); Olshausen & Field (1997) in the context of image representation in the visual cortex, and the problem was studied extensively since Kreutz-Delgado et al. (2003). Learning of overcomplete dictionaries for sparse representations of data has many applications, see e.g. Lewicki & Sejnowski (2000). MOD Engan et al. (1999) and K-SVD Aharon et al. (2006) are two representative algorithms for the dictionary learning. Several authors studied the identifiability of the dictionary under various (in general weaker than ours) assumptions, e.g. Michal Aharon (2006); Vainsencher et al. (2011); Jenatton et al. (2012); Spielman et al. (2013); Arora et al. (2013); Agarwal et al. (2013); Gribonval et al. (2013). An interesting view on the place of sparse and redundant representations in todays signal processing is given in Elad (2012).

A nice overview of recent progress on low-rank matrix factorization from incomplete observations is given in Davenport & Romberg (2015). The two closely related problems of sparse principal component analysis or blind source separation is also explored in a number of works, see e.g. Lee et al. (1999); Zibulevsky & Pearlmutter (2001); Bofill & Zibulevsky (2001); Georgiev et al. (2005); d’Aspremont et al. (2007). A short survey on the topic with relevant references can be found in Gribonval et al. (2006). Matrix completion is another problem that belongs to the class treated in this paper. Again, many important works were devoted to this problem giving theoretical guarantees, algorithm and applications, see e.g. Candès & Recht (2009); Candès & Tao (2010); Candes & Plan (2010); Keshavan et al. (2010). Another related problem is the robust principal component analysis that was also studied by many authors; algorithms and theoretical limits were analyzed in Chandrasekaran et al. (2009); Yuan & Yang (2009); Wright et al. (2009); Candès et al. (2011).

Our work differs from the mainstream of existing literature in its general positioning. Let us mention here some of the main differences with most other works.

Most existing works concentrate on finding theoretical guarantees and algorithms that work in the worst possible case of matrices $F$ and $X$ . In our work we analyze the typical cases when elements of $F$ and $X$ are generated at random. Arguably a worst case analysis is useful for practical applications in that it provides some guarantee. On the other hand, in some large-size applications one can be confronted in practice with a situation which is closer to the typical case that we study here. Our typical-case analysis provides results that are much tighter than those usually obtained in literature in terms of both achievability and computational tractability. For instance our results imply much smaller sample complexity, or much smaller mean-squared error for a given signal-to-noise ratio, etc.

Our main contribution is the asymptotic phase diagram of Bayes-optimal inference in matrix factorization. Special cases of our result cover important problems in signal processing and machine learning. In the present work we do not concentrate on the validation of the associated approximate message-passing algorithm on practical examples, nor on its comparison with existing algorithms. A contribution in this direction can be found in Krzakala et al. (2013); Parker et al. (2014).

A large part of existing machine-learning and signal-processing literature provides theorems. Our work is based on statistical physics methods that are conjectured to give exact results. While many results obtained with these methods on a variety of problems have been indeed proven later on, a general proof that these methods are rigorous is not known yet.

Many existing algorithms are based on convex relaxations of the corresponding problems. In this paper we analyze the Bayes-optimal setting. Not surprisingly, this setting gives a much better performance than convex relaxation. The reason why it is less studied is that it is often considered as hopelessly complicated from the algorithmic perspective; however some recent results using message-passing algorithms which stand at the core of our analysis have shown very good performance in the Bayes-optimal problem. Besides, the Bayes-optimal offers an optimal benchmark for testing algorithmic methods.

When treating low-rank matrices we consider the rank to be a finite fraction of the total dimension, whereas most of existing literature considers the rank to be a finite constant. The AMP algorithm derived in this paper can of course be used as a heuristics also in the constant rank case, but our results about MMSE are exact only in the limit where the rank is a finite fraction of the total dimension. Also note that for the special case of rank one the AMP algorithm was derived and analysed rigorously Rangan & Fletcher (2012); Deshpande & Montanari (2014).

When treating sparsity we consider the number of non-zeros is a finite fraction of the total dimension, whereas existing literature often considers a constant number of non-zeros. Again the AMP algorithm derived in this paper can of course be used as a heuristics also in the constant number of non-zeros case, but our results about MMSE are exact only in the limit where the density is a finite fraction of the total dimension.

The paper is organized as follows: First, Sec. II, we explain the statistical physics background for the Bayes-optimal inference. In Sec. III we give a detailed derivation of the approximate message passing algorithm for the matrix factorization problem. This algorithm is equivalent to maximization of the Bethe free entropy, whose expression is discussed in Sec. IV. These first two sections thus state our algorithmic approach to matrix factorization. The asymptotic performance of the AMP algorithm and the Bayes-optimal MMSE are analyzed in Sec. V using two technics: i) the state evolution of the AMP algorithm and ii) the replica method. As usual, the two methods are found to agree and to give the same predictions. We then use these results in order to study some exemples of matrix factorization problems in Sec. VI. In particular we derive the phase diagram, the MMSE and the sample complexity of dictionary learning, blind matrix calibration, sparse PCA, blind source separation, low rank matrix completion, robust PCA and factor analysis. Our results are summarized and discussed in the conclusion in Sec. VII.

II Elements of the background from statistical physics

There are important identities that hold for the Bayes-optimal inference and that simplify many of the calculations that follow. In the physics of disordered systems these identities are known as the Nishimori identities Opper & Haussler (1991); Iba (1999); Nishimori (2001). Basically, they follow from the fact that the true signal $F^{0}$ , $X^{0}$ is an equilibrium configuration with respect to the Boltzmann measure $P(F,X|Y)$ (5). Hence many properties of the true signal $F^{0}$ , $X^{0}$ can be computed by using averages over the distribution $P(F,X|Y)$ even if one does not know $F^{0},X^{0}$ precisely (5).

In order to derive the Nishimori identities, we need to define three types of averages: the thermodynamic average, the double thermodynamic average, and the disorder average.

Consider a function $A(F,X)$ depending on a “trial” configuration $F$ and $X$ . We define the “thermodynamic average” of $A$ given $Y$ as:

where $P(F,X|Y)$ is given by Eq. (5). The thermodynamic average $\langle A(F,X)\rangle_{F,X|Y}$ is a function of $Y$ .

Similarly, for a function $A(F_{1},X_{1},F_{2},X_{2})$ that depends on two trial configurations $X_{1}$ , $F_{1}$ and $X_{2}$ , $F_{2}$ , we define the “double thermodynamic average” of $A$ given $Y$ as:

For a function $B$ that depends on the measurement $Y$ and on the true signal $F^{0},X^{0}$ , we define the “disorder average” as

Note that if the quantity $B$ depends only on $Y$ , then we have

This is simply because the partition function ${\cal Z}(Y)$ is ${\cal Z}(Y)=\int{\rm d}X^{0}\,{\rm d}F^{0}\,P_{X}(X^{0})\,P_{F}(F^{0})\,P_{\rm out}(Y|F^{0}X^{0})$ .

Let us now derive the Nishimori identities. We consider a function $A(F,X,F^{0},X^{0})$ that depends on the trial configuration $F,X$ and on the true signal $F^{0},X^{0}$ . Its thermodynamic average $\langle A(F,X,F^{0},X^{0})\rangle_{F,X|Y}$ is a function of the measurement $Y$ and the true signal $F^{0},X^{0}$ . The disorder average of this quantity can be written as

In this last expression, renaming $F,X$ to $F_{1},X_{1}$ and $F^{0},X^{0}$ to $F_{2},X_{2}$ , we see that the average over $F,X,F^{0},X^{0}$ is nothing but the double thermodynamic average:

where in the last step we have used the form (35) of the disorder average.

is the general form of the Nishimori identity. Written in this way, it holds for many inference problems where the model for signal generation is known.

Self-averaging is a property that is assumed to hold for many quantities in statistical physics. Self-averaging means that in the thermodynamic limit (as introduced in the first paragraph of section I.3) the distribution of $\langle A(F,X,F^{0},X^{0})\rangle_{F,X|Y}$ concentrates with respect to the realization of disorder $Y,F^{0},X^{0}$ . In other words for large system sizes the quantity $\langle A(F,X,F^{0},X^{0})\rangle_{F,X|Y}$ converges with probability one to its average over disorder, $[\langle A(F,X,F^{0},X^{0})\rangle_{F,X|Y}]_{F^{0},X^{0},Y}$ . Self-averaging implies the existence of the thermodynamic limit and is in general very challenging to prove rigorously. Self-averaging makes the Nishimori identity very useful in practice.

To give one particularly useful example, let us define $m_{X}=(1/(PN))\,\sum_{il}X^{0}_{il}X_{il}$ and $q_{X}=(1/(PN))\,\sum_{il}X^{1}_{il}X^{2}_{il}$ . The Nishimori identity states that $[\langle m_{X}\rangle_{F,X|Y}]_{F^{0},X^{0},Y}=[\langle\langle q_{X}\rangle\rangle]_{F_{1},X_{1},F_{2},X_{2}|Y}]_{Y}$ . Assuming that the quantities $m_{X}$ and $q_{X}$ are self-averaging, we obtain in the thermodynamic limit, for almost all $Y$ : $\langle m_{X}\rangle_{F,X|Y}=\langle\langle q_{X}\rangle\rangle_{F_{1},X_{1},F_{2},X_{2}|Y}$ . Explicitly, this gives:

where $o_{N}(1)$ is going to zero as $N\to\infty$ . This is a remarkable identity concerning the mean of $X_{il}$ with the posterior distribution $\nu_{il}(X_{il})$ . The left-hand side measures the overlap between this mean and the sought true value $X^{0}_{il}$ . The right-hand side measures the self overlap of the mean, which can be estimated without any knowledge of the true value $X^{0}_{il}$ , by generating two independent samples from $P(F,X|Y)$ .

By symmetry, all these examples apply also to averages and functions of the matrix $F$ :

Another example of a Nishimori identity used in our derivation is eq. (108). Note that this one is crucial to maintain certain variance-related variables strictly positive as explained in section III.3. Without eq. (108) the AMP algorithm may have pathological numerical behavior causing that some by definition strictly positive quantities become negative. Deeper understanding of this problem in cases where Nishimori identities cannot be imposed is under investigation.

Nishimori identities also greatly simplify the state evolution equations of section V.1.1. In general state evolution involves nine coupled equations. However, after imposing Nishimori identities, eqs. (187-189), we are left with only three coupled equations.

II.2 Easy/hard and possible/impossible phase transitions

Fixed points of equations (12-14) allow us to evaluate the MMSE of the matrix $F$ and of the matrix $X$ . We investigate two fixed points: The one that is reached from the “uninformative” initialization (18), and the fixed point that is reached with the informative (planted) initialization (19). If these two initializations lead to a different fixed point it is the one with the largest value of the Bethe free entropy (I.3) that corresponds to the Bayes-optimal MMSE. The reason for this is that the Bethe free entropy is the exponent of the normalization constant in the posterior probability distribution: a larger Bethe free entropy is hence related to a fixed point that corresponds to an exponentially larger probability. Furthermore, in cases where the uninformative initialization does not lead to this Bayes-optimal MMSE we anticipate that the optimal solution is hard to reach for a large class of algorithms.

Depending on the application in question and the value of parameters ( $\alpha,\pi,\rho,\epsilon,\dots$ ) we can sometimes identify a phase transition, i.e. a sharp change in behavior of the MMSE. As in statistical physics it is instrumental to distinguish two kinds of phase transitions

Second order phase transition: In this case there are two regions of parameters. In one, the recovery performance is poor (positive MMSE); in the other one the recovery is perfect (zero MMSE). This situation can arrive only in zero noise, with positive noise there is a smooth “crossover” and the transition between the phase of good and poor recovery is not sharply defined. Interestingly, as we will see, in all examples where we observed the second order phase transition, its location coincides with the simple counting bounds discussed in Section I.4. In the phase with poor MMSE, there is simply not enough information in the data to recover the original signal (independently of the recovery algorithm or its computational complexity). Experience from statistical physics tells us that in problems where such a second order phase transition happens, and more generally in cases where the state evolution (12-14) has a unique fixed point, there is no fundamental barrier that would prevent good algorithms to attain the MMSE. And in particular our analysis suggest that in the limit of large system sizes the AMP algorithm derived in this paper should be able to achieve this Bayes-optimal MMSE.

First order phase transition: A more subtle phenomenology can be observed in the low-noise regime of dictionary learning, sparse PCA and blind matrix calibration. In some region of parameters, that we call the “spinodal” region, the informative (planted) and uninformative initializations do not lead to the same fixed point of the state evolution equations. The spinodal regime itself may divide into two parts - the one where the uninformative fixed point has a larger free entropy, and the “solvable-but-hard” phase where the informative fixed point has a larger free entropy. The boundary between these two parts is the first order phase transition point. We anticipate that in the solvable-but-hard phase the Bayes-optimal MMSE is not achievable by the AMP algorithm nor by a large class of other known algorithms. The first order phase transition is associated with a discontinuity in the MMSE. The MSE reached from the uninformative initializations will be denoted AMP-MSE and is also discontinuous.

In the case of the first order phase transition it will hence be useful to distinguish in our notations between the minimal achievable MSE that we denote MMSE, and the MSE achievable by the AMP-like algorithms that we denote AMP-MSE. When MMSE $=$ AMP-MSE in the large $N$ limit we say that AMP is asymptotically optimal. The region where in the large size limit MMSE $<$ AMP-MSE is called the spinodal regime/region.

II.3 Why our results are not rigorous? Why do we conjecture that they are exact?

In this section we aim at summarizing the assumptions that we make but do not prove along the derivation of our results presented in sections III, IV, V. In section VI we then give the MMSE and AMP-MSE as obtained from numerical evaluation of eqs. (12-14).

Let us first state the general strategy from a physics point of view. We use two complementary approaches, the cavity method and the replica method.

Our main goal is to derive the MMSE for the matrix factorization problem in the Bayes-optimal setting. Our main tool is the cavity method, and it presents two advantages with respect to the replica method: 1) during the derivation of the MMSE we obtain the AMP algorithm as a side-product and 2) based on experience of the last three decades, it is likely that a rigorous proof of our conjectures will follow very closely this path. In the cavity method we first assume that there is a fixed point of the belief propagation equations that correctly describes the posterior probability distribution, and that the BP equations (initialized randomly, or in the case where a first order phase transition is present, initialized close to the planted configuration) converge to it. Then we are interested in the average evolution of the BP iterations. This can be understood through a statistical analysis of the BP equations in which we keep only the leading terms in the large $N$ limit, an approach called the cavity method Mézard et al. (1987) in the physics literature. The result of this analysis are the state evolution equations that should be asymptotically exact, they do not depend on the size of the system $N$ anymore. There is one known way by which the assumptions made in this derivation can fail, this way is called replica-symmetry-breaking (RSB). Fortunately, in the Bayes optimal inference, RSB is excluded, as we explain below.

Our second analysis uses the replica method. In the replica approach one computes the average of the $n$ -th power of the normalization of the posterior probability distribution. Then one uses the limit $n\to 0$ to obtain the average of its logarithm. Doing this, one needs to evaluate a saddle point of a certain potential and one assumes that this saddle point is restricted to a certain class that is called “replica symmetric”, eq. (227). Again this assumption is justified in the Bayes optimal inference, because we know that there is no RSB in this case.

Our two analyses, using the cavity method and the replica method, lead to the same set of closed equations for the MMSE of the problem. This important crosscheck, and the very reasonable nature of the assumptions described below (based on our experience in using this kind of approach in various settings) lead us to conjecture that the results described in this paper are exact.

II.3.2 What are the assumptions?

Let us start with the cavity method, which is in general much more amenable to a rigorous analysis. Let us hence describe the assumptions made in section III.

Our main assumption is that in the leading order in $N$ the true marginal probabilities can be obtained from a fixed point of the belief propagation equations (41)-(44) as written for the factor graph in Fig. 1. For this to be possible the incoming messages on the right hand side of eqs. (41)-(44) must be conditionally independent (in the thermodynamic limit, as defined is section I.3 ), in which case we can write the joint probabilities as products over the messages. If the factor graph was a tree this assumption would be obviously correct. On factor graphs that are locally tree-like these assumptions can be justified (often also rigorously) by showing that the correlations decay fast enough (see for instance Mézard & Montanari (2009)). The factor graph in Fig. 1 is far from a tree. However, from the point of view of conditional independence between messages, this factor graph resembles the one of compressed sensing for which the corresponding results were proven rigorously Bayati & Montanari (2011); Donoho et al. (2012). It is hence reasonable to expect that matrix factorization belongs to the same class of problems where BP is asymptotically exact in the above sense.

We assume that the iterations of equations (41)-(44) converge for very large systems ( $N\to\infty$ ) to this fixed point that describes the true marginal probabilities, provided that the iteration is started from the right initial condition (in the presence of a first order phase transition we need to consider initialization of the iteration in the planted configuration in order to reach the right fixed point).

Given these assumptions the rest of section III is justified by the fact that in the derivation we only neglect terms that do not contribute to the final MSE in the thermodynamic limit.

The main assumptions made in the replica calculation is that “self-averaging” applies (this is a statement on the concentration of the measure, which basically assumes that the averaged properties are the same as the properties of a generic single instance of the problem) and that we can exchange the limits $\lim_{n\to 0}$ and $\lim_{N\to\infty}$ . On top of these, it relies on the replica symmetric calculation, which is justified here because we are interested in evaluating Bayes optimal MMSE. Unfortunately in the mathematical literature there is so far very little progress in proving these basic assumptions of the replica method, even in problems that are much easier than matrix factorization. It can be remembered that the original motivation for introducing the cavity method Mézard et al. (1986) was precisely to provide an alternative way to the replica computations, which could be stated in clear mathematical terms.

In this paper we follow the physics strategy and we hence provide a conjecture for evaluating the MMSE in matrix factorization. We do anticipate that further work using the path of the cavity approach will eventually provide a proof of our conjecture. Let us mention that rigorous results exist for closely related problems, namely the problem of compressed sensing Bayati & Montanari (2011); Donoho et al. (2012), and also rank-one matrix factorization Rangan & Fletcher (2012); Deshpande & Montanari (2014). None of these apply directly to our case where the rank is extensive and where the matrix $F$ is not known. A non-trivial generalization of these works will be needed in order to prove our results.

II.3.3 Cavity and replica method for other problems

The cavity and replica methods, as we use them use in the present paper, rely on the set of assumptions listed above. It is worth to note that over the years they have been applied successfully to a wide range of problems. There is nowadays a range of results that can be “derived” (and in many of those cases that was the original derivation) using the cavity or/and the replica methods and that have been proven fully rigorously. The list is long, but to cite a couple of interesting examples we can mention the proof of the satisfiability threshold conjecture Ding et al. (2014), performance of low-density parity check codes Richardson & Urbanke (2008), the optimal solution of the linear assignment problem Aldous (2001), the analysis of the compressed sensing problem Donoho et al. (2009); Bayati & Montanari (2011), and many more.

II.4 Absence of replica symmetry breaking in Bayes-optimal inference

The study of the Bayes-optimal inference, i.e. the case when the prior probability distribution matches the true distribution with which the data have been generated, is fundamentally simpler than the general case. The fundamental reason for this simplicity is that, in the Bayes-optimal case, a major complication referred to as static replica symmetry breaking (RSB) in statistical physics literature does not appear Nishimori (2001). RSB is a source of computational complications in many optimization and constraint satisfaction problems such as the random K-satisfiability Mézard et al. (2002) or models of spin glasses Mézard et al. (1987).

One common way to define RSB is via the overlap distribution $P(q)$ Nishimori (2001); Mézard et al. (1987); Mézard & Montanari (2009). We define an overlap between two configurations $X^{1}$ and $X^{2}$ as $q_{X}(X^{1},X^{2})=(1/(PN))\sum_{il}X^{1}_{il}X^{2}_{il}$ . The overlap distribution function is then defined (for a given realization of the problem $X^{0},F^{0},Y$ ) as the probability distribution of $q_{X}(X^{1},X^{2})$ over the posterior measure $P(X,F|Y)$ . For problems that do not have any permutational symmetry (such as the permutation of the $N$ rows of $X$ and columns of $F$ ), we say the system is replica symmetric when $\lim_{N\to\infty}P(q)=\delta(q-q^{0})$ for almost all choices of $X^{0},F^{0},Y$ , and that there is replica symmetry breaking otherwise. The number of steps of RSB then correspond to the number of additional delta peaks in the limiting distribution $P(q)$ . When the limiting distribution $P(q)$ has a continuous support then we talk about full-RSB. For systems with permutational symmetry the corresponding number of peaks gets simply multiplied by the size of the symmetry group.

Let us further define a so-called magnetization of configuration $X^{1}$ with respect to some reference configuration $X^{0}$ as $m_{X}(X^{0},X^{1})=(1/(PN))\sum_{il}X^{1}_{il}X^{2}_{il}$ . Such a magnetization and its variance can be computed from first and second derivative of the free entropy density. The derivative is taken with respect to a so-called magnetic field which is an auxiliary parameter introduced in the posterior probability for this purpose. In statistical physics, we expect (and in some case, actually prove) some standard properties of the free entropy such as its self-averaging and its analyticity (except at phase transition points) with respect to such parameters. From this reasoning it follows that whereas the mean of the magnetization is of order $O(1)$ , its variance is of order $O(1/N)$ . Hence in the limit $N\to\infty$ the distribution (over $P(X,F|Y)$ ) of magnetization is a delta peak. If the reference configuration $X^{0}$ is the planted configuration, then self-averaging implies that the distribution of $m_{X}(X^{0},X^{1})$ with respect to both its arguments converges to a delta function. From the Nishimori identities it follows that the same has to be true for the distribution of the overlaps $P(q)$ . Therefore only a $P(q)$ converging to a delta function in the thermodynamic limit is allowed in the Bayes optimal inference setting. In a more technical side, it also follows from the Nishimori identities that the two-point correlations between variables decay sufficiently fast as proven in Montanari (2008), this excludes the existence of the static RSB for the posterior measure $P(F,X|Y)$ .

II.5 Learning of hyper-parameters

To obtain the main results of this paper we assume that the priors $P_{X}$ and $P_{F}$ are known, including their related parameters. In practice, even if one knows a parametric family of probability distributions that approximates well the matrix elements of $F$ and $X$ , one often does not have the full information about the parameters of this distribution (typically its mean and variance). In the same manner, one might know the nature of the measurement noise, but without a precise knowledge of its amplitude. In this subsection we remark that learning of the parameters can be rather straightforwardly included in the present algorithmic approach. However, its detail implementation and analysis, as well as the analysis of the deterioration of performance when the prior distributions are not known, is left for future work.

In order to learn the hyper-parameters, a common technique in statistical inference is the expectation maximization Dempster et al. (1977), where one updates iteratively the hyper-parameters in order to maximize the posterior likelihood, i.e. the normalization ${\cal Z}(Y)$ in the posterior probability measure. In the context of approximate message passing the expectation maximization has been derived and implemented e.g. in Krzakala et al. (2012); Vila & Schniter (2011). It turns out that the expectation maximization update of hyper-parameters is analogous to the update where one imposes the Nishimori identities as was done for compressed sensing in Krzakala et al. (2012). In other words one updates the hyper-parameters in such a way that the Nishimori identities are satisfied after every iteration. For instance, the mean and variance of the distribution $P_{X}$ is updated to correspond to the empirical mean and variance of the estimators of elements $X_{il}$ . The variance of the noise in a Gaussian output channel is updated to be the same as the squared difference between the observed matrix $Y$ and the estimator of the product $FX$ . Generically, for mismatched priors the Nishimori identities are not satisfied. At the same time the various hyper-parameters appear explicitly in these identities. Therefore, one way of performing learning of parameters is iteratively setting new values of the hyper-parameters in order to satisfy the Nishimori identities. This can be straightforwardly implemented within the GAMP code. In compressed sensing following this strategy is equivalent to an expectation maximization learning algorithm Krzakala et al. (2012).

It is interesting to note that, as the Bayes-optimal setting has several nice properties in terms of simplicity of the analytical approach and in terms of convergence of the message-passing algorithms, bringing the iterations back on the Nishimori line by doing expectation maximization improves quite generically the convergence of the algorithm. In our opinion this is one of the properties observed in Parker et al. (2013, 2014).

III Approximate message passing for matrix factorization

Of course, the factor graph of matrix factorization, shown in Fig. 1, is extremely far from a tree. The above BP equations can, however, still be asymptotically exact if the dependence between the incoming messages is negligible in the leading order in $N$ . This indeed happens in compressed sensing (where the matrix $F$ is a known matrix, generated randomly with zero mean), as follows from the rigorously proven success of approximate message passing Donoho et al. (2009); Bayati & Montanari (2011). In the present case of matrix factorization, we do not have any rigorous proof yet, but based on our experience from studies of mean field spin glass systems Mézard et al. (1987); Mézard & Montanari (2009), we conjecture that the fixed points of the above belief propagation equations describe asymptotically exactly (in the same sense as for compressed sensing) the performance of Bayes-optimal inference. Hence the analysis of the fixed points of the above equations leads to the understanding of information-theoretic limitations for matrix factorization. The associated phase transitions describe possible algorithmic barriers. This analysis is the main goal and result of the present paper.

The above BP iterative equations are written for probability distributions over real values variables and the $2N-1$ -uple integrals from the r.h.s. are mathematically intractable in this form. We now define means and variances of the variable-to-factor messages as

Notice that the factors $\sqrt{N}$ in the definition of $r$ , and $N$ in the definition of $s$ , have been introduced in order to ensure that all the messages $a,c,r,s$ are of order $O(1)$ in the thermodynamic limit.

Using this scaling, we shall now show that the BP equations can be simplified in the thermodynamic limit, and that they can actually be written as a closed set of equations involving only messages $a,c,r,s$ . Our general aim is to design an algorithm which in some region of parameters will asymptotically match the performance of the exact (computationally intractable) Bayes-optimal inference. Belief propagation provides such an algorithm, but in order to make it computationally efficient, writing it in terms of the messages $a,c,r,s$ is crucial, provided one is careful not to loose any terms in the asymptotic analysis of the thermodynamic limit to leading order.

Let us define the Fourrier transform of the output function

Next we perform the integral over variable $\xi$ which is simply a Gaussian integral. This gives for the message

where we introduced auxiliary variables that are both of order $O(1)$

The last integral to be performed in (52) is the one over the matrix element $F_{\mu i}$ . Using again the fact that $F_{\mu i}=O(1/\sqrt{N})$ , we expand the exponential in which $F_{\mu i}$ appears to second order and perform the integration to obtain

Following the notation of Rangan (2011) we now define the output-function as

The following useful identity holds for the average of $(z-\omega)^{2}/V^{2}$ in the above measure

At this point we follow closely the derivation of AMP from Krzakala et al. (2012) and define the probability distributions

where $\hat{Z}_{X}(\Sigma,T)$ , and $\hat{Z}_{F}(Z,W)$ are normalizations. We define the average and variance of ${\cal M}_{X}$ and ${\cal M}_{F}$ as

These are the input auxiliary function of Rangan (2011). It is instrumental to notice that

With these definition we obtain from (41,42) using (58,61) that

It is clear from the above expressions that all the messages $a,c,r,s$ and $A,C,R,S$ scale as $O(1)$ in the thermodynamic limit. For instance, as $A$ are positive, the quantity $\sum_{\nu(\neq\mu)}A^{t}_{\nu l\to il}$ is $O(1)$ . On the other hand, the message ${\hat{f}}_{\mu i\to\mu l}(t)$ is an estimate of $\sqrt{N}F_{\mu i}$ ; this estimate is $O(1)$ , but a sum like $(1/\sqrt{N})\sum_{\nu(\neq\mu)}{\hat{f}}_{\nu i\to\nu l}(t)$ is an estimate of $\sum_{\nu}F_{\nu i}$ . As $P_{F}$ has mean variance of order $O(1/N)$ , this sum is actually of $O(1)$ . The same argument suggests that $(1/\sqrt{N})\sum_{\nu(\neq\mu)}B^{t}_{\nu l\to il}$ is $O(1)$ . Recalling eqs. (59,60) and (62,63), we have derived that in the thermodynamic limit the general belief propagation equations simplify into a closed set of equations in the messages which are the means and variances $a,r,c,s$ defined in (45-48). To iterate this message passing algorithm we initialize as

then we compute $V^{t}_{\mu il}$ and $\omega^{t}_{\mu il}$ from (53-54), then we compute $B^{t}$ , $A^{t}$ , $R^{t}$ and $S^{t}$ according to (59-60) and (62-63) using definition of $g_{\rm out}$ (56). Finally we update the messages according to (70-73) and iterate. Notice, however, that we work with $O(N^{3})$ messages, each of them takes $N$ steps to update, and hence the computational complexity of this algorithm is relatively high. In the next section we will write a simplification that reduces this complexity.

From the fixed point of the belief propagation equations one can also compute the approximated marginal probabilities of the posterior, defined as

One again defines the mean and variance of these two messages, ${\hat{x}}_{il}(t+1)$ , $c_{il}(t+1)$ , ${\hat{f}}_{\mu i}(t+1)$ and $s_{\mu i}(t+1)$ analogously to (80-83):

III.2 GAMP for matrix factorization

The message-passing form the AMP algorithm for matrix factorization derived in the previous section uses $2NMP$ messages, one between each variable component $(il)$ and $(\mu i)$ and each measurement $(\mu l)$ , in each iteration. In fact, exploiting again the simplifications which take place in the thermodynamic limit, always within the assumption that the elements of the matrix $F$ scale as $O(1/\sqrt{N})$ , it is possible to rewrite and close the BP equations in terms of only $2N(P+M)$ messages. In statistical physics terms, the resulting equations correspond to the Thouless-Anderson-Palmer equations (TAP) Thouless et al. (1977) used in the study of spin glasses. In the thermodynamic limit, these are asymptotically equivalent to the BP equations. Going from BP to TAP is, in the compressed sensing literature, the step to go from the rBP Rangan (2010) to the AMP Donoho et al. (2009) algorithm. Let us now show how to take this step for the present problem of matrix factorization.

In the thermodynamic limit, it is clear from (70-73) that the messages ${\hat{x}}_{il\to\mu l}$ , $v_{il\to\mu l}$ and ${\hat{f}}_{\mu i\to\mu l}$ , $s_{\mu i\to\mu l}$ are nearly independent of $\mu l$ . For instance in the equation giving ${\hat{x}}_{il\to\mu l}$ , the only dependence on $\mu$ is through the fact that the sum over $\nu$ avoids the value $\nu=\mu$ . But this is one term in $M$ , and therefore one might expect that this term is negligible. However, one must be careful when these small terms are summed over and their sum might be of the leading order in $N$ . Such terms are called in spin-glass theory the “Onsager reaction terms”. In the following we derive these Onsager terms.

Let us define the following variables all of order $O(1)$ on which we will close the equations

To keep track of all the Onsager terms that will influence the leading order of the final equations we notice that

From these expansions we obtain the GAMP algorithm for matrix factorization

In order to compute $\omega^{t=0}$ , $T^{t=0}$ and $W^{t=0}$ use the above equations as if ${\hat{f}}_{\mu i}(-1)=0$ and ${\hat{x}}_{il}(-1)=0$ .

The interpretation of the terms in the GAMP for matrix factorization is the following: $\omega_{\mu l}^{t}$ is the mean of the current estimate of $z_{\mu l}=\sum_{i}F_{\mu i}X_{il}$ and $V^{t}_{\mu l}$ is the variance of that estimate; $T_{il}$ and $\Sigma_{il}$ is the mean and variance of the current estimate of $X_{il}$ without taking into account the prior information of $X_{il}$ ; the parameters ${\hat{x}}_{il}$ and $c_{il}$ are then the mean and variance of the current estimate of $X_{il}$ with the prior information taken into account. Analogously for $W_{\mu i}$ and $Z_{\mu i}$ being the mean and variance of the estimate for $F_{\mu i}$ before the prior is taken into account, and ${\hat{f}}_{\mu i}$ with $s_{\mu i}$ are the mean and variance once the prior information was accounted for.

A reader familiar with the AMP and GAMP algorithm for compressed sensing Donoho et al. (2009); Rangan (2011); Krzakala et al. (2012) will recognize that the above equations indeed reduce to the compressed sensing GAMP of Rangan (2011) when one sets ${\hat{f}}_{\mu i}(t)=\sqrt{N}F_{\mu i}$ and $s_{\mu i}=0$ .

The above algorithm is closely related to the BiG-AMP of Parker et al. (2013). There are however three differences between our algorithm and BiG-AMP:

We find a $s_{\mu i}$ -dependent term in the expression (98) for $\Sigma_{il}$ which is not present in BiG-AMP.

Similarly, we find a $c_{il}$ -dependent term in the expression (100) for $Z_{\mu i}$ which is not present in BiG-AMP.

Considering the last point, the fact of having different time indices during the iterations does not influence the fixed points, in which we are mainly interested. However, the use of correct time indices is crucial for the assumptions leading to the density evolution of this algorithm (that we derive in section V.1) to hold.

As for the missing terms in the BiG-AMP expressions of Parker et al. (2013) for $\Sigma_{il}$ and $Z_{\mu i}$ , they have a more serious effect as they can change the fixed point. To the best of our understanding, these terms have been neglected in Parker et al. (2013), while they should be kept. It seems to us that some of the leading order terms are missing in eqs. (15-16) from Parker et al. (2013).

III.3 Simplifications due to the Nishimori identities

In the previous section we derived the GAMP algorithm for matrix factorization, eqs. (96-103). This algorithm can in principle be used for any set of matrices $F$ and $X$ . If iterated in the form derived in Section III.2 it often shows problems of convergence. There are ways to slightly improve the convergence of the above algorithm in a wide range of applications by a number of empirical methods suggested in Parker et al. (2014).

We will focus on the particular case when matrices $X$ and $F$ were indeed generated from the separable probability distributions $P_{F}(F)$ and $P_{X}(X)$ described in eqs. (3-4), and the output $y$ was generated by the assumed model.

In this case the belief propagation is a proxy for the optimal Bayes inference algorithms and a number of properties described in section I.2 hold. In analogy with fundamental works on spin glasses Nishimori (2001); Iba (1999) we called these properties the Nishimori identities. The setting where conditions (106) hold will be called the Bayes-optimal setting.

The Nishimori identities hold and the system is on the Nishimori line when one is using the correct priors on $F$ and $X$ and the right output channel in the reconstruction process, i.e. when conditions (106) hold. In the limit $N\to\infty$ and thanks to self-averaging we then have on the Nishimori line at every iteration step $t$

The meaning of this identity is that the mean squared error of the current estimate of $Z=FX$ computed from the current estimates of variances $V_{\mu l}^{t}$ is equal to the mean squared difference between the true $Z$ and its current estimate $\omega_{\mu l}^{t}$ . Using the above expression and eq. (57) we obtain an identity

The above identity holds also if the sum is only over $\mu$ or only over $l$ . Finally using the conditional independence assumed in BP between the incoming messages we get also

Under this condition we can simplify considerably the expressions for $\Sigma_{il}^{t}$ and $Z_{\mu i}^{t}$ and get

Note that the r.h.s. of the two above equations is always strictly positive, which is reassuring given these expressions play the role of a variance of a probability distribution. Note also that the BiG-AMP algorithm of Parker et al. (2013) uses expressions (110,111) instead of (98,100), however, without mentioning the reason.

III.4 Simplification for matrices with random entries

Relying on the definitions of order parameters (161-162) and using part of the results on section V.1 we can write a version of the GAMP for matrix factorization that is in the leading order in $N$ equivalent to (96-103) for matrices $X$ and $F$ having iid elements.

Let us define the analog of $\hat{\chi}^{t}$ and $\hat{q}^{t}$ (168-169) as empirical means of the corresponding functions

Anticipating the reasoning that we shall use later in section V.1, we realize that in the leading order quantities $V^{t}_{il}$ , $\Sigma^{t}_{il}$ and $Z^{t}_{\mu i}$ do not depend on their indices $i,l,\mu$ . We have

These three equations can hence replace (96), (98) and (100) in GAMP. Furthermore, if we focus on the fixed point and hence disregard some of the time indices eqs. (97), (99) and (101) can be simplified as

The set of eqs. (102-103), (114), (119-122) was presented for the simple output channel with white noise in Krzakala et al. (2013). We detail this procedure for the generic output in Alg. 1.

We want to stress here that all these simplifications take place for any output channel $P_{\rm out}(y|z)$ . In contrast with the “uniform variance” approximation of Parker et al. (2013) the above result does not mean that the variances $c_{il}$ and $s_{\mu i}$ are independent in the leading order on their indices. On the contrary, these variances depend on their indices even in the simplest case of GAMP when the matrix $F_{\mu i}$ is known, i.e. for the compressed sensing problem.

IV The Bethe free entropy

The fixed point of the belief propagation equations or its AMP version can be used to estimate the posterior likelihood, i.e. the normalization ${\cal Z}(Y)$ of the posterior probability (5). The logarithm of this normalization is called the Bethe free entropy in statistical physics Yedidia et al. (2003). Negative logarithm of the normalization is called the free energy, in physics there is usually a temperature associated to the free energy. Bethe free entropy is computed from the fixed point of the BP equations (41-44) as Yedidia et al. (2003); Mézard & Montanari (2009):

The derivatives of this expression for $\Phi^{\rm Bethe}$ with respect to the messages give back the full BP equations of (41-44). In this general form, the computation of $\Phi^{\rm Bethe}$ for the present problem is not of practical interest, and it is thus very useful to carry out the same steps that we did in Section III.1 in order to obtain a more mathematically tractable form of $\Phi^{\rm Bethe}$ that is asymptotically equivalent to (123) in the thermodynamic limit, using the set of AMP message passing equations (53-54), (59-60), (62-63), and (70-73). The result is:

Finally we might want to express the free entropy using the fixed point of the GAMP eqs. (96-103). In order to do this we need to rewrite the last two terms in (129). Using an expansion in $1/N$ and keeping the leading order terms we get

We remind that the above expressions give the posterior likelihood given a fixed point on the GAMP equations.

To write the final formula in a more easily interpretable form we use the probability distributions ${\cal M}_{F}$ and ${\cal M}_{X}$ defined in (64-65) with normalizations

The above expression evaluated at the fixed point of the AMP algorithm hence gives the Bethe approximation to the log-likelihood. It is mainly use to decide which fixed point of AMP is better. Indeed, there are cases where there exist more than one AMP fixed point and it is the one with the largest Bethe entropy that corresponds asymptotically to the optimal Bayesian inference.

Since the free entropy has a meaning only at the fixed point, we can transform it by using any of the fixed point identities verified by the BP messages. A well known property of Bethe free entropy and belief propagation Yedidia et al. (2003) is that the BP fixed points are stationary points of the Bethe free entropy. In this section we show that also for the AMP for matrix factorization the Bethe free entropy can be written in a form that allows to generate the fixed-point BP equations as a stationary point. It can be achieved by writing the Bethe free entropy eq. (139) as

In order to derive (140) from (139), we have substituted $g^{2}_{\rm out}$ by its fixed point expression, and imposed the values of the variance $V$ . Under the present form, the Bethe free entropy satisfies the following theorem:

(Bethe/AMP correspondance) The fixed point of the AMP equations eqs. (96-103) are the stationary points of the cost function $\Phi^{\rm Bethe}_{\rm AMP}$ eq. (140).

This can be checked explicitly by setting to zero the derivatives of $\Phi^{\rm Bethe}_{\rm AMP}$ . Indeed, the derivatives with respect to $T,\Sigma,W$ and $Z$ yield

Then, the stationarity with respect to $\omega$ can be expressed easily by noting that $\frac{\partial}{\partial\omega_{\mu z}}\log Z^{\mu l}=g_{\rm out}$ (a consequence of Eq. (130):

which is nothing but the fixed point equation for $\omega$ .

It is convenient to compute the derivative with respect to $V$ (even though this quantity is eventually a function of $r$ , $s$ , $a$ and $c$ ) using $\frac{\partial}{\partial V_{\mu z}}\log Z^{\mu l}=\frac{1}{2}\left(g_{\rm out}^{2}+\partial_{\omega}g_{\rm out}\right)$ so that at the fixed point, when eq. (145) is satisfied, we have

Using this equation, one can finally check explicitly that deriving with respect to $a,c,r$ and $s$ yields the remaning AMP equations for $T,\Sigma,W$ and $Z$ . This concludes the proof. ∎

IV.2 The variational Bethe free entropy

We have shown that the fixed points of the approximate message passing equations are extrema of $\Phi^{\rm Bethe}_{\rm AMP}$ . However they are in general saddle points of this function, and it is very useful to derive an alternative “variational” free entropy, the maxima of which are the fixed points. In particular, this will allow us to find these fixed points by alternative methods which do not rely on iterating the equations as was done for compressed sensing in Krzakala et al. (2014). This variational free entropy can also be used not only at the maximum, but for each possible values of the parameters, as the current estimate of the quality of reconstruction. Such a property has been used to implement a so-called adaptive damping in compressed sensing Vila et al. (2014) and it can hence be anticipated that similar implementation trick will be useful for matrix factorization as well.

In order to derive the variational Bethe free entropy, we impose the fixed point conditions, and express the free entropy only as a function of the parameters of our trial distributions for the two matrices. Then, we simply have

where $a^{*},c^{*},r^{*},s^{*}$ are given in terms of the Eqs. (141-144) by: $a^{*}=f_{X}(\Sigma_{il},T_{il})$ , $c^{*}=f_{c}(\Sigma_{il},T_{il})$ , $r^{*}=f_{F}(Z_{\mu i},W_{\mu i})$ , $s^{*}=f_{s}(Z_{\mu i},W_{\mu i})$ , and $\omega^{*}$ is the solution of (97).

In order to write this variational expression in a nicer form, let us notice that the Kullback-Leibler divergences between ${\cal M}_{X}$ , ${\cal M}_{F}$ (64-65) and the prior distribution are

where ${\cal Z}^{\mu l}$ is given by (130). Then one has

with $V^{*}$ and $\omega^{*}$ satisfying eqs. (114) and (119). Note that this expression has the same form as the one used in Rangan et al. (2013) for the simpler case of GAMP for compressed sensing and for the generalized linear problem. Our expression thus generalizes the formula of Rangan et al. (2013) to the bi-linear case.

IV.2.2 The AWGN output channel

In the case of the additive white Gaussian noise output channel (23) the function $g_{\rm out}$ takes the simple form:

hence $\partial_{\omega}g_{\rm out}$ does not depend on the variable $\omega_{\mu l}$ . The only explicit dependence on $\omega_{\mu l}$ in the free entropy is through eq. (130) which becomes for the AWGN output channel

The free entropy is defined only at the fixed point of the GAMP equations. Given a fixed point we can express from (97) for the AWGN channel

We plug this last expression into (155) to obtain

Simplifying the last two terms of eq. (153) we obtain for the AWGN channel:

The first three terms of this free entropy are clearly negative and the last term cannot be larger than $-MP\log{(2\pi\Delta)}/2$ . Hence the free entropy (158) is bounded from above. This is consistent with its interpretation as a variational expression. The stationary points of (158) are the fixed points of the GAMP algorithm and hence the fixed points corresponding to the maximum likelihood could also be found by direct maximization of the expression (158). This offers an interesting algorithmic alternative to the iterative AMP algorithm that was explored for the compressed sensing problem in Krzakala et al. (2014).

Another use of the expression (158) is that during the iteration of the GAMP algorithm its value should be increasing, hence we can adaptively choose the step-size of the iterations to ensure this increase. Such an adaptive dumping was implemented in Parker et al. (2014) using a different form of the free entropy that does not correspond to the Bethe free entropy but to the variational mean field (VMF) free entropy which reads

It is easy to check that $\Phi^{\rm VMF}_{\rm AWGN}(T_{il},\Sigma_{il},W_{\mu i},Z_{\mu i})<\Phi^{\rm Bethe}_{\rm AWGN}(T_{il},\Sigma_{il},W_{\mu i},Z_{\mu i})$ , which could be expected since the Bethe expression, which is asymptotically exact, should be a better approximation than the mean field approximation.

V Asymptotic analysis

In this section we derive the asymptotic ( $N\to\infty$ ) evolution of the GAMP iterations for matrix factorization. This asymptotic analysis holds as long as all the elements of the true matrix $F$ are iid random variables generated from a distribution $P_{F^{0}}$ , and all elements of the true matrix $X$ are iid random variables generated from a distribution $P_{X^{0}}$ . In general we will not assume $P_{F^{0}}=P_{F}$ and $P_{X^{0}}=P_{X}$ : this special case of Bayes-optimal analysis will be treated in the next section.

In the present section we will also distinguish between the true output channel characterized by the conditional probability distribution $P^{0}_{\rm out}(y_{\mu l}|z^{0}_{\mu l})$ and the output channel that is being used in the GAMP algorithm $P_{\rm out}(y_{\mu l}|z_{\mu l})$ . We remind that $z^{0}_{\mu l}=\sum_{i=1}^{N}F^{0}_{\mu i}X^{0}_{il}$ , where $F^{0}_{\mu i}$ and $X^{0}_{il}$ are the elements of the actual matrices that we do not know and aim to recover, and $z_{\mu l}=\sum_{i=1}^{N}F_{\mu i}X_{il}$ . Again the special case of $P^{0}_{\rm out}=P_{\rm out}$ will be treated the next section.

We will assume that at least one of the probability distributions $P_{F^{0}}$ and $P_{X^{0}}$ (and also at least one of $P_{F}$ and $P_{X}$ ) has zero mean, otherwise there would be additional terms in this asymptotic analysis, as e.g. in Caltagirone et al. (2014).

Let us first recall the definition of the order parameters; all of them are finite, of order $O(1)$ , in the thermodynamic limit:

Note also that the above sums over a pair of indices could also be sums over only one index (and adjusted normalization) and the order parameters would not change in the leading order: for instance, we expect that in the thermodynamic limit, $\frac{1}{N}\sum_{j}{\hat{x}}_{jl}(t)X^{0}_{jl}$ will go to the same limit as $m^{t}_{X}$ defined in (160).

First let us compute the average over realizations of $X^{0}$ , $F^{0}$ and $w^{0}$ of the quantity $V_{\mu l}^{t}$ defined by eq. (90). By the assumptions of the belief propagation equations (43-44), the terms in the product in (90) are statistically independent and we can hence write for the average, to the leading order

Further, we realize that the variance of this quantity (again over the realizations of $X^{0}$ , $F^{0}$ and $w^{0}$ ) is

In order to derive this result, we expand the square and obtain a double sum over $i$ and $j$ . Because of the conditional independence between incoming messages assumed in belief propagation, the terms with $i\neq j$ average exactly to zero. As for the terms with $i=j$ , they add up to a contribution of order $O(1/N)$ . From this we can conclude that, to leading order in the thermodynamic limit, the quantity $V_{\mu l}^{t}=V^{t}$ does not depend on its indices.

Further we are interested in the average $\Sigma^{t}$ of the quantity $\Sigma_{il}^{t}$ over the realization of $F^{0}$ , $X^{0}$ , and $w^{0}$ . Using the definition of $\Sigma_{il}^{t}$ eq. (88) and the expression for $A_{\mu l\to il}^{t}$ eq. (60) we obtain

We proceed analogously for $Z^{t}_{\mu i}$ . Using again the conditional independence between incoming messages assumed in BP equations we obtain

We use $y_{\mu l}=h(z^{0}_{\mu l},w^{0})$ , and we remind that, to leading order, $V^{t}_{\mu il}=V^{t}$ . The function $g_{\rm out}$ above hence depends on two correlated fluctuating variables $\omega_{\mu il}^{t}$ and $z^{0}_{\mu l}$ , and on $w^{0}$ . Both the variables $\omega_{\mu il}^{t}$ and $z^{0}_{\mu l}$ are sums over many independent (for $z$ this is by construction, for $\omega$ by the BP assumptions) terms. Hence, according to the central limit theorem, they are Gaussian random variables. Their mean is zero when at least one of the distributions $P_{X}$ and $P_{F}$ (and one of the $P_{X^{0}}$ and $P_{F^{0}}$ ) have zero mean (which we assume in this section). The covariance matrix between the variables $\omega_{\mu il}^{t}$ and $z^{0}_{\mu l}$ is

where we again used the BP assumption of independence between the incoming messages, but also between $F^{0}_{\mu i}$ and the message ${\hat{x}}_{il\to\mu l}$ , and between $X_{il}^{0}$ and ${\hat{f}}_{\mu i\to\mu l}$ . As for the variance of $z^{0}$ , we denote it by:

Altogether this gives for $\hat{\chi}$ and $\hat{q}$

We now study the asymptotic behavior of $T^{t}_{il}$ defined by eq. (88)

where we used definition of $B^{t}_{\mu l\to\i l}$ in eq. (59). The message ${\hat{f}}_{\mu i\to\mu l}(t)$ are uncorrelated with all the other incoming messages and also with all the $F_{\mu j}^{0}$ for $j\neq i$ . It is, however, correlated with $F_{\mu i}^{0}$ and the dependence on $F_{\mu i}^{0}$ has to be hence treated separately. After an expansion in the leading order we obtain

where in the first term we defined the new parameter $\hat{m}$ as

Using the same kind of analysis as we did for $\hat{q}$ and $\hat{\chi}$ , we find that

The second term of (176) when averaged over realization of $F^{0}$ , $X^{0}$ and $w^{0}$ behaves as a Gaussian random variable. In (176) we moreover assumed that

which is true in all the special cases analysed in this paper, and is also true in general under the Bayes-optimal inference as detailed in the next section. If the zero-mean assumption (179) did not hold the density evolution equations would contain additional terms (similarly as if both $F$ and $X$ had non-zero means), see e.g. the state evolution in compressed sensing for non-zero mean matrices Caltagirone et al. (2014). Under the zero-mean assumption, the variance of the Gaussian variable is $\alpha q^{t}_{F}\hat{q}^{t}$ with $\hat{q}^{t}$ given by (174).

With the use of (166-166), (176), (176), and the expressions for messages (102-103) we obtain

where ${\cal D}\xi={\rm d}\xi e^{-\xi^{2}/2}/{\sqrt{2\pi}}$ is a Gaussian integration measure. Analogously we have for the $F$ -related order parameters

The six equations (181-183) together with (173-178) are the general form of density evolution for GAMP in the general case of matrix factorization. We remind that these equations describe the asymptotic evolution of the algorithm in the “thermodynamic” limit of large sizes, as long as the matrices $X$ , $X^{0}$ , $F$ , $F^{0}$ were generated with iid elements, at least one of the random variables $X$ and $F$ , and at least one of the $X^{0}$ and $F^{0}$ has zero mean, and the output channel satisfies the condition (179). The first of these condition is absolutely essential for our approach; the restriction to zero means is here for convenience, the non-zero means and generic form of output function can be treated with the same formalism that we have used here, with additional terms.

To satisfy the Nishimori identities we have to suppose that all the prior distributions are matching the true distributions from which the signal and noise were generated, i.e. conditions (106) hold. The condition $P^{0}_{\rm out}(y|z)=P_{\rm out}(y|z)$ is equivalent to $P_{0}(w)=P(w)$ and $h(z,w)=h^{0}(z,w)$ . In this Bayes-optimal setting, (106), the asymptotic analysis simplifies considerably since we have

To justify the above statement we need to prove that if (106) is satisfied and (187-189) hold up to iteration $t$ then (187-189) hold also in iteration $t+1$ . This is done in the next two subsections.

In the Bayes-optimal setting, the state evolution simplifies into

where ${\cal D}\xi$ is a Gaussian integral ${\rm d}\xi e^{-\xi^{2}/2}/\sqrt{2\pi}$ . Here we chose to use the expression coming from eq. (182), (185) and (173), but we could have used any of the other expressions that are equivalent on the Nishimori line. Where $m_{F}$ and $m_{X}$ are initialized as squares of the means of the corresponding prior distributions

In case the prior distribution depends on another random variables, e.g. in case of matrix calibration, we take additional average with respect to that variable. If the above initialization gives $m_{F}^{t=0}=0$ and $m_{X}^{t=0}=0$ then this is a fixed point of the state evolution. This is due to the permutational symmetry between the columns of matrix $F$ and rows of matrix $X$ . To obtain a nontrivial fixed point we initialize at $m_{F}^{t=0}=\eta$ for some very small $\eta$ , corresponding to an infinitesimal prior information about the matrix elements of the matrix $F$ . Note that this is needed only in the state evolution, the algorithm breaks the permutational symmetry spontaneously. The same situation appears in an Ising ferromagnet at low temperature where zero magnetization is a fixed point of the equilibrium equations, but the physically correct solution to which dynamical procedures converge had large magnetization in absolute value.

Our general strategy in the asymptotic analysis of optimal Bayesian inference and related phase transition is that the corresponding fixed points must satisfy the Nishimori identities, hence we will restrict our search for fixed points to parameters lying on the Nishimori line, i.e. satisfying the identities (187-189). When these identities are not imposed the iterations of the state evolution equations are not always converging to fixed points on the Nishimori line. This is also reflected in problems with convergence in the GAMP algorithm for matrix factorization. In the algorithm the Nishimori identities are unfortunately not straightforward to impose.

V.1.2 The input Nishimori identities

Assume that in the state evolution the Nishimori identities (187-189) hold for all iteration times smaller or equal to $t$ . Out aim is to show that then $m_{X}^{t+1}$ and $q_{X}^{t+1}$ computed from (183) and (182) are equal. Recall that

The proof of $m_{F}^{t+1}=q_{F}^{t+1}$ is exactly the same.

Using the definition of the function $f_{c}(\Sigma,R)$ (66), the same change of variables, and resulting cancelations as above we get

V.1.3 The output Nishimori identities

Let us now assume that the input Nishimori identities (187-188) are satisfied and we want to show that (189) and (179) hold.

Doing analogous manipulations of expliciting the Gaussian measure and using eq. (10) and the definition of $g_{\rm out}$ in equation (174) we obtain

For $\hat{\chi}^{t}$ we do integration with respect to $p$ by parts and using steps as in the above we obtain

Thanks to a cancelation between the integrals over variables $z$ and $z^{\prime\prime}$ we can perform explicitly the integral over $y$ (keeping in mind that $P_{\rm out}(y|z^{\prime})$ is a normalized probability distribution). The remaining Gaussian integral is then zero.

In a analogous manner we prove eq. (179) by noticing that the expectation with respect to $F^{0}$ , $X^{0}$ and $w^{0}$ is exactly the integral $\int{\rm d}y\,{\rm d}p\,{\rm d}z\,P_{\rm out}(y|z)\,{\cal N}(p,z)$ .

To conclude, identities (189) and (179) hold in the limit $N\to\infty$ . However, as common in statistical physics we can recall the self-averaging property under which quantities on almost every large ( $N\to\infty$ ) instance are equal to their averages aver randomness (disorder) of $F^{0}$ , $X^{0}$ and $w^{0}$ . This self-averaging then for instance justifies the use of eq. (108) on large single instances of the matrix factorization problem.

V.2 Replica method

The replica method is known as a non-rigorous approach to evaluate the typical performance of various Bayesian inference problems. We here show how this is employed for the matrix factorization problem. We show, as expected from other problems, that the result of the replica analysis is fully equivalent to the result of the state evolution.

constitutes the basis of our analysis. In statistical physics, one can generally examine properties of systems via evaluation of the free entropy $\log Z(Y)$ , which statistically fluctuates depending on the realization of $Y$ in the current case. However, as $N,M,P\to\infty$ , one can expect that the self-averaging property holds and, therefore, the free entropy density $N^{-2}\log Z(Y)$ converges to its typical value $\phi\equiv N^{-2}\left[\log Z(Y)\right]_{Y}$ with probability of unity. This is also expected to hold for other macroscopic quantities relevant to the performance of the matrix factorization. Therefore, assessment of $\phi$ is the central issue in our analysis.

with respect to the generative distribution of $Y$

where we assumed that the functional forms of $P_{F^{0}}(F_{\mu i})$ , $P_{X^{0}}(X_{il})$ and $P_{\rm out}^{0}(y_{\mu l}|\sum_{i}F_{\mu i}X_{il})$ may be different from those of the assumed model $P_{F}(F_{\mu i})$ , $P_{X}(X_{il})$ and $P_{\rm out}(y_{\mu l}|\sum_{i}F_{\mu i}X_{il})$ for generality. When they are equal, which correspond to the Bayes-optimal setting, $P_{0}(Y)=Z(Y)$ holds.

In performing the integrals of $2(n+1)$ matrices $(F^{0},\{F^{a}\}_{a=1}^{n})$ , $(X^{0},\{X^{a}\}_{a=1}^{n})$ and $Y$ that come out in evaluating $\left[Z^{n}(Y)\right]_{Y}$ , we insert trivial identities with respect to all combinations of replica indices $a\leq b=0,1,2,\ldots,n$

to the integrand, where $F^{a}\cdot F^{b}\equiv\sum_{\mu,i}F_{\mu i}^{a}F_{\mu i}^{b}$ and similarly for $X^{a}\cdot X^{b}$ . Let us denote ${\cal Q}_{F}\equiv(q_{F}^{ab})$ and ${\cal Q}_{X}\equiv(q_{X}^{ab})$ , and introduce two joint distributions

where $V_{F}({\cal Q}_{F})$ and $V_{X}({\cal Q}_{X})$ are the normalization constants. These yield an expression of $\left[Z^{n}(Y)\right]_{Y}$ as

where $\left[\cdots\right]_{{\cal Q}_{F},{\cal Q}_{X}}$ denotes the average with respect to (212) and (213). In computing $\left[\cdots\right]_{{\cal Q}_{F},{\cal Q}_{X}}$ , it is noteworthy that $\{F^{a}\}$ and $\{X^{a}\}$ follow statistically independent distributions, and either of them has zero mean and both of them have finite variances from our assumption. These allow us to handle $z_{\mu l}^{a}\equiv\sum_{i}F_{\mu i}^{a}X_{\mu i}^{a}$ $(a=0,1,2,\ldots,n;\mu=1,2,\ldots,M;l=1,2,\ldots,P)$ as multivariate Gaussian random variables whose distribution is given by

V.2.2 Replica symmetric free entropy

where the extremization with respect to $Q_{F},Q_{X},q_{F},q_{X},m_{F},m_{X}$ , and their conjugate variables leads to the same equations we obtained in the state evolution. Particularly extremization w.r.t. the six conjugate variables gives eqs. (181-186). Extremization with respect to $Q_{F},Q_{X},q_{F},q_{X},m_{F},m_{X}$ gives $\hat{Q}_{F}=\pi Q_{X}(\hat{\chi}-\hat{q})$ , $\hat{Q}_{X}=\alpha Q_{F}(\hat{\chi}-\hat{q})$ , $\hat{q}_{F}=\pi q_{X}\hat{q}$ , $\hat{q}_{X}=\alpha q_{F}\hat{q}$ , $\hat{m}_{F}=\pi m_{X}\hat{m}$ , $\hat{m}_{X}=\alpha m_{F}\hat{m}$ where $\hat{\chi}$ , $\hat{q}$ , and $\hat{m}$ are given byt eqs. (173-174) and (178).

V.2.3 Simplification in the Bayes-optimal setting

One can generally evaluate thermodynamically dominant values of $Q_{F},Q_{X},q_{F},q_{X},m_{F},m_{X}$ by solving the extremization problem of (237), which is involved with twelve variables including the conjugate variables and therefore is rather complicated to handle. However, the problem is significantly simplified in the Bayes-optimal setting.

This is because the Nishimori identities allows us to handle $F^{0}=(F_{\mu i}^{0})$ and $X^{0}=(X_{il}^{0})$ as if they were the $n+1$ -st replica variables added to the $n$ -replicated system composed of $\{F^{a}\}_{a=1}^{n}=\{(F_{\mu i}^{a})\}_{a=1}^{n}$ and $\{X^{a}\}_{a=1}^{n}=\{(X_{il}^{a})\}_{a=1}^{n}$ in the computation of the moment $\left[Z^{n}(Y)\right]_{Y}=\int dYP_{0}(Y)Z^{n}(Y)=\int dYP_{0}^{n+1}(Y)$ . The replica symmetry among $a=0,1,2,\ldots,n$ ensures the following properties:

$m_{F}=q_{F}$ , $m_{X}=q_{X}$ , $\hat{m}_{F}=\hat{q}_{F}$ , and $\hat{m}_{X}=\hat{q}_{X}$ are satisfied.

$Q_{F}=Q_{F}^{0}$ , $Q_{X}=Q_{X}^{0}$ , $\hat{Q}_{F}=0$ , and $\hat{Q}_{X}=0$ hold for $n\to 0$ .

Substituting these into (237) yields a simplified expression of the free entropy, which is involved with only four macroscopic variables $m_{F}$ , $m_{X}$ , $\hat{m}_{F}$ , and $\hat{m}_{X}$ , as

where we changed integration variables as $\sqrt{N\hat{m}_{F}}\xi+N\hat{m}_{F}F^{0}\to\sqrt{N\hat{m}_{F}}\xi$ and $\sqrt{\hat{m}_{X}}\xi+\hat{m}_{X}X^{0}\to\sqrt{\hat{m}_{X}}\xi$ together with ${\cal D}\xi\to{\cal D}\xi e^{-\frac{N\hat{m}_{F}}{2}(F^{0})^{2}+\sqrt{N\hat{m}_{F}}\xi F^{0}}$ and ${\cal D}\xi\to{\cal D}\xi e^{-\frac{\hat{m}_{X}}{2}(X^{0})^{2}+\sqrt{\hat{m}_{X}}\xi X^{0}}$ , respectively. The saddle point conditions can be summarized via equations that we obtained in the state evolution, notably eqs. (190-192) with $\hat{m}_{F}=\pi m_{X}\hat{m}$ , $\hat{m}_{X}=\alpha m_{F}\hat{m}$ .

The free entropy (241) would be also obtained from an asymptotic limit of the Bethe free entropy of section IV. Overall, as usual in statistical physics, the cavity method and the replica method yield equivalent results for the matrix factorization problem.

VI Examples of asymptotic phase diagrams

In this section we use the state evolution derived in section V.1.1 to analyze the asymptotic MMSE of the Bayes-optimal inference in matrix factorization for applications listed in section I.4. We restrict our analysis to the Bayes-optimal inference, i.e. the case where we generate the data as specified in Sec. I.4 and assume that we know the corresponding distributions. In terms of the AMP algorithm and the state evolution this means we can use all the simplifications that arise under the Nishimori identities. The AMP algorithms for matrix factorization and the asymptotic analysis derived in sections III and V apply to all the examples, the only elements that are application-dependent are the “input” functions $f_{X}$ and $f_{F}$ , and the output function $g_{\rm out}$ .

In terms of our asymptotic analysis the equations for dictionary learning, sparse PCA, and blind source separation are very close, see definitions in section I.4, these problems basically differ by the region of parameters $\alpha,\pi$ that is of interest. Moreover the dictionary learning can be seen as the $\eta\to\infty$ limit of the blind calibration problem (which is trivially taken in the equations). We hence group the discussion of these problems in the present section, they all present the first order phase transition as low measurement noise.

The matrices $F$ and $X$ in our setting of the dictionary learning and sparse PCA problems are generated according to eqs. (21-22). Using the definitions of the input function $f_{X}$ in eq. (66), and $f_{F}$ in eq. (67) we obtain explicitly:

The functions $f_{c}(\Sigma,T)$ and $f_{s}(Z,W)$ are then obtained from eqs. (68-69).

In case of matrix calibration we have some prior knowledge on the matrix $F$ given by eq. (26). This leads to a function $f_{F}$ of the form

Indeed as the uncertainty in the matrix $\eta\to\infty$ this goes to the $f_{F}$ for dictionary learning or sparse PCA (242).

The output function $g_{\rm out}$ defined in eq. (23) is, for the output channel (56) with additive white Gaussian noise of variance $\Delta$ :

For such a simple output function the eqs. (173-178) in the density evolution simplify greatly into

which under the simplification of the Nishimori line gives

This is the only equation in the state evolution that explicitly depends on the variance of the measurement noise $\Delta$ . Also eqs. (181-186) simplify for distributions (21) and (26) and using the Nishimori identities they reduce to a pair of equations

Note that indeed only eq. (247) depends on the matrix uncertainty parameter $\eta$ , and the dictionary learning limit $\eta\to\infty$ is straightforward. The MMSE of the matrix $F$ , defined in (9), predicted by this state evolution is then $E_{F}$ . The MMSE of the matrix $X$ , defined in (7), is found equal to $E_{X}$

The two sets of initial conditions that we will analyze to investigate the MMSE and the associated phase transitions are

Random (uninformative) initialization: $m^{t=0}_{X}=0$ , and $m^{t=0}_{F}=1/(1+\eta)$ . Note that in the limit of dictionary learning $\eta\to\infty$ this initialization corresponds to a fixed point of the state evolution equations. This fixed point reflects the $N\!$ permutational symmetry in the dictionary learning problem, and its instability corresponds to a spontaneous breaking of this symmetry. In the limit of dictionary learning we will hence initialize the state evolution with $m^{t=0}_{F}$ being a very small positive constant, and we will see the behavior will not depend on its precise value.

Planted (informative) initialization: $m^{t=0}_{X}=\rho(\overline{X}^{2}+\sigma)-\delta_{X}$ , and $m^{t=0}_{F}=1-\delta_{F}$ , where $\delta_{X}$ and $\delta_{F}$ are small positive constants to test the “stability” of the zero MMSE point.

The free entropy density from which we compute the limiting performance of the Bayes-optimal inference in case of a first order phase transition is expressed from (241) as follows. For simplicity from now on (till the end of this section VI.1) we analyze only the case where the mean of the elements of $X^{0}$ was zero, $\overline{X}=0$ , and the variance of the nonzero ones was one, $\sigma=1$ )

where $\hat{m}$ is given by eq. (246). The dependence on the matrix uncertainty is only in the last term and the limit of the completely unknown matrix $F$ is easily taken by $\eta\to\infty$ .

The simple Eqs. (246)-(250) is all we need at the end to analyze the MMSEs $E_{X}$ and $E_{F}$ of dictionary learning, blind matrix calibration, sparse PCA and blind source separation problems when the signal $X^{0}$ and $F^{0}$ are generated according to eqs. (21-22) with $\overline{X}=0$ and $\sigma=1$ , $\rho$ is the fraction of nonzero elements in $X^{0}$ , and $\eta$ is the matrix uncertainty from (25). The parameter $\alpha=M/N$ is the ratio between number of lines and the number of columns of $F^{0}$ , $\pi=P/N$ is the ration between number of columns and the number of lines of $X^{0}$ . The output channel has additive white Gaussian noise of variance $\Delta$ , and this information about distributions and their parameters is used in the posterior likelihood in the optimal Bayes inference.

VI.1.2 Phase diagram for blind matrix calibration and dictionary learning

For the noiseless case $\Delta=0$ , and all positive $\eta>0$ , the linear stability analysis of eqs. (246-248) around the informative fixed point $m_{X}=\rho$ , $m_{F}=1$ (i.e. the MMSEs $E_{F}=E_{X}=0$ ) leads to an update for the perturbations $\delta_{F}^{t+1}=(\delta_{F}^{t}\rho+\delta_{X})/(\alpha\pi)$ and $\delta_{X}^{t+1}=\rho(\delta_{F}^{t}\rho+\delta_{X})/\alpha$ . By computations of the largest eigenvalue of the corresponding $2\times 2$ matrix we obtain that this informative fixed point is stable if and only if

In other words the zero MSE fixed point is locally stable above the counting lower bound (24).

Further we notice that in the low noise limit $\Delta\to 0$ , for all positive and finite $\eta$ , and for $m_{X}=\rho-\delta_{X}$ , $m_{F}=1-\delta_{F}$ with $\delta_{X}$ and $\delta_{F}$ being small positive constants of the same order as $\Delta$ the free entropy (250) becomes in the leading order $(\pi\alpha-\pi\rho-\alpha)\log(\Delta+\delta_{X}+\rho\delta_{F})/2$ . For $\pi>\pi^{*}(\alpha,\rho)$ this is a large positive value, and a large negative value for $\pi<\pi^{*}(\alpha,\rho)$ .

Hence for the noiseless measurements $\Delta=0$ the asymptotic Bayes-optimal MMSE is $E_{X}=0$ and $E_{F}=0$ for $\pi>\pi^{*}(\alpha,\rho)$ for all $\eta>0$ . This is a remarkable result as it implies that in the Bayes-optimal setting the dictionary is identifiable (or an exact calibration of the matrix possible) as soon as the number of samples $P$ per signal element is larger than the value $\pi^{*}(\alpha,\rho)$ given by the trivial counting bound (24).

The next question is whether this MMSE is achievable in a computationally tractable way. To answer this we study the state evolution starting in the uninformative initialization. First we analyze the behavior of the state evolution when $m_{X}=\delta_{X}$ , and $m_{F}=\delta_{F}$ where both $\delta_{X}$ , and $\delta_{F}$ are positive and small, while we also consider $\eta$ being very large. The linear expansion of state evolution update then leads to $\delta_{X}^{t+1}=\rho^{2}\alpha\delta_{F}/(\Delta+\rho)$ and $\delta_{F}=1/\eta+\pi\delta_{X}/(\Delta+\rho)$ . Hence for $\eta\to\infty$ the uninformative initialization is in fact a stable fixed point of the state evolution equations as long as

This means that for $\pi^{*}(\rho,\alpha)<\pi<\pi_{F}(\Delta,\rho,\alpha)$ the MMSE is not achievable in the dictionary learning (e.g. when $\alpha<1$ $\pi^{*}(0,\alpha)<\pi_{F}(0,0,\alpha)$ ) with the approximate message passing presented in this paper. This simple analysis leads us to the conclusion that a first order phase transition is in play in the dictionary learning problem, and as we will see also in the blind calibration ( $\eta<\infty$ ) and sparse PCA ( $\alpha>1$ ).

As a side remark let us remind that the limit $\eta\to 0$ should lead to results known from Bayesian compressed sensing. In particular in compressed sensing for low noise the matrix $X$ is identifiable if and only if $\alpha>\rho$ . To reconcile this with the previous results notice that indeed for $\eta=c\Delta\to 0$ with $c=O(1)$ the leading term of the free entropy becomes $\pi(\alpha-\rho)\log(\Delta)$ . Hence compressed sensing result is recovered. Whereas for $1\gg\eta\gg\Delta$ it is the dictionary learning static phase transition $\pi^{*}=\alpha/(\alpha-\rho)$ that is the relevant one.

Due to the close link between the two problems, we shall described the results for dictionary learning together with the case of blind matrix calibration. In both these cases we are typically trying to learn (calibrate) an overcomplete dictionary $\alpha<1$ and a sparse signal $X$ from as few samples $P$ as possible. We hence first plot in Fig. 2 (left) the MSE for $F$ (in red) and for $X$ (in blue) as a function of $\pi=P/N$ and fixed (representative) value of undersampling ratio $\alpha=0.5$ , and density $\rho=0.2$ in zero (or negligible) measurement noise, $\Delta=0$ . We consider several values of matrix uncertainty $\eta$ (the larger the value the less we know about the matrix). The AMP-MSE achieved from the uninformative initialization is depicted in dashed lines, the MMSE achieved in this zero noise case from the planted (informative) initialization is in full lines.

Fig. 2 (right) shows the value of the spinodal transition $\pi^{s}$ as a function of the matrix uncertainty $\eta$ . We see that, as expected, $\lim_{\eta\to 0}\pi^{s}(\eta)=\pi^{*}$ . A result that is less intuitive is that the large $\eta$ limit is also well defined and finite $\lim_{\eta\to\infty}\pi^{s}(\eta)=\pi^{s}_{\rm DL}<0$ . This means that even in the dictionary learning where no prior information about the matrix elements is available the dictionary is identifiable with AMP for large system sizes above the spinodal transition $\pi^{s}_{\rm DL}$ .

In Fig. 2 (right) we also see an interesting behavior in the function $\pi^{s}(\eta)$ for low values of $\rho$ - there is a sharp phase transition from low $\eta$ regime, where the AMP-MSE at the transition has a weak discontinuity towards a relatively low value of MSE, and a high $\eta$ regime where the discontinuity is very abrupt towards a value of MSE that is close to the completely uninformative value. This behavior is also illustrated in Fig. 3 (left) where we plot the AMP-MSE as a function of $\pi$ and $\eta$ (for fixed $\rho=0.05$ , $\alpha=0.5$ , $\Delta=0$ ).

Fig. 3 (right) depicts the phase diagram of dictionary learning ( $\eta\to\infty$ ) and blind matrix calibration (finite $\eta$ ) we plot the static threshold $\pi^{*}$ ( $\eta$ -independent) above which the matrix $F$ is identifiable in full red, and the spinodal threshold $\pi^{s}$ (for various values of $\eta$ ) above which the AMP identifies asymptotically the original matrix $F$ is dashed line. Notice that $\pi^{s}(\rho)$ diverges as $\rho\to\rho^{\rm CS}_{\rm BP}$ , where $\rho^{\rm CS}_{\rm BP}$ is the AMP phase transition in compressed sensing, see e.g. Krzakala et al. (2012). This is expected as for $\rho>\rho^{\rm CS}_{\rm BP}$ the sparse signal cannot be recovered with AMP even if the matrix $F$ is fully known.

From now on (also in following subsections) we will discuss only the case $\eta\to\infty$ when no prior information about the dictionary $F$ is available. In Fig. 4 (left) we illustrate the results for dictionary learning with non-zero measurement noise. The situation is qualitatively similar to what happens in noisy compressed sensing Krzakala et al. (2012). The first order phase transition is becoming weaker as the noise grows, until some value $\Delta^{*}$ above which there is no phase transition anymore and the AMP-MSE $=$ MMSE for the whole range of $\pi$ . In Fig. 4 (right) we then plot the AMP-MSE and the MMSE of the signal $X$ as a function of the noise variance $\Delta$ and $\pi$ for $\alpha=0.5$ and $\rho=0.2$ . This surface plot demonstrates how the sharp transition disappears and is replaced by a continuous evolution of the MSE when the noise is large. This MMSE is compared to the AMP-MSE for the same case in Fig. 5.

Sparse PCA as we set it in Section I.4 is closely related to dictionary learning. Except that from the three sizes of matrices $M$ , $N$ and $P$ the smallest one is the $N$ – corresponding to the matrix $Y$ to be of relatively low rank. Hence for sparse PCA we should only really consider $\alpha>1$ , $\pi>1$ .

Behavior of the state evolution is for this range of parameters qualitatively very similar to the one we just observed in the previous section. In Fig. 6 left we treat the case of $M=P$ , i.e. $\pi=\alpha$ , with zero measurement noise, and compute the smallest value for which the matrices $F$ and $\rho$ -sparse $X$ are recoverable for a given $\rho$ . Just as in the dictionary learning we obtain that the Bayes optimal MMSE is zero everywhere above the counting bound $\pi=\alpha>1+\rho$ , blue line in Fig. 6 left. The AMP-MSE is, however, zero only above the spinodal line, depicted in red in the figure. The gap between the two lines is not very large in this case.

The right part of Fig. 6 shows the AMP-MSE and the MMSE of the signal matrix $X$ for measurement noise variance $\Delta=10^{-10}$ and density $\rho=0.5$ .

In blind source separation, $P$ corresponds to the length of the signal, $N$ is the number of sources, and $M$ the number of sensors. Typically the signal is very long, i.e. one has $P\gg N$ and $P\gg M$ . The particularly interesting case is when there is more sources than sensors $\alpha<1$ in that case the signal can be reconstructed only if the density of the signal $\rho$ is smaller than a certain value. The counting bound gives us $\rho<\alpha(\pi-1)/\pi$ and this also corresponds to the value under which the MMSE drops to zero under zero measurement noise. As in the previous case also here we observe a first order phase transition and with AMP we can reach zero error (in the noiseless case) only below $\rho^{s}$ that we depict for $\pi=10$ as a function of $\alpha$ in Fig. 7.

VI.2 Low rank matrix completion

In the remaining examples we treat cases in which neither $F$ nor $X$ are sparse, $\rho=1$ , we start with low rank matrix completion.

In matrix completion the output function $g_{\rm out}$ is eq. (244) for the known matrix elements $\mu l$ (there is $\epsilon MP$ of them), and $g_{\rm out}(\omega,y,V)=0$ for the unknown elements $\mu l$ (there is $(1-\epsilon)MP$ of them). The input function $f_{X}$ (242) for non-sparse matrix $X$ , $\rho=1$ , becomes

In the state evolution, using the Nishimori identities, we then have

where $\epsilon$ is the fraction of known elements of $Y$ . Moreover for the input function (253) the state evolution equation for $m_{X}$ becomes

The equation for $m_{F}$ is the one of (247).

It is instrumental to analyze the local stability of the informative and the uninformative initialization for low rank matrix completion. For the informative initialization we consider $m_{X}^{t}=1-\delta_{X}^{t}$ , and $m_{F}^{t}=1-\delta_{F}^{t}$ , where $\delta_{X}^{t}$ , and $\delta_{F}^{t}$ are small positive numbers. The state evolution update equations at zero noise $\Delta=0$ lead to $\delta_{X}^{t+1}=(\delta_{X}^{t}+\delta_{F}^{t})/(\epsilon\alpha)$ , and $\delta_{F}^{t+1}=(\delta_{X}^{t}+\delta_{F}^{t})/(\epsilon\pi)$ . The largest eigenvalue of this $2\times 2$ system is $(\alpha+\pi)/(\epsilon\alpha\pi)$ and hence the informative fixed point is stable for $\epsilon>\epsilon^{*}=(\alpha+\pi)/(\alpha\pi)$ , which coincides with the counting bound eq. (28). This means that for $\epsilon>\epsilon^{*}$ the noiseless matrix completion, the matrices $F$ and $X$ can be recovered without error (asymptotically).

For the uninformative initialization we consider $m_{X}^{t}=\delta_{X}^{t}$ , and $m_{F}^{t}=\delta_{F}^{t}$ , where $\delta_{X}^{t}$ , and $\delta_{F}^{t}$ are again small positive numbers. This time the state evolution equations give $\delta_{F}^{t+1}=\pi\epsilon\delta_{X}^{t}/(1+\Delta)$ and $\delta_{X}^{t+1}=\alpha\epsilon\delta_{F}^{t}/(1+\Delta)$ . Hence the uninformative fixed point is stable for $\epsilon<(\Delta+1)/\sqrt{\pi\alpha}$ . This can indeed be verified in Fig. 8.

In matrix completion we treat matrices $Y$ of low rank, hence $N$ is much smaller than both $P$ and $M$ . The main questions concerns the fraction $\epsilon$ of elements that need to be known in order for the recovery of $X$ and $F$ to be possible. In this case we did not identify first order phase transition, as a result we have MMSE $=$ AMP-MSE. At zero measurement noise we observed a phase transition from a phase of perfect recovery to a phase with positive MMSE, its position coincides with the counting bound (28), $\epsilon^{*}=(\alpha+\pi)/{\alpha\pi}$ . With non-zero measurement noise, in the scaling of noise and rank we consider in this paper, the behavior of MMSE as a function of the other parameters is smooth and derivable (no phase transition). In Fig. 8 we plot an example of the MMSE as a function of the fraction of known elements $\epsilon$ for squared matrix $Y$ , i.e $\alpha=\pi$ . We generated the signal elements with zero mean and unit variance, $\overline{X}=0$ , $\sigma=1$ .

Our analysis suggest that compared to the cases with non-zero sparsity the low-rank matrix completion is a much easier problem, at least in the random setting considered in the present paper. The fact that the “counting” threshold can be saturated or close to saturated in the noiseless case by several algorithms can be seen e.g. in the data presented in Parker et al. (2014).

VI.3 Robust PCA

The input functions are the eq. (242) for matrix $F$ and eq. (253) for matrix $X$ . In robust PCA as defined by (30) we get for the output function

The state evolution in the Bayes-optimal setting, i.e. using the Nishimori identities, becomes

Equation for $m_{F}^{t}$ is (247), and for $m_{X}^{t}$ is (255).

In robust PCA the informative initialization is again $m_{X}^{t}=1-\delta_{X}^{t}$ , and $m_{F}^{t}=1-\delta_{F}^{t}$ , where $\delta_{X}^{t}$ , and $\delta_{F}^{t}$ are small positive numbers. For small noise $\Delta_{s}$ and $\Delta_{l}=O(1)$ the corresponding fixed point is stable under the same conditions as for low rank matrix completion, i.e. for $\epsilon>\epsilon^{*}=(\alpha+\pi)/(\alpha\pi)$ .

The uninformative fixed point is $m_{X}^{t}=\delta_{X}^{t}$ , and $m_{F}^{t}=\delta_{F}^{t}$ , where $\delta_{X}^{t}$ , and $\delta_{F}^{t}$ are again small positive numbers. This evolves as $\delta_{X}^{t+1}=\alpha\delta_{F}^{t}\hat{m}$ , $\delta_{F}^{t+1}=\pi\delta_{X}^{t}\hat{m}$ , with $\hat{m}=(1+\delta_{s}=\epsilon\Delta_{l}-\epsilon\Delta_{s})/[(1+\Delta_{s})(1+\Delta_{l})]$ in this limit. Hence for instance for $\Delta_{s}\to 0$ , $\Delta_{l}=1$ and $\pi=\alpha$ we have that the uninformative fixed point is stable for $\epsilon<2/\alpha-1$ , which again corresponds to what we observe in Fig. 9.

In the example of Fig. 9 we plot the MMSE as a function of the fraction of undistorted elements $\epsilon$ in the case of squared matrix $Y$ , $\alpha=\pi$ , the variance of the large distortions $\Delta_{l}=1$ and two different values of the small measurement noise $\Delta_{s}$ . We see a second order phase transition at the counting bound for $\Delta_{s}=0$ and a smooth decay of the MMSE for $\Delta_{X}>0$ .

It is interesting to compare how well robust PCA can be solved with respect to the matrix completion. In both cases $\epsilon$ is the fraction of known elements. The difference is that in matrix completion their position is known, whereas in robust PCA it is not. Intuitively the R-PCA should thus be a much harder problem. This is not confirmed in our analysis that instead suggest that robust PCA is as easy as matrix completion, since the zero noise phase transitions in the two coincide. Moreover, whereas at $\epsilon\to 0$ there is no information left in matrix completion (that is why the MMSE $=1$ ), in robust PCA the largely distorted elements can still be explored and the MMSE $<1$ . Note, however, that algorithmically it seems less easy to saturate this theoretical asymptotic performance in R-PCA, see e.g. Figure 7 in Parker et al. (2014).

VI.4 Factor analysis

In factor analysis, the input functions $f_{F}$ and $f_{X}$ are the same as the dictionary learning (242) at $\rho=1$ . The output function (56) for factor analysis (31) is given by

where $\psi_{\mu}$ is the variance of the $\mu$ -th component of the unique factor. The variance of unique factor $\psi_{\mu}$ depends here on the index $\mu$ and does not on the index $l$ , which leads to a slight modification in the derivation of the state evolution from section V.1. For simplicity, we assume that $\psi_{\mu}$ ’s are known; in practice, these should be estimated by the expectation and maximization scheme in conjunction with GAMP. Then, we obtain, on the Nishimori line

where $\langle\cdot\rangle_{\mu}$ means the average over $\psi_{\mu}$ , the variance of the elements of the matrix $F$ is denoted $Q_{F,\mu}^{0}$ .

To analytically solve (261), one has to specify the distributions of $\psi_{\mu}$ and $Q_{F,\mu}^{0}$ . We set $Q_{F,\mu}^{0}=Q_{F}^{0}=1$ for all $\mu$ , and suppose a two-peak distribution for $\psi$ as

Let us also assume $\overline{X}=0$ and $\sigma=1$ . In this case the state evolution can be summarized as

The total MMSE is given by $E_{F}=1-(\epsilon m_{F1}+(1-\epsilon)m_{F2})$ and $E_{X}=1-m_{X}$ . Fig. 10 (left) shows the $\pi$ -dependence of the MMSE at $\alpha=2$ , $\psi_{1}=0.5$ , and $\psi_{2}=2$ for $\epsilon=0$ , 0.5, 0.9 and 1.

We analyze again the stability of the uninformative fixed point, $(m_{X},m_{F_{1}},m_{F_{2}})=(0,0,0)$ , of the state evolution. Small positive numbers $\delta_{X}$ , $\delta_{F_{1}}$ , and $\delta_{F_{2}}$ that give the uninformative initialization $m_{X}=\delta_{X}$ , $m_{F_{1}}=\delta_{F_{1}}$ , and $m_{F_{2}}=\delta_{F_{2}}$ evolve under the state evolution as

These expressions indicate that the uninformative fixed point becomes unstable when $\alpha\pi\Big{[}\frac{\epsilon}{(\psi_{1}+1)^{2}}+\frac{1-\epsilon}{(\psi_{2}+1)^{2}}\Big{]}>1$ . The critical values of $\pi$ given by this condition coincides with the transition point where the MMSE departs from 1 shown in Fig. 10 (left). As an example of $\epsilon$ -dependence, we show the MMSE of $X$ at $\psi_{1}=10^{-2}$ , $\psi_{1}=10^{-5}$ and $\psi_{2}=1$ for four different values of $\alpha=\pi$ in Fig. 10 (right). Consistently with our analysis, in these cases the uninformative initialization is always unstable.

The transition associated with the stability of the informative fixed point occurs only when at least one of $\psi$ s tends to be zero. For instance when $\psi_{1}=0$ , the informative initialization corresponds to $\delta_{F_{1}}=1-m_{F_{1}}$ and $\delta_{X}=1-m_{X}$ that are given by

without depending on $\psi_{2}$ . These expressions mean that when $\epsilon>\epsilon^{*}=\pi/[\alpha(\pi-1)]$ , the informative fixed point is stable, consistently with Fig. 10 (right).

The state evolution of factor analysis for the two-peak case is qualitatively similar to that of robust PCA and low rank matrix completion, but the values of the phase transition points differ.

VII Discussion and Conclusions

In this paper we have analyzed various examples of the matrix factorization problem. We obtain a matrix $Y$ that is an element-wise noisy measurement of an unknown matrix $Z=FX$ , where both $Y$ , and $Z$ are $M\times P$ matrices, $F$ is a $M\times N$ matrix, and $X$ is a $N\times P$ matrix. We have considered the computational tractability of this problem in the large size limit $N\to\infty$ while $\pi=P/N=O(1)$ , and $\alpha=M/N=O(1)$ . Our analysis concerns the teacher-student scenario where $X$ and $F$ are generated with random independent elements of some known probability distributions and we employ the Bayes-optimal inference scheme to recover $F$ and $X$ from $Y$ .

Let us summarize our contribution: We derived the approximate message passing algorithm for matrix factorization. One version of the algorithm —for calibration and dictionary learning— was reported in Krzakala et al. (2013), and a very related algorithm called Big-AMP was discussed by Parker et al. (2013, 2014). This algorithm is derived from belief propagation. We have presented the AMP for matrix factorization in several forms in Sections III.1, and III.2. We have also discussed simplifications that arise in the Bayes-optimal setting when we can use the Nishimori identities (Section III.3), or when the matrix is large and one uses self-averaging of some of the quantities appearing in GAMP (Section III.4). We focused on the theoretical properties of the AMP algorithm, for a robust practical implementation we refer the reader to the works Parker et al. (2013, 2014) that include a very complete report on its performance on a range of benchmarks.

Next to the AMP algorithm we have also derived the corresponding Bethe free entropy in Section IV. The Bethe free entropy evaluated at a fixed point of the GAMP equations approximates the log-likelihood of the corresponding problem. We mainly use it in situations when we have more than one fixed point of GAMP, it is then the one with the largest values of the Bethe entropy that asymptotically given the MMSE of the Bayes-optimal inference. We also derived a variational Bethe free entropy in Section IV.2. This is a useful quantity that can serve in controlling the convergence of the AMP approach. Alternatively, a direct maximization of this expression is a promising algorithm itself (see Krzakala et al. (2014) for an investigation of this idea for compressed sensing).

The AMP algorithm for matrix factorization is amenable to asymptotic analysis via the state evolution technique that was carried out rigorously for approximate message passing in compressed sensing Bayati & Montanari (2011). We derive the state evolution analysis for matrix factorization using tools of statistical mechanics. In particular we use two approaches leading to equivalent results the cavity method (Section V.1) and the replica method (Section V.2). Our derivation of the state evolution is not rigorous, but we conjecture that it is nevertheless asymptotically exact as is the case in many other systems of this type including the compressed sensing. The main result of this state evolution are simple iterative equations that provide a way to compute the MMSE of the Bayes-optimal inference as well as the MSE reached theoretically in the large size limit by the AMP algorithm. The rigorous proof of the formulas derived in this paper is obviously an important topic for future work.

The main results of this paper concern analysis of MMSE and AMP-MSE for various interesting examples of the matrix factorization problem. We analyze the asymptotic phase diagrams for dictionary learning, blind matrix calibration, sparse PCA, blind source separation, matrix completion, robust PCA and factor analysis. Earlier results on this analysis appeared in Krzakala et al. (2013); Sakata & Kabashima (2013). We find that when one of the matrices $F$ or $X$ is sparse the problems undergo a first order phase transition which is related to an interesting algorithmic barrier known for instance from compressed sensing Krzakala et al. (2012).

It is a generic observation that for most of the problems we analyzed the theoretically achievable performance is much better than the one achievable by existing algorithms. The AMP algorithm should be able to match this performance for very large systems which is the most exciting perspective for further development of this work. If successful it could lead to an algorithmic revolution in various application of the matrix factorization.

In this paper we concentrate on the theoretical analysis and not on the performance of the algorithm itself. Some studies of the performance of some versions of the algorithm can be find in Krzakala et al. (2013); Parker et al. (2014). We, however, observed that the performance depends strongly on the implementation details and we did not yet found a way to match the theoretically predicted performance for systems of treatable (practical) size in all cases.

It is worth discussing some of these algorithmic issues in this conclusion. One of the main problems it that the GAMP algorithm with parallel update presents instabilities that drive its evolution away from the so-called Nishimori line; see a recent study of this issue in the compressed sensing problem Caltagirone et al. (2014). This can be seen even in the state evolution when we do not assume explicitly that the result of the Bayes-optimal inference corresponds to a fixed point that belongs to the Nishimori line. There are ways how to avoid these issues, e.g. we observed that the difficulties basically disappear when the sequential update of the message passing algorithm from Section III.1 is used instead of the parallel one. This, however, does not scale very well with the systems size and our results were hence spoiled by very strong finite size effects. When learning the $M\times N$ matrix $F$ , and the $N\times P$ matrix $X$ we also often observed that a (very small) number of the $P$ signals $X_{l}$ were not correctly reconstructed, and these “rogue” vectors in $X_{l}$ were polluting the reconstruction of $F$ . An exemple of these finite size effects can be observed in Krzakala et al. (2013).

In the work of Parker et al. (2013, 2014) a part of the problems with convergence of the corresponding algorithm was mitigated by adaptive damping (though maybe with not the most suitable cost function, see Sec. IV) and expectation maximization learning. Why this is helpful is theoretically explained in the recent work Caltagirone et al. (2014) for compressed sensing. However, the implementation of Parker et al. (2014) does not match the theoretical performance predicted in this paper either (we have explicitly tried for the dictionary learning and robust PCA examples). This shows that more work is needed in order to reach a practical algorithm able to achieve the prediction at moderate sizes. A more proper understanding of these problems, and further developments of the algorithm are therefore the main direction of our future work.

Acknowledgments

The research leading to these results has received funding from the European Research Council under the European Union’s $7^{th}$ Framework Programme (FP/2007-2013)/ERC Grant Agreement 307087-SPARCS. Financial support by the JSPS Core-to-Core Program “Non-equilibrium dynamics of soft matter and information” and JSPS/MEXT KAKENHI Nos. 23-4665 and 25120013 is also acknowledged.