Phase transitions and sample complexity in Bayes-optimal matrix factorization

Yoshiyuki Kabashima, Florent Krzakala, Marc Mézard, Ayaka Sakata, Lenka Zdeborová

I Introduction

We study in this paper a variety of questions which all deal with the general problem of matrix factorization. Generically, this problem is stated as follows: Given a M×PM\times P dimensional matrix YY, that was obtained from noisy element-wise measurements of a matrix ZZ, one seeks a factorization Z=FXZ=FX, where the M×NM\times N dimensional matrix FF and the N×PN\times P dimensional matrix XX must satisfy some specific requirements like sparsity, low-rank or non-negativity.

From a machine learning point of view, matrix factorization can be applied to unsupervised learning of data representation Bengio et al. (2013). The success of machine learning, including recent progress such as deep learning LeCun et al. (2015), depends largely on data representations. Explicit approaches to efficient data representation, such as matrix factorization, are hence of wide relevance. Other applications that can be formulated as matrix factorization include dictionary learning or sparse coding Olshausen et al. (1996); Olshausen & Field (1997); Kreutz-Delgado et al. (2003), sparse principal component analysis Zou et al. (2006), blind source separation Belouchrani et al. (1997), low rank matrix completion Candès & Recht (2009); Candès & Tao (2010) or robust principal component analysis Candès et al. (2011), that will be described below.

Theoretical limits on when matrix factorization is possible and computationally tractable are still rather poorly understood. In this work we make a step towards this understanding by predicting the limits of matrix factorization and its algorithmic tractability when ZZ is created using randomly generated matrices FF and XX, and measured element-wise via a known noisy output channel Pout(YZ)P_{\rm out}(Y|Z). Our results are derived in the limit where N,M,PN,M,P\to\infty with fixed ratios M/N=αM/N=\alpha, P/N=πP/N=\pi. We predict the existence of sharp phase transitions in this limit and provide the explicit formalism to locate them.

We use two types of methods in this paper. The first one is based on a generalization of approximate message passing (AMP) Donoho et al. (2009) to the matrix factorization problem, and on its asymptotic analysis which is known in statistical physics as the cavity method Mézard et al. (1987); Mézard & Montanari (2009), and has been called state evolution in the context of compressed sensing Donoho et al. (2009). The second method that we use in the following is the replica method. These two methods are widely believed to be exact in the context of theoretical statistical physics, but most of the results that we shall obtain in the present work are not rigorously established. Our predictions have thus the status of conjectures. A first cross-check of the correctness of these conjectures is the fact that the two methods give identical results. This has been understood first in the context of spin glasses Mézard et al. (1987).

This work builds upon some previous steps that we described in earlier reports Sakata & Kabashima (2013); Krzakala et al. (2013). The message passing algorithm related to our analysis was first presented in Krzakala et al. (2013) and is very closely related to the Big-AMP algorithm developed and tested in Schniter et al. (2012); Parker et al. (2013, 2014); relations and differences with Big-AMP will be mentioned in several places throughout the paper. Our main focus here, beside the detailed derivation of the algorithm, is the asymptotic analysis and phase diagrams which were not studied in Schniter et al. (2012); Parker et al. (2013, 2014). We also discuss several variants of the AMP algorithm that are interesting for theoretical reasons. Several of these variants, however, have convergence problems when implemented straightforwardly. For a robust implementation of the algorithm that can be used on practical benchmarks we refer to the works Parker et al. (2013, 2014).

Our general method provides a unifying framework for the study of computational tractability and identifiability of various matrix factorization problems. The first step for this synergy is the formulation of the problem via a graphical model (see Fig. 1) that is amenable to analysis using the present methods.

The phase diagrams that we shall derive establishes for each problem two types of thresholds in the plane απ\alpha-\pi : the threshold where the problem of matrix factorization ceases to be solvable in principle, and the threshold where AMP ceases to find the best solution. In most existing works the computation of phase transitions was treated separately for each of the various problems. For instance, redundant dictionaries for sparse representations and low rankness are usually thought as two different kinds of dimensional reduction. Interestingly, in our work a wide class of problems is treated within one unified formalism; this is theoretically interesting in the context of recent developments Elad (2012).

The element-wise measurement yμly_{\mu l} of zμlz_{\mu l} is then specified by some known probability distribution function Pout(yμlzμl)P_{\rm out}(y_{\mu l}|z_{\mu l}), so that:

The goal of matrix factorization is to estimate both matrices FF and XX from the measurements YY.

In this paper we will treat this problem in the framework of Bayesian inference. In particular we will assume that the matrices FF and XX were both generated from a known separable probability distribution

Although we restrict to separable prior probability distributions PF(F)P_{F}(F), PX(X)P_{X}(X), it turns out that these priors can encode a broad range constraints such as, for instance sparsity. This is the reason why so many different problems can be studied within our scheme. The output channel Pout(y,z)P_{\rm out}(y,z) can be of a rather generic nature, thus including various kinds of additive and multiplicative noise; in the spirit of activation functions from neural networks it can degrade the information included in the measurement by e.g. keeping only the sign of elements of zz.

The posterior distribution of FF and XX given the measurements YY is written as

where Z(Y){\cal Z}(Y) is the normalization constant, known as the partition function in statistical physics.

Notice that, while the original problem of finding FF and XX, given the measurements YY, is not well determined (because of the possibility to obtain, from a given solution, an infinity of other solutions through the transformation FFU1F\to FU^{-1} and XUXX\to UX, where UU is any N×NN\times N nonsingular matrix), the fact of using well defined priors PFμiP^{\mu i}_{F} and PXilP^{il}_{X} actually lifts the degeneracy: the problem of finding the most probable F,XF,X given the measurements and the priors is well defined. In case the priors PFμiP^{\mu i}_{F} and PXilP^{il}_{X} do not depend on the indices μl\mu l and ilil we are left with a permutational symmetry between the NN column of FF and NN rows of XX. Both in the algorithm and the asymptotic analysis this symmetry is broken and one of the N ⁣N\! solutions is chosen at random.

Typically, in most applications, the distributions PoutP_{\rm out}, PFP_{F} and PXP_{X} will depend on a set of parameters (such as the mean, variance, sparsity, noise strength, etc.) that we usually will not write explicitly in the general case, in order to simplify the notations. The prior knowledge of these parameters is not necessarily required in our approach: these parameters can be learned via an expectation-maximization-like algorithm that we will discuss briefly in section II.5.

Note also that Eq. 1 can be multiplied by an arbitrary constant: with a corresponding change in the output function the problem will not be modified. In the derivations of this paper we choose the above constant in such a way that the elements of matrices XX, YY, and ZZ are of order O(1)O(1), whereas the elements of FF scale in a consistent way, meaning that the mean of each FμiF_{\mu i} is of order O(1/N)O(1/N) and its variance is also of order O(1/N)O(1/N).

I.2 Bayes-optimal inference

Our paper deals with the general case of incomplete information. This happens when the reconstruction assumes that the matrices XX, FF and YY were generated with some distributions PXP_{X}, PFP_{F} and Pout(YZ)P_{\rm out}(Y|Z), whereas in reality the matrices were generated using some other distributions PX0P_{X^{0}}, PF0P_{F^{0}} and Pout0(YZ)P^{0}_{\rm out}(Y|Z). The message passing algorithm and its asymptotic evolution will be derived in this general case.

However, our most important results concern the Bayes-optimal setting, i.e. when we assume that

In this case, an estimator XX^{\star} that minimizes the mean-squared error (MSE) with respect to the original signal X0X^{0}, defined as

is obtained from marginals of XilX_{il} with respect to the posterior probability measure P(F,XY)P(F,X|Y), i.e.,

is the marginal probability distribution of the variable ilil. The mean squared error achieved by this optimal estimator is called the minimum mean squared error (MMSE) in this paper.

A similar result holds for the estimator of F0F^{0} that minimizes the mean-squared error

which is obtained from the mean of FμiF_{\mu i} with respect to the posterior probability measure P(F,XY)P(F,X|Y). In the remainder of this article we will be using these estimators.

I.3 Statement of the main result

The main result of this paper are explicit formulas for the MMSE achievable in the Bayes optimal setting (as defined above) for the matrix factorization problem in the “thermodynamic limit”, i.e. when N,M,PN,M,P\to\infty with fixed ratios M/N=αM/N=\alpha, P/N=πP/N=\pi. When sparsity is involved we consider that a finite fraction of matrix elements are non-zero. Similarly, when we treat matrices with low ranks we consider again the ranks to be a finite fraction of the total dimension. We also derive the AMP-MSE, i.e. the mean square error achievable by the approximate message passing algorithm as derived in this paper.

So far we were characterizing the output channel by the conditional distribution Pout0(yz0)P^{0}_{\rm out}(y|z^{0}) or Pout(yz)P_{\rm out}(y|z). It will be useful to think of the output as a deterministic function hh of zz and of random variables ww, i.e. yμl=h(zμl,wμl)=h0(zμl0,wμl0)y_{\mu l}=h(z_{\mu l},w_{\mu l})=h^{0}(z^{0}_{\mu l},w_{\mu l}^{0}). The random (“noise”) variables ww and w0w^{0} are specified by their probability distributions P(w)P(w) and P0(w0)P_{0}(w^{0}). We can relate PoutP_{\rm out} to hh as follows

To compute the MMSE and AMP-MSE we need to analyze the fixed points of the following iterative equation.

The output function goutg_{\rm out} is defined using the output probability PoutP_{\rm out} as

The AMP-MSE is obtained from a fixed point reached with a so-called uninformative initialization. The uninformative initialization does not use any information about the seeked matrices F0,X0F^{0},X^{0}, it uses only the prior distributions, it is defined as

In case the prior distribution depends on another random variables, e.g. in case of matrix calibration, we take additional average with respect to that variable. If the above initialization gives mFt=0=0m_{F}^{t=0}=0 and mXt=0=0m_{X}^{t=0}=0 then this is a fixed point of the above iterative equations. This is due to the permutational symmetry between the columns of matrix FF and rows of matrix XX. To obtain a nontrivial fixed point we initialize at mFt=0=ηm_{F}^{t=0}=\eta for some very small η\eta, corresponding to an infinitesimal prior information about the matrix elements of the matrix FF.

To compute the MMSE we need to initialize the iterations in an informative way, using the knowledge of the true X0X^{0} and F0F^{0}. This informative initialization is defined as an infinitesimal perturbation of

If the resulting fixed point agrees with the AMP-MSE then this is also the MMSE. In case that this informative initialization leads to a different fixed point than the AMP-MSE then the MMSE is given by the one of them for which the following free entropy is larger

Sections III, IV, V are devoted to the derivation of these formulas for MMSE and AMP-MSE. In section VI we then give a number of examples of phase diagram for specific applications of the matrix factorization problem.

We reiterate at this point that whereas our results are exact results within the theoretical physics scope of the cavity/replica method, we do not provide their proofs and their mathematical status is that of conjectures. We devote section II.3 to explaining the reasons of why this is so. The situation is comparable to the predictions-conjectures that were made for the satisfiability threshold Mézard et al. (2002), which have been partly established rigorously recently Ding et al. (2014), or the predictions-conjectures that were made for the CDMA problem Tanaka (2002); Guo & Verdú (2005), and were later proved by Montanari & Tse (2006); Bayati & Montanari (2011).

I.4 Applications of matrix factorization

Several important problems which have received a lot of attention recently are special cases of the matrix factorization problem as set above. In this paper we will analyse the following ones.

This is a basic example of finding a representation of data that is advantageous in some way. Representation learning Bengio et al. (2013) is a concept behind many machine learning applications, including the deep learning framework which has become very popular in recent time LeCun et al. (2015). In the context of representation learning dictionary learning is sometimes referred to as sparse coding Bengio et al. (2013).

Dictionary learning relies on the fact that signals of interest are often sparse in some basis; this property is widely used in data compression and more recently in compressed sensing. A lot of work has been devoted to analyzing bases in which different data are sparse. The goal of dictionary learning is to infer a basis in which the data are sparse based purely on a large number of samples from the data. The M×PM\times P matrix YY then represents the PP samples of MM-dimensional data. The goal is to decompose Y=FX+WY=FX+W into a M×NM\times N matrix FF, and a N×PN\times P sparse matrix XX, WW is the noise.

In this paper we will analyse the following teacher-student scenario of dictionary learning. We will generate a random Gaussian matrix F0F^{0} with iid elements of zero mean and variance 1/N1/N, and a random Gauss-Bernoulli matrix X0X^{0} with fraction 0<ρ<10<\rho<1 of non-zero elements. The non-zero elements of X0X^{0} will be iid Gaussian with mean X\overline{X} and variance σ\sigma. The noise W0W^{0} with elements wμl0w^{0}_{\mu l} is also iid Gaussian with zero mean and variance Δ\Delta. We hence have

where δ(X)\delta(X) is the Dirac delta function. The goal is to infer F0F^{0} and X0X^{0} from the knowledge of YY with the smallest possible number of samples PP.

In the noiseless case, Δ=0\Delta=0, exact reconstruction might be possible only when the observed information is larger that the information that we want to infer. This provides a simple counting bound on the number of needed samples PP:

Note that the above assumptions on PFP_{F}, PXP_{X} and PoutP_{\rm out} likely do not hold in any realistic problem. However, it is of theoretical interest to analyze the average theoretical performance and computational tractability of such a problem, as it gives a well defined benchmark. Moreover, we anticipate that if we develop an algorithm working well in the above case it might also work well in many real applications where the above assumptions are not satisfied, in the same spirit as the approximated message passing algorithm derived for compressed sensing with zero mean Gaussian measurement matrices Donoho et al. (2009) works also for other kinds of matrices.

Typically, we would look for an invertible basis F0F^{0} with N=MN=M, in that case we speak of a square dictionary. However, with the compressed-sensing application in mind, it is also very interesting to consider that YY might be under-sampled measurements of the actual signal, corresponding then to α=M/N<1\alpha=M/N<1. Hence we will be interested in the whole range 0<α10<\alpha\leq 1. The regime of α<1\alpha<1 corresponds to an overcomplete dictionary, in which case each of the PP measurements yl\vec{y}_{l} is a sparse linear combination of the columns (atoms) of the dictionary.

We remind that the NN columns of the matrix F0F^{0} can always be permuted arbitrarily and multiplied by ±1\pm 1. This is also true for rows of the matrix X0X^{0}: all these operations do not change YY, nor the posterior probability distribution. This is hence an intrinsic freedom in the dictionary learning problems that we have to keep in mind. Note that many works consider the dictionary to be column normalized, which lifts part of the degeneracy in some optimization formulations of the problem. In our setting the equivalent of column normalization is asymptotically determined by the properties of the prior distribution.

In dictionary learning one does not have any specific information about the elements Fμi0F^{0}_{\mu i}. However in some applications of compressed sensing one might have an approximate knowledge of the measurement matrix: it is often possible to use known samples of the signal XX in order to calibrate the matrix in a supervised manner (i.e. using known training samples of the signal). Sometimes, however, the known training samples are not available and hence the only way to calibrate is to measure a number of unknown samples and perform their reconstruction and calibration of the matrix at the same time, such a scenario is called the blind calibration.

In blind matrix calibration, the properties of the signal and the output function are the same as in dictionary learning, eqs. (22-23). As for the matrix elements Fμi0F^{0}_{\mu i} one knows a noisy estimation FμiF^{\prime}_{\mu i}. In this work we will assume that this estimation was obtained from Fμi0F^{0}_{\mu i} as follows

where ξμi\xi_{\mu i} is a Gaussian random variables with zero mean and variance 1/N1/N. This way, if the matrix elements Fμi0F^{0}_{\mu i} have zero mean and variance 1/N1/N, then the same is true for the elements FμiF^{\prime}_{\mu i}.

The control parameter η\eta is then quantifying how well one knows the measurement matrix. It provides a way to interpolate between the pure compressed sensing η=0\eta=0, where one knows the measurement matrix F0F^{0}, and the dictionary learning problems η\eta\to\infty. Explicitly, the prior distribution of a given element of the matrix FF is

where N(a,b){\cal N}(a,b) is a Gaussian distribution with mean aa and variance bb.

Another special case of matrix factorization that is often studied is the low-rank matrix completion. In that case one “knows” only a small (but in our case finite when NN\to\infty) fraction ϵ\epsilon of elements of the M×PM\times P matrix YY. Also one knows which elements are known and which are not; let us call M{\cal M} the set on which elements are known, it is a set of size ϵMP\epsilon MP. In this case the output function is:

The precise choice on the function on the second line is arbitrary as long as it does not depend on the zμlz_{\mu l}. In what follows we will assume that the ϵMP\epsilon MP known elements were chosen uniformly at random.

In low rank matrix completion NN is small compared to MM and PP, hence both π\pi and α\alpha are relatively large. Note, however, that the limit we analyse in this paper keeps π=O(1)\pi=O(1) and α=O(1)\alpha=O(1) while NN\to\infty, whereas in many previous works on low-rank matrix completion the rank was considered to be O(1)O(1) and hence α\alpha and π\pi of order O(N)O(N). Compared to those works the analysis here applies to “not-so-low-rank” matrix completion. The question is what fraction ϵ\epsilon of elements of YY needs to be known in order to be able to reconstruct the two matrices F0F^{0} and X0X^{0}.

For negligible measurement noise, Δ=0\Delta=0, a simple counting bound gives that the fraction of known elements we need for reconstruction is at least

Again we will study the student-teacher scenario when a low-rank ZZ is generated from X0X^{0} and F0F^{0} having iid elements distributed according to eq. (21) and (22) with ρ=1\rho=1 (no sparsity). To construct YY we keep a random fraction ϵ\epsilon of elements of ZZ, and the goal is to reconstruct X0X^{0} and F0F^{0} from that knowledge.

We also note that variants on matrix completion where the output channel Pout(y,z)P_{\rm out}(y,z) keeps only the sign of zz were also considered, called 1-bit matrix completion, and have some interesting properties reported in Davenport et al. (2014); Davenport & Romberg (2015).

Principal component analysis (PCA) is a well-known tool for dimensional reduction. One usually considers the singular value decomposition (SVD) of a given matrix and keeps a given number of largest values, thus minimizing the mean square error between the original matrix and its low-rank approximation. The SVD is computationally tractable, and provides the minimization of the mean square error between the original matrix and its low-rank approximation. However, with additional constraints there is no general computationally tractable approach.

A variant of PCA that is relevant for a number practical application requires that one of the low-rank components is sparse. The goal is then to approximate a matrix YY by a product FXFX where FF is a tall matrix, and XX a wide sparse matrix. The teacher-student scenario for sparse PCA that we will analyse in this paper uses eq. (21-23) and the matrix dimensions are such that π=P/N\pi=P/N and α=M/N\alpha=M/N are both large, but still of order O(1)O(1), and comparable one to the other. Hence it is only the region of interest for α\alpha and π\pi that makes this problem different from the dictionary learning. Note that, in the same way as for the matrix completion, many works in the literature consider N=O(1)N=O(1) whereas here we have NN\to\infty in such a way that π=P/N=O(1)\pi=P/N=O(1) and α=M/N=O(1)\alpha=M/N=O(1). Hence we work with low rank, but not as low as most of the existing literature. For a statistical physics study following the lines of the present paper where the rank is O(1)O(1) see Lesieur et al. (2015).

In the zero measurement noise case, Δ=0\Delta=0, the simple counting bound gives that the rank NN for which the reconstruction problem may be solvable needs to be smaller than

One important application where sparse PCA is relevant is the blind source separation problem. Given NN ρ\rho-sparse (in some known basis) source-signals of dimension PP, they are mixed in an unknown way via a matrix FF into MM channel measurements YY. In blind source separation typically both the number of sources NN and the number of channels (sensors) MM are small compared to PP. When N<MN<M we obtain an overdetermined problem which may be solvable even for ρ=1\rho=1. More interesting is the undetermined case with the number of sensors smaller than the number of sources, M<NM<N, which would not be solvable unless the signal is sparse ρ<1\rho<1 (in some basis), in that case the bound (29) applies.

Another variant of PCA that arises often in practice is the robust PCA, where the matrix YY is very close to a low rank matrix FXFX plus a sparse full rank matrix. The interpretation is that YY was created as low rank but then a small fraction of elements was distorted by a large additive noise. The resulting YY is hence not low rank.

In this paper we will analyse a case of robust PCA when FF and XX are generated from eq. (21-22) with ρ=1\rho=1 and the output function is

where ϵ\epsilon is the fraction of elements that were not largely distorted, Δs1\Delta_{s}\ll 1 is the small measurement noise on the non-distorted elements, Δl\Delta_{l} is the large measurement noise on the distorted elements. We will require ΔlX2+σ\Delta_{l}\approx\overline{X}^{2}+\sigma to be comparable to the variance of zμlz_{\mu l} such that there is no reliable way to tell which elements were distorted by simply looking at the distribution of yμly_{\mu l}. The parameters regime we are interested in here is π\pi and α\alpha both relatively large and comparable one to another.

In robust PCA in the zero measurement noise case, Δs=0\Delta_{s}=0, the simple counting bound gives that the fraction of non-distorted elements ϵ\epsilon under which the reconstruction may still be solvable needs to satisfy the same bound as for matrix completion (28). Indeed this counting bound does not distinguish between the case of matrix completion when the position of known elements of YY is known and the case of RPCA when their positions are unknown.

The characteristic feature of FA is to take into account the site dependence of the output function as

In many application of matrix factorization it is known that both the coefficient of the signals XX and the coefficients of the dictionary FF must be non-negative. In our setting this can be taken into account easily by imposing the distributions PXP_{X} and PFP_{F} to have non-negative support. In the student-teacher scenario of this paper we can hence consider the nonzero elements of XX to be PX(X)=0P_{X}(X)=0 for X<0X<0 and PX(X)=N(X,σ)P_{X}(X)=N(\overline{X},\sigma) for X>0X>0, and analogously for PFP_{F}.

I.5 Related work and positioning of our contribution

Matrix factorization and its special cases as mentioned above are well-studied problems with extensive theoretical and algorithmic literature that we cannot fully cover here. We will hence only give examples of relevant works and will try to be exhaustive only concerning papers that are very closely related to our work (i.e. papers using message-passing algorithms or analyzing the same phase transitions in the same scaling limits).

The dictionary learning problem was identified in the work of Olshausen et al. (1996); Olshausen & Field (1997) in the context of image representation in the visual cortex, and the problem was studied extensively since Kreutz-Delgado et al. (2003). Learning of overcomplete dictionaries for sparse representations of data has many applications, see e.g. Lewicki & Sejnowski (2000). MOD Engan et al. (1999) and K-SVD Aharon et al. (2006) are two representative algorithms for the dictionary learning. Several authors studied the identifiability of the dictionary under various (in general weaker than ours) assumptions, e.g. Michal Aharon (2006); Vainsencher et al. (2011); Jenatton et al. (2012); Spielman et al. (2013); Arora et al. (2013); Agarwal et al. (2013); Gribonval et al. (2013). An interesting view on the place of sparse and redundant representations in todays signal processing is given in Elad (2012).

A nice overview of recent progress on low-rank matrix factorization from incomplete observations is given in Davenport & Romberg (2015). The two closely related problems of sparse principal component analysis or blind source separation is also explored in a number of works, see e.g. Lee et al. (1999); Zibulevsky & Pearlmutter (2001); Bofill & Zibulevsky (2001); Georgiev et al. (2005); d’Aspremont et al. (2007). A short survey on the topic with relevant references can be found in Gribonval et al. (2006). Matrix completion is another problem that belongs to the class treated in this paper. Again, many important works were devoted to this problem giving theoretical guarantees, algorithm and applications, see e.g. Candès & Recht (2009); Candès & Tao (2010); Candes & Plan (2010); Keshavan et al. (2010). Another related problem is the robust principal component analysis that was also studied by many authors; algorithms and theoretical limits were analyzed in Chandrasekaran et al. (2009); Yuan & Yang (2009); Wright et al. (2009); Candès et al. (2011).

Our work differs from the mainstream of existing literature in its general positioning. Let us mention here some of the main differences with most other works.

Most existing works concentrate on finding theoretical guarantees and algorithms that work in the worst possible case of matrices FF and XX. In our work we analyze the typical cases when elements of FF and XX are generated at random. Arguably a worst case analysis is useful for practical applications in that it provides some guarantee. On the other hand, in some large-size applications one can be confronted in practice with a situation which is closer to the typical case that we study here. Our typical-case analysis provides results that are much tighter than those usually obtained in literature in terms of both achievability and computational tractability. For instance our results imply much smaller sample complexity, or much smaller mean-squared error for a given signal-to-noise ratio, etc.

Our main contribution is the asymptotic phase diagram of Bayes-optimal inference in matrix factorization. Special cases of our result cover important problems in signal processing and machine learning. In the present work we do not concentrate on the validation of the associated approximate message-passing algorithm on practical examples, nor on its comparison with existing algorithms. A contribution in this direction can be found in Krzakala et al. (2013); Parker et al. (2014).

A large part of existing machine-learning and signal-processing literature provides theorems. Our work is based on statistical physics methods that are conjectured to give exact results. While many results obtained with these methods on a variety of problems have been indeed proven later on, a general proof that these methods are rigorous is not known yet.

Many existing algorithms are based on convex relaxations of the corresponding problems. In this paper we analyze the Bayes-optimal setting. Not surprisingly, this setting gives a much better performance than convex relaxation. The reason why it is less studied is that it is often considered as hopelessly complicated from the algorithmic perspective; however some recent results using message-passing algorithms which stand at the core of our analysis have shown very good performance in the Bayes-optimal problem. Besides, the Bayes-optimal offers an optimal benchmark for testing algorithmic methods.

When treating low-rank matrices we consider the rank to be a finite fraction of the total dimension, whereas most of existing literature considers the rank to be a finite constant. The AMP algorithm derived in this paper can of course be used as a heuristics also in the constant rank case, but our results about MMSE are exact only in the limit where the rank is a finite fraction of the total dimension. Also note that for the special case of rank one the AMP algorithm was derived and analysed rigorously Rangan & Fletcher (2012); Deshpande & Montanari (2014).

When treating sparsity we consider the number of non-zeros is a finite fraction of the total dimension, whereas existing literature often considers a constant number of non-zeros. Again the AMP algorithm derived in this paper can of course be used as a heuristics also in the constant number of non-zeros case, but our results about MMSE are exact only in the limit where the density is a finite fraction of the total dimension.

The paper is organized as follows: First, Sec. II, we explain the statistical physics background for the Bayes-optimal inference. In Sec. III we give a detailed derivation of the approximate message passing algorithm for the matrix factorization problem. This algorithm is equivalent to maximization of the Bethe free entropy, whose expression is discussed in Sec. IV. These first two sections thus state our algorithmic approach to matrix factorization. The asymptotic performance of the AMP algorithm and the Bayes-optimal MMSE are analyzed in Sec. V using two technics: i) the state evolution of the AMP algorithm and ii) the replica method. As usual, the two methods are found to agree and to give the same predictions. We then use these results in order to study some exemples of matrix factorization problems in Sec. VI. In particular we derive the phase diagram, the MMSE and the sample complexity of dictionary learning, blind matrix calibration, sparse PCA, blind source separation, low rank matrix completion, robust PCA and factor analysis. Our results are summarized and discussed in the conclusion in Sec. VII.

II Elements of the background from statistical physics

There are important identities that hold for the Bayes-optimal inference and that simplify many of the calculations that follow. In the physics of disordered systems these identities are known as the Nishimori identities Opper & Haussler (1991); Iba (1999); Nishimori (2001). Basically, they follow from the fact that the true signal F0F^{0}, X0X^{0} is an equilibrium configuration with respect to the Boltzmann measure P(F,XY)P(F,X|Y) (5). Hence many properties of the true signal F0F^{0}, X0X^{0} can be computed by using averages over the distribution P(F,XY)P(F,X|Y) even if one does not know F0,X0F^{0},X^{0} precisely (5).

In order to derive the Nishimori identities, we need to define three types of averages: the thermodynamic average, the double thermodynamic average, and the disorder average.

Consider a function A(F,X)A(F,X) depending on a “trial” configuration FF and XX. We define the “thermodynamic average” of AA given YY as:

where P(F,XY)P(F,X|Y) is given by Eq. (5). The thermodynamic average A(F,X)F,XY\langle A(F,X)\rangle_{F,X|Y} is a function of YY.

Similarly, for a function A(F1,X1,F2,X2)A(F_{1},X_{1},F_{2},X_{2}) that depends on two trial configurations X1X_{1}, F1F_{1} and X2X_{2}, F2F_{2}, we define the “double thermodynamic average” of AA given YY as:

For a function BB that depends on the measurement YY and on the true signal F0,X0F^{0},X^{0}, we define the “disorder average” as

Note that if the quantity BB depends only on YY, then we have

This is simply because the partition function Z(Y){\cal Z}(Y) is Z(Y)=dX0dF0PX(X0)PF(F0)Pout(YF0X0){\cal Z}(Y)=\int{\rm d}X^{0}\,{\rm d}F^{0}\,P_{X}(X^{0})\,P_{F}(F^{0})\,P_{\rm out}(Y|F^{0}X^{0}).

Let us now derive the Nishimori identities. We consider a function A(F,X,F0,X0)A(F,X,F^{0},X^{0}) that depends on the trial configuration F,XF,X and on the true signal F0,X0F^{0},X^{0}. Its thermodynamic average A(F,X,F0,X0)F,XY\langle A(F,X,F^{0},X^{0})\rangle_{F,X|Y} is a function of the measurement YY and the true signal F0,X0F^{0},X^{0}. The disorder average of this quantity can be written as

In this last expression, renaming F,XF,X to F1,X1F_{1},X_{1} and F0,X0F^{0},X^{0} to F2,X2F_{2},X_{2}, we see that the average over F,X,F0,X0F,X,F^{0},X^{0} is nothing but the double thermodynamic average:

where in the last step we have used the form (35) of the disorder average.

is the general form of the Nishimori identity. Written in this way, it holds for many inference problems where the model for signal generation is known.

Self-averaging is a property that is assumed to hold for many quantities in statistical physics. Self-averaging means that in the thermodynamic limit (as introduced in the first paragraph of section I.3) the distribution of A(F,X,F0,X0)F,XY\langle A(F,X,F^{0},X^{0})\rangle_{F,X|Y} concentrates with respect to the realization of disorder Y,F0,X0Y,F^{0},X^{0}. In other words for large system sizes the quantity A(F,X,F0,X0)F,XY\langle A(F,X,F^{0},X^{0})\rangle_{F,X|Y} converges with probability one to its average over disorder, [A(F,X,F0,X0)F,XY]F0,X0,Y[\langle A(F,X,F^{0},X^{0})\rangle_{F,X|Y}]_{F^{0},X^{0},Y}. Self-averaging implies the existence of the thermodynamic limit and is in general very challenging to prove rigorously. Self-averaging makes the Nishimori identity very useful in practice.

To give one particularly useful example, let us define mX=(1/(PN))ilXil0Xilm_{X}=(1/(PN))\,\sum_{il}X^{0}_{il}X_{il} and qX=(1/(PN))ilXil1Xil2q_{X}=(1/(PN))\,\sum_{il}X^{1}_{il}X^{2}_{il}. The Nishimori identity states that [mXF,XY]F0,X0,Y=[qX]F1,X1,F2,X2Y]Y[\langle m_{X}\rangle_{F,X|Y}]_{F^{0},X^{0},Y}=[\langle\langle q_{X}\rangle\rangle]_{F_{1},X_{1},F_{2},X_{2}|Y}]_{Y}. Assuming that the quantities mXm_{X} and qXq_{X} are self-averaging, we obtain in the thermodynamic limit, for almost all YY: mXF,XY=qXF1,X1,F2,X2Y\langle m_{X}\rangle_{F,X|Y}=\langle\langle q_{X}\rangle\rangle_{F_{1},X_{1},F_{2},X_{2}|Y}. Explicitly, this gives:

where oN(1)o_{N}(1) is going to zero as NN\to\infty. This is a remarkable identity concerning the mean of XilX_{il} with the posterior distribution νil(Xil)\nu_{il}(X_{il}). The left-hand side measures the overlap between this mean and the sought true value Xil0X^{0}_{il}. The right-hand side measures the self overlap of the mean, which can be estimated without any knowledge of the true value Xil0X^{0}_{il}, by generating two independent samples from P(F,XY)P(F,X|Y).

By symmetry, all these examples apply also to averages and functions of the matrix FF:

Another example of a Nishimori identity used in our derivation is eq. (108). Note that this one is crucial to maintain certain variance-related variables strictly positive as explained in section III.3. Without eq. (108) the AMP algorithm may have pathological numerical behavior causing that some by definition strictly positive quantities become negative. Deeper understanding of this problem in cases where Nishimori identities cannot be imposed is under investigation.

Nishimori identities also greatly simplify the state evolution equations of section V.1.1. In general state evolution involves nine coupled equations. However, after imposing Nishimori identities, eqs. (187-189), we are left with only three coupled equations.

II.2 Easy/hard and possible/impossible phase transitions

Fixed points of equations (12-14) allow us to evaluate the MMSE of the matrix FF and of the matrix XX. We investigate two fixed points: The one that is reached from the “uninformative” initialization (18), and the fixed point that is reached with the informative (planted) initialization (19). If these two initializations lead to a different fixed point it is the one with the largest value of the Bethe free entropy (I.3) that corresponds to the Bayes-optimal MMSE. The reason for this is that the Bethe free entropy is the exponent of the normalization constant in the posterior probability distribution: a larger Bethe free entropy is hence related to a fixed point that corresponds to an exponentially larger probability. Furthermore, in cases where the uninformative initialization does not lead to this Bayes-optimal MMSE we anticipate that the optimal solution is hard to reach for a large class of algorithms.

Depending on the application in question and the value of parameters (α,π,ρ,ϵ,\alpha,\pi,\rho,\epsilon,\dots) we can sometimes identify a phase transition, i.e. a sharp change in behavior of the MMSE. As in statistical physics it is instrumental to distinguish two kinds of phase transitions

Second order phase transition: In this case there are two regions of parameters. In one, the recovery performance is poor (positive MMSE); in the other one the recovery is perfect (zero MMSE). This situation can arrive only in zero noise, with positive noise there is a smooth “crossover” and the transition between the phase of good and poor recovery is not sharply defined. Interestingly, as we will see, in all examples where we observed the second order phase transition, its location coincides with the simple counting bounds discussed in Section I.4. In the phase with poor MMSE, there is simply not enough information in the data to recover the original signal (independently of the recovery algorithm or its computational complexity). Experience from statistical physics tells us that in problems where such a second order phase transition happens, and more generally in cases where the state evolution (12-14) has a unique fixed point, there is no fundamental barrier that would prevent good algorithms to attain the MMSE. And in particular our analysis suggest that in the limit of large system sizes the AMP algorithm derived in this paper should be able to achieve this Bayes-optimal MMSE.

First order phase transition: A more subtle phenomenology can be observed in the low-noise regime of dictionary learning, sparse PCA and blind matrix calibration. In some region of parameters, that we call the “spinodal” region, the informative (planted) and uninformative initializations do not lead to the same fixed point of the state evolution equations. The spinodal regime itself may divide into two parts - the one where the uninformative fixed point has a larger free entropy, and the “solvable-but-hard” phase where the informative fixed point has a larger free entropy. The boundary between these two parts is the first order phase transition point. We anticipate that in the solvable-but-hard phase the Bayes-optimal MMSE is not achievable by the AMP algorithm nor by a large class of other known algorithms. The first order phase transition is associated with a discontinuity in the MMSE. The MSE reached from the uninformative initializations will be denoted AMP-MSE and is also discontinuous.

In the case of the first order phase transition it will hence be useful to distinguish in our notations between the minimal achievable MSE that we denote MMSE, and the MSE achievable by the AMP-like algorithms that we denote AMP-MSE. When MMSE==AMP-MSE in the large NN limit we say that AMP is asymptotically optimal. The region where in the large size limit MMSE<<AMP-MSE is called the spinodal regime/region.

II.3 Why our results are not rigorous? Why do we conjecture that they are exact?

In this section we aim at summarizing the assumptions that we make but do not prove along the derivation of our results presented in sections III, IV, V. In section VI we then give the MMSE and AMP-MSE as obtained from numerical evaluation of eqs. (12-14).

Let us first state the general strategy from a physics point of view. We use two complementary approaches, the cavity method and the replica method.

Our main goal is to derive the MMSE for the matrix factorization problem in the Bayes-optimal setting. Our main tool is the cavity method, and it presents two advantages with respect to the replica method: 1) during the derivation of the MMSE we obtain the AMP algorithm as a side-product and 2) based on experience of the last three decades, it is likely that a rigorous proof of our conjectures will follow very closely this path. In the cavity method we first assume that there is a fixed point of the belief propagation equations that correctly describes the posterior probability distribution, and that the BP equations (initialized randomly, or in the case where a first order phase transition is present, initialized close to the planted configuration) converge to it. Then we are interested in the average evolution of the BP iterations. This can be understood through a statistical analysis of the BP equations in which we keep only the leading terms in the large NN limit, an approach called the cavity method Mézard et al. (1987) in the physics literature. The result of this analysis are the state evolution equations that should be asymptotically exact, they do not depend on the size of the system NN anymore. There is one known way by which the assumptions made in this derivation can fail, this way is called replica-symmetry-breaking (RSB). Fortunately, in the Bayes optimal inference, RSB is excluded, as we explain below.

Our second analysis uses the replica method. In the replica approach one computes the average of the nn-th power of the normalization of the posterior probability distribution. Then one uses the limit n0n\to 0 to obtain the average of its logarithm. Doing this, one needs to evaluate a saddle point of a certain potential and one assumes that this saddle point is restricted to a certain class that is called “replica symmetric”, eq. (227). Again this assumption is justified in the Bayes optimal inference, because we know that there is no RSB in this case.

Our two analyses, using the cavity method and the replica method, lead to the same set of closed equations for the MMSE of the problem. This important crosscheck, and the very reasonable nature of the assumptions described below (based on our experience in using this kind of approach in various settings) lead us to conjecture that the results described in this paper are exact.

II.3.2 What are the assumptions?

Let us start with the cavity method, which is in general much more amenable to a rigorous analysis. Let us hence describe the assumptions made in section III.

Our main assumption is that in the leading order in NN the true marginal probabilities can be obtained from a fixed point of the belief propagation equations (41)-(44) as written for the factor graph in Fig. 1. For this to be possible the incoming messages on the right hand side of eqs. (41)-(44) must be conditionally independent (in the thermodynamic limit, as defined is section I.3 ), in which case we can write the joint probabilities as products over the messages. If the factor graph was a tree this assumption would be obviously correct. On factor graphs that are locally tree-like these assumptions can be justified (often also rigorously) by showing that the correlations decay fast enough (see for instance Mézard & Montanari (2009)). The factor graph in Fig. 1 is far from a tree. However, from the point of view of conditional independence between messages, this factor graph resembles the one of compressed sensing for which the corresponding results were proven rigorously Bayati & Montanari (2011); Donoho et al. (2012). It is hence reasonable to expect that matrix factorization belongs to the same class of problems where BP is asymptotically exact in the above sense.

We assume that the iterations of equations (41)-(44) converge for very large systems (NN\to\infty) to this fixed point that describes the true marginal probabilities, provided that the iteration is started from the right initial condition (in the presence of a first order phase transition we need to consider initialization of the iteration in the planted configuration in order to reach the right fixed point).

Given these assumptions the rest of section III is justified by the fact that in the derivation we only neglect terms that do not contribute to the final MSE in the thermodynamic limit.

The main assumptions made in the replica calculation is that “self-averaging” applies (this is a statement on the concentration of the measure, which basically assumes that the averaged properties are the same as the properties of a generic single instance of the problem) and that we can exchange the limits limn0\lim_{n\to 0} and limN\lim_{N\to\infty}. On top of these, it relies on the replica symmetric calculation, which is justified here because we are interested in evaluating Bayes optimal MMSE. Unfortunately in the mathematical literature there is so far very little progress in proving these basic assumptions of the replica method, even in problems that are much easier than matrix factorization. It can be remembered that the original motivation for introducing the cavity method Mézard et al. (1986) was precisely to provide an alternative way to the replica computations, which could be stated in clear mathematical terms.

In this paper we follow the physics strategy and we hence provide a conjecture for evaluating the MMSE in matrix factorization. We do anticipate that further work using the path of the cavity approach will eventually provide a proof of our conjecture. Let us mention that rigorous results exist for closely related problems, namely the problem of compressed sensing Bayati & Montanari (2011); Donoho et al. (2012), and also rank-one matrix factorization Rangan & Fletcher (2012); Deshpande & Montanari (2014). None of these apply directly to our case where the rank is extensive and where the matrix FF is not known. A non-trivial generalization of these works will be needed in order to prove our results.

II.3.3 Cavity and replica method for other problems

The cavity and replica methods, as we use them use in the present paper, rely on the set of assumptions listed above. It is worth to note that over the years they have been applied successfully to a wide range of problems. There is nowadays a range of results that can be “derived” (and in many of those cases that was the original derivation) using the cavity or/and the replica methods and that have been proven fully rigorously. The list is long, but to cite a couple of interesting examples we can mention the proof of the satisfiability threshold conjecture Ding et al. (2014), performance of low-density parity check codes Richardson & Urbanke (2008), the optimal solution of the linear assignment problem Aldous (2001), the analysis of the compressed sensing problem Donoho et al. (2009); Bayati & Montanari (2011), and many more.

II.4 Absence of replica symmetry breaking in Bayes-optimal inference

The study of the Bayes-optimal inference, i.e. the case when the prior probability distribution matches the true distribution with which the data have been generated, is fundamentally simpler than the general case. The fundamental reason for this simplicity is that, in the Bayes-optimal case, a major complication referred to as static replica symmetry breaking (RSB) in statistical physics literature does not appear Nishimori (2001). RSB is a source of computational complications in many optimization and constraint satisfaction problems such as the random K-satisfiability Mézard et al. (2002) or models of spin glasses Mézard et al. (1987).

One common way to define RSB is via the overlap distribution P(q)P(q) Nishimori (2001); Mézard et al. (1987); Mézard & Montanari (2009). We define an overlap between two configurations X1X^{1} and X2X^{2} as qX(X1,X2)=(1/(PN))ilXil1Xil2q_{X}(X^{1},X^{2})=(1/(PN))\sum_{il}X^{1}_{il}X^{2}_{il}. The overlap distribution function is then defined (for a given realization of the problem X0,F0,YX^{0},F^{0},Y) as the probability distribution of qX(X1,X2)q_{X}(X^{1},X^{2}) over the posterior measure P(X,FY)P(X,F|Y). For problems that do not have any permutational symmetry (such as the permutation of the NN rows of XX and columns of FF), we say the system is replica symmetric when limNP(q)=δ(qq0)\lim_{N\to\infty}P(q)=\delta(q-q^{0}) for almost all choices of X0,F0,YX^{0},F^{0},Y, and that there is replica symmetry breaking otherwise. The number of steps of RSB then correspond to the number of additional delta peaks in the limiting distribution P(q)P(q). When the limiting distribution P(q)P(q) has a continuous support then we talk about full-RSB. For systems with permutational symmetry the corresponding number of peaks gets simply multiplied by the size of the symmetry group.

Let us further define a so-called magnetization of configuration X1X^{1} with respect to some reference configuration X0X^{0} as mX(X0,X1)=(1/(PN))ilXil1Xil2m_{X}(X^{0},X^{1})=(1/(PN))\sum_{il}X^{1}_{il}X^{2}_{il}. Such a magnetization and its variance can be computed from first and second derivative of the free entropy density. The derivative is taken with respect to a so-called magnetic field which is an auxiliary parameter introduced in the posterior probability for this purpose. In statistical physics, we expect (and in some case, actually prove) some standard properties of the free entropy such as its self-averaging and its analyticity (except at phase transition points) with respect to such parameters. From this reasoning it follows that whereas the mean of the magnetization is of order O(1)O(1), its variance is of order O(1/N)O(1/N). Hence in the limit NN\to\infty the distribution (over P(X,FY)P(X,F|Y)) of magnetization is a delta peak. If the reference configuration X0X^{0} is the planted configuration, then self-averaging implies that the distribution of mX(X0,X1)m_{X}(X^{0},X^{1}) with respect to both its arguments converges to a delta function. From the Nishimori identities it follows that the same has to be true for the distribution of the overlaps P(q)P(q). Therefore only a P(q)P(q) converging to a delta function in the thermodynamic limit is allowed in the Bayes optimal inference setting. In a more technical side, it also follows from the Nishimori identities that the two-point correlations between variables decay sufficiently fast as proven in Montanari (2008), this excludes the existence of the static RSB for the posterior measure P(F,XY)P(F,X|Y).

II.5 Learning of hyper-parameters

To obtain the main results of this paper we assume that the priors PXP_{X} and PFP_{F} are known, including their related parameters. In practice, even if one knows a parametric family of probability distributions that approximates well the matrix elements of FF and XX, one often does not have the full information about the parameters of this distribution (typically its mean and variance). In the same manner, one might know the nature of the measurement noise, but without a precise knowledge of its amplitude. In this subsection we remark that learning of the parameters can be rather straightforwardly included in the present algorithmic approach. However, its detail implementation and analysis, as well as the analysis of the deterioration of performance when the prior distributions are not known, is left for future work.

In order to learn the hyper-parameters, a common technique in statistical inference is the expectation maximization Dempster et al. (1977), where one updates iteratively the hyper-parameters in order to maximize the posterior likelihood, i.e. the normalization Z(Y){\cal Z}(Y) in the posterior probability measure. In the context of approximate message passing the expectation maximization has been derived and implemented e.g. in Krzakala et al. (2012); Vila & Schniter (2011). It turns out that the expectation maximization update of hyper-parameters is analogous to the update where one imposes the Nishimori identities as was done for compressed sensing in Krzakala et al. (2012). In other words one updates the hyper-parameters in such a way that the Nishimori identities are satisfied after every iteration. For instance, the mean and variance of the distribution PXP_{X} is updated to correspond to the empirical mean and variance of the estimators of elements XilX_{il}. The variance of the noise in a Gaussian output channel is updated to be the same as the squared difference between the observed matrix YY and the estimator of the product FXFX. Generically, for mismatched priors the Nishimori identities are not satisfied. At the same time the various hyper-parameters appear explicitly in these identities. Therefore, one way of performing learning of parameters is iteratively setting new values of the hyper-parameters in order to satisfy the Nishimori identities. This can be straightforwardly implemented within the GAMP code. In compressed sensing following this strategy is equivalent to an expectation maximization learning algorithm Krzakala et al. (2012).

It is interesting to note that, as the Bayes-optimal setting has several nice properties in terms of simplicity of the analytical approach and in terms of convergence of the message-passing algorithms, bringing the iterations back on the Nishimori line by doing expectation maximization improves quite generically the convergence of the algorithm. In our opinion this is one of the properties observed in Parker et al. (2013, 2014).

III Approximate message passing for matrix factorization

Of course, the factor graph of matrix factorization, shown in Fig. 1, is extremely far from a tree. The above BP equations can, however, still be asymptotically exact if the dependence between the incoming messages is negligible in the leading order in NN. This indeed happens in compressed sensing (where the matrix FF is a known matrix, generated randomly with zero mean), as follows from the rigorously proven success of approximate message passing Donoho et al. (2009); Bayati & Montanari (2011). In the present case of matrix factorization, we do not have any rigorous proof yet, but based on our experience from studies of mean field spin glass systems Mézard et al. (1987); Mézard & Montanari (2009), we conjecture that the fixed points of the above belief propagation equations describe asymptotically exactly (in the same sense as for compressed sensing) the performance of Bayes-optimal inference. Hence the analysis of the fixed points of the above equations leads to the understanding of information-theoretic limitations for matrix factorization. The associated phase transitions describe possible algorithmic barriers. This analysis is the main goal and result of the present paper.

The above BP iterative equations are written for probability distributions over real values variables and the 2N12N-1-uple integrals from the r.h.s. are mathematically intractable in this form. We now define means and variances of the variable-to-factor messages as

Notice that the factors N\sqrt{N} in the definition of rr, and NN in the definition of ss, have been introduced in order to ensure that all the messages a,c,r,sa,c,r,s are of order O(1)O(1) in the thermodynamic limit.

Using this scaling, we shall now show that the BP equations can be simplified in the thermodynamic limit, and that they can actually be written as a closed set of equations involving only messages a,c,r,sa,c,r,s. Our general aim is to design an algorithm which in some region of parameters will asymptotically match the performance of the exact (computationally intractable) Bayes-optimal inference. Belief propagation provides such an algorithm, but in order to make it computationally efficient, writing it in terms of the messages a,c,r,sa,c,r,s is crucial, provided one is careful not to loose any terms in the asymptotic analysis of the thermodynamic limit to leading order.

Let us define the Fourrier transform of the output function

Next we perform the integral over variable ξ\xi which is simply a Gaussian integral. This gives for the message

where we introduced auxiliary variables that are both of order O(1)O(1)

The last integral to be performed in (52) is the one over the matrix element FμiF_{\mu i}. Using again the fact that Fμi=O(1/N)F_{\mu i}=O(1/\sqrt{N}), we expand the exponential in which FμiF_{\mu i} appears to second order and perform the integration to obtain

Following the notation of Rangan (2011) we now define the output-function as

The following useful identity holds for the average of (zω)2/V2(z-\omega)^{2}/V^{2} in the above measure

At this point we follow closely the derivation of AMP from Krzakala et al. (2012) and define the probability distributions

where Z^X(Σ,T)\hat{Z}_{X}(\Sigma,T), and Z^F(Z,W)\hat{Z}_{F}(Z,W) are normalizations. We define the average and variance of MX{\cal M}_{X} and MF{\cal M}_{F} as

These are the input auxiliary function of Rangan (2011). It is instrumental to notice that

With these definition we obtain from (41,42) using (58,61) that

It is clear from the above expressions that all the messages a,c,r,sa,c,r,s and A,C,R,SA,C,R,S scale as O(1)O(1) in the thermodynamic limit. For instance, as AA are positive, the quantity ν(μ)Aνlilt\sum_{\nu(\neq\mu)}A^{t}_{\nu l\to il} is O(1)O(1). On the other hand, the message f^μiμl(t){\hat{f}}_{\mu i\to\mu l}(t) is an estimate of NFμi\sqrt{N}F_{\mu i}; this estimate is O(1)O(1), but a sum like (1/N)ν(μ)f^νiνl(t)(1/\sqrt{N})\sum_{\nu(\neq\mu)}{\hat{f}}_{\nu i\to\nu l}(t) is an estimate of νFνi\sum_{\nu}F_{\nu i}. As PFP_{F} has mean variance of order O(1/N)O(1/N), this sum is actually of O(1)O(1). The same argument suggests that (1/N)ν(μ)Bνlilt(1/\sqrt{N})\sum_{\nu(\neq\mu)}B^{t}_{\nu l\to il} is O(1)O(1). Recalling eqs. (59,60) and (62,63), we have derived that in the thermodynamic limit the general belief propagation equations simplify into a closed set of equations in the messages which are the means and variances a,r,c,sa,r,c,s defined in (45-48). To iterate this message passing algorithm we initialize as

then we compute VμiltV^{t}_{\mu il} and ωμilt\omega^{t}_{\mu il} from (53-54), then we compute BtB^{t}, AtA^{t}, RtR^{t} and StS^{t} according to (59-60) and (62-63) using definition of goutg_{\rm out} (56). Finally we update the messages according to (70-73) and iterate. Notice, however, that we work with O(N3)O(N^{3}) messages, each of them takes NN steps to update, and hence the computational complexity of this algorithm is relatively high. In the next section we will write a simplification that reduces this complexity.

From the fixed point of the belief propagation equations one can also compute the approximated marginal probabilities of the posterior, defined as

One again defines the mean and variance of these two messages, x^il(t+1){\hat{x}}_{il}(t+1), cil(t+1)c_{il}(t+1), f^μi(t+1){\hat{f}}_{\mu i}(t+1) and sμi(t+1)s_{\mu i}(t+1) analogously to (80-83):

III.2 GAMP for matrix factorization

The message-passing form the AMP algorithm for matrix factorization derived in the previous section uses 2NMP2NMP messages, one between each variable component (il)(il) and (μi)(\mu i) and each measurement (μl)(\mu l), in each iteration. In fact, exploiting again the simplifications which take place in the thermodynamic limit, always within the assumption that the elements of the matrix FF scale as O(1/N)O(1/\sqrt{N}), it is possible to rewrite and close the BP equations in terms of only 2N(P+M)2N(P+M) messages. In statistical physics terms, the resulting equations correspond to the Thouless-Anderson-Palmer equations (TAP) Thouless et al. (1977) used in the study of spin glasses. In the thermodynamic limit, these are asymptotically equivalent to the BP equations. Going from BP to TAP is, in the compressed sensing literature, the step to go from the rBP Rangan (2010) to the AMP Donoho et al. (2009) algorithm. Let us now show how to take this step for the present problem of matrix factorization.

In the thermodynamic limit, it is clear from (70-73) that the messages x^ilμl{\hat{x}}_{il\to\mu l}, vilμlv_{il\to\mu l} and f^μiμl{\hat{f}}_{\mu i\to\mu l}, sμiμls_{\mu i\to\mu l} are nearly independent of μl\mu l. For instance in the equation giving x^ilμl{\hat{x}}_{il\to\mu l}, the only dependence on μ\mu is through the fact that the sum over ν\nu avoids the value ν=μ\nu=\mu. But this is one term in MM, and therefore one might expect that this term is negligible. However, one must be careful when these small terms are summed over and their sum might be of the leading order in NN. Such terms are called in spin-glass theory the “Onsager reaction terms”. In the following we derive these Onsager terms.

Let us define the following variables all of order O(1)O(1) on which we will close the equations

To keep track of all the Onsager terms that will influence the leading order of the final equations we notice that

From these expansions we obtain the GAMP algorithm for matrix factorization

In order to compute ωt=0\omega^{t=0}, Tt=0T^{t=0} and Wt=0W^{t=0} use the above equations as if f^μi(1)=0{\hat{f}}_{\mu i}(-1)=0 and x^il(1)=0{\hat{x}}_{il}(-1)=0.

The interpretation of the terms in the GAMP for matrix factorization is the following: ωμlt\omega_{\mu l}^{t} is the mean of the current estimate of zμl=iFμiXilz_{\mu l}=\sum_{i}F_{\mu i}X_{il} and VμltV^{t}_{\mu l} is the variance of that estimate; TilT_{il} and Σil\Sigma_{il} is the mean and variance of the current estimate of XilX_{il} without taking into account the prior information of XilX_{il}; the parameters x^il{\hat{x}}_{il} and cilc_{il} are then the mean and variance of the current estimate of XilX_{il} with the prior information taken into account. Analogously for WμiW_{\mu i} and ZμiZ_{\mu i} being the mean and variance of the estimate for FμiF_{\mu i} before the prior is taken into account, and f^μi{\hat{f}}_{\mu i} with sμis_{\mu i} are the mean and variance once the prior information was accounted for.

A reader familiar with the AMP and GAMP algorithm for compressed sensing Donoho et al. (2009); Rangan (2011); Krzakala et al. (2012) will recognize that the above equations indeed reduce to the compressed sensing GAMP of Rangan (2011) when one sets f^μi(t)=NFμi{\hat{f}}_{\mu i}(t)=\sqrt{N}F_{\mu i} and sμi=0s_{\mu i}=0.

The above algorithm is closely related to the BiG-AMP of Parker et al. (2013). There are however three differences between our algorithm and BiG-AMP:

We find a sμis_{\mu i}-dependent term in the expression (98) for Σil\Sigma_{il} which is not present in BiG-AMP.

Similarly, we find a cilc_{il}-dependent term in the expression (100) for ZμiZ_{\mu i} which is not present in BiG-AMP.

Considering the last point, the fact of having different time indices during the iterations does not influence the fixed points, in which we are mainly interested. However, the use of correct time indices is crucial for the assumptions leading to the density evolution of this algorithm (that we derive in section V.1) to hold.

As for the missing terms in the BiG-AMP expressions of Parker et al. (2013) for Σil\Sigma_{il} and ZμiZ_{\mu i}, they have a more serious effect as they can change the fixed point. To the best of our understanding, these terms have been neglected in Parker et al. (2013), while they should be kept. It seems to us that some of the leading order terms are missing in eqs. (15-16) from Parker et al. (2013).

III.3 Simplifications due to the Nishimori identities

In the previous section we derived the GAMP algorithm for matrix factorization, eqs. (96-103). This algorithm can in principle be used for any set of matrices FF and XX. If iterated in the form derived in Section III.2 it often shows problems of convergence. There are ways to slightly improve the convergence of the above algorithm in a wide range of applications by a number of empirical methods suggested in Parker et al. (2014).

We will focus on the particular case when matrices XX and FF were indeed generated from the separable probability distributions PF(F)P_{F}(F) and PX(X)P_{X}(X) described in eqs. (3-4), and the output yy was generated by the assumed model.

In this case the belief propagation is a proxy for the optimal Bayes inference algorithms and a number of properties described in section I.2 hold. In analogy with fundamental works on spin glasses Nishimori (2001); Iba (1999) we called these properties the Nishimori identities. The setting where conditions (106) hold will be called the Bayes-optimal setting.

The Nishimori identities hold and the system is on the Nishimori line when one is using the correct priors on FF and XX and the right output channel in the reconstruction process, i.e. when conditions (106) hold. In the limit NN\to\infty and thanks to self-averaging we then have on the Nishimori line at every iteration step tt

The meaning of this identity is that the mean squared error of the current estimate of Z=FXZ=FX computed from the current estimates of variances VμltV_{\mu l}^{t} is equal to the mean squared difference between the true ZZ and its current estimate ωμlt\omega_{\mu l}^{t}. Using the above expression and eq. (57) we obtain an identity

The above identity holds also if the sum is only over μ\mu or only over ll. Finally using the conditional independence assumed in BP between the incoming messages we get also

Under this condition we can simplify considerably the expressions for Σilt\Sigma_{il}^{t} and ZμitZ_{\mu i}^{t} and get

Note that the r.h.s. of the two above equations is always strictly positive, which is reassuring given these expressions play the role of a variance of a probability distribution. Note also that the BiG-AMP algorithm of Parker et al. (2013) uses expressions (110,111) instead of (98,100), however, without mentioning the reason.

III.4 Simplification for matrices with random entries

Relying on the definitions of order parameters (161-162) and using part of the results on section V.1 we can write a version of the GAMP for matrix factorization that is in the leading order in NN equivalent to (96-103) for matrices XX and FF having iid elements.

Let us define the analog of χ^t\hat{\chi}^{t} and q^t\hat{q}^{t} (168-169) as empirical means of the corresponding functions

Anticipating the reasoning that we shall use later in section V.1, we realize that in the leading order quantities ViltV^{t}_{il}, Σilt\Sigma^{t}_{il} and ZμitZ^{t}_{\mu i} do not depend on their indices i,l,μi,l,\mu. We have

These three equations can hence replace (96), (98) and (100) in GAMP. Furthermore, if we focus on the fixed point and hence disregard some of the time indices eqs. (97), (99) and (101) can be simplified as

The set of eqs. (102-103), (114), (119-122) was presented for the simple output channel with white noise in Krzakala et al. (2013). We detail this procedure for the generic output in Alg. 1.

We want to stress here that all these simplifications take place for any output channel Pout(yz)P_{\rm out}(y|z). In contrast with the “uniform variance” approximation of Parker et al. (2013) the above result does not mean that the variances cilc_{il} and sμis_{\mu i} are independent in the leading order on their indices. On the contrary, these variances depend on their indices even in the simplest case of GAMP when the matrix FμiF_{\mu i} is known, i.e. for the compressed sensing problem.

IV The Bethe free entropy

The fixed point of the belief propagation equations or its AMP version can be used to estimate the posterior likelihood, i.e. the normalization Z(Y){\cal Z}(Y) of the posterior probability (5). The logarithm of this normalization is called the Bethe free entropy in statistical physics Yedidia et al. (2003). Negative logarithm of the normalization is called the free energy, in physics there is usually a temperature associated to the free energy. Bethe free entropy is computed from the fixed point of the BP equations (41-44) as Yedidia et al. (2003); Mézard & Montanari (2009):

The derivatives of this expression for ΦBethe\Phi^{\rm Bethe} with respect to the messages give back the full BP equations of (41-44). In this general form, the computation of ΦBethe\Phi^{\rm Bethe} for the present problem is not of practical interest, and it is thus very useful to carry out the same steps that we did in Section III.1 in order to obtain a more mathematically tractable form of ΦBethe\Phi^{\rm Bethe} that is asymptotically equivalent to (123) in the thermodynamic limit, using the set of AMP message passing equations (53-54), (59-60), (62-63), and (70-73). The result is:

Finally we might want to express the free entropy using the fixed point of the GAMP eqs. (96-103). In order to do this we need to rewrite the last two terms in (129). Using an expansion in 1/N1/N and keeping the leading order terms we get

We remind that the above expressions give the posterior likelihood given a fixed point on the GAMP equations.

To write the final formula in a more easily interpretable form we use the probability distributions MF{\cal M}_{F} and MX{\cal M}_{X} defined in (64-65) with normalizations

The above expression evaluated at the fixed point of the AMP algorithm hence gives the Bethe approximation to the log-likelihood. It is mainly use to decide which fixed point of AMP is better. Indeed, there are cases where there exist more than one AMP fixed point and it is the one with the largest Bethe entropy that corresponds asymptotically to the optimal Bayesian inference.

Since the free entropy has a meaning only at the fixed point, we can transform it by using any of the fixed point identities verified by the BP messages. A well known property of Bethe free entropy and belief propagation Yedidia et al. (2003) is that the BP fixed points are stationary points of the Bethe free entropy. In this section we show that also for the AMP for matrix factorization the Bethe free entropy can be written in a form that allows to generate the fixed-point BP equations as a stationary point. It can be achieved by writing the Bethe free entropy eq. (139) as

In order to derive (140) from (139), we have substituted gout2g^{2}_{\rm out} by its fixed point expression, and imposed the values of the variance VV. Under the present form, the Bethe free entropy satisfies the following theorem:

(Bethe/AMP correspondance) The fixed point of the AMP equations eqs. (96-103) are the stationary points of the cost function ΦAMPBethe\Phi^{\rm Bethe}_{\rm AMP} eq. (140).

This can be checked explicitly by setting to zero the derivatives of ΦAMPBethe\Phi^{\rm Bethe}_{\rm AMP}. Indeed, the derivatives with respect to T,Σ,WT,\Sigma,W and ZZ yield

Then, the stationarity with respect to ω\omega can be expressed easily by noting that ωμzlogZμl=gout\frac{\partial}{\partial\omega_{\mu z}}\log Z^{\mu l}=g_{\rm out} (a consequence of Eq. (130):

which is nothing but the fixed point equation for ω\omega.

It is convenient to compute the derivative with respect to VV (even though this quantity is eventually a function of rr,ss,aa and cc) using VμzlogZμl=12(gout2+ωgout)\frac{\partial}{\partial V_{\mu z}}\log Z^{\mu l}=\frac{1}{2}\left(g_{\rm out}^{2}+\partial_{\omega}g_{\rm out}\right) so that at the fixed point, when eq. (145) is satisfied, we have

Using this equation, one can finally check explicitly that deriving with respect to a,c,ra,c,r and ss yields the remaning AMP equations for T,Σ,WT,\Sigma,W and ZZ. This concludes the proof. ∎

IV.2 The variational Bethe free entropy

We have shown that the fixed points of the approximate message passing equations are extrema of ΦAMPBethe\Phi^{\rm Bethe}_{\rm AMP}. However they are in general saddle points of this function, and it is very useful to derive an alternative “variational” free entropy, the maxima of which are the fixed points. In particular, this will allow us to find these fixed points by alternative methods which do not rely on iterating the equations as was done for compressed sensing in Krzakala et al. (2014). This variational free entropy can also be used not only at the maximum, but for each possible values of the parameters, as the current estimate of the quality of reconstruction. Such a property has been used to implement a so-called adaptive damping in compressed sensing Vila et al. (2014) and it can hence be anticipated that similar implementation trick will be useful for matrix factorization as well.

In order to derive the variational Bethe free entropy, we impose the fixed point conditions, and express the free entropy only as a function of the parameters of our trial distributions for the two matrices. Then, we simply have

where a,c,r,sa^{*},c^{*},r^{*},s^{*} are given in terms of the Eqs. (141-144) by: a=fX(Σil,Til)a^{*}=f_{X}(\Sigma_{il},T_{il}), c=fc(Σil,Til)c^{*}=f_{c}(\Sigma_{il},T_{il}), r=fF(Zμi,Wμi)r^{*}=f_{F}(Z_{\mu i},W_{\mu i}), s=fs(Zμi,Wμi)s^{*}=f_{s}(Z_{\mu i},W_{\mu i}), and ω\omega^{*} is the solution of (97).

In order to write this variational expression in a nicer form, let us notice that the Kullback-Leibler divergences between MX{\cal M}_{X}, MF{\cal M}_{F} (64-65) and the prior distribution are

where Zμl{\cal Z}^{\mu l} is given by (130). Then one has

with VV^{*} and ω\omega^{*} satisfying eqs. (114) and (119). Note that this expression has the same form as the one used in Rangan et al. (2013) for the simpler case of GAMP for compressed sensing and for the generalized linear problem. Our expression thus generalizes the formula of Rangan et al. (2013) to the bi-linear case.

IV.2.2 The AWGN output channel

In the case of the additive white Gaussian noise output channel (23) the function goutg_{\rm out} takes the simple form:

hence ωgout\partial_{\omega}g_{\rm out} does not depend on the variable ωμl\omega_{\mu l}. The only explicit dependence on ωμl\omega_{\mu l} in the free entropy is through eq. (130) which becomes for the AWGN output channel

The free entropy is defined only at the fixed point of the GAMP equations. Given a fixed point we can express from (97) for the AWGN channel

We plug this last expression into (155) to obtain

Simplifying the last two terms of eq. (153) we obtain for the AWGN channel:

The first three terms of this free entropy are clearly negative and the last term cannot be larger than MPlog(2πΔ)/2-MP\log{(2\pi\Delta)}/2. Hence the free entropy (158) is bounded from above. This is consistent with its interpretation as a variational expression. The stationary points of (158) are the fixed points of the GAMP algorithm and hence the fixed points corresponding to the maximum likelihood could also be found by direct maximization of the expression (158). This offers an interesting algorithmic alternative to the iterative AMP algorithm that was explored for the compressed sensing problem in Krzakala et al. (2014).

Another use of the expression (158) is that during the iteration of the GAMP algorithm its value should be increasing, hence we can adaptively choose the step-size of the iterations to ensure this increase. Such an adaptive dumping was implemented in Parker et al. (2014) using a different form of the free entropy that does not correspond to the Bethe free entropy but to the variational mean field (VMF) free entropy which reads

It is easy to check that ΦAWGNVMF(Til,Σil,Wμi,Zμi)<ΦAWGNBethe(Til,Σil,Wμi,Zμi)\Phi^{\rm VMF}_{\rm AWGN}(T_{il},\Sigma_{il},W_{\mu i},Z_{\mu i})<\Phi^{\rm Bethe}_{\rm AWGN}(T_{il},\Sigma_{il},W_{\mu i},Z_{\mu i}), which could be expected since the Bethe expression, which is asymptotically exact, should be a better approximation than the mean field approximation.

V Asymptotic analysis

In this section we derive the asymptotic (NN\to\infty) evolution of the GAMP iterations for matrix factorization. This asymptotic analysis holds as long as all the elements of the true matrix FF are iid random variables generated from a distribution PF0P_{F^{0}}, and all elements of the true matrix XX are iid random variables generated from a distribution PX0P_{X^{0}}. In general we will not assume PF0=PFP_{F^{0}}=P_{F} and PX0=PXP_{X^{0}}=P_{X}: this special case of Bayes-optimal analysis will be treated in the next section.

In the present section we will also distinguish between the true output channel characterized by the conditional probability distribution Pout0(yμlzμl0)P^{0}_{\rm out}(y_{\mu l}|z^{0}_{\mu l}) and the output channel that is being used in the GAMP algorithm Pout(yμlzμl)P_{\rm out}(y_{\mu l}|z_{\mu l}). We remind that zμl0=i=1NFμi0Xil0z^{0}_{\mu l}=\sum_{i=1}^{N}F^{0}_{\mu i}X^{0}_{il}, where Fμi0F^{0}_{\mu i} and Xil0X^{0}_{il} are the elements of the actual matrices that we do not know and aim to recover, and zμl=i=1NFμiXilz_{\mu l}=\sum_{i=1}^{N}F_{\mu i}X_{il}. Again the special case of Pout0=PoutP^{0}_{\rm out}=P_{\rm out} will be treated the next section.

We will assume that at least one of the probability distributions PF0P_{F^{0}} and PX0P_{X^{0}} (and also at least one of PFP_{F} and PXP_{X}) has zero mean, otherwise there would be additional terms in this asymptotic analysis, as e.g. in Caltagirone et al. (2014).

Let us first recall the definition of the order parameters; all of them are finite, of order O(1)O(1), in the thermodynamic limit:

Note also that the above sums over a pair of indices could also be sums over only one index (and adjusted normalization) and the order parameters would not change in the leading order: for instance, we expect that in the thermodynamic limit, 1Njx^jl(t)Xjl0\frac{1}{N}\sum_{j}{\hat{x}}_{jl}(t)X^{0}_{jl} will go to the same limit as mXtm^{t}_{X} defined in (160).

First let us compute the average over realizations of X0X^{0}, F0F^{0} and w0w^{0} of the quantity VμltV_{\mu l}^{t} defined by eq. (90). By the assumptions of the belief propagation equations (43-44), the terms in the product in (90) are statistically independent and we can hence write for the average, to the leading order

Further, we realize that the variance of this quantity (again over the realizations of X0X^{0}, F0F^{0} and w0w^{0}) is

In order to derive this result, we expand the square and obtain a double sum over ii and jj. Because of the conditional independence between incoming messages assumed in belief propagation, the terms with iji\neq j average exactly to zero. As for the terms with i=ji=j, they add up to a contribution of order O(1/N)O(1/N). From this we can conclude that, to leading order in the thermodynamic limit, the quantity Vμlt=VtV_{\mu l}^{t}=V^{t} does not depend on its indices.

Further we are interested in the average Σt\Sigma^{t} of the quantity Σilt\Sigma_{il}^{t} over the realization of F0F^{0}, X0X^{0}, and w0w^{0}. Using the definition of Σilt\Sigma_{il}^{t} eq. (88) and the expression for AμliltA_{\mu l\to il}^{t} eq. (60) we obtain

We proceed analogously for ZμitZ^{t}_{\mu i}. Using again the conditional independence between incoming messages assumed in BP equations we obtain

We use yμl=h(zμl0,w0)y_{\mu l}=h(z^{0}_{\mu l},w^{0}), and we remind that, to leading order, Vμilt=VtV^{t}_{\mu il}=V^{t}. The function goutg_{\rm out} above hence depends on two correlated fluctuating variables ωμilt\omega_{\mu il}^{t} and zμl0z^{0}_{\mu l}, and on w0w^{0}. Both the variables ωμilt\omega_{\mu il}^{t} and zμl0z^{0}_{\mu l} are sums over many independent (for zz this is by construction, for ω\omega by the BP assumptions) terms. Hence, according to the central limit theorem, they are Gaussian random variables. Their mean is zero when at least one of the distributions PXP_{X} and PFP_{F} (and one of the PX0P_{X^{0}} and PF0P_{F^{0}}) have zero mean (which we assume in this section). The covariance matrix between the variables ωμilt\omega_{\mu il}^{t} and zμl0z^{0}_{\mu l} is

where we again used the BP assumption of independence between the incoming messages, but also between Fμi0F^{0}_{\mu i} and the message x^ilμl{\hat{x}}_{il\to\mu l}, and between Xil0X_{il}^{0} and f^μiμl{\hat{f}}_{\mu i\to\mu l}. As for the variance of z0z^{0}, we denote it by:

Altogether this gives for χ^\hat{\chi} and q^\hat{q}

We now study the asymptotic behavior of TiltT^{t}_{il} defined by eq. (88)

where we used definition of Bμl\iltB^{t}_{\mu l\to\i l} in eq. (59). The message f^μiμl(t){\hat{f}}_{\mu i\to\mu l}(t) are uncorrelated with all the other incoming messages and also with all the Fμj0F_{\mu j}^{0} for jij\neq i. It is, however, correlated with Fμi0F_{\mu i}^{0} and the dependence on Fμi0F_{\mu i}^{0} has to be hence treated separately. After an expansion in the leading order we obtain

where in the first term we defined the new parameter m^\hat{m} as

Using the same kind of analysis as we did for q^\hat{q} and χ^\hat{\chi}, we find that

The second term of (176) when averaged over realization of F0F^{0}, X0X^{0} and w0w^{0} behaves as a Gaussian random variable. In (176) we moreover assumed that

which is true in all the special cases analysed in this paper, and is also true in general under the Bayes-optimal inference as detailed in the next section. If the zero-mean assumption (179) did not hold the density evolution equations would contain additional terms (similarly as if both FF and XX had non-zero means), see e.g. the state evolution in compressed sensing for non-zero mean matrices Caltagirone et al. (2014). Under the zero-mean assumption, the variance of the Gaussian variable is αqFtq^t\alpha q^{t}_{F}\hat{q}^{t} with q^t\hat{q}^{t} given by (174).

With the use of (166-166), (176), (176), and the expressions for messages (102-103) we obtain

where Dξ=dξeξ2/2/2π{\cal D}\xi={\rm d}\xi e^{-\xi^{2}/2}/{\sqrt{2\pi}} is a Gaussian integration measure. Analogously we have for the FF-related order parameters

The six equations (181-183) together with (173-178) are the general form of density evolution for GAMP in the general case of matrix factorization. We remind that these equations describe the asymptotic evolution of the algorithm in the “thermodynamic” limit of large sizes, as long as the matrices XX, X0X^{0}, FF, F0F^{0} were generated with iid elements, at least one of the random variables XX and FF, and at least one of the X0X^{0} and F0F^{0} has zero mean, and the output channel satisfies the condition (179). The first of these condition is absolutely essential for our approach; the restriction to zero means is here for convenience, the non-zero means and generic form of output function can be treated with the same formalism that we have used here, with additional terms.

To satisfy the Nishimori identities we have to suppose that all the prior distributions are matching the true distributions from which the signal and noise were generated, i.e. conditions (106) hold. The condition Pout0(yz)=Pout(yz)P^{0}_{\rm out}(y|z)=P_{\rm out}(y|z) is equivalent to P0(w)=P(w)P_{0}(w)=P(w) and h(z,w)=h0(z,w)h(z,w)=h^{0}(z,w). In this Bayes-optimal setting, (106), the asymptotic analysis simplifies considerably since we have

To justify the above statement we need to prove that if (106) is satisfied and (187-189) hold up to iteration tt then (187-189) hold also in iteration t+1t+1. This is done in the next two subsections.

In the Bayes-optimal setting, the state evolution simplifies into

where Dξ{\cal D}\xi is a Gaussian integral dξeξ2/2/2π{\rm d}\xi e^{-\xi^{2}/2}/\sqrt{2\pi}. Here we chose to use the expression coming from eq. (182), (185) and (173), but we could have used any of the other expressions that are equivalent on the Nishimori line. Where mFm_{F} and mXm_{X} are initialized as squares of the means of the corresponding prior distributions

In case the prior distribution depends on another random variables, e.g. in case of matrix calibration, we take additional average with respect to that variable. If the above initialization gives mFt=0=0m_{F}^{t=0}=0 and mXt=0=0m_{X}^{t=0}=0 then this is a fixed point of the state evolution. This is due to the permutational symmetry between the columns of matrix FF and rows of matrix XX. To obtain a nontrivial fixed point we initialize at mFt=0=ηm_{F}^{t=0}=\eta for some very small η\eta, corresponding to an infinitesimal prior information about the matrix elements of the matrix FF. Note that this is needed only in the state evolution, the algorithm breaks the permutational symmetry spontaneously. The same situation appears in an Ising ferromagnet at low temperature where zero magnetization is a fixed point of the equilibrium equations, but the physically correct solution to which dynamical procedures converge had large magnetization in absolute value.

Our general strategy in the asymptotic analysis of optimal Bayesian inference and related phase transition is that the corresponding fixed points must satisfy the Nishimori identities, hence we will restrict our search for fixed points to parameters lying on the Nishimori line, i.e. satisfying the identities (187-189). When these identities are not imposed the iterations of the state evolution equations are not always converging to fixed points on the Nishimori line. This is also reflected in problems with convergence in the GAMP algorithm for matrix factorization. In the algorithm the Nishimori identities are unfortunately not straightforward to impose.

V.1.2 The input Nishimori identities

Assume that in the state evolution the Nishimori identities (187-189) hold for all iteration times smaller or equal to tt. Out aim is to show that then mXt+1m_{X}^{t+1} and qXt+1q_{X}^{t+1} computed from (183) and (182) are equal. Recall that

The proof of mFt+1=qFt+1m_{F}^{t+1}=q_{F}^{t+1} is exactly the same.

Using the definition of the function fc(Σ,R)f_{c}(\Sigma,R) (66), the same change of variables, and resulting cancelations as above we get

V.1.3 The output Nishimori identities

Let us now assume that the input Nishimori identities (187-188) are satisfied and we want to show that (189) and (179) hold.

Doing analogous manipulations of expliciting the Gaussian measure and using eq. (10) and the definition of goutg_{\rm out} in equation (174) we obtain

For χ^t\hat{\chi}^{t} we do integration with respect to pp by parts and using steps as in the above we obtain

Thanks to a cancelation between the integrals over variables zz and zz^{\prime\prime} we can perform explicitly the integral over yy (keeping in mind that Pout(yz)P_{\rm out}(y|z^{\prime}) is a normalized probability distribution). The remaining Gaussian integral is then zero.

In a analogous manner we prove eq. (179) by noticing that the expectation with respect to F0F^{0}, X0X^{0} and w0w^{0} is exactly the integral dydpdzPout(yz)N(p,z)\int{\rm d}y\,{\rm d}p\,{\rm d}z\,P_{\rm out}(y|z)\,{\cal N}(p,z).

To conclude, identities (189) and (179) hold in the limit NN\to\infty. However, as common in statistical physics we can recall the self-averaging property under which quantities on almost every large (NN\to\infty) instance are equal to their averages aver randomness (disorder) of F0F^{0}, X0X^{0} and w0w^{0}. This self-averaging then for instance justifies the use of eq. (108) on large single instances of the matrix factorization problem.

V.2 Replica method

The replica method is known as a non-rigorous approach to evaluate the typical performance of various Bayesian inference problems. We here show how this is employed for the matrix factorization problem. We show, as expected from other problems, that the result of the replica analysis is fully equivalent to the result of the state evolution.

constitutes the basis of our analysis. In statistical physics, one can generally examine properties of systems via evaluation of the free entropy logZ(Y)\log Z(Y), which statistically fluctuates depending on the realization of YY in the current case. However, as N,M,PN,M,P\to\infty, one can expect that the self-averaging property holds and, therefore, the free entropy density N2logZ(Y)N^{-2}\log Z(Y) converges to its typical value ϕN2[logZ(Y)]Y\phi\equiv N^{-2}\left[\log Z(Y)\right]_{Y} with probability of unity. This is also expected to hold for other macroscopic quantities relevant to the performance of the matrix factorization. Therefore, assessment of ϕ\phi is the central issue in our analysis.

with respect to the generative distribution of YY

where we assumed that the functional forms of PF0(Fμi)P_{F^{0}}(F_{\mu i}), PX0(Xil)P_{X^{0}}(X_{il}) and Pout0(yμliFμiXil)P_{\rm out}^{0}(y_{\mu l}|\sum_{i}F_{\mu i}X_{il}) may be different from those of the assumed model PF(Fμi)P_{F}(F_{\mu i}), PX(Xil)P_{X}(X_{il}) and Pout(yμliFμiXil)P_{\rm out}(y_{\mu l}|\sum_{i}F_{\mu i}X_{il}) for generality. When they are equal, which correspond to the Bayes-optimal setting, P0(Y)=Z(Y)P_{0}(Y)=Z(Y) holds.

In performing the integrals of 2(n+1)2(n+1) matrices (F0,{Fa}a=1n)(F^{0},\{F^{a}\}_{a=1}^{n}), (X0,{Xa}a=1n)(X^{0},\{X^{a}\}_{a=1}^{n}) and YY that come out in evaluating [Zn(Y)]Y\left[Z^{n}(Y)\right]_{Y}, we insert trivial identities with respect to all combinations of replica indices ab=0,1,2,,na\leq b=0,1,2,\ldots,n

to the integrand, where FaFbμ,iFμiaFμibF^{a}\cdot F^{b}\equiv\sum_{\mu,i}F_{\mu i}^{a}F_{\mu i}^{b} and similarly for XaXbX^{a}\cdot X^{b}. Let us denote QF(qFab){\cal Q}_{F}\equiv(q_{F}^{ab}) and QX(qXab){\cal Q}_{X}\equiv(q_{X}^{ab}), and introduce two joint distributions

where VF(QF)V_{F}({\cal Q}_{F}) and VX(QX)V_{X}({\cal Q}_{X}) are the normalization constants. These yield an expression of [Zn(Y)]Y\left[Z^{n}(Y)\right]_{Y} as

where []QF,QX\left[\cdots\right]_{{\cal Q}_{F},{\cal Q}_{X}} denotes the average with respect to (212) and (213). In computing []QF,QX\left[\cdots\right]_{{\cal Q}_{F},{\cal Q}_{X}}, it is noteworthy that {Fa}\{F^{a}\} and {Xa}\{X^{a}\} follow statistically independent distributions, and either of them has zero mean and both of them have finite variances from our assumption. These allow us to handle zμlaiFμiaXμiaz_{\mu l}^{a}\equiv\sum_{i}F_{\mu i}^{a}X_{\mu i}^{a} (a=0,1,2,,n;μ=1,2,,M;l=1,2,,P)(a=0,1,2,\ldots,n;\mu=1,2,\ldots,M;l=1,2,\ldots,P) as multivariate Gaussian random variables whose distribution is given by

V.2.2 Replica symmetric free entropy

where the extremization with respect to QF,QX,qF,qX,mF,mXQ_{F},Q_{X},q_{F},q_{X},m_{F},m_{X}, and their conjugate variables leads to the same equations we obtained in the state evolution. Particularly extremization w.r.t. the six conjugate variables gives eqs. (181-186). Extremization with respect to QF,QX,qF,qX,mF,mXQ_{F},Q_{X},q_{F},q_{X},m_{F},m_{X} gives Q^F=πQX(χ^q^)\hat{Q}_{F}=\pi Q_{X}(\hat{\chi}-\hat{q}), Q^X=αQF(χ^q^)\hat{Q}_{X}=\alpha Q_{F}(\hat{\chi}-\hat{q}), q^F=πqXq^\hat{q}_{F}=\pi q_{X}\hat{q}, q^X=αqFq^\hat{q}_{X}=\alpha q_{F}\hat{q}, m^F=πmXm^\hat{m}_{F}=\pi m_{X}\hat{m}, m^X=αmFm^\hat{m}_{X}=\alpha m_{F}\hat{m} where χ^\hat{\chi}, q^\hat{q}, and m^\hat{m} are given byt eqs. (173-174) and (178).

V.2.3 Simplification in the Bayes-optimal setting

One can generally evaluate thermodynamically dominant values of QF,QX,qF,qX,mF,mXQ_{F},Q_{X},q_{F},q_{X},m_{F},m_{X} by solving the extremization problem of (237), which is involved with twelve variables including the conjugate variables and therefore is rather complicated to handle. However, the problem is significantly simplified in the Bayes-optimal setting.

This is because the Nishimori identities allows us to handle F0=(Fμi0)F^{0}=(F_{\mu i}^{0}) and X0=(Xil0)X^{0}=(X_{il}^{0}) as if they were the n+1n+1-st replica variables added to the nn-replicated system composed of {Fa}a=1n={(Fμia)}a=1n\{F^{a}\}_{a=1}^{n}=\{(F_{\mu i}^{a})\}_{a=1}^{n} and {Xa}a=1n={(Xila)}a=1n\{X^{a}\}_{a=1}^{n}=\{(X_{il}^{a})\}_{a=1}^{n} in the computation of the moment [Zn(Y)]Y=dYP0(Y)Zn(Y)=dYP0n+1(Y)\left[Z^{n}(Y)\right]_{Y}=\int dYP_{0}(Y)Z^{n}(Y)=\int dYP_{0}^{n+1}(Y). The replica symmetry among a=0,1,2,,na=0,1,2,\ldots,n ensures the following properties:

mF=qFm_{F}=q_{F}, mX=qXm_{X}=q_{X}, m^F=q^F\hat{m}_{F}=\hat{q}_{F}, and m^X=q^X\hat{m}_{X}=\hat{q}_{X} are satisfied.

QF=QF0Q_{F}=Q_{F}^{0}, QX=QX0Q_{X}=Q_{X}^{0}, Q^F=0\hat{Q}_{F}=0, and Q^X=0\hat{Q}_{X}=0 hold for n0n\to 0.

Substituting these into (237) yields a simplified expression of the free entropy, which is involved with only four macroscopic variables mFm_{F}, mXm_{X}, m^F\hat{m}_{F}, and m^X\hat{m}_{X}, as

where we changed integration variables as Nm^Fξ+Nm^FF0Nm^Fξ\sqrt{N\hat{m}_{F}}\xi+N\hat{m}_{F}F^{0}\to\sqrt{N\hat{m}_{F}}\xi and m^Xξ+m^XX0m^Xξ\sqrt{\hat{m}_{X}}\xi+\hat{m}_{X}X^{0}\to\sqrt{\hat{m}_{X}}\xi together with DξDξeNm^F2(F0)2+Nm^FξF0{\cal D}\xi\to{\cal D}\xi e^{-\frac{N\hat{m}_{F}}{2}(F^{0})^{2}+\sqrt{N\hat{m}_{F}}\xi F^{0}} and DξDξem^X2(X0)2+m^XξX0{\cal D}\xi\to{\cal D}\xi e^{-\frac{\hat{m}_{X}}{2}(X^{0})^{2}+\sqrt{\hat{m}_{X}}\xi X^{0}}, respectively. The saddle point conditions can be summarized via equations that we obtained in the state evolution, notably eqs. (190-192) with m^F=πmXm^\hat{m}_{F}=\pi m_{X}\hat{m}, m^X=αmFm^\hat{m}_{X}=\alpha m_{F}\hat{m}.

The free entropy (241) would be also obtained from an asymptotic limit of the Bethe free entropy of section IV. Overall, as usual in statistical physics, the cavity method and the replica method yield equivalent results for the matrix factorization problem.

VI Examples of asymptotic phase diagrams

In this section we use the state evolution derived in section V.1.1 to analyze the asymptotic MMSE of the Bayes-optimal inference in matrix factorization for applications listed in section I.4. We restrict our analysis to the Bayes-optimal inference, i.e. the case where we generate the data as specified in Sec. I.4 and assume that we know the corresponding distributions. In terms of the AMP algorithm and the state evolution this means we can use all the simplifications that arise under the Nishimori identities. The AMP algorithms for matrix factorization and the asymptotic analysis derived in sections III and V apply to all the examples, the only elements that are application-dependent are the “input” functions fXf_{X} and fFf_{F}, and the output function goutg_{\rm out}.

In terms of our asymptotic analysis the equations for dictionary learning, sparse PCA, and blind source separation are very close, see definitions in section I.4, these problems basically differ by the region of parameters α,π\alpha,\pi that is of interest. Moreover the dictionary learning can be seen as the η\eta\to\infty limit of the blind calibration problem (which is trivially taken in the equations). We hence group the discussion of these problems in the present section, they all present the first order phase transition as low measurement noise.

The matrices FF and XX in our setting of the dictionary learning and sparse PCA problems are generated according to eqs. (21-22). Using the definitions of the input function fXf_{X} in eq. (66), and fFf_{F} in eq. (67) we obtain explicitly:

The functions fc(Σ,T)f_{c}(\Sigma,T) and fs(Z,W)f_{s}(Z,W) are then obtained from eqs. (68-69).

In case of matrix calibration we have some prior knowledge on the matrix FF given by eq. (26). This leads to a function fFf_{F} of the form

Indeed as the uncertainty in the matrix η\eta\to\infty this goes to the fFf_{F} for dictionary learning or sparse PCA (242).

The output function goutg_{\rm out} defined in eq. (23) is, for the output channel (56) with additive white Gaussian noise of variance Δ\Delta:

For such a simple output function the eqs. (173-178) in the density evolution simplify greatly into

which under the simplification of the Nishimori line gives

This is the only equation in the state evolution that explicitly depends on the variance of the measurement noise Δ\Delta. Also eqs. (181-186) simplify for distributions (21) and (26) and using the Nishimori identities they reduce to a pair of equations

Note that indeed only eq. (247) depends on the matrix uncertainty parameter η\eta, and the dictionary learning limit η\eta\to\infty is straightforward. The MMSE of the matrix FF, defined in (9), predicted by this state evolution is then EFE_{F}. The MMSE of the matrix XX, defined in (7), is found equal to EXE_{X}

The two sets of initial conditions that we will analyze to investigate the MMSE and the associated phase transitions are

Random (uninformative) initialization: mXt=0=0m^{t=0}_{X}=0, and mFt=0=1/(1+η)m^{t=0}_{F}=1/(1+\eta). Note that in the limit of dictionary learning η\eta\to\infty this initialization corresponds to a fixed point of the state evolution equations. This fixed point reflects the N ⁣N\! permutational symmetry in the dictionary learning problem, and its instability corresponds to a spontaneous breaking of this symmetry. In the limit of dictionary learning we will hence initialize the state evolution with mFt=0m^{t=0}_{F} being a very small positive constant, and we will see the behavior will not depend on its precise value.

Planted (informative) initialization: mXt=0=ρ(X2+σ)δXm^{t=0}_{X}=\rho(\overline{X}^{2}+\sigma)-\delta_{X}, and mFt=0=1δFm^{t=0}_{F}=1-\delta_{F}, where δX\delta_{X} and δF\delta_{F} are small positive constants to test the “stability” of the zero MMSE point.

The free entropy density from which we compute the limiting performance of the Bayes-optimal inference in case of a first order phase transition is expressed from (241) as follows. For simplicity from now on (till the end of this section VI.1) we analyze only the case where the mean of the elements of X0X^{0} was zero, X=0\overline{X}=0, and the variance of the nonzero ones was one, σ=1\sigma=1)

where m^\hat{m} is given by eq. (246). The dependence on the matrix uncertainty is only in the last term and the limit of the completely unknown matrix FF is easily taken by η\eta\to\infty.

The simple Eqs. (246)-(250) is all we need at the end to analyze the MMSEs EXE_{X} and EFE_{F} of dictionary learning, blind matrix calibration, sparse PCA and blind source separation problems when the signal X0X^{0} and F0F^{0} are generated according to eqs. (21-22) with X=0\overline{X}=0 and σ=1\sigma=1, ρ\rho is the fraction of nonzero elements in X0X^{0}, and η\eta is the matrix uncertainty from (25). The parameter α=M/N\alpha=M/N is the ratio between number of lines and the number of columns of F0F^{0}, π=P/N\pi=P/N is the ration between number of columns and the number of lines of X0X^{0}. The output channel has additive white Gaussian noise of variance Δ\Delta, and this information about distributions and their parameters is used in the posterior likelihood in the optimal Bayes inference.

VI.1.2 Phase diagram for blind matrix calibration and dictionary learning

For the noiseless case Δ=0\Delta=0, and all positive η>0\eta>0, the linear stability analysis of eqs. (246-248) around the informative fixed point mX=ρm_{X}=\rho, mF=1m_{F}=1 (i.e. the MMSEs EF=EX=0E_{F}=E_{X}=0) leads to an update for the perturbations δFt+1=(δFtρ+δX)/(απ)\delta_{F}^{t+1}=(\delta_{F}^{t}\rho+\delta_{X})/(\alpha\pi) and δXt+1=ρ(δFtρ+δX)/α\delta_{X}^{t+1}=\rho(\delta_{F}^{t}\rho+\delta_{X})/\alpha. By computations of the largest eigenvalue of the corresponding 2×22\times 2 matrix we obtain that this informative fixed point is stable if and only if

In other words the zero MSE fixed point is locally stable above the counting lower bound (24).

Further we notice that in the low noise limit Δ0\Delta\to 0, for all positive and finite η\eta, and for mX=ρδXm_{X}=\rho-\delta_{X}, mF=1δFm_{F}=1-\delta_{F} with δX\delta_{X} and δF\delta_{F} being small positive constants of the same order as Δ\Delta the free entropy (250) becomes in the leading order (παπρα)log(Δ+δX+ρδF)/2(\pi\alpha-\pi\rho-\alpha)\log(\Delta+\delta_{X}+\rho\delta_{F})/2. For π>π(α,ρ)\pi>\pi^{*}(\alpha,\rho) this is a large positive value, and a large negative value for π<π(α,ρ)\pi<\pi^{*}(\alpha,\rho).

Hence for the noiseless measurements Δ=0\Delta=0 the asymptotic Bayes-optimal MMSE is EX=0E_{X}=0 and EF=0E_{F}=0 for π>π(α,ρ)\pi>\pi^{*}(\alpha,\rho) for all η>0\eta>0. This is a remarkable result as it implies that in the Bayes-optimal setting the dictionary is identifiable (or an exact calibration of the matrix possible) as soon as the number of samples PP per signal element is larger than the value π(α,ρ)\pi^{*}(\alpha,\rho) given by the trivial counting bound (24).

The next question is whether this MMSE is achievable in a computationally tractable way. To answer this we study the state evolution starting in the uninformative initialization. First we analyze the behavior of the state evolution when mX=δXm_{X}=\delta_{X}, and mF=δFm_{F}=\delta_{F} where both δX\delta_{X}, and δF\delta_{F} are positive and small, while we also consider η\eta being very large. The linear expansion of state evolution update then leads to δXt+1=ρ2αδF/(Δ+ρ)\delta_{X}^{t+1}=\rho^{2}\alpha\delta_{F}/(\Delta+\rho) and δF=1/η+πδX/(Δ+ρ)\delta_{F}=1/\eta+\pi\delta_{X}/(\Delta+\rho). Hence for η\eta\to\infty the uninformative initialization is in fact a stable fixed point of the state evolution equations as long as

This means that for π(ρ,α)<π<πF(Δ,ρ,α)\pi^{*}(\rho,\alpha)<\pi<\pi_{F}(\Delta,\rho,\alpha) the MMSE is not achievable in the dictionary learning (e.g. when α<1\alpha<1 π(0,α)<πF(0,0,α)\pi^{*}(0,\alpha)<\pi_{F}(0,0,\alpha)) with the approximate message passing presented in this paper. This simple analysis leads us to the conclusion that a first order phase transition is in play in the dictionary learning problem, and as we will see also in the blind calibration (η<\eta<\infty) and sparse PCA (α>1\alpha>1).

As a side remark let us remind that the limit η0\eta\to 0 should lead to results known from Bayesian compressed sensing. In particular in compressed sensing for low noise the matrix XX is identifiable if and only if α>ρ\alpha>\rho. To reconcile this with the previous results notice that indeed for η=cΔ0\eta=c\Delta\to 0 with c=O(1)c=O(1) the leading term of the free entropy becomes π(αρ)log(Δ)\pi(\alpha-\rho)\log(\Delta). Hence compressed sensing result is recovered. Whereas for 1ηΔ1\gg\eta\gg\Delta it is the dictionary learning static phase transition π=α/(αρ)\pi^{*}=\alpha/(\alpha-\rho) that is the relevant one.

Due to the close link between the two problems, we shall described the results for dictionary learning together with the case of blind matrix calibration. In both these cases we are typically trying to learn (calibrate) an overcomplete dictionary α<1\alpha<1 and a sparse signal XX from as few samples PP as possible. We hence first plot in Fig. 2 (left) the MSE for FF (in red) and for XX (in blue) as a function of π=P/N\pi=P/N and fixed (representative) value of undersampling ratio α=0.5\alpha=0.5, and density ρ=0.2\rho=0.2 in zero (or negligible) measurement noise, Δ=0\Delta=0. We consider several values of matrix uncertainty η\eta (the larger the value the less we know about the matrix). The AMP-MSE achieved from the uninformative initialization is depicted in dashed lines, the MMSE achieved in this zero noise case from the planted (informative) initialization is in full lines.

Fig. 2 (right) shows the value of the spinodal transition πs\pi^{s} as a function of the matrix uncertainty η\eta. We see that, as expected, limη0πs(η)=π\lim_{\eta\to 0}\pi^{s}(\eta)=\pi^{*}. A result that is less intuitive is that the large η\eta limit is also well defined and finite limηπs(η)=πDLs<0\lim_{\eta\to\infty}\pi^{s}(\eta)=\pi^{s}_{\rm DL}<0. This means that even in the dictionary learning where no prior information about the matrix elements is available the dictionary is identifiable with AMP for large system sizes above the spinodal transition πDLs\pi^{s}_{\rm DL}.

In Fig. 2 (right) we also see an interesting behavior in the function πs(η)\pi^{s}(\eta) for low values of ρ\rho - there is a sharp phase transition from low η\eta regime, where the AMP-MSE at the transition has a weak discontinuity towards a relatively low value of MSE, and a high η\eta regime where the discontinuity is very abrupt towards a value of MSE that is close to the completely uninformative value. This behavior is also illustrated in Fig. 3 (left) where we plot the AMP-MSE as a function of π\pi and η\eta (for fixed ρ=0.05\rho=0.05, α=0.5\alpha=0.5, Δ=0\Delta=0).

Fig. 3 (right) depicts the phase diagram of dictionary learning (η\eta\to\infty) and blind matrix calibration (finite η\eta) we plot the static threshold π\pi^{*} (η\eta-independent) above which the matrix FF is identifiable in full red, and the spinodal threshold πs\pi^{s} (for various values of η\eta) above which the AMP identifies asymptotically the original matrix FF is dashed line. Notice that πs(ρ)\pi^{s}(\rho) diverges as ρρBPCS\rho\to\rho^{\rm CS}_{\rm BP}, where ρBPCS\rho^{\rm CS}_{\rm BP} is the AMP phase transition in compressed sensing, see e.g. Krzakala et al. (2012). This is expected as for ρ>ρBPCS\rho>\rho^{\rm CS}_{\rm BP} the sparse signal cannot be recovered with AMP even if the matrix FF is fully known.

From now on (also in following subsections) we will discuss only the case η\eta\to\infty when no prior information about the dictionary FF is available. In Fig. 4 (left) we illustrate the results for dictionary learning with non-zero measurement noise. The situation is qualitatively similar to what happens in noisy compressed sensing Krzakala et al. (2012). The first order phase transition is becoming weaker as the noise grows, until some value Δ\Delta^{*} above which there is no phase transition anymore and the AMP-MSE==MMSE for the whole range of π\pi. In Fig. 4 (right) we then plot the AMP-MSE and the MMSE of the signal XX as a function of the noise variance Δ\Delta and π\pi for α=0.5\alpha=0.5 and ρ=0.2\rho=0.2. This surface plot demonstrates how the sharp transition disappears and is replaced by a continuous evolution of the MSE when the noise is large. This MMSE is compared to the AMP-MSE for the same case in Fig. 5.

Sparse PCA as we set it in Section I.4 is closely related to dictionary learning. Except that from the three sizes of matrices MM, NN and PP the smallest one is the NN – corresponding to the matrix YY to be of relatively low rank. Hence for sparse PCA we should only really consider α>1\alpha>1, π>1\pi>1.

Behavior of the state evolution is for this range of parameters qualitatively very similar to the one we just observed in the previous section. In Fig. 6 left we treat the case of M=PM=P, i.e. π=α\pi=\alpha, with zero measurement noise, and compute the smallest value for which the matrices FF and ρ\rho-sparse XX are recoverable for a given ρ\rho. Just as in the dictionary learning we obtain that the Bayes optimal MMSE is zero everywhere above the counting bound π=α>1+ρ\pi=\alpha>1+\rho, blue line in Fig. 6 left. The AMP-MSE is, however, zero only above the spinodal line, depicted in red in the figure. The gap between the two lines is not very large in this case.

The right part of Fig. 6 shows the AMP-MSE and the MMSE of the signal matrix XX for measurement noise variance Δ=1010\Delta=10^{-10} and density ρ=0.5\rho=0.5.

In blind source separation, PP corresponds to the length of the signal, NN is the number of sources, and MM the number of sensors. Typically the signal is very long, i.e. one has PNP\gg N and PMP\gg M. The particularly interesting case is when there is more sources than sensors α<1\alpha<1 in that case the signal can be reconstructed only if the density of the signal ρ\rho is smaller than a certain value. The counting bound gives us ρ<α(π1)/π\rho<\alpha(\pi-1)/\pi and this also corresponds to the value under which the MMSE drops to zero under zero measurement noise. As in the previous case also here we observe a first order phase transition and with AMP we can reach zero error (in the noiseless case) only below ρs\rho^{s} that we depict for π=10\pi=10 as a function of α\alpha in Fig. 7.

VI.2 Low rank matrix completion

In the remaining examples we treat cases in which neither FF nor XX are sparse, ρ=1\rho=1, we start with low rank matrix completion.

In matrix completion the output function goutg_{\rm out} is eq. (244) for the known matrix elements μl\mu l (there is ϵMP\epsilon MP of them), and gout(ω,y,V)=0g_{\rm out}(\omega,y,V)=0 for the unknown elements μl\mu l (there is (1ϵ)MP(1-\epsilon)MP of them). The input function fXf_{X} (242) for non-sparse matrix XX, ρ=1\rho=1, becomes

In the state evolution, using the Nishimori identities, we then have

where ϵ\epsilon is the fraction of known elements of YY. Moreover for the input function (253) the state evolution equation for mXm_{X} becomes

The equation for mFm_{F} is the one of (247).

It is instrumental to analyze the local stability of the informative and the uninformative initialization for low rank matrix completion. For the informative initialization we consider mXt=1δXtm_{X}^{t}=1-\delta_{X}^{t}, and mFt=1δFtm_{F}^{t}=1-\delta_{F}^{t}, where δXt\delta_{X}^{t}, and δFt\delta_{F}^{t} are small positive numbers. The state evolution update equations at zero noise Δ=0\Delta=0 lead to δXt+1=(δXt+δFt)/(ϵα)\delta_{X}^{t+1}=(\delta_{X}^{t}+\delta_{F}^{t})/(\epsilon\alpha), and δFt+1=(δXt+δFt)/(ϵπ)\delta_{F}^{t+1}=(\delta_{X}^{t}+\delta_{F}^{t})/(\epsilon\pi). The largest eigenvalue of this 2×22\times 2 system is (α+π)/(ϵαπ)(\alpha+\pi)/(\epsilon\alpha\pi) and hence the informative fixed point is stable for ϵ>ϵ=(α+π)/(απ)\epsilon>\epsilon^{*}=(\alpha+\pi)/(\alpha\pi), which coincides with the counting bound eq. (28). This means that for ϵ>ϵ\epsilon>\epsilon^{*} the noiseless matrix completion, the matrices FF and XX can be recovered without error (asymptotically).

For the uninformative initialization we consider mXt=δXtm_{X}^{t}=\delta_{X}^{t}, and mFt=δFtm_{F}^{t}=\delta_{F}^{t}, where δXt\delta_{X}^{t}, and δFt\delta_{F}^{t} are again small positive numbers. This time the state evolution equations give δFt+1=πϵδXt/(1+Δ)\delta_{F}^{t+1}=\pi\epsilon\delta_{X}^{t}/(1+\Delta) and δXt+1=αϵδFt/(1+Δ)\delta_{X}^{t+1}=\alpha\epsilon\delta_{F}^{t}/(1+\Delta). Hence the uninformative fixed point is stable for ϵ<(Δ+1)/πα\epsilon<(\Delta+1)/\sqrt{\pi\alpha}. This can indeed be verified in Fig. 8.

In matrix completion we treat matrices YY of low rank, hence NN is much smaller than both PP and MM. The main questions concerns the fraction ϵ\epsilon of elements that need to be known in order for the recovery of XX and FF to be possible. In this case we did not identify first order phase transition, as a result we have MMSE==AMP-MSE. At zero measurement noise we observed a phase transition from a phase of perfect recovery to a phase with positive MMSE, its position coincides with the counting bound (28), ϵ=(α+π)/απ\epsilon^{*}=(\alpha+\pi)/{\alpha\pi}. With non-zero measurement noise, in the scaling of noise and rank we consider in this paper, the behavior of MMSE as a function of the other parameters is smooth and derivable (no phase transition). In Fig. 8 we plot an example of the MMSE as a function of the fraction of known elements ϵ\epsilon for squared matrix YY, i.e α=π\alpha=\pi. We generated the signal elements with zero mean and unit variance, X=0\overline{X}=0, σ=1\sigma=1.

Our analysis suggest that compared to the cases with non-zero sparsity the low-rank matrix completion is a much easier problem, at least in the random setting considered in the present paper. The fact that the “counting” threshold can be saturated or close to saturated in the noiseless case by several algorithms can be seen e.g. in the data presented in Parker et al. (2014).

VI.3 Robust PCA

The input functions are the eq. (242) for matrix FF and eq. (253) for matrix XX. In robust PCA as defined by (30) we get for the output function

The state evolution in the Bayes-optimal setting, i.e. using the Nishimori identities, becomes

Equation for mFtm_{F}^{t} is (247), and for mXtm_{X}^{t} is (255).

In robust PCA the informative initialization is again mXt=1δXtm_{X}^{t}=1-\delta_{X}^{t}, and mFt=1δFtm_{F}^{t}=1-\delta_{F}^{t}, where δXt\delta_{X}^{t}, and δFt\delta_{F}^{t} are small positive numbers. For small noise Δs\Delta_{s} and Δl=O(1)\Delta_{l}=O(1) the corresponding fixed point is stable under the same conditions as for low rank matrix completion, i.e. for ϵ>ϵ=(α+π)/(απ)\epsilon>\epsilon^{*}=(\alpha+\pi)/(\alpha\pi).

The uninformative fixed point is mXt=δXtm_{X}^{t}=\delta_{X}^{t}, and mFt=δFtm_{F}^{t}=\delta_{F}^{t}, where δXt\delta_{X}^{t}, and δFt\delta_{F}^{t} are again small positive numbers. This evolves as δXt+1=αδFtm^\delta_{X}^{t+1}=\alpha\delta_{F}^{t}\hat{m}, δFt+1=πδXtm^\delta_{F}^{t+1}=\pi\delta_{X}^{t}\hat{m}, with m^=(1+δs=ϵΔlϵΔs)/[(1+Δs)(1+Δl)]\hat{m}=(1+\delta_{s}=\epsilon\Delta_{l}-\epsilon\Delta_{s})/[(1+\Delta_{s})(1+\Delta_{l})] in this limit. Hence for instance for Δs0\Delta_{s}\to 0, Δl=1\Delta_{l}=1 and π=α\pi=\alpha we have that the uninformative fixed point is stable for ϵ<2/α1\epsilon<2/\alpha-1, which again corresponds to what we observe in Fig. 9.

In the example of Fig. 9 we plot the MMSE as a function of the fraction of undistorted elements ϵ\epsilon in the case of squared matrix YY, α=π\alpha=\pi, the variance of the large distortions Δl=1\Delta_{l}=1 and two different values of the small measurement noise Δs\Delta_{s}. We see a second order phase transition at the counting bound for Δs=0\Delta_{s}=0 and a smooth decay of the MMSE for ΔX>0\Delta_{X}>0.

It is interesting to compare how well robust PCA can be solved with respect to the matrix completion. In both cases ϵ\epsilon is the fraction of known elements. The difference is that in matrix completion their position is known, whereas in robust PCA it is not. Intuitively the R-PCA should thus be a much harder problem. This is not confirmed in our analysis that instead suggest that robust PCA is as easy as matrix completion, since the zero noise phase transitions in the two coincide. Moreover, whereas at ϵ0\epsilon\to 0 there is no information left in matrix completion (that is why the MMSE=1=1), in robust PCA the largely distorted elements can still be explored and the MMSE<1<1. Note, however, that algorithmically it seems less easy to saturate this theoretical asymptotic performance in R-PCA, see e.g. Figure 7 in Parker et al. (2014).

VI.4 Factor analysis

In factor analysis, the input functions fFf_{F} and fXf_{X} are the same as the dictionary learning (242) at ρ=1\rho=1. The output function (56) for factor analysis (31) is given by

where ψμ\psi_{\mu} is the variance of the μ\mu-th component of the unique factor. The variance of unique factor ψμ\psi_{\mu} depends here on the index μ\mu and does not on the index ll, which leads to a slight modification in the derivation of the state evolution from section V.1. For simplicity, we assume that ψμ\psi_{\mu}’s are known; in practice, these should be estimated by the expectation and maximization scheme in conjunction with GAMP. Then, we obtain, on the Nishimori line

where μ\langle\cdot\rangle_{\mu} means the average over ψμ\psi_{\mu}, the variance of the elements of the matrix FF is denoted QF,μ0Q_{F,\mu}^{0}.

To analytically solve (261), one has to specify the distributions of ψμ\psi_{\mu} and QF,μ0Q_{F,\mu}^{0}. We set QF,μ0=QF0=1Q_{F,\mu}^{0}=Q_{F}^{0}=1 for all μ\mu, and suppose a two-peak distribution for ψ\psi as

Let us also assume X=0\overline{X}=0 and σ=1\sigma=1. In this case the state evolution can be summarized as

The total MMSE is given by EF=1(ϵmF1+(1ϵ)mF2)E_{F}=1-(\epsilon m_{F1}+(1-\epsilon)m_{F2}) and EX=1mXE_{X}=1-m_{X}. Fig. 10 (left) shows the π\pi-dependence of the MMSE at α=2\alpha=2, ψ1=0.5\psi_{1}=0.5, and ψ2=2\psi_{2}=2 for ϵ=0\epsilon=0, 0.5, 0.9 and 1.

We analyze again the stability of the uninformative fixed point, (mX,mF1,mF2)=(0,0,0)(m_{X},m_{F_{1}},m_{F_{2}})=(0,0,0), of the state evolution. Small positive numbers δX\delta_{X}, δF1\delta_{F_{1}}, and δF2\delta_{F_{2}} that give the uninformative initialization mX=δXm_{X}=\delta_{X}, mF1=δF1m_{F_{1}}=\delta_{F_{1}}, and mF2=δF2m_{F_{2}}=\delta_{F_{2}} evolve under the state evolution as

These expressions indicate that the uninformative fixed point becomes unstable when \alpha\pi\Big{[}\frac{\epsilon}{(\psi_{1}+1)^{2}}+\frac{1-\epsilon}{(\psi_{2}+1)^{2}}\Big{]}>1. The critical values of π\pi given by this condition coincides with the transition point where the MMSE departs from 1 shown in Fig. 10 (left). As an example of ϵ\epsilon-dependence, we show the MMSE of XX at ψ1=102\psi_{1}=10^{-2}, ψ1=105\psi_{1}=10^{-5} and ψ2=1\psi_{2}=1 for four different values of α=π\alpha=\pi in Fig. 10 (right). Consistently with our analysis, in these cases the uninformative initialization is always unstable.

The transition associated with the stability of the informative fixed point occurs only when at least one of ψ\psis tends to be zero. For instance when ψ1=0\psi_{1}=0, the informative initialization corresponds to δF1=1mF1\delta_{F_{1}}=1-m_{F_{1}} and δX=1mX\delta_{X}=1-m_{X} that are given by

without depending on ψ2\psi_{2}. These expressions mean that when ϵ>ϵ=π/[α(π1)]\epsilon>\epsilon^{*}=\pi/[\alpha(\pi-1)], the informative fixed point is stable, consistently with Fig. 10 (right).

The state evolution of factor analysis for the two-peak case is qualitatively similar to that of robust PCA and low rank matrix completion, but the values of the phase transition points differ.

VII Discussion and Conclusions

In this paper we have analyzed various examples of the matrix factorization problem. We obtain a matrix YY that is an element-wise noisy measurement of an unknown matrix Z=FXZ=FX, where both YY, and ZZ are M×PM\times P matrices, FF is a M×NM\times N matrix, and XX is a N×PN\times P matrix. We have considered the computational tractability of this problem in the large size limit NN\to\infty while π=P/N=O(1)\pi=P/N=O(1), and α=M/N=O(1)\alpha=M/N=O(1). Our analysis concerns the teacher-student scenario where XX and FF are generated with random independent elements of some known probability distributions and we employ the Bayes-optimal inference scheme to recover FF and XX from YY.

Let us summarize our contribution: We derived the approximate message passing algorithm for matrix factorization. One version of the algorithm —for calibration and dictionary learning— was reported in Krzakala et al. (2013), and a very related algorithm called Big-AMP was discussed by Parker et al. (2013, 2014). This algorithm is derived from belief propagation. We have presented the AMP for matrix factorization in several forms in Sections III.1, and III.2. We have also discussed simplifications that arise in the Bayes-optimal setting when we can use the Nishimori identities (Section III.3), or when the matrix is large and one uses self-averaging of some of the quantities appearing in GAMP (Section III.4). We focused on the theoretical properties of the AMP algorithm, for a robust practical implementation we refer the reader to the works Parker et al. (2013, 2014) that include a very complete report on its performance on a range of benchmarks.

Next to the AMP algorithm we have also derived the corresponding Bethe free entropy in Section IV. The Bethe free entropy evaluated at a fixed point of the GAMP equations approximates the log-likelihood of the corresponding problem. We mainly use it in situations when we have more than one fixed point of GAMP, it is then the one with the largest values of the Bethe entropy that asymptotically given the MMSE of the Bayes-optimal inference. We also derived a variational Bethe free entropy in Section IV.2. This is a useful quantity that can serve in controlling the convergence of the AMP approach. Alternatively, a direct maximization of this expression is a promising algorithm itself (see Krzakala et al. (2014) for an investigation of this idea for compressed sensing).

The AMP algorithm for matrix factorization is amenable to asymptotic analysis via the state evolution technique that was carried out rigorously for approximate message passing in compressed sensing Bayati & Montanari (2011). We derive the state evolution analysis for matrix factorization using tools of statistical mechanics. In particular we use two approaches leading to equivalent results the cavity method (Section V.1) and the replica method (Section V.2). Our derivation of the state evolution is not rigorous, but we conjecture that it is nevertheless asymptotically exact as is the case in many other systems of this type including the compressed sensing. The main result of this state evolution are simple iterative equations that provide a way to compute the MMSE of the Bayes-optimal inference as well as the MSE reached theoretically in the large size limit by the AMP algorithm. The rigorous proof of the formulas derived in this paper is obviously an important topic for future work.

The main results of this paper concern analysis of MMSE and AMP-MSE for various interesting examples of the matrix factorization problem. We analyze the asymptotic phase diagrams for dictionary learning, blind matrix calibration, sparse PCA, blind source separation, matrix completion, robust PCA and factor analysis. Earlier results on this analysis appeared in Krzakala et al. (2013); Sakata & Kabashima (2013). We find that when one of the matrices FF or XX is sparse the problems undergo a first order phase transition which is related to an interesting algorithmic barrier known for instance from compressed sensing Krzakala et al. (2012).

It is a generic observation that for most of the problems we analyzed the theoretically achievable performance is much better than the one achievable by existing algorithms. The AMP algorithm should be able to match this performance for very large systems which is the most exciting perspective for further development of this work. If successful it could lead to an algorithmic revolution in various application of the matrix factorization.

In this paper we concentrate on the theoretical analysis and not on the performance of the algorithm itself. Some studies of the performance of some versions of the algorithm can be find in Krzakala et al. (2013); Parker et al. (2014). We, however, observed that the performance depends strongly on the implementation details and we did not yet found a way to match the theoretically predicted performance for systems of treatable (practical) size in all cases.

It is worth discussing some of these algorithmic issues in this conclusion. One of the main problems it that the GAMP algorithm with parallel update presents instabilities that drive its evolution away from the so-called Nishimori line; see a recent study of this issue in the compressed sensing problem Caltagirone et al. (2014). This can be seen even in the state evolution when we do not assume explicitly that the result of the Bayes-optimal inference corresponds to a fixed point that belongs to the Nishimori line. There are ways how to avoid these issues, e.g. we observed that the difficulties basically disappear when the sequential update of the message passing algorithm from Section III.1 is used instead of the parallel one. This, however, does not scale very well with the systems size and our results were hence spoiled by very strong finite size effects. When learning the M×NM\times N matrix FF, and the N×PN\times P matrix XX we also often observed that a (very small) number of the PP signals XlX_{l} were not correctly reconstructed, and these “rogue” vectors in XlX_{l} were polluting the reconstruction of FF. An exemple of these finite size effects can be observed in Krzakala et al. (2013).

In the work of Parker et al. (2013, 2014) a part of the problems with convergence of the corresponding algorithm was mitigated by adaptive damping (though maybe with not the most suitable cost function, see Sec. IV) and expectation maximization learning. Why this is helpful is theoretically explained in the recent work Caltagirone et al. (2014) for compressed sensing. However, the implementation of Parker et al. (2014) does not match the theoretical performance predicted in this paper either (we have explicitly tried for the dictionary learning and robust PCA examples). This shows that more work is needed in order to reach a practical algorithm able to achieve the prediction at moderate sizes. A more proper understanding of these problems, and further developments of the algorithm are therefore the main direction of our future work.

Acknowledgments

The research leading to these results has received funding from the European Research Council under the European Union’s 7th7^{th} Framework Programme (FP/2007-2013)/ERC Grant Agreement 307087-SPARCS. Financial support by the JSPS Core-to-Core Program “Non-equilibrium dynamics of soft matter and information” and JSPS/MEXT KAKENHI Nos. 23-4665 and 25120013 is also acknowledged.

References