Implicit Regularization in Deep Matrix Factorization
Sanjeev Arora, Nadav Cohen, Wei Hu, Yuping Luo
Introduction
It is a mystery how deep neural networks generalize despite having far more learnable parameters than training examples. Explicit regularization techniques alone cannot account for this generalization, as they do not prevent the networks from being able to fit random data (see ). A view by which gradient-based optimization induces an implicit regularization has thus arisen. Of course, this view would be uninsightful if “implicit regularization” were treated as synonymous with “promoting generalization” — the question is whether we can characterize the implicit regularization independently of any validation data. Notably, the mere use of the term “regularization” already predisposes us towards characterizations based on known explicit regularizers (e.g. a constraint on some norm of the parameters), but one must also be open to the possibility that something else is afoot.
An old argument (cf. ) traces implicit regularization in deep learning to beneficial effects of noise introduced by small-batch stochastic optimization. The feeling is that solutions that do not generalize correspond to “sharp minima,” and added noise prevents convergence to such solutions. However, recent evidence (e.g. ) suggests that deterministic (or near-deterministic) gradient-based algorithms can also generalize, and thus a different explanation is in order.
A major hurdle in this study is that implicit regularization in deep learning seems to kick in only with certain types of data (not with random data for example), and we lack mathematical tools for reasoning about real-life data. Thus one needs a simple test-bed for the investigation, where data admits a crisp mathematical formulation. Following earlier works, we focus on the problem of matrix completion: given a randomly chosen subset of entries from an unknown matrix , the task is to recover the unseen entries. To cast this as a prediction problem, we may view each entry in as a data point: observed entries constitute the training set, and the average reconstruction error over the unobserved entries is the test error, quantifying generalization. Fitting the observed entries is obviously an underdetermined problem with multiple solutions. However, an extensive body of work (see for a survey) has shown that if is low-rank, certain technical assumptions (e.g. “incoherence”) are satisfied and sufficiently many entries are observed, then various algorithms can achieve approximate or even exact recovery. Of these, a well-known method based upon convex optimization finds the minimal nuclear norm matrix among those fitting all observed entries (see ). Recall that the nuclear norm (also known as trace norm) of a matrix is the sum of its singular values, regarded as a convex relaxation of rank.
One may try to solve matrix completion using shallow neural networks. A natural approach, matrix factorization, boils down to parameterizing the solution as a product of two matrices — — and optimizing the resulting (non-convex) objective for fitting observed entries. Formally, this can be viewed as training a depth- linear neural network. It is possible to explicitly constrain the rank of the produced solution by limiting the shared dimension of and . However, in practice, even when the rank is unconstrained, running gradient descent with small learning rate (step size) and initialization close to zero tends to produce low-rank solutions, and thus allows accurate recovery if is low-rank. This empirical observation led Gunasekar et al. to conjecture in that gradient descent over a matrix factorization induces an implicit regularization minimizing nuclear norm:
With small enough learning rate and initialization close enough to the origin, gradient descent on a full-dimensional matrix factorization converges to the minimum nuclear norm solution.
Could the implicit regularization of deep matrix factorizations be stronger than that of their shallow counterpart (which Conjecture 1 equates with nuclear norm minimization)? Experiments reported in Figure 1 suggest that this is indeed the case — depth leads to more accurate completion of a low-rank matrix when the number of observed entries is small. Our purpose in the current paper is to mathematically analyze this stronger form of implicit regularization. Can it be described by a matrix norm (or quasi-norm) continuing the line of Conjecture 1, or is a paradigm shift required?
1 Paper overview
In Section 2 we investigate the potential of norms for capturing the implicit regularization in deep matrix factorization. Surprisingly, we find that the main theoretical evidence connecting nuclear norm and shallow (depth-) matrix factorization — proof given in for Conjecture 1 in a particular restricted setting — extends to arbitrarily deep factorizations as well. This result disqualifies the natural hypothesis by which Schatten quasi-norms replace nuclear norm as the implicit regularization when one adds depth to a shallow matrix factorization. Instead, when interpreted through the lens of , it brings forth a conjecture by which the implicit regularization is captured by nuclear norm for any depth. Since our experiments (Figure 1) show that depth changes (enhances) the implicit regularization, we are led to question the theoretical direction proposed in , and accordingly conduct additional experiments to evaluate the validity of Conjecture 1.
Typically, when the number of observed entries is sufficiently large with respect to the rank of the matrix to recover, nuclear norm minimization yields exact recovery, and thus it is impossible to distinguish between that and a different implicit regularization which also perfectly recovers. The regime most interesting to evaluate is therefore that in which the number of observed entries is too small for exact recovery by nuclear norm minimization — here there is room for different implicit regularizations to manifest themselves by providing higher quality solutions. Our empirical results show that in this regime, matrix factorizations consistently outperform nuclear norm minimization, suggesting that their implicit regularization admits stronger bias towards low-rank, in contrast to Conjecture 1. Together, our theory and experiments lead us to suspect that the implicit regularization in matrix factorization (shallow or deep) may not be amenable to description by a simple mathematical norm, and a detailed analysis of the dynamics in optimization may be necessary.
Section 3 carries out such an analysis, characterizing how the singular value decomposition of the learned solution evolves during gradient descent. Evolution rates of singular values turn out to be proportional to their size exponentiated by , where is the depth of the factorization. This establishes a tendency towards low rank solutions, which intensifies with depth. Experiments validate the findings, demonstrating the dynamic nature of implicit regularization in deep matrix factorization.
We believe the trajectories traversed in optimization may be key to understanding generalization in deep learning, and hope that our work will inspire further progress along this line.
Can the implicit regularization be captured by norms?
We will focus in this section on matrix sensing — a more general problem than matrix completion. Here, we are given measurement matrices , with corresponding labels generated by , and our goal is to reconstruct the unknown matrix . As in the case of matrix completion, well-known methods, and in particular nuclear norm minimization, can recover if it is low-rank, certain technical conditions are met, and sufficiently many observations are given (see ).
Our first result is that the theory developed by to support Conjecture 1 can be generalized to suggest that nuclear norm captures the implicit regularization in matrix factorization not just for depth , but for arbitrary depth. This is of course inconsistent with the experimental findings reported in Figure 1. We will first recall the existing theory, and then show how to extend it.
studied implicit regularization in shallow (depth-) matrix factorization by considering recovery of a positive semidefinite matrix from sensing via symmetric measurements, namely:
Motivated by Theorem 1 and empirical evidence they provided, raised Conjecture 1, which, formally stated, hypothesizes that the condition in Theorem 1 of commuting is unnecessary, and an identical statement holds for arbitrary (symmetric linearly independent) measurement matrices. Their conjecture also relaxes the requirement from the initialization of gradient flow — an initial value of is believed to suffice, where is an arbitrary full-rank matrix (that does not depend on ).
While the analysis of covers only symmetric matrix factorizations of the form , they noted that it can be extended to also account for asymmetric factorizations of the type considered in the current paper. Specifically, running gradient flow on the objective:
Next, we show that Theorem 1 — the main theoretical justification for the connection between nuclear norm and shallow matrix factorization — extends to arbitrarily deep factorizations as well. Consider gradient flow over the objective:
Our proof is inspired by that of Theorem 1 given in . Using the expression for derived in (Lemma 3 in Appendix A), it can be shown that commutes with , and takes on a particular form. Taking limits and , optimality (minimality) of nuclear norm is then established using a duality argument. ∎
Theorem 2 provides a particular setting where the implicit regularization in deep matrix factorizations boils down to nuclear norm minimization. By Proposition 1 below, there exist instances of this setting for which the minimization of nuclear norm contradicts minimization (even locally) of Schatten- quasi-norm for any . Therefore, one cannot hope to capture the implicit regularization in deep matrix factorizations through Schatten quasi-norms. Instead, if we interpret Theorem 2 through the lens of , we arrive at a conjecture by which the implicit regularization is captured by nuclear norm for any depth.
2 Experiments challenging Conjecture 1
Subsection 2.1 suggests that from the perspective of current theory, it is natural to apply Conjecture 1 to matrix factorizations of arbitrary depth. On the other hand, the experiment reported in Figure 1 implies that depth changes (enhances) the implicit regularization. To resolve this tension we conduct a more refined experiment, which ultimately puts in question the validity of Conjecture 1.
Our experimental protocol is as follows. For different matrix completion tasks with varying number of observed entries, we compare minimum nuclear norm solution to those brought forth by running gradient descent on matrix factorizations of different depths. For each depth, we apply gradient descent with different choices of learning rate and standard deviation for (random, zero-centered) initialization, observing the trends as these become smaller. The outcome of the experiment is presented in Figure 8. As can be seen, when the number of observed entries is sufficiently large with respect to the rank of the matrix to recover, factorizations of all depths indeed admit solutions that tend to minimum nuclear norm. However, when there are less entries observed — precisely the data-poor setting where implicit regularization matters most — neither shallow (depth-) nor deep (depth- with ) factorizations minimize nuclear norm. Instead, they put more emphasis on lowering the effective rank (cf. ), in a manner which is stronger for deeper factorizations.
A close look at the experiments of reveals that there too, in situations where the number of observed entries (or sensing measurements) was small (less than required for reliable recovery), a discernible gap appeared between the minimal nuclear norm and that returned by (gradient descent on) a matrix factorization. In light of Figure 8, we believe that if had included in its plots an accurate surrogate for rank (e.g. effective rank or Schatten- quasi-norm with small ), scenarios where matrix factorization produced sub-optimal (higher than minimum) nuclear norm would have manifested superior (lower) rank. More broadly, our experiments suggest that the implicit regularization in (shallow or deep) matrix factorization is somehow geared towards low rank, and just so happens to minimize nuclear norm in cases with sufficiently many observations, where minimum nuclear norm and minimum rank are known to coincide (cf. ). We note that the theoretical analysis of supporting Conjecture 1 is limited to such cases, and thus cannot truly distinguish between nuclear norm minimization and some other form of implicit regularization favoring low rank.
Given that Conjecture 1 seems to hold in some settings (Theorems 1 and 2; ) but not in other (Figure 8), we hypothesize that capturing implicit regularization in (shallow or deep) matrix factorization through a single mathematical norm (or quasi-norm) may not be possible, and a detailed account for the optimization process might be necessary. This is carried out in Section 3.
Dynamical analysis
This section characterizes trajectories of gradient flow (gradient descent with infinitesimally small learning rate) on deep matrix factorizations. The characterization significantly extends past analyses for linear neural networks (e.g. ) — we derive differential equations governing the dynamics of singular values and singular vectors for the product matrix (Equation (1)). Evolution rates of singular values turn out to be proportional to their size exponentiated by , where is the depth of the factorization. For singular vectors, we show that lack of movement implies a particular form of alignment with the gradient, and by this strengthen past results which have only established the converse. Via theoretical and empirical demonstrations, we explain how our findings imply a tendency towards low-rank solutions, which intensifies with depth.
We study gradient flow over the factorization:
and in accordance with past work, assume that factors are balanced at initialization, i.e.:
Equation (5) is satisfied approximately in the common setting of near-zero initialization (it holds exactly in the “residual” setting of identity initialization — cf. ). The condition played an important role in the analysis of , facilitating derivation of a differential equation governing the product matrix of a linear neural network (see Lemma 3 in Appendix A). It was shown in empirically that there is an excellent match between the theoretical predictions of gradient flow with balanced initialization, and its practical realization via gradient descent with small learning rate and near-zero initialization. Other works (e.g. ) later supported this match theoretically.
Employing results of , we will characterize the evolution of singular values and singular vectors for . As a first step, we show that admits an analytic singular value decomposition ():
The product matrix can be expressed as:
We show that is an analytic function of and then invoke Theorem 1 in . ∎
The diagonal elements of , which we denote by , are signed singular values of ; the columns of and , denoted and , are the corresponding left and right singular vectors (respectively).
With Lemma 1 in place, we are ready to characterize the evolution of singular values:
The signed singular values of the product matrix evolve by:
If the matrix factorization is non-degenerate, i.e. has depth , the singular values need not be signed (we may assume for all ).
Differentiating the analytic singular value decomposition (Equation (6)) with respect to time, multiplying from the left by and from the right by , and using the fact that and have orthonormal columns, we obtain . Equation (7) then follows from plugging in the expression for developed by (Lemma 3 in Appendix A). ∎
Next, we turn to the evolution of singular vectors:
Assume that at initialization, the singular values of the product matrix are distinct and different from zero. This assumption can be relaxed significantly — all that is needed is that no singular value be identically zero (), and no pair of singular values be identical through time (). Then, its singular vectors evolve by:
We follow a series of steps adopted from to obtain expressions for and in terms of , , and . Plugging in the expression for developed by (Lemma 3 in Appendix A) then yields Equations (8), (9). ∎
Earlier papers studying gradient flow for linear neural networks (e.g. ) could show that singular vectors are stationary if they align with the singular vectors of the gradient. Corollary 1 is significantly stronger and implies a converse — if singular vectors are stationary, they must be aligned with the gradient. Qualitatively, this suggests that a “goal” of gradient flow on a deep matrix factorization is to align singular vectors of the product matrix with those of the gradient.
Figure 13 presents empirical demonstrations of our conclusions from Theorem 3 and Corollary 1. It shows that for a non-degenerate deep matrix factorization, i.e. one with depth , under gradient descent with small learning rate and near-zero initialization, singular values of the product matrix are subject to an enhancement/attenuation effect as described above: they progress very slowly after initialization, when close to zero; then, upon reaching a certain threshold, the movement of a singular value becomes rapid, with the transition from slow to rapid movement being sharper with a deeper factorization (larger ). In terms of singular vectors, the figure shows that those of the product matrix indeed align with those of the gradient. Overall, the dynamics promote solutions that have a few large singular values and many small ones, with a gap that is more extreme the deeper the matrix factorization is. This is an implicit regularization towards low rank, which intensifies with depth.
Let . By Equation (10):
Computing the integrals, we may express as a function of : In accordance with Theorem 3, if , we assume without loss of generality that , while disregarding the trivial case of equality to zero.
where , and stands for a value that does not depend on . Equation 11 reveals a gap between and that enhances with depth. For example, consider the case where . If the depth is one, i.e. the matrix factorization is degenerate, will grow linearly with . If — shallow matrix factorization — will grow polynomially more slowly than ( here is positive). Increasing depth further will lead to asymptote when grows, at a value which can be shown to be lower the larger is. Overall, adding depth to the matrix factorization leads to more significant gaps between singular values of the product matrix, i.e. to a stronger implicit bias towards low rank.
Related work
Implicit regularization in deep learning is a highly active area of research. For non-linear neural networks, the topic has thus far been studied empirically (e.g. in ), with theoretical analyses being somewhat scarce (see for some of the few observations that have been derived). The majority of theoretical attention has been devoted to (single-layer) linear predictors and (multi-layer) linear neural networks, often viewed as stepping stones towards non-linear models. Linear predictors were treated in . For linear neural networks, studied settings where the training objective admits a single global minimum, and the question is what path gradient descent (or gradient flow) takes to reach it. and also considered settings where there are multiple global minima, but in these too there was just one solution to which optimization could converge, leaving only the question of what path is taken to reach it. This stands in contrast to the practical deep learning scenario where there are multiple global minima, and implicit regularization refers to the optimizer being biased towards reaching those solutions that generalize well. The latter scenario was treated by and in the context of linear neural networks trained for binary classification via separable data. These works showed that under certain assumptions, gradient descent converges (in direction) to the maximum margin solution. Intriguingly, the bias towards maximum margin holds with any number of layers, so in particular, implicit regularization was found to be oblivious to depth. In addition to standard linear neural networks, also analyzed “linear convolutional networks”, characterized by a particular weight sharing pattern. For such models, the implicit regularization was found to promote sparsity in the frequency domain, in a manner which does depend on depth.
The most extensively studied instance of linear neural networks is matrix factorization, corresponding to a model with multiple inputs, multiple outputs and a single hidden layer, typically trained to recover a low-rank linear mapping. The literature on matrix factorization for low-rank matrix recovery is far too broad to cover here — we refer to for a recent survey, while mentioning that the technique is oftentimes attributed to . Notable works proving successful recovery of a low-rank matrix through matrix factorization trained by gradient descent with no explicit regularization are . Of these, can be viewed as resolving the conjecture of — which we investigate in Section 2 — for the case of sufficiently many linear measurements satisfying the restricted isometry property.
To the best of our knowledge, the current paper is the first to study implicit regularization for deep (three or more layer) linear neural networks with multiple outputs. The latter trait seems to be distinctive, as it is the main differentiator between the setting of , where implicit regularization is oblivious to depth, and ours, for which we show that depth has significant impact. We note that our work is focused on the type of solutions reached by gradient descent, not the complementary questions of whether an optimal solution is found, and how fast that happens. These questions were studied extensively for matrix factorization — cf. — and more recently for linear neural networks of arbitrary depth — see . From a technical perspective, closest to our work are and — we rely on their results and significantly extend them (see Sections 2 and 3).
Conclusion
The implicit regularization of gradient-based optimization is key to generalization in deep learning. As a stepping stone towards understanding this phenomenon, we studied deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Through theory and experiments, we questioned prevalent norm-based explanations for implicit regularization in matrix factorization (cf. ), and offered an alternative, dynamical approach. Our characterization of the dynamics induced by gradient flow on the singular value decomposition of the learned matrix significantly extends prior work on linear neural networks. It reveals an implicit tendency towards low rank which intensifies with depth, supporting the empirical superiority of deeper matrix factorizations.
An emerging view is that understanding optimization in deep learning necessitates a detailed account for the trajectories traversed in training (cf. ). Our work adds another dimension to the potential importance of trajectories — we believe they are necessary for understanding generalization as well, and in particular, may be key to analyzing implicit regularization for non-linear neural networks.
Acknowledgments and Disclosure of Funding
References
References
Appendix A Useful lemmas
We recall the following result from , which characterizes the evolution of the product matrix under gradient flow on a deep matrix factorization:
Suppose we run gradient flow over the factorization:
with factors initialized to be balanced, i.e.:
Then, the product matrix obeys the following dynamics:
An additional result we will use is the following technical lemma:
If , the solution to Equation (12) is:
This solution does not diverge in finite time (regardless of the chosen ), and obviously preserves the sign of its initial value.
If , Equation (12) is solved by:
In this case, divergence in finite time can take place (depending on the choice of ), but nonetheless the sign of is preserved until that happens. ∎
Appendix B Deferred proofs
and its adjoint operator :
Then we can rewrite the loss function in Equation (2) as:
Returning to the original variables (un-diagonalizing by the orthogonal matrix ), recall that:
(by Equations (17) and (21)); and
Lemma 5 below then concludes the proof. ∎
Let be the optimal value for the primal program (Equation (22)):
By duality theory, for any feasible in the dual program (Equation (23)), we have . Since is feasible in the primal, and each is feasible in the dual, it holds that:
Taking the limit , the right hand side above becomes , which implies . ∎
B.2 Proof of Proposition 1
obviously satisfies Equation (24). Additionally, it is symmetric with eigenvalues:
and therefore is positive semidefinite. For any :
B.3 Proof of Lemma 1
B.4 Proof of Theorem 3
Differentiate the analytic singular value decomposition (Equation (6)) with respect to time:
then multiply from the left by and from the right by :
where we used the fact that and have orthonormal columns. Restricting our attention to the diagonal elements of this matrix equation, we have:
Since has constant (unit) length it holds that , and similarly . The latter equation thus simplifies to:
Lemma 3 from Appendix A provides the following expression for :
Left-multiplying by , right-multiplying by , and using the fact that (columns of ) and (columns of ) are orthonormal sets, we obtain:
Combining this with Equation (25) yields the sought-after Equation (7).
To complete the proof, it remains to show that if the matrix factorization is non-degenerate (has depth ), singular values need not be signed, i.e. we may assume for all . Equation (7), along with Lemma 4, imply that if , will never switch sign. Therefore, either for all , or alternatively, this will hold if we take away a minus sign from and absorb it into (or ). ∎
B.5 Proof of Lemma 2
A real analytic function is either identically zero, or admits a zero set with no accumulation points (cf. ). For any , applying this fact to the signed singular value , while taking into account our assumption of it being different from zero at initialization, we conclude that the set of times for which it vanishes has no accumulation points. Similarly, for any , , we assumed that is different from zero at initialization, and thus the set of times for which it vanishes is free from accumulation points. Overall, any time for which for some , or for some , must be isolated, i.e. surrounded by a neighborhood in which none of these conditions are met. Accordingly, hereafter, we assume and , knowing that for times in which this does not hold, and can be inferred by continuity.
We now follow a series of steps adopted from , to derive expressions for and in terms of , , and . Differentiate the analytic singular value decomposition (Equation (6)) with respect to time:
Multiplying from the left by and from the right by , we have:
where we used the fact that and have orthonormal columns. This orthonormality also implies that and are skew-symmetric, To see this, note that is constant, thus its derivative with respect to time is equal to zero, i.e. (by an analogous argument holds as well). and in particular have zero diagonals. Since is diagonal, and have zero diagonals as well. On the other hand holds zeros outside its diagonal, and so we may write:
where stands for Hadamard (element-wise) product, and is a matrix holding zeros on its diagonal and ones elsewhere. Taking transpose of Equation (28), while recalling that and are skew-symmetric, we have:
Right-multiply Equation (28) by , left-multiply Equation (29) by , and add:
Since we assume diagonal elements of are distinct ( for ), this implies:
Multiplying from the left by yields:
with being the projection onto the subspace spanned by the (orthonormal) columns of . Denote by the projection onto the orthogonal complement, i.e. , where is the identity matrix. Apply to both sides of Equation (26):
Note that , and multiply from the right by (the latter is well-defined since we assume diagonal elements of are non-zero — ):
Adding Equations (31) and (32), we obtain an expression for :
By returning to Equations (28) and (29), switching the directions from which they were multiplied by (i.e. multiplying Equation (28) from the left and Equation (29) from the right), and continuing similarly to above, an analogous expression for is derived:
where is the identity matrix.
Next, we invoke Lemma 3 from Appendix A, which provides an expression for :
Since is symmetric (and is diagonal), Equation (37) implies:
Taking Hadamard product by (Equation (30)) we obtain:
where is given by:
The first term on the right-hand side here complies with the result we seek to prove (Equation (8)). To treat the second term, we again invoke Equation (36), noting that the matrix (projection onto the orthogonal complement of the subspace spanned by the columns of ) produces zero when right-multiplied by . This implies:
Plugging this back into Equation (40) yields Equation (8) — sought-after result. The analogous Equation (9) can be derived in a similar fashion (by incorporating Equation (35) into Equation (34), as we have done for Equation (33)). ∎
B.6 Proof of Corollary 1
As stated in the proof of Lemma 2 (Appendix B.5), for all but a set of isolated points it holds that and , meaning Equations (8) and (9) are well-defined. We will initially assume this to be the case, and then treat isolated points by taking limits. Left-multiply Equation (8) by and Equation (9) by :
where we have used the fact that and have orthonormal columns. Right-multiplying the first equation by , left-multiplying the second by , and then subtracting, we obtain:
Appendix C Extension of [20] to asymmetric matrix factorization
Extending Theorem 1 from to asymmetric (depth-) matrix factorizations boils down to proving the following proposition:
We follow the proof of Theorem 2 (Appendix B.1) up until Equation (15). Equation (13), specialized to , yields dynamics for the product matrix :
The dynamics in Equation (43) are precisely those developed in for a symmetric matrix factorization. The proof of Theorem 1 there can now be applied as is, establishing the desired result. ∎
Appendix D Further experiments and implementation details
Figures 4, 5 and 6 present matrix sensing experiments supplementing the matrix completion experiments reported in Figures 1, 8 and 13 respectively.
D.2 Implementation details
In this appendix we provide implementation details omitted from the descriptions of our experiments (Figures 1, 8, 13, 4, 5 and 6). Our implementation is based on Python, with PyTorch () for realizing deep matrix factorizations and CVXPY () for finding minimum nuclear norm solutions. Source code for reproducing our results can be found in https://github.com/roosephu/deep_matrix_factorization.
In figures 1, 8, 4 and 5, each error bar marks standard deviation of the respective result over three trials differing in random seed for initialization of gradient descent. Reconstruction error with respect to a ground truth matrix is based on normalized Frobenius distance, i.e. for a solution it is . In experiments with matrix completion and sensing under varying number of observations (Figures 1, 8, 4 and 5), plots begin at the smallest number for which stable results were obtained, and end when all evaluated methods are close to zero reconstruction error. For the dynamics illustration experiments (Figures 13 and 6), plots showing singular values hold curves corresponding to the largest ones.