Global Optimality in Tensor Factorization, Deep Learning, and Beyond

Benjamin D. Haeffele, Rene Vidal

Introduction

Models involving factorization or decomposition are ubiquitous across a wide variety of technical fields and application areas. As a simple example relevant to machine learning, various forms of matrix factorization are used in classical dimensionality reduction techniques such as Principle Component Analysis (PCA) and in more recent methods like non-negative matrix factorization or dictionary learning (Lee and Seung, 1999; Aharon et al., 2006; Mairal et al., 2010). In a typical matrix factorization problem, we might seek to find matrices (U,V)(U,V) such that the product UVTUV^{T} closely approximates a given data matrix YY while at the same time requiring that UU and VV satisfy certain properties (e.g., non-negativity, sparseness, etc.). This naturally leads to an optimization problem of the form

Given this challenge, a common approach is to relax the non-convex factorization problem into a problem which is convex on the product of the factorized matrices, X=UVTX=UV^{T}. As a concrete example, in low-rank matrix factorization, one might be interested in solving a problem of the form

More generally, tensor factorization models provide a natural extension to matrix factorization and have been employed in a wide variety of applications (Cichocki et al., 2009; Kolda and Bader, 2009). The resulting optimization problem is similar to matrix factorization, with the difference that we now consider more general factorizations which decompose a multidimensional tensor YY into a set of KK different factors (X1,,XK)(X^{1},\ldots,X^{K}), where each factor is also possibly a multidimensional tensor. These factors are then combined via an arbitrary multilinear mapping Φ(X1,,XK)Y\Phi(X^{1},\ldots,X^{K})\approx Y; i.e., Φ\Phi is a linear function in each XiX^{i} term if the other Xj, ijX^{j},\ i\neq j terms are held constant. This model then typically gives optimization problems of the form

where each XiX^{i} factor is an appropriately sized matrix and the ψi()\psi_{i}(\cdot) functions apply some form of non-linearity after each matrix multiplication, e.g., a sigmoid function, rectification, max-pooling. Note that although here we have shown the linear operations to be simple matrix multiplications for notational simplicity, this is easily generalized to other linear operators (e.g., in a convolutional network each linear operator could be a set of convolutions with a group of various kernels with parameters contained in the XiX^{i} variables).

2 Paper Contributions

Our primary contribution is to extend ideas from convex matrix factorization and present a general framework which allows for a wide variety of factorization problems to be analyzed within a convex formulation. Specifically, using this convex framework we are able to show that local minima of the non-convex factorization problem achieve the global minimum if they satisfy a simple condition. Further, we also show that if the factorization is done with factorized variables of sufficient size, then from any initialization it is always possible to reach a global minimizer using purely local descent search strategies.

Two concepts are key to our analysis framework: 1) the size of the factorized elements is not constrained, but instead fit to the data through regularization (for example, the number of columns in UU and VV is allowed to change in matrix factorization) 2) we require that the mapping from the factorized elements to the final output, Φ\Phi, satisfies a positive homogeneity property. Interestingly, the deep learning field has increasingly moved to using non-linearities such as Rectified Linear Units (ReLU) and Max-Pooling, both of which satisfy the positive homogeneity property, and it has been noted empirically that both the speed of training the neural network and the overall performance of the network is increased significantly when ReLU non-linearities are used instead of the more traditional hyperbolic tangent or sigmoid non-linearities (Dahl et al., 2013; Maas et al., 2013; Krizhevsky et al., 2012; Zeiler et al., 2013). We suggest that our framework provides a partial theoretical explanation to this phenomena and also offers directions of future research which might be beneficial in improving the performance of multilayer neural networks.

Prior Work

Despite the significant empirical success and wide ranging applications of the models discussed above (and many others not discussed), as we have mentioned, a vast majority of the above techniques models suffer from the significant disadvantage that the associated optimization problems are non-convex and very challenging to solve. As a result, the numerical optimization algorithms often used to solve factorization problems – including (but certainly not limited to) alternating minimization, gradient descent, stochastic gradient descent, block coordinate descent, back-propagation, and quasi-newton methods – are typically only guaranteed to converge to a critical point or local minimum of the objective function (Mairal et al., 2010; Rumelhart et al., 1988; Ngiam et al., 2011; Wright and Nocedal, 1999; Xu and Yin, 2013). The nuclear norm relaxation of low-rank matrix factorization discussed above provides a means to solve factorization problems with reglarization promoting low-rank solutionsSimilar convex relaxation techniques have also been proposed for low-rank tensor factorizations, but in the case of tensors finding a final factorization Xopt=Φ(X1,,XK)X_{opt}=\Phi(X^{1},\ldots,X^{K}) from a low-rank tensor can still be a challenging problem (Tomioka et al., 2010; Gandy et al., 2011), but it fails to capture the full generality of problems such as (1) as it does not allow one to find factors, (U,V)(U,V), with the desired properties encouraged by Θ(U,V)\Theta(U,V) (sparseness, non-negativity, etc.). To address this issue, several studies have explored a more general convex relaxation via the matrix norm given by

where (Ui,Vi)(U_{i},V_{i}) denotes the ii’th columns of UU and VV, u\|\cdot\|_{u} and v\|\cdot\|_{v} are arbitrary vector norms, and the number of columns (rr) in the UU and VV matrices is allowed to be variable (Bach et al., 2008; Bach, 2013; Haeffele et al., 2014). The norm in (6) has appeared under multiple names in the literature, including the projective tensor norm, decomposition norm, and atomic norm, and by replacing the column norms in (6) with gauge functions the formulation can be generalized to incorporate additional regularization on (U,V)(U,V), such as non-negativity, while still being a convex function of XX (Bach, 2013). Further, it is worth noting that for particular choices of the u\|\cdot\|_{u} and v\|\cdot\|_{v} vector norms, Xu,v\|X\|_{u,v} reverts to several well known matrix norms and thus provides a generalization of many commonly used regularizers. Notably, when the vector norms are both l2l_{2} norms, X2,2=X\|X\|_{2,2}=\|X\|_{*}, and the form in (6) is the well known variational definition of the nuclear norm.

The u,v\|\cdot\|_{u,v} norm has the appealing property that by an appropriate choice of vector norms u\|\cdot\|_{u} and v\|\cdot\|_{v} (or more generally gauge functions), one can promote desired properties in the factorized matrices (U,V)(U,V) while still working with a problem which is convex w.r.t. the product X=UVTX=UV^{T}. Based on this concept, several studies have explored optimization problems over factorized matrices (U,V)(U,V) of the form

Even though the problem is still non-convex w.r.t. the factorized matrices (U,V)(U,V), it can be shown using ideas from Burer and Monteiro (2005) on factorized semidefinite programming that, subject to a few general conditions, then local minima of (7) will be global minima (Bach et al., 2008; Haeffele et al., 2014), which can significantly reduce the dimensionality of some large scale optimization problems. Unfortunately, aside from a few special cases, the norm defined by (6) (and related regularization functions such as those discussed by Bach (2013)) cannot be evaluated efficiently, much less optimized over, due to the complicated and non-convex nature of the definition. As a result, in practice one is often forced to replace (7) by the closely related problem

However, (7) and (8) are not equivalent problems, due to the fact that solutions to (7) include any factorization (U,V)(U,V) such that their product equals the optimal solution, UVT=XoptUV^{T}=X_{opt}, while in (7) one is specifically searching for a factorization (U,V)(U,V) that achieves the infimum in (6); in brief, solutions to (8) will be solutions to (7), but the converse is not true. As a consequence, results guaranteeing that local minima of the form (7) will be global minima cannot be applied to the formulation in (8), which is typically more useful in practice. Here we focus our analysis on the more commonly used family of problems, such as (8), and show that similar guarantees can be provided regarding the global optimality of local minima. Additionally, we show that these ideas can be significantly extended to a very wide range of non-convex models and regularization functions, with applications such as tensor factorization and certain forms of neural network training being additional special cases of our framework.

In the context of neural networks, Bengio et al. (2005) showed that for neural networks with a single hidden layer, if the number of neurons in the hidden layer is not fixed, but instead fit to the data through a sparsity inducing regularization, then the process of training a globally optimal neural network is analgous to selecting a finite number of hidden units from the infinite dimensional space of all possible hidden units and taking a weighted summation of these units to produce the output. Further, these ideas have very recently been used to analyze the generalization performance of such networks (Bach, 2014). Here, our results take a similar approach and extend these ideas to certain forms of multi-layer neural networks. Additionally, our framework provides sufficient conditions on the network architecture to guarantee that from any intialization a globally optimal solution can be found by performing purely local descent on the network weights.

Preliminaries

Before we present our main results, we first describe our notation system and recall a few definitions.

2 Definitions

We now make/recall a few general definitions and well known facts which will be used in our analysis.

The indicator function of a set CC is defined as

Note that this definition also implies that θ(0,,0)=0\theta(0,\ldots,0)=0 for p0p\neq 0.

The one-sided directional derivative of a function θ(x)\theta(x) at a point xx in the direction zz is denoted θ(x)(z)\theta(x)(z) and defined as dθ(x)(z)limϵ0  (θ(x+ϵz)θ(x))ϵ1d\theta(x)(z)\equiv\lim_{\epsilon\searrow 0}\ \ (\theta(x+\epsilon z)-\theta(x))\epsilon^{-1}.

Also, recall that for a differentiable function θ(x)\theta(x), dθ(x)(z)=<θ(x),z>d\theta(x)(z)=\left<\nabla\theta(x),z\right>.

Problem Formulation

Returning to the motivating example from the introduction (4), we now define the family of mapping functions from the factors into the output space and the family of regularization functions on the factors (Φ\Phi and Θ\Theta, respectively) which we will study in our framework.

As we do not place any restrictions on the elemental mapping, ϕ\phi, beyond the requirement that it must be positively homogeneous, there are a wide range of problems that can be captured by a mapping with form (10). Several example problems which can be placed in this framework include:

is positively homogeneous with degree 2 and Φr(U,V)=i=1rUiViT=UVT\Phi_{r}(U,V)=\sum_{i=1}^{r}U_{i}V_{i}^{T}=UV^{T} is simply matrix multiplication for matrices with rr columns.

(where \otimes denotes the tensor outer product) results in Φr(X1,,XK)\Phi_{r}(X^{1},\ldots,X^{K}) being the mapping used in the rank-rr CANDECOMP/PARAFAC (CP) tensor decomposition model (Kolda and Bader, 2009). Further, instead of choosing ϕ\phi to be a simple outer product, we can also generalize this to be any multilinear function of the factors (Xi1,,XiK)(X^{1}_{i},\ldots,X^{K}_{i})We note that more general tensor decompositions, such as the general form of the Tucker decomposition, do not explicitly fit inside the framework we describe here; however, by using similar arguments to the ones we develop here, it is possible to show analogous results to those we derive in this paper for more general tensor decompositions, which we do not show for clarity of presentation..

More general still, since any positively homogenous transformation is a potential elemental mapping, by an appropriate definition of ϕ\phi, one can describe neural networks with very general architectures, provided the non-linearities in the network are compatible with positive homogeneity. Note that max-pooling and rectification are both positively homogeneous and thus fall within our framework. For example, the well-known ImageNet network from (Krizhevsky et al., 2012), which consists of a series of convolutional layers, linear-rectification, max-pooling layers, response normalization layers, and fully connected layers, can be described by taking r=1r=1 and defining ϕ\phi to be the entire transformation of the network (with the removal of the response normalization layers, which are not positively homogenous). Note, however, that our results will rely on rr potentially changing size or being initialized to be sufficiently large, which limits the applicability of our results to current state-of-the-art network architectures (see discussion).

Here we have provided a few examples of common factorization mappings that can be cast in form (10), but certainly there are a wide variety of other problems for which our framework is relevant. Additionally, while all of the mappings described above are positively homogeneous with degree equal to the degree of the factorization (KK), this is not a requirement; p0p\neq 0 is sufficient. For example, non-linearities such as a rectification followed by raising each element to a non-zero power are positively homogeneous but of a possibly different degree. What will turn out to be essential, however, is that we require pp to match the degree of positive homogeneity used to regularize the factors, which we will discuss in the next section.

2 Factorization Regularization

Again, due to the generality of the framework, there are a wide variety of possible elemental regularization functions. We highlight two positive semidefinite, positively homogeneous functions which are commonly used and note that functions can be composed with summations, multiplications, and raising to non-zero powers to change the degree of positive homogeneity and combine various functions.

Norms: Any norm x\|x\| is positively homogeneous with degree 1. Note that because we make no requirement of convexity on gg, this framework can also include functions such as the lql_{q} pseudo-norms for q(0,1)q\in(0,1).

A few typical formulations of a gg which are positively homogeneous with degree KK might include:

where all of the norms, (i)\|\cdot\|_{(i)}, are arbitrary. Forms (15) and (16) can be shown to be equivalent, in the sense that they give rise to the same Ωϕ,g\Omega_{\phi,g} function, for all of the example mappings ϕ\phi we have discussed here and by an appropriate choice of norm can induce various properties in the factorized elements (such as sparsity), while form (17) is similar but additionally constrains each factor to be an element of a conic set CiC_{i} (see Bach et al., 2008; Bach, 2013; Haeffele et al., 2014, for examples from matrix factorization).

To define our regularization function on the output tensor, X=Φ(X1,,XK)X=\Phi(X^{1},\ldots,X^{K}), it will be necessary that the elemental regularization function, gg, and the elemental mapping, ϕ\phi, satisfy a few properties to be considered ’compatible’ for the definition of our regularization function. Specifically, we will require the following definition.

From this, we now define our main regularization function:

with the additional condition that Ωϕ,g(X)=\Omega_{\phi,g}(X)=\infty if XrIm(Φr)X\notin\bigcup_{r}\textnormal{Im}(\Phi_{r}).

We will show that Ωϕ,g\Omega_{\phi,g} is a convex function of XX and that in general the infimum in (18) can always be achieved with a finitely sized factorization (i.e., rr does not need to approach \infty)In particular, the largest rr needs to be is card(D)\textnormal{card}(D), and we note that card(D)\textnormal{card}(D) is a worst case upper bound on the size of the factorization. In certain cases the bound can be shown to be lower. As an example, Ωϕ,g(X)=X\Omega_{\phi,g}(X)=\|X\|_{*} when ϕ(u,v)=uvT\phi(u,v)=uv^{T} and g(u,v)=u2v2g(u,v)=\|u\|_{2}\|v\|_{2}. In this case the infimum can be achieved with rrank(X)min{card(u),card(v)}r\leq\textnormal{rank}(X)\leq\min\{\textnormal{card}(u),\textnormal{card}(v)\}.. While Ωϕ,g\Omega_{\phi,g} suffers from many of the practical issues associated with the matrix norm u,v\|\cdot\|_{u,v} discussed earlier (namely that in general it cannot be evaluated in polynomial time due to the complicated definition), because Ωϕ,g(X)\Omega_{\phi,g}(X) is a convex function on XX, this allows us to use Ωϕ,g\Omega_{\phi,g} purely as an analysis tool to derive results for a more tractable factorized formulation.

3 Problem Definition

To build our analysis, we will start by defining the convex (but typically non-tractable) problem, given by

H(Q)H(Q) is convex (but possibly non-differentiable)

(ϕ,g)(\phi,g) are a nondegenerate pair as defined by Definition 8; Ωϕ,g(X)\Omega_{\phi,g}(X) is as defined by (18); and λ>0\lambda>0

The minimum of F(X,Q)F(X,Q) exists     arg minX,QF(X,Q)\implies\emptyset\neq\operatorname*{arg\,min}_{X,Q}F(X,Q).

As noted above, it is typically impractical to optimize over functions involving Ωϕ,g(X)\Omega_{\phi,g}(X), and, even if one were given an optimal solution to (19), XoptX_{opt}, one would still need to solve the problem given in (18) to recover the desired (X1,,XK)(X^{1},\ldots,X^{K}) factors. Therefore, we use (19) merely as analysis tool and instead tailor our results to the non-convex optimization problem given by

We will show in the next section that any local minima of (20) is a global minima if it satisfies the condition that one slice from each of the factorized tensors is all zero. Further, we will also show that if rr is taken to be large enough then from any initialization we can always find a global minimum of (20) by doing an optimization based purely on local descent.

Main Analysis

We begin our analysis by first showing a few simple properties and lemmas relevant to our framework.

First, from the definition of Φr\Phi_{r} it is easy to verify that if ϕ\phi is positively homogeneous with degree pp, then Φr\Phi_{r} is also positively homogeneous with degree pp and satisfies the following proposition

Given a size-rxr_{x} set of KK factors, (X1,,XK)rx(X^{1},\ldots,X^{K})_{r_{x}}, and a size-ryr_{y} set of KK factors, (Y1,,YK)ry(Y^{1},\ldots,Y^{K})_{r_{y}}, then α0,β0\forall\alpha\geq 0,\beta\geq 0

where recall, [X Y][X\ Y] denotes the concatenation of XX and YY along the final dimension of the tensor.

Further, Ωϕ,g(X)\Omega_{\phi,g}(X) satisfies the following proposition:

Ωϕ,g(0)=0\Omega_{\phi,g}(0)=0 and Ωϕ,g(X)>0  X0\Omega_{\phi,g}(X)>0\ \ \forall X\neq 0.

Ωϕ,g\Omega_{\phi,g} is positively homogeneous with degree 1.

Ωϕ,g(X+Y)Ωϕ,g(X)+Ωϕ,g(Y)  (X,Y)\Omega_{\phi,g}(X+Y)\leq\Omega_{\phi,g}(X)+\Omega_{\phi,g}(Y)\ \ \forall(X,Y)

The infimum in (18) can be achieved with rcard(D)  Xr\leq\textnormal{card}(D)\ \ \forall X s.t. Ωϕ,g(X)<\Omega_{\phi,g}(X)<\infty.

Proof Proposition 11 Many of these properties can be shown in a similar fashion to results from the u,v\|\cdot\|_{u,v} norm discussed previously (Bach et al., 2008; Yu et al., 2014).

1) By definition and the fact that gg is positive semidefinite, we always have Ωϕ,g(X)0  X\Omega_{\phi,g}(X)\geq 0\ \ \forall X. Trivially, Ωϕ,g(0)=0\Omega_{\phi,g}(0)=0 since we can always take (X1,,XK)=(0,,0)(X^{1},\ldots,X^{K})=(0,\ldots,0) to achieve the infimum. For X0X\neq 0, because (ϕ,g)(\phi,g) is a non-degenerate pair then i=1rg(Xi1,,XiK)>0\sum_{i=1}^{r}g(X^{1}_{i},\ldots,X^{K}_{i})>0 for any (X1,,XK)r  s.t.  Φr(X1,,XK)=X(X^{1},\ldots,X^{K})_{r}\ \ \textnormal{s.t.}\ \ \Phi_{r}(X^{1},\ldots,X^{K})=X and rr finite. Property 5) shows that the infimum can be achieved with rr finite, completing the result.

2) For all α0\alpha\geq 0 and any (X1,,XK)r(X^{1},\ldots,X^{K})_{r} such that X=Φr(X1,,XK)X=\Phi_{r}(X^{1},\ldots,X^{K}), note that from positive homogeneity Φr(α1/pX1,,α1/pXK)=αX\Phi_{r}(\alpha^{1/p}X^{1},\ldots,\alpha^{1/p}X^{K})=\alpha X and i=1rg(α1/pXi1,,α1/pXiK)=αi=1rg(Xi1,,XiK)\sum_{i=1}^{r}g(\alpha^{1/p}X_{i}^{1},\ldots,\alpha^{1/p}X_{i}^{K})=\alpha\sum_{i=1}^{r}g(X^{1}_{i},\ldots,X^{K}_{i}). Applying this fact to the definition of Ωϕ,g\Omega_{\phi,g} gives that Ωϕ,g(αX)=αΩϕ,g,(X)\Omega_{\phi,g}(\alpha X)=\alpha\Omega_{\phi,g,}(X).

3) If either Ωϕ,g(X)=\Omega_{\phi,g}(X)=\infty or Ωϕ,g(Y)=\Omega_{\phi,g}(Y)=\infty then the inequality is trivially satisfied. Considering any (X,Y)(X,Y) pair such that Ωϕ,g\Omega_{\phi,g} is finite for both XX and YY, for any ϵ>0\epsilon>0 let (X1,,XK)rx(X^{1},\ldots,X^{K})_{r_{x}} be an ϵ\epsilon optimal factorization of XX. Specifically, Φrx(X1,,XK)=X\Phi_{r_{x}}(X^{1},\ldots,X^{K})=X and i=1rxg(Xi1,,XiK)Ωϕ,g(X)+ϵ\sum_{i=1}^{r_{x}}g(X^{1}_{i},\ldots,X^{K}_{i})\leq\Omega_{\phi,g}(X)+\epsilon. Similarly, let (Y1,,YK)ry(Y^{1},\ldots,Y^{K})_{r_{y}} be an ϵ\epsilon optimal factorization of YY. From Proposition 10 we have Φrx+ry([X1Y1],,[XKYK])=X+Y\Phi_{r_{x}+r_{y}}([X^{1}\,Y^{1}],\ldots,[X^{K}\,Y^{K}])=X+Y, so Ωϕ,g(X+Y)i=1rxg(Xi1,,XiK)+j=1ryg(Yj1,,XjK)Ωϕ,g(X)+Ωϕ,g(Y)+2ϵ\Omega_{\phi,g}(X+Y)\leq\sum_{i=1}^{r_{x}}g(X^{1}_{i},\ldots,X^{K}_{i})+\sum_{j=1}^{r_{y}}g(Y^{1}_{j},\ldots,X^{K}_{j})\leq\Omega_{\phi,g}(X)+\Omega_{\phi,g}(Y)+2\epsilon. Letting ϵ\epsilon tend to 0 completes the result.

Note that because (ϕ,g)(\phi,g) is a nondegenerate pair, for any non-zero XΓX\in\Gamma there exists α[1,)\alpha\in[1,\infty) such that αX\alpha X is on the boundary of Γ\Gamma, so Γ\Gamma and its convex hull are compact sets.

Further, note that Γ\Gamma contains the origin by definition of ϕ\phi and gg, so as a result, Ωϕ,g\Omega_{\phi,g} is equivalent to a gauge function on the convex hull of Γ\Gamma

This, combined with positive homogeneity, completes the result as we can take (Xi1,,XiK)=((μoptθi)1/pZi1,,(μoptθi)1/pZiK)(X_{i}^{1},\ldots,X_{i}^{K})=((\mu_{opt}\theta_{i})^{1/p}Z^{1}_{i},\ldots,(\mu_{opt}\theta_{i})^{1/p}Z^{K}_{i}), which gives

and shows that a factorization of size-card(D)\textnormal{card}(D) which achieves the infimum must exist.

We next derive the Fenchel dual of Ωϕ,g\Omega_{\phi,g}, which will provide a useful characterization of the subgradient of Ωϕ,g\Omega_{\phi,g}.

The Fenchel dual of Ωϕ,g(X)\Omega_{\phi,g}(X) is given by

Proof Recall, Ωϕ,g(W)supZ<W,Z>Ωϕ,g(Z)\Omega^{*}_{\phi,g}(W)\equiv\sup_{Z}\left<W,Z\right>-\Omega_{\phi,g}(Z), so for ZZ to approach the supremum we must have ZrIm(Φr)Z\in\bigcup_{r}\textnormal{Im}(\Phi_{r}). As result, the problem is equivalent to

If Ωϕ,g(W)1\Omega^{\circ}_{\phi,g}(W)\leq 1 then all the terms in the summation of (28) will be non-positive, so taking (Z1,,ZK)=(0,,0)(Z^{1},\ldots,Z^{K})=(0,\ldots,0) will achieve the supremum. Conversely, if Ωϕ,g(W)>1\Omega^{\circ}_{\phi,g}(W)>1, then (z1,,zK)\exists(z^{1},\ldots,z^{K}) such that <W,ϕ(z1,,zK)>>g(z1,,zK)\left<W,\phi(z^{1},\ldots,z^{K})\right>>g(z^{1},\ldots,z^{K}). This result, combined with the positive homogeneity of ϕ\phi and gg gives that (28) is unbounded by considering (αz1,,αzK)(\alpha z^{1},\ldots,\alpha z^{K}) as α\alpha\rightarrow\infty.

We briefly note that the optimization problem associated with (26) is typically referred to as the polar problem and is a generalization of the concept of a dual norm. In practice solving the polar can still be very challenging and is often the limiting factor in applying our results in practice (see Bach, 2013; Zhang et al., 2013, for further information).

With the above derivation of the Fenchel dual, we now recall that if Ωϕ,g(X)<\Omega_{\phi,g}(X)<\infty then the subgradient of Ωϕ,g(X)\Omega_{\phi,g}(X) can be characterized by Ωϕ,g(X)={W:<X,W>=Ωϕ,g(X)+Ωϕ,g(W)}\partial\Omega_{\phi,g}(X)=\{W:\left<X,W\right>=\Omega_{\phi,g}(X)+\Omega^{*}_{\phi,g}(W)\}. This forms the basis for the following lemma which will be used in our main results

Given a factorization X=Φr(X1,,XK)X=\Phi_{r}(X^{1},\ldots,X^{K}) and a regularization function Ωϕ,g(X)\Omega_{\phi,g}(X), then the following conditions are equivalent:

(X1,,XK)(X^{1},\ldots,X^{K}) is an optimal factorization of XX; i.e., i=1rg(Xi1,,XiK)=Ωϕ,g(X)\sum_{i=1}^{r}g(X^{1}_{i},\ldots,X^{K}_{i})=\Omega_{\phi,g}(X)

W\exists W such that Ωϕ,g(W)1\Omega^{\circ}_{\phi,g}(W)\leq 1 and <W,Φr(X1,,XK)>=i=1rg(Xi1,,XiK)\left<W,\Phi_{r}(X^{1},\ldots,X^{K})\right>=\sum_{i=1}^{r}g(X^{1}_{i},\ldots,X^{K}_{i}),

W\exists W such that Ωϕ,g(W)1\Omega^{\circ}_{\phi,g}(W)\leq 1 and i{1,,r}\forall i\in\{1,\ldots,r\}, <W,ϕ(Xi1,,XiK)>=g(Xi1,,XiK)\left<W,\phi(X^{1}_{i},\ldots,X^{K}_{i})\right>=g(X^{1}_{i},\ldots,X^{K}_{i})

Further, any WW which satisfies condition 2 or 3 satisfies both conditions 2 and 3 and WΩϕ,g(X)W\in\partial\Omega_{\phi,g}(X).

Proof 2     \iff 3) 3 trivially implies 2 from the definition of Φr\Phi_{r}. For the opposite direction, because Ωϕ,g(W)1\Omega^{\circ}_{\phi,g}(W)\leq 1 we have <W,ϕ(Xi1,,XiK)>g(Xi1,,XiK)  i\left<W,\phi(X^{1}_{i},\ldots,X^{K}_{i})\right>\leq g(X^{1}_{i},\ldots,X^{K}_{i})\ \ \forall i. Taking the sum over ii, we can only achieve equality in 2 if we have equality i\forall i in condition 3. This also shows that any WW which satisfies condition 2 or 3 must also satisfy the other condition.

We next show that if WW satisfies conditions 2/3 then WΩϕ,g(X)W\in\partial\Omega_{\phi,g}(X). First, from condition 2/3 and the definition of Ωϕ,g\Omega_{\phi,g}, we have Ωϕ,g(X)i=1rg(Xi1,,XiK)=<W,X><\Omega_{\phi,g}(X)\leq\sum_{i=1}^{r}g(X^{1}_{i},\ldots,X^{K}_{i})=\left<W,X\right><\infty. Thus, recall that because Ωϕ,g(X)\Omega_{\phi,g}(X) is convex and finite at XX, we have <W,X>Ωϕ,g(X)+Ωϕ,g(W)\left<W,X\right>\leq\Omega_{\phi,g}(X)+\Omega^{*}_{\phi,g}(W) with equality iff WΩϕ,g(X)W\in\partial\Omega_{\phi,g}(X). Now, by contradiction assume WW satisfies conditions 2/3 but WΩϕ,g(X)W\notin\partial\Omega_{\phi,g}(X). From condition 2/3 we have Ωϕ,g(W)=0\Omega^{*}_{\phi,g}(W)=0, so Ωϕ,g(X)=Ωϕ,g(X)+Ωϕ,g(W)><X,W>=i=1rg(Xi1,,XiK)\Omega_{\phi,g}(X)=\Omega_{\phi,g}(X)+\Omega^{*}_{\phi,g}(W)>\left<X,W\right>=\sum_{i=1}^{r}g(X^{1}_{i},\ldots,X^{K}_{i}) which contradicts the definition of Ωϕ,g(X)\Omega_{\phi,g}(X).

1     \implies 2) Any WΩϕ,g(X)W\in\partial\Omega_{\phi,g}(X) satisfies <X,W>=Ωϕ,g(X)+Ωϕ,g(W)=i=1rg(Xi1,,XiK)\left<X,W\right>=\Omega_{\phi,g}(X)+\Omega^{*}_{\phi,g}(W)=\sum_{i=1}^{r}g(X^{1}_{i},\ldots,X^{K}_{i}).

2     \implies 1) By contradiction, assume (X1,,XK)r(X^{1},\ldots,X^{K})_{r} was not an optimal factorization of XX. This gives, Ωϕ,g(X)<i=1rg(Xi1,,XiK)=<W,X>=Ωϕ,g(X)+Ωϕ,g(W)=Ωϕ,g(X)\Omega_{\phi,g}(X)<\sum_{i=1}^{r}g(X^{1}_{i},\ldots,X^{K}_{i})=\left<W,X\right>=\Omega_{\phi,g}(X)+\Omega^{*}_{\phi,g}(W)=\Omega_{\phi,g}(X), producing the contradiction.

Finally, we show one additional lemma before presenting our main results.

Proof Let (Zi1,,ZiK)=(θiXi1,,θiXiK)(Z^{1}_{i},\ldots,Z^{K}_{i})=(\theta_{i}X^{1}_{i},\ldots,\theta_{i}X^{K}_{i}) for all i{1r}i\in\{1\ldots r\} and let Λ=i=1rθiϕ(Xi1,,XiK)\Lambda=\sum_{i=1}^{r}\theta_{i}\phi(X^{1}_{i},\ldots,X^{K}_{i}). From positive homogeneity and the fact that we have a local minimum, then δ>0\exists\delta>0 such that ϵ(0,δ)\forall\epsilon\in(0,\delta) we must have

Taking the first order approximation (1+ϵθi)p=1+pϵθi+O(ϵ2)(1+\epsilon\theta_{i})^{p}=1+p\epsilon\theta_{i}+O(\epsilon^{2}) and rearranging the terms of (LABEL:eq:lem_localmin_inequal), we arrive at

Noting that for ϵ>0\epsilon>0 but sufficiently small, we also must have fr(X1,,XK,Q)fr(X1ϵZ1,,XKϵZK)f_{r}(X^{1},\ldots,X^{K},Q)\leq f_{r}(X^{1}-\epsilon Z^{1},\ldots,X^{K}-\epsilon Z^{K}), using identical steps as before and taking the first order approximation (1ϵθi)p=1pϵθi+O(ϵ2)(1-\epsilon\theta_{i})^{p}=1-p\epsilon\theta_{i}+O(\epsilon^{2}), we get

and taking the limit limϵ0[\eqrefeq:negepsilonϵ]\lim_{\epsilon\searrow 0}[\tfrac{\eqref{eq:neg_epsilon}}{\epsilon}], we arrive at

Combining (33) and (35) and rearranging terms gives the result.

2 Main Results

Based on the above preliminary results, we are now ready to state our main results and several immediate corollaries.

Given a function fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q) of the form given in (20), any local minimizer of the optimization problem

such that (Xi01,,Xi0K)=(0,,0)(X^{1}_{i_{0}},\ldots,X^{K}_{i_{0}})=(0,\ldots,0) for some i0{1,,r}i_{0}\in\{1,\ldots,r\} is a global minimizer.

We begin by noting that from the definition of Ωϕ,g(X)\Omega_{\phi,g}(X), for any factorization X=Φr(X1,,XK)X=\Phi_{r}(X^{1},\ldots,X^{K})

with equality at any factorization which achieves the infimum in (18). We will show that a local minimum of fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q) satisfying the conditions of the theorem also satisfies the conditions for (Φr(X1,,XK),Q)(\Phi_{r}(X^{1},\ldots,X^{K}),Q) to be a global minimum of the convex function F(X,Q)F(X,Q), which implies a global minimum of fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q) due to the global bound in (37).

First, because (19) is a convex function, a simple subgradient condition gives that (X,Q)(X,Q) is a global minimum of F(X,Q)F(X,Q) iff the following two conditions are satisfied

Turning to the factorization objective, if (X1,,XK,Q)(X^{1},\ldots,X^{K},Q) is a local minimum of fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q), then (Z1,,ZK)r\forall(Z^{1},\ldots,Z^{K})_{r} there exists δ>0\delta>0 such that ϵ(0,δ)\forall\epsilon\in(0,\delta) we have fr(X1+ϵ1/pZ1,,XK+ϵ1/pZK,Q)fr(X1,,XK,Q)f_{r}(X^{1}+\epsilon^{1/p}Z^{1},\ldots,X^{K}+\epsilon^{1/p}Z^{K},Q)\geq f_{r}(X^{1},\ldots,X^{K},Q). If we now consider search directions (Z1,,ZK)r(Z^{1},\ldots,Z^{K})_{r} of the form

where i0i_{0} is the index such that (Xi01,,Xi0K)=(0,,0)(X^{1}_{i_{0}},\ldots,X^{K}_{i_{0}})=(0,\ldots,0), then for ϵ(0,δ)\epsilon\in(0,\delta), we have

The equality between (LABEL:eq:z_search_1) and (LABEL:eq:z_search_2) comes from the special form of ZZ given by (40) and the positive homogeneity of ϕ\phi and gg. Rearranging terms, we now have

Because (z1,,zK)(z^{1},\ldots,z^{K}) was arbitrary, we have established that

Further, if we choose θ\theta to be vector of all ones in Lemma 14, we get

From this result, we can then test the global optimality of any local minimum (regardless of whether it has an all-zero slice or not) from the immediate corollary:

Given a function fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q) of the form given in (20), any local minimizer of the optimization problem

is a global minimizer if fr+1([X1 0],,[XK 0],Q)f_{r+1}([X^{1}\ 0],\ldots,[X^{K}\ 0],Q) is a local minimizer of fr+1f_{r+1}.

From the results of Theorem 15, we are now also able to show that if we let the size of the factorized variables (rr) become large enough, then from any initialization we can always find a global minimizer of fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q) using a purely local descent strategy. Specifically, we have the following result.

Given a function fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q) as defined by (20), if r>card(D)r>\textnormal{card}(D) then from any point (Z1,,ZK,Q)(Z^{1},\ldots,Z^{K},Q) such that fr(Z1,,ZK,Q)<f_{r}(Z^{1},\ldots,Z^{K},Q)<\infty there must exist a non-increasing path from (Z1,,ZK,Q)(Z^{1},\ldots,Z^{K},Q) to a global minimizer of fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q).

Without loss of generality, scale θ^\hat{\theta} so that miniθ^i=1\min_{i}\hat{\theta}_{i}=-1. Now, for all (γ,i){}×{1,,r}(\gamma,i)\in\{\}\times\{1,\ldots,r\}, let us define

Further, from the positive homogeneity of (ϕ,g)(\phi,g) we have γ\forall\gamma\in

where the equality between (50) and (51) is seen by recalling that i=1rθ^iϕ(Xi1,,XiK)=0\sum_{i=1}^{r}\hat{\theta}_{i}\phi(X^{1}_{i},\ldots,X^{K}_{i})=0 and i=1rθ^ig(Xi1,,XiK)=0\sum_{i=1}^{r}\hat{\theta}_{i}g(X^{1}_{i},\ldots,X^{K}_{i})=0.

As a result, as γ\gamma goes from 010\rightarrow 1 we can traverse a path from (X1,,XK,Q)(R1(1),,RK(1),Q)(X^{1},\ldots,X^{K},Q)\rightarrow(R^{1}(1),\ldots,R^{K}(1),Q) without changing the value of frf_{r}. Also recall that by construction (Ri01(1),,Ri0K(1))=(0,,0)(R^{1}_{i_{0}}(1),\ldots,R^{K}_{i_{0}}(1))=(0,\ldots,0), so if (R1(1),,RK(1),Q)(R^{1}(1),\ldots,R^{K}(1),Q) is a local minimizer of frf_{r} then it must be a global minimizer due to Theorem 15. If (R1(1),,RK(1),Q)(R^{1}(1),\ldots,R^{K}(1),Q) is not a local minimizer then there must exist a descent direction and we can iteratively apply this result until we reach a global minimizer, completing the proof.

We note that from this proof we have also described a meta-algorithm (outlined in Algorithm 1) which can be used with any local-descent optimization strategy to guarantee convergence to a global minimum. While in general the size of the factorization (rr) might increase as the algorithm proceeds, as a worst case, it is guaranteed that a global minimum can be found with a finite rr never growing larger than card(D)+1\textnormal{card}(D)+1. Also note that this is a worst case upper bound on rr for the most general form of our framework and that for specific choices of ϕ\phi and gg the bound on the maximum rr required can be significantly lowered.

Algorithm 1 will find a global minimum of fr(X1,,XK,Q)f_{r}(X^{1},\ldots,X^{K},Q) as defined in (20). If rr is intialized to be greater than card(D)\textnormal{card}(D), then the size of the factorized variables will not increase. Otherwise, the algorithm will terminate with rcard(D)+1r\leq\textnormal{card}(D)+1.

Discussion and Conclusions

We begin the discussion of our results with a cautionary note; namely, these results can be challenging to apply in practice. In particular, many algorithms based on alternating minimization can typically only guarantee convergence to a critical point, and with the inherent non-convexity of the problem, verifying whether a given critical point is also a local minima can be a challenging problem on its own. Nevertheless, we emphasize that our results guarantee that global minimizers can be found from purely local descent if the optimization problem falls within the general framework we have described here. As a result, even if the particular local descent strategy one chooses for a specific problem does not come with guaranteed convergence to a local minimum, the scope of the problem is still vastly reduced from a full global optimization. There is no need, in theory, to consider multiple initializations or more complicated (and much larger scale) techniques to explore the entire search space.

In addition to the above points, our analysis analysis also provides a few insights into the behavior of factorization problems and offers simple guidance on the design of such problems. The first is that balancing the degree of positive homogeneity between the regularization function and the mapping function is crucial. Here we have analyzed a mapping Φ\Phi with the particular form given in (10). We conjecture our results can likely be generalized to include additional factorization mappings (which we save for future work), but even for more general mappings and regularization functions, requiring the degrees of positive homogeneity to match between the regularization function and the mapping function will be critical to showing results similar to those we present here. In general, if the degrees of positive homogeneity do not match between the factorization mapping and the regularization function, then it either becomes impossible to make guarantees regarding the global optimality of a local minimum, or the regularization function does nothing to limit the size of the factorization, so the degrees of freedom in the model are largely determined by the user defined choice of rr. As a demonstration of these phenomena, first consider the case where we have a general mapping Φ(X1,,XK)\Phi(X^{1},\ldots,X^{K}) which is positively homogeneous with degree pp (but which is not assumed to have form (10)). Now, consider a general regularization function G(X1,,XK)G(X^{1},\ldots,X^{K}) which is positively homogeneous with degree p<pp^{\prime}<p, then the following proposition provides a simple counter-example demonstrating that in general it is not possible to guarantee that a global minimum can be found from local descent.

The above proposition shows that unless we have the special case where (X1,,XK)=(0,,0)(X^{1},\ldots,X^{K})=(0,\ldots,0) happens to be a global minimizer, then there will always exist a local minimum at the origin, and from the origin it will always be necessary to take an increasing path to escape the local minimum. The case described above, where p>pp>p^{\prime}, is arguably the more common situation for mismatched degrees of homogeneity (as opposed to p<pp<p^{\prime}), and a typical example might be an objective function such as

where Φ\Phi is a positively homogeneous mapping with degree K>2K>2 (e.g., the mapping of a deep neural network) but pp^{\prime} is typcially taken to be only 1 or 2 depending on the particular choice of norm.

Conversely, in the situation where p>pp^{\prime}>p, then it is often the case that the regularization function is not sufficient to ’limit’ the size of the factorization, in the sense that the objective function can always be decreased by allowing the size of the factors to grow. As a simple example, consider the case of matrix factorization with the objective function

If the size of the factorization doubles, then we can always take [22U 22U][22V 22V]T=UVT[\tfrac{\sqrt{2}}{2}U\ \tfrac{\sqrt{2}}{2}U][\tfrac{\sqrt{2}}{2}V\ \tfrac{\sqrt{2}}{2}V]^{T}=UV^{T}, so if (22)p([U U]p+[V V]p)<Up+Vp(\tfrac{\sqrt{2}}{2})^{p^{\prime}}(\|[U\ U]\|^{p^{\prime}}+\|[V\ V]\|^{p^{\prime}})<\|U\|^{p^{\prime}}+\|V\|^{p^{\prime}}, then the objective function can always be decreased by simply duplicating and scaling the existing factorization. It is easily verified that the above inequality is satisfied for many choices of norm (for example, all the lql_{q} norms with q1q\geq 1) when p>2p^{\prime}>2. As a result, this implies that the degrees of freedom in the model will be largely dependent on the initial choice of the number of columns in (U,V)(U,V), since in general the objective function is typically decreased by having all entries of (U,V)(U,V) be non-zero.

2 Implications for Neural Networks

Examining our results specifically as they apply to deep neural networks, we first note that from our analysis we have shown that neural networks which are based on positively homogeneous mappings can be regularized in the way we have outlined in our framework so that the optimization problem of training the network can be analyzed from a convex framework. Further, we suggest that our results provide a partial explanation to the recently observed empirical phenomenon where replacing the traditional sigmoid or hyperbolic tangent non-linearities with positively homogeneous non-linearities, such as rectification and max-pooling, significantly boosts the speed of optimization and the performance of the network (Dahl et al., 2013; Maas et al., 2013; Krizhevsky et al., 2012; Zeiler et al., 2013). Namely, by using a positively homogeneous network mapping, the problem then becomes a convex function of the network outputs. Additionally, we have also shown that if the size of the network is allowed to be large enough then for any initialization a global minimizer can be found from purely local descent, and thus local minima are all equivalent. This is a similar conclusion to the work of Choromanska et al. (2014), who analyzed the problem from a statistical standpoint and showed that with sufficiently large networks and appropriate assumptions about the distributions of the data and network weights, then with high probability any family of networks learned with different initializations will have similar objective values, but we note that our results allow for a well defined set of conditions which will be sufficient to guarantee the property. Finally, many modern large scale networks do not use traditional regularization on the network weigh parameters such as an l1l_{1} or l2l_{2} norms during training and instead rely on alternative forms of regularization such as dropout as it tends to achieve better performance in practice(Srivastava et al., 2014; Krizhevsky et al., 2012; Wan et al., 2013). Given our commentary above regarding the critical importance of balancing the degree of homogeneity between the mapping and the regularizer, an immediate prediction of our analysis is that simply ensuring that the degrees of homogeneity are balanced could be a significant factor in improving the performance of deep networks.

We conclude by noting that the main limitation of our current framework in the context of the analysis of currently existing state-of-the-art neural networks is that the form of the mapping we study here (10) implies that the network architecture must consist of rr parallel subnetworks, where each subnetwork has a particular architecture defined by the elemental mapping ϕ\phi. Previously, we mentioned as an example that the well known ImageNet network from (Krizhevsky et al., 2012) can be described by our framework by taking r=1r=1 and using an appropriate definition of ϕ\phi; however, to apply Corollary 16 to then test for global optimality, we must test whether it is possible to reduce the objective function by adding an entire network with the same architecture in parallel to the given network. Clearly, this is a significant limitation for the application of these results and suggests two possibilities for future work. The first is that simply implementing neural networks with a highly parallel network architecture and relatively simple subnetwork architectures could be advantageous and worthy of experimental study. In fact, the ImageNet network already has a certain degree of parallelization as the initial convolutional layers of the network operate largely in parallel on separate GPU units. More generally, here we have focused on mappings with form (10) as it is conducive to analysis, but we believe that many of the results we have presented here can be generalized to more general mappings (and thus more general network architectures) using many of the principles and analysis techniques we have presented here; an extension we reserve for future work.

3 Conclusions

Here we have presented a general framework which allows for a wide variety of non-convex factorization problems to be analyzed with tools from convex analysis. In particular, we have shown that for problems which can be placed in our framework, any local minimum can be guaranteed to be a global minimum of the non-convex factorization problem if one slice of the factorized tensors is all zero. Additionally, we have shown that if the non-convex factorization problem is done with factors of sufficient size, then from any feasible initialization it is always possible to find a global minimizer using a purely local descent algorithm.

References