Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method

Boaz Barak, Jonathan A. Kelner, David Steurer

Introduction

The dictionary learning (also known as “sparse coding”) problem is to recover an unknown n×mn\times m matrix AA (known as a “dictionary”) from examples of the form

This problem has found applications in multiple areas, including computational neuroscience [OF97, OF96a, OF96b], machine learning [EP07, MRBL07], and computer vision and image processing [EA06, MLB+08, YWHM08]. The appeal of this problem is that, intuitively, data should be sparse in the “right” representation (where every coordinate corresponds to a meaningful feature), and finding this representation can be a useful first step for further processing, just as representing sound or image data in the Fourier or wavelet bases is often a very useful preprocessing step in signal or image processing. See [SWW12, AAJ+13, AGM13, ABGM14] and the references therein for further discussion of the history and motivation of this problem.

This is a nonlinear problem, as both AA and xx are unknown, and dictionary learning is a computationally challenging task even in the noiseless case. When AA is known, recovering xx from yy constitutes the sparse recovery / compressed sensing problem, which has efficient algorithms [Don06, CRT06]. Hence, a common heuristic for dictionary learning is to use alternating minimization, using sparse recovery to obtain a guess for xx based on a guess of AA, and vice versa.

Recently there have been several works giving dictionary learning algorithms with rigorous guarantees on their performance [SWW12, AAJ+13, AAN13, AGM13, ABGM14]. These works differ in various aspects, but they all share a common feature: they give no guarantee of recovery unless the distribution {x}\{x\} is over extremely sparse vectors, namely having less than O(n)O(\sqrt{n}) (as opposed to merely o(n)o(n)) nonzero coordinates. (There have been other works dealing with the less sparse case, but only at the expense of making strong assumptions on xx and/or AA; see Section 1.3 for more discussion of related works.)

In this work we give a different algorithm that can be proven to approximately recover the matrix AA even when xx is much denser (up to τn\tau n coordinates for some small constant τ>0\tau>0 in some settings). The algorithm works (in the sense of approximate recovery) even with noise, in the so-called overcomplete case (where m>nm>n), and without making any incoherence assumptions on the dictionary.

Our algorithm is based on the Sum of Squares (SOS) semidefinite programming hierarchy [Sho87, Nes00, Par00, Las01]. The SOS algorithm is a very natural method for solving non-convex optimization problems that has found applications in a variety of scientific fields, including control theory [HG05], quantum information theory [DPS02], game theory [Par06], formal verification [Har07], and more. Nevertheless, to our knowledge this work provides the first rigorous bounds on the SOS algorithm’s running time for a natural unsupervised learning problem.

In this section we formally define the dictionary learning problem and state our result. We define a σ\sigma-dictionary to be an m×nm\times n matrix A=(a1am)A=(a^{1}|\cdots|a^{m}) such that ai=1\lVert a^{i}\rVert=1 for all ii, and AAσIA^{\top}A\preceq\sigma I (where II is the identity matrix). The parameter σ\sigma is an analytical proxy for the overcompleteness m/nm/n of the dictionary AA. In particular, if the columns of AA are in isotropic position (i.e., AAA^{\top}A is proportional to the identity), then the top eigenvalue of AAA^{\top}A is its trace divided by nn, which equals (1/n)iai2=m/n(1/n)\sum_{i}\lVert a^{i}\rVert^{2}=m/n because all of the aia^{i}’s have unit norm.While we do not use it in this paper, we note that in the dictionary learning problem it is always possible to learn a linear “whitening transformation” BB from the samples that would place the columns in isotropic position, at the cost of potentially changing the norms of the vectors. (There also exists a linear transformation that keeps the vectors normalized [Bar98, For01], but we do not know how to learn it from the samples.) In this work we are mostly interested in the case m=O(n)m=O(n), which corresponds to σ=O(1)\sigma=O(1).

Equation (1.2) will motivate us in defining an analytical proxy for the condition that the distribution {x}\{x\} over coefficients is τ\tau-sparse.By using an analytical proxy as opposed to requiring strict sparsity, we are only enlarging the set of distributions under consideration. However, we will make some additional conditions below, and in particular requiring low order non-square moments to vanish, that although seemingly mild compared to prior works, do restrict the family of distributions.

Specifically, in the dictionary learning case, since we are interested in learning all column vectors, we want every coordinate ii to be typical (for example, if the coefficient xix_{i} is always or always 11, we will not be able to learn the corresponding column vector). Moreover, a necessary condition for recovery is that every pair of coordinates is somewhat typical in the sense that the events that xix_{i} and xjx_{j} are nonzero are not perfectly correlated. Indeed, suppose for simplicity that when xix_{i} is nonzero, it is distributed like an independent standard Gaussian. Then if those two events were perfectly correlated, recovery would be impossible since the distribution over examples would be identical if we replaced {ai,aj}\{a_{i},a_{j}\} with any pair of vectors {Πai,Πaj}\{\Pi a^{i},\Pi a^{j}\} where Π\Pi is a rotation in the plane spanned by {ai,aj}\{a^{i},a^{j}\}.

for all iji\neq j and for some τ1\tau\ll 1.

for every non-square monomial xαx^{\alpha} of degree at most dd. (Here, α{0,1,}m\alpha\in\{0,1,\ldots\}^{m} is a multiindex and xαx^{\alpha} denotes the monomial ixiαi\prod_{i}x_{i}^{\alpha_{i}}. The degree of xαx^{\alpha} is α:=iαi|\alpha|\mathrel{\mathop{:}}=\sum_{i}\alpha_{i}; we say that xαx^{\alpha} is non-square if xαx^{\alpha} is not the square of another monomial, i.e.,, if α\alpha has an odd coordinate.)

Another way to justify this notion of nice distributions is that, as our analysis shows, it is a natural way to ensure that if aa is a column of the dictionary then the random variable a,y\langle a,y\rangle for a random sample yy from (1.1) will be “spiky” in the sense that it will have a large dd-norm compared to its 22-norm. Thus it is a fairly clean way to enable recovery, especially in the setting (such as ours) where we don’t assume orthogonality or even incoherence between the dictionary vectors.

Modeling noise

Given a noisy dictionary learning example of the form y=Ax+ey=Ax+e, one can also view it (assuming we are in the non-degenerate case of AA having full rank) as y=A(x+e)y=A(x+e^{\prime}) for some ee^{\prime} (whose magnitude is controlled by the norm of ee and the condition number of AA). If ee^{\prime} has sufficiently small magnitude, and is composed of i.i.d random variables (and even under more general conditions), the distribution {x+e}\{x+e^{\prime}\} will be nice as well. Therefore, we will not explicitly model the noise in the following, but rather treat it as part of the distribution {x}\{x\} which our definition allows to be only “approximately sparse”.

2 Our results

Given samples of the form {y=Ax}\{y=Ax\} for a (d,τ)(d,\tau)-nice {x}\{x\}, with dd a sufficiently large constant (corresponding to having τn\tau n nonzero entries), we can approximately recover the dictionary AA in polynomial time as long as τnδ\tau\leqslant n^{-\delta} for some δ>0\delta>0, and in quasipolynomial time as long as τ\tau is a sufficiently small constant. Prior polynomial-time algorithms required the distribution to range over vectors with less than n\sqrt{n} nonzero entries (and it was not known how to improve upon this even using quasipolynomial time).

For every ε>0,σ1,δ>0\varepsilon>0,\sigma\geqslant 1,\delta>0 there exists dd and a polynomial-time algorithm R\mathcal{R} such that for every σ\sigma-dictionary A=(a1am)A=(a^{1}|\cdots|a^{m}) and (d,τ=nδ)(d,\tau=n^{-\delta})-nice {x}\{x\}, given nO(1)n^{O(1)} samples from from {y=Ax}\{y=Ax\}, R\mathcal{R} outputs with probability at least 0.90.9 a set that is ε\varepsilon-close to {a1,,am}\{a^{1},\ldots,a^{m}\}.

The hidden constants in the O()O(\cdot) notation may depend on ε,σ,δ\varepsilon,\sigma,\delta. The algorithm can recover the dictionary vectors even in the relatively dense case when τ\tau is (a sufficiently small) constant, at the expense of a quasipolynomial (i.e., nO(logn)n^{O(\log n)}) running time. See Theorems 4.2 and 7.6 for a precise statement of the dependencies between the constants.

Our algorithm aims to recover the vectors up to ε\varepsilon-accuracy, with a running time as in a PTAS that depends (polynomially) on ε\varepsilon in the exponent. Prior algorithms achieving exact recovery needed to assume much stronger conditions, such as incoherence of dictionary columns. Because we have not made incoherence assumptions and have only assumed the signals obey an analytic notion of sparsity, exact recovery is not possible, and there are limitations on how precisely one can recover the dictionary vectors (even information theoretically).

We believe that it is important to understand the extent to which dictionary recovery can be performed with only weak assumptions on the model, particularly given that real-world signals are often only approximately sparse and have somewhat complicated distributions of errors. When stronger conditions are present that make better error guarantees possible, our algorithm can provide an initial solution for local search methods (or other recovery algorithms) to boost the approximate solution to a more precise one. We believe that understanding the precise tradeoffs between the model assumptions, achievable precision, and running time is an interesting question for further research.

For every ε>0,σ1\varepsilon>0,\sigma\geqslant 1, there exists d,τd,\tau and a probabilistic nO(logn)n^{O(\log n)}-time algorithm R\mathcal{R} such that for every σ\sigma-dictionary A=(a1am)A=(a^{1}|\cdots|a^{m}), given a polynomial PP such that

R\mathcal{R} outputs with probability at least 0.90.9 a set SS that is ε\varepsilon-close to {a1,,am}\{a^{1},\ldots,a^{m}\}.

(We denote PQP\preceq Q if QPQ-P is a sum of squares of polynomials. Also, as in Theorem 1.2, there are certain conditions under which R\mathcal{R} runs in polynomial time; see Section 7.)

The condition (1.5) implies that the input PP to R\mathcal{R} is τ\tau-close to the tensor Audd\lVert A^{\top}u\rVert_{d}^{d}, in the sense that P(u)Auddτ|P(u)-\lVert A^{\top}u\rVert_{d}^{d}|\leqslant\tau for every unit vector uu. This allows for very significant noise, since for a typical vector uu, we expect Audd\lVert A^{\top}u\rVert_{d}^{d} to be have magnitude roughly mnd/2mn^{-d/2} which would be much smaller than τ\tau for every constant τ>0\tau>0. Thus, on most of its inputs, PP can behave radically differently than Audd\lVert A^{\top}u\rVert_{d}^{d}, and in particular have many local minima that do not correspond to local minima of the latter polynomial. For this reason, it seems unlikely that one can establish a result such as Theorem 1.4 using a local search algorithm.The conditions (1.5) and maxu2=1P(u)Auddτ\max_{\lVert u\rVert_{2}=1}|P(u)-\lVert A^{\top}u\rVert_{d}^{d}|\leqslant\tau are not identical for d>2d>2. Nevertheless, the discussion above applies to both conditions, since (1.5) does allow for PP to have very different behavior than Audd\lVert A^{\top}u\rVert_{d}^{d}.

We give an overview of our algorithm and its analysis in Section 2. Sections 4, 6 and 5 contain the complete formal proofs. In its current form, our algorithm is efficient only in the theoretical/asymptotic sense, but it is very simple to describe (modulo its calls to the SOS solver), see Figure 1. We believe that the Sum of Squares algorithm can be a very useful tool for attacking machine learning problems, yielding a first solution to the problem that can later be tailored and optimized.

3 Related work

Starting with the work of Olshausen and Field [OF96a, OF96b, OF97], there is a vast body of literature using various heuristics (most commonly alternating minimization) to learn dictionaries for sparse coding, and applying this tool to many applications. Here we focus on papers that gave algorithms with proven performance.

Independent Component Analysis (ICA) [Com94] is one method that can be used for the dictionary learning in the case the random variables x1,,xnx_{1},\ldots,x_{n} are statistically independent. For the case of m=nm=n this was shown in [Com94, FJK96, NR09], while the works [LCC07, GVX14] extend it for the overcomplete (i.e. m>nm>n) case.

Another recent line of works analyzed different algorithms, which in some cases are more efficient or handle more general distributions than ICA. Spielman, Wang and Wright [SWW12] give an algorithm to exactly recover the dictionary in the m=nm=n case. Agarwal, Anandkumar, Jain, Netrapalli, and Tandon [AAJ+13] and Arora, Ge and Moitra [AGM13] obtain approximate recovery in the overcomplete (i.e. m>nm>n) case, which can be boosted to exact recovery under some additional conditions on the sparsity and dictionary [AAN13, AGM13]. However, all these works require the distribution xx to be over very sparse vectors, specifically having less than n\sqrt{n} nonzero entries. As discussed in [SWW12, AGM13], n\sqrt{n} sparsity seemed like a natural barrier for this problem, and in fact, Spielman et al [SWW12] proved that every algorithm of similar nature to theirs will fail to recover the dictionary when when the coefficient vector can have Ω(nlogn)\Omega(\sqrt{n\log n}) coordinates. The only work we know of that can handle vectors of support larger than n\sqrt{n} is the recent paper [ABGM14], but it achieves this at the expense of making fairly strong assumptions on the structure of the dictionary, in particular assuming some sparsity conditions on AA itself. In addition to the sparsity restriction, all these works had additional conditions on the distribution that are incomparable or stronger than ours, and the works [AAJ+13, AGM13, AAN13, ABGM14] make additional assumptions on the dictionary (namely incoherence) as well.

The tensor decomposition problem is also very widely studied with a long history (see e.g., [Tuc66, Har70, Kru77]). Some recent works providing algorithms and analysis include [AFH+12, AGM12, BCMV14, BCV14]. However, these works are in a rather different parameter regime than ours— assuming the tensor is given with very little noise (inverse polynomial in the spectral norm), but on the other hand requiring very low order moments (typically three or four, as opposed to the large constant or even logarithmic number we use).

Organization of this paper

In Section 2 we give a high level overview of our ideas. Sections 4–6 contain the full proof for solving the dictionary learning and tensor decomposition problems in quasipolynomial time, where the sparsity parameter τ\tau is a small constant. In Section 7 we show how this can be improved to polynomial time when τnδ\tau\leqslant n^{-\delta} for some constant δ>0\delta>0.

Overview of algorithm and its analysis

The dictionary learning problem can be easily reduced to the noisy tensor decomposition problem. Indeed, it is not too hard to show that for an appropriately chosen parameter dd, given a sufficiently large number of examples y1,,yNy_{1},\ldots,y_{N} from the distribution {y=Ax}\{y=Ax\}, the polynomial

will be roughly τ\tau close (in the spectral norm) to the polynomial Audd\lVert A^{\top}u\rVert_{d}^{d}, where τ\tau is the “niceness”/“sparsity” parameter of the distribution {x}\{x\}. Therefore, if we give PP as input to the tensor decomposition algorithm of Theorem 1.4, we will obtain a set that is close to the columns of AA.The polynomial (2.1) and similar variants have been used before in works on dictionary learning. The crucial difference is that those works made strong assumptions, such as independence of the entries of {x}\{x\}, that ensured this polynomial has a special structure that made it possible to efficiently optimize over it. In contrast, our work applies in a much more general setting.

The challenge is that because τ\tau is a positive constant, no matter how many samples we take, the polynomial PP will always be bounded away from the tensor Audd\lVert A^{\top}u\rVert_{d}^{d}. Hence we must use a tensor decomposition algorithm that can handle a very significant amount of noise. This is where the Sum-of-Squares algorithm comes in. This is a general tool for solving systems of polynomial equations [Sho87, Nes00, Par00, Las01]. Given the SOS algorithm, the description of our tensor decomposition algorithm is extremely simple (see Figure 1 below). We now describe the basic facts we use about the SOS algorithm, and sketch the analysis of our noisy tensor decomposition algorithm. See the survey [BS14] and the references therein for more detail on the SOS algorithm, and Sections 4, 5 and 6 for the full description of our algorithm and its analysis (including its variants that take polynomial time at the expense of requiring dictionary learning examples with sparser coefficients).

The SOS algorithm is a method, based on semidefinite programming, for solving a system of polynomial equations. Alas, since this is a non-convex and NP-hard problem, the algorithm doesn’t always succeed in producing a solution. However, it always returns some object, which in some sense can be interpreted as a “distribution” {u}\{u\} over solutions of the system of equations. It is not an actual distribution, and in particular we cannot sample from {u}\{u\} and get an individual solution, but we can compute low order moments of {u}\{u\}. Specifically, we make the following definition:

Numerical accuracy will never play an important role in our results, and so we can just assume that we can always find in nO(k)n^{O(k)} time a degree-kk pseudo-distribution satisfying given polynomial constraints, if such a pseudo-distribution exists.

2 Noisy tensor decomposition

Our basic noisy tensor decomposition algorithm is described in Figure 1. This algorithm finds (a vector close to) a column of AA with inverse polynomial probability. Using similar ideas, one can extend it to an algorithm that outputs all vectors with high probability; we provide the details in Section 6. Following the approach of [BKS14], our analysis of this algorithm proceeds in two phases:

We show that if the pseudo-distribution {u}\{u\} obtained in Step 1 is an actual distribution, then the vector output in Step 3 is close to one of the columns of AA.

We then show that the arguments used in establishing (i) generalize to the case of pseudo-distributions as well.

The first part is actually not so surprising. For starters, every unit vector uu that maximizes PP must be highly correlated with some column aa of AA. Indeed, Aadd1\lVert A^{\top}a\rVert_{d}^{d}\geqslant 1 for every column aa of AA, and hence the maximum of P(u)P(u) over a unit uu is at least 1τ1-\tau. But if u,a21ε\langle u,a\rangle^{2}\leqslant 1-\varepsilon for every column aa then P(u)P(u) must be much smaller than 11. Indeed, in this case

Since ai,u2σ\sum\langle a^{i},u\rangle^{2}\leqslant\sqrt{\sigma}, this implies that, as long as dlogσεd\gg\tfrac{\log\sigma}{\varepsilon}, Audd\lVert A^{\top}u\rVert_{d}^{d} (and thus also P(u)P(u)) is much smaller than 11.

Therefore, if {u}\{u\} obtained in Step 1 is an actual distribution, then it would be essentially supported on the set A={±a1,,±am}\mathcal{A}=\{\pm a^{1},\ldots,\pm a^{m}\} of the columns of AA and their negations. Let us suppose that {u}\{u\} is simply the uniform distribution over A\mathcal{A}. (It can be shown that this essentially is the hardest case to tackle.) In this case the matrix MM considered in Step 3 can be written as

where W()W(\cdot) is the polynomial selected in Step 2. (This uses the fact that this polynomial is a product of linear functions and hence satisfies W(a)2=W(a)W(-a)^{2}=W(a) for all aa.) If W()W(\cdot) satisfies

Part (ii). The above argument establishes (i), but this is all based on a rather bold piece of wishful thinking— that the object {u}\{u\} we obtained in Step 1 of the algorithm was actually a genuine distribution over unit vectors maximizing PP. In actuality, we can only obtain the much weaker guarantee that {u}\{u\} is a degree kk pseudo-distribution for some k=O(logn)k=O(\log n). (An actual distribution corresponds to a degree-\infty pseudo-distribution.) The technical novelty of our work lies in establishing (ii). The key observation is that in all our arguments above, we never used any higher moments of {u}\{u\}, and that all the inequalities we showed boil down to the simple fact that a square of a polynomial is never negative. (Such proofs are known as Sum of Squares (SOS) proofs.)

applying it to the vector v=Auv=A^{\top}u (where we denote v=maxivi\lVert v\rVert_{\infty}=\max_{i}|v_{i}|). The first (and most major) obstacle in giving a low degree “Sum of Squares” proof for (2.4) is that this is not a polynomial inequality. To turn it into one, we replace the LL_{\infty} norm with the LkL_{k} norm for some large kk (k=O(logm)k=O(\log m) will do). If we replace v\lVert v\rVert_{\infty} with vk\lVert v\rVert_{k} in (2.4), and raise it to the k/(d2)k/(d-2)-th power then we obtain the inequality

which is a valid inequality between polynomials in vv whenever kk is an integer multiple of d2d-2 (which we can ensure).

We now need to find a sum-of-squares proof for this inequality, namely that the right-hand side of (2.5) is equal to the left-hand side plus a sum of squares, that is, we are to show that for s=k/(d2)s=k/(d-2),

By expanding the ss-th powers in this expression, we rewrite this polynomial inequality as

where the summations involving α\alpha are over degree-ss multiindices α{0,,s}n\alpha\in\{0,\dots,s\}^{n}, and (sα)\binom{s}{\alpha} denotes the multinomial coefficient (nα)=s!α1!αm!\binom{n}{\alpha}=\frac{s!}{\alpha_{1}!\dots\alpha_{m}!}. We will prove (2.6) term by term, i.e., we will show that vdαv2αivi(d2)sv^{d\alpha}\preceq v^{2\alpha}\sum_{i}v_{i}^{(d-2)s} for every multiindex α\alpha. Since v2α0v^{2\alpha}\succeq 0, it is enough to show that v(d2)αivi(d2)sv^{(d-2)\alpha}\preceq\sum_{i}v_{i}^{(d-2)s}. This is implied by the following general inequality, which we prove in Appendix A:

Let w1,,wnw_{1},\ldots,w_{n} be polynomials. Suppose w10,,wn0w_{1}\succeq 0,\ldots,w_{n}\succeq 0. Then, for every multiindex α\alpha, wαiwiα.w^{\alpha}\preceq\sum_{i}w_{i}^{\lvert\alpha\rvert}\,.

We note that dd is even, so wi=vid20w_{i}=v_{i}^{d-2}\succeq 0 is a square, as required by the lemma.

For the case that α\lvert\alpha\rvert is a power of 22, the inequality in the lemma follows by repeatedly applying the inequality xy12x2+12y2x\cdot y\preceq\frac{1}{2}x^{2}+\frac{1}{2}y^{2}, which in turn holds because the difference between the two sides equals 12(xy)2\tfrac{1}{2}(x-y)^{2}. As a concrete example, we can derive w13w2w14+w24w_{1}^{3}w_{2}\preceq w_{1}^{4}+w_{2}^{4} in this way,

(The first two steps use the inequality xy12x2+12y2x\cdot y\preceq\tfrac{1}{2}x^{2}+\tfrac{1}{2}y^{2}. The last step uses that both w1w_{1} and w2w_{2} are sum of squares.)

Once we have an SOS proof for (2.5) we can conclude that it holds for pseudo-distributions as well, and in particular that for every pseudo-distribution {u}\{u\} of degree at least k+2k/(d2)k+2k/(d-2) satisfying {u22=1}\{\lVert u\rVert_{2}^{2}=1\},

We use similar ideas to port the rest of the proof to the SOS setting, concluding that whenever {u}\{u\} is a pseudo-distribution that satisfies {u22=1}\{\lVert u\rVert_{2}^{2}=1\} and {P(u)1τ}\{P(u)\geqslant 1-\tau\}, then with inverse polynomial probability it will hold that

Preliminaries

We use the notation of pseudo-expectations and pseudo-distributions from Section 2.1. We now state some basic useful facts about pseudo-distributions, see [BS14, BKS14, BBH+12] for a more comprehensive treatment.

One useful property of pseudo-distributions is that we can find actual distribution that match their first two moments.

Another property we will use is that we can reweigh a pseudo-distribution by a positive polynomial WW to obtain a new pseudo-distribution that corresponds to the operation on actual distributions of reweighing the probability of an element uu proportional to W(u)W(u).

but since WW is a sum of squares, WP2WP^{2} is also a sum of squares and hence the denominator of the left-hand side is non-negative, while the numerator is by assumption positive. ∎

Dictionary Learning

We now state our formal theorem for dictionary learning. The following definition of nice distributions captures formally the conditions needed for recovery. (It is equivalent up to constants to the definition of Section 1.1, see Remark 4.4 below.)

We can now state our result for dictionary learning in quasipolynomial time. The result for polynomial time is stated in Section 7.

There exists an algorithm that for every desired accuracy ε>0\varepsilon>0 and overcompleteness σ1\sigma\geqslant 1 solves the following problem for every (d,τ)(d,\tau)-nice distribution with dd(ε,σ)=O(ε1logσ)d\geqslant d(\varepsilon,\sigma)=O(\varepsilon^{-1}\log\sigma) and ττ(ε,σ)=(ε1logσ)O(ε1logσ)\tau\leqslant\tau(\varepsilon,\sigma)=(\varepsilon^{-1}\log\sigma)^{O(\varepsilon^{-1}\log\sigma)} in time n(1/ε)O(1)(d+logm)n^{(1/\varepsilon)^{O(1)}(d+\log m)}: Given nO(d)/poly(τ)n^{O(d)}/\operatorname{poly}(\tau) samples from a distribution {y=Ax}\{y=Ax\} for a σ\sigma-overcomplete dictionary AA and (d,τ)(d,\tau)-nice distribution {x}\{x\}, output a set of vectors that is ε\varepsilon-close to the set of columns of AA (in symmetrized Hausdorff distance).

output a set of vectors that is ε\varepsilon-close to the set of columns of AA (in symmetrized Hausdorff distance).

We will prove Theorem 4.3 (noisy tensor decomposition) in Section 5 and Section 6. At this point, let us see how it yields Theorem 4.2 (dictionary learning, quasipolynomial time). The following lemma gives the connection between tensor decomposition and dictionary learning.

Therefore, we can apply the algorithm in Theorem 4.3 (noisy tensor decomposition) for noise parameter τ=2τkddd\tau^{\prime}=2\tau k^{d}d^{d} to obtain a set SS of unit vectors that is ε\varepsilon-close to the set of columns of AA (in symmetrized Hausdorff distance). ∎

Sampling pseudo-distributions

In this section we will develop an efficient algorithm that behaves in certain ways like a hypothetical sampling procedure for low-degree pseudo-distributions. (Sampling procedures, even inefficient or approximate ones, cannot exist in general for low-degree pseudo-distributions [Gri01, Sch08].) This algorithm will be a key ingredient of our algorithm for Theorem 4.3 (noisy tensor decomposition, quasipolynomial time).

The algorithm in the following theorem achieves the above property of sampling procedures with the key advantage that it applies to any low-degree pseudo-distributions.

The result follows from the following lemmas.

Furthermore, there exists a randomized algorithm that runs in time nO(k)n^{O(k)} and computes such a polynomial WW with probability at least 2O(k/poly(ε))2^{-O(k/\operatorname{poly}(\varepsilon))}.

Let w(1),,w(k/2)w^{\scriptscriptstyle(1)},\ldots,w^{\scriptscriptstyle(k/2)} be independent samples from the distribution {wc,ξτM+1}\{w\mid\langle c,\xi\rangle\geqslant\tau_{M+1}\}. Then, let W=w(1)w(k/2)/Mk/2W=w^{\scriptscriptstyle(1)}\cdots w^{\scriptscriptstyle(k/2)}/M^{k/2}. The expectation of this random polynomial satisfies

The following bound shows that there exists a polynomial WW that satisfies the conclusion of the lemma,

Since {W}\{W\} has density 2O(M2)2^{-O(M^{2})} in {W0}\{W_{0}\}, it also follows that

Noisy tensor decomposition

In this section we will prove Theorem 4.3 (noisy tensor decomposition, quasi-polynomial time).

output a set of vectors that is ε\varepsilon-close to the set of columns of AA (in symmetrized Hausdorff distance).

First, we claim that the pseudo-distribution {u}\{u\} also satisfies the constraint {Aukkeδk}\{\lVert A^{\top}u\rVert_{k}^{k}\geqslant e^{-\delta^{\prime}k}\} where δ=dd2δ+logσd2\delta^{\prime}=\tfrac{d}{d-2}\delta+\frac{\log\sigma}{d-2}. The proof of this claim follows by a sum-of-squares version of the following form of Hölder’s inequality,

See the overview section for a proof of this fact. By substituting v=Auv=A^{\top}u and using the facts that Au22σu2\lVert A^{\top}u\rVert_{2}^{2}\preceq\sigma\lVert u\rVert^{2} and that {u}\{u\} satisfies the constraint {u2=1}\{\lVert u\rVert^{2}=1\}, we get that {u}\{u\} satisfies {Aukk(Audd)k/(d2)/σk/(d2)}\{\lVert A^{\top}u\rVert_{k}^{k}\geqslant(\lVert A^{\top}u\rVert_{d}^{d})^{k/(d-2)}/\sigma^{k/(d-2)}\}, which implies the claim because {Auddeδd}\{\lVert A^{\top}u\rVert_{d}^{d}\geqslant e^{-\delta d}\}.

While there exists a degree-kk pseudo-distribution {u}\{u\} that satisfies the constraints {P(u)1τ,u22=1}\{P(u)\geqslant 1-\tau,\lVert u\rVert_{2}^{2}=1\} and {s,u21γ}\{\langle s,u\rangle^{2}\leqslant 1-\gamma\} for every sSs\in S:

Add the vector cc^{\prime} to the set SS.

Next we claim that every vector in sSs\in S is close to one of the columns of AA. Indeed, every such vector satisfies Asddeεd2τ\lVert A^{\top}s\rVert_{d}^{d}\geqslant e^{-\varepsilon d}-2\tau, which by Lemma 6.1 implies that s,c21O(ε+τ/d+(logσ)/d)=1O(ε)\langle s,c\rangle^{2}\geqslant 1-O(\varepsilon+\tau/d+(\log\sigma)/d)=1-O(\varepsilon) for a column cc of AA.

Next we claim that if the algorithm terminates then for every column cc of AA there exists a vector sSs\in S with c,s21γ\langle c,s\rangle^{2}\geqslant 1-\gamma. Indeed, if there exists a column that violates this condition, then it would satisfy all constraints for the pseudo-distribution, which means that the algorithm does not terminate at this point.

To finish the proof of the theorem it remains to bound the number of iterations of the algorithm. We claim that the number of iterations is bounded by the number mm of columns of AA because in each iteration the vectors in SS will cover at least one more of the columns of AA. As observed before, every vector sSs\in S is close to a column csc_{s} of AA in the sense that s2cs22=O(ε)\lVert s^{\otimes 2}-c_{s}^{\otimes 2}\rVert^{2}=O(\varepsilon). However, since cc^{\prime} satisfies c,s21γ/10\langle c^{\prime},s\rangle^{2}\leqslant 1-\gamma/10, we have by triangle inequality γ/5(c)2s222(c)2cs22+2s2cs22\gamma/5\leqslant\lVert(c^{\prime})^{\otimes 2}-s^{\otimes 2}\rVert^{2}\leqslant 2\lVert(c^{\prime})^{\otimes 2}-c_{s}^{\otimes 2}\rVert^{2}+2\lVert s^{\otimes 2}-c_{s}^{\otimes 2}\rVert^{2}, which means that (c)2cs22γ/10O(ε)\lVert(c^{\prime})^{\otimes 2}-c_{s}^{\otimes 2}\rVert^{2}\geqslant\gamma/10-O(\varepsilon). Therefore, the vector cc^{\prime} is not close to any of the vectors csc_{s} for sSs\in S, which means that it has to be close to another column of AA. (Here, we are again assuming that γ\gamma was chosen so that γ/ε\gamma/\varepsilon is a large enough constant.) ∎

By adapting the algorithm somewhat we can also achieve recovery guarantees for columns with significantly smaller norm than the maximum norm. Concretely, we can modify the algorithm so that we ask for pseudo-distributions satisfying P(u)ρP(u)\geqslant\rho, where ρ\rho is a parameter that we gradually decrease so we can get all the vectors. However, we need to also change the right-hand side of the constraint u,s21γ\langle u,s\rangle^{2}\leqslant 1-\gamma to a value that decreases with ρ\rho. Otherwise, the algorithm might not terminate, as there can be exponentially vectors that are somewhat far from a column vector cc, and all of them will have fairly large value for P()P(\cdot). Such a modified algorithm can still obtain all the column vectors (up to a small error) if we assume that the they are sufficiently incoherent. That is, a,aμ\langle a,a^{\prime}\rangle\leqslant\mu for every distinct columns aa, aa^{\prime} of AA with μ\mu depending on the norm ratios. Similar (and in fact often stronger) assumptions were made in prior works on dictionary learning. (However, we need these assumptions only when the vectors have different norms.)

Polynomial-time algorithms

In this section we show how we can improve our tensor decomposition algorithm when we have access to examples of very sparse linear combinations of the dictionary columns, culminating in Theorem 7.6 that gives a polynomial-time algorithm for the dictionary problem for the case the distribution is (d,τ)(d,\tau)-nice for τ=nΩ(1)\tau=n^{-\Omega(1)}.

The following theorem refines Theorem 5.1 (sampling pseudo-distributions) reconstructing a vector cc^{\prime} that is close to a target vector cc. We make an additional assumption about having access to samples from a distribution {W}\{W\} over sum-of-squares polynomials. This distribution comes with a noise parameter τ\tau that controls how well the distribution correlated with the target vector cc. If this noise parameter is sufficiently small, samples from distribution allow the algorithm to work under a more refined but milder condition on the pseudo-distribution {u}\{u\}. For our dictionary learning algorithm, we can satisfy this condition when the noise parameter τ\tau of the distribution {W}\{W\} satisfies τm1/k\tau\ll m^{1/k}. (The noise parameter τ\tau roughly coincides with the niceness parameter of the distribution {x}\{x\}.)

The pseudo-distribution {u}\{u\} has degree 2(1+2k)2(1+2k) and satisfies the polynomial constraint u22=1\lVert u\rVert_{2}^{2}=1 and the conditions

The following lemma is the main new ingredient of the proof of this theorem.

Note that the conclusion of the lemma implies that we can recover a vector cc^{\prime} with c,c21O(ε)\langle c^{\prime},c\rangle^{2}\geqslant 1-O(\varepsilon^{\prime}) using Theorem 5.1 in time nk/poly(ε)n^{k/poly(\varepsilon^{\prime})} with probability 2O(k)/poly(ε)2^{O(k)/\operatorname{poly}(\varepsilon^{\prime})}. Therefore, Theorem 7.1 follows by combining Lemma 7.2 with Theorem 5.1.

Here, the last step uses the assumption (1+k)τ1/2(1+k)\tau\leqslant 1/2 to bound the series i=1k(1+k)iτi2(t+k)τ\sum_{i=1}^{k}(1+k)^{i}\tau^{i}\leqslant 2(t+k)\tau. It follows that

2 Tensor decomposition

The following lemma shows that a pseudo-distribution {u}\{u\} that satisfies the constraints {Au2(t+k)2(t+k)1,u22=1}\{\lVert A^{\top}u\rVert_{2(t+k)}^{2(t+k)}\approx 1,\lVert u\rVert_{2}^{2}=1\} also satisfies the condition of Theorem 7.1 for one of the columns of the dictionary AA.

It follows that the pseudo-distribution {u}\{u\} cannot satisfy the constraint Au2(1+k)2(1+k)e2(k1)εσ\lVert A^{\top}u\rVert_{2(1+k)}^{2(1+k)}\geqslant e^{2(k-1)\varepsilon}\sigma.

3 Dictionary learning

The following lemma shows that up to polynomial reweighing the distribution {y=Ax}\{y=Ax\} gives us access to a distribution {W}\{W\} that satisfies the condition of Theorem 7.1.

The expectation of ww after reweighing by xi2x_{i}^{2} satisfies

The last step uses that all non-square moments of {x}\{x\} vanish. The desired bounds follow because the coefficient of a(i),u2\langle a^{\scriptscriptstyle(i)},u\rangle^{2} is 11 and for all indices jij\neq i, the coefficients of a(j),u2\langle a^{\scriptscriptstyle(j)},u\rangle^{2} are all between and τ\tau. For the final bounds, we also use Au22σu2\lVert A^{\top}u\rVert_{2}^{2}\preceq\sigma\lVert u\rVert^{2}. ∎

We will use Lemma 7.4 to reason about the distribution {W}\{W\}. Without loss of generality, we assume that cc is the first column of the dictionary AA. Let xˉ=(x1,,xk)\bar{x}=(x_{1},\ldots,x_{k^{\prime}}) be kk^{\prime} independent samples from {x}\{x\}. (The distribution {W}\{W\} is the same as {Ax1,u2Axk,u2}\{\langle Ax_{1},u\rangle^{2}\cdots\langle Ax_{k^{\prime}},u\rangle^{2}\}.) We claim that the distribution {W}\{W\} satisfies (7.1) after reweighing by the function r(xˉ)2=x1,12xk,12r(\bar{x})^{2}=x_{1,1}^{2}\cdots x_{k^{\prime},1}^{2} (the product of the square of the first coordinates of x1,,xkx_{1},\ldots,x_{k^{\prime}}). The distribution after reweighing is, up to scaling of the polynomials, equal to the distribution D={W=w1wk}\mathcal{D}=\{W=w_{1}\cdots w_{k^{\prime}}\}, where w1,,wkw_{1},\ldots,w_{k^{\prime}} are independent samples from the distribution D1\mathcal{D}_{1} in Lemma 7.4. By Lemma 7.4, this reweighted distribution satisfies the condition (7.1), that is,

Since we assume {x}\{x\} to be (4,τ)(4,\tau)-nice, the variance of D\mathcal{D} is bounded by nO(k)n^{O(k)}.

The following theorem gives a polynomial time algorithm for dicionary learning under (d,τ)(d,\tau)-nice distributions for all τ=nΩ(1)\tau=n^{\Omega(1)}.

There exists an algorithm that for every desired accuracy ε>0\varepsilon>0 and overcompleteness σ1\sigma\geqslant 1 solves the following problem for every (d,τ)(d,\tau)-nice distribution with dd(ε,σ)=O(d1logσ)d\geqslant d(\varepsilon,\sigma)=O(d^{-1}\log\sigma) and ττ(ε,σ)=(ε1logσ)O(ε1logσ)\tau\leqslant\tau(\varepsilon,\sigma)=(\varepsilon^{-1}\log\sigma)^{O(\varepsilon^{-1}\log\sigma)} in time n(1/ε)O(1)kn^{(1/\varepsilon)^{O(1)}k} for k=d+O(logmlog(1/τ))k=d+O(\tfrac{\log m}{\log(1/\tau)}): Given nO(d)/poly(τ)n^{O(d)}/\operatorname{poly}(\tau) samples from a distribution {y=Ax}\{y=Ax\} for a σ\sigma-overcomplete dictionary AA and (d,τ)(d,\tau)-nice distribution {x}\{x\}, output a set of vectors that is ε\varepsilon-close to the set of columns of AA (in symmetrized Hausdorff distance).

We will show how to use Theorem 7.5 to recover a single vector that is close to one of the columns of AA. By repeating this step in the same way as in the proof of Theorem 4.3 (noisy tensor decomposition) we can recover a set of vectors that is close to the set of columns of AA.

To recover a single vector, we estimate from the samples of {y=Ax}\{y=Ax\} a polynomial PP that is close to Audd\lVert A^{\top}u\rVert_{d}^{d} in the same way as in the proof of Theorem 4.2. (The distance of PP from Audd\lVert A^{\top}u\rVert_{d}^{d} in spectral norm will be O(τdd)=O(ε)O(\tau d^{d})=O(\varepsilon).) Next, we compute a degree-kk pseudo-distribution {u}\{u\} that satisfies the constraints {P1ε,u22=1}\{P\geqslant 1-\varepsilon,\lVert u\rVert_{2}^{2}=1\}.To recover all vectors, we would also add constraints {s,u21γ}\{\langle s,u\rangle^{2}\leqslant 1-\gamma\} for all vectors ss that have already been recovered (see proof of Theorem 4.3). The same argument as in the proof of Lemma 6.1 shows that {u}\{u\} also satisfies the constraint {AukkeO(ε)k}\{\lVert Au\rVert_{k}^{k}\geqslant e^{O(\varepsilon)k}\}, which means that {u}\{u\} satisfies the premise of Theorem 7.5. Therefore, the algorithm in Theorem 7.5 recovers a vector close to one of the columns of AA.

Conclusions and Open Problems

The Sum of Squares method has found many uses across a variety of disciplines, and in this work we demonstrate its potential for solving unsupervised learning problems in regimes that have so far eluded other algorithms. It is an interesting direction to identify other problems that can be solved using this algorithm.

The generality of the SOS method comes at a steep cost of efficiency. It is a fascinating open problem, and one we are quite optimistic about, to use the ideas from the SOS-based algorithm to design practically efficient algorithms.

References

Appendix A Proof of Lemma 2.3

Lemma 2.3 is a consequence of the following sum-of-squares version of the AM-GM inequality.The first sum-of-squares proof of the AM-GM inequality dates back to Hurwitz in 1891 [Hur91]. For related results and sums-of-squares proofs of more general sets of inequalities, see [Rez87, Rez89, FH14].

Let w1,,wnw_{1},\dots,w_{n} be polynomials. Suppose w1,,wn0w_{1},\ldots,w_{n}\succeq 0. Then,

To see that this lemma implies Lemma 2.3, write for a multi-index α\alpha with α=s\lvert\alpha\rvert=s the polynomial wαw^{\alpha} as a product wα=j=1swij,w^{\alpha}=\prod_{j=1}^{s}w_{i_{j}}, where wiw_{i} is repeated αi\alpha_{i} times. (E.g., we would write w12w2w32w_{1}^{2}w_{2}w_{3}^{2} as w1w1w2w3w3w_{1}w_{1}w_{2}w_{3}w_{3} and we would have (i1,,i5)=(1,1,2,3,3)(i_{1},\dots,i_{5})=(1,1,2,3,3).) Then applying Lemma A.1 to the polynomials wi1,,wisw_{i_{1}},\ldots,w_{i_{s}} gives the inequality asserted in Lemma 2.3,

where the second inequality uses that 0αi/α10\leqslant\alpha_{i}/\lvert\alpha\rvert\leqslant 1 and the premise wi0w_{i}\succeq 0.

To prove Lemma A.1, we will give a sequence of polynomials R0,,Rn1R_{0},\dots,R_{n-1} such that R0=(z1n+znn)/nR_{0}=(z_{1}^{n}+\dots z_{n}^{n})/n, Rn1=z1znR_{n-1}=z_{1}\dots z_{n}, and R0Rn1R_{0}\succeq\dots\succeq R_{n-1}. To this end, let

where SnS_{n} denotes the symmetric group on nn elements. So, for instance,

The following claim will then complete the proof:

For any k{1,,n1}k\in\{1,\dots,n-1\}, Rk1RkR_{k-1}-R_{k} is a sum of squares.

For a given permutation σSn\sigma\in S_{n}, the corresponding monomials in RkR_{k} and Rk1R_{k-1} will share many of the same variables, differing only in the exponents of wσ1w_{\sigma_{1}} and wσk+1w_{\sigma_{k+1}}. We will thus try to arrange the terms of Rk1RkR_{k-1}-R_{k} so that we can pull out the common variables, which will let us reduce our inequality to one involving only two variables.

Since the wiw_{i} are sums of squares, the expression inside the braces is as well. It is therefore enough to show that (wankwbnk)(wawb)\left(w_{a}^{n-k}-w_{b}^{n-k}\right)\left(w_{a}-w_{b}\right) is a sum of squares. This follows from the fact that