Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method

Boaz Barak, Jonathan A. Kelner, David Steurer

Introduction

The dictionary learning (also known as “sparse coding”) problem is to recover an unknown $n\times m$ matrix $A$ (known as a “dictionary”) from examples of the form

This problem has found applications in multiple areas, including computational neuroscience [OF97, OF96a, OF96b], machine learning [EP07, MRBL07], and computer vision and image processing [EA06, MLB+08, YWHM08]. The appeal of this problem is that, intuitively, data should be sparse in the “right” representation (where every coordinate corresponds to a meaningful feature), and finding this representation can be a useful first step for further processing, just as representing sound or image data in the Fourier or wavelet bases is often a very useful preprocessing step in signal or image processing. See [SWW12, AAJ+13, AGM13, ABGM14] and the references therein for further discussion of the history and motivation of this problem.

This is a nonlinear problem, as both $A$ and $x$ are unknown, and dictionary learning is a computationally challenging task even in the noiseless case. When $A$ is known, recovering $x$ from $y$ constitutes the sparse recovery / compressed sensing problem, which has efficient algorithms [Don06, CRT06]. Hence, a common heuristic for dictionary learning is to use alternating minimization, using sparse recovery to obtain a guess for $x$ based on a guess of $A$ , and vice versa.

Recently there have been several works giving dictionary learning algorithms with rigorous guarantees on their performance [SWW12, AAJ+13, AAN13, AGM13, ABGM14]. These works differ in various aspects, but they all share a common feature: they give no guarantee of recovery unless the distribution $\{x\}$ is over extremely sparse vectors, namely having less than $O(\sqrt{n})$ (as opposed to merely $o(n)$ ) nonzero coordinates. (There have been other works dealing with the less sparse case, but only at the expense of making strong assumptions on $x$ and/or $A$ ; see Section 1.3 for more discussion of related works.)

In this work we give a different algorithm that can be proven to approximately recover the matrix $A$ even when $x$ is much denser (up to $\tau n$ coordinates for some small constant $\tau>0$ in some settings). The algorithm works (in the sense of approximate recovery) even with noise, in the so-called overcomplete case (where $m>n$ ), and without making any incoherence assumptions on the dictionary.

Our algorithm is based on the Sum of Squares (SOS) semidefinite programming hierarchy [Sho87, Nes00, Par00, Las01]. The SOS algorithm is a very natural method for solving non-convex optimization problems that has found applications in a variety of scientific fields, including control theory [HG05], quantum information theory [DPS02], game theory [Par06], formal verification [Har07], and more. Nevertheless, to our knowledge this work provides the first rigorous bounds on the SOS algorithm’s running time for a natural unsupervised learning problem.

In this section we formally define the dictionary learning problem and state our result. We define a $\sigma$ -dictionary to be an $m\times n$ matrix $A=(a^{1}|\cdots|a^{m})$ such that $\lVert a^{i}\rVert=1$ for all $i$ , and $A^{\top}A\preceq\sigma I$ (where $I$ is the identity matrix). The parameter $\sigma$ is an analytical proxy for the overcompleteness $m/n$ of the dictionary $A$ . In particular, if the columns of $A$ are in isotropic position (i.e., $A^{\top}A$ is proportional to the identity), then the top eigenvalue of $A^{\top}A$ is its trace divided by $n$ , which equals $(1/n)\sum_{i}\lVert a^{i}\rVert^{2}=m/n$ because all of the $a^{i}$ ’s have unit norm.While we do not use it in this paper, we note that in the dictionary learning problem it is always possible to learn a linear “whitening transformation” $B$ from the samples that would place the columns in isotropic position, at the cost of potentially changing the norms of the vectors. (There also exists a linear transformation that keeps the vectors normalized [Bar98, For01], but we do not know how to learn it from the samples.) In this work we are mostly interested in the case $m=O(n)$ , which corresponds to $\sigma=O(1)$ .

Equation (1.2) will motivate us in defining an analytical proxy for the condition that the distribution $\{x\}$ over coefficients is $\tau$ -sparse.By using an analytical proxy as opposed to requiring strict sparsity, we are only enlarging the set of distributions under consideration. However, we will make some additional conditions below, and in particular requiring low order non-square moments to vanish, that although seemingly mild compared to prior works, do restrict the family of distributions.

Specifically, in the dictionary learning case, since we are interested in learning all column vectors, we want every coordinate $i$ to be typical (for example, if the coefficient $x_{i}$ is always or always $1$ , we will not be able to learn the corresponding column vector). Moreover, a necessary condition for recovery is that every pair of coordinates is somewhat typical in the sense that the events that $x_{i}$ and $x_{j}$ are nonzero are not perfectly correlated. Indeed, suppose for simplicity that when $x_{i}$ is nonzero, it is distributed like an independent standard Gaussian. Then if those two events were perfectly correlated, recovery would be impossible since the distribution over examples would be identical if we replaced $\{a_{i},a_{j}\}$ with any pair of vectors $\{\Pi a^{i},\Pi a^{j}\}$ where $\Pi$ is a rotation in the plane spanned by $\{a^{i},a^{j}\}$ .

for all $i\neq j$ and for some $\tau\ll 1$ .

for every non-square monomial $x^{\alpha}$ of degree at most $d$ . (Here, $\alpha\in\{0,1,\ldots\}^{m}$ is a multiindex and $x^{\alpha}$ denotes the monomial $\prod_{i}x_{i}^{\alpha_{i}}$ . The degree of $x^{\alpha}$ is $|\alpha|\mathrel{\mathop{:}}=\sum_{i}\alpha_{i}$ ; we say that $x^{\alpha}$ is non-square if $x^{\alpha}$ is not the square of another monomial, i.e.,, if $\alpha$ has an odd coordinate.)

Another way to justify this notion of nice distributions is that, as our analysis shows, it is a natural way to ensure that if $a$ is a column of the dictionary then the random variable $\langle a,y\rangle$ for a random sample $y$ from (1.1) will be “spiky” in the sense that it will have a large $d$ -norm compared to its $2$ -norm. Thus it is a fairly clean way to enable recovery, especially in the setting (such as ours) where we don’t assume orthogonality or even incoherence between the dictionary vectors.

Modeling noise

Given a noisy dictionary learning example of the form $y=Ax+e$ , one can also view it (assuming we are in the non-degenerate case of $A$ having full rank) as $y=A(x+e^{\prime})$ for some $e^{\prime}$ (whose magnitude is controlled by the norm of $e$ and the condition number of $A$ ). If $e^{\prime}$ has sufficiently small magnitude, and is composed of i.i.d random variables (and even under more general conditions), the distribution $\{x+e^{\prime}\}$ will be nice as well. Therefore, we will not explicitly model the noise in the following, but rather treat it as part of the distribution $\{x\}$ which our definition allows to be only “approximately sparse”.

2 Our results

Given samples of the form $\{y=Ax\}$ for a $(d,\tau)$ -nice $\{x\}$ , with $d$ a sufficiently large constant (corresponding to having $\tau n$ nonzero entries), we can approximately recover the dictionary $A$ in polynomial time as long as $\tau\leqslant n^{-\delta}$ for some $\delta>0$ , and in quasipolynomial time as long as $\tau$ is a sufficiently small constant. Prior polynomial-time algorithms required the distribution to range over vectors with less than $\sqrt{n}$ nonzero entries (and it was not known how to improve upon this even using quasipolynomial time).

For every $\varepsilon>0,\sigma\geqslant 1,\delta>0$ there exists $d$ and a polynomial-time algorithm $\mathcal{R}$ such that for every $\sigma$ -dictionary $A=(a^{1}|\cdots|a^{m})$ and $(d,\tau=n^{-\delta})$ -nice $\{x\}$ , given $n^{O(1)}$ samples from from $\{y=Ax\}$ , $\mathcal{R}$ outputs with probability at least $0.9$ a set that is $\varepsilon$ -close to $\{a^{1},\ldots,a^{m}\}$ .

The hidden constants in the $O(\cdot)$ notation may depend on $\varepsilon,\sigma,\delta$ . The algorithm can recover the dictionary vectors even in the relatively dense case when $\tau$ is (a sufficiently small) constant, at the expense of a quasipolynomial (i.e., $n^{O(\log n)}$ ) running time. See Theorems 4.2 and 7.6 for a precise statement of the dependencies between the constants.

Our algorithm aims to recover the vectors up to $\varepsilon$ -accuracy, with a running time as in a PTAS that depends (polynomially) on $\varepsilon$ in the exponent. Prior algorithms achieving exact recovery needed to assume much stronger conditions, such as incoherence of dictionary columns. Because we have not made incoherence assumptions and have only assumed the signals obey an analytic notion of sparsity, exact recovery is not possible, and there are limitations on how precisely one can recover the dictionary vectors (even information theoretically).

We believe that it is important to understand the extent to which dictionary recovery can be performed with only weak assumptions on the model, particularly given that real-world signals are often only approximately sparse and have somewhat complicated distributions of errors. When stronger conditions are present that make better error guarantees possible, our algorithm can provide an initial solution for local search methods (or other recovery algorithms) to boost the approximate solution to a more precise one. We believe that understanding the precise tradeoffs between the model assumptions, achievable precision, and running time is an interesting question for further research.

For every $\varepsilon>0,\sigma\geqslant 1$ , there exists $d,\tau$ and a probabilistic $n^{O(\log n)}$ -time algorithm $\mathcal{R}$ such that for every $\sigma$ -dictionary $A=(a^{1}|\cdots|a^{m})$ , given a polynomial $P$ such that

$\mathcal{R}$ outputs with probability at least $0.9$ a set $S$ that is $\varepsilon$ -close to $\{a^{1},\ldots,a^{m}\}$ .

(We denote $P\preceq Q$ if $Q-P$ is a sum of squares of polynomials. Also, as in Theorem 1.2, there are certain conditions under which $\mathcal{R}$ runs in polynomial time; see Section 7.)

The condition (1.5) implies that the input $P$ to $\mathcal{R}$ is $\tau$ -close to the tensor $\lVert A^{\top}u\rVert_{d}^{d}$ , in the sense that $|P(u)-\lVert A^{\top}u\rVert_{d}^{d}|\leqslant\tau$ for every unit vector $u$ . This allows for very significant noise, since for a typical vector $u$ , we expect $\lVert A^{\top}u\rVert_{d}^{d}$ to be have magnitude roughly $mn^{-d/2}$ which would be much smaller than $\tau$ for every constant $\tau>0$ . Thus, on most of its inputs, $P$ can behave radically differently than $\lVert A^{\top}u\rVert_{d}^{d}$ , and in particular have many local minima that do not correspond to local minima of the latter polynomial. For this reason, it seems unlikely that one can establish a result such as Theorem 1.4 using a local search algorithm.The conditions (1.5) and $\max_{\lVert u\rVert_{2}=1}|P(u)-\lVert A^{\top}u\rVert_{d}^{d}|\leqslant\tau$ are not identical for $d>2$ . Nevertheless, the discussion above applies to both conditions, since (1.5) does allow for $P$ to have very different behavior than $\lVert A^{\top}u\rVert_{d}^{d}$ .

We give an overview of our algorithm and its analysis in Section 2. Sections 4, 6 and 5 contain the complete formal proofs. In its current form, our algorithm is efficient only in the theoretical/asymptotic sense, but it is very simple to describe (modulo its calls to the SOS solver), see Figure 1. We believe that the Sum of Squares algorithm can be a very useful tool for attacking machine learning problems, yielding a first solution to the problem that can later be tailored and optimized.

3 Related work

Starting with the work of Olshausen and Field [OF96a, OF96b, OF97], there is a vast body of literature using various heuristics (most commonly alternating minimization) to learn dictionaries for sparse coding, and applying this tool to many applications. Here we focus on papers that gave algorithms with proven performance.

Independent Component Analysis (ICA) [Com94] is one method that can be used for the dictionary learning in the case the random variables $x_{1},\ldots,x_{n}$ are statistically independent. For the case of $m=n$ this was shown in [Com94, FJK96, NR09], while the works [LCC07, GVX14] extend it for the overcomplete (i.e. $m>n$ ) case.

Another recent line of works analyzed different algorithms, which in some cases are more efficient or handle more general distributions than ICA. Spielman, Wang and Wright [SWW12] give an algorithm to exactly recover the dictionary in the $m=n$ case. Agarwal, Anandkumar, Jain, Netrapalli, and Tandon [AAJ+13] and Arora, Ge and Moitra [AGM13] obtain approximate recovery in the overcomplete (i.e. $m>n$ ) case, which can be boosted to exact recovery under some additional conditions on the sparsity and dictionary [AAN13, AGM13]. However, all these works require the distribution $x$ to be over very sparse vectors, specifically having less than $\sqrt{n}$ nonzero entries. As discussed in [SWW12, AGM13], $\sqrt{n}$ sparsity seemed like a natural barrier for this problem, and in fact, Spielman et al [SWW12] proved that every algorithm of similar nature to theirs will fail to recover the dictionary when when the coefficient vector can have $\Omega(\sqrt{n\log n})$ coordinates. The only work we know of that can handle vectors of support larger than $\sqrt{n}$ is the recent paper [ABGM14], but it achieves this at the expense of making fairly strong assumptions on the structure of the dictionary, in particular assuming some sparsity conditions on $A$ itself. In addition to the sparsity restriction, all these works had additional conditions on the distribution that are incomparable or stronger than ours, and the works [AAJ+13, AGM13, AAN13, ABGM14] make additional assumptions on the dictionary (namely incoherence) as well.

The tensor decomposition problem is also very widely studied with a long history (see e.g., [Tuc66, Har70, Kru77]). Some recent works providing algorithms and analysis include [AFH+12, AGM12, BCMV14, BCV14]. However, these works are in a rather different parameter regime than ours— assuming the tensor is given with very little noise (inverse polynomial in the spectral norm), but on the other hand requiring very low order moments (typically three or four, as opposed to the large constant or even logarithmic number we use).

Organization of this paper

In Section 2 we give a high level overview of our ideas. Sections 4–6 contain the full proof for solving the dictionary learning and tensor decomposition problems in quasipolynomial time, where the sparsity parameter $\tau$ is a small constant. In Section 7 we show how this can be improved to polynomial time when $\tau\leqslant n^{-\delta}$ for some constant $\delta>0$ .

Overview of algorithm and its analysis

The dictionary learning problem can be easily reduced to the noisy tensor decomposition problem. Indeed, it is not too hard to show that for an appropriately chosen parameter $d$ , given a sufficiently large number of examples $y_{1},\ldots,y_{N}$ from the distribution $\{y=Ax\}$ , the polynomial

will be roughly $\tau$ close (in the spectral norm) to the polynomial $\lVert A^{\top}u\rVert_{d}^{d}$ , where $\tau$ is the “niceness”/“sparsity” parameter of the distribution $\{x\}$ . Therefore, if we give $P$ as input to the tensor decomposition algorithm of Theorem 1.4, we will obtain a set that is close to the columns of $A$ .The polynomial (2.1) and similar variants have been used before in works on dictionary learning. The crucial difference is that those works made strong assumptions, such as independence of the entries of $\{x\}$ , that ensured this polynomial has a special structure that made it possible to efficiently optimize over it. In contrast, our work applies in a much more general setting.

The challenge is that because $\tau$ is a positive constant, no matter how many samples we take, the polynomial $P$ will always be bounded away from the tensor $\lVert A^{\top}u\rVert_{d}^{d}$ . Hence we must use a tensor decomposition algorithm that can handle a very significant amount of noise. This is where the Sum-of-Squares algorithm comes in. This is a general tool for solving systems of polynomial equations [Sho87, Nes00, Par00, Las01]. Given the SOS algorithm, the description of our tensor decomposition algorithm is extremely simple (see Figure 1 below). We now describe the basic facts we use about the SOS algorithm, and sketch the analysis of our noisy tensor decomposition algorithm. See the survey [BS14] and the references therein for more detail on the SOS algorithm, and Sections 4, 5 and 6 for the full description of our algorithm and its analysis (including its variants that take polynomial time at the expense of requiring dictionary learning examples with sparser coefficients).

The SOS algorithm is a method, based on semidefinite programming, for solving a system of polynomial equations. Alas, since this is a non-convex and NP-hard problem, the algorithm doesn’t always succeed in producing a solution. However, it always returns some object, which in some sense can be interpreted as a “distribution” $\{u\}$ over solutions of the system of equations. It is not an actual distribution, and in particular we cannot sample from $\{u\}$ and get an individual solution, but we can compute low order moments of $\{u\}$ . Specifically, we make the following definition:

Numerical accuracy will never play an important role in our results, and so we can just assume that we can always find in $n^{O(k)}$ time a degree- $k$ pseudo-distribution satisfying given polynomial constraints, if such a pseudo-distribution exists.

2 Noisy tensor decomposition

Our basic noisy tensor decomposition algorithm is described in Figure 1. This algorithm finds (a vector close to) a column of $A$ with inverse polynomial probability. Using similar ideas, one can extend it to an algorithm that outputs all vectors with high probability; we provide the details in Section 6. Following the approach of [BKS14], our analysis of this algorithm proceeds in two phases:

We show that if the pseudo-distribution $\{u\}$ obtained in Step 1 is an actual distribution, then the vector output in Step 3 is close to one of the columns of $A$ .

We then show that the arguments used in establishing (i) generalize to the case of pseudo-distributions as well.

The first part is actually not so surprising. For starters, every unit vector $u$ that maximizes $P$ must be highly correlated with some column $a$ of $A$ . Indeed, $\lVert A^{\top}a\rVert_{d}^{d}\geqslant 1$ for every column $a$ of $A$ , and hence the maximum of $P(u)$ over a unit $u$ is at least $1-\tau$ . But if $\langle u,a\rangle^{2}\leqslant 1-\varepsilon$ for every column $a$ then $P(u)$ must be much smaller than $1$ . Indeed, in this case

Since $\sum\langle a^{i},u\rangle^{2}\leqslant\sqrt{\sigma}$ , this implies that, as long as $d\gg\tfrac{\log\sigma}{\varepsilon}$ , $\lVert A^{\top}u\rVert_{d}^{d}$ (and thus also $P(u)$ ) is much smaller than $1$ .

Therefore, if $\{u\}$ obtained in Step 1 is an actual distribution, then it would be essentially supported on the set $\mathcal{A}=\{\pm a^{1},\ldots,\pm a^{m}\}$ of the columns of $A$ and their negations. Let us suppose that $\{u\}$ is simply the uniform distribution over $\mathcal{A}$ . (It can be shown that this essentially is the hardest case to tackle.) In this case the matrix $M$ considered in Step 3 can be written as

where $W(\cdot)$ is the polynomial selected in Step 2. (This uses the fact that this polynomial is a product of linear functions and hence satisfies $W(-a)^{2}=W(a)$ for all $a$ .) If $W(\cdot)$ satisfies

Part (ii). The above argument establishes (i), but this is all based on a rather bold piece of wishful thinking— that the object $\{u\}$ we obtained in Step 1 of the algorithm was actually a genuine distribution over unit vectors maximizing $P$ . In actuality, we can only obtain the much weaker guarantee that $\{u\}$ is a degree $k$ pseudo-distribution for some $k=O(\log n)$ . (An actual distribution corresponds to a degree- $\infty$ pseudo-distribution.) The technical novelty of our work lies in establishing (ii). The key observation is that in all our arguments above, we never used any higher moments of $\{u\}$ , and that all the inequalities we showed boil down to the simple fact that a square of a polynomial is never negative. (Such proofs are known as Sum of Squares (SOS) proofs.)

applying it to the vector $v=A^{\top}u$ (where we denote $\lVert v\rVert_{\infty}=\max_{i}|v_{i}|$ ). The first (and most major) obstacle in giving a low degree “Sum of Squares” proof for (2.4) is that this is not a polynomial inequality. To turn it into one, we replace the $L_{\infty}$ norm with the $L_{k}$ norm for some large $k$ ( $k=O(\log m)$ will do). If we replace $\lVert v\rVert_{\infty}$ with $\lVert v\rVert_{k}$ in (2.4), and raise it to the $k/(d-2)$ -th power then we obtain the inequality

which is a valid inequality between polynomials in $v$ whenever $k$ is an integer multiple of $d-2$ (which we can ensure).

We now need to find a sum-of-squares proof for this inequality, namely that the right-hand side of (2.5) is equal to the left-hand side plus a sum of squares, that is, we are to show that for $s=k/(d-2)$ ,

By expanding the $s$ -th powers in this expression, we rewrite this polynomial inequality as

where the summations involving $\alpha$ are over degree- $s$ multiindices $\alpha\in\{0,\dots,s\}^{n}$ , and $\binom{s}{\alpha}$ denotes the multinomial coefficient $\binom{n}{\alpha}=\frac{s!}{\alpha_{1}!\dots\alpha_{m}!}$ . We will prove (2.6) term by term, i.e., we will show that $v^{d\alpha}\preceq v^{2\alpha}\sum_{i}v_{i}^{(d-2)s}$ for every multiindex $\alpha$ . Since $v^{2\alpha}\succeq 0$ , it is enough to show that $v^{(d-2)\alpha}\preceq\sum_{i}v_{i}^{(d-2)s}$ . This is implied by the following general inequality, which we prove in Appendix A:

Let $w_{1},\ldots,w_{n}$ be polynomials. Suppose $w_{1}\succeq 0,\ldots,w_{n}\succeq 0$ . Then, for every multiindex $\alpha$ , $w^{\alpha}\preceq\sum_{i}w_{i}^{\lvert\alpha\rvert}\,.$

We note that $d$ is even, so $w_{i}=v_{i}^{d-2}\succeq 0$ is a square, as required by the lemma.

For the case that $\lvert\alpha\rvert$ is a power of $2$ , the inequality in the lemma follows by repeatedly applying the inequality $x\cdot y\preceq\frac{1}{2}x^{2}+\frac{1}{2}y^{2}$ , which in turn holds because the difference between the two sides equals $\tfrac{1}{2}(x-y)^{2}$ . As a concrete example, we can derive $w_{1}^{3}w_{2}\preceq w_{1}^{4}+w_{2}^{4}$ in this way,

(The first two steps use the inequality $x\cdot y\preceq\tfrac{1}{2}x^{2}+\tfrac{1}{2}y^{2}$ . The last step uses that both $w_{1}$ and $w_{2}$ are sum of squares.)

Once we have an SOS proof for (2.5) we can conclude that it holds for pseudo-distributions as well, and in particular that for every pseudo-distribution $\{u\}$ of degree at least $k+2k/(d-2)$ satisfying $\{\lVert u\rVert_{2}^{2}=1\}$ ,

We use similar ideas to port the rest of the proof to the SOS setting, concluding that whenever $\{u\}$ is a pseudo-distribution that satisfies $\{\lVert u\rVert_{2}^{2}=1\}$ and $\{P(u)\geqslant 1-\tau\}$ , then with inverse polynomial probability it will hold that

Preliminaries

We use the notation of pseudo-expectations and pseudo-distributions from Section 2.1. We now state some basic useful facts about pseudo-distributions, see [BS14, BKS14, BBH+12] for a more comprehensive treatment.

One useful property of pseudo-distributions is that we can find actual distribution that match their first two moments.

Another property we will use is that we can reweigh a pseudo-distribution by a positive polynomial $W$ to obtain a new pseudo-distribution that corresponds to the operation on actual distributions of reweighing the probability of an element $u$ proportional to $W(u)$ .

but since $W$ is a sum of squares, $WP^{2}$ is also a sum of squares and hence the denominator of the left-hand side is non-negative, while the numerator is by assumption positive. ∎

Dictionary Learning

We now state our formal theorem for dictionary learning. The following definition of nice distributions captures formally the conditions needed for recovery. (It is equivalent up to constants to the definition of Section 1.1, see Remark 4.4 below.)

We can now state our result for dictionary learning in quasipolynomial time. The result for polynomial time is stated in Section 7.

There exists an algorithm that for every desired accuracy $\varepsilon>0$ and overcompleteness $\sigma\geqslant 1$ solves the following problem for every $(d,\tau)$ -nice distribution with $d\geqslant d(\varepsilon,\sigma)=O(\varepsilon^{-1}\log\sigma)$ and $\tau\leqslant\tau(\varepsilon,\sigma)=(\varepsilon^{-1}\log\sigma)^{O(\varepsilon^{-1}\log\sigma)}$ in time $n^{(1/\varepsilon)^{O(1)}(d+\log m)}$ : Given $n^{O(d)}/\operatorname{poly}(\tau)$ samples from a distribution $\{y=Ax\}$ for a $\sigma$ -overcomplete dictionary $A$ and $(d,\tau)$ -nice distribution $\{x\}$ , output a set of vectors that is $\varepsilon$ -close to the set of columns of $A$ (in symmetrized Hausdorff distance).

output a set of vectors that is $\varepsilon$ -close to the set of columns of $A$ (in symmetrized Hausdorff distance).

We will prove Theorem 4.3 (noisy tensor decomposition) in Section 5 and Section 6. At this point, let us see how it yields Theorem 4.2 (dictionary learning, quasipolynomial time). The following lemma gives the connection between tensor decomposition and dictionary learning.

Therefore, we can apply the algorithm in Theorem 4.3 (noisy tensor decomposition) for noise parameter $\tau^{\prime}=2\tau k^{d}d^{d}$ to obtain a set $S$ of unit vectors that is $\varepsilon$ -close to the set of columns of $A$ (in symmetrized Hausdorff distance). ∎

Sampling pseudo-distributions

In this section we will develop an efficient algorithm that behaves in certain ways like a hypothetical sampling procedure for low-degree pseudo-distributions. (Sampling procedures, even inefficient or approximate ones, cannot exist in general for low-degree pseudo-distributions [Gri01, Sch08].) This algorithm will be a key ingredient of our algorithm for Theorem 4.3 (noisy tensor decomposition, quasipolynomial time).

The algorithm in the following theorem achieves the above property of sampling procedures with the key advantage that it applies to any low-degree pseudo-distributions.

The result follows from the following lemmas.

Furthermore, there exists a randomized algorithm that runs in time $n^{O(k)}$ and computes such a polynomial $W$ with probability at least $2^{-O(k/\operatorname{poly}(\varepsilon))}$ .

Let $w^{\scriptscriptstyle(1)},\ldots,w^{\scriptscriptstyle(k/2)}$ be independent samples from the distribution $\{w\mid\langle c,\xi\rangle\geqslant\tau_{M+1}\}$ . Then, let $W=w^{\scriptscriptstyle(1)}\cdots w^{\scriptscriptstyle(k/2)}/M^{k/2}$ . The expectation of this random polynomial satisfies

The following bound shows that there exists a polynomial $W$ that satisfies the conclusion of the lemma,

Since $\{W\}$ has density $2^{-O(M^{2})}$ in $\{W_{0}\}$ , it also follows that

Noisy tensor decomposition

In this section we will prove Theorem 4.3 (noisy tensor decomposition, quasi-polynomial time).

output a set of vectors that is $\varepsilon$ -close to the set of columns of $A$ (in symmetrized Hausdorff distance).

First, we claim that the pseudo-distribution $\{u\}$ also satisfies the constraint $\{\lVert A^{\top}u\rVert_{k}^{k}\geqslant e^{-\delta^{\prime}k}\}$ where $\delta^{\prime}=\tfrac{d}{d-2}\delta+\frac{\log\sigma}{d-2}$ . The proof of this claim follows by a sum-of-squares version of the following form of Hölder’s inequality,

See the overview section for a proof of this fact. By substituting $v=A^{\top}u$ and using the facts that $\lVert A^{\top}u\rVert_{2}^{2}\preceq\sigma\lVert u\rVert^{2}$ and that $\{u\}$ satisfies the constraint $\{\lVert u\rVert^{2}=1\}$ , we get that $\{u\}$ satisfies $\{\lVert A^{\top}u\rVert_{k}^{k}\geqslant(\lVert A^{\top}u\rVert_{d}^{d})^{k/(d-2)}/\sigma^{k/(d-2)}\}$ , which implies the claim because $\{\lVert A^{\top}u\rVert_{d}^{d}\geqslant e^{-\delta d}\}$ .

While there exists a degree- $k$ pseudo-distribution $\{u\}$ that satisfies the constraints $\{P(u)\geqslant 1-\tau,\lVert u\rVert_{2}^{2}=1\}$ and $\{\langle s,u\rangle^{2}\leqslant 1-\gamma\}$ for every $s\in S$ :

Add the vector $c^{\prime}$ to the set $S$ .

Next we claim that every vector in $s\in S$ is close to one of the columns of $A$ . Indeed, every such vector satisfies $\lVert A^{\top}s\rVert_{d}^{d}\geqslant e^{-\varepsilon d}-2\tau$ , which by Lemma 6.1 implies that $\langle s,c\rangle^{2}\geqslant 1-O(\varepsilon+\tau/d+(\log\sigma)/d)=1-O(\varepsilon)$ for a column $c$ of $A$ .

Next we claim that if the algorithm terminates then for every column $c$ of $A$ there exists a vector $s\in S$ with $\langle c,s\rangle^{2}\geqslant 1-\gamma$ . Indeed, if there exists a column that violates this condition, then it would satisfy all constraints for the pseudo-distribution, which means that the algorithm does not terminate at this point.

To finish the proof of the theorem it remains to bound the number of iterations of the algorithm. We claim that the number of iterations is bounded by the number $m$ of columns of $A$ because in each iteration the vectors in $S$ will cover at least one more of the columns of $A$ . As observed before, every vector $s\in S$ is close to a column $c_{s}$ of $A$ in the sense that $\lVert s^{\otimes 2}-c_{s}^{\otimes 2}\rVert^{2}=O(\varepsilon)$ . However, since $c^{\prime}$ satisfies $\langle c^{\prime},s\rangle^{2}\leqslant 1-\gamma/10$ , we have by triangle inequality $\gamma/5\leqslant\lVert(c^{\prime})^{\otimes 2}-s^{\otimes 2}\rVert^{2}\leqslant 2\lVert(c^{\prime})^{\otimes 2}-c_{s}^{\otimes 2}\rVert^{2}+2\lVert s^{\otimes 2}-c_{s}^{\otimes 2}\rVert^{2}$ , which means that $\lVert(c^{\prime})^{\otimes 2}-c_{s}^{\otimes 2}\rVert^{2}\geqslant\gamma/10-O(\varepsilon)$ . Therefore, the vector $c^{\prime}$ is not close to any of the vectors $c_{s}$ for $s\in S$ , which means that it has to be close to another column of $A$ . (Here, we are again assuming that $\gamma$ was chosen so that $\gamma/\varepsilon$ is a large enough constant.) ∎

By adapting the algorithm somewhat we can also achieve recovery guarantees for columns with significantly smaller norm than the maximum norm. Concretely, we can modify the algorithm so that we ask for pseudo-distributions satisfying $P(u)\geqslant\rho$ , where $\rho$ is a parameter that we gradually decrease so we can get all the vectors. However, we need to also change the right-hand side of the constraint $\langle u,s\rangle^{2}\leqslant 1-\gamma$ to a value that decreases with $\rho$ . Otherwise, the algorithm might not terminate, as there can be exponentially vectors that are somewhat far from a column vector $c$ , and all of them will have fairly large value for $P(\cdot)$ . Such a modified algorithm can still obtain all the column vectors (up to a small error) if we assume that the they are sufficiently incoherent. That is, $\langle a,a^{\prime}\rangle\leqslant\mu$ for every distinct columns $a$ , $a^{\prime}$ of $A$ with $\mu$ depending on the norm ratios. Similar (and in fact often stronger) assumptions were made in prior works on dictionary learning. (However, we need these assumptions only when the vectors have different norms.)

Polynomial-time algorithms

In this section we show how we can improve our tensor decomposition algorithm when we have access to examples of very sparse linear combinations of the dictionary columns, culminating in Theorem 7.6 that gives a polynomial-time algorithm for the dictionary problem for the case the distribution is $(d,\tau)$ -nice for $\tau=n^{-\Omega(1)}$ .

The following theorem refines Theorem 5.1 (sampling pseudo-distributions) reconstructing a vector $c^{\prime}$ that is close to a target vector $c$ . We make an additional assumption about having access to samples from a distribution $\{W\}$ over sum-of-squares polynomials. This distribution comes with a noise parameter $\tau$ that controls how well the distribution correlated with the target vector $c$ . If this noise parameter is sufficiently small, samples from distribution allow the algorithm to work under a more refined but milder condition on the pseudo-distribution $\{u\}$ . For our dictionary learning algorithm, we can satisfy this condition when the noise parameter $\tau$ of the distribution $\{W\}$ satisfies $\tau\ll m^{1/k}$ . (The noise parameter $\tau$ roughly coincides with the niceness parameter of the distribution $\{x\}$ .)

The pseudo-distribution $\{u\}$ has degree $2(1+2k)$ and satisfies the polynomial constraint $\lVert u\rVert_{2}^{2}=1$ and the conditions

The following lemma is the main new ingredient of the proof of this theorem.

Note that the conclusion of the lemma implies that we can recover a vector $c^{\prime}$ with $\langle c^{\prime},c\rangle^{2}\geqslant 1-O(\varepsilon^{\prime})$ using Theorem 5.1 in time $n^{k/poly(\varepsilon^{\prime})}$ with probability $2^{O(k)/\operatorname{poly}(\varepsilon^{\prime})}$ . Therefore, Theorem 7.1 follows by combining Lemma 7.2 with Theorem 5.1.

Here, the last step uses the assumption $(1+k)\tau\leqslant 1/2$ to bound the series $\sum_{i=1}^{k}(1+k)^{i}\tau^{i}\leqslant 2(t+k)\tau$ . It follows that

2 Tensor decomposition

The following lemma shows that a pseudo-distribution $\{u\}$ that satisfies the constraints $\{\lVert A^{\top}u\rVert_{2(t+k)}^{2(t+k)}\approx 1,\lVert u\rVert_{2}^{2}=1\}$ also satisfies the condition of Theorem 7.1 for one of the columns of the dictionary $A$ .

It follows that the pseudo-distribution $\{u\}$ cannot satisfy the constraint $\lVert A^{\top}u\rVert_{2(1+k)}^{2(1+k)}\geqslant e^{2(k-1)\varepsilon}\sigma$ .

3 Dictionary learning

The following lemma shows that up to polynomial reweighing the distribution $\{y=Ax\}$ gives us access to a distribution $\{W\}$ that satisfies the condition of Theorem 7.1.

The expectation of $w$ after reweighing by $x_{i}^{2}$ satisfies

The last step uses that all non-square moments of $\{x\}$ vanish. The desired bounds follow because the coefficient of $\langle a^{\scriptscriptstyle(i)},u\rangle^{2}$ is $1$ and for all indices $j\neq i$ , the coefficients of $\langle a^{\scriptscriptstyle(j)},u\rangle^{2}$ are all between and $\tau$ . For the final bounds, we also use $\lVert A^{\top}u\rVert_{2}^{2}\preceq\sigma\lVert u\rVert^{2}$ . ∎

We will use Lemma 7.4 to reason about the distribution $\{W\}$ . Without loss of generality, we assume that $c$ is the first column of the dictionary $A$ . Let $\bar{x}=(x_{1},\ldots,x_{k^{\prime}})$ be $k^{\prime}$ independent samples from $\{x\}$ . (The distribution $\{W\}$ is the same as $\{\langle Ax_{1},u\rangle^{2}\cdots\langle Ax_{k^{\prime}},u\rangle^{2}\}$ .) We claim that the distribution $\{W\}$ satisfies (7.1) after reweighing by the function $r(\bar{x})^{2}=x_{1,1}^{2}\cdots x_{k^{\prime},1}^{2}$ (the product of the square of the first coordinates of $x_{1},\ldots,x_{k^{\prime}}$ ). The distribution after reweighing is, up to scaling of the polynomials, equal to the distribution $\mathcal{D}=\{W=w_{1}\cdots w_{k^{\prime}}\}$ , where $w_{1},\ldots,w_{k^{\prime}}$ are independent samples from the distribution $\mathcal{D}_{1}$ in Lemma 7.4. By Lemma 7.4, this reweighted distribution satisfies the condition (7.1), that is,

Since we assume $\{x\}$ to be $(4,\tau)$ -nice, the variance of $\mathcal{D}$ is bounded by $n^{O(k)}$ .

The following theorem gives a polynomial time algorithm for dicionary learning under $(d,\tau)$ -nice distributions for all $\tau=n^{\Omega(1)}$ .

There exists an algorithm that for every desired accuracy $\varepsilon>0$ and overcompleteness $\sigma\geqslant 1$ solves the following problem for every $(d,\tau)$ -nice distribution with $d\geqslant d(\varepsilon,\sigma)=O(d^{-1}\log\sigma)$ and $\tau\leqslant\tau(\varepsilon,\sigma)=(\varepsilon^{-1}\log\sigma)^{O(\varepsilon^{-1}\log\sigma)}$ in time $n^{(1/\varepsilon)^{O(1)}k}$ for $k=d+O(\tfrac{\log m}{\log(1/\tau)})$ : Given $n^{O(d)}/\operatorname{poly}(\tau)$ samples from a distribution $\{y=Ax\}$ for a $\sigma$ -overcomplete dictionary $A$ and $(d,\tau)$ -nice distribution $\{x\}$ , output a set of vectors that is $\varepsilon$ -close to the set of columns of $A$ (in symmetrized Hausdorff distance).

We will show how to use Theorem 7.5 to recover a single vector that is close to one of the columns of $A$ . By repeating this step in the same way as in the proof of Theorem 4.3 (noisy tensor decomposition) we can recover a set of vectors that is close to the set of columns of $A$ .

To recover a single vector, we estimate from the samples of $\{y=Ax\}$ a polynomial $P$ that is close to $\lVert A^{\top}u\rVert_{d}^{d}$ in the same way as in the proof of Theorem 4.2. (The distance of $P$ from $\lVert A^{\top}u\rVert_{d}^{d}$ in spectral norm will be $O(\tau d^{d})=O(\varepsilon)$ .) Next, we compute a degree- $k$ pseudo-distribution $\{u\}$ that satisfies the constraints $\{P\geqslant 1-\varepsilon,\lVert u\rVert_{2}^{2}=1\}$ .To recover all vectors, we would also add constraints $\{\langle s,u\rangle^{2}\leqslant 1-\gamma\}$ for all vectors $s$ that have already been recovered (see proof of Theorem 4.3). The same argument as in the proof of Lemma 6.1 shows that $\{u\}$ also satisfies the constraint $\{\lVert Au\rVert_{k}^{k}\geqslant e^{O(\varepsilon)k}\}$ , which means that $\{u\}$ satisfies the premise of Theorem 7.5. Therefore, the algorithm in Theorem 7.5 recovers a vector close to one of the columns of $A$ .

Conclusions and Open Problems

The Sum of Squares method has found many uses across a variety of disciplines, and in this work we demonstrate its potential for solving unsupervised learning problems in regimes that have so far eluded other algorithms. It is an interesting direction to identify other problems that can be solved using this algorithm.

The generality of the SOS method comes at a steep cost of efficiency. It is a fascinating open problem, and one we are quite optimistic about, to use the ideas from the SOS-based algorithm to design practically efficient algorithms.

References

Appendix A Proof of Lemma 2.3

Lemma 2.3 is a consequence of the following sum-of-squares version of the AM-GM inequality.The first sum-of-squares proof of the AM-GM inequality dates back to Hurwitz in 1891 [Hur91]. For related results and sums-of-squares proofs of more general sets of inequalities, see [Rez87, Rez89, FH14].

Let $w_{1},\dots,w_{n}$ be polynomials. Suppose $w_{1},\ldots,w_{n}\succeq 0$ . Then,

To see that this lemma implies Lemma 2.3, write for a multi-index $\alpha$ with $\lvert\alpha\rvert=s$ the polynomial $w^{\alpha}$ as a product $w^{\alpha}=\prod_{j=1}^{s}w_{i_{j}},$ where $w_{i}$ is repeated $\alpha_{i}$ times. (E.g., we would write $w_{1}^{2}w_{2}w_{3}^{2}$ as $w_{1}w_{1}w_{2}w_{3}w_{3}$ and we would have $(i_{1},\dots,i_{5})=(1,1,2,3,3)$ .) Then applying Lemma A.1 to the polynomials $w_{i_{1}},\ldots,w_{i_{s}}$ gives the inequality asserted in Lemma 2.3,

where the second inequality uses that $0\leqslant\alpha_{i}/\lvert\alpha\rvert\leqslant 1$ and the premise $w_{i}\succeq 0$ .

To prove Lemma A.1, we will give a sequence of polynomials $R_{0},\dots,R_{n-1}$ such that $R_{0}=(z_{1}^{n}+\dots z_{n}^{n})/n$ , $R_{n-1}=z_{1}\dots z_{n}$ , and $R_{0}\succeq\dots\succeq R_{n-1}$ . To this end, let

where $S_{n}$ denotes the symmetric group on $n$ elements. So, for instance,

The following claim will then complete the proof:

For any $k\in\{1,\dots,n-1\}$ , $R_{k-1}-R_{k}$ is a sum of squares.

For a given permutation $\sigma\in S_{n}$ , the corresponding monomials in $R_{k}$ and $R_{k-1}$ will share many of the same variables, differing only in the exponents of $w_{\sigma_{1}}$ and $w_{\sigma_{k+1}}$ . We will thus try to arrange the terms of $R_{k-1}-R_{k}$ so that we can pull out the common variables, which will let us reduce our inequality to one involving only two variables.

Since the $w_{i}$ are sums of squares, the expression inside the braces is as well. It is therefore enough to show that $\left(w_{a}^{n-k}-w_{b}^{n-k}\right)\left(w_{a}-w_{b}\right)$ is a sum of squares. This follows from the fact that