Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method
Boaz Barak, Jonathan A. Kelner, David Steurer
Introduction
The dictionary learning (also known as “sparse coding”) problem is to recover an unknown matrix (known as a “dictionary”) from examples of the form
This problem has found applications in multiple areas, including computational neuroscience [OF97, OF96a, OF96b], machine learning [EP07, MRBL07], and computer vision and image processing [EA06, MLB+08, YWHM08]. The appeal of this problem is that, intuitively, data should be sparse in the “right” representation (where every coordinate corresponds to a meaningful feature), and finding this representation can be a useful first step for further processing, just as representing sound or image data in the Fourier or wavelet bases is often a very useful preprocessing step in signal or image processing. See [SWW12, AAJ+13, AGM13, ABGM14] and the references therein for further discussion of the history and motivation of this problem.
This is a nonlinear problem, as both and are unknown, and dictionary learning is a computationally challenging task even in the noiseless case. When is known, recovering from constitutes the sparse recovery / compressed sensing problem, which has efficient algorithms [Don06, CRT06]. Hence, a common heuristic for dictionary learning is to use alternating minimization, using sparse recovery to obtain a guess for based on a guess of , and vice versa.
Recently there have been several works giving dictionary learning algorithms with rigorous guarantees on their performance [SWW12, AAJ+13, AAN13, AGM13, ABGM14]. These works differ in various aspects, but they all share a common feature: they give no guarantee of recovery unless the distribution is over extremely sparse vectors, namely having less than (as opposed to merely ) nonzero coordinates. (There have been other works dealing with the less sparse case, but only at the expense of making strong assumptions on and/or ; see Section 1.3 for more discussion of related works.)
In this work we give a different algorithm that can be proven to approximately recover the matrix even when is much denser (up to coordinates for some small constant in some settings). The algorithm works (in the sense of approximate recovery) even with noise, in the so-called overcomplete case (where ), and without making any incoherence assumptions on the dictionary.
Our algorithm is based on the Sum of Squares (SOS) semidefinite programming hierarchy [Sho87, Nes00, Par00, Las01]. The SOS algorithm is a very natural method for solving non-convex optimization problems that has found applications in a variety of scientific fields, including control theory [HG05], quantum information theory [DPS02], game theory [Par06], formal verification [Har07], and more. Nevertheless, to our knowledge this work provides the first rigorous bounds on the SOS algorithm’s running time for a natural unsupervised learning problem.
In this section we formally define the dictionary learning problem and state our result. We define a -dictionary to be an matrix such that for all , and (where is the identity matrix). The parameter is an analytical proxy for the overcompleteness of the dictionary . In particular, if the columns of are in isotropic position (i.e., is proportional to the identity), then the top eigenvalue of is its trace divided by , which equals because all of the ’s have unit norm.While we do not use it in this paper, we note that in the dictionary learning problem it is always possible to learn a linear “whitening transformation” from the samples that would place the columns in isotropic position, at the cost of potentially changing the norms of the vectors. (There also exists a linear transformation that keeps the vectors normalized [Bar98, For01], but we do not know how to learn it from the samples.) In this work we are mostly interested in the case , which corresponds to .
Equation (1.2) will motivate us in defining an analytical proxy for the condition that the distribution over coefficients is -sparse.By using an analytical proxy as opposed to requiring strict sparsity, we are only enlarging the set of distributions under consideration. However, we will make some additional conditions below, and in particular requiring low order non-square moments to vanish, that although seemingly mild compared to prior works, do restrict the family of distributions.
Specifically, in the dictionary learning case, since we are interested in learning all column vectors, we want every coordinate to be typical (for example, if the coefficient is always or always , we will not be able to learn the corresponding column vector). Moreover, a necessary condition for recovery is that every pair of coordinates is somewhat typical in the sense that the events that and are nonzero are not perfectly correlated. Indeed, suppose for simplicity that when is nonzero, it is distributed like an independent standard Gaussian. Then if those two events were perfectly correlated, recovery would be impossible since the distribution over examples would be identical if we replaced with any pair of vectors where is a rotation in the plane spanned by .
for all and for some .
for every non-square monomial of degree at most . (Here, is a multiindex and denotes the monomial . The degree of is ; we say that is non-square if is not the square of another monomial, i.e.,, if has an odd coordinate.)
Another way to justify this notion of nice distributions is that, as our analysis shows, it is a natural way to ensure that if is a column of the dictionary then the random variable for a random sample from (1.1) will be “spiky” in the sense that it will have a large -norm compared to its -norm. Thus it is a fairly clean way to enable recovery, especially in the setting (such as ours) where we don’t assume orthogonality or even incoherence between the dictionary vectors.
Modeling noise
Given a noisy dictionary learning example of the form , one can also view it (assuming we are in the non-degenerate case of having full rank) as for some (whose magnitude is controlled by the norm of and the condition number of ). If has sufficiently small magnitude, and is composed of i.i.d random variables (and even under more general conditions), the distribution will be nice as well. Therefore, we will not explicitly model the noise in the following, but rather treat it as part of the distribution which our definition allows to be only “approximately sparse”.
2 Our results
Given samples of the form for a -nice , with a sufficiently large constant (corresponding to having nonzero entries), we can approximately recover the dictionary in polynomial time as long as for some , and in quasipolynomial time as long as is a sufficiently small constant. Prior polynomial-time algorithms required the distribution to range over vectors with less than nonzero entries (and it was not known how to improve upon this even using quasipolynomial time).
For every there exists and a polynomial-time algorithm such that for every -dictionary and -nice , given samples from from , outputs with probability at least a set that is -close to .
The hidden constants in the notation may depend on . The algorithm can recover the dictionary vectors even in the relatively dense case when is (a sufficiently small) constant, at the expense of a quasipolynomial (i.e., ) running time. See Theorems 4.2 and 7.6 for a precise statement of the dependencies between the constants.
Our algorithm aims to recover the vectors up to -accuracy, with a running time as in a PTAS that depends (polynomially) on in the exponent. Prior algorithms achieving exact recovery needed to assume much stronger conditions, such as incoherence of dictionary columns. Because we have not made incoherence assumptions and have only assumed the signals obey an analytic notion of sparsity, exact recovery is not possible, and there are limitations on how precisely one can recover the dictionary vectors (even information theoretically).
We believe that it is important to understand the extent to which dictionary recovery can be performed with only weak assumptions on the model, particularly given that real-world signals are often only approximately sparse and have somewhat complicated distributions of errors. When stronger conditions are present that make better error guarantees possible, our algorithm can provide an initial solution for local search methods (or other recovery algorithms) to boost the approximate solution to a more precise one. We believe that understanding the precise tradeoffs between the model assumptions, achievable precision, and running time is an interesting question for further research.
For every , there exists and a probabilistic -time algorithm such that for every -dictionary , given a polynomial such that
outputs with probability at least a set that is -close to .
(We denote if is a sum of squares of polynomials. Also, as in Theorem 1.2, there are certain conditions under which runs in polynomial time; see Section 7.)
The condition (1.5) implies that the input to is -close to the tensor , in the sense that for every unit vector . This allows for very significant noise, since for a typical vector , we expect to be have magnitude roughly which would be much smaller than for every constant . Thus, on most of its inputs, can behave radically differently than , and in particular have many local minima that do not correspond to local minima of the latter polynomial. For this reason, it seems unlikely that one can establish a result such as Theorem 1.4 using a local search algorithm.The conditions (1.5) and are not identical for . Nevertheless, the discussion above applies to both conditions, since (1.5) does allow for to have very different behavior than .
We give an overview of our algorithm and its analysis in Section 2. Sections 4, 6 and 5 contain the complete formal proofs. In its current form, our algorithm is efficient only in the theoretical/asymptotic sense, but it is very simple to describe (modulo its calls to the SOS solver), see Figure 1. We believe that the Sum of Squares algorithm can be a very useful tool for attacking machine learning problems, yielding a first solution to the problem that can later be tailored and optimized.
3 Related work
Starting with the work of Olshausen and Field [OF96a, OF96b, OF97], there is a vast body of literature using various heuristics (most commonly alternating minimization) to learn dictionaries for sparse coding, and applying this tool to many applications. Here we focus on papers that gave algorithms with proven performance.
Independent Component Analysis (ICA) [Com94] is one method that can be used for the dictionary learning in the case the random variables are statistically independent. For the case of this was shown in [Com94, FJK96, NR09], while the works [LCC07, GVX14] extend it for the overcomplete (i.e. ) case.
Another recent line of works analyzed different algorithms, which in some cases are more efficient or handle more general distributions than ICA. Spielman, Wang and Wright [SWW12] give an algorithm to exactly recover the dictionary in the case. Agarwal, Anandkumar, Jain, Netrapalli, and Tandon [AAJ+13] and Arora, Ge and Moitra [AGM13] obtain approximate recovery in the overcomplete (i.e. ) case, which can be boosted to exact recovery under some additional conditions on the sparsity and dictionary [AAN13, AGM13]. However, all these works require the distribution to be over very sparse vectors, specifically having less than nonzero entries. As discussed in [SWW12, AGM13], sparsity seemed like a natural barrier for this problem, and in fact, Spielman et al [SWW12] proved that every algorithm of similar nature to theirs will fail to recover the dictionary when when the coefficient vector can have coordinates. The only work we know of that can handle vectors of support larger than is the recent paper [ABGM14], but it achieves this at the expense of making fairly strong assumptions on the structure of the dictionary, in particular assuming some sparsity conditions on itself. In addition to the sparsity restriction, all these works had additional conditions on the distribution that are incomparable or stronger than ours, and the works [AAJ+13, AGM13, AAN13, ABGM14] make additional assumptions on the dictionary (namely incoherence) as well.
The tensor decomposition problem is also very widely studied with a long history (see e.g., [Tuc66, Har70, Kru77]). Some recent works providing algorithms and analysis include [AFH+12, AGM12, BCMV14, BCV14]. However, these works are in a rather different parameter regime than ours— assuming the tensor is given with very little noise (inverse polynomial in the spectral norm), but on the other hand requiring very low order moments (typically three or four, as opposed to the large constant or even logarithmic number we use).
Organization of this paper
In Section 2 we give a high level overview of our ideas. Sections 4–6 contain the full proof for solving the dictionary learning and tensor decomposition problems in quasipolynomial time, where the sparsity parameter is a small constant. In Section 7 we show how this can be improved to polynomial time when for some constant .
Overview of algorithm and its analysis
The dictionary learning problem can be easily reduced to the noisy tensor decomposition problem. Indeed, it is not too hard to show that for an appropriately chosen parameter , given a sufficiently large number of examples from the distribution , the polynomial
will be roughly close (in the spectral norm) to the polynomial , where is the “niceness”/“sparsity” parameter of the distribution . Therefore, if we give as input to the tensor decomposition algorithm of Theorem 1.4, we will obtain a set that is close to the columns of .The polynomial (2.1) and similar variants have been used before in works on dictionary learning. The crucial difference is that those works made strong assumptions, such as independence of the entries of , that ensured this polynomial has a special structure that made it possible to efficiently optimize over it. In contrast, our work applies in a much more general setting.
The challenge is that because is a positive constant, no matter how many samples we take, the polynomial will always be bounded away from the tensor . Hence we must use a tensor decomposition algorithm that can handle a very significant amount of noise. This is where the Sum-of-Squares algorithm comes in. This is a general tool for solving systems of polynomial equations [Sho87, Nes00, Par00, Las01]. Given the SOS algorithm, the description of our tensor decomposition algorithm is extremely simple (see Figure 1 below). We now describe the basic facts we use about the SOS algorithm, and sketch the analysis of our noisy tensor decomposition algorithm. See the survey [BS14] and the references therein for more detail on the SOS algorithm, and Sections 4, 5 and 6 for the full description of our algorithm and its analysis (including its variants that take polynomial time at the expense of requiring dictionary learning examples with sparser coefficients).
The SOS algorithm is a method, based on semidefinite programming, for solving a system of polynomial equations. Alas, since this is a non-convex and NP-hard problem, the algorithm doesn’t always succeed in producing a solution. However, it always returns some object, which in some sense can be interpreted as a “distribution” over solutions of the system of equations. It is not an actual distribution, and in particular we cannot sample from and get an individual solution, but we can compute low order moments of . Specifically, we make the following definition:
Numerical accuracy will never play an important role in our results, and so we can just assume that we can always find in time a degree- pseudo-distribution satisfying given polynomial constraints, if such a pseudo-distribution exists.
2 Noisy tensor decomposition
Our basic noisy tensor decomposition algorithm is described in Figure 1. This algorithm finds (a vector close to) a column of with inverse polynomial probability. Using similar ideas, one can extend it to an algorithm that outputs all vectors with high probability; we provide the details in Section 6. Following the approach of [BKS14], our analysis of this algorithm proceeds in two phases:
We show that if the pseudo-distribution obtained in Step 1 is an actual distribution, then the vector output in Step 3 is close to one of the columns of .
We then show that the arguments used in establishing (i) generalize to the case of pseudo-distributions as well.
The first part is actually not so surprising. For starters, every unit vector that maximizes must be highly correlated with some column of . Indeed, for every column of , and hence the maximum of over a unit is at least . But if for every column then must be much smaller than . Indeed, in this case
Since , this implies that, as long as , (and thus also ) is much smaller than .
Therefore, if obtained in Step 1 is an actual distribution, then it would be essentially supported on the set of the columns of and their negations. Let us suppose that is simply the uniform distribution over . (It can be shown that this essentially is the hardest case to tackle.) In this case the matrix considered in Step 3 can be written as
where is the polynomial selected in Step 2. (This uses the fact that this polynomial is a product of linear functions and hence satisfies for all .) If satisfies
Part (ii). The above argument establishes (i), but this is all based on a rather bold piece of wishful thinking— that the object we obtained in Step 1 of the algorithm was actually a genuine distribution over unit vectors maximizing . In actuality, we can only obtain the much weaker guarantee that is a degree pseudo-distribution for some . (An actual distribution corresponds to a degree- pseudo-distribution.) The technical novelty of our work lies in establishing (ii). The key observation is that in all our arguments above, we never used any higher moments of , and that all the inequalities we showed boil down to the simple fact that a square of a polynomial is never negative. (Such proofs are known as Sum of Squares (SOS) proofs.)
applying it to the vector (where we denote ). The first (and most major) obstacle in giving a low degree “Sum of Squares” proof for (2.4) is that this is not a polynomial inequality. To turn it into one, we replace the norm with the norm for some large ( will do). If we replace with in (2.4), and raise it to the -th power then we obtain the inequality
which is a valid inequality between polynomials in whenever is an integer multiple of (which we can ensure).
We now need to find a sum-of-squares proof for this inequality, namely that the right-hand side of (2.5) is equal to the left-hand side plus a sum of squares, that is, we are to show that for ,
By expanding the -th powers in this expression, we rewrite this polynomial inequality as
where the summations involving are over degree- multiindices , and denotes the multinomial coefficient . We will prove (2.6) term by term, i.e., we will show that for every multiindex . Since , it is enough to show that . This is implied by the following general inequality, which we prove in Appendix A:
Let be polynomials. Suppose . Then, for every multiindex ,
We note that is even, so is a square, as required by the lemma.
For the case that is a power of , the inequality in the lemma follows by repeatedly applying the inequality , which in turn holds because the difference between the two sides equals . As a concrete example, we can derive in this way,
(The first two steps use the inequality . The last step uses that both and are sum of squares.)
Once we have an SOS proof for (2.5) we can conclude that it holds for pseudo-distributions as well, and in particular that for every pseudo-distribution of degree at least satisfying ,
We use similar ideas to port the rest of the proof to the SOS setting, concluding that whenever is a pseudo-distribution that satisfies and , then with inverse polynomial probability it will hold that
Preliminaries
We use the notation of pseudo-expectations and pseudo-distributions from Section 2.1. We now state some basic useful facts about pseudo-distributions, see [BS14, BKS14, BBH+12] for a more comprehensive treatment.
One useful property of pseudo-distributions is that we can find actual distribution that match their first two moments.
Another property we will use is that we can reweigh a pseudo-distribution by a positive polynomial to obtain a new pseudo-distribution that corresponds to the operation on actual distributions of reweighing the probability of an element proportional to .
but since is a sum of squares, is also a sum of squares and hence the denominator of the left-hand side is non-negative, while the numerator is by assumption positive. ∎
Dictionary Learning
We now state our formal theorem for dictionary learning. The following definition of nice distributions captures formally the conditions needed for recovery. (It is equivalent up to constants to the definition of Section 1.1, see Remark 4.4 below.)
We can now state our result for dictionary learning in quasipolynomial time. The result for polynomial time is stated in Section 7.
There exists an algorithm that for every desired accuracy and overcompleteness solves the following problem for every -nice distribution with and in time : Given samples from a distribution for a -overcomplete dictionary and -nice distribution , output a set of vectors that is -close to the set of columns of (in symmetrized Hausdorff distance).
output a set of vectors that is -close to the set of columns of (in symmetrized Hausdorff distance).
We will prove Theorem 4.3 (noisy tensor decomposition) in Section 5 and Section 6. At this point, let us see how it yields Theorem 4.2 (dictionary learning, quasipolynomial time). The following lemma gives the connection between tensor decomposition and dictionary learning.
Therefore, we can apply the algorithm in Theorem 4.3 (noisy tensor decomposition) for noise parameter to obtain a set of unit vectors that is -close to the set of columns of (in symmetrized Hausdorff distance). ∎
Sampling pseudo-distributions
In this section we will develop an efficient algorithm that behaves in certain ways like a hypothetical sampling procedure for low-degree pseudo-distributions. (Sampling procedures, even inefficient or approximate ones, cannot exist in general for low-degree pseudo-distributions [Gri01, Sch08].) This algorithm will be a key ingredient of our algorithm for Theorem 4.3 (noisy tensor decomposition, quasipolynomial time).
The algorithm in the following theorem achieves the above property of sampling procedures with the key advantage that it applies to any low-degree pseudo-distributions.
The result follows from the following lemmas.
Furthermore, there exists a randomized algorithm that runs in time and computes such a polynomial with probability at least .
Let be independent samples from the distribution . Then, let . The expectation of this random polynomial satisfies
The following bound shows that there exists a polynomial that satisfies the conclusion of the lemma,
Since has density in , it also follows that
Noisy tensor decomposition
In this section we will prove Theorem 4.3 (noisy tensor decomposition, quasi-polynomial time).
output a set of vectors that is -close to the set of columns of (in symmetrized Hausdorff distance).
First, we claim that the pseudo-distribution also satisfies the constraint where . The proof of this claim follows by a sum-of-squares version of the following form of Hölder’s inequality,
See the overview section for a proof of this fact. By substituting and using the facts that and that satisfies the constraint , we get that satisfies , which implies the claim because .
While there exists a degree- pseudo-distribution that satisfies the constraints and for every :
Add the vector to the set .
Next we claim that every vector in is close to one of the columns of . Indeed, every such vector satisfies , which by Lemma 6.1 implies that for a column of .
Next we claim that if the algorithm terminates then for every column of there exists a vector with . Indeed, if there exists a column that violates this condition, then it would satisfy all constraints for the pseudo-distribution, which means that the algorithm does not terminate at this point.
To finish the proof of the theorem it remains to bound the number of iterations of the algorithm. We claim that the number of iterations is bounded by the number of columns of because in each iteration the vectors in will cover at least one more of the columns of . As observed before, every vector is close to a column of in the sense that . However, since satisfies , we have by triangle inequality , which means that . Therefore, the vector is not close to any of the vectors for , which means that it has to be close to another column of . (Here, we are again assuming that was chosen so that is a large enough constant.) ∎
By adapting the algorithm somewhat we can also achieve recovery guarantees for columns with significantly smaller norm than the maximum norm. Concretely, we can modify the algorithm so that we ask for pseudo-distributions satisfying , where is a parameter that we gradually decrease so we can get all the vectors. However, we need to also change the right-hand side of the constraint to a value that decreases with . Otherwise, the algorithm might not terminate, as there can be exponentially vectors that are somewhat far from a column vector , and all of them will have fairly large value for . Such a modified algorithm can still obtain all the column vectors (up to a small error) if we assume that the they are sufficiently incoherent. That is, for every distinct columns , of with depending on the norm ratios. Similar (and in fact often stronger) assumptions were made in prior works on dictionary learning. (However, we need these assumptions only when the vectors have different norms.)
Polynomial-time algorithms
In this section we show how we can improve our tensor decomposition algorithm when we have access to examples of very sparse linear combinations of the dictionary columns, culminating in Theorem 7.6 that gives a polynomial-time algorithm for the dictionary problem for the case the distribution is -nice for .
The following theorem refines Theorem 5.1 (sampling pseudo-distributions) reconstructing a vector that is close to a target vector . We make an additional assumption about having access to samples from a distribution over sum-of-squares polynomials. This distribution comes with a noise parameter that controls how well the distribution correlated with the target vector . If this noise parameter is sufficiently small, samples from distribution allow the algorithm to work under a more refined but milder condition on the pseudo-distribution . For our dictionary learning algorithm, we can satisfy this condition when the noise parameter of the distribution satisfies . (The noise parameter roughly coincides with the niceness parameter of the distribution .)
The pseudo-distribution has degree and satisfies the polynomial constraint and the conditions
The following lemma is the main new ingredient of the proof of this theorem.
Note that the conclusion of the lemma implies that we can recover a vector with using Theorem 5.1 in time with probability . Therefore, Theorem 7.1 follows by combining Lemma 7.2 with Theorem 5.1.
Here, the last step uses the assumption to bound the series . It follows that
2 Tensor decomposition
The following lemma shows that a pseudo-distribution that satisfies the constraints also satisfies the condition of Theorem 7.1 for one of the columns of the dictionary .
It follows that the pseudo-distribution cannot satisfy the constraint .
3 Dictionary learning
The following lemma shows that up to polynomial reweighing the distribution gives us access to a distribution that satisfies the condition of Theorem 7.1.
The expectation of after reweighing by satisfies
The last step uses that all non-square moments of vanish. The desired bounds follow because the coefficient of is and for all indices , the coefficients of are all between and . For the final bounds, we also use . ∎
We will use Lemma 7.4 to reason about the distribution . Without loss of generality, we assume that is the first column of the dictionary . Let be independent samples from . (The distribution is the same as .) We claim that the distribution satisfies (7.1) after reweighing by the function (the product of the square of the first coordinates of ). The distribution after reweighing is, up to scaling of the polynomials, equal to the distribution , where are independent samples from the distribution in Lemma 7.4. By Lemma 7.4, this reweighted distribution satisfies the condition (7.1), that is,
Since we assume to be -nice, the variance of is bounded by .
The following theorem gives a polynomial time algorithm for dicionary learning under -nice distributions for all .
There exists an algorithm that for every desired accuracy and overcompleteness solves the following problem for every -nice distribution with and in time for : Given samples from a distribution for a -overcomplete dictionary and -nice distribution , output a set of vectors that is -close to the set of columns of (in symmetrized Hausdorff distance).
We will show how to use Theorem 7.5 to recover a single vector that is close to one of the columns of . By repeating this step in the same way as in the proof of Theorem 4.3 (noisy tensor decomposition) we can recover a set of vectors that is close to the set of columns of .
To recover a single vector, we estimate from the samples of a polynomial that is close to in the same way as in the proof of Theorem 4.2. (The distance of from in spectral norm will be .) Next, we compute a degree- pseudo-distribution that satisfies the constraints .To recover all vectors, we would also add constraints for all vectors that have already been recovered (see proof of Theorem 4.3). The same argument as in the proof of Lemma 6.1 shows that also satisfies the constraint , which means that satisfies the premise of Theorem 7.5. Therefore, the algorithm in Theorem 7.5 recovers a vector close to one of the columns of .
Conclusions and Open Problems
The Sum of Squares method has found many uses across a variety of disciplines, and in this work we demonstrate its potential for solving unsupervised learning problems in regimes that have so far eluded other algorithms. It is an interesting direction to identify other problems that can be solved using this algorithm.
The generality of the SOS method comes at a steep cost of efficiency. It is a fascinating open problem, and one we are quite optimistic about, to use the ideas from the SOS-based algorithm to design practically efficient algorithms.
References
Appendix A Proof of Lemma 2.3
Lemma 2.3 is a consequence of the following sum-of-squares version of the AM-GM inequality.The first sum-of-squares proof of the AM-GM inequality dates back to Hurwitz in 1891 [Hur91]. For related results and sums-of-squares proofs of more general sets of inequalities, see [Rez87, Rez89, FH14].
Let be polynomials. Suppose . Then,
To see that this lemma implies Lemma 2.3, write for a multi-index with the polynomial as a product where is repeated times. (E.g., we would write as and we would have .) Then applying Lemma A.1 to the polynomials gives the inequality asserted in Lemma 2.3,
where the second inequality uses that and the premise .
To prove Lemma A.1, we will give a sequence of polynomials such that , , and . To this end, let
where denotes the symmetric group on elements. So, for instance,
The following claim will then complete the proof:
For any , is a sum of squares.
For a given permutation , the corresponding monomials in and will share many of the same variables, differing only in the exponents of and . We will thus try to arrange the terms of so that we can pull out the common variables, which will let us reduce our inequality to one involving only two variables.
Since the are sums of squares, the expression inside the braces is as well. It is therefore enough to show that is a sum of squares. This follows from the fact that