A Size-Free CLT for Poisson Multinomials and its Applications

Constantinos Daskalakis, Anindya De, Gautam Kamath, Christos Tzamos

Introduction

In theoretical computer science, PMDs are commonly used in the analysis of randomized algorithms, often through large deviation inequalities. They have also found applications in algorithmic problems where one is looking for a collection of random vectors optimizing a certain probabilistic objective, or satisfying probabilistic constraints. For example, understanding the behavior of PMDs has led to polynomial-time approximation schemes for anonymous games [Mil96, Blo99, Blo05, Kal05, DP07, DP08, DP09], despite the PPAD-completeness of their exact equilibria [CDO15]. Anonymous games are games where a large number nn of players share the same kk strategies, and each player’s utility only depends on his own choice of strategy and the number of other players that chose each of the kk strategies. In particular, the expected payoff of each player depends on the PMD resulting from the mixed strategies of the other players. It turns out that understanding the behavior of PMDs provides a handle on the computation of approximate Nash equilibria. One of our main contributions is to advance the state of the art for computing approximate Nash equilibria in anonymous games. We will come to this contribution shortly.

Recently Valiant and Valiant have used PMDs to obtain sample complexity lower bounds for testing symmetric properties of distributions [VV11]. The workhorse in their lower bounds is a new CLT bounding the total variation distance between a (n,k)(n,k)-GMD and a multidimensional Gaussian with the same mean vector and covariance matrix. Since they are comparing a discrete to a continuous distribution under the total variation distance, they need to discretize the Gaussian by rounding its coordinates to their closest point in the integer lattice. If XX is distributed according to some (n,k)(n,k)-GMD with mean vector μ\mu and covariance matrix Σ\Sigma, and YY is distributed according to the multi-dimensional Gaussian N(μ,Σ){\cal N}(\mu,\Sigma), [VV11] shows that:

where σ2\sigma^{2} is the minimum eigenvalue of Σ\Sigma and Y\lfloor Y\rceil denotes the rounding of YY to the closest point in the integer lattice. The dependence of the bound on the dimension kk and the minimum eigenvalue σ2\sigma^{2} is necessary, and quite typical of Berry-Esseen type bounds. Answering a question raised in [VV11], we prove a qualitatively stronger CLT by showing that the explicit dependence of the bound on nn can be removed (hence, the CLT is “size-free”).

Suppose that XX is distributed according to some (n,k)(n,k)-GMD with mean μ\mu and covariance matrix Σ\Sigma, and YN(μ,Σ)Y\sim{\cal N}(\mu,\Sigma). There exists some constant C>0C>0 such that

where σ2\sigma^{2} is the minimum eigenvalue of Σ\Sigma.

Interestingly, Theorem 1 is proven by bootstrapping the Valiant-Valiant CLT itself. Indeed, this CLT was used as one of the key ingredients in a recent structural characterization of PMDs [DKT15], where it was shown that any (n,k)(n,k)-Poisson multinomial random vector is ε\varepsilon-close in total variation distance to the sum of an (appropriately discretized) Gaussian and a (poly(k/ε),k)({\rm poly}(k/\varepsilon),k)-Poisson multinomial random vector; see Theorem 6. In turn, we prove Theorem 1 by using Theorem 6 as a black box.

For more details on our proof’s approach, see Section 3.

In the remainder of this section we discuss the algorithmic applications of our CLT, concluding with our improved algorithms for learning PMDs using Fourier analysis.

We have already discussed anonymous games earlier in this section, where we have also explained their relation to PMDs. In particular, the expected utility uiu_{i} of some player ii in a nn-player kk-strategy anonymous game only depends on his own choice of mixed strategy XiX_{i} and the (n1,k)(n-1,k)-Poisson multinomial random vector jiXj\sum_{j\neq i}X_{j} aggregating the mixed strategies of his opponents. It is therefore natural to expect that a better understanding of the structure of PMDs could lead to improved algorithms for computing Nash equilibria in these games. Indeed, earlier work [DP08, DP15] has exploited this connection to obtain algorithms for approximate Nash equilibria, whose running time is

While clearly of theoretical interest, this bound shows that anonymous games are one of the few classes of games where approximate equilibria can be efficiently computed, while exact equilibria are PPAD-hard [CDO15], even for nn-player 77-strategy anonymous games. Exploiting our CLT we obtain a significant improvement over [DP08].

An ε\varepsilon-approximately well supported Nash equilibrium of an nn-player kk-strategy anonymous games whose utilities are in canbecomputedintime:AsitiscustomaryinNashequilibriumalgorithms,approximateNashequilibriaaredefinedwithrespecttoadditiveapproximationsandtheplayerutilitiesarenormalizedtocan be computed in time:As it is customary in Nash equilibrium algorithms, approximate Nash equilibria are defined with respect to additive approximations and the player utilities are normalized to to make these approximations meaningful.

The salient feature of Theorem 2 is the polynomial dependence of the running time on nn and its quasi-polynomial dependence on ε1\varepsilon^{-1}. In terms of these dependencies our algorithm matches the best known algorithm for 22-strategy anonymous games [DP09], where much more is known given the single-dimensional nature of (n,2)(n,2)-PMDs.

Moreover, the recent hardness results for anonymous games [CDO15] establish that not only finding an exact but also a 2na2^{n^{a}}-approximate Nash equilibrium is PPAD-hard. An interesting corollary of Theorem 2 is that this cannot be pushed to poly(1/n){\rm poly}(1/n)-approximations, unless PPAD can be solved in quasi-polynomial time.

Unless PPAD \subseteq Quasi-PTIME, it is not PPAD-hard to find a poly(1/n){\rm poly}(1/n)-approximately well supported Nash equilibrium in anonymous games, for any poly(){\rm poly}(\cdot).

It is interesting to contrast this corollary with normal-form games where it is known that computing inverse polynomial approximations is PPAD-hard [DGP09, CDT09].

From a technical standpoint, our algorithm for anonymous games uses the structural understanding of PMDs as follows. Since every player views the aggregate strategies of the other players as a PMD, one approach would be to guess each player’s view using a cover as developed in [DKT15]. However, this approach gives a runtime which is exponential in nn, since it requires us to enumerate the cover for each player. An alternative approach is to guess the overall PMD which occurs at a Nash equilibrium, and guess appropriate “corrections” that allow us to infer each player’s view. To do this, we must find an alternative PMD which approximately matches the PMD at Nash in the following sense:

The PMD that results by removing the CRV corresponding to a player should be close to the view that the player observes;

A player’s CRV must only assign probability to strategies which are approximate best responses to his view.

It turns out that these conditions can be satisfied by using a careful dynamic program together with the structural understanding provided by [DKT15] and the CLT of Theorem 1. According to this structural result, we can partition the players into a “sparse” and a “Gaussian” component. Moreover, our CLT implies that matching the first two moments of the Gaussian suffices to approximate this component. This allows us to perform guesses at a different granularity for the sparse and Gaussian components. Roughly speaking, our dynamic program guesses a succinct representation of the two components and tries to compute CRVs which obey this representation and satisfy the conditions outlined above.

For more details on our PTAS, refer to Section 4.

Moreover, we can efficiently enumerate this cover in time polynomial in its size.

It is important to contrast Theorem 3 with Theorem 2 in [DKT15], which provides a non-proper cover whose size is similar, albeit with a leading factor of nO(k2)n^{O(k^{2})}. Instead, our cover is proper, which is important for approximation algorithms that require searching over PMDs. Its dependence on nn is also optimal, as the number of (n,k)(n,k)-PMDs whose summands are deterministic is already nΩ(k)n^{\Omega(k)}. Moreover, we provide a lower bound for the dependence on 1/ε1/\varepsilon, establishing that the quasi-polynomial dependence is also essentially optimal.

We describe our proper cover construction in two parts. First, we give details on how to construct a non-proper cover of size nO(k)n^{O(k)}. The main tool we use is the existence of spectral sparsifiers for Laplacian matrices. Our non-proper cover sparsifies the non-proper cover of [DKT15], showing how its leading factor of nO(k2)n^{O(k^{2})} can be reduced to nO(k)n^{O(k)}. Roughly speaking, the factor of nO(k2)n^{O(k^{2})} was due to spectrally approximating all possible covariance matrices Σ\Sigma, whose O(k2)O(k^{2}) entries are bounded by nn. These covariance matrices corresponded to covariance matrices of (n,k)(n,k)-PMDs, and the cover maintained for each such Σ\Sigma some Σ\Sigma^{\prime} such that vT(ΣΣ)vpoly(ε/k)vTΣv,v|v^{T}(\Sigma-\Sigma^{\prime})v|\leq{\rm poly}(\varepsilon/k)\cdot v^{T}\Sigma v,\forall v. (We call this guarantee a “poly(ε/k){\rm poly}(\varepsilon/k)-spectral approximation.”) The realization leading to our sparsification result is that covariance matrices of PMDs are in fact graph Laplacians. Indeed, a (n,k)(n,k)-PMD, X=iXiX=\sum_{i}X_{i}, has covariance matrix, cov(X)=icov(Xi){\rm\bf cov}(X)=\sum_{i}{\rm\bf cov}(X_{i}), corresponding to the sum of the covariance matrices of its summands. Now the covariance matrix of a kk-CRV, XiX_{i}, is actually the Laplacian of a graph that has one node jj per dimension, along with an edge from node jj to node jj^{\prime} of weight E[Xij]E[Xij]{\bf E}[X_{ij}]\cdot{\bf E}[X_{ij^{\prime}}]; and the covariance matrix of a (n,k)(n,k)-PMD is the Laplacian of the graph with the sum of the weights from each constituent kk-CRV– see Observation 1. We show that Laplacians corresponding to (n,k)(n,k)-PMDs can be poly(ε/k){\rm poly}(\varepsilon/k)-spectrally covered with a set of covariance matrices of size nO(k)(kε)O(k3)n^{O(k)}\cdot\left({k\over\varepsilon}\right)^{O(k^{3})}.

We appeal to recent results in spectral sparsification of Laplacian matrices [ST11, SS11, BSS12, BSST13]. In particular, we use the result of Batson, Spielman, and Srivastava [BSS12] (Theorem 8) to argue that the underlying graph can be sparsified to linearly many edges in the dimension kk. We do this in the hopes that we would have fewer parameters in the covariance matrix to guess. Unfortunately, the [BSS12] sparsification theorem has polynomial dependence in the accuracy. So applying it with a poly(ε/k){\rm poly}(\varepsilon/k)-approximation error, which is what we need, gives a meaningless result (namely no sparsification at all). Instead, we only use this theorem to get a rough O(1)O(1)-spectral cover of (n,k)(n,k)-PMD covariance matrices. Around every covariance matrix in this rough cover we grow a local poly(ε/k){\rm poly}(\varepsilon/k)-spectral cover. Roughly speaking, as the O(1)O(1)-spectral cover provides multiplicative approximation to the variance in every direction vv, every covariance matrix in this cover gives us a multiplicative handle on the eigenvalues of the matrices approximated by it. This is sufficient information to cover these matrices to poly(ε/k){\rm poly}(\varepsilon/k)-spectral error with a “local” spectral cover of size (k/ε)O(k2)(k/\varepsilon)^{O(k^{2})}– see Lemma 6. Putting everything together, we get a poly(ε/k){\rm poly}(\varepsilon/k)-spectral cover of all covariance matrices of (n,k)(n,k)-PMDs of size nO(k)(kε)O(k3)n^{O(k)}\cdot\left({k\over\varepsilon}\right)^{O(k^{3})}– see Section 5.1.3. As covering these matrices was the bottleneck in the size of the non-proper cover, this completes the construction of a non-proper cover whose size is (4).

Further details on our non-proper construction are provided in Section 5.

Details on this conversion process are provided in Section 6.

Our lower bound is described further in Section 7. Our technique shows a lower bound on the metric entropy of a polynomial map of the moments of PMDs using an extension of Bézout’s theorem and other tools from algebraic geometry.

Finally, we give a new learning algorithm for PMDs:

This improves the learning algorithm from [DKT15] by eliminating the superpolynomial dependence on ε\varepsilon in the running time that was obtained in that paper. Our algorithm exploits properties of the continuous Fourier transform of a PMD, as opposed to recent work by Diakonikolas, Kane and Stewart on learning univariate sums of independent integer random variables, which uses the discrete Fourier transform [DKS16b]. They also apply similar discrete Fourier techniques in their simultaneous work on PMDs [DKS16a].

We note that such Fourier-based learning algorithms may simply output a description of the Fourier transform of a distribution. This allows one to compute the PMF of the distribution at any point of interest, but it is not obvious how to sample from such a description. Our algorithm outputs an explicit description of a distribution, which allows one to efficiently (i.e., in time independent of nn) draw samples from the distribution. In contrast, they output the Fourier transform of a distribution and describe how to sample from it.

For more details on our learning algorithm, refer to Section 8.

1 Comparison of Results with [DKS16a]

Simultaneous to our work, Diakonikolas, Kane, and Stewart also studied Poisson Multinomial distributions [DKS16a]. In this section, we describe and compare their results with ours. While both papers independently prove many qualitatively similar results, the techniques are quite different, and thus both may be of independent interest.

Both papers prove new CLTs, which manage to remove the dependence on nn which is found in the CLT of [VV11], while the dependence on kk and 1/σ1/\sigma remains polynomial. Additionally, both works improve upon the previous best covers for PMDs [DKT15]. First, both manage to reduce the size of the cover – interestingly, the two improvements seem to be orthogonal. Our result improves the dependence on nn from nk2n^{k^{2}} to nO(k)n^{O(k)}, while theirs improves the dependence on kk and 1/ε1/\varepsilon from (1/ε)O(k5klogk+1(1/ε))(1/\varepsilon)^{O(k^{5k}\log^{k+1}(1/\varepsilon))} to (1/ε)O(klog(k/ε)/loglog(k/ε))k1(1/\varepsilon)^{O(k\log(k/\varepsilon)/\log\log(k/\varepsilon))^{k-1}}. We note that this upper bound holds for k>2k>2: for k=2k=2, [DKS16b] proves the tight cover size bound of n(1/ε)Θ(log(1/ε))n\cdot(1/\varepsilon)^{\Theta(\log(1/\varepsilon))}. Furthermore, both papers describe how to efficiently achieve a proper cover of this size. These cover sizes are asymptotically optimal, as shown by lower bounds in both papers. In particular, the double-exponential dependence in kk is necessary. Both works also consider the problem of finding approximate Nash equilibria in anonymous games. The complexity of both algorithms is roughly comparable to the PMD cover size. Finally, both papers study the learning of PMDs, obtaining algorithms with sample complexity poly(k,log(1/ε))k/ε2\operatorname*{poly}(k,\log(1/\varepsilon))^{k}/\varepsilon^{2}. The runtime of our algorithm is poly(k/ε)k2\operatorname*{poly}(k/\varepsilon)^{k^{2}}, and the runtime of their algorithm is poly(k,log(1/ε))k/ε2logn\operatorname*{poly}(k,\log(1/\varepsilon))^{k}/\varepsilon^{2}\cdot\log n, both in the standard word RAM model.

Preliminaries

We more formally define several of the distribution classes we consider.

A kk-Categorical Random Variable (kk-CRV) is a random variable that takes values in {e1,,ek}\{e_{1},\dots,e_{k}\} where eje_{j} is the kk-dimensional unit vector along direction jj. π(i)\pi(i) is the probability of observing eie_{i}.

An (n,k)(n,k)-Poisson Multinomial Distribution ((n,k)(n,k)-PMD) is given by the law of the sum of nn independent but not necessarily identical kk-CRVs. An (n,k)(n,k)-PMD is parameterized by a nonnegative matrix πn×k\pi\in^{n\times k} each of whose rows sum to 11 is denoted by MπM^{\pi}, and is defined by the following random process: for each row π(i,)\pi(i,\cdot) of matrix π\pi interpret it as a probability distribution over the columns of π\pi and draw a column index from this distribution. Finally, return a row vector recording the total number of samples falling into each column (the histogram of the samples).

We note that a sample from an (n,k)(n,k)-PMD is redundant – given k1k-1 coordinates of a sample, we can recover the final coordinate by noting that the sum of all kk coordinates is nn. For instance, while a Binomial distribution is over a support of size 22, a sample is 11-dimensional since the frequency of the other coordinate may be inferred given the parameter nn. With this inspiration in mind, we define the Generalized Multinomial Distribution, which is the primary object of study in [VV11].

A Truncated kk-Categorical Random Variable is a random variable that takes values in {0,e1,,ek1}\{0,e_{1},\dots,e_{k-1}\} where eje_{j} is the (k1)(k-1)-dimensional unit vector along direction jj, and is the (k1)(k-1) dimensional zero vector. ρ(0)\rho(0) is the probability of observing the zero vector, and ρ(i)\rho(i) is the probability of observing eie_{i}.

An (n,k)(n,k)-Generalized Multinomial Distribution ((n,k)(n,k)-GMD) is given by the law of the sum of nn independent but not necessarily identical truncated kk-CRVs. A GMD is parameterized by a nonnegative matrix ρn×(k1)\rho\in^{n\times(k-1)} each of whose rows sum to at most 11 is denoted by GρG^{\rho}, and is defined by the following random process: for each row ρ(i,)\rho(i,\cdot) of matrix ρ\rho interpret it as a probability distribution over the columns of ρ\rho – including, if j=1kρ(i,j)<1\sum_{j=1}^{k}\rho(i,j)<1, an “invisible” column – and draw a column index from this distribution. Finally, return a row vector recording the total number of samples falling into each column (the histogram of the samples).

For both (n,k)(n,k)-PMDs and (n,k)(n,k)-GMDs, we will refer to nn and kk as the size and dimension, respectively.

We note that a PMD corresponds to a GMD where the “invisible” column is the zero vector, and thus the definition of GMDs is more general than that of PMDs. However, whenever we refer to a GMD in this paper, it will explicitly have a non-zero invisible column.

While we will approximate the Multinomial distribution with Gaussian distributions, it does not make sense to compare discrete distributions with continuous distributions, since the total variation distance is always 11. As such, we must discretize the Gaussian distributions. We will use the notation x\lfloor x\rceil to say that xx is rounded to the nearest integer (with ties being broken arbitrarily). If xx is a vector, we round each coordinate independently to the nearest integer.

As seen in the definition of an (n,k)(n,k)-GMD, we have one coordinate which is equal to nn minus the sum of the other coordinates. We define a similar notion for a discretized Gaussian. However, we go one step further, to take care of when there are several such Gaussians which live in disjoint dimensions. By this, we mean that given two Gaussians, the set of directions in which they have a non-zero variance are disjoint. Without loss of generality (because we can simply relabel the dimensions), we assume all of a Gaussian’s non-zero variance directions are consecutive, i.e., the covariance matrix is all zeros, except for a single block on the diagonal. Therefore, when we add the covariance matrices, the result is block diagonal. The resulting distribution is described in the following definition.

The structure preserving rounding of a multidimensional Gaussian Distribution takes as input a multi-dimensional Gaussian N(μ,Σ)\mathcal{N}(\mu,\Sigma) with Σ\Sigma in block-diagonal form. It chooses one coordinate as a “pivot” in each block, samples from the Gaussian ignoring these pivots and rounds each value to the nearest integer. Finally, the pivot coordinate of each block is set by taking the difference between the sum of the means and the sum of the values sampled within the block.

Finally, we formally define the notion of a cover.

2 Probability Metrics

To compare probability distributions, we will require the total variation and Kolmogorov distances:

The total variation distance between two probability measures PP and QQ on a σ\sigma-algebra FF is defined by

Unless explicitly stated otherwise, in this paper, when two distributions are said to be ε\varepsilon-close, we mean in total variation distance.

The Kolmogorov distance between two probability measures PP and QQ with CDFs FPF_{P} and FQF_{Q} is defined by

We note that Kolmogorov distance is, in general, weaker than total variation distance. In particular, total variation distance between two distributions is lower bounded by the Kolmogorov distance.

3 Miscellaneous Lemmata

We will use the following tools for bounding total variation distance between various random variables.

Let X,XX,X^{\prime} be two random variables over a domain Ω\Omega. Fix any (possibly randomized) function FF on Ω\Omega (which may be viewed as a distribution over deterministic functions on Ω\Omega) and let F(X)F(X) be the random variable such that a draw from F(X)F(X) is obtained by drawing independently xx from XX and ff from FF and then outputting f(x)f(x) (likewise for F(X)F(X^{\prime})). Then we have

Let X1,,XnX_{1},\dots,X_{n} be independent random variables, with E[Xi]=0,E[Xi2]=σi2>0,E[Xi3]=ρi<E[X_{i}]=0,E[X_{i}^{2}]=\sigma_{i}^{2}>0,E[|X_{i}|^{3}]=\rho_{i}<\infty, and define X=i=1nXi,σ2=i=1nσi2,ρ=i=1nρiX=\sum_{i=1}^{n}X_{i},\sigma^{2}=\sum_{i=1}^{n}\sigma_{i}^{2},\rho=\sum_{i=1}^{n}\rho_{i}. Then for an absolute constant C00.56C_{0}\leq 0.56,

Given two kk-dimensional Gaussians N1=N(μ1,Σ1),N2=N(μ2,Σ2)\mathcal{N}_{1}=\mathcal{N}(\mu_{1},\Sigma_{1}),\mathcal{N}_{2}=\mathcal{N}(\mu_{2},\Sigma_{2}) such that for all i,j[k]i,j\in[k], Σ1(i,j)Σ2(i,j)α|\Sigma_{1}(i,j)-\Sigma_{2}(i,j)|\leq\alpha, and the minimum eigenvalue of Σ1\Sigma_{1} is at least σ2α\sigma^{2}\geq\alpha,

In addition, we prove the following general purpose lemma showing that two multivariate Gaussians with spectrally-close moments are close in total variation distance. This is intended to be a multivariate version of Proposition B.4 of [DDO+13], which proves a similar statement for univariate Gaussians. The proof appears in Section A.

Suppose there exist two kk-dimensional Gaussians, XN(μ1,Σ1)X\sim\mathcal{N}(\mu_{1},\Sigma_{1}) and YN(μ2,Σ2)Y\sim\mathcal{N}(\mu_{2},\Sigma_{2}), such that for all unit vectors vv,

4 Results on PMDs from [DKT15]

Our work builds upon recent structural results on PMDs [DKT15]. We recall some of the key results which we will refer to in this paper.

Two key parameters used in this paper are c=c(ε,k)=poly(ε/k)c=c(\varepsilon,k)=\operatorname*{poly}(\varepsilon/k) and t=t(ε,k)=poly(k/ε)t=t(\varepsilon,k)=\operatorname*{poly}(k/\varepsilon), set as c=(ε2k5)1+δcc=\left(\frac{\varepsilon^{2}}{k^{5}}\right)^{1+\delta_{c}} and t=(k19cε6)1+δtt=\left(\frac{k^{19}}{c\varepsilon^{6}}\right)^{1+\delta_{t}}, for constants δc,δt>0\delta_{c},\delta_{t}>0.

The main tool from this paper we will use is the structural characterization, stating that every PMD is close to the sum of an appropriately discretized Gaussian and a “sparse” PMD.

For parameters cc and tt as described above, every (n,k)(n,k)-Poisson multinomial random vector is ε\varepsilon-close to the sum of a Gaussian with a structure preserving rounding and a (tk2,k)(tk^{2},k)-Poisson multinomial random vector. For each block of the Gaussian, the minimum non-zero eigenvalue of Σi\Sigma_{i} is at least tc2k4\frac{tc}{2k^{4}}.

Finally, we will also use their rounding procedure, which relates a PMD to a nearby PMD with all parameters either equal to or sufficiently far from and 11:

A Size-Free CLT

We overview our proof of Theorem 1. Recall that the Central Limit Theorem of Valiant and Valiant, (1), has a poly-logarithmic dependence on the size parameter of the GMD. Their work raised the question whether this CLT could be made size-independent, and we resolve this conjecture by showing that it can be. This qualitative improvement comes at a quantitative loss in the polynomial dependence of the bound on the parameters kk and σ2\sigma^{2}.

Our CLT builds off of the structural result of [DKT15], Theorem 6, which we use as a black box. This structural result says that every (n,k)(n,k)-PMD is ε\varepsilon-close to the sum of an appropriately discretized Gaussian and a (poly(k/ε),k)(\operatorname*{poly}(k/\varepsilon),k)-PMD. We note that the statement of Theorem 6 does not tell us anything about the moments of this Gaussian and sparse PMD, while our new CLT requires that the discretized Gaussian has the same moments as the original PMD. We prove this CLT in two steps. First, we show that the original PMD XX and the discretized Gaussian from the cover GG are close in total variation distance, i.e., we show that we can “drop” the sparse PMD component from Theorem 6 in the relevant approximation regime. Then, we bound the distance between the discretized Gaussian from the cover, GG, and a discretized Gaussian with the same mean and covariance as the original PMD, GXG_{X}. The proof is concluded by combining these two bounds using the triangle inequality.

Next, we bound the distance between the discretized Gaussian from the cover, GG, and a discretized Gaussian with the same moments as the original PMD, GXG_{X}. At this point, we know that XX and GG are close in total variation distance. By projecting both distributions in some direction and considering true Gaussians with the same moments as XX and GG, it can be shown that the first two moments are similar in this direction – otherwise, the true Gaussians would be far from each other in the Kolmogorov metric. This implies that the first two moments of XX and GG are close in every direction, as guaranteed by Proposition 8. Applying Lemma 2 tells us that bona-fide Gaussians with moments which are close in every direction are therefore close in total variation distance. The proof is concluded by applying the Data Processing inequality, which shows that the corresponding discretized Gaussians GG and GXG_{X} are close as well.

We state and prove many useful lemmas in Section 3.1, which we combine to complete the proof of Theorem 1 in Section 3.2.

The following two propositions bound the Kolmogorov distance between a univariate Gaussian and the projection of a GMD or a discretized Gaussian, respectively.

Suppose that there exists an (n,k)(n,k)-generalized multinomial random vector XX, with mean vector μ\mu and covariance matrix Σ\Sigma. Then for any unit vector vv,

where σ2\sigma^{2} is the minimum eigenvalue of Σ\Sigma.

We apply the Berry-Esseen theorem (Proposition 1). Let Yi=XiE[Xi]Y_{i}=X_{i}-E[X_{i}] to recenter the random variables, and we will now compare Y=iYiY=\sum_{i}Y_{i} with N(0,vTΣv)\mathcal{N}(0,v^{T}\Sigma v). We note that vTYi[2,2]v^{T}Y_{i}\in[-\sqrt{2},\sqrt{2}]. Letting σi2=\mboxVar(vTYi)\sigma_{i}^{2}=\mbox{\bf Var}(v^{T}Y_{i}) and ρi=E[vTYi3]\rho_{i}=E\left[\left|v^{T}Y_{i}\right|^{3}\right], this implies that ρi2σi2\rho_{i}\leq\sqrt{2}\sigma_{i}^{2}, and thus the Berry-Esseen bound gives

Suppose there exists a random variable XN(μ,Σ)X\sim\lfloor\mathcal{N}(\mu,\Sigma)\rceil. Then for any unit vector vv,

where σ2\sigma^{2} is the minimum eigenvalue of Σ\Sigma.

Let YN(μ,Σ)Y\sim\mathcal{N}(\mu,\Sigma). We first show vT(YY)k2|v^{T}(Y-\lfloor Y\rceil)|\leq\frac{\sqrt{k}}{2}, which holds by Cauchy-Schwarz: v2=1\|v\|_{2}=1 and YY2kYYk2\|Y-\lfloor Y\rceil\|_{2}\leq\sqrt{k}\cdot\|Y-\lfloor Y\rceil\|_{\infty}\leq\frac{\sqrt{k}}{2}. Thus,

because the two distributions are univariate Gaussians with the same variance (which is at least σ2)\sigma^{2}) and means shifted by k\sqrt{k}. This implies

The following proposition compares a Gaussian XX and an arbitrary distribution YY. It shows that if YY’s variance is much smaller than XX’s, then they must be far in Kolmogorov distance.

Suppose there exists a univariate Gaussian XX with variance σX2\sigma_{X}^{2}, and a distribution YY with variance σY2<σX2\sigma_{Y}^{2}<\sigma_{X}^{2}. Then the Kolmogorov distance between XX and YY is at least 12(σYσX)2/3\frac{1}{2}-\left(\frac{\sigma_{Y}}{\sigma_{X}}\right)^{2/3}.

We consider the event that a sample falls in an interval of width 2k2k centered at E[Y]E[Y]. As a certificate of a large Kolmogorov distance between XX and YY, we show that the probability assigned to this interval is very different for XX versus YY.

First, by Chebyshev’s inequality, we know that

where the last inequality uses the Taylor expansion of the error function.

The difference in probability assigned to this interval is at least

Setting k=σY2/3σX1/3k=\sigma_{Y}^{2/3}\sigma_{X}^{1/3} gives

The following proposition tells us if we are considering the sum of two random variables, one being a Gaussian with a large variance and one being an arbitrary distribution with a small support, we can remove all contribution from the distribution with small support and not pay a large cost in total variation distance.

We start by applying a law of total probability for total variation distance:

Using the data processing inequality for total variation distance (Lemma 1):

The next proposition tells us that Kolmogorov closeness implies parameter closeness for univariate Gaussians.

Without loss of generality, assume μ1μ2\mu_{1}\leq\mu_{2}. We will first show the conclusion assuming the means are separated, and then assuming the variances are separated.

Our final proposition in this section applies the previous proposition, showing that total variation closeness implies parameter closeness (in any projection) when considering a GMD and a discretized Gaussian.

Consider the projections of XX and YY onto vv. By Propositions 3 and 4 and the triangle inequality, the Kolmogorov distance between the univariate Gaussians with the same mean and variance is at most α+2kσ\alpha+\frac{2\sqrt{k}}{\sigma}. Applying Proposition 7 implies the desired result. ∎

2 Proof of Theorem 1

We will prove the statement for a sufficiently large constant CC. Thus we only need examine the case

otherwise the conclusion of the theorem statement is vacuous since total variation distance is at most 11.

As a starting point, we convert from a GMD to the corresponding (n,k)(n,k)-Poisson multinomial random vector XX and apply Theorem 6 with ε=k3σ1/10\varepsilon=\frac{k^{3}}{\sigma^{1/10}}. This gives us that

where GG is a Gaussian with a structure preserving rounding and PP is a (tk2,k)(tk^{2},k)-Poisson multinomial random vector. By the definition of tt in Section 2.4, we have that tCσ9/10k2t\leq\frac{C^{\prime}\sigma^{9/10}}{k^{2}} for some constant CC^{\prime}. Thus, PP is a (Cσ9/10,k)(C^{\prime}\sigma^{9/10},k)-Poisson multinomial random vector.

Now, applying Proposition 6 and the triangle inequality, we get

where σv2=min{vTΣv,vTΣDv}\sigma_{v}^{2}=\min\{v^{T}\Sigma v,v^{T}\Sigma_{D}v\}.

We use the Data Processing Inequality (Lemma 1) followed by Lemma 2 with these guarantees to give:

Finally, applying the triangle inequality with Equation (7) gives

Choosing the constant CC sufficiently large completes the proof.

A PTAS for Anonymous Games

Indeed, if the above guarantee held, then the expected payoff of every player ii from any pure strategy σ\sigma would not change by more than an additive O(ε)O(\varepsilon) if we changed the strategies of all other players from (Xj)ji(X_{j})_{j\neq i} to (Yj)ji(Y_{j})_{j\neq i}. So, if VXV_{X} were a Nash equilibrium and support(Yi)support(Xi){\rm support}(Y_{i})\subseteq{\rm support}(X_{i}), it would follow that YiY_{i} is an approximate best response of player ii to (Yj)ji(Y_{j})_{j\neq i}. So VYV_{Y} would be an approximate equilibrium.

Unfortunately, we do not know how to construct a proper ε\varepsilon-cover SεS_{\varepsilon} of all (n,k)(n,k)-PMDs that has size of order (3) and such that for any VXV_{X} there exists some VYSεV_{Y}\in S_{\varepsilon} satisfying Condition (8). Nevertheless, we can exploit our CLT and the structural result of [DKT15] (restated as Theorem 6 in this paper) to bypass this difficulty. Roughly speaking [DKT15] approximate a given VX=(X1,,Xn)V_{X}=(X_{1},\ldots,X_{n}) by first discretizing the parameters of all XiX_{i}’s into fine enough accuracy (this is shown to only cost some O(ε)O(\varepsilon) in total variation distance), then partitioning the XiX_{i}’s into a small group L{\cal L} of size poly(k/ε){\rm poly}(k/\varepsilon) that are left intact, and a large group whose sum is approximated by a discretized multidimensional Gaussian (up to another cost of O(ε)O(\varepsilon) in total variation distance). It is further shown that the distribution of the sum of variables in L{\cal L} can be summarized through the vector m\vec{m} of its first O(log1/ε)O(\log{1/\varepsilon}) moments (at a loss of an additional O(ε)O(\varepsilon) in total variation distance), while the discretized Gaussian through its first two moments (μ,Σ)(\mu,\Sigma). Moreover, it is shown that the Gaussian has at least poly(k/ε){\rm poly}(k/\varepsilon) variance in all directions where it has non-zero variance. By enumerating over all possible summary statistics (m,μ,Σ)(\vec{m},\mu,\Sigma), a non-proper cover of all (n,k)(n,k)-PMDs can be obtained, whose size is of the order of (3).

Suppose now that VX=(X1,,Xn)V_{X}=(X_{1},\ldots,X_{n}) is a Nash equilibrium whose approximating statistic in the non-proper cover is some (m,μ,Σ)(\vec{m},\mu,\Sigma). Given a correct guess for this statistic, our goal is to uncover an approximate Nash equilibrium VY=(Y1,,Yn)V_{Y}=(Y_{1},\ldots,Y_{n}) of the game. By the construction of the cover, we know that every player ii either contributed his discretized XiX_{i} to the discretized Gaussian with parameters (μ,Σ)(\mu,\Sigma), or to the small group of variables with moments m\vec{m}. So, letting C{\cal C} be the set of kk-CRVs whose parameters have the discretization accuracy used in the construction of the cover, we need to assign some YiCY_{i}\in{\cal C} to each player ii such that:

There exists a poly(k/ε){\rm poly}(k/\varepsilon)-size subset L{\cal L} of players such that iLYi\sum_{i\in{\cal L}}Y_{i} has vector of moments m\vec{m}, while iLYi\sum_{i\notin{\cal L}}Y_{i} has first two moments (μ,Σ)(\mu,\Sigma).

For all ii, YiY_{i} is a best response to jiYj\sum_{j\neq i}Y_{j}.

To find a good assignment, we first construct a compatibility graph between players and mixed strategies in C{\cal C}. We add an edge between some ii and some YiCY_{i}\in{\cal C} iff at least one of the following two conditions is met. We also annotate the edge with all conditions that are met:

(YiY_{i} is compatible with iLi\in{\cal L}): YiY_{i} is an approximate best response to the “environment” ii would observe if ii contributed to m\vec{m}. If ii contributed to m\vec{m} and Condition (a) were met, then we can deduce what PMD player ii would see in his environment. Indeed, this would be within some O(ε)O(\varepsilon) in total variation distance to a the sum of a Gaussian random vector with parameters (μ,Σ)(\mu,\Sigma) and a PMD whose first O(log(1/ε))O(\log(1/\varepsilon)) moments are the same as m\vec{m} after removing the contribution of YiY_{i}. The updated moment vector can be computed from m\vec{m} and YiY_{i} as moments are symmetric polynomials of the underlying parameters. Given the updated moment vector, the PMD is determined to within ε\varepsilon in total variation distance, so its sum with the discretized Gaussian is also determined, and we can also efficiently determine whether YiY_{i} is an approximate best response of player ii to that distribution.

(YiY_{i} is compatible with iLˉi\in\bar{{\cal L}}): YiY_{i} is an approximate best response to the “environment” ii would observe if ii contributed to the discretized Gaussian with parameters (μ,Σ)(\mu,\Sigma). First, for this to be the case YiY_{i} must be “compatible” with Σ\Sigma, i.e. not correlating uncorrelated pairs of dimensions/adding variance in zero-variance dimensions (or in other words, the block structure of Σ\Sigma should be preserved). Moreover, since all non-zero eigenvalues of Σ\Sigma are at least poly(k/ε){\rm poly}(k/\varepsilon)-large, the discretized Gaussian with parameters (μ,Σ)(\mu,\Sigma) and (μE[Yi],Σcov(Yi))(\mu-{\bf E}[Y_{i}],\Sigma-{\rm\bf cov}(Y_{i})) are approximately the same(Proposition 2). At the same time, due to the largeness of the non-zero eigenvalues of Σ\Sigma, if condition (a) were eventually true, then our CLT (Theorem 1) would imply that jLˉ{i}Yj\sum_{j\in\bar{{\cal L}}\setminus\{i\}}Y_{j} is well-approximated by the discretized Gaussian with parameters (μE[Yi],Σcov(Yi))(\mu-{\bf E}[Y_{i}],\Sigma-{\rm\bf cov}(Y_{i})), and hence by that with parameters (μ,Σ)(\mu,\Sigma). So, if iLˉi\in\bar{\cal L}, ii is assigned YiY_{i}, and Condition (a) is eventually met, then the PMD that player ii sees in his environment is pinned down to within O(ε)O(\varepsilon) in total variation distance: it is approximately the sum of the discretized Gaussian with parameters (μ,Σ)(\mu,\Sigma) and a PMD with moments m\vec{m}. We can therefore check if YiY_{i} is an approximate best response to that distribution.

After constructing the compatibility graph as above, we need to see if there is an assignment of players to compatible mixed strategies from C\cal C so that (a) is satisfied. This looks non-trivial, but it can be done using dynamic programming. We sweep through the players, maintaining as state all possible leftover moments (m,μ,Σ)(\vec{m}^{\prime},\mu^{\prime},\Sigma^{\prime}) that may arise from assignments of a prefix of players to compatible mixed strategies. Given the discretization of C\cal C, the set of possible states is bounded by (3). Importantly, the compatibility graph has the property that player ii is happy when given a compatible strategy as long as the overall assignment matches (m,μ,Σ)(\vec{m},\mu,\Sigma).

A mixed strategy profile ρ\rho is a set of nn distributions {ρiΔk}i[n]\{\rho_{i}\in\Delta^{k}\}_{i\in[n]}, where by Δk\Delta^{k} we denote the (k1)(k-1)-dimensional simplex, or, equivalently, the set of distributions over [k][k]. A mixed strategy profile ρ\rho is an ε\varepsilon-approximately well supported Nash equilibrium (or an ε\varepsilon-Nash equilibrium, for short) if, for all i[n]i\in[n] and j[k]j\in[k],

where ρi\rho_{-i} is the distribution over Πn1k\Pi^{k}_{n-1} obtained by drawing n1n-1 random samples from [k][k] independently according to the distributions ρi,ii\rho_{i^{\prime}},i^{\prime}\neq i, and forming the induced partition. We note that this an ε\varepsilon-Nash equilibrium is stronger than the related concept of an ε\varepsilon-approximate Nash equilibrium (see, i.e., [DGP09] for further discussion of this distinction). Throughout this paper, we solely consider the harder problem of computing an ε\varepsilon-Nash equilibrium.

A -Nash equilibrium is simply called a Nash equilibrium and it is always guaranteed to exist by Nash’s theorem.

2 An Algorithm for Anonymous Games

In a Nash equilibrium ρ\rho of an anonymous game every player uses a mixed strategy ρi\rho_{i} selecting strategy jj with probability ρij\rho_{ij}. The distribution of the number of players which select each of the strategies is an (n,k)(n,k)-PMD. Using the fact that there exist small size ε\varepsilon-covers for PMDs, we can efficiently search over the space of all strategies and identify a mixed strategy profile that produces an ε\varepsilon-Nash equilibrium. We show that there exists an efficient polynomial time approximation scheme (EPTAS) for computing an ε\varepsilon-Nash equilibrium, thus proving Theorem 2.

The algorithm works by guessing an aggregate statistic (m,μ,Σ)(m,\mu,\Sigma) that describes the overall behavior of all players. This statistic is based on the structural theorem shown in [DKT15], which shows that the overall PMD that describes the mixed strategy profile can be approximately written as sum of a discretized Gaussian and a sparse PMD with only poly(k/ε)\operatorname*{poly}(k/\varepsilon) components. Moreover, for the sparse PMD knowledge of the log(1/ε)\log(1/\varepsilon) moments (which is equivalent to knowing the power-sums of all the summands up to poly(1/ε)\operatorname*{poly}(1/\varepsilon), suffices to describe it within ε\varepsilon in total variation distance. Thus, the algorithm requires guessing the power-sums mm of the sparse PMD and the mean μ\mu and covariance Σ\Sigma of the discretized Gaussian.

As we will show, knowledge of an individual’s strategy together with the aggregate statistic (m,μ,Σ)(m,\mu,\Sigma) for the overall mixed strategy profile, allows us to compute an approximate distribution DiD_{i} that discribes the player’s view about the aggregate strategy of everyone else. If we manage to assign strategies ρi\rho_{i} to every player so that ρi\rho_{-i} approximately matched DiD_{i} and additionally each player only chooses strategies that corresponds to approximate best responses with respect to his view DiD_{i} we will obtain an ε\varepsilon-Nash equilibrium. The following lemma formalizes this intuition and is the main tool we use in the proof of Theorem 2.

For all i[n]i\in[n] and j[k]j\in[k], ExDi[uji(x)]<maxj[k]ExDi[uji(x)]ε2ρij=0,E_{x\sim D_{i}}[u^{i}_{j}(x)]<\max_{j^{\prime}\in[k]}E_{x\sim D_{i}}[u^{i}_{j^{\prime}}(x)]-\varepsilon_{2}\Rightarrow\rho_{ij}=0,

Then, ρ\rho is an (2ε1+ε2)(2\varepsilon_{1}+\varepsilon_{2})-Nash equilibrium for the game GG.

Consider the game G=(n,k,{uji})G=(n,k,\{u^{i}_{j}\}). By Nash’s theorem there always exists a Nash equilibrium. Let ρ\rho be such an equilibrium where every player uses a mixed strategy ρi\rho_{i} selecting strategy jj with probability ρij\rho_{ij}. The distribution of vectors which give the number of players which select each of the strategies is an (n,k)(n,k)-PMD.

To get an efficient algorithm, we need to search over a restricted set of strategies for each player. To be able to do that we must show that an ε\varepsilon-Nash equilibrium exists in a more restricted space. To argue that, we begin by a Nash equilibrium ρ\rho and perform a series of operations that maintain the property that the resulting mixed strategy profile is an ε\varepsilon-equilibrium.

We first proceed by rounding the probabilities ρij\rho_{ij} so that they are either or at least cc as done in Lemma 3. This gives a PMD ρ(1)\rho^{(1)} that is O(c1/2k5/2log1/2(1/ck))O(c^{1/2}k^{5/2}\log^{1/2}(1/ck))-close in total variation distance to ρ\rho . Moreover, if we consider the PMD ρi(1)\rho^{(1)}_{-i}, which is the (n1,k)(n-1,k)-PMD obtained by removing the ii-th component from the rounded PMD ρ(1)\rho^{(1)}, this is also O(c1/2k5/2log1/2(1/ck))O(c^{1/2}k^{5/2}\log^{1/2}(1/ck))-close in total variation to ρi\rho_{-i}, i.e. the PMD obtained after removing the ii-th component from the original PMD ρ\rho. The proof of this statement is almost identical to the proof in [DKT15] and is omitted. That proof uses Poisson approximations to bound the total variation between the rounded and the unrounded PMDs and uses the fact that the means of the two PMDs can differ by at most cc in each coordinate. The only difference is that here, the means of the two PMDs can differ by at most 2c2c in each coordinate which results in the same asymptotic bound for total variation distance. Moreover, note the rounding procedure doesn’t change any probabilities that were originally 0, i.e. ρij=0ρij(1)=0\rho_{ij}=0\Rightarrow\rho^{(1)}_{ij}=0.

By the structural theorem of [DKT15], the components ρi(2)\rho^{(2)}_{i} of the PMD ρ(2)\rho^{(2)} can be partitioned into two PMDs:

For every player ii in the sparse PMD, his view about the aggregate strategy of the others is approximately the same as if the large PMD was replaced by the Gaussian, i.e.

With those observations in hand, we proceed to give the algorithm for computing ε\varepsilon-equilibria for anonymous games. To do this, we first guess the mean μ\mu and covariance Σ\Sigma of the Gaussian component as well as all the power-sums mm of the sparse PMD. We then try to construct CRVs for every player so that the overall mean and covariance as well as the power-sums match those that we guessed and moreover every player’s CRV assigns positive probability mass only to approximately optimal strategies. If we are able to do so, Lemma 4 implies that this gives an approximate Nash equilibrium. In more detail, the algorithm performs the following steps:

Guess the mean and covariance of the Gaussian component and the power sums of the sparse PMD. For every guess, we repeat the next steps until a feasible solution is found.

We need to guess the powersums for (4ek3)k(4ek^{3})^{k} different PMDs since CRVs are first clustered according to their value in every coordinate. Since the parameters of the sparse PMD are all multiples of tk3ε1\lceil\frac{tk^{3}}{\varepsilon}\rceil^{-1}, this results in at most 2k5klogk+2(1ε)2^{k^{5k}\log^{k+2}\left(\frac{1}{\varepsilon}\right)} distinct power-sum vectors in totalThis upper bound was derived in [DKT15]. For the gaussian component all entries of the mean and covariance are multiples of nkε\lceil\frac{nk}{\varepsilon}\rceil which requires nkεO(k2)\lceil\frac{nk}{\varepsilon}\rceil^{O(k^{2})} guesses in total.

For every player, we need to compute the contribution of his mixed strategy (CRV) to the overall distribution. If that player is to be assigned in the sparse component, its probabilities are all multiples of tk3ε1\lceil\frac{tk^{3}}{\varepsilon}\rceil^{-1} and we can compute its contribution to the power-sums mm. Similarly, if that player is to be assigned in the gaussian component its probabilities are all multiples of nkε1\lceil\frac{nk}{\varepsilon}\rceil^{-1} and we can easily compute its contribution to the mean and covariance.

However, not all assignments are feasible. We need to consider only CRVs for that player that assign positive probability mass to coordinates that are approximately best responses to the strategy of other players. Even though we don’t know the strategies of the others exactly, we can compute a good approximate description of the players view by subtracting from the power sums mm the players contribution (if any) and computing any PMDPMD that matches those power-sums. Similarly, if the player is mapped to the gaussian component we subtract the players mean and covariance from the overall mean μ\mu and covariance Σ\Sigma and compute a discretized Gaussian with the resulting mean and covariance instead. We say that an assignment of a player to a component (sparse or Gaussian) and a specific distribution over strategies is feasible if it approximately maximizes the player’s utility uu with respect to his approximate view about the strategies of others.

To find if there exists a set of feasible strategies that matches the guessed statistic (m,μ,Σ)(m,\mu,\Sigma), we use dynamic programming. The states of our dynamic program are the following: For any prefix of players, we keep track of the remaining power-sums, mean and covariance we need to account for. We iteratively process players one by one keeping track of which states are reachable. Our estimation is feasible if after processing all players we have accounted for all the power-sums, mean and covariance in our original guess. If we find such a solution, we output the assignment of players to mixed strategies that resulted in this solution.

Applying Lemma 4 directly shows that this is indeed an O(ε)O(\varepsilon)-Nash equilibrium.

The total runtime of the algorithm is polynomial on the number of states of the above dynamic program. Since there are nkεO(k2)\lceil\frac{nk}{\varepsilon}\rceil^{O(k^{2})} Gaussian parameters in total as well as 2k5klogk+2(1ε)2^{k^{5k}\log^{k+2}\left(\frac{1}{\varepsilon}\right)} power sums in total, the overall runtime is nO(k2)2poly(k,log(1/ε))kn^{O(k^{2})}2^{\operatorname*{poly}(k,\log(1/\varepsilon))^{k}} and the theorem follows. ∎

On the road to getting the proper cover described by Theorem 3, we first show Theorem 7. This constructs a non-proper cover of the same size. The main theorem of this section is the following:

Moreover, we can efficiently enumerate this cover in time polynomial in its size.

This theorem should be contrasted with Theorem 3, which provides a proper cover of similar size. It should also be contrasted to Theorem 2 of [DKT15], which provides a cover with a leading factor of nk2n^{k^{2}}, so the cover presented here improves the exponent of nn from quadratic to linear in the dimension. This is the correct order of exponential dependence on kk, as simply counting the number of (n,k)(n,k)-PMDs with deterministic summands gives a lower bound of nΩ(k)n^{\Omega(k)}. We also show in Section 7 that the quasi-polynomial dependence on 1/ε1/\varepsilon with an exponent of Ω(k)\Omega(k) cannot be avoided, as we provide an essentially matching lower bound on the cover size.

The starting point for our cover will be Theorem 6, stating that every (n,k)(n,k)-PMD is ε\varepsilon-close to the sum of an appropriately discretized Gaussian and a (poly(k/ε),k)(\operatorname*{poly}(k/\varepsilon),k)-PMD. We generate an ε/2\varepsilon/2-cover for each and combine them by triangle inequality.

We cover the sparse PMD component using the same methods as in [DKT15]. The first, naive way of covering this component involves gridding over all poly(k/ε)\operatorname*{poly}(k/\varepsilon) parameters with poly(ε/k)\operatorname*{poly}(\varepsilon/k) granularity. This results in a cover size of 2poly(k/ε)2^{\operatorname*{poly}(k/\varepsilon)}.

The more sophisticated way of covering this component uses a “moment matching” technique. A result by Roos [Roo02] shows that the probability mass function can be written as the weighted sum of partial derivatives of a standard multinomial distribution. When analyzed carefully, his result implies that the lower order moments of the distribution are sufficient to characterize the PMD. In other words, any two PMDs with identical “moment profiles” (which describe these lower order moments) are close in total variation distance, and it suffices to keep only one representative for each moment profile. This method results in a cover of size 2O(k5klogk+2(1/ε))2^{O(k^{5k}\log^{k+2}(1/\varepsilon))}. Combining this with the other approach gives a cover of size

For more details, see the proof of Theorem 2 of [DKT15].

To cover the Gaussian component, [DKT15] grid over all O(k2)O(k^{2}) parameters of the Gaussian component, arguing the effectiveness of the gridding using Proposition 2. This gridding results in the leading factor of nO(k2)n^{O(k^{2})} in the size of the cover. In contrast, we use a spectral covering approach: instead of trying to grid over all parameters of the covariance matrix, we first sparsify it and then match the magnitude of its projection in every direction. In particular, we establish a cover of the following nature:

Let Gn,k,ε\mathcal{G}_{n,k,\varepsilon} be the set of all Gaussians with structure preserving roundings which may arise as a consequence of Theorem 6 when applied to (n,k)(n,k)-Poisson multinomial random vectors with parameter ε\varepsilon. Then there exists a set S\mathcal{S} of Gaussians with structure preserving roundings of size at most nO(k)(kε)O(k3)n^{O(k)}\cdot\left(\frac{k}{\varepsilon}\right)^{O(k^{3})} with the following properties:

For any GGn,k,εG\in\mathcal{G}_{n,k,\varepsilon}, there exists a G^S\hat{G}\in\mathcal{S}, such that GG and G^\hat{G} have the same block structure (i.e., the partition of coordinates), and within each block, have the same pivot coordinate and sum for the mean vector coordinates. Furthermore, for each block ii, letting (μi,Σi)(\mu_{i},\Sigma_{i}) and (μ^i,Σ^i)(\hat{\mu}_{i},\hat{\Sigma}_{i}) be the mean and covariance for the block (excluding the pivot coordinate), we have that for all unit vectors vv,

vT(μiμ^i)εσivk|v^{T}(\mu_{i}-\hat{\mu}_{i})|\leq\frac{\varepsilon\sigma_{iv}}{k};

vT(ΣiΣ^i)vεσiv22k3/2|v^{T}(\Sigma_{i}-\hat{\Sigma}_{i})v|\leq\frac{\varepsilon\sigma_{iv}^{2}}{2k^{3/2}};

where σiv2=max{vTΣiv,vTΣ^iv}\sigma_{iv}^{2}=\max\{v^{T}\Sigma_{i}v,v^{T}\hat{\Sigma}_{i}v\}.

This lemma statement is slightly technical due to the nature of the Gaussians with structure preserving roundings. It essentially says that we cover the set of Gaussians arising from the structural theorem by matching their block structure exactly, and within each block, matching the moments spectrally. Plugging these guarantees into Lemma 2 and applying the data processing inequality for total variation distance (Lemma 1) gives the desired closeness.

For simplicity of exposition, for the remainder of this overview section, we assume that the Gaussian’s structure preserving rounding consists of a single block, an assumption we do not make in the full proof (described in Section 5.1). By the guarantees of the structural result, in this case, the minimum eigenvalue of the covariance matrix is at least some poly(k/ε){\rm poly}(k/\varepsilon). So the goal of our exposition in this section is to produce a cover of Gaussians that may result from Theorem 6 and whose covariance matrices have minimum eigenvalue at least poly(k/ε){\rm poly}(k/\varepsilon).

Since the mean vector only has kk parameters, we can grid over the entries. Though we require a spectral guarantee, this naive gridding is sufficient. This gives a set of size (nkε)O(k)\left(\frac{nk}{\varepsilon}\right)^{O(k)}, such that, for any Gaussian which may arise from Theorem 6, its mean vector is approximated by a mean vector in our set with the approximation guarantees required by Lemma 2.

Covering the covariance matrix takes more care. At a high level, our approach views PMDs through the lens of spectral graph theory and exploits the existence of spectral sparsifiers. Recall the definition of the Laplacian matrix of a graph:

Given an undirected weighted graph G=(V,E,w)G=(V,E,w) on nn vertices, its Laplacian matrix is an n×nn\times n matrix LGL_{G} where

To see the connection to PMDs, we observe that the covariance matrix of a PMD is the Laplacian matrix of a graph defined by the parameters. For a single kk-CRV XX with parameter vector π\pi, it can be shown that the variance of XiX_{i} is π(i)(1π(i))\pi(i)(1-\pi(i)) and the covariance of XiX_{i} and XjX_{j} is π(i)π(j)-\pi(i)\pi(j). Since i=1kπ(i)=1\sum_{i=1}^{k}\pi(i)=1, the covariance matrix is equal to the Laplacian matrix of a graph on kk nodes with w(i,j)=π(i)π(j)w(i,j)=\pi(i)\pi(j). This can be extended to (n,k)(n,k)-PMDs by observing that the sum of random variables has a covariance matrix equal to the sum of the individual covariance matrices, and a similar statement holds for graphs and the corresponding Laplacian matrices. We summarize this connection in the following observation:

At the core of our approach, we use the following celebrated result of Batson, Spielman, and Srivastava [BSS12], which says that the Laplacian matrix of a graph on kk vertices can be spectrally approximated by the Laplacian matrix of a graph with only O(k)O(k) edges:

where LGL_{G} is the Laplacian matrix of the graph GG.

Using this tool, the approach will proceed as follows. This theorem implies that, for every true covariance matrix Σ\Sigma, there exists a matrix M1M_{1} with only O(k)O(k) entries which preserves every projection up to a multiplicative factor of 1/51/5. We can obtain a matrix M2M_{2} with the same sparsity pattern as M1M_{1} by guessing which subset of O(k)O(k) entries is non-zero, requiring exp(klogk)\exp(k\cdot\log{k}) guesses. Furthermore, we can grid over the non-zero entries of M2M_{2} to ensure that it approximates every projection of M1M_{1} up to a multiplicative factor of 1/251/25. Since M1M_{1} has minimum eigenvalue poly(k/ε){\rm poly}(k/\varepsilon) and maximum entry O(n)O(n), gridding requires only (nkε)O(k)\left({n\cdot k\over\varepsilon}\right)^{O(k)} guesses, and we get that M2M_{2} gives a 1/41/4 multiplicative spectral approximation to Σ\Sigma. To make our approximation finer, we will O(ε/k)O(\varepsilon/\sqrt{k})-cover the set of PSD matrices within a 1/41/4-neighborhood of M2M_{2}. We first recall the definition of a cover in this context:

Let S be a set of symmetric k×kk\times k PSD matrices. An ε\varepsilon-cover of the set SS, denoted by SεS_{\varepsilon}, is a set of PSD matrices such that for any matrix ASA\in S, there exists a matrix BSεB\in S_{\varepsilon} such that for all vectors yy: yT(AB)yεyTAy|y^{T}(A-B)y|\leq\varepsilon y^{T}Ay.

Now, if we could O(ε/k)O(\varepsilon/\sqrt{k})-cover the set of all matrices 1/41/4-close to M2M_{2}, we would obtain an O(ε/k)O(\varepsilon/\sqrt{k})-approximation to Σ\Sigma. We do so using the following lemma, which provides a method to generate such a cover. A slight generalization of this statement appeared as Lemma 9 in [DKT15], but we give a slightly simpler proof in Section 5.2 for completeness.

Let AA be a symmetric k×kk\times k PSD matrix with minimum eigenvalue at least 11 and let SS be the set of all matrices BB such that yT(AB)yε1yTAy|y^{T}(A-B)y|\leq\varepsilon_{1}y^{T}Ay for all vectors yy, where ε1[0,1/4]\varepsilon_{1}\in[0,1/4]. Then, there exists an ε\varepsilon-cover SεS_{\varepsilon} of SS that has size Sε(kε)O(k2)|S_{\varepsilon}|\leq\left(\frac{k}{\varepsilon}\right)^{O(k^{2})}.

Combining the above, we obtain a set of covariance matrices of size nO(k)(1ε)poly(k)n^{O(k)}\cdot\left(\frac{1}{\varepsilon}\right)^{\operatorname*{poly}(k)} such that, for any Gaussian which may arise in Theorem 6, its covariance matrix is approximated by a covariance matrix in our cover as required by Lemma 2.

Combining the guarantees obtained for the mean and the covariance matrix, we find that they satisfy both conditions of Lemma 2. Therefore, we have described a cover of size nO(k)(1ε)poly(k)n^{O(k)}\cdot\left(\frac{1}{\varepsilon}\right)^{\operatorname*{poly}(k)} for all possible Gaussian components. The proof of Theorem 7 is completed by taking the Cartesian product of this Gaussian cover with the cover for the (poly(k/ε),k)(\operatorname*{poly}(k/\varepsilon),k)-PMD component.

For more details on covering the Gaussian component, see Section 5.1.

1 Details on Covering the Gaussian Component

Recall that the Gaussian component will have a structure preserving rounding. The first step in designing our cover will be to guess the partitioning into blocks. There are kk dimensions, resulting in at most k!k! different block structures. In what follows, we will describe how to cover a single block up to accuracy O(εk)O(\frac{\varepsilon}{k}), taking the Cartesian product of the resulting sets will give an O(ε)O(\varepsilon)-cover of the entire Gaussian at the additional cost of kk in the exponent.

For a single block which consists of dimensions SiS_{i}, we must first guess the size parameter nin_{i} and which dimension is to be used as the pivot. The former is an integer between and nn, and guessing it comes at a cost of nn in our cover size. Guessing the latter comes at a Si|S_{i}| cost in our cover size.

Recall that our strategy will be to spectrally match the parameters of the true Gaussian. We will conclude the two distributions are close using the guarantees provided by Lemma 2. We describe how to obtain such guarantees for both the mean and covariance matrix separately.

1.2 Covering the Covariance Matrix of a Block

We will use the characterization provided by Observation 1, which tells us that the covariance matrix of an (n,k)(n,k)-PMD is the Laplacian matrix of a graph defined by the parameters of the distribution. Recall from the proof of Theorem 6 (which appears in [DKT15]), the covariance matrix of the Gaussian we are attempting to match is also the covariance matrix of an (ni,Si)(n_{i},|S_{i}|)-Generalized Multinomial Distribution. For the remainder of this proof, we let GG be the graph defined by this characterization for the covariance matrix of the corresponding (ni,Si)(n_{i},|S_{i}|)-Poisson Multinomial Distribution.

As a starting point, we use Theorem 8, which shows the existence of spectral sparsifiers. In particular it implies that, if given GG on Si|S_{i}| nodes and we want a subgraph HH such that (11/5)LGLH(1+1/5)LG(1-1/5)L_{G}\preceq L_{H}\preceq(1+1/5)L_{G}, there exists an HH with at most 110Si110|S_{i}| edges which gives this approximation. The first step in covering the covariance matrix is to guess which edges are present in the graph. Since there are (Si2)\binom{|S_{i}|}{2} possible edges in the graph, this requires at most

Further, recall that our structural result implies that 125vTLHvtc100k4\frac{1}{25}v^{T}L_{H}v\geq\frac{tc}{100k^{4}}, so it suffices to obtain a graph MM such that

For a unit vector vv and Si×Si|S_{i}|\times|S_{i}| PSD matrices AA and BB,

Suppose we guess the edge weights of MM such that they are at most tc100k7\frac{tc}{100k^{7}} away from those of HH. This tells us maxijLH(i,j)LM(i,j)tc100k7\max_{i\neq j}|L_{H}(i,j)-L_{M}(i,j)|\leq\frac{tc}{100k^{7}}, and since the diagonal entries of LML_{M} are the sums of the off-diagonal entries, maxiLH(i,i)LM(i,i)tc100k6\max_{i}|L_{H}(i,i)-L_{M}(i,i)|\leq\frac{tc}{100k^{6}}. This implies that it suffices to additively estimate the edge weights up to accuracy tc100k6\frac{tc}{100k^{6}}. Since the maximum entry of LGL_{G} is at most nin_{i}, the spectral guarantee implies that the maximum entry of LHL_{H} is at most 6ni5\frac{6n_{i}}{5}, and similarly, the maximum edge weight. Therefore, gridding over all 110Si110|S_{i}| non-zero edge weights, we define a set with at most

At this point, we have a PSD matrix LML_{M} which, when projected onto the subspace orthogonal to e1e_{1}, is 1/41/4-spectrally close to the target covariance matrix. We wish to ε2k3/2\frac{\varepsilon}{2k^{3/2}}-cover the space of all PSD matrices which are 1/41/4-spectrally close to this matrix. We will use Lemma 6, which we instantiate with parameter “ε\varepsilon” set to ε2k3/2\frac{\varepsilon}{2k^{3/2}}, allowing us to generate a ε2k3/2\frac{\varepsilon}{2k^{3/2}}-cover of a 14\frac{1}{4}-neighborhood of a given PSD matrix with (kε)O(k2)\left(\frac{k}{\varepsilon}\right)^{O(k^{2})} candidates. Since we knew one of the previous candidates was 14\frac{1}{4}-close to the target, this gives us a matrix which satisfies the second condition of Lemma 2 to approximate this block up to εk\frac{\varepsilon}{k} accuracy. The size of this cover is at most

1.3 Putting the Guarantees Together

At this point, to cover a single block up to accuracy O(ε/k)O(\varepsilon/k), we have a set of size at most

Taking the Cartesian product of sets and multiplying by the number of guesses for the block structure of the Gaussian, we get an overall cover of size

Combining with the cover for the (poly(k/ε),k)(\operatorname*{poly}(k/\varepsilon),k)-PMD component, we obtain an overall cover for (n,k)(n,k)-PMDs of size

2 Proof of Lemma 6

To construct the cover, we will make use of the eigenvalues and eigenvectors of the matrix AA. We first show that for any matrix BSB\in S, its eigenvalues are close to the eigenvalues of AA.

Let A,BA,B be two symmetric k×kk\times k PSD matrices such that for all vectors yy with y=1\|y\|=1, yT(AB)yε1yTAy|y^{T}(A-B)y|\leq\varepsilon_{1}y^{T}Ay for some constant ε1>0\varepsilon_{1}>0. Then for the eigenvalues λ1A...λkA\lambda^{A}_{1}\leq...\leq\lambda^{A}_{k} of AA, and the eigenvalues λ1B...λkB\lambda^{B}_{1}\leq...\leq\lambda^{B}_{k} of BB, it holds that:

From Courant’s minimax principle, we have that the ii-th eigenvalue of AA is equal to:

where CC is an (i1)×k(i-1)\times k matrix. For the matrix BB, we have that

Similarly, we have that λiB(1ε1)λiA\lambda^{B}_{i}\geq(1-\varepsilon_{1})\lambda_{i}^{A}, so the result follows. ∎

By computing the eigenvalues μ1μk\mu_{1}\leq\dots\leq\mu_{k} of AA, we have estimates of the eigenvalues λ1,,λk\lambda_{1},\dots,\lambda_{k} of BB within a multiplicative factor of 1±2ε11\pm 2\varepsilon_{1}. We can improve our estimates to a better multiplicative factor 1±ε1\pm\varepsilon by gridding multiplicatively around each eigenvalue. This requires another log1+ε(1+2ε112ε1)=O(1/ε)\log_{1+\varepsilon}\left(\frac{1+2\varepsilon_{1}}{1-2\varepsilon_{1}}\right)=O(1/\varepsilon) guesses per eigenvalue. So in total, we require (1ε)O(k)\left(\frac{1}{\varepsilon}\right)^{O(k)} guesses for obtaining accurate estimates λ1,,λk\lambda^{\prime}_{1},\dots,\lambda^{\prime}_{k} of the eigenvalues of BB.

Once we know (approximately) the eigenvalues of BB, we will try to guess also its eigenvectors v1,,vkv_{1},\dots,v_{k}. We will do this by performing a careful gridding around the eigenvectors of AA which we can assume, without loss of generality (by rotating), to be the standard basis vectors e1,e2,,eke_{1},e_{2},\dots,e_{k}. So for each eigenvector vzv_{z} of BB, we will try to approximate it by guessing its projections to the eigenvectors of AA.

We now bound the projections of eigenvectors of AA to eigenvectors of BB. Since we know that eiTBei(1+ε1)eiTAeie^{T}_{i}Be_{i}\leq(1+\varepsilon_{1})e^{T}_{i}Ae_{i}, we get that zλz(vzei)2(1+ε1)μi\sum_{z}\lambda_{z}(v_{z}e_{i})^{2}\leq(1+\varepsilon_{1})\mu_{i} which implies that vz,i2μiλzv_{z,i}\leq\sqrt{\frac{2\mu_{i}}{\lambda_{z}}}. Moreover, since λzmax{(1ε1)μz,1}max{12μz,1}\lambda_{z}\geq\max\{(1-\varepsilon_{1})\mu_{z},1\}\geq\max\{\frac{1}{2}\mu_{z},1\}, we know that the projection of vzv_{z} to eie_{i} will be smaller than 2μimax{μz,1}2\sqrt{\frac{\mu_{i}}{\max\{\mu_{z},1\}}}. An additional bound for the projection of vzv_{z} to eie_{i} can be obtained by considering the variance of the matrices AA and BB in the direction vzv_{z}. Since we know that vzTBvz(1ε1)vzTAvzv^{T}_{z}Bv_{z}\geq(1-\varepsilon_{1})v^{T}_{z}Av_{z}, we get that iμi(vzei)2λz1ε12λz\sum_{i}\mu_{i}(v_{z}e_{i})^{2}\leq\frac{\lambda_{z}}{1-\varepsilon_{1}}\leq 2\lambda_{z} which implies that vz,i2λzμiv_{z,i}\leq\sqrt{\frac{2\lambda_{z}}{\mu_{i}}}.

We will now show that the covariance matrix B^\hat{B} satisfies the property that it is close in all directions to BB. To do this we will make use of the following lemma from [DKT15]. This roughly states that two PSD matrices spectrally approximate each other in O(k2)O(k^{2}) particular directions, then they spectrally approximate each other in every direction.

For all i[k]i\in[k], \Big{|}\Big{(}\frac{v_{i}}{\sqrt{\lambda}_{i}}\Big{)}^{T}\Big{(}\hat{\Sigma}-\Sigma\Big{)}\Big{(}\frac{v_{i}}{\sqrt{\lambda}_{i}}\Big{)}\Big{|}\leq\varepsilon,

For all i,j[k]i,j\in[k], \Big{|}\Big{(}\frac{v_{i}}{\sqrt{\lambda}_{i}}+\frac{v_{j}}{\sqrt{\lambda}_{j}}\Big{)}^{T}\Big{(}\hat{\Sigma}-\Sigma\Big{)}\Big{(}\frac{v_{i}}{\sqrt{\lambda}_{i}}+\frac{v_{j}}{\sqrt{\lambda}_{j}}\Big{)}\Big{|}\leq 4\varepsilon.

We will only consider directions y=vzλzy=\frac{v_{z}}{\sqrt{\lambda}_{z}} for z[k]z\in[k] and y=vzλz+vzλzy=\frac{v_{z}}{\sqrt{\lambda}_{z}}+\frac{v_{z^{\prime}}}{\sqrt{\lambda}_{z^{\prime}}} for z,z[k]z,z^{\prime}\in[k].

We first consider direction y=vzλzy=\frac{v_{z}}{\sqrt{\lambda}_{z}}. We have that:

The first term is in the range [(1ε)(1kε)2,(1+ε)(1+kε)2][(1-\varepsilon)(1-k\varepsilon^{\prime})^{2},(1+\varepsilon)(1+k\varepsilon^{\prime})^{2}], which for εε/k\varepsilon^{\prime}\leq\varepsilon/k, becomes (1±O(ε))(1\pm O(\varepsilon)). The rest of the terms can be bounded as follows:

for ε=O(εk3)\varepsilon^{\prime}=O(\sqrt{\frac{\varepsilon}{k^{3}}}). This means that vzTB^vz(1ε,1+ε)λzv^{T}_{z}\hat{B}v_{z}\in(1-\varepsilon,1+\varepsilon)\lambda_{z}. The proof is similar for directions y=vzλz+vzλzy=\frac{v_{z}}{\sqrt{\lambda}_{z}}+\frac{v_{z^{\prime}}}{\sqrt{\lambda}_{z^{\prime}}} for z,z[k]z,z^{\prime}\in[k].

Overall, we can get an estimate B^\hat{B} of any matrix BSB\in S by making at most (kε)O(k2)\left(\frac{k}{\varepsilon}\right)^{O(k^{2})} guesses, which implies an ε\varepsilon-cover of this size.

A Proper Cover for PMDs

We show how to turn the non-proper cover of Section 5 into a proper one as described by Theorem 3, using Theorem 1. We note that a non-constructive proper cover follows immediately from Theorem 7, since for each element of an improper ε/2\varepsilon/2-cover that lies within ε/2\varepsilon/2 of a PMD, we can match it with such a PMD. The resulting set of PMDs defines then a proper ε\varepsilon-cover. Our focus in this section is to provide an efficient construction of a proper cover.

Our approach will be to enumerate the improper cover of Theorem 7 and convert each distribution to a nearby (n,k)(n,k)-PMD. This cover consists of distributions which are the sum of a Gaussian with a structure preserving rounding and a (poly(k/ε),k)(\operatorname*{poly}(k/\varepsilon),k)-PMD. Since the (poly(k/ε),k)(\operatorname*{poly}(k/\varepsilon),k)-PMD component is already a collection of kk-CRVs, this part of the cover is already proper, and it suffices to convert the Gaussian component into a nearby (npoly(k/ε),k)(n-\operatorname*{poly}(k/\varepsilon),k)-PMD.

The main technical lemma we prove is the following, which states that if a discretized Gaussian GG is spectrally close to a GMD ρ\rho, we can obtain a new GMD ρ\rho^{\prime} which is spectrally close to ρ\rho:

Let N(μ,Σ)\lfloor\mathcal{N}(\mu,\Sigma)\rceil be a discretized Gaussian and suppose there exists a (n,k)(n,k)-GMD ρ\rho with mean μρ\mu^{\rho} and covariance Σρ\Sigma^{\rho} such that for all vectors vv it holds that vT(μμρ)ε1vTΣv|v^{T}(\mu-\mu^{\rho})|\leq\varepsilon_{1}\sqrt{v^{T}\Sigma v} and vT(ΣΣρ)vε2vTΣv|v^{T}(\Sigma-\Sigma^{\rho})v|\leq\varepsilon_{2}v^{T}\Sigma v.

Then, it is possible to compute in time nO(k)n^{O(k)} a (n,k)(n,k)-GMD ρ\rho^{\prime} with mean μρ\mu^{\rho^{\prime}} and covariance Σρ\Sigma^{\rho^{\prime}} such that for all vectors vv it holds that vT(μμρ)ε1vTΣv+3k2.5v2|v^{T}(\mu-\mu^{\rho^{\prime}})|\leq\varepsilon_{1}\sqrt{v^{T}\Sigma v}+3k^{2.5}\|v\|_{2} and vT(ΣΣρ)vε2vTΣv+3k3v22|v^{T}(\Sigma-\Sigma^{\rho^{\prime}})v|\leq\varepsilon_{2}v^{T}\Sigma v+3k^{3}\|v\|_{2}^{2}.

We prove this lemma using the Shapley-Folkman lemma [Sta69], which states that the Minkowski sum of a large number of sets is approximately convex:

With this lemma in hand, the proof of Lemma 8 proceeds as follows. Let M\mathcal{M} be the set of all possible mean and covariances for a single CRV, and Mn\mathcal{M}^{\oplus n} be the Minkowski sum of nn copies of M\mathcal{M}. Given a discretized Gaussian with mean and covariance (μ,Σ)Mn(\mu,\Sigma)\in\mathcal{M}^{\oplus n}, we would ideally like to find {x1,,xn}\{x_{1},\dots,x_{n}\} such that i=1nxi=(μ,Σ)\sum_{i=1}^{n}x_{i}=(\mu,\Sigma). However, since this set is not convex, this optimization problem is not obviously tractable. Instead, we convert (μ,Σ)(\mu,\Sigma) to a spectrally close (μ^,Σ^)(\hat{\mu},\hat{\Sigma}) which lies on the convex hull of Mn\mathcal{M}^{\oplus n}, which can be done using a linear program. At this point, we exploit the “almost convex” characterization provided the Shapley-Folkman lemma, and we will iteratively “peel off” plausible CRVs. More specifically, noting that the moment profile is at most k2+kk^{2}+k dimensional and applying Lemma 9, we can use a linear program to find the parameters of a single CRV such that subtracting its moments gives a moment profile which lies on the convex hull of Mn1\mathcal{M}^{\oplus n-1}. We repeat nk2kn-k^{2}-k times until we are left with a point on the convex hull of Mk2+k\mathcal{M}^{\oplus k^{2}+k}, at which point we may pick the last k2+kk^{2}+k CRVs arbitrarily. The proof is completed by arguing that the resulting GMD satisfies the theorem conditions. For the full proof of Lemma 8, see Section 6.1.

We convert the same block of ρ\rho^{\prime} to a discretized Gaussian in the same way, incurring the same cost. Finally, we relate the two discretized Gaussians in total variation distance. As mentioned in the previous paragraph, the means and covariances are spectrally close up to relative accuracy 2εk\frac{2\varepsilon}{k} and εk3/2\frac{\varepsilon}{k^{3/2}}. We plug this guarantee into Lemma 2 and apply the data processing inequality (Lemma 1) to conclude that the two distributions are O(ε/k)O(\varepsilon/k)-close. The proof is concluded by setting ε=ε10/3/k25/2\varepsilon^{\prime}=\varepsilon^{10/3}/k^{25/2} and rescaling ε\varepsilon by a constant factor.

We first argue that rounding all constituent probability vectors in the (n,k)(n,k)-GMD ρ\rho so that all their coordinates are integer multiples of 1/n1/n to obtain a (n,k)(n,k)-GMD ρ^\hat{\rho} approximately preserves the spectral closeness guarantees with the discretized Gaussian. More specifically, for all vectors vv it holds that:

We know that μρμρ^1\|\mu^{\rho}-\mu^{\hat{\rho}}\|_{\infty}\leq 1 and thus μρμρ^2k\|\mu^{\rho}-\mu^{\hat{\rho}}\|_{2}\leq\sqrt{k}, so

Similarly we have that ΣρΣρ^max1\|\Sigma^{\rho}-\Sigma^{\hat{\rho}}\|_{\textrm{max}}\leq 1 which implies that vT(ΣρΣρ^)vkv2|v^{T}(\Sigma^{\rho}-\Sigma^{\hat{\rho}})v|\leq k\|v\|^{2} for all vectors vv. Thus,

At this point, we have shown that there exists a (n,k)(n,k)-GMD with mean and covariance close to that of the discretized Gaussian such that all its constituent probability vectors have coordinates that are integer multiples of 1/n1/n. Now, for every probability vector p\vec{p} with probabilities that are multiples of 1/n1/n, consider its moment profile (μp,Σp)(\mu^{\vec{p}},\Sigma^{\vec{p}}), where μp=p\mu^{\vec{p}}=\vec{p} and Σp\Sigma^{\vec{p}} are the mean and covariance of the kk-CRV with probabilities p\vec{p}. Let M\cal{M} be the set of all possible moment profiles generated by such probability vectors p\vec{p}. Since there are at most nk1n^{k-1} probability vectors p\vec{p} the set M\cal M has size at most nk1n^{k-1}. Moreover, it is easy to see that for the rounded GMD ρ^\hat{\rho}, it holds that (μρ^,Σρ^)Mn(\mu^{\hat{\rho}},\Sigma^{\hat{\rho}})\in{\cal{M}}^{\oplus n} where Mn={x  x1,...,xnM,x=ixi}{\cal{M}}^{\oplus n}=\{x\ |\ \exists x_{1},...,x_{n}\in{\cal M},x=\sum_{i}x_{i}\} denotes the Minkowski addition of M\cal{M} with itself nn times. This is because the mean and covariance of the GMD is equal to the sum of the means and covariances of its constituent CRVs, which are all in M\cal{M} since each CRV has probabilities that are integer multiples of 1/n1/n.

Naively searching over Mn{\cal{M}}^{\oplus n} for a GMD that satisfies the guarantees of ρ^\hat{\rho} is not easy since it would require time that is exponential in nn. To get a computationally efficient algorithm, we search instead in the set conv(Mn)=conv(M)n\textrm{conv}\left({\cal{M}}^{\oplus n}\right)=\textrm{conv}\left({\cal{M}}\right)^{\oplus n} where, for a set AA, conv(A)\textrm{conv}(A) denotes its convex closure, and the equality is a basic property of Minkowski sums. The reason this is easy is that it is solvable by a linear program as follows:

For mMm\in{\cal{M}} and i{1,...,n}i\in\{1,...,n\}, we assign the variables xi,m0x_{i,m}\geq 0 that denote whether we want to pick the moment profile mm for the ii-th CRV.

For all ii, we need that mxi,m=1\sum_{m}x_{i,m}=1. This ensures that for all ii, mxi,mmconv(M)\sum_{m}x_{i,m}m\in\textrm{conv}\left({\cal{M}}\right).

We need that the aggregate moment profile (μ^,Σ^)=i,mxi,mm(\hat{\mu},\hat{\Sigma})=\sum_{i,m}x_{i,m}m satisfies the closeness constraints with (μ,Σ)(\mu,\Sigma). For all vv we require that:

These are all linear constraints so a solution (μ^,Σ^)=i,mxi,mm(\hat{\mu},\hat{\Sigma})=\sum_{i,m}x_{i,m}m, can be computed by solving the linear program using the Ellipsoid method. Note that the constraints of the third bullet are infinitely many but can be verified efficiently using a separation oracle. To check the first set of constraints, we can check whether the optimization problem

has a negative solution. This is a convex optimization problem which can be solved in polynomial time. To check the second set of constraints, we note that ε2vTΣv+kv22=vT(ε2Σ+kI)v\varepsilon_{2}v^{T}\Sigma v+k\|v\|_{2}^{2}=v^{T}(\varepsilon_{2}\Sigma+kI)v. By setting A(ε2Σ+kI)A\triangleq(\varepsilon_{2}\Sigma+kI) and uA1/2vu\triangleq A^{1/2}v, we can rewrite the constraints as:

This is equivalent to checking whether the maximum eigenvalue of the matrix (A1/2)T(ΣΣ^)A1/2\left(A^{-1/2}\right)^{T}(\Sigma-\hat{\Sigma})A^{-1/2} is greater than 11.

For any m,mconv(M)m,m^{\prime}\in\textrm{conv}\left({\cal{M}}\right), it holds that mm1\|m-m^{\prime}\|_{\infty}\leq 1. Moreover, (μρ,Σρ)=i=1nmi(\mu^{\rho^{\prime}},\Sigma^{\rho^{\prime}})=\sum_{i=1}^{n}m_{i} and (μ^,Σ^)=i=1(nk2k)mi+i=1k2+kmi(\hat{\mu},\hat{\Sigma})=\sum_{i=1}^{(n-k^{2}-k)}m_{i}+\sum_{i=1}^{k^{2}+k}m^{\prime}_{i}. This implies that μρμ^k2+k\|\mu^{\rho^{\prime}}-\hat{\mu}\|_{\infty}\leq k^{2}+k and ΣρΣ^maxk2+k\|\Sigma^{\rho^{\prime}}-\hat{\Sigma}\|_{\max}\leq k^{2}+k. We have that:

Similarly, vT(ΣΣρ)vvT(ΣΣ^)v+vT(Σ^Σρ)vε2vTΣv+(k3+k2+k1)v22.|v^{T}(\Sigma-\Sigma^{\rho^{\prime}})v|\leq|v^{T}(\Sigma-\hat{\Sigma})v|+|v^{T}(\hat{\Sigma}-\Sigma^{\rho^{\prime}})v|\leq\varepsilon_{2}v^{T}\Sigma v+(k^{3}+k^{2}+k^{1})\|v\|_{2}^{2}.

A Lower Bound for Covers of PMDs

In this section, we discuss Theorem 4, the lower bound on the size of any ε\varepsilon-cover of (n,k)(n,k) PMDs. This theorem shows that it is not possible to get significant improvement on the cover size obtained in Theorem 3. In particular, the dependence of the size of the cover on 1/ε1/\varepsilon is tight up to a difference of 33 in the exponent of log(1/ε)\log(1/\varepsilon).

It turns out that it is easy to prove a dependence of O(nk)O(n^{k}) on the size of any ε\varepsilon-cover and most of the work is involved in showing a lower bound of T(k,ε)=2logk1(1/ε)T(k,\varepsilon)=2^{\log^{k-1}(1/\varepsilon)} on the cover size. Thus, in this overview we only focus on the machinery required to show the lower bound of T(k,ε)T(k,\varepsilon) on the ε\varepsilon-cover size. We remark that prior to our work, for k=2k=2 (i.e. PBDs), Diakonikolas, Kane, and Stewart obtained a lower bound of 2log2(1/ε)2^{\log^{2}(1/\varepsilon)} [DKS16b].

We provide the proof of Theorem 4. The proof will use algebraic geometric tools to argue this fact. In particular, the main theorem we will prove will be the following:

We will now prove Theorem 4 using Theorem 9.

Mα(Z)M_{\alpha}(Z) is a multisymmetric polynomial of degree α\|\alpha\| in the variables {PZ[i,j]}\{P_{Z}[i,j]\} where 1im1\leq i\leq m and 1j<k1\leq j<k. By multisymmetric, we mean the polynomial is invariant under permuting the rows of PZP_{Z}.

where the sum is taken over all surjections σ\sigma from {1,,α}[m]\{1,\ldots,\|\alpha\|\}\rightarrow[m] where for i<ji<j, if βi=βj\beta_{i}=\beta_{j}, then σ(i)<σ(j)\sigma(i)<\sigma(j).

The definition above might look a little involved, so to illustrate the concept, we will consider a simple example. For example, if k=3k=3 and α=(1,2,0)\alpha=(1,2,0), then

Likewise, if k=4k=4 and α=(1,1,1,0)\alpha=(1,1,1,0), then

Another family of polynomials which will be very useful in our reasoning will be the family of power sum multisymmetric polynomials.

As we have already observed, Mα(Z)M_{\alpha}(Z) is a multisymmetric polynomial in entries of the matrix PZP_{Z} at degree is at most α\|\alpha\|. To prove that is a linear combination of Eβ\mathbf{E}_{\beta} for βα\beta\preceq\alpha, it suffices to make the following observation: Note that Mα(Z)M_{\alpha}(Z) is

The next lemma implies bounds on the coefficients of Eβ\mathbf{E}_{\beta} in expressing MαM_{\alpha}. Let us now assume that Mα=βαγβEβM_{\alpha}=\sum_{\beta\preceq\alpha}\gamma_{\beta}\cdot\mathbf{E}_{\beta}. It is easy to see the following claim.

For α\alpha with αk=0\alpha_{k}=0, we have γα=j=1k1αj!\gamma_{\alpha}=\prod_{j=1}^{k-1}\alpha_{j}!.

The next claim is also fairly easy to prove.

βαγβ=2O(k)j=1k1αj!\sum_{\beta\preceq\alpha}|\gamma_{\beta}|=2^{O(k)}\cdot\prod_{j=1}^{k-1}\alpha_{j}!.

In other words, cjc_{j} is obtained by a pointwise product of α\alpha and the indicator vector of the singleton set {j}\{j\}. Now, assume that for 1jk11\leq j\leq k-1,

where β=β1++βk1\beta=\beta_{1}+\ldots+\beta_{k-1}. Thus, to prove our claim, it suffices to show that for any particular 1jk11\leq j\leq k-1,

Note that cjc_{j} is just αj\alpha_{j} at the jthj^{th} position and zero everywhere else. We introduce the following notation: For any integer kk, we let P(k)\mathcal{P}(k) denote the set of its partitions i.e. a tuple of strictly positive integers summing to kk ordered in decreasing sequence. For example, for k=5k=5, we have 6 distinct partitions (5),(4,1),(3,2),(3,1,1),(2,2,1),(2,1,1,1)(5),(4,1),(3,2),(3,1,1),(2,2,1),(2,1,1,1). For any partition PP(k)P\in\mathcal{P}(k), we use s(P)s(P) to denote the number of summands in PP. For example, for the partition P=(3,2)P=(3,2), s(P)=2s(P)=2. Further, if P=(x1,,xk)P=(x_{1},\ldots,x_{k}) is a partition of nn, then

With these notations in place, it is easy to see that

Now, it is easy to see that for any integer x>1x>1, (1.4)x<=x!(1.4)^{x}<=x!. If 1(P)1(P) denotes the number of 11 in the partition PP. With this, we have

Now, note that the total number of partitions of αj\alpha_{j} with tt ones in it is upper bounded by P(αt)|\mathcal{P}(\alpha-t)|. However, it is a well-known fact in number theory, that P(αt)2O(αt)|\mathcal{P}(\alpha-t)|\leq 2^{O(\sqrt{\alpha-t})}. Thus,

Let ZZ and ZZ^{\prime} be two (m,k)(m,k)-PMDs such that Eβ(Z)Eβ(Z)δ|\mathbf{E}_{\beta}(Z)-\mathbf{E}_{\beta}(Z^{\prime})|\geq\delta. Then, there exists β0β\beta_{0}\preceq\beta such that

where c=2O(k)c=2^{O(k)} is the constant appearing in Claim 2.

Let c=2O(k)c=2^{O(k)} be the constant appearing in Claim 2. Let β=d|\beta|=d and ii be the smallest integer such that there exists a β0β\beta_{0}\preceq\beta with β0=i|\beta_{0}|=i and

Note that by assumption, there exists such a β0\beta_{0}. Next,

Applying Claim 1 and Claim 2, we get that

Again applying the hypothesis on β0\beta_{0}, we have

The strategy for the rest of the proof is as follows: Instead of showing Theorem 9, we will show the following lemma.

To see why it suffices to prove Lemma 11, we have the following claim.

Assume towards a contradiction that dTV(Z1,Z2)<δmα1d_{TV}(Z_{1},Z_{2})<\delta\cdot m^{-\|\alpha\|_{1}}. By definition, this means that there is a coupling (Z1,Z2)(Z^{\prime}_{1},Z^{\prime}_{2}) such that the marginal Z1Z^{\prime}_{1} is distributed as Z1Z_{1}, the marginal Z2Z^{\prime}_{2} is distributed as Z2Z_{2} and Pr[Z1Z2]<δmα1\Pr[Z^{\prime}_{1}\not=Z^{\prime}_{2}]<\delta\cdot m^{-\|\alpha\|_{1}}. As the support of both Z1Z_{1} and Z2Z_{2} is confined in the box [0,m]k[0,m]^{k}, it easily follows that \big{|}M_{\alpha}(Z_{1})-M_{\alpha}(Z_{2})\big{|}<\Pr[Z^{\prime}_{1}\not=Z^{\prime}_{2}]\cdot m^{\|\alpha\|_{1}}<\delta. This results in a contradiction, thus completing the proof. ∎

In light of Lemma 10, it instead suffices to prove the following lemma.

It is easy to see that this operation defines legitimate (m,k)(m,k) PMD matrices. Further, note that Eα\mathbf{E}_{\alpha} is a homogenous polynomial of degree α\|\alpha\|. Thus, it is easy to see that if Eα(Ai)Eα(Aj)\mathbf{E}_{\alpha}(A_{i})\not=\mathbf{E}_{\alpha}(A_{j}), then

For us, the utility of Jacobian will come from its role in the following theorem.

As a consequence, we have the following corollary.

To show a lower bound on the entropy of MSM_{\mathcal{S}}, we will apply Corollary 2. To apply this, we need to prove that detJMS0\det J_{M_{\mathcal{S}}}\not=0. It is not clear how to show this, so we introduce an intermediate map.

The family {Pα}\{\mathbf{P}_{\alpha}\} is usually referred to as power-sum multisymmetric polynomials.

The idea here will be that we will relate the family Pα\mathbf{P}_{\alpha} and Eα\mathbf{E}_{\alpha} and then argue about the Jacobian of a map defined in terms of Pα\mathbf{P}_{\alpha}. The following relation between Eα\mathbf{E}_{\alpha} and Pα\mathbf{P}_{\alpha} was established in Dalbec [Dal99].

Using induction, the following lemma is immediate.

Immediately, we have that rank(JPS)rank(JES)\mathsf{rank}(J_{\mathbf{P^{\prime}}_{\mathcal{S}}})\leq\mathsf{rank}(J_{\mathbf{E^{\prime}}_{\mathcal{S}}}). Thus, to show a lower bound on rank(JES)\mathsf{rank}(J_{\mathbf{E^{\prime}}_{\mathcal{S}}}), it suffices to show a lower bound on rank(JPS)\mathsf{rank}(J_{\mathbf{P^{\prime}}_{\mathcal{S}}}). We next show the following claim.

This implies that rank(JES)dk/kk\mathsf{rank}(J_{\mathbf{E^{\prime}}_{\mathcal{S}}})\geq d^{k}/k^{k}. This means that we can choose a square submatrix of size (d/k)k×(d/k)k(d/k)^{k}\times(d/k)^{k} of JESJ_{\mathbf{E^{\prime}}_{\mathcal{S}}} of full rank. This means there is a subset of S\mathcal{S} of size dk/kkd^{k}/k^{k} (call it S\mathcal{S}^{\prime}) such that

A Fourier-Based Learning Algorithm for PMDs

We believe the application of Fourier analysis to learn such structured distributions is interesting in its own right and might have application in the future towards obtaining learning algorithms for other interesting classes of distributions. In particular, the recent work on the population recovery problem [WY12, MS13, LZ15] may also be viewed as an example of use of Fourier analysis towards learning of structured distributions.

Refer to Corollary 3 for the precise bounds. Similar exponential decay of Fourier transform is also true around the other points of the form {1,0,1}k\{-1,0,1\}^{k}. Let us use V=i=1k(1+σi)V=\prod_{i=1}^{k}(1+\sigma_{i}) where σi2\sigma_{i}^{2} are the eigenvalues of Σ\Sigma. It is not difficult to show that all but an ε\varepsilon-fraction of the mass of ZZ falls on a set of size Vlogk(1/ε)V\cdot\log^{k}(1/\varepsilon) (Lemma 18). On the other hand, using the exponential decay of the Fourier transform, we have the following crucial claim: We identify a region Sk\mathcal{S}\subseteq^{k} of volume logk(1/ε)/V\log^{k}(1/\varepsilon)/V such that

Refer to Claim 7 for the precise bounds. Also, in this informal description, we use O~\widetilde{O} to hide the dependence on kk as well as the polylogarithmic factors of 1/ε1/\varepsilon. This implies that if HH is another function such that H^(ξ)Z^(ξ)ε|\widehat{H}(\xi)-\widehat{Z}(\xi)|\leq\varepsilon inside S\mathcal{S} and outside S\mathcal{S}, then

The main idea behind learning PMDs is to look at the Fourier spectrum of PMDs. Specifically, we will prove two structural results about PMDs. One is that the Fourier spectrum of PMDs (roughly) has an exponential decay around the origin. The second result we will prove is the Fourier spectrum is a Lipschitz function and thus to estimate the Fourier spectrum in the entire domain, it suffices to compute it at a few points. Combining these two results along with standard statements on Fourier inversion show that if we construct a hypothesis distribution which approximates the Fourier spectrum of the target PMD at the chosen points and also exhbits a similar exponential decay in the Fourier spectrum, then the hypothesis distribution is close to the target PMD. While the condition on Fourier decay is not algorithmically easy to impose, we show that using some ideas from [DKT15], the problem of imposing these constraints reduces to linear programming. We will first quickly review the notion of Fourier spectrum of integer valued distributions.

Let us now recall the setting: P=Z1++ZnP=Z_{1}+\ldots+Z_{n} where ZiZ_{i} are independent random variables supported on {e1,,ek}\{\mathbf{e}_{1},\ldots,\mathbf{e}_{k}\}. Also for 1in1\leq i\leq n and 1jk1\leq j\leq k, let pij=Pr[Xi=j]p_{ij}=\Pr[X_{i}=j]. To specify the next lemma, for any ξk\xi\in^{k}, we will need to define an associated vector ζk\zeta\in^{k}. For any ξ\xi\in, define the associated ζ\zeta as follows:

Let XX be a CRV with covariance matrix Σ\Sigma. Then, X^(ξ)2115ζTΣζ|\widehat{X}(\xi)|^{2}\leq 1-\frac{1}{5}\cdot\zeta^{T}\cdot\Sigma\cdot\zeta.

To prove this, we do a simple case analysis, and use the inequality sin2(πx/2)x2/5\sin^{2}(\pi x/2)\geq x^{2}/5 for x3/2|x|\leq 3/2:

If both ξi\xi_{i} and ξj\xi_{j} are of the same type, then note that ξiξj=ζiζj1|\xi_{i}-\xi_{j}|=|\zeta_{i}-\zeta_{j}|\leq 1 which gives the required inequality.

If ξi\xi_{i} and ξj\xi_{j} are type 22 and 33, then note that ξiξj=2(ζiζj)|\xi_{i}-\xi_{j}|=|2-(\zeta_{i}-\zeta_{j})|. This implies that

Noting that ζiζj1|\zeta_{i}-\zeta_{j}|\leq 1 gives the required inequality.

If ξi\xi_{i} is of type 11 and ξj\xi_{j} is of type 2, then note that the maximum value that ξiξj|\xi_{i}-\xi_{j}| can take is 3/23/2. On the other hand, notice that ξiξjζiζj|\xi_{i}-\xi_{j}|\geq|\zeta_{i}-\zeta_{j}|. These two facts immediately imply that

The exact same situation holds if ξi\xi_{i} is of type 11 and ξj\xi_{j} is of type 3.

Having shown (11), see that this implies that

For any (n,k)(n,k)-PMD PP with covariance matrix Σ\Sigma, we have that for any ξk\xi\in^{k}:

This follows simply by noticing that for a PMD P=X1++XnP=X_{1}+\ldots+X_{n},

where Σi\Sigma_{i} is the covariance matrix of XiX_{i}. Using the inequality, 1x2ex2/21-x^{2}\leq e^{-x^{2}/2} (for x1|x|\leq 1), we have

We also have the following variant of the above lemma which will be useful for us.

where XX^{\prime} and YY^{\prime} are the centered random variables obtained by centering XX and YY. Likewise,

Applying the same for the Y^\widehat{Y^{\prime}} and applying triangle inequality, we get the claim. ∎

We now state the Plancherel identity in this setting. In particular, we have the following easy claim (which can be found in any standard text on Fourier analysis).

The above corollary demonstrates that to learn the PMD to error ε\varepsilon, it suffices to produce another distribution HH whose Fourier spectrum is very close to the Fourier spectrum of FF (the “very small” is quantified by the effective support of FF).

Given sample access to a (n,k)(n,k)-PMD XX with mean μ\mu and covariance matrix Σ\Sigma, there exists an algorithm which can produce estimates μ^\hat{\mu} and Σ^\hat{\Sigma} such that with probability at least 9/109/10 for every vector yy:

The sample and time complexity are O(k4/ε2)O(k^{4}/\varepsilon^{2}).

The following is guaranteed by the multidimensional Chernoff bound.

Let XX be a (n,k)(n,k)-PMD with mean μ\mu and covariance matrix Σ\Sigma. Let Lr={z:(zμ)(zμ)trΣ}L_{r}=\{z:(z-\mu)\cdot(z-\mu)^{t}\preceq r\cdot\Sigma\}. For r=O(log(1/ε)+logk)r=O(\log(1/\varepsilon)+\log k),

Let the eigenvectors of Σ\Sigma be v1,,vkv_{1},\ldots,v_{k} with the corresponding eigenvalues σ12,,σk2\sigma_{1}^{2},\ldots,\sigma_{k}^{2}. Consider any two distinct x,yLrx,y\in L_{r}. Since xx and yy are distinct, hence there must be some 1ik1\leq i\leq k such that the projection of xx and yy along viv_{i} is separated by k1/2k^{-1/2}. Let us denote the projection of xx along viv_{i} by xix_{i}. Then the condition of lying in LrL_{r} implies that xirσi|x_{i}|\leq r\cdot\sigma_{i}. It is then easy to see that if the number of integer points in LrL_{r} is more than i=1k(2σikr+1)\prod_{i=1}^{k}\left(2\sigma_{i}\sqrt{kr}+1\right), then there must be 22 points xx and yy and some 1ik1\leq i\leq k, xiyik1/2|x_{i}-y_{i}|\leq k^{-1/2}. ∎

For a point z{1,0,1}kz\in\{-1,0,1\}^{k}, define Cz,rC_{z,r} as

Note that k^{k} can be partitioned into the regions RzR_{z} (for z{1,0,1}kz\in\{-1,0,1\}^{k}). In other words,

Let Sr=z{1,0,1}k(RzCz,r)S_{r}=\cup_{z\in\{-1,0,1\}^{k}}(R_{z}\cap C_{z,r}) and let Sr=kSr\overline{S}_{r}=^{k}\setminus S_{r}. Then,

We now individually bound each of the summands. Fix any particular zz. Using Corollary 3

The last inequality uses that r>2r>2. This integral is now easily seen to be bounded by

This is exactly the same bound as stated in the claim. ∎

Let Sr=z{1,0,1}k(RzCz,r)S_{r}=\cup_{z\in\{-1,0,1\}^{k}}(R_{z}\cap C_{z,r}). Then,

Doing the exact same calculation as in the proof of Claim 7,

2 Learning algorithm for PMDs

Theorem 6, the structure theorem from [DKT15], allows us to assume that the PMD ZZ is essentially a discretized Gaussian GG convolved with a sparse PMD SS where the sparse PMD is supported on only poly(k/ε)\operatorname*{poly}(k/\varepsilon) summands.

By setting ‘ε\varepsilon’ from the Theorem statement to be ε10\varepsilon^{10}, we get that dTV(Z,G+S)ε10d_{TV}(Z,G+S)\leq\varepsilon^{10}. Because our subsequent learning algorithm will take O(ε10)\ll O(\varepsilon^{-10}) samples, we assume that we are getting samples from G+SG+S instead of ZZ and that Z=G+SZ=G+S. Furthermore, using the following claim from [DKT15], we can get a spectral estimate with accuracy ε10\varepsilon^{10} of the mean and covariance of the Gaussian GG by guessing the partition of coordinates in the covariance matrix of the Gaussian and going through all elements of the spectral cover of PSD matrices around a fine estimate S^\widehat{S} for Σ\Sigma obtained using k/ε2k/\varepsilon^{2} samples from Lemma 17.

Let A be a symmetric k×kk\times k PSD matrix with minimum eigenvalue 1 and let S\mathcal{S} be the set of all matrices BB such that yT(AB)yε1yTAy+ε2yTy|y^{T}\cdot(A-B)\cdot y|\leq\varepsilon_{1}y^{T}\cdot A\cdot y+\varepsilon_{2}y^{T}y where ε1[0,1/4)\varepsilon_{1}\in[0,1/4) and ε2[0,)\varepsilon_{2}\in[0,\infty). Then, there exists a cover Sε\mathcal{S}_{\varepsilon} of size (k(1+ε2)/ε)k2(k\cdot(1+\varepsilon_{2})/\varepsilon)^{k^{2}} such that any BSB\in\mathcal{S} is ε\varepsilon-spectrally close to some element in the cover.

The spectral closeness translates to closeness ε10\varepsilon^{10} in total variation distance between Gaussians (Lemma 2) and again since we will be taking O(ε10)\ll O(\varepsilon^{-10}) samples in the learning algorithm, we can assume that the gaussian GG has exactly the mean μG\mu_{G} and covariance ΣG\Sigma_{G} we guessed.

Similarly, we can assume that the sparse-PMD has known mean and covariance μS\mu_{S} and ΣS\Sigma_{S}. This is because any PMD with nn^{\prime} summands is ε10\varepsilon^{10}-close in total variation to a PMD where all the probabilities are rounded to multiples of nk/ε101\lceil n^{\prime}k/\varepsilon^{10}\rceil^{-1}. This fact follows from union-bounding all the errors of the individual summands. Since n=poly(k/ε)n^{\prime}=\operatorname*{poly}(k/\varepsilon) for the sparse PMD, all coordinates are multiples of poly(ε/k)\operatorname*{poly}(\varepsilon/k), which implies that the mean and covariance coordinates are also multiples of poly(ε/k)\operatorname*{poly}(\varepsilon/k) and we can guess them exactly using poly(k/ε)k2\operatorname*{poly}(k/\varepsilon)^{k^{2}} guesses. Again, since this sparse PMD is ε10\varepsilon^{10} close and we will be getting much fewer samples, we can assume that the sparse PMD has exactly the mean and covariance we guessed.

At this point, we have argued the following:

The PMD ZZ is equal to the sum of a discretized Gaussian GG and a sparse PMD SS with poly(k/ε)\operatorname*{poly}(k/\varepsilon) summands. The mean and covariance of the Gaussian (μG,ΣG)(\mu_{G},\Sigma_{G}) and of the sparse PMD (μS,ΣS)(\mu_{S},\Sigma_{S}) are known, which implies that the mean and covariance of the overall PMD ZZ is equal to (μ,Σ)=(μS,ΣS)+(μG,ΣG)(\mu,\Sigma)=(\mu_{S},\Sigma_{S})+(\mu_{G},\Sigma_{G}).

Our learning algorithm attempts to recover the sparse PMD in order to learn the overall distribution ZZ. However, imposing the condition that the distribution we are trying to estimate is a sparse PMD will involve solving non linear equations making the computation intractable. Rather, we will seek to learn a sparse distribution SS^{\prime} supported on [0,T]k[0,T]^{k} where T=poly(k/ε)T=\operatorname*{poly}(k/\varepsilon).

To learn this distribution, we will attempt to estimate its Fourier Transform. We will be mostly interested in points on the grid:

where (vi,σi2)(\vec{v}_{i},\sigma^{2}_{i}) are the eigenvector, eigenvalue pairs of the matrix Σ\Sigma. From Corollary 3, we know that the Fourier transform decays exponentially as we move away from {1,0,1}k\{-1,0,1\}^{k}, and in particular Claim 7 bounds the total mass contained at a distance at least rr from all the points. For our purposes, we set r=O(klogk+klog(1/ε))r=O(k\log k+k\log(1/\varepsilon)) and perform the following steps to learn the sparse distribution SS^{\prime}.

Create variables pαp_{\alpha} for every α[0,T]k\alpha\in[0,T]^{k} with the constraints 0pα10\leq p_{\alpha}\leq 1 and αTkpα=1\sum_{\alpha\in T^{k}}p_{\alpha}=1.

Let A1=z{1,0,1}k{ξ:σi2(vi((1)z(ξz)))2r}\mathcal{A}_{1}=\cup_{z\in\{-1,0,1\}^{k}}\{\xi:\sum{\sigma}^{2}_{i}\left({v}_{i}\cdot((-1)^{z}\circ(\xi-z))\right)^{2}\leq r\}. Let V1\mathcal{V}_{1} be the points of the grid V\mathcal{V} that lie in A1\mathcal{A}_{1}. For each of those points, get an estimate Z^est\widehat{Z}_{est} of Z^\widehat{Z} such that Z^estZ^<ε6kk2k|\widehat{Z}_{est}-\widehat{Z}|<\frac{\varepsilon}{6^{k}\cdot k^{2k}} and then impose linear constraints on {pα}\{p_{\alpha}\} so that Re[S^(ξ)G^(ξ)Z^est(ξ)]ε6kk2k|Re[\widehat{S^{\prime}}(\xi)\cdot\widehat{G}(\xi)-\widehat{Z}_{est}(\xi)]|\leq\frac{\varepsilon}{6^{k}\cdot k^{2k}} and Im[S^(ξ)G^(ξ)Z^est(ξ)]ε6kk2k|Im[\widehat{S^{\prime}}(\xi)\cdot\widehat{G}(\xi)-\widehat{Z}_{est}(\xi)]|\leq\frac{\varepsilon}{6^{k}\cdot k^{2k}}.

Let σG,i2,vG,i\sigma_{G,i}^{2},\vec{v}_{G,i} be the eigenvalues and eigenvectors of ΣG\Sigma_{G} We note that, since we may have eigenvalues which are both large and small in magnitude, a naive eigendecomposition algorithm would incur a cost which depends on nn. However, as we only require the eigenvalues and eigenvectors approximately, this cost can be avoided by applying an appropriate power-iteration method. The cost in terms of kk and 1/ε1/\varepsilon is dominated by the other steps in our algorithm. and consider the set:

Construct a grid of points in k^{k} with a spacing of ε2kk2k6k\frac{\varepsilon^{2k}}{k^{2k}\cdot 6^{k}} in every direction. Let V2\mathcal{V}_{2} be the subset of these points which fall in A2A_{2}. For all these points impose the conditionss that Re[S^(ξ)]eζTΣSζ|Re[\widehat{S^{\prime}}(\xi)]|\leq e^{-\zeta^{T}\Sigma_{S}\zeta} and Im[S^(ξ)]eζTΣSζ|Im[\widehat{S}(\xi)]|\leq e^{-\zeta^{T}\Sigma_{S}\zeta} that follow from Corollary 3.

Finally, add the constraints αpαα=μS\sum_{\alpha}p_{\alpha}\alpha=\mu_{S} and αpα(αμS)(αμS)T=ΣS\sum_{\alpha}p_{\alpha}(\alpha-\mu_{S})(\alpha-\mu_{S})^{T}=\Sigma_{S}

Note that in Step 2, V1\mathcal{V}_{1} has size at most (rk2k6kε)k\left(\frac{\sqrt{r}\cdot k^{2k}\cdot 6^{k}}{\varepsilon}\right)^{k}. If we naively estimated every Fourier coefficient in V1\mathcal{V}_{1} the number of samples would be too high because every Fourier coefficient requires log(1/δ)/ε2\log(1/\delta)/{\varepsilon^{2}} samples to learn with accuracy ε\varepsilon and probability of failure 1δ1-\delta. However, we can instead take O(klog(r/ε)/ε2)O(k\log(r/\varepsilon)/\varepsilon^{2}) samples and reuse the same samples to compute all the required Fourier coefficients. Since the probability of error is very small a simple union bound among all of the coefficients, shows that with at least constant probability all of them can be estimated within ε\varepsilon.

To complete the learning algorithm, we repeat the steps above for each of the guessed mean and covariance matrices (μG,ΣG),(μS,ΣS)(\mu_{G},\Sigma_{G}),(\mu_{S},\Sigma_{S}). We then perform a hypothesis selection algorithm to choose a distribution within O(ε)O(\varepsilon) from each of the distributions we obtain. We made O(poly(k/ε)k2)O(\operatorname*{poly}(k/\varepsilon)^{k^{2}}) guesses, and thus obtained O(poly(k/ε)k2)O(\operatorname*{poly}(k/\varepsilon)^{k^{2}}) candidate hypotheses. Applying the following tournament theorem for hypothesis selection from [DK14], we can select a good estimate in O((kε)2log(k/ε))O\left(\left(\frac{k}{\varepsilon}\right)^{2}\log(k/\varepsilon)\right) samples in O(poly(k/ε)k2)O(\operatorname*{poly}(k/\varepsilon)^{k^{2}}) runtime.

We first show that there is a solution to {pα}\{p_{\alpha}\} which satisfies all the constraints. Indeed, if we set the sparse distribution SS^{\prime} to be equal to the distribution SS of the sparse PMD we defined above, we get:

αTkpα=1\sum_{\alpha\in T^{k}}p_{\alpha}=1 since SS is a probability distribution supported on [0,T]k[0,T]^{k}.

The constraint Re[S^(ξ)G^(ξ)Z^est(ξ)]ε6kk2k|Re[\widehat{S^{\prime}}(\xi)\cdot\widehat{G}(\xi)-\widehat{Z}_{est}(\xi)]|\leq\frac{\varepsilon}{6^{k}\cdot k^{2k}} is satisfied since for S=SS^{\prime}=S,

The derivation for the constraint on the imaginary part is identical.

From Corollary 3, the sparse PMD satisfies S^(ξ)e(1/5)ζTΣSζ|\widehat{S}(\xi)|\leq e^{-(1/5)\cdot\zeta^{T}\Sigma_{S}\zeta} everywhere in k^{k}. This condition implies the imposed constraints which are only evaluated in few points.

The distribution SS has mean μS\mu_{S} and covariance ΣS\Sigma_{S}, so the last constraint is satisfied.

Consider any point ξ\xi in A1\mathcal{A}_{1}. Then, note that there is some ξV\xi^{\prime}\in\mathcal{V} such that for 1ik1\leq i\leq k, ξξ,viεk2k6kmax{1,σi}\langle\xi-\xi^{\prime},\vec{v}_{i}\rangle\leq\frac{\varepsilon}{k^{2k}\cdot 6^{k}\cdot{\max\{1,\sigma_{i}\}}}. Applying Lemma 16, we get that

We bound the volume of the set B2B_{2}. To do this, we again apply Claim 8, and get that

Note that for any point ξA2\xi\in A_{2}, we there is a point ξ\xi^{\prime} such that ξξ2ε2kk2k6k\|\xi-\xi^{\prime}\|_{2}\leq\frac{\varepsilon^{2k}}{k^{2k}\cdot 6^{k}} and that S^(ξ)e(1/5)ζTΣSζ|\widehat{S^{\prime}}(\xi^{\prime})|\leq e^{-(1/5)\cdot\zeta^{\prime T}\Sigma_{S}\cdot\zeta^{\prime}}. Since the variance of ΣS\Sigma_{S} is at most poly(k/ε)\operatorname*{poly}(k/\varepsilon) in every direction, we get that

By applying Claim 8 to bound the volume of the set A2B2A_{2}\subseteq B_{2} and using the fact that S^(ξ)2|\widehat{S}(\xi)|^{2} is at most er/20e^{-r/20}, we get that the first integral is at most

The last inequality uses the fact that whenever σG,iσi\sigma_{G,i}\leq{\sigma}_{i}, it must imply that all the variance comes from SS and thus σipoly(k/ε){\sigma_{i}}\leq\operatorname*{poly}(k/\varepsilon). By plugging the value of rr, we get that

The calculation for the second integral is similar.

Here the first inequality follows by exactly the same calculation we did for the first integral whereas the second inequality uses that A2B2A_{2}\subseteq B_{2}. Now, that we had derived that

However, max{σG,i,1}εΘ(1)max{σi,1}\max\{\sigma_{G,i},1\}\geq\varepsilon^{\Theta(1)}\cdot\max\{{\sigma_{i}},1\} (because the variance of SS is at most poly(1/ε)\operatorname*{poly}(1/\varepsilon) in any direction. This implies that

Note that S+G^(ξ)=G^(ξ)S^(ξ)\widehat{S^{\prime}+G}(\xi)=\widehat{G}(\xi)\cdot\widehat{S^{\prime}}(\xi). Thus, S+G^(ξ)2G^(ξ)2|\widehat{S^{\prime}+G}(\xi)|^{2}\leq|\widehat{G}(\xi)|^{2}. Applying Claim 7 and noting that

Again using the fact that the variance of SS in any direction is at most poly(k/ε)\operatorname*{poly}(k/\varepsilon),

Plugging in the value of rr, we get that

Combining Claim 11, Claim 12 and Claim 13, we get that

This is at most dTV(S+G,S+G)ε(klog(1/ε))O(k)d_{TV}(S+G,S^{\prime}+G)\leq\varepsilon\cdot(k\log(1/\varepsilon))^{O(k)}. Setting ε\varepsilon to be εpoly(k,log(1/ε))k\frac{\varepsilon^{\prime}}{\operatorname*{poly}(k,\log(1/\varepsilon^{\prime}))^{k}}, we complete the proof of Theorem 5.

Open Problems

A number of interesting questions regarding Poisson Multinomial distributions are left open by this work and [DKS16a]. We outline a few of them here.

The complexity of learning Poisson Multinomials. This work and [DKS16a] both give algorithms for learning PMDs. The sample and time complexities are polynomial in 1/ε1/\varepsilon and exponential in kk. Meanwhile, [DKT15] gives an algorithm with a sample complexity polynomial in both parameters, but the time complexity is exponential in kk and 1/ε1/\varepsilon. Is there an algorithm for learning PMDs with sample and time complexities both polynomial in kk and 1/ε1/\varepsilon?

Exploring the connection between Poisson Multinomials and Laplacian matrices. In this work, we described a cover for the set of (n,k)(n,k)-PMDs of size Ok,ε(nO(k))O_{k,\varepsilon}(n^{O(k)}). Our construction relied crucially on Observation 1 (which states that the covariance matrix of a PMD is Laplacian) and spectral sparsification results for Laplacian matrices. With this connection in hand, can one derive other results for PMDs using the wealth of literature on Laplacian matrices?

A tighter central limit theorem. [VV11] proves a central limit theorem between an (n,k)(n,k)-GMD and a discretized Gaussian with the same mean and covariance, upper bounding their total variation distance by O(k4/3σ1/3log2/3n)O(k^{4/3}\sigma^{-1/3}\log^{2/3}n), where σ2\sigma^{2} is the smallest eigenvector of the covariance matrix of the GMD. Both this paper and [DKS16a] qualitatively improve this bound by removing the dependence on nn, while keeping the dependence on kk and 1/σ1/\sigma still polynomial. How well can a GMD be approximated by a discretized Gaussian? In one dimension, the answer is Θ(1/σ)\Theta(1/\sigma) [CGS10], which implies a the answer for multiple dimensions is at least Ω(k/σ)\Omega(\sqrt{k}/\sigma). [DKS16a] achieves this dependence on 1/σ1/\sigma (up to log factors), but the optimal dependence on kk is currently unknown.

References

Appendix A Proof of Lemma 2

Without loss of generality, assume that Σ1\Sigma_{1} and Σ2\Sigma_{2} are full rank. If not, the guarantees in the statement ensure that their nullspace is identical, and we can project to a lower dimension such that the resulting matrices are full rank.

First, we note that the assumptions in the lemma statement can be converted to be in terms of the minimum of the two variances, instead of the maximum. Define σv2=min{vTΣ1v,vTΣ2v}\sigma_{v}^{2}=\min\{v^{T}\Sigma_{1}v,v^{T}\Sigma_{2}v\}. The second assumption can be rearranged to see that (1ε2)sv2σv2(1-\frac{\varepsilon}{2})s_{v}^{2}\leq\sigma_{v}^{2}. Plugging this back into the second assumption gives that

where the last inequality holds for ε1\varepsilon\leq 1 (otherwise, the lemma’s conclusion is trivial). Similarly, the second assumption also implies 1ε2svσv\sqrt{1-\frac{\varepsilon}{2}}s_{v}\leq\sigma_{v}, when plugged into the first assumption gives

For the remainder of the proof, we will use these guarantees instead of the ones in the lemma statement.

We recall the standard formula for KL-divergence between two Gaussian distributions. Let {λi}\{\lambda_{i}\} be the eigenvalues of Σ21/2Σ1Σ21/2\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}.

We bound the divergence induced by differences in the means and covariances separately. We start with the means. Note that

Now we bound the divergence induced by differences in the covariances. We bound the eigenvalues of Σ21/2Σ1Σ21/2\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}. Note that

Substituting u=Σ21/2vu=\Sigma_{2}^{1/2}v makes the latter condition equivalent to

The Courant-Fischer Theorem implies that 11+ελi1+ε\frac{1}{1+\varepsilon}\leq\lambda_{i}\leq 1+\varepsilon for all ii.

At this point, we note that xlnx1(1x)2x-\ln x-1\leq(1-x)^{2} for all x1x\geq 1. This implies