A Size-Free CLT for Poisson Multinomials and its Applications

Constantinos Daskalakis, Anindya De, Gautam Kamath, Christos Tzamos

Introduction

In theoretical computer science, PMDs are commonly used in the analysis of randomized algorithms, often through large deviation inequalities. They have also found applications in algorithmic problems where one is looking for a collection of random vectors optimizing a certain probabilistic objective, or satisfying probabilistic constraints. For example, understanding the behavior of PMDs has led to polynomial-time approximation schemes for anonymous games [Mil96, Blo99, Blo05, Kal05, DP07, DP08, DP09], despite the PPAD-completeness of their exact equilibria [CDO15]. Anonymous games are games where a large number $n$ of players share the same $k$ strategies, and each player’s utility only depends on his own choice of strategy and the number of other players that chose each of the $k$ strategies. In particular, the expected payoff of each player depends on the PMD resulting from the mixed strategies of the other players. It turns out that understanding the behavior of PMDs provides a handle on the computation of approximate Nash equilibria. One of our main contributions is to advance the state of the art for computing approximate Nash equilibria in anonymous games. We will come to this contribution shortly.

Recently Valiant and Valiant have used PMDs to obtain sample complexity lower bounds for testing symmetric properties of distributions [VV11]. The workhorse in their lower bounds is a new CLT bounding the total variation distance between a $(n,k)$ -GMD and a multidimensional Gaussian with the same mean vector and covariance matrix. Since they are comparing a discrete to a continuous distribution under the total variation distance, they need to discretize the Gaussian by rounding its coordinates to their closest point in the integer lattice. If $X$ is distributed according to some $(n,k)$ -GMD with mean vector $\mu$ and covariance matrix $\Sigma$ , and $Y$ is distributed according to the multi-dimensional Gaussian ${\cal N}(\mu,\Sigma)$ , [VV11] shows that:

where $\sigma^{2}$ is the minimum eigenvalue of $\Sigma$ and $\lfloor Y\rceil$ denotes the rounding of $Y$ to the closest point in the integer lattice. The dependence of the bound on the dimension $k$ and the minimum eigenvalue $\sigma^{2}$ is necessary, and quite typical of Berry-Esseen type bounds. Answering a question raised in [VV11], we prove a qualitatively stronger CLT by showing that the explicit dependence of the bound on $n$ can be removed (hence, the CLT is “size-free”).

Suppose that $X$ is distributed according to some $(n,k)$ -GMD with mean $\mu$ and covariance matrix $\Sigma$ , and $Y\sim{\cal N}(\mu,\Sigma)$ . There exists some constant $C>0$ such that

where $\sigma^{2}$ is the minimum eigenvalue of $\Sigma$ .

Interestingly, Theorem 1 is proven by bootstrapping the Valiant-Valiant CLT itself. Indeed, this CLT was used as one of the key ingredients in a recent structural characterization of PMDs [DKT15], where it was shown that any $(n,k)$ -Poisson multinomial random vector is $\varepsilon$ -close in total variation distance to the sum of an (appropriately discretized) Gaussian and a $({\rm poly}(k/\varepsilon),k)$ -Poisson multinomial random vector; see Theorem 6. In turn, we prove Theorem 1 by using Theorem 6 as a black box.

For more details on our proof’s approach, see Section 3.

In the remainder of this section we discuss the algorithmic applications of our CLT, concluding with our improved algorithms for learning PMDs using Fourier analysis.

We have already discussed anonymous games earlier in this section, where we have also explained their relation to PMDs. In particular, the expected utility $u_{i}$ of some player $i$ in a $n$ -player $k$ -strategy anonymous game only depends on his own choice of mixed strategy $X_{i}$ and the $(n-1,k)$ -Poisson multinomial random vector $\sum_{j\neq i}X_{j}$ aggregating the mixed strategies of his opponents. It is therefore natural to expect that a better understanding of the structure of PMDs could lead to improved algorithms for computing Nash equilibria in these games. Indeed, earlier work [DP08, DP15] has exploited this connection to obtain algorithms for approximate Nash equilibria, whose running time is

While clearly of theoretical interest, this bound shows that anonymous games are one of the few classes of games where approximate equilibria can be efficiently computed, while exact equilibria are PPAD-hard [CDO15], even for $n$ -player $7$ -strategy anonymous games. Exploiting our CLT we obtain a significant improvement over [DP08].

An $\varepsilon$ -approximately well supported Nash equilibrium of an $n$ -player $k$ -strategy anonymous games whose utilities are in $can be computed in time:As it is customary in Nash equilibrium algorithms, approximate Nash equilibria are defined with respect to additive approximations and the player utilities are normalized to$ to make these approximations meaningful.

The salient feature of Theorem 2 is the polynomial dependence of the running time on $n$ and its quasi-polynomial dependence on $\varepsilon^{-1}$ . In terms of these dependencies our algorithm matches the best known algorithm for $2$ -strategy anonymous games [DP09], where much more is known given the single-dimensional nature of $(n,2)$ -PMDs.

Moreover, the recent hardness results for anonymous games [CDO15] establish that not only finding an exact but also a $2^{n^{a}}$ -approximate Nash equilibrium is PPAD-hard. An interesting corollary of Theorem 2 is that this cannot be pushed to ${\rm poly}(1/n)$ -approximations, unless PPAD can be solved in quasi-polynomial time.

Unless PPAD $\subseteq$ Quasi-PTIME, it is not PPAD-hard to find a ${\rm poly}(1/n)$ -approximately well supported Nash equilibrium in anonymous games, for any ${\rm poly}(\cdot)$ .

It is interesting to contrast this corollary with normal-form games where it is known that computing inverse polynomial approximations is PPAD-hard [DGP09, CDT09].

From a technical standpoint, our algorithm for anonymous games uses the structural understanding of PMDs as follows. Since every player views the aggregate strategies of the other players as a PMD, one approach would be to guess each player’s view using a cover as developed in [DKT15]. However, this approach gives a runtime which is exponential in $n$ , since it requires us to enumerate the cover for each player. An alternative approach is to guess the overall PMD which occurs at a Nash equilibrium, and guess appropriate “corrections” that allow us to infer each player’s view. To do this, we must find an alternative PMD which approximately matches the PMD at Nash in the following sense:

The PMD that results by removing the CRV corresponding to a player should be close to the view that the player observes;

A player’s CRV must only assign probability to strategies which are approximate best responses to his view.

It turns out that these conditions can be satisfied by using a careful dynamic program together with the structural understanding provided by [DKT15] and the CLT of Theorem 1. According to this structural result, we can partition the players into a “sparse” and a “Gaussian” component. Moreover, our CLT implies that matching the first two moments of the Gaussian suffices to approximate this component. This allows us to perform guesses at a different granularity for the sparse and Gaussian components. Roughly speaking, our dynamic program guesses a succinct representation of the two components and tries to compute CRVs which obey this representation and satisfy the conditions outlined above.

For more details on our PTAS, refer to Section 4.

Moreover, we can efficiently enumerate this cover in time polynomial in its size.

It is important to contrast Theorem 3 with Theorem 2 in [DKT15], which provides a non-proper cover whose size is similar, albeit with a leading factor of $n^{O(k^{2})}$ . Instead, our cover is proper, which is important for approximation algorithms that require searching over PMDs. Its dependence on $n$ is also optimal, as the number of $(n,k)$ -PMDs whose summands are deterministic is already $n^{\Omega(k)}$ . Moreover, we provide a lower bound for the dependence on $1/\varepsilon$ , establishing that the quasi-polynomial dependence is also essentially optimal.

We describe our proper cover construction in two parts. First, we give details on how to construct a non-proper cover of size $n^{O(k)}$ . The main tool we use is the existence of spectral sparsifiers for Laplacian matrices. Our non-proper cover sparsifies the non-proper cover of [DKT15], showing how its leading factor of $n^{O(k^{2})}$ can be reduced to $n^{O(k)}$ . Roughly speaking, the factor of $n^{O(k^{2})}$ was due to spectrally approximating all possible covariance matrices $\Sigma$ , whose $O(k^{2})$ entries are bounded by $n$ . These covariance matrices corresponded to covariance matrices of $(n,k)$ -PMDs, and the cover maintained for each such $\Sigma$ some $\Sigma^{\prime}$ such that $|v^{T}(\Sigma-\Sigma^{\prime})v|\leq{\rm poly}(\varepsilon/k)\cdot v^{T}\Sigma v,\forall v$ . (We call this guarantee a “ ${\rm poly}(\varepsilon/k)$ -spectral approximation.”) The realization leading to our sparsification result is that covariance matrices of PMDs are in fact graph Laplacians. Indeed, a $(n,k)$ -PMD, $X=\sum_{i}X_{i}$ , has covariance matrix, ${\rm\bf cov}(X)=\sum_{i}{\rm\bf cov}(X_{i})$ , corresponding to the sum of the covariance matrices of its summands. Now the covariance matrix of a $k$ -CRV, $X_{i}$ , is actually the Laplacian of a graph that has one node $j$ per dimension, along with an edge from node $j$ to node $j^{\prime}$ of weight ${\bf E}[X_{ij}]\cdot{\bf E}[X_{ij^{\prime}}]$ ; and the covariance matrix of a $(n,k)$ -PMD is the Laplacian of the graph with the sum of the weights from each constituent $k$ -CRV– see Observation 1. We show that Laplacians corresponding to $(n,k)$ -PMDs can be ${\rm poly}(\varepsilon/k)$ -spectrally covered with a set of covariance matrices of size $n^{O(k)}\cdot\left({k\over\varepsilon}\right)^{O(k^{3})}$ .

We appeal to recent results in spectral sparsification of Laplacian matrices [ST11, SS11, BSS12, BSST13]. In particular, we use the result of Batson, Spielman, and Srivastava [BSS12] (Theorem 8) to argue that the underlying graph can be sparsified to linearly many edges in the dimension $k$ . We do this in the hopes that we would have fewer parameters in the covariance matrix to guess. Unfortunately, the [BSS12] sparsification theorem has polynomial dependence in the accuracy. So applying it with a ${\rm poly}(\varepsilon/k)$ -approximation error, which is what we need, gives a meaningless result (namely no sparsification at all). Instead, we only use this theorem to get a rough $O(1)$ -spectral cover of $(n,k)$ -PMD covariance matrices. Around every covariance matrix in this rough cover we grow a local ${\rm poly}(\varepsilon/k)$ -spectral cover. Roughly speaking, as the $O(1)$ -spectral cover provides multiplicative approximation to the variance in every direction $v$ , every covariance matrix in this cover gives us a multiplicative handle on the eigenvalues of the matrices approximated by it. This is sufficient information to cover these matrices to ${\rm poly}(\varepsilon/k)$ -spectral error with a “local” spectral cover of size $(k/\varepsilon)^{O(k^{2})}$ – see Lemma 6. Putting everything together, we get a ${\rm poly}(\varepsilon/k)$ -spectral cover of all covariance matrices of $(n,k)$ -PMDs of size $n^{O(k)}\cdot\left({k\over\varepsilon}\right)^{O(k^{3})}$ – see Section 5.1.3. As covering these matrices was the bottleneck in the size of the non-proper cover, this completes the construction of a non-proper cover whose size is (4).

Further details on our non-proper construction are provided in Section 5.

Details on this conversion process are provided in Section 6.

Our lower bound is described further in Section 7. Our technique shows a lower bound on the metric entropy of a polynomial map of the moments of PMDs using an extension of Bézout’s theorem and other tools from algebraic geometry.

Finally, we give a new learning algorithm for PMDs:

This improves the learning algorithm from [DKT15] by eliminating the superpolynomial dependence on $\varepsilon$ in the running time that was obtained in that paper. Our algorithm exploits properties of the continuous Fourier transform of a PMD, as opposed to recent work by Diakonikolas, Kane and Stewart on learning univariate sums of independent integer random variables, which uses the discrete Fourier transform [DKS16b]. They also apply similar discrete Fourier techniques in their simultaneous work on PMDs [DKS16a].

We note that such Fourier-based learning algorithms may simply output a description of the Fourier transform of a distribution. This allows one to compute the PMF of the distribution at any point of interest, but it is not obvious how to sample from such a description. Our algorithm outputs an explicit description of a distribution, which allows one to efficiently (i.e., in time independent of $n$ ) draw samples from the distribution. In contrast, they output the Fourier transform of a distribution and describe how to sample from it.

For more details on our learning algorithm, refer to Section 8.

1 Comparison of Results with [DKS16a]

Simultaneous to our work, Diakonikolas, Kane, and Stewart also studied Poisson Multinomial distributions [DKS16a]. In this section, we describe and compare their results with ours. While both papers independently prove many qualitatively similar results, the techniques are quite different, and thus both may be of independent interest.

Both papers prove new CLTs, which manage to remove the dependence on $n$ which is found in the CLT of [VV11], while the dependence on $k$ and $1/\sigma$ remains polynomial. Additionally, both works improve upon the previous best covers for PMDs [DKT15]. First, both manage to reduce the size of the cover – interestingly, the two improvements seem to be orthogonal. Our result improves the dependence on $n$ from $n^{k^{2}}$ to $n^{O(k)}$ , while theirs improves the dependence on $k$ and $1/\varepsilon$ from $(1/\varepsilon)^{O(k^{5k}\log^{k+1}(1/\varepsilon))}$ to $(1/\varepsilon)^{O(k\log(k/\varepsilon)/\log\log(k/\varepsilon))^{k-1}}$ . We note that this upper bound holds for $k>2$ : for $k=2$ , [DKS16b] proves the tight cover size bound of $n\cdot(1/\varepsilon)^{\Theta(\log(1/\varepsilon))}$ . Furthermore, both papers describe how to efficiently achieve a proper cover of this size. These cover sizes are asymptotically optimal, as shown by lower bounds in both papers. In particular, the double-exponential dependence in $k$ is necessary. Both works also consider the problem of finding approximate Nash equilibria in anonymous games. The complexity of both algorithms is roughly comparable to the PMD cover size. Finally, both papers study the learning of PMDs, obtaining algorithms with sample complexity $\operatorname*{poly}(k,\log(1/\varepsilon))^{k}/\varepsilon^{2}$ . The runtime of our algorithm is $\operatorname*{poly}(k/\varepsilon)^{k^{2}}$ , and the runtime of their algorithm is $\operatorname*{poly}(k,\log(1/\varepsilon))^{k}/\varepsilon^{2}\cdot\log n$ , both in the standard word RAM model.

Preliminaries

We more formally define several of the distribution classes we consider.

A $k$ -Categorical Random Variable ( $k$ -CRV) is a random variable that takes values in $\{e_{1},\dots,e_{k}\}$ where $e_{j}$ is the $k$ -dimensional unit vector along direction $j$ . $\pi(i)$ is the probability of observing $e_{i}$ .

An $(n,k)$ -Poisson Multinomial Distribution ( $(n,k)$ -PMD) is given by the law of the sum of $n$ independent but not necessarily identical $k$ -CRVs. An $(n,k)$ -PMD is parameterized by a nonnegative matrix $\pi\in^{n\times k}$ each of whose rows sum to $1$ is denoted by $M^{\pi}$ , and is defined by the following random process: for each row $\pi(i,\cdot)$ of matrix $\pi$ interpret it as a probability distribution over the columns of $\pi$ and draw a column index from this distribution. Finally, return a row vector recording the total number of samples falling into each column (the histogram of the samples).

We note that a sample from an $(n,k)$ -PMD is redundant – given $k-1$ coordinates of a sample, we can recover the final coordinate by noting that the sum of all $k$ coordinates is $n$ . For instance, while a Binomial distribution is over a support of size $2$ , a sample is $1$ -dimensional since the frequency of the other coordinate may be inferred given the parameter $n$ . With this inspiration in mind, we define the Generalized Multinomial Distribution, which is the primary object of study in [VV11].

A Truncated $k$ -Categorical Random Variable is a random variable that takes values in $\{0,e_{1},\dots,e_{k-1}\}$ where $e_{j}$ is the $(k-1)$ -dimensional unit vector along direction $j$ , and is the $(k-1)$ dimensional zero vector. $\rho(0)$ is the probability of observing the zero vector, and $\rho(i)$ is the probability of observing $e_{i}$ .

An $(n,k)$ -Generalized Multinomial Distribution ( $(n,k)$ -GMD) is given by the law of the sum of $n$ independent but not necessarily identical truncated $k$ -CRVs. A GMD is parameterized by a nonnegative matrix $\rho\in^{n\times(k-1)}$ each of whose rows sum to at most $1$ is denoted by $G^{\rho}$ , and is defined by the following random process: for each row $\rho(i,\cdot)$ of matrix $\rho$ interpret it as a probability distribution over the columns of $\rho$ – including, if $\sum_{j=1}^{k}\rho(i,j)<1$ , an “invisible” column – and draw a column index from this distribution. Finally, return a row vector recording the total number of samples falling into each column (the histogram of the samples).

For both $(n,k)$ -PMDs and $(n,k)$ -GMDs, we will refer to $n$ and $k$ as the size and dimension, respectively.

We note that a PMD corresponds to a GMD where the “invisible” column is the zero vector, and thus the definition of GMDs is more general than that of PMDs. However, whenever we refer to a GMD in this paper, it will explicitly have a non-zero invisible column.

While we will approximate the Multinomial distribution with Gaussian distributions, it does not make sense to compare discrete distributions with continuous distributions, since the total variation distance is always $1$ . As such, we must discretize the Gaussian distributions. We will use the notation $\lfloor x\rceil$ to say that $x$ is rounded to the nearest integer (with ties being broken arbitrarily). If $x$ is a vector, we round each coordinate independently to the nearest integer.

As seen in the definition of an $(n,k)$ -GMD, we have one coordinate which is equal to $n$ minus the sum of the other coordinates. We define a similar notion for a discretized Gaussian. However, we go one step further, to take care of when there are several such Gaussians which live in disjoint dimensions. By this, we mean that given two Gaussians, the set of directions in which they have a non-zero variance are disjoint. Without loss of generality (because we can simply relabel the dimensions), we assume all of a Gaussian’s non-zero variance directions are consecutive, i.e., the covariance matrix is all zeros, except for a single block on the diagonal. Therefore, when we add the covariance matrices, the result is block diagonal. The resulting distribution is described in the following definition.

The structure preserving rounding of a multidimensional Gaussian Distribution takes as input a multi-dimensional Gaussian $\mathcal{N}(\mu,\Sigma)$ with $\Sigma$ in block-diagonal form. It chooses one coordinate as a “pivot” in each block, samples from the Gaussian ignoring these pivots and rounds each value to the nearest integer. Finally, the pivot coordinate of each block is set by taking the difference between the sum of the means and the sum of the values sampled within the block.

Finally, we formally define the notion of a cover.

2 Probability Metrics

To compare probability distributions, we will require the total variation and Kolmogorov distances:

The total variation distance between two probability measures $P$ and $Q$ on a $\sigma$ -algebra $F$ is defined by

Unless explicitly stated otherwise, in this paper, when two distributions are said to be $\varepsilon$ -close, we mean in total variation distance.

The Kolmogorov distance between two probability measures $P$ and $Q$ with CDFs $F_{P}$ and $F_{Q}$ is defined by

We note that Kolmogorov distance is, in general, weaker than total variation distance. In particular, total variation distance between two distributions is lower bounded by the Kolmogorov distance.

3 Miscellaneous Lemmata

We will use the following tools for bounding total variation distance between various random variables.

Let $X,X^{\prime}$ be two random variables over a domain $\Omega$ . Fix any (possibly randomized) function $F$ on $\Omega$ (which may be viewed as a distribution over deterministic functions on $\Omega$ ) and let $F(X)$ be the random variable such that a draw from $F(X)$ is obtained by drawing independently $x$ from $X$ and $f$ from $F$ and then outputting $f(x)$ (likewise for $F(X^{\prime})$ ). Then we have

Let $X_{1},\dots,X_{n}$ be independent random variables, with $E[X_{i}]=0,E[X_{i}^{2}]=\sigma_{i}^{2}>0,E[|X_{i}|^{3}]=\rho_{i}<\infty$ , and define $X=\sum_{i=1}^{n}X_{i},\sigma^{2}=\sum_{i=1}^{n}\sigma_{i}^{2},\rho=\sum_{i=1}^{n}\rho_{i}$ . Then for an absolute constant $C_{0}\leq 0.56$ ,

Given two $k$ -dimensional Gaussians $\mathcal{N}_{1}=\mathcal{N}(\mu_{1},\Sigma_{1}),\mathcal{N}_{2}=\mathcal{N}(\mu_{2},\Sigma_{2})$ such that for all $i,j\in[k]$ , $|\Sigma_{1}(i,j)-\Sigma_{2}(i,j)|\leq\alpha$ , and the minimum eigenvalue of $\Sigma_{1}$ is at least $\sigma^{2}\geq\alpha$ ,

In addition, we prove the following general purpose lemma showing that two multivariate Gaussians with spectrally-close moments are close in total variation distance. This is intended to be a multivariate version of Proposition B.4 of [DDO+13], which proves a similar statement for univariate Gaussians. The proof appears in Section A.

Suppose there exist two $k$ -dimensional Gaussians, $X\sim\mathcal{N}(\mu_{1},\Sigma_{1})$ and $Y\sim\mathcal{N}(\mu_{2},\Sigma_{2})$ , such that for all unit vectors $v$ ,

4 Results on PMDs from [DKT15]

Our work builds upon recent structural results on PMDs [DKT15]. We recall some of the key results which we will refer to in this paper.

Two key parameters used in this paper are $c=c(\varepsilon,k)=\operatorname*{poly}(\varepsilon/k)$ and $t=t(\varepsilon,k)=\operatorname*{poly}(k/\varepsilon)$ , set as $c=\left(\frac{\varepsilon^{2}}{k^{5}}\right)^{1+\delta_{c}}$ and $t=\left(\frac{k^{19}}{c\varepsilon^{6}}\right)^{1+\delta_{t}}$ , for constants $\delta_{c},\delta_{t}>0$ .

The main tool from this paper we will use is the structural characterization, stating that every PMD is close to the sum of an appropriately discretized Gaussian and a “sparse” PMD.

For parameters $c$ and $t$ as described above, every $(n,k)$ -Poisson multinomial random vector is $\varepsilon$ -close to the sum of a Gaussian with a structure preserving rounding and a $(tk^{2},k)$ -Poisson multinomial random vector. For each block of the Gaussian, the minimum non-zero eigenvalue of $\Sigma_{i}$ is at least $\frac{tc}{2k^{4}}$ .

Finally, we will also use their rounding procedure, which relates a PMD to a nearby PMD with all parameters either equal to or sufficiently far from and $1$ :

A Size-Free CLT

We overview our proof of Theorem 1. Recall that the Central Limit Theorem of Valiant and Valiant, (1), has a poly-logarithmic dependence on the size parameter of the GMD. Their work raised the question whether this CLT could be made size-independent, and we resolve this conjecture by showing that it can be. This qualitative improvement comes at a quantitative loss in the polynomial dependence of the bound on the parameters $k$ and $\sigma^{2}$ .

Our CLT builds off of the structural result of [DKT15], Theorem 6, which we use as a black box. This structural result says that every $(n,k)$ -PMD is $\varepsilon$ -close to the sum of an appropriately discretized Gaussian and a $(\operatorname*{poly}(k/\varepsilon),k)$ -PMD. We note that the statement of Theorem 6 does not tell us anything about the moments of this Gaussian and sparse PMD, while our new CLT requires that the discretized Gaussian has the same moments as the original PMD. We prove this CLT in two steps. First, we show that the original PMD $X$ and the discretized Gaussian from the cover $G$ are close in total variation distance, i.e., we show that we can “drop” the sparse PMD component from Theorem 6 in the relevant approximation regime. Then, we bound the distance between the discretized Gaussian from the cover, $G$ , and a discretized Gaussian with the same mean and covariance as the original PMD, $G_{X}$ . The proof is concluded by combining these two bounds using the triangle inequality.

Next, we bound the distance between the discretized Gaussian from the cover, $G$ , and a discretized Gaussian with the same moments as the original PMD, $G_{X}$ . At this point, we know that $X$ and $G$ are close in total variation distance. By projecting both distributions in some direction and considering true Gaussians with the same moments as $X$ and $G$ , it can be shown that the first two moments are similar in this direction – otherwise, the true Gaussians would be far from each other in the Kolmogorov metric. This implies that the first two moments of $X$ and $G$ are close in every direction, as guaranteed by Proposition 8. Applying Lemma 2 tells us that bona-fide Gaussians with moments which are close in every direction are therefore close in total variation distance. The proof is concluded by applying the Data Processing inequality, which shows that the corresponding discretized Gaussians $G$ and $G_{X}$ are close as well.

We state and prove many useful lemmas in Section 3.1, which we combine to complete the proof of Theorem 1 in Section 3.2.

The following two propositions bound the Kolmogorov distance between a univariate Gaussian and the projection of a GMD or a discretized Gaussian, respectively.

Suppose that there exists an $(n,k)$ -generalized multinomial random vector $X$ , with mean vector $\mu$ and covariance matrix $\Sigma$ . Then for any unit vector $v$ ,

where $\sigma^{2}$ is the minimum eigenvalue of $\Sigma$ .

We apply the Berry-Esseen theorem (Proposition 1). Let $Y_{i}=X_{i}-E[X_{i}]$ to recenter the random variables, and we will now compare $Y=\sum_{i}Y_{i}$ with $\mathcal{N}(0,v^{T}\Sigma v)$ . We note that $v^{T}Y_{i}\in[-\sqrt{2},\sqrt{2}]$ . Letting $\sigma_{i}^{2}=\mbox{\bf Var}(v^{T}Y_{i})$ and $\rho_{i}=E\left[\left|v^{T}Y_{i}\right|^{3}\right]$ , this implies that $\rho_{i}\leq\sqrt{2}\sigma_{i}^{2}$ , and thus the Berry-Esseen bound gives

Suppose there exists a random variable $X\sim\lfloor\mathcal{N}(\mu,\Sigma)\rceil$ . Then for any unit vector $v$ ,

where $\sigma^{2}$ is the minimum eigenvalue of $\Sigma$ .

Let $Y\sim\mathcal{N}(\mu,\Sigma)$ . We first show $|v^{T}(Y-\lfloor Y\rceil)|\leq\frac{\sqrt{k}}{2}$ , which holds by Cauchy-Schwarz: $\|v\|_{2}=1$ and $\|Y-\lfloor Y\rceil\|_{2}\leq\sqrt{k}\cdot\|Y-\lfloor Y\rceil\|_{\infty}\leq\frac{\sqrt{k}}{2}$ . Thus,

because the two distributions are univariate Gaussians with the same variance (which is at least $\sigma^{2})$ and means shifted by $\sqrt{k}$ . This implies

The following proposition compares a Gaussian $X$ and an arbitrary distribution $Y$ . It shows that if $Y$ ’s variance is much smaller than $X$ ’s, then they must be far in Kolmogorov distance.

Suppose there exists a univariate Gaussian $X$ with variance $\sigma_{X}^{2}$ , and a distribution $Y$ with variance $\sigma_{Y}^{2}<\sigma_{X}^{2}$ . Then the Kolmogorov distance between $X$ and $Y$ is at least $\frac{1}{2}-\left(\frac{\sigma_{Y}}{\sigma_{X}}\right)^{2/3}$ .

We consider the event that a sample falls in an interval of width $2k$ centered at $E[Y]$ . As a certificate of a large Kolmogorov distance between $X$ and $Y$ , we show that the probability assigned to this interval is very different for $X$ versus $Y$ .

First, by Chebyshev’s inequality, we know that

where the last inequality uses the Taylor expansion of the error function.

The difference in probability assigned to this interval is at least

Setting $k=\sigma_{Y}^{2/3}\sigma_{X}^{1/3}$ gives

The following proposition tells us if we are considering the sum of two random variables, one being a Gaussian with a large variance and one being an arbitrary distribution with a small support, we can remove all contribution from the distribution with small support and not pay a large cost in total variation distance.

We start by applying a law of total probability for total variation distance:

Using the data processing inequality for total variation distance (Lemma 1):

The next proposition tells us that Kolmogorov closeness implies parameter closeness for univariate Gaussians.

Without loss of generality, assume $\mu_{1}\leq\mu_{2}$ . We will first show the conclusion assuming the means are separated, and then assuming the variances are separated.

Our final proposition in this section applies the previous proposition, showing that total variation closeness implies parameter closeness (in any projection) when considering a GMD and a discretized Gaussian.

Consider the projections of $X$ and $Y$ onto $v$ . By Propositions 3 and 4 and the triangle inequality, the Kolmogorov distance between the univariate Gaussians with the same mean and variance is at most $\alpha+\frac{2\sqrt{k}}{\sigma}$ . Applying Proposition 7 implies the desired result. ∎

2 Proof of Theorem 1

We will prove the statement for a sufficiently large constant $C$ . Thus we only need examine the case

otherwise the conclusion of the theorem statement is vacuous since total variation distance is at most $1$ .

As a starting point, we convert from a GMD to the corresponding $(n,k)$ -Poisson multinomial random vector $X$ and apply Theorem 6 with $\varepsilon=\frac{k^{3}}{\sigma^{1/10}}$ . This gives us that

where $G$ is a Gaussian with a structure preserving rounding and $P$ is a $(tk^{2},k)$ -Poisson multinomial random vector. By the definition of $t$ in Section 2.4, we have that $t\leq\frac{C^{\prime}\sigma^{9/10}}{k^{2}}$ for some constant $C^{\prime}$ . Thus, $P$ is a $(C^{\prime}\sigma^{9/10},k)$ -Poisson multinomial random vector.

Now, applying Proposition 6 and the triangle inequality, we get

where $\sigma_{v}^{2}=\min\{v^{T}\Sigma v,v^{T}\Sigma_{D}v\}$ .

We use the Data Processing Inequality (Lemma 1) followed by Lemma 2 with these guarantees to give:

Finally, applying the triangle inequality with Equation (7) gives

Choosing the constant $C$ sufficiently large completes the proof.

A PTAS for Anonymous Games

Indeed, if the above guarantee held, then the expected payoff of every player $i$ from any pure strategy $\sigma$ would not change by more than an additive $O(\varepsilon)$ if we changed the strategies of all other players from $(X_{j})_{j\neq i}$ to $(Y_{j})_{j\neq i}$ . So, if $V_{X}$ were a Nash equilibrium and ${\rm support}(Y_{i})\subseteq{\rm support}(X_{i})$ , it would follow that $Y_{i}$ is an approximate best response of player $i$ to $(Y_{j})_{j\neq i}$ . So $V_{Y}$ would be an approximate equilibrium.

Unfortunately, we do not know how to construct a proper $\varepsilon$ -cover $S_{\varepsilon}$ of all $(n,k)$ -PMDs that has size of order (3) and such that for any $V_{X}$ there exists some $V_{Y}\in S_{\varepsilon}$ satisfying Condition (8). Nevertheless, we can exploit our CLT and the structural result of [DKT15] (restated as Theorem 6 in this paper) to bypass this difficulty. Roughly speaking [DKT15] approximate a given $V_{X}=(X_{1},\ldots,X_{n})$ by first discretizing the parameters of all $X_{i}$ ’s into fine enough accuracy (this is shown to only cost some $O(\varepsilon)$ in total variation distance), then partitioning the $X_{i}$ ’s into a small group ${\cal L}$ of size ${\rm poly}(k/\varepsilon)$ that are left intact, and a large group whose sum is approximated by a discretized multidimensional Gaussian (up to another cost of $O(\varepsilon)$ in total variation distance). It is further shown that the distribution of the sum of variables in ${\cal L}$ can be summarized through the vector $\vec{m}$ of its first $O(\log{1/\varepsilon})$ moments (at a loss of an additional $O(\varepsilon)$ in total variation distance), while the discretized Gaussian through its first two moments $(\mu,\Sigma)$ . Moreover, it is shown that the Gaussian has at least ${\rm poly}(k/\varepsilon)$ variance in all directions where it has non-zero variance. By enumerating over all possible summary statistics $(\vec{m},\mu,\Sigma)$ , a non-proper cover of all $(n,k)$ -PMDs can be obtained, whose size is of the order of (3).

Suppose now that $V_{X}=(X_{1},\ldots,X_{n})$ is a Nash equilibrium whose approximating statistic in the non-proper cover is some $(\vec{m},\mu,\Sigma)$ . Given a correct guess for this statistic, our goal is to uncover an approximate Nash equilibrium $V_{Y}=(Y_{1},\ldots,Y_{n})$ of the game. By the construction of the cover, we know that every player $i$ either contributed his discretized $X_{i}$ to the discretized Gaussian with parameters $(\mu,\Sigma)$ , or to the small group of variables with moments $\vec{m}$ . So, letting ${\cal C}$ be the set of $k$ -CRVs whose parameters have the discretization accuracy used in the construction of the cover, we need to assign some $Y_{i}\in{\cal C}$ to each player $i$ such that:

There exists a ${\rm poly}(k/\varepsilon)$ -size subset ${\cal L}$ of players such that $\sum_{i\in{\cal L}}Y_{i}$ has vector of moments $\vec{m}$ , while $\sum_{i\notin{\cal L}}Y_{i}$ has first two moments $(\mu,\Sigma)$ .

For all $i$ , $Y_{i}$ is a best response to $\sum_{j\neq i}Y_{j}$ .

To find a good assignment, we first construct a compatibility graph between players and mixed strategies in ${\cal C}$ . We add an edge between some $i$ and some $Y_{i}\in{\cal C}$ iff at least one of the following two conditions is met. We also annotate the edge with all conditions that are met:

( $Y_{i}$ is compatible with $i\in{\cal L}$ ): $Y_{i}$ is an approximate best response to the “environment” $i$ would observe if $i$ contributed to $\vec{m}$ . If $i$ contributed to $\vec{m}$ and Condition (a) were met, then we can deduce what PMD player $i$ would see in his environment. Indeed, this would be within some $O(\varepsilon)$ in total variation distance to a the sum of a Gaussian random vector with parameters $(\mu,\Sigma)$ and a PMD whose first $O(\log(1/\varepsilon))$ moments are the same as $\vec{m}$ after removing the contribution of $Y_{i}$ . The updated moment vector can be computed from $\vec{m}$ and $Y_{i}$ as moments are symmetric polynomials of the underlying parameters. Given the updated moment vector, the PMD is determined to within $\varepsilon$ in total variation distance, so its sum with the discretized Gaussian is also determined, and we can also efficiently determine whether $Y_{i}$ is an approximate best response of player $i$ to that distribution.

( $Y_{i}$ is compatible with $i\in\bar{{\cal L}}$ ): $Y_{i}$ is an approximate best response to the “environment” $i$ would observe if $i$ contributed to the discretized Gaussian with parameters $(\mu,\Sigma)$ . First, for this to be the case $Y_{i}$ must be “compatible” with $\Sigma$ , i.e. not correlating uncorrelated pairs of dimensions/adding variance in zero-variance dimensions (or in other words, the block structure of $\Sigma$ should be preserved). Moreover, since all non-zero eigenvalues of $\Sigma$ are at least ${\rm poly}(k/\varepsilon)$ -large, the discretized Gaussian with parameters $(\mu,\Sigma)$ and $(\mu-{\bf E}[Y_{i}],\Sigma-{\rm\bf cov}(Y_{i}))$ are approximately the same(Proposition 2). At the same time, due to the largeness of the non-zero eigenvalues of $\Sigma$ , if condition (a) were eventually true, then our CLT (Theorem 1) would imply that $\sum_{j\in\bar{{\cal L}}\setminus\{i\}}Y_{j}$ is well-approximated by the discretized Gaussian with parameters $(\mu-{\bf E}[Y_{i}],\Sigma-{\rm\bf cov}(Y_{i}))$ , and hence by that with parameters $(\mu,\Sigma)$ . So, if $i\in\bar{\cal L}$ , $i$ is assigned $Y_{i}$ , and Condition (a) is eventually met, then the PMD that player $i$ sees in his environment is pinned down to within $O(\varepsilon)$ in total variation distance: it is approximately the sum of the discretized Gaussian with parameters $(\mu,\Sigma)$ and a PMD with moments $\vec{m}$ . We can therefore check if $Y_{i}$ is an approximate best response to that distribution.

After constructing the compatibility graph as above, we need to see if there is an assignment of players to compatible mixed strategies from $\cal C$ so that (a) is satisfied. This looks non-trivial, but it can be done using dynamic programming. We sweep through the players, maintaining as state all possible leftover moments $(\vec{m}^{\prime},\mu^{\prime},\Sigma^{\prime})$ that may arise from assignments of a prefix of players to compatible mixed strategies. Given the discretization of $\cal C$ , the set of possible states is bounded by (3). Importantly, the compatibility graph has the property that player $i$ is happy when given a compatible strategy as long as the overall assignment matches $(\vec{m},\mu,\Sigma)$ .

A mixed strategy profile $\rho$ is a set of $n$ distributions $\{\rho_{i}\in\Delta^{k}\}_{i\in[n]}$ , where by $\Delta^{k}$ we denote the $(k-1)$ -dimensional simplex, or, equivalently, the set of distributions over $[k]$ . A mixed strategy profile $\rho$ is an $\varepsilon$ -approximately well supported Nash equilibrium (or an $\varepsilon$ -Nash equilibrium, for short) if, for all $i\in[n]$ and $j\in[k]$ ,

where $\rho_{-i}$ is the distribution over $\Pi^{k}_{n-1}$ obtained by drawing $n-1$ random samples from $[k]$ independently according to the distributions $\rho_{i^{\prime}},i^{\prime}\neq i$ , and forming the induced partition. We note that this an $\varepsilon$ -Nash equilibrium is stronger than the related concept of an $\varepsilon$ -approximate Nash equilibrium (see, i.e., [DGP09] for further discussion of this distinction). Throughout this paper, we solely consider the harder problem of computing an $\varepsilon$ -Nash equilibrium.

A -Nash equilibrium is simply called a Nash equilibrium and it is always guaranteed to exist by Nash’s theorem.

2 An Algorithm for Anonymous Games

In a Nash equilibrium $\rho$ of an anonymous game every player uses a mixed strategy $\rho_{i}$ selecting strategy $j$ with probability $\rho_{ij}$ . The distribution of the number of players which select each of the strategies is an $(n,k)$ -PMD. Using the fact that there exist small size $\varepsilon$ -covers for PMDs, we can efficiently search over the space of all strategies and identify a mixed strategy profile that produces an $\varepsilon$ -Nash equilibrium. We show that there exists an efficient polynomial time approximation scheme (EPTAS) for computing an $\varepsilon$ -Nash equilibrium, thus proving Theorem 2.

The algorithm works by guessing an aggregate statistic $(m,\mu,\Sigma)$ that describes the overall behavior of all players. This statistic is based on the structural theorem shown in [DKT15], which shows that the overall PMD that describes the mixed strategy profile can be approximately written as sum of a discretized Gaussian and a sparse PMD with only $\operatorname*{poly}(k/\varepsilon)$ components. Moreover, for the sparse PMD knowledge of the $\log(1/\varepsilon)$ moments (which is equivalent to knowing the power-sums of all the summands up to $\operatorname*{poly}(1/\varepsilon)$ , suffices to describe it within $\varepsilon$ in total variation distance. Thus, the algorithm requires guessing the power-sums $m$ of the sparse PMD and the mean $\mu$ and covariance $\Sigma$ of the discretized Gaussian.

As we will show, knowledge of an individual’s strategy together with the aggregate statistic $(m,\mu,\Sigma)$ for the overall mixed strategy profile, allows us to compute an approximate distribution $D_{i}$ that discribes the player’s view about the aggregate strategy of everyone else. If we manage to assign strategies $\rho_{i}$ to every player so that $\rho_{-i}$ approximately matched $D_{i}$ and additionally each player only chooses strategies that corresponds to approximate best responses with respect to his view $D_{i}$ we will obtain an $\varepsilon$ -Nash equilibrium. The following lemma formalizes this intuition and is the main tool we use in the proof of Theorem 2.

For all $i\in[n]$ and $j\in[k]$ , $E_{x\sim D_{i}}[u^{i}_{j}(x)]<\max_{j^{\prime}\in[k]}E_{x\sim D_{i}}[u^{i}_{j^{\prime}}(x)]-\varepsilon_{2}\Rightarrow\rho_{ij}=0,$

Then, $\rho$ is an $(2\varepsilon_{1}+\varepsilon_{2})$ -Nash equilibrium for the game $G$ .

Consider the game $G=(n,k,\{u^{i}_{j}\})$ . By Nash’s theorem there always exists a Nash equilibrium. Let $\rho$ be such an equilibrium where every player uses a mixed strategy $\rho_{i}$ selecting strategy $j$ with probability $\rho_{ij}$ . The distribution of vectors which give the number of players which select each of the strategies is an $(n,k)$ -PMD.

To get an efficient algorithm, we need to search over a restricted set of strategies for each player. To be able to do that we must show that an $\varepsilon$ -Nash equilibrium exists in a more restricted space. To argue that, we begin by a Nash equilibrium $\rho$ and perform a series of operations that maintain the property that the resulting mixed strategy profile is an $\varepsilon$ -equilibrium.

We first proceed by rounding the probabilities $\rho_{ij}$ so that they are either or at least $c$ as done in Lemma 3. This gives a PMD $\rho^{(1)}$ that is $O(c^{1/2}k^{5/2}\log^{1/2}(1/ck))$ -close in total variation distance to $\rho$ . Moreover, if we consider the PMD $\rho^{(1)}_{-i}$ , which is the $(n-1,k)$ -PMD obtained by removing the $i$ -th component from the rounded PMD $\rho^{(1)}$ , this is also $O(c^{1/2}k^{5/2}\log^{1/2}(1/ck))$ -close in total variation to $\rho_{-i}$ , i.e. the PMD obtained after removing the $i$ -th component from the original PMD $\rho$ . The proof of this statement is almost identical to the proof in [DKT15] and is omitted. That proof uses Poisson approximations to bound the total variation between the rounded and the unrounded PMDs and uses the fact that the means of the two PMDs can differ by at most $c$ in each coordinate. The only difference is that here, the means of the two PMDs can differ by at most $2c$ in each coordinate which results in the same asymptotic bound for total variation distance. Moreover, note the rounding procedure doesn’t change any probabilities that were originally 0, i.e. $\rho_{ij}=0\Rightarrow\rho^{(1)}_{ij}=0$ .

By the structural theorem of [DKT15], the components $\rho^{(2)}_{i}$ of the PMD $\rho^{(2)}$ can be partitioned into two PMDs:

For every player $i$ in the sparse PMD, his view about the aggregate strategy of the others is approximately the same as if the large PMD was replaced by the Gaussian, i.e.

With those observations in hand, we proceed to give the algorithm for computing $\varepsilon$ -equilibria for anonymous games. To do this, we first guess the mean $\mu$ and covariance $\Sigma$ of the Gaussian component as well as all the power-sums $m$ of the sparse PMD. We then try to construct CRVs for every player so that the overall mean and covariance as well as the power-sums match those that we guessed and moreover every player’s CRV assigns positive probability mass only to approximately optimal strategies. If we are able to do so, Lemma 4 implies that this gives an approximate Nash equilibrium. In more detail, the algorithm performs the following steps:

Guess the mean and covariance of the Gaussian component and the power sums of the sparse PMD. For every guess, we repeat the next steps until a feasible solution is found.

We need to guess the powersums for $(4ek^{3})^{k}$ different PMDs since CRVs are first clustered according to their value in every coordinate. Since the parameters of the sparse PMD are all multiples of $\lceil\frac{tk^{3}}{\varepsilon}\rceil^{-1}$ , this results in at most $2^{k^{5k}\log^{k+2}\left(\frac{1}{\varepsilon}\right)}$ distinct power-sum vectors in totalThis upper bound was derived in [DKT15]. For the gaussian component all entries of the mean and covariance are multiples of $\lceil\frac{nk}{\varepsilon}\rceil$ which requires $\lceil\frac{nk}{\varepsilon}\rceil^{O(k^{2})}$ guesses in total.

For every player, we need to compute the contribution of his mixed strategy (CRV) to the overall distribution. If that player is to be assigned in the sparse component, its probabilities are all multiples of $\lceil\frac{tk^{3}}{\varepsilon}\rceil^{-1}$ and we can compute its contribution to the power-sums $m$ . Similarly, if that player is to be assigned in the gaussian component its probabilities are all multiples of $\lceil\frac{nk}{\varepsilon}\rceil^{-1}$ and we can easily compute its contribution to the mean and covariance.

However, not all assignments are feasible. We need to consider only CRVs for that player that assign positive probability mass to coordinates that are approximately best responses to the strategy of other players. Even though we don’t know the strategies of the others exactly, we can compute a good approximate description of the players view by subtracting from the power sums $m$ the players contribution (if any) and computing any $PMD$ that matches those power-sums. Similarly, if the player is mapped to the gaussian component we subtract the players mean and covariance from the overall mean $\mu$ and covariance $\Sigma$ and compute a discretized Gaussian with the resulting mean and covariance instead. We say that an assignment of a player to a component (sparse or Gaussian) and a specific distribution over strategies is feasible if it approximately maximizes the player’s utility $u$ with respect to his approximate view about the strategies of others.

To find if there exists a set of feasible strategies that matches the guessed statistic $(m,\mu,\Sigma)$ , we use dynamic programming. The states of our dynamic program are the following: For any prefix of players, we keep track of the remaining power-sums, mean and covariance we need to account for. We iteratively process players one by one keeping track of which states are reachable. Our estimation is feasible if after processing all players we have accounted for all the power-sums, mean and covariance in our original guess. If we find such a solution, we output the assignment of players to mixed strategies that resulted in this solution.

Applying Lemma 4 directly shows that this is indeed an $O(\varepsilon)$ -Nash equilibrium.

The total runtime of the algorithm is polynomial on the number of states of the above dynamic program. Since there are $\lceil\frac{nk}{\varepsilon}\rceil^{O(k^{2})}$ Gaussian parameters in total as well as $2^{k^{5k}\log^{k+2}\left(\frac{1}{\varepsilon}\right)}$ power sums in total, the overall runtime is $n^{O(k^{2})}2^{\operatorname*{poly}(k,\log(1/\varepsilon))^{k}}$ and the theorem follows. ∎

On the road to getting the proper cover described by Theorem 3, we first show Theorem 7. This constructs a non-proper cover of the same size. The main theorem of this section is the following:

Moreover, we can efficiently enumerate this cover in time polynomial in its size.

This theorem should be contrasted with Theorem 3, which provides a proper cover of similar size. It should also be contrasted to Theorem 2 of [DKT15], which provides a cover with a leading factor of $n^{k^{2}}$ , so the cover presented here improves the exponent of $n$ from quadratic to linear in the dimension. This is the correct order of exponential dependence on $k$ , as simply counting the number of $(n,k)$ -PMDs with deterministic summands gives a lower bound of $n^{\Omega(k)}$ . We also show in Section 7 that the quasi-polynomial dependence on $1/\varepsilon$ with an exponent of $\Omega(k)$ cannot be avoided, as we provide an essentially matching lower bound on the cover size.

The starting point for our cover will be Theorem 6, stating that every $(n,k)$ -PMD is $\varepsilon$ -close to the sum of an appropriately discretized Gaussian and a $(\operatorname*{poly}(k/\varepsilon),k)$ -PMD. We generate an $\varepsilon/2$ -cover for each and combine them by triangle inequality.

We cover the sparse PMD component using the same methods as in [DKT15]. The first, naive way of covering this component involves gridding over all $\operatorname*{poly}(k/\varepsilon)$ parameters with $\operatorname*{poly}(\varepsilon/k)$ granularity. This results in a cover size of $2^{\operatorname*{poly}(k/\varepsilon)}$ .

The more sophisticated way of covering this component uses a “moment matching” technique. A result by Roos [Roo02] shows that the probability mass function can be written as the weighted sum of partial derivatives of a standard multinomial distribution. When analyzed carefully, his result implies that the lower order moments of the distribution are sufficient to characterize the PMD. In other words, any two PMDs with identical “moment profiles” (which describe these lower order moments) are close in total variation distance, and it suffices to keep only one representative for each moment profile. This method results in a cover of size $2^{O(k^{5k}\log^{k+2}(1/\varepsilon))}$ . Combining this with the other approach gives a cover of size

For more details, see the proof of Theorem 2 of [DKT15].

To cover the Gaussian component, [DKT15] grid over all $O(k^{2})$ parameters of the Gaussian component, arguing the effectiveness of the gridding using Proposition 2. This gridding results in the leading factor of $n^{O(k^{2})}$ in the size of the cover. In contrast, we use a spectral covering approach: instead of trying to grid over all parameters of the covariance matrix, we first sparsify it and then match the magnitude of its projection in every direction. In particular, we establish a cover of the following nature:

Let $\mathcal{G}_{n,k,\varepsilon}$ be the set of all Gaussians with structure preserving roundings which may arise as a consequence of Theorem 6 when applied to $(n,k)$ -Poisson multinomial random vectors with parameter $\varepsilon$ . Then there exists a set $\mathcal{S}$ of Gaussians with structure preserving roundings of size at most $n^{O(k)}\cdot\left(\frac{k}{\varepsilon}\right)^{O(k^{3})}$ with the following properties:

For any $G\in\mathcal{G}_{n,k,\varepsilon}$ , there exists a $\hat{G}\in\mathcal{S}$ , such that $G$ and $\hat{G}$ have the same block structure (i.e., the partition of coordinates), and within each block, have the same pivot coordinate and sum for the mean vector coordinates. Furthermore, for each block $i$ , letting $(\mu_{i},\Sigma_{i})$ and $(\hat{\mu}_{i},\hat{\Sigma}_{i})$ be the mean and covariance for the block (excluding the pivot coordinate), we have that for all unit vectors $v$ ,

$|v^{T}(\mu_{i}-\hat{\mu}_{i})|\leq\frac{\varepsilon\sigma_{iv}}{k}$ ;

$|v^{T}(\Sigma_{i}-\hat{\Sigma}_{i})v|\leq\frac{\varepsilon\sigma_{iv}^{2}}{2k^{3/2}}$ ;

where $\sigma_{iv}^{2}=\max\{v^{T}\Sigma_{i}v,v^{T}\hat{\Sigma}_{i}v\}$ .

This lemma statement is slightly technical due to the nature of the Gaussians with structure preserving roundings. It essentially says that we cover the set of Gaussians arising from the structural theorem by matching their block structure exactly, and within each block, matching the moments spectrally. Plugging these guarantees into Lemma 2 and applying the data processing inequality for total variation distance (Lemma 1) gives the desired closeness.

For simplicity of exposition, for the remainder of this overview section, we assume that the Gaussian’s structure preserving rounding consists of a single block, an assumption we do not make in the full proof (described in Section 5.1). By the guarantees of the structural result, in this case, the minimum eigenvalue of the covariance matrix is at least some ${\rm poly}(k/\varepsilon)$ . So the goal of our exposition in this section is to produce a cover of Gaussians that may result from Theorem 6 and whose covariance matrices have minimum eigenvalue at least ${\rm poly}(k/\varepsilon)$ .

Since the mean vector only has $k$ parameters, we can grid over the entries. Though we require a spectral guarantee, this naive gridding is sufficient. This gives a set of size $\left(\frac{nk}{\varepsilon}\right)^{O(k)}$ , such that, for any Gaussian which may arise from Theorem 6, its mean vector is approximated by a mean vector in our set with the approximation guarantees required by Lemma 2.

Covering the covariance matrix takes more care. At a high level, our approach views PMDs through the lens of spectral graph theory and exploits the existence of spectral sparsifiers. Recall the definition of the Laplacian matrix of a graph:

Given an undirected weighted graph $G=(V,E,w)$ on $n$ vertices, its Laplacian matrix is an $n\times n$ matrix $L_{G}$ where

To see the connection to PMDs, we observe that the covariance matrix of a PMD is the Laplacian matrix of a graph defined by the parameters. For a single $k$ -CRV $X$ with parameter vector $\pi$ , it can be shown that the variance of $X_{i}$ is $\pi(i)(1-\pi(i))$ and the covariance of $X_{i}$ and $X_{j}$ is $-\pi(i)\pi(j)$ . Since $\sum_{i=1}^{k}\pi(i)=1$ , the covariance matrix is equal to the Laplacian matrix of a graph on $k$ nodes with $w(i,j)=\pi(i)\pi(j)$ . This can be extended to $(n,k)$ -PMDs by observing that the sum of random variables has a covariance matrix equal to the sum of the individual covariance matrices, and a similar statement holds for graphs and the corresponding Laplacian matrices. We summarize this connection in the following observation:

At the core of our approach, we use the following celebrated result of Batson, Spielman, and Srivastava [BSS12], which says that the Laplacian matrix of a graph on $k$ vertices can be spectrally approximated by the Laplacian matrix of a graph with only $O(k)$ edges:

where $L_{G}$ is the Laplacian matrix of the graph $G$ .

Using this tool, the approach will proceed as follows. This theorem implies that, for every true covariance matrix $\Sigma$ , there exists a matrix $M_{1}$ with only $O(k)$ entries which preserves every projection up to a multiplicative factor of $1/5$ . We can obtain a matrix $M_{2}$ with the same sparsity pattern as $M_{1}$ by guessing which subset of $O(k)$ entries is non-zero, requiring $\exp(k\cdot\log{k})$ guesses. Furthermore, we can grid over the non-zero entries of $M_{2}$ to ensure that it approximates every projection of $M_{1}$ up to a multiplicative factor of $1/25$ . Since $M_{1}$ has minimum eigenvalue ${\rm poly}(k/\varepsilon)$ and maximum entry $O(n)$ , gridding requires only $\left({n\cdot k\over\varepsilon}\right)^{O(k)}$ guesses, and we get that $M_{2}$ gives a $1/4$ multiplicative spectral approximation to $\Sigma$ . To make our approximation finer, we will $O(\varepsilon/\sqrt{k})$ -cover the set of PSD matrices within a $1/4$ -neighborhood of $M_{2}$ . We first recall the definition of a cover in this context:

Let S be a set of symmetric $k\times k$ PSD matrices. An $\varepsilon$ -cover of the set $S$ , denoted by $S_{\varepsilon}$ , is a set of PSD matrices such that for any matrix $A\in S$ , there exists a matrix $B\in S_{\varepsilon}$ such that for all vectors $y$ : $|y^{T}(A-B)y|\leq\varepsilon y^{T}Ay$ .

Now, if we could $O(\varepsilon/\sqrt{k})$ -cover the set of all matrices $1/4$ -close to $M_{2}$ , we would obtain an $O(\varepsilon/\sqrt{k})$ -approximation to $\Sigma$ . We do so using the following lemma, which provides a method to generate such a cover. A slight generalization of this statement appeared as Lemma 9 in [DKT15], but we give a slightly simpler proof in Section 5.2 for completeness.

Let $A$ be a symmetric $k\times k$ PSD matrix with minimum eigenvalue at least $1$ and let $S$ be the set of all matrices $B$ such that $|y^{T}(A-B)y|\leq\varepsilon_{1}y^{T}Ay$ for all vectors $y$ , where $\varepsilon_{1}\in[0,1/4]$ . Then, there exists an $\varepsilon$ -cover $S_{\varepsilon}$ of $S$ that has size $|S_{\varepsilon}|\leq\left(\frac{k}{\varepsilon}\right)^{O(k^{2})}$ .

Combining the above, we obtain a set of covariance matrices of size $n^{O(k)}\cdot\left(\frac{1}{\varepsilon}\right)^{\operatorname*{poly}(k)}$ such that, for any Gaussian which may arise in Theorem 6, its covariance matrix is approximated by a covariance matrix in our cover as required by Lemma 2.

Combining the guarantees obtained for the mean and the covariance matrix, we find that they satisfy both conditions of Lemma 2. Therefore, we have described a cover of size $n^{O(k)}\cdot\left(\frac{1}{\varepsilon}\right)^{\operatorname*{poly}(k)}$ for all possible Gaussian components. The proof of Theorem 7 is completed by taking the Cartesian product of this Gaussian cover with the cover for the $(\operatorname*{poly}(k/\varepsilon),k)$ -PMD component.

For more details on covering the Gaussian component, see Section 5.1.

1 Details on Covering the Gaussian Component

Recall that the Gaussian component will have a structure preserving rounding. The first step in designing our cover will be to guess the partitioning into blocks. There are $k$ dimensions, resulting in at most $k!$ different block structures. In what follows, we will describe how to cover a single block up to accuracy $O(\frac{\varepsilon}{k})$ , taking the Cartesian product of the resulting sets will give an $O(\varepsilon)$ -cover of the entire Gaussian at the additional cost of $k$ in the exponent.

For a single block which consists of dimensions $S_{i}$ , we must first guess the size parameter $n_{i}$ and which dimension is to be used as the pivot. The former is an integer between and $n$ , and guessing it comes at a cost of $n$ in our cover size. Guessing the latter comes at a $|S_{i}|$ cost in our cover size.

Recall that our strategy will be to spectrally match the parameters of the true Gaussian. We will conclude the two distributions are close using the guarantees provided by Lemma 2. We describe how to obtain such guarantees for both the mean and covariance matrix separately.

1.2 Covering the Covariance Matrix of a Block

We will use the characterization provided by Observation 1, which tells us that the covariance matrix of an $(n,k)$ -PMD is the Laplacian matrix of a graph defined by the parameters of the distribution. Recall from the proof of Theorem 6 (which appears in [DKT15]), the covariance matrix of the Gaussian we are attempting to match is also the covariance matrix of an $(n_{i},|S_{i}|)$ -Generalized Multinomial Distribution. For the remainder of this proof, we let $G$ be the graph defined by this characterization for the covariance matrix of the corresponding $(n_{i},|S_{i}|)$ -Poisson Multinomial Distribution.

As a starting point, we use Theorem 8, which shows the existence of spectral sparsifiers. In particular it implies that, if given $G$ on $|S_{i}|$ nodes and we want a subgraph $H$ such that $(1-1/5)L_{G}\preceq L_{H}\preceq(1+1/5)L_{G}$ , there exists an $H$ with at most $110|S_{i}|$ edges which gives this approximation. The first step in covering the covariance matrix is to guess which edges are present in the graph. Since there are $\binom{|S_{i}|}{2}$ possible edges in the graph, this requires at most

Further, recall that our structural result implies that $\frac{1}{25}v^{T}L_{H}v\geq\frac{tc}{100k^{4}}$ , so it suffices to obtain a graph $M$ such that

For a unit vector $v$ and $|S_{i}|\times|S_{i}|$ PSD matrices $A$ and $B$ ,

Suppose we guess the edge weights of $M$ such that they are at most $\frac{tc}{100k^{7}}$ away from those of $H$ . This tells us $\max_{i\neq j}|L_{H}(i,j)-L_{M}(i,j)|\leq\frac{tc}{100k^{7}}$ , and since the diagonal entries of $L_{M}$ are the sums of the off-diagonal entries, $\max_{i}|L_{H}(i,i)-L_{M}(i,i)|\leq\frac{tc}{100k^{6}}$ . This implies that it suffices to additively estimate the edge weights up to accuracy $\frac{tc}{100k^{6}}$ . Since the maximum entry of $L_{G}$ is at most $n_{i}$ , the spectral guarantee implies that the maximum entry of $L_{H}$ is at most $\frac{6n_{i}}{5}$ , and similarly, the maximum edge weight. Therefore, gridding over all $110|S_{i}|$ non-zero edge weights, we define a set with at most

At this point, we have a PSD matrix $L_{M}$ which, when projected onto the subspace orthogonal to $e_{1}$ , is $1/4$ -spectrally close to the target covariance matrix. We wish to $\frac{\varepsilon}{2k^{3/2}}$ -cover the space of all PSD matrices which are $1/4$ -spectrally close to this matrix. We will use Lemma 6, which we instantiate with parameter “ $\varepsilon$ ” set to $\frac{\varepsilon}{2k^{3/2}}$ , allowing us to generate a $\frac{\varepsilon}{2k^{3/2}}$ -cover of a $\frac{1}{4}$ -neighborhood of a given PSD matrix with $\left(\frac{k}{\varepsilon}\right)^{O(k^{2})}$ candidates. Since we knew one of the previous candidates was $\frac{1}{4}$ -close to the target, this gives us a matrix which satisfies the second condition of Lemma 2 to approximate this block up to $\frac{\varepsilon}{k}$ accuracy. The size of this cover is at most

1.3 Putting the Guarantees Together

At this point, to cover a single block up to accuracy $O(\varepsilon/k)$ , we have a set of size at most

Taking the Cartesian product of sets and multiplying by the number of guesses for the block structure of the Gaussian, we get an overall cover of size

Combining with the cover for the $(\operatorname*{poly}(k/\varepsilon),k)$ -PMD component, we obtain an overall cover for $(n,k)$ -PMDs of size

2 Proof of Lemma 6

To construct the cover, we will make use of the eigenvalues and eigenvectors of the matrix $A$ . We first show that for any matrix $B\in S$ , its eigenvalues are close to the eigenvalues of $A$ .

Let $A,B$ be two symmetric $k\times k$ PSD matrices such that for all vectors $y$ with $\|y\|=1$ , $|y^{T}(A-B)y|\leq\varepsilon_{1}y^{T}Ay$ for some constant $\varepsilon_{1}>0$ . Then for the eigenvalues $\lambda^{A}_{1}\leq...\leq\lambda^{A}_{k}$ of $A$ , and the eigenvalues $\lambda^{B}_{1}\leq...\leq\lambda^{B}_{k}$ of $B$ , it holds that:

From Courant’s minimax principle, we have that the $i$ -th eigenvalue of $A$ is equal to:

where $C$ is an $(i-1)\times k$ matrix. For the matrix $B$ , we have that

Similarly, we have that $\lambda^{B}_{i}\geq(1-\varepsilon_{1})\lambda_{i}^{A}$ , so the result follows. ∎

By computing the eigenvalues $\mu_{1}\leq\dots\leq\mu_{k}$ of $A$ , we have estimates of the eigenvalues $\lambda_{1},\dots,\lambda_{k}$ of $B$ within a multiplicative factor of $1\pm 2\varepsilon_{1}$ . We can improve our estimates to a better multiplicative factor $1\pm\varepsilon$ by gridding multiplicatively around each eigenvalue. This requires another $\log_{1+\varepsilon}\left(\frac{1+2\varepsilon_{1}}{1-2\varepsilon_{1}}\right)=O(1/\varepsilon)$ guesses per eigenvalue. So in total, we require $\left(\frac{1}{\varepsilon}\right)^{O(k)}$ guesses for obtaining accurate estimates $\lambda^{\prime}_{1},\dots,\lambda^{\prime}_{k}$ of the eigenvalues of $B$ .

Once we know (approximately) the eigenvalues of $B$ , we will try to guess also its eigenvectors $v_{1},\dots,v_{k}$ . We will do this by performing a careful gridding around the eigenvectors of $A$ which we can assume, without loss of generality (by rotating), to be the standard basis vectors $e_{1},e_{2},\dots,e_{k}$ . So for each eigenvector $v_{z}$ of $B$ , we will try to approximate it by guessing its projections to the eigenvectors of $A$ .

We now bound the projections of eigenvectors of $A$ to eigenvectors of $B$ . Since we know that $e^{T}_{i}Be_{i}\leq(1+\varepsilon_{1})e^{T}_{i}Ae_{i}$ , we get that $\sum_{z}\lambda_{z}(v_{z}e_{i})^{2}\leq(1+\varepsilon_{1})\mu_{i}$ which implies that $v_{z,i}\leq\sqrt{\frac{2\mu_{i}}{\lambda_{z}}}$ . Moreover, since $\lambda_{z}\geq\max\{(1-\varepsilon_{1})\mu_{z},1\}\geq\max\{\frac{1}{2}\mu_{z},1\}$ , we know that the projection of $v_{z}$ to $e_{i}$ will be smaller than $2\sqrt{\frac{\mu_{i}}{\max\{\mu_{z},1\}}}$ . An additional bound for the projection of $v_{z}$ to $e_{i}$ can be obtained by considering the variance of the matrices $A$ and $B$ in the direction $v_{z}$ . Since we know that $v^{T}_{z}Bv_{z}\geq(1-\varepsilon_{1})v^{T}_{z}Av_{z}$ , we get that $\sum_{i}\mu_{i}(v_{z}e_{i})^{2}\leq\frac{\lambda_{z}}{1-\varepsilon_{1}}\leq 2\lambda_{z}$ which implies that $v_{z,i}\leq\sqrt{\frac{2\lambda_{z}}{\mu_{i}}}$ .

We will now show that the covariance matrix $\hat{B}$ satisfies the property that it is close in all directions to $B$ . To do this we will make use of the following lemma from [DKT15]. This roughly states that two PSD matrices spectrally approximate each other in $O(k^{2})$ particular directions, then they spectrally approximate each other in every direction.

For all $i\in[k]$ , $\Big{|}\Big{(}\frac{v_{i}}{\sqrt{\lambda}_{i}}\Big{)}^{T}\Big{(}\hat{\Sigma}-\Sigma\Big{)}\Big{(}\frac{v_{i}}{\sqrt{\lambda}_{i}}\Big{)}\Big{|}\leq\varepsilon$ ,

For all $i,j\in[k]$ , $\Big{|}\Big{(}\frac{v_{i}}{\sqrt{\lambda}_{i}}+\frac{v_{j}}{\sqrt{\lambda}_{j}}\Big{)}^{T}\Big{(}\hat{\Sigma}-\Sigma\Big{)}\Big{(}\frac{v_{i}}{\sqrt{\lambda}_{i}}+\frac{v_{j}}{\sqrt{\lambda}_{j}}\Big{)}\Big{|}\leq 4\varepsilon$ .

We will only consider directions $y=\frac{v_{z}}{\sqrt{\lambda}_{z}}$ for $z\in[k]$ and $y=\frac{v_{z}}{\sqrt{\lambda}_{z}}+\frac{v_{z^{\prime}}}{\sqrt{\lambda}_{z^{\prime}}}$ for $z,z^{\prime}\in[k]$ .

We first consider direction $y=\frac{v_{z}}{\sqrt{\lambda}_{z}}$ . We have that:

The first term is in the range $[(1-\varepsilon)(1-k\varepsilon^{\prime})^{2},(1+\varepsilon)(1+k\varepsilon^{\prime})^{2}]$ , which for $\varepsilon^{\prime}\leq\varepsilon/k$ , becomes $(1\pm O(\varepsilon))$ . The rest of the terms can be bounded as follows:

for $\varepsilon^{\prime}=O(\sqrt{\frac{\varepsilon}{k^{3}}})$ . This means that $v^{T}_{z}\hat{B}v_{z}\in(1-\varepsilon,1+\varepsilon)\lambda_{z}$ . The proof is similar for directions $y=\frac{v_{z}}{\sqrt{\lambda}_{z}}+\frac{v_{z^{\prime}}}{\sqrt{\lambda}_{z^{\prime}}}$ for $z,z^{\prime}\in[k]$ .

Overall, we can get an estimate $\hat{B}$ of any matrix $B\in S$ by making at most $\left(\frac{k}{\varepsilon}\right)^{O(k^{2})}$ guesses, which implies an $\varepsilon$ -cover of this size.

A Proper Cover for PMDs

We show how to turn the non-proper cover of Section 5 into a proper one as described by Theorem 3, using Theorem 1. We note that a non-constructive proper cover follows immediately from Theorem 7, since for each element of an improper $\varepsilon/2$ -cover that lies within $\varepsilon/2$ of a PMD, we can match it with such a PMD. The resulting set of PMDs defines then a proper $\varepsilon$ -cover. Our focus in this section is to provide an efficient construction of a proper cover.

Our approach will be to enumerate the improper cover of Theorem 7 and convert each distribution to a nearby $(n,k)$ -PMD. This cover consists of distributions which are the sum of a Gaussian with a structure preserving rounding and a $(\operatorname*{poly}(k/\varepsilon),k)$ -PMD. Since the $(\operatorname*{poly}(k/\varepsilon),k)$ -PMD component is already a collection of $k$ -CRVs, this part of the cover is already proper, and it suffices to convert the Gaussian component into a nearby $(n-\operatorname*{poly}(k/\varepsilon),k)$ -PMD.

The main technical lemma we prove is the following, which states that if a discretized Gaussian $G$ is spectrally close to a GMD $\rho$ , we can obtain a new GMD $\rho^{\prime}$ which is spectrally close to $\rho$ :

Let $\lfloor\mathcal{N}(\mu,\Sigma)\rceil$ be a discretized Gaussian and suppose there exists a $(n,k)$ -GMD $\rho$ with mean $\mu^{\rho}$ and covariance $\Sigma^{\rho}$ such that for all vectors $v$ it holds that $|v^{T}(\mu-\mu^{\rho})|\leq\varepsilon_{1}\sqrt{v^{T}\Sigma v}$ and $|v^{T}(\Sigma-\Sigma^{\rho})v|\leq\varepsilon_{2}v^{T}\Sigma v$ .

Then, it is possible to compute in time $n^{O(k)}$ a $(n,k)$ -GMD $\rho^{\prime}$ with mean $\mu^{\rho^{\prime}}$ and covariance $\Sigma^{\rho^{\prime}}$ such that for all vectors $v$ it holds that $|v^{T}(\mu-\mu^{\rho^{\prime}})|\leq\varepsilon_{1}\sqrt{v^{T}\Sigma v}+3k^{2.5}\|v\|_{2}$ and $|v^{T}(\Sigma-\Sigma^{\rho^{\prime}})v|\leq\varepsilon_{2}v^{T}\Sigma v+3k^{3}\|v\|_{2}^{2}$ .

We prove this lemma using the Shapley-Folkman lemma [Sta69], which states that the Minkowski sum of a large number of sets is approximately convex:

With this lemma in hand, the proof of Lemma 8 proceeds as follows. Let $\mathcal{M}$ be the set of all possible mean and covariances for a single CRV, and $\mathcal{M}^{\oplus n}$ be the Minkowski sum of $n$ copies of $\mathcal{M}$ . Given a discretized Gaussian with mean and covariance $(\mu,\Sigma)\in\mathcal{M}^{\oplus n}$ , we would ideally like to find $\{x_{1},\dots,x_{n}\}$ such that $\sum_{i=1}^{n}x_{i}=(\mu,\Sigma)$ . However, since this set is not convex, this optimization problem is not obviously tractable. Instead, we convert $(\mu,\Sigma)$ to a spectrally close $(\hat{\mu},\hat{\Sigma})$ which lies on the convex hull of $\mathcal{M}^{\oplus n}$ , which can be done using a linear program. At this point, we exploit the “almost convex” characterization provided the Shapley-Folkman lemma, and we will iteratively “peel off” plausible CRVs. More specifically, noting that the moment profile is at most $k^{2}+k$ dimensional and applying Lemma 9, we can use a linear program to find the parameters of a single CRV such that subtracting its moments gives a moment profile which lies on the convex hull of $\mathcal{M}^{\oplus n-1}$ . We repeat $n-k^{2}-k$ times until we are left with a point on the convex hull of $\mathcal{M}^{\oplus k^{2}+k}$ , at which point we may pick the last $k^{2}+k$ CRVs arbitrarily. The proof is completed by arguing that the resulting GMD satisfies the theorem conditions. For the full proof of Lemma 8, see Section 6.1.

We convert the same block of $\rho^{\prime}$ to a discretized Gaussian in the same way, incurring the same cost. Finally, we relate the two discretized Gaussians in total variation distance. As mentioned in the previous paragraph, the means and covariances are spectrally close up to relative accuracy $\frac{2\varepsilon}{k}$ and $\frac{\varepsilon}{k^{3/2}}$ . We plug this guarantee into Lemma 2 and apply the data processing inequality (Lemma 1) to conclude that the two distributions are $O(\varepsilon/k)$ -close. The proof is concluded by setting $\varepsilon^{\prime}=\varepsilon^{10/3}/k^{25/2}$ and rescaling $\varepsilon$ by a constant factor.

We first argue that rounding all constituent probability vectors in the $(n,k)$ -GMD $\rho$ so that all their coordinates are integer multiples of $1/n$ to obtain a $(n,k)$ -GMD $\hat{\rho}$ approximately preserves the spectral closeness guarantees with the discretized Gaussian. More specifically, for all vectors $v$ it holds that:

We know that $\|\mu^{\rho}-\mu^{\hat{\rho}}\|_{\infty}\leq 1$ and thus $\|\mu^{\rho}-\mu^{\hat{\rho}}\|_{2}\leq\sqrt{k}$ , so

Similarly we have that $\|\Sigma^{\rho}-\Sigma^{\hat{\rho}}\|_{\textrm{max}}\leq 1$ which implies that $|v^{T}(\Sigma^{\rho}-\Sigma^{\hat{\rho}})v|\leq k\|v\|^{2}$ for all vectors $v$ . Thus,

At this point, we have shown that there exists a $(n,k)$ -GMD with mean and covariance close to that of the discretized Gaussian such that all its constituent probability vectors have coordinates that are integer multiples of $1/n$ . Now, for every probability vector $\vec{p}$ with probabilities that are multiples of $1/n$ , consider its moment profile $(\mu^{\vec{p}},\Sigma^{\vec{p}})$ , where $\mu^{\vec{p}}=\vec{p}$ and $\Sigma^{\vec{p}}$ are the mean and covariance of the $k$ -CRV with probabilities $\vec{p}$ . Let $\cal{M}$ be the set of all possible moment profiles generated by such probability vectors $\vec{p}$ . Since there are at most $n^{k-1}$ probability vectors $\vec{p}$ the set $\cal M$ has size at most $n^{k-1}$ . Moreover, it is easy to see that for the rounded GMD $\hat{\rho}$ , it holds that $(\mu^{\hat{\rho}},\Sigma^{\hat{\rho}})\in{\cal{M}}^{\oplus n}$ where ${\cal{M}}^{\oplus n}=\{x\ |\ \exists x_{1},...,x_{n}\in{\cal M},x=\sum_{i}x_{i}\}$ denotes the Minkowski addition of $\cal{M}$ with itself $n$ times. This is because the mean and covariance of the GMD is equal to the sum of the means and covariances of its constituent CRVs, which are all in $\cal{M}$ since each CRV has probabilities that are integer multiples of $1/n$ .

Naively searching over ${\cal{M}}^{\oplus n}$ for a GMD that satisfies the guarantees of $\hat{\rho}$ is not easy since it would require time that is exponential in $n$ . To get a computationally efficient algorithm, we search instead in the set $\textrm{conv}\left({\cal{M}}^{\oplus n}\right)=\textrm{conv}\left({\cal{M}}\right)^{\oplus n}$ where, for a set $A$ , $\textrm{conv}(A)$ denotes its convex closure, and the equality is a basic property of Minkowski sums. The reason this is easy is that it is solvable by a linear program as follows:

For $m\in{\cal{M}}$ and $i\in\{1,...,n\}$ , we assign the variables $x_{i,m}\geq 0$ that denote whether we want to pick the moment profile $m$ for the $i$ -th CRV.

For all $i$ , we need that $\sum_{m}x_{i,m}=1$ . This ensures that for all $i$ , $\sum_{m}x_{i,m}m\in\textrm{conv}\left({\cal{M}}\right)$ .

We need that the aggregate moment profile $(\hat{\mu},\hat{\Sigma})=\sum_{i,m}x_{i,m}m$ satisfies the closeness constraints with $(\mu,\Sigma)$ . For all $v$ we require that:

These are all linear constraints so a solution $(\hat{\mu},\hat{\Sigma})=\sum_{i,m}x_{i,m}m$ , can be computed by solving the linear program using the Ellipsoid method. Note that the constraints of the third bullet are infinitely many but can be verified efficiently using a separation oracle. To check the first set of constraints, we can check whether the optimization problem

has a negative solution. This is a convex optimization problem which can be solved in polynomial time. To check the second set of constraints, we note that $\varepsilon_{2}v^{T}\Sigma v+k\|v\|_{2}^{2}=v^{T}(\varepsilon_{2}\Sigma+kI)v$ . By setting $A\triangleq(\varepsilon_{2}\Sigma+kI)$ and $u\triangleq A^{1/2}v$ , we can rewrite the constraints as:

This is equivalent to checking whether the maximum eigenvalue of the matrix $\left(A^{-1/2}\right)^{T}(\Sigma-\hat{\Sigma})A^{-1/2}$ is greater than $1$ .

For any $m,m^{\prime}\in\textrm{conv}\left({\cal{M}}\right)$ , it holds that $\|m-m^{\prime}\|_{\infty}\leq 1$ . Moreover, $(\mu^{\rho^{\prime}},\Sigma^{\rho^{\prime}})=\sum_{i=1}^{n}m_{i}$ and $(\hat{\mu},\hat{\Sigma})=\sum_{i=1}^{(n-k^{2}-k)}m_{i}+\sum_{i=1}^{k^{2}+k}m^{\prime}_{i}$ . This implies that $\|\mu^{\rho^{\prime}}-\hat{\mu}\|_{\infty}\leq k^{2}+k$ and $\|\Sigma^{\rho^{\prime}}-\hat{\Sigma}\|_{\max}\leq k^{2}+k$ . We have that:

Similarly, $|v^{T}(\Sigma-\Sigma^{\rho^{\prime}})v|\leq|v^{T}(\Sigma-\hat{\Sigma})v|+|v^{T}(\hat{\Sigma}-\Sigma^{\rho^{\prime}})v|\leq\varepsilon_{2}v^{T}\Sigma v+(k^{3}+k^{2}+k^{1})\|v\|_{2}^{2}.$

A Lower Bound for Covers of PMDs

In this section, we discuss Theorem 4, the lower bound on the size of any $\varepsilon$ -cover of $(n,k)$ PMDs. This theorem shows that it is not possible to get significant improvement on the cover size obtained in Theorem 3. In particular, the dependence of the size of the cover on $1/\varepsilon$ is tight up to a difference of $3$ in the exponent of $\log(1/\varepsilon)$ .

It turns out that it is easy to prove a dependence of $O(n^{k})$ on the size of any $\varepsilon$ -cover and most of the work is involved in showing a lower bound of $T(k,\varepsilon)=2^{\log^{k-1}(1/\varepsilon)}$ on the cover size. Thus, in this overview we only focus on the machinery required to show the lower bound of $T(k,\varepsilon)$ on the $\varepsilon$ -cover size. We remark that prior to our work, for $k=2$ (i.e. PBDs), Diakonikolas, Kane, and Stewart obtained a lower bound of $2^{\log^{2}(1/\varepsilon)}$ [DKS16b].

We provide the proof of Theorem 4. The proof will use algebraic geometric tools to argue this fact. In particular, the main theorem we will prove will be the following:

We will now prove Theorem 4 using Theorem 9.

$M_{\alpha}(Z)$ is a multisymmetric polynomial of degree $\|\alpha\|$ in the variables $\{P_{Z}[i,j]\}$ where $1\leq i\leq m$ and $1\leq j<k$ . By multisymmetric, we mean the polynomial is invariant under permuting the rows of $P_{Z}$ .

where the sum is taken over all surjections $\sigma$ from $\{1,\ldots,\|\alpha\|\}\rightarrow[m]$ where for $i<j$ , if $\beta_{i}=\beta_{j}$ , then $\sigma(i)<\sigma(j)$ .

The definition above might look a little involved, so to illustrate the concept, we will consider a simple example. For example, if $k=3$ and $\alpha=(1,2,0)$ , then

Likewise, if $k=4$ and $\alpha=(1,1,1,0)$ , then

Another family of polynomials which will be very useful in our reasoning will be the family of power sum multisymmetric polynomials.

As we have already observed, $M_{\alpha}(Z)$ is a multisymmetric polynomial in entries of the matrix $P_{Z}$ at degree is at most $\|\alpha\|$ . To prove that is a linear combination of $\mathbf{E}_{\beta}$ for $\beta\preceq\alpha$ , it suffices to make the following observation: Note that $M_{\alpha}(Z)$ is

The next lemma implies bounds on the coefficients of $\mathbf{E}_{\beta}$ in expressing $M_{\alpha}$ . Let us now assume that $M_{\alpha}=\sum_{\beta\preceq\alpha}\gamma_{\beta}\cdot\mathbf{E}_{\beta}$ . It is easy to see the following claim.

For $\alpha$ with $\alpha_{k}=0$ , we have $\gamma_{\alpha}=\prod_{j=1}^{k-1}\alpha_{j}!$ .

The next claim is also fairly easy to prove.

$\sum_{\beta\preceq\alpha}|\gamma_{\beta}|=2^{O(k)}\cdot\prod_{j=1}^{k-1}\alpha_{j}!$ .

In other words, $c_{j}$ is obtained by a pointwise product of $\alpha$ and the indicator vector of the singleton set $\{j\}$ . Now, assume that for $1\leq j\leq k-1$ ,

where $\beta=\beta_{1}+\ldots+\beta_{k-1}$ . Thus, to prove our claim, it suffices to show that for any particular $1\leq j\leq k-1$ ,

Note that $c_{j}$ is just $\alpha_{j}$ at the $j^{th}$ position and zero everywhere else. We introduce the following notation: For any integer $k$ , we let $\mathcal{P}(k)$ denote the set of its partitions i.e. a tuple of strictly positive integers summing to $k$ ordered in decreasing sequence. For example, for $k=5$ , we have 6 distinct partitions $(5),(4,1),(3,2),(3,1,1),(2,2,1),(2,1,1,1)$ . For any partition $P\in\mathcal{P}(k)$ , we use $s(P)$ to denote the number of summands in $P$ . For example, for the partition $P=(3,2)$ , $s(P)=2$ . Further, if $P=(x_{1},\ldots,x_{k})$ is a partition of $n$ , then

With these notations in place, it is easy to see that

Now, it is easy to see that for any integer $x>1$ , $(1.4)^{x}<=x!$ . If $1(P)$ denotes the number of $1$ in the partition $P$ . With this, we have

Now, note that the total number of partitions of $\alpha_{j}$ with $t$ ones in it is upper bounded by $|\mathcal{P}(\alpha-t)|$ . However, it is a well-known fact in number theory, that $|\mathcal{P}(\alpha-t)|\leq 2^{O(\sqrt{\alpha-t})}$ . Thus,

Let $Z$ and $Z^{\prime}$ be two $(m,k)$ -PMDs such that $|\mathbf{E}_{\beta}(Z)-\mathbf{E}_{\beta}(Z^{\prime})|\geq\delta$ . Then, there exists $\beta_{0}\preceq\beta$ such that

where $c=2^{O(k)}$ is the constant appearing in Claim 2.

Let $c=2^{O(k)}$ be the constant appearing in Claim 2. Let $|\beta|=d$ and $i$ be the smallest integer such that there exists a $\beta_{0}\preceq\beta$ with $|\beta_{0}|=i$ and

Note that by assumption, there exists such a $\beta_{0}$ . Next,

Applying Claim 1 and Claim 2, we get that

Again applying the hypothesis on $\beta_{0}$ , we have

The strategy for the rest of the proof is as follows: Instead of showing Theorem 9, we will show the following lemma.

To see why it suffices to prove Lemma 11, we have the following claim.

Assume towards a contradiction that $d_{TV}(Z_{1},Z_{2})<\delta\cdot m^{-\|\alpha\|_{1}}$ . By definition, this means that there is a coupling $(Z^{\prime}_{1},Z^{\prime}_{2})$ such that the marginal $Z^{\prime}_{1}$ is distributed as $Z_{1}$ , the marginal $Z^{\prime}_{2}$ is distributed as $Z_{2}$ and $\Pr[Z^{\prime}_{1}\not=Z^{\prime}_{2}]<\delta\cdot m^{-\|\alpha\|_{1}}$ . As the support of both $Z_{1}$ and $Z_{2}$ is confined in the box $[0,m]^{k}$ , it easily follows that $\big{|}M_{\alpha}(Z_{1})-M_{\alpha}(Z_{2})\big{|}<\Pr[Z^{\prime}_{1}\not=Z^{\prime}_{2}]\cdot m^{\|\alpha\|_{1}}<\delta$ . This results in a contradiction, thus completing the proof. ∎

In light of Lemma 10, it instead suffices to prove the following lemma.

It is easy to see that this operation defines legitimate $(m,k)$ PMD matrices. Further, note that $\mathbf{E}_{\alpha}$ is a homogenous polynomial of degree $\|\alpha\|$ . Thus, it is easy to see that if $\mathbf{E}_{\alpha}(A_{i})\not=\mathbf{E}_{\alpha}(A_{j})$ , then

For us, the utility of Jacobian will come from its role in the following theorem.

As a consequence, we have the following corollary.

To show a lower bound on the entropy of $M_{\mathcal{S}}$ , we will apply Corollary 2. To apply this, we need to prove that $\det J_{M_{\mathcal{S}}}\not=0$ . It is not clear how to show this, so we introduce an intermediate map.

The family $\{\mathbf{P}_{\alpha}\}$ is usually referred to as power-sum multisymmetric polynomials.

The idea here will be that we will relate the family $\mathbf{P}_{\alpha}$ and $\mathbf{E}_{\alpha}$ and then argue about the Jacobian of a map defined in terms of $\mathbf{P}_{\alpha}$ . The following relation between $\mathbf{E}_{\alpha}$ and $\mathbf{P}_{\alpha}$ was established in Dalbec [Dal99].

Using induction, the following lemma is immediate.

Immediately, we have that $\mathsf{rank}(J_{\mathbf{P^{\prime}}_{\mathcal{S}}})\leq\mathsf{rank}(J_{\mathbf{E^{\prime}}_{\mathcal{S}}})$ . Thus, to show a lower bound on $\mathsf{rank}(J_{\mathbf{E^{\prime}}_{\mathcal{S}}})$ , it suffices to show a lower bound on $\mathsf{rank}(J_{\mathbf{P^{\prime}}_{\mathcal{S}}})$ . We next show the following claim.

This implies that $\mathsf{rank}(J_{\mathbf{E^{\prime}}_{\mathcal{S}}})\geq d^{k}/k^{k}$ . This means that we can choose a square submatrix of size $(d/k)^{k}\times(d/k)^{k}$ of $J_{\mathbf{E^{\prime}}_{\mathcal{S}}}$ of full rank. This means there is a subset of $\mathcal{S}$ of size $d^{k}/k^{k}$ (call it $\mathcal{S}^{\prime}$ ) such that

A Fourier-Based Learning Algorithm for PMDs

We believe the application of Fourier analysis to learn such structured distributions is interesting in its own right and might have application in the future towards obtaining learning algorithms for other interesting classes of distributions. In particular, the recent work on the population recovery problem [WY12, MS13, LZ15] may also be viewed as an example of use of Fourier analysis towards learning of structured distributions.

Refer to Corollary 3 for the precise bounds. Similar exponential decay of Fourier transform is also true around the other points of the form $\{-1,0,1\}^{k}$ . Let us use $V=\prod_{i=1}^{k}(1+\sigma_{i})$ where $\sigma_{i}^{2}$ are the eigenvalues of $\Sigma$ . It is not difficult to show that all but an $\varepsilon$ -fraction of the mass of $Z$ falls on a set of size $V\cdot\log^{k}(1/\varepsilon)$ (Lemma 18). On the other hand, using the exponential decay of the Fourier transform, we have the following crucial claim: We identify a region $\mathcal{S}\subseteq^{k}$ of volume $\log^{k}(1/\varepsilon)/V$ such that

Refer to Claim 7 for the precise bounds. Also, in this informal description, we use $\widetilde{O}$ to hide the dependence on $k$ as well as the polylogarithmic factors of $1/\varepsilon$ . This implies that if $H$ is another function such that $|\widehat{H}(\xi)-\widehat{Z}(\xi)|\leq\varepsilon$ inside $\mathcal{S}$ and outside $\mathcal{S}$ , then

The main idea behind learning PMDs is to look at the Fourier spectrum of PMDs. Specifically, we will prove two structural results about PMDs. One is that the Fourier spectrum of PMDs (roughly) has an exponential decay around the origin. The second result we will prove is the Fourier spectrum is a Lipschitz function and thus to estimate the Fourier spectrum in the entire domain, it suffices to compute it at a few points. Combining these two results along with standard statements on Fourier inversion show that if we construct a hypothesis distribution which approximates the Fourier spectrum of the target PMD at the chosen points and also exhbits a similar exponential decay in the Fourier spectrum, then the hypothesis distribution is close to the target PMD. While the condition on Fourier decay is not algorithmically easy to impose, we show that using some ideas from [DKT15], the problem of imposing these constraints reduces to linear programming. We will first quickly review the notion of Fourier spectrum of integer valued distributions.

Let us now recall the setting: $P=Z_{1}+\ldots+Z_{n}$ where $Z_{i}$ are independent random variables supported on $\{\mathbf{e}_{1},\ldots,\mathbf{e}_{k}\}$ . Also for $1\leq i\leq n$ and $1\leq j\leq k$ , let $p_{ij}=\Pr[X_{i}=j]$ . To specify the next lemma, for any $\xi\in^{k}$ , we will need to define an associated vector $\zeta\in^{k}$ . For any $\xi\in$ , define the associated $\zeta$ as follows:

Let $X$ be a CRV with covariance matrix $\Sigma$ . Then, $|\widehat{X}(\xi)|^{2}\leq 1-\frac{1}{5}\cdot\zeta^{T}\cdot\Sigma\cdot\zeta$ .

To prove this, we do a simple case analysis, and use the inequality $\sin^{2}(\pi x/2)\geq x^{2}/5$ for $|x|\leq 3/2$ :

If both $\xi_{i}$ and $\xi_{j}$ are of the same type, then note that $|\xi_{i}-\xi_{j}|=|\zeta_{i}-\zeta_{j}|\leq 1$ which gives the required inequality.

If $\xi_{i}$ and $\xi_{j}$ are type $2$ and $3$ , then note that $|\xi_{i}-\xi_{j}|=|2-(\zeta_{i}-\zeta_{j})|$ . This implies that

Noting that $|\zeta_{i}-\zeta_{j}|\leq 1$ gives the required inequality.

If $\xi_{i}$ is of type $1$ and $\xi_{j}$ is of type 2, then note that the maximum value that $|\xi_{i}-\xi_{j}|$ can take is $3/2$ . On the other hand, notice that $|\xi_{i}-\xi_{j}|\geq|\zeta_{i}-\zeta_{j}|$ . These two facts immediately imply that

The exact same situation holds if $\xi_{i}$ is of type $1$ and $\xi_{j}$ is of type 3.

Having shown (11), see that this implies that

For any $(n,k)$ -PMD $P$ with covariance matrix $\Sigma$ , we have that for any $\xi\in^{k}$ :

This follows simply by noticing that for a PMD $P=X_{1}+\ldots+X_{n}$ ,

where $\Sigma_{i}$ is the covariance matrix of $X_{i}$ . Using the inequality, $1-x^{2}\leq e^{-x^{2}/2}$ (for $|x|\leq 1$ ), we have

We also have the following variant of the above lemma which will be useful for us.

where $X^{\prime}$ and $Y^{\prime}$ are the centered random variables obtained by centering $X$ and $Y$ . Likewise,

Applying the same for the $\widehat{Y^{\prime}}$ and applying triangle inequality, we get the claim. ∎

We now state the Plancherel identity in this setting. In particular, we have the following easy claim (which can be found in any standard text on Fourier analysis).

The above corollary demonstrates that to learn the PMD to error $\varepsilon$ , it suffices to produce another distribution $H$ whose Fourier spectrum is very close to the Fourier spectrum of $F$ (the “very small” is quantified by the effective support of $F$ ).

Given sample access to a $(n,k)$ -PMD $X$ with mean $\mu$ and covariance matrix $\Sigma$ , there exists an algorithm which can produce estimates $\hat{\mu}$ and $\hat{\Sigma}$ such that with probability at least $9/10$ for every vector $y$ :

The sample and time complexity are $O(k^{4}/\varepsilon^{2})$ .

The following is guaranteed by the multidimensional Chernoff bound.

Let $X$ be a $(n,k)$ -PMD with mean $\mu$ and covariance matrix $\Sigma$ . Let $L_{r}=\{z:(z-\mu)\cdot(z-\mu)^{t}\preceq r\cdot\Sigma\}$ . For $r=O(\log(1/\varepsilon)+\log k)$ ,

Let the eigenvectors of $\Sigma$ be $v_{1},\ldots,v_{k}$ with the corresponding eigenvalues $\sigma_{1}^{2},\ldots,\sigma_{k}^{2}$ . Consider any two distinct $x,y\in L_{r}$ . Since $x$ and $y$ are distinct, hence there must be some $1\leq i\leq k$ such that the projection of $x$ and $y$ along $v_{i}$ is separated by $k^{-1/2}$ . Let us denote the projection of $x$ along $v_{i}$ by $x_{i}$ . Then the condition of lying in $L_{r}$ implies that $|x_{i}|\leq r\cdot\sigma_{i}$ . It is then easy to see that if the number of integer points in $L_{r}$ is more than $\prod_{i=1}^{k}\left(2\sigma_{i}\sqrt{kr}+1\right)$ , then there must be $2$ points $x$ and $y$ and some $1\leq i\leq k$ , $|x_{i}-y_{i}|\leq k^{-1/2}$ . ∎

For a point $z\in\{-1,0,1\}^{k}$ , define $C_{z,r}$ as

Note that $^{k}$ can be partitioned into the regions $R_{z}$ (for $z\in\{-1,0,1\}^{k}$ ). In other words,

Let $S_{r}=\cup_{z\in\{-1,0,1\}^{k}}(R_{z}\cap C_{z,r})$ and let $\overline{S}_{r}=^{k}\setminus S_{r}$ . Then,

We now individually bound each of the summands. Fix any particular $z$ . Using Corollary 3

The last inequality uses that $r>2$ . This integral is now easily seen to be bounded by

This is exactly the same bound as stated in the claim. ∎

Let $S_{r}=\cup_{z\in\{-1,0,1\}^{k}}(R_{z}\cap C_{z,r})$ . Then,

Doing the exact same calculation as in the proof of Claim 7,

2 Learning algorithm for PMDs

Theorem 6, the structure theorem from [DKT15], allows us to assume that the PMD $Z$ is essentially a discretized Gaussian $G$ convolved with a sparse PMD $S$ where the sparse PMD is supported on only $\operatorname*{poly}(k/\varepsilon)$ summands.

By setting ‘ $\varepsilon$ ’ from the Theorem statement to be $\varepsilon^{10}$ , we get that $d_{TV}(Z,G+S)\leq\varepsilon^{10}$ . Because our subsequent learning algorithm will take $\ll O(\varepsilon^{-10})$ samples, we assume that we are getting samples from $G+S$ instead of $Z$ and that $Z=G+S$ . Furthermore, using the following claim from [DKT15], we can get a spectral estimate with accuracy $\varepsilon^{10}$ of the mean and covariance of the Gaussian $G$ by guessing the partition of coordinates in the covariance matrix of the Gaussian and going through all elements of the spectral cover of PSD matrices around a fine estimate $\widehat{S}$ for $\Sigma$ obtained using $k/\varepsilon^{2}$ samples from Lemma 17.

Let A be a symmetric $k\times k$ PSD matrix with minimum eigenvalue 1 and let $\mathcal{S}$ be the set of all matrices $B$ such that $|y^{T}\cdot(A-B)\cdot y|\leq\varepsilon_{1}y^{T}\cdot A\cdot y+\varepsilon_{2}y^{T}y$ where $\varepsilon_{1}\in[0,1/4)$ and $\varepsilon_{2}\in[0,\infty)$ . Then, there exists a cover $\mathcal{S}_{\varepsilon}$ of size $(k\cdot(1+\varepsilon_{2})/\varepsilon)^{k^{2}}$ such that any $B\in\mathcal{S}$ is $\varepsilon$ -spectrally close to some element in the cover.

The spectral closeness translates to closeness $\varepsilon^{10}$ in total variation distance between Gaussians (Lemma 2) and again since we will be taking $\ll O(\varepsilon^{-10})$ samples in the learning algorithm, we can assume that the gaussian $G$ has exactly the mean $\mu_{G}$ and covariance $\Sigma_{G}$ we guessed.

Similarly, we can assume that the sparse-PMD has known mean and covariance $\mu_{S}$ and $\Sigma_{S}$ . This is because any PMD with $n^{\prime}$ summands is $\varepsilon^{10}$ -close in total variation to a PMD where all the probabilities are rounded to multiples of $\lceil n^{\prime}k/\varepsilon^{10}\rceil^{-1}$ . This fact follows from union-bounding all the errors of the individual summands. Since $n^{\prime}=\operatorname*{poly}(k/\varepsilon)$ for the sparse PMD, all coordinates are multiples of $\operatorname*{poly}(\varepsilon/k)$ , which implies that the mean and covariance coordinates are also multiples of $\operatorname*{poly}(\varepsilon/k)$ and we can guess them exactly using $\operatorname*{poly}(k/\varepsilon)^{k^{2}}$ guesses. Again, since this sparse PMD is $\varepsilon^{10}$ close and we will be getting much fewer samples, we can assume that the sparse PMD has exactly the mean and covariance we guessed.

At this point, we have argued the following:

The PMD $Z$ is equal to the sum of a discretized Gaussian $G$ and a sparse PMD $S$ with $\operatorname*{poly}(k/\varepsilon)$ summands. The mean and covariance of the Gaussian $(\mu_{G},\Sigma_{G})$ and of the sparse PMD $(\mu_{S},\Sigma_{S})$ are known, which implies that the mean and covariance of the overall PMD $Z$ is equal to $(\mu,\Sigma)=(\mu_{S},\Sigma_{S})+(\mu_{G},\Sigma_{G})$ .

Our learning algorithm attempts to recover the sparse PMD in order to learn the overall distribution $Z$ . However, imposing the condition that the distribution we are trying to estimate is a sparse PMD will involve solving non linear equations making the computation intractable. Rather, we will seek to learn a sparse distribution $S^{\prime}$ supported on $[0,T]^{k}$ where $T=\operatorname*{poly}(k/\varepsilon)$ .

To learn this distribution, we will attempt to estimate its Fourier Transform. We will be mostly interested in points on the grid:

where $(\vec{v}_{i},\sigma^{2}_{i})$ are the eigenvector, eigenvalue pairs of the matrix $\Sigma$ . From Corollary 3, we know that the Fourier transform decays exponentially as we move away from $\{-1,0,1\}^{k}$ , and in particular Claim 7 bounds the total mass contained at a distance at least $r$ from all the points. For our purposes, we set $r=O(k\log k+k\log(1/\varepsilon))$ and perform the following steps to learn the sparse distribution $S^{\prime}$ .

Create variables $p_{\alpha}$ for every $\alpha\in[0,T]^{k}$ with the constraints $0\leq p_{\alpha}\leq 1$ and $\sum_{\alpha\in T^{k}}p_{\alpha}=1$ .

Let $\mathcal{A}_{1}=\cup_{z\in\{-1,0,1\}^{k}}\{\xi:\sum{\sigma}^{2}_{i}\left({v}_{i}\cdot((-1)^{z}\circ(\xi-z))\right)^{2}\leq r\}$ . Let $\mathcal{V}_{1}$ be the points of the grid $\mathcal{V}$ that lie in $\mathcal{A}_{1}$ . For each of those points, get an estimate $\widehat{Z}_{est}$ of $\widehat{Z}$ such that $|\widehat{Z}_{est}-\widehat{Z}|<\frac{\varepsilon}{6^{k}\cdot k^{2k}}$ and then impose linear constraints on $\{p_{\alpha}\}$ so that $|Re[\widehat{S^{\prime}}(\xi)\cdot\widehat{G}(\xi)-\widehat{Z}_{est}(\xi)]|\leq\frac{\varepsilon}{6^{k}\cdot k^{2k}}$ and $|Im[\widehat{S^{\prime}}(\xi)\cdot\widehat{G}(\xi)-\widehat{Z}_{est}(\xi)]|\leq\frac{\varepsilon}{6^{k}\cdot k^{2k}}$ .

Let $\sigma_{G,i}^{2},\vec{v}_{G,i}$ be the eigenvalues and eigenvectors of $\Sigma_{G}$ We note that, since we may have eigenvalues which are both large and small in magnitude, a naive eigendecomposition algorithm would incur a cost which depends on $n$ . However, as we only require the eigenvalues and eigenvectors approximately, this cost can be avoided by applying an appropriate power-iteration method. The cost in terms of $k$ and $1/\varepsilon$ is dominated by the other steps in our algorithm. and consider the set:

Construct a grid of points in $^{k}$ with a spacing of $\frac{\varepsilon^{2k}}{k^{2k}\cdot 6^{k}}$ in every direction. Let $\mathcal{V}_{2}$ be the subset of these points which fall in $A_{2}$ . For all these points impose the conditionss that $|Re[\widehat{S^{\prime}}(\xi)]|\leq e^{-\zeta^{T}\Sigma_{S}\zeta}$ and $|Im[\widehat{S}(\xi)]|\leq e^{-\zeta^{T}\Sigma_{S}\zeta}$ that follow from Corollary 3.

Finally, add the constraints $\sum_{\alpha}p_{\alpha}\alpha=\mu_{S}$ and $\sum_{\alpha}p_{\alpha}(\alpha-\mu_{S})(\alpha-\mu_{S})^{T}=\Sigma_{S}$

Note that in Step 2, $\mathcal{V}_{1}$ has size at most $\left(\frac{\sqrt{r}\cdot k^{2k}\cdot 6^{k}}{\varepsilon}\right)^{k}$ . If we naively estimated every Fourier coefficient in $\mathcal{V}_{1}$ the number of samples would be too high because every Fourier coefficient requires $\log(1/\delta)/{\varepsilon^{2}}$ samples to learn with accuracy $\varepsilon$ and probability of failure $1-\delta$ . However, we can instead take $O(k\log(r/\varepsilon)/\varepsilon^{2})$ samples and reuse the same samples to compute all the required Fourier coefficients. Since the probability of error is very small a simple union bound among all of the coefficients, shows that with at least constant probability all of them can be estimated within $\varepsilon$ .

To complete the learning algorithm, we repeat the steps above for each of the guessed mean and covariance matrices $(\mu_{G},\Sigma_{G}),(\mu_{S},\Sigma_{S})$ . We then perform a hypothesis selection algorithm to choose a distribution within $O(\varepsilon)$ from each of the distributions we obtain. We made $O(\operatorname*{poly}(k/\varepsilon)^{k^{2}})$ guesses, and thus obtained $O(\operatorname*{poly}(k/\varepsilon)^{k^{2}})$ candidate hypotheses. Applying the following tournament theorem for hypothesis selection from [DK14], we can select a good estimate in $O\left(\left(\frac{k}{\varepsilon}\right)^{2}\log(k/\varepsilon)\right)$ samples in $O(\operatorname*{poly}(k/\varepsilon)^{k^{2}})$ runtime.

We first show that there is a solution to $\{p_{\alpha}\}$ which satisfies all the constraints. Indeed, if we set the sparse distribution $S^{\prime}$ to be equal to the distribution $S$ of the sparse PMD we defined above, we get:

$\sum_{\alpha\in T^{k}}p_{\alpha}=1$ since $S$ is a probability distribution supported on $[0,T]^{k}$ .

The constraint $|Re[\widehat{S^{\prime}}(\xi)\cdot\widehat{G}(\xi)-\widehat{Z}_{est}(\xi)]|\leq\frac{\varepsilon}{6^{k}\cdot k^{2k}}$ is satisfied since for $S^{\prime}=S$ ,

The derivation for the constraint on the imaginary part is identical.

From Corollary 3, the sparse PMD satisfies $|\widehat{S}(\xi)|\leq e^{-(1/5)\cdot\zeta^{T}\Sigma_{S}\zeta}$ everywhere in $^{k}$ . This condition implies the imposed constraints which are only evaluated in few points.

The distribution $S$ has mean $\mu_{S}$ and covariance $\Sigma_{S}$ , so the last constraint is satisfied.

Consider any point $\xi$ in $\mathcal{A}_{1}$ . Then, note that there is some $\xi^{\prime}\in\mathcal{V}$ such that for $1\leq i\leq k$ , $\langle\xi-\xi^{\prime},\vec{v}_{i}\rangle\leq\frac{\varepsilon}{k^{2k}\cdot 6^{k}\cdot{\max\{1,\sigma_{i}\}}}$ . Applying Lemma 16, we get that

We bound the volume of the set $B_{2}$ . To do this, we again apply Claim 8, and get that

Note that for any point $\xi\in A_{2}$ , we there is a point $\xi^{\prime}$ such that $\|\xi-\xi^{\prime}\|_{2}\leq\frac{\varepsilon^{2k}}{k^{2k}\cdot 6^{k}}$ and that $|\widehat{S^{\prime}}(\xi^{\prime})|\leq e^{-(1/5)\cdot\zeta^{\prime T}\Sigma_{S}\cdot\zeta^{\prime}}$ . Since the variance of $\Sigma_{S}$ is at most $\operatorname*{poly}(k/\varepsilon)$ in every direction, we get that

By applying Claim 8 to bound the volume of the set $A_{2}\subseteq B_{2}$ and using the fact that $|\widehat{S}(\xi)|^{2}$ is at most $e^{-r/20}$ , we get that the first integral is at most

The last inequality uses the fact that whenever $\sigma_{G,i}\leq{\sigma}_{i}$ , it must imply that all the variance comes from $S$ and thus ${\sigma_{i}}\leq\operatorname*{poly}(k/\varepsilon)$ . By plugging the value of $r$ , we get that

The calculation for the second integral is similar.

Here the first inequality follows by exactly the same calculation we did for the first integral whereas the second inequality uses that $A_{2}\subseteq B_{2}$ . Now, that we had derived that

However, $\max\{\sigma_{G,i},1\}\geq\varepsilon^{\Theta(1)}\cdot\max\{{\sigma_{i}},1\}$ (because the variance of $S$ is at most $\operatorname*{poly}(1/\varepsilon)$ in any direction. This implies that

Note that $\widehat{S^{\prime}+G}(\xi)=\widehat{G}(\xi)\cdot\widehat{S^{\prime}}(\xi)$ . Thus, $|\widehat{S^{\prime}+G}(\xi)|^{2}\leq|\widehat{G}(\xi)|^{2}$ . Applying Claim 7 and noting that

Again using the fact that the variance of $S$ in any direction is at most $\operatorname*{poly}(k/\varepsilon)$ ,

Plugging in the value of $r$ , we get that

Combining Claim 11, Claim 12 and Claim 13, we get that

This is at most $d_{TV}(S+G,S^{\prime}+G)\leq\varepsilon\cdot(k\log(1/\varepsilon))^{O(k)}$ . Setting $\varepsilon$ to be $\frac{\varepsilon^{\prime}}{\operatorname*{poly}(k,\log(1/\varepsilon^{\prime}))^{k}}$ , we complete the proof of Theorem 5.

Open Problems

A number of interesting questions regarding Poisson Multinomial distributions are left open by this work and [DKS16a]. We outline a few of them here.

The complexity of learning Poisson Multinomials. This work and [DKS16a] both give algorithms for learning PMDs. The sample and time complexities are polynomial in $1/\varepsilon$ and exponential in $k$ . Meanwhile, [DKT15] gives an algorithm with a sample complexity polynomial in both parameters, but the time complexity is exponential in $k$ and $1/\varepsilon$ . Is there an algorithm for learning PMDs with sample and time complexities both polynomial in $k$ and $1/\varepsilon$ ?

Exploring the connection between Poisson Multinomials and Laplacian matrices. In this work, we described a cover for the set of $(n,k)$ -PMDs of size $O_{k,\varepsilon}(n^{O(k)})$ . Our construction relied crucially on Observation 1 (which states that the covariance matrix of a PMD is Laplacian) and spectral sparsification results for Laplacian matrices. With this connection in hand, can one derive other results for PMDs using the wealth of literature on Laplacian matrices?

A tighter central limit theorem. [VV11] proves a central limit theorem between an $(n,k)$ -GMD and a discretized Gaussian with the same mean and covariance, upper bounding their total variation distance by $O(k^{4/3}\sigma^{-1/3}\log^{2/3}n)$ , where $\sigma^{2}$ is the smallest eigenvector of the covariance matrix of the GMD. Both this paper and [DKS16a] qualitatively improve this bound by removing the dependence on $n$ , while keeping the dependence on $k$ and $1/\sigma$ still polynomial. How well can a GMD be approximated by a discretized Gaussian? In one dimension, the answer is $\Theta(1/\sigma)$ [CGS10], which implies a the answer for multiple dimensions is at least $\Omega(\sqrt{k}/\sigma)$ . [DKS16a] achieves this dependence on $1/\sigma$ (up to log factors), but the optimal dependence on $k$ is currently unknown.

References

Appendix A Proof of Lemma 2

Without loss of generality, assume that $\Sigma_{1}$ and $\Sigma_{2}$ are full rank. If not, the guarantees in the statement ensure that their nullspace is identical, and we can project to a lower dimension such that the resulting matrices are full rank.

First, we note that the assumptions in the lemma statement can be converted to be in terms of the minimum of the two variances, instead of the maximum. Define $\sigma_{v}^{2}=\min\{v^{T}\Sigma_{1}v,v^{T}\Sigma_{2}v\}$ . The second assumption can be rearranged to see that $(1-\frac{\varepsilon}{2})s_{v}^{2}\leq\sigma_{v}^{2}$ . Plugging this back into the second assumption gives that

where the last inequality holds for $\varepsilon\leq 1$ (otherwise, the lemma’s conclusion is trivial). Similarly, the second assumption also implies $\sqrt{1-\frac{\varepsilon}{2}}s_{v}\leq\sigma_{v}$ , when plugged into the first assumption gives

For the remainder of the proof, we will use these guarantees instead of the ones in the lemma statement.

We recall the standard formula for KL-divergence between two Gaussian distributions. Let $\{\lambda_{i}\}$ be the eigenvalues of $\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}$ .

We bound the divergence induced by differences in the means and covariances separately. We start with the means. Note that

Now we bound the divergence induced by differences in the covariances. We bound the eigenvalues of $\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}$ . Note that

Substituting $u=\Sigma_{2}^{1/2}v$ makes the latter condition equivalent to

The Courant-Fischer Theorem implies that $\frac{1}{1+\varepsilon}\leq\lambda_{i}\leq 1+\varepsilon$ for all $i$ .

At this point, we note that $x-\ln x-1\leq(1-x)^{2}$ for all $x\geq 1$ . This implies