Learning Poisson Binomial Distributions

Constantinos Daskalakis, Ilias Diakonikolas, Rocco A. Servedio

Introduction

We begin by considering a somewhat fanciful scenario: You are the manager of an independent weekly newspaper in a city of nn people. Each week the ii-th inhabitant of the city independently picks up a copy of your paper with probability pip_{i}. Of course you do not know the values p1,,pnp_{1},\dots,p_{n}; each week you only see the total number of papers that have been picked up. For many reasons (advertising, production, revenue analysis, etc.) you would like to have a detailed “snapshot” of the probability distribution (pdf) describing how many readers you have each week. Is there an efficient algorithm to construct a high-accuracy approximation of the pdf from a number of observations that is independent of the population nn? We show that the answer is “yes.”

A Poisson Binomial Distribution of order nn is the distribution of a sum

where X1,,XnX_{1},\dots,X_{n} are independent Bernoulli (0/1) random variables. The expectations (E[Xi]=pi)i({\bf E}[X_{i}]=p_{i})_{i} need not all be the same, and thus these distributions generalize the Binomial distribution Bin(n,p){\rm Bin}(n,p) and, indeed, comprise a much richer class of distributions. (See Section 1.2 below.) It is believed that Poisson [Poi37] was the first to consider this extension of the Binomial distributionWe thank Yuval Peres and Sam Watson for this information [PW11]. and the distribution is sometimes referred to as “Poisson’s Binomial Distribution” in his honor; we shall simply call these distributions PBDs.

PBDs are one of the most basic classes of discrete distributions; indeed, they are arguably the simplest nn-parameter probability distribution that has some nontrivial structure. As such they have been intensely studied in probability and statistics (see Section 1.2) and arise in many settings; for example, we note here that tail bounds on PBDs form an important special case of Chernoff/Hoeffding bounds [Che52, Hoe63, DP09]. In application domains, PBDs have many uses in research areas such as survey sampling, case-control studies, and survival analysis, see e.g., [CL97] for a survey of the many uses of these distributions in applications. Given the simplicity and ubiquity of these distributions, it is quite surprising that the problem of density estimation for PBDs (i.e., learning an unknown PBD from independent samples) is not well understood in the statistics or learning theory literature. This is the problem we consider, and essentially settle, in this paper.

Let X=i=1nXiX=\sum_{i=1}^{n}X_{i} be an unknown PBD.

[Learning PBDs from constantly many samples] There is an algorithm with the following properties: given n,ϵ,δn,\epsilon,\delta and access to independent draws from XX, the algorithm uses

[Properly learning PBDs from constantly many samples] There is an algorithm with the following properties: given n,ϵ,δn,\epsilon,\delta and access to independent draws from XX, the algorithm uses

We note that, since every sample drawn from XX is a log(n)\log(n)-bit string, for constant δ\delta the number of bit-operations performed by our first algorithm is quasilinear in the length of its input. Moreover, the sample complexity of both algorithms is close to optimal, since Ω(1/ϵ2)\Omega(1/\epsilon^{2}) samples are required even to distinguish the (simpler) Binomial distributions Bin(n,1/2){\rm Bin}(n,1/2) and Bin(n,1/2+ϵ/n){\rm Bin}(n,1/2+\epsilon/\sqrt{n}), which have total variation distance Ω(ϵ).\Omega(\epsilon). Indeed, in view of this observation, our second algorithm is essentially sample-optimal.

Let X=i=1naiXiX=\sum_{i=1}^{n}a_{i}X_{i} be a weighted sum of unknown independent Bernoullis such that there are at most kk different values among a1,,an.a_{1},\dots,a_{n}. Then there is an algorithm with the following properties: given n,ϵ,δ,n,\epsilon,\delta, a1,,ana_{1},\dots,a_{n} and access to independent draws from XX, it uses

To complement Theorem 2, we also show that if there are many distinct weights in the sum, then even for weights with a very simple structure any learning algorithm must use many samples:

2 Related work.

At a high level, there has been a recent surge of interest in the theoretical computer science community on fundamental algorithmic problems involving basic types of probability distributions, see e.g., [KMV10, MV10, BS10, VV11] and other recent papers; our work may be considered as an extension of this theme. More specifically, there is a broad literature in probability theory studying various properties of PBDs; see [Wan93] for an accessible introduction to some of this work. In particular, many results study approximations to the Poisson Binomial distribution via simpler distributions. In a well-known result, Le Cam [Cam60] shows that for any PBD X=i=1nXiX=\sum_{i=1}^{n}X_{i} with E[Xi]=pi{\bf E}[X_{i}]=p_{i}, it holds that

Our approach, which removes nn completely from the sample complexity, requires a refined understanding of the structure of the set of all PBDs on nn variables, in fact one that is more refined than the understanding provided by the aforementioned results (approximating a PBD by a Poisson, Normal, or Binomial distribution). We give an outline of the approach in the next section.

3 Our approach.

The general learning result that we use (Lemma 10) is the following: We show that for any class S{\cal S} of target distributions, if S{\cal S} has an ϵ\epsilon-cover of size NN then there is a generic algorithm for learning an unknown distribution from S{\cal S} to accuracy O(ϵ){O(\epsilon)} that uses O((logN)/ϵ2)O((\log N)/\epsilon^{2}) samples. Our approach is rather similar to the algorithm of [DL01] for choosing a density estimate (but different in some details); it works by carrying out a tournament that matches every pair of distributions in the cover against each other. Our analysis shows that with high probability some ϵ\epsilon-accurate distribution in the cover will survive the tournament undefeated, and that any undefeated distribution will with high probability be O(ϵ)O(\epsilon)-accurate.

Applying this general result to the O(ϵ)O(\epsilon)-cover of size (1/ϵ)O(log2(1/ϵ))(1/\epsilon)^{O(\log^{2}(1/\epsilon))} described above, we obtain a PBD that is O(ϵ)O(\epsilon)-close to the target (this accounts for the increased running time in Part (2) versus Part (1)). We stress that for both the non-proper and proper learning algorithms sketched above, many technical subtleties and challenges arise in implementing the high-level plan given above, requiring a careful and detailed analysis.

We prove Theorem 2 using the general approach of Lemma 10 specialized to weighted sums of independent Bernoullis with constantly many distinct weights. We show how the tournament can be implemented efficiently for the class S{\cal S} of weighted sums of independent Bernoullis with constantly many distinct weights, and thus obtain Theorem 2. Finally, the lower bound of Theorem 3 is proved by a direct information-theoretic argument.

4 Preliminaries.

For a distribution XX supported on [n]={0,1,,n}[n]=\{0,1,\dots,n\} we write X(i)X(i) to denote the value Pr[X=i]\Pr[X=i] of the probability density function (pdf) at point ii, and X(i)X(\leq i) to denote the value Pr[Xi]\Pr[X\leq i] of the cumulative density function (cdf) at point ii. For S[n]S\subseteq[n], we write X(S)X(S) to denote iSX(i)\sum_{i\in S}X(i) and XSX_{S} to denote the conditional distribution of XX restricted to S.S. Sometimes we write X(I)X(I) and XIX_{I} for a subset I[0,n]I\subseteq[0,n], meaning X(I[n])X(I\cap[n]) and XI[n]X_{I\cap[n]} respectively.

Total Variation Distance.

Recall that the total variation distance between two distributions XX and YY over a finite domain DD is

Covers.

Poisson Binomial Distribution.

A Poisson binomial distribution DSnD\in{\cal S}_{n} can be represented uniquely as a vector (pi)i=1n(p_{i})_{i=1}^{n} satisfying 0p1p2pn10\leq p_{1}\leq p_{2}\leq\ldots\leq p_{n}\leq 1. To go from DSnD\in{\cal S}_{n} to its corresponding vector, we find a collection X1,,XnX_{1},\ldots,X_{n} of mutually independent Bernoullis such that i=1nXi\sum_{i=1}^{n}X_{i} is distributed according to DD and E[X1]E[Xn]{\bf E}[X_{1}]\leq\ldots\leq{\bf E}[X_{n}]. (Such a collection exists by the definition of a Poisson binomial distribution.) Then we set pi=E[Xi]p_{i}={\bf E}[X_{i}] for all ii. Lemma 1 of [DP13] shows that the resulting vector (p1,,pn)(p_{1},\ldots,p_{n}) is unique.

We denote by PBD(p1,,pn){\rm PBD}(p_{1},\ldots,p_{n}) the distribution of the sum i=1nXi\sum_{i=1}^{n}X_{i} of mutually independent indicators X1,,XnX_{1},\ldots,X_{n} with expectations pi=E[Xi]p_{i}={\bf E}[X_{i}], for all ii. Given the above discussion PBD(p1,,pn){\rm PBD}(p_{1},\ldots,p_{n}) is unique up to permutation of the pip_{i}’s. We also sometimes write {Xi}\{X_{i}\} to denote the distribution of i=1nXi.\sum_{i=1}^{n}X_{i}. Note the difference between {Xi}\{X_{i}\}, which refers to the distribution of iXi\sum_{i}X_{i}, and {Xi}i\{X_{i}\}_{i}, which refers to the underlying collection of mutually independent Bernoulli random variables.

Translated Poisson Distribution.

We will make use of the translated Poisson distribution for approximating the Poisson Binomial distribution. We define the translated Poisson distribution, and state a known result on how well it approximates the Poisson Binomial distribution.

We say that an integer random variable YY is distributed according to the translated Poisson distribution with parameters μ\mu and σ2\sigma^{2}, denoted TP(μ,σ2)TP(\mu,\sigma^{2}), iff YY can be written as

where ZZ is a random variable distributed according to Poisson(σ2+{μσ2}){\rm Poisson}(\sigma^{2}+\{\mu-\sigma^{2}\}), where {μσ2}\{\mu-\sigma^{2}\} represents the fractional part of μσ2\mu-\sigma^{2}.

The following lemma gives a useful bound on the variation distance between a Poisson Binomial Distribution and a suitable translated Poisson distribution. Note that if the variance of the Poisson Binomial Distribution is large, then the lemma gives a strong bound.

Let J1,,JnJ_{1},\ldots,J_{n} be independent random indicators with E[Ji]=pi{\bf E}[J_{i}]=p_{i}. Then

where μ=i=1npi\mu=\sum_{i=1}^{n}p_{i} and σ2=i=1npi(1pi)\sigma^{2}=\sum_{i=1}^{n}p_{i}(1-p_{i}).

The following bound on the total variation distance between translated Poisson distributions will be useful.

Running Times, and Bit Complexity.

Throughout this paper, we measure the running times of our algorithms in numbers of bit operations. For a positive integer nn, we denote by n{\langle{n}\rangle} its description complexity in binary, namely n=log2n{\langle{n}\rangle}=\lceil\log_{2}n\rceil. Moreover, we represent a positive rational number qq as q1q2{q_{1}\over q_{2}}, where q1q_{1} and q2q_{2} are relatively prime positive integers. The description complexity of qq is defined to be q=q1+q2{\langle{q}\rangle}={\langle{q_{1}}\rangle}+{\langle{q_{2}}\rangle}. We will assume that all ϵ\epsilon’s and δ\delta’s input to our algorithms are rational numbers.

In this section, we prove Theorem 1 by providing a sample- and time-efficient algorithm for learning an unknown PBD X=i=1nXiX=\sum_{i=1}^{n}X_{i}. We start with an important ingredient in our analysis.

A cover for PBDs. We make use of the following theorem, which provides a cover of the set S=Sn{\cal S}={\cal S}_{n} of all PBDs of order-nn. The theorem was given implicitly in [DP11] and explicitly as Theorem 1 in [DP13].

For all ϵ>0\epsilon>0, there exists an ϵ\epsilon-cover SϵS{\cal S}_{\epsilon}\subseteq{\cal S} of S{\cal S} such that

Sϵn2+n(1ϵ)O(log21/ϵ)|{\cal S}_{\epsilon}|\leq n^{2}+n\cdot\left({1\over\epsilon}\right)^{O(\log^{2}{1/\epsilon})}; and

Sϵ{\cal S}_{\epsilon} can be constructed in time linear in its representation size, i.e., O(n2logn)+O(nlogn)(1ϵ)O(log21/ϵ){O}(n^{2}\log n)+{O}(n\log n)\cdot\left({1\over\epsilon}\right)^{O(\log^{2}{1/\epsilon})}.

Moreover, if {Yi}Sϵ\{Y_{i}\}\in{\cal S}_{\epsilon}, then the collection of nn Bernoulli random variables {Yi}i=1,,n\{Y_{i}\}_{i=1,\dots,n} has one of the following forms, where k=k(ϵ)C/ϵk=k(\epsilon)\leq C/\epsilon is a positive integer, for some absolute constant C>0C>0:

Finally, for every {Xi}S\{X_{i}\}\in{\cal S} for which there is no ϵ\epsilon-neighbor in Sϵ{\cal S}_{\epsilon} that is in sparse form, there exists some {Yi}Sϵ\{Y_{i}\}\in{\cal S}_{\epsilon} in kk-heavy Binomial form such that

We remark that the cover theorem as stated in [DP13] does not include the part of the above statement following “finally.” We provide a proof of this extension in Appendix A.

The Basic Learning Algorithm. The high-level structure of our learning algorithms which give Theorem 1 is provided in Algorithm Learn-PBD of Figure 1. We instantiate this high-level structure, with appropriate technical modifications, in Section 2.4, where we give more detailed descriptions of the non-proper and proper algorithms that give parts (1) and (2) of Theorem 1.

At a high level, the subroutine Learn-Sparse is given sample access to XX and is designed to find an ϵ\epsilon-accurate hypothesis HSH_{S} with probability at least 1δ/31-\delta/3, if the unknown PBD XX is ϵ\epsilon-close to some sparse form PBD inside the cover Sϵ{\cal S}_{\epsilon}. Similarly, Learn-Poisson is designed to find an ϵ\epsilon-accurate hypothesis HPH_{P}, if XX is not ϵ\epsilon-close to a sparse form PBD (in this case, Theorem 4 implies that XX must be ϵ\epsilon-close to some k(ϵ)k(\epsilon)-heavy Binomial form PBD). Finally, Choose-Hypothesis is designed to choose one of the two hypotheses HS,HPH_{S},H_{P} as being ϵ\epsilon-close to X.X. The following subsections specify these subroutines, as well as how the algorithm can be used to establish Theorem 1. We note that Learn-Sparse and Learn-Poisson do not return the distributions HSH_{S} and HPH_{P} as a list of probabilities for every point in [n][n]. They return instead a succinct description of these distributions in order to keep the running time of the algorithm logarithmic in nn. Similarly, Choose-Hypothesis operates with succinct descriptions of these distributions.

Our starting point here is the simple observation that any PBD is a unimodal distribution over the domain {0,1,,n}\{0,1,\dots,n\}. (There is a simple inductive proof of this, or see Section 2 of [KG71].) This enables us to use the algorithm of Birgé [Bir97] for learning unimodal distributions. We recall Birgé’s result, and refer the reader to Appendix B for an explanation of how Theorem 5 as stated below follows from [Bir97].

For all n,ϵ,δ>0n,\epsilon,\delta>0, there is an algorithm that draws

samples from an unknown unimodal distribution XX over [n][n], does

The main result of this subsection is the following:

For all n,ϵ,δ>0n,\epsilon^{\prime},\delta^{\prime}>0, there is an algorithm Learn-SparseX(n,ϵ,δ){}^{X}(n,\epsilon^{\prime},\delta^{\prime}) that draws

samples from a target PBD XX over [n][n], does

The high-level idea of Lemma 3 is quite simple. We truncate O(ϵ)O(\epsilon^{\prime}) of the probability mass from each end of XX to obtain a conditional distribution X[a^,b^]X_{[\hat{a},\hat{b}]}; since XX is unimodal so is X[a^,b^]X_{[\hat{a},\hat{b}]}. If b^a^\hat{b}-\hat{a} is larger than O(1/ϵ3)O(1/\epsilon^{\prime 3}) then the algorithm outputs “fail” (and XX could not have been close to a sparse-form distribution in the cover). Otherwise, we use Birgé’s algorithm to learn the unimodal distribution X[a^,b^]X_{[\hat{a},\hat{b}]}. A detailed description of the algorithm is given in Figure 2 below.

Proof of Lemma 3: As described in Figure 2, algorithm Learn-SparseX(n,ϵ,δ){}^{X}(n,\epsilon^{\prime},\delta^{\prime}) first draws M=32log(8/δ)/ϵ2M=32\log(8/\delta^{\prime})/\epsilon^{\prime 2} samples from XX and sorts them to obtain a list of values 0s1sMn.0\leq s_{1}\leq\cdots\leq s_{M}\leq n. We claim the following about the values a^\hat{a} and b^\hat{b} defined in Step 2 of the algorithm:

With probability at least 1δ/21-\delta^{\prime}/2, we have X(a^)[3ϵ/2,5ϵ/2]X(\leq\hat{a})\in[3\epsilon^{\prime}/2,5\epsilon^{\prime}/2] and X(b^)[15ϵ/2,13ϵ/2]X(\leq\hat{b})\in[1-5\epsilon^{\prime}/2,1-3\epsilon^{\prime}/2].

We only show that X(a^)3ϵ/2X(\leq\hat{a})\geq 3\epsilon^{\prime}/2 with probability at least 1δ/81-\delta^{\prime}/8, since the arguments for X(a^)5ϵ/2X(\leq\hat{a})\leq 5\epsilon^{\prime}/2, X(b^)13ϵ/2X(\leq\hat{b})\leq 1-3\epsilon^{\prime}/2 and X(b^)15ϵ/2X(\leq\hat{b})\geq 1-5\epsilon^{\prime}/2 are identical. Given that each of these conditions is met with probability at least 1δ/81-\delta^{\prime}/8, the union bound establishes our claim.

To show that X(a^)3ϵ/2X(\leq\hat{a})\geq 3\epsilon^{\prime}/2 is satisfied with probability at least 1δ/81-\delta^{\prime}/8 we argue as follows: Let α=max{i  X(i)<3ϵ/2}\alpha^{\prime}=\max\{i~{}|~{}X(\leq i)<3\epsilon^{\prime}/2\}. Clearly, X(α)<3ϵ/2X(\leq\alpha^{\prime})<3\epsilon^{\prime}/2 while X(α+1)3ϵ/2X(\leq\alpha^{\prime}+1)\geq 3\epsilon^{\prime}/2. Given this, if MM samples are drawn from XX then the expected number of them that are α\leq\alpha^{\prime} is at most 3ϵM/23\epsilon^{\prime}M/2. It follows then from the Chernoff bound that the probability that more than 74ϵM{7\over 4}\epsilon^{\prime}M samples are α\leq\alpha^{\prime} is at most e(ϵ/4)2M/2δ/8e^{-(\epsilon^{\prime}/4)^{2}M/2}\leq\delta^{\prime}/8. Hence except with this failure probability, we have a^α+1\hat{a}\geq\alpha^{\prime}+1, which implies that X(a^)3ϵ/2X(\leq\hat{a})\geq 3\epsilon^{\prime}/2. ∎

As specified in Steps 3 and 4, if b^a^>(C/ϵ)3\hat{b}-\hat{a}>(C/\epsilon^{\prime})^{3}, where CC is the constant in the statement of Theorem 4, the algorithm outputs “fail”, returning the trivial hypothesis which puts probability mass 11 on the point . Otherwise, the algorithm runs Birgé’s unimodal distribution learner (Theorem 5) on the conditional distribution X[a^,b^]X_{[\hat{a},\hat{b}]}, and outputs the result of Birgé’s algorithm. Since XX is unimodal, it follows that X[a^,b^]X_{[\hat{a},\hat{b}]} is also unimodal, hence Birgé’s algorithm is appropriate for learning it. The way we apply Birgé’s algorithm to learn X[a^,b^]X_{[\hat{a},\hat{b}]} given samples from the original distribution XX is the obvious one: we draw samples from XX, ignoring all samples that fall outside of [a^,b^][\hat{a},\hat{b}], until the right O(log(1/δ)log(1/ϵ)/ϵ3)O(\log(1/\delta^{\prime})\log(1/\epsilon^{\prime})/\epsilon^{\prime 3}) number of samples fall inside [a^,b^][\hat{a},\hat{b}], as required by Birgé’s algorithm for learning a distribution of support of size (C/ϵ)3(C/\epsilon^{\prime})^{3} with probability at least 1δ/41-\delta^{\prime}/4. Once we have the right number of samples in [a^,b^][\hat{a},\hat{b}], we run Birgé’s algorithm to learn the conditional distribution X[a^,b^]X_{[\hat{a},\hat{b}]}. Note that the number of samples we need to draw from XX until the right O(log(1/δ)log(1/ϵ)/ϵ3)O(\log(1/\delta^{\prime})\log(1/\epsilon^{\prime})/\epsilon^{\prime 3}) number of samples fall inside [a^,b^][\hat{a},\hat{b}] is still O(log(1/δ)log(1/ϵ)/ϵ3)O(\log(1/\delta^{\prime})\log(1/\epsilon^{\prime})/\epsilon^{\prime 3}), with probability at least 1δ/41-\delta^{\prime}/4. Indeed, since X([a^,b^])=1O(ϵ)X([\hat{a},\hat{b}])=1-O(\epsilon^{\prime}), it follows from the Chernoff bound that with probability at least 1δ/41-\delta^{\prime}/4, if K=Θ(log(1/δ)log(1/ϵ)/ϵ3)K=\Theta(\log(1/\delta^{\prime})\log(1/\epsilon^{\prime})/\epsilon^{\prime 3}) samples are drawn from XX, at least K(1O(ϵ))K(1-O(\epsilon^{\prime})) fall inside [a^,b^][\hat{a},\hat{b}].

Analysis: It is easy to see that the sample complexity of our algorithm is as promised. For the running time, notice that, if Birgé’s algorithm is invoked, it will return two lists of numbers a1a_{1} through aka_{k} and b1b_{1} through bkb_{k}, as well as a list of probability masses q1,,qkq_{1},\ldots,q_{k} assigned to each subinterval [ai,bi][a_{i},b_{i}], i=1,,ki=1,\ldots,k, by the hypothesis distribution HSH_{S}, where k=O(log(1/ϵ)/ϵ)k=O(\log(1/\epsilon^{\prime})/\epsilon^{\prime}). In linear time, we can compute a list of probabilities q^1,,q^k\hat{q}_{1},\ldots,\hat{q}_{k}, representing the probability assigned by HSH_{S} to every point of subinterval [ai,bi][a_{i},b_{i}], for i=1,,ki=1,\ldots,k. So we can represent our output hypothesis HSH_{S} via a data structure that maintains O(1/ϵ3)O(1/\epsilon^{\prime 3}) pointers, having one pointer per point inside [a,b][a,b]. The pointers map points to probabilities assigned by HSH_{S} to these points. Thus turning the output of Birgé’s algorithm into an explicit distribution over [a,b][a,b] incurs linear overhead in our running time, and hence the running time of our algorithm is also as promised. (See Appendix B for an explanation of the running time of Birgé’s algorithm.) Moreover, we also note that the output distribution has the promised structure, since in one case it has a single atom at and in the other case it is the output of Birgé’s algorithm on a distribution of support of size (C/ϵ)3(C/\epsilon^{\prime})^{3}.

2 Learning when X𝑋X is close to a k𝑘k-heavy Binomial Form PBD.

For all n,ϵ,δ>0n,\epsilon^{\prime},\delta^{\prime}>0, there is an algorithm Learn-PoissonX(n,ϵ,δ){}^{X}(n,\epsilon^{\prime},\delta^{\prime}) that draws

samples from a target PBD XX over [n][n], does

Our proof plan is to exploit the structure of the cover of Theorem 4. In particular, if XX is not ϵ\epsilon^{\prime}-close to any sparse form PBD in the cover, it must be ϵ\epsilon^{\prime}-close to a PBD in heavy Binomial form with approximately the same mean and variance as XX, as specified by the final part of the cover theorem. Hence, a natural strategy is to obtain estimates μ^\hat{\mu} and σ^2\hat{\sigma}^{2} of the mean and variance of the unknown PBD XX, and output as a hypothesis a translated Poisson distribution with parameters μ^\hat{\mu} and σ^2\hat{\sigma}^{2}. We show that this strategy is a successful one. Before providing the details, we highlight two facts that we will establish in the subsequent analysis and that will be used later. The first is that, assuming XX is not ϵ\epsilon^{\prime}-close to any sparse form PBD in the cover Sϵ{\cal S}_{\epsilon^{\prime}}, its variance σ2\sigma^{2} satisfies

The second is that under the same assumption, the estimates μ^\hat{\mu} and σ^2\hat{\sigma}^{2} of the mean μ\mu and variance σ2\sigma^{2} of XX that we obtain satisfy the following bounds with probability at least 1δ1-\delta:

See Figure 3 and the associated Figure 4 for a detailed description of the Learn-PoissonX(n,ϵ,δ){}^{X}(n,\epsilon^{\prime},\delta^{\prime}) algorithm.

Proof of Lemma 5: We start by showing that we can estimate the mean and variance of the target PBD XX.

We treat the estimation of μ\mu and σ2\sigma^{2} separately. For both estimation problems we show how to use O(1/ϵ2)O(1/\epsilon^{2}) samples to obtain estimates μ^\hat{\mu} and σ^2\hat{\sigma}^{2} achieving the required guarantees with probability at least 2/32/3 (we refer to these as “weak estimators”). Then a routine procedure allows us to boost the success probability to 1δ1-\delta at the expense of a multiplicative factor O(log1/δ)O(\log 1/\delta) on the number of samples. While we omit the details of the routine boosting argument, we remind the reader that it involves running the weak estimator O(log1/δ)O(\log 1/\delta) times to obtain estimates μ^1,,μ^O(log1/δ)\hat{\mu}_{1},\ldots,\hat{\mu}_{O(\log{1/\delta})} and outputting the median of these estimates, and similarly for estimating σ2\sigma^{2}.

We proceed to specify and analyze the weak estimators for μ\mu and σ2\sigma^{2} separately:

Weak estimator for μ\mu: Let Z1,,ZmZ_{1},\ldots,Z_{m} be independent samples from XX, and let μ^=iZim\hat{\mu}={\sum_{i}Z_{i}\over m}. Then

Choosing t=3t=\sqrt{3} and m=3/ϵ2m=\lceil 3/\epsilon^{2}\rceil, the above imply that μ^μϵσ|\hat{\mu}-\mu|\leq\epsilon\sigma with probability at least 2/32/3.

Weak estimator for σ2\sigma^{2}: Let Z1,,ZmZ_{1},\ldots,Z_{m} be independent samples from XX, and let σ^2=i(Zi1miZi)2m1\hat{\sigma}^{2}={\sum_{i}(Z_{i}-{1\over m}\sum_{i}Z_{i})^{2}\over m-1} be the unbiased sample variance. (Note the use of Bessel’s correction.) Then it can be checked [Joh03] that

where κ\kappa is the excess kurtosis of the distribution of XX (i.e. κ=E[(Xμ)4]σ43\kappa={\frac{{\bf E}[(X-\mu)^{4}]}{\sigma^{4}}}-3). To bound κ\kappa in terms of σ2\sigma^{2} suppose that X=i=1nXiX=\sum_{i=1}^{n}X_{i}, where E[Xi]=pi{\bf E}[X_{i}]=p_{i} for all ii. Then

Choosing t=3t=\sqrt{3} and m=3/ϵ2m=\lceil 3/\epsilon^{2}\rceil, the above imply that σ^2σ2ϵσ24+1σ2|\hat{\sigma}^{2}-\sigma^{2}|\leq\epsilon\sigma^{2}\sqrt{4+{1\over\sigma^{2}}} with probability at least 2/32/3.

We proceed to prove Lemma 5. Learn-PoissonX(n,ϵ,δ){}^{X}(n,\epsilon^{\prime},\delta^{\prime}) runs A(n,ϵ,δ){\cal A}(n,\epsilon,\delta) from Lemma 6 with appropriately chosen ϵ=ϵ(ϵ)\epsilon=\epsilon(\epsilon^{\prime}) and δ=δ(δ)\delta=\delta(\delta^{\prime}), given below, and then outputs the translated Poisson distribution TP(μ^,σ^2)TP(\hat{\mu},\hat{\sigma}^{2}), where μ^\hat{\mu} and σ^2\hat{\sigma}^{2} are the estimated mean and variance of XX output by A{\cal A}. Next, we show how to choose ϵ\epsilon and δ\delta, as well as why the desired guarantees are satisfied by the output distribution.

In particular, conditions (b) and (d) above imply that

for some universal constant θ\theta, establishing (1). In terms of this θ\theta, we choose ϵ=ϵ/4+1θ2\epsilon=\epsilon^{\prime}/\sqrt{4+{1\over\theta^{2}}} and δ=δ\delta=\delta^{\prime} for the application of Lemma 6 to obtain—from O(log(1/δ)/ϵ2)O(\log(1/\delta^{\prime})/\epsilon^{\prime 2}) samples—estimates μ^\hat{\mu} and σ^2\hat{\sigma}^{2} of μ\mu and σ2\sigma^{2}.

From our choice of parameters and the guarantees of Lemma 6, it follows that, if XX is not ϵ\epsilon^{\prime}-close to any PBD in sparse form inside the cover Sϵ{\cal S}_{\epsilon^{\prime}}, then with probability at least 1δ1-\delta^{\prime} the estimates μ^\hat{\mu} and σ^2\hat{\sigma}^{2} satisfy:

establishing (2). Moreover, if YY is a random variable distributed according to the translated Poisson distribution TP(μ^,σ^2)TP(\hat{\mu},\hat{\sigma}^{2}), we show that XX and YY are within O(ϵ)O(\epsilon^{\prime}) in total variation distance, concluding the proof of Lemma 5.

We make use of Lemma 1. Suppose that X=i=1nXiX=\sum_{i=1}^{n}X_{i}, where E[Xi]=pi{\bf E}[X_{i}]=p_{i} for all ii. Lemma 1 implies that

It remains to bound the total variation distance between the translated Poisson distributions TP(μ,σ2)TP(\mu,\sigma^{2}) and TP(μ^,σ^2)TP(\hat{\mu},\hat{\sigma}^{2}). For this we use Lemma 2. Lemma 2 implies

The claim follows from (3), (4) and the triangle inequality. ∎

The proof of Lemma 5 is concluded. We remark that the algorithm described above does not need to know a priori whether or not XX is ϵ\epsilon^{\prime}-close to a PBD in sparse form inside the cover Sϵ{\cal S}_{\epsilon^{\prime}} of Theorem 4. The algorithm simply runs the estimator of Lemma 6 with ϵ=ϵ/4+1θ2\epsilon=\epsilon^{\prime}/\sqrt{4+{1\over\theta^{2}}} and δ=δ\delta^{\prime}=\delta and outputs whatever estimates μ^\hat{\mu} and σ^2\hat{\sigma}^{2} the algorithm of Lemma 6 produces. \square

3 Hypothesis testing.

Our hypothesis testing routine Choose-HypothesisX uses samples from the unknown distribution XX to run a “competition” between two candidate hypothesis distributions H1H_{1} and H2H_{2} over [n][n] that are given in the input. We show that if at least one of the two candidate hypotheses is close to the unknown distribution XX, then with high probability over the samples drawn from XX the routine selects as winner a candidate that is close to XX. This basic approach of running a competition between candidate hypotheses is quite similar to the “Scheffé estimate” proposed by Devroye and Lugosi (see [DL96b, DL96a] and Chapter 6 of [DL01], as well as [Yat85]), but our notion of competition here is different.

We obtain the following lemma, postponing all running-time analysis to the next section.

There is an algorithm Choose-HypothesisX(H1,H2,ϵ,δ){}^{X}({H}_{1},{H}_{2},\epsilon^{\prime},\delta^{\prime}) which is given sample access to distribution XX, two hypothesis distributions H1,H2H_{1},H_{2} for XX, an accuracy parameter ϵ>0\epsilon^{\prime}>0, and a confidence parameter δ>0.\delta^{\prime}>0. It makes

Proof of Lemma 8: Figure 5 describes how the competition between H1H_{1} and H2H_{2} is carried out.

The correctness of Choose-Hypothesis is an immediate consequence of the following claim. (In fact for Lemma 8 we only need item (i) below, but item (ii) will be handy later in the proof of Lemma 10.)

For part (ii): If p1p25ϵp_{1}-p_{2}\leq 5\epsilon^{\prime} then the competition declares a draw, hence H3iH_{3-i} is not the winner. Otherwise we have p1p2>5ϵp_{1}-p_{2}>5\epsilon^{\prime} and the above arguments imply that the competition between H1H_{1} and H2H_{2} will declare H3iH_{3-i} as the winner with probability at most 2emϵ2/22e^{-{m\epsilon^{\prime 2}/2}}.

In view of Claim 9, the proof of Lemma 8 is concluded. \square

Our Choose-Hypothesis algorithm implies a generic learning algorithm of independent interest.

Let S{\cal S} be an arbitrary set of distributions over a finite domain. Moreover, let SϵS{\cal S}_{\epsilon}\subseteq{\cal S} be an ϵ\epsilon-cover of S{\cal S} of size NN, for some ϵ>0\epsilon>0. For all δ>0\delta>0, there is an algorithm that uses

The algorithm performs a tournament, by running Choose-HypothesisX(Hi,Hj,ϵ,{}^{X}(H_{i},H_{j},\epsilon, δ/(4N))\delta/(4N)) for every pair (Hi,Hj)(H_{i},H_{j}), i<ji<j, of distributions in Sϵ{\cal S}_{\epsilon}. Then it outputs any distribution YSϵY^{\star}\in{\cal S}_{\epsilon} that was never a loser (i.e., won or tied against all other distributions in the cover). If no such distribution exists in Sϵ{\cal S}_{\epsilon} then the algorithm says “failure,” and outputs an arbitrary distribution from Sϵ{\cal S}_{\epsilon}.

We note that Devroye and Lugosi (Chapter 7 of [DL01]) prove a similar result, but there are some differences. They also have all pairs of distributions in the cover compete against each other, but they use a different notion of competition between every pair. Moreover, their approach chooses a distribution in the cover that wins the maximum number of competitions, whereas our algorithm chooses a distribution that is never defeated (i.e., won or tied against all other distributions in the cover).

Recent work [DK14, AJOS14, SOAJ14] improves the running time of the tournament approaches of Lemma 10, Devroye-Lugosi and other related tournaments to have a quasilinear dependence of O(NlogN)O(N\log N) on the size N=SϵN=|{\cal S}_{\epsilon}| . In particular, they avoid running Choose-Hypothesis for all pairs of distributions in Sϵ{\cal S}_{\epsilon}.

4 Proof of Theorem 1.

We first show Part (1) of the theorem, where the learning algorithm may output any distribution over [n][n] and not necessarily a PBD. The algorithm for this part of the theorem, Non-Proper-Learn-PBD, is given in Figure 6. This algorithm follows the high-level structure outlined in Figure 1 with the following modifications: (a) first, if the total variation distance to within which we want to learn XX is ϵ\epsilon, the second argument of both Learn-Sparse and Learn-Poisson is set to ϵ12max{c1,c2}{\epsilon\over 12\max\{c_{1},c_{2}\}}, where c1c_{1} and c2c_{2} are respectively the constants from Lemmas 3 and 5; (b) the third step of Learn-PBD is replaced by Choose-HypothesisX(HS,HP^,ϵ/8,δ/3){}^{X}(H_{S},{\widehat{H_{P}}},{\epsilon/8},\delta/3), where HP^{\widehat{H_{P}}} is defined in terms of HPH_{P} as described in Definition 2 below; and (c) if Choose-Hypothesis returns HSH_{S}, then Learn-PBD also returns HSH_{S}, while if Choose-Hypothesis returns HP^\widehat{H_{P}}, then Learn-PBD returns HPH_{P}.

(Definition of HP^{\widehat{H_{P}}}:) HP^{\widehat{H_{P}}} is defined in terms of HPH_{P} and the support of HSH_{S} in three steps:

for all points ii such that HS(i)=0H_{S}(i)=0, we let HP^(i)=HP(i){\widehat{H_{P}}}(i)={{H_{P}}}(i);

for all points ii such that HS(i)0H_{S}(i)\neq 0, we describe in Appendix C an efficient deterministic algorithm that numerically approximates HP(i)H_{P}(i) to within an additive error of ±ϵ/48s\pm\epsilon/48s, where s=O(1/ϵ3)s=O(1/\epsilon^{3}) is the cardinality of the support of HSH_{S}. If HP,i^\widehat{H_{P,i}} is the approximation to HP(i)H_{P}(i) output by the algorithm, we set HP^(i)=max{0,HP,i^ϵ/48s}\widehat{H_{P}}(i)={\max\{0,\widehat{H_{P,i}}-\epsilon/48s\}}; notice then that HP(i)ϵ/24sHP^(i)HP(i)H_{P}(i)-\epsilon/24s\leq\widehat{H_{P}}(i)\leq H_{P}(i); finally,

for an arbitrary point ii such that HS(i)=0H_{S}(i)=0, we set HP^(i)=1jiHP^(j)\widehat{H_{P}}(i)=1-\sum_{j\neq i}\widehat{H_{P}}(j), to make sure that HP^\widehat{H_{P}} is a probability distribution.

We remark that the reason why we do not wish to use HPH_{P} directly in Choose-Hypothesis is purely computational. In particular, since HPH_{P} is a translated Poisson distribution, we cannot compute its probabilities HP(i)H_{P}(i) exactly, and we need to approximate them. On the other hand, we need to make sure that using approximate values will not cause Choose-Hypothesis to make a mistake. Our HP^\widehat{H_{P}} is carefully defined so as to make sure that Choose-Hypothesis selects a probability distribution that is close to the unknown XX, and that all probabilities that Choose-Hypothesis needs to compute can be computed without much overhead. In particular, we remark that, in running Choose-Hypothesis, we do not a priori compute the value of HP^\widehat{H_{P}} at every point; we do instead a lazy evaluation of HP^\widehat{H_{P}}, as explained in the running-time analysis below.

Now we turn to Part (2) of Theorem 1, the proper learning result. The algorithm for this part of the theorem, Proper-Learn-PBD, is given in Figure 7. The algorithm is essentially the same as Non-Proper-Learn-PBD but with the following modifications, to produce a PBD that is within O(ϵ)O(\epsilon) of the unknown XX: First, we replace Learn-Sparse with a different learning algorithm, Proper-Learn-Sparse, which is based on Lemma 10, and always outputs a PBD. Second, we add a post-processing step to Learn-Poisson that converts the translated Poisson distribution HPH_{P} output by this procedure to a PBD (in fact, to a Binomial distribution). After we describe these new ingredients in detail, we explain and analyze our proper learning algorithm.

The procedure Proper-Learn-SparseX(n,ϵ,δ){}^{X}(n,\epsilon,\delta) is given in Figure 8; we explain the procedure in tandem with a proof of correctness. As in Learn-Sparse, we start by truncating Θ(ϵ)\Theta(\epsilon) of the probability mass from each end of XX to obtain a conditional distribution X[a^,b^]X_{[\hat{a},\hat{b}]}. In particular, we compute a^\hat{a} and b^\hat{b} as described in the beginning of the proof of Lemma 3 (setting ϵ=ϵ\epsilon^{\prime}=\epsilon and δ=δ\delta^{\prime}=\delta). Claim 4 implies that, with probability at least 1δ/21-\delta/2, X(a^),1X(b^)[3ϵ/2,5ϵ/2]X(\leq\hat{a}),1-X(\leq\hat{b})\in[3\epsilon/2,5\epsilon/2]. (Let us denote this event by G{\cal G}.) We distinguish the following cases:

If b^a^>ω=(C/ϵ)3\hat{b}-\hat{a}>\omega=(C/\epsilon)^{3}, where CC is the constant in the statement of Theorem 4, the algorithm outputs “fail,” returning the trivial hypothesis that puts probability mass 11 on the point . Observe that, if b^a^>ω\hat{b}-\hat{a}>\omega and X(a^),1X(b^)[3ϵ/2,5ϵ/2]X(\leq\hat{a}),1-X(\leq\hat{b})\in[3\epsilon/2,5\epsilon/2], then XX cannot be ϵ\epsilon-close to a sparse-form distribution in the cover.

Locate-Binomial(μ^,σ^2,n)(\hat{\mu},\hat{\sigma}^{2},n): This routine takes as input the output (μ^,σ^2)(\hat{\mu},\hat{\sigma}^{2}) of Learn-PoissonX(n,ϵ,δ){}^{X}(n,\epsilon,\delta) and computes a Binomial distribution HBH_{B}, without any additional samples from XX. The guarantee is that, if XX is not ϵ\epsilon-close to any sparse form distribution in the cover SϵS_{\epsilon} of Theorem 4, then, with probability at least 1δ1-\delta (over the randomness in the output of Learn-Poisson), HBH_{B} will be O(ϵ)O(\epsilon)-close to XX.

Locate-Binomial is presented in Figure 9; we proceed to explain the algorithm and establish its correctness. This routine has three steps. The first two eliminate corner-cases in the values of μ^\hat{\mu} and σ^2\hat{\sigma}^{2}, while the last step defines a Binomial distribution HBBin(n^,p^)H_{B}\equiv{\rm Bin}(\hat{n},\hat{p}) with n^n\hat{n}\leq n that is O(ϵ)O(\epsilon)-close to HPTP(μ^,σ^2)H_{P}\equiv TP(\hat{\mu},\hat{\sigma}^{2}) and hence to XX under our working assumptions. (We note that a significant portion of the work below is to ensure that n^n\hat{n}\leq n, which does not seem to follow from a more direct approach. Getting n^n\hat{n}\leq n is necessary in order for our learning algorithm for order-nn PBDs to be truly proper.) Throughout (a), (b) and (c) below we assume that our working assumptions hold. In particular, our assumptions are used every time we employ the bounds (1) and (2) of Section 2.2.

Tweaking σ^2\hat{\sigma}^{2}: If σ^2n4\hat{\sigma}^{2}\leq{n\over 4}, we set σ12=σ^2\sigma_{1}^{2}=\hat{\sigma}^{2}; otherwise, we set σ12=n4\sigma_{1}^{2}={n\over 4}. (As intuition for this tweak, observe that the largest possible variance of a Binomial distribution Bin(n,){\rm Bin}(n,\cdot) is n/4.n/4.) We note for future reference that in both cases (2) gives

where the lower bound follows from (2) and the fact that any PBD satisfies σ2n4\sigma^{2}\leq{n\over 4}.

where we used the fact that σ2=Ω(1/ϵ2)\sigma^{2}=\Omega(1/\epsilon^{2}) from (1).

Observe first that σ12>σ22\sigma_{1}^{2}>\sigma_{2}^{2} and σ220\sigma_{2}^{2}\geq 0, where the last assertion follows from the fact that μ^n\hat{\mu}\leq n by construction.

Next, suppose that X=PBD(p1,,pn)X=PBD(p_{1},\ldots,p_{n}). Then from Cauchy-Schwarz we get that

where the first inequality follows from (2), the second inequality follows from (7) and the fact that any PBD over nn variables satisfies μn,\mu\leq n, and the last one from (1).

where we used that σ2=Ω(1/ϵ2)\sigma^{2}=\Omega(1/\epsilon^{2}) from (1).

Note that, from the way that σ22\sigma_{2}^{2} is set in Step (b) above, we have that n^n\hat{n}\leq n and p^\hat{p}\in, as required for Bin(n^,p^){\rm Bin}(\hat{n},\hat{p}) to be a valid Binomial distribution and a valid output for Part 2 of Theorem 1.

Let us bound the total variation distance between Bin(n^,p^){\rm Bin}(\hat{n},\hat{p}) and TP(μ^,σ22)TP(\hat{\mu},\sigma_{2}^{2}). First, using Lemma 1 we have:

where the second inequality uses (8) (or (5) depending on which case of Step (b) we fell into) and the last one uses the fact that σ2=Ω(1/ϵ2)\sigma^{2}=\Omega(1/\epsilon^{2}) from (1). So plugging this into (10) we get:

The next step is to compare TP(n^p^,n^p^(1p^))TP(\hat{n}\hat{p},\hat{n}\hat{p}(1-\hat{p})) and TP(μ^,σ22)TP(\hat{\mu},\sigma_{2}^{2}). Lemma 2 gives:

Learning weighted sums of independent Bernoullis

Theorem 2. Let X=i=1naiXiX=\sum_{i=1}^{n}a_{i}X_{i} be a weighted sum of unknown independent Bernoulli random variables such that there are at most kk different values in the set {a1,,an}.\{a_{1},\dots,a_{n}\}. Then there is an algorithm with the following properties: given n,n, a1,,ana_{1},\dots,a_{n} and access to independent draws from XX, it uses

samples from the target distribution XX, runs in time

Given a vector a=(a1,,an)\overline{a}=(a_{1},\dots,a_{n}) of weights, we refer to a distribution X=i=1naiXiX=\sum_{i=1}^{n}a_{i}X_{i} (where X1,,XnX_{1},\dots,X_{n} are independent Bernoullis which may have arbitrary means) as an a\overline{a}-weighted sum of Bernoullis, and we write Sa{\cal S}_{\overline{a}} to denote the space of all such distributions.

To prove Theorem 2 we first show that Sa{\cal S}_{\overline{a}} has an ϵ\epsilon-cover that is not too large. We then show that by running a “tournament” between all pairs of distributions in the cover, using the hypothesis testing subroutine from Section 2.3, it is possible to identify a distribution in the cover that is close to the target a\overline{a}-weighted sum of Bernoullis.

Let {bj}j=1k\{b_{j}\}_{j=1}^{k} denote the set of distinct weights in a1,,ana_{1},\dots,a_{n}, and let n_{j}=\big{|}\{i\in[n]\mid a_{i}=b_{j}\}\big{|}. With this notation, we can write X=j=1kbjSj=g(S)X=\mathop{{\textstyle\sum}}_{j=1}^{k}b_{j}S_{j}=g(S), where S=(S1,,Sk)S=(S_{1},\ldots,S_{k}) with each SjS_{j} a sum of njn_{j} many independent Bernoulli random variables and g(y1,,yk)=j=1kbjyjg(y_{1},\ldots,y_{k})=\sum_{j=1}^{k}b_{j}y_{j}. Clearly we have j=1knj=n\mathop{{\textstyle\sum}}_{j=1}^{k}n_{j}=n. By Theorem 4, for each j{1,,k}j\in\{1,\dots,k\} the space of all possible SjS_{j}’s has an explicit (ϵ/k)(\epsilon/k)-cover Sϵ/kj\mathcal{S}^{j}_{\epsilon/k} of size Sϵ/kjnj2+n(k/ϵ)O(log2(k/ϵ))|\mathcal{S}^{j}_{\epsilon/k}|\leq n_{j}^{{2}}+n\cdot(k/\epsilon)^{O(\log^{2}(k/\epsilon))}. By independence across SjS_{j}’s, the product Q=j=1kSϵ/kj\mathcal{Q}=\mathop{{\textstyle\prod}}_{j=1}^{k}\mathcal{S}^{j}_{\epsilon/k} is an ϵ\epsilon-cover for the space of all possible SS’s, and hence the set

is an ϵ\epsilon-cover for Sa.{\cal S}_{\overline{a}}. So Sa{\cal S}_{\overline{a}} has an explicit ϵ\epsilon-cover of size Q=j=1kSϵ/kj(n/k)2k(k/ϵ)kO(log2(k/ϵ))|\mathcal{Q}|=\mathop{{\textstyle\prod}}_{j=1}^{k}|\mathcal{S}^{j}_{\epsilon/k}|\leq(n/k)^{{2}k}\cdot(k/\epsilon)^{k\cdot O(\log^{2}(k/\epsilon))}. ∎

where the sum is over all kk-tuples (m1,,mk)(m_{1},\dots,m_{k}) such that 0mjnj0\leq m_{j}\leq n_{j} for all jj and b1m1++bkmk=wb_{1}m_{1}+\cdots+b_{k}m_{k}=w (as noted above there are at most O((n/k)k)O((n/k)^{k}) such kk-tuples). To complete the proof of Theorem 2 we note that PrH[Sj=mj]\Pr_{H}[S_{j}=m_{j}] can be computed in O(nj2)O(n_{j}^{2}) time by standard dynamic programming. \square

2 Sample complexity lower bound for learning sums of weighted independent Bernoulli random variables

The intuition underlying this lower bound is straightforward: Suppose there are n/100n/100 variables XiX_{i}, chosen uniformly at random, which have pi=100/np_{i}=100/n (call these the “relevant variables”), and the rest of the pip_{i}’s are zero. Given at most cnc\cdot n draws from XX for a small constant cc, with high probability some constant fraction of the relevant XiX_{i}’s will not have been “revealed” as relevant, and from this it is not difficult to show that any hypothesis must have constant error. A detailed argument follows.

Proof of Theorem 3: We define a probability distribution over possible target probability distributions XX as follows: A subset S{n/2+1,,n}S\subset\{n/2+1,\dots,n\} of size S=n/100|S|=n/100 is drawn uniformly at random from all (n/2n/100){n/2\choose n/100} possible outcomes.. The vector p=(p1,,pn)\overline{p}=(p_{1},\dots,p_{n}) is defined as follows: for each iSi\in S the value pip_{i} equals 100/n=1/S,100/n=1/|S|, and for all other ii the value pip_{i} equals 0. The ii-th Bernoulli random variable XiX_{i} has E[Xi]=pi{\bf E}[X_{i}]=p_{i}, and the target distribution is X=Xp=i=1niXi.X=X_{\overline{p}}=\sum_{i=1}^{n}iX_{i}.

Fix any S,pS,\overline{p} as described above. For any j{n/2+1,,n}j\in\{n/2+1,\dots,n\} we have Xp(j)0X_{\overline{p}}(j)\neq 0 if and only if jSj\in S. For any jSj\in S the value Xp(j)X_{\overline{p}}(j) is exactly (100/n)(1100/n)n/1001>35/n(100/n)(1-100/n)^{n/100-1}>35/n (for nn sufficiently large), and hence Xp({n/2+1,,n})>0.35X_{\overline{p}}(\{n/2+1,\dots,n\})>0.35 (again for nn sufficiently large).

The first claim of the lemma holds because any set of c2c\geq 2 numbers from {n/2+1,,n}\{n/2+1,\dots,n\} must sum to more than nn. The second claim holds because the only way a draw xx from XpX_{\overline{p}} can have x=jx=j is if Xj=1X_{j}=1 and all other XiX_{i} are 0 (here we are using limx(11/x)x=1/e\lim_{x\rightarrow\infty}(1-1/x)^{x}=1/e).

The next lemma is an easy consequence of Chernoff bounds:

Fix any p\overline{p} as defined above, and consider a sequence of n/2000n/2000 independent draws from Xp=iiXiX_{\overline{p}}=\sum_{i}iX_{i}. With probability 1eΩ(n)1-e^{-\Omega(n)} the total number of indices j[n]j\in[n] such that XjX_{j} is ever 1 in any of the n/2000n/2000 draws is at most n/1000n/1000.

We are now ready to prove Theorem 3. Let LL be a learning algorithm that receives n/2000n/2000 samples. Let S{n/2+1,,n}S\subset\{n/2+1,\dots,n\} and p\overline{p} be chosen randomly as defined above, and set the target to X=Xp.X=X_{\overline{p}}.

We consider an augmented learner LL^{\prime} that is given “extra information.” For each point in the sample, instead of receiving the value of that draw from XX the learner LL^{\prime} is given the entire vector (X1,,Xn){0,1}n(X_{1},\dots,X_{n})\in\{0,1\}^{n}. Let TT denote the set of elements j{n/2+1,,n}j\in\{n/2+1,\dots,n\} for which the learner is ever given a vector (X1,,Xn)(X_{1},\dots,X_{n}) that has Xj=1.X_{j}=1. By Lemma 16 we have Tn/1000|T|\leq n/1000 with probability at least 1eΩ(n)1-e^{-\Omega(n)}; we condition on the event Tn/1000|T|\leq n/1000 going forth.

Conclusion and open problems

References

Appendix A Extension of the Cover Theorem: Proof of Theorem 4

Theorem 4 is restating the main cover theorem (Theorem 1) of [DP13], except that it claims an additional property, namely what follows the word “finally” in the statement of the theorem. (We will sometimes refer to this property as the last part of Theorem 4 in the following discussion.) Our goal is to show that the cover of [DP13] already satisfies this property without any modifications, thereby establishing Theorem 4. To avoid reproducing the involved constructions of [DP13], we will assume that the reader has some familiarity with them. Still, our proof here will be self-contained.

First, we note that the ϵ\epsilon-cover Sϵ{\cal S}_{\epsilon} of Theorem 1 of [DP13] is a subset of a larger ϵ2{\epsilon\over 2}-cover Sϵ/2{\cal S}^{\prime}_{\epsilon/2} of size n2+n(1/ϵ)O(1/ϵ2)n^{2}+n\cdot(1/\epsilon)^{O(1/\epsilon^{2})}, which includes all the kk-sparse and all the kk-heavy Binomial PBDs (up to permutations of the underlying pip_{i}’s), for some k=O(1/ϵ)k=O(1/\epsilon). Let us call Sϵ/2{\cal S}^{\prime}_{\epsilon/2} the “large ϵ2{\epsilon\over 2}-cover” to distinguish it from Sϵ{\cal S}_{\epsilon}, which we will call the “small ϵ\epsilon-cover.” The reader is referred to Theorem 2 in [DP13] (and the discussion following that theorem) for a description of the large ϵ2\epsilon\over 2-cover, and to Section 3.2 of [DP13] for how this cover is used to construct the small ϵ\epsilon-cover. In particular, the small ϵ\epsilon-cover is a subset of the large ϵ/2{\epsilon/2}-cover, including only a subset of the sparse form distributions in the large ϵ/2{\epsilon/2}-cover. Moreover, for every sparse form distribution in the large ϵ/2\epsilon/2-cover, the small ϵ\epsilon-cover includes at least one sparse form distribution that is ϵ/2\epsilon/2-close in total variation distance. Hence, if the large ϵ/2\epsilon/2-cover satisfies the last part of Theorem 4 (with ϵ/2\epsilon/2 instead of ϵ\epsilon and Sϵ/2{\cal S}_{\epsilon/2}^{\prime} instead of Sϵ{\cal S}_{\epsilon}), it follows that the small ϵ\epsilon-cover also satisfies the last part of Theorem 4.

For {Xi}i\{X_{i}\}_{i}, {Yi}i\{Y_{i}\}_{i} and {Zi}i\{Z_{i}\}_{i} as above, let (μ,σ2)(\mu,\sigma^{2}), (μZ,σZ2)(\mu_{Z},\sigma_{Z}^{2}) and (μY,σY2)(\mu_{Y},\sigma_{Y}^{2}) denote respectively the (mean, variance) pairs of the variables X=iXiX=\sum_{i}X_{i}, Z=iZiZ=\sum_{i}Z_{i} and Y=iYiY=\sum_{i}Y_{i}. We argue first that the pair (μZ,σZ2)(\mu_{Z},\sigma_{Z}^{2}) satisfies μμZ=O(ϵ)|\mu-\mu_{Z}|=O(\epsilon) and σ2σZ2=O(ϵ(1+σ2))|\sigma^{2}-\sigma_{Z}^{2}|=O(\epsilon\cdot(1+\sigma^{2})). Next we argue that, if the collection {Yi}i\{Y_{i}\}_{i} output by the Stage 2 filter is in heavy Binomial form, then (μY,σY2)(\mu_{Y},\sigma_{Y}^{2}) satisfies μμY=O(1)|\mu-\mu_{Y}|={O(1)} and σ2σY2=O(1+ϵ(1+σ2))|\sigma^{2}-\sigma_{Y}^{2}|=O(1+\epsilon\cdot(1+\sigma^{2})), concluding the proof.

Proof for (μZ,σZ2)(\mu_{Z},\sigma_{Z}^{2}): The Stage 1 filter only modifies the indicators XiX_{i} with pi(0,1/k)(11/k,1)p_{i}\in(0,1/k)\cup(1-1/k,1), for some well-chosen k=O(1/ϵ)k=O(1/\epsilon). For convenience let us define Lk={i \vline pi(0,1/k)}{\cal L}_{k}=\{i~{}\vline~{}p_{i}\in(0,1/k)\} and Hk={i \vline pi(11/k,1)}{\cal H}_{k}=\{i~{}\vline~{}p_{i}\in(1-1/k,1)\} as in [DP13]. The filter of Stage 1 rounds the expectations of the indicators indexed by Lk{\cal L}_{k} to some value in {0,1/k}\{0,1/k\} so that no single expectation is altered by more than an additive 1/k1/k, and the sum of these expectations is not modified by more than an additive 1/k1/k. Similarly, the expectations of the indicators indexed by Hk{\cal H}_{k} are rounded to some value in {11/k,1}\{1-1/k,1\}. See the details of how the rounding is performed in Section 4.1 of [DP13]. Let us then denote by {pi}i\{p_{i}^{\prime}\}_{i} the expectations of the indicators {Zi}i\{Z_{i}\}_{i} resulting from the rounding. We argue that the mean and variance of Z=iZiZ=\sum_{i}Z_{i} is close to the mean and variance of XX. Indeed,

We proceed to bound the two terms of the RHS separately. Since the argument is symmetric for Lk{\cal L}_{k} and Hk{\cal H}_{k} we only do Lk{\cal L}_{k}. We have

Using the above (and a symmetric argument for index set Hk{\cal H}_{k}) we obtain:

Proof for (μY,σY2)(\mu_{Y},\sigma_{Y}^{2}): After the Stage 1 filter is applied to the collection {Xi}i\{X_{i}\}_{i}, the resulting collection of random variables {Zi}i\{Z_{i}\}_{i} has expectations pi{0,1}[1/k,11/k]p^{\prime}_{i}\in\{0,1\}\cup[1/k,1-1/k], for all ii. The Stage 2 filter has different form depending on the cardinality of the set M={i  pi[1/k,11/k]}{\cal M}=\{i~{}|~{}p_{i}^{\prime}\in[1/k,1-1/k]\}. In particular, if M>k3|{\cal M}|>k^{3} the output of the Stage 2 filter is in heavy Binomial form, while if Mk3|{\cal M}|\leq k^{3} the output of the Stage 2 filter is in sparse form. As we are only looking to provide guarantee for the distributions in heavy Binomial form, it suffices to only consider the former case next.

M>k3|{\cal M}|>k^{3}: Let {Yi}i\{Y_{i}\}_{i} be the collection produced by Stage 2 and let Y=iYiY=\sum_{i}Y_{i}. Then Lemma 4 of [DP13] implies that

Appendix B Birgé’s theorem: Learning unimodal distributions

Here we briefly explain how Theorem 5 follows from [Bir97]. We assume that the reader is moderately familiar with the paper [Bir97].

Birgé (see his Theorem 1 and Corollary 1) upper bounds the expected variation distance between the target distribution (which he denotes ff) and the hypothesis distribution that is constructed by his algorithm (which he denotes f^n\hat{f}_{n}; it should be noted, though, that his “nn” parameter denotes the number of samples used by the algorithm, while we will denote this by “mm”, reserving “nn” for the domain {1,,n}\{1,\dots,n\} of the distribution). More precisely, [Bir97] shows that this expected variation distance is at most that of the Grenander estimator (applied to learn a unimodal distribution when the mode is known) plus a lower-order term. For our Theorem 5 we take Birgé’s “η\eta” parameter to be ϵ\epsilon. With this choice of η,\eta, by the results of [Bir87a, Bir87b] bounding the expected error of the Grenander estimator, if m=O(log(n)/ϵ3)m=O(\log(n)/\epsilon^{3}) samples are used in Birgé’s algorithm then the expected variation distance between the target distribution and his hypothesis distribution is at most O(ϵ).O(\epsilon). To go from expected error O(ϵ){O(\epsilon)} to an O(ϵ){O(\epsilon)}-accurate hypothesis with probability at least 1δ1-\delta, we run the above-described algorithm O(log(1/δ))O(\log(1/\delta)) times so that with probability at least 1δ1-\delta some hypothesis obtained is O(ϵ){O(\epsilon)}-accurate. Then we use our hypothesis testing procedure of Lemma 8, or, more precisely, the extension provided in Lemma 10, to identify an O(ϵ)O(\epsilon)-accurate hypothesis from within this pool of O(log(1/δ))O(\log(1/\delta)) hypotheses. (The use of Lemma 10 is why the running time of Theorem 5 depends quadratically on log(1/δ)\log(1/\delta) and why the sample complexity contains the second 1ϵ2log1δloglog1δ{\frac{1}{\epsilon^{2}}}\log{\frac{1}{\delta}}\log\log{\frac{1}{\delta}} term.)

Appendix C Efficient Evaluation of the Poisson Distribution

In this section we provide an efficient algorithm to compute an additive approximation to the Poisson probability mass function. It seems that this should be a basic operation in numerical analysis, but we were not able to find it explicitly in the literature. Our main result for this section is the following.

There is an algorithm that, on input a rational number λ>0\lambda>0, and integers k0k\geq 0 and t>0t>0, produces an estimate pk^\widehat{p_{k}} such that

Clearly we cannot just compute eλe^{-\lambda}, λk\lambda^{k} and k!k! separately, as this will take time exponential in the description complexity of kk and λ\lambda. We follow instead an indirect approach. We start by rewriting the target probability as follows

Note that Ek0E_{k}\leq 0. Our goal is to approximate EkE_{k} to within high enough accuracy and then use this approximation to approximate pkp_{k}.

In particular, the main part of the argument involves an efficient algorithm to compute an approximation Ek^^\widehat{\widehat{E_{k}}} to EkE_{k} satisfying

We show that if we had such an approximation, then we would be able to complete the proof. For this, we claim that it suffices to approximate eEk^^e^{\widehat{\widehat{{E_{k}}}}} to within an additive error 12t{1\over 2t}. Indeed, if pk^\widehat{p_{k}} were the result of this approximation, then we would have:

To approximate eEk^^e^{\widehat{\widehat{{E_{k}}}}} given Ek^^{\widehat{\widehat{{E_{k}}}}}, we need the following lemma:

Let α0\alpha\leq 0 be a rational number. There is an algorithm that computes an estimate eα^\widehat{e^{\alpha}} such that

Since eαe^{\alpha}\in, the point of the additive grid {i4t}i=14t\{{i\over 4t}\}_{i=1}^{4t} closest to eαe^{\alpha} achieves error at most 1/(4t)1/(4t). Equivalently, in a logarithmic scale, consider the grid {lni4t}i=14t\{\ln{i\over 4t}\}_{i=1}^{4t} and let j^{\ast}:=\arg\min_{j}\left\{\Big{|}\alpha-{\ln({j\over 4t})}\Big{|}\right\}. Then, we have that

The idea of the algorithm is to approximately identify the point jj^{\ast}, by computing approximations to the points of the logarithmic grid combined with a binary search procedure. Indeed, consider the “rounded” grid {lni4t^}i=14t\{\widehat{\ln{i\over 4t}}\}_{i=1}^{4t} where each ln(i4t)^\widehat{\ln({i\over 4t})} is an approximation to ln(i4t)\ln({i\over 4t}) that is accurate to within an additive 116t{1\over 16t}. Notice that, for i=1,,4ti=1,\ldots,4t:

Given that our approximations are accurate to within an additive 1/16t1/16t, it follows that the rounded grid {lni4t^}i=14t\{\widehat{\ln{i\over 4t}}\}_{i=1}^{4t} is monotonic in ii.

The algorithm outputs i4ti^{\ast}\over 4t as its final approximation to eαe^{\alpha}. We argue next that the achieved error is at most an additive 12t1\over 2t. Since the distance between two consecutive points of the grid {lni4t}i=14t\{\ln{i\over 4t}\}_{i=1}^{4t} is more than 1/(8t)1/(8t) and our approximations are accurate to within an additive 1/16t1/16t, a little thought reveals that i{j1,j,j+1}i^{\ast}\in\{j^{\ast}-1,j^{\ast},j^{\ast}+1\}. This implies that i4ti^{\ast}\over 4t is within an additive 1/2t1/2t of eαe^{\alpha} as desired, and the proof of the lemma is complete. ∎

Given Lemma 17, we describe how we could approximate eEk^^e^{\widehat{\widehat{{E_{k}}}}} given Ek^^{\widehat{\widehat{{E_{k}}}}}. Recall that we want to output an estimate pk^\widehat{p_{k}} such that pk^eEk^^1/(2t)|\widehat{p_{k}}-e^{\widehat{\widehat{{E_{k}}}}}|\leq 1/(2t). We distinguish the following cases:

If Ek^^0\widehat{\widehat{{E_{k}}}}\geq 0, we output pk^:=1\widehat{p_{k}}:=1. Indeed, given that \Big{|}\widehat{\widehat{{E}_{k}}}-E_{k}\Big{|}\leq{1\over 4t} and Ek0E_{k}\leq 0, if Ek^^0\widehat{\widehat{{E_{k}}}}\geq 0 then Ek^^[0,14t]\widehat{\widehat{{E_{k}}}}\in[0,{1\over 4t}]. Hence, because t1t\geq 1, eEk^^[1,1+1/2t]e^{\widehat{\widehat{{E_{k}}}}}\in[1,1+1/2t], so 11 is within an additive 1/2t1/2t of the right answer.

In view of the above, we only need to show how to compute Ek^^\widehat{\widehat{E_{k}}}. There are several steps to our approximation:

(Stirling’s Asymptotic Approximation): Recall Stirling’s asymptotic approximation (see e.g., [Whi80] p.193), which says that lnk!\ln k! equals

where BkB_{k} are the Bernoulli numbers. We define an approximation of lnk!\ln{k!} as follows:

for m0:=O(tk+1).m_{0}:=O\left(\left\lceil{{\langle{t}\rangle}\over{\langle{k}\rangle}}\right\rceil+1\right).

(Definition of an approximate exponent Ek^\widehat{E_{k}}): Define Ek^:=λ+kln(λ)ln(k!)^\widehat{E_{k}}:=-\lambda+k\ln(\lambda)-\widehat{\ln(k!)}. Given the above discussion, we can calculate the distance of Ek^\widehat{E_{k}} to the true exponent EkE_{k} as follows:

So we can focus our attention to approximating Ek^\widehat{E_{k}}. Note that Ek^\widehat{E_{k}} is the sum of m0+2=O(logtlogk)m_{0}+2=O({\log t\over\log k}) terms. To approximate it within error 1/(10t)1/(10t), it suffices to approximate each summand within an additive error of O(1/(tlogt))O(1/(t\cdot\log t)). Indeed, we so approximate each summand and our final approximation Ek^^\widehat{\widehat{E_{k}}} will be the sum of these approximations. We proceed with the analysis:

Note that, with this approximation, we have

(Floating-Point Representation): We will also need accurate approximations to ln2π^\ln{\widehat{2\pi}}, lnk\ln k and lnλ\ln\lambda. We think of 2π^\widehat{2\pi} and kk as multiple-precision floating point numbers base 22. In particular,

k2logkk2logkk\equiv 2^{\lceil\log k\rceil}\cdot{k\over 2^{\lceil\log k\rceil}} can be described with a binary fraction of logk\lceil\log k\rceil, i.e., k{\langle{k}\rangle}, bits and an exponent of length O(loglogk)O(\log\log k), i.e., O(logk)O(\log{\langle{k}\rangle}).

Also, since λ\lambda is a positive rational number, λ=λ1λ2\lambda={\lambda_{1}\over\lambda_{2}}, where λ1\lambda_{1} and λ2\lambda_{2} are positive integers of at most λ{\langle{\lambda}\rangle} bits. Hence, for i=1,2i=1,2, we can think of λi\lambda_{i} as a multiple-precision floating point number base 22 with a binary fraction of λ{\langle{\lambda}\rangle} bits and an exponent of length O(logλ)O(\log{\langle{\lambda}\rangle}). Hence, if we choose L=log2(12(3k+1)t2kλ1λ2)=O(k+λ+t)L=\lceil\log_{2}(12(3k+1)t^{2}\cdot k\cdot\lambda_{1}\cdot\lambda_{2})\rceil=O({\langle{k}\rangle}+{\langle{\lambda}\rangle}+{\langle{t}\rangle}), we can represent all numbers 2π^,λ1,λ2,k\widehat{2\pi},\lambda_{1},\lambda_{2},k as multiple precision floating point numbers with a binary fraction of LL bits and an exponent of O(logL)O(\log L) bits.

lnk^lnk2Llnk112(3k+1)t2|\widehat{\ln k}-{\ln k}|\leq 2^{-L}{\ln k}\leq{1\over 12(3k+1)t^{2}}; and similarly

lnλi^lnλi112(3k+1)t2|\widehat{\ln\lambda_{i}}-{\ln\lambda_{i}}|\leq{1\over 12(3k+1)t^{2}}, for i=1,2i=1,2;

ln2π^^ln2π^112(3k+1)t2.|\widehat{\ln\widehat{2\pi}}-{\ln\widehat{2\pi}}|\leq{1\over 12(3k+1)t^{2}}.

(Estimating the terms of the series): To complete the analysis, we also need to approximate each term of the form cj=Bjj(j1)kj1c_{j}=\frac{B_{j}}{j(j-1)\cdot k^{j-1}} up to an additive error of O(1/(tlogt))O(1/(t\cdot\log t)). We do this as follows: We compute the numbers BjB_{j} and kj1k^{j-1} exactly, and we perform the division approximately.

Let Ek^^\widehat{\widehat{E_{k}}} be the approximation arising if we use all the aforementioned approximations. It follows from the above computations that

(Overall Error): Combining the above computations we get: