Testing $k$-Modal Distributions: Optimal Algorithms via Reductions

Constantinos Daskalakis, Ilias Diakonikolas, Rocco A. Servedio, Gregory Valiant, Paul Valiant

Introduction

While it is surprising that these tasks can be performed using a sublinear number of samples, for many real-world applications using n\sqrt{n}, n2/3n^{2/3}, or nlogn\frac{n}{\log n} samples is still impractical. As these bounds characterize worst-case instances, one might hope that drastically better performance may be possible for many settings typically encountered in practice. Thus, a natural research direction, which we pursue in this paper, is to understand how structural properties of the distributions in question may be leveraged to yield improved sample complexities.

In this work we consider monotone, unimodal, and more generally kk-modal distributions. Monotone, unimodal, and bimodal distributions abound in the natural world. The distribution of many measurements—heights or weights of members of a population, concentrations of various chemicals in cells, parameters of many atmospheric phenomena–often belong to this class of distributions. Because of their ubiquity, much work in the natural sciences rests on the analysis of such distributions (for example, on November 1, 2011 a Google Scholar search for the exact phrase “bimodal distribution” in the bodies of papers returned more than 90,000 hits). Though perhaps not as pervasive, kk-modal distributions for larger values of kk commonly arise as mixtures of unimodal distributions and are natural objects of study. On the theoretical side, motivated by the many applications, monotone, unimodal, and kk-modal distributions have been intensively studied in the probability and statistics literatures for decades, see e.g. [Gre56, Rao69, BBBB72, CKC83, Gro85, Bir87a, Bir87b, Kem91, Fou97, CT04, JW09].

Our main results are algorithms, and nearly matching lower bounds, that give a complete picture of the sample complexities of identity testing and estimating L1L_{1} distance for monotone and kk-modal distributions. We obtain such results in both the setting where the two distributions are given via samples, and the setting where one of the distributions is given via samples and the other is described explicitly.

We view the equivalence between the sample complexity of each of the above tasks on a monotone or unimodal distribution of domain [n][n] and the sample complexity of the same task on an unrestricted distribution of domain [logn][\log n] as a surprising result, because such an equivalence fails to hold for related estimation tasks. For example, consider the task of distinguishing whether a distribution on [n][n] is uniform versus far from uniform. For general distributions this takes Θ(n)\Theta(\sqrt{n}) samples, so one might expect the corresponding problem for monotone distributions to need logn\sqrt{\log n} samples; in fact, however, one can test this with a constant number of samples, by simply comparing the empirically observed probability masses of the left and right halves of the domain. An example in the other direction is the problem of finding a constant additive estimate for the entropy of a distribution. On domains of size [n][n] this can be done in nlogn\frac{n}{\log n} samples, and thus one might expect to be able to estimate entropy for monotone distributions on [n][n] using lognloglogn\frac{\log n}{\log\log n} samples. Nevertheless, it is not hard to see that Ω(log2n)\Omega(\log^{2}n) samples are required.

The reduction-like techniques which we use to establish both our algorithmic results and our lower bounds (discussed in more detail in Section 1.2 below) reveal an unexpected relationship between the class of kk-modal distributions of support [n][n] and the class of general distributions of support [klogn][k\log n]. We hope that this reduction-based approach may provide a framework for the discovery of other relationships that will be useful in future work in the extreme sublinear regime of statistical property estimation and property testing.

Comparison with prior work. Our results significantly extend and improve upon the previous algorithmic results of Batu et al [BKR04] for identity testing of monotone or unimodal (k=1k=1) distributions, which required O(log3n)O(\log^{3}n) samples. More recently, [DDS11] established the sample complexity of learning kk-modal distributions to be essentially Θ(klog(n)ϵ3)\Theta(k\log(n)\epsilon^{-3}). Such a learning algorithm easily yields a testing algorithm with the same sample complexity for all four variants of the testing problem (one can simply run the learner twice to obtain hypotheses p^{\widehat{p}} and q^{\widehat{q}} that are sufficiently close to pp and qq respectively, and output accordingly).

While the [DDS11] result can be applied to our testing problems (though giving suboptimal results), we stress that the ideas underlying [DDS11] and this paper are quite different. The [DDS11] paper learns a kk-modal distribution by using a known algorithm for learning monotone distributions [Bir87b] kk times in a black-box manner; the notion of reducing the domain size—which we view as central to the results and contributions of this paper—is nowhere present in [DDS11]. By contrast, the focus in this paper is on introducing the use of reductions as a powerful (but surprisingly, seemingly previously unused) tool in the development of algorithms for basic statistical tasks on distributions, which, at least in this case, is capable of giving essentially optimal upper and lower bounds for natural restricted classes of distributions.

2 Techniques.

Our main conceptual contribution is a new reduction-based approach that lets us obtain all our upper and lower bounds in a clean and unified way. The approach works by reducing the monotone and kk-modal distribution testing problems to general distribution testing and estimation problems over a much smaller domain, and vice versa. For the monotone case this smaller domain is essentially of size log(n)/ϵ\log(n)/\epsilon, and for the kk-modal case the smaller domain is essentially of size klog(n)/ϵ2.k\log(n)/\epsilon^{2}. By solving the general distribution problems over the smaller domain using known results we get a valid answer for the original (monotone or kk-modal) problems over domain [n][n]. More details on our algorithmic reduction are given in Section A.

The inspiration for our results is an observation of Birgé [Bir87b] that given a monotone-decreasing probability distribution over [n][n], if one subdivides [n][n] into an exponentially increasing series of consecutive sub-intervals, the iith having size (1+ϵ)i(1+\epsilon)^{i}, then if one replaces the probability mass on each interval with a uniform distribution on that interval, the distribution changes by only O(ϵ)O(\epsilon) in total variation distance. Further, given such a subdivision of the support into log1+ϵ(n)\log_{1+\epsilon}(n) intervals, one may essentially treat the original monotone distribution as essentially a distribution over these intervals, namely a distribution of support log1+ϵ(n)\log_{1+\epsilon}(n). In this way, one may hope to reduce monotone distribution testing or estimation on [n][n] to general distribution testing or estimation on a domain of size log1+ϵ(n)\log_{1+\epsilon}(n), and vice versa. See Section B for details.

Notation and Preliminaries

We write [n][n] to denote the set {1,,n}\{1,\ldots,n\}, and for integers iji\leq j we write [i,j][i,j] to denote the set {i,i+1,,j}\{i,i+1,\ldots,j\}. We consider discrete probability distributions over [n][n], which are functions p:[n]p:[n]\to such that i=1np(i)=1\mathop{{\textstyle\sum}}_{i=1}^{n}p(i)=1. For S[n]S\subseteq[n] we write p(S)p(S) to denote iSp(i)\mathop{{\textstyle\sum}}_{i\in S}p(i). We use the notation PP for the cumulative distribution function (cdf) corresponding to pp, i.e. P:[n]P:[n]\to is defined by P(j)=i=1jp(i)P(j)=\mathop{{\textstyle\sum}}_{i=1}^{j}p(i).

A distribution pp over [n][n] is non-increasing (resp. non-decreasing) if p(i+1)p(i)p(i+1)\leq p(i) (resp. p(i+1)p(i)p(i+1)\geq p(i)), for all i[n1]i\in[n-1]; pp is monotone if it is either non-increasing or non-decreasing. Thus the “orientation” of a monotone distribution is either non-decreasing (denoted \uparrow) or non-increasing (denoted \downarrow).

We call a nonempty interval I=[a,b][2,n1]I=[a,b]\subseteq[2,n-1] a max-interval of pp if p(i)=cp(i)=c for all iIi\in I and max{p(a1),p(b+1)}<c\max\{p(a-1),p(b+1)\}<c. Analogously, a min-interval of pp is an interval I=[a,b][2,n1]I=[a,b]\subseteq[2,n-1] with p(i)=cp(i)=c for all iIi\in I and min{p(a1),p(b+1)}>c\min\{p(a-1),p(b+1)\}>c. We say that pp is kk-modal if it has at most kk max-intervals and min-intervals. We note that according to our definition, what is usually referred to as a bimodal distribution is a 33-modal distribution.

Finally, a sub-distribution is a function q:[n]q:[n]\to which satisfies i=1nq(i)1.\mathop{{\textstyle\sum}}_{i=1}^{n}q(i)\leq 1. For pp a distribution over [n][n] and I[n]I\subseteq[n], the restriction of pp to II is the sub-distribution pIp^{I} defined by pI(i)=p(i)p^{I}(i)=p(i) if iIi\in I and pI(i)=0p^{I}(i)=0 otherwise. Likewise, we denote by pIp_{I} the conditional distribution of pp on II, i.e. pI(i)=p(i)/p(I)p_{I}(i)=p(i)/p(I) if iIi\in I and pI(i)=0p_{I}(i)=0 otherwise.

2 Basic tools from probability.

We will require the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality ([DKW56]) from probability theory. This basic fact says that O(1/ϵ2)O(1/\epsilon^{2}) samples suffice to learn any distribution within error ϵ\epsilon with respect to the Kolmogorov distance. More precisely, let pp be any distribution over [n].[n]. Given mm independent samples s1,,sms_{1},\dots,s_{m} drawn from p:[n],p:[n]\to, the empirical distribution p^m:[n]{\widehat{p}}_{m}:[n]\to is defined as follows: for all i[n]i\in[n], p^m(i)={j[m]sj=i}/m{\widehat{p}}_{m}(i)=|\{j\in[m]\mid s_{j}=i\}|/m. The DKW inequality states that for m=Ω((1/ϵ2)ln(1/δ))m=\Omega((1/\epsilon^{2})\cdot\ln(1/\delta)), with probability 1δ1-\delta the empirical distribution p^m{\widehat{p}}_{m} will be ϵ\epsilon-close to pp in Kolmogorov distance. This sample bound is asymptotically optimal and independent of the support size.

Another simple result that we will need is the following, which is easily verified from first principles:

Let I=[a,b]I=[a,b] be an interval and let uIu_{I} denote the uniform distribution over I.I. Let pIp_{I} denote a non-increasing distribution over II. Then for every initial interval I=[a,b]I^{\prime}=[a,b^{\prime}] of II, we have uI(I)pI(I).u_{I}(I^{\prime})\leq p_{I}(I^{\prime}).

3 Testing and estimation for arbitrary distribution

If pqp\equiv q then with probability at least 1δ1-\delta the algorithm outputs “accept;” and

If pqp\equiv q then with probability at least 1δ1-\delta the algorithm outputs “accept;” and

Testing and Estimating Monotone Distributions

Our main tool for testing monotone distributions is an oblivious decomposition of monotone distributions that is a variant of a construction of Birgé [Bir87b]. As we will see it enables us to reduce the problem of testing a monotone distribution to the problem of testing an arbitrary distribution over a much smaller domain.

The following simple lemma, proved in Section A, shows why reduced distributions are useful for us:

We now state our oblivious decomposition result for monotone distributions:

There is an analogous version of Theorem 5, asserting the existence of an “oblivious” partition for non-decreasing distributions (which is of course different from the “oblivious” partition I{\cal I} for non-increasing distributions of Theorem 5); this will be useful later.

While our construction is essentially that of Birgé, we note that the version given in [Bir87b] is for non-increasing distributions over the continuous domain [0,n][0,n], and it is phrased rather differently. Adapting the arguments of [Bir87b] to our discrete setting of distributions over [n][n] is not conceptually difficult but requires some care. For the sake of being self-contained we provide a self-contained proof of the discrete version, stated above, that we require in Appendix E.

2 Efficiently testing monotone distributions

From Monotone to k𝑘k-modal

In this section we establish our main positive testing results for kk-modal distributions, the upper bounds stated in the final four rows of Table 1. In the previous section, we were able to use the oblivious decomposition to yield a partition of [n][n] into relatively few intervals, with the guarantee that the corresponding flattened distribution is close to the true distribution. The main challenge in extending these results to unimodal or kk-modal distributions, is that in order to make the analogous decomposition, one must first determine–by taking samples from the distribution–which regions are monotonically increasing vs decreasing. Our algorithm Construct-Flat-Decomposition(p,ϵ,δ)(p,\epsilon,\delta) performs this task with the following guarantee:

The following terminology will be useful: Let I={Ii}i=1r{\cal I}=\{I_{i}\}_{i=1}^{r} and I={Ii}i=1s{\cal I}^{\prime}=\{I^{\prime}_{i}\}_{i=1}^{s} be two partitions of [n][n] into rr and ss intervals respectively. The common refinement of I{\cal I} and I{\cal I}^{\prime} is the partition J{\cal J} of [n][n] into intervals obtained from I{\cal I} and I{\cal I}^{\prime} in the obvious way, by taking all possible nonempty intervals of the form IiIj.I_{i}\cap I^{\prime}_{j}. It is clear that J{\cal J} is both a refinement of I{\cal I} and of I{\cal I}^{\prime} and that the number of intervals J|{\cal J}| in J{\cal J} is at most r+s.r+s. We prove the following lemma in Section A:

Let pp be any distribution over [n][n], let I={Ii}i=1a{\cal I}=\{I_{i}\}_{i=1}^{a} be a (p,ϵ,a)(p,\epsilon,a)-flat decomposition of [n][n], and let J={Ji}i=1b{\cal J}=\{J_{i}\}_{i=1}^{b} be a refinement of I.{\cal I}. Then J{\cal J} is a (p,2ϵ,b)(p,2\epsilon,b)-flat decomposition of [n][n].

We describe the Test-Identity-Known-kmodal algorithm below.

The modifications required to obtain algorithms Test-Identity-Unknown-kmodal, L1L_{1}-Estimate-Known-kmodal and L1L_{1}-Estimate-Unknown-kmodal, and the analysis of these algorithms, are completely analogous to the modifications and analyses of Section 3.2 and are omitted.

We present Construct-Flat-Decomposition(p,ϵ,δ)(p,\epsilon,\delta) followed by an intuitive explanation. Note that it employs a procedure Orientation(p^,I)({\widehat{p}},I), which uses no samples and is presented and analyzed in Section 4.2.

Roughly speaking, when Construct-Flat-Decomposition constructs a partition I{\cal I}, it initially breaks [n][n] up into two types of intervals. The first type are intervals that are “okay” to include in a flat decomposition, either because they have very little mass, or because they consist of a single point, or because they are close to uniform. The second type are intervals that are “not okay” to include in a flat decomposition – they have significant mass and are far from uniform – but the algorithm is able to ensure that almost all of these are monotone distributions with a known orientation. It then uses the oblivious decomposition of Theorem 5 to construct a flat decomposition of each such interval. (Note that it is crucial that the orientation is known in order to be able to use Theorem 5.)

In more detail, Construct-Flat-Decomposition(p,ϵ,δ)(p,\epsilon,\delta) works as follows. The algorithm first draws a batch of samples from pp and uses them to construct an estimate p^{\widehat{p}} of the CDF of pp (this is straightforward using the DKW inequality). Using p^{\widehat{p}} the algorithm partitions [n][n] into a collection of O(k/ϵ)O(k/\epsilon) disjoint intervals in the following way:

A small collection of the intervals are “negligible”; they collectively have total mass less than ϵ\epsilon under pp. Each negligible interval II will be an element of the partition I.{\cal I}.

Some of the intervals are “heavy points”; these are intervals consisting of a single point that has mass Ω(ϵ/k)\Omega(\epsilon/k) under pp. Each heavy point II will also be an element of the partition I.{\cal I}.

The remaining intervals are “moderate” intervals, each of which has mass Θ(ϵ/k)\Theta(\epsilon/k) under pp.

2 The Orientation algorithm.

The Orientation algorithm takes as input an explicit distribution of a distribution p^{\widehat{p}} over [n][n] and an interval I[n].I\subseteq[n]. Intuitively, it assumes that p^I{\widehat{p}}_{I} is close (in Kolmogorov distance) to a monotone distribution pIp_{I}, and its goal is to determine the orientation of pIp_{I}: it outputs either \uparrow, \downarrow or \bot (the last of which means “close to uniform”). The algorithm is quite simple; it checks whether there exists an initial interval II^{\prime} of II on which p^I{\widehat{p}}_{I}’s weight is significantly different from uI(I)u_{I}(I^{\prime}) (the weight that the uniform distribution over II assigns to II^{\prime}) and bases its output on this in the obvious way. A precise description of the algorithm (which uses no samples) is given below.

Orientation Inputs: explicit description of distribution p^{\widehat{p}} over [n][n]; interval I=[a,b][n]I=[a,b]\subseteq[n] 1. If I=1|I|=1 (i.e. I={a}I=\{a\} for some a[n]a\in[n]) then return “\bot”, otherwise continue. 2. If there is an initial interval I=[a,j]I^{\prime}=[a,j] of II that satisfies uI(I)(p^)I(I)>ϵ7u_{I}(I^{\prime})-({\widehat{p}})_{I}(I^{\prime})>{\frac{\epsilon}{7}} then halt and output “\uparrow”. Otherwise, 3. If there is an initial interval I=[a,j]I^{\prime}=[a,j] of II that satisfies uI(I)(p^)I(I)<ϵ7u_{I}(I^{\prime})-({\widehat{p}})_{I}(I^{\prime})<-{\frac{\epsilon}{7}} then halt and output “\downarrow”. Otherwise, 4. Output “\bot”.

We proceed to analyze Orientation. We show that if pIp_{I} is far from uniform then Orientation outputs the correct orientation for it. We also show that whenever Orientation does not output “\bot”, whatever it outputs is the correct orientation of pIp_{I}. The proof is given in Section C.3.

Let pp be a distribution over [n][n] and let interval I=[a,b][n]I=[a,b]\subseteq[n] be such that pIp_{I} is monotone. Suppose p(I)99ϵ/(10000k)p(I)\geq 99\epsilon/(10000k), and suppose that for every interval III^{\prime}\subseteq I we have that p^(I)p(I)ϵ210000k.|{\widehat{p}}(I^{\prime})-p(I^{\prime})|\leq{\frac{\epsilon^{2}}{10000k}}. Then

If pIp_{I} is non-decreasing and pIp_{I} is ϵ/6\epsilon/6-far from the uniform distribution uIu_{I} over II, then Orientation(p^,I)({\widehat{p}},I) outputs “\uparrow”;

if Orientation(p^,I)({\widehat{p}},I) outputs “\uparrow” then pIp_{I} is non-decreasing;

if pIp_{I} is non-increasing and pIp_{I} is ϵ/6\epsilon/6-far from the uniform distribution uIu_{I} over II, then Orientation(p^,I)({\widehat{p}},I) outputs “\downarrow”;

if Orientation(p^,I)({\widehat{p}},I) outputs “\downarrow” then pIp_{I} is non-increasing.

Lower Bounds

Our algorithmic results follow from a reduction which shows how one can reduce the problem of testing properties of monotone or kk-modal distributions to the task of testing properties of general distributions over a much smaller support. Our approach to proving lower bounds is complementary; we give a canonical scheme for transforming “lower bound instances” of general distributions to related lower bound instances of monotone distributions with much larger supports.

Definitions 2 and 3, given below, define a two-stage transformation of a generic distribution into a related kk-modal distribution over a much larger support. This transformation preserves total variation distance: for any pair of distributions, the variation distance between their transformations is identical to the variation distance between the original distributions. Additionally, we ensure that given access to ss independent samples from an original input distribution, one can simulate drawing ss samples from the related kk-modal distribution yielded by the transformation. Given any lower–bound construction DD for general distributions, the above transformation will yield a lower–bound instance DkD_{k} for (k1)(k-1)-modal distributions (so monotone distributions correspond to k=1k=1) defined by selecting a pair of distributions (p,p)D,(p,p^{\prime})\leftarrow D, then outputting the pair of transformed distributions. This transformed ensemble of distributions is a lower–bound instance, for if some algorithm could successfully test pairs of (k1)(k-1)-modal distributions from Dk,D_{k}, then that algorithm could be used to test pairs from DD, by simulating samples drawn from the transformed versions of the distributions. The following proposition, proved in Section D, summarizes the above discussion:

Before proving this proposition, we state various corollaries which result from applying the Proposition to known lower-bound constructions for general distributions. The first is for the “testing identity, qq is unknown” problem:

This Corollary gives the lower bounds stated in lines 2 and 6 of Table 1. It follows from applying Proposition 6 to a (trivially modified) version of the lower bound construction given in [BFR+00, Val08b], summarized by the following theorem:

Our second corollary is for L1L_{1} estimation, in the case that one of the distributions is explicitly given. This trivially also yields an equivalent lower bound for the setting in which both distributions are given via samples.

This Corollary gives the lower bounds claimed in lines 3, 4, 7 and 8 of Table 1. It follows from applying Proposition 6 to the lower bound construction given in [VV11a], summarized by the following theorem:

Note that the above theorem can be expressed in the language of Proposition 6 by defining the distribution DD^{\prime} over pairs of distributions which chooses a distribution according to DD for the first distribution of each pair, and always selects unu_{n} for the second distribution of each pair.

Our third corollary, which gives the lower bounds claimed in lines 1 and 5 of Table 1, is for the “testing identity, qq is known” problem:

The above corollary follows from applying Proposition 6 to the following trivially verified lower bound construction:

As noted previously (after Theorem 7), this fact can also be expressed in the language of Proposition 6.

Conclusions

We have introduced a simple new approach for tackling distribution testing problems for restricted classes of distributions, by reducing them to general-distribution testing problems over a smaller domain. We applied this approach to get new testing results for a range of distribution testing problems involving monotone and kk-modal distributions, and established lower bounds showing that all our new algorithms are essentially optimal.

A general direction for future work is to apply our reduction method to obtain near-optimal testing algorithms for other interesting classes of distributions. This will involve constructing flat decompositions of various types of distributions using few samples, which seems to be a natural and interesting algorithmic problem. A specific goal is to develop a more efficient version of our Construct-Flat-Decomposition algorithm for kk-modal distributions; is it possible to obtain an improved version of this algorithm that uses o(k)o(k) samples?

References

Appendix A Shrinking the domain size: Reductions for distribution-testing problems

In this section we present the general framework of our reduction-based approach and sketch how we instantiate this approach for monotone and kk-modal distributions.

The following useful lemma relates closeness of pp and qq to closeness of the reduced distributions:

We close this section with a result about partitions and flat decompositions which will be useful later. Let I={Ii}i=1a,{\cal I}=\{I_{i}\}_{i=1}^{a}, I={Ij}j=1b{\cal I}^{\prime}=\{I^{\prime}_{j}\}_{j=1}^{b} be two partitions of [n][n]. We say that I{\cal I}^{\prime} is a refinement of I{\cal I} if for every i[a]i\in[a] there is a subset SiS_{i} of [b][b] such that jSiIj=Ii\cup_{j\in S_{i}}I^{\prime}_{j}=I_{i} (note that for this to hold we must have aba\leq b). Note that {Si}i=1a\{S_{i}\}_{i=1}^{a} forms a partition of [b][b]. We prove the following useful lemma:

Lemma 4. Let pp be any distribution over [n][n], let I={Ii}i=1a{\cal I}=\{I_{i}\}_{i=1}^{a} be a (p,ϵ,a)(p,\epsilon,a)-flat decomposition of [n][n], and let J={Ji}i=1b{\cal J}=\{J_{i}\}_{i=1}^{b} be a refinement of I.{\cal I}. Then J{\cal J} is a (p,2ϵ,b)(p,2\epsilon,b)-flat decomposition of [n][n].

We now proceed to establish (1). Let T[n]T\subseteq[n] and consider a partition of TT into kk nonempty sets TiT_{i}, i[k]i\in[k]. Denote μ=defp(T)/T\mu\stackrel{{\scriptstyle\textrm{def}}}{{=}}p(T)/|T| and μi=defp(Ti)/Ti\mu_{i}\stackrel{{\scriptstyle\textrm{def}}}{{=}}p(T_{i})/|T_{i}|. Then, (1) can be re-expressed as follows

We shall prove the above statement for all sequences of numbers p(1),,p(n)p(1),\dots,p(n). Since adding or subtracting the same quantity from each number p(t)p(t) does not change the validity of (2), for the sake of convenience we may assume all the numbers average to , that is, μ=0\mu=0. Consider the ii-th term on the right hand side, tTip(t)μi\mathop{{\textstyle\sum}}_{t\in T_{i}}|p(t)-\mu_{i}|. We can bound this quantity from above as follows:

where the inequality follows from the triangle inequality (applied term by term), the first equality is by the definition of μi\mu_{i}, the second equality is trivial, and the final equality uses the assumption that μ=0\mu=0. The lemma follows by summing over i[k]i\in[k], using the fact that the TiT_{i}’s form a partition of TT. ∎

Appendix B Efficiently Testing Monotone Distributions

Our main tool for testing monotone distributions is an oblivious decomposition of monotone distributions that is a variant of a construction of Birgé [Bir87b]. As we will see it enables us to reduce the problem of testing a monotone distribution to the problem of testing an arbitrary distribution over a much smaller domain. The decomposition result is given below:

There is a dual version of Theorem 5, asserting the existence of an “oblivious” partition for non-decreasing distributions (which is of course different from the “oblivious” partition I{\cal I} for non-increasing distributions of Theorem 5); this will be useful later.

While our construction is essentially that of Birgé, we note that the version given in [Bir87b] is for non-increasing distributions over the continuous domain [0,n][0,n], and it is phrased rather differently. Adapting the arguments of [Bir87b] to our discrete setting of distributions over [n][n] is not conceptually difficult but requires some care. For the sake of being self-contained we provide a self-contained proof of the discrete version, stated above, that we require in Appendix E.

B.2 Efficiently testing monotone distributions

Appendix C Efficiently Testing k𝑘k-modal Distributions

In this section we establish our main positive testing results for kk-modal distributions, the upper bounds stated in the final four rows of Table 1. The key to all these results is an algorithm Construct-Flat-Decomposition(p,ϵ,δ)(p,\epsilon,\delta). We prove the following performance guarantee about this algorithm:

The following terminology will be useful: Let I={Ii}i=1r{\cal I}=\{I_{i}\}_{i=1}^{r} and I={Ii}i=1s{\cal I}^{\prime}=\{I^{\prime}_{i}\}_{i=1}^{s} be two partitions of [n][n] into rr and ss intervals respectively. The common refinement of I{\cal I} and I{\cal I}^{\prime} is the partition J{\cal J} of [n][n] into intervals obtained from I{\cal I} and I{\cal I}^{\prime} in the obvious way, by taking all possible nonempty intervals of the form IiIj.I_{i}\cap I^{\prime}_{j}. It is clear that J{\cal J} is both a refinement of I{\cal I} and of I{\cal I}^{\prime} and that the number of intervals J|{\cal J}| in J{\cal J} is at most r+s.r+s.

The modifications required to obtain algorithms Test-Identity-Unknown-kmodal, L1L_{1}-Estimate-Known-kmodal and L1L_{1}-Estimate-Unknown-kmodal, and the analysis of these algorithms, are completely analogous to the modifications and analyses of Appendix B.2 and are omitted.

We present Construct-Flat-Decomposition(p,ϵ,δ)(p,\epsilon,\delta) followed by an intuitive explanation. Note that it employs a procedure Orientation(p^,I)({\widehat{p}},I), which uses no samples and is presented and analyzed in Section 4.2.

Roughly speaking, when Construct-Flat-Decomposition constructs a partition I{\cal I}, it initially breaks [n][n] up into two types of intervals. The first type are intervals that are “okay” to include in a flat decomposition, either because they have very little mass, or because they consist of a single point, or because they are close to uniform. The second type are intervals that are “not okay” to include in a flat decomposition – they have significant mass and are far from uniform – but the algorithm is able to ensure that almost all of these are monotone distributions with a known orientation. It then uses the oblivious decomposition of Theorem 5 to construct a flat decomposition of each such interval. (Note that it is crucial that the orientation is known in order to be able to use Theorem 5.)

In more detail, Construct-Flat-Decomposition(p,ϵ,δ)(p,\epsilon,\delta) works as follows. The algorithm first draws a batch of samples from pp and uses them to construct an estimate p^{\widehat{p}} of the CDF of pp (this is straightforward using the DKW inequality). Using p^{\widehat{p}} the algorithm partitions [n][n] into a collection of O(k/ϵ)O(k/\epsilon) disjoint intervals in the following way:

A small collection of the intervals are “negligible”; they collectively have total mass less than ϵ\epsilon under pp. Each negligible interval II will be an element of the partition I.{\cal I}.

Some of the intervals are “heavy points”; these are intervals consisting of a single point that has mass Ω(ϵ/k)\Omega(\epsilon/k) under pp. Each heavy point II will also be an element of the partition I.{\cal I}.

The remaining intervals are “moderate” intervals, each of which has mass Θ(ϵ/k)\Theta(\epsilon/k) under pp.

C.2 Performance of Construct-Flat-Decomposition: Proof of Lemma 3.

The claimed sample bound is obvious from inspection of the algorithm, as the only step that draws any samples is Step 2. The bound on the number of intervals in the flat decomposition follows directly from the upper bounds on the number of heavy points, negligible intervals and moderate intervals shown below, using also Theorem 5. It remains to show that the output of the algorithm is a valid flat decomposition of pp. First, by the DKW inequality (Theorem 1) we have that with probability at least 1δ1-\delta it is the case that

We make some preliminary observations about the weight that pp has on the intervals constructed in Steps 4 and 5. Since every atomic interval IiI_{i} constructed in Step 4 has p^(I)ϵ/(100k){\widehat{p}}(I)\geq\epsilon/(100k) (except potentially the rightmost one), it follows that the number α\alpha of atomic intervals constructed in Step 3 satisfies

We now establish bounds on the probability mass that pp assigns to the moderate intervals, heavy points, and negligible intervals that are constructed in Step 4. Using (3), each interval IiI_{i} that is declared to be a moderate interval in Step 4(a) must satisfy

By virtue of the greedy process that is used to construct atomic intervals in Step 3, each point bb that is declared to be a heavy point in Step 4(b) must satisfy p^(b)2ϵ/(100k){\widehat{p}}(b)\geq 2\epsilon/(100k) and thus using (3) again

Moreover, each interval [a,b1][a,b-1] that is declared to be a negligible interval must satisfy p^([a,b1])<ϵ/(100k){\widehat{p}}([a,b-1])<\epsilon/(100k) and thus using (3) again

It is clear that nmn_{m} (the number of moderate intervals) and nhn_{h} (the number of heavy points) are each at most α.\alpha. Next we observe that the number of negligible intervals nnn_{n} satisfies

This is because at the end of each negligible interval [a,b1][a,b-1] we have (observing that each negligible interval must be nonempty) that p(b1)p([a,b1])101ϵ/(10000k)p(b-1)\leq p([a,b-1])\leq 101\epsilon/(10000k) while p(b)199ϵ/(10000k)p(b)\geq 199\epsilon/(10000k). Since pp is kk-modal, there can be at most (k+1)/2k\lceil(k+1)/2\rceil\leq k points b[n]b\in[n] satisfying this condition. Since each negligible interval II satisfies p(I)101ϵ/(10000k)p(I)\leq 101\epsilon/(10000k) we have that the total probability mass under pp of all the negligible intervals is at most 101ϵ/10000.101\epsilon/10000.

Thus far we have built a partition of [n][n] into a collection of nm100k/ϵn_{m}\leq\lceil 100k/\epsilon\rceil moderate intervals (which we denote M1,,MnmM_{1},\dots,M_{n_{m}}), a set of nh100k/ϵn_{h}\leq\lceil 100k/\epsilon\rceil heavy points (which we denote h1,,hnhh_{1},\dots,h_{n_{h}}) and a set of nnkn_{n}\leq k negligible intervals (which we denote N1,,NnnN_{1},\dots,N_{n_{n}}). Let A{1,,nm}A\subseteq\{1,\dots,n_{m}\} denote the set of those indices ii such that Orientation(p^,Mi)({\widehat{p}},M_{i}) outputs \bot in Step 6. The partition I{\cal I} that Construct-Flat-Decomposition constructs consists of {h1},,{hnh}\{h_{1}\},\dots,\{h_{n_{h}}\}, N1,,NnnN_{1},\dots,N_{n_{n}}, {Mi}iA\{M_{i}\}_{i\in A}, and i([nm]A)JMi.\bigcup_{i\in([n_{m}]\setminus A)}{\cal J}_{M_{i}}. We can thus write pp as

Using Lemma 15 (proved in Appendix F) we can bound the total variation distance between pp and (pf)I(p_{f})^{\cal I} by

Since p(I)=(pf)I(I)p(I)=(p_{f})^{\cal I}(I) for every III\in{\cal I}, this simplifies to

Recalling from (6) that p(Nj)101ϵ/(10000k)p(N_{j})\leq 101\epsilon/(10000k) for each negligible interval NjN_{j}, and recalling that nnkn_{n}\leq k, the first summand in (9) is at most 101ϵ/10000.101\epsilon/10000.

Let B[nm]AB\subset[n_{m}]\setminus A be such that, for all jBj\in B, pMjp_{M_{j}} is monotone. Summing the above over all jBj\in B gives:

Given that pp is kk-modal, the cardinality of the set [nm](AB)[n_{m}]\setminus(A\cup B) is at most kk. So we have the bound:

C.3 The Orientation algorithm.

The Orientation algorithm takes as input an explicit distribution of a distribution p^{\widehat{p}} over [n][n] and an interval I[n].I\subseteq[n]. Intuitively, it assumes that p^I{\widehat{p}}_{I} is close (in Kolmogorov distance) to a monotone distribution pIp_{I}, and its goal is to determine the orientation of pIp_{I}: it outputs either \uparrow, \downarrow or \bot (the last of which means “close to uniform”). The algorithm is quite simple; it checks whether there exists an initial interval II^{\prime} of II on which p^I{\widehat{p}}_{I}’s weight is significantly different from uI(I)u_{I}(I^{\prime}) (the weight that the uniform distribution over II assigns to II^{\prime}) and bases its output on this in the obvious way. A precise description of the algorithm (which uses no samples) is given below.

Orientation Inputs: explicit description of distribution p^{\widehat{p}} over [n][n]; interval I=[a,b][n]I=[a,b]\subseteq[n] 1. If I=1|I|=1 (i.e. I={a}I=\{a\} for some a[n]a\in[n]) then return “\bot”, otherwise continue. 2. If there is an initial interval I=[a,j]I^{\prime}=[a,j] of II that satisfies uI(I)(p^)I(I)>ϵ7u_{I}(I^{\prime})-({\widehat{p}})_{I}(I^{\prime})>{\frac{\epsilon}{7}} then halt and output “\uparrow”. Otherwise, 3. If there is an initial interval I=[a,j]I^{\prime}=[a,j] of II that satisfies uI(I)(p^)I(I)<ϵ7u_{I}(I^{\prime})-({\widehat{p}})_{I}(I^{\prime})<-{\frac{\epsilon}{7}} then halt and output “\downarrow”. Otherwise, 4. Output “\bot”.

We proceed to analyze Orientation. We show that if pIp_{I} is far from uniform then Orientation outputs the correct orientation for it. We also show that whenever Orientation does not output “\bot”, whatever it outputs is the correct orientation of pIp_{I}. For ease of readability, for the rest of this subsection we use the following notation:

Lemma 5. Let pp be a distribution over [n][n] and let interval I=[a,b][n]I=[a,b]\subseteq[n] be such that pIp_{I} is monotone. Suppose p(I)99ϵ/(10000k)p(I)\geq 99\epsilon/(10000k), and suppose that for every interval III^{\prime}\subseteq I we have that

If pIp_{I} is non-decreasing and pIp_{I} is ϵ/6\epsilon/6-far from the uniform distribution uIu_{I} over II, then Orientation(p^,I)({\widehat{p}},I) outputs “\uparrow”;

if Orientation(p^,I)({\widehat{p}},I) outputs “\uparrow” then pIp_{I} is non-decreasing;

if pIp_{I} is non-increasing and pIp_{I} is ϵ/6\epsilon/6-far from the uniform distribution uIu_{I} over II, then Orientation(p^,I)({\widehat{p}},I) outputs “\downarrow”;

if Orientation(p^,I)({\widehat{p}},I) outputs “\downarrow” then pIp_{I} is non-increasing.

Let I=[a,j]II^{\prime}=[a,j]\subseteq I be any initial interval of I.I. We first establish the upper bound

as this will be useful for the rest of the proof. Using (10) we have

Now using the fact that p(I)p(I)p(I^{\prime})\leq p(I) and p(I)99ϵ/(10000k)p(I)\geq 99\epsilon/(10000k), we get that (12) is at least

So we have established the lower bound pI(I)(p^)I(I)ϵ/49.p_{I}(I^{\prime})-({\widehat{p}})_{I}(I^{\prime})\geq-\epsilon/49. For the upper bound, similar reasoning gives

and so we have shown that pI(I)(p^)I(I)ϵ/49|p_{I}(I^{\prime})-({\widehat{p}})_{I}(I^{\prime})|\leq\epsilon/49 as desired. Now we proceed to prove the lemma.

and thus Orientation outputs “\uparrow” in Step 3 as claimed.

Now we turn to Part 2 of the lemma. Suppose that Orientation(p^,I)({\widehat{p}},I) outputs “\uparrow”. Then it must be the case that there is an initial interval I=[a,j]I^{\prime}=[a,j] of II that satisfies uI(I)(p^)I(I)>ϵ7.u_{I}(I^{\prime})-({\widehat{p}})_{I}(I^{\prime})>{\frac{\epsilon}{7}}. By (11) we have that uI(I)pI(I)>ϵ7ϵ49=6ϵ49u_{I}(I^{\prime})-p_{I}(I^{\prime})>{\frac{\epsilon}{7}}-{\frac{\epsilon}{49}}={\frac{6\epsilon}{49}}. But Observation 1 tells us that if pIp_{I} were non-increasing then we would have uI(I)pI(I)0u_{I}(I^{\prime})-p_{I}(I^{\prime})\leq 0; so pIp_{I} cannot be non-increasing, and therefore it must be non-decreasing.

Finally, Part 4 of the lemma follows from analogous arguments as Part 2.∎

Appendix D Proof of Proposition 6

We start by defining the transformation, and then prove the necessary lemmas to show that the transformation yields kk-modal distributions with the specified increase in support size, preserves L1L_{1} distance between pairs, and has the property that samples from the transformed distributions can be simulated given access to samples from the original distributions.

The transformation proceeds in two phases. In the first phase, the input distribution pp is transformed into a related distribution ff with larger support; ff has the additional property that the ratio of the probabilities of consecutive domain elements is bounded. Intuitively the distribution ff corresponds to a “reduced distribution” from Section A. In the second phase, the distribution ff is transformed into the final 2(k1)2(k-1)-modal distribution gg. Both stages of the transformation consist of subdividing each element of the domain of the input distribution into a set of elements of the output distribution; in the first stage, the probabilities of each element of the set are chosen according to a geometric sequence, while in the second phase, all elements of each set are given equal probabilities.

We now define this two-phase transformation and prove Proposition 6.

Fix ϵ>0\epsilon>0 and a distribution pp over [n][n] such that pminp(i)pmaxp_{min}\leq p(i)\leq p_{max} for all i[n].i\in[n]. We define the distribution fp,ϵ,pmax,pminf_{p,\epsilon,p_{max},p_{min}}in two steps. Let qq be the distribution on support [c][c] with c=1+log1+ϵpmaxlog1+ϵpminc=1+\lceil\log_{1+\epsilon}p_{max}-\log_{1+\epsilon}p_{min}\rceil that is defined by q(i)=(1+ϵ)i1ϵ(1+ϵ)c1.q(i)=(1+\epsilon)^{i-1}\frac{\epsilon}{(1+\epsilon)^{c}-1}. The distribution fp,ϵ,pmax,pminf_{p,\epsilon,p_{max},p_{min}} has support [cn][cn], and for i[n]i\in[n] and j[c]j\in[c] it assigns probability p(i)q(j)p(i)q(j) to domain element c(i1)+j.c(i-1)+j.

It is convenient for us to view the modr\mod r operator as giving an output in [r][r], so that “rmodrr\mod r” equals r.r.

Given ϵ,pmin,pmax,\epsilon,p_{min},p_{max}, and access to independent samples from distribution pp, one can generate independent samples from fp,ϵ,pmax,pminf_{p,\epsilon,p_{max},p_{min}} and from gk,p,ϵ,pmax,pmin.g_{k,p,\epsilon,p_{max},p_{min}}.

To generate a sample according to fp,ϵ,pmax,pmin,f_{p,\epsilon,p_{max},p_{min}}, one simply takes a sample ipi\leftarrow p and then draws j[c]j\in[c] according to the distribution qq as defined in Definition 2 (note that this draw according to qq only involves ϵ,pmin\epsilon,p_{min} and pmaxp_{max}). We then output the value c(i1)+j.c(i-1)+j. It follows immediately from the above definition that the distribution of the output value is fp,ϵ,pmax,pmin.f_{p,\epsilon,p_{max},p_{min}}.

and thus g1,p,ϵ,pmax,pming_{1,p,\epsilon,p_{max},p_{min}} is indeed 0-modal since it is monotone non-increasing. For k>1k>1 the above arguments apply to each of the kk equally-sized contiguous regions of the support, so there are 2(k1)2(k-1) modes, namely the local maxima occurring at the right endpoint of each region, and the local minima occurring at the left endpoint of each region. ∎

For any distributions p,pp,p^{\prime} with support [n][n], and any ϵ,pmax,pmin,\epsilon,p_{max},p_{min}, we have that

Both equalities follow immediately from the fact that the transformations of Definitions 2 and 3 partition each element of the input distribution in a manner that is oblivious to the probabilities. To illustrate, letting c=1+log1+ϵpmaxlog1+ϵpmin,c=1+\lceil\log_{1+\epsilon}p_{max}-\log_{1+\epsilon}p_{min}\rceil, and letting qq be as in Definition 2, we have the following:

If pp has support [n],[n], then for any ϵ<1/2,\epsilon<1/2, the distribution gk,p,ϵ,pmax,pming_{k,p,\epsilon,p_{max},p_{min}} is supported on [N][N], where NN is at most ke8nk(1+log(pmax/pmin))ϵ2k\frac{e^{\frac{8n}{k}\left(1+\log(p_{max}/p_{min})\right)}}{\epsilon^{2}}.

The support of fp,ϵ,pmax,pminf_{p,\epsilon,p_{max},p_{min}} is n(1+log1+ϵpmaxlog1+ϵpmin)n(2+log(pmax/pmin)log(1+ϵ)).n(1+\lceil\log_{1+\epsilon}p_{max}-\log_{1+\epsilon}p_{min}\rceil)\leq n\left(2+\frac{\log(p_{max}/p_{min})}{\log(1+\epsilon)}\right). Letting a1:=1a_{1}:=1 and b1:=1ϵ,b_{1}:=\lceil\frac{1}{\epsilon}\rceil, and defining ai:=ai1(1+ϵ),a_{i}:=\lceil a_{i-1}(1+\epsilon)\rceil, and bi:=bi1(1+ϵ),b_{i}:=\lceil b_{i-1}(1+\epsilon)\rceil, we have that aibia_{i}\leq b_{i} for all ii. Additionally, bi+1/bi1+2ϵb_{i+1}/b_{i}\leq 1+2\epsilon, since all bi1/ϵ,b_{i}\geq 1/\epsilon, and thus the ceiling operation can increase the value of (1+ϵ)bi(1+\epsilon)b_{i} by at most ϵbi\epsilon b_{i}. Putting these two observations together, we have

For any ϵ1/2,\epsilon\leq 1/2, we have that the support of gk,p,1/2,pmax,pming_{k,p,1/2,p_{max},p_{min}} is at most

The proof is now a simple matter of assembling the above parts. Given a distribution DD over pairs of distributions of support [n],[n], as specified in the proposition statement, the distribution DkD_{k} is defined via the process of taking (p,p)D,(p,p^{\prime})\leftarrow D, then applying the transformation of Definitions 2 and 3 with ϵ=1/2\epsilon=1/2 and to yield a pair (gk,p,1/2,pmax,pmin,gk,p,1/2,pmax,pmin)\left(g_{k,p,1/2,p_{max},p_{min}},g_{k,p^{\prime},1/2,p_{max},p_{min}}\right). We claim that this DkD_{k} satisfies all the properties claimed in the proposition statement. Specifically, Lemmas 12 and 14, respectively, ensure that every distribution in the support of DkD_{k} has at most 2(k1)2(k-1) modes, and has support size at most 4ke8nk(1+log(pmax/pmin)).4ke^{\frac{8n}{k}\left(1+\log(p_{max}/p_{min})\right)}. Additionally, Lemma 13 guarantees that the transformation preserves L1L_{1} distance, namely, for two distributions p,pp,p^{\prime} with support [n],[n], we have L1(p,p)=L1(gk,p,1/2,pmax,pmin,gk,p,1/2,pmax,pmin).L_{1}(p,p^{\prime})=L_{1}(g_{k,p,1/2,p_{max},p_{min}},g_{k,p^{\prime},1/2,p_{max},p_{min}}). Finally, Lemma 11 guarantees that, given ss independent samples from pp, one can simulate drawing ss independent samples according to gk,p,1/2,pmax,pmin.g_{k,p,1/2,p_{max},p_{min}}. Assuming for the sake of contradiction that one had an algorithm that could distinguish whether L1(gk,p,1/2,pmax,pmin,gk,p,1/2,pmax,pmin)L_{1}(g_{k,p,1/2,p_{max},p_{min}},g_{k,p^{\prime},1/2,p_{max},p_{min}}) is less than ϵ1\epsilon_{1} versus greater than ϵ2\epsilon_{2} with the desired probability given ss samples, one could take ss samples from distributions (p,p)D(p,p^{\prime})\leftarrow D, simulate having drawn them from gk,p,1/2,pmax,pming_{k,p,1/2,p_{max},p_{min}} and gk,p,1/2,pmax,pmin,g_{k,p^{\prime},1/2,p_{max},p_{min}}, and then run the hypothesized tester algorithm on those samples, and output the answer, which will be the same for (p,p)(p,p^{\prime}) as for (gk,p,1/2,pmax,pmin,gk,p,1/2,pmax,pmin).(g_{k,p,1/2,p_{max},p_{min}},g_{k,p^{\prime},1/2,p_{max},p_{min}}). This contradicts the assumption that no algorithm with these success parameters exists for (p,p)D.(p,p^{\prime})\leftarrow D.

Appendix E Proof of Theorem 5

We first note that we can assume that ϵ>1/n\epsilon>1/n. Otherwise, the decomposition of [n][n] into singleton intervals Ii={i}I_{i}=\{i\}, i[n]i\in[n], trivially satisfies the statement of the theorem. Indeed, in this case we have that (1/ϵ)logn>n(1/\epsilon)\cdot\log n>n and pfpp_{f}\equiv p.

Let pp be any non-increasing distribution over [n][n]. We will now show that the above described decomposition satisfies

where pIp^{I} denotes the (sub-distribution) restriction of pp over II.

Let Ij=[nj1+1,nj]I_{j}=[n_{j-1}+1,n_{j}] with lj=Ij=njnj1l_{j}=|I_{j}|=n_{j}-n_{j-1}. Then we have that

Recall that pfp_{f} is by definition constant within each IjI_{j} and in particular equal to pˉfj=i=nj1+1njp(i)/lj\bar{p}_{f}^{j}=\mathop{{\textstyle\sum}}_{i=n_{j-1}+1}^{n_{j}}p(i)/l_{j}. Also recall that pp is non-increasing, hence p(nj1)p(nj1+1)pˉfjp(nj)p(n_{j-1})\geq p(n_{j-1}+1)\geq\bar{p}_{f}^{j}\geq p(n_{j}). Therefore, we can bound from above the variation distance within IjI_{j} as follows

To bound the above quantity we analyze summands with lj<1/ϵl_{j}<1/\epsilon and with lj1/ϵl_{j}\geq 1/\epsilon separately.

Consider the short intervals and cluster them into groups according to their length; that is, a group contains all intervals in SS of the same length. We denote by GiG_{i} the iith group, which by definition contains all intervals in SS of length ii; note that these intervals are consecutive. The cardinality of a group (denoted by |\cdot|) is the number of intervals it contains; the length of a group is the number of elements it contains (i.e. the sum of the lengths of the intervals it contains).

Note that G1G_{1} (the group containing all singleton intervals) has G1=Ω(1/ϵ)|G_{1}|=\Omega(1/\epsilon) (this follows from the assumption that 1/ϵ<n1/\epsilon<n). Hence G1G_{1} has length Ω(1/ϵ)\Omega(1/\epsilon). Let j<1/ϵj^{\ast}<1/\epsilon be the maximum length of any short interval in SS. It is easy to verify that each group GjG_{j} for jjj\leq j^{\ast} is nonempty, and that for all jj1j\leq j^{\ast}-1, we have Gj=Ω((1/ϵ)(1/j))|G_{j}|=\Omega\left((1/\epsilon)\cdot(1/j)\right), which implies that the length of GjG_{j} is Ω(1/ϵ)\Omega(1/\epsilon).

To bound the contribution to (13) from the short intervals, we consider the corresponding sum for each group, and use the fact that G1G_{1} makes no contribution to the error. In particular, the contribution of the short intervals is

where plp^{-}_{l} (resp. pl+p^{+}_{l}) is the probability mass of the leftmost (resp. rightmost) point in GlG_{l}. Given that pp is non-increasing, we have that pl+pl+1p^{+}_{l}\geq p^{-}_{l+1}. Therefore, we can upper bound (14) by

Now note that p1+=O(ϵ)p(G1)p^{+}_{1}=O(\epsilon)\cdot p(G_{1}), since G1G_{1} has length (total number of elements) Ω(1/ϵ)\Omega(1/\epsilon) and pp is non-increasing. Similarly, for l<jl<j^{\ast}, we have that pl+=O(ϵ)p(Gl)p^{+}_{l}=O(\epsilon)\cdot p(G_{l}), since GlG_{l} has length Ω(1/ϵ)\Omega(1/\epsilon). Therefore, the above quantity can be upper bounded by

We consider two cases: The first case is that L=L=\emptyset. In this case, we are done because the above expression (15) is O(ϵ)O(\epsilon). The second case is that LL\neq\emptyset (we note in passing that in this case the total number of elements in all short intervals is Ω(1/ϵ2)\Omega(1/\epsilon^{2}), which means that we must have ϵ=Ω(1/n)\epsilon=\Omega(1/\sqrt{n})). In this case we bound the contribution of the long intervals using the same argument as Birgé. In particular, the contribution of the long intervals is

Given that lj+1lj(2ϵ)ljl_{j+1}-l_{j}\leq(2\epsilon)\cdot l_{j} and jljp(nj)p(L)\mathop{{\textstyle\sum}}_{j}l_{j}\cdot p(n_{j})\leq p(L), it follows that the second summand in (16) is at most O(ϵ)p(L)O(\epsilon)\cdot p(L). Therefore, the total variation distance between pp and pfp_{f} is at most (15) + (16), i.e.

Finally, note that p(L)+p(S)=1p(L)+p(S)=1 and pj+=O(ϵ)p^{+}_{j^{\ast}}=O(\epsilon). (The latter holds because pj+p^{+}_{j^{\ast}} is the probability mass of the rightmost point in SS; recall that SS has length at least 1/ϵ1/\epsilon and pp is decreasing.) This implies that (17) is at most O(ϵ)O(\epsilon), and this completes the proof of Theorem 5.

Appendix F Bounding variation distance

As noted above, our tester will work by decomposing the interval [n][n] into sub-intervals. The following lemma will be useful for us; it bounds the variation distance between two distributions pp and qq in terms of how pp and qq behave on the sub-intervals in such a decomposition.

Let [n][n] be partitioned into I1,,IrI_{1},\dots,I_{r}. Let p,qp,q be two distributions over [n][n]. Then

We assume that p(I1)q(I1)p(I_{1})\leq q(I_{1}) and prove (19) under this assumption. This gives the bound in general since if p(I1)>q(I1)p(I_{1})>q(I_{1}) we have

where the first inequality is by (19). The triangle inequality gives us

Summing this over all iI1i\in I_{1} we get

We can rewrite the first term on the RHS as

so to prove the desired bound it suffices to show that

So we indeed have (20) as required, and the lemma holds. ∎