Learning mixtures of structured distributions over discrete domains

Siu-on Chan, Ilias Diakonikolas, Rocco A. Servedio, Xiaorui Sun

Introduction

Learning an unknown probability distribution given access to independent samples is a classical topic with a long history in statistics and probability theory. Theoretical computer science researchers have also been interested in these problems at least since the 1990s [KMR+94, Das99], with an explicit focus on the computational efficiency of algorithms for learning distributions. Many works in theoretical computer science over the past decade have focused on learning and testing various kinds of probability distributions over high-dimensional spaces, see e.g. [Das99, FM99, DS00, AK01, VW02, FOS05, RS05, BS10, KMV10, MV10, ACS10] and references therein. There has also been significant recent interest in learning and testing various types of probability distributions over the discrete domain [n]={1,,n}[n]=\{1,\dots,n\}, see e.g. [BKR04, VV11b, VV11a, DDS12a, DDS12b].

A natural type of distribution learning problem, which is the focus of this work, is that of learning an unknown mixture of “simple” distributions. Mixtures of distributions have received much attention in statistics [Lin95, RW84, TSM85] and in recent years have been intensively studied in computer science as well (see many of the papers referenced above). Given distributions p1,,pkp_{1},\dots,p_{k} and non-negative values μ1,,μk\mu_{1},\dots,\mu_{k} that sum to 1, we say that p=i=1kμipip=\sum_{i=1}^{k}\mu_{i}p_{i} is a kk-mixture of components p1,,pkp_{1},\dots,p_{k} with mixing weights μ1,,μk\mu_{1},\dots,\mu_{k}. A draw from pp is obtained by choosing i[k]i\in[k] with probability μi\mu_{i} and then making a draw from pip_{i}.

We focus on density estimation rather than, say, clustering or parameter estimation, for several reasons. First, clustering samples according to which component in the mixture each sample came from is often an impossible task unless restrictive separation assumptions are made on the components; we prefer not to make such assumptions. Second, the classes of distributions that we are chiefly interested in (such as log-concave, MHR and unimodal distributions) are all non-parametric classes, so it is unclear what “parameter estimation” would even mean for these classes. Finally, even in highly restricted special cases, parameter estimation provably requires sample complexity exponential in kk, the number of components in the mixture. Moitra and Valiant [MV10] have shown that parameter estimation for a mixture of kk Gaussians inherently requires exp(Ω(k))\exp(\Omega(k)) samples. Their argument can be translated to the discrete setting, with translated Binomial distributions in place of Gaussians, to provide a similar lower bound for parameter estimation of translated Binomial mixtures. Thus, parameter estimation even for a mixture of kk translated Binomial distributions over [n][n] (a highly restricted special case of all the mixture classes we consider, since translated Binomial distributions are log-concave, MHR and unimodal) requires exp(Ω(k))\exp(\Omega(k)) samples. This rather discouraging lower bound motivates the study of other variants of the problem of learning mixture distributions.

Returning to our density estimation framework, it is not hard to show that from an information-theoretic perspective, learning a mixture of distributions from a class C\mathfrak{C} of distributions is never much harder than learning a single distribution from C\mathfrak{C}. In Appendix A we give a simple argument which establishes the following:

While the generic algorithm AA^{\prime} uses relatively few samples, it is computationally highly inefficient, with running time exponentially higher than the runtime of algorithm AA (since AA^{\prime} tries all possible partitions of its input sample into kk separate subsamples). Indeed, naive approaches to learning mixture distributions run into a “credit assignment” problem of determining which component distribution each sample point belongs to.

As the main contributions of this paper, we (i) give a general algorithm which efficiently learns mixture distributions over [n][n] provided that the component distributions satisfy a mild condition; and (ii) show that this algorithm can be used to obtain highly efficient algorithms for natural mixture learning problems.

2 A general algorithm.

The mild condition which we require of the component distributions in our mixtures is essentially that each component distribution must be close to a (variable-width) histogram with few bins. More precisely, let us say that a distribution qq over [n][n] is (ε,t)(\varepsilon,t)-flat (see Section 2) if there is a partition of [n][n] into tt disjoint intervals I1,,ItI_{1},\dots,I_{t} such that pp is ε\varepsilon-close (in total variation distance) to the distribution obtained by “flattening” pp within each interval IjI_{j} (i.e., by replacing p(k)p(k), for kIjk\in I_{j}, with iIjp(i)/Ij\sum_{i\in I_{j}}p(i)/|I_{j}|). Our general result for learning mixture distributions is a highly efficient algorithm that learns any kk-mixture of (ε,t)(\varepsilon,t)-flat distributions:

As we show in Section 1.3 below, Theorem 1.1 yields near-optimal sample complexity for a range of interesting mixture learning problems, with a running time that is nearly linear in the sample size. Another attractive feature of Theorem 1.1 is that it always outputs hypothesis distributions with a very simple structure (enabling a succinct representation), namely histograms with at most kt/εkt/\varepsilon bins.

3 Applications of the general approach.

We apply our general approach to obtain a wide range of learning results for mixtures of various natural and well-studied types of discrete distributions. These include mixtures of log-concave distributions, mixtures of monotone hazard rate (MHR) distributions, and mixtures of unimodal distributions. To do this, in each case we need a structural result stating that any distribution of the relevant type can be well-approximated by a histogram with few bins. In some cases (unimodal distributions) the necessary structural results were previously known, but in others (log-concave and MHR distributions) we establish novel structural results that, combined with our general approach, yield nearly optimal algorithms.

Log-concave distributions. Discrete log-concave distributions are essentially those distributions pp that satisfy p(k)2p(k+1)p(k1)p(k)^{2}\geq p(k+1)p(k-1) (see Section 4 for a precise definition). They are closely analogous to log-concave distributions over continuous domains, and encompass a range of interesting and well-studied types of discrete distributions, including binomial, negative binomial, geometric, hypergeometric, Poisson, Poisson Binomial, hyper-Poisson, Pólya-Eggenberger, and Skellam distributions (see Section 1 of [FBR11]). In the continuous setting, log-concave distributions include uniform, normal, exponential, logistic, extreme value, Laplace, Weibull, Gamma, Chi and Chi-Squared and Beta distributions, see [BB05]. Log-concave distributions over [n][n] have been studied in a range of different contexts including economics, statistics and probability theory, and algebra, combinatorics and geometry, see [An95, FBR11, Sta89] and references therein.

Our main learning result for mixtures of discrete log-concave distributions is:

Our main learning result for mixtures of MHR distributions is:

This theorem is also nearly optimal. We show that for kn1Ω(1)k\leq n^{1-\Omega(1)} and ε1/nΩ(1)\varepsilon\geq 1/n^{\Omega(1)}, any algorithm for learning a mixture of kk MHR distributions to accuracy ε\varepsilon must use Ω(klog(n)/ε3)\Omega(k\log(n)/\varepsilon^{3}) samples.

Unimodal distributions. A distribution over [n][n] is said to be unimodal if its probability mass function is monotone non-decreasing over [1,t][1,t] for some tnt\leq n and then monotone non-increasing on [t,n][t,n]. Every log-concave distribution is unimodal, but the MHR and unimodal conditions are easily seen to be incomparable. Many natural types of distributions are unimodal and there has been extensive work on density estimation for unimodal distributions and related questions [Rao69, Weg70, BKR04, Bir97, Fou97].

Our main learning result for mixtures of unimodal distributions is:

Our approach in fact extends to learning a kk-mixture of tt-modal distributions (see Section 6). The same lower bound argument that we use for mixtures of MHR distributions also gives us that for kn1Ω(1)k\leq n^{1-\Omega(1)} and ε1/nΩ(1)\varepsilon\geq 1/n^{\Omega(1)}, any algorithm for learning a mixture of kk unimodal distributions to accuracy ε\varepsilon must use Ω(klog(n)/ε3)\Omega(k\log(n)/\varepsilon^{3}) samples.

4 Related work.

Achlioptas and McSherry [AM05] and Kannan et al. [KSV08] gave algorithms for clustering points drawn from a mixture of kk high-dimensional log-concave distributions, under various separation assumptions on the distance between the means of the components. We are not aware of prior work on density estimation of mixtures of arbitrary log-concave distributions in either the continuous or the discrete setting.

MHR distributions: As noted above, MHR distributions appear frequently and play an important role in reliability theory and in economics (to the extent that the MHR condition is considered a standard assumption in these settings). Surprisingly, the problem of learning an unknown MHR distribution or mixture of such distributions has not been explicitly considered in the statistics literature. We note that several authors have considered the problem of estimating the hazard rate of an MHR distribution in different contexts, see e.g. [Wan86, HW93, GJ11, Ban08].

Unimodal distributions: The problem of learning a single unimodal distribution is well-understood: Birgé [Bir97] gave an efficient algorithm for learning continuous unimodal distributions (whose density is absolutely bounded); his algorithm, when translated to the discrete domain [n][n], requires O(log(n)/ε3)O(\log(n)/\varepsilon^{3}) samples. This sample size is also known to be optimal (up to constant factors)[Bir87a]. In recent work, Daskalakis et al. [DDS12a] gave an efficient algorithm to learn tt-modal distributions over [n][n]. We remark that their result does not imply ours, as even a mixture of two unimodal distributions over [n][n] may have Ω(n)\Omega(n) modes. We are not aware of prior work on efficiently learning mixtures of unimodal distributions.

Paper Structure. Following some preliminaries in Section 2, Section 3 presents our general framework for learning mixtures. Sections 4, 5 and 6 analyze the cases of log-concave, MHR and unimodal mixtures respectively.

Preliminaries and notation

For pp a probability distribution over [n][n] we write p(i)p(i) to denote the probability of element i[n]i\in[n] under pp, so p(i)0p(i)\geq 0 for all i[n]i\in[n] and i=1np(i)=1.\sum_{i=1}^{n}p(i)=1. For S[n]S\subseteq[n] we write p(S)p(S) to denote iSp(i)\sum_{i\in S}p(i). We write pSp^{S} to denote the sub-distribution over SS induced by pp, i.e., pS(i)=p(i)p^{S}(i)=p(i) if iSi\in S and pS(i)=0p^{S}(i)=0 otherwise.

A distribution pp over [n][n] is non-increasing (resp. non-decreasing) if p(i+1)p(i)p(i+1)\leq p(i) (resp. p(i+1)p(i)p(i+1)\geq p(i)), for all i[n1]i\in[n-1]; pp is monotone if it is either non-increasing or non-decreasing.

Finally, the following notation and terminology will be useful: given mm independent samples s1,,sms_{1},\dots,s_{m}, drawn from distribution p:[n],p:[n]\to, the empirical distribution p^m:[n]{\widehat{p}}_{m}:[n]\to is defined as follows: for all i[n]i\in[n], p^m(i)={j[m]sj=i}/m{\widehat{p}}_{m}(i)=\left|\{j\in[m]\mid s_{j}=i\}\right|/m.

Let I={I1,,Is}{\cal I}=\{I_{1},\dots,I_{s}\} be a partition of [n][n] into ss disjoint intervals, and J={J1,,Jt}{\cal J}=\{J_{1},\dots,J_{t}\} be a partition of [n][n] into tt disjoint intervals. We say that J{\cal J} is a refinement of I{\cal I} if each interval in I{\cal I} is a union of intervals in J{\cal J}, i.e., for every a[s]a\in[s] there is a subset Sa[t]S_{a}\subseteq[t] such that Ia=bSaJbI_{a}=\cup_{b\in S_{a}}J_{b}.

For I={Ii}i=1r{\cal I}=\{I_{i}\}_{i=1}^{r} and I={Ii}i=1s{\cal I}^{\prime}=\{I^{\prime}_{i}\}_{i=1}^{s} two partitions of [n][n] into rr and ss intervals respectively, we say that the common refinement of I{\cal I} and I{\cal I}^{\prime} is the partition J{\cal J} of [n][n] into intervals obtained from I{\cal I} and I{\cal I}^{\prime} in the obvious way, by taking all possible nonempty intervals of the form IiIj.I_{i}\cap I^{\prime}_{j}. It is clear that J{\cal J} is both a refinement of I{\cal I} and of I{\cal I}^{\prime} and that J{\cal J} contains at most r+sr+s intervals.

We recall some basic tools from probability.

The VC inequality. Given a family of subsets A\mathcal{A} over [n][n], define pA=supAAp(A)\left\|p\right\|_{\mathcal{A}}=\sup_{A\in\mathcal{A}}|p(A)|. The VC–dimension of A\mathcal{A} is the maximum size of a subset X[n]X\subseteq[n] that is shattered by A\mathcal{A} (a set XX is shattered by A\mathcal{A} if for every YXY\subseteq X some AAA\in\mathcal{A} satisfies AX=YA\cap X=Y).

Let p^m\widehat{p}_{m} be an empirical distribution of mm samples from pp. Let A\mathcal{A} be a family of subsets of VC–dimension dd. Then

Uniform convergence. We will also use the following uniform convergence bound:

Let A\mathcal{A} be a family of subsets over [n][n], and p^m\widehat{p}_{m} be an empirical distribution of mm samples from pp. Let XX be the random variable pp^A\left\|p-\hat{p}\right\|_{\mathcal{A}}. Then we have

Learning mixtures of (ε,t)𝜀𝑡(\varepsilon,t)-flat distributions

In this section we present and analyze our general algorithm for learning mixtures of (ε,t)(\varepsilon,t)-flat distributions. We proceed in stages by considering three increasingly demanding learning scenarios, each of which builds on the previous one.

We start with the simplest scenario, in which the learning algorithm is given a partition P{\cal P} which is a (p,ε,t)(p,\varepsilon,t)-flat decomposition of [n][n] for the target distribution pp being learned.

Algorithm Learn-Known-Decomposition(p,P,ε,δ)(p,{\cal P},\varepsilon,\delta):

Input: sample access to unknown distribution pp over [n][n]; (p,ε,t)(p,\varepsilon,t)-flat decomposition P{\cal P} of [n][n]; accuracy parameter ε\varepsilon; confidence parameter δ\delta

Draw m=O((t+log1/δ)/ε2)m=O((t+\log 1/\delta)/\varepsilon^{2}) samples to obtain an empirical distribution p^m\widehat{p}_{m}.

An application of the triangle inequality yields

The first term on the right-hand side is at most ϵ\epsilon by the definition of a (p,ε,t)(p,\varepsilon,t)-flat decomposition. The second term is also at most ϵ\epsilon, as follows by Proposition 3.1, stated and proved below. ∎

where A={IPp(I)>p^m(I)}A=\bigcup\{I\in\mathcal{P}\mid p(I)>\widehat{p}_{m}(I)\}. Since P\mathcal{P} contains at most ss intervals, AA is a union of at most ss intervals. Consequently the above right-hand side is at most pp^mAs\left\|p-\widehat{p}_{m}\right\|_{\mathcal{A}_{s}}, where As\mathcal{A}_{s} is the family of all unions of at most ss intervals over [n][n].Formally, define A1={[a,b]1abn}{}\mathcal{A}_{1}=\{[a,b]\mid 1\leq a\leq b\leq n\}\cup\{\emptyset\} as the collection of all intervals over [n][n], including the empty interval. Then As={I1IsI1,,IsA1}\mathcal{A}_{s}=\{I_{1}\cup\dots\cup I_{s}\mid I_{1},\dots,I_{s}\in\mathcal{A}_{1}\}. Since the VC-dimension of As\mathcal{A}_{s} is 2s2s, Theorem 2.1 implies that the considered quantity has expected value at most ε\varepsilon. The claimed result now follows by applying Theorem 2.2 with η=ε.\eta=\varepsilon.

2 Second scenario: unknown flat distribution.

The second algorithm deals with the scenario in which the target distribution pp is (ε/4,t)(\varepsilon/4,t)-flat but no flat decomposition is provided to the learner. We show that in such a setting we can construct a (p,ε,O(t/ε))(p,\varepsilon,O(t/\varepsilon))-flat decomposition P{\cal P} of [n][n], and then we can simply use this P{\cal P} to run Learn-Known-Decomposition.

The basic subroutine Right-Interval will be useful here (and later). It takes as input an explicit description of a distribution qq over [n][n], an interval J=[a,b][n]J=[a,b]\subseteq[n], and a threshold τ>0.\tau>0. It returns the longest interval in [a,b][a,b] that ends at bb and has mass at most τ\tau under qq. If no such interval exists then q(b)q(b) must exceeds τ\tau, and the subroutine simply returns the singleton interval [b,b][b,b].

Input: explicit description of distribution qq; interval J=[a,b]J=[a,b]; threshold τ\tau

If q(b)>τq(b)>\tau then set i=bi^{\prime}=b, otherwise set i=min{aibq([i,b])τ}i^{\prime}=\min\{a\leq i\leq b\mid q([i,b])\leq\tau\}.

The algorithm to construct a decomposition is given below:

Algorithm Construct-Decomposition(p,τ,ε,δ)(p,\tau,\varepsilon,\delta):

Input: sample access to unknown distribution pp over [n][n]; parameter τ\tau;

accuracy parameter ε\varepsilon; confidence parameter δ\delta

Draw m=O((1/τ+log1/δ)/ε2)m=O((1/\tau+\log 1/\delta)/\varepsilon^{2}) samples to obtain an empirical distribution p^m\widehat{p}_{m}.

Let II be the interval returned by Right-Interval(p^m,J,τ)(\widehat{p}_{m},J,\tau).

Add II to P\mathcal{P} and set J=JIJ=J\setminus I.

To prove the above theorem we will need the following elementary fact about refinements:

Let pp be any distribution over [n][n] and let I={Ii}i=1t\mathcal{I}=\{I_{i}\}_{i=1}^{t} be a (p,ϵ,t)(p,\epsilon,t)-flat decomposition of [n][n]. If J={Ji}i=1t\mathcal{J}=\{J_{i}\}_{i=1}^{t^{\prime}} is a refinement of I\mathcal{I}, then J\mathcal{J} is a (p,2ϵ,t)(p,2\epsilon,t^{\prime})-flat decomposition of [n][n].

We will also use the following simple observation about the Right-Interval subroutine:

Suppose Right-Interval(q,J,τ)(q,J,\tau) returns an interval IJI\neq J and Right-Interval(q,JI,τ)(q,J\setminus I,\tau) returns II^{\prime}. Then q(I)+q(I)>τq(I)+q(I^{\prime})>\tau.

of II to Δ\Delta. If IPQI\in\mathcal{P}\cap\mathcal{Q} then the contribution to Δ\Delta is zero; on the other hand, if IPQI\in{\cal P}\setminus{\cal Q} then the contribution to Δ\Delta is at most p(I)/2p(I)/2. Thus the total contribution summed across all IPI\in{\cal P} is at most (1/2)IPQp(I)(1/2)\mathop{\textstyle\sum}_{I\in\mathcal{P}\setminus\mathcal{Q}}p(I). Now we observe that with probability at least 1δ1-\delta we have

Our algorithm to learn an unknown (ε/4,t)(\varepsilon/4,t)-flat distribution is now very simple:

Algorithm Learn-Unknown-Decomposition(p,t,ε,δ)(p,t,\varepsilon,\delta):

Input: sample access to unknown distribution pp over [n][n]; parameter tt;

accuracy parameter ε\varepsilon; confidence parameter δ\delta

Run Construct-Decomposition(p,ε/(4t),ε,δ/2)(p,\varepsilon/(4t),\varepsilon,\delta/2) to obtain a (p,ε,8t/ε)(p,\varepsilon,8t/\varepsilon)-flat decomposition P{\cal P} of [n][n].

Run Learn-Known-Decomposition(p,P,ε,δ/2)(p,{\cal P},\varepsilon,\delta/2) and return the hypothesis hh that it outputs.

3 Main result (third scenario): learning a mixture of flat distributions.

We have arrived at the scenario of real interest to us, namely learning an unknown mixture of kk distributions each of which has an (unknown) flat decomposition. The key to learning such distributions is the following structural result, which says that any such mixture must itself have a flat decomposition:

Let C\mathfrak{C} be a class of (ε,t)(\varepsilon,t)-flat distributions over [n][n], and let pp be any kk-mixture of distributions in C.\mathfrak{C}. Then pp is a (2ε,kt)(2\varepsilon,kt)-flat distribution.

Let p=j=1kμjpjp=\sum_{j=1}^{k}\mu_{j}p_{j} be a kk-mixture of components p1,,pkC.p_{1},\dots,p_{k}\in\mathfrak{C}. Let Pj{\cal P}_{j} denote the (pj,ϵ,t)(p_{j},\epsilon,t)-flat decomposition of [n][n] corresponding to pjp_{j}, and let P\mathcal{P} be the common refinement of P1,P2,,Pk\mathcal{P}_{1},\mathcal{P}_{2},\dots,\mathcal{P}_{k}. It is clear that P\mathcal{P} contains at most ktkt intervals. By Lemma 3.1, P\mathcal{P} is a (pj,2ϵ,kt)(p_{j},2\epsilon,kt)-flat decomposition for every pjp_{j}. Hence we have

where (2) is the triangle inequality and (3) follows from the fact that the expression in (2) is a nonnegative convex combination of terms bounded from above by 2ε2\varepsilon. ∎

Given Lemma 3.2, the desired mixture learning algorithm follows immediately from the results of the previous subsection:

Learning mixtures of log-concave distributions

In this section we apply our general method from Section 3 to learn log-concave distributions over [n][n] and mixtures of such distributions. We start with a formal definition:

A probability distribution pp over [n][n] is said to be log-concave if it satisfies the following conditions: (i) if 1i<j<kn1\leq i<j<k\leq n are such that p(i)p(k)>0p(i)p(k)>0 then p(j)>0p(j)>0; and (ii) p(k)2p(k1)p(k+1)p(k)^{2}\geq p(k-1)p(k+1) for all k[n].k\in[n].

We note that while some of the literature on discrete log-concave distributions states that the definition consists solely of item (ii) above, item (i) is in fact necessary as well since without it log-concave distributions need not even be unimodal (see the discussion following Definition 2.3 of [FBR11]).

We recall the well-known fact that log-concavity implies unimodality (see e.g. [KG71]). Thus, it is useful to analyze log-concave distributions which additionally are monotone (since a general log-concave distribution can be viewed as consisting of two such pieces). With this motivation we give the following lemma:

Let pp be a distribution over [n][n] that is non-decreasing and log-concave on [1,b][n][1,b]\subseteq[n]. Let I=[a,b]I=[a,b] be an interval of mass p(I)=τp(I)=\tau, and suppose that the interval J=[1,a1]J=[1,a-1] has mass p(J)=σ>0p(J)=\sigma>0. Then

for 1ib1\leq i\leq b. Eq. (4) holds for i=bi=b, since p(bs)p(a)p(b-s)\leq p(a) by the non-decreasing property. The general case ibi\leq b follows by induction and using the fact that the ratio p(i1)/p(i)p(i-1)/p(i) is non-decreasing in ii for any log-concave distribution (an immediate consequence of the definition of log-concavity).

for 0jt0\leq j\leq t. Since the intervals have geometrically decreasing mass, this implies that

Rearranging yields the desired inequality. ∎

We will also use the following elementary fact:

We are now ready to present and analyze our algorithm Decompose-Log-Concave that draws samples from an unknown log-concave distribution and outputs a flat decomposition. The algorithm simply runs the general algorithm Construct-Decomposition with an appropriate choice of parameters. However the analysis will not go via the “generic” Theorem 3.2 (which would yield a weaker bound) but instead uses Lemma 4.1, which is specific to log-concave distributions.

Algorithm Decompose-Log-Concave(p,ε,δ)(p,\varepsilon,\delta):

Input: sample access to unknown log-concave distribution pp over [n][n];

accuracy parameter ε\varepsilon; confidence parameter δ\delta

Set τ=Θ(ε/log(1/ε))\tau=\Theta(\varepsilon/\log(1/\varepsilon)).

Run Construct-Decomposition(p,τ,ε,δ)(p,\tau,\varepsilon,\delta) and return the decomposition P{\cal P} that it yields.

Our main theorem in this section is the following:

For any log-concave distribution pp over [n][n], Algorithm Decompose-LogConcave(p,ε,δ)(p,\varepsilon,\delta) draws O(log(1/ε)/ε3+log(1/δ)(log(1/ε))2/ε2)O(\log(1/\varepsilon)/\varepsilon^{3}+\log(1/\delta)(\log(1/\varepsilon))^{2}/\varepsilon^{2}) samples from pp and with probability at least 1δ1-\delta constructs a decomposition P\mathcal{P} that is (p,ε,O(log(1/ε)/ε))(p,\varepsilon,O(\log(1/\varepsilon)/\varepsilon))-flat.

by the τ\tau-closeness of pp and p^m\widehat{p}_{m} in Kolmogorov distance and the definition of Right-Interval. Also, by Observation 3.1, p^m(Jj)(j1)/2τ((j1)/21)τ\widehat{p}_{m}(J_{j})\geq\lfloor(j-1)/2\rfloor\tau\geq((j-1)/2-1)\tau, and hence

again by closeness in Kolmogorov distance.

Since pp is non-decreasing on [1,i01][1,i_{0}-1], we have

The right-hand side above is at most ε/2\varepsilon/2 by our choice of τ\tau (with an appropriate constant in the big-oh).

Similarly, let PR={IPI[i0+1,n]}\mathcal{P}_{R}=\{I\in\mathcal{P}\mid I\subset[i_{0}+1,n]\} be the collection of intervals to the right of i0i_{0}. An identical analysis (using the obvious analogue of Lemma 4.1 for non-increasing log-concave distributions on [i0+1,n][i_{0}+1,n]) shows that the contribution of intervals in PR\mathcal{P}_{R} to Δ\Delta is at most ε/2\varepsilon/2.

Finally, let I0PI_{0}\in\mathcal{P} be the interval containing i0i_{0}. If I0I_{0} is a singleton, it does not contribute to Δ\Delta. Otherwise, p^m(I0)τ\widehat{p}_{m}(I_{0})\leq\tau and p(I0)2τp(I_{0})\leq 2\tau, hence the contribution of I0I_{0} to Δ\Delta is at most 2τ2\tau.

Our claimed upper bounds follow from the above theorem by using our framework of Section 3. Indeed, it is clear that we can learn any unknown log-concave distribution by running Algorithm Decompose-Log-Concave(p,ε,δ/2)(p,\varepsilon,\delta/2) to obtain a decomposition P{\cal P} and then Algorithm Learn-Known-Decomposition(p,P,ε,δ/2)(p,P,\varepsilon,\delta/2) to obtain a hypothesis distribution hh:

Theorem 4.2 of course implies that every log-concave distribution pp is (ε,O(log(1/ε)/ε))(\varepsilon,O(\log(1/\varepsilon)/\varepsilon))-flat. We may thus apply Corollary 3.1 and obtain our main learning result for kk-mixtures of log-concave distributions:

Lower bounds. It is shown in [DL01, Lemma 15.1] that learning a continuous distribution whose density is bounded and convex over $toaccuracyto accuracy\varepsilonrequiresrequires\Omega((1/\varepsilon)^{5/2})samples.Aneasyadaptationofthisargumentimpliesthesameresultforaboundedconcavedensityoversamples. An easy adaptation of this argument implies the same result for a bounded concave density over.Byanappropriatediscretizationprocedure,onecanshowthatlearningadiscreteconcavedensityover. By an appropriate discretization procedure, one can show that learning a discrete concave density over[n]requiresrequires\Omega((1/\varepsilon)^{5/2})samplesforallsamples for all\varepsilon\geq 1/n^{\Omega(1)}.Sinceadiscreteconcavedistributionisalsologconcave,thesamelowerboundholdsforthiscasetoo.ForthecaseofSince a discrete concave distribution is also log-concave, the same lower bound holds for this case too. For the case ofkmixtures,wemayconsiderauniformmixtureof-mixtures, we may consider a uniform mixture ofkcomponentdistributionswherethecomponent distributions where theithdistributioninthemixtureissupportedon-th distribution in the mixture is supported on[1+(i-1)n/k,in/k]andislogconcaveonitssupport.Itisclearthateachcomponentdistributionislogconcaveoverand is log-concave on its support. It is clear that each component distribution is log-concave over[n],anditisnotdifficulttoseethatinordertolearnsuchamixturetoaccuracy, and it is not difficult to see that in order to learn such a mixture to accuracy\varepsilon,atleast, at least9/10ofthecomponentdistributionsmustbelearnedtototalvariationdistanceatmostof the component distributions must be learned to total variation distance at most10\varepsilon.Wethusgetthatfor. We thus get that fork\leq n^{1-\Omega(1)}andand\varepsilon\geq 1/n^{\Omega(1)},anyalgorithmforlearningamixtureof, any algorithm for learning a mixture ofklogconcavedistributionstoaccuracylog-concave distributions to accuracy\varepsilonmustusemust use\Omega(k\varepsilon^{-5/2})$ samples.

Learning mixtures of MHR distributions

In this section we apply our general method from Section 3 to learn monotone hazard rate (MHR) distributions over [n][n] and mixtures of such distributions.

It is known that every log-concave distribution over [n][n] is MHR but the converse is not true, as can easily be seen from the fact that every monotone non-decreasing distribution over [n][n] is MHR.

In Section 5.1 we prove that every MHR distribution over [n][n] has an (ε,O(log(n/ε)/ε))(\varepsilon,O(\log(n/\varepsilon)/\varepsilon))-flat decomposition. We combine this with our general results from Section 3 to get learning results for mixtures of MHR distributions.

Our algorithm to construct a flat decomposition of an MHR distribution pp is Decompose-MHR, given below. Note that this algorithm takes an explicit description of pp as input and does not draw any samples from pp. Roughly speaking, the algorithm works by partitioning [n][n] into intervals such that within each interval the value of pp never deviates from its value at the leftmost point of the interval by a multiplicative factor of more than (1+ε/8).(1+\varepsilon/8).

Algorithm Decompose-MHR(p,ε)(p,\varepsilon):

Input: explicit description of MHR distribution pp over [n][n]; accuracy parameter ε>0\varepsilon>0

Set J=[n]J=[n] and initialize Q\mathcal{Q} to be the empty set.

Let II be the interval returned by Right-Interval(p,J,ε/8)({p},J,\varepsilon/8), and II^{\prime} be the interval returned by Right-Interval(p,JI,ε/8)({p},J\setminus I,\varepsilon/8). Set J=J(II)J=J\setminus(I\cup I^{\prime}).

Set iJi\in J to be the smallest integer such that p(i)ε/(4n)p(i)\geq\varepsilon/(4n). If no such ii exists, let I=JI^{\prime\prime}=J and go to Step 5. Otherwise, let I=[1,i1]I^{\prime\prime}=[1,i-1] and J=JIJ=J\setminus I^{\prime\prime}.

Let jJj\in J be the smallest integer such that either p(j)>(1+ε/8)p(i)p(j)>(1+\varepsilon/8)p(i) or p(j)<11+ε/8p(i)p(j)<\frac{1}{1+\varepsilon/8}p(i) holds. If no such jj exists let I=JI^{\prime\prime\prime}=J, otherwise, let I=[i,j1]I^{\prime\prime\prime}=[i,j-1].

Add II^{\prime\prime\prime} to Q\mathcal{Q}, and set J=JIJ=J\setminus I^{\prime\prime\prime}.

Return P=Q{I,I,I}\mathcal{P}=\mathcal{Q}\cup\{I,I^{\prime},I^{\prime\prime}\}.

Our first lemma for the analysis of Decompose-MHR states that MHR distributions satisfy a condition that is similar to being monotone non-decreasing:

If p(b+1)>p(a)p(b+1)>p(a) then p(b+1)11+ηp(a)p(b+1)\geq{\frac{1}{1+\eta}}p(a) holds directly, so for the rest of the proof we may assume that p(b+1)p(a)p(b+1)\leq p(a).

By the definition of the MHR condition we have p(a)p([a+1,n])p(b+1)p([b+2,n])\frac{p(a)}{p([a+1,n])}\leq\frac{p(b+1)}{p([b+2,n])} and hence

Let Q={I1,I2,,IQ}\mathcal{Q}=\{I_{1},I_{2},\dots,I_{|\mathcal{Q}|}\}, with Ii=[ai,bi]I_{i}=[a_{i},b_{i}], 1iQ1\leq i\leq|\mathcal{Q}|, where ai<ai+1a_{i}<a_{i+1}. Let Q={IiQ:p(ai)>p(ai+1)}\mathcal{Q}^{\prime}=\{I_{i}\in\mathcal{Q}:p(a_{i})>p(a_{i+1})\} and Q={IiQ:p(ai)p(ai+1)}\mathcal{Q}^{\prime\prime}=\{I_{i}\in\mathcal{Q}:p(a_{i})\leq p(a_{i+1})\}. Thus, Q\mathcal{Q}^{\prime} consists of those intervals II in Q\mathcal{Q} which are such that the following interval’s initial value is significantly smaller than the initial value of II, and Q\mathcal{Q}^{\prime\prime} consists of those IQI\in\mathcal{Q} for which the following interval’s initial value is significantly larger than the initial value of II. We also denote Ri=[ai+1,n]R_{i}=[a_{i+1},n]. For convenience, we also let aQ+1=bQ+1a_{|\mathcal{Q}|+1}=b_{|\mathcal{Q}|}+1.

We first bound the “total multiplicative decrease in pp” across all intervals in Q\mathcal{Q}^{\prime}:

We have IiQp(ai)p(ai+1)8ϵ.\mathop{\textstyle\prod}_{I_{i}\in\mathcal{Q}^{\prime}}\frac{p(a_{i})}{p(a_{i+1})}\leq\frac{8}{\epsilon}.

Observation 3.1 implies that the total probability mass p(II)p(I\cup I^{\prime}) on intervals II and II^{\prime} is at least ϵ/8\epsilon/8. We thus have

where the first inequality follows from Lemma 5.1, the second inequality is self-evident, the equality follows from the telescoping product, and the final inequality is because p(II)ε/8.p(I\cup I^{\prime})\geq\varepsilon/8.

At this point we can bound the number of intervals produced by Decompose-MHR:

Step 4 of Algorithm Decompose-MHR adds at most O(log(n/ϵ)/ϵ)O(\log(n/\epsilon)/\epsilon) intervals to Q\mathcal{Q}.

We first bound the number of intervals in Q\mathcal{Q}^{\prime}. Let the intervals in Q\mathcal{Q}^{\prime} be I1,I2,IQI^{\prime}_{1},I^{\prime}_{2},\dots I^{\prime}_{|\mathcal{Q}^{\prime}|}, where Ij=[aj,bj]I^{\prime}_{j}=[a^{\prime}_{j},b^{\prime}_{j}] and a1>a2>>aQa_{1}^{\prime}>a_{2}^{\prime}>\dots>a_{|\mathcal{Q}^{\prime}|}^{\prime}. Observation 3.1 implies that the total probability mass p(II)p(I\cup I^{\prime}) is at least ϵ/8\epsilon/8. Hence, p([b1+1,n])p([b_{1}^{\prime}+1,n]) is at least ϵ/8\epsilon/8 and we have p(R1)ε/8p(R^{\prime}_{1})\geq\varepsilon/8. For j1j\geq 1 it holds

Consequently the number of intervals in Q\mathcal{Q}^{\prime} is bounded by O(log(1/ϵ)/ϵ)O(\log(1/\epsilon)/\epsilon).

Now we bound the number of intervals in Q\mathcal{Q}^{\prime\prime}. We consider the value of IiQp(ai+1)p(ai)\prod_{I_{i}\in\mathcal{Q}}\frac{p(a_{i+1})}{p(a_{i})}:

Since p(aQ+1)1p(a_{|\mathcal{Q}|+1})\leq 1 and p(a1)ε/(4n)p(a_{1})\geq\varepsilon/(4n), the above is at most 4n/ε4n/\varepsilon; using Lemma 5.2, we get that

On the other hand, for every IiQI_{i}\in\mathcal{Q}^{\prime\prime} we have that p(ai+1)p(ai)(1+ε/8)\frac{p(a_{i+1})}{p(a_{i})}\geq(1+\varepsilon/8). Consequently there can be at most O((1/ε)log(n/ϵ))O((1/\varepsilon)\log(n/\epsilon)) intervals in Q\mathcal{Q}^{\prime\prime}, and the proof is complete. ∎

It remains only to show that P\mathcal{P} is actually a flat decomposition of pp:

Algorithm Decompose-MHR outputs a partition P\mathcal{P} of [n][n] that is (p,ε,log(n/ε)/ε))(p,\varepsilon,\log(n/\varepsilon)/\varepsilon))-flat.

Now for each interval II^{\prime\prime\prime} in Q\mathcal{Q}, we have

Applying Corollary 3.1, we get our main learning result for mixtures of MHR distributions:

Lower bounds. By adapting a lower bound of [Bir87a] (for monotone distributions over a continuous interval) it can be shown that for ε1/nΩ(1)\varepsilon\geq 1/n^{\Omega(1)}, any algorithm for learning a monotone distribution over [n][n] to accuracy ε\varepsilon must use Ω(log(n)/ε3)\Omega(\log(n)/\varepsilon^{3}) samples. We may consider a uniform mixture of kk component distributions where the ii-th distribution in the mixture is supported on and monotone non-decreasing over [1+(i1)n/k,in/k][1+(i-1)n/k,in/k]. Each component distribution is MHR (over the entire domain). The same argument as in the log-concave case implies that, for kn1Ω(1)k\leq n^{1-\Omega(1)} and ε1/nΩ(1)\varepsilon\geq 1/n^{\Omega(1)}, any algorithm for learning a mixture of kk MHR distributions to accuracy ε\varepsilon must use Ω(klog(n)/ε3)\Omega(k\log(n)/\varepsilon^{3}) samples.

Learning mixtures of unimodal and t𝑡t-modal distributions

In this section we apply our general method from Section 3 to learn mixtures of unimodal (and, more generally, tt-modal) distributions over [n].[n]. Here our task is quite easy because of a result of L. Birgé [Bir87b] which essentially provides us with the desired flat decompositions. We note that Birgé’s structural result was obtained as part of an efficient learning algorithm for monotone distributions; Birgé subsequently gave an efficient learning algorithm for unimodal distributions [Bir97]. However, we are not aware of work prior to ours on learning mixtures of unimodal or tt-modal distributions.

We begin by defining unimodal and tt-modal distributions over [n][n]:

A distribution pp over [n][n] is unimodal if there exists i[n]i\in[n] such that pp is non-decreasing over [1,i][1,i] and non-increasing over [i,n][i,n]. For t>1t>1, distribution pp over [n][n] is tt-modal if there is a partition of [n][n] into tt intervals I1,,ItI_{1},\dots,I_{t} such that the sub-distributions pI1,,pItp^{I_{1}},\dots,p^{I_{t}} are unimodal.

By adapting a construction of Birgé (proved in [Bir87b] for distributions over the continuous real line) to the discrete domain [n][n], [DDS+13] established the following:

Let pp be any monotone distribution (either non-increasing or non-decreasing) over [n][n]. Then pp is (ε,O(log(n)/ε))(\varepsilon,O(\log(n)/\varepsilon))-flat.

We note that it can be shown (using the same construction that is used in the Ω(log(n)/ε3)\Omega(\log(n)/\varepsilon^{3}) sample complexity lower bound of [Bir87a] for learning monotone distributions) that O(log(n)/ε)O(\log(n)/\varepsilon) is the best possible bound for the number of intervals required in Theoorem 6.1.

An immediate consequence of Theorem 6.1 is that any unimodal distribution over [n][n] is (ε,O(log(n)/ε))(\varepsilon,O(\log(n)/\varepsilon))-flat, and any tt-modal distribution over [n][n] is (ε,O(tlog(n)/ε))(\varepsilon,O(t\cdot\log(n)/\varepsilon))-flat. Using Corollary 3.1 we thus obtain the following results for learning mixtures of unimodal or tt-modal distributions:

Lower bounds. The lower bound arguments we gave for mixtures of MHR distributions (which are based on Birgé’s lower bounds for learning monotone distributions) apply unchanged for mixtures of unimodal distributions, since every distribution which is supported on and monotone non-decreasing over [1+(i1)n/k,in/k][1+(i-1)n/k,in/k] is unimodal over [n].[n].

Conclusions and future work

This work introduces a simple general approach to learning mixtures of “structured” distributions over discrete domains. We illustrate the usefulness of our approach by showing it yields nearly optimal algorithms for learning mixtures of natural and well-studied classes (log-concave, MHR and unimodal) and in the process we establish novel structural properties of these classes.

Are there any other natural distribution classes for which our general framework is applicable? We suspect so. At the technical level, the linear dependence on the parameters kk and tt in the sample complexity of Theorem 1.1 is optimal (up to constant factors). It would be interesting to improve the dependence on 1/ε1/\varepsilon from cubic down to quadratic (which would be best possible) with an efficient algorithm.

References

Appendix A Proof of Proposition 1.1

At a high level, the algorithm AA^{\prime} works by drawing a large set of samples from the target mixture and trying all possible ways of partitioning the sample into kk disjoint subsamples. For each partition of the sample it runs algorithm AA over each subsample and combines the resulting hypothesis distributions (guessing the mixture weights) to obtain a hypothesis mixture distribution. Finally, a “hypothesis testing” procedure is used to identify a high-accuracy hypothesis from the collection of all hypotheses distributions obtained in this way.

More precisely, let pp denote the unknown target kk-mixture of distributions from C\mathfrak{C}. Algorithm AA^{\prime} works as follows:

For each possible way of partitioning SS into kk disjoint subsamples Sˉ=(S1,,Sk)\bar{S}=(S_{1},\dots,S_{k}) such that each Sim(n,ε/20)log(5k/δ)|S_{i}|\geq m(n,\varepsilon/20)\cdot\log(5k/\delta), run algorithm AA a total of kk times, using SiS_{i} as the input sample for the ii-th run, to obtain hypothesis distributions h1Sˉ,,hkSˉh^{\bar{S}}_{1},\dots,h^{\bar{S}}_{k}. For each vector μ=(μ1,,μk)\mu=(\mu_{1},\dots,\mu_{k}) of non-negative mixing weights that sum to 1 and satisfy μi=\mu_{i}= (integer)ε/(20k)\cdot\varepsilon/(20k), let hμSˉh^{\bar{S}}_{\mu} be the mixture distribution i=1kμihiSˉ.\sum_{i=1}^{k}\mu_{i}h^{\bar{S}}_{i}.

Draw M=O(Mlogk+klog(k/ε))log(5/δ)/ε2M^{\prime}=O(M\log k+k\log(k/\varepsilon))\cdot\log(5/\delta)/\varepsilon^{2} samples from pp and use them to run the “hypothesis testing” routine described in Lemma 11 of [DDS12b] over all hypotheses hμSˉh^{\bar{S}}_{\mu} obtained in the previous step. Output the hypothesis distribution that this routine outputs.