How Well Can Generative Adversarial Networks Learn Densities: A Nonparametric View

Tengyuan Liang

Introduction

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Li et al., 2015; Arjovsky et al., 2017; Dziugaite et al., 2015) have stood out as an important unsupervised method for learning and efficient sampling from a complicated, multi-modal target data distribution. Despite its celebrated empirical success in image tasks, there are many theoretical questions to be answered (Liu et al., 2017; Arora and Zhang, 2017; Arora et al., 2017a).

One convenient formulation of the GAN framework (Arjovsky et al., 2017; Li et al., 2015; Dziugaite et al., 2015; Liu et al., 2017) solves the following minimax problem, at the population level,

In plain language, given a target distribution ν\nu, one seeks for a distribution μ\mu from a probability distribution generator class μG\mu_{G}, such that it minimizes the loss incurred by the best discriminator function inside the discriminator class FD\mathcal{F}_{D}. In practice, both the generator class and the discriminator class are represented by neural networks. μG\mu_{G} quantifies the transformed distributions realized by a network with random inputs (either Gaussian or uniform distribution), and FD\mathcal{F}_{D} represents the functions that are realizable by a certain neural network architecture. We refer the readers to Liu et al. (2017) for other more general formulations of GANs.

Two fundamental yet basic questions that puzzle machine learning theorists are: (1) how well does GAN learn the densities statistically, overlooking the optimization difficulty? (2) how well does the iterative optimization dynamics of solving the minimax problem approximate the optimal solution?

In the statistics literature, density estimation has been a central topic in nonparametric statistics (Nemirovski, 2000; Tsybakov, 2009; Wassermann, 2006). The minimax optimal rate of convergence has been understood fairly well, for a wide range of density function classes quantified by its smoothness. We would like to point out a simple yet convincing connection between two fields: in nonparametric statistics, the model grows in size to accommodate the complexity of the data, which is reminiscent of the model complexity in GANs, and more generally in the deep neural networks.

Note the GAN framework mentioned above is flexible. Define the following metric induced by the function class F\mathcal{F},

If F\mathcal{F} contains all Lipschitz-11 functions, then dF(μ,ν)d_{\mathcal{F}}(\mu,\nu) is the Wasserstein-1 metric (Wasserstein-GAN (Arjovsky et al., 2017)). When F\mathcal{F} represents all functions bounded by 11, then dF(μ,ν)d_{\mathcal{F}}(\mu,\nu) is the total variation metric (Radon metric). Let H\mathcal{H} be a reproducing kernel Hilbert space (RKHS) and k(,)k(\cdot,\cdot) be its kernel. If F\mathcal{F} consists of functions in the closure of the span of the set {k(,x),xΩ}\{k(\cdot,x),x\in\Omega\}, then dF(μ,ν)=d(μν)/dxHd_{\mathcal{F}}(\mu,\nu)=\|d(\mu-\nu)/dx\|_{\mathcal{H}} (MMD-GAN (Dziugaite et al., 2015; Li et al., 2015)).

Consider the density function that lies in Sobolev space with smoothness parameter α\alpha

and the discriminator function class FD=Wβ,2(L)\mathcal{F}_{D}=W^{\beta,2}(L) with smoothness β\beta. As α,β\alpha,\beta varies, FR\mathcal{F}_{R} describes a wide range of nonparametric densities, and FD\mathcal{F}_{D} provides a rich hierarchy of critic metrics. Let nn be the sample size, dd be the dimension.

In contrast, as long as α>βd/(2β)1\alpha>\frac{\beta}{d/(2\beta)-1}, the GAN estimator μ^n\hat{\mu}_{n} with empirical measure ν^n\hat{\nu}_{n} only achieves a considerably slower rate

Nonparametric estimation with GAN framework. The GAN framework for nonparametric density estimation enjoys the upper bound

One may wonder whether this rate can be significantly improved by other approaches. We show that for any procedure νn\nu_{n} based on nn samples, the minimax lower bound under the metric dFDd_{\mathcal{F}_{D}} reads

In the case when dd is large, the exponent in the upper bound and the minimax lower bounds are very close to each other, in the sense that α+β2α+d=(1+O(1/d))α+β2(α+β)+d\frac{\alpha+\beta}{2\alpha+d}=(1+\mathcal{O}(1/d))\frac{\alpha+\beta}{2(\alpha+\beta)+d}.

In the case when both the generator and discriminator are realized by neural networks. We establish the following results using the insights gained from above.

Networks can learn densities. We make progress on answering how well generative adversarial networks learn nonparametric densities dν(x)/dxWα,(L)d\nu(x)/dx\in W^{\alpha,\infty}(L) with smoothness parameter α\alpha, under the Wasserstein metric.

In addition, for all estimators νn\nu_{n} based on nn samples, the minimax rate cannot be smaller than

Here C,c>0C,c>0 are constants independent of nn. Note that by Yarotsky (2017), one can approximate Wk,(1)W^{k,\infty}(1) within ϵ\epsilon accuracy by ReLU networks with log(1/ϵ)\log(1/\epsilon) depth and poly(1/ϵ)\text{poly}(1/\epsilon) units for integer kk. The formal and more general statment can be found in Thm. 3.1.

2 Preliminaries

For integer kk, define the Sobolev space Wk,p(L)W^{k,p}(L) (p=2,p=2,\infty)

where α\alpha is a multi-index and D(α)D^{(\alpha)} denotes the α\alpha-weak derivative.

Another equivalent definition of Sobolev space for p=2p=2 is through the coefficients of the generalized Fourier series, which is also referred to as the Sobolev ellipsoid.

A nonparametric view of the GAN framework

The following oracle inequality holds for GAN.

Let F\mathcal{F} be any critic function class. Denote μn\mu_{n} as the solution (w.r.t. the empirical estimate νn\nu_{n}) to GAN with generator μG\mu_{G} and discriminator FD\mathcal{F}_{D}

Then the following decompositions hold for any distribution ν\nu,

Let’s remark on the decompositions. Eqn. (2.1) use dFDd_{\mathcal{F}_{D}} as the objective evaluation metric. The first term is the best approximation error within the generator class, even if we have population access to the true measure ν\nu. The second term is the statistical error, also called the generalization error, due to the fact that we have only finite nn-samples. Eqn. (2.2) employs a different dFd_{\mathcal{F}} as the objective metric, while using the mis-matched discriminator class FD\mathcal{F}_{D} in the GAN procedure. The first term describes the approximation error induced by the generator, the second term corresponds to how well the discriminator serves as a surrogate for the objective metric, and the third term is the statistical error.

Using the above theorem, choose FD\mathcal{F}_{D} as the critic metric, we need to study the excess loss

We will start with a crude bound when dνn(x)/dx=1ni=1nδXi(x)d\nu_{n}(x)/dx=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}(x) is chosen as the empirical density. It turns out that one can significantly improve this bound through choosing a better “regularized” or “smoothed” version of empirical measure, in the context of learning nonparametric densities. The empirical measure is not optimal because one can leverage the complexity/smoothness of the measure ν\nu, and the complexity of FD\mathcal{F}_{D} to improve upon the generalization error.

2 Upper bound for arbitrary density

Take dν^n(x)/dx=1ni=1nδXi(x)d\hat{\nu}_{n}(x)/dx=\frac{1}{n}\sum_{i=1}^{n}\delta_{X_{i}}(x), then

Assuming supfFDf1\sup_{f\in\mathcal{F}_{D}}\|f\|_{\infty}\leq 1, one has the standard entropy integral bound,

Plugging in the entropy estimate logN(ϵ,FD,)\log\mathcal{N}(\epsilon,\mathcal{F}_{D},\|\cdot\|_{\infty}) for various functional classes (Nickl and Pötscher (2007) and reference therein), one can easily derive the following corollary.

It is easy to see that overlooking the distributional information — by simply going through symmetrization and empirical processes theory — can lead to suboptimal result when the true density is smooth. Roughly speaking, the symmetrization approach treats the true density as a highly non-smooth one with α=0\alpha=0 (say the empirical density). Next, we will investigate how to improve GAN when the true density lies in Sobolev spaces Wα,2(L),α>0W^{\alpha,2}(L),\alpha>0.

3 Smoothness helps: improved upper bound for Sobolev spaces

Now we show that one can achieve faster rates in the GAN framework for density estimation, simultaneously leveraging the smoothness information in the density dνd\nu and the metric FD\mathcal{F}_{D}.

Consider the density function class that lies in Sobolev space with smoothness parameter α\alpha, and the discriminator function class with smoothness β\beta,

Theorem 2.1 tells us the rate of convergence satisfies the following upper bound

this nonparametric upper bound achieves a faster rate than the symmetrization bound which ignores the smoothness of the distribution. Remark in the real applications of GAN, the dimension dd is rather large (in images usually d>103d>10^{3}), and β\beta is fairly small for a strong metric (β=1\beta=1 in Wasserstein GAN), so this condition is satisfied for any α>0.002\alpha>0.002. In summary, we write out the explicit rate of convergence,

Let’s broaden the discussions slightly by considering different base measures π(x)\pi(x) beyond the Lebesgue measure, and the generalized Fourier basis. One can think of π(x)\pi(x) as the uniform measure on d^{d} or the product Gaussian measure. An equivalent formulation of the GAN problem that translates learning a distribution dμ(x)=g(x)dπ(x)d\mu(x)=g(x)d\pi(x) to learning the importance score/density ratio function g(x)g(x) with respect to a base measure π(x),xX\pi(x),x\in\mathcal{X}, can be viewed as

Assume the generalized Fourier basis are bounded maxξψξ(x)C\max_{\xi}\|\psi_{\xi}(x)\|_{\infty}\leq C. Consider the density function class with smoothness α\alpha, and the discriminator function class FD\mathcal{F}_{D} with smoothness β\beta

4 Minimax lower bound

Can the rate obtained by GAN framework be significantly improved by any other procedure? We consider in this section the minimax lower bound for the intrinsic difficulty of nonparametric estimation under the GAN discriminator metric. Here we consider a nonparametric function estimation problem on the Sobolev ellipsoid Θα(L)\Theta^{\alpha}(L), which is statistically equivalent to the density estimation problem over the Sobolev space Wα,2(L)W^{\alpha,2}(L) asymptotically (Tsybakov, 2009). A separate lower bound for the density estimation within the smaller Hölder class Wα,(L)W^{\alpha,\infty}(L) is proved in Thm. 3.1.

Consider the function class with smoothness α\alpha and the discriminator function class FD\mathcal{F}_{D} with smoothness β\beta

In the Gaussian sequence model (2.4), for any estimator νn\nu_{n},

Let us compare our lower bound with the upper bound in Remark 2.1. When the generator function class is rich enough so that μG\mu_{G} contains the ν\nu, the upper bound becomes

Remark again, when take as a special case the base measure π(x)\pi(x) being uniform on d^{d}, FR=Wα,2(L)\mathcal{F}_{R}=W^{\alpha,2}(L) and FD=Wβ,2(L)\mathcal{F}_{D}=W^{\beta,2}(L) are the usual Sobolev spaces. In the case when dd is large, the exponents in the upper and lower bounds are very close to each other, in the sense that α+β2α+d=(1+od(1))α+β2(α+β)+d\frac{\alpha+\beta}{2\alpha+d}=(1+o_{d}(1))\frac{\alpha+\beta}{2(\alpha+\beta)+d}.

How well networks learn densities

In this section, we answer in a quantitative way that when both the generator class and the discriminator class represented by deep ReLU networks are rich enough, one do learn the distribution. One can view our results as establishing rate of convergence and fundamental difficulty for learning the distribution under the GAN framework, for a wide range of densities with different smoothness properties. It builds a more detailed theory upon the seminal work of Goodfellow et al. (2014), where they proved that when sample sizes, generator sizes, and discriminator sizes are all infinite, one learns the distribution.

Let’s remark on the universal approximation property of deep networks first before introducing the result. It is well known that deep neural networks are universal approximators (Hornik et al., 1989). In particular, Yarotsky (2017) constructed a fixed ReLU network architecture, denoted as Freluα(ϵ)\mathcal{F}_{relu}^{\alpha}(\epsilon), that enjoys the following two properties:

It approximates all functions in Wβ,(1)W^{\beta,\infty}(1) in the following sense

It has depth at most O(log1/ϵ)\mathcal{O}(\log 1/\epsilon) and at most O(ϵd/βlog1/ϵ)\mathcal{O}(\epsilon^{-d/\beta}\log 1/\epsilon) weights and computation units.

With the above in mind, we are ready to state the following theorem.

The discriminator network class can approximate the Sobolev space Wβ,(1)W^{\beta,\infty}(1), in the sense that maxfWβ,(1)minfFDffϵ\max_{f\in W^{\beta,\infty}(1)}\min_{f^{\prime}\in\mathcal{F}_{D}}\|f-f^{\prime}\|_{\infty}\leq\epsilon;

The generator network class can approximate the Sobolev space νWα,(L)\nu\in W^{\alpha,\infty}(L), in the sense that for any true density dν/dxWα,(L)d\nu/dx\in W^{\alpha,\infty}(L), minμμGμνϵ\min_{\mu\in\mu_{G}}\|\mu-\nu\|_{\infty}\leq\epsilon;

The discriminator network class is not overly complex in the sense that FDWβ,2(L)W0,(M)\mathcal{F}_{D}\subset W^{\beta,2}(L)\cap W^{0,\infty}(M). In other words, the weak derivatives are integrable and the function is bounded.

In addition, the minimax lower bound for any estimator νn\nu_{n} based on nn samples satisfies,

Here Ci,i=1,2,3C_{i},i=1,2,3 are constants independent of nn.

Note the lower bound here is harder than that in Theorem 2.3 because: (1) the class is in fact Hölder (subset of the Sobolev), Wα,(L)Wα,2(L)W^{\alpha,\infty}(L)\subset W^{\alpha,2}(L), (2) for density estimation directly.

Let us make few remarks on the objective metrics and rates here. When β=1\beta=1, the objective metric is equivalent to the Wasserstein-11 metric (Lemma 4.2). Therefore with properly chosen discriminator and generator network GAN enjoys the upper bound nα+12α+2+dn^{-\frac{\alpha+1}{2\alpha+2+d}}, while the minimax lower bound being nα+12α+dn^{-\frac{\alpha+1}{2\alpha+d}}. Consider the following two scenarios:

When the dimension dd is very large, for a fixed smoothness parameter α\alpha, these two rates are very close. In this case, to obtain an ϵ\epsilon-error in Wasserstein metric, GAN requires exponential number of samples (in dimension) n=ϵ(2+dα+1)n=\epsilon^{-(2+\frac{d}{\alpha+1})}, so does any other procedure as indicated by the lower bound.

When the the smoothness parameter scales with the dimension, say α+1d/C0\alpha+1\geq d/C_{0} for some constant C0>0C_{0}>0, then GAN only requires a polynomial number of samples n=ϵ(2+C0)n=\epsilon^{-(2+C_{0})}. In the large dimension setting when dd\rightarrow\infty, any other procedure will require at least n=ϵ(2+C0)n=\epsilon^{-(2+C_{0})} as well.

Let us comment on the assumptions used in the theorem. We consider the function class realized by ReLU networks. Define c(α,β,d)=α+β2(α+β)+d(βd12)c(\alpha,\beta,d)=\frac{\alpha+\beta}{2(\alpha+\beta)+d}\vee(\frac{\beta}{d}\wedge\frac{1}{2}) and let’s choose ϵnc(α,β,d)\epsilon\asymp n^{-c(\alpha,\beta,d)}.

The assumption A.1 on the discriminator network class can be realized by subsuming a specific fixed architecture (Yarotsky, 2017) as a subclass FDFreluβ,(ϵ)\mathcal{F}_{D}\supseteq\mathcal{F}_{relu}^{\beta,\infty}(\epsilon), where the network Freluβ,(ϵ)\mathcal{F}_{relu}^{\beta,\infty}(\epsilon) has depth O(c(α,β,d)logn)\mathcal{O}(c(\alpha,\beta,d)\log n), and O(ndβc(α,β,d))\mathcal{O}^{*}(n^{\frac{d}{\beta}c(\alpha,\beta,d)}) weights and computational units. Roughly, this means that one may need a deep discriminator network (scales logarithmically with number of samples in depth, and polynomially in number of units) to recover the density in a strong sense such as under the Wasserstein metric.

The assumption A.2 on the generative network class can also be realized if we reformulate the GAN in an importance sampling way: given uniform samples XdX\in^{d}, the generative network outputs a importance/density score μ(X)\mu(X) realized by function realized by the network. Again, the deep network μGFreluβ,(ϵ)\mu_{G}\supseteq\mathcal{F}_{relu}^{\beta,\infty}(\epsilon) with depth O(c(α,β,d)logn)\mathcal{O}(c(\alpha,\beta,d)\log n) and O(ndβc(α,β,d))\mathcal{O}^{*}(n^{\frac{d}{\beta}c(\alpha,\beta,d)}) weights does the job. Note in practice, the generative network learns a transformation of XX with μ(X)\mu(X) as the density, instead of assigning the importance score. We do acknowledge that learning the transformation mapping (represented by the network) from the input random measure to the target measure has computational advantage over the importance sampling approach. However, overlooking the computation, output an importance score is also plausible to learn a distribution theoretically.

The assumption A.3 can be relaxed. In fact, one can prove a variant of the theorem without assumption 3, in the sense

It formalizes the intuition that if we choose a too strict metric as the discriminator (FD\mathcal{F}_{D} too rich), even though we are evaluating on a weaker metric (induced by a smaller function space Wβ,(1)W^{\beta,\infty}(1)), there is a price to pay in the rates for choosing the overly complex space FD\mathcal{F}_{D} as discriminator.

2 Error rates for deeper ReLU discriminator networks

In the GAN formulation, the discriminator class FD\mathcal{F}_{D} is represented by functions realized by a neural network with a certain architecture. In this section, we will apply our theory to obtain bounds for a wide range of feed-forward multi-layer networks with ReLU activation as the discriminator metric.

Denote σ(x)=max(x,0)\sigma(x)=\max(x,0) as the ReLU activation. Consider the following feed-forward multi-layer network:

There is constant V1V\geq 1 such that for each unit in the network, the vector ww of weights associated with that unit has w1V\|w\|_{1}\leq V.

Mathematically, one can define the function class induced by the network through the recursive definition: F(0)(V)={xxi:x=(x1,,xd)d,i[d]}{0,1}\mathcal{F}^{(0)}(V)=\{x\rightarrow x_{i}:x=(x_{1},\ldots,x_{d})\in^{d},i\in[d]\}\cup\{0,1\}, and for any i1i\geq 1,

Let’s remark on how our bound improves upon what is known, and when one can improve GAN by using a better estimate than the empirical measure ν^n\hat{\nu}_{n}.

Recall the covering number bound (Thm. 14.17) for deeper networks in Anthony and Bartlett (2009), one knows that

More concretely, consider the following two extreme scenarios:

Technical Proofs

For any μμG\mu\in\mu_{G}, we know that due to the optimality of GAN

where we use the following fact dF(μ1,μ2)+dF(μ2,μ3)dF(μ1,μ3)d_{\mathcal{F}}(\mu_{1},\mu_{2})+d_{\mathcal{F}}(\mu_{2},\mu_{3})\geq d_{\mathcal{F}}(\mu_{1},\mu_{3}) and

It is easy to bound the following using similar logic

This matches the best known bound as in Canas and Rosasco (2012) (Section 2.1.1).

Remark in addition that the parametric rate 1n\frac{1}{\sqrt{n}} is inevitable, which can be easily seen from the Sudakov minoration,

where based on i.i.d. samples X(1),X(2),X(n)νX^{(1)},X^{(2)},\ldots X^{(n)}\sim\nu

For any νFR\nu\in\mathcal{F}_{R} (or equivalently gWα,2(Lα)g\in W^{\alpha,2}(L_{\alpha})), we have

For the second term, the following inequality holds

Combining two terms, we have for any νFR\nu\in\mathcal{F}_{R} (or equivalently gWα,2(L)g\in W^{\alpha,2}(L)),

And the optimal choice of Mn12(α+β)+dM\asymp n^{\frac{1}{2(\alpha+\beta)+d}}. Note that this is a more aggressive smoothing scheme than classic nonparametric estimation with smoothness α\alpha, due to the fact that we are utilizing the smoothness in the evaluation metric at the same time.

The proof uses the standard Fano’s inequality.

Assume that H2H\geq 2 and suppose Θ\Theta contains θ0,θ1,,θH\theta_{0},\theta_{1},\ldots,\theta_{H} such that:

d(θj,θk)2s>0d(\theta_{j},\theta_{k})\geq 2s>0, for all j,k[H]j,k\in[H] and jkj\neq k.

1Hj=1HDKL(Pj,P0)αlogH\frac{1}{H}\sum_{j=1}^{H}D_{\rm KL}(P_{j},P_{0})\leq\alpha\log H with 0<α<1/80<\alpha<1/8 and Pj=PθjP_{j}=P_{\theta_{j}} for j[H]j\in[H].

Let’s construct a mixture of hypothesis on the function space Wα,2(L)W^{\alpha,2}(L), and a subset of discriminator functions in Wβ,2(L)W^{\beta,2}(L), such that the multiple testing problem among the mixture of hypothesis is hard, and thus the loss induced by the best discriminator among the subset provides a lower bound on the estimation rate. Let’s construct this mixture in the frequency domain. Choose M>0M>0 to be determined later, denote the hypothesis class of interest to be

It is easy to verify that ΩαWα,2(L)\Omega_{\alpha}\subset W^{\alpha,2}(L) because for any gw(x)g_{w}(x), we have

Similarly, let’s consider ΛβWβ,2(L)\Lambda_{\beta}\subset W^{\beta,2}(L)

where ρ(w,w)\rho(w,w^{\prime}) denotes the Hamming distance between ww and ww^{\prime} on the hypercube {0,1}Md\{0,1\}^{M^{d}}. Now we need to construct a subset over the hypercube such that for any pairs w,ww,w^{\prime}, they are separated in terms of Hamming distance.

The Varshamov-Gilbert bound (Lemma 2.9 in Tsybakov (2009)) does the job. We know that there exist a subset {w(0),,w(H)}{0,1}h\{w^{(0)},\ldots,w^{(H)}\}\subset\{0,1\}^{h} such that w(0)=(0,,0)w^{(0)}=(0,\ldots,0),

Now let’s calculate the probability distance Pj,P0P_{j},P_{0} induced by hypothesis w(j)w^{(j)} and w(0)w^{(0)}, for all j[H]j\in[H] to show that information theoretically, it is hard to distinguish the mixture of hypothesis. The following holds,

If we choose M=(4L2clog21dα)12α+dn12α+dM=\left(\frac{4L^{2}}{c\log 2}\frac{1}{d^{\alpha}}\right)^{\frac{1}{2\alpha+d}}\cdot n^{\frac{1}{2\alpha+d}}, we know

We outline the key steps of the proof. Due to the approximation property of the network class FD\mathcal{F}_{D} to Wβ,(1)W^{\beta,\infty}(1), and μG\mu_{G} to Wα,(1)W^{\alpha,\infty}(1), we know that

because FDW0,(M)F_{D}\subseteq W^{0,\infty}(M). Apply the oracle inequality (2.2), one has

Here the last step we use the following fact

Therefore for any νWα,(1)\nu\in W^{\alpha,\infty}(1),

where the last line follows from Theorem 2.2. Combine with the approximation error, we reach

Under the assumption that FDWβ,2(L)\mathcal{F}_{D}\subseteq W^{\beta,2}(L), then

and (gw(x)g0(x))/gw(x)cw+adhnα1/50\|(g_{w}(x)-g_{0}(x))/g_{w}(x)\|_{\infty}\leq c_{w}+a^{d}h_{n}^{\alpha}\ll 1/50. It is easy to verify (1) ΩαWα,(L)\Omega_{\alpha}\subset W^{\alpha,\infty}(L) for some LL, and each element in the set is a valid density; (2) ΛβWβ,(1)\Lambda_{\beta}\subset W^{\beta,\infty}(1), as for any multi-index γ\gamma,

To apply the Fano’s Lemma, let’s first show that the hypothesis are separated in the evaluation metric. Let’s use the same subset {w(0),,w(H)}{0,1}h\{w^{(0)},\ldots,w^{(H)}\}\subset\{0,1\}^{h} claimed by the Varshamov-Gilbert bound

In our case h=mdh=m^{d}. For the loss function, any w,w{w0,,wH}w,w^{\prime}\in\{w^{0},\ldots,w^{H}\}

Now let’s show that based nn i.i.d. data generated from density gw(x)g_{w}(x), it is hard to distinguish the hypothesis. Note that for x<1/50|x|<1/50, we know that log(1+x)xx2\log(1+x)\geq x-x^{2}. Therefore

Therefore if we choose mn12α+dm\asymp n^{-\frac{1}{2\alpha+d}}, we know 1Hj=1HDKL(Pw(j)n,Pw(0)n)clogH=cmd\frac{1}{H}\sum_{j=1}^{H}D_{\rm KL}(P_{w^{(j)}}^{\otimes n},P_{w^{(0)}}^{\otimes n})\leq c\log H=c^{\prime}m^{d}. Using the Fano’s Lemma, the lower bound for density estimation is of the order hnα+β=nα+β2α+dh_{n}^{\alpha+\beta}=n^{-\frac{\alpha+\beta}{2\alpha+d}}, as

Without loss of generality assume the function is differentiable. Then

Observe that xy1xy2xy\|x-y\|_{1}\geq\|x-y\|_{2}\geq\|x-y\|_{\infty}. Then f(x)f(y)LxyLxy2Lxy1|f(x)-f(y)|\leq L\|x-y\|_{\infty}\leq L\|x-y\|_{2}\leq L\|x-y\|_{1}. For the other side of the inequality, xy1dxy2dxy\|x-y\|_{1}\leq\sqrt{d}\|x-y\|_{2}\leq d\|x-y\|_{\infty}, hence f(x)f(y)Lxy1Ldxy2Ldxy|f(x)-f(y)|\leq L\|x-y\|_{1}\leq L\sqrt{d}\|x-y\|_{2}\leq Ld\|x-y\|_{\infty}. ∎

Acknowledgement

The author thanks Sasha Rakhlin for helpful discussions.

References