Multivariate Stein Factors for a Class of Strongly Log-concave Distributions

Lester Mackey, Jackson Gorham

Introduction

Here, pp represents the density of PP with respect to Lebesgue measure.

Next, one shows that, for every test function hh in a convergence-determining class H\mathcal{H}, the Stein equation

admits a solution uhu_{h} in a set U\mathcal{U} of functions with uniformly bounded low-order derivatives. These uniform derivative bounds are commonly termed Stein factors.

Finally, one uses whatever tools necessary to upper bound the Stein discrepancyNot to be confused with the “Stein discrepancy” of , which names an entirely different quantity.

which by construction upper bounds the reference metric dH(Q,P)d_{\mathcal{H}}(Q,P).

To date, this recipe has been successfully used with the Langevin operator (1) to obtain explicit approximation error bounds for a wide variety of univariate targets PP [see, e.g., 7, 6].In the univariate setting, the operator (1) is commonly called Stein’s density operator. The same operator has been used to analyze multivariate Gaussian approximation , but few other multivariate distributions have established Stein factors. To extend the reach of the multivariate literature, we derive uniform Stein factor bounds for a broad class of strongly log-concave target distributions in Theorem 2.1. The result covers common Bayesian target distributions, including Bayesian logistic regression posteriors under Gaussian priors, and explicitly relates the Stein discrepancy (3) and practical Monte Carlo diagnostics based thereupon to standard probability metrics, like the Wasserstein distance.

and we term a function kk-strongly log-concave if logf\log f is kk-strongly concave. We finally let 0hh\nabla^{0}h\triangleq h for all functions hh and define the Lipschitz constants

Stein factors for strongly log-concave distributions

solves the the Stein equation (2) and satisfies

Theorem 2.1 implies that the Stein discrepancy (3) with set

To establish the second, we fix hWh\in\mathcal{W} and t>0t>0 and define the smoothed function

In the final equality we have used the fact that v,Z\langle{v},{{Z}}\rangle and w,Z\langle{w},{{Z}}\rangle are jointly normal with zero mean and covariance \Sigma=\begin{bmatrix}\mathopen{}\mathclose{{}\left\|{v}}\right\|_{2}^{2}&\langle{v},{w}\rangle\\ \langle{v},{w}\rangle&\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}^{2}\end{bmatrix}, so that the product v,Zw,Z\langle{v},{{Z}}\rangle\langle{w},{{Z}}\rangle has the distribution of the off-diagonal element of the Wishart distribution with scale Σ\Sigma and 11 degree of freedom.

We can now develop a bound for dWd_{\mathcal{W}} using our smoothed functions. Let

While Lemma 2.2 targets Lipschitz test functions, comparable results can be obtained for non-smooth functions, like the indicators of convex sets, by adapting the smoothing technique of [3, Lem. 2.1].

Before turning to the proof of Theorem 2.1, we illustrate a practical application to measuring the quality of Monte Carlo or cubature sample points in Bayesian inference. Consider the Bayesian logistic regression posterior density [see, e.g., 11]

Hence, Theorem 2.1 applies with k=1/\sigma^{2},L_{3}=\frac{\sum_{l=1}^{L}\mathopen{}\mathclose{{}\left\|{v_{l}}}\right\|_{2}^{3}}{6\sqrt{3}}, and L_{4}=\frac{\sum_{l=1}^{L}\mathopen{}\mathclose{{}\left\|{v_{l}}}\right\|_{2}^{4}}{8}. We may now plug the associated Stein factors

into the non-uniform graph Stein discrepancy of to obtain a computable upper bound on dM(Q,P)d_{\mathcal{M}}(Q,P) or dW(Q,P)d_{\mathcal{W}}(Q,P) for any discrete probability measure Q=1ni=1nδβiQ=\frac{1}{n}\sum_{i=1}^{n}\delta_{\beta_{i}}.

Proof of Theorem 2.1

Before tackling the main proof, we will establish a series of useful lemmas. We will make regular use of the following well-known Lipschitz property:

Our first lemma enumerates several properties of the overdamped Langevin diffusion that will prove useful in the proofs to follow.

Consider the Lyapunov function V(x)=\mathopen{}\mathclose{{}\left\|{x}}\right\|_{2}^{2}+1. The strong log-concavity of pp, the Cauchy-Schwarz inequality, and the arithmetic-geometric mean inequality imply that

2 High-order weighted difference bounds

A second, technical lemma bounds the growth of weighted smooth function differences in terms of the proximity of function arguments. The result will be used to characterize the smoothness of Zt,x{Z}_{t,x} as a function of the starting point xx (Lemma 3.5) and, ultimately, to establish the smoothness of uhu_{h} (Theorem 2.1).

To establish the second-order difference bound (5), we first apply Taylor’s theorem with mean-value remainder to h(x)h(y)h(x)-h(y) and h(x)h(y)h(x^{\prime})-h(y^{\prime}) to obtain

To derive the third-order difference bound (6), we apply Taylor’s theorem with mean-value remainder to h(w)h(z)h(w)-h(z), h(y)h(x)h(y)-h(x), h(w)h(z)h(w^{\prime})-h(z^{\prime}), and h(y)h(x)h(y^{\prime})-h(x^{\prime}) to write

To bound the subsequent line, we note that Cauchy-Schwarz, the definition of the operator norm, and the Lipschitz property (4) imply that

Finally, Cauchy-Schwarz and the definition of the operator norm give

Bounding the third-order difference (7) in terms of these four estimates yields (6).

3 Synchronous coupling lemma

Our proof of Theorem 2.1 additionally rests upon a series of coupling inequalities which serve to characterize the smoothness of Zt,x{Z}_{t,x} as a function of xx. The couplings espoused in the lemma to follow are termed synchronous, because the same Brownian motion is used to drive each process.

For each starting point of the form z+bv+bvz+b^{\prime}v^{\prime}+bv with z{x,x}z\in\{x,x^{\prime}\}, b{0,ϵ,ϵ}b^{\prime}\in\{0,\epsilon^{\prime},\epsilon^{\prime\prime}\}, and b{0,ϵ}b\in\{0,\epsilon\}, consider an overdamped Langevin diffusion (Zt,z+bv+bv)t0({Z}_{t,z+b^{\prime}v^{\prime}+bv})_{t\geq 0} solving the stochastic differential equation

These coupled processes almost surely satisfy the synchronous coupling bounds,

the second-order differenced function bound,

and the third-order differenced function bound,

By Lemma 3.1, each process (Zt,z+bv+bv)t0({Z}_{t,z+b^{\prime}v^{\prime}+bv})_{t\geq 0} with z{x,x}z\in\{x,x^{\prime}\}, b{0,ϵ,ϵ}b^{\prime}\in\{0,\epsilon^{\prime},\epsilon^{\prime\prime}\}, and b{0,ϵ}b\in\{0,\epsilon\} is well-defined for all times t[0,)t\in[0,\infty). The first-order bound (10) is well known, and a concise proof can be found in .

To establish the second conclusion (11), we consider the Itô process of second-order differences

and apply Itô’s lemma to the mapping (t,w)\mapsto e^{kt/2}\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}. This yields

where, to achieve the second inequality, we used the kk-strong log-concavity of pp. Now we may derive the second-order synchronous coupling bound (11), since

Applying the synchronous coupling bound (11) to the estimate (16) finally delivers the second-order differenced function bound (13).

Third-order bounds

To establish the third conclusion (12), we consider the Itô process of third-order differences

and invoke Itô’s lemma once more for the mapping (t,w)\mapsto e^{kt/2}\mathopen{}\mathclose{{}\left\|{w}}\right\|_{2}. This produces

In the final line, we used the kk-strong log-concavity of pp. Our efforts now yield (12) via

The third-order differenced function bound (3.5) then follows by applying the third-order synchronous coupling bound (12) to the estimate (18).

4 Proof of Theorem 2.1

for (Wt)t0(W_{t})_{t\geq 0} a dd-dimensional Wiener process. In what follows, when considering the joint distribution of a finite collection of overdamped Langevin diffusions, we will assume that the diffusions are coupled in the manner of Lemma 3.5, so that each diffusion is driven by a shared dd-dimensional Wiener process (Wt)t0(W_{t})_{t\geq 0}.

To see that the integral representation of uh(x)u_{h}(x) is well-defined, note that

The first relation uses the stationarity of PP, the second uses the Lipschitz relation (4), the third uses the first-order coupling inequality (10) of Lemma 3.5, and the last uses the fact that strongly log-concave distributions have subexponential tails and therefore finite moments of all orders [8, Lem. 1].

The second relation is an application of the Lipschitz relation (4), and the third applies the first-order coupling inequality (10) of Lemma 3.5.

To demonstrate that uhu_{h} is differentiable with Lipschitz gradient, we first establish a weighted second-order difference inequality for uhu_{h}.

We apply the Lemma 3.5 second-order function coupling inequality (13) to obtain

The desired bound follows by integrating the final expression.

Indeed, Lemma 3.7 implies that, for any integers m,m>0m,m^{\prime}>0,

Hence, the sequence \mathopen{}\mathclose{{}\left(\frac{u_{h}(x+v/m)-u_{h}(x)}{1/m}}\right)_{m=1}^{\infty} is Cauchy, and the directional derivative (21) exists.

where the second inequality follows from Lemma 3.7. Since each directional derivative is Lipschitz continuous, we may conclude that uhu_{h} is continuously differentiable with Lipschitz continuous gradient uh\nabla u_{h}. Our Lipschitz function deduction (19) and the Lipschitz relation (4) additionally supply the uniform bound M1(uh)2kM1(h).M_{1}(u_{h})\leq\frac{2}{k}M_{1}(h).

To demonstrate that uh\nabla u_{h} is differentiable with Lipschitz gradient, we begin by establishing a weighted third-order difference inequality for uhu_{h}.

Introduce the shorthand c1f1(x,x,ϵ,ϵ,ϵ)c_{1}\triangleq f_{1}(x,x^{\prime},\epsilon,\epsilon^{\prime},\epsilon^{\prime\prime}) and c2f2(x,x,ϵ,ϵ,ϵ)c_{2}\triangleq f_{2}(x,x^{\prime},\epsilon,\epsilon^{\prime},\epsilon^{\prime\prime}). We apply the Lemma 3.5 third-order function coupling inequality (3.5) to the thrice continuously differentiable function hh to obtain

Integrating this final expression yields the advertised bound.

Lemma 3.9 guarantees that, for any integers m,m>0m,m^{\prime}>0,

Hence, the sequence \mathopen{}\mathclose{{}\left(\frac{\nabla_{v}u_{h}(x+v^{\prime}/m)-\nabla_{v}u_{h}(x)}{1/m}}\right)_{m=1}^{\infty} is Cauchy, and the directional derivative (24) exists.

Solving the Stein equation

References