The Normal Law Under Linear Restrictions: Simulation and Estimation via Minimax Tilting

Z. I. Botev

Introduction

More than a century ago Francis Galton (1889) observed that he scarcely knows “anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the law of frequency of error. The law would have been personified by the Greeks if they had known of it.”

In this article we address some hitherto intractable computational problems related to the dd-dimensional multivariate normal law under linear restrictions:

One of the goals of this article is to propose a new methodology, which not only yields an unbiased estimator orders of magnitude less variable than the Genz estimator, but also works reliably in cases where the Genz estimator and other alternatives fail to deliver meaningful estimates (e.g., relative error close to 100%) Matlab® and R implementations are available from Matlab® Central, http://%****␣arXivJRSS.tex␣Line␣625␣****www.mathworks.com/matlabcentral/fileexchange/53796, and the CRAN repository (under the name TruncatedNormal), as well as from the author’s website: http://web.maths.unsw.edu.au/~zdravkobotev/.

Background on Separation of Variables Estimator

We first briefly describe the separation of variables (SOV) estimator of Genz (1992) (see also Geweke (1991)). Let A=LQA=LQ^{\top} be the LQ decomposition of the matrix AA, where LL is m×dm\times d lower triangular with nonnegative entries down the main diagonal and Q=Q1Q^{\top}=Q^{-1} is d×dd\times d orthonormal. A simple change of variable xQz\mathbf{x}\leftarrow Q^{\top}\mathbf{z} then yields:

where ϕ(x;μ,Σ)\phi(\mathbf{x};\boldsymbol{\mu},\Sigma) denotes the pdf of the N(μ,Σ){\sf N}(\boldsymbol{\mu},\Sigma) distribution. For simplicity of notation, we henceforth assume that m=dm=d so that LL is full rank. The case of m<dm<d is considered later in the experimental section. Genz (1992) decomposes the region C={x:lLxu}\mathscr{C}=\{\mathbf{x}:\mathbf{l}\leqslant L\mathbf{x}\leqslant\mathbf{u}\} sequentially as follows:

where gg is an importance sampling density over the set C\mathscr{C} and in the SOV form

Denoting by Φ()\Phi(\cdot) the cdf of the standard normal distribution, this gives the following.

This is an intractable combinatorial optimization problem whose objective function is not even available. Nevertheless, Genz and Bretz (2009) propose a heuristic for finding an acceptable approximation to π\boldsymbol{\pi}^{*}. We henceforth assume that this variable reordering heuristic is always applied as a preprocessing step to the SOV Algorithm 2.1 so that the matrix AA and the bounds l\mathbf{l} and u\mathbf{u} are already in permuted form. We will revisit variable reordering in the numerical experiments in Section 5.

2 Accept-Reject Simulation

The SOV approach described above suggests that we could simulate from f(z)f(\mathbf{z}) exactly by using g(x)g(\mathbf{x}) as an instrumental density in the following accept-reject scheme (Kroese et al., 2011, Chapter 3).

Minimax Tilting

To simplify the notation in the subsequent analysis, let

The second and stronger type of efficiency is bounded relative error,

Finally, the best one can hope for in an asymptotic regime is the highly desirable vanishing relative error (VRE) property:

Here we take a different tack, one that exploits features unique to the problem at hand and that will yield efficiency gains in both an asymptotic and non-asymptotic regime. A key result in this direction is the following Lemma 3.1, whose proof is given in the appendix.

is a saddle-point problem with a unique solution given by the concave optimization program:

Note that (7) minimizes with respect to μ\boldsymbol{\mu} the worst-case behavior of the likelihood ratio, namely supxCexp(ψ(x;μ))\sup_{\mathbf{x}\in\mathscr{C}}\exp\left(\psi(\mathbf{x};\boldsymbol{\mu})\right). The lemma states we can both easily locate the global worst-case behavior of the likelihood ratio, and simultaneously locate (in finite computing time) the global minimum with respect to μ\boldsymbol{\mu}. Prior to analyzing the theoretical properties of minimax tilting, we first explain how to implement the minimax method in practice.

Then, the gradient equation ψ(x;μ)=0\nabla\psi(\mathbf{x};\boldsymbol{\mu})=\mathbf{0} can be written as

The Karush-Kuhn-Tucker equations give the necessary and sufficient condition for the global solution (x,μ)(\mathbf{x}^{*},\boldsymbol{\mu}^{*}) of (7):

where η1,η2\boldsymbol{\eta}_{1},\boldsymbol{\eta}_{2} are Lagrange multipliers.

Suppose we find the unique solution of the nonlinear system (8) using, for example, a trust-region Dogleg method (Powell, 1970). If we denote the solution to (8) by (x˘,μ˘)(\breve{\mathbf{x}},\breve{\boldsymbol{\mu}}), then the Karush-Kuhn-Tucker equations imply that (x˘,μ˘)=(x,μ)(\breve{\mathbf{x}},\breve{\boldsymbol{\mu}})=(\mathbf{x}^{*},\boldsymbol{\mu}^{*}) if and only if (x˘,μ˘)C(\breve{\mathbf{x}},\breve{\boldsymbol{\mu}})\in\mathscr{C} or equivalently η1=η2=0\boldsymbol{\eta}_{1}=\boldsymbol{\eta}_{2}=\mathbf{0}. If, however, the solution (x˘,μ˘)(\breve{\mathbf{x}},\breve{\boldsymbol{\mu}}) to (8) does not lie in C\mathscr{C}, then (x˘;μ˘)(\breve{\mathbf{x}};\breve{\boldsymbol{\mu}}) will be suboptimal and, in order to compute (x;μ)(\mathbf{x}^{*};\boldsymbol{\mu}^{*}), one has to use a constrained convex optimization solver. This observation then leads to the following procedure.

Numerical experience suggests almost always (x˘,μ˘)(\breve{\mathbf{x}},\breve{\boldsymbol{\mu}}) happens to lie in C\mathscr{C} and there is no need to do any additional computation over and above Powell’s (1970) trust-region method.

Theoretical Properties of Minimax Tilting

There are a number of reasons why the minimax program (7) is an excellent way of selecting the tilting parameter. The first one shows that, unlike its competitors, the proposed estimator,

achieves the best possible efficiency in a tail asymptotic regime.

Similar to the variable reordering in Section 2.1, suppose that PP is a permutation matrix which maps the vector (1,,d)(1,\ldots,d)^{\top} into the permutation π=(π1,,πd)\boldsymbol{\pi}=(\pi_{1},\ldots,\pi_{d})^{\top}, that is, P(1,,d)=πP(1,\ldots,d)^{\top}=\boldsymbol{\pi}. Let LL be the lower triangular factor of PΣP=LLP\Sigma P^{\top}=LL^{\top} and p=Pl\mathbf{p}=P\mathbf{l}. It is clear that

for any permutation π\boldsymbol{\pi}. For the time being, we leave π\boldsymbol{\pi} unspecified, because unlike in Section 2.1, here we do not use π\boldsymbol{\pi} to minimize the variance of the estimator, but to simplify the notation in our efficiency analysis.

Define the convex quadratic programming problem:

The Karush-Kuhn-Tucker equations, which are a necessary and sufficient condition to find the solution of (12), are given by:

Given the partition λ=(λ1,λ2)\boldsymbol{\lambda}=(\boldsymbol{\lambda}_{1}^{\top},\boldsymbol{\lambda}_{2}^{\top})^{\top} with dim(λ1)=d1\dim(\boldsymbol{\lambda}_{1})=d_{1} and dim(λ2)=d2\dim(\boldsymbol{\lambda}_{2})=d_{2}, we now choose π\boldsymbol{\pi} such that all the active constraints in (13) correspond to λ1>0\boldsymbol{\lambda}_{1}>\mathbf{0} and all the inactive ones to λ2=0\boldsymbol{\lambda}_{2}=\mathbf{0}. Similarly, we define a partitioning for x,p\mathbf{x},\mathbf{p}, and the lower triangular

Note that the only reason for introducing the above variable reordering via the permutation matrix PP and insisting that all active constraints of (12) are collected in the upper part of vector λ\boldsymbol{\lambda} is notational convenience and simplicity. At the cost of some generality, this preliminary variable reordering allows us to state and prove the efficiency result in the following Theorem 4.1 in its simplest and neatest form.

Consider the estimation of the probability

where XN(0,Σ),  ZN(0,I)\mathbf{X}\sim{\sf N}(\mathbf{0},\Sigma),\;\mathbf{Z}\sim{\sf N}(\mathbf{0},I); and LL=PΣP,  p=Pl>0LL^{\top}=P\Sigma P^{\top},\;\mathbf{p}=P\mathbf{l}>\mathbf{0} are the permuted versions of Σ,l\Sigma,\mathbf{l} ensuring that the Lagrange multiplier vector λ\boldsymbol{\lambda} in (13) satisfies λ1>0\boldsymbol{\lambda}_{1}>\mathbf{0} and λ2=0\boldsymbol{\lambda}_{2}=\mathbf{0}. Define

and let J\mathscr{J} be the set of indices for which the components of the vector q\mathbf{q} are zero, that is,

If J=\mathscr{J}=\emptyset, then the minimax estimator (11) is a vanishing relative error estimator:

The theorem suggests that, unless the covariance matrix Σ\Sigma has a very special structure, the estimator enjoys VRE. This raises the question: Is there a simple setting that guarantees VRE for any full-rank covariance matrix under any preliminary variable reordering?

The next result shows that when l\mathbf{l} can be represented as a weighted linear combination of the columns of the covariance matrix Σ=AA\Sigma=AA^{\top}, then we always have VRE.

Note that the permutation matrix PP plays no role in the statement of Theorem 4.2 (we can assume P=IP=I), and that we do not assume l>0\mathbf{l}>\mathbf{0}, but only that l=Σl\mathbf{l}=\Sigma\mathbf{l}^{*} for some l>0\mathbf{l}^{*}>\mathbf{0}.

In light of Theorems 4.1 and 4.2, for the obverse problem of simulation from the truncated multivariate normal, we obtain the following result.

Suppose that the instrumental density in the Accept-Reject Algorithm 2.2 for simulation from

where ν\boldsymbol{\nu} and σ=(σ1,,σd)\boldsymbol{\sigma}=(\sigma_{1},\ldots,\sigma_{d})^{\top} are location and scale parameters, respectively. Define

Finally, in addition to the strong efficiency properties of the estimator, another reason that recommends the minimax estimator is that it permits us to tackle intractable simulation and estimation problems as illustrated in the next section.

Numerical Examples and Applications

In all examples we compare the separation-of-variables (SOV) estimator of Genz with the proposed minimax-exponentially-tilted (MET) estimator. We note that initially we considered a comparison with other estimation schemes such as the radially symmetric approach of Nomura (2014a) and the specialized orthant probability algorithm of Miwa et al. (2003); Craig (2008); Nomura (2014b). Unfortunately, unless a special autoregressive covariance structure is present, these methods are hardly competitive in anything but very few dimensions. For example, the orthant algorithm of Miwa et al. (2003) has complexity O(d!×n)\mathcal{O}(d!\times n), which becomes too costly for d>10d>10. For this reason, we give a comparison only with the broadly applicable SOV scheme, which is widely recognized as the current state-of-the-art method.

Since both the SOV and MET estimators are smooth, one can seek further gains in efficiency using randomized quasi Monte Carlo. The idea behind quasi Monte Carlo is to reduce the error of the estimator by using quasirandom or low-discrepancy sequences of numbers, instead of the traditional (pseudo-) random sequences. Typically the error of a sample average estimator decays at the rate of O(n1/2)\mathcal{O}(n^{-1/2}) when using random numbers, and at the rate of O((lnn)d/n)\mathcal{O}((\ln n)^{d}/n) when using pseudorandom numbers; see Gerber and Chopin (2015) for an up-to-date discussion.

For both the SOV and MET estimator we use the nn-point Richtmyer quasirandom sequence with randomization, as recommended by Genz and Bretz (2009). The randomization allows us to estimate the variability of the estimator in the standard Monte Carlo manner. The details are summarized as follows.

At this junction we assume that the matrix AA (or equivalently Σ\Sigma) and the bounds l\mathbf{l} and u\mathbf{u} have already been permuted according to the variable reordering heuristic discussed in Section 2.1. Thus, the ordering of the variables during the integration will be the same for both estimators and will not matter in the comparison.

Consider A=[1/2,1]d\mathscr{A}=[1/2,1]^{d} with a covariance matrix

The second column shows the lower bound discussed in Lemma 4.1 and column five shows the deterministic upper bound. These two bounds can then be used to compute the exact confidence interval (mentioned in the previous section) whenever we allow nn to vary freely. Here, since nn is fixed and the error is allowed to vary, we instead display the upper bound to the relative error (given in column six under the “worst err.” heading)

Finally, column seven (accept pr.) gives the acceptance rate of Algorithm 2.2 when using the instrumental density g(;μ)g(\cdot\,;\boldsymbol{\mu}^{*}) with enveloping constant c=exp(ψ(x;μ))c=\exp\left(\psi(\mathbf{x}^{*};\boldsymbol{\mu}^{*})\right).

What makes the MET approach better than other methods? First, the acceptance rate in column seven remains high even for d=50d=50. In contrast, the acceptance rate from naive acceptance-rejection with instrumental pdf ϕ(0,Σ)\phi(\mathbf{0},\Sigma) is a rare-event probability of approximately 2.13×101532.13\times 10^{-153}. Note again that the existing accept-reject scheme of Chopin (2011) is an excellent algorithm designed for extremely fast simulation in one or two dimensions (in quite general settings) and is not suitable here.

Second, the performance of both the SOV and MET estimators gradually deteriorates with increasing dd. However, the SOV estimator has larger relative error, does not give meaningful results for d>25d>25, and possesses no theoretical quantification of its performance. In contrast, the MET estimator is guaranteed to have better relative error than the one given in column six (worst err.).

Example II (Fernández et al., 2007).

Consider the hypercube A=d\mathscr{A}=^{d} and the isotopic covariance with elements

Observe how rapidly the probabilities become very small. Why should we be interested in estimating small “rare-event” probabilities? The simple answer is that all probabilities become eventually rare-event probabilities as the dimensions get larger and larger, making naive accept-reject simulation infeasible. These small probabilities sometimes present not only theoretical challenges (rare-event estimation), but practical ones like representation in finite precision arithmetic and numerical underflow. For instance, in using the SOV estimator Grün and Hornik (2012) note that: “Numerical problems arise for very small probabilities, e.g. for observations from different components. To avoid these problems observations with a small posterior probability (smaller than or equal to 10610^{-6}) are omitted in the M-step of this component.” The MET estimator is not immune to numerical underflow and loss of precision during computation, but consistent with Theorems 4.1 and 4.2, it is typically much more robust than the SOV estimator in estimating small probabilities.

2 Random Correlation Matrices

One can argue that the covariance matrices we have considered so far are too structured and hence not representative of a “typical” covariance matrix. Thus, for simulation and testing Miwa et al. (2003) and Craig (2008) find it desirable to use random correlation matrices. In the subsequent examples we use the method of Davies and Higham (2000) to simulate random test correlation matrices whose eigenvalues are uniformly distributed over the simplex {x:x1++xd=d}\{\mathbf{x}:x_{1}+\cdots+x_{d}=d\}.

A natural question is whether the MET estimator would still be preferable when integrating over a “non-tail” region such as A=[1/2,]100\mathscr{A}=[-1/2,\infty]^{100}. The table below summarizes the output of running the algorithms on 100 independently simulated random correlation matrices. Both the SOV and MET estimators used n=105n=10^{5} quasi Monte Carlo points. The ‘accept rate’ row displays the five number summary of the estimated acceptance probability of Algorithm 2.2.

In the current example, the numerical experiments suggest that the MET estimator is roughly 20% more costly than the SOV estimator. If one adjusts the results in Figure 3 in order to account for this time difference, then the relative error in the SOV row would be reduced by a factor of at most 1.21.2. This adjustment will thus give a reduction in the typical (median) relative error from 1.01.0 to 1/1.20.831/1.2\approx 0.83 percent, which is hardly significant.

Example IV.

Finally, we wish to know if the strong efficiency described in Theorem 4.1 may benefit the MET estimator as we move further into the tails of the distribution. Choose the “tail-like” A=[1,]100\mathscr{A}=[1,\infty]^{100} and use n=105n=10^{5}. The following table and graph summarize the results of 100 replications.

As seen from the results, in this particular example the variance of the MET estimator is typically more than 10510^{5} times smaller than the variance of the SOV estimator.

3 Computational Limitations In High Dimensions

Figure 5 shows the output of a numerical experiment with n=105n=10^{5} for various values of dd. The graph on the left gives the computational cost in seconds. Both the SOV and the MET estimators have cost of O(d3)\mathcal{O}(d^{3}) — hence the excellent agreement with the least squares cubic polynomials fitted to the empirical CPU data. The table on the right displays the relative error for both methods. In this example, we apply the variable reordering heuristic to the SOV estimator only, illustrating that the heuristic is not always necessary to achieve satisfactory performance with the MET estimator.

This example confirms the result in Theorem 4.2 that the SOV estimator works better in settings with strongly positive correlation structure (but poorly with negative correlation). Further, the results suggest the MET estimator is also aided by the presence of positive correlation.

4 Exact Simulation of Probit Posterior

A popular GLM (Koop et al., 2007) for binary responses y=(y1,,ym)\mathbf{y}=(y_{1},\ldots,y_{m})^{\top} with explanatory variables xi=(1,xi2,,xik),  i=1,,m\mathbf{x}_{i}=(1,x_{i2},\ldots,x_{ik})^{\top},\;i=1,\ldots,m is the probit Bayesian model:

Likelihood: p(\mathbf{y}\,|\,\boldsymbol{\beta})\varpropto\exp\left(\sum_{i=1}^{m}\ln\Phi\Big{(}(2y_{i}-1)\mathbf{x}_{i}^{\top}\boldsymbol{\beta}\Big{)}\right).

equals the desired posterior p(βy)p(\boldsymbol{\beta}\,|\,\mathbf{y}). We can thus apply our accept-reject scheme, because the joint f(β,λ)f(\boldsymbol{\beta},\boldsymbol{\lambda}) is of the desired truncated multivariate form (1) with d=k+md=k+m and

As an numerical example, we apply the probit model to the widely studied extramarital affairs dataset from Koop et al. (2007). The dataset contains m=601m=601 independent observations: the binary response yiy_{i} indicates if the ii-th respondent has had an extramarital affair; the six explanatory variables (k=7k=7) are male indicator (Male), number of years married (Year), ‘has’ or ‘has not’ children (Kids), religious or not (Relig.), years of formal education (Ed.), and a binary variable denoting whether the marriage is happy or not (Happy). Figure 6 shows the boxplots of the marginal distributions of β1,,β7\beta_{1},\ldots,\beta_{7} based on 80008000 iid simulations from the posterior p(βy)p(\boldsymbol{\beta}\,|\,\mathbf{y}) with prior covariance V=5IV=5I.

The conclusion that only years of marriage, religiosity, and conjugal happiness are statistically significant is, of course, well known (Koop et al., 2007) and used to validate our new simulation scheme. The question is what have we gained in using minimax tilting?

On the one hand, for the first time we have conducted the Bayesian inference using exact iid samples from the posterior and we did not have to fret about unquantifiable issues such as ‘burn-in’ and ‘mixing-speed’ as is typical with approximate MCMC simulation (Philippe and Robert, 2003).

On the other hand, the acceptance rate in the simulation was 1/2171/217, that is, we had to simulate (on average) 217 random vectors to accept one as an exact independent realization from the posterior. Admittedly, this acceptance rate could have been better and as shown in the previous experiments it is going to deteriorate with increasing dimensionality. However, there are hardly any alternatives for exact sampling — naive acceptance rejection for the extramarital data would enjoy an acceptance rate of O(10146)\mathcal{O}(10^{-146}) and without minimax tilting (say, with proposal g(x;0)g(\mathbf{x};\mathbf{0})) the Accept-Reject Algorithm 2.2 enjoys an acceptance rate of O(1016)\mathcal{O}(10^{-16}).

Thus, our main point stands: the proposed accept-reject scheme can be used for exact simulation whenever, say d100d\leqslant 100, and when dd is in the thousands it can be used to accelerate Gibbs sampling by grouping or blocking dozens of highly correlated variables together (Chopin, 2011; Philippe and Robert, 2003).

Concluding Remarks

The minimax tilting method can be effective for exact simulation from the truncated multivariate normal distribution. The proposed method permits us to dispense with Gibbs sampling in dimensions less than 100, and for larger dimensions to accelerate existing Gibbs samplers by sampling jointly hundreds of highly correlated variables.

The minimax approach can also be used to estimate normal probability integrals. Theoretically, the method improves on the already excellent SOV estimator and in a tail asymptotic regime it can achieve the best possible efficiency — vanishing relative error. The numerical experiments suggest that the proposed method can be significantly more accurate than the widely used SOV estimator, especially in the tails of the distribution. The experiments also point out to its limitations — as the dimensions get larger and larger it eventually fails.

The minimax tilting approach in this article can be extended to other multivariate densities related to the normal. Upcoming work by the author will argue that significant efficiency gains are also possible in the case of the multivariate student-tt and general elliptic distributions for which a strong log-concavity property holds. Just as in the multivariate normal case, the approach permits us to estimate accurately hitherto intractable student-tt probabilities, for which existing estimation schemes exhibit relative error close to 100%.

Acknowledgments

This work was supported by the Australian Research Council under grant DE140100993.

Appendix A Appendix

(using the obvious choices of CkC_{k} and Zk\mathscr{Z}_{k}) is concave in x\mathbf{x}. Hence, ψ\psi is concave in x\mathbf{x}, because it is a non-negative weighted sum of concave functions.

Second, we show that ψ\psi is convex in μ\boldsymbol{\mu} for each value of x\mathbf{x}. After some simplification, we can write

A.2 Proof of Theorem 4.1

Before proceeding with the proof we note the following.

First, using the necessary and sufficient condition (13), we can write the solution of (12) explicitly as x1=γL111p1,  x2=0\mathbf{x}_{1}=\gamma L_{11}^{-1}\mathbf{p}_{1},\;\mathbf{x}_{2}=\mathbf{0} with minimum γ22L111p12\frac{\gamma^{2}}{2}\|L_{11}^{-1}\mathbf{p}_{1}\|^{2}. In addition, from (13) we can also deduce that λ1=γL11L111p1>0\boldsymbol{\lambda}_{1}=\gamma L_{11}^{-\top}L_{11}^{-1}\mathbf{p}_{1}>\mathbf{0} and q=L21L111p1p20.\mathbf{q}=L_{21}L_{11}^{-1}\mathbf{p}_{1}-\mathbf{p}_{2}\geqslant\mathbf{0}.

if J\mathscr{J}\not=\emptyset, and c=(2π)d1/2L111c=(2\pi)^{-d_{1}/2}|L_{11}|^{-1} if J=\mathscr{J}=\emptyset.

In the setting of Theorem 4.1, the Karusch-Kuhn-Tucker conditions (10) simplify to:

where η\boldsymbol{\eta} is a Lagrange multiplier (corresponding to η2\boldsymbol{\eta}_{2} in (10)) and we replaced l\mathbf{l} with γp\gamma\mathbf{p}.

We now verify by substitution that, if J=\mathscr{J}=\emptyset, the unique solution of (17) is of the asymptotic form

where we recall that λ1=γL11L111p1>0\boldsymbol{\lambda}_{1}=\gamma L_{11}^{-\top}L_{11}^{-1}\mathbf{p}_{1}>\mathbf{0}. Equation one in (17) thus simply verifies that

Equation one and two yield x=L˘Ψ=LD1Ψ\mathbf{x}=\breve{L}^{\top}\boldsymbol{\Psi}=L^{\top}D^{-1}\boldsymbol{\Psi}, which again is easily verified:

It follows from Mill’s ratio, lnΦ(γ)12γ2lnγ12ln(2π)\ln\overline{\Phi}(\gamma)\simeq-\frac{1}{2}\gamma^{2}-\ln\gamma-\frac{1}{2}\ln(2\pi), and lnΦ(γ)0\ln\overline{\Phi}(-\gamma)\uparrow 0 that

It follows that for J=\mathscr{J}=\emptyset the minimax estimator (11) exhibits vanishing relative error — the best possible asymptotic tail behavior.

Case 𝒥≠∅𝒥\mathscr{J}\not=\emptyset.

Recall that (x˘,μ˘)(\breve{\mathbf{x}},\breve{\boldsymbol{\mu}}) is the solution of the nonlinear system (8), as well as the optimization program (7) without its constraint xC\mathbf{x}\in\mathscr{C} (note that a reordering of the variables via the permutation matrix PP does not change the statement of (7) or (8)). We have ψ(x;μ)ψ(x˘;μ˘)\psi(\mathbf{x}^{*};\boldsymbol{\mu}^{*})\leqslant\psi(\breve{\mathbf{x}};\breve{\boldsymbol{\mu}}), because dropping a constraint in the maximization of (7) cannot reduce the maximum. As in the case of J=\mathscr{J}=\emptyset, one can then verify via direct substitution that

A.3 Proof of Theorem 4.2

In the following proof we use the following multidimensional Mill’s ratio (Savage, 1962):

This is a generalization of the well-known one-dimensional result: Φ(γ)ϕ(γ;0,1)1γ,  γ  .\frac{\overline{\Phi}(\gamma)}{\phi(\gamma;0,1)}\simeq\frac{1}{\gamma},\;\gamma\uparrow\infty\;. As in the proof of Theorem 4.1, we proceed to find the asymptotic solution of the nonlinear optimization program (7) by considering the necessary and sufficient Karusch-Kuhn-Tucker conditions (10). In the setup of Theorem 4.2 these conditions simplify to (replacing l\mathbf{l} with γΣl\gamma\Sigma\mathbf{l}^{*}):

We can thus verify via direct substitution that the following

and hence from the one-dimensional Mill’s ratio we have

In other words, ΨγDl\boldsymbol{\Psi}\simeq\gamma D\mathbf{l}^{*} as γ\gamma\uparrow\infty. It follows that for equation one in (21) we obtain

and for equation two (recall that L˘=D1L\breve{L}=D^{-1}L, so that L˘=LD1\breve{L}^{\top}=L^{\top}D^{-1})

As a consequence, using the fact that lndet(L)=klnDkk\ln|\det(L)|=\sum_{k}\ln D_{kk} (recall that LL is triangular with positive diagonal elements), we have

However, by Mill’s ratio (20), we also have

and in considering the asymptotics of ψ(x;0)\psi(\mathbf{x};\mathbf{0}) we are free to select x\mathbf{x} to obtain the best error behavior subject to the constraint L˘xγL˘Ll\breve{L}\mathbf{x}\geqslant\gamma\breve{L}L^{\top}\mathbf{l}^{*}. This gives

A.4 Proof of Corollary 4.1

A.5 Proof of Lemma 4.1

References