Layered Adaptive Importance Sampling

L. Martino, V. Elvira, D. Luengo, J. Corander

Introduction

Monte Carlo methods currently represent a maturing toolkit widely used throughout science and technology (Doucet:MCSignalProcessing2005, ; Robert04, ; Wang:MCWirelessComm2002, ). Importance sampling (IS) and Markov Chain Monte Carlo (MCMC) methods are well-known Monte Carlo (MC) techniques applied to compute integrals involving a high-dimensional target probability density function (pdf) πˉ(x)\bar{\pi}({\bf x}). In both cases, the choice of a suitable proposal density q(x)q({\bf x}) is crucial for the success of the Monte Carlo based approximation. For this reason, the design of adaptive IS or MCMC schemes represents one of the most active research topics in this area, and several methods have been proposed in the literature (Cappe04, ; CORNUET12, ; Craiu09, ; Haario2001, ; Luengo13, ).

Since both IS and MCMC have certain intrinsic advantages and weaknesses, several attempts have been made to successfully marry the two approaches, producing hybrid techniques: IS-within-MCMC (Andrieu10, ; Brockwell10, ; Liesenfeld08, ; Liu00, ; Neal11, ) or MCMC-within-IS (Beaujean13, ; Botev13, ; Chopin02, ; EBEB14, ; Mendes15, ; Neal01, ; Yuan13, ). To set the scene for such developments it is useful to recall briefly some of the main strengths of IS and MCMC, respectively. For instance, one benefit of IS is that it delivers a straightforward estimate of the normalizing constant of πˉ(x)\bar{\pi}({\bf x}) (Liang10, ; Robert04, ) (a.k.a. evidence or marginal likelihood), which is essential for several applications (FRIEL11, ; Skilling06, ). In contrast, the estimation of the normalizing constant is highly challenging using MCMC methods, and several authors have investigated different approaches to overcome the obstacles related to the instability of the resulting estimators (Botev08, ; CALDWELL14, ; Chib01, ; FRIEL11, ; Weinberg10, ). Furthermore, the application and the theoretical analysis of an IS scheme using an adaptive proposal pdf is easier than the theoretical analysis of the corresponding adaptive MCMC scheme, which is much more delicate (Andrieu08, ).

On the other hand, an appealing feature of MCMC algorithms is their explorative behavior. For instance, the proposal function q(xxt1)q({\bf x}|{\bf x}_{t-1}) can depend on the previous state of the chain xt1{\bf x}_{t-1} and foster movements between different regions of the target density. For this reason, MCMC methods are usually preferred when no detailed information about the target πˉ(x)\bar{\pi}({\bf x}) is available, especially in large dimensional spaces (Andrieu:MCMachineLearning2003, ; Fitzgerald01, ). Moreover, in order to amplify their explorative behavior several parallel MCMC chains can be run simultaneously (Robert04, ; Liang10, ). This strategy facilitates the exploration of the state space, although at the expense of an increase in the computational cost. Several schemes have been introduced to share information among the different chains (Craiu09, ; O-MCMC, ; Smelly, ), which further improves the overall convergence.

The main contribution of this work is the description and analysis of a hierarchical proposal procedure for generating samples, which can then be employed within any Monte Carlo algorithm. In this hierarchical scheme, we consider two conditionally independent levels: the upper level is used to generate mean vectors for the proposal pdfs, which are then used in the lower level to draw candidate samples according to some MC scheme. We show that the standard Population Monte Carlo (PMC) method (Cappe04, ) can be interpreted as applying implicitly this hierarchical procedure.

The second major contribution of this work is providing a general framework for multiple importance sampling (MIS) schemes and their iterative adaptive versions. We discuss several alternative applications of the so-called deterministic approach (ElviraMIS15, ; Owen00, ; Veach95, ) for sampling a mixture of pdfs. This general framework includes different MIS schemes used within adaptive importance sampling (AIS) techniques already proposed in literature, such as the standard PMC (Cappe04, ), the adaptive multiple importance sampling (AMIS) (CORNUET12, ; Marin12, ), and the adaptive population importance sampling (APIS) (APIS15, ).

Finally, we combine the general MIS framework with the hierarchical procedure for generating samples, introducing a new class of AIS algorithms. More specifically, one or several MCMC chains are used for driving an underlying MIS scheme. Each algorithm differs from the others in the specific Markov adaptation employed and the particular MIS technique applied for yielding the final Monte Carlo estimators. This novel class of algorithms efficiently combines the main strengths of the IS and the MCMC methods, since it maintains an explorative behavior (as in MCMC) and can still easily estimate the normalizing constant (as in IS).

We describe in detail the simplest possible algorithm of this class, called random walk importance sampling. Moreover, we introduce two additional population-based variants that provide a good trade-off between performance and computational cost. In the first variant, the mean vectors are updated according to independent parallel MCMC chains. In the other one, an interacting adaptive strategy is applied. In both cases, all the adapted proposal pdfs collaborate to yield a single global IS estimator. One of the proposed algorithms, called parallel interacting Markov adaptive importance sampling (PI-MAIS), can be interpreted as parallel MCMC chains cooperating to produce a single global estimator, since the chains exchange statistical information to achieve a common purpose.

The rest of the paper is organized as follows. Section 2 is devoted to the problem statement. The hierarchical proposal procedure is then introduced in Section 3. In Section 4, we describe a general framework for importance sampling schemes using a population of proposal pdfs, whereas Section 5 introduces the adaptation procedure for the mean vectors of these proposal pdfs. Numerical examples are provided in Section 6, including comparisons with several benchmark techniques. Different scenarios have been considered: a multimodal distribution, a nonlinear banana-shaped target, a high-dimensional example, and a localization problem in a wireless sensor network. Finally, Section 7 contains some brief final remarks.

Target distribution and related integrals

Our goal is computing efficiently some integral measure w.r.t. the target pdf,

and ff is any square-integrable function (w.r.t. πˉ(x)\bar{\pi}({\bf x})) of x{\bf x}.Note that, as both πˉ(x)\bar{\pi}({\bf x}) and ZZ depend on the observations y{\bf y}, the use of πˉ(xy)\bar{\pi}({\bf x}|{\bf y}) and Z(y)Z({\bf y}) would be more precise. However, since the observations are fixed, in the sequel we remove the dependence on y{\bf y} to simplify the notation. In this work, we address the problem of approximating II and ZZ via Monte Carlo methods. Since drawing directly from πˉ(x)π(x)\bar{\pi}({\bf x})\propto\pi({\bf x}) is impossible in many applications, Monte Carlo techniques use a simpler proposal density q(x)q({\bf x}) to generate random candidates, testing or weighting them according to some suitable rule. Indeed, throughout the paper we focus on the combined use of several proposal pdfs, denoted as q1,,qJq_{1},\ldots,q_{J}.

Hierarchical procedure for proposal generation

The performance of MC methods depends on the discrepancy between the target, πˉ(x)π(x)\bar{\pi}({\bf x})\propto\pi({\bf x}), and the proposal q(x)q({\bf x}). Namely, the performance improves if q(x)q({\bf x}) is more similar (i.e., closer) to πˉ(x)\bar{\pi}({\bf x}). In general, tuning the parameters of the proposal is a difficult task that requires statistical information of the target distribution. In this section, we deal with this important issue, focusing on the mean vector of the proposal pdf. More specifically, we consider a proposal pdf defined by a mean vector μ{\bm{\mu}} and covariance matrix C{\bf C}, denoted as q(xμ,C)=q(xμC)q({\bf x}|{\bm{\mu}},{\bf C})=q({\bf x}-{\bm{\mu}}|{\bf C}). We propose the following hierarchical procedure for generating a set of samples that will be employed afterwards within some Monte Carlo technique:

Draw a mean vector μjh(μ){\bm{\mu}}_{j}\sim h({\bm{\mu}}).

Draw xj(m)q(xμj,C){\bf x}_{j}^{(m)}\sim q({\bf x}|{\bm{\mu}}_{j},{\bf C}) for m=1,,Mm=1,\ldots,M.

Use all the generated samples, xj(m){\bf x}_{j}^{(m)} for j=1,,Jj=1,\ldots,J and m=1,,Mm=1,\ldots,M, as candidates within some Monte Carlo method.

Note that h(μ)h({\bm{\mu}}) plays the role of a prior pdf over the mean vector of qq in this approach. Hence, the pdf of each sample xj(m){\bf x}_{j}^{(m)} can be expressed as

i.e., the hierarchical procedure is equivalent to drawing directly xj(m)q~(xC){\bf x}_{j}^{(m)}\sim\widetilde{q}({\bf x}|{\bf C}) for all j=1,,Jj=1,\ldots,J and m=1,,Mm=1,\ldots,M. The density q~\widetilde{q} is thus the equivalent proposal density of the whole hierarchical generating procedure. Note also that the samples μ1,,μJ{\bm{\mu}}_{1},\ldots,{\bm{\mu}}_{J} are not directly used by the Monte Carlo estimator, since only the samples xj(m){\bf x}_{j}^{(m)}, for j=1,,Jj=1,\ldots,J, m=1,,Mm=1,\ldots,M, enter the actual estimator. Hence, the computational cost per iteration of this hierarchical procedure is higher than the cost of a standard approach, However, it leads to substantial computational savings in terms of improved convergence towards the target, and thus a reduced number of iterations required, as shown later in the simulations. Furthermore, note that the generation of the μj{\bm{\mu}}_{j}’s in the upper level is independent of the samples xj(m){\bf x}_{j}^{(m)} drawn in the lower level, thus facilitating the theoretical analysis of the resulting algorithms, as discussed in Section 5.1.Note that, in the ideal case described here, each μj{\bm{\mu}}_{j} is also independent of the other μ{\bm{\mu}}’s. However, in the rest of this work, we also consider cases where correlation among the mean vectors (μ1,,μJ{\bm{\mu}}_{1},\ldots,{\bm{\mu}}_{J}) is introduced.

Assuming that the parametric form of q(xμ,C)q({\bf x}|{\bm{\mu}},{\bf C}) and its covariance matrix C{\bf C} are fixed, we consider the problem of finding the optimal prior h(μC)h^{*}({\bm{\mu}}|{\bf C}) over the mean vector μ{\bm{\mu}}. Note that, since q(xμ,C)=q(xμC)q({\bf x}|{\bm{\mu}},{\bf C})=q({\bf x}-{\bm{\mu}}|{\bf C}), we can write

regardless of the choice of the prior over the mean vectors in the upper level. The desirable scenario is to have the equivalent proposal q~(xC)\widetilde{q}({\bf x}|{\bf C}) coinciding exactly with the target πˉ(x)\bar{\pi}({\bf x}),Given a function f(x)f({\bf x}), the optimal proposal qq minimizing the variance of the IS estimator is q~(xC)f(x)πˉ(x)\widetilde{q}({\bf x}|{\bf C})\propto|f({\bf x})|\bar{\pi}({\bf x}). However, in practical applications, we are often interested in computing expectations w.r.t. several ff’s. In this context, a more appropriate strategy is to minimize the variance of the importance weights. In this case, the minimum variance is attained when q~(xC)=πˉ(x)\widetilde{q}({\bf x}|{\bf C})=\bar{\pi}({\bf x}) Doucet08tut . i.e.,

where h(μC)h^{*}({\bm{\mu}}|{\bf C}) represents the optimal prior.

2 Asymptotically optimal choice of the prior h​(𝝁)ℎ𝝁h({\bm{\mu}})

Since Eq. (7) cannot be solved analytically in general, in this section we relax that condition and look for an equivalent proposal q~\widetilde{q} which fulfills (7) asymptotically as JJ\to\infty. For the sake of simplicity, let us set M=1M=1. Thus, we consider the generation of JJ samples {x1,,xJ}\{{\bf x}_{1},\ldots,{\bf x}_{J}\}, drawn using the following hierarchical procedure:

Draw a mean vector μjh(μ){\bm{\mu}}_{j}\sim h({\bm{\mu}}).

Draw xjq(xμj,C){\bf x}_{j}\sim q({\bf x}|{\bm{\mu}}_{j},{\bf C}).

Note that we are using JJ different proposal pdfs,

to draw {x1,,xJ}\{{\bf x}_{1},\ldots,{\bf x}_{J}\}, with each xj{\bf x}_{j} being drawn from the jj-th proposal xjq(xμj,C){\bf x}_{j}\sim q({\bf x}|{\bm{\mu}}_{j},{\bf C}). However, if the samples x1,,xJ{\bf x}_{1},\ldots,{\bf x}_{J} are used altogether regardless of their order, then it can interpreted that they have been drawn from the following mixture using the deterministic mixture sampling scheme (see (mcbookOwen, , Chapter 9), ElviraMIS15 ):

Note that, since μjh(μ){\bm{\mu}}_{j}\sim h({\bm{\mu}}), then ψ(x)\psi({\bf x}) is a Monte Carlo approximation of the integral in Eq. (7), i.e.,

Furthermore, if we choose h(μ)=πˉ(μ)h({\bm{\mu}})=\bar{\pi}({\bm{\mu}}), i.e., μjπˉ(μ){\bm{\mu}}_{j}\sim\bar{\pi}({\bm{\mu}}), then ψ(x)\psi({\bf x}) is also a kernel density estimator of πˉ(x)\bar{\pi}({\bf x}), where the q(xμj,C)q({\bf x}|{\bm{\mu}}_{j},{\bf C}) play the role of the kernel functions Wand94 . In general, this estimator has non-zero bias and variance, depending on the choice of qq, C{\bf C} and the number of samples JJ. However, for a given value of JJ, there exists an optimal choice of C{\bf C}^{*} which provides the minimum Mean Integrated Square Error (MISE) estimator Wand94 . Using the optimal covariance matrix C{\bf C}^{*}, it can be proved

pointwise as JJ\rightarrow\infty Wand94 . Hence, the equivalent proposal density of the hierarchical approach converges to the target when JJ\rightarrow\infty. It is possible to show C0||C^{*}||\rightarrow 0 as JJ\rightarrow\infty, so that there is no contradiction between (9) and (10) since q(xμC)q({\bf x}-{\bm{\mu}}|{\bf C}^{*}) becomes increasingly similar to δ(xμ)\delta({\bf x}-{\bm{\mu}}), and thus q~(xC)πˉ(x)\widetilde{q}({\bf x}|{\bf C}^{*})\rightarrow\bar{\pi}({\bf x}) as JJ\rightarrow\infty.

3 Practical implementation

As explained in Section 3.2, h(μ)=πˉ(μ)h({\bm{\mu}})=\bar{\pi}({\bm{\mu}}) is a suitable choice from a kernel density estimation point of view. However, sampling directly from πˉ(μ)\bar{\pi}({\bm{\mu}}) is unfeasible from a practical point of view (otherwise, we would not require any MC algorithm). Therefore, we propose applying another sampling method, such as an MCMC algorithm, to obtain the samples {μ1,,μJ}πˉ(μ)\{{\bm{\mu}}_{1},\ldots,{\bm{\mu}}_{J}\}\sim\bar{\pi}({\bm{\mu}}). More specifically, starting from an initial μ0{\bm{\mu}}_{0}, we generate a sequence

where KK is the kernel of the MCMC technique used. With the choice h(μ)=πˉ(μ)h({\bm{\mu}})=\bar{\pi}({\bm{\mu}}), the two levels of the sampler play different roles:

The upper level attends the need for exploration of the state space, providing {μ1,,μJ}\{{\bm{\mu}}_{1},\ldots,{\bm{\mu}}_{J}\}.

The lower level is devoted to the approximation of local features of the the target, using {x1,,xJ}\{{\bf x}_{1},\ldots,{\bf x}_{J}\}.

In general, the two levels require their own tuning of the parameters of the corresponding proposals.

4 Relationship with other adaptive MC schemes

In contrast to the hierarchical approach described previously, in standard adaptive MC approaches Bugallo15 ; Haario2001 ; Luengo13 the parameter μn{\bm{\mu}}_{n} is determined by a deterministic function,

of the previously generated samples (assuming to generate MM samples from each proposal),

Although γ\gamma is a deterministic function, the sequence {μj}j=1J\{{\bm{\mu}}_{j}\}_{j=1}^{J} is generated according to a conditional pdf, K(μjμ1,,μj1)K({\bm{\mu}}_{j}|{\bm{\mu}}_{1},\ldots,{\bm{\mu}}_{j-1}), since Xj1{\bf X}_{j-1} is random. Unlike in the hierarchical scheme, in standard adaptive MC approaches, the sequence {μj}j=1J\{{\bm{\mu}}_{j}\}_{j=1}^{J} typically converges to a fixed vector.

In the standard PMC method (Cappe04, ) the sequence of mean vectors μj{\bm{\mu}}_{j}’s is also generated depending on the previous x{\bf x}’s but, in this case, the final distribution is unknown and it is not a fixed vector, in general (for further details see Appendix C). Similar considerations also apply for Sequential Monte Carlo (SMC) schemes Moral06 ; Fearnhead13 ; Schafer13 where the adaptation is performed using a combination of resampling and MCMC steps. Other interesting and related techniques are the Particle MCMC (P-MCMC) (Andrieu10, ) and the Sequentially Interacting MCMC (SI-MCMC) (Brockwell10, ) methods. In this case, IS approximations of the target are used to build better proposal pdfs, employed within MCMC steps. Both methods are also able to provide efficient estimators of ZZ. However, unlike in PMC, SMC, P-MCMC and SI-MCMC, in the proposed hierarchical approach each μj{\bm{\mu}}_{j} is always chosen independently of Xj1{\bf X}_{j-1} and it is distributed according to h(μ)h({\bm{\mu}}), decided in advance by the user. Moreover, the means μ1,,μj{\bm{\mu}}_{1},\ldots,{\bm{\mu}}_{j} are not involved in the resulting estimators. Related observations are provided in Section 5.1 and Table 5.

Generalized Multiple Importance Sampling

So far, we have introduced a hierarchical procedure to generate candidates for an MC technique, adapting the mean vectors of a set of proposal densities. In this section, we provide a general framework for multiple importance sampling (MIS) techniques using a population of proposal densities, which embeds various alternative schemes proposed in the literature ElviraMIS15 . First, we consider several alternatives of static MIS, and then we focus on the corresponding adaptive MIS samplers.

As we have already highlighted, finding a good proposal pdf, q(x)q({\bf x}), is critical and is in general very challenging (Owen00, ). An alternative strategy consists in using a population of proposal pdfs. This approach is often known in the literature as multiple importance sampling (MIS) (mcbookOwen, ; Owen00, ; Veach95, ; ElviraMIS15, ). Consider a set of JJ proposal pdfs,

with heavier tails than the target π\pi, and let us assume that MM samples are drawn from each of them, i.e.,

In this scenario, the weights associated to the samples can be obtained following at least one of these two strategies:

Deterministic mixture MIS (DM-MIS) (Owen00, ; Veach95, ):

for j=1,...,Jj=1,...,J and m=1,,Mm=1,\ldots,M, and where ψ(x)=1Jj=1Jqj(x)\psi({\bf x})=\frac{1}{J}\sum_{j=1}^{J}q_{j}({\bf x}) is the mixture pdf, composed of all the proposal pdfs. This approach is based on the considerations provided in Appendix B.

In both cases, the consistency of the estimators is ensured ElviraMIS15 . The main advantage of the DM-MIS weights is that they yield more efficient estimators than using the standard importance weights (CORNUET12, ; Owen00, ; LetterVictor, ; APIS15, ). However, the DM-MIS estimator is computationally more expensive, as it requires JMJM total evaluations for each proposal instead of just MM, for computing all the weights. The number of evaluations of the target π(x)\pi({\bf x}) is the same regardless of whether the weights are calculated according to Eq. (12) or (13), so this increase in computational cost may not be relevant in many applications. However, in some other cases this additional computational load may be excessive (especially for large values of JJ) and alternative efficient solutions are desirable. For instance, the use of partial mixtures has been proposed in (LetterVictor, ):

All the previous cases can be captured by a generic mixture-proposal Φj(x)\Phi_{j}({\bf x}), under which the MIS weights can be defined as

with m=1,,Mm=1,\ldots,M, where Φj(xj(m))=qj(xj(m))\Phi_{j}({\bf x}_{j}^{(m)})=q_{j}({\bf x}_{j}^{(m)}) in Eq. (12), Φj(xj(m))=1Jk=1Jqk(xj(m))\Phi_{j}({\bf x}_{j}^{(m)})=\frac{1}{J}\sum_{k=1}^{J}q_{k}({\bf x}_{j}^{(m)}) in Eq. (13), and

in Eq. (15). In any case, the weights are always normalized as

Table 1 shows these three choices of Φj(xj(m))\Phi_{j}({\bf x}_{j}^{(m)}), whereas Table 2 summarizes a generalized static MIS procedure.

Note that the IS estimator I^\hat{I} of a specific moment of πˉ\bar{\pi}, i.e., the integral II given in Eq. (3), and the approximation Z^\hat{Z} of the normalizing constant in Eq. (4), can now be approximated as

Then, the particle approximation of the measure of πˉ\bar{\pi} is given by

In Section 4.2, we describe a framework where a partial grouping of the proposal pdfs arises naturally from the sampler’s definition.

2 Generalized Adaptive Multiple Importance Sampling

for the nn-th proposal (see Figure 1). At the tt-th iteration, the adaptation procedure takes into account statistical information about the target distribution gathered in the previous iterations, 1,,t11,\ldots,t-1, using one of the many procedures that have been proposed in the literature (pmc-cappe08, ; Cappe04, ; CORNUET12, ; APIS15, ). Furthermore, at the tt-th iteration, MM samples are drawn from each proposal qn,tq_{n,t},

n=1,,Nn=1,\ldots,N and t=1,,Tt=1,\ldots,T. An importance weight wn,t(m)w_{n,t}^{(m)} is then assigned to each sample xn,t(m){\bf x}_{n,t}^{(m)}. Several strategies can be applied to build wn,t(m)w_{n,t}^{(m)} considering the different MIS approaches, as discussed in the previous section. Figure 1 provides a graphical representation of this scenario, by showing both the spatial and temporal evolution of the J=NTJ=NT proposal pdfs.

is associated to each sample xn,t(m){\bf x}_{n,t}^{(m)}. In a standard MIS approach, the function employed in the denominator is

In the complete DM-MIS case, the function Φn,t\Phi_{n,t} is

This case corresponds to the external blue rectangle in Fig. 1. Two natural alternatives of partial DM-MIS schemes appear in this scenario. The first one uses the following partial mixture

with n=1,,Nn=1,\ldots,N, in the denominator of the IS weight. Namely, we consider the temporal evolution of the nn-th single proposal qn,tq_{n,t}. Hence, we have L=NL=N mixtures, each one formed by P=TP=T components (horizontal red rectangle in Fig. 1). The other possibility is considering the mixture of all the qn,tq_{n,t}’s at the tt-th iteration, i.e.,

for t=1,,Tt=1,\ldots,T, so that we have L=TL=T mixtures, each one formed by P=NP=N components (vertical green rectangle in Fig. 1). The function Φn,t\Phi_{n,t} in Eq. (23) is used in the standard PMC scheme (Cappe04, ); Eq. (25) with N=1N=1 has been considered in adaptive multiple importance sampling (AMIS) (CORNUET12, ). Eq. (26) has been applied in the adaptive population importance sampling (APIS) algorithm (APIS15, ), whereas in other techniques, such as Mixture PMC (pmc-cappe08, ; Douc07a, ; Douc07b, ), a similar strategy is employed but using a standard sampling of the mixture ϕt(x)\phi_{t}({\bf x}).

Note that, using ψ(x)\psi({\bf x}) and ξn(x)\xi_{n}({\bf x}), the computational cost per iteration increases as the total number of iterations TT grows. Indeed, at the tt-th iteration all the previous proposals qn,1,,qn,t1q_{n,1},\ldots,q_{n,t-1} (for all nn) must be evaluated at all the new samples xn,t(m){\bf x}_{n,t}^{(m)}. Hence, algorithms based on these proposals quickly become unfeasible as the number of iterations grows. On the other hand, using ϕt(x)\phi_{t}({\bf x}) the computational cost per iteration is controlled by NN, remaining invariant regardless of the number of adaptive steps performed.

Observe also that a suitable AIS scheme builds iteratively a global IS estimator which uses the normalized weights

for n=1,,Nn=1,\ldots,N, m=1,,Mm=1,\ldots,M, and t=1,,Tt=1,\ldots,T.

Table 4 shows an iterative version of GAMIS. We remark that, at the tt-th iteration, the weights of the samples previously generated need to be recalculated, as shown in step 2(c-3) of Table 4. The choices Φn,t(x)=qn,t(x)\Phi_{n,t}({\bf x})=q_{n,t}({\bf x}) or Φn,t(x)=ϕt(x)\Phi_{n,t}({\bf x})=\phi_{t}({\bf x}) allow avoiding completely this re-computation step of the weights. For simplicity, in Table 4 we have provided the output of the algorithms as weighted samples, i.e., all the pairs {xn,t(m),ρˉn,t(m)}\{{\bf x}_{n,t}^{(m)},\bar{\rho}_{n,t}^{(m)}\}. However, the output can be equivalently expressed as an estimator of a specific moment of the target. In this case, the final IS estimators I^T\hat{I}_{T} and Z^T\hat{Z}_{T} are

where ρˉn,τ(m)=wn,τ(m)NMTZ^T\bar{\rho}_{n,\tau}^{(m)}=\frac{w_{n,\tau}^{(m)}}{{NMT\hat{Z}}_{T}}. Moreover, the final particle approximation is

The estimators in Eq. (29) can be expressed recursively, thus providing an estimate at each iteration tt, as stated before. Starting with H0=0H_{0}=0, I^0=0{\hat{I}}_{0}=0, and setting St=n=1Nm=1Mwn,t(m)S_{t}=\sum_{n=1}^{N}\sum_{m=1}^{M}w_{n,t}^{(m)} and Ht=Ht1+StH_{t}=H_{t-1}+S_{t}, we have

where A^t=n=1Nm=1Mwn,t(m)Stf(xn,t(m)){\hat{A}}_{t}=\sum_{n=1}^{N}\sum_{m=1}^{M}\frac{w_{n,t}^{(m)}}{S_{t}}f({\bf x}_{n,t}^{(m)}) is the partial IS estimator using only the samples drawn at the tt-th iteration. Therefore, I^t{\hat{I}}_{t} can be seen as a convex combination of the two IS estimators I^t1{\hat{I}}_{t-1} and A^t{\hat{A}}_{t} (for further explanations see Eqs. (46)-(47) in Appendix B.3). Finally, note that

A brief discussion about the consistency of I^t{\hat{I}}_{t} and Z^t{\hat{Z}}_{t} is provided in Appendix A.

Markov adaptation for GAMIS

In this section, we design efficient adaptive importance sampling (AIS) techniques by combining the main ideas discussed in the two previous sections. More specifically, we apply the hierarchical MC approach to adapt the proposal pdfs within a GAMIS scheme. Therefore, a Markov GAMIS technique, or simply Markov Adaptive Importance Sampling (MAIS) algorithm, consists of the following two layers:

Upper level (Adaptation): Given the set of mean vectors,

obtain the new set Pt={μ1,t,,μN,t}\mathcal{P}_{t}=\{{\bm{\mu}}_{1,t},\ldots,{\bm{\mu}}_{N,t}\} according to MCMC transitions with πˉ\bar{\pi} as invariant density. More specifically, a kernel K(μ1:N,tμ1:N,t1)K({\bm{\mu}}_{1:N,t}|{\bm{\mu}}_{1:N,t-1}) leaving invariant the distribution n=1Nπˉ(μn)\prod_{n=1}^{N}{\bar{\pi}}({\bm{\mu}}_{n}) is applied.

Lower level (MIS estimator): Given the population of proposals,

choose a function Φn,t(x)\Phi_{n,t}({\bf x}) for the computation of the weights in Eq. (22), and perform a MIS approximation of the target as described in Section 4.2.

The motivation behind the MCMC adaptation has been described in Section 3.2 and 3.3: the functions qn,tq_{n,t}, located at the μn,t{\bm{\mu}}_{n,t}’s, jointly provide a kernel estimate of the target πˉ{\bar{\pi}}.

Furthermore, we recall that the generation of the means, μn,t{\bm{\mu}}_{n,t}, is completely independent from the samples xn,t{\bf x}_{n,t} drawn in the lower level. This is a key point from a theoretical and practical point of view. Indeed, the generic MAIS algorithm can be divided in two steps: (a) first generate all the means {μn,t}t=1T\{{\bm{\mu}}_{n,t}\}_{t=1}^{T} for n=1,,Nn=1,\ldots,N, (b) then perform the MIS estimation considering all the proposals qn,t(xμn,t,Cn)q_{n,t}({\bf x}|{\bm{\mu}}_{n,t},{\bf C}_{n}), n\forall n and t\forall t. Namely, any MAIS technique can be converted into a generalized static MIS scheme (see Section 4.1). As a consequence, the unique conditions required for ensuring the consistency of the corresponding estimators are ElviraMIS15 ; Robert04 :

All the proposal pdfs, qn,tq_{n,t}, must have heavier tails than the target πˉ{\bar{\pi}}.

A suitable function Φn,t(x)\Phi_{n,t}({\bf x}) for the denominator of the importance weights must be chosen. Namely, the use of Φn,t(x)\Phi_{n,t}({\bf x}) provides consistent estimators ElviraMIS15 , like the functions Φn,t(x)\Phi_{n,t}({\bf x}) described in Section 4.2.

Moreover, the independence of the upper level from the lower level of the hierarchical approach, helps the parallelization of the algorithms as we discuss later.

Table 5 compares different AIS schemes. In the standard AIS method Bugallo15 , the sequence of {μn,t}\{{\bm{\mu}}_{n,t}\} converges to a unknown fixed vector as tt\rightarrow\infty. In the standard PMC algorithm Cappe04 , the limiting distribution of {μn,t}\{{\bm{\mu}}_{n,t}\} is unknown. Furthermore, in both cases, standard AIS and PMC, the adaptation depends on the previously generated samples x{\bf x}’s. In MAIS techniques, the use of an ergodic chain (with invariant pdf πˉ{\bar{\pi}}) for generating the nn-th mean vector μn,t{\bm{\mu}}_{n,t} ensures that its asymptotic density is πˉ(μ){\bar{\pi}}({\bm{\mu}}).

2 The new class of algorithms

Markov GAMIS framework can lead to many different algorithms, depending on the MCMC strategy used to update the mean vectors and the specific choice of the function Φn,t\Phi_{n,t}. Table 6 provides several examples of novel techniques determined by the value of NN, the choice of Φn,t\Phi_{n,t}, and the type of MCMC adaptation. Some of them are variants of well-known techniques like PMC (Cappe04, ) and AMIS (CORNUET12, ), where the Markov adaptation procedure is employed. Others, such as the Random Walk Importance Sampling (RWIS), the Parallel Interacting Markov Adaptive Importance Sampling (PI-MAIS) and Doubly Interacting Markov Adaptive Importance Sampling (I2-MAIS), are described below in detail. For these completely novel algorithms we have set Φn,t(x)=ϕt(x)\Phi_{n,t}({\bf x})=\phi_{t}({\bf x}), so that the computational cost is directly controlled by NN and the re-weighting step 2(c-3) in Table 4 is not required.

RWIS is the simplest possible Markov GAMIS algorithm. Specifically, for the MCMC adaptation we consider a standard MH technique, setting N=1N=1 and choosing Φn,t(x)=ϕt(x)=qn,t(x)\Phi_{n,t}({\bf x})=\phi_{t}({\bf x})=q_{n,t}({\bf x}) (since N=1N=1, the two cases coincide). Table 7 shows the RWIS algorithm, which is a special case of the more general scheme described in Table 8 when N=1N=1. Note that we have a proposal pdf used for the MH adaptation, φ(μμt1,Λ)\varphi({\bm{\mu}}|{\bm{\mu}}_{t-1},{\bf\Lambda}), which is different from the proposal pdf used for the IS estimation, q(xμt,C)q({\bf x}|{\bm{\mu}}_{t},{\bf C}).

3 Population-based algorithms

The RWIS technique can be easily extended by using a population of NN proposal pdfs. In this case, we choose

so that the computational cost of evaluating the mixture Φn,t(x)=ϕt(x)\Phi_{n,t}({\bf x})=\phi_{t}({\bf x}) depends only on NN, regardless of the number tt of iterations. Moreover, step 2(c-3) in Table 4 is not required in this case. Table 8 describes the corresponding algorithm without specifying the MCMC approach used for generating the population of means, Pt={μ1,t,...,μN,t}{\bf\mathcal{P}}_{t}=\{{\bm{\mu}}_{1,t},...,{\bm{\mu}}_{N,t}\}, given Pt1{\bf\mathcal{P}}_{t-1}.

Two possible adaptation procedures via MCMC are discussed below. In the first one, we consider NN independent parallel chains for updating the NN mean vectors. We refer to this method as Parallel Interacting Markov Adaptive Importance Sampling (PI-MAIS). Although PI-MAIS is parallelizable, in the iterative version of Table 8 the NN independent processes cooperate together in Eq. (36) to provide unique global IS estimate. In the second adaptation scheme, we introduce the interaction also in the upper level. Hence, we refer to this method as Doubly Interacting Markov Adaptive Importance Sampling (I2-MAIS). In both cases, the corresponding technique provides an IS approximation of the target or, equivalently, the estimators I^T\hat{I}_{T} and Z^T\hat{Z}_{T} in Eq. (29), using NMTNMT samples.

The simplest option is applying one iteration of NN parallel MCMC chains, one for each μn,t1{\bm{\mu}}_{n,t-1} returning μn,t{\bm{\mu}}_{n,t}, for n=1,,Nn=1,\ldots,N. For instance, given NN parallel MH transitions, each one employing (possibly) a different proposal pdf φn\varphi_{n} with covariance matrix Λn{\bf\Lambda}_{n}, we have: For n=1,,Nn=1,\ldots,N:

Draw μφn(μμn,t1,Λn){\bm{\mu}}^{\prime}\sim\varphi_{n}({\bm{\mu}}|{\bm{\mu}}_{n,t-1},{\bf\Lambda}_{n}).

Set μn,t=μ{\bm{\mu}}_{n,t}={\bm{\mu}}^{\prime} with probability

otherwise set μn,t=μn,t1{\bm{\mu}}_{n,t}={\bm{\mu}}_{n,t-1} (with probability 1α1-\alpha).

Figure 2(a) illustrates this scenario. Each mean vector μn,t{\bm{\mu}}_{n,t} is updated independently from the rest. Therefore, in PI-MAIS, the interaction among the different processes occurs only in the underlying IS layer of the hierarchical structure: the importance weights in Eq. (36) are built using the partial DM-MIS strategy with ϕt(x)=1Nn=1Nqn,t(xμn,t,Cn)\phi_{t}({\bf x})=\frac{1}{N}\sum_{n=1}^{N}q_{n,t}({\bf x}|{\bm{\mu}}_{n,t},{\bf C}_{n}). Considerations about the parallelization of PI-MAIS are given in Section 5.5.

3.2 MCMC adaptation for I2-MAIS

Thus, one transition is formed by the following steps:

Draw Pφ(PPt1){\bf P}^{\prime}\sim\varphi({\bf P}|{\bf P}_{t-1}), where P=[μ1,,μN]{\bf P}^{\prime}=[{\bm{\mu}}_{1}^{\prime},\ldots,{\bm{\mu}}_{N}^{\prime}].

Set Pt=P{\bf P}_{t}={\bf P}^{\prime} with probability

otherwise set Pt=Pt1{\bf P}_{t}={\bf P}_{t-1} (with probability 1α1-\alpha).

At each iteration, NN new samples μn{\bm{\mu}}_{n}^{\prime} are drawn (as in PI-MAIS) and therefore NN new evaluations of π\pi are required (i.e., one evaluation of πg\pi_{g}). When a new P{\bf P}^{\prime} is accepted, all the components of Pt{\bf P}_{t} differ from Pt1{\bf P}_{t-1}, unlike in the strategy described later. However, the probability of accepting a new population becomes very small for large values of NN. Sample Metropolis-Hastings (SMH) algorithm SMH is a population-based MCMC technique, suitable for our purposes (Liang10, , Chapter 5). At each iteration tt, given the previous set

a new possible parameter μ0,t1{\bm{\mu}}_{0,t-1}, drawn from an independent proposal φ(μ)\varphi({\bm{\mu}}), is tested to be interchanged with another parameter in Pt1={μ1,t1,...,μN,t1}{\bf\mathcal{P}}_{t-1}=\{{\bm{\mu}}_{1,t-1},...,{\bm{\mu}}_{N,t-1}\}. The underlying idea of SMH is to replace one “bad” sample in the population Pt1{\bf\mathcal{P}}_{t-1} with a potentially “better” one, according to a certain suitable probability α\alpha. The algorithm is designed so that, after a burn-in period, the elements in Pt{\bf\mathcal{P}}_{t} are distributed according to πˉg(μ1,,μN){\bar{\pi}}_{g}({\bm{\mu}}_{1},\ldots,{\bm{\mu}}_{N}). One iteration of SMH consists of the following steps:

Draw a candidate μ0,t1φ(μ){\bm{\mu}}_{0,t-1}\sim\varphi({\bm{\mu}}).

Choose a “bad” sample, μk,t1{\bm{\mu}}_{k,t-1} with k{1,...,N}k\in\{1,...,N\}, from the population according to a probability proportional to φ(μk,t1)π(μk,t1)\frac{\varphi({\bm{\mu}}_{k,t-1})}{\pi({\bm{\mu}}_{k,t-1})}, which corresponds to the inverse of the standard IS weights.

Accept the new population, Pt={μ1,t,,μN,t}\mathcal{P}_{t}=\{{\bm{\mu}}_{1,t},\ldots,{\bm{\mu}}_{N,t}\} with μn,t=μn,t1{\bm{\mu}}_{n,t}={\bm{\mu}}_{n,t-1} for all nkn\neq k and μk,t=μ0,t1{\bm{\mu}}_{k,t}={\bm{\mu}}_{0,t-1}, with probability

Otherwise, set Pt=Pt1{\bf\mathcal{P}}_{t}={\bf\mathcal{P}}_{t-1}.

Draw μ{\bm{\mu}}^{\prime} from a proposal pdf φn(μμn1,t,Λn)\varphi_{n}({\bm{\mu}}|{\bm{\mu}}_{n-1,t},{\bf\Lambda}_{n}).

Set μn,t=μ{\bm{\mu}}_{n,t}={\bm{\mu}}^{\prime} with probability

otherwise set μn,t=μn1,t{\bm{\mu}}_{n,t}={\bm{\mu}}_{n-1,t}.

This scenario is illustrated in Fig. 2(d). In this case, after TTiterations of the I2-MAIS scheme, we generate a unique MH chain with NTNT total states, divided in TT parts of NN states. At each iteration of the I2-MAIS scheme, each block of NN states is employed as mean vector of the NN proposal pdfs used in the lower level.

4 Computational cost: comparison between PI-MAIS and I2-MAIS

In all cases, the total number of samples involved in the final estimation is NMTNMT. The total number of evaluations of the target, EE, is larger due to the MCMC implementation, i.e., E>NMTE>NMT. More precisely, the total number of evaluations of the target is:

E=MNT+NTE=MNT+NT, for I2-MAIS with the MH-within-Gibbs approach.

Note that we have taken into account that several evaluations of the target have been computed in the previous iterations. Moreover, the application of the MCMC techniques requires generation of VV additional uniform r.v.’s for performing the acceptance tests (and additional r.v.’s for choosing a “bad” candidate in SMH). Specifically, we need: V=NTV=NT uniform r.v.’s in PI-MAIS and I2-MAIS with MH-within-Gibbs, V=TV=T uniform r.v.’s for I2-MAIS with MH in the extended space, andV=2TV=2T, TT uniform r.v. and TT multinomial r.v., for I2-MAIS with SMH. However, in practical applications, the main computational effort is usually required for the target evaluation. The computing time required in the multinomial sampling within SMH increases with NN. Finally, we recall that we have used a deterministic mixture weighting scheme with Φn,t(x)=ϕt(x)\Phi_{n,t}({\bf x})=\phi_{t}({\bf x}), which requires MN2TMN^{2}T evaluations of the proposal pdfs, qn,t(x)q_{n,t}({\bf x}), for n=1,,Nn=1,\ldots,N and t=1,,Tt=1,\ldots,T.

5 Non-iterative and parallel implementations

As remarked in Section 5.1, the choice of the means μn,t{\bm{\mu}}_{n,t}’s is completely independent from the estimation steps. Thus, all the means can be selected in advance (also in parallel if the strategy in Section 5.3.1 is used), and the MIS estimation steps can then be performed as in a completely static framework (i.e., as described in Section 4.1). This consideration is valid for any choice of Φn,t(x)\Phi_{n,t}({\bf x}).

Let us consider now the choice of Φn,t\Phi_{n,t}’s as temporal mixtures, i.e., Φn,t=1Tt=1Tqn,t(x)\Phi_{n,t}=\frac{1}{T}\sum_{t=1}^{T}q_{n,t}({\bf x}) or Φn,t(x)=qn,t(x)\Phi_{n,t}({\bf x})=q_{n,t}({\bf x}). Moreover, let us consider the use of NN parallel MCMC chains for adapting the means, i.e., one independent chain for each parameter μn,t{\bm{\mu}}_{n,t}, with n=1,,Nn=1,\ldots,N. In this case, the corresponding algorithm is completely parallelizable. Indeed, it can be decomposed into NN parallel MAIS techniques, each one producing the partial estimators I^n,T{\hat{I}}_{n,T} and Z^n,T{\hat{Z}}_{n,T}, after TT iterations. The global estimators are then given by

Furthermore, different strategies for sharing information among the parallel chains can also be applied (Craiu09, ; O-MCMC, ; Smelly, ; Geyer91, ; Marinari92, ; Neal01, ), or for reducing the total number of evaluations of the target Jacob11 (the scheme in Jacob11 can be applied if a unique independent proposal is employed, i.e., φn(μ)=φ(μ)\varphi_{n}({\bm{\mu}})=\varphi({\bm{\mu}}) for all nn).

Numerical simulations

In this section, we test the performance of the proposed scheme comparing them with other benchmark techniques. First of all, we tackle two challenging issues for adaptive Monte Carlo methods: multimodality in Section 6.1 and nonlinearity in Section 6.2. Furthermore, in Section 6.4 we consider an application of positioning and tuning model parameters in a wireless sensor network (Ali07, ; Ihler05, ; MartinoStatCo10, ).

In this section, we test the novel proposed algorithms in a multimodal scenario, comparing with several other methods. Specifically, we consider a bivariate multimodal target pdf, which is itself a mixture of 55 Gaussians, i.e.,

with means ν1={\bf\nu}_{1}=^{\top}, ν2={\bf\nu}_{2}=^{\top}, ν3={\bf\nu}_{3}=^{\top}, ν4={\bf\nu}_{4}=^{\top}, ν5={\bf\nu}_{5}=^{\top}, and covariance matrices Σ1=[2, 0.6;0.6, 1]{\bf\Sigma}_{1}=[2,\ 0.6;0.6,\ 1], Σ2=[2, 0.4;0.4, 2]{\bf\Sigma}_{2}=[2,\ -0.4;-0.4,\ 2], Σ3=[2, 0.8;0.8, 2]{\bf\Sigma}_{3}=[2,\ 0.8;0.8,\ 2], Σ4=[3, 0;0, 0.5]{\bf\Sigma}_{4}=[3,\ 0;0,\ 0.5] and Σ5=[2, 0.1;0.1, 2]{\bf\Sigma}_{5}=[2,\ -0.1;-0.1,\ 2]. The main challenge in this example is the ability in discovering the 55 different modes of πˉ(x)π(x)\bar{\pi}({\bf x})\propto\pi({\bf x}). Since we know the moments of π(x)\pi({\mathbf{x}}), we can easily assess the performance of the different techniques.

Given a random variable (r.v.) Xπˉ(x){\bf X}\sim{\bar{\pi}}({\bf x}), we consider the problem of approximating via Monte Carlo the expected value E[X]=[1.6,1.4]E[{\bf X}]=[1.6,1.4]^{\top} and the normalizing constant Z=1Z=1. Note that an adequate approximation of ZZ requires the ability of learning about all the 55 modes. We compare the performances of different sampling algorithms in terms of Mean Square Error (MSE): (a) the AMIS technique (CORNUET12, ), (b) three different PMC schemesThe standard PMC method (Cappe04, ) is described in Section C., two of them proposed in (pmc-cappe08, ; Cappe04, ) and one PMC using a partial DM-MIS scheme with Φn,t(x)=ϕt(x)\Phi_{n,t}({\bf x})=\phi_{t}({\bf x}), (c) NN parallel independent MCMC chains and (d) the proposed PI-MAIS method. Moreover, we test two static MIS approaches, the standard MIS and a partial DM-MIS schemes with Φn,t(x)=ϕt(x)\Phi_{n,t}({\bf x})=\phi_{t}({\bf x}), computing iteratively the final estimator.

For a fair comparison, all the mentioned algorithms have been implemented in such a way that the number of total evaluations of the target is E=2105E=2\cdot 10^{5}. All the involved proposal densities are Gaussian pdfs. More specifically, in PI-MAIS, we use the following parameters: N=100N=100, M{1,19,99}M\in\{1,19,99\}, T{20,100,1000}T\in\{20,100,1000\} in order to fulfill E=MNT+NT=(M+1)NT=2105E=MNT+NT=(M+1)NT=2\cdot 10^{5} (see Section 5.4). The proposal densities of the upper level of the hierarchical approach, φn(xμn,t,Λn)\varphi_{n}({\mathbf{x}}|{\bm{\mu}}_{n,t},{\bf\Lambda}_{n}), are Gaussian pdfs with covariance matrices Λn=λ2I2{\bf\Lambda}_{n}=\lambda^{2}{\bf I}_{2} and λ{5,10,70}\lambda\in\{5,10,70\}. The proposal densities used in the lower importance sampling level, qn,t(xμn,t,Cn)q_{n,t}({\mathbf{x}}|{\bm{\mu}}_{n,t},{\bf C}_{n}) are Gaussian pdfs with covariance matrices Cn=σ2I2{\bf C}_{n}=\sigma^{2}{\bf I}_{2} and σ{0.5,1,2,5,10,20,70}\sigma\in\{0.5,1,2,5,10,20,70\}. We also try different non-isotropic diagonal covariance matrices in both levels, i.e, Λn=diag(λn,12,λn,22){\bf\Lambda}_{n}=\textrm{diag}(\lambda_{n,1}^{2},\lambda_{n,2}^{2}), where λi,jU()\lambda_{i,j}\sim\mathcal{U}(), and Cn=diag(σn,12,σn,22){\bf C}_{n}=\textrm{diag}(\sigma_{n,1}^{2},\sigma_{n,2}^{2}), where σn,jU()\sigma_{n,j}\sim\mathcal{U}() for j{1,2}j\in\{1,2\} and n=1,Nn=1,\ldots N. We test all these techniques using two different initializations: first, we choose deliberately a “bad” initialization of the initial mean vectors, denoted as In1, in the sense that the initialization region does not contain the modes of π\pi. Thus, we can test the robustness of the algorithms and their ability to improve the corresponding static approaches. Specifically, the initial mean vectors are selected uniformly within the following square

for n=1,,Nn=1,\ldots,N. Different examples of this configuration are shown in Fig. 3 with squares. Secondly, we also consider a better initialization, denoted as In2, where the initialization region contains all the modes. Specifically, the initial mean vectors are selected uniformly within the following square

for n=1,,Nn=1,\ldots,N. All the results are averaged over 21032\cdot 10^{3} independent experiments. Tables 9 and 10 show the Mean Square Error (MSE) in the estimation of the first component of E[X]E[{\bf X}], with the initialization In1 and In2 respectively. Table 11 provides the MSE in the estimation of ZZ with In1. The best results in each column are highlighted in bold-face. In AMIS (CORNUET12, ), the mean vector and the covariance matrix of a single proposal (i.e., N=1N=1) are adapted, using Φ1,t(x)=ξ1(x)\Phi_{1,t}({\bf x})=\xi_{1}({\bf x}) in the computation of the IS weights. Hence, in AMIS, we have tested different values of samples per iterations M{500,103,2103,5103,104}M\in\{500,10^{3},2\cdot 10^{3},5\cdot 10^{3},10^{4}\} and T=EMT=\frac{E}{M}. For the sake of simplicity, we directly show the worst and best results among the several simulations made with different parameters. PI-MAIS outperforms the other algorithms virtually for all the choices of the parameters, with both initializations. In general, a greater value of TT is needed since the proposal pdfs are initially bad localized. Moreover, PI-MAIS always improves the performance of the static approaches. These two consideration show the benefit of the Markov adaptation. Hence, PI-MAIS presents more robustness with respect to the initial values and the choice of the covariance matrices. Figure 6(a) providing a summary of the results in Table 9 showing the log(\mboxMSE)\log(\mbox{MSE}) as function of the log(σ)\log(\sigma), for the main compared methods. Figure 3 depicts the initial (squares) and final (circles) configurations of the mean vectors of the proposal densities for the standard PMC and the PI-MAIS methods, in a specific run and different values of σ,λ{3,5}\sigma,\lambda\in\{3,5\}. In both cases, PI-MAIS guarantees a better covering of the modes of π(x)\pi({\bf x}).

2 Nonlinear banana-shaped target distribution

Here we consider a bi-dimensional “banana-shaped” target distribution (Haario2001, ), which is a benchmark function in the literature due to its nonlinear nature. Mathematically, it is expressed as

where, we have set B=10B=10, η1=4\eta_{1}=4, η2=5\eta_{2}=5, and η3=5\eta_{3}=5. The goal is to estimate the expected value E[X]E[{X}], where X=[X1,X2]πˉ(x1,x2){X}=[X_{1},X_{2}]\sim{\bar{\pi}}(x_{1},x_{2}), by applying different Monte Carlo approximations. We approximately compute the true value E[X][0.4845,0]E[{X}]\approx[-0.4845,0]^{\top} using an exhaustive deterministic numerical method (with an extremely thin grid), in order to obtain the mean square error (MSE) of the following methods: standard PMC (Cappe04, ), the Mixture PMC (pmc-cappe08, ), the AMIS (CORNUET12, ), PI-MAIS and I2-MAIS with SMH adaptation.

We consider Gaussian proposal distributions for all the algorithms. The initialization has been performed by randomly drawing the parameters of the Gaussians, with the mean of the nn-th proposal given by μn,0U(×){\bm{\mu}}_{n,0}\sim\mathcal{U}(\times), and its covariance matrix given by Cn=[σn,120; 0σn,22]{\bf C}_{n}=[\sigma_{n,1}^{2}\quad 0;\ 0\quad\sigma_{n,2}^{2}]^{\top}. We have considered two cases: an isotropic setting where σn,k{1,2,,10}\sigma_{n,k}\in\{1,2,\ldots,10\} with k=1,2k=1,2, and an anisotropic case with random selection of the parameters where σn,kU()\sigma_{n,k}\sim\mathcal{U}(), with k=1,2k=1,2. Recall that in AMIS and Mixture PMC, the covariance matrices are also adapted.

For each algorithm, we test several combinations of parameters, keeping fixed the total number of target evaluations, E=2105E=2\cdot 10^{5}. In the standard PMC method, described in Section C), we consider N{50,100,103,5103}N\in\{50,100,10^{3},5\cdot 10^{3}\} and T=ENT=\frac{E}{N} (here M=1M=1). In Mixture PMC, we consider different number of component in the mixture proposal pdf N{10,50,100}N\in\{10,50,100\}, and different samples per proposal S{100,200,103,2103,5103}S\in\{100,200,10^{3},2\cdot 10^{3},5\cdot 10^{3}\} at each iteration (here T=EST=\frac{E}{S}). In AMIS, we test S{500,103,2103,5103,104}S\in\{500,10^{3},2\cdot 10^{3},5\cdot 10^{3},10^{4}\} and T=EST=\frac{E}{S} (we recall N=1N=1). The range of these values of parameters are chosen, after a preliminary study, in order to obtain the best performance from each technique. In PI-MAIS an I2-MAIS, we set N{50,100}N\in\{50,100\}. For the adaptation in PI-MAIS, we also consider Gaussian pdfs φn(xμn,t,Λn)\varphi_{n}({\mathbf{x}}|{\bm{\mu}}_{n,t},{\bf\Lambda}_{n}), covariance matrices Λn=λ2I2{\bf\Lambda}_{n}=\lambda^{2}{\bf I}_{2} with λ{3,5,10,20}\lambda\in\{3,5,10,20\}. In I2-MAIS, for the SMH method we use a Gaussian pdf with mean ^{\top} and covariance matrix Λ=λ2I2{\bf\Lambda}=\lambda^{2}{\bf I}_{2} and again λ{3,5,10,20}\lambda\in\{3,5,10,20\}. We test M{1,9,19}M\in\{1,9,19\} for both, so that T=EN(M+1)T=\frac{E}{N(M+1)} for PI-MAIS and T=ENM+1T=\lfloor\frac{E}{NM+1}\rfloor for I2-MAIS (see Section 5.4).

The results are averaged 500500 over independent simulations, for each combination of parameters. Table 12 shows the smallest and highest MSE values obtained in the estimation of the expected value of the target, averaged between the two components of E[X]E[{X}], achieved by the different methods. The smallest MSEs in each column (each σ\sigma) are highlighted in bold-face. PI-MAIS and I2-MAIS outperform the other techniques virtually for all the values of σ\sigma. In this example, AMIS also provides good results. Figure 7 show a graphical representation of the results in Table 12, with the exception of the last column.

Fig. 4 displays the initial (squares) and final (circles) configurations of the mean vectors of the proposals for the different algorithms, in one specific run. Since in Mixture PMC and AMIS the covariance matrices are also adapted, we show the shape of some proposals as ellipses representing approximately 85%85\% of probability mass. For, PMC we also depict a last resampling output with triangles, in order to show the loss in diversity. Unlike PMC, PI-MAIS ensures a better covering of the region of high probability.

3 High dimensional target distribution

Let us consider again a mixture of isotropic Gaussians as target pdf, i.e.,

where νk=[νk,1,,νk,Dx]{\bm{\nu}}_{k}=[\nu_{k,1},\ldots,\nu_{k,D_{x}}]^{\top}, and Σk=χk2IDx{\bm{\Sigma}}_{k}=\chi_{k}^{2}{\bf I}_{D_{x}} for k{1,2,3}k\in\{1,2,3\}, with IDx{\bf I}_{D_{x}} being the Dx×DxD_{x}\times D_{x} identity matrix. We set ν1,j=5\nu_{1,j}=-5, ν2,j=6\nu_{2,j}=6, ν3,j=3\nu_{3,j}=3 for all j=1,...,Dxj=1,...,D_{x}, and χk=8\chi_{k}=8 for all k{1,2,3}k\in\{1,2,3\}. The expected value of the target π(x){\pi}({\bf x}) is then E[Xj]=43E[{X_{j}}]=\frac{4}{3} for j=1,,Dxj=1,\ldots,D_{x}. In order to study the performance of the proposed scheme as the dimension of the state space increases, we vary the dimension of the state space in Eq. (40) testing different values of DxD_{x} (with 2Dx502\leq D_{x}\leq 50).

We consider the problem of approximating via Monte Carlo the expected value of the target density, and we compare the performance of different methods: (a) the standard PMC scheme (Cappe04, ), (b) NN parallel independent MH chains (Par-MH), (c) a standard Sequential Monte Carlo (SMC) scheme (Moral06, ) and (d) the proposed PI-MAIS method. We test the algorithms with N{100,500}N\in\{100,500\}. All the proposal pdfs involved in the experiments are Gaussians, with the same covariance matrices for all the techniques. The initial mean vectors in all techniques are selected randomly and independently as μn,0U([6×6]Dx){\bm{\mu}}_{n,0}\sim\mathcal{U}([-6\times 6]^{D_{x}}) for n=1,,Nn=1,\ldots,N.

Again, all the mentioned algorithms have been implemented in such a way that the number of total evaluations of the target is E=2105E=2\cdot 10^{5}. More specifically, in PI-MAIS, we use two sets of parameters: with N=100N=100, M=19M=19, T=100T=100, and with N=500N=500, M=19M=19, T=20T=20 in order to fulfill E=(M+1)NT=2105E=(M+1)NT=2\cdot 10^{5} (see Section 5.4). The proposal pdf of the upper level of the hierarchical approach, φn(xμn,t,Λn)\varphi_{n}({\mathbf{x}}|{\bm{\mu}}_{n,t},{\bf\Lambda}_{n}), are Gaussian pdfs with covariance matrices Λn=λ2IDx{\bf\Lambda}_{n}=\lambda^{2}{\bf I}_{D_{x}} and λ=10\lambda=10. The proposal pdfs used in the lower importance sampling level, qn,t(xμn,t,Cn)q_{n,t}({\mathbf{x}}|{\bm{\mu}}_{n,t},{\bf C}_{n}) are Gaussian pdfs with covariance matrices Cn=σ2IDx{\bf C}_{n}=\sigma^{2}{\bf I}_{D_{x}} again with σ=10\sigma=10 (for a fair comparison with the other techniques). In PMC, Par-MH and SMC we use the same proposals with the same covariances and initial parameters. As described in App. C, in PMC the adaptation is carried out by resampling steps, in SMC an alternation of resampling and MH steps is performed whereas, in Par-MH, NN independent MH chains are carried out.

The results are averaged over 200200 independent simulations. Fig. 8 shows the log-MSE in the estimation of E[X]E[{\bf X}] as a function of the dimension DxD_{x} of the state-space. Fig. 8(a) compares the algorithms with N=100N=100 proposal pdfs, whereas in Fig. 8(b) we have N=500N=500, keeping fixed the number of total evaluations of the target E=2105E=2\cdot 10^{5}. We observe, as expected, the performance of all the methods degenerate as the dimension of the problem, DxD_{x} increases, since we maintain fixed the computational cost E=2105E=2\cdot 10^{5}. PI-MAIS always provides the best results, with the exception for the cases corresponding to N=100N=100 and Dx=35,50D_{x}=35,50 where SMC obtains a lower MSE (for N=100N=100 and Dx=40D_{x}=40, they provide virtually the same MSE).

4 Localization problem in a wireless sensor network

where Θj\Theta_{j} are independent Gaussian variables with identical pdfs, N(ϑj;0,ω2)\mathcal{N}(\vartheta_{j};0,\omega^{2}), j=1,2j=1,2. We also consider a prior density over ω\omega, i.e., Ωp(ω)=N(ω;0,25)I(ω>0),\Omega\sim p(\omega)=\mathcal{N}(\omega;0,25)I(\omega>0), where I(ω>0)I(\omega>0) is 11 if ω>0\omega>0 and otherwise. The parameter A=aA=a is also unknown and we again consider a Gaussian prior Ap(a)=N(a;0,25)A\sim p(a)=\mathcal{N}(a;0,25). Moreover, we also apply Gaussian priors over X{\bf X}, i.e., p(xi)=N(xi;0,25)p(x_{i})=\mathcal{N}(x_{i};0,25) with i=1,2i=1,2. Thus, the posterior pdf is

Our goal is computing the expected value of

via Monte Carlo, in order to provide an estimate of the position of the target, the parameter aa and the standard deviation ω\omega of the noise in the system. We apply PI-MAIS and three different PMC schemes (see example in Section 6.1, for a description), all using NN Gaussian proposals. We initialize the mean vectors so that they are randomly spread within the space of the variables of interest, i.e.,

and the covariance matrices Cn=\mboxdiag(σn,12,,σn,42)I4{\bf C}_{n}=\mbox{diag}(\sigma_{n,1}^{2},\ldots,\sigma_{n,4}^{2}){\bf I}_{4} with n=1,,Nn=1,\ldots,N. The values of the standard deviations σn,j\sigma_{n,j} are chosen randomly for each Gaussian pdf. Specifically, σn,jU([1,Q])\sigma_{n,j}\sim\mathcal{U}([1,Q]), j=1,,4j=1,\ldots,4, where we have considered three possible values for QQ, i.e., Q{5,10,30}Q\in\{5,10,30\}. For the adaptation process of PI-MAIS, we consider also Gaussian proposals with covariance matrices Λn=λ2I2{\bf\Lambda}_{n}=\lambda^{2}{\bf I}_{2} and λ{5,10,70}\lambda\in\{5,10,70\}. We also try different non-isotropic diagonal covariance matrices, i.e, Λn=diag(λn,12,λi,22){\bf\Lambda}_{n}=\textrm{diag}(\lambda_{n,1}^{2},\lambda_{i,2}^{2}), where λn,jU()\lambda_{n,j}\sim\mathcal{U}().

For a fair comparison, all the techniques have been simulated with sets of parameters that yield the same number of target evaluations, fixed to E=2105E=2\cdot 10^{5}. In PI-MAIS, we have chosen parameters N=100N=100, M={1,19,99}M=\{1,19,99\}, T={20,100,1000}T=\{20,100,1000\}. The PMC algorithms has been simulated with N=100N=100 and T=2000T=2000. The MSE of the different estimators (averaged over 30003000 independent runs) are provided in Table 13 and the log(\mboxMSE)\log(\mbox{MSE}) in Figure 6(b). PI-MAIS outperforms always PMC when σn,jU()\sigma_{n,j}\sim\mathcal{U}() and σn,jU()\sigma_{n,j}\sim\mathcal{U}() whereas PMC provides better results for σn,jU()\sigma_{n,j}\sim\mathcal{U}(). Therefore, the results show jointly the robustness and flexibility of the proposed PI-MAIS technique.

Conclusions

In this work, we have introduced a layered (i.e., hierarchical) framework for designing adaptive Monte Carlo methods. In general terms, we have shown that such a hierarchical interpretation lies behind the good performance of two well-known algorithms; a random walk proposal within an MH scheme and the standard PMC method. Furthermore, we have used this approach to introduce a novel class of adaptive importance sampling (AIS) schemes. The novel class of AIS algorithms employs the determinist mixture (DM) idea (Owen00, ; Veach95, ) in order to reduce the variance of the resulting IS estimators. We have extended the use of the DM strategy with respect to other algorithms available in the literature, providing a more general and flexible framework. From an estimation perspective, this framework includes different schemes proposed in literature (CORNUET12, ; APIS15, ) as special cases, although they differ to an extent in terms of the employed adaptation procedure. Our framework also contains several other sampling schemes considering full or partial DM approaches. Finally, we have discussed several aspects of the trade-offs in terms of the computational cost and advantages due to improved accuracy of the resulting estimators. Numerical comparisons with different algorithms on benchmark models have confirmed the benefit of the layered adaptive sampling approaches.

Acknowledgements

This work has been supported by the projects COMONSENS (CSD2008 00010), ALCIT (TEC2012 38800C03 01), DISSECT (TEC2012 38058 C03 01), OTOSiS (TEC 2013 41718 R), and COMPREHENSION (TEC 2012 38883 C02 01), by the BBVA Foundation with ”I Convocatoria de Ayudas Fundaci n BBVA a Investigadores, Innovadores y Creadores Culturales”- MG FIAR project, by the ERC grant 239784 and AoF grant 251170, and by the European Union 7th Framework Programme through the Marie Curie Initial Training Network “Machine Learning for Personalized Medicine” MLPM2012, Grant No. 316861.

References

Appendix A Consistency of GAMIS estimators

First of all, we remark that the complete analysis should take in account the chosen adaptive procedure since, in general, the adaptation uses the information of previous weighted samples. However, in this work we consider an adaption procedure completely independent of the estimation steps, as clarified in Sections 3.4-5.1. This simplifies substantially the analysis as described in Section 5.1.

The consistency of the global estimators in Eq. (29) provided by GAMIS can be considered when number of samples per time step (M×NM\times N) and the number of iterations of the algorithm (TT) grow to infinity. For some exhaustive studies of specific cases, see the analysis in and . Here we provide some brief arguments for explaining why I^T\hat{I}_{T} and Z^T{\hat{Z}}_{T} obtained by a GAMIS scheme are, in general, consistent. Let us assume that qn,tq_{n,t}’s have heavier tails than πˉ(x)π(x)\bar{\pi}({\bf x})\propto\pi({\bf x}). Note that the global estimator I^T\hat{I}_{T} can be seen as a result of a static batch MIS estimator involving LL different mixture-proposals Φn,t(x)\Phi_{n,t}({\bf x}) and J=NMTJ=NMT total number of samples. The weights wn,t(m)w_{n,t}^{(m)} built using Φn,t(x)\Phi_{n,t}({\bf x}) in the denominator of the IS ratio are suitable importance weights yielding consistent estimators, as explained in detail in AppendixB. Hence, for a finite number of iterations T<T<\infty, when MM\rightarrow\infty (or NN\rightarrow\infty), the consistency can be guaranteed by standard IS arguments, since it is well known that Z^TZ\hat{Z}_{T}\to Z and I^TI\hat{I}_{T}\to I as MM\to\infty, or NN\to\infty .

Furthermore, for TT\rightarrow\infty and N,M<N,M<\infty, we have a convex combination, given in Eq. (31), of conditionally independent (consistent but biased) IS estimators . Indeed, although in an adaptive scheme the proposals depend on the previous configurations of the population, the samples drawn at each iteration are conditionally independent of the previous ones, and independent of each other drawn at the same iteration. The bias is due to unknown ZZ (see Eq. (4)), and hat Z^T\hat{Z}_{T} is used to replace ZZ. However, Z^TZ\hat{Z}_{T}\to Z as TT\to\infty, as discussed in [47, Chapter 14]: hence, I^T\hat{I}_{T} is asymptotically unbiased as TT\to\infty.

Appendix B Importance sampling with multiple proposals

There are at least two procedures to build a joint IS estimator: the standard multiple importance sampling (MIS) approach and the full deterministic mixture (DM-MIS) scheme.

The simplest approach [47, Chapter 14] is computing the classical IS weights:

with i=1,,M1i=1,\ldots,M_{1} and k=1,,M2k=1,\ldots,M_{2}. The IS estimator is then built by normalizing them jointly, i.e., computing

where Stot=i=1M1w1(i)+k=1M2w2(k)S_{tot}=\sum_{i=1}^{M_{1}}w_{1}^{(i)}+\sum_{k=1}^{M_{2}}w_{2}^{(k)}. For J>2J>2 proposal pdfs and xj(1),,xj(Mj)qj(x){\bf x}_{j}^{(1)},\ldots,{\bf x}_{j}^{(M_{j})}\sim q_{j}({\bf x}), for j=1,,Jj=1,\ldots,J, we have

In this case, Stot=n=1Jmj=1Mjwj(mj)S_{tot}=\sum_{n=1}^{J}\sum_{m_{j}=1}^{M_{j}}w_{j}^{(m_{j})}.

B.2 Deterministic mixture approach

An alternative approach is based on the deterministic mixture sampling idea . Considering N=2N=2 proposals q1q_{1}, q2q_{2}, and setting

In this case, the complete proposal is considered to be a mixture of q1q_{1} and q2q_{2}, weighted according to the number of samples drawn from each one. Note that, unlike in the standard procedure for sampling from a mixture, a deterministic and fixed number of samples are drawn from each proposal in the DM approach . It can be shown that the set Z\mathcal{Z} of samples drawn in this deterministic way is distributed according to the mixture q(z)=M1M1+M2q1(z)+M2M1+M2q2(z)q({\bf z})=\frac{M_{1}}{M_{1}+M_{2}}q_{1}({\bf z})+\frac{M_{2}}{M_{1}+M_{2}}q_{2}({\bf z}) [45, Chapter 9, Section 11]. The DM estimator is finally given by

where Stot=j=12mj=1Mjwj(mj)S_{tot}=\sum_{j=1}^{2}\sum_{m_{j}=1}^{M_{j}}w_{j}^{(m_{j})} and the wj(mj)w_{j}^{(m_{j})} are given by (44). For J>2J>2 proposal pdfs, the DM estimator can also be easily generalized:

with i=1,,Ji=1,\ldots,J, Mtot=M1+M2++MJM_{tot}=M_{1}+M_{2}+\ldots+M_{J} and Stot=j=1Jmj=1Mjwj(mj)S_{tot}=\sum_{j=1}^{J}\sum_{m_{j}=1}^{M_{j}}w_{j}^{(m_{j})}. On the one hand, the DM approach is more efficient than the IS method, thus providing a better performance in terms of a reduced variance of the corresponding estimator, as shown in the following section. On the other hand, it needs to evaluate every proposal MtotM_{tot} times instead of only MjM_{j} times (in the standard MIS procedure), and therefore is more costly from a computational point of view. However, this increased computational cost is negligible when the proposal is much cheaper to evaluate than the target, as it often happens in practical applications.

B.3 Convex combination of partial IS estimators

Regardless the type of weights employed in the IS scheme (either as in Eq. (42) or as in Eq. (44)), the resulting estimators can be written as convex combination of simpler ones. First of all, let us consider again the use of J=2J=2 proposals, q1q_{1} and q2q_{2}. We draw MjM_{j} samples from each one, xj(1),,xj(Mj)qj(x){\bf x}_{j}^{(1)},\ldots,{\bf x}_{j}^{(M_{j})}\sim q_{j}({\bf x}), with j{1,2}j\in\{1,2\}. The two partial sums of the weights corresponding only to the samples drawn from q1q_{1} and q2q_{2}, are given by S1=i=1M1w1(i)S_{1}=\sum_{i=1}^{M_{1}}w_{1}^{(i)} and S2=k=1M2w2(k)S_{2}=\sum_{k=1}^{M_{2}}w_{2}^{(k)}. The partial IS estimators, obtained by considering only one proposal pdf, are I^1=i=1M1wˉ1(i)f(x1(i))\hat{I}_{1}=\sum_{i=1}^{M_{1}}\bar{w}_{1}^{(i)}f({\bf x}_{1}^{(i)}) and I^2=k=1M2wˉ2(k)f(x2(k))\hat{I}_{2}=\sum_{k=1}^{M_{2}}\bar{w}_{2}^{(k)}f({\bf x}_{2}^{(k)}) where the normalized weights are wˉ1(i)=w1(i)S1\bar{w}_{1}^{(i)}=\frac{w_{1}^{(i)}}{S_{1}} and wˉ2(k)=w2(k)S2\bar{w}_{2}^{(k)}=\frac{w_{2}^{(k)}}{S_{2}}, respectively. The complete IS estimator, taking into account the M1+M2M_{1}+M_{2} samples jointly, is

This procedure can be easily extended for J>2J>2 different proposal pdfs, obtaining the complete estimator as the convex combination of the NN partial estimators:

where xj(1),,xj(Mj)qj(x){\bf x}_{j}^{(1)},\ldots,{\bf x}_{j}^{(M_{j})}\sim q_{j}({\bf x}), I^j=k=1Mjwj(k)f(xj(k))\hat{I}_{j}=\sum_{k=1}^{M_{j}}{w_{j}^{(k)}f({\bf x}_{j}^{(k)})}, Sj=k=1Mjwj(k)S_{j}=\sum_{k=1}^{M_{j}}{w_{j}^{(k)}} and Z^j=1Mjk=1Mjwj(k)\hat{Z}_{j}=\frac{1}{M_{j}}\sum_{k=1}^{M_{j}}w_{j}^{(k)}.

Appendix C Hierarchical interpretation of PMC

The standard Population Monte Carlo (PMC) method can be interpreted as using a hierarchical procedure. Although it is possible to recognize the two different layers, there are some differences w.r.t. the hierarchical procedure in Section 3. The first one is that in PMC the generation of μ{\bm{\mu}}’s is not independent of the previously generated x{\bf x}’s. The second one is that the prior is instead h(μ)=π^t(N)(μ)h({\bm{\mu}})=\hat{\pi}_{t}^{(N)}({\bm{\mu}}), where π^t(N)\hat{\pi}_{t}^{(N)} is an approximation of the measure of πˉ(μ)\bar{\pi}({\bm{\mu}}) obtained using the previously generated samples x{\bf x}’s (in the second level of the hierarchical approach). More specifically, a standard PMC method is an adaptive importance sampler using a population of proposals q1q_{1}, \ldots, qNq_{N}. PMC consists of the following steps, given an initial set, μ1,0{\bm{\mu}}_{1,0}, \ldots, μN,0{\bm{\mu}}_{N,0}, of mean vectors:

Draw xn,tqn,t(xμn,t,Cn){\bf x}_{n,t}\sim q_{n,t}({\bf x}|{\bm{\mu}}_{n,t},{\bf C}_{n}), for n=1,,Nn=1,\ldots,N.

Assign to each sample xn,t{\bf x}_{n,t} the weights,

Resampling: draw NN independent samples μn,t+1{\bm{\mu}}_{n,t+1}, n=1,,Nn=1,\ldots,N, according to the particle approximation

where we have denoted x1:N,t=[x1,t,,xN,t]{\bf x}_{1:N,t}=[{\bf x}_{1,t},\ldots,{\bf x}_{N,t}]^{\top}. Note that each μn,t+1{x1,t,,xN,t}{\bm{\mu}}_{n,t+1}\in\{{\bf x}_{1,t},\ldots,{\bf x}_{N,t}\}, for all nn.

Return all the pairs {xn,t,wn,t}\{{\bf x}_{n,t},w_{n,t}\}, n=1,,Nn=1,\ldots,N and t=0,,T1t=0,\ldots,T-1.

Fixing an iteration tt, the generating procedure used in one iteration of the standard PMC method can be cast in the hierarchical formulation:

Draw NN samples μ1,t,,μN,t{\bm{\mu}}_{1,t},\ldots,{\bm{\mu}}_{N,t} from π^t1(N)(μx1:N,t1)\hat{\pi}_{t-1}^{(N)}({\bm{\mu}}|{\bf x}_{1:N,t-1}).

Draw xn,tqn,t(xμn,t,Cn){\bf x}_{n,t}\sim q_{n,t}({\bf x}|{\bm{\mu}}_{n,t},{\bf C}_{n}), for n=1,,Nn=1,\ldots,N.

Note that π^t1(N)\hat{\pi}_{t-1}^{(N)} plays the role of the prior hh in the hierarchical scheme above. Differently from the novel proposed scheme, the two levels of hierarchical procedure are not independent since the pdf π^t(N)(μx1:N,t)\hat{\pi}_{t}^{(N)}({\bm{\mu}}|{\bf x}_{1:N,t}) depends on the samples drawn in the lower level. Furthermore, π^t(N)\hat{\pi}_{t}^{(N)} also varies with tt and NN, whereas in our procedure we consider a fixed prior hh. However, note that π^t(N)\hat{\pi}_{t}^{(N)} is an empirical measure approximation of πˉ{\bar{\pi}} that improves when NN grows. An equivalent formulation of the hierarchical scheme for PMC is given below, involving a probability of generating a new mean μ{\bm{\mu}} given the previous ones μ1:N,t1=[μ1,t1,,μN,t1],{\bm{\mu}}_{1:N,t-1}=[{\bm{\mu}}_{1,t-1},\ldots,{\bm{\mu}}_{N,t-1}]^{\top}, denoted as Kt(N)(μμ1:N,t1)K_{t}^{(N)}({\bm{\mu}}|{\bm{\mu}}_{1:N,t-1}).

Consider the tt-th iteration of PMC. Let us define as

the vector containing all the generated samples except for the nn-th. Let us also denote as μi,t+1{x1,t,xN,t}{\bm{\mu}}_{i,t+1}\in\{{\bf x}_{1,t}\ldots,{\bf x}_{N,t}\}, a generic mean vector, i.e. i{1,,N}i\in\{1,\ldots,N\} at the iteration t+1t+1, after applying one resampling step (i.e., a multinomial sampling according to the normalized weights). Hence, the distribution of μ{\bm{\mu}} given the previous means μ1:N,t1{\bm{\mu}}_{1:N,t-1} is

where π^t(N)(μx1:N,t)\hat{\pi}_{t}^{(N)}({\bm{\mu}}|{\bf x}_{1:N,t}) is given in Eq. (49). For simplicity, below we denote

Then, after some straightforward rearrangements, Eq. (50) can be rewritten as

where Z^=1Nn=1Nπ(xn)qn(xn)\hat{Z}=\frac{1}{N}\sum_{n=1}^{N}\frac{\pi({\bf x}_{n})}{q_{n}({\bf x}_{n})} is the estimate of the normalizing constant of the target obtained using the classical IS weights. The hierarchical formulation of PMC can be rewritten as:

Draw NN samples μ1,t,,μN,t{\bm{\mu}}_{1,t},\ldots,{\bm{\mu}}_{N,t} from Kt(N)(μμ1:N,t1)K_{t}^{(N)}({\bm{\mu}}|{\bm{\mu}}_{1:N,t-1}) in Eq. (50) or (51).

Draw xn,tqn,t(xμn,t,Cn){\bf x}_{n,t}\sim q_{n,t}({\bf x}|{\bm{\mu}}_{n,t},{\bf C}_{n}), for n=1,,Nn=1,\ldots,N.

When NN\to\infty, then Z^Z\hat{Z}\to Z , and thus Kt(N)(μμ1:N,t1)1Zπ(μ)=πˉ(μ)K_{t}^{(N)}({\bm{\mu}}|{\bm{\mu}}_{1:N,t-1})\to\frac{1}{Z}\pi({\bm{\mu}})=\bar{\pi}({\bm{\mu}}), for all t=1,Tt=1\ldots,T. Namely, when NN grows, the hierarchical scheme above tends to have h(μ)=πˉ(μ)h({\bm{\mu}})={\bar{\pi}}({\bm{\mu}}) as prior in the upper level. Figures 5 show three different examples of the conditional pdf Kt(N)K_{t}^{(N)} (obtained via numerical approximation) for a fixed tt and different N{2,20,1000}N\in\{2,20,1000\}. We can observe that Kt(N)K_{t}^{(N)} becomes closer to the target πˉ\bar{\pi} (depicted in solid line) as NN grows.

In the Markov adaptive importance sampling (MAIS) schemes described in Section 5, since we are using MCMC methods for drawing from h(μ)=πˉ(μ)h({\bm{\mu}})={\bar{\pi}}({\bm{\mu}}), actually we have also a current prior Kt(N)(μ1:N,tμ1:N,t1)K_{t}^{(N)}({\bm{\mu}}_{1:N,t}|{\bm{\mu}}_{1:N,t-1}), determined for the kernels of the considered MCMC algorithms. For instance, in PI-MAIS we have

where An(μn,tμn,t1)A_{n}({\bm{\mu}}_{n,t}|{\bm{\mu}}_{n,t-1}) is the kernel of the nn-th chain. Unlike in PMC, since we are using ergodic chains with invariant pdf πˉ{\bar{\pi}}, we know that Kt(N)(μ1:N,tμ1:N,t1)n=1Nπˉ(μn)K_{t}^{(N)}({\bm{\mu}}_{1:N,t}|{\bm{\mu}}_{1:N,t-1})\rightarrow\prod_{n=1}^{N}{\bar{\pi}}({\bm{\mu}}_{n}) for tt\rightarrow\infty, with a fixed NN. Whereas PMC requires to increase NN for obtaining the same result.