Layered Adaptive Importance Sampling
L. Martino, V. Elvira, D. Luengo, J. Corander
Introduction
Monte Carlo methods currently represent a maturing toolkit widely used throughout science and technology (Doucet:MCSignalProcessing2005, ; Robert04, ; Wang:MCWirelessComm2002, ). Importance sampling (IS) and Markov Chain Monte Carlo (MCMC) methods are well-known Monte Carlo (MC) techniques applied to compute integrals involving a high-dimensional target probability density function (pdf) . In both cases, the choice of a suitable proposal density is crucial for the success of the Monte Carlo based approximation. For this reason, the design of adaptive IS or MCMC schemes represents one of the most active research topics in this area, and several methods have been proposed in the literature (Cappe04, ; CORNUET12, ; Craiu09, ; Haario2001, ; Luengo13, ).
Since both IS and MCMC have certain intrinsic advantages and weaknesses, several attempts have been made to successfully marry the two approaches, producing hybrid techniques: IS-within-MCMC (Andrieu10, ; Brockwell10, ; Liesenfeld08, ; Liu00, ; Neal11, ) or MCMC-within-IS (Beaujean13, ; Botev13, ; Chopin02, ; EBEB14, ; Mendes15, ; Neal01, ; Yuan13, ). To set the scene for such developments it is useful to recall briefly some of the main strengths of IS and MCMC, respectively. For instance, one benefit of IS is that it delivers a straightforward estimate of the normalizing constant of (Liang10, ; Robert04, ) (a.k.a. evidence or marginal likelihood), which is essential for several applications (FRIEL11, ; Skilling06, ). In contrast, the estimation of the normalizing constant is highly challenging using MCMC methods, and several authors have investigated different approaches to overcome the obstacles related to the instability of the resulting estimators (Botev08, ; CALDWELL14, ; Chib01, ; FRIEL11, ; Weinberg10, ). Furthermore, the application and the theoretical analysis of an IS scheme using an adaptive proposal pdf is easier than the theoretical analysis of the corresponding adaptive MCMC scheme, which is much more delicate (Andrieu08, ).
On the other hand, an appealing feature of MCMC algorithms is their explorative behavior. For instance, the proposal function can depend on the previous state of the chain and foster movements between different regions of the target density. For this reason, MCMC methods are usually preferred when no detailed information about the target is available, especially in large dimensional spaces (Andrieu:MCMachineLearning2003, ; Fitzgerald01, ). Moreover, in order to amplify their explorative behavior several parallel MCMC chains can be run simultaneously (Robert04, ; Liang10, ). This strategy facilitates the exploration of the state space, although at the expense of an increase in the computational cost. Several schemes have been introduced to share information among the different chains (Craiu09, ; O-MCMC, ; Smelly, ), which further improves the overall convergence.
The main contribution of this work is the description and analysis of a hierarchical proposal procedure for generating samples, which can then be employed within any Monte Carlo algorithm. In this hierarchical scheme, we consider two conditionally independent levels: the upper level is used to generate mean vectors for the proposal pdfs, which are then used in the lower level to draw candidate samples according to some MC scheme. We show that the standard Population Monte Carlo (PMC) method (Cappe04, ) can be interpreted as applying implicitly this hierarchical procedure.
The second major contribution of this work is providing a general framework for multiple importance sampling (MIS) schemes and their iterative adaptive versions. We discuss several alternative applications of the so-called deterministic approach (ElviraMIS15, ; Owen00, ; Veach95, ) for sampling a mixture of pdfs. This general framework includes different MIS schemes used within adaptive importance sampling (AIS) techniques already proposed in literature, such as the standard PMC (Cappe04, ), the adaptive multiple importance sampling (AMIS) (CORNUET12, ; Marin12, ), and the adaptive population importance sampling (APIS) (APIS15, ).
Finally, we combine the general MIS framework with the hierarchical procedure for generating samples, introducing a new class of AIS algorithms. More specifically, one or several MCMC chains are used for driving an underlying MIS scheme. Each algorithm differs from the others in the specific Markov adaptation employed and the particular MIS technique applied for yielding the final Monte Carlo estimators. This novel class of algorithms efficiently combines the main strengths of the IS and the MCMC methods, since it maintains an explorative behavior (as in MCMC) and can still easily estimate the normalizing constant (as in IS).
We describe in detail the simplest possible algorithm of this class, called random walk importance sampling. Moreover, we introduce two additional population-based variants that provide a good trade-off between performance and computational cost. In the first variant, the mean vectors are updated according to independent parallel MCMC chains. In the other one, an interacting adaptive strategy is applied. In both cases, all the adapted proposal pdfs collaborate to yield a single global IS estimator. One of the proposed algorithms, called parallel interacting Markov adaptive importance sampling (PI-MAIS), can be interpreted as parallel MCMC chains cooperating to produce a single global estimator, since the chains exchange statistical information to achieve a common purpose.
The rest of the paper is organized as follows. Section 2 is devoted to the problem statement. The hierarchical proposal procedure is then introduced in Section 3. In Section 4, we describe a general framework for importance sampling schemes using a population of proposal pdfs, whereas Section 5 introduces the adaptation procedure for the mean vectors of these proposal pdfs. Numerical examples are provided in Section 6, including comparisons with several benchmark techniques. Different scenarios have been considered: a multimodal distribution, a nonlinear banana-shaped target, a high-dimensional example, and a localization problem in a wireless sensor network. Finally, Section 7 contains some brief final remarks.
Target distribution and related integrals
Our goal is computing efficiently some integral measure w.r.t. the target pdf,
and is any square-integrable function (w.r.t. ) of .Note that, as both and depend on the observations , the use of and would be more precise. However, since the observations are fixed, in the sequel we remove the dependence on to simplify the notation. In this work, we address the problem of approximating and via Monte Carlo methods. Since drawing directly from is impossible in many applications, Monte Carlo techniques use a simpler proposal density to generate random candidates, testing or weighting them according to some suitable rule. Indeed, throughout the paper we focus on the combined use of several proposal pdfs, denoted as .
Hierarchical procedure for proposal generation
The performance of MC methods depends on the discrepancy between the target, , and the proposal . Namely, the performance improves if is more similar (i.e., closer) to . In general, tuning the parameters of the proposal is a difficult task that requires statistical information of the target distribution. In this section, we deal with this important issue, focusing on the mean vector of the proposal pdf. More specifically, we consider a proposal pdf defined by a mean vector and covariance matrix , denoted as . We propose the following hierarchical procedure for generating a set of samples that will be employed afterwards within some Monte Carlo technique:
Draw a mean vector .
Draw for .
Use all the generated samples, for and , as candidates within some Monte Carlo method.
Note that plays the role of a prior pdf over the mean vector of in this approach. Hence, the pdf of each sample can be expressed as
i.e., the hierarchical procedure is equivalent to drawing directly for all and . The density is thus the equivalent proposal density of the whole hierarchical generating procedure. Note also that the samples are not directly used by the Monte Carlo estimator, since only the samples , for , , enter the actual estimator. Hence, the computational cost per iteration of this hierarchical procedure is higher than the cost of a standard approach, However, it leads to substantial computational savings in terms of improved convergence towards the target, and thus a reduced number of iterations required, as shown later in the simulations. Furthermore, note that the generation of the ’s in the upper level is independent of the samples drawn in the lower level, thus facilitating the theoretical analysis of the resulting algorithms, as discussed in Section 5.1.Note that, in the ideal case described here, each is also independent of the other ’s. However, in the rest of this work, we also consider cases where correlation among the mean vectors () is introduced.
Assuming that the parametric form of and its covariance matrix are fixed, we consider the problem of finding the optimal prior over the mean vector . Note that, since , we can write
regardless of the choice of the prior over the mean vectors in the upper level. The desirable scenario is to have the equivalent proposal coinciding exactly with the target ,Given a function , the optimal proposal minimizing the variance of the IS estimator is . However, in practical applications, we are often interested in computing expectations w.r.t. several ’s. In this context, a more appropriate strategy is to minimize the variance of the importance weights. In this case, the minimum variance is attained when Doucet08tut . i.e.,
where represents the optimal prior.
2 Asymptotically optimal choice of the prior h(𝝁)ℎ𝝁h({\bm{\mu}})
Since Eq. (7) cannot be solved analytically in general, in this section we relax that condition and look for an equivalent proposal which fulfills (7) asymptotically as . For the sake of simplicity, let us set . Thus, we consider the generation of samples , drawn using the following hierarchical procedure:
Draw a mean vector .
Draw .
Note that we are using different proposal pdfs,
to draw , with each being drawn from the -th proposal . However, if the samples are used altogether regardless of their order, then it can interpreted that they have been drawn from the following mixture using the deterministic mixture sampling scheme (see (mcbookOwen, , Chapter 9), ElviraMIS15 ):
Note that, since , then is a Monte Carlo approximation of the integral in Eq. (7), i.e.,
Furthermore, if we choose , i.e., , then is also a kernel density estimator of , where the play the role of the kernel functions Wand94 . In general, this estimator has non-zero bias and variance, depending on the choice of , and the number of samples . However, for a given value of , there exists an optimal choice of which provides the minimum Mean Integrated Square Error (MISE) estimator Wand94 . Using the optimal covariance matrix , it can be proved
pointwise as Wand94 . Hence, the equivalent proposal density of the hierarchical approach converges to the target when . It is possible to show as , so that there is no contradiction between (9) and (10) since becomes increasingly similar to , and thus as .
3 Practical implementation
As explained in Section 3.2, is a suitable choice from a kernel density estimation point of view. However, sampling directly from is unfeasible from a practical point of view (otherwise, we would not require any MC algorithm). Therefore, we propose applying another sampling method, such as an MCMC algorithm, to obtain the samples . More specifically, starting from an initial , we generate a sequence
where is the kernel of the MCMC technique used. With the choice , the two levels of the sampler play different roles:
The upper level attends the need for exploration of the state space, providing .
The lower level is devoted to the approximation of local features of the the target, using .
In general, the two levels require their own tuning of the parameters of the corresponding proposals.
4 Relationship with other adaptive MC schemes
In contrast to the hierarchical approach described previously, in standard adaptive MC approaches Bugallo15 ; Haario2001 ; Luengo13 the parameter is determined by a deterministic function,
of the previously generated samples (assuming to generate samples from each proposal),
Although is a deterministic function, the sequence is generated according to a conditional pdf, , since is random. Unlike in the hierarchical scheme, in standard adaptive MC approaches, the sequence typically converges to a fixed vector.
In the standard PMC method (Cappe04, ) the sequence of mean vectors ’s is also generated depending on the previous ’s but, in this case, the final distribution is unknown and it is not a fixed vector, in general (for further details see Appendix C). Similar considerations also apply for Sequential Monte Carlo (SMC) schemes Moral06 ; Fearnhead13 ; Schafer13 where the adaptation is performed using a combination of resampling and MCMC steps. Other interesting and related techniques are the Particle MCMC (P-MCMC) (Andrieu10, ) and the Sequentially Interacting MCMC (SI-MCMC) (Brockwell10, ) methods. In this case, IS approximations of the target are used to build better proposal pdfs, employed within MCMC steps. Both methods are also able to provide efficient estimators of . However, unlike in PMC, SMC, P-MCMC and SI-MCMC, in the proposed hierarchical approach each is always chosen independently of and it is distributed according to , decided in advance by the user. Moreover, the means are not involved in the resulting estimators. Related observations are provided in Section 5.1 and Table 5.
Generalized Multiple Importance Sampling
So far, we have introduced a hierarchical procedure to generate candidates for an MC technique, adapting the mean vectors of a set of proposal densities. In this section, we provide a general framework for multiple importance sampling (MIS) techniques using a population of proposal densities, which embeds various alternative schemes proposed in the literature ElviraMIS15 . First, we consider several alternatives of static MIS, and then we focus on the corresponding adaptive MIS samplers.
As we have already highlighted, finding a good proposal pdf, , is critical and is in general very challenging (Owen00, ). An alternative strategy consists in using a population of proposal pdfs. This approach is often known in the literature as multiple importance sampling (MIS) (mcbookOwen, ; Owen00, ; Veach95, ; ElviraMIS15, ). Consider a set of proposal pdfs,
with heavier tails than the target , and let us assume that samples are drawn from each of them, i.e.,
In this scenario, the weights associated to the samples can be obtained following at least one of these two strategies:
Deterministic mixture MIS (DM-MIS) (Owen00, ; Veach95, ):
for and , and where is the mixture pdf, composed of all the proposal pdfs. This approach is based on the considerations provided in Appendix B.
In both cases, the consistency of the estimators is ensured ElviraMIS15 . The main advantage of the DM-MIS weights is that they yield more efficient estimators than using the standard importance weights (CORNUET12, ; Owen00, ; LetterVictor, ; APIS15, ). However, the DM-MIS estimator is computationally more expensive, as it requires total evaluations for each proposal instead of just , for computing all the weights. The number of evaluations of the target is the same regardless of whether the weights are calculated according to Eq. (12) or (13), so this increase in computational cost may not be relevant in many applications. However, in some other cases this additional computational load may be excessive (especially for large values of ) and alternative efficient solutions are desirable. For instance, the use of partial mixtures has been proposed in (LetterVictor, ):
All the previous cases can be captured by a generic mixture-proposal , under which the MIS weights can be defined as
with , where in Eq. (12), in Eq. (13), and
in Eq. (15). In any case, the weights are always normalized as
Table 1 shows these three choices of , whereas Table 2 summarizes a generalized static MIS procedure.
Note that the IS estimator of a specific moment of , i.e., the integral given in Eq. (3), and the approximation of the normalizing constant in Eq. (4), can now be approximated as
Then, the particle approximation of the measure of is given by
In Section 4.2, we describe a framework where a partial grouping of the proposal pdfs arises naturally from the sampler’s definition.
2 Generalized Adaptive Multiple Importance Sampling
for the -th proposal (see Figure 1). At the -th iteration, the adaptation procedure takes into account statistical information about the target distribution gathered in the previous iterations, , using one of the many procedures that have been proposed in the literature (pmc-cappe08, ; Cappe04, ; CORNUET12, ; APIS15, ). Furthermore, at the -th iteration, samples are drawn from each proposal ,
and . An importance weight is then assigned to each sample . Several strategies can be applied to build considering the different MIS approaches, as discussed in the previous section. Figure 1 provides a graphical representation of this scenario, by showing both the spatial and temporal evolution of the proposal pdfs.
is associated to each sample . In a standard MIS approach, the function employed in the denominator is
In the complete DM-MIS case, the function is
This case corresponds to the external blue rectangle in Fig. 1. Two natural alternatives of partial DM-MIS schemes appear in this scenario. The first one uses the following partial mixture
with , in the denominator of the IS weight. Namely, we consider the temporal evolution of the -th single proposal . Hence, we have mixtures, each one formed by components (horizontal red rectangle in Fig. 1). The other possibility is considering the mixture of all the ’s at the -th iteration, i.e.,
for , so that we have mixtures, each one formed by components (vertical green rectangle in Fig. 1). The function in Eq. (23) is used in the standard PMC scheme (Cappe04, ); Eq. (25) with has been considered in adaptive multiple importance sampling (AMIS) (CORNUET12, ). Eq. (26) has been applied in the adaptive population importance sampling (APIS) algorithm (APIS15, ), whereas in other techniques, such as Mixture PMC (pmc-cappe08, ; Douc07a, ; Douc07b, ), a similar strategy is employed but using a standard sampling of the mixture .
Note that, using and , the computational cost per iteration increases as the total number of iterations grows. Indeed, at the -th iteration all the previous proposals (for all ) must be evaluated at all the new samples . Hence, algorithms based on these proposals quickly become unfeasible as the number of iterations grows. On the other hand, using the computational cost per iteration is controlled by , remaining invariant regardless of the number of adaptive steps performed.
Observe also that a suitable AIS scheme builds iteratively a global IS estimator which uses the normalized weights
for , , and .
Table 4 shows an iterative version of GAMIS. We remark that, at the -th iteration, the weights of the samples previously generated need to be recalculated, as shown in step 2(c-3) of Table 4. The choices or allow avoiding completely this re-computation step of the weights. For simplicity, in Table 4 we have provided the output of the algorithms as weighted samples, i.e., all the pairs . However, the output can be equivalently expressed as an estimator of a specific moment of the target. In this case, the final IS estimators and are
where . Moreover, the final particle approximation is
The estimators in Eq. (29) can be expressed recursively, thus providing an estimate at each iteration , as stated before. Starting with , , and setting and , we have
where is the partial IS estimator using only the samples drawn at the -th iteration. Therefore, can be seen as a convex combination of the two IS estimators and (for further explanations see Eqs. (46)-(47) in Appendix B.3). Finally, note that
A brief discussion about the consistency of and is provided in Appendix A.
Markov adaptation for GAMIS
In this section, we design efficient adaptive importance sampling (AIS) techniques by combining the main ideas discussed in the two previous sections. More specifically, we apply the hierarchical MC approach to adapt the proposal pdfs within a GAMIS scheme. Therefore, a Markov GAMIS technique, or simply Markov Adaptive Importance Sampling (MAIS) algorithm, consists of the following two layers:
Upper level (Adaptation): Given the set of mean vectors,
obtain the new set according to MCMC transitions with as invariant density. More specifically, a kernel leaving invariant the distribution is applied.
Lower level (MIS estimator): Given the population of proposals,
choose a function for the computation of the weights in Eq. (22), and perform a MIS approximation of the target as described in Section 4.2.
The motivation behind the MCMC adaptation has been described in Section 3.2 and 3.3: the functions , located at the ’s, jointly provide a kernel estimate of the target .
Furthermore, we recall that the generation of the means, , is completely independent from the samples drawn in the lower level. This is a key point from a theoretical and practical point of view. Indeed, the generic MAIS algorithm can be divided in two steps: (a) first generate all the means for , (b) then perform the MIS estimation considering all the proposals , and . Namely, any MAIS technique can be converted into a generalized static MIS scheme (see Section 4.1). As a consequence, the unique conditions required for ensuring the consistency of the corresponding estimators are ElviraMIS15 ; Robert04 :
All the proposal pdfs, , must have heavier tails than the target .
A suitable function for the denominator of the importance weights must be chosen. Namely, the use of provides consistent estimators ElviraMIS15 , like the functions described in Section 4.2.
Moreover, the independence of the upper level from the lower level of the hierarchical approach, helps the parallelization of the algorithms as we discuss later.
Table 5 compares different AIS schemes. In the standard AIS method Bugallo15 , the sequence of converges to a unknown fixed vector as . In the standard PMC algorithm Cappe04 , the limiting distribution of is unknown. Furthermore, in both cases, standard AIS and PMC, the adaptation depends on the previously generated samples ’s. In MAIS techniques, the use of an ergodic chain (with invariant pdf ) for generating the -th mean vector ensures that its asymptotic density is .
2 The new class of algorithms
Markov GAMIS framework can lead to many different algorithms, depending on the MCMC strategy used to update the mean vectors and the specific choice of the function . Table 6 provides several examples of novel techniques determined by the value of , the choice of , and the type of MCMC adaptation. Some of them are variants of well-known techniques like PMC (Cappe04, ) and AMIS (CORNUET12, ), where the Markov adaptation procedure is employed. Others, such as the Random Walk Importance Sampling (RWIS), the Parallel Interacting Markov Adaptive Importance Sampling (PI-MAIS) and Doubly Interacting Markov Adaptive Importance Sampling (I2-MAIS), are described below in detail. For these completely novel algorithms we have set , so that the computational cost is directly controlled by and the re-weighting step 2(c-3) in Table 4 is not required.
RWIS is the simplest possible Markov GAMIS algorithm. Specifically, for the MCMC adaptation we consider a standard MH technique, setting and choosing (since , the two cases coincide). Table 7 shows the RWIS algorithm, which is a special case of the more general scheme described in Table 8 when . Note that we have a proposal pdf used for the MH adaptation, , which is different from the proposal pdf used for the IS estimation, .
3 Population-based algorithms
The RWIS technique can be easily extended by using a population of proposal pdfs. In this case, we choose
so that the computational cost of evaluating the mixture depends only on , regardless of the number of iterations. Moreover, step 2(c-3) in Table 4 is not required in this case. Table 8 describes the corresponding algorithm without specifying the MCMC approach used for generating the population of means, , given .
Two possible adaptation procedures via MCMC are discussed below. In the first one, we consider independent parallel chains for updating the mean vectors. We refer to this method as Parallel Interacting Markov Adaptive Importance Sampling (PI-MAIS). Although PI-MAIS is parallelizable, in the iterative version of Table 8 the independent processes cooperate together in Eq. (36) to provide unique global IS estimate. In the second adaptation scheme, we introduce the interaction also in the upper level. Hence, we refer to this method as Doubly Interacting Markov Adaptive Importance Sampling (I2-MAIS). In both cases, the corresponding technique provides an IS approximation of the target or, equivalently, the estimators and in Eq. (29), using samples.
The simplest option is applying one iteration of parallel MCMC chains, one for each returning , for . For instance, given parallel MH transitions, each one employing (possibly) a different proposal pdf with covariance matrix , we have: For :
Draw .
Set with probability
otherwise set (with probability ).
Figure 2(a) illustrates this scenario. Each mean vector is updated independently from the rest. Therefore, in PI-MAIS, the interaction among the different processes occurs only in the underlying IS layer of the hierarchical structure: the importance weights in Eq. (36) are built using the partial DM-MIS strategy with . Considerations about the parallelization of PI-MAIS are given in Section 5.5.
3.2 MCMC adaptation for I2-MAIS
Thus, one transition is formed by the following steps:
Draw , where .
Set with probability
otherwise set (with probability ).
At each iteration, new samples are drawn (as in PI-MAIS) and therefore new evaluations of are required (i.e., one evaluation of ). When a new is accepted, all the components of differ from , unlike in the strategy described later. However, the probability of accepting a new population becomes very small for large values of . Sample Metropolis-Hastings (SMH) algorithm SMH is a population-based MCMC technique, suitable for our purposes (Liang10, , Chapter 5). At each iteration , given the previous set
a new possible parameter , drawn from an independent proposal , is tested to be interchanged with another parameter in . The underlying idea of SMH is to replace one “bad” sample in the population with a potentially “better” one, according to a certain suitable probability . The algorithm is designed so that, after a burn-in period, the elements in are distributed according to . One iteration of SMH consists of the following steps:
Draw a candidate .
Choose a “bad” sample, with , from the population according to a probability proportional to , which corresponds to the inverse of the standard IS weights.
Accept the new population, with for all and , with probability
Otherwise, set .
Draw from a proposal pdf .
Set with probability
otherwise set .
This scenario is illustrated in Fig. 2(d). In this case, after iterations of the I2-MAIS scheme, we generate a unique MH chain with total states, divided in parts of states. At each iteration of the I2-MAIS scheme, each block of states is employed as mean vector of the proposal pdfs used in the lower level.
4 Computational cost: comparison between PI-MAIS and I2-MAIS
In all cases, the total number of samples involved in the final estimation is . The total number of evaluations of the target, , is larger due to the MCMC implementation, i.e., . More precisely, the total number of evaluations of the target is:
, for I2-MAIS with the MH-within-Gibbs approach.
Note that we have taken into account that several evaluations of the target have been computed in the previous iterations. Moreover, the application of the MCMC techniques requires generation of additional uniform r.v.’s for performing the acceptance tests (and additional r.v.’s for choosing a “bad” candidate in SMH). Specifically, we need: uniform r.v.’s in PI-MAIS and I2-MAIS with MH-within-Gibbs, uniform r.v.’s for I2-MAIS with MH in the extended space, and, uniform r.v. and multinomial r.v., for I2-MAIS with SMH. However, in practical applications, the main computational effort is usually required for the target evaluation. The computing time required in the multinomial sampling within SMH increases with . Finally, we recall that we have used a deterministic mixture weighting scheme with , which requires evaluations of the proposal pdfs, , for and .
5 Non-iterative and parallel implementations
As remarked in Section 5.1, the choice of the means ’s is completely independent from the estimation steps. Thus, all the means can be selected in advance (also in parallel if the strategy in Section 5.3.1 is used), and the MIS estimation steps can then be performed as in a completely static framework (i.e., as described in Section 4.1). This consideration is valid for any choice of .
Let us consider now the choice of ’s as temporal mixtures, i.e., or . Moreover, let us consider the use of parallel MCMC chains for adapting the means, i.e., one independent chain for each parameter , with . In this case, the corresponding algorithm is completely parallelizable. Indeed, it can be decomposed into parallel MAIS techniques, each one producing the partial estimators and , after iterations. The global estimators are then given by
Furthermore, different strategies for sharing information among the parallel chains can also be applied (Craiu09, ; O-MCMC, ; Smelly, ; Geyer91, ; Marinari92, ; Neal01, ), or for reducing the total number of evaluations of the target Jacob11 (the scheme in Jacob11 can be applied if a unique independent proposal is employed, i.e., for all ).
Numerical simulations
In this section, we test the performance of the proposed scheme comparing them with other benchmark techniques. First of all, we tackle two challenging issues for adaptive Monte Carlo methods: multimodality in Section 6.1 and nonlinearity in Section 6.2. Furthermore, in Section 6.4 we consider an application of positioning and tuning model parameters in a wireless sensor network (Ali07, ; Ihler05, ; MartinoStatCo10, ).
In this section, we test the novel proposed algorithms in a multimodal scenario, comparing with several other methods. Specifically, we consider a bivariate multimodal target pdf, which is itself a mixture of Gaussians, i.e.,
with means , , , , , and covariance matrices , , , and . The main challenge in this example is the ability in discovering the different modes of . Since we know the moments of , we can easily assess the performance of the different techniques.
Given a random variable (r.v.) , we consider the problem of approximating via Monte Carlo the expected value and the normalizing constant . Note that an adequate approximation of requires the ability of learning about all the modes. We compare the performances of different sampling algorithms in terms of Mean Square Error (MSE): (a) the AMIS technique (CORNUET12, ), (b) three different PMC schemesThe standard PMC method (Cappe04, ) is described in Section C., two of them proposed in (pmc-cappe08, ; Cappe04, ) and one PMC using a partial DM-MIS scheme with , (c) parallel independent MCMC chains and (d) the proposed PI-MAIS method. Moreover, we test two static MIS approaches, the standard MIS and a partial DM-MIS schemes with , computing iteratively the final estimator.
For a fair comparison, all the mentioned algorithms have been implemented in such a way that the number of total evaluations of the target is . All the involved proposal densities are Gaussian pdfs. More specifically, in PI-MAIS, we use the following parameters: , , in order to fulfill (see Section 5.4). The proposal densities of the upper level of the hierarchical approach, , are Gaussian pdfs with covariance matrices and . The proposal densities used in the lower importance sampling level, are Gaussian pdfs with covariance matrices and . We also try different non-isotropic diagonal covariance matrices in both levels, i.e, , where , and , where for and . We test all these techniques using two different initializations: first, we choose deliberately a “bad” initialization of the initial mean vectors, denoted as In1, in the sense that the initialization region does not contain the modes of . Thus, we can test the robustness of the algorithms and their ability to improve the corresponding static approaches. Specifically, the initial mean vectors are selected uniformly within the following square
for . Different examples of this configuration are shown in Fig. 3 with squares. Secondly, we also consider a better initialization, denoted as In2, where the initialization region contains all the modes. Specifically, the initial mean vectors are selected uniformly within the following square
for . All the results are averaged over independent experiments. Tables 9 and 10 show the Mean Square Error (MSE) in the estimation of the first component of , with the initialization In1 and In2 respectively. Table 11 provides the MSE in the estimation of with In1. The best results in each column are highlighted in bold-face. In AMIS (CORNUET12, ), the mean vector and the covariance matrix of a single proposal (i.e., ) are adapted, using in the computation of the IS weights. Hence, in AMIS, we have tested different values of samples per iterations and . For the sake of simplicity, we directly show the worst and best results among the several simulations made with different parameters. PI-MAIS outperforms the other algorithms virtually for all the choices of the parameters, with both initializations. In general, a greater value of is needed since the proposal pdfs are initially bad localized. Moreover, PI-MAIS always improves the performance of the static approaches. These two consideration show the benefit of the Markov adaptation. Hence, PI-MAIS presents more robustness with respect to the initial values and the choice of the covariance matrices. Figure 6(a) providing a summary of the results in Table 9 showing the as function of the , for the main compared methods. Figure 3 depicts the initial (squares) and final (circles) configurations of the mean vectors of the proposal densities for the standard PMC and the PI-MAIS methods, in a specific run and different values of . In both cases, PI-MAIS guarantees a better covering of the modes of .
2 Nonlinear banana-shaped target distribution
Here we consider a bi-dimensional “banana-shaped” target distribution (Haario2001, ), which is a benchmark function in the literature due to its nonlinear nature. Mathematically, it is expressed as
where, we have set , , , and . The goal is to estimate the expected value , where , by applying different Monte Carlo approximations. We approximately compute the true value using an exhaustive deterministic numerical method (with an extremely thin grid), in order to obtain the mean square error (MSE) of the following methods: standard PMC (Cappe04, ), the Mixture PMC (pmc-cappe08, ), the AMIS (CORNUET12, ), PI-MAIS and I2-MAIS with SMH adaptation.
We consider Gaussian proposal distributions for all the algorithms. The initialization has been performed by randomly drawing the parameters of the Gaussians, with the mean of the -th proposal given by , and its covariance matrix given by . We have considered two cases: an isotropic setting where with , and an anisotropic case with random selection of the parameters where , with . Recall that in AMIS and Mixture PMC, the covariance matrices are also adapted.
For each algorithm, we test several combinations of parameters, keeping fixed the total number of target evaluations, . In the standard PMC method, described in Section C), we consider and (here ). In Mixture PMC, we consider different number of component in the mixture proposal pdf , and different samples per proposal at each iteration (here ). In AMIS, we test and (we recall ). The range of these values of parameters are chosen, after a preliminary study, in order to obtain the best performance from each technique. In PI-MAIS an I2-MAIS, we set . For the adaptation in PI-MAIS, we also consider Gaussian pdfs , covariance matrices with . In I2-MAIS, for the SMH method we use a Gaussian pdf with mean and covariance matrix and again . We test for both, so that for PI-MAIS and for I2-MAIS (see Section 5.4).
The results are averaged over independent simulations, for each combination of parameters. Table 12 shows the smallest and highest MSE values obtained in the estimation of the expected value of the target, averaged between the two components of , achieved by the different methods. The smallest MSEs in each column (each ) are highlighted in bold-face. PI-MAIS and I2-MAIS outperform the other techniques virtually for all the values of . In this example, AMIS also provides good results. Figure 7 show a graphical representation of the results in Table 12, with the exception of the last column.
Fig. 4 displays the initial (squares) and final (circles) configurations of the mean vectors of the proposals for the different algorithms, in one specific run. Since in Mixture PMC and AMIS the covariance matrices are also adapted, we show the shape of some proposals as ellipses representing approximately of probability mass. For, PMC we also depict a last resampling output with triangles, in order to show the loss in diversity. Unlike PMC, PI-MAIS ensures a better covering of the region of high probability.
3 High dimensional target distribution
Let us consider again a mixture of isotropic Gaussians as target pdf, i.e.,
where , and for , with being the identity matrix. We set , , for all , and for all . The expected value of the target is then for . In order to study the performance of the proposed scheme as the dimension of the state space increases, we vary the dimension of the state space in Eq. (40) testing different values of (with ).
We consider the problem of approximating via Monte Carlo the expected value of the target density, and we compare the performance of different methods: (a) the standard PMC scheme (Cappe04, ), (b) parallel independent MH chains (Par-MH), (c) a standard Sequential Monte Carlo (SMC) scheme (Moral06, ) and (d) the proposed PI-MAIS method. We test the algorithms with . All the proposal pdfs involved in the experiments are Gaussians, with the same covariance matrices for all the techniques. The initial mean vectors in all techniques are selected randomly and independently as for .
Again, all the mentioned algorithms have been implemented in such a way that the number of total evaluations of the target is . More specifically, in PI-MAIS, we use two sets of parameters: with , , , and with , , in order to fulfill (see Section 5.4). The proposal pdf of the upper level of the hierarchical approach, , are Gaussian pdfs with covariance matrices and . The proposal pdfs used in the lower importance sampling level, are Gaussian pdfs with covariance matrices again with (for a fair comparison with the other techniques). In PMC, Par-MH and SMC we use the same proposals with the same covariances and initial parameters. As described in App. C, in PMC the adaptation is carried out by resampling steps, in SMC an alternation of resampling and MH steps is performed whereas, in Par-MH, independent MH chains are carried out.
The results are averaged over independent simulations. Fig. 8 shows the log-MSE in the estimation of as a function of the dimension of the state-space. Fig. 8(a) compares the algorithms with proposal pdfs, whereas in Fig. 8(b) we have , keeping fixed the number of total evaluations of the target . We observe, as expected, the performance of all the methods degenerate as the dimension of the problem, increases, since we maintain fixed the computational cost . PI-MAIS always provides the best results, with the exception for the cases corresponding to and where SMC obtains a lower MSE (for and , they provide virtually the same MSE).
4 Localization problem in a wireless sensor network
where are independent Gaussian variables with identical pdfs, , . We also consider a prior density over , i.e., where is if and otherwise. The parameter is also unknown and we again consider a Gaussian prior . Moreover, we also apply Gaussian priors over , i.e., with . Thus, the posterior pdf is
Our goal is computing the expected value of
via Monte Carlo, in order to provide an estimate of the position of the target, the parameter and the standard deviation of the noise in the system. We apply PI-MAIS and three different PMC schemes (see example in Section 6.1, for a description), all using Gaussian proposals. We initialize the mean vectors so that they are randomly spread within the space of the variables of interest, i.e.,
and the covariance matrices with . The values of the standard deviations are chosen randomly for each Gaussian pdf. Specifically, , , where we have considered three possible values for , i.e., . For the adaptation process of PI-MAIS, we consider also Gaussian proposals with covariance matrices and . We also try different non-isotropic diagonal covariance matrices, i.e, , where .
For a fair comparison, all the techniques have been simulated with sets of parameters that yield the same number of target evaluations, fixed to . In PI-MAIS, we have chosen parameters , , . The PMC algorithms has been simulated with and . The MSE of the different estimators (averaged over independent runs) are provided in Table 13 and the in Figure 6(b). PI-MAIS outperforms always PMC when and whereas PMC provides better results for . Therefore, the results show jointly the robustness and flexibility of the proposed PI-MAIS technique.
Conclusions
In this work, we have introduced a layered (i.e., hierarchical) framework for designing adaptive Monte Carlo methods. In general terms, we have shown that such a hierarchical interpretation lies behind the good performance of two well-known algorithms; a random walk proposal within an MH scheme and the standard PMC method. Furthermore, we have used this approach to introduce a novel class of adaptive importance sampling (AIS) schemes. The novel class of AIS algorithms employs the determinist mixture (DM) idea (Owen00, ; Veach95, ) in order to reduce the variance of the resulting IS estimators. We have extended the use of the DM strategy with respect to other algorithms available in the literature, providing a more general and flexible framework. From an estimation perspective, this framework includes different schemes proposed in literature (CORNUET12, ; APIS15, ) as special cases, although they differ to an extent in terms of the employed adaptation procedure. Our framework also contains several other sampling schemes considering full or partial DM approaches. Finally, we have discussed several aspects of the trade-offs in terms of the computational cost and advantages due to improved accuracy of the resulting estimators. Numerical comparisons with different algorithms on benchmark models have confirmed the benefit of the layered adaptive sampling approaches.
Acknowledgements
This work has been supported by the projects COMONSENS (CSD2008 00010), ALCIT (TEC2012 38800C03 01), DISSECT (TEC2012 38058 C03 01), OTOSiS (TEC 2013 41718 R), and COMPREHENSION (TEC 2012 38883 C02 01), by the BBVA Foundation with ”I Convocatoria de Ayudas Fundaci n BBVA a Investigadores, Innovadores y Creadores Culturales”- MG FIAR project, by the ERC grant 239784 and AoF grant 251170, and by the European Union 7th Framework Programme through the Marie Curie Initial Training Network “Machine Learning for Personalized Medicine” MLPM2012, Grant No. 316861.
References
Appendix A Consistency of GAMIS estimators
First of all, we remark that the complete analysis should take in account the chosen adaptive procedure since, in general, the adaptation uses the information of previous weighted samples. However, in this work we consider an adaption procedure completely independent of the estimation steps, as clarified in Sections 3.4-5.1. This simplifies substantially the analysis as described in Section 5.1.
The consistency of the global estimators in Eq. (29) provided by GAMIS can be considered when number of samples per time step () and the number of iterations of the algorithm () grow to infinity. For some exhaustive studies of specific cases, see the analysis in and . Here we provide some brief arguments for explaining why and obtained by a GAMIS scheme are, in general, consistent. Let us assume that ’s have heavier tails than . Note that the global estimator can be seen as a result of a static batch MIS estimator involving different mixture-proposals and total number of samples. The weights built using in the denominator of the IS ratio are suitable importance weights yielding consistent estimators, as explained in detail in AppendixB. Hence, for a finite number of iterations , when (or ), the consistency can be guaranteed by standard IS arguments, since it is well known that and as , or .
Furthermore, for and , we have a convex combination, given in Eq. (31), of conditionally independent (consistent but biased) IS estimators . Indeed, although in an adaptive scheme the proposals depend on the previous configurations of the population, the samples drawn at each iteration are conditionally independent of the previous ones, and independent of each other drawn at the same iteration. The bias is due to unknown (see Eq. (4)), and hat is used to replace . However, as , as discussed in [47, Chapter 14]: hence, is asymptotically unbiased as .
Appendix B Importance sampling with multiple proposals
There are at least two procedures to build a joint IS estimator: the standard multiple importance sampling (MIS) approach and the full deterministic mixture (DM-MIS) scheme.
The simplest approach [47, Chapter 14] is computing the classical IS weights:
with and . The IS estimator is then built by normalizing them jointly, i.e., computing
where . For proposal pdfs and , for , we have
In this case, .
B.2 Deterministic mixture approach
An alternative approach is based on the deterministic mixture sampling idea . Considering proposals , , and setting
In this case, the complete proposal is considered to be a mixture of and , weighted according to the number of samples drawn from each one. Note that, unlike in the standard procedure for sampling from a mixture, a deterministic and fixed number of samples are drawn from each proposal in the DM approach . It can be shown that the set of samples drawn in this deterministic way is distributed according to the mixture [45, Chapter 9, Section 11]. The DM estimator is finally given by
where and the are given by (44). For proposal pdfs, the DM estimator can also be easily generalized:
with , and . On the one hand, the DM approach is more efficient than the IS method, thus providing a better performance in terms of a reduced variance of the corresponding estimator, as shown in the following section. On the other hand, it needs to evaluate every proposal times instead of only times (in the standard MIS procedure), and therefore is more costly from a computational point of view. However, this increased computational cost is negligible when the proposal is much cheaper to evaluate than the target, as it often happens in practical applications.
B.3 Convex combination of partial IS estimators
Regardless the type of weights employed in the IS scheme (either as in Eq. (42) or as in Eq. (44)), the resulting estimators can be written as convex combination of simpler ones. First of all, let us consider again the use of proposals, and . We draw samples from each one, , with . The two partial sums of the weights corresponding only to the samples drawn from and , are given by and . The partial IS estimators, obtained by considering only one proposal pdf, are and where the normalized weights are and , respectively. The complete IS estimator, taking into account the samples jointly, is
This procedure can be easily extended for different proposal pdfs, obtaining the complete estimator as the convex combination of the partial estimators:
where , , and .
Appendix C Hierarchical interpretation of PMC
The standard Population Monte Carlo (PMC) method can be interpreted as using a hierarchical procedure. Although it is possible to recognize the two different layers, there are some differences w.r.t. the hierarchical procedure in Section 3. The first one is that in PMC the generation of ’s is not independent of the previously generated ’s. The second one is that the prior is instead , where is an approximation of the measure of obtained using the previously generated samples ’s (in the second level of the hierarchical approach). More specifically, a standard PMC method is an adaptive importance sampler using a population of proposals , , . PMC consists of the following steps, given an initial set, , , , of mean vectors:
Draw , for .
Assign to each sample the weights,
Resampling: draw independent samples , , according to the particle approximation
where we have denoted . Note that each , for all .
Return all the pairs , and .
Fixing an iteration , the generating procedure used in one iteration of the standard PMC method can be cast in the hierarchical formulation:
Draw samples from .
Draw , for .
Note that plays the role of the prior in the hierarchical scheme above. Differently from the novel proposed scheme, the two levels of hierarchical procedure are not independent since the pdf depends on the samples drawn in the lower level. Furthermore, also varies with and , whereas in our procedure we consider a fixed prior . However, note that is an empirical measure approximation of that improves when grows. An equivalent formulation of the hierarchical scheme for PMC is given below, involving a probability of generating a new mean given the previous ones denoted as .
Consider the -th iteration of PMC. Let us define as
the vector containing all the generated samples except for the -th. Let us also denote as , a generic mean vector, i.e. at the iteration , after applying one resampling step (i.e., a multinomial sampling according to the normalized weights). Hence, the distribution of given the previous means is
where is given in Eq. (49). For simplicity, below we denote
Then, after some straightforward rearrangements, Eq. (50) can be rewritten as
where is the estimate of the normalizing constant of the target obtained using the classical IS weights. The hierarchical formulation of PMC can be rewritten as:
Draw samples from in Eq. (50) or (51).
Draw , for .
When , then , and thus , for all . Namely, when grows, the hierarchical scheme above tends to have as prior in the upper level. Figures 5 show three different examples of the conditional pdf (obtained via numerical approximation) for a fixed and different . We can observe that becomes closer to the target (depicted in solid line) as grows.
In the Markov adaptive importance sampling (MAIS) schemes described in Section 5, since we are using MCMC methods for drawing from , actually we have also a current prior , determined for the kernels of the considered MCMC algorithms. For instance, in PI-MAIS we have
where is the kernel of the -th chain. Unlike in PMC, since we are using ergodic chains with invariant pdf , we know that for , with a fixed . Whereas PMC requires to increase for obtaining the same result.