Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering

Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach

Introduction

In this paper, we explore a way to combine ideas from optimization with sampling to get better approximations in probabilistic models. We consider state-space models (SSMs, also referred to as general state-space hidden Markov models), as they constitute an important class of models in engineering, econometrics and other areas involving time series and dynamical systems. A discrete-time, nonlinear SSM can be written as

where xtXx_{t}\in\mathcal{X} denotes the latent state variable and ytYy_{t}\in\mathcal{Y} the observation at time tt. Exact state inference in SSMs is possible, essentially, only when the model is linear and Gaussian or when the state-space X\mathcal{X} is a finite set. For solving the inference problem beyond these restricted model classes, sequential Monte Carlo methods, i.e. particle filters (PFs), have emerged as a key tool; see e.g., Doucet and Johansen (2011); Cappé et al. (2005); Doucet et al. (2000). However, since these methods are based on Monte Carlo integration they are inherently affected by sampling variance, which can degrade the performance of the estimators.

Particular challenges arise in the case when the observation likelihood p(ytxt)p(y_{t}|x_{t}) is computationally expensive to evaluate. For instance, this is common in robotics applications where the observation model relates the sensory input of the robot, which can comprise vision-based systems, laser rangefinders, synthetic aperture radars, etc. For such systems, simply evaluating the observation function for a fixed value of xtx_{t} can therefore involve computationally expensive operations, such as image processing, point-set registration, and related tasks. This poses difficulties for particle-filtering-based solutions for two reasons: (1) the computational bottleneck arising from the likelihood evaluation implies that we cannot simply increase the number of particles to improve the accuracy, and (2) this type of “complicated” observation models will typically not allow for adaptation of the proposal distribution used within the filter, in the spirit of Pitt and Shephard (1999), leaving us with the standard—but inefficient—bootstrap proposal as the only viable option. On the contrary, for these systems, the dynamical model p(xtxt1)p(x_{t}|x_{t-1}) is often comparatively simple, e.g. being a linear and Gaussian “nearly constant acceleration” model (Ristic et al., 2004).

The method developed in this paper is geared toward this class of filtering problems. The basic idea is that, in scenarios when the likelihood evaluation is the computational bottleneck, we can afford to spend additional computations to improve upon the sampling of the particles. By doing so, we can avoid excessive variance arising from simple Monte Carlo sampling from the bootstrap proposal.

We build on the optimization view from Bach et al. (2012) of kernel herding (Chen et al., 2010) to approximate the integrals appearing in the Bayesian filtering recursions. We make use of the Frank-Wolfe (FW) quadrature to approximate, in particular, mixtures of Gaussians which often arise in a particle filtering context as the mixture over past particles in the distribution over the next state. We use this approach within a filtering framework and prove theoretical convergence results for the resulting method, denoted as sequential kernel herding (SKH), giving one of the first explicit better convergence rates than for a particle filter. Our preliminary experiments show that SKH can give better accuracy than a standard particle filter or a quasi-Monte Carlo particle filter.

Adaptive quadrature rules with Frank-Wolfe optimization

and so by bounding MMD(p,p^)\mathop{\rm MMD}(p,\hat{p}), we can bound the error of approximating the expectation for all fHf\in\mathcal{H}, with fH\|f\|_{\mathcal{H}} as a proportionality constant. MMD(p,p^)\mathop{\rm MMD}(p,\hat{p}) is thus a central quantity for developing good quadrature rules given by (2). In the context of RKHSs, MMD(p,q)\mathop{\rm MMD}(p,q) can be called the maximum mean discrepancy (Gretton et al., 2012) between the distributions pp and qq, and acts a pseudo-metric on the space of distributions on X\mathcal{X}. If κ\kappa is a characteristic kernel (such as the standard RBF kernel), then MMD\mathop{\rm MMD} is in fact a metric, i.e. MMD(p,q)=0    p=q\mathop{\rm MMD}(p,q)=0\implies p=q. We refer the reader to Sriperumbudur et al. (2010) for the regularity conditions needed for the existence of these objects and for more details.

2 Frank-Wolfe optimization for adaptive quadrature

Algorithm 1 presents the Frank-Wolfe optimization algorithm to solve mingMJ(g)\min_{g\in\mathcal{M}}J(g) in the context of getting quadrature rules (we also introduce the shorthand notation μp:=μ(p)\mu_{p}:=\mu(p)). We note that to evaluate the quality MMD(p^,p)\mathop{\rm MMD}(\hat{p},p) of this adaptive quadrature rule, we need to be able to evaluate μp(x)=xXp(x)κ(x,x)dx\mu_{p}(x)=\int_{x^{\prime}\in\mathcal{X}}p(x^{\prime})\kappa(x^{\prime},x)dx^{\prime} efficiently. This is true only for specific pairs of kernels and distributions, but fortunately this is the case when pp is a mixture of Gaussians and κ\kappa is a Gaussian kernel. This insight is central to this paper; we explore this case more specifically in Section 2.3. To find the next quadrature point, we also need to (approximately) optimize μp(x)\mu_{p}(x) over X\mathcal{X} (step 3 of Algorithm 1, called the FW vertex search). In general, this will yield a non-convex optimization problem, and thus cannot be solved with guarantees, even with gradient descent. In our current implementation, we approach step 3 by doing an exhaustive search over MM random samples from pp precomputed when FW-Quad is called. We thus follow the idea from the kernel herding paper (Chen et al., 2010) to choose the best NN “super-samples” out of a large set of samples MM. Thanks to the fact that convergence guarantees for Frank-Wolfe optimization can still be given when using an approximate FW vertex search, we show in Appendix B of the supplementary material that this procedure either adds a O(1/M1/4)O(1/M^{1/4}) term or a O(1/M)O(1/\sqrt{M}) term to the worst-case MMD(p^,p)\mathop{\rm MMD}(\hat{p},p) error.

In our description of Algorithm 1, a preset number NN of particles (iterations) was used. Alternatively, we could use a variable number of iterations with the terminating criterion test gkμ(p)Hϵ\|g_{k}-\mu(p)\|_{\mathcal{H}}\leq\epsilon which can be explicitly computed during the algorithm and provides the MMD\mathop{\rm MMD} error bound on the returned quadrature rule. Option (2) on line 5 chooses the step-size γk\gamma_{k} by analytic line-search (hereafter referred as the FW-LS version) while option (1) chooses the kernel herding step-size γk=1/(k+1)\gamma_{k}=1/(k+1) (herafter referred as the FW version) which always yields uniform weights: wk(i)=1/kw_{k}^{(i)}=1/k for all iki\leq k. A third alternative is to re-optimize J(g)J(g) over the convex hull of the previously visited vertices; this is called the fully corrective version (Jaggi, 2013) of the Frank-Wolfe algorithm (hereafter referred as FCFW). In this case: (wk+1(1),,wk+1(k+1))=arg minwΔk+1wKk+1w2ck+1w(w_{k+1}^{(1)},\ldots,w_{k+1}^{(k+1)})=\operatorname*{arg\,min}_{\bm{w}\in\Delta_{k+1}}\bm{w}^{\top}\bm{K}_{k+1}\bm{w}-2\bm{c}_{k+1}^{\top}\bm{w}, where Δk+1\Delta_{k+1} is the (k+1)(k+1)-dimensional probability simplex, Kk+1\bm{K}_{k+1} is the kernel matrix on the (k+1)(k+1) vertices: (Kk+1)ij=κ(x(i),x(j))(\bm{K}_{k+1})_{ij}=\kappa(x^{(i)},x^{(j)}) and (ck+1)i=μp(x(i))(\bm{c}_{k+1})_{i}=\mu_{p}(x^{(i)}) for i=1,,(k+1)i=1,\ldots,{(k+1)}. This is a convex quadratic problem over the simplex. A slightly modified version of the FCFW is called the min-norm point algorithm and can be more efficiently optimized using specific purpose active-set algorithms — see Bach (2013, §9.2) for more details. We refer the reader to Bach et al. (2012) for more details on the rate of convergence of Frank-Wolfe quadrature assuming that the FW vertex is found with guarantees. We summarize them as follows: if H\mathcal{H} is infinite dimensional, then FW-Quad gives the same O(1/N)O(1/\sqrt{N}) rate for the MMD error as standard random sampling, for all FW methods. On the other hand, if a ball of non-zero radius centered at μp\mu_{p} lies within M\mathcal{M}, then faster rates than random sampling are possible: FW gives a O(1/N)O(1/N) rate whereas FW-LS and FCFW gives exponential convergence rates (though in practice, we often see differences not explained by the theory between these methods).

3 Example: mixture of Gaussians

Sequential kernel herding

Consider again the SSM in (1). The joint probability density function for a sequence of latent states x1:T:=(x1,,xT)x_{1:T}:=(x_{1},\ldots,x_{T}) and observations y1:Ty_{1:T} factorizes as p(x1:T,y1:T)=t=1Tp(xtxt1)p(ytxt)p(x_{1:T},y_{1:T})=\prod_{t=1}^{T}p(x_{t}|x_{t-1})p(y_{t}|x_{t}), with p(x1x0):=p(x1)p(x_{1}|x_{0}):=p(x_{1}) denoting the prior density on the initial state. We would like to do approximate inference in this SSM. In particular, we could be interested in computing the joint filtering distribution rt(x1:t):=p(x1:ty1:t)r_{t}(x_{1:t}):=p(x_{1:t}|y_{1:t}) or the joint predictive distribution pt+1(xt+1,x1:t):=p(xt+1,x1:ty1:t)p_{t+1}(x_{t+1},x_{1:t}):=p(x_{t+1},x_{1:t}|y_{1:t}). In particle filtering methods, we approximate these distributions with empirical distributions from weighted particle sets {wt(i),x1:t(i)}i=1N\{w_{t}^{(i)},x_{1:t}^{(i)}\}_{i=1}^{N} as in (2). We note that it is easy to marginalize p^\hat{p} with a simple weight summation, and so we will present the algorithm as getting an approximation for the joint distributions rtr_{t} and ptp_{t} defined above, with the understanding that the marginal ones are easy to obtain afterwards. In the terminology of particle filtering, xt(i)x_{t}^{(i)} is the particle at time tt, whereas x1:t(i)x_{1:t}^{(i)} is the particle trajectory. While principally the PF provides an approximation of the full joint distribution rt(x1:t)r_{t}(x_{1:t}), it is well known that this approximation deteriorates for any marginal of xsx_{s} for sts\ll t (Doucet and Johansen, 2011). Hence, the PF is typically only used to approximate marginals of xsx_{s} for sts\lesssim t (fixed-lag smoothing) or s=ts=t (filtering), or for prediction.

2 Sequential kernel herding

3 Convergence theory

In this section, we give sufficient conditions to guarantee that SKH is consistent as NN goes to infinity. Let ptp_{t} here denote the marginalized predictive instead of the joint. Let FtF_{t} be the forward transformation operator on signed measures that takes the predictive distribution ptp_{t} on xtx_{t} and yields the un-normalized marginalized predictive distribution FtptF_{t}p_{t} on xt+1x_{t+1} in the SSM. Thus for a measure ν\nu, we get (Ftν)():=Xtp(xt)p(ytxt)dν(xt)(F_{t}\nu)(\cdot):=\int_{\mathcal{X}_{t}}p(\cdot|x_{t})p(y_{t}|x_{t})d\nu(x_{t}). We also have that pt+1=1WtFtptp_{t+1}=\frac{1}{W_{t}}F_{t}p_{t}.

For the following theorem, Ft\mathcal{F}_{t} is a function space on Xt+1\mathcal{X}_{t+1} defined (depending on Ht+1\mathcal{H}_{t+1}) as all functions for which the following semi-norm is finite:In general, the integral on Xt+1\mathcal{X}_{t+1} should be with respect to the base measure for which the conditional density p(xt+1xt)p(x_{t+1}|x_{t}) is defined. All proofs are in the supplementary material.

Suppose that the function ft:(xt+1,xt)p(ytxt)p(xt+1xt)f_{t}:(x_{t+1},x_{t})\mapsto p(y_{t}|x_{t})p(x_{t+1}|x_{t}) is in the tensor product function space FtHt\mathcal{F}_{t}\otimes\mathcal{H}_{t} with the following defined nuclear norm: ftFtHt:=infiαiFtβiHt\|f_{t}\|_{\mathcal{F}_{t}\otimes\mathcal{H}_{t}}:=\inf\sum_{i}\|\alpha_{i}\|_{\mathcal{F}_{t}}\|\beta_{i}\|_{\mathcal{H}_{t}}, where the infimum is taken over all the possible expansions such that ft(xt+1,xt)=iαi(xt+1)βi(xt)f_{t}(x_{t+1},x_{t})=\sum_{i}\alpha_{i}(x_{t+1})\beta_{i}(x_{t}) for all xt,xt+1x_{t},x_{t+1}. Then for any finite signed Borel measure ν\nu on Xt\mathcal{X}_{t}, we have:

Suppose that for all 1tT1\leq t\leq T, ftf_{t} is in FtHt\mathcal{F}_{t}\otimes\mathcal{H}_{t} as defined in Theorem 1 and oto_{t} is in Ht\mathcal{H}_{t}. Then we have:We use the convention that the empty sum is and the empty product is 11.

We note that χt1\chi_{t}\approx 1 as we expect the errors on WkW_{k} to go in either direction, and thus to cancel each other over time (though in the worst case it could grow exponentially in tt). If ϵ^tϵ\hat{\epsilon}_{t}\leq\epsilon and ρtρ\rho_{t}\leq\rho, we basically have μ(p^T)μ(pT)=O(ρTϵ)\|\mu(\hat{p}_{T})-\mu(p_{T})\|=O(\rho^{T}\epsilon) if ρ>1\rho>1; O(Tϵ)O(T\epsilon) if ρ=1\rho=1; and O(ϵ)O(\epsilon) if ρ<1\rho<1 (a contraction). The exponential dependence in TT is similar as for a standard particle filter for general distributions; see Douc et al. (2014) though for conditions to get a contraction for the PF.

Experiments

We start by investigating the merits of different sampling schemes for approximating mixtures of Gaussians, since this is an intrinsic step to the SKH algorithm. In Figure 1, we give the MMD\mathop{\rm MMD} error as well as the error on the mean function in term of the number of particles NN for the different sampling schemes on a randomly chosen mixture of Gaussians with K=100K=100 components in d=2d=2 dimensions. Additional results as well as the details of the model are given in Appendix C.1 of the supplementary material. In our experiments, the number of FW search points is M=50, ⁣000M=50,\!000. We note that even though in theory all methods should have the same rate of convergence O(1/N)O(1/\sqrt{N}) for the MMD\mathop{\rm MMD} (as H\mathcal{H} is infinite dimensional), FCFW empirically improves significantly over the other methods. As dd increases, the difference between the methods tapers off for a fixed kernel bandwidth σ2\sigma^{2}, but increasing σ2\sigma^{2} gives better results for FW and FCFW than the other schemes.

In the remaining sections, we evaluate empirically the application of kernel herding in a filtering context using the proposed SKH algorithm.

2 Particle filtering using SKH on synthetic examples

We consider first several synthetic data sets in order to assess the improvements offered by Frank-Wolfe quadrature over standard Monte Carlo and quasi-Monte-Carlo techniques. We generate data from four different systems (further details on the experimental setup can be found in Appendix C.2):

Two linear Gaussian state-space (LGSS) models of dimensions d=3d=3 and d=15d=15, respectively.

A jump Markov linear system (JMLS), consisting of 2 interacting LGSS models of dimension d=2d=2. The switching between the models is governed by a hidden 2-state Markov chain.

A nonlinear benchmark time-series model used by, among others, Doucet et al. (2000); Gordon et al. (1993). The model is of dimension d=1d=1 and is given by:

with vtv_{t} and ete_{t} mutually independent standard Gaussian.

These models are ordered in increasing levels of difficulty for inference. For the LGSS models, the exact filtering distributions can be computed by a Kalman filter. For the JMLS, this is also possible by running a mixture of Kalman filters, albeit at a computational cost of 2T2^{T} (where TT is the total number of time steps). For the nonlinear system, no closed form expressions are available for the filtering densities; instead we run a PF with N=100, ⁣000N=100,\!000 particles as a reference.

We generate 3030 batches of observations for T=100T=100 time steps from all systems, except for the JMLS where we use T=10T=10 (to allow exact filtering). We run the proposed SKH filter, using both FW and FCFW optimization and compare against a bootstrap PF (using stratified resampling (Carpenter et al., 1999)) and a quasi-Monte-Carlo PF based on a Sobol-sequence point-set. All methods are run with NN varying from 2020 to 200200 particles. We deliberately use rather few particles since, as discussed above, we believe that this is the setting when the proposed method can be particularly useful.

To assess the performances of the different methods, we first compute the root-mean-squared errors (RMSE) for the filtered mean-state-estimates over the TT time steps, w.r.t. the reference filters. We report the median RMSEs over the 30 different data batches, along with the 25%25\% and 75%75\% quantiles, and the minimum and maximum values in Figure 2. The SKH algorithms were run for three different values of σ2{0.01,0.1,1}\sigma^{2}\in\{0.01,0.1,1\}. Here, we report the results for σ2=1\sigma^{2}=1 for the LGSS models and the JMLS, and for σ2=0.1\sigma^{2}=0.1 for the nonlinear benchmark model. The results for the other values are given in Appendix C.2. The improvements are somewhat robust to the value of σ2\sigma^{2}, but in some cases significant differences were observed. As can be seen, both SKH methods improve significantly upon both QMC and the bootstrap PF. For the two LGSS models, we also compute the MMD (reported in the rightmost column in Figure 2).

3 Vision-based UAV Localization

In this section, we apply the proposed SKH algorithm to solve a filtering problem in field robotics. We use the data and the experimental setup described by Törnqvist et al. (2009). The problem consists of estimating the full six-dimensional pose of an unmanned aerial vehicle (UAV).

Törnqvist et al. (2009) proposed a vision-based solution, essentially tracking interest points in the camera images over consecutive frames to estimate the ego-motion. This information is then fused with the inertial and barometer sensors to estimate the pose of the UAV. The system is modelled on state-space form, with a state vector comprising the position, velocity, acceleration, as well as the orientation and the angular velocity of the UAV. The state is also augmented with sensor biases, resulting in a state dimension of 22. Furthermore, the state is augmented with the three-dimensional positions of the interest points that are currently tracked by the vision system; this is a varying number but typically around ten.

To deal with the high-dimensional state-vector, Törnqvist et al. (2009) used a Rao-Blackwellized PF (see Appendix A) to solve the filtering problem, marginalizing all but 6 state components (being the pose, i.e., the position and orientation) using a combination of Kalman filters and extended Kalman filters. The remaining 6 state-variables were tracked using a bootstrap particle filter with N=200N=200 particles; the strikingly small number of particles owing to the computational complexity of the likelihood evaluation.

For the current experiment, we obtained the code and the flight-test data from Törnqvist et al. (2009). The modularity of our approach allowed us to simply replace the Monte Carlo simulation step within their setup with FW-Quad. We ran SKH-FW with σ2=10\sigma^{2}=10 and SKH-FCFW with σ2=0.1\sigma^{2}=0.1, as well as the bootstrap PF used in Törnqvist et al. (2009), and a QMC-PF; all methods using N=50N=50, 100, and 200 particles. We ran all methods 10 times on the same data; the variation in SKH coming from the random search points for the FW procedure, and in QMC for starting the Sobol sequence at different points. For comparison, we ran 10 times a reference PF with N=100, ⁣000N=100,\!000 particles and averaged the results. The median position errors for 100 seconds of robot time (there are 20 SSM time steps per second of robot time) are given in Figure 3. The UAV is assumed to start at a known location at time zero, hence, all the errors are zero initially. Note that all methods accumulate errors over time. This is natural, since there is no absolute position reference available (i.e., the filter is unstable) and the objective is basically to keep the error as small as possible for as long time as possible. SKH-FW here gives the overall best results, with significant improvements over the bootstrap PF and the QMC methods for small number of particles. SKH-FW even gives similar errors for the last time step with only N=200N=200 particles as one of the reference PFs (using N=100, ⁣000N=100,\!000 particles). See Appendix C.2.1 for a discussion of the role of σ2\sigma^{2} for FCFW.

In these experiments, we focused on investigating how optimization could improve the error per particle, as the gain in runtime depends on the exact implementation as well as the likelihood evaluation cost. We note that the FW-Quad algorithm scales as O(NM)O(NM) for NN samples and MM search points when using FW, by updating the objective on the MM search points in an online fashion (we also empirically observed this linear scaling in NN). On the other hand, FCFW scales as O(N2M)O(N^{2}M) as the weights on the particles possibly change at each iteration, preventing the same online trick. SKH scales linearly with the number of time steps TT (as a standard PF). For the UAV application, the original Matlab code from Törnqvist et al. (2009) spent an average of 0.2 s per time step for N=50N=50 particles (linear in the number of particles as the likelihood evaluation is the bottleneck) on a XEON E5-2620 2.10 GHz PC. The overhead of using our Matlab implementation of FW-Quad with N=50N=50 is about 0.15 s per time step for FW and 0.3 s for FCFW; and 0.3 s for FW and 1.0 s for FCFW for N=100N=100 (we used M=10, ⁣000M=10,\!000 search points in this experiment). In practice, this means that SKH-FW can be run here with 50 particles in the same time as the standard PF is run with about 90 particles. But as Figure 3 shows, the error for SKH-FW with 50 particles is still much lower than the PF with 200 particles.

Conclusion

We have developed a method for Bayesian filtering problems using a combination of optimization and particle filtering. The method has been demonstrated to provide improved performance over both random sampling and quasi-Monte Carlo methods. The proposed method is modular and it can be used with different types of particle filtering techniques, such as the Rao-Blackwellized particle filter. Further investigating this possibility for other classes of particle filters is a topic for future work. Future work also includes a deeper analysis of the convergence theory for the method in order to develop practical guidelines for the choice of the kernel bandwidth.

We thank Eric Moulines for useful discussions. This work was partially supported by the MSR-Inria Joint Centre, a grant by the European Research Council (SIERRA project 239993) and by the Swedish Research Council (project Learning of complex dynamical systems number 637-2014-466).

References

Appendix A Extension for Rao-Blackwellization

A common strategy for improving the efficiency of the PF is to make use of Rao-Blackwellization—this idea can be used also with SKH. Rao-Blackwellization, here, refers to analytically marginalizing some conditionally tractable component of the state vector and thereby reducing the dimensionality of the space on which the PF operates. Assume that the state of the system is comprised of two components xtx_{t} and ztz_{t}, where the filtering density for ztz_{t} is tractable conditionally on the history of x1:tx_{1:t}. The typical case is that of a conditionally linear Gaussian system, in which case the aforementioned conditional filtering density p(ztx1:t,y1:t)p(z_{t}|x_{1:t},y_{1:t}) is Gaussian and computable using a Kalman filter (conditionally on x1:tx_{1:t}). The Rao-Blackwellized PF (RBPF) exploits this property by factorizing:

Appendix B Rates for SKH when using random search points

In this section, we show that we can get guarantees on the MMD\mathop{\rm MMD} error of the FW-Quad procedure when approximately finding the FW vertex in step 3 of Algorithm 1 using exhaustive search through MM random samples from pp. This means that despite not solving step 3 exactly, the SKH procedure with MM random search points (under assumptions of Theorem 2) is still consistent as long as MM grows to infinity.

The main idea is that the rates of convergence for the Frank-Wolfe optimization procedure still holds when the linear subproblem (step 3) is solved within accuracy of δ\delta. More specifically, if we guarantee that the FW vertex gˉk+1\bar{g}_{k+1} that we use satisfy J(gk),gˉk+1mingMJ(gk),g+δ\langle J^{\prime}(g_{k}),\bar{g}_{k+1}\rangle\leq\min_{g\in\mathcal{M}}\langle J^{\prime}(g_{k}),g\rangle+\delta during the algorithm, then the standard O(1/k)O(1/k) rate of convergence for FW carries through but within δ\delta of the optimal objective (i.e. up to J(g)+δJ(g^{*})+\delta). A simple modification of the argument by Jaggi (2013) (who used a shrinking δ\delta during the FW algorithm) can show this for the step-size of γk=2k+2\gamma_{k}=\frac{2}{k+2}; we give the proofs for the step-size of γk=1k+1\gamma_{k}=\frac{1}{k+1} as well as the potential faster rate O(1/k2)O(1/k^{2}) for the MMD objective in Appendix G.

Let XMX\mathcal{X}_{M}\subseteq\mathcal{X} be the set of MM search points, and pMp_{M} be the empirical distribution for the MM samples from pp. Let δM:=μ(pM)μ(p)H\delta_{M}:=\|\mu(p_{M})-\mu(p)\|_{\mathcal{H}} which can be made small by increasing MM. Consider the iteration kk in FW-Quad where we do exhaustive search on XM\mathcal{X}_{M} in step 3. We thus have:

where RM:=maxxXMΦ(x)R_{M}:=\max_{x\in\mathcal{X}_{M}}\|\Phi(x)\| (RMRR_{M}\leq R). We can thus interpret step 3 as approximately solving (within δMRM\delta_{M}R_{M}) the linear subproblem for the Frank-Wolfe optimization of JM(g):=12gμ(pM)H2J_{M}(g):=\frac{1}{2}\|g-\mu(p_{M})\|_{\mathcal{H}}^{2} over the marginal polytope of XM\mathcal{X}_{M}. We thus get a rate of convergence to within δMRM\delta_{M}R_{M} of mingJM(g)=0\min_{g}J_{M}(g)=0. Finally, we have

Appendix C Additional details on experiments

The parameters for the mixture of Gaussians p(x)=i=1KπiN(xμi,Σi)p(x)=\sum_{i=1}^{K}\pi_{i}\mathcal{N}(x|\mu_{i},\Sigma_{i}) were randomly sampled as follows:

The means μi\mu_{i}’s are uniformly sampled on d^{d}.

Σi=σi2I\Sigma_{i}=\sigma_{i}^{2}I where σi2\sigma_{i}^{2} is uniformly sampled on [0.1,4.1][0.1,4.1].

πi\pi_{i} are obtained by normalizing independent uniform random variables.

Figure 4 present additional results for the mixture of Gaussians experiments. From our experiments, we make the following observations:

FCFW always performs best (this was observed similarly in Bach et al. (2012) but for other pairs of distribution / kernel).

As dd increases, the difference between the methods tapers off for a fixed σ2\sigma^{2}, but increasing σ2\sigma^{2} gives better results for FW and FCFW than the others (see for example the last column of Figure 4).

The FW-LS results are identical to FW, and so we have excluded them from the plots for clarity.

The improvement of QMC over MC decreases as the number of mixture components KK increase. FW and FCFW are not affected by KK as much.

To generate quasi-random samples from the mixture of Gaussians, we generate a (d ⁣+ ⁣1)(d\!+\!1)-dimensional Sobol sequence using the Matlab qrandstream function. The last dimension is (naively) used to sample the mixture component by using the inverse transformation method for a discrete random variable. The first dd components are then used to sample from the corresponding multivariate Gaussian by transforming dd independent standard normals. We note that Gerber and Chopin (2014) argued on the importance of sorting the discrete mixture components according to their location before choosing them with the standard inverse transformation method (in our naive implementation, the order is arbitrary and arising from how the mixture of Gaussians was stored). They propose a method for this that they called Hilbert sort, for which they could prove nice low-discrepancy properties. This approach might reduce the sensitivity to KK of QMC. The worse results of QMC for the UAV experiment in Figure 3 might be explained by our naive implementation.

C.2 Synthetic data examples and additional results

In this section we provide additional details and results for the synthetic data examples. The LGSS models and the modes of the JMLS are generated randomly using the function drss from the Matlab Control Systems Toolbox. The four models that were considered are given by:

with (A,C)(A,C) being an observable pair. The system has poles in 0.2825-0.2825 and 0.3669±0.0379i-0.3669\pm 0.0379i.

with (A,C)(A,C) being an observable pair. The system has poles in 0.2456±0.6594i0.2456\pm 0.6594i, 0.48330.4833, 0.33290.3329, 0.0882±0.2512i0.0882\pm 0.2512i, 0.1485-0.1485, 0.8045-0.8045, 0.4848-0.4848, 0.5252±0.0368i-0.5252\pm 0.0368i, 0.6692±0.0612i-0.6692\pm 0.0612i, 0.6604-0.6604, and 0.6680-0.6680.

with Π=(0.70.30.30.7)\Pi=\begin{pmatrix}0.7&0.3\\ 0.3&0.7\end{pmatrix}, and the two system modes corresponding to observable systems with poles in 0.4429-0.4429, 0.09370.0937, and 0.6576-0.6576, 0.31090.3109, respectively.

Additional results, for the different values of σ2{0.01,0.1,1}\sigma^{2}\in\{0.01,0.1,1\} are reported in Figures 5–8.

SKH-FW did not seem to suffer from this problem. This might partly explain why we were able to use the much bigger σ2=10\sigma^{2}=10 for the UAV experiment with good results (Figure 3) whereas we used σ2=0.1\sigma^{2}=0.1 for SKH-FCFW.

Appendix D Proof sketch for Theorem 1

Proof sketch. We assume that the function ft:(xt+1,xt)p(ytxt)p(xt+1xt)f_{t}:(x_{t+1},x_{t})\mapsto p(y_{t}|x_{t})p(x_{t+1}|x_{t}) is in the tensor product FtHt\mathcal{F}_{t}\otimes\mathcal{H}_{t}, with Ft\mathcal{F}_{t} defined as in the statement of the theorem. We consider the nuclear norm \citepsupjameson1987summing:

over all possible decompositions {αi,βi}i=1\{\alpha_{i},\beta_{i}\}_{i=1}^{\infty} of ftf_{t} such that, for all xt,xt+1x_{t},x_{t+1}

In the following, let {αi,βi}i=1\{\alpha_{i},\beta_{i}\}_{i=1}^{\infty} be such a decomposition for ftf_{t}. We have

Now we have that μ(Ftν)Ht+1=suphHt+1=1h,μ(Ftν)\|\mu(F_{t}\nu)\|_{\mathcal{H}_{t+1}}=\sup_{\|h\|_{\mathcal{H}_{t+1}}=1}|\langle h,\mu(F_{t}\nu)\rangle|, so we consider for some hHt+1h\in\mathcal{H}_{t+1}:

This inequality was valid for any expansion {αi,βi}i=1\{\alpha_{i},\beta_{i}\}_{i=1}^{\infty} for ftf_{t}, and thus we can take the infimum of the upper bound over all possible expansions to get:

Appendix E Special case for the Gaussian kernel

In this section, we explore what form F\|\cdot\|_{\mathcal{F}} takes for the Gaussian kernel. We then show that ftFH\|f_{t}\|_{\mathcal{F}\otimes\mathcal{H}} is finite for a simple one-dimensional linear Gaussian SSM as long as σ\sigma is small enough (and thus H\mathcal{H} is big enough, as the size of H\mathcal{H} increases when σ\sigma decreases for the Gaussian kernel).

For the Gaussian kernel κ(x,y)=exp(xy22/2σ2)=:q(xy)\kappa(x,y)=\exp(-\|x-y\|_{2}^{2}/2\sigma^{2})=:q(x-y), its Fourier transform is

By applying Cauchy-Schwartz on L2L^{2} on the RHS, this leads to

This allows for quite peaky distributions for the dynamics, as Diracs are authorized (with constant Fourier transform).

To compute an upper bound on ftFH\|f_{t}\|_{\mathcal{F}\otimes\mathcal{H}}, we simply need to find a decomposition of p(ytxt)p(xt+1xt)p(y_{t}|x_{t})p(x_{t+1}|x_{t}) as a sum of terms αi(xt+1)βi(xt)\alpha_{i}(x_{t+1})\beta_{i}(x_{t}) and bound the appropriate norms of αi\alpha_{i} and βi\beta_{i}.

We do this for a special case in the following section.

with, using 12τ2+w21w2=w2w1w2=w1+w-\frac{1}{2\tau^{2}}+\frac{w^{2}}{1-w^{2}}=\frac{w^{2}-w}{1-w^{2}}=\frac{-w}{1+w}:

We thus now need to compute the norms of αn\alpha_{n} and βn\beta_{n}, by first computing the Fourier transform. We use the representation:

integrating over a contour around the origin, which leads to:

using Hn(x)Cexp(x2/2)(2nn!π)1/2H_{n}(x)\leqslant C\exp(x^{2}/2)(2^{n}n!\sqrt{\pi})^{1/2}, with C=π1/4C=\pi^{-1/4}, and -\frac{1+w}{4w}+\frac{(1+w)^{2}}{8w}=\frac{1+w}{4w}\big{(}-1+\frac{1+w}{2}\big{)}=-\frac{1-w^{2}}{8w}.

We have (A,w,0)=1w28A2w\square(A,w,0)=\frac{1-w^{2}}{8A^{2}w}. Thus, by continuity if BB if small enough, (A,w,B)>0\square(A,w,B)>0. Note that when we have B=0B=0 and A=1A=1, we recover previous results for αn\alpha_{n}.

Thus, for σ\sigma small enough, we have the norm less than a constant times

Appendix F Proof for Theorem 2

Finally, the initialization error term (III) can be bounded by using Theorem 1 (with ν=q^tqt\nu=\hat{q}_{t}-q_{t}):

where Ct:=ftFHC_{t}:=\|f_{t}\|_{F\otimes\mathcal{H}}.

To control MMD(q^t,qt)\mathop{\rm MMD}(\hat{q}_{t},q_{t}), we now only work on the un-normalized distributions:

by repeating the arguments for smaller tt’s and unrolling the recursion (and recall that u=tt1()=1\prod_{u=t}^{t-1}(\cdot)=1 by convention).

Finally, we transform back the ϵt\epsilon_{t} errors in the algorithmic quantities ϵ^t\hat{\epsilon}_{t} that the FW algorithm measures:

We expect χu=k=1u1W^kWk1\chi_{u}=\prod_{k=1}^{u-1}\frac{\hat{W}_{k}}{W_{k}}\approx 1 as the errors on the normalization constants could hopefully go in both direction and thus cancel each other, though in the worst case it could also grow with uu. Substituting back in (6), we get what we wanted to prove:

For parameter estimation in a HMM, one would also be interested in the quality of approximation for ZtZ_{t}. We note that inequality (B) also gives us a bound on the relative error of our estimate Z^t\hat{Z}_{t} for the normalization constant:

The disadvantage of the bound (7) is the presence of the quantity χu\chi_{u} for which we did not provide an explicit upper bound (though we would expect it to be close to 1). To get an explicit upper bound for the error, we can repeat a similar argument but always working with the normalized quantities:

The term (II) is the normalization error which can bounded similarly as before as:

Similarly as before, we also have for (III) that by Theorem 1 (with ν=p^tpt\nu=\hat{p}_{t}-p_{t}) that MMD(Ftp^t,Ftpt)CtMMD(p^t,pt)\mathop{\rm MMD}(F_{t}\hat{p}_{t},F_{t}p_{t})\leq C_{t}\,MMD(\hat{p}_{t},p_{t}). Combining the three terms, we get:

Appendix G Faster rates for FW with approximate vertex search for the MMD objective

In this section, we provide the proofs for the rate of convergence for FW on the MMD objective J(g):=12gμ(p)H2J(g):=\frac{1}{2}\|g-\mu(p)\|^{2}_{\mathcal{H}} when an approximate vertex search is used (as mentioned in Appendix B) by extending the proofs from Chen et al. (2010). We consider the step-size γk=1k+1\gamma_{k}=\frac{1}{k+1}.We note that the rate extends to the line-search step-size as well as the improvement at each iteration can only be better in this case considering the proof technique that we use. We note that the standard step-size for Frank-Wolfe optimization to get a O(1/k)O(1/k) rate is γk=2k+2\gamma_{k}=\frac{2}{k+2}.We also tried the γk=2k+2\gamma_{k}=\frac{2}{k+2} step-size in the mixture of Gaussians experiment of Section 2.3, but it gave similar results as the γk=1k+1\gamma_{k}=\frac{1}{k+1} step-size. The best rate known for general objectives when using FW with γk=1k+1\gamma_{k}=\frac{1}{k+1} is actually O(log(k)/k)O(\log(k)/k) \citepsup[Bound 3.2]freund2013FW. We make use of the specific form of the MMD objective here to prove the O(1/k)O(1/k) rate, as well as the faster O(1/k2)O(1/k^{2}) rate under additional assumptions. For the rest of this section, we use \|\cdot\| to mean H\|\cdot\|_{\mathcal{H}}.

Consider the FW-Quad Algorithm 1 where an approximate vertex search is used: gkμp,gˉk+1mingMgkμp,g+δ\langle g_{k}-\mu_{p},\bar{g}_{k+1}\rangle\leq\min_{g\in\mathcal{M}}\langle g_{k}-\mu_{p},g\rangle+\delta, where gˉk+1:=Φ(xk+1)\bar{g}_{k+1}:=\Phi(x_{k+1}) and δ0\delta\geq 0. Suppose that μp\mu_{p} lies in the strict interior of M\mathcal{M} with a radius r>0r>0, i.e. a ball of radius rr centered at μp\mu_{p} lies within M\mathcal{M}. Recall that maxgMgR\max_{g\in\mathcal{M}}\|g\|\leq R. Then we have the faster rate O(1/k2)O(1/k^{2}) for the objective JJ:

If r=0r=0 (note that μpM\mu_{p}\in\mathcal{M}), then we can still get a standard O(1/k)O(1/k) FW rate:

Proof If a ball of a radius rr centered at μp\mu_{p} lies within M\mathcal{M}, then we have that:

So the approximate vertex search yields gˉk+1\bar{g}_{k+1} with the property:

By using the FW update gk+1=γkgˉk+1+(1γk)gkg_{k+1}=\gamma_{k}\bar{g}_{k+1}+(1-\gamma_{k})g_{k} with the γk=1k+1\gamma_{k}=\frac{1}{k+1} step-size, we get:

Thus if we let vk:=k(gkμp)v_{k}:=k(g_{k}-\mu_{p}), then we get:

The last inequality used the crucial strict interior assumption that yielded (11). Now let Ck:=1r(2R2+kδ)C_{k}:=\frac{1}{r}(2R^{2}+k\delta). Note that Ck+1CkC_{k+1}\geq C_{k}. We will now proceed to show by induction that vkCk\|v_{k}\|\leq C_{k} for k1k\geq 1. Note that the bracket in (12) is negative if and only if vkCk\|v_{k}\|\geq C_{k} (i.e. vk+1vk\|v_{k+1}\|\leq\|v_{k}\| in this case), giving the inspiration for the CkC_{k} threshold.

by using the fact that Rr1\frac{R}{r}\geq 1 since a ball of radius rr fitting in M\mathcal{M} implies that the maximum norm RR of elements in M\mathcal{M} is at least rr.

Now suppose that vkCk\|v_{k}\|\leq C_{k}, i.e. that vk=αCk\|v_{k}\|=\alpha C_{k} for some α\alpha\in. Then (12) becomes:

The RHS is a convex function of α\alpha, and so it is maximized at the boundary of its domain. For α=0\alpha=0, we get vk+122Ckr=4R2+2kδ\|v_{k+1}\|^{2}\leq 2C_{k}r=4R^{2}+2k\delta. For α=1\alpha=1, we get vk+12Ck2\|v_{k+1}\|^{2}\leq C_{k}^{2}. And thus in general, supposing α\alpha\in, we get that vk+12max{2Ckr,Ck2}=Ck2\|v_{k+1}\|^{2}\leq\max\{2C_{k}r,C_{k}^{2}\}=C_{k}^{2} as:

using Rr1\frac{R}{r}\geq 1. This completes the induction step as this means that vk+1CkCk+1\|v_{k+1}\|\leq C_{k}\leq C_{k+1}.

Thus we conclude that vkCk\|v_{k}\|\leq C_{k} for all k1k\geq 1, i.e.

This shows the faster O(1/k)O(1/k) rate (9). If we do not have μp\mu_{p} in the strict interior of M\mathcal{M}, i.e. r=0r=0, then we can unroll the inequality (12) to get:

This translates to a slower O(1/k)O(1/\sqrt{k}) rate on gkμp\|g_{k}-\mu_{p}\| (with δ\sqrt{\delta} precision), but note that at least it does not have the log(k)\log(k) factor from the rate by \citetsupfreund2013FW.

Going back over the argument from Appendix B, we said that using MM random search points in FW-Quad was similar to approximately solving (within δMRM\delta_{M}R_{M}) the linear subproblem for the Frank-Wolfe optimization of JM(g):=12gμ(pM)H2J_{M}(g):=\frac{1}{2}\|g-\mu(p_{M})\|_{\mathcal{H}}^{2} over the marginal polytope of XM\mathcal{X}_{M}. We note that the marginal polytope of XM\mathcal{X}_{M} is at most MM-dimensional, and thus, by using a similar argument as in Proposition 1 of Bach et al. (2012), we could show that it contains a ball of radius rM>0r_{M}>0 centered at μ(pM)\mu(p_{M}).We note that as XM\mathcal{X}_{M} is finite, we do not need to make the additional assumption that the kernel κ\kappa is continuous unlike in Proposition 1 of Bach et al. (2012). From (9) in Theorem G.1, we can thus conclude that:

This seems to give a faster rate, but the problem is that rMr_{M} might shrink at an exponential rate with MM if H\mathcal{H} is infinite dimensional. Thus even though δM\delta_{M} is O(1/M)O(1/\sqrt{M}), the dependence of δMRMrM\delta_{M}\frac{R_{M}}{r_{M}} might be worse than the previously quoted rate of O(1/M1/4)O(1/M^{1/4}); the latter is thus the general worst-case.

On the other hand, under the additional assumption that H\mathcal{H} is finite dimensional and that there is a ball of radius r>0r>0 centered around μp\mu_{p} in M\mathcal{M} (by using Proposition 1 of Bach et al. (2012) for example), then for a sufficiently large MM, we will have rMr_{M} close to rr. Thus for large MM, we will have gkμ(pM)Rr(2Rk+δM)\|g_{k}-\mu(p_{M})\|\lesssim\frac{R}{r}\left(\frac{2R}{k}+\delta_{M}\right), and thus gNμp=O(R2r(1N+1M))\|g_{N}-\mu_{p}\|=O\left(\frac{R^{2}}{r}\left(\frac{1}{N}+\frac{1}{\sqrt{M}}\right)\right). This gives an asymptotically faster rate than the O(R(1N+1RM1/4))O\left(R\left(\frac{1}{\sqrt{N}}+\frac{1}{\sqrt{R}{M^{1/4}}}\right)\right) rate given in Appendix B arising from (10), but the constant is worse by a factor of Rr\frac{R}{r}.