Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama

INTRODUCTION

Variational inference methods are one of the most widely-used computational tools to deal with the intractability of Bayesian inference, while stochastic gradient (SG) methods are one of the most widely-used tools for solving optimization problems on huge datasets. The last three years have seen an explosion of work exploring SG methods for variational inference (Hoffman et al., 2013; Salimans et al., 2013; Ranganath et al., 2013; Titsias & Lázaro-Gredilla, 2014; Mnih & Gregor, 2014; Kucukelbir et al., 2014). In many settings, these methods can yield simple updates and scale to huge datasets.

A challenge that has been addressed in many of the recent works on this topic is that the “black-box” SG method ignores the geometry of the variational-parameter space. This has lead to methods like the stochastic variational inference (SVI) method of Hoffman et al. (2013), that uses natural gradients to exploit the geometry. This leads to better performance in practice, but this approach only applies to conditionally-conjugate models. In addition, it is not clear how using natural gradients for variational inference affects the theoretical convergence rate of SG methods.

In this work we consider a general framework that (i) can be stochastic to allow huge datasets, (ii) can exploit the geometry of the variational-parameter space to improve performance, and (iii) can yield a closed-form update even for non-conjugate models. The new framework can be viewed as a stochastic generalization of the proximal-gradient method of Khan et al. (2015), which splits the objective into conjugate and non-conjugate terms. By linearizing the non-conjugate terms, this previous method as well as our new method yield simple closed-form proximal-gradient updates even for non-conjugate models.

While proximal-gradient methods have been well-studied in the optimization community (Beck & Teboulle, 2009), like SVI there is nothing known about the convergence rate of the method of Khan et al. (2015) because it uses “divergence” functions which do not satisfy standard assumptions. Our second contribution is to analyze the convergence rate of the proposed method. In particular, we generalize an existing result on the convergence rate of stochastic mirror descent in non-convex settings (Ghadimi et al., 2014) to allow a general class of divergence functions that includes the cases above (in both deterministic and stochastic settings). While it has been observed empirically that including an appropriate divergence function enables larger steps than basic SG methods, this work gives the first theoretical result justifying the use of these more-general divergence functions. It in particular reveals how different factors affect the convergence rate such as the Lipschitz-continuity of the lower bound, the information geometry of the divergence functions, and the variance of the stochastic approximation. Our results also suggest conditions under which the proximal-gradient steps of Khan et al. (2015) can make more progress than (non-split) gradient steps, and sheds light on the choice of step-size for these methods. Our experimental results indicate that the new method leads to improvements in performance on a variety of problems, and we note that the algorithm and theory might be useful beyond the variational inference scenarios we have considered in this work.

VARIATIONAL INFERENCE

Consider a general latent variable model where we have a data vector y\mathbf{y} of length NN and a latent vector z\mathbf{z} of length DD. In Bayesian inference, we are interested in computing the marginal likelihood p(\mbox{\mbox{y\mathbf{y}}}), which can be written as the integral of the joint distribution p(\mbox{\mbox{y\mathbf{y}}},\mbox{\mbox{z\mathbf{z}}}) over all values of z\mathbf{z}. This integral is often intractable, and in variational inference we typically approximate it with the evidence lower-bound optimization (ELBO) approximation L\underline{\mathcal{L}}. This approximation introduces a distribution q(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{λ\boldsymbol{\lambda}}}) and chooses the variational parameters λ\boldsymbol{\lambda} to maximize the following lower bound on the marginal likelihood:

The inequality follows from concavity of the logarithm function. The set S\mathcal{S} is the set of valid parameters λ\boldsymbol{\lambda}.

To optimize λ\boldsymbol{\lambda}, one of the seemingly-simplest approaches is gradient descent: \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k+1}=\mbox{\mbox{λ\boldsymbol{\lambda}}}_{k}+\beta_{k}\nabla\underline{\mathcal{L}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{k}), which can be viewed as optimizing a quadratic approximation of L\underline{\mathcal{L}},

While we can often choose the family qq so that it has convenient computational properties, it might be impractical to apply gradient descent in this context when we have a very large dataset or when some terms in the lower bound are intractable. Recently, SG methods have been proposed to deal with these issues (Ranganath et al., 2013; Titsias & Lázaro-Gredilla, 2014): they allow large datasets by using random subsets (mini-batches) and can approximate intractable integrals using Monte Carlo methods that draw samples from q(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{λ\boldsymbol{\lambda}}}).

A second drawback of applying gradient descent to variational inference is that it uses the Euclidean distance and thus ignores the geometry of the variational-parameter space, which often results in slow convergence. Intuitively, (2) implies that we should move in the direction of the gradient, but not move \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k+1} too far away from \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k} in terms of the Euclidean distance. However, the Euclidean distance is not appropriate for variational inference because λ\boldsymbol{\lambda} is the parameter vector of a distribution; the Euclidean distance is often a poor measure of dissimilarity between distributions. The following example from Hoffman et al. (2013) illustrates this point: the two normal distributions \mbox{{\cal N}}(0,10000) and \mbox{{\cal N}}(10,10000) are almost indistinguishable, yet the Euclidean distance between their parameter vectors is 10, whereas the distributions \mbox{{\cal N}}(0,0.01) and \mbox{{\cal N}}(0.1,0.01) barely overlap, but their Euclidean distance between parameters is only 0.10.1.

Natural-Gradient Methods: The canonical way to address the problem above is by replacing the Euclidean distance in (2) with another divergence function. For example, the natural gradient method defines the iteration by using the symmetric Kullback-Leibler (KL) divergence (Hoffman et al., 2013; Pascanu & Bengio, 2013; Amari, 1998),

where \mbox{\mbox{G\mathbf{G}}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}) is the Fisher information-matrix,

Hoffman et al. (2013) show that the natural-gradient update can be computationally simpler than gradient descent for conditionally-conjugate exponential family models. In this family, we assume that the distribution of z\mathbf{z} factorizes as \prod_{i}p(\mbox{\mbox{z\mathbf{z}}}^{i}|\textrm{pa}^{i}) where \mbox{\mbox{z\mathbf{z}}}^{i} are disjoint subsets of z\mathbf{z} and pai\textrm{pa}^{i} are the parents of the \mbox{\mbox{z\mathbf{z}}}^{i} in a directed acyclic graph. This family also assumes that each conditional distribution is in the exponential family,

where \mbox{\mbox{η\boldsymbol{\eta}}}^{i} are the natural parameters, \mbox{\mbox{T\mathbf{T}}}^{i}(\mbox{\mbox{z\mathbf{z}}}^{i}) are the sufficient statistics, A^{i}(\mbox{\mbox{η\boldsymbol{\eta}}}^{i}) is the partition function, and h^{i}(\mbox{\mbox{z\mathbf{z}}}^{i}) is the base measure. Hoffman et al. (2013) consider a mean-field approximation q(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{λ\boldsymbol{\lambda}}})=\prod_{i}q^{i}(\mbox{\mbox{z\mathbf{z}}}^{i}|\mbox{\mbox{λ\boldsymbol{\lambda}}}^{i}) where each qiq^{i} belongs to the same exponential-family distribution as the joint distribution,

The parameters of this distribution are denoted by \mbox{\mbox{λ\boldsymbol{\lambda}}}^{i} to differentiate them from the joint-distribution parameters \mbox{\mbox{η\boldsymbol{\eta}}}^{i}.

As shown by Hoffman et al. (2013), the Fisher matrix for this problem is equal to \nabla^{2}A^{i}(\mbox{\mbox{λ\boldsymbol{\lambda}}}^{i}) and the gradient of the lower bound with respect to \mbox{\mbox{λ\boldsymbol{\lambda}}}^{i} is equal to \nabla^{2}A^{i}(\mbox{\mbox{λ\boldsymbol{\lambda}}}^{i})(\mbox{\mbox{λ\boldsymbol{\lambda}}}^{i}-\mbox{\mbox{λ\boldsymbol{\lambda}}}^{i}_{*}) where \mbox{\mbox{λ\boldsymbol{\lambda}}}^{i}_{*} are the mean-field parameters (see Paquet, 2014). Therefore, when computing the natural-gradient, the \nabla^{2}A^{i}(\mbox{\mbox{λ\boldsymbol{\lambda}}}^{i}) terms cancel out and the natural-gradient is simply \mbox{\mbox{λ\boldsymbol{\lambda}}}^{i}-\mbox{\mbox{λ\boldsymbol{\lambda}}}^{i}_{*} which is much easier to compute than the actual gradient. Unfortunately, for non-conjugate models this cancellation does not happen and the simplicity of the update is lost. The Riemannian conjugate-gradient method of Honkela et al. (2011) has similar issues, in that computing \nabla^{2}A(\mbox{\mbox{λ\boldsymbol{\lambda}}}) is typically very costly.

This method yields better convergence properties, but requires numerical optimization to implement the update even for conditionally-conjugate models. Khan et al. (2015) considers a deterministic proximal-gradient variant of this method by splitting the lower bound into L:=f+h-\underline{\mathcal{L}}:=f+h, where ff contains all the “easy” terms and hh contains all the “difficult” terms. By linearizing the “difficult” terms, this leads to a closed-form update even for non-conjugate models. The update is given by:

However, this method requires the exact gradients which is usually not feasible for large dataset and/or complex models.

The convergence rate of mirror descent algorithm has been analyzed in convex (Duchi et al., 2010) and more recently in non-convex (Ghadimi et al., 2014) settings. However, mirror descent does not cover the cases described above in (5) and (6) when a KL divergence between two exponential-family distributions is used with λ\boldsymbol{\lambda} as the natural-parameter. For such cases, the Bregman divergence corresponds to a KL divergence with swapped parameters (see Nielsen & Garcia, 2009, Equation 29),

where A(\mbox{\mbox{λ\boldsymbol{\lambda}}}) is the partition function of qq. Because (5) and (6) both use a KL divergence where the second argument is fixed to \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k}, instead of the first argument, they are not covered under the mirror-descent framework. In addition, even though mirror-descent has been used for variational inference (Ravikumar et al., 2010), Bregman divergences do not yield an efficient update in many scenarios.

PROXIMAL-GRADIENT SVI

This unifies a variety of existing approaches since it allows:

Splitting of L\underline{\mathcal{L}} into a difficult term ff and a simple term hh, similar to the method of Khan et al. (2015).

A stochastic approximation ^f\widehat{\nabla}f of the gradient of the difficult term, similar to SG methods.

Below, we describe each feature in detail, along with the precise assumptions used in our analysis.

We make the following assumptions about ff and hh:

The function ff is differentiable and its gradient is LL-Lipschitz-continuous, i.e. \forall\mbox{\mbox{λ\boldsymbol{\lambda}}} and \mbox{\mbox{λ\boldsymbol{\lambda}}}^{\prime}\in\mathcal{S} we have

The function hh can be a general convex function.

These assumptions are very weak. The function ff can be non-convex and the Lipschitz-continuity assumption is typically satisfied in practice (and indeed the analysis can be generalized to only require this assumption on a smaller set containing the iterations). The assumption that hh is convex seems strong, but note that we can always take h=0h=0 in the split if the function has no “nice” convex part. Below, we give several illustrative examples of such splits for variational-Gaussian inference with q(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{λ\boldsymbol{\lambda}}}):=\mbox{{\cal N}}(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{m\mathbf{m}}},\mbox{\mbox{V\mathbf{V}}}), so that \mbox{\mbox{λ\boldsymbol{\lambda}}}=\{\mbox{\mbox{m\mathbf{m}}},\mbox{\mbox{V\mathbf{V}}}\} with m\mathbf{m} being the mean and V\mathbf{V} being the covariance matrix.

The detailed derivation is in the appendix. By substituting in (1), we obtain the lower bound \underline{\mathcal{L}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}) shown below along with its split:

A1 is satisfied for common likelihoods, while it is easy to establish that hh is convex. We show in Section 6 that this split leads to a closed-form update for iteration (9).

We give further details about the bound for this case in the appendix.

Correlated Topic Model (CTM): Given a text document with a vocabulary of NN words, denote its word-count vector by y\mathbf{y}. Let KK be the number of topics and z\mathbf{z} be the vector of topic-proportions. We can then use the following split:

where \mbox{\mbox{μ\boldsymbol{\mu}}},\mbox{\mbox{Σ\boldsymbol{\Sigma}}} are parameters of the Gaussian prior and βn,k\beta_{n,k} are parameters of KK multinomials. We give further details about the bound in the appendix.

2 STOCHASTIC-APPROXIMATION

The approach of Khan et al. (2015) considers (9) in the special case of (6) where we use the exact gradient \nabla f(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{k}) in the first term. But in practice this gradient is often difficult to compute. In our framework, we allow a stochastic approximation of \nabla f(\mbox{\mbox{λ\boldsymbol{\lambda}}}) which we denote by \widehat{\nabla}f(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{k}).

In our analysis we make the following two assumptions regarding the stochastic approximation of the gradient:

Its variance is upper bounded: \textrm{Var}[\widehat{\mbox{\mbox{g\mathbf{g}}}}(\mbox{\mbox{λ\boldsymbol{\lambda}}},\mbox{\mbox{ξ\boldsymbol{\xi}}}_{n})]\leq\sigma^{2}.

In both the assumptions, the expectation is taken with respect to the noise \mbox{\mbox{ξ\boldsymbol{\xi}}}_{n}. The first assumption is true for the stochastic approximations of (12). The second assumption is stronger, but only needs to hold for all \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k} so is almost always satisfied in practice.

3 DIVERGENCE FUNCTIONS

There exist an α>0\alpha>0 such that for all \mbox{\mbox{λ\boldsymbol{\lambda}}},\mbox{\mbox{λ\boldsymbol{\lambda}}}^{\prime} generated by (9) we have:

The first assumption is reasonable and is satisfied by typical divergence functions like the squared Euclidean distance and variants of the KL divergence. In the next section we show that, whenever the iteration (9) is defined and all \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k} stay within a compact set, the second assumption is satisfied for all divergence functions considered in Section 2.

SPECIAL CASES

which is equivalent to strong-convexity of the function A(\mbox{\mbox{λ\boldsymbol{\lambda}}}) (Nesterov, 2004, Theorem 2.1.9).

Taking the derivative with respect to λ\boldsymbol{\lambda} and substituting in (13) with \mbox{\mbox{λ\boldsymbol{\lambda}}}^{\prime}=\mbox{\mbox{λ\boldsymbol{\lambda}}}_{k}, we get the condition

which is satisfied when \bigtriangledown^{2}A(\mbox{\mbox{λ\boldsymbol{\lambda}}}) is positive-definite over a compact set for α\alpha equal to its lowest eigenvalue on the set.

Methods based on natural-gradient using iteration (3) (like SVI) correspond to using h=0h=0, f=Lf=-\underline{\mathcal{L}}, and the symmetric KL divergence. Assumption A1 to A5 are usually assumed for these methods and, as we show next, A6 is also satisfied. In particular, when qq is an exponential family distribution the symmetric KL divergence can be written as the sum of the Bregman divergence shown in (8) and the KL divergence shown in (14),

where the first equality follows from the definition of the symmetric KL divergence and the second one follows from (8). Since the two divergences in the sum satisfy A6, the symmetric KL divergence also satisfies the assumption.

CONVERGENCE OF PG-SVI

We first analyze the convergence rate of deterministic methods where the gradient is exact, \widehat{\nabla}f(\mbox{\mbox{λ\boldsymbol{\lambda}}})=\nabla f(\mbox{\mbox{λ\boldsymbol{\lambda}}}). This yields a simplified result that applies to a wide variety of existing variational methods. Subsequently, we consider the more general case where a stochastic approximation of the gradient is used.

We first establish the convergence under a fixed step-size when using the exact gradient. We use C_{0}=\underline{\mathcal{L}}^{*}-\underline{\mathcal{L}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{0}) as the initial (constant) sub-optimality, and express our result in terms of the quantity

where \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k+1} is computed using (9).

Let A1, A2, A5, and A6 be satisfied. If we run tt iterations of (9) with a fixed step-size βk=β=α/L\beta_{k}=\beta=\alpha/L for all kk and an exact gradient \nabla f(\mbox{\mbox{λ\boldsymbol{\lambda}}}), then we have

In more general settings, the quantity GkG_{k} provides a generalized notion of first-order optimality for problems that may be non-smooth or use a non-Euclidean geometry. Further, if the objective is bounded below (C0C_{0} is finite), this result implies that the algorithm converges to such a stationary point and also gives a rate of convergence of O(1/t)O(1/t).

If we use a divergence with α>1\alpha>1 then we can use a step-size larger than 1/L1/L and the error will decrease faster than gradient-descent. To our knowledge, this is the first result that formally shows that natural-gradient methods can achieve faster convergence rates. The splitting of the objective into ff and hh functions is also likely to improve the step-size. Since LL only depends on ff, sometimes it might be possible to reduce the Lipschitz constant by choosing an appropriate split.

We next give a more general result that allows a per-iteration step size.

If we choose the step-sizes βk\beta_{k} to be such that 0<βk2α/L0<\beta_{k}\leq 2\alpha/L with βk<2α/L\beta_{k}<2\alpha/L for at least one kk, then,

We give a proof in the appendix. For gradient-descent, the above result implies that we can use any step-size less than 2/L2/L, which agrees with the classical step-size choices for gradient and proximal-gradient methods.

2 STOCHASTIC METHODS

We now give a bound for the more general case where we use a stochastic approximation of the gradient.

Let A1-A6 be satisfied. If we run tt iterations of (9) for a fixed step-size βk=α/L\beta_{k}=\alpha_{*}/L (where 0<γ<20<\gamma<2 is a scalar) and fixed batch-size Mk=MM_{k}=M for all kk with a stochastic gradient \widehat{\nabla}f(\mbox{\mbox{λ\boldsymbol{\lambda}}}), then we have

where cc is a constant such that c>1/(2α)c>1/(2\alpha) and α:=α1/(2c)\alpha_{*}:=\alpha-1/(2c). The expectation is taken with respect to the noise \mbox{\mbox{ξ\boldsymbol{\xi}}}:=\{\mbox{\mbox{ξ\boldsymbol{\xi}}}_{0},\mbox{\mbox{ξ\boldsymbol{\xi}}}_{1},\ldots,\mbox{\mbox{ξ\boldsymbol{\xi}}}_{t-1}\}, and a random variable RR which follows the uniform distribution Prob(R=k)=1/t,k{0,1,2,,t1}Prob(R=k)=1/t,\forall k\in\{0,1,2,\ldots,t-1\}.

Unlike the bound of Proposition 1, this bound depends on the noise variance σ2\sigma^{2} as well the mini-batch size MM. In particular, as we would expect, the bound gets tighter as the variance gets smaller and as the size of our mini-batch grows. Notice that the dependence on the variance σ2\sigma^{2} is also improved if we have a favourable geometry that increases α\alpha^{*}. Thus, we can achieve a higher accuracy by either increasing the mini-batch size or improving the geometry. In the appendix we give a more general result that allows non-constant sequences of step sizes, although we found that constant step-sizes work better empirically. Note that while stating the result in terms of a randomized iteration might seem strange, in practice we typically just take the last iteration as the approximate minimizer.

CLOSED-FORM UPDATES FOR NON-CONJUGATE MODELS

We now give an example where iteration (9) attains a closed-form solution. We expect such closed-form solution to exist for a large class of problems, including models where qq is an exponential-family distribution, but here we focus on the GP model discussed in Section 3.1.

For the GP model, we rewrite the lower bound (11) as

where mnk,km_{n_{k},k} and vnk,kv_{n_{k},k} denote the value of mnm_{n} and vnv_{n} in the kk’th iteration for n=nkn=n_{k}. By using the KL divergence as our divergence function in iteration (9), and by denoting \mbox{{\cal N}}(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{m\mathbf{m}}}_{k},\mbox{\mbox{V\mathbf{V}}}_{k}) by qkq_{k}, we can express the two last two terms in (9) as a single KL divergence function as shown below:

where rk:=1/(1+βk)r_{k}:=1/(1+\beta_{k}). Comparing this to (17), we see that this objective is similar to that of a GP model with a Gaussian priorSince pp and qq are Gaussian, the product is a Gaussian. p1rkqkrkp^{1-r_{k}}q_{k}^{r_{k}} and a linear Gaussian-like log-likelihood. Therefore, we can obtain closed-form updates for its minimization.

The updates are shown below and a detailed derivation is given in the appendix.

where \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{0} is initialized to a small positive constant to avoid numerical issues, \mbox{\mbox{1\boldsymbol{1}}}_{n_{k}} is a vector with all zero entries except nkn_{k}’th entry which is equal to 1, κk\boldsymbol{\kappa}_{k} is nkn_{k}’th column of K\mathbf{K}, and \mbox{\mbox{A\mathbf{A}}}_{k}:=\mbox{\mbox{K\mathbf{K}}}+[\mbox{\mbox{diag}}(\widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k})]^{-1}. For iteration k+1k+1, we use mnk+1,k+1m_{n_{k+1},k+1} and vnk+1,k+1v_{n_{k+1},k+1} to compute the gradients αnk+1,k+1\alpha_{n_{k+1},k+1} and γnk+1,k+1\gamma_{n_{k+1},k+1}, and run the above updates again. We continue until a convergence criteria is reached.

There are numerous advantages of these updates. First, We do not need to store the full covariance matrix V\mathbf{V}. The updates avoid forming the matrix and only update m\mathbf{m}. This works because we only need one diagonal element in each iteration to compute the stochastic gradient γnk,k\gamma_{n_{k},k}. For large NN this is a clear advantage since the memory cost is O(N)O(N) rather than O(N2)O(N^{2}). Second, computation of the mean vector m\mathbf{m} and a diagonal entry of V\mathbf{V} only require solving two linear equations, as shown in the second and third line of (19). In general, for a mini-batch of size MM, we need a total of 2M2M linear equations, which is a lot cheaper than an explicit inversion. Finally, the linear equations at iteration k+1k+1 are very similar to those at iteration kk, since \mbox{\mbox{A\mathbf{A}}}_{k} differ only at one entry from \mbox{\mbox{A\mathbf{A}}}_{k+1}. Therefore, we can reuse computations from the previous iteration to improve the computational efficiency of the updates.

EXPERIMENTAL RESULTS

In this section, we compare our method to many existing approaches such as SGD and four adaptive gradient-methods (ADAGRAD, ADADELTA, RMSprop, ADAM), as well as two variational inference methods for non-conjugate models (the delta method and Laplace method). We show results on Gaussian process classification (Kuss & Rasmussen, 2005) and correlated topic models (Blei & Lafferty, 2007). The code to reproduce these experiments can be found on GitHub.https://github.com/emtiyaz/prox-grad-svi

We first consider binary classification by using a GP model with a Bernoulli-logit likelihood on three datasets: Sonar, Ionosphere, and USPS-3vs5. These datasets can be found at the UCI data repositoryhttps://archive.ics.uci.edu/ml/datasets.html and their details are discussed in Kuss & Rasmussen (2005). For the GP prior, we use the zero mean-function and a squared-exponential covariance function with hyperparameters σ\sigma and ll as defined in Kuss & Rasmussen (2005) (see Eq. 33). We set the values of the hyperparameters using cross-validation. For the three datasets, the hyperparameters (logl,logσ)(\log l,\log\sigma) are set to (1,6)(-1,6), (1,2.5)(1,2.5), and (2.5,5)(2.5,5), respectively.

In our first experiment, we compare the performance under a fixed step-size.The results also demonstrate the faster convergence of our method compared to gradient-descent methods. We compare the following four algorithms on the Ionosphere dataset: (1) batch gradient-descent (referred to as ‘GD’), (2) batch proximal-gradient algorithm (referred to as ‘PG’), (3) batch proximal-gradient algorithm with gradients approximated by using Monte Carlo (referred to as ‘PG-MC’ and using S=500S=500 samples), and (4) the proposed proximal-gradient stochastic variational-inference (referred to as ‘PG-SVI’) method where stochastic gradients are obtained using (12) with M=5M=5.

Figure 1 shows the number of examples required for convergence versus the step-size. A lower number implies faster convergence. Convergence is assessed by monitoring the lower bound, and when the change in consecutive iterations do not exceed a certain threshold, we stop the algorithm.

We clearly see that GD requires many more passes through the data, and proximal-gradient methods converge faster than GD. In addition, the upper bound on the step-size for PG is much larger than GD. This implies that PG can potentially take larger steps than the GD method. PG-SVI is surprisingly as fast as PG which shows the advantage of our approach over the approach of Khan et al. (2015).

1.2 Comparison with Adaptive Gradient Methods

We also compare PG-SVI to SGD and four adaptive methods, namely ADADELTA (Zeiler, 2012), RMSprop (Tieleman & Hinton, 2012), ADAGRAD (Duchi et al., 2011), and ADAM (Kingma & Ba, 2014). The implementation details of these algorithms are given in the appendix. We compare the value of the lower bound versus number of passes through the data. We also compare the average log-loss on the test data,nlogp^n/N-\sum_{n}\log\hat{p}_{n}/N_{*}, where \hat{p}_{n}=p(y_{n}|\sigma,l,\mbox{{\cal D}}_{t}) is the predictive probabilities of the test point yny_{n} given training data \mbox{{\cal D}}_{t} and NN_{*} is the total number of test-pairs. A lower value is better for the log-loss, and a value of 1 is equal to the performance of random coin-flipping.

Figure 2 summarizes the results. In these plots, lower is better for both objectives and one “pass” means the number of randomly selected examples is equal to the total number of examples. Our method is much faster to converge than other methods. It always converged within 10 passes through the data while other methods required more than 100 passes.

2 CORRELATED TOPIC MODEL

We next show results for correlated topic model on two collections of documents, namely the NIPS and Associated Press (AP) datasets. The NIPShttps://archive.ics.uci.edu/ dataset contains 1500 documents from the NIPS conferences held between 1987 and 1999 (a vocabulary-size of 12,419 words and a total of around 1.9M words). The APhttp://www.cs.columbia.edu/~blei/lda-c/index.html collection contains 2,246 documents from the Associated Press (a vocabulary-size of 10,473 words and a total of 436K observed words). We use 50% of the documents for training and 50% for testing.

We compare to the delta method and the Laplace method discussed in Wang & Blei (2013), and also to the original mean-field (MF) method of Blei & Lafferty (2007). For these methods, we use an available implementation.https://www.cs.princeton.edu/~chongw/resource.html All of these methods approximate the lower bound by using approximations to the expectation of log-sum-exp functions (see the appendix for details). We compare these methods to the two versions of our algorithm which do not use such approximations, but instead use a stochastic gradient as explain in Section 3.2. Specifically, we use the following two versions: one with full covariance (referred to as PG-SVI), and the other with a diagonal covariance (referred to as PG-SVI-MF). For both of these algorithms, we use a fixed step-size of 0.001, and a mini-batch size of 2 documents.

Following Wang & Blei (2013) we compare the held-out log-likelihood which is computed as follows: a new test document y\mathbf{y} is split into two halves (\mbox{\mbox{y\mathbf{y}}}^{1},\mbox{\mbox{y\mathbf{y}}}^{2}), then we compute the approximate posterior q(\mbox{\mbox{z\mathbf{z}}}) to the posterior p(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{y\mathbf{y}}}^{1}) and use this to compute the held-out log-likelihood for each y_{n}\in\mbox{\mbox{y\mathbf{y}}}^{2} using

We use a Monte Carlo to this quantity by using a large number of samples from qq (unlike Wang & Blei (2013) who approximate it by using the Delta method). We report the average of this quantity over all words in \mbox{\mbox{y\mathbf{y}}}^{2}.

Figure 3 shows the negative of the average held-out log-likelihood versus time for 10 topics (lower values are better). We see that methods based on proximal-gradient algorithm converge a little bit faster than the existing methods. More importantly, they achieves better performance. This could be due to the fact that we do not approximate the expectation of the log-sum-exp function, unlike the delta and Laplace method. We obtained similar results when we used a different number of topics.

DISCUSSION

This work has made two contributions. First, we proposed a new variational inference method that combines variable splitting, stochastic gradients, and general divergence functions. This method is well-suited for a huge variety of the variational inference problems that arise in practice, and we anticipate that it may improve over state of the art methods in a variety of settings. Our second contribution is a theoretical analysis of the convergence rate of this general method. Our analysis generalizes existing results for the mirror descent algorithm in optimization, and establishes convergences rates of a variety of existing variational inference methods. Due to its generality we expect that this analysis could be useful to establish convergence rates of other algorithms that we have not thought of, perhaps beyond the variational inference settings we consider in this work. However, an open problem that is also discussed by Ghadimi et al. (2014) it to esbatlish convergence to an arbitrary accuracy with a fixed batch size.

One issue that we have not satisfactorily resolved is giving a theoretically-justified way to set the step-size in practice; our analysis only indicates that it must be sufficiently small. However, this problem is common in many methods in the literature and our analysis at least suggests the factors that should be taken into account. Another open issue is the applicability our method to many other latent variable models; in this paper we have shown applications to variational-Gaussian inference, but we expect that our method should result in simple updates for a larger class of latent variable models such as non-conjugate exponential family distribution models. Additional work on these issues will improve usability of our method.

References

Appendix A Examples of Splitting for Variational-Gaussian Inference

We give detailed derivations for the splitting-examples shown in Section 3.1 in the main paper. As in the main paper, we denote the Gaussian posterior distribution by q(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{λ\boldsymbol{\lambda}}}):=\mbox{{\cal N}}(\mbox{\mbox{z\mathbf{z}}}|\mbox{\mbox{m\mathbf{m}}},\mbox{\mbox{V\mathbf{V}}}), so that \mbox{\mbox{λ\boldsymbol{\lambda}}}=\{\mbox{\mbox{m\mathbf{m}}},\mbox{\mbox{V\mathbf{V}}}\} with m\mathbf{m} being the mean and V\mathbf{V} being the covariance matrix.

Consider GP models for NN input-output pairs \{y_{n},\mbox{\mbox{x\mathbf{x}}}_{n}\} indexed by nn. Let z_{n}:=f(\mbox{\mbox{x\mathbf{x}}}_{n}) be the latent function drawn from a GP with a zero-mean function and a covariance function \kappa(\mbox{\mbox{x\mathbf{x}}},\mbox{\mbox{x\mathbf{x}}}^{\prime}). We denote the Kernel matrix obtained on the data \mbox{\mbox{x\mathbf{x}}}_{n} for all nn by K\mathbf{K}.

We use a non-Gaussian likelihood p(ynzn)p(y_{n}|z_{n}) to model the output, and assume that each yny_{n} is independently sampled from this likelihood given z\mathbf{z}. The joint-distribution over y\mathbf{y} and z\mathbf{z} is shown below:

By substituting in Eq. 1 of the main paper, we can obtain the lower bound L\underline{\mathcal{L}} after a few simplifications, as shown below:

The assumption A2 is satisfied since the KL divergence is convex in both m\mathbf{m} and V\mathbf{V}. This is clear from the expression of the KL divergence:

where DD is the dimensionality of z\mathbf{z}. Convexity w.r.t. m\mathbf{m} follows from the fact that the above is quadratic in m\mathbf{m} and K\mathbf{K} is positive semi-definite. Convexity w.r.t. V\mathbf{V} follows due to concavity of \log|\mbox{\mbox{V\mathbf{V}}}| (trace is linear, so does not matter).

Assumption A1 depends on the choice of the likelihood p(ynzn)p(y_{n}|z_{n}), but is usually satisfied. The simplest example is a Gaussian likelihood for which the function ff takes the following form:

where mnm_{n} is the nn’th element of m\mathbf{m} and vnv_{n} is the nn’th diagonal entry of V\mathbf{V}. This clearly satisfies A1, since the objective is quadratic in m\mathbf{m} and linear in V\mathbf{V}.

Here is an example where A1 is not satisfied: for Poisson likelihood logp(ynzn)=exp[ynznezn]/yn!\log p(y_{n}|z_{n})=\exp[y_{n}z_{n}-e^{z_{n}}]/y_{n}! with rate parameter equal to ezne^{z_{n}}, the function ff takes the following form:

whose derivative is not Lipschitz continuous since exponential functions are not Lipschitz.

A.2 Generalized Linear Models (GLMs)

We now describe a split for generalized linear models. We model the output yny_{n} by using an exponential family distribution whose natural-parameter is equal to \eta_{n}:=\mbox{\mbox{x\mathbf{x}}}_{n}^{T}\mbox{\mbox{z\mathbf{z}}}. Assuming a standard Gaussian prior over z\mathbf{z}, the joint distribution can be written as follows:

The lower bound can be shown to be the following:

which is very similar to the GP case. Therefore, Assumptions A1 and A2 will follow with similar arguments.

A.3 Correlated Topic Model (CTM)

We consider text documents with a vocabulary size NN. Let z\mathbf{z} be a length KK real-valued vector which follows a Gaussian distribution shown in (32). Given z\mathbf{z}, a topic tnt_{n} is sampled for the nn’th word using a multinomial distribution shown in (33). Probability of observing a word in the vocabulary is then given by (34).

Here β\boldsymbol{\beta} is a N×KN\times K real-valued matrix with non-negative entries and columns that sum to 1. The parameter set for this model is given by \mbox{\mbox{θ\boldsymbol{\theta}}}=\{\mbox{\mbox{μ\boldsymbol{\mu}}},\mbox{\mbox{Σ\boldsymbol{\Sigma}}},\mbox{\mbox{β\boldsymbol{\beta}}}\}. We can marginalize out tnt_{n} and obtain the data-likelihood given z\mathbf{z},

Given that we observe nn’th word yny_{n} times, we can write the following joint distribution:

where \mbox{\mbox{μ\boldsymbol{\mu}}},\mbox{\mbox{Σ\boldsymbol{\Sigma}}} are parameters of the Gaussian prior and βn,k\beta_{n,k} are parameters of KK multinomials.

where W=nynW=\sum_{n}y_{n} is the total number of words. The top line is the function [-f(\mbox{\mbox{λ\boldsymbol{\lambda}}})] while the bottom line is [-h(\mbox{\mbox{λ\boldsymbol{\lambda}}})].

There are two intractable expectations in ff, each involving expectation of a log-sum-exp function. Wang and Blei (2013) use the Delta method and Laplace method to approximate these expectations. In contrast, in PG-SVI algorithm, we use Monte Carlo to approximate the gradient of these functions.

Appendix B Proof of Proposition 1 and 2

We first prove the Proposition 2. Proposition 1 is obtained as a special case of it. Our proof technique is borrowed from Ghadimi et. al. (2014). We extend their results to general divergence functions.

We denote the proximal projection at \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k} with gradient g\mathbf{g} and step-size β\beta by,

The following lemma gives a bound on the norm of \mathcal{P}(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{k},\mbox{\mbox{g\mathbf{g}}},\beta).

The following holds for any \mbox{\mbox{λ\boldsymbol{\lambda}}}_{k}\in\mathcal{S}, any real-valued vector g\mathbf{g} and β>0\beta>0.

We use this to derive the optimality condition of (40). For any λ\boldsymbol{\lambda}, the following holds from the optimality condition:

Letting \mbox{\mbox{λ\boldsymbol{\lambda}}}=\mbox{\mbox{λ\boldsymbol{\lambda}}}_{k},

The first line follows from Assumption A2 (convexity of hh), and the second line follows from Assumption A6. ∎

Now, we are ready to prove Proposition 2:

The second line follows from the definition of P\mathcal{P} and the last line is due to Lemma 1. Rearranging the terms and using L=f+h-\underline{\mathcal{L}}=f+h we get:

Summing these term for all k=0,1,t1k=0,1,\dots t-1, we get the following:

By noting that the global maximum of the lower bound always upper bounds any other value, we get \underline{\mathcal{L}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{*})-\underline{\mathcal{L}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{0})\geq\underline{\mathcal{L}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{t-1})-\underline{\mathcal{L}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{0}). Using this,

Since we assume at least one of βk<2α/L\beta_{k}<2\alpha/L, we can divide by the summation term, to get the following:

Proposition 1 can be obtained by simply plugging in βk=α/L\beta_{k}=\alpha/L,

Appendix C Proof of Proposition 3

We will first prove the following theorem, which gives a similar result to Proposition 2 but for a stochastic gradient ^f\widehat{\nabla}f.

If we choose the step-size βk\beta_{k} such that 0<βk2α/L0<\beta_{k}\leq 2\alpha_{*}/L with βk<2α/L\beta_{k}<2\alpha_{*}/L for at least one kk, then,

where the expectation is taken over R{0,1,2,,t1}R\in\{0,1,2,\ldots,t-1\} which is a discrete random variable drawn from the probability mass function

and over \mbox{\mbox{ξ\boldsymbol{\xi}}}:=\{\mbox{\mbox{ξ\boldsymbol{\xi}}}_{1},\mbox{\mbox{ξ\boldsymbol{\xi}}}_{2},\ldots,\mbox{\mbox{ξ\boldsymbol{\xi}}}_{t-1}\} with \mbox{\mbox{ξ\boldsymbol{\xi}}}_{k} is the noise in the stochastic approximation ^f\widehat{\nabla}f.

Now considering c>1/(2α)c>1/(2\alpha), α=α1/(2c)\alpha_{*}=\alpha-1/(2c) and βk2αL\beta_{k}\leq\frac{2\alpha_{*}}{L}, and summing up both sides for iteration k=0,1,t1k=0,1\dots,t-1, we obtain

Writing the expectation with respect to RR and x\mathbf{x} we get

whose numerator is the left side of (55). Dividing (55) by k=0t(αβkL2βk2)\sum_{k=0}^{t}\left(\alpha_{*}\beta_{k}-\frac{L}{2}\beta_{k}^{2}\right) and using this in (56) we get the result. ∎

By substituting βk=α/L\beta_{k}=\alpha_{*}/L and Mk=MM_{k}=M in (47),

The probability distribution for RR reduces to a uniform distribution in this case, with the probability of each iteration being 1/t1/t. This proves Proposition 3.

Appendix D Derivation of Closed-Form Updates for the GP Model

where nkn_{k} is the example selected in kk’th iteration. We will now show that its solution can be obtained in closed-form.

𝑘1\mbox{\mbox{\mathbf{V}}}_{k+1} We first derive the full update of \mbox{\mbox{V\mathbf{V}}}_{k+1}. The KL divergence between two Gaussian distributions is given as follows:

Using this, we expand the last two terms of (59) to get the following,

Taking derivative of (59) with respect to V\mathbf{V} at \mbox{\mbox{V\mathbf{V}}}=\mbox{\mbox{V\mathbf{V}}}_{k+1} and setting it to zero, we get the following (here \mbox{\mbox{I\mathbf{I}}}_{n} is a matrix with all zeros, except the nn’th diagonal element which is set to 1):

which gives us the update of \mbox{\mbox{V\mathbf{V}}}_{k+1} for rk:=1/(1+βk)r_{k}:=1/(1+\beta_{k}).

𝑘1\mbox{\mbox{\mathbf{V}}}_{k+1} A full update will require storing the matrix \mbox{\mbox{V\mathbf{V}}}_{k+1}. Fortunately, we can avoid storing the full matrix and still do an exact update. The key point here is to notice that to compute the stochastic gradient in the next iteration we only need one diagonal element of \mbox{\mbox{V\mathbf{V}}}_{k+1} rather than the whole matrix. Specifically, if we sample nk+1n_{k+1}’th example at the iteration k+1k+1, then we need to compute vnk+1,k+1v_{n_{k+1},k+1} which is the nk+1n_{k+1}’th diagonal element of \mbox{\mbox{V\mathbf{V}}}_{k+1}. This can be done by solving one linear equation, as we show in this section. Specifically, we show that the following updates can be used to compute vnk+1,k+1v_{n_{k+1},k+1}:

where \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k}=r_{k}\widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k-1}+(1-r_{k})\gamma_{n_{k},k}\mbox{\mbox{1\boldsymbol{1}}}_{n_{k}} (\mbox{\mbox{1\boldsymbol{1}}}_{n} is a vector of all zeros except its nn’th entry which is equal to 1). We start the recursion with \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{0}=\epsilon where ϵ\epsilon is a small positive number.

We will now show that \mbox{\mbox{V\mathbf{V}}}_{k} can be reparameterized in terms of a vector \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k} which contains accumulated weighted sum of the gradient γnj,j\gamma_{n_{j},j}, for all jkj\leq k. To show this, we recursively substitute the update of \mbox{\mbox{V\mathbf{V}}}_{j} for j<k+1j<k+1, as shown below (recall that nkn_{k} is the example selected at the kk’th iteration). The second line is obtained by substituting the full update of \mbox{\mbox{V\mathbf{V}}}_{k} by using (64). The third line is obtained after a few simplifications. The fourth line is obtained by substituting the update of \mbox{\mbox{V\mathbf{V}}}_{k-1} and a few simplifications.

This update expresses \mbox{\mbox{V\mathbf{V}}}_{k+1} in terms of \mbox{\mbox{V\mathbf{V}}}_{k-2}, K\mathbf{K}, and gradients of the data example selected at k,k1,k,k-1, and k2k-2. Continuing in this fashion until k=0k=0, we can write the update as follows:

where tkt_{k} is the product of rk,rk1,,r0r_{k},r_{k-1},\ldots,r_{0}. We can write the updates more compactly by defining the accumulation of the gradients γnj,j\gamma_{n_{j},j} for all jkj\leq k by a vector \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k},

The vector \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k} can be obtained by using a recursion. We illustrate this below, where we have grouped the terms in (69) to show the recursion for \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k} (here \mbox{\mbox{1\boldsymbol{1}}}_{n} is a vector with all zero entries except nn’th entry which is set to 1):

Therefore, \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k} can be recursively updated as follows:

with an initialization \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{0}=\epsilon where ϵ\epsilon is a small constant to avoid numerical issues.

If we set \mbox{\mbox{V\mathbf{V}}}_{0}=\mbox{\mbox{K\mathbf{K}}}, then the formula simplifies to the following:

which is completely specified by \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k}, eliminating the need to compute and store \mbox{\mbox{V\mathbf{V}}}_{k+1}.

The nk+1n_{k+1}’th diagonal element can be obtained by using Matrix Inversion Lemma, which gives us the update (65).

D.3 Update of 𝐦𝐦\mathbf{m}

Taking derivative of (59) with respect to m\mathbf{m} at \mbox{\mbox{m\mathbf{m}}}=\mbox{\mbox{m\mathbf{m}}}_{k+1} and setting it to zero, we get the following (here \mbox{\mbox{1\boldsymbol{1}}}_{n} is a vector with all zero entries except nn’th entry which is set to 1):

where the last step is obtained using the fact that 1/βk=rk/(1rk)1/\beta_{k}=r_{k}/(1-r_{k}).

We simplify as shown below. The second line is obtained by adding and subtracting (1-r_{k})\mbox{\mbox{K\mathbf{K}}}^{-1}\mbox{\mbox{m\mathbf{m}}}_{k} in the square bracket at the right. In the the third line, we take \mbox{\mbox{m\mathbf{m}}}_{k} out. The fourth line is obtained by plugging in the updates of \mbox{\mbox{V\mathbf{V}}}_{k}^{-1}=\mbox{\mbox{K\mathbf{K}}}^{-1}+\mbox{\mbox{diag}}(\widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k}). The fifth line is obtained by using Matrix-Inversion lemma, and the sixth line is obtained by taking \mbox{\mbox{K\mathbf{K}}}^{-1} out of the right-most term.

where \mbox{\mbox{B\mathbf{B}}}_{k}:=\mbox{\mbox{K\mathbf{K}}}+[\mbox{\mbox{diag}}(r_{k}\widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k-1})]^{-1}.

Since r_{k}\widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k-1} and \widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k} differ only slightly (by the new example gradient γnk\gamma_{n_{k}}, we can instead use the following approximate update:

where \mbox{\mbox{A\mathbf{A}}}_{k}:=\mbox{\mbox{K\mathbf{K}}}+[\mbox{\mbox{diag}}(\widetilde{\mbox{\mbox{γ\boldsymbol{\gamma}}}}_{k})]^{-1}.

Appendix E Closed-Form Updates for GLMs

The PG-SVI iteration can be written as follows:

Using a similar derivation to the GP model, we can show that the following updates will give us the solution:

where \mbox{\mbox{K\mathbf{K}}}=\mbox{\mbox{X\mathbf{X}}}\mbox{\mbox{X\mathbf{X}}}^{T} and \widetilde{\mbox{\mbox{m\mathbf{m}}}}_{k}:=\mbox{\mbox{X\mathbf{X}}}^{T}\mbox{\mbox{m\mathbf{m}}}.

Appendix F Description of the Dataset for Binary GP Classification

Appendix G Description of Algorithms for Binary GP Classification

We give implementation details of all the algorithms used for binary GP- classification experiment. For all methods, we compute a stochastic estimate of the gradient by using a mini-batch size of 5, 5, and 20 for the three datasets: Sonar, Ionosphere, and USPS-3vs5 respectively. Similarly, the number of MC samples used are 2000, 500, and 2000.

For GD, SGD, and all the adaptive methods, \mbox{\mbox{λ\boldsymbol{\lambda}}}:=\{\mbox{\mbox{m\mathbf{m}}},\mbox{\mbox{L\mathbf{L}}}\} where L\mathbf{L} is the Cholesky factor of V\mathbf{V}. The algorithmic parameters of these methods is given in Table 1. Below, we give details of their updates.

For the GD method, we use the following update:

For the SGD method, we use a stochastic gradient, instead of the exact gradient:

where αk=(k+1)κ\alpha_{k}=(k+1)^{-\kappa} is the step-size and \mbox{\mbox{g\mathbf{g}}}_{k}:=-\widehat{\nabla}\underline{\mathcal{L}}(\mbox{\mbox{λ\boldsymbol{\lambda}}}_{k}).

We use the following updates for ADAGRAD:

where α0\alpha_{0} is a fixed step-size and ϵ\epsilon is a small constant used to avoid numerical errors.

where α0\alpha_{0} is a fixed step-size and ρ\rho is the decay factor.

We use the following updates for ADADELTA:

where again α0\alpha_{0} is a fixed step-size, and ρ\rho is the decay factor.

Finally, the updates for ADAM are shown below:

where α0\alpha_{0} is a fixed step-size and ρμ,ρs\rho_{\mu},\rho_{s} are decay factors.