Learning deep kernels for exponential family densities
Li Wenliang, Danica J. Sutherland, Heiko Strathmann, Arthur Gretton
Introduction
Density estimation is a foundational problem in statistics and machine learning (Devroye & Györfi, 1985; Wasserman, 2006), lying at the core of both supervised and unsupervised machine learning problems. Classical techniques such as kernel density estimation, however, struggle to exploit the structure inherent to complex problems, and thus can require unreasonably large sample sizes for adequate fits. For instance, assuming only twice-differentiable densities, the risk of density estimation with samples in dimensions scales as (Wasserman, 2006, Section 6.5).
One promising approach for incorporating this necessary structure is the kernel exponential family (Canu & Smola, 2006; Fukumizu, 2009; Sriperumbudur et al., 2017). This model allows for any log-density which is suitably smooth under a given kernel, i.e. any function in the corresponding reproducing kernel Hilbert space. Choosing a finite-dimensional kernel recovers any classical exponential family, but when the kernel is sufficiently powerful the class becomes very rich: dense in the family of continuous probability densities on compact domains in KL, TV, Hellinger, and distances (Sriperumbudur et al., 2017, Corollary 2). The normalization constant is not available in closed form, making fitting by maximum likelihood difficult, but the alternative technique of score matching (Hyvärinen, 2005) allows for practical usage with theoretical convergence guarantees (Sriperumbudur et al., 2017).
The choice of kernel directly corresponds to a smoothness assumption on the model, allowing one to design a kernel corresponding to prior knowledge about the target density. Yet explicitly deciding upon a kernel to incorporate that knowledge can be complicated. Indeed, previous applications of the kernel exponential family model have exclusively employed simple kernels, such as the Gaussian, with a small number of parameters (e.g. the length scale) chosen by heuristics or cross-validation (e.g. Sasaki et al., 2014; Strathmann et al., 2015). These kernels are typically spatially invariant, corresponding to a uniform smoothness assumption across the domain. Although such kernels are sufficient for consistency in the infinite-sample limit, the induced models can fail in practice on finite datasets, especially if the data takes differently-scaled shapes in different parts of the space. Figure 1 (left) illustrates this problem when fitting a simple mixture of Gaussians. Here there are two “correct” bandwidths, one for the broad mode and one for the narrow mode. A translation-invariant kernel must pick a single one, e.g. an average between the two, and any choice will yield a poor fit on at least part of the density.
In this work, we propose to learn the kernel of an exponential family directly from data. We can then achieve far more than simply tuning a length scale, instead learning location-dependent kernels that adapt to the underlying shape and smoothness of the target density. We use kernels of the form
where the deep network extracts features of the input and is a simple kernel (e.g. a Gaussian) on those features. These types of kernels have seen success in supervised learning (Wilson et al., 2016; Jean et al., 2018) and critic functions for implicit generative models (Li et al., 2017; Bellemare et al., 2017; Bińkowski et al., 2018; Arbel et al., 2018), among other settings. We call the resulting model a deep kernel exponential family (DKEF).
We can train both kernel parameters (including all the weights of the deep network) and, unusually, even regularization parameters directly on the data, in a form of meta-learning. Normally, directly optimizing regularization parameters would always yield 0, since their beneficial effect in preventing overfitting is by definition not seen on the training set. Here, though, we can exploit the closed-form fit of the kernel exponential family to optimize a “held-out” score (Section 3). Figure 1 (right) demonstrates the success of this model on the same mixture of Gaussians; here the learned, location-dependent kernel gives a much better fit.
We compare the results of our new model to recent general-purpose deep density estimators, primarily autoregressive models (Uria et al., 2013; Germain et al., 2015; van den Oord et al., 2016) and normalizing flows (Jimenez Rezende & Mohamed, 2015; Dinh et al., 2017; Papamakarios et al., 2017). These models learn deep networks with structures designed to compute normalized densities, and are fit via maximum likelihood. We explore the strengths and limitations of both deep likelihood models and deep kernel exponential families on a variety of datasets, including artificial data designed to illustrate scenarios where certain surprising problems arise, as well as benchmark datasets used previously in the literature. The models fit by maximum likelihood typically give somewhat higher likelihoods, whereas the deep kernel exponential family generally better fits the shape of the distribution.
Background
Under mild regularity conditions, this is equal to
up to an additive constant depending only on , which can be ignored during training. Here denotes . We can estimate (3) with :
Kernel exponential families
Using a simple finite-dimensional , we can recover any standard exponential family, e.g. normal, gamma, or Poisson; if is sufficiently rich, the family can approximate any continuous distribution with tails like arbitrarily well (Sriperumbudur et al., 2017, Example 1 and Corollary 2).
Sutherland et al. (2018) used a small for numerical stability but primarily regularized with . As we change , however, changes meaning, and we found empirically that this regularizer tends to harm the fit. The term was recommended by Kingma & LeCun (2010), encouraging the learned log-density to be smooth without much extra computation; it provides some empirical benefit in our context. Given , , and , 3 (Appendix B) shows we can find the optimal by solving an linear system in time: the which minimizes is
Fitting Deep Kernels
All previous applications of score matching in the kernel exponential family of which we are aware (e.g. Sriperumbudur et al., 2017; Strathmann et al., 2015; Sun et al., 2015; Sutherland et al., 2018) have used kernels of the form , with kernel parameters and regularization weights either fixed a priori or selected via cross-validation. This simple form allows the various kernel derivatives required in (6) to be easily computed by hand, and the small number of parameters makes grid search adequate for model selection. But, as discussed in Section 1, these simple kernels are insufficient for complex datasets. Thus we wish to use a richer class of kernels , with a large number of parameters – in particular, kernels defined by a neural network. This prohibits model selection via simple grid search.
One could attempt to directly minimize jointly in the kernel parameters , the model parameters , and perhaps the inducing points . Consider, however, the case where we simply use a Gaussian kernel and . Then we can achieve arbitrarily good values of (3) by taking , drastically overfitting to the training set .
Solving for and computing the loss (4) require matrices of kernel second derivatives, but current deep learning-oriented automatic differentiation systems are not optimized for evaluating tensor-valued higher-order derivatives at once. We therefore implement backpropagation to compute , , and of (6) as TensorFlow operations (Abadi et al., 2015) to obtain the scalar loss , and used TensorFlow’s automatic differentiation only to optimize , , , and parameters.
Backpropagation to find these second derivatives requires explicitly computing the Hessians of intermediate layers of the network, which becomes quite expensive as the model grows; this limits the size of kernels that our model can use. A more efficient implementation based on Hessian-vector products might be possible in an automatic differentiation system with better support for matrix derivatives.
Kernel architecture
We will choose our kernel as a mixture of Gaussian kernels with length scales , taking in features of the data extracted by a network :
Combining components makes it easier to account for both short-range and long-range dependencies. We constrain to ensure a valid kernel, and for simplicity. The networks are made of fully connected layers of width . For , we found that adding a skip connection from data directly to the top layer speeds up learning. A softplus nonlinearity, , ensures that the model is twice-differentiable so (3) is well-defined.
1 Behavior on Mixtures
If all modes are connected by regions of positive density, then the log density gradient in between components will determine their relative weight, and indeed score matching is then consistent. But when is nearly zero between two dense components, so that there are no or few samples in between, score matching will generally have insufficient evidence to weight nearly-separate components.
4 (Appendix C) studies the the kernel exponential family in this case. For two components that are completely separated according to , (6) fits each as it would if given only that component, except that the effective is scaled: smaller components are regularized more.
Section C.1 studies a simplified case where the kernel length scale is far wider than the component; then the density ratio between components, which should be , is approximately . Depending on the value of , this ratio will often either be quite extreme, or nearly . It is unclear, however, how well this result generalizes to other settings.
A heuristic workaround when disjoint components are suspected is as follows: run a clustering algorithm to identify disjoint components, separately fit a model to each cluster, then weight each model according to its sample count. When the components are well-separated, this clustering is straightforward, but it may be difficult in high-dimensional cases when samples are sparse but not fully separated.
2 Model Evaluation
As all our models are twice-differentiable, we also compare the score-matching loss (4) on held-out test data. A lower score-matching loss implies a smaller Fisher divergence between model and data distributions.
Finally, we compare test log-likelihoods, using importance sampling estimates of the normalizing constant :
(Proof in Appendix D.) We can find , because we propose from , and thus we can effectively estimate the bound (Section D.1). This estimate of the upper bound is itself biased upwards (6), so it is likely, though not guaranteed, that the estimate overstates the amount of bias.
3 Previous Attempts at Deep Score Matching
The scalar in brackets is fit analytically, so is determined almost entirely by the network plus .
Saremi et al. (2018) recently also attempted parameterizing the unnormalizing log-density as a deep network, using an approximation called Parzen score matching. This approximation requires a global constant bandwidth to define the Parzen window size for the loss function, fit to the dataset before learning the model. This is likely only sensible on datasets for which simple fixed-bandwidth kernel density estimation is appropriate; on more complex datasets, the loss may be poorly motivated. It also leads to substantial oversmoothing visible in their results. As they did not provide code for their method, we do not compare to it empirically.
Experimental Results
In our experiments, we compare to several alternative methods. The first group are all fit by maximum likelihood, and broadly fall into (at least) one of two categories: autoregressive models decompose and learn a parametric density model for each of these conditionals. Normalizing flows instead apply a series of invertible transformations to some simple initial density, say standard normal, and then compute the density of the overall model via the Jacobian of the transformation. We use implementationsgithub.com/gpapamak/maf of the following several models in these categories by Papamakarios et al. (2017):
MADE (Germain et al., 2015) masks the connections of an autoencoder so it is autoregressive. We use two hidden layers, and each conditional a Gaussian. MADE-MOG is the same but with each conditional a mixture of 10 Gaussians.
Real NVP (Dinh et al., 2017) is a normalizing flow; we use a general-purpose form for non-image datasets.
MAF (Papamakarios et al., 2017). A combination of a normalizing flow and MADE, where the base density is modeled by MADE, with 5 autoregressive layers. MAF-MOG instead models the base density by MADE-MOG.
For the models above, we use layers of width 30 for experiments on synthetic data, and 100 for benchmark datasets. Larger values did not improve performance.
KCEF (Arbel & Gretton, 2018). Inspired by autoregressive models, the density is modeled by a cascade of kernel conditional exponential family distributions, fit by score matching with Gaussian kernels.github.com/MichaelArbel/KCEF
DKEF. On synthetic datasets, we consider four variants of our model with one kernel component, . KEF-G refers to the model using a Gaussian kernel with a learned bandwidth. DKEF-G-15 has the kernel (7), with layers of width . DKEF-G-50 is the same with . To investigate whether the top Gaussian kernel helps performance, we also train DKEF-L-50, whose kernel is , where has . To compare with the architecture of Kingma & LeCun (2010), DKEF-L-50-1 has the same architecture as DKEF-L-50 except that we add an extra layer with a single neuron, and use . In all experiments, , with . On benchmark datasets, we use DKEF-G-50 and KEF-G with three kernel components, . Code for DKEF is at github.com/kevin-w-li/deep-kexpfam.
We first demonstrate the behavior of the models on several two-dimensional synthetic datasets Funnel, Banana, Ring, Square, Cosine, Mixture of Gaussians (MoG) and Mixture of Rings (MoR). Together, they cover a range of geometric complexities and multimodality.
We visualize the fit of various methods by showing the log density function in Figure 2. For each model fit on each distribution, we report the normalized log likelihood (left) and Fisher divergence (right). In general, the kernel score matching methods find cleaner boundaries of the distributions, and our main model KDEF-G-15 produces the lowest Fisher divergence on many of the synthetic datasets while maintaining high likelihoods.
Among the kernel exponential families, DKEF-G outperformed previous versions where ordinary Gaussian kernels are used for either joint (KEF-G) or autoregressive (KCEF) modeling. DKEF-G-50 does not substantially improve over DKEF-G-15; we omit it for space. We can gain additional insights into the model by looking at the shape of the learned kernel, shown by the colored lines; the kernels do indeed adapt to the local geometry.
DKEF-L-50 and DKEF-L-50-1 show good performance when the target density has simple geometries, but had trouble in more complex cases, even with much larger networks than used by DKEF-G-15. It seems that a Gaussian kernel with inducing points provides much stronger representational features than a using linear kernel and/or a single inducing point. A large enough network would likely be able to perform the task well, but, using currently available software, the second derivatives in the score matching loss limit our ability to use very large networks. A similar phenomenon was observed by Bińkowski et al. (2018) in the context of GAN critics, where combining some analytical RKHS optimization with deep networks allowed much smaller networks to work well.
As expected, models fit by DKEFs generally have smaller Fisher divergences than likelihood-based methods. For Funnel and Banana, the true densities are simple transformations of Gaussians, and the normalizing flow models perform relatively well. But on Ring, Square, and Cosine, the shape of the learned distribution by likelihood-based methods exhibits noticeable artifacts, especially at the boundaries. These artifacts, particularly the “breaks” in Ring, may be caused by a normalizing flow’s need to be invertible and smooth. The shape learned by DKEF-G-15 is much cleaner.
On multimodal distributions with disjoint components, likelihood-based and score matching-based methods show interesting failure modes. The likelihood-based models often connect separated modes with a “bridge”, even for MADE-MOG and MAF-MOG which use explicit mixtures. On the other hand, DKEF is able to find the shapes of the components, but the weighting between modes is unstable. As suggested in Section 3.1, we also fit mixtures of all models (except KCEF) on a partition of MoR found by spectral clustering (Shi & Malik, 2000); DKEF-G-15 produced an excellent fit.
Another challenge we observed in our experiments is that the estimator of the objective function, , tends to be more noisy as the model fit improves. This happens particularly on datasets where there are “sharp” features, such as Square (see Figure 4 in Section E.1), where the model’s curvature becomes extreme at some points. This can cause higher variance in the gradients of the parameters, and more difficulty in optimization.
2 Results on Benchmark Datasets
Following recent work on density estimation (Uria et al., 2013; Arbel & Gretton, 2018; Papamakarios et al., 2017; Germain et al., 2015), we trained DKEF and the likelihood-based models on five UCI datasets (Dheeru & Karra Taniskidou, 2017); in particular, we used RedWine, WhiteWine, Parkinson, HepMass, and MiniBoone. All performances were measured on held-out test sets. We did not run KCEF due to its computational expense. Section E.2 gives further details.
Results for models with a fixed Gaussian were similar (Figure 6, in Section E.2).
Discussion
Learning deep kernels helps make the kernel exponential family practical for large, complex datasets of moderate dimension. We can exploit the closed-form fit of the vector to optimize kernel and even regularization parameters using a “held-out” loss, in a particularly convenient instance of meta-learning. We are thus able to find smoothness assumptions that fit our particular data, rather than arbitrarily choosing them a priori.
Computational expense makes score matching with deep kernels difficult to scale to models with large kernel networks, limiting the dimensionality of possible targets. Combining with the kernel conditional exponential family might help alleviate that problem by splitting the model up into several separate but marginally complex components. The kernel exponential family, and score matching in general, also struggles to correctly balance datasets with nearly-disjoint components, but it seems to generally learn density shapes better than maximum likelihood-based deep approaches.
References
Appendix A DKEFs can be normalized
Consider the kernel , where is a kernel such that and a function such that . Let , where is any scalar and is a product of independent generalized Gaussian densities, with each :
(For example, for strictly positive definite could be achieved with and the Cholesky factorization of .) Then, for any function in the RKHS corresponding to ,
First, we have that and
defining , .
Let , and let be the normalizing constant of , . Then
We can now show that each of these expectations is finite: letting ,
for any . The first integral is clearly finite. Picking , so that for , gives that
The condition on holds for any given by a deep network with Lipschitz activation functions, such as the softplus function we use in this work. The condition on also holds for a linear kernel (where , ), any translation-invariant kernel (, ), or mixtures thereof. If is bounded, the integral is finite for any function .
The given proof would not hold for a quadratic , which has been used previously in the literature; indeed, it is clearly possible for such an to be unnormalizable.
Appendix B Finding the optimal 𝜶𝜶{\bm{\alpha}}
We will show a slightly more general result than we need, also allowing for an penalty. This result is related to Lemma 4 of Sutherland et al. (2018), but is more elementary and specialized to our particular needs while also allowing for more types of regularizers.
For fixed , , and , as long as then the optimal is
We will show that the loss is quadratic in . Note that
The term is of the same form, but with second derivatives:
Because and , , are all positive semidefinite, the matrix in parentheses is strictly positive definite, and the claimed result follows directly from standard vector calculus. ∎
Appendix C Behavior on mixtures
Let , where , , . Also suppose that the inducing points are partitioned as , with . Further let the kernel be such that when , for , with its first and second derivatives also zero. Then the kernel exponential family solution of 3 is
Let , be the , of 3 when using only and . Then, because the kernel values and derivatives are zero across components, if and are from separate components then
as at least one of the kernel derivatives will be zero for each term of the sum. When and are from the same component, the total will be the same except that is bigger, giving
is of the same form and factorizes in the same way. does not scale:
Each term in the sum for which is in a different component than will be zero, giving . Thus becomes
Thus the fits for the components are essentially added together, except that each component uses a different and ; smaller components are regularized more. , interestingly, is unscaled.
It is difficult in general to tell how two components will be weighted relative to one another; the problem is essentially equivalent to computing the overall normalizing constant of a fit. However, we can gain some insight by analyzing a greatly simplified case, in Section C.1.
Consider, for the sake of our study of mixture fits, one of the simplest possible situations for a kernel exponential family: , with a kernel for , so that for all sampled from . Let be approximately uniform, for , so that . Also assume that , but is fixed. Assume that , and refer to as simply . Then we have
Because for any near the data in this setup, it’s sufficient to just consider a single . In that case,
Thus, if we attempt to fit the mixture with , we are approximately in the regime of 4 and so the ratio between the two components in the fit is
If , the density ratio is correctly 1. If we further assume that , so that the denominator is dominated by the last term, then the ratio becomes approximately
Thus, depending on the size of , the ratio will usually either remain too close to or become too extreme as changes; only in a very narrow parameter range is it approximately correct.
Appendix D Upper bound on normalizer bias
Recall the importance sampling setup of Section 3.2:
First note that the following form of a Taylor expansion holds identically:
We can thus write the bias as the following, where , is the distribution of , and :
is convex in and thus its supremum is , with the term at being necessarily larger as long as .
Picking gives the desired bound on via Lemma 5.
Lemma 1 of Liao & Berg (2018) shows that since is concave, is decreasing in . Thus , giving the claim. ∎
Let and be such that and , with . Then
Let denote the number of samples of which are smaller than , so that samples are at least . Then we have
is distributed binomially with probability of success at most , so applying Theorem 1 of Arratia & Gordon (1989) yields
Let , so corresponds to , and corresponds to . Then
We can evaluate . For , if and only if , where
Clearly and . But notice that
so that is strictly decreasing on , and strictly increasing on . Thus for all , and for all . The claim follows by Jensen’s inequality. ∎
For a kernel such as (7) bounded in $a:=\exp\left(\sum_{m=1}^{M}\min(\alpha_{m},0)\right){\textnormal{r}}_{u}$.
For large , essentially any will make the second term practically zero, so we select as slightly less than the 40th percentile of an initial sample of , and confirm a high-probability () Hoeffding upper bound on with another sample. (We use as times the estimate of the 40th percentile, to avoid ties.) We use samples for each of these.
To finally estimate the bound, we estimate on yet another independent sample, again usually of size but for MiniBoone.
Crucially, the function is convex (6); because the variance is unbiased, our estimate of the bias bound is itself biased upwards. As 1’s bound is also not tight, our estimate thus likely overstates the actual amount of bias.
Appendix E Additional experimental details
For each synthetic distribution, we sample random points from the distribution, of which are used for testing; of the rest, 90% () are used for training, and 10% (900) are used for validation. Training was early stopped when validation cost does not improve for 200 minibatches. The current implementation of KCEF does not include a Nyström approximation, and trains via full-batch L-BFGS-B, so we down-sampled the training data to 1000 points. We used the Adam optimizer (Kingma & Ba, 2015) for all other models. For MADE, RealNVP, and MAF, we used minibatches of size 200 and the learning rate was For KEF-G and DKEF, we used 200 inducing points, used , and learning rate . The same parameters are used for each component for mixture models trained on MoR.
To show that learning is stable, we ran the experiments on 5 random draws of training, validation and test sets from the synthetic distributions, trained KDEF initialized using 5 random seeds and calculated validation score at each iteration until convergence in the first phase of training (before optimizing for ’s). The traces are shown in Figure 4.
The same data for benchmark datasets are shown in Figure 5. There is no overfitting except for the small Redwine dataset. Runs on Hepmass and Miniboone do not seem to fully converge, despite having met the early stopping criterion.
E.2 Benchmark datasets
RedWine and WhiteWine are quantized, and thus problematic for modeling with continuous densities; we added to each dimension uniform noise with support equal to the median distances between two adjacent values. For HepMass and MiniBoone, we removed ill-conditioned dimensions as did Papamakarios et al. (2017). For all datasets except HepMass, 10% of the entire data was used as testing, and 10% of the remaining was used for validation with an upper limit of due to time cost of validation at each iteration. For HepMass, we used the same splitting as done in Papamakarios et al. (2017) and with the same upper limit on validation set. The data is then whitened before fitting and the whitening matrix was computed on at most data points.
Likelihood-based models
We set MADE, MADE-MOG, each autoregressive layer of MAF and each scaling and shifting layers of real NVP to have two hidden layers of 100 neurons. For real NVP, MAF and MAF-MOG, five autoregressive layers were used; MAF-MOG and MADE-MOG has a mixture of 10 Gaussians for each conditional distribution. Learning rate was The size of a minibatch is 200.
Deep kernel exponential family
We set the DKEF model to have three kernels (), each a Gaussian on features of a 3-layer network with 30 neurons in each layer. There was also a skip-layer connection from data directly to the last layer which accelerated learning. Length scales were initialized to 1.0, 3.3 and 10.0. Each was initialized to 0.001. The weights of the network were initialized from a Gaussian distribution with standard deviation equal to . We also optimized the inducing points which were initialized with random draws from training data. The number of inducing points , and . The learning rate was . We found that our initialization on the weight std and ’s are importance for fast and stable learning; other parameters did not significantly change the results under similar computational budget (time and memory).
FSSD tests were conducted using 100 points selected at random from the test set, with added normal noise of standard deviation , using code provided by the authors.
We estimated with samples proposed from , as in Section 3.2, and estimated the bias as in Section D.1.
We added independent noise to the data in training. This is similar to the regularization applied by (Kingma & LeCun, 2010; Saremi et al., 2018), except that the noise is added directly to the data instead of the model.
For all models, we stopped training when the objective ((4) or log likelihood) did not improve for 200 minibatches. We also set a time budget of 3 hours on each model; this was fully spent by MAF, MOG-MAF and Real NVP on HepMass. We found that MOG-MADE had unstable runs on some datasets; out of 15 runs on each dataset, 7 on WhiteWine, 4 on Parkinsons and 9 on MiniBoone produced invalid log likelihoods. These results were discarded in Figure 3 log likelihood panels.
The DKEF in our main results (Figure 3) has an adaptive which is a generalized normal distribution. We also trained DKEF with being an isotropic multivariate normal of standard deviation 2.0. These results Figure 6 are similar to Figure 3 but exhibit much smaller bias estimates in the log normalizer for RedWine and Parkinsons.