Discriminative Embeddings of Latent Variable Models for Structured Data

Hanjun Dai, Bo Dai, Le Song

Introduction

Structured data, such as sequences, trees and graphs, are prevalent in a number of interdisciplinary areas such as protein design, genomic sequence analysis, and drug design (Schölkopf et al., 2004). To learn from such complex data, we have to first transform such data explicitly or implicitly into some vectorial representations, and then apply machine learning algorithms in the resulting vector space. So far kernel methods have emerged as one of the most effective tools for dealing with structured data, and have achieved the state-of-the-art classification and regression results in many sequence (Leslie et al., 2002a; Vishwanathan & Smola, 2003) and graph datasets (Gärtner et al., 2003; Borgwardt, 2007).

The success of kernel methods on structured data relies crucially on the design of kernel functions — positive semidefinite similarity measures between pairs of data points (Schölkopf & Smola, 2002). By designing a kernel function, we have implicitly chosen a corresponding feature representation for each data point which can potentially has infinite dimensions. Later learning algorithms for various tasks and with potentially very different nature can then work exclusively on these pairwise kernel values without the need to access the original data points. Such modular structure of kernel methods has been very powerful, making them the most elegant and convenient methods to deal with structured data. Thus designing kernel for different structured objects, such as strings, trees and graphs, has always been an important subject in the kernel community. However, in the big data era, this modular framework has also limited kernel methods in terms of their ability to scale up to millions of data points, and exploit discriminative information to learn feature representations.

For instance, a class of kernels are designed based on the idea of “bag of structures” (BOS), where each structured data point is represented as a vector of counts for elementary structures. The spectrum kernel and variants for strings (Leslie et al., 2002a), subtree kernel (Ramon & Gärtner, 2003), graphlet kernel (Shervashidze et al., 2009) and Weisfeiler-lehman graph kernel (Shervashidze et al., 2011) all follow this design principle. In other words, the feature representations of these kernels are fixed before learning, with each dimension corresponding to a substructure, independent of the supervised learning tasks at hand. Since there are many unique substructures which may or may not be useful for the learning tasks, the explicit feature space of such kernels typically has very high dimensions. Subsequently algorithms dealing with the pairwise kernel values have to work with a big kernel matrix squared in the number of data points. The square dependency on the number of data points largely limits these BOS kernels to datasets of size just thousands.

A second class of kernels are based on the ingenious idea of exploiting the ability of probabilistic graphical models (GM) in describing noisy and structured data to design kernels. For instance, one can use hidden Markov models for sequence data, and use pairwise Markov random fields for graph data. The Fisher kernel (Jaakkola & Haussler, 1999) and probability product kernel (Jebara et al., 2004) are two representative instances within the family. The former method first fits a common generative model to the entire dataset, and then uses the empirical Fisher information matrix and the Fisher score of each data point to define the kernel; The latter method instead fits a different generative model for each data point, and then uses inner products between distributions to define the kernel. Typically the parameterization of these GM kernels are chosen before hand. Although the process of fitting generative models allow the kernels to adapt to the geometry of the input data, the resulting feature representations are still independent of the discriminative task at hand. Furthermore, the extra step of fitting generative models to data can be a challenging computation and estimation task by itself, especially in the presence of latent variables. Very often in practice, one finds that BOS kernels are easier to deploy than GM kernels, although the latter is supposed to capture the additional geometry and uncertainty information of data.

In this paper, we wish to revisit the idea of using graphical models for kernel or feature space design, with the goal of scaling up kernel methods for structured data to millions of data points, and allowing the kernel to learn the feature representation from label information. Our idea is to model each structured data point as a latent variable model, then embed the graphical model into feature spaces (Smola et al., 2007; Song et al., 2009), and use inner product in the embedding space to define kernels. Instead of fixing a feature or embedding space beforehand, we will also learn the feature space by directly minimizing the empirical loss defined by the label information.

The resulting embedding algorithm, structure2vec, runs in a scheme similar to graphical model inference procedures, such as mean field and belief propagation. Instead of performing probabilistic operations (such as sum, product and renormalization), the algorithm performs nonlinear function mappings in each step, inspired by kernel message passing algorithm in Song et al. (2010, 2011). Furthermore, structure2vec is also different from the kernel message passing algorithm in several aspects. First, structure2vec deals with a different scenario, i.e., learning similarity measure for structured data. Second, structure2vec learns the nonlinear mappings using the discriminative information. And third, a variant of structure2vec can run in a mean field update fashion, different from message passing algorithms.

Besides the above novel aspects, structure2vec is also very scalable in terms of both memory and computation requirements. First, it uses a small and explicit feature map for the nonlinear feature space, and avoids the need for keeping the kernel matrix. This makes the subsequent classifiers or regressors order of magnitude smaller compared to other methods. Second, the nonlinear function mapping in structure2vec can be learned using stochastic gradient descent, allowing it to handle extremely large scale datasets.

Finally in experiments, we show that structure2vec compares favorably to other kernel methods in terms of classification accuracy in medium scale sequence and graph benchmark datasets including SCOP and NCI. Furthermore, structure2vec can handle extremely large data set, such as the 2.3 million molecule dataset from Harvard Clean Energy Project, run 2 times faster, produce model $10,000$ times smaller and achieve state-of-the-art accuracy. These strong empirical results suggest that the graphical models, theoretically well-grounded methods for capturing structure in data, combined with embedding techniques and discriminative training can significantly improve the performance in many large scale real-world structured data classification and regression problems.

Backgrounds

Kernels for Structured Data. Each kernel function will correspond to some feature map $\phi(\chi)$ , where the kernel function can be expressed as the inner product between feature maps, i.e., $k(\chi,\chi^{\prime})=\left\langle\phi(\chi),\phi(\chi^{\prime})\right\rangle$ . For structured input domain, one can design kernels using counts on substructures. For instance, the spectrum kernel for two sequences $\chi$ and $\chi^{\prime}$ is defined as (Leslie et al., 2002a)

where ${\mathcal{S}}$ is the set of possible subsequences, $\#(s\in x)$ counts the number occurrence of subsequence $s$ in $x$ . In this case, the feature map $\phi(\chi)=(\#(s_{1}\in\chi),\#(s_{2}\in\chi),...)^{\top}$ corresponds to a vector of dimension $\left|{\mathcal{S}}\right|$ . Similarly, the graphlet kernel (Shervashidze et al., 2009) for two graphs $\chi$ and $\chi^{\prime}$ can also be defined as (1), but ${\mathcal{S}}$ is now the set of possible subgraphs, and $\#(s\in\chi)$ counts the number occurrence of subgraphs. We refer to this class of kernels as “bag of structures” (BOS) kernel.

Hilbert Space Embedding of Distributions. Hilbert space embeddings of distributions are mappings of distributions into potentially infinite dimensional feature spaces (Smola et al., 2007),

Model for a Structured Data Point

Without loss of generality, we assume each structured data point $\chi$ is a graph, with a set of nodes $\mathcal{V}=\{1,\ldots,V\}$ and a set of edges $\mathcal{E}$ . We will use $x_{i}$ to denote the value of the attribute for node $i$ . We note the node attributes are different from the label of the entire data point. For instance, each atom in a molecule will correspond to a node in the graph, and the node attribute will be the atomic number, while the label for the entire molecule can be whether the molecule is a good drug or not. Other structures, such as sequences and trees, can be viewed as special cases of general graphs.

We will model the structured data point $\chi$ as an instance drawn from a graphical model. More specifically, we will model the label of each node in the graph with a variable $X_{i}$ , and furthermore, associate an additional hidden variable $H_{i}$ with it. Then we will define a pairwise Markov random field on these collection of random variables

where $\Psi$ and $\Phi$ are nonnegative node and edge potentials respectively. In this model, the variables are connected according to the graph structure of the input data point. That is to say, we use the graph structure of the input data directly as the conditional independence structure of an undirected graphical model. Figure 1 illustrates two concrete examples in constructing the graphical models for strings and graphs. One can design more complicated graphical models which go beyond pairwise Markov random fields, and consider longer range interactions with potentials involving more variables. We will focus on pairwise Markov random fields for simplicity of representation.

We note that such a graphical model is built for each individual data point, and the conditional independence structures of two graphical models can be different if the two data points $\chi$ and $\chi^{\prime}$ are different. Furthermore, we do not observe the value for the hidden variables $\left\{H_{i}\right\}$ , which makes the learning of the graphical model potentials $\Phi$ and $\Psi$ even more difficult. Thus, we will not pursue the standard route of maximum likelihood estimation, and rather we will consider the sequence of computations needed when we try to embed the posterior of $\left\{H_{i}\right\}$ into a feature space.

Embedding Latent Variable Models

We will embed the posterior marginal $p(H_{i}|\left\{x_{i}\right\})$ of a hidden variable using a feature map $\phi(H_{i})$ , i.e.,

Only when the graph structure is a tree, exact computation can be carried out efficiently via message passing (Pearl, 1988). Thus in the general case, approximate inference algorithms, e.g., mean field inference and loopy belief propagation (BP), are developed. In many applications, however, these variational inference algorithms exhibit excellent empirical performance (Murphy et al., 1999). Several theoretical studies have also provided insight into the approximations made by loopy BP, partially justifying its application to graphs with cycles (Wainwright & Jordan, 2008; Yedidia et al., 2001a).

In the following subsection, we will explain the embedding of mean field and loopy BP. The embedding of other variational inference methods, e.g., double-loop BP, damped BP, tree-reweighted BP, and generalized BP will be explained in Appendix D. We show that the iterative update steps in these algorithms, which are essentially minimizing approximations to the exact free energy, can be simply viewed as function mappings of the embedded marginals using the alternative view in (3) and (4).

The vanilla mean-field inference tries to approximate $p(\left\{H_{i}\right\}|\left\{x_{i}\right\})$ with a product of independent density components $p(\left\{H_{i}\right\}|\left\{x_{i}\right\})\approx\prod_{i\in\mathcal{V}}q_{i}(h_{i})$ where each $q_{i}(h_{i})\geqslant 0$ is a valid density, such that $\int_{\mathcal{H}}q_{i}(h_{i})dh_{i}=1$ . Furthermore, these density components are found by minimizing the following variational free energy (Wainwright & Jordan, 2008),

One can show that the solution to the above optimization problem needs to satisfy the following fixed point equations for all $i\in\mathcal{V}$

where $c_{i}^{\prime}=c_{i}+\sum_{j\in\mathcal{N}(i)}\int q_{j}(h_{j})\log\Phi(h_{j},x_{j})dh_{j}$ . Here $\mathcal{N}(i)$ are the set of neighbors of variable $H_{i}$ in the graphical model, and $c_{i}$ is a constant. The fixed point equations in (4.1) imply that $q_{i}(h_{i})$ is a functional of a set of neighboring marginals $\left\{q_{j}\right\}_{j\in\mathcal{N}(i)}$ , i.e.,

If for each marginal $q_{i}$ , we have an injective embedding

where $\sigma(\cdot):=\max\{0,\cdot\}$ is a rectified linear unit applied elementwisely to its argument, and $\mathbf{W}=\{W_{1},W_{2}\}$ . The number of the rows in $\mathbf{W}$ equals to $d$ . With such parameterization, the mean field iterative update in the embedding space can be carried out as Algorithm 1. We could also multiply $\widetilde{\mu}_{i}$ with $V$ to rescale the range of message embeddings if needed. In fact, with or without $V$ , the functions will be the same in terms of the representation power. Specifically, for any $(\mathbf{W},V)$ , we can always find another ‘equivalent’ parameters $(\mathbf{W}^{\prime},I)$ where $\mathbf{W}^{\prime}=\{W_{1},W_{2}V\}$ .

2 Embedding Loopy Belief Propagation

Loopy belief propagation is another variational inference method, which essentially optimizes the Bethe free energy taking pairwise interactions into account (Yedidia et al., 2001b),

subject to pairwise marginal consistency constraints: $\int_{\mathcal{H}}q_{ij}(h_{i},h_{j})dh_{j}=q_{i}(h_{i})$ , $\int_{\mathcal{H}}q_{ij}(h_{i},h_{j})dh_{j}=q_{i}(h_{i})$ , and $\int_{\mathcal{H}}q_{i}(h_{i})dh_{i}=1$ . One can obtain the fixed point condition for the above optimization for all $(i,j)\in\mathcal{E}$ ,

where $m_{ij}(h_{j})$ is the intermediate result called the message from node $i$ to $j$ . Furthermore, $m_{ij}(h_{j})$ is a nonnegative function which can be normalized to a density, and hence can also be embedded.

Similar to the reasoning in the mean field case, the (4.2) implies the messages $m_{ij}(h_{j})$ and marginals $q_{i}(h_{i})$ are functionals of messages from neighbors, i.e.,

With the assumption that there is an injective embedding for each message $\widetilde{\nu}_{ij}=\int\phi(h_{j})m_{ij}(h_{j})dh_{j}$ and for each marginal $\widetilde{\mu}_{i}=\int\phi(h_{i})q_{i}(h_{i})dh_{i}$ , we can apply the reasoning from (3) and (4), and express the messages and marginals from the embedding view,

where $\mathbf{W}=\{W_{1},W_{2},W_{3},W_{4}\}$ are matrices with appropriate sizes. Note that one can use other nonlinear function mappings to parameterize $\widetilde{\mathcal{T}}_{1}$ and $\widetilde{\mathcal{T}}_{2}$ as well. Overall, the loopy BP embedding updates is summarized in Algorithm 2.

With similar strategy as in mean field case, we will learn the parameters in $\widetilde{\mathcal{T}}_{1}$ and $\widetilde{\mathcal{T}}_{2}$ later with supervision signals from the discriminative task.

3 Embedding Other Variational Inference

In fact, there are many other variational inference methods, with different forms of free energies or different optimization algorithms, resulting different message update forms, e.g., double-loop BP (Yuille, 2002), damped BP (Minka, 2001), tree-reweightd BP (Wainwright et al., 2003), and generalized BP (Yedidia et al., 2001b). The proposed embedding method is a general technique which can be tailored to these algorithms. The major difference is the dependences in the messages. For the details of embedding of these algorithms, please refer to Appendix D.

Discriminative Training

Similar to kernel BP (Song et al., 2010, 2011) and kernel EP (Jitkrittum et al., 2015), our current work exploits feature space embedding to reformulate graphical model inference procedures. However, different from the kernel BP and kernel EP, in which the feature spaces are chosen beforehand and the conditional embedding operators are learned locally, our approach will learn both the feature spaces, the transformation $\widetilde{\mathcal{T}}$ , as well as the regressor or classifier for the target values end-to-end using label information.

Note that each data point will have its own graphical model and embedded features due to its individual structure, but the parameters $u$ and $\mathbf{W}$ , are shared across these graphical models.

In the case of $K$ -class classification problem, we denote $z$ is the $1$ -of- $K$ representation of $y$ , i.e., $z\in\{0,1\}^{K}$ , $z^{k}=1$ if $y=k$ , and $z^{i}=0$ , $\forall i\neq k$ . By adopt the softmax loss, we obtain the optimization for embedding parameters and discriminative classifier estimation as,

The same idea can also be generalized to other discriminative tasks with different loss functions. As we can see from the optimization problems (16) and (17), the objective functions are directly related to the corresponding discriminative tasks, and so as to $\mathbf{W}$ and $\mathbf{u}$ . Conceptually, the procedure starts with representing each datum by a graphical model constructed corresponding to its individual structure with sharing potential functions, and then, we embed these graphical models with the same feature mappings. Finally the embedded marginals are aggregated with a prediction function for a discriminative task. The shared potential functions, feature mappings and final prediction functions are all learned together for the ultimate task with supervision signals.

We optimize the objective (16) or (17) with stochastic gradient descent for scalability consideration. However, other optimization algorithms are also applicable, and our method does not depend on this particular choice. The gradients of the parameters $\mathbf{W}$ are calculated recursively similar to recurrent neural network for sequence models. In our case, the recursive structure will correspond the message passing structure. The overall framework is illustrated in Algorithm 3. For details of the gradient calculation, please refer to Appendix F.

Experiments

Below we first compare our method with algorithms using prefixed kernel on string and graph benchmark datasets. Then we focus on Harvard Clean Energy Project dataset which contains 2.3 million samples. We demonstrate that while getting comparable performance on medium sized datasets, we are able to handle millions of samples, and getting much better when more training data are given. The two variants of structure2vec are denoted as DE-MF and DE-LBP, which stands for discriminative embedding using mean field or loopy belief propagation, respectively.

Our algorithms are implemented with C++ and CUDA, and experiments are carried out on clusters equipped with NVIDIA Tesla K20. The code is available on https://github.com/Hanjun-Dai/graphnn.

We compare our algorithm on string benchmark datasets with the kernel method with existing sequence kernels, i.e., the spectrum string kernel (Leslie et al., 2002a), mismatch string kernel (Leslie et al., 2002b) and fisher kernel with HMM generative models (Jaakkola & Haussler, 1999). On graph benchmark datasets, we compare with subtree kernel (Ramon & Gärtner, 2003) (R&G, for short), random walk kernel(Gärtner et al., 2003; Vishwanathan et al., 2010), shortest path kernel (Borgwardt & Kriegel, 2005), graphlet kernel(Shervashidze et al., 2009) and the family of Weisfeiler-Lehman kernels (WL kernel) (Shervashidze et al., 2011). After getting the kernel matrix, we train SVM classifier or regressor on top.

We tune all the methods via cross validation, and report the average performance. Specifically, for structured kernel methods, we tune the degree in $\{1,2,\ldots,10\}$ (for mismatch kernel, we also tune the maximum mismatch length in $\{1,2,3\}$ ) and train SVM classifier (Chang & Lin, 2001) on top, where the trade-off parameter $C$ is also chosen in $\{0.01,0.1,1,10\}$ by cross validation. For fisher kernel that using HMM as generative model, we also tune the number of hidden states assigned to HMM in $\{2,\ldots,20\}$ .

For our methods, we simply use one-hot vector (the vector representation of discrete node attribute) as the embedding for observed nodes, and use a two-layer neural network for the embedding (prediction) of target value. The hidden layer size $b\in\{16,32,64\}$ of neural network, the embedding dimension $d\in\{16,32,64\}$ of hidden variables and the number of iterations $t\in\{1,2,3,4\}$ are tuned via cross validation. We keep the number of parameters small, and use early stopping (Giles, 2001) to avoid overfitting in these small datasets.

Here we do experiments on two string binary classification benchmark datasets. The first one (denoted as SCOP) contains 7329 sequences obtained from SCOP (Structural Classification of Proteins) 1.59 database (Andreeva et al., 2004). Methods are evaluated on the ability to detect members of a target SCOP family (positive test set) belonging to the same SCOP superfamily as the positive training sequences, and no members of the target family are available during training. We use the same 54 target families and the same training/test splits as in remote homology detection (Kuang et al., 2005). The second one is FC and RES dataset (denoted as FC_RES) provided by CRISPR/Cas9 system, on which the task it to tell whether the guide RNA will direct Cas9 to target DNA. There are 5310 guides included in the dataset. Details of this dataset can be found in Doench et al. (2014); Fusi et al. (2015). We use two variants for spectrum string kernel: 1) kmer-single, where the constructed kernel matrix $K_{k}^{(s)}$ only consider patterns of length $k$ ; 2) kmer-concat, where kernel matrix $K^{(c)}=\sum_{i=1}^{k}K_{k}^{(s)}$ . We also find the normalized kernel matrix $K_{k}^{Norm}(x,y)=\frac{K_{k}(x,y)}{\sqrt{K_{k}(x,x)K_{k}(y,y)}}$ helps.

Table 1 reports the mean AUC of different algorithms. We found two variants of structure2vec are consistently better than the string kernels. Also, the improvement in SCOP is more significant than in FC_RES. This is because SCOP is a protein dataset and its alphabet size $|\Sigma|$ is much larger than that of FC_RES, an RNA dataset. Furthermore, the dimension of the explicit features for a k-mer kernel is $O(|\Sigma|^{k})$ , which can make the off-diagonal entries of kernel matrix very small (or even zero) with large alphabet size and $k$ . That’s also the reason why kmer-concat performs better than kmer-single. structure2vec learns a discriminative feature space, rather than prefix it beforehand, and hence does not have this problem.

1.2 Graph Dataset

We test the algorithms on five benchmark datasets for graph kernel: MUTAG, NCI1, NCI109, ENZYMES and D&D. MUTAG (Debnath et al., 1991). NCI1 and NCI109 (Wale et al., 2008) are chemical compounds dataset, while ENZYMES (Borgwardt & Kriegel, 2005) and D&D (Dobson & Doig, 2003) are of proteins. The task is to do multi-class or binary classification. We show the detailed statistics of these datasets in Table 2.

The results of baseline algorithms are taken from Shervashidze et al. (2011) since we use exactly the same setting here. From the accuracy comparison shown in Figure 2, we can see the proposed embedding methods are comparable to the alternative graph kernels, on different graphs with different number of labels, nodes and edges. Also, in dataset D&D which consists of 82 different types of labels, our algorithm performs much better. As reported in Shervashidze et al. (2011), the time required for constructing dictionary for the graph kernel can take up to more than a year of CPU time in this dataset, while our algorithm can learn the discriminative embedding efficiently from structured data directly without the construction of the handcraft dictionary.

2 Harvard Clean Energy Project(CEP) dataset

The Harvard Clean Energy Project (Hachmann et al., 2011) is a theory-driven search for the next generation of organic solar cell materials. One of the most important properties of molecule for this task is the overall efficiency of the energy conversion process in a solar cell, which is determined by the power conversion efficiency (PCE). The Clean Energy Project (CEP) performed expensive simulations for the 2.3 million candidate molecules on IBM’s World Community Grid, in order to get this property value. So using machine learning approach to accurately predict the PCE values is a promising direction for the high throughput screening and discovering new materials.

In this experiment, we randomly select 90% of the data for training, and the rest 10% for testing. This setting is similar to Pyzer-Knapp et al. (2015), except that we use the entire 2.3m dataset here. Since the data is distributed unevenly (see Figure 3), we resampled the training data (but not the test data) to make the algorithm put more emphasis on molecules with higher PCE values, in order to make accurate prediction for promising candidate molecules. Since the traditional kernel methods are not scalable, we make the explicit feature maps for WL subtree kernel by collecting all the molecules and creating dictionary for the feature space. The other graph kernels, like edge kernel and shortest path kernel, are having too large feature dictionary to work with. We use RDKit (Landrum, 2012) to extract features for atoms (nodes) and bonds (edges).

The mean absolute error (MAE) and root mean square error (RMSE) are reported in Table 3. We found utilizing graph information can accurately predict PCE values. Also, our proposed two methods are working equally well. Although WL tree kernel with degree 6 is also working well, it requires $10,000$ times more parameters than structure2vec and runs 2 times slower. The preprocessing needed for WL tree kernel also makes it difficult to use in large datasets.

To understand the effect of the inference embedding in the proposed algorithm framework, we further compare our methods with different number of fixed point iterations in Figure 4. It can see that, higher number of fixed point iterations will lead to faster convergence, though the number of parameters of the model in different settings are the same. The mean field embedding will get much worse result if only one iteration is executed. Compare to the loopy BP case with same setting, the latter one will always have one more round message passing since we need to aggregate the messages from edge to node in the last step. And also, from the quality of prediction we find that, though making slightly higher prediction error for molecules with high PCE values due to insufficient data, the variants of our algorithm are not overfitting the ‘easy’ (i.e., the most popular) range of PCE value.

Conclusion

We propose, structure2vec, an effective and scalable approach for structured data representation based on the idea of embedding latent variable models into feature spaces, and learning such feature spaces using discriminative information. Interestingly, structure2vec extracts features by performing a sequence of function mappings in a way similar to graphical model inference procedures, such as mean field and belief propagation. In applications involving millions of data points, we showed that structure2vec runs 2 times faster, produces models $10,000$ times smaller, while at the same time achieving the state-of-the-art predictive performance. structure2vec provides a nice example for the general strategy of combining the strength of graphical models, Hilbert space embedding of distribution and deep learning approach, which we believe will become common in many other learning tasks.

Acknowledgements. This project was supported in part by NSF/NIH BIGDATA 1R01GM108341, ONR N00014-15-1-2340, NSF IIS-1218749, and NSF CAREER IIS-1350983.

References

Appendix A More Related Work

Neural network is also a powerful tool on graph structured data. Scarselli et al. (2009) proposed a neural network which generates features by solving a heuristic nonlinear system iteratively, and is learned using Almeida-Pineda algorithm. To guarantee the existence of the solution to the nonlinear system, there are extra requirements for the features generating function. From this perspective, the model in (Li et al., 2015) can be considered as an extension of (Scarselli et al., 2009) where the gated recurrent unit is used for feature generation. Rather than these heuristic models, our model is based on the principled graphical model embedding framework, which results in flexible embedding functions for generating features. Meanwhile, the model can be learned efficiently by traditional stochastic gradient descent.

There are several work transferring locality concept of convolutional neural networks (CNN) from Euclidean domain to graph case, using hierarchical clustering, graph Laplacian (Bruna et al., 2013), or graph Fourier transform (Henaff et al., 2015). These models are still restricted to problems with the same graph structure, which is not suitable for learning with molecules. Mou et al. (2016) proposed a convolution operation on trees, while the locality are defined based on parent-child relations. Duvenaud et al. (2015) used CNN to learn the circulant fingerprints for graphs from end to end. The dictionary of fingerprints are maintained using softmax of subtree feature representations, in order to obtain a differentiable model. If we unroll the steps in Algorithm 3, it can also be viewed as an end to end learning system. However, the structures of the proposed model are deeply rooted in graphical model embedding, from mean field and loopy BP, respectively. Also, since the parameters will be shared across different unrolling steps, we would have more compact model. As will be shown in the experiment section, our model is easy to train, while yielding good generalization ability.

A.2 Comparison with Learning Message Estimator

By recognizing inference as computational expressions, inference machines (Ross et al., 2011) incorporate learning into the messages passing inference for CRFs. More recently, Hershey et al. (2014); Zheng et al. (2015); Lin et al. (2015) designed specific recurrent neural networks and convolutional neural networks for imitating the messages in CRFs. Although these methods share the similarity, i.e., bypassing learning potential function, to the proposed framework, there are significant differences comparing to the proposed framework.

The most important difference lies in the learning setting. In these existing messages learning work (Hershey et al., 2014; Zheng et al., 2015; Lin et al., 2015; Chen et al., 2014), the learning task is still estimating the messages represented graphical models with designed function forms, e.g., RNN or CNN, by maximizing loglikelihood. While in our work, we represented each structured data as a distribution, and the learning task is regression or classification over these distributions. Therefore, we treat the embedded models as samples, and learn the nonlinear mapping for embedding, and regressor or classifier, $f:\mathcal{P}\rightarrow\mathcal{Y}$ , over these distributions jointly, with task-dependent user-specified loss functions.

Another difference is the way in constructing the messages forms, and thus, the neural networks architecture. In the existing work, the neural networks forms are constructed strictly follows the message updates forms (4.1) or (4.2). Due to such restriction, these works only focus on discrete variables with finite values, and is difficult to extend to continuous variables because of the integration. However, by exploiting the embedding point of view, we are able to build the messages with more flexible forms without losing the dependencies. Meanwhile, the difficulty in calculating integration for continuous variables is no longer a problem with the reasoning (3) and (4).

Appendix B Derivation of the Fixed-Point Condition for Mean-Field Inference

In this section, we derive the fixed-point equation for mean-field inference in Section 4. As we introduced, the mean-field inference is indeed minimizing the variational free energy,

where $c_{i}=-\int\prod_{k\notin\mathcal{N}(i)}q_{k}(h_{k})\left(\sum_{k\notin\mathcal{N}(i)}\log\left(\prod_{(k,l)\in\mathcal{E}}\Psi(h_{k},h_{l})\Phi(h_{k},x_{k})\right)\right)\prod_{k\notin\mathcal{N}(i)}dh_{k}$ . Take functional derivatives of $L(\{q_{i}\}_{i=1}^{d})$ w.r.t. $q_{i}(h_{i})$ , and set them to zeros, we obtain the fixed-point condition in Section 4,

This fixed-point condition could be further reduced due to the independence between $h_{i}$ and $x_{j}$ given $h_{j}$ , i.e.,

where $c_{i}^{\prime}=c_{i}+\sum_{j\in\mathcal{N}(i)}\int q_{j}(h_{j})\log\Phi(h_{j},x_{j})dh_{j}$ .

Appendix C Derivation of the Fixed-Point Condition for Loopy BP

The derivation of the fixed-point condition for loopy BP can be found in Yedidia et al. (2001b). However, to keep the paper self-contained, we provide the details here. The objective of loopy BP is

Denote $\lambda_{ij}(h_{j})$ is the multiplier to marginalization constraints $\int_{\mathcal{H}}q_{ij}(h_{i},h_{j})dh_{i}-q_{j}(h_{j})=0$ , the Lagrangian is formed as

with normalization constraints $\int_{\mathcal{H}}q_{i}(h_{i})dh_{i}=1$ . Take functional derivatives of $L(\{q_{ij}\},\{q_{i}\},\{\lambda_{ij}\},\{\lambda_{ji}\})$ with respect to $q_{ij}(h_{i},h_{j})$ and $q_{i}(h_{i})$ , and set them to zero, we have

We set $m_{ij}(h_{j})=\frac{q_{j}(h_{j})}{\Phi(h_{i},x_{i})\exp(\lambda_{ij}(h_{j}))}$ , therefore,

Plug it into $q_{ij}(h_{i},h_{j})$ and $q_{i}(h_{i})$ , we recover the loopy BP update for marginal belief and

The update rule for message $m_{ij}(h_{j})$ can be recovered using the marginal consistency constraints,

Moreover, we also obtain the other important relationship between $m_{ij}(h_{j})$ and $\lambda_{ji}(h_{i})$ by marginal consistency constraint and the definition of $m_{ij}(h_{j})$ ,

Appendix D Embedding Other Variational Inference

The proposed embedding is a general algorithm and can be tailored to other variational inference methods with message passing paradigm. In this section, we discuss the embedding for several alternatives, which optimize the primal and dual Bethe free energy, its convexified version and Kikuchi free energy, respectively. We will parametrize the messages with the same function class, i.e., neural networks with ReLU. More generally, we can also treat the weights as parameters and learn them together.

Noticed the Bethe free energy can be decomposed into the summation of a convex function and a concave function, Yuille (2002) utilizes CCCP to minimize the Bethe free energy, resulting the double-loop algorithm. Take the gradient of Lagrangian of the objective function, and set to zero, the primal variable can be represented in dual form,

The algorithm updates $\gamma$ and $\lambda$ alternatively,

Consider the $\lambda_{ij}$ as messages, with the injective embedding assumptions for corresponding messages, follow the same notation, we can express the messages in embedding view

Therefore, we have the parametrization form as

D.2 Damped BP

Instead of the primal form of Bethe free energy, Minka (2001) investigates the duality of the optimization,

subject to $\Big{(}|\mathcal{N}(i)|-1\Big{)}\gamma_{i}(h_{i})=\sum_{k\in\mathcal{N}(i)}\lambda_{ki}(h_{i})$ . Define message as

which results the embedding follows the same injective assumption and notations,

It is interesting that after parametrization, the embeddings of double-loop BP and damped BP are essentially the same, which reveal the connection between double-loop BP and damped BP.

D.3 Tree-reweighted BP

Different from loopy BP and its variants which optimizing the Bethe free energy, the tree-reweighted BP (Wainwright et al., 2003) is optimizing a convexified Bethe energy,

subject to pairwise marginal consistency constraints: $\int_{\mathcal{H}}q_{ij}(h_{i},h_{j})dh_{j}=q_{i}(h_{i})$ , $\int_{\mathcal{H}}q_{ij}(h_{i},h_{j})dh_{j}=q_{i}(h_{i})$ , and $\int_{\mathcal{H}}q_{i}(h_{i})dh_{i}=1$ . The $\{v_{ij}\}_{(i,j)\in\mathcal{E}}$ represents the probabilities that each edge appears in a spanning tree randomly chose from all spanning tree from $G=\{\mathcal{V},\mathcal{E}\}$ under some measure. Follow the same strategy as loopy BP update derivations, i.e., take derivatives of the corresponding Lagrangian with respect to $q_{i}$ and $q_{ij}$ and set to zero, meanwhile, incorporate with the marginal consistency, we can arrive the messages updates,

Similarly, the embedded messages and the marginals on nodes can be obtained as

Parametrize these message in the same way, we obtain,

Notice the tree-weighted BP contains extra parameters $\{v_{ij}\}_{(i,j)\in\mathcal{E}}$ which is in the spanning tree polytope as (Wainwright et al., 2003).

D.4 Generalized Belief Propagation

The Kikuchi free energy is the generalization of the Bethe free energy by involving high-order interactions. More specifically, given the MRFs, we denote $R$ to be a set of regions, i.e., some basic clusters of nodes, their intersections, the intersections of the intersections, and so on. We denote the $sub(r)$ or $sup(r)$ , i.e., subregions or superregions of $r$ , as the set of regions completely contained in $r$ or containing $r$ , respectively. Let $h_{r}$ be the state of the nodes in region $r$ , then, the Kikuchi free energy is

where $c_{r}$ is over-counting number of region $r$ , defined by $c_{r}:=1-\sum_{s\in sup(r)}c_{s}$ with $c_{r}=1$ if $r$ is the largest region in $R$ . It is straightforward to verify that the Bethe free energy is a special case of the Kikuchi free energy by setting the basic cluster as pair of nodes. The generalized loopy BP (Yedidia et al., 2001b) is trying to seek the stationary points of the Kikuchi free energy under regional marginal consistency constraints and density validation constraints by following messages updates,

The $M(r)$ denotes the indices of messages $m_{r^{\prime},s^{\prime}}$ that $s^{\prime}\in sub(r)\cup\{r\}$ , and $r^{\prime}\setminus s^{\prime}$ is outside $r$ . $M(r,s)$ is the set of indices of messages $m_{r^{\prime},s^{\prime}}$ where $r^{\prime}\in sub(r)\setminus s$ and $s^{\prime}\in sub(s)\cup\{s\}$ .

With the injective embedding assumption for each message $\widetilde{\nu}_{r,s}=\int\phi(h_{s})m_{r,s}(h_{s})dh_{s}$ and $\widetilde{\mu}_{r}=\int\phi(h_{r})q_{r}(h_{r})dh_{r}$ , following the reasoning (3) and (4), we can express the embeddings as

Following the same parameterization in loopy BP, we represent the embeddings by neural network with rectified linear units,

where $\mathbf{W}=\{\{W_{1}^{i}\},W_{2},W_{3},\{W_{4}^{i}\},W_{5}\}$ are matrices with appropriate sizes. The generalized BP embedding updates will be almost the same as Algorithm 2 except the order of the iterations. We start from the messages into the smallest region first (Yedidia et al., 2001b).

Remark: The choice of basis clusters and the form of messages determine the dependency in the embedding. Please refer to Yedidia et al. (2005) for details about the principles to partition the graph structure, and several other generalized BP variants with different messages forms. The algorithms proposed for minimizing the Bethe free energy (Minka, 2001; Heskes, 2002; Yuille, 2002) can also be extended for Kikuchi free energy, resulting in different embedding forms.

Appendix E More experiments

In this section, we present more detailed study of our proposed structure2vec.

In addition to the bar plots shown in Figure 2, we present the corresponding raw numerical scores in Table 4.

Appendix F Derivatives Computation in Algorithm 3

The term $\frac{\partial l}{\partial f}$ depends on the supervised information and the loss function we used, and $\frac{\partial l}{\partial\sigma(m_{y}^{n})}=\mathbf{u}^{T}\frac{\partial l}{\partial f}$ . The last term $\frac{\partial\sigma(m_{y}^{n})}{\partial m_{y}^{n}}$ depends on the nonlinear function $\sigma$ we used here.

The derivatives with respect to $\mathbf{u}$ for the current encountered sample $\{\chi_{n},y_{n}\}$ SGD iteration are

Then, we can update the parameters $W_{1},W_{2}$ using following gradients.