Hierarchical Density Order Embeddings

Ben Athiwaratkun, Andrew Gordon Wilson

Introduction

Learning feature representations of natural data such as text and images has become increasingly important for understanding real-world concepts. These representations are useful for many tasks, ranging from semantic understanding of words and sentences (Mikolov et al., 2013; Kiros et al., 2015), image caption generation (Vinyals et al., 2015), textual entailment prediction (Rocktäschel et al., 2015), to language communication with robots (Bisk et al., 2016).

Meaningful representations of text and images capture visual-semantic information, such as hierarchical structure where certain entities are abstractions of others. For instance, an image caption “A dog and a frisbee” is an abstraction of many images with possible lower-level details such as a dog jumping to catch a frisbee or a dog sitting with a frisbee (Figure 1). A general word such as “object” is also an abstraction of more specific words such as “house” or “pool”. Recent work by Vendrov et al. (2016) proposes learning such asymmetric relationships with order embeddings – vector representations of non-negative coordinates with partial order structure. These embeddings are shown to be effective for word hypernym classification, image-caption ranking and textual entailment (Vendrov et al., 2016).

Another recent line of work uses probability distributions as rich feature representations that can capture the semantics and uncertainties of concepts, such as Gaussian word embeddings (Vilnis & McCallum, 2015), or extract multiple meanings via multimodal densities (Athiwaratkun & Wilson, 2017). Probability distributions are also natural at capturing orders and are suitable for tasks that involve hierarchical structures. An abstract entity such as “animal” that can represent specific entities such as “insect”, “dog”, “bird” corresponds to a broad distribution, encapsulating the distributions for these specific entities. For example, in Figure 1, the distribution for “insect” is more concentrated than for “animal”, with a high density occupying a small volume in space.

Such entailment patterns can be observed from density word embeddings through unsupervised training based on word contexts (Vilnis & McCallum, 2015; Athiwaratkun & Wilson, 2017). In the unsupervised settings, density embeddings are learned via maximizing the similarity scores between nearby words. In these cases, the density encapsulation behavior arises due to the word occurrence pattern that a general word can often substitute more specific words; for instance, the word “tea” in a sentence “I like iced tea” can be substituted by “beverages”, yielding another natural sentence “I like iced beverages”. Therefore, the probability density of a general concept such as “beverages” tends to have a larger variance than specific ones such as “tea”, reflecting higher uncertainty in meanings since a general word can be used in many contexts. However, the information from word occurrences alone is not sufficient to train meaningful embeddings of some concepts. For instance, it is fairly common to observe sentences “Look at the cat”, or “Look at the dog”, but not “Look at the mammal”. Therefore, due to the way we typically express natural language, it is unlikely that the word “mammal” would be learned as a distribution that encompasses both “cat” and “dog”, since “mammal” rarely occurs in similar contexts.

Rather than relying on the information from word occurrences, one can do supervised training of density embeddings on hierarchical data. In this paper, we propose new training methodology to enable effective supervised probabilistic density embeddings. Despite providing rich and intuitive word representations, with a natural ability to represent order relationships, probabilistic embeddings have only been considered in a small number of pioneering works such as Vilnis & McCallum (2015), and these works are almost exclusively focused on unsupervised embeddings. Probabilistic Gaussian embeddings trained directly on labeled data have been briefly considered but perform surprisingly poorly compared to other competing models (Vendrov et al., 2016; Vulić et al., 2016).

Our work reaches a very different conclusion: probabilistic Gaussian embeddings can be highly effective at capturing ordering and are suitable for modeling hierarchical structures, and can even achieve state-of-the-art results on hypernym prediction and graded lexical entailment tasks, so long as one uses the right training procedures.

In particular, we make the following contributions.

We adopt a new form of loss function for training hierarchical probabilistic order embeddings.

We introduce the notion of soft probabilistic encapsulation orders and a thresholded divergence-based penalty function, which do not over-penalize words with a sufficient encapsulation.

We introduce a new graph-based scheme to select negative samples to contrast the true relationship pairs during training. This approach incorporates hierarchy information to the negative samples that help facilitate training and has added benefits over the hierarchy-agnostic sampling schemes previously used in literature.

We also demonstrate that initializing the right variance scale is highly important for modeling hierarchical data via distributions, allowing the model to exhibit meaningful encapsulation orders.

The outline of our paper is as follows. In Section 2, we introduce the background for Gaussian embeddings (Vilnis & McCallum, 2015) and vector order embeddings (Vendrov et al., 2016). We describe our training methodology in Section 3, where we introduce the notion of soft encapsulation orders (Section 3.2) and explore different divergence measures such as the expected likelihood kernel, KL divergence, and a family of Rényi alpha divergences (Section 3.3). We describe the experiment details in Section 4 and offer a qualitative evaluation of the model in Section 4.3, where we show the visualization of the density encapsulation behavior. We show quantitative results on the WordNet Hypernym prediction task in Section 4.2 and a graded entailment dataset HyperLex in Section 4.4.

In addition, we conduct experiments to show that our proposed changes to learn Gaussian embeddings contribute to the increased performance. We demonstrate (a) the effects of our loss function in Section A.2.3, (b) soft encapsulation in Section A.2.1, (c) negative sample selection in Section 4.4], and (d) initial variance scale in Section A.2.2.

We make our code publicly available.https://github.com/benathi/density-order-emb

Background and Related Work

Vilnis & McCallum (2015) was the first to propose using probability densities as word embeddings. In particular, each word is modeled as a Gaussian distribution, where the mean vector represents the semantics and the covariance describes the uncertainty or nuances in the meanings. These embeddings are trained on a natural text corpus by maximizing the similarity between words that are in the same local context of sentences. Given a word ww with a true context word cpc_{p} and a randomly sampled word cnc_{n} (negative context), Gaussian embeddings are learned by minimizing the rank objective in Equation 1, which pushes the similarity of the true context pair E(w,cp)E(w,c_{p}) above that of the negative context pair E(w,cn)E(w,c_{n}) by a margin mm.

The similarity score E(u,v)E(u,v) for words u,vu,v can be either E(u,v)=KL(fu,fv)E(u,v)=-\text{KL}(f_{u},f_{v}) or E(u,v)=logfu,fuL2E(u,v)=\log\langle f_{u},f_{u}\rangle_{L_{2}} where fu,fvf_{u},f_{v} are the distributions of words uu and vv, respectively. The Gaussian word embeddings contain rich semantic information and performs competitively in many word similarity benchmarks.

The true context word pairs (w,cp)(w,c_{p}) are obtained from natural sentences in a text corpus such as Wikipedia. In some cases, specific words can be replaced by a general word in a similar context. For instance, “I love cats” or “I love dogs” can be replaced with “I love animals”. Therefore, the trained word embeddings exhibit lexical entailment patterns where specific words such as “dog” and “cat” are concentrated distributions that are encompassed by a more dispersed distribution of “animal”, a word that “cat” and “dog” entail. The broad distribution of a general word agrees with the distributional informativeness hypothesis proposed by Santus et al. (2014), which says that a generic word can occur in more general contexts in place of the specific ones that entail it.

However, some word entailment pairs have weak density encapsulation patterns due to the nature of word diction. For instance, even though “dog” and “cat” both entail “mammal”, it is rarely the case that we observe a sentence “I have a mammal” as opposed to “I have a cat” in a natural corpus; therefore, after training density word embeddings on word occurrences, encapsulation of some true entailment instances do not occur.

2 Partial Orders and Vector Order Embeddings

We describe the concepts of partial orders and vector order embeddings proposed by Vendrov et al. (2016), which we will later consider in the context of our hierarchical density order embeddings.

A partial order over a set of points XX is a binary relation \preceq such that for a,b,cXa,b,c\in X, the following properties hold: (1) aaa\preceq a (reflexivity); (2) if aba\preceq b and bab\preceq a then a=ba=b (antisymmetry); and (3) if aba\preceq b and bcb\preceq c then aca\preceq c (transitivity). An example of a partially ordered set is a set of nodes in a tree where aba\preceq b means aa is a child node of bb. This concept has applications in natural data such as lexical entailment. For words aa and bb, aba\preceq b means that every instance of aa is an instance of bb, or we can say that aa entails bb. We also say that (a,b)(a,b) has a hypernym relationship where aa is a hyponym of bb and bb is a hypernym of aa. This relationship is asymmetric since aba\preceq b does not necessarily imply (ba)(b\preceq a). For instance, aircraft \preceq vehicle but it is not true that vehicle \preceq aircraft.

3 Other Related Work

Li et al. (2017) extends Vendrov et al. (2016) for knowledge representation on data such as ConceptNet (Speer et al., 2016). Another related work by Hockenmaier & Lai (2017) embeds words and phrases in a vector space and uses denotational probabilities for textual entailment tasks. Our models offer an improvement on order embeddings and can be applicable to such tasks, which view as a promising direction for future work.

Methodology

In Section 3.1, we describe the partial orders that can be induced by density encapsulation. Section 3.2 describes our training approach that softens the notion of strict encapsulation with a viable penalty function.

A partial order on probability densities can be obtained by the notion of encapsulation. That is, a density ff is more specific than a density gg if ff is encompassed in gg. The degree of encapsulation can vary, which gives rise to multiple order relations. We define an order relation η\preceq_{\eta} for η0\eta\geq 0 where η\eta indicates the degree of encapsulation required for one distribution to entail another. More precisely, for distributions ff and gg,

Note that {x:f(x)>η}\{x:f(x)>\eta\} is a set where the density ff is greater than the threshold η\eta. The relation in Equation 2 says that ff entails gg if and only if the set of gg contains that of ff. In Figure 2, we depict two Gaussian distributions with different mean vectors and covariance matrices. Figure 2 (left) shows the density values of distributions ff (narrow, blue) and gg (broad, orange) and different threshold levels. Figure 2 (right) shows that different η\eta’s give rise to different partial orders. For instance, we observe that neither fη1gf\preceq_{\eta_{1}}g nor gη1fg\preceq_{\eta_{1}}f but fη3gf\preceq_{\eta_{3}}g.

2 Soft Encapsulation Orders

A plausible penalty function for the order relation fηgf\preceq_{\eta}g is a set measure on {x:f(x)>η}{x:g(x)>η}\{x:f(x)>\eta\}-\{x:g(x)>\eta\}. However, this penalty is difficult to evaluate for most distributions, including Gaussians. Instead, we use simple penalty functions based on asymmetric divergence measures between probability densities. Divergence measures D()D(\cdot||\cdot) have a property that D(fg)=0D(f||g)=0 if and only if f=gf=g. Using D()D(\cdot||\cdot) to represent order violation is undesirable since the penalty should be if fgf\neq g but fgf\preceq g. Therefore, we propose using a thresholded divergence

which can be zero if ff is properly encapsulated in gg. We discuss the effectiveness of using divergence thresholds in Section A.2.1.

We note that by using dγ(,)d_{\gamma}(\cdot,\cdot) as a violation penalty, we no longer have the strict partial order. In particular, the notion of transitivity in a partial order is not guaranteed. For instance, if fgf\preceq g and ghg\preceq h, our density order embeddings would yield dγ(f,g)=0d_{\gamma}(f,g)=0 and dγ(g,h)=0d_{\gamma}(g,h)=0. However, it is not necessarily the case that dγ(f,h)=0d_{\gamma}(f,h)=0 since D(fh)D(f||h) can be greater than γ\gamma. This is not a drawback since a high value of D(fh)D(f||h) reflects that the hypernym relationship is not direct, requiring many edges from ff to hh in the hierarchy. The extent of encapsulation contains useful entailment information, as demonstrated in Section 4.4 where our model scores highly correlate with the annotated scores of a challenging lexical entailment dataset and achieves state-of-the-art results.

Another property, antisymmetry, does not strictly hold since if dγ(f,g)=0d_{\gamma}(f,g)=0 and dγ(g,f)=0d_{\gamma}(g,f)=0 does not imply f=gf=g. However, in this situation, it is necessary that ff and gg overlap significantly if γ\gamma is small. Due to the fact that the dγ(,)d_{\gamma}(\cdot,\cdot) does not strictly induce a partial order, we refer to this model as soft density order embeddings or simply density order embeddings.

3 Divergence Measures

Kullback-Leibler (KL) Divergence The KL divergence is an asymmetric measure of the difference between probability distributions. For distributions ff and gg, KL(gf)g(x)logg(x)f(x) dx\text{KL}(g||f)\equiv\int g(x)\log\frac{g(x)}{f(x)}\ dx imposes a high penalty when there is a region of points xx such that the density f(x)f(x) is low but g(x)g(x) is high. An example of such a region is the area on the left of ff in Figure 2. This measure penalizes the situation where ff is a concentrated distribution relative to gg; that is, if the distribution ff is encompassed by gg, then the KL yields high penalty. For dd-dimensional Gaussians f=Nd(μf,Σf)f=\mathcal{N}_{d}(\mu_{f},\Sigma_{f}) and g=Nd(μg,Σg)g=\mathcal{N}_{d}(\mu_{g},\Sigma_{g}),

Rényi α\alpha-Divergence is a general family of divergence with varying scale of zero-forcing penalty (Renyi, 1961). Equation 4 describes the general form of the α\alpha-divergence for α0,1\alpha\neq 0,1 (Liese & Vajda, 1987). We note that for α0\alpha\to 0 or 11, we recover the KL divergence and the reverse KL divergence; that is, limα1Dα(fg)=KL(fg)\lim_{\alpha\to 1}D_{\alpha}(f||g)=\text{KL}(f||g) and limα0Dα(fg)=KL(gf)\lim_{\alpha\to 0}D_{\alpha}(f||g)=\text{KL}(g||f) (Pardo, 2006). The α\alpha-divergences are asymmetric for all α\alpha’s, except for α=1/2\alpha=1/2.

For two multivariate Gaussians ff and gg, we can write the Rényi divergence as (Pardo, 2006):

The parameter α\alpha controls the degree of zero forcing where minimizing Dα(fg)D_{\alpha}(f||g) for high α\alpha results in ff being more concentrated to the region of gg with high density. For low α\alpha, ff tends to be mass-covering, encompassing regions of gg including the low density regions. Recent work by Li & Turner (2016) demonstrates that different applications can require different degrees of zero-forcing penalty.

3.2 Symmetric Divergence

Expected Likelihood Kernel The expected likelihood kernel (ELK) (Jebara et al., 2004) is a symmetric measure of affinity, define as K(f,g)=f,gHK(f,g)=\langle f,g\rangle_{\mathcal{H}}. For two Gaussians ff and gg,

Since this kernel is a similarity score, we use its negative as our penalty. That is, DELK(fg)=2logf,gHD_{\text{ELK}}(f||g)=-2\log\langle f,g\rangle_{\mathcal{H}}. Intuitively, the asymmetric measures should be more successful at training density order embeddings. However, a symmetric measure can result in a correct encapsulation order as well, since a general entity often has to minimize the penalty with many specific elements and consequently ends up having a broad distribution to lower the average loss. The expected likelihood kernel is used to train Gaussian and Gaussian Mixture word embeddings on a large text corpus (Vilnis & McCallum, 2015; Athiwaratkun & Wilson, 2017) where the model performs well on the word entailment dataset (Baroni et al., 2012).

4 Loss Function

To learn our density embeddings, we use a loss function similar to that of Vendrov et al. (2016). Minimizing this function (Equation 7) is equivalent to minimizing the penalty between a true relationship pair (u,v)(u,v) where uvu\preceq v, but pushing the penalty to be above a margin mm for the negative example (u,v)(u^{\prime},v^{\prime}) where u⪯̸vu^{\prime}\not\preceq v^{\prime}:

We note that this loss function is different than the rank-margin loss introduced in the original Gaussian embeddings (Equation 1). Equation 7 aims to reduce the dissimilarity of a true relationship pair d(u,v)d(u,v) with no constraint, unlike in Equation 1, which becomes zero if d(u,v)d(u,v) is above d(u,v)d(u^{\prime},v^{\prime}) by margin mm.

5 Selecting Negative Samples

In many embedding models such as word2vec (Mikolov et al., 2013) or Gaussian embeddings (Vilnis & McCallum, 2015), negative samples are often used in the training procedure to contrast with true samples from the dataset. For flat data such as words in a text corpus, negative samples are selected randomly from a unigram distribution. We propose new graph-based methods to select negative samples that are suitable for hierarchical data, as demonstrated by the improved performance of our density embeddings. In our experiments, we use various combinations of the following methods.

Method S1: A simple negative sampling procedure used by Vendrov et al. (2016) is to replace a true hypernym pair (u,v)(u,v) with either (u,v)(u,v^{\prime}) or (u,v)(u^{\prime},v) where u,vu^{\prime},v^{\prime} are randomly sampled from a uniform distribution of vertices. Method S2: We use a negative sample (v,u)(v,u) if (u,v)(u,v) is a true relationship pair, to make D(vu)D(v||u) higher than D(uv)D(u||v) in order to distinguish the directionality of density encapsulation. Method S3: It is important to increase the divergence between neighbor entities that do not entail each other. Let A(w)A(w) denote all descendants of ww in the training set D\mathcal{D}, including ww itself. We first randomly sample an entity wDw\in\mathcal{D} that has at least 22 descendants and randomly select a descendant uA(w){w}u\in A(w)-\{w\}. Then, we randomly select an entity vA(w)A(u)v\in A(w)-A(u) and use the random neighbor pair (v,u)(v,u) as a negative sample. Note that we can have uvu\preceq v, in which case the pair (v,u)(v,u) is a reverse relationship. Method S4: Same as S3 except that we sample vA(w)A(u){w}v\in A(w)-A(u)-\{w\}, which excludes the possibility of drawing (w,u)(w,u).

Experiments

We have introduced density order embeddings (DOE) to model hierarchical data via encapsulation of probability densities. We propose using a new loss function, graph-based negative sample selections, and a penalty relaxation to induce soft partial orders. In this section, we show the effectiveness of our model on WordNet hypernym prediction and a challenging graded lexical entailment task, where we achieve state-of-the-art performance.

First, we provide the training details in Section 4.1 and describe the hypernym prediction experiment in 4.2. We offers insights into our model with the qualitative analysis and visualization in Section 4.3. We evaluate our model on HyperLex, a lexical entailment dataset in Section 4.4.

We have a similar data setup to the experiment by Vendrov et al. (2016) where we use the transitive closure of WordNet noun hypernym relationships which contains 82,11582,115 synsets and 837,888837,888 hypernym pairs from 84,42784,427 direct hypernym edges. We obtain the data using the WordNet API of NLTK version 3.2.1 (Loper & Bird, 2002).

The validation set contains 40004000 true hypernym relationships as well as 40004000 false hypernym relationships where the false hypernym relationships are constructed from the S1 negative sampling described in Section 3.5. The same process applies for the test set with another set of 40004000 true hypernym relationships and 40004000 false hypernym relationships.

We use dd-dimensional Gaussian distributions with diagonal covariance matrices. We use d=50d=50 as the default dimension and analyze the results using different dd’s in Section A.2.4. We initialize the mean vectors to have a unit norm and normalize the mean vectors in the training graph. We initialize the diagonal variance components to be all equal to β\beta and optimize on the unconstrained space of log(Σ)\log(\Sigma). We discuss the important effects of the initial variance scale in Section A.2.2.

We use a minibatch size of 500500 true hypernym pairs and use varying number of negative hypernym pairs, depending on the negative sample combination proposed in Section 3.5. We discuss the results for many selection strategies in Section 4.4. We also experiment with multiple divergence measures D()D(\cdot||\cdot) described in Section 3.3. We use D()=DKL()D(\cdot||\cdot)=D_{KL}(\cdot||\cdot) unless stated otherwise. Section A.2.5 considers the results using the α\alpha-divergence family with varying degrees of zero-forcing parameter α\alpha’s. We use the Adam optimizer (Kingma & Ba, 2014) and train our model for at most 2020 epochs. For each energy function, we tune the hyperparameters on grids. The hyperparameters are the loss margin mm, the initial variance scale β\beta, and the energy threshold γ\gamma. We evaluate the results by computing the penalty on the validation set to find the best threshold for binary classification, and use this threshold to perform prediction on the test set. Section A.1 describes the hyperparameters for all our models.

2 Hypernym Prediction

We show the prediction accuracy results on the test set of WordNet hypernyms in Table 1. We compare our results with vector order-embeddings (VOE) by Vendrov et al. (2016) (VOE model details are in Section 2.2). Another important baseline is the transitive closure, which requires no learning and classifies if a held-out edge is a hypernym relationship by determining if it is in the union of the training edges. word2gauss and word2gauss\dagger are the Gaussian embeddings trained using the loss function in Vilnis & McCallum (2015) (Equation 1) where word2gauss is the result reported by Vendrov et al. (2016) and word2gauss\dagger is the best performance of our replication (see Section A.2.3 for more details). Our density order embedding (DOE) outperforms the implementation by Vilnis & McCallum (2015) significantly; this result highlights the fact that our different approach for training Gaussian embeddings can be crucial to learning hierarchical representations.

We observe that the symmetric model (ELK) performs quite well for this task despite the fact that the symmetric metric cannot capture directionality. In particular, ELK can accurately detect pairs of concepts with no relationships when they’re far away in the density space. In addition, for pairs that are related, ELK can detect pairs that overlap significantly in density space. The lack of directionality has more pronounced effects in the graded lexical entailment task (Section 4.4) where we observe a high degradation in performance if ELK is used instead of KL.

We find that our method outperforms vector order embeddings (VOE) (Vendrov et al., 2016). We also find that DOE is very strong in a 22-dimensional Gaussian embedding example, trained for the purpose of visualization in Section 4.3, despite only having only 44 parameters: 22 from 22-dimensional μ\mu and another 22 from the diagonal Σ\Sigma. The results of DOE using a symmetric measure also outperforms the baselines on this experiment, but has a slightly lower accuracy than the asymmetric model.

Figure 3 offers an explanation as to why our density order embeddings might be easier to learn, compared to the vector counterpart. In certain cases such as fitting a general concept entity to the embedding space, we simply need to adjust the distribution of entity to be broad enough to encompass all other concepts. In the vector counterpart, it might be required to shift many points further from the origin to accommodate entity to reduce cascading order violations.

3 Qualitative Analysis

For qualitative analysis, we additionally train a 22-dimensional Gaussian model for visualization. Our qualitative analysis shows that the encapsulation behavior can be observed in the trained model. Figure 4 demonstrates the ordering of synsets in the density space. Each ellipse represents a Gaussian distribution where the center is given by the mean vector μ\mu and the major and minor axes are given by the diagonal standard deviations Σ\sqrt{\Sigma}, scaled by 300300 for the xx axis and 3030 for the yy axis, for visibility.

Most hypernym relationships exhibit encapsulation behavior where the hypernym encompasses the synset that entails it. For instance, the distribution of whole.n.02 is subsumed in the distribution of physical_entity.n.01. Note that location.n.01 is not entirely encapsulated by physical_entity.n.01 under this visualization. However, we can still predict which entity should be the hypernym among the two since the KL divergence of one given another would be drastically different. This is because a large part of physical_entity.n.01 has considerable density at the locations where location.n.01 has very low density. This causes KL(physical\textunderscoreentity.n.01  location.n.01)\text{KL}({\tt physical{\textunderscore}entity.n.01}\ ||\ {\tt location.n.01}) to be very high (5103)(5103) relative to KL(location.n.01  physical\textunderscoreentity.n.01)\text{KL}({\tt location.n.01}\ ||\ {\tt physical{\textunderscore}entity.n.01}) (206)(206). Table 2 shows the KL values for all pairs where we note that the numbers are from the full model (d=50d=50).

Another interesting pair is city.n.01 \preceq location.n.01 where we see the two distributions have very similar contours and the encapsulation is not as distinct. In our full model d=50d=50, the distribution of location.n.01 encompasses city.n.01’s, indicated by low KL(city.n.01location.n.01)\text{KL}({\tt city.n.01}||{\tt location.n.01}) but high KL(location.n.01city.n.01)\text{KL}({\tt location.n.01}||{\tt city.n.01}).

Figure 4 (Right) demonstrates the idea that synsets on the top of the hypernym hierarchy usually have higher “volume”. A convenient metric that reflects this quantity is logdet(Σ)\log\text{det}(\Sigma) for a Gaussian distribution with covariance Σ\Sigma. We can see that the synset, physical_entity.n.01, being the hypernym of all the synsets shown, has the highest logdet(Σ)\log\text{det}(\Sigma) whereas entities that are more specific such as object.n.01, whole.n.02 and living_thing have decreasingly lower volume.

4 Graded Lexical Entailment

HyperLex is a lexical entailment dataset which has fine-grained human annotated scores between concept pairs, capturing varying degrees of entailment (Vulić et al., 2016). Concept pairs in HyperLex reflect many variants of hypernym relationships, such as no-rel (no lexical relationship), ant (antonyms), syn (synonyms), cohyp (sharing a hypernym but not a hypernym of each other), hyp (hypernym), rhyp (reverse hypernym). We use the noun dataset of HyperLex for evaluation, which contains 2,163 pairs.

We evaluate our model by comparing our model scores against the annotated scores. Obtaining a high correlation on a fine-grained annotated dataset is a much harder task compared to a binary prediction, since performing well requires meaningful model scores in order to reflect nuances in hypernymy. We use negative divergence as our score for hypernymy scale where large values indicate high degrees of entailment.

We note that the concepts in our trained models are WordNet synsets, where each synset corresponds to a specific meaning of a word. For instance, pop.n.03 has a definition “a sharp explosive sound as from a gunshot or drawing a cork” whereas pop.n.04 corresponds to “music of general appeal to teenagers; …”. For a given pair of words (u,v)(u,v), we use the score of the synset pair (su,sv)(s_{u}^{\prime},s_{v}^{\prime}) that has the lowest KL divergence among all the pairs Sn×SvS_{n}\times S_{v} where Su,SvS_{u},S_{v} are sets of synsets for words uu and vv, respectively. More precisely, s(u,v)=minsuSu,svSvD(su,sv)s(u,v)=-\min_{s_{u}\in S_{u},s_{v}\in S_{v}}D(s_{u},s_{v}). This pair selection corresponds to choosing the synset pair that has the highest degree of entailment. This approach has been used in word embeddings literature to select most related word pairs (Athiwaratkun & Wilson, 2017). For word pairs that are not in the model, we assign the score equal to the median of all scores. We evaluate our model scores against the human annotated scores using Spearman’s rank correlation.

Table 3 shows HyperLex results of our models DOE-A (asymmetric) and DOE-S (symmetric) as well as other competing models. The model DOE-A which uses KL divergence and negative sampling approach S1, S2 and S4 outperforms all other existing models, achieving state-of-the-art performance for the HyperLex noun dataset. (See Section A.1 for hyperparameter details) The model DOE-S which uses expected likelihood kernel attains a lower score of 0.4550.455 compared to the asymmetric counterpart (DOE-A). This result underscores the importance of asymmetric measures which can capture relationship directionality.

We provide a brief summary of competing models: FR scores are based on concept word frequency ratio (Weeds et al., 2004). SLQS uses entropy-based measure to quantify entailment (Santus et al., 2014). Vis-ID calculates scores based on visual generality measures (Kiela et al., 2015). WN-B calculates the scores based on the shortest path between concepts in WN taxonomy (Miller, 1995). w2g Guassian embeddings trained using the methodology in Vilnis & McCallum (2015). VOE Vector order embeddings (Vendrov et al., 2016). Euc and Poin calculate scores based on the Euclidean distance and Poincaré distance of the trained Poincaré embeddings (Nickel & Kiela, 2017). The models FR and SLQS are based on word occurrences in text corpus, where FR is trained on the British National Corpus and SLQS is trained on ukWac, WackyPedia (Bailey & Thompson, 2006; Baroni et al., 2009) and annotated BLESS dataset (Baroni & Lenci, 2011). Other models Vis-ID, w2g, VOE, Euc, Poin and ours are trained on WordNet, with the exception that Vis-ID also uses Google image search results for visual data. The reported results of FR, SLQS, Vis-ID, WN-B, w2g and VOE are from Vulić et al. (2016).

We note that an implementation of Gaussian embeddings model (w2g) reported by Vulić et al. (2016) does not perform well compared to previous benchmarks such as Vis-ID, FR, SLQS. Our training approach yields the opposite results and outperforms other highly competitive methods such as Poincaré embeddings and Hypervec. This result highlights the importance of the training approach, even if the concept representation of our work and Vilnis & McCallum (2015) both use Gaussian distributions. In addition, we observe that vector order embeddings (VOE) do not perform well compared to our model, which we hypothesize is due to the “soft” orders induced by the divergence penalty that allows our model scores to more closely reflect hypernymy degrees.

We note another interesting observation that a model trained on a symmetric divergence (ELK) from Section 4.2 can also achieve a high HyperLex correlation of 0.5320.532 if KL is used to calculate the model scores. This is because the encapsulation behavior can arise even though the training penalty is symmetric (more explanation in Section 4.2). However, using the symmetric divergence based on ELK results in poor performance on HyperLex (0.455), which is expected since it cannot capture the directionality of hypernymy.

We note that another model LEAR obtains an impressive score of 0.6860.686 (Vulić & Mrkšić, 2014). However, LEAR use pre-trained word embeddings such as word2vec or GloVe as a pre-processing step, leveraging a large vocabulary with rich semantic information. To the best of our knowledge, our model achieves the highest HyperLex Spearman’s correlation among models without using large-scale pre-trained embeddings.

Table 4 shows the effects of negative sample selection described in Section 3.5. We note again that S1 is the technique used in literature Socher et al. (2013); Vendrov et al. (2016) and S2, S3, S4 are the new techniques we proposed. The notation, for instance, k×S1+S2k\times\textbf{S1}+\textbf{S2} corresponds to using kk samples from S1 and 11 sample from S2 per each positive sample. We observe that our new selection methods offer strong improvement from the range of 0.510.520.51-0.52 (using S1 alone) to 0.550.55 or above for most combinations with our new selection schemes.

Future Work

Analogous to recent work by Vulić & Mrkšić (2014) which post-processed word embeddings such as GloVe or word2vec, our future work includes using the WordNet hierarchy to impose encapsulation orders when training probabilistic embeddings.

In the future, the distribution approach could also be developed for encoder-decoder based models for tasks such as caption generation where the encoder represents the data as a distribution, containing semantic and visual features with uncertainty, and passes this distribution to the decoder which maps to text or images. Such approaches would be reminiscent of variational autoencoders (Kingma & Welling, 2013), which take samples from the encoder’s distribution.

References

Appendix A Supplementary Materials

In Section 4.3, the 22-dimensional Gaussian model is trained with S-1 method where the number of negative samples is equal to the number of positive samples. The best hyperparameters for d=2d=2 model is (m,β,γ)=(100.0,2×104,3.0)(m,\beta,\gamma)=(100.0,2\times 10^{-4},3.0).

In Section 4.2, the best hyperparameters (m,β,γ)(m,\beta,\gamma) for each of our model are as follows: For Gaussian with KL penalty: (2000.0,5×105,500.0)(2000.0,5\times 10^{-5},500.0), , Gaussian with reversed KL penalty: (1000.0,1×104,1000.0)(1000.0,1\times 10^{-4},1000.0), Gaussian with ELK penalty (1000,1×105,10)(1000,1\times 10^{-5},10).

In Section 4.4, we use the same hyperparameters as in 4.2 with KL penalty, but a different negative sample combination in order to increase the distinguishability of divergence scores. For each positive sample in the training set, we use one sample from each of the methods S1, S2, S4. We note that the model from Section 4.2, using S1 with the KL penalty obtains a Spearman’s correlation of 0.5270.527.

A.2 Analysis of Training Methodology

We emphasize that Gaussian embeddings have been used in the literature, both in the unsupervised settings where word embeddings are trained with local contexts from text corpus, and in supervised settings where concept embeddings are trained to model annotated data such as WordNet . The results in supervised settings such as modeling WordNet have been reported to compare with competing models but often have inferior performance (Vendrov et al., 2016; Vulić et al., 2016). Our paper reaches the opposite conclusion, showing that a different training approach using Gaussian representations can achieve state-of-the-art results.

Consider a relationship fgf\preceq g where ff is a hyponym of gg or gg is a hypernym of ff. Even though the divergence D(fg)D(f||g) can capture the extent of encapsulation, a density ff will have the lowest divergence with respect with gg only if f=gf=g. In addition, if ff is a more concentrated distribution that is encompassed by gg, D(fg)D(f||g) is minimized when ff is at the center of gg. However, if there any many hyponyms f1,f2f_{1},f_{2} of gg, the hyponyms can compete to be close to the center, resulting in too much overlapping between f1f_{1} and f2f_{2} if the random sampling to penalize negative pairs is not sufficiently strong. The divergence threshold γ\gamma is used such that there is no longer a penalty once the divergence is below a certain level.

We demonstrate empirically that the threshold γ\gamma is important for learning meaningful Gaussian distributions. We fix the hyperparameters m=2000m=2000 and β=5×105\beta=5\times 10^{-5}, with S1 negative sampling. Figure 5 shows that there is an optimal non-zero threshold and yields the best performance for both WordNet Hypernym prediction and HyperLex Spearman’s correlation. We observe that using γ=0\gamma=0 is detrimental to the performance, especially on HyperLex results.

A.2.2 Initial Variance Scale

As opposed to the mean vectors that are randomly initialized, we initialize all diagonal covariance elements to be the same. Even though the variance can adapt during training, we find that different initial scales of variance result in drastically different performance. To demonstrate, in Figure 6, we show the best test accuracy and HyperLex Spearman’s correlation for each initial variance scale, with other hyperparameters (margin mm and threshold γ\gamma) tuned for each variance. We use S1 + S2 + S4 as a negative sampling method. In general, a low variance scale β\beta increases the scale of the loss and requires higher margin mm and threshold γ\gamma. We observe that the best prediction accuracy is obtained when log(β)10\log(\beta)\approx-10 or β=5×105\beta=5\times 10^{-5}. The best HyperLex results are obtained when the scales of β\beta are sufficiently low. The intuition is that low β\beta increases the scale of divergence D()D(\cdot||\cdot), which increases the ability to capture relationship nuances.

A.2.3 Loss Function

We verify that for this task, our loss function in Equation 7 in superior to Equation 1 originally proposed by Vilnis & McCallum (2015). We use the exact same setup with new negative sample selections and KL divergence thresholding and compare the two loss functions. Table 5 verifies our claim.

A.2.4 Dimensionality

Table 6 shows the results for many dimensionalities for two negative sample strategies: S1 and S1 + S2 + S4 .

A.2.5 α𝛼\alpha-divergences

Table 7 show the results using models trained and evaluated with D()=Dα()D(\cdot||\cdot)=D_{\alpha}(\cdot||\cdot) with negative sampling approach S1. Interestingly, we found that α1\alpha\to 1 (KL) offers the best result for both prediction accuracy and HyperLex . It is possible that α=1\alpha=1 is sufficiently asymmetric enough to distinguish hypernym directionality, but does not have as sharp penalty as in α>1\alpha>1, which can help learning.