CogALex-V Shared Task: LexNET - Integrated Path-based and Distributional Method for the Identification of Semantic Relations

Vered Shwartz, Ido Dagan

Introduction

Discovering whether words are semantically related and identifying the specific semantic relation that holds between them is a key component in many NLP applications, such as question answering and recognizing textual entailment [Dagan et al., 2013]. Automated methods for semantic relation identification are commonly corpus-based, and mainly rely on the distributional representation of each word.

The CogALex shared task on the corpus-based identification of semantic relations consists of two subtasks. In the first task, the system needs to identify for a word pair whether the words are semantically related or not (e.g. True:(dog, cat), False:(dog, fruit)). In the second task, the goal is to determine the specific semantic relation that holds for a given pair, if any (PART_OF:(tail, cat), HYPER:(cat, animal)).

In this paper we describe our approach and system setup for the shared task. We use LexNET [Shwartz and Dagan, 2016], an integrated path-based and distributional method for semantic relation classification. LexNET was the system with the overall best performance on subtask 2, and was ranked third on subtask 1, demonstrating the utility of integrating the complementary path-based and distributional information sources in recognizing semantic relatedness.LexNET’s code is available at https://github.com/vered1986/LexNET, and the shared task results are available at https://sites.google.com/site/cogalex2016/home/shared-task/results

To aid in recognizing whether a pair of words are related at all (subtask 1), we combine LexNET with a common similarity measure (cosine similarity), achieving fairly good performance, and a slight improvement upon using cosine similarity alone. Subtask 2, however, has shown to be extremely difficult, with LexNET and all other systems achieving relatively low F1F_{1} scores. The conflict between the mediocre performance and the recent success of distributional methods on several other common datasets for semantic relation classification [Baroni et al., 2012, Weeds et al., 2014, Roller et al., 2014] could be explained by the stricter evaluation setup in this subtask, which is supposed to demonstrate more closely real-world application settings. The difficulty of the semantic relation classification task emphasizes the need to develop better methods for this task.

Background

Recognizing word relatedness is typically addressed by distributional methods. To determine to what extent two terms xx and yy are related, a vector similarity or distance measure is applied to their distributional representations: sim(vwx,vwy)sim(\vec{v}_{w_{x}},\vec{v}_{w_{y}}). This is a straightforward application of the distributional hypothesis [Harris, 1954], according to which related words occur in similar contexts, hence have similar vector representations.

Most commonly, vector cosine is adopted as a similarity measure [Turney et al., 2010]. Many other measures exist, including but not limited to Euclidean distance, KL divergence [Cover and Thomas, 2012], Jaccard’s coefficient [Salton and McGill, 1986], and more recently neighbor rank [Hare et al., 2009, Lapesa and Evert, 2013] and APSyn [Santus et al., 2016a].See ?) for an extensive list of such measures. To turn this task into a binary classification task, where xx and yy are classified as either related or not, one can set a threshold to separate similarity scores of related and unrelated word pairs.

2 Semantic Relation Classification

Recognizing lexical semantic relations between words is valuable for many NLP applications, such as ontology learning, question answering, and recognizing textual entailment. Most corpus-based methods classify the relation between a pair of words xx and yy based on the distributional representation of each word [Baroni et al., 2012, Roller et al., 2014, Fu et al., 2014, Weeds et al., 2014]. Earlier methods utilized the dependency paths that connect the joint occurrences of xx and yy in the corpus as a cue to the relation between the words [A. Hearst, 1992, Snow et al., 2004, Nakashole et al., 2012]. Recently, ?) presented LexNET, an extension of HypeNET [Shwartz et al., 2016]. This method integrates both path-based and distributional information for semantic relation classification, which outperformed approaches that rely on a single information source, on several common datasets [Baroni and Lenci, 2011, Necsulescu et al., 2015, Santus et al., 2015, Santus et al., 2016b].

System Description

In LexNET, a word-pair (x,y)(x,y) is represented as a feature vector, consisting of a concatenation of distributional and path-based features: vxy=[vwx,vpaths(x,y),vwy]\vec{v}_{xy}=[\vec{v}_{w_{x}},\vec{v}_{paths(x,y)},\vec{v}_{w_{y}}], where vwx\vec{v}_{w_{x}} and vwy\vec{v}_{w_{y}} are xx and yy’s word embeddings, providing their distributional representation, and vpaths(x,y)\vec{v}_{paths(x,y)} is the average embedding vector of all the dependency paths that connect xx and yy in the corpus. Dependency paths are embedded using a LSTM [Hochreiter and Schmidhuber, 1997], as described in ?). This vector is then fed into a neural network that outputs the class distribution c\vec{c}, and then the pair is classified to the relation with the highest score rr:

MLP stands for Multi Layer Perceptron, and could be computed with or without a hidden layer (equations 2 and 3, respectively):

where WiW_{i} and bib_{i} are the network parameters and h\vec{h} is the hidden layer.

While path-based approaches have been commonly used for semantic relation classification [A. Hearst, 1992, Snow et al., 2004, Nakashole et al., 2012, Necsulescu et al., 2015], they have never been used for word relatedness, which is considered a “classical” task for distributional methods. We argue that path-based information can improve performance of word relatedness tasks as well (see Section 4.1). We train LexNET to distinguish between two classes: related and unrelated, and combine it with the common cosine similarity measure to tackle subtask 1.

Experimental Settings

The shared task organizers provided a dataset extracted from EVALution 1.0 [Santus et al., 2015], which was split into training and test sets. As instructed, we trained and tuned our method on the training set, and evaluated it once on the test set. To tune the hyper-parameters, we split the training set to 90% train and 10% validation sets. Since the dataset contains only 318 different words in the xx slot, we performed the split such that the train and the validation contain distinct xx words.A random split yielded perfect results on the validation set, which were due to lexical memorization [Levy et al., 2015].

LexNET has several tunable hyper-parameters. Similarly to ?), we used the English Wikipedia dump from May 2015 as an underlying corpus (3B tokens), and initialized the network’s word embeddings with the 50-dimensional pre-trained GloVe word embeddings [Pennington et al., 2014], trained on Wikipedia and Gigaword 5 (6B tokens). We fixed this hyper-parameter due to computational limitations with higher-dimensional embeddings. For each subtask, we tuned LexNET’s hyper-parameters on the validation set: the number of hidden layers (0 or 1), the number of training epochs, and the word dropout rate (see ?) for technical details). Table 1 displays the best performing hyper-parameters in each subtask, along with the performance on the validation set, which is detailed below.

We tuned LexNET’s hyper-parameters on the validation set, disregarding the similarity measure at this point, and then chose the model that performed best on the validation set and combined it with the similarity measure.

We computed cosine similarity for each (x,y)(x,y) pair in the dataset: cos(vwx,vwy)=vwxvwyvwxvwy\operatorname{cos}(\vec{v}_{w_{x}},\vec{v}_{w_{y}})=\frac{\vec{v}_{w_{x}}\cdot\vec{v}_{w_{y}}}{\|\vec{v}_{w_{x}}\|\cdot\|\vec{v}_{w_{y}}\|}, and normalized it to the range $.Wescoredeach. We scored each(x,y)$ pair by a combination of LexNET’s score for the related class and the cosine similarity score:

where wC,wLw_{C},w_{L} are the weights assigned to cosine similarity and LexNET’s scores respectively, such that wC+wL=1w_{C}+w_{L}=1. We tuned the weights and a threshold tt using the validation set, and classified (x,y)(x,y) as related if Rel(x,y)t\operatorname{Rel(x,y)}\geq t. The word vectors used to compute the cosine similarity scores were chosen among several available pre-trained embeddings.word2vec (300 dimensions, SGNS, trained on GoogleNews, 100B tokens) [Mikolov et al., 2013], GloVe (50-300 dimensions, trained on Wikipedia and Gigaword 5, 6B tokens) [Pennington et al., 2014], and dependency-based embeddings (300 dimensions, trained on Wikipedia, 3B tokens) [Levy and Goldberg, 2014]. For completeness we also report the performance of two baselines: cosine similarity (wC=1w_{C}=1) and LexNET (wL=1w_{L}=1, fixed t=0.5t=0.5).

2 Subtask 2: Semantic Relation Classification

The subtask’s train set is highly imbalanced towards random instances (roughly 10 times more than any other relation), and training any supervised method leads to overfitting to the random class. We therefore trained the model only on the related classes (excluding random pairs), for which the classes are more balanced. During inference time, we used the model from subtask 1 to assign a relatedness score to each pair, Rel(x,y)Rel(x,y), and computed the class distribution using the model from subtask 2, only for pairs that were related according to this score.

Finally, we applied a heuristic that if for a word pair (x,y)(x,y), the difference in scores between the top scoring classes is low (<0.2<0.2), and the top class is syn, then it is only classified as syn if the number of paths between xx and yy is smaller than 3. This is due to the fact that synonyms are hard to recognize with both distributional and path-based approaches [Shwartz and Dagan, 2016], but it is known that they do not tend to co-occur.

To compare LexNET’s performance on the validation set with other methods’ performances, we adapted the distributional baseline employed by ?) and ?), where a classifier is trained on the combination of xx and yy’s word embeddings. We experimented with several combination methods (concatenation [Baroni et al., 2012], difference [Fu et al., 2014, Weeds et al., 2014], and ASYM [Roller et al., 2014]), regularization factors, and pre-trained word embeddings [Mikolov et al., 2013, Pennington et al., 2014, Levy and Goldberg, 2014]. This time, we used cosine similarity (subtask 1) to separate related from unrelated pairs, and trained the classifier only to distinguish between the related classes. Similarly to subtask 1, we tune LexNET and the baseline’s hyper-parameters on the validation set. The best performance is reported in Table 1.

Results and Analysis

Table 2 displays the performance of our methods and the baselines on the test set. In addition to the two baselines provided by the shared task organizers (majority and random), we report also the results of our baselines detailed in Section 4. The majority baseline classifies all the instances as unrealted (subtask 1) or random (subtask 2). Since these labels are excluded from the averaged F1F_{1} computation, this baseline’s performance metrics are all zero.

Cos achieves fairly good performance (F1=0.747F_{1}=0.747), and LexNET+Cos slightly improves upon it. To better understand LexNET’s contribution, we examined pairs that were correctly classified by LexNET+Cos while being incorrectly classified by Cos. Out of the 57 pairs that were true negative in LexNET and false positive in Cos, we judged only one as somehow related ((death, man)).

We sampled 25 (from the 184) true positive pairs in LexNET+Cos that were false negatives in Cos, and found that they were all connected via paths in the corpus, suggesting that LexNET’s contribution comes also from the path-based component, rather than only from adding distributional information. 12 of the pairs contained a polysemous term, for which the relation holds in a specific sense (e.g. (fire, shoot)). 5 other pairs had a weak relation, e.g. (compact, car). While a car can be compact, non of these words is one of the most related words to the other.car is mostly related to driver, cars, and race, and compact to compactness and locally. As noted by ?), these are cases in which distributional methods may fail to identify the relation between the words, while even a single meaningful path connecting xx and yy can capture the relation between them.

Subtask 2: Semantic Relation Classification

We note that the overall results on this task are low, in contrast to the success of several methods on common datasets [Baroni et al., 2012, Weeds et al., 2014, Roller et al., 2014, Shwartz and Dagan, 2016]. One possible explanation is the stricter and more informative evaluation, that considers the random class as noise, discarding it from the F1F_{1} average.When the random class is included in the averaged F1F_{1} score, the results are: P = 0.780, R = 0.786, F1F_{1} = 0.781. Additionally, the dataset is lexically split, disabling lexical memorization [Levy et al., 2015]. However, the strict evaluation spots a light on the difficulty of this task, which was somewhat obfuscated by the strong results published so far, but might have been obtained thanks to dataset and evaluation peculiarities [Levy et al., 2015, Santus et al., 2016b, Shwartz and Dagan, 2016].

Figure 1 displays LexNET’s per relation F1F_{1} scores on the test set, with the corresponding confusion matrix. While the F1F_{1} scores of individual classes are relatively low, the confusion matrix shows that pairs were always classified to the correct relation more than to any other class. A common error comes from subtask 1’s model: while most unrelated pairs were classified as unrelated, many related pairs were also classified as unrelated. This may be solved in the future by learning the two subtasks jointly rather than applying a pipeline.

Among the other relations, the performance on synonyms was the worst. The path-based component is weak in recognizing synonyms, which do not tend to co-occur. The distributional information causes confusion between synonyms and antonyms, since both tend to occur in the same contexts. Moreover, synonyms were also sometimes mistaken with hypernyms, as the difference between the two relations is often subtle [Shwartz et al., 2016].

Conclusion

We have presented our submission to the CogALex 2016 shared task on corpus-based identification of semantic relations. The submission is based on LexNET [Shwartz and Dagan, 2016], an integrated path-based and distributional method for semantic relation classification. LexNET was the best-performing system on subtask 2, demonstrating the utility of integrating the complementary path-based and distributional information sources in recognizing semantic relatedness.

We have shown that subtask 1 (word relatedness) reaches reasonable performance with cosine similarity, and is slightly improved when combined with LexNET, especially when the relation between the words is non-prototypical. The performance on subtask 2, however, was relatively low for all systems that participated in the shared task, including LexNET. This demonstrates the difficulty of the semantic relation classification task, and emphasizes the need to develop improved methods for this task, possibly using additional sources of information.

Acknowledgments

This work was partially supported by an Intel ICRI-CI grant, the Israel Science Foundation grant 880/12, and the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1).

References