FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation

Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, Maosong Sun

Introduction

Relation classification (RC) is an important task in NLP, aiming to determine the correct relation between two entities in a given sentence. Many works have been proposed for this task, including kernel methods Zelenko et al. (2002); Mooney and Bunescu (2006), embedding methods Gormley et al. (2015), and neural methods Zeng et al. (2014). The performance of these conventional models heavily depends on time-consuming and labor-intensive annotated data, which make themselves hard to generalize well. Adopting distant supervision is a primary approach to alleviate this problem for RC Mintz et al. ; Riedel et al. ; Hoffmann et al. (2011); Surdeanu et al. (2012); Zeng et al. (2015); Lin et al. (2016), which heuristically aligns knowledge bases (KBs) and text to automatically annotate adequate amounts of training instances. We evaluate the model proposed by Lin et al. (2016), which is followed by the recent state-of-the-art methods Zeng et al. (2017); Ji et al. (2017); Huang and Wang (2017); Wu et al. (2017); Liu et al. (2017); Feng et al. (2018); Zeng et al. (2018), on the benchmark dataset NYT-10 Riedel et al. . Though it achieves promising results on common relations, the performance of a relation drops dramatically when its number of training instances decrease. About $58\%$ of the relations in NYT-10 are long-tail with fewer than $100$ instances. Furthermore, distant supervision suffers from the wrong labeling problem, which makes it harder to classify long-tail relations. Hence, it is necessary to study training RC models with insufficient training instances.

We formulate RC as a few-shot learning task in this paper, which requires models capable of handling classification task with a handful of training instances, as shown in Table 1. Many efforts have devoted to few-shot learning. The early works Caruana (1995); Bengio (2012); Donahue et al. (2014) apply transfer learning methods to finetune pre-trained models from the common classes containing adequate instances to the uncommon classes with only few instances. Then metric learning methods Koch et al. (2015); Vinyals et al. (2016); Snell et al. (2017) have been proposed to learn the distance distributions among classes. Similar classes are adjacent in the distance space. The metric methods also take advantage of non-parametric estimation to make models efficient and general. Recently, the idea of meta-learning is proposed, which encourages the models to learn fast-learning abilities from previous experience and rapidly generalize to new concepts. Many meta-learning models Ravi and Larochelle (2017); Santoro et al. (2016); Finn et al. (2017); Munkhdalai and Yu (2017) achieve the state-of-the-art results on several few-shot benchmarks.

Though meta-learning methods develop fast, most of these works evaluate on two popular datasets, Omniglot Lake et al. (2015) and mini-ImageNet Vinyals et al. (2016). Both the datasets concentrate on image classification. Many works in NLP mainly focus on the zero-shot/semi-supervised scenario Xie et al. (2016); Ma et al. (2016); Carlson et al. (2009), which incorporate extra information to classify objects never appearing in the training sets. However, the few-shot scenario needs models to classify objects with few instances without any extra information. Recently, Yu et al. (2018) propose a multi-metric method for few-shot text classification. However, there lack systematic researches about adopting few-shot learning for NLP tasks. We propose FewRel: a new large-scale supervised Few-shot Relation Classification dataset. To address the wrong labeling problem in most distantly supervised RC datasets, we apply crowd-sourcing to manually remove the noise.Many previous works, such as (Roth et al., 2013; Luo et al., 2017; Xin et al., 2018) have worked on automatically removing noise from distantly supervision. Instead, we use crowd-sourcing methods to achieve a high accuracy.

Besides constructing the dataset, we systematically implement the most recent state-of-the-art few-shot learning methods and adapt them for RC. We conduct a detailed evaluation for all these models on our dataset. Though the state-of-the-art few-shot learning methods have much lower results than humans on our challenging dataset, they significantly outperform the vanilla RC models, indicating that incorporating few-shot learning is promising and needs further research. In summary, our contribution is three-fold:

(1) We formulate RC as a few-shot learning task, and propose a new large supervised few-shot RC dataset.

(2) We systematically adapt the most recent state-of-the-art few-shot learning methods for RC, which may further benefit other NLP tasks.

(3) We conduct a comprehensive evaluation of few-shot learning methods on our dataset, which indicates some promising research directions for RC.

FewRel Dataset

In this section, we describe the process of creating FewRel in detail. The whole procedure can be divided into two steps: (1) We create a large candidate set of sentences aligned to relations via distant supervision. (2) We ask human annotators to filter out the wrong labeled sentences for each relation to finally achieve a clean RC dataset.

For the first step, We use Wikipedia as the corpusWe use whole Wikipedia articles as corpus, not just the first sentence. and Wikidata as the KB. Wikidata is a large-scale KB where many entities are already linked to Wikipedia articles. The articles in Wikipedia also contain anchors linking to each other. Thus it is convenient to align sentences in Wikipedia articles to KB facts in Wikidata. We also employ entity linking technique to extract more unanchored entities in articles. We first adopt named entity recognition via spaCyhttps://spacy.io/ to find possible entity mentions, then match each mention with the name of an entity in KBs, and link the mention to the entity if successfully matched.

For each sentence $s$ in Wikipedia articles containing head and tail entities $e_{1}$ and $e_{2}$ , if there exists a Wikidata statement $(e_{1},e_{2},r)$ meaning $e_{1}$ and $e_{2}$ have the relation $r$ , we denote the $(s,e_{1},e_{2},r)$ tuple as an instance and add it to the candidate set. Empirically, many instances of a given relation contain the same entity pair. For such relation, classifiers may prefer memorizing the entity pairs in the training instances rather than grasping the sentence semantics. Therefore, in the candidate set of each relation, we only keep $1$ instance for each unique entity pair. Finally, we remove relations with fewer than $1000$ instances, and randomly keep $1000$ instances for the rest of the relations. As a result, we get a candidate set of $122$ relations and $122,000$ instances.

2 Human Annotation

Next, we invite some well-educated annotators to filter the raw data on a platform similar to Amazon MTurk developed by ourselves. The platform presents each annotator with one instance each time, by showing the sentence, two entities in the sentence, and the corresponding relation labeled by distant supervision. The platform also provides the name of the entities and relation in Wikidata accompanied with the detailed description of that relation. Then the annotator is asked to judge whether the relation could be deduced only from the sentence semantics. We also ask the annotator to mark an instance as negative if the sentence is not complete, or the mention is falsely linked with the entity.

Relations are randomly assigned to annotators from the candidate set, and each annotator will consecutively annotate $20$ instances of the same relation before switching to next relation. To ensure the labeling quality, each instance is labeled by at least two annotators. If the two annotators have disagreements on this instance, it will be assigned to a third annotator. As a result, each instance has at least two same annotations, which will be the final decision. After the annotation, we remove relations with fewer than $700$ positive instances. For the remaining $105$ relations, we calculate the inter-annotator agreement for each relation using the free-marginal multirater kappa Randolph (2005), and keep the top $100$ relations.

3 Dataset Statistics

The final FewRel dataset consists of $100$ relations, each has $700$ instances. A full list of relations, including their names and descriptions, is provided in Appendix A.2. The average number of tokens in each sentence is $24.99$ , and there are $124,577$ unique tokens in total. Following recent meta-learning tasks Vinyals et al. (2016), which use separate sets of classes for training and testing, we use $64$ , $16$ , and $20$ relations for training, validation, and testing respectively. Table 2 provides a comparison of our FewRel dataset to two other popular few-shot classification datasets, Omniglot and mini-ImageNet. Table 3 provides a comparison of FewRel to the previous RC datasets, including SemEval-2010 Task 8 dataset (Hendrickx et al., 2009), ACE 2003-2004 dataset (Strassel et al., 2008), TACRED dataset (Zhang et al., 2017), and NYT-10 dataset (Riedel et al., 2010). While some RC datasets contain instances with no relations (negative), we ignore such instances for comparison.

Experiments

We conduct comprehensive evaluations of vanilla RC models with simple strategies such as finetune or kNN on our new dataset. We also evaluate the recent state-of-the-art few-shot learning methods.

In few-shot relation classification, we intend to obtain a function $F:(\mathcal{R},\mathcal{S},x)\mapsto y$ . Here $\mathcal{R}=\{r_{1},\ldots,r_{m}\}$ defines the relations that the instances are classified into. $\mathcal{S}$ is a support set

including $n_{i}$ instances for each relation $r_{i}\in\mathcal{R}$ . For relation classification, a data instance $x_{i}^{j}$ is a sentence accompanied with a pair of entities. The query data $x$ is an unlabeled instance to classify, and $y\in\mathcal{R}$ is the prediction of $x$ given by $F$ .

In recent research on few-shot learning, $N$ way $K$ shot setting is widely adopted. We follow this setting for the few-shot relation classification problem. To be exact, for $N$ way $K$ shot learning

2 Experiment Settings

We consider four types of few-shot tasks in our experiments: 5 way 1 shot, 5 way 5 shot, 10 way 1 shot, 10 way 5 shot. Under this setting, we evaluate different few-shot training strategies and state-of-the-art few-shot learning methods built upon two widely used instance encoders, CNN Zeng et al. (2014) and PCNN Zeng et al. (2015).

For both CNN and PCNN, the sentence is first represented to the input vectors by transforming each word into concatenation of word embeddings and position embeddings. In CNN, the input vectors pass a convolution layer, a max-pooling layer, and a non-linear activation layer to get the final output sentence embedding. PCNN is a variant of CNN, which replaces the max-pooling operation with a piecewise max-pooling operation.

To evaluate this two vanilla models in few-shot RC task, we first consider two training strategies, namely Finetune and kNN. For the Finetune baseline, it learns to classify all relations on the training set with CNN/PCNN, and tune parameters on the support set. We only tune the parameters of output layer, and keep other parameters unchanged. For the kNN baseline, it also jointly classifies all relations during training, while at the test time, it uses the neural networks to embed all the instances and then adopts k-nearest-neighbor (kNN) to classify the test instances.

By adapting them to relation classification, we also evaluate four recently proposed few-shot learning methods, including Meta Network Munkhdalai and Yu (2017), GNN Satorras and Estrach (2018), SNAIL Mishra et al. (2018), and Prototypical Network Snell et al. (2017). We describe briefly about these baselines in Sec. 3.3. If you are familiar with these methods, you can safely skip that subsection. The hyperparameters of each model are selected via grid search against the validation set.

Human performance is also evaluated under 5 way 1 shot setting and 10 way 1 shot setting. A human labeler is given $5/10$ instances from different relations and one extra test instance. Human labelers are asked to decide which relation the test instance belongs to. Note that these labelers are not provided the name of the relations and any extra information. Since 5 way 5 shot and 10 way 5 shot settings are easier, we only evaluate performance of 5 way 1 shot and 10 way 1 shot.

3 Baselines of Few-shot Learning Models

Meta Network Munkhdalai and Yu (2017) is a meta learning algorithm utilizing a high level meta learner on top of the traditional classification model, or base learner, to supervise the training process. The weights of base learner are divided into two groups, fast weights and slow weights. Fast weights are generated by the meta learner, whereas slow weights are simply updated by minimizing classification loss. The fast weights are expected to help the model generalize to new tasks with very few training instances.

GNN

GNN Satorras and Estrach (2018) tackles the few-shot learning problem by considering each supporting instance or query instance as a node in the graph. For those instances in the support sets, label information is also embedded into the corresponding node representations. Graph neural networks are then employed to propagate the information between nodes. A query instance is expected to receive information from support sets in order to make the classification. In our adaption, while the instances are encoded by CNNs, labels are represented by one-hot encoding.

SNAIL

SNAIL Mishra et al. (2018) is a meta learning model that utilizes temporal convolutional neural networks and attention modules for fast learning from past experience. SNAIL arranges all the supporting instance-label pairs into a sequence and appends the query instance behind them. Such an order agrees with the temporal order of learning process where we learn information by reading supporting instances before making predictions for unlabeled instances. Temporal convolution (a 1-D convolution) is then performed along the sequence to aggregate information across different time steps and a causally masked attention model is used over the sequence to aggregate useful information from former instances to latter ones.

Prototypical Networks

Prototypical Network Snell et al. (2017) is a few-shot classification model based on the assumption that for each class there exists a prototype. The model tries to find the prototypes for classes from supporting instances, and compares the distance between the query instance and each prototype under certain distance metric. Prototypical network learns a embedding function $u$ to embed each class’s instances, and computes each prototype by averaging over all the output embeddings of instances in the support set $\mathcal{S}$ that are labeled with the corresponding class.

Result Analysis and Future Work

We report evaluation results in Table 4. From our preliminary experiments, PCNN with few-shot learning methods perform 3-10 percentages worse than CNN, therefore only CNN results are shown in our experimental results. From the results, we observe that integrating few-shot learning methods into CNN significantly outperforms CNN/PCNN with finetune or kNN, which means adapting few-shot learning methods for RC is promising. However, there are still huge gaps between their performance and humans’, which means our dataset is a challenging testbed for both relation classification and few-shot learning.

In this paper, we propose a new large and high quality dataset, FewRel, for few-shot relation classification task. This dataset provides a new point of view for RC, and also a new benchmark for few-shot learning. Through the evaluation of different few-shot learning methods, we find even the best model performs much worse than humans, which suggests there is still large space for few-shot learning methods to improve.

The most challenging characteristic of our dataset is the diversity in expressing the same relation. We provide some examples from FewRel in Table 5, showing different reasoning modes needed for classifying some instances. Future researches may consider incorporating commonsense knowledge or improved causal modules.

Acknowledgement

This work is supported by the National Natural Science Foundation of China (NSFC No. 61572273, 61532010). This work is also funded by the Natural Science Foundation of China (NSFC) and the German Research Foundation (DFG) in Project Crossmodal Learning, NSFC 61621136008 / DFC TRR-169. Hao Zhu is supported by Tsinghua University Initiative Scientific Research Program. We thank all annotators for their hard work. We also thank all members from Tsinghua NLP Lab for their strong support for annotator recruitment.