Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment

Masatoshi Tsuchiya

Introduction

The quality of a training data is one of the crucial problems when a learning-centered approach including neural network (NN) is employed. [Reidsma and Carletta, 2008] demonstrated that annotation errors of the dialog act corpus, which follows a certain systematic pattern, mislead the learning result of Bayesian network. [Zhang et al., 2017] described that the capacity of a NN model is large enough for brute-force memorizing the entire data set, even if its labels are random. Thus, an influence for a NN model caused by a certain systematic pattern may become more serious than the influence for other learning-centered methods.

Both a method to improve the quality of a training data and a metric to evaluate its reliability are important. As the former method, majority vote of human annotators was widely used [Sabou et al., 2014, Bowman et al., 2015, Al Khatib et al., 2016] As the latter metric, many kind of inter-annotator agreement metrics based on multiple human annotation results were employed, because direct assessment of data quality is difficult [Craggs and Wood, 2005, Artstein and Poesio, 2008, Mathet et al., 2012].

This paper proposes a new empirical method to investigate a quality of a large corpus designed for the recognizing textual entailment (RTE) task [Condoravdi et al., 2003, Bos and Markert, 2005, MacCartney and Manning, 2009, Marelli et al., 2014a]. The proposed method, which is inspired by a statistical hypothesis test, assesses a quality of a target corpus directly and does not depend on multiple human annotation results unlike the existing metrics. The proposed method consists of two phases: the first phase is to introduce the predictability of textual entailment (TE) labels as a null hypothesis which is extremely unacceptable when a target corpus has no hidden bias, and the second phase is to test the null hypothesis for the target corpus using a Naive Bayes (NB) model.

In order to demonstrate the efficiency of the proposed method, we investigate two RTE corpora: the Stanford Natural Language Inference (SNLI) corpus [Bowman et al., 2015] and the Sentences Involving Compositional Knowledge (SICK) corpus [Marelli et al., 2014b]. Although the experimental result of the SICK corpus rejects the null hypothesis, the result of the SNLI corpus does not reject it. Thus, the SNLI corpus has a hidden bias to allow prediction of TE labels from hypothesis sentences even if no context information is given by a premise sentence. The other experiment shows that this hidden bias causes the risk that a NN model for RTE works as a entirely different model than its constructor expects.

The major contributions of this paper are following three points:

This paper proposes a new empirical method to reveal a hidden bias of a large RTE corpus (Section 2.).

This paper applies the proposed method on the SNLI corpus and the SICK corpus, and reveals the hidden bias of the SNLI corpus (Section 3.).

This paper also presents that this hidden bias causes the risk that a NN model proposed for RTE works as a entirely different model than its constructor expects (Section 4.).

Proposed Method

This section describes the detail of the proposed method, which consists of two phases.

The first phase of the proposed method is to derive a null hypothesis which is extremely unacceptable when a target corpus has no hidden bias. We focus the task definition of the target corpus for this phase.

[Marelli et al., 2014a] defined RTE as a task to partition relationships between a premise sentence and a hypothesis sentence into three categories: entailment, neutral and contradiction. Consider the example sentences shown in Figure 1. When $s_{1}$ is given as a premise sentence and $s_{h}$ is given as a hypothesis sentence, the relationship between $s_{1}$ and $s_{h}$ is labeled entailment. The relationship between $s_{2}$ and $s_{h}$ is labeled neutral, and the relationship between $s_{3}$ and $s_{h}$ is labeled contradiction. These examples indicate that the TE label is determinable if and only if context information is given by a premise sentence. Based on this observation, the null hypothesis of the proposed method is defined as follows:

The TE label of the hypothesis sentence is determinable without the premise sentence.

Because the null hypothesis looks extremely unacceptable, it is appropriate to reveal a hidden bias of the target RTE corpus.

2. TE Label Prediction Model

The second phase of the proposed method is to test statistical significance of the null hypothesis for a target corpus. This phase requires two models: the first model is the statistical model of the null hypothesis, henceforth referred to as the TE label prediction model, and the second model is the statistical model of the alternative hypothesis, henceforth referred to as the baseline model.

The TE label prediction model is a model which predicts TE labels for hypothesis sentences without context information of premise sentences. This paper employs a multinomial NB model [Wang and Manning, 2012] as the TE label prediction model, which is defined by the following equation:

where $y$ is a TE label and $x_{i}$ is a feature. This paper simply employs all word unigrams of hypothesis sentences as features.

The baseline model assigns TE labels for hypothesis sentences when no information is given by either premise sentences or hypothesis sentences but only the TE label distribution $P(y)$ of the target corpus is available. In such case, it is reasonable to assign the TE label which occurs most frequently in the target corpus for all hypothesis sentences. This baseline assignment is defined as follows:

If there is a statistically significant difference between the TE label prediction model and the baseline model, the null hypothesis is not rejected for the target corpus, and it indicates that the target corpus contains a hidden bias. Otherwise, the null hypothesis is rejected for the target corpus,

Experiment

This section presents the detailed experimental conditions and the experimental results. The highlight of these results is that the TE label prediction model achieves 63.3% accuracy for the SNLI corpus.

Table 1 shows the performances of the TE label prediction models trained and tested on two RTE corpora[Pedregosa et al., 2011] is employed to implement the TE label prediction model.. The TE label prediction model, which is trained on the SNLI training hypothesis sentences and their TE labels, achieves 63.3% accuracy on the SNLI test hypothesis sentences without premise sentences. The baseline model based on the SNLI TE label distribution achieves 34.3% accuracy on the same hypothesis sentences. The sign test indicates that there is a statistically significant difference between these models ( $p=5.7e^{-202}$ ). On the other hand, the performance of the TE label prediction model trained and tested on the SICK corpus is close to the performance of the baseline model (56.7%). The sign test indicates that there is no statistically significant difference between these models ( $p=0.65$ ).

Figure 2 clearly shows the difference between the behavior of the TE label prediction model trained on the SNLI corpus and the model trained on the SICK corpus. The left matrix is obtained by the model trained and tested on the SNLI corpus, and the right matrix is obtained by the model trained and tested on the SICK corpus. The model of the SICK corpus simply tries the major TE label ‘neutral’ for almost all hypothesis sentences without prediction, although the model of the SNLI corpus tries to predict an appropriate TE label for each individual hypothesis sentence.

These results indicate that the null hypothesis is rejected for the SICK corpus, but it is not rejected for the SNLI corpus. Therefore, hypothesis sentences of the SNLI corpus have a hidden bias to allow prediction of their TE labels without premise sentences.

Discussion

As described in Section 3., hypothesis sentences of the SNLI corpus have a hidden bias to allow prediction of their TE labels without premise sentences. The magnitude of the performance impact caused by the hidden bias is important, because the SNLI corpus is widely used as training data by many NN models for RTE [Bowman et al., 2015, Rocktäschel et al., 2015, Yin et al., 2016, Mou et al., 2016a, Wang and Jiang, 2016, Liu et al., 2016a, Liu et al., 2016b, Cheng et al., 2016, Parikh et al., 2016, Sha et al., 2016]. This section discusses the performance impact of the NN models caused by the hidden bias.

The test pairs of the SNLI corpus are classified into two subsets using the TE label prediction model trained on the SNLI corpus. The first subset is the empirical easy test set $E_{e}$ , which consists of all test pairs whose TE labels are predicted correctly by the TE label prediction model. The second subset is the empirical hard test set $H_{e}$ , which consists of the rest pairs. Table 2 shows the classification result. 63.3% test pairs of the SNLI corpus were classified as $E_{e}$ , and the rest pairs were classified as $H_{e}$ .

2. Definitions of NN Models for RTE

Two NN models are prepared to evaluate performance impacts caused by the hidden bias. The first model (henceforth denoted as the parallel LSTM model) was proposed by [Bowman et al., 2015] for RTE, and was evaluated on the performance difference between RTE corpora by [Mou et al., 2016b]. This model is defined by the following equations.

The first step is to convert a premise sentence $\bm{x}_{p}$ and a hypothesis sentence $\bm{x}_{h}$ into embedding vectors using the word embedding matrix $W_{e}$ , which is initialized with the 300-dimension reference GloVe vectors [Pennington et al., 2014]. The second step is to convert embedding vectors into two 100-dimension sentence vectors with LSTMs, and they are concatenated into a 200-dimension vector. The remaining steps are to predict a TE label with three tanh fully connected layers and then to apply the softmax function.

The second model (henceforth denoted as the sequential LSTM model), which was proposed by [Rocktäschel et al., 2015] for RTE, is defined as follows.

In the second model, two LSTMs are sequentially connected. Thus, it is possible to consider that the memory cells of these LSTMs are directly modeling a recognition process unlike the parallel LSTM model. All vectors of the sequential LSTM model are 100-dimension. Although [Rocktäschel et al., 2015] proposed the variants with attentions between a premise sentence and a hypothesis sentence, the attention-less model is employed in this experiment, because of its simplicity.

3. Performance Impact of NN Models for RTE

This subsection presents the big performance drop of the NN models caused by the hidden bias.

Table 3 shows the experimental results of these NN models trained and tested on the SNLI corpus. Although both NN models achieve high accuracy for the whole test set and for the empirical easy test set $E_{e}$ , they achieve drastic low accuracy for the empirical hard test set $H_{e}$ . These performance drops mean that a large portion of the high accuracy achieved by both NN models benefits from the empirical easy test set $E_{e}$ .

Table 4 shows the performance which is achieved by the same NN models when all words of premise sentences are replaced by unknown word symbols. Because this replacement removes all context information from premise sentences, thus the performance of the NN models must drop close to the chance ratio, if the NN models decide TE labels based on context information of premise sentences. Despite this expectation, both NN models achieve obviously higher performance than the chance ratio for $E_{e}$ (36.8%, shown in Table 2). This result indicates that both NN models do not work as RTE models for $E_{e}$ , but work as TE label prediction models for $E_{e}$ . This behavior of NN models for $E_{e}$ must be entirely different than their constructor expected.

4. Comparison of SNLI and SICK corpora

The SNLI and SICK corpora, are entirely similar in their sentence domains, English scene descriptions. Both of them use the Flickr30k corpus [Young et al., 2014] as origins of their sentences. It is also exhibited by the small differences of sentence token mean counts as shown in Table 6. The second is about their vocabulary. The out-of-vocabulary (OOV) ratio of SICK test pairs is 0.15%, when words of SNLI training pairs are regarded as known. This small OOV ratio indicates that SICK test pairs and SNLI training pairs are quite close from the view point of their vocabulary.

The SNLI and SICK corpora are different in the method of composing sentences. Hypothesis sentences of the SNLI corpus are composed by human workers, but all sentences of the SICK corpus are derived from original sentences using hand-crafted rules. We think that this difference may be a cause of the hidden bias revealed by this paper.

Conclusion

This paper proposes a new empirical method to investigate the quality of large RTE corpus. The proposed method consists of two phases: the first phase is to introduce the predictability of TE labels as a null hypothesis, and the second phase is to test the null hypothesis using a NB model. The proposed method reveals a hidden bias of the SNLI corpus, which allows prediction of TE labels from hypothesis sentences without context information given by premise sentences.

This paper also presents that this hidden bias makes a large performance impact on the NN models for RTE. The experimental result shows that a large portion of the high accuracy achieved by the NN models benefits from the hidden bias. The other experimental result shows that a NN model trained on the SNLI corpus does not work as an RTE model, but works as a TE label prediction model, when biased test pairs are given. These results arise a risk that a complex NN model works as an entirely different model than its constructor expects.

Acknowledgments

A part of this research was supported by JSPS KAKENHI Grant No. 15K12097. I would like to express my sincere appreciation to Dr. Mitsuo Yoshida and Dr. Adam Meyers for their valuable comments.