Unifying Human and Statistical Evaluation for Natural Language Generation

Tatsunori B. Hashimoto, Hugh Zhang, Percy Liang

Introduction

Generating text is a core part of many NLP tasks such as image captioning (Lin et al., 2014), open-domain dialogue Sordoni et al. (2015), story generation Roemmele (2016), and summarization Nallapati et al. (2016). However, proper evaluation of natural language generation has proven difficult (Liu et al., 2016; Novikova et al., 2017; Chaganty et al., 2018). A good evaluation metric should not only capture the quality of generation, but also the diversity of generation, which is especially crucial for creative, open-ended tasks like dialogue or story generation.

Human evaluation, which is often viewed as the gold standard evaluation, captures quality but fails to capture diversity. As an example, for language modeling, a model that directly plagiarizes sentences from the training set would pass the human quality bar but would have zero generalization ability and thus have inadequate diversity. On the other hand, statistical evaluation—i.e., perplexity on a reference test set—captures diversity, as it ensures a model must assign reasonable probability to novel sentences, but perplexity provides an inadequate measure of quality (Theis et al., 2015). For example, modifying a perfect model by removing its ability to generate even a single test sentence results in infinite perplexity even though the model is still near-perfect. Automatic metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin and Rey, 2004) capture quality better than perplexity but still correlate poorly with human evaluation and fail to capture diversity Novikova et al. (2017); Chaganty et al. (2018).

Existing approaches to combining statistical and human evaluation have been ad-hoc, leading to misleading performance measures. A common approach is to measure diversity through the perplexity of a probabilistic model and quality through human evaluation on beam-searched outputs. This gives the illusion that a single model is high-quality and diverse, while the reality is that it shows we can have either a diverse model (when sampling from the distribution used to compute perplexity) or a high-quality model (when beam-searching).

In this paper, we define the idealized evaluation metric as twice the error of the optimal discriminator for classifying sentences as coming from the reference distribution or the model (Section 2). If a model generates gibberish (low quality), the optimal discriminator can classify these accurately as coming from the model. If the reference distribution contains sentences the model cannot generate (low diversity), the optimal discriminator can classify these accurately as coming from the reference.

Unfortunately, the optimal discriminator is unavailable. Human discriminators cannot capture diversity effectively, and learned discriminators—e.g., from a Generative Adversarial Network (Goodfellow et al., 2014) or one trained on human judgments (Lowe et al., 2017)—are too unreliable to use for rigorous evaluation.

Our key result (Section 3) is based on the observation that the optimal classifier depends only on two numbers: the probability of a sentence under the model and the probability under the reference distribution. The former can be computed directly from the model, and we show that the latter can be well-approximated by human judgment scores. The resulting two-dimensional space is illustrated in Figure 1. We apply a simple kk-nearest neighbor classifier in this space and define Human Unified with Statistical Evaluation (HUSE) as twice the leave-one-out error of this classifier.

We apply HUSE to four natural language generation tasks (Section 5): language modeling, chitchat dialogue, story generation, and summarization. First, we show that human evaluation alone is insufficient to discriminate model generations from the references, leading to inflated estimates of model performance. In contrast, HUSE is able to reveal deficiencies of current models. We also show that common techniques for improving sample quality such as annealing actually increase distinguishability between the model and reference due to losses in diversity.

Optimal Discriminator

Consider a natural language generation task where the model is given a context xx (e.g., a dialogue history) drawn from some prior p(x)p(x) and must output a distribution over possible sentences pmodel(yx)p_{\text{model}}(y\mid x). We define an idealized evaluation metric based on whether pmodelp_{\text{model}} is close to a reference distribution prefp_{\text{ref}}, which is generally human-generated. While some tasks only care about quality and thus only require pmodelp_{\text{model}} to place mass on some high quality yy, we demand that pmodelp_{\text{model}} places mass on all high quality yy as given by prefp_{\text{ref}}. This diversity is important for open-ended tasks such as dialogue or story generation. Also note that prefp_{\text{ref}} need not be the human distribution, or match the training distribution. It can be defined as the distribution given by experts. Specifically, consider a random variable yy drawn from either the reference or the model based on an indicator zBernoulli(12)z\sim\text{Bernoulli}\left(\frac{1}{2}\right):

Define LL^{*} to be twice the lowest possible error over any discriminator ff that attempts to determine zz based on xx and yy:

LL^{*} measures similarity between pmodelp_{\text{model}} and prefp_{\text{ref}}; it is 0 if pmodelp_{\text{model}} and prefp_{\text{ref}} are disjoint and 1 if they are identical. Note that LL^{*} is a linear function of the total variational divergence: pmodelprefTV=defx,yp(x)pmodel(yx)pref(yx)=1L\|p_{\text{model}}-p_{\text{ref}}\|_{\text{TV}}\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{x,y}p(x)\left|p_{\text{model}}(y\mid x)-p_{\text{ref}}(y\mid x)\right|=1-L^{*}. See Appendix A.1 for details.

Unfortunately, LL^{*} is unattainable because it requires computing the optimal discriminator. In the spirit of the Turing Test, we could consider using the error rate of a human discriminator fhumf_{\text{hum}} instead, often considered the gold standard for evaluation. However, while humans might have knowledge of prefp_{\text{ref}}, they do not have full knowledge of pmodelp_{\text{model}} and thus would have difficulties determining which sentences a model cannot generate.

As a concrete example, suppose prefp_{\text{ref}} placed a uniform distribution over some set SS. Without knowledge of pmodelp_{\text{model}} the most sensible discriminator is to predict z=1z=1 (reference) when ySy\in S. This discriminator achieves the same classification error of 0.50.5 for both the perfect model pmodel=prefp_{\text{model}}=p_{\text{ref}} and one which can only return a single ySy\in S. We could try to reveal pmodelp_{\text{model}} to humans by showing multiple samples simultaneously, but this is expensive and, as we will later see, unnecessary.

Another option is to learn ff over an expressive class of functions such as neural networks on data sampled from pmodelp_{\text{model}} and prefp_{\text{ref}}. This is analogous to learning the discriminator in a Generative Adversarial Network (GAN) (Goodfellow et al., 2014) or learning an evaluation metric from human judgments (Lowe et al., 2017). However, as (x,y)(x,y) are high-dimensional objects, training a good classifier is extremely difficult (and perhaps not significantly easier than solving the original generation problem). Indeed, learned evaluation metrics do not generalize very well (Lowe et al., 2017; Chaganty et al., 2018). Unlike these approaches which seek to replace human evaluation, our focus will instead be on combining human and automatic statistical evaluation to estimate the optimal classifier error.

Human Unified with Statistical Evaluation (HUSE)

Our key result is that the optimal discriminator depends on (x,y)(x,y) only through a two-dimensional sufficient statistic (Section 3.1), motivating an approximation which we call HUSE (Section 3.2).

Note that the evaluation score L(ϕ)L(\phi) given by a feature map ϕ\phi optimizes over all functions that depend on ϕ\phi (3). Thus, the more information ϕ\phi contains, the lower L(ϕ)L(\phi) is. This has two implications: First, any feature map ϕ\phi yields an (optimistic) upper bound on LL^{*} (2), meaning that L(ϕ)L(\phi) might be able detect when a model is poor but cannot certify that it is good. Second, adding features to ϕ\phi can only improve this bound.

Let us consider the following two-dimensional feature map:

From the arguments above, it is clear that L(ϕopt)LL(\phi_{\text{opt}})\geq L^{*}, but perhaps more surprisingly, we actually have equality:

The two-dimensional feature map ϕopt\phi_{\text{opt}} achieves the optimal discriminator score: L(ϕopt)=LL(\phi_{\text{opt}})=L^{*}.

Proof We compute the true posterior over zz given x,yx,y. Since p(z=1)=p(z=0)=12p(z=1)=p(z=0)=\frac{1}{2}, p(yx,z=1)=pref(yx)p(y\mid x,z=1)=p_{\text{ref}}(y\mid x) and p(yx,z=0)=pmodel(yx)p(y\mid x,z=0)=p_{\text{model}}(y\mid x), by Bayes’ rule:

The optimal discriminator simply predicts z=1z=1 if pref(yx)>pmodel(yx)p_{\text{ref}}(y\mid x)>p_{\text{model}}(y\mid x) and z=0z=0 otherwise. In other words, the decision boundary is given by ϕopt(x,y)1>ϕopt(x,y)2\phi_{\text{opt}}(x,y)_{1}>\phi_{\text{opt}}(x,y)_{2}. ∎

More generally, we can obtain this equality with a wider class of ϕ\phi. It will hold exactly for any invertible transformation of ϕopt\phi_{\text{opt}} (Appendix Corollary 1), and approximately for any ϕ\phi which has high mutual information with ϕopt\phi_{\text{opt}} (Appendix Theorem 1). This means that we can substitute prefp_{\text{ref}} with noisy, possibly un-normalized estimates and still obtain accurate estimates of LL^{*}.

2 HUSE features

While we can directly compute pmodel(yx)p_{\text{model}}(y\mid x) for many probabilistic models, pref(yx)p_{\text{ref}}(y\mid x) is unattainable, so L(ϕopt)L(\phi_{\text{opt}}) is not computable. However, the wisdom of the crowds Surowiecki (2004); Ungar et al. (2012) suggests that pooling together the judgments of many humans can often produce surprisingly reliable estimates of real-world probabilities such as pref(yx)p_{\text{ref}}(y\mid x), even if no individual human is particularly reliable. With this motivation, we ask Amazon Mechanical Turk workers to rate a sentence from 1–5 based on how “typical” it is as a way to estimate pref(yx)p_{\text{ref}}(y\mid x). (see Appendix A.3 for more details). We define HJ(x,y)\text{HJ}(x,y) to be the average response over 20 crowdworkers. Figure 2 shows that for a language modeling task on the Reddit corpus,We used the Reddit corpus due to crowdworker familiarity, corpus size, and short average sentence length, which results in a wide range of sentence frequencies. HJ(x,y)\text{HJ}(x,y) strongly correlates with the actual log-frequency of yy in the corpus. The high correlation suggests that human judgments HJ(x,y)\text{HJ}(x,y) are a good surrogate for logpref\log p_{\text{ref}}.

In addition, we found that rather than using the model probability pmodel(yx)p_{\text{model}}(y\mid x) directly as a feature, normalizing by sentence length len(y)\text{len}(y) yielded lower (tighter) scores. We therefore define the HUSE features as follows:

and define the (population) HUSE score as L(ϕhuse)L(\phi_{\text{huse}}).

3 Guarantees derived from HUSE

We now show that the HUSE score satisfies two nice properties: (i) HUSE does at least as well as human evaluation and (ii) a low HUSE score is sufficient to show that a model is far from the reference distribution.

To show (i), consider a feature map that only includes human evaluation: ϕhj(x,y)=def[HJ(x,y)]\phi_{\text{hj}}(x,y)\stackrel{{\scriptstyle\rm def}}{{=}}[\text{HJ}(x,y)]. Because ϕhuse\phi_{\text{huse}} also incorporates human evaluation, L(ϕhuse)L(\phi_{\text{huse}}) is always tighter (lower) than the human discriminator error L(ϕhj)L(\phi_{\text{hj}}):

Furthermore, the main difference between L(ϕhuse)L(\phi_{\text{huse}}) and LL^{*} is that the former uses HJ(x,y)\text{HJ}(x,y) and the latter uses prefp_{\text{ref}}. But as we argued using Figure 2, HJ(x,y)\text{HJ}(x,y) is strongly correlated with prefp_{\text{ref}}, and good approximations to prefp_{\text{ref}} provide approximation guarantees for L(ϕhuse)L(\phi_{\text{huse}}) (Appendix Theorem 1).

Evaluating models with HUSE

In this section, we show how we can estimate the error rate L(ϕ)L(\phi) from finite data (Section 4.1). We then show how the HUSE estimate (L^(ϕhuse))(\hat{L}(\phi_{\text{huse}})) can be decomposed into a score that measures quality (HUSE-Q) and a score that measures diversity (HUSE-D), which allows us to study quality-diversity tradeoffs (Section 4.2).

For any feature map ϕ\phi, we show how to produce an estimate of L(ϕ)L(\phi). Fix nn contexts x1,,xnx_{1},\dots,x_{n}. First, we draw nn examples y1,,yny_{1},\dots,y_{n} from the reference distribution pref(yx)p_{\text{ref}}(y\mid x), which are usually human-generated sentences from a test set. We also draw nn examples y1,,yny_{1}^{\prime},\dots,y_{n}^{\prime} from the model pmodel(yx)p_{\text{model}}(y\mid x) we wish to evaluate. Next, for each of the 2n2n examples (x,y)(x,y), we compute the feature map ϕ(x,y)\phi(x,y), which might involve evaluating the model probability pmodel(yx)p_{\text{model}}(y\mid x) as well as collecting human judgments HJ(x,y)\text{HJ}(x,y) from crowdworkers.

Finally, we compute the leave-one-out error of a classifier that tries to predict whether a given example (x,y)(x,y) comes from the reference distribution (z=1z=1) or the model (z=0z=0).

The classification problems for HUSE are two-dimensional, which allows us to accurately estimate error rates using a kk-nearest neighbors classifier. We opt to use nearest neighbors classifiers as they are simple, require no training, and can asymptotically capture arbitrary continuous decision boundaries. Specifically, we set k=16k=16 and define neighbors using L2L_{2} distances over the feature vectors ϕ(x,y)\phi(x,y) scaled componentwise to have unit variance. The overall procedure for computing the estimate L^(ϕ)\hat{L}(\phi) is formally defined in Algorithm 1.

2 Quality-diversity decomposition

We now define the (empirical) HUSE score using the feature map ϕhuse\phi_{\text{huse}}:

We define the quality component of HUSE (HUSE-Q) similarly using human judgments alone:

Since humans can detect quality defects in a model, any increase in error from removing pmodelp_{\text{model}} must come from a model’s lack of diversity. Therefore, we define the diversity component (HUSE-D) as follows:

which implies the decomposition (1HUSE-D)+(1HUSE-Q)=1HUSE(1-\text{HUSE-D})+(1-\text{HUSE-Q})=1-\text{HUSE}. As long as the discriminators are non-degenerate (obtaining better performance than chance and HUSE >> HUSE-Q), all scores are contained in $.Here,. Here,\text{HUSE-D}=1impliesthatthemodelsuffersnodiversitydefects,whileimplies that the model suffers no diversity defects, while\text{HUSE-D}=0$ indicates that the examples could be discriminated perfectly due to a lack of diversity.

Experiments

We use HUSE to evaluate three different types of single-sentence natural language generation tasks: (i) unconditional and high entropy (language modeling); (ii) conditional and high entropy (story generation, chit-chat dialogue); and (iii) conditional and low entropy (summarization). We show that HUSE provides a direct and interpretable measure of diversity on high-entropy tasks, while also serving as a useful model diagnostic on low-entropy ones.

The four tasks along with the datasets and models are as follows:

Summarization: Giganews story to headline dataset and the pre-trained model from Gehrmann et al. (2018). The dataset consists of 3.8 million news story-headline pairs. Examples from this dataset are shown in Table 2.

Story generation: Last sentence generation for ROC stories Mostafazadeh et al. (2016) consisting of 96,198 examples of partially written four-sentence stories as input, and a single sentence which completes the story as the target. We use a standard OpenNMT model with global attention Klein et al. (2017).

Language modeling: One billion word benchmark pre-trained language model from Jozefowicz et al. (2016). The task consists of generating a single sentence from the one billion word newswire text distribution.

Chit-chat dialogue: Two-turn chit-chat dialogue dataset consisting of 37.3 million comment-response pairs from Reddit (Appendix A.4). Comments are generally short (5–15 tokens) and cover a single topic (e.g. given “wow how did i not notice that”, the response is “you were focusing on other things its understandable”). We train a convolutional model using fairseq Gehring et al. (2017).

For all the tasks, we train neural models and evaluate their diversity-quality tradeoffs as we change the decoding scheme for generation. Our primary evaluation concerns diversity trade-offs involving temperature annealing which is a generation technique applicable to any probabilistic model that generates words sequentially. In temperature annealed models, we sample a word ww proportional to p1/t(w)p^{1/t}(w) where pp is the model probability of ww given previous words and tt is the temperature parameter. We excluded beam search since it qualitatively behaves similarly to temperature annealing with low temperatures and HUSE0\text{HUSE}\approx 0 due to beam search being extremely under diverse.

As a non-neural baseline, we also consider retrieval based models based on Apache solr on a few tasks. For this approach, we retrieve the single most relevant response from the training set using the BM25 similarity metric on inputs. Such models are known to perform well in tasks with complex outputs such as program generation Hayati et al. (2018); Hashimoto et al. (2018) and style transfer Li et al. (2018).

For cost reasons, we did not measure certain combinations of task and generation mechanisms. We did not measure retrieval for chit-chat dialogue, as we observed its outputs were lower quality than a low-temperature neural model. We also did not anneal language models, as the generation quality from the language model was already high, and our goal was to show that they achieved high HUSE. Our set of measurements, while not comprehensive, generally covers the available quality-diversity tradeoffs for conditional tasks.

Finally, we collect human judgments HJ(x,y)\text{HJ}(x,y) as per Section 4.1 where we query 20 Amazon Mechanical Turk crowdworkers for typicality ratings on 100 reference and 100 model sentences. Since our models generate UNK (unknown and out-of-vocabulary) tokens, we instructed crowdworkers to treat UNK tokens as rare, but appropriate words for the context.

2 Overall results

The HUSE scores across the four tasks vary widely. Table 1 shows that single-sentence language models are nearly indistinguishable, with HUSE=0.86\text{HUSE}=0.86 and implied discriminator error of 43%43\%.

In contrast, both summarization and dialogue are highly distinguishable (HUSE0.5\text{HUSE}\approx 0.5) with relatively low quality when sampled from t=1.0t=1.0. Human evaluation alone (HUSE-Q) would suggest that using temperature annealing (t=0.7)(t=0.7) to emphasize high-probability outputs substantially improves the model (HUSE-Q goes from 0.580.58 to 0.920.92 for summarization and 0.560.56 to 0.920.92 for dialogue). However, we find that this increase in sample quality comes at the cost of diversity (HUSE-D goes from 0.950.95 to 0.340.34 for summarization and 1.01.0 to 0.570.57 for dialogue). Examining the achievable HUSE and diversity tradeoffs in Figure 3 shows that mechanisms such as annealing which improve sample quality actually degrade HUSE due to severe losses in diversity.

We find that all generation schemes and models are inadequate for story generation on ROC stories. The original model (t=1.0t=1.0) is very easily distinguishable by a human (HUSE-Q=0.15\text{HUSE-Q}=0.15), corresponding to a discriminator error of 7%7\%. The retrieval models can improve this to HUSE-Q=0.47\text{HUSE-Q}=0.47, but this comes at the expense of diversity.

Finally, we observe that directly sampling from the model (t=1.0)(t=1.0) is always diverse. This suggests that human evaluation is an appropriate evaluation for generation systems that are directly sampled (rather than beam-searched).

3 Model error analysis with HUSE

Since HUSE is estimated from a two-dimensional classification problem, we can directly visualize the classification problem to understand defects in both model quality and diversity.

Figure 4 shows both reference points ϕhuse(xi,yi)\phi_{\text{huse}}(x_{i},y_{i}) (blue squares) and model points ϕhuse(xi,yi)\phi_{\text{huse}}(x_{i},y_{i}^{\prime}) (red circles) for the summarization task. The shaded areas indicate the decision boundary of the 1616-nearest neighbor classifier.

At temperature t=1.0t=1.0, we find that the classification boundary is mostly horizontal, implying that human judgment alone can distinguish model outputs from references. There is a cluster of sentences with high HJ and high pmodelp_{\text{model}} which are essentially indistinguishable. Examining the samples in this top-right region reveals that these are news stories with short headlines such as “Nadal pulls out of Sydney International” which can be reliably generated even at t=1.0t=1.0. However, the model frequently generates low quality samples that can easily be distinguished such as “two new vaccines in the poor countries were effective against go-it-alone study says” (Table 2).

At lower temperatures of t=0.9t=0.9 and t=0.7t=0.7, the boundary shifts towards becoming diagonal. Although the distribution is no longer directly separable on human judgment, the two distributions are clearly separable with the inclusion of pmodelp_{\text{model}}.

Using Figure 4, we can identify individual examples which were correctly and incorrectly classified based on pmodelp_{\text{model}} and HJ. Table 2 shows examples of both quality failures and diversity failures identified by HUSE. For example, the “diversity failure” table shows that the summarization model (t=0.7t=0.7) has an extremely low probability of generating some reference sentences (“NFL’s bills shake up front office”) and is thus under-diverse. Closer examination of the model shows that the probability of generating “front office” is low, since it is an unusual way to refer to the president and general manager. Improving these models on the diversity failures will require that the model understand more subtle paraphrases. We can also identify model successes, where the model outputs are indistinguishable from the reference in terms of quality (“Agassi bows out of Australian Open after injury”), and the model assigns high probability to the reference (“Agassi withdraws from Australian Open”).

4 HUSE stability

Since HUSE depends on human crowdworker annotations, one might ask if it is possible to reduce either the number of annotated examples, or number of distinct crowdworkers for each example. We show that for low-quality models, substantially fewer annotations are needed.

Figure 5 shows the result of subsampling our original data of 200 sentences and 20 crowdworkers and estimating HUSE. First, we find that using 50 test set examples (Figure 5, left) is often sufficient to give accurate estimates of HUSE. Next, we find that the necessary number of crowdworkers per example depends heavily on the task. Easily distinguishable tasks (story generation), require only 10 crowdworkers, while less distinguishable tasks (summarization) require more than 20 crowdworkers to obtain accurate estimates.

Related work

Existing approaches to NLG evaluation use a hodgepodge mix of quality and diversity measures. Out of the 26 NLG papers at ACL 2018, six perform only human evaluation, fourteen measure human evaluation and a diversity metric such as perplexity or n-gram diversity, and six do not evaluate using human judgments.

While perplexity and nn-gram counts can in principle evaluate diversity, their practical implementations suffer from serious drawbacks. When human evaluation and perplexity are both evaluated, they are almost always done on separate models—human evaluations are done on beam-searched output, while perplexity is computed on the softmax outputs. This makes it appear as if the models can simultaneously generate high quality outputs while also being diverse, when in fact they can only be one at a time based on whether they sample or run beam search.

On the other hand, nn-gram diversity was proposed by Li et al. (2016) to identify models with the generic utterance problem where models repeat phrases such as ‘I don’t know’. Unfortunately, nn-gram diversity is computed across contexts by counting the number of unique nn-grams generated, and so does not measure a model’s ability to generate multiple valid utterances at any single context. In particular, a model which only outputs a single memorized utterance per context (e.g., via memorization or retrieval) can still have high nn-gram diversity as long as the memorized sentences differ across contexts.

Finally, all existing diversity measures are computed separately from human evaluation. This results in two incomparable evaluation metrics, which prevent us from reasoning about tradeoffs between diversity and quality. In contrast, HUSE allows us to make precise statements about the tradeoffs between model quality and diversity because it is a single metric which decomposes into diversity and quality terms.

Related evaluations of diversity.

The importance of diverse responses has previously been acknowledged for summarization Nenkova et al. (2007) and information retrieval Clarke et al. (2008). Our work differs in considering a single evaluation measure that captures quality and diversity applicable to any generation task.

Automated metrics based on nn-gram overlap such as BLEU, METEOR, ROUGE Papineni et al. (2002); Lavie and Denkowski (2009); Lin and Rey (2004) work well for machine translation but do not generalize well to domains with a diverse spectrum of correct responses. While variants Sun and Zhou (2012); Galley et al. (2015); Shima and Mitamura (2011) have adapted such metrics to high entropy generative environments, they are still significantly inferior to the human judgments they attempt to mimic.

Caccia et al. (2018) recently examined the diversity and quality tradeoffs for different language model architectures on synthetic datasets. However, as their approach relies on measuring log-likelihoods under both the model and reference distributions, it cannot be applied to real data where prefp_{\text{ref}} is unavailable. Our main conceptual contribution overcomes this by showing that HJ is an acceptable proxy for prefp_{\text{ref}}.

Sajjadi et al. (2018) also examines diversity and quality (which they call precision and recall) in the context of generative image models. However, they rely on assuming that prefp_{\text{ref}} and pmodelp_{\text{model}} can be estimated accurately using the Fréchet Inception Distance (FID) Heusel et al. (2017). HUSE avoids such assumptions and instead directly leverages human judgments, resulting in a simple and reliable metric more suitable for use as a gold-standard.

Estimating optimal classification error.

Evaluating a model by estimating its optimal classification error has been considered by several earlier works Olsson et al. (2018); Kannan and Vinyals (2016); Li et al. (2017); Bruni and Fernandez (2017); Bowman et al. (2016). However, these methods have focused on classifying sentences directly, which is quite challenging to do reliably. Existing adversarial evaluation methods do not yet reliably outperform human classification Kannan and Vinyals (2016); Bruni and Fernandez (2017). We propose the use of both human evaluation and model probabilities as part of the adversarial evaluation framework, and demonstrate that the resulting classifier reliably outperforms humans and captures both the sample quality and diversity of a model.

Distributional divergence estimation.

Our proposed evaluation metric is closely related to the total variation distance which has been studied extensively in the distribution testing literature. It is known that total variation distance estimates have pessimistic minimax estimation rates in high dimensions Balakrishnan and Wasserman (2017). Our work overcomes this by utilizing pmodelp_{\text{model}} and an estimate of prefp_{\text{ref}}. Other approaches to distributional testing include the maximum mean discrepancy (MMD) and Wasserstein distances, but these approaches require knowledge of a ground truth metric or kernel space Tolstikhin et al. (2016); Singh et al. (2018). Although such divergences are easier to estimate than the total variation distance from samples, the implied convergence rates are still too slow to be practically useful.

Discussion

In this paper, we demonstrate that the current gold standard of human evaluation does not penalize under-diverse models. To remedy this, we propose HUSE, a general purpose evaluation strategy which can be applied to any model for which we can calculate a model’s sampling probabilities. HUSE is an upper bound on the optimal classification error of distinguishing reference and model-generated text, and never does worse than human classification. HUSE leverages both model probabilities and human judgments, ensuring that models which do well on the metric are both high-quality and diverse.

Our work can be viewed as a “superhuman version” of the classic Turing Test Turing (1950). Instead of relying on just a human classifier, we approximate the optimal classifier, which can utilize information about the model in addition to the reference. We also modify the classification problem and seek to identify whether a sample comes from a (potentially superhuman) reference distribution, rather than the human distribution. These two changes lead to tractable, rigorous estimators which can quantify tradeoffs between model quality and diversity on a wide range of generation tasks.

Acknowledgements. We would like to thank Arun Chaganty, Robin Jia, and Peng Qi for extensive comments and feedback on the paper. This work was funded by DARPA CwC program under ARO prime contract no. W911NF-15-1-0462.

Reproducibility. All code, data, and experiments are available on the CodaLab platform at https://worksheets.codalab.org/worksheets/0x88644b5ee189402eb19d39d721d1005c.

References

Appendix A Appendix

This is a standard result, replicated here for completeness:

The total variation distance is related to the optimal discriminator error as follows: pmodelprefTV=1L\|p_{\text{model}}-p_{\text{ref}}\|_{\text{TV}}=1-L^{*}.

Proof Fix any xx. Define ay=defpref(yx)a_{y}\stackrel{{\scriptstyle\rm def}}{{=}}p_{\text{ref}}(y\mid x) and by=defpmodel(yx)b_{y}\stackrel{{\scriptstyle\rm def}}{{=}}p_{\text{model}}(y\mid x). Let S=def{y:ay<by}S\stackrel{{\scriptstyle\rm def}}{{=}}\{y:a_{y}<b_{y}\} be the yy where the pmodelp_{\text{model}} assigns higher probability than prefp_{\text{ref}}, and define A=defySayA\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{y\in S}a_{y} and B=defySbyB\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{y\in S}b_{y} be the aggregated probabilities. On SS, the optimal discriminator should return z=0z=0 (model). This is an error when z=1z=1, which occurs with probability 12A\frac{1}{2}A. Analogously, on the complement of SS, the error probability (when z=0z=0) is 12(1B)\frac{1}{2}(1-B). The total contribution to LL^{*} is thus A+(1B)A+(1-B). The rest follows from algebra:

A.2 Approximation error from ϕitalic-ϕ\phi features

Let LL^{*} and L(ϕ)L(\phi) be the optimal classification error and optimal error under feature map ϕ\phi respectively. Then,

where I=defI(Zopt;ϕopt(X,Y)ϕ(X,Y))I\stackrel{{\scriptstyle\rm def}}{{=}}I(Z_{opt};\phi_{opt}(X,Y)\mid\phi(X,Y)) is the conditional mutual information in bits and ZoptZ_{opt} is the prediction of the optimal classifier.

Proof The lower bound falls out of the definition of LL^{*}. To prove the upper bound, a variant of the entropy lower bound by Feder and Merhav Feder and Merhav (1994) shows that the error rate for predicting ZoptZ_{opt}, via the optimal f(ϕ(X,Y))f(\phi(X,Y)) follows

Now expand the mutual information using the chain rule

The last line follows from the fact that ZoptZ_{opt} is a deterministic function of ϕopt\phi_{\text{opt}} (Proposition 1). Substituting this into the inequality gives the bound,

with I=I(Zopt;ϕopt(X,Y)ϕ(X,Y))I=I(Z_{opt};\phi_{opt}(X,Y)\mid\phi(X,Y)).

Finally, note that ZoptZ_{opt} incurs L/2L^{*}/2 error, and we disagree with ZoptZ_{opt} at most a P(f(ϕ(X,Y))Zopt)P(f(\phi(X,Y))\neq Z_{opt}) fraction of time. Assuming that we get every one of these disagreements wrong gives an upper bound of L/2+P(f(ϕ(X,Y))Zopt)L^{*}/2+P(f(\phi(X,Y))\neq Z_{opt}) on L(ϕ)/2L(\phi)/2. ∎

A straightforward corollary is that whenever ϕ\phi is an invertible function of ϕopt\phi_{\text{opt}}, the conditional mutual information is zero, and therefore the above inequalities become an equality.

Whenever ϕ\phi is an invertible function of ϕopt\phi_{\text{opt}}, L(ϕ)=LL(\phi)=L^{*}.

A.3 Amazon Mechanical Turk for human judgments

In order to show that HUSE can be reliably estimated even with simple crowdsourcing techniques, we used a single uniform task design where we asked Amazon Mechanical Turk workers to rate the typicality of a sentence from 0–5. We defined 0 as invalid (grammatically or factually incorrect) and 5 as ‘very typical’. HJ(x,y)\text{HJ}(x,y) is defined as the average score that crowdworkers assign to a response yy given the context xx. We did not perform substantial filtering or qualification checks beyond HIT acceptance rate (HIT Approval rate greater than 95 percent and number of HITs approved greater than 50 and location is USA). We constructed each HIT to be 25 examples, and paid one dollar per HIT.

We observe that measuring many replicates is sufficient to get low-variance estimates of HJ. For classification tasks where the model is straightforward to identify from references (such as story generation) we require five to ten replicates, while for hard tasks such as summarization at least twenty replicates are needed (Section 5.4). Manual inspection suggests that up to 20% of the collected data are low-quality but that this noise is uncorrelated with the sentence being rated and outweighed by a larger majority of honest and reasonably accurate data. Even if the data quality is low, HUSE is still a valid upper bound (i.e. models with low HUSE are guaranteed to be distinguishable from humans). Thus the models which we identify as having low-HUSE are reliably distinguishable regardless of the crowdworker quality.

A.4 Reddit Dataset

We use a subset of Reddit comments from 2006-2018 scraped from https://pushshift.io/. We construct a dictionary containing the 10,000 most popular words and preprocess the dataset by removing deleted posts, out-of-vocabulary tokens, profanity, comments with less than 10 upvotes, and comments with over 400 tokens.