The price of debiasing automatic metrics in natural language evaluation

Arun Tejasvi Chaganty, Stephen Mussman, Percy Liang

Introduction

In recent years, there has been an increasing interest in tasks that require generating natural language, including abstractive summarization (Nallapati et al., 2016), open-response question answering (Nguyen et al., 2016; Kočisky et al., 2017), image captioning (Lin et al., 2014), and open-domain dialogue (Lowe et al., 2017b). Unfortunately, the evaluation of these systems remains a thorny issue because of the diversity of possible correct responses. As the gold standard of performing human evaluation is often too expensive, there has been a large effort developing automatic metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin and Rey, 2004), METEOR (Lavie and Denkowski, 2009; Denkowski and Lavie, 2014) and CiDER (Vedantam et al., 2015). However, these have shown to be biased, correlating poorly with human metrics across different datasets and systems (Liu et al., 2016b; Novikova et al., 2017).

Can we combine automatic metrics and human evaluation to obtain an unbiased estimate at lower cost than human evaluation alone? In this paper, we propose a simple estimator based on control variates (Ripley, 2009), where we average differences between human judgments and automatic metrics rather than averaging the human judgments alone. Provided the two are correlated, our estimator will have lower variance and thus reduce cost.

We prove that our estimator is optimal in the sense that no unbiased estimator using the same automatic metric can have lower variance. We also analyze its data efficiency (equivalently, cost savings)—the factor reduction in number of human judgments needed to obtain the same accuracy versus naive human evaluation—and show that it depends solely on two factors: (a) the annotator variance (which is a function of the human evaluation prompt) and (b) the correlation between human judgments and the automatic metric. This factorization allows us to calculate typical and best-case data efficiencies and accordingly refine the evaluation prompt or automatic metric.

Finally, we evaluate our estimator on state-of-the-art systems from two tasks, summarization on the CNN/Daily Mail dataset Hermann et al. (2015); Nallapati et al. (2016) and open-response question answering on the MS MARCOv1.0 dataset (Nguyen et al., 2016). To study our estimators offline, we preemptively collected 10,000 human judgments which cover several tasks and systems.An anonymized version of this data and the annotation interfaces used can be found at https://bit.ly/price-of-debiasing. As predicted by the theory, we find that the data efficiency depends not only on the correlation between the human and automatic metrics, but also on the evaluation prompt. If the automatic metric had perfect correlation, our data efficiency would be around 3, while if we had noiseless human judgments, our data efficiency would be about 1.5. In reality, the reduction in cost we obtained was only about 10%, suggesting that improvements in both automatic metric and evaluation prompt are needed. As one case study in improving the latter, we show that, when compared to a Likert survey, measuring the amount of post-editing needed to fix a generated sentence reduced the annotator variance by three-fold.

Bias in automatic evaluation

It is well understood that current automatic metrics tend to correlate poorly with human judgment at the instance-level. For example, Novikova et al. (2017) report correlations less than $0.3$ for a large suite of word-based and grammar-based evaluation methods on a generation task. Similarly, Liu et al. (2016b) find correlations less than $0.35$ for automatic metrics on a dialog generation task in one domain, but find correlations with the same metric dropped significantly to less than $0.16$ when used in another domain. Still, somewhat surprisingly, several automatic metrics have been found to have high system-level correlations (Novikova et al., 2017). What, then, are the implications of having a low instance-level correlation?

As a case study, consider the task of open-response question answering: here, a system receives a human-generated question and must generate an answer from some given context, e.g. a document or several webpages. We collected the responses of several systems on the MS MARCOv1 dataset (Nguyen et al., 2016) and crowdsourced human evaluations of the system output (see Section 4 for details).

The instance-level correlation (Figure 1(b)) is only $\rho=0.31$ . A closer look at the instance-level correlation reveals that while ROUGE is able to correctly assign low scores to bad examples (lower left), it is bad at judging good examples and often assigns them low ROUGE scores (lower right)—see Table 1 for examples. This observation agrees with a finding reported in Novikova et al. (2017) that automatic metrics correlate better with human judgments on bad examples than average or good examples.

Thus, as Figure 1(a) shows, we can improve low-scoring ROUGE examples without improving their human judgment ( $\vartriangle$ ) and vice versa ( $\triangleright$ ). Indeed, Conroy and Dang (2008) report that summarization systems were optimized for ROUGE during the DUC challenge (Dang, 2006) until they were indistinguishable from the ROUGE scores of human-generated summaries, but the systems had hardly improved on human evaluation. Hill-climbing on ROUGE can also lead to a system that does worse on human scores, e.g. in machine translation (Wu et al., 2016). Conversely, genuine quality improvements might not be reflected in improvements in ROUGE. This bias also appears in pool-based evaluation for knowledge base population (Chaganty et al., 2017). Thus the problems with automatic metrics clearly motivate the need for human evaluation, but can we still use the automatic metrics somehow to save costs?

Statistical estimation for unbiased evaluation

2 Control variates estimator

Now let us see how an automatic metric $g$ can reduce variance. If there is no annotator variance ( $\sigma^{2}_{a}=0$ ) so that $Y(z)=f(z)$ , we should expect the variance of $f(z)-g(z)$ to be lower than the variance of $f(z)$ , assuming $g$ is correlated with $f$ —see Figure 2 for an illustration.

The actual control variates estimator needs to handle noisy $Y(z)$ (i.e. $\sigma^{2}_{a}>0$ ) and guard against a $g(z)$ with low correlation. Let us standardize $g$ to have zero mean and unit variance, because we have assumed it is free to evaluate. As before, let $z^{(1)},\dots,z^{(n)}$ be independent samples from $\mathcal{Z}$ and draw $y^{(i)}=Y(z^{(i)})$ independently as well. We define the control variates estimator as

Intuitively, we have averaged over $y^{(i)}$ to handle the noise introduced by $Y(z)$ , and scaled $g(z)$ to prevent an uncorrelated automatic metric from introducing too much noise.

An important quantity governing the quality of an automatic metric $g$ is the correlation between $f(z)$ and $g(z)$ (recall that $g$ has unit variance):

We can show that among all distributions with fixed $\sigma^{2}_{f}$ , $\sigma^{2}_{a}$ , and $\alpha$ (equivalently $\rho$ ), this estimator is minimax optimal, i.e. it has the least variance among all unbiased estimators:

Among all unbiased estimators that are functions of $y^{(i)}$ and $g(z^{(i)})$ , and for all distributions with a given $\sigma^{2}_{f}$ , $\sigma^{2}_{a}$ , and $\alpha$ ,

and no other estimator has a lower worst-case variance.

Comparing the variances of the two estimators ((2) and (6)), we define the data efficiency as the ratio of the variances:

where $\gamma\stackrel{{\scriptstyle\rm def}}{{=}}\sigma^{2}_{a}/\sigma^{2}_{f}$ is the normalized annotator variance. Data efficiency is the key quantity in this paper: it is the multiplicative reduction in the number of samples required when using the control variates estimator $\hat{\mu}_{\text{cv}}$ versus the sample mean $\hat{\mu}_{\text{mean}}$ . Figure 3 shows the inverse data efficiency contours as a function of the correlation $\rho$ and $\gamma$ .

When there is no correlation between human and automatic metrics ( $\rho=0$ ), the data efficiency is naturally $1$ (no gain). In order to achieve a data efficiency of $2$ (half the labeling cost), we need $|\rho|\geq\sqrt{2}/2\approx 0.707$ . Interestingly, even for an automatic metric with perfect correlation ( $\rho=1$ ), the data efficiency is still capped by $\frac{1+\gamma}{\gamma}$ : unless $\gamma\to 0$ the data efficiency cannot increase unboundedly. Intuitively, even if we knew that $\rho=1$ , $f(z)$ would be undetermined up to a constant additive shift and just estimating the shift would incur a variance of $\frac{1}{n}\sigma_{a}^{2}$ .

3 Using the control variates estimator

The control variates estimator can be easily integrated into an existing evaluation: we run human evaluation on a random sample of system outputs, automatic evaluation on all the system outputs, and plug in these results into Algorithm 1.

It is vital that we are able to evaluate the automatic metric on a significantly larger set of examples than those with human evaluations to reliably normalize $g(z)$ : without these additional examples, it be can shown that the optimal minimax estimator for $\mu$ is simply the naive estimate $\hat{\mu}_{\text{mean}}$ . Intuitively, this is because estimating the mean of $g(z)$ incurs an equally large variance as estimating $\mu$ . In other words, $g(z)$ is only useful if we have additional information about $g$ beyond the samples $\{z^{(i)}\}$ .

Algorithm 1 shows the estimator. In practice, we do not know $\alpha=\operatorname{Cov}(f(z),g(z))$ , so we use a plug-in estimate $\hat{\alpha}$ in line 3 to compute the estimate $\widetilde{\mu}$ in line 4. We note that estimating $\alpha$ from data does introduce a $O(1/n)$ bias, but when compared to the standard deviation which decays as $\Theta(1/\sqrt{n})$ , this bias quickly goes to .

The estimator $\widetilde{\mu}$ in Algorithm 1 has $O(1/n)$ bias.

An additional question that arises when applying Algorithm 1 is figuring out how many samples $n$ to use. Given a target variance, the number of samples can be estimated using (6) with conservative estimates of $\sigma^{2}_{f}$ , $\sigma^{2}_{a}$ and $\rho$ . Alternatively, our estimator can be combined with a dynamic stopping rule (Mnih et al., 2008) to stop data collection once we reach a target confidence interval.

4 Discussion of assumptions

We will soon see that empirical instantiations of $\gamma$ and $\rho$ lead to rather underwhelming data efficiencies in practice. In light of our optimality result, does this mean there is no hope for gains? Let us probe our assumptions. We assumed that the human judgments are uncorrelated across different system outputs; it is possible that a more accurate model of human annotators (e.g. Passonneau and Carpenter (2014)) could offer improvements. Perhaps with additional information about $g(z)$ such as calibrated confidence estimates, we would be able to sample more adaptively. Of course the most direct routes to improvement involve increasing the correlation of $g$ with human judgments and reducing annotator variance, which we will discuss more later.

Tasks and datasets

In order to compare different approaches to evaluating systems, we first collected human judgments for the output of several automatic summarization and open-response question answering systems using Amazon Mechanical Turk. Details of instructions provided and quality assurance steps taken are provided in Appendix A of the supplementary material. In this section, we’ll briefly describe how we collected this data.

In automatic summarization, systems must generate a short (on average two or three sentence) summary of an article: for our study, we chose articles from the CNN/Daily Mail (CDM) dataset (Hermann et al., 2015; Nallapati et al., 2016) which come paired with reference summaries in the form of story highlights. We focus on the language quality of summaries and leave evaluating content selection to future work.

For each summary, we collected human judgments on a scale from 1–3 (Figure 4(a)) for fluency, (lack of) redundancy, and overall quality of the summary using guidelines from the DUC summarization challenge (Dang, 2006). As an alternate human metric, we also asked workers to post-edit the system’s summary to improve its quality, similar to the post-editing step in MT evaluations (Snover et al., 2006). Obtaining judgments costs about $0.15 per summary and this cost rises to about$ 0.40 per summary for post-editing.

We collected judgments on the summaries generated by the seq2seq and pointer models of See et al. (2017), the ml and ml+rl models of Paulus et al. (2018), and the reference summaries.All system output was obtained from the original authors through private communication. Before presenting the summaries to human annotators, we performed some minimal post-processing: we true-cased and de-tokenized the output of seq2seq and pointer using Stanford CoreNLP (Manning et al., 2014) and replaced “unknown” tokens in each system with a special symbol ( $\blacksquare$ ).

Evaluating answer correctness.

Next, we look at evaluating the correctness of system outputs in question answering using the MS MARCO question answering dataset (Nguyen et al., 2016). Here, each system is provided with a question and up to 10 paragraphs of context. The system generates open-response answers that do not need to be tied to a span in any paragraph.

We first ask annotators to judge if the output is even plausible for the question, and if yes, ask them identify if it is correct according to each context paragraph. We found that requiring annotators to highlight regions in the text that support their decision substantially improved the quality of the output without increasing costs. Annotations cost $0.40 per system response.This cost could be significantly reduced if systems also specify which passage they used to generate the answer.

While our goal is to evaluate the correctness of the provided answer, we found that there are often answers which may be correct or incorrect depending on the context. For example, the question “what is a pothole” is typically understood to refer to a hole in a roadway, but also refers to a geological feature (Figure 4(b)). This is reflected when annotators mark one context paragraph to support the given answer but mark another to contradict it. We evaluated systems based on both the average correctness (AvgCorrect) of their answers across all paragraphs as well as whether their answer is correct according to any paragraph (AnyCorrect).

We collected annotations on the systems generated by the fastqa and fastqa_ext from Weissenborn et al. (2017) and the snet and snet.ens(emble) models from Tan et al. (2018), along with reference answers. The answers generated by the systems were used without any post-processing. Surprisingly, we found that the correctness of the reference answers (according to the AnyCorrect metric) was only 73.5%, only 2% above that of the leading system (snet.ens). We manually inspected 30 reference answers which were annotated incorrectly and found that of those, about 95% were indeed incorrect. However, 62% are actually answerable from some paragraph, indicating that the real ceiling performance on this dataset is around 90% and that there is still room for improvement on this task.

Experimental results

We are now ready to evaluate the performance of our control variates estimator proposed in Section 3 using the datasets presented in Section 4. Recall that our primary quantity of interest is data efficiency, the ratio of the number of human judgments required to estimate the overall human evaluation score for the control variates estimator versus the sample mean. We’ll briefly review the automatic metrics used in our evaluation before analyzing the results.

We consider the following frequently used automatic word-overlap based metrics in our work: BLEU (Papineni et al., 2002), ROUGE (Lin and Rey, 2004) and METEOR (Lavie and Denkowski, 2009). Following Novikova et al. (2017) and Liu et al. (2016b), we also compared a vector-based sentence-similarity using sent2vec (Pagliardini et al., 2017) to compare sentences (VecSim). Figure 5 shows how each of these metrics is correlated with human judgment for the systems being evaluated. Unsurprisingly, the correlation varies considerably across systems, with token-based metrics correlating more strongly for systems that are more extractive in nature (fastqa and fastqa_ext).

Results.Extended results for other systems, metrics and prompts can be found at https://bit.ly/price-of-debiasing/.

In Section 3 we proved that the control variates estimator is not only unbiased but also has the least variance among other unbiased estimators. Figure 6 plots the width of the 80% confidence interval, estimated using bootstrap, measured as a function of the number of samples collected for different tasks and prompts. As expected, the control variates estimator reduces the width of the confidence interval. We measure data efficiency by the averaging of the ratio of squared confidence intervals between the human baseline and control variates estimates. We observe that the data efficiency depends on the task, prompt and system, ranging from about 1.08 (a 7% cost reduction) to 1.15 (a 13% cost reduction) using current automatic metrics.

As we showed in Section 3, further gains are fundamentally limited by the quality of the evaluation prompts and automatic metrics. Figures 6(a) and 6(b) show how improving the quality of the evaluation prompt from a Likert-scale prompt for quality (Overall) to using post-editing (Edit) noticeably decreases variance and hence allows better automatic metrics to increase data efficiency. Likewise, Figure 6(c) shows how using a better automatic metric (ROUGE-L instead of VecSim) also reduces variance.

Figure 6 also shows the conjectured confidence intervals if we were able to eliminate noise in human judgments (noiseless humans) or have a automatic metric that correlated perfectly with average human judgment (perfect metric). In particular, we use the mean of all (2–3) humans on each $z$ for the perfect $g(z)$ and use the mean of all humans on each $z$ for the “noiseless” $Y(z)$ .

In both cases, we are able to significantly increase data efficiency (i.e. decrease estimator variance). With zero annotator variance and using existing automatic metrics, the data efficiency ranges from 1.42 to 1.69. With automatic metrics with perfect correlation and current variance of human judgments, it ranges from 2.38 to 7.25. Thus, we conclude that it is important not only to improve our automatic metrics but also the evaluation prompts we use during human evaluation.

Related work

In this work, we focus on using existing automatic metrics to decrease the cost of human evaluations. There has been much work on improving the quality of automatic metrics. In particular, there is interest in learning models (Lowe et al., 2017a; Dusek et al., 2017) that are able to optimize for improved correlations with human judgment. However, in our experience, we have found that these learned automatic metrics have trouble generalizing to different systems. The framework we provide allows us to safely incorporate such models into evaluation, exploiting them when their correlation is high but also not introducing bias when it is low.

Our key technical tool is control variates, a standard statistical technique used to reduce the variance of Monte Carlo estimates (Ripley, 2009). The technique has also been used in machine learning and reinforcement learning to lower variance estimates of gradients (Greensmith et al., 2004; Paisley et al., 2012; Ranganath et al., 2014). To the best of our knowledge, we are the first to apply this technique in the context of language evaluation.

Our work also highlights the importance of human evaluation. Chaganty et al. (2017) identified a similar problem of systematic bias in evaluation metrics in the setting of knowledge base population and also propose statistical estimators that relies on human evaluation to correct bias. Unfortunately, their technique relies on having a structured output (relation triples) that are shared between systems and does not apply to evaluating natural language generation. In a similar vein, Chang et al. (2017) dynamically collect human feedback to learn better dialog policies.

Discussion

Prior work has shown that existing automatic metrics have poor instance-level correlation with mean human judgment and that they score many good quality responses poorly. As a result, the evaluation is systematically biased against genuine system improvements that would lead to higher human evaluation scores but not improve automatic metrics. In this paper, we have explored using an automatic metric to decrease the cost of human evaluation without introducing bias. In practice, we find that with current automatic metrics and evaluation prompts data efficiencies are only 1.08–1.15 (7–13% cost reduction). Our theory shows that further improvements are only possible by improving the correlation of the automatic metric and reducing the annotator variance of the evaluation prompt. As an example of how evaluation prompts could be improved, we found that using post-edits of summarizes decreased normalized annotator variance by a factor of three relative to using a Likert scale survey. It should be noted that changing the evaluation prompt also changes the underlying ground truth $f(z)$ : it is up to us to find a prompt that still captures the essence of what we want to measure.

Without making stronger assumptions, the control variates estimator we proposed outlines the limitations of unbiased estimation. Where do we go from here? Certainly, we can try to improve the automatic metric (which is potentially as difficult as solving the task) and brainstorming alternative ways of soliciting evaluation (which has been less explored). Alternatively, we could give up on measuring absolute scores, and seek instead to find techniques stably rank methods and thus improve them. As the NLP community tackles increasingly difficult tasks, human evaluation will only become more important. We hope our work provides some clarity on to how to make it more cost effective.

Reproducibility

All code, data, and experiments for this paper are available on the CodaLab platform at https://bit.ly/price-of-debiasing.

Acknowledgments

We are extremely grateful to the authors of the systems we evaluated for sharing their systems’ output with us. We also would like to thank Urvashi Khandelwal and Peng Qi for feedback on an earlier draft of the paper, the crowdworkers on Amazon Mechanical Turk and TurkNation for their work and feedback during the data collection process, and the anonymous reviewers for their constructive feedback.

References

Appendix A Crowdsourcing data collection

In this section, we provide details regarding our the design of our annotation interfaces and the quality control measures we took.

Each human annotator was shown a short summary that was generated by a system from an article in the CNN/Daily Mail dataset or provided as a reference for that article. The annotators were then asked to (a) provide Likert scale ratings of the summary on multiple facets (fluency, redundancy and overall quality) and (b) perform post-edits to correct any errors (Figure 7(a)).

We found that using a five-level Likert scale increased annotator variance as annotators relative to a three-level Likert scale. Annotators were provided specific cues to calibrate their Likert ratings through a tutorial and were reminded of these cues through tooltips on the rating buttons (see Figure 7(b) for an example). If the annotators rated a summary as lacking along any facet, they were then forced to perform post-edits to “improve [its] quality as much as possible”. We found that forcing annotators to provide post-edits on examples significantly decreased the annotator variance even on the Likert ratings.

Following the recommendations of Liu et al. (2016a), we forced annotators to complete an interactive tutorial containing 10 questions each before beginning the task (Figure 7(b)). The tutorial provided guidelines and examples on how to rate each facet (fluency, redundancy and overall quality) and tested whether they were able to identify and correct language errors using the post-editing interface. The tutorial took about 5–6 minutes to complete and annotators were paid a one-time bonus of $0.75 on completion.

We initially included additional questions to assess focus, coherency and referential clarity adapted from the DUC evaluation guidelines (Dang, 2006), but found that annotators were unable to reliably identify these errors in the short summaries. We also experimented with asking annotators to highlight language errors in the text to justify their ratings, but again found that annotators were unable to localize these errors reliably.

Quality control measures.

We initially attempted to use attention-check examples for the Likert rating questions, but found that the ratings on these examples were themselves quite subjective and hence were not a reliable signal to reject work. Instead, we found that requiring post-edits to summaries significantly reduced spam. Additionally, we rejected annotators who took too little time to complete the task, had very low agreement rates on the Likert questions or had edits that were consistently shorter than 5 characters to prevent spam.

A.2 Answer correctness evaluation.

Each annotator was shown a question from the MS MARCO dataset and an answer that was generated by a system or provided as a reference answer from the dataset. The annotators were then asked to (a) rate if the question made sense and the answer was plausibly correct and (b) asked to identify which paragraphs provided in the dataset justified the answer (Figure 8(a)).

We found that some of the questions in the MS MARCO dataset were extremely ambiguous (e.g. “metatarsal what causes”) and some system responses were implausible (e.g “monogenic bone diseases”, for the question “what genes cause osteoporosis”). In these cases, annotators expressed confusion if they were forced to judge if the response was correct or incorrect. We resolved this confusion by first asking annotators if the question made sense and if system response was even plausible.

In early pilots, we found that annotators often rated a paragraph that correctly answered the question but was unrelated to the system response to be “correct”. We were able to resolve this problem by asking annotators to double-check their work (see the last question in Figure 8(a) for an example).

Once again, we forced annotators to complete an interactive tutorial containing eight questions each before beginning the task (Figure 8(b)). The tutorial also took about 5–6 minutes to complete and annotators were paid a one-time bonus of $0.75 on completion.

Quality control measures.

We found that requiring annotators to provide justification spans significantly spam. Additionally, we rejected annotators who took too little time to complete the task or had very low agreement rates on the answer correctness.

Appendix B Proofs

In this section, we provide proofs for the theorems stated in the main paper.

In this section, we prove the main theorem (Theorem 3.1) in the paper about the minimax optimal variance for an unbiased estimator. Theorem 3.1 will follow from the two following lemmas (Lemmas B.1 and B.2). First, we show in Lemma B.1 that for all distributions with fixed $\sigma^{2}_{f}$ , $\sigma^{2}_{a}$ and $\rho$ , the variance of $\hat{\mu}_{\text{cv}}$ is constant and equal to: $\frac{1}{n}(\sigma_{f}^{2}(1-\rho^{2})+\sigma_{a}^{2})$ . Then we give an explicit distribution, a Gaussian distribution, where any estimator yields at least this variance using the theory of sufficient statistics. Together, these show that the max variance of any estimator is at least the max variance of $\hat{\mu}_{\text{cv}}$ .

where $\alpha=\operatorname{Cov}(f(z),g(z))$ .

The variance of $\hat{\mu}_{\text{cv}}$ is always

By the law of total variance, with respect to the draws of $z^{(i)}$ ,

We will evaluate each of the two terms on the right hand side.

Because the human responses $Y(z^{(i)})$ are uncorrelated,

Because the $z^{(i)}$ are sampled independently,

Note that $\operatorname{Var}(f(z))=\sigma^{2}_{f}$ , $\operatorname{Cov}(f(z),g(z))=\alpha$ , and $\operatorname{Var}(g(z))=1$ (since it is normalized). Thus,

Since the correlation $\rho=\frac{\alpha}{\sigma_{f}\sigma_{g}}=\frac{\alpha}{\sigma_{f}}$ ,

Putting these two terms together, we find that,

For the next lemma, we show that the worst-case variance for any estimator is at least that of $\hat{\mu}_{\text{cv}}$ . For this, we will define a simple Gaussian distribution and use the theory of sufficient statistics. We explicitly define a distribution over $f(z)$ , $g(z)$ , and $Y(Z)-f(z)$ . In particular, we assume these are all Gaussian distributions with respective means, $\mu,0,0$ , and variances, $\sigma^{2}_{f},1,\sigma^{2}_{a}$ . Additionally, we assume that $f(z)$ and $g(z)$ have covariance $\alpha$ but $Y(z)-f(z)$ is independent.

$\hat{\mu}_{\text{cv}}$ is the minimal variance unbiased estimate (MVUE) for the Gaussian distribution above.

The proof is straightforward: we first show that $\hat{\mu}_{\text{cv}}$ is a sufficient statistic using the Fisher-Neyman factorization theorem, and then we apply the Lehman-Scheffe theorem.

For ease of notation, define $g_{i}=g(z^{(i)})$ and $y_{i}=y^{(i)}$ . For the purposes of statistics, only $\mu$ is a parameter; the other “parameters” are known constants. Note that the pdf of the observed variables $g_{i}$ and $y_{i}$ is,

Thus, with the Fisher-Neyman factorization theorem, it suffices to show that the exponetiated term $T$ decomposes as a sum of a function that only depends on the data and a function that only depends on $\hat{\mu}_{\text{cv}}$ and $\mu$ .

Letting $c_{3}$ be the inverse determinant (which is constant),

Thus, we see the decomposition into the function of only the data on the right and only $\mu$ and $\hat{\mu}_{\text{cv}}$ on the left. Thus, $\hat{\mu}_{\text{cv}}$ is a sufficient statistic.

Further, since $\hat{\mu}_{\text{cv}}$ is normally distributed with mean dependent on $\mu$ , it is complete.

Thus, by the Lehmann-Scheffe theorem, $\hat{\mu}_{\text{cv}}$ is the minimal variance unbiased estimate (MVUE).

Among all unbiased estimators that are functions of $y^{(i)}$ and $g(z^{(i)})$ , and for all distributions with a given $\sigma^{2}_{f}$ , $\sigma^{2}_{a}$ , and $\alpha$ ,

and no other estimator has a lower worst-case variance.

From Lemma B.1 we have that the max variance of $\hat{\mu}_{\text{cv}}$ over all distributions with fixed variances, is exactly,

Further, from Lemma B.2, we know that $\hat{\mu}_{\text{cv}}$ is the MVUE for a particular class of distributions, thus, any estimator has a larger max variance over all distributions.

Combining these two facts, we get that the minimax variance is the variance of $\hat{\mu}_{\text{cv}}$ . ∎

B.2 Added Bias

The estimator in Algorithm 1 has $O(1/n)$ bias.

Because $Y(z)$ is independent and has mean $f(z)$ ,

Because $g(z)$ is mean zero and the $z^{(i)}$ are drawn independently,