With Little Power Comes Great Responsibility

Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, Dan Jurafsky

Introduction

Despite its importance to empirical evaluation, relatively little attention has been paid to statistical power in NLP. In particular, if it is the case that typical experiments in NLP are underpowered, not only would we expect many meaningful improvements to go undetected, we would also expect many apparently significant differences to be exaggerated (Gelman and Carlin, 2014). In this paper, we build on past work calling for greater rigor in evaluation (McCoy et al., 2019; Azer et al., 2020), including the need for careful hypothesis testing Koehn (2004); Berg-Kirkpatrick et al. (2012); Søgaard et al. (2014); Dror et al. (2018), and show why and how power matters to NLP, addressing challenges unique to this domain.

Roughly speaking, power is the probability that a statistical test will successfully detect a true effect. As an illustrative example, imagine comparing two dialog systems (see Figure 1). We want to know if people tend to prefer one system over the other. To test this, we will need multiple people to evaluate the systems. But how many? Once we have collected data, a statistical test will tell us if we can reject the null hypothesis the systems are equally good. Assuming the systems are not identical, statistical power is the probability that the experiment will return a significant result (or equivalently, it is one minus the probability of failing to detect the difference as significant). Although we don’t know the magnitude of this difference, power analysis helps to estimate how much power an experiment will have under various assumptions.

Power depends on multiple factors, including the statistical test used, the significance threshold, true effect size, variance, and sample size. All else being equal, experiments with larger samples will have greater power than smaller samples, as shown in Figure 1. Similarly, larger effects and those with less variance are easier to detect, and therefore require fewer samples for equivalent power. Importantly, note that if we do find a significant difference, this does not imply that the experiment had high power.Using the observed outcome from a single experiment to compute power falls into the trap of post-hoc power analysis and is not recommended. For additional background on statistical power, power analysis, null-hypothesis significance testing, and post-hoc analysis, please refer to Appendix A.

Proceeding with a test that is underpowered (i.e., too few subjects or items; often taken to mean less than 80% power; Cohen, 1962) means that one is less likely to be able to draw any useful statistical conclusion from the experiment, and has contributed, in part, to the replication crisis in other fields (Button et al., 2013; Szucs and Ioannidis, 2017; Ioannidis et al., 2017). Routinely running experiments with low statistical power undermines the scientific enterprise. Not only will true effects go undetected; when significant effects are found, they are likely to be noisier and have lower positive predictive value (Button et al., 2013).

Moreover, significant findings from underpowered experiments are more likely to exaggerate or reverse the true effect – so-called Type-M (magnitude) and Type-S (sign) errors, respectively (Gelman and Carlin, 2014). This problem can lead to systematic distortions in the literature if only significant findings are published, especially if these results are based on underpowered experiments (Scargle, 1999). The effect of Type-M error can be seen in Figure 1; significant differences are less likely to be found in smaller samples (right), but among those tests that are significant, the observed difference will tend to exaggerate the true difference (left) by more than a larger sample (middle). For further discussion of Type-M and Type-S errors, please refer to Appendix B.

Here, we investigate how these issues affect NLP. Although retrospective analysis of power involves challenges, we present evidence that underpowered experiments are widespread in NLP research. Among human evaluations, we find most experimental designs involve too few items and/or raters to detect small effects (§5). For comparing models in terms of accuracy, we find that some widely used benchmark datasets, including MRPC and SST-2, are now too small to be able to properly measure future progress against top performing models (§3). We also introduce a novel approach to power analysis for machine translation and characterize power in experiments testing for differences in BLEU (§4). Finally, a survey of recent papers reveals a general lack of statistical evaluation and a dearth of detailed reporting (§5.1).

To improve future practice, we suggest broader adoption of power analyses prior to evaluation, provide guidance on running power analyses in NLP, and release a series of notebooks for this purpose.

Power Analysis for NLP

Because most NLP tasks do not take the form of standard experiments in other sciences (Kraemer and Blasey, 2015; Westfall et al., 2014), it is non-trivial to run power analyses for many tasks of interest. While we cannot cover every scenario, we present here a generalizable, simulation-based approach to power analysis, along with three sample applications, which can be extended as necessary. Such an approach is modular, reusable, and transparent, and encourages planning of analyses in advance of data collection.

Every power analysis requires assumptions, and there is not likely to be a single correct approach. Rather, the point is to make one’s assumptions explicit, and include enough detail so as to account for whatever is likely to be observed. By using reasonable assumptions, one can help to ensure that one’s experiment is sufficiently well-powered, In the case of NLP, this means that one recruits enough subjects, collects enough ratings, or uses a large enough test set.

The general procedure we suggest for power analysis is described in detail in Figure 2. At a high level, the idea is to estimate power by running simulations. Recall that power is the probability of detecting a true effect, conditional on the experimental setting (effect size, variance, etc.) and significance threshold. Thus, if one can translate these assumptions into a process for generating simulated data, we can estimate power by generating many simulated datasets using assumed or estimated parameter values, running each sample through a significance test, and reporting the proportion that are found to be significant.

The key to generalizing this approach is to begin with the end in mind. In particular, if one plans to test for a difference between models, one needs to choose the statistical test that will be used. That test will determine the level of detail required in the generative process for simulating data.

To return to the opening example of evaluating dialog systems, we want to test if people prefer one system over the other (Ai et al., 2007). If we ignore the nuances of human preference for now (but see §5 for a more nuanced approach), and simply assume that each person either prefers system A or system B, the only assumption we need to make for a power analysis in this setting is the proportion of people in the population who prefer system B. We can then simulate samples of $n$ people (each of whom independently has the same probability of preferring system B) as a draw from a binomial distribution, and repeat this thousands of times.We don’t need to address variance in this scenario, as the variance of a binomial distribution is a function of its mean. For each sample, we then test whether the proportion of people who prefer system B is significantly different from 0.5. The estimated power of this experiment would thus be the proportion of simulated differences that are found to be significant.More direct solutions are available for some settings, including this one (see Appendix E.5), but we describe it using the generic approach from Figure 2 for the purpose of illustration. For all cases examined in this paper, simulations take only minutes on a laptop.

The most difficult part of power analyses is estimating the relevant quantities, such as the true proportion of people that prefer system B. Note, however, that one can always compute what power would be for a range of possible values, and indeed, this is the recommended procedure. For estimating the relevant parameters within an NLP context, we will primarily rely on data from the literature, measurements on validation data, and estimates from external datasets (see §3.2). However, where appropriate, pilot studies may also be informative.

In the remainder of this paper, we consider three scenarios of interest in depth, and assess the state of power in the NLP literature for each.

Comparing Models on Accuracy

It is common in NLP research to look for models which improve over state of the art (SOTA) on various benchmarks. However, an important but rarely asked question is, can these benchmarks support the kinds of comparisons we want to make? Many have emphasized the need for proper significance testing to avoid spurious findings, but if an experiment’s test set is small, the minimum detectable effect (MDE) size may be large: only large improvements will yield sufficiently powered comparisons (i.e., $\geq 80\%$ power). If an experiment is badly underpowered, it cannot provide useful evidence that one model achieves slightly better performance than another for the underlying data distribution. Reliance on such evidence risks leading to over-confidence about the relative ranking of various models. As we show in §3.3, there is legitimate reason to be concerned about this in the case of certain widely used benchmarks.

The standard statistical test for comparing classifiers on paired data is McNemar’s test (Dietterich, 1998; Dror et al., 2018), which uses the numbers of items where the models disagree (i.e., the off-diagonal elements in Table 1).Unpaired data (i.e., if two models are evaluated on different data drawn from the same distribution) requires a different approach, such as using a binomial test. See Appendix E.5 for extended discussion. McNemar’s test assesses whether $\chi^{2}=\frac{\left(p_{10}-p_{01}\right)^{2}}{p_{10}+p_{01}}$ is significant, and if so, rejects the null hypothesis that the distributions are the same.

Thus, for McNemar’s test, the relevant data generating process for simulations can be specified in terms of the expected difference in accuracy between the models, $\Delta_{acc}$ , and $P_{a}$ , the expected proportion of examples for which the models will have the same outcome (i.e., both correct or both incorrect). From these we can compute the expected proportions of examples on which only one model is correct (i.e., the off-diagonals in Table 1), and estimate power via the algorithm in Figure 2. Figure 3 illustrates how power increases with increased sample size, effect size, and agreement rate.Corresponding plots showing Type-M and Type-S error (Gelman and Carlin, 2014) are in Appendix B. To walk through a numerical example, see Appendix C. For an interactive example, see the accompanying online notebooks.

2 Estimating parameters

In order to estimate the required parameters ( $P_{a}$ and $\Delta_{acc}$ ), we consider three options: (1) use results on validation (dev) data; (2) fit a regression based on historical data; (3) use middle-of-the-road assumptions when lacking other information. Using these methods, we can then estimate power or calculate the smallest effect that can be detected with 80% power at $\alpha=0.05$ (or other thresholds). Both to illustrate this process, and to provide guidance for future work, we demonstrate these approaches below using data from two widely-used datasets for evaluating NLP models: SQuAD 2.0 Rajpurkar et al. (2016, 2018) and the GLUE benchmark (Wang et al., 2018).

To the extent that we expect performance on test data to match performance on validation data (i.e., in the absence of domain shift), paired performance on validation data (i.e., difference in accuracy and agreement rate) provides one method for estimating power when comparing against a baseline model.

To illustrate this, from the authors of SQuAD 2.0, we obtain the pairwise agreement rates between all models submitted to the leaderboard on both validation and test data. We find a very strong correlation between validation and test for both pairwise accuracy differences ( $\Delta_{acc})$ and agreement rates ( $P_{a}$ ) ( $r=0.99$ for both, as shown in Figure 9 in Appendix D, with results on validation data included in the accompanying online materials), suggesting we can use paired predictions on validation data for power calculations when we have access to the predictions from both models. Note that this approach assumes that the dev and test data have been drawn from the same distribution, and that dev performance has not been artificially inflated (such as by training on validation data directly).

When one does not have access to the baseline model or an informative prior, one can make use of historical trends. That is, we can try to estimate what a typical improvement will look like, given the current state of the art (SOTA). To illustrate this approach, we collect reported results for both SQuAD 2.0 and GLUE, and fit regressions to estimate $\Delta_{acc}$ and $P_{a}$ . Given these parameters, we can assess the likely power and MDE for a typical model improvement against a given baseline accuracy level.

To fit a regression to predict typical improvements to SOTA, we gather data from GLUE papers and manually label 119 accuracy comparisons and 57 claims of improvement (as denoted by bolding of a result and a claim of SOTA in text) across 14 papers (selected as being at or above the BERT score on the GLUE leaderboard with an accompanying paper). In regressing $\Delta_{acc}$ on baseline accuracy and task, we achieve an $R^{2}=0.69$ , which is not a perfect fit, but still provides a prior on likely effect size. Similarly, we achieve an $R^{2}=0.67$ when fitting a regression to SOTA improvements on the SQuAD 2.0 leaderboard (selected as being a significant improvement in time-ordered submissions). See Appendix E.2.1 for more details.

To assess power for McNemar’s test, we must also fit a regression predicting the expected overlap between the models ( $P_{a}$ ). To fit such a regression, from GLUE authors we obtain the model test set predictions on all tasks from a set of 10 high-performing models, which allows us to measure the extent to which their predictions overlap with each other. Using GLUE tasks which measure accuracy, we regress $P_{a}$ on baseline accuracy and $\Delta_{acc}$ , and achieve an $R^{2}$ of $0.97$ . WNLI (Levesque et al., 2012), MRPC (Dolan and Brockett, 2005), SST-2 (Socher et al., 2013), RTE (Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), QNLI (Rajpurkar et al., 2016) MNLI (Williams et al., 2018), and QQP (Iyer et al., 2017). For consideration of other metrics, see Appendix F. Repeating this for SQuAD 2.0, we get an $R^{2}$ of $0.94$ . See Appendix E.2 for regression coefficients and additional details.

Typical improvements on popular tasks tend to be small (see mean improvements in Table 2). Except for rare transformative work, such as BERT (Devlin et al., 2019), it is generally difficult to do much better than a previous SOTA and thus improvements are likely to follow a trend, which is why we are able to use historical data as a guide. In cases where such data is not available or cannot be trusted, other methods are necessary.

If no informative prior is available and the baseline model or can’t be used for comparison on a validation set, then we must fall back on middle of the road assumptions. Lachenbruch (1992) provides a suggested default prior, and we find that MDEs using this method are very similar to those found by using the regression based approach. Appendix E.3 provides more details, and Table 9 in the appendix presents the comparison.

3 Assessing power in the literature

Using the regression-based approach of estimating $\Delta_{acc}$ and $P_{a}$ described above, we estimate the MDE for each individual accuracy-based GLUE task in comparison to current SOTA, and report the average effect size of results which claimed improvements. Table 2 summarizes these results, showing for each dataset the size of the test set, the accuracy of the best performing model on each task at the time of writing, the estimated MDE to have 80% power using our regression to predict overlap ( $P_{a}$ ), and the average reported difference from their respective baselines.

As can be seen in Table 2, the mean reported effect size ( $|\Delta_{acc}|$ ) is well below the estimated MDE for the three smallest test sets – WNLI, MRPC, and SST-2. Because this mean is based on models comparing to even weaker baselines, we would expect most future improvements to be even smaller. Thus, most future experiments involving these three datasets will not have adequate power to test for improvements over the current SOTA in the way that they are routinely used. Moreover, alternative analyses give even more pessimistic estimates of likely improvements relative to MDE, as described in Appendix E.4. If an experiment does show significant improvement on a dataset such as MRPC, the potential for Type-M error should make us skeptical that this improvement will generalize to new data from the same domain.

While the above results are informative about future experiments, we would also ideally like to know about the power of past experiments. Most of the papers from which we collected results did not report a significance test on the test set. Here we estimate the expected power and predicted result of such a test using leave-one-out regressions, where we make a prediction for each reported improvement using all other reported model comparisons. This procedure reveals that only 46% would have predicted adequate power (using estimates for expected improvement and agreement), and approximately 51% would have been significant (based on estimated agreement and reported improvement). Approximately 80% of experiments with at least 80% power would also have been found to be significant (37% of all comparisons).

In part because performance on many of these tasks is now so good, a large expected improvement is required in order for a new experiment to have 80% power, suggesting that larger test set sizes may be necessary to continue making well-powered claims of SOTA improvement on individual tasks. For any comparisons which are likely to be underpowered, we should refrain from placing much emphasis on obtaining small improvements over the previously reported best model. In extreme cases, such as MRPC and SST-2, it is worth considering whether it is time to retire these datasets as the basis for model comparison. It is also worth exploring power with respect to claims of improvement on multiple tasks with a single model (Demšar, 2006), rather than each task individually. We leave consideration of this as an interesting direction for future work.

Machine Translation

To show how our approach to power analysis can be applied to a more difficult setting, we consider automated evaluation of machine translation using BLEU scores (Papineni et al., 2002). As with accuracy, we would like to know what scale of improvements can be detected with reasonable power on typical test sets. This setting is more complicated because (1) BLEU is a corpus-level metric, rather than being averaged across instances, and (2) typical models are trained on vast amounts of parallel data, with little data available that has not been used in training, making it difficult to estimate variation in performance.

To test for a significant difference between two MT models we use the randomization test, as recommended in Dror et al. (2018): given the paired output translations from both models, swap the outputs for a random subset of test examples and compute the resulting difference in BLEU. Repeating this thousands of times gives us a null distribution, which can be used to test the observed difference between models.

If large amounts of untouched evaluation data were available, we could approach power analysis by simply evaluating BLEU score on many random subsets of $n$ sentences, and computing the mean and variance of each system. Unfortunately, because MT depends on parallel text (most of which is used in training), evaluation data tends to be scarce. Instead, we introduce a generative process that can produce the necessary inputs for power analysis.

For intuition, note that if we swap the $i^{\textrm{th}}$ pair of model outputs (as is done in the randomization test), leaving rest as they are, we change the difference in BLEU between models by a specific amount, $\delta_{i}$ , which we call the effect of making that swap. While these individual effects are not independent of each other due to the corpus-level nature of the metric, in practice, the sum of individual effects closely approximates the net effect of swapping entire subsets (see Figure 15 in Appendix G).

Given this generative process, we can then estimate power using the Algorithm in Figure 2. On each iteration, draw a simulated dataset from the generative process, compute the observed difference between models as $\hat{\Delta}_{B}=-\frac{1}{2}\sum_{i=1}^{n}\delta_{i}$ , and test if this is significantly different from zero using a modified randomization test, in which we assume that the net effect of swapping a subset of instances is simply the sum of the $\delta_{i}$ ’s in the subset. (Please see online materials for an interactive example).

In order to estimate reasonable values for the required parameters, we use several pretrained models from the fairseq library (Ott et al., 2019) for the WMT English-German translation task. We evaluate these models on the shared task test sets from 2016-2019 and compute BLEU scores using sacrebleu (Post, 2018). Fitting a Delta-Laplace mixture to the effects of swapping individual output pairs, we estimate values for $\hat{P}_{0}$ and $\hat{b}_{0}$ , reported in Table 3. (See also Figure 16 in Appendix G; code for computing estimates is provided in the online materials).

While far from identical, the four comparisons, each representing different stages of model evolution, all produce similar estimates. Although these estimates are only based on a single language pair, the models and test sets are relatively diverse, and we expect that these estimates will generalize, though better estimates could be obtained by fitting this distribution to a new domain of interest.

Using these estimates, we can now characterize how much power test sets of different test set sizes ( $n$ ) would have for a range of possible differences in BLEU ( $\Delta_{B}$ ). Figure 4 shows this for $P_{0}$ and $b_{0}$ set to the average of the observed values.For a sensitivity analysis of how power varies under different assumptions for $P_{0}$ and $b_{0}$ , please see Figure 17 in Appendix G. Based on this estimate, we conclude that for typical MT test sets of around 2,000 examples, an improvement of 1 BLEU point can likely be detected with approximately 75% power. As shown in Figure 4 this power level increases dramatically with sample size and effect size.

This analysis has served, in part, to show how a simulation-based approach to power analysis can be adapted to virtually any task. Additional work is required to test how well these specific parameter estimates will generalize, but the same process can easily be adapted to new language pairs. More generally, there would be great value in the MT community curating larger held-out test sets, both to validate this analysis, and for better powered future comparison.

Likert-Scale Human Evaluations

Tasks such as natural language generation are difficult to evaluate using automated methods; as such, human evaluations are central to NLP. Past work has reported great variation in how human evaluations are done (van der Lee et al., 2019). Therefore, we begin with a meta-analysis of a subset of human evaluation experiments from EMNLP 2019, which we then use as the basis for claims about the power of human evaluations in NLP more generally.

To characterize the state of human evaluation in NLP, we identified papers from the main session of EMNLP 2019 that made use of human evaluations (details in Appendix H.2). To generalize across studies, we restrict our analysis to Likert-scale comparisons, which was the most commonly reported type of evaluation. We extracted all cases where a new model was being compared to the best-performing baseline on one more metrics (117 comparisons from 41 papers) and normalized all ratings to be on a 0-1 scale.

One takeaway from this meta-analysis is that the reported effect sizes (that is, difference between the novel model and the best-performing baseline) vary widely ( $\textrm{s.d.}=.12$ on a scale). Number of items tested is more consistent: 69% used 100 or fewer, and only 18% used over 200. But, as similarly found by van der Lee et al. (2019), many key details were not reported in this sample of experiments. Most commonly missing was number of ratings per item (34% of all experiments), followed by total number of workers (28%). For 7% of experiments, we could not determine the number of items tested. 57% of experiments collected 3 annotations per item, which was also the modal number of unique annotators. Thus, it is often difficult to ascertain, for any particular experiment, the details of the experimental setting that are necessary to evaluate the validity of the results.

Because the number of items rated was the most commonly reported, we use that as our proxy for sample size. Figure 5 shows scaled mean difference between models as a function of number of items. As expected, we see greater variance in effects with smaller samples since, with smaller samples, we expect greater noise. We also observe a slight negative correlation between effect size and sample size. That is, as sample size gets larger (and, thus, as estimates get more precise), the estimated effect size gets smaller. This trend is sometimes used as an indication of publication bias (censoring of null and opposite-direction effects) since, in a sample with no publication bias, the effect size should be independent of the sample size Begg and Mazumdar (1994). However, in our case, this correlation is not significant (Kendall’s $\tau=-.07$ , $p=.32$ ) and so it is difficult to draw strong conclusions.We exclude from this analysis two large negative effects with $N=500$ which would exaggerate this correlation.

2 Power analysis for human Likert ratings

What kind of effect sizes can typical human evaluation experimental designs detect? As in previous sections, we can use simulations to explore how many annotators and/or instances should be used to have sufficient power.

Simulating human experiments is conceptually simple (e.g., $m$ raters each rate $n$ generated sentence on overall quality), but for realistic simulations, we need to consider variation in items (some generated sentences are better than others), and variation by rater (some raters use higher ratings and/or respond to different aspects of quality), as well as the overall difference in quality between models. A simulation which treated all workers as identical would fail to capture this variation, and hence might overestimate power Barr et al. (2013).

Unfortunately, details such as worker variance are rarely reported in published papers. To better characterize the typical variation in human evaluations, we rely on a convenience sample of several large datasets to estimate these parameters and use them in our simulations as a proxy for what we might observe in practice. Although focused on different tasks, all use a similar methodology, namely, getting many Likert-scale annotations per instance from many annotators and models (in some cases as many as 20 ratings per item).We use publicly available or author-provided data from Hashimoto et al. (2019); Dathathri et al. (2020); Holtzman et al. (2020), and WMT19 (links in Appendix H.2).

In order to extract estimates of these parameters for our simulations, we use hierarchical mixed-effects models, as used in psychology and other behavioral fields (Barr et al., 2013; Gelman and Hill, 2006). Such models incorporate variation in the quality of generated instances, annotator responses, and annotator sensitivity, and are recommended by van der Lee et al. (2019) for analyzing human evaluations. (We provide details in Appendix H.3 and include code for fitting such models as part of the online materials). Using this approach, we obtain an estimate of the relevant parameters from each of the large datasets. From these, we choose sets of parameters to be representative of experiments with high or low variance, with full results in Appendix H.3 (see Table 16 for parameter estimates).

As before, we then use these estimates to simulate data, assess significance on the simulated data (here using mixed effect regression), and compute power as a function of mean difference and sample size.These simulations require estimates for 7 parameters: the baseline, the effect size, variance by worker, variance by worker as a function of model, variance by item, variance by item as a function of model, and residual variance. The resulting power estimates are shown in Figure 6, plotted in terms of effect size, sample size, and numbers of workers and items, for both the high and low variance scenarios. From this analysis, we highlight a few key takeaways:

Many human evaluation studies are likely underpowered: Using the “high variance” parameters (which are typical of most of the datasets we used), the most common design at EMNLP 2019 (3 workers, 100 items) is underpowered unless the effect size is quite large (0.2 or higher on the scale).

Even with low variance, typical designs are underpowered to detect small effects: Using our estimated parameters for the low variance setting, experiments will be underpowered to detect small effects (0.05 on the scale), unless an unusually large number of ratings per item are collected (10+ for 100 items).

Need for improved reporting: Most human evaluations do not report enough detail to interpret the results. This could be drastically improved through basic power analyses, significance testing using mixed-effects models, and sharing of raw data.

Given our model estimates and simulations, we conclude that, in aggregate, many human evaluations are underpowered and would benefit from larger sample sizes, particularly by using more workers per item. Increased adoption of even approximate power calculations within the NLP community will promote thoughtful consideration of appropriate sample sizes and improve the reliability and replicability of results.

Overall Recommendations

Power analyses should be done prior to evaluation when comparing against a baseline. If a comparison is likely to be underpowered, the pros and cons of running that evaluation should be carefully considered. Underpowered experiments do not provide convincing evidence of progress.

For new datasets and shared tasks, the number of instances in the test will determine the minimum detectable effect size, and should be chosen accordingly.

For tasks which no longer have adequate power to detect typical improvements (e.g., MRPC and SST-2), authors should consider expanding the test set or retiring the task.

To facilitate future power calculation and significance tests, model owners should release final fine-tuned model checkpoints. Alternatively, leaderboard owners may wish to make validation set predictions from all submitted models publicly available.

For human evaluations, (anonymized) raw data should be shared, along with parameters and code to replicate the analysis, including proper significance testing. Prior to collecting human evaluation data, researchers should create an analysis plan and run power analyses to determine an appropriate sample size (likely requiring more workers and items than is currently typical in NLP).

Conclusion

Recent progress in NLP has been extraordinarily rapid, sometimes at the cost of experimental rigor. In this paper, we have presented evidence that underpowered experiments are widespread in NLP. For comparisons based on small samples, there is little reason to think that such an evaluation could reliably provide evidence of a significant improvement, and good reason to believe that improvements found to be significant will exaggerate or reverse the true effect. Going forward, a combination of larger test sets, simple power analyses, and wider sharing of code, data, and experimental details will help to build the foundation for a higher standard of experimental methodology in NLP.

Acknowledgments

Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. Thanks to Sam Bowman, Amanpreet Singh, Kevin Clark, Naman Goyal, and Colin Raffel for providing data from submissions to the GLUE leaderboard, as well as Taylor Berg-Kirkpatrick, Sumanth Dathathri, Ari Holtzman, Hannah Rashkin, and Nikita Srivatsan for providing raw human evaluation data, not all of which made it into the paper.

References

Appendix A Further Discussion of Significance Testing, Power Analysis, and Post-Hoc Analysis

In this paper, we work within the framework of null hypothesis significance testing (NHST). NHST is not free from problems, in that certain systematic processes within the practice of scientific research and publishing can undermine its advantages, many of which have been explored in the literature (Gelman and Loken, 2013; Ioannidis, 2019; McShane et al., 2019). Nevertheless, it would be premature to discard the entire paradigm, and we believe there is still some value in considering power within NHST for several reasons.

First, despite its flaws, NHST remains a commonly used experimental framework in NLP research. Whether implicit of explicit, most experimental comparisons in the NLP literature have the structure of an experiment in the NHST framework, where having equivalent performance to an existing baseline is treated as a null hypothesis and the new model is argued to be significantly better (the typical case) or significantly worse (far rarer). But, whereas many fields that run experiments have standardized procedures for assessing statistical significance, NLP papers vary as to how formally they use a hypothesis testing framework to evaluate their results Berg-Kirkpatrick et al. (2012); van der Lee et al. (2019); Azer et al. (2020).

Second, when done properly, NHST does provide a convenient way of summarizing results. Improvements in overall methdology, such as sharing code and data, sensitivity analyses, greater interest in null findings, and even pre-registration can vastly improve the validity of this paradigm, and we are seeing adoption of some of these practices within NLP.

Finally, there is also a great need for additional clarity with respect to precisely what claims are being made by NLP papers. In this work, we are primarily focused on claims made about trained models (i.e. in testing whether one particular instantiation of a model is significantly better than a particular instantiation of another model). It is, of course, also important to consider broader claims that might be made, such as about expected performance or computational budget (Dodge et al., 2019; Schwartz et al., 2019), and everything we have to say can be extended to incorporate such considerations. For the purpose of clarity, however, we restrict ourselves to the simplest sort of statistical claim.

The probability that a statistical test will reject the null hypothesis in an experiment is a function of several parameters, some of which are typically known or controllable, such as the sample size and significance threshold, and some of which are unknown, such as the details about exactly how models differ. Power tells us what this probability would be, if we knew the true values for these unknown parameters. Conditional on a particular difference existing (e.g. an expected difference in accuracy between two models for a particular data distribution), along with a statistical test, a significance threshold, power is the probability that the test will reject the null hypothesis and find the observed difference to be significant. In common statistical terminology, power is one minus the probability of false negatives in rejecting the null hypothesis or type II error.

While we will not, in general, know what the true power of an experiment is, by making reasonable assumptions, we can try to choose appropriate values for those parameters that we can control. By making assumptions about what we expect to observe, we can obtain estimates of how much power a test is likely to have, which may lead us to modify our experimental design, such as by increasing the sample size.

Importantly, proper experiment design requires specifying these parameters in advance of data collection, or otherwise using a valid stopping rule. One can always obtain a significant result by progressively collecting data until a significant result is found (“sampling to a foregone conclusion”), but this is not a valid procedure (Anscombe, 1954; Wagenmakers, 2007). Similarly, post-hoc power analysis, using estimates derived from the experiment itself, provides no additional information beyond a transformation of the observed $p$ -value, and is thus not recommended (though see below).

Expanding on the algorithm in Figure 2, a simulation-based power analysis involves the following:

First, determine the statistical test, $T$ , which will be used. For the example of comparing models depicted in Figure 1, we will use the binomial test to compare the systems (Dror et al., 2018).

Come up with a generative process which could be used to generate data like that which we will collect. In this step, we need to make assumptions about the comparison of interest. Since the binomial test requires only the counts of how many people prefer each system, we need to specify a prior on generating those counts. For example, we might assume that 60% of people will prefer system B, so the generative process will be $c_{B}\sim\textrm{Binomial}(p=0.6,n)$ , where $n$ is the total number of people to be sampled.

Choose a value of $n$ for which we want to calculate power. Repeatedly (e.g., 10,000 times) draw many samples from our assumed generative process for that size of $n$ .

For each simulated dataset of size $n$ , run the chosen statistical test to check if difference between the observed counts is significant, and compute the proportion that are found to be significant. This is our estimate of power.

Note that more direct solutions for power analysis do exist for some settings, such as this one (see Appendix E.5 below).

Post-hoc power analysis is an issue when the true population effect has variance to it (O’Keefe, 2007; Hoenig and Heisey, 2001; Gelman, 2019). In the case of NLP models, there are several perspectives on the comparisons which can lead to differences regarding how we perceive post-hoc power analysis: (1) we are comparing one model vs. another on a particular test set, the effect we see is the true population effect, post-hoc power analysis is okay because it is deterministic; (2) we are comparing one model vs. another on a data distribution from which the test and dev set are drawn, post-hoc power is not okay; (3) we are comparing one training algorithm vs. another (including variance from both training procedures and test/dev set draws), post-hoc power analysis is still not okay. We specifically look at the case of (2). While (3) is interesting on its own, this is not the typical comparison done (yet) in NLP research and thus we do not have enough information on reported training variance to investigate this thoroughly here. The case of (1) is also atypical as the authors of a study typically wish to draw inferences about how well a model does on the true data distribution (hence, why a dev and test set are used).

Appendix B Type-M and Type-S errors

Although the most obvious risk of using underpowered experiments is that there is a greater chance of failing to detect a true effect, there is an additional harm of using an underpowered design, which has emerged in light of the replication crisis in science. This can be most easily understood through the idea of Type-M and Type-S error (Gelman and Carlin, 2014).

Type-M error is the extent to which an observed difference exaggerates the true effect, conditional on a finding being significant. Type-S error is the probability that an observed difference has the opposite sign of the true difference, again conditional on a finding being significant. Even in a low-powered experiment, there is some probability of finding an effect to be significant; the lower the power, however, the more likely it is that the observed significant difference has the opposite sign of the true effect, and the larger the degree to which the magnitude of the observed effect will tend to exaggerate the true effect.

Intuitively, if power is low, this means that the sample size is small relative to the effect size. As such, the difference will only be significant if an atypically large effect is observed. Assuming the use of a two-sided test, many of these significant findings will also have the wrong sign, as they will be nearly as likely to fall on either side of zero for a symmetric distribution.

Type-M and Type-S error rates can be estimated using the exact same process for power analysis as described in Figure 2. To do so, we need only augment the algorithm with these two additional steps:

$\textrm{Type-M error}\approx\sum_{i:p_{i}\leq\alpha}\frac{\textrm{abs}(e_{i})/\textrm{abs}(e^{*})}{|{j:p_{j}\leq\alpha}|}$

Figures 7 and 8 show scenarios for comparing classifiers on accuracy, corresponding to Figure 3 in the main text, but showing expected Type-M and Type-S error instead of power. As can be seen, Type-M and Type-S error increase with smaller sample sizes, smaller differences between models, and lower agreement rates, all corresponding to lower power.

Appendix C Numerical Example of a McNemar’s Test Simulation

To provide a concrete example of comparing classifiers on accuracy, imagine that a test set for a benchmark task has 500 instances. Based on prior knowledge (see main paper), we might assume that our proposed model will achieve, at most, an absolute improvement of 2 percentage points over the state of the art ( $\Delta_{acc}=0.02$ ), and that the models are likely to agree on 90% of examples ( $P_{a}=0.9$ ). We can convert these assumptions into a distribution over outcomes which will define our generative process. In particular, for a random unseen instance, these assumptions imply that there is a 10% chance of a disagreement; the probability that our model is correct and the old model is incorrect is therefore 6%, and the opposite outcome has a probability of 4% (giving us the assumed net difference of 2%). Note that, because McNemar’s test does not consider the on-diagonal elements, it is not necessary that we explicitly define the baseline accuracy. Thus, a valid probability distribution for use in this simulations could be that shown in Table 4.

By drawing many samples from this distribution of size $n=500$ and computing a $p$ -value using McNemar’s test for each, we obtain an estimate that the power of this test is approximately $0.25$ for a significance threshold of $\alpha=0.05$ , which is severely underpowered. This would also imply a Type-M error factor of 1.9; we would expect that a typical experiment that found the observed difference between models to be significant would exaggerate the true difference of 0.02 by a factor of 1.9, producing observed significant differences between models on the order of 0.04, on average. (See supplementary notebooks for calculations and interactive demonstration). As such, we conclude that this test set is too small to be able to reliably evaluate whether or not our model is significantly different from the state of the art, and should distrust any observed differences that are significant, unless we have poorly estimated the relevant parameters.

By contrast, if the test set contained 2000 examples, we would estimate the test to have nearly 80% power, with a Type-M factor of only 1.1, and would feel comfortable proceeding with and reporting on this evaluation. Similarly, if we had reason to think that our model represented a game-changing advance, and would achieve an improvement of 4 percentage points, or if we had reason to believe that the models would agree on 97.5% of examples, then we would have the power to evaluate this, even with only 500 examples.

Appendix D SQuAD 2.0 Analysis and Results

From the authors of SQuAD 2.0, we obtained pairwise agreement statistics on the SQuAD 2.0 development and test sets for all models that were submitted to the SQuAD 2.0 leaderboard and had publicly visible development set predictions on the CodaLab platform. We removed six submissions whose exact match (EM) scores on test data were less than $50\%$ ; EM scores below $50\%$ suggest a bug or misconfiguration of the model for predicting on the test set, as the majority baseline gets roughly $50\%$ accuracy (by always predicting no-answer). We also removed one submission whose development set EM score was more than $20$ points higher than its test EM score, as it seemed likely that the model had been trained on the development set. After this filtering, we were left with 144 models.

Figure 9 shows the correlation between validation and test data for both pairwise accuracy differences ( $\Delta_{acc}$ ) and agreement rates ( $P_{a}$ ) on the SQuAD 2.0 leaderboard. As can be seen, these correlate well, suggesting that measuring these quantities on validation data can serve as a reasonable guide when doing a power analysis for a new model, though lower agreement rates on dev data to tend to slightly underestimate agreement on test. If the validation results are available for both models, these can be used to compute estimates of $P_{a}$ and $\Delta_{acc}$ , and these can be used to compute the approximate power of the test set.

To verify that using these estimates provide a reliable guide to power, we make use the predictions made by SQuAD 2.0 submissions on both validation and test data. In particular, if we assume that each submission is being compared to the previous model to demonstrate a significant and well-powered improvement over the previous baseline, we find that 19 out of 143 submissions showed sufficient improvement on the validation set to have at least 80% power (see Figure 10). Of these, 14 (74%) attain a significant improvement over the baseline on the test data (consistent with the expected value of 80%). Of the remaining 124 submissions, 3 (2.5%) would show a significant improvement over the baseline, but did not have sufficient power based on validation performance. Interestingly, while all other significant improvements were generally well-spaced over time, these three underpowered submissions were all beaten by a new submissions within 5 days. As an aside, we also note that the vast majority of submissions are significantly worse than the current SOTA, reinforcing the notion that real improvements are rare, and most improvements will be small.

Correlation between the effect size on the validation and test sets may not always be so high. Overconfidence in the power of your experiment may thus occur if the validation performance is greater than the test performance (as would be the case if no regularization was used and extensive hyperparameter tuning caused a model to overfit to the validation set). Alternatively, if comparing to a baseline with inflated performance on validation data (for the same reasons as above), running power analyses based purely on estimates from validation data would underestimate power. As such, combining validation estimates with reasonable priors is recommended.

Appendix E Accuracy

From the authors of the GLUE benchmark – as well as authors of individual models – we obtain the model test-set predictions on all tasks from a set of 10 high-performing models, which allows us to measure the extent to which their predictions overlap with each other. We select GLUE tasks which use accuracy as an evaluation metric. The relevant tasks are MNLI (Williams et al., 2018), MRPC (Dolan and Brockett, 2005), RTE (Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), SST-2 (Socher et al., 2013), QQP (Iyer et al., 2017), QNLI (Rajpurkar et al., 2016), and WNLI (Levesque et al., 2012). For consideration of other metrics, see Appendix F.

We use model predictions for: ELECTRA (small, base, large, large with tricks) (Clark et al., 2019b), XLNet (large) (Yang et al., 2019), T5 (Raffel et al., 2019), ALBERT (large) (Lan et al., 2020), BAM(large) (Clark et al., 2019a), RoBERTa (large) (Liu et al., 2019), and BERT (Devlin et al., 2019). We only had the model predictions available and extrapolated overlap from that, we did not have access to the models themselves, ground truth test set labels, nor dev set predictions for the models.

E.1.2 Comparisons and Claims

We gather data from GLUE papers regarding the accuracy tasks and manually label 119 comparisons and 57 claims of improvement (as denoted within a work by bolding of a new model’s number and a claim of SOTA in the main text) across 14 papers (selected as being at or above the BERT score on the GLUE leaderboard with an accompanying publication). For each paper we examine if a specific comparison is made against a baseline that isn’t claiming state of the art performance. For example, the STILTs approach (Phang et al., 2018) makes comparisons against non-SOTA baselines, which we add to our labeling scheme but filter out when fitting regressions to likely SOTA improvements. We mark this as SOTA Comparison = N. For claims of SOTA improvement, we examine this as some textual basis for the claim (e.g., “we drive state of the art performance on GLUE”) coupled with bolding of values in a table reporting baselines against the model under test. We mark datapoints as Claim of Improvement = Y if they are an improvement claim. We mark effect size as the improvement from the best previous baseline (the current SOTA) on the test set on a per-dataset basis. We note that in several cases, worse results on the new model were bolded. We treated this as no claim of improvement. If results were not bolded but still higher for the new model we also treated this as no claim for improvement.

E.2 Regression-based approach to modeling power and MDEs

There are several versions of McNemar’s test, each with their own unique method for calculating power, sample size, or minimum effect size. See, for example, discussions in Schlesselman (1982), Duffy (1984) Suissa and Shuster (1991), Connett et al. (1987), Fagerland et al. (2013), and Lachenbruch (1992).

The methods for calculating sample size or power by Connett et al. (1987); Schlesselman (1982); Suissa and Shuster (1991) require making an assumption about the odds ratio $\Phi=p_{10}/p_{01}$ as well as an estimate of the fraction of discordant pairs (disagreements between two models).

Fagerland et al. (2013) suggest that the exact unconditional version of the test by Suissa and Shuster (1991) has desirable properties. Thus, we use the implementation of the power calculations for this test from the https://github.com/ekstroem/MESS package.

How do we make an assumption about the odds ratio and fraction of discordant pairs? We first fit an OLS regression to the existing models on the GLUE leaderboard for all binary choice accuracy tasks using the aforementioned predictions provided by the leaderboard creators and individual authors of models,

for all $i$ that are a pairwise comparison between any two models, $\text{min\_acc}_{i}$ is the minimum accuracy between the two models under comparison, $\text{acc\_diff}_{i}$ is the gap between the two models, and $\text{overlap}_{i}$ is the fraction of overlapping predictions. We end up with the model shown in Table 5.

We note that outcomes are biased toward a higher range of accuracy values and may not be a perfect prior. However, this does give us a fairly good linear fit for top-of-the-leaderboard results. We then can predict the expected overlap for a given model as:

Note now we can make an assumption on the expected fraction of discordant values and the odds ratio, the latter being:

This is all that is necessary for McNemar’s test and thus we can then simply solve for the minimum expect treatment effect for the given sample size of the dataset and a power of $80\%$ . Note that for QQP we use the normal approximation rather than exact unconditional test as the large sample size makes the exact test intractable. See Duffy (1984).

We fit such a regression to GLUE tasks and achieve an $R^{2}$ of $0.97$ . Repeating this for SQuAD 2.0, we get an $R^{2}$ of $0.94$ , with fit shown in Table 6. See Figure 11 for a plot indicating the level of agreement plotted against baseline accuracy. See also additional model comparisons for overlap in Appendix I.

E.2.2 Predicting Effect Size

A similar regression can be run to predict the expected effect size given the baseline accuracy: how much do models typically improve given the current SOTA. To fit an OLS regression predicting this value, we gather data from GLUE papers regarding the accuracy tasks and manually label 119 comparisons and 57 claims of improvement (as denoted within a work by bolding of a new model’s number and a claim of SOTA in the main text) across 14 papers (selected as being at or above the BERT score on the GLUE leaderboard with an accompanying publication). We fit the regression:

to see how predictable the expected effect size is, where $\widehat{\Delta}_{i}$ is the predicted effect size, $\text{baseline}_{i}$ is the baseline model’s accuracy, and $\text{task}_{i}$ is a categorical variable (in the regression this ends up being a set of dummy variables for each category so we denote $\hat{\beta}$ to emphasize this). Note that for SQuAD 2.0, we use a separate regression without the task variable since it is a single-task leaderboard.

We achieve an $R^{2}=0.69$ which is not a perfect fit, but still provides a prior on likely effect size. Similarly, we achieve an $R^{2}=.67$ when fitting a regression to SOTA improvements on the SQuAD 2.0 leaderboard (selected as being a significant improvement in time-ordered submissions).

See Table 7 and Table 8 for regression coefficients and model fits. Figure 13 shows the per-task distribution of effect sizes against baseline accuracies in GLUE papers for SOTA improvements. Figure 12 shows the effect size distribution as a histogram.

E.2.3 Caveats for Regression-based Approach

Fitting a regression to predict overlap between a baseline and a new model has a good linear fit. However, this may not be the case for every dataset. Additionally, predicting effect sizes via a linear fit is not a perfect prior. The measurements of power in this case are meant to simulate estimating power before running evaluation on a test set, as running power analysis using only the observed effect may lead to the issues of post-hoc power estimation.

E.3 No Prior Approach (Lachenbruch, 1992)

What do you do if there is no prior data available (as in a new task) and so you cannot make assumptions about discordant pairs or odds ratio? Lachenbruch (1992) discusses this exact problem in the context of clinical trials, and proposes an alternative method based on the work of Connett et al. (1987) which allows you to make assumptions about potential marginal probabilities, providing a midpoint value, as well as an upper and lower bound. We use an implementation of this from: https://rdrr.io/rforge/biostatUZH/man/sampleSizeMcNemar.html and solve for the expected accuracy minimum given a fixed dataset sample size and baseline accuracy for each of the lower bound, midpoint, and upper bound. In practice, we find the Lachenbruch (1992) prior to be very close to the values we obtain from the above regression (see Table 9). Importantly this method requires no assumptions and is meant to give an idea for whether it is worth pursuing a study for the given size of the test set.

E.4 Extended Results

Table 9 contains additional MDE estimates using a two-sample proportion test as in Appendix E.5, the Lachenbruch (1992) methodology. We also provide the standard errors and $n$ for each average effect size, the OLS regression predicting the next effect size for a new SOTA $\widehat{\Delta}$ , and the current difference from SOTA and next on the leaderboard. We note that MDE calculations are roughly similar except for the upper and lower bounds provided in the Lachenbruch (1992) calculation. We also note that predicted SOTA results are far lower than past averages since the average includes early large results like those of Devlin et al. (2019). We can see that in some cases the predicted effect size is even smaller than the lowest bound MDE and we may wish to consider the usefulness of further comparisons on individual datasets in such cases.

E.5 Calculating Power or Sample Size with Binomial Test

If we assume that samples are unpaired – the new model and baseline evaluation samples are drawn from the same data distribution but aren’t necessarily the same samples – we can use a binomial test for significance.

In this case, we assume that we have two models and each draw brings a 1 if the model is correct or 0 if incorrect. We would like to use the two-sample proportion test, and have two binomial distributions with $p_{1}$ and $p_{2}$ as the mean probabilities. Our null hypothesis is $H_{0}:p_{1}=p_{2}$ . We have an alternative hypothesis (two sided) is $H_{1}:p_{1}\neq p_{2}$ . Note, in R we can use the function power.prop.test() to calculate power, the MDE, or the sample size of the tests. See also a tutorial here: https://imai.fas.harvard.edu/teaching/files/Handout9.pdf.

Appendix F Additional Metrics

In this appendix, we provide guidance on how we might apply power analysis to metrics beyond what is covered in the main paper.

While accuracy is the most commonly used metric in the GLUE benchmark, other tasks make use of other metrics such as F1 and Matthew’s correlation. F1 is particularly relevant in cases of binary classification where there is strong class imbalance, such that even the baseline of predicting the most common class will achieve high accuracy.

If we have good prior information, we can use an approach akin to that recommended for accuracy, but replacing McNemar’s test with a randomization test (as used for machine translation, see §4 in main paper). In particular, given an evaluation on paired data (as is the case for all benchmark datasets), one can test for a significant difference between models in terms of F1 (or any other metric) using a randomization test. That is, on each iteration, we randomize the assignment of which model each prediction came from for every instance with probability 0.5, and compute the resulting overall difference in F1. Repeating this thousands of times gives us the null distribution, and we can then check to see whether the observe difference in F1 is in the tails of this distribution, which can thereby be converted into a $p$ -value (see Dror et al. (2018) for more details).

Because F1 (and related metrics) cannot be represented as a simple sum over individual instances, in order to completely specify a hypothetical data generating process, we need to assume values for all cells in the confusion matrix, per class. That is for each class we would need to assume values for the cells as shown in Table 11, where the relevant distribution of predictions are for the instances with the corresponding label, and the values for each class sum to one.

In addition, we need to assume the true distribution of labels in the data distribution of interest, $p(c)$ for $c$ in $\{1,\ldots,C\}$ . Given these assumptions, we could then simulate an arbitrary number of datasets from this process. For each instance, we would first sample a true label $(c)$ , and then sample the model predictions from the corresponding contingency table. For each simulated dataset, we could then apply the randomization test (using thousands of randomizations). By repeating this process many times, we can directly estimate power for the corresponding assumptions and sample size $n$ .

This process is not particularly efficient, but can still be run relatively quickly on a laptop. The more difficult part is choosing good values for the necessary probabilities. However, such an approach can still be used to test for how sensitive power is to variations in assumptions. It is also possible to make simplifying assumptions, such as that the rate of false positives and false negatives will be the same across classes, or to estimate some parameters from training data, such as the underlying distribution of labels. The same technique can easily be extended to other metrics that depend on the contingency table, such as Matthew’s correlation.

Appendix G Additional Details for the BLEU Scores Power Analysis

In this section, we provide further details for the machine translation (MT) data generation procedure as well as an analysis of how power varies for a range of values of $P_{0}$ and $b_{0}$ , the parameters estimated from the empirical observations.

Recall that using the randomization test to determine whether two MT systems are statistically different gives rise to the null distribution of differences in BLEU.The bootstrap is another valid approach to testing for differences between models (Koehn, 2004; Graham et al., 2014; Dror et al., 2018), though note the concerns highlighted by Riezler and Maxwell (2005). If we had access to large amounts of parallel text, we could instead sample many subsets of real sentences and evaluate the difference between models on those subsets, which allow us to characterize the mean and variance of the difference in model performance. Such estimates could then be used to estimate power directly. Because we do not have access to such data, however, we instead rely on the randomization approach, in which we run several thousand trials where the paired output translations for a subset of the test set samples are swapped. In order to estimate power, we would like to be able to generate many datasets from a data generating procedure, which we can parameterize by various parameters, such as the difference between models. Rather than generating raw text, however, and computing BLEU scores on that, we instead attempt to generate only the data necessary for the randomization test. How can we do this?

In our case, the answer to this question lies in establishing a relationship between individual samples and the permuted set within each trial of the randomization test. This relationship is as follows: the sum of individual changes to the difference in BLEU, from swapping single samples at a time, closely approximates the net change to the difference in BLEU, from swapping those samples all at once.Note that this does not directly solve the problem of computing BLEU at the sentence level (Chen and Cherry, 2014), as it still mimicking the process of evaluating BLEU on a corpus. Let $S$ be the set of test set samples swapped during a single trial of the randomization test and $R_{B}(S)$ be the difference in BLEU between the paired outputs after swapping the examples in $S$ . $\Delta_{B}$ is the original difference in BLEU and $\delta_{i}$ is the change to the difference in BLEU from swapping test sample $i$ and leaving all other samples unswapped. Then, we find that,

This relationship is illustrated in Figure 15: Figure 15(a) shows the difference between two models evaluated on the 2019 test set, and Figure 15(b) shows the difference between a different pair of models evaluated on the 2018 test set. We found the same relationship is true for the 2017 and 2016 test sets, as well.

Now that we have established a relationship to closely approximate the outcome of each randomization trial, all that remains is to define a distribution from which the individual changes to the difference in BLEU can be sampled. This distribution is a mixture of a Delta distribution at zero and a Laplace distribution. The Delta distribution accounts for the proportion of samples ( $P_{0}$ ) such that swapping any of them individually results in no change to the difference in BLEU, i.e. the effect is zero. For the remaining samples, we fit a Laplace distribution, as shown in Figure 16. This Laplace is parametrized by two parameters: location ( $\mu$ ) and scale ( $b$ ). By fitting this mixture to the individual effects computed from evaluating BLEU differences on many pairs of models, we discover that the variance parameter scales inversely proportional to the size of the dataset. Thus, we report an overall $b_{0}$ value for each dataset, such that $b_{0}$ = $b_{k}*n_{k}$ , where $b_{k}$ is the Laplace scale parameter obtained from dataset $k$ containing $n_{k}$ samples.

For generating synthetic data, we need to specify $\mu$ and $b$ , as well as $P_{0}$ . However, because we want the effect of swapping half the non-zero samples from this distribution to equal the difference in BLEU between models, we only use the above fits to estimate $b_{0}$ . We thus complete the generative process by assuming values for $\Delta_{B}$ , $n$ , $P_{0}$ , $b_{0}$ , and setting $\mu=-2\cdot\Delta_{B}/(n\cdot(1-P_{0}))$ such that the average effect of a random subset of $n/2$ instances is equal to $-\Delta_{B}$ . Table 3 in the main paper shows a range of observed values for $P_{0}$ and $b_{0}$ .

G.2 Variation in Power Estimates for a Range of Parameter Values

Now that we have defined the data generation procedure, and have estimates for the two parameters, $P_{0}$ and $b_{0}$ , that are needed to simulate datasets, we can estimate power for a range of values for sample size $n$ and difference in BLEU $\Delta_{B}$ , and see how these estimates vary as $P_{0}$ and $b_{0}$ change. To provide a concrete example, suppose that we have two machine translation models that we expect will differ by $\Delta_{B}=$ 1 BLEU point. For a dataset of $n=$ 2,000 sentences, we assume that the models will perform equally for $P_{0}=0.2$ , i.e. 20% of sentences, and will assume a base scale parameter of $b_{0}=26$ . To compute power, we would follow the process in Algorithm 1, with the following modifications. On each iteration, we would draw individual changes to the difference in BLEU from the distribution specified above, with $P_{0}=0.2$ , $\Delta_{B}=1$ , $b_{0}=26$ , and $n=2000$ . For each such draw, we would apply the randomization test to compute a null distribution, using the sum of individual amounts as the total effect of flipping a random subset of pairs. Based on the null distribution, we compute if the difference is significant for this trial. Repeating this many times and observing the proportion of trials that are found to be significant gives us the approximate power.

Figure 17 shows power for a range of values for $\Delta_{B}$ , $n$ , $P_{0}$ and $b_{0}$ . When $P_{0}$ is low, as is true for the observed data in Table 3, effect sizes and sample sizes need to be larger in order for an experiment to be well-powered. But as $P_{0}$ gets higher, a given effect size can be detected by a smaller sample size. On the other hand, as $b_{0}$ increases and consequently the scale parameter $b$ for the Laplace grows, even large effect sizes cannot be detected by test sets containing 5,000 samples.

Appendix H Details of Human Evaluation Section

To assess the state of statistical power in a typical NLP study using human evaluation, we sampled papers from the mean EMNLP 2019 workshop that contained the phrase “human eval”. This first pass returned 117 papers, of which 86 had relevant human evaluations (in which models were compared), with the remainder either referencing human evaluation, or containing some other type of evaluation, such as comparing the agreement between automated metrics and human performance. Because some papers had more than one such evaluation, we had 97 experiments for analysis. Of these 51 were Likert experiments (as discussed in the main text), 38 were some form of direct model comparison, and 8 were other.

Significance testing was rare and was reported, in some form, in only 24% of experiments. Bolding or starring the best results in a table was more common, occurring in 63% of human rating experiments in our set. Whether bold results implies that the author is claiming a meaningful difference is not always clear. We did find one single case of authors performing a power analysis to estimate sample size among the papers we surveyed (Garbacea et al., 2019). However, because that paper did not involve a comparison of models to a baseline, it was not included in our analysis. In addition, we note that few details were provided, such that we were unable to ascertain precisely how the power analysis was done.

Because we chose to focus on ordinal ratings, we further annotated those in order to record the mean ratings and experimental characteristics (number of annotators, number of items, number of annotators per item), as well as all differences for all metrics between the model being proposed and the best performing baseline evaluated in the paper, as discussed in the main text.

H.2 Human evaluation datasets

For our analyses, we make use of the following datasets:

From Hashimoto et al. (2019) we use the evaluation data for Reddit, language modeling, and summarization. The data is available at https://worksheets.codalab.org/worksheets/0x88644b5ee189402eb19d39d721d1005c

From Dathathri et al. (2020) we use the available ratings. The data is available at https://github.com/uber-research/PPLM

For WMT19 (http://statmt.org/wmt19/translation-task.html), the data is available at https://www.computing.dcu.ie/~ygraham/newstest2019-humaneval.tar.gz

For Holtzman et al. (2020), we obtain the human evaluation data directly from the authors.

H.3 Linear Mixed Effect Models

To assess power in the human ratings framework, we used linear mixed effect models with random intercepts and slopes for worker and item, as in Barr et al. (2013). Following best practices, we use the following structure, where $w$ is a particular worker and $i$ is a particular item. There are seven parameters, corresponding to the parameters needed for running a power analysis: fixed effects $\beta_{0}$ (the intercept) and $\beta_{1}$ (the model effect), and variance parameters for the worker intercept ( $\sigma_{0w}$ ), the item intercept ( $\sigma_{0i}$ ) and their respective slope variance parameters ( $\sigma_{1w}$ and $\sigma_{1i}$ ). There is also a variance parameter for the overall error ( $\sigma_{wi}$ ). We transform the Likert ratings to be on a scale and treat them as normally distributed (which we note is an imperfect assumption). We give fit parameters for these values, on a few datasets, in Tables 13, 14, and 15.

For simplicity and convergence issues, we do not include a correlation parameter in the random effect structure.

To assess power, we use two possible variance settings derived from the model fits (“high variance” and “low variance” settings, in the main text) and show these in Table 16. We systematically vary the number of annotators (always assuming each annotator annotates each item, which is not always true in typical experiments), the number of items, and the effect size. We note that simulations can be customized to the planned analysis, including aspects such as how many items will be annotated by each annotator.

To compute power, we use each setting of the parameters to simulate 200 experiments and compute the proportion that detect a significant positive effect ( $t>1.96$ ). Significant effects in the opposite direction ( $t<-1.96$ ) do not count as detections. Code for these model fits and simulations is included with the online materials. However, we note that these should be used as a starting point, rather than being blindly copied, as details may differ in each experimental setting.

H.4 Head to head human evaluations

Another commonly used form of human evaluation is head to head comparison, where raters are shown a pair of outputs (one from each model), and asked to choose which they prefer, sometimes with “neither” as a third option. Head to head comparisons offer some advantages over ratings-basd approaches (Yannakakis and Martínez, 2015; van der Lee et al., 2019), but do not scale as well when comparing many models.

As with ordinal judgements, there are multiple ways of analyzing such data. If we treat annotator judgements as independent and identically distributed (such as if we only collect one judgement from each annotator), we could model this simply in terms of the underlying probabilities that a random annotator will prefer each model (as in the opening example in the main paper). In that case, running a power analysis would be a simple as assuming values for the underlying probabilities of each category (win, lose, draw), as usual based on pilot data or prior assumptions, and simulating many draws from that prior, checking in each sample to see if there is a statistically significant difference between win and lose.

On the other hand, if multiple judgements will be collected from each annotator and/or for each pair of outputs, then it makes sense to use a richer model to account for all sources of variation, as described above (see §H.3). In particular, the mixed effects framework can be adopted, potentially by modeling the outcome as a logistic model (in the case of win or lose), with ties either excluded or split.