Improved Natural Language Generation via Loss Truncation

Daniel Kang, Tatsunori Hashimoto

Introduction

Learning to generate text is a core part of many NLP tasks, including summarization Nallapati et al. (2016), image captioning Lin et al. (2014), and story generation Roemmele (2016). A common challenge to all these tasks is that references from the training distribution are not unique and contain substantial variations in phrasing and content Wiseman et al. (2017); Dhingra et al. (2019). Learning to generate under a set of diverse and noisy references is challenging as some variations ought to be learned (e.g., paraphrasing) while others should not (e.g., hallucinated facts, ignoring prompts).

Existing training procedures for models seek to match the underlying distribution, leading to models that replicate and sometimes even amplify unwanted behaviors such as hallucination during generation. For example, neural language models often produce fluent text that is unfaithful to the source Tian et al. (2019); Wiseman et al. (2017); Lee et al. (2018). Existing work Fan et al. (2018); Holtzman et al. (2019) has primarily addressed these issues by constructing decoders that implicitly remove unwanted variation when generating (see §6 for a detailed discussion of task-specific losses).

In this work, we argue that this phenomenon is not model specific, but is due to the widely-used log loss: we demonstrate that log loss is not robust to noisy and invalid references (§2). In particular, log loss requires that models assign probabilities to all potential test reference sequences. As a result, log loss is sensitive to outliers: invalid or noisy references with small probability mass can cause large changes in model behavior. We show that the brittleness of log loss, together with the noise in existing generation datasets, lead to low-quality and unfaithful generated text.

Instead of optimizing log loss, which has little correlation with model output quality Theis et al. (2016); Hashimoto et al. (2019); Gamon et al. (2005), recent work on diverse generation models has proposed optimizing for the distinguishability of samples from the model and the reference. Distinguishability provides a natural and appealing guarantee: samples that are indistinguishable from human generated text will be as high quality as human generated text. Furthermore, we show that optimizing for distinguishability is robust in the face of noisy and even invalid data. Despite its appeal, distinguishability has not been widely used due to statistical and computational challenges. For example, existing methods that directly optimize for distinguishability have yet to match even naive log loss based baselines Caccia et al. (2018).

We propose a modification to the log loss, loss truncation, that has the benefits of distinguishability while being efficient to train. Loss truncation is as efficient to train as log loss, nearly as robust as distinguishability, and provides distinguishability guarantees via an upper bound. It achieves these properties by modifying the standard log loss to adaptively remove examples with high log loss. We additionally extend loss truncation with a sequence-level rejection sampling scheme that generates higher quality sequences by restricting the outputs to be high probability sequences.

We show that loss truncation with direct and rejection sampling outperforms standard log loss based generation methods (beam search, full sampling, top- $k$ , and top- $p$ sampling) on distinguishability, as measured by the HUSE score Hashimoto et al. (2019). We additionally study the factual accuracy of a summarization system trained on loss truncation and show that our proposed approach produces summaries which improve upon all baselines (including beam searched models) and match references on factual accuracy.

Motivation and Problem Statement

In order to achieve this, many existing models are trained to minimize the Kullback-Leibler (KL) divergence,

We refer to the first term of this divergence as the log loss of a model. The second term is commonly ignored as it is a constant with respect to the model. Minimizing the log loss has several practical benefits: 1) it is written as an expected loss (and is thus straightforward to optimize via stochastic gradient descent), 2) it factorizes across tokens in autoregressive modeling, and 3) it provides a guarantee on a model’s goodness of fit (Eq (1)).

Unfortunately, log loss also suffers from several drawbacks. It is known to have little correlation with a model’s sample quality and it can be brittle to invalid references in the training data.

Log loss is not robust to noise. The KL divergence has intuitively correct behavior when each input $x$ has a single correct reference $y$ : it will maximize the probability of the single correct reference. However, log loss can be problematic when there are multiple correct references, of which some are invalid or difficult to model.

In particular, log loss is sensitive to invalid or noisy data because it requires that the model assign high probabilities to all potential references. Log loss is unbounded above: a model assigning zero probability to even a single reference makes the model incur an infinite overall loss.

We show a well-known example of this behavior with synthetic data. We consider fitting a single Gaussian to a mixture of two Gaussian in Figure 1. The reference distribution (blue) has a valid set of references at zero as well as variation that the model does not expect (e.g., invalid or noisy references) on the right. Minimizing the log loss results in a suboptimal model that is forced to span both groups. Furthermore, post-hoc processing the model does not help, as even the most likely output under the log loss trained model (~3) has low probability under the reference distribution.

In natural language generation, training sets can contain invalid or poor quality references. As such, these types of problems manifest themselves in tasks such as summarization (hallucinating facts), story generation (ignoring prompts and constraints), and captioning (ignoring parts of the image).

Much of the existing literature on faithful generation has focused on designing better models for valid references (via copying or attention constraints), but the example in Figure 1 shows that this alone may not be sufficient. The Gaussian ‘model’ in this case perfectly fits the mixture component at zero but is still brittle because it cannot simultaneously fit the other group of (invalid) samples. Resolving this will require either a model which is designed explicitly to capture invalid references or a loss function that can ignore them.

We show that low-probability reference sequences (e.g., Figure 1) are pervasive by examining the Gigaword summarization dataset Rush et al. (2017). We manually classified 300 titles into two categories: 1) requires hallucinating new facts and 2) directly entailed from the context. We show an example of a reference that requires hallucination in Figure 2. In this example, a model that assigns high probability to the new fact (Thursday) must also frequently hallucinate dates on other examples.

We show the fraction of examples in each category in Table 1. As shown, 35% of titles require hallucinating new facts. Others have found this phenomenon to be pervasive in other datasets Kryściński et al. (2019), including the CNN/DM dataset See et al. (2017).

Studying the log loss of these examplesThe log loss was computed from a standard language model, see §5 for details., we note that the average log loss of titles that require new facts is over 1.7 $\times$ the average loss of the titles that are directly entailed (Table 1) and the high-loss examples are clearly dominated by examples which require hallucination (Figure 3). In fact, we find that over 80% of examples with greater than 40 log loss requires some form of hallucination.

These statistics are similar to the toy example we presented earlier in Figure 1. A small but nontrivial fraction of invalid and unexpected data force the model to incur high losses. Much like in the earlier example, we can see that a model which aims to have low log loss on this dataset must spend a substantial amount of effort learning to hallucinate.

Distinguishability. Given that large-scale data will inevitably contain annotation errors and noise, we might ask whether there are effective alternatives to the KL divergence for training models. The distinguishability of samples from a model compared to the reference is one such objective. Distinguishability has recently gained attention as a way to learn and evaluate models based on both sample quality and diversity Hashimoto et al. (2019); Zhou et al. (2019); Zellers et al. (2019); Gehrmann et al. (2019). We show that this objective also serves as a naturally robust alternative to the KL divergence for learning language models. Unfortunately, directly optimizing for distinguishability (e.g., via generative adversarial networks) is challenging Caccia et al. (2018) and we show this works poorly in practice (§5).

Distinguishability is defined as the error rate of an optimal classifier which seeks to distinguish samples from both the model and reference, and we will formally define this via the mixture

where $z\sim\textrm{Bernoulli}\left(\frac{1}{2}\right)$ . We can now define $L^{*}$ to be twice the optimal error in identifying samples from the model

Our measure of distinguishability, the total variation (TV) distance, is a linear function of this error

Log loss as a surrogate for distinguishability. Distinguishability is both robust and provides sample quality guarantees, but is challenging to optimize Caccia et al. (2018). One approach to optimize for distinguishability is to find an appropriate surrogate loss which serves as an upper bound. This is analogous to the use of logistic or hinge losses as a way to optimize for classification accuracy. For log loss, Pinsker’s inequality Csiszar and Körner (2011) relates the KL divergence and distinguishability as

This explains the empirical success of log loss in low-uncertainty situations, where KL is sufficiently small and this bound becomes tight.

Our approach will be to modify the log loss slightly by truncating the distribution. This truncated loss will be as easy to optimize as log loss, while being more robust and providing a tighter variant of Pinsker’s inequality.

Loss Truncation

Intuition. We would like the model to ignore data that would force it to unnecessarily hallucinate at test time. Concretely, recall the toy example (Figure 1); there is a set of invalid references that force the model to be degenerate. If we could remove these these invalid references by truncating the distribution, the resulting model would be high quality. We can show that this intuition is theoretically justified, and that truncating (i.e., removing) an appropriate $c$ -fraction of the data provides tighter bounds on the distinguishability of the model.

Improved log losses for distinguishability. We will demonstrate that log loss with an appropriate $c$ -fraction of the data removed provides guarantees on distinguishability. We will define the set of truncated distributions as the set of distributions with any $c$ -fraction of data removed

A simple lemma shows that that all elements in $\mathcal{P}_{c,p}$ are $c$ -close to $p$ in TV (Appendix B).

See Appendix B for the proof. Namely, distinguishability is bounded by the log loss with respect to the truncated distribution and a small constant. Furthermore, this upper bound is valid for any $c$ , although different $c$ will change the tightness of the bound and produce different models.

This truncated bound can be substantially tighter than Pinsker’s inequality. Consider for example a model that can perfectly capture $(1-c)$ fraction of the data, but $c$ -fraction of the reference outputs cannot be generated by the model and receive probability zero. In this case, the distinguishability (TV) is $c$ , the KL divergence is infinite, while our truncated bound is $\sqrt{c^{2}+2c}$ . This suggests that appropriately truncating high-loss examples makes log loss robust and allows us to use log loss as a surrogate for distinguishability, even in the presence of invalid and noisy references.

This heuristic is straightforward to compute, provides an upper bound on distinguishability, and matches our earlier observation that high-loss examples are correlated with invalid examples we would like the model to ignore (see Table 1).

As an example of how our heuristic can improve estimation and tightness in bounds, consider the earlier toy example in Figure 1. In this example, we find the optimal mean for a single Gaussian with fixed variance which fits mixture of two Gaussians. Figure 4 shows the objective function value implied by the TV loss, log loss (Pinsker’s bound), and our $c$ -truncated bound as a function of the Gaussian mean. We find that log loss provides an upper bound on distinguishability (via Pinsker’s inequality) but is loose and results in a low quality estimate. In contrast, $c$ -truncation results in a nearly identical minimizer as directly minimizing TV.

Implementing Truncation

Our algorithm has three components at training time. First, it trains a model on all the data using standard hyperparameters, which we refer to as “hotstarting” the model. Second, it tracks a running estimate of the $1-c$ quantile of the losses during training. Third, it performs gradient updates on examples that are below the current $1-c$ quantile estimate. We present the pseudocode in Algorithm 1 and describe each step in detail below.Our code is available at https://github.com/ddkang/loss_dropper.

Hotstarting. First, our algorithm hotstarts the model (hotstart( $M$ ) in Alg. 1) by training with the standard log loss. Hotstarting address two challenges in optimizing the truncated loss. First, losses are uninformative at the start of training so truncating examples based on these losses will result in dropping valid examples. We have empirically found that truncating after hotstarting primarily drops invalid references, which avoids this problem. Second, hotstarting allows the model to transfer information from the entire dataset to the clean $1-c$ fraction of the data. Examples that cause a model to hallucinate may still contain valid information about the fluency of a sentence, which hotstarting can capture. This is effectively pretraining our model on the entire data before learning to generate on the clean subset. We have found this procedure to be effective in practice.

Quantile estimation. Second, our algorithm keeps track of the $1-c$ quantile over the distribution of losses. For each new minibatch $B$ , we update an online estimate of the $1-c$ quantile (estimateQuantile( $M,B$ ) in Alg. 1). To estimate this quantile, our algorithm constructs a histogram over the last 10,000 examples seen during training and estimates the empirical $1-c$ quantile every 10,000 examples.For datasets with fewer than 10,000 examples, we can perform this procedure over the entire dataset.

Loss dropping. Third, our algorithm will perform minibatch stochastic gradient descent while excluding examples that have losses above the current top $1-c$ quantile estimate $q$ (truncatedUpdate( $M,B,q$ ) in Alg. 1). Dropping can be accomplished in automatic differentiation packages (e.g., Tensorflow and PyTorch) by setting the loss on the given example to zero.

2 Generating High-Probability Samples

Thus far, our goal has been to robustly learn the underlying distribution. However, in some cases, a user may wish to only generate high confidence sequences, which will ideally correspond to high quality sequences.

To generate such samples, we propose sequence-level rejection sampling.

Recall that our truncation heuristic selects for the $1-c$ quantile of the distribution. For a user-defined level $\alpha$ , our rejection sampling scheme will aim to generate samples from the $1-c\cdot\alpha$ quantile.

We show that rejection sampling can outperform baselines in generating factual summaries (§5). We further show examples of selected and rejected samples in Appendix A.

Evaluation

Dataset and Task. We primarily evaluate loss truncation on abstractive summarization in the form of generating news headlines from an article. We selected this task to highlight that loss truncation can improve sample quality and factual accuracy, while also achieving the secondary goal of diversity for abstractive systems See et al. (2017); Kryściński et al. (2019).

We evaluated on the Gigaword summarization task Rush et al. (2017) as in Gehrmann et al. (2018). While there are other summarization datasets, we chose Gigaword for the following reasons. First, it is large enough that sample quality defects are not caused by a lack of data. Second, the dataset is structured so that neither model nor computation is the bottleneck in performance: the standard sequence-to-sequence models are competitive on the Gigaword dataset. Third, while Gigaword dataset is known to have noise, this matches the behavior of existing annotation errors Beigman and Klebanov (2009); Klebanov and Beigman (2010) and uncertainty Kryściński et al. (2019).

To show that loss truncation is applicable beyond summarization, we also performed a preliminary evaluation of our approach on the E2E NLG task. In E2E, the goal is to generate restaurant reviews from meaning representations Dušek et al. (2019).

Model and Baselines. We used a standard LSTM architecture with global attention for summarization that has been used for the Gigaword summarization task in the past Gehrmann et al. (2018). The learning rate and hyperparameters are given in Appendix C. For the E2E task, we use a standard model with the exact settings as in Puzikov and Gurevych (2018).

For loss truncation on Gigaword, we used $c=0.6$ . We matched the total number of training steps when training via loss truncation (including the hotstart) and standard log loss. We sampled from the full model distribution for loss truncated models except when rejection sampling.

As baselines on Gigaword, we generate from the log loss trained language model using several decoders that have been reported to mitigate low-quality outputs such as beam search, top- $k$ sampling Fan et al. (2018), and top- $p$ sampling Holtzman et al. (2019). We also evaluate directly sampling from the probabilistic model in order to estimate overall distinguishability and understand the diversity-quality trade-offs of each model.

Finally, on Gigaword, we also compared against a recent generative adversarial network (GAN) model with a publicly available implementation Wang and Lee (2018).

Human-evaluation metrics. We evaluate whether loss truncation improves model distinguishability on summarization by measuring the HUSE estimator for TV Hashimoto et al. (2019). HUSE measures distinguishability by learning a classifier over the log-probabilities and human evaluation scores over both samples from the model and references. We also use HUSE to evaluate the quality-diversity tradeoffs of the models by estimating both HUSE-Q (which measures quality via human judgement) and HUSE-D (which measures diversity via statistical evaluation).

In order to assess whether this leads to improvements in the faithfulness of samples, we measure whether loss truncation reduces the number of factually inaccurate outputs from the model via a crowdsourced survey. We designed our prompt based on earlier factual accuracy human evaluation Novikova et al. (2017) and measured whether the original article contained all of the information given in the generated title.

We describe the crowd worker setup in Appendix D.

Automated metrics. While human evaluation is our primary metric of evaluation as it is considered gold-standard, we additionally evaluate on automated metrics to contextualize our human evaluation results. We measure ROUGE-L Lin and Hovy (2003) for summarization and BLEU score Papineni et al. (2002) for E2E.

2 Loss Truncation Outperforms Baselines on HUSE

Using the HUSE score to measure the TV distance, we assessed whether loss truncation successfully improved our model in terms of distinguishability compared to log loss. As shown in Table 2, loss truncation outperforms all baselines on HUSE score (including the original log loss model Full samp), suggesting the truncated model is a better language model than the log loss model as measured by distinguishability.

We find that that loss truncation improves over the log loss by increasing the generation quality (HUSE-Q) by 12% without substantially lowering diversity (e.g., memorizing examples from the training set). These results affirmatively answers an open question posed by Hashimoto et al. (2019) on whether it is possible to obtain models that improve the quality while maintaining overall distinguishability compared to log loss trained models. Post-hoc modification of the log loss model’s distribution by removing unlikely words using either top- $k$ or top- $p$ sampling result in substantial losses in HUSE due to losses in diversity.

We further considered matching the entropy of the loss truncation model with top- $k=100$ and top- $p=0.9$ (Appendix C). At a fixed entropy, loss truncation can outperform on HUSE by up to 26%.

Comparing models with high sample quality, loss truncation with rejection sampling improves upon all baselines (including beam search) in terms of raw human quality evaluation (HUSE-Q), and we see that the Pareto frontier of truncation and rejection sampling (which can be achieved via ensembling) dominates the baselines on both quality and diversity (Figure 5). Rejection sampling decreases overall HUSE score because it is designed to only return high quality samples (i.e., high HUSE-Q): this comes at the cost of reduced diversity, so overall HUSE score suffers.

The results amongst our baselines recapitulate known results for the quality-diversity tradeoffs of existing methods. Beam search has high sample quality, but low diversity; top- $k$ and top- $p$ samplers provide diversity gains over beam search; and GANs generally underperform well-tuned log loss based models on both diversity and quality.

3 Loss Truncation with Rejection Sampling Produces High Quality Outputs

We now ask whether improvements in distinguishability (as measured by HUSE) for the loss truncation model translate to practical improvements in sample quality, such as the factual accuracy of generated outputs in summarization. We evaluate this through a crowdsourced study on factual accuracy.

Since we are interested in studying whether our model can produce high quality samples, we used rejection sampling with $\alpha=0.1$ to obtain high-quality samples from the model. We compare this to the log loss model with baseline decoders. For the top- $p$ and top- $k$ sampling decoders that have quality-diversity tradeoffs, we select $k$ and $p$ such that the entropy of the sampling distribution matches our rejection sampling approach (see Appendix C for details).

To measure factual accuracy, we asked crowd workers how much information in the generated titles was contained in the article in a similar fashion to Novikova et al. (2017). Table 3 shows the average factual accuracy rating for each model. We find that rejection sampling outperforms all baselines, including the current gold standard of beam search, and matches the human reference level of factual accuracy.

Although it may seem surprising that loss truncation and rejection sampling together can achieve the same factual accuracy score as humans, recall that over 34% of the dataset consists of titles which have facts that are not contained in the article. The loss truncation approach biases the model towards learning only the easily predicted (and likely factually accurate) titles.

4 Loss Truncation Produces Diverse Outputs

Finally, one of the benefits of optimizing for distinguishability is that it naturally optimizes for both diversity and quality. Manually examining outputs from the models, we find that directly sampling from the loss truncated model often produces high quality and diverse outputs. We show examples of generated outputs for baselines and loss truncation in Table 4. Loss truncation uses different phrasings (‘at least # killed’, and ‘floods sweep’) while top- $k$ follows a nearly templated pattern with a few changes to the words which appear. Top- $p$ and direct sampling both have diverse phrasings, but also hallucinate facts (‘earthquake’ in sampling and ‘torrential rains’ in top- $p$ sampling).

5 Loss Truncation can Outperform on Automated Metrics

While our primary evaluation metrics are human evaluations (HUSE and factuality), we additionally investigate automated metrics to further contextualize our results. For summarization, we used ROUGE-L and for E2E we use BLEU score for the automated metrics.

For summarization, the ROUGE-L scores for loss truncation and entropy-matched top- $k$ and top- $p$ decoding were 23.2, 22.8, and 22.8 respectively. While loss truncation does not substantially improve ROUGE-L, we see that it still outperforms baselines. We do not expect reference-based evaluations to fully capture the benefits of loss truncation, as these metrics encourage the models to fully imitate the data distribution – including invalid and hallucinated examples.

For E2E, the BLEU scores for loss truncation and the baseline were 0.72 and 0.64 respectively. We confirmed that the baseline model for the E2E task achieves a similar score as reported by Balakrishnan et al. (2019). Perhaps surprisingly, improving BLEU score to 0.72 almost closes the gap to using complex tree-structured semantic representations, which achieves a BLEU score of 0.74 Balakrishnan et al. (2019).

We further show that loss truncation is not sensitive to the hyperparameter $c$ on automated metrics in Appendix E.1 and provide a preliminary investigation of combining loss truncation and alternative decoders in Appendix E.2.

Related Work

Decoder-based diversity. Researchers have proposed a variety of models for text generation Radford et al. (2019); Keskar et al. (2019); Sutskever et al. (2014). These models generate text using decoding methods such as beam search. While beam search is generally thought of as the gold standard Tillmann and Ney (2003), it can produce generic and repetitive outputs Holtzman et al. (2019). To achieve diversity, top- $k$ Fan et al. (2018) and top- $p$ Holtzman et al. (2019) sampling stochastically decodes the outputs after restricting the output space to avoid low-quality outputs.

While these techniques can improve generation quality, they rely on models trained via log loss, which we show can result in undesired behavior that cannot be fixed post-hoc. Our work is complementary to existing work on decoders by proposing a loss that can improve the probabilistic models which these decoders operate on.

Loss modifications. Prior work has identified specific issues in generative models, such as repetitiveness, and proposed loss modifications to address these specific issues in the context of long text generation Welleck et al. (2019); Holtzman et al. (2018). In contrast, we identify an issue with the widely used log loss, and propose loss truncation, which does not require a task- and issue-specific modification. Many of the penalties and decoding techniques proposed in these earlier works can be combined with truncated log loss to obtain models that are more robust to noisy references.

Contemporaneous with our work, Tian et al. (2019) propose an attention weight approach to improving generation faithfulness via decoder and loss modifications. Our work complements this by providing a conceptual basis for improving faithfulness by ignoring examples (i.e., optimizing distinguishability), and providing a simple and general loss. We consider complex, model dependent loss truncation methods for optimizing distinguishability to be exciting future work.

Other generation methods optimize for task-specific losses Och (2003); Shen et al. (2015). Task specific losses are not known in many cases and thus we require an effective task-agnostic loss, e.g., log loss or TV. We show that TV acts as a useful task-agnostic goodness of fit measure, and we provide an improved alternative to log loss.

GANs. GANs have been proposed to learn models that minimize distinguishability Li et al. (2017); Rajeswar et al. (2017); Dai et al. (2017). While GANs have been successful in generating images Goodfellow et al. (2014); Brock et al. (2018), GANs remaining challenging to optimize for text due to the discrete nature of text. Our findings match earlier reports that GANs underperform log loss trained sequence-to-sequence models Caccia et al. (2018). In this work, we show that better training methods for distinguishability can arise from modifying the standard log loss via truncation.

Robust learning. Robust learning is the study of learning in the face of outliers Tukey (1960); Donoho (1982); Huber (1992). Our work is related to the $\epsilon$ -contamination model, in which an $\epsilon$ fraction of the data has been modified, potentially by an adversary Diakonikolas et al. (2018). Our work shows that robust learning under log loss can result in improved empirical performance and bounds on distinguishability.

While there are a number of effective approaches to robust learning Diakonikolas et al. (2018); Fischler and Bolles (1981), we focus on a simple truncation procedure as it is one of the only procedures scaleable enough to apply on large-scale generation datasets. Our work shows that more effective, scalable robust learning procedures can help improve natural language generation methods.

Conclusion

In this work, we show that log loss is not robust to noise, which can in turn cause undesired behavior, such as hallucinating facts in summarization. In response, we propose loss truncation, a robust training method that optimizes for distinguishability of generated samples. We additionally propose a sequence-level rejection sampling scheme to generate high quality sequences. We show that loss truncation outperforms a range of baselines (including beam search, top- $p$ , top- $k$ , and full sampling) on distinguishability. We additionally show that rejection sampling outperforms all baselines, including beam search, on generating factual summaries. These results suggest that robust learning in the form of truncating the log loss can complement model-based approaches to faithful generation by ignoring invalid and undesired references.

References

Appendix A Examples of Titles and Generations

Examples of ground truth titles. We present examples of titles in Figure 6 that require factual hallucination and can be directly entailed from context.

Examples of generated titles. We present examples of titles that from rejection sampling that are selected and that were rejected in sampling in Figure 7. As shown, rejected titles tend to be of lower quality.

Appendix B Proof of Lemma and Proposition

Lemma. We prove the lemma that all elements in $\mathcal{P}_{c,p}$ are close to $p$ in total variation.

By definition of $\mathcal{P}_{c,p}$ , for any $q_{0}$ there exists a $q_{1}$ such that $p=cq_{1}+(1-c)q_{0}$ so,

Proposition. We prove that the truncated log loss bounds total variation.

which follows from the triangle inequality, Pinsker’s inequality, and using Lemma 1 to bound the remaining terms by $c$ . ∎

Appendix C Hyperparameters

Summarization model hyperparameters. We used a standard OpenNMT-py model with global attention for all sequence-to-sequence experiments Klein et al. (2017). It has a single LSTM layer in the encoder and two in the decoder.

For the baseline model, we train for 200,000 steps with SGD and an initial learning rate of $1$ . For the loss truncated model, we hotstart with 100,000 minibatch updates and subsequently with 100,000 minibatch updates with the truncated loss with an initial learning rate of $0.1$ .

$k$ and $p$ selection. A key parameter in top- $k$ and top- $p$ sampling are $k$ and $p$ respectively. These parameters trade off between diversity and quality. To select these values, we chose values of $k$ and $p$ that had similar entropies to our model trained with loss truncation.

Specifically, $k=100$ and $p=0.9$ matched loss truncation at $c=0.6$ for summarization (entropies of $18.08$ , $20.01$ , and $17.93$ respectively). $k=2$ and $p=0.4$ matched rejection sampling for summarization at $c=0.6,\alpha=0.1$ (entropies of $3.71$ , $4.02$ , and $3.84$ respectively).

Appendix D Crowd Worker Setup and Prompts

Crowdsourcing setup. For all human evaluations, we used Amazon Mechanical Turk (all prompts shown below). We sampled 312 context/title pairs to measure HUSE. For each generated title, we asked 9 crowd workers to measure the typicality of the generated title, as in Hashimoto et al. (2019). Each crowd worker responded to 24 generated titles.

For measuring factuality, we sampled 312 examples and for each example, we asked two crowd workers how much information in the generated title was present in the article.

Prompts. We show crowd worker prompts for measuring HUSE and factuality in Figure 8. The HUSE prompt was directly taken from Hashimoto et al. (2019) with an extra control.

Appendix E Further experiments

We investigate the sensitivity of loss truncation to the hyperparameter $c$ . To do so, we vary $c$ and measure ROUGE-L and BLEU scores, for summarization and E2E respectively.

We show results for summarization in Table 5 and E2E in Table 6 along with baselines. As shown, truncation outperforms on automated metrics on a variety of hyperparameter settings on automated metrics. We leave a full investigation of sensitivity to $c$ as future work.

E.2 Combining Loss Truncation and Decoders

As loss truncation is a training method, it can be combined with alternative methods of decoding at inference time. As such, we perform a preliminary investigation of using top- $k$ and top- $p$ decoding with loss truncation.

We show ROUGE-L of loss truncation combined with various decoders and baselines for summarization in Table 7. As shown, top- $k$ and top- $p$ decoding work with loss truncation and can improve sample quality.