A Simple, Fast Diverse Decoding Algorithm for Neural Generation

Jiwei Li, Will Monroe, Dan Jurafsky

Introduction

Neural generation models [Sutskever et al., 2014, 1, Cho et al., 2014, Kalchbrenner and Blunsom, 2013] are of growing interest for various applications such as machine translation [Sennrich et al., 2015b, Gulcehre et al., 2015], conversational response generation [Vinyals and Le, 2015, Sordoni et al., 2015, Luan et al., 2016], abstractive summarization [Nallapati et al., 2016, Rush et al., 2015, Chopra et al., 2016], and image caption generation [Chen et al., 2015]. Such models are trained by learning to predict an output sequence, and then at test time, the model chooses the best sequence given the input, usually using beam search.

One long-recognized issue with beam search is lack of diversity in the beam: candidates often differ only by punctuation or minor morphological variations, with most of the words overlapping [Macherey et al., 2008, Tromble et al., 2008, Kumar and Byrne, 2004]. Lack of diversity can hinder sequence generation quality. For tasks like conversational response generation or image caption generation, there is no one correct answer; the decoder thus needs to explore different paths to various sequences to avoid local minima [Vijayakumar et al., 2016].

Lack of diversity causes particular problems in two-stage re-ranking approaches, in which an N-best list or lattice of candidates is generated using beam search and then re-ranked using features too global or expensive to include in the first beam decoding pass E.g., position bias or bilingual attention symmetry in MT [Cohn et al., 2016] or global discourse features in summarization.. In neural response generation, a re-ranking step helps avoid generating dull or generic responses [Li et al., 2015a, Sordoni et al., 2015, Shao et al., 2016]. Lack of diversity in the N-best list significantly decreases the impact of reranking.?), for example, find that re-ranking heuristics in conversational response generation work for shorter responses but not for long responses, since the beam N-best list for long responses are mostly identical even with a large beam.

In this paper, we propose a simple, fast, diversity-fostering beam search model for neural decoding; the model can be obtained by changing just one line of beam search code in MATLAB. The algorithm uses standard beam search as its backbone but adds an additional term penalizing siblings—expansions of the same parent node in the search— thus favoring choosing hypotheses from diverse parents (as demonstrated in Figure 1).

The proposed model supports batched decoding using GPUs, significantly speeding up the decoding process compared to other diversity fostering models for phrase-based MT systems [Macherey et al., 2008, Tromble et al., 2008, Kumar and Byrne, 2004, Devlin and Matsoukas, 2012]. To show the generality of the model, we evaluate it on three neural generation tasks—conversational response generation, abstractive summarization and machine translation– demonstrating that the algorithm generates better outputs due to considering more diverse sequences. We discuss which properties of these various tasks make them more or less likely to be helped by diverse decoding. We also propose a more sophisticated variant that uses reinforcement learning to automatically adjust the diversity rate for different inputs, yielding an additional performance boost.

Related Work

Diverse decoding has been sufficiently explored in phrase-based MT [Huang, 2008, Finkel et al., 2006], including use of compact representations like lattices and hypergraphs [Macherey et al., 2008, Tromble et al., 2008, Kumar and Byrne, 2004], “traits” like translation length [Devlin and Matsoukas, 2012], bagging/boosting [Xiao et al., 2013], blending multiple systems [Cer et al., 2013], and sampling translations proportional to their probability [Chatterjee and Cancedda, 2010]. The most relevant is work from ?) and ?) that produces diverse N-best lists by adding a dissimilarity function based on N-gram overlaps, distancing the current translation from already-generated ones by choosing translations that have high scores but are distinct from previous ones. While we draw on these intuitions, these existing diversity-promoting algorithms are tailored to phrase-based translation frameworks and not easily transplanted to neural MT decoding, which requires batched computation.

Some recent work has looked at decoding for neural generation. ?) proposed a meta-algorithm that runs in parallel many chains of the noisy version of an inner decoding algorithm. ?) proposed a diversity-augmented objective for image caption generation akin to a neural version of ?). ?) used a stochastic search algorithm that reranks the hypothesis segment by segment, which injects diversity earlier in the decoding process.

The proposed RL based algorithm is inspired by a variety of recent reinforcement learning approaches in NLP for tasks such as dialogue [Dhingra et al., 2016], word compositions [Yogatama et al., 2016], machine translation [Ranzato et al., 2015], neural model visualization [Lei et al., 2016], and coreference [Clark and Manning, 2016].

Diverse Beam Decoding

In this section, we introduce the proposed algorithm. We first go over the vanilla beam search method and then detail the proposed algorithm which fosters diversity during decoding.

Let XX denote the source input, which is the input dialogue history for conversational response generation or a source sentence for machine translation. The input XX is mapped to a vector representation, which is used as the initial input to the decoder. Each XX is paired with a target sequence YY, which corresponds to a dialogue utterance in response generation or a target sentence in machine translation. Y={y1,y2,...,yny}Y=\{y_{1},y_{2},...,y_{n_{y}}\} consists a sequence of nyn_{y} words. A neural generation model defines a distribution over outputs and sequentially predicts tokens using a softmax function:

At test time, the goal is to find the sequence YY^{*} that maximizes the probability given input XX:

2 Standard Beam Search for N-best lists

N-best lists are standardly generated from a model of p(YX)p(Y|X) using a beam search decoder. As illustrated in Figure 1, at time step t1t-1 in decoding, the decoder keeps track of KK hypotheses, where KK denotes the beam size, and their scores S(Yt1X)=logp(y1,y2,...,yt1X)S(Y_{t-1}|X)=\log p(y_{1},y_{2},...,y_{t-1}|X). As it moves on to time step tt, it expands each of the KK hypotheses (denoted as Yt1k={y1k,y2k,...,yt1k}Y_{t-1}^{k}=\{y_{1}^{k},y_{2}^{k},...,y_{t-1}^{k}\}, k[1,K]k\in[1,K]) by selecting the top KK candidate expansions, each expansion denoted as ytk,ky_{t}^{k,k^{\prime}}, k[1,K]k^{\prime}\in[1,K], leading to the construction of K×KK\times K new hypotheses:

The score for each of the K×KK\times K hypotheses is computed as follows:

In a standard beam search model, the top KK hypotheses are selected (from the K×KK\times K hypotheses computed in the last step) based on the score S(Yt1k,ytk,kx)S(Y_{t-1}^{k},y_{t}^{k,k^{\prime}}|x). The remaining hypotheses are ignored when the algorithm proceeds to the next time step.

3 Generating a Diverse N-best List

Unfortunately, the N-best lists outputted from standard beam search are a poor surrogate for the entire search space [Finkel et al., 2006, Huang, 2008]. The beam search algorithm can only keep a small proportion of candidates in the search space, and most of the generated translations in N-best list are similar. Our proposal is to increase diversity by changing the way S(Yt1k,ytk,kx)S(Y_{t-1}^{k},y_{t}^{k,k^{\prime}}|x) is computed, as shown in Figure 1. For each of the hypotheses Yt1kY_{t-1}^{k} (he and it), we generate the top KK translations ytk,ky_{t}^{k,k^{\prime}}, k[1,K]k^{\prime}\in[1,K] as in the standard beam search model. Next, we rank the KK translated tokens generated from the same parental hypothesis based on p(ytk,kx,Yt1k)p(y_{t}^{k,k^{\prime}}|x,Y_{t-1}^{k}) in descending order: he is ranks first among he is and he has, and he has ranks second; similarly for it is and it has.

We then rewrite the score for [Yt1k,ytk,k][Y_{t-1}^{k},y_{t}^{k,k^{\prime}}] by adding an additional term γk\gamma k^{\prime}, where kk^{\prime} denotes the ranking of the current hypothesis among its siblings (1 for he is and it is, 2 for he has and it has).

We call γ\gamma the diversity rate; it indicates the degree of diversity one wants to integrate into the beam search model.

The top KK hypotheses are selected based on S^(Yt1k,ytk,kx)\hat{S}(Y_{t-1}^{k},y_{t}^{k,k^{\prime}}|x) as we move on to the next time step. By adding the additional term γk\gamma k^{\prime}, the model punishes lower-ranked hypotheses among siblings (hypotheses descended from the same parent). When we compare newly generated hypotheses descended from different ancestors, the model gives more credit to top hypotheses from each of the different ancestors. For instance, even though the original score for it is is lower than he has, the model favors the former as the latter is more severely punished by the intra-sibling ranking part γk\gamma k^{\prime}. The model thus generally favors choosing hypotheses from diverse parents, leading to a more diverse N-best list. The proposed model is straightforwardly implemented with a minor adjustment to the standard beam search.

Automatically Learning Diversity Rate

One disadvantage of the algorithm described above is that a fixed diversity rate γ\gamma is applied to all examples. Yet the optimal diversity rate could vary from instance to instance, and too high a diversity could even be detrimental if it pushes the decoding model too far from the beam search scores. Indeed, ?) find that in response generation, standard beam search works well for short responses but deteriorates as the sequence gets longer, while ?) argue in image caption generation that diverse decoding is beneficial for images with many objects, but not images with few objects.

A good diverse decoding algorithm should have the ability to automatically adjust its diversity rates for different inputs—for example, using small diversity rates for images with fewer objects but larger rates for those with more objects. We propose a reinforcement learning-based algorithm called diverseRL that is capable of learning different γ\gamma values for different inputs.

We first define a list Γ\Gamma that contains the values that γ\gamma can take. For example, Γ\Gamma might consist of the 21 values in the range at regularly spaced intervals 0.05 apart.This is just for illustration purpose. One can define any set of diversity rate values. Our main idea is to use reinforcement learning (policy gradient methods) to discover the best diversity rate γ(X)\gamma(X) for a given input XX with respect to the final evaluation metric. For each input XX, we parameterize the action of choosing an associated diversity rate γ(X)\gamma(X) by a policy network π(γ(X)=γX)\pi(\gamma(X)=\gamma^{\prime}|X) which is a distribution over the Γ|\Gamma| classes. This means the probability of γ(X)\gamma(X) taking on two similar values (e.g., 0.05 and 0.1) are independent. An alternative is to make γ\gamma continuous. However, we find that using discrete values is good enough because of the large amount of training data, and discrete values are easier to implement. We first map the input XX to a vector representation hXh_{X} using a recurrent netThis recurrent net shares parameters with the standard generation model., and then map hXh_{X} to a policy distribution over different values of γ\gamma using a softmax function:

Given an action, namely a choice of γ\gamma^{\prime} for γ(X)\gamma(X), we start decoding using the proposed diverse decoding algorithm and obtain an N-best list. Then we pick the best output—the output with the largest reranking score, or the output with the largest probability if no reranking is needed. Using the selected output, we compute the evaluation score (e.g., BLEU) denoted R(γ(X)=γ)R(\gamma(X)=\gamma^{\prime}), and this score is used as the rewardThis idea is inspired by recent work [Ranzato et al., 2015] that uses BLEU score as reward in reinforcement learning for machine translation. Our focus is different since we are only interested in learning the policy to obtain diversity rates γ(X)\gamma(X). for the action of choosing diversity rate γ(X)=γ\gamma(X)=\gamma^{\prime}.

We use the REINFORCE algorithm [Williams, 1992], a kind of policy gradient method, to find the optimal diversity rate policy by maximizing the expectation of the final reward, denoted as follows:

The expectation is approximated by sampling from π\pi and the gradient is computed based on the likelihood ratio [Glynn, 1987, VM et al., 1968]:

where bb denotes the baseline value.The baseline value is estimated using another neural model that takes as input XX and outputs a scalar bb denoting the estimation of the reward. The baseline model is trained by minimizing the mean squared loss between the estimated reward bb and actual cumulative reward rr, rb2||r-b||^{2}. We refer the readers to [Ranzato et al., 2015, Zaremba and Sutskever, 2015] for more details. The baseline estimator model is independent from the policy models and the error is not backpropagated back to them. The model is trained to take actions (choosing a diversity rate γ\gamma) that will lead to the highest value of final rewards.

2 Training and Testing

To learn the policy π(γ(X)=γ))\pi(\gamma(X)=\gamma^{\prime})), we take a pretrained encoder-decoder model and run additional epochs over the training set in which we keep the encoder-decoder parameters fixed and do the diverse beam search using γ(X)\gamma(X) sampled from the policy distribution. Because decoding is needed for every training sample, training is extremely time-consuming. Luckily, since the sentence composition model involved in π(γ(X)=γX)\pi(\gamma(X)=\gamma|X) shares parameters with the pre-trained encoder-decoder model, the only parameters needed to be learned are those in the softmax function in Eq. 4, the number of which is relatively small. We therefore only take a small fraction of training examples (around 100,000 instances).

Special attention is needed for tasks in which feature-based reranking is used for picking the final output: feature weights will change as γ\gamma changes because those weights are tuned based on the decoded N-best list for dev set, usually using MERT [Och, 2003]. Different γ\gamma will lead to different dev set N-best lists and consequently different feature weights. We thus adjust feature weights using the dev set after every 10,000 instances. Training takes roughly 1 day.

Experiments

We conduct experiments on three different sequence generation tasks: conversational response generation, abstractive summarization and machine translation, the details of which are described below.

We used the OpenSubtitles (OSDb) dataset [Tiedemann, 2009], an open-domain movie script dataset containing roughly 60M-70M scripted lines spoken by movie characters. Our models are trained to predict the current turn given the preceding ones. We trained a two-layer encoder-decoder model with attention [1, Luong et al., 2015], with 512 units in each layer. We treat the two preceding dialogue utterances as dialogue history, simply concatenating them to form the source input.

To better illustrate in which scenarios the proposed algorithm offers the most help, we construct three different dev-test splits based on reference length, specifically:

Natural: Instances randomly sampled from the dataset.

Short: Instances with target reference length no greater than 6.

Long: Instances with target reference length greater than 16.

Each set contains roughly 2,000 instances.

1.2 Decoding and Reranking

No reranking is needed. We simply pick the output with highest probability using standard and diverse beam search.

Following ?), we first generate an N-best list using vanilla or diverse beam search and rerank the generated responses by combining likelihood logp(YX)\log p(Y|X), backward likelihood logp(XY)\log p(X|Y),logp(YX)\log p(Y|X) is trained in a similar way as standard Seq2Seq models with only sources and targets being swapped. sequence length L(Y)L(Y), and language model likelihood p(Y)p(Y). The linear combination of log(YX)\log(Y|X) and logp(XY)\log p(X|Y) is a generalization of the mutual information between the source and the target, which dramatically decreases the rate of dull and generic responses. Feature weights are optimized using MERT [Och, 2003] on N-best lists of response candidates.We set the minimum length and maximum length of a decoded target to 0.75 and 1.5 times the length of sources. Beam size K is set to 10. We then rerank the N-best list and pick the output target with largest final ranking score.

1.3 Evaluation

For automatic evaluations, we report (1) BLEU [Papineni et al., 2002], which has widely been used in response generation evaluation; and (2) diversity, which is the number of distinct unigrams and bigrams in generated responses scaled by the total number of generated tokens. This scaling is to avoid favoring long sentences, as described in Li et al. (2016a).

We also do human evaluation, as suggested by ?). We employ crowdsourced judges to provide evaluations for a random sample of 200 items. Each output pair was ranked by 3 judges, who were asked to decide which of the two outputs was better. They were instructed to prefer outputs that were more specific (relevant) to the context. Ties were permitted. Identical strings were automatically assigned the same score.

1.4 Results

Results for BLEU scores and diversity scores are presented in Tables 1 and 2. Diverse decoding boosts performance across all settings, but more so in the reranking settings than the non-reranking ones. For reranking, the improvement is much larger for longer responses than shorter ones. We suspect that this is because short responses have a smaller search space, and so the vanilla beam search has a sufficiently diverse beam. For longer responses, the generator may get trapped in a local decoding minimum, and as [Shao et al., 2016] pointed out can generate incoherent or even contradictory responses like “I like fish but I don’t like fish” . Table 4 shows that for these longer responses, these hypotheses generated by standard beam search are nearly identical, dramatically decreasing the impact of the reranking process. Sampling could be another way of generating diverse responses. However, responses from sampling are usually incoherent, especially for long sequences. This is because the error accumulates as decoding goes on. A similar phenomenon has also been observed by ?).

On the Natural set, which has the same distribution of length as the full data, we see a performance boost from the diverseRL model’s ability to dynamically adjust diversity. For the short and long sets, the improvement from the diverseRL model is less significant, presumably because the dataset has been pre-processed in such a way that examples are more similar to each other in terms of length.

In terms of human evaluation, we find that the diverseRL model produces responses with better general quality, winning 62 percent of the time when compared to standard beam search.

2 Abstractive Summarization

We consider two settings for abstractive summarization. For the first setting (denoted single), we follow the protocols described in ?), in which the source input is the first sentence of the document to summarize. We train a word-level attention model for this setting. Our training dataset consists of 800K pairs.

A good summarization system should have the ability to summarize a large chunk of text, separating wheat from chaff. We thus consider another setting in which each input consists of multiple sentences (denoted multi). We consider 10 sentences at most for each input. Sentence position treated as a word feature which is associated with an embedding to be learned as suggested in ?) We train a hierarchical model with attention at sentence level [Li et al., 2015b] for generation, which has been shown to yield better results than word-level encoder-decoder models for multi-sentence summarization [Nallapati et al., 2016].

For reranking, we employ various global features taken from or inspired by prior non-neural work in summarization [Daumé III and Marcu, 2006, Nenkova and Vanderwende, 2005, McKeown et al., 1999]. Features we consider include (1) average tf-idf score of constituent words in the generated output; (2) KLSum [Haghighi and Vanderwende, 2009], which reflects the topic distribution overlap between the entire input document and the generated output; Specifically, the KL divergence between the topic distributions assigned by a variant of the LDA model [Blei et al., 2003] that identifies general, document-specific and topic-specific word clusters. and (3) backward probability p(XY)p(X|Y), i.e., the probability of generating the entire document given the summary. For evaluation, we report ROUGE-2 scores [Lin, 2004] in Table 5.

The proposed diverse reranking algorithm helps in both cases, but more significantly in the reranking setting than the non-reranking one. This is because the mechanisms by which diverse decoding helps in the two settings are different: for the non-reranking setting, if the conditional entropy in the target distribution is low enough, the decoded string from standard beam search will be close to the global optimum, and thus there won’t be much space for improvement. For the reranking setting, however, the story is different: it is not just about finding the global optimal output based on the target distribution, but also about incorporating different criteria to make up for facets that are missed by the encoder-decoder model. This requires the N-best list to be diverse for the reranking models to make a significant difference.

We also find an improvement from the reranking model using document-level features over the non-reranking model. We did not observe a big performance boost from the diverseRL model over the standard diverse model (around 0.3 ROUGE score boost) on this task. This is probably because the diverseRL model helps most when inputs are different and thus need different diverse decoding rates. In this task, the input documents are all news articles and share similar properties, so a unified diversity rate tuned on the dev set might already be good enough. Results for diverseRL are thus omitted for brevity.

Interestingly, we find that the result for the multi setting is significantly worse than the single setting, which is also observed by ?): adding more sentences leads to worse results. This illustrates the incapability of neural generation models to date to summarize long documents. The proposed diverse decoding model produces a huge performance boost in the multi setting.

3 Machine Translation

The models are trained on the WMT’14 training dataset containing 4.5 million pairs for English-German. We limit our vocabulary to the top 50K most frequent words for both languages. Words not in the vocabulary are replaced by a universal unknown token. We use newstest2013 (3000 sentence pairs) as the development set and report translation performances in BLEU [Papineni et al., 2002] on newstest2014 (2737 sentences). We trained neural Seq2Seq models [Sutskever et al., 2014] with attention [Luong et al., 2015, Cho et al., 2014]. Unknown words are replaced using methods similar to those of ?).

3.2 Decoding and Reranking

Again we consider both reranking settings and non-reranking settings, where for non-reranking we select the best output using standard beam search. For reranking, we first generate a large N-best list using the beam search algorithm.We use beam size K=50K=50 both for standard beam search and diverse beam search. At each time step of decoding, we are presented with K×KK\times K word candidates. We first add all hypotheses with an EOS token generated at the current time step to the N-best list. Next, we preserve the top K unfinished hypotheses and move to the next time step. We therefore maintain constant batch size as hypotheses are completed and removed, by adding in more unfinished hypotheses. This allows the size of final N-best list for each input to be much larger than the beam size. We then rerank the hypotheses using features that have been shown to be useful in neural machine translation [Sennrich et al., 2015a, Gulcehre et al., 2015, Cohn et al., 2016, Cheng et al., 2015], including (1) backward probability p(XY)p(X|Y); (2) language model probability p(Y)p(Y) trained from monolingual data [Gulcehre et al., 2015, Gulcehre et al., 2015]; p(t)p(t) is trained using a single-layer LSTM recurrent models using monolingual data. We use News Crawl corpora from WMT13 (http://www.statmt.org/wmt13/translation-task.html) as additional training data to train monolingual language models. We used a subset of the original dataset which roughly contains 60 millions sentences. (3) bilingual symmetry: the agreement between attentions in German-English and English-German; and (4) target length. Again feature weights are optimized using MERT [Och, 2003] on N-best lists of response candidates in the dev set.

3.3 Results

Experimental results are shown in Table 6. First, no significant improvement is observed for the proposed diverse beam search model in the non-reranking setting. The explanation is similar to the abstractive summarization task: the vanilla beam search algorithm with large beam size is already good enough at finding the global optimum, and the small benefit from diverse decoding for some examples might even be canceled out by others where diversity rate value is too large. The diverseRL model addresses the second issue, leading to a performance boost of +0.25. For the reranking setting, the performance boost is more significant, +0.6 for diverse beam search and +0.9 for DiverseRL.

Overall, diverse decoding doesn’t seem to improve machine translation as much as it does summarization and response generation. We suspect this is due to the very low entropy of the target distribution (perplexity less than 6), so standard beam search is already fairly strong.The difference between beam size 2 and beam size 12 in decoding for French-English translation is only around 0.3 BLEU score [Sutskever et al., 2014], which also confirms that the standard decoding algorithm is sufficiently powerful for MT.

Discussion

In this paper, we introduce a general diversity-promoting decoding algorithm for neural generation. The model adds an intra-sibling ranking term to the standard beam search algorithm, favoring choosing hypotheses from diverse parents. The proposed model is a general, simple and fast algorithm that will bring a performance boost to all neural generation tasks for which a diverse N-best list is needed.

On top of this approach, we build a more sophisticated algorithm that is capable of automatically adjusting diversity rates for different inputs using reinforcement learning. We find that, at the expense of model complexity and training time (compared with the basic RL-free diverse decoding algorithm), the model is able to adjust its diversity rate to better values, yielding generation quality better than either standard beam search or the basic diverse decoding approach.

Diverse decoding doesn’t help all tasks equally, contributing the most in two kinds of tasks: (1) Those with a very diverse space of ground truth outputs (e.g., tasks like response generation), rather than those in which the conditional entropy of the target distribution is already low enough (e.g., machine translation) (2) Tasks for which reranking is needed to incorporate features not included in the first decoder pass, such as document-level abstractive summarization.

We would especially like to thank Thang Luong Minh for insightful discussions and releasing relevant code, as well as Sida Wang, Ziang Xie, Chris Manning and other members of the Stanford NLP group for helpful comments and suggestions. We also thank Michel Galley, Bill Dolan, Chris Brockett, Jianfeng Gao and other members of the NLP group at Microsoft Research for helpful discussions. Jiwei Li is supported by a Facebook Fellowship, which we gratefully acknowledge. This work is partially supported by the DARPA Communicating with Computers (CwC) program under ARO prime contract no. W911NF- 15-1-0462 and the NSF via IIS-1514268. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, the NSF, or Facebook.

References