Controlling Output Length in Neural Encoder-Decoders

Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, Manabu Okumura

Introduction

Since its first use for machine translation [Kalchbrenner and Blunsom, 2013, Cho et al., 2014, Sutskever et al., 2014], the encoder-decoder approach has demonstrated great success in many other sequence generation tasks including image caption generation [Vinyals et al., 2015b, Xu et al., 2015], parsing [Vinyals et al., 2015a], dialogue response generation [Li et al., 2016a, Serban et al., 2016] and sentence summarization [Rush et al., 2015, Chopra et al., 2016]. In particular, in this paper we focus on sentence summarization, which as its name suggests, consists of generating shorter versions of sentences for applications such as document summarization [Nenkova and McKeown, 2011] or headline generation [Dorr et al., 2003]. Recently, Rush et al. [Rush et al., 2015] automatically constructed large training data for sentence summarization, and this has led to the rapid development of neural sentence summarization (NSS) or neural headline generation (NHG) models. There are already many studies that address this task [Nallapati et al., 2016, Ayana et al., 2016, Ranzato et al., 2015, Lopyrev, 2015, Gulcehre et al., 2016, Gu et al., 2016, Chopra et al., 2016].

One of the essential properties that text summarization systems should have is the ability to generate a summary with the desired length. Desired lengths of summaries strongly depends on the scene of use, such as the granularity of information the user wants to understand, or the monitor size of the device the user has. The length also depends on the amount of information contained in the given source document. Hence, in the traditional setting of text summarization, both the source document and the desired length of the summary will be given as input to a summarization system. However, methods for controlling the output sequence length of encoder-decoder models have not been investigated yet, despite their importance in these settings.

In this paper, we propose and investigate four methods for controlling the output sequence length for neural encoder-decoder models. The former two methods are decoding-based; they receive the desired length during the decoding process, and the training process is the same as standard encoder-decoder models. The latter two methods are learning-based; we modify the network architecture to receive the desired length as input.

In experiments, we show that the learning-based methods outperform the decoding-based methods for long (such as 50 or 75 byte) summaries. We also find that despite this additional length-control capability, the proposed methods remain competitive to existing methods on standard settings of the DUC2004 shared task-1.

Background

Text summarization is one of the oldest fields of study in natural language processing, and many summarization methods have focused specifically on sentence compression or headline generation. Traditional approaches to this task focus on word deletion using rule-based [Dorr et al., 2003, Zajic et al., 2004] or statistical [Woodsend et al., 2010, Galanis and Androutsopoulos, 2010, Filippova and Strube, 2008, Filippova and Altun, 2013, Filippova et al., 2015] methods. There are also several studies of abstractive sentence summarization using syntactic transduction [Cohn and Lapata, 2008, Napoles et al., 2011] or taking a phrase-based statistical machine translation approach [Banko et al., 2000, Wubben et al., 2012, Cohn and Lapata, 2013].

Recent work has adopted techniques such as encoder-decoder [Kalchbrenner and Blunsom, 2013, Sutskever et al., 2014, Cho et al., 2014] and attentional [Bahdanau et al., 2015, Luong et al., 2015] neural network models from the field of machine translation, and tailored them to the sentence summarization task. Rush et al. [Rush et al., 2015] were the first to pose sentence summarization as a new target task for neural sequence-to-sequence learning. Several studies have used this task as one of the benchmarks of their neural sequence transduction methods [Ranzato et al., 2015, Lopyrev, 2015, Ayana et al., 2016]. Some studies address the other important phenomena frequently occurred in human-written summaries, such as copying from the source document [Gu et al., 2016, Gulcehre et al., 2016]. Nallapati et al. [Nallapati et al., 2016] investigate a way to solve many important problems capturing keywords, or inputting multiple sentences.

Neural encoder-decoders can also be viewed as statistical language models conditioned on the target sentence context. Rosenfeld et al. [Rosenfeld et al., 2001] have proposed whole-sentence language models that can consider features such as sentence length. However, as described in the introduction, to our knowledge, explicitly controlling length of output sequences in neural language models or encoder-decoders has not been investigated.

Finally, there are some studies to modify the output sequence according some meta information such as the dialogue act [Wen et al., 2015], user personality [Li et al., 2016b], or politeness [Sennrich et al., 2016]. However, these studies have not focused on length, the topic of this paper.

2 Importance of Controlling Output Length

As we already mentioned in Section 1, the most standard setting in text summarization is to input both the source document and the desired length of the summary to a summarization system. Summarization systems thus must be able to generate summaries of various lengths. Obviously, this property is also essential for summarization methods based on neural encoder-decoder models.

Since an encoder-decoder model is a completely data-driven approach, the output sequence length depends on the training data that the model is trained on. For example, we use sentence-summary pairs extracted from the Annotated English Gigaword corpus as training data [Rush et al., 2015], and the average length of human-written summary is 51.38 bytes. Figure 1 shows the statistics of the corpus. When we train a standard encoder-decoder model and perform the standard beam search decoding on the corpus, the average length of its output sequence is 38.02 byte.

However, there are other situations where we want summaries with other lengths. For example, DUC2004 is a shared task where the maximum length of summaries is set to 75 bytes, and summarization systems would benefit from generating sentences up to this length limit.

While recent NSS models themselves cannot control their output length, Rush et al. [Rush et al., 2015] and others following use an ad-hoc method, in which the system is inhibited from generating the end-of-sentence (EOS) tag by assigning a score of $-\infty$ to the tag and generating a fixed number of wordsAccording to the published code (https://github.com/facebook/NAMAS), the default number of words is set to 15, which is too long for the DUC2004 setting. The average number of words of human summaries in the evaluation set is 10.43., and finally the output summaries are truncated to 75 bytes. Ideally, the models should be able to change the output sequence depending on the given output length, and to output the EOS tag at the appropriate time point in a natural manner.

Network Architecture: Encoder-Decoder with Attention

In this section, we describe the model architecture used for our experiments: an encoder-decoder consisting of bi-directional RNNs and an attention mechanism. Figure 2 shows the architecture of the model.

Suppose that the source sentence is represented as a sequence of words $\bm{x}=(x_{1},x_{2},x_{3},...,x_{N})$ . For a given source sentence, the summarizer generates a shortened version of the input (i.e. $N>M$ ), as summary sentence $\bm{y}=(y_{1},y_{2},y_{3},...,y_{M})$ . The model estimates conditional probability $p(\bm{y}|\bm{x})$ using parameters trained on large training data consisting of sentence-summary pairs. Typically, this conditional probability is factorized as the product of conditional probabilities of the next word in the sequence:

where $\bm{y}_{<t}=(y_{1},y_{2},y_{3},...,y_{t-1})$ . In the following, we describe how to compute $p(y_{t}|\bm{y}_{<t},\bm{x})$ .

We use the bi-directional RNN (BiRNN) as encoder which has been shown effective in neural machine translation [Bahdanau et al., 2015] and speech recognition [Schuster and Paliwal, 1997, Graves et al., 2013].

A BiRNN processes the source sentence for both forward and backward directions with two separate RNNs. During the encoding process, the BiRNN computes both forward hidden states $(\overrightarrow{\bm{h}}_{1},\overrightarrow{\bm{h}}_{2},...,\overrightarrow{\bm{h}}_{N})$ and backward hidden states $(\overleftarrow{\bm{h}}_{1},\overleftarrow{\bm{h}}_{2},...,\overleftarrow{\bm{h}}_{N})$ as follows:

While $g$ can be any kind of recurrent unit, we use long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997] networks that have memory cells for both directions ( $\overrightarrow{\bm{c}}_{t}$ and $\overleftarrow{\bm{c}}_{t}$ ).

After encoding, we set the initial hidden states $\bm{s}_{0}$ and memory-cell $\bm{m}_{0}$ of the decoder as follows:

2 Decoder and Attender

Our decoder is based on an RNN with LSTM $g$ :

We also use the attention mechanism developed by Luong et al. [Luong et al., 2015], which uses $\bm{s}_{t}$ to compute contextual information $\bm{d}_{t}$ of time step $t$ . We first summarize the forward and backward encoder states by taking their sum $\bar{\bm{h}}_{i}=\overrightarrow{\bm{h}}_{i}+\overleftarrow{\bm{h}}_{i}$ , and then calculate the context vector $\bm{d}_{t}$ as the weighted sum of these summarized vectors:

where $\bm{a}_{t}$ is the weight at the $t$ -th step for $\bar{\bm{h}}_{i}$ computed by a softmax operation:

After context vector $\bm{d}_{t}$ is calculated, the model updates the distribution over the next word as follows:

3 Training and Decoding

The training objective of our models is to maximize log likelihood of the sentence-summary pairs in a given training set $D$ :

Once models are trained, we use beam search to find the output that maximizes the conditional probability.

Controlling Length in Encoder-decoders

In this section, we propose our four methods that can control the length of the output in the encoder-decoder framework. In the first two methods, the decoding process is used to control the output length without changing the model itself. In the other two methods, the model itself has been changed and is trained to obtain the capability of controlling the length. Following the evaluation dataset used in our experiments, we use bytes as the unit of length, although our models can use either words or bytes as necessary.

The first method we examine is a decoding approach similar to the one taken in many recent NSS methods that is slightly less ad-hoc. In this method, we inhibit the decoder from generating the EOS tag by assigning it a score of $-\infty$ . Since the model cannot stop the decoding process by itself, we simply stop the decoding process when the length of output sequence reaches the desired length. More specifically, during beam search, when the length of the sequence generated so far exceeds the desired length, the last word is replaced with the EOS tag and also the score of the last word is replaced with the score of the EOS tag (EOS replacement).

2 fixRng𝑓𝑖𝑥𝑅𝑛𝑔fixRng: Discarding Out-of-range Sequences

Our second decoding method is based on discarding out-of-range sequences, and is not inhibited from generating the EOS tag, allowing it to decide when to stop generation. Instead, we define the legitimate range of the sequence by setting minimum and maximum lengths. Specifically, in addition to the normal beam search procedure, we set two rules:

If the model generates the EOS tag when the output sequence is shorter than the minimum length, we discard the sequence from the beam.

If the generated sequence exceeds the maximum length, we also discard the sequence from the beam. We then replace its last word with the EOS tag and add this sequence to the beam (EOS replacement in Section 4.1).This is a workaround to prevent the situation in which all sequences are discarded from a beam.

In other words, we keep only the sequences that contain the EOS tag and are in the defined length range. This method is a compromise that allows the model some flexibility to plan the generated sequences, but only within a certain acceptable length range.

It should be noted that this method needs a larger beam size if the desired length is very different from the average summary length in the training data, as it will need to preserve hypotheses that have the desired length.

3 LenEmb𝐿𝑒𝑛𝐸𝑚𝑏LenEmb: Length Embedding as Additional Input for the LSTM

where $byte(y_{t})$ is the length of output word $y_{t}$ and $length$ is the desired length. We learn the values of the length embedding matrix $W_{le}$ during training. This method provides additional information about the amount of length remaining in the output sequence, allowing the decoder to “plan” its output based on the remaining number of words it can generate.

4 LenInit𝐿𝑒𝑛𝐼𝑛𝑖𝑡LenInit: Length-based Memory Cell Initialization

While $LenEmb$ inputs the remaining length $l_{t}$ to the decoder at each step of the decoding process, the $LenInit$ method inputs the desired length once at the initial state of the decoder. Figure 4 shows the architecture of $LenInit$ . Specifically, the model uses the memory cell $\bm{m}_{t}$ to control the output length by initializing the states of decoder (hidden state $\bm{s}_{0}$ and memory cell $\bm{m}_{0}$ ) as follows:

While the model of $LenEmb$ is guided towards the appropriate output length by inputting the remaining length at each step, this $LenInit$ attempts to provide the model with the ability to manage the output length on its own using its inner state. Specifically, the memory cell of LSTM networks is suitable for this endeavour, as it is possible for LSTMs to learn functions that, for example, subtract a fixed amount from a particular memory cell every time they output a word. Although other ways for managing the length are also possible,For example, we can also add another memory cell for managing the length. we found this approach to be both simple and effective.

Experiment

We trained our models on a part of the Annotated English Gigaword corpus [Napoles et al., 2012], which Rush et al. [Rush et al., 2015] constructed for sentence summarization. We perform preprocessing using the standard script for the datasethttps://github.com/facebook/NAMAS. The dataset consists of approximately 3.6 million pairs of the first sentence from each source document and its headline. Figure 1 shows the length histograms of the summaries in the training set. The vocabulary size is 116,875 for the source documents and 67,564 for the target summaries including the beginning-of-sentence, end-of-sentence, and unknown word tags. For $LenEmb$ and $LenInit$ , we input the length of each headline during training. Note that we do not train multiple summarization models for each headline length, but a single model that is capable of controlling the length of its output.

We evaluate the methods on the evaluation set of DUC2004 task-1 (generating very short single-document summaries). In this task, summarization systems are required to create a very short summary for each given document. Summaries over the length limit (75 bytes) will be truncated and there is no bonus for creating a shorter summary. The evaluation set consists of 500 source documents and 4 human-written (reference) summaries for each source document. Figure 5 shows the length histograms of the summaries in the evaluation set. Note that the human-written summaries are not always as long as 75 bytes. We used three variants of ROUGE [Lin, 2004] as evaluation metrics: ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence). The two-sided permutation test [Chinchor, 1992] was used for statistical significance testing ( $p\leq 0.05$ ).

2 Implementation

We use Adam [Kingma and Ba, 2015] ( $\alpha$ =0.001, $\beta_{1}$ =0.9, $\beta_{2}$ =0.999, $eps$ = $10^{-8}$ ) to optimize parameters with a mini-batch of size 80. Before every 10,000 updates, we first sampled 800,000 training examples and made groups of 80 examples with the same source sentence length, and shuffled the 10,000 groups.

We set the dimension of word embeddings to 100 and that of the hidden state to 200. For LSTMs, we initialize the bias of the forget gate to 1.0 and use 0.0 for the other gate biases [Józefowicz et al., 2015]. We use Chainer [Tokui et al., 2015] to implement our models. For $LenEmb$ , we set $L$ to 300, which is larger than the longest summary lengths in our dataset (see Figure 1-(b) and Figure 5-(b)).

For all methods except $fixRng$ , we found a beam size of 10 to be sufficient, but for $fixRng$ we used a beam size of 30 because it more aggressively discards candidate sequences from its beams during decoding.

Result

Table 1 shows the ROUGE scores of each method with various length limits (30, 50 and 75 byte). Regardless of the length limit set for the summarization methods, we use the same reference summaries. Note that, $fixLen$ and $fixRng$ generate the summaries with a hard constraint due to their decoding process, which allows them to follow the hard constraint on length. Hence, when we calculate the scores of $LenEmb$ and $LenInit$ , we impose a hard constraint on length to make the comparison fair (i.e. $LenEmb$ (0,L) and $LenInit$ (0,L) in the table). Specifically, we use the same beam search as that for $fixRng$ with minimum length of 0.

For the purpose of showing the length control capability of $LenEmb$ and $LenInit$ , we show at the bottom two lines the results of the standard beam search without the hard constraints on the length $fixRng$ is equivalence to the standard beam search when we set the range as $(0,\infty)$ .. We will use the results of $LenEmb$ (0,∞) and $LenInit$ (0,∞) in the discussions in Sections 6.2 and 6.3.

The results show that the learning-based methods ( $LenEmb$ and $LenInit$ ) tend to outperform decoding-based methods ( $fixLen$ and $fixRng$ ) for the longer summaries of 50 and 75 bytes. However, in the 30-byte setting, there is no significant difference between these two types of methods. We hypothesize that this is because average compression rate in the training data is 30% (Figure 1-(c)) while the 30-byte setting forces the model to generate summaries with 15.38% in average compression rate, and thus the learning-based models did not have enough training data to learn compression at such a steep rate.

2 Examples of Generated Summaries

Tables 2 and 3 show examples from the validation set of the Annotated Gigaword Corpus. The tables show that all models, including both learning-based methods and decoding-based methods, can often generate well-formed sentences.

We can see various paraphrases of “#### us figure championships”Note that “#” is a normalized number and “us” is “US” (United States). and “withdrew”. Some examples are generated as a single noun phrase ( $LenEmb$ (30) and $LenInit$ (30)) which may be suitable for the short length setting.

3 Length Control Capability of Learning-based Models

Figure 6 shows histograms of output length from the standard encoder-decoder, $LenEmb$ , and $LenInit$ . While the output lengths from the standard model disperse widely, the lengths from our learning-based models are concentrated to the desired length. These histograms clearly show the length controlling capability of our learning-based models.

Table 4-(a) shows the final state of the beam when $LenInit$ generates the sentence with a length of 30 bytes for the example with standard beam search in Table 3. We can see all the sentences in the beam are generated with length close to the desired length. This shows that our method has obtained the ability to control the output length as expected. For comparison, Table 4-(b) shows the final state of the beam if we perform standard beam search in the standard encoder-decoder model (used in $fixLen$ and $fixRng$ ). Although each sentence is well-formed, the lengths of them are much more varied.

4 Comparison with Existing Methods

Finally, we compare our methods to existing methods on standard settings of the DUC2004 shared task-1. Although the objective of this paper is not to obtain state-of-the-art scores on this evaluation set, it is of interest whether our length-controllable models are competitive on this task. Table 5 shows that the scores of our methods, which are copied from Table 1, in addition to the scores of some existing methods. ABS [Rush et al., 2015] is the most standard model of neural sentence summarization and is the most similar method to our baseline setting ( $fixLen$ ). This table shows that the score of $fixLen$ is comparable to those of the existing methods. The table also shows the $LenEmb$ and the $LenInit$ have the capability of controlling the length without decreasing the ROUGE score.

Conclusion

In this paper, we presented the first examination of the problem of controlling length in neural encoder-decoder models, from the point of view of summarization. We examined methods for controlling length of output sequences: two decoding-based methods ( $fixLen$ and $fixRng$ ) and two learning-based methods ( $LenEmb$ and $LenInit$ ). The results showed that learning-based methods generally outperform the decoding-based methods, and the learning-based methods obtained the capability of controlling the output length without losing ROUGE score compared to existing summarization methods.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP26280080. We are grateful to have the opportunity to use the Kurisu server of Dwango Co., Ltd. for our experiments.