Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence

Xiaoyu Shen, Ernie Chang, Hui Su, Jie Zhou, Dietrich Klakow

Introduction

Data-to-text generation aims at automatically producing natural language descriptions of structured database Reiter and Dale (1997). Traditional statistical methods usually tackle this problem by breaking the generation process into a set of local decisions that are learned separately Belz (2008); Angeli et al. (2010); Kim and Mooney (2010); Oya et al. (2014). Recently, neural attention models conflate all steps into a single end-to-end system and largely simplify the training process Mei et al. (2016); Lebret et al. (2016); Shen et al. (2017); Su et al. (2018, 2019); Chang et al. (2020). However, the black-box conflation also renders the generation uninterpretable and hard to control Wiseman et al. (2018); Shen et al. (2019a). Verifying the generation correctness in a principled way is non-trivial. In practice, it often suffers from the problem of information missing, repetition and “hallucination” Dušek et al. (2018, 2020).

In this work, we propose to explicitly exploit the segmental structure of text. Specifically, we assume the target text is formed from a sequence of segments. Every segment is the result of a two-stage decision: (1) Select a proper data record to be described and (2) Generate corresponding text by paying attention only to the selected data record. This decision is repeated until all desired records have been realized. Figure 1 illustrates this process.

Compared with neural attention, the proposed model has the following advantages: (1) We can monitor the corresponding data record for every segment to be generated. This allows us to easily control the output structure and verify its correctnessFor example, we can perform a similar constrained decoding as in Balakrishnan et al. (2019) to rule out outputs with undesired patterns.. (2) Explicitly building the correspondence between segments and data records can potentially reduce the hallucination, as noted in Wu et al. (2018); Deng et al. (2018) that hard alignment usually outperforms soft attention. (3) When decoding each segment, the model pays attention only to the selected data record instead of averaging over the entire input data. This largely reduces the memory and computational costs Coarse-to-fine attention Ling and Rush (2017); Deng et al. (2017) was proposed for the same motivation, but they resort to reinforcement learning which is hard to train, and the performance is sacrificed for efficiency..

To train the model, we do not rely on any human annotations for the segmentation and correspondence, but rather marginalize over all possibilities to maximize the likelihood of target text, which can be efficiently done within polynomial time by dynamic programming. This is essentially similar to traditional methods of inducing segmentation and alignment with semi-markov models Daumé III and Marcu (2005); Liang et al. (2009). However, they make strong independence assumptions thus perform poorly as a generative model Angeli et al. (2010). In contrast, the transition and generation in our model condition on all previously generated text. By integrating an autoregressive neural network structure, our model is able to capture unbounded dependencies while still permitting tractable inference. The training process is stable as it does not require any sampling-based approximations. We further add a soft statistical constraint to control the segmentation granularity via posterior regularization Ganchev et al. (2010). On both the E2E and WebNLG benchmarks, our model is able to produce significantly higher-quality outputs while being several times computationally cheaper. Due to its fully interpretable segmental structure, it can be easily reconciled with heuristic rules or hand-engineered constraints to control the outputs Code to be released soon.

Related Work

Data-to-text generation is traditionally dealt with using a pipeline structure containing content planning, sentence planning and linguistic realization Reiter and Dale (1997). Each target text is split into meaningful fragments and aligned with corresponding data records, either by hand-engineered rules Kukich (1983); McKeown (1992) or statistical induction Liang et al. (2009); Koncel-Kedziorski et al. (2014); Qin et al. (2018). The segmentation and alignment are used as supervision signals to train the content and sentence planner Barzilay and Lapata (2005); Angeli et al. (2010). The linguistic realization is usually implemented by template mining from the training corpus Kondadadi et al. (2013); Oya et al. (2014). Our model adopts a similar pipeline generative process, but integrates all the sub-steps into a single end-to-end trainable neural architecture. It can be considered as a neural extension of the PCFG system in Konstas and Lapata (2013), with a more powerful transition probability considering inter-segment dependence and a state-of-the-art attention-based language model as the linguistic realizer. Wiseman et al. (2018) tried a similar neural generative model to induce templates. However, their model only captures loose data-text correspondence and adopts a weak markov assumption for the segment transition probability. Therefore, it underperforms the neural attention baseline as for generation. Our model is also in spirit related to recent attempts at separating content planning and surface realization in neural data-to-text models Zhao et al. (2018); Puduppully et al. (2019); Moryossef et al. (2019); Ferreira et al. (2019). Nonetheless, all of them resort to manual annotations or hand-engineered rules applicable only for a narrow domain. Our model, instead, automatically learn the optimal content planning via exploring over exponentially many segmentation/correspondence possibilities.

There have been quite a few neural alignment models applied to tasks like machine translation Wang et al. (2018); Deng et al. (2018), character transduction Wu et al. (2018); Shankar and Sarawagi (2019) and summarization Yu et al. (2016); Shen et al. (2019b). Unlike word-to-word alignment, we focus on learning the alignment between data records and text segments. Some works also integrate neural language models to jointly learn the segmentation and correspondence, e.g., phrase-based machine translation Huang et al. (2018), speech recognition Wang et al. (2017) and vision-grounded word segmentation Kawakami et al. (2019). Data-to-text naturally fits into this scenario since each data record is normally verbalized in one continuous text segment.

Background: Data-to-Text

Let $X,Y$ denote a source-target pair. $X$ is structured data containing a set of records and $Y$ corresponds to $y_{1},y_{2},\ldots,y_{m}$ which is a text description of $X$ . The goal of data-to-text generation is to learn a distribution $p(Y|X)$ to automatically generate proper text describing the content of the data.

The neural attention architecture handles this task with an encode-attend-decode process Bahdanau et al. (2015). The input $X$ is processed into a sequence of $x_{1},x_{2},\ldots,x_{n}$ , normally by flattening the data records Wiseman et al. (2017). The encoder encodes each $x_{i}$ into a vector $h_{i}$ . At each time step, the decoder attends to encoded vectors and outputs the probability of the next token by $p(y_{t}|y_{1\mathrel{\mathop{\mathchar 58\relax}}t-1},A_{t})$ . $A_{t}$ is a weighted average of source vectors:

$d_{t}$ is the hidden state of the decoder at time step $t$ . $f$ is a score function to compute the similarity between $h_{i}$ and $d_{t}$ Luong et al. (2015).

Approach

Suppose the input data $X$ contains a set of records $r_{1},r_{2},...,r_{K}$ . Our assumption is that the target text $y_{1\mathrel{\mathop{\mathchar 58\relax}}m}$ can be segmented into a sequence of fragments. Each fragment corresponds to one data record. As the ground-truth segmentation and correspondence are not available, we need to enumerate over all possibilities to compute the likelihood of $y_{1\mathrel{\mathop{\mathchar 58\relax}}m}$ . Denote by $\mathcal{S}_{y}$ the set containing all valid segmentation of $y_{1\mathrel{\mathop{\mathchar 58\relax}}m}$ . For any valid segmentation $s_{1\mathrel{\mathop{\mathchar 58\relax}}\tau_{s}}\in\mathcal{S}_{y}$ , $\pi(s_{1\mathrel{\mathop{\mathchar 58\relax}}\tau_{s}})=y_{1\mathrel{\mathop{\mathchar 58\relax}}m}$ , where $\pi$ means concatenation and $\tau_{s}$ is the number of segments. For example, let $m=5$ and $\tau_{s}=3$ . One possible segmentation would be $s_{1\mathrel{\mathop{\mathchar 58\relax}}\tau_{s}}=\{\{y_{1},y_{2},\$ \},\{y_{3},\ $\},\{y_{4},y_{5},\$ \}\} $.$ \$ $is the end-of-segment symbol and is removed when applying the$ \pi $operator. We further define$ c(*) $to be the corresponding data record(s) of$ * $. The likelihood of each text is then computed by enumerating over all possibilities of$ s_{1\mathrel{\mathop{\mathchar 58\relax}}\tau_{s}} $and$ c(s_{1\mathrel{\mathop{\mathchar 58\relax}}\tau_{s}})$:

Every segment is generated by first selecting the data record based on the transition probability $p(c(s_{o})|\pi(s_{<o}),c(s_{<o}))$ , then generating tokens based on the word generation probability $p(s_{o}|\pi(s_{<o}),c(s_{o}))$ . Figure 2 illustrates the generation process of our model.

We base the generation probability on the same decoder as in neural attention models. The only difference is that the model can only pay attention to its corresponding data record. The attention scores of other records are masked out when decoding $s_{o}$ :

Following the common practice, we define the output probability with the pointer generator See et al. (2017); Wiseman et al. (2017):

$d_{t}$ is the decoder’s hidden state at time step $t$ . $\circ$ denotes vector concatenation. $A_{t}$ is the context vector. MLP indicates multi-layer perceptron and $\sigma$ normalizes the score between $(0,1)$ . $W_{1}$ and $W_{2}$ are trainable matrices. $p_{gen}$ is the probability that the word is generated from a fixed vocabulary distribution $p_{vocab}$ instead of being copied. The final decoding probability $p_{\theta}(y_{t})$ is marginalized over $p_{vocab}$ and the copy distribution. The generation probability of $s_{o}$ factorizes over the words within it and the end-of-segment token:

Transition Probability

We make a mild assumption that $c(s_{o})$ is dependent only on $c(s_{o-1})$ and $\pi(s_{1\mathrel{\mathop{\mathchar 58\relax}}o-1})$ but irrelevant of $c(s_{<o-1})$ , which is a common practice when modelling alignment Och et al. (1999); Yu et al. (2016); Shankar and Sarawagi (2019). The transition probability is defined as:

A softmax layer is finally applied to the above equation to normalize it as a proper probability distribution. $f(r_{i})$ is a representation of $r_{i}$ , which is defined as a max pooling over all the word embeddings contained in $r_{i}$ . $A_{s_{o-1}}$ is the attention context vector when decoding the last token in $s_{o-1}$ , defined as in Equation 1. It carries important information from $c(s_{o-1})$ to help predict $c(s_{o})$ . $d_{s_{o-1}}$ is the hidden state of the neural decoder which goes through all history tokens $\pi(s_{1\mathrel{\mathop{\mathchar 58\relax}}o-1})$ . $M,N$ are trainable matrices to project $A_{s_{o-1}}$ and $d_{s_{o-1}}$ into the same dimension as $f(r_{i})$ .

We further add one constraint to prohibit self-transition, which can be easily done by zeroing out the transition probability in Equation 3 when $c(s_{o})=c(s_{o-1})$ . This forces the model to group together text describing the same data record.

Since Equation 3 conditions on all previously generated text, it is able to capture more complex dependencies as in semi-markov models Liang et al. (2009); Wiseman et al. (2018).

Null Record

In our task, we find some frequent phrases, e.g., “it is”, “and”, tend to be wrongly aligned with some random records, similar to the garbage collection issue in statistical alignment Brown et al. (1993). This hurt the model interpretability. Therefore, we introduce an additional null record $r_{0}$ to attract these non-content phrases. The context vector when aligned to $r_{0}$ is a zero vector so that the decoder will decode words based solely on the language model without relying on the input data.

Training

Equation 2 contains exponentially many combinations to enumerate over. Here we show how to efficiently compute the likelihood with the forward algorithm in dynamic programming Rabiner (1989). We define the forward variable $\alpha(i,j)=p(y_{1\mathrel{\mathop{\mathchar 58\relax}}i},c(y_{i})=j|X)$ . With the base $\alpha(1,j)=p(y_{1}|c(y_{1})=j)$ . The recursion goes as follows for $i=1,2,\ldots,m-1$ :

The final likelihood of the target text can be computed as $p(y_{1\mathrel{\mathop{\mathchar 58\relax}}m}|X)=\sum_{j=r_{0}}^{r_{K}}\alpha(m,j)$ . As the forward algorithm is fully differentiable, we maximize the log-likelihood of the target text by backpropagating through the dynamic programming. The process is essentially equivalent to the generalized EM algorithm Eisner (2016). By means of the modern automatic differentiation tools, we avoid the necessity to calculate the posterior distribution manually Kim et al. (2018).

To speed up training, we set a threshold $L$ to the maximum length of a segment as in Liang et al. (2009); Wiseman et al. (2018). This changes the complexity in Equation 4 to a constant $O(LK)$ instead of scaling linearly with the length of the target text. Moreover, as pointed out in Wang et al. (2017), the computation for the longest segment can be reused for shorter segments. We therefore first compute the generation and transition probability for the whole sequence in one pass. The intermediate results are then cached to efficiently proceed the forward algorithm without any re-computation.

One last issue is the numerical precision, it is important to use the log-space binary operations to avoid underflow Kim et al. (2017).

Segmentation Granularity

There are several valid segmentations for a given text. As shown in Table 1, when the segmentation (example 1) is too fine-grained, controlling the output information becomes difficult because the content of one data record is realized in separate pieces The finer-grained segmentation might be useful if the focus is on modeling the detailed discourse structure instead of the information accuracy Reed et al. (2018); Balakrishnan et al. (2019), which we leave for future work.. When it is too coarse, the alignment might become less accurate (as in Example 4, “pub” is wrongly merged with previous words and aligned together to the “Food” record). In practice, we expect the segmentation to stay with accurate alignment yet avoid being too brokenly separated. To control the granularity as we want, we utilize posterior regularization Ganchev et al. (2010) to constrain the expected number of segments for each text We can also utilize some heuristic rules to help segmentation. For example, we can prevent breaking syntactic elements obtained from an external parser Yang et al. (2019) or match entity names with handcrafted rules Chen et al. (2018). The interpretability of the segmental structure allows easy combination with these rules. We focus on a general domain-agnostic method in this paper, though heuristic rules might bring further improvement under certain cases., which can be calculated by going through a similar forward pass as in Equation 4 Eisner (2002); Backes et al. (2018). Most computation is shared without significant extra burden. The final loss function is:

Decoding

The segment-by-segment generation process allows us to easily constrain the output structure. Undesirable patterns can be rejected before the whole text is generated. We adopt three simple constraints for the decoder:

The same data record cannot be realized more than once (except for the null record).

The generation will not finish until all data records have been realized.

Constraint 2 and 3 directly address the information repetition and missing problem. When segments are incrementally generated, the constraints will be checked against for validity. Note that adding the constraints hardly incur any cost, the decoding process is still finished in one pass. No post-processing or reranking is needed.

Computational Complexity

Suppose the input data has $M$ records and each record contains $N$ tokens. The computational complexity for neural attention models is $O(MN)$ at each decoding step where the whole input is retrieved. Our model, similar to chunkwise attention Chiu and Raffel (2018) or coarse-to-fine attention Ling and Rush (2017), reduces the cost to $O(M+N)$ , where we select the record in $O(M)$ at the beginning of each segment and attend only to the selected record in $O(N)$ when decoding every word. For larger input data, our model can be significantly cheaper than neural attention models.

Experiment Setup

We conduct experiments on the E2E Novikova et al. (2017b) and WebNLG Colin et al. (2016) datasets. E2E is a crowd-sourced dataset containing 50k instances in the restaurant domain. The inputs are dialogue acts consisting of three to eight slot-value pairs. WebNLG contains 25k instances describing entities belonging to fifteen distinct DBpedia categories. The inputs are up to seven RDF triples of the form (subject, relation, object).

Implementation Details

We use a bi-directional LSTM encoder and uni-directional LSTM decoder for all experiments. Input data records are concatenated into a sequence and fed into the encoder. We choose the hidden size of encoder/decoder as 512 for E2E and 256 for WebNLG. The word embedding is with size 100 for both datasets and initialized with the pre-trained Glove embedding nlp.stanford.edu/data/glove.6B.zip Pennington et al. (2014). We use a drop out rate of $0.3$ for both the encoder and decoder. Models are trained using the Adam optimizer Kingma and Ba (2014) with batch size 64. The learning rate is initialized to 0.01 and decays an order of magnitude once the validation loss increases. All hyperparameters are chosen with grid search according to the validation loss. Models are implemented based on the open-source library PyTorch Paszke et al. (2019). We set the hyperparameters in Eq. 5 as $\eta=K,\gamma=1$ (recall that $K$ is the number of records in the input data). The intuition is that every text is expected to realize the content of all $K$ input records. It is natural to assume every text can be roughly segmented into $K$ fragments, each corresponding to one data record. A deviation of $K\pm 1$ is allowed for noisy data or text with complex structures.

Metrics

We measure the quality of system outputs from three perspectives: (1) word-level overlap with human references, which is a commonly used metric for text generation. We report the scores of BLEU-4 Papineni et al. (2002), ROUGE-L Lin (2004), Meteor Banerjee and Lavie (2005) and CIDEr Vedantam et al. (2015) . (2) human evaluation. Word-level overlapping scores usually correlate rather poorly with human judgements on fluency and information accuracy Reiter and Belz (2009); Novikova et al. (2017a). Therefore, we passed the input data and generated text to human annotators to judge if the text is fluent by grammar (scale 1-5 as in Belz and Reiter (2006)), contains wrong fact inconsistent with input data, repeats or misses information. We report the averaged score for fluency and definite numbers for others. The human is conducted on a sampled subset from the test data. To ensure the subset covers inputs with all possible number of records ( $K\in$ for E2E and $K\in$ for WebNLG), we sample 20 instances for every possible $K$ . Finally,we obtain 120 test cases for E2E and 140 for WebNLG The original human evaluation subset of WebNLG is randomly sampled, most of the inputs contain less than 3 records, so we opt for a new sample for a thorough evaluation.. (3) Diversity of outputs. Diversity is an important concern for many real-life applications. We measure it by the number of unique unigrams and trigrams over system outputs, as done in Dušek et al. (2020).

Results

In this section, we first show the effects of the granularity regularization we proposed, then compare model performance on two datasets and analyze the performance difference. Our model is compared against the neural attention-based pointer generator (PG) which does not explicit learn the segmentation and correspondence. To show the effects of the constrained decoding mentioned in section 4, Decoding. we run our model with only the first constraint to prevent empty segments (denoted by ours in experiments), with the first two constraints to prevent repetition (denoted by ours (+R)), and with all constraints to further reduce information missing (denoted by ours (+RM)).

We show the effects of the granularity regularization (section 4, Segmentation Granularity) in Fig 3. When varying the model size, the segmentation granularity changes much if no regularization is imposed. Intuitively if the generation module is strong enough (larger hidden size), it can accurately estimate the sentence likelihood itself without paying extra cost of switching between segments, then it tends to reduce the number of transitions. Vice versa, the number of transitions will grow if the transition module is stronger (larger embedding size). With the regularization we proposed, the granularity remains what we want regardless of the hyperparameters. We can thereby freely decide the model capacity without worrying about the difference of segmentation behavior.

Results on E2E

On the E2E dataset, apart from our implementations, we also compare agianst outputs from the SLUG Juraska et al. (2018), the overall winner of the E2E challenge (seq2seq-based), DANGNT Nguyen and Tran (2018), the best grammar rule based model, TUDA Puzikov and Gurevych (2018), the best template based model, and the autoregressive neural template model (N_TEMP) from Wiseman et al. (2018). SLUG uses a heuristic slot aligner based on a set of handcrafted rules and combine a complex pipeline of data augmentation, selection, model ensemble and reranker, while our model has a simple end-to-end learning paradigm with no special delexicalizing, training or decoding tricks. Table 2 reports the evaluated results. Seq2seq-based models are more diverse than rule-based models at the cost of higher chances of making errors. As rule-based systems are by design always faithful to the input information, they made zero wrong facts in their outputs. Most models do not have the fact repetition issue because of the relatively simple patterns in the E2E dataset. therefore, adding the (+R) constraint only improves the performance minorly. The (+RM) constraint reduces the number of information missing to 3 without hurting the fluency. All the 3 missing cases are because of the wrong alignment between the period and one data record, which can be easily fixed by defining a simple rule. We put the error analysis in appendix A. N_Temp performs worst among all seq2seq-based systems because of the restrictions we mentioned in section 2. As also noted by the author, it trades-off the generation quality for interpretability and controllability. In contrast, our model, despite relying on no heuristics or complex pipelines, made zero wrong facts with the lowest information missing rate, even surpassing rule-based models. It also maintains interpretable and controllable without sacrificing the generation quality.

Results on WebNLG

Table 3 reports the results evaluated on the WebNLG dataset. We also include results from MELBOURNE, a seq2seq-based system achieving highest scores on automatic metrics in the WebNLG challenge and UPF-FORGE, a classic grammar-based system that wins in the human evaluation WebNLG contains significantly more distinct types of attributes than E2E, so the chance of making errors or repetitions increases greatly. Nevertheless, our model still performs on-par on automatic metrics with superior information adequacy and output diversity. The (+R) decoding constraint becomes important since the outputs in WebNLG are much longer than those in E2E, neural network models have problems tracking the history generation beyond certain range. Models might repeat facts that have been already generated long back before. The (+R) constraint effectively reduces the repetition cases from 19 to 2. These 2 cases are intra-segment repetitions and failed to be detected since our model can only track inter-segment constraints (examples are in appendix A). The (+RM) constraint brings down the information missing cases to 5 with slightly more wrong and repeated facts compared with (+R). Forcing models to keep generating until coveraging all records will inevitably increase the risk of making errors.

Discussions

In summary, our models generates most diverse outputs, achieves similar or better performances in word-overlap automatic metrics while significantly reduces the information hallucination, repetition and missing problems. An example of hallucination is shown in Table 4. The standard PG model “hallucinated” the contents of “low-priced”, “in the city center” and “delivers take-away”. The visualized attention maps reveal that it failed to attend properly when decoding the word “low”. The decoding is driven mostly by language models instead of the contents of input data. In contrast, as we explicitly align each segment to one slot, the attention distribution of our model is concentrated on one single slot rather than averaged over the whole input, the chance of hallucinating is therefore largely reduced.

Figure 4 shows some example generations from WebNLG. Without adding the decoding constraints, PG and our model both suffer from the problem of information repetition and missing. However, the interpretability of our model enables us to easily avoid these issues by constraining the segment transition behavior. For the attention-based PG model, there exists no simple way of applying these constraints. We can also explicitly control the output structure similar to Wiseman et al. (2018), examples are shown in appendix B.

Conclusion

In this work, we exploit the segmental structure in data-to-text generation. The proposed model significantly alleviates the information hallucination, repetition and missing problems without sacrificing the fluency and diversity. It is end-to-end trainable, domain-independent and allows explicit control over the structure of generated text. As our model is interpretable in the correspondence between segments and input records, it can be easily combined with hand-engineered heuristics or user-specific requirements to further improve the performance.

Acknowledgements

This research was funded in part by the DFG collaborative research center SFB 1102. Ernie Chang is supported by SFB 248 “Foundations of Perspicuous Software Systems” (E2); Xiaoyu Shen is supported by IMPRS-CS fellowship. We sincerely thank the anonymous reviewers for their insightful comments that helped us to improve this paper.

References

Appendix A Error Analysis

Missing: Even with the coverage decoding constraint, the model can still occasionally miss information. We show one example in Table 5. The segments cover all input records, but the segment aligned to “familyfriendly” only generates a period symbol. This happens 3 times on E2E and twice on WebNLG. On the other 3 cases of missing on WebNLG, some segments only generate one end-of-sentence symbol. Both conditions can be easily fixed by some simple filtering rules.

Repeating: There are still some repeating cases on the WebNLG dataset. Table 6 shows one example. “amsterdam-centrum is part of amsterdam” is repeated twice within a segment. As our constraint decoding can only prevent inter-segment repetition, it cannot fully avoid the repetition problem resulting from the intra-segment errors of RNNs.

Appendix B Controlling output structure

As our model learns interpretable correspondence of each segment, it can control the output structures same as in Wiseman et al. (2018). Table 7 shows example generations by sampling diverse segment structures.