Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation

Ran Tian, Shashi Narayan, Thibault Sellam, Ankur P. Parikh

Introduction

The task of generating natural language text ${\bm{y}}$ from a source content ${\bm{x}}$ is the essence of many NLP applications, such as summarization (Mani, 1999), machine translation (Koehn, 2009), and data-to-text generation (Kukich, 1983; McKeown, 1992). While traditionally done with template-based approaches (Becker, 2002; Foster and White, 2004; Gatt and Reiter, 2009; Reiter et al., 2005), recent neural encoder-decoder models (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2014) have demonstrated remarkable ability to generate fluent text without cumbersome handcrafted rules and templates (Rush et al., 2015; Radford et al., 2019).

However, encoder-decoder models have been shown to be prone to hallucination, i.e., generating text that is fluent but unfaithful to the source (Vinyals and Le, 2015; Koehn and Knowles, 2017; Wiseman et al., 2017; Maynez et al., 2020). This severe shortcoming can often limit the use of neural approaches in many real world systems, where it is not acceptable to produce output that is even occasionally unfaithful.

In this work, we address the issue of hallucination in data-to-text generation, where the source content ${\bm{x}}$ is a structured table and the text ${\bm{y}}$ is a description of the table – a relatively easy setting to objectively evaluate the faithfulness of generation. We show an example from the WikiBio dataset (Lebret et al., 2016) in Figure 1; the task is to generate a sentence summarizing a tabular biography of a person. The output of a strong generation baseline, the Pointer-Generator (See et al., 2017), contains a phrase criminal defense attorney that is incorrect and cannot be supported by the infobox table (but loosely related to FBI in the table). Note that the reference also contains information such as bonanno crime family and informant that are true, but cannot be inferred from the infobox; this source-reference divergence exists in many large-scale generation datasets (Wiseman et al., 2017), and might encourage a generation model to output phrases that are unsupported by the source.

However, the issue is not limited to divergence between source and reference since hallucination can appear even with cleaned references (Parikh et al., 2020). Rather, the underlying problem is that the model learns wrong correlations between different parts of the training data. As the data and models get larger and more complicated, learning wrong correlations might always become an issue because of abundant inter-related factors in play. Thus, we need methodology to control neural networks and pose our prior knowledge on “correct correlations” to the models. This is an important, yet less addressed challenge in deep learning.

In this work, we pose a “confidence prior” to encoder-decoder models, by carefully reconsidering the two components in a decoder: attention to the source and language modeling. Our prior knowledge is that a model should attend to the source when generating a word, as long as the word conveys source information. Wrongly associating a content phrase (e.g. defense attorney) to the language model, simply because it seems more fluent (e.g. criminal defense attorney is fluent), might be a major cause of hallucination (§ 4.1).

Therefore, we design a confidence score to detect hallucination, by using an attention score to measure how much the model is attending to the source, and a language model to judge if a word conveys source information (§ 3). Then, we propose a variational Bayes training framework that can ensure a model to generate with high confidence, while learning the confidence score parameters at the same time (§ 3.1). Experiments on the WikiBio dataset demonstrate that our approach is considerably more faithful to the source than existing state-of-the-art solutions, according to both PARENT score (Dhingra et al., 2019) and human evaluation (§ 4.1). We also report strong results on the WebNLG (Gardent et al., 2017) dataset (§ 4.2).

Preliminaries

where the context vector ${\bm{v}}_{t}$ is given by:

Here, $\alpha_{s,t}$ is an attention weight of the prefix ${\bm{y}}_{<t}$ attending to source position $s$ , for which we use bilinear attention (Eq. equation 3, Luong et al., 2015); and ${\bm{h}}_{t}$ is given by an RNNWhile it is possible our approach could extend to other types of decoders, our current formulation of the confidence score specifically uses RNN with attention. (Eq. equation 4, where $[\cdot]$ denotes concatenation):

In case the encoder-decoder is equipped with a copy mechanism, the generation probability is mixed with a probability of copying from the source (Gu et al., 2016; See et al., 2017):

where $p^{\text{gen}}_{t}$ is the probability of doing generation instead of copying at step $t$ , and $\beta_{s,t}$ is an attention weight that the copy mechanism is paying to position $s$ in the source. The sum is taken over all positions $s$ where the word $x_{s}$ is the same as $y_{t}$ .

Modeling Confident Decoding

In this section, we mathematically describe our approach. For each decoder position $t$ , we define the following confidence score $C_{t}(y_{t})$ to detect hallucination:

Here, $A_{t}\in$ is the attention score (see below “Attention Score”) which indicates how much the model is generating based on the source; e.g., it should be close to $1$ for content words copied from the source. $P_{B}(y_{t}|{\bm{y}}_{<t})$ is the probability of a tailored language model (see below “Base Language Model”), that should be high for templatic words but low for words that convey source information.

Figure 2 shows an example. Templatic words (e.g. born, is) do not need support from the source, and they have high probability by the base language model and high confidence score, regardless of the attention score. On the other hand, tokens that convey source information (e.g. Michael, author) have low probability by the base language model, and their confidence scores depend on the attention scores. A low confidence score indicates that a word conveying source information is generated by the model without paying attention to the source; which, we conjecture, is a signal of hallucination. Indeed, in Figure 2, the tokens with lower confidence scores (e.g. author, radio host) are not supported by the table. This example is taken from the WikiBio validation set (cf. Appendix for more).

The attention score should measure how much a token is generated based on source. We modify the conventional attention mechanism (Eq. equation 3 equation 4) in two ways to make such measurement easier. First, we make the attention weights sum to less than $1$ , so that the model can choose “not to attend”; this is achieved by adding a constant $1$ to the denominator of Eq. equation 7. Second, instead of concatenating the previous attention vector to the input of RNN, we only use ${\bm{a}}_{t-1}$ in calculation of the current attention weights, so that the hidden states of the RNN no longer contain any source information (Eq. equation 7 equation 8):

Then, because the next token is generated by the context vector ${\bm{v}}_{t}={\bm{a}}_{t}+{\bm{h}}_{t}$ in Eq. equation 2, and all the source information in ${\bm{v}}_{t}$ comes from ${\bm{a}}_{t}$ , we define the attention score $A_{t}$ as below to measure how much ${\bm{a}}_{t}$ affects ${\bm{v}}_{t}$ (where $\lVert\cdot\rVert$ denotes Euclidean norm):

We have $A_{t}\in$ by triangle inequality. In an extreme case, when ${\bm{h}}_{t}$ is completely cancelled out by ${\bm{a}}_{t}$ in the sum ${\bm{v}}_{t}$ , the attention score equals $1$ . When the model has a copy mechanism, we can refine $A_{t}$ with the copying probability:

In practice, we have confirmed that our modification of the attention mechanism (Eq. equation 7 equation 8) does not impact the quality of data-to-text generation (§ 4.2); the observed attention score has a reasonable range of $0.6\sim 0.9$ for tokens supported by the source, and $0.2\sim 0.5$ for templatic words or hallucinated tokens (Figure 2).

Base Language Model

In order to be effective, $\text{RNN}_{B}$ should be able to learn “soft templates” rather than simply fluent text. Unfortunately using an ordinary unconditioned language model for $P_{B}(y_{t}|{\bm{y}}_{<t})$ can be problematic, since the model can learn source-specific knowledge through ${\bm{y}}_{<t}$ . For instance if there is only one person named Walter in the training data, and he is a pilot, then the language model might learn Walter is a pilot as a fixed generation pattern. We tailor $\text{RNN}_{B}$ to reduce such artifacts by down-weighting input embeddings that are associated with high attention scores (and thus are source-specific):

1 Training with Confident Sub-sequence Sampling

How do we train a model to generate confidently, using the confidence score we just proposed? Note that the confidence score itself has trainable parameters (i.e., attention score and the parameters of $\text{RNN}_{B}$ ). Our idea is to assume a latent “confident sub-sequence” of the target for each training example, and learn the latent sub-sequence by sampling according to the confidence score. As training progresses, the confidence score improves and the sampled sub-sequences contain only the parts of the target that are faithful to the source. An example is given in Figure 3 where the model learns to assign low scores to the tokens stage, film so the sampled subsequence only contains information faithful to the source.

Formally, for each target ${\bm{y}}=y_{1}y_{2}\dots y_{T}$ , we define ${\bm{z}}=z_{1}z_{2}\dots z_{R}=y_{\iota(1)}y_{\iota(2)}\dots y_{\iota(R)}$ as a latent sub-sequence of ${\bm{y}}$ , which consists of confident tokens of length $R$ . Here, $\iota:|R|\rightarrow|T|$ is an inclusion of indices. We regard ${\bm{z}}$ as a sequential “keep or skip” labeling over ${\bm{y}}$ , and sample ${\bm{z}}$ from the probability distribution $Q({\bm{z}}|{\bm{y}},{\bm{x}})=\prod_{t=1}^{T}Q_{t}$ :

Here, $\rho$ and $\gamma$ are trainable parameters initialized to and $1$ , respectively. As $\rho$ gets larger, it more strictly enforces our prior knowledge of faithful generation: Every (kept) token should have a high confidence score, which means either the encoder-decoder is paying attention to the source, or the token is a template element that does not convey source information. Empirically, the trained $\rho$ in our model indeed converges to a positive value (e.g. about $3.4$ on the WikiBio dataset).

For each training example $({\bm{x}},{\bm{y}})$ , our objective is to minimize the following generation cost:

Above we have applied the Bayes rule; so we model $P({\bm{y}}|{\bm{z}},{\bm{x}})$ and $P({\bm{z}}|{\bm{x}})$ , instead of $P({\bm{y}}|{\bm{x}})$ . We set $P({\bm{z}}|{\bm{x}})$ to be the encoder-decoder model as described before. Since $P({\bm{y}}|{\bm{z}},{\bm{x}})$ is not used in test, we simply assume it remembers all the training examples and, when given an ${\bm{x}}$ that appears in training data, gives a probability $1$ to the gold reference ${\bm{y}}$ and to all others. Hence, we can set $P({\bm{y}}|{\bm{z}},{\bm{x}})=1$ here, and our modeling efforts focus on $P({\bm{z}}|{\bm{x}})$ .

Unfortunately, the posterior $P({\bm{z}}|{\bm{y}},{\bm{x}})$ in the above objective cannot be arbitrarily modeledFor one thing, the right hand side of Eq. equation 14 should give the same value for any ${\bm{z}}$ , because the left hand side does not depend on ${\bm{z}}$ . This is a non-trivial constraint for $P({\bm{z}}|{\bm{y}},{\bm{x}})$ .. We thus employ a Variational Bayes scheme (Koller and Friedman, 2009) and use our sampling probability $Q=Q({\bm{z}}|{\bm{y}},{\bm{x}})$ to approximate $P({\bm{z}}|{\bm{y}},{\bm{x}})$ . By adding $\log Q$ , we get

The variational Bayes objective is to minimize the upper bound on the right hand side of Eq. equation 17.

Importantly, the base language model $\text{RNN}_{B}$ is trained in two ways: through $Q$ in the variational Bayes Objective and also by minimizing an additional $-\log P_{B}({\bm{z}})$ term. Jointly training $\text{RNN}_{B}$ on the confident sub-sequence ${\bm{z}}$ implicitly biases it toward more confident generation patterns.

2 Calibration and <null> Token

Finally, we discuss two additional techniques to utilize the confidence score at inference time.

With a model trained to generate confidently, one might still want to explicitly re-rank the generation probability at inference time toward more confident tokens. The calibration technique (Braverman et al., 2019) provides a way to learn such explicit re-ranking. It parameterizes a family of probability distributions that augment $P(y_{t}|{\bm{y}}_{<t},{\bm{x}})$ with some quantity that one cares, which in our case is the confidence score $C_{t}(y_{t})$ :

<null> token

If a token is generated with a confidence score lower than a certain threshold, we replace it with a special token; the token is fed to the next step, and consecutive tokens are shut out from the beam search. After beam search, all s are deleted from the output sequence. We slightly modify the sub-sequence sampling during training to be compatible with this strategy: Once a target token is labeled skip, it is replaced by a instead of being skipped; only consecutive tokens labeled skip are actually skipped (i.e. not being counted by the sampled sub-sequence ${\bm{z}}$ ). Intuitively, the token mimics a “pause and rethink” strategy, making the generation process more robust against unconfident tokens. Empirically, we found that token combined with length penalty (Wu et al., 2016) can drastically increase recall while maintaining precision in data-to-text generation (§ 4.1).

Experiments

We evaluate on the WikiBio (Lebret et al., 2016) and WebNLG (Gardent et al., 2017) datasets. These datasets exhibit different levels of source-reference divergence and thus test our model in different regimes. Specifically, WikiBio is heuristically collected and 62% of examples exhibit divergence (Dhingra et al., 2019), whereas WebNLG has human generated responses with less divergence.

WikiBio contains 728,321 infoboxes paired with biographies, taken from the Sep.-2015 dump of English Wikipedia, and split into train/valid/test sets in a 8:1:1 ratio. The biography text is the first sentence of the Wikipedia page ( $26.1$ words on average). Infoboxes have $12.1$ non-empty fields on average. The WebNLG release v2.1 with constrained split (Shimorina and Gardent, 2018) contains 16,095 data inputs in the format of RDF triples, and 42,873 data-text pairs (i.e. multiple references for each data input), splitted in a 8:1:1 ratio. The constrained split ensures that no RDF triple in the test set is in the train or dev set.

As a typical setting, we treat the data-to-text tasks as seq-to-seq prediction; infoboxes and RDF triples are linearized, with “key/value”s or “subject/relation/object”s separated by special tokens.

We compare our method (the bottom two) against several strong baselines (the top four):

BERT-to-BERT (Rothe et al., 2019): A Transformer encoder-decoder model (Vaswani et al., 2017) where the encoder and decoder are both initialized with BERT (Devlin et al., 2019).

Structure-aware Seq2Seq (Liu et al., 2018): A state-of-the-art method on WikiBio in terms of BLEU, which explicitly handles field names and table contents in an LSTM-based model.

Pointer-Generator (See et al., 2017): Seq2Seq model with attention and copy mechanism.

BERT-to-LSTM: A Transformer encoder (initialized with BERT) to LSTM decoder model.

Conf-PtGen (Ours): A Pointer-Generator model with our proposed confident decoding.

Conf-T2LSTM (Ours): A Transformer encoder to LSTM decoder model, with confident decoding.

Here, the Pointer-Generator and BERT-to-LSTM are by our implementation, and serve as the base to our Conf-PtGen and Conf-T2LSTM models, respectively. More detailed experiment settings are found in the Appendix.

For automatic evaluation, we report BLEU (Papineni et al., 2002), as well as PARENT (Dhingra et al., 2019), a metric that takes into account the data information, by aligning n-grams from the reference and prediction to the semi-structured input data, before computing their precision and recall. It is designed to mitigate the shortcomings of BLEU on data-to-text generation.

For human evaluation, we obtain annotations on examples randomly chosen from predictions on the WikiBio test set, the same 500 for each model. Examples from different models are mixed and randomly shuffled, with model names hidden from the annotators. We instruct the annotators to grade on each of 3 criteria: faithfulness (precision), coverage (recall), and fluency. Faithfulness assesses if all the information in the proposed sentence is supported by the table or the reference. A single hallucinated piece of information makes the sentence non-faithful. Coverage measures the number of table cells that contain information present in the sentence. Finally, fluency assesses if the sentence is clear, natural, and grammatically correct; raters choose among three options: Fluent (clear, natural and grammatically correct; reads like a sentence found in a book), Mostly Fluent (with a few error, but mostly understandable), and Not Fluent (with many errors and hardly understandable).

An ideal system would always produce fluent and faithful text with high coverage. The output of our models and baselines, as well as the human evaluation data are publicly released.The output of our models and baselines, with human evaluations, are available at https://drive.google.com/open?id=1Kg4hJkaK9gWCv7mxwBfHEQwAgF_TrwcE. We will open-source our code as well.

Results

Table 1 shows the results. Despite achieving high BLEU scores, BERT-to-BERT and Structure-Aware Seq2Seq are less faithful according to human evaluation. Pointer-Generator is the most faithful among baselines, probably because its copy mechanism promotes verbatim copy from the source. By applying our confident decoding method to the Pointer-Generator and BERT-to-LSTM respectively, we achieve clear improvement in faithfulness over the respective baselines.

Among the automatic metrics, PARENT precision and recall seem correlated to faithfulness and coverage respectively, and our approach achieves the highest precision and F1. BLEU, perhaps because of its length penalty that rewards longer generations, seems more correlated to coverage rather than faithfulness. Generally, it is easier for longer predictions to achieve higher coverage/recall, but harder to achieve faithfulness/precision.

In order to control recall while maintaining precision, we combine two techniques at inference time: The length penalty (Wu et al., 2016) which encourages longer generation, and the token threshold (§ 3.2) which shuts out unconfident tokens. In Table 1, Conf-PtGen does not use length penalty and is trained without tokens (denoted “w/o lp” and “w/o null”, respectively), so it tends to stop generation early when it is unconfident, which leads to shorter predictions and less coverage. In contrast, when Conf-T2LSTM incorporates token with a moderate threshold (i.e. null 0.5), it improves both precision and recall from the BERT-to-LSTM baseline, without sacrificing fluency. One can boost the precision and recall even more, by using length penalty to promote recall (e.g. lp 2.0) and an aggressive threshold (e.g. null 0.8) to keep precision. This seems to cost some fluency, but most generations are still fluent (cf. Appendix for generation examples).

Ablation Test

In this experiment, we assess the effects of three novel components in our confident decoding method: (1) The design of a confidence score; (2) The variational Bayes objective with confident sub-sequence sampling; and (3) The calibration technique to re-rank output probabilities. We start from the Conf-PtGen, and in each test replace one component by a trivial alternative: (1) We compare with using the probability $P(y_{t}|{\bm{y}}_{<t},{\bm{x}})$ directly as confidence, and train models using the same hyper-parameters as Conf-PtGen. The results on the WikiBio test set are shown in Table 2, as “– Confidence”. (2) We compare with models trained by maximizing the ordinary log-likelihood, without sub-sequence sampling; the calibration technique is still applied (“– Variational”). (3) We disable the calibration technique (“– Calibration”).

As we can see from Table 2, all three components improve PARENT precision. While the improvement by calibration is the smallest, the technique also improves PARENT recall and BLEU score at the same time, making it an easy choice. The other techniques trade recall for precision, making them useful for tasks that require a high degree of faithfulness. When all three components are disabled, the model is exactly the same as our implementation of the Pointer-Generator. Every component improves PARENT precision upon it as well. Especially, comparing Pointer-Generator with “– Variational” shows again that calibration improves all metrics.

Sensitivity to Source

We have conjectured that making an encoder-decoder attend to the source whenever it is necessary can reduce hallucination; and we have clearly improved faithfulness by using a confident decoder that implements this conjecture. In this experiment, we show that Conf-T2LSTM is indeed more sensitive to the source than BERT-to-LSTM. The idea is to set all the source encoded vectors ${\bm{s}}_{1},...,{\bm{s}}_{S}=\textbf{enc}(x_{1},...,x_{S})$ to $\mathbf{0}$ at some random steps during decoding, and see how many predictions changed. In Figure 4, we show the results on the WikiBio validation set. As we increase the probability of source vectors to be set to $\mathbf{0}$ , the predictions by Conf-T2LSTM changed more than BERT-to-LSTM. At each noise level, we decode $5$ times and plot the mean difference, as well as the standard deviation as error bar (almost indistinguishable from the lines in the chart). The exact predictions by both models are noise-sensitive: At a level of $0.1$ , there are over $65\%$ of predictions changed already. However, most changes are subtle to human eyes; it is hard to glimpse any drop in generation quality.

2 Results on WebNLG

The WebNLG dataset has more controlled data format and generation patterns than WikiBio, making it a suitable benchmark for data-to-text models. Although the issue of hallucination is not severe on this dataset, we use it to compare modifications we made on the encoder-decoder architecture vs. the conventional designs. In particular, we compare: (i) T2LSTM-att, a 12-layer Transformer encoder to LSTM decoder architecture, with conventional attention mechanism; (ii) T2LSTM, with our modified attention as defined in Eq. equation 7equation 8; (iii) Conf-T2LSTM, with our confident decoding. All three models use the sentence-piece tokenizer (Kudo and Richardson, 2018) with a vocabulary size of $4,000$ , and length penalty $1.0$ at inference. For Conf-T2LSTM, the threshold is set to $0.5$ . We evaluate using BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), and $\text{ROUGE}_{L}$ (Lin, 2004) computed using the evaluation scripts for the E2E Challenge (Dušek et al., 2018).

According to results shown in Table 3, our modeling enhancements do not degrade performance in the presence of clean references. Compared to an OpenNMT (Klein et al., 2017) baseline with the best delexicalisation and copying setting reported by Shimorina and Gardent (2018), our models demonstrate strong performance. We do not include other baselines such as Ferreira et al. (2019) and Kale (2020) because they report numbers on an older version of the WebNLG corpus.

Discussion and Related Work

Many ideas have been explored in text generation to achieve more accurate predictions, such as learning neural templates (Wiseman et al., 2018), separating content selection from generation (Zhou et al., 2017; Gehrmann et al., 2018; Puduppully et al., 2019a), MaskGAN (Fedus et al., 2018), entity modeling (Puduppully et al., 2019b), data augmentation (Ma et al., 2019; Kedzie and McKeown, 2019), etc. Among them, improving the faithfulness is an emerging research topic that has been tackled by a variety of works recently. Concurrent to our work, Matsumaru et al. (2020) empirically found that removing unfaithful instances from the training data can reduce hallucination in headline generation, while Kang and Hashimoto (2020) proposed a loss truncation training framework that can remove such noise in a principled manner. However, removing entire examples from the loss is not practical for noisy datasets such as Wikibio, where 62% of examples exhibit divergence. Wang et al. (2020) and Shen et al. (2020) tackle this problem by adding additional terms to the loss that enforce alignment between source and target, while the “goodness” of such alignment relies on heuristics specific to the task and data. In contrast, the prior knowledge we exploit in this work is more general, as it does not depend on any source-specific data structure. Our method has the potential to be adapted to other generation tasks such as document summarization and machine translation, or combined with other approaches. Li and Rush (2020) also propose to control neural text generation by posterior regularization, but they still rely on heuristics such as surface matching between source and target. Harkous et al. (2020) address the problem by decoder re-ranking, while the ranker is trained on heuristically extracted faithful data-text pairs. Complementary to all these works, our approach develops a deeper understanding of the encoder-decoder architecture itself, by carefully reconsidering its attention and language modeling components.

References

Appendix A Generation Examples and Analysis of Hallucination

In the following, we show some typical cases where the Pointer-Generator baseline hallucinates but the Conf-PtGen does not. In most of the cases, the Conf-T2LSTM models do not hallucinate either. Hallucinated parts are colored red. The examples are taken from the WikiBio validation set.

In the first six examples (i.e. “Frank Lino”, “Rohan Robertson”, “Walter Smallwood”, “Nellie Wong”, “Hal Bedsole” and “Constant Vanden Stock”), some information is missing in the table while the Pointer-Generator baseline made it up. Our confident decoder models learn to omit the missing fields, although by doing this, some of the generated sentences become not fluent.

In the next example (i.e. “Richard Lloyd”), the baseline seems to have learned weird language modeling from some similar training points, and tries to generate more than the table contents; our confident decoder models generate correctly.

In the last two examples (i.e. “Robert I. Marshall” and “Thomas Edwards”), there are corresponding fields in the table but the Pointer-Generator baseline didn’t learn to generate correctly, possibly because these fields should not be simply copied. Our confident models generate more faithfully to the source.

Frank “Curly” Lino (born October 30, 1938 Brooklyn) is a Sicilian-American Caporegime in the Bonanno crime family who later became an informant.

Frank Lino (born October 30, 1938 in Brooklyn, New York, United States) is an American criminal defense attorney.

Frank Lino (born October 30, 1938 in Brooklyn, New York, United States) is an American.

Frank Lino (born October 30, 1938) is an American actor.

Frank Lino (born October 30, 1938 Gravesend , Brooklyn , New York) is a former American

Frank Lino (born October 30, 1938 Gravesend, Brooklyn, New York) is a former American

Rohan Robertson (born 21 August 1961) is a former Australian rules footballer who played for North Melbourne in the Victorian Football League (VFL) between 1985 and 1988.

Rohan Robertson (born 21 August 1961) is a former Australian rules footballer who played with Carlton in the Victorian Football League.

Rohan Robertson (born 21 August 1961) is a former Australian rules footballer who played in the Victorian Football League.

Rohan Robertson (born 21 August 1961) is a former Australian rules footballer who played with Carlton in the Victorian Football League (VFL).

Walter Clayton Smallwood (April 24, 1893 – April 29, 1967) was a professional baseball pitcher from 1913 to 1931.

Walter Herbert Smallwood (April 24, 1893 – April 29, 1967) was a pitcher in major League Baseball.

Walter Smallwood (April 24, 1893 – April 29, 1967) was a pitcher in major League Baseball.

Walter Henry Smallwood (April 24, 1893 – April 29, 1967) was a major league baseball pitcher.

Walter Smallwood (April 24, 1893 – April 29, 1967) was a major league baseball pitcher.

The residence address is not stated in the table.

Nellie Wong (born 12 September 1934) is a poet and activist for feminist and socialist causes.

Nellie Wong (born September 12, 1934 in Oakland, California) is an American poet, activist, feminist, and feminist activist who lives and works in Los Angeles, California, United States, where she

Nellie Wong (born September 12, 1934) is an American poet and activist.

Nellie Wong (born September 12, 1934) is an American poet

Harold Jay “Hal” Bedsole (born December 21, 1941) is a retired American football player.

Gene (born December 21, 1941) is a former American football tight end in the National Football League.

(Born December 21, 1941) is a former American Football tight end in the National Football League for the Minnesota Vikings.

Hal Bedsole (born December 21, 1941) is a former American football tight end in the national football league.

Hal Bedsole (born December 21 , 1941) is a former american football tight end in the national football league.

Hal Bedsole (born December 21, 1941 Chicago, Illinois) is a former American football tight end in the national football league for the Minnesota Vikings and Minnesota Vikings.

The nationality and occupation is missing in the table.

Constant Vanden Stock (; 13 june 1914 – 19 April 2008) was the honorary president and former president and player of Belgian football club R.S.C. Anderlecht.

Constant Vanden Stock (June 13, 1914 – April 19, 2008) was an American Figure Skater.

Constant Vanden Stock (June 13, 1914 – April 19, 2008) was a

Constant Vanden Stock (June 13, 1914 – April 19, 2008) was an American politician.

Constant Vanden Stock (June 13, 1914 – April 19, 2008) was an American

Available fields in the table are name, nationality, years and teams.

Richard Lloyd (18 February 1945 – 30 March 2008) was a British racing car driver and founder of multiple sports car and touring car teams.

Richard Lloyd Lloyd is a British Racing driver who won the GTI Engineering Championship in 1982, driving with GTI Engineering, Richard Lloyd Racing of GTI Engineering, and Lloyd Lloyd Racing at the age of 14.

Richard Lloyd was a British racing driver.

Robert I. Marshall (born October 16, 1946 in Wilmington, Delaware) is an American politician and a democratic member of the Delaware Senate since January 9, 1979 representing district 3.

Robert I. Marshall (born October 16, 1946 in Wilmington, Delaware) is an American politician and a Democratic member of the Delaware Senate since January 9, 1979 representing district 41.

Robert I. Marshall (born October 16, 1946 in Wilmington, Delaware) is an American politician and a Democratic member of the Delaware Senate since January 9, 1979.

Robert I. Marshall (born October 16 , 1946 in Wilmington, Delaware) is an American politician and a Democratic member of the Delaware Senate since January 9, 1979 representing district 3 .

Robert I. Marshall (born October 16, 1946 in Wilmington, Delaware) is an American politician and a Democratic member of the Delaware Senate since January 9, 1979 representing district 3.

The table has a field “known for”: English and Welsh dictionary.

Thomas Edwards (Caerfallwch), (1779 – 1858), was a Welsh author.

Thomas Edwards (1779 – 4 June 1858) was an English author, UNK, and UNK, who spent most of his life in the English and English literature of the English and English dictionary literature.

Thomas Edwards (1779 – 4 June 1858) was a Welsh author.

Thomas Edwards (1779 – 4 June 1858) was a Welsh and Welsh dictionary

Thomas Edwards (1779 – 4 June 1858) was a Welsh and Welsh dictionary.

Appendix B Interplay between Attention Score, Base Language Model, and Confidence Score

Figure 2 in our main paper showed an example of the learned attention score, base language model probability and confidence score of our Conf-T2LSTM model. The example is taken from the WikiBio validation set, using the reference sentence.

In order to further illustrate the mechanism of our approach, in Figure 5 we show more examples of the scores, learned by our Conf-PtGen model. Compared to Conf-T2LSTM, Conf-PtGen has a copy mechanism, and the attention scores seem more sensitive to missing fields in the table. In Figure 5, Occupation is missing in the Frank Lino table, and the Cornelia Molnar table only has a name. Our model successfully detected the tokens not supported by the table.

Appendix C Human Evaluation Instructions

We show the detailed instructions for our human evaluation in the following. We have discussed with the lead annotators about many other examples as well. We have made sure that: (a) Valid inferences (e.g. inferring nationality from birth place) are considered faithful; (b) If a piece of information exists or can be inferred from the table, the corresponding cell should be highlighted, even if the information was also in the background knowledge; (c) Only one cell should be highlighted for one piece of information.

A writer has the following background knowledge and is given a table:

The writer read the table and produced the following sentence: We wish to evaluate the quality of the sentence.

Fluent: It is clear, natural, and the grammar is correct. It reads like if it was found in a book.

Mostly Fluent: It has a few errors or it does not sound natural, but you can understand it.

Not Fluent: It has many errors and/or you can hardly understand it.

2. Please compare carefully the content of the sentence to the content of the table. How many cells from the table did the writer use to produce the sentence? (Click on the cells in the table above to update the counter)

3. A sentence is faithful if it contains only information supported by the table or the writer’s background knowledge. It should not add any additional information, even if the information is true or interesting. Please compare once again the content of the sentence to the content of the table and background knowledge. How faithful is the sentence?

Faithful: every part of the sentence is supported by the table and/or background knowledge.

Mostly Faithful: every part of the sentence can be linked to some evidence in the table or the background knowledge, but it is not fully supported. This should only be used for rare edge cases.

Not Faithful: The sentence contains information that is not supported by the table or background knowledge.

The examples are based on the following background knowledge and table:

alfred angas scott ( 1875 - 1923 ) was a british motorcycle designer .

Appendix D Detailed Experiment Settings

In our experiments, we use Transformers of 12 layers, a hidden size of 768, filter size of 3072, and 12 attention heads. The LSTM has a hidden size of 256 and memory size of 1024. Both the BERT-to-BERT and BERT-to-LSTM models use the bert-base-multilingual-uncased checkpoint, with a vocabulary size of 105k. For BERT-to-BERT, we use parameter sharing between the encoder and decoder, as it performs slightly better. For Pointer-Generator, we use GloVe (Pennington et al., 2014) as the input word embedding and truncate the vocabulary size to 5,000. We use Tensorflow (Abadi et al., 2016) to build our systems.

For training, we use the Adam optimizer (Kingma and Ba, 2015), and the learning rate is set to $0.00005$ for BERT-to-LSTM, and $0.0005$ for Pointer-Generator. We use early-stopping based on validation loss to determine the training epochs. The dropout rates in the LSTM model are especially important for appropriate training; we use dropout rate $0.5$ for the input layer of LSTM, and RNN-dropout $0.2$ for the memory; the dropout rate applied to the attention layer is $0.2$ for WikiBio and $0.1$ for WebNLG. The number $K$ of samples we use for the Monte Carlo method of our variation Bayes loss is set to $8$ . For decoding, we use a beam size of $8$ .