Data Recombination for Neural Semantic Parsing

Robin Jia, Percy Liang

Introduction

Semantic parsing—the precise translation of natural language utterances into logical forms—has many applications, including question answering [Zelle and Mooney (1996, Zettlemoyer and Collins (2005, Zettlemoyer and Collins (2007, Liang et al. (2011, Berant et al. (2013], instruction following [Artzi and Zettlemoyer (2013b], and regular expression generation [Kushman and Barzilay (2013]. Modern semantic parsers [Artzi and Zettlemoyer (2013a, Berant et al. (2013] are complex pieces of software, requiring hand-crafted features, lexicons, and grammars.

Meanwhile, recurrent neural networks (RNNs) have made swift inroads into many structured prediction tasks in NLP, including machine translation [Sutskever et al. (2014, Bahdanau et al. (2014] and syntactic parsing [Vinyals et al. (2015b, Dyer et al. (2015]. Because RNNs make very few domain-specific assumptions, they have the potential to succeed at a wide variety of tasks with minimal feature engineering. However, this flexibility also puts RNNs at a disadvantage compared to standard semantic parsers, which can generalize naturally by leveraging their built-in awareness of logical compositionality.

In this paper, we introduce data recombination, a generic framework for declaratively injecting prior knowledge into a domain-general structured prediction model. In data recombination, prior knowledge about a task is used to build a high-precision generative model that expands the empirical distribution by allowing fragments of different examples to be combined in particular ways. Samples from this generative model are then used to train a domain-general model. In the case of semantic parsing, we construct a generative model by inducing a synchronous context-free grammar (SCFG), creating new examples such as those shown in Figure 1; our domain-general model is a sequence-to-sequence RNN with a novel attention-based copying mechanism. Data recombination boosts the accuracy of our RNN model on three semantic parsing datasets. On the Geo dataset, data recombination improves test accuracy by $4.3$ percentage points over our baseline RNN, leading to new state-of-the-art results for models that do not use a seed lexicon for predicates.

Problem statement

We cast semantic parsing as a sequence-to-sequence task. The input utterance $x$ is a sequence of words $x_{1},\dotsc,x_{m}\in\mathcal{V}^{\text{(in)}}$ , the input vocabulary; similarly, the output logical form $y$ is a sequence of tokens $y_{1},\dotsc,y_{n}\in\mathcal{V}^{\text{(out)}}$ , the output vocabulary. A linear sequence of tokens might appear to lose the hierarchical structure of a logical form, but there is precedent for this choice: ?) showed that an RNN can reliably predict tree-structured outputs in a linear fashion.

We evaluate our system on three existing semantic parsing datasets. Figure 2 shows sample input-output pairs from each of these datasets.

GeoQuery (Geo) contains natural language questions about US geography paired with corresponding Prolog database queries. We use the standard split of 600 training examples and 280 test examples introduced by ?). We preprocess the logical forms to De Brujin index notation to standardize variable naming.

ATIS (ATIS) contains natural language queries for a flights database paired with corresponding database queries written in lambda calculus. We train on $4473$ examples and evaluate on the $448$ test examples used by ?).

Overnight (Overnight) contains logical forms paired with natural language paraphrases across eight varied subdomains. ?) constructed the dataset by generating all possible logical forms up to some depth threshold, then getting multiple natural language paraphrases for each logical form from workers on Amazon Mechanical Turk. We evaluate on the same train/test splits as ?).

In this paper, we only explore learning from logical forms. In the last few years, there has an emergence of semantic parsers learned from denotations [Clarke et al. (2010, Liang et al. (2011, Berant et al. (2013, Artzi and Zettlemoyer (2013b]. While our system cannot directly learn from denotations, it could be used to rerank candidate derivations generated by one of these other systems.

Sequence-to-sequence RNN Model

Our sequence-to-sequence RNN model is based on existing attention-based neural machine translation models [Bahdanau et al. (2014, Luong et al. (2015a], but also includes a novel attention-based copying mechanism. Similar copying mechanisms have been explored in parallel by ?) and ?).

The encoder converts the input sequence $x_{1},\dotsc,x_{m}$ into a sequence of context-sensitive embeddings $b_{1},\dotsc,b_{m}$ using a bidirectional RNN [Bahdanau et al. (2014]. First, a word embedding function $\phi^{\text{(in)}}$ maps each word $x_{i}$ to a fixed-dimensional vector. These vectors are fed as input to two RNNs: a forward RNN and a backward RNN. The forward RNN starts with an initial hidden state $h_{0}^{\text{F}}$ , and generates a sequence of hidden states $h_{1}^{\text{F}},\dotsc,h_{m}^{\text{F}}$ by repeatedly applying the recurrence

The recurrence takes the form of an LSTM [Hochreiter and Schmidhuber (1997]. The backward RNN similarly generates hidden states $h_{m}^{\text{B}},\dotsc,h_{1}^{\text{B}}$ by processing the input sequence in reverse order. Finally, for each input position $i$ , we define the context-sensitive embedding $b_{i}$ to be the concatenation of $h_{i}^{\text{F}}$ and $h_{i}^{\text{B}}$

The decoder is an attention-based model [Bahdanau et al. (2014, Luong et al. (2015a] that generates the output sequence $y_{1},\dotsc,y_{n}$ one token at a time. At each time step $j$ , it writes $y_{j}$ based on the current hidden state $s_{j}$ , then updates the hidden state to $s_{j+1}$ based on $s_{j}$ and $y_{j}$ . Formally, the decoder is defined by the following equations:

When not specified, $i$ ranges over $\{1,\dotsc,m\}$ and $j$ ranges over $\{1,\dotsc,n\}$ . Intuitively, the $\alpha_{ji}$ ’s define a probability distribution over the input words, describing what words in the input the decoder is focusing on at time $j$ . They are computed from the unnormalized attention scores $e_{ji}$ . The matrices $W^{(s)}$ , $W^{(a)}$ , and $U$ , as well as the embedding function $\phi^{\text{(out)}}$ , are parameters of the model.

2 Attention-based Copying

In the basic model of the previous section, the next output word $y_{j}$ is chosen via a simple softmax over all words in the output vocabulary. However, this model has difficulty generalizing to the long tail of entity names commonly found in semantic parsing datasets. Conveniently, entity names in the input often correspond directly to tokens in the output (e.g., “iowa” becomes iowa in Figure 2). On Geo and ATIS, we make a point not to rely on orthography for non-entities such as “state” to _state, since this leverages information not available to previous models [Zettlemoyer and Collins (2005] and is much less language-independent.

To capture this intuition, we introduce a new attention-based copying mechanism. At each time step $j$ , the decoder generates one of two types of actions. As before, it can write any word in the output vocabulary. In addition, it can copy any input word $x_{i}$ directly to the output, where the probability with which we copy $x_{i}$ is determined by the attention score on $x_{i}$ . Formally, we define a latent action $a_{j}$ that is either $\texttt{Write}[w]$ for some $w\in\mathcal{V}^{\text{(out)}}$ or $\texttt{Copy}[i]$ for some $i\in\{1,\dotsc,m\}$ . We then have

The decoder chooses $a_{j}$ with a softmax over all these possible actions; $y_{j}$ is then a deterministic function of $a_{j}$ and $x$ . During training, we maximize the log-likelihood of $y$ , marginalizing out $a$ .

Attention-based copying can be seen as a combination of a standard softmax output layer of an attention-based model [Bahdanau et al. (2014] and a Pointer Network [Vinyals et al. (2015a]; in a Pointer Network, the only way to generate output is to copy a symbol from the input.

Data Recombination

The main contribution of this paper is a novel data recombination framework that injects important prior knowledge into our oblivious sequence-to-sequence RNN. In this framework, we induce a high-precision generative model from the training data, then sample from it to generate new training examples. The process of inducing this generative model can leverage any available prior knowledge, which is transmitted through the generated examples to the RNN model. A key advantage of our two-stage approach is that it allows us to declare desired properties of the task which might be hard to capture in the model architecture.

Our approach generalizes data augmentation, which is commonly employed to inject prior knowledge into a model. Data augmentation techniques focus on modeling invariances—transformations like translating an image or adding noise that alter the inputs $x$ , but do not change the output $y$ . These techniques have proven effective in areas like computer vision [Krizhevsky et al. (2012] and speech recognition [Jaitly and Hinton (2013].

In semantic parsing, however, we would like to capture more than just invariance properties. Consider an example with the utterance “what states border texas ?”. Given this example, it should be easy to generalize to questions where “texas” is replaced by the name of any other state: simply replace the mention of Texas in the logical form with the name of the new state. Underlying this phenomenon is a strong conditional independence principle: the meaning of the rest of the sentence is independent of the name of the state in question. Standard data augmentation is not sufficient to model such phenomena: instead of holding $y$ fixed, we would like to apply simultaneous transformations to $x$ and $y$ such that the new $x$ still maps to the new $y$ . Data recombination addresses this need.

2 General Setting

3 SCFGs for Semantic Parsing

It is instructive to compare our SCFG-based data recombination with Wasp [Wong and Mooney (2006, Wong and Mooney (2007], which uses an SCFG as the actual semantic parsing model. The grammar induced by Wasp must have good coverage in order to generalize to new inputs at test time. Wasp also requires the implementation of an efficient algorithm for computing the conditional probability $p(y\mid x)$ . In contrast, our SCFG is only used to convey prior knowledge about conditional independence structure, so it only needs to have high precision; our RNN model is responsible for boosting recall over the entire input space. We also only need to forward sample from the SCFG, which is considerably easier to implement than conditional inference.

Below, we examine various strategies for inducing a grammar $G$ from a dataset $\mathcal{D}$ . We first encode $\mathcal{D}$ as an initial grammar with rules Root $\to\left<x,y\right>$ for each $(x,y)\in\mathcal{D}$ . Next, we will define each grammar induction strategy as a mapping from an input grammar $G_{\text{in}}$ to a new grammar $G_{\text{out}}$ . This formulation allows us to compose grammar induction strategies (Section 4.3.4).

Our first grammar induction strategy, AbsEntities, simply abstracts entities with their types. We assume that each entity $e$ (e.g., texas) has a corresponding type $e.t$ (e.g., state), which we infer based on the presence of certain predicates in the logical form (e.g. stateid). For each grammar rule $X\to\left<\alpha,\beta\right>$ in $G_{\text{in}}$ , where $\alpha$ contains a token (e.g., “texas”) that string matches an entity (e.g., texas) in $\beta$ , we add two rules to $G_{\text{out}}$ : (i) a rule where both occurrences are replaced with the type of the entity (e.g., state), and (ii) a new rule that maps the type to the entity (e.g., $\textsc{StateId}\rightarrow\left<\text{``{texas}''},\texttt{texas}\right>$ ; we reserve the category name State for the next section). Thus, $G_{\text{out}}$ generates recombinant examples that fuse most of one example with an entity found in a second example. A concrete example from the Geo domain is given in Figure 3.

3.2 Abstracting Whole Phrases

Our second grammar induction strategy, AbsWholePhrases, abstracts both entities and whole phrases with their types. For each grammar rule $X\to\left<\alpha,\beta\right>$ in $G_{\text{in}}$ , we add up to two rules to $G_{\text{out}}$ . First, if $\alpha$ contains tokens that string match to an entity in $\beta$ , we replace both occurrences with the type of the entity, similarly to rule (i) from AbsEntities. Second, if we can infer that the entire expression $\beta$ evaluates to a set of a particular type (e.g. state) we create a rule that maps the type to $\left<\alpha,\beta\right>$ . In practice, we also use some simple rules to strip question identifiers from $\alpha$ , so that the resulting examples are more natural. Again, refer to Figure 3 for a concrete example.

This strategy works because of a more general conditional independence property: the meaning of any semantically coherent phrase is conditionally independent of the rest of the sentence, the cornerstone of compositional semantics. Note that this assumption is not always correct in general: for example, phenomena like anaphora that involve long-range context dependence violate this assumption. However, this property holds in most existing semantic parsing datasets.

3.3 Concatenation

The final grammar induction strategy is a surprisingly simple approach we tried that turns out to work. For any $k\geq 2$ , we define the Concat- $k$ strategy, which creates two types of rules. First, we create a single rule that has Root going to a sequence of $k$ Sent’s. Then, for each root-level rule $\textsc{Root}\to\left<\alpha,\beta\right>$ in $G_{\text{in}}$ , we add the rule $\textsc{Sent}\to\left<\alpha,\beta\right>$ to $G_{\text{out}}$ . See Figure 3 for an example.

Unlike AbsEntities and AbsWholePhrases, concatenation is very general, and can be applied to any sequence transduction problem. Of course, it also does not introduce additional information about compositionality or independence properties present in semantic parsing. However, it does generate harder examples for the attention-based RNN, since the model must learn to attend to the correct parts of the now-longer input sequence. Related work has shown that training a model on more difficult examples can improve generalization, the most canonical case being dropout [Hinton et al. (2012, Wager et al. (2013].

3.4 Composition

We note that grammar induction strategies can be composed, yielding more complex grammars. Given any two grammar induction strategies $f_{1}$ and $f_{2}$ , the composition $f_{1}\circ f_{2}$ is the grammar induction strategy that takes in $G_{\text{in}}$ and returns $f_{1}(f_{2}(G_{\text{in}}))$ . For the strategies we have defined, we can perform this operation symbolically on the grammar rules, without having to sample from the intermediate grammar $f_{2}(G_{\text{in}})$ .

Experiments

We evaluate our system on three domains: Geo, ATIS, and Overnight. For ATIS, we report logical form exact match accuracy. For Geo and Overnight, we determine correctness based on denotation match, as in ?) and ?), respectively.

We note that not all grammar induction strategies make sense for all domains. In particular, we only apply AbsWholePhrases to Geo and Overnight. We do not apply AbsWholePhrases to ATIS, as the dataset has little nesting structure.

2 Implementation Details

We tokenize logical forms in a domain-specific manner, based on the syntax of the formal language being used. On Geo and ATIS, we disallow copying of predicate names to ensure a fair comparison to previous work, as string matching between input words and predicate names is not commonly used. We prevent copying by prepending underscores to predicate tokens; see Figure 2 for examples.

On ATIS alone, when doing attention-based copying and data recombination, we leverage an external lexicon that maps natural language phrases (e.g., “kennedy airport”) to entities (e.g., jfk:ap). When we copy a word that is part of a phrase in the lexicon, we write the entity associated with that lexicon entry. When performing data recombination, we identify entity alignments based on matching phrases and entities from the lexicon.

We run all experiments with $200$ hidden units and $100$ -dimensional word vectors. We initialize all parameters uniformly at random within the interval $[-0.1,0.1]$ . We maximize the log-likelihood of the correct logical form using stochastic gradient descent. We train the model for a total of $30$ epochs with an initial learning rate of $0.1$ , and halve the learning rate every $5$ epochs, starting after epoch $15$ . We replace word vectors for words that occur only once in the training set with a universal word vector. Our model is implemented in Theano [Bergstra et al. (2010].

When performing data recombination, we sample a new round of recombinant examples from our grammar at each epoch. We add these examples to the original training dataset, randomly shuffle all examples, and train the model for the epoch. Figure 4 gives pseudocode for this training procedure. One important hyperparameter is how many examples to sample at each epoch: we found that a good rule of thumb is to sample as many recombinant examples as there are examples in the training dataset, so that half of the examples the model sees at each epoch are recombinant.

At test time, we use beam search with beam size $5$ . We automatically balance missing right parentheses by adding them at the end. On Geo and Overnight, we then pick the highest-scoring logical form that does not yield an executor error when the corresponding denotation is computed. On ATIS, we just pick the top prediction on the beam.

3 Impact of the Copying Mechanism

First, we measure the contribution of the attention-based copying mechanism to the model’s overall performance. On each task, we train and evaluate two models: one with the copying mechanism, and one without. Training is done without data recombination. The results are shown in Table 1.

On Geo and ATIS, the copying mechanism helps significantly: it improves test accuracy by $10.4$ percentage points on Geo and $6.4$ points on ATIS. However, on Overnight, adding the copying mechanism actually makes our model perform slightly worse. This result is somewhat expected, as the Overnight dataset contains a very small number of distinct entities. It is also notable that both systems surpass the previous best system on Overnight by a wide margin.

We choose to use the copying mechanism in all subsequent experiments, as it has a large advantage in realistic settings where there are many distinct entities in the world. The concurrent work of ?) and ?), both of whom propose similar copying mechanisms, provides additional evidence for the utility of copying on a wide range of NLP tasks.

4 Main Results

For our main results, we train our model with a variety of data recombination strategies on all three datasets. These results are summarized in Tables 2 and 3. We compare our system to the baseline of not using any data recombination, as well as to state-of-the-art systems on all three datasets.

We find that data recombination consistently improves accuracy across the three domains we evaluated on, and that the strongest results come from composing multiple strategies. Combining AbsWholePhrases, AbsEntities, and Concat-2 yields a $4.3$ percentage point improvement over the baseline without data recombination on Geo, and an average of $1.7$ percentage points on Overnight. In fact, on Geo, we achieve test accuracy of $89.3\%$ , which surpasses the previous state-of-the-art, excluding ?), which used a seed lexicon for predicates. On ATIS, we experiment with concatenating more than $2$ examples, to make up for the fact that we cannot apply AbsWholePhrases, which generates longer examples. We obtain a test accuracy of $83.3$ with AbsEntities composed with Concat-3, which beats the baseline by $7$ percentage points and is competitive with the state-of-the-art.

For completeness, we also investigated the effects of data recombination on the model without attention-based copying. We found that recombination helped significantly on Geo and ATIS, but hurt the model slightly on Overnight. On Geo, the best data recombination strategy yielded test accuracy of $82.9\%$ , for a gain of $8.3$ percentage points over the baseline with no copying and no recombination; on ATIS, data recombination gives test accuracies as high as $74.6\%$ , a $4.7$ point gain over the same baseline. However, no data recombination strategy improved average test accuracy on Overnight; the best one resulted in a $0.3$ percentage point decrease in test accuracy. We hypothesize that data recombination helps less on Overnight in general because the space of possible logical forms is very limited, making it more like a large multiclass classification task. Therefore, it is less important for the model to learn good compositional representations that generalize to new logical forms at test time.

5 Effect of Longer Examples

Interestingly, strategies like AbsWholePhrases and Concat-2 help the model even though the resulting recombinant examples are generally not in the support of the test distribution. In particular, these recombinant examples are on average longer than those in the actual dataset, which makes them harder for the attention-based model. Indeed, for every domain, our best accuracy numbers involved some form of concatenation, and often involved AbsWholePhrases as well. In comparison, applying AbsEntities alone, which generates examples of the same length as those in the original dataset, was generally less effective.

We conducted additional experiments on artificial data to investigate the importance of adding longer, harder examples. We experimented with adding new examples via data recombination, as well as adding new independent examples (e.g. to simulate the acquisition of more training data). We constructed a simple world containing a set of entities and a set of binary relations. For any $n$ , we can generate a set of depth- $n$ examples, which involve the composition of $n$ relations applied to a single entity. Example data points are shown in Figure 5. We train our model on various datasets, then test it on a set of $500$ randomly chosen depth- $2$ examples. The model always has access to a small seed training set of $100$ depth- $2$ examples. We then add one of four types of examples to the training set:

Same length, independent: New randomly chosen depth- $2$ examples.Technically, these are not completely independent, as we sample these new examples without replacement. The same applies to the longer “independent” examples.

Longer, independent: Randomly chosen depth- $4$ examples.

Same length, recombinant: Depth- $2$ examples sampled from the grammar induced by applying AbsEntities to the seed dataset.

Longer, recombinant: Depth- $4$ examples sampled from the grammar induced by applying AbsWholePhrases followed by AbsEntities to the seed dataset.

To maintain consistency between the independent and recombinant experiments, we fix the recombinant examples across all epochs, instead of resampling at every epoch. In Figure 6, we plot accuracy on the test set versus the number of additional examples added of each of these four types. As expected, independent examples are more helpful than the recombinant ones, but both help the model improve considerably. In addition, we see that even though the test dataset only has short examples, adding longer examples helps the model more than adding shorter ones, in both the independent and recombinant cases. These results underscore the importance training on longer, harder examples.

Discussion

In this paper, we have presented a novel framework we term data recombination, in which we generate new training examples from a high-precision generative model induced from the original training dataset. We have demonstrated its effectiveness in improving the accuracy of a sequence-to-sequence RNN model on three semantic parsing datasets, using a synchronous context-free grammar as our generative model.

There has been growing interest in applying neural networks to semantic parsing and related tasks. ?) concurrently developed an attention-based RNN model for semantic parsing, although they did not use data recombination. ?) proposed a non-recurrent neural model for semantic parsing, though they did not run experiments. ?) use an RNN model to perform a related task of instruction following.

Our proposed attention-based copying mechanism bears a strong resemblance to two models that were developed independently by other groups. ?) apply a very similar copying mechanism to text summarization and single-turn dialogue generation. ?) propose a model that decides at each step whether to write from a “shortlist” vocabulary or copy from the input, and report improvements on machine translation and text summarization. Another piece of related work is ?), who train a neural machine translation system to copy rare words, relying on an external system to generate alignments.

Prior work has explored using paraphrasing for data augmentation on NLP tasks. ?) augment their data by swapping out words for synonyms from WordNet. ?) use a similar strategy, but identify similar words and phrases based on cosine distance between vector space embeddings. Unlike our data recombination strategies, these techniques only change inputs $x$ , while keeping the labels $y$ fixed. Additionally, these paraphrasing-based transformations can be described in terms of grammar induction, so they can be incorporated into our framework.

In data recombination, data generated by a high-precision generative model is used to train a second, domain-general model. Generative oversampling [Liu et al. (2007] learns a generative model in a multiclass classification setting, then uses it to generate additional examples from rare classes in order to combat label imbalance. Uptraining [Petrov et al. (2010] uses data labeled by an accurate but slow model to train a computationally cheaper second model. ?) generate a large dataset of constituency parse trees by taking sentences that multiple existing systems parse in the same way, and train a neural model on this dataset.

Some of our induced grammars generate examples that are not in the test distribution, but nonetheless aid in generalization. Related work has also explored the idea of training on altered or out-of-domain data, often interpreting it as a form of regularization. Dropout training has been shown to be a form of adaptive regularization [Hinton et al. (2012, Wager et al. (2013]. ?) showed that encouraging a knowledge base completion model to handle longer path queries acts as a form of structural regularization.

Language is a blend of crisp regularities and soft relationships. Our work takes RNNs, which excel at modeling soft phenomena, and uses a highly structured tool—synchronous context free grammars—to infuse them with an understanding of crisp structure. We believe this paradigm for simultaneously modeling the soft and hard aspects of language should have broader applicability beyond semantic parsing.

This work was supported by the NSF Graduate Research Fellowship under Grant No. DGE-114747, and the DARPA Communicating with Computers (CwC) program under ARO prime contract no. W911NF-15-1-0462.

All code, data, and experiments for this paper are available on the CodaLab platform at https://worksheets.codalab.org/worksheets/0x50757a37779b485f89012e4ba03b6f4f/.