Learning to Recombine and Resample Data for Compositional Generalization

Ekin Akyürek, Afra Feyza Akyürek, Jacob Andreas

Introduction

How can we build machine learning models with the ability to learn new concepts in context from little data? Human language learners acquire new word meanings from a single exposure (Carey & Bartlett, 1978), and immediately incorporate words and their meanings productively and compositionally into larger linguistic and conceptual systems (Berko, 1958; Piantadosi & Aslin, 2016). Despite the remarkable success of neural network models on many learning problems in recent years—including one-shot learning of classifiers and policies (Santoro et al., 2016; Wang et al., 2016)—this kind of few-shot learning of composable concepts remains beyond the reach of standard neural models in both diagnostic and naturalistic settings (Lake & Baroni, 2018; Bahdanau et al., 2019a).

Consider the few-shot morphology learning problem shown in Fig. 1, in which a learner must predict various linguistic features (e.g. 3rd person, SinGular, PRESent tense) from word forms, with only a small number of examples of the PAST tense in the training set. Neural sequence-to-sequence models (e.g. Bahdanau et al., 2015) trained on this kind of imbalanced data fail to predict past-tense tags on held-out inputs of any kind (Section 5). Previous attempts to address this and related shortcomings in neural models have focused on explicitly encouraging rule-like behavior by e.g. modeling data with symbolic grammars (Jia & Liang, 2016; Xiao et al., 2016; Cai et al., 2017) or applying rule-based data augmentation (Andreas, 2020). These procedures involve highly task-specific models or generative assumptions, preventing them from generalizing effectively to less structured problems that combine rule-like and exceptional behavior. More fundamentally, they fail to answer the question of whether explicit rules are necessary for compositional inductive bias, and whether it is possible to obtain “rule-like” inductive bias without appeal to an underlying symbolic generative process.

This paper describes a procedure for improving few-shot compositional generalization in neural sequence models without symbolic scaffolding. Our key insight is that even fixed, imbalanced training datasets provide a rich source of supervision for few-shot learning of concepts and composition rules. In particular, we propose a new class of prototype-based neural sequence models (c.f. Gu et al., 2018) that can be directly trained to perform the kinds of generalization exhibited in Fig. 1 by explicitly recombining fragments of training examples to reconstruct other examples. Even when these prototype-based models are not effective as general-purpose predictors, we can resample their outputs to select high-quality synthetic examples of rare phenomena. Ordinary neural sequence models may then be trained on datasets augmented with these synthetic examples, distilling the learned regularities into more flexible predictors. This procedure, which we abbreviate R&R, promotes efficient generalization in both challenging synthetic sequence modeling tasks (Lake & Baroni, 2018) and morphological analysis in multiple natural languages (Cotterell et al., 2018).

By directly optimizing for the kinds of generalization that symbolic representations are supposed to support, we can bypass the need for symbolic representations themselves: R&R gives performance comparable to or better than state-of-the-art neuro-symbolic approaches on tests of compositional generalization.

Our results suggest that some failures of systematicity in neural models can be explained by simpler structural constraints on data distributions and corrected with weaker inductive bias than previously described. Code for all experiments in this paper is available at https://github.com/ekinakyurek/compgen. We implemented our experiments in Knet (Yuret, 2016) using Julia (Bezanson et al., 2017).

Background and related work

Systematic compositionality—the capacity to identify rule-like regularities from limited data and generalize these rules to novel situations—is an essential feature of human reasoning (Fodor et al., 1988). While details vary, a common feature of existing attempts to formalize systematicity in sequence modeling problems (e.g. Gordon et al., 2020) is the intuition that learners should make accurate predictions in situations featuring novel combinations of previously observed input or output subsequences. For example, learners should generalize from actions seen in isolation to more complex commands involving those actions (Lake et al., 2019), and from relations of the form r(a,b) to r(b,a) (Keysers et al., 2020; Bahdanau et al., 2019b). In machine learning, previous studies have found that standard neural architectures fail to generalize systematically even when they achieve high in-distribution accuracy in a variety of settings (Lake & Baroni, 2018; Bastings et al., 2018; Johnson et al., 2017).

In addition to task-specific model architectures (Andreas et al., 2016; Russin et al., 2019), recent years have seen a renewed interest in data augmentation as a flexible and model-agnostic tool for encouraging controlled generalization (Ratner et al., 2017). Existing proposals for sequence models are mainly rule-based—in sequence modeling problems, specifying a synchronous context-free grammar (Jia & Liang, 2016) or string rewriting system (Andreas, 2020) to generate new examples. Rule-based data augmentation schemes that recombine multiple training examples have been proposed for image classification (Inoue, 2018) and machine translation (Fadaee et al., 2017). While rule-based data augmentation is highly effective in structured problems featuring crisp correspondences between inputs and outputs, the effectiveness of such approaches involving more complicated, context-dependent relationships between inputs and outputs has not been well-studied.

for a dataset $\mathcal{D}$ and a learned sequence rewriting model $p_{\textrm{rewrite}}(d\mid d^{\prime};\theta)$ . (To avoid confusion, we will use the symbol $d$ to denote a datum. Because a data augmentation procedure must produce complete input–output examples, each $d$ is an $(x,y)$ pair for the conditional tasks evaluated in this paper.) While recent variants implement $p_{\textrm{rewrite}}$ with neural networks, these models are closely related to classical kernel density estimators (Rosenblatt, 1956). But additionally—building on the motivation in Section 1—they may be viewed as one-shot learners trained to generate new data $d$ from a single example.

Existing work uses prototype-based models as replacements for standard sequence models. We will show here that they are even better suited to use as data augmentation procedures: they can produce high-precision examples in the neighborhood of existing training data, then be used to bootstrap simpler predictors that extrapolate more effectively. But our experiments will also show that existing prototype-based models give mixed results on challenging generalizations of the kind depicted in Fig. 1 when used for either direct prediction or data augmentation—performing well in some settings but barely above baseline in others.

Accordingly, R&R is built on two model components that transform prototype-based language models into an effective learned data augmentation scheme. Section 3 describes an implementation of $p_{\textrm{rewrite}}$ that encourages greater sample diversity and well-formedness via a multi-prototype copying mechanism (a two-shot learner). Section 4 describes heuristics for sampling prototypes $d^{\prime}$ and model outputs $d$ to focus data augmentation on the most informative examples. Section 5 investigates the empirical performance of both components of the approach, finding that they together provide they a simple but surprisingly effective tool for enabling compositional generalization.

Prototype-based sequence models for data recombination

We begin with a brief review of existing prototype-based sequence models. Our presentation mostly follows the retrieve-and-edit approach of Guu et al. (2018), but versions of the approach in this paper could also be built on retrieval-based models implemented with memory networks (Miller et al., 2016; Gu et al., 2018) or transformers (Khandelwal et al., 2020; Guu et al., 2020). The generative process described in Eq. 2 implies a marginal sequence probability:

Maximizing this quantity over the training set with respect to $\theta$ will encourage $p_{\textrm{rewrite}}$ to act as a model of valid data transformations: To be assigned high probability, every training example must be explained by at least one other example and a parametric rewriting operation. (The trivial solution where $p_{\theta}$ is the identity function, with $p_{\theta}(d\mid d^{\prime}=d)=1$ , can be ruled out manually in the design of $p_{\theta}$ .) When $\mathcal{D}$ is large, the sum in Eq. 3 is too large to enumerate exhaustively when computing the marginal likelihood. Instead, we can optimize a lower bound by restricting the sum to a neighborhood $\mathcal{N}(d)\subset\mathcal{D}$ of training examples around each $d$ :

The choice of $\mathcal{N}$ is discussed in more detail in Section 4. Now observe that:

where the second step uses Jensen’s inequality. If all $|\mathcal{N}(d)|$ are the same size, maximizing this lower bound on log-likelihood is equivalent to simply maximizing

We have motivated prototype-based models by arguing that $p_{\textrm{rewrite}}$ learns a model of transformations licensed by the training data. However, when generalization involves complex compositions, we will show that neither a basic RNN implementation of $p_{\textrm{rewrite}}$ or a single prototype is enough; we must provide the learned rewriting model with a larger inventory of parts and encourage reuse of those parts as faithfully as possible. This motivates the two improvements on the prototype-based modeling framework described in the remainder of this section: generalization to multiple prototypes (Section 3.1) and a new rewriting model (Section 3.2).

To improve compositionality in prototype-based models, we equip them with the ability to condition on multiple examples simultaneously. We extend the basic prototype-based language model to $n$ prototypes, which we now refer to as a recombination model $p_{\textrm{recomb}}$ :

A multi-protype model may be viewed as a meta-learner (Thrun & Pratt, 1998; Santoro et al., 2016): it maps from a small number of examples (the prototypes) to a distribution over new datapoints consistent with those examples. By choosing the neighborhood and implementation of $p_{\textrm{recomb}}$ appropriately, we can train this meta-learner to specialize in one-shot concept learning (by reusing a fragment exhibited in a single prototype) or compositional generalization (by assembling fragments of prototypes into a novel configuration). To enable this behavior, we define a set of compatible prototypes $\Omega\subset\mathcal{D}^{n}$ (Section 4) and let $p_{\Omega}\stackrel{{\scriptstyle\text{ def }}}{{=}}\textrm{Unif}(\Omega)$ . We update Eq. 6 to feature a corresponding multi-prototype neighborhood $\mathcal{N}:\mathcal{D}\to\Omega$ . The only terms that have changed are the conditioning variable and the constant term, and it is again sufficient to choose $\theta$ to optimize $\sum_{d_{1:n}^{\prime}\in\mathcal{N}(d)}\log p_{\textrm{recomb}}(d\mid d_{1:n}^{\prime})$ over $\mathcal{D}$ , implementing $p_{\textrm{recomb}}$ as described next.

2 Recombination networks

Past work has found that latent-variable neural sequence models often ignore the latent variable and attempt to directly model sequence marginals (Bowman et al., 2016). When an ordinary sequence-to-sequence model with attention is used to implement $p_{\textrm{recomb}}$ , even in the one-prototype case, generated sentences often have little overlap with their prototypes (Weston et al., 2018). We describe a specific model architecture for $p_{\textrm{recomb}}$ that does not function as a generic noise model, and in which outputs are primarily generated via explicit reuse of fragments of multiple prototypes, by facilitating copying from independent streams containing prototypes and previously generated input tokens.

We take $p_{\textrm{recomb}}(d\mid d_{1:n}^{\prime};\theta)$ to be a neural (multi-)sequence-to-sequence model (c.f. Sutskever et al., 2014) which decomposes probability autoregressively: $p_{\textrm{recomb}}(d\mid d_{1:n}^{\prime};\theta)=\prod_{t}p(d^{t}\mid d^{<t},d_{1:n}^{\prime};\theta)$ . As shown in Fig. 2, three LSTM encoders—two for the prototypes and one for the input prefix—compute sequences of token representations $h_{\textrm{proto}}$ and $h_{\textrm{out}}$ respectively. Given the current decoder hidden state $h_{out}^{t}$ , the model first attends to both prototype and output tokens:

To enable copying from each sequence, we project attention weights $\alpha_{\textrm{out}}$ and $\alpha_{\textrm{proto}}^{k}$ onto the output vocabulary to produce a sparse vector of probabilities:

Unlike rule-based data recombination procedures, however, $p_{\textrm{recomb}}$ is not required to copy from the prototypes, and can predict output tokens directly using values retrieved by the attention mechanism:

To produce a final distribution over output tokens at time $t$ , we combine predictions from each stream:

This copy mechanism is similar to the one proposed by Merity et al. (2017) and See et al. (2017). We compare 1- and 2-prototype models to an ordinary sequence model and baselines in Section 5.

Sampling schemes

The models above provide generic procedures for generating well-formed combinations of training data, but do nothing to ensure that the generated samples are of a kind useful for compositional generalization. While the training objective in Eq. 7 encourages the learned $p(d)$ to lie close to the training data, an effective data augmentation procedure should intuitively provide novel examples of rare phenomena. To generate augmented training data, we combine the generative models of Section 3 with a simple sampling procedure that upweights useful examples.

In classification problems with imbalanced classes, a common strategy for improving accuracy on the rare class is to resample so that the rare class is better represented in training data (Japkowicz et al., 2000). When constructing an augmented dataset using the models described above, we apply a simple rejection sampling scheme. In Eq. 1, we set:

Here $p(d^{t})$ is the marginal probability that the token $d^{t}$ appears in any example and $\epsilon$ is a hyperparameter. The final model is then trained using Eq. 1, retaining those augmented samples for which $u(d)=1$ . For extremely imbalanced problems, like the ones considered in Section 5, this weighting scheme effectively functions as a rare tag constraint: only examples containing rare words or tags are used to augment the original training data.

2 Neighborhoods and prototype priors

How can we ensure that the data augmentation procedure generates any samples with positive weight in Eq. 18? The prototype-based models described in Section 3 offer an additional means of control over the generated data. Aside from the implementation of $p_{\textrm{recomb}}$ , the main factors governing the behavior of the model are the choice of neighborhood function $\mathcal{N}(d)$ and, for $n\geq 2$ , the set of prior compatible prototypes $\Omega$ . Defining these so that rare tags also preferentially appear in prototypes helps ensure that the generated samples contribute to generalization. Let $d_{1}$ and $d_{2}$ be prototypes. As a notational convenience, given two sequences $d_{1}$ , $d_{2}$ , let $d_{1}\backslash d_{2}$ the set of tokens in $d_{1}$ but not $d_{2}$ , and $d_{1}\Delta d_{2}$ denote the set of tokens not common to $d_{1}$ and $d_{2}$ .

Guu et al. (2018) define a one-prototype $\mathcal{N}$ based on a Jaccard distance threshold (Jaccard, 1901). For experiments with one-prototype models we employ a similar strategy, choosing an initial neighborhood of candidates such that

The $n\geq 2$ prototype case requires a more complex neighborhood function—intuitively, for an input $d$ , we want each $(d_{1},d_{2},\ldots)$ in the neighborhood to collectively contain enough information to reconstruct $d$ . Future work might treat the neighborhood function itself as latent, allowing the model to identify groups of prototypes that make $d$ probable; here, as in existing one-prototype models, we provide heuristic implementations for the $n=2$ case.

Long–short recombination: For each $(d_{1},d_{2})\in\mathcal{N}(d)$ , $d_{1}$ is chosen to be similar to $d$ , and $d_{2}$ is chosen to be similar to the difference between $d$ and $d_{1}$ . (The neighborhood is so named because one of the prototypes will generally have fewer tokens than the other one.)

Here [ $d\backslash d_{1}$ ] is the sequence obtained by removing all tokens in $d_{1}$ from $d$ . Recall that we have defined $p_{\Omega}(d_{1:n})\stackrel{{\scriptstyle\text{ def }}}{{=}}\textrm{Unif}(\Omega)$ for a set $\Omega$ of “compatible” prototypes. For experiments using long–short combination, all prototypes are treated as compatible; that is, $\Omega=\mathcal{D}\times\mathcal{D}$ .

Long–long recombination: $\mathcal{N}(d)$ contains pairs of prototypes that are individually similar to $d$ and collectively contain all the tokens needed to reconstruct $d$ :

For experiments using long–long recombination, we take $\Omega=\{(d_{1},d_{2})\in\mathcal{D}\times\mathcal{D}:|d_{1}\Delta d_{2}|=1\}$ .

Datasets & Experiments

We evaluate R&R on two tests of compositional generalization: the scan instruction following task (Lake & Baroni, 2018) and a few-shot morphology learning task derived from the sigmorphon 2018 dataset (Kirov et al., 2018; Cotterell et al., 2018). Our experiments are designed to explore the effectiveness of learned data recombination procedures in controlled and natural settings. Both tasks involve conditional sequence prediction: while preceding sections have discussed augmentation procedures that produce data points $d=(x,y)$ , learners are evaluated on their ability to predict an output $y$ from an input $x$ : actions $y$ given instructions $x$ , or morphological analyses $y$ given words $x$ .

For each task, we compare a baseline with no data augmentation, the rule-based geca data augmentation procedure (Andreas, 2020), and a sequence of ablated versions of R&R that measure the importance of resampling and recombination. The basic Learned Aug model trains an RNN to generate $(x,y)$ pairs, then trains a conditional model on the original data and samples from the generative model. Resampling filters these samples as described in Section 4. Recomb-n models replace the RNN with a prototype-based model as described in Section 3. Additional experiments (Table 1bb) compare data augmentation to prediction of $y$ via direct inference (Appendix E) in the prototype-based model and several other model variants.

scan (Lake & Baroni, 2018) is a synthetic dataset featuring simple English commands paired with sequences of actions. Our experiments aim to show that R&R performs well at one-shot concept learning and zero-shot generalization on controlled tasks where rule-based models succeed. We experiment with two splits of the dataset, jump and around right. In the jump split, which tests one-shot learning, the word jump appears in a single command in the training set but in more complex commands in the test set (e.g. look and jump twice). The around right split (Loula et al., 2018) tests zero-shot generalization by presenting learners with constructions like walk around left and walk right in the training set, but walk around right only in the test set.

Despite the apparent simplicity of the task, ordinary neural sequence-to-sequence models completely fail to make correct predictions on scan test set (Table 1b). As such it has been a major focus of research on compositional generalization in sequence-to-sequence models, and a number of heuristic procedures and specialized model architectures and training procedures have been developed to solve it (Russin et al., 2019; Gordon et al., 2020; Lake, 2019; Andreas, 2020). Here we show that the generic prototype recombination procedure described above does so as well. We use long–short recombination for the jump split and long–long recombination for the around right split. We use a recombination network to generate 400 samples $d=(x,y)$ and then train an ordinary LSTM with attention (Bahdanau et al., 2019b) on the original and augmented data to predict $y$ from $x$ . Training hyperparameters are provided in Appendix D.

Table 1b shows the results of training these models on the scan dataset.We provide results from geca for comparison. Our final RNN predictor is more accurate than the one used by Andreas (2020), and training it on the same augmented dataset gives higher accuracies than reported in the original paper. 2-prototype recombination is essential for successful generalization on both splits. Additional ablations (Table 1bb) show that the continuous latent variable used by Guu et al. (2018) does not affect performance, but that the copy mechanism described in Section 3.2 and the use of the recomb-2 model for data augmentation rather than direct inference are necessary for accurate prediction.

2 SIGMORPHON 2018

The sigmorphon 2018 dataset consists of words paired with morphological analyses (lemmas, or base forms, and tags for linguistic features like tense and case, as depicted in Fig. 1). We use the data to construct a morphological analysis task (Akyürek et al., 2019) (predicting analyses from surface forms) to test models’ few-shot learning of new morphological paradigms. In three languages of varying morphological complexity (Spanish, Swahili, and Turkish) we construct splits of the data featuring a training set of 1000 examples and three test sets of 100 examples. One test set consists exclusively of words in the past tense, one in the future tense and one with other word forms (present tense verbs, nouns and adjectives). The training set contains exactly eight past-tense and eight future-tense examples; all the rest are other word forms. Experiments evaluate R&R’s ability to efficiently learn noisy morphological rules, long viewed a key challenge for connectionist approaches to language learning (Rumelhart & McClelland, 1986). As approaches may be sensitive to the choice of the eight examples from which the model must generalize, we construct five different splits per language and use the Spanish past-tense data as a development set. As above, we use long–long recombination with similarity criteria applied to $y$ only. We augment the training data with 180 samples from $p_{\textrm{recomb}}$ and again train an ordinary LSTM with attention for final predictions. Details are provided in Appendix B.

Table 2 shows aggregate results across languages. We report the model’s $F_{1}$ score for predicting morphological analyses of words in the few-shot training condition (past and future) and the standard training condition (other word forms). Here, learned data augmentation with both one- and two-prototype models consistently matches or outperforms geca. The improvement is sometimes dramatic: for few-shot prediction in Swahili, recomb-1 augmentation reduces the error rate by 40% relative to the baseline and 21% relative to geca. An additional baseline + resampling experiment upweights the existing rare samples rather than synthesizing new ones; results demonstrate that recombination, and not simply reweighting, is important for generalization. Table 2 also includes a finer-grained analysis of novel word forms: words in the evaluation set whose exact morphological analysis never appeared in the training set. R&R again significantly outperforms both the baseline and geca-based data augmentation in the few-shot fut+past condition and the ordinary other condition, underscoring the effectiveness of this approach for “in-distribution” compositional generalization. Finally, the gains provided by learned augmentation and geca appear to be at least partially orthogonal: combining the geca + resampling and recomb-1 + resampling models gives further improvements in Spanish and Turkish.

3 Analysis

Samples from the best learned data augmentation models for scan and sigmorphon may be found in the Section G.3 . We programaticaly analyzed 400 samples from recomb-2 models in scan and found that 40% of novel samples are exactly correct in the around right split and 74% in the jump split. A manual analysis of 50 Turkish samples indicated that only 14% of the novel samples were exactly correct. The augmentation procedure has a high error rate! However, our analysis found that malformed samples either (1) feature malformed $x$ s that will never appear in a test set (a phenomenon also observed by Andreas (2020) for outputs of geca), or (2) are mostly correct at the token level (inducing predictions with a high $F_{1}$ score). Data augmentation thus contributes a mixture of irrelevant examples, label noise—which may exert a positive regularizing effect (Bishop, 1995)—and well-formed examples, a small number of which are sufficient to induce generalization (Bastings et al., 2018). Without resampling, sigmorphon models generate almost no examples of rare tags.

A partial explanation is provided by the preceding analysis, which notes that the accuracy of the data augmentation procedure as a generative model is comparatively low. Additionally, the data augmentation procedure selects only the highest-confidence samples from the model, so the quality of predicted $y$ s conditioned on random $x$ s will in general be even lower. A conditional model trained on augmented data is able to compensate for errors in augmentation or direct inference (Table 12 in the Appendix).

One surprising feature of Table 2 is performance of the learned aug (basic) + resampling model. While less effective than the recombination-based models, augmentation with samples from an ordinary RNN trained on $(x,y)$ pairs improves performance for some test splits. One possible explanation is that resampling effectively acts as a posterior constraint on the final model’s predictive distribution, guiding it toward solutions in which rare tags are more probable than observed in the original training data. Future work might model this constraint explicitly, e.g. via posterior regularization (as in Li & Rush, 2020).

Conclusions

We have described a method for improving compositional generalization in sequence-to-sequence models via data augmentation with learned prototype recombination models. These are the first results we are aware of demonstrating that generative models of data are effective as data augmentation schemes in sequence-to-sequence learning problems, even when the generative models are themselves unreliable as base predictors. Our experiments demonstrate that it is possible to achieve compositional generalization on-par with complex symbolic models in clean, highly structured domains, and outperform them in natural ones, with basic neural modeling tools and without symbolic representations.

We thank Eric Chu for feedback on early drafts of this paper. This work was supported by a hardware donation from NVIDIA under the NVAIL grant program. The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center (Reuther et al., 2018) for providing HPC resources that have contributed to the research results reported within this paper.

References

Appendix A Model Architecture

We use a single layer BiLSTM network to encode $h_{\textrm{proto}}^{kj}$ as follows:

The hidden and embedding sizes are 1024. No dropout is applied. We project bi-directional embeddings to the hidden size with a linear projection. We concatenate the backward and the forward hidden states.

We choose the hidden size as 512, and embedding size as 64. We apply 0.5 dropout to the input. We project hidden vectors in the attention mechanism.

A.2 Decoder

The decoder is implemented by a single layer. In addition to the hidden state and memory cell, we also carry out a feed vector through time:

The input to the LSTM decoder at time step $t$ is the concatenation of the previous token’s representation, previous feed vector, and a latent $z$ vector (in the VAE model).

We use a single-layer LSTM network with a hidden size of 1024, and an embedding size of 1024. We initialize the decoder hidden states with the final hidden states of the BiLSTM encoder. feed is the same size as the hidden state. No dropout is applied in the decoder. Output calculations are provided in the original paper in equation Eq. 16. The query vector for the attentions is identically the hidden state:

Further details of the attention are provided in Section A.3.

The decoder is implemented by a single layer LSTM network with hidden size of 512, and embedding size of 64. The embedding parameters are shared with the encoder. Here the size of the feed vector is equal to embedding size, 64.

We have no self-attention for this decoder in the feed vector. There is an attention projection with dimension is 128. The details of the attention mechanism given in Section A.3. Finally, we use transpose of the embedding matrix to project feed to the output space.

$output^{t}$ contains unnormalized scores before the final softmax layer. We apply 0.7 dropout to $h_{out}^{t}$ during both training and test. The copy mechanism will be further described in Section A.3.

The input to the LSTM decoder is the same as Eq. 25 except the decoder embedding matrix, $W_{d}$ , shares parameters with encoder embedding matrix $W_{e}$ . We applied 0.5 dropout to the embeddings $d_{t-1}$ .

The query vector for the attention is calculated by:

A.3 Attention and Copying

We use the attention mechanism described in Vaswani et al. (2017) with slight modifications.

We use a linear transformation for key while retaining an embedding size of 1024, and leave query and value transformations as the identity. We do not normalize by the square root of the attention dimension. The query vector is described in the decoder Section A.2. The copy mechanism for the morphology task is explained in the paper in detail.

We use the nonlinear tanh transformation for key, query and value. That the attention scores are calculated separately for each prototype using different parameters as well as the normalization i.e. obtaining $\alpha$ ’s is performed separately for each prototype.

The copy mechanism for this task is slightly different and follows Gu et al. (2016) We normalize prototype attention scores and output scores jointly. Let $\bar{\alpha}_{i}$ represent attention weights for each prototype sequence before normalization. Then, we concatenate them to the output vector in Eq. 27.

We obtain a probability vector via a final softmax layer:

That size of this probability vector is vocabulary size plus the total length all prototypes. We then project this into the output space by:

where indices finds all corresponding scores in $prob^{t}$ for token $w$ where there might be more than one element for a given $w$ . This is because one score can come from the $output^{t}$ region, and others from the prototype regions of $prob^{t}$ . During training we applied 0.5 dropout to the indices from $output^{t}$ . Thus, the model is encouraged to copy more.

Appendix B Neighborhoods and Sampling

In the Eq. 20 and Eq. 21 we expressed the generic form of neighborhood sets. Here we provide the implementation details.

In the jump split, we use long-short recombination with $\delta=0.5$ . In around right we use long-long recombination with $\delta=0.5$ , and construct $\Omega$ so that the first and second prototypes to differ by a single token. We randomly pick $k<10\times 3$ (10 different first prototypes, and 3 different second prototypes for each of them) prototype pairs that satisfy these conditions. For the recomb-1 experiment, we use the same neighborhood setup except but consider only the $k<10$ first prototypes.

In the jump split, we used beam search with beam size 4 in the decoder. We calculate the mean and standard deviation over the lengths of both among the first $d^{\prime}_{1}$ and the second $d^{\prime}_{2}$ prototypes in the train set. Then, during the sampling, we expect the first and second prototypes whose length is shorter than their respective mean plus standard deviation. This decision is based on the fact that the part of the $\Omega$ that the model is exposed to is determined by the empirical distribution, $\hat{\Omega}$ , that arises from training neighborhoods. When sampling, we try to pick prototypes from a distribution that are close to properties of that empirical distribution. In around right, we use temperature sampling with $T=0.4$ . If a model cannot sample the expected number of both novel and unique samples within a reasonable time, we increase temperature $T$ .

We use long-long recombination, as explained in the paper, with slight modifications which leverage the structure of the task. We set $\Omega$ as:

For the recomb-1 model $\mathcal{N}(d)$ utilizes tag similarity, lemma similarity and is constructed using a score function:

Given $d$ , we sort training examples by using $score_{1}$ as the comparison key and pick the four smallest neighbors (using a lexicographic sort) to form $\mathcal{N}(d)$ .

For the recomb-2 model, $\mathcal{N}(d)$ uses the same score function for the first prototype as in the recomb-1 case. The second prototype is selected using:

Given $x$ , and a scored first prototype, we do one more sort over training examples by using $\textrm{score}_{2}$ as the comparison key. Then we pick first four neighbors for $\mathcal{N}(d)$ .

We use a mix strategy of temperature sampling with $T=0.5$ and greedy sampling in which we use the former for $d_{\textrm{input}}$ and the latter for $d_{\textrm{output}}$ . We sample 180 unique and novel examples.

Appendix C Generative Model Training

All of the hyper parameters mentioned here are optimized by a grid search on the Spanish validation set. We train our models for 25 epochsWhen training 2-proto and 1-proto models, we increment epoch counter when the entire neighborhood for every $d$ is processed. For 0-proto, one epoch is defined canonically i.e. the entire train set.. We use Adam optimizer with learning rate 0.0001. The generative model is trained on morphological reinflection order ( $d_{\textrm{lemma}}d_{\textrm{tags}}\triangleright d_{\textrm{inflection}}$ ) from left to right, then the samples from the model are reordered for morphological analysis task ( $d_{\textrm{inflection}}\triangleright d_{\textrm{lemma}}d_{\textrm{tags}}$ ).

We use different number of epochs for jump and around right splits where all models are trained for 8 epochs in the former and 3 epochs in the latter. We use Adam optimizer with learning rate 0.002, and gradient norm clip with 1.0.

Appendix D Seq2Seq Baseline Model

After generating novel samples, we either concatenate them to the training data (in morphology), or sample training batches from a mixture of the original training data and the augmented data. Our conditional model is the same as the generative model used in morphology experiments, described in detail in the paper body, replacing $d_{\textrm{proto}}$ with $x$ , and $d$ with $y$ .

Every conditional model’s size is the same as the corresponding generative model which was used for augmentation. This is to ensure that the conditional model and the generative model have the same capacity. We train conditional models for 150 epochs for SCAN and we used augmentation ratios of $p_{\textrm{aug}}=0.01$ and $p_{\textrm{aug}}=0.2$ in jump and around right, respectively. For morphology, we train the conditional models for 100 epochs, and we use all generated examples for augmentation.

Appendix E Direct Inference

To adapt the prototype-based model for conditional prediction, we condition the neighborhood function on the input $x$ rather than the full datum $d$ , as in Hashimoto et al. (2018). Candidate $y$ s are then sampled from the generative model given the observed $x$ while marginalizing over retrieved prototypes. Finally, we re-rank these candidates via Eq. 7 and output the highest-scoring candidate.

Appendix F VAE Model

We use the same prior as Guu et al. (2018) given in Eq. 31. In this prior, $z$ is defined by a norm and direction vector. The norm is sampled from the uniform distribution between zero and a maximum possible norm $\mu_{\textrm{max}}=10.0$ , and the direction is sampled uniformly from the unit hypersphere. This sampling procedure corresponds to a von Mises–Fisher distribution with concentration parameter zero.

For SCAN, the size of $z$ is 32, and for morphology the size of $z$ is 2.

Similarly to the prior, the posterior network decomposes $z$ into its norm and direction vectors. The norm vector is sampled from a uniform distribution at $(|\mu|,\textrm{min}(|\mu|+\epsilon,\mu_{\textrm{max}}))$ , and the direction is sampled from the von Mises–Fisher distribution $\textrm{vmF}(\mu,\kappa)$ where $\kappa=25,\epsilon=1.0$ .

Appendix G Additional Results

In the paper, Table 2 shows morphology results for (non-VAE) models with 8 hints (past- and future-tense examples in the training set). Here, we provide additional results for different hint set sizes and model variants.

G.1.2 hints=8

Main 8-prototype $F_{1}$ results are provided in the body of the paper. Here we provide exact match results and an extra set of comparisons to the VAE model.

G.1.3 hints=16

G.2 Significance Tests

Tables 9, 10 and 11 sho the $p$ -values for pairwise differences between the baseline and prototype-based models

G.3 Generated Samples

All samples are randomly selected unless otherwise indicated.

In Table 12, we present three test samples from the SCAN task along with the predictions by direct inference and the conditional model trained on the augmented data with recomb-2. Note that the augmentation procedure was able to create novel samples whose input ( $x$ ) happens to be in the test set (Examples 1 and 3) while $y$ may or may not be correct (Example 1).

Below are a set of samples from the learned aug (basic) model for SCAN dataset’s jump and around right splits, in order:

IN: run opposite and walk opposite right twice OUT: RUN TURN RIGHT TURN RIGHT RUN TURN RIGHT TURN RIGHT WALK IN: look around right thrice after run around thrice thrice OUT: TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT LOOK TURN RIGHT LOOK TURN RIGHT LOOK TURN RIGHT LOOK TURN RIGHT LOOK IN: look opposite right twice and walk around twice OUT: TURN RIGHT TURN RIGHT LOOK TURN RIGHT TURN RIGHT LOOK TURN RIGHT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK IN: run opposite and thrice OUT: RUN TURN LEFT RUN RUN IN: walk opposite right thrice turn turn right thrice OUT: TURN RIGHT TURN RIGHT TURN RIGHT TURN RIGHT TURN RIGHT TURN RIGHT TURN RIGHT TURN RIGHT TURN RIGHT TURN RIGHT WALK IN: jump opposite right twice jump look around left OUT: TURN RIGHT TURN RIGHT JUMP TURN LEFT TURN LEFT JUMP TURN LEFT TURN LEFT LOOK TURN LEFT TURN LEFT LOOK IN: walk around left thrice after jump left left OUT: TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK IN: run opposite right twice walk run left thrice OUT: TURN RIGHT TURN RIGHT RUN TURN RIGHT TURN RIGHT WALK TURN LEFT TURN LEFT RUN TURN LEFT TURN LEFT RUN

Below are a set of samples from the recomb-1 model for SCAN dataset’s around right split. Note that there were no samples with rare tags generated by the model for the jump split:

IN: run around right after walk around left OUT: TURN LEFT WALK TURN LEFT WALK TURN LEFT RUN TURN LEFT RUN TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK TURN LEFT WALK IN: look around right after jump around left OUT: TURN LEFT LOOK TURN LEFT JUMP TURN LEFT LOOK TURN LEFT LOOK TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP IN: look around right and jump around left OUT: TURN RIGHT LOOK TURN RIGHT LOOK TURN RIGHT LOOK TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP IN: walk around right and turn right twice OUT: TURN RIGHT WALK TURN RIGHT WALK TURN RIGHT WALK TURN RIGHT WALK TURN RIGHT TURN RIGHT

Below are 4 samples from the recomb-2 model for each of SCAN dataset’s jump and around right splits, respectively:

IN: jump opposite left thrice after jump opposite left thrice OUT: TURN LEFT TURN LEFT JUMP TURN LEFT TURN LEFT JUMP TURN LEFT TURN LEFT JUMP TURN LEFT TURN LEFT WALK TURN LEFT TURN LEFT WALK TURN LEFT TURN LEFT WALK IN: jump left thrice and jump left thrice OUT: TURN LEFT LOOK TURN LEFT LOOK TURN LEFT LOOK TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP IN: jump opposite right and turn around left OUT: TURN RIGHT TURN RIGHT JUMP TURN LEFT TURN LEFT TURN LEFT TURN LEFT IN: turn around left and jump around left OUT: TURN LEFT TURN LEFT TURN LEFT TURN LEFT TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP IN: look right twice after run around right OUT: TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT LOOK TURN RIGHT LOOK IN: turn right twice after look around right OUT: TURN RIGHT LOOK TURN RIGHT LOOK TURN RIGHT LOOK TURN RIGHT LOOK TURN RIGHT TURN RIGHT IN: look twice and run around right OUT: LOOK LOOK TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN TURN RIGHT RUN IN: walk opposite right twice and jump around right OUT: TURN RIGHT TURN RIGHT WALK TURN RIGHT TURN RIGHT WALK TURN RIGHT JUMP TURN RIGHT JUMP TURN RIGHT JUMP TURN RIGHT JUMP

G.3.2 Morphology

Below are a set of samples from the learned aug (basic) model in sigmorphon format.

şahmiçe şahmiçende N;LOC;SG;PSS2S karadan havaya füze karadan havaya füzel N;DAT;PL;PSS3P ernek erneklerine N;DAT;PL;PSS3P kiler kilerime N;DAT;SG;PSS1S mahlep mahlebimizi N;ACC;SG;PSS1P süzmek süzerler V;IND;3;PL;PRS;POS;DECL âlap âlaps N;LGSPEC1;3S;SG;PRS jöle jöleleri N;ACC;PL envejecerse envejeciéndose V.CVB;PRS colaxar colaxa V;IND;PRS;3;SG pergedrer no pergedremos V;NEG;IMP;1;PL mantear no mantees V;NEG;IMP;2;SG flaguear no flagueen V;NEG;IMP;3;PL malacostar malacostaría V;COND;3;SG desinstar desinse V;POS;IMP;3;SG concretizar no concretices V;NEG;IMP;2;SG

Below are a set of samples from the learned aug (basic) + resampling model.

şaşırmak şaşırmıyor musun? V;IND;2;PL;PST;PROG;POS;INTR ayılmak ayılmaya V;IND;1;SG;PST;DECL pleşmek pleşmiyor muyuz? V;IND;1;PL;PST;PROG;NEG;INTR imciyetmek imciyetmezdeğiz V;IND;1;PL;FUT;NEG;DECL kuvaşmak kuvaşmayacağız V;IND;1;PL;FUT;NEG;DECL yermek yermeyeceğiz V;IND;1;PL;FUT;NEG;DECL yarıtmak yarıtmayacağız V;IND;1;PL;FUT;NEG;DECL kelimek kelimeyeceğiz V;IND;1;PL;FUT;NEG;DECL trasescar trasescáis V;IND;PST;2;PL;IPFV tronar tronar V;IND;FUT;1;SG terzcalminar terzcalminan V;IND;PST;3;PL;IPFV esubronizar esubronizamos V;IND;PST;1;PL;IPFV urdir urdiremos V;IND;FUT;1;PL conder conderemos V;IND;FUT;1;PL florear florearían V;IND;PST;3;PL;LGSPEC1;SG sabrordar sabrordamos V;IND;PST;1;PL;IPFV

Below are a set of samples from the recomb-1 + resampling model (the best performing model in Table 2). Here we additionally annotate samples with error categories.

kovulmak kovulmaz mısınız V;IND;2;PL;FUT;NEG;INTR (Inflection and tags don’t match.) düşünmek düşündüler V;IND;3;PL;PST;POS;DECL (Correct and novel.) sütmek sütmez miyiz? V;IND;2;SG;FUT;NEG;INTR (Inflection and tags don’t match.) bakmak bakmayacak mıyım? V;IND;1;PL;FUT;NEG;INTR (Inflection and tags don’t match) döndürmek döndürecek misiniz? V;IND;2;PL;FUT;POS;INTR (Correct and novel.) türkçeleştirtmek türkçeleştirtiyor m V;IND;2;PL;PST;PROG;NEG;INTR (Wrong inflection, novel tag.) çalmak çalmayız V;IND;2;PL;FUT;POS;DECL (Inflection and tags don’t match.) üsürmek üsürmezsin V;IND;2;SG;PST;NEG;DECL (Inflection and tags don’t match.) duplicar duplicaráis V;IND;FUT;2;PL (Correct and novel) efundar efundan V;SBJV;FUT;3;PL (Inflection and tags don’t match) deshumanizar deshumanicas V;SBJV;PST;2;SG (Inplausible inflection.) emular emulares V;SBJV;FUT;2;SG (Correct and also in train set.) languidecer languidecíamos V;IND;PST;1;SG;IPFV (Inflection and tags don’t match) nominar nominamos V;SBJV;FUT;1;PL (Novel tags, incorrect inflection.) finciar finciare V;SBJV;FUT;1;SG (Correct and novel.) abastar abasto V;IND;PST;1;SG (Inflection and tags don’t match)

G.4 Attention Heatmap

Here we provide a visualization copy and attention mechanism in recomb-2 model for SCAN experiments.

Appendix H Compute

We use a single 32GB NVIDIA V100 Volta GPU for each experiment. For every experiment, the whole pipeline which consists of training of the generative model, sampling and training of the conditional model takes less than an hour.