Learning to Parse and Translate Improves Neural Machine Translation

Akiko Eriguchi, Yoshimasa Tsuruoka, Kyunghyun Cho

Introduction

Neural Machine Translation (NMT) has enjoyed impressive success without relying on much, if any, prior linguistic knowledge. Some of the most recent studies have for instance demonstrated that NMT systems work comparably to other systems even when the source and target sentences are given simply as flat sequences of characters (Lee et al., 2016; Chung et al., 2016) or statistically, not linguistically, motivated subword units (Sennrich et al., 2016; Wu et al., 2016). Shi et al. (2016) recently made an observation that the encoder of NMT captures syntactic properties of a source sentence automatically, indirectly suggesting that explicit linguistic prior may not be necessary.

On the other hand, there have only been a couple of recent studies showing the potential benefit of explicitly encoding the linguistic prior into NMT. Sennrich and Haddow (2016) for instance proposed to augment each source word with its corresponding part-of-speech tag, lemmatized form and dependency label. Eriguchi et al. (2016) instead replaced the sequential encoder with a tree-based encoder which computes the representation of the source sentence following its parse tree. Stahlberg et al. (2016) let the lattice from a hierarchical phrase-based system guide the decoding process of neural machine translation, which results in two separate models rather than a single end-to-end one. Despite the promising improvements, these explicit approaches are limited in that the trained translation model strictly requires the availability of external tools during inference time. More recently, researchers have proposed methods to incorporate target-side syntax into NMT models. Alvarez-Melis and Jaakkola (2017) have proposed a doubly-recurrent neural network that can generate a tree-structured sentence, but its effectiveness in a full scale NMT task is yet to be shown. Aharoni and Goldberg (2017) introduced a method to serialize a parsed tree and to train the serialized parsed sentences.

We propose to implicitly incorporate linguistic prior based on the idea of multi-task learning (Caruana, 1998; Collobert et al., 2011). More specifically, we design a hybrid decoder for NMT, called NMT+RNNGOur code is available at https://github.com/tempra28/nmtrnng., that combines a usual conditional language model and a recently proposed recurrent neural network grammars (RNNGs, Dyer et al., 2016). This is done by plugging in the conventional language model decoder in the place of the buffer in RNNG, while sharing a subset of parameters, such as word vectors, between the language model and RNNG. We train this hybrid model to maximize both the log-probability of a target sentence and the log-probability of a parse action sequence. We use an external parser (Andor et al., 2016) to generate target parse actions, but unlike the previous explicit approaches, we do not need it during test time.

We evaluate the proposed NMT+RNNG on four language pairs ({JP, Cs, De, Ru}-En). We observe significant improvements in terms of BLEU scores on three out of four language pairs and RIBES scores on all the language pairs.

Neural Machine Translation

Neural machine translation is a recently proposed framework for building a machine translation system based purely on neural networks. It is often built as an attention-based encoder-decoder network (Cho et al., 2015) with two recurrent networks—encoder and decoder—and an attention model. The encoder, which is often implemented as a bidirectional recurrent network with long short-term memory units (LSTM, Hochreiter and Schmidhuber, 1997) or gated recurrent units (GRU, Cho et al., 2014), first reads a source sentence represented as a sequence of words ${\bm{x}}=(x_{1},x_{2},\ldots,x_{N})$ . The encoder returns a sequence of hidden states ${\bm{h}}=(h_{1},h_{2},\ldots,h_{N})$ . Each hidden state $h_{i}$ is a concatenation of those from the forward and backward recurrent network: $h_{i}=\left[\overrightarrow{h}_{i};\overleftarrow{h}_{i}\right]$ , where

$V_{x}(x_{i})$ refers to the word vector of the $i$ -th source word.

The decoder is implemented as a conditional recurrent language model which models the target sentence, or translation, as

where ${\bm{y}}=(y_{1},\ldots,y_{M})$ . Each of the conditional probabilities in the r.h.s is computed by

where $f_{\text{dec}}$ is a recurrent activation function, such as LSTM or GRU, and $W_{y}$ is the output word vector of the word $y$ .

$c_{j}$ is a time-dependent context vector that is computed by the attention model using the sequence ${\bm{h}}$ of hidden states from the encoder. The attention model first compares the current hidden state $s_{j}$ against each of the hidden states and assigns a scalar score: $\beta_{i,j}=\exp(h_{i}^{\top}{\bm{W}}_{d}s_{j})$ (Luong et al., 2015). These scores are then normalized across the hidden states to sum to 1, that is $\alpha_{i,j}=\frac{\beta_{i,j}}{\sum_{i}\beta_{i,j}}$ . The time-dependent context vector is then a weighted-sum of the hidden states with these attention weights: $c_{j}=\sum_{i}\alpha_{i,j}h_{i}$ .

Recurrent Neural Network Grammars

A recurrent neural network grammar (RNNG, Dyer et al., 2016) is a probabilistic syntax-based language model. Unlike a usual recurrent language model (see, e.g., Mikolov et al., 2010), an RNNG simultaneously models both tokens and their tree-based composition. This is done by having a (output) buffer, stack and action history, each of which is implemented as a stack LSTM (sLSTM, Dyer et al., 2015). At each time step, the action sLSTM predicts the next action based on the (current) hidden states of the buffer, stack and action sLSTM. That is,

where $W_{a}$ is the vector of the action $a$ . If the selected action is shift, the word at the beginning of the buffer is moved to the stack. When the reduce action is selected, the top-two words in the stack are reduced to build a partial tree. Additionally, the action may be one of many possible non-terminal symbols, in which case the predicted non-terminal symbol is pushed to the stack.

The hidden states of the buffer, stack and action sLSTM are correspondingly updated by

where $V_{y}$ and $V_{a}$ are functions returning the target word and action vectors. The input vector $r_{t}$ of the stack sLSTM is computed recursively by

where $r^{d}$ and $r^{p}$ are the corresponding vectors of the parent and dependent phrases, respectively (Dyer et al., 2015). This process is iterated until a complete parse tree is built. Note that the original paper of RNNG (Dyer et al., 2016) uses constituency trees, but we employ dependency trees in this paper. Both types of trees are represented as a sequence of the three types of actions in a transition-based parsing model.

When the complete sentence is provided, the buffer simply summarizes the shifted words. When the RNNG is used as a generator, the buffer further generates the next word when the selected action is shift. The latter can be done by replacing the buffer with a recurrent language model, which is the idea on which our proposal is based.

Learning to Parse and Translate

Our main proposal in this paper is to hybridize the decoder of the neural machine translation and the RNNG. We continue from the earlier observation that we can replace the buffer of RNNG to a recurrent language model that simultaneously summarizes the shifted words as well as generates future words. We replace the RNNG’s buffer with the neural translation model’s decoder in two steps.

Learning and Inference

2 Knowledge Distillation for Parsing

A major challenge in training the proposed hybrid model is that there is not a parallel corpus augmented with gold-standard target-side parse, and vice versa. In other words, we must either parse the target-side sentences of an existing parallel corpus or translate sentences with existing gold-standard parses. As the target task of the proposed model is translation, we start with a parallel corpus and annotate the target-side sentences. It is however costly to manually annotate any corpus of reasonable size (Table 6 in Alonso et al., 2016).

We instead resort to noisy, but automated annotation using an existing parser. This approach of automated annotation can be considered along the line of recently proposed techniques of knowledge distillation (Hinton et al., 2015) and distant supervision (Mintz et al., 2009). In knowledge distillation, a teacher network is trained purely on a training set with ground-truth annotations, and the annotations predicted by this teacher are used to train a student network, which is similar to our approach where the external parser could be thought of as a teacher and the proposed hybrid network’s RNNG as a student. On the other hand, what we propose here is a special case of distant supervision in that the external parser provides noisy annotations to otherwise an unlabeled training set.

Specifically, we use SyntaxNet, released by Andor et al. (2016), on a target sentence.When the target sentence is parsed as data preprocessing, we use all the vocabularies in a corpus and do not cut off any words. We use the plain SyntaxNet and do not train it furthermore. We convert a parse tree into a sequence of one of three transition actions (SHIFT, REDUCE-L, REDUCE-R). We label each REDUCE action with a corresponding dependency label and treat it as a more fine-grained action.

Experiments

We compare the proposed NMT+RNNG against the baseline model on four different language pairs–Jp-En, Cs-En, De-En and Ru-En. The basic statistics of the training data are presented in Table 1. We mapped all the low-frequency words to the unique symbol “UNK” and inserted a special symbol “EOS” at the end of both source and target sentences.

We use the ASPEC corpus (“train1.txt”) from the WAT’16 Jp-En translation task. We tokenize each Japanese sentence with KyTea (Neubig et al., 2011) and preprocess according to the recommendations from WAT’16 (WAT, 2016). We use the first 100K sentence pairs of length shorter than 50 for training. The vocabulary is constructed with all the unique tokens that appear at least twice in the training corpus. We use “dev.txt” and “test.txt” provided by WAT’16 respectively as development and test sets.

Cs, De and Ru

We use News Commentary v8. We removed noisy metacharacters and used the tokenizer from Moses (Koehn et al., 2007) to build a vocabulary of each language using unique tokens that appear at least 6, 6 and 5 times respectively for Cs, Ru and De. The target-side (English) vocabulary was constructed with all the unique tokens appearing more than three times in each corpus. We also excluded the sentence pairs which include empty lines in either a source sentence or a target sentence. We only use sentence pairs of length 50 or less for training. We use “newstest2015” and “newstest2016” as development and test sets respectively.

2 Models, Learning and Inference

In all our experiments, each recurrent network has a single layer of LSTM units of 256 dimensions, and the word vectors and the action vectors are of 256 and 128 dimensions, respectively. To reduce computational overhead, we use BlackOut (Ji et al., 2015) with 2000 negative samples and $\alpha=0.4$ . When employing BlackOut, we shared the negative samples of each target word in a sentence in training time Hashimoto and Tsuruoka (2017), which is similar to the previous work (Zoph et al., 2016). For the proposed NMT+RNNG, we share the target word vectors between the decoder (buffer) and the stack sLSTM.

Each weight is initialized from the uniform distribution $\left[-0.1,0.1\right]$ . The bias vectors and the weights of the softmax and BlackOut are initialized to be zero. The forget gate biases of LSTMs and Stack-LSTMs are initialized to 1 as recommended in Józefowicz et al. (2015). We use stochastic gradient descent with minibatches of 128 examples. The learning rate starts from $1.0$ , and is halved each time the perplexity on the development set increases. We clip the norm of the gradient (Pascanu et al., 2012) with the threshold set to 3.0 (2.0 for the baseline models on Ru-En and Cs-En to avoid NaN and Inf). When the perplexity of development data increased in training time, we halved the learning rate of stochastic gradient descent and reloaded the previous model. The RNNG’s stack computes the vector of a dependency parse tree which consists of the generated target words by the buffer. Since the complete parse tree has a “ROOT” node, the special token of the end of a sentence (“EOS”) is considered as the ROOT. We use beam search in the inference time, with the beam width selected based on the development set performance.

It took about 15 minutes per epoch and about 20 minutes respectively for the baseline and the proposed model to train a full JP-EN parallel corpus in our implementation.We run all the experiments on multi-core CPUs (10 threads on Intel(R) Xeon(R) CPU E5-2680 v2 @2.80GHz)

3 Results and Analysis

In Table 2, we report the translation qualities of the tested models on all the four language pairs. We report both BLEU (Papineni et al., 2002) and RIBES (Isozaki et al., 2010). Except for De-En, measured in BLEU, we observe the statistically significant improvement by the proposed NMT+RNNG over the baseline model. It is worthwhile to note that these significant improvements have been achieved without any additional parameters nor computational overhead in the inference time.

Since each component in RNNG may be omitted, we ablate each component in the proposed NMT+RNNG to verify their necessity. Since the buffer is the decoder, it is not possible to completely remove it. Instead we simply remove the dependency of the action distribution on it. As shown in Table 3, we see that the best performance could only be achieved when all the three components were present. Removing the stack had the most adverse effect, which was found to be the case for parsing as well by Kuncoro et al. (2017).

Generated Sentences with Parsed Actions

The decoder part of our proposed model consists of two components: the NMT decoder to generate a translated sentence and the RNNG decoder to predict its parsing actions. The proposed model can therefore output a dependency structure along with a translated sentence. Figure 1 shows an example of JP-EN translation in the development dataset and its dependency parse tree obtained by the proposed model. The special symbol (“EOS”) is treated as the root node (“ROOT”) of the parsed tree. The translated sentence was generated by using beam search, which is the same setting of NMT+RNNG shown in Table 3. The parsing actions were obtained by greedy search. The resulting dependency structure is mostly correct but contains a few errors; for example, dependency relation between “The” and “ transition” should not be “pobj”.

Conclusion

We propose a hybrid model, to which we refer as NMT+RNNG, that combines the decoder of an attention-based neural translation model with the RNNG. This model learns to parse and translate simultaneously, and training it encourages both the encoder and decoder to better incorporate linguistic priors. Our experiments confirmed its effectiveness on four language pairs ({JP, Cs, De, Ru}-En). The RNNG can in principle be trained without ground-truth parses, and this would eliminate the need of external parsers completely. We leave the investigation into this possibility for future research.

Acknowledgments

We thank Yuchen Qiao and Kenjiro Taura for their help to speed up the implementations of training and also Kazuma Hashimoto for his valuable comments and discussions. This work was supported by JST CREST Grant Number JPMJCR1513 and JSPS KAKENHI Grant Number 15J12597 and 16H01715. KC thanks support by eBay, Facebook, Google and NVIDIA.