Adversarial Feature Matching for Text Generation

Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, Lawrence Carin

Introduction

Generating meaningful and coherent sentences is central to many natural language processing applications. The general idea is to estimate a distribution over sentences from a corpus, then use it to sample realistic-looking sentences. This task is important because it enables generation of novel sentences that preserve the semantic and syntactic properties of real-world sentences, while being potentially different from any of the examples used to estimate the model. For instance, in the context of dialog generation, it is desirable to generate answers that are more diverse and less generic (Li et al., 2016).

One simple approach consists of first learning a latent space to represent (fixed-length) sentences using an encoder-decoder (autoencoder) framework based on Recurrent Neural Networks (RNNs) (Cho et al., 2014; Sutskever et al., 2014), then generate synthetic sentences by decoding random samples from this latent space. However, this approach often fails to generate realistic sentences from arbitrary latent representations. The reason for this is that, when mapping sentences to their latent representations using an autoencoder, the mappings usually cover a small but structured region of the latent space, which corresponds to a manifold embedding (Bowman et al., 2016). In practice, most regions of the latent space do not necessarily map (decode) to realistic sentences. Consequently, randomly sampling latent representations often yields nonsensical sentences. Recent work by Bowman et al. (2016) has attempted to generate more diverse sentences via RNN-based variational autoencoders. However, they did not address the fundamental problem that the posterior distribution over latent variables does not appropriately cover the latent space.

Another underlying challenge of generating realistic text relates to the nature of the RNN. During inference, the RNN generates words in sequence from previously generated words, contrary to learning, where ground-truth words are used every time. As a result, error accumulates proportional to the length of the sequence, i.e., the first few words look reasonable, however, quality deteriorates quickly as the sentence progresses. Bengio et al. (2015) coined this phenomenon exposure bias. Toward addressing this problem, Bengio et al. (2015) proposed the scheduled sampling approach. However, Huszár (2015) showed that scheduled sampling is a fundamentally inconsistent training strategy, in that it produces largely unstable results in practice.

The Generative Adversarial Network (GAN) (Goodfellow et al., 2014) is an appealing and natural answer to the above issues. GAN matches the distributions of synthetic and real data by introducing an adversarial game between a generator and a discriminator. The GAN objective seeks to constitute a generator, that functionally maps samples from a given (simple) prior distribution, to synthetic data that appear to be realistic. The GAN setup explicitly seeks that the latent representations from real data (via encoding) be distributed in a manner consistent with the specified prior (e.g., Gaussian or uniform). Due to the nature of adversarial training, the discriminator compares real and synthetic sentences, rather than their individual words, which in principle should alleviate the exposure-bias issue. Recent work (Lamb et al., 2016) has incorporated an additional discriminator to train a sequence-to-sequence language model that better preserves long-term dependencies.

Effort has also been made to generate realistic-looking sentences via adversarial training. For instance, by borrowing ideas from reinforcement learning, Yu et al. (2017); Li et al. (2017) treat the sentence generation as a sequential decision making process. Despite the success of these methods, two fundamental problems of the GAN framework limit their use in practice: (i) the generator tends to produce a single observation for multiple latent representations, i.e., mode collapsing (Metz et al., 2017), and (ii) the generator’s contribution to the learning signal is insubstantial when the discriminator is close to its local optimum, i.e., vanishing gradient behavior (Arjovsky & Bottou, 2017).

In this paper we propose a new framework, TextGAN, to alleviate the problems associated with generating realistic-looking sentences via GAN. Specifically, the Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) RNN is used as generator, and the Convolutional Neural Network (CNN) (Kim, 2014) is used as discriminator. We consider a kernel-based moment-matching scheme over a Reproducing Kernel Hilbert Space (RKHS), to force the empirical distributions of real and synthetic sentences to have matched moments in latent-feature space. As a consequence, our approach ameliorates the mode-collapsing issue associated with standard GAN training. This strategy encourages the model to learn representations that are both informative of the original sentences (via the autoencoder) and discriminative w.r.t. synthetic sentences (via the discriminator). We also propose several complementary techniques, including initialization strategies and discretization approximations to ease GAN training, and to achieve superior performance compared to related approaches.

Model

GAN (Goodfellow et al., 2014) aims to obtain the equilibrium of the following optimization objective

2 TextGAN

Given a sentence corpus $\mathcal{S}$ , instead of directly optimizing the objective from standard GAN in (1), we adopt an approach that is similar to the feature matching scheme of Salimans et al. (2016). Specifically, we consider the objective

In the situation for which simple features are enough for the discrimination/reconstruction task, this additional loss seeks to estimate complex features that are difficult for the current generator, thus improving in terms of generation ability. In our experience, we find the reconstruction and MMD loss in D serve as regularizer to the binary classification loss, in that by adding these losses, discriminator features tend to be more spread-out in the feature space.

In summary, the adversarial game associated with (2) and (3) is the following: $D(\cdot)$ attempts to select informative sentence features, while $G(\cdot)$ aims to match these features. Parameters $\lambda_{r}$ and $\lambda_{m}$ act as trade-off between discrimination ability, and reconstruction and moment matching precision, respectively. We argue that this framework has several advantages over the standard GAN objective in (1).

Additionally, seeking to match the sentence features provides a more achievable and informative objective than directly trying to mislead the discriminator as in standard GAN. Specifically, the loss in (3) implies a clearer aim for the generator, as it requires matching the latent features (distribution-wise) as opposed to uniquely trying to fake a binary classifier.

Note that if the latent features from real and synthetic data have similar distributions it is unlikely that the discriminator, that uses these features as inputs, will be able to tell them apart. Implementation-wise, the updating signal from the generator does not need to propagate all the way back from the discriminator, but rather directly from the features layer, thus less prone to fading. We believe there may be other possible approaches for text generation using GAN, however, we hope to provide a first attempt toward overcoming some of the difficulties associated with it.

3 Alternative (data efficient) objectives

Gaussian covariance matching

We could also avoid using the kernel trick, as was used in (4). Instead, we can replace $\mathcal{L}_{MMD^{2}}$ by $\mathcal{L}_{G}^{(c)}$ (below), where we accumulate (Gaussian) sufficient statistics from multiple minibatches, thus alleviating the inadequate-minibatch-size issue. Specifically,

4 Model specification

The above process describes how one feature is extracted from one filter. In practice, the model uses multiple filters with varying window sizes. Each filter can be considered as a linguistic feature detector that learns to recognize a specific class of $h$ -grams. Assume we have $m$ window sizes, and for each window size, we use $p$ filters, then we obtain a $mp$ -dimensional vector ${\bm{f}}$ to represent a sentence. On top of this $mp$ -dimensional feature vector, we specify a softmax layer to map the input sentence to an output $D({{\bf X}})\in$ , representing the probability of ${{\bf X}}$ being from the data distribution (real), rather than from the adversarial generator (synthesized).

There are other CNN architectures in the literature (Kalchbrenner et al., 2014; Hu et al., 2014; Johnson & Zhang, 2015). We adopt the CNN model of Kim (2014); Collobert et al. (2011) due to its simplicity and excellent performance on sentence classification tasks.

LSTM generator

5 Training Techniques

To train the generator $G(\cdot)$ , which contains discrete variables, direct application of the gradient estimation may be difficult (Yu et al., 2017). Score-function-based approaches, such as the REINFORCE algorithm (Williams, 1992), achieve unbiased gradient estimation for discrete variables using Monte Carlo estimation. However, in our experiments, we found that the variance of the gradient estimation is very large, which is consistent with Maddison et al. (2017). Here we consider a soft-argmax operator (Zhang et al., 2016), i.e., when performing learning, as an approximation to (7):

where $\odot$ represents the element-wise product. Note that when $L\rightarrow\infty$ , this approximation approaches (7).

Pre-training

Previous literature (Goodfellow et al., 2014; Salimans et al., 2016) has discussed the fundamental difficulty of training GANs using gradient-based methods. In general, gradient descent optimization schemes may fail to converge to the equilibrium by moving along the orbit trajectory among saddle points (Salimans et al., 2016). Intuitively, good initialization can facilitate convergence. Toward this end, we initialize the LSTM parameters of the generator by pre-training a standard CNN-LSTM autoencoder (Gan et al., 2016). For the discriminator/encoder initialization, we use a permutation training strategy. For each sentence in the corpus, we randomly swap two words to construct a slightly tweaked sentence counterpart. The discriminator is pre-trained to distinguish the tweaked sentences from the true sentences. The swapping operation is preferred here because it constitutes a much more challenging task for the discriminator to learn, compared to adding or deleting words, where the structure of real sentences is more strongly disrupted, thus making it easier for the discriminator. The permutation pre-training is important because it requires the discriminator to learn features characteristic of sentences’ long dependencies. We empirically found this provides a better initialization (compared to no pre-training) for the discriminator to learn good features.

We also utilized other training techniques to stabilize training, such as soft-labeling (Salimans et al., 2016). Details of these are provided in the Supplementary Material.

Related Work

Generative Moment Matching Networks (GMMNs) (Dziugaite et al., 2015; Li et al., 2015) are closely related to our approach. However, these methods either directly match the empirical distribution in the data domain, or extract features using a pre-trained autoencoder (Li et al., 2015). If the goal is to perform matching in the data domain when generating sentences, the dimensionality of input data would be $T\times k$ (higher than 10,000 in our case). Note that the minibatch size required to obtain reasonable statistical power grows linearly with the number of dimension (Ramdas et al., 2014), and the computational cost of MMD grows quadratically with the size of data points. Therefore, directly applying GMMNs is often computationally prohibitive. Furthermore, directly matching in the data domain via GMMNs implies word-by-word discrepancy, which yields less smooth gradients. This happens because a word-by-word discrepancy ignores sentence structure. For example, two sentences “a boy is swimming” and “boy is swimming” will be far apart in a word-by-word metric, when they are indeed close in a sentence-by-sentence feature space.

A two-step method, where a feature encoder is generated first as in Li et al. (2015) helps alleviate the problems above. However, in Li et al. (2015) the feature encoder is fixed once pre-trained, limiting the potential to adjust features during the training phase. Alternatively, our approach matches the real and synthetic data on a sentence feature space, where features are dynamically and adversarially adapted to focus on the most challenging features for the generator to mimic. In addition, features are designed to maintain both discrimination and reconstruction ability, instead of merely focusing on reconstruction as in Li et al. (2015).

Recent work considered combining autoencoders or variational autoencoders (Kingma & Welling, 2014) with GAN (Zhao et al., 2017; Larsen et al., 2016; Makhzani et al., 2015; Mescheder et al., 2017; Wang & Liu, 2016). They demonstrated superior performance on image generation. Our approach is similar to these approaches; however, we attempt to learn the reconstruction of the latent code, instead of the input data (sentences). Donahue et al. (2017); Dumoulin et al. (2016) learned a reverse mapping from data space to latent space. In our approach we enforce the discriminator and encoder to share a latent structure, with the aim of learning a representation for both discrimination and latent code reconstruction. Chen et al. (2016) maximized the mutual information between the generated data and the latent codes by leveraging a network-adapted variational proposal distribution. In our case, we minimize the distance between the original and reconstructed latent codes.

Our approach attempts to minimize a NN-based embedded MMD distance of two empirical distributions. Aside from MMD, kernel-based discrepancy metrics such as kernelized Stein discrepancy (Liu et al., 2016; Wang & Liu, 2016) have been shown to be computationally tractable, while maintaining statistical power. We leave the investigation of using Stein for moment matching as a promising future direction. Wasserstein GAN (Arjovsky et al., 2017) considers an Earth-Mover (EM) distance of the real data and synthetic data distribution, instead of the JSD as in standard GAN (Goodfellow et al., 2014) or TVD as in Zhao et al. (2017). The EM metric yields stable gradients, thus avoiding the collapsing mode and vanishing gradient problem of the latter two. We note that our approach is equivalent to minimizing a MMD loss over the data domain, however, with a NN-based embedded Gaussian kernel. As shown in Arjovsky et al. (2017), MMD is a proper metric when the kernel is universal. Because of the similarity of the conditions, our approach enjoys the advantages of Wasserstein GAN, namely, ameliorating the gradient vanishing problems.

Experiments

Our model is trained using a combination of two datasets: (i) the BookCorpus dataset (Zhu et al., 2015), which consists of 70 million sentences from over 7000 books; and (ii) the ArXiv dataset, which consists of 5 million sentences from abstracts of papers from various subjects, obtained from the arXiv website. The motivation for merging two different corpora is to investigate whether the model can generate sentences that integrate both scientific and informal writing styles. We randomly choose 0.5 million sentences from BookCorpus and 0.5 million sentences from arXiv to construct training and validation sets, i.e., 1 million sentences for each. For testing, we randomly select 25,000 sentences from both corpus, for a total of 50,000 sentences.

We train the generator and discriminator/encoder iteratively. Provided that the LSTM generator typically involves more parameters and is more difficult to train than the CNN discriminator, we perform one optimization step for the discriminator for every $K=5$ steps of the generator. We use a mixture of 5 isotropic Gaussian (RBF) kernels with different bandwidths $\sigma$ as in Li et al. (2015). Bandwidth parameters are selected to be close to the median distance (in our case around 20) of feature vectors encoded from real sentences. $\lambda_{r}$ and $\lambda_{m}$ are selected based on the performance on the validation set. The validation performance is evaluated by loss of generator and corpus-level BLEU score (Papineni et al., 2002), described below.

For the CNN discriminator/encoder, we use filter windows ( $h$ ) of sizes {3,4,5} with 300 feature maps each, hence each sentence is represented as a 900-dimensional vector. The dimensionality of $\bm{z}$ and $\hat{\bm{z}}$ is also 900. The feature vector is then fed into a 900-200-2 fully connected network for the discriminator and 900-900-900 for encoder, with sigmoid activation units connecting the intermediate layers and softmax/tanh units for the top layer of discriminator/encoder. We did not observe performance changes by adding dropout. For the LSTM sentence generator, we use one hidden layer of 500 units.

Gradients are clipped if the norm of the parameter vector exceeds 5 (Sutskever et al., 2014). Adam (Kingma & Ba, 2015) with learning rate $5\times 10^{-5}$ for both discriminator and generator is utilized for optimization. The size of the minibatch is set to 256.

Matching feature distributions

Quantitative comparison

We evaluate the generated-sentence quality using the BLEU score (Papineni et al., 2002) and Kernel Density Estimation (KDE), as in Goodfellow et al. (2014); Nowozin et al. (2016). For comparison, we consider textGAN with 4 different loss objectives: Mean Matching (MM) as in Salimans et al. (2016), Covariance Matching (CM) as in (5), MMD and MMD with compressed network (MMD-L), by mapping the original 900-dimensional features to 200-dimensional, as described in Section 2.3. We also compare to a baseline autoencoder (AE) model. The AE uses a CNN as encoder and an LSTM as decoder, where the CNN and LSTM network structures are set to be identical as the CNN and LSTM used in textGAN. We finally consider a Variational Autoencoder (VAE) implemented as in Bowman et al. (2016). To train the VAE model, we use annealing to gradually increase the KL divergence between the prior and approximated posterior. The details are provided in the the Supplementary Material. We also compare with seqGAN (Yu et al., 2017). For seqGAN we follow the authors’ guidelines of running 350 pre-training epochs followed by 50 discriminator training epochs, to generate 320 sentences. For AE, VAE and textGAN, we first uniformly sample 320 latent codes from the latent code space, and use the corresponding generator (or decoder, in the AE/VAE case) to generate sentences.

For BLEU score evaluation, we follow the strategy in Yu et al. (2017) of using the entire test set as the reference. For KDE evaluation, the lengths of the generated sentences are different, thus we first embed all the sentences to a 900-dimensional vector. Since no standard sentence encoder is available, we use the encoder learned from AE. The covariance matrix for the Parzen kernel in KDE is set to be the covariance of feature vectors for real tested sentences. Despite the fact that the KDE approach, as a log-likelihood estimator tends to have high variance (Theis et al., 2016), the KDE score tracks well with our BLEU score evaluation.

The results are shown in Table 1. MMD and MMD-L generally score higher in sentences quality. MMD-L seems better at capturing 2-grams (BLEU-2), while MMD outperforms MMD-L in 4-grams (BLEU-4). We also observed that when using CM, the generated sentences tend to be shorter than MMD (not shown).

Generated sentences

Table 2 shows six sentences generated by textGAN. Note that the generated sentences seem to be able to produce novel phrases by imagining concept combinations, e.g., in Table 2(b,c,f), or to borrow words from a different corpus to compose novel sentences, e.g., in Table 2(d,e). In many cases, it learns to automatically match the parentheses and quotation marks, e.g., in Table 2(a), and can synthesize relatively long sentences, e.g., in 2(a,f). In general, the synthetic sentences seem syntactically reasonable. However, the semantic meaning is less well preserved especially in sentence of more than 20 words, e.g., in Table 2(e,f).

We observe that the discriminator can still sufficiently distinguish the synthetic sentences from the real ones (the probability to predict synthetic data as real is around 0.05), even when the synthetic sentences seems to perserve reasonable grammatical structure and use proper wording. It is likely that the CNN is able to accurately characterize the semantic meaning and differentiate sentences, while the generator may get trapped into a local optimum, where any slight modification would result in a higher loss (3) for the generator. Presumably, long-range distance features are not difficult to abstract by the discriminator/encoder, however, is less likely to be imitated by the generator. One promising direction is to leverage reinforcement learning strategies as in Yu et al. (2017), where the updating for LSTM can be more effectively steered. Nevertheless, investigation on how to improve the the long-range behavior is left as interesting future work.

Latent feature space trajectories

Following Bowman et al. (2016), we further empirically evaluate whether the latent variable space can “densely” encode sentences. We visualize the transition from one sentence to another by constructing a linear path between two randomly selected points in latent feature space, to then generate the intermediate sentences along the linear trajectory. For comparison, a baseline autoencoder (AE) is trained for 20 epochs. The results for textGAN and AE are presented in Table 3. Compared to AE, the sentences produced by textGAN are generally more syntactically and semantically reasonable. The transition suggest “smoothness” and interpretability, however, the wording choices and sentence structure showed dramatic changes in some regions in the latent feature space. This seems to indicate that local “transition smoothness” varies from region to region.

Conclusion

We have introduced a novel approach for text generation using adversarial training, termed TextGAN, and have discussed several techniques to specify and train such a model. We demonstrated that the proposed model delivers superior performance compared to related approaches, can produce realistic sentences, and that the learned latent representation space can “smoothly” encode plausible sentences. We quantitatively evaluate the proposed methods with baseline models and existing methods. The results indicate superior performance of TextGAN.

In future work, we will attempt to apply conditional GAN models (Mirza & Osindero, 2014) to disentangle the latent representations for different writing styles. This would enable a smooth lexical and grammatical transition between different writing styles. It would be also interesting to generate text by conditioning on observed images (Pu et al., 2016). In addition, we plan to leverage an additional refining stage where a reverse-order LSTM (Graves & Schmidhuber, 2005) is applied after the sentence is first generated, to produce sentences with better long-term semantical interpretation.

Acknowledgments

This research was supported by ARO, DARPA, DOE, NGA, ONR and NSF.