Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization

Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, Bill Dolan

Introduction

Neural conversational models are effective in generating coherent and relevant responses [1, 2, 3, 4, etc.]. However, the maximum-likelihood objective commonly used in these neural models fosters generation of responses that average out the responses in the training data, resulting in the production of safe but bland responses .

We argue that this problem is in fact twofold. The responses of a system may be diverse but uninformative (e.g.,“I don’t know”, “I haven’t a clue”, “I haven’t the foggiest”, “I couldn’t tell you”), and conversely informative but not diverse (e.g., always giving the same generic responses such as “I like music”, but never “I like jazz”). A major challenge, then, is to strike the right balance between informativeness and diversity. On the one hand, we seek informative responses that are relevant and fully address the input query. Mathematically, this can be measured via Mutual Information (MI) , by computing the reduction in uncertainty about the query given the response. On the other hand, diversity can help produce responses that are more varied and unpredictable, which contributes to making conversations seem more natural and human-like.

The MI approach of conflated the problems of producing responses that are informative and diverse, and subsequent work has not attempted to address the distinction explicitly. Researchers have applied Generative Adversarial Networks (GANs) to neural response generation . The equilibrium for the GAN objective is achieved when the synthetic data distribution matches the real data distribution. Consequently, the adversarial objective discourages generating responses that demonstrate less variation than human responses. However, while GANs help reduce the level of blandness, the technique was not developed for the purpose of explicitly improving either informativeness or diversity.

We propose a new adversarial learning method, Adversarial Information Maximization (AIM), for training end-to-end neural response generation models that produce informative and diverse conversational responses. Our approach exploits adversarial training to encourage diversity, and explicitly maximizes a Variational Information Maximization Objective (VIMO) to produce informative responses. To leverage VIMO, we train a backward model that generates source from target. The backward model guides the forward model (from source to target) to generate relevant responses during training, thus providing a principled approach to mutual information maximization. This work is the first application of a variational mutual information objective in text generation.

To alleviate the instability in training GAN models, we propose an embedding-based discriminator, rather than the binary classifier used in traditional GANs. To reduce the variance of gradient estimation, we leverage a deterministic policy gradient algorithm and employ the discrete approximation strategy in . We also employ a dual adversarial objective inspired by , which composes both source-to-target (forward) and target-to-source (backward) objectives. We demonstrate that this forward-backward model can work synergistically with the variational information maximization loss. The effectiveness of our approach is validated empirically on two social media datasets.

Method

Let $\mathcal{D}=\{(S_{i},T_{i})\}_{i=1}^{N}$ denote a set of $N$ single-turn conversations, where $S_{i}$ represents a query (i.e., source), $T_{i}$ is the response to $S_{i}$ (i.e., target). We aim to learn a generative model $p_{\theta}(T|S)$ that produces both informative and diverse responses for arbitrary input queries.

To achieve this, we propose the Adversarial Information Maximization (AIM), illustrated in Figure 1, where ( $i$ ) adversarial training is employed to learn the conditional distribution $p_{\theta}(T|S)$ , so as to improve the diversity of generated responses over standard maximum likelihood training, and ( $ii$ ) variational information maximization is adopted to regularize the adversarial learning process and explicitly maximize mutual information to boost the informativeness of generated responses.

where $\mathcal{L}_{\textrm{GAN}}(\theta,\psi)$ represents the objective that accounts for adversarial learning, while $\mathcal{L}_{\textrm{MI}}(\theta,\phi)$ denotes the regularization term corresponding to the mutual information, and $\lambda$ is a hyperparameter that balances these two parts.

2 Diversity-encouraging objective

The conditional generator $p_{\theta}(T|S)$ that produces neural response $T=(y_{1},\ldots,y_{n})$ given the source sentence $S=(x_{1},\ldots,x_{m})$ and an isotropic Gaussian noise vector $Z$ is shown in Figure 2. The noise vector $Z$ is used to inject noise into the generator to prompt diversity of generated text.

Specifically, a 3-layer convolutional neural network (CNN) is employed to encode the source sentence $S$ into a fixed-length hidden vector $H_{0}$ . A random noise vector $Z$ with the same dimension of $H_{0}$ is then added to $H_{0}$ element-wisely. This is followed by a series of long short-term memory (LSTM) units as decoder. In our model, the $t$ -th LSTM unit takes the previously generated word $y_{t-1}$ , hidden state $H_{t-1}$ , $H_{0}$ and $Z$ as input, and generates the next word $y_{t}$ that maximizes the probability over the vocabulary set. However, the argmax operation is used, instead of sampling from a multinomial distribution as in the standard LSTM. Thus, all the randomness during the generation is clamped into the noise vector $Z$ , and the reparameterization trick can be used (see Eqn. (4)). However, the argmax operation is not differentiable, thus no gradient can be backpropagated through $y_{t}$ . Instead, we adopt the soft-argmax approximation below:

where $V$ is a weight matrix used for computing a distribution over words. When the temperature $\tau\rightarrow 0$ , the argmax operation is exactly recovered , however the gradient will vanish. In practice, $\tau$ should be selected to balance the approximation bias and the magnitude of gradient variance, which scales up nearly quadratically with $1/\tau$ . Note that when $\tau=1$ this recovers the setting in . However, we empirically found that using a small $\tau$ would result in accumulated ambiguity when generating words in our experiment.

Discriminator

We empirically found that separate embedding for each sentence yields better performance than concatenating $(S,T)$ pairs. Presumably, mapping $(S,T)$ pairs to the embedding space requires the embedding network to capture the cross-sentence interaction features of how relevant the response is to the source. Mapping them separately to the embedding space would divide the tasks into a sentence feature extraction sub-task and a sentence feature matching sub-task, rather than entangle them together. Thus the former might be slightly easier to train.

Objective

where $f(x)\triangleq 2\text{tanh}^{-1}(x)$ scales the difference to deliver more smooth gradients.

Note that Eqn. (3) is conceptually related to in which the discriminator loss is introduced to provide sequence-level training signals. Specifically, the discriminator is responsible for assessing both the genuineness of a response and the relevance to its corresponding source. The discriminator employed in evaluates a source-target pair by operations like concatenation. However, our approach explicitly structures the discriminator to compare the embeddings using cosine similarity metrics, thus avoiding learning a neural network to match correspondence, which could be difficult. Presumably our discriminator delivers more direct updating signal by explicitly defining how the response is related to the source.

The objective in Eqn. (3) also resembles Wasserstein GAN (WGAN) in that without the monotonous scaling function $f$ , the discriminator $D_{\psi}$ can be perceived as the critic in WGAN with embedding-structured regularization. See details in the Supplementary Material.

To backpropagate the learning signal from the discriminator $D_{\psi}$ to the generator $p_{\theta}(T|S)$ , instead of using the standard policy gradient as in , we consider a novel approach related to deterministic policy gradient (DPG) , which estimates the gradient as below:

3 Information-promoting objective

Denoting the unknown oracle joint distribution as $p(S,T)$ , we aim to find an encoder joint distribution $p^{e}(S,T)=p_{\theta}(T|S)p(S)$ by learning a forward model $p_{\theta}(T|S)$ , such that $p^{e}(S,T)$ approximates $p(S,T)$ , while the mutual information under $p^{e}(S,T)$ remains high. See Figure 1 for illustration.

Empirical success has been achieved in for mutual information maximization. However their approach is limited by the fact that the MI-prompting objective is used only during testing time, while the training procedure remains the same as the standard maximum likelihood training. Consequently, during training the model is not explicitly specified for maximizing pertinent information. The MI objective merely provides a criterion for reweighing response candidates, rather than asking the generator to produce more informative responses in the first place. Further, the hyperparameter that balances the likelihood and anti-likelihood/reverse-likelihood terms is manually selected from $(0,1)$ , which deviates from the actual MI objective, thus making the setup ad hoc.

The gradient of $\mathcal{L}_{\textrm{MI}}(\theta,\phi)$ w.r.t. $\theta$ can be approximated by Monte Carlo samples using the REINFORCE policy gradient method

where $b$ is denoted as a baseline. Here we choose a simple empirical average for $b$ . Note that more sophisticated baselines based on neural adaptation or self-critic can be also employed. We complement the policy gradient objective with small proportion of likelihood-maximization loss, which was shown to stabilize the training as in .

As an alternative to the REINFORCE approach used in (2.3), we also considered using the same DPG-like approach as in (4) for approximated gradient calculation. Compared to the REINFORCE approach, the DPG-like method yields lower variance, but is less memory efficient in this case. This is because the $\mathcal{L}_{\textrm{MI}}(\theta,\phi)$ objective requires the gradient first back-propagated to synthetic text through all backward LSTM nodes, then from synthetic text back-propagated to all forward LSTM nodes, where both steps are densely connected. Hence, the REINFORCE approach is used in this part.

4 Dual Adversarial Learning

One issue of the above approach is that learning an appropriate $q_{\phi}(S|T)$ is difficult. Similar to the forward model, this backward model $q_{\phi}(S|T)$ may also tend to be “bland” in generating source from the target. As illustrated in Figure 4, supposing that we define a decoder joint distribution $p^{d}(S,T)=q_{\phi}(S|T)p(T)$ , this distribution tends to be flat along $T$ axis (i.e., tending to generate the same source giving different target inputs). Similarly, $p^{e}(S,T)$ tends to be flat along the $S$ axis as well.

where $\lambda$ is a hyperparameter to balance the GAN loss and the MI loss. An illustration is shown in Figure 5.

We believe that this approach would also improve the generation diversity. To understand this, notice that we are maximizing a surrogate objective of $I_{p^{d}}(S,T)$ , which can be written as

When optimizing $\theta$ , the backward model $q^{d}_{\phi}(S|T)$ is fixed and $H(T|S)$ remains constant. Thereby optimizing $I_{p^{d}}(S,T)$ with respect to $\theta$ can be understood as equivalently maximizing $H(T)$ , which promotes the diversity of generated text.

Related Work

Our work is closely related to , where an information-promoting objective was proposed to directly optimize an MI-based objective between source and target pairs. Despite the great success of this approach, the use of the additional hyperparameter for the anti-likelihood term renders the objective only an approximation to the actual MI. Additionally, the MI objective is employed only during testing (decoding) time, while the training procedure does not involve such an MI objective and is identical to standard maximum-likelihood training. Compared with , our approach considers optimizing a principled MI variational lower bound during training.

Adversarial learning has been shown to be successful in dialog generation, translation, image captioning and a series of natural language generation tasks . leverages adversarial training and reinforcement learning to generate high quality responses. Our adversarial training differs from in both the discriminator and generator design: we adopt an embedding-based structured discriminator that is inspired by the ideas from Deep Structured Similarity Models (DSSM) . For the generator, instead of performing multinomial sampling at each generating step and leveraging REINFORCE-like method as in , we clamp all the randomness in the generation process to an initial input noise vector, and employ a discrete approximation strategy as used in . As a result, the variance of gradient estimation is largely reduced.

Unilke previous work, we seek to make a conceptual distinction between informativeness and diversity, and combine the MI and GAN approaches, proposed previously, in a principled manner to explicitly render responses to be both informative (via MI) and diverse (via GAN).

Our AIM objective is further extended to a dual-learning framework. This is conceptually related to several previous GAN models in the image domain that designed for joint distribution matching . Among these, our work is mostly related to the Triangle GAN . However, we employ an additional VIMO as objective, which has a similar effect to that of “cycle-consistent” regularization which enables better communication between the forward and backward models. also leverages a dual objective for supervised translation training and demonstrates superior performance. Our work differs from in that we formulate the problem in an adversarial learning setup. It can thus be perceived as conditional distribution matching rather than seeking a regularized maximum likelihood solution.

Experiments

We evaluated our methods on two datasets: Reddit and Twitter. The Reddit dataset contains 2 million source-target pairs of single turn conversations extracted from Reddit discussion threads. The maximum length of sentence is 53. We randomly partition the data as (80%, 10%, 10%) to construct the training, validation and test sets. The Twitter dataset contains 7 million single turn conversations from Twitter threads. We mainly compare our results with MMI We did not compare with since the code is not available, and the original training data used in contains a large portion of test data, owing to data leakage..

We evaluated our method based on relevance and diversity metrics. For relevance evaluation, we adopt BLEU , ROUGE and three embedding-based metrics following . The Greedy metric yields the maximum cosine similarity over embeddings of two utterances . Similarly, the Average metric considers the average embedding cosine similarity. The Extreme metric obtains sentence representation by taking the largest extreme values among the embedding vectors of all the words it contains, then calculates the cosine similarity of sentence representations.

To evaluate diversity, we follow to use Dist-1 and Dist-2, which is characterized by the proportion between the number of unique n-grams and total number of n-grams of tested sentence. However, this metric neglects the frequency difference of n-grams. For example, token A and token B that both occur 50 times have the same Dist-1 score (0.02) as token A occurs 1 time and token B occurs 99 times, whereas commonly the former is considered more diverse that the latter. To accommodate this, we propose to use the Entropy (Ent-n) metric, which reflects how evenly the empirical n-gram distribution is for a given sentence:

where $V$ is the set of all n-grams, $F(w)$ denotes the frequency of n-gram $w$ .

We evaluated conditional GAN (cGAN), adversarial information maximization (AIM), dual adversarial information maximization (DAIM), together with maximum likelihood CNN-LSTM sequence-to-sequence baseline on multiple datasets. For comparison with previous state of the art methods, we also include MMI . To eliminate the impact of network architecture differences, we implemented MMI-bidi using our CNN-LSTM framework. The settings, other than model architectures, are identical to . We performed a beam search with width of 200 and choose the hyperparameter based on performance on the validation set.

The forward and backward models were pretrained via seq2seq training. During cGAN training, we added a small portion of supervised signals to stabilize the training . For embedding-based evaluation, we used a word2vec embedding trained on GoogleNews Corpushttps://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM, recommended by . For all the experiments, we employed a 3-layer convolutional encoder and an LSTM decoder as in . The filter size, stride and the word embedding dimension were set to $5$ , $2$ and $300$ , respectively, following . The hidden unit size of $H_{0}$ was set to 100. We set $\lambda$ to be 0.1 and the supervised-loss balancing parameter to be $0.001$ . All other hyperparameters were shared among different experiments. All experiments are conducted using NVIDIA K80 GPUs.

2 Evaluation on Reddit data

We first evaluated our methods on the Reddit dataset using the relevance and diversity metrics. We truncated the vocabulary to contain only the most frequent 20,000 words. For testing we used 2,000 randomly selected samples from the test setWe did not use the full test set because MMI decoding is relatively slow.. The results are summarized in Table 1. We observe that by incorporating the adversarial loss the diversity of generated responses is improved (cGAN vs. seq2seq). The relevance under most metrics (except for BLEU), increases by a small amount.

Compared MMI with cGAN, AIM and DAIM, we observe substantial improvements on diversity and relevance due to the use of the additional mutual information promoting objective in cGAN, AIM and DAIM. Table 2 presents several examples. It can be seen that AIM and DAIM produce more informative responses, due to the fact that the MI objective explicitly rewards the responses that are predictive to the source, and down-weights those that are generic and dull. Under the same hyperparameter setup, we also observe that DAIM benefits from the additional backward model and outperforms AIM in diversity, which better approximates human responses. We show the histogram of the length of generated responses in the Supplementary Material. Our models are trained until convergence. cGAN, AIM and DAIM respectively consume around 1.7, 2.5 and 3.5 times the computation time compared with our seq2seq baseline.

The distributional discrepancy between generated responses and ground-truth responses is arguably a more reasonable metric than the single response judgment. We leave it to future work.

Human evaluation

Informativeness is not easily measurable using automatic metrics, so we performed a human evaluation on 600 random sampled sources using crowd-sourcing. Systems were paired and each pair of system outputs was randomly presented to 7 judges, who ranked them for informativeness and relevanceRelevance relates to the degree to which judges perceived the output to be semantically tied to the previous turn, and can be regarded as a constraint on informativeness. An affirmative response like “Sure” and “Yes” is relevant but not very informative.. The human preferences are shown in Table 3. A statistically significant (p < 0.00001) preference for DAIM over MMI is observed with respect to informativeness, while relevance judgments are on par with MMI. MMI has proved a strong baseline: the other two GAN systems are (with one exception) statistically indistinguishable from MMI, which in turn perform significantly better than seq2seq. Box charts illustrating these results can be found in the Supplementary Material.

3 Evaluation on Twitter data

We further compared our methods on the Twitter dataset. The results are shown in Table 4. We treated all dialog history before the last response in a multi-turn conversation session as a source sentence, and use the last response as the target to form our dataset. We employed CNN as our encoder because a CNN-based encoder is presumably advantageous in tracking long dialog history comparing to an LSTM encoder. We truncated the vocabulary to contain only 20k most frequent words due to limited flash memory capacity. We evaluated each methods on 2k test data.

Adversarial training encourages generating more diverse sentences, at the cost of slightly decreasing the relevance score. We hypothesize that such a decrease is partially attributable to the evaluation metrics we used. All the relevance metrics are based on utterance-pair discrepancy, i.e., the score assesses how close the system output is to the ground-truth response. Thus, the MLE system output tends to obtain a high score despite being bland, because a MLE response by design is most “relevant” to any random response. On the other hand, adding diversity without improving semantic relevance may occasionally hurt these relevance scores.

However the additional MI term seems to compensate for the relevance decrease and improves the response diversity, especially in Dist- $n$ and Ent- $n$ with a larger value of $n$ . Sampled responses are provided in the Supplementary Material.

Conclusion

In this paper we propose a novel adversarial learning method, Adversarial Information Maximization (AIM), for training response generation models to promote informative and diverse conversations between human and dialogue agents. AIM can be viewed as a more principled version of the classical MMI method in that AIM is able to directly optimize the (lower bounder of) the MMI objective in model training while the MMI method only uses it to rerank response candidates during decoding. We then extend AIM to DAIM by incorporating a dual objective so as to simultaneously learn forward and backward models. We evaluated our methods on two real-world datasets. The results demonstrate the our methods do lead to more informative and diverse responses in comparison to existing methods.

Acknowledgements

We thank Adji Bousso Dieng, Asli Celikyilmaz, Sungjin Lee, Chris Quirk, Chengtao Li for helpful discussions. We thank anonymous reviewers for their constructive feedbacks.