A Multiplicative Model for Learning Distributed Text-Based Attribute Representations

Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov

Introduction

Distributed word representations have enjoyed success in several NLP tasks . More recently, the use of distributed representations have been extended to model concepts beyond the word level, such as sentences, phrases and paragraphs , entities and relationships and embeddings of semantic categories .

In this paper we propose a general framework for learning distributed representations of attributes: characteristics of text whose representations can be jointly learned with word embeddings. The use of the word attribute in this context is general. Table 1 illustrates several of the experiments we perform along with the corresponding notion of attribute. For example, an attribute can represent an indicator of the current sentence or language being processed. This allows us to learn sentence and language vectors, similar to the proposed model of . Attributes can also correspond to side information, or metadata associated with text. For instance, a collection of blogs may come with information about the age, gender or industry of the author. This allows us to learn vectors that can capture similarities across metadata based on the associated body of text. The goal of this work is to show our notion of attribute vectors can achieve strong performance on a wide variety of NLP related tasks.

To capture these kinds of interactions between attributes and text, we propose the use of a third-order model where attribute vectors act as gating units to a word embedding tensor. That is, words are represented as a tensor consisting of several prototype vectors. Given an attribute vector, a word embedding matrix can be computed as a linear combination of word prototypes weighted by the attribute representation. During training, attribute vectors reside in a separate lookup table which can be jointly learned along with word features and the model parameters. This type of three-way interaction can be embedded into a neural language model, where the three-way interaction consists of the previous context, the attribute and the score (or distribution) of the next word after the context.

Using a word embedding tensor gives rise to the notion of conditional word similarity. More specifically, the neighbours of word embeddings can change depending on which attribute is being conditioned on. For example, the word ‘joy’ when conditioned on an author with the industry attribute ‘religion’ appears near ‘rapture’ and ‘god’ but near ‘delight’ and ‘comfort’ when conditioned on an author with the industry attribute ‘science’. Another way of thinking of our model would be the language analogue of . They used a factored conditional restricted Boltzmann machine for modelling motion style defined by real or continuous valued style variables. When our factorization is embedded into a neural language model, it allows us to generate text conditioned on different attributes in the same manner as could generate motions from different styles. As we show in our experiments, if attributes are represented by different books, samples generated from the model learn to capture associated writing styles from the author.

Multiplicative interactions have also been previously incorporated into neural language models. introduced a multiplicative model where images are used for gating word representations. Our framework can be seen as a generalization of and in the context of their work an attribute would correspond to a fixed representation of an image. introduced a multiplicative recurrent neural network for generating text at the character level. In their model, the character at the current timestep is used to gate the network’s recurrent matrix. This led to a substantial improvement in the ability to generate text at the character level as opposed to a non-multiplicative recurrent network.

Methods

In this section we describe the proposed models. We first review the log-bilinear neural language model of as it forms the basis for much of our work. Next, we describe a word embedding tensor and show how it can be factored and introduced into a multiplicative neural language model. This is concluded by detailing how our attribute vectors are learned.

where ${\bf C}^{(i)},i=1,\ldots,n-1$ are $K\times K$ context parameter matrices. Thus, ${\bf\hat{r}}$ is the predicted representation of ${\bf r}_{w_{n}}$ . The conditional probability $P(w_{n}=i|w_{1:n-1})$ of $w_{n}$ given $w_{1},\ldots,w_{n-1}$ is

2 A word embedding tensor

where $\textrm{diag}(\cdot)$ denotes the matrix with its argument on the diagonal. These matrices are parametrized by a pre-chosen number of factors $F$ .

3 Multiplicative neural language models

We now show how to embed our word representation tensor $\bm{\mathcal{T}}$ into the log-bilinear neural language model. Let ${\bf E}=({\bf W}^{fk})^{\top}{\bf W}^{fv}$ denote a ‘folded’ $K\times V$ matrix of word embeddings. Given the context $w_{1},\ldots,w_{n-1}$ , the predicted next word representation ${\bf\hat{r}}$ is given by:

where ${\bf E}(:,w_{i})$ denotes the column of ${\bf E}$ for the word representation of $w_{i}$ and ${\bf C}^{(i)},i=1,\ldots,n-1$ are $K\times K$ context matrices. Given a predicted next word representation ${\bf\hat{r}}$ , the factor outputs are

where $\bullet$ is a component-wise product. The conditional probability $P(w_{n}=i|w_{1:n-1},{\bf x})$ of $w_{n}$ given $w_{1},\ldots,w_{n-1}$ and ${\bf x}$ can be written as

where ${\bf W}^{fv}(:,i)$ denotes the column of ${\bf W}^{fv}$ corresponding to word $i$ . In contrast to the log-bilinear model, the matrix of word representations ${\bf R}$ from before is replaced with the factored tensor $\bm{\mathcal{T}}$ , as shown in Fig. 1.

4 Unshared vocabularies across attributes

5 Learning attribute representations

We now discuss how to learn representation vectors ${\bf x}$ . Recall that when training neural language models, the word representations of $w_{1},\ldots,w_{n-1}$ are updated by backpropagating through the word embedding matrix. We can think of this as being a linear layer, where the input to this layer is a one-hot vector with the $i$ -th position active for word $w_{i}$ . Then multiplying this vector by the embedding matrix results in the word vector for $w_{i}$ . Thus the columns of the word representations matrix consisting of words from $w_{1},\ldots,w_{n-1}$ will have non-zero gradients with respect to the loss. This allows us to consistently modify the word representations throughout training.

We construct attribute representations in a similar way. Suppose that ${\bf L}$ is an attribute lookup table, where ${\bf x}=f({\bf L}(:,x))$ and $f$ is an optional non-linearity. We often use a rectifier non-linearity in order to keep ${\bf x}$ sparse and positive, which we found made training much more stable. Initially, the entries of ${\bf L}$ are generated randomly. During training, we treat ${\bf L}$ in the same way as the word embedding matrix. This way of learning language representations allows us to measure how ‘similar’ attributes are as opposed to using a one-hot encoding of attributes for which no such similarity could be computed.

In some cases, attributes that are available during training may not also be available at test time. An example of this is when attributes are used as sentence indicators for learning representations of sentences. To accommodate for this, we use an inference step similar to that proposed by . That is, at test time all the network parameters are fixed and stochastic gradient descent is used for inferring the representation of an unseen attribute vector.

Experiments

In this section we describe our experimental evaluation and results. Throughout this section we refer to our model as Attribute Tensor Decomposition (ATD). All models are trained using stochastic gradient descent with an exponential learning rate decay and linear (per epoch) increase in momentum.

We first demonstrate initial qualitative results to get a sense of the tasks our model can perform. For these, we use the small project Gutenberg corpus which consists of 18 books, some of which have the same author. We first trained a multiplicative neural language model with a context size of 5 where each attribute is represented as a book. This results in 18 learned attribute vectors, one for each book. After training, we can condition on a book vector and generate samples from the model. Table 2 illustrates some the generated samples. Our model learns to capture the ‘style’ associated with different books. Furthermore, by conditioning on the average of book representations, the model can generate reasonable samples that represent a hybrid of both attributes, even though such attribute combinations were not observed during training.

Next, we computed POS sequences from sentences that occur in the training corpus. We trained a multiplicative neural language model with a context size of 5 to predict the next word from its context, given knowledge of the POS tag for the next word. That is, we model $P(w_{n}=i|w_{1:n-1},{\bf x})$ where ${\bf x}$ denotes the POS tag for word $w_{n}$ . After training, we gave the model an initial input and a POS sequence and proceeded to generate samples. Table 3 shows some results for this task. Interestingly, the model can generate rather funny and poetic completions to the initial context.

Our first quantitative experiments are performed on the sentiment treebank of . A common challenge for sentiment classification tasks is that the global sentiment of a sentence need not correspond to local sentiments exhibited in sub-phrases of the sentence. To address this issue, collected annotations from the movie reviews corpus of of all subphrases extracted from a sentence parser. By incorporating local sentiment into their recursive architectures, was able to obtain significant performance gains with recursive networks over bag of words baselines.

We follow the same experimental procedure proposed by for which evaluation is reported on two tasks: fine-grained classification of categories {very negative, negative, neutral, positive, very positive } and binary classification {positive, negative }. We extracted all subphrases of sentences that occur in the training set and used these to train a multiplicative neural language model. Here, each attribute is represented as a sentence vector, as in . In order to compute subphrases for unseen sentences, we apply an inference procedure similar to , where the weights of the network are frozen and gradient descent is used to infer representations for each unseen vector. We trained a logistic regression classifier using all training subphrases in the training set. At test time, we infer a representation for a new sentence which is used for making a review prediction. We used a context size of 8, 100 dimensional word vectors initialized from and 100 dimensional sentence vectors initialized by averaging vectors of words from the corresponding sentence.

Table 4, left panel, illustrates our results on this task in comparison to all other proposed approaches. Our results are on par with the highest performing recursive network on the fine-grained task and outperforms all bag-of-words baselines and recursive networks with the exception of the RTNN on the binary task. Our method is outperformed by the two recently proposed approaches of (a convolutional network trained on sentences) and Paragraph Vector . We suspect that a much more extensive hyperparameter search over context sizes, word and sentence embedding sizes as well as inference initialization schemes would likely close the gap between our approach and .

2 Cross-lingual document classification

denote the sentence representation of $S$ , defined as the sum of language conditioned word representations for each $w\in S$ . Equivalently we define a sentence representation for the translation $S^{\prime}$ of $S$ denoted as $v(S^{\prime})$ . We then optimize the following ranking objective:

subject to the constraints that each sentence vector has unit norm. Each $C_{k}$ is a constrastive (non-translation) sentence of $S$ and $\theta$ denotes all model parameters. This type of cross-language ranking loss was first used by but without the norm constraint which we found significantly improved the stability of training. The Europarl corpus contains roughly 2 million parallel sentence pairs between English and German as well as English and French, for which we induce 40 dimensional word representations. Evaluation is then performed on English and German sections of the Reuters RCV1/RCV2 corpora. Note that these documents are not parallel. The Reuters dataset contains multiple labels for each document. Following , we only consider documents which have been assigned to one of the top 4 categories in the label hierarchy. These are CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social) and MCAT (Markets). There are a total of 34,000 English documents and 42,753 German documents with vocabulary sizes of 43614 English words and 50,110 German words. We consider both training on English and evaluating on German and vice versa. To represent a document, we sum over the word representations of words in that document followed by a unit-ball projection. Following we use an averaged perceptron classifier. Classification accuracy is then evaluated on a held-out test set in the other language. We used a monolingual validation set for tuning the margin $\alpha$ , which was set to $\alpha=1$ . Five contrastive terms were used per example which were randomly assigned per epoch.

Table 4, right panel, shows our results compared to all proposed methods thus far. We are competitive with the current state-of-the-art approaches, being outperformed only by BiCVM+ and BAE-corr on EN $\rightarrow$ DE. The BAE-corr method combines both a reconstruction term and a correlation regularizer to match sentences, while our method does not consider reconstruction. We also performed experimentation on a low resource task, where we assume the same conditions as above with the exception that we only use 10,000 parallel sentence pairs between English and German while still incorporating all English and French parallel sentences. For this task, we compare against a separation baseline, which is the same as our model but with no parameter sharing across languages (and thus resembles ). Here we achieve 74.7% and 69.7% accuracies (EN $\rightarrow$ DE and DE $\rightarrow$ EN) while the separation baseline obtains 63.8% and 67.1%. This indicates that parameter sharing across languages can be useful when only a small amount of parallel data is available. $t$ -SNE embeddings of English-German word pairs are illustrated in figure 2

Another interesting consideration is whether or not the learned language vectors can capture any interesting properties of various languages. To look into this, we trained a multiplicative neural language model simultaneously on 5 languages: English, French, German, Czech and Slovak. To our knowledge, this is the most languages word representations have been jointly learned on. We computed a correlation matrix from the language vectors, shown illustrated in Fig. 3a. Interestingly, we observe high correlation between Czech and Slovak representations, indicating that the model may have learned some notion of lexical similarity. That being said, additional experimentation for future work is necessary to better understand the similarities exhibited through language vectors.

3 Blog authorship attribution

For our final task, we use the Blog corpus of which contains 681,288 blog posts from 19,320 authors. For our experiments, we break the corpus into two separate datasets: one containing the 1000 most prolific authors (most blog posts) and the other containing all the rest. Each author comes with an attribute tag corresponding to a tuple (age, gender, industry) indicating the age range of the author (10s, 20s or 30s), whether the author is male or female and what industry the author works in. Note that industry does not necessary correspond to the topic of blog posts. We use the dataset of non-prolific authors to train a multiplicative language model conditioned on an attribute tuple of which there are 234 unique tuples in total. We used 100 dimensional word vectors initialized from , 100 dimensional attribute vectors with random initialization and a context size of 5. A 1000-way classification task is then performed on the prolific author subset and evaluation is done using 10-fold cross-validation. Our initial experimentation with baselines found that tf-idf performs well on this dataset (45.9% accuracy). Thus, we consider how much we can improve on the tf-idf baseline by augmenting word and attribute features.

For the first experiment, we determine the effect conditional word embeddings have on classification performance, assuming attributes are available at test timeFor blog metadata, this is a reasonable assumption since such side information can be easily accessed.. For this, we compute two embedding matrices, one without and with attribute knowledge:

We represent a blog post as the sum of word vectors projected to unit norm and augment these with tf-idf features. As an additional baseline we include a log-bilinear language model . Figure 3b illustrates the results from which we observe conditioned word embeddings are significantly more discriminative over word embeddings computed without knowledge of attribute vectors.

For the second experiment, we determine the effect of inferring attribute vectors at test time if they are not assumed to be available. To do this, we train a logistic regression classifier within each fold for predicting attributes. We compute an inferred vector by averaging each of the attribute vectors weighted by the log-probabilities of the classifier. In Fig. 3c we plot the difference in performance when an inferred vector is augmented vs. when it is not. These results show consistent, albeit small improvement gains when attribute vectors are inferred at test time.

To get a better sense of the attribute features learned from the model, the supplementary material contains a t-SNE embedding of the learned attribute vectors. Interestingly, the model learns features which largely isolate the vectors of all teenage bloggers independent of gender and topic.

4 Conditional word similarity

One of the key properties of our tensor formulation is the notion of conditional word similarity, namely how neighbours of word representations change depending on the attributes that are conditioned on. In order to explore the effects of this, we performed two qualitative comparisons: one using blog attribute vectors and the other with language vectors. These results are illustrated in Table 5. For the first comparison on the left, we chose two attributes from the blog corpus and a query word. We identify each of these attribute pairs as A and B. Next, we computed a ranked list of the nearest neighbours (by cosine similarity) of words conditioned on each attribute and identified the top 15 words in each. Out of these 15 words, we display the top 3 words which are common to both ranked lists, as well as 3 words that are unique to a specific attribute. Our results illustrate that the model can capture distinctive notions of word similarities depending on which attributes are being conditioned. On the right of Table 5, we chose a query word in English (italicized) and computed the nearest neighbours when conditioned on each language vector. This results in neighbours that are either direct translations of the query word or words that are semantically similar. The supplementary material includes additional examples with nearest neighbours of collocations.

Conclusion

There are several future directions from which this work can be extended. One application area of interest is in learning representations of authors from papers they choose to review as a way of improving automating reviewer-paper matching . Since authors contribute to different research topics, it might be more useful to instead consider a mixture of attribute vectors that can allow for distinctive representations of the same author across research areas. Another interesting application is learning representations of graphs. Recently, proposed an approach for learning embeddings of nodes in social networks. Introducing network indicator vectors could allow us to potentially learn representations of full graphs. Such an approach would allow for a new way of comparing structural similarity of different types of social networks. Finally, it would be interesting to train a multiplicative neural language model simultaneously across dozens of languages to better determine what kinds of properties and similarities language vectors can learn to represent.