Molding CNNs for text: non-linear, non-consecutive convolutions

Tao Lei, Regina Barzilay, Tommi Jaakkola

Introduction

Deep learning methods and convolutional neural networks (CNNs) among them have become de facto top performing techniques across a range of NLP tasks such as sentiment classification, question-answering, and semantic parsing. As methods, they require only limited domain knowledge to reach respectable performance with increasing data and computation, yet permit easy architectural and operational variations so as to fine tune them to specific applications to reach top performance. Indeed, their success is often contingent on specific architectural and operational choices.

CNNs for text applications make use of temporal convolution operators or filters. Similar to image processing, they are applied at multiple resolutions, interspersed with non-linearities and pooling. The convolution operation itself is a linear mapping over “n-gram vectors” obtained by concatenating consecutive word (or character) representations. We argue that this basic building block can be improved in two important respects. First, the power of n-grams derives precisely from multi-way interactions and these are clearly missed (initially) with linear operations on stacked n-gram vectors. Non-linear interactions within a local context have been shown to improve empirical performance in various tasks [Mitchell and Lapata, 2008, Kartsaklis et al., 2012, Socher et al., 2013]. Second, many useful patterns are expressed as non-consecutive phrases, such as semantically close multi-word expressions (e.g.,“not that good”, “not nearly as good”). In typical CNNs, such expressions would have to come together and emerge as useful patterns after several layers of processing.

We propose to use a feature mapping operation based on tensor products instead of linear operations on stacked vectors. This enables us to directly tap into non-linear interactions between adjacent word feature vectors [Socher et al., 2013, Lei et al., 2014]. To offset the accompanying parametric explosion we maintain a low-rank representation of the tensor parameters. Moreover, we show that this feature mapping can be applied to all possible non-consecutive n-grams in the sequence with an exponentially decaying weight depending on the length of the span. Owing to the low rank representation of the tensor, this operation can be performed efficiently in linear time with respect to the sequence length via dynamic programming. Similar to traditional convolution operations, our non-linear feature mapping can be applied successively at multiple levels.

We evaluate the proposed architecture in the context of sentence sentiment classification and news categorization. On the Stanford Sentiment Treebank dataset, our model obtains state-of-the-art performance among a variety of neural networks in terms of both accuracy and training cost. Our model achieves 51.2% accuracy on fine-grained classification and 88.6% on binary classification, outperforming the best published numbers obtained by a deep recursive model [Tai et al., 2015] and a convolutional model [Kim, 2014]. On the Chinese news categorization task, our model achieves 80.0% accuracy, while the closest baseline achieves 79.2%.

Related Work

Deep neural networks have recently brought about significant advancements in various natural language processing tasks, such as language modeling [Bengio et al., 2003, Mikolov et al., 2010], sentiment analysis [Socher et al., 2013, Iyyer et al., 2015, Le and Zuidema, 2015], syntactic parsing [Collobert and Weston, 2008, Socher et al., 2011a, Chen and Manning, 2014] and machine translation [Bahdanau et al., 2014, Devlin et al., 2014, Sutskever et al., 2014]. Models applied in these tasks exhibit significant architectural differences, ranging from recurrent neural networks [Mikolov et al., 2010, Kalchbrenner and Blunsom, 2013] to recursive models [Pollack, 1990, Küchler and Goller, 1996], and including convolutional neural nets [Collobert and Weston, 2008, Collobert et al., 2011, Yih et al., 2014, Shen et al., 2014, Kalchbrenner et al., 2014, Zhang and LeCun, 2015].

Our model most closely relates to the latter. Since these models have originally been developed for computer vision [LeCun et al., 1998], their application to NLP tasks introduced a number of modifications. For instance, ?) use the max-over-time pooling operation to aggregate the features over the input sequence. This variant has been successfully applied to semantic parsing [Yih et al., 2014] and information retrieval [Shen et al., 2014, Gao et al., 2014]. ?) instead propose (dynamic) k-max pooling operation for modeling sentences. In addition, ?) combines CNNs of different filter widths and either static or fine-tuned word vectors. In contrast to the traditional CNN models, our method considers non-consecutive n-grams thereby expanding the representation capacity of the model. Moreover, our model captures non-linear interactions within n-gram snippets through the use of tensors, moving beyond direct linear projection operator used in standard CNNs. As our experiments demonstrate these advancements result in improved performance.

Background

Out-of-index words are simply set to all zeros.

The resulting feature values are often passed through non-linearities such as the hyper-tangent (element-wise) as well as aggregated or reduced by “sum-over” or “max-pooling” operations for later (similar stages) of processing.

The overall architecture can be easily modified by replacing the basic n-gram vectors and the convolution operation with other feature mappings. Indeed, we appeal to tensor algebra to introduce a non-linear feature mapping that operates on non-consecutive n-grams.

Model

Typical nn-gram feature mappings where concatenated word vectors are mapped linearly to feature coordinates may be insufficient to directly capture relevant information in the nn-gram. As a remedy, we replace concatenation with a tensor product. Consider a 3-gram (x1,x2,x3)(\mathbf{x}_{1},\mathbf{x}_{2},\mathbf{x}_{3}) and the corresponding tensor product x1x2x3\mathbf{x}_{1}\otimes\mathbf{x}_{2}\otimes\mathbf{x}_{3}. The tensor product is a 3-way array of coordinate interactions such that each ijkijk entry of the tensor is given by the product of the corresponding coordinates of the word vectors

Tensor-based feature mapping

Since each n-gram in the sequence is now expanded into a high-dimensional tensor using tensor products, the set of filters are analogously maintained as high-order tensors. In other words, our filters are linear mappings over the higher dimensional interaction terms rather than the original word coordinates.

The formula is equivalent to summing over all the third-order polynomial interaction terms where tensor T\mathbf{T} stores the coefficients.

Directly maintaining the filters as full tensors leads to parametric explosion. Indeed, the size of the tensor T\mathbf{T} (i.e. h×dnh\times d^{n}) would be too large even for typical low-dimensional word vectors where, e.g., d=300d=300. To this end, we assume a low-rank factorization of the tensor T\mathbf{T}, represented in the Kruskal form. Specifically, T\mathbf{T} is decomposed into a sum of hh rank-1 tensors

Non-consecutive n-gram features

Traditional convolution uses consecutive n-grams in the feature map. Non-consecutive n-grams may nevertheless be helpful since phrases such as “not good”, “not so good” and “not nearly as good” express similar sentiments but involve variable spacings between the key words. Variable spacings are not effectively captured by fixed n-grams.

We will aggregate these vectors into an hh-dimensional feature representation at each position in the sequence. The idea is similar to neural bag-of-words models where the feature representation for a document or sentence is obtained by averaging (or summing) of all the word vectors. In our case, we define the aggregate representation z3[k]z_{3}[k] in position kk as the weighted sum of all 3-gram feature representations ending at position kk, i.e.,

where λ[0,1)\lambda\in[0,1) is a decay factor that down-weights 3-grams with longer spans (i.e., 3-grams that skip more in-between words). As λ0\lambda\rightarrow 0 all non-consecutive 3-grams are omitted, z3[k]=z[k2,k1,k]z_{3}[k]=z[k-2,k-1,k], and the model acts like a traditional model with only consecutive n-grams. When λ>0\lambda>0, however, z3[k]z_{3}[k] is a weighted average of many 3-grams with variable spans.

Aggregating features via dynamic programming

Directly calculating z3[]z_{3}[\cdot] according to Eq.(3) by enumerating all 3-grams would require O(L3)O(L^{3}) feature-mapping operations. We can, however, evaluate the features more efficiently by relying on the associative and distributive properties of the feature operation in Eq.(2).

Let f3[k]f_{3}[k] be a dynamic programming table representing the sum of 3-gram feature representations before multiplying with matrix O\mathbf{O}. That is, z3[k]=Of3[k]z_{3}[k]=\mathbf{O}^{\top}f_{3}[k] or, equivalently,

We can analogously define f1[i]f_{1}[i] and f2[j]f_{2}[j] for 1-grams and 2-grams,

These dynamic programming tables can be calculated recursively according to the following formulas:

where s1[]s_{1}[\cdot] and s2[]s_{2}[\cdot] are two auxiliary tables. The resulting z[]z[\cdot] is the sum of 1, 2, and 3-gram features. We found that aggregating the 1,2 and 3-gram features in this manner works better than using 3-gram features alone. Overall, the n-gram feature aggregation can be performed in O(Ln)O(Ln) matrix multiplication/addition operations, and remains linear in the sequence length.

The overall architecture

The simplest strategy (adopted in neural bag-of-words models) would be to average the feature representations and pass the resulting averaged vector directly to a softmax output unit

Our architecture, as illustrated in Figure 1, includes two additional refinements. First, we add a non-linear activation function after each feature representation, i.e. z=ReLU(z+b)\mathbf{z^{\prime}}=\text{ReLU}\left(\mathbf{z}+b\right), where bb is a bias vector and ReLU is the rectified linear unit function. Second, we stack multiple tensor-based feature mapping layers. That is, the input sequence x\mathbf{x} is first processed into a feature sequence and passed through the non-linear transformation to obtain z(1)\mathbf{z}^{(1)}. The resulting feature sequence z(1)\mathbf{z}^{(1)} is then analogously processed by another layer, parameterized by a different set of feature-mapping matrices P,,O\mathbf{P},\cdots,\mathbf{O}, to obtain a higher-level feature sequence z(2)\mathbf{z}^{(2)}, and so on. The output feature representations of all these layers are averaged within each layer and concatenated as shown in Figure 1. The final prediction is therefore obtained on the basis of features across the levels.

Learning the Model

Following standard practices, we train our model by minimizing the cross-entropy error on a given training set. For a single training sequence x\mathbf{x} and the corresponding gold label ymy\in^{m}, the error is defined as,

where mm is the number of possible output label.

The set of model parameters (e.g. P,,O\mathbf{P},\cdots,\mathbf{O} in each layer) are updated via stochastic gradient descent using AdaGrad algorithm [Duchi et al., 2011].

We initialize matrices P,Q,R\mathbf{P},\mathbf{Q},\mathbf{R} from uniform distribution [3/d,3/d]\left[-\sqrt{3/d},\sqrt{3/d}\right] and similarly OU[3/h,3/h]\mathbf{O}\sim\text{U}\left[-\sqrt{3/h},\sqrt{3/h}\right]. In this way, each row of the matrices is an unit vector in expectation, and each rank-1 filter slice has unit variance as well,

In addition, the parameter matrix W\mathbf{W} in the softmax output layer is initialized as zeros, and the bias vectors bb for ReLU activation units are initialized to a small positive constant 0.010.01.

Regularization

We apply two common techniques to avoid overfitting during training. First, we add L2 regularization to all parameter values with the same regularization weight. In addition, we randomly dropout [Hinton et al., 2012] units on the output feature representations z(i)\mathbf{z}^{(i)} at each level.

Experimental Setup

We evaluate our model on sentence sentiment classification task and news categorization task. For sentiment classification, we use the Stanford Sentiment Treebank benchmark [Socher et al., 2013]. The dataset consists of 11855 parsed English sentences annotated at both the root (i.e. sentence) level and the phrase level using 5-class fine-grained labels. We use the standard 8544/1101/2210 split for training, development and testing respectively. Following previous work, we also evaluate our model on the binary classification variant of this benchmark, ignoring all neutral sentences. The binary version has 6920/872/1821 sentences for training, development and testing.

For the news categorization task, we evaluate on Sogou Chinese news corpora.http://www.sogou.com/labs/dl/c.html The dataset contains 10 different news categories in total, including Finance, Sports, Technology and Automobile etc. We use 79520 documents for training, 9940 for development and 9940 for testing. To obtain Chinese word boundaries, we use LTP-Cloudhttp://www.ltp-cloud.com/intro/en/ https://github.com/HIT-SCIR/ltp, an open-source Chinese NLP platform.

Baselines

We implement the standard SVM method and the neural bag-of-words model NBoW as baseline methods in both tasks. To assess the proposed tensor-based feature map, we also implement a convolutional neural network model CNN by replacing our filter with traditional linear filter. The rest of the framework (such as feature averaging and concatenation) remains the same.

In addition, we compare our model with a wide range of top-performing models on the sentence sentiment classification task. Most of these models fall into either the category of recursive neural networks (RNNs) or the category of convolutional neural networks (CNNs). The recursive neural network baselines include standard RNN [Socher et al., 2011b], RNTN with a small core tensor in the composition function [Socher et al., 2013], the deep recursive model DRNN [Irsoy and Cardie, 2014] and the most recent recursive model using long-short-term-memory units RLSTM [Tai et al., 2015]. These recursive models assume the input sentences are represented as parse trees. As a benefit, they can readily utilize annotations at the phrase level. In contrast, convolutional neural networks are trained on sequence-level, taking the original sequence and its label as training input. Such convolutional baselines include the dynamic CNN with k-max pooling DCNN [Kalchbrenner et al., 2014] and the convolutional model with multi-channel CNN-MC by ?). To leverage the phrase-level annotations in the Stanford Sentiment Treebank, all phrases and the corresponding labels are added as separate instances when training the sequence models. We follow this strategy and report results with and without phrase annotations.

Word vectors

The word vectors are pre-trained on much larger unannotated corpora to achieve better generalization given limited amount of training data [Turian et al., 2010]. In particular, for the English sentiment classification task, we use the publicly available 300-dimensional GloVe word vectors trained on the Common Crawl with 840B tokens [Pennington et al., 2014]. This choice of word vectors follows most recent work, such as DAN [Iyyer et al., 2015] and RLSTM [Tai et al., 2015]. For Chinese news categorization, there is no widely-used publicly available word vectors. Therefore, we run word2vec [Mikolov et al., 2013] to train 200-dimensional word vectors on the 1.6 million Chinese news articles. Both word vectors are normalized to unit norm (i.e. w22=1\|w\|_{2}^{2}=1) and are fixed in the experiments without fine-tuning.

Hyperparameter setting

We perform an extensive search on the hyperparameters of our full model, our implementation of the CNN model (with linear filters), and the SVM baseline. For our model and the CNN model, the initial learning rate of AdaGrad is fixed to 0.01 for sentiment classification and 0.1 for news categorization, and the L2 regularization weight is fixed to 1e51e-5 and 1e61e-6 respectively based on preliminary runs. The rest of the hyperparameters are randomly chosen as follows: number of feature-mapping layers {1,2,3}\in\{1,2,3\}, n-gram order n{2,3}n\in\{2,3\}, hidden feature dimension h{50,100,200}h\in\{50,100,200\}, dropout probability {0.0,0.1,0.3,0.5}\in\{0.0,0.1,0.3,0.5\}, and length decay λ{0.0,0.3,0.5}\lambda\in\{0.0,0.3,0.5\}. We run each configuration 3 times to explore different random initializations. For the SVM baseline, we tune L2 regularization weight C{0.01,0.1,1.0,10.0}C\in\{0.01,0.1,1.0,10.0\}, word cut-off frequency {1,2,3,5}\in\{1,2,3,5\} (i.e. pruning words appearing less than this times) and n-gram feature order n{1,2,3}n\in\{1,2,3\}.

Implementation details

The source code is implemented in Python using the Theano library [Bergstra et al., 2010], a flexible linear algebra compiler that can optimize user-specified computations (models) with efficient automatic low-level implementations, including (back-propagated) gradient calculation.

Results

Table 1 presents the performance of our model and other baseline methods on Stanford Sentiment Treebank benchmark. Our full model obtains the highest accuracy on both the development and test sets. Specifically, it achieves 51.2% and 88.6% test accuracies on fine-grained and binary tasks respectivelyBest hyperparameter configuration based on dev accuracy: 3 layers, 3-gram tensors (n=3), feature dimension d=200d=200 and length decay λ=0.5\lambda=0.5. As shown in Table 2, our model performance is relatively stable – it remains high accuracies with around 0.5% standard deviation under different initializations and dropout rates.

Our full model is also several times faster than other top-performing models. For example, the convolutional model with multi-channel (CNN-MC) runs over 2400 seconds per training epoch. In contrast, our full model (with 3 feature layers) runs on average 28 seconds with only root labels and on average 445 seconds with all labels.

Our results also show that the CNN model, where our feature map is replaced with traditional linear map, performs worse than our full model. This observation confirms the importance of the proposed non-linear, tensor-based feature mapping. The CNN model also lags behind the DCNN and CNN-MC baselines, since the latter two propose several advancements over standard CNN.

Table 3 reports the results of SVM, NBoW and our model on the news categorization task. Since the dataset is much larger compared to the sentiment dataset (80K documents vs. 8.5K sentences), the SVM method is a competitive baseline. It achieves 78.5% accuracy compared to 74.4% and 79.2% obtained by the neural bag-of-words model and CNN model. In contrast, our model obtains 80.0% accuracy on both the development and test sets, outperforming the three baselines by a 0.8% absolute margin. The best hyperparameter configuration in this task uses less feature layers and lower n-gram order (specifically, 2 layers and n=2n=2) compared to the sentiment classification task. We hypothesize that the difference is due to the nature of the two tasks: the document classification task requires to handle less compositions or context interactions than sentiment analysis.

2 Hyperparameter Analysis

We next investigate the impact of hyperparameters in our model performance. We use the models trained on fine-grained sentiment classification task with only root labels.

We plot the fine-grained sentiment classification accuracies obtained during hyperparameter grid search. Figure 2 illustrates how the number of feature layers impacts the model performance. As shown in the figure, adding higher-level features clearly improves the classification accuracy across various hyperparameter settings and initializations.

Non-consecutive n-gram features

We also analyze the effect of modeling non-consecutive n-grams. Figure 3 splits the model accuracies according to the choice of span decaying factor λ\lambda. Note when λ=0\lambda=0, the model applies feature extractions to consecutive n-grams only. As shown in Figure 3, this setting leads to consistent performance drop. This result confirms the importance of handling non-consecutive n-gram patterns.

Non-linear activation

Finally, we verify the effectiveness of rectified linear unit activation function (ReLU) by comparing it with no activation (or identity activation f(x)=xf(x)=x). As shown in Figure 4, our model with ReLU activation generally outperforms its variant without ReLU. The observation is consistent with previous work on convolutional neural networks and other neural network models.

3 Example Predictions

Figure 5 gives examples of input sentences and the corresponding predictions of our model in fine-grained sentiment classification. To see how our model captures the sentiment at different local context, we apply the learned softmax activation to the extracted features at each position without taking the average. That is, for each index ii, we obtain the local sentiment p=softmax(W(z(1)[i]z(2)[i]z(3)[i]))p=\text{softmax}\left(\mathbf{W}^{\top}\left(z^{(1)}[i]\oplus z^{(2)}[i]\oplus z^{(3)}[i]\right)\right). We plot the expected sentiment scores s=22sp(s)\sum_{s=-2}^{2}s\cdot p(s), where a score of 2 means “very positive”, 0 means “neutral” and -2 means “very negative”. As shown in the figure, our model successfully learns negation and double negation. The model also identifies positive and negative segments appearing in the sentence.

Conclusion

We proposed a feature mapping operator for convolutional neural networks by modeling n-gram interactions based on tensor product and evaluating all non-consecutive n-gram vectors. The associated parameters are maintained as a low-rank tensor, which leads to efficient feature extraction via dynamic programming. The model achieves top performance on standard sentiment classification and document categorization tasks.

Acknowledgments

We thank Kai Sheng Tai, Mohit Iyyer and Jordan Boyd-Graber for answering questions about their paper. We also thank Yoon Kim, the MIT NLP group and the reviewers for their comments. We acknowledge the support of the U.S. Army Research Office under grant number W911NF-10-1-0533. The work is developed in collaboration with the Arabic Language Technologies (ALT) group at Qatar Computing Research Institute (QCRI). Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors, and do not necessarily reflect the views of the funding organizations.

References