Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks

Ji Young Lee, Franck Dernoncourt

Introduction

Short-text classification is an important task in many areas of natural language processing, including sentiment analysis, question answering, or dialog management. Many different approaches have been developed for short-text classification, such as using Support Vector Machines (SVMs) with rule-based features [Silva et al., 2011], combining SVMs with naive Bayes [Wang and Manning, 2012], and building dependency trees with Conditional Random Fields [Nakagawa et al., 2010]. Several recent studies using ANNs have shown promising results, including convolutional neural networks (CNNs) [Kim, 2014, Blunsom et al., 2014, Kalchbrenner et al., 2014] and recursive neural networks [Socher et al., 2012].

Most ANN systems classify short texts in isolation, i.e., without considering preceding short texts. However, short texts usually appear in sequence (e.g., sentences in a document or utterances in a dialog), therefore using information from preceding short texts may improve the classification accuracy. Previous works on sequential short-text classification are mostly based on non-ANN approaches, such as Hidden Markov Models (HMMs) [Reithinger and Klesen, 1997], [Stolcke et al., 2000], maximum entropy [Ang et al., 2005], and naive Bayes [Lendvai and Geertzen, 2007].

Inspired by the performance of ANN-based systems for non-sequential short-text classification, we introduce a model based on recurrent neural networks (RNNs) and CNNs for sequential short-text classification, and evaluate it on the dialog act classification task. A dialog act characterizes an utterance in a dialog based on a combination of pragmatic, semantic, and syntactic criteria. Its accurate detection is useful for a range of applications, from speech recognition to automatic summarization [Stolcke et al., 2000]. Our model achieves state-of-the-art results on three different datasets.

Model

Our model comprises two parts. The first part generates a vector representation for each short text using either the RNN or CNN architecture, as discussed in Section 2.1 and Figure 1. The second part classifies the current short text based on the vector representations of the current as well as a few preceding short texts, as presented in Section 2.2 and Figure 2.

We denote scalars with italic lowercases (e.g., k,bfk,\,b_{f}), vectors with bold lowercases (e.g., s,xi\mathbf{s},\,\mathbf{x}_{i}), and matrices with italic uppercases (e.g., WfW_{f}). We use the colon notation vi:j\mathbf{v}_{i:j} to denote the sequence of vectors (vi,vi+1,,vj)(\mathbf{v}_{i},\mathbf{v}_{i+1},\dotsc,\mathbf{v}_{j}).

A given short text of length \ell is represented as the sequence of mm-dimensional word vectors x1:,\mathbf{x}_{1:\ell}, which is used by the RNN or CNN model to produce the nn-dimensional short-text representation s.\mathbf{s}.

We use a variant of RNN called Long Short Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997]. For the ttht^{th} word in the short-text, an LSTM takes as input xt,ht1,ct1\mathbf{x}_{t},\mathbf{h}_{t-1},\mathbf{c}_{t-1} and produces ht,ct\mathbf{h}_{t},\mathbf{c}_{t} based on the following formulas:

subscript𝑊𝑖subscript𝐱𝑡subscript𝑈𝑖subscript𝐡𝑡1subscript𝐛𝑖\displaystyle=\sigma(W_{i}\mathbf{x}_{t}+U_{i}\mathbf{h}_{t-1}+\mathbf{b}_{i}) ft\displaystyle\mathbf{f}_{t} =σ(Wfxt+Ufht1+bf)\displaystyle=\sigma(W_{f}\mathbf{x}_{t}+U_{f}\mathbf{h}_{t-1}+\mathbf{b}_{f}) c~t\displaystyle\tilde{\mathbf{c}}_{t} =tanh(Wcxt+Ucht1+bc)\displaystyle=\text{tanh}(W_{c}\mathbf{x}_{t}+U_{c}\mathbf{h}_{t-1}+\mathbf{b}_{c}) ct\displaystyle\mathbf{c}_{t} =ftct1+itc~t\displaystyle=\mathbf{f}_{t}\odot\mathbf{c}_{t-1}+\mathbf{i}_{t}\odot\tilde{\mathbf{c}}_{t} ot\displaystyle\mathbf{o}_{t} =σ(Woxt+Uoht1+bo)\displaystyle=\sigma(W_{o}\mathbf{x}_{t}+U_{o}\mathbf{h}_{t-1}+\mathbf{b}_{o}) ht\displaystyle\mathbf{h}_{t} =ottanh(ct)\displaystyle=\mathbf{o}_{t}\odot\text{tanh}(\mathbf{c}_{t}) where WjRn×m,  UjRn×nW_{j}\in\mathbb{R}^{n\times m},\;U_{j}\in\mathbb{R}^{n\times n} are weight matrices and bjRn\mathbf{b}_{j}\in\mathbb{R}^{n} are bias vectors, for j{i,f,c,o}j\in\{i,f,c,o\}. The symbols σ()\sigma(\cdot) and tanh()(\cdot) refer to the element-wise sigmoid and hyperbolic tangent functions, and \odot is the element-wise multiplication. h0=c0=0\mathbf{h}_{0}=\mathbf{c}_{0}=\mathbf{0}.

In the pooling layer, the sequence of vectors h1:\mathbf{h}_{1:\ell} output from the RNN layer are combined into a single vector sRn\mathbf{s}\in\mathbb{R}^{n} that represents the short-text, using one of the following mechanisms: last, mean, and max pooling. Last pooling takes the last vector, i.e., s=h,\mathbf{s}=\mathbf{h}_{\ell}, mean pooling averages all vectors, i.e., s=1t=1ht,\mathbf{s}=\tfrac{1}{\ell}\sum_{t=1}^{\ell}\mathbf{h}_{t}, and max pooling takes the element-wise maximum of h1:.\mathbf{h}_{1:\ell}.

1.2 CNN-based short-text representation

Using a filter WfRh×mW_{f}\in\mathbb{R}^{h\times m} of height h,h, a convolution operation on hh consecutive word vectors starting from ttht^{th} word outputs the scalar feature

∙subscript𝑊𝑓subscript𝑋:𝑡𝑡ℎ1subscript𝑏𝑓\displaystyle c_{t}=\text{ReLU}(W_{f}\tiny{\,\bullet\,}X_{t:t+h-1}+b_{f}) where Xt:t+h1Rh×mX_{t:t+h-1}\in\mathbb{R}^{h\times m} is the matrix whose ithi^{th} row is xiRm,\mathbf{x}_{i}\in\mathbb{R}^{m}, and bfRb_{f}\in\mathbb{R} is a bias. The symbol \bullet refers to the dot product and ReLU()(\cdot) is the element-wise rectified linear unit function.

We perform convolution operations with nn different filters, and denote the resulting features as ctRn,\mathbf{c}_{t}\in\mathbb{R}^{n}, each of whose dimensions comes from a distinct filter. Repeating the convolution operations for each window of hh consecutive words in the short-text, we obtain c1:h+1.\mathbf{c}_{1:\ell-h+1}. The short-text representation sRn\mathbf{s}\in\mathbb{R}^{n} is computed in the max pooling layer, as the element-wise maximum of c1:h+1.\mathbf{c}_{1:\ell-h+1}.

2 Sequential short-text classification

Let si\mathbf{s}_{i} be the nn-dimensional short-text representation given by the RNN or CNN architecture for the ithi^{th} short text in the sequence. The sequence sid1d2:i\mathbf{s}_{i-d_{1}-d_{2}\,:\,i} is fed into a two-layer feedforward ANN that predicts the class for the ithi^{th} short text. The hyperparameters d1,d2d_{1},d_{2} are the history sizes used in the first and second layers, respectively.

The first layer takes as input sid1d2:i\mathbf{s}_{i-d_{1}-d_{2}\,:\,i} and outputs the sequence yid2:i\mathbf{y}_{i-d_{2}\,:\,i} defined as

superscriptsubscript𝑑0subscript𝑑1subscript𝑊𝑑subscript𝐬𝑗𝑑subscript𝐛1for-all𝑗𝑖subscript𝑑2𝑖\displaystyle\mathbf{y}_{j}=\text{tanh}\left(\sum_{d=0}^{d_{1}}W_{-d}\,\mathbf{s}_{j-d}+\mathbf{b}_{1}\right),\;\forall j\in[i-d_{2},i] where W0,W1,W2Rk×nW_{0},W_{-1},W_{-2}\in\mathbb{R}^{k\times n} are the weight matrices, b1Rk\mathbf{b}_{1}\in\mathbb{R}^{k} is the bias vector, yjRk\mathbf{y}_{j}\in\mathbb{R}^{k} is the class representation, and kk is the number of classes for the classification task.

Similarly, the second layer takes as input the sequence of class representations yid2:i\mathbf{y}_{i-d_{2}:i} and outputs ziRk\mathbf{z}_{i}\in\mathbb{R}^{k}:

superscriptsubscript𝑗0subscript𝑑2subscript𝑊𝑗subscript𝐲𝑖𝑗subscript𝐛2\displaystyle\mathbf{z}_{i}=\text{softmax}\left(\sum_{j=0}^{d_{2}}W_{-j}\,\mathbf{y}_{i-j}+\mathbf{b}_{2}\right) where U0,U1,U2Rk×kU_{0},U_{-1},U_{-2}\in\mathbb{R}^{k\times k} and b2Rk\mathbf{b}_{2}\in\mathbb{R}^{k} are the weight matrices and bias vector.

The final output zi\mathbf{z}_{i} represents the probability distribution over the set of kk classes for the ithi^{th} short-text: the jthj^{th} element of zi\mathbf{z}_{i} corresponds to the probability that the ithi^{th} short-text belongs to the jthj^{th} class.

Datasets and Experimental Setup

We evaluate our model on the dialog act classification task using the following datasets:

DSTC 4: Dialog State Tracking Challenge 4 [Kim et al., 2015, Kim et al., 2016].

MRDA: ICSI Meeting Recorder Dialog Act Corpus [Janin et al., 2003, Shriberg et al., 2004]. The 5 classes are introduced in [Ang et al., 2005].

SwDA: Switchboard Dialog Act Corpus [Jurafsky et al., 1997].

For MRDA, we use the train/validation/test splits provided with the datasets. For DSTC 4 and SwDA, only the train/test splits are provided.111All train/validation/test splits can be found at https://github.com/Franck-Dernoncourt/naacl2016 Table 1 presents statistics on the datasets.

2 Training

The model is trained to minimize the negative log-likelihood of predicting the correct dialog acts of the utterances in the train set, using stochastic gradient descent with the Adadelta update rule [Zeiler, 2012]. At each gradient descent step, weight matrices, bias vectors, and word vectors are updated. For regularization, dropout is applied after the pooling layer, and early stopping is used on the validation set with a patience of 10 epochs.

Results and Discussion

To find effective hyperparameters, we varied one hyperparameter at a time while keeping the other ones fixed. Table 2 presents our hyperparameter choices.

d1d_{1} d2d_{2} LSTM CNN 0 1 2 0 1 2 DSTC4 0 63.1 (62.4, 63.6) 65.7 (65.6, 65.7) 64.7 (63.9, 65.3) 64.1 (63.5, 65.2) 65.4 (64.7, 66.6) 65.1 (63.2, 65.9) 1 65.8 (65.5, 66.1) 65.7 (65.3, 66.1) 64.8 (64.6, 65.1) 65.3 (64.1, 65.9) 65.1 (62.1, 66.2) 64.9 (64.4, 65.6) 2 65.7 (65.0, 66.2) 65.5 (64.4, 66.1) 64.9 (64.6, 65.2) 65.7 (64.9, 66.3) 65.8 (65.2, 66.1) 65.4 (64.5, 66.0) MRDA 0 82.8 (82.4, 83.1) 83.2 (82.9, 83.4) 82.9 (82.4, 83.4) 83.2 (83.0, 83.4) 83.5 (82.9, 84.0) 83.8 (83.4, 84.2) 1 83.2 (82.6, 83.7) 83.8 (83.5, 84.4) 83.6 (83.2, 83.8) 84.6 (84.5, 84.9) 84.6 (84.4, 84.8) 84.1 (83.8, 84.4) 2 84.1 (83.5, 84.4) 83.9 (83.4, 84.7) 83.3 (82.6, 84.2) 84.4 (84.1, 84.8) 84.6 (84.5, 84.7) 84.4 (84.2, 84.7) SwDA 0 66.3 (65.1, 68.0) 67.9 (66.3, 68.6) 67.8 (66.7, 69.0) 67.0 (65.3, 68.7) 69.1 (68.5, 70.0) 69.7 (69.2, 70.9) 1 68.4 (67.8, 68.8) 67.8 (65.5, 68.9) 67.3 (65.5, 69.5) 69.9 (69.1, 70.9) 69.8 (69.3, 70.6) 69.9 (68.8, 70.6) 2 69.5 (68.9, 70.2) 67.9 (66.5, 69.4) 67.7 (66.9, 68.9) 71.4 (70.4, 73.1) 71.1 (70.2, 72.1) 70.9 (69.7, 71.7) Table 3: Accuracy (%) on different architectures and history sizes d1,d2d_{1},d_{2}. For each setting, we report average (minimum, maximum) computed on 5 runs. Sequential classification (d1+d2>0d_{1}+d_{2}>0) outperforms non-sequential classification (d1=d2=0d_{1}=d_{2}=0). Overall, the CNN model outperformed the LSTM model for all datasets, albeit by a small margin except for SwDA. We also tried a variant of the LSTM model, gated recurrent units [Cho et al., 2014], but the results were generally lower than LSTM. We initialized the word vectors with the 300-dimensional word vectors pretrained with word2vec on Google News [Mikolov et al., 2013a, Mikolov et al., 2013b] for DSTC 4, and the 200-dimensional word vectors pretrained with GloVe on Twitter [Pennington et al., 2014] for MRDA and SwDA, as these choices yielded the best results among all publicly available word2vec, GloVe, SENNA [Collobert, 2011, Collobert et al., 2011] and RNNLM [Mikolov et al., 2011] word vectors.

The effects of the history sizes d1d_{1} and d2d_{2} for the short-text and the class representations, respectively, are presented in Table 3 for both the LSTM and CNN models. In both models, increasing d1d_{1} while keeping d2=0d_{2}=0 improved the performances by 1.3-4.2 percentage points. Conversely, increasing d2d_{2} while keeping d1=0d_{1}=0 yielded better results, but the performance increase was less pronounced: incorporating sequential information at the short-text representation level was more effective than at the class representation level.

Using sequential information at both the short-text representation level and the class representation level does not help in most cases and may even lower the performances. We hypothesize that short-text representations contain richer and more general information than class representations due to their larger dimension. Class representations may not convey any additional information over short-text representations, and are more likely to propagate errors from previous misclassifications.

Table 4 compares our results with the state-of-the-art. Overall, our model shows competitive results, while requiring no human-engineered features. Rigorous comparisons are challenging to draw, as many important details such as text preprocessing and train/valid/test split may vary, and many studies fail to perform several runs despite the randomness in some parts of the training process, such as weight initialization.

Conclusion

In this article we have presented an ANN-based approach to sequential short-text classification. We demonstrate that adding sequential information improves the quality of the predictions, and the performance depends on what sequential information is used in the model. Our model achieves state-of-the-art results on three different datasets for dialog act prediction.

References