Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function

Devendra Singh Sachan, Manzil Zaheer, Ruslan Salakhutdinov

Introduction

Text classification is an important problem in natural language processing (NLP). The task is to assign a document to one or more predefined categories. It has a wide range of applications such as sentiment analysis (?), topic categorization (?), and email filtering (?). Early machine learning approaches for text classification were based on the extraction of bag-of-words features followed by a supervised classifier such as naïve Bayes (?) or a linear SVM (?). Later, better word representations were introduced, such as latent semantic analysis (?), skipgram (?), and fastText (?), which improved classification accuracy. Recently, recurrent and convolutional neural network (?) models were introduced to utilize the word order and grammatical structure. Many complex variations of these models have been proposed to improve the text classification accuracy, e.g. training one-hot CNN (JZ15a; ? ?) or one-hot bidirectional LSTM (BiLSTM) network with dynamic max-pooling (JZ16; ? ?).

Current state-of-the-art approaches for text classification involve using pretrained LSTMs (DL15; ? ?) or complex computationally intensive models (JZ17; ? ?). DL15 argued that randomly initialized LSTMs are difficult to optimize and can lead to worse performance than linear models. Therefore, to improve the performance, they proposed pretraining the LSTM with either a language model or a sequence auto-encoder. However, pretraining or using complicated models can be very time consuming, which is a major disadvantage and may not be always feasible. In this paper, we consider a BiLSTM classifier model similar to the one proposed by DL15 for text classification. For this simple BiLSTM model with pretrained embeddings, we propose a training strategy that can achieve accuracy competitive with the previous purely supervised models, but without the extra pretraining step. We also perform ablation studies to understand aspects of the proposed training strategy that result in an improvement.

Pretraining approaches often use extra unlabeled data in addition to the labeled data. We explore the applicability of such semi-supervised learning (SSL) in our training framework, where there is no prior pretraining step. In this regard, we propose a mixed objective function for SSL that can utilize both labeled and unlabeled data to obtain further improvement in classification. To summarize, our contributions are as follows:

We show that with proper model training, using a maximum likelihood objective with a simple one-layer BiLSTM model (§2) can produce competitive accuracies,

We propose a mixed objective function that can be applied to text classification tasks (§2),

On seven benchmark text classification tasks, we achieve new state-of-the-art results despite having a much simpler model, minimal model tuning, and fewer parameters (§4),

We extend our proposed mixed objective function to relation extraction task, where we achieve better F1 score on SemEval-2010 and TACRED datasets, again with a simple model and minimal model tuning (§6).

Methods

In this section, we will describe the model architecture, our training strategy, and our proposed mixed objective function. For mathematical notation, we will use bold lowercase to denote vectors, bold uppercase to denote matrices, and lowercase to denote scalars and individual words in a document.

Next, these hidden state outputs from the forward LSTM ( $\overrightarrow{\mathbf{h}_{t}}$ ) and backward LSTM ( $\overleftarrow{\mathbf{h}_{t}}$ ) are concatenated at every time-step to enable encoding of information from past and future contexts respectively

This max-pooling mechanism constrains the model to capture the most useful features produced by the BiLSTM encoder. Next, the linear layer applies an affine transformation to the feature vector to produce logits ( $\mathbf{d}$ )

where $(\mathbf{x},y)$ is a training example and $\boldsymbol{\theta}$ denotes the model parameters. For model training, we use supervised and unsupervised loss functions, which are discussed next.

Supervised Training

This is the most widely used method to learn the parameters of neural network models from observed data for text classification task. Here, we minimize the average cross-entropy loss between the estimated class probability and the ground truth class label for all training examples

Adversarial Training (AT)

Adversarial examples are created from inputs by small perturbations to mislead the machine learning algorithm. The objective of adversarial training is to construct and give as input adversarial examples during model training procedure to make the model more robust to adversarial noise and thereby improving its generalization ability (?).

In this work, we make adversarial perturbations to the input word embeddings ( $\mathbf{v}=[\mathbf{v}_{1},\ldots,\mathbf{v}_{T}]$ ) (MDG16, ? ?). These perturbations ( $\mathbf{r}_{\textit{at}}$ ) are estimated by linearizing the supervised cross-entropy loss around the input word embeddings. Specifically, to get the adversarial embedding $(\mathbf{v}^{*})$ corresponding to $\mathbf{v}$ , we use the $L_{2}$ norm of the training loss gradient ( $\mathbf{g}$ ) that is computed by backpropagation using the current model parameters ( $\boldsymbol{\hat{\theta}}$ )

where, $k$ is the correct class label, $\epsilon$ is a hyperparameter that controls the magnitude of the perturbation. We apply adversarial loss to only the labeled data. It is defined as

Unsupervised Training

In this paper, in addition to supervised training, we also experiment with two unsupervised methodologies: entropy minimization and virtual adversarial training. These loss functions when incorporated into the objective function act as effective regularizers during model training. To describe them, we assume that there exists an additional $\textit{m}_{\textit{u}}$ unlabeled examples in the dataset $\{\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(\textit{m}_{\textit{u}})}\}$ .

In addition to supervised cross-entropy loss, we also minimize the conditional entropy of the estimated class probabilities (?; ?). This can also be interpreted as a special case of the missing label problem where the probability $p(y^{(i)}=k|\mathbf{x}^{(i)};\boldsymbol{\theta})$ signifies a soft assignment of the $i^{th}$ example to label $k$ (i.e. soft clustering). Entropy minimization loss is applied in an unsupervised manner to both the labeled and unlabeled data.

Virtual Adversarial Training (VAT)

Next, the gradient is estimated from the KL divergence as:

Virtual adversarial perturbation ( $\mathbf{r}_{\textit{vadv}}$ ) is generated using the $L_{2}$ norm of the gradient and added to the word embeddings

Lastly, virtual adversarial loss can be computed from both the labeled and unlabeled data as:

Mixed Objective Function

Our proposed mixed objective function combines the above described supervised and unsupervised loss functions using $\lambda_{\textit{ML}}$ , $\lambda_{\textit{AT}}$ , $\lambda_{\textit{EM}}$ , and $\lambda_{\textit{VAT}}$ as hyperparameters

Training Strategy

We note that there are three main considerations to be taken into account when training the text classifier. First, the knowledge of the entire sequence is essential for good classification performance. Thus, commonly used practice of truncated backpropagation through time (?) is a key limiting factor. One should perform gradient update for the entire text sequence. To prevent out of memory issues that can result from longer sequences, we propose to use dynamic batch size that consists of fixed number of total words per mini-batch. Second, not only we need to use pretrained word embeddings, but we need to finetune them for the specific task. Lastly, we should use a larger vocabulary size and not limit to only high-frequency words. This is because rare or tail words are often strong indicators of the class.

Experiments

In this work, we experiment with seven datasets that are summarized in Table 1. ACL-IMDB (?) and Elec (JZ15a) datasets are widely used for binary sentiment classification of movie reviews and Amazon product reviews respectively while AG-News, DBpedia (?), and RCV1 (?) are for topic classification of news articles, Wikipedia, and Reuters corpus respectively. For the RCV1 dataset, we perform multiclass topic classification based on the second-level topics and construct its training, dev, and test splits in accordance with JZ15a. To show that our proposed method also scales to larger datasets and categories, we also experiment with large-scale datasets of IMDB reviews and Arxiv abstracts that are used for fine-grained sentiment- and topic classification respectively (?). We preprocess all the datasets by converting the text to lowercase and treat all punctuations as separate tokens.

Implementation Details

All of our models were implemented in PyTorch framework (?) and were trained on a single GPU. Our experimental setup is common for all the datasets unless specified otherwise. We use 300D pretrained vectors to initialize the embedding layer. We learn embeddings for the ACL-IMDB, RCV1, IMDB, and Arxiv datasets using word2vec (?) and use fastText pretrained embeddingsThese embeddings were trained on data containing 600B tokens from Common Crawl (crawl-300d-2M.vec). (?) for all the other datasets. We use one-layer BiLSTM of size 512D. For regularization, we apply dropout ( $p_{drop}=0.5$ ) to word embeddings and to LSTM’s hidden states. We also use the word dropout strategy in which we randomly set a word to be “UNK” with a probability $p_{w}=0.1$ . For training, we employ SGD using Adam optimizer (?) ( $\textit{learning rate}=10^{-3}$ , $\beta_{1}=0$ , $\beta_{2}=0.98$ , $\epsilon_{\textit{adam}}=10^{-8}$ ) with an exponential learning rate decay scheme. We perform gradient clipping by having a maximum $L_{2}$ norm of 1. For training, we backpropagate through time over entire sequence, i.e. we did not truncate sequence. This differs from DL15 where they perform truncated backpropagation through time for 400 time-steps from the end of a sequence.

For semi-supervised training, we experiment with all the objective functions described in §2. For $L_{\textit{MIXED}}$ , we include all the constituent terms with $\lambda_{\textit{ML}}$ , $\lambda_{\textit{AT}}$ , $\lambda_{\textit{EM}}$ , $\lambda_{\textit{VAT}}$ set to 1 and $\xi=0.1$ . We want to emphasize that in contrast with MDG16, we do not perform embedding layer normalization during AT or VAT objectives, as by including it, we noticed a drop in accuracy during our initial experiments. We select the hyperparameters such as dynamic batch size, vocabulary size, and adversarial perturbation ( $\epsilon$ ) by cross-validation on the development set. We mention these dataset-specific hyperparameters in Table 2. For supervised experiments ( $L_{\textit{ML}}$ ), we perform training till 20 epochs and for semi-supervised experiments ( $L_{\textit{MIXED}}$ ), training is done till 50 epochs. For ACL-IMDB, Elec, RCV1, IMDB, and Arxiv datasets, we use training and test set as unlabeled data, while for AG-News and DBpedia datasets as their test sets are small, we use only the training set as unlabeled data. During training, we keep the batch size of the unlabeled data the same as that of the labeled data.

Results

In this section, we report the classification accuracy on the test set and perform ablation studies for both supervised and semi-supervised training.

In Table 3, we present the error rates of our method and the previous best-published models when training is done using only the maximum likelihood objective ( $L_{\textit{ML}}$ ). We observe that our model that consists of one-layer BiLSTM and pretrained embedding weights achieves a very competitive performance on all the datasets compared with the more complex approaches such as one-hot LSTM (JZ16) or pyramidal CNN (JZ17). Specifically, for ACL-IMDB, AG-News, IMDB, and Arxiv datasets, we report much better results than earlier methods. Thus, our proposed model and training strategy enjoy the following advantages: (a) it is very easy to implement using current deep learning frameworks; (b) it requires much less training time and GPU memory compared with other complicated models; (c) it entirely avoids complex initialization strategies such as pretraining the LSTM weights using a language model; (d) Our results can serve as strong baselines when developing more advanced task-specific models.

To know the importance of various components in the model and training regimen, we perform ablation studies using the ACL-IMDB dataset (see Table 5). We verify that good performance of our model mostly results from finetuning the pretrained embeddings, using a larger vocabulary size, and using a carefully preprocessed dataset. We also see that excluding word dropout, smaller-sized LSTM, and lowering the batch size causes a slight drop in performance while using static pretrained or randomly initialized embeddings or smaller vocabulary size can cause a large drop.

Semi-Supervised Training

For our next set of experiments, we perform training using $L_{\textit{MIXED}}$ objective whose results are shown in Table 4. We observe that the mixed objective improves over maximum likelihood objective and achieves state-of-the-art results on all the seven datasets. Specifically, on the widely used ACL-IMDB dataset, there is a substantial reduction of 26.9% in relative error compared with the previous best-published model of JZ16, which was substantially more complex as they use one-hot encodings of words along with a lot of additional features such as multi-view region embeddings from CNNs and LSTMs. We also want to highlight that, although the model of MDG16 also experiments with adversarial and virtual adversarial training, our approach performs much better compared with them due to our improved training strategy and the use of $L_{\textit{EM}}$ objective. Similarly, for the benchmark AG-News dataset, we observe relative error reduction of 26.6% compared with previous state-of-the-art model of JZ17 who use a very deep pyramidal-CNN along with region embeddings. Even on the Elec, DBPedia, and RCV1 datasets, our results present significant improvements over the previous best semi-supervised results. $L_{\textit{MIXED}}$ objective also scales well to the dataset sizes, as on the large datasets of IMDB and Arxiv, it outperforms the above mentioned previous approaches by a substantial margin. We note here that the approach of ? (?) is not directly comparable with our results as they use a three-layer LSTM model. We discuss the effect of model size in §5.

Next, we perform ablation studies when the model is trained using $L_{\textit{MIXED}}$ on the ACL-IMDB dataset and analyze the contributions of the different component terms present in the objective (see Table 6). First, we observe that when the model is trained using $L_{\textit{MIXED}}$ objective on both the labeled and unlabeled data, the accuracy on ACL-IMDB drastically improves by 33% compared with using only the $L_{\textit{ML}}$ objective. Second, we also observe that when trained only on labeled data the inclusion of $L_{\textit{AT}}$ and $L_{\textit{VAT}}$ can also significantly improve the performance. However, $L_{\textit{EM}}$ alone doesn’t lead to any significant gains. Furthermore, when $L_{\textit{MIXED}}$ is trained with only labeled data, we see 12% relative increase in accuracy. Finally, when we add unlabeled data to both $L_{\textit{VAT}}$ and $L_{\textit{EM}}$ , we see consistent improvements, thus suggesting that these objective functions complement each other and together improve the overall performance.

Analysis

To understand the effect on word embeddings due to training using $L_{\textit{ML}}$ and $L_{\textit{MIXED}}$ objectives, we show the top-10 closest words for the query word “good” based on their cosine similarity in Table 7, where the word embeddings were extracted from the models trained on ACL-IMDB dataset. We see that for static embeddings, the closest words have a mix of both positive (‘great’, ‘decent’, ‘nice’) and negative sentiments (‘bad’, ‘but’). This can be understood as they are syntactically similar adjectives. When these embeddings are finetuned using the $L_{\textit{ML}}$ objective, the network learns more meaningful representations and accommodates more positive sentiment words close to the query word ‘good’. Moreover, when trained using the $L_{\textit{MIXED}}$ objective, we see that those words that have a very high correlation with the class label (positive sentiment class in this case) are clustered together in the embedding space. Our hypothesis is that this factor also contributes to an increase in the overall classification accuracy.

Model Regularization Effect

Figure 2a and Figure 2b show the moving average training loss and test error respectively versus the number of epochs on the ACL-IMDB dataset with the $L_{\textit{ML}}$ , $L_{\textit{AT}}$ , $L_{\textit{VAT}}$ , $L_{\textit{EM}}$ , and $L_{\textit{MIXED}}$ objectives. We can see that $L_{\textit{ML}}$ begins to overfit after 5 epochs, $L_{\textit{AT}}$ overfits after 10 epochs while $L_{\textit{MIXED}}$ , $L_{\textit{EM}}$ , and $L_{\textit{VAT}}$ don’t overfit much and thus achieve better generalization than the $L_{\textit{ML}}$ and $L_{\textit{AT}}$ objectives (see in Figure 2b). Moreover, as $L_{\textit{MIXED}}$ and $L_{\textit{VAT}}$ objectives can use unlabeled data, their training loss decays gradually. Thus, $L_{\textit{MIXED}}$ objective while being very effective in performance is also a very robust model regularizer. On the other hand, from Figure 2b, we can see that $L_{\textit{VAT}}$ , $L_{\textit{EM}}$ , and $L_{\textit{MIXED}}$ take a long time to converge compared with $L_{\textit{ML}}$ and $L_{\textit{AT}}$ and are thus quite slow to train. In our experiments, one epoch of $L_{\textit{MIXED}}$ takes around 20m on GeForce GTX 1080 GPU and it requires roughly 45 epochs to converge. This is considerably slower than $L_{\textit{ML}}$ objective where each epoch takes around 3m and the overall convergence time is thus 15m for 5 epochs.

Varying Data Size

In this setup, we first analyze the test error on ACL-IMDB dataset by feeding the model trained with different objective functions (§2) with an increasing number of training examples (learning curve; see Figure 3a). We observe that all the objective functions converge to lower error rates when training data is increased. We also see that mixed objective model is always optimal (achieves lower test error rate) for any setting of the number of training examples.

Next, we analyze the test error on ACL-IMDB dataset by varying the amount of unlabeled data. For this experiment, we use additional 50,000 reviews provided with the ACL-IMDB dataset and its 25,000 reviews from test set as unlabeled data. We evaluate the performance of each objective function by linearly increasing the amounts of unlabeled data (see Figure 3b). Initially, increasing the amount of unlabeled data tends to improve the performance of $L_{\textit{VAT}}$ , $L_{\textit{EM}}$ , and $L_{\textit{MIXED}}$ . However, we observe that their performance saturates once 25,000 unlabeled examples are available. Furthermore, as the amount of unlabeled data increases, the performance tends to degrade sharply. As ACL-IMDB training set also consists of 25,000 examples, from this observation, it can be assumed that to obtain the best performance using $L_{\textit{MIXED}}$ , the size of unlabeled and labeled dataset should be roughly the same. We also note that as $L_{\textit{ML}}$ and $L_{\textit{AT}}$ are supervised approaches, their performance remains unaffected.

Varying Model Size

We note that prior work in the supervised text classification task has used smaller-sized LSTMs i.e. models with hidden state sizes at most 512 units because larger models didn’t give accuracy gains (DL15, JZ16). This is also consistent with our observation, as we find that that supervised approaches ( $L_{\textit{ML}}$ ) do not benefit much from increasing the model size. However, when using additional loss functions such as the mixed objective, accuracy scales much better with model size (see Figure 4a). Further, we also observe accuracy gains for all methods upon increasing the number of layers in the model (see Figure 4b). Specifically, the error rate of $L_{\textit{MIXED}}$ objective improves to 4.15% when using a three-layer deep model. This suggests that larger-sized semi-supervised methods can lead to the development of more accurate models for text classification task. However, a four-layer model hurts the $L_{\textit{MIXED}}$ objective’s performance due to the training instability of $L_{\textit{EM}}$ method.

Effect on Prediction Probabilities

To study the behavior of different methods, we plot a histogram of the prediction probabilities for both the correct (Figure 5a) and incorrect (Figure 5b) predictions. We observe that for correct predictions all the methods especially $L_{\textit{EM}}$ and $L_{\textit{MIXED}}$ have very sharp and confident distribution of class probabilities. However, for incorrect predictions only $L_{\textit{EM}}$ has sharp peaks while $L_{\textit{VAT}}$ , $L_{\textit{AT}}$ , and $L_{\textit{ML}}$ encourage the model to learn a smoother distribution.

Ensemble Approach vs Mixed Objective

To understand if the above objectives have complementary strengths, we combine their predicted probabilities with a linear interpolation strategy. Given the output probability for a class $k$ as $p(y=k|\mathbf{x})$ , the interpolated probability $p_{\textit{I}}(y=k|\mathbf{x})$ is calculated as:

where $\alpha_{\textit{i}}\in$ and is chosen based on grid search. This simple interpolation technique results in an improved error rate of 5.2%. However, the error rate of our proposed mixed objective function is substantially lower (4.3%) thus highlighting the importance of performing joint training of the model based on different objective functions.

Example Predictions

In Table 8, we show some example movie reviews from the test set of ACL-IMDB dataset that are correctly classified by the highlighted method and incorrectly by all the remaining methods. We observe that methods such as $L_{\textit{MIXED}}$ , $L_{\textit{EM}}$ , and $L_{\textit{VAT}}$ that are based on unsupervised training are able to correctly classify difficult instances in which the overall sentiment is determined by the entire sentence structure. This illustrates the ability of these methods to learn complex long-range dependencies.

Relation Extraction

To evaluate if the mixed objective function can generalize to other tasks we also perform experiments on relation extraction (RE) task. In this, the objective is to identify if a predefined semantic “relation category” exists (or doesn’t exist) between a pair of subject and object entities present in text. This task also presents specific challenges: the linguistics coverage problem due to the lack of all possible training examples of a relation class and longer text span between the entities in a sentence.

For the RE task, we use a position-aware attention model (?) that consists of a word embedding, position embedding, LSTM, attention layer, linear layer, and softmax layer. We augment the word embeddings by concatenating them with POS tag and NER category embeddings which is then fed to the LSTM to get hidden states for each word. The position embeddings for a word is derived based on the relative distance of the current word from the subject and object entities. Next, the attention layer computes the final sentence representation by focusing on both the hidden states and position embeddings. Finally, sentence representation is fed to the linear layer followed by a softmax layer for relation classification.

We perform experiments using our proposed mixed objective function on two RE datasets: TACRED and SemEval-2010 Task 8 whose statistics are shown in Table 10. The POS and NER tags are computed using the Stanford CoreNLP toolkit.https://stanfordnlp.github.io/CoreNLP/ Following standard convention, we report the micro-averaged F1 score on TACRED and official macro-averaged F1 score on SemEval datasets. We performed only a small number of experiments to search for the hyper-parameter values of dropout, embedding size, hidden layer size, and learning rate on the development set, all other parameters remained the same as the positional attention model of ? (?).https://github.com/yuhaozhang/tacred-relation Our results in Table 10 show that when trained with mixed objective function, our model performs quite well, producing better results than all previously reported models despite the lack of complex task-specific hyper-parameter tuning.

Related Work

Neural network models for NLP have yielded impressive results on several benchmark tasks (?; ?). To learn document features for text classification task, several methods have been proposed— ? (?) uses 1D CNNs, ? (?) uses a simple bidirectional recurrent CNN with max-pooling, ? (?) applies 2D max-pooling on top of BiLSTMs, ? (?) investigates a joint CNN-LSTM model, and JZ15a, JZ16, JZ17 apply CNNs, LSTMs, and pyramidal CNNs respectively to one-hot encoding of word sequences. An alternative approach is to first learn sentence representations followed by combining them to learn document features. To do this, ? (?) first apply a CNN or LSTM followed by a gated RNN while ? (?) learn the sentence and document features in a hierarchical manner using a self-attention mechanism.

Semi-Supervised Learning.

SSL approaches can be broadly categorized into three types: multi-view, data augmentation, and transfer learning. First, under multi-view learning, the objective is to use multiple views of both the labeled and unlabeled data to train the model. These multiple views can be obtained either from raw text (?) or from the features (JZ15b). Second, under data augmentation, as the name implies, involves pseudo-augmenting either the features or the labels. For text classification, ? (?) performed semi-supervised training using naïve Bayes and expectation-maximization algorithms and demonstrated substantial improvements in performance. MDG16 compute embedding perturbations using adversarial and virtual adversarial approaches to improve model training. Third, under transfer learning, the approach of initializing the task-specific model weights by pretrained weights from an auxiliary task is a widely used strategy that has shown to improve the performance in tasks such as text classification (?; ?), question-answering (?), and machine translation (?; ?).

Conclusion

We show that a simple BiLSTM model using maximum likelihood training can result in a competitive performance on text classification tasks without the need for an additional pretraining step. Also, in addition to maximum likelihood, using a combination of entropy minimization, adversarial, and virtual adversarial training, we report state-of-the-art results on several text classification datasets. This mixed objective function also generalizes well to other tasks such as relation extraction where it outperforms current best models.