MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Jiaao Chen, Zichao Yang, Diyi Yang

Introduction

In the era of deep learning, research has achieved extremely good performance in most supervised learning settings LeCun et al. (2015); Yang et al. (2016). However, when there is only limited labeled data, supervised deep learning models often suffer from over-fitting Xie et al. (2019). This strong dependence on labeled data largely prevents neural network models from being applied to new settings or real-world situations due to the need of large amount of time, money, and expertise to obtain enough labeled data. As a result, semi-supervised learning has received much attention to utilize both labeled and unlabeled data for different learning tasks, as unlabeled data is always much easier and cheaper to collect Chawla and Karakoulas (2011).

This work takes a closer look at semi-supervised text classification, one of the most fundamental tasks in language technology communities. Prior research on semi-supervised text classification can be categorized into several classes: (1) utilizing variational auto encoders (VAEs) to reconstruct the sentences and predicting sentence labels with latent variables learned from reconstruction such as Chen et al. (2018); Yang et al. (2017); Gururangan et al. (2019); (2) encouraging models to output confident predictions on unlabeled data for self-training like Lee (2013); Grandvalet and Bengio (2004); Meng et al. (2018); (3) performing consistency training after adding adversarial noise Miyato et al. (2019, 2017) or data augmentations Xie et al. (2019); (4) large scale pretraining with unlabeld data, then finetuning with labeled data Devlin et al. (2019). Despite the huge success of those models, most prior work utilized labeled and unlabeled data separately in a way that no supervision can transit from labeled to unlabeled data or from unlabeled to labeled data. As a result, most semi-supervised models can easily still overfit on the very limited labeled data, despite unlabeled data is abundant.

To overcome the limitations, in this work, we introduce a new data augmentation method, called TMix (Section 3), inspired by the recent success of Mixup Gururangan et al. (2019); Berthelot et al. (2019) on image classifications. TMix, as shown in Figure 1, takes in two text instances, and interpolates them in their corresponding hidden space. Since the combination is continuous, TMix has the potential to create infinite mount of new augmented data samples, thus can drastically avoid overfitting. Based on TMix, we then introduce a new semi-supervised learning method for text classification called MixText (Section 4) to explicitly model the relationships between labeled and unlabeled samples, thus overcoming the limitations of previous semi-supervised models stated above. In a nutshell, MixText first guesses low-entropy labels for unlabeled data, then uses TMix to interpolate the label and unlabeled data. MixText can facilitate mining implicit relations between sentences by encouraging models to behave linearly in-between training examples, and utilize information from unlabeled sentences while learning on labeled sentences. In the meanwhile, MixText exploits several semi-supervised learning techniques to further utilize unlabeled data including self-target-prediction Laine and Aila (2016), entropy minimization Grandvalet and Bengio (2004), and consistency regularization Berthelot et al. (2019); Xie et al. (2019) after back translations.

To demonstrate the effectiveness of our method, we conducted experiments (Section 5) on four benchmark text classification datasets and compared our method with previous state-of-the-art semi-supervised method, including those built upon models pre-trained with large amount of unlabeled data, in terms of accuracy on test sets. We further performed ablation studies to demonstrate each component’s influence on models’ final performance. Results show that our MixText method significantly outperforms baselines especially when the given labeled training data is extremely limited.

Related Work

The pre-training and fine-tuning framework has achieved huge success on NLP applications in recent years, and has been applied to a variety of NLP tasks Radford et al. (2018); Chen et al. (2019); Akbik et al. (2019). Howard and Ruder (2018) proposed to pre-train a language model on a large general-domain corpus and fine-tune it on the target task using some novel techniques like discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing. In this manner, such pre-trained models show excellent performance even with small amounts of labeled data. Pre-training methods are often designed with different objectives such as language modeling Peters et al. (2018); Howard and Ruder (2018); Yang et al. (2019b) and masked language modeling Devlin et al. (2019); Lample and Conneau (2019). Their performances are also improved with training larger models on more data Yang et al. (2019b); Liu et al. (2019).

2 Semi-Supervised Learning on Text Data

Semi-supervised learning has received much attention in the NLP community Gururangan et al. (2019); Clark et al. (2018); Yang et al. (2015), as unlabeled data is often plentiful compared to labeled data. For instance, Gururangan et al. (2019); Chen et al. (2018); Yang et al. (2017) leveraged variational auto encoders (VAEs) in a form of sequence-to-sequence modeling on text classification and sequential labeling. Miyato et al. (2017) utilized adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings. Yang et al. (2019a) took advantage of hierarchy structures to utilize supervision from higher level labels to lower level labels. Xie et al. (2019) exploited consistency regularization on unlabeled data after back translations and tf-idf word replacements. Clark et al. (2018) proposed cross-veiw training for unlabeled data, where they used an auxiliary prediction modules that see restricted views of the input (e.g., only part of a sentence) and match the predictions of the full model seeing the whole input.

3 Interpolation-based Regularizers

Interpolation-based regularizers (e.g., Mixup) have been recently proposed for supervised learning Zhang et al. (2017); Verma et al. (2019a) and semi-supervised learning Berthelot et al. (2019); Verma et al. (2019b) for image-format data by overlaying two input images and combining image labels as virtual training data and have achieved state-of-the-art performances across a variety of tasks like image classification and network architectures. Different variants of mixing methods have also been designed such as performing interpolations in the input space Zhang et al. (2017), combining interpolations and cutoff Yun et al. (2019), and doing interpolations in the hidden space representations Verma et al. (2019a, c). However, such interpolation techniques have not been explored in the NLP field because most input space in text is discrete, i.e., one-hot vectors instead of continues RGB values in images, and text is generally more complex in structures.

4 Data Augmentations for Text

When labeled data is limited, data augmentation has been a useful technique to increase the amount of training data. For instance, in computer vision, images are shifted, zoomed in/out, rotated, flipped, distorted, or shaded with a hue Perez and Wang (2017) for training data augmentation. But it is relatively challenging to augment text data because of its complex syntactic and semantic structures. Recently, Wei and Zou (2019) utilized synonym replacement, random insertion, random swap and random deletion for text data augmentation. Similarly, Kumar et al. (2019) proposed a new paraphrasing formulation in terms of monotone sub-modular function maximization to obtain highly diverse paraphrases, and Xie et al. (2019) and Chen et al. (2020) applied back translations Sennrich et al. (2015) and word replacement to generate paraphrases on unlabeled data for consistency training. Other work which also investigates noise and its incorporation into semi-supervised named entity classification Lakshmi Narayan et al. (2019); Nagesh and Surdeanu (2018).

TMix

In this section, we extend Mixup–a data augmentation method originally proposed by Zhang et al. (2017) for images–to text modeling. The main idea of Mixup is very simple: given two labeled data points (xi,yi)(\mathbf{x}_{i},\mathbf{y}_{i}) and (xj,yj)(\mathbf{x}_{j},\mathbf{y}_{j}), where x\mathbf{x} can be an image and y\mathbf{y} is the one-hot representation of the label, the algorithm creates virtual training samples by linear interpolations:

𝜆subscript𝐱𝑖1𝜆subscript𝐱𝑗\displaystyle\lambda\mathbf{x}_{i}+(1-\lambda)\mathbf{x}_{j}, (1) y~=mix(yi,yj)=\displaystyle\tilde{\mathbf{y}}=\text{mix}(\mathbf{y}_{i},\mathbf{y}_{j})= λyi+(1λ)yj,\displaystyle\lambda\mathbf{y}_{i}+(1-\lambda)\mathbf{y}_{j}, (2) where λ\lambda\in. The new virtual training samples are used to train a neural network model. Mixup can be interpreted in different ways. On one hand, Mixup can be viewed a data augmentation approach which creates new data samples based on the original training set. On the other hand, it enforces a regularization on the model to behave linearly among the training data. Mixup was demonstrated to work well on continuous image data Zhang et al. (2017). However, extending it to text seems challenging since it is infeasible to compute the interpolation of discrete tokens.

To this end, we propose a novel method to overcome this challenge — interpolation in textual hidden space. Given a sentence, we often use a multi-layer model like BERT Devlin et al. (2019) to encode the sentences to get the semantic representations, based on which final predictions are made. Some prior work Bowman et al. (2016) has shown that decoding from an interpolation of two hidden vectors generates a new sentence with mixed meaning of two original sentences. Motivated by this, we propose to apply interpolations within hidden space as a data augment method for text. For an encoder with LL layers, we choose to mixup the hidden representation at the mm-th layer, m[0,L]m\in[0,L].

As demonstrated in Figure 1, we first compute the hidden representations of two text samples separately in the bottom layers. Then we mix up the hidden representations at layer mm, and feed the interpolated hidden representations to the upper layers. Mathematically, denote the ll-th layer in the encoder network as gl(.;θ){g_{l}(.;\bm{\theta})}, hence the hidden representation of the ll-th layer can be computed as hl=gl(hl1;θ).\mathbf{h}_{l}=g_{l}(\mathbf{h}_{l-1};\bm{\theta}). For two text samples xi\mathbf{x}_{i} and xj\mathbf{x}_{j}, define the -th layer as the embedding layer, i.e., h0i=WExi,h0j=WExj\mathbf{h}_{0}^{i}=\mathbf{W}_{\text{E}}\mathbf{x}_{i},\mathbf{h}_{0}^{j}=\mathbf{W}_{\text{E}}\mathbf{x}_{j}, then the hidden representations of the two samples from the lower layers are:

The mixup at the mm-th layer and continuing forward passing to upper layers are defined as:

𝜆superscriptsubscript𝐡𝑚𝑖1𝜆superscriptsubscript𝐡𝑚𝑗\displaystyle\tilde{\mathbf{h}}_{m}=\lambda\mathbf{h}_{m}^{i}+(1-\lambda)\mathbf{h}_{m}^{j}, h~l=gl(h~l1;θ),l[m+1,L].\displaystyle\tilde{\mathbf{h}}_{l}=g_{l}(\tilde{\mathbf{h}}_{l-1};\bm{\theta}),l\in[m+1,L]. We call the above method TMix and define the new mixup operation as the whole process to get h~L\tilde{\mathbf{h}}_{L}:

By using an encoder model g(.;θ)g(.;\bm{\theta}), TMix interpolates textual semantic hidden representations as a type of data augmentation. In contrast with Mixup defined in the data space in Equation 1, TMix depends on an encoder function, hence defines a much broader scope for computing interpolations. For ease of notation, we drop the explicit dependence on g(.;θ)g(.;\bm{\theta}), λ\lambda and mm in notations and denote it simply as TMix(xi,xj)\text{TMix}(\mathbf{x}_{i},\mathbf{x}_{j}) in the following sections.

In our experiments, we sample the mix parameter λ\lambda from a Beta distribution for every batch to perform the interpolation :

in which α\alpha is the hyper-parameter to control the distribution of λ\lambda. In TMix, we mix the labels in the same way as Equation 2 and then use the pairs (h~L,y~)(\tilde{\mathbf{h}}_{L},\tilde{y}) as inputs for downstream applications.

Instead of performing mixup at random input layers like Verma et al. (2019a), choosing which layer of the hidden representations to mixup is an interesting question to investigate. In our experiments, we use 12-layer BERT-base Devlin et al. (2019) as our encoder model. Recent work Jawahar et al. (2019) has studied what BERT learned at different layers. Specifically, the authors found {3,4,5,6,7,9,12} layers have the most representation power in BERT and each layer captures different types of information ranging from surface, syntactic to semantic level representation of text. For instance, the 9-th layer has predictive power in semantic tasks like checking random swapping of coordinated clausal conjuncts, while the 3-rd layer performs best in surface tasks like predicting sentence length.

Building on those findings, we choose the layers that contain both syntactic and semantic information as our mixing layers, namely M={7,9,12}\mathbf{M}=\{7,9,12\}. For every batch, we randomly sample mm, the layer to mixup representations, from the set M\mathbf{M} computing the interpolation. We also performed ablation study in Section 5.5 to show how TMix’s performance changes with different choice of mix layer sets.

Note that TMix provides a general approach to augment text data, hence can be applied to any downstream tasks. In this paper, we focus on text classification and leave other applications as potential future work. In text classification, we minimize the KL-divergence between the mixed labels and the probability from the classifier as the supervision loss:

where p(.;ϕ)p(.;\bm{\phi}) is a classifier on top of the encoder model. In our experiments, we implement the classifier as a two-layer MLP, which takes the mixed representation TMix(xi,xj)\text{TMix}(\mathbf{x}_{i},\mathbf{x}_{j}) as input and returns a probability vector. We jointly optimize over the encoder parameters θ\bm{\theta} and the classifier parameters ϕ\bm{\phi} to train the whole model.

Semi-supervised MixText

In this section, we demonstrate how to utilize the TMix to help semi-supervised learning. Given a limited labeled text set Xl={x1l,...,xnl}\mathbf{X}_{l}=\{\mathbf{x}^{l}_{1},...,\mathbf{x}^{l}_{n}\}, with their labels Yl={y1l,...,ynl}\mathbf{Y}_{l}=\{\mathbf{y}^{l}_{1},...,\mathbf{y}^{l}_{n}\} and a large unlabeled set Xu={x1u,...,xmu}\mathbf{X}_{u}=\{\mathbf{x}^{u}_{1},...,\mathbf{x}^{u}_{m}\}, where nn and mm are the number of data points in each set. yil{0,1}C\mathbf{y}_{i}^{l}\in\{0,1\}^{C} is a one-hot vector and CC is the number of classes. Our goal is to learn a classifier that efficiently utilizes both labeled data and unlabeled data.

We propose a new text semi-supervised learning framework called MixText 111Note that MixText is a semi-supervised learning framework while TMix is a data augmentation approach.. The core idea behind our framework is to leverage TMix both on labeled and unlabeled data for semi-supervised learning. To fulfill this goal, we come up a label guessing method to generate labels for the unlabeled data in the training process. With the guessed labels, we can treat the unlabeled data as additional labeled data and perform TMix for training. Moreover, we combine TMix with additional data augmentation techniques to generate large amount of augmented data, which is a key component that makes our algorithm work well in setting with extremely limited supervision. Finally, we introduce an entropy minimization loss that encourages the model to assign sharp probabilities on unlabeled data samples, which further helps to boost performance when the number of classes CC is large. The overall architecture is shown in Figure 2. We will explain each component in detail.

Back translations Edunov et al. (2018) is a common data augmentation technique and can generate diverse paraphrases while preserving the semantics of the original sentences. We utilize back translations to paraphrase the unlabeled data. For each xiu\mathbf{x}_{i}^{u} in the unlabeled text set Xu\mathbf{X}_{u}, we generate KK augmentations xi,ka=augmentk(xiu),k[1,K]{\mathbf{x}}_{i,k}^{a}=\text{augment}_{k}(\mathbf{x}_{i}^{u}),k\in[1,K] by back translations with different intermediate languages. For example, we can translate original sentences from English to German and then translate them back to get the paraphrases. In the augmented text generation, we employ random sampling with a tunable temperature instead of beam search to ensure the diversity. The augmentations are then used for generating labels for the unlabeled data, which we describe below.

2 Label Guessing

For an unlabeled data sample xiu\mathbf{x}_{i}^{u} and its KK augmentations xi,ka{\mathbf{x}}_{i,k}^{a}, we generate the label for them using weighted average of the predicted results from the current model:

Note that yiu\mathbf{y}^{u}_{i} is a probability vector. We expect the model to predict consistent labels for different augmentations. Hence, to enforce the constraint, we use the weighted average of all predictions, rather than the prediction of any single data sample, as the generated label. Moreover, by explicitly introducing the weight woriw_{ori} and wkw_{k}, we can control the contributions of different quality of augmentations to the generated labels. Our label guessing method improves over Tarvainen and Valpola (2017) which utilizes teacher and student models to predict labels for unlabeled data, and UDA Xie et al. (2019) that just uses p(xiu)p(\mathbf{x}^{u}_{i}) as generated labels.

To avoid the weighted average being too uniform, we utilize a sharpening function over predicted labels. Given a temperature hyper-parameter TT:

where .1||.||_{1} is l1l_{1}-norm of the vector. When T0T\to 0, the generated label becomes a one-hot vector.

3 TMix on Labeled and Unlabeled Data

After getting the labels for unlabeled data, we merge the labeled text Xl\mathbf{X}_{l}, unlabeled text Xu\mathbf{X}_{u} and unlabeled augmentation text Xa={xi,ka}{\mathbf{X}}_{a}=\{x^{a}_{i,k}\} together to form a super set X=XlXuXa\mathbf{X}=\mathbf{X}_{l}\cup\mathbf{X}_{u}\cup{\mathbf{X}}_{a}. The corresponding labels are Y=YlYuYa\mathbf{Y}=\mathbf{Y}_{l}\cup\mathbf{Y}_{u}\cup\mathbf{Y}_{a}, where Ya={yi,ka}\mathbf{Y}^{a}=\{\mathbf{y}^{a}_{i,k}\} and we define yi,ka=yiu\mathbf{y}^{a}_{i,k}=\mathbf{y}^{u}_{i}, i.e., the all augmented samples share the same generated label as the original unlabeled sample.

In training, we randomly sample two data points x,xX\mathbf{x},\mathbf{x}^{\prime}\in\mathbf{X}, then we compute TMix(x,x)\text{TMix}(\mathbf{x},\mathbf{x}^{\prime}), mix(y,y)\text{mix}(\mathbf{y},\mathbf{y}^{\prime}) and use the KL-divergence as the loss:

Since x,x\mathbf{x},\mathbf{x}^{\prime} are randomly sampled from X\mathbf{X}, we interpolate text from many different categories: mixup among among labeled data, mixup of labeled and unlabeled data and mixup of unlabeled data. Based on the categories of the samples, the loss can be divided into two types:

When xXl\mathbf{x}\in\mathbf{X}_{l}, the majority information we are actually using is from the labeled data, hence training the model with supervised loss.

When the samples are from unlabeled or augmentation set, i.e., xXuXa\mathbf{x}\in\mathbf{X}^{u}\cup\mathbf{X}^{a}, most information coming from unlabeled data, the KL-divergence is a type of consistency loss, constraining augmented samples to have the same labels with the original data sample.

4 Entropy Minimization

To encourage the model to produce confident labels on unlabeled data, we propose to minimize the entropy of prediction probability on unlabeled data as a self-training loss:

where γ\gamma is the margin hyper-parameter. We minimize the entropy of the probability vector if it is larger than γ\gamma.

Combining the two losses, we get the overall objective function of MixText:

Experiments

We performed experiment with four English text classification benchmark datasets: AG News Zhang et al. (2015), BPpedia Mendes et al. (2012), Yahoo! Answers Chang et al. (2008) and IMDB Maas et al. (2011). We used the original test set as our test set and randomly sampled from the training set to form the training unlabeled set and development set. The dataset statistics and split information are presented in Table 1.

For unlabeled data, we selected German and Russian as intermediate languages for back translations using FairSeq222https://github.com/pytorch/fairseq, and the random sampling temperature was 0.9. Here is an example, for a news from AG News dataset: “Oil prices rallied to a record high above 55abarrelonFridayonrisingfearsofawinterfuelsupplycrunchandrobusteconomicgrowthinChina,theworldsnumbertwouser,theaugmenttextsthroughGermanandRussianare:Oilpricessurgedtoarecordhighabove55 a barrel on Friday on rising fears of a winter fuel supply crunch and robust economic growth in China, the world’s number two user”, the augment texts through German and Russian are: “Oil prices surged to a record high above55 a barrel on Friday on growing fears of a winter slump and robust economic growth in world No.2 China” and “Oil prices soared to record highs above $55 per barrel on Friday amid growing fears over a winter reduction in U.S. oil inventories and robust economic growth in China, the world’s second-biggest oil consumer”.

2 Baselines

To test the effectiveness of our method, we compared it with several recent models:

VAMPIRE Gururangan et al. (2019): VAriational Methods for Pretraining In Resource-limited Environments(VAMPIRE) pretrained a unigram document model as a variational autoencoder on in-domain, unlabeled data and used its internal states as features in a downstream classifier.

BERT Devlin et al. (2019): We used the pre-trained BERT-based-uncased model333https://pypi.org/project/pytorch-transformers/ and fine-tuned it for the classification. In details, we used average pooling over the output of BERT encoder and the same two-layer MLP as used in MixText to predict the labels.

UDA Xie et al. (2019): Since we do not have access to TPU and need to use smaller amount of unlabeled data, we implemented Unsupervised Data Augmentation(UDA) using pytorch by ourselves. Specifically, we used the same BERT-based-uncased model, unlabeled augment data and batch size as our MixText, used original unlabeled data to predict the labels with the same softmax sharpen temperature as our MixText and computed consistency loss between augmented unlabeled data.

3 Model Settings

We used BERT-based-uncased tokenizer to tokenize the text, bert-based-uncased model as our text encoder, and used average pooling over the output of the encoder, a two-layer MLP with a 128 hidden size and tanhtanh as its activation function to predict the labels. The max sentence length is set as 256. We remained the first 256 tokens for sentences that exceed the limit. The learning rate is 1e-5 for BERT encoder, 1e-3 for MLP. For α\alpha in the beta distribution, generally, when labeled data is fewer than 100 per class, α\alpha is set as 2 or 16, as larger α\alpha is more likely to generate λ\lambda around 0.5, thus creating “newer” data as data augmentations; when labeled data is more than 200 per class, α\alpha is set to 0.2 or 0.4, as smaller α\alpha is more likely to generate λ\lambda around 0.1, thus creating “similar” data as adding noise regularization.

For TMix, we only utilize the labeled dataset as the settings in Bert baseline, and set the batch size as 8. In MixText, we utilize both labeled data and unlabeled data for training using the same settings as in UDA. We set K=2K=2, i.e., for each unlabeled data we perform two augmentations, specifically German and Russian. The batch size is 4 for labeled data and 8 for unlabeled data. 0.5 is used as a starting point to tune temperature TT. In our experiments, we set 0.3 for AG News, 0.5 for DBpedia and Yahoo! Answer, and 1 for IMDB.

4 Results

We evaluated our baselines and proposed methods using accuracy with 5000 unlabeled data and with different amount of labeled data per class ranging from 10 to 10000 (5000 for IMDB).

The results on different text classification datasets are shown in Table 2 and Figure 3. All transformer based models (BERT, TMix, UDA and MixText) showed better performance compared to VAMPIRE since larger models were adopted. TMix outperformed BERT, especially when labeled data was limited like 10 per class. For instance, model accuracy improved from 69.5% to 74.1% on AG News with 10 labeled data, demonstrating the effectiveness of TMix. When unlabeled data was introduced in UDA, it outperformed TMix such as from 58.6% to 63.2% on Yahoo! with 10 labeled data, because more data was used and consistency regularization loss was added. Our proposed MixText consistently demonstrated the best performances when compared to different baseline models across four datasets, as MixText not only incorporated unlabeled data and utilized implicit relations between both labeled data and unlabeled data via TMix, but also had better label guessing on unlabeled data through weighted average among augmented and original sentences.

4.2 Varying the Number of Unlabeled Data

We also conducted experiments to test our model performances with 10 labeled data and different amount of unlabeled data (from 0 to 10000) on AG News and Yahoo! Answer, shown in Figure 4. With more unlabeled data, the accuracy became much higher on both AG News and Yahoo! Answer, which further validated the effectiveness of the usage of unlabeled data.

4.3 Loss on Development Set

To explore whether our methods can avoid overfitting when given limited labeled data, we plotted the losses on development set during the training on IMDB and Yahoo! Answer with 200 labeled data per class in Figure 5. We found that the loss on development sets tends to increase a lot in around 10 epochs for Bert, indicating that the model overfitted on training set. Although UDA can alleviate the overfitting problems with consistency regularization, TMix and MixText showed more stable trends and lower loss consistently. The loss curve for TMix also indicated that it can help solving overfitting problems even without extra data.

5 Ablation Studies

We performed ablation studies to show the effectiveness of each component in MixText.

We explored different mixup layer set MM for TMix and the results are shown in Table 3. Based on Jawahar et al. (2019), the {3,4,5,6,7,9,12} are the most informative layers in BERT based model and each of them captures different types of information (e.g., surface, syntactic, or semantic). We chose to mixup using different subsets of those layers to see which subsets gave the optimal performance. When no mixup is performed, our model accuracy was 69.5%. If we just mixup at the input and lower layers ({0, 1, 2}), there seemed no performance increase. When doing mixup using different layer sets (e.g., {3,4}, or {6,7,9}), we found large differences in terms of model performances: {3,4} that mainly contains surface information like sentence length does not help text classification a lot, thus showing weaker performance. The 6th layer captures depth of the syntactic tree which also does not help much in classifications. Our model achieved the best performance at {7, 9, 12}; this layer subset contains most of syntactic and semantic information such as the sequence of top level constituents in the syntax tree, the object number in main clause, sensitivity to word order, and the sensitivity to random replacement of a noun/verb.

5.2 Remove Different Parts from MixText

We also measured the performance of MixText by stripping each component each time and displayed the results in Table 4. We observed the performance drops after removing each part, suggesting that all components in MixText contribute to the final performance. The model performance decreased most significantly after removing unlabeled data which is as expected. Comparing to weighted average prediction for unlabeled data, the decrease from removing TMix was larger, indicating that TMix has the largest impact other than unlabeled data, which also proved the effectiveness of our proposed Text Mixup, an interpolation-based regularization and augmentation technique.

Conclusion

To alleviate the dependencies of supervised models on labeled data, this work presented a simple but effective semi-supervised learning method, MixText, for text classification, in which we also introduced TMix, an interpolation-based augmentation and regularization technique. Through experiments on four benchmark text classification datasets, we demonstrated the effectiveness of our proposed TMix technique and the Mixup model, which have better testing accuracy and more stable loss trend, compared with current pre-training and fine-tuning models and other state-of-the-art semi-supervised learning methods. For future direction, we plan to explore the effectiveness of MixText in other NLP tasks such as sequential labeling tasks and other real-world scenarios with limited labeled data.

Acknowledgement

We would like to thank the anonymous reviewers for their helpful comments, and Chao Zhang for his early feedback. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research. DY is supported in part by a grant from Google.

References