KATE: K-Competitive Autoencoder for Text

Yu Chen, Mohammed J. Zaki

Introduction

An autoencoder is a neural network which can automatically learn data representations by trying to reconstruct its input at the output layer. Many variants of autoencoders have been proposed recently . While autoencoders have been successfully applied to learn meaningful representations on image datasets (e.g., MNIST , CIFAR-10 ), their performance on text datasets has not been widely studied. Traditional autoencoders are susceptible to learning trivial representations for text documents. As noted by Zhai and Zhang , the reasons include that fact that textual data is extremely high dimensional and sparse. The vocabulary size can be hundreds of thousands while the average fraction of zero entries in the document vectors can be very high (e.g., 98%). Further, textual data typically follows power-law word distributions. That is, low-frequency words account for most of the word occurrences. Traditional autoencoders always try to reconstruct each dimension of the input vector on an equal footing, which is not quite appropriate for textual data.

Document representation is an interesting and challenging task which is concerned with representing textual documents in a vector space, and it has various applications in text processing, retrieval and mining. There are two major approaches to represent documents: 1) Distributional Representation is based on the hypothesis that linguistic terms with similar distributions have similar meanings. These methods usually take advantage of the co-occurrence and context information of words and documents, and each dimension of the document vector usually represents a specific semantic meaning (e.g., a topic). Typical models in this category include Latent Semantic Analysis (LSA) , probabilistic LSA (pLSA) and Latent Dirichlet Allocation (LDA) . 2) Distributed Representations encode a document as a compact, dense and lower dimensional vector with the semantic meaning of the document distributed along the dimensions of the vector. Many neural network-based distributed representation models have been proposed and shown to be able to learn better representations of documents than distributional representation models.

In this paper, we try to overcome the weaknesses of traditional autoencoders when applied to textual data. We propose a novel autoencoder called KATE (for K-competitive Autoencoder for TExt), which relies on competitive learning among the autoencoding neurons. In the feedforward phase, only the most competitive $k$ neurons in the layer fire and those $k$ “winners” further incorporate the aggregate activation potential of the remaining inactive neurons. As a result, each hidden neuron becomes better at recognizing specific data patterns and the overall model can learn meaningful representations of the input data. After training the model, each hidden neuron is distinct from the others and no competition is needed in the testing/encoding phase. We conduct comprehensive experiments qualitatively and quantitatively to evaluate KATE and to demonstrate the effectiveness of our model. We compare KATE with traditional autoencoders including basic autoencoder, denoising autoencoder , contractive autoencoder , variational autoencoder , and k-sparse autoencoder . We also compare with deep generative models , neural autoregressive and variational inference models, probabilistic topic models such as LDA , and word representation models such as Word2Vec and Doc2Vec . KATE achieves state-of-the-art performance across various datasets on several downstream tasks like document classification, regression and retrieval.

Related Work

Autoencoders. The basic autoencoder is a shallow neural network which tries to reconstruct its input at the output layer. An autoencoder consists of an encoder which maps the input $\boldsymbol{x}$ to the hidden layer: $\boldsymbol{z}=g(\boldsymbol{W}\boldsymbol{x}+\boldsymbol{b})$ and a decoder which reconstructs the input as: $\hat{\boldsymbol{x}}=o(\boldsymbol{W}^{\prime}\boldsymbol{z}+\boldsymbol{c})$ ; here $\boldsymbol{b}$ and $\boldsymbol{c}$ are bias terms, $\boldsymbol{W}$ and $\boldsymbol{W}^{\prime}$ are input-to-hidden and hidden-to-output layer weight matrices, and $g$ and $o$ are activation functions. Weight tying (i.e., setting $\boldsymbol{W}^{\prime}=\boldsymbol{W}^{T}$ ) is often used as a regularization method to avoid overfitting. While plain autoencoders, even with perfect reconstructions, usually only extract trivial representations of the data, more meaningful representations can be obtained by adding appropriate regularization to the models. Following this line of reasoning, many variants of autoencoders have been proposed recently . The denoising autoencoder (DAE) inputs a corrupted version of the data while the output is still compared with the original uncorrupted data, allowing the model to learn patterns useful for denoising. The contractive autoencoder (CAE) introduces the Frobenius norm of the Jacobian matrix of the encoder activations into the regularization term. When the Frobenius norm is 0, the model is extremely invariant to perturbations of input data, which is thought as good. The variational autoencoder (VAE) is a generative model inspired by variational inference whose encoder $q_{\phi}(\boldsymbol{z}|\boldsymbol{x})$ approximates the intractable true posterior $p_{\theta}(\boldsymbol{z}|\boldsymbol{x})$ , and the decoder $p_{\theta}(\boldsymbol{x}|\boldsymbol{z})$ is a data generator. The k-sparse autoencoder (KSAE) explicitly enforces sparsity by only keeping the $k$ highest activities in the feedforward phase.

We notice that most of the successful applications of autoencoders are on image data, while only a few have attempted to apply autoencoders on textual data. Zhai and Zhang have argued that traditional autoencoders, which perform well on image data, are less appropriate for modeling textual data due to the problems of high-dimensionality, sparsity and power-law word distributions. They proposed a semi-supervised autoencoder which applies a weighted loss function where the weights are learned by a linear classifier to overcome some of these problems. Kumar and D’Haro found that all the topics extracted from the autoencoder were dominated by the most frequent words due to the sparsity of the input document vectors. Further, they found that adding sparsity and selectivity penalty terms helped alleviate this issue to some extent.

Deep generative models. Deep Belief Networks (DBNs) are probabilistic graphical models which learn to extract a deep hierarchical representation of the data. The top 2 layers of DBNs form a Restricted Boltzmann Machine (RBM) and other layers form a sigmoid belief network. A relatively fast greedy layer-wise pre-training algorithm is applied to train the model. Maaloe et al. showed that DBNs can be competitive as a topic model. DocNADE is a neural autoregressive topic model that estimates the probability of observing a new word in a given document given the previously observed words. It can be used for extracting meaningful representations of documents. It has been shown to outperform the Replicated Softmax model which is a variant of RBMs for document modeling. Srivastava et al. introduced a type of Deep Boltzmann Machine (DBM) that is suitable for extracting distributed semantic representations from a corpus of documents; an Over-Replicated Softmax model was proposed to overcome the apparent difficulty of training a DBM. NVDM is a neural variational inference model for document modeling inspired by the variational autoencoder.

Probabilistic topic models. Probabilistic topic models, such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) have been extensively studied . Especially for LDA, many variants have been proposed for non-parametric learning , sparsity and efficient inference . Those models typically build a generative probabilistic model using the bag-of-words representation of the documents.

Word representation models. Distributed representations of words in a vector space can capture semantic meanings of words and help achieve better results in various downstream text analysis tasks. Word2Vec and Glove are state-of-the-art word representation models. Pre-training word embeddings on a large corpus of documents and applying learned word embeddings in downstream tasks has been shown to work well in practice . Doc2Vec was inspired by Word2Vec and can directly learn vector representations of paragraphs and documents. NTM , which also uses pre-trained word embeddings, is a neural topic model where the representations of words and documents are combined into a uniform framework.

With this brief overview of existing work, we now turn to our competitive autoencoder approach for text documents.

K-Competitive Autoencoder

Although the objective of an autoencoder is to minimize the reconstruction error, our goal is to extract meaningful features from data. Compared with image data, textual data is more challenging for autoencoders since it is typically high-dimensional, sparse and has power-law word distributions. When examining the features extracted by an autoencoder, we observed that they were not distinct from one another. That is, many neurons in the hidden layer shared similar groups of input neurons (which typically correspond to the most frequent words) with whom they had the strongest connections. We hypothesized that the autoencoder greedily learned relatively trivial features in order to reconstruct the input.

To overcome this drawback, our approach guides the autoencoder to focus on important patterns in the data by adding constraints in the training phase via mutual competition. In competitive learning, neurons compete for the right to respond to a subset of the input data and as a result, the specialization of each neuron in the network is increased. Note that the specialization of neurons is exactly what we want for an autoencoder, especially when applied on textual data. By introducing competition into an autoencoder, we expect each neuron in the hidden layer to take responsibility for recognizing different patterns within the input data. Following this line of reasoning, we propose the k-competitive autoencoder, KATE, as described below.

where $V$ is the vocabulary and $n_{i}$ is the count of word $i$ in that document. Let $\hat{\boldsymbol{x}}$ be the output of KATE on a given input $\boldsymbol{x}$ . We use the binary cross-entropy as the loss function, which is defined as

where $\hat{x}_{i}$ is the reconstructed value for $x_{i}$ .

Let $H$ be some subset of hidden neurons; define the energy of $H$ as the total activation potential for $H$ , given as: $E(H)=\sum_{h_{i}\in H}|z_{i}|$ , i.e., sum of the absolute values of the activations for neurons in $H$ . In KATE, in the feedforward phase, after computing the activations $\boldsymbol{z}$ for a given input $\boldsymbol{x}$ , we select the most competitive $k$ neurons as the “winners” while the remaining “losers” are suppressed (i.e., made inactive). However, in order to compensate for the loss of energy from the loser neurons, and to make the competition among neurons more pronounced, we amplify and reallocate that energy among the winner neurons.

KATE uses tanh activation function for the k-competitive hidden layer. We divide these neurons into positive and negative neurons based on their activations. The most competitive $k$ neurons are those that have the largest absolute activation values. However, we select the $\lceil k/2\rceil$ largest positive activations as the positive winners, and reallocate the energy of the remaining positive loser neurons among the winners using an $\alpha$ amplification connection, where $\alpha$ is a hyperparameter. Finally, we set the activations of all losers to zero. Similarly, the $\lfloor k/2\rfloor$ lowest negative activations are the negative winners, and they incorporate the amplified energy from the negative loser neurons, as detailed in Algorithm 2. We argue that the $\alpha$ amplification connections are a critical component in the k-competitive layer. When $\alpha=0$ , no gradients will flow through loser neurons, resulting in a regular k-sparse autoencoder (regardless of the activation functions and k-selection scheme). When $\alpha>2/k$ , we actually boost the gradient signal flowing through the loser neurons. We empirically show that amplification helps improve the autoencoder model (see Sec. 4.4.1 and 4.5). As an example, consider Figure 1, which shows an example feedforward step for $k=2$ . Here, $h_{1}$ and $h_{6}$ are the positive and negative winners, respectively, since the absolute activation potential for $h_{1}$ is $|z_{1}|=0.8$ , and for $h_{6}$ it is $|z_{6}|=0.6$ . The positive winner $h_{1}$ takes away the energy from the positive losers $h_{2}$ and $h_{3}$ , which is $E(\{h_{2},h_{3}\})=0.2+0.1=0.3$ . Likewise, the negative winner $h_{6}$ takes away the energy from the negative losers $h_{4}$ and $h_{5}$ , which is $-E(\{h_{4},h_{5}\})=-(|-0.1|+|-0.3|)=-0.4$ . The hyperparameter $\alpha$ governs how the energy from the loser neurons is incorporated into the winner neurons, for both positive and negative cases. That is $h_{1}$ ’s net activation becomes $z_{1}=0.8+0.3\alpha$ , and $h_{6}$ ’s net activation is $z_{6}=-0.6-0.4\alpha$ . The rest of the neurons are set to zero activation.

Finally, as noted in Algorithm 1 we use weight tying for the hidden to output layer weights, i.e., we use $\boldsymbol{W}^{T}$ as the weight matrix, with different biases $\boldsymbol{c}$ . Also, since the inputs $\boldsymbol{x}$ are non-negative for document representations (e.g., word counts), we use the sigmoid activation function at the output layer to maintain the non-negativity. Note that in the back-propagation procedure, the gradients will first flow through the winner neurons in the hidden layer and then the loser neurons via the $\alpha$ amplification connections. No gradients will flow directly from the output neurons to the loser neurons since they are made inactive in the feedforward step.

Once the k-competitive network has been trained, we simply encode each test input as shown in Algorithm 1. That is, given a test input $\boldsymbol{x}$ , we map it to the feature space to obtain $\boldsymbol{z}=tanh(\boldsymbol{W}\boldsymbol{x}+\boldsymbol{b})$ . No competition is required for the encoding step since the hidden neurons are well trained to be distinctive from others. We argue that this is one of the superior features of KATE.

2. Relationship to Other Models

The k-sparse autoencoder is closely related to our model, but there are several differences. The k-sparse autoencoder explicitly enforces sparsity by only keeping the $k$ highest activities at training time. Then, at testing time, in order to enforce sparsity, only the $\alpha k$ highest activities are kept where $\alpha$ is a hyperparameter. Since its hidden layer uses a linear activation function, the only non-linearity in the encoder comes from the selection of the $k$ highest activities. Instead of focusing on sparsity, our model focuses on competition to drive each hidden neuron to be distinct from the others. Thus, at testing time, no competition is needed. The non-linearity in KATE’s encoding comes from the tanh activation function and the winner-take-all operation (i.e., top k selection and amplifying energy reallocation).

It is important to note that for the k-sparse autoencoder, too much sparsity (i.e., low $k$ ) can cause the so-called “dead” hidden neurons problem, which can prevent gradient back-propagation from adjusting the weights of these “dead” hidden neurons. As mentioned in the original paper, the model is prone to behaving in a manner similar to k-means clustering. That is, in the first few epochs, it will greedily assign individual hidden neurons to groups of training cases and these hidden neurons will be re-enforced but other hidden neurons will not be adjusted in subsequent epochs. In order to address this problem, scheduling the sparsity level over epochs was suggested. However, by design our approach does not suffer from this problem since the gradients will still flow through the loser neurons via the $\alpha$ amplification connections in the k-competitive layer.

KATE vs. K-Max Pooling

Our proposed k-competitive operation is also reminiscent of the k-max pooling operation [blunsom2014convolutional] applied in convolutional neural networks. We can intuitively regard k-max pooling as a global feature sampler which selects a subset of k maximum neurons in the previous convolutional layer and uses only the selected subset of neurons in the following layer. Unlike our k-competitive approach, the objective of k-max pooling is to reduce dimensionality and introduce feature invariance via this downsampling operation.

KATE as a Regularized Autoencoder

We can also regard our model as a special case of a fully competitive autoencoder where all the neurons in the hidden layer are fully connected with each other and the weights on the connections between them are fully trainable. The difference is that we restrict the architecture of this competitive layer by using a positive adder and a negative adder to constrain the energy, which serves as a regularization method.

Experiments

In this section, we evaluate our k-competitive autoencoder model on various datasets and downstream text analytics tasks to gauge its effectiveness in learning meaningful representations in different situations. All experiments were performed on a machine with a 1.7GHz AMD Opteron 6272 Processor, with 264G RAM. Our model, KATE, was implemented in Keras (github.com/fchollet/keras) which is a high-level neural networks library, written in Python. The source code for KATE is available at github.com/hugochan/KATE.

For evaluation, we use datasets that have been widely used in previous studies . Table 1 provides statistics of the different datasets used in our experiments. It lists the training, testing and validation (a subset of training) set sizes, the size of the vocabulary, average document length, the number of classes (or values for regression), and the various downstream tasks we perform on the datasets.

The 20 Newsgroups (www.qwone.com/~jason/20Newsgroups) data consists of 18846 documents, which are partitioned (nearly) evenly across 20 different newsgroups. Each document belongs to exactly one newsgroup. The corpus is divided by date into training (60%) and testing (40%) sets. We follow the preprocessing steps utilized in previous work . That is, after removing stopwords and stemming, we keep the most frequent 2,000 words in the training set as the vocabulary. We use this dataset to show that our model can learn meaningful representations for classification and document retrieval tasks.

The Reuters RCV1-v2 dataset (www.jmlr.org/papers/volume5/lewis04a) contains 804,414 newswire articles, where each document typically has multiple (hierarchical) topic labels. The total number of topic labels is 103. The dataset already comes preprocessed with stopword removal and stemming. We randomly split the corpus into 554,414 training and 25,000 test cases and keep the most frequent 5,000 words in the training dataset as the vocabulary. We perform multi-label classification on this dataset.

The Wiki10+ dataset (www.zubiaga.org/datasets/wiki10+/) comprises English Wikipedia articles with at least 10 annotations on delicious.com. Following the steps of Cao et al. , we only keep the 25 most frequent social tags and those documents containing any of these tags. After removing stopwords and stemming, we randomly split the corpus into 13,972 training and 6,000 test cases and keep the most frequent 2,000 words in the training set as the vocabulary for use in multi-label classification.

The Movie review data (MRD) (www.cs.cornell.edu/people/pabo/movie-review-data/) contains a collection of movie-review documents, with a numerical rating score in the interval $$. After removing stopwords and stemming, we randomly split the corpus into 3,337 training and 1,669 test cases and keep the most frequent 2,000 words in the training set as the vocabulary. We use this dataset for regression, i.e., predicting the movie ratings.

Note that among the above datasets, only the 20 Newsgroups dataset is balanced, whereas both the Reuters and Wiki10+ datasets are highly imbalanced in terms of class labels.

2. Comparison with Baseline Methods

We compare our k-competitive autoencoder KATE with a wide range of other models including various types of autoencoders, topic models, belief networks and word representation models, as listed below.

LDA : a directed graphical model which models a document as a mixture of topics and a topic as a mixture of words. Once trained, each document can be represented as a topic proportion vector on the topic simplex. We used the gensim LDA implementation in our experiments.

DocNADE : a neural autoregressive topic model that can be used for extracting meaningful representations of documents. The implementation is available at www.dmi.usherb.ca/~larocheh/code/DocNADE.zip.

DBN : a direct acyclic graph whose top two layers form a restricted Boltzmann machine. We use the implementation available at github.com/larsmaaloee/deep-belief-nets-for-topic-modeling.

NVDM : a neural variational inference model for document modeling. The authors have not released the source code, but we used an open-source implementation at github.com/carpedm20/variational-text-tensor-flow.

Word2Vec : a model in which each document is represented as the average of the word embedding vectors for that document. We use Word2Vecpre to denote the version where we use Google News pre-trained word embeddings which contain 300-dimensional vectors for 3 million words and phrases. Those embeddings were trained by state-of-the-art word2vec skipgram model. On the other hand, we use Word2Vec to denote the version where we train word embeddings separately on each of our datasets, using the gensim implementation.

Doc2Vec : a distributed representation model inspired by Word2Vec which can directly learn vector representations of documents. There are two versions named Doc2Vec-DBOW and Doc2Vec-DM. We use Doc2Vec-DM in our experiments as it was reported to consistently outperform Doc2Vec-DBOW in the original paper. We used the gensim implementation in our experiments.

AE: a plain shallow (i.e., one hidden layer) autoencoder, without any competition, which can automatically learn data representations by trying to reconstruct its input at the output layer.

DAE : a denoising autoencoder that accepts a corrupted version of the input data while the output is still the original uncorrupted data. In our experiments, we found that masking noise consistently outperforms other two types of noise, namely Gaussian noise and salt-and-pepper noise. Thus, we only report the results of using masking noise. Basically, masking noise perturbs the input by setting a fraction $v$ of the elements $i$ in each input vector as 0. To be fair and consistent, we use a shallow denoising autoencoder in our experiments.

CAE : a contractive autoencoder which introduces the Frobenius norm of the Jacobian matrix of the encoder activations into the regularization term.

VAE : a generative autoencoder inspired by variational inference.

KSAE : a competitive autoencoder which explicitly enforces sparsity by only keeping the $k$ highest activities in the feedforward phase.

We implemented the AE, DAE, CAE, VAE and KSAE autoencoders on our own, since their implementations are not publicly available.

Training Details: For all the autoencoder models (including AE, DAE, CAE, VAE, KSAE, and KATE), we represent each input document as a log-normalized word count vector, using binary cross-entropy as the loss function and Adadelta as the optimizer. Weight tying is also applied. For CAE and VAE, additional regularization terms are added to the loss function as mentioned in the original papers. As for VAE, we use tanh as the nonlinear activation function while as for AE, DAE and CAE, sigmoid is applied. As for KSAE, we found that omitting sparsity in the testing phase gave us better results in all experiments.

When training models, we randomly extract a subset of documents from the training set as a validation set, as noted in Table 1, which is used for tuning hyperparameters and early stopping. Early stopping is a type of regularization used to avoid overfitting when training an iterative algorithm. We stop training after 5 successive epochs with no improvement on the validation set. All baseline models were optimized as recommended in original sources. For KATE, we set $\alpha$ as 6.26, learning rate as 2, batch size as 100 (for the Reuters dataset) or 50 (for other datasets) and $k$ as 6 (for the 20 topics case), 32 (for the 128 topics case) or 102 (for the 512 topics case), as determined from the validation set.

3. Qualitative Analysis

In this set of qualitative experiments, we compare the topics generated by KATE to other representative models including AE, KSAE, and LDA. Even though KATE is not explicitly designed for the purpose of word embeddings, we compared word representations learned by KATE with the Word2Vec model to demonstrate that our model can learn semantically meaningful representations from text. We evaluate the above models on the 20 Newsgroups data. Matching the number of classes, the number of topics is set to exactly 20 for all models. For both KSAE and KATE, $k$ (the sparsity level/number of winning neurons) is set as 6.

Table 2 shows some topics learned by various models. As for autoencoders, each topic is represented by the 10 words (i.e., input neurons) with the strongest connection to that topic (i.e., hidden neuron). As for LDA, each topic is represented by the 10 most probable words in that topic. The basic AE is not very good at learning distinctive topics from textual data. In our experiment, all the topics learned by AE are dominated by frequent common words like line, subject and organ, which were always the top 3 words in all the 20 topics. KSAE learns some meaningful words but only alleviates this problem to some extent, for example, line, subject, organ and white still appears as top 4 words in 6 topics. For this reason, the output of AE and KSAE is shown for only one of the newsgroups (soc.religion.christian). On the other hand, we find that KATE generates 20 topics that are distinct from each other, and which capture the underlying semantics very well. For example, it associates words such as god, christian, jesu, moral, bibl, exist, religion, christ under the topic soc.religion.christian. It is worth emphasizing that KATE belongs to the class of distributed representation models, where each topic is “distributed” among a group of hidden neurons (the topics are therefore better interpreted as “virtual” topics). However, we find that KATE can generate competitive topics compared with LDA, which explicitly infers topics as mixture of words.

3.2. Word Embeddings Learned by Different Models

3.3. Visualization of Document Representations

A good document representation method is expected to group related documents, and to separate the different groups. Figure 2 shows the PCA projections of the document representations taken from the six main groups in the 20 Newsgroups data. As we can observe, neither AE nor the KSAE methods can learn good document representations. On the other hand, KATE successfully extracts meaningful representations from the documents; it automatically clusters related documents in the same group, and it can easily distinguish the six different groups. In fact, KATE is very competitive with LDA (arguably even better on this dataset, since LDA confuses some categories), even though the latter explicitly learns documents representations as mixture of topics, which in turn are mixture of words. Figure 3 shows the T-SNE based visualization of the above document representations and we can draw a similar conclusion.

4. Quantitative Experiments

We now turn to quantitative experiments to measure the effectiveness of KATE compared to other models on tasks such as classification, multi-label classification (MLC), regression, and document retrieval (DR). For classification, MLC and regression tasks, we train a simple neural network that uses the encoded test inputs as feature vectors, and directly maps them to the output classes or values. A simple softmax classifier with cross-entropy loss was applied for the classification task, and multi-label logistic regression classifier with cross-entropy loss was applied for the MLC task. For the regression task we used a two-layer neural regression model (where the output layer is a sigmoid neuron) with squared error loss. The same architecture is used for all methods to ensure fairness. Note that when comparing various methods, the same number of features were learned for all of them except for Word2Vecpre which uses 300-dimensional pre-trained word embeddings and thus its number of features was fixed as 300 in all experiments.

We first quantify how distinct are the topics learned via different methods. Let $\boldsymbol{v}_{i}$ denote the vector representation of topic $i$ , and let there be $m$ topics. The cosine of the angle between $\boldsymbol{v}_{i}$ and $\boldsymbol{v}_{j}$ , given as $\cos(\boldsymbol{v}_{i},\boldsymbol{v}_{j})=\tfrac{\boldsymbol{v}_{i}^{T}\boldsymbol{v}_{j}}{\|\boldsymbol{v}_{i}\|\cdot\|\boldsymbol{v}_{j}\|}$ , is a measure of how similar/correlated the two topic vectors are; it takes values in the range $ $. The topics are most dissimilar when the vectors are orthogonal to each other, i.e., with the angle between them is$ \pi/2 $, with the cosine of the angle being zero. Define the pair-wise mean squared cosine deviation among$ m$ topics as follows

Thus, MSCD $\in$ , and smaller values of MSCD (closer to zero) imply more distinctive, i.e., orthogonal topics.

We evaluate MSCD for topics generated by AE, KSAE, LDA, and KATE. We also evaluate KATE without amplification. In LDA, a topic is represented as its probabilistic distribution over the vocabulary set, whereas for autoencoders, it is defined as the weights on the connections between the corresponding hidden neuron and all the input neurons. We conduct experiments on the 20 Newsgroups dataset and vary the number of topics from 20 to 128 and 512. Table 4 shows these results. We find that KATE has the lowest MSCD values, which means that it can learn more distinctive (i.e., orthogonal) topics than other methods. Our results are much better than LDA, since the latter does not prevent topics from being similar. On the other hand, the competition in KATE drives topics (i.e., the hidden neurons) to become distinct from each other. Interestingly, KATE with amplification (i.e., here we have $\alpha=6.26$ ) consistently achieves lower MSCD values than KATE without amplification, which verifies the effectiveness of the $\alpha$ amplification connections in terms of learning distinctive topics.

4.2. Document Classification Task

In this set of experiments, we evaluate the quality of learned document representations from various models for the purpose of document classification. Table 5 shows the classification accuracy results on the 20 Newsgroups dataset (using 128 topics). Traditional autoencoders (including AE, DAE, CAE) do not perform well on this task. We observed that the validation set error was oscillating when training these classifiers (also observed in the regression task below), which indicates that the extracted features are not representative and consistent. KSAE consistently achieves higher accuracies than other autoencoders and does not exhibit the oscillating phenomenon, which means that adding sparsity does help learn better representations. VAE even performs better than KSAE on this dataset, which shows the advantages of VAE over other traditional autoencoders. However, as we will see later, VAE fails to consistently perform well across different datasets and tasks. Word2Vecpre performs on par with DBN and LDA even though it just averages all the word embeddings in a document, which suggests the effectiveness of pre-training word embeddings on a large external corpus to learn general knowledge. Not surprisingly, DocNADE works very well on this task as also reported in previous work . Our KATE model significantly outperforms all other models. For example, KATE obtains 74.4% accuracy which is significantly higher than the 72.4% accuracy achieved by VAE.

Table 6 shows multi-label classification results on Reuters and Wiki10+ datasets. Here we show both the Macro-F1 and Micro-F1 scores (reflecting a balance of precision and recall) for different number of features. Micro-F1 score biases the metric towards the most populated labels, while Macro-F1 biases the metric towards the least populated labels. Both Reuters and Wiki10+ are highly imbalanced. For example in Wiki10+, the documents belonging to ‘wikipedia’ or ‘wiki’ account for 90% of the corpus while only around 6% of the documents are relevant to ‘religion’. Similarly, in Reuters, the documents belonging to ‘CCAT’ account for 47% of the corpus while there are only 5 documents relevant to ‘GMIL’. DocNADE works the very well on this task, but the sparse and competitive autoencoders also perform well. KATE outperforms KSAE on Reuters and remains competitive on Wiki10+. We don’t report the results of DBN on Reuters since the training did not end even after a long time.

4.3. Regression Task

In this set of experiments, we evaluate the quality of learned document representations from various models for predicting the movie ratings in the MRD dataset, as shown in Table 7 (using 128 features). The coefficient of determination, denoted $r^{2}$ , from the regression model was used to evaluate the methods. The best possible $r^{2}$ statistic value is 1.0; negative values are also possible, indicating a poor fit of the model to the data. In general, other autoencoder models perform poorly on this task, for example, AE even gets a negative $r^{2}$ score. Interestingly, Word2Vecpre performs on par with DocNADE, indicating that word embeddings learned from a large external corpus can capture some semantics of emotive words (e.g., good, bad, wonderful). We observe that KATE significantly outperforms all other models, including Word2Vecpre, which means it can learn meaningful representations which are helpful for sentiment analysis.

4.4. Document Retrieval Task

We also evaluate the various models for document retrieval. Each document in the test set is used as an individual query and we fetch the relevant documents from the training set based on the cosine similarity between the document representations. The average fraction of retrieved documents which share the same label as the query document, i.e., precision, was used as the evaluation metric. As shown in Figure 4, VAE performs the best on this task followed by DocNADE and KATE. Among the other models, DBN and LDA also have decent performance, but the other autoencoders are not that effective.

4.5. Timing

Finally, we compare the training time of various models. Results are shown in Table 8 for the 20 Newsgroups dataset, with 20 topics. Our model is much faster than deep generative models like DBN and DocNADE. It is typically slower than other autoencoders since it usually takes more epochs to converge. Nevertheless, as demonstrated above, it significantly outperforms other models in various text analytics tasks.

5. KATE: Effects of Parameter Tuning

Having demonstrated the effectiveness of KATE compared to other methods, we study the effects of various hyperparameter choices in KATE, such as the number of topics (i.e., hidden neurons), the number of winners $k$ and the energy amplification parameter $\alpha$ . The default values for the number of topics is 128, with $k=32$ and $\alpha=6.26$ . Note when exploring the effect of the number of topics, we also vary $k$ to find its best match to the given number of topics. Figure 5 shows the classification accuracy on the 20 Newsgroups dataset, as we vary these parameters. We observe that as we increase the number of topics or hidden neurons (in Figure 5a), the accuracy continues to rise, but eventually drops off. We use 128 as the default value since it offers the best trade-off in complexity and performance; only relatively minor gains are achieved in increasing the number of topics beyond 128. Considering the number of winning neurons (see Figure 5b), the main trend is that the performance degrades when we make $k$ larger, which is expected since larger $k$ implies lesser competition. In practice, when tuning $k$ , we find that starting by a value close to around a quarter of the number of topics is a good strategy. Finally, as we mentioned, the $\alpha$ amplification connection is crucial as verified in Figure 5c. When $\alpha=2/k=0.0625$ , which means there is no amplification for the energy, the classification accuracy is 71.1%. However, we are able to significantly boost the model performance up to 74.6% accuracy by increasing the value of $\alpha$ . We use a default value of $\alpha=6.26$ , which once again reflects a good trade-off across different datasets. It is also important to note that across all the experiments, we found that using the tanh activation function (instead of sigmoid function) in the k-competitive layer of KATE gave the best performance. For example, on the 20 Newsgroups data, using 128 topics, KATE with tanh yields 74.4% accuracy, while with sigmoid it was only 56.8%.

Conclusions

We described a novel k-competitive autoencoder, KATE, that explicitly enforces competition among the neurons in the hidden layer by selecting the $k$ highest activation neurons as winners, and reallocates the amplified energy (aggregate activation potential) from the losers. Interestingly, even though we use a shallow model, i.e., with one hidden layer, it outperforms a variety of methods on many different text analytics tasks. More specifically, we perform a comprehensive evaluation of KATE against techniques spanning graphical models (e.g., LDA), belief networks (e.g., DBN), word embedding models (e.g., Word2Vec), and several other autoencoders including the k-sparse autoencoder (KSAE). We find that across tasks such as document classification, multi-label classification, regression and document retrieval, KATE clearly outperforms competing methods or obtains close to the best results. It is very encouraging to note that KATE is also able to learn semantically meaningful representations of words, documents and topics, which we evaluated via both quantitative and qualitative studies. As part of future work, we plan to evaluate KATE on more domain specific datasets, such as bibliographic networks, for example for topic induction and scientific publication retrieval. We also plan to improve the scalability and effectiveness of our approach on much larger text collections by developing parallel and distributed implementations.