Charagram: Embedding Words and Sentences via Character n-grams

John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu

Introduction

Representing textual sequences such as words and sentences is a fundamental component of natural language understanding systems. Many functional architectures have been proposed to model compositionality in word sequences, ranging from simple averaging [Mitchell and Lapata, 2010, Iyyer et al., 2015] to functions with rich recursive structure [Socher et al., 2011, Tai et al., 2015, Bowman et al., 2016]. Most work uses words as the smallest units in the compositional architecture, often using pretrained word embeddings or learning them specifically for the task of interest [Tai et al., 2015, He et al., 2015].

Some prior work has found benefit from using character-based compositional models that encode arbitrary character sequences into vectors. Examples include recurrent neural networks (RNNs) and convolutional neural networks (CNNs) on character sequences, showing improvements for several NLP tasks [Ling et al., 2015a, Kim et al., 2015, Ballesteros et al., 2015, dos Santos and Guimarães, 2015]. By sharing subword information across words, character models have the potential to better represent rare words and morphological variants.

Our approach, charagram, uses a much simpler functional architecture. We represent a character sequence by a vector containing counts of character $n$ -grams, inspired by ?). This vector is embedded into a low-dimensional space using a single nonlinear transformation. This can be interpreted as learning embeddings of character $n$ -grams, which are learned so as to produce effective sequence embeddings when a summation is performed over the character $n$ -grams in the sequence.

We consider three evaluations: word similarity, sentence similarity, and part-of-speech tagging. On multiple word similarity datasets, charagram outperforms RNNs and CNNs, achieving state-of-the-art performance on SimLex-999 [Hill et al., 2015]. When evaluated on a large suite of sentence-level semantic textual similarity tasks, charagram embeddings again outperform the RNN and CNN architectures as well as the paragram-phrase embeddings of ?). We also consider English part-of-speech (POS) tagging using the bidirectional long short-term memory tagger of ?). The three architectures reach similar performance, though charagram converges fastest to high accuracy.

We perform extensive analysis of our charagram embeddings. We find large gains in performance on rare words, showing the empirical benefit of subword modeling. We also compare performance across different character $n$ -gram vocabulary sizes, finding that the semantic tasks benefit far more from large vocabularies than the syntactic task. However, even for challenging semantic similarity tasks, we still see strong performance with only a few thousand character $n$ -grams.

Nearest neighbors show that charagram embeddings simultaneously address differences due to spelling variation, morphology, and word choice. Inspection of embeddings of particular character $n$ -grams reveals etymological links; e.g., die is close to mort. We release our resources to the community in the hope that charagram can provide a strong baseline for subword-aware text representation.

Related Work

We first review work on using subword information in word embedding models. The simplest approaches append subword features to word embeddings, letting the model learn how to use the subword information for particular tasks. Some added knowledge-based morphological features to word representations [Alexandrescu and Kirchhoff, 2006, El-Desoky Mousa et al., 2013]. Others learned embeddings jointly for subword units and words, defining simple compositional architectures (often based on addition) to create word embeddings from subword embeddings [Lazaridou et al., 2013, Botha and Blunsom, 2014, Qiu et al., 2014, Chen et al., 2015].

A recent trend is to use richer functional architectures to convert character sequences into word embeddings. ?) used recursive models to compose morphs into word embeddings, using unsupervised morphological analysis. ?) used a bidirectional long short-term memory (LSTM) RNN on characters to embed arbitrary word types, showing strong performance for language modeling and POS tagging. ?) used this model to represent words for dependency parsing. Several have used character-level RNN architectures for machine translation, whether for representing source or target words [Ling et al., 2015b, Luong and Manning, 2016], or for generating entire translations character-by-character [Chung et al., 2016].

?) and ?) used character-level RNNs for language modeling. Others trained character-level RNN language models to provide features for NLP tasks, including tokenization and segmentation [Chrupała, 2013, Evang et al., 2013], and text normalization [Chrupała, 2014].

CNNs with character $n$ -gram filters have been used to embed arbitrary word types for several tasks, including language modeling [Kim et al., 2015], part-of-speech tagging [dos Santos and Zadrozny, 2014], named entity recognition [dos Santos and Guimarães, 2015], text classification [Zhang et al., 2015], and machine translation [Costa-Jussà and Fonollosa, 2016]. Combinations of CNNs and RNNs on characters have also been explored [Józefowicz et al., 2016].

Most closely-related to our approach is the DSSM (instantiated variously as “deep semantic similarity model” or “deep structured semantic model”) developed by ?). For an information retrieval task, they represented words using feature vectors containing counts of character $n$ -grams. ?) used a very similar technique to represent words in neural language models for machine translation. Our charagram embeddings are based on this same idea. We show this strategy to be extremely effective when applied to both words and sentences, outperforming character LSTMs like those used by ?) and character CNNs like those from ?).

Models

We now describe models that embed textual sequences using their characters, including our charagram model and the baselines that we compare to. We denote a character-based textual sequence by $x=\langle x_{\_}1,x_{\_}2,...,x_{\_}{m}\rangle$ , which includes space characters between words as well as special start-of-sequence and end-of-sequence characters. We use $x_{\_}i^{j}$ to denote the subsequence of characters from position $i$ to position $j$ inclusive, i.e., $x_{\_}i^{j}=\langle x_{\_}i,x_{\_}{i+1},...,x_{\_}j\rangle$ , and we define $x_{\_}i^{i}=x_{\_}i$ .

Our charagram model embeds a character sequence $x$ by adding the vectors of its character $n$ -grams followed by an elementwise nonlinearity:

The set $V$ is used to restrict the model to a predetermined set (vocabulary) of character $n$ -grams. Below, we compare several choices for defining this set. The number of parameters in the model is $d+d|V|$ . This model is based on the letter $n$ -gram hashing technique developed by ?) for their DSSM approach. One can also view Eq. (1) (as they did) as first populating a vector of length $|V|$ with counts of character $n$ -grams followed by a nonlinear transformation.

We compare the charagram model to two other models. First we consider LSTM architectures [Hochreiter and Schmidhuber, 1997] over the character sequence $x$ , using the version from ?). We use a forward LSTM over the characters in $x$ , then take the final LSTM hidden vector as the representation of $x$ . Below we refer to this model as “charLSTM.”

We also compare to convolutional neural network (CNN) architectures, which we refer to below as “charCNN.” We use the architecture from ?) with a single convolutional layer followed by an optional fully-connected layer. We use filters of varying lengths of character $n$ -grams, using two primary configurations of filter sets, one of which is identical to that used by ?). Each filter operates over the entire sequence of character $n$ -grams in $x$ and we use max pooling for each filter. We tune over the choice of nonlinearity for both the convolutional filters and for the optional fully-connected layer. We give more details below about filter sets, $n$ -gram lengths, and nonlinearities.

We note that using character $n$ -gram convolutional filters is similar to our use of character $n$ -grams in the charagram model. The difference is that, in the charagram model, the $n$ -gram must match exactly for its vector to affect the representation, while in the CNN each filter will affect the representation of all sequences (depending on the nonlinearity being used). So the charagram model is able to learn precise vectors for particular character $n$ -grams with specific meanings, while there is pressure for the CNN filters to capture multiple similar patterns that recur in the data. Our qualitative analysis shows the specificity of the learned character $n$ -gram vectors learned by the charagram model.

Experiments

We perform three sets of experiments. The goal of the first two (Section 4.1) is to produce embeddings for textual sequences such that the embeddings for paraphrases have high cosine similarity. Our third evaluation (Section 4.2) is a classification task, and follows the setup of the English part-of-speech tagging experiment from ?).

We compare the ability of our models to capture semantic similarity for both words and sentences. We train on noisy paraphrase pairs from the Paraphrase Database (PPDB; Ganitkevitch et al., 2013) with an $L_{\_}2$ regularized contrastive loss objective function, following the training procedure of ?) and ?). Key details are provided here, but see Appendix A for a fuller description.

For word similarity, we focus on two of the most commonly used datasets for evaluating semantic similarity of word embeddings: WordSim-353 (WS353) [Finkelstein et al., 2001] and SimLex-999 (SL999) [Hill et al., 2015]. We also evaluate our best model on the Stanford Rare Word Similarity Dataset [Luong et al., 2013].

For sentence similarity, we evaluate on a diverse set of 22 textual similarity datasets, including all datasets from every SemEval semantic textual similarity (STS) task from 2012 to 2015. We also evaluate on the SemEval 2015 Twitter task [Xu et al., 2015] and the SemEval 2014 SICK Semantic Relatedness task [Marelli et al., 2014]. Given two sentences, the aim of the STS tasks is to predict their similarity on a 0-5 scale, where 0 indicates the sentences are on different topics and 5 indicates that they are completely equivalent.

Each STS task consists of 4-6 datasets covering a wide variety of domains, including newswire, tweets, glosses, machine translation outputs, web forums, news headlines, image and video captions, among others. Most submissions for these tasks use supervised models that are trained and tuned on provided training data or similar datasets from older tasks. Further details are provided in the official task descriptions [Agirre et al., 2012, Agirre et al., 2013, Agirre et al., 2014, Agirre et al., 2015].

1.2 Preliminaries

For training data, we use pairs from PPDB. For word similarity experiments, we train on word pairs and for sentence similarity, we train on phrase pairs. PPDB comes in different sizes (S, M, L, XL, XXL, and XXXL), where each larger size subsumes all smaller ones. The pairs in PPDB are sorted by a confidence measure and so the smaller sets contain higher precision paraphrases.

Before training the charagram model, we need to populate $V$ , the vocabulary of character $n$ -grams included in the model. We obtain these from the training data used for the final models in each setting, which is either the lexical or phrasal section of PPDB XXL. We tune over whether to include the full sets of character $n$ -grams in these datasets or only those that appear more than once.

When extracting $n$ -grams, we include spaces and add an extra space before and after each word or phrase in the training and evaluation data to ensure that the beginning and end of each word is represented. We note that strong performance can be obtained using far fewer character $n$ -grams; we explore the effects of varying the number of $n$ -grams and the $n$ -gram orders in Section 4.4.

We used Adam [Kingma and Ba, 2014] with a learning rate of 0.001 to learn the parameters in the following experiments.

1.3 Word Embedding Experiments

For hyperparameter tuning, we used one epoch on the lexical section of PPDB XXL, which consists of 770,007 word pairs. We used either WS353 or SL999 for model selection (reported below). We then took the selected hyperparameters and trained for 50 epochs to ensure that all models had a chance to converge.

Full details of our tuning procedure are provided in Appendix B. In short, we tuned all models thoroughly, tuning the activation functions for charagram and charCNN, as well as the regularization strength, mini-batch size, and sampling type for all models. For charCNN, we experimented with two filter sets: one uses 175 filters for each $n$ -gram size $\in\{2,3,4\}$ , and the other uses the set of filters from ?), consisting of 25 filters of size 1, 50 of size 2, 75 of size 3, 100 of size 4, 125 of size 5, and 150 of size 6. We also experimented with using dropout [Srivastava et al., 2014] on the inputs of the last layer of the charCNN model in place of $L_{\_}2$ regularization, as well as removing the last feedforward layer. Neither of these variations significantly improved performance on our suite of tasks for word or sentence similarity. However, using more filters does improve performance, seemingly linearly with the square of the number of filters.

The results are shown in Table 1. The charagram model outperforms both the charLSTM and charCNN models, and also outperforms recent strong results on SL999.

We also found that the charCNN and charLSTM models take far more epochs to converge than the charagram model. We noted this trend across experiments and explore it further in Section 4.3.

We found that performance of charagram on word similarity tasks can be improved by using more character $n$ -grams. This is explored in Section 4.4. Our best result from these experiments was obtained with the largest model we considered, which contains 173,881 $n$ -gram embeddings. When using WS353 for model selection and training for 25 epochs, this model achieves 70.6 on SL999. To our knowledge, this is the best result reported on SL999 in this setting; Table 2 shows comparable recent results. Note that a higher SL999 number is reported in [Mrkšić et al., 2016], but the setting is not comparable to ours as they started with embeddings tuned on SL999.

Lastly, we evaluated our model on the Stanford Rare Word Similarity Dataset [Luong et al., 2013], using SL999 for model selection. We obtained a Spearman’s $\rho$ of 47.1, which outperforms the 41.8 result from ?) and is competitive with the 47.8 reported in ?), despite only using PPDB for training.

1.4 Sentence Embedding Experiments

We did initial training of our models using one pass through PPDB XL, which consists of 3,033,753 unique phrase pairs. Following ?), we use the annotated phrase pairs developed by ?) as our validation set, using Spearman’s $\rho$ to rank the models. We then take the highest performing models and train on the 9,123,575 unique phrase pairs in the phrasal section of PPDB XXL for 10 epochs.

For all experiments, we fix the mini-batch size to 100, the margin $\delta$ to 0.4, and use MAX sampling (see Appendix A). For the charagram model, $V$ contains all 122,610 character $n$ -grams ( $n\in\{2,3,4\}$ ) in the PPDB XXL phrasal section. The other tuning settings are the same as in Section 4.1.3.

For another baseline, we train the paragram-phrase model of ?), tuning its regularization strength over $\{10^{-5},10^{-6},10^{-7},10^{-8}\}$ . The paragram-phrase model simply uses word averaging as its composition function, but outperforms many more complex models.

In this section, we refer to our model as charagram-phrase because the input is a character sequence containing multiple words rather than only a single word as in Section 4.1.3. Since the vocabulary $V$ is defined by the training data sequences, the charagram-phrase model includes character $n$ -grams that span multiple words, permitting it to capture some aspects of word order and word co-occurrence, which the paragram-phrase model is unable to do.

We encountered difficulties training the charLSTM and charCNN models for this task. We tried several strategies to improve their chance at convergence, including clipping gradients, increasing training data, and experimenting with different optimizers and learning rates. We found success by using the original (confidence-based) ordering of the PPDB phrase pairs for the initial epoch of learning, then shuffling them for subsequent epochs. This is similar to curriculum learning [Bengio et al., 2009]. The higher-confidence phrase pairs tend to be shorter and have many overlapping words, possibly making them easier to learn from.

An abbreviated version of the sentence similarity results is shown in Table 3; Appendix C contains the full results. For comparison, we report performance for the median (50%), third quartile (75%), and top-performing (Max) systems from the shared tasks. We observe strong performance for the charagram-phrase model. It always does better than the charCNN and charLSTM models, and outperforms the paragram-phrase model on 15 of the 22 tasks. Furthermore, charagram-phrase matches or exceeds the top-performing task-tuned systems on 5 tasks, and is within 0.003 on 2 more. The charLSTM and charCNN models are significantly worse, with the charCNN being the better of the two and beating paragram-phrase on 4 of the tasks.

We emphasize that there are many other models that could be compared to, such as an LSTM over word embeddings. This and many other models were explored by ?). Their paragram-phrase model, which simply learns word embeddings within an averaging composition function, was among their best-performing models. We used this model in our experiments as a strongly-performing representative of their results.

Lastly, we note other recent work that considers a similar transfer learning setting. The FastSent model [Hill et al., 2016] uses the 2014 STS task as part of its evaluation and reports an average Pearson’s $r$ of 61.3, much lower than the 74.7 achieved by charagram-phrase on the same datasets.

2 POS Tagging Experiments

We now consider part-of-speech (POS) tagging, since it has been used as a testbed for evaluating architectures for character-level word representations. It also differs from semantic similarity, allowing us to evaluate our architectures on a syntactic task. We replicate the POS tagging experimental setup of ?). Their model uses a bidirectional LSTM over character embeddings to represent words. They then use the resulting word representations in another bidirectional LSTM that predicts the tag for each word. We replace their character bidirectional LSTM with our three architectures: charCNN, charLSTM, and charagram.

We use the Wall Street Journal portion of the Penn Treebank, using Sections 1-18 for training, 19-21 for tuning, and 22-24 for testing. We set the dimensionality of the character embeddings to 50 and that of the (induced) word representations to 150. For optimization, we use stochastic gradient descent with a mini-batch size of 100 sentences. The learning rate and momentum are set to 0.2 and 0.95 respectively. We train the models for 50 epochs, again to ensure that all models have an opportunity to converge.

The other settings for our models are mostly the same as for the word and sentence experiments (Section 4.1). We again use character $n$ -grams with $n\in\{2,3,4\}$ , tuning over whether to include all 54,893 in the training data or only those that occur more than once. However, there are two minor differences from the previous sections. First, we add a single binary feature to indicate if the token contains a capital letter. Second, our tuning considers rectified linear units as the activation function for the charagram and charCNN architectures.We did not consider ReLU for the similarity experiments because the final embeddings are used directly to compute cosine similarities, which led to poor performance when restricting the embeddings to be non-negative.

The results are shown in Table 4. Performance is similar across models. We found that adding a second fully-connected 150 dimensional layer to the charagram model improved results slightly.We also tried adding a second (300 dimensional) layer for the word and sentence embedding models and found that it hurt performance.

3 Convergence

One observation we made during our experiments was that different models converged at significantly different rates. Figure 1 plots the performance of the word similarity and tagging tasks as a function of the number of examples processed during training. For word similarity, we plot the oracle Spearman’s $\rho$ on SL999, while for tagging we plot tagging accuracy on the validation set. We evaluate performance every quarter epoch (approximately every 194,252 word pairs) for word similarity and every epoch for tagging. We only show the first 10 epochs of training in the tagging plot.

The plots show that the charagram model converges quickly to high performance. The charCNN and charLSTM models take many more epochs to converge. Even with tagging, which uses a very high learning rate, charagram converges significantly faster than the others. For word similarity, it appears that charCNN and charLSTM are still slowly improving at the end of 50 epochs. This suggests that if training was done for a much longer period, and possibly on more data, the charLSTM or charCNN models could match and surpass the charagram model. However, due to the large training sets available from PPDB and the computational requirements of these architectures, we were unable to explore the regime of training for many epochs. We conjecture that slow convergence could be the reason for the inferior performance of LSTMs for similarity tasks as reported by ?).

4 Model Size Experiments

The default setting for our charagram and charagram-phrase models is to use all character bigram, trigrams, and 4-grams that occur in the training data at least $C$ times, tuning $C$ over the set $\{1,2\}$ . This results in a large number of parameters, which could be seen as an unfair advantage over the comparatively smaller charCNN and charLSTM models, which have up to 881,025 and 763,200 parameters respectively in the similarity experiments.This includes 134 character embeddings.

On the other hand, for a given training example, very few parameters in the charagram model are actually used. For the charCNN and charLSTM models, by contrast, all parameters are used except the character embeddings for those characters that are not present in the example. For a sentence with 100 characters, and when using the 300-dimensional charagram model with bigrams, trigrams, and 4-grams, there are approximately 90,000 parameters in use for this sentence, far fewer than those used by the charCNN and charLSTM for the same sentence.

We performed a series of experiments to investigate how the charagram and charagram-phrase models perform with different numbers and lengths of character $n$ -grams. For a given $k$ , we took the top $k$ most frequent character $n$ -grams for each value of $n$ in use. We experimented with $k$ values in $\{100,1000,50000\}$ . If there were fewer than $k$ unique character $n$ -grams for a given $n$ , we used all of them. For these experiments, we did very little tuning, setting the regularization strength to 0 and only tuning over the activation function. We repeated this experiment for all three of our tasks. For word similarity, we report performance on SL999 after training for 5 epochs on the lexical section of PPDB XXL. For sentence similarity, we report the average Pearson’s $r$ over all 22 datasets after training for 5 epochs on the phrasal section of PPDB XL. For tagging, we report accuracy on the validation set after training for 50 epochs. The results are shown in Table 5.

When using extremely small models with only 100 $n$ -grams of each order, we still see relatively strong performance on POS tagging. However, the semantic similarity tasks require far more $n$ -grams to yield strong performance. Using 1000 $n$ -grams clearly outperforms 100, and 50,000 $n$ -grams performs best.

Analysis

One of our primary motivations for character-based models is to address the issue of out-of-vocabulary (OOV) words, which were found to be one of the main sources of error for the paragram-phrase model from ?). They reported a negative correlation (Pearson’s $r$ of -0.45) between OOV rate and performance. We took the 12,108 sentence pairs in all 20 SemEval STS tasks and binned them by the total number of unknown words in the pairs.Unknown words were defined as those not present in the 1.7 million unique (case-insensitive) tokens that comprise the vocabulary for the GloVe embeddings available at http://nlp.stanford.edu/projects/glove/. The paragram-sl999 embeddings, used to initialize the paragram-phrase model, use this same vocabulary. We computed Pearson’s $r$ over each bin. The results are shown in Table 6.

The charagram-phrase model has better performance for each number of unknown words. The paragram-phrase model degrades when more unknown words are present, presumably because it is forced to use the same unknown word embedding for all unknown words. The charagram-phrase model has no notion of unknown words, as it can embed any character sequence.

We next investigated the sensitivity of the two models to length, as measured by the maximum of the lengths of the two sentences in a pair. We binned all of the 12,108 sentence pairs in the 20 SemEval STS tasks by length and then again found the Pearson’s $r$ for both the paragram-phrase and charagram-phrase models. The results are shown in Table 7.

We find that both models are robust to sentence length, achieving the highest correlations on the longest sentences. We also find that charagram-phrase outperforms paragram-phrase at all sentence lengths.

2 Qualitative Analysis

Aside from OOVs, the paragram-phrase model lacks the ability to model word order or cooccurrence, since it simply averages the words in the sequence. We were interested to see whether charagram-phrase could handle negation, since it does model limited information about word order (via character $n$ -grams that span multiple words in the sequence). We made a list of “not” bigrams that could be represented by a single word, then embedded each bigram using both models and did a nearest-neighbor search over a working vocabulary.This contained all words in PPDB-XXL, our evaluations, and in two other datasets: the Stanford Sentiment task [Socher et al., 2013] and the SNLI dataset [Bowman et al., 2015], resulting in 93,217 unique (up-to-casing) tokens. The results, in Table 8, show how the charagram-phrase embeddings model negation. In all cases but one, the nearest neighbor is a paraphrase for the bigram and the next neighbors are mostly paraphrases as well. The paragram-phrase model, unsurprisingly, is incapable of modeling negation. In all cases, the nearest neighbor is not, as this word carries much more weight than the word it modifies. The remaining nearest neighbors are either the modified word or stalled.

We did two additional nearest neighbor explorations with our charagram-phrase model. In the first, we collected the nearest neighbors for words that were not in the training data (i.e. PPDB XXL), but were in our working vocabulary. This consisted of 59,660 words. In the second, we collected nearest neighbors of words that were in our training data which consisted of 37,765 tokens.

A sample of the nearest neighbors is shown in Table 9. Several kinds of similarity are being captured simultaneously by the model. One kind is similarity in terms of spelling variation, including misspellings (vehicals, vehicels, and vehicles) and repetition for emphasis (baby and babyyyyyyy). Another kind is similarity in terms of morphological variants of a shared root (e.g., journeying and journey). We also see that the model has learned many strong synonym relationships without significant amounts of overlapping $n$ -grams (e.g., vehicles, cars, and automobiles). We find these characteristics for words both in and out of the training data. Words in the training data, which tend to be more commonly used, do tend to have higher precision in their nearest neighbors (e.g., see neighbors for huge). We noted occasional mistakes for words that share a large number of $n$ -grams but are not paraphrases (see nearest neighbors for litered which is likely a misspelling of littered).

Lastly, since our model learns embeddings for character $n$ -grams, we include an analysis of character $n$ -gram nearest neighbors in Table 10. These $n$ -grams appear to be grouped into themes, such as death (first row), food (second row), and speed (third row), but have different granularities. The $n$ -grams in the last row appear in paraphrases of 2, whereas the second-to-last row shows $n$ -grams in words like french and vocabulary, which can broadly be classified as having to do with language.

Conclusion

We performed a careful empirical comparison of character-based compositional architectures on three NLP tasks. While most prior work has considered machine translation, language modeling, and syntactic analysis, we showed how character-level modeling can improve semantic similarity tasks, both quantitatively and with extensive qualitative analysis. We found a consistent trend: the simplest architecture converges fastest to high performance. These results, coupled with those from ?), suggest that practitioners should begin with simple architectures rather than moving immediately to RNNs and CNNs. We release our code and trained models so they can be used by the NLP community for general-purpose, character-based text representation.

Acknowledgments

We would like to thank the developers of Theano [Theano Development Team, 2016] and NVIDIA Corporation for donating GPUs used in this research.

Appendix A Training

For word and sentence similarity, we follow the training procedure of ?) and ?), described below. For part-of-speech tagging, we follow the English Penn Treebank training procedure of ?).

For the similarity tasks, the training data consists of a set $X$ of phrase pairs $\langle x_{\_}1,x_{\_}2\rangle$ from the Paraphrase Database (PPDB; Ganitkevitch et al., 2013), where $x_{\_}1$ and $x_{\_}2$ are assumed to be paraphrases. We optimize a margin-based loss:

where $g$ is the embedding function in use, $\delta$ is the margin, the full set of parameters is contained in $\theta$ (e.g., for the charagram model, $\theta=\langle W,\mbox{\boldmath$ b $}\rangle$ ), $\lambda$ is the $L_{\_}2$ regularization coefficient, and $t_{\_}1$ and $t_{\_}2$ are carefully selected negative examples taken from a mini-batch during optimization (discussed below). Intuitively, we want the two phrases to be more similar to each other ( $\cos(g(x_{\_}1),g(x_{\_}2))$ ) than either is to their respective negative examples $t_{\_}1$ and $t_{\_}2$ , by a margin of at least $\delta$ .

To select $t_{\_}1$ and $t_{\_}2$ in Eq. 2, we tune the choice between two approaches. The first, MAX, simply chooses the most similar phrase in some set of phrases (other than those in the given phrase pair). For simplicity and to reduce the number of tunable parameters, we use the mini-batch for this set, but it could be a separate set. Formally, MAX corresponds to choosing $t_{\_}1$ for a given $\langle x_{\_}1,x_{\_}2\rangle$ as follows:

where $X_{\_}b\subseteq X$ is the current mini-batch. That is, we want to choose a negative example $t_{\_}i$ that is similar to $x_{\_}i$ according to the current model parameters. The downside of this approach is that we may occasionally choose a phrase $t_{\_}i$ that is actually a true paraphrase of $x_{\_}i$ .

The second strategy selects negative examples using MAX with probability 0.5 and selects them randomly from the mini-batch otherwise. We call this sampling strategy MIX. We tune over the choice of strategy in our experiments.

Appendix B Tuning Word Similarity Models

For all architectures, we tuned over the mini-batch size (25 or 50) and the type of sampling used (MIX or MAX). $\delta$ was set to 0.4 and the dimensionality $d$ of each model was set to 300.

For the charagram model, we tuned the activation function $h$ ( $\tanh$ or linear) and regularization coefficient $\lambda$ (over $\{10^{-4},10^{-5},10^{-6}\}$ ). The $n$ -gram vocabulary $V$ contained all 100,283 character $n$ -grams ( $n\in\{2,3,4\}$ ) in the lexical section of PPDB XXL.

For charCNN and charLSTM, we randomly initialized 300 dimensional character embeddings for all unique characters in the training data. For charLSTM, we tuned over whether to include an output gate. For charCNN, we tuned the filter activation function (rectified linear or $\tanh$ ) and tuned the activation for the fully-connected layer ( $\tanh$ or linear). For both the charLSTM and charCNN models, we tuned $\lambda$ over $\{10^{-4},10^{-5},10^{-6}\}$ .

Appendix C Full Sentence Similarity Results

Table 11 shows the full results of our sentence similarity experiments.