Augmenting Neural Networks with First-order Logic

Tao Li, Vivek Srikumar

Introduction

Neural models demonstrate remarkable predictive performance across a broad spectrum of NLP tasks: e.g., natural language inference Parikh et al. (2016), machine comprehension Seo et al. (2017), machine translation Bahdanau et al. (2015), and summarization Rush et al. (2015). These successes can be attributed to their ability to learn robust representations from data. However, such end-to-end training demands a large number of training examples; for example, training a typical network for machine translation may require millions of sentence pairs (e.g. Luong et al., 2015). The difficulties and expense of curating large amounts of annotated data are well understood and, consequently, massive datasets may not be available for new tasks, domains or languages.

In this paper, we argue that we can combat the data hungriness of neural networks by taking advantage of domain knowledge expressed as first-order logic. As an example, consider the task of reading comprehension, where the goal is to answer a question based on a paragraph of text (Fig. 1). Attention-driven models such as BiDAF Seo et al. (2017) learn to align words in the question with words in the text as an intermediate step towards identifying the answer. While alignments (e.g. author to writing) can be learned from data, we argue that models can reduce their data dependence if they were guided by easily stated rules such as: Prefer aligning phrases that are marked as similar according to an external resource, e.g., ConceptNet Liu and Singh (2004). If such declaratively stated rules can be incorporated into training neural networks, then they can provide the inductive bias that can reduce data dependence for training.

That general neural networks can represent such Boolean functions is known and has been studied both from the theoretical and empirical perspectives (e.g. Maass et al., 1994; Anthony, 2003; Pan and Srikumar, 2016). Recently, Hu et al. (2016) exploit this property to train a neural network to mimic a teacher network that uses structured rules. In this paper, we seek to directly incorporate such structured knowledge into a neural network architecture without substantial changes to the training methods. We focus on three questions:

Can we integrate declarative rules with end-to-end neural network training?

Can such rules help ease the need for data?

How does incorporating domain expertise compare against large training resources powered by pre-trained representations?

The first question poses the key technical challenge we address in this paper. On one hand, we wish to guide training and prediction with neural networks using logic, which is non-differentiable. On the other hand, we seek to retain the advantages of gradient-based learning without having to redesign the training scheme. To this end, we propose a framework that allows us to systematically augment an existing network architecture using constraints about its nodes by deterministically converting rules into differentiable computation graphs. To allow for the possibility of such rules being incorrect, our framework is designed to admit soft constraints from the ground up. Our framework is compatible with off-the-shelf neural networks without extensive redesign or any additional trainable parameters.

To address the second and the third questions, we empirically evaluate our framework on three tasks: machine comprehension, natural language inference, and text chunking. In each case, we use a general off-the-shelf model for the task, and study the impact of simple logical constraints on observed neurons (e.g., attention) for different data sizes. We show that our framework can successfully improve an existing neural design, especially when the number of training examples is limited.

We introduce a new framework for incorporating first-order logic rules into neural network design in order to guide both training and prediction.

We evaluate our approach on three different NLP tasks: machine comprehension, textual entailment, and text chunking. We show that augmented models lead to large performance gains in the low training data regimes.The code used for our experiments is archived here: https://github.com/utahnlp/layer_augmentation

Problem Setup

In this section, we will introduce the notation and assumptions that form the basis of our formalism for constraining neural networks.

Neural networks are directed acyclic computation graphs G=(V,E)G=(V,E), consisting of nodes (i.e. neurons) VV and weighted directed edges EE that represent information flow. Although not all neurons have explicitly grounded meanings, some nodes indeed can be endowed with semantics tied to the task. Node semantics may be assigned during model design (e.g. attention), or incidentally discovered in post hoc analysis (e.g., Le et al., 2012; Radford et al., 2017, and others). In either case, our goal is to augment a neural network with such named neurons using declarative rules.

The use of logic to represent domain knowledge has a rich history in AI (e.g. Russell and Norvig, 2016). In this work, to capture such knowledge, we will primarily focus on conditional statements of the form LRL\rightarrow R, where the expression LL is the antecedent (or the left-hand side) that can be conjunctions or disjunctions of literals, and RR is the consequent (or the right-hand side) that consists of a single literal. Note that such rules include Horn clauses and their generalizations, which are well studied in the knowledge representation and logic programming communities (e.g. Chandra and Harel, 1985).

Integrating rules with neural networks presents three difficulties. First, we need a mapping between the predicates in the rules and nodes in the computation graph. Second, logic is not differentiable; we need an encoding of logic that admits training using gradient based methods. Finally, computation graphs are acyclic, but user-defined rules may introduce cyclic dependencies between the nodes. Let us look at these issues in order.

As mentioned before, we will assume named neurons are given. And by associating predicates with such nodes that are endowed with symbolic meaning, we can introduce domain knowledge about a problem in terms of these predicates. In the rest of the paper, we will use lower cased letters (e.g., ai,bja_{i},b_{j}) to denote nodes in a computation graph, and upper cased letters (e.g., Ai,BjA_{i},B_{j}) for predicates associated with them.

To deal with the non-differentiablity of logic, we will treat the post-activation value of a named neuron as the degree to which the associated predicate is true. In §3, we will look at methods for compiling conditional statements into differentiable statements that augment a given network.

Since we will augment computation graphs with compiled conditional forms, we should be careful to avoid creating cycles. To formalize this, let us define cyclicity of conditional statements with respect to a neural network.

Given two nodes aa and bb in a computation graph, we say that the node aa is upstream of node bb if there is a directed path from aa to bb in the graph.

Let GG be a computation graph. An implicative statement LRL\rightarrow R is cyclic with respect to GG if, for any literal RiRR_{i}\in R, the node rir_{i} associated with it is upstream of the node ljl_{j} associated with some literal LjLL_{j}\in L. An implicative statement is acyclic if it is not cyclic.

Fig. 2 and its caption gives examples of cyclic and acyclic implications. A cyclic statement sometimes can be converted to an equivalent acyclic statement by constructing its contrapositive. For example, the constraint B1A1B_{1}\rightarrow A_{1} is equivalent to ¬A1¬B1\neg A_{1}\rightarrow\neg B_{1}. While the former is cyclic, the later is acyclic. Generally, we can assume that we have acyclic implications.As we will see in §3.3, the contrapositive does not always help because we may end up with a complex right hand side that we can not yet compile into the computation graph.

A Framework for Augmenting Neural Networks with Constraints

To create constraint-aware neural networks, we will extend the computation graph of an existing network with additional edges defined by constraints. In §3.1, we will focus on the case where the antecedent is conjunctive/disjunctive and the consequent is a single literal. In §3.2, we will cover more general antecedents.

Given a computation graph, suppose we have a acyclic conditional statement: ZYZ\rightarrow Y, where ZZ is a conjunction or a disjunction of literals and YY is a single literal. We define the neuron associated with YY to be y=g(Wx)y=g\left(\mathbf{W}\mathbf{x}\right), where gg denotes an activation function, W\mathbf{W} are network parameters, x\mathbf{x} is the immediate input to yy. Further, let the vector z\mathbf{z} represent the neurons associated with the predicates in ZZ. While the nodes z\mathbf{z} need to be named neurons, the immediate input x\mathbf{x} need not necessarily have symbolic meaning.

Our goal is to augment the computation of yy so that whenever ZZ is true, the pre-activated value of yy increases if the literal YY is not negated (and decreases if it is). To do so, we define a constrained neural layer as

Here, we will refer to the function dd as the distance function that captures, in a differentiable way, whether the antecedent of the implication holds. The importance of the entire constraint is decided by a real-valued hyper-parameter ρ0\rho\geq 0.

The definition of the constrained neural layer says that, by compiling an implicative statement into a distance function, we can regulate the pre-activation scores of the downstream neurons based on the states of upstream ones.

Designing the distance function

The key consideration in the compilation step is the choice of an appropriate distance function for logical statements. The ideal distance function we seek is the indicator for the statement ZZ:

However, since the function dideald_{ideal} is not differentiable, we need smooth surrogates.

In the rest of this paper, we will define and use distance functions that are inspired by probabilistic soft logic (c.f. Klement et al., 2013) and its use of the Łukasiewicz T-norm and T-conorm to define a soft version of conjunctions and disjunctions.The definitions of the distance functions here as surrogates for the non-differentiable dideald_{ideal} is reminiscent of the use of hinge loss as a surrogate for the zero-one loss. In both cases, other surrogates are possible.

Table 1 summarizes distance functions corresponding to conjunctions and disjunctions. In all cases, recall that the ziz_{i}’s are the states of neurons and are assumed to be in the range $$. Examining the table, we see that with a conjunctive antecedent (first row), the distance becomes zero if even one of the conjuncts is false. For a disjunctive antecedent (second row), the distance becomes zero only when all the disjuncts are false; otherwise, it increases as the disjuncts become more likely to be true.

Negating Predicates

Both the antecedent (the ZZ’s) and the consequent (YY) could contain negated predicates. We will consider these separately.

For any negated antecedent predicate, we modify the distance function by substituting the corresponding ziz_{i} with 1zi1-z_{i} in Table 1. The last two rows of the table list out two special cases, where the entire antecedents are negated, and can be derived from the first two rows.

To negate consequent YY, we need to reduce the pre-activation score of neuron yy. To achieve this, we can simply negate the entire distance function.

Scaling factor ρ𝜌\rho

In Eq. 1, the distance function serves to promote or inhibit the value of downstream neuron. The extent is controlled by the scaling factor ρ\rho. For instance, with ρ=+\rho=+\infty, the pre-activation score of the downstream neuron is dominated by the distance function. In this case, we have a hard constraint. In contrast, with a small ρ\rho, the output state depends on both the Wx\mathbf{W}\mathbf{x} and the distance function. In this case, the soft constraint serves more as a suggestion. Ultimately, the network parameters might overrule the constraint. We will see an example in §4 where noisy constraint prefers small ρ\rho.

2 General Boolean Antecedents

So far, we exclusively focused on conditional statements with either conjunctive or disjunctive antecedents. In this section, we will consider general antecedents.

As an illustrative example, suppose we have an antecedent (¬AB)(CD)(\neg A\vee B)\wedge(C\vee D). By introducing auxiliary variables, we can convert it into the conjunctive form PQP\wedge Q, where (¬AB)P(\neg A\vee B)\leftrightarrow P and (CD)Q(C\vee D)\leftrightarrow Q. To perform such operation, we need to: (1) introduce auxiliary neurons associated with the auxiliary predicates PP and QQ, and, (2) define these neurons to be exclusively determined by the biconditional constraint.

To be consistent in terminology, when considering biconditional statement (¬AB)P(\neg A\vee B)\leftrightarrow P, we will call the auxiliary literal PP the consequent, and the original literals AA and BB the antecedents.

Because the implication is bidirectional in biconditional statement, it violates our acyclicity requirement in §3.1. However, since the auxiliary neuron state does not depend on any other nodes, we can still create an acyclic sub-graph by defining the new node to be the distance function itself.

With a biconditional statement ZYZ\leftrightarrow Y, where YY is an auxiliary literal, we define a constrained auxiliary layer as

where dd is the distance function for the statement, z\mathbf{z} are upstream neurons associated with ZZ, yy is the downstream neuron associated with YY. Note that, compared to Eq. 1, we do not need activation function since the distance, which is in $$, can be interpreted as producing normalized scores.

Note that this construction only applies to auxiliary predicates in biconditional statements. The advantage of this layer definition is that we can use the same distance functions as before (i.e., Table 1). Furthermore, the same design considerations in §3.1 still apply here, including how to negate the left and right hand sides.

Constructing augmented networks

To complete the modeling framework, we summarize the workflow needed to construct an augmented neural network given a conditional statement and a computation graph: (1) Convert the antecedent into a conjunctive or a disjunctive normal form if necessary. (2) Convert the conjunctive/disjunctive antecedent into distance functions using Table 1 (with appropriate corrections for negations). (3) Use the distance functions to construct constrained layers and/or auxiliary layers to augment the computation graph by replacing the original layer with constrained one. (4) Finally, use the augmented network for end-to-end training and inference. We will see complete examples in §4.

3 Discussion

Not only does our design not add any more trainable parameters to the existing network, it also admits efficient implementation with modern neural network libraries.

When posing multiple constraints on the same downstream neuron, there could be combinatorial conflicts. In this case, our framework relies on the base network to handle the consistency issue. In practice, we found that summing the constrained pre-activation scores for a neuron is a good heuristic (as we will see in §4.3).

For a conjunctive consequent, we can decompose it into multiple individual constraints. That is equivalent to constraining downstream nodes independently. Handling more complex consequents is a direction of future research.

Experiments

In this section, we will answer the research questions raised in §1 by focusing on the effectiveness of our augmentation framework. Specifically, we will explore three types of constraints by augmenting: 1) intermediate decisions (i.e. attentions); 2) output decisions constrained by intermediate states; 3) output decisions constrained using label dependencies.

To this end, we instantiate our framework on three tasks: machine comprehension, natural language inference, and text chunking. Across all experiments, our goal is to study the modeling flexibility of our framework and its ability to improve performance, especially with decreasing amounts of training data.

To study low data regimes, our augmented networks are trained using varying amounts of training data to see how performances vary from baselines. For detailed model setup, please refer to the appendices.

Attention is a widely used intermediate state in several recent neural models. To explore the augmentation over such neurons, we focus on attention-based machine comprehension models on SQuAD (v1.1) dataset Rajpurkar et al. (2016). We seek to use word relatedness from external resources (i.e., ConceptNet) to guide alignments, and thus to improve model performance.

We base our framework on two models: BiDAF Seo et al. (2017) and its ELMo-augmented variant Peters et al. (2018). Here, we provide an abstraction of the two models which our framework will operate on:

where pp and qq are the paragraph and query respectively, σ\sigma refers to the softmax activation, a\overleftarrow{\mathbf{a}} and a\overrightarrow{\mathbf{a}} are the bidirectional attentions from qq to pp and vice versa, y\mathbf{y} and z\mathbf{z} are the probabilities of answer boundaries. All other aspects are abstracted as encoderencoder and layerslayers.

Augmentation

By construction of the attention neurons, we expect that related words should be aligned. In a knowledge-driven approach, we can use ConceptNet to guide the attention values in the model in Eq. 4.

We consider two rules to illustrate the flexibility of our framework. Both statements are in first-order logic that are dynamically grounded to the computation graph for a particular paragraph and query. First, we define the following predicates: Ki,jK_{i,j} word pip_{i} is related to word qjq_{j} in ConceptNet via edges {Synonym, DistinctFrom, IsA, Related}. Ai,j\overleftarrow{A}_{i,j} unconstrained model decision that word qjq_{j} best matches to word pip_{i}. Ai,j\overleftarrow{A}^{\prime}_{i,j} constrained model decision for the above alignment. Using these predicates, we will study the impact of the following two rules, defined over a set CC of content words in pp and qq: R1R_{1}: i,jC, Ki,jAi,j\forall i,j\in C,~{}K_{i,j}\rightarrow\overleftarrow{A}^{\prime}_{i,j}. R2R_{2}: i,jC, Ki,jAi,jAi,j\forall i,j\in C,~{}K_{i,j}\wedge\overleftarrow{A}_{i,j}\rightarrow\overleftarrow{A}^{\prime}_{i,j}.

The rule R1R_{1} says that two words should be aligned if they are related. Interestingly, compiling this statement using the distance functions in Table 1 is essentially the same as adding word relatedness as a static feature. The rule R2R_{2} is more conservative as it also depends on the unconstrained model decision. In both cases, since Ki,jK_{i,j} does not map to a node in the network, we have to create a new node ki,jk_{i,j} whose value is determined using ConceptNet, as illustrated in Fig. 3.

Can our framework use rules over named neurons to improve model performance?

The answer is yes. We experiment with rules R1R_{1} and R2R_{2} on incrementally larger training data. Performances are reported in Table 2 with comparison with baselines. We see that our framework can indeed use logic to inform model learning and prediction without any extra trainable parameters needed. The improvement is particularly strong with small training sets. With more data, neural models are less reliant on external information. As a result, the improvement with larger datasets is smaller.

How does it compare to pretrained encoders?

Pretrained encoders (e.g. ELMo and BERT Devlin et al. (2018)) improve neural models with improved representations, while our framework augments the graph using first-order logic. It is important to study the interplay of these two orthogonal directions. We can see in Table 2, our augmented model consistently outperforms baseline even with the presence of ELMo embeddings.

We explored two options to incorporate word relatedness; one is a straightforward constraint (i.e. R1R_{1}), another is its conservative variant (i.e. R2R_{2}). It is a design choice as to which to use. Clearly in Table 2, constraint R1R_{1} consistently outperforms its conservative alternative R2R_{2}, even though R2R_{2} is better than baseline. In the next task, we will see an example where a conservative constraint performs better with large training data.

2 Natural Language Inference

Unlike in the machine comprehension task, here we explore logic rules that bridge attention neurons and output neurons. We use the SNLI dataset Bowman et al. (2015), and base our framework on a variant of the decomposable attention (DAtt, Parikh et al., 2016) model where we replace its projection encoder with bidirectional LSTM (namely L-DAtt).

Again, we abstract the pipeline of L-DAtt model, only focusing on layers which our framework works on. Given a premise pp and a hypothesis hh, we summarize the model as:

Here, σ\sigma is the softmax activation, a\overleftarrow{\mathbf{a}} and a\overrightarrow{\mathbf{a}} are bidirectional attentions, y\mathbf{y} are probabilities for labels Entailment, Contradiction, and Neutral.

Augmentation

We will borrow the predicate notation defined in the machine comprehension task (§4.1), and ground them on premise and hypothesis words, e.g. Ki,jK_{i,j} now denotes the relatedness between premise word pip_{i} and hypothesis word hjh_{j}. In addition, we define the predicate YlY_{l} to indicate that the label is ll. As in §4.1, we define two rules governing attention:

where CC is the set of content words. Note that the two constraints apply to both attention directions.

Intuitively, if a hypothesis content word is not aligned, then the prediction should not be Entailment. To use this knowledge, we define the following rule:

where Z1Z_{1} and Z2Z_{2} are auxiliary predicates tied to the YEntailY^{\prime}_{\text{Entail}} predicate. The details of N3N_{3} are illustrated in Fig. 4.

How does our framework perform with large training data?

The SNLI dataset is a large dataset with over half-million examples. We train our models using incrementally larger percentages of data and report the average performance in Table 3. Similar to §4.1, we observe strong improvements from augmented models trained on small percentages (\leq10%) of data. The straightforward constraint N1N_{1} performs strongly with \leq2% data while its conservative alternative N2N_{2} works better with a larger set. However, with full dataset, our augmented models perform only on par with baseline even with lowered scaling factor ρ\rho. These observations suggest that if a large dataset is available, it may be better to believe the data, but with smaller datasets, constraints can provide useful inductive bias for the models.

Are noisy constraints helpful?

It is not always easy to state a constraint that all examples satisfy. Comparing N2N_{2} and N3N_{3}, we see that N3N_{3} performed even worse than baseline, which suggests it contains noise. In fact, we found a significant amount of counter examples to N3N_{3} during preliminary analysis. Yet, even a noisy rule can improve model performance with \leq10% data. The same observation holds for N1N_{1}, which suggests conservative constraints could be a way to deal with noise. Finally, by comparing N2N_{2} and N2,3N_{2,3}, we find that the good constraint N2N_{2} can not just augment the network, but also amplify the noise in N3N_{3} when they are combined. This results in degrading performance in the N2,3N_{2,3} column starting from 5% of the data, much earlier than using N3N_{3} alone.

3 Text Chunking

Attention layers are a modeling choice that do not always exist in all networks. To illustrate that our framework is not necessarily grounded to attention, we turn to an application where we use knowledge about the output space to constrain predictions. We focus on the sequence labeling task of text chunking using the CoNLL2000 dataset Tjong Kim Sang and Buchholz (2000). In such sequence tagging tasks, global inference is widely used, e.g., BiLSTM-CRF Huang et al. (2015). Our framework, on the other hand, aims to promote local decisions. To explore the interplay of global model and local decision augmentation, we will combine CRF with our framework.

where xx is the input sentence, σ\sigma is softmax, y\mathbf{y} are the output probabilities of BIO tags.

Augmentation

We define the following predicates for input and output neurons:

Then we can write rules for pairwise label dependency. For instance, if word tt has B/I- tag for a certain label, word tt+11 can not have an I- tag with a different label.

Our second set of rules are also intuitive: A noun should not have non-NP label. C5C_{5}: t,Ntl{B-VP,I-VP,B-PP,I-PP}¬Yt,l\forall t,N_{t}\rightarrow\bigwedge_{l\in\{\text{{B-VP,I-VP,B-PP,I-PP}}\}}\neg Y^{\prime}_{t,l} While all above rules can be applied as hard constraints in the output space, our framework provides a differentiable way to inform the model during training and prediction.

How does local augmentation compare with global inference?

We report performances in Table 4. While a first-order Markov model (e.g., the BiLSTM-CRF) can learn pairwise constraints such as C1:4C_{1:4}, we see that our framework can better inform the model. Interestingly, the CRF model performed even worse than the baseline with \leq40% data. This suggests that global inference relies on more training examples to learn its scoring function. In contrast, our constrained models performed strongly even with small training sets. And by combining these two orthogonal methods, our locally augmented CRF performed the best with full data.

Related Work and Discussion

Our work is related to neural-symbolic learning (e.g. Besold et al., 2017) which seeks to integrate neural networks with symbolic knowledge. For example, Cingillioglu and Russo (2019) proposed neural models that multi-hop logical reasoning.

KBANN Towell et al. (1990) constructs artificial neural networks using connections expressed in propositional logic. Along these lines, França et al. (2014, CILP++) build neural networks from a rule set for relation extraction. Our distinction is that we use first-order logic to augment a given architecture instead of designing a new one. Also, our framework is related to Kimmig et al. (2012, PSL) which uses a smooth extension of standard Boolean logic.

Hu et al. (2016) introduced an imitation learning framework where a specialized teacher-student network is used to distill rules into network parameters. This work could be seen as an instance of knowledge distillation Hinton et al. (2015). Instead of such extensive changes to the learning procedure, our framework retains the original network design and augments existing interpretable layers.

Regularization with Logic

Several recent lines of research seek to guide training neural networks by integrating logical rules in the form of additional terms in the loss functions (e.g., Rocktäschel et al., 2015) that essentially promote constraints among output labels (e.g., Du et al., 2019; Mehta et al., 2018), promote agreement Hsu et al. (2018) or reduce inconsistencies across predictions Minervini and Riedel (2018).

Furthermore, Xu et al. (2018) proposed a general design of loss functions using symbolic knowledge about the outputs. Fischer et al. (2019) described a method for for deriving losses that are friendly to gradient-based learning algorithms. Wang and Poon (2018) proposed a framework for integrating indirect supervision expressed via probabilistic logic into neural networks.

Learning with Structures

Traditional structured prediction models (e.g. Smith, 2011) naturally admit constraints of the kind described in this paper. Indeed, our approach for using logic as a template-language is similar to Markov Logic Networks Richardson and Domingos (2006), where logical forms are compiled into Markov networks. Our formulation augments model scores with constraint penalties is reminiscent of the Constrained Conditional Model of Chang et al. (2012).

Recently, we have seen some work that allows backpropagating through structures (e.g. Huang et al., 2015; Kim et al., 2017; Yogatama et al., 2017; Niculae et al., 2018; Peng et al., 2018, and the references within). Our framework differs from them in that structured inference is not mandantory here. We believe that there is room to study the interplay of these two approaches.

Also related to our attention augmentation is using word relatedness as extra input feature to attention neurons (e.g. Chen et al., 2018).

Conclusions

In this paper, we presented a framework for introducing constraints in the form of logical statements to neural networks. We demonstrated the process of converting first-order logic into differentiable components of networks without extra learnable parameters and extensive redesign. Our experiments were designed to explore the flexibility of our framework with different constraints in diverse tasks. As our experiments showed, our framework allows neural models to benefit from external knowledge during learning and prediction, especially when training data is limited.

Acknowledgements

We thank members of the NLP group at the University of Utah for their valuable insights and suggestions; and reviewers for pointers to related works, corrections, and helpful comments. We also acknowledge the support of NSF SaTC-1801446, and gifts from Google and NVIDIA.

References

Appendix A Appendices

Here, we explain our experiment setup for the three tasks: machine comprehension, natural language inference, and text chunking. For each task, we describe the model setup, hyperparameters, and data splits.

For all three tasks, we used Adam Paszke et al. (2017) for training and use 300 dimensional GloVe Pennington et al. (2014) vectors (trained on 840B tokens) as word embeddings.

The SQuAD (v1.1) dataset consists of 87,59987,599 training instances and 10,57010,570 development examples. Firstly, for a specific percentage of training data, we sample from the original training set. Then we split the sampled set into 9/1 folds for training and development. The original development set is reserved for testing only. This is because that the official test set is hidden, and the number of models we need to evaluate is impractical for accessing official test set.

In our implementation of the BiDAF model, we use a learning rate 0.0010.001 to train the model for 2020 epochs. Dropout Srivastava et al. (2014) rate is 0.20.2. The hidden size of each direction of BiLSTM encoder is 100100. For ELMo models, we train for 2525 epochs with learning rate 0.00020.0002. The rest hyperparameters are the same as in Peters et al. (2018). Note that we did neither pre-tune nor post-tune ELMo embeddings. The best model on the development split is selected for evaluation. No exponential moving average method is used. The scaling factor ρ\rho’s are manually grid-searched in {1,2,4,8,161,2,4,8,16} without extensively tuning.

A.2 Natural Language Inference

We use Stanford Natural Language Inference (SNLI) dataset which has 549,367549,367 training, 9,8429,842 development, and 9,8249,824 test examples. For each of the percentages of training data, we sample the same proportion from the orginal development set for validation. To have reliable model selection, we limit the minimal number of sampled development examples to be 10001000. The original test set is only for reporting.

In our implimentation of the BiLSTM variant of the Decomposable Attention (DAtt) model, we adopt learning rate 0.00010.0001 for 100100 epochs of training. The dropout rate is 0.20.2. The best model on the development split is selected for evaluation. The scaling factor ρ\rho’s are manually grid-searched in {0.5,1,2,4,8,160.5,1,2,4,8,16} without extensively tuning.

A.3 Text Chunking

The CoNLL2000 dataset consists of 8,9368,936 examples for training and 2,0122,012 for testing. From the original training set, both of our training and development examples are sampled and split (by 9/1 folds). Performances are then reported on the original full test set.

In our implementation, we set hidden size to 100100 for each direction of BiLSTM encoder. Before the final linear layer, we add a dropout layer with probability 0.50.5 for regularization. Each model was trained for 100100 epochs with learning rate 0.00010.0001. The best model on the development split is selected for evaluation. The scaling factor ρ\rho’s are manually grid-searched in {1,2,4,8,16,32,641,2,4,8,16,32,64} without extensively tuning.