Equation Parsing: Mapping Sentences to Grounded Equations
Subhro Roy, Shyam Upadhyay, Dan Roth
Introduction
Understanding text often involves reasoning with respect to quantities mentioned in it. Understanding the news article statement in Example 1 requires identifying relevant entities and the mathematical relations expressed among them in text, and determining how to compose them. Similarly, solving a math word problem with a sentence like Example 2, requires realizing that it deals with a single number, knowing the meaning of “difference” and composing the right equation – “25” needs to be subtracted from a number only after it is multiplied by .
As a first step towards understanding such relations, we introduce the Equation Parsing task - given a sentence expressing a mathematical relation, the goal is to generate an equation representing the relation, and to map the variables in the equation to their corresponding noun phrases. To keep the problem tractable, in this paper we restrict the final output equation form to have at most two (possibly coreferent) variables, and assume that each quantity mentioned in the sentence can be used at most once in the final equation.We empirically found that around 97% of sentences describing a relation have this property. In example 1, the gold output of an equation parse should be , with and .
The task can be seen as a form of semantic parsing [Goldwasser and Roth, 2011, Kwiatkowski et al., 2013] where instead of mapping a sentence to a logical form, we want to map it to an equation. However, there are some key differences that make this problem very challenging in ways that differ from the “standard” semantic parsing. In Equation Parsing, not all the components of the sentence are mapped to the final equation. There is a need to identify noun phrases that correspond to variables in the relations and determine that some are irrelevant and can be dropped. Moreover, in difference from semantic parsing into logical forms, in Equation Parsing multiple phrases in the text could correspond to the same variable, and identical phrases in the text could correspond to multiple variables.
We call the problem of mapping noun phrases to variables the problem of grounding variables. Grounding is challenging for various reasons, key among them are that: (i) The text often does not mention “variables” explicitly, e.g., the sentence in example 3 describes a mathematical relation between the speed of bird and the speed of wind, without mentioning “speed” explicitly. (ii) Sometimes, multiple noun phrases could refer to the same variable. For instance, in example 2, both “a number” and “the same number” refer to the same variable. On the other hand, the same noun phrase might refer to multiple variables, as in example 4, where the noun phrase “two numbers” refer to two variables.
In addition, the task involves deciding which of the quantities identified in the sentence are relevant to the final equation generation. In example 5, both “5” and “10” are not relevant for the final equation “”. Finally, the equation needs to be constructed from a list of relevant quantities and grounded variables. Overall, the output space becomes exponential in the number of quantities mentioned in the sentence.
Determining the final equation that corresponds to the text is an inference step over a very large space. To address this, we define the concept of “projectivity” - a condition where the final equation can be generated by combining adjacent numbers or variables, and show that most sentences expressing mathematical relations exhibit the projectivity property. Finally, we restrict our inference procedure to only search over equations which have this property.
Our approach builds on a pipeline of structured predictors that identify irrelevant quantities, recognize coreferent variables, and, finally, generate equations. We also leverage a high precision lexicon of mathematical expressions and develop a greedy lexicon matching strategy to guide inference. We discuss and exemplify the advantages of this approach and, in particular, explain where the “standard” NLP pipeline fails to support equation parsing, and necessitates the new approach proposed here. Another contribution of this work is the development of a new annotated data set for the task of equation parsing. We evaluate our method on this dataset and show that our method predicts the correct equation in of the cases and that in of the time we also ground all variables correctly.
The next section presents a discussion of related work. Next we formally describe the task of equation parsing. The following sections describe our equation representation and the concept of projectivity, followed by the description of our algorithm to generate the equations and variable groundings from text. We conclude with experimental results.
Related Work
The work most related to this paper is [Madaan et al., 2016], which focuses on extracting relation triples where one of the arguments is a number. In contrast, our work deals with multiple variables and complex equations involving them. There has been a lot of recent work in automatic math word problem solving [Kushman et al., 2014, Roy et al., 2015, Hosseini et al., 2014, Roy and Roth, 2015]. These solvers cannot handle sentences individually. They require the input to be a complete math word problem, and even then, they only focus on retrieving a set of answer values without mentioning what each answer value corresponds to. Our work is also conceptually related to work on semantic parsing – mapping natural language text to a formal meaning representation [Wong and Mooney, 2007, Clarke et al., 2010, Cai and Yates, 2013, Kwiatkowski et al., 2013, Goldwasser and Roth, 2011]. However, as mentioned earlier, there are some significant differences in the task definition that necessitate the development of a new approach.
The Equation Parsing Task
Equation parsing takes as input a sentence describing a single mathematical equation, comprising one or two variables and other quantities mentioned in . Let be the set of noun phrases in the sentence . The output of the task is the mathematical equation described in , along with a mapping of each variable in the equation to its corresponding noun phrase in . We refer to this mapping as the “grounding” of the variable; the noun phrase represents what the variable stands for in the equation. Table 1 gives an example of an input and output for the equation parsing of the text in example 2. Since an equation can be written in various forms, we use the form which most agrees with text, as our target output. So, for example 1, we will choose and not . In cases where several equation forms seem to be equally likely to be the target equation, we randomly choose one of them, and keep this choice consistent across the dataset.
In this section, we introduce an equation parse for a sentence. An equation parse of a sentence is a pair , where represents a set of triggers extracted from , and represents an equation tree formed with the set as leaves. We now describe these terms in detail.
Trigger Given a sentence mentioning a mathematical relation, a trigger can either be a quantity trigger expressed in , or variable trigger which is a noun phrase in corresponding to a variable. A quantity trigger is a tuple , where is the numeric value of the quantity mentioned in text, and is the span of text from the sentence which refers to the quantity. A variable trigger is a tuple , where represents the label of the variable, and represents the noun phrase representing the variable. For example, for the sentence in Fig 1, the spans “Twice”, “25”, and “triple” generate quantity triggers, whereas “a number” and “the same number” generate variable triggers, with label .
Trigger List The trigger list for a sentence contains one trigger for each variable mention and each numeric value used in the final equation expressed by the sentence . The trigger list might consist of multiple triggers having the same label, or extracted from the same span of text. In the example sentence in Fig 1, the trigger list comprises two triggers having the same label . The final trigger list for the example in Fig 1 is {(, “2”), (, “a number”), (, “25”), (, “triple”), (, “the same number”)}. Note that there can be multiple valid trigger lists. In our example, we could have chosen both variable triggers to point to the same mention “a number”. Quantity triggers in the trigger list form the quantity trigger list, and the variable triggers in trigger list form the variable trigger list.
Equation Tree An equation tree of a sentence is a binary tree whose leaves constitute the trigger list of , and internal nodes (except the root) are labeled with one of the following operations – addition, subtraction, multiplication, division. In addition, for nodes which are labeled with subtraction or division, we maintain a separate variable to determine order of its children. The root of the tree is always labeled with the operation equal.
An equation tree is a natural representation for an equation. Each node in an equation tree represents an expression , and the label of the parent node determines how the expressions of its children are to be composed to construct its own expression. Let us denote the label for a non-leaf node to be , where and the order of a node ’s children by (defined only for subtraction and division nodes), which takes values (Left-Right) or (Right-Left). For a leaf node , the expression represents the variable label, if is a variable trigger, and the numeric value of the quantity, if it is a quantity trigger. Finally, we use and to represent the left and right child of node , respectively. The equation represented by the tree can be generated as follows. For all non-leaf nodes , we have
Given an equation tree of a sentence, the equation represented by it is the expression generated by the root of (following Equation 1). Referring to the equation tree in Fig 1, the node marked “” represents , and the root represents the full equation .
Projectivity
The key observation, as our corpus analysis indicates, is that for most sentences, there exists a trigger list, such that the equation tree representing the relation in the sentence is projective. However this might involve mapping two mentions of the same variable to different noun phrases. Figure 1 shows an example of a projective equation tree, which requires different mentions of to be mapped to different noun phrases. If we had mapped both mentions of to same noun phrase “a number”, the resulting equation tree would not have been projective. We collected sentences which represent an equation with one or two mentions of variables, and each number in the sentence used at most once in the equation. We found that only one sentence among these could not generate a projective equation tree. (See Section 6.1 for details on dataset creation). Therefore, we develop an algorithmic approach for predicting projective equation trees, and show empirically that it compares favourably with ones which do not make the projective assumption.
Predicting Equation Parse
Equation parsing of a sentence involves predicting three components – Quantity Trigger List, Variable Trigger List and Equation Tree. We develop three structured prediction modules to predict each of the above components.
All our prediction modules take a similar form: given input and output , we learn a scoring function , which scores how likely is the output given input . The scoring function is linear, , where is a feature vector extracted from and . The inference problem, that is, the prediction for an input is then: , where is the set of all allowed values of .
Given input text and the quantities mentioned in it, the role of this step is to identify , for each quantity in the text, whether it should be part of the final equation. For instance, in example 5 in Section 1, both “5” and “10” are not relevant for the final equation “”. Similarly, in example 4, the number “two” is irrelevant for the equation “”.
2 Predicting Variable Trigger List
The goal of this step is to predict the variable trigger list for the equation. Our structured classifier takes as input the sentence , and the output is either one or two noun-phrases, representing variables in the final equation. As we pointed out earlier, multiple groundings might be valid for any given variable, hence there can be multiple valid variable trigger lists. For every sentence , we construct a set of valid outputs. Each element in corresponds to a valid variable trigger list. Finally, we aim to output only one of the elements of .
We modified the standard structured prediction algorithm to consider “superset supervision” and take into account multiple gold structures for an input . We assume access to training examples of the form : , where each is a set of valid outputs for the sentence . Since we want to output only one variable trigger list, we want to score at least one from higher than all other possible outputs, for each . We use a modified latent structured SVM to learn the weight vector . The algorithm treats the best choice among all of as a latent variable. At each iteration, for all , the algorithm chooses the best choice from the set , according to the weight vector . Then, is updated by learning on all by a standard structured SVM algorithm. The details of the algorithm are in Algorithm 1.
The distinction from standard latent structural SVM is in line of Algorithm 1. In order to get the best choice for input , we search only inside , instead of all of . A similar formulation can be found in ?). The features used for variable trigger prediction include variable features (properties of noun phrase indicating variable) and neighborhood features (lexical features from neighborhood of variable mention). Details added to the appendix.
If the output of the classifier is a pair of noun phrases, we use a rule based variable coreference detector, to determine whether both noun phrases should have the same variable label or not. The rules for variable coreference are as follows :
If both noun phrases are the same, and they do not have the token “two” or “2”, they have the same label.
If the noun phrases are different, and the noun phrase appearing later in the sentence contains tokens “itself”, “the same number”, they have the same label.
In all other cases, they have different labels.
Finally, each noun phrase contributes one variable trigger to the variable trigger list.
3 Predicting Equation Tree
It is natural to assume that the syntactic parse of the sentence could be very useful in addressing all the predictions we are making in the equation parsing tasks. However, it turns out that this is not the case – large portions of the syntactic parse will not be part of the equation parse, hence we need the aforementioned modules to address this. Nevertheless, in the next task of predicting the equation tree, we attempted to constraint the output space using guidance from the syntactic tree; we found, though, that even enforcing this weak level of output expectation is not productive. This was due to the poor performance of current syntactic parsers on the equation data (eg., in of sentences, the Stanford parser made a mistake which does not allow recovering the correct equation).
The tree prediction module receives the trigger list predicted by the previous two modules, and the goal is to create an equation tree using the trigger list as the leaves of that tree. The input is the sentence and the trigger list, and the output is the equation tree representing the relation described in the sentence. We assume that the output will be a projective equation tree. For features , we extract for each non-leaf node of the equation tree , neighborhood features (from neighborhood of node spans of ’s children), connecting text features (from text between the spans of ’s children) and number features (properties of number in case of leaf nodes). Details are included in the appendix.
Lexicon To bootstrap the equation parsing process, we developed a high precision lexicon to translate mathematical expressions to operations and orders, like “sum of A and B” translates to “A+B”, “A minus B” translates to “A-B”, etc. (where A and B denote placeholder numbers or expressions). At each step of CKY, while constructing a node of the equation tree, we check for a lexicon text expression corresponding to node . If found, we allow only the corresponding operation (and order) for node , and do not explore other operations or orders. We show empirically that reducing the space using this greedy lexicon matching help improve performance. We found that using the lexicon rules as features instead of hard constraints do not help as much. Note that our lexicon comprises only generic math concepts, and around of the sentences in our dataset do not contain any pattern from the lexicon.
Finally, given input sentence, we first predict the quantity trigger and the variable trigger lists. Given the complete trigger list, we predict the equation tree relating the components of the trigger list.
4 Alternatives
A natural approach could be to jointly learn to predict all three components, to capture the dependencies among them. To investigate this, we developed a structured SVM which predicts all components jointly, using the union of the features of each component. We use approximate inference, first enumerating possible trigger lists, and then equation trees, and find the best scoring structure. However, this method did not outperform the pipeline method. The worse performance of joint learning is due to: (1) search space being too large for the joint model to do well given our dataset size of 385, and (2) our independent classifiers being good enough, thus supporting better joint inference. This tradeoff is strongly supported in the literature [Punyakanok et al., 2005, Sutton and McCallum, 2007].
Another option is to enforce constraints between trigger list predictions, such as, variable triggers should not overlap with the quantity triggers. However, we noticed that often noun phrases returned by the Stanford parser were noisy, and would include neighboring numbers within the extracted noun phrases. This prevented us from enforcing such constraints.
Experimental Results
We now describe the data set, and the annotation procedure used. We then evaluate the system’s performance on predicting trigger list, equation tree, and the complete equation parse.
We created a new dataset consisting of sentences extracted from algebra word problems and financial news headlines. For algebra word problems, we used the MIT dataset [Kushman et al., 2014], and two high school mathematics textbooks, Elementary Algebra (College of Redwoods) and Beginning and Intermediate Algebra (Tyler Wallace). Financial news headlines were extracted from The Latest News feed of MarketWatch, over the month of February, 2015. All sentences with information describing a mathematical relation among at most two (possibly coreferent) variables, were chosen. Next, we pruned sentences which require multiple uses of a number to create the equation. This only removed a few time related sentences like “In 10 years, John will be twice as old as his son.”. We empirically found that around 97% of sentences describing a relation fall under the scope of our dataset.
The annotators were shown each sentence paired with the normalized equation representing the relation in the sentence. For each variable in the equation, the annotators were asked to mark spans of text which best describe what the variable represents. The annotation guidelines are provided in the appendix. We wanted to consider only noun phrase constituents for variable grounding. Therefore, for each annotated span, we extracted the noun phrase with maximum overlap with the span, and used it to represent the variables. Finally, a tuple with each variable being mapped to one of the noun phrases representing it, forms a valid output grounding (variable trigger list). We computed inter-annotator agreement on the final annotations where only noun phrases represent variables. The agreement (kappa) was 0.668, indicating good agreement. The average number of mention annotations per sentence was 1.74.
2 Equation Parsing Modules
In this section, we evaluate the performance of the individual modules of the equation parsing process. We report Accuracy - the fraction of correct predictions. Table 3 shows the -fold cross validation accuracy of the various modules. In each case, we also report accuracy by removing each feature group, one at a time. In addition, for equation tree prediction, we also show the effect of lexicon, projectivity, conforming to syntactic parse constraints, and using lexicon as features instead of hard constraints. For all our experiments, we use the Stanford Parser [Socher et al., 2013], the Illinois POS tagger [Roth and Zelenko, 1998] and the Illinois-SL structured prediction package [Chang et al., 2015].
3 Equation Parsing Results
In this section, we evaluate the performance of our system on the overall equation parsing task. We report Equation Accuracy - the fraction of sentences for which the system got the equation correct, and Equation+Grounding Accuracy - the fraction of sentences for which the system got both the equation and the grounding of variables correct. Table 4 shows the overall performance of our system, on a -fold cross validation. We compare against Joint Learning - a system which jointly learns to predict all relevant components of an equation parse (Section 5.4). We also compare with SPF [Artzi and Zettlemoyer, 2013], a publicly available semantic parser, which can learn from sentence-logical form pairs. We train SPF with sentence-equation pairs and a seed lexicon for mathematical terms (similar to ours), and report equation accuracy. Our structured predictors pipeline approach is shown to be superior to both Joint Learning and SPF.
SPF gets only a few sentences correct. We attribute this to the inability of SPF to handle overlapping mentions (like in Example 4), as well as its approach of parsing the whole sentence to the final output form. The developers of SPF also confirmed Private communication that it is not suitable for equation parsing and that these results are expected. Since equation parsing is a more involved process, a slight adaptation of SPF does not seem possible, necessitating a more involved process , of the type we propose. Our approach, in contrast to SPF, can handle overlapping mentions, selects triggers from text, and parses the trigger list to form equations.
4 Error Analysis
For variable trigger list prediction, around of the errors were due to the predictor choosing a span which is contained within the correct span, e.g., when the target noun phrase is “The cost of a child’s ticket”, our predictor chose only “child’s ticket”. Although this choice might be sufficient for downstream tasks, we consider it to be incorrect in our current evaluation. Another of the errors were due to selection of entities which do not participate in the relation. For example, in “A rancher raises 5 times as many cows as horses.”, our predictor chose “A rancher” and “cows” as variables, whereas the relation exists between “cows” and “horses”. For the prediction of the equation tree, we found that of the errors were due to rare math concepts expressed in text. For example, “7 dollars short of the price” represents dollars should be subtracted from the price. These errors can be handled by carefully augmenting the lexicon. Another of the errors were due to lack of world knowledge, requiring understanding of time, speed, and distance.
Conclusion
This paper investigates methods that identify and understand mathematical relations expressed in text. We introduce the equation parsing task, which involves generating an equation from a sentence and identifying what the variables represent. We define the notion of projectivity, and construct a high precision lexicon, and use these to reduce the equation search space. Our experimental results are quite satisfying and raise a few interesting issues. In particular, it suggests that predicting equation parses using a pipeline of structured predictors performs better than jointly trained alternatives. As discussed, it also points out the limitation of the current NLP tools in supporting these tasks. Our current formulation has one key limitation; we only deal with expressions that are described within a sentence. Our future work will focus on lifting this restriction, in order to allow relations expressed across multiple sentences and multiple relations expressed in the same sentence. Code and dataset are available at http://cogcomp.cs.illinois.edu/page/publication_view/800.
Acknowledgements
This work is funded by DARPA under agreement number FA8750-13-2-0008, and a grant from the Allen Institute for Artificial Intelligence (allenai.org).
Appendix
Appendix A Features
The feature function used for the classification generates the following features :
A.2 Variable Trigger List Prediction
The features used for variable trigger prediction are as follows:
Variable features : Unigrams and bigrams generated from the noun phrase representing variables, part of speech tags of tokens in noun phrase representing variables.
Neighborhood Features : Unigrams and POS tags from neighborhood of variables.
All the above features are conjoined with two labels, one denoting whether has two variables or one, and the second denoting whether has two variables represented by the same noun phrase.
A.3 Equation Tree Prediction
For features , we extract for each non-leaf node of the equation tree , the following:
Number Features : In case we are combining two leaf nodes representing quantity triggers, we add a feature signifying whether one number is larger than the other.
Appendix B Annotation Guidelines
The annotators were shown each sentence paired with the normalized equation representing the relation in the sentence. For each variable in the equation, the annotators were asked to mark spans of text which best describe what the variable represents. They were asked to annotate associated entities if exact variable description was not present. For instance, in example 3 (Section 1), the relation holds between the speed of bird and the speed of wind. However, “speed” is not explicitly mentioned in the sentence. In such cases, the annotators were asked to annotate the associated entities “the wind” and “a bird” as representing variables.
The guidelines also directed annotators to choose the longest possible mention, in case they feel the mention boundary is ambiguous. As a result, in the sentence, “City Rentals rent an intermediate-size car for 18.95 dollars plus 0.21 per mile.”, the phrase “City Rentals rent an intermediate-size car” was annotated as representing variable. We allow multiple mentions to be annotated for the same variable. In example 2 (Section 1), both “a number” and “the same number” were annotated as representing the same variable.
Appendix C Lexicon
We construct a high precision list of rules, to parse sentences describing mathematical concepts, for example, “difference of”, “greater than”, etc. For each non-leaf node of a projective equation tree, we define the following terms :
The rules in our lexicon are described using the above terms. They are as follows, ordered from low precedence to high precedence.