NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models

Yongchao Chen, Rujul Gandhi, Yang Zhang, Chuchu Fan

Introduction

Temporal Logic (TL) has been widely used as a mathematically precise language to specify requirements in many engineering domains such as robotics Tellex et al. (2020), electronics design Browne et al. (1986), autonomous driving Maierhofer et al. (2020). TL can capture the complex spatial, temporal, and logical requirements of both human languages and environmental constraints, and can be transformed into executable actions or control inputs for robots Gundana and Kress-Gazit (2022); Raman et al. (2013); Boteanu et al. (2016); Patel et al. (2020); Gopalan et al. (2018).

Unlike many robotics works that try to use end-to-end black-box models to infer robotic behaviors directly from natural language (NL) Ahn et al. (2022), using structured TL as the intermediate has a twofold benefit – the TL can be used for direct planning, and the TL representation can be used to identify specific sources of failure and provide automatic feedback to a non-expert user Raman et al. (2013). However, TL has a steep learning curve. Communicating one’s goals and constraints through NL is much more intuitive to a non-expert. Therefore, a model able to transform NL instructions into TL is a missing but crucial component for interactive robots and engineering designs.

Currently, there is no general tool to perform automated translations between TL and NL that takes the following requirements into consideration:

Cross-domain generalization. Although TL is used in many engineering domains, current NL-to-TL approaches largely constrain their training data to a single domain. These datasets mostly lack plentiful corpus richness of NL-TL and have their own specified formats of Atomic Propositions (AP). Then the models fail to generalize to other domains Gopalan et al. (2018), even though the structure of TL itself is not dependent on the domain and should be generic.

Variability of NL instructions. Past work often constructs synthetic data algorithmically, requiring limited forms of the NL input. Real-world NL utterances cannot be encoded into a small set of rules. Models trained on such homogeneous data fail to generalize to new sentence structures Brunello et al. (2019).

One big bottleneck in the NL-to-TL problem is the lack of data. Although modern statistical methods can outperform rule-based methods Buzhinsky (2019), they typically require a huge dataset. This data is expensive and difficult to collect since strong expertise of annotators is needed Brunello et al. (2019). As outlined above, constraining the domain or form of the NL instructions relieves the pressure of dataset, but also unavoidably undermines the generalizability Brunello et al. (2019); Patel et al. (2019).

To supplement the data creation process and simultaneously overcome the need for a huge dataset, we propose to use pre-trained LLMs. We utilize GPT-3 Brown et al. (2020) to assist dataset creation and finetune T5 models Raffel et al. (2020) to be specialized in NL-to-TL transformation.

Another aspect of our approach is to use ‘lifted’ versions of NL and TL for finetuning our model, which greatly enhances generalizability. In previous work, models trained on full NL-to-TL transformation often include converting specific individual actions into APs. For example, the AP "a response is created in Slack" might be formalized as "create_Slack". As a result, each work has to regularize its own content and style of APs, affecting generalization. Instead of taking this approach, we hide all the APs in our data during finetuning, acquiring a lifted model on lifted NL-to-TL transformation. For the final ground application from full NL into full TL, two methods are proposed, either by combining the lifted model with AP recognition or further transfer learning in one specific domain. For further transfer learning into specific domains, we compare the models with/without pre-training on lifted NL-TL and show its significance.

In this paper, we present two key contributions:

Constructing a cross-domian NL-TL dataset. We generate a dataset of 15K lifted NL-TL pairs using a novel GPT-3-assisted framework. Ablation studies are conducted to show the significance of each part of the framework for dataset construction. In addition, we collect and clean previous datasets (13K) from two varied domains, adapting original full NL-TL pairs into lifted versions. In this way, we publish a dataset of 28K lifted NL-TL pairs. Ablation studies show that the newly created data are indispensable since purely training on the collected data fails to work across domains.

Finetuning a lifted NL-to-TL model on T5 using our data, and demonstrating the improvement in performance compared to former state-of-the-art methods. For application in full NL-to-STL transformation, two methods are proposed. We compare our model to Seq2Seq models and direct few-shot learning by GPT-3, across five domains. The experimental results show that our methods achieve better accuracy (>95% across all domains) and are more data-efficient (<10% domain specific data). We also do the ablation study by training a Seq2Seq model with lifted NL-to-TL dataset, revealing that T5’s superior model capacity is essential.

GPT-4 Bubeck et al. (2023) comes out when approaching the end of this work. To compare the accuracy of direct GPT-4 few-shot end-to-end translation with our finetuned model, we did an ad-hoc test on ChatGPT Plus version with 100 samples in each domain. Here we can not test on more samples since we do not have access to GPT-4 API and ChatGPT Plus version has access limitation per hour. The results show that GPT-4 achieves an accuracy of 77.7% over 300 samples, much lower than our model but higher than GPT-3.

Temporal Logic Specifications

There are many different versions of TL Emerson (1990); Maler and Nickovic (2004); Koymans (1990). They are more or less similar in terms of syntax. We will use Signal Temporal Logic (STL) as a representative formal language that supports constraints over the continuous real-time, which is more suitable to capture time-critical missions. In some previous work, Linear Temporal Logic (LTL) is also widely used, which is contained by STL when the time is discrete. We will construct our framework based on STL and show that the trained model also performs well on datasets and tasks using LTL. An STL formula is defined recursively according to the following syntax:

where $\phi$ and $\varphi$ are STL formulas, and $\pi^{\mu}$ is an atomic predicate. $\neg$ (negation), $\land$ (and), $\lor$ (or), $\Rightarrow$ (imply), and $\Leftrightarrow$ (equal)) are logical operators. $\textbf{F}_{[a,b]}$ (eventually/finally), $\textbf{G}_{[a,b]}$ (always/globally), and $\textbf{U}_{[a,b]}$ (until) are temporal operators with real-time constraints $t\in[a,b]$ . Temporal operators with time constraints are illustrated by Table 4, and other operators can be presented using the basic syntax.

2 Lifted STL and Lifted NL

We represent our data as ‘lifted’ NL and STL, in which the specific APs corresponding to individual actions are hidden (following nomenclature from Hsiung et al. (2021)). In our lifted NL and STL, each AP is replaced with a placeholder prop_i. In this way, we train our model on the general context of the instruction regardless of the specific APs. The correspondences between full and lifted NL/STL are shown in Figure 1.

3 STL Expression Formats

Consider an STL expression as a binary tree, as in Figure 2. When finetuning a text-to-text model, there are different ways of representing the target STL in the form of linear text. Specifically, the targeted tokens can be linearized in an in-order (left subtree, root, right subtree) or pre-order (root, left subtree, right subtree) manner. Meanwhile, the operators can also be represented as the words with their corresponding meanings (rather than as symbols). The training results show that the in-order expression with all the operators replaced by words achieves much better accuracy than other three forms (will discuss in the following Section 5).

Related Work

Over decades, researchers have methods to translate English sentences into various TL formulae Brunello et al. (2019); Finucane et al. (2010); Tellex et al. (2020); Raman et al. (2013). However, to simplify the tasks, some previous work typically make strong assumptions to restrict the input text or the output formula, thus limiting the flexibility and generalizability.

The first representative attempt is by Finucane et al. (2010); Tellex et al. (2011); Howard et al. (2014), where the traditional methods typically follow three steps: 1) pre-process given English input by extracting syntactical information, 2) identify patterns or rules for TL through classification, and 3) run an attribute grammar-based parser to derive a target logical format. These methods only work for restricted input NL Tellex et al. (2020).

Another category of approaches are learning-based. Representative state-of-the-art works are Gopalan et al. (2018); Wang et al. (2021); He et al. (2022). In Gopalan et al. (2018) the authors gather a dataset focusing on Geometric LTL (GLTL), in which the NL and GLTL examples are all for the navigation of a car in the room. Then Seq2Seq models with attention mechanism are trained. Though the accuracy (93.45%) is satisfying, the used GLTLs are relatively simple with each GLTL normally including one to three APs and the dataset also focuses on one confined task. In He et al. (2022) the authors choose to first translate a manually generated set of STL formulae into English sentences and train a semantic parser on the synthetic data. Such synthetic data cannot represent general NL and therefore the trained parser only works well on the original STL formulae.

In Wang et al. (2021) a semantic parser is built to learn the latent structure of NL commands for ground robots. The parser will provide (potentially incorrect) intermediate LTL representations to a motion planner, and the planner will give an executed trajectory as the feedback to see whether the robot’s execution satisfies the English input. Such approach has no guarantee on the correctness of the translated TL. In recent months, the work by Fuggitti and Chakraborti (2023) directly applies LLMs like GPT-3 to convert NL to LTL via few-shot learning. The prompts should be well designed and the model will fail once the NL and LTL structures are too complex (we will discuss it in Section 7.1).

Hence, in the domain of NL-to-TL translation, data augmentation/synthesis has been done algorithmically in previous work, not using generative models. This constrains how natural the resulting ‘NL’ actually is. In recent years, starting from the attention mechanism Vaswani et al. (2017), the rapid progression of pre-trained LLMs in NLP tends to unify many previous seemingly independent tasks into one large pre-trained model, especially the GPT series from OpenAI Brown et al. (2020), and T5 Raffel et al. (2020) and PaLM Chowdhery et al. (2022) from Google. These models are pre-trained with large amounts of natural sentences and codes, intrinsically encoding much world knowledge Creswell et al. (2022). The auto-regressive LLMs can naturally generate rich and meaningful corpus based on the given prompt. Then many recent work propose to do the data augmentation with LLMs, such as generating medical dialogues Chintagunta et al. (2021) and python codes Chen et al. (2021) via GPT-3. This inspires us the new opportunity in NL-to-TL task.

Approach

There are 3 steps in our approach. First, generating lifted NL-STL dataset with LLMs. Second, finetuning LLMs to get high accuracy on lifted NL-STL transformation. Third, lifting the data and applying the lifted model. Finally, we also consider the case where lifting is not possible and we must translate end to end by further finetuning the model.

We apply the LLM GPT-3 (Davinci-003) to help generate multiple lifted NL and STL pairs. The first intuitive method is to use various NL-STL pairs as prompts and ask GPT-3 to automatically generate more NL-STL pairs. However, it turns out that the model always generates STL and NL with similar syntactic structures as the given prompts, thus limiting the sentence richness. To stimulate GPT-3 to generate sentences with more variations, we ask it to generate corresponding NLs from different STLs. The whole framework (referred as Framework1) is shown in Figure 3. Various pre-order STLs are randomly synthesized by binary tree generation algorithm (See Algorithm 1 and specific discussion in Appendix B). The pre-order STLs are then transformed into in-order expressions via rules. To make the input STL more understandable to GPT-3, the operators are represented with the words of their meanings ( $\Rightarrow(imply),\Leftrightarrow(equal),\lor(or),etc$ ). Then the GPT-3 will try to generate the raw NL whose semantic meaning is close to the STL. Human annotators then modify the raw NL to make its meaning consistent with the STL. During this process, the NL-STL pairs in prompts will be randomly chosen to make the vocabulary and sentence structure more diversified. We gather 200 NL instructions from 10 volunteers who are familiar with robot tasks and randomly choose 100 NL to serve as the prompt pool, while the other 100 NL serve as the Manual testing data. In each iteration of Framework1, 20 pairs are randomly chosen from the prompt pool to form the prompt of GPT-3 (a prompt example is shown in Appendix C.1 and the discussion on how many examples should be included in GPT-3 prompt is shown in Appendix D).

While Framework1 enhances the sentence richness, one issue is that the pure rule-based synthesis of STL sometimes generates unreasonable semantic meanings, or that the STL is too complex to describe it with NL. To solve this problem, an optimized framework (referred as Framework2) is shown in Figure 4. Compared to Framework1, an extra loop between STL and NL is added using GPT-3. In this way, the initial rule-based STL with unreasonable or complex meanings will be automatically filtered by GPT-3 itself. In other words, during the mapping from STL-1 to NL-1, GPT-3 more or less modifies the meanings of the STLs that it cannot fully translate. Then the translated NL-1, though not fully consistent with STL-1, is more reasonable in the view of humans. It turns out that the semantic meanings synthesized by Framework2 are closer to the common human languages, and NL-STL pairs contain many fewer errors to annotate. The average number of annotated pairs is about 80 per person per hour with Framework1, and about 120 per person per hour with Framework2.

We in total create 15108 lifted NL-STL pairs combining both Framework1 and Framework2, with the whole cost of around 150 person-hours. Appendix C.2 shows a prompt example to transform from NL-1 back into STL-2 via GPT-3, and Appendix E shows some example annotations of lifted NL-STL pairs. Appendix F explains the whole process and the license of human annotation to correct NL-STL pairs.

Apart from synthesizing and annotating lifted NL-STL pairs with GPT-3, we also collect and annotate the data gathered from Wang et al. (2021) and He et al. (2022). Wang et al. (2021) focuses on robot navigation task with LTL, and He et al. (2022) focuses on circuit controlling with STL. To clean and process the data into lifted NL-STL pairs, the APs in both two datasets are detected and hidden by combining hard-coded algorithms with entity recognition package SpaCy Honnibal and Montani (2017). We gather 5K lifted NL-STL pairs from Navigation dataset Wang et al. (2021) and 8K lifted NL-STL pairs from the Circuit dataset He et al. (2022). Note that the original Navigation dataset uses LTL, while we correct some expression formats to form into STL. The original Circuit dataset contains 120K NL-STL pairs, while we find including 8K examples into our dataset is informative enough to cover the whole corpus richness of Circuit dataset.

Hence, in this work a dataset with in total about 28K lifted NL-STL pairs are created. Appendix J.1 shows the statistics of this lifted NL-STL dataset.

2 Model Finetuning

We mainly apply the T5 model Raffel et al. (2020) to serve as the base LLM for finetuning. To study whether model sizes will influence the performance, T5-base (220M) and T5-large (770M) are both finetuned on the same data. The setting is illustrated in Appendix L.

Experimental Results

Though the corrected data from Navigation and Circuit studies already provide multiple examples, these datasets only cover limited conditions and lack generalization. To show the necessity of the newly created data, the T5 models are finetuned with varied number of created NL-STL pairs, as shown in Figure 5. During training, all the data collected from Navigation and Circuit studies are used and the number of created data are varied among different models. The trained models are then tested with either the created data (referred as GPT-3-assisted data test) or the NL instructions collected from volunteers (referred as Manual data test). Since minor difference in STLs can cause severe difference in the real meanings, we apply the binary accuracy as the metric, i.e., 100% right or not. We find that the Top-1 testing accuracy increases greatly increasing the number of created pairs, with the highest accuracy 97.52% and 90.12% of GPT-3 assisted data and Manual data testing, respectively.

Table 1 presents the experimental results with the targeted STL of different formats as discussed in Section 2.3. We find that using the in-order format plus replacing operators with words will largely improve the performance. In-order format is more consistent with natural sentence expressions and lowers the difficulty for finetuning an LLM. This result is different from former conclusions when training Seq2Seq model for NL to STL/LTL tasks, where the pre-order format is better because it naturally avoids the issue of parentheses matching Wang et al. (2021).

Ablation Studies

Human Annotation To reveal the significance of human annotation, we train the model with the same amount of raw pairs created by GPT-3 and test them on corrected data. The results are shown in Appendix H.1. We find that annotated dataset can improve the testing accuracy by around 10%. Framework2 To reveal the significance of the data generated by Framework2, we train the model with either the same amount of data from pure Framework1 or the data combining two frameworks. Utilizing both frameworks improves the accuracy by around 2% (Appendix H.2). Model Capacity of T5 To reveal the significance of T5’s superior model capacity, we train a Seq2Seq model on the same lifted NL-STL dataset for comparison, as shown in Appendix I. Finetuning on T5 model improves the accuracy by around 14.5% compared to the Seq2Seq model.

Application

Right now we have finetuned T5 model to convert lifted NL into lifted STL. For the real applications, we need one model to convert from full NL to full STL, in which the format of APs should be regularized. To reach this destination, we will display two methods in the following discussion, and compare them with other state-of-the-art models. We test on five datasets across domains Circuit He et al. (2022), Navigation Wang et al. (2021), Office email Fuggitti and Chakraborti (2023), GLTL Gopalan et al. (2018); Tellex et al. (2020), and CW Squire et al. (2015). The examples of full NL-STL pairs in each domain are shown in Appendix G, and the statistics of our synthesized dataset and each collected dataset are shown in Appendix J.2. Note that some lifted NL-STL pairs in Circuit and Navigation datasets have been used during training the lifted model, while all the full NL-STL pairs have not been seen. All the data in other three domains are independent of the finetuning in lifted models. Our model achieves higher accuracy on full NL-STL transformation with much less training data across all these domains.

In the real applications, we have to formulate how the APs are presented in STL (like ’verb_noun’) so that the specified APs can directly connect with controllers. As shown in Appendix M, we directly utilize GPT-3 to recognize APs in the sentence and hide them as "prop_i". Then the lifted model will predict the targeted lifted STL and the hidden APs will be swapped into formatted form to generate the full STL.

Table 2 displays the performance accuracy of this method. We test on three distinct domains and compare with the GPT-3 end-to-end method, i.e., using GPT-3 to directly transform NL into STL. The GPT-3 end-to-end method is proposed by Fuggitti and Chakraborti (2023) recently, aiming to generalize into all various domains. However, in the NL to STL/LTL task, finetuning on a much smaller LLM like T5 is still greatly better than direct few-shot learning on state-of-the-art LLM like GPT-3 and GPT-4. Due to the limitation of GPT-4 access, we did an ad-hoc test on ChatGPT Plus with 100 samples in each domain. The experimental results show that combining finetuned lifted model with AP recognition using GPT-3 can lead to a full task accuracy over 95% across all three tested domains. Table 3 displays the performance of detecting APs with GPT-3. Compared to the direct NL to STL task, AP detection task is much easier to GPT-3. Hence, dividing the whole task into AP recognition and semantic parsing are more data-efficient and flexible than pure end-to-end method.

To further test model performance under varied sentence complexity, we plot the testing accuracy vs. the number of APs in Appendix K. As the number of APs in each lifted STL increases, the accuracy of GPT-3 few-shot learning decreases, while the finetuned T5-large model still performs well.

2 Transfer Learning

On the condition that we know how the users define the representation of APs, the aforementioned method is suitable to predict the full STL. On the other hand, there is also the condition that we cannot acquire the specific hard-coded rules to formulate AP representation, but only full NL-STL pairs. In these cases, the direct further finetuning may help. In other words, the lifted model has learnt to parse the semantic logical relations, and the further transfer learning is to learn how the APs are regulated in this specific dataset. This direct end-to-end transfer learning serves as the second way for ground applications.

To show that our method is generalizable and data-efficient, we compare our methods to the original Seq2Seq methods implemented in each dataset. Specifically, in the Circuit dataset the authors train the model from the ground using Transformer architecture Vaswani et al. (2017), and in GLTL and CW datasets the authors implement recurrent neural network (RNN) encoder-decoder framework with Gated Recurrent Unit (GRU) as the core RNN cell. As the work on Navigation dataset uses the final task completion rate as the criterion not the direct LTL accuracy, the LTL prediction accuracy is inherently low. For a fair comparison in Navigation dataset, we implements the same Seq2Seq framework as that in GLTL and CW datasets.

The experimental results are shown in Figure 6. Compared to the original Seq2Seq model proposed in each dataset, transfer learning with LLM is much more efficient, and the pre-training on lifted NL-STL pairs also displays a great saving on training data requirements. We also find that T5-large model performs better than T5-base model. In all the three domains, the T5-large model with lifted NL-STL pre-training can achieve an accuracy near 95% with only 200 to 500 full NL-STL examples. This amount of example requirement is one magnitude less than the Seq2Seq baselines.

The CW dataset is somewhat unique since it only has 36 different LTLs, meaning there are on average 50 different NLs corresponding to the same LTL. The study in Gopalan et al. (2018) applies this dataset to test the generalizability of the models within the domain. They use some types of LTLs as the training examples for transfer learning, and the left types of LTLs as the testing set. This is to test whether the model can predict the LTLs that it has not seen during the training. We also carry out this experiment and compare with the method in the original paper. As shown in Figure 7, the LLM with finetuning is apparently better than the original baseline.

Limitation

In spoken language, coreference is quite common, such as "pick up the apple and then bring it to me". Here "apple" and "it" refer to the same object. In the five datasets we collected and tested, the coreference problem is not severe since most NL do not have pronouns. For further work, the NER models specialized in resolving coreferences and converting them into normal APs are needed for more unbounded input sentences.

During the synthesis of varied STLs, here we use direct algorithm-based method to generate STL binary trees. To make the semantic meanings of STLs closer to human spoken language, we add the extra NL-STL loop via GPT-3. However, another intuitive way is to fit the probable distribution of operators to be close to human spoken language. For instance, the probability of two continued ’negation’ operators is nearly zero. In this work we only set some hard rules to specify the STL synthesis. Further work can focus on the fitting of operator distributions and apply it into STL generation.

The evaluation metric here is pure binary accuracy (fully correct or not). Actually, it is quite difficult to judge the similarity or distance of two TLs. Simply calculating token matching or computing truth values both own drawbacks. A more effective metric is needed.

The output of LLMs may sometimes generate incorrect TLs. We build up rule-based methods to check syntactic correctness and correct errors like parentheses matching. Further work can be added to improve output correctness by modifying training procedures and loss functions.

Conclusion

We propose a framework to achieve NL-to-TL transformation with the assistance of LLM, from aspects of both data generation and model training. One dataset with about 28K lifted NL-TL pairs is then constructed by which the T5 model is finetuned. Two approaches are implemented to utilize the trained model into full NL-to-TL translation. Experimental results on five varied domains display much better accuracy and generalizablity compared to original methods. The created dataset can be used to train future NL-to-TL models and serve as the benchmark. The proposed framework to finetune LLMs with lifted NL-TL pairs makes it possible for generalizable NL-to-TL translation without the constraints of domains and input instruction structures.

Although our framework is built on GPT-3, the rapid progression of LLMs can promote our framework. The strong semantic parsing ability of newly released GPT-4 will mitigate the burden of human annotations in our method. We find that GPT-4 generated STLs/NLs are closer to the correct answers, compared to GPT-3 generated STLs/NLs. As future work, we believe the model can be improved with larger dataset containing more diversified corpus with GPT-4 as the base model.

Acknowledgements

We thank the help from the volunteers for contributing the natural language instructions.

This work was supported by ONR under Award N00014-22-1-2478 and MIT-IBM Watson AI Lab. However, this article solely reflects the opinions and conclusions of its authors. The authors would also like to thank Lifu Huang, Zhiyang Xu, and Yue Meng for the early-stage exploration and discussion of the work.

References

Appendix A STL illustration

Appendix B Full algorithm to synthesize multiple STLs

This is the full algorithm to synthesize multiple varied STLs. The blue-colored words are the example output in each step. All the operators are classified into the operator with only one subtree, or the operator with two subtrees. A random ordered $prop$ list is generated with the length less than the upper limit. Then this full list is split into some sub_lists. For each sub_list, operators are randomly appended in the left side until each $prop$ occupy one position in the binary tree. Then these modified sub_lists are assembled back into the full STL by appending operators with two subtrees. The STL generated in this way are syntactically correct, but may own some flaws in semantic meanings. Some rules are pre-set to avoid the unreasonable conditions, e.g., two negation operation should not appear continually.

Appendix C Examples of prompt input to GPT-3

These are the example prompts for GPT-3 to convert between NL and STL, or detect the spans of Atomic Proportions.

Figure 8 is a prompt example for GPT-3 to convert from STL to its corresponding NL. The input STL follows in-order expression with all the operators replaced by words with the same meanings. The prompt contains 20 NL-STL pairs, which are randomly picked up from 100 examples and are changed constantly during data creation.

C.2 Prompt example from NL to pre-order STL via GPT-3

Figure 9 is a prompt example for GPT-3 to convert from NL to its corresponding STL. The output STL follows pre-order expression. We have tested that GPT-3 acts with close performance when STL follows either pre-order or in-order formats.

C.3 Prompt example for AP recognition via GPT-3

Figure 10 is a prompt example for applying GPT-3 to detect APs in natural sentences. In this example, the specific domain is Navigation.

Appendix D Number of NL-STL pairs in GPT-3 prompts

As shown in Figure 11, here we ask GPT-3 to transform from NL to STL and tune the number of NL-STL pairs in the prompt to detect the accuracy evolution. We test on the NL whose targeted STL have the number of APs to be two or three. We find that the prediction accuracy given by GPT-3 will arise with the number of example pairs and turn into a plateau when the number of example pairs increases to larger than 20. Here we choose the number of pairs to be 20 in the prompt.

Appendix E Example annotations of lifted NL-STL pairs

Appendix F Process of human annotation

All the human annotators are researchers in the area of formal methods and robot planning with extensive knowledge in temporal logics. The annotators are all volunteers from the authors’ institute and collaborative institutes. Before the data annotation and collection, we have notified them that the data will be used to train the language model to transform from natural language instructions to temporal logics, and that both the annotated data and model will be open to the public. All the voluntary annotators have agreed to the statement to use their annotated data. As for the task instruction, the annotators participate in the guidance meeting and are provided with a guidance list on temporal logics and some example annotation pairs (the guidance list and examples of annotation are available on github page). Each annotator annotated 50 pairs initially, and sent their results to other randomly assigned annotators for cross-checking. Finally, the authors also checked all pairs to ensure the accuracy of the annotated data.

Appendix G Example full NL-STL pairs of each specialized dataset

Appendix H Ablation Studies

This part is to demonstrate the significance of human annotation for the GPT-3 synthesized data. Figure 12 shows the model accuracy under varied number of training raw pairs. The great thing is that the T5-large model can still achieve a highest testing accuracy of 87.3% and 79.4% on the GPT-3-assisted data and Manual data test, even only using the raw data synthesized from GPT-3. However, compared to the results in Figure 5, models trained on annotated data achieves accuracy about 10% higher than models trained on raw data.

H.2 Significance of Framework2

Appendix I Model Capacity

As shown in Figure 13, T5-large performs much better than Seq2Seq model when training on the same lifted dataset. This reveals the significance to use LLM in this NL-to-TL task.

Appendix J Datasets statistics

J.2 Statistics of corpus richness

Appendix K Accuracy evolution with AP number

This section is to illustrate that directly applying GPT-3 to predict STL from NL via few-shot learning largely decreases the accuracy when the sentence structure is complex. Here we hypothesize that sentence complexity is positively related to the number of APs. As shown in Figure 14, the prediction accuracy decreases rapidly with increasing AP number using GPT-3 end-to-end method. On the other hand, the method to finetune the T5-large using synthesized NL-STL pairs remains high accuracy across different AP numbers.

Appendix L Details of implementation

For all the finetuning experiments on both T5-base and T5-large models, we choose the learning rate as 2e-5, a batch size of 16, a weight decaying ratio as 0.01, and run 20 epochs for each setting. Experiments on average finish in 3 hours for T5-base, and 10 hours for T5-large, on a single Nvidia RTX 8000 GPU. Average results and standard deviations are typically acquired from 3 runs with seeds , apart from the transfer learning in CW dataset where 10 runs are carried with seeds . For the finetuning on lifted models, the input dataset is split into training set (0.9) and testing set (0.1).