Is Supervised Syntactic Parsing Beneficial for Language Understanding? An Empirical Investigation

Goran Glavaš, Ivan Vulić

Introduction

Structural analysis of sentences, based on a variety of syntactic formalisms (Charniak, 1996; Taylor et al., 2003; De Marneffe et al., 2006; Hockenmaier and Steedman, 2007; Nivre et al., 2016, 2020, inter alia), has been the beating heart of NLP pipelines for decades Klein and Manning (2003); Chen and Manning (2014); Dozat and Manning (2017); Kondratyuk and Straka (2019), establishing rather strong common belief that high-level semantic language understanding (LU) crucially depends on explicit syntax. The unprecedented success of neural language learning models based on transformer networks Vaswani et al. (2017), trained on unlabeled corpora via language modeling (LM) objectives (Devlin et al., 2019; Liu et al., 2019b; Clark et al., 2020, inter alia) on a wide variety of LU tasks Wang et al. (2018); Hu et al. (2020), however, questions this widely accepted assumption.

The question of necessity of supervised parsing for LU and NLP in general has been raised before. More than a decade ago, Bod (2007) questioned the superiority of supervised parsing over unsupervised induction of syntactic structures in the context of statistical machine translation. Nonetheless, the NLP community has since still managed to find sufficient evidence for the usefulness of explicit syntax in higher-level LU tasks (Levy and Goldberg, 2014; Cheng and Kartsaklis, 2015; Bastings et al., 2017; Kasai et al., 2019; Zhang et al., 2019a, inter alia). However, we believe that the massive improvements brought about by the LM-pretrained transformers – unexposed to any explicit syntactic signal – warrant a renewed scrutiny of the utility of supervised parsing for high-level language understanding.Disclaimer 1: In this work, we make a clear distinction between Computational Linguistics (CL), i.e., the area of linguistics leveraging computational methods for analyses of human languages and NLP, the area of artificial intelligence tackling human language in order to perform intelligent tasks. This work scrutinizes the usefulness of supervised parsing and explicit syntax only for the latter. We find the usefulness of explicit syntax in CL to be self-evident.,Disclaimer 2: The purpose of this work is definitely not to invalidate the admirable efforts on syntactic annotation and modeling, but rather to make an empirically driven step towards a deeper understanding of the relationship between LU and formalised syntactic knowledge, and the extent of its impact to modern semantic LU and applications. The research question we address in this work can be summarized as follows:

(RQ) Is explicit structural language information, provided in the form of a widely adopted syntactic formalism (Universal Dependencies, UD) Nivre et al. (2016) and injected in a supervised manner into LM-pretrained transformers beneficial for transformers’ downstream LU performance?

While existing body of work Lin et al. (2019); Tenney et al. (2019); Liu et al. (2019a); Kulmizev et al. (2020); Chi et al. (2020) probes transformers for structural phenomena, our work is more pragmatically motivated. We directly evaluate the effect of infusing structural language information from UD treebanks, via intermediate dependency parsing (DP) training, on transformers’ performance in downstream LU. To this end, we couple a pretrained transformer with a biaffine parser similar to Dozat and Manning (2017), and train the model (i.e., fine-tune the transformer) for DP. Our parser on top of RoBERTa Liu et al. (2019b) and XLM-R Conneau et al. (2020) produces DP results which are comparable to state of the art. We then fine-tune the syntactically-informed transformers for three downstream LU tasks: natural language inference (NLI) Williams et al. (2018); Conneau et al. (2018), paraphrase identification Zhang et al. (2019b); Yang et al. (2019), and causal commonsense reasoning Sap et al. (2019); Ponti et al. (2020). We quantify the contribution of explicit syntax by comparing LU performance of the transformer exposed to intermediate parsing training (IPT) and its counterpart directly fine-tuned for the downstream task. We investigate the effects of IPT (1) monolingually, by fine-tuning English transformers, BERT and RoBERTa, on an English UD treebank and for (2) downstream zero-shot language transfer, by fine-tuning massively multilingual transformers (MMTs) – mBERT and XLM-R Conneau et al. (2020) – on treebanks of downstream target languages, before the downstream fine-tuning on source language (English) data.

While intermediate parsing training is obviously not the only way of bringing syntactic knowledge to downstream tasks Kuncoro et al. (2019); Swayamdipta et al. (2019); Kuncoro et al. (2020), it is arguably the most straightforward way of injecting syntactic signal in the context of the predominant pretraining-fine-tuning paradigm that has, nonetheless, not been investigated up to this point. Other methods of bringing syntactic signal to downstream tasks such as knowledge distillation Kuncoro et al. (2020) and pre-training on shallow trees instead of sequences Swayamdipta et al. (2019) have failed to demonstrate significant gains on higher-level LU tasks.

Our results also render supervised UD parsing largely inconsequential to LU. We observe limited and inconsistent gains only in zero-shot downstream language transfer: further analyses reveal that (1) intermediate LM training yields comparable gains and (2) IPT only marginally changes representation spaces of transformers exposed to sufficient amount of language data in LM-pretraining. We hope that these empirical findings will shed new light on the relationship between supervised parsing (and manually labeled treebanks) and LU with transformer networks, and guide further similar investigations in future work, in order to fully understand the impact of formal syntactic knowledge on LU performance with modern neural architectures.

Related Work

Bringing Explicit Syntax to LMs. Previous work has attempted to enrich language models with explicit syntactic knowledge in ways other than intermediate parsing training. Swayamdipta et al. (2019) modify the pretraining objective of ELMo Peters et al. (2018) to learn from shallowly parsed (i.e., chunked) corpora. They, however, report no notable improvements on downstream tasks. Kuncoro et al. (2019) propose to distil the knowledge from a Recurrent NN Grammar (RNNG) teacher trained on a small syntactically annotated corpus (by modeling the joint probability of surface sequence and phrase structure tree) into an LSTM-based student pretrained on a much larger corpus. They show that distillation helps the student in structured prediction tasks, but their downstream evaluation does not involve LU tasks. Their subsequent work Kuncoro et al. (2020) replaces the RNN student with BERT Devlin et al. (2019): syntactic distillation again helps structured prediction, but hurts (slightly) the performance on LU tasks from the GLUE benchmark Wang et al. (2018).

Transformer-Based Dependency Parsing. Building on the success of preceding neural parsers Chen and Manning (2014); Kiperwasser and Goldberg (2016), Dozat and Manning (2017) proposed a biaffine parsing head on top of a Bi-LSTM encoder: contextualized word vectors are fed to two feed-forward networks, producing dependent- and head-specific token representations, respectively. Arc and relation scores are produced via biaffine products between these dependent- and head-specific representation matrices. Finally, the Edmonds algorithm induces the optimal tree from pairwise arc predictions. Most recent DP work Kondratyuk and Straka (2019); Üstün et al. (2020) replaces the Bi-LSTM encoder with multilingual BERT’s transformer, reporting state-of-the-art parsing performance. Kondratyuk and Straka (2019) fine-tune mBERTs parameters on the concatenation of all UD treebanks, whereas Üstün et al. (2020) freeze the original transformer’s parameters and inject adapters Houlsby et al. (2019) for parsing.

We propose and work with a simpler transformer-based biaffine parser: we apply biaffine attention directly on representations from transformer’s output layer, eliminating the head- and dependendant-based feed-forward mapping. Despite this simplification, our biaffine parser produces DP results comparable to current state-of-the-art parsers.

Syntactic BERTology. The substantial body of syntactic probing work shows that BERT Devlin et al. (2019) (a) encodes text in a hierarchical manner (i.e., it encodes some implicit underlying syntax) Lin et al. (2019); and (b) captures specific shallow syntactic information (parts-of-speech and syntactic chunks) Tenney et al. (2019); Liu et al. (2019a). Hewitt and Manning (2019) find that linear transformations, when applied on BERT’s contextualized word vectors, reflect distances in dependency trees. This suggests that BERT encodes sufficient structural information to reconstruct dependency trees (though without arc directionality and relations). Chi et al. (2020) extend the analysis to multilingual BERT, finding that its representation subspaces may recover trees also for other languages. They also provide evidence that clusters of head–dependency pairs roughly correspond to UD relations. Similarly, Kulmizev et al. (2020) show that BERT’s latent syntax corresponds more to UD trees than to shallower SUD Gerdes et al. (2018) structures. Despite the evident similarity between BERT’s latent syntax and formalisms such as UD, there is ample evidence that BERT insufficiently leverages syntax in downstream tasks: it often produces similar predictions for syntactically valid as well as for structurally corrupt sentences (e.g., with random word order) Wallace et al. (2019); Ettinger (2020); Zhao et al. (2020).

Intermediate Training. Sometimes called Supplementary Training on Intermediate Labeled-data Tasks (STILT) Phang et al. (2018), intermediate training is a transfer learning setup in which one trains an LM-pretrained transformer on one or more supervised tasks (ideally with large training sets) before final fine-tuning for the target task. Phang et al. (2018) show that intermediate NLI training of BERT on the Multi-NLI dataset Williams et al. (2018) benefits several language understanding tasks. Subsequent work Wang et al. (2019); Pruksachatkun et al. (2020) investigated many combinations of intermediate and target LU tasks, failing to identify any universally beneficial intermediate task. In this work we use DP as an intermediate training task (IPT) for LM-pretrained transformers.

Methodology

We then directly compute the arc and relation scores as biaffine products of X\mathbf{X} and X\mathbf{X}^{\prime}:

Note that, in comparison with the original biaffine parser Dozat and Manning (2017) and its other transformer-based variants Kondratyuk and Straka (2019); Üstün et al. (2020), we feed word-level representations derived from the transformer’s output directly to biaffine products, omitting the dependent- and head-specific MLP transformations. Deep task-specific architectures go against the fine-tuning idea: deep transformers have plenty of their own parameters that can be tuned for DP. We want to propagate as much of the explicit syntactic knowledge as possible into the transformer: a deep(er) DP-specific architecture on top of the transformer would impede the propagation of this knowledge to the transformer’s parameters.

Experimental Setup

We now detail experimental setup, where LU fine-tuning follows Intermediate Parsing Training (IPT).

Our primary goal is to identify if injection of explicit syntax into transformers via supervised parsing training improves their downstream LU performance – this translates into sequential fine-tuning: (1) we first attach a biaffine parser from §3 on the transformer and train the whole model on a UD treebank; (2) we then couple the syntactically-informed transformer with the corresponding downstream classification head and perform final fine-tuning. We then compare the downstream performance of transformers with and without the IPT step.

Mono- vs. Cross-Lingual IPT Experiments. In the monolingual setup, we work with English (en ) transformers, BERT and RoBERTa, pretrained on en corpora. In the zero-shot language transfer setup, where we work with multilingual models, mBERT and XLM-R Conneau et al. (2020), we first train transformers via IPT on the UD treebank of the target language (i.e., a language with no downstream training data) before fine-tuning it on the en training set of the LU task. We experiment with four target languages: German (de ), French (fr ), Turkish (tr ), and Chinese (zh ).Selected languages vary in typological and etymological proximity to en as the source language: de is in the same (Germanic) branch of Indo-European languages, fr is from the different branch of the same family, whereas tr (Turkic) and zh (Sino-Tibetan) belong to different language families.

Standard vs. Adapter-Based Fine-Tuning. Standard fine-tuning updates all transformer’s parameters, which, for tasks with large training sets may have some drawbacks: (i) fine-tuning may last long and (ii) task-specific information may overwrite the useful distributional knowledge obtained during LM-pretraining. Adapter-based fine-tuning Houlsby et al. (2019); Pfeiffer et al. (2020) remedies for these potential issues by keeping the original transformer’s parameters frozen and inserting new adapter parameters in transformer layers. In fine-tuning, both sets of parameters are used to make predictions, but we only update adapters based on loss gradients. As the number of adapter parameters is only a fraction of the number of original parameters (3-8%), fine-tuning is also much faster.

Therefore, to account for the possibility of forgetting distributional knowledge in standard IPT fine-tuning, we also carry out adapter-based IPT. We follow Houlsby et al. (2019) and inject two bottleneck adapters into each transformer layer: first after the multi-head attention sublayer and another after the feed-forward sublayer. In downstream LU tasks, however, we unfreeze the original transformer parameters and fine-tune them together with adapters (now containing syntactic knowledge).

2 Language Understanding Tasks

We now outline the downstream LU tasks. For brevity, we report all the technical training and optimization details in the Supplementary Material.

NLI is a ternary sentence-pair classification task. We predict if the hypothesis is entailed by the premise, contradicts it, or neither. For monolingual en experiments, we use Multi-NLI Williams et al. (2018). In zero-shot transfer experiments, we train on en Multi-NLI and evaluate on target language (de , fr , tr , zh ) test portions of the multilingual XNLI dataset Conneau et al. (2018). Models trained on the Multi-NLI datasets have been shown, however, to capture certain heuristics (e.g., lexical overlap) useful for many training instances rather than more complex and generalizable language inference McCoy et al. (2020). Because of this, we additionally evaluate on the HANS dataset McCoy et al. (2020), consisting of adversarial examples on which models that capture such heuristics fail.

Paraphrase Identification is a binary classification task where we predict if two sentences are mutual paraphrases. For en, we train, validate, and test on respective portions of the PAWS dataset Zhang et al. (2019b). In zero-shot language transfer, we evaluate on the test de , fr , and zh portions of the PAWS-X dataset Yang et al. (2019).

Commonsense Reasoning. We evaluate on two multiple-choice classification (MCC) datasets. In monolingual evaluation, we use the SocialIQA (SIQA) dataset Sap et al. (2019), testing models’ ability to reason about social interactions. Each SIQA instance consists of a premise, a question, and three possible answers. For zero-shot language transfer experiments, we resort to the recently published XCOPA dataset Ponti et al. (2020), obtained by translating test portions of the en COPA (Choice of Plausible Alternatives) dataset Roemmele et al. (2011) to 11 languages. As mentioned, (X)COPA is an MCC task, with each instance containing a premise, a question,While SIQA has unconstrained questions, (X)COPA has only two question types: a) What is the CAUSE of this (premise)? and b) What is the RESULT of this (premise)? and two possible answers. Due to the very limited size of the en COPA training set (mere 400 instances), we follow Ponti et al. (2020) and evaluate the models fine-tuned on SIQA (en ) on the XCOPA test portions (in tr and zh).

3 Training and Optimization Details

All the transformer models with which we experiment – en BERT, mBERT, en RoBERTa, and XLM-R have L=12L=12 layers and hidden representations of size H=768H=768. We apply a dropout (p=0.1p=0.1) on the transformer outputs before forwarding them to the task-specific classification heads (i.e., biaffine parsing head in intermediate parsing training, and MCC or SEQC heads in downstream fine-tuning). We optimize the parameters using the Adam algorithm Kingma and Ba (2015): we found the initial learning rate of 10510^{-5} to offer stable convergence in both intermediate parsing training and downstream fine-tuning for all LU tasks. We train for at most 3030 epochs over the respective training set, with early stopping based on the development loss.We measure the development loss every UU update steps and stop the training if the loss does not decrease over 1010 consecutive measurements. We set U=500U=500 in NLI training and U=250U=250 in all other training procedures. On UD treebanks and SIQA we train in batches of size 88, whereas on Multi-NLI and PAWS we train in batches of size 3232. In Adapter-based IPT, we set the adapter size to 6464 and use GELU Hendrycks and Gimpel (2016) as the activation function in adapter layers.

Evaluation

We first discuss parsing performance of our novel biaffine parser (see §3). We then show transformers’ downstream LU performance after IPT, both in monolingual en setting and in zero-shot transfer.

Parsing Performance. In order to judge the benefits of IPT in downstream LU, we must first verify parsing performance of our biaffine parser, i.e., that we successfully fine-tune transformers for DP. Table 1 shows that our biaffine parser gives state-of-the-art performance for all five languages in our study. Our (m)BERT-based parser outperforms UDify Kondratyuk and Straka (2019), also based on mBERT, for en , fr , and tr , and performs comparably for zh .Our mBERT-based parser performs poorly for de : the cause of it is unclear and this requires further investigation. Our parser based on XLM-R additionally yields an improvement over UDify for de as well. It is worth noting that UDify trains the mBERT-based parser (1) on the concatenation of all UD treebanks and that it (2) additionally exploits gold UPOS and lemma annotations. We train our parsers only on the training portion of the respective treebank without using any additional morpho-syntactic information.Also, since absolute parsing performance is not the primary objective of this work, we did not perform extensive language-specific hyperparameter tuning. One could likely obtain better parsing scores than what we report in Table 1 with careful language-specific model selection. Our mBERT-based parser outperforms our XLM-R-based parser only for zh : this is likely due to a tokenization mismatch between XLM-R’s subword tokenization for zh and gold tokenization in the zh -GSD treebank.We explain this mismatch in the Appendix.

Monolingual en Results. Table 2 quantifies the effects of applying IPT with the en -EWT UD treebank to BERT and RoBERTa. We report downstream LU performance on NLI, PAWS, and SIQA. The reported results do not favor supervised parsing (i.e., explicit syntax): compared to original transformers that have not been exposed to any explicit syntactic supervision, variants exposed to UD syntax via IPT (Standard, Adapter) fail to produce any significant gains for any of the downstream LU tasks. One cannot argue that the cause of this might be forgetting (i.e., overwriting) of the distributional knowledge obtained in LM pretraining during IPT: Adapter IPT variants, in which all distributional knowledge is preserved by design, also fail to yield any significant LU gains. IPT yields the largest gain (+3.4%) for BERT on HANS – the NLI dataset consisting of adversarial examples for which syntax deliberately affects the sentence meaning more directly. The same effect, however, is not there for RoBERTa, suggesting that the additional syntactic knowledge that BERT gets through IPT, RoBERTa seems to obtain through larger-scale pretraining.

Zero-Shot Language Transfer. We show the results obtained for zero-shot downstream language transfer setup, for both mBERT and XLM-R, in Table 3. Again, these results do not particularly favor the intermediate injection of explicit syntactic information in general. However, in few cases we do observe gains from the intermediate target-language parsing training: e.g., 3% gain on PAWS-X for zh as well as 4% and 5% gains on XCOPA for zh and tr , respectively. Interestingly, all substantial improvements are obtained for mBERT; for XLM-R, the improvements are less consistent and less pronounced. This might be due to XLM-R’s larger capacity which makes it less susceptible to the “curse of multilinguality” Conneau et al. (2020): with the subword vocabulary twice as large as mBERT’s, XLM-R is able to store more language-specific information. Also, XLM-R has seen substantially more target language data in LM-pretraining than mBERT for each language. This might mean that the larger IPT gains for mBERT come from mere exposure to additional target language text rather than from injection of explicit syntactic UD signal (see further analyses in §5.2).

2 Further Analysis and Discussion

We first compare the impact of IPT with the effect of additional LM training on the same raw data. We then quantify the topological modification that IPT makes in transformers’ representation spaces.

Explicit Syntax or Just More Language Data? We scrutinize the IPT gains that we observe in some zero-shot language transfer experiments. We hypothesize that these gains may, at least in part, be credited to transformer simply seeing more target language data. To investigate this, we replace IPT with intermediate (masked) language modeling training (ILMT) on the same data (i.e., sentences from the respective treebank used in IPT) before final downstream LU fine-tuning. Because MLM is a self-supervised objective, we can credit all differences in downstream LU performance between ILMT and IPT variants of the same pretrained transformer to supervised parsing, i.e., to the injection of explicit UD knowledge.

ILMT Details. We mask 15% of subword tokens in each sentence and predict them with a linear classifier applied on transformed representations of [MASK] tokens. We compute the cross-entropy loss and use the same hyperparameter configuration as described in §4.3. The development set, used for early stopping, is subdued to fixed masking, whereas we mask the training sentences dynamically, before feeding them to the transformer.

Results. We run this analysis for setups in which we observe substantial gains from IPT: PAWS-X for mBERT (Adapter fine-tuning, for fr and zh ) and XCOPA for mBERT (Standard fine-tuning, tr and zh ). The comparison between IPT and ILMT for these setups is provided in Figure 2. Like IPT, ILMT on mBERT generates downstream gains over direct downstream fine-tuning (i.e., no intermediate training) in all four setups. The gains from ILMT (with the exception of XCOPA for zh ) are almost as large as gains from IPT. This suggests that most of the gain with IPT comes from seeing more target language text, and prevents us from concluding that the explicit syntactic annotation is responsible for the LU improvements in zero-shot downstream transfer. This interpretation is corroborated by the fact that IPT gains roughly correlate with the amount of language-specific data seen in LM-pretraining: the gains are more prominent for mBERT than for XLM-R and for tr and zh than for fr and de (see Table 3).

Although not invariant to all linear transformations, l-CKA is invariant to orthogonal projection and isotropic scaling, which suffices for our purposes. We base our analysis on the following assumption: the extent of change in transformers’ representation space topology (reflected by l-CKA), is proportional to the novelty of knowledge injected in fine-tuning. Put differently, injection of new (i.e., missing) knowledge should substantially change the topology of the space (low l-CKA score).

Figure 3 shows the heatmap of l-CKA scores for pairs of BERT and RoBERTa variants, for layers L8-L12.Most l-CKA scores in layers L1-L7 are very high (0.9\geq 0.9) and provide little insight. See the Supplementary Material. Comparing B-P and B-N reveals that IPT changes the topology of BERT’s higher layers roughly as much as NLI fine-tuning does, implying that both the English UD treebank (en -EWT) and Multi-NLI data contain a non-negligible amount of novel knowledge for BERT. However, the direct N-P comparison shows that IPT and NLI enrich BERT (also RoBERTa) with different type of knowledge, i.e., they change the representation spaces of its layers in different ways. This suggests that the transformers cannot acquire the missing knowledge needed for NLI from IPT (i.e., from en -EWT), and explains why IPT is not effective for NLI.

IPT (comparison B-P) injects more new information than ILMT (comparison B-M), and this is more pronounced for BERT than for RoBERTa. IPT and ILMT change RoBERTa’s parameters much less than BERT’s (see B-M and B-P l-CKA scores for L11/L12), which we interpret as additional evidence, besides RoBERTa consistently outscoring BERT, that RoBERTa encodes richer language representations, due to its larger-scale and longer training. It also agrees with suggestions that BERT is “undertrained” for its capacity Liu et al. (2019b).

Very high B-P (and B-AP) l-CKA scores in lower layers suggest that the explicit syntactic knowledge from human-curated treebanks is redundant w.r.t. the structural language knowledge transformers obtain through LM pretraining. This is consistent with concurrent observations Chi et al. (2020); Kulmizev et al. (2020) showing (some) correspondence between structural knowledge of (m)BERT and UD syntax. Finally, we observe highest l-CKA scores in the P-AP column, suggesting that Standard and Adapter IPT inject roughly the same syntactic information, despite different fine-tuning mechanisms.

Figure 4 illustrates the results of the same analysis for language transfer experiments, for de and tr (scores for fr and zh are in the Appendix). The effects of ILMT and IPT (B-M, B-P/B-AP) for de and tr with mBERT and XLM-R resemble those for en with BERT and RoBERTa: ILMT changes transformers less than IPT. The amount of new syntactic knowledge IPT injects is larger (l-CKA scores are lower) than for en , especially for XLM-R (vs. RoBERTa for en ). We believe that it reflects the relative under-representation of the target language in the model’s multilingual pretraining corpus (e.g., for tr): this leads to poorer representations of target language structure by mBERT and XLM-R compared to BERT’s and RoBERTa’s representation of en structure. This gives us two seemingly conflicting empirical findings: (a) IPT appears to inject a fair amount of target-language UD syntax, but (b) this translates to (mostly) insignificant and inconsistent gains in language transfer in LU tasks (especially so for XLM-R, cf. Table 3). A plausible reconciling hypothesis is that there is a substantial mismatch between the type of structural information we obtain through supervised (UD) parsing and the type of structural knowledge beneficial for LU tasks. If true, this hypothesis would render supervised parsing rather unavailing for high-level language understanding, at least in the context of LM-pretrained transformers, the current state of the art in NLP. This warrants further investigation, and we hope that our work will inspire further discussion and additional studies.

Conclusion

We thoroughly examined the effects of leveraging formalized syntactic structures (UD) in state-of-the-art neural language models (e.g., RoBERTa, XLM-R) for downstream language understanding (LU) tasks, both in monolingual and language transfer settings. The key results, obtained through intermediate parsing training (IPT) based on a state-of-the-art-level dependency parser, indicate that explicit syntax, at least in our extensive experiments, provides negligible impact on LU tasks.

Besides offering extensive empirical evidence of the mismatch between explicit syntax and improved LU performance with state-of-the-art transformers, this study sheds new light on some fundamental questions such as the one in the title. Similar to word embeddings Mikolov et al. (2013) removing sparse lexical features from the NLP horizon, will transformers make supervised parsing obsolete for LU applications or not? More dramatically, in the words of Rens Bod (2007): “Is the end of supervised parsing in sight” for semantic LU tasks?The answer is ’Probably no’: formalized syntactic structures will still be an important source of inductive bias, especially in setups without sufficient text data for large-scale pretraining; our experiments, however, validate that state-of-the-art transformer models can implicitly capture that inductive bias in high-resource setups and for major languages.

Acknowledgments

We thank the anonymous reviewers (especially R2!) for the exceptionally meaningful and helpful comments. Goran Glavaš is supported by the Baden Württemberg Stiftung (Eliteprogramm, AGREE grant). The work of Ivan Vulić is supported by the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (no. 648909).

References

Appendix A Reproducibility

We first provide details on where to obtain datasets and code used in this work.

Table 5 lists the sizes (in number of sentences) of Universal Dependencies treebanks that we use for our intermediate parsing training and evaluation of our biaffine dependency parsers. The UD treebanks v.2.5, which we used in this work, are available at: http://hdl.handle.net/11234/1-3105. In Table 6 we provide links to language understanding datasets used in our study.

A.2 Code and Dependencies

We make our code available at: https://github.com/codogogo/parse_stilt. Our code is built on top of the HuggingFace Transformers framework: https://github.com/huggingface/transformers (v. 2.7). Table 4 details the LM-pretrained transformer models from this framework which we exploited in this work. Besides the Transformers library, our code only relies on standard Python’s scientific computing libraries (e.g., numpy).

Appendix B zh Tokenization: XLM-R vs. GSD

A word-level token from the parse tree normally corresponds to one or more transformer’s subword tokens: we thus average subword vectors to obtain word vectors for biaffine parsing. For XLM-R and the zh GSD treebank, however, a single XLM-R’s subword token often corresponds to two treebank tokens. E.g., the sequence “只是二選一做決擇” with treebank tokenization [‘只’, ‘是’, ‘二’, ‘選’, ‘一’, ‘做’, ‘決擇’] is tokenized as [‘只是’, ‘二’, ‘選’, ‘一’, ‘做’, ‘決’, ‘擇’] by XLM-R. Two treebank tokens, ‘只’ and ‘是’, are captured with a single XLM-R “subword” token, ‘只是’. To ensure that each XLM-R subword token corresponds to exactly one treebank token, we inject spaces between treebank tokens before XLM-R tokenization: we then obtain the subword tokenization [‘只’, ‘是’, ‘二’, ‘選’, ‘一’, ‘做’, ‘決, ‘擇’]. However, this is suboptimal for XLM-R: its representations of tokens ’只’ and ’是’ are probably less reliable than that of the ’只是’ token. We believe this is why mBERT (without tokenization mismatches for zh ) outperforms XLM-R in zh parsing.

Appendix C Complete Topology Analysis Results

Finally, we show the complete results (for all layers, all transformers, and all languages covered in our experiments) of our topological analysis of transformers’ representations before and after different fine-tuning steps. Figure 5 shows the analysis results for monolingual en transformers, BERT and RoBERTa. Figure 6 and Figure 7 show the results for multilingual transformers, mBERT and XLM-R, respectively, for all four target languages included in our experiments: de , fr , tr , and zh .