Which *BERT? A Survey Organizing Contextualized Encoders

Patrick Xia, Shijie Wu, Benjamin Van Durme

Introduction

A couple years ago, Peters et al. (2018, ELMo) won the NAACL Best Paper Award for creating strong performing, task-agnostic sentence representations due to large scale unsupervised pretraining. Days later, its high level of performance was surpassed by Radford et al. (2018) which boasted representations beyond a single sentence and finetuning flexibility. This instability and competition between models has been a recurring theme for researchers and practitioners who have watched the rapidly narrowing gap between text representations and language understanding benchmarks. However, it has not discouraged research. Given the recent flurry of models, we often ask: “What, besides state-of-the-art, does this newest paper contribute? Which encoder should we use?”

The goals of this survey are to outline the areas of progress, relate contributions in text encoders to ideas from other fields, describe how each area is evaluated, and present considerations for practitioners and researchers when choosing an encoder. This survey does not intend to compare specific model metrics, as tables from other works provide comprehensive insight. For example, Table 16 in Raffel et al. (2019) compares the scores on a large suite of tasks of different model architectures, training objectives, and hyperparameters, and Table 1 in Rogers et al. (2020) details early efforts in model compression and distillation. We also recommend other closely related surveys on contextualized word representations Smith (2019); Rogers et al. (2020); Liu et al. (2020a), transfer learning in NLP Ruder et al. (2019), and integrating encoders into NLP applications Wolf et al. (2019). Complementing these existing bodies of work, we look at the ideas and progress in the scientific discourse for text representations from the perspective of discerning their differences.

We organize this paper as follows. §2 provides brief background on encoding, training, and evaluating text representations. §3 identifies and analyzes two classes of pretraining objectives. In §4, we explore faster and smaller models and architectures in both training and inference. §5 notes the impact of both quality and quantity of pretraining data. §6 briefly discusses efforts on probing encoders and representations with respect to linguistic knowledge. §7 describes the efforts into training and evaluating multilingual representations. Within each area, we conclude with high-level observations and discuss the evaluations that are used and their shortcomings.

We conclude in §8 by making recommendations to researchers: publicizing negative results in this area is especially important owing to the sheer cost of experimentation and to ensure evaluation reproducibility. In addition, probing studies need to focus not only on the models and tasks, but also on the pretraining data. We pose questions for users of contextualized encoders, like whether the compute requirement of a model is worth the benefits. We hope our survey serves as a guide for both NLP researchers and practitioners, orienting them to the current state of the field of contextualized encoders and differences between models.

Background

Pretrained text encoders take as input a sequence of tokenizedUnlike traditional word-level tokenization, most works decompose text into subtokens from a fixed vocabulary using some variation of byte pair encoding Gage (1994); Schuster and Nakajima (2012); Sennrich et al. (2016) text, which is encoded by a multi-layered neural model. The representation of each (sub)token, $x_{t}$ , is either the set of hidden weights, $\{h^{(l)}_{t}\}$ for each layer $l$ , or its weight on just the top layer, $h^{(-1)}_{t}$ . Unlike fixed-sized word, sentence, or paragraph representations, the produced contextualized representations of the text depends on the length of the input text. Most encoders use the Transformer architecture Vaswani et al. (2017).

Transfer: The Pretrain-Finetune Framework

While text representations can be learned in any manner, ultimately, they are evaluated using specific target tasks. Historically, the learned representations (e.g. word vectors) were used as initialization for task-specific models. Dai and Le (2015) are credited with using pretrained language model outputs as initialization, McCann et al. (2017) use pretrained outputs from translation as frozen word embeddings, and Howard and Ruder (2018) and Radford et al. (2018) demonstrate the effectiveness of finetuning to different target tasks by updating the full (pretrained) model for each task. We refer to the embeddings produced by the pretrained models (or encoders) as contextualized text representations. As our goal is to discuss the encoders and their representations, we do not cover the innovations in finetuning (Liu et al., 2015; Ruder et al., 2019; Phang et al., 2018; Liu et al., 2019c; Zhu et al., 2020, inter alia).

Evaluation

Widely adopted evaluations of text representations relate them to downstream natural language understanding (NLU) benchmarks. This full-stack process necessarily conflates representation power with finetuning strategies. Common language understanding benchmarks include (1) a diverse suite of sentence-level tasks covering paraphrasing, natural language inference, sentiment, and linguistic acceptability (GLUE) and its more challenging counterpart with additional commonsense and linguistic reasoning tasks (SuperGLUE) Wang et al. (2019c, b); Clark et al. (2019a); De Marneffe et al. (2019); Roemmele et al. (2011); Khashabi et al. (2018); Zhang et al. (2018); Dagan et al. (2006); Bar Haim et al. (2006); Giampiccolo et al. (2007); Bentivogli et al. (2009); Pilehvar and Camacho-Collados (2019); Rudinger et al. (2018); Poliak et al. (2018); Levesque et al. (2011); (2) crowdsourced questions derived from Wikipedia articles (Rajpurkar et al., 2016, 2018, SQuAD); and (3) multiple-choice reading comprehension (Lai et al., 2017, RACE).

Area I: Pretraining Tasks

To utilize data at scale, pretraining tasks are typically self-supervised. We categorize the contributions into two types: token prediction (over a large vocabulary space) and nontoken prediction (over a handful of labels). In this section, we discuss several empirical observations. While token prediction is clearly important, less clear is which variation of the token prediction task is the best (or whether it even matters). Nontoken prediction tasks appear to offer orthogonal contributions that marginally improve the language representations. We emphasize that in this section, we seek to outline the primary efforts in pretraining objectives and not to provide a comparison on a set of benchmarks.See Raffel et al. (2019) for comprehensive experiments.

Predicting (or generating) the next word has historically been equivalent to the task of language modeling. Large language models perform impressively on a variety of language understanding tasks while maintaining their generative capabilities Radford et al. (2018, 2019); Keskar et al. (2019); Brown et al. (2020), often outperforming contemporaneous models that use additional training objectives.

ELMo Peters et al. (2018) is a BiLSTM model with a language modeling objective for the next (or previous) token given the forward (or backward) history. This idea of looking at the full context was further refined as a clozeA cloze task is a fill-in-the-blank task. task Baevski et al. (2019), or as a denoising Masked Language Modeling (MLM) objective (Devlin et al., 2019, BERT). MLM replaces some tokens with a [mask] symbol and provides both right and left contexts (bidirectional context) for predicting the masked tokens. The bidirectionality is key to outperforming a unidirectional language model on a large suite of natural language understanding benchmarks Devlin et al. (2019); Raffel et al. (2019).

The MLM objective is far from perfect, as the use of [mask] introduces a pretrain/finetune vocabulary discrepancy. Devlin et al. (2019) look to mitigate this issue by occasionally replacing [mask] with the original token or sampling from the vocabulary. Yang et al. (2019) convert the discriminative objective into an autoregressive one, which allows the [mask] token to be discarded entirely. Naively, this would result in unidirectional context. By sampling permutations of the factorization order of the joint probability of the sequence, they preserve bidirectional context. Similar ideas for permutation language modeling (PLM) have also been studied for sequence generation Stern et al. (2019); Chan et al. (2019); Gu et al. (2019). The MLM and PLM objectives have since been unified architecturally Song et al. (2020); Bao et al. (2020) and mathematically Kong et al. (2020).

ELECTRA Clark et al. (2020) replaces [mask] through the use of a small generator (trained with MLM) to sample a real token from the vocabulary. The main encoder, a discriminator, then determines whether each token was replaced.

A natural extension would mask units that are more linguistically meaningful, such as rarer words,Clark et al. (2020) report negative results for rarer words. whole words, or named entities Devlin et al. (2019); Sun et al. (2019b). This idea can be simplified to random spans of texts Yang et al. (2019); Song et al. (2019). Specifically, Joshi et al. (2020) add a reconstruction objective which predicts the masked tokens using only the span boundaries. They find that masking random spans is more effective than masking linguistic units.

An alternative architecture uses an encoder-decoder framework (or denoising autoencoder) where the input is a corrupted (masked) sequence the output is the full original sequence Wang et al. (2019d); Lewis et al. (2020); Raffel et al. (2019).

2 Nontoken Prediction

Bender and Koller (2020) argue that for the goal of natural language understanding, we cannot rely purely on a language modeling objective; there must be some grounding or external information that relates the text to each other or to the world. One solution is to introduce a secondary objective to directly learn these biases.

Self-supervised discourse structure objectives, such as text order, has garnered significant attention. To capture relationships between two sentences,Sentence unfortunately refers to a text segment containing no more than a fixed number of subtokens. It may contain any (fractional) number of real sentences. Devlin et al. (2019) introduce the next sentence prediction (NSP) objective. In this task, either sentence B follows sentence A or B is a random negative sample. Subsequent works showed that this was not effective, suggesting the model simply learned topic Yang et al. (2019); Liu et al. (2019d). Jernite et al. (2017) propose a sentence order task of predicting whether A is before, after, or unrelated to B, and Wang et al. (2020b) and Lan et al. (2020) use it for pretraining encoders. They report that (1) understanding text order does contribute to improved language understanding; and (2) harder-to-learn pretraining objectives are more powerful, as both modified tasks have lower intrinsic performance than NSP. It is still unclear, however, if this is the best way to incorporate discourse structure, especially since these works do not use real sentences.

Additional work has focused on effectively incorporating multiple pretraining objectives. Sun et al. (2020a) use multi-task learning with continual pretraining Hashimoto et al. (2017), which incrementally introduces newer tasks into the set of pretraining tasks from word to sentence to document level tasks. Encoders using visual features (and evaluated only on visual tasks) jointly optimize multiple different masking objectives over both token sequences and regions of interests in the image Tan and Bansal (2019).Table 5 in Su et al. (2020) provides a recent summary of efforts in visual-linguistic representations.

Prior to token prediction, discourse information has been used in training sentence representations. Conneau et al. (2017, 2018a) use natural language inference sentence pairs, Jernite et al. (2017) use discourse-based objectives of sentence order, conjunction classifier, and next sentence selection, and Nie et al. (2019) use discourse markers. While there is weak evidence suggesting that these types of objectives are less effective than language modeling Wang et al. (2019a), we lack fair studies comparing the relative influence between the two categories of objectives.

3 Comments on Evaluation

We reviewed the progress on pretraining tasks, finding that token prediction is powerful but can be improved further by other objectives. Currently, successful techniques like span masking or arbitrarily sized “sentences” are linguistically unmotivated. We anticipate future work to further incorporate more meaningful linguistic biases in pretraining.

Our observations are informed by evaluations that are compared across different works. These benchmarks on downstream tasks do not account for ensembling or finetuning and can only serve as an approximation for the differences between the models. For example, Jiang et al. (2020) develop a finetuning method over a supposedly weaker model which leads to gains in GLUE score over reportedly stronger models. Furthermore, these evaluations aggregate vastly different tasks. Those interested in the best performance should first carefully investigate metrics on their specific task. Even if models are finetuned on an older encoder,This the case with retrieval-based QA Guu et al. (2020); Herzig et al. (2020), which builds on BERT. it may be more cost-efficient and enable fairer future comparisons to reuse those over restarting the finetuning or reintegrating new encoders into existing models when doing so does not necessarily guarantee improved performance.

Area II: Efficiency

As models perform better but cost more to train, some have called for research into efficient models to improve deployability, accessibility, and reproducibility Amodei and Hernandez (2018); Strubell et al. (2019); Schwartz et al. (2019). Encoders tend to scale effectively Lan et al. (2020); Raffel et al. (2019); Brown et al. (2020), so efficient models will also result in improvements over inefficient ones of the same size. In this section, we give an overview of several efforts aimed to decrease the computation budget (time and memory usage) during training and inference of text encoders. While these two axes are correlated, reductions in one axis do not always lead to reductions in the other.

One area of research decreases wall-clock training time through more compute and larger batches. You et al. (2020) reduce the time of training BERT by introducing the LAMB optimizer, a large batch stochastic optimization method adjusted for attention models. Rajbhandari et al. (2020) analyze memory usage in the optimizer to enable parallelization of models resulting in higher throughput in training. By reducing the training time, models can be practically trained for longer, which has also been shown to lead to benefits in task performance (Liu et al., 2019d; Lan et al., 2020, inter alia).

Another line of research reduces the compute through attention sparsification (discussed in §4.2) or increasing the convergence rate Clark et al. (2020). These works report hardware and estimate the reduction in floating point operations (FPOs).We borrow this terminology from Schwartz et al. (2019). These kinds of speedup are orthogonal to hardware parallelization and are most encouraging as they pave the path for future work in efficient training.

Note that these approaches do not necessarily affect the latency to process a single example nor the compute required during inference, which is a function of the size of the computation graph.

2 Inference

Reducing model size without impacting performance is motivated by lower inference latency, hardware memory constraints, and the promise that naively scaling up dimensions of the model will improve performance. Size reduction techniques produce smaller and faster models, while occasionally improving performance. Rogers et al. (2020) survey BERT-like models and present in Table 1 the differences in sizes and performance across several models focused on inference efficiency.

Architectural changes have been explored as one avenue for reducing either the model size or inference time. In Transformers, the self-attention pattern scales quadratically in sequence length. To reduce the asymptotic complexity, the self-attention can be sparsified: each token only attending to a small “local” set Vaswani et al. (2017); Child et al. (2019); Sukhbaatar et al. (2019). This has further been applied to pretraining on longer sequences, resulting in sparse contextualized encoders (Qiu et al., 2019; Ye et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020, inter alia). Efficient Transformers is an emerging subfield with applications beyond NLP; Tay et al. (2020) survey 17 Transformers that have implications on efficiency.

Another class of approaches carefully selects weights to reduce model size. Lan et al. (2020) use low-rank factorization to reduce the size of the embedding matrices, while Wang et al. (2019f) factorize other weight matrices. Additionally, parameters can be shared between layers Dehghani et al. (2019); Lan et al. (2020) or between an encoder and decoder Raffel et al. (2019). However, models that employ these methods do not always have smaller computation graphs. This greatly reduces the usefulness of parameter sharing compared to other methods that additionally offer greater speedups relative to the reduction in model size.

Closely related, model pruning Denil et al. (2013); Han et al. (2015); Frankle and Carbin (2018) during training or inference has exploited the overparameterization of neural networks by removing up to 90%-95% parameters. This approach has been successful in not only reducing the number of parameters, but also improving performance on downstream tasks. Related to efforts for pruning deep networks in computer vision Huang et al. (2016), layer selection and dropout during both training and inference have been studied in both LSTM Liu et al. (2018a) and Transformer Fan et al. (2020) based encoders. These also have a regularization effect resulting in more stable training and improved performance. There are additional novel pruning methods that can be performed during training Guo et al. (2019); Qiu et al. (2019). These successful results are corroborated by other efforts Gordon et al. (2020) showing that low levels of pruning do not substantially affect pretrained representations. Additional successful efforts in model pruning directly target a downstream task Sun et al. (2019a); Michel et al. (2019); McCarley (2019); Cao et al. (2020a). Note that pruning does not always lead to speedups in practice as sparse operations may be hard to parallelize.

Knowledge distillation (KD) uses an overparameterized teacher model to rapidly train a smaller student model with minimal loss in performance Hinton et al. (2015) and has been used for translation Kim and Rush (2016), computer vision Howard et al. (2017), and adversarial examples Carlini and Wagner (2016). This has been applied to ELMo Li et al. (2019) and BERT (Tang et al., 2019; Sanh et al., 2019; Sun et al., 2020b, inter alia). KD can also be combined with adaptive inference, which dynamically adjusts model size Liu et al. (2020b), or performed on submodules which are later substituted back into the full model Xu et al. (2020).

Quantization with custom low-precision hardware is also a promising method for both reducing the size of models and compute time, albeit it does not reduce the number of parameters or FPOs Shen et al. (2020); Zafrir et al. (2019). This line of work is mostly orthogonal to other efforts specific to NLP.

3 Standardizing Comparison

There has yet to be a comprehensive and fair evaluation across all models. The closest, Table 1 in Rogers et al. (2020), compares 12 works in model compression. However, almost no two papers are evaluated against the same BERT with the same set of tasks. Many papers on attention sparsification do not evaluate on NLU benchmarks. We claim this is because finetuning is itself an expensive task, so it is not prioritized by authors: works on improving model efficiency have focused only on comparing to a BERT on a few tasks.

While it is easy for future research on pretraining to report model sizes and runtimes, it is harder for researchers in efficiency to report NLU benchmarks. We suggest extending versions of the leaderboards under different resource constraints so that researchers with access to less hardware could still contribute under the resource-constrained conditions. Some work has begun in this direction: the SustaiNLP 2020 Shared Task is focused on the energy footprint of inference for GLUE.https://sites.google.com/view/sustainlp2020/shared-task

Area III: (Pretraining) Data

Unsurprisingly for our field, increasing the size of training data for an encoder contributes to increases in language understanding capabilities Yang et al. (2019); Raffel et al. (2019); Kaplan et al. (2020). At current data scales, some models converge before consuming the entire corpus. In this section, we identify a weakness when given less data, advocate for better data cleaning, and raise technical and ethical issues with using web-scraped data.

There has not yet been observed a ceiling to the amount of data that can still be effectively used in training Baevski et al. (2019); Liu et al. (2019d); Yang et al. (2019); Brown et al. (2020). Raffel et al. (2019) curate a 745GB subset of Common Crawl (CC),https://commoncrawl.org/ scrapes publicly accessible webpages each month. which starkly contrasts with the 13GB used in BERT. For multilingual text encoding, Wenzek et al. (2020) curate 2.5TB of language-tagged CC. As CC continues to grow, there will be even larger datasets Brown et al. (2020).

Sun et al. (2017) explore a similar question for computer vision, as years of progress iterated over 1M labeled images. By using 300M images, they improved performance on several tasks with a basic model. We echo their remarks that we should be cognizant of data sizes when drawing conclusions.

Is there a floor to the amount of data needed to achieve current levels of success on language understanding benchmarks? As we decrease the data size, LSTM-based models start to dominate in perplexity Yang et al. (2019); Melis et al. (2020), suggesting there are challenges with either scaling up LSTMs or scaling down Transformers. While probing contextualized models and representations is an important area of study (see §6), prior work focuses on pretrained models or models further pretrained on domain-specific data Gururangan et al. (2020). We are not aware of any work which probes identical models trained with decreasingly less data. How much (and which) data is necessary for high performance on probing tasks?Conneau et al. (2020a) claim we need a few hundred MiB of text data for BERT.

2 Data Quality

While text encoders should be trained on language, large-scale datasets may contain web-scraped and uncurated content (like code). Raffel et al. (2019) ablate different types of data for text representations and find that naively increasing dataset size does not always improve performance, partially due to data quality. This realization is not new. Parallel data and alignment in machine translation (Moore and Lewis, 2010; Duh et al., 2013; Xu and Koehn, 2017; Koehn et al., 2018, inter alia) and speech Peddinti et al. (2016) often use language models to filter out misaligned or poor data. Sun et al. (2017) use automatic data filtering in vision. These successes on other tasks suggest that improved automated methods of data cleaning would let future models consume more high-quality data.

In addition to high quality, data uniqueness appears to be advantageous. Raffel et al. (2019) show that increasing the repetitions (number of epochs) of the pretraining corpus hurts performance. This is corroborated by Liu et al. (2019d), who find that random, unique masks for MLM improve over repeated masks across epochs. These findings together suggest a preference to seeing more new text. We suspect that representations of text spans appearing multiple times across the corpus are better shaped by observing them in unique contexts.

Raffel et al. (2019) find that differences in domain mismatch in pretraining data (web crawled vs. news or encyclopedic) result in strikingly different performance on certain challenge sets, and Gururangan et al. (2020) find that continuing pretraining on both domain and task specific data lead to gains in performance.

3 Datasets and Evaluations

With these larger and cleaner datasets, future research can better explore tradeoffs between size and quality, as well as strategies for scheduling data during training.

As we continue to scrape data off the web and publish challenge sets relying on other web data, we need to cautiously construct our training and evaluation sets. For example, the domains of many benchmarks (Wang et al. (2019c, GLUE), Rajpurkar et al. (2016, 2018, SQuAD), Wang et al. (2019b, SuperGLUE), Paperno et al. (2016, LAMBADA), Nallapati et al. (2016, CNN/DM)) now overlap with the data used to train language representations. Section 4 in Brown et al. (2020) more thoroughly discuss the effects of overlapping test data with pretraining data. Gehman et al. (2020) highlight the prevalance of toxic language in the common pretraining corpora and stress the important of pretraining data selection, especially for deployed models. We are not aware of a comprehensive study that explores the effect of leaving out targeted subsets of the pretraining data. We hope future models note the domains of pretraining and evaluation benchmarks, and for future language understanding benchmarks to focus on more diverse genres in addition to diverse tasks.

As we improve models by training on increasing sizes of crawled data, these models are also being picked up by NLP practitioners who deploy them in real-world software. These models learn biases found in their pretraining data (Gonen and Goldberg, 2019; May et al., 2019, inter alia). It is critical to clearly state the sourceHow was the data generated, curated, and processed? of the pretraining data and clarify appropriate uses of the released models. For example, crawled data can contain incorrect facts about living people; while webpages can be edited or retracted, publicly released “language” model are frozen, which can raise privacy concerns Feyisetan et al. (2020).

Area IV: Interpretability

While it is clear that the performance of text encoders surpass human baselines, it is less clear what knowledge is stored in these models; how do they make decisions? In their survey, Rogers et al. (2020) find answers to the first question and also raise the second. Inspired by prior work Lipton (2018); Belinkov and Glass (2019); Alishahi et al. (2019), we organize here the major probing methods that are applicable to all encoders in hopes that future work will use comparable techniques.

One technique uses the learned model as initialization for a model trained on a probing task consisting of a set of targeted natural language examples. The probing task’s format is flexible as additional, (simple) diagnostic classifiers are trained on top of a typically frozen model Ettinger et al. (2016); Hupkes et al. (2018); Poliak et al. (2018); Tenney et al. (2019b). Task probing can also be applied to the embeddings at various layers to explore the knowledge captured at each layer Tenney et al. (2019a); Lin et al. (2019); Liu et al. (2019a). Hewitt and Liang (2019) warn that expressive (nonlinear) diagnostic classifiers can learn more arbitrary information than constrained (linear) ones. This revelation, combined with the differences in probing task format and the need to train, leads us to be cautious in drawing conclusions from these methods.

2 Model Inspection

Model inspection directly opens the metaphorical black box and studies the model weights without additional training. For examples, the embeddings themselves can be analyzed as points in a vector space Ethayarajh (2019). Through visualization, attention heads have been matched to linguistic functions Vig (2019); Clark et al. (2019b). These works suggest inspection is a viable path to debugging specific examples. In the future, methods for analyzing and manipulating attention in machine translation Lee et al. (2017); Liu et al. (2018b); Bau et al. (2019); Voita et al. (2019) can also be applied to text encoders.

Recently, interpreting attention as explanation has been questioned Serrano and Smith (2019); Jain and Wallace (2019); Wiegreffe and Pinter (2019); Clark et al. (2019b). The ongoing discussion suggests that this method may still be insufficient for uncovering the rationale for predictions, which is critical for real-world applications.

3 Input ManipulationThis is analogous to the “few-shot“ and “zero-shot” analysis in Brown et al. (2020).

Input manipulation draws conclusions by recasting the probing task format into the form of the pretraining task and observing the model’s predictions. As discussed in §3, word prediction (cloze task) is a popular objective. This method has been used to investigate syntactic and semantic knowledge Goldberg (2019); Ettinger (2020); Kassner and Schütze (2019). For a specific probing task, Warstadt et al. (2019) show that cloze and diagnostic classifiers draw similar conclusions. As input manipulation is not affected by variables introduced by probing tasks and is as interpretable than inspection, we suggest more focus on this method: either by creating new datasets Warstadt et al. (2020) or recasting existing ones Brown et al. (2020) into this format. A disadvantage of this method (especially for smaller models) is the dependence on both the pattern used to elicit an answer from the model and, in the few-shot case where a couple examples are provided first, highly dependent on the examples Schick and Schütze (2020).

4 Future Directions in Model Analysis

Most probing efforts have relied on diagnostic classifiers, yet these results are being questioned. Inspection of model weights has discovered what the models learn, but cannot explain their causal structure. We suggest researchers shift to the paradigm of input manipulation. By creating cloze tasks that assess linguistic knowledge, we can both observe decisions made by the model, which would imply (lack of) knowledge of a phenomenon. Furthermore, it will also enable us to directly interact with these models (by changing the input) without additional training, which currently introduces additional sources of uncertainty.

Bender and Koller (2020) also recommend a top-down view for model analysis that focuses on the end-goals for our field over hill-climbing individual datasets. While language models continue to outperform each other on these tasks, they argue these models do not learn meaning.A definition is given in §3 of Bender and Koller (2020). If not meaning, what are these models learning?

We are overinvesting in BERT. While it is fruitful to understand the boundaries of its knowledge, we should look more across (simpler) models to see how and why specific knowledge is picked up as our models both become increasingly complex and perform better on a wide set of tasks. For example, how many parameters does a Transformer-based model need to outperform ELMo or even rule-based baselines?

Area V: Multilinguality

The majority of research on text encoders has been in English.Of the monolingual encoders in other languages, core research in modeling has only been performed so far for a few non-English languages Sun et al. (2019b, 2020a). Cross-lingual shared representations have been proposed as an efficient way to target multiple languages by using multilingual text for pretraining (Mulcaire et al., 2019; Devlin et al., 2019; Lample and Conneau, 2019; Liu et al., 2020c, inter alia). For evaluation, researchers have devised multilingual benchmarks mirroring those for NLU in English Conneau et al. (2018b); Liang et al. (2020); Hu et al. (2020). Surprisingly, without any explicit cross-lingual signal, these models achieve strong zero-shot cross-lingual performance, outperforming prior cross-lingual word embedding-based methods Wu and Dredze (2019); Pires et al. (2019).

A natural follow-up question to ask is why these models learn cross-lingual representations. Some answers include the shared subword vocabulary Pires et al. (2019); Wu and Dredze (2019), shared Transformer layers Conneau et al. (2020b); Artetxe et al. (2020) across languages, and depth of the network K et al. (2020). Studies have also found the geometry of representations of different languages in the multilingual encoders can be aligned with linear transformations Schuster et al. (2019); Wang et al. (2019e, 2020c); Liu et al. (2019b), which has also been observed in independent monolingual encoders Conneau et al. (2020b). These alignments can be further improved Cao et al. (2020b).

All of the areas discussed in this paper are applicable to multilingual encoders. However, progress in training, architecture, datasets, and evaluations are occurring concurrently, making it difficult to draw conclusions. We need more comparisons between competitive multilingual and monolingual systems or datasets. To this end, Wu and Dredze (2020) find that monolingual BERTs in low-resource languages are outperformed by multilingual BERT. Additionally, as zero-shot (or few-shot) cross-lingual transfer has inherently high variance Keung et al. (2020), the variance of models should also be reported.

We anticipate cross-lingual performance being a new dimension to consider when evaluating text representations. For example, it will be exciting to discover how a small, highly-performant monolingual encoder contrasts against a multilingual variant; e.g., what is the minimum number of parameters needed to support a new language? Or, how does model size relate to the phylogenetic diversity of languages supported?

Discussion

This survey, like others, is limited to only what has been shared publicly so far. The papers of many models described here highlight their best parts, where potential flaws are perhaps obscured within tables of numbers. Leaderboard submissions that do not achieve first place may never be published. Meanwhile, encoders are expensive to work with, yet they are a ubiquitous component in most modern NLP models. We strongly encourage more publication and publicizing of negative results and limitations. In addition to their scientific benefits,An EMNLP 2020 workshop is motivated by better science (https://insights-workshop.github.io/). publishing negative results in contextualized encoders can avoid significant externalities of rediscovering what doesn’t work: time, money, and electricity. Furthermore, we ask leaderboard owners to periodically publish surveys of their received submissions.

The flourishing research in improving encoders is rivaled by research in interpreting them, mainly focused on discovering the boundary of what knowledge is captured by the models. For investigations that aim to sharpen the boundary, it is logical to build off of these prior results. However, we raise a concern that these encoders are all trained on similar data and have similar sizes. Future work in probing should also look across different sizes and domains of training data, as well as study the effect of model size. This can be further facilitated by model creators who release (data) ablated versions of their models.

We also raise a concern about reproducibility and accessibility of evaluation. Already, several papers focused on model compression do not report full GLUE results, possibly due to the expensive finetuning process for each of the nine datasets. Finetuning currently requires additional compute and infrastructure,Pruksachatkun et al. (2020) is a library that reduces some infrastructural overhead of finetuning. and the specific methods used impact task performance. As long as finetuning is still an essential component of evaluating encoders, devising cheap, accessible, and reproducible metrics for encoders is an open problem.

Ribeiro et al. (2020) suggest a practical solution to both probing model errors and reproducible evaluations by creating tools that quickly generate test cases for linguistic capabilities and find bugs in models. This task-agnostic methodology may be extensible to both challenging tasks and probing specific linguistic phenomenon.

2 Which *BERT should we use?

Here, we discuss tradeoffs between metrics and synthesize the previous sections. We provide a series of questions to consider when working with encoders for research or application development.

An increasingly popular line of recent work has investigated knowledge distillation, model compression, and sparsification of encoders (§4.2). These efforts have led to significantly smaller encoders that boast competitive performance, and under certain settings, non-contextual embeddings alone may be sufficient Arora et al. (2020); Wang et al. (2020a). For downstream applications, ask: Is the extra iota of performance worth the significant costs of compute?

Leaderboards vs. real data

As a community, we are hill-climbing on curated benchmarks that aggregate dozens of tasks. Performance on these benchmarks does not necessarily reflect that of specific real-world tasks, like understanding social media posts about a pandemic Müller et al. (2020). Before picking the best encoder determined by average scores, ask: Is this encoder the best for our specific task? Should we instead curate a large dataset and pretrain again? Gururangan et al. (2020) suggest continued pretraining on in-domain data as a viable alternative to pretraining from scratch.

For real-world systems, practitioners should be especially conscious of the datasets on which these encoders are pretrained. There is a tradeoff between task performance and possible harms contained within the pretraining data.

Monolingual vs. Multilingual

For some higher resource languages, there exist monolingual pretrained encoders. For tasks in those languages, those encoders are a good starting point. However, as we discussed in §7, multilingual encoders can, surprisingly, perform competitively, yet these metrics are averaged over multiple languages and tasks. Again, we encourage looking at the relative performance for a specific task and language, and whether monolingual encoders (or embeddings) may be more suitable.

Ease-of-use vs. novelty

With a constant stream of new papers and models (without peer review) for innovating in each direction, we suggest using and building off encoders that are well-documented with reproduced or reproducible results. Given the pace of the field and large selection of models, unless aiming to reproduce prior work or improve underlying encoder technology, we recommend proceeding with caution when reimplementing ideas from scratch.

Conclusions

In this survey we categorize research in contextualized encoders and discuss some issues regarding its conclusions. We cover background on contextualized encoders, pretraining objectives, efficiency, data, approaches in model interpretability, and research in multilingual systems. As there is now a large selection of models to choose from, we discuss tradeoffs that emerge between models. We hope this work provides some assistance to both those entering the NLP community and those already using contextualized encoders in looking beyond SOTA (and Twitter) to make more educated choices.

Acknowledgments

We especially thank the (meta-)reviewers for their insightful feedback and criticisms. In addition, we thank Sabrina Mielke, Nathaniel Weir, Huda Khayrallah, Mitchell Gordon, and Shuoyang Ding for discussing several drafts of this work. This work was supported in part by DARPA AIDA (FA8750-18-2-0015) and IARPA BETTER (#2019-19051600005). The views and conclusions contained in this work are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, or endorsements of DARPA, ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.