CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting

Introduction

End-to-end neural models have generally replaced the traditional NLP pipeline, and with it, the error cascades and feature engineering common to such systems, preferring instead to let the model automatically induce its own sophisticated representations. Tokenization, however, is one of few holdovers from that era, with nearly all commonly-used models today requiring an explicit preprocessing stage to segment a raw text string into a sequence of discrete model inputs. Broadly speaking, tokenizers are generally either carefully constructed systems of language-specific rules, which are costly, requiring both manual feature engineering and linguistic expertise, or data-driven algorithms such as Byte Pair Encoding Sennrich et al. (2016), WordPiece Wu et al. (2016), or SentencePiece Kudo and Richardson (2018) that split strings based on frequencies in a corpus, which are less brittle and easier to scale, but are ultimately too simplistic to properly handle the wide range of linguistic phenomena that can’t be captured by mere string-splitting (§2.1).

The degree of sophistication required to accurately capture the full breadth of linguistic phenomena, along with the infeasibility of writing such rules by hand across all languages and domains, suggests that explicit tokenization itself is problematic. In contrast, an end-to-end model that operates directly on raw text strings would avoid these issues, instead learning to compose individual characters into its own arbitrarily complex features, with potential benefits for both accuracy and ease of use. While this change is conceptually very simple—one could replace the subword vocabulary in a model like Bert Devlin et al. (2019) with a vocabulary made solely of individual characters—doing so leads to two immediate problems. First, the computational complexity of a transformer Vaswani et al. (2017), the main components in Bert as well as other models such as GPT (Radford et al., 2019; Brown et al., 2020) and T5 (Raffel et al., 2020), grows quadratically with the length of the input. Since standard subword models have roughly four characters per subword on average, the 4x increase in input sequence length would result is a significantly slower model. Second, simply switching to a character vocabulary yields empirically poor results (§4.2).

In order to enable tokenization-free modeling that overcomes these obstacles, we present Canine. Canine is a large language encoder with a deep transformer stack at its core. Inputs to the model are sequences of Unicode characters.We consider splitting on Unicode characters to be tokenization-free because it depends only on the (deterministic) process defined by the Unicode standard, and not on any models, hand-crafted rules, or other linguistic knowledge. To represent the full space of Unicode charactersUnicode defines 1,114,112 total codepoints, of which only 143,698 are assigned to characters as of Unicode 13.0. This covers 154 scripts and over 900 languages. without a vocabulary, we employ a hashing strategy. To avoid the slowdown from increasing the sequence length, Canine uses strided convolutions to downsample input sequences to a shorter length before the deep transformer stack.

Like Bert, we pre-train Canine on the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. For the MLM task, Canine offers two options:

A fully character-level loss that autoregressively predicts characters in masked spans.

A vocabulary-based loss that predicts the identities of masked subword tokens. Critically, this tokenization is used only for the pre-training loss; tokens are never input to the encoder, and the tokenizer and subword vocabulary can be safely discarded after pre-training. This effectively converts the hard constraint of token boundaries found in other models into a soft inductive bias in Canine.

the first pre-trained tokenization-free deep encoder;

an efficient model architecture that directly encodes long sequences of characters with speed comparable to vanilla Bert; and

a model that performs no tokenization on the input, avoiding the lossy information bottleneck associated with most pre-processing.

Motivation

Subword tokenizers are the de-facto standard in modern NLP (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020). These algorithms are limited to only simple word-splitting operations. While this is perhaps a reasonable approach for a language with impoverished morphology such as English, it is much less appropriate in the face of phenomena like agglutinative morphology, non-concatenative morphology, consonant mutation, vowel harmony, etc.

Even in high-resource languages, subword models still tend to struggle on challenging domains, such as informal text, which often includes typos, spelling variation,e.g. Spanish speakers may drop accents when typing. transliteration, or emoji (O’Connor et al., 2010). Bert, which uses WordPiece tokenization, is sensitive to corruptions of the input, both natural typos (Sun et al., 2020) and adversarial manipulations (Pruthi et al., 2019), with some of the loss attributable to corrupted strings no longer being covered by the vocabulary.

Seemingly safe heuristics used by these algorithms, such as splitting on whitespace and punctuation, are problematic when applied to languages that do not use spaces between words (Thai, Chinese) or use punctuation as letters (Hawaiian,Hawaiian uses an apostrophe to indicate a glottal stop. TwiInformal Twi uses a right paren ) to represent the letter \textopeno.). While SentencePiece does offer the option to skip whitespace splitting, it is not typically used due to poor empirical performance.

Fixed vocabulary methods can also force modelers to choose between difficult preprocessing tradeoffs: should one keep accents, casing, etc. and avoid destructive preprocessing?—Or keep such orthographic information and risk important words dropping out of the frequency-based vocabulary altogether due to the presence of multiple variants of otherwise-similar words? For instance, mBert initially removed all diacritics, thus dropping tense information in SpanishSpanish past tense uses an accented final vowel. and conflating many unrelated words in Vietnamese.Vietnamese uses diacritics to indicate tones—often the only difference among several unrelated content words.

Finally, using a fixed vocabulary during pre-training also creates complications for downstream tasks, which are subsequently tied to the same tokenizer and vocabulary used for pre-training, even if it is not well-suited for the target domain and/or end-task. Boukkouri et al. (2020) showed that Bert’s Wikipedia+BooksCorpus WordPiece vocabulary results in excessive segmentation when fine-tuning on medical data, diminishing the benefit of pre-training as a strategy.

2 Enabling better generalization

Much as Tenney et al. (2019) showed that large encoders learn elements of the classic NLP pipeline, it seems natural to let the model discover tokenization as well. With this in mind, we seek an approach that can better generalize beyond the orthographic forms encountered during pre-training.

In terms of scientific inquiry, we would like to know whether we can build models that learn how to compose words where appropriate, and memorize them where memorization is needed. Large frequency-derived vocabularies partially mitigate this problem by simply memorizing more, but language inherently requires aspects of both memorization and composition. By building a model that directly engages with these issues within the small scale of word composition, we hope to enable future work studying these problems at larger scales such as phrasal constructions.

Practically, generalization is hindered for vocabulary elements that are slight orthographic variations, where one is very infrequent. Hypothetically, a model may estimate a very good embedding for a common vocabulary element kitten, but a poor embedding for the less frequent element kittens since the model has no a priori knowledge that they are related. Embeddings that are rarely touched during pre-training will not be updated much beyond their random initializations.

3 Reducing engineering effort

Mature tokenizers often include years of hand-engineered rules around special cases such as email addresses, URLs, and handling unknown words;For example, should a subword containing an unknown character be a separate token, or should the unknown character be separated as its own token? even fairly minimal modern tokenizers include initial word-splitting heuristics followed by a specific algorithm and vocabulary for further breaking these tokens into subwords.

Modern pre-trained models also have many requirements throughout their lifecycle: Between the time a model is pre-trained, fine-tuned, and served—potentially months or years apart—its weights and model implementation may be converted to be compatible with another toolkit, its fine-tuning data may be tokenized in a different way, and the natural distribution of words may be quite different. All of these things introduce ample opportunities for mismatches to arise between tokenization and the vocabulary from pre-training. Yet this same pre-training paradigm presents an advantage for character models: access to a far more (unsupervised) data to learn word composition from characters; without transfer learning, this has historically been impractical for many tasks having little supervised data.

Canine

Canine consists of three primary components: (1) a vocabulary-free technique for embedding text; (2) a character-level model that is efficient by means of downsampling and upsampling; and (3) an effective means of performing masked language modeling on a character-level model.

Canine is designed to be a minimally modified variant of the deep transformer stack found in modern encoders such as GPT, (m)Bert, XLM, and XLM-R such that its architecture is easily adoptable by other models in this family. The simplest implementation of such a character model would be to feed characters at each position in place of subwords. However, this approach would result in far more sequence positions given the same input text, leading to linearly more compute in feed forward layers and quadratically more compute in self-attention layers.

Like existing models, the input to Canine must ultimately be represented as a sequence of integers, but because the nature of characters is well-defined and standardized by Unicode, preprocessing code that would typically be hundreds or thousands of lines can be replaced by a very simple procedure: just iterate over the characters in the input string, and return their codepoint integer values (e.g., a single line of codePython preprocessing: [ord(c) for c in text] in Python). Furthermore, because codepoint values are part of the Unicode Standard, they are documented publicly, already supported by programming languages, and will not change over time, unlike arbitrary vocabulary-based IDs.

Canine uses hashing Svenstrup et al. (2017) to support embedding the full space of Unicode codepoints with a relatively small number of parameters, but to reduce the chance that different codepoints will share exactly the same representation, we define a generalization of the standard hashing approach in which we apply multiple hash functions to each codepoint and concatenate the representations associated with the various hash values.

While each individual hash function is subject to hash collisions,This is not a probing/chaining hash table, but rather as an approximate map, where we expect and tolerate collisions, similar to a Bloom Map Talbot and Talbot (2008). the overall effect is minimal since each function only accounts for a small portion of the codepoint’s overall embedding, and it is highly improbable that the other hash functions will produce the same collisions.

Because the model always supports all codepoints, it is possible to learn representations during fine-tuning for characters (and, by extension, words, scripts, etc.) that were never seen during pre-training, while still making use of what pre-training learned about word composition and sentence structure.

We can also redefine the embeddings $e_{i}$ above to include character n-grams, again without a fixed vocabulary, such that each n-gram order contributes equally to a summed embedding:We use $B=15$ k and $N=4$ for our n-grams.

This formulation still admits tokenization-free modeling, but provides the model with an inductive bias that favors slightly more memorization via a compute-cheap means of adding parameters. Notably, it also allows the model’s input signature to remain a simple sequence of codepoints.

While the above architecture is sufficient for classification tasks, sequence prediction tasks require that the model expose an output layer with the same sequence length as the input (i.e., characters are the model’s input and output “API” for tasks like tagging and span prediction).

We reconstruct a character-wise output representation by first concatenating the output of the original character transformer (above) with the downsampled representation produced by the deep transformer stack. (Note that since each downsampled position is associated with exactly $r$ characters for a downsampling rate of $r$ , each position of downsampled representation is replicated $r$ times before concatenation.) More formally,

While the initial character encoder (before downsampling) and final character encoder (after upsampling) both represent character positions, they conceptually have very different purposes in the network. Intuitively, we think of the initial character encoder as composing characters to create a more word-like representation, while the final character encoder is extracting the in-context representation that’s relevant for predicting the “meaning” of the content at each position; Canine must be able to deal with additional ambiguity during upsampling since a single downsampled position may span more than one conceptual word. Because of the different roles of these induced features, we do not use residual connections from $\bm{\mathbf{h}}_{\text{init}}$ to $\bm{\mathbf{h}}_{\text{up}}$ .

2 Pre-training

Recent pre-trained models ranging from Bert to T5 have largely used variations on a masked language model (MLM) task (also known as span corruption) as an unsupervised pre-training loss function—a means of generating synthetic examples that are not from any realistic task, yet prepare a model to learn realistic tasks in future phases of training (i.e. fine-tuning). The Canine pre-training procedure retains the MLM task, and offers two distinct strategies for computing the MLM loss—autoregressive character prediction vs. subword prediction—both of which yield a fully tokenization-free model following pre-training. In our experiments, we use only one of these losses at a time.

Canine-C is an autoregressive character loss that masks character spans within each sequence. These spans are chosen based on whitespace boundaries. No punctuation splitting nor other heuristics are used. All characters within the masked span are replaced by a special mask codepoint in the input.We use codepoints in Unicode’s Private Use Area block such that the input remains a valid Unicode string. No random subword replacement is performed as there is no subword vocabulary.Though we expect that future work on vocabulary-free random replacement may improve quality.

Canine-C auto-regressively predicts the masked characters. The order of the masked positions is shuffled such that masked context is not necessarily revealed left-to-right, but rather a single character at a time. The pre-training data preparation is shown in Figure 2. Masked inputs are fed to the model as $\mathbf{x}$ . The output of the Canine model $\mathbf{y}_{\text{seq}}$ and the embeddings $\mathbf{e}_{\text{g}}$ of the gold characters $\mathbf{g}$ (i.e. the character positions selected for MLM prediction) are concatenated and then fed through a small feed-forward neural network to project back to the original dimensionality $d$ ; these are finally shuffled and used by a single layer auto-regressive transformer with a left-to-right self-attention mask:The left-to-right self-attention masking is with regard to the shuffled sequence.

This representation $\hat{\mathbf{y}}$ is then used to predict each character. To avoid wasting time on a large output weight matrix and softmax, the gold target classes $\bm{\mathbf{t}}$ are bucketed codepoint IDs such that $t_{i}=g_{i}\operatorname*{\scriptstyle\%}B$ . This is similar to the strategy used in the character hash embedder (§3.1). The occassional collisions among characters is less problematic due (a) the fact that this is an encoder-only model and (b) that the embeddings must still retain contextual information in order to correctly predict characters. Because we’re only predicting a relatively small subsequence of the input (15% in our experiments), the cost of this layer is small.

2.2 Subword Loss

We also experiment with Canine-S, a subword-based loss function, to demonstrate how a token-aware pre-training loss can still be paired with a tokenization-free model such that the tokenizer and vocabulary are discarded after pre-training.

Like mBert’s MLM setup, each span in Canine-S corresponds to a single subword. As with the autoregressive loss, all characters within the masked span are replaced with a special “mask” codepoint. Random replacements of subwords are chosen from the vocabulary of same-length subwords such that the length of the character sequence remains unchanged; more formally, given a subword selected for random replacement $x$ and a vocabulary of subwords $V$ , $x$ ’s replacement will be drawn from the subset of $v\in V$ where $\textsc{Len}(v)=\textsc{Len}(x)$ .

Within each masked character span, Canine-S randomly selects a character position where the model will make a prediction; the model predicts the identity of the masked subword via softmax. The associated subword embeddings are discarded after pre-training.

2.3 Targeted Upsampling

By design, each final character representation (after upsampling) is a function of the output of the initial character encoder (before downsampling) and the output of the deep transformer stack—there are no inter-position dependencies across the upsampled sequence. This depends on the upsampler using position-wise feed-forward projections and a single transformer layer. During pre-training, we leverage this design to improve speed by only performing upsampling on the sequence positions that will be used by the MLM task $\mathbf{p}$ . More formally, we use the following equivalentThis highly-effective targeted upsampling optimization is the primary reason that Canine uses a full Transformer layer for the final full-length character sequence rather than a local transformer. Because a block-wise local transformer assumes uniform position-wise locality over attention blocks, it is not trivial to combine these two optimizations; the local self-attention mask would no longer be a simple block diagonal. However, this final upsampling layer is discarded for classification tasks and so does not contribute any cost. Hence, while it is possible to combine local attention and targeted upsampling, this is left as future work. form of the Up function during pre-training:

2.4 Modularity

Unlike previous models, Canine removes both the vocabulary and tokenization algorithm as fossilized parts of the final model that must be replicated during fine-tuning and prediction. Regardless of which pre-training loss is chosen (characters or subwords), the use of these components in Canine is limited to a detail of the pre-training procedure—an inductive bias of the loss function—that is then discarded. The fine-tuning and prediction phases of the model lifecycle never have any knowledge of what vocabulary or tokenization algorithm (if any) were used in pre-training. This allows the model to natively process untokenized data, or even process data that has been pre-processed by different tokenizers, a situation that would otherwise introduce a significant skew between training phases.

Experiments

TyDi QA is a dataset of information-seeking questions in 11 typologically diverse languages Clark et al. (2020). Questions are written before answers, leading to less lexical and morphological overlap between questions and answers, which are drawn from Wikipedia. We evaluate on the primary tasks.As opposed to the simplified TyDiQA-GoldP task, which is part of the Xtreme meta-benchmark.

Given a list of the passages in a Wikipedia article, return either the index of the passage that answers the question, or return NULL if the article contains no acceptable answer.

Given a full Wikipedia article, return the start and end byte indices of the minimal span that completely answers the question. Alternatively, a system may indicate that the article does not contain an answer, or return YES or NO for yes/no type questions.

1.2 Named Entity Recognition Data

We also consider the task of named entity recognition (NER), which requires the model to identify which spans of a sentence correspond to entities and label the entity type. In all of our experiments, we framed the task as sequence labeling, predicting BIO-encoded span labels.

We use Spanish and Dutch data from the CoNLL 2002 NER task Tjong Kim Sang (2002) and English and German from the CoNLL 2003 NER task Tjong Kim Sang and De Meulder (2003), all from the newswire domain.

To widen the scope of our experiments beyond Europoean languages, we also include MasakhaNER Adelani et al. (2021), which includes ten African languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian Pidgin, Swahili, Wolof, and Yorùbá) with human annotations on local news text.

1.3 Model Configuration

In order to determine which pre-training architecture produces better quality downstream predictions, we compare Canine to mBert, which we re-implemented and re-trained in order to hold as many variables as possible constant. Note that we intentionally do not compare against public pre-trained checkpoints that use different pre-training corpora since (a) this would be a major confounding variable and (b) most publicly available pre-trained models are simply instantiations of Bert, including XLM-RXLM-R instantiates Bert with a larger pre-training corpus, larger model size, and larger vocabulary size. and X-STILTS.X-STILTS performs English fine-tuning on an existing XLM-R checkpoint. Phang et al. (2020)

We pre-train on the multilingual Wikipedia data of mBert, which includes 104 languages. Similarly, we reuse mBert’s exponential smoothing technique to weight the languages within the pre-training samples. We train for 124k steps with batch size 4096 (2.5 passes over the data) using the LAMB optimizer (You et al., 2020) with a linearly decayed learning rate of 0.018 where 2.5% of the steps are used for warm-up. We use a sequence length of 512 for mBert, and 2048 for Canine, which results in 512 downsampled positions in its core deep transformer stack. We pre-train on 64 Cloud TPUs v3v3 TPUs have 16 GiB memory / core (128 GiB total). for approximately one day (see results for precise timings). For both mBert and Canine-S (Canine with the subword loss), we select 15% of subwords for the MLM loss and predict up to 80 output positions; 80% of these are masked in the input, 10% are randomly replaced, and 10% are unmodified. For Canine-C (Canine with the autoregressive character loss), we select 15% of contiguous spans for the MLM loss and predict up to 320 output characters, and no random replacement is performed. For TyDi QA, we use a maximum answer length of 100 characters, which is approximately the 99 ${}^{\text{th}}$ percentile answer length. Sequences longer than the maximum sequence length are zero-padded, following BERT.Each pre-training uses approximately 24 hours on 64 TPUs (1.5k TPU-hours), so the 18 pre-trainings in Tables 2/3/4 required about 28k TPU-hours. The 18 TyDi QA experiments in these tables, each take about 1 hour on 16 TPUs, each with 3 replicas (48 TPU-hours), about 1k TPU-hours total. The 3 NER experiments in Table 5 each took 3 hours on 4 TPUs with 3 replicas each (36 TPU-hours), 108 TPU-hours total. Thus replicating the experiments in this paper would take approximately 29k TPU-hours.

2 TyDi QA Results

Our main result is shown in Table 2. Canine-S (Canine with the subword loss) improves over mBert in the TyDi QA SelectP task by 2.8 F1, while using about 30% fewer parameters. Similarly, Canine-C (Canine with the autoregressive character loss), improves over mBert by 2.5 F1. Adding vocab-free character n-grams leads to even further gains over mBert (+3.8 F1) and even more on the MinSpan task (+6.9 F1). A language-wise breakdown is provided in Table 7 in the appendix.

We also present results from some ablation models as additional baselines in rows 3-4 of Table 2. First, for row 3, we simply replace Bert’s subword vocabulary with a pure character vocabulary, which makes characters both the input granularity and the unit of masking and prediction for the MLM task, and observe that not only is the model 10X slower than subword-based Bert, but the quality also suffers greatly. Then, for row 4, we modify that model to use subwords for masking and MLM predictions, while keeping characters as the input granularity, and we see a substantial quality improvement, though pre-training remains extremely slow. Finally, by comparing to the full Canine model in row 5, we can see that adding the downsampling strategy improves speed by 700%, and also leads to an additional small bump in quality. We speculate that this additional quality gain comes from giving the model a better inductive bias toward more word-like units within the deep transformer stack.

Canine fares particularly well on morphologically rich languages such as Kiswahili. Table 3 shows examples where Canine outperforms mBert on the TyDi QA SelectP task. In particular, we observe examples where Kiswahili’s rich morphology does not hinder the matching process for Canine.

3 Ablations

In Table 6, we consider minor modifications to the final Canine architecture, and evaluate the effect of each on the downstream quality of the model.These ablations were carried out during initial model development, hence comparisons to a non-final model.

Instead of attending to the character-wise sequence $\bm{\mathbf{h}}_{\text{up}}$ , we attend to the downsampled sequence:

While this change reduces the overall FLOPS of the model due to the reduced attention computation, it does not have a major effect on pre-training throughput. However, it does substantially degrade quality.

We reduce the number of hash buckets ( $B$ ) from 16k to 8k, meaning more (partial) collisions in embedding lookups. This significantly hinders the MinSpan task.

We switch from our hash-based no-vocabulary strategy to using a normal character vocabulary (which we derive from the pre-training corpus). We observe that this underperforms the hashing approach. We speculate that this might be due to skew between the pre-training corpus and the final downstream task since not all codepoints can be included in the vocabulary.

We reduced the embedding size of the initial character encoder (i.e. the embedding size of $\bm{\mathbf{h}}_{\text{init}}$ and $\bm{\mathbf{e}}$ —not $\bm{\mathbf{h}}_{\text{up}}$ nor $\bm{\mathbf{y}}_{\text{seq}}$ ) and observe that quality falls off rapidly.

We remove the local transformer from $\bm{\mathbf{h}}_{\text{init}}$ and similarly observed a marked reduction in quality.

While more aggressive downsampling (a factor of 5X or 6X, rather than 4X) brings substantial speed gains, the passage-level quality degrades substantially and the minimal span predictions suffer even more.

When we do not use the trick of applying the final character transformer ( $\bm{\mathbf{y}}_{\text{seq}}$ ) only to the positions that will be computed by the MLM task, we observe a large reduction in speed. Since this model is theoretically equivalent in terms of operations, we show only the speed for exposition.

We also performed ablations aimed at exploring the effect of feature concatenation and residuals; results are in Table 4. Not concatenating the downsampled representation with the initial character representation when computing $\mathbf{h}_{\text{up}}$ causes the model to become unstable (row 2); adding a residual from $\mathbf{h}_{\text{up}}$ back to $\mathbf{h}_{\text{init}}$ does not help (row 3). However, additionally inserting a residual from $\mathbf{h}_{\text{up}}$ back to $\mathbf{h}^{\prime}_{\text{down}}$ does stabilize the model (row 4) though it does not recover the original quality.

4 NER Results

Named entity recognition is a task in which memorization is often a very effective strategy. For example, if a model has London in its vocabulary and sees it with the label location during training, then it simply has to retrieve this memorized association when it sees the token London at test time. Therefore, evaluating on NER is helpful for understanding the ways in which different models emphasize memorization vs. generalization.

As shown in Table 5, Canine-C performs significantly worse than mBert on NER, likely due to mBert’s memorization-friendly vocabulary. However, when (tokenization-free) n-gram features are added to Canine-C, performance rebounds, showing that it is possible to cheaply boost a model’s memorization ability while remaining fully tokenization-free.

A full language-wise breakdown is provided in the appendix (Table 8). It’s worth noting that part of the performance difference on MasakhaNER is due to mBert producing no usable outputs for Amharic. The mBert pre-training data does not contain Amharic (or any Amharic-script text), so it has no vocabulary entries to Amharic’s script (meaning that mBert sees only sequences of [UNK] on Amharic inputs). However, since Canine always supports the full Unicode space, it is able to achieve 50 F1 even though it, too, had never seen Amharic text during pre-training. We take this as validation of Canine’s vocabulary-free approach. It may also be evidence that Canine exhibits cross-script transfer abilities analogous to those in mBert Pires et al. (2019).

Canine-C tends not to label rarer lexical items that mBert appears to have memorized. For example, with Canine-C, JCPenney (a relatively rare lexical item) is not recognized as an entity. Canine-C also tends to separate long entities; for example, “State Street Bank and Trust Company” is labeled as two separate spans: “State Street Bank” and “Trust Company”; and the location TAMPA BAY is recognized only as TAMPA. However, adding n-grams features appears to mostly resolve this issue.

Related Work

Further improvements to standard subword tokenization like Byte Pair Encoding (BPE) Sennrich et al. (2016), WordPiece Wu et al. (2016), and SentencePiece Kudo and Richardson (2018) have been proposed. Subword regularization (Kudo, 2018) and BPE-dropout (Provilkov et al., 2020) recognize that deterministic segmentation during training limits the ability to leverage morphology and word composition; instead, they sample at random one of the multiple tokenizations of the training input, made possible by the inherent ambiguity of subword vocabularies. Wang et al. (2021) recently expanded on this paradigm to enforce consistency of predictions over different segmentations. Unigram LM (Kudo, 2018), which builds its vocabulary top-down, was shown to align with morphology better than BPE on pre-trained encoders (Bostrom and Durrett, 2020).

Others have built hybrid models that use multiple granularities, combining characters with tokens (Luong and Manning, 2016) or different subword vocabularies (Zhang and Li, 2021).

2 Character-level models

Following the larger NLP trend, character-level n-gram models Huang et al. (2013); Wieting et al. (2016); Bojanowski et al. (2017) have mostly been replaced by neural networks. While generally lagging behind their word-level counterparts, character-level features are important for morphologically rich languages, particularly in low-resource settings (Garrette and Baldridge, 2013).

Character language models (CLMs) have used vanilla RNN architectures to produce distributions over sequences of characters in a purely tokenization-free manner (Sutskever et al., 2011; Graves, 2013; Hwang and Sung, 2017; Radford et al., ). Hierarchical RNNs modeled the assumption that language operates on increasing layers of abstraction: Chung et al. (2017) jointly trained a sub-module to segment the character-level input into larger spans at each layer of a stacked LSTM.

Due to the consistent lag in performance behind their word-level counterparts, attention shifted from pure CLMs towards merely character-aware models, still reliant on traditional tokenization. Some hybrid models processed the input at character level, but predicted words from a closed vocabulary (Kim et al., 2016; Gerz et al., 2018). Others reintroduced explicit tokenization on the input side, and either generated bursts of character sequences that formed an open vocabulary (Kawakami et al., 2017) or used a character-only generator as a fallback when the main closed-vocabulary word generator produced a rare or unknown token (Matthews et al., 2019; Mielke and Eisner, 2019). Especially after the popularization of the inherently ambiguous subword vocabularies like BPE, several studies moved beyond a single input segmentation and marginalized over all possible segmentations (van Merriënboer et al., 2017; Buckman and Neubig, 2018; Grave et al., 2019).

Coming full circle, Kawakami et al. (2019) induced a lexicon without any explicit supervision, reverting back to pure CLMs. In a revitalized effort to bring them on-par with coarser granularities, researchers leveraged external resources such as grounding in vision (Kawakami et al., 2019) or multi-task learning together with supervised morphology tasks (Blevins and Zettlemoyer, 2019).

After the transformer (Vaswani et al., 2017) replaced RNNs as the dominant architecture in NLP, character-level models followed. Al-Rfou et al. (2019) showed that byte-level vanilla Transformers significantly underperform their word-level counterparts. A similar finding was reported by Radford et al. (2019). Although the gap has been reduced (Choe et al., 2019), subword transformers remain the status quo for pure language modeling.

In parallel with LM efforts, the neural machine translation (NMT) community sought to solve its open-vocabulary problem via character-level modeling. Luong and Manning (2016) proposed a hybrid model that operated mainly at the word level, but consulted a character-level LSTM for unknown words; this was a practical compromise, as their character-only model took 3 months to train. Lee et al. (2017) enabled pure character NMT by shortening the input length via convolutional, pooling, and highway layers. Notably, their many-to-English model outperformed its subword counterpart and most bilingual baselines, with a 35% increase in training time (on a single GPU) compared to a baseline BPE-to-char model. Canine has a similar motivation, but operates in the context of pre-trained transformers; training is 7x faster compared to a char-to-char baseline (on TPU v3), and has a 28% increase in training time over mBert (Table 2).

Character information has been leveraged for many other end tasks as well, including: text classification (Zhang et al., 2015; Zhang and LeCun, 2017), part-of-speech tagging and NER (Gillick et al., 2016; Akbik et al., 2018; Pinter et al., 2019), named entity detection (Yu et al., 2018), dependency parsing Vania et al. (2018), and machine reading comprehension (Hewlett et al., 2018). Character information proved particularly useful for low-resource languages (Xie et al., 2018), phenomena such as code-switching and transliteration (Ball and Garrette, 2018), and rich morphology (Vania and Lopez, 2017), previously receiving special modeling including adaptor grammars (Botha and Blunsom, 2013).

Token-based models have also been augmented with character-level information in the context of transfer learning, where encoders trained with unsupervised objectives are repurposed to solve downstream tasks. Pinter et al. (2017) addressed the out-of-vocabulary problem of static pre-trained word embeddings by training a model to map the surface of a word to its pre-trained representation, and used it on unknown words. ELMo (Peters et al., 2018), a bidirectional LSTM model, applied character convolutions to its whitespace-separated input tokens. CharacterBert (Boukkouri et al., 2020) ported this technique to Bert, augmenting its existing WordPiece-tokenized input. Consistent with previous observations that feeding characters into a transformer stack comes with a huge computational cost while not improving over tokenization-based approaches (Al-Rfou et al., 2019), a Bert model fine-tuned for semantic parsing achieved gains only when characters complemented subwords (van Noord et al., 2020).

3 Multilingual models

Multilingual NLP has been dominated by deep pre-trained multilingual models whose subword vocabularies are shared across languages. Such models borrow their architectures from monolingual predecessors and apply joint training in 100+ languages, either with unsupervised LM losses: mBert, mT5 (Xue et al., 2021), or with additional translation losses: XLM (Lample and Conneau, 2019), XLM-R (Conneau et al., 2020). Chung et al. (2020) extended this by forming language clusters with per-cluster vocabularies. To accommodate languages unseen during pre-training, Wang et al. (2020) extended the vocabulary and continued pre-training.

Conclusion

In this article, we described Canine, which is, to our knowledge, the first pre-trained deep encoder for language understanding that uses a tokenization-free, vocabulary-free model, while surpassing the quality of models built on top of heuristic tokenizers. Canine eliminates many engineering pitfalls for practitioners and opens up new research directions for the community.

Acknowledgements

The authors wish to thank Noah Constant, Rami Al-Rfou, Kristina Toutanova, Kenton Lee, Ming-Wei Chang, and Tim Dozat for their feedback on this work. We would also like to thank Martin Njoroge and Nanjala Misiko for their consultations on the Kiswahili examples, Diana Akrong for consulting on Twi orthography, and Waleed Ammar for consulting on Arabic morphology.