A New Generation of Perspective API: Efficient Multilingual Character-level Transformers

Alyssa Lees, Vinh Q. Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, Lucy Vasserman

Introduction

Developing robust and effective content moderation systems is a crucial component for keeping the web safe from abusive users. Offensive, toxic, and harassing content has the potential to significantly harm users along with wide-ranging negative effects on broader society. To this end, building machine learned systems that are able to detect toxic content is a well-established and highly important research area, and an essential component of modern platforms’ content moderation workflows, alongside robust human moderator teams.

Most research in this area has been focused on building specialized models for specific locales, languages, domains, or label distributions (Alakrot et al., 2018; Ranasinghe et al., 2019; Mandl et al., 2019; Wulczyn et al., 2017). Thus, it is common practice to build monolingual models for specific languages and/or domains. Given that the web is highly multilingual, multi-cultural, and typographically diverse, monolingual systems are likely to under-perform in real applications, as they are unable to handle code-switching, cross-cultural phenomena, or cross-lingual generalization.

Due to the rigidity of feature-based machine learning models, subword tokenization, and byte-pair encoding (Kudo and Richardson, 2018) based deep learning models (e.g., BERT (Devlin et al., 2019)), many of these models are not universally applicable across different languages and/or tasks and are considered more or less static once trained. This makes it difficult to apply a single model across a diverse range of languages, domains, and tasks. It also makes it challenging to incrementally train a model on new downstream applications. Furthermore, rigid vocabularies are also vulnerable to common adversaries of the social web - misspellings, emojis and obfuscation, all of which are techniques that are commonly used for microaggressions and covert attacks (Lees et al., 2021).

With these challenges in mind, this paper presents a new generation of toxic content classifiers for Jigsaw’s Perspective API. We refer to this generation as UTC (Unified Toxic Content Classification), centering around a new modeling framework for highly performant and robust toxic content detection. In summary, UTC is a single compact pretrained Charformer-based Transformer (Tay et al., 2021) that is pretrained on multilingual documents along with comment text from a wide variety of online discussion forums and other sources of user generated content using a sequence to sequence denoising loss (Raffel et al., 2020).

UTC leverages recent novel advances of Learnable Tokenizers as part of the model architecture (Tay et al., 2021) and is therefore vocabulary- and token-free. While character-level features or modeling approaches have been explored in the context of toxicity detection (Kurita et al., 2019), our paper proposes the first byte-level pretrained model that remains competitive with (or outperforms) subword models tailored to specific domains or languages. Notably, this vocabulary free property of UTC enables it to be both language agnostic and more robust to domain transfer.

Furthermore, in the design of UTC we address major practical challenges of productionizing character-level Transformers in a latency sensitive, public API setting. We describe the approach in detail in Section 3 and show the impact of our changes in Table 8. The result is a model compact enough to be served in real-life production settings while demonstrating competitive performance versus the winning entries of the 2020 Jigsaw Multilingual Toxic Comment Classification Kaggle contest, even though it is ¿10x more memory (parameter) efficient. Extensive evaluations for model bias (Borkan et al., 2019) also show that UTC is reasonably unbiased across multiple languages. In summary, the primary contributions of this work can be summarized as follows:

We present Unified Toxic Content Classification (UTC), a modeling framework suitable for efficient character-level multilingual moderation workflows. The proposed UTC framework is comprised of Learnable Tokenizers (Tay et al., 2021), Reconfigurable Seq2Seq Transformer architectures, and a new comment-based pre-training scheme.

We conduct very extensive and rigorous experiments on multiple tasks and benchmark datasets, from both academic settings and sampled from our production traffic. We show that UTC outperforms strong baselines such as a multilingual BERT model pretrained on comments and state-of-the-art mT5 models.

For evaluating the robustness and flexibility of the proposed approach, we include benchmarks that specifically test the model’s ability to handle code-switching, covert toxicity, emojis, obfuscated text, distribution shifts, and model bias. We show that under all conditions, UTC outperforms or matches strong baselines.

We present the results of our experience deploying UTC into production Perspective API.

Related Work

Although there has been a long history of using machine learning to detect abusive content (e.g., email spam (Dada et al., 2019)), research using direct text classification started in earnest with the introduction of the modest sized hand labeled data in (Davidson et al., 2017), (Nobata et al., 2016) and (Waseem and Hovy, 2016). These works also coincide with the launch of the Workshop on Online Abuse and Harms http://www.workshopononlineabuse.com/ which completed its fifth annual meeting.

There has also been a significant amount of criticism regarding the application of machine learning to conversation moderation, see (Yin and Zubiaga, 2021) for a recent survey of the issues and challenges. The nature of online identity and social relationships, and the problems of governance are complex and involve many interacting entities with overlapping jurisdictions. And despite the popularity of some shared, labeled test sets, there is little consensus within the community regarding sampling, annotation standards, annotator recruitment and training, classifier design, or scoring metrics.

One concern regarding the use of machine learning models for moderation is that flaws in the training data, whether due to the process used to collect the data, biases held by the annotators, or underlying societal, historical biases, whether intentional or unconscious, can manifest in models as unintended discriminatory biases. This concern was raised in (Davidson et al., 2019) and (Sap et al., 2019). (Jacobs et al., 2020) provides a good overview of these concerns. To address these concerns we employ the techniques of data augmentation suggested in (Dixon et al., 2018) and (Borkan et al., 2019), and include a cross-language bias analysis to measure the unintended bias for similar terms across all languages. It is also worth noting the progress towards a more comprehensive taxonomy of abusive content has drawn interdisciplinary attention, and systemic annotation efforts (Kennedy et al., 2018) which also inform our work. There has also been criticism regarding current commercial models’ performance on tagging abusive, toxic, or hateful content. This includes many examples of adversarial perturbations designed to “fool” moderation models, such as proposed in (Gröndahl et al., 2018). While the present work demonstrates improved performance against these types of attacks, the task of improving models in these areas is ongoing.

The English language has dominated research in classifying offensive content, although there have been numerous publications focusing on other specific languages. There is still no agreement regarding the efficacy of training multilingual models versus monolingual models, but for applications with user generated content, not needing to ask or guess what language is being used has clear advantages. Two recent examples of similar work are (Wang and Banko, 2021), which found that monolingual models do not universally perform better on sentiment and hate speech classification, and (Song et al., 2021) which uses model fusion in an attempt to correct the imbalance in available training resources.

UTC: Unified Toxic Content Classification

This section introduces UTC, the proposed modeling framework in this paper.

In practice we set $M=4$ to enumerate blocks sized 1 to 4. Following previous work, since we enumerate blocks here with a stride of $b$ we apply a 1D convolution of size $b+1$ before this enumeration step.

Next we use the block scoring network (a linear transformation) to score every block in each of $X_{1},\ldots,X_{M}$ . We then upsample every sequence $X_{b}$ and their scores back to original sequence length $L_{bytes}$ via repetition. At this point, we have a set of block embeddings $X_{b,i}$ and their scores $p_{b,i}$ for every position $i$ and block size $b$ .

We construct the locally composed sequence representation $\hat{X}$ by reducing over the block size dimension. In particular we take the sum of every $X_{b,i}$ at position $i$ weighed by their block score: $\hat{X}_{i}=\sum^{M}_{b}P_{b,i}X_{b,i}$ .

Finally $\hat{X}$ is down-sampled by mean pooling.

We refer interested readers to (Tay et al., 2021) for fine-grained details.

2. Transformer Stack

The Transformer stack in our approach accepts latent subwords from the Learnable Tokenizer as an input and the remainder of the Transformer stack remains identical to a standard Transformer model. Transformer architectures are characterized by stacks of self-attention blocks followed by simple feed-forward layers (Vaswani et al., 2017).

3. Reconfigurable Seq2Seq Architecture

Our pretraining utilizes a Seq2Seq (Encoder-Decoder) architecture that is optimized by teacher forcing. In practice, we find this denoising loss to be more effective than encoder-only (BERT-based) pretraining. Intuitively, Seq2Seq based masked language modeling also enables sequential and long-term dependencies to be taken into account in the autoregressive generation process. Our Seq2Seq architecture is reconfigurable, i.e., during certain tasks, we may remove the decoder for specialized regression or classification heads while retaining a universal encoder for all tasks. With this formulation, we can retain a unified encoder across all tasks from shared representation learning. While the T5 model (Raffel et al., 2020) enables regression problems to be framed in Seq2Seq architectures, we find that adding regression heads is more natural and effective in practice. Moreover, this supports the case where we have multiple labels per input example. Note that the entire UTC Seq2Seq architecture can also be finetuned on downstream classification tasks.

During pretraining, our model optimizes the following cross entropy loss: $L=-\sum^{L}_{t=1}\sum^{n}_{i=1}\log(\pi^{t}_{i})+(1-y^{t}_{i})\log(1-\pi^{t}_{i})$ where $\pi^{t}_{i}$ is the prediction of class $i$ at time step $t$ and $y^{t}_{i}$ is the ground truth label of the class $i$ at time step $t$ .

3.2. Multi-Regression Heads and Loss Function

4. Pretraining

Our model is pre-trained on an equal mixture of two data sources: Perpsective Pretraining Corpus (PPC) and the mC4 corpus from mT5 (Xue et al., 2021b). PPC is a proprietary corpus of $\sim$ 4.6B message and comment texts from a variety of sources including data historically processed by the Perspective API or shared by partners. Note that API clients can enable/disable this data storage via an API flag (doNotStorehttps://developers.perspectiveapi.com/s/about-the-api-methods). Text in this corpus typically comes from a variety of online forums. We mix the two corpora equally, sampling equally between target languages within the mC4 mixture, while using the natural language distribution in the PPC split. We pretrain our method using the span-based denoising objective in a Seq2Seq fashion using a mean span corruption length of $20$ bytes and a corruption rate of $15\%$ . We pretrain for 1M steps and batch size of $128$ sequences, with the maximum length for each sequence set to be $512$ bytes.

Experimental Settings

This section provides an overview of our experimental setup.

We conduct three categories of experiments: core multilingual toxic comment classification, robustness evaluation, and evaluation of adaptation to new types of toxicity. For core multilingual toxic comment classification we evaluate on both existing public benchmarks (Multilingual Toxic Comments Challenge) as well as a labeled real world dataset derived from live API traffic (Production-Multilingual). For robustness evaluation, we evaluate model performance when faced with code-switching (a subset of the multilingual datasets), obfuscation (obfuscated CivilComments), and distribution shift (zero-shot TweetEval (Barbieri et al., 2020) and CivilComments-WILDS (Koh et al., 2021)). We also evaluate the model on an identity term bias task based on (Borkan et al., 2019). For adapting to new types of toxicity, we evaluate finetuning performance on Covert Toxicity (Lees et al., 2021), and Hatemoji (Kirk et al., 2021). We refer the reader to Sections 5, 6, and 7 for detailed descriptions of these datasets.

2. Models

This section discusses the details about the major models we use in our experiments.

Perspective API Jigsaw’s public API for scoring comments for toxicity (Jigsaw, 2017), prior to this work. It should be noted that many of the languages evaluated in the paper are not currently supported by the Perspective API, and as such Perspective results are omitted in such experiments.

Custom mBERT We compare with a strong multilingual BERT (Devlin et al., 2019) baseline that has been pretrained on PPC. The model uses a custom SentencePiece vocabulary of size 200K, created explicitly from the PPC corpus. We refer to this strong production baseline as Custom mBERT and consider it representative of a model highly tailored for the domain. The baseline model consisted of 768 dimensions, 12 layers, 12 heads, consistent with BERT-base (Devlin et al., 2019). The pre-training consists of MLM Loss and translation pairs with uniform masking at $15\%$ . Pretraining was conducted for 125K steps with batch size of 32K.

Multilingual T5 (mT5) - the state-of-the-art for multilingual natural language processing. mT5 is a pretrained T5 (Raffel et al., 2020) model pre-trained on 100+ languages on the Multilingual C4 corpus.

UTC and UTC† - our proposed models described in Section 3. For the vanilla UTC model, we use $d_{model}=512$ , $d_{ff}=2048,d_{kv}=64,N_{heads}=8$ . The number of encoder layers is set to $24$ and the number of decoder layers is set to $6$ . For the learned tokenizer, we set the sequence length downsampling rate of $2$ , and set the convolution filter size to $5$ . The standard UTC is approximately 102M parameters when deployed in downstream applications. This model size was considered to ensure a fast serving latency. We also consider a larger (but still servable) UTC† model that is approximately $268M$ parameters where $d_{model}=768$ , $d_{ff}=3072,d_{kv}=64,N_{heads}=12$ and number of encoder layers is set to $28$ . We denote this model as UTC†.

Fine-grained details on each specific baseline can be found in each individual experiment section. Please see Appendix A for additional reproduction details.

Experiments: Multilingual

In this section we report the core multilingual toxic comment classification results of the work. With the exception of Perspective API, we finetune and evaluate all models outlined in Section 4.2, on each dataset. using a batch size of 512 until convergence.

A proprietary internal multilingual toxic comment classification training and evaluation set. This dataset is derived from live traffic that is sent to the production Perspective API (with doNotStore flag set to false), translated to multiple languages and is exclusively labeled offline by human annotators for toxicity. Given the nature of this dataset, this task represents the performance of our models in production, and is the most important metric by which we compare models. The training sets cover the languages of AR, CS, DE, EN, ES, FR, HI, HI-Latn, ID, IT, JA, KO, NL, PL, PT, RU, SV, ZH along with some low prevalence examples in a few other languages. The training data is 38 million records and is not balanced across languages with a heavy skew towards EN. The evaluation sets are limited to the languages covered in the experimental results : AR, CS, EN, HI-Latn, ID, JA, KO, NL, PL, PT, RU, ZH. This narrower set focuses in on languages where Perspective did not already have a production quality model (at the time of experimentation), plus English where significant Perspective usage comes from. The evaluation sets are roughly balanced in volume across languages and comprise 1.3 million records.

1.2. Results

Table 1 reports results on the Production-Multilingual dataset. Overall, UTC outperformed all baselines, with a small UTC model outperforming mT5base, a model with more than twice the size w.r.t. number of parameters. While UTC does not perform as strongly on English as Custom mBERT, the main advantage of UTC is observed in many non-English languages.

2. Multilingual Toxic Comments Challenge

In this section, we report experimental results on the public dataset featured in the Jigsaw Multilingual Toxic Comments Challenge (JMTCC) hosted by Kaggle. The competition was held in 2020 and comprises of 6 languages besides English: Spanish, French, Italian, Portuguese, Russian, and Turkish. While the evaluation data was multilingual, only English data was provided in the training set. Hence, it was common for participants to make use of translation data to augment the training set. We train our models on the translated data that was shared in the Kaggle discussion forums.

2.2. Compared Baselines

Aside from the baselines in Section 4.2, we also report results from the winners of the Jigsaw Multilingual Toxic Comment Classification Kaggle competition, although it is worth noting that our goal is to develop single standalone models that can feasibly be deployed in production. Meanwhile, the top Jigsaw Multilingual Toxic Comment Classification Kaggle submissions often involved aggressive ensembling, score scaling techniques etc, that are highly infeasible in practice. Nevertheless, we believe it is beneficial to evaluate how well our standalone single model fares compared to a strong highly engineered upper bound. Based on our interpretation of the Kaggle champion’s entry, we estimate the number of model parameters to be $>5B$ given that they ensemble multiple XLM large models (at least 300M parameters each) along with monolingual models.

2.3. Results

Table 2 reports results on the JMTCC dataset. Our results show that our best UTC† achieves $0.9367$ AUC-ROC, outperforming all considered single model baselines, especially a strong state-of-the-art mT5 baseline. Notably, this result is only slightly worse than the top performing Kaggle #1 result which comprises XLM-Roberta ensembles, pseudo labelling and other commonly used techniques. We consider the result achieved by UTC† to be pretty compelling, given that this is a single model that can actually be used in production applications.

Experiments: Robustness

In this section we do no additional training and evaluate the fine-tuned models from Section 5.1 to evaluate the robustness of our proposed methods.

The Code-Switching eval sets aim to identify theoretically more difficult multilingual user comments. Both bespoke evaluation datasets below are constructed by filtering the parent superset with the same criteria for multilingual comment identification: test examples are restricted to those where $2$ or more languages are present. Samples are included if and only if $>=25\%$ of the example content is identified to be in each of 2 or more languages using a language detection model.https://github.com/google/cld3. We use two subsets for code-switching based on Production-Multilingual and JMTCC datasets. Details on the breakdown of these code-switching datasets can be found in the supplemental material.

The same baseline models evaluated on the Production-Multilingual dataset were employed, including Custom mBERT, the comment domain multilingual BERT model, $mT5_{base}$ and $mT5_{small}$ . Similarly, we also included an evaluation using the public Perspective API (Jigsaw, 2017). It should be noted that not all of the languages included in the code-switching datasets are listed as supported by Perspective and as such Perspective may be disadvantaged in this experiment. To specify a language for Perspective API, we use a language detection model to identify the primary language for each code-switch example. If the language is not supported by Perspective API, then we default to English.

1.2. Results

Table 3 reports our experimental results on code-switching evaluation sets. On both JMTCC-CS and Production-CS, UTC† outperforms the best, outperforming the mT5base baseline. In general we find that an off-the-shelf mT5 model also substantially outperforms the Custom mBERT model. Finally, we note that the Perspective API performs poorly since prior to this work, models were separately trained on individual languages and as such they are not well equipped to handle multilingual code-switching. One limitation of this experiment is the dominance of English and Latin based languages in the code-switching evaluation sets. We suspect that the byte level vocabulary in our Charformer model, is advantageous for understanding with character based languages especially in code-switching tasks. The preliminary results give evidence to this conclusion.

2. Human-Readable Obfuscation

One common technique used to bypass toxicity classification models is to intentionally misspell words in a fashion that is understood by human readers, yet obfuscated to machine learning models (Gröndahl et al., 2018). Even though our proposed models are not explicitly trained on these types of adversarial examples, in this section we run zero-shot experiments on synthetically obfuscated data to evaluate model robustness in this area.

We construct a synthetically obfuscated variant of the Civil Comments (Borkan et al., 2019) test set. As the dataset is in English only, we manually construct a dictionary of valid substitutions for every letter in the English alphabet. Then for every alphabetical character in each example, we replace the character by a substitute with some probability, which we call the character obfuscation rate. If a character is chosen, then a substitute for the character is chosen uniformly at random from the list of valid substitutes for the character. Substitutes may be any other character or string that may still be readable as original character, e.g. ”a” may be substituted with ”4”, ”@”, or ”/\”. Valid substitutions for vowels also include ”*” or an empty string, which effectively removes information from the sequence. See Figure 3 for the comprehensive dictionary of valid substitutions, and Figure 4 for examples of text at various character obfuscation rates. Finally, please note that the construction of this dataset is not meant to comprehensively capture a realistic distribution of adversarial examples against toxicity classification models. Instead, we aim to create a controlled and sufficiently challenging dataset to serve as a point of evaluation between models.

2.2. Compared Baselines

We evaluate the zero-shot performance of Perspective API (prior to this work), Custom mBERT, mT5-small, and UTC (102M param.) on obfuscated Civil Comments, sweeping the character obfuscation rate from 0 to 50% in increments of 10%. Custom mBERT, mT5, and UTC are the same fine-tuned models from Table 1. No additional training was done for these experiments.

2.3. Results

Figure 2 plots the performance of all models across character obfuscation rates. In this experiment we observe that while all models have similar zero-shot performance when there is no obfuscation applied to the English only dataset, UTC outperforms every baseline at every other obfuscation rate greater than 0. Albeit, all models do decay in performance as obfuscation rate increases – however, this is expected as the models rarely see this type of obfuscation during training, and are not fine-tuned on any additional obfuscated data. This result echos a similar finding by previous byte-level models (Xue et al., 2021a). One natural question to ask might be: if fine-tuned, are the models able to learn to adapt to this type of obfuscation? As an additional result, we found that when fine-tuned on a 30% obfuscated version of Civil Comments, UTC was able to fully recover performance to 86.0 AUC-ROC, while mT5-small recovers to only 84.5 AUC-ROC (-2.1pt from the unobfuscated zero-shot result.) This result illustrates the value of the UTC inductive bias on this particular task.

3. Distribution Shifts

Toxic content appears on many different surfaces in different forms, targeting many different types of people. In this section we evaluate the performance of our model on two different setups. In the first setup with TweetEval, we evaluate performance on a task with a different labeling process and domain focus. In the second setup, we evaluate the performance of the model on subpopulation shift using CivilComments-WILDS.

We evaluate our models on the TweetEval hate content classification test split (Barbieri et al., 2020), which is taken from the SemEval2019 Hateval challenge (Basile et al., 2019). The task is to predict whether a given tweet contains hateful language targeted against any of two communities: women and immigrants. This task differs from our UTC pre-training and fine-tuning as it is purely Tweet focused and has been labeled to a different standard (i.e. hateful language.) As this is zero-shot evaluation, we do not do any additional fine-tuning for this experiment.

We evaluate our same baselines and models as Section 6.2: Perspective API, Custom mBERT, mT5, and UTC. For mT5 and UTC we evaluate both small and base sized versions. Additionally, we compare our zero-shot results against the current state-of-the-art for the task: RoBERTa Retrained and BERTweet. Both are English-only RoBERTa-based models which have been extensively pre-trained on a large corpus of English tweets, as well as finetuned on the corresponding TweetEval hate classification training set.

3.2. Results

Table 4 reports performance on TweetEval hate classification. All zero-shot baselines which were finetuned on Production-Multilingual data showed strong performance in this experiment, with multilingual mT5 and UTC in particular performing on par with an English-only RoBERTa model that had seen additional pretraining on a large Twitter corpus and finetuned on TweetEval hate classification training data. Higher results previously reported by (Barbieri et al., 2020) and (Nguyen et al., 2020) are only observed in ”best run” performance where the model saw favorable variance. This experiment demonstrates the effectiveness of our methods in training domain shift robust toxicity classification models.

3.3. CivilComments-WILDS

Introduced in (Koh et al., 2021), this dataset augments the Civil Comments dataset with various demographic identities referenced in each example. The goal of this dataset is to evaluate for subpopulation shift: a setting where the model sees all domains (i.e. demographic identities) during evaluation as it does during training, but in different proportions. In particular, for CivilComments-WILDS a model is trained on all demographic identities available in CivilComments, but is evaluated on a single identity at a time – an extreme subpopulation shift. This is repeated individually for each subpopulation, with the aim of maximizing performance on the worst performing subpopulation. Following the original work, we perform our analysis on 8 demographic identities male, female, LGBTQ, Christian, Muslim, other religions, Black, and White. We report accuracy on the complete test split and the worst accuracy from the 8 subpopulations and the gap between the two to show that our model performs better on these subpopulation shifts. (Koh et al., 2021), showed the existence of a significant gap between the average in-distribution accuracy and the worst subpopulation accuracy. We additionally report this gap for our own evaluated models.

We evaluate small-sized mT5 and UTC models in this setting from Table 5.1. We compare to the highest performing DistilBERT results from (Koh et al., 2021) with respect to both average overall accuracy and worst-group accuracy (DistilBERT using empirical risk minimization, ERM, and group distributionally robust optimization, Group DRO, respectively.) As our models were multilingually fine-tuned, while DistilBERT fine-tuned on only English Civil Comments, for a fair comparison in this subpopulation shift setting our mT5 and UTC models are additionally fine-tuned on Civil Comments before evaluation. We do not use any robust optimization techniques when fine-tuning our models.

Both mT5-small and UTC significantly outperform the baseline results from prior work on all metrics. Our models perform better overall and have almost half to a third smaller of a gap between average and worst group performance than DistilBERT with robust optimization techniques. Although the exact source of this gain is unclear, we posit that the extended amount of multilingual pre-training, pre-finetuning and greater model size may play a significant role here.

4. Identity Term Bias

Borkan et al. (Borkan et al., 2019) outlined nuanced bias metrics, to be used in addition to overall model metrics such as AUC-ROC and (Dixon et al., 2018), introduced synthetic, templated datasets for identifying unintended bias in toxicity models. Here we use these tools to evaluate our models for identity term bias.

We use a new multilingual version of the synthetic template bias evaluation data set, publicly released in 2021https://medium.com/jigsaw/identifying-machine-learning-bias-with-updated-data-sets-7c36d6063a2c and https://github.com/conversationai/unintended-ml-bias-analysis/tree/2021-refresh. The examples in this dataset are generated from predefined templates with slots where different words (e.g. identity terms, adjectives, verbs, etc.) can be substituted for related terms in order to test for performance with regards to various subgroups (identities). To obtain a multilingual dataset for each of the 12 target languages we rely on a team of expert native speakers to construct these templates. The final generated dataset consists of $\sim 2M$ examples and has a balanced class distribution. Following (Borkan et al., 2019) we report Subgroup AUC, Background Positive Subgroup Negative (BPSN) AUC, and Background Negative Subgroup Positive (BNSP) AUC for all subgroup-language combinations. Please see (Borkan et al., 2019) for precise definitions of these metrics.

4.2. Results.

We evaluate model bias for UTC with mT5-small serving as a baseline comparison. Both models remain the same as from Table 1. We visualize the results in Figure 5 and 6 for all language splits. Note that to generate these visualizations we aggregate results with their corresponding English identity term (results with no corresponding English term are not shown here). Overall, the metrics remain strong at $>$ .7 across languages, with UTC performing stronger on more subgroup-language combinations than mT5. As both models are finetuned on the same dataset, we see that the UTC inductive bias may play a role here. However, some subgroup-language combinations still require additional work. For example, there are some terms in Korean and Japanese that demonstrate unwanted bias, such as the BPSN AUC metric for the term homosexual where mj 동성애자 , and 同性愛者 have values $\leq$ .5. This suggests that the non-toxic templates containing the identity term are yielding false positives, or rather the term is correlated with toxicity. As such, further explicit debiasing efforts are still needed.

Experiments: Adapting

In this section we demonstrate that the fine-tuned checkpoints from Section 5.1 are highly adaptable to new types of toxicity by further fine-tuning our models on new challenging tasks.

We conduct experiments on Covert Toxicity (Lees et al., 2021), a task of distinguishing if a piece of text contains nuanced toxicity such as microaggressions. We compare with Toxic-BERT and Covert-BERT baselines reported in (Lees et al., 2021). We also finetune a monolingual (English only) T5 and mT5 base model as strong baselines.

Table 6 reports results on the CovertToxicity task. We show that UTC and UTC† achieves very competitive results outperforming both Toxic-BERT and Covert-BERT baselines. On this task, the performance of UTC† is competitive to mT5base. Interestingly, the multilingual models (mT5 and UTC model) outperform the specialized monolingual English T5 model.

2. Emoji-based Hate

The English-only Hatemoji dataset comprises of two splits: HatemojiCheck and HatemojiTrain. HatemojiCheck is a manually constructed labeled test suite of 3,930 short-form statements and whether they use emoji-based hateful language.

2.2. Compared Baselines

(Kirk et al., 2021) showed that fine-tuning on HatemojiTrain greatly improves performance on HatemojiCheck. Following this, we evaluate our two highest performing models from Section 5.1, mT5 and UTC, by further fine-tuning them on HatemojiTrain then evaluating on HatemojiCheck. Note that these checkpoints have already been finetuned on Production-Multilingual. We use the validation split of HatemojiTrain to pick the best checkpoint for this additional fine-tuning. We additionally compare these results to the best results reported in (Kirk et al., 2021), which is an English-only DeBERTa model optimized for the task.

2.3. Results

Table 7 reports results on HatemojiCheck. Even though UTC is multilingual, when finetuned UTC outperforms the best performing model from (Kirk et al., 2021) by a significant margin. We posit that this gain may be attributed to the learned tokenizer, which may effectively update during finetuning to adapt to parsing emojis. On the other hand mT5Small, under-performs the Kirk et al. baseline.

Deployment Results

On December 9th, 2021 Jigsaw launched support for 10 new languagesLanguages launched with UTC: Arabic, Chinese (Simplified), Czech, Dutch, Indonesian, Japanese, Korean, Polish, Hindi, and Hinglish (a mix of English and Hindi transliterated using Latin characters) in the Perspective API (Acosta et al., 2021), powered by UTC† (Table 1), making the model available publicly https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages (Note that languagesLanguages not yet using UTC: English, French, German, Italian, Portuguese, Russian, and Spanish previously available within Perspective API were not impacted by this launch). UTC dramatically increased Perspective’s capabilities, as it is the first model to reach our production standards for these 10 languages. All previous production candidates (CNNs and BERT-based architectures) had low overall performance, low performance on bias evaluations, or were too slow to serve for real-time usage.

The model was deployed smoothly with no operational issues, and as of writing this paper, the model averages $\sim$ 15 QPS and $\sim$ 200ms median latency (for the 10 newly launched languages only). In our load testing, we have observed that the smaller UTC (102M) model can achieve 45ms median latency at 1K QPS on a single TPUv2 chip with batching. We anticipate our production latency improving further as load increases as our serving infrastructure does not have to wait to accumulate requests for batching. In the future, we hope to migrate to using the smaller and faster UTC (102M) model, as well as explore further performance improvements. We also plan to transition the rest of the languages Perspective serves to UTC over time, as well as expand to additional new languages.

Overall, we consider our results impressive for a byte-level Transformer model of this size. We attribute the majority of this performance to the sequence length downsampling done by Charformer, as well as our removal of the decoder during finetuning, effectively reducing both our sequence length and depth dimensions by half respectively. We include an ablation study for the speed of our model with and without these modifications in Table 8. Performance may also be attributed to forgoing tokenization, a process that is hard to parallelize, in favor of Charformer GBST, which can run on specialized hardware (TPU). In addition to quality and performance, there are engineering advantages to our approach. We have found that having one model to support multiple languages significantly simplifies the maintenance of our service as now there are fewer models to maintain. Additionally, our experience forgoing tokenization echoes that of previous literature – we find that preparing models for production is simplified as we no longer need to coordinate model checkpoints with matching vocabularies. Given these results, we are looking forward to expand the usage of this approach in the future.

Conclusion

This paper presents Jigsaw’s new generation of toxic comment classification models, which is currently deployed in production for 10 new languages in the Perspective API. We outline our approach in applying state-of-the-art token-free Charformer to the problem of toxic comment classification and the efficiency techniques we take to enable serving such a byte-level model in production. Through rigorous experiments on real-world and academic benchmarks we demonstrate the effectiveness our approach.

References

Appendix A Reproduction Details

Our UTC model is implemented in Mesh TensorFlowhttps://github.com/tensorflow/mesh (Shazeer et al., 2018), a wrapper over TensorFlow API that enables distributed model parallelism, along with the T5 libraryhttps://github.com/google-research/text-to-text-transfer-transformer. For Charformer (Tay et al., 2021), we use the official implementationhttps://github.com/google-research/google-research/tree/master/charformer released by the authors. The overarching model architecture follows the T5.1.1 setup using T5-styled relative attention biases instead of position embeddings.

A.0.2. Optimization and Training Details

This section describes the general setup for our pretraining and finetuning experiments. Dataset specific details are deferred to respective sections. For both pretraining and finetuning, we use the Adafactor optimizer (Shazeer and Stern, 2018). During pretraining, we use a learning rate equal to the inverse square root of the current training step following (Raffel et al., 2020). Finetuning is performed using a fixed constant learning rate of $10^{-3}$ . We apply a dropout of $0.1$ during finetuning. Pretraining is conducted with $64$ TPU-v3 chips and finetuning is typically conducted with $16$ TPU-v3 chips. Pretraining generally takes about 3-4 days to complete.

A.0.3. Reproducibility

Our model is currently available via the production Perspective APIhttps://developers.perspectiveapi.com/s/docs for Arabic, Chinese (Simplified), Czech, Dutch, Hindi, Hinglish (a mix of Hindi and English), Indonesian, Japanese, Korean, Polish, and Russian for the ”TOXICITY” attribute. Access to the model for English is also released under the ”TOXICITY_EXPERIMENTAL” attribute. Even though the interface for the API requires specification of a language, all requests are routed to a single UTC† model.

A.0.4. Dataset Details

Here we include figures to further provide some details about selected datasets used.