PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Nogueira, Roberto Lotufo

Introduction

Pretrained language models have been employed successfully in various NLP tasks . Recent works demonstrate that monolingual pretrained models perform better on tasks in that same language than models pretrained on multilingual corpora .

For Portuguese tasks, BERT models have already shown improved performance when pretrained on a Portuguese corpus . One of the motivations to perform a similar pretraining but using the T5 model is its capability to generate text. Thus, it can perform tasks that a BERT model cannot, such as summarization, abstractive question answering, and translation.

In this work, we improve the original T5 model on Portuguese language tasks by pretraining it on BrWac , a large corpus of web pages in Brazilian Portuguese. We call this model PTT5. We validate our pretraining on Portuguese tasks of sentence entailment prediction and named entity recognition, and show that monolingual pretraining significantly improves the model’s performance.

Data

We use three Brazilian Portuguese datasets: BrWac for pretraining, and ASSIN 2 and HAREM for fine-tuning and evaluating our pretrained models.

The BrWac corpus was built after crawling more than 60 million web pages, filtered down to 3.5 million pages after applying quality control filters, resulting in a dataset with 2.7 billion tokens.

We construct input examples for the pretraining task by concatenating sentences until we reach 512 words. If the last sentence does not fit in the 512 words, we move it to the next example, i.e., we do not split sentences. Any sentence larger than 512 words, which is rare, is truncated to 512 words and added as a separate input example, truncated to 512 words. We use the ftfy library to fix encoding problems. The resulting dataset has 15.6 GB of text and a mean of 360 words per document, as shown in Table 1. We use Python’s split function to count words.

These input examples go through a tokenizer that can vary from using the original T5 vocabulary or our custom Portuguese vocabulary (Section 3.1).

2 ASSIN 2

ASSIN 2 consists of two Portuguese tasks: semantic similarity and entailment prediction. In the semantic similarity task, given a pair of sentences, a model has to predict a number between 1 and 5, representing how semantically close the two sentences are. The entailment prediction task consists of classifying if one sentence implies the other. It is thus a binary classification task whose classes are “entail” or “none”. The dataset consists of short sentence pairs, with 6500 pairs for training, 500 for validation, and 2448 for testing.

3 HAREM

HAREM is a collection of two Portuguese datasets for Named Entity Recognition (NER). The first dataset, called First HAREM, consists of 129 documents with a total of 4151 entities on the selective scenario and 5017 for the total scenario. Those two scenarios differs only on the quantity of annotated classes. The total scenario contains 10 classes (Location, Person, Organization, Value, Date, Title, Thing, Event, Abstraction and Other), while the selective scenario contains 5 classes (Person, Organization, Location, Value and Date). The second dataset, called MiniHAREM, is composed of 128 documents, 3642 and 3018 entities for the total and selective scenarios, respectively. In this work, we focus only on the selective scenario. We separate 7 percent of the First HAREMs documents for validation and the remaining for training. All MiniHAREM is used as a test set. We refer the reader to the original paper for more information .

Methodology

We now describe our methodology, including creating the custom Portuguese vocabulary, unsupervised pretraining, and fine-tuning and evaluation on ASSIN 2 and HAREM.

The original T5 vocabulary uses the SentencePiece library using English, German, French, and Romanian web pages from Common Crawl.http://commoncrawl.org/

We use a similar procedure to create our Portuguese vocabulary: we train a SentencePiece model on a corpus of 2 million sentences randomly chosen from the Portuguese Wikipedia. We use the Unigram language model as in Kudo and a predetermined vocabulary size of 32,000 wordpieces.

We use the same control tokens (padding, end-of-sequence, and unknown) and vocabulary size of the original T5 to start pretraining from the original T5 checkpoints without significant changes in the model architecture and overall pretraining process.

2 Unsupervised Pretraining

The unsupervised pretraining was performed with a denoising objective, which can be implemented in a few different ways. The main idea is to train the model in an unsupervised way, feeding the model with corrupted versions of the original token sequence, and training it to reconstruct the original sequence .

In all pretraining experiments, we use one of the strategies explored in the original T5 paper: each token in the input sequence has a predefined probability of being replaced by a mask token. The model is fed with this masked token sequence, and trained to produce the original sequence. For example, given the input sequence “Que $<$ M $>$ para $<$ M $>$ sobre o $<$ M $>$ PTT5!”, the model is trained to produce the sequence “Que belo dia para aprendermos sobre o maravilhoso PTT5!”, where “ $<$ M $>$ ” is a mask token. Note that this is an illustrative example. In practice, the tokens in the sentence are subword units.

In the pretraining experiments, we use the cross-entropy loss as a cost function and the Adafactor optimizer. Training always starts from the corresponding original T5 checkpoints released by Raffel et al. . We use Google Cloud TPU v3-8’s and T5’s official implementation in TensorFlow.https://github.com/google-research/text-to-text-transfer-transformer

In addition to updating all model weights during pretraining, we also experiment with updating only the vocabulary embeddings and freezing the remaining weights. The model then has fewer weights to be learned during pretraining, which leads to faster convergence times. Hence, this could be a more economical pretraining strategy than the widely adopted strategy of pretraining the whole model.

3 ASSIN 2 Training and Validation

For the ASSIN 2 tasks, the input to T5 is formatted as:

where [S1] and [S2] corresponds to the strings of the sentences 1 and 2, respectively, and [eos] is the end-of-sequence token. We experimented with two ways for producing the scores in the sentence similarity task: For the first approach, we follow Raffel et al.’s strategy for regression tasks and train the model to output the literal string representing the score. We limit this generation to 5 tokens, which is then converted to a floating-point number. In the second approach, we feed the mean over the sequence length of the last hidden state of T5’s encoder to a linear layer with a sigmoid activation. Since scores are between 1 and 5, we rescale the scalar output $y$ of the sigmoid layer by doing $4y+1$ . The loss function consists of Mean Square Error (MSE), which is also one of the the similarity tasks’ main metrics. For the entailment task, we feed the last hidden state of the T5 encoder to the linear layer with two neurons in the output followed by a softmax. We fine-tune the models using the cross-entropy loss with the RAdam optimizer.

4 HAREM Training and Validation

For the Named Entity Recognition task we feed the model with an input using the following format:

where [S] is the sentence string. As output we expect that each entity is followed by its class. For instance, given the input “Recognize Entities: John lives in New York”, the target output would be “John [Person] lives in [Other] New York [Local]”, where “[Other]” is used to label all out of context tokens. The example is given in English for demonstration purpose only as HAREM is a dataset in Portuguese. This approach allows us to recognize entities in a sentence using an sequence to sequence model, without any modification in its architecture. Training is done through minimization of cross-entropy loss for each token using the AdamW optimizer .

Experiments and Discussion

Our experiments comprise two main phases: unsupervised pretraining on the BrWac corpus, and fine-tuning and evaluating on ASSIN 2 and HAREM tasks.

The experiments with unsupervised pretraining (Figure 1) were conducted following the methodology described in Section 3.2. The corruption rate for the input tokens was 15%, which is the same used by the original T5 model. Input and output sequence token lengths follow T5’s maximum of 512. Longer sequences are truncated, and shorter sequences are padded with a special padding token. The learning rate was constant and equal to 0.003. All models were pretrained for four epochs.

Table 2 shows a summary of all pretraining experiments performed. Each row represents a specific combination of model size, batch size, and pretraining strategy (whole model vs. vocabulary embeddings only) and comprises two experiments: one for multilingual vocabulary and another for the Portuguese vocabulary (see Section 3.1 for more details).

As expected, larger models show lower loss levels. Using the Portuguese vocabulary resulted in higher loss values for the same model size. A hypothesis for the cause of this behavior is that, since the weights were initialized using checkpoints from models trained with the original T5 vocabulary, the model has to adapt to the new Portuguese vocabulary. Evidence for this is the initial high loss for models starting with the Portuguese vocabulary.

For base and large models, we observed a decrease in approximately 20% on time spent per epoch when training only the vocabulary embeddings versus training the whole model.

2 ASSIN 2 Experiments

Here we describe our experiments in our target task: ASSIN 2. Unless noted otherwise, the learning rate for fine-tuning is 0.0001; the batch sizes are 32, 2, and 1 for the small, base, and large models, respectively. The small batch sizes for the base and large models are due to memory limitations in our GPU. We found that 128 sequence length is enough to accommodate ASSIN 2’s sentence pairs by looking at the tokenized training and validation data. The maximum number of epochs in the reported experiment plots is 50. We used a patience of 5 epochs for the similarity task and 10 epochs for the entailment task.

Table 3 shows results on the test set using the official ASSIN 2 evaluation script. We compare with fine-tuning from the original T5 weights and fine-tuning from PTT5. We also compare our results to BERTimbau, a BERT model also pretrained on BrWac , mBERT, a multilingual training of BERT, and the top models from the official ASSIN 2 leaderboard.https://sites.google.com/view/ASSIN2.

In general, PTT5 Base achieves competitive performance with BERTimbau, with the Portuguese vocabulary largely contributing to the results. Despite the largest BERT model achieving the best performance (BERTimbau Large), our PTT5 Base was better than PTT5 Large. PTT5 Base achieves top MSE, which can be due to optimization with MSE loss. It is noticeable how ASSIN 2’s test dataset is different from its validation set. PTT5 consistently outperforms the original T5 model on both tasks. This result aligns with our initial hypothesis that Portuguese denoise pretraining might improve performance when fine-tuning on Portuguese tasks. The use of the custom Portuguese vocabulary also consistently improved results.

Figure 2 compares the validation loss in semantic similarity task and the validation accuracy in the entailment task between the two output strategies: string generation and linear layer over the last hidden states. For the string generation approach,accuracy is used in Figures 2(b) and 2(d). These experiments used the original T5 vocabulary and weights. We use the linear layer approach in all other experiments due to its faster convergence and higher stability.

2.2 Ablation study: Portuguese Vocabulary vs. Original T5 Vocabulary

Figure 3 shows validation losses when varying the size of the initial PTT5 weights, with and without our custom Portuguese vocabulary on the semantic similarity task (Figures 3(a), 3(c) and 3(e)) and entailment task (Figures 3(b), 3(d), and 3(f)). We notice that the Portuguese vocabulary helps PTT5 achieving better results and convergence on ASSIN 2. Additionally, we notice that larger models converge faster in the fine-tuning step.

2.3 Ablation: Additional Hyperparameter Tuning

Table 4 shows the best validation MSE loss for the similarity task and best validation cross-entropy loss for the entailment task, for each model. All PTT5 models were pretrained for 4 epochs. PTT5 base with the Portuguese vocabulary achieved the lowest MSE (0.0387). PTT5 Large with the Portuguese vocabulary achieved the lowest cross-entropy (0.1420). However, as shown in Table 3, the large model achieved a lower Person correlation than the base model.

In Table 5, we show some additional experiments to explore variations in the batch size, learning rate, and the number of pretraining epochs. We also evaluated the performance of the PTT5 pretrained with vocabulary embeddings only.

Different pretraining methods showed similar performance on the validation set of ASSIN 2. However, on the test set, pretraining all weights achieved far better performance than pretraining the vocabulary embeddings only. These results suggest that the validation set of ASSIN 2 was not the best choice for model selection, probably due to its small size (500 examples) and because metrics were already too high for all models (e.g., Pearson correlations were close to 1). Hence, experiments on other tasks are needed to confirm whether pretraining vocabulary embeddings only is a viable option.

3 HAREM Experiments

Here we describe the experiments on the HAREM dataset for Named Entity Recognition. We compare three versions of the T5 models: the original T5 model pretrained on English texts, PTT5 with the original English vocabulary, and PTT5 with the Portuguese vocabulary. We use base models due to the computational costs of larger models. This constraint also permitted only a small batch size of 2, however we accumulate the the gradients over 4 steps, given a total batch size of 8. We utilize AdamW with a learning rate of $0.0002$ without any warmup or scheduling techniques.

We preprocess First HAREM and MiniHAREM datasets using the code made available by Souza et al. . For the models using the English vocabulary, we replace accented characters (e.g., ã, ó) with its closest non-accented representation (e.g., a, o). The entity tags are exactly their natural language labels. For instance, “Organização” is used to identify organizations. Due to the limitation of 512 tokens of the T5 model, all examples that are longer than 512 are divided into smaller segments using a sliding window technique and a stride of 256 tokens.

During the validation phase, a beam search decoding method of width 5 is used to generate the output sequence, followed by a token labeling postprocessing that uses the BIO format. For example, if the model generated “John [Person] lives in [Other] New York [Local]”, we would mark the words in the sequence as B-PER, O, O, B-LOC, I-LOC. This format stands for Begin, Intermediate and Out of context tokens and allows us to easily compare with other works. However, as a drawback, any non-entity token inserted or forgotten on the output that does not exist on the input sentence can lead to misalignment of the predicted and true labels sequences, thus impacting performance negatively.

The results are shown in Table 6. Our language-specific pre-training (PTT5) helps improving upon the original T5 pretraining, with a slight gain when a Portuguese vocabulary is used. Nevertheless, Portuguese BERT still performs better than PTT5 by a small amount.

Table 7 shows that most degrading point of the performance comes from detecting the Value entity.

Conclusion

We pretrained T5 models on a large Brazilian Portuguese corpus. The resulting models achieved better performance than the original T5 models on the Portuguese sentence entailment and NER tasks. Moreover, using a Portuguese vocabulary proved to be better than using the original T5 vocabulary. Finally, for ASSIN 2 tasks, we found that pretraining all weights leads to better performance than pretraining the vocabulary embeddings only. Despite having achieved better results than the top-submissions to the ASSIN 2 leaderboard, our Portuguese T5 model is still a few points below to the state-of-the-art model, a Portuguese BERT Large model (BERTimbau Large).

Acknowledgements

We thank Google for the free TPUs and Google Cloud credits. This work was initially developed as the final project for the IA376E course taught by professors Rodrigo Nogueira and Roberto Lotufo at the University of Campinas (UNICAMP).