WangchanBERTa: Pretraining transformer-based Thai Language Models

Lalita Lowphansirikul, Charin Polpanumas, Nawat Jantrakulchai, Sarana Nutanong

Introduction

Transformer-based language models, more specifically BERT-based architectures [Devlin et al., 2018b], [Liu et al., 2019], [Lan et al., 2019], [Clark et al., 2020], and [He et al., 2020], have achieved state-of-the-art performance in downstream tasks such as sequence classification , token classification, question answering, natural language inference and word sense disambiguation [Wang et al., 2018, Wang et al., 2019]. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset such as BERT-th [ThAIKeras, 2018] trained on Thai Wikipedia Dump, or finetuning multi-lingual models such as XLMR [Conneau et al., 2019] (100 languages) and mBERT [Devlin et al., 2018b] (104 languages). Training on a small dataset has a detrimental effect on downstream performance. BERT-th underperforms RNN-based ULMFit [Polpanumas and Phatthiyaphaibun, 2021] trained Thai Wikipedia Dump on sequence classification task Wongnai Reviews [Wongnai.com, 2018]. For multi-lingual training, we can see from comparison between multi-lingual and mono-lingual models such as [Martin et al., 2019] that multi-lingual models underperform mono-lingual models. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization.

In this report, we describe a language model based on RoBERTa-base architecture and SentencePiece [Kudo and Richardson, 2018] subword tokenizer on 78GB cleaned and deduplicated data from publicly available social media posts, news articles, and other open datasets. We also pretrain four other language models using different tokenizers, namely SentencePiece [Kudo and Richardson, 2018], dictionary-based word-level and syllable-level tokenizer (PyThaiNLP’s newmm [Phatthiyaphaibun et al., 2020]), and SEFR tokenizer [Limkonchotiwat et al., 2020], on Thai Wikipedia Dump to explore how tokens affect downstream performance.

To assess the effectiveness of our language model, we conducted an extensive set of experimental studies on the following downstream tasks: sequence classification (multi-class and multi-label) and token classification. Our model wangchanberta-base-att-spm-uncased outperforms strong baseline models (NBSVM [Wang and Manning, 2012] and CRF [Okazaki, 2007]), ULMFit [Howard and Ruder, 2018] (thai2fit [Polpanumas and Phatthiyaphaibun, 2021]) and multi-lingual transformer-based models (XLMR [Conneau et al., 2019] and mBERT [Devlin et al., 2018a]) on both sequence and token classification tasks.

The remaining sections of this report are organized as follows. In Section 2, we describe the methodology in pretraining the language models including raw data, preprocessing, train-validation-test split preparation and training the models. In Section 3, we introduce the downstream tasks we use to test the performance of our language models. In Section 4, we demonstrate the results of our language modeling and finetuning for downstream tasks. In Section 5, we discuss the results and next steps for this work.

The pretrained language models and finetuned modelshttps://huggingface.co/airesearch are publicly available at Huggingface’s Model Hub. The source code used for the experiments can be found at our GitHub repository.https://github.com/vistec-AI/thai2transformers

Methodology

We train one language model on the Assorted Thai Texts dataset including all available raw datasets and four language models on the Wikipedia-only dataset, each with a different tokenizer.

The raw data are obtained from (statistics after preprocessing):

2 Preprocessing

We apply preprocessing rules to the raw datasets before using them as our training sets. This effectively demands the preprocessing rules to be applied before finetuning for both domain-specific language modeling and other downstream tasks.

A large portion of our training data (wisesight-large and pantip-large) comes from social media, which usually have a lot of unusual spellings and repetitions. For such noisy data, [Raffel et al., 2020] reports that pretraining on a cleaned corpus C4 yields better performance in downstream tasks. Therefore, we opted to perform the following processing rules, in order:

Replace HTML forms of characters with the actual characters such as nbsp; with a space and ¡br /¿ with a line break [Howard and Ruder, 2018].

Remove empty brackets ((), {}, and []) than sometimes come up as a result of text extraction such as from Wikipedia.

Replace more than one spaces with a single space

Remove more than 3 repetitive characters such as ดีมากกก to ดีมาก [Howard and Ruder, 2018].

Word-level tokenization using [Phatthiyaphaibun et al., 2020]’s newmm dictionary-based maximal matching tokenizer.

Replace repetitive words; this is done post-tokenization unlike [Howard and Ruder, 2018] since there is no delimitation by space in Thai as in English.

Replace spaces with ¡_¿. The SentencePiece tokenizer combines the spaces with other tokens. Since spaces serve as punctuation in Thai such as sentence boundaries similar to periods in English, combining it with other tokens will omit an important feature for tasks such as word tokenization and sentence breaking. Therefore, we opt to explicitly mark spaces with ¡_¿.

For Wikipedia-only dataset, we only replace non-breaking spaces with spaces, remove an empty parenthesis that occur right after the title of the first paragraph, and replace spaces with ¡_¿.

Each row of all datasets are originally delimited by line breaks. Due to memory constraints, in order to train the language model, we need to limit our maximum sequence length to 416 subword tokens (tokenized by SentencePiece [Kudo and Richardson, 2018] unigram model) or roughly 300 word tokens (tokenized by dictionary-based maximal matching [Phatthiyaphaibun et al., 2020]). In order to do so, we use the sentence breaking model CRFCut ([Lowphansirikul et al., 2020]). CRFCut is a conditional random fields (CRF) model trained on English-to-Thai translated texts of [Sornlertlamvanich et al., 1997] (23,125 sentences), TED transcripts (136,463 sentences; [Lowphansirikul et al., 2020]) and generated product reviews (217,482 sentences; [Lowphansirikul et al., 2020]). It uses English sentence boundary as sentence boundary labels for translated Thai texts. CRFCut has sentence-boundary F1 score of 0.69 on [Sornlertlamvanich et al., 1997], 0.71 on TED Transcripts, and 0.96 on generated product reviews. We keep only sentences that are 5 to 300 words long to not exceed 416-subword maximum sequence length and also not have a sequence too short for language modeling.

For the model trained on Assorted Thai Texts dataset, in the same manner as [Martin et al., 2019], we use SentencePiece [Kudo and Richardson, 2018] unigram language model [Kudo, 2018] to tokenize sentences of training data into subwords. The tokenizer has a vocabulary size of 25,000 subwords, trained on 15M sentences. To construct the training set for the tokenizer, we first take 2.5M randomly sampled sentences from pantip-large, 3.5M randomly sampled sentences from wisesight-large and all sentences of the remaining datasets, resulting in 20,961,306 total sentences. Out of those, we randomly sampled 15M sentences to train the tokenizer.

For the models trained on Wikipedia-only dataset, we use four different tokenizers to examine their effects on language modeling and downstream tasks. We use the same training set of 944,782 sentences sampled from Thai Wikipedia Dump

SentencePiece tokenizer; we train the SentencePiece [Kudo and Richardson, 2018] unigram language model [Kudo, 2018] using 944,782 sentences from Thai Wikipedia Dump, resulting in a tokenizer with vocab size of 24,000 subwords.

Word-level tokenizer; the word-level, dictionary-based tokenizer newmm [Phatthiyaphaibun et al., 2020] is used to create a tokenizer with vocab size of 97,982 words.

Syllable-level tokenizer; the syllable-level dictionary-based tokenizer syllable [Phatthiyaphaibun et al., 2020] is used to create a tokenizer with vocab size of 59,235 syllables.

SEFR tokenizer; Stacked Ensemble Filter and Refine tokenizer (engine=best) [Limkonchotiwat et al., 2020] based on probabilities from CNN-based deepcut [Kittinaradorn et al., 2019] with a vocab size of 92,177 words.

3 Train-Validation-Test Splits

After preprocessing and deduplication, we have a training set of 381,034,638 unique, mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [Phatthiyaphaibun et al., 2020], 8,680,485,067 subwords as tokenized by SentencePiece [Kudo and Richardson, 2018] tokenizer, and 53,035,823,287 characters.

We also randomly sampled 99,181 sentences (19.28MB) as validation set and 42,238,656 sentences (8GB) as test set. Both are preprocessed in the same manner as the training set.

From Thai Wikipedia Dump, we extract in a uniformly random manner 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set.

4 Language Modeling

We use the transformer [Vaswani et al., 2017] architecture of BERT (Base) (12 layers, 768 hidden dimensions, 12 attention heads) [Devlin et al., 2018b]. Our setup is very similar to [Martin et al., 2019] replacing BERT’s WordPiece tokenizer with a SentencePiece tokenizer, with the exception of preprocessing rules applied before subword tokenization.

We train the model with masked language modeling. To circumvent the word boundary issues in Thai, we opted to perform this at the subword level instead of whole-word level, even though the latter is reported to have better performance in English [Joshi et al., 2020]. In the same manner as BERT [Devlin et al., 2018b] and RoBERTa [Liu et al., 2019], for each sequence, we sampled 15% of the tokens and replace them with ¡mask¿ token. Out of the 15%, 80% is replaced with a ¡mask¿ token, 10% is left unchanged and 10% is replaced with a random token. The objective is to predict the tokens replaced with ¡mask¿ using cross entropy loss.

We pretrain RoBERTaBASE\text{RoBERTa}_{\textrm{BASE}} on both the Assorted Thai Texts dataset and Wikipedia-only dataset. The size of Wikipedia-only dataset is about 0.57 GB which is comparatively low compared to the Assorted Thai Texts dataset. Therefore, we manually tune the hyperparamters used for RoBERTaBASE\text{RoBERTa}_{\textrm{BASE}} pretraining for each training set in order to control the loss stability. The hyperparameters of the RoBERTaBASE\text{RoBERTa}_{\textrm{BASE}} architecture and model pretraining are listed in Table 2.

We name our pretrained language models according to their architectures, tokenizers and the datasets on which they are trained on. The models can be found on HuggingFacehttps://huggingface.co/models.

Downstream Tasks

We evaluate the downstream performance of our pretrained Thai RoBERTaBASE\text{RoBERTa}_{\textrm{BASE}} models on existing Thai sequence-classification and token-classification benchmark datasets.

We use train-valiation-test split as provided by each dataset as hosted on Huggingface Datasets.https://huggingface.co/datasets When not all splits are available, namely for Wongnai Reviews and ThaiNER, we sample respective splits in a uniformly random manner. The descriptive statistics of each datasets are as follows:

[Suriyawongkul et al., 2019] is a multi-class text classification dataset (sentiment analysis). The data are social media messages in Thailand collected from 2016 to early 2019. Each message is annotated as positive, neutral, negative, or question.

[Wongnai.com, 2018] is a multi-class text classification dataset (rating classification). The data are restaurant reviews and their respective ratings from 1 (worst) to 5 (best) stars.

[Lowphansirikul et al., 2020] is a dataset that originally consists of product reviews generated by CTRL [Keskar et al., 2019] in English. It is translated to Thai as part of the scb-mt-en-th-2020 machine translation dataset. Translation is performed both by human annotators and models. We use only the translated Thai texts as a feature to predict review stars from 1 (worst) to 5 (best).

is a multi-label text classification dataset (topic classification) based on news articles of Prachathai.com from August 24, 2004 to November 15, 2018 packaged by [Phatthiyaphaibun et al., 2020]. We perform topic classification of the headline of each article, which can contain none to all of the following labels: politics, human rights, quality of life, international, social, environment, economics, culture, labor, national security, ict, and education.

1.2 Token Classification

[Phatthiyaphaibun, 2019] is a 6,456-sentence named entity recognition (NER) dataset created by expanding an unnamed, 2,258-sentence dataset by [Tirasaroj and Aroonmanakun, 2012]. The NER tags are annotated by humans in IOB format.

[Boonkwan et al., 2020] is a dataset with 5 layers of linguistic annotations: word boundaries, POS tagging, NER, clause boundaries, and sentence boundaries. NER tags are in IOBE format. We use the dataset for POS tagging and NER tasks.

2 Benchmarking Models

We provide benchmarks using traditional models (NBSVM for sequence classification and CRF for token classification), RNN-based models (ULMFit; only for sequence classification) and transformer-based models.

[Wang and Manning, 2012] We adopt the NBSVM implementation by Jeremy Howardhttps://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline as our strong baselines for sequence classification both multi-class and multi-label. The notable differences are substituting binarized ngram features with tf-idf features (uni- and bi-grams; minimum document frequency of 3, maximum document frequency of 90%). We also apply the same cleaning rules as the language model, with the differences being adding repeated character tokens ¡rep¿ and repeated word tokens ¡wrep¿ instead of space tokens ¡_¿.

We perform hyperparameter tuning for penalty types (L1 and L2) and inverse of regularization strength (C=[1.0, 2.0, 3.0, 4.0]) and choose the models with the highest F1 scores (micro-averaged for multi-class and macro-averaged for multi-label classification). See Table 8. For multi-label classifcation, we search for the best set of thresholds (ranging between 0.01 – 0.99 with the step size of 0.01) that maximize macro-average F1-score on validation set.

is an implementation of ULMFit language model finetuning for text classification [Howard and Ruder, 2018]. [Polpanumas and Phatthiyaphaibun, 2021] pretrained a language model with vocab size of 60,005 words (tokenized by PyThaiNLP’s newmm) on Thai Wikipedia Dump. We finetune the language model on the training set of each dataset for 5 epochs. Then that, we finetune for the sequence classification tasks using gradual unfreezing from the last one, two and three parameter groups with discriminative learning rates, for one epoch each. After that, we finetune all the weights of the model for 5 epochs. The checkpoints with the highest accuracy scores (validation losses for multi-label classification) are chosen to perform on the test sets. See Table LABEL:tab:thai2fit_hyperparams. Lastly, we search for the best set of thresholds (ranging between 0.01 – 0.99 with the step size of 0.01) that maximize macro-average F1-score on validation set.

[Lafferty et al., 2001] We use the CRFSuite implementation [Okazaki, 2007] of conditional random fields as a strong baseline for POS and NER tagging tasks. We generate the features by extracting unigrams, bigrams and trigrams features within a sliding window of three timesteps, before and after the current token (beginning and ending of sentences are padded with xxpad tokens). We finetune L1 and L2 penalty combinations using 10,000 randomly sampled sentences for LST20 and the entire training set for ThaiNER. With hyperparameters with the best F1 score (micro-averaged) on the validation set, we train on the entire training sets and report performances on the test sets. See Table 10. We run each CRF model for 500 iterations.

We use the same finetuning scheme for all transformer-based models, namely XLM-RoBERTa-base [Conneau et al., 2019], BERT-base-multilingual-cased [Devlin et al., 2018a], wangchanberta-base-wiki-tokenizer (spm, newmm, syllable, sefr), and wangchanberta-base-att-spm-uncased. For the sequence classification task, we preprocess each dataset with the rules described in 2.2. We then finetune each pretrained language model on downstream tasks for 3 epochs. The criteria to select the best epoch is the validation micro-average F1-score for multi-class classification and macro-average F1-score for multi-label classification. The batch size is set to 16. The The learning rate is warmed up over the first 10% of steps to the value of 3e-5 and linearly decayed to zero. We finetune models with FP16 mixed precision training. All models are optimized with Adam [Kingma and Ba, 2014] (β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, ϵ=\epsilon=1e-8, L2L_{2} weight decay == 0.01) with corrected bias. For multi-label classification head, we search for the best set of thresholds (ranging between 0.01 – 0.99 with the step size of 0.01) that maximize macro-average F1-score on validation set.

For the token classification tasks, we finetune each pretrained language models for 6 epochs. The criteria to select the best epoch is the validation loss. The batch size is set to 32. The learning rate is warmed up over the first 10% of steps to the value of 3e-5 and linearly decayed to zero. We finetune models with FP16 mixed precision training. All models are optimized with Adam with the parameters as same as the sequence classification task.

Results

The following table shows the performance RoBERTaBASE\text{RoBERTa}_{\textrm{BASE}} trained on Wikipedia-only dataset. There are four variations of tokenization including subword-level with SentencePiece [Kudo and Richardson, 2018], word-level and syllable-level with PyThaiNLP [Phatthiyaphaibun et al., 2020] tokenizer (denoted as newmm and syllable respectively), and stacked-ensemble, word-level tokenizer sefr [Limkonchotiwat et al., 2020].

For the RoBERTaBASE\text{RoBERTa}_{\textrm{BASE}} trained on Assorted Thai Texts dataset, we only trained with subword token built with SentencePiece [Kudo and Richardson, 2018] due to the limited computational resources.

2 Downstream Tasks

We choose models to perform on the test set based on their performance on the validation sets. For multi-class sequence classification and token classification, we optimize our models for the highest micro-averaged F1 score. For multi-label sequence classification, we optimize for the highest macro-averaged F1 score, as it is less affected by class imbalance. Moreover, for multi-label sequence classification, we also find the best probability threshold for each label based on the validation set. We report the performance of these optimized models on the test sets.

For sequence classification tasks, our model trained on the Assorted Thai Texts dataset outperfroms both strong baselines and other transformer-based architecture on all downstream tasks except Generated Reviews (EN-TH). This may be attributed to the fact that the dataset is translated from generated texts in English, thus multi-lingual pretraining of XLMR gives it the advantage. 6.

For token classification tasks, our model trained on the Assorted Thai Texts dataset achieves the highest micro-averaged F1 score in all tasks except POS tagging in ThaiNER dataset. This could be attributed to the fact that the POS tags in ThaiNER are machine-generated and thus more suited for the baseline model CRF. See Table 7.

Discussions and Future Works

Consistent with previous works on language modeling, we found that training on large datasets such as our Assorted Thai Texts dataset yield better downstream performance. The only case when a multi-lingual model (XLMR) outperforms our largest mono-lingual model is when the training data include multi-lingual elements namely the English-to-Thai translated texts of Generated Reviews EN-TH. From our experiments on the Wikipedia-only dataset, we did not find any notable diferrence in downstream performance for sequence classification or token classification tasks.

Another area we will explore in the future is the inherent biases on our relatively large language models. Previous works including [Sheng et al., 2019] [Nadeem et al., 2020] [Nangia et al., 2020] have detected social biases within large language models trained in English. Our next step in this direction is to create similar bias-measuring datasets in Thai contexts to detect the biases in our language models.

We pretrain our language models on publicly available datasets. Two main concerns that have been raised about similar models are copyrights and privacy. All datasets used to train our models are based on publicly available data. Publicly available social media data are packaged and provided to use by Wisesighthttps://wisesight.com (wisesight-large) and Chaos Theoryhttps://www.facebook.com/ChaosTheoryCompany/ (pantip-large). Unless specified otherwise in the distribution of datasets, all rights belong to the content creators. We provide the weights of our pretrained language models under CC-BY-SA 4.0. Our models are trained as feature extractors for downstream tasks, and not generative tasks. Reproduction of training data can happen [Carlini et al., 2020] albeit at much lower chance than language models trained specifically for generative tasks.

Acknowledgements

We thank Wisesight16, Chaos Theory17 and Pantip.com for providing what has become, to the best of our knowledge, the largest and most diverse high-quality training data in Thai for language modeling.

References

Appendix