scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

Lalita Lowphansirikul, Charin Polpanumas, Attapol T. Rutherford, Sarana Nutanong

Introduction

Machine translation (MT) techniques have advanced rapidly in the last decade with many practical applications, especially for high-resource language pairs, for instance, English-German, English-French [Ott et al., 2018] and Chinese-English [Hassan et al., 2018]. While the translation quality of these machine translation systems is close to that of average bilingual human translators [Wu et al., 2016], they require a relatively large number of of parallel segments to train and benchmark on. Examples of these parallel datasets include News Commentary Parallel Corpus http://www.casmacat.eu/corpus/news-commentary.html, Europarl Parallel Corpus, UN Parallel Corpus [Ziemski et al., 2016], Europarl [Koehn, 2005] and ParaCrawl Corpus [Esplà et al., 2019]. However, English-Thai is a low-resource language pair. Insufficient number of training examples is found to directly deteriorate translation quality [Koehn and Knowles, 2017] as current state-of-the-art models ([Bahdanau et al., 2014, Gehring et al., 2017, Vaswani et al., 2017]) require substantial amount of training data to perform well. Therefore, we curate this dataset of approximately 1M English-Thai sentence pairs to solve the challenge of both quantity and diversity of English-Thai machine translation data.

The difficulties in constructing an English-Thai machine translation dataset include costs for acquiring high-quality translated segment pairs, complexity involved in segment alignment due to the ambiguity of Thai sentence boundaries, and limited number of web pages and documents with English-Thai billingual content. Currently, the largest source of English-Thai segment pairs is the Open Parallel Corpus (OPUS) [Tiedemann, 2012]. It comprises of parallel segments for many language pairs including English-Thai. However, the contexts of those segment pairs are limited to subtitles (OpenSubtitles [Lison and Tiedemann, 2016], QED [Abdelali et al., 2014]), religious texts (Bible [Christodouloupoulos and Steedman, 2015], JW300 [Agić and Vulić, 2019], Tanzil http://opus.nlpl.eu/Tanzil.php), and open-source software documentation (Ubuntuhttp://opus.nlpl.eu/Ubuntu.php, KDE4http://opus.nlpl.eu/KDE4.php, GNOMEhttp://opus.nlpl.eu/GNOME.php).

In order to build an English-Thai machine translation dataset with sufficient number of training examples from a variety of domains, we curate a total of 1,001,752 segment pairs from web-crawled data, government documents, model-generated texts and publicly available datasets for NLP tasks in English. For each data source, approaches to obtain and filter English-Thai segment pairs are described in details. Using OPUS and our dataset, we train machine translation models based on Transformer [Vaswani et al., 2017] and compare the model performance with Google and AI-for-Thai translation services. We used Thai-English IWSLT 2015 [Cettolo et al., 2015] as a benchmark dataset and BLEU [Papineni et al., 2002] as the evaluation metric. BLEU is widely used to evaluate translation quality by comparing translated segments with ground-truth segments. Higher BLEU score indicates better correspondence between the results and ground-truth translation. Our models are comparable to Google Translation API (as of May 2020) for Thai $\rightarrow$ English and outperform for both direction when OPUS is included in the training data.

The rest of the paper is organized as follows. In Section 2, we first describe the sources from which segment pairs are retrieved for our dataset. After that, we detail the methods to obtain segment pairs, verify translation quality, and filter out noisy segment pairs. In Section 3, we exhibit the statistics of our resulting dataset namely number of segment, number of tokens, and the distribution of segment pair similarity scores. Section 4 presents the results of our experiments training machine translation models on OPUS and our dataset, and evaluating the performance on IWSLT 2015, OPUS and our dataset. In the next section, we discuss the challenges in building the English-Thai machine translation dataset and explore the opportunities to further improve the methodology to obtain a dataset with larger size and higher quality. Our work is then concluded in Section 6.

Last but not least, our English-Thai machine translation datasethttps://github.com/vistec-AI/dataset-releases/releases/tag/scb-mt-en-th-2020_v1.0 and pre-trained machine translation modelshttps://github.com/vistec-AI/model-releases/releases/tag/SCB_1M+TBASE_v1.0 are publicly available on our GitHub repositories. We also present additional datasets for other Thai NLP tasks such as review classification and sentence segmentation, which are created as a result of building the machine translation dataset, in Appendix 1.

Methodology

We collect and generate over one million English-Thai segment pairs from five data sources and preprocess them for English-Thai and Thai-English machine translation tasks. Since there is no formal definition of sentence boundaries in Thai [Aroonmanakun et al., 2007], we use English sentence boundaries as segment boundaries for parallel Thai segments. In some cases where the sentence boundaries are not clear even in English (for instance, product descriptions), we do not perform sentence segmentation and treat the entire texts as segments.

We use English segments from following public datasets for natural language processing (NLP) and natural language understanding (NLU) tasks as source segments. These datasets are translated to Thai by professional and crowdsourced translators.

Taskmaster-1 [Byrne et al., 2019] is a dataset of 13,215 task-based dialogs in 6 domains: ordering pizza, making auto repair appointments, scheduling rides, ordering movie tickets, ordering coffee drinks and making restaurant reservations. The dialogs created in both written and spoken English.

The National University of Singapore (NUS) SMS Corpus [Chen and Kan, 2011] is a collection of 67,093 SMS messages written by Singaporeans, mostly NUS students. The style of writing is informal and contains so-called Singlish dialect of English.

Mozilla Common Voice https://voice.mozilla.org/en is a crowdsourced collection of 61,584 voice recordings in various languages. We use the English transcriptions as the source segments. The dataset has segments both written and spoken English.

Microsoft Research Paraphrase Identification Corpus [Dolan and Brockett, 2005] contains 5,801 English segment pairs from news sources. Each segment pair has a binary label of whether they are paraphrasing of each other (that is, semantically equivalent) or not.

1.2 Generated Product Reviews

We generate 372,534 product reviews in English using Conditional Transformer Language Model (CTRL) [Keskar et al., 2019] and use them as the source segments. The conditional transformer language model was trained on multiple domains such as Amazon reviews, Wikipedia, Project Gutenberg and Reddit. CTRL can generate texts with content and style specified by the control codes. For our dataset, we specified the following conditions:

The content generated must be in the product review domain.

The generated reviews must represent sentiments ranging from mostly dissatisfied to mostly satisfied (1-5 scale).

The length of each generated review is limited to less than 150 tokens. Incomplete segments as a result of the generation process are filtered out.

1.3 Wikipedia

Wikipedia consists of articles about various topics such as biographies, events, organizations and places. Articles are written and edited by crowdsourced contributors. At the time of writing, we obtain 6,047,512 articles in English Wikipedia and 136,452 articles in Thai Wikipedia. We hypothesize that there are a number of articles among them that can be treated as parallel documents.

1.4 Web Crawling

Large machine translation datasets such as Paracrawl [espla2019paracrawl] are created from scraping websites with parallel texts. We gather domains of possible parallel websites from three sources:

Paracrawl: Out of 208,349 domains from 23 language pairs of Paracrawl, we found that 1,047 domains have both English and Thai content.

Top 500 Thai Websites according to Alexa.com [ale,]: We hypothesize that websites with high traffic volume are more likely to have pages both in Thai and English.

Other specific bilingual websites such as Asia Pacific Defense Forum, Ministry of Foreign Affairs, and websites of various embassies in Thailand that provide sizeable amount of English-Thai content.

1.5 Thai Government Documents

Official government documents in Thai and English in PDF format are obtained from their respective organizations. The documents include but are not limited to:

The Constitution of the Kingdom of Thailand 2017 (B.E. 2560)

Thailand’s Labour Relations Act 1975 (B.E. 2518)

Thailand’s First - Twelfth National Economic and Social Development Plan

Thailand 20-Year Energy Efficiency Development Plan 2011-2030 (B.E. 2554 - 2573)

Alternative Energy Development Plan 2015-2036 (B.E. 2558 - 2579)

Thailand Power Development Plan 2015-2036 (B.E. 2558 - 2579)

Sustainable Future City Initiative Guideline for SFCI Cities

2 Translation of English Segments

One way to create segment pairs is to employ various translation methods. We employ 3 approaches to get the translation including professional translation, crowdsourced translation and Google Translation API.

Regarding professional translation, we employ 25 professional translators to translate 13,215 conversations of the Taskmaster-1 dataset and 43,374 generated product reviews from English to Thai. Secondly, we use a crowdsourcing platform to disseminate English-to-Thai translation tasks for NUS SMS, Mozilla Common Voice, and Microsoft Research Paraphrase Identification, and 21,590 generated product reviews.

Aforementioned approaches are relatively expensive and time-consuming, therefore, we opt in Google Translation API to translate 307,570 generated English product reviews to Thai. After that, we employ annotators to assess the quality of each product review. We ask the annotators to classify whether the product reviews translation should be accepted or rejected. The criteria are fluency and adequacy of the translation. One product review may have several segments but we only include segments from product reviews that are labeled as acceptable.

3 Alignment of Existing English-Thai Segments

Apart from translation from English to Thai, we also perform segment alignment of existing English-Thai segment pairs parallel documents.

We use NLTK [Loper and Bird, 2002] for English sentence segmentation. For Thai texts, We train a conditional random field model to predict sentence boundary tokens based on the following datasets:

Generated Product Reviews: 67,387 reviews and a total of 259,867 segments that are translated by Google Translate API and annotated by humans are used to train the model since we know the sentence boundaries marked by English texts

TED Transcripts: We obtain transcripts in Thai of TED talks containing 136,463 utterances. We treat each utterance as a segment.

ORCHID Corpus: The corpus was originally meant for POS tagging but it contains 23,125 marked segment boundaries and are used as benchmark for Thai sentence segmentation

We tokenize them into Thai words using newmm tokenizer of pyThaiNLP [Phatthiyaphaibun et al., 2020], then create unigram, bigram and trigram features with a sliding window of 2 steps before and after the token to predict if it is a sentence boundary or not. We also mark words that are often found to be sentence starters or sentence enders and apply the same feature extraction.

Our baseline model CRFCut achieves the following performance Training codes at https://github.com/vistec-AI/crfcut.

3.2 Segment Extraction

Once we have a means to segment all texts, we proceed to extract all segments from each data source.

Paracrawl Corpus Release v5.0 (September 2019)

First, we aggregate the TMX files from 23 language pairs. The total number of domains listed is 208,349. The total number of URLs is approximately 12.8 M URLs. We directly substitute ISO 639-1, 639-2T, 639-2B language codes appeared in the URLs of non-English language code (e.g /de/, /ger/, /es/, /spa/) to Thai language code (e.g. /th/, /tha/), and send HTTP request to verify whether the HTTP request of modified URL with Thai language code response with HTTP status 200.

With this approach, we obtain a total number of 1,047 domains that comprised of content in both English and Thai. We use the web crawling module from bitextor [Espl and Transducens, 2009] to crawl the websites and perform language detection to filtered out the pages whose contents are in neither English nor Thai. We then perform document alignment on crawled data of each domain name based on edit distance of tokens in URLs. A token in this case is defined by a group of characters separated by / except for the protocols (http:, https: and so on). URLs pairs with edit distance equal to one token were paired up, for instance, two URLs that are different only in the language code tokens. We sucessfully aligned 23,528 document pairs.

We obtain the list of top-500 websites in Thailand from the ranking website Alexa.com [ale,]. We retrieved the sitemaps in XML format from those websites and read all the URLs listed. We wrote a web crawling script to crawl bilingual web pages based on these URLs. Similar to what we do with Paracrawl, if a URL contains English or Thai language code, we substitute the language code with /en/ or /th/ and verify if the document pair contains content both in English and Thai. The total number of aligned documents we crawled is 246,868 page pairs that have content both in English and Thai.

To create parallel documents from Wikipedia pages, we align English and Thai articles based on their titles by transforming them into dense vectors using multilingual universal sentence encoder [Yang et al., 2019] and find cosine similarity. Out of all English and Thai articles, we find 13,853 articles that we consider parallel documents.

We extract segments from aligned government documents in PDF format with Apache Tika https://tika.apache.org/. Character errors in extracted Thai texts are fixed with handcrafted rules See https://github.com/vistec-AI/pdf2parallel.

Thai Translation of Generated Product Reviews

We obtained the translation in Thai of 43,374 generated product reviews by professional translation. Since the translation is in document-level, we need to extract segments from the source reviews and translated reviews in order to obtain the alignment at segment-level.

3.3 Segment Alignment

For each pair of aligned documents, we have two approaches in aligning segments. The first approach is applicable for documents crawled from the web. We segment the content in the documents by HTML tags (e.g. $<$ p $>$ , $<$ li $>$ , $<$ h $>$ ). All content within a tag is treated as one segment. We then choose only document pairs that have the same number of equivalent tags and align the segments in order. The downside of this approach is that we might end up with multiple segments per tag.

The second approach is to use sentence segmenter in the previous section to segment Thai texts and NLTK sentence segmenter [Loper and Bird, 2002] to segment English texts then align them based on semantic similarity. We found that after sentence segmentation there are more Thai segments than their English counterparts. In order to correctly align the segments, multiple segments in Thai language have to align with one segment in English in a many-to-one manner. For each English segment, we align them with a concatenation of one to three consecutive Thai segments. To extract the semantic features, we use multilingual universal sentence encoder [Yang et al., 2019] trained on 13 languages including English and Thai to transform each segment into a 512-dimension dense vector. After that, for each segment pair, we compute cosine similarity of their respective vectors. Therefore, one English segment can have up to three versions of alignment with one, two or three concatenated consecutive Thai segments. For each English segment, we select the version that has the highest cosine similarity score.

4 Preprocessing for Machine Translation

We apply rule-based text cleaning to all texts obtained. After that, we filter out segments that are incorrectly aligned using handcrafted rules and multilingual universal sentence encoder [Yang et al., 2019].

We perform text cleaning on each sub-dataset with text-cleaning rules including NFKC Unicode text normalization, replacing HTML entity and number code (e.g. ", ) with corresponding ASCII characters, Removing redundant spaces, and standardizing quote characters. Note that emojis and emoticons are not filtered out from the texts.

4.2 Segment Pair Filtering

Since we obtain our segment pairs by different sources and approaches with varying degree of quality, we have to filter out some segment pairs that are not parallel to each other by handcrafted rules and text similarity based on multilingual universal sentence encoder. The source code and thresholds used for the preprocessing can be found at: https://github.com/vistec-AI/thai2nmt_preprocess

For each dataset, we define a set of thresholds for the following handcrafted rules to filter out low-quality segment pairs:

Percentage of English or Thai characters in each English or Thai segment; for instance, Thai segments with lower percentage of Thai characters are most likely not actually Thai segments but segments from other languages that have been mistakenly crawled

Minimum and maximum number of word tokens for Thai and English segment. We use newmm tokenizer from pyThaiNLP [Phatthiyaphaibun et al., 2020] to tokenize Thai words, and NLTK [Loper and Bird, 2002] to tokenize English words. Spaces are excluded from the token counts.

Ratio of word tokens between English and Thai segments; for example, a pair of segment with 100 tokens for English and 5 tokens for Thai will be filtered out from the resulting dataset.

We also remove all duplicated segment pairs both by exact match and by text similarity based on multilingual universal sentence encoder.

Text Similarity based on Multilingual Universal Sentence Encoder

We transform all segments into 512-dimension dense vectors using multilingual universal sentence encoder, trained on 13 languages including English and Thai [Yang et al., 2019]. We then calculate the cosine similarity between English and Thai segments of each segment pair. The rationale is that segments that are translation of each other should be semantically similar and thus have high cosine similarity score.

We found that after sentence segmentation there are more Thai segments than their English counterparts. This is to be expected . In order to correctly align the segments, multiple segments in Thai language have to align with one segment in English (many-to-one). Thus, we compute cosine similarity between a pair of English segment and Thai concatenated segments.

We use a different cosine similarity threshold for segments from each domain. For example, texts retrieved from web crawling have a relatively higher threshold of 0.7 as we see higher rate of misalignment, whereas the segment pairs from Thai government documents have the threshold of 0.5 as they follow set patterns and are easier to align.

Resulting Datasets

We collected segment pairs from 12 sources and performed the text processing procedures described in Methodology.

Table 2 and 3 present the statistics of the resulting datasets after text processing. The total number of segment pairs is 1,001,752. We tokenize Thai segments with pyThaiNLP’s newmm dictionary-based tokenizer where space token is excluded and Moses tokenizer for English segments.

Table 4 presents the distribution of segment similarity score for each sub-dataset. Examples of segment pairs and their similarity score are shown in Appendix 3.

Experiments

We use the preprocessed and filtered segments pairs summing up to 1,001,752 pairs for the experiments. The ratio for training/validation/test sets is 80/10/10. The validation set and test set are sampled in a stratified manner in respect to their sources. We also ensure that their are no duplicate segments within the same language shared between validation and test sets.

Additionally, we use approximately 5M parallel English-Thai segments from OPUS [Tiedemann, 2012], an open source parallel corpus. Out of 9 English-Thai parallel datasets currently listed in OPUS, we use the following 6 datasets: OpenSubtitles [Lison and Tiedemann, 2016], Tatoeba tatoeba.org, Tanziltanzil.net, QED [Abdelali et al., 2014], Ubuntu and GNOME. The total number of segment pairs is 3,715,179. Then, we perform hand-crafted text cleaning as defined in the section 2.4.1 and segment filtering rules including setting Thai/English character ratio limit up to 0.1, number tokens up to 500 for each segment, removing segments meant for English translation with Thai characters and removing duplicated segment pairs. The resulting datasets contain 3,318,153 segment pair in total. The ratio for training/validation/test sets is 80/10/10.

2 Models & Architectures

We use the Transformer [Vaswani et al., 2017], a supervised neural machine translation model, implemented in the Fairseq toolkit [Ott et al., 2019] as our NMT models in both English $\rightarrow$ Thai and Thai $\rightarrow$ English direction. We train Transformer models with the number of 6encoder and 6 decoder blocks, 512 embedding dimensions, and 2,048 feed forward hidden units. The dropout rate is set to 0.1 only for the encoder and decoder input layer. The embedding of decoder input and output are shared. Maximum number of tokens per mini-batch is 9,750. The optimizer is Adam with initial learning rate of 1e-7 and weight decay rate of 0.0. The learning rate has an inverse squared schedule with warmup for the first 4,000 updates. Label smoothing of 0.1 is applied during training. The criteria for selecting the best model checkpoint is label-smoothed cross entropy loss.

There are 3 types of tokens used in the experiment namely word-level token tokenized by pyThaiNLP’s dictionary based tokenizer (newmm) for Thai, word-level token with Moses tokenizer for English (moses), and subword-level tokenized by SentencePiece [Kudo and Richardson, 2018] trained on the training set for both English and Thai (spm). The translation directions for MT model are both th $\rightarrow$ en, and en $\rightarrow$ th. The token type for each direction consists of word $\rightarrow$ word, word $\rightarrow$ subword, subword $\rightarrow$ word, and subword $\rightarrow$ subword (joined dictionary).

In addition, for the word-level tokens where Thai is the target language, space tokens are included during the word tokenization process with pyThaiNLP. When training transformer base and large, the maximum number of tokens for each batch is set to 9,750 and 6,750 respectively. The number of epoch for transformer base and large is set to 150 and 75 respectively. All the models in this experiment are trained on NVIDIA V100 GPU with mixed-precision training (fp16) and gradient accumulation for 16 steps. The source code used for the experiments can be found at: https://github.com/vistec-AI/thai2nmt

3 Evaluation Methods

SacreBLEU [Post, 2018] is used to evaluate translation quality in both directions. For th $\rightarrow$ en translation, word-level outputs are detokenized with Moses detokenizer and subword outputs for both Thai and English are detokenized Sentencepiece [Kudo and Richardson, 2018]. The version string used for computing BLEU score for case-sensitive and case-insentive are BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.2.10 and BLEU+case.lc+numrefs.1+smooth.exp+tok.13a+version.1.2.12 respectively.

For the en $\rightarrow$ th translation, the word-level outputs are detokenized by joining all the output tokens including space tokens as specified when preparing word-level tokens. The detokenized texts are tokenized again by the pyThaiNLP word tokenizer. We then compute BLEU score with the tokenized texts.

For model decoding, the model checkpoint selected is the epoch with minimum label-smoothed cross entropy loss. The beam width used is 4.

4 Experiment Results

We report the evaluation results on the test set of our dataset, denoted as SCB_1M, and parallel English-Thai segments from OPUS, denoted as MT_OPUS. The the total number of segment pairs from SCB_1M and MT_OPUS test set are 100,177 and 297,874 respectively.

We trained models on each train set and cross validate on the test sets from 2 sources.

4.2 Thai-English IWSLT 2015

Thai-English IWSLT 2015 evaluation dataset [Cettolo et al., 2015] contains parallel transcription of TED talks where the source language is Thai and target language is English. The number of segment pairs is 4,242 from 46 parallel TED talks transcriptions. We used IWSLT 2015 test sets from 4 years (2010-2013).

In this evaluation campaign, the segments in Thai were manually tokenized according to the BEST 2010 guideline. However, in order to mimic actual written Thai segments, we map the pre-tokenized segments with the untokenized segments from Thai-English TED talks transcriptions that we have crawled. Noted that, we pre-processed the original segments by removing parenthetic content in English as this evaluation campaign also applied this rule before segmenting Thai words.

In Table 6, we compare the performance of our baselibe models trained on SCB_1M, MT_OPUS, and both. We report detokenized SacreBLEU (case-sensitive) for th $\rightarrow$ en direction and BLEU4 (case-sensitive) for en $\rightarrow$ th direction.

In Table 7, we compare the performance of our models with Google Translation API. We submitted the pre-processed Thai segments to Google Translation API (Neural Translation Model Predictions In Translation V3) on May 12 2020 to obtain translated segments in English and English segments from IWSLT 2015 to obtain translated segments in Thai. We submitted English segments to the Translation API provided by AI-for-Thai https://www.aiforthai.in.th to obtain translated segments in Thai on May 16 2020. We evaluated only in English $\rightarrow$ Thai direction as that moment AI-for-Thai provided only English $\rightarrow$ Thai translation. We report detokenized SacreBLEU (case-sensitive) for th $\rightarrow$ en direction, and BLEU4 (case-sensitive) for en $\rightarrow$ th direction.

Discussion

Segment Alignment between Languages With and Without Boundaries

Unlike English, there is no segment boundary marking in Thai. One segment in Thai may or may not cover all the content of an English segment. Currently, we mitigate this problem by grouping Thai segments together before computing the text similarity scores. We then choose the combination with the highest text similarity score. It can be said that adequacy is the main issue in building this dataset.

Quality of Translation from Crawled Websites

Some websites use machine translation models such as Google Translate to localize their content. As a result, Thai segments retrieved from web crawling might face issues of fluency since we do not use human annotators to perform quality control.

Quality Control of Crowdsourced Translators

When we use a crowdsourcing platform to translate the content, we can not fully control the quality of the translation. To combat this, we filter out low-quality segments by using a text similarity threshold, based on cosine similarity of universal sentence encoder vectors. Moreover, some crowdsourced translators might copy and paste source segments to a translation engine and take the results as answers to the platform. To further improve, we can apply techniques such as described in [Zaidan, 2012] to control the quality and avoid fraud on the platform.

Domain Dependence of Machine Tranlsation Models

We test domain dependence of machine translation models by comparing models trained and tested on the same dataset, using 80/10/10 train-validation-test split, and models trained on one dataset and tested on the other.

For SCB_1M test set, models trained on SCB_1M training set have consistently 4-8 times higher BLEU score than those trained on MT_OPUS. In similar manner, for MT_OPUS test set, models trained on MT_OPUS have 2-4 times higher BLEU score than those trained on SCB_1M. This suggests that diversity of domains in the training set greatly impacts the performance of the models.

Performance Uplifts from Models Trained on Existing Datasets

For the IWSLT 2015 test set, the model trained on both OPUS [Tiedemann, 2012] and our dataset achieve 0.24 uplift in SacreBLEU for Thai to English translation and 0.53 uplift in SacreBLEU for English to Thai translation. The uplifts might be smaller due to the fact that IWSLT 2015 is a collection of TED Talk transcripts which are in the same domain as OpenSubtitles [Lison and Tiedemann, 2016], the majority of OPUS dataset.

In this section, we discuss the challenges in building a large-scale English-Thai machine translation and corresponding machine translation models.

Conclusions

We release English-Thai parallel corpus comprising of over 1 million segment pairs including both written and spoken language. The segment pairs in the corpus comprise text from various domains such as product reviews, laws, report, news, spoken dialogues, and SMS messages. We also release 4 additional datasets for Thai text classification tasks and Thai sentence segmentation task.

We present an approach to filtering segment pairs with universal sentence encoder to remove misaligned segments. This approach can be used to only filtered out unrelated segments but it still prone to target segment adequacy error. Our further improvement is to develop a sophisticated method in order to obtain less noisy parallel corpus.

We conduct experiments on English $\rightarrow$ Thai and Thai $\rightarrow$ English machine translation systems trained on our dataset and the Open Parallel Corpus (OPUS) with different types of source and target token (i.e. word-level and subword-level). The evaluation results on Thai-English IWSLT 2015 test sets show that performance of our baseline models is on par with Google Translation API for Thai→English and outperform for both direction when OPUS is included in the training data.

Acknowledgement

This investigation is partially supported by the Digital Economy Promotion Agency Thailand under the infrastructure project code MP-62-003 and Siam Commercial Bank. We thank our data annotation partners Hope Data Annotations and Wang: Data Market; Office of the National Economic and Social Development Council (NESDC) through Phannisa Nirattiwongsakorn for providing government documents; Chonlapat Patanajirasit for training CRFCut sentence segmentation models on new datasets; Witchapong Daroontham for product review classification baselines; Pined Laohapiengsak for helping with sentence alignment using universal sentence encoder.

References

Appendix 1: Datasets for Other Tasks

In addition to the machine translation tasks, we can also use some datasets for other natural language processing tasks in Thai.

For the paraphrase identification task, we take the crowdsourced translations from English to Thai based on Microsoft Research Paraphrase Identification corpus [Dolan and Brockett, 2005]. The current version of msr_paraphrase has 10,122 translated sentences. As a result, the dataset includes 3,513 and 1,485 sentence pairs for training and test set respectively (reduced from the original dataset by 563 pairs for training set and 240 pairs for test set).

2 Sentence Segmentation

We can build sentence segmentation models with the generated product review dataset as described in Section 2.3.1.

3 Translation Quality Estimation

The fact that generated_reviews_yn use human annotators to label the Google-Translated reviews allows us to have another dataset for translation quality estimation. The total number of reviews in this dataset is 302,066.

4 Product Review Classification

We combine generated_reviews_translator and generated_reviews_yn to create a product review classification dataset with 64,760 reviews. The distribution of label is shown below. Note that we might want to exclude those reviews in generated_reviews_yn that are labelled as not human-readable from validation set when evaluating a text classification model.

Appendix 2: Example Sentence Pairs

The sentences pairs examples from our English-Thai machine translation daraset are listed below:

1) Dialogues in spoken language from Taskmaster-1 Source (en): Hakkasan and uptown restaurant Philippe Chow are top rated Target (th): ฮักกาชาง กับร้านในอัพทาวน์ชื่อ ฟิลิปเป้ เชา ได้ เรตดีอยู่นะ Source (en): What showtimes do they have at night? Target (th): ตอนกลางคืนมีรอบกี่โมงบ้างคะ? Source (en): Who doesn’t deliver these days? Alright, so a White Wonder with chicken & onions? Target (th): เดี๋ยวนี้ใครเขาไม่มีเดลิเวอรี่แล้วบ้าง? เอาเถอะ เอาไวท์วันเดอร์ใส่ไก่กับหัวหอมนะ?

6) Microsoft Research Paraphrase Identification corpus

2 Translated segment pairs via Google Translation API verified by translators

3 Aligned segment pairs from web-crawled data and PDF documents

2) English-Thai parallel Wikipedia corpus

3) News sites (Asia Pacific Defense Forum)

5) Crawled pages from websites listed in ParaCrawl v5

Appendix 3: Sentence Pairs Similarity with USE

2 Example of correctly aligned sentence pairs with low similarity score

3 Example of incorrectly aligned sentence pairs with low similarity score

4 Example of sentence pairs with high similarity score but lack adequacy in source or target sentence

Appendix 4: Sample of Translation Results

The sampled translation results bellow are from the Transformer Base model trained on the train set (80%) from our 1 million segment pairs dataset where the source and target token for the MT model is subword (joined dictionary).

Direction: English $\rightarrow$ Thai Source: The centre was based at the Munich Fairgrounds, in what was formally Munich Airport. The building is now known as the Munich Exhibition Centre. Reference: ศูนย์ดังกล่าวตั้งอยู่ที่ ”มิวนิกแฟร์” (Munich Fair) ซึ่งก่อสร้างขึ้นในบริเวณของท่าอากาศยานมิวนิก ปัจจุบันอาคารแห่งนี้เป็นที่รู้จักในชื่อ ”ศูนย์แสดงสินค้ามิวนิก” (Munich Exhibition Centre) Hypothesis: ศูนย์จัดแสดงสินค้ามิวนิกตั้งอยู่ที่ ”มิวนิกแฟร์กราวด์” ในเมืองมิวนิก ปัจจุบันอาคารแห่งนี้เป็นที่รู้จักในชื่อ ”ศูนย์แสดงนิทรรศการมิวนิก” Source: I want the Almond Milk, and if they are out of that I would like the Coconut Milk. Reference: เอานมอัลมอนด์ค่ะ ถ้าไม่มีเอานมมะพร้าว Hypothesis: เอานมอัลมอนด์ค่ะ แล้วก็ถ้ากะทิหมดก็ขอเป็นกะทินะคะ Source: Traveling intercity by bus is generally cheaper than traveling by train. Buses vary widely in terms of comfort and onboard options depending on your budget. One big advantage of traveling by bus is that you can journey overnight, meaning that you save the money of a night’s accommodation. Expect to take around eight or nine hours from Tokyo to the western city of Osaka. The biggest transport hub for buses is the Shinjuku Expressway Bus Terminal , where you can board a bus headed for every corner of the country. Reference: โดยทั่วไปการเดินทางจากเมืองหนึ่งไปสู่อีกเมืองหนึ่งโดยรถบัสจะเป็นวิธีที่ถูกกว่ารถไฟ ความสะดวกสบายและตัวเลือกภายในรถโดยสารจะแตกต่างตามงบประมาณ ข้อดีใหญ่ข้อหนึ่งคือรถบัสมีเที่ยวที่ออกเดินทางช่วงกลางคืนจึงช่วยให้สามารถประหยัดค่าพักค้างแรมไปได้ 1 วัน จากโตเกียวไปเมืองฝั่งตะวันตก ”โอซาก้า” จะใช้เวลาประมาณ 8-9 ชั่วโมง ศูนย์กลางการคมนาคมรถบัสที่ใหญ่สุดคือ ” สถานีรถบัสชินจูกุ (สถานีรถบัสด่วนพิเศษชินจูกุ) ” และสามารถนั่งรถบัสไปได้ทุกหนแห่งภายในญี่ปุ่น Hypothesis: การเดินทางโดยรถโดยสารประจําทางโดยทั่วไปจะถูกกว่าการเดินทางโดยรถไฟ รถบัสมีราคาแตกต่างกันไปมากในแง่ของความสะดวกสบายและทางเลือกบนเรือขึ้นอยู่กับงบประมาณของคุณ ความได้เปรียบอย่างใหญ่หลวงหนึ่งของการเดินทางโดยรถบัสคือคุณสามารถเดินทางข้ามคืนได้ หมายความว่า คุณประหยัดค่าที่พักของยามค่ําคืนได้ โดยจะใช้เวลาประมาณ 8 หรือ 9 ชั่วโมงจากโตเกียวไปเมืองโอซาก้าตะวันตกประมาณ 8-9 ชั่วโมง และเป็นศูนย์กลางการขนส่งที่ใหญ่ที่สุดของรถบัสคือสถานีรถบัสด่วนชินจูกุ สถานีรถบัสชินจูกุ สามารถขึ้นรถบัสทุกมุมของประเทศได้ Source: Additionally, B cells present antigens (they are also classified as professional antigen-presenting cells (APCs)) and secrete cytokines. In mammals, B cells mature in the bone marrow, which is at the core of most bones. In birds, B cells mature in the bursa of Fabricius, a lymphoid organ where they were first discovered by Chang and Glick, (B for bursa) and not from bone marrow as commonly believed. Reference: บีเซลล์ () เป็นเซลล์เม็ดเลือดขาวประเภทลิมโฟไซต์ ซึ่งเมื่อถูกกระตุ้นด้วยสารแปลกปลอมหรือแอนติเจนจะพัฒนาเป็นพลาสมาเซลล์ที่มีหน้าที่หลั่งแอนติบอดีมาจับกับแอนติเจน บีเซลล์มีแหล่งกําเนิดในร่างกายจากสเต็มเซลล์ ที่ชื่อว่า ”Haematopoietic Stem cell” ที่ไขกระดูก พบครั้งแรกที่ไขกระดูกบริเวณก้นกบของไก่ ที่ชื่อว่า Bursa of Fabricius จึงใช้ชื่อว่า ”บีเซลล์” (บางแห่งอ้างว่า B ย่อมาจาก Bone Marrow หรือไขกระดูกซึ่งเป็นที่กําเนิดของบีเซลล์ แต่นี่เป็นเพียงความบังเอิญเท่านั้น) Hypothesis: นอกจากนี้ เซลล์ B ยังนําเสนอแอนติเจน (ซึ่งเป็นเซลล์ที่เป็นตัวแทนของแอนติเจนระดับมืออาชีพ) และ เลสเตอรี cytokins ในสัตว์เลี้ยงลูกด้วยนม เซลล์ B เจริญในไขกระดูกเป็นแกนกลางของกระดูกส่วนใหญ่ ในนก เซลล์ B เจริญใน Bursa of Fricius, lymphoid organ ที่ที่พวกเขาค้นพบครั้งแรกโดย Chang and Glick (B for Bursa) และไม่ใช่จากไขกระดูกที่พบทั่วไป

The following sampled translation results shows the different in translated sentence for each pair of source and target token (word-level, subword-level) of the MT model.

Direction: English $\rightarrow$ Thai Source: อีกที่ที่อยู่ใกล้ใจกลางเมืองยิ่งขึ้นไปอีกคือ ”เดจิคิว บาร์บีคิว คาเฟ่” ที่จะสามารถย่างบาร์บีคิวบนระเบียงไม้บรรยากาศสบายๆ ไปพร้อมๆ กับการชมวิว ”สะพานสายรุ้งเรนโบว์บริดจ์” Reference: Closer to central Tokyo is Dejikyu BBQ Café in Odaiba, where you can barbecue on a comfortable wooden deck overlooking Rainbow Bridge. Hypotheses: bpe $\rightarrow$ bpe : Another closer to downtown is Dejikyu’s BBQ Cafe, where you can grill BBQ on a woody balcony with a view of Rainbow Bridge. word $\rightarrow$ word : Another closer to downtown is ¡unk¿’s BBQ Cafe, where you can barbecue on a cozy wooden porch with a view of Rainbow Bridge. word $\rightarrow$ bpe : Another closer to the city center is DejiQ BBQ Cafe, where you can barbecue on a wooden balcony with a casual atmosphere while watching Rainbow Bridge. bpe $\rightarrow$ word : Another closer location to downtown is ¡unk¿ BBQ Cafe, where you can barbecue on a casual wooden balcony with a view of Rainbow Bridge. Source: หุ้นของ Mattel ลดลง 13 เซนต์เหลือ 19.72 ดอลลาร์ในตลาดหลักทรัพย์นิวยอร์ก Reference: Shares of Mattel were down 13 cents to $19.72 on the New York Stock Exchange. Hypotheses: bpe$ \rightarrow $bpe : Mattel’s shares fell 13 cents to$ 19.22 on the New York Stock Exchange. word $\rightarrow$ word : Shares of the ¡unk¿ have been down 13 cents to $25 in the New York Stock Exchange. word$ \rightarrow $bpe : Shares of Mattel fashion fell 13 cents to dollar on the New York Stock Exchange. bpe$ \rightarrow $word : Matte’s shares were down 13 cents to$ 72 on the New York Stock Exchange.