ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien Huu Nguyen
Introduction
Since the introduction of word embeddings Bengio et al. (2000) and deep learning architectures Collobert et al. (2011), Natural Language Processing (NLP) has witnessed significant breakthroughs that fundamentally transform research and applications in various areas. Starting with the creation of word2vec Mikolov et al. (2013b) to initialize the shift of NLP from feature-engineering methods to representation learning with deep learning, the major milestones in NLP involve the presentation of the seq2seq or encoder-decoder framework Cho et al. (2014); Sutskever et al. (2014) for text generation in 2014, the proposal of the attention mechanism Bahdanau et al. (2015) for text encoding in 2015, the development of Transformer architecture Vaswani et al. (2017) as the basic building block for modern NLP models, the notion of uncontextualized word embeddings from language models in ELMo Peters et al. (2018), and the pre-trained transformer-based language models, e.g., BERT Devlin et al. (2019), GPT Radford et al. (2018, 2019), T5 Raffel et al. (2020), and BART Lewis et al. (2020).
The recent advances in NLP feature large language models (LLMs) that have parameter sizes over a hundred billion and are pre-trained on massive data, e.g., GPT-3 Rae et al. (2021), Megatron Shoeybi et al. (2019), GPT-Jurassic Lieber et al. (2021), and OPT-175B Zhang et al. (2022b). Although still relying on the Transformer architecture, the unprecedented scales of model size and training data have allowed new emergent abilities to change the landscape and practices in NLP Wei et al. (2022). An important emergent skill involves prompt-based learning that facilities the probing of information from LLMs with prompts by sampling the learned language distributions Brown et al. (2020). In this way, the models demonstrate strong generalization in few-shot and zero-shot learning while avoiding parameter updates for the underlying architectures. However, due to the auto-regressive training objective, the sampling from LLMs might generate unexpected outputs for users (misalignment with human interests), e.g., containing untruthful facts/toxic sentiments and being very sensitive to minor changes in the input prompts Ouyang et al. (2022).
To this end, ChatGPThttps://openai.com/blog/chatgpt/ is one of the latest developments in NLP that have mitigated limitations of previous LLMs to gain widespread attention from the public. In the first two months of its launch, ChatGPT has attracted 100 million users Milmo (2023). As the next iteration of InstructGPT Ouyang et al. (2022), ChatGPT is optimized on top of a GPT-3.5 series model using reinforcement learning from human feedback (RLHF) Christiano et al. (2017). In contrast to previous LLMs, ChatGPT and InstructGPT leverage human demonstrations of desired outputs for input prompts to train supervised models, while human rankings of generated outputs are obtained to train a reward model to further optimize the LLMs with reinforcement learning. Compared to InstructGPT, ChatGPT is trained with conversational data to allow follow-up questions. In this way, ChatGPT is able to interact with humans in multi-turn conversations to generate more aligned outputs with human interests, thus being more natural and accessible to users. In addition, due to the deployment of public APIs to facilitate general users, there have been multiple reports on the successes of ChatGPT in solving challenging tasks in various areas, e.g., passing the United States Medical Licensing Examination Kung et al. (2022) and real exams in a law school Choi et al. (2023), performing competitively with commercial translation services for some high-resource languages Jiao et al. (2023), generating fluent and comprehensive responses to answer complex questions, self-correcting previous errors in the conversations Guo et al. (2023), and even producing code from natural language instructions.
Across different communities, there is an excitement about a future with ChatGPT and similar technologies as assistants for professional areas Jeblick et al. (2022); King (2022) or evaluators of natural language understanding Wang et al. (2023b). At the same time, the impressive abilities of LLMs have triggered active discussions among researchers and industry members on the next steps for NLP research Zhang et al. (2022a); Kuzman et al. (2023) and how the technologies will shape the future job market. On the other extreme, the communities also express concerns about long-term implications of ChatGPT and LLMs for society, citing issues on plagiarism, privacy, misinformation, and security. As ChatGPT and current LLMs are trained on large-scale data collected extensively from different corners, they might copy information from one source into the output without proper citation. The generated texts might also be utilized in different situations without acknowledgment. The training data, user prompts, and model responses might involve private information that is not fully concealed. In addition, ChatGPT and LLMs can still hallucinate important information. The fluency and eloquence of generated texts might easily deceive humans about information correctness (especially for non-native speakers), thus amplifying the social risks of misinformation Bang et al. (2023).
Similar to other LLMs, ChatGPT is trained on a mix of training data from multiple languages. Although English is the majority, the combination of multilingual data contributes to ChatGPT’s abilities to accept inputs and generate responses in different languages, making it accessible and widely adopted by people around the world. However, given the recency of the technology, ChatGPT has been mainly evaluated over English data. The community is lacking a comprehensive, public, and independent evaluation of ChatGPT over various non-English languages for diverse NLP tasks to provide proper perspectives for future research and applications. Given ChatGPT’s transformative potentials, associated long-term risks, huge cost for training, and limited transparency, a fundamental question is whether multilingual LLMs such as ChatGPT can also be reliably adopted for different languages or it is necessary to develop language-specific LLMs/other technologies to solve NLP problems for non-English languages.
To address the multilingual concerns for ChatGPT, a few recent studies have investigated ChatGPT’s performance and responses for non-English languages. However, the considered tasks/languages/settings and scale of evaluation data in existing multilingual evaluations are still limited, which is unable to show a comprehensive picture of the potentials/performance of the technology on a diversity of other languages. For instance, Bang et al. (2023) evaluates the multilingual performance of ChatGPT on three tasks of language identification, sentiment analysis, and machine translation; however, only a few languages are selected for each task and the number of evaluation samples for each language does not exceed 50. Beyond English, the analysis of ChatGPT’s responses for input questions in Guo et al. (2023) is only done for Chinese, while the results of the medical licensing examinations for ChatGPT are only shown for Japanese in Kasai et al. (2023). In addition, Fang et al. (2023) and Wang et al. (2023a) explores ChatGPT in three languages English, Chinese, and German; however, the studies only focus on grammatical error correction or cross-lingual summarization.
To this end, our paper aims to perform a more thorough evaluation of ChatGPT for its performance on multiple languages over different NLP tasks. Our experiments consider 37 diverse languages, characterizing high-, medium-, low-, and extremely low-resource languages, to better highlight ChatGPT’s potentials and limitations. To our knowledge, this is one of the largest sets of languages evaluated for ChatGPT in a public study to date. In addition to Natural Language Inference (NLI), Question Answering, and Common Sense Reasoning, our current work will examine the tasks of Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Relation Extraction, and Summarization, which are not covered in previous multilingual evaluations for ChatGPT. To improve the reproducibility of the evaluations and better reflect the approach of general users, our current work will focus on the zero-shot learning setting for ChatGPT where no human-provided examples are presented to the model. Importantly, due to the scale of available languages/tasks/datasets/models and the growing nature of multilingual learning research in NLP, we will use this work as an ongoing and public effort to evaluate ChatGPT and other LLMs for multiple languages, emphasizing on understudied languages to measure robustness and democratize impacts of the technologies.
Despite some exceptions and potential updates with future experiments, our current experiments suggest the following major tendencies:
ChatGPT’s zero-shot learning performance is generally worse than the state-of-the-art performance of the supervised learning models for a majority of the considered tasks across different languages, including high-, medium-, low-, and extremely-low resource languages. The performance gaps are usually very large, demonstrating the unfit of ChatGPT as a general solver for different NLP problems. It thus highlights the importance of task-specific models for the development of NLP applications.
ChatGPT’s performance is generally better for English than for other languages, especially for higher-level tasks that require more complex reasoning abilities (e.g., named entity recognition, question answering, common sense reasoning, and summarization). The performance differences can be substantial for some tasks and lower-resource languages, which justifies the biases of ChatGPT for English and suggests the potentials of the development of language-specific models/LLMs for different languages and groups.
ChatGPT can perform better with English prompts even though the task and input texts are intended for other languages, further confirming the biases toward English of ChatGPT.
Related Work
Evaluation of ChatGPT: Since the release of ChatGPT in November 2022 with impressive language abilities, there has been a growing interest in evaluating ChatGPT for different aspects of natural language understanding. The first line of work concerns the performance comparison of ChatGPT and state-of-the-art systems for important tasks in NLP such as text summarization Wang et al. (2023a); Yang et al. (2023), machine translation Hendy et al. (2023); Jiao et al. (2023); Kocmi and Federmann (2023), question answering Tan et al. (2023); Omar et al. (2023), information extraction Wei et al. (2023); Gao et al. (2023), text classification Kuzman et al. (2023); Amin et al. (2023), grammatical error detection Fang et al. (2023), and stance detection Zhang et al. (2022a). Along this line, several recent studies have attempted to examine the performance of ChatGPT more comprehensively on multiple datasets Bang et al. (2023); Qin et al. (2023); Koco’n et al. (2023); Zhong et al. (2023). The second direction for ChatGPT evaluation focuses on the robustness/reliability of the model against possible variants of input texts. For example, Wang et al. (2023c) explores the robustness of ChatGPT under the adversarial and out-of-domain learning settings while Jang and Lukasiewicz (2023) examines the logical prediction consistency of ChatGPT for inputs with semantic equivalence, logical negation, or symmetricity. Finally, the third dimension for ChatGPT evaluation discusses the potential impacts and risks of the technology for the broader society, e.g., in education Susnjak (2022); Khalil and Er (2023), law Choi et al. (2023), medical Kung et al. (2022), ethnics Shen et al. (2023), human-computer collaboration Lanzi and Loiacono (2023), and cognition Mahowald et al. (2023). However, to our knowledge, none of existing work has conducted large-scale evaluations of ChatGPT for multiple languages and tasks as we do.
Multilingual NLP: Using the encoder-decoder framework in the original Transformer architecture Vaswani et al. (2017), several variants have been explored to train language models, characterizing the encoder-only models, e.g., BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019), decoder-only models, e.g., GPT Radford et al. (2019); Brown et al. (2020), and encoder-decoder models, e.g., BART Lewis et al. (2020) and T5 Raffel et al. (2020). While initial efforts have focused on English, recent work has extended these language models to other languages following two major directions. The first direction introduces language-specific language models that are trained exclusively over data collected for a target language, e.g., for Spanish MMG (2021), Polish Resources and Technology Infrastructure (2021), French Martin et al. (2020); Kamal Eddine et al. (2021), Japanese Wongso (2021), Hindi Parmar (2021), and Sweddish Moell (2021). The second direction, on the other hand, seeks to train the language models over combined data from multiple languages, leading to multilingual counterparts for text encoding, e.g., mBERT Devlin et al. (2019), XLM-RoBERTa Conneau et al. (2020), mBART Liu et al. (2020), and mT5 Xue et al. (2021). Notably, BLOOM Scao et al. (2022) is a decoder-only LLM with 176 billion parameters trained over data from 46 natural languages and 13 programming languages. In contrast to ChatGPT, BLOOM and its parameters are publicly available to the community.
Picking up from previous multilingual NLP research for non-English languages Kim et al. (2010); Täckström et al. (2012); Zhang et al. (2016); Conneau et al. (2017); Mayhew et al. (2017); Lample et al. (2018); Joulin et al. (2018); M’hamdi et al. (2019); Ni and Florian (2019) with cross-lingual word embeddings Mikolov et al. (2013a); Upadhyay et al. (2016); Ruder et al. (2019), multilingual language models have enabled a new generation of multilingual models that significantly boost the performance for NLP tasks in different languages Wu and Dredze (2020). For example, multilingual language models have achieved state-of-the-art multilingual performance for Sentence Splitting Nguyen et al. (2021a), Dependency Parsing Kondratyuk and Straka (2019), Question Answering Huang et al. (2019); Wu et al. (2022), Named Entity Recognition Pires et al. (2019); Wu and Dredze (2019); Karthikeyan et al. (2020), and Event Extraction Nguyen et al. (2021b); Guzman-Nateras et al. (2022).
Finally, extensive efforts have been devoted to create multilingual datasets (those with texts and annotations for multiple languages) for different NLP tasks, including Universal Dependencies Nivre et al. (2016) (for POS tagging and Dependency Parsing in more than 76 languages), CoNLL 2002 and 2003 Sang and Meulder (2002, 2003) (for NER in 4 languages), XNLI Conneau et al. (2018) (for Natural Language Inference in 14 languages), DARPA LORELEI Strassel and Tracey (2016) (for NER and Entity Linking in 34 languages), TyDi Zhang et al. (2021) (for Multilingual Information Retrieval), XWikis Perez-Beltrachini and Lapata (2021) for Cross-Lingual Summarization, XQuAD Artetxe et al. (2020) for Cross-Lingual Question Answering, MEE Pouran Ben Veyseh et al. (2022) for Event Extraction in 9 languages, MECI Lai et al. (2022) for event causality identification in 5 languages, and XGLUE Liang et al. (2020) and XTREME Hu et al. (2020) for multiple NLP tasks, among others. Such multilingual datasets have served as the key elements to evaluate multilingual learning models and measure research progress in this field. To this end, our work will leverage available multilingual datasets for different NLP tasks to evaluate ChatGPT and LLMs to appropriately contextualize their potentials for multilingual learning.
Methodology
The goal of our research is to evaluate the performance of ChatGPT and LLMs for NLP tasks in different languages. Given the large numbers of NLP datasets/tasks/languages and the growing developments of LLMs, our work will be an ongoing effort to include additional experiments to be more comprehensive along the way. In the current version of the paper, we will evaluate ChatGPT on seven diverse NLP tasks, i.e., Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Relation Classification, Natural Language Inference (NLI), Question Answering (QA), Common Sense Reasoning (CSR), and Summarization. Over different tasks, our experiments will cover 34 diverse languages, characterizing high-, medium-, low-, and extremely low-resource languages to provide broader perspectives. Following Bang et al. (2023), we employ the ratio of the data for each language in the CommonCrawl corpushttp://commoncrawl.org, i.e., the main data to pre-train GPT-3, to classify the resource levels. In particular, a language will be considered as high-, medium-, low-, and extremely low-resource if its data ratio is greater than (), between and (), between and (), and smaller than () respectively. Table 1 presents information and categories for the languages considered in our work.
As the scale of ChatGPT precludes the ability to fine-tune the model on downstream task data for most general users, we focus on the zero-shot learning setting for ChatGPT. We also report the state-of-the-art performance of the supervised models for a task in each language as a reference for research progress. In zero-shot learning, an NLP task is specified by a natural-language task description . Given a new data sample with input text for the task , the concatenation of and will then be sent into the ChatGPT model as the input prompt to generate a natural-language response . Afterward, the response will be parsed using some pre-defined task-specific rules to obtain an output in the required format for (e.g., a pre-defined label for classification problems). Finally, the outputs for examples in an evaluation dataset will be scored to return ChatGPT’s performance for task .
Different from some previous work that exploits two-stage prompting to adopt a zero-shot chain of thoughts Kojima et al. (2022); Qin et al. (2023), we directly utilize single-stage prompting that only adds the task description into each input to simulate the common approach of general users for ChatGPT. Other prompting strategies can be explored in future versions of our work. As such, in the current version, we aim to design simple task descriptions while ensuring necessary information to indicate the task and facilitate the parsing of responses to produce accurate outputs . In addition, for tasks in a non-English target language, we will evaluate task descriptions in both English and target-specific languages to shed light on the best approach to prompt ChatGPT in multilingual settings. To facilitate the experiments, all non-English task descriptions are obtained using automatic translation tools, e.g., Google Translatehttps://translate.google.com, to translate the designed English descriptions for each task. Finally, all of the responses from ChatGPT in this work are obtained between March 1 and April 5. This is right after ChatGPT is made available in OpenAI APIs to enable large-scale requests from the public for comprehensive evaluations. To improve reproducibility, we clear the conversations in ChatGPT for each query to remove any previous context.
Part-of-Speech Tagging
Part-of-Speech (POS) Tagging is a coarse-grained word classification task whose goal is to label the syntactic information of the words in a sentence. POS tagging can help alleviate the sparseness of word-level features and serve as an important pre-processing step in NLP systems. We evaluate ChatGPT for its multilingual POS tagging abilities over the XGLUE-POS dataset Liang et al. (2020), which covers 18 languages and includes labels derived from the Universal Dependencies (UD) Treebanks (v2.5) Zeman et al. (2020). In the experiments, we utilize the XGLUE-POS dataset from Huggingface Datasetshttps://huggingface.co/datasets/xglue that only includes 17 languages (e.g., excluding Portuguese). As such, we use the test sets of XGLUE-POS with more than 15K samples for the selected languages in the evaluation.
Our prompt for POS tagging for ChatGPT consists of a task description, a note for output format, and an input sentence, concatenated in that order, i.e., PromptPOS = [task description; output format note; input sentence]. Notably, instead of directly using the text of input sentence, we feed ChatGPT with the list of words in the sentence to facilitate the word-label alignment and parsing of ChatGPT responses for POS tagging. Our task description and output format note then emphasize on the expected format for the ChatGPT’s responses to follow the tuple structure with pairs of words and their corresponding POS tags. In the experiments, this approach has led to better performance for ChatGPT than the direct input sentence. We illustrate an example for the English POS prompts for ChatGPT in Figure 1.
Results: Table 2 presents the performance of ChatGPT (zero-shot learning with both English and language-specific task descriptions) and the fully supervised XLM-R model (based on XLM-RoBERTa base) Liang et al. (2020). Here, performance is measured via the accuracy of the predicted POS tags. As can be seen, ChatGPT outperforms XLM-R over 13 out 17 languages for multilingual POS tagging. Different from XLM-R where English has the best POS tagging performance, ChatGPT seems to have better accuracy than English with some other languages (e.g., French, Spanish). Finally, we observe that English prompts tend to perform better or at lest competitively with language-specific prompts for ChatGPT across different languages for POS tagging.
Named Entity Recognition
Named Entity Recognition (NER) is an important task in NLP Sang and Meulder (2002), providing basic technologies for many downstream applications, such as search engines, question answering, and recommendation systems. Aiming to identify spans and semantic types of names (e.g., person, organization) in text, NER is usually formulated as a sequence tagging problem where a label is assigned to each word in a sentence to indicate names. The BIO annotation schema is often leveraged to form the labels to capture both span and type information Ratinov and Roth (2009). For multilingual NER evaluation of ChatGPT, we employ the datasets from the recent shared task MultiCoNER Malmasi et al. (2022) that seeks to build NER systems for 11 languages following the WNUT 2017 taxonomy for entity types Derczynski et al. (2017). There are 6 entity types in MultiCoNER, i.e., PER (person), LOC (location), CORP (corporation), CW (creative work), GRP (group of people), and PROD (product). The sentences in MultiCoNER belong to three main domains: Wikipedia, web questions, and user queries. In MultiCoNER, the sentences tend to be short with low context. Also, the entities in the sentences usually exhibit ambiguous semantics with high level of complexity to cause more challenges for the problem. We utilize the test sets of the language in MultiCoNER for evaluation.
Our prompt structure for ChatGPT with NER follows the prompts for POS Tagging, i.e., PromptNER = [task description; output format note; input sentence], which involve a task description to explain the task and list entity type/labels of interest. We also have a note to specify the expected output format with tuples of words and predicted tags for names. However, a key difference for NER is that we explicitly ask ChatGPT to produce tags for each work in the BIO format. Although this approach seems to make the task more challenging for ChatGPT, we find that it actually improves the performance for ChatGPT. Our hypothesis is that the BIO tag requirement encourages ChatGPT to solve NER as a sequence labeling problem, thus forcing it to comprehensively annotate names in input sentences. In contrast, the simpler approach to prompt ChatGPT for names without BIO specification might suggest reading comprehension formulation that does not tag all names with exact spans for NER. The responses from ChatGPT are also harder (i.e., more ambiguous and unpredictable) to parse for NER outputs without the BIO requirement. We provide an English prompt example for NER for ChatGPT in Figure 2.
Results: Table 3 evaluates the performance of ChatGPT (zero-shot learning with both English and language-specific task descriptions) and DAMO Wang et al. (2022a), the model with current best-reported performance on MultiCoNER. The latter retrieves relevant context from Wikipeida for each input sentence that are then fed into the XLMR-RoBERTa model (large version) for NER. DAMO also employ a conditional random fields (CRF) layer for the modeling. Our results for NER are evaluated using macro-averaged F1 scores Malmasi et al. (2022). The most important observation from the table is that ChatGPT significantly underperforms DAMO on MultiCoNER across all 11 languages. In fact, the performance of ChatGPT is less than 40% for all languages, which suggests less suitability of ChatGPT to solve NER in this domain.
In order to better understand the performance of ChatGPT for MultiCoNER, we use the scoring script nervaluatehttps://github.com/MantisAI/nervaluate to compute detailed scores for each entity types for ChatGPT. Table 4 shows label-wise precision, recall, and F1 scores of ChatGPT (with English prompts). We also include spurious percentages (over total numbers of predictions), which are the percentages of ChatGPT’s predictions that do not exist in the annotated data for each type. As can be seen, ChatGPT’s extraction performance is very poor for GRP (group of people) and CW (creative work), which have F1 scores of less than 15%. Also, the spurious percentages of ChatGPT are generally high for all entity types, which suggests ChatGPT’s verbosity and confusion for NER.
Relation Extraction
Relation Extraction (RE) is a crucial task in information extraction (IE), aiming to identify and classify semantic relations between two entity mentions in an input text. To facilitate multilingual experiments for RE, we conduct our evaluation over the SMiLER dataset Seganti et al. (2021). SMiLER provides relation annotation for texts in 14 languages with 36 relation types (including “no-relation”). The test sets of the languages (with more than 12K samples) are employed for evaluation.
An input example for RE involves an input text and two entity mentions in the text for classification. To probe ChatGPT for RE for an example, we design the prompt via the concatenation of a task description, input text, and two entity mentions, i.e., PromptRE = [task description; output format note; input text; entity 1; entity 2]. In the task description for RE, we explicitly include all the relation types to inform ChatGPT. We also introduce an output format note to specify the expected format for the responses from ChatGPT for RE, thus facilitating response parsing for relation labels. To illustrate the RE prompts for ChatGPT, we present an example with the English prompt and corresponding response in Figure 3.
Results: Table 5 shows the performance of ChatGPT (zero-shot learning with both English and language-specific task descriptions) and mT5-IL Chen et al. (2022), a state-of-the-art supervised in-language prompting model for SMiLER. mT5-IL is based on the base version of mT5. Micro F1 scores are used as the performance metric for RE. From Table 5, the results suggest that mT5-IL significantly outperforms ChatGPT over different languages no matter if we ask ChatGPT with English or language-specific prompts (except for Swedish and Ukranian). The performance gap is up to 15% over F1 score on average for the languages. Language-specific prompts seem to yield better or comparable performance as English prompts for ChatGPT with RE. Ukrainian is an exception when English prompts return better F1 score for ChatGPT. Interestingly, ChatGPT performs the worst in English for RC with SMiLER, potentially due to the much larger size of English test data with greater diversity and challenges (5,461 samples for English vs. 1,243 samples for the second large test set for French).
Natural Language Inference
Natural Language Inference (NLI) aims to predict the entailment/contradiction relations between two input sentences, i.e., a premise and a hypothesis. To evaluate ChatGPT for multilingual NLI, we utilize the XNLI dataset Conneau et al. (2018) that provides annotated data for English and 14 other languages with three categories, i.e., Entailment, Contradiction, and Neutral. As such, the data in non-English languages is obtained by translating English data for XNLI. XNLI provides development and test data to facilitate development and evaluation. However, as the labels for the test data are not publicly available, we utilize the development data of XNLI in this experiment.
To construct the prompt for ChatGPT for each example in XNLI, we directly concatenate the task description, the premise, the hypothesis, and a multiple choice question (of entailment, contradiction, and neural) in this order, i.e., PromptNLI = [task description; premise; hypothesis; question]. An example of English input prompts and responses from ChatGPT is shown in Figure 4.
Results: Table 6 reports the performance (accuracy) of ChatGPT and the multilingual model mT5-XXL Xue et al. (2021). Here, for each non-English target language, we present ChatGPT’s performance on two zero-shot learning settings depending on whether the task descriptions are in English or target language. For mT5-XXL, the model is fine-tuned on English training data and translations in the target language to achieve the best reported performance on XNLI. It is clear from the table that ChatGPT performs significantly poorer than mT5-XXL across different languages by large margins. The performance gaps between ChatGPT and mT5-XXL also seem smaller for high-resource languages. Finally, ChatGPT with target-language task descriptions produces significantly lower accuracy than those with English task descriptions across all considered languages, suggesting the benefits of English descriptions for multilingual NLI with ChatGPT.
Question Answering
Given a context passage and a question, a Question Answering (QA) model needs to return the answer for the question, which should be a span of text in the input passage. To this end, we utilize the XQuAD dataset Artetxe et al. (2020) to evaluate ChatGPT in multiple languages for QA. XQuAD involves 240 paragraphs and 1190 question-answer pairs in English and their translations into ten other languages for evaluation.
We collect the English task description for QA from the NaturalInstructions repository Wang et al. (2022b) for ChatGPT. In addition, as ChatGPT tends to generate long responses, we introduce a note to remind the model that the answers for our dataset should be short and directly extracted from the input passage. This approach has helped ChatGPT to provide more direct answers in our experiments. To this end, for an example with an input passage and question, our prompt for ChatGPT is formed via: PromptQA = [task description; passage; question; note]. We demonstrate an example of the QA prompts in Figure 5.
Given the responses from ChatGPT for our QA prompts for the examples, we remove the period characters in the end and directly evaluate remaining responses using the SQuAD’s scorerhttps://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py, which is suggested by the original paper of XQuAD Artetxe et al. (2020).
Results: Table 7 shows the performance of ChatGPT (zero-shot learning) and mT5-XXL Xue et al. (2021), a state-of-the-art supervised learning model for XQuAD. As such, for each language, mT5-XXL is trained over the combination of English training data and the translations to the target language to achieve optimal performance. We report the performance using both the exact match (EM) and F1 scores. Table 7 illustrates that ChatGPT’s zero-shot performance is significantly worse than the supervised model mT5-XXL for all the languages. Across different models and prompts, the QA performance for English is significantly better than those for other languages, demonstrating the clear bias for English of current multilingual language models. Finally, we find that prompting ChatGPT with English tends to produce better performance for multilingual QA than using target languages.
Common Sense Reasoning
Common Sense Reasoning (CSR) evaluates the reasoning of the models via multiple-choice questions. The inputs for the models involve a question and a few choices for the answer, and the models need to select one of the choices. To evaluate ChatGPT’s multilingual abilities for CSR, we leverage two datasets: (i) X-CSQA Talmor et al. (2019); Lin et al. (2021), which involves English data and its translations to 15 other languages, and (ii) Wikipedia Cloze QA from IndicNLPSuite Kakwani et al. (2020), which covers 11 low- and extremely-low-resource Indian languages. We evaluate the models on the dev set of X-CSQA with 1,000 samples for each language, while the Wiki Cloze QA dataset from IndicNLPSuite contains 62,314 samples for all languages.
In the CSR prompts for ChatGPT, we combine the task description, the question, and the multiple choices for each sample, i.e., PromptCSR = [task description; question; multiple choices]. Here, for the task description, we also indicate the language of the input question and multiple choices. Two examples of prompts for CSR inputs are presented in Figure 6 for the X-CSQA dataset and in Figure 7 for the Wikipedia Cloze QA dataset from IndicNLPSuite.
Results: Table 8 reports the accuracy of ChatGPT (zero-shot learning for both English and language-specific prompts) and the state-of-the-art supervised model TRT Fang et al. (2022) on the X-CSQA dataset. TRT is based on the XLM-RoBERTa large model Conneau et al. (2020) where commonsense knowledge in different sources is retrieved to enrich input questions and answers. Except for English, the table illustrates the poorer performance of ChatGPT than TRT across all other languages for CSR on X-CSQA when the English task description is used. Interestingly, in contrast to other tasks, we find that language-specific prompts tend to perform better than English prompts for ChatGPT in CSR for high-resource languages (except for Chinese), leading to some improvement over supervised learning (e.g. for French, Spanish, and Dutch).
For IndicNLPSuite, Table 9 demonstrates the accuracy of ChatGPT and IndicBERT Kakwani et al. (2020), a pre-trained encoder-only model using the ALBERT architecture over an Indian language corpora. IndicBERT is fine-tuned on training data to deliver state-of-the-art performance for IndicNLPSuite in the original paper Kakwani et al. (2020). Our experiment results for IndicNLPSuite confirm the general tendency that supervised learning models still perform better than ChatGPT over different languages. However, there are two exceptions with Hindi and Kannada where ChatGPT can produce better accuracy over IndicNLPSuite. Finally, Table 9 suggests that English prompts are a better way to prompt ChatGPT for Indian languages than these languages themselves (except for Marathi and Gujarati).
Summarization
In summarization, systems need to provide key and concise information for a longer input text, which can be helpful for different downstream applications such as news analysis, marketing, question answering, and scientific document processing. To study the performance of ChatGPT for summarization in multiple languages, we choose the XL-Sum dataset Hasan et al. (2021) that provides summaries of news articles in 44 languages. In contrast to extractive summarization that select important sentences in the input text to a summary, XL-Sum addresses abstractive summarization to allow text generation with more creative writing in the summary (the sentences in the summary might not necessarily appear in the input text). Despite greater challenges, abstractive summarization can produce more natural texts to better serve downstream applications.
To facilitate the experiments, we select 12 languages in XL-Sum, covering high-, medium-, low-, and extremely low-resource languages, and evaluate ChatGPT’s performance on the test datasets of the languages. Table 10 shows the sizes of test data (i.e., the numbers of samples) in XL-Sum for the selected languages. In the experiments, we utilize the ROUGE-1, ROUGE-2, and ROUGE-L scores as performance measures for summarization. Note that for the non-English languages, the scorer script in the original paper of XL-Sum Hasan et al. (2021) is used for performance computation.
As a summary in XL-Sum is expected to be written in the same language as the input text, given an input text, our summarization prompt for ChatGPT is constructed via the concatenation: PromptSUM = [task description; output language specification: input text]. Accordingly, the task description is simply: “Summarize this
Results: Tables 10 and 11 presents the summarization performance of ChatGPT (zero-shot learning) for the selected languages in XL-Sum using English and language-specific prompts respectively. In the tables, we also include the performance of the mT5-XXL model that is trained over training data of specific languages in XL-Sum. mT5-XXL has achieved state-of-the-art performance for XL-Sum as reported in Aharoni et al. (2022). It is obvious from the tables that ChatGPT’s performance is consistently inferior to mT5-XXL’s with large performance gaps in different languages. To better understand the poor performance of ChatGPT, Tables 10 and 11 also report the average lengths of the human-provided summaries and the summaries generated by ChatGPT (in terms of the numbers of characters). It is clear from the tables that ChatGPT tends to generate lengthy summaries, potentially leading to its poorer performance. In addition, the tables show the success rates of ChatGPT for each language, which is defined as the ratios of requests sent to the ChatGPT server and received non-empty responses/summaries. As can be seen, the success rates of ChatGPT for lower-resource languages are also lower that can further explain ChatGPT’s performance and reliability for such languages.
Discussion
The most important findings from our experiment results is that ChatGPT exhibits significantly worse performance than state-of-the-art supervised models for most of considered NLP tasks in different languages. Given the huge costs to train ChatGPT and similar LLMs as well as the necessity of paid APIs to run large amounts of requests with OpenAI, it seems more reasonable to build smaller task-specific models for NLP problems (or at least for the considered tasks) in different languages that can be hosted locally to serve at lower costs.
In addition, we notice an exception for the POS tagging task where ChatGPT can achieve competitive or even better performance than the supervised learning models (especially with English prompts) over different languages. For instance, ChatGPT has significantly better POS tagging accuracy for Thai, Vietnamese, Bulgarian, Hindi, and Urdu, which are medium- and low-resource languages. As such, in contrast to other considered tasks which require some level of semantic reasoning, POS tagging focuses on low-level syntactic analysis. We thus hypothesize that ChatGPT possesses high-level skills in grammar and low-level abilities of semantic reasoning to generate seemingly fluent texts for multiple languages. However, for more complicated semantic analysis, ChatGPT might find it more challenging to perform accurate predictions and generations.
Regarding the classification of high-, medium-, low-, and extremely low-resource languages, our work currently relies on data ratios for the languages in the CommonCrawl corpus. According to our experiments, it is interesting that the performance of ChatGPT for low- and extremely-low-resource languages in some tasks is better or comparable to those for high- or medium-resource languages. For instance, for POS tagging in Table 2, ChatGPT’s performance for Urdu (a low-resource language) is better than the performance for Vietnamese and Thai (high- and medium-resource languages). In NER, ChatGPT achieves better performance for the low-resource language Bengali than for Chinese (using English prompts in Table 3). For the common sense reasoning task in Table 8, ChatGPT’s performance for the extremely-low-resource language Swahili is comparable to those for Polish (with English prompts). Similarly, for summarization in Table 11, ChatGPT has better ROUGE scores for the extremely low-resource language Kyrgyz than for Hindi with language-specific prompts. To this end, it seems evident that data size might not be the only factor that dictates the resource level and performance for a task of a language with ChatGPT and LLMs. Among others, the overall picture might also need to consider the target task and the similarities/relations of a language with respect to the dominant languages in the training data for LLMs.
Compared to language-specific prompts, the superior performance of ChatGPT with English task descriptions over a majority of problems and languages suggests that ChatGPT might better understand/analyze the tasks with English prompts to lead to improved abilities to generate responses with accurate outputs. In addition, the inclusion of English task descriptions for non-English inputs can be seen as an approach to shift the representations of language-specific inputs toward the English space that can be better processed by ChatGPT due to the domination of English in its training data. Finally, the better performance with English prompts also raises an interesting question on whether English is the optimal language to prompt ChatGPT or it is better to employ other languages for this purpose for different target languages.
Limitations: As an ongoing work to evaluate ChatGPT and LLMs on multilingual learning tasks, our current work observes several limitations that can be addressed in future studies. First, although our experiments have covered 37 languages, including low- and extremely low-languages, there are still many other languages that are not explored in the current work. Some tasks/datasets in our work have not covered lower-resource languages. The future work can expand the language set with greater focuses on lower-resource languages to better understand LLMs’ performance in this important direction. Second, many other tasks, including those with available multilingual datasets, have not been considered in the current work. Examining more tasks and datasets will enable a more comprehensive understanding of ChatGPT and LLMs in multilingual settings. Third, our current work only evaluates ChatGPT in the zero-shot learning setting, thus unable to show comparisons with other recent multilingual LLMs, e.g., BLOOM Scao et al. (2022), GPT-4, and BARD, in various learning scenarios. While these models are currently less accessible for large-scale evaluations, our plan is to further include more models and learning settings along the way to strengthen our evaluations and comparisons when possible. Finally, the current work only evaluates the models in terms of performance over NLP tasks. To better characterize ChatGPT and LLMs, other evaluation metrics should also be investigated to report more complete perspectives, including but not limited to adversarial robustness, biases, toxic/harmful content, accessibility, development costs, and interpretability.
Conclusion
Toward a more comprehensive understanding of ChatGPT and LLMs on their multilingual learning abilities for NLP, our current work conducts an evaluation for ChatGPT on 7 different tasks, i.e., Part-of-Speech Tagging, Named Entity Recognition, Relation Extraction, Natural Language Inference, Question Answering, Common Sense Reasoning, and Summarization. Using 37 diverse languages with high-, medium-, low-, and extremely low resources for the experiments, our results reveal the less optimal performance of ChatGPT in the zero-shot learning setting for NLP tasks in different languages, advocating for task-specific models to secure best performance. As an ongoing research, we plan to extend the experiments to include more languages, tasks, models, criteria, and settings in future work to obtain broader and deeper insights.