Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao

Introduction

Large language models (LLMs), such as GPT-3 Brown et al. (2020) and InstructGPT Ouyang et al. (2022), have swept the natural language processing (NLP) community. Due to their emergent abilities Wei et al. (2022a), these LLMs can achieve impressive few-shot and zero-shot performance in a variety of NLP tasks. More recently, ChatGPThttps://chat.openai.com, developed by OpenAI upon InstructGPT Ouyang et al. (2022), has attracted great attention. Encouragingly, different from prior public chatbots, ChatGPT is able to generate fluent and comprehensive responses to various human inquiries, and even correct inappropriate human questions.

In light of the conventional wisdom that “GPT-style models work well in generation tasks, but perform poorly for understanding tasks, even worse than the base-sized BERT Devlin et al. (2019)”, we wonder whether there is a similar phenomenon in the ChatGPT scenario. For the generation ability of ChatGPT, several prior studies Jiao et al. (2023); Bang et al. (2023); Wang et al. (2023) have shown that ChatGPT can achieve comparable or even better performance than existing LLMs on several generation tasks. However, it is still unclear whether ChatGPT works well on natural language understanding (NLU) tasks too.

In this report, we provide a systematic study to explore the question: “can ChatGPT understand too”. This question is answered by evaluating ChatGPT on the authoritative and popular GLUE Wang et al. (2019) benchmark, spanning 8 representative understanding tasks, i.e., sentiment analysis, linguistic acceptability, paraphrase, textual similarity, natural language inference, and question answering. For reference, we also compare it with 4 representative BERT-style models. Through a series of experiments and analyses, we find that:

ChatGPT falls short in handling paraphrase and similarity tasks. Specifically, ChatGPT performs poorly in negative paraphrase and neutral similarity samples, respectively.

ChatGPT outperforms all BERT-style models on inference tasks by a large margin, indicating its impressive reasoning ability.

ChatGPT achieves comparable performance compared with BERT-base on sentiment analysis and question-answering tasks.

Despite its good performance on inference tasks, ChatGPT may generate some contradictory or unreasonable responses, which would be its potential limitations.

Furthermore, in addition to analyzing the ChatGPT itself, we also explore the complementarity of ChatGPT and some advanced prompting strategies, i.e., the standard few-shot prompting (also known as in-context learning) Brown et al. (2020), manual few-shot chain-of-thought (CoT) prompting Wei et al. (2022b) and zero-shot CoT prompting Kojima et al. (2022). Empirically, we find that ❶ all these prompting strategies can consistently improve the ChatGPT, among which the manual-CoT brings the most performance benefits. Interestingly, we also observe that ❷ the performance of in-context learning is relatively sensitive to the provided examples, especially in the 1-shot scenario, which is similar to the findings of Agrawal et al. (2022). One possible reason is that the performance of in-context learning is (highly) related to the correlation (e.g., similarity) between the provided examples and test data.

To summarize, the zero-shot performance of ChatGPT is comparable to the baseline fine-tuned BERT-base model. With the help of advanced prompting strategies, ChatGPT shows better understanding ability, and even outperforms the powerful RoBERTa-large model on some NLU tasks. However, there is still a performance gap between ChatGPT and fine-tuned RoBERTa-large in terms of average performance. That said, while ChatGPT could solve many NLP problems quite well, it still fails to beat the current SOTA models He et al. (2021); Wang et al. (2020); Zhong et al. (2022d); Patra et al. (2022); Zhong et al. (2023), especially on some NLU tasks.

The remainder of this report is designed as follows. We present the evaluation settings and comparative results in Section 2. In Section 3, we explore whether ChatGPT can be improved with advanced prompting strategies. In Section 4, we briefly review the related works. Conclusions are described in Section 5.

ChatGPT vs. BERT

In this section, we first introduce the evaluation setting (§2.1), and present the major results (§2.2). Then, some analyses of why ChatGPT performs well or poorly are also provided (§2.3). Lastly, we show some failure examples of ChatGPT to explore its potential limitations (§2.4).

Here, we briefly introduce the evaluation setting, including downstream tasks and datasets, baselines, and prompts for ChatGPT.

Following many prior works Zhong et al. (2022a, 2023), we use the widely-used GLUE benchmark Wang et al. (2019) for model evaluation purposes. As one of the most popular NLU benchmarks, GLUE consists of several challenging NLU tasks, including linguistic acceptability (CoLA, Warstadt et al. (2019)), sentiment analysis (SST-2, Socher et al. (2013)), paraphrase (MRPC, Dolan and Brockett (2005)), textual similarity (STS-B, Cer et al. (2017)), question paraphrase (QQP), textual entailment (MNLI, Williams et al. (2018), RTE, Giampiccolo et al. (2007)) and question-answer entailment (QNLI, Rajpurkar et al. (2016)). Considering the limits of testing ChatGPT, we follow Jiao et al. (2023) and randomly sample a subset of the dev set as the evaluation data for each task. Specifically, since most GLUE tasks are classification tasks (except STS-B which is a regression task), we randomly sample 25 instances for each class from the dev set. For STS-B, we randomly sample 50 instances from a uniform distribution. Table 1 shows the task descriptions and statisticsMore detailed descriptions are shown in Appendix A.1..

For evaluation, we report the performance with Accuracy (“Acc.”) metric for most tasks, except the Pearson and Spearman correlation (“Pear./Spea.”) for STS-B, the Matthew correlation (“Mcc.”) for CoLA, the additional F1 score for MRPC and QQP.

Baselines.

We compare ChatGPT (Jan 31 Version) with 4 representative BERT-style models, as the BERT models are commonly used as the baselines to evaluate the understanding ability Zhong et al. (2022b). Specifically, base-sized/ large-sized BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019) are used. All models are fine-tuned on the full training set for each task, where the fine-tuning hyper-parameters are the same to Zhong et al. (2022c). To estimate the lower bound of ChatGPT’s understanding ability, we mainly focus on the comparison between ChatGPT and the basic base-sized BERT.

Prompts for ChatGPT.

For each task, we design task-specific prompts for triggering the understanding ability of ChatGPT. Specifically, inspired by Jiao et al. (2023), we also ask ChatGPT to generate the prompts for each task, by inputting the following human inquiries:

> provide five concise prompts or templates that can make you deal with the [x] task

where the [x] is the task slot. Taking the sentiment analysis task as an example, we show this process in Figure 1. We evaluated ChatGPT on the sentiment analysis task with these five candidate prompts in the preliminary experiments and found a slight performance difference. Thus, for simplicity, we choose one typical prompt for each task and show them in Table 1.

2 Main Results

The full results on the GLUE benchmark are shown in Table 2. Overall, ChatGPT can achieve comparable average performance compared with BERT-base (78.7% vs. 79.2%), but still underperforms the other powerful BERT-style models (e.g., RoBERTa-large, 87.8%) by a clear margin. These results show that ChatGPT attains the basic understanding ability, but there is still quite some room for improvement.

Specifically, comparing ChatGPT with BERT-base on specific tasks, we can find that: 1) ChatGPT performs poorly on the paraphrase and similarity tasks, i.e., MRPC and STS-B, where the performance drop is up to 24% score. 2) ChatGPT surpasses all BERT-style models on natural language inference tasks, i.e., MNLI and RTE, indicating its superiority on inference/reasoning. 3) ChatGPT is comparable to BERT-base on the single sentence classification tasks, i.e., sentiment analysis (SST-2) and linguistic acceptability (CoLA), and QA-related tasks, i.e., QNLI.

3 Analysis

As seen in Table 2, ChatGPT works well on inference tasks, but falls short in handling paraphrase and similarity tasks. Here, we investigate how ChatGPT works on these special tasks in detail.

To have a closer look at why ChatGPT achieves impressive performance on inference tasks, we report the per-class accuracy of ChatGPT and compared models on MNLI and RTE tasks. The results are shown in Table 3. It can be seen that, ChatGPT outperforms BERT-base by a large margin among all settings. Especially, in the class of “entailment”, i.e., the premise entails the hypothesis, ChatGPT even surpasses all powerful BERT models by a clear margin. These results continue showing the effective inference ability of ChatGPT, especially reasoning factual input.

Paraphrase Task.

Similar to the above analysis, we also report the per-class accuracy of ChatGPT and other models on the paraphrasing task, i.e., MRPC, in Table 4. Surprisingly, ChatGPT achieves comparable performance compared with BERT-base when evaluating “entailment” samples, but there is a dramatic performance drop (up to 47% score) in the class of “not_entailment”, where the sentences in the pair are not semantically equivalent. This indicates that ChatGPT is not sensitive to the semantic difference between a pair of sentences, which might be related to a lack of human feedback on this aspect during model training.

Similarity Task.

Since the STS-B is a regression task, we choose some samples from the uniform similarity distribution, ranging from 0 for no meaning overlap to 5 for meaning equivalence, and show the absolute difference between predictions and ground-truths for ChatGPT and BERT-base, respectively. As seen in Figure 2, ChatGPT underperforms BERT-base in most cases, as it generally predicts far from the ground-truths. To be more specific, we can observe that ChatGPT performs worse when the sentences in the pair have a lower similarity (<2.5 scores), which is similar to the observation from Table 4. It can also be found that, ChatGPT is difficult to accurately predict the similarity score for a pair of sentences around the decision boundary (around the 2.5 scores). One of the reasons is ChatGPT is not fine-tuned on the STS-B task and cannot determine a correct decision boundary. And we show, in Section 3, advanced prompting strategies upon ChatGPT could be considerably improved.

4 Case Study

Here, we show some bad cases of ChatGPT to explore its potential limitations, and attempt to explain why ChatGPT falls short in handling the negative samples of the paraphrasing task.

First, while ChatGPT works well for the inference task, it still fails to make the correct predictions in some cases. As seen in Figure 3, ChatGPT can generate fluent responses to both inquiries due to its powerful generation ability. However, we observe that these responses are somewhat contradictory and even unreasonable. For example, in the upper case, ChatGPT says “...Jane was hungry and that this was the reason for giving candy to Joan,...”, which is very confusing. If Jane was indeed hungry, Jane would not give candy to Joan, but eat the candy himself (herself). There is a similar phenomenon in the lower case, where ChatGPT answers with confused logic. In general, ChatGPT is able to generate fluent responses following a certain pattern, but appears to have limitations in really reasoning the sentences. One evidence is that ChatGPT even fails to answer the questions, such as the cases in Figure 3, that are easily answered by humans.

On the other hand, some example failures of ChatGPT in the paraphrase task are shown in Figure 4. Both cases are in the class of “not_entailment”. ChatGPT thinks the two sentences have the same semantics, as both sentences describe a decrease (increase) in the value, which can be viewed as a coarse-grained semantic similarity. However, we can easily find that the major difference between the two sentences is the value difference, determining the “not_entailment” polarity of these cases. We refer to this value difference as the fine-grained semantic difference. These cases show that such a discrepancy between coarse-grained and fine-grained semantic information might be one of the reasons why ChatGPT struggles with handling negative samples in the paraphrase task. This also indicates that strengthening the ability of ChatGPT to extract fine-grained semantic information would effectively improve its performance on the paraphrase tasks.

Improving ChatGPT with Advanced Prompting Strategies

As mentioned in Section 2, we mainly focus on the zero-shot learning performance of ChatGPT, and the evaluation results show that there is still a clear margin between ChatGPT and fine-tuned BERT models on some NLU tasks. Inspired by some advanced prompting methods Brown et al. (2020); Wei et al. (2022b); Kojima et al. (2022) that can effectively exploit the capabilities of LLMs, here, we attempt to investigate whether these methods can also improve the understanding ability of ChatGPT and narrow its performance gap with powerful BERT models.

❷ In the 1-shot scenario, the performance of ChatGPT is relatively sensitive to the given in-context example.

Despite the overall performance gains in few-shot settings, we can find that ChatGPT does not consistently perform better on these NLU tasks, especially in the 1-shot scenario. More specifically, when equipped with the standard 1-shot prompting, ChatGPT even performs worse on some tasks, e.g., CoLA, MRPC, MNLI and RTE. We attribute it to the lower correlation between the randomly sampled in-context example and test data, as the prior work Agrawal et al. (2022) shows that the 1-shot noisy unrelated example could have a catastrophic impact on output qualityThis might be also the reason why 5-shot prompting generally works better, as concatenating multiple random examples could reduce the effect of noise.. To further verify this conjecture, we use the different 1-shot example to perform the standard 1-shot prompting. Taking the CoLA task as an example, the comparative results are shown in Figure 6. As seen, the 1-shot performance is unstable, and when given a more related 1-shot example, ChatGPT can achieve more performance gains, confirming our statement.

❸ There is still a performance gap between ChatGPT and fine-tuned RoBERTa-large.

With the help of manual-CoT, ChatGPT achieves impressive performance improvements and shows state-of-the-art (SOTA) performance among all comparison models on some tasks, e.g., CoLA, SST-2 and RTE. However, as seen, compared with the fine-tuned RoBERTa-large, ChatGPT still underperforms on some tasks, especially for the paraphrase task (MRPC), by a clear margin. These results continue indicating that, although ChatGPT could solve many NLP problems quite well, it still fails to beat the current SOTA models, especially on some NLU tasks.

☞ Note

Some readers may concern that our work could be a kind of “lottery ticket”, as we only evaluate ChatGPT on a part of the validation set for each task. To dispel such doubt, we investigate whether there are similar findings in the full-data setting. Specifically, taking the RTE task as an example, we report the corresponding results of ChatGPT under the few-data and full-data settings, respectively, as shown in Table 3.2. It can be found that ChatGPT shows similar characteristics (e.g., significantly benefiting from manual-CoT) in both scenarios, indicating the credibility of our work.

In recent years, we have witnessed numerous Transformer-based pretrained language models (PLMs) Devlin et al. (2019); Liu et al. (2019); Brown et al. (2020); Raffel et al. (2020); Lewis et al. (2020); Zhong et al. (2022a, 2023) that achieved tremendous success in various natural language processing (NLP) tasks. Based on the model architectures, these PLMs can be classified into three groups: 1) encoder-only PLMs (e.g., BERT Devlin et al. (2019))We refer to these encoder-only models as BERT-style models, and the decoder-only models as GPT-style models., 2) decoder-only PLMs (e.g., GPT-3 Brown et al. (2020)) and 3) encoder-decoder PLMs (e.g., T5 Raffel et al. (2020)). Due to different pretraining functions, these PLMs exhibit different abilities when performing NLP tasks. Specifically, the BERT-style models are based on a bidirectional masked language modeling (MLM) objective, which enforces the models to encode the context information. Through fine-tuning on the specific task, these BERT-style models can work well on a variety of natural language understanding (NLU) tasks. On the contrary, the GPT-style models aim to predict future words towards a sequence of words. Such auto-regressive models are well-suitable for language generation, but they are unidirectional and usually fail short in the representation learning for understanding the sentence Liu et al. (2021); Zhong et al. (2022a).

More recently, a lot of work focus on scaling up the PLMs and developing the large language models (LLMs) Ouyang et al. (2022); Chowdhery et al. (2022); Smith et al. (2022); Zhang et al. (2022). Wei et al. (2022a) show that LLMs exhibit emergent abilities, e.g., few-shot and zero-shot learning, when the model sizes are large enough. As a typical LLM, the recently-released ChatGPT has attracted great attention, due to its impressive ability to generate fluent and high-quality responses. There is growing interest in exploring the capabilities, applications, ethics, and failures of ChatGPT Jiao et al. (2023); Bang et al. (2023); Qin et al. (2023); Zhuo et al. (2023); Wang et al. (2023). Along with the research line, we mainly focus on analyzing the understanding ability of ChatGPT in this report, which is important but has been given little attention.

In this study, we empirically investigate the language understanding ability of ChatGPT on a diversity of natural language understanding tasks. Through a series of quantitative studies, we find that ChatGPT works well on inference tasks, but falls short in handling paraphrase and similarity tasks, especially for the negative instances. Furthermore, we attempt to improve the understanding ability of ChatGPT with some advanced prompting strategies. The results show that with the help of these prompting strategies, ChatGPT can achieve significant performance improvements, and even outperforms the powerful RoBERTa-large on some tasks. Overall, ChatGPT attains a comparable understanding ability compared with some fine-tuned BERT-style models, but still fails to beat the currently best models on some NLU tasks. We hope our study could facilitate more research on how to address the limitations and improve the understanding performance of ChatGPT.

Our work has several potential limitations. First, due to the limits of testing ChatGPT, we mainly evaluate ChatGPT on a part of the validation set for each task. It would be more convincing if we can test on more samples. On the other hand, this report only uses the GLUE benchmark for experiments, in which the task types are somewhat limited. In future work, we would like to evaluate ChatGPT on more NLU tasks and conduct more in-depth analyses and discussions.

Appendix A Appendix

In this work, we conduct extensive experiments on the GLUE Wang et al. (2019) benchmark. Here, we introduce the detailed descriptions of all downstream tasks and datasets as follows:

CoLA Corpus of Linguistic Acceptability Warstadt et al. (2019) is a binary single-sentence classification task to determine whether a given sentence is linguistically “acceptable”.

SST-2 The Stanford Sentiment Treebank Socher et al. (2013) is a binary classification task to predict the sentiment of a given sentence.

MRPC Microsoft Research Paraphrase Corpus Dolan and Brockett (2005) is a task to predict whether two sentences are semantically equivalent.

STS-B Semantic Textual Similarity Cer et al. (2017) is a task to predict how similar two sentences are on a 1-5 scale in terms of semantic meaning.

QQP The Quora Question Pairs dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

MNLI The Multi-Genre Natural Language Inference Corpus Williams et al. (2018) is a task to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither, given a premise sentence and a hypothesis sentence.

QNLI Question Natural Language Inference is a binary classification task constructed from SQuAD Rajpurkar et al. (2016), which aims to predict whether a context sentence contains the answer to a question sentence.

RTE Recognizing Textual Entailment Giampiccolo et al. (2007), given a premise and a hypothesis, is a task to predict whether the premise entails the hypothesis.

A.2 Input Examples

Here, we present input examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompting used in ChatGPT. Table 7 to 14 show the detailed examples for each task of the GLUE benchmark.