ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, Haifeng Wang

cs.CL

Introduction

Pre-trained language models such as ELMo , GPT , BERT , and ERNIE have proved to be effective for improving the performances of various natural language processing tasks including sentiment classification , natural language inference , text summarization , named entity recognition and so on. In general, pre-trained language models are learned on a large amount of text data in a self-supervised manner, and then fine-turned on downstream tasks or directly deployed through zero/few-shot learning without task-specific fine-tuning. Such pre-trained language models have become the new paradigm for natural language processing tasks.

In the past year or two, one of the important trends of pre-trained language models is their increasing model size, which leads to lower perplexity in pre-training and better performances on downstream tasks. Megatron-LM , with one billion parameters, is proposed for language understanding using a simple but efficient intra-layer model parallel approach, which achieves the state-of-the-art results on several datasets. T5 explores the limits of pre-trained models with 10 billion parameters, but soon the record was broken by the GPT-3 model with 175 billion parameters which has a good performance under the few-shot or even zero-shot settings. Soon afterwards, Switch-Transformer is proposed as the world’s first trillion-parameter pre-trained language model.

However, these large-scale pre-trained language models with hundreds of billions of parameters are trained on plain texts. For example, the 175-billion-parameter GPT-3 is trained on a corpus with 570GB filtered texts from Common Crawl. Such raw texts lack explicit representation of knowledge such as linguistic knowledge and world knowledge. In addition, most large-scale models are trained in an auto-regressive way, but shows that such models demonstrate poorer performance with traditional fine-tuning when adapting to downstream language understanding tasks.

In this work, to solve the problem caused by a single auto-regressive framework and to explore the performance of knowledge enhanced pre-trained models with large-scale parameters, we propose a unified framework called ERNIE 3.0 to train large-scale knowledge enhanced models on a 4TB corpus consisting of plain texts and a large-scale knowledge graph by fusing the auto-regressive network and the auto-encoding network. The proposed ERNIE 3.0 can handle both natural language understanding tasks and natural language generation tasks through zero-shot learning, few-shot learning or fine-tuning. Furthermore, the proposed framework supports the introduction of various customized tasks at any time. These tasks share the same encoding networks and are trained through multi-task learning. This method makes the encoding of lexical, syntactic and semantic information across different tasks possible. Moreover, when given a new task, our framework could incrementally train the distributed representations based on the previous training parameters, with no need to train them from scratch.

In summary, our contributions are as follows:

We propose a unified framework ERNIE 3.0, which combines auto-regressive network and auto-encoding network so that the trained model can handle both natural language understanding and generation tasks through zero-shot learning, few-shot learning or fine-tuning.

We pre-train large-scale knowledge enhanced models with 10 billion parameters and evaluate them with a series of experiments on both natural language understanding and natural language generation tasks. Experimental results show that ERNIE 3.0 consistently outperforms the state-of-the art models on 54 benchmarks by a large margin and achieves the first place on the SuperGLUE benchmark.

Related Work

Since BERT is proposed as a powerful language model for natural language understanding, pre-trained language models have attracted more and more attention and become the new paradigm for natural language processing. One of the research trends is increasing model size, which leads to lower perplexity and better performance . As a result, many large-scale pre-trained models have been proposed in the past two years. T5 model is proposed to push the performance for both natural language understanding and natural language generation tasks with 11 billion parameters. The T5 model converts all text-based language tasks into a text-to-text format by a unified framework and fully explores the effectiveness of pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors. After the T5 model, GPT-3 , which includes 175 billion parameters, is proposed to achieve an amazing performance on a wide range of tasks under the few-shot and zero-shot settings. Specifically, GPT-3 is an auto-regressive language model, 10x more than its predecessor, GPT-2, proposed by . However, GPT-3 shows a lack of common sense, exists biases and privacy issues in the tests . have proposed a 1 trillion parameters model named Switch Transformer with simplifying MoE routing algorithm to improve model with less communication and computational costs, and also proposed a large scale distributed training solution to tackle the problem of training complexity, communication costs, and training instability.

Besides the models mentioned above, more non-English large models have been proposed recently. released a 2.6 billion parameters Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data and the model structure was inspired by . have released a 11 billion parameters model CPM-2. To accelerate the pre-training based on existing PLMs instead of training models from scratch, the knowledge inheritance techniques have been introduced and during the fine-tuning stage, prompt tuning is involved to better exploit the knowledge within the pre-trained model. have proposed a cross-modal pre-training method called M6(Multi-Modality to Multi-Modality Multitask Mega-Transformer) including 100 billion parameters for unified pre-training on multiple modalities data. proposed a 200 billion parameters auto regressive language model named PangGu- $\alpha$ which is trained on a cluster of 2048 Ascend 910 AI processors with distributed training techniques including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and re-materialization. Except for those Chinese large-scale models, a Korean 204 billion parameters language model named HyperCLOVA has been proposed, and its volume of machine-learned data in Korean was 6,500 times larger than GPT-3’s. From what has been discussed above, observations now suggest that large-scale pre-trained models have attracted more and more attention from industry and academia.

2 Knowledge Enhanced Models

Pre-trained language models capture syntactical and semantic knowledge from large-scale corpus, but lack world knowledge. Recently, several works have attempted to incorporate world knowledge in pre-trained language models. The typical form of world knowledge is a knowledge graph. Many works () integrate entity and relation embedding from knowledge graph in pre-trained language models. WKLM replaced entity mentions in the original documents with names of other entities of the same type and train the models to distinguish the correct entity mention from randomly chosen ones. KEPLER optimized the models with knowledge embedding and mask language model objectives to align the world knowledge and language representation into the same semantic space. CoLAKE integrated the language context and the knowledge context in a word-knowledge graph and jointly learned contextualized representation for language and knowledge with the extended mask language model objective. Another existing form of world knowledge is the extra annotation of large-scale data. ERNIE 1.0 introduced phrase masking and named entity masking and predicts the whole masked phrases and named entities to help the model learn the dependency information in both local contexts and global contexts. CALM teached models to detect and revised a corrupted sentence with the incorrect ordering of concepts and to distinguish truth sentences from less plausible ones via two kinds of self-supervised pre-training tasks. K-Adapter utilized adapters trained on different knowledge sources with extra annotations to distinguish where the knowledge comes from.

ERNIE 3.0

A significant improvement has been achieved on various natural language processing tasks for knowledge enhanced pre-trained models with the base or large model size, such as ERNIE, ERNIE 2.0 and SpanBERT , in which the base/large model size represent 12/24 layers Transformer respectively. In order to explore the effectiveness of knowledge enhanced large-scale pre-trained model, we propose the ERNIE 3.0 framework to pre-train model on massive unsupervised corpus including plain texts and knowledge graph. Furthermore, we employ various types of pre-training tasks to enable the model to learn the different levels of knowledge consisting of valuable lexical, syntactic and semantic information more effectively, in which the pre-training tasks spread three task paradigms, that is natural language understanding, natural language generation and knowledge extraction. Therefore, ERNIE 3.0 innovatively designs a Continual Multi-Paradigms Unified Pre-training Framework to enable the collaborative pre-training among multi-task paradigms. The explicit introduction of ERNIE 3.0 will be explained in the following sections.

The Framework of the ERNIE 3.0 is shown in Figure 1, which can be widely used for pre-training, fine-tuning and zero/few-shot learning. Unlike the prevalent unified pre-training strategy of employing a shared Transformer network for different well-designed cloze tasks and utilizing specific self-attention masks to control what context the prediction conditions on, ERNIE 3.0 designs a new Continual Multi-Paradigms Unified Pre-training Framework. We believed that the different task paradigms of natural language processing depend on identical underlying abstract features consistently, such as lexical information and syntactic information, but the requirements of top-level concrete features are incompatible, in which the natural language understanding tasks have the disposition to learn the semantic coherence while natural language generation tasks expect further contextual information. Therefore, inspired by the classical model architecture of multi-task learning, in which the lower layers are shared across all tasks while the top layers are task-specific, we proposed the ERNIE 3.0 to enable the different task paradigms to share the underlying abstract features learned in a shared network and utilizing the task-specific top-level concrete features learned in their own task-specific network respectively. Furthermore, in order to help the model efficiently learn the lexical, syntactic and semantic representations, ERNIE 3.0 exploits the continual multi-task learning framework introduced in ERNIE 2.0 . As for the application of different kinds of downstream tasks, we will first initialize the ERNIE 3.0 with the combination of parameters of a pre-trained shared network and corresponding task-specific networks for different task paradigms, and then execute the corresponding follow-up procedure using data from specific tasks.

We refer to the backbone shared network and task-specific networks as the Universal Representation Module and Task-specific Representation Modules in ERNIE 3.0. Specifically, the universal representation network plays the role of universal semantic features extractor (for example, it can be a multi-layer Transformer), in which the parameters are shared across all kinds of task paradigms, including natural language understanding, natural language generation and so on. And the task-specific representation networks undertake the function of extracting the task-specific semantic features, in which the parameters are learned by task-specific objectives. ERNIE 3.0 not only enables the model to distinguish the task-specific semantic information among different task paradigms, but also mitigates the dilemma that large-scale pre-trained models are difficult to implement with limited time and hardware resources, in which ERNIE 3.0 permits the models to only update the parameters of a task-specific representation network during the fine-tuning phase. Specifically, ERNIE 3.0 employs the collaborative architecture of a Universal Representation Module and two Task-specific Representation Modules, namely natural language understanding (NLU) specific representation module and natural language generation (NLG) specific representation module.

ERNIE 3.0 uses a multi-layer Transformer-XL as the backbone network like other pre-trained models such as XLNet , Segatron and ERNIE-Doc , in which Transformer-XL is similar to Transformer but introduces an auxiliary recurrence memory module to help modelling longer texts. We refer to the backbone as Universal Representation Module and it is shared across all the task paradigms. Proverbially, the Transformer can capture the contextual information for each token in the sequence via self-attention and generate a sequence of contextual embedding. It is evident that the larger the scale of Transformer model, the stronger its capacity to capture and store up various semantic information with different levels. Therefore, ERNIE 3.0 sets the universal representation module with a larger size to enable the model to effectively capture universal lexical and syntactic information from training data by learning various pre-training tasks of different paradigms. And what needs special attention is that the memory module is only valid for natural language generation tasks while controlling the attention mask matrices.

1.2 Task-specific Representation Module

Similar to the basic shared representation module, the task-specific representation module is also a multi-layer Transformer-XL, which is used to capture the top-level semantic representations for different task paradigms. ERNIE 3.0 sets the task-specific representation module to a manageable size, that is a base model size, instead of the multi-layer perceptron or shallow Transformer commonly used in multi-task learning, which will produce three obvious benefits, the first is that the base network has a stronger ability to capture semantic information than multi-layer perceptron and shallow Transformer; the second is that the task-specific networks with base model size enable ERNIE 3.0 to distinguish the top-level semantic information among different task paradigms without significantly increasing the parameters of a large-scale model; finally, the smaller model size of a task-specific network than a shared network would lead to realizable practical applications for large scale pre-trained model when only fine-tuning on the task-specific representation module. ERNIE 3.0 constructs two task-specific representation modules, that is NLU-specific representation module and NLG-specific representation module, in which the former is a bi-directional modeling network while the latter is a uni-directional modeling network.

2 Pre-training Tasks

We construct several tasks for various task paradigms to capture different aspects of information in the training corpora and make the capacity of understanding, generation and reasoning available to pre-trained model.

Knowledge Masked Language Modeling ERNIE 1.0 proposed an effective strategy to enhance representation through knowledge integration, namely Knowledge Integrated Masked Language Modeling task. It introduced phrase masking and named entity masking that predict the whole masked phrases and named entities to help the model learn the dependency information in both local contexts and global contexts.

Document Language Modeling Generative pre-training models usually utilize traditional language model (such as GPT , GPT-2 ) or sequence-to-sequence language model (such as BART , T5 , ERNIE-GEN ) as the pre-training task, the latter trains on the network with an auxiliary decoder structure. ERNIE 3.0 opt for traditional language model as the pre-training task to abate the network complexity and heighten the effectiveness of unified pre-training. In addition, to enable the NLG network of ERNIE 3.0 to model longer text, we introduce the Enhanced Recurrence Memory Mechanism proposed in ERNIE-Doc , which can model a larger effective context length than traditional recurrence Transformer by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence.

2.2 Structure-aware Pre-training Tasks

Sentence Reordering Sentence reordering task, which is introduced in ERNIE 2.0 , aims to train the model to learn the relationship between sentences by reorganizing permuted segments. At length, a given paragraph is randomly split into 1 to m segments during pre-training and all of the combinations are shuffled by a random permuted order. Then, the pre-trained model is asked to reorganize these permuted segments, modeled as a k-class classification problem where $k=\sum_{n=1}^{m}n!$ .

Sentence Distance Sentence distance task, an extension of traditional next sentence prediction (NSP) task, is widely used in various pre-trained models to enhance their ability to learn the sentence-level information, which can be modeled as a 3-class classification problem. The three categories represent that the two sentences are adjacent, nonadjacent but in the same document and from two different documents respectively.

2.3 Knowledge-aware Pre-training Tasks

Universal Knowledge-Text Prediction To incorporate knowledge into one pre-trained language model, we introduce universal knowledge-text prediction (UKTP) task, which is an extension of knowledge masked language modeling. While knowledge masked language modeling only requires unstructured texts, universal knowledge-text prediction task requires both unstructured texts and knowledge graphs. The universal knowledge-text prediction task is illustrated in Figure 2. Given a pair of triple from knowledge graph and the corresponding sentence from encyclopedia, we randomly mask relation in triple or words in a sentence. To predict the relation in the triple, the model needs to detect mentions of head entity and tail entity and determine semantic relationship that holds between them in the corresponding sentence. The essence of this process is similar to the distant supervision algorithm in relation extraction tasks. The distant supervision algorithm assume that if two entities participate in a relation, any sentence that contain those two entities might express that relation. Meanwhile, to predict words in the corresponding sentence, the model not only considers the dependency information in the sentence, but also logical relationship in the triple. Specifically, the procedure of obtaining pairs of a triple and this corresponding sentence is as follows: given a document from encyclopedia, we first find the candidate triples in the knowledge graph whose mentions of head entity or tail entity is title of the document, and then select triples from candidate triples whose mentions of head entity and tail entity are mentioned in the same sentence in the document.

ERNIE 3.0 trains the NLU network through knowledge masked language modeling to improve the capacity of capturing the lexical information, trains the sentence reordering task and the sentence distance discerning task to strengthen the ability of capturing the syntactic information, and finally optimizes the model with the universal knowledge-text prediction task to improve knowledge memorization and reasoning. Meanwhile, ERNIE 3.0 trains the NLG network with the document language modeling task to enable various generation styles.

3 Pre-training Process

Progressive training was originally proposed to improve stability, which starts from an efficient and small model and gradually increase the capacity . Recent study leverages this paradigm to accelerate model training. As large-scale pre-training keeps advancing the state-of-the-art(, ), their overwhelming computational consumption becomes the major burden towards further developing more powerful models(). Preliminary application of progressive training has been made on Transformer pre-training. BERT() designs a two-stage training with a reduced sequence length for the first 90% of updates. also gradually increase the batch size linearly from a small value to the full value. also notice that changing the regularization factors (e.g. , ) stage-wise with respect to the input size can speed up training networks. To further improve convergence speed of the training process, we propose to adjust the training regularization factors in a more comprehensive and smooth way by progressively and simultaneously increasing the training factors including the input sequence length, the batch size, the learning rate and the dropout rate. In fact, it is common that Transformer models adopts the learning rate warm-up strategy to increase training stability and our improved progressive learning strategy is compatible to the existing strategy.

3.2 Pre-training Data

To ensure the success of the pre-training of ERNIE 3.0, we construct a large-scale, wide-variety and high-quality Chinese text corpora amounting to 4TB storage size in 11 different categories. To our best knowledge, this is currently the largest Chinese pre-training corpora compared with CLUECorpus2020 (100GB), Chinese multi-modal pre-training data (300GB), WuDaoCorpus2.0 used by CPM-2 (2.3TB Chinese data and 300GB English data) and PanGu Corpus (1.1TB).

In detail, we build the corpus for ERNIE 3.0 based on that from ERNIE 2.0 (including baike, wikipedia, feed and etc), Baidu Search (including Baijiahao, Zhidao, Tieba, Experience), Web text, QA-long, QA-short, Poetry https://www.luge.ai/text-generation/chinese-poetry.html#_1-chinese-poetry&Couplet https://github.com/v-zich/couplet-clean-dataset, Domain-specific data from medical, law and financial area and Baidu knowledge graph with more than 50 million facts. To improve the data quality, we adopt the following pre-processing strategies:

Deduplication is conducted on different granularities including character level, paragraph level and document level. On the character level, we replace consecutive identical characters (i.e., spaces, tabs, exclamation mark, question mark and etc) with one single character. One the paragraph level, we replace two identical consecutive paragraphs consisting of $N$ sentences with one single paragraph where $0<N<100$ . The two aforementioned deduplication strategies are critical for ERNIE 3.0 to generate non-repeating contents. At last, we adopted Message Digest Algorithm5 (MD5) to filter duplicate documents by comparing the sum of the MD5 of top-3 longest sentences from each document.

Sentences with less than 10 words are filtered since they may be problematic or incomplete ones which contains limited semantic information for model pre-training.

We further conduct sentence segmentation using regular expressions and word segmentation based on Baidu’s word segmentation tool. This helps ERNIE 3.0 to learn better sentence boundary and named entity knowledge during pre-training.

Then, each dataset is multiplied by a user-defined multiplier number to increase the data diversity after truncating the data for NLU-network pre-training.

3.3 Pre-training Settings

Both the universal representation module and the task-specific representation modules of ERNIE 3.0 uses the Transformer-XL structure as the backbone. For the universal representation module, we adopt a structure with 48 layers, 4096 hidden units and 64 heads. For the task-specific representation modules, we adopt a structure with 12 layers, 768 hidden units and 12 heads. The total parameter of universal representation module and task-specific representation modules is 10 billion. The activation function used is GeLU. The maximum sequence length of context and the memory length of language generation is set to 512 and 128, respectively. The total batch size of all pre-training tasks is set to 6144. We use Adam with learning rate of 1e-4, $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps and linear decay of the learning rate. In the first 10,000 steps, we also use the progressive learning to speedup convergence in the initial stage of pre-training. The model is trained for a total of 375 billion tokens with 384 NVDIA v100 GPU cards and is implemented on PaddlePaddle framework. By virtue of parameter sharding used in , we manage to reduce the memory usage of our model and address the problem of the total parameter of model exceeding the memory of a single GPU card.

Experiments

We compare the performance of ERNIE 3.0 with the state-of-the-art the previous state-of-the-art results are all from the public single model that we can find. pre-training models through fine-tuning on both natural language understanding tasks (in Sec. 4.2.1) and natural language generation tasks (in Sec. 4.2.2), and zero-shot learning (in Sec. 4.3)The previous SoTA results of ERNIE 2.0 and RoBERTa-wwm-ext on corresponding datasets are reproduced by ourselves, except for the datasets that already have released pre-trained results..

We executed extensive experiments on 54 NLP tasks to evaluate the fine-tuning and zero-shot learning performances of the models.

45 datasets belonging to 14 kinds of natural language understanding tasks are used in our experiments, as follows:

Sentiment Analysis: NLPCC2014-SC http://tcci.ccf.org.cn/conference/2014/pages/page04_dg.html, SE-ABSA16_PHNS http://alt.qcri.org/semeval2016/task5/, SE-ABSA16_CAME, BDCI2019 https://www.datafountain.cn/competitions/350.

Opinion extraction: COTE-BD , COTE-DP , COTE-MFW .

Natural Language Inference: XNLI , OCNLI , CMNLI .

Event Extraction: CCKS2020 http://sigkg.cn/ccks2020/?page_id=69.

Semantic Similarity: AFQMC , LCQMC , CSL , PAWS-X , BQ Corpus .

Chinese News Classification: TNEWS https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset, IFLYTEK , THUCNEWS http://thuctc.thunlp.org/, CNSE , CNSS .

Closed-Book Question Answering: NLPCC-DBQA http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf, CHIP2019, cMedQA , cMedQA2 , CKBQA https://github.com/pkumod/CKBQA, WebQA .

Named Entity Recognition: CLUENER , Weibo , OntoNotes , CCKS2019 https://www.biendata.xyz/competition/ccks_2019_1/.

Machine Reading Comprehension: CMRC 2018 , CMRC2019 , DRCD , DuReader , Dureader ${}_{\text{robust}}$ , Dureader ${}_{\text{checklist}}$ , Dureader ${}_{\text{yesno}}$ https://aistudio.baidu.com/aistudio/competition/detail/49/?isFromLUGE=TRUE, C3 , CHID .

Legal Documents Analysis: CAIL2018-Task1 , CAIL2018-Task2 .

Cant Understanding: DogWhistle Insider, DogWhistle Outsider.

1.2 Natural Language Generation Tasks

9 datasets belonging to 7 kinds of natural language generation tasks are used in our experiments, as follows:

Question Generation:KBQG http://tcci.ccf.org.cn/conference/2017/dldoc/taskgline05.pdf, DuReader-QG , DuReader ${}_{\text{robust}}$ -QG .

Closed-Book Question Answering: MATINF-QA .

2 Experiments on Fine-tuning Tasks

The results of natural language understanding tasks are reported in Table 2.

Sentiment Analysis. Sentiment Analysis is a classification task aiming to determine whether a sentence is positive, negative, or neutral. We consider 4 datasets from different domains, including shopping (NLPCC2014-SC), electronics (SE-ABSA16_PHNS, SE-ABSA16_CAM), and financial (BDCI2019). ERNIE 3.0 achieves a substantial improvement on all four datasets.

Opinion Extraction. Similar to the sentiment analysis task, opinion extraction requires the model to mine the opinion of a sentence. We use 3 sub-datasets from Chinese Customer Review (COTE). Experiment results show that ERNIE 3.0 also outperforms the current SoTA system by a great margin.

Natural Language Inference. Natural Language Inference is the task to determine whether a given premise semantically entails another hypothesis. We use OCNLI and XNLI datasets. The results indicate that ERNIE 3.0 has achieved 3.9 and 0.7 accuracy improvement on two datasets, respectively. The improvement on the XNLI dataset is quite limited, and this may be due to the poor quality of the dataset since the XNLI dataset is translated from English.

Winograd Schemas Challenge. WSC2020 is an anaphora resolution task where the model is asked to decide whether a pronoun and a noun in a sentence co-refer, ERNIE 3.0 achieves a significant improvement of 25.7 points.

Relation Extraction. The task of relation extraction is to identify the relationship between different entities like persons and organizations. We consider FinRE and SanWen – two relation extraction datasets for financial news and Chinese literature respectively. ERNIE 3.0 outperforms the previous SoTA model by 2.46 points on average.

Event Extraction. Similar to relation extraction, the event extraction task aims to identify the event entities and classify them into different categories. We choose CCKS2020 – a text-level event subject extraction dataset of financial field. ERNIE 3.0 has 3 points of improvement on the test set.

Semantic Similarity. Semantic Similarity is a classic NLP task that determines the similarity between various terms such as words, sentences, documents. In this work, we focus on sentence-level similarity tasks. We test ERNIE 3.0 on several datasets in varied fields including AFQMC, LCQMC, CSL, PAWS-X, and BQ. Experiment results show that ERNIE 3.0 outperforms the baseline models by a remarkable margin. Especially, under comparable number of parameters, ERNIE 3.0 surpasses CPM-2 with 1.2 points on LCQMC dataset.

Chinese News Classification. We also evaluate ERNIE 3.0 on Chinese news classification. We consider 6 datasets including news title (TNEWS), app descriptions (IFLYTEK), and news stories (THUCNEWS, CNSE, CNSS). Under different types of classification tasks, ERNIE 3.0 can consistently achieve better accuracy with 2.8 points improvement on average.

Closed-Book Question Answering. Closed-Book Question Answering aims to directly answer the questions without any additional references or knowledge. We select a general QA dataset NLPCC-DBQA and three medical field datasets – CHIP2019, cMedQA, and cMedQA2 to test the ability of ERNIE 3.0. Experiment results show that ERNIE 3.0 performs better on all QA tasks, we believe knowledge enhanced pre-training methods do bring benefits to the closed-book QA task.

Cant Understanding. Cant, also known as doublespeak, is an advanced language usage for humans. However, it is rather difficult for machines to understand this type of language. We test the cant understanding ability of ERNIE 3.0 on DogWhistle – a dataset based on Decrypto game. The model is required to select the right answer with the guidance of the corresponding cant. ERNIE 3.0 gets the best result and shows its potential for understanding much more difficult languages.

Named Entity Recognition. Named Entity Recognition is a classical NLP task of extracting and classifying entities in text. We select widely used OntoNotes, CLUENER, Weibo, and a domain-specific dataset CCKS2019. From the results, ERNIE 3.0 performs better than the baseline models across all datasets.

Machine Reading Comprehension. We comprehensively evaluate the ability of ERNIE 3.0 on machine reading comprehension in different aspects, including span-predict reading comprehension (CMRC2018, DuReader, DRCD, DuReader ${}_{\text{checklist}}$ ), multiple-choice reading comprehension (C3, DuReader ${}_{\text{yesno}}$ ), cloze and completion (CHID, CMRC2019), and robustness test (Dureader ${}_{\text{robust}}$ ). With the help of knowledge enhanced pre-training, ERNIE 3.0 surpasses the baseline models with significant enhancements on all types of tasks. To be more specific, ERNIE 3.0 achieve at least 1.0 points of EM improvement on 5 span-predict tasks and 0.89 accuracy improvement on multiple-choice tasks on average. Also, under comparable number of parameters, ERNIE 3.0 outperforms CPM-2 with 0.6 points on C3 dataset. For the robustness test, ERNIE 3.0 also performs best on the test set with over-sensitivity and over-stability samples.

Legal Documents Analysis. Next, we test the ability of ERNIE 3.0 on document analysis, we choose two domain-specific tasks of law. These two datasets from CAIL2018 are both multi-label document classification tasks. ERNIE 3.0 outperforms ERNIE 2.0 with remarkable increment.

Document Retrieval. Document retrieval aims to match documents given queries. We evaluate the retrieval ability of ERNIE 3.0 on Sogou-Log. Following previous work , we report NDCG@1 performance on the test-same test set and MRR performance on the test-raw test set and ERNIE 3.0 outperforms CPM-2.

2.2 Fine-tuning on Natural Language Generation Tasks

The results of natural language generation tasks are reported in Table 3.

Text Summarization. We consider a Large Scale Chinese Short Text Summarization (LCSTS) dataset which requires a model to understand the text and refine the key information to generate coherent, informative summaries. LCSTS is a classic Chinese text summarization dataset which consists of 2 million real Chinese short texts with short summaries from Sina Weibo. ERNIE 3.0 achieves 48.46% Rouge-L score which outperforms CPM-2 with comparable number of parameters (11B) and current SoTA ProphetNet-zh.

Question Generation. Question Generation is the reverse task of Machine Reading Comprehension (MRC) which requires the model to understand a document and generate a reasonable question based on a given short answer. We use a suite of three datasets including knowledge base question generation (KBQG), two MRC datasets named Dureader and Dureader ${{}_{\text{robust}}}$ . ERNIE 3.0 performs best on these three datasets compared to the baselines.

Math. To test ERNIE 3.0’s ability to perform simple arithmetic operations, we consider the Math23K dataset which contains 23,161 real math word problems for elementary school students with problem descriptions, structured equations and answers. ERNIE 3.0 is fine-tuned to generate the postfix expression of the structured equation given the problem description, then the final answer can be calculated using the Python eval() function (note that the ‘[’ and ‘]’ should be replaced with ‘(’ and ‘)’ respectively, also the ‘%’ should be replaced with ‘*0.01’ to avoid the failed solutions using Python eval() function). It shows that ERNIE 3.0 is a great math solver which achieves high accuracy 75% compared to CPM-2 69.37%.

Advertisement Generation. We consider AdGen which consists of 119K pairs of advertising text and clothing specification tables from a Chinese e-commerce platform. It requires the model to generate a long advertising text that covers all given attribute-value pairs for a piece of clothing. An attribute-value pair is joined with a colon, and several attribute-value pairs are concatenated sequentially using a ‘—’ according to their segment number. Then we take the structural attribute-value pairs string as input for ERNIE 3.0. It shows that ERNIE 3.0 is capable to generate a coherent and intriguing long advertising text by extracting information from a structural input with 19.56 percent point improvement w.r.t BLEU-4 compared to CPM-2.

Translation. For ERNIE 3.0, we mainly consider the pre-training on Chinese corpus. To test its multilingual ability, we expand our vocabulary to include extra 10K English subwords. On a classic multilingual dataset WMT20-enzh, we fine-tuned ERNIE 3.0 to translate English to Chinese. Compared to mT5-xxLarge and CPM-2, ERNIE 3.0 Due to the large size of the training dataset of WMT20-enzh, ERNIE 3.0 is not fully trained to convergence. We reported the BLEU score at 1.5 epoch checkpoint using SacreBLEU project . is the best and presents superior multilingual ability.

Dialogue Generation. Next, we evaluate ERNIE 3.0 on Dialog Generation task. We consider a Chinese multi-domain knowledge-driven conversation dataset that contains 4.5K conversations from three domains (film, music, and travel). We train and test ERNIE 3.0 on the fused set of data from aforementioned three domains by only giving dialogue history to generate the current utterance. Knowledge triplets are excluded from inputs, so it’s suitable to test a model’s ability to model multi-turn conversations by leveraging inherent knowledge during pre-training. Compared to baselines, ERNIE 3.0 improves the performance a lot by 8.1 percent point, and we believe the knowledge graph enhanced pre-training attributes a lot.

2.3 LUGE benchmark

In order to further evaluate the capabilities of different models comprehensively and conveniently, we conduct experiments on the Language Understanding and Generation Evaluation Benchmarks(LUGE)) https://www.luge.ai/. We use six representative tasks (see Tab. 4) from LUGE. ERNIE 3.0 delivers an average 5.36 percent improvement over leading pre-trained models such as ERNIE 2.0 and RoBERTa.

3 Experiments on Zero-shot Learning

We have demonstrated that ERNIE 3.0 is superior to previous SoTA methods on both NLU and NLG tasks following the pretraining-then-finetuning paradigm. In this section, we conduct various types of tasks with the zero-shot setting where a model is applied without any gradient updates or fine-tuning. ERNIE 3.0 achieves strong performance compared to recently proposed large-scale language models such as CPM-1 (2.6B), PanGu- $\alpha$ -2.6B and PanGu- $\alpha$ -13B on most downstream tasks. At last, we show that ERNIE 3.0 can generate more coherent, natural and accurate responses rated on our manually collected 450 cases across 13 different tasks.

The evaluation methods can be classified into two categories, namely perplexity-based method and generation-based method.

Perplexity-based Method. On tasks that choose one single correct answer from multiple candidates such as CHID and CMRC2017, we compare the per-token perplexity score The perplexity score of a sample is normalized by the number of tokens. when filling each answer into the blank of the context. The one with lower per-token perplexity score will be the predicted as the correct answer. On tasks that require binary or multiple classification, we assign each label with a more semantically meaningful name and use a prompt to formalize the context and the label as a human-readable text. Then, this kind of tasks can be treated as multi-choice tasks. The prompts we used are similar to that in CPM-1 and PanGu- $\alpha$ .

Generation-based Method. On tasks with free-form completion such as Closed-book QA, we use beam search with a beam width of 8 and no length penalty. The maximum generated length of a completion is limited by a pre-defined number based on 95% percentile point of answers’ length on the dataset. Then metrics such as exact match (EM), F1 and Rouge-1 are used. On tasks with restrained completion such as extractive MRC, we use restrained beam search with the same parameters as before. A Trie-Tree is constructed for each sample to efficiently and effectively restrain the space of generation and only generate completion occurred in a given text.

3.2 Results

Chinese News Classification. For the TNEWS and IFLYTEK datasets, there are 15 and 119 categories respectively. We randomly sample three candidates as negative labels for each sample and compare the per-token perlexity score among these four choices. This sampling strategy is aligned with CPM-1’s and PanGu- $\alpha$ ’s to reduce the total computational cost since we need to calculate per-token perlexity score for each candidate separately. ERNIE 3.0 performs well on TNEWS even reaching competitiveness with prior state-of-the-art fine-tuning approaches and performs slightly well on IFLYTEK.

Semantic Similarity. We consider AFQMC and CSL datasets. ERNIE 3.0 outperforms baselines at a large margin. However, the accuracy is slightly above than a random-guess model. This may be partly attributed to the sub-optimal selection of the prompt (like The following two sentences have the same/different semantics: $SENT_A.$ SENT_B.).

Natural Language Inference. ERNIE 3.0 is evaluated on two NLI datasets, namely OCNLI and CMNLI where CMNLI consists of XNLI and MNLI by translating English to Chinese. We use the prompt as $SENT_A? No/Yes/Maybe,$ SENT_B. The performance of ERNIE 3.0 is comparable to baselines, it shows that there is still a large room for improvement for pre-trained models on zero-shot NLI task.

Winograd Schema Challenge: We formalize the WSC2020 dataset as a multi-choice completion task where a pronoun is replaced with each candidates to calculate the per-token perplexity of a sample. ERNIE 3.0 improves the performance by 3.38 percent point compared to PanGu- $\alpha$ -13B.

Cloze and completion. On the CHID dataset, we split each sentence that contains only one blank word as a sample, and formalize as a multi-choice task. ERNIE 3.0 achieves the best score among baselines. For Chinese Word Prediction with Long Context (Chinese WPLC), a sample consists of a masked text and a correct word. Following PanGu- $\alpha$ , we replace the mask token with the correct word and calculate the perplexity score of a whole sentence. Compared to PanGu- $\alpha$ , ERNIE 3.0 achieves much lower perplexity score. On the CMRC2019 dataset, we randomly sample three negative candidates for each blank from the original candidates, then beam search is applied to calculate the optimal path for a sample. We also formalize the PD, CFT and CMRC2017 as multi-choice tasks where the text before the blank is taken as the input, and the multiple choices are the words the appear in the whole text. ERNIE 3.0 surpassed the baselines with a large margin.

Machine Reading Comprehension. We consider four MRC datasets. On C3, a multi-choice machine reading comprehension tasks, we use the prompt as Question: $Question? Answer:$ Choice. The answer is in the following document: $Document. For CMRC2018, DRCD and DuReader, we evaluate ERNIE 3.0 using generation-base method and the prompt is Document:$ Document. Question: $Question? Answer:. ERNIE 3.0 outperforms baselines with a large margin on CMRC2018, DRCD and DuReader dataset.

Closed-book Question Answering. We evaluated ERNIE 3.0 on two Closed-book Question Answering datasets which require the model to generate answers using its inherent knowledge learned during pre-training. WebQA is a large scale real-word QA dataset from Baidu Zhidao. We only provide ERNIE 3.0 with the question without additional evidence. The prompt is similar to MRC’s but without document input (Question: $Question? Answer:). ERNIE 3.0 achieves better performance compared to baselines. We presented the detailed analysis about CKBQA dataset in Section. 5.

3.3 Case Study

We manually collected 450 cases to evaluate the zero-shot generation ability of current large-scale pre-trained models on 13 tasks from 5 different types including Question Answering, Interpretation, Dialogue, Text Generation and Summarization. In human evaluation, the annotators are asked to score the generation quality on a scale of $ $. We reported the average score of coherence, fluency, and accuracy in Tab. 6, and showed some zero-shot generations of ERNIE 3.0 in Tab. 7. ERNIE 3.0 can generate the most coherent, fluent and accurate texts on average as compared to CPM-1, PLUG, PanGu-$ \alpha $We use the implementation of CPM-1 in https://github.com/jm12138/CPM-Generate-Paddle, PLUG in https://nlp.aliyun.com/portal?/BigText_chinese#/BigText_chinese and PanGu-$ \alpha$ in https://git.openi.org.cn/PCL-Platform.Intelligence/PanGu-Alpha. The introduction of three scoring metrics are listed as follows, and the scoring details are provided in Tab. 8.

Coherence measures whether the generation is relevant and consistent with the context.

Fluency evaluates whether the generated text is natural or readable. A fluent text should have no semantic contradiction among the generated text.

Accuracy is a metric to evaluate whether the generated text is the same as the ground truth.

4 Experiments on SuperGLUE

As a multi-task benchmark for natural language understanding, SuperGLUE is usually used to evaluate the performance of pre-training models. We also test the performance of ERNIE 3.0 on SuperGLUE, which covers a diverse range of NLP datasets as follows.

BoolQ (Boolean Questions, ) is a QA task where each example consists of a short passage and a yes/no question about the passage. The task is to answer the questions with YES or NO, and the metric of this task is accuracy.

CB (Commitment Bank, ) is an imbalanced corpus of natural language inference task. The task is evaluated using accuracy and macro-F1.

COPA (Choice of Plausible Alternatives ) is a causal reasoning task based on common sense knowledge. The data are curated from blogs and a photography-related encyclopedia. Following the original work, we evaluate this task using accuracy.

MultiRC (Multi-Sentence Reading Comprehension ) is a QA task where each example consists of a context paragraph, a question about that paragraph, and a list of possible answers. The system must predict which answers are true and which are false. The evaluation metrics are F1 over all answer-options (F $1_{a}$ ) and exact match of each question’s set of answers (EM).

ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset, ) is a multiple-choice QA task. It requires the model to pick an entity to complete the answer, given a context of news article and a Cloze-style question. This task is evaluated with max (over all mentions) token-level F1 and exact match.

RTE (Recognizing Textual Entailment ) dataset comes from a series of annual competitions on textual entailment. It is a natural language inference corpus and evaluated with accuracy.

WiC (Word-in-Context ) is a word sense disambiguation task cast as binary classification of sentence pairs using accuracy as the evaluation metrics.

WSC (Winograd Schema Challenge ) is a coreference resolution task in which examples consist of a sentence with a pronoun and a list of noun phrases from the sentence as choices. The system must select the correct referent of the pronoun from the provided choices. This task is evaluated with accuracy.

Similar to the pre-training corpus used in RoBERTa and DeBERTa , we compiled the English pre-training corpus for ERNIE 3.0 including English Wikipedia, BookCorpus , CC-News , OpenWebText , Stories . As shown in the Table 9, ERNIE 3.0 surpasses T5 and DeBERTa and obtains a score of 90.6, taking the first place in SuperGLUE Benchmark.

Analysis

To verify the effectiveness of the task-specific networks, we compare our proposed structure with those which share parameters under various pre-training tasks. For the ablation test, we choose understanding and generation as two different training paradigms and utilize the corresponding tasks mentioned in Section 3.2. The unified network follows the base model settings (12 layers, 768 dims, 12 attention heads), and the task-specific networks for each task paradigms are set to 3 layers, 256 dims, and 4 attention heads. For the contrast model, the task-specific network is shared across different task paradigms. Figure 3 illustrates the perplexity variation of the NLG task during the pre-training process.

As shown in Figure 3, the model with its own task-specific network for different task paradigms reaches a higher convergence speed. Furthermore, as training progresses, the performance gap becomes bigger compared to the model with a shared task-specific network. The experimental result shows the effectiveness of the proposed task-specific networks and demonstrates the necessity of distinguishing different tasks.

A group of ablation experiments is conducted to evaluate the performance of the universal knowledge-text prediction task. The relation extraction task is a typical knowledge-driven task, aiming to predict the relationship between two entities mentioned in a given sentence. Specifically, we add four special tokens, [HD], [/HD], [TL] and [/TL] to identify the mention of a head entity and a tail entity respectively, then the relation classification is performed on the sum of the final representations of the aforementioned four special tokens. We construct the experiments on SanWen and FinRE datasets and as shown in Table 10, the knowledge enhancement strategy achieves impressive empirical performance on the relation extraction task.

In addition, the zero-shot generation experiment on CKBQA also confirms the effectiveness of the universal knowledge-text prediction task. Specifically, the knowledge-based question answering (KBQA) task requires a model to search and reason for correct answers based on a knowledge graph. It’s suitable to measure the knowledge learning capability of the pre-trained languages models using the KBQA task. We use the ”QUESTION: $QUESTION? ANSWER:” as the prompt for zero-shot learning and then compare the performance of our proposed model with several state-of-the-art pre-trained language models on the CKBQA dataset. As shown in Table 5, ERNIE 3.0 significantly outperforms PanGu-$ \alpha$ and CPM-1 in the CKBQA dataset which indicates that ERNIE 3.0 has the ability to memorize and learn more knowledge.

Conclusion

we proposed the ERNIE 3.0 framework to pre-train a knowledge enhanced 10-billion parameter model on a 4TB corpus including plain texts and a knowledge graph. In order to handle both language understanding and generation tasks with zero-shot learning, few-shot learning and fine-tuning, ERNIE 3.0 designs a unified pre-training framework that integrates both auto-encoder networks and auto-regressive networks. We construct extensive experiments on various datasets from different task paradigms and fields, and the results demonstrate the effectiveness of ERNIE 3.0 as compared to the previous state-of-the-art pre-trained models.