ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang, Jiaxiang Liu, Xuyi Chen, Yuxiang Lu, Weixin Liu, Xi Wang, Yangfan Bai, Qiuliang Chen, Li Zhao, Shiyong Li, Peng Sun, Dianhai Yu, Yanjun Ma, Hao Tian, Hua Wu, Tian Wu, Wei Zeng, Ge Li, Wen Gao, Haifeng Wang

cs.CL

Introduction

Pre-trained language models such as ELMo , GPT , BERT , and ERNIE have proven effective for improving the performances of various natural language understanding and generation tasks. Pre-trained language models are generally learned on a large amount of text data in a self-supervised manner and then fine-tuned on downstream tasks or directly deployed through zero-shot learning without task-specific fine-tuning. Such pre-trained language models start to serve as foundation models and bring a new paradigm for natural language processing tasks. This new paradigm changes the focus of NLP research from designing specialized models for different tasks to studying pre-trained language models and using them in various tasks. Recent advances such as GPT-3 have demonstrated the promising trend towards scaling up pre-trained language models with billions of parameters. These studies show surprising potentials by scaling up pre-trained models.

Most of existing large-scale models were pre-trained on plain texts without integrating knowledge. ERNIE 3.0 tried to incorporate knowledge such as linguistic knowledge and world knowledge into large-scale pre-trained language models. ERNIE 3.0 pre-trained Transformers on massive unstructured texts and knowledge graphs to learn different levels of knowledge, such as lexical, syntactic, and semantic information. ERNIE 3.0 can handle both natural language understanding tasks and natural language generation tasks through zero-shot learning, few-shot learning, or fine-tuning. Furthermore, it supports introducing various customized tasks at any time. These tasks share the same encoding networks that are pre-trained through multi-task learning. This method makes it possible to encode lexical, syntactic, and semantic information across different tasks.

This work explores the performance of knowledge-enhanced pre-trained models with larger-scale parameters based on the ERNIE 3.0 framework. We train a Chinese dense pre-trained language model with 260 billion parameters (named as ERNIE 3.0 Titan) on the PaddlePaddle platform. Although large-scale language models like GPT-3 have shown promising text generation capabilities, it is still challenging for users to control the generation results and obtain generated texts that are factually consistent with the real world. To fill the gap, we propose a highly credible and controllable generative pre-training technique (see Figure. 2), in which a self-supervised adversarial loss and a controllable language modeling loss are optimized during the pre-training phase. In detail, a self-supervised adversarial loss allows the model to learn to distinguish whether a text is the original one or generated by ERNIE 3.0. Besides accelerating the convergence of the model, this loss enables ERNIE 3.0 Titan to re-ranking the credibility of the generated results. Meanwhile, a controllable language modeling loss is applied to enable the model to generate texts with specific attributes. We prompt the original text with a diverse attribute set, including the genre, topic, keywords, sentiment, and length, which can be easily expanded for more user-defined attributes. The users can freely combine these attributes for controllable generations of different types of downstream application scenarios. We conduct several experiments on 68 datasets. The results show that ERNIE 3.0 Titan significantly outperforms previous models on various tasks by a large margin and achieves new state-of-the-art results.

Furthermore, we propose a distillation framework to distill the ERNIE 3.0 Titan for easy deployment. Intuitively, we can apply current knowledge distillation methods to ERNIE 3.0 Titan. However, current distillation methods require an additional inference stage on a fully trained teacher to transfer knowledge to the student models, which is not environment-friendly concerning carbon emissions . Another problem of current methods is that only one student model can be produced after the distillation phase is completed, requiring the teacher to infer multiple times to distill multiple students. In addition to the computation resource problems, previous studies reveal that distillation from oversized teachers can lead to unexpected performance degradation problems. indicates that the difficulty comes from the large gap between the teacher’s and student’s parameter numbers, causing significant differences between their representation spaces. To this end, we propose an online distillation framework to efficiently distill the ERNIE 3.0 Titan into multiple small models during the pre-training stage, which results in little additional computation cost as compared to current distillation methods. Our distillation framework contains four key features: i) teaching multiple students at the same time, ii) proposing On-the-Fly Distillation (OFD), where the teacher instructs the students during the training stage for a more environmentally friendly distillation, iii) introducing teacher assistants for better distilling large scale models, iv) introducing Auxiliary Layer Distillation (ALD), a technique to improve distillation performance by stacking an additional student layer in distillation stage and discarding it at the fine-tuning stage. We compare our distilled ERNIE 3.0 Titan with previous compact models on five typical types of downstream tasks. The results demonstrate the effectiveness of our proposed distillation framework, and show that the distilled ERNIE 3.0 Titan achieves SOTA results on all tasks.

Related Work

Due to the rapid development of deep learning algorithms and the iterations of high-performance chips, pre-trained language models such as BERT , GPT , and ERNIE have made significant breakthroughs in many fields of natural language processing, such as natural language understanding, language generation, machine translation, human-machine dialogue, and question answering. These methods use unified Transformer models for learning universal representations on large unsupervised corpora. This technique taps into the advantages of scale of unsupervised data brings to natural language processing and significantly breaks the high reliance on costly annotated data.

Some recent works had demonstrated that increasing the size of pre-trained models can further exploit the potential value of unsupervised data. For example, the T5 model was proposed to push the performance for both natural language understanding and natural language generation tasks with 11 billion parameters. The T5 model converts all text-based language tasks into a text-to-text format by a unified framework and fully explores the effectiveness of pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors. GPT-3 , with 175 billion parameters, achieved an amazing performance on a wide range of tasks under the few-shot and zero-shot settings with prompts. Several work have investigated the effects of larger pre-trained models such as Jurassic-1 , Gopher , Megatron-Turing NLG , PanGu- $\alpha$ , Yuan 1.0 , etc. An alternative route to increasing the size of the model parameters is to sparse the model parameters. These models use mixture-of-experts to select different parameters for each incoming example. As far as we know, sparse models have not achieved better performance than dense models up to now, although they can scale more efficiently with trillion parameters or even more. Therefore, this paper mainly focuses on large-scale distributed training and inference techniques for dense models. Most of the previous models learned only on plain texts confronted the problem of lacking common sense . In addition, most large-scale models are trained in an auto-regressive way, but shows that such models have poorer performance with traditional fine-tuning when adapting to downstream language understanding tasks. In order to solve these problems, a unified framework called ERNIE 3.0 was proposed to train large-scale knowledge-enhanced models on large-scale plain texts and a large-scale knowledge graph by fusing the auto-regressive network and the auto-encoding network.

The exponential increment of the pre-trained language model’s size has posed a great challenge for efficient training due to memory constraints and unaffordable training time. The size of pre-trained language models exceeds the memory limit of a single modern processor. In addition, it is inevitable to store momentum and other optimizer states in widely used optimization algorithms such as Adam . Therefore there are lots of work focusing on achieving efficient training of large-scale models. The first category is the Pipeline Model Parallelism, splitting different Transformer layers into separate cards. GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when partition model across multiple accelerators. TeraPipe proposed a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of uni-directional language models. PTD-P scheduled with interleaved stages in a fine-grained way to reduce the size of the pipeline bubble further. Another category is the Tensor Model Parallelism , in which individual layers of the model are partitioned over multiple devices. They took advantage of the structure of transformer networks to create a simple model parallel implementation by adding a few synchronization primitives. PTD-P also combine pipeline, tensor, and data parallelism across multi-GPU servers to combine their advantages.

2 Credible Text Generation

The credibility of a text contains multiple aspects such as coherency, clarity, veracity, etc. For non-pretraining methods, previous works explored many approaches to promote several aspects of credible generation: Plan-and-write uses a hierarchical method that first plans a storyline and then generates a story based on it to improve the text coherency; CGRG uses grounding knowledge to generate veritable answers. With the development of pre-training technology, large-scale language models provide a simple yet powerful solution for credible text generation. shows that GPT-2 synthetic text samples can achieve a promising convincing score compared with authentic articles from the New York Times (72% vs. 83% in one cohort judged the articles to be credible). Furthermore, GROVER utilizes an adversarial framework to train a classifier with the generated text for incredible news detection. Inspired by this, ERNIE 3.0 Titan introduces an auxiliary adversarial loss to train a credibility ranker for self-ranking and select the most credible text for output.

3 Controllable Text Generation

As large-scale pre-training models have shown growing effectiveness in generating high-quality sentences, controllable text generation is getting more attention . GPT-3 used in-context few-shot learning to prompt text generation for various tasks, while it is still challenging for users in controlling the generation results. CTRL provided an effective way for controllable text generation. It trained a language model conditioned on control codes that govern style, content, and task-specific behavior. However, these control codes, derived from the structure that naturally co-occurs with raw texts, cover constrained controllable attributes. Following works like Arg-CTRL and Tailor extended the control codes of CTRL either by crowdsourcing aspect annotations or deriving from the PropBank formalism to control the semantic-specific generation. Another line of work aims to generate texts of desired attributes through relatively small ‘pluggable’ attribute models focusing on light-weight controllable fine-tuning on pre-trained models. In this paper, we focus on directly providing a highly controllable model pre-trained on ERNIE 3.0 controllable dataset powered by conditional LM loss.

4 Model Distillation of Language Models

Although large-scale language models show their outstanding ability on various tasks, their enormous parameters require a significant amount of computational resources and make them difficult to deploy on real-life applications. Language model distillation has recently drawn significant attention as one prevailing method to solve the foregoing problem. For example, proposes DistilBERT to successfully halve the depth of the BERT model by matching the output between teacher and student model on pre-training and fine-tuning stages. TinyBERT adds two additional points of distillation, the attention distribution and hidden representations, to improve distillation quality. To further boost the distillation, in addition to the pre-training and fine-tuning stages, ERNIE-Tiny introduces two additional stages to ensure that the student captures the domain knowledge from the fine-tuned teacher. Unlike previous works that require a fine-tuned teacher for distillation, proposes task-agnostic distillation that only performs distillation on the pre-trained teacher by mimicking self-attention and value relation.

Recent works have observed that distillation from an oversized teacher to a significantly smaller student can cause an unexpected performance drop. suggests that the performance drop is due to the capacity gap between the teacher and the student, and introduces additional student models named teacher assistants with size between teacher and student to alleviate this gap. proposes to early stop the teacher to reduce the gap. proposes to utilize unconverged teacher checkpoints to guide student’s learning process in a curriculum learning manner. proposes joint training of teacher and student, allowing the teacher to be aware of the existence of the student and reducing the gap, although it deteriorates the performance of the teacher model. suggests that the unexpected distillation performance drop is caused by the fact that the oversized model tends to be overconfident, and forcing the student to learn such overconfident output harms the distillation performance. It proposes to normalize the point of distillation to alleviate the overconfidence problem.

ERNIE 3.0 Titan Model

A significant improvement has been achieved on various natural language processing tasks for knowledge-enhanced pre-trained models with the base or large model size, such as ERNIE, ERNIE 2.0, and SpanBERT , in which the base/large model size represent 12/24 layers Transformer respectively. In order to explore the effectiveness of knowledge enhanced large-scale pre-trained model, a Continual Multi-Paradigms Unified Pre-training Framework named ERNIE 3.0 Framework is proposed in to pre-train model on massive unsupervised corpus including plain texts and knowledge graphs. Specifically, ERNIE 3.0 Framework allows collaborative pre-training among multi-task paradigms, in which various types of pre-training tasks are incrementally deployed in the corresponding task paradigm to enable the model to learn different levels of knowledge, i.e., valuable lexical, syntactic and semantic information, more effectively. Benefiting from the superiority of ERNIE 3.0 Framework, ERNIE 3.0 has made astonishing improvements on abundant downstream tasks across natural language understanding and natural language generation. As a matter of course, ERNIE 3.0 Titan is built on ERNIE 3.0 Framework in this paper. The detail of ERNIE 3.0 Framework will be explained in the following sections.

Until recently, the prevalent unified pre-training models trend to employ a shared Transformer network for different well-designed cloze tasks and utilize specific self-attention masks to control the context of the prediction conditions. Nevertheless, we believe that the different task paradigms of natural language processing consistently depend on identical underlying abstract features, such as lexical and syntactic information. However, the requirements of top-level concrete features are incompatible, in that the natural language understanding tasks have the disposition to learn the semantic coherence while natural language generation tasks expect further contextual information. Inspired by the classical model architecture of multi-task learning, in which the lower layers are shared across all tasks while the top layers are task-specific, construct the ERNIE 3.0 Framework shown in Figure 1, which enable the different task paradigms to share the underlying abstract features learned in a shared network and utilizing the task-specific top-level concrete features learned in their task-specific network respectively. The backbone shared network and task-specific networks are referred to as the Universal Representation Module and Task-specific Representation Modules in ERNIE 3.0 Framework. Specifically, the universal representation network plays the role of a universal semantic features extractor (for example, a multi-layer Transformer). The parameters are shared across all kinds of task paradigms, including natural language understanding and natural language generation, and so on. And the task-specific representation networks undertake the function of extracting the task-specific semantic features, in which the parameters are learned by task-specific objectives. Furthermore, the continual multi-task learning framework introduced in ERNIE 2.0 is integrated into ERNIE 3.0 Framework to help the model efficiently learn the lexical, syntactic, and semantic representations.

Driven by the success of ERNIE 3.0 , ERNIE 3.0 Titan also employs the collaborative architecture of a Universal Representation Module and two Task-specific Representation Modules, namely natural language understanding (NLU) specific representation module and natural language generation (NLG) specific representation module. Details are as follows:

Universal Representation Module. In likewise, a multi-layer Transformer-XL is adopted as the backbone network like other pre-trained models such as ERNIE 3.0 , XLNet , Segatron and ERNIE-Doc , in which Transformer-XL is similar to Transformer but introduces an auxiliary recurrence memory module to help modelling longer texts. Proverbially, the larger the scale of the Transformer model, the stronger its capacity to capture and store up various semantic information with different levels enabled by the self-attention mechanism. Therefore, ERNIE 3.0 Titan sets the universal representation module with a larger size (refer to the section 3.4) to enable the model to effectively capture universal lexical and syntactic information from training data by learning various pre-training tasks of different paradigms. And what needs special attention is that the memory module is only valid for natural language generation tasks while controlling the attention mask matrices.

Task-specific Representation Modules. Instead of the multi-layer perceptron or shallow Transformer commonly used as task-specific representation networks in multi-task learning, ERNIE 3.0 Titan employs the Transformer-XL network with base model size as the task-specific representation modules to capture the top-level semantic representations for different task paradigms. Under this design, ERNIE 3.0 Titan achieves a triple-win scenario: the base Transformer has a stronger ability to capture semantic information than multi-layer perceptron and shallow Transformer while not significantly increasing the parameters of the large-scale model; last but not least, a new available route that enables the realizable practical applications for large scale pre-trained model can be explored — only fine-tuning on the task-specific representation modules. ERNIE 3.0 Titan constructs two task-specific representation modules: the bi-directional modeling NLU-specific representation network and the uni-directional modeling NLG-specific representation network.

2 Pre-training Tasks

In order to make the capacity of understanding, generation and reasoning available to ERNIE 3.0 Titan, we construct several tasks for various task paradigms to capture different aspects of information in the training corpora, including word-aware pre-training tasks, structure-aware pre-training tasks and knowledge-aware pre-training task introuced in ERNIE 3.0 . Additionally, an innovative knowledge-aware pre-training task namely Credible and Controllable Generations is built to control the generation result and obtain the result factually consistent with the real world.

Knowledge Masked Language Modeling ERNIE 1.0 proposed an effective strategy to enhance representation through knowledge integration, namely Knowledge Integrated Masked Language Modeling task. It introduced phrase masking and named entity masking that predict the whole masked phrases and named entities to help the model learn the dependency information in both local contexts and global contexts.

Document Language Modeling As introduced in , document language modeling task is a special version of traditional language modeling task, which trains models on long text instead of the prevailing shorter segments of manageable size (at most 512 tokens). Enhanced Recurrence Memory Mechanism proposed in ERNIE-Doc is introduced into ERNIE 3.0 Titan to heighten the capability of modeling a larger effective context length than traditional recurrence Transformer.

2.2 Structure-aware Pre-training Tasks

Sentence Reordering Sentence reordering task, which is introduced in ERNIE 2.0 , aims to train the model to learn the relationship between sentences by reorganizing permuted segments. At length, a given paragraph is randomly split into 1 to m segments during pre-training, and all of the combinations are shuffled by a randomly permuted order. Then, the pre-trained model is asked to reorganize these permuted segments, modeled as a k-class classification problem where $k=\sum_{n=1}^{m}n!$ .

Sentence Distance Sentence distance task, an extension of the traditional next sentence prediction (NSP) task, is widely used in various pre-trained models to enhance their ability to learn the sentence-level information, which can be modeled as a 3-class classification problem. The three categories represent that the two sentences are adjacent, nonadjacent but in the same document and from two different documents, respectively.

2.3 Knowledge-aware Pre-training Task

Universal Knowledge-Text Prediction Universal knowledge-text prediction (UKTP) task, a particular masked language modeling that constructed on both unstructured texts and structured knowledge graphs, plays a pivotal role in incorporating world knowledge and commonsense knowledge into pre-trained model. Given a pair of a triple from the knowledge graph and the corresponding sentence from the encyclopedia, UKTP task randomly mask the relation in triple or the words in corresponding sentence. To predict the relation in the triple, the model needs to detect mentions of the head entity and the tail entity and determine their semantic relationship in the corresponding sentence. Another, to predict the words in the corresponding sentence, the model needs to consider the dependency information in the sentence and the logical relationship in the triple.

Controlling the generated texts based on desired attributes and improving their credibility is a key and practical feature we introduced in ERNIE 3.0 Titan. To achieve this, we design a self-supervised adversarial loss and a controllable language modeling loss for generating credible and controllable texts, respectively.

The self-supervised adversarial loss allows the model to distinguish whether a text is generated or the original one. As a result, it is easy for ERNIE 3.0 Titan to discard the low credibility generated texts with repeating words, unfluent and conflicting sentences. In detail, we formalize this as a binary classification problem experimented on our ERNIE 3.0 adversarial dataset $D_{a}=\{D_{\text{original}},D_{\text{generated}}\}$ which is a subset of original ERNIE 3.0 Corpus ${D_{\text{original}}}$ with its adversarial samples $D_{\text{generated}}$ generated by ERNIE 3.0.

The controllable language modeling loss is a modified language modeling loss by conditioning on extra prompts for controlling the generated texts as follows:

where ERNIE 3.0 Titan is trained to minimize the negative log-likelihood loss on ERNIE 3.0 controllable dataset $D_{c}=\{x^{1},x^{2},\dots,x^{|D_{c}|}\}$ , $t$ means the $t_{\text{th}}$ token of $x$ . $x^{n}$ is associated with $\text{prompts}^{n}$ specifying the genre, topic, keywords, sentiment and length. The loss is switched to the normal language modeling loss with a pre-defined probability 0.5 to prevent the model from heavily depending on the prompts. Different from CTRL which convers a constrainted controllable attributes from the semi-structural raw texts, we enrich the controllable attributes set using task-specific supervised models on ERNIE 3.0 Corpus. As the ERNIE 3.0 Corpus in nature constructed from various sources including Web, QA, Novel, Knowledge graph and etc. (see 3.3), we assign soft prompts (learnable prompt embedding) for different datasets to better align the model within the genre of the target dataset.

3 Pre-training Data

To ensure the success of the pre-training of ERNIE 3.0 Titan, we utilize the ERNIE 3.0 Corpus , a large-scale, wide-variety, and high-quality Chinese text corpora amounting to 4TB storage size in 11 different categories. Two additional datasets, namely the ERNIE 3.0 adversarial dataset and ERNIE 3.0 controllable dataset, are constructed.

ERNIE 3.0 adversarial dataset: The adversarial dataset is constructed based on ERNIE 3.0 Corpus. The positive examples consist of 2M natural paragraphs sampled from ERNIE 3.0 Corpus, while for negative example generation, we randomly take the first 1~3 sentences from the original positive paragraph as the prefix input, and utilize ERNIE 3.0 to generate the rest part of the paragraph. The max length of generated paragraph is set to 512, and we discard the last incomplete sentence if the generation process is terminated by max-length excess.

ERNIE 3.0 controllable dataset: The controllable dataset is highly scalable to include more diverse user-defined attributes. Here, we introduce 5 different controllable attributes including genre, topic, keywords, sentiment and length as follows:

Genre is assigned to samples based on the source the data collected from, including general (ERNIE 2.0 Corpus), Web, QA, Knowledge, Finance, Law, Medical, Novel, Couplet, Poet, Machine Translation. Each genre type is associated with pre-defined maximum soft prompt embeddings (64 in our experiment). In the pre-training phase, the number of soft prompt embeddings are sampled randomly between 0 and the maximum number.

Topic is labeled using a topic model https://ai.baidu.com/tech/nlp_apply/topictagger which can classify a document into 26 different topics such as international, sports, society, news, technology, digital, emotion, cars, education, fashion, games, travel, food, culture, healthy life, child and music.

Keywords are extracted using a keyword extraction model https://ai.baidu.com/tech/nlp_apply/doctagger which performs in-depth analysis of article titles and content, and outputs multi-dimensional tags that reflect the key information of the article, such as subject, entity, etc.

Sentiment is derived using a sentiment classification model https://ai.baidu.com/tech/nlp_apply/sentiment_classify. A positive, negative, or neutral label is assigned to each sample.

Length is counted on the tokenized text. The length attribute can prompt the model to generate texts with the desired length to avoid harshly truncating.

In pre-training phase, we use the following input format for each sample: “[Genre-0], [Genre-1], $\cdots$ [Genere-N] [t] Topic texts [/t] [k] Keyword text0, Keyword text1, $\cdots$ [/k] [senti] Sentiment label text [/senti] [w] About $L$ words [/w] Original text ” where [Genere- $n$ ] is $n_{\text{th}}$ soft prompt embedding for one of the genre type mentioned above, $N\in[0,64)$ , $L$ is the token number of the original text and [t], [/t], [k], [/k] ,[senti], [/senti], [w], [/w] are special tokens to seperate each attribute. For example, the input for the case in Figure. 2 can be “[News-0], [News-1], $\cdots$ [News-N] [t] Sports [/t] [k] Lampard, Chelsea, UCL [/k] [senti] Positive [/senti] [w] About 85 words [/w] Lampard said on the 4th that Chelsea... ”. Note that for each attribute, we randomly discard it with a pre-defined probability (0.5 in our experiment) to prevent the model from heavily depending on it.

4 Pre-training Settings

Following the pre-training setting of ERNIE 3.0, ERNIE 3.0 Titan includes the universal representation module and the task-specific representation modules, which both use the Transformer-XL structure. We adopt a structure with 12 layers, 768 hidden units, and 12 heads for the task-specific representation modules. We adopt a structure with 48 layers, 12288 hidden units, and 192 heads for the universal representation modules. We found that continually increasing the hidden size would make it difficult to load the parameter of output embedding in a single machine with eight cards with 32GB memory. In order to further increase the model capacity, we choose to scale the parameters of the point-wise feed-forward network alternatively. The inner layer of the universal representation modules has a dimensional of 196608, which is 16 times the hidden size of the model. The total number of parameters of ERNIE 3.0 Titan is over 260 billion.

We use Adam with learning rate of 1e-4, $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , L2 weight decay of 0.1, We also clip the global norm of the gradient at 1.0 to improve the stability of pre-training. The maximum sequence length of context and the memory length of language generation is 512 and 128, respectively. We use progressive learning to speed up convergence in the first 4000 steps and linear decay of the learning rate. ERNIE 3.0 Titan is implemented on the PaddlePaddle framework and uses parallelism techniques that facilitate the efficient training of large models that do not fit in the memory of a single GPU. We will give the detail of these in the next section.

Efficient Training and Inference of ERNIE 3.0 Titan

To train a huge model like GPT-3 faces severe challenges in memory consumption and computational efficiency. This model requires 2.1TB for parameter and optimizer states storage and $3.14E^{11}$ TeraFLOPS for training 300 billion tokens. Given that a modern AI processor like Nvidia V100 GPU can only provide 32GB of memory and 125 TeraFLOPS https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, it will take 28 days to train with 2048 GPU V100 cards even with a 50% percentage of theoretical peak FLOPS. There have been many related works to solve these problems, for example, Megatron-LM , Gpipe , Zero , and so on. PaddlePaddle has also proposed 4D hybrid parallelism with more sophisticated combination techniques https://ai.baidu.com/forum/topic/show/987996.

To train the ERNIE 3.0 Titan model on heterogeneous clusters is faced with even more challenges than GPT-3. On the one hand, the ERNIE 3.0 model has a more sophisticated architecture than GPT-3, such as task-specific representation layers and memory mechanism . Such structures are prone to unbalance and inefficient pipeline training on top of intensive memory consumption. On the other hand, different clusters are coupled with distinct software stacks, including operator implementations, communication libraries, schedulers, etc. For example, NPU performs worse than GPU on the tensor computation of dynamic shapes and shows strength on FP16 calculation, posing new challenges like customized layers optimization, unbalanced pipeline, and training stability.

To achieve efficient and stable training with convergence guarantee, PaddlePaddle developed an end-to-end adaptive distributed training technology, including fine-grained parallelism, heterogeneous hardware-aware training, and fault tolerance mechanism to train the 260B model on both Nvidia V100 GPU and Ascend 910 NPU clusters.

The 4D hybrid parallelism includes data parallelism (DP) , intra-layer tensor model parallelism (MP) , inter-layer pipeline model parallelism (PP) , and the improved sharded data parallelism (Sharded) based on ZeRO . DP is deployed to partition and distribute the data across devices and synchronize all their gradient. PP is utilized to split model layers into multiple stages distributed across devices, where the throughput is directly related to load balancing and bubble fraction. MP is used to slice parameters and activation as distributed across devices. Sharded commits to reducing the redundancy of optimizer states. Our improved version of Sharded, namely Group Sharded, is designed to decouple data-parallel and Sharded flexibly.

Besides, we implemented distributed operators for large Embedding, FC, and FC_Softmax layers, supporting fine-grained partition. These operators can utilize intra-node communication to further reduce memory occupation and redundant calculations. This strategy increases the overall throughput significantly on the ERNIE 3.0 Titan model’s unique architecture.

1.2 Heterogeneous hardware-aware training

We train ERNIE 3.0 Titan model on both NPU clusters at PengCheng Lab and GPU clusters in Baidu. Specifically, training large-scale models on NPU relies heavily on carefully treating the following problems: 1) interface design between deep learning framework with fast-evolving NPU software stack; 2) software-hardware collaborative NPU performance optimization 3) adjustment of load balance concerning NPU cluster throughput.

To take the best advantage of PaddlePaddle’s imperative programming and the 4D parallelism technology, we use Ascend Computing Language (ACL) https://support.huaweicloud.com/asdevg-python-cann/atlaspython_01_0008.html in our implementation instead of using Ascend Graph(GE) https://support.huawei.com/enterprise/en/doc/EDOC1100164817/753b4f6/introduction. We found that NPU has strong performance using pure FP16 while showing weakness in dealing with smaller shapes and dynamic shapes commonly used in NLP models.

In the PaddlePaddle framework, the parallel strategy can be adjusted in a resource-ware manner to fully exploit the computing power of the hardware. Figure 3 shows that the Task-specific Representation layers of ERNIE 3.0 require more load balance adjustments on NPUs than that on GPUs because NPUs consume more time on small layers’ kernel launches. As shown in Table 1, the ERNIE 3.0 Titan model reaches 91.7% weak scalability with thousands of NPU cards and increases the throughput up to 2.1 times while the number of cards only increased by 22%. Furthermore, Figure 4 shows how we convert dynamic shape to static shape to utilize the ACL performance fully.

1.3 Fault tolerance

During our experiments, we also encountered hardware problems, for example, a sudden drop of communication speed because of PCIe errors, GPU card errors such as Double Bit ECC Error (DBE) https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html. PaddlePaddle developed fault-tolerant features that automatically replace faulty machines to reduce hardware waste and time consumption to resolve these problems. PaddlePaddle allows users to customize their need to store and restore states from the checkpoints.

These problems encountered in practice prompt our work on end-to-end adaptive training. For details please refer to .

2 Distributed Inference

It becomes infeasible to perform inference using ERNIE 3.0 Titan model on common GPU devices such as Nvidia A100-SXM4 (40GB). Therefore, the model has to be split into multiple subgraphs deployed on multiple GPU cards. We implemented tensor model parallelism and pipeline model parallelism in Paddle Serving, PaddlePaddle’s online serving framework. In addition, we adopted methods such as unified memory optimization, op fusion, model IO optimization, and quantization-aware acceleration on a single card to improve the inference speed.

Online Distillation Framework for ERNIE 3.0 Titan

We devise an online distillation framework concerning the computation overhead and carbon emissions during the training stage, where the teacher model will teach students and train itself simultaneously to utilize the computational resource more effectively. To describe the framework, we firstly introduce our main procedure and then discuss the two key techniques, On the Fly Distillation (OFD) and Auxiliary Layer Distillation (ALD), in detail.

Unlike existing distillation methods, our proposed framework trains multiple students rather than one at once. Figure 6 shows the general structure of our proposed distillation framework. The blue rectangle on the left represents the teacher model, ERNIE 3.0 Titan, the green rectangle represents a 24-layer model, which we call teacher assistant (TA). Besides the TA model, we also introduce multiple smaller student models (shown in red) into this framework. Considering the large gap between the teacher model and those smaller students, we do not directly transfer the knowledge from teacher to students. Instead, we use the teacher assistant as a distillation bridge to better transfer knowledge. As have shown that the attention probability of the teacher model’s last few layers is crucial to task-agnostic knowledge distillation, we follow this paradigm in our framework. Through this framework, three students models can be trained simultaneously, saving the complex process to train them one by one in traditional distillation methods. However, there are still two issues with the current procedure. The first one is that this distillation process does not utilize the training time computational resource effectively and requires additional forward propagation from the teacher for distillation, causing additional computation overhead. The second problem is that matching attention module in Transformer between teacher and students will leave the weights of feed forward network in the students’ last layer untrained, as one Transformer layer is composed of an attention module and a feed forward network module . To solve those two problems, we introduce two techniques called OFD and ALD which will be discussed in the following sections.

2 OFD: On the Fly Distillation

To better utilize the computational resource and for a more environmentally friendly distillation, we propose our online learning method: OFD. The learning process is that every time the teacher updates one step, the students update one step toward the teacher. Students’ learning targets (i.e. the teacher) change over time during the training process. From this ”moving target” perspective, this framework is similar to where different teacher checkpoints during pre-training are selected as distillation targets. However, the teacher in this framework changes smoothly, whereas that in changes discretely. OFD allows teacher training and distillation perform simultaneously. The benefit is that we can better utilize the forward propagation from the teacher during the teacher’s pre-training for distillation, unlike the existing knowledge distillation methods which requires additional forward propagation from the teacher to extract the knowledge for distillation. Note that the OFD will not influence the teacher’s training as the gradients from distillation loss will not flow from the TA or students back to the teacher.

3 ALD: Auxilliary Layer Distillation

As the knowledge being transferred during distillation is the attention probability distribution, the other module in a Transformer block, the feed forward network, will not be trained during distillation. To show this problem more clearly, we will briefly describe the structure of the Transformer.

In our distillation framework, we use the Kullback–Leibler divergence of $\mathbf{A}_{l,a}$ between teacher and students as the distillation objective. However, matching the attention in the last layer of the students will leave the FFN in the last layer untrained as the gradient only flows backward. To this end, we propose stacking an extra layer on the students during distillation to ensure that the gradient can flow through the entire network and that all the parameters are trained during distillation. This extra layer will be discarded when the students are fine-tuned on downstream tasks.

Experiments

Three groups of experiments, including fine-tuning on natural language understanding tasks (in Sec. 6.2), few-shot learning (in Sec. 6.3), and zero-shot learning (in Sec. 6.4), are conducted on a variety of prevailing NLP tasks to evaluate the performance of ERNIE 3.0 Titan. All the previous state-of-the-art results for comparison come from the best public single model reported that we could find.The previous SoTA results of ERNIE 2.0 and RoBERTa-wwm-ext on corresponding datasets are reproduced by ourselves, except for the datasets that already have released pre-trained results.. It is essential to mention that all experimental results of ERNIE 3.0 Titan are based on the insufficiently pre-trained model so far. ERNIE 3.0 Titan is still in training, and we believe that the model will become stronger as the pre-training progresses.

68 datasets belonging to 12 kinds of natural language processing tasks are used in our experiments, in which datasets marked with FC are from FewCLUE Benchmark. Significantly, several datasets are applied to the experiments of fine-tuning / few-shot learning and zero-shot learning simultaneously in different ways. In this paper, we treat duplicate datasets used in different evaluation methods independently. The details as follows:

Sentiment Analysis (SA): NLPCC2014-SC http://tcci.ccf.org.cn/conference/2014/pages/page04_dg.html, SE-ABSA16_PHNS http://alt.qcri.org/semeval2016/task5/, SE-ABSA16_CAME, BDCI2019 https://www.datafountain.cn/competitions/350, EPRSTMT .

Opinion Extraction (OE): COTE-BD , COTE-MFW .

Natural Language Inference (NLI): OCNLI , CMNLI , OCNLI-FC .

Winograd Schema Challenge (WSC): CLUEWSC , CLUEWSC-FC .

Relation Extraction (RE): FinRE , SanWen .

Semantic Similarity (SS): AFQMC , LCQMC , PAWS-X , BQ Corpus , CSL , CSL-FC , BUSTM .

Text Classification (TC): TNEWS https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset, TNEWS-FC , IFLYTEK , IFLYTEK-FC , THUCNEWS http://thuctc.thunlp.org/, CNSE , CNSS , CSLDCP .

Closed-Book Question Answering (CB-QA): NLPCC-DBQA http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf, CHIP2019, cMedQA , cMedQA2 , CKBQA https://github.com/pkumod/CKBQA, WebQA .

Cloze and Completion (Clz.&Compl.): PD&CFT , CMRC2017 , CMRC2019 , CHID , CHID-FC , WPLC .

Machine Reading Comprehension (MRC): DRCD , DuReader , Dureader ${}_{\text{robust}}$ , Dureader ${}_{\text{checklist}}$ , Dureader ${}_{\text{yesno}}$ https://aistudio.baidu.com/aistudio/competition/detail/49/?isFromLUGE=TRUE, C3 , CMRC 2018 .

Legal Documents Analysis (LDA): CAIL2018-Task1 , CAIL2018-Task2 .

Cant Understanding (CU): DogWhistle Insider, DogWhistle Outsider .

2 Experiments on Fine-tuning Tasks

The results of natural language understanding tasks are reported in Table 2. As shown in Table 2,

Sentiment Analysis. Sentiment Analysis is a classification task aiming to determine whether a sentence is positive, negative, or neutral. We consider four datasets from different domains, including shopping (NLPCC2014-SC), electronics (SE-ABSA16_PHNS, SE-ABSA16_CAM), and financial (BDCI2019). ERNIE 3.0 Titan achieves state-of-the-art results on all four datasets.

Opinion Extraction. Similar to the sentiment analysis task, opinion extraction requires the model to mine the opinion of a sentence. We use two sub-datasets from Chinese Customer Review (COTE). Experiment results show that ERNIE 3.0 Titan also outperforms the current SoTA system.

Relation Extraction. The relation extraction task is to identify the relationship between different entities like persons and organizations. We consider FinRE and SanWen – two relation extraction datasets for financial news and Chinese literature, respectively. ERNIE 3.0 Titan outperforms the previous SoTA model by a remarkable margin.

Semantic Similarity. Semantic Similarity is a classic NLP task that determines the similarity between various terms such as words, sentences, documents. In this work, we focus on sentence-level similarity tasks. We test ERNIE 3.0 Titan on several datasets in varied fields, including AFQMC, LCQMC, and BQ. Experiment results show that ERNIE 3.0 Titan outperforms the baseline models by a visible margin.

Text Classification. We also evaluate ERNIE 3.0 Titan on Text classification. We consider four datasets: app descriptions (IFLYTEK) and news stories (THUCNEWS, CNSE, CNSS). Under different types of classification tasks, ERNIE 3.0 Titan can consistently achieve better accuracy.

Closed-Book Question Answering. Closed-Book Question Answering aims to directly answer the questions without any additional references or knowledge. We select a general QA dataset NLPCC-DBQA and two medical field datasets – cMedQA and cMedQA2 to test the ability of ERNIE 3.0 Titan. Experiment results show that ERNIE 3.0 Titan performs better on all QA tasks. We believe knowledge-enhanced pre-training methods do bring benefits to the closed-book QA task.

Cant Understanding. Cant, also known as doublespeak, is an advanced language usage for humans. However, it is rather difficult for machines to understand this type of language. We test the cant understanding ability of ERNIE 3.0 Titan on DogWhistle – a dataset based on Decrypto game. The model is required to select the right answer with the guidance of the corresponding cant. ERNIE 3.0 Titan gets the best result and shows its potential for understanding more difficult languages.

Cloze and completion. Cloze tests require the ability to understand context and vocabulary in order to identify the correct language that belongs in the deleted passages. Benefiting from the knowledge-enhanced pre-training, ERNIE 3.0 Titan achieves the best score among baselines.

Machine Reading Comprehension. We comprehensively evaluate the ability of ERNIE 3.0 Titan on machine reading comprehension in different aspects, including span-predict reading comprehension (DuReader, DRCD, DuReader ${}_{\text{checklist}}$ ), multiple-choice reading comprehension (C3, DuReader ${}_{\text{yesno}}$ ), and robustness test (Dureader ${}_{\text{robust}}$ ). With the help of knowledge-enhanced pre-training, ERNIE 3.0 Titan surpasses the baseline models with significant enhancements on all types of tasks.

Legal Documents Analysis. Next, two domain-specific tasks of law are chosen to evaluate the ability of ERNIE 3.0 Titan on document analysis. These two datasets from CAIL2018 are both multi-label document classification tasks. ERNIE 3.0 Titan breaks the previous SoTA performance.

3 Experiments on Few-shot Learning

In this section, we evaluate the few-shot performance of ERNIE 3.0 Titan on the FewCLUE benchmark. FewCLUE is a comprehensive few-shot evaluation benchmark in Chinese, including various of tasks. Flowing Section 6.1, we categorize these task into six different types including text classification, natural language understanding etc. Each task provides five training/evaluation splits and a corresponding union set (called train $\_$ all and dev $\_$ all). The number of the few-shot training samples is related to the number of the label categories. In other words, tasks with more classes have more training samples. We choose mT5-XXL, Yuan 13B-PLM, and ERNIE 3.0 as our baselines. For mT5-XXL, the results are produced based on the open-source codes with their default tuning method, and for Yuan 13B-PLM, we directly use the published results.

We tested three approaches to train the few-shot learners based on ERNIE 3.0 Titan. For different types of tasks, we use the corresponding task-specific training method:

For text classification, sentiment analysis, and semantic similarity tasks, we use traditional fine-tuning methods.

For reading comprehension (CHID-FC) and winograd schema challenge task, we utilize pattern exploiting training to reformulate such tasks cloze-style questions and perform gradient-based tuning.

For the natural language inference task, we exploit NSP-based prompt training . Different labels are regarded as prompts to concatenate the two sentences, and the model is trained to select the label that makes the concatenated sentence the most coherent. The classification network of the sentence distance task (Section 3.2.2) in the pre-training phase can be seen as a good initialization to train the coherent ranker.

We use the union set to train the few-shot learner and report the results on the evaluation sets (dev $\_$ all) and public test sets to conduct the experiments.

3.2 Results

The main results of the few-shot learning tasks are illustrated in Table 3. ERNIE 3.0 Titan consistently outperforms baseline models, including ERNIE 3.0. Under vanilla fine-tuning methods, ERNIE 3.0 Titan still can surpass Yuan 1.0 4.58% average points on the text classification task, 1.56% on sentence similarity, and 0.62% on semantic similarity tasks. The task of reading comprehension and winograd schema challenge can be naturally reformulated to cloze-style tasks. Under prompt-based pattern exploiting, ERNIE 3.0 Titan outperforms Yuan 1.0 by a remarkable margin: 8.18% absolute improvement on CLUEWSC-FC and 7.92% on CHID-FC dataset. For natural language inference task, ERNIE 3.0 Titan also achieves a significant improvement of 6.87 points.

4 Experiments on Zero-shot Learning

This section conducts various types of tasks with the zero-shot setting where a model is applied without parameter updates. ERNIE 3.0 Titan achieves strong performance compared to recently proposed large-scale Chinese language models such as CPM-1 (2.6B), PanGu- $\alpha$ , Yuan 1.0 on all downstream tasks. On the CKBQA-sub dataset, which requires strong knowledge reasoning ability, ERNIE 3.0 Titan surpassed GPT-3 by over 8 point percent with respect to accuracy. In our case study (Sec. 6.4.3), we demonstrate the ability of ERNIE 3.0 Titan to generate controllable and credible results. Quantitatively, we evaluated ERNIE 3.0 with baselines on our manually collected 467 cases across 13 different tasks and showed that it could generate more coherent, natural, and accurate responses.

This section unified five probability forms of the scoring function for tasks with a limited label set, such as text classification, sentiment analysis, and cloze-style tasks. Based on the unified ERNIE 3.0 framework, the implementation of these five scoring functions can be task-specific. The ablation study in Sec. 6.6.1 shows the effect of different scoring functions where some can obtain stable performance gain over others on a certain task type.

In Table. 5, five forms of the scoring function are shown where $f_{\text{prompt}}(\cdot)$ is the function to prompt the input $x$ to $x^{{}^{\prime}}$ , $f_{\text{fill}(\cdot)}$ is the function to fill the $i_{\text{th}}$ label into prompted text $x^{{}^{\prime}}$ , $y_{i}\in\mathcal{Y}$ , $\mathcal{Y}$ is the label set and $|y_{i}|,l_{i}$ denote the tokens length for label $y_{i}$ and the filled prompt respectively. We will introduce these scoring functions as follows:

$P(x,y)$ . The implementation of the joint probability $P(x,y)$ is equivalent to the perplexity of the prompted text, meaning that the prompted text with the lowest perplexity score will be predicted as the correct answer.

$P(y|x)$ . Given the input text $x$ , we will choose the highest probability answer among all possible labels. To ease the effect of the label length bias, length normalization is commonly used. This method is also called the average log-likelihood used in GPT-3 .

$P(x|y)$ is the reverse version of $P(y|x)$ . The above two forms explicitly include the label’s probability, which ignores the effect of the label bias. Since the label is imbalanced distributed in the pre-training corpus, the model tends to assign a higher probability to labels common in the corpus. By conditioning on $y$ , we assume the impact of the label bias will be flattened over the input tokens.

$\frac{P(y|x)}{P(y)}$ is proportional to the pointwise mutual information (PMI) of $x,y$ . PMI has been used for finding collocations and associations between words. Compared to $P(y|x)$ , we can think of $\frac{P(y|x)}{P(y)}$ as a way to eliminate the effect of label bias, since common labels with high probability will decrease the final score through division. In practice, it is better to restrain the $P(y)$ in the target domain by prompting the text using a domain-specific prompt but taking as input an empty input which is used in .

$P(\text{True}|x,y)$ formalizes each possible answer with the input text as a binary classification task. In this way, the next sentence prediction (NSP) task In ERNIE framework, we use the sentence distance prediction task which is an extension of the traditional next sentence prediction (NSP) task (introduced in Sec. 3.2.2). can be utilized, which has been pre-trained on the hidden state of [CLS]. Intuitively, NSP task estimates the affinity score of two sentences. Thus, we can prompt the text into two sentences where one is filled with different labels.

For generative tasks such as machine reading comprehension, ERNIE 3.0 used a restrained beam search with a beam width of 8 for extractive MRC to ensure the generation is a span that occurred in the context. Though effective, the beam search is time-consuming for ERNIE 3.0 Titan. Since ERNIE 3.0 Titan is assumed to be more powerful, we used the top-1 sampling strategy for all generation tasks.

4.2 Results

In Table. 6, we summarized the prompting functions, label verbalizers and evaluation methods we used for each task. The detailed results will be introduced as follows:

Text Classification. For the TNEWS and IFLYTEK datasets, we randomly sample three candidates as negative labels for each sample. This sampling strategy is aligned with CPM-1’s, PanGu- $\alpha$ ’s, and ERNIE 3.0’s to make fair comparisons. While for TNEWS-FC, IFLYTEK-FC, CSLDCP, the negative sampling strategy is discarded in order to compare with Yuan 1.0. ERNIE 3.0 outperforms competitive baselines on these tasks.

Sentiment analysis. On the EPRSTMT dataset, ERNIE 3.0 Titan achieves 88.75% w.r.t. accuracy on the zero-shot setting, meaning sentiment analysis is a simple task for large-scale pre-trained models.

Semantic Similarity. We consider AFQMC, CSL, CSL-FC, and BUSTM datasets. ERNIE 3.0 Titan outperforms baselines at a large margin. However, compared to ERNIE 3.0, the performance gain on AFQMC and CSL is marginal, and the minor improvement on CSL comes from the soft prompt tokens [webN]. It means that the soft prompt tokens appended before the original text help the model utilize the knowledge in a specific domain. On the other hand, it is hard for models to learn the semantic similarity based only on a language modeling loss. While on BUSTM, ERNIE 3.0 Titan using the NSP task as the scoring function surpassed Yuan 1.0 by 5% point. We assume the inductive bias of the NSP task helps the model learn semantic similarity.

Natural Language Inference. ERNIE 3.0 Titan is evaluated on three NLI datasets, namely OCNLI, OCNLI-FC, and CMNLI, and achieves the best performance. We used the $P(\text{True}|x,y)$ scoring function for OCNLI-FC since we found that the NSP task shows some capability in modeling the semantic similarity on the BUSTM dataset. We tested different soft prompt tokens for CMNLI including [webN], [qaN] and [novelN]. Soft prompt tokens [novelN] achieves the highest score (51.70) compared to 49.87 with [webN] and 49.4 with [qaN]. And, there is still a large room for improvement for pre-trained models on zero-shot NLI tasks.

Winograd Schema Challenge: We formalize the CLUEWSC dataset as a multi-choice completion task where a pronoun is replaced with each candidate to calculate the scores. Since there are no candidates set for the CLUEWSC-FC dataset, we come up with a superior prompt by appending a complement after the target pronoun, and the model is required to judge the correctness of the complement. ERNIE 3.0 Titan surpassed Yuan 1.0 a lot on the CLUEWSC-FC dataset and achieved superior performance on CLUEWSC attributing to the power of scale.

Cloze and completion. We split a sample containing multiple blanks as multiple sentences to predict independently on the CHID dataset. The Hungarian algorithm is used to ensure that two blanks in one sample have unique predictions. ERNIE 3.0 Titan achieves the best score among baselines, and the Hungarian algorithm contributes a lot improving from 77.32 to 86.21. For Chinese Word Prediction with Long Context (Chinese WPLC), a sample consists of a masked text and a correct word. Following PanGu- $\alpha$ , we replace the mask token with the correct word and calculate the perplexity score of a whole sentence. ERNIE 3.0 Titan achieves a much lower perplexity score (16.50) with the help of soft prompt tokens [novelN]. On the CMRC2019 dataset, we randomly sample three negative candidates for each blank from the original candidates, then beam search is applied to calculate the optimal path for a sample. We also formalize the PD, CFT, and CMRC2017 as multi-choice tasks where multiple choices are the words appearing in the given text. For efficiency, restricted generation is used for these three datasets. ERNIE 3.0 Titan surpassed the baselines with a large margin.

Machine Reading Comprehension. We consider four MRC datasets. Due to the power of ERNIE 3.0 Titan, we simply utilized the top-1 sampling strategy to generate answers. The maximum generated length of completion is limited by a pre-defined number based on 95% percentile point of answers’ length on the dataset. The performance of ERNIE 3.0 Titan is superior, outperforming Yuan 1.0 with a comparable number of parameters by 3.03% point on average for the CMRC2018 dataset.

Closed-book Question Answering. We evaluated ERNIE 3.0 on two Closed-book Question Answering datasets which require the model to generate answers using its inherent knowledge learned during pre-training. WebQA is a large-scale real-world QA dataset from Baidu Zhidao. CKBQA is a knowledge-based question answering task. We only provide ERNIE 3.0 Titan with the question without additional evidence. In addition, we evaluated GPT-3 on a subset of the CKBQA dataset where questions requiring the background knowledge of China are filtered out, and then questions are manually translated into English. The engine for GPT-3 is Davinci, and the object is text completion. ERNIE 3.0 Titan significantly outperforms baselines and exceeds GPT-3 by over 8% point, indicating that ERNIE 3.0 Titan is superior in learning and reasoning.

4.3 Case Study

We manually collected 467 cases https://ernie-github.cdn.bcebos.com/cases.xlsx to evaluate the zero-shot generation ability of current large-scale pre-trained models on 13 tasks from 5 different types including Question Answering, Interpretation, Dialogue, Text Generation and Summarization. In human evaluation, the annotators are asked to score the generation quality on a scale of $ $. We reported the average score of coherence, fluency, and accuracy in Tab. 7, 8, and showed some controllable generations of ERNIE 3.0 Titan in Tab. 9. In addition, we construct a subset of above manually collected cases where cases requiring background knowledge about Chinese history, geography, and culture have been removed. Overall, ERNIE 3.0 Titan can generate the most coherent, fluent and accurate texts on average as compared to CPM-1, PLUG, PanGu-$ \alpha $, ERNIE 3.0 and GPT-3 We use the implementation of CPM-1 in https://github.com/jm12138/CPM-Generate-Paddle, PLUG in https://nlp.aliyun.com/portal?/BigText_chinese#/BigText_chinese, PanGu-$ \alpha$ in https://git.openi.org.cn/PCL-Platform.Intelligence/PanGu-Alpha, ERNIE 3.0 in https://wenxin.baidu.com/wenxin/ernie and GPT-3 in https://beta.openai.com/docs/guides/completion, and users can combine different attributes to generate highly credible and controllable generations. The introduction of three scoring metrics are listed as follows, and the scoring details are provided in Tab. 10.

Coherence measures whether the generation is relevant and consistent with the context.

Fluency evaluates whether the generated text is natural or readable. A fluent text should have no semantic contradiction among the generated text.

Accuracy is a metric to evaluate whether the generated text is the same as the ground truth.

5 Experiments on Model Distillation

This section discusses the experiment for our task-agnostic model distillation framework for ERNIE 3.0 Titan. We evaluate our distilled ERNIE 3.0 Titan on five typical types of downstream tasks, including natural language inference (XNLI ), semantic analysis (ChnSentiCorp ), document question answering (NLPCC-DBQAhttp://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf), semantic similarity (LCQMC ), and machine reading comprehension (CMRC2018 ).

To reduce the gap between the giant teacher model and the student models, we introduce a 24-layers TA model with a hidden size of 1024. To accelerate the distillation procedure, we also pre-train the TA model for 500K steps before starting distillation with the same pre-training setting as ERNIE 3.0 Titan. For ALD settings, we add one extra layer for each student. Then we jointly train the teacher and students with OFD. We introduce five students with different model sizes and parameters ranging from 14M to 110M. We measure the inference latency of these student models on a V100 GPU with PaddlePaddle and show the relative speedup with respect to BERT-Base in Table 11.

We compare our downstream fine-tuning results with other compact PLMs. We choose the Chinese version BERT , TinyBERT , ERNIE 2.0 , and RoBERTa-wwm-ext as baseline models. BERT, RoBERTa-wwm-ext and ERNIE 2.0 are PLMs pre-trained from scratch without any distillation. TinyBERT is pre-trained via a multi-step distillation procedure using the task-specific distillation paradigm. Results are shown in Table 11, the experiments are reported with means over five random initialization.

As shown in Table 11, the 12L768H student model achieves SOTA results on all tasks. By taking a closer look at the EM and F1 for CMRC2018, this student model surpasses the strong baseline ERNIE 2.0 by 4.78 and 2.82, respectively. Notably, among all 6-layer PLMs, the 6L768H version of distilled ERNIE 3.0 Titan performs the best and even outperforms the 12-layer BERT-Base on XNLI, LCQMC, and NLPCC-DBQA. Comparison between these 6-layer students and TinyBERT also demonstrate the effectiveness of our proposed framework.

6 Analysis

In Table. 12, we compare the performance of different scoring functions on five zero-shot tasks. Overall, $P(y|x)/P(y)$ -bi is more amenable to text classification tasks, while semantic similarity and natural language inference tasks prefer $P(\text{True}|x,y)$ -bi. We put forward a hypothesis that the effectiveness of $P(y|x)/P(y)$ -bi is positively correlated with the size of the label set. On IFLYEK-ZC with 119 classes, ERNIE 3.0 Titan obtains 9.68% point ( $30.74\rightarrow 40.42$ ) performance gain compared to 4.28% point ( $53.55\rightarrow 57.83$ ) on TNEWS-FC with 15 classed when eliminating the label bias effect through the division of $P(y)$ . $P(y|x)/P(y)$ -bi even fails on BUSTM with 2 classes and OCNLI-FC with 3 classes compared to the second-best scoring function $P(x|y)$ -uni. Intuitively, the dataset with a larger label set is more likely to be affected by the label bias. The model prefers frequent answers in the pre-training dataset, conflicting with the balanced label distribution in the downstream dataset. $P(\text{True}|x,y)$ -bi utilizes the inherently pre-trained NSP-task, which is suitable for tasks that need to distinguish the semantic similarity between two sentences. The best performance always achieves using the scoring function with bidirectional attention. When comparing $P(y|x)$ -uni with $P(y|x)$ -bi, $P(y|x)$ -bi could additionally utilize the information from the text behind the label resulting in better performance among all datasets.

6.2 Adversarial Credibility Classification

We notice that the adversarial credibility classification method mentioned in 3.2.3 not only can filter out low credibility texts during the generation but also can speed up the pre-training convergence. We conduct a comparative experiment with the base model settings (12 layers, 768 dims, 12 attention heads) and report the results in Figure 7. The pre-training tasks of our baseline strictly follow the settings of the original ERNIE 3.0, while the contrast model has an additional self-supervised adversarial loss for credibility classification.

Figure 7 illustrates the perplexity variation of the masked language model task during the pre-training process. The model with the auxiliary adversarial loss reaches a higher convergence speed. We think this might because the adversarial loss of distinguishing whether a text is generated or the original one can help the model learn the distribution of true natural sentences.

6.3 The Effect of Controllable Attributes

In ERNIE 3.0 Titan framework, we have introduced five different controllable attributes, including genre, topic, keywords, sentiment, and length in Sec. 3.3. By assembling different attributes, users can have more access to ERNIE 3.0 Titan to obtain diverse and controllable generations shown in Table. 9. Meanwhile, the genre soft prompts improve the performance on some zero-shot tasks such as CHID, CMNLI, and WPLC. We assume the improvement comes from the domain calibration. For example, [novelN] soft prompts are prefix of novel texts when optimizing the language modeling loss. When conditioning on the [novelN] soft prompts, the output distribution of the model will be shifted to the novel domain which results in the lower perplexity score on WPLC dataset. In addition, we test the effect of the length attribute showing that the number of actual generated tokens has a positive correlation with the expected generated tokens in Figure. 8. Also, we observe that the length attribute affects the genre of generations. ERNIE 3.0 Titan tends to generate texts constructed from knowledge graph (like The capital of China is Beijing.) when the length attribute is small and prefers novel and web texts when the length attribute is large.

Conclusion

We pre-train a knowledge-enhanced language model with 260 billion parameters named ERNIE 3.0 Titan based on the ERNIE 3.0 framework. It is the largest Chinese dense pre-training model as far as we know. We have validated it on 68 datasets, and the results show that ERNIE 3.0 Titan achieves new state-of-the-art results. In addition, We propose a novel method for users to control the generation result and obtain the result factually consistent with the real world. We also devise an online distillation framework and conduct several distilled models of different sizes concerning the computation overhead of large-scale pre-training models. In the next stage, we will continually update ERNIE 3.0 Titan with more data to further explore the limit of the performance of large-scale pre-trained language models. We will also endeavor to explore the potential of knowledge-enhanced large-scale multi-modal models for more and various tasks.