PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun Liu, Yonghong Tian

cs.CL

Introduction

Pre-trained Language Models (PLMs) [1, 2, 3, 4, 5, 6, 7, 8, 9, etc.] have gained great success in the Natural Language Processing (NLP). By learning contextual representation of text from large-scale corpora in a self-supervised manner, PLMs can achieve state-of-the-art performances on a wide range of Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks.

Radford et. al. demonstrates a significant gains on a variety of NLP tasks via Generative Pre-trained Transformer (GPT), which is an autoregressive language model first pretrained on unsupervised text data and then finetuned for each supervised task. Devlin et.al. proposes BERT, a bidirectional Transformer with the masked language model (MLM) pretraining objective, which obtains new state-of-the-art performances on the GLUE benchmark of NLU tasks. After them, there have been an increasing number of research work on developing the pretraining techniques and continuously improving the performance of downstream NLP tasks. Among all the techniques, researchers find that the performance of PLMs can be steadily improved simply by enlarging the amount of the training data as well as the capacity of the model. For instance, RoBERTa shows that BERT can be substantially improved by training the model longer with more data. GPT-2 as the successor of GPT, which shares the same architecture but contains 1.5 billion parameters and is trained with 40GB text, can perform reasonably well on multiple tasks in the zero-shot setting. The T5 model with 11 billion parameters trained on the 745GB C4 data, keeps pushing the performance of both NLU and NLG tasks.

Recently, the OpenAI team announced its lasted version of the GPT-series models: GPT-3 . The largest GPT-3 model contains 175 billion parameters and is trained using 570GB of text data. Besides its strong capability in generating high-quality text, GPT-3 is especially effective in solving a wide range of tasks without task-specific finetuning in the few-shot, or even zero-shot settings. Moreover, on many of the tasks the performance improves steadily as the size of the GPT model grows, and sometimes even reaches the level of the prior state-of-the-art finetuning approaches. From applications perspective, GPT-3 is revolutionary, as it relieves the need for labelling many examples and retraining model for every new task, which hinders the applicability of NLP models in real-world applications.

However, GPT-3 is now only available for limited access via OpenAI API, and it is primarily trained with English data. To promote the public research of Chinese PLMs, we propose training a very large-scale Chinese PLM named PanGu- $\alpha$ with number of parameters up to 200 billion. To the best of our knowledge, this is the largest Chinese PLM up to the publication of this technical report.

The difficulty in training a PLM rises as the scale of the model grows beyond the level of 10 billion. The main challenges lie in three aspects:

Model Design. There have been a couple of architectures of PLMs besides GPT and BERT. However, not all the PLMs can be smoothly scaled to hundreds of billions of parameters. For examples, some models may have problem of slow convergence or even divergence during training as the model size increases. Inspired by GPT-3 and our preliminary experiments, we choose the Transformer-based autoregressive language model as the base architecture. Besides, we develop an additional query layer on top of the Transformer layers to induce the expected output of the model during pretraining. Our experiments demonstrate that the structure of PanGu- $\alpha$ can scale up to 200 billion parameters.

Training Corpora. Training data is essential in building a strong and generalisable pretrained model. On one hand, the amount of the data should be sufficient to feed a large PLM. On the other hand, the data should be of high quality and diversity to ensure the generality of the PLM. To build Chinese corpus with comprehensive coverage, we collect a large amount of data from a wide range of resources, including Common Crawl, e-Books, encyclopedias, news, and so on. Based on them, we conduct multiple processes of data filtering and cleaning to make sure the processed data are of high quality and reliability.

Distributed Training. The memory requirement of training PanGu- $\alpha$ with 200 billion parameters is much beyond the memory capacities of modern AI processors. It is difficult to acquire large end-to-end throughput while keeping high resource utilization on a cluster of processors. The problem becomes more challenging when considering the topology of hardware. We combine five-dimensional parallel functionalities with a carefully designed parallelization strategy and apply them to the largest PanGu- $\alpha$ , which is efficiently trained on a cluster of 2048 Ascend 910 AI processors and powered by CANNhttps://www.hiascend.com/en/software/cann.

We train three PanGu- $\alpha$ models on a high-quality 1.1TB Chinese text corpus with increasing magnitude of parameter sizes, which are PanGu- $\alpha$ 2.6B, PanGu- $\alpha$ 13B, and PanGu- $\alpha$ 200B, respectively. We first evaluate the models on language modeling tasks, showing that the perplexity can be decreased with the increase of model capacity and the amount of data and computation. Then we investigate the text generation ability of PanGu- $\alpha$ in various scenarios such as dialogue generation, summarization, question answering, etc. We demonstrate a few generated samples for different applications in the experiment section. Furthermore, we evaluate the task-agnostic few-shot performances of PanGu- $\alpha$ 2.6B and 13B on a wide range of NLP tasks, including cloze tasks, reading comprehension, closed-book QA, Winograd style tasks, commonsense reasoning, natural language inference, and text classification. The experimental results demonstrate that with the growing model capacity, the performance on various tasks can generally improve.

We are currently seeking a proper way to let both non-profit research institutes and commercial companies to get access to our pretrained PanGu- $\alpha$ models, either by releasing the code and model or via APIs. We are also assessing the possibility of releasing all or part of our pretraining data, within the constraints of the law and legality.

To facilitate the community to pretrain a large-scale language model by their own, the parallel computing functionalities are open-sourced in the Auto-parallel module of MindSporehttps://gitee.com/mindspore/mindspore, a deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. Besides the basic parallel functionalities, Auto-parallel is easy enough to use by freeing developers from parallel model training with minimal (or zero) code modifications from the standalone version, as if the model is trained on a single device.

The reminder of this technical report is organized as follow. Section 2 describe the architecture of our PanGu- $\alpha$ models. In section 3, we detail our methods to construct a 1.1TB high-quality training corpus from 80TB raw data collected from various sources. Section 4 addresses the parallelization paradigm of model training and scheduling strategy on a cluster of Ascend processors. Section 5 presents the experimental results of PanGu- $\alpha$ models on various tasks.

Model

PanGu- $\alpha$ is a large-scale autoregressive language model (ALM) pretrained on a large corpus of text, mostly in Chinese language. It models the generative process of all the tokens in the corpus, where the generation of a token depends on its previous tokens in a sequence. Assuming that a sequence $X=\{x_{1},x_{2},...,x_{N}\}$ is composed of $N$ tokens, the training objective can be formulated as maximization of the log-likelihood:

where $p(x_{n}|x_{1},...,x_{n-1};\theta)$ is the probability of observing the $n$ -th token $x_{n}$ given the previous context $x_{1:n-1}$ , and $\theta$ denotes the model parameters.

The architecture of PanGu- $\alpha$ is based on Transformer , which has been extensively used as the backbone of a variety of pretrained language models such as BERT and GPT . Different from them, we develop an additional query layer on top of Transformer layers to predict the next token. The diagram of the model is shown in Figure 1. We elaborate each part as follow.

2 Model Structure

A standard transformer layer includes two sub-layers: multi-head attention (MHA) and fully connected feed-forward network (FFN).

With multiple attention heads, the output becomes:

For both MHA and FFN, we take the pre-layer normalization scheme, which can make the training of Transformer model easier and faster .

2.2 Query Layer

The subsequent computation of MHA and FFN remains the same as the original Transformer. We denote the final output as $o_{n}$ . The negative log-likelihood of next token becomes:

where $x_{n}$ denotes the true token and $W^{o},b^{o}$ is the additional task-dependent parameters.

2.3 Model Configurations

To evaluate the scaling ability of the PanGu- $\alpha$ model, we train three models with increasing magnitude of parameter sizes, that is, PanGu- $\alpha$ 2.6B, PanGu- $\alpha$ 13B, and PanGu- $\alpha$ 200B. Table 1 shows the detailed configurations of the three models, including the number of total parameters, the hidden dimension for the tokens, the inner dimension of the feed-forward layer, and the number of attention heads.

Dataset

A large-scale Chinese text corpus of high quality is crucial for the pretraining of our PanGu- $\alpha$ models, especially the one with 200B parameters. Existing large-scale text corpora for pretraining super large language models are mainly English. For example, the GPT-3 is trained using a dataset which contains 570GB filtered texts from Common Crawl with $92.6\%$ of the words are English. The Colossal Clean Crawled Corpus (C4) for training T5 consists of about 750GB clean English texts scraped from the web . To the best of our knowledge, there are three Chinese text corpora that are above 100GB: (a) CLUECorpus2020 (100GB), which is retrieved from the Common Crawl dataset ; (b) the Chinese multi-modal pretraining data, released by which contains 300GB texts; and (c) WuDaoCorpushttps://data.baai.ac.cn/data-set-details/0c8dc71dd06ae75a10ca422fb49b0751, which opens about 300GB text data to only specific partners so far. However, all the above datasets are still not enough to train the super large-scale models up to 200B parameters compared to the data size used in existing English pretrained models.

Even though the raw web datasets such as SogouThttps://www.sogou.com/labs/resource/t.php and Common Crawlhttps://commoncrawl.org/the-data/ contain massive amount of Chinese texts, the construction of our desired dataset is still challenging due to the highly varying quality of the raw web data, the huge amount of storage and computation to preprocess the data, and the lack of well-defined metrics to evaluate the quality of the data.

To tackle the aforementioned issues, we construct a 1.1TB high-quality Chinese text corpus by cleaning and filtering enormous raw data from multiple sources. A big data management platform is built to accelerate the massive data analysis and processing. Both manual and model-based evaluation measures are used to guide the data preprocessing and training data selection, as detailed in the following sections.

To construct a large-scale high-quality Chinese corpus, we collect nearly 80TB raw data from the public datasets (e.g., BaiDuQA, CAIL2018, Sogou-CA, etc.), web pages data from Common Crawl, encyclopedia, news and e-books. As shown in Figure 2, our data construction process includes three steps: rule-based data cleaning, model-based data filtering and text deduplication. To improve the quality of the training dataset, the first two steps (i.e., cleaning and filtering) are iteratively enhanced via manual and model-based data quality evaluations. The data construction process is done on a big data management platform built based on the open source Spark/Hadoop framework using 8 high-performance computing nodes4 computing nodes with 28TB storage + 2 CUPs (24 cores) + 1.5TB Memory and 4 computing nodes with 7.3TB storage + 2 CPUs (64 cores) + 1TB memory.. With the distributed processing capability and the tools of our platform, the efficiency of the data analysis and processing is significantly improved (see Table 2 for the processing time). Next, we introduce the details of each step in the dataset construction process.

Among the five data sources as shown in Fig 2, the Common Crawl data contributes the most amount to our corpus but unfortunately contains a significant amount of low-quality web pages. To improve the data quality, we first adopt the following rule-based text cleaning strategies over the raw web pages from Common Crawl:

Remove the document which contains less than 60% Chinese characters, or less than 150 characters, or only the title of a webpage;

Remove the special symbols and duplicated paragraphs in each document;

Identify advertisements based on keywords and remove documents which contain advertisements;

Convert all traditional Chinese text to simplified Chinese;

Identify the navigation bar of the web page and remove it.

Then, three filters are applied to the preprocessed documents to further remove the harmful, advertising and low-quality documents.

Sensitive word filtering: The original documents of Common Crawl include a lot of harmful or sensitive website contents which would mislead our generative model. Thus, we manually collect 724 sensitive words and remove documents containing more than three of the sensitive words.

Model-based spam filtering: To further remove the advertisements and spams, we train a spam classification model using fastTexthttps://fasttext.cc/ on a manually labeled dataset. The negative training examples are 10K junk documents manually selected from the Common Crawl dataset, and the positive examples are sampled from the high-quality Chinese text corpus. We remove the documents that are classified as spams.

Low-quality document filtering: Following the practice in GPT-3, we train a classifier to score the quality of each document and eliminate the documents with scores below a threshold (see Appendix A of for details).

1.2 Text Deduplication

Although we have removed duplicated paragraphs in each document in the previous step, there are still documents with highly overlapped content across different data sources. Therefore, we carry out fuzzy data deduplication over the documents across all our data sources.

Due to the super large scale of the whole dataset, the conventional MinHashLSH algorithm in Spark incurs more than 8 hours to duplicate less than 200MB data, which is too slow to meet our efficiency requirement. To accelerate the deduplication process, we design a distributed large-scale text data duplication detection and deduplication algorithm by exploiting the computing framework of our big data management platform. The proposed algorithm takes only 3.5 hours to complete the deduplication process for 500GB documents.

1.3 Data Quality Evaluation

Give above preprocessing steps, one key question is how the cleaning rules and the filtering thresholds are decided. In this work, we evaluate the data quality after each round of preprocessing and update the cleaning rules and the filtering models according to the evaluation results. Both manual and model-based evaluations are considered. The manual evaluation is conducted over randomly sampled texts from the perspectives of sentence smoothness and the amount of low-quality contents (e.g., advertisements, repeated short sentences, spams, etc.). However, the manual evaluation can only cover a very small proportion of the whole dataset. To improve the accuracy of the data evaluation, we train the PanGu- $\alpha$ 350M model using 30GB data sampled from the preprocessed dataset and evaluate the data quality using the PPL on a high-quality development dataset. The preprocessed dataset that achieves lower PPL is considered to have higher quality and its corresponding cleaning rules and filtering models are considered to be better.

2 Training Data Selection

Using the construction process in Figure 2, a Chinese text corpus with 1.1TB data is built from the five types of data sources. The composition of our corpus and the processing steps adopted to each data source is shown in Table 3. Based on the new corpus, we construct two training datasets with 100GB and 1TB text data for our medium (2.6B and 13B) and large (200B) models, respectively. As shown in Table 4, each data source is sampled during training with different proportions according to the quality of the processed dataset evaluated using the method in Section 3.1.3. The distribution of the number of token in each training dataset is shown in Figure 3. The averaged document lengths of the 100GB and 1TB dataset are 239 and 405 tokens, respectively. The 1TB dataset has a larger averaged document length due to the large proportion of Common Crawl dataset. Note that the length of the text will affect the generation performance of the model. When the averaged number of token for the training samples is small, the model will be biased to generate short texts and be good at processing downstream tasks requiring short texts, and vice versa.

System

Training PanGu- $\alpha$ 200B and using it for inference are difficult. The memory requirement for just storing PanGu- $\alpha$ 200B is around 750 GB. Training such a huge model consumes several times more memory than just storing the parameters, since the gradients and optimizer states are also essential for updating the parameters. As a contrast, the memory of modern AI processors (e.g., GPU, Ascend 910 AI processor ) is still around 30-40 GB. Thus, it is inevitable to partition the model to a collection of devices (processors). The problem is challenging in two perspectives. First, multiple basic parallel functionalities should be combined to acquire the end-to-end high performance. Finding the best combination strategy is challenging due to the huge strategy space. Second, parallel training should be easy to use, and the underlying parallel-related code should be removed from the model definition code. We use Auto-parallel in MindSpore to address the problem by maximizing the ratio of the computation over the communication. Auto-parallel supports five-dimensional parallel functionalities, and employs topology-aware scheduling to map partitioned model slices to the cluster for the end-to-end high performance. Furthermore, Auto-parallel enables the least code modifications from the standalone version for parallel training.

The most applied parallelism way is data parallelism, which partitions the training batches across devices, and synchronizes the gradients from different devices before taking an optimizer step, as shown in Figure 4(a). There are three regimes in model parallelism. One regime is op-level model parallelism , which partitions its involved tensors of each operator (layer), as shown in Figure 4(b). Op-level model parallelism reduces the memory consumption by slicing the parameters and the activation memory, however, it introduces communications to keep the distributed tensor layouts consistent between successive operators. The second regime is pipeline model parallelism , which partitions the total layers to stages, and then places stages to different devices, as shown in Figure 4(c). The memory benefit comes from that each device holds a subset of total layers of the model, and the communications only happen at the boundaries of stages. The third regime is optimizer model parallelism (Figure 4(d)), which aims to reduce the redundant optimizer memory and computation consumption resulted from data parallelism. Some outputs of operators in forward phase reside in memory for a fairly long time, because they are used in the backward phase for gradient calculations. Rematerialization (Figure 4(e)) abandons these memories to reduce the peak memory consumption in the whole training time, by recomputing the corresponding forward operators.

Each parallelism dimension trades computation (or communication) overheads for memory (or throughput) benefits. To acquire maximum end-to-end throughput, a balanced composition point should be found along these dimensions. The problem becomes more challenging when considering the heterogeneous bandwidths in a cluster of devices.

Figure 5(b) demonstrates a typical organization of a cluster. Each server includes multiple devices, and the servers in a rack are connected by a ToR (top of rack) switch. Racks are then connected by the Spine switch. The bandwidth between devices in a server is greater than that across servers in a rack, and the latter one is greater than that across racks. Therefore, the model is partitioned across servers in a rack using the pipeline parallelism regime, resulting in that each server holds a stage of the model layers. Then, the stage is split using the op-level parallelism across the devices in each server, in order to utilize the high bandwidths. Each rack owns the whole model, and different racks are data parallel. Deploying data parallelism and optimizer parallelism across racks is due to that the induced communication operators are not on the critical path of the training iteration, which could be fused and overlapped with backward propagation to improve the performance.

Figure 6 shows how a combined parallelization is applied to the PanGu- $\alpha$ 200B model. First, 64 layers of the model are partitioned into 16 stages, each stage containing 4 layers. For each layer, involved parameters and tensors are partitioned for each operator. Specifically, the parameters involved in query ( $Q$ ), key ( $K$ ) and value ( $V$ ) operators are partitioned into 8 slices. The input tensor of these three operators is partitioned into 16 slices, and the number of optimizer model parallelism is determined accordingly.The ‘8’ is called model parallel number and ‘16’ is called data (and optimizer) parallel number in our system. In the example of Figure 6, the model parallel number and data parallel number are both 2. Parallelization strategies for other operators in the layer are configured likewise. Rematerialization is configured to perform within each layer, which limits the extra computation overheads. Totally, 2048 Ascend 910 AI processors are used to train the full PanGu- $\alpha$ 200B model.

2 Implementation

The parallel-related functionalities are implemented in the Auto-parallel module of MindSpore. The Auto-parallel decouples machine learning models from complicated underlying parallel implementations, and let researchers focus on the development of new models. Auto-parallel enables parallel training by just adding annotations on the standalone model script. Here, we briefly go through two model parallelism regimes.

Figure 7 shows how to specify the combined parallelization strategy to PanGu- $\alpha$ . Figure 7(a) and Figure 7(b) shows the pseudocode of configuring Attention and FeedForward to conduct op-level parallelism, respectively. qkv_mm’s sharding strategy is ((2, 1), (1, 2)), indicating that x is partitioned along the row (batch or data) dimension into 2 slices, while q_w, k_w and v_w are partitioned along the column dimension. Since the device number is 4 here, each device holds a distinct pair of a x’s slice and a q_w’s (k_w’s and v_w’s) slice. matmul’s sharding strategy is ((2, 2), (2, 1)), where the contracting dimension is partitioned, thus an AllReduce is needed here to perform the operation. Likewise, another AllReduce is needed in Figure 7(b)’s matmul2. Auto-parallel can find such needed operators. Furthermore, the tensor redistribution is designed to automatically find the transformation (a list of operators) between any two inconsistent distributed tensor layouts with minimum communication cost, and then the operators are inserted into the data flow graph. The sharding strategy of batch_mm in Figure 7(a) corresponds to splitting the batch and head dimension.

Figure 7(d) shows the pseudocode of conducting pipeline parallelism in MindSpore. The number of stages is configured as 2, and the number of devices is 8. Thus, 4 devices together perform each stage. The layer1 is configured to be the stage 0, thus replicated on 4 devices. Likewise, layer2 is replicated on the other 4 devices. If combined with Figure 7(a) and Figure 7(b), the desired parallelization strategy is obtained to PanGu- $\alpha$ .The stategy of optimizer parallelism is hidden in how batch dimension is split in the configuration. We omit the configuration for rematerialization here. Send and Receive are inferred to communicate the activation output from stage 0 to stage 1, and then are automatically inserted into the data flow graphs on two stages, respectively.

In the future, we will: a) develop a cost model and a parallelization strategy searching algorithm for all parallelism dimensions in order to completely liberate developers from the underlying parallel-related works; b) support the heterogeneous-parallelism to offload a part of tensors and the corresponding computations to the host CPU to accelerate the training; c) use Sparse Attention to speedup the computation.

All training and inference jobs are run on the ModelArtshttps://www.huaweicloud.com/product/modelarts.html platform, which manages the end-to-end workflows and provides the functionality of cluster scheduling for a job to acquire a hierarchical cluster.

Experiments

Our PanGu- $\alpha$ models are developed under the Mindspore framework and are trained on a cluster of 2048 Ascend 910 AI processors. The detailed settings are shown in Table 5. For the training of the 200B model, we use 2048 Ascend processors at the first phase and then switch to 1024 Ascends processors in the middle, in order to conduct other experiments using the rest of resources. The Byte Pair Encoding (BPE) is used as the tokenizer, and the vocabulary size are 40,000. The sequence length for the training data is set to 1024 for all the models.

The curves of training loss for the PanGu- $\alpha$ models are shown in Figure 8. We adopt the number of training tokens as the x-axis since the batch size for the 200B model is not comparable to that of the 13B and 2.6B models. The loss of 200B model converges to around 2.49, while the losses of 13B and 2.6B models converge to 2.58 and 2.64 respectively. From the training curves, we can observed that the losses are still decreasing by the end of training, which indicates that our PanGu- $\alpha$ model are still under-trained, and may have great potential to improve. We also evaluate the perplexity of our PanGu- $\alpha$ models on the validation set, which is randomly sampled from the Common Crawl dataset. The results in Table 6 show that PanGu- $\alpha$ models with larger parameters sizes achieve smaller perplexity values, indicating that larger PanGu- $\alpha$ models are better language models.

2 Task Description

In this section, we evaluate our models on a broad spectrum of natural language processing tasks. Similar to the GPT-3 , the experiments are conducted under three learning settings, i.e., zero-shot, one-shot, and few-shot, without any finetuning. For each task, we evaluate the models with the test sets when publicly available. Otherwise, we use the development sets instead. For some tasks with a very large test set or development set, we randomly sample a subset from the dataset in the experiments to reduce the computational cost. The evaluation datasets are classified into 7 categories by the task similarities, and we describe each category as follows.

Cloze and completion tasks, including WPLC, CHID , PD&CFT , CMRC2017 , and CMRC2019 . Chinese WPLC (Word Prediciton with Long Context) is a dataset created to test the ability to model long-range dependencies, similar to the LAMBADA dataset for English. The CHID (Chinese IDiom dataset) requires the model to identify the ground-truth idiom from 10 candidate idioms. The PD&CFT task requires the model to predict the mask words in sentences derived from People’s Daily (PD) news dataset and Children’s Fairy Tale (CFT) dataset. The CMRC2017 (Chinese Machine Reading Comprehension) task contains two different sub-task: cloze-style task and user query reading comprehension task, among which we only evaluate our models on the cloze-style task. While the aforementioned tasks are word-level tasks, the CMRC2019 is a sentence cloze-style dataset that involves filling the right sentence from several candidate sentences into the passage. For the CMRC2019 and the CHID, a list of candidate choices are provided, making them classification tasks, while for WPLC, CMRC2017 and PD&CFT, the models need to generate the answer as no candidate choices are given. Accuracy metric is employed for evaluating the cloze-style tasks.

Reading comprehension tasks, including CMRC2018 , DRCD , and DuReader . These are all span-extraction tasks originally. That is, given a passage as context and a question, the models need to extract a text span from the passage which contains the correct answer to the question. The evaluation metrics, including F1 and exact match (EM), measure the similarity between the predicted span and the ground-truth text span. Instead of span-extraction, we formulate these task as generation tasks where the models generate the texts directly. The similarity between the generated text span and the ground-truth text span is evaluated. Note that for the DuReader task, we select the Zhidao subset for evaluation in our experiment.

Closed-book question answering (QA) tasks, including WebQA . We follow the same closed-book setting in GPT-3 , where the models are not allowed to access any external knowledge when answering open-domain factoid questions about broad factual knowledge.

Winograd-Style tasks, including CLUEWSC2020 . CLUEWSC2020 is a Chinese Winograd Schema Challenge dataset, which is an anaphora/coreference resolution task. In practice, we convert the task into a multiple-choice classification problem.

Common sense reasoning tasks, including C3 . C3 is a free-form multiple-choice reading comprehension dataset which can benefit from common sense reasoning. Different from the extraction-based reading comprehension tasks, the answers to of C3 questions cannot be directly found in the given context. Therefore, we use it to evaluate the common sense reasoning ability of the models.

Natural language inference (NLI) tasks, including Chinese Multi-Genre NLI (CMNLI) and Original Chinese Natural Language Inference (OCNLI) . The NLI tasks require the model to identify the relation between two sentences, either entailment, neutral or contradiction. We formulate these tasks as three-class classification problems.

Text classification tasks, including TouTiao Text Classification for News Titles (TNEWS), IFLYTEK app description classification (IFLYTEK), Ant Financial Question Matching Corpus (AFQMC), and Chinese Scientific Literature (CSL) . These text classification tasks covers broad domains of text, including news, applications, financial text, scientific text. For the TNEWS and IFLYTEK tasks, there are 15 and 119 categories originally. However, we randomly sample three candidates as negative labels for each instance and perform 4-class classification. The reason is that the computational cost of our perplexity-based classification method increases linearly to the total number of candidate categories, which will be described in the next section.

3 Evaluation Details

The tasks can be generally classified into two-categories: classification tasks and generation tasks. For the classification tasks, we resolve the task as perplexity comparison tasks. For some tasks, the samples needs to be filled into a tailor-designed template as the input to the models. The templates for each task are described in Table 7, where ”/” means the task does not involve a template. The decoding strategies for these text generation tasks are described in Table 8.

The generation tasks include word-level generation tasks and sentence-level generation tasks. Since our PanGu- $\alpha$ models are autoregressive language models capable of text generation, the generation tasks can be solved naturally by simply generating the answers. For the cloze tasks such as WPLC, PD&CFT, and CMRC2017, the prompts are the context before the positions to be predicted. For the reading comprehension tasks and closed book QA tasks, templates are designed if necessary. For example, in the reading comprehension tasks, the sample is filled into a template Reading document : $Document Question:$ Question Answer:, which serves as the prompt for the model to generate the answer.

As in GPT-3, the few-shot task is designed as in-context learning, where $K$ prompts are concatenated one by one. The first $K-1$ prompts contain the ground truth answer while the last prompt is the sample we want to predict. An example for CMRC2018 task is shown in Figure 9

3.2 Perplexity-based method

The perplexity-based method solves the classifications tasks. For each pair of ¡text, label¿, an input will be generated automatically according to a pre-designed criteria, as shown in Table 7. The sequence generated by the template will be fed into the model and a perplexity value will be computed. The label associated with the smallest perplexity value will be considered as the predicted label for this passage.

We also employ the in-context learning strategy for solving few-shot tasks. An example for few-shot OCNLI task is shown in Figure 10.

4 Results

Table 9 compares PanGu- $\alpha$ 2.6B with CPM https://github.com/TsinghuaAI/CPM-Generate, a recently released generative Chinese PLM with 2.6B parameters, on 16 downstream tasks in Chinese. PanGu- $\alpha$ 2.6B achieves higher performance compared to CPM 2.6B on more than 11 tasks in zero-shot setting, 12 tasks on the one-shot setting, and 14 tasks on the few-shot setting. In general, the experimental results indicate that PanGu- $\alpha$ 2.6B achieves higher in-context learning ability over CPM 2.6B, especially for few-shot learning and generation-tasks. Regarding generation-tasks, PanGu- $\alpha$ 2.6B outperforms CPM 2.6B with an improvement of 6 points on average. To be more specific, PanGu- $\alpha$ 2.6B surpasses CPM 2.6B with 5 points in scores for both reading comprehension and closed-book QA tasks, 7 points in scores for cloze (without choices) tasks respectively. Regarding perplexity-tasks, PanGu- $\alpha$ is comparable to CPM 2.6B on natural language inference with CMNLI and OCNLI datasets, while it is slightly worse than CPM on classification tasks with TNEWS and IFLYTEK datasets. We suppose that the main factor that contributes to the different performance of CPM 2.6B and PanGu- $\alpha$ 2.6B is the training data. We collect massive and diverse data from a wide range of sources, which allows our PanGu- $\alpha$ model to handle more diverse tasks.

Table 10 compares PanGu- $\alpha$ 13B with PanGu- $\alpha$ 2.6B. PanGu- $\alpha$ 13B outperforms PanGu- $\alpha$ 2.6B on all generation-tasks and most of the perplexity-tasks. Regarding CMRC2018, DRCD and WebQA tasks of PanGu- $\alpha$ 13B, the few-shot performance surpasses zero-shot by more than 10 points, demonstrating that PanGu- $\alpha$ 13B has superior in-context learning ability. PanGu- $\alpha$ 13B outperforms PanGu- $\alpha$ 2.6B with an improvement of 3 points on average. To be more specific, PanGu- $\alpha$ 13B surpasses PanGu- $\alpha$ 2.6B with 4 points for both reading comprehension and closed-book QA tasks, 2 points for cloze (without choices) tasks respectively. Regarding the NLI tasks, the 13B model performs worse than the 2.6B model, which is consistent with the observations in GPT-3. Overall, the comparison results between PanGu- $\alpha$ 13B with PanGu- $\alpha$ 2.6B demostrate that a larger scale of pretrained model generally improves the performance on few-shot learning tasks.

5 Natural Language Generation Examples

We evaluate the generation capabilities of PanGu- $\alpha$ 200B on various text generation scenarios. We show some of the examples in this section. We do not conduct any post-editing to the generated text, except that we truncate the generated text when the model does not stop generation at a reasonable point. Among the scenarios we have tested, we find that our PanGu- $\alpha$ model is particularly good at poetry&duilian generation, text summarization, dialog generation, and fiction generation, where roughly 90% of the generated examples are acceptable to human. We believe there are certainly more applications for PanGu- $\alpha$ models to explore in the future.

Conclusion

We have pretrained large-scale Chinese autoregressive language models named PanGu- $\alpha$ , with up to 200 billion parameters. PanGu- $\alpha$ has been developed under the MindSpore framework and trained on a cluster of 2048 Ascend AI processors. We believe there are many open problems in the field of large-scale PLMs:

Large-scale language models have demonstrated its promising few-shot capabilities in NLP tasks. However, the behaviors of such models are not systematically studied yet. How to make proper use of large PLMs and how to develop efficient few-shot algorithms remain open questions.

Though effective, the computational cost for the inference of super large language models is still expensive. Thus it is worthwhile studying how to save the cost for the inference of large PLMs without sacrificing much of their performance. Model compression and acceleration of large PLMs could be an interesting topic.

Training a even larger PLM with trillions of parameters will certainly bring more challenges to the both software and hardware sides. In addition, more efficient model structures such MoE or Switch Transformers are also expected for relieving the computational cost of model training and inference.

Pretrained multi-modal models integrating language, vision and speech data have attracted much attention recently . Similar to the scaling law of language models, the performance of pertrained multi-modal models may also improve when the model sizes increase and more training data are collected. This is definitely a promising direction to explore.

Acknowledgements

We thank Hanyang Wan, Qian Zhao, Yong Li, Zhou Cao, Yongqiang Lai, Zhijian Guo, Yue Wang, Zherui Chang, Junqiu Wei, Pingyi Zhou, Yulong Ao, Wenzhi Liu for their great support to this work. Also thanks for the support by the School of Electronics Engineering and Computer Science at Peking University, Central Software Institute and Noah’s Ark Lab at Huawei Technologies, and Peng Cheng Laboratory.