GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach

cs.CL

Introduction

Over the past several years, there has been an explosion in research surrounding large language models (LLMs) for natural language processing, catalyzed largely by the impressive performance of Transformer-based language models such as BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), and T5 (Raffel et al., 2020). One of the most impactful outcomes of this research has been the discovery that the performance of LLMs scales predictably as a power law with the number of parameters, with architectural details such as width/depth ratio having a minimal impact on performance within a wide range (Kaplan et al., 2020). A consequence of this has been an abundance of research focusing on scaling Transformer models up to ever-larger scales, resulting in dense models that surpass 500B parameters (Smith et al., 2022; Chowdhery et al., 2022), a milestone that would have been almost unthinkable just a few years prior.

Today, there are dozens of publicly acknowledged LLMs in existence, the largest having more than two orders of magnitude more parameters than GPT-2, and even at that scale there are nearly a dozen different models. However, these models are almost universally the protected intellectual property of large organizations, and are gated behind a commercial API, available only upon request, or not available for outsider use at all. To our knowledge, the only freely and publicly available dense autoregressive language models larger than GPT-2 are GPT-Neo (2.7B parameters) (Black et al., 2021), GPT-J-6B (Wang and Komatsuzaki, 2021), Megatron-11B111This model does not work using the provided codebase, and we have been told it under-performs GPT-J., Pangu- $\alpha$ -13B (Zeng et al., 2021), and the recently released FairSeq models (2.7B, 6.7B, and 13B parameters) (Artetxe et al., 2021).

In this paper we introduce GPT-NeoX-20B, a 20 billion parameter open-source autoregressive language model. We make the models weights freely and openly available to the public through a permissive license, motivated by the belief that open access to LLMs is critical to advancing research in a wide range of areas—particularly in AI safety, mechanistic interpretability, and the study of how LLM capabilities scale. Many of the most interesting capabilities of LLMs only emerge above a certain number of parameters, and they have many properties that simply cannot be studied in smaller models. Although safety is often cited as a justification for keeping model weights private, we believe this is insufficient to prevent misuse, and is largely a limitation on the ability to probe and study LLMs for researchers not based at the small number of organizations that have access to state of the art language models. In addition, we make partially trained checkpoints avaliable at evenly spaced 1000 step intervals throughout the whole of training. We hope that by making a wide range of checkpoints throughout training freely available, we will facilitate research on the training dynamics of LLMs, as well as the aforementioned areas of AI safety and interpretability.

In studying GPT-NeoX-20B, we find several noteworthy phenomena at odds with the established literature. We train on a dataset that contains duplicated data for more than one epoch but see no evidence of performance loss. While (Hendrycks et al., 2021a) claims that few-shot prompting doesn’t improve performance on their task, we find that this is actually a phenomenon unique to GPT-3 and doesn’t apply to either GPT-NeoX-20B or FairSeq models. Finally, we find that GPT-NeoX-20B is a powerful few-shot learner, recieving a much larger performance boost from few-shot examples than comparable sized GPT-3 and FairSeq models. As we see the same with GPT-J-6B (Wang and Komatsuzaki, 2021), we hypothesize that this may be due to the shared choice of training data.

In the following sections, we give a broad overview of GPT-NeoX-20B’s architecture and training hyperparameters, detail the hardware and software setup used for training and evaluation, and elaborate on the choices made when designing the training dataset and tokenization. We also address of some of the difficulties and unknowns we encountered in training such a large model. We place significant importance on the broader impacts of the release GPT-NeoX-20B, and provide a lengthy discussion of why we believe its release is a net benefit. We also document issues of training cost and carbon emissions in as much detail as much as possible.

Model Design and Implementation

GPT-NeoX-20B is an autoregressive transformer decoder model whose architecture largely follows that of GPT-3 (Brown et al., 2020), with a few notable deviations described below. Our model has 20 billion parameters, of which 19.9 billion are “non-embedding” parameters that Kaplan et al. (2020) identify as the proper number to use for scaling laws analysis. Our model has 44 layers, a hidden dimension size of 6144, and 64 heads.

Although our architecture is largely similar to GPT-3, there are some notable differences. In this section we give a high-level overview of those differences, but ask the reader to refer to Brown et al. (2020) for full details of the model architecture. Our model architecture is almost identical to that of GPT-J (Wang and Komatsuzaki, 2021)222The sole difference is due to an oversight discussed in Section 2.1.2, however we choose to use GPT-3 as the point of reference because there is no canonical published reference on the design of GPT-J.

We use rotary embeddings (Su et al., 2021) instead of the learned positional embeddings used in GPT models (Radford et al., 2018), based on our positive prior experiences using it in training LLMs. Rotary embeddings are a form of static relative positional embeddings. In brief, they twist the embedding space such that the attention of a token at position $m$ to token at position $n$ is linearly dependent on $m-n$ . More formally, they modify the standard multiheaded attention equations from

where $\mathbf{x}_{m}$ , $\mathbf{x}_{n}$ are (batched) embeddings of tokens at position $m$ and $n$ respectively and $\mathbf{W}^{T}_{q}$ , $\mathbf{W}_{k}$ are the query and key weights respectively to

where $R^{d}_{\Theta,x}$ is a $d\times d$ block diagonal matrix with the block of index $i$ being a $2$ D rotation by $x\theta_{i}$ for hyperparameters $\Theta=\{\theta_{i}=10000^{-2i/d}\;|\;i\in\{0,1,2,\ldots,(d-1)/2\}\}$ .

While Su et al. (2021) apply rotary embeddings to every embedding vector, we follow Wang and Komatsuzaki (2021) and instead apply it only to the first $25\%$ of embedding vector dimensions. Our initial experiments indicate that this strikes the best balance of performance and computational efficiency.333See the Weights & Biases reports here and here for further details.

1.2 Parallel Attention + FF Layers

We compute the Attention and Feed-Forward (FF) layers in parallel444See GitHub for implementation details. and sum the results, rather than running them in series. This is primarily for efficiency purposes, as each residual addition with op-sharding requires one all-reduce in the forward pass and one in the backwards pass (Shoeybi et al., 2020). By computing the Attention and FFs in parallel, the results can be reduced locally before performing a single all-reduce. In Mesh Transformer JAX (Wang, 2021), this led to a 15% throughput increase, while having comparable loss curves with running them in series during early training.

Due to an oversight in our code, we unintentionally apply two independent Layer Norms instead of using a tied layer norm the way Wang and Komatsuzaki (2021) does. Instead of computing

𝑥AttnsubscriptLN1𝑥FFsubscriptLN1𝑥x+\operatorname{Attn}(\operatorname{LN}_{1}(x))+\operatorname{FF}(\operatorname{LN}_{1}(x)) as intended, our codebase unties the layer norms:

𝑥AttnsubscriptLN1𝑥FFsubscriptLN2𝑥x+\operatorname{Attn}(\operatorname{LN}_{1}(x))+\operatorname{FF}(\operatorname{LN}_{2}(x)). Unfortunately, this was only noticed after we were much too far into training to restart. Subsequent experiments at small scales indicated that the untied layer norm makes no difference in performance, but we nevertheless wish to highlight this in the interest of transparency.

1.3 Initialization

For the Feed-Forward output layers before the residuals, we used the initialization scheme introduced in Wang (2021), $\frac{2}{L\sqrt{d}}$ . This prevents activations from growing with increasing depth and width, with the factor of 2 compensating for the fact that the parallel and feed-forward layers are organized in parallel. For all other layers, we use the small init scheme from Nguyen and Salazar (2019), $\sqrt{\frac{2}{d+4d}}$

1.4 All Dense Layers

While GPT-3 uses alternating dense and sparse layers using the technique introduced in Child et al. (2019), we instead opt to exclusively use dense layers to reduce implementation complexity.

2 Software Libraries

Our model is trained using a codebase that builds on Megatron (Shoeybi et al., 2020) and DeepSpeed (Rasley et al., 2020) to facilitate efficient and straightforward training of large language models with tens of billions of parameters. We use the official PyTorch v1.10.0 release binary package compiled with CUDA 11.1. This package is bundled with NCCL 2.10.3 for distributed communications.

3 Hardware

We trained GPT-NeoX-20B on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs. All GPUs can directly access the InfiniBand switched fabric through one of four ConnectX-6 HCAs for GPUDirect RDMA. Two NVIDIA MQM8700-HS2R switches—connected by 16 links—compose the spine of this InfiniBand network, with one link per node CPU socket connected to each switch. Figure 2 shows a simplified overview of a node as configured for training.

Training

Due to the intractability of performing a hyperparameter sweep for a 20 billion parameter model, we opted to use the values from Brown et al. (2020) to guide our choice of hyperparameters. As Brown et al. (2020) did not train a model at our exact scale, we interpolate between the learning rates of their 13B and 175B models to arrive at a learning rate of $0.97\mathrm{\textsc{e}}{-5}$ . Based on the results of smaller scale experiments, we select a weight decay of 0.01. To achieve a higher training throughput, we opt to use the same batch size as OpenAI’s 175B model–approximately 3.15M tokens, or 1538 contexts of 2048 tokens each, and train for a total of $150,000$ steps, decaying the learning rate with a cosine schedule to $10\%$ of its original value at the end of training.

We use the AdamW (Loshchilov and Hutter, 2019) optimizer, with beta values of $0.9$ and $0.95$ respectively, and an epsilon of $1.0\mathrm{\textsc{e}}{-8}$ . We extend AdamW with the ZeRO optimizer (Rajbhandari et al., 2020) to reduce memory consumption by distributing optimizer states across ranks. Since the weights and optimizer states of a model at this scale do not fit on a single GPU, we use the tensor parallelism scheme introduced in Shoeybi et al. (2020) in combination with pipeline parallelism (Harlap et al., 2018) to distribute the model across GPUs. To train GPT-NeoX-20B, we found that the most efficient way to distribute the model given our hardware setup was to set a tensor parallel size of 2, and a pipeline parallel size of 4. This allows for the most communication intensive processes, tensor and pipeline parallelism, to occur within a node, and data parallel communication to occur across node boundaries. In this fashion, we were able to achieve and maintain an efficiency of 117 teraFLOPS per GPU.

GPT-NeoX-20B was trained on the Pile (Gao et al., 2020), a massive curated dataset designed specifically for training large language models. It consists of data from 22 data sources, coarsely broken down into 5 categories:

Academic Writing: Pubmed Abstracts and PubMed Central, arXiv, FreeLaw,555https://www.courtlistener.com/ USPTO Backgrounds,666https://bulkdata.uspto.gov/ PhilPapers,777https://philpapers.org/ NIH Exporter888https://exporter.nih.gov/

Web-scrapes and Internet Resources: CommonCrawl, OpenWebText2, StackExchange,999https://archive.org/details/stackexchange Wikipedia (English)

Prose: BookCorpus2, Bibliotik, Project Gutenberg (PG-19; Rae et al., 2019)

Dialogue: Youtube subtitles, Ubuntu IRC,101010https://irclogs.ubuntu.com/ OpenSubtitles (Lison and Tiedemann, 2016), Hacker News,111111https://news.ycombinator.com/ EuroParl (Koehn, 2005)

Miscellaneous: GitHub, the DeepMind Mathematics dataset (Saxton et al., 2019), Enron Emails (Klimt and Yang, 2004)

In aggregate, the Pile consists of over 825 GiB of raw text data. The diversity of data sources reflects our desire for a general-purpose language model. Certain components are up-sampled to obtain a more balanced data distribution. In contrast, GPT-3’s training data consists of web-scrapes, books datasets, and Wikipedia. When comparing results in this work to GPT-3, the training data is almost certainly the biggest known unknown factor. Full details of the Pile can be found in the technical report (Gao et al., 2020) and the associated datasheet (Biderman et al., 2022).

It is particularly notable that the Pile contains a scrape of StackExchange preprocessed into a Q/A form. There is a significant and growing body of work on the influence of the syntactic structure of finetuning data on downstream performance (Zhong et al., 2021; Tan et al., 2021; Sanh et al., 2021; Wei et al., 2021). While so far there has been no systematic work that focuses on prompted pretraining, recent work (Biderman and Raff, 2022) observed that the formulation of the StackExchange component of the Pile appears to heavily influence code generation.

2 Tokenization

For GPT-NeoX-20B, we use a BPE-based tokenizer similar to that used in GPT-2, with the same total vocabulary size of 50257, with three major changes to the tokenizer. First, we train a new BPE tokenizer based on the Pile, taking advantage of its diverse text sources to construct a more general-purpose tokenizer. Second, in contrast to the GPT-2 tokenizer which treats tokenization at the start of a string as a non-space-delimited token, the GPT-NeoX-20B tokenizer applies consistent space delimitation regardless. This resolves an inconsistency regarding the presence of prefix spaces to a tokenization input.121212https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475/2. An example can be seen in Figure 3. Third, our tokenizer contains tokens for repeated space tokens (all positive integer amounts of repeated spaces up to and including 24). This allows the GPT-NeoX-20B tokenizer to tokenize text with large amounts of whitespace using fewer tokens; for instance, program source code or arXiv LaTeX source files. See Appendix E for an analysis of the tokenizer.

def fib Rec ( n ): $\hookleftarrow$ if n < 2 : $\hookleftarrow$ return n $\hookleftarrow$ else : $\hookleftarrow$ return fib Rec ( n-1) + fib Rec ( n -2)

def fib Rec ( n ): $\hookleftarrow$ if n < 2 : $\hookleftarrow$ return n $\hookleftarrow$ else : $\hookleftarrow$ return fib Rec(n- 1 ) + fib Rec (n-2 )

3 Data Duplication

In the past two years, the standard practice when training autoregressive language models has become to train for only one epoch (Komatsuzaki, 2019; Kaplan et al., 2020; Henighan et al., 2020). Recent research has claimed to see significant benefits from going even further and deduplicating training data (Lee et al., 2021; Kandpal et al., 2022; Roberts et al., 2022). In particular, every publicly known larger language model other than GPT-3 (Brown et al., 2020) and Jurassic-1131313In private communication, the authors confirmed that Jurassic-1 was trained on the Pile (Gao et al., 2020). either uses some form of deduplication (Rae et al., 2022; Askell et al., 2021; Zeng et al., 2021; Sun et al., 2021; Smith et al., 2022; Hoffmann et al., 2022; Chowdhery et al., 2022) or does not discuss the training data in sufficient detail to determine what was done (Kim et al., 2021).

When the Pile was originally made, the only language model larger than GPT-NeoX-20B that existed was GPT-3, which upsampled high-quality subsets of its training data. The Pile followed suit, and due to a combination of a lack of resources for large-scale ablations and a lack of noticeable impact at smaller scales, we opt to use the Pile as-is. As shown in fig. 4, even at the 20B parameter scale we see no drop in test validation loss after crossing the one epoch boundary.

Unfortunately, none of the papers that have claimed to see an improvement from deduplication have released trained models that demonstrate this, making replication and confirmation of their results difficult. Lee et al. (2021) releases the deduplication code that they used, which we intend to use to explore this question in more detail in the future.

It is important to note that even if there is not an improvement in loss or on task evaluations there are nevertheless compelling reasons to deduplicate training data for any model put into production. In particular, systematic analysis has shown significant benefits in terms of reducing the leakage of training data (Lee et al., 2021; Zhang et al., 2021; Carlini et al., 2022; Kandpal et al., 2022).

Performance Evaluations

To evaluate our model we use the EleutherAI Language Model Evaluation Harness (Gao et al., 2021b), an open source codebase for language model evaluation that supports a number of model APIs. As our goal is to make a powerful model publicly accessible, we compare with English language models with at least 10B parameters that are publicly accessible. We compare with the GPT-3 models on the OpenAI API (Brown et al., 2020), the open source FairSeq dense models (Artetxe et al., 2021), and GPT-J-6B (Wang and Komatsuzaki, 2021). We do not compare against T5 (Raffel et al., 2020) or its derivatives as our evaluation methodology assumes that the models are autoregressive. While there is a Megatron-11B checkpoint that has been publicly released, the released code is non-functional and we have not been able to get the model to work. We do not compare against any mixture-of-experts models as no public MoE model achieves performance comparable to a 10B parameter dense model.

While the size of the GPT-3 API models are not officially confirmed, we follow Gao (2021b) and assess them as being 350M (Ada), 1.3B (Babbage), 6.7B (Curie), and 175B (Da Vinci). We categorize both GPT-J-6B and GPT-NeoX-20B under the umbrella of GPT-NeoX models, as both models are trained with the same architecture and were trained on the same dataset. However, we connect them using a dashed line to reflect the fact that these two models are not the same model trained at two different scales the way the FairSeq and GPT-3 models are, having been trained using different codebases, different tokenizers, and for different numbers of tokens.

Where we were able to obtain the relevant information, we report two baselines: human-level performance and random performance. All plots contain error bars representing two standard errors, indicating the $95\%$ confidence interval around each point. For some plots, the standard error is so small that the interval is not visible.

We evaluate our model on a diverse collection of standard language model evaluation datasets that we divide into three main categories: natural language tasks, Advanced Knowledge-Based Tasks, and Mathematical Tasks. We evalutate GPT-J-6B, GPT-NeoX-20B, and FairSeq models both zero- and five-shot, but due to financial constraints only evaluate GPT-3 models zero-shot. Due to space constraints a representative subset of the results are shown here, with the rest in Appendix D.

We evaluate our model on a diverse collection of standard language model evaluation datasets: ANLI (Nie et al., 2020), ARC (Clark et al., 2018), HeadQA (English) (Vilares and Gómez-Rodríguez, 2019), HellaSwag (Zellers et al., 2019), LAMBDADA (Paperno et al., 2016), LogiQA (Liu et al., 2020), OpenBookQA (Mihaylov et al., 2018), PiQA (Bisk et al., 2020), PROST (Aroca-Ouellette et al., 2021), QA4MRE (Peñas et al., 2013) (2013), SciQ (Welbl et al., 2017), TriviaQA (Joshi et al., 2017), Winogrande (Sakaguchi et al., 2021), and the SuperGlue version of the Winograd Schemas Challenge (WSC) (Wang et al., 2019).

The solving of mathematical problem solving is an area that has had a long history of study in AI research, despite the fact that large language models tend to perform quite poorly on both arithmetic tasks and mathematical problems phrased in natural language. We evaluate on the MATH test dataset (Hendrycks et al., 2021b) as well as on the numerical arithmetic problems introduced by Brown et al. (2020). Note that the MATH test dataset is an evaluation metric that is generally finetuned on, but due to computational limitations we only evaluate models zero- and five-shot here.

We are also interested in the ability of our models to answer factual questions that (for humans) require advanced knowledge. To do this, we use a dataset of multiple choice questions in a variety of diverse domains developed by Hendrycks et al. (2021a). Following common practice on this dataset, we focus on results aggregated by subject area: Humanities, Social Sciences, STEM, and Miscellaneous as presented in Figure 7. We report five-shot performance to be comparable to previous work, taking our five-shot GPT-3 values from Hendrycks et al. (2021a).

Discussion

While GPT-NeoX-20B outperforms FairSeq 13B on some tasks (e.g. ARC, LAMBADA, PIQA, PROST), it underperforms on others (e.g. HellaSwag, LogiQA zero-shot). In total, across the 32 evaluations we did we outpreform on 22 tasks, underperform on four tasks, and fall within the margin of error on six tasks. By far our weakest performance is on HellaSwag, where we score four standard deviations below FairSeq 13B in both zero- and five-shot evaluations. Similarly, GPT-J underperforms FairSeq 6.7B by three standard deviations zero-shot and six standard deviations five-shot on HellaSwag. We find this massive performance loss largely inexplicable; while we originally assumed that the substantial non-prose components of the Pile were to blame, we note that GPT-J and GPT-NeoX overpreform FairSeq models on the very similar Lambada task by roughly the same amount.

While GPT-3 and FairSeq models are generally quite close on arithmetic tasks, they are consistently out-performed by GPT-J and GPT-NeoX. We conjecture that this is traceable to the prevalence of mathematics equations in the training data, but warn that people should not assume that this means that training on the Pile produces better out-of-distribution arithmetic reasoning. Razeghi et al. (2022) show that there is a strong correlation between the frequency of a numerical equation in the Pile and GPT-J’s performance on that equation, and we see no reason this would not hold in GPT-NeoX 20B, FairSeq, and GPT-3. We are unfortunately unable to investigate this effect in FairSeq and GPT-3 models because the authors do not release their training data.

While GPT-NeoX and FairSeq models both exhibit dominant performance on MMMLU compared to GPT-3 in the five-shot setting (Figure 7), their performance is much closer in the zero-shot setting (Tables 10, 11, 12 and 13). Hendrycks et al. (2021b) claim to find that few-shot evaluation does not improve performance relative to zero-shot, but they only study GPT-3. By contrast, we find that GPT-NeoX and FairSeq models do improve substantially with as few as five examples. We view this as a warning against drawing strong conclusions about evaluation metrics based only on one model, and encourage researchers developing new evaluation benchmarks to leverage multiple different classes of models to avoid overfitting their conclusions to a specific model.

2 Powerful Few-Shot Learning

Our experiments indicate that GPT-J-6B and GPT-NeoX-20B benefit substantially more from few-shot evaluations than the FairSeq models do. When going from 0-shot to 5-shot evaluations, GPT-J-6B improves by $0.0526$ and GPT-NeoX-20B improves by $0.0598$ while the FairSeq 6.7B and 13B models improve by $0.0051$ and $0.0183$ respectively. This result is statistically significant and robust to perturbations of prompting. While we do not have a particular explanation for this currently, we view this as a strong recommendation for our models. While we do not have systematic five-shot evaluations of GPT-3 due to financial limitations, the change in performance demonstrated in tables 10, 11, 12, 13 and 7 further supports the suggestion that GPT-J-6B and GPT-NeoX-20B are able to gain significantly more utility from five-shot examples.

3 Limitations

Hyperparameter tuning is an expensive process, and is often infeasible to do at full scale for multi-billion parameter models. Due to the aforementioned limitations, we opted to choose hyperparameters based on a mixture of experiments at smaller scales and by interpolating parameters appropriate for our model size based on previously published work (Brown et al., 2020). However, several aspects of both our model architecture [Section 2.1] and training setup, including the data [Section 3.1] and the tokenizer [Section 3.2], diverge significantly from Brown et al. (2020). As such, it is almost certainly the case that the hyperparameters used for our model are no longer optimal, and potentially never were.

Many of the design choices we made during the development of this model were oriented towards improving performance on coding tasks. However, we underestimated the difficulty and cost of existing coding benchmarks (Chen et al., 2021), and so were unable to evaluate out model in that domain. We hope to do so in the future.

Finally, the lack of dataset deduplication could also have had an impact on downstream performance. Recent work has shown that deduplicating training data can have a large effect on perplexity (Lee et al., 2021). While our experiments show no sign of this, it is hard to dismiss it due to the number of researchers who have found the opposite result.

4 Releasing a 20B Parameter LLM

The current status quo in research is that large language models are things people train and publish about, but do not actually release. To the best of our knowledge, GPT-NeoX-20B is the largest and most performant dense language model to ever be publicly released. A variety of reasons for the non-release of large language models are given by various groups, but the primary one is the harms that public access to LLMs would purportedly cause.

We take these concerns quite seriously. However, having taken them quite seriously, we feel that they are flawed in several respects. While a thorough analysis of these issues is beyond the scope of this paper, the public release of our model is the most important contribution of this paper and so an explanation of why we disagree with the prevailing wisdom is important.

The open-source release of this model is motivated by the hope that it will allow researchers who would not otherwise have access to LLMs to use them. While there are negative risks due to the potential acceleration of capabilities research, we believe the benefits of this release outweigh the risks. We also note that these benefits are not hypothetical, as a number of papers about the limits and ethics of LLMs has been explicitly enabled by the public release of previous models (Zhang et al., 2021; Kandpal et al., 2022; Carlini et al., 2022; Birhane et al., 2021; nostalgebraist, 2020; Meng et al., 2022; Lin et al., 2021).

Perhaps the most curious aspect of the argument that LLMs should not be released is that the people making such arguments are not arguing they they should not use LLMs. Rather, they are claiming that other people should not use them. We do not believe that this is a position that should be taken seriously. The companies and governments that have the financial resources to train LLMs are overwhelmingly more likely to do large scale harm using a LLM than a random individual.

Releasing this model is the beginning, not the end, of our work to make GPT-NeoX-20B widely accessible to researchers. Due to the size of the model, inference is most economical on a pair of RTX 3090 Tis or a single A6000 GPU and finetuning requires significantly more compute. Truly promoting widespread access to LLMs means promoting widespread access to computing infrastructure in addition to the models themselves. We plan to make progress on this issue going forward by continuing to work on reducing the inference costs of our model, and by working with researchers to provide access to the computing infrastructure they need to carry out experiments on our models. We strongly encourage researchers who are interested in studying GPT-NeoX-20B but lack the necessary infrastructure to reach out to discuss how we can help empower you.

Summary

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive Transformer language model trained on the Pile (Gao et al., 2020) dataset, and detail the main architectural differences between GPT-NeoX-20B and GPT-3—most notably the change in tokenizer, the addition of Rotary Positional Embeddings, the parallel computation of attention and feed-forward layers, and a different initialization scheme and hyperparameters. We run extensive evaluations of GPT-NeoX-20B on natural language and factual knowledge tasks, and compare it with other publicly available models, finding it performs particularly well on knowledge-based and mathematical tasks. Finally, we are open sourcing the training and evaluation code at https://github.com/EleutherAI/gpt-neox, where readers can find a link to download the model weights across the whole training run.

Acknowledgments

We thank staff at CoreWeave—in particular Max Hjelm, Brannin McBee, Peter Salanki, and Brian Venturo—for providing the GPUs and computing infrastructure that made this project possible. We would also like to acknowledge Eren Doğan and Wesley Brown for feedback and technical support throughout the project, and John Schulman, Evan Hubinger, Victor Sanh, Jacob Hilton, and Siddharth Karamcheti for providing feedback on drafts of the paper.

Finally, we thank Anthony DiPofi, Charles Foster, Jeffrey Hsu, Eric Tang, Anish Thite, Kevin Wang, and Andy Zou for their contributions to the EleutherAI Language Modeling Evaluation Harness we used to evaluate GPT-NeoX-20B.

References

Appendix A Individual Contributions

Sid Black was the lead developer and overall point person for the project. Stella Biderman was the lead scientist and project manager.

Implementation of training infrastructure: Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Samuel Weinbach

Scaling experiments and optimization: Sid Black, Stella Biderman, Quentin Anthony, Samuel Weinbach

Positional Embeddings: Sid Black, Eric Hallahan, Michael Pieler

Miscellaneous: USVSN Sai Prashanth, Ben Wang

Scientific Experimentation

Evaluations: Stella Biderman, Leo Gao, Jonathan Tow, Sid Black, Shivanshu Purohit, Horace He, Laurence Golding

Positional Embeddings: Stella Biderman, Laurence Golding, Michael Pieler

Tokenizer: Stella Biderman, Jason Phang, Leo Gao

Broader Impacts

Alignment Implications: Leo Gao, Connor Leahy, Laria Reynolds, Kyle McDonell

Environmental Impact: Stella Biderman, Eric Hallahan

Appendix B Full Configuration Details

In Table 1 we attach the full configuration details used to train GPT-NeoX-20B. The file is available in .yaml format usable in gpt-neox at https://github.com/EleutherAI/gpt-neox, where we also provide documentation describing the role of each parameter.

Appendix C Broader Impacts

The current status quo in research is that large language models are things people train and publish about, but do not actually release. To the best of our knowledge, GPT-NeoX-20B is the largest dense language model to ever be publicly released with a several-way tie for second place at 13 billion parameters (Artetxe et al., 2021; Xue et al., 2020, 2022) and many more models at the 10-11B parameter scale. A variety of reasons for the non-release of large language models are given by various groups, but the primary one is the harms that public access to LLMs would purportedly cause.

The open-source release of this model is motivated by the hope that it will allow ethics and alignment researchers who would not otherwise have access to LLMs to use them. While there are negative risks due to the potential acceleration of capabilities research, we believe the benefits of this release outweigh the risks of accelerating capabilities research.

C.1 Impact on Capabilities Research and Products

When discussing the impact of access to technology, it is important to distinguish between capacities research which seeks to push the current state-of-the-art and research on

We feel the risk of releasing GPT-NeoX-20B is acceptable, as the contribution of the model to capabilities research is likely to be limited, for two reasons.

We ultimately believe that the benefits of releasing this model outweigh the risks, but this argument hinges crucially on the particular circumstances of this release. All actors considering releasing powerful AI models or advancing the frontier of capabilities should think carefully about what they release, in what way, and when.

C.2 Impact on Ethics and Alignment Research

To oversimplify a complex debate, there are broadly speaking two schools of thought regarding the mitigation of harm that is done by AI algorithms: AI Ethics and AI Alignement. AI Ethics researchers are primarily concerned with the impact of current technologies or technologies very similar to current technologies, while AI Alignment is primarily concerned with future “generally intelligent” systems whose capacities greatly outclass currently existing systems and possess human and superhuman levels of intelligence. While the tools, methods, and ideas of these camps are very different, we believe that increasing access to these technologies will empower and advance the goals of researchers in both schools.

Analyzing and documenting the limitations of models is an essential aspect of AI ethics research (Matias, 2020). Work examining and criticizing datasets (Kreutzer et al., 2022; Dodge et al., 2021; Birhane et al., 2021), functionality (Smart, 2021; Zhang et al., 2021; Carlini et al., 2022; Biderman and Raff, 2022), evaluation and deployment procedures (Biderman and Scheirer, 2020; Talat et al., 2022), and more are essential to well-rounded and informed debate on the value and application of technology.

However the current centralization of LLM training also creates a centralization of control of technology (Sadowski et al., 2021; Whittaker, 2021) that makes meaningful independent evaluation impossible. This means that it is often not possible to do this kind of work in practice because of the severe access restrictions companies that own large language models put on them. While GPT-NeoX is the 13th largest dense language model at time of writing only model larger than GPT-NeoX 20B that is publicly accessible is GPT-3. There are significant limitations on people’s ability to do research on GPT-3 though, as it is not free to use and its training data is private.

C.2.2 The Usefulness of Large Language Models in Alignment

LLMs represent a different paradigm than the AI systems generally studied by alignment researchers because they are not well-described as coherent agents or expected utility maximizers. Though trained to optimize a log-likelihood loss function, at a high level the goals a LLM pursues are varied and contradictory, depending on the way it is prompted. This introduces additional challenges, but may also enable new approaches to alignment.

GPT-NeoX-20B itself is not the system we need to align, but we hope it can serve as a publicly available platform for experiments whose results might generalize to crucial future work.

The following is a non-exhaustive list of potential approaches we consider promising for further investigation.

Mechanistic interpretability research (Cammarata et al., 2020) hopes to gain an understanding into how models accomplish the tasks they do, in part in the hopes of detecting problematic or deceptive algorithms implemented by models before these failures manifest in the real world. Being able to interpret and inspect the detailed inner workings of trained models would be a powerful tool to ensure models are optimizing for the goals we intended (Hubinger et al., 2021; Koch et al., 2021). Reverse engineering transformer language models has already yielded insights about the inner functioning of LMs (Elhage et al., 2021; nostalgebraist, 2020; Meng et al., 2022; Dai et al., 2021).

Because they are trained to predict human writing, LLMs also appear to develop a useful representation of human values at the semantic level. Finding a way to utilise these representations could be a possible path toward solving the problem of reward robustness in RL and other algorithms which require a proxy of human judgment (Stiennon et al., 2022; Wentworth, 2020). Despite fundamental theoretical limitations on learning human values (Armstrong and Mindermann, 2018; Kosoy, 2016), value learning may still be robust enough to align weaker superhuman AIs. Future experiments could explore the extent to which LLM pretraining improves downstream reward model robustness and generalization.

Since LLM prompts are in a human-readable form, it can provide insight on the LLM’s expected behavior. Prompt programming or finetuning can be used to leverage this fact and force a LLM to execute more transparent algorithms, such as splitting problems into steps or explicitly writing an “internal monologue” (Soares, 2021; Gao et al., 2021a; Nye et al., 2021). Reliability and trustworthiness can present significant challenges for these approaches.

However, this form of transparency also has its limits. In particular, models can often respond unpredictably to prompts, and internal monologues may become completely detached from the model’s decision making process if translating between the model’s ontology and the human ontology is more complex than simply modeling human monologues (Christiano et al., 2021).

Although LLMs are not well-described as coherent agents, they can still be used to generate goal-directed processes. Given an appropriate prompt (such as a story of a character working to achieve a goal), LLMs can predict and thus simulate an agent (Huang et al., 2022). Simulated agents take representative actions according to the patterns present in the training data, similar to behavior cloning. One potential future research direction is testing whether they are less susceptible to failure modes that follow from expected utility maximization, such as Goodhart failures and power-seeking behavior. However, other failure modes can be introduced by the LM training procedure, such as “delusions” or “hallucinations” (Ortega et al., 2021; Gao, 2021a; Maynez et al., 2020). Additionally, simulated agents may be uncompetitive with optimal agents like those produced by Reinforcement Learning. An important research direction is to explore how the beneficial properties of simulated agents can be maintained while making them competitive with RL based approaches.

LMs can be used as relatively unagentic tools, such as OpenAI’s Codex model (Chen et al., 2021) acting as a coding assistant. Because pretrained LLMs are not directly optimized for the factual accuracy of their predictions, it is possible they avoid some of the traditional problems with tool or oracle AI (Armstrong et al., 2012), such as the incentive to produce manipulative answers (Demski, 2019). Tool AI is not a long-term solution to the problem of alignment, but it could be used to assist alignment research or even automate large parts of it. For example, language models could be used to help brainstorm alignment ideas more quickly, act as a writing assistant, or directly generate alignment research papers for humans to review. This line of research also risks accelerating capabilities research, a concern we discuss more below.

C.3 Differential Impact on Access

Because training large models requires a significant engineering and capital investment, such models are often out of reach for small labs and independent researchers. As it stands, only large organizations have access to the latest generation of powerful language models (Brown et al., 2020; Rae et al., 2022; Fedus et al., 2021; Lieber et al., 2021; Tang, 2021). The number of researchers focused primarily on ethics and alignment working at these labs is much lower than those working on developing new capabilities.

We feel the risk of releasing GPT-NeoX-20B is acceptable, as the contribution of the model to capabilities research is likely to be limited, for two reasons. Firstly, the organizations pursuing capabilities research most aggressively are unlikely to benefit from our open-source release of this model as they have already developed more powerful models of their own. Secondly, we believe the single most important piece of knowledge that drives advancing capabilities research is the knowledge that scaling LLMs was possible in the first place (Leahy, 2021; Leahy and Biderman, 2021). Whereas the actual implementation is very fungible (as evidenced by the large number of parties who have succeeded in creating their own LLMs in the past two years). This differential impact, wherein our release is expected to benefit primarily people who have less funding and infrastructure, is a key factor in our decision to release this model publicly.

C.4 Environmental Impact

A significant point of concern in some recent work is the energy usage and carbon emissions associated with training large language models (Strubell et al., 2019; Schwartz et al., 2020; Lacoste et al., 2019; Bender et al., 2021). In particular, Strubell et al. (2019) estimate that a then-recent paper by the authors released $626,155$ lbs or $284.01$ metric tons141414We choose to present environmental impact figures in metric tons to align with standard reporting. of $\mathrm{CO_{2}}$ ( $\mathrm{t_{CO_{2}}}$ ). As Strubell et al. (2019) has been widely cited and quoted in the media as representative of large-scale language models, we decided to explicitly and carefully track our energy usage and carbon emissions to see if this is truly a representative account of NLP emissions.

Throughout the development and training of our model, we tracked our energy usage and carbon emissions. We found that the process of developing and training GPT-NeoX-20B emitted almost exactly $10\%$ of Strubell et al. (2019)’s estimate, coming in at a total of $69957$ lbs or $31.73$ metric tons of $\mathrm{CO_{2}}$ . This is roughly the equivalent of the yearly emissions of the average American or 35 round-trip flights between New York City and San Francisco. Our systems were based in Illinois, USA, and consumed energy sourced from the mix as follows

$30.40\%$ Coal ( $0.95\,\mathrm{t_{CO_{2}}}$ /MWh)

$31.30\%$ Gas ( $0.6078\,\mathrm{t_{CO_{2}}}$ /MWh)

$\phantom{0}1.30\%$ Hydroelectric ( $0\,\mathrm{t_{CO_{2}}}$ /MWh)

$17.40\%$ Nuclear ( $0\,\mathrm{t_{CO_{2}}}$ /MWh)

$\phantom{0}0.30\%$ Solar ( $0\,\mathrm{t_{CO_{2}}}$ /MWh)

$18.10\%$ Wind ( $0\,\mathrm{t_{CO_{2}}}$ /MWh)

$\phantom{0}1.30\%$ Other Renewables ( $0\,\mathrm{t_{CO_{2}}}$ /MWh)

This mixture produces an average of $0.47905$ $\mathrm{t_{CO_{2}}}$ /MWh, and we consumed a total of $43.92$ MWh of electricity over the course of $1830$ hours of training. Scaling, testing, and evaluation were responsible for the equivalent of another $920$ hours on our systems, for a total energy consumption $66.24$ MWh and thus the production of just under $35$ metric tons of $\mathrm{CO_{2}}$ .

It is noteworthy that Strubell et al. (2019) are estimating emissions from a neural architecture search paper, and is therefore not directly comparable to ours. The primary motivation for our comparison is that their number has attracted a lot of attention and is often taken to be respresentative of NLP research. In general, we advocate for more systematic and comprehensive reporting to improve transparency surrounding this important topic.

Appendix D Full Evaluation Results

Results for natural language understanding tasks are shown in Tables 2 and 3, while results for Hendrycks tasks are found in LABEL:tab:hendrycks_gpt1, LABEL:tab:hendrycks_gpt2, LABEL:tab:hendrycks_fairseq1 and LABEL:tab:hendrycks_fairseq2.

All evaluations had version 0 in the Evaluation Harness. This information is reported in the output of the Evaluation Harness and should be used for ensuring reproducibility of these results, even as the task implementations themselves may change to fix bugs.

Appendix E Tokenizer Analysis

Both tokenizers share 36938 out of 50257 tokens, a $\sim$ 73.5% overlap in tokens. In this section, we perform comparison between the GPT-NeoX-20B tokenizer to the GPT-2 tokenizer using the validation set of the Pile.

In Table 15, we show the resulting number of tokens from tokenizing each component of the Pile’s validation set with both tokenizers, and the ratio of GPT-NeoX-20B tokens to GPT-2 tokens.

We observe that the GPT-NeoX-20B tokenizer represents all Pile components using fewer or very closely comparable numbers of tokens. The largest percentage improvement in token counts are in the EuroParl, GitHub, and PubMed Central components, with a more than 20% savings in the number of tokens needed to represent that component. We highlight that arXiv, GitHub, and StackExchange—subsets with large code components—can be represented with meaningfully fewer tokens with the GPT-NeoX-20B tokenizer compared to the GPT-2 tokenizer. Overall, the GPT-NeoX-20B tokenizer represents the Pile validation set with approximately 10% fewer tokens compared to the GPT-2 tokenizer.

Given that the GPT-NeoX-20B tokenizer is tweaked to better tokenize whitespace, we also perform a comparison between the two tokenizers excluding whitespace. We perform the same analysis as the above, but exclude all whitespace tokens from our computations, only counting the non-whitespace tokens. A token is considered a whitespace token if it consists only of whitespace characters. The results are shown in Table 16 in the Appendix. We observe that the GPT-NeoX-20B tokenizer still uses 5% fewer tokens to represent the Pile validation set compared to the GPT-2 tokenizer. As expected, the token ratios for certain components such as GitHub and StackExchange become closer to even once the whitespace characters are excluded.

While we evaluated our tokenizer using the validation set for the Pile, the Pile components would still be considered in-domain for the tokenizer and may not provide the most informative comparison point. To perform an out-of-domain comparison, we perform the same analysis using the AllenAI replication of C4,151515https://github.com/allenai/allennlp/discussions/5056, another popular pretraining corpus for large language models. As above, we use the validation set for our analysis. Our results are shown in Table 14. We find that the GPT-NeoX-20B tokenizer tokenizes the C4 validation set to approximately the same number of tokens as the GPT-2 tokenizer. When excluding all whitespace tokens, the GPT-NeoX-20B requires approximately 1% more tokens to represent the corpus compared to the GPT-2 tokenizer.

We show in Table 17 the 10 longest tokens in each tokenizer vocabulary. We exclude consideration of tokens that comprise only symbols or whitespace characters. We observe that for the GPT-2 tokenizer, many of the longest tokens appear to reflect artifacts in the tokenizer training data, likely with certain websites or web-scrapes being overrepresented in the training data. For the GPT-NeoX-20B tokenizer, we observe that most of the longest tokens are scientific terms, likely arising from the PubMed components of the Pile.

E.1.2 Worst Case Word Tokenization Comparison

We consider the words for which there is the greatest discrepancy in the resulting token length between the two tokenizers, where one tokenizer needs many tokens to represent while the other tokenizer uses relatively few tokens. We define a word as a contiguous string delimited by whitespace or punctuation (as defined by strings.punctuation in Python). We perform this analysis at the component level. We only consider words that occur at least 10 times within the given component. We show in Table 18 a representative example from the Pile-CC corpus.

Appendix F Tokenization Examples

In Figures 8 and 13, we show examples of tokenized documents from the Pile, comparing the GPT-2 tokenizer to ours.

253 tokens --- $\hookleftarrow$ ab stract : ’ The maximal minors of a $p \ times ( m + p )$ - mat rix of univariate po lyn om ials of degree $n$ with ind etermin ate coefficients are themselves polynom ials of degree $np$ . The subal gebra generated by their coefficients is the coordinate ring of the quantum Grass mann ian, a singular compact ification of the space of rational curves of degree $np$ in the Grassmannian of $p$ - planes in ( $m + p$ )-space. These sub al gebra generators are shown to form a sag bi basis . The resulting flat deformation from the quantum Grass mann ian to a tor ic variety gives a new “Grö b ner basis style ” proof of the R avi - Ros enthal -Wang formulas in quantum Sch u bert calculus. The coordinate ring of the quantum Grass mannian is an algebra with straight ening law , which is normal , Cohen - Mac aul ay, Goren stein and Kos z ul , and the ideal of quantum Pl ü cker relations has a quad r atic Gr ö b ner basis. This holds more generally for skew quantum Schubert varieties . These results are well-known for the classical Sch u bert varietie

229 tokens --- $\hookleftarrow$ abstract : ’ The maximal minors of a $p \ times ( m + p)$ - matrix of univariate polynomials of degree $n$ with indeterm inate coefficients are themselves polynomials of degree $np$ . The sub algebra generated by their coefficients is the coordinate ring of the quantum Grass mann ian , a singular compactification of the space of rational curves of degree $np$ in the Grass mann ian of $p$ - planes in ( $m + p$ )- space . These sub algebra generators are shown to form a sag bi basis. The resulting flat deformation from the quantum Grassmannian to a tor ic variety gives a new “Gr ö b ner basis style ” proof of the R avi - Ros enthal -Wang formulas in quantum Sch ubert calculus . The coordinate ring of the quantum Grass mann ian is an algebra with straight ening law , which is normal, Cohen - Mac aul ay , Gorenstein and Kos z ul , and the ideal of quantum Pl ü cker relations has a quadratic Grö b ner basis . This holds more generally for skew quantum Sch ubert varieties . These results are well - known for the classical Schubert vari et ie

224 tokens $\hookleftarrow$ $\hookleftarrow$ ** THE TR AP ** $\hookleftarrow$ $\hookleftarrow$ Bever ley Kendall $\hookleftarrow$ $\hookleftarrow$ Copyright © Beverley Kendall 2014 $\hookleftarrow$ $\hookleftarrow$ Published by Season Publishing LLC $\hookleftarrow$ $\hookleftarrow$ This is a work of fiction. Names , characters , places and incidents are products of the author ’s imagination or are used fictit iously and are not to be construed as real . Any resemblance to actual events, locales , organizations , or persons , living or dead , is completely coinc idental . $\hookleftarrow$ $\hookleftarrow$ www. b ever ley k end all.com $\hookleftarrow$ $\hookleftarrow$ Cover Design © Okay Cre ations, Sarah Hansen $\hookleftarrow$ $\hookleftarrow$ All rights reserved . Except as permitted under the U . S . Copyright Act of 1976 , no part of this publication may be reproduced , distributed or transmitted in any form or by any means , or stored in a database or retrieval system , without the prior written permission of the author . $\hookleftarrow$ $\hookleftarrow$ ** License Statement ** $\hookleftarrow$ $\hookleftarrow$ This ebook is licensed for your personal enjoyment only . This ebook may not be re - sold or given away to other people . If you would like to share this book with another person , please purchase an additional copy for each reader . If

228 tokens $\hookleftarrow$ $\hookleftarrow$ ** THE TR AP ** $\hookleftarrow$ $\hookleftarrow$ Bever ley Kend all $\hookleftarrow$ $\hookleftarrow$ Copyright © Beverley Kend all 2014 $\hookleftarrow$ $\hookleftarrow$ Published by Season Publishing LLC $\hookleftarrow$ $\hookleftarrow$ This is a work of fiction . Names , characters , places and incidents are products of the author ’s imagination or are used fict it iously and are not to be construed as real . Any resemblance to actual events, local es , organizations , or persons, living or dead , is completely coincidental. $\hookleftarrow$ $\hookleftarrow$ www . b ever ley kendall. com $\hookleftarrow$ $\hookleftarrow$ Cover Design © Okay Creations, Sarah Hansen $\hookleftarrow$ $\hookleftarrow$ All rights reserved. Except as permitted under the U . S. Copyright Act of 1976 , no part of this publication may be reproduced , distributed or transmitted in any form or by any means , or stored in a database or retrieval system , without the prior written permission of the author . $\hookleftarrow$ $\hookleftarrow$ ** License Statement ** $\hookleftarrow$ $\hookleftarrow$ This ebook is licensed for your personal enjoyment only. This e book may not be re -sold or given away to other people . If you would like to share this book with another person, please purchase an additional copy for each reader. If

477 tokens o? $\hookleftarrow$ True $\hookleftarrow$ Supp ose -3*t = 1 + 8 . Let s(d ) = d ** 3 + 6*d ** 2 + 2 * d + 1. Let u be s ( t). Suppose 10 = 5 * z , 5*a + 0 * z = - z + u. Is 4 a factor of a? $\hookleftarrow$ True $\hookleftarrow$ Supp ose 5 * l = r - 35 , - 2 * r + 5* l - 15 = - 70. Is r a multiple of 4 ? $\hookleftarrow$ True $\hookleftarrow$ Supp ose 2 * l + 11 - 1 = 0 . Does 15 divide (-2)/l - 118 /( - 5 )? $\hookleftarrow$ False $\hookleftarrow$ Supp ose 3 * k - 3*f + 0 * f - 72 = 0, - 25 = - 5 *f. Is 9 a factor of 2 /(-4) + k / 2 ? $\hookleftarrow$ False $\hookleftarrow$ Supp ose 6 * w + 25 = w. Let t ( c ) = c + 9 . Let u be t (w). Suppose - u * z = -3*z - 10 . Is z a multiple of 5 ? $\hookleftarrow$ True $\hookleftarrow$ Let j = 81 + - 139 . Let i = j + 101 . Is 11 a factor of i? $\hookleftarrow$ False $\hookleftarrow$ Let q ( s) = s ** 3 + 4 * s**2 - s + 2 . Let u be q(- 4 ). Let o ( w) = w ** 2 + w - 6. Let t be o ( u ). Suppose -3* l - 39 = - 3*d - 2 * l , 0 = 3*d - 2 * l - t. Does 9 divide d ? $\hookleftarrow$ False $\hookleftarrow$ Suppose - 2 * b + 39 + 13 = 0 . Is b a multiple of 14? $\hookleftarrow$ False $\hookleftarrow$ Let q = -7 + 12 . Suppose 8 * l = q*l + 81 . Suppose 129 = 4*f - l . Is 13 a factor of f ? $\hookleftarrow$ True $\hookleftarrow$ Supp ose 0 = - 4 * n + j + 33, 4 * n - n + 4*j = 20 . Let c = 5 - n. Is 35 * 1 - (-6)/c a multiple of 11 ? $\hookleftarrow$ True $\hookleftarrow$ Let g ( m ) = m**2 - 2 * m - 3 . Let k be g ( 3 ). Let j be

468 tokens o? $\hookleftarrow$ True $\hookleftarrow$ Suppose - 3*t = 1 + 8 . Let s(d) = d ** 3 + 6*d** 2 + 2 * d + 1. Let u be s ( t ). Suppose 10 = 5 * z , 5 *a + 0 * z = - z + u. Is 4 a factor of a ? $\hookleftarrow$ True $\hookleftarrow$ Suppose 5 * l = r - 35, - 2 * r + 5*l - 15 = - 70 . Is r a multiple of 4 ? $\hookleftarrow$ True $\hookleftarrow$ Suppose 2* l + 11 - 1 = 0. Does 15 divide (- 2 )/ l - 118/(- 5 )? $\hookleftarrow$ False $\hookleftarrow$ Suppose 3*k - 3 * f + 0*f - 72 = 0 , - 25 = -5 * f . Is 9 a factor of 2 /(- 4 ) + k /2? $\hookleftarrow$ False $\hookleftarrow$ Suppose 6 * w + 25 = w . Let t ( c) = c + 9 . Let u be t(w ). Suppose - u * z = -3 * z - 10 . Is z a multiple of 5 ? $\hookleftarrow$ True $\hookleftarrow$ Let j = 81 + - 139 . Let i = j + 101 . Is 11 a factor of i ? $\hookleftarrow$ False $\hookleftarrow$ Let q(s) = s ** 3 + 4*s** 2 - s + 2 . Let u be q (- 4 ). Let o(w) = w ** 2 + w - 6. Let t be o ( u). Suppose - 3 * l - 39 = -3* d - 2 * l , 0 = 3 * d - 2 * l - t. Does 9 divide d ? $\hookleftarrow$ False $\hookleftarrow$ Suppose - 2 * b + 39 + 13 = 0 . Is b a multiple of 14? $\hookleftarrow$ False $\hookleftarrow$ Let q = -7 + 12 . Suppose 8 * l = q* l + 81 . Suppose 129 = 4* f - l . Is 13 a factor of f ? $\hookleftarrow$ True $\hookleftarrow$ Suppose 0 = - 4 * n + j + 33, 4 * n - n + 4*j = 20 . Let c = 5 - n. Is 35 * 1 - (-6)/c a multiple of 11 ? $\hookleftarrow$ True $\hookleftarrow$ Let g ( m ) = m**2 - 2 * m - 3 . Let k be g ( 3 ). Let j be

430 tokens $\hookleftarrow$ < at-form state =" vm . form " autocomplete=" off " id =" external _test_form "> $\hookleftarrow$ < at - input - group col="12 " tab =" 20 " state="vm. form . input s " form-id=" external _ test "> $\hookleftarrow$ < at - action - group col="12 " pos =" right "> $\hookleftarrow$ < at - action - button $\hookleftarrow$ variant =" tertiary" $\hookleftarrow$ ng- click =" vm . on Close()" $\hookleftarrow$ > $\hookleftarrow$ :: vm . strings.get(’ CLOSE ’) $\hookleftarrow$ $\hookleftarrow$ $\hookleftarrow$ $\hookleftarrow$

257 tokens $\hookleftarrow$ < at - form state="vm. form " aut ocomplete =" off" id=" external _ test _ form "> $\hookleftarrow$ < at - input - group col="12" tab =" 20 " state ="vm.form . inputs " form - id="external_ test "> $\hookleftarrow$ < at - action -group col=" 12 " pos =" right "> $\hookleftarrow$ < at - action - button $\hookleftarrow$ variant =" ter ti ary " $\hookleftarrow$ ng - click =" vm .onClose()" $\hookleftarrow$ > $\hookleftarrow$ :: vm . strings . get (’CLOSE’) $\hookleftarrow$ $\hookleftarrow$ $\hookleftarrow$

178 tokens Theresa May is expected to appoint an EU ambassador who “ bel ieves in Brexit” in the wake of the current Brussels representative’s decision to quit after being cut adrift by Downing Street . $\hookleftarrow$ $\hookleftarrow$ Sir Ivan Rogers on Tuesday announced his resignation as Britain ’s ambassador in Brussels after it was made clear Mrs May and her senior team had “lost confidence ” in him over his “p essim istic ” view of Brexit. $\hookleftarrow$ $\hookleftarrow$ Government sources made clear that Sir Ivan had “ j umped before he was pushed” and that Number 10 believed his negative view of Brexit meant that he could not lead the negotiations after the Prime Minister triggers Article 50. $\hookleftarrow$ $\hookleftarrow$ In a 1 ,400-word resignation letter to his staff leaked on Tuesday night , Sir Ivan launched a thinly-veiled attack on the " m uddled thinking" in Mrs May ’s Government .

170 tokens Theresa May is expected to appoint an EU ambassador who “ belie ves in Brexit ” in the wake of the current Brussels representative ’s decision to quit after being cut ad rift by Downing Street. $\hookleftarrow$ $\hookleftarrow$ Sir Ivan Rogers on Tuesday announced his resignation as Britain ’ s ambassador in Brussels after it was made clear Mrs May and her senior team had “ lost confidence ” in him over his “ p ess im istic” view of Brexit . $\hookleftarrow$ $\hookleftarrow$ Government sources made clear that Sir Ivan had “ j umped before he was pushed ” and that Number 10 believed his negative view of Brexit meant that he could not lead the negotiations after the Prime Minister triggers Article 50 . $\hookleftarrow$ $\hookleftarrow$ In a 1,400- word resignation letter to his staff leaked on Tuesday night , Sir Ivan launched a thinly-ve iled attack on the " muddled thinking" in Mrs May ’s Government .

268 tokens Carot id end art ere ct omy: operative risks , recurrent sten osis , and long-term stroke rates in a modern series. $\hookleftarrow$ To determine whether car ot id endarterect omy ( CE A ) safely and effectively maintained a durable reduction in stroke complications over an extended period , we reviewed our data on 478 consecutive patients who underwent 5 44 CEA’s since 1976 . Follow - up was complete in 83 % of patients ( mean 44 months). There were 7 early deaths (1.3 %), only 1 stroke related (0.2 %). Per i operative stroke rates (overall 2 . 9 %) varied according to operative indications : as ym pt omatic , 1.4 %; transient is che mic attacks (TIA )/ am au rosis fug ax (AF), 1 . 3 %; non hemispheric symptoms ( NH ), 4 . 9%; and prior stroke ( C VA ), 7.1%. Five and 10 - year stroke-free rates were 96 % and 92 % in the as ym pt omatic group , 93% and 87 % in the T IA /AF group, 92 % and 92 % in the NH group , and 80 % and 73% in the C VA group . Late ipsilateral strokes occurred inf requently ( 8 patients, 1. 7 %). Late deaths were primarily cardiac related ( 51 . 3 %). Stro

250 tokens Carot id end arte rectomy : operative risks, recurrent stenosis , and long - term stroke rates in a modern series . $\hookleftarrow$ To determine whether carotid end arte rectomy ( CE A) safely and effectively maintained a durable reduction in stroke complications over an extended period , we reviewed our data on 478 consecutive patients who underwent 544 CEA’s since 1976 . Follow - up was complete in 83 % of patients ( mean 44 months). There were 7 early deaths (1.3 %), only 1 stroke related (0.2 %). Per i operative stroke rates (overall 2 . 9 %) varied according to operative indications : asymptomatic , 1 . 4%; transient ischemic attacks ( T IA )/ amaurosis fug ax ( AF ), 1 .3%; non hem isp heric symptoms ( NH), 4. 9 %; and prior stroke (CVA), 7 . 1 %. Five and 10-year stroke - free rates were 96% and 92 % in the asymptomatic group , 93% and 87 % in the T IA/AF group , 92 % and 92 % in the NH group , and 80 % and 73% in the CV A group . Late ipsilateral strokes occurred inf requently ( 8 patients, 1. 7 %). Late deaths were primarily cardiac related ( 51 . 3 %). Stro