Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic

Introduction

The original promise of computing was to solve information overload in science. In his 1945 essay "As We May Think", Vannevar Bush observed how "publication has been extended far beyond our present ability to make real use of the record" (Bush, 1945). He proposed computers as a solution to manage the growing mountain of information. Licklider expanded on this with the vision of a symbiotic relationship between humans and machines. Computers would take care of routine tasks such as storage and retrieval, "preparing the way for insights and decisions in scientific thinking" (Licklider, 1960).

Computing has indeed revolutionized how research is conducted, but information overload remains an overwhelming problem (Bornmann and Mutz, 2014). In May 2022, an average of 516 papers per day were submitted to arXiv (arXiv, 2022). Beyond papers, scientific data is also growing much more quickly than our ability to process it (Marx, 2013). As of August 2022, the NCBI GenBank contained 1.49×10121.49\times 10^{12} nucleotide bases (GenBank, 2022). Given the volume of information, it is impossible for a single person to read all the papers in a given field; and it is likewise challenging to organize data on the underlying scientific phenomena.

Search engines are the current interface for accessing scientific knowledge following the Licklider paradigm. But they do not organize knowledge directly, and instead point to secondary layers such as Wikipedia, UniProt and PubChem Compound which organize literature and data. These resources require costly human contributions, for example writing a review of literature, an encyclopedia article or annotating a protein. Given this bottleneck, researchers continue to feel overwhelmed even with powerful search tools to hand.

In this paper, we argue for a better way through large language models. Unlike search engines, language models can potentially store, combine and reason about scientific knowledge. For example, a model trained on the literature could potentially find hidden connections between different research, find hidden gems, and bring these insights to the surface. It could synthesize knowledge by generating secondary content automatically: such as literature reviews, encyclopedia articles, lecture notes and more. And lastly, it could organize different modalities: linking papers with code, protein sequences with compounds, theories with LaTeX, and more. Our ultimate vision is a single neural network for powering scientific tasks. We believe this is will be the next interface for how humans access scientific knowledge, and we get started in this paper.

We introduce a new large language model called Galactica (GAL) for automatically organizing science. Galactica is trained on a large and curated corpus of humanity’s scientific knowledge. This includes over 48 million papers, textbooks and lecture notes, millions of compounds and proteins, scientific websites, encyclopedias and more. Unlike existing language models, which rely on an uncurated crawl-based paradigm, our corpus is high-quality and highly curated. We are able to train on it for multiple epochs without overfitting, where upstream and downstream performance improves with use of repeated tokens.

Dataset design is critical to our approach, which includes curating a high-quality dataset and engineering an interface to interact with the body of knowledge. All data is processed in a common markdown format to blend knowledge between sources. We also include task-specific datasets in pre-training to facilitate composition of this knowledge into new task contexts. For the interface, we use task-specific tokens to support different types of knowledge. We process citations with a special token, that allows a researcher to predict a citation given any input context. We wrap step-by-step reasoning in a special token, that mimicks an internal working memory. And lastly, we wrap modalities such as SMILES and protein sequences in special tokens, which allows a researcher to interface with them using natural language. With this interface and the body of scientific knowledge in the model, we achieve state-of-the-art results across many scientific tasks.

On reasoning tasks, Galactica beats existing language models on benchmarks such as MMLU and MATH (Hendrycks et al., 2020, 2021). With our reasoning token approach, we outperform Chinchilla on mathematical MMLU with an average score of 41.3% versus 35.7% (Hoffmann et al., 2022). Our 120B model achieves a score of 20.4% versus PaLM 540B’s 8.8% on MATH (Chowdhery et al., 2022; Lewkowycz et al., 2022). The 30B model also beats PaLM 540B on this task with 18 times less parameters. We believe this adds another reasoning method to the deep learning toolkit, alongside the existing chain-of-thought approach that has been well explored recently (Wei et al., 2022; Suzgun et al., 2022).

We also find Galactica performs strongly in knowledge-intensive scientific tasks. We conduct detailed knowledge probes of Galactica’s knowledge of equations, chemical reactions and other scientific knowledge. Galactica significantly exceeds the performance of general language models such as the latest GPT-3 in these tasks; on LaTeX equations, it achieves a score of 68.2% versus the latest GPT-3’s 49.0% (Brown et al., 2020). Galactica also performs well in downstream scientific tasks, and we set a new state-of-the-art on several downstream tasks such as PubMedQA (77.6%) and MedMCQA dev (52.9%) (Jin et al., 2019; Pal et al., 2022).

We also demonstrate new capabilities with Galactica’s interface. First, the capability of predicting citations improves smoothly with scale, and we also find the model becomes better at modelling the underlying distribution of citations: the empirical distribution function approaches the reference distribution with scale. Importantly, we find this approach outperforms tuned sparse and dense retrieval approaches for citation prediction. This, along other results, demonstrates the potential for language models to replace the Licklider paradigm, document storage and retrieval, with their context-associative power in weight memory.

In addition, Galactica can perform multi-modal tasks involving SMILES chemical formulas and protein sequences. We formulate drug discovery tasks as text prompts and show performance scales in a weakly supervised setup. We also demonstrate Galactica learns tasks such as IUPAC name prediction in a self-supervised way, and does so by attending to interpretable properties such as functional groups. Lastly, Galactica can annotate protein sequences with natural language, including predicting functional keywords.

Galactica was used to help write this paper, including recommending missing citations, topics to discuss in the introduction and related work, recommending further work, and helping write the abstract and conclusion.

Related Work

LLMs have achieved breakthrough performance on NLP tasks in recent years. Models are trained with self-supervision on large, general corpuses and they perform well on hundreds of tasks (Brown et al., 2020; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022; Zhang et al., 2022; Chowdhery et al., 2022). This includes scientific knowledge tasks such as MMLU (Hendrycks et al., 2020). They have the capability to learn in-context through few-shot learning (Brown et al., 2020). The capability set increases with scale, and recent work has highlighted reasoning capabilities at larger scales with a suitable prompting strategy (Wei et al., 2022; Chowdhery et al., 2022; Kojima et al., 2022; Lewkowycz et al., 2022).

One downside of self-supervision has been the move towards uncurated data. Models may mirror misinformation, stereotypes and bias in the corpus (Sheng et al., 2019; Kurita et al., 2019; Dev et al., 2019; Blodgett et al., 2020; Sheng et al., 2021). This is undesirable for scientific tasks which value truth. Uncurated data also means more tokens with limited transfer value for the target use-case; wasting compute budget. For example, the PaLM corpus is 50% social media conversations, which may have limited transfer towards scientific tasks (Chowdhery et al., 2022). The properties of scientific text also differ from general text - e.g. scientific terms and mathematics - meaning a general corpus and tokenizer may be inefficient. We explore whether a normative approach to dataset selection can work with the large model paradigm in this work.

Works such as SciBERT, BioLM and others have shown the benefit of a curated, scientific corpus (Beltagy et al., 2019; Lewis et al., 2020a; Gu et al., 2020; Lo et al., 2019b; Gu et al., 2020; Shin et al., 2020; Hong et al., 2022). The datasets and models were typically small in scale and scope, much less than corpora for general models222One of the larger corpora S2ORC has <20<20bn tokens, whereas corpora for GPT-3 and PaLM have 300\geq 300bn tokens. ScholarBERT has a very large corpus at ¿200bn tokens, but the model is small at 770M capacity.. Beyond scientific text, Transformers for protein sequences and SMILES have shown potential for learning natural representations (Rives et al., 2021; Honda et al., 2019; Irwin et al., 2021; Nijkamp et al., 2022; Lin et al., 2022b). However, sequences like SMILES have descriptive limitations for representing chemical structure. We explore in this work whether a large, multi-modal scientific corpus can aid representation learning, where sequences occur alongside footprints and text in a signal-dense context.

The idea of "scaling laws" was put forward by Kaplan et al. (2020), who demonstrated evidence that loss scales as a power-law with model size, dataset size, and the amount of training compute. The focus was on upstream perplexity, and work by Tay et al. (2022a) showed that this does not always correlate with downstream performance. Hoffmann et al. (2022) presented new analysis taking into account the optimal amount of data, and suggested that existing language models were undertrained: "Chinchilla scaling laws". This work did not take into the account of fresh versus repeated tokens. In this work, we show that we can improve upstream and downstream performance by training on repeated tokens.

Storing information in weights is more unreliable in the sense models may blend information together, hallucination, but it is more "pliable" in the sense it can associate information through the representation space, association. Despite hallucination risks, there is evidence large language models can act as implicit knowledge bases with sufficient capacity (Petroni et al., 2019). They perform well on knowledge-intensive tasks such as general knowledge (TriviaQA) and specialist knowledge (MMLU) without an external retrieval mechanism (Brown et al., 2020; Hendrycks et al., 2020).

The question of how to update network knowledge remains an active research question (Scialom et al., 2022; Mitchell et al., 2022). Likewise, the question of how to improve the reliability of generation is an active question (Gao et al., 2022). Despite these limitations, today’s large models will become cheaper with experience (Hirschmann, 1964), and so a growing proportion of scientific knowledge will enter weight memory as training and re-training costs fall. In this work we perform probes to investigate Galactica’s depth of knowledge, and show that the ability to absorb scientific knowledge improves smoothly with scale.

Retrieval-augmented models aim to alleviate the shortcomings of weight memory. Examples of such models include RAG, RETRO and Atlas (Lewis et al., 2020b; Borgeaud et al., 2021; Izacard et al., 2022). These models have the advantage of requiring less capacity but the disadvantage of needing supporting retrieval infrastructure. Since knowledge is often fine-grained, e.g. the sequence of a particular protein, or the characteristics of a particular exoplanet, retrieval will likely be needed in future even for larger models. In this work we focus on how far we can go with model weights alone, but we note the strong case for using retrieval augmentation for future research on this topic.

Dataset

“Nature is written in that great book which ever is before our eyes – I mean the universe – but we cannot understand it if we do not first learn the language and grasp the symbols in which it is written." Galileo Galilei, The Assayer

The idea that Nature can be understood in terms of an underlying language has a long history (Galilei, 1623; Wigner, 1959; Wheeler, 1990). In recent years, deep learning has been used to represent Nature, such as proteins and molecules (Jumper et al., 2021; Ross et al., 2021). Amino acids are an alphabet in which the language of protein structure is written, while atoms and bonds are the language of molecules. At a higher level, we organize knowledge through natural language, and many works have trained on scientific text (Beltagy et al., 2019; Lewis et al., 2020a; Gu et al., 2020; Lo et al., 2019b). With Galactica, we train a single neural network on a large scientific corpus to learn the different languages of science.

Our corpus consists of 106106 billion tokens from papers, reference material, encyclopedias and other scientific sources. We combine natural language sources, such as papers and textbooks, and natural sequences, such as protein sequences and chemical formulae. We process LaTeX where we can capture it, and also include academic code to capture computational science. We highlight the corpus details in Table 1 and 2. Full details, including dataset components and filtering logic, are contained in the Appendix.

Notably the dataset is small and curated compared to other LLM corpuses, which are larger and uncurated. This is a key question of this work: can we make a working LLM based on a curated, normative paradigm? If true, we could make more purposefully-designed LLMs by having a clear understanding of what enters the corpus, similar to expert systems which had normative standards (Jackson, 1990).

[START_AMINO]MIRLGAPQTLVLLTLLVAAVLRCQGQDVQEAGSCVQDGQRYNDKDVWKPEPCRICVCDTG...[END_AMINO] Summary Protein: Collagen alpha-1(II) chain Gene: COL2A1 Organism: Homo sapiens (Human) Status: evidence at protein level Function Type II collagen is specific for cartilaginous tissues. It is essential for the normal embryonic development of the skeleton, for linear growth and for the ability of cartilage to resist compressive forces. [START_REF]Nucleotide sequence of the full length cDNA encoding for human type II procollage, Lee[END_REF]… Features - Domain, 32-90, Cleavage; by procollagen N-endopeptidase - Site Cleavage, 181-182, Cleavage; by procollagen N-endopeptidase - Binding site, 1301, Ca2+ … Figure 1: Multi-Modal Data. A protein sequence occurs in a document context along with annotations, text and citations from UniProt. Full contents of the document are cut for clarity of exposition. Tokenization is an important part of dataset design given the different modalities present. For example, protein sequences are written in terms of amino acid residues, where character-based tokenization is appropriate. To achieve the goal of specialized tokenization, we utilize specialized tokens for different modalities:

Citations: we wrap citations with special reference tokens [START_REF] and [END_REF].

Step-by-Step Reasoning: we wrap step-by-step reasoning with a working memory token , mimicking an internal working memory context.

Mathematics: for mathematical content, with or without LaTeX, we split ASCII operations into individual characters. Parentheses are treated like digits. The rest of the operations allow for unsplit repetitions. Operation characters are !"#$%&’*+,-./:;<=>?\^_‘| and parentheses are ()[]{}.

Numbers: we split digits into individual tokens. For example 737612.62 -> 7,3,7,6,1,2,.,6,2.

SMILES formula: we wrap sequences with [START_SMILES] and [END_SMILES] and apply character-based tokenization. Similarly we use [START_I_SMILES] and [END_I_SMILES] where isomeric SMILES is denoted. For example, C(C(=O)O)N \rightarrow C,(,C,(,=,O,),O,),N.

Amino acid sequences: we wrap sequences with [START_AMINO] and [END_AMINO] and apply character-based tokenization, treating each amino acid character as a single token. For example, MIRLGAPQTL -> M,I,R,L,G,A,P,Q,T,L.

DNA sequences: we also apply a character-based tokenization, treating each nucleotide base as a token, where the start tokens are [START_DNA] and [END_DNA]. For example, CGGTACCCTC -> C, G, G, T, A, C, C, C, T, C.

We cover a few of the specialized token approaches below that do not have clear parallels in the literature, in particular the working memory and citation tokens.

Transformer-based architectures lack an explicit working memory capability, which means a single-forward pass has limited efficacy. This is problematic for tasks that require multiple steps of computation. A current workaround is using a Transformer’s output context as an external working memory to read from and write to. This is seen in recent work on chain-of-thought prompting (Wei et al., 2022; Suzgun et al., 2022). In one sense this is intuitive, as humans also augment their limited working memory with scratchpads. In another sense, we would like models to refine their representations internally like humans; e.g. mental arithmetic.

There are two limitations with chain-of-thought. First, it relies on prompt discovery to find a prompt that elicits robust step-by-step reasoning; i.e. minimizes mistakes from doing too much in a single forward pass. Not only does this require finding a robust prompt that works in all cases, but it also often relies on few-shot examples which take up context space. What is worse, much of the step-by-step reasoning on the internet misses intermediate steps that a human has performed using internal memory. Humans do not write down every step they perform because it would lead to long and tedious answers. They write down the principal steps of reasoning, and do lower-level steps via internal working memory. This means there is "missing data" in written text, i.e. between written steps there are internal memory steps that are not explicitly stated.

Secondly, chain-of-thought prompting uses the neural network to perform tasks that it is arguably not best suited to doing; for example, arithmetic. Prior work has shown that accuracy on tasks like multiplication is proportional to term frequency (Razeghi et al., 2022). Given that classical computers are specialized for tasks like arithmetic, one strategy is to offload these tasks from the neural network to external modules. For example, prior work has looked at the possibilities of external tool augmentation, such as calculators (Thoppilan et al., 2022). However, this requires a strategy to identify where the neural network should offload; and it may not be straightforward when combined with a discovered zero-shot prompt, especially where lower-level computation steps are not explicitly stated in writing.

Our solution is a working memory token we call . We construct a few prompt datasets, see Table 3, that wrap step-by-by-step reasoning within . Some of these datasets were generated programmatically (OneSmallStep), by creating a problem template and sampling the variables, others were sourced online (Workout, Khan Problems), and others used existing datasets and transformed them into a based context (GSM8k train). Where a computation is performed that a human could not do internally, we offload by writing and executing a Python script. An example is shown in Figure 3. Importantly, we do not have to turn this on, and the model can also predict the output from running a program. For our experiments, we did not find the need to turn Python offloading on, and leave this aspect to future work.

Question: A needle 35 mm35\mathrm{~{}mm} long rests on a water surface at 20C20^{\circ}\mathrm{C}. What force over and above the needle’s weight is required to lift the needle from contact with the water surface? σ=0.0728m\sigma=0.0728\mathrm{m}. σ\displaystyle\sigma =0.0728 N/m\displaystyle=0.0728\mathrm{~{}N}/\mathrm{m} σ\displaystyle\sigma =F/L\displaystyle=F/L 0.0728\displaystyle 0.0728 =F/(2×0.035)\displaystyle=F/(2\times 0.035) F\displaystyle F =0.0728(2×0.035)\displaystyle=0.0728(2\times 0.035) calculate.py ‘‘‘ f = 0.0728*(2*0.035) with open("output.txt", "w") as file: file.write(str(round(f, 5))) ‘‘‘ < <> 0.0051 Answer: F=0.0051 NF=0.0051\mathrm{~{}N} Figure 3: Model-Machine Symbiosis. We show an example answer with the working memory token. It performs exact steps for rearranging the equation, and when it reaches a calculation that it cannot solve reliably in a forward-pass, it writes a program, which can then be offloaded to a classical computer. Data source Split Prompts Tokens GSM8k (Cobbe et al., 2021) train 7,473 3,518,467 OneSmallStep n/a 9,314 3,392,252 Khan Problems (Hendrycks et al., 2021) n/a 3,835 1,502,644 Workout n/a 921 470,921 Total 21,543 9 million Table 3: Reasoning Datasets To train the model to use we include several datasets in pre-training that incorporate this token. Full details are contained in the Appendix. Longer term, an architecture change may be needed to support adaptive computation, so machines can have internal working memory on the lines of work such as adaptive computation time and PonderNet (Graves, 2016; Banino et al., 2021). In this paper, we explore the external working memory approach as a bridge to the next step. Notably our prompt datasets are not very large or diverse, so there are likely large further gains to be made with this approach.

1.2 Citation Token

A distinctive properties of academic text is citations. In order to represent the implicit citation graph within the text, we process citations with global identifiers and special tokens [START_REF] and [END_REF] signifying when a citation is made. Figure 4 shows an example of citation processed text from a paper.

Recurrent neural networks, long short-term memory [START_REF]Long Short-Term Memory, Hochreiter[END_REF] and gated recurrent [START_REF]Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Chung[END_REF] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [START_REF]Sequence to Sequence Learning with Neural Networks, Sutskever[END_REF][START_REF]Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau[END_REF][START_REF]Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation, Cho[END_REF]. Figure 4: Citation Processed Text. Example of citation processed text from Attention Is All You Need (Vaswani et al., 2017). For title-processed citations, the title can be associated with the previous context. We considered two type of citation identifier: (a) paper titles and (b) alphanumeric IDs. Based on ablations, we found that title based identifiers have greater citation prediction accuracy than IDs. However, we also found that paper titles are more prone to hallucination error at lower scales given the text-based nature of the identifier. We consider title processing for this paper, but we note the trade-offs between both approaches. Experiments for these ablations are contained in the Appendix.

2 Prompt Pre-Training

We deviate from existing language model research in one important direction, which is our decision to include prompts in pre-training alongside the general corpora. This is motivated by a number of observations.

First, existing work has shown the importance of training token count on performance. The Chinchilla paper derived scaling "laws" taking into account number of tokens, training a 70bn model for 1.4 trillion tokens (Hoffmann et al., 2022). They obtained state-of-the-art performance on MMLU, beating much larger models such as Gopher (Rae et al., 2021).

Separately, research such as FLAN and T0 showed prompt tuning can boost downstream performance (Wei et al., 2021; Sanh et al., 2021; Chung et al., 2022). Their strategy involved converting tasks to text prompts, using prompt diversity in how the tasks are posed, and then fine-tuning on these prompt datasets. For FLAN and T0, this approach boosts performance, beating larger models such as GPT-3 on many tasks.

And additionally there is the UnifiedQA approach (Khashabi et al., 2020). In this approach, a T5 model is fine-tuned on question answering datasets, and is shown to boost performance on out-of-domain question answering datasets (Raffel et al., 2020). The model outperforms GPT-3 on MMLU, a model 16 times larger.

The first stream of research above focuses on total training tokens as a way to boost performance; i.e. it is token agnostic. The second stream of research focuses on task-context tokens as a way to boost performance; i.e. it is token selective. Since fine-tuned smaller models beat larger few-shot models on tasks like MMLU, this suggests world knowledge may be present in smaller models, but task-context knowledge may be poor given the relative number of task-context tokens seen in the general corpus.

For this paper, we opt to augment pre-training data with more task prompts to boost performance at lower scales. This is advantageous if it obviates the need for more data scale, e.g. a >11 trillion corpus, or more model scale. The largest 120B model we train runs on a single NVIDIA A100 node. Additionally, given that fine-tuning requires expertise, making the model work out-the-box for popular tasks like question answering and summarization is more useful for users of the model. Lastly, by including prompts alongside general data, we maximize the generality of the model while boosting performance on some tasks of interest.

The closest analog to this approach for large language models is ExT5 (Aribandi et al., 2021). We take a similar approach by taking many machine learning training datasets, converting them to a text format, with prompt diversity, and then including them alongside general corpora in our pre-training set. A summary of prompt types is given in Table 4; the full details of datasets and prompts used are covered in the Appendix.

Because of prompt inclusion, it is important to distinguish between in-domain performance, where the training dataset is included in pre-training, and out-of-domain performance, where the training dataset is not included in pre-training. We mark these results clearly in the Results section of this paper. Importantly, we do not advocate for prompt pre-training as an alternative to instruction tuning. In fact, instruction tuning on Galactica is likely useful follow-up work given its potential to boost performance on several tasks of interest.

Method

Galactica uses a Transformer architecture in a decoder-only setup (Vaswani et al., 2017), with the following modifications:

GeLU Activation - we use GeLU activations for all model sizes (Hendrycks and Gimpel, 2016).

Context Window - we use a 2048 length context window for all model sizes.

No Biases - following PaLM, we do not use biases in any of the dense kernels or layer norms (Chowdhery et al., 2022).

Learned Positional Embeddings - we use learned positional embeddings for the model. We experimented with ALiBi at smaller scales but did not observe large gains, so we did not use it (Press et al., 2021).

Vocabulary - we construct a vocabulary of 50k tokens using BPE (Sennrich et al., 2015). The vocabulary was generated from a randomly selected 2% subset of the training data.

2 Models

The different model sizes we trained, along with training hyperparameters are outlined in Table 5.

We train using AdamW with β1=0.9\beta_{1}=0.9, β2=0.95\beta_{2}=0.95 and weight decay of 0.10.1 (Loshchilov and Hutter, 2017). We clip the global norm of the gradient at 1.0, and we use linear decay for learning rate down to 10% of it value. We use dropout and attention dropout of p=0.1p=0.1. We do not use embedding dropout. We found longer warmup was important for the largest model in the early stages of training to protect against the effects of bad initialization, which can have long-memory effects on the optimizer variance state and slow down learning. This may be specific to our model and training setup, and it is not clear whether this advice generalizes.

3 Libraries and Infrastructure

We use the metaseq library333https://github.com/facebookresearch/metaseq/ for training the models, built by the NextSys team at Meta AI.

For training the largest 120B model, we use 128 NVIDIA A100 80GB nodes. For inference Galactica 120B requires a single A100 node. We choose the maximum model size to obey this constraint for downstream accessibility, and we will work to improve its accessibility for the research community in coming months.

Results

We train the models for 450 billion tokens, or approximately 4.25 epochs. We find that performance continues to improve on validation set, in-domain and out-of-domain benchmarks with multiple repeats of the corpus.

First, from Figure 6, validation loss continues to fall with four epochs of training. The largest 120B model only begins to overfit at the start of the fifth epoch. This is unexpected as existing research suggests repeated tokens can be harmful on performance (Hernandez et al., 2022). We also find the 30B and 120B exhibit a epoch-wise double descent effect of plateauing (or rising) validation loss followed by a decline. This effect becomes stronger with each epoch, and is most visible above with the 120B model towards end of training.

To investigate further, we examine the per-source breakdown of validation loss to see if there is heterogeneity in loss behaviour. We plot example curves in Figure 23 overleaf for the 30B model. We see no signs of loss heterogeneity: loss falls for all sources. The 120B exhibits the same relative trend of declining validation loss for all sources until the beginning of fifth epoch, where all sources spike (see Appendix).

The next question to answer is whether this trend extends to downstream performance and out-of-domain generalization. For this we use a 57 task subset of BIG-bench subset, a general corpus with principally non-scientific tasks and prompt types not included in pre-training (Srivastava et al., 2022). We plot results in Figure 8. We see no signs of overfitting suggesting that use of repeated tokens is improving downstream performance as well as upstream performance.

We suspect that two factors could be at play, a quality factor, the curated nature of the corpus enables more value per token to be extracted, or a modality factor, the nature of scientific data enables more value per token to be extracted. The missing step of causation is what leads specifically from either factor towards less overfitting, and we leave this question to further work. We note the implication that the "tokens\text{tokens}\rightarrow\infty" focus of current LLM projects may be overemphasised versus the importance of filtering the corpus for quality.

In the following sections, we turn to evaluating Galactica’s scientific capabilities. Specifically, we focus on the high-level design goals of building an LLM that can store, combine and reason about scientific knowledge - as these are needed for building a new interface for science.

2 Knowledge Probes

First, we examine how well Galactica absorbs scientific knowledge. We set up several knowledge probe benchmarks, building off the LAMA approach of Petroni et al. (2019). These were critical metrics during model development for identifying knowledge gaps within the corpus, and informing how to iterate the corpus. They also provide insight into the relative knowledge strengths of Galactica versus general language models, and we cover these results in this section before turning to the downstream tasks.

We construct a dataset of popular LaTeX equations from the fields of chemistry, physics, mathematics, statistics and economics. Memorisation of equations is useful to measure as it is necessary for many downstream tasks; for example, recalling an equation to use as part of an answer to a problem. Unless stated explicitly, Galactica results are reported as zero-shot. In total there are 434 equations we test for the knowledge probe.

We prompt with an equation name and generate LaTeX. An example is shown in Figure 9.

Prompt The formula for Bessel’s differential equation is: Generated Answer x2d2ydx2+xdydx+(x2α2)y=0x^{2}{\frac{d^{2}y}{dx^{2}}}+x{\frac{dy}{dx}}+\left(x^{2}-\alpha^{2}\right)y=0 Figure 9: LaTeX Equations Probe. We prompt for the name of an equation and evaluate whether the generated LaTeX is correct. We manually evaluate given the possibility of multiple correct answers. We summarize the results in Table 6. Equation knowledge increases smoothly with scale. Galactica outperforms larger language models trained on general corpuses, indicating the value of a curated dataset.

2.2 Domain Probes

We also set up domain probes to track specialized knowledge for certain fields. We detail these below:

AminoProbe: a dataset of names, structures and properties of the 20 common amino acids.

BioLAMA: a dataset of biomedical factual knowledge triples.

Chemical Reactions: a dataset of chemical reactions.

Galaxy Clusters: a dataset of galaxy clusters with their constellation classifications.

Mineral Groups: a dataset of minerals and their mineral group classifications.

In each case, we construct a prompt to test the knowledge. For example, for Chemical Reactions, we ask Galactica to predict the products of the reaction in the chemical equation LaTeX. We mask out products in the description so the model is inferring based on the reactants only. An example is shown in Figure 10.

Prompt Sulfuric acid reacts with sodium chloride, and gives _____ and _____: \[ \ce{ NaCl + H2SO4 -> Generated Answer \ceNaCl+H2SO4>NaHSO4+HCl\ce{NaCl+H2SO4->NaHSO4+HCl} Figure 10: Chemical Reactions. We prompt based on a description and reactants, and evaluate whether the generated products are correct. We report results for these knowledge probes in Table 7.

We also observe steady scaling behaviour in these knowledge probes, with the exception of BioLAMA which we suspect reflects zero-shot prompt difficulty for all LLMs. Notably fine-grained factual knowledge, such as "ConstellationOf(GalaxyCluster)" type-queries seems to scale smoothly with the size of the model.

2.3 Reasoning

We now turn to reasoning capabilities with the token. We start by evaluating on the MMLU mathematics benchmarks, which we report in Table 8 (Hendrycks et al., 2020). Galactica performs strongly compared to larger base models, and use of the token appears to boost performance over Chinchilla, even for the smaller 30B Galactica model.

We also evaluate on the MATH dataset to further probe the reasoning capabilities of Galactica (Hendrycks et al., 2021). We compare the token prompt directly with the Minerva 5-shot chain-of-thought prompt mCoT for comparability. We report results in Table 9.

We see that Galactica outperforms the base PaLM model by a significant margin, with both chain-of-thought and prompts. Galactica 30B outperforms PaLM 540B on both prompts: an 18 times smaller model. This suggests Galactica may be a better base model for fine-tuning towards mathematical tasks.

We report Minerva results for completeness, which is a 540B PaLM fine-tuned towards LaTeX specifically. Minerva outperforms base Galactica, but the performance differences are non-uniform; which points towards different mathematical data biases. For a direct comparison to Minerva, the model is freely available for those who want to finetune Galactica towards LaTeX specifically as follow-up work.

3 Downstream Scientific NLP

We now evaluate on downstream scientific tasks to see how well Galactica can compose its knowledge in different task contexts. We focus on knowledge-intensive scientific tasks and report full results in Table 10. For this we use the MMLU benchmark as well as some other popular scientific QA benchmarks. We include the MMLU results earlier without to test for knowledge association specifically. Full MMLU results, including social sciences and other fields, are reported in the Appendix. We also perform data leakage analysis on these benchmarks for more confidence; results are in the Appendix.

From Table 10, Galactica can compose its knowledge into the question-answering task, and performance is strong; significantly outperforming the other open language models, and outperforming a larger model (Gopher 280B) in the majority of tasks. Performance against Chinchilla is more variable, and Chinchilla appears to be stronger in a subset of tasks: in particular, high-school subjects and less-mathematical, more memorization intensive tasks. In contrast, Galactica tends to perform better in mathematical and graduate-level tasks.

Our working hypothesis is that the Galactica corpus is biased towards graduate scientific knowledge, given it consists mostly of papers, which explains lagging performance in high-school subjects. While we do pick up some high-school level content through encyclopedias, textbooks and the filtered CommonCrawl, this amounts to a small quantity of tokens (a few billion). We leave the question of how to capture more of this base scientific knowledge in a curated way to future work.

On remaining tasks, we achieve state-of-the-art results over fine-tuned models at the time of writing. On PubMedQA, we achieve a score of 77.6% which outperforms the state-of-the-art of 72.2% (Yasunaga et al., 2022). On MedMCQA dev we achieve score of 52.9% versus the state-of-the-art of 41.0% (Gu et al., 2020). For BioASQ and MedQA-USMLE, performance is close to the state-of-the-art performance of fine-tuned models (94.8% and 44.6%) (Yasunaga et al., 2022).

4 Citation Prediction

In this section we evaluate Galactica’s capability to predict citations given an input context, which is an important test of Galactica’s capability to organize the scientific literature. We find that both accuracy and the quality of distributional approximation improves with scale.

We construct three datasets to evaluate the model’s capability to cite:

PWC Citations: a dataset with 644 pairs of machine learning concepts and papers that introduced them. Concepts consist of methods (e.g. ResNet) and datasets (e.g. ImageNet) from Papers with Code444https://paperswithcode.com.

Extended Citations: a dataset with 110 pairs of non-machine learning concepts and papers that introduced them. Examples of concepts include Kozac sequence and Breit-Wigner distribution.

Contextual Citations: a dataset with 1,869 pairs of references and contexts from our arXiv validation set. The dataset is constructed by sampling 1,000 random references and collecting their contexts.

For the PWC Citations and Extended Citations datasets, the citation prediction task is framed as a text generation task. The model is given a prompt like "In this paper we use ResNet method [START_REF]" in order to generate a prediction for the ResNet concept. For Contextual Citations, we prompt after the input context for the citation, where the context ends with [START_REF].

We compare Galactica to sparse and dense retrieval-based approaches on this task.

For the sparse baseline, we use ElasticSearch to create an index of all the references, including their titles, abstracts, and short snippets of text with the contexts they appear in. Then, given a text query, we retrieve the top references ordered by the sum of matching scores across all selected fields.

For dense retriever baselines, we evaluate two different Contriever models (Izacard et al., 2021). The first is the pre-trained model released by Izacard et al. (2021). The second model we use is fine-tuned on a random subset of 10 million context/paper pairs from our corpus, trained to retrieve the right paper given a context before a citation. The setup for dense retrieval is: (1) each reference is encoded by the model using its title and abstract, (2) a text query is encoded by the same model, (3) the references that match the query re returned. Retrieval is performed using a FAISS index (Johnson et al., 2019).

The performance on all evaluation sets increases smoothly with scale. At larger scales, Galactica outperforms the retrieval-based approaches as its context-associative power improves. This is an important result as current approaches for navigating the literature use these existing retrieval approaches. As the power of language models improves, we suspect they will become a valuable new tool for exploring the literature.

4.2 Citation Distributional Analysis

We now turn to look at how well Galactica can model the empirical citation distribution. For this analysis we use the Contextual Citations dataset, where prompts are extracted from a paper by taking the context before a citation as the prompt. An example prompt with a model prediction is shown overleaf in Figure 12.

We use the in-context citation data to analyse the distributional difference between predicted and ground truth paper counts. This allows us to assess the model bias towards predicting more popular papers. Specifically, for each context there is a ground truth and predicted reference. We count the number of times each reference appears in our corpus. We then compare the distribution of reference counts between the ground truth references and the predicted references using the Kolmogorov-Smirnov distance (Massey, 1951).

The comparison between the citation count distributions for different model sizes can be seen in Figure 11. Figure 11(a) shows the decrease in the Kolmogorov-Smirnov distance between the distribution of ground truth paper citations and the distribution of predicted papers citations. Figure 11(b) shows how the distribution of paper counts for the predicted papers gets closer to the ground truth as the model size grows. At smaller scales the model is more prone to predicting more popular papers. As the model grows in size this bias towards predicting popular papers diminishes.

5 General Capabilities

We have studied Galactica’s scientific capabilities. It is perhaps not surprising that a specialist scientific model outperforms general models on scientific tasks, but what would be more surprising was if it outperformed general models on general NLP tasks. In this section, we show surprising evidence that it does just that.

We evaluate on 57 BIG-bench tasks in Table 12 (Srivastava et al., 2022). The tasks are primarily non-scientific and test general language capability, for example anachronisms, figure of speech and metaphor boolean. We always evaluate with 5-shots, and we use the default prompt style from BIG-Bench. Importantly, we do not include this prompt style in pre-training; so the evaluation between Galactica and the other models is comparable 5-shot. Full details and results are in the Appendix. We summarize average scores in Table 12:

Both the 30B and 120B Galactica models outperform the larger OPT and BLOOM general models. This is a surprising result given we designed Galactica to trade-off generality for performance in scientific tasks.

We suspect this result reflects the higher-quality of the Galactica corpus, stemming from the fact it is curated and also primarily academic text. Previous open LLM efforts likely overfocused on scale goals and underfocused on data filtering. Another implication is that the focus on tokens \rightarrow\infty from Chinchilla needs to be complemented with strong data quality procedures (Hoffmann et al., 2022). With this paper, we took an opposite approach by focusing on high-quality tokens and repeated epochs of training. However, the Chinchilla insight stands: and there is much more scientific text that we have not exploited in this work.

6 Chemical Understanding

We now turn to Galactica’s capability to interface with different scientific modalities. We start by looking at Galactica’s chemical capabilities. Chemical properties exhibit complex correlations which means the chemical space is very large. Better organization of chemical information through language models could aid chemical design and discovery. We explore how Galactica can provide a new interface for these tasks in this section.

For this work, we only include a small subset of available compounds from PubChem Compound in pre-training. Specifically, we take a random subset (22 million) of total compounds (110110 million). This is to ensure the model is not overly biased towards learning natural sequences over natural language. This is a constraint we can relax in future work, enabling for much larger corpus. Here we focus on the first step of investigating whether a single model can learn effectively in the multi-modal setting.

We find that a language model can learn chemical tasks such as IUPAC naming in a self-supervised way, and in addition, we can pose drug discovery tasks as natural language prompts and achieve reasonable results.

SMILES is a line notation which represents chemical structure as a sequence of characters (Weininger, 1988). In the Galactica corpus, the SMILES formula occurs alongside information in the document, such as IUPAC names, molecular weight and XLogP. In the context of self-supervised learning, this means a language model is performing implicit multi-task learning: the model is predicting the next SMILES token, but can also use SMILES to predict other entities in the document.

As an initial test, we set up a IUPAC Name Prediction task, where the task is to name a compound according to the IUPAC nomenclature given a SMILES formula input. The IUPAC nomenclature is a method of naming organic compounds that has a ruleset based on naming the longest chain of carbons connected by single bonds (Favre and Powerll, ). There is a large set of rules and the procedure is algorithmically complex, meaning it is hard to automate. As a result, it is missing from standard cheminformatics toolkits.

Previous works such as STOUT and Struct2IUPAC have explored the possiblity of using RNNs and Transformers for this task (Rajan et al., 2021; Krasnov et al., 2021). We explore in this section whether Galactica can translate a SMILES specification to its IUPAC name in the self-supervised setting. We design a prompt based on the PubChem structure, with the SMILES as the only input, and the output to predict the IUPAC name.

To evaluate, we use our compound validation set of 17,052 compounds, and prompt with the SMILES formula and predict the IUPAC name. To calculate accuracy, we use OPSIN to convert the generated IUPAC name to SMILES, canonicalize it and compare with the canonicalized SMILES target (Lowe et al., 2011).

Accuracy increases smoothly with scale. Given we restricted the corpus to 2 million molecules, it is likely much better performance is achievable through training or fine-tuning on more molecules. The model is freely available for those who want to perform this follow-up work.

The more immediate question is what is actually being learnt: is Galactica inferring names from the fundamental molecular structure? To answer this, we visualize the average atomic attention at each stage of a prediction in Figure 13 overleaf. Encouragingly, the results are interpretable in terms of the underlying chemistry, and Galactica attends to the correct group when predicting a name, e.g. for "amino" it attends primarily to the \ceNH2-\ce{NH_{2}} substituent.

Task: Convert the SMILES to IUPAC Name Example: CC(C)(C)C(=O)N(CC1=NC(=CS1)C(=O)OC)C2CCCCC2

6.2 MoleculeNet

We now explore whether we can pose traditional drug discovery tasks in a natural language format, combining the different modalities involved. Humans organize knowledge via natural language, and so learning an interface between natural language and scientific modalities like SMILES could be a new tool for navigating the chemical space. We use MoleculeNet classification benchmarks to answer this question, which are summarized in Table 14 (Wu et al., 2017).

To evaluate, we include the training sets in pre-training by converting to a text format. We use prompt randomization (varying how the question is posed). For example, for BBBP the training prompt has forms like in Figure 14 below. These examples occur alongside the other corpuses in training, and each example is seen just over 44 times. This is not comparable to direct fine-tuning or supervision due to the presence of other data in pre-training, so it might be considered a form of weak supervision instead.

Here is a SMILES formula: [START_I_SMILES]O=C(O)CCCC1=CC=C(N(CCCl)CCCl)C=C1[END_I_SMILES] Question: Will the chemical compound penetrate the blood-brain barrier? Answer: No Figure 14: BBBP Prompt. We include the SMILES and pose the classification problem in natural language. For some MoleculeNet datasets, other modalities are implicitly present. For example, in the Tox21 dataset, bioassays concern particular receptors such as the androgen receptor (AR). As an experiment, we decided to frame the task in a text format with the protein sequence and the SMILES as part of the prompt. We show an example for Tox21 in Figure 15.

Here is a sequence for a protein: [START_AMINO]MEEPQSDPSVEPPLSQETFSDLWKLLPE...[END_AMINO] And here is an isomeric SMILES for a compound: [START_I_SMILES]CC(O)(P(=O)(O)O)P(=O)(O)O[END_I_SMILES] Question: Will the the chemical compound be active against this protein? Answer: No Figure 15: Tox21 Prompt. We include the protein sequence and the SMILES formula and pose the classification problem in natural language. We make sure to Kekulize the SMILES to be consistent with PubChem representations. For evaluation, we use the recommended splits from the DeepChem library (Ramsundar et al., 2019).

We present results in Table 15. Performance scales with model size. The scaling is slower than tasks like QA, and the base model lags a specialist model with explicit 3D information and 10 times more molecules (Zhou et al., 2022). We suspect the weak supervision setup is harder for this task, and fine-tuning and/or more molecule data is required to get sufficient task signal. The model is available for work on this.

For our purposes, the implication for future work is that we can learn drug discovery tasks via natural language prompts. If we can learn these relationships automatically in a signal-dense document context (e.g. online chemical databases), this might reduce the reliance on supervised datasets to perform these tasks.

As a final check, we can average Galactica’s attention heads across layers, and visualize whereabouts the model looks in the SMILES sequence to make a prediction (atomic attention). We show an example in Figure 16 for some Tox21 predictions.

7 Biological Understanding

In this section we examine Galactica’s capability to interface with biological modalities. Language models could potentially play a role in automatic organisation of this data, for example annotating newly sequenced proteins with functional information. We explore the potential of this interface in this section.

For protein sequences from UniProt, we include a small subset of available sequences in pre-training. Specifically, we take reviewed Swiss-Prot proteins; a high-quality subset (0.50.5 million) of total (227227 million). This is to ensure the model is not overly biased towards learning natural sequences over natural language. As with molecule data, this is a constraint we can relax in future work, enabling for much larger corpus. Here we focus on the first step of investigating whether a single model can learn effectively in the multi-modal setting.

We find that a language model can learn an implicit measure of sequence similarity that it can use for tasks such as functional annotation and descriptions.

While Galactica does not explicitly model the 3D structure of a protein, the information needed for a specific conformation is contained in the linear amino acid sequence, which in turn determine function. As a first step, we test upstream performance through evaluating protein sequence perplexity. Constructing a good validation set is important and data leakage is a problem for works in this field. We construct four holdout sets to obtain more confidence about what is being learnt and what generalizes.

First, we conduct BLAST on the sequences in the training set and remove all sequences with a sequence identity 50%\geq 50\% with 51 CASP14 target sequences. These are the same test sequences used in ESMFold (Lin et al., 2022b). In total we remove 167 sequences from the training set using this approach. We call this this holdout set CASPSimilarSeq. We call the 51 CASP14 target sequences CASPSeq.

Secondly, we conduct organism-level holdout, and remove all sequences from the Paenungulata clade of organisms, including elephants, elephant shrews, manatees and aadvarks. This allows us to test whether Galactica can annotate sequeces for organisms it has never seen before. In total we remove 109 sequences from the training set using this approach. We call this holdout set PaenSeq. Note that this does not enforce any sequence similarity constraints, and there may be very similar sequences in the training set.

Lastly, we conduct a randomized test split, consisting of 5456 sequences. There is no sequence identity constraint applied, so memorization may be more at play, but it still provides a signal about the breadth of sequence knowledge absorbed by the model. We call this holdout set UniProtSeq.

We evaluate perplexity for all holdout sets in Table 16 and plot in Figure 17. For three of the validation sets we observe smooth scaling, reflecting the potential for high sequence similarity with sequences in the training set; for example, orthologs in the case of the Paen validation set. Interestingly, the CASP set with sequence similarity constraints levels off, suggesting the gains from the 550k proteins in training quickly saturates.

To investigate further, we example validation perplexity on the CASPSeq set during training of the 120B model, and we plot results in Figure 18 below.

We observe falling validation perplexity up until the start of the fourth epoch, at which point the model overfits for this particular dataset. This may suggest Galactica is getting worse at more "out-of-domain" proteins that differ significantly from the test set. For future work, less repetition is probably desirable; and more generally, increasing the diversity of proteins in the training dataset is likely to be beneficial.

7.2 Functional Keyword Prediction

We now look at specific translation capabilities from protein sequence toward natural language, which may be useful for tasks such as protein annotation. As a first test, we look at UniProt keywords that Galactica can infer from the sequence. An example of these is shown in Figure 20 overleaf.

Sequence Here is the sequence: [START_AMINO]MQKSPLERASVISKLFFSWPGPILRKGYRQHLKLSDIYQIPSVDSADNLSEKLERE...[END_AMINO] ### Ground-Truth Keywords ATP-binding, Cell membrane, Chloride, Chloride channel, Endoplasmic reticulum, Endosome, Glycoprotein, Ion channel, Ion transport, Isomerase, Isopeptide bond, Lipoprotein, Membrane, Nucleotide-binding, Nucleus, Palmitate, Phosphoprotein, Reference proteome, Repeat, Transmembrane, Transmembrane helix, Transport, Ubl conjugation ### Galactica 30B Predicted Keywords ATP-binding, Cell membrane, Chloride, Chloride channel, Endoplasmic reticulum, Endosome, Glycoprotein, Ion channel, Ion transport, Isomerase, Isopeptide bond, Lipoprotein, Membrane, Nucleotide-binding, Nucleus, Palmitate, Phosphoprotein, Reference proteome, Repeat, Transmembrane, Transmembrane helix, Transport, Ubl conjugation Figure 20: Protein Keyword Prediction. Example shown is Q108U0 from the PaenSeq holdout, a cystic fibrosis transmembrane conductance regulator from the African elephant. The closest protein by sequence similarity in the training set is the Q2QLA3 protein, a cystic fibrosis transmembrane conductance regular from a horse, with 91.8% sequence similarity. We report results in Table 17. $F_{1}$ score increases across the holdout sets with scale, suggesting that Galactica can learn keywords by inferring from the sequence. However, we see saturation for the CASPSimSeq, suggesting this capability depends on how similar the sequences are to those in the training set. This is reflected in the example in Figure 20, where Galactica uses its knowledge of a similar proteins from different organisms, with a maximum sequence similarity of 91.8% in the training set, to help annotate.

We attempted to visualize attention in the protein sequence, but we did not observe anything with biological intepretation (e.g. attention to domains). Our working hypothesis is that Galactica has learnt an implicit measure of sequence similarity that it uses to associate predicted keywords, but that this is not directly interpretable from where it attends to. This differs from our chemistry analysis where results were interpretable in terms of attention to the underlying atomic structure.

7.3 Protein Function Description

As the next test, we look at generating free-form descriptions of protein function from the sequence. We look at the UniProt function descriptions and compare to Galactica generated descriptions.

We report results in Table 18. ROUGE-L score increases smoothly across all the holdout sets. We show an example overleaf in Figure 21 from PaenSeq. The protein is a Cytochrome b protein from a rock hyrax (Q7Y8J5). The closest sequence by similarity in the training set is a Cytochrome b protein from a pygmy hippopotamus (O03363) with 83% sequence similarity. In this case we get a perfect prediction from the description.

This is the sequence: [START_AMINO]MTNIRKNHPLLKTINDAFIDLPTPSNISTWWNFGSLLGACLIIQVLTGLFLAMHYTSDT...[END_AMINO] ### Ground-Truth Description Component of the ubiquinol-cytochrome c reductase complex (complex III or cytochrome b-c1 complex) that is part of the mitochondrial respiratory chain. The b-c1 complex mediates electron transfer from ubiquinol to cytochrome c. Contributes to the generation of a proton gradient across the mitochondrial membrane that is then used for ATP synthesis. ### Galactica 120B Predicted Description Component of the ubiquinol-cytochrome c reductase complex (complex III or cytochrome b-c1 complex) that is part of the mitochondrial respiratory chain. The b-c1 complex mediates electron transfer from ubiquinol to cytochrome c. Contributes to the generation of a proton gradient across the mitochondrial membrane that is then used for ATP synthesis. Figure 21: Protein Description Prediction. Example shown is Q7Y8J5 from the PaenSeq holdout, a Cytochrome b protein from a rock hyrax. The closest protein by sequence similarity in the training set is the O03363 protein, a Cytochrome b protein from a pygmy hippopotamus, with 83% sequence similarity. As with the keyword prediction task, Galactica appears to be learning based on matching sequences with similar ones it has seen in training, and using this to form a description. This suggests language models for protein sequences could serve as useful alternatives to existing search methods such as BLAST and MMseqs2 (Altschul et al., 1990; Steinegger and Söding, 2017).

Toxicity and Bias

In this section we study the toxicity and bias of the Galactica model. We evaluate on benchmarks related to stereotypes, toxicity, and misinformation. We compare results to other language models. We find Galactica is significantly less biased and toxic than existing language models.

For the following evaluations, we investigate Galactica’s ability to detect (and generate) harmful stereotypes and hate speech, using four widely used benchmarks.

CrowS-Pairs is a collection of 1,508 crowd-sourced pairs of sentences, one which is "more" stereotyping and one which is "less" stereotyping, and covers nine characteristics (Nangia et al., 2020). These characteristics are race, religion, socioeconomic status, age, disability, nationality, sexual orientation, physical appearance, and gender. A language model’s preference for stereotypical content is measured by computing the proportion of examples in which the "more" stereotypical sentence is preferred (as determined by log likelihood). Higher scores indicate a more harmfully biased model, whereas an ideal model with no bias would score 50%.

We report results for Galactica and other language models in Table 19. Galactica exhibits significantly lower stereotypical biases in most categories, with the exception of sexual orientation and age, when compared to the latest GPT-3 (text-davinci-002) and OPT 175B. Galactica attains a better overall score of 60.5% compared to the other models. Language models such as OPT use the Pushshift.io Reddit corpus as a primary data source, which likely leads the model to learn more discriminatory associations (Zhang et al., 2022). Galactica is trained on a scientific corpus where the incidence rate for stereotypes and discriminatory text is likely to be lower.

1.2 StereoSet

StereoSet aims to measure stereotypical biases across profession, religion, gender, and race (Nadeem et al., 2021). The benchmark contains two tasks: an intrasentence task and an intersentence task, with around 2,100 examples each in the development set.

Intrasentence Task: the stereotype and associated context are in the same sentence.

Intersentence Task: the context and stereotype are in different (consecutive) sentences.

Alongside stereo- and anti-stereotypical variants of sentences, each example in StereoSet contains an unrelated sentence. This sentence is included for measuring a Language Modelling Score (LMS) and a Stereotype Score (SS). These two metrics are combined to form the Idealized Context Association Test score (ICAT), which is a balanced measure of bias detection and language modeling. An ideal, unbiased language model would score an LMS of 100, an SS of 50, and an ICAT of 100.

We report results in Table 20. Galactica outperforms other models on all categories for the overall ICAT score.

1.3 Toxicity

To measure toxicity we use the RealToxicityPrompts (RTP) benchmark introduced in Gehman et al. (2020). We follow the same setup of Zhang et al. (2022) and sample 25 generations of 20 tokens using nucleus sampling (p=0.9) for each of 5000 randomly sampled prompts from RTP. We use the prompts to produce sequences (i.e, continuations) which are then scored by a toxicity classifier provided by Perspective API555https://github.com/conversationai/perspectiveapi.

Figure 22 plots the results. The chart shows the mean toxicity probability of continuations (y-axis), stratified across bucketed toxicities of the original prompts (x-axis). Galactica exhibits substantially lower toxicity rates than the other models.

2 TruthfulQA

TruthfulQA is a benchmark that measures answer truthfulness of language model generations (Lin et al., 2022a). It comprises 817 questions that span health, law, finance and other categories. We compare to other published language models. We report results in Table 21. Galactica exceeds the performance of other language models on this benchmark. However, absolute performance is still low. Given the curated nature of our corpus, this suggests that data alone does not cause language models to struggle at this task.

Limitations and Future Work

We cover some of the limitations with work in this section.

Our corpus has several limitations, both external and internally imposed. The main external constraint is our restriction to use open-access resources, and much of scientific knowledge like papers and textbooks are not open access. With access to these closed sources of knowledge, performance is likely to be considerably higher. We also use self-imposed constraints, like restricting the number of molecules and proteins for this work; without these constraints, we are likely to see considerable performance gains due to much larger corpuses for these modalities.

In several benchmarks, we show performance gains over existing language models, but we do not specifically disentangle the effects of the prompts we included in pre-training versus the core scientific corpus. In future work, we likely need to disentangle these effects in order to see whether general language capabilities are possible with a scientific corpus alone without prompt boosting.

While we demonstrate that the model approaches the true citation distribution with scale, some bias towards popular papers still remains with the 120B scale model, so the model likely requires augmentation before being used in a production environment.

We opted for the former in this paper, but ideally we would need to explore what the latter could achieve, along the lines of the recent work of Chung et al. (2022). A limitation of this work is that we do not perform this direct comparison through ablations, making clear the trade-offs between approaches.

While Galactica absorbs broad societal knowledge through sources such as Wikipedia - e.g. 120B knows Kota Kinabalu is the capital of Malaysia’s Sabah state - we would not advise using it for tasks that require this type of knowledge as this is not the intended use-case.

While we have shown text-based Transformers are surprisingly powerful with text representations of scientific phenomena, we caution against the interpretation that text is all you need. For example, in chemistry, geometry is a fundamental language that determines meaning, yet Galactica has no notion of geometry; e.g. 3D co-ordinates of atoms.

2 Future Work

For development of the base model, we highlight several directions that may be worth pursuing.

It is likely further gains can be obtained with mixture-of-denoising training as U-PaLM has recently shown (Tay et al., 2022b; Chung et al., 2022). We suspect this might be beneficial for the scientific modalities such as protein sequences, where the left-to-right LM objective is quite limiting.

We use a maximum context window length of 20482048 tokens in this work. Extending this is likely to be beneficial for understanding in long-form scientific documents, such as textbooks and also documents with longer modality sequences (e.g. long protein sequences).

We cannot capture scientific knowledge adequately without capturing images. This is a natural follow-up project, although it likely requires some architectural modification to make it work well. Existing work such as Alayrac et al. (2022) has shown how to extend LLMs with this modality.

We feel could be a general-purpose reasoning token and we would like to invest more in this direction, including increasing prompt diversity and exploring performance on more benchmarks.

Even as language models become more accurate with scale, we need assurances that their generations are correct and factual. Developing this layer is critical for production applications of language models in general beyond scientific applications.

Should we re-train from scratch to incorporate new scientific knowledge or train from older checkpoints? This is an open question, and further research is needed to find the best procedure for incorporating new knowledge into the model.

While we have shown how large language models can absorb large bodies of scientific knowledge, retrieval has a place for fine-grained types of knowledge, and we believe this is a strong direction to pursue to complement the flexible weight memory of the Transformer.

Discussion and Conclusion

For over half a century, the dominant way of accessing scientific knowledge has been through a store-and-retrieve paradigm. The limitation of this approach is the reasoning, combining and organization of information still relies on human effort. This has led to a significant knowledge throughput bottleneck. In this work we explored how language models might disrupt this paradigm and bring about a new interface for humanity to interface with knowledge.

We showed that language models are surprisingly strong absorbers of technical knowledge, such as LaTeX equations and chemical reactions, and these capabilities tend to scale smoothly with model size. The context-associative power of language models likely confers significant advantages over search engines in the long-run. We demonstrated this for citation prediction, where a language model outperforms tuned sparse and dense retrieval pipelines for this task. Language models will likely provide a valuable new tool for exploring the literature and the body of scientific knowledge in coming years.

We also demonstrated that language models can compose a curated knowledge base to perform well in knowledge-intensive question answering tasks. This includes composing knowledge in a step-by-step reasoning manner. We showed that with a working memory token approach, we can achieve strong performance over existing methods on mathematical MMLU and MATH benchmarks. We suspect tasks like MATH are in principle solvable with language model approaches. The current bottleneck is the availability of high quality step-by-step datasets. However, language models will not perform these tasks like humans until they have an architectural change that supports adaptive computation.

We also performed initial investigations on the potential of LLMs to act as a bridge between scientific modalities and natural language. We showed Galactica could learn tasks like IUPAC naming through self-supervision. We also showed that it is possible to formulate drug discovery tasks like MoleculeNet in a natural language prompt and achieve strong results without direct fine-tuning. Lastly, we showed the potential for tasks such as automatic protein annotation. In all, increasing the number (and size) of datasets that bridge between natural language and natural sequences is likely to boost performance further.

Taken together, we feel there is a strong potential for language models to take on knowledge tasks that are currently human specialisms. We open source the models so others can build on our work, and we look forward to seeing how the open machine learning community will extend it.

Acknowledgments

Thanks to to Susan Zhang, Stephen Roller, Naman Goyal and others for their support in using metaseq. We build on the open LLM training foundation they made possible with the OPT project (Zhang et al., 2022).

Thanks to Iliyan Zarov, Lukas Blecher, Jian Xiang Kuan and Mikhail Pershin for their contributions to the project.

Thanks to Faisal Azhar and Joe Spisak for their valuable support in delivering this project.

Thanks to Antonine Bordes, Laurens van der Maaten and Joelle Pineau for leadership support, and belief in this project. Additional thanks to Laurens for his valuable feedback on the paper.

Thanks to Geeta Chauhan, Hamid Shojanazeri and Eric Han for help with faster inference.

Thanks to numerous others for comments and advice over the past year: Patrick Lewis, Pontus Stenetorp, Timo Schick, Sebastian Riedel, Soumith Chintala.

Thanks to the open source creators whose libraries, datasets and other tools we utilized. Your efforts accelerated our efforts; and we open source our model to accelerate yours.

Thanks to the GPU nodes that didn’t die on us when training the 120B model.

References

Appendix A Appendix

We cover the various components of the corpus in this section.

We source scientific papers from preprint servers such as arXiv, PMC and other sources; see Table 22.

We also use the Semantic Scholar full text dataset (S2) to capture the long tail of science (Lo et al., 2019a). We apply several quality filters, including excluding papers from journals with certain keywords, and also excluding papers with a low journal impact factor. Details of the filters we used are contained in the Appendix.

We source abstracts where full texts are not open access. In total the full dataset contains 48 million papers, abstract and full-text, up to July 2022.

We use a modified version of the GROBID library for converting PDFs to text, as well as obtaining titles, authors and citations (GROBID, 2008–2022). Where mathematical LaTeX is available, for example in arXiv, we make sure to combine the GROBID results with LaTeX source to recover mathematical content.

The final paper documents are stored in a markdown format, as opposed to full LaTeX. We use markdown as the standard format for all documents in the corpus to support knowledge blending between sources. Papers are citation processed, following the title-based approach of Section 2.2.

A.1.2 Reference Material

We source encyclopedias, textbooks and educational material to create a base of reference material that the model can learn from. The details are covered in Table 23.

We apply source specific processing for several of the datasets, specifically:

For StackExchange, we take questions from scientific sites; see the Appendix for the subset used.

For Papers with Code and IUPAC Goldbook we apply data augmentation in the form of prompt randomization. Sometimes we pose sections as questions/answers; for example a section explaining a machine learning method is sometimes posed as "Question: What is [method]?".

For KhanAcademy articles, we add tokens for step-by-step reasoning examples, which we explain shortly in Section 2.4.

We make an effort to preserve mathematical LaTeX and capture citations, including hyperlinks to papers.

A.1.3 Knowledge Bases

We source fine-grained knowledge from scientific knowledge bases. The details are covered in Table 24.

For the chemistry and biology datasets, we wrap modalities like SMILES and protein sequences with their specialized tokens (see Section 2.1). For UniProt we apply data augmentation to the document format:

Order Randomization - with probability 0.50.5 the protein sequence starts at beginning of the document, else the end of document. This ensures we can learn from seqproperty\text{seq}\rightarrow\text{property} and propertyseq\text{property}\rightarrow\text{seq}.

Format Randomization - with probability 13\frac{1}{3} we replace a description, e.g. "The function of protein is…", with a Q&A, e.g. "Question: What is the function of the protein? Answer: The function is…".

For NASA Exoplanet we apply order randomization to the exoplanet characteristics.

For chemical and biological sequences, we take a small subset of available entities. This is to ensure the model is not overly biased towards learning natural sequences over natural language. Specifically:

For PubChem Compound, we take a small, random subset (22 million) of total compounds (110110 million).

For UniProt, we take reviewed Swiss-Prot proteins; a small subset (0.50.5 million) of total (227227 million).

For RefSeq Genome, we take reference sequences, which is a small subset of available nucleotide sequences. For the human genome, we only include the protein-coding genes.

This is a constraint we can relax in future work, enabling for much larger corpus. In this work, we focus on the first step of investigating whether a single model can learn effectively in this multi-modal setting.

A.1.4 Common Crawl

We source academic and scientific content via a highly-filtered subset of CommonCrawl. The details are covered in Table 25.

For Scientific Common Crawl, we train a fasttext classifier to identify Common Crawl webpages with scientific content (Joulin et al., 2016) using a noisy set of 600 domains. We then manually annotated the domains predicted by fasttext as scientific to assemble a list of 200 high-quality scientific and reference domains.

For Academic Common Crawl, we assemble a list of academic domains, such as university websites. We take PDFs from these domains, based on the Common Crawl index, and process these using GROBID.

We do not LaTeX-process pages from these sources.

We found the quality of extracted text in CommonCrawl generally quite poor, which is why we applied stringent filters. We suspect this could be an important area for future work in order to capture more base scientific knowledge.

A.1.5 Code

We source academic GitHub repositories from the Papers with Code index for machine learning, physics, mathematics, statistics and astronomy. The index does not explicitly cover sciences such as biology and chemistry, but many of these repositories are captured as part of the general machine learning index. We exclude repositories that do not have a license or copyright file.

A.1.6 <work> Datasets

For KhanProblems, we used the problems from AMPS and converted to a format (Hendrycks et al., 2021). Where possible we tried to include more tedious steps to reduce errors from a single pass, but this annotation was fairly incomplete and we suspect bigger gains are possible with more cleaning.

For GSM8k we use the provided training dataset and convert so the calculator steps are performed by writing a Python program, following the format (Cobbe et al., 2021). In general, we found when the model went into this prompt style, it was more error-prone. We think this is because the prompt style made the model write too many programs within , rather than getting things ready to run in a single program. In general we found longer answers led to a higher chance of a mistake on the reasoning path.

For OneSmallStep, we made 50 problem set question templates, and randomized the variables in the problem to get more prompt examples. We summarize the fields we made prompts below.

As we can see the diversity was not very large, and so further gains are likely with more annotation.

Lastly we wrote 921 examples, based off internet examples, in a format for Workout. This was our highest quality dataset, and had reasonable diversity across fields: mathematics, chemistry, biology, astronomy, physics, geology, history. This is the type of dataset we would look to scale in future work.

A.2 Dataset Deduplication

We use the following procedure for deduplicating the corpus:

We identify identical spans of 100 bytes or more (of utf-8 text) across the whole corpus, except for some explicitly excluded data sources. We do this using the repository from Lee et al. (2022).

We process corpus files in a predetermined order to prioritize some sources. From a set of spans representing the exact same content across files, we remove the span in the first file. If the same content repeats across a single file and it was not found in the files before, all its occurrences are kept.

We merge duplicated spans separated by at most 4 bytes.

We narrow down the resulting spans to paragraph boundaries (i.e. "\n\n").

We remove the content from files corresponding to the spans.

A.3 Citation Identifier Ablations

We report ablations for the citation identifier ablations below, where we test title-based identifiers versus alphanumeric identifiers.

Specifically, we set up an evaluation set of dataset and method names from Papers with Code. The task is to predict the citation given the method or dataset name, e.g. ResNet [START_REF], where the target is Deep Residual Learning for Image Recognition, He. We train a 6.7bn model on both types of processing for the ablation. Method and dataset results are shown below.

A.4 120B Validation Loss Per Source

A.5 Chain-of-Thought vs <work>

We used the recent results by Chung et al. (2022) of PaLM 540B on the MMLU validation set (Hendrycks et al., 2020) for comparison. While use of reasoning degrades performance versus direct prompting for both approaches, the token appears more robust.

A.6 Prompt Pre-training Datasets

We report the prompt datasets we included in pre-training below.

We set up a prediction task for chemical and physical properties with our validation set of 17,052 compounds. We use the PubChem document structure to design a prompt. We show an example for XLogP in Figure 24.

Canonical SMILES [START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES] Computed Properties | Property Name | Property Value | XLogP3-AA Log P | Figure 24: Chemical Property Prompt. We design a prompt based on the PubChem document format. Using this prompt style, we test the model’s ability to learn chemical and physical properties from the SMILES sequence. We report results in Table 35. The error decreases fairly smoothly with scale, suggesting self-supervised learning is occurring within-document from SMILES towards the chemical and physical properties. But it tails off for 120B which suggests more molecule data might be needed.

A.6.2 Docking Regression

We looked briefly at the docking score regression task (García-Ortegón et al., 2022). Here the task is to predict a docking score based on an target and a ligand. In the case of Galactica, we use a text format to represent this information. An example is shown in Figure 25. We report results in Table 36.

[START_AMINO]MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKE...[END_AMINO] [START_I_SMILES]O1[C@@H]([C@@H](O)[C@@H](O)[C@@H]1N2C(=O)NC(=O)C=C2)...[END_I_SMILES] Question: What will be the docking score of this compound against the protein? Answer: -8.8 Figure 25: DockSTRING Format. To construct the training set, we take the protein target and ligand sequences, pose a natural language question, and have the docking score as the answer. Docking Regression Model Param (bn) ESR2 F2 KIT PARP1 PGR GAL 125M 0.1 -12.4 -6.09 -6.73 -1.69 -12.4 GAL 1.3B 1.3 -0.293 0.591 0.063 0.728 -1.72 GAL 6.7B 6.7 -0.216 0.694 0.290 0.681 -0.894 GAL 30B 30 -0.186 0.679 0.313 0.732 -0.468 GAL 120B 120 -0.564 0.626 0.249 0.732 -0.960 Table 36: DockSTRING Results. Metric shown is R2R^{2}. For three of the targets, Galactica is able to infer from looking at the sequences alone, and performance scales from 1.3B parameters onwards. However, Galactica does not solve the two harder targets ESR2 and PGR. This hints at a limitation with the text representation, and may point to more geometrical information being needed to solve the task with reasonable data-efficiency.

A.6.3 Rest of MMLU

We report social sciences and results for other fields below:

A.7 Further Training Dataset Details

We compile a list of scientific entities, retrieve fragments for each one, and write a description of the entity based on the retrieved fragments. This can be considered a summarization task. We also write ground-truth descriptions without any retrieved fragments.

A.7.2 MethodNet

We compile machine learning abstracts and predict the new method that was introduced in the paper.

A.7.3 PWC Desc

For a list of dataset and methods in machine learning, we retrieve fragments for each one from the introducing paper, and write a summary description based on the retrieved fragments.

A.7.4 Ribosome

We use Expasy666https://web.expasy.org/translate/ to create a paired translation set between nucleotide sequences from the protein coding part of the human genome and protein sequences.

A.7.5 S2

Papers from certain fields are ignored due to quality concerns: psychology, business, art, economics, geography, history, political science, philosophy and sociology. Papers from journals with words like "law", "history", "politics", "business", "religion" were also ignored. For S2, we also exclude papers from low impact journals. The approximate impact factor of each journal in the S2 dataset was computed, by counting the number of papers in that journal and the number of citations that these papers received. If the approximate impact factor <1<1, the papers from that journal are ignored. Non-English papers are ignored. Some of these constraints can likely be relaxed in future work.

A.7.6 ScientificEntities

For a random sample of academic paper abstracts, we predict the scientific entities that were mentioned in the abstract.

A.7.7 StackExchange

We include question and answers from the following sources: academic, ai, arduino, astronomy, aviation, bioinformatics, biology, chemistry, chess, cogsci, computergraphics, cs, cseducators, cstheory, datascience, dsp, earthscience, economics, electronics, engineering, hardwarerecs, health, hsm, math, matheducators, mathematica, mathoverflow, /mechanics, networkengineering, or, physics, puzzling, quant, quantumcomputing, retrocomputing, reverseengineering, robotics, scicomp, softwareengineering, softwarerecs, sound, space, stats.

A.7.8 TrueOrFalse

We include 107 True or False questions to improve zero-shot performance for this type of question.

A.7.9 UChallenge

We include 346 free-form question and answers of university-level questions about science; this is a form of closed-book QA (and not multiple-choice).

A.8 Evaluation Dataset Examples

A.8.2 Galaxy Clusters

A.8.3 Mineral Groups

A.8.4 Deduplication Results

One of our concerns from reading the literature was the lack of data leakage analysis for results on MMLU, given the massive corpuses being used. Following from previous work of Brown et al. (2020), we search for n-gram matches between the training and test set. We chose to remove any 13-gram matches from the test set that appear in the training set and we report the scores before and after removal of these clashing examples. Results are shown overleaf.

A.8.5 Example Wikipedia Article Written by Galactica

A.8.6 Example Literature Survey Written by Galactica

A.8.7 Example Lecture Notes Written by Galactica

A.8.8 I’m sorry Frank, I think you missed it

If AI is going to help us explore the universe, we need it to have basic chess abilities to alleviate boredom - given the impossibility of faster-than-light travel.

The BIG-bench task suite of Srivastava et al. (2022) has a benchmark for checkmate-in-one detection. For fun, we made a dataset of 20,000 public chess games and converted them to ASCII chess using the python-chess library777https://python-chess.readthedocs.io/en/latest/. We included 19,426 games in our pre-training corpus (rest for validation). We also recorded the ELO ratings of players. An example document looks like below:

# A Chess Game ## Player Information White ELO: 2286 Black ELO: 2586 ## The Game Begins r n b q k b n r p p p p p p p p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P P P P P P P P R N B Q K B N R White (ELO: 2286) plays e4 r n b q k b n r p p p p p p p p . . . . . . . . . . . . . . . . . . . . P . . . . . . . . . . . P P P P . P P P R N B Q K B N R (cont) For evaluation, we converted the checkmate-in-one boards to ASCII and prompted for a move. Results are shown below.

While this represents the state-of-the-art over other large language models888https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/checkmate_in_one, it is clear that more work is needed on this problem.