BloombergGPT: A Large Language Model for Finance
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann
Introduction
The release of GPT-3 in 2020 (Brown et al., 2020) demonstrated the powerful benefits of training very large auto-regressive language models (LLMs). GPT-3 had 175 billion parameters, a hundredfold increase over the previous GPT-2 model, and did remarkably well across a wide range of now popular LLM tasks, including reading comprehension, open-ended question answering, and code generation. This performance has been replicated across several other models (Chowdhery et al., 2022; Scao et al., 2022; Zhang et al., 2022a). Furthermore, evidence suggests that large models exhibit emergent behaviors; growth allows them to acquire abilities not present in smaller models (Wei et al., 2022a). A notable example of emergent behavior is the ability to perform tasks via few-shot prompting, where a model can learn a task from just a few examples. This ability improves well-above random as we increase the size of language models. Broadly speaking, few-shot prompting dramatically expands the range of tasks supported by models and lowers the barrier to entry for users seeking automation for new language tasks.
After GPT-3, models grew in size to 280 billion (Gopher, Rae et al., 2021), 540 billion (PaLM, Chowdhery et al., 2022), and 1 trillion parameters (Megatron, Korthikanti et al., 2022). Work also explored other important aspects of achieving a high-performing LLM, such as different training objectives (Tay et al., 2022b), multilingual models (Scao et al., 2022), more efficient and smaller models (Black et al., 2022), and finding data and parameter-efficient training sizes (Hoffmann et al., 2022).
These efforts have almost exclusively focused on general LLMs, trained on datasets that cover a broad range of topics and domains. While these have included some datasets for specialized domains (e.g., code (Chen et al., 2021a) or biomedical articles Gao et al. (2021)) the focus has been on building LLMs with broad capabilities. Recent efforts training models using only domain-specific data have yielded models that, while much smaller, beat general purpose LLMs on tasks within those domains, such as science Taylor et al. (2022) and medicine Bolton et al. (2023); Luo et al. (2022); Lehman et al. (2023). These findings motivate further development of models focused on specific domains.
Financial Technology (FinTech) is a large and growing area with NLP technologies having an increasingly important role Xing et al. (2018); Fisher et al. (2016); Dredze et al. (2016). Financial NLP tasks Shah et al. (2022) include sentiment analysis Araci (2019), named entity recognition Salinas Alvarado et al. (2015), news classification Sinha and Khandait (2020), and question answering Chen et al. (2021b, 2022). While the range of tasks is similar to those found in general NLP benchmarks, the complexity and terminology of the financial domain warrant a domain-specific system. For all of the reasons generative LLMs are attractive in general – few-shot learning, text generation, conversational systems, etc. – it would be valuable to have a LLM focused on the financial domain. While there are masked language models tuned for the financial domain Araci (2019), no LLM has been tuned for or evaluated on tasks for this domain.
We train BloombergGPT, a 50 billion parameter language model that supports a wide range of tasks within the financial industry. Rather than building a general-purpose LLM, or a small LLM exclusively on domain-specific data, we take a mixed approach. General models cover many domains, are able to perform at a high level across a wide variety of tasks, and obviate the need for specialization during training time. However, results from existing domain-specific models show that general models cannot replace them. At Bloomberg, we support a very large and diverse set of tasks, well served by a general model, but the vast majority of our applications are within the financial domain, better served by a specific model. For that reason, we set out to build a model that achieves best-in-class results on financial benchmarks, while also maintaining competitive performance on general-purpose LLM benchmarks.
We achieve this goal by constructing the largest domain-specific dataset yet, drawing on existing data creation, collection, and curation resources at Bloomberg. As Bloomberg is primarily a financial data company, our data analysts have collected and curated financial language documents over the span of forty years. We have extensive archives of financial data that cover a range of topics, with careful tracking of data sources and usage rights. We add this data to public datasets to create a large training corpus with over 700 billion tokens. Using a portion of this training corpus, we train a BLOOM-style, 50 billion parameter model designed based on guidelines from Hoffmann et al. (2022) and Le Scao et al. (2022). We validate the model on standard LLM benchmarks, open financial benchmarks, and a suite of Bloomberg-internal benchmarks that most accurately reflect our intended use cases. Our results demonstrate that our mixed training approach leads to a model that vastly outperforms existing models on in-domain financial tasks while being on par or better on general NLP benchmarks.
2 Broader Contributions
Beyond the construction of a LLM for financial data, our goal is to contribute to the broader research community. Specifically, our experience documented in this paper provides evidence that further develops the community’s understanding of several open questions in the literature.
The few existing domain-specific LLMs are trained exclusively on domain-specific data sources (Luo et al., 2022; Bolton et al., 2023; Taylor et al., 2022), or adapt a very large general purpose model to domain-specific tasks (Singhal et al., 2022; Lewkowycz et al., 2022). Our alternative approach – training an LLM on both domain-specific and general data sources – has not been studied so far. The resulting model does very well on domain-specific tasks, but also maintains strong performance on general-purpose benchmarks.
Nearly all language models rely in large part on web-scraped data, such as C4 (Raffel et al., 2020) and The Pile (Gao et al., 2021) (which includes OpenWebText2). This data may be cleaned or subsetted in various ways before use Touvron et al. (2023); Rae et al. (2020); Scao et al. (2022); Jernite et al. (2022), but issues of data duplication Carlini et al. (2020) and toxic language remain Welbl et al. (2021). Our training data is unusual for LLM training in that it includes a significant amount of curated and prepared data from reliable sources.
LLM evaluation remains a challenging and evolving problem Gehrmann et al. (2022); Goyal et al. (2022), with new benchmarks trying to standardize evaluation across models (Liang et al., 2022; Srivastava et al., 2022). However, for domain-specific tasks, there remains a mismatch between evaluation and actual use cases. Evaluations are built on available datasets and not necessarily on how the model will be used in practice. We provide results on both public financial NLP benchmarks (Shah et al., 2022; Chen et al., 2021b) as well as a selection of internal Bloomberg tasks, which are better aligned with our intended use cases and directly evaluate our model’s ability to perform tasks of interest.
Early LLMs made a single training pass over a corpus of 200-400 billion tokens (Brown et al., 2020) and Hoffmann et al. (2022) posited that models were undertrained, instead focusing on training smaller models with more data, a strategy most recently employed by Touvron et al. (2023). We select a model size motivated by Hoffmann et al. (2022) and train a 50 billion parameter model on 569 billion tokens from our corpus of over 700 billion tokens to produce a model that is competitive with larger models.
After assembling training data, the critical step of tokenization transforms the text into a format suitable for the language model. The importance of this step is often overlooked Mielke et al. (2021), and many older LLMs use the same tokenizer and vocabulary, meaning that we have little evidence to support other tokenizers. We take a different approach and use a Unigram model instead of greedy merge-based sub-word tokenizers since it saves probabilities allowing for smarter tokenization at inference time (Kudo, 2018).
GPT-3 and subsequent models were the work of large teams and required an enormous amount of computation. Initial work to reproduce these results, such as OPT Zhang et al. (2022a), did not match the performance of the original model. With the release of each subsequent model, the community’s understanding, experience, and software tools increase. In developing BloombergGPT, we benefited from existing code developed as part of the BLOOM effort Scao et al. (2022), showing that a moderately sized team can produce a competitive model on domain-specific data. We describe our experiences training BloombergGPT in detail to support future training efforts and address each of the above topics.
Dataset
To train BloombergGPT, we construct “FinPile”, a comprehensive dataset consisting of a range of English financial documents including news, filings, press releases, web-scraped financial documents, and social media drawn from the Bloomberg archives. These documents have been acquired through our business process over the past two decades. We augment FinPile with public data widely used to train LLMs. The result is a training corpus that is roughly half domain-specific text and half general-purpose text. For a breakdown of the full training set, see Table 1. To improve data quality, we de-duplicate each dataset (The Pile, C4, Wikipedia, FinPile) according to Lee et al. (2022a); as a side-effect, the statistics reported in Table 1 might be different from those reported in other papers.
The Bloomberg Terminal has provided access to a comprehensive set of diverse structured and unstructured financial data and analytics for the past four decades. In serving this mission, Bloomberg analysts have curated a set of financial documents that were either created internally or acquired from external sources. We utilize this extensive collection of curated and maintained documents to create FinPile, which consists of company filings, financial news, and other data relevant to the financial markets.
Some documents included in the FinPile, such as company filings, are available to the general public, although collecting these documents and pre-processing them for LLM training is a non-trivial task. Other documents, such as (a subset of) Bloomberg news, must be purchased. The rest of the documents are private and available, among other sources, through the Bloomberg Terminal. Finally, we clean this data to strip off markup, special formatting, and templates.
Note that each document in FinPile is time-stamped, with dates ranging from 2007-03-01 to 2022-07-31; the quality and quantity of documents increase over this time range. While we do not utilize date information in this work, we plan to use it in the future, such as for evaluation of what the model learns about different time periods. While we cannot release FinPile, our experience training on a large, carefully curated, and clean domain-specific dataset may provide helpful insights to the community on the advantages and challenges of building a financial LLM in particular, and a domain-specific model in general. We provide a breakdown and analysis of FinPile in Table 2 and a brief description of the types of data included below.
Bloomberg collects web content by identifying sites that contain financially relevant information. While this category makes up the majority of FinPile, its classifications are rough, with content classified mainly by the location of the web domain. Within these location-specific sources, e.g. “US” (15.95% of total), “Asia-Pac” (4.72% of total), and “UK” (1.98% of total), document types are highly varied as would be expected from a web crawl. While web sources are common in existing public LLM training datasets, Bloomberg’s web crawl is focused on high-quality websites that have financially relevant information, as opposed to a general-purpose crawl of the web.
1.2 News (38B tokens – 5.31% of training)
The News category includes all news sources excluding news articles written by Bloomberg journalists. Overall, there are hundreds of English news sources in FinPile including “Bloomberg Transcripts” (0.41% of total), which are transcripts of Bloomberg TV news. Generally, the content in this dataset comes from reputable sources of news that are relevant to the financial community so as to maintain factuality and reduce bias.
1.3 Filings (14B tokens – 2.04% of training)
Company Filings are financial statements prepared by (public) companies and made available to the general public. In some countries, like the US, public companies are mandated to prepare and submit their financial statements on a regular cadence; e.g., 10-K annual reports and 10-Q quarterly reports. In our dataset, a majority of the filings come from EDGAR, which is the SEC’s online database (1.90% of total). Filings are typically long PDF documents with tables and charts that are dense in financial information, which are processed and normalized in Bloomberg. Filings are substantially different from the types of documents typically used to train LLMs, but contain critically important information for financial decision-making.
1.4 Press (9B tokens – 1.21% of training)
The Press category contains press releases typically issued by companies that are financially relevant. Taken together with filings, press releases represent most of the public communications of a company. However, unlike filings, press releases are similar to news stories in terms of content and style.
1.5 Bloomberg (5B tokens – 0.70% of training)
This category comprises Bloomberg authored news and other documents such as opinions and analyses. The largest sources are “Bloomberg News” (0.44% of total) and “Bloomberg First Word” (0.13% of total), the Bloomberg-authored wire of real-time news. While Bloomberg News covers a wide range of topics, it typically focuses on content relevant to the financial community. This dataset contains documents of varying lengths.
2 Public Datasets (345B tokens – 48.73% of training)
We use three widely known and available public datasets in our training corpus.
The Pile (Gao et al., 2021) is the dataset used in GPT-Neo (Black et al., 2021), GPT-J (Wang and Komatsuzaki, 2021), and GPT-NeoX (20B) (Black et al., 2022). We include The Pile in our training data for the following reasons. First, it has been used to successfully train an LLM. Second, it has undergone significant data cleaning and pre-processing. Third, it includes multiple domains and we believe such diverse data will aid generalization to new domains and may even support training on financial data. For example, domains such as FreeLaw and GitHub are useful to teams at Bloomberg that work on legal documents and software development, respectively. Creators of The Pile have deliberately chosen to include duplicate content, with the duplication factor being proportional to the perceived quality of the content. However, as we deduplicate each of our datasets, the size of The Pile is significantly reduced. Additionally, note that our tokenizer (§2.3) is trained on The Pile.
2.2 C4 (138B tokens – 19.48% of training)
The Colossal Clean Crawled Corpus (C4) is a common dataset used to train LLMs, and was introduced to support training T5 (Raffel et al., 2020). Although it overlaps with Pile-CC, C4 is cleaned and processed differently; hence, we feel that including C4 in addition to The Pile can add value more than duplicated documents would. We find that C4 contains high-quality natural language documents due to the layers of cleaning, though others have noted that the distribution across web domains is unusual, with a high fraction of data stemming from patents Dodge et al. (2021).
2.3 Wikipedia (24B tokens – 3.35% of training)
Both The Pile and C4 include out-of-date copies of Wikipedia, so it could be beneficial for the factuality of the model to have up-to-date Wikipedia pages included. Therefore, we include a dump of English Wikipedia from July 1, 2022. This dataset is tokenized quite inefficiently (3.06 characters per token), indicating an above-average amount of markup, which suggests that further cleaning might benefit future model training.
3 Tokenization
We choose the Unigram tokenizer (Kudo, 2018) instead of a greedy merge-based sub-word tokenizer, such as Byte Pair Encoding (BPE) (Sennrich et al., 2016) or Wordpiece (Schuster and Nakajima, 2012; Wu et al., 2016), based on promising results in Kudo and Richardson (2018) and Bostrom and Durrett (2020). Following GPT-2 Radford et al. (2019), we treat our data as a sequence of bytes rather than Unicode characters, and we include each of the 256 bytes as tokens. In a pretokenization step, the input byte sequence is broken into chunks by greedily matching the following regular expression: [ A-Za-z]+||[^A-Za-z0-9]+. This follows GPT-2 in preventing multiple character classes from appearing in a single token. However, we include spaces in the alphabetic chunks, which allows multi-word tokens to be learned, increasing information density and reducing context lengths. The pretokenization follows the approach of PaLM Chowdhery et al. (2022) in placing each digit in its own chunk, with the hope that this will lead to better handling of numbers. We train our tokenizer on The Pile Gao et al. (2021) as it draws from diverse domains, including code and academic papers, in proportions that suit our use case.
The Unigram tokenizer implementation is too inefficient to process the entire Pile dataset at once, so we use a split and merge approach. We split each of the 22 domains in the Pile into 256 chunks of roughly equal size. We then train a Unigram tokenizer with a vocabulary size of 65,536 () on each of the (total ) chunks. We hierarchically merge the individual tokenizers by first merging the 256 tokenizers from each domain, and then combining the 22 resulting tokenizers to get the final tokenizer.
Unigram tokenizers amount to probability distributions over tokens (i.e. unigram language models), and we merge tokenizers by taking a weighted average of the probabilities of corresponding tokens, with the weights determined by the relative sizes (in bytes) of the data used to train the tokenizers. The result is a tokenizer with 7 million tokens. To reduce the size of the vocabulary to tokens, we drop the tokens with the smallest probabilities and renormalize. To ensure we do not need an out-of-vocabulary token, we also add as tokens the 36 (of 256 possible) bytes that do not occur in The Pile, along with an <|endoftext|> token.
There are various considerations in choosing the vocabulary size. One advantage of a large vocabulary for LLMs is that more information can fit into the context window. On the other hand, there is overhead with a larger vocabulary: a larger proportion of model parameters are required for token embedding. We select our vocabulary size of tokens based on experiments with vocabulary ranging from 25,000 to 550,000. For each vocabulary size, we tokenize the C4 dataset and compute the total size (in bytes) for the dataset, where each token is represented using bits. Our heuristic is to choose the vocabulary size that leads to the smallest encoded representation of C4. This gives us a vocabulary size of 125,000, which we then round up to the nearest power of 2 (, or 131,072 tokens). Our tokenizer is large, relative to the standard vocabulary size of approximately 50,000 tokens. For an analysis of tokenization efficiency, see Table 3.
Model
Our model is a decoder-only causal language model based on BLOOM (Scao et al., 2022). We present an overview of the architecture, with full details in Appendix A.
The model contains 70 layers of transformer decoder blocks defined as follows:
subscriptℎℓ1SALNsubscriptℎℓ1\displaystyle=h_{\ell-1}+\mathop{\mathrm{SA}}\nolimits(\mathop{\mathrm{LN}}\nolimits(h_{\ell-1})) where is multi-head self-attention, is layer-normalization, and is a feed-forward network with 1-hidden layer. Inside FFN, the non-linear function is GELU (Hendrycks and Gimpel, 2016). ALiBi positional encoding is applied through additive biases at the self-attention component of the transformer network (Le Scao et al., 2022). The input token embeddings are tied to the linear mapping before the final softmax. Following Le Scao et al. (2022) and first used in Dettmers et al. (2022), the model has an additional layer normalization after token embeddings, formally:
superscriptLN𝑒𝑚subscriptℎ0SALNsuperscriptLN𝑒𝑚subscriptℎ0\displaystyle=\mathop{\mathrm{LN}}\nolimits^{em}(h_{0})+\mathop{\mathrm{SA}}\nolimits(\mathop{\mathrm{LN}}\nolimits(\mathop{\mathrm{LN}}\nolimits^{em}(h_{0}))), where is the initial token embedding and is the new component of embedding layer-normalization. Notice that the second term includes two consecutive layer-normalizations.
2 Model Scaling
The size of our model is based on Chinchilla scaling laws (Hoffmann et al., 2022), in particular their Approach 1 and Approach 2. We start with a total compute budget of 1.3M GPU hours on 40GB A100 GPUs. Since we adopt activation checkpointing to reduce our memory footprint, this costs us an additional 0.33x TFLOPs per iteration due to repeated forward passes. To account for this additional cost, we plug in 0.75 1.3M into Chinchilla equations instead of the full amount.
From Hoffmann et al. (2022), we use the data reported in Table 3 for Approach 1 and Table A3 for Approach 2, and fit regression lines to their log-scaled versions. This gives us:
⋅subscript10𝐹𝐿𝑂𝑃𝑠0.5020.2291111.112B\displaystyle=\exp_{10}(\log_{10}(FLOPs)\cdot 0.502+0.229)=1111.112\text{B} Approach 2 These calculations imply that our dataset of ~700B tokens is too small for a “Chinchilla optimal” configuration given our compute budget (assuming just one pass through the data).111The scaling law derived by Chinchilla is tokenizer-specific. Our tokenizer can encode the same document more compactly due to the support of multi-word expressions and the larger vocabulary size. It’s still an open question how well these scaling laws transfer across tokenizers, and how vocabulary size impacts token and parameter trade-offs assuming fixed compute. We leave this exploration to future work. While we can increase the amount of general-purpose training data, we are limited in the amount of domain-specific training data at our disposal. FinPile is already among the largest domain-specific training sets, and we do not want it to represent less than half of our total training.
Since we are data limited, we choose the largest model that we can, while ensuring that we can train on all our tokens and still leave ~30% of the total compute budget as a buffer for unforeseen failures, retries, and restarts. This leads us to a 50B parameter model, which is also roughly the Chinchilla optimal size for our compute budget. Figure 1 provides a summary of the scaling laws and how BloombergGPT compares to other models.
To determine how to allocate the 50B parameters to different model components (i.e., the “shape” of our model), we follow Levine et al. (2020), who propose that for a total number of self-attention layers , the optimal hidden dimension is obtained by:
We sweep over a range of integer values and pick the combination that yields a total of ~50B parameters. This leads to the choice of and as our target shape parameters. However, we also want to follow the tradition that the hidden dimension is evenly divisible by the number of attention heads, with the quotient giving the attention head dimension. Furthermore, we want the dimensions to be multiples of 8 to achieve higher performance in Tensor Core operations NVIDIA (2023). We settle on 40 heads, each having a dimension of 192, resulting in a total hidden dimension of and a total of 50.6B parameters. Table 4 provides a summary of the hyper-parameters used in BloombergGPT.
3 Training Configuration
BloombergGPT is a PyTorch model trained with a standard left-to-right causal language modeling objective. Following Brown et al. (2020), we want all our training sequences to be exactly the same length, in our case 2,048 tokens, to maximize GPU utilization. To achieve this, we concatenate all our tokenized training documents with an <|endoftext|> token as a document separator. We then break this token sequence into chunks of 2,048 tokens. Note that with this approach, each training sequence may contain multiple documents from different domains. Also note that, because we’re using ALiBi positional encoding, BloombergGPT can be applied to sequences longer than 2,048 at inference time. For optimization efficiency, training sequences are grouped together into batches, as described in more detail below.
We use the AdamW optimizer (Loshchilov and Hutter, 2019). We set to 0.9, to 0.95, and weight decay to 0.1. Following Brown et al. (2020), we set the maximum learning rate to 6e-5 and use the cosine decay learning rate scheduler with linear warmup. We warm up the learning rate in the first 1800 steps. Following Hoffmann et al. (2022), the final learning rate is 0.1x the max learning rate, i.e. 6e-6. We also employ batch size warmup (Brown et al., 2020): in the first 7,200 steps, we use a batch size of 1,024 (2.1M tokens), then switch to a batch size of 2,048 (4.2M tokens) for the remainder of training.
We set dropout to 0.0 in all layers in our initial run, although we add dropout later as explained in §4. The model parameters are randomly initialized to samples from a normal distribution with zero mean and standard deviation (Smith et al., 2022). Following Megatron-LM (Shoeybi et al., 2019), we rescale the standard deviation of the second layer in the MLP and the output layer of the attention by . We use the technique of query_key_layer_scaling (Shoeybi et al., 2019), which was proposed to improve numerical stability for FP16 mixed-precision training but may also help in BF16.
LLMs optimization requires running convex optimization algorithms over incredibly complex non-convex loss surfaces. Previous work has reported various instabilities while training LLMs. For example, Chowdhery et al. (2022) found that the loss spiked roughly 20 times while training PaLM, despite the fact that gradient clipping was enabled. They mitigated these issues by re-starting training from a checkpoint roughly 100 steps before the spike started, and then skip 200–500 data batches. They hypothesized that spikes occur due to the combination of specific data batches with a particular model parameter state. Similarly, during OPT training, Zhang et al. (2022a) noticed spikes in the gradient and activation norms, or divergences in the training perplexity. After these behaviors, they lowered their learning rate, which stabilized these norms and allowed training to continue. Interestingly, Scao et al. (2022) report only a single loss spike, from which the model recovered on its own.
We use the Amazon SageMaker service provided by AWS to train and evaluate BloombergGPT. We use the latest version available at the time of training and train on a total of 64 p4d.24xlarge instances. Each p4d.24xlarge instance has 8 NVIDIA 40GB A100 GPUs with NVIDIA NVSwitch intra-node connections (600 GB/s) and NVIDIA GPUDirect using AWS Elastic Fabric Adapter (EFA) inter-node connections (400 Gb/s). This yields a total of 512 40GB A100 GPUs. For quick data access, we use Amazon FSX for Lustre, which supports up to 1000 MB/s read and write throughput per TiB storage unit.
4 Large-scale Optimization
To train BloombergGPT, which has a larger memory footprint than available GPU memory on cloud instances, we rely on stage 3 of ZeRO optimization (Rajbhandari et al., 2020). We utilize the proprietary SageMaker Model Parallelism (SMP) library from AWS, which enables the automatic distribution of large models across multiple GPU devices and instances (Karakus et al., 2021). After experimenting with various techniques, we achieve 102 TFLOPs on average and each training step takes 32.5 seconds. We find the following setup to be the best performing in our training.
ZeRO shards the training state (model parameters, gradients, and optimizer state) across a group of GPUs. We shard a model across 128 GPUs, and we have 4 copies of the model during training.
Zhang et al. (2022b) decrease training communication overhead and memory requirements for cloud training clusters. MiCS includes such features as hierarchical communication, 2-hop gradient update, scale-aware model partitioning.
Chen et al. (2016) minimizes training memory consumption by removing activations at the expense of additional computation during backward passes. When a layer has activation checkpointing enabled, only the layer input and outputs are kept in memory following a forward pass, while any intermediate tensors are discarded from memory. During the backward pass, these intermediate tensors may be recomputed. We apply activation checkpointing to each transformer layer.
To reduce the memory requirements, forward and backward passes are done in BF16, while parameters are stored and updated in full precision (FP32). The ALiBi matrices are computed in full precision and stored in BF16. We also use FP32 to calculate fused softmax in the Attention block and store its results in BF16. Finally, the softmax calculations in the loss function are computed in FP32.
Another possibility for optimization is combining composition of several operations into a single GPU operation. This can both reduce peak memory usage by avoiding storage of intermediate results in the computation graph, as well as help improve speed. Similar to Megatron-LM Shoeybi et al. (2019), we use a masked-causal-softmax fused kernel in SMP in the self-attention module. In practice, we observe 4-5 TFLOPs improvement for speed, and avoid out-of-memory errors given the rest of the configuration.
Training Run
The process of training BloombergGPT involved decisions along the way based on the progress of model training. We share some highlights of this process. A detailed presentation appears in the Training Chronicles (Appendix C). Figure 2 shows the learning curves for both training and validation sets. The solid lines show (smoothed) training loss and the dotted lines show loss on the held-out validation set. Changes in the color of the lines indicate changes to the optimization hyperparameter configurations, either as scheduled, or in response to increasing or stagnating validation loss. This plot shows the path taken by the successful model training run. To present a clear plot, the Figure does not show other attempts with different model configurations, overwritten partial runs after a rollback, or other training strategies not utilized in the final model.
We measured training loss every five steps on the current batch. The raw values vary wildly, causing large jitter when plotted. The plot smoothes the training loss by showing a running average where . Smoothing is not needed for the validation loss since it is measured on the entire validation set every 300 steps.
We trained for a total of 139,200 steps ( days) and ended model training after completing of one epoch through our training data (569B tokens out of the 709B tokens available). We ended training early because the loss on our held-out development set was no longer improving, although it’s possible that substantially longer training may have yielded further improvements.
We began the run with a warm-up batch size of 1,024 for 7,200 steps, after which we switched to the regular batch size of 2,048 (color changes from black to blue). Change in batch size manifests as a visible curvature change in the validation loss at step 7,200. Most of the remainder of the training performed stably with decreasing training and validation losses. Intervention was required at later stages, after step 115,500, when we observed flat or increasing validation loss. We then applied the following corrective modifications in sequence:
Step 115,500 (blue to orange): Shrink learning rate to two-thirds
Step 129,900 (orange to green): Halve learning rate, and add dropout (with 0.1 probability)
Step 137,100 (green to red): Halve learning rate again
We ended the run at step 146,000 based on the lack of observable progress on the validation loss. We selected the checkpoint at step 139,200 as the final model based on validation loss and downstream evaluations.
Evaluation
We evaluated the performance of BloombergGPT on two broad categories of tasks: finance-specific and general purpose. The finance-specific tasks help us test our hypothesis that training on high-quality finance-specific data will yield better results on financial tasks. The general purpose tasks investigate whether the performance of our model is directly comparable to previously published results. For financial tasks, we assembled publicly available financial datasets that include a range of NLP tasks. Then, to directly test BloombergGPT’s ability on Bloomberg tasks of interest, we also included tasks drawn from Bloomberg-internal high-quality evaluation sets for sentiment analysis and named entity recognition. For general-purpose tasks, we draw from multiple existing benchmarks and group results into the following categories: BIG-bench Hard, Knowledge Assessments, Reading Comprehension, and Linguistic Tasks. The number of tasks per type and the definitions of the groups are presented in Table 5.
We compare BloombergGPT to the three closest models described in § 7 based on model size, type of training data, overall performance, and most importantly, access. An overview of the model sizes and compute is provided in Table 6.
GPT-NeoX (Black et al., 2022): According to Liang et al. (2022), this model is the best performing available model under 50B parameters.
OPT (Zhang et al., 2022a): We chose to compare to OPT since our model size and structure roughly match, though our model is smaller.
BLOOM (Scao et al., 2022): While this model is substantially larger than BloombergGPT, we use the same model architecture and software stack. We note that BLOOM is multilingual, so while it is much larger, it also is trained on data from more languages.
All three models use some of the same general-purpose datasets we use in our training corpus. We additionally report results from the original GPT-3 (Brown et al., 2020) whenever externally available.222Another related general-purpose model at a comparable size (LLaMA, Touvron et al., 2023), was released during the preparation of this manuscript, but third-party evaluation results were not available and we haven’t received access to the model weights.
We prefer running models ourselves to ensure identical evaluation setups, and we place any results that have been reported elsewhere and were not run by us into a separated group. To fairly compare the models, we avoid any tuning of prompts and other techniques that could lead to improved results for some, but not all, models. For that reason, every task is tested via “standard” prompting (shown in Table 7), i.e., without any parameter changes to the underlying model, without task descriptions, and without Chain-of-Thought prompting (Wei et al., 2022b). The number of few-shot examples presented to the model depends on the task, and we include these details in the respective sections. For each group of results, we further present a win rate similar to Liang et al. (2022) that represents the fraction of “wins” in side-by-side comparisons over individual tasks between all model pairs for which we have run the evaluation ourselves.
For tasks where a set of candidates are given, we perform likelihood-based classification, following Brown et al. (2020). We consider three methods for classification: regular, calibration, and normalization. Formally,
Regular:
Calibration:
Normalization:
where is a candidate, is the context, and len measures the number of sub-word tokens. We report the performance of the best method for each model and task. For other tasks, we perform generation via greedy decoding.
We use the official split and report performance on the test set whenever possible. If the test labels are not publicly available, we report performance on the dev set instead. If an official split for a dataset does not exist, we create train and test splits by selecting 20% of examples to be the test and the rest as train. All few-shot context examples are sampled from the training set. To reduce the variance of few-shot evaluation, we sample different shots for each test example, unless otherwise specified. For the sake of consistency, for each test example, all models have identical surface form as input in our evaluation.
2 Heldout Loss
We begin by testing how well BloombergGPT models the language distribution of the in-distribution finance data. We evaluate the bits per byte of the different models on a heldout dataset that contains examples from all sections of FinPile (described in §2). To limit data leakage and better simulate real-world usage of LLMs, we select a temporally heldout dataset that is strictly further in the future than the training set, and perform deduplication between the training and heldout set. During evaluation, for documents that are longer than 2,048 tokens, we use a sliding window approach with half window size as context. That means that any token beyond the first 2,048 has at least 1,024 tokens as context during prediction. We report the loss breakdown by the type of document in FinPile.
Figure 3 shows that BloombergGPT consistently outperforms other models. While this is expected and mainly serves as a sanity check, it also provides valuable insight into the generalization capabilities of the other models. For example, the gap to BloombergGPT is most significant in the Filings category, likely because these documents, while public, are typically in PDF format and thus not included in any existing datasets.
3 Financial Tasks
The NLP tasks most often considered in finance are also common in the broader NLP literature; but, these tasks take on different characteristics and challenges when performed on financial data. Take the example of sentiment analysis, where a headline such as “COMPANY to cut 10,000 jobs” portrays negative sentiment in the general sense but can at times be considered positive for financial sentiment towards COMPANY, as it might result in the stock price or investor confidence increasing. We use a combination of public and internal benchmarks to assess the performance of BloombergGPT, BLOOM, GPT-NeoX, and OPT. All task types considered and their corresponding prompt templates are shown in Table 7.
Our public financial benchmarks include four tasks from the FLUE benchmark (Shah et al., 2022) and the ConvFinQA dataset (Chen et al., 2022). As LLM performance on most of these financial tasks have not been broadly reported, there is no standard testing framework. Thus, we adapt them to a few-shot setting (see Section § 5.1). Our guiding principle in designing the experiments was to select the number of shots such that the average performance across all the models was best. While non-LLM numbers of custom models for these tasks are available, we omit reporting them here due to differences in the evaluation setup. As a result, our claims are restricted to comparisons of LLMs. We evaluate on the following tasks (more details provided in Appendix B):
FPB (Malo et al., 2014): The Financial Phrasebank Dataset includes a sentiment classification task on sentences from financial news. Any news that could benefit/hurt an investor is considered positive/negative and neutral otherwise. We create our own splits and report F1 score weighted by support in a 5-shot setup.
FiQA SA (Maia et al., 2018): The second sentiment analysis task is to predict the aspect-specific sentiment in English financial news and microblog headlines, which were published as a part of the 2018 challenge on financial question answering and opinion mining. While the original dataset is annotated on a continuous scale, we discretize the data into a classification setup with negative, neutral, and positive classes. Like with FPB, we create our own splits including microblogs and news, and use a 5-shot setup, reporting weighted F1.
Headline (Sinha and Khandait, 2020): This is a binary classification task of whether a news headline in the gold commodity domain includes certain information. This human-annotated dataset consists of English news headlines about “gold”. Each news article carries a subset of the following tags: “price or not”, “price up”, “price down”, “price stable”, “past price”, “future price”, “past general”, “future general”, “asset comparison”. We verbalize each tag into a question using the official documentation, use 5 shots, and report the average weighted F1 score across all categories.
NER (Salinas Alvarado et al., 2015): This is a named entity recognition task on financial data gathered for credit risk assessment from financial agreements filed with the SEC. The annotated entity types follow the standard CoNLL format (Tjong Kim Sang and De Meulder, 2003) and are annotated with PER, LOC, ORG, and MISC. As it is nontrivial to learn to predict empty outputs in few-shot setups, we drop sentences that do not contain any entity. We further drop MISC tags due to their ambiguous definition. All the models required more shots to perform well and we thus selected 20 shots and report the entity-level F1 score.
ConvFinQA (Chen et al., 2022): Given input from S&P 500 earnings reports that includes text and at least one table with financial data, the task is to answer conversational questions that require numerical reasoning over the input. This task requires numerical reasoning, an understanding of structured data and financial concepts, and a model needs to relate follow-up questions to the dialog turns.
For ConvFinQA, we use an entire gold conversation and its context is used as input to the models. As each “turn” of the conversation concludes, the “turn” along with the answer for that turn is appended as context for future turns. We report the exact match accuracy on the public development set.
BloombergGPT performs best of all models for four of the five tasks (ConvFinQA, FiQA SA, FPB, and Headline) and comes in second in NER (Table 8). Consequently, BloombergGPT also has the highest win rate among all the models that we tested. The gap to equally-sized models is especially pronounced for ConvFinQA which is challenging due to the requirement to use conversational input to reason over tables and generate an answer.
3.2 Internal Task: Sentiment Analysis
For the Bloomberg-internal tasks, we consider aspect-specific sentiment analysis, which is prevalent in financial literature. All of the datasets we use are in English.
Our annotation process consists of a discovery phase during which we establish the annotation and sampling procedures, understand how many annotators are typically required per example, and determine the level of training that is needed for the annotators (Tseng et al., 2020). Depending on the complexity of the task, our annotators are a dedicated team of financial experts at Bloomberg, consultant workers, or a combination of both. In each case, ties are resolved by adjudication from additional annotators and ambiguous examples are excluded. All the datasets in this section were annotated by 2 annotators with a third annotator breaking any ties.
We measure the performance of LLMs for the internal datasets using a five-shot evaluation, similar to the external datasets. As the datasets are large, we randomly sample at most 1k test examples. We report F1 weighted by the support of each label. Note that, similar to the external datasets, it is likely that the unlabeled versions of the data used in our internal datasets occur in FinPile and are therefore seen by BloombergGPT during training. However, since some of FinPile is also available on the web, other LLMs we compare against may have also been trained on unlabeled versions of this data. Dataset statistics are provided in Table 9.
Equity News Sentiment: This task is to predict the aspect-specific sentiment expressed in the news story toward a company. The dataset consists of English news stories from Bloomberg, premium, and web content. Annotations of “positive”, “negative”, or “neutral” indicate that the news story is likely to increase, decrease, or not change the long-term investor confidence in the company.
Equity Social Media Sentiment: The task is similar to “Equity News Sentiment” but instead of news, we use financially-relevant English social media content.
Equity Transcript Sentiment: This task is also similar to “Equity News Sentiment” but instead of news, we use transcripts from company press conferences. The transcripts are made available through the use of speech recognition and at times, human edits. Long transcripts are processed in chunks, and each chunk in our dataset typically contains between 70 and 80 tokens.
ES News Sentiment: While this task is to predict the aspect-specific sentiment expressed in the news story towards a company (aspect), the goal is not to indicate effect on investor confidence. The stories are annotated “positive”, “negative”, or “neutral” if the news story contains content that reflects good, bad, or neutral news about the company’s environmental and social policies.
Country News Sentiment: This task is different from the other sentiment tasks in that the goal is to predict the sentiment expressed in the news story towards a country. The dataset consists of English news stories from Bloomberg, premium, and web content. The stories are annotated “positive”, “negative”, or “neutral” if the news story alludes to the growth, shrinkage, or status quo of that country’s economy.
Table 10 shows that across the four internal aspect-specific sentiment tasks BloombergGPT performs better than all the other tested models, by a wide margin. The only task in which the models perform similarly is the social media sentiment task, while BloombergGPT outperforms the other models by at least 25 and up to over 60 points in the other three.
3.3 Exploratory Task: NER
Even though NER is a well-established NLP task with state-of-the-art results using BERT Wu and Dredze (2019); Luoma and Pyysalo (2020) and T5 Liu et al. (2022) style models, NER is largely an unexplored task for generative LLMs. NER is not in HELM Liang et al. (2022), there is a single (Polish) task in BIG-bench Srivastava et al. (2022), and none of the LLM papers we study report NER performance. Hence, we consider NER as an exploratory task and report preliminary NER results given its importance in the Financial sector.
There are a few reasons for why NER may be a difficult task for generative LLMs. NER is an information extraction task, and a better fit for encoder-decoder or encoder-only architectures. The generative nature of LLMs does not confer an advantage for NER. We find that extensive prompt engineering and a greater number of shots are required to obtain reasonable results for NER than for other tasks. Finance-specific NER has subtleties that make it especially difficult for zero or few-shot learning.
For example, consider the (fabricated) headline “Bloomberg: Mr. Musk adds new features to Twitter and comments on China”. Depending on our annotation guidelines and downstream task needs: (a) the reporting news organization “Bloomberg” can be tagged or not, depending on whether we want only salient entities, (b) “Mr. Musk” or just “Musk” is the PER to be tagged, (c) “Twitter” can be tagged as an ORG or a PRD (product) as features are added to the Twitter product and not the organization, and (d) “China” can be tagged ORG or LOC, though the right tag is likely ORG. Without adding extensive annotation guidelines in the prompt, the LLM does not know the intended tagging behavior.
Based on preliminary testing, we determined the following setting to obtain the best performance on the internal NER tasks from all models. First, we restrict the entity types to be predicted to be ORG, PER, and LOC. In all, we filtered out less than 1% of entities. We also remove all documents that contain no entities (i.e., all “O”’s). Both of these modifications are intended to increase the usefulness of the examples seen in few-shot prompting. We expect that further work on prompt engineering for NER could produce better results.
We consider seven Bloomberg internal NER datasets from different domains.
BN NER: This is a named entity recognition task on entities occurring in English long-form Bloomberg news content (the “BN wire”) between 2017 to 2020.
BFW NER: Similar to “BN NER” but instead of using the long-form BN wire, we use short-form stories from the “Bloomberg First Word” wire between 2018 to 2020.
Filings NER: The goal of this task is to identify entities that occur in mandatory financial disclosures filed by companies. The dataset contains filings sampled between 2016 and 2019.
Headlines NER: The goal of this task is to identify entities that occur in headlines of English Bloomberg news content. The dataset contains headlines sampled between 2016 and 2020.
Premium NER: The goal of this task is to identify entities that occur in a subset of the third-party English news content ingested by Bloomberg. The dataset contains stories sampled between 2019 and 2021.
Transcripts NER: The goal of this task is to identify entities that occur in transcripts of company press conferences. The dataset contains transcripts from 2019.
Social Media NER: The goal of this task is to identify entities that occur in English financially-relevant social media content. The dataset contains social media content sampled between 2009 and 2020.
As our datasets are substantive, we randomly sample 4,000 training and 500 testing examples from each filtered internal dataset. We utilize 20-shot prompts and evaluate using F1. The results from the internal NER tasks are mixed (Table 12). The much larger BLOOM wins most of the NER tasks. Of the like-sized models, BloombergGPT performs the best placing first once (Headlines), second four times (BN, Premium, Transcripts, Social media), third once (BFW), and last once (Filings).
Named entity disambiguation (NED) links entity mentions to known entities in knowledge bases or other structured information sources. Within the financial world, we seek to link text mentions of companies to their ticker symbols, an abbreviation that uniquely identifies publicly traded shares of a particular stock on a particular stock market.
We directly test the ability of an LLM to complete this task by evaluating a joint NER+NED task: identify the stock tickers of companies mentioned in a document. This requires the model to first identify company mentions and then generate the corresponding stock ticker. For example, given “AAPL announced that they will stop using Intel chips in future products.” the correct NER output would be “AAPL, Intel” while the correct NER+NED output would be “AAPL, INTC”.
One of the advantages of this task is that it is robust to variations in extracting the exact text span. While NER evaluation requires exact matches, tickers may be successfully produced without first identifying spans. Furthermore, it evaluates a model’s knowledge of companies, their various surface forms, and company to ticker mappings.
We create evaluation data with linked tickers for this task by running a state-of-the-art entity linking system for companies in financial data over the Bloomberg internal NER annotated documents from each domain. We remove documents with no linked tickers. Following our NER evaluations, we randomly sample 4,000 training and 500 testing examples from each filtered internal dataset. We utilize 20-shot prompts and evaluate using F1.
Table 12 shows that BloombergGPT outperforms all other models by a large margin, except on social media data where it comes in second behind BLOOM. In our social media data, companies are often referenced by their tickers, removing the requirement of the model to link the mention and reverting the task to NER. These results further underscore the advantage of BloombergGPT for financial tasks.
4 BIG-bench Hard
We now turn to evaluate BloombergGPT on standard, general-purpose NLP tasks. While the focus of our model is on financial tasks, our inclusion of general-purpose training data may help improve not only the financial tasks, but also allow our model to perform well on more standard NLP datasets. We start with BIG-bench Hard (Suzgun et al., 2022), a subset of the most challenging tasks in BIG-bench (Srivastava et al., 2022). It only includes tasks in which the best available model at construction was unable to achieve a performance higher than the average human rater via standard prompting techniques.
Results for each task are shown in Table 13. Overall, while BloombergGPT falls behind the much larger PaLM (10x parameters) and BLOOM (3.5x parameters), it is the best-performing among similarly sized models. In fact, its performance is closer to BLOOM than it is to either GPT-NeoX or OPT. It further achieves the best performance of all models in date understanding, hyperbaton (ordering of adjectives), and tracking shuffled objects. In sum, according to this benchmark, we find that developing finance-specific BloombergGPT did not come at the expense of its general-purpose abilities.
5 Knowledge Assessments
We next assess knowledge, which we define as the ability to recall information seen during model training, via scenarios that have the model answer questions without providing additional context or resources (closed-book question answering). This includes multiple-choice questions, and we report accuracy. We follow the template of Brown et al. (2020). The list of scenarios is as follows:
ARC (Clark et al., 2018): Multiple-choice questions collected from 3rd to 9th grade science exams, includes easy and challenging splits.
CommonsenseQA (Talmor et al., 2019): Multiple-choice QA dataset that requires different types of commonsense knowledge.
MMLU (Hendrycks et al., 2021): Manually collected multiple-choice knowledge questions in 57 subjects.
PhysicalQA (PiQA, Bisk et al., 2020): Questions about how the physical world works.
BloombergGPT achieves the highest performance among BLOOM, GPT-NeoX, and OPT in one task, and comes second in the other three (Table 14). Similar to the previous section, it outperforms models of similar size while almost being on par with the much larger models. The Massive Multitask Language Understanding (MMLU, Hendrycks et al., 2021) covers 57 different subjects and thus has a much wider coverage than the tasks described above. The aggregated results in Table 15 paint a more consistent picture and follow the insights seen in BIG-bench hard. BloombergGPT consistently outperforms OPT, which in turn outperforms GPT-NeoX, while GPT-3 performs best. In contrast to the previous sections, BloombergGPT also outperforms BLOOM in this category, although by a slim margin. It falls behind the reported performance of GPT-3, especially in the social science category. The gap to GPT-3 is closest in the STEM and “Other” domains which include finance and accounting-related questions.
6 Reading Comprehension
We define reading comprehension benchmarks as tasks in which the model can generate the correct response based on information contained in the presented input text. Our grouping includes open-book QA tasks, as opposed to Brown et al. (2020), who separate them into a different categories. We follow the template of Brown et al. (2020), and report accuracy. We include the following tasks:
BoolQ (Clark et al., 2019): Yes/No questions about a passage from Wikipedia.
OpenBookQA (Mihaylov et al., 2018): Multiple-choice elementary-level science questions, given a book of science facts, applied to new situations.
RACE (Lai et al., 2017): A multiple choice dataset of middle and high school English examinations.
Multi-Sentence Reading Comprehension (MultiRC, Khashabi et al., 2018): Short paragraphs and multi-sentence questions.
Reading Comprehension with Commonsense Reasoning (ReCoRD, Zhang et al., 2018): Automatically generated questions about CNN and Daily Mail news articles.
Table 16 reflects a similar ranking as in the above evaluations: While GPT-3 has the highest performance, BloombergGPT is a close second. Except for OpenBookQA, The performance of BloombergGPT is the highest among BLOOM, GPT-NeoX, and OPT. Surprisingly, BLOOM falls behind significantly in this category.
7 Linguistic Tasks
We define as linguistic tasks those scenarios that are not directly connected to user-facing applications. These include tasks that evaluate disambiguation, grammar, or entailment. These tasks are designed to directly assess a model’s ability to understand language. We follow the template of Brown et al. (2020), and report accuracy. The list of tasks is as follows:
Recognizing Textual Entailment (RTE, Dagan et al., 2007; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009): Given two text fragments, identify whether the meaning of one text is entailed.
Adversarial NLI (ANLI, Nie et al., 2020): Adversarially constructed entailment detection.
CommitmentBank (CB, De Marneffe et al., 2019): Naturally occurring discourses whose final sentence contains a clause-embedding predicate.
Choice of Plausible Alternatives (COPA, Gordon et al., 2011): Premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise.
Words in Context (WIC Pilehvar and Camacho-Collados, 2019): Determine if a word is being used with the same meaning in two sentences.
Winograd (Levesque et al., 2011): Determine which word a pronoun refers to when it is semantically unambiguous.
Winogrande (Sakaguchi et al., 2019): Adversarially mined challenging Winograd examples.
HellaSWAG (Zellers et al., 2019): Pick the best ending to a story or set of instructions.
StoryCloze (Mostafazadeh et al., 2016): Select the correct ending sentence for five-sentence long stories.
The results (Table 17) for linguistic tasks follow a similar trend to the knowledge category. BloombergGPT falls slightly behind GPT-3 and outperforms the other models. Similar to the reading comprehension category, BLOOM falls behind BloombergGPT.
8 Summary
Across dozens of tasks in many benchmarks a clear picture emerges. Among the models with tens of billions of parameters that we compare to, BloombergGPT performs the best. Furthermore, in some cases, it is competitive or exceeds the performance of much larger models (hundreds of billions of parameters). While our goal for BloombergGPT was to be a best-in-class model for financial tasks, and we included general-purpose training data to support domain-specific training, the model has still attained abilities on general-purpose data that exceed similarly sized models, and in some cases match or outperform much larger models.
Qualitative Samples
We now share qualitative examples from our model that highlight the benefits of our domain specialization.
One use case for BloombergGPT is to make interactions with financial data more natural. An existing way to retrieve data is via the Bloomberg Query Language (BQL). BQL can be used to interact with different classes of securities, each with its own fields, functions, and parameters. BQL is an incredibly powerful but complex tool. As we show in Figure 4, BloombergGPT can be utilized to make BQL more accessible by transforming natural language queries into valid BQL.
Other use cases that are well supported are in the news space. Since it is trained on many news articles, it can be used for many news applications and assist journalists in their day-to-day work. For example, when constructing newsletters, journalists may have to write short headlines for each new section. While a dedicated model to help with this task may be too expensive to maintain, BloombergGPT performs well out of the box (Figure 5).
Due to the financial domain training data, we are able to query BloombergGPT for knowledge relevant to the financial world. For example, it performs well at identifying the CEO of a company. Figure 6 shows several examples including output from other models. While BloombergGPT correctly identifies the CEOs, GPT-NeoX does not, and FLAN-T5-XXL completely fails, consistently ignoring the company and instead predicting the CEO at Cirrus Logic who was included in the prompt. While BloombergGPT does not perfectly solve this task and makes mistakes, we were not able to find any example where the other models solved the task while BloombergGPT did not.
Related Work
Language modeling has a long history in the NLP community. The idea of training a probabilistic language model for scoring word sequences was likely first introduced by Jelinek (1976). N-gram models were popular for decades Brown et al. (1992), and were trained on corpora up to 2 trillion tokens (Brants et al., 2007). Research on training language models accelerated over the last decade due to innovations in machine learning, data availability, and compute. Early work in autoregressive language modeling (e.g., Mikolov et al., 2010; Sutskever et al., 2011) used recurrent neural networks, but these were small models trained on small datasets. The introduction of the transformer architecture (Vaswani et al., 2017) facilitated the scaling of these models in terms of data, compute, and the number of parameters.
The process of developing models that could better approximate the distribution of language over large corpora led to the discovery that the representations these models produce are useful starting points for many downstream tasks. This was demonstrated by Radford et al. (2018) and Howard and Ruder (2018) who showed that generative pretraining with an autoregressive language modeling objective achieves strong performance in transfer learning. Radford et al. (2019) further showed scaling the model size and training data led to autoregressive language models that perform well in different downstream tasks without any additional supervised fine-tuning.
Brown et al. (2020) showed that further scaling the models led to the emergence of new model capabilities and increased model robustness. Since the release of GPT-3 by Brown et al. (2020), many other researchers built large language models to study data quantity, data quality, network architecture, parameter scaling, data scaling, tokenization, and open-sourcing strategies (Raffel et al., 2020; Zhang et al., 2022a; Black et al., 2022; Rae et al., 2021; Hoffmann et al., 2022; Chowdhery et al., 2022; Lieber et al., 2021; Zeng et al., 2022; Tafjord and Clark, 2021; Smith et al., 2022; Scao et al., 2022; Taylor et al., 2022; Lin et al., 2022; Soltan et al., 2022; Thoppilan et al., 2022; Bao et al., 2022; Sanh et al., 2022; Roller et al., 2021; Glaese et al., 2022; Wang et al., 2021; Peng et al., 2022, among many others).
The value of domain-specific training for masked (encoder only) language models is well established. Commonly accepted approaches are to train BERT models (Devlin et al., 2019) from scratch on domain-specific data or to continue pretraining an existing model on new domain-specific data (Gururangan et al., 2020). Following these strategies, BioBERT (Lee et al., 2020) adapts BERT to the biomedical domain and SciBERT is trained on scientific publications (Beltagy et al., 2019). The results of these papers showed that in-domain training allows models to outperform previous state-of-the-art models in a variety of biomedical text mining tasks. Further examples of this paradigm are ClinicalBERT for the clinical domain (Huang et al., 2019), BioMed-RoBERTa for scientific biomedical papers (Gururangan et al., 2020), and BERTweet and Bernice for Twitter data (Nguyen et al., 2020; DeLucia et al., 2022).
Since the training of auto-regressive—decoder-only—language models of more than 10B parameters is significantly more costly than training masked LMs under 1B parameters, there have been much fewer examples of domain-specific autoregressive models. However, existing approaches follow the same two strategies. Adapting an existing model, medPaLM (Singhal et al., 2022) adapted PaLM to the biomedical domain and Minerva (Lewkowycz et al., 2022) to mathematical reasoning tasks.
Recently, several examples of from-scratch trained decoder-only models for domain-specific data have emerged. One popular domain is protein sequences since they can be represented using language-like sequences but are not covered by natural language models (e.g., Lin et al., 2022; Xiao et al., 2021; Nijkamp et al., 2022). However, there can be benefits even for models in natural language domains. Galactica is trained exclusively on a large collection of scientific datasets, and includes special processing to handle scientific notations (Taylor et al., 2022). While performing very well on scientific tasks, Galactica also surprisingly also performs well on more standard NLP tasks. BioGPT (Luo et al., 2022) and BioMedLM Bolton et al. (2023) are both smaller GPT-style models trained on biomedical data. Lehman et al. (2023) compares encoder/decoder models trained exclusively on domain-specific data, versus those adapted from general-purpose training. Researchers working on large generative language dialog models have reached similar conclusions about the benefits of using domain-specific training data (Zhang et al., 2020; Roller et al., 2021; Thoppilan et al., 2022).
These findings highlight the advantages of in-domain pretraining, especially if sufficient data is available, as it is in our case. Inspired by the general capabilities of Galactica, we augment our private data with public data with the goal of investigating whether a model can gain in-domain capabilities without sacrificing general-domain performance.
Large corpora of raw text data are critical for training LLMs. As a result, there are now several corpora available that cover a wide range of sources.
The Colossal Clean Crawled Corpus (C4, Raffel et al., 2020) draws from Common Crawl to create a processed training corpus. The Pile is a carefully curated corpus that contains a wide range of data sources Gao et al. (2021). These datasets are built on or include web crawls (OpenWebText2) augmented with an array of data from high-quality sources (Pubmed, Arxiv). Various efforts aim to clean datasets, especially web data, by removing unwanted or harmful text (Touvron et al., 2023; Rae et al., 2020). BLOOM Scao et al. (2022) carefully selected data sources and included various filtering mechanisms Jernite et al. (2022).
While web data is an effective strategy for obtaining large amounts of diverse data, robust cleaning efforts still result in data artifacts, duplicates Carlini et al. (2020), various types of toxic language Welbl et al. (2021), and it can lead to unintended marginalization of minority voices (Xu et al., 2021). Dodge et al. (2021) studied C4 to better understand the metadata, and the included and excluded data. Their findings suggest that C4 contains machine-generated text, is biased due to exclusion filters and might contain examples drawn from evaluation datasets for NLP tasks. A similar effort was undertaken by Zeng et al. (2022) to document the pre-processing they undertook to train their Chinese large language model.
Lee et al. (2022a) investigated the impact of deduplication on model performance for several datasets and found that deduplication reduces the emission of memorized training data, allows better estimation of the generalization error, and improves training time and cost without impacting performance. These insights highlight the importance and challenges of constructing high-quality training corpora. As discussed in §2, Bloomberg’s core business curates and provides access to datasets, which we use to construct a high-quality dataset FinPile to train BloombergGPT, resulting in best-in-class financial performance.
The tasks addressed by language models have vastly increased and require a very different evaluation process from traditional task-specific systems. There have been two paradigms for LLM evaluation: The first is to evaluate a model in many different scenarios via automatic evaluation (Liang et al., 2022; Srivastava et al., 2022) and the second is to perform extrinsic and task-specific evaluations by integrating them into user workflows (e.g., Lee et al., 2022b; Goyal et al., 2022).
While the second strategy is necessary for assessing deployments of models in products, it is infeasible to run these human evaluations at a scale of the first strategy and it is thus standard to follow the first strategy when introducing new models. In our case, we combine multiple general-purpose evaluations from multiple existing benchmarks that have different goals. Srivastava et al. (2022) aim for maximum coverage by soliciting tasks from the entire research community, while HELM (Liang et al., 2022) suggests evaluation in various “scenarios” that are represented through specific datasets. Earlier language model papers developed their own evaluation schemata (Brown et al., 2020). While these benchmarks allow for a side-by-side comparison between models, it is challenging to ensure that all experimental parameters (prompts, decoding strategies, few-shot examples, etc.) are the same. For that reason, we differentiate between reported and verified numbers in our evaluation (§5).
Beyond the general-purpose evaluation, we also require a targeted domain evaluation. Prior domain-specific models like Galactica (Taylor et al., 2022) chose a set of tasks that the model is likely to perform well on. In their case, these were various scientific tasks. However, there exists no standard benchmark for the financial NLP domain. While the recent work on FLUE (Shah et al., 2022) aims to provide such a benchmark, it has limited coverage of relevant tasks, no suggested evaluation strategy for few-shot learning, and the quality of some annotations is low. To provide externally comparable results, we developed a few-shot strategy for FLUE, but also decided to augment the publicly available evaluation tasks with company-internal benchmarks.
Large language model training remains expensive in terms of the computational cost and human effort to assemble data and train the model. Determining the optimal amount of training data and model shape and size for the best utilization of resources becomes important.
Kaplan et al. (2020) first studied the dependence of language model performance on architecture, parameter size, compute power, and dataset size. They reported that the number of model parameters, the dataset size, and the amount of compute improves performance on the autoregressive language modeling objective smoothly according to the power law. A similar investigation by Hernandez et al. (2021) into data transfer for differing distributions found that this also follows a power law. Moving beyond studying the effect on loss, Rae et al. (2021) analyzed the effect of scale on undesirable properties such as bias and toxicity by training a wide range of model sizes.
Comparing model architectures, Levine et al. (2020) studied the scaling of models that use self-attention and derived guidelines for depth-to-width allocation. Tay et al. (2021) reported that model shape (depth-width ratio) impacted performance on downstream tasks even if it had minimal impact on the pretraining objective. Tay et al. (2022a) further studied the effect of scaling for different model architectures and showed that architecture choice is pertinent when scaling and that the vanilla transformer architecture scales best.
Of particular importance to this work is the study of Hoffmann et al. (2022), who investigated the effect of model size and the number of training tokens on the performance of a model given a fixed compute budget. They posited that existing large language models were undertrained and that model size and the number of training tokens should be scaled equally. They demonstrated this hypothesis through Chinchilla, a model significantly smaller, yet higher performing, than most of the largest LLMs. These findings opened the door for “Chinchilla optimal” training of smaller models that achieve strong performance, and for which inference can be run much more efficiently than for their larger counterparts. These findings led us to consider a nearly Chinchilla-optimal model using a standard architecture.
Tokenization and vocabulary choice play a critical role in model performance as they can help the model learn meaningful representations and generalize to unseen words. Byte-Pair encoding (BPE) (Sennrich et al., 2016) learns a greedy bottom-up vocabulary by repeatedly merging the most frequent sequence pairs in the training set till a predetermined vocabulary size is reached. Radford et al. (2018) adapted BPE by limiting the base vocabulary to be all possible bytes as opposed to all Unicode characters. Wordpiece tokenization (Schuster and Nakajima, 2012) also learns a greedy bottom-up vocabulary by repeatedly merging the sequence-pair that maximizes the likelihood of the training data, which is a slight deviation from the method in Sennrich et al. (2016).
In contrast to BPE and Wordpiece, the Unigram tokenizer (Kudo, 2018) learns a top-down vocabulary by first initializing a large vocabulary and repeatedly discarding those vocabulary items that increase loss (e.g., log-likelihood of the training data) the least. By construction, the Unigram model can tokenize an input text in several different ways. That is, the Unigram model saves probabilities allowing for smarter tokenization at inference time.
Finally, SentencePiece (Kudo and Richardson, 2018) adapts the schemes mentioned above to handle languages that are not space separated. Beltagy et al. (2019) constructed a vocabulary specific to scientific text and observed that their domain-specific trained vocabulary only had a 42% overlap with the non-domain-specific BERT vocabulary trained on general domain text. Similarly, Lewis et al. (2020) showed that a dedicated biomedical vocabulary improved performance on sequence labeling tasks consistently. Lieber et al. (2021) constructed a larger vocabulary to ensure token efficiency, which the authors claim resulted in reduced training time and better semantic representation. These findings demonstrate the importance of selecting a tokenizer and accompanying vocabulary that best reflects that training domain. For those reasons, we decided to train our own unigram tokenizer instead of relying on existing public ones.
Transformer-based models rely on positional embeddings to encode position and location information of words in a text. Encoding the sequence position and the effect of this choice on model performance have been studied extensively. These include sinusoidal embeddings (Vaswani et al., 2017), rotary position embeddings (Su et al., 2021a), adding relative position bias (Raffel et al., 2020), and adding linear biases to attention heads (Press et al., 2022). A side-effect of the strategy in Press et al. (2022) is that one can train on shorter sequences without loss in performance on longer sequences. This has two benefits: first, models learn to generalize (extrapolate) to longer sequences and second, models can be trained on shorter sequences reducing training time.
Ethics, Limitations, and Implications
The rapid development and adoption of large language models have been accompanied by a rigorous conversation about the ethics, uses, and limitations of these models. For a more complete treatment of these topics, we direct the reader to Bommasani et al. (2021); Bender et al. (2021); Birhane et al. (2022); Weidinger et al. (2021, 2022). We discuss issues that are directly relevant to the development of BloombergGPT.
Finance is a sensitive area for technology, and ensuring accurate, factual information is crucial for our products, our clients, and the firm’s reputation in the marketplace. On the other hand, our clients are also eager to adopt state-of-the-art technology to support their workflows. To provide natural language applications to the financial community, we have developed a rigorous risk and testing assessment process. This process includes careful annotation guidelines Tseng et al. (2020), pre-launch review at multiple levels by the central risk and compliance organizations, and by the product leaders (e.g., the newsroom) as applicable, and post-launch monitoring. Moreover, we conduct our research, development, and deployment of NLP and AI systems in accordance with all applicable regulations.
Similarly, toxicity and bias are areas where, as a company, we take extraordinary care with any content we produce, whether from humans or machines. Since the measurement of toxicity and bias in our model depends on its application areas, quantifying the potential for the generation of harmful language remains an open question. We are particularly interested in studying whether FinPile, which is cleaner and contains fewer examples of overtly biased or toxic language (e.g., Press Releases), reduces the proclivity of the model to generate inappropriate content. As we move to develop products built on this technology, we will apply existing testing procedures, as well as risk and compliance controls, to ensure safe use.
2 Openness
An ongoing debate in the community concerns how LLMs should be released, if at all. While models that are not publicly available cannot be fully evaluated by the community, distributing models can lead to nefarious purposes. Especially for a model like BloombergGPT, which is trained on a significant amount of press releases, news articles, and filings, a release carries a high risk for abuse through imitation.
We have witnessed many different strategies to mitigate risks associated with the release of LLMs. One strategy is to freely and openly share trained models Scao et al. (2022), and rely on a license that dictates how the model should and should not be used. Another requires individuals to apply for access to the trained model parameters Zhang et al. (2022a); Touvron et al. (2023). A more restrictive approach is to provide API access to models, but no access to the underlying model parameters or detailed information on the data the model was trained on (Brown et al., 2020). Finally, some have provided no access to the model Chowdhery et al. (2022); Hoffmann et al. (2022). Each decision reflects a combination of factors, including model use, potential harms, and business decisions.
One of Bloomberg’s core business propositions is around providing access to data that has been collected over the course of decades. As is well known, LLMs are susceptible to data leakage attacks and it is possible to extract significant segments of text given model weights Carlini et al. (2020, 2022). Moreover, even giving selective access to researchers isn’t a guarantee that the model cannot be leaked. Without strong privacy guarantees, we must be concerned that providing access to model weights entails giving access to FinPile. For this reason, we err on the side of caution and follow the practice of other LLM developers in not releasing our model.
Nevertheless, our insights and experiences in training and evaluating BloombergGPT contribute to the developing understanding of these models. In particular, our experience may be useful to those building domain-specific models. During the process of developing BloombergGPT, we found the OPT chronicles, experiences of the BLOOM team, as well as work of non-open models like GPT-3, PaLM, Chinchilla, and Gopher, to be crucial enablers of our work. In support of this tradition, we include our Training Chronicles (Appendix C).
Conclusion
We have presented BloombergGPT, a best-in-class LLM for financial NLP.
Our model contributes to the ongoing dialog on effective ways to train domain-specific models. Our training strategy of mixing domain-specific and general-purpose data results in a model that balances performance in both domains. Additionally, our work offers another data point on selecting Chinchilla optimal-sized models. Finally, we hope that our model training logs will provide a guide for those training their own LLMs.
We have several interesting directions to pursue. First, task fine-tuning has yielded significant improvements in LLMs, and we plan to consider what unique opportunities exist for model alignment in the financial domain (Wei et al., 2021; Ouyang et al., 2022). Second, by training on data in FinPile, we are selecting data that may exhibit less toxic and biased language. The effects of this on the final model are unknown as yet, which we plan to test. Third, we seek to understand how our tokenization strategy changes the resulting model. These are a few of the new research directions we hope to pursue with BloombergGPT.
We achieve strong results on general LLM benchmarks and outperform comparable models on financial tasks. We attribute this, in decreasing order of impact, to 1. a well-curated internal dataset, 2. our unique choice in tokenizer, and 3. an up-to-date architecture. We will continue to develop financial applications with BloombergGPT to further explore the benefits of these modeling choices.
Acknowledgments and Disclosure of Funding
We would like to acknowledge the people who helped us, including Emmanuel Scoullos (NVIDIA) and Can Karakus (Amazon Web Services).
References
Appendix A Architecture
Unstyled variables denote scalars, bold lower-case variables represent [column] vectors, and bold capitalized variables represent matrices. For instance, could be an element in the vector , which could in turn be the -th column of matrix .
Named functions are typed in non-italicized regular typeface, such as and .
Red color is used to denote trainable parameters, or functions that are parametrized by trainable parameters, such as {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W}} or {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{FFN}}\nolimits(}\cdot{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375})}.
A sequence of elements is denoted by . We treat a sequence of (column) vectors as a matrix, i.e. , where each .
: A function on vectors, that is, where are -dimensional real valued vectors. Whenever such a function is applied to a matrix, it is applied column-wise: , .
: Element-wise (or Hadamard) product of matrices or vectors and (of the same shape).
: Indicator function that returns 1 if the predicate is true and 0 otherwise.
: For integer , the set of all positive integers up to (including) , i.e. .
: Adding a vector to a matrix is defined as repeated addition to each column. That is, .
Softmax: where is applied element-wise to a vector.
Dropout: where, , and . Random variables are drawn independently for each presentation of an example.
A.1 Full Architecture
Let denote an input sequence of length , where each element denotes an integer identifier of a token from the vocabulary .
Initial input representations are obtained by
where {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W^{em}}}\in\mathbb{R}^{D\times|\mathcal{V}|} is the token embedding matrix, is the -th standard basis vector, and {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits^{em}} is the embedding LayerNorm function, to be defined in the following subsections.
Observe that no positional embedding is applied here due to how ALiBi works.
Layer representations for each layer can be sequentially defined as follows (this computation is sometimes referred to as a “block”):
superscript𝑯bold-ℓ1subscriptSAℓsubscriptsuperscriptLN𝑖𝑛ℓsuperscript𝑯bold-ℓ1\displaystyle=\boldsymbol{H^{\ell-1}}+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{SA}}\nolimits_{\ell}(\mathop{\mathrm{LN}}\nolimits^{in}_{\ell}(}\boldsymbol{H^{\ell-1}}{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}))} (3) \displaystyle=\boldsymbol{\bar{H}^{\ell}}+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{FFN}}\nolimits_{\ell}(\mathop{\mathrm{LN}}\nolimits^{at}_{\ell}(}\boldsymbol{\bar{H}^{\ell}}{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}))} (4) where {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{SA}}\nolimits_{\ell}}, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{FFN}}\nolimits_{\ell}}, and {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits_{\ell}^{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\scriptscriptstyle\bullet}}}}}}} denote SelfAttention, FeedForwardNetwork, and LayerNorm functions at layer , respectively, as defined in the following subsections. The red color indicates that the functions depend on trainable parameters. {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits_{\ell}^{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\scriptscriptstyle\bullet}}}}}}} is further parametrized by an indication of what the function is applied to, such as {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits_{\ell}^{in}} when applied to the block input and {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits_{\ell}^{at}} when applied to the attention output. We designate these separately since they use different (i.e. untied) trainable parameters.
Given the final layer representation , logits are obtained as:
where {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W^{em}}}\in\mathbb{R}^{D\times|\mathcal{V}|} is the same embedding matrix we used in the embedding part and {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits^{f}} is the final LayerNorm application. We follow the PaLM approach in omitting a bias term.
The token distribution for position , conditioned on the prefix , is given by
𝑗1conditional𝑤superscriptsubscriptsubscript𝑥𝑡𝑡1𝑗softmaxsubscriptsubscript𝒚𝒋𝑤\displaystyle\mathbb{P}(x_{j+1}=w|\{x_{t}\}_{t=1}^{j})=\mathop{\mathrm{softmax}}\nolimits(\boldsymbol{y_{j}})_{w} (6) where is the ’th column of .
A.2 SelfAttention with ALiBi (SASA\mathop{\mathrm{SA}}\nolimits)
SelfAttention with ALiBi at layer , {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{SA}}\nolimits_{\ell}}:\mathbb{R}^{D\times T}\rightarrow\mathbb{R}^{D\times T} is defined as follows.
Let denote an attention head where is the total number of heads. Let denote the dimensionality of each head. Let denote the ALiBi matrix and the attention mask, respectively, which will be defined later.
Then, \boldsymbol{Y}={\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{SA}}\nolimits_{\ell}(}\boldsymbol{X}{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375})} such that:
superscriptsubscript𝑾bold-ℓ𝒏𝒒𝑿superscriptsubscript𝒃bold-ℓ𝒏𝒒\displaystyle={\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W_{\ell}^{n,q}}}\boldsymbol{X}+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{b_{\ell}^{n,q}}} (7) \displaystyle={\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W_{\ell}^{n,k}}}\boldsymbol{X}+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{b_{\ell}^{n,k}}} (8) \displaystyle={\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W_{\ell}^{n,v}}}\boldsymbol{X}+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{b_{\ell}^{n,v}}} (9) (10) (11) (12) \displaystyle=\mathop{\mathrm{drop}}\nolimits^{p_{h}}((\sum_{n=1}^{N}{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{U_{\ell}^{n}}}\boldsymbol{\bar{Y}^{n}})+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{c_{\ell}}}) (13) where {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W_{\ell}^{n,q}}},{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W_{\ell}^{n,k}}},{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W_{\ell}^{n,v}}}\in\mathbb{R}^{D^{n}\times D}, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{U_{\ell}^{n}}}\in\mathbb{R}^{D\times D^{n}}, are the trainable weight parameters, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{b_{\ell}^{n,q}}},{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{b_{\ell}^{n,k}}},{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{b_{\ell}^{n,v}}}\in\mathbb{R}^{D^{n}}, , {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{c_{\ell}}}\in\mathbb{R}^{D}, are the trainable bias parameters, and are the attention and hidden unit dropout probabilities.
The ALiBi matrix is constructed as:
1𝑛1mod~𝑁0.5𝑛1~𝑁\displaystyle=1+((n-1)\mathop{\mathrm{mod}}\nolimits\tilde{N})-0.5\left\lfloor\frac{n-1}{\tilde{N}}\right\rfloor (15) (16) and the attention mask is constructed as:
where we follow the convention that .
A.3 LayerNorm (LNLN\mathop{\mathrm{LN}}\nolimits)
LayerNorm, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits^{\theta}}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{D}, is defined as follows:
direct-product𝒙𝜇𝒙superscript𝜎2𝒙italic-ϵsuperscript𝜸𝜽superscript𝜷𝜽\displaystyle=\frac{\boldsymbol{x}-\mu(\boldsymbol{x})}{\sqrt{\sigma^{2}(\boldsymbol{x})+\epsilon}}\odot{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{\gamma^{\theta}}}+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{\beta^{\theta}}} (18) where
and, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{\gamma^{\theta}}},{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{\beta^{\theta}}}\in\mathbb{R}^{D} are the trainable gain and bias parameters, and is a small constant.
is used as the parametrization variable to emphasize {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits^{em}}, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits^{f}}, and {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits^{in}_{\ell}}, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{LN}}\nolimits^{at}_{\ell}}, have different (untied) and parameters.
A.4 FeedForwardNetwork (FFNFFN\mathop{\mathrm{FFN}}\nolimits)
Feedforward network component {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{FFN}}\nolimits_{\ell}}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{D} is defined as a simple multilayer perceptron. \boldsymbol{y}={\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\mathop{\mathrm{FFN}}\nolimits_{\ell}(}\boldsymbol{x}{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375})} such that:
superscriptsubscript𝑾bold-ℓ𝒇𝒙superscriptsubscript𝒃bold-ℓ𝒇\displaystyle=\mathop{\mathrm{gelu}}\nolimits({\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W_{\ell}^{f}}}\boldsymbol{x}+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{b_{\ell}^{f}}}) (21) \displaystyle=\mathop{\mathrm{drop}}\nolimits^{p_{f}}({\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{U_{\ell}^{f}}}\boldsymbol{h}+{\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{c_{\ell}^{f}}}) (22) where is applied element-wise, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{W_{\ell}^{f}}}\in\mathbb{R}^{D^{\prime}\times D}, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{U_{\ell}^{f}}}\in\mathbb{R}^{D\times D^{\prime}} are the trainable weight parameters, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{b_{\ell}^{f}}}\in\mathbb{R}^{D^{\prime}}, {\color[rgb]{0.8828125,0.3984375,0.3984375}\definecolor[named]{pgfstrokecolor}{rgb}{0.8828125,0.3984375,0.3984375}\boldsymbol{c_{\ell}^{f}}}\in\mathbb{R}^{D} are the trainable bias parameters, and denotes the dropout probability at this component.
A.5 List of All Trainable Parameters
List of shape hyperparameters and their values are as follows:
, (hidden dimension of each head)
(hidden dimension of )
Initialization hyperparameters are as follows:
is the default range (standard deviation).
is the rescaled range for the second layer in and the final linear map in .
List of all parameters with their sizes and (element-wise) initialization:
Appendix B Details on external financial tasks
(Malo et al., 2014): The Financial Phrasebank Dataset includes a sentiment classification task on sentences in the English language taken from financial news about companies listed on OMX Helsinki. Sentiment annotations of positive, negative, neutral are adjudicated from the perspective of an investor: any news that could benefit/hurt an investor is considered positive/negative and neutral otherwise. Each sentence is annotated by 5 to 8 annotators who have sufficient knowledge of finance, whereas the source sentences were written by financial journalists. For example, news about shrinking revenue would be labeled negative and company growth as positive. While there are different configurations of this dataset with each configuration denoting the percentage agreement between annotators (, , , ), we choose to use the configuration with . Since an official train-test split is not available, we create our own random split. Our training split contains 3,876 sentences with 1,086 positive, 488 negative, and 2,302 neutral sentences and our test set contains 970 sentences with 277 positive, 116 negative, and 577 neutral sentences. We choose 5 shots and report F1 score weighted by support.
(Maia et al., 2018): The second sentiment analysis task is to predict the aspect-specific sentiment in English financial news and microblog headlines, which were published as a part of the 2018 challenge on financial question answering and opinion mining. In the original task, sentiment is annotated on a continuous scale of ; the details on the annotation task are not readily available. To make this regression dataset amenable for few-shot LLM setup, we convert it into a classification task: Negative (), neutral (), and positive (), where is the original sentiment score. We selected this discretization based on a manual examination of the dataset. Like with FPB, we create our own random split combining both microblogs and news. After discretization, our training set contains 938 sentences with 576 positive, 287 negative, and 75 neutral sentences and our test set contains 235 sentences with 141 positive, 76 negative, and 18 neutral sentences. We select 5 shots and report weighted F1.
(Sinha and Khandait, 2020): This is a binary classification task of whether a news headline in the gold commodity domain includes certain information. This human-annotated dataset consists of 11,412 English news headlines from 2000 to 2019 about “gold” scraped from providers such as Reuters, The Hindu, The Economic Times, Bloomberg, and from aggregator sites such as Kitco, and MetalsDaily. Each news article carries a subset of the following tags: “price or not”, “price up”, “price down”, “price stable”, “past price”, “future price”, “past general”, “future general”, “asset comparison”. The dataset is created using annotator consensus and Cohen’s Kappa for each of the categories is indicating a high-quality dataset. Like with FPB, we create our own random split. Our training set contains 9,129 sentences with 7,780, 3,785, 3,392, 414, 7,482, 299, 1,285, 67, 1,696 examples of “price or not”, “price up”, “price down”, “price stable”, “past price”, “future price”, “past general”, “future general”, “asset comparison” classes, respectively. Similarly, the test set contains 2283 sentences with 1,955, 962, 838, 109, 1,873, 82, 313, 15, 454 examples of the same classes. We verbalize each tag into a question using the official documentation on each tag as shown in Table 18. We used 5 shots, and report the average weighted F1 score across all categories.
(Salinas Alvarado et al., 2015): This is a named entity recognition task on financial data gathered for credit risk assessment. The dataset consists of 8 documents with ~55,000 words of financial agreements filed with the SEC. The annotated entity types follow the standard CoNLL format (Tjong Kim Sang and De Meulder, 2003) and are annotated with PER, LOC, ORG, and MISC. We use Fin-5 as the training data for context sampling and test on the Fin-3 split. As MISC cannot be defined on its own but “names (that) are not already in the other categories” (Tjong Kim Sang and De Meulder, 2003), we drop all entities with type MISC. Additionally, as it is nontrivial to learn to predict empty output in the few-shot set-up, we drop sentences that do not contain any entity. After preprocessing, our training set contains 504 sentences with 168 PER, 745 LOC, and 241 ORG, and our test set consists of 98 sentences with 39 PER, 216 LOC, and 56 ORG. We found that all the models required more shots to perform well. Hence, we selected 20 shots and report the entity-level F1 score.
(Chen et al., 2022): Given an input that includes text and at least one table with financial data, the task is to answer conversational questions that require numerical reasoning over the input. The source data is earning reports of S&P 500 companies and consists of 3,892 conversations consisting 14,115 questions. This task requires numerical reasoning, an understanding of structured data and financial concepts, and a model needs to relate follow-up questions to the dialog turns. To solve this task, we use “1 shot” where an entire gold conversation and its context is input to the models. In addition, as each “turn” of the conversation concludes, the “turn” along with the “gold” answer for that turn is appended as context for future turns. Tables are linearized in the context (as suggested by the authors) as Markdown tables, and we replace an empty entry with ”-”. The reported score is the exact match accuracy of the direct answer produced by a model. As test set labels are not publicly available, we report results on the dev set instead. Our training set contains 11,104 conversations and 45,888 questions and our test set contains 1,490 conversations and 5,932 questions.
Appendix C Training Chronicles
Our first training run was called v0. In this run, we experimented with curriculum learning. Data that the model would see in the future would likely be similar to the newer data in our training corpus, so we wanted the model to do better on those future documents. Additionally, since there are facts that change over time, newer information should ideally override the old. Therefore, we temporally ordered the training data by month in FinPile.
Figure 7 shows the learning curve for run v0. We observed a large gap between training and validation losses, which was expected: early stages of training would observe the oldest data (starting from 2007) whereas our validation set was strictly from the future (i.e., 2022). However, one week into training we found the model stuck on both training and validation loss, as seen by the very limited validation progress between steps 15k-20k and almost no progress after step 20k. There was the possibility that the training loss and the divergence of training and validation loss would both resolve themselves as the training data became more and more similar to the validation data as the curriculum progressed. However, we deemed this to be too risky to catch any other potential problems with the training that might require early intervention, since it would mean training for many steps without any diagnostic signal. We thus decided to abandon curriculum learning altogether.
We removed curriculum learning by shuffling all of our training data uniformly on the shard level.333Instead of loading one shard of data at a time, we load multiple random shards (without replacement) at the same time and shuffle them on the fly. We then started a new run (v1.0), which led to much faster improvements in the validation loss. We were unable to ascertain if curriculum learning had a negative impact on training or if the loss plateaued due to other factors, for example, the other discovered issue in v1.x.
C.1 Elbow
During our new run without curriculum learning (v1.0), we observed that the gradient norm showed a steady increase after about 12k steps (~4.5 days of training), with occasional spikes (see Figure 8). This was accompanied by sudden jumps in the validation loss, possibly indicating that the model might be becoming sensitive to small changes in its weights. Training loss seemed to have been plateauing again, as well.
We believed that the gradient norm increases were the cause of the validation loss problems (notice the alignment between sudden validation loss jumps with some of the sudden gradient norm jumps for v1.0, in Figure 8). We made several attempts across several model runs to fix the gradient norm increases:
All of these attempted fixes were made after we observed a trend of increasing gradient norms similar to the original run (v1.0), or some early signs of a similar path that we hypothesized would eventually grow more. Since we didn’t want to waste training time, we did our best to make decisions early instead of allowing the model to continue down a bad training path.
We investigated the norms of the weights themselves to see if any peculiar trends were aligning with the gradient growth. In particular, we were curious to see if there were particular layers or components that were responsible for the large gradient norms.
Figure 9 plots L2 norms for each component, averaged by the square root of the number of elements (layer norm multipliers start from 1 and all the others start close to zero due to initialization). We observed that all components follow a similar benign trend except one: Input LayerNorm at layer 1 (i.e. ), which suddenly elbows and starts increasing roughly linearly after step ~12k. This also aligns with the initial growth of the gradient norms.
To take a closer look, we inspected individual values of the multiplier weights (there are 60 such values in a single model shard out of 128) in Figure 10. We observed all values contributing to the same trend of shrinking until steps 11-12k and then shifting to move upward instead.
During this investigation, we discovered another bug: Weight decay was applied to all the non-bias parameters, as opposed to skipping the LayerNorm multiplier weight since they are initialized at 1. To the best of our knowledge, this practice has been inherited from the BERT implementation444https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/optimization.py#L59-L65.
This makes the elbow artifact shown in Figure 10 even more confusing: An additional push of weights towards 0 would trivially explain a downward trend but not a sudden shift to growth.
After four failed attempts to fix the run, we considered the possibility of this run being unsalvageable and contemplated starting from scratch to apply a more conservative hyperparameter setting from the beginning. These would include things that we have tried in our attempts such as shrinking the learning rate or gradient clipping. Additionally, because the pathological trend change is isolated to , which is topologically very close to the removed LayerNorm at the embedding layer () we decided to add back as an additional precaution, despite most other LLMs not having this component.
C.2 Slide
After numerous attempts to fix the elbow issue, we wanted to be as conservative as possible for the hyperparameter choices when starting from scratch to keep the learning dynamics as stable as possible. We started the next run (v2.0) with the following hyperparameters and changes:
Use FP32 precision in LM-head ( in Equation 6)
Use max learning rate of 6e-5 instead of 1e-4
Use a gradient clipping value of 0.3 instead of 1.0
Use a different seed to ensure different initialization and data order
Reintroduce LayerNorm at embedding layer ()
Use a longer learning rate warm-up period of 1800 steps
Remove incorrect use of weight decay on LayerNorm multipliers (\gamma^{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\scriptscriptstyle\bullet}}}}}}_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\hskip 1.75pt\scriptscriptstyle\bullet}}}}}})
Use Megatron initialization rescaling (see use of in § A.5)
Apply query_key_layer_scaling (Shoeybi et al., 2019)
Apply a batch size warm-up: Use a batch size of 1024 for 7200 iterations, then increase to 2048
In addition to hyperparameter changes, we also performed additional monitoring to catch issues earlier. Because we observed the pathological behavior at the first LayerNorm component, we started monitoring the norms of the weights and (scaled by ).
With the aforementioned conservative choice of hyperparameters during v2.0, we observed very smooth and (thankfully!) uneventful training for approximately 42 days (~115,500 iterations). We saw few surprises both in terms of training and validation performance curves (see Figure 11), as well as the norms of the gradients. The only intervention needed during this period was to restart the job after 28 days, due to the underlying platform having a hard limit on the duration of the job.
During this period, we observed a smoothly decreasing validation loss (except a few jumps earlier on) until it started to flatten around 2.116 (y-axis) at the end. Running training loss has a similar trend of overall decrease with the typical jitter and random occasional increments.
We also observed that weight norms for LayerNorm components of the initial layer were smooth and stable without any immediate or long-term trend changes (see Figure 11). This presents some evidence that we were not suffering from the pathological behavior we observed in v1.x in regards to LayerNorm parameters .
Overall, while this set of changes led to a smooth training run in v2.0, we cannot conclude which of these changes was decisive in leading to a successful training run. We defer such investigation to future work.
C.3 Suspense
About 48 days into training v2.0, we noticed that the validation loss had not improved in a week (from iteration 115,500 to 133,200, see Figure 12, v2.0 curves). During the same period, we also noticed training loss flattening around 2.10 (with the usual jitter). We suspected that the model was no longer learning properly, and decided to intervene.
We considered two options: 1) changing the max learning rate, 2) rolling back to an earlier checkpoint and re-shuffling the remainder of the data to pick up a different optimization path.
We had two proposed ways in which to change the learning rate. An argument for increasing the learning rate was the possibility that we were stuck in a local optimum. Allowing the optimizer to make bigger jumps would allow the model to escape the optimum and continue learning. On the other hand, the argument for decreasing the learning rate was based on Zhang et al. (2022a) in which they had observed improvements after shrinking the learning rate after getting stuck. Furthermore, we had spent more steps in the high-learning-rate region of the overall learning rate schedule since, by following the Chinchilla scaling law, we had more total steps compared to models like BLOOM or GPT-3.
The other option was to roll back to an earlier checkpoint, re-shuffle the remainder of the data and continue training. Chowdhery et al. (2022) found that when they saw spikes in the validation loss, they “re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200–500 data batches.” This suggests that data ordering mattered, and backing out of a bad data/gradient path may help. That may have been our issue with curriculum learning (v0.x), although it may have been that the issues were not with curriculum learning but with other issues we fixed in v1.0.
In the end, we decided to shrink the learning rate, roll back to the start of the increasing validation loss trend 7 days prior, and also re-shuffle the remaining data.
We also became concerned that our choices were based on a single, albeit large, development set. Our validation set only contained data from July 2022 (val; roughly 105M tokens), whereas the training set ranged from 2007 to June 2022, meaning that the validation set was slightly out of distribution. We had done this to ensure a future-forward evaluation, and to ensure that the training set didn’t have leaked validation data. While this matched our goals, it was possible that a single month of data was not properly reflective of the model’s abilities, and thus we were making decisions that overfit the validation data. We created a second validation set from the last 105M tokens of the training set for offline evaluation (val). These tokens were from training but would be unobserved until the model finished training. However, since the model training data was fully shuffled, this validation set was not from a held-out time-period.
To assess whether a lack of progress on validation loss translates into a lack of progress on downstream evaluation performance, we used two popular benchmarks: the multiple-choice subset of BBH (bbh) and all of MMLU. These provided additional assurance that changes in validation loss were tracking actual model improvements. Note that running a checkpoint on these benchmarks is much more time-consuming than computing the validation loss.
Using our two dev sets and downstream evaluations for guidance, we made several attempts to improve run v2.0 and direct the model to continue learning. A summary of our attempts follows: Run Changes from v2.0 run Shared Change - Re-shuffle future data starting from step 115500 v2.1 - Start from v2.0 step 115500 - Reduce max learning rate from 6e-5 to 4e-5 v2.2 - Start from v2.0 step 115500 - Increase dropout from 0.0 to 0.1 v2.3 - Start from v2.1 step 129900 - Reduce max learning rate from 6e-5 to 2e-5 - Increase dropout from 0.0 to 0.1 v2.4 - Start from v2.1 step 129900 - Reduce max learning rate from 6e-5 to 2e-5 v2.5 - Start from v2.3 step 137100 - Reduce max learning rate from 6e-5 to 1e-5 - Increase dropout from 0.0 to 0.1 v2.6 - Start from v2.1 step 129900 - Reduce max learning rate from 6e-5 to 2e-5 - Reduce weight decay from 0.1 to 0.01
After we lowered the learning rate and rolled back the model (v2.1), we observed an initial sudden (and dramatic) improvement; however, validation loss quickly flattened out. Coupled with the mixed results on downstream evaluation, we decided to enable dropout for the first time with a probability of 0.1.
With dropout, as expected, we observed a larger training loss since dropout is applied during the computation of the loss (v2.2 in Figure 12). However, we observed an initially decreasing validation loss. Still, as the run progressed further, validation loss started creeping back up and met the value of the original run (v2.0, blue).
Based on these observations, we decided that further decreasing the learning rate would give us the best chance to continue learning successfully. We subsequently tried various combinations of smaller values of learning rate and adding dropout. Observe that in v2.3 (red) with 2e-5 max learning rate and in v2.5 (brown) with 1e-5 as its continuation, both with a dropout rate of 0.1, shown in Figure 12. In Table 20, we observed v2.3 led to much better perplexity and slightly better downstream performance, and v2.5 continues to improve downstream performance compared to v2.3 in the beginning, while decreasing perplexity slightly. v2.4 (purple) attempted a max learning rate of 2e-5 as well, without dropout, however. The only odd run during this time is v2.6 (pink), in which we experimented with a smaller weight decay of 0.01 (compared to the original 0.1) with a max learning rate of 2e-5 to investigate the possibility of getting stuck in local minima due to too strong of a pull from the weight decay. However, this yields almost the exact same curve as the original 0.1 weight decay (the difference between v2.4 and v2.6 is only the weight decay, and since they yield the same curve v2.6 completely hides v2.4, rendering it invisible in the plot).
In conclusion, all of the runs (summarized in Figure 12) had the same outcome of eventual flattening of the validation loss and sometimes even increasing the loss. We did not observe that any particular change significantly improved the downstream evaluations and validation loss (Table 20).
At this point, we had used 77% of our training data and were nearing the end of the budget we had set aside for training. Combined with all of these observations and initial promising results on the downstream benchmarks, we decided to end training despite not having gone through all of our training data. Another motivating factor was the possibility of using remaining unseen training data for subsequent runs of different styles of training and finetuning.
Based on this experience, we plan to explore different options in future experiments that have shown the potential to lead to more stable training for longer durations, including SwiGLU activations (Shazeer, 2020), RoPE embeddings (Su et al., 2021b), and normalization for queries and keys in the attention layers (Henry et al., 2020).