The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

cs.CL cs.AI cs.LG

Introduction

The latest wave of language models, both public (Chung et al., 2022; Taori et al., 2023; Geng et al., 2023) and proprietary (Anil et al., 2023; OpenAI, 2023; Anthropic, 2023; Yoo et al., 2022) attribute their powerful abilities in large part to the diversity and richness of ever larger training datasets, including pre-training corpora, and finetuning datasets compiled by academics (Wei et al., 2021; Sanh et al., 2021; Muennighoff et al., 2022), synthetically generated by models (Taori et al., 2023; Wang et al., 2022a), or aggregated by platforms like Hugging Face (Lhoest et al., 2021). Recent trends see practitioners combining and re-packaging thousands of datasets and web sources (Gao et al., 2020; Penedo et al., 2023; Wang et al., 2022b; Longpre et al., 2023a), but despite some notable documentation efforts (Spacerini, 2021; Biderman et al., 2022), there are diminishing efforts to attribute, document or understand the raw ingredients into new models (Dodge et al., 2021; Bandy and Vincent, 2021; Bommasani et al., 2023a).

A Crisis in Data Transparency & its Consequences. Increasingly, widely used dataset collections are treated as monolithic, instead of a lineage of data sources, scraped (or model generated), curated, and annotated, often with multiple rounds of re-packaging (and re-licensing) by successive practitioners. The disincentives to acknowledge this lineage stem both from the scale of modern data collection (the effort to properly attribute it), and the increased copyright scrutiny (Saveri et al., 2023). Together, these factors have seen fewer Datasheets (Gebru et al., 2021), non-disclosure of training sources (OpenAI, 2023; Anil et al., 2023; Touvron et al., 2023), and ultimately a decline in understanding training data (Sambasivan et al., 2021b; Longpre et al., 2023b).

This lack of understanding can lead to data leakages between training and test data (Elangovan et al., 2021; Carlini et al., 2022), expose personally identifiable information (PII) (Bubeck et al., 2023), present unintended biases or behaviours (Welbl et al., 2021; Xu et al., 2021; Pozzobon et al., 2023), and generally result in lower quality models than anticipated. Beyond these practical challenges, information gaps and documentation debt incur substantial ethical and legal risks. For instance, model releases appear to contradict data terms of use (e.g., WizardCoder (Luo et al., 2023) licensed for commercial use, while training on commercially-prohibited OpenAI data), license revisions post-public release (with MPT-StoryTeller (Frankle, 2023)), and even copyright lawsuits (e.g. Stability AI (Arstechnica, 2023) and OpenAI (Saveri et al., 2023)). As training models on data is both expensive and largely irreversible, these risks and challenges are not easily remedied. In this work, we term the combination of these indicators, including datasets’ sourcing, creation and licensing heritage, as well as its characteristics, Data Provenance.

Unreliable Data Provenance & Licensing. Our work motivates the urgency of tooling that facilitates informed and responsible use of data in both pretraining and finetuning. To empower practitioners to attribute data provenance, we develop a set of tools and standards to trace the data lineage of 44 of the most widely used and adopted text data collections, spanning 1800+ finetuning datasets. We compile and expand relevant metadata with a much richer taxonomy than Hugging Face, Papers with Code, or other aggregators (see Section 2.1). With legal experts, we design a pipeline for tracing dataset provenance, including the original source of the dataset, the associated licenses, creators, and subsequent use.

As a byproduct of our work establishing the Data Provenance of widely used datasets, we are able to characterize the AI data ecosystem/supply chain (Cen et al., 2023; Bommasani et al., 2023c), as well as state of the field for policymakers, researchers and legal experts. Our work points to a crisis in license laundering and informed usage of popular datasets, with systemic problems in sparse, ambiguous, or incorrect license documentation. Notably, we find that 70%+ of licenses for popular datasets on GitHub and Hugging Face are “Unspecified”, leaving a substantial information gap that is difficult to navigate in terms of legal responsibility. Second, the licenses that are attached to datasets uploaded to dataset sharing platforms are often inconsistent with the license ascribed by the original author of the dataset—our rigorous re-annotation of licenses finds that 66% of analyzed Hugging Face licenses were in a different use category, often labeled as more permissive than the author’s intended license. As a result, much of this data is risky to use (or harmfully misleading) for practitioners who want to respect the data provenance of a work. Our initiative reduces “Unspecified“ licenses from 72%+ to 30% and attaches license URLs for under-resourced model developers to more confidently select appropriate data for their needs. To this end, the Data Provenance Initiative supports attribution and responsible AI with the following contributions:

The most extensive known public audit of AI Data Provenance, tracing the lineage of 1800+ text datasets (the “DPCollection”), their licenses, conditions, and sources. We demonstrate a growing adoption and reliance on software licenses in the AI community and synthesize observations into legal guidance for developers (Section 4).

The Data Provenance Explorer (DPExplorer)www.dataprovenance.org, an open-source repository for downloading, filtering, and exploring data provenance and characteristics. Our tools auto-generate Data Provenance Cards for scalable symbolic attribution and future documentation best practices.

We find a sharp and widening divide between commercially open and closed data, with the latter monopolizing more diverse and creative sources. We suggest a data collection focus to narrow this gap.

The Initiative to Audit Data Provenance

The Data Provenance Initiative’s goal is to audit popular and widely used datasets with large-scale Legal and AI expert-guided annotation. We propose a base set of indicators necessary for tracing dataset lineage and understanding dataset risks (described in Section 2.1). As a first contribution of the initiative, we audit 44 instruction or “alignment” finetuning data collections composed of 1858 individual datasets, selected by experts for their widespread adoption and use in the community. The selected collections and their variants see 100s to 10M+ monthly downloads on Hugging Face, with the datasets within these collections tallying to many more Table 1.

The initiative’s initial focus on alignment finetuning datasets was decided based on their growing emphasis in the community for improving helpfulness, reducing harmfulness, and orienting models to human values (Ouyang et al., 2022). Some collections have overlapping datasets and examples, but we choose not to deduplicate to preserve the original design choices, that may include different templates, formatting, and filtering. We remove datasets related to common benchmarks like MMLU (Hendrycks et al., 2020) and BigBench (Srivastava et al., 2023).

Our information audit spans (I) identifier information, bridging metadata from several aggregators, including Hugging Face, GitHub, Papers with Code, Semantic Scholar, and ArXiv, (II) detailed dataset characteristics for a richer understanding of training set composition, and (III) dataset provenance for licensing and attribution. We expand our provenance metadata beyond just licenses, because conversations with practitioners revealed they rely not only on data licenses, but on a specific legal & ethical risk tolerance, parameterized by (a) the lineage of licenses, (b) the data source, (c) the creator’s identity, and (d) the precedence of adoption by other developers.

We release our extensive audit, as two tools: (1) a data explorer interface, the Data Provenance Explorer (DPExplorer) for widespread use, and (2) an accompanying repository for practitioners to download the data filtered for license conditions. Practitioners are also able to generate a human-readable, markdown summary, or Data Provenance Card, of the used datasets, and compositional properties for languages, tasks, and licenses (Section 2.3). Modern researchers training on hundreds of datasets often find it onerous to manually curate extensive data cards for these compilations(Mitchell et al., 2019; Gebru et al., 2021). We hope this tool will aid in writing the data attribution and composition sections of these documentation efforts, by providing auto-generated, copy-and-pastable dataframe summaries.

Collecting comprehensive metadata for each dataset required leveraging several sources including collection by linking to resources already on the web (), human annotation by legal experts (), or using GPT-4 to assist in human annotation ().

discloses links and connects aggregator identifiers.

Dataset Identifiers : The dataset’s name, associated paper title, and description of the dataset.

Dataset Aggregator Links : A link to each major aggregator, including GitHub, Hugging Face, Papers with Code, Semantic Scholar, and ArXiv allows us to incorporate and compare their crowdsourced metadata.

Collection : The name and URL to the data collection of which this dataset is a part.

detail information relevant to understanding data representation/composition, and curating a training set.

Languages : Each of the languages represented in the dataset, so developers can easily follow the “Bender Rule” (Bender, 2011).

Task Categories : The 20+ task categories represented in the instructions, such as Question Answering, Translation, Program Synthesis, Toxicity Identification, Creative Writing, and Roleplaying.

Text Topics : An automated annotation of the topics discussed in the datasets, with GPT-4 labeling a sample of 100 examples for up to 10 covered topics.

Text Length Metrics : The minimum, maximum, and mean number of dialog turns per conversation, of characters (agnostic to tokenization/non-whitespace languages, as this introduces biases (Petrov et al., 2023)) per user instruction and assistant responses.

Format : The format and intended use of the data. The options are zero-shot prompts, few-shot prompts, chain-of-thought prompts, multi-turn dialog, and response ranking.

Time of Collection : The time as which the work was published, which acts as an upper bound estimate of the age of the text.

Licenses : The license name and URLs associated with the data, using the process described in Section 2.2. We also enable filtering by license use classes, categorized by legal professionals.

Text Source : The original sources of the text, often Wikipedia, Reddit, or other scraped online/offline sources.

Creators : The institutions of the dataset authors, including universities, corporations, and other organizations.

Attribution : The attribution information for the authors of the paper associated with the dataset.

Citation & Download Counts : The citation and Hugging Face download count for the paper and dataset, dated September 2023. This acts as an estimate of community use, and is commonly used as precedence to decide on the risk level for using these datasets .

2 License Annotation Process

One of our central contributions is to validate the licenses associated with widely used and adopted datasets. This followed a time-intensive human annotation protocol, to collect dataset authors’ self-reported licenses, and categorize them according to stated conditions. Note that this protocol reflects best efforts to verify self-reported licenses, and does not constitute legal advice (see Section 4). Additionally, it is important to note that the enforceability of these licenses depends on several factors discussed in Section 4. One especially important assumption in cases where datasets are based on data obtained from other sources is that dataset creators actually have a copyright interest in their dataset. This depends on the data source and how creators modify or augment this data, and requires a case-by-case analysis. However, it appears that most developers operate under the general assumption that they alone own their datasets. Our license annotation workflow follows these steps:

Compile all Self-Reported License Information We aggregate all licensing information reported on GitHub, ArXiv, Hugging Face, Papers with Code, and the collection itself (e.g. Super-Natural Instructions, Wang et al. (2022c)).

Search for explicit Data Licenses The annotator searches for a license specifically given to the dataset (not the accompanying code) by the authors. A license is found if (a) the GitHub repository mentions or links a license in reference to the data, (b) the Hugging Face license label was uploaded by the dataset creator themselves, (c) the paper, Hugging Face, or Papers with Code provide a dataset-specific license link, attributable to the data authors.

Identify a License Type A license may fall into a set of common types (e.g. MIT, Apache 2, CC BY SA, etc.), be a “Custom” license, a permission Request Form, or if none was found for the data, Unspecified. If a dataset has multiple licenses, the annotator will list each of them, according to their types.

Categorize Licenses From the perspective of a machine learning practitioner, licensing typically is viewed through the lens of how it impacts the model lifecycle—does it impede or allow for training on the data, downstream use conditions, attributing, modifying or re-distributing it. Based on discussions with industry experts, we categorize licenses based on three important features that impact the model lifecycle: is data usage limited to academic or non-commercial purposes (Permitted Use), does the data source need to be attributed (Attribution), and do derivatives of the data need to licensed under the same terms as the original (Share-Alike). If there are multiple licenses for a dataset, its categorization for each feature is the chosen as the strictest across licenses.

Additional Provenance In practice, legal teams may wish to balance their risk tolerance with more nuanced criteria. For instance, they may be satisfied with using (more permissive) GitHub licenses, even when it is ambiguous whether these apply to the code or the data. They may also wish to include or exclude datasets based on whether these are already widely used in practice, where the original data was sourced from, and if the creator is a competitor. To supplement the above license categories, we also collect all this metadata for fine-grained selection and filtering.

3 Data Provenance Card—A Data Bibliography

Prior work has stressed the importance of data documentation and attribution (Bender and Friedman, 2018; Bommasani et al., 2023a). In particular, Gebru et al. (2021)’s Datasheets breaks down documentation into motivation, composition, collection process, processing, uses, maintanence, and distribution. Similarly, Bender and Friedman (2018) ask for curation rationale, language variety, speaker demographic, annotator demographic, speech situation, and text characteristics, among others. However, when models train on many sources of data, even if they are each rigorously documented for each of these fields (rarely the case), it is challenging to cleanly synthesize comprehensive and navigable documentation for the resulting bundle.

To make this process tractable with scale, we propose leveraging Symbolic Attribution, where our tools auto-generate a structured store of the provenance and attribution metadata, similar to a bibliography for data.Auto-generated at https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection Our collected schema allows this store to succinctly capture the attribution (links to repositories, aggregator copies, papers, creators), provenance (text/machine sources, licenses), and compositional properties of the data (languages, tasks, text metrics, format, and time). This file of references and metadata, known as a Data Provenance Card enables comprehensive documentation, proposed by prior work, while providing some advantages from its structure. First, the Data Provenance Card can be easily searched, sorted, filtered and analyzed, whereas Datasheets or Statements, designed for individual datasets, are meant to be manually read. Second, developers can efficiently assemble relevant information without losing any detail, by symbolically linking to the original datasets and their documentation. Third, as datasets are continually re-packaged and absorbed into newer and bigger collections, Data Provenance Cards are easily adaptable by simply appending or concatenating them together. Altogether, we hope this tooling enables and promotes the thorough documentation proposed in prior work (Bender and Friedman, 2018; Gebru et al., 2021; Mitchell et al., 2019; Pushkarna et al., 2022)

Empirical Analysis of Data Provenance

This work constitutes the first extensive study of empirical license use for Natural Language Processing datasets. In this section, we share the insights we have gathered from our large-scale annotation and categorization. There is an important assumption in this section: the OpenAI Terms of Use is a contract, not a license, which prohibits the development of competing models using its outputs. For simplicity, we treat this as a Non-Commercial license in our analysis, though this is disputed for third parties who did not generate the OpenAI data themselves and therefore may not be bound by their terms (see Section 4 for discussion). Given the intention of OpenAI not to facilitate competitive commercial uses, we follow their categorization for this analysis.

Frequency of license types Figure 2 shows the distribution of licenses. The most common licenses are CC-BY-SA 4.0 (15.7%), the OpenAI Terms of Use (12.3%), and CC-BY 4.0 (11.6%). While most licenses are common and recognizable, there is a long tail of variants with unique settings, as well as a large set of Custom licenses accounting for 9.6% of all recorded licenses on their own. This wide license diversity illustrates the challenge to startups and less resourced organizations attempting to navigate responsible training data collection, its legality and ethics.

Distribution of Restrictive Licenses In total, 85% of dataset licenses request attribution, and 30% include a share alike clause.“Share alike” is a copyright term meaning adaptations or copies of a work are required to be released under the same license as the original. Datasets which request attribution pose challenges for practitioners who commonly train on hundreds of datasets and either don’t cite them at all (OpenAI, 2023; Anil et al., 2023; Touvron et al., 2023) or simply cite an aggregation of data, which often falls short of the license’s conditions of attributing the specific repository or paper. Futhermore, “Share alike” clauses poses challenges for practitioners re-packaging data collections usually with multiple conflicting share-alike licenses without a clear way to resolve them (like Longpre et al. (2023a); Wang et al. (2022c) and others in the DPCollection). Frequently, practitioners will over-write share-alike licenses with more restrictive or even less restrictive conditions.

Missing or Unspecified Licenses. Next, we compare our manually reviewed licensing terms, to the licenses for the same datasets, as documented in the aggregators GitHub, HuggingFace, and Papers with Code. Table 2 shows that these crowdsourced aggregators have an extremely high proportion of missing (“Unspecified”) licenses, ranging from 69-72%, as compared to our protocol which yields only 30% “Unspecified”. The problem with “Unspecified” licenses is that it is unclear whether it is due to a shortcoming of the aggregator or because creators intentionally released them without a license. Consequently, risk-averse developers are forced to avoid many valuable datasets, which they would use otherwise if they were given assurance that there is indeed no license. As part of DPCollection, we manually reassign 46-65% of dataset licenses (depending on the platform), resulting in much higher coverage, thus giving risk-averse developers more confidence and breadth in their dataset utilization.

Incorrectly Specified Licenses. Table 2 also finds real licenses as assigned by us are frequently stricter than the ones by aggregators. GitHub, Hugging Face and Papers with Code each label license use cases too permissively in 29%, 27%, and 16% of cases respectively. Our inspection suggests this is due to contributors on these platforms often mistaking licenses attached to code in GitHub repositories for licenses attached to data.

2 How does Data Availability Differ by License Use Category?

While non-commercial and academic-only licenses play important roles in protecting data use, their presence can also exclude communities from participating (or competing) in the development of these technologies. In this section, we break down datasets according to their license restrictions and see how they differ. Specifically, we ask: Does complying with licenses dictate systematic differences in resources for commercially-permissive (“open”) and non-commercial (“closed”) development? And what particular features of data are particularly constrained by non-commercial prohibitions?

We compare datasets by categories of permitted use, according to their licenses: (1) Commercially viable, (2) Non-Commercial/Academic-Only (NC/A-O), or (3) Unspecified license. We group together Non-Commercial and Academic-Only conditions as the distinction will rarely matter for developers. We argue in Section 4 that datasets without any license (Unspecified) have not imposed any conditions, so can often be treated as commercially viable, but this may depend on a developer’s risk tolerance and jurisdiction.

Non-Commercial & Academic-Only Licensed Datasets have statistically greater diversity in their representation of tasks, topics, sources, and target text lengths. For each of these features, Table 3 illustrates the mean number per dataset, broken down by license category and entropy to measure the randomness, and thus diversity, of each feature. NC/A-O datasets see greater diversity of tasks, topics, and sources represented in the text than commercial datasets. Figure 4 shows where this diversity comes from. The most NC/A-O task categories include Brainstorming, Explanation, Logic & Math, as well as Creativity and Creative Writing. In comparison, the most commercially viable task categories are Short Text Generation, Translation, and Classification. Similarly, among Source Domains, Governments and Search Queries are largely viable for commercial (and unspecified) purposes, whereas General Web, Exams, and Model-generated sources are among the most restrictive.

Target Text Lengths are significantly higher for NC/A-O datasets than commercial datasets. Not only do NC/A-O datasets appear more textually and functionally diverse, their length characteristics differ substantially. While Table 3 shows the input text lengths across license categories are similar on average, the target text lengths are significantly higher for NC/A-O datasets (103 vs 677). This breakdown is further illustrated in Figure 5, where we see greater representation of both NC/A-O and synthetic datasets above the 100 target token threshold (y-axis).

The rise of synthetic datasets generated using APIs with non-commercial terms of use may explain the differences in text diversity and length. Table 3 also shows a full 45% of NC/A-O datasets are synthetic, as compared to $<14\%$ in more permissive license categories. Taori et al. (2023); Wang et al. (2022a); Xu et al. (2023a) and their variants, all generated in part using commercial APIs, exhibit stronger task and topic diversity than traditional academic datasets, as they cater to longer form generations, by design. This is evident from the concentration of creative, brainstorming, and reasoning tasks baked into them, as compared to the focus of more topic-focused question answering, classification, and short text generation in non-synthetic datasets. These datasets are usually created using larger proprietary models, mostly from OpenAI APIs. The OpenAI Terms of Use state “you may not…use output from the Services to develop models that compete with OpenAI.” which we discuss in Section 4.https://openai.com/policies/terms-of-use

2023 has a large spike in license usage, and in NC/A-O licensed data, representing 61%, as compared to 20% on average in prior years. Among the large collection of datasets we trace, we record the date at which they are released, by cross-referencing their associated GitHub, ArXiv, and Hugging Face dates. We find a striking change in the pattern of licensing restrictions. As shown in Figure 3, prior to 2023, no year saw greater than 1/3 of the datasets released as NC/A-O. However, in 2023, which includes many of the most popular and diverse datasets, the NC/A-O rate is 61%. Furthermore, most datasets were unaccompanied by a license prior to 2022 (5̃0-80%), as compared to only 12% in 2023. The shift to more license use, and more restrictively conditioned data releases may foretell future challenges to open data, if the trend continues.

Commercial datasets have greater language variety, but low-resource language datasets see the least commercial coverage. Table 3 shows that commercial datasets actually have greater diversity of languages than NC/A-O. However, when broken down by language family, as in Figure 3, we see stark differences in permitted use by group. Code language datasets are nearly all commercially viable (78%), because dataset creators can easily filter GitHub for permissively licensed repositories. Interestingly, English, Atlantic-Congo, and Afroasiatic languages also see large permissive representation. However, Turkic, Sino-Tibetan, Japonic, and Indo-European languages see in excess of 35% as non-commercial. Note that while the Indo-European language family contains many high-resource European language families, there is a long tail of lower-resource ones. These NC/A-O language families provide directions for open data practitioners to focus their future efforts.

3 Broader Characteristics of the Data

In addition to understanding systematic differences in the data by license, there are research questions regarding the overall composition and characteristics of these widely used and adopted datasets. Our compilation of metadata through the DPCollection allows us to map the landscape of data characteristics, and inspect particular features. Note that all these details are also available with interactive visualizations at www.comingsoon.com, for further research and examination.

Language representation is heavily skewed to English and Western European Languages. Following Talat et al. (2022)’s recommendations in data transparency and documentation in demographic analysis, and corroborating Kreutzer et al. (2022)’s similar analysis for pretraining corpora, we find a stark Western-centric skew in representation. Figure 6 illustrates the coverage per country according to the spoken languages and their representation in DPCollection. We compute a Language Representation score $S_{k}$ for each country $k$ , parametrized by $p_{kl}$ , the percentage of people in country $k$ that speak language $l$ , and $w_{li}$ which is a binary indicator that is 1 if dataset $i\in D$ contains language $l$ and 0 otherwise.

The distribution visualized in Figure 6 shows that Asian, African, and South American nations are sparsely covered if at all. Even when nations from the Global South appear to have linguistic representation, according to Section 3.3, the text source and dialect of the language contained in these datasets almost always originates from North American or European creators and web sources (though this is difficult to measure precisely). These observations corroborate similar findings in the geo-diversity of image data in the vision domain (Shankar et al., 2017; De Vries et al., 2019; Mahadev and Chakravarti, 2021). The resulting models trained on these datasets are likely to have inherent bias, underperforming in critical ways for users of models outside of the west (Ahia et al., 2021).

The primary drivers of dataset curation are Academic organizations, supplying 69%, followed by 21% industry labs, and 17% research institutions. These metrics describe the scale of dataset curation contributions, but not the influence each dataset has had on the community. LABEL:tab:creators demonstrates the single largest dataset contributors are AI2 (12.3%), University of Washington (8.9%), and Facebook AI Research (8.4%). It is important to note that these contributors often only download and compile text from the Internet that was originally written by other people.

Text datasets focus on topics of Language & Linguistics, General Knowledge, Logic, & Lifestyle. Prior data collection work focuses predominantly on describing datasets by their task compositions (Sanh et al., 2021; Wang et al., 2022a; Longpre et al., 2023a), but rarely by their actual topics (except (Gao et al., 2020) in their Appendix). LABEL:tab:topics shows the most popular topics, clustered by category, with their representation across datasets. Like most NLP tasks, much of this text data focuses on communication and language understanding topics, followed closely by general knowledge, routine, sports, and education.

Text datasets are sourced primarily from Online Encyclopedias (22%), Social Media (16%), scraped from the General Web (11%), News (11%), Entertainment web resources (9%). While practitioners document their individual dataset sources in their published papers, this information is unstructured and can be hard to find. As a result, massive collections of widely used datasets rarely compile the distribution of their original sources, instead just citing the papers. After a series of dataset compilations and re-packaging, the original sources are often lost or not well known. By manually scanning approximately 500 academic papers our volunteers annotated the original text sources and compiled them into domain clusters, to permit attribution and analysis, as summarized in LABEL:tab:domains. Among the individual most adopted sources by the used sources are wikipedia.org (14.9%), undisclosed webpage scrapes (7.0%), reddit (6.2%), and Twitter (4.0%). The least represented domains are Commerce, Reviews, Legal, Academic Papers, and Search Queries, among others.

Legal Discussion

Our empirical analysis highlights that we are in the midst of a crisis in dataset provenance and practitioners are forced to make decisions based on limited information and opaque legal frameworks. While we believe our tooling will enable better transparency about where licenses are in tension, major legal ambiguities remain in data licensing.

Copyright laws aim to encourage written and artistic expression by giving authors exclusive rights to copy, distribute, and adapt their work (Patterson, 2003; Burger, 1988). Open-source licenses first emerged as legal tools to encourage collaboration around software development (Von Krogh and Von Hippel, 2003). A range of licenses with different terms and purposes exists including the MIT License, Creative Commons Licenses, and the Apache License, as well as the newer Responsible AI License (RAIL) and AI2 ImpACT Licenses.See https://www.licenses.ai/blog/2023/3/3/ai-pubs-rail-licenses and https://allenai.org/impact-license#licenses. These license templates propose terms aimed at encouraging more responsible or risk-based machine learning practices, see also Contractor et al. (2022) The interplay between copyright and licenses can be understood in the following way: copyright automatically gives creators exclusive rights in their works and creators assign these rights to others through license agreements. As we will explore, the open-source licenses that emerged in the last three decades are not always well-equipped to handle the unique characteristics of data, and especially supervised AI training data. Meanwhile, it remains unclear how relevant laws, including those related to copyright and fair use, should be applied to the unique challenges raised by Generative AI and supervised datasets (Lee et al., 2023). In this section, we highlight some of the key legal challenges and ambiguities related to supervised datasets.

We focus on supervised datasets, which we define as datasets that are created for machine learning (mainly for finetuning and alignment) and where dataset creators made copyrightable contributions in the form of annotations or compilations. A typical supervised dataset is the result of a process that involves several stages of scraping (or machine generation) and annotation by different entities. Generally, raw data is created by people interacting with internet platforms, such as individuals writing articles, sharing artworks, or engaging in online discussion forums. The copyrights to this raw data are normally held by individual users (e.g. Reddit) or by the platform (e.g. Amazon Reviews). Much of this data has been scraped to construct unsupervised datasets for machine learning and this use is commonly justified on the basis of fair use or data mining exceptions to copyright (Henderson et al., 2023; Sobel, 2017; Lee et al., 2023; Samuelson, 2023; Lemley and Casey, 2020). However, we find that many common supervised datasets are generated by annotating small samples of scraped raw data using human annotators or large language models. The annotated data is then published with a license agreement. In stark contrast to the copyrighted content that is scraped from the web, supervised datasets were created for the sole purpose of furthering machine learning. The focus of the legal discussion in this section is on how supervised dataset creators can constrain the usage of the copyrightable content they create through licenses and other legal mechanisms. Though we do not address them here, there are several important related questions on the use of copyrighted works to create supervised datasets and on the copyrightability of training datasets.

The legal analysis surrounding supervised datasets is complicated by the lack of a uniform global legal framework to address copyright concerns. Different jurisdictions have different and evolving laws. Therefore, the location of model developers and training data creators as well as where and when data was collected may influence the legal analysis. For example, the United States has a fair-use exception to copyright that allows the limited use of copyrighted material under certain circumstances without requiring permission from the rights holders (17 U.S.C. §107) . The EU has no fair-use provision but does have an explicit copyright exception to allow data mining under certain conditions, like obtaining lawful access to the data (Margoni and Kretschmer, 2022). Meanwhile, datasets themselves generally enjoy copyright protection in the U.S. (Lee et al., 2023) while the E.U. recently created a unique set of rights for dataset creators with the purpose of incentivizing research and development related to databases (Derclaye and Husovec, 2022). In addition to differences across jurisdictions, there are also several international agreements related to copyright Ricketson and Ginsburg (2022). Ultimately, it can be challenging to determine which laws should apply to a given machine learning project when the relevant rules vary between the locations where the data was scraped and annotated, where it was downloaded, where the model was trained, and where the model was deployed.

While geographical disparities in regulatory frameworks present one set of challenges, the subjectivity inherent in determining whether copyright infringement has occurred makes it even more challenging to design technical safeguards. For example, in the U.S. part of the copyright infringement analysis depends on whether two works are subjectively similar from the perspective of an ordinary person (Mohler, 1999; Cohen, 1986; Balganesh et al., 2014). This is a subjective standard and existing case law may be challenging to extend to generative AI outputs. As a result, while there are technical strategies that can reduce the risk of infringement (Henderson et al., 2023; Sag, 2023; Vyas et al., 2023), it will be difficult for developers to create technical safeguards that eliminate this risk entirely.

Apart from these jurisdictional and interpretive ambiguities, the process of training a model raises specific copyright questions (Epstein et al., 2023). Training a model poses several interesting legal questions with respect to copyright and infringement may occur in several ways even before any outputs are generated.

First, the act of creating a training dataset by scraping existing works involves making a digital copy of the underlying data. As the name implies, copyright gives the author of a protected work the exclusive right to make copies of that work. If the scraped data is protected by copyright, then creating training data corpora may raise copyright issues (Quang, 2021). Second, copyright holders generally have an exclusive right to create derivative works (e.g., translations of a work) but it is not clear whether a trained machine learning model should be considered a derivative of the training data (Lee et al., 2023). If models are considered to be derivative works, then training a model would be more likely to violate the rights of the training data’s copyright holders (Gervais, 2021).

In the U.S., the fair use exception may allow models to be trained on protected works (Henderson et al., 2023; Lemley and Casey, 2020; Sobel, 2017; Samuelson, 2023). As these authors explain, the training of machine learning models on copyrighted content may be permissible if the underlying works are significantly “transformed” into model weights, only a small amount of each work in the training data is included in the trained model, model training is designed to only glean generalizable insights from the training data, and the trained model does not have a strong effect on the economic success of the works in the training data. It is important to underscore that, while training a machine learning model itself may be protected by fair use this does not mean that model outputs will not infringe on the copyright of prior works. As the authors above highlight, the application of fair use in the context is still evolving and several of these issues are currently being litigated (see e.g., Andersen v. Stability, Doe v. GitHub, and Tremblay v. OpenAI).

The prior literature on fair use and machine learning tends to focus on copyrighted art or text that was scraped to train a model. These scraped works were not created for the purpose of training machine learning models. By contrast, in this paper, we focus on supervised datasets that were created for the sole purpose of training machine learning models. As underscored by Henderson et al. (2023) and Sobel (2017), the fair use analysis depends in part on whether a trained model copies the “expressive purpose” of the original work. While the expressive purpose of a piece of text or art is not to train machine learning models, the purpose of a training dataset is to do just that. As a result, we expect that it is less likely that fair use would apply to the use of curated data. Instead, the creators of these datasets hold a copyright in the datasetData ownership and data copyright are complex topics (Ginsburg, 1992). We assume that the creators of supervised datasets have some form of copyright in their dataset, though there is often content in these datasets that is owned by third parties. If they satisfy the requirements for copyrightability, dataset creators would have a copyright interest in any new content they create (e.g. annotations). In the U.S., datasets themselves may also be copyrightable as compilations (Lee et al., 2023) while the E.U. provides so-called sui generis rights for databases (Derclaye and Husovec, 2022). and the terms of the dataset license agreement govern the subsequent use of this data However, it is rare in practice for an LLM to use a single supervised dataset and often multiple datasets are compiled into collections. This further complicates the legal analysis because we find that the license terms of many popular dataset collections are conflicting.

Beyond the intricate interplay between training data and fair use, the frequently misapplied licensing frameworks for datasets present another set of complications. Most open-source licenses were designed for software, but we find them being attached to datasets. These licenses were intended to be applied to software, not data, which creates challenges (Meeker, 2022). One of the challenges is that licenses like the Apache and the Creative Commons outline restrictions related to “derivative” or “adapted works” but it remains unclear if a trained model should be classified as a derivative work. This issue is further exacerbated when multiple datasets, each potentially governed by a different open-source license, are amalgamated into collections. If the requirements of the underlying license agreements are irreconcilable, such as different copyleft requirements, this makes it extremely hard for developers to use certain collections while respecting all license terms. To remedy these issues, new licenses are being proposed to address the needs of machine learning datasets such as the BigScience Responsible AI License or an adaptation of the MIT License that requires additional permissions for model training proposed by Ioannidis et al. (2023). Despite these new proposals, we find that the majority of datasets are licensed under conventional open-source licenses.

We find that approximately 12% of the datasets we audit were annotated using OpenAI. The OpenAI Terms of Use state that outputs from the OpenAI service may not be used to “to develop models that compete with OpenAI”https://openai.com/policies/terms-of-use. These terms seem to preclude a developer from using OpenAI to generate training data to train a competing LLM. However, it is not clear whether they would also limit the ability of a developer to use OpenAI to create and publish an annotated dataset. On the one hand, publishing such a dataset does not directly compete with OpenAI. On the other hand, it seems foreseeable that such a dataset could enable third parties (who did not themselves use OpenAI) to create competing LLMs. In the U.S., there are several doctrines of secondary or indirect copyright liability aimed to enforce copyright in cases where there is no direct infringement (Grossman, 2005; Lee et al., 2023). The application of these doctrines depends on many factors, most importantly on whether OpenAI has a copyright interest in its outputs. If these copyright doctrines do not apply, then it is still possible that publishing the dataset constitutes a breach of contract by the dataset developers. While it would be more challenging for OpenAI to pursue a case against third parties, there are myriad other business torts, from unfair competition to misappropriation, that may be relevant to this situation, and which go beyond the scope of this paper (Marks and Moll, 2023). Time will tell the extent to which OpenAI and other LLM service providers can enforce their terms of use against third parties. However, a prominent researcher at Google has already resigned citing concerns that OpenAI outputs were used to train BARD (Victor and Efrati, 2023) In light of these legal ambiguities, our tool gives developers the ability to exclude OpenAI-generated datasets.

In the face of these pervasive legal uncertainties, practitioners’ decisions regarding data usage are ultimately guided by a blend of factors including the specific licensing terms, the origin of datasets, and the degree of usage of a given dataset by others. Navigating this landscape requires striking a delicate balance between risk mitigation and the need for sufficient resources. This equation, however, varies across regions, applications, and corporate environments, influenced by factors such as competition, risk, and regional legislation. A strategy for partially mitigating these uncertainties is for model providers to indemnify users, as done by Google Cloud Suggs and Venables (2023). However, this may not be feasible for resource-constrained developers and, while it protects end-users, it does not solve the issues faced by model developers or dataset curators.

The fundamental purpose of copyright is to encourage creativity and innovation. As we highlighted in the sections above, the current legal landscape remains ambiguous and this lack of clarity can stifle innovation as developers fear legal repercussions. Through our audit and tooling, we seek to provide important information for practitioners to make informed decisions in an otherwise ambiguous landscape, guided by their own own legal interpretation and risk tolerance. This information includes data license lineages, a categorization of license terms, details on data creators, and the underlying data sources (e.g. web or LLM). In light of ongoing litigation and a lack of legal certainty, we attempted to give developers In creating a repository of data licensing information, we are also taking a step towards encouraging dataset creators to be more thoughtful about the licenses that they select. Dataset creators are well-positioned to understand the appropriate uses of the datasets they publish and licenses can be a tool to communicate these restrictions and to encourage responsible AI development. We further aim to highlight that machine learning practitioners should take dataset license terms seriously, as they may have real impacts on how their models may be used in practice. Ultimately, thoughtful data licensing could be leveraged to promote more responsible, inclusive, and transparent machine learning practices.

It is important to note we collect self-reported licenses, and categorize them according to our best efforts, as a volunteer research and transparency initiative. The information provided by any of our works and any outputs of the Data Provenance Initiative do not, and are not intended to, constitute legal advice; instead, all information, content, and materials are for general informational purposes only. Readers and users should seek their own legal advice from counsel in their relevant jurisdiction.

Related Work

A long line of work has highlighted the importance of data and its documentation in natural language processing (Paullada et al., 2021; Rogers, 2021; Meyer et al., 2023; Gururangan et al., 2018; Muennighoff et al., 2023b). In particular, these works stress the challenges posed by poor documentation to reproducibility, good science, and generally well-understood model behavior (Sambasivan et al., 2021a; Bandy and Vincent, 2021; Longpre et al., 2023b). Recent work has also explored the importance of documenting AI ecosystems (Bommasani et al., 2023b) and the supply chain from data to models (Cen et al., 2023).

Several notable works have conducted large-scale analyses into data, particularly pretraining text corpora (Gao et al., 2020; Dodge et al., 2021; Kreutzer et al., 2022; Laurençon et al., 2022; Scao et al., 2022a, b; McMillan-Major et al., 2022). Other works have investigated the geo-diversity of vision-based datasets (Shankar et al., 2017; De Vries et al., 2019; Mahadev and Chakravarti, 2021). Different forms of data governance have been proposed to centralize responsibility and documentation over datasets, including for the BigScience project (Jernite et al., 2022) and a Public Data Trust (Chan et al., 2023). In terms of finding and visualizing datasets, a few recent tools have been proposed (Färber and Leisinger, 2021; Viswanathan et al., 2023).

Adjacent to the realm of legality, prior works have strongly advocated and provided frameworks for documentation and audits to increase transparency and accountability in AI systems (Miceli et al., 2022; Kapoor et al., 2023; Raji and Buolamwini, 2022). In a manner akin to DPI, which draws upon the collective knowledge of legal and machine learning experts, earlier research has also underscored the significance of interdisciplinary collaborations (Hutchinson et al., 2021). Datasheets for datasets Gebru et al. (2021) and Data Statements Bender and Friedman (2018) both provide structured frameworks for revealing essential metadata such as the motivation behind intended use. Pushkarna et al. (2022) expanded on datasheets with “Data Cards” for sources, collection, ethics, and adoption.

Similarly, Mitchell et al. (2019) introduced model cards to benchmark model performance across demographic groups and disclose evaluation procedures. Crisan et al. (2022) proposed interactive model card as an alternative mode of documentation and metadata sharing. Complementary to transparency regarding the dataset’s creation process, Corry et al. (2021) provide a framework that guides users on how to navigate datasets as they approach the end of their life-cycle. DPI builds upon the foundational frameworks laid out in these earlier studies, with a specific focus on addressing the licensing aspects of dataset curation. Our goal is to equip users with a comprehensive understanding of the legal risks associated with dataset usage.

The legality of the datasets used to train large base models has recently received significant attention (Sag, 2020; Henderson et al., 2023). The challenge of determining the legality of employing different datasets becomes particularly complex due to the intricate nature of dataset creation processes. Lee et al. (2023) break up the stages of dataset creation and model generation and assess the relevant copyright questions in the US legal system. These processes often involve multiple licenses and restrictions that can interact in ways that obscure the final legal risk. Soh (2021) propose a high-level framework for pinpointing the areas within dataset creation and usage where legal analysis is necessary, but do not apply this framework to any existing datasets. Min et al. (2023) demonstrate that refraining from training on copyrighted or highly restricted datasets has a detrimental impact on downstream performance. Their proposed solution involves using a language model trained on "low-risk" text and augmenting it with a data-store containing "high-risk" text which can be modified appropriately as the legal landscape clarifies over time. (Lee et al., 2023) DPI enhances these investigations by involving legal experts in the development of a framework for assessing a dataset’s “risk” and annotating the “risk” associated with numerous existing high-profile datasets.

Acknowledgements

We would like to thank Katherine Lee, A. Feder Cooper, Peter Henderson, Aviya Skowron and Stella Biderman for valuable comments and feedback.

References

Appendix

Here we enumerate the author contributions. We would like to emphasize that all authors contributed crucial elements to this project, and Core Contributors in particular are recognized with hands on service to the design and construction of Data Provenance’s first implementation.

Shayne Longpre Core Contributor. Primary designer and coder of the repository and explorer interface. Led audit implementation, and analysis, as well as the manual annotation process.

Robert Mahari Core Contributor. Led the legal analysis, and licensing annotation design.

Anthony Chen Core Contributor. Led automatic inferencing of dataset text metrics, topics, and task category annotations. Supported writing, analysis, and code testing.

Naana Obeng-Marnu Core contributor. Led visualization design, particularly interactive visualizations in the Data Provenance Explorer.

Damien Sileo Core contributor. Led data aggregator linking, and metadata scraping. Supported writing, analysis, source annotation and adding datasets.

William Brannon Core contributor. Added 8 data collections, supported writing and data analysis.

Niklas Muennighoff Core contributor. Added several large data collections, supported writing, analysis, visualization, and source annotations.

Nathan Khazam Core contributor. Led licensing annotation effort and supported adding datasets along with testing.

Jad Kabbara Core contributor and advisor. Led text source annotation effort and supported with framing, writing and analysis.

Kartik Perisetla Core contributor. Added several datasets, supported writing, analysis, and dataset preparation for Hugging Face.

Xinyi (Alexis) Wu Core contributor. Added several datasets, testing, and supported automatic metadata collection.

Enrico Shippole Core contributor. Led final dataset preparation for Hugging Face upload and testing.

Kurt Bollacker Advisor on project design and framing.

Tongshuang Wu Advisor, particularly on data analysis and visualizations. Supported writing and Data Provenance Explorer design.

Luis Villa Advisor on data copyright and licensing, and supporting writing in the legal discussion section.

Sandy Pentland Advisor on general project design and framing.

Sara Hooker Advisor on general project design and framing, as well as supporting writing, analysis, and directing experiments.

Appendix B Exact Licenses and Citations

See Table 5 for a summary of the Data Provenance Collection licenses and citations. More comprehensive details are available at https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection.

Appendix C Details on Collecting Data Provenance

This data was collected with a mix of manual and automated techniques, leveraging dataset aggregators like GitHub, Hugging Face and Semantic Scholar. Annotating and verifying license information, in particular, required a carefully guided manual workflow, designed with legal practitioners (see Section 2.2). Once these information aggregators were connected, it was possible to synthesize or scrape additional metadata, such as dataset languages, task categories, and time of collection. And for richer details on each dataset, like text topics and source, we used carefully tuned prompts on language models inspecting each dataset.

Based on the manually retrieved pages, we automatically extract Licenses from HuggingFace configurations and GitHub pages. We leverage the Semantic Scholar public API (Kinney et al., 2023) to retrieve the released date and current citation counts associated to academic publications. Additionally, we compute a series of other helpful, but often overlooked data properties such as text metrics (the min/mean/max for input and target lengths), and dialog turns. We elected to measure sequence length in characters rather than word tokens, for fairer treatment across language and script given well-known differences in tokenizer performance across different languages (Petrov et al., 2023).

While Task Categories have become the established measurement of data diversity in recent instruction tuning work (Sanh et al., 2021; Wang et al., 2022a), there are so many other rich features describing data diversity and representation. To augment this, we use OpenAI’s GPT-4 API to help annotate for text topics. We randomly sampled 100 examples per dataset and carefully prompt GPT-4 to suggest up to 10 topics discussed in the text.

To annotate for the original data sources, AI experts (PhD students and postdocs) reviewed the papers and filled out the original text sources, whether machines or template-generation were used for synthetic generation, and whether human annotators were used. GPT-4 was used as an in-context retriever on the dataset’s ArXiv paper to extract snippets that the experts may have missed. We split the ArXiv paper into 4000 characters chunks and prompt the API to return a json list of any mentions of the dataset source, e.g. from scraping, synthetic or manual generation.