The Pushshift Reddit Dataset
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, Jeremy Blackburn
Introduction
Understanding complex socio-technical phenomena requires data-driven research based on large-scale, reliable, relevant data sets. Web data, particularly data from application programming interfaces (APIs), has been an enormous boon for researchers using online social platforms’ databases of user-generated activity and content . The ability to “crawl” and “scrape” large-scale and high-resolution samples of publicly-accessible user data stimulated emerging fields like social computing and computational social science , and developed new fields like crisis informatics . But following major scandals around data privacy and ethics, social media platforms like Facebook and Twitter changed previously permissive data access provisions of their public APIs . As a consequence, the ability for researchers to collect timely data, share tools, instruct students, and reproduce findings has been curtailed.
This “post-API age” is characterized by the deprecation of data resources used for research and teaching , increased stratification of data access based on social, technical, and financial capital , and greater fear of prosecution around violating terms of service in the course of research . These changes have had a profoundly chilling effect on researchers’ use of API-derived data to investigate behavior like discrimination, harassment, radicalization, hate speech, and disinformation. Furthermore, researchers have struggled in systematically studying the role that platforms’ changing features, design affordances, and governance strategies play in sustaining these forms of “turpitude-as-a-service” . Faced with conflicting incentives between protecting their users’ data from abuse and maintaining their commitments to values of openness, online social platforms are exploring alternative data sharing models like “trusted third party” models that still carry significant technical and reputational risks .
Even if the “golden age” of API-driven computational social science and social computing research had not closed in the shadow of privacy scandals, it was nevertheless characterized by enormous inefficiencies in data collection and inequalities in access , ethically-suspect methods and implications , a lack of concern for data sharing or reproducibility , and failures to validate constructs or generalize to off-platform behavior . Facebook’s and Twitter’s changes in data access were significant, however the enclosure of previously open big social data sources is not ubiquitous among platform providers . Social platforms and online communities like Wikipedia , Stack Exchange , GitHub , and Reddit continue to offer open APIs and data dumps that are valuable for researchers.
In this paper, we assist to the goal of providing open APIs and data dumps to researchers by releasing the Pushshift Reddit dataset. In addition to monthly dumps of 651M submissions and 5.6B comments posted on Reddit between 2005 and 2019Available at https://files.pushshift.io/reddit/, the Pushshift Reddit dataset also includes an API for researcher access and a Slackbot that allows researchers to easily interact with the collected data. The Pushshift Reddit API enables researchers to easily execute queries on the whole dataset without the need for downloading the monthly dumps. This reduces the requirement for substantial storage capacity, thus making the data more available to a wider range of users. Finally, we provide access to a Slackbot that allows researchers to easily produce visualizations of data from the Pushshift Reddit dataset in real-time and discuss them with colleagues on Slack. These resources allow research teams to quickly begin interacting with data with very little time spent on the tedious aspects of data collection, cleaning, and storage.
Pushshift
Pushshift is not a new or isolated data platform, but a five year-old platform with a track record in peer-reviewed publications and an active community of several hundred users. Pushshift not only collects Reddit data, but exposes it to researchers via an API. Why do people use Pushshift’s API instead of the official Reddit API? In short, Pushshift makes it much easier for researchers to query and retrieve historical Reddit data, provides extended functionality by providing full-text search against comments and submissions, and has larger single query limits. Specifically, because, at the time of this writing, Pushshift has a size limit five times greater than Reddit’s 100 object limit, Pushshift enables the end user to quickly ingest large amounts of data. Additionally, the Pushshift API offers aggregation endpoints to provide summary analysis of Reddit activity, a feature that the Reddit API lacks entirely.
The Pushshift Reddit dataset provides not just a technical infrastructure of software and hardware for collecting “big social data” but also a social infrastructure of organizational processes for responsibly collecting, governing, and discussing these research data.
Pushshift uses multiple backend software components to collect, store, catalog, index, and disseminate data to end-users. As seen in Fig. 1, these subsystems are:
The ingest engine, which is responsible for collecting and storing raw data.
A PostgreSQL database, which allows for advanced querying of data and meta-data storage.
An Elastic Search document store cluster, which performs indexing and aggregation of ingested data.
An API to allow researchers dynamic access to collected data and aggregation functionality.
The first stage in the Pushshift pipeline is the ingest engine, which is responsible for actually collecting data. The ingest engine can be thought of as a framework for large scale collection of heterogeneous social media data sources. The ingest engine orchestrates the execution of a multiple data collection programs, each designed to handle a particular data source. Specifically, the ingest engine provides and manages a job scheduling queue, and provides a set of common APIs to handle the data storage. Currently, Pushshift’s ingest engine works as follows:
First, the program runner starts each ingest program (i.e., the programs that actually collect the data). The ingest engine is agnostic to the particulars of the individual ingest programs: no particular programming language is required, and there is no particular expectation of how an ingest program works, modulo its interactions with the remainder of the ingest engine. Typically, an ingest program will directly interact with Web APIs, scrape content from HTML pages, use data streams where available, etc. Next, the ingest program inserts the raw data retrieved into a database as well as into a document store. Behind the scenes, each piece of collected data is added to an intermediate queue (currently implemented via Redis), which serves as a staging area until the data is processed by any custom processing scripts the ingest program’s creator might require. Finally, the raw data is periodically flushed to disk. The data storage format can be specified by the ingest program creator via the custom processing scripts previously mentioned, or a standard, Pushshift-implemented format can be used (e.g., ndjson).
PostgreSQL & ElasticSearch.
Pushshift currently uses Elasticsearch (ES) as a scalable document store for each data source that is part of the ingest pipeline. ES offers a number of important features for storing and analyzing large amounts of data. For example, ES achieves ease-of-scaling by utilizing a cluster approach for horizontal expansion. It ensures redundancy by creating multiple replicas for each index so that a node outage does not affect the overall health of the cluster. The ES robust dynamic mapping tools allow easy modification and expansion of indexes to accommodate changes in data structure from the source. This is useful because Reddit’s API does not implement any type of versioning, yet there are constant additions and modifications made to the API when new features and data types are added to the response objects. By using dynamic mapping types, Pushshift can easily add new fields to existing indices. This enables us to quickly modify the corresponding mappings to allow search and aggregation on those new fields. Pushshift also makes use of the ICU Analysis plug-in for ES , which provides support for international locales, full Unicode support up through Unicode 12, and complete emoji search support.
API Pushshift currently allows users to search Reddit data via an API. Right now, this API exports much of the search and aggregation functionality provided by Elastic Search. This functionality supports dozens of community applications and numerous research projects. The API is the major workload of handled by Pushshift’s computational resources, serving 500M requests per month. Although in this paper we focus on a description of the data (Section 3) due to space limitations, we provide online API documentation at https://pushshift.io/api-parameters/.
Community In addition to Pushshift’s website, which features an interactive dashboard of current activity trends, Pushshift also has two active user communities on Reddit and Slack. The /r/pushshift subreddit was created in April 2015 and is used for sharing announcements, answering questions, reporting bugs, and soliciting feedback for new features. There are more than 2,100 subscribers to this subreddit, an active team of 10 moderators, and more than 700 posts (with more than 4,000 comments) from over 350 unique users (see Fig. 2).
The Pushshift Slack team has nearly 300 registered users and more than 260,000 messages across 53 channels discussing data science and visualization. Custom tools have also been developed to integrate the Pushshift archive into these Slack communities. For example, users can interact with a Slack chatbot in realtime. The bot can analyze and visualize Pushshift data based on queries made in the Slack channel, and return those visualizations to the channel for discussion and observation. In Fig. 3, a user queried the total number of daily comments to the /r/the_donald subreddit by day over the past four years and received a time series plot and summary statistics from the chatbot within a few seconds. The chatbot can also be shared to other non-Pushshift workspaces, allowing researchers in other Slack workspaces to use the data. This extends the reach of Pushshift data even further.
Description of the Pushshift Reddit Dataset
Pushshift makes available all the submissions and comments posted on Reddit between June 2005 and April 2019. The dataset consists of 651,778,198 submissions and 5,601,331,385 comments posted on 2,888,885 subreddits. Fig. 4 shows the number of submissions and comments per day. We observe that the number of submissions and comments increase over the course of our dataset. After August 2013, we have consistently over 1M comments per day, while by the end of our dataset (April 2019) we have 5M comments per day. Also, while submissions are substantially fewer than comments, submissions have reached reached a consistent level of over 500K per day in this dataset.
The Pushshift Reddit dataset is made up of two sets of files: one set of files for the submissions and one for the comments. Below, we describe the structure of each of the files in these two sets.
Submissions. The submissions dataset consists of a set of newline delimited JSONhttp://ndjson.org/ files: we maintain a separate file for each month of our data collection. Each line in these files correspond to a submission and it is a JSON object. Table 1 describes the most important key/values included in each submission’s JSON object.
Comments. Similarly to the submissions, the comments’ dataset is a collection of ndjson files with each file corresponding to a month-worth of data. Each line in these files correspond to a comment and it is a JSON object. Table 2 describes the most important keys/values in each comment’s JSON object.
FAIR principles. The Pushshift Reddit dataset aligns with the FAIR principles.https://www.go-fair.org/fair-principles/ Our dataset is Findable as the monthly dumps are publicly available via Pushshift’s websitehttps://files.pushshift.io/reddit/. We also upload a small sample of the dataset to the Zenodo service, so that we obtain a persistent digital object identifier (DOI): 10.5281/zenodo.3608135. Note that we were unable to upload the entire dataset to Zenodo, since the service has a limit of 100GB and our dataset is in the order of several terabytes. The Pushshift Reddit dataset is Accessible as it can be accessed by anyone visiting the Pushshift’s website. Furthermore, we offer an API and a Slackbot that allow researchers to easily execute queries and obtain data from our infrastructure without the need to download the large monthly dumps. Also, our dataset is Interoperable because it is JSON format, which is a widely known and used format for data. Because the provenance for the collected data is very clear, and users are simply asked to cite Pushshift in order to use the data, our dataset is also Reusable.
Dataset Use Cases
The Pushshift Reddit dataset has attracted a substantial research community.As of late 2019, Google Scholar indexes over 100 peer-reviewed publications that used Pushshift data (see Fig. 5). This research covers a diverse cross-section of research topics including measuring toxicity, personality, virality, and governance. Pushshift’s influence as a primary source of Reddit data among researchers has attracted empirical scrutiny , which in turn has led to improved data validation efforts . We note that there is some difficulty in ascertaining our dataset’s full contribution to the scientific community due to a previous lack of deliberate efforts to conform to FAIR principles, which we address in this paper.
Reddit’s ecosystem of sub-reddits are primarily governed by volunteer moderators with substantial discretion over creating and enforcing rules about user behavior and content moderation . This distributed and volunteer-led model stands in contrast to the centralized strategies of other prominent social platforms like Facebook, Twitter, and YouTube . These differences between centralized versus delegated moderation make ideal case studies for comparing the effectiveness of responses to difficult issues like social movements, fringe identities, hate speech, and harassment campaigns . Pushshift data has already been instrumental for researchers exploring the spillover effects of banning offensive sub-communities , identifying common features of abusive behavior across communities , similarity in norms and rules across communities , perceptions of fairness in moderation decisions , and improving automated moderation tools .
Online extremism.
The political extremism research community currently faces significant challenges in understanding how mainstream and fringe online spaces are used by bad actors. Despite widespread agreement that recent increases in online radicalization are due to “a globalised, toxic, anonymous online culture” operating largely outside mainstream social media platforms , much of the research on extremist use of social media still focuses on mainstream sites like Facebook or Twitter . Access to these rapidly-changing online spaces is difficult, and many research teams end up using out-of-date data, or relying on the data they have, rather than the data they need. Many social media platforms face pressure to monetize their data or remove access to it entirely , making research access to these spaces expensive and difficult. Yet, extremism researchers agree that data access is a key limitation to understanding online radicalization as a phenomenon. Online extremism researchers top recommendation is to “invest more in data-driven analysis of far-right violent extremist and terrorist use of the Internet.” Pushshift data has already been used to understand the phenomenon of hate speech and political extremism and trolling and negative behaviors in fringe online spaces .
Online disinformation.
The online disinformation research community has focused its attention on how social media facilitates the spread of deliberately inaccurate information . The use of social media platforms to spread this “fake news” and biased political propaganda was particularly concerning given the events surrounding Russian interference in the 2016 US presidential election. Researchers studying disinformation acknowledge that mainstream platforms, particularly Facebook, are still the main place where disinformation campaigns take place and that a lack of data access is significantly limiting their efforts . While mainstream sites are the largest amplifiers of disinformation content, the content itself is often created on fringe sites that serve as proving grounds . As with extremism and terrorism research, data access and data sharing in the disinformation research community is an ongoing struggle. Pushshift data has already been used in a number of papers on disinformation and social media trustworthiness .
Web science.
Datasets like Pushshift are critically important for researchers who answer questions at the intersection of Internet and society. How does technology spread? What is the impact of each interface or design choice on the efficacy of social media platforms? How should we measure the success or failure of an online community? Pushshift data has already been used in studies of user engagement on social media , social media moderation schemes , measuring success and growth of online communities , conflict in online groups , the spread of technological innovations , modeling collaboration , and measuring engagement and collective attention .
Big data science.
As one of a few easily-accessible, very large collections of social media data, Pushshift enables data-intensive research in foundational areas like network science , and new algorithms for cloud computing and very large databases .
Health informatics.
Because of the relative anonymity allowed by certain social media platforms, large social media datasets are useful for researchers studying topics in health informatics including sensitive medical issues, atypical behaviors, and interpersonal conflict. Pushshift data has been used by researchers studying eating disorders and weight loss , addiction and substance abuse , sexually transmitted infections , difficult child-rearing problems , and various mental health challenges .
Robust intelligence.
Intelligent systems that can augment and enhance human understanding often require large amounts of human-generated text data generated in a social context. Social media data collected by Pushshift has been used already by researchers in computational linguistics and natural language processing , recommender systems , intelligent conversational systems , automatic summarization , entity recognition , and other fields associated with the development of systems that can sense, reason, learn, and predict.
Related Work
Promising alternatives to the aforementioned model of “storage buckets of open data hosted by cloud providers” exist that are better-tailored towards the needs of researchers.
Pushshift is not the first large-scale real-time social media data collection service aimed towards researchers. Table 3 summarizes the social and organizational features of other similar services. While not an exhaustive list, the following have heavily influenced the research community as well as motivated Pushshift’s own goals and design.
is an “open source platform for studying media ecosystems” that tracks hundreds of millions of news stories and makes aggregated count and topical data available via a free and semi-public API . The Media Cloud platform has been used to study digital health communication, agenda-setting, and online social movements. Researchers can use the API to get counts of stories, topics, words, and tags in response to queries by keyword, media source, and time window using a Solr search platform .
is a free open platform monitoring global news media tracking events, related topics, and imagery. The platform offers a database and knowledge graph accessible both through dumps and an “analysis service” for filtering and visualizing subsets of the complete dataset .
is a platform of social question answering communities, including Stack Overflow. While data dumps of the platform are hosted by the Internet Archive , Stack Exchange offers both an API of activity as well as a “Data Explorer” allowing users to write SQL queries via a web interface against a regularly-updated database .
is the parent organization of projects like Wikipedia. It hosts data dumps of revision histories, content, and pageviews; makes data available through robust APIs; and offers a variety of interactive services. Wikimedia’s deployment of Jupyter Notebooks can access replication databases of revisions and content. This enables researchers focus on analyzing data rather than system and database administration.
Other dataset papers. Considering the challenges in the post-API age, the collection, curation, and dissemination of datasets is crucial for the advancement of science. To that end, it is worth exploring other works whose primary contribution has been the dataset they provide. For example, released a dataset that includes 37M posts and 24M comments covering August 2016 through December 2018 from Gab, a Twitter-like social media platform that after being de-platformed by major service providers ported their codebase to use the federated social network protocol from the Mastodon project. As it turns out, released a dataset focused around Mastodon itself. Their dataset contains 5M posts, along with a crowdsourced (by Mastodon users) label that indicates whether or not the post contains inappropriate content. Research into other types of computer-mediated communication platforms have also been enabled by dataset contributions. released a dataset from 178 WhatsApp groups that includes 454K messages from 45K different users.
Discussion & Conclusion
In this paper, we presented the Pushshift Reddit Dataset, which includes hundreds of millions of submissions and billions of comments from 2005 until the present. In addition to offering Pushshift’s data as monthly dumps, we also make this dataset available via a searchable API, as well as additional tools and community resources. This paper also serves as a more formal and archival description of what Pushshift’s Reddit dataset provides. Having already been used in over 100 papers from numerous disciplines over the past four years, the Pushshift Reddit dataset will continue to be a valuable resource for the research community in the future.