LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki

Introduction

Multi-modal language-vision models demonstrated recently strong transfer capability to novel datasets in absense of per-sample labels . This capability requires sufficiently large model and data scale during pre-training. Increasing data scale alone can often improve model performance . When increasing model and compute budget scale in addition, scaling laws suggest further increase in generalization and transfer performance if not bottlenecked by the data scale . There is a plethora of recent works that have built massive datasets in order to optimally scale up various models . However, these massive datasets have rarely been released for various reasons. Gao et. al. recently released The Pile, an openly-available 800GB text dataset , in an attempt to loosely mimic the dataset used for GPT-3. The largest publicly known image-text paired datasets range from 400 million to around a billion, but none of them has been released.

To address this issue, we build and release LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices. We describe the procedure to create the dataset and demonstrate successful training of DALL-E architecture. Having sufficiently large scale, the dataset opens venues for research on multi-modal language-vision models to broad community.

Dataset and Methods

Overview of LAION-400M. We officially release the following packages under LAION-400M project:

400 million pairs of image URL and the corresponding metadata

400 million pairs of CLIP image embedding and the corresponding text

Several sets of kNN indices that enable quick search in the dataset

img2dataset library that enables efficient crawling and processing of hundreds of millions of images and their metadata from a list of URLs with minimal resources

Web demo of image-text search on LAION-400M (Fig. 1)https://rom1504.github.io/clip-retrieval/

As for the pairs of image URL and metadata, we provide parquet files that consist of the following attributes for each pair: sample ID, URL, type of Creative Commons license (if applicable), NSFW tag (detected with CLIP), cosine similarity score between the text and image embedding and height and width of the image. We found less than 1% of images were detected as NSFW, which can be filtered out by an user with NSFW tag.

Acquisition. The acquisition follows the flowchart of Fig. 2 and can be split into two major components:

Distributed processing of petabyte-scale Common Crawl dataset, which produces a collection of matching URLs and captions.

Single node post-processing of the data, which is much lighter and can be run in a few days, producing the final dataset.

To create image-text pairs, we parse through WAT files from Common Crawl and parse out all HTML IMG tags containing an alt-text attribute. We download the raw images from the parsed URLs with asynchronous requests using Trio and Asks libraries.

After downloading the WAT files from Common Crawl, we apply the following filtering conditions:

All samples with less than 5 character alt-text length or less than 5 KB image size are dropped.

Duplicate removal is performed with bloom filter based on URL and alt-text.

We use CLIP to compute embeddings of the image and alt-text. Then we compute the cosine similarity of both embeddings and drop all samples with cosine similarity below 0.3. This threshold was selected based on human inspections.

We use the CLIP embeddings of images and texts to filter out illegal contents.

1.2 img2dataset

We developed img2dataset library to comfortably download from a given set of URLs, resize and store the images and captions in the webdataset format.https://github.com/rom1504/img2dataset This allows to download 100 million images from our list of URLs in 20 hours with a single node (1Gbps connection speed, 32GB of RAM, an i7 CPU with 16 cores), which allows anyone to obtain the whole dataset or a smaller subset.

Analysis & Results

Web demo and similarity search. A web demo was created to allow an user to search images and texts based on a query image or text using the CLIP embeddings of the input and our precomputed kNN indices. It demonstrates the diversity of images and captions that can be found in LAION-400M as well as high semantic relevance (Fig. 1).

Tab. 1 shows the distribution of image sizes of LAION-400M. Given the abundance of high-resolution images, one can produce subsets of images for training various customized models, and also choose image resolution that is suitable for purpose of particular training.

Training DALL-E model. We ran DALLE-pytorch , an open-source replication of DALL-E , to assess the dataset’s capability to train a text-to-image model. The VQGAN pretrained on ImageNet is used to encode image tokens. For generation, we use CLIP ViT-B/16 to rank the top 8 of 128 total samples per caption. Despite only seeing a subset of approximately 7.2 million images for a single epoch, we observe fast convergence across a variety of categories. Samples generated from the model show sufficient quality and provide evidence for successful training progress (Fig. 3).

Conclusion

By releasing an openly available dataset that contains 400 million image-text pairs, we have closed the gap to proprietary large scale datasets that were necessary to train state-of-the-art language-vision models such as DALL-E and CLIP. As proof of concept, we demonstrated that a subset of our dataset can be used to train a DALL-E model, producing samples of sufficient quality. The dataset opens the road for large-scale training and research of language-vision models, that were previously restricted to those having access to proprietary large datasets, to the broad community.

References