The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, Davide Testuggine

Introduction

In the past few years there has been a surge of interest in multimodal problems, from image captioning Chen2015coco ; Krishna2017visualgenome ; Young2014flickr30k ; gurari2020captioning ; sidorov2020textcaps to visual question answering (VQA) Antol2015vqa ; Goyal2017vqa2 ; gurari2018vizwiz ; Hudson2019gqa ; singh2019towards and beyond. Multimodal tasks are interesting because many real-world problems are multimodal in nature—from how humans perceive and understand the world around them, to handling data that occurs “in the wild” on the Internet. Many practitioners believe that multimodality holds the key to problems as varied as natural language understanding Baroni2016grounding ; Kiela2016virtual , computer vision evaluation Geman2015visual , and embodied AI Savva2019habitat .

Their success notwithstanding, it is not always clear to what extent truly multimodal reasoning and understanding is required for solving many current tasks and datasets. For example, it has been pointed out that language can inadvertently impose strong priors that result in seemingly impressive performance, without any understanding of the visual content in the underlying model Devlin2015exploring . Similar problems were found in VQA Antol2015vqa , where simple baselines without sophisticated multimodal understanding performed remarkably well Zhou2015simple ; Jabri2016revisiting ; Agrawal2016analyzing ; Goyal2017vqa2 and in multimodal machine translation Elliott2016multi30k ; Specia2016mmt , where the image was found to matter relatively little Delbrouck2017empirical ; Elliott2018adversarial ; Caglayan2019probing . In this work, we present a challenge set designed to measure truly multimodal understanding and reasoning, with straightforward evaluation metrics and a direct real world use case.

Specifically, we focus on hate speech detection in multimodal memes. Memes pose an interesting multimodal fusion problem: Consider a sentence like “love the way you smell today” or “look how many people love you”. Unimodally, these sentences are harmless, but combine them with an equally harmless image of a skunk or a tumbleweed, and suddenly they become mean. See Figure 1 for an illustration. That is to say, memes are often subtle and while their true underlying meaning may be easy for humans to detect, they can be very challenging for AI systems.

A crucial characteristic of the challenge is that we include so-called “benign confounders” to counter the possibility of models exploiting unimodal priors: for every hateful meme, we find alternative images or captions that make the label flip to not-hateful. Using the example above, for example, if we replaced the skunk and tumbleweed images with pictures of roses or people, the memes become harmless again. Similarly, we can flip the label by keeping the original images but changing the text to “look how many people hate you” or “skunks have a very particular smell”. This strategy is compatible with recent work arguing that tasks should include counterfactual or contrastive examples Kaushik2019learning ; Gardner2020evaluating ; Goyal2017vqa2 . At the same time, the existence of such confounders is a good indicator of the need for multimodality—if both kinds (image and text confounders) exist, classifying the original meme and its confounders correctly will require multimodal reasoning. Thus, the challenge is designed such that it should only be solvable by models that are successful at sophisticated multimodal reasoning and understanding.

At the scale of the internet, malicious content cannot be tackled by having humans inspect every data point. Consequently, machine learning and artificial intelligence play an ever more important role in mitigating important societal problems, such as the prevalence of hate speech. Our challenge set can thus be said to serve the dual purpose of measuring progress on multimodal understanding and reasoning, while at the same time facilitating progress in a real-world application of hate speech detection. This further sets apart our challenge from existing tasks, many of which have limited or indirect real-world use cases.

In what follows, we describe in detail how the challenge set was created, analyze its properties and provide baseline performance numbers for various unimodal and state-of-the-art multimodal methods. We find that the performance of our baselines reflects a concrete hierarchy in their multimodal sophistication, highlighting the task’s usefulness for evaluating the capabilities of advanced multimodal models. The gap between the best-performing model and humans remains large, suggesting that the hateful memes challenge is well-suited for measuring progress in multimodal research.

The Hateful Memes Challenge Set

The Hateful Memes dataset is a so-called challenge set, by which we mean that its purpose is not to train models from scratch, but rather to finetune and test large scale multimodal models that were pre-trained, for instance, via self-supervised learning. In order to reduce visual bias and to make sure we could obtain appropriate licenses for the underlying content, we reconstruct source memes from scratch using a custom-built tool. We used third-party annotators, who were trained on employing the hate speech definition used in this paper. Annotators spent an average time of 27 minutes per final meme in the dataset. In this section, we describe how we define hate speech, how we obtain and annotate memes, and give further details on how the challenge set was constructed.

Previous work has employed a wide variety of definitions and terminology around hate speech. Hate speech, in the context of this paper and the challenge set, is strictly defined as follows:

A direct or indirect attack on people based on characteristics, including ethnicity, race, nationality, immigration status, religion, caste, sex, gender identity, sexual orientation, and disability or disease. We define attack as violent or dehumanizing (comparing people to non-human things, e.g. animals) speech, statements of inferiority, and calls for exclusion or segregation. Mocking hate crime is also considered hate speech.

This definition resembles community standards on hate speech employed by e.g. Facebook111https://www.facebook.com/communitystandards/hate_speech. There has been previous work arguing for a three-way distinction between e.g. hateful, offensive and normal Davidson2017hatespeech , but this arguably makes the boundary between the first two classes murky, and the classification decision less actionable.

The definition employed here has some notable exceptions, i.e., attacking individuals/famous people is allowed if the attack is not based on any of the protected characteristics listed in the definition. Attacking groups perpetrating hate (e.g. terrorist groups) is also not considered hate. This means that hate speech detection also involves possibly subtle world knowledge.

2 Reconstructing Memes using Getty Images

One important issue with respect to dataset creation is having clarity around licensing of the underlying content. We’ve constructed our dataset specifically with this in mind. Instead of trying to release original memes with unknown creators, we use “in the wild” memes to manually reconstruct new memes by placing, without loss of meaning, meme text over a new underlying stock image. These new underlying images were obtained in partnership with Getty Images under a license negotiated to allow redistribution for research purposes. This approach has several additional benefits: 1) it allows us to avoid any potential noise from optical character recognition (OCR), since our reconstruction tool records the annotated text; and 2) it reduces any bias that might exist in the visual modality as a result of e.g. using different meme creation tools or underlying image sources.

3 Annotation Process

We used a third-party annotation company rather than a crowdsourcing platform. The annotators were trained for 4 hours in recognizing hate speech according to the above definition. Three pilot runs were subsequently performed where annotators were tasked with classifying memes and were then given feedback to improve their performance.

The annotation process comprised several steps, which are shown in Figure 2. To summarize: for a set of memes occurring in the wild, which we use as our source set, we 1) find replacement images that we can license from Getty; 2) use these replacement images to construct novel memes using the original text; and 3) subsequently have these annotated for hatefulness (by a different set of annotators). For memes found to be hateful, we also 4) construct benign confounders, if possible, by replacing the image and/or the text of the newly created hateful meme with an alternative that flips the label from hateful back to not-hateful. We describe each step in detail in the following sections.

The annotation process started from a source dataset of more than 1 million images occurring in the wild, posted on public social media groups from within the United States. The content was filtered to discard non-meme images and deduplicated, yielding a set of 162k memes, which served as the source set from which we construct our set of newly created memes.

In Phase 1, we asked annotators to check that the meme text was in English, and that the meme was non-violating. Memes were considered violating if they contained any of the following: self injury or suicidal content, child exploitation or nudity, calls to violence, adult sexual content or nudity, invitation to acts of terrorism and human trafficking. Lastly, we removed content that contained slurs since these are unimodal by definition. This resulted in a set of roughly 46k candidate memes. For the memes that passed this test, we asked annotators to use the Getty Images search API to find similar images to the meme image, according the following definition:

an image (or set of images) is a suitable replacement if and only if, when overlaid with the meme’s text, it preserves the meaning and intent of the original meme.

In other words, images only constitute suitable replacements if they preserve the semantic content of the original meme that was used for inspiration. If no meaning-preserving replacement image was found, the meme was discarded.

3.2 Phase 2: Meme construction

The process from Phase 1 yielded a set of image replacement candidates for the original set of non-discarded inspiration memes. In the next phase, we asked annotators to use a custom-designed meme creation tool to create a novel meme using the new underlying image (see Figure 3 for an example of the interface). Text from the original meme was extracted using an off-the-shelf OCR system and used to provide the annotator with a starting point for construction.

The annotator was shown the original meme and asked to reproduce it as faithfully as possible using the new underlying image, focusing on the preservation of semantic content. Some memes consist of multiple background images, in which case these were first vertically stacked and provided as a single image to the user. The annotators stored the resultant meme in PNG and SVG formats (the latter gives flexibility for potentially moving or augmenting the text, as well as extracting it or removing it altogether).

3.3 Phase 3: Hatefulness rating

Next, the annotators were presented with the newly created memes, and asked to provide a hatefulness rating according to the above definition. Each meme was rated by three different annotators on a scale of 1-3, with a 1 indicating definitely hateful; a 2 indicating not sure; and a 3 indicating definitely not-hateful. For every meme we collected five annotations. Ternary rather than binary labelling was useful for filtering memes with severe disagreement (e.g. two annotators voting 1 and three voting 3), which we had annotated by an expert. This yielded a single binary classification label for each of the newly constructed memes.

3.4 Phase 4: Benign confounders

Finally, for the subset of memes that were annotated as hateful, we collected confounders (a.k.a., “contrastive” Gardner2020evaluating or “counterfactual” Kaushik2019learning examples). This allows us to address some of the biases that machine learning systems would easily pick up on. For instance, a system might pick up on accidental biases where appearances of words like “black” or “white” might be strongly correlated with hate speech. To mitigate this problem, and to make the dataset extra challenging, we collect benign confounders, defined as follows:

A benign confounder is a minimum replacement image or replacement text that flips the label for a given multimodal meme from hateful to non-hateful.

For example, in the “love the way you smell today” example, we could replace the image of the skunk with an image of roses, or conversely we could replace the text with “skunks have a very particular smell”. In both cases, the meme would no longer be mean, but just a statement—as such, the label would be flipped. These benign confounders make the dataset more challenging, and make it so that any system, if it wants to solve this task, has to be multimodal: using just the text, or just the image, is going to lead to poor performance.

The benign confounder construction phase also allows us to spot memes that are only unimodally hateful: if it is not possible to find a replacement text that flips the label, the image itself is hateful. The same is true in the other direction (if no replacement image can flip the label), but somewhat more complicated, since it might be that it was simply not possible to find a suitable replacement image. In our experiments, we found that almost all cases of unimodal hate were text-only, so this is an important category of memes.

4 Splits

Using the various phases outlined above and after further filtering to remove low-quality examples, we end up with a dataset totalling exactly 10k memes. The dataset comprises five different types of memes: multimodal hate, where benign confounders were found for both modalities, unimodal hate where one or both modalities were already hateful on their own, benign image and benign text confounders and finally random not-hateful examples.

We construct a dev and test set from 5% and 10% of the data respectively, and set aside the rest to serve as fine-tuning training data. The dev and test set are fully balanced, and are comprised of memes using the following percentages: 40% multimodal hate, 10% unimodal hate, 20% benign text confounder, 20% benign image confounder, 10% random non-hateful.

The objective of the task is, given an image and the pre-extracted—i.e., we do not require OCR—text, to classify memes according to their hatefulness. We recommend that people report the area under the receiver operating characteristic curve (ROC AUC) Bradley1997auc as the main metric, but also report accuracy since the test set is balanced and that metric is easier to interpret.

5 Competition and “unseen” test set

A competition based on the dataset described in this paper was held at NeurIPS. The winners were determined according to performance on a different “unseen” test set that will be described in a separate competition report. For more information about the competition, please visit the dataset website at https://hatefulmemeschallenge.com.

Dataset Analysis

In this section, we analyze the properties of the dataset in more detail, and examine inter annotator agreement and the presence of different types of hate category and type of attack found in our (not necessarily representative) data sample.

As explained in Section 2.3.3, memes were reviewed by three annotators. We use Cohen’s kappa score to measure the inter annotators reliability in hatefulness rating. We found the score to be 68.4 which indicates “moderate” agreement Mchugh2012interrater . The score demonstrates the difficulty of determining whether content constitutes hate speech or not. Furthermore, the hate speech definition can be rather complex and contains exceptions. For example, attacking a terrorist organization is not hateful but equating a terrorist organization with people of some race or ethnicity is considered hateful.

In cases with good agreement, we followed the majority vote for deciding the binary label. For cases where there was significant crowdworker disagreement, we had the meme checked and the correct label assigned by expert annotators. As noted, if even those disagreed, the meme was discarded. The dev set and test set were double checked by an expert in their entirety. We had a different set of annotators label the final test set, and obtained a human accuracy of 84.7% with low variance (ten batches were annotated by separate annotators, per-batch accuracy was always in the 83-87% range).

2 Hate Categories and Types of Attack

Appendix C shows an analysis of the dataset in terms of the categories of hate and the types of attack, shedding more light on how the dataset characteristics relate to the hate speech definition.

3 Lexical Statistics

Appendix B shows an analysis of the lexical statistics of the dataset, including of the visual modality.

Benchmarking Multimodal Classification

In this section, we establish baseline scores for various unimodal and several state-of-the-art multimodal models on this task. We find that performance relative to humans is still poor, indicating that there is a lot of room for improvement. Starter kit code is available at https://github.com/facebookresearch/mmf/tree/master/projects/hateful_memes.

We evaluate a variety of models, belonging to one of the three following classes: unimodal models, multimodal models that were unimodally pretrained (where for example a pretrained BERT model and a pretrained ResNet model are combined in some way), and multimodal models that were multimodally pretrained (where a model was pretrained using a multimodal objective).

We evaluate two image encoders: 1) standard ResNet-152 He2016deep convolutional features from res-5c with average pooling (Image-Grid) 2) features from fc6 layer of Faster-RCNN ren2015faster with ResNeXt-152 as its backbone Xie2016resnext . The Faster-RCNN is trained on Visual Genome Krishna2017visualgenome with attribute loss following singh2018pythia and features from fc6 layer are fine-tuned using weights of the fc7 layer (Image-Region). For the textual modality, the unimodal model is BERT Devlin2018bert (Text BERT).

Transfer learning from pre-trained text and image models on multimodal tasks is well-established, and we evaluate a variety of models that fall in this category. We include results for simple fusion methods where we take the mean of the unimodal ResNet-152 and BERT output scores (Late Fusion) or concatenate ResNet-152 features with BERT and training an MLP on top (Concat BERT). These can be compared to more sophisticated multimodal methods, such as supervised multimodal bitransformers Kiela2019supervised using either Image-Grid or Image-Region features (MMBT-Grid and MMBT-Region), and versions of ViLBERT Lu2019vilbert and Visual BERT Li2019visualbert that were only unimodally pretrained and not pretrained on multimodal data (ViLBERT and Visual BERT).

Finally, we compare these methods to models that were trained on a multimodal objective as an intermediate step before we finetune on the current task. Specifically, we include the official multimodally pretrained versions in the ViLBERT (trained on Conceptual Captions Sharma2018conceptual , ViLBERT CC) and Visual BERT (trained on COCO, Visual BERT COCO).

We performed grid search hyperparameter tuning over the learning rate, batch size, warm up and number of iterations. We report results averaged over three random seeds, together with their standard deviation. See Appendix A for more details.

2 Results

The results are shown in Table 1. We observe that the text-only classifier performs slightly better than the vision-only classifier. Since the test set is balanced, the random and majority-class baselines lie at 50 AUROC exactly, which gets only marginally outperformed by the visual models.

The multimodal models do better. We observe that the more advanced the fusion, the better the model performs, with early fusion models (MMBT, ViLBERT and Visual BERT) broadly outperforming middle (Concat) and late fusion approaches.

Interestingly, we can see that the difference between unimodally pretrained models and multimodally pretrained models pretraining is relatively small, corroborating findings recently reported in Singh2020pretraining and indicating that multimodal pretraining can probably be improved further.

Accuracy for trained (but non-expert) annotators is 84.7%. The fact that even the best multimodal models are still very far away from human performance shows that there is room for improvement.

Related Work

There has been a lot of work in recent years on detecting hate speech in network science Ribeiro2018hatefulusers and natural language processing Waseem2017understanding ; Schmidt2017survey ; Fortuna2018survey . Several text-only hate speech datasets have been released, mostly based on Twitter Waseem2016racist ; Waseem2016hate ; Davidson2017hatespeech ; Golbeck2017corpus ; Founta2018abusive , and various architectures have been proposed for classifiers Kumar2018benchmarking ; Malmasi2018challenges ; Malmasi2017detectinghate . Hate speech detection has proven to be difficult, and for instance subject to unwanted bias Dixon2017aaai ; Sap2019acl ; Davidson2019acl . One issue is that not all of these works have agreed on what defines hate speech, and different terminology has been used, ranging from offensive or abusive language, to online harassment or aggression, to cyberbullying, to harmful speech, to hate speech Waseem2017understanding . Here, we focus exclusively on hate speech in a narrowly defined context.

There has been surprisingly little work related to multimodal hate speech, with only a few papers including both images and text. Yang et al. Yang2019exploring report that augmenting text with image embedding information immediately boosts performance in hate speech detection. Hosseinmardi et al. Hosseinmardi2015cyberbull collect a dataset of Instagram images and their associated comments, which they then label with the help of Crowdflower workers. They asked workers two questions: 1) does the example constitute cyberaggression; and 2) does it constititute cyberbullying. Where the former is defined as “using digital media to intentionally harm another person” and the latter is a subset of cyber-aggression, defined as “intentionally aggressive behavior that is repeatedly carried out in an online context against a person who cannot easily defend him or herself” Hosseinmardi2015cyberbull . They show that including the image features improves classification performance. The dataset consisted of 998 examples, of which 90% was found to have high-confidence ratings, of which 52% was classified as bullying. Singh et al. Singh2017mmcyberbull conduct a detailed study, using the same dataset, of the types of features that matter for cyber-bullying detection in this task. Similarly, Zhong et al. Zhong2016cyberbull collected a dataset of Instagram posts and comments, consisting of 3000 examples. They asked Mechanical Turk workers two questions: 1) do the comments include any bullying; and 2) if so, is the bullying due to the content of the image. 560 examples were found to be bullying. They experiment with different kinds of features and simple classifiers for automatically detecting whether something constitutes bullying.

Our work differs from these works in various ways: our dataset is larger and explicitly designed to be difficult for unimodal architectures; we only include examples with high-confidence ratings from trained annotators and carefully balance the dataset to include different kinds of multimodal fusion problems; we focus on hate speech, rather than the more loosely defined cyberbullying; and finally we test more sophisticated models on this problem. Vijayaraghavan et al. Vijayaraghavan2019multimodalhate propose methods for interpreting multimodal hatespeech detection models, where the modalities consist of text and socio-cultural information rather than images. Concurrently, Gomez et al. Gomez2020exploring introduced a larger (and arguably noisier) dataset for multimodal hate speech detection based on Twitter data, which also contains memes and which would probably be useful as pretraining data for our task.

Multimodal hate speech detection is a vision and language task. Vision and language problems have gained a lot of traction is recent years (see Mogadala et al. Mogadala2019survey for a survey), with great progress on important problems such as visual question answering Antol2015vqa ; Goyal2017vqa2 and image caption generation and retrieval Chen2015coco ; Young2014flickr30k ; Krishna2017visualgenome ; sidorov2020textcaps ; gurari2020captioning , with offshoot tasks focusing specifically on visual reasoning Johnson2017clevr , referring expressions Kazemzadeh2014referit , visual storytelling Park2015neurips ; Huang2016visualstory , visual dialogue Das2017visdial ; deVries2017guesswhat , multimodal machine translation Elliott2016multi30k ; Specia2016mmt , visual reasoning Suhr2018corpus ; Hudson2019gqa ; singh2019towards ; xie2019visual ; gurari2018vizwiz , visual common sense reasoning Zellers2019commonsense and many others.

A large subset of these tasks focus on (autoregressive) text generation or retrieval objectives. One of the two modalities is usually dominant. They often rely on bounding boxes or similar features for maximum performance, and are not always easy to evaluate Vedantam2015cider . While these tasks are of great interest to the community, they are different from the kinds of real-world multimodal classification problems one might see in industry—a company like Facebook or Twitter, for example, needs to classify a lot of multimodal posts, ads, comments, etc for a wide variety of class labels. These use cases often involve large-scale, text-dominant multimodal classification similar to what is proposed in this task.

Related multimodal classification tasks exist; for instance, there has been extensive research in multimodal sentiment Soleymani2017mmsentiment , but there is no agreed-upon standard dataset or benchmark task. Other datasets using internet data include Food101 Wang2015food101 , where the goal is to predict the dish of recipes and images; various versions of Yelp reviews Ma2018yelp ; Walmart and Ferramenta product classification Zahavy2016walmart ; Gallo2017ferramenta ; social media name tagging (Twitter and Snapchat) Lu2018multimodalsocialmedia ; social media target-oriented sentiment Yu2019targetsentiment ; social media crisis handling Alam2018crisismmd ; various multimodal news classification datasets Ramisa2017breakingnews ; Shu2017fakenews ; multimodal document intent in Instagram posts Kruk2019mmid ; and predicting tags for Flickr images Thomee2015yfcc100m ; Joulin2016bag . Other datasets include grounded entailment, which exploits the fact that one of the large-scale natural language inference datasets was constructed using captions as premises, yielding a image, premise, hypothesis triplet with associated entailment label Vu2018vsnli ; as well as MM-IMDB, where the aim is to predict genres from posters and plots Arevalo2017gated ; and obtaining a deeper understanding of multimodal advertisements, which requires similarly subtle reasoning Hussain2017ads ; Zhang2018ads . Sabat et al. Sabat2019hate recently found in a preliminary study that the visual modality can be more informative for detecting hate speech in memes than the text. The quality of these datasets varies substantially, and their data is not always readily available to different organizations. Consequently, there has been a practice where authors opt to simply “roll their own” dataset, leading to a fragmented status quo. We believe that our dataset fills up an important gap in the space of multimodal classification datasets.

Conclusion

In this work, we introduced a new challenge dataset and benchmark centered around detecting hate speech in multimodal memes. Hate speech is an important societal problem, and addressing it requires improvements in the capabilities of modern machine learning systems. Detecting hate speech in memes requires reasoning about subtle cues and the task was constructed such that unimodal models find it difficult, by including “benign confounders” that flip the label of a multimodal hateful meme. Our analysis provided insight into the distribution of protected categories and types of attack, as well as words and object frequencies. We found that results on the task reflected a concrete hierarchy in multimodal sophistication, with more advanced fusion models performing better. Still, current state-of-the-art multimodal models perform relatively poorly on this dataset, with a large gap to human performance, highlighting the challenge’s promise as a benchmark to the community. We hope that this work can drive progress in multimodal reasoning and understanding, as well as help solve an important real-world problem.

Broader Impact

This work offers several positive societal benefits. Hate speech is a well-known problem, and countering it via automatic methods can have a big impact on people’s lives. This challenge is meant to spur innovation and encourage new developments in multimodal reasoning and understanding, which can have positive effects for an extremely wide variety of tasks and applications. With these advantages also come potential downsides: better multimodal systems may lead to the automation of jobs in the coming decades, and could be used for censorship or nefarious purposes. These risks can in part be mitigated by developing AI systems to counter them.

References

Appendix A Implementation Details & Hyperparameters

In Table 2, we list out the different hyperparameter settings for our experiments. The exact reproduction details and scripts for the experiments are available with code. Grid search was performed on batch size, learning rate and maximum number of updates to find the best hyperparameter configuration. While batch size and learning rate differ based on the model as shown in Table A, we found 22000 updates as best performing. The models were evaluated at an interval of 500 updates on the dev set and the model with best AUROC was taken as the final model to be evaluated on test set. We use weighted Adam with cosine learning rate schedule and fixed 2000 warmup steps for optimization without gradient clipping. The $\epsilon$ , $\beta_{1}$ and $\beta_{2}$ for Adam is set to 1e-8, 0.9 and 0.98 respectively. The models were implemented in PyTorch using MMF . We trained all of the models on 2 nodes each containing 8 V100 GPUs in a distributed setup.

For ViLBERT, we use the default setting of 6 and 12 TRM blocks for the visual and linguistic streams respectively and use the original weights pretrained on Conceptual Captions provided with . For VisualBERT, we follow and use 12 TRM blocks and use COCO pretrained weights provided with . The hidden size for textual stream TRM blocks is set to 768 and for visual stream is set to 1024. The dropout value for linear and attention layers is set to 0.1 in TRM blocks. The dropout for classifier MLP layers is set to 0.5. For BERT text encoder, we use pretrained BERT base uncased weights provided in the Transformers library . For region features from Faster RCNN, we use features from fc6 layer and fine-tune the weights for fc7 layer. We fine-tune all models fully end-to-end without freezing anything regardless of whether model was pretrained or not.

Appendix B Textual and Visual Lexical Statistics

In this section, we analyze the lexical (word-level) statistics of the dataset. Figure 4 shows the token count distribution, broken down by binary hatefulness label, as well as by category (multimodal hate, unimodal hate, etc). We also analyze the frequency of words for different classes in the test set. Table 3 shows the top 10 most-frequent words by class and their normalized frequency. We observe that certain words used in dehumanizing based on gender (e.g. equating women with “dishwashers” or “sandwich makers”) and colors (“black” and “white”) are frequent. Note that these are also frequent in the benign image confounder category, meaning that these words are not necessarily directly predictive of the label. In unimodal hate (which is almost always text-only), we observe that the language is stronger and often targets religious groups.

Shedding light on the properties of the visual modality, rather than the text, is more difficult. We do the same analysis but this time for bounding box labels from Mask R-CNN . The results are in Table 4. Aside from the fact that these systems appear quite biased, we find that the dataset seems relatively evenly balanced across different properties and objects.

Appendix C Hate Categories and Types of Attack

According to the definition in Section 2.1, hate speech is defined using the characteristics of certain protected categories. That is, creators and distributors of hateful memes aim to incite hatred, enmity or humiliation of groups of people based on the grounds of gender, nationality, etc. and if that group falls in one of the protected categories, it constitutes hate speech under our definition. We analyzed the dev set based on these protected categories. In our analysis, we identified nine such protected categories as listed in Table 5 (right). One meme can attack multiple protected categories. We observe that race and religion-based hate are the most prevalent. Note that this set of protected categories is by no means complete or exhaustive, but it is useful to get a better sense of the dataset properties.

C.2 Types of Attack

Hateful memes employ different types of attacks for different protected categories. In our analysis, we classified these attack types into different classes, together with a catch-all “other” class. One of the major types of attack is dehumanization, in which groups of people are compared with non-human things, such as animals or objects. We break this down into different types of dehumanization found in the data. Attack types are exclusive (only one per meme) and their frequencies are listed in Table 5 (left). Our findings show that the types of attack are broadly distributed, with comparison with criminals (e.g. terrorists), negative stereotypes (e.g. saying that certain demographics are sexually attracted to children/animals) and mocking hate crimes (e.g. the holocaust) being the most frequent categories. Again, note that this set of attack types is by no means complete or exhaustive, but it helps us understand what types of attack a classifier needs to detect if it is to be successful on the task.