SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

Introduction

Recent years have seen the proliferation of offensive language in social media platforms such as Facebook and Twitter. As manual filtering is very time consuming, and as it can cause post-traumatic stress disorder-like symptoms to human annotators, there have been many research efforts aiming at automating the process. The task is usually modeled as a supervised classification problem, where systems are trained on posts annotated with respect to the presence of some form of abusive or offensive content. Examples of offensive content studied in previous work include hate speech Davidson et al. (2017); Malmasi and Zampieri (2017, 2018), cyberbulling Dinakar et al. (2011), and aggression Kumar et al. (2018). Moreover, given the multitude of terms and definitions used in the literature, some recent studies have investigated the common aspects of different abusive language detection sub-tasks Waseem et al. (2017); Wiegand et al. (2018).

Interestingly, none of this previous work has studied both the type and the target of the offensive language, which is our approach here. Our task, OffensEvalhttp://competitions.codalab.org/competitions/20011, uses the Offensive Language Identification Dataset (OLID)http://scholar.harvard.edu/malmasi/olid Zampieri et al. (2019), which we created specifically for this task. OLID is annotated following a hierarchical three-level annotation schema that takes both the target and the type of offensive content into account. Thus, it can relate to phenomena captured by previous datasets such as the one by Davidson et al. (2017). Hate speech, for example, is commonly understood as an insult targeted at a group, whereas cyberbulling is typically targeted at an individual.

We defined three sub-tasks, corresponding to the three levels in our annotation schema:A total of 800 teams signed up to participate in the task, but only 115 teams ended up submitting results eventually.

Offensive language identification (104 participating teams)

Automatic categorization of offense types (71 participating teams)

Offense target identification (66 participating teams)

The remainder of this paper is organized as follows: Section 2 discusses prior work, including shared tasks related to OffensEval. Section 3 presents the shared task description and the sub-tasks included in OffensEval. Section 4 includes a brief description of OLID based on Zampieri et al. (2019). Section 5 discusses the participating systems and their results in the shared task. Finally, Section 6 concludes and suggests directions for future work.

Related Work

Different abusive and offense language identification problems have been explored in the literature ranging from aggression to cyber bullying, hate speech, toxic comments, and offensive language. Below we discuss each of them briefly.

Aggression identification: The TRAC shared task on Aggression Identification Kumar et al. (2018) provided participants with a dataset containing 15,000 annotated Facebook posts and comments in English and Hindi for training and validation. For testing, two different sets, one from Facebook and one from Twitter, were used. The goal was to discriminate between three classes: non-aggressive, covertly aggressive, and overtly aggressive. The best-performing systems in this competition used deep learning approaches based on convolutional neural networks (CNN), recurrent neural networks, and LSTM Aroyehun and Gelbukh (2018); Majumder et al. (2018).

Bullying detection: There have been several studies on cyber bullying detection. For example, Xu et al. (2012) used sentiment analysis and topic models to identify relevant topics, and Dadvar et al. (2013) used user-related features such as the frequency of profanity in previous messages.

Hate speech identification: This is the most studied abusive language detection task Kwok and Wang (2013); Burnap and Williams (2015); Djuric et al. (2015). More recently, Davidson et al. (2017) presented the hate speech detection dataset with over 24,000 English tweets labeled as non offensive, hate speech, and profanity.

Offensive language: The GermEvalhttp://projects.fzai.h-da.de/iggsa/ Wiegand et al. (2018) shared task focused on offensive language identification in German tweets. A dataset of over 8,500 annotated tweets was provided for a course-grained binary classification task in which systems were trained to discriminate between offensive and non-offensive tweets. There was also a second task where the offensive class was subdivided into profanity, insult, and abuse. This is similar to our work, but there are three key differences: (i) we have a third level in our hierarchy, (ii) we use different labels in the second level, and (iii) we focus on English.

Toxic comments: The Toxic Comment Classification Challengehttp://kaggle.com/c/jigsaw-toxic-comment-classification-challenge was an open competition at Kaggle, which provided participants with comments from Wikipedia organized in six classes: toxic, severe toxic, obscene, threat, insult, identity hate. The dataset was also used outside of the competition Georgakopoulos et al. (2018), including as additional training material for the aforementioned TRAC shared Fortuna et al. (2018).

While each of the above tasks tackles a particular type of abuse or offense, there are many commonalities. For example, an insult targeted at an individual is commonly known as cyberbulling and insults targeted at a group are known as hate speech. The hierarchical annotation model proposed in OLID Zampieri et al. (2019) and used in OffensEval aims to capture this. We hope that the OLID’s dataset would become a useful resource for various offensive language identification tasks.

Task Description and Evaluation

The training and testing material for OffensEval is the aforementioned Offensive Language Identification Dataset (OLID) dataset, which was built specifically for this task. OLID was annotated using a hierarchical three-level annotation model introduced in Zampieri et al. (2019). Four examples of annotated instances from the dataset are presented in Table 1. We use the annotation of each of the three layers in OLID for a sub-task in OffensEval as described below.

In this sub-task, the goal is to discriminate between offensive and non-offensive posts. Offensive posts include insults, threats, and posts containing any form of untargeted profanity. Each instance is assigned one of the following two labels.

Not Offensive (NOT): Posts that do not contain offense or profanity;

Offensive (OFF): We label a post as offensive if it contains any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct. This category includes insults, threats, and posts containing profane language or swear words.

2 Sub-task B: Automatic categorization of offense types

In sub-task B, the goal is to predict the type of offense. Only posts labeled as Offensive (OFF) in sub-task A are included in sub-task B. The two categories in sub-task B are the following:

Targeted Insult (TIN): Posts containing an insult/threat to an individual, group, or others (see sub-task C below);

Untargeted (UNT): Posts containing non-targeted profanity and swearing. Posts with general profanity are not targeted, but they contain non-acceptable language.

3 Sub-task C: Offense target identification

Sub-task C focuses on the target of offenses. Only posts that are either insults or threats (TIN) arwe considered in this third layer of annotation. The three labels in sub-task C are the following:

Individual (IND): Posts targeting an individual. It can be a a famous person, a named individual or an unnamed participant in the conversation. Insults/threats targeted at individuals are often defined as cyberbullying.

Group (GRP): The target of these offensive posts is a group of people considered as a unity due to the same ethnicity, gender or sexual orientation, political affiliation, religious belief, or other common characteristic. Many of the insults and threats targeted at a group correspond to what is commonly understood as hate speech.

Other (OTH): The target of these offensive posts does not belong to any of the previous two categories, e.g., an organization, a situation, an event, or an issue.

4 Task Evaluation

Given the strong imbalance between the number of instances in the different classes across the three tasks, we used the macro-averaged F1-score as the official evaluation measure for all three sub-tasks.

At the end of the competition, we provided the participants with packages containing the results for each of their submissions, including tables and confusion matrices, and tables with the ranks listing all teams who competed in each sub-task. For example, the confusion matrix for the best team in sub-task A is shown in Figure 1.

5 Participation

The task attracted nearly 800 teams and 115 of them submitted their results. The teams that submitted papers for the SemEval-2019 proceedings are listed in Table 2.ASE-CSE is for Amrita School of Engineering - CSE.

Data

Below, we briefly describe OLID, the dataset used for our SemEval-2019 task 6. A detailed description of the data collection process and annotation is presented in Zampieri et al. (2019).

OLID is a large collection of English tweets annotated using a hierarchical three-layer annotation model. It contains 14,100 annotated tweets divided into a training partition of 13,240 tweets and a testing partition of 860 tweets. Additionally, a small trial dataset of 320 tweets was made available before the start of the competition.

The distribution of the labels in OLID is shown in Table 3. We annotated the dataset using the crowdsourcing platform Figure Eight.https://www.figure-eight.com/ We ensured the quality of the annotation by only hiring experienced annotators on the platform and by using test questions to discard annotators who did not achieve a certain threshold. All the tweets were annotated by two people. In case of disagreement, a third annotation was requested, and ultimately we used a majority vote. Examples of tweets from the dataset with their annotation labels are shown in Table 1.

Results

The models used in the task submissions ranged from traditional machine learning, e.g., SVM and logistic regression, to deep learning, e.g., CNN, RNN, BiLSTM, including attention mechanism, to state-of-the-art deep learning models such as ELMo Peters et al. (2018) and BERT Devlin et al. . Figure 2 shows a pie chart indicating the breakdown by model type for all participating systems in sub-task A. Deep learning was clearly the most popular approach, as were also ensemble models. Similar trends were observed for sub-tasks B and C.

Some teams used additional training data, exploring external datasets such as Hate Speech Tweets Davidson et al. (2017), toxicity labels Thain et al. (2017), and TRAC Kumar et al. (2018). Moreover, seven teams indicated that they used sentiment lexicons or a sentiment analysis model for prediction, and two teams reported the use of offensive word lists. Furthermore, several teams used pre-trained word embeddings from FastText Bojanowski et al. (2016), from GloVe, including Twitter embeddings from GloVe Pennington et al. (2014) and from word2vec Mikolov et al. (2013); Godin et al. (2015).

In addition, several teams used techniques for pre-processing the tweets such as normalizing the tokens, hashtags, URLs, retweets (RT), dates, elongated words (e.g., “Hiiiii” to “Hi”, partially hidden words (“c00l” to “cool”). Other techniques include converting emojis to text, removing uncommon words, and using Twitter-specific tokenizers, such as the Ark Tokenizerhttp://www.cs.cmu.edu/~ark/TweetNLP Gimpel et al. (2011) and the NLTK TweetTokenizer,http://www.nltk.org/api/nltk.tokenize.html as well as standard tokenizers (Stanford Core NLP Manning et al. (2014), and the one from Keras.http://keras.io/preprocessing/text/ Approximately a third of the teams indicated that they used one or more of these techniques.

The results for each of the sub-tasks are shown in Table 4. Due to the large number of submissions, we only show the F1-score for the top-10 teams, followed by result ranges for the rest of the teams. We further include the models and the baselines from Zampieri et al. (2019): CNN, BiLSTM, and SVM. The baselines are choosing all predictions to be of the same class, e.g., all offensive, and all not offensive for sub-task A. Table 5 shows all the teams that participated in the tasks along with their ranks in each task. These two tables can be used together to find the score/range for a particular team.

Below, we describe the overall results for each sub-task, and we describe the top-3 systems.

Sub-task A was the most popular sub-task with 104 participating teams. Among the top-10 teams, seven used BERT Devlin et al. with variations in the parameters and in the pre-processing steps. The top-performing team, NULI, used BERT-base-uncased with default-parameters, but with a max sentence length of 64 and trained for 2 epochs. The 82.9% F1 score of NULI is 1.4 points better than the next system, but the difference between the next 5 systems, ranked 2-6, is less than one point: 81.5%-80.6%. The top non-BERT model, MIDAS, is ranked sixth. They used an ensemble of CNN and BLSTM+BGRU, together with Twitter word2vec embeddings Godin et al. (2015) and token/hashtag normalization.

2 Sub-task B

A total of 76 teams participated in sub-task B, and 71 of them had also participated in sub-task A. In contrast to sub-task A, where BERT clearly dominated, here five of the top-10 teams used an ensemble model. Interestingly, the best team, jhan014, which was ranked 76th in sub-task A, used a rule-based approach with a keyword filter based on a Twitter language behavior list, which included strings such as hashtags, signs, etc., achieving an F1-score of 75.5%. The second and the third teams, Amobee and HHU, used ensembles of deep learning (including BERT) and non-neural machine learning models. The best team from sub-task A also performed well here, ranked 4th (71.6%), thus indicating that overall BERT works well for sub-task B as well.

3 Sub-task C

A total of 66 teams participated in sub-task C, and most of them also participated in sub-tasks A and B. As in sub-task B, ensembles were quite successful and were used by five of the top-10 teams. However, as in sub-task A, the best team, vradivchev_anikolov, used BERT after trying many other deep learning methods. They also used pre-processing and pre-trained word embeddings based on GloVe. The second best team, NLPR $@$ SRPOL, used an ensemble of deep learning models such as OpenAI Finetune, LSTM, Transformer, and non-neural machine learning models such as SVM and Random Forest.

4 Description of the Top Teams

The top-3 teams by average rank for all three sub-tasks were NLPR $@$ SRPOL, NULI, and vradivchev_anikolov. Below, we provide a brief description of their approaches:

was ranked 8th, 9th, and 2nd on sub-tasks A, B, and C, respectively. They used ensembles of OpenAI GPT, Random Forest, the Transformer, Universal encoder, ELMo, and combined embeddings from fastText and custom ones. They trained their models on multiple publicly available offensive datasets, as well as on their own custom dataset annotated by linguists.

was ranked 1st, 4th, and 18th on sub-tasks A, B, and C, respectively. They experimented with different models including linear models, LSTM, and pre-trained BERT with fine-tuning on the OLID dataset. Their final submissions for all three subtasks only used BERT, which performed best during development. They also used a number of pre-processing techniques such as hashtag segmentation and emoji substitution.

was ranked 2nd, 16th, and 1st on sub-tasks A, B, and C, respectively. They trained a variety of models and combined them in ensembles, but their best submissions for sub-tasks A and C used BERT only, as the other models overfitted. For sub-task B, BERT did not perform as well, and they used soft voting classifiers. In all cases, they used pre-trained GloVe vectors and they also applied techniques to address the class imbalance in the training data.

Conclusion

We have described SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task used OLID Zampieri et al. (2019), a dataset of English tweets annotated for offensive language use, following a three-level hierarchical schema that considers (i) whether a message is offensive or not (for sub-task A), (ii) what is the type of the offensive message (for sub-task B), and (iii) who is the target of the offensive message (for sub-task C).

Overall, about 800 teams signed up for OffensEval, and 115 of them actually participated in at least one sub-task. The evaluation results have shown that the best systems used ensembles and state-of-the-art deep learning models such as BERT. Overall, both deep learning and traditional machine learning classifiers were widely used. More details about the indvididual systems can be found in their respective system description papers, which are published in the SemEval-2019 proceedings. A list with references to these publications can be found in Table 2; note, however, that only 50 of the 115 participating teams submitted a system description paper.

As is traditional for SemEval, we have made OLID publicly available to the research community beyond the SemEval competition, hoping to facilitate future research on this important topic.

In fact, the OLID dataset and the SemEval-2019 Task 6 competition setup have already been used in teaching curricula in universities in UK and USA. For example, student competitions based on OffensEval using OLID have been organized as part of Natural Language Processing and Text Analytics courses in two universities in UK: Imperial College London and the University of Leeds. System papers describing some of the students’ work are publicly accessiblehttp://scholar.harvard.edu/malmasi/offenseval-student-systems and have also been made available on arXiv.org Cambray and Podsadowski (2019); Frisiani et al. (2019); Ong (2019); Sapora et al. (2019); Puiu and Brabete (2019); Uglow et al. (2019). Similarly, a number of students in Linguistics and Computer Science at the University of Arizona in USA have been using OLID in their coursework.

In future work, we plan to increase the size of the OLID dataset, while addressing issues such as class imbalance and the small size for the test partition, particularly for sub-tasks B and C. We would also like to expand the dataset and the task to other languages.

Acknowledgments

We would like to thank the SemEval-2019 organizers for hosting the OffensEval task and for replying promptly to all our inquires. We further thank the SemEval-2019 anonymous reviewers for the helpful suggestions and for the constructive feedback, which have helped us improve the text of this report.

We especially thank the SemEval-2019 Task 6 participants for their interest in the shared task, for their participation, and for their timely feedback, which have helped us make the shared task a success.

Finally, we would like to thank Lucia Specia from Imperial College London and Eric Atwell from the University of Leeds for hosting the OffensEval competition in their courses. We further thank the students who participated in these student competitions and especially those who wrote papers describing their systems.

The research presented in this paper was partially supported by an ERAS fellowship, which was awarded to Marcos Zampieri by the University of Wolverhampton, UK.