Tackling Online Abuse: A Survey of Automated Abuse Detection Methods

Pushkar Mishra, Helen Yannakoudakis, Ekaterina Shutova

Introduction

With the advent of social media, anti-social and abusive behavior has become a prominent occurrence online. Undesirable psychological effects of abuse on individuals make it an important societal problem of our time. Munro [Munro, 2011] studied the ill-effects of online abuse on children, concluding that children may develop depression, anxiety, and other mental health problems as a result of their encounters online. Pew Research Center, in its latest report on online harassment [Duggan, 2017], revealed that $40\%$ of adults in the United States have experienced abusive behavior online, of which $18\%$ have faced severe forms of harassment, e.g., that of sexual nature. The report goes on to say that harassment need not be experienced first-hand to have an impact: $13\%$ of American Internet users admitted that they stopped using an online service after witnessing abusive and unruly behavior of their fellow users. These statistics stress the need for automated abuse detection and moderation systems. Therefore, in the recent years, a new research effort on abuse detection has sprung up in NLP.

That said, the notion of abuse has proven elusive and difficult to formalize. Different norms across (online) communities can affect what is considered abusive [Chandrasekharan et al., 2018]. In the context of natural language, abuse is a term that encompasses many different types of fine-grained negative expressions. For example, Nobata et al. [Nobata et al., 2016] use it to collectively refer to hate speech, derogatory language and profanity, while Mishra et al. [Mishra et al., 2018a] use it to discuss racism and sexism. The definitions for different types of abuse tend to be overlapping and ambiguous. However, regardless of the specific type, we define abuse as any expression that is meant to denigrate or offend a particular person or group. Taking a course-grained view, Waseem et al. [Waseem et al., 2017] classify abuse into broad categories based on explicitness and directness. Explicit abuse comes in the form of expletives, derogatory words or threats, while implicit abuse has a more subtle appearance characterized by the presence of ambiguous terms and figures of speech such as metaphor or sarcasm. Directed abuse targets a particular individual as opposed to generalized abuse, which is aimed at a larger group such as a particular gender or ethnicity.

This categorization exposes some of the intricacies that lie within the task of automated abuse detection. While directed and explicit abuse is relatively straightforward to detect for humans and machines alike, the same is not true for implicit or generalized abuse. This is illustrated in the works of Dadvar et al. [Dadvar et al., 2013] and Waseem and Hovy [Waseem and Hovy, 2016]: Dadvar et al. observed an inter-annotator agreement of $93\%$ on their cyber-bullying dataset. Cyber-bullying is a classic example of directed and explicit abuse since there is typically a single target who is harassed with personal attacks. On the other hand, Waseem and Hovy noted that $85\%$ of all the disagreements in annotation of their dataset occurred on the sexism class. Sexism is typically both generalized and implicit.

This paper aims to provide a comprehensive view of the field of automated abuse detection. We make various contributions that differ from traditional surveys of the field [Schmidt and Wiegand, 2017, Fortuna and Nunes, 2018, Salminen et al., 2018, Castelle, 2018]. Firstly, we present a review of the commonly-used datasets. We then discuss the various methods for abuse detection that have been investigated by the NLP community, including ones based on neural networks which previous surveys have omitted. Next, we summarize the main trends that emerge, highlight the challenges that remain, and outline possible solutions. Lastly, taking note of the direction of developments, we propose guidelines for ethics and explainability that align with the aforementioned categorization of abuse per explicitness and directness.

Annotated datasets

Supervised learning approaches to abuse detection require annotated datasets for training and evaluation purposes. To date, several manually annotated datasets have been made available by researchers. Before describing the commonly-used ones,Some datasets are directly described in the later sections where methods applied on them are discussed. Additionally, in the appendix, we also provide summaries of the datasets that are publicly available along with links to them. we highlight the two respects in which these datasets differ:

Source: the platform from which the data samples were collected. For example, data samples can be posts from Reddit or tweets from Twitter. Source governs many properties of the dataset such as linguistic style and structure, level of grammatical correctness, extent of (deliberate) obfuscation of words, etc. Essentially, source affects both explicitness and directness of the abusive samples in it.

Composition: the composition of a dataset is governed by the nature of data samples it contains. Most datasets are annotated for or compiled to cover only certain subset of types of abuse, e.g., racism and sexism, or personal attack and racism, or hate speech and profanity.

Early datasets. The earliest dataset published in this field was from Spertus [Spertus, 1997]. It consisted of $1,222$ private messages written in English taken from web-masters of controversial web resources such as NewtWatch. These messages were marked as flame (containing insults or abuse; $7.5\%$ ), maybe flame ( $13\%$ ), or okay ( $79.5\%$ ). We refer to this dataset as data-smokey. Yin et al. [Yin et al., 2009] constructed three English datasets and annotated them for harassment, which they defined as “systematic efforts by a user to belittle the contributions of other users”. The samples were taken from three social media platforms: Kongregate ( $4,802$ posts; $0.87\%$ harassment), Slashdot ( $4,303$ posts; $1.4\%$ harassment), and MySpace ( $1,946$ posts; $3.3\%$ harassment). We refer to the three datasets jointly as data-harass.

Yahoo! as source. Many datasets have been compiled using samples taken from portals of Yahoo!, specifically the News and Finance ones. Djuric et al. [Djuric et al., 2015] created a dataset of $951,736$ user comments in English from the Yahoo! Finance website that were editorially labeled as hate speech ( $5.9\%$ ) or clean (data-yahoo-fin-dj). Nobata et al. [Nobata et al., 2016] produced four more datasets with comments from Yahoo! News and Yahoo! Finance, each labeled abusive or clean: 1) data-yahoo-fin-a: $759,402$ comments, 7.0% abusive; 2) data-yahoo-news-a: $1,390,774$ comments, 16.4% abusive; 3) data-yahoo-fin-b: $448,436$ comments, 3.4% abusive; and 4) data-yahoo-news-b: $726,073$ comments, 9.7% abusive.

Twitter as source. Several groups have investigated abusive language in Twitter. Waseem and Hovy [Waseem and Hovy, 2016] created a corpus of $16,907$ tweets, each annotated as one of racism ( $11.7\%$ ), sexism, ( $20.0\%$ ) or neither (data-twitter-wh). We note that although certain tweets in the dataset lack explicit abusive traits (e.g., @Mich_McConnell Just “her body” right?), they have nevertheless been marked as racist or sexist as the annotators took the wider discourse into account; however, such discourse information is not preserved in the dataset. Inter-annotator agreement was reported at $\kappa=84\%$ , with a further insight that $85\%$ of all the disagreements occurred on the sexism class alone. Waseem [Waseem, 2016] later released a dataset of $6,909$ tweets annotated as racism ( $1.41\%$ ), sexism ( $13.08\%$ ), both ( $0.70\%$ ), or neither (data-twitter-w). data-twitter-w and data-twitter-wh have $2,876$ tweets in common. It should, however, be noted that the inter-annotator agreement between the two datasets is low (mean pairwise $\kappa=14\%$ ) [Waseem, 2016]. Davidson et al. [Davidson et al., 2017] created a dataset of approximately $25k$ tweets, manually annotated as one of racist ( $5\%$ ), offensive but not racist ( $76\%$ ), or clean ( $19\%$ ). We note, however, that their data sampling procedure relied on the presence of certain abusive words and, as a result, the distribution of classes does not follow a real-life distribution. Recently, Founta et al. [Founta et al., 2018] crowd-sourced a dataset (data-twitter-f) of $80k$ tweets, of which $59\%$ were rated normal, $22.5\%$ spam, $7.5\%$ hateful, and $11\%$ abusive. The OffensEval 2019 shared task used a recently released dataset of $14,100$ tweets [Zampieri et al., 2019], each hierarchically labeled as: offensive ( $33\%$ ) or not, offense is targeted ( $29\%$ ) or not, and whether the target is an individual ( $17.8\%$ ), a group ( $8.2\%$ ) or otherwise ( $3\%$ ).

Other English sources. Wulczyn et al. [Wulczyn et al., 2017a] crowd-sourced annotations for English comments from Wikipedia’s talk pages and released three datasets: one focusing on personal attacks ( $115,864$ comments; $11.7\%$ abusive), one on aggression ( $115,864$ comments), and one on toxicity ( $159,686$ comments; $9.6\%$ abusive) (data-wiki-att, data-wiki-agg, and data-wiki-tox respectively). data-wiki-agg contains the exact same comments as data-wiki-att but annotated for aggression – the two datasets show a high correlation in the nature of abuse (Pearson’s $r=0.972$ ). Gao and Huang [Gao and Huang, 2017] released a dataset of $1,528$ Fox News user comments (data-fox-news) annotated as hateful ( $28.5\%$ ) or non-hateful. The dataset preserves context information for each comment, including user’s screen-name, all comments in the same thread, and the news article for which the comment is written. Salminen et al. [Salminen et al., 2018] sourced $137k$ comments from under videos posted on YouTube and Facebook (data-ytube-fb). They self-annotated the corpus per a fine-grained hierarchical taxonomy consisting of 13 main categories and 16 sub-categories that cover both nature of abuse (e.g., humiliation) as well as targets (e.g., religion).

Non-English datasets. Some researchers investigated abuse in languages other than English. Van Hee et al. [Van Hee et al., 2015] gathered $85,485$ Dutch posts from ask.fm to form a dataset on cyber-bullying (data-bully; $6.7\%$ cyber-bullying cases). Pavlopoulos et al. [Pavlopoulos et al., 2017b] released a dataset of ca. $1.6M$ comments in Greek provided by the news portal Gazzetta (data-gazzetta). The comments were marked as accept or reject, and are divided into 6 splits with similar distributions (the training split is the largest one: $66\%$ accepted and $34\%$ rejected comments). As part of the GermEval shared task on identification of offensive language in German tweets [Wiegand et al., 2018c], a dataset of $8,541$ tweets was released, of which $21\%$ were labeled as abuse, $11.4\%$ as insult, $1.39\%$ as profanity, and $66.16\%$ as other. Around the same time, $15k$ Facebook posts and comments, each in Hindi (in both Roman and Devanagari script) and English, were released (data-facebook) as part of the COLING 2018 shared task on aggression identification [Kumar et al., 2018]. $35.3\%$ of the comments were covertly aggressive, $22.8\%$ overtly aggressive and $41.9\%$ non-aggressive. We note, however, that some issues were raised by the participants regarding the quality of the annotations. The HatEval 2019 shared task focused on detecting hate speech against immigrants and women using a dataset of $5k$ tweets in Spanish and $10k$ in English annotated hierarchically as hateful or not; and, in turn, as aggressive or not, and whether the target is an individual or a group.

Remarks. In their study, Ross et al. [Ross et al., 2016] stressed the difficulty of reliably annotating abuse, which stems from multiple factors such as the lack of standard definitions for the myriad types of abuse, differences in annotators’ cultural background and experiences, and ambiguity in the annotation guidelines. That said, Waseem et al. [Waseem et al., 2017] and Nobata et al. [Nobata et al., 2016] observed that annotators with prior expertise provide good-quality annotations with high levels of agreement amongst themselves. That aside, most datasets contain discrete labels only; abuse detection systems trained on such datasets would be deprived of the notion of severity, which is vital in real-world settings. In fact, all existing datasets cover only a subset of abuse types. Moreover, recent studies have also found substantial and systematic bias in existing datasets, which they primarily attributed to the sampling strategy, e.g., topic or word-based [Wiegand et al., 2019].

Feature engineering based abuse detection

The first documented method for abuse detection was that of Spertus [Spertus, 1997] who hand-crafted rules over texts to generate feature vectors for learning. Since then, several methods have been proposed that rely on manual feature engineering. Such feature engineering happens on two fronts, either on the text of the sample (textual) or on the user(s) who created or interacted with the sample (social).

Textual feature engineering. This kind of feature engineering models directed and explicit traits of abuse within samples. Researchers have adopted two approaches to textual feature engineering: hand-crafted rules cum lexicon-based approach and computational approach. The former includes features extracted from text based on linguistic rules (e.g., text contains the pronoun you followed by profanity) or some curated lexicon of abusive words and expressions (e.g., hatebase.org. The latter on the other hand includes bag-of-words (bow) counts, tf-idf weighted features, features based on similarity clustering, etc. Both approaches are summarized in Table 1.

Social feature engineering. Several researchers have directly incorporated features and identity traits of users in order to model the likeliness of abusive behavior from users with certain traits, a process known as user profiling. Dadvar et al. [Dadvar et al., 2013] included the age of users alongside other traditional lexicon-based features to detect cyber-bullying, while Galán-García et al. [Galán-García et al., 2016] utilized the time of publication, geo-position and language in the profile of Twitter users. Waseem and Hovy [Waseem and Hovy, 2016] exploited gender of Twitter users alongside character n-gram counts to improve detection of sexism and racism in tweets from data-twitter-wh (F1 increased from $73.89\%$ to $73.93\%$ ). Using the same setup, Unsvåg and Gambäck [Unsvåg and Gambäck, 2018] showed that the inclusion of social network-based (i.e., number of followers and friends) and activity-based (i.e., number of status updates and favorites) information of users alongside their gender further enhances performance ( $3$ points gain in F1).

Neural networks based abuse detection

Advancements in computational capabilities have led researchers to explore methods for abuse detection that rely on neural architectures. Such methods can be broadly divided into three categories: 1) those that simply consume distributed representations generated by neural networks, 2) those that perform deep learning on texts, and 3) those that use neural networks for modeling social aspects.

Distributed representations. Djuric et al. [Djuric et al., 2015] were the first to adopt neural networks for abuse detection. They utilized paragraph2vec [Le and Mikolov, 2014] to obtain low-dimensional representations for comments in data-yahoo-fin-dj and trained a logistic regression (LR) classifier. Their model outperformed other classifiers trained on BOW-based representations (AUROC $80.07\%$ vs. $78.89\%$ ). The authors noted that words and phrases in hate speech tend to be obfuscated, leading to high dimensionality and sparsity of BOW-based representations, which in turn causes classifiers to over-fit in training.

Building on the work of Djuric et al., Nobata et al. [Nobata et al., 2016] examined the performance of a variety of features on the Yahoo! datasets (data-yahoo-*) using a regression model: 1) word and character n-grams, 2) linguistic features like number of polite/hate words and punctuation count, 3) syntactic features like parent and grandparent of node in a dependency tree, and 4) distributional-semantic features like paragraph2vec comment representations. Although the best results were achieved with all the features combined ( $79.5\%$ F1 on data-yahoo-fin-a, $81.7\%$ F1 on data-yahoo-news-a), character n-grams on their own contributed significantly more than the other features due to their robustness to noise such as obfuscations, misspellings, unseen words. The paragraph2vec representations, in comparison, performed at par with character n-grams only on data-yahoo-news-a, which was noted to be less noisy than data-yahoo-fin-a. Working with the data-yahoo-fin-dj dataset, Mehdad and Tetreault [Mehdad and Tetreault, 2016] investigated whether character-level features are more indicative of abuse than word-level ones. Their experiments demonstrated the superiority of character-level features, revealing that SVM classifiers trained on Bayesian log-ratio vectors of average counts of character n-grams improve not only upon the more intricate approach of Nobata et al. (AUROC from $91\%$ to $92\%$ ), but also upon other recurrent neural networks (RNN) based character and word-level models.

Samghabadi et al. [Samghabadi et al., 2017] started with a similar set of features as Nobata et al. [Nobata et al., 2016] and augmented it with hand-engineered ones such as polarity scores derived from SentiWordNet, scores based on the LIWC program, and features based on emoticons. They applied their method to three different datasets: data-wiki-att, a Kaggle dataset annotated for insult, and a dataset of questions and answers (each labeled as invective or neutral) that they created by crawling ask.fm. Distributional-semantic features combined with the aforementioned features constituted an effective feature space for the task ( $65\%$ , $68\%$ , $56\%$ F1 on data-wiki-att, Kaggle, ask.fm respectively). In line with the results of Nobata et al. and Mehdad and Tetreault, the authors found character n-grams to be performing well on these datasets too.

Deep learning on texts. With the advent of deep learning, many researchers have explored its efficacy in abuse detection. Badjatiya et al. [Badjatiya et al., 2017] evaluated several neural architectures on the data-twitter-wh dataset. Their best setup involved a two-step approach wherein they used a word-level long-short term memory (LSTM) model, to tune GLoVe or randomly-initialized word embeddings, and then trained a gradient-boosted decision tree (GBDT) classifier on the average of the tuned embeddings in each tweet. They achieved the best results using randomly-initialized embeddings (weighted F1 of $93\%$ ). However, working with a similar setup, Mishra et al. [Mishra et al., 2018a] reported that GLoVe initialization provided superior performance; a mismatch was attributed to the fact that Badjatiya et al. tuned the embeddings on the entire dataset, including the test set, to which the randomly-initialized ones overfit better.

Park and Fung [Park and Fung, 2017] utilized character and word-level CNNs to classify comments in the dataset that they formed by combining data-twitter-w and data-twitter-wh. Their experiments demonstrated that combining the two levels of granularity using two input channels achieves the best results, improving upon a character n-grams based LR baseline (weighted F1 from $81.4\%$ to $82.7\%$ ). Several other works have also demonstrated the efficacy of CNNs in detecting abusive social media posts [Singh et al., 2018]. Some researchers [Wang, 2018, Zhang et al., 2018] have shown that sequentially combining CNNs with gated recurrent unit (GRU) RNNs can enhance performance by taking advantage of properties of both architectures, e.g., 1-2% increase in F1 compared to only using CNNs.

Pavlopoulos et al. [Pavlopoulos et al., 2017a, Pavlopoulos et al., 2017b] applied deep learning to the data-wiki-att, data-wiki-tox, and data-gazzetta datasets. Their most effective setups were: 1) a word-level GRU followed by an LR layer, and 2) setup 1 extended with an attention mechanism on words. Both setups outperformed a simple word-list baseline as well as the character n-grams based LR classifier (detox) from Wulczyn et al. [Wulczyn et al., 2017a]. Setup 1 achieved the best performance on data-wiki-att (AUROC $97.71\%$ ) and data-wiki-tox (AUROC $98.42\%$ ), while setup 2 performed the best on data-gazzetta (AUROC $84.69\%$ ). Additionally, the attention mechanism was shown to be able to highlight abusive words and phrases within the comments, exhibiting a high level of agreement with annotators on the task. Lee et al. [Lee et al., 2018] worked with a subset of the data-twitter-f dataset and showed that a word-level bi-GRU model along with latent topic clustering, where topic information is extracted from the hidden states of the GRU [Yoon et al., 2018], yielded the best weighted F1 of $80.5\%$ .

The GermEval 2018 shared task on identification of offensive language in German tweets [Wiegand et al., 2018c] saw submission of both deep learning and feature engineering based methods. The winning system [Montani, 2018], with macro F1 of $76.77\%$ , employed multiple character and token n-gram classifiers alongside distributional semantic features obtained by averaging word embeddings. The second best approach [von Grünigen et al., 2018], with macro F1 $75.52\%$ , on the other hand, employed an ensemble of CNNs whose outputs were fed to a meta classifier for final prediction. Most of the remaining submissions [Risch et al., 2018, Wiegand et al., 2018a] used deep learning with CNNs and RNNs alongside techniques such as transfer learning (e.g., via machine translation or joint representation learning for words across languages) from abuse-annotated datasets in other languages (mainly English). Wiegand et al. [Wiegand et al., 2018c] noted that simple deep learning approaches themselves were quite effective, and the addition of other techniques did not necessarily provide substantial gains.

Kumar et al. [Kumar et al., 2018] noted similar trends in the shared task for aggression identification on the data-facebook dataset. The top approach on the task’s English dataset [Aroyehun and Gelbukh, 2018], with macro F1 of $64.25\%$ , comprised RNNs and CNNs along with transfer learning via machine translation. The top approach for Hindi [Samghabadi et al., 2018], with F1 of $62.92\%$ , utilized lexical features based on word and character n-grams. In order to further understand the pros and cons of both, Aken et al. [van Aken et al., 2018] performed a systematic comparison of neural and non-neural approaches to toxic comment classification, concluding that ensembles of the two were most effective.

The GermEval 2019 shared task on identification of offensive language in German tweets [StruÃŸ et al., 2019] consisted of three sub-tasks: 1) course-grained classification of samples as offense or other, 2) fine-grained classification of samples as offense, profanity, insult or other, and 3) classification of offensive samples as implicit or explicit. All three sub-tasks saw submission of a range of methods, including those based on deep contextualized language models like BERT [Devlin et al., 2019]. The organizers noted that in fact the winning submissions across all three sub-tasks utilized some form of BERT. Paraschiv and Cercel [Paraschiv and Cercel, 2019] made the winning submission on sub-tasks 1 (76.95% F1) and 2 (53.59% average F1) that comprised a BERT model pre-trained on German Wikipedia and German Twitter corpora prior to being fine-tuned on the sub-task datasets. Risch et al. [Risch et al., 2019] had the winning submission on sub-task 3 (53.93% F1) that again comprised a BERT model pre-trained on German texts. While the organizers noted that methods based on traditional CNN and RNN models didn’t feature in the top 3 on any sub-task, they found that ensemble models trained on character and token n-grams [Montani and SchÃller, 2019] and lexicon-based features [Schmid et al., 2019] fared well.

Researchers have recently started exploring multi-task learning with neural networks for the purpose of abuse detection. Rajamanickam et al. [Rajamanickam et al., 2020] demonstrated that jointly learning over emotion classification and abuse detection tasks leads to better performance on the latter. Detecting the affective nature of comments (e.g., disgust, anger, joy, fear, optimism) helps to detect abuse more accurately on Twitter posts, achieving an F1 of 79.55 and 76.03 on data-twitter-wh and OffensEval respectively. Samghabadi et al. [Samghabadi et al., 2019] utilize an emotion-aware attention mechanism, achieving a macro-F1 of 83.56 on Kaggle and 88.27 on data-wiki-att.

Modeling social aspects with neural networks. More recently, researchers have employed neural networks to extract representations or profiles for users instead of manually leveraging traits like gender, location, etc. as discussed before. Working with the data-gazzetta dataset, Pavlopoulos et al. [Pavlopoulos et al., 2017c] incorporated user embeddings into Pavlopoulos’ setup 1 [Pavlopoulos et al., 2017a, Pavlopoulos et al., 2017b] described above. They divided all the users whose comments are included in data-gazzetta into 4 types based on proportion of abusive comments (e.g., red users if $>10$ comments and $\geq 66\%$ abusive comments), yellow (users with $>10$ comments and $33\%-66\%$ abusive comments), green (users with $>10$ comments and $\leq 33\%$ abusive comments), and unknown (users with $\leq 10$ comments). They then assigned unique randomly-initialized embeddings to users and added them as additional input to the LR layer alongside representations of comments obtained from the GRU. This increased the AUROC from $79.24\%$ to $80.71\%$ . Qian et al. [Qian et al., 2018] used LSTMs for modeling inter and intra-user relationships on data-twitter-wh, with sexist and racist tweets combined into one category. The authors applied a bi-LSTM to users’ recent tweets in order to generate intra-user representations that capture their historic behavior. To improve robustness against noise present in tweets, they also used locality sensitive hashing to form sets of semantically similar user tweets. The authors then trained a policy network to select tweets from sets that a bi-LSTM could use to generate inter-user representations. When these inter and intra-user representations were utilized alongside representations of tweets from an LSTM baseline, F1 increased from $70.3\%$ to $77.4\%$ .

Mishra et al. [Mishra et al., 2018a] constructed a community graph of all the users whose tweets are in the data-twitter-wh dataset. Nodes were the users and edges denoted the follower-following relationship among them on Twitter. The authors applied node2vec [Grover and Leskovec, 2016] to this graph to generate user embeddings, i.e., profiles. Inclusion of these embeddings into the character n-gram based baselines yielded significant gains on data-twitter-wh whereby F1 scores on the racism and sexism classes increased from $72.28\%$ and $72.09\%$ to $75.09\%$ and $82.75\%$ respectively. The gains were attributed to the fact that user embeddings captured not only information about online communities, but also elements of the wider conversation amongst connected users. Ribeiro et al. [Ribeiro et al., 2018] and Mishra et al. [Mishra et al., 2019] applied graph neural networks [Kipf and Welling, 2017, Hamilton et al., 2017] to community graphs to generate embeddings for users that capture not only their surrounding community but also their linguistic behavior. Mishra et al. [Mishra et al., 2019] recorded $79.49\%$ F1 on the racism and $84.44\%$ F1 on the sexism classes of data-twitter-wh.

Discussion

English has been the dominant language so far in terms of focus, followed by German, Hindi and Dutch. However, recent efforts have focused on compilation of datasets in other languages such as Slovene and Croatian [Ljubešić et al., 2018], Chinese [Su et al., 2017], Arabic [Mubarak et al., 2017], and even some unconventional ones such as Hinglish [Mathur et al., 2018]. Most of the research to date has been on racism, sexism, personal attacks, toxicity, and harassment. Other types of abuse such as obscenity, threats, insults, and grooming remain relatively unexplored. That said, we note that the majority of methods investigated to date and described herein are (in principle) applicable to a range of abuse types.

The recent approaches that rely on word-level CNNs and RNNs remain vulnerable to obfuscation of words [Mishra et al., 2018b]. On the other hand, the use of sub-word units, both in feature engineering (e.g. character n-grams) and as tokenized inputs to models like BERT, remains one of the most effective techniques for addressing obfuscation since sub-word units are robust to spelling variations. Many researchers to date have exclusively relied on text based features for abuse detection. But recent works have shown that personal and community-based profiling features of users significantly enhance the state of the art. Since posts on social media often includes data of multiple modalities (e.g., a combination of images and text), abuse detection systems would also need to incorporate a multi-modal component. Facebook recently released a dataset consisting of multi-modal hateful memes [Kiela et al., 2020] under the Hateful Memes Challenge to foster research in this area.

Despite the fast-paced progress in this field, an important challenge that remains mostly unsolved is that of recognizing implicit abuse [van Aken et al., 2018]. Implicit abuse comes in the form of figurative language, such as sarcasm, irony or metaphor, rhetorical questions, analogies and comparisons. Metaphor and sarcasm are particularly common, and tend to express stronger emotions and sentiments than the literally-used words and phrases [Mohammad et al., 2016]. Nobata et al. [Nobata et al., 2016] (among others) noted that sarcastic comments are hard for abuse detection methods to deal with since surface features are not sufficient; typically the knowledge of the context or background of the user is also required. Mishra [Mishra, 2018] found that metaphors are more frequent in abusive samples as opposed to non-abusive ones. However, to fully understand the impact of figurative devices on abuse detection, datasets with more pronounced presence of these are required.

The key to modeling implicit abuse, and detecting abuse more accurately in general, might lie in shifting focus from modeling individual comments to modeling online conversations and how they evolve and escalate towards abuse. Abuse is inherently contextual; it can only be interpreted as part of a wider conversation between users on the Internet. This means that, in practice, individual comments can be difficult to classify without modeling their respective contexts. Mishra et al. [Mishra et al., 2018a] have pointed out that some tweets in data-twitter-wh do not contain sufficient lexical or semantic information to detect abuse even in principle, e.g., @user: Logic in the world of Islam http://t.co/xxxxxxx, and techniques for modeling discourse and elements of pragmatics are needed. To address this issue, Gao and Huang [Gao and Huang, 2017], working with data-fox-news, incorporate features from two sources of context: the title of the news article for which the comment was posted, and the screen name of the user who posted it. Yet this is only a first step towards modeling the wider context in abuse detection; more sophisticated techniques are needed to capture the history of the conversation and the behavior of the users as it develops over time. NLP techniques for modeling discourse and dialogue can be a good starting point in this line of research.

Another challenge in modeling abuse is presented by its ever-changing nature, as societies and technologies evolve. New abusive words and phrases continue to enter the language [Wiegand et al., 2018b]. Working with the data-yahoo-*-b datasets, Nobata et al. [Nobata et al., 2016] found that a classifier trained on more recent data outperforms one trained on older data. They noted that a prominent factor in this is the continuous evolution of the Internet jargon. We would like to add that, given the situational and topical nature of abuse [Chandrasekharan et al., 2018], contextual features learned by detection methods may become irrelevant over time. A similar trend also holds for abuse detection across domains. Wiegand et al. [Wiegand et al., 2018b] showed that the performance of the (then) state of the art classifiers [Nobata et al., 2016, Pavlopoulos et al., 2017b] decreases substantially when tested on data drawn from domains different to that of the training set. They attributed this trend to lack of domain-specific learning. Chandrasekharan et al. [Chandrasekharan et al., 2017] propose an approach that utilizes similarity scores between posts to improve in-domain performance based on out-of-domain data. Possible solutions for improving cross-domain abuse detection can be found in the literature of (adversarial) multi-task learning and domain adaptation [Daumé III, 2009, Ganin et al., 2016, Wu and Huang, 2015], and also in works such as that of Sharifirad et al. [Sharifirad et al., 2018] who utilize knowledge graphs to augment the training of a sexist tweet classifier. Recently, Waseem et al. [Waseem et al., 2018] and Karan and Šnajder [Karan and Šnajder, 2018] exploited multi-task learning frameworks to train models that are robust across data from different distributions or data annotated under different guidelines.

2 Ethical questions around automated abuse detection

Identifying experiences as abusive provides validation to victims of abuse and enables observers to grasp the scope of the problem. It also creates new descriptive norms, suggesting what types of behavior constitute abuse, and outlines existing expectations around appropriate behavior. On the other hand, automated systems can invalidate abusive experiences, particularly for victims whose experiences may not lie in the realm of typical ones [Blackwell et al., 2017]. This points to a critical issue: automated systems embody the morals and values of their creators and annotators [Bowker and Star, 2000, Blackwell et al., 2017]. It is therefore imperative that we design systems that are robust to such issues, e.g., some recent works have investigated ways to mitigate gender bias in models [Binns et al., 2017, Park et al., 2018].

That said, unfortunately, whilst the research community has started incorporating signals from user profiling, there has not yet been a discussion of ethical guidelines for doing so. To encourage such a discussion, we lay out four ethical considerations in the design of such approaches:

The profiling approach should not compromise the privacy of the user. So a researcher might ask themselves such questions as: is the profiling based on identity traits of users (e.g., gender, race etc.) or solely on their online behavior? And is an appropriate generalization from (identifiable) user traits to population-level behavioral trends performed?

One needs to reflect on the possible bias in the training procedure; is the approach likely to induce a bias against users with certain traits?

The visibility aspect needs to be accounted for; is the profiling visible to the users, i.e., can users directly or indirectly observe how they (or others) have been profiled?

One needs to carefully consider the purpose of such profiling; is it intended to take actions against users, or is it more benign (e.g. to better understand the content produced by them and make task-specific generalizations)?

While we do not intend to provide answers to these questions within this paper, we hope that the above considerations can help to start a debate on these important issues.

3 Explainable abuse detection

Explainability has become an important aspect within NLP, and within AI generally. Yet there has been no discussion of this issue in the context of abuse detection systems. We hereby propose three properties that an explainable abuse detection system should aim to exhibit.

It needs to establish intent of abuse (or the lack of it) and provide evidence for it, hence convincingly segregating abuse from other phenomena such as sarcasm and humor.

It needs to capture abusive language, i.e., highlight instances of abuse if present, be they explicit (i.e., use of expletives) or implicit (e.g., dehumanizing comparisons).

It needs to identify the target(s) of abuse (or the absence thereof), be it an individual or a group.

These properties align well with the categorizations of abuse we discussed in the introduction. They also aptly motivate the advances needed in the field: (1) developments in areas such as sarcasm detection and user profiling for precise segregation of abusive intent from humor, satire, etc.; (2) better identification of implicit abuse, which requires improvements in modeling of figurative language; (3) effective detection of generalized abuse and inference of target(s), which require advances in areas such as domain adaptation and conversation modeling.

Conclusions

Online abuse stands as a significant challenge before society. Its nature and characteristics constantly evolve, making it a complex phenomenon to study and model. Methods for automated abuse detection have seen a lot of development in recent years: from simple rule-based ones aimed at identifying directed and explicit abuse to sophisticated ones that can capture rich semantic information and even aspects of user behavior. By providing a comprehensive review of the field to date, our paper aims to lay a platform for future research, facilitating progress in this important effort. While we see an array of challenges that lie ahead, e.g., modeling extra-propositional aspects of language, user behavior and wider conversation, we believe that recent progress in the areas of semantics, dialogue modeling and social media analysis put the research community in a strong position to address them. The notion of abuse has been rather hard to define due to differing opinions on sarcasm, self-deprecating humor, and terms seen as offensive, to name a few issues. In fact, attempts to impose a definition may also curb the diversity that exists across various abuse datasets since what constitutes abuse varies from culture to culture. But we do believe that the NLP community can and should work towards standardizing the understanding of different characteristics of abuse, examples of which are presented in the paper: directed, generalized, implicit and explicit. This will allow for more comparable and systematic modeling of different types of abuse (including those that might emerge in the future) and also facilitate transfer learning across them.

References

Appendix A Summaries of public datasets

In table 2, we summarize the datasets described in this paper that are publicly available and provide links to them.

Appendix B A discussion of metrics

The performance results we have reported highlight that, throughout work on abuse detection, different researchers have utilized different evaluation metrics for their experiments – from area under the receiver operating characteristic curve (AUROC) [Wulczyn et al., 2017a, Djuric et al., 2015] to micro and macro F1 [Mishra et al., 2018b] – regardless of the properties of their datasets. This makes the presented techniques more difficult to compare. In addition, as abuse is a relatively infrequent phenomenon, the datasets are typically skewed towards non-abusive samples [Waseem, 2016]. Metrics such as AUROC may, therefore, be unsuitable since they may mask poor performance on the abusive samples as a side-effect of the large number of non-abusive samples [Jeni et al., 2013]. Macro-averaged precision, recall, and F1, as well as precision, recall, and F1 on specifically the abusive classes, may provide a more informative evaluation strategy; the primary advantage being that macro-averaged metrics provide a sense of effectiveness on the minority classes [Van Asch, 2013]. Additionally, area under the precision-recall curve (AUPRC) might be a better alternative to AUROC in imbalanced scenarios [Davis and Goadrich, 2006].

Appendix C Embeddings and OOV words

Djuric et al. [Djuric et al., 2015] and Nobata et al. [Nobata et al., 2016] observed that abusive language tends to contain obfuscations (e.g., w0m3n). This poses a particular problem for abuse detection methods [Blackwell et al., 2017] and specifically for those that rely on word-level neural networks as they operate with a finite vocabulary of words and map all unknown words in the test set to a single out-of-vocabulary (OOV) embedding. This has the undesired effect that deliberately obfuscated words and benign misspellings get conflated, leading to loss in performance [Mishra et al., 2018a, Qian et al., 2018]. Spelling correction and edit-distance techniques for resolving obfuscations can provide some level of mitigation; however, they do not help in cases where obfuscation is severe, e.g., a55h0le, or is by concatenation, e.g., idiotb*tch. While one way around the problem is to have character-level models, Mishra et al. [Mishra et al., 2018b] showed that such models perform worse than word-level ones with pre-trained embeddings. Hence, techniques to generate embeddings for OOV words on the fly [Bojanowski et al., 2017, Mishra et al., 2018b] have been exploited.