RADAR: Robust AI-Text Detection via Adversarial Learning

Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

Introduction

Large language models (LLMs) are high-capacity neural networks that are pretrained at web-scale datasets. They are foundation models achieving state-of-the-art performance in a wide range of natural language processing tasks (e.g. document completion, question answering, machine translation, and content creation with text prompts) with advanced capabilities such as in-context learning and reasoning (e.g. chain of thoughts). In particular, LLMs are the backbone of many ChatGPT-like conversational bots that enable text generation with high fluency and accuracy. However, while LLMs and their derived applications are expected to become ubiquitous in our future technology and society, new risks in failing to distinguish the so-called “AI text” generated by LLMs have emerged and gained considerable attention due to various reasons. The problem of reliable AI-text detection is motivated by realistic socio-technological challenges such as fake content generation, AI plagiarism (e.g. using LLMs for writing tests), and false accusations of innocent writers. According to a report released by OpenAIhttps://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text, their latest AI-text detector is admittedly not fully reliable. In the reported evaluation of some challenging cases for English texts, their classifier only correctly identifies 26% of AI-text (true positives) while incorrectly classifying 9% of human-written text (false positives). Moreover, a recent study found that state-of-the-art AI-text detectors demonstrated severely degraded performance when encountering texts written by non-native English speakers.

What can be even more challenging in AI-text detection is that existing AI-text detectors are prone to be manipulated. The authors in showed that using LLMs as a paraphraser can easily evade several AI-text detection methods, even in the scenario when the original AI-text had been watermarked. These findings sparked a heated debate about whether and how we can successfully design a reliable AI-text detector. While theoretically quantifies the best detector’s performance with respect to the total variation distance between AI-text and human-text distributions and argues that AI-text is difficult to detect, another work proves that it is possible to obtain a reliable AI-text detector unless the human-text distribution is exactly the same as the AI-text distribution, based on an information-theoretical analysis (i.e., the sample complexity of Chernoff information and likelihood-ratio-based detectors).

To improve AI-text detection, we propose RADAR, a framework for training a robust AI-text detector using adversarial learning. An overview of RADAR is illustrated in Figure 1. Our proposal draws inspiration from adversarial machine learning techniques that train a high-quality generator by introducing a discriminator to form a two-player game, such as generative adversarial networks (GANs) . In RADAR, we introduce a paraphraser and a detector as two players with opposite objectives. The paraphraser’s goal is to generate realistic content that can evade AI-text detection, while the detector’s goal is to enhance AI-text detectability. In our framework, both the paraphraser and the detector are parametrized by separate LLMs. During training, the paraphraser learns to rewrite the text from a training corpus (generated by a target LLM from a human-text corpus) with the aim of decreasing the likelihood of AI-text prediction by the detector, whereas the detector aims to enhance the detection performance by learning to compare human-text v.s. AI-text from the training data and the paraphraser’s output. These two players iteratively update their model parameters until their respective validation loss becomes stable. Specifically, the paraphraser treats the prediction of the detector as a reward and uses Proximal Policy Optimization (PPO) for updates. The detector updates its parameters based on a logistic loss function evaluated on the human-text and AI-text corpora (including the texts generated by the paraphraser). In the evaluation phase, the trained detector is deployed to predict the likelihood of AI-written content for any input instance. When compared with 6 existing detectors, our experimental results on 8 different LLMs and 4 datasets show that RADAR attains similar detection performance on the original AI-generated texts (a relatively easy task) and simultaneously improves the AI-text detectability when facing an “unseen” paraphraser (i.e. this paraphraser is not used in RADAR). The result is summarized in Figure 2. When facing an unseen paraphraser (GPT-3.5-Turbo), the area under the receiver operating characteristic (AUROC) score of RADAR is improved by 31.64% compared to the best existing detector, suggesting a significant improvement and reliable AI-text detection power enabled by RADAR.

We summarize our main contributions as follows:

To the best of our knowledge, RADAR is the first study that leverages the idea of adversarial learning between a paraphraser and a detector for training a robust AI-text detector.

The experiments on 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) and 4 datasets show that unlike the six existing supervised and unsupervised AI-text detection methods, RADAR is the only robust detector that attains consistently high detection performance. RADAR’s detector is not weakened by paraphrasing, as shown in Figure 2.

We also find the strong transferability of RADAR’s detection capability. The detectors of RADAR obtained from instruction-tuned first-class LLMs (e.g., Vicuna-7B) are also effective on other LLMs, suggesting the possibility of training a universal AI-text detector based on the state-of-the-art LLMs.

Related Work

AI-Text Detection. The research in AI-text detection can be divided into three approaches. (i) Statistical methods: some statistics such as entropy , n-gram frequency, and perplexity are used as a threshold to discern AI-text. A typical example is GLTR , which exploits entropy, probability, and probability rank for detection. A more recent work is DetectGPT , which assumes that the machine-generated text always lies in the negative curvature region of the log probability of the LLM of interest. Based on this hypothesis, DetectGPT perturbs the input text with a mask-filling language model, such as T5 . Then, AI-text detection is performed by comparing the log probability of the text and its infilled variants. (ii) Classification methods: AI-text detection is formulated as a binary classification task, and a classifier is trained for a target language model . For example, OpenAI trains its AI-text classifier with a RoBERTa-based model .

The developers collected samples from the WebText datasethttps://huggingface.co/datasets/openwebtext and labeled them as human-generated. Then, for each target GPT-2 model, they collected the generated samples and labeled them as machine-generated. Finally, they fine-tuned the pretrained RoBERTa-based model for AI-text classification. More recently, with the appearance of CharGPT, OpenAI tuned a GPT model called AI-Classifier1 using data from several sources. The human-written text comes from three sources: a new Wikipedia dataset, the WebText dataset collected in 2019, and a set of human demonstrations collected as part of training InstructGPT . To collect machine-generated text, for the Wikipedia and WebText datasets, they truncated the articles sampled from the original corpus and used 34 models to generate article completion, pairing each generated text with the original article. For the demonstrations, they used a model to generate responses for each prompt and paired them with the corresponding human demonstrations. This detector was only accessible via a web interface since its release in January 2023, and it has been taken down since July 2023. (iii) Watermark methods: post-hoc watermarking techniques, such as rule-based methods and deep-learning-based methods , can be applied to an LLM. At inference time, proposed a soft watermarking scheme to embed a watermark in each word of the generated sentence by dividing the vocabulary into different lists and sampling the next token in a differentiated manner. However, many existing AI-text detectors are shown to be significantly weakened by paraphrasing in .

Adversarial Learning for Natural Language Generation. The success of GAN in the computer vision domain has motivated many studies in natural language generation. However, since text generation is a sequential sampling process that occurs in a discrete vocabulary space, it is difficult to directly train a text generator using back-propagation in an end-to-end manner . There are two common approaches to tackle this problem. The first one is to replace the discrete sampling operation with continuous approximation techniques , such as Gumbel-Softmax . The second one is to view text generation as a decision-making process and cast the generator as a policy . A typical example is SeqGAN . During generation, SeqGAN considers the generated tokens as the state and the next token to be generated as the action, and it adopts Monte Carlo search to collect reward signals from the discriminator. Instead of using a classifier as the discriminator, the Diversity-Promoting GAN uses a unidirectional LSTM as the discriminator and combines both word-level and sentence-level rewards into training. TextGAIL proposed an imitation learning paradigm in which the rewards of the human-written text are regarded as a constant value. Then, both the rewards from human-text and AI-text are used to optimize the generator with PPO. These works all used warm-up training for the generator with maximum likelihood estimation (MLE) on the probability of the generated text sequence. On the other hand, trained a language GAN from scratch. Our proposed RADAR differs from these works in that we focus on training a robust AI-text detector with a tunable paraphraser. Another line of work, such as , uses paraphrasing techniques to find adversarial examples for natural language processing tasks and for training a robust language model via adversarial training. Their focus is on the correctness of natural language understanding, which is beyond our scope of AI-text detection.

RADAR: Methodology and Algorithms

We start this section by giving an overview and mathematical notations of our proposed RADAR framework in Figure 1. Then, in Sections 3.1 and 3.2, we provide the details on the design and training of the paraphraser and detector, respectively. Finally, we will summarise the entire training process into an algorithmic procedure in Section 3.3.

High-Level Methodology. Our RADAR framework consists of three neural-network-based language models (LMs): the target LM $\mathcal{T}_{\theta}$ , the detector $\mathcal{D}_{\phi}$ and the paraphraser $\mathcal{G}_{\sigma}$ , parameterized with $\theta$ , $\phi$ and $\sigma$ , respectively. We note that $\mathcal{T}_{\theta}$ is frozen (no updates on $\theta$ ) in the entire process. We summarize RADAR into three key steps:

Step 1 (Data preparation): Before training, we build $\mathcal{M}$ , the corpus of AI-text, by applying document completion based on the prefix span of text in the human-text corpus $\mathcal{H}$ using $\mathcal{T}_{\theta}$ .

Step 2 (Paraphraser update): We collect AI-text samples $x_{m}$ from $\mathcal{M}$ and use $\mathcal{G}_{\phi}$ to do paraphrasing on $x_{m}$ to generate paraphrased AI-text $x_{p}$ to form a corpus $\mathcal{P}$ . Then, we use the reward of $x_{p}$ returned by the detector $\mathcal{D}_{\theta}$ to update the paraphraser $\mathcal{G}_{\phi}$ using PPO.

Step 3 (Dectector update): We use the human-text samples $x_{h}$ from $\mathcal{H}$ , the original AI-text samples $x_{m}$ from $\mathcal{M}$ , and the paraphrased AI-text samples $x_{p}$ from $\mathcal{P}$ in step 2 to update the detector $\mathcal{D}_{\theta}$ with a logistic loss function.

Step 4 (Performance Validation and Evaluation): During training, we use the test set of WebText as the validation dataset to estimate RADAR’s performance. For evaluation, we use $\mathcal{T}_{\theta}$ to generate AI-text for the evaluation dataset and to calculate RADAR’s detection AUROC.

Step 2 to Step 3 can be repeated until there is no improvement in the AUROC evaluated on the validation dataset. The nature of rivalry in adversarial learning and the introduced competition helps the detector to learn to be robust in detecting both original and paraphrased AI-text.

In RADAR, the goal of the paraphraser $\mathcal{G}_{\sigma}$ is to paraphrase the input machine-generated text $x_{m}$ . We model the generation of paraphrased text as a decision-making process, taking $x_{m}$ as the state and the output text $x_{p}$ as the action. In particular, we optimize $\mathcal{G}_{\sigma}$ using the reward feedback from the detector $\mathcal{D}_{\phi}$ with PPO. The output of $\mathcal{D}_{\phi}(x_{p})$ is the predicted likelihood of $x_{p}$ being Human-text. The reward returned by $x_{p}$ and the log probability of the text $x_{p}$ are defined in Eq. 1:

where $x_{p}^{i}$ means the $i$ -th token in the sentence $x_{p}$ of length $N$ and $x_{p}^{1:i-1}$ represents the first $i-1$ tokens in $x_{p}$ ( $x_{p}^{1:0}$ means the default starting token).

We propose Clipped PPO with Entropy Penalty (cppo-ep) in RADAR to optimize $\mathcal{G}_{\sigma}$ . Let $\text{clip}(\cdot,a,b)$ denote a value-clipping operation with a lower limit $a$ and an upper limit $b$ , $r(\sigma,x_{m},x_{p})$ be the importance sampling ratio between a new policy $\mathcal{G}_{\sigma}$ and an old policy $\mathcal{G}_{\sigma^{\prime}}$ , and $(x_{m},x_{p})\sim P_{\mathcal{G}_{\sigma^{\prime}}}$ be a state-action pair sampled from $\mathcal{G}_{\sigma^{\prime}}$ . The loss of cppo-ep is defined as:

2 Training Detector via Reweighted Logistic Loss

In a typical GAN training process, the discriminator receives an equal amount of positive and negative samples in each step, assuring an in-batch sample balance. However, in RADAR, by construction, the number of AI-text samples is twice the number of human-text samples, because each $x_{h}$ from the human-text corpus $\mathcal{H}$ is paired with a sample $x_{m}$ from the original AI-text corpus $\mathcal{M}$ as well as a paraphrased sample $x_{p}$ generated by the paraphraser $\mathcal{G}_{\phi}$ . To handle this in-batch imbalance problem, we use a reweighted logistic loss function to optimize the detector $D_{\phi}$ , as described in Eq. 3:

Recall that $\mathcal{D}_{\phi}(x)\in$ is the predicted probability of an input instance $x$ being Human-text. $L_{\mathcal{H}}$ is the loss to improve the correctness of predicting $x_{h}\sim\mathcal{H}$ as human-written. $L_{\mathcal{M}}=L_{\mathcal{M}}^{1}+L_{\mathcal{M}}^{2}$ , where $L_{\mathcal{M}}^{1}$ and $L_{\mathcal{M}}^{2}$ are used to avoid $x_{m}$ and $x_{p}$ from being predicted as human-text, respectively. $\lambda$ is a coefficient ranging from 0 to 1. We introduce $\lambda$ to adjust the proportion of AI-text components in the overall loss function to alleviate the effects of sample imbalance.

3 RADAR Algorithm

The entire training procedure of RADAR is summarized in Algorithm 1. For a given target LLM, RADAR returns a trained paraphraser and a trained detector through the designed training steps. In the evaluation phase, the detector is used to predict the likelihood of AI-text for any input instance.

Experiments

Datasets and Metrics. For training, we sampled 160K documents from WebText to build the human-text corpus $\mathcal{H}$ . Then, we build the original AI-text corpus $\mathcal{M}$ from $\mathcal{H}$ using a target language model $\mathcal{T}_{\theta}$ , which performs text completion using the first 30 tokens as the prompt and limits the sentence length to be 200 tokens. For evaluation, we select four human-text datasets covering different domains. Following , we use Xsum, SQuAD, and Reddit WritingPrompts (WP) to test a detector’s ability to detect fake news, avoid academic fraud, and identify machine-generated literature innovation, respectively. In addition, we also use the non-native-authored TOEFL dataset (TOFEL) to evaluate a detector’s bias when encountering non-native-authored English text. Please see Appendix A for more details about the evaluation datasets. Following existing works, we report the area under the receiver operating characteristic curve (AUROC) score by varying the detector’s threshold as the performance measure (higher is better), which captures the relationship between the true positive rate and the false positive rate.

Comparisons. We compare RADAR with various detection methods. These methods include the OpenAI (RoBERTa) model which is fine-tuned on WebText and GPT-2 generations, as well as the statistical approaches including log probability, rank, log rank, entropy, and DetectGPT . Specifically, we implemented DetectGPT using the trained T5-large model as the mask-filling model and performed 10 perturbations for each sentence to be detected.

Large Language Models. For the target LLM $\mathcal{T}_{\theta}$ , we select 4 pairs of LLMs and summarize them in Table 1. Each pair contains an open-source LLM and its fine-tuned version via instruction-tuning.

Paraphrase Configurations. We consider two settings: without (w/o) paraphrasing and with paraphrasing. To prepare the machine-generated text for evaluation, for the w/o paraphrasing setting, we use the original AI-text corpus $\mathcal{M}$ generated by a target LLM based on an evaluation dataset. For the with paraphrasing setting, we define two types of paraphrasing: seen paraphraser and unseen paraphraser. The seen paraphraser refers to the paraphraser $\mathcal{G}_{\sigma}$ returned by RADAR. The unseen paraphraser means a new paraphraser that has not participated in training the detector of RADAR. We used the OpenAI API service of GPT-3.5-Turbo as the default unseen paraphraser. The prompt we used for paraphrasing is “Enhance word choices to make the sentence sound more like a human”, as inspired by .

Implementation Details. We provide the detailed setups when implementing Algorithm 1. We build a PPO buffer $\mathcal{B}$ that can temporarily store 256 pairs of data for subsequent training. We use the pre-trained T5-large and RoBERTa-large models as the initialization of $\mathcal{G}_{\sigma}$ and $\mathcal{D}_{\phi}$ respectively. During training, we set the batch size to 32 and train the models until the validation loss converges. We use AdamW as the optimizer with the initial learning rate set to 1e-5 and use linear decay for both $\mathcal{G}_{\sigma}$ and $\mathcal{D}_{\phi}$ . We set $\lambda=0.5$ for sample balancing in Eq. 3 and set $\gamma=0.01$ in Eq. 2. We follow the same construction principle of the training dataset to create the 4 evaluation datasets based on Xsum, SQuAD, WP, and TOFEL. Experiments were run on 2 GPUS (NVIDIA Tesla V100 32GB).

2 Performance Evaluation and Comparison with Existing Methods

We run three groups of experiments (w/o paraphraser, seen paraphraser, and unseen paraphraser) and report the overall results of RADAR and the compared methods on all 4 datasets in Table 2. The reported AUROC scores are averaged over the 8 considered LLMs. In the relatively easy case of without paraphrasing, most detectors attain good AUROC scores. RADAR attains a comparable performance (0.856) to the best existing detector (log rank, 0.904). The slightly worse performance of RADAR can be explained by the tradeoff in enhancing AI-text detection against paraphrasing.

When facing paraphrasing, all existing methods except entropy show significant performance degradation. The drop in AUROC compared to the w/o paraphrasing case ranges from $10.4\%$ to $81.7\%$ . While entropy is shown to be more robust to paraphrasing, its AUROC score can be quite low. On the contrary, RADAR demonstrates robust and superior detection power, attaining the best performance on every dataset. As shown in Figure 2, the average AUROC score of RADAR (0.857) improves the best existing method (entropy, 0.651) by 31.64% against the unseen paraphraser. On average, RADAR is more robust to the seen paraphraser than the unseen paraphraser, because the seen paraphraser is what is used to train the detector in RADAR. More importantly, the detection performance of RADAR is stable across different paraphrasing schema, suggesting that RADAR can successfully mitigate the performance drop in AI-text detection.

3 AI-Text Detection Transferability of RADAR

We explore the AI-text detection transferability of RADAR between the 8 LLMs and report the ratio F(A,B)=AUROC(A,B)/AUROC(B,B) for each LLM pair (A,B), where AUROC(A,B) means using the RADAR’s detector trained on model A to evaluate the AI-text generated by model B. A larger ratio means better transferability from A to B. Figure 3 shows the matrix of pairwise detection transferability and the bar chart of the holistic detection transferability to all the 8 LLMs in the without and unseen paraphrasing settings. We highlight two key observations as follows.

(I) Instruction-tuned models have better detection transferability. Partitioning the LLMs into two groups, we can find that the detector targeting an instruction-tuned LLM (top 4 rows) generally transfers better than the detector targeting the corresponding LLM without instruction-tuning (bottom 4 rows). Take the pair (Vicuna-7B, LLaMA-7B) as an example, we can see that without paraphrasing, F(Vicuna-7B,LLaMA) can reach up to $95.0\%$ . On the other hand, F(LLaMA-7B,Vicuna-7B) can only account for $68.2\%$ . Sorting the detectors according to the holistic detection transferalbility (which is presented in the bar chart), we can see the top-3 detectors are all trained with the instruction-tuned LLMs. A similar conclusion can be made for the with paraphrasing setting. Moreover, there is no obvious trend between the target LLM size and the resulting detection performance. The effect of instruction tuning on transferability is more prominent than model size.

(II) RADAR achieves better detection transferability against paraphrasing. Another interesting finding is that RADAR’s transferability is generally improved when paraphrasing is in place. Comparing the two bar charts in Fig. 3(a) and Fig. 3(b), the average holistic detection transferability (over all LLMs) is increased by $11.6\%$ . Except for LLaMA-7B (3.8% drop) and GPT-J-6B (1.4% drop), all other LLMs’ holistic transferability scores are improved from 2.4% (Palmyra-base) to 47.6% (Camel-5B).

Transfer detection on AI-text generated by GPT-4. We also test RADAR detectors on the texts generated by GPT-4. The results show that 5 out of 8 RADAR models can outperform the OpenAI (RoBERTa), and three of them can achieve more than 0.8 detection AUROC. For example, RADAR trained on Camel-5B can achieve 0.915 detection AUROC on GPT-4 generations. The results show that the RADAR can achieve good transfer detection for GPT-4. The details are given in Appendix K.

Ensemble detection. We also explored whether and how ensemble learning benefits detection by combining the outputs of detectors. The results show that the detection performance can be lifted by carefully tuning the ensemble ratio and the model to be combined. Please see Appendix G for the exact experiment results.

To sum up, we believe our findings suggest promising results for training a universal robust AI-text detector by leveraging state-of-the-art LLMs, and RADAR can use a smaller-sized and weaker LLM to achieve good detection performance on texts generated from top-notching LLMs (such as GPT-4).

4 Variants of Paraphrasing

In addition to paraphrasing the original LLM-generated texts, we also evaluate the detection performance when paraphrasing human texts (the output is labeled as AI-text). We also allow paraphrasing multiple times in our analysis. We conduct our experiments on the Xsum dataset using the detector trained with Camel-5B. The paraphraser for evaluation is GPT-3.5-Turbo. As shown in Figure 4(a), we find that RADAR is the only detector robust to multi-round paraphrasing. On paraphrased AI-text, all existing methods suffer from a notable performance drop. On paraphrased human-text, RADAR remains effective, along with two existing methods (OpenAI (RoBERTa) and entropy). In general, multi-round paraphrasing does not seem to increase the difficulty of AI-text detection. We also find RADAR is robust to Dipper , another paraphrase model. Please see Appendix I for details.

5 Evaluation on RADAR’s Paraphraser

Although our focus is on training a robust AI-text detector via RADAR, as a by-product, we expect to obtain a better paraphraser through adversarial learning. To verify this hypothesis, we compare the quality of the initial paraphraser (a pretrained LLM) and the final paraphraser returned by RADAR using GPT-3.5-Turbo’s response. We select 100 documents from WebText and use 4 different paraphrasers from RADAR to paraphrase the documents. Then, we ask GPT-3.5-Turbo to rate sentences generated by these paraphrasers versus their initial version (T5-large). Figure 5(a) shows that RADAR also improves the quality of paraphrasing. Figure 5(b) shows that the RADAR’s paraphraser can score higher if it is trained with a larger target LLM with instruction tuning. Following , we also evaluate RADAR’s paraphrasers on Quora Question Pairs (QQPhttps://www.kaggle.com/c/quora-question-pairs/data) and use iBLEU ( $\alpha=0.8$ ) as the metric (higher is better). Figure 5(c) shows that the paraphrasing performance can be improved via RADAR as all the RADAR-paraphrasers can achieve a larger iBLEU score than T5-large.

6 Balancing the Detection Performance in the with and without Paraphrasing Settings

From Figure 2, we can observe that though RADAR can achieve robust detection under paraphrasing, it is (slightly) worse than some of the existing baselines when AI-text data are unperturbed (i.e., w/o paraphrasing). We run a trade-off analysis on the weight coefficient $\lambda$ in Equation (3) to study whether RADAR can be further tuned to achieve competitive performance on unperturbed data while still being robust to paraphrasing. We use Vicuna-7B as the target model to train 10 RADAR detectors by varying $\lambda$ from 0.1 to 1.0 with 0.1 increment, and then evaluate these detectors as well as other detection baselines on the evaluation datasets. The results in Appendix J show that we can promote RADAR’s performance on unperturbed data while still preserving high detection AUROC on paraphrased data. Take $\lambda=0.6$ as an example. When we change $\lambda$ from $0.5$ (the default value of $\lambda$ ) to $0.6$ , the AUROC of w/o paraphrasing increases from $0.906$ to $0.937$ , while the AUROC of unseen-paraphrasing also increases from $0.892$ to $0.920$ . The result suggests that the detection performance of RADAR in the with and without paraphrasing settings can be simultaneously improved or better balanced with careful tuning of the hyperparameter $\lambda$ during training.

Conclusion

In this paper, we presented a robust AI-text detector training framework called RADAR, which adopts adversarial learning to jointly train a detector and a paraphraser. RADAR addresses the shortcoming of existing detectors when facing LLM-paraphrased texts. Our extensive experiments on 8 LLMs and 4 datasets validated the effectiveness of RADAR and demonstrated its strong transferability across LLMs. We believe our results shed new light on improving AI-text detection.

Limitations and Ethical Considerations

While RADAR is more robust to paraphrasing than existing baselines measured on 4 datasets, sometimes it may show degraded detection performance against native LLM-generated texts (without paraphrasing) when compared to the best existing detection method. Moreover, like every existing AI-text detector, we acknowledge that our detector is not perfect and will likely give incorrect predictions in some cases. In terms of ethical considerations, we suggest users use our tool to assist with identifying AI-written content at scale and with discretion. If the detection result is to be used as evidence, further validation steps are necessary as RADAR cannot always make correct predictions.

Acknowledgement

The authors thank James Sanders and Jonathon Hartley for providing examples that can weaken the detection of RADAR models.

References

Appendix

Appendix A Human-text Corpora

We summarize the human-text corpora we used in RADAR’s training, validation, and evaluation phases in Table A1. It shows the usage of these corpora, the source where they come from, and the number of samples we select from them for evaluation.

Appendix B Details of Existing Detectors

Every detector assigns a score to the given text and determines whether the text is generated by AI based on the score. We introduce the scores used in existing detectors in the following.

Unsupervised Methods. In this paper, we leverage log-p, rank, log-rank, and entropy as the baselines. They are all unsupervised methods. They depend on statistical metrics of the given text to determine if it is an AI-text. Specifically, they input the given text to the target language model $\mathcal{T}_{\theta}$ , and sniff statistics from $\mathcal{T}_{\theta}$ ’s output. We use $M_{\text{log-p}}$ , $M_{\text{rank}}$ , $M_{\text{log-rank}}$ , and $M_{\text{entropy}}$ to represent the score for them respectively. These scores are calculated as below:

where $N$ is the length of the input sentence $x$ , $C$ is the size of the vocabulary, $x_{i}$ means the $i$ -th token in $x$ and $x^{1:i-1}$ represents the first $i-1$ tokens in $x$ , sort( $a$ ) is a sorting operation which inputs a list $a$ and returns a new list in descending order, index( $a$ , $b$ ) is a indexing operation which inputs a list $a$ with an element $b$ and outputs the position of $b$ in $a$ .

DetectGPT is also an unsupervised detection method we compare with in this paper. It introduces another language model (denoted as $\mathcal{G}_{\sigma}$ ), which is used to do perturbations on the given text. DetectGPT uses the perturbation discrepancy as the score assigned to the text (denoted as $M_{\text{DetectGPT}}$ ), which is shown below:

Supervised Methods. RADAR, as well as OpenAI (RoBERTa), are both supervised detection methods. Let $D$ denote the OpenAI (RoBERTa) or RADAR’s detector. The score $M_{D}$ that they assigned to the input text $x$ is defined below:

where $f_{D}(x)$ means $D$ ’s [CLS] token’s logits over the whole label set. $\text{Softmax}(f_{D}(x))$ means the prediction probability of AI-text, and $\text{Softmax}(f_{D}(x))$ means the prediction probability of human-text. Since there are only two labels (AI-text vs human-text) in the label set, the detection is equivalent to a logistic regression task with a scalar output.

Appendix C RADAR Loss Visualization

We visualize RADAR’s training process by presenting the training loss and validation performance below. We take the Camel-5B language model as an example.

From Figure 1(a) and Figure 1(b) we can see that the loss for both the Detector and the Paraphraser converge. From Figure 1(c) we can conclude that RADAR’s detection capacity on the validation dataset is promoted to a stable state as the training loss of the detector and the paraphraser converge.

Appendix D Paraphrase Settings

When using a RADAR-seen paraphraser, the input is: Paraphrase: [s], Where [s] is the slot for the input AI-text. The paraphraser adopts top-k sampling and nucleus sampling strategy to decode a new word. Top-k sampling only focuses on the $k$ highest-probability tokens. Nucleus sampling sorts the sampling distribution in descending order and selects top-n tokens from the sorted distribution until their cumulative probability exceeds $p$ , and then samples the next token from the top-n candidates. In our experiment, we set $k$ to 50 and $p$ to 0.95.

When using GPT-3.5-Turbo’s API service as another paraphrasing tool (RADAR-Unseen paraphraser) to paraphrase the AI-texts, the instruction we input is Enhance the word choices in the sentence to sound more like that of a human. When paraphrasing the human-texts, the instruction is Worsen the word choices in the sentence to sound less like that of a human.

Multi-round paraphrasing can be easily achieved by using the paraphrased text as input text and using the paraphraser to re-paraphrase it.

Appendix E Complete Experimental Results

We show all the evaluation results in Table A2 (without paraphraser), Table A3 (with RADAR-Unseen paraphraser) and Table A4 (with RADAR-Seen paraphraser).

Appendix F Case Study for RADAR

Sample Selection. For each evaluation dataset (Xsum, SQuAD, WP, and TOFEL), we randomly select one sample from the human-text corpora and use 8 instruction-tuned LLMs (Vicuna-7B, Dolly-V1-6B, Camel-5B, Dolly-V2-3B, LLaMA-7B, GPT-J-6B, Palmyra-base, Pythia-2.8B) to generate completions for each text, then use GPT-3.5-Turbo API service to perform paraphrase to all completions. Thus, we get 64 AI-texts in total (32 completions, 32 paraphrases).

Case Selection. Each AI-text has a source model, for the detection of one given text, we use the RADAR-detector trained for its source model. Then, we collect the one with the largest confidence to be machine-generated (most likely to be correctly detected) and the one with the smallest confidence to be machine-generated (mostly likely to evade detection) for the following case study.

Analysis. We show two detection cases in Table A5. One with a higher probability (0.9999) is an easy-to-detect case, and another with a lower probability (0.0031) is a difficult-to-detect case. Specifically, the latter can only be detected when our detection threshold drops below 0.0031. From the table, we can see that the misclassification has a good explanation, because the AI-text is nearly identical to the original Human-text, except for the inclusion of several additional words. In fact, the AI-text can be seen as the Human-text with a suffix composed of several words appended to it.

Appendix G Effectiveness of Ensembling Detectors from RADAR

We study the ensembling detection performance by combining two RADAR-detectors’ prediction probability and report the detection AUROC calculated using the combined prediction probability. The combined prediction probability $E(A,B,\beta,x)=(1-\beta)\mathcal{D}_{\phi}^{A}(x)+\beta\mathcal{D}_{\phi}^{B}(x)$ is a weighted sum of the prediction probability of the base model $A$ and the augmented model $B$ . The detection performances are shown in Table A6. We explore various ensembling ratios and ensembling models. Ensembling with the base model itself and setting the ensembling ratio to 0 both mean no ensembling. Setting the ensembling ratio to 1 is another extreme case, which refers to the transfer detection scheme mentioned in Section 4.3. From the results, we can see that the ensembling detection’s effectiveness can be influenced by both the ensemble model and the ensemble ratio.

Appendix H Text Detection Towards Different Lengths

We study RADAR’s effects on AI-generated texts with different lengths. We grouped the evaluation dataset into 5 subsets according to the length of the AI-text. The results are shown in Figure A2. We summarize our observations below:

For the group {log probability, rank, log rank, DetectGPT}, without paraphrasing, these methods are not really sensitive to the length of the text. When facing paraphrasing, however, their performance increases with a longer text length.

For the group {entropy, OpenAI (RoBERTa), RADAR}, without paraphrasing, these methods have a better detection performance for longer texts. On the contrary, their performance degrades when facing longer paraphrased AI-text (even though RADAR seems much better for short-text detection, it still outperforms other methods by a large margin, see Figure 2(b)).

Appendix I Detection on Dipper Paraphrasing

We also explore the use of RADAR to detect other advanced paraphrasers. We use Dipper proposed in (L60-O60 version) to paraphrase the evaluation dataset following the same setup of RADAR-Unseen paraphrasing and RADAR-Seen paraphrasing. The results are shown in Figure A3. We can see from the results that the green bars are much higher than the red bars except for DetectGPT, which means in general, Dipper is less destructive than GPT-3.5-Turbo to these detectors. RADAR’s detection AUROC on Dipper reaches $0.9$ . One thing to be noted is that though the OpenAI (RoBERTa) can perform well under Dipper paraphrasing, it still cannot be regarded as robust because it can be bypassed by other paraphrasers (RADAR-Seen and RADAR-Unseen).

Appendix J Sensitivity Analysis of RADAR on the Hyperparameter λ𝜆\lambda

Please refer to our discussion in Section 4.6.

Appendix K Detection on GPT-4 Generated Texts

Figure A5 shows RADAR’s transfer detection for GPT-4. We use RADAR detectors trained on weaker LLMs (Vicuna-7B, Camel-5B, etc.) to detect the generation of GPT-4. We prompted GPT-4 with the instruction: You are a helpful assistant to complete given text:, to generate texts based on the same evaluation datasets (Xsum, SQuAD, WP and TOFEL) used when evaluating other LLMs. The detection performance have been discussed in Section 4.3.

Appendix L Use GPT-3.5 to Assess RADAR-paraphrasers

After RADAR training, we not only get a detector but also a paraphraser. We use GPT-3.5-Turbo to assign a score to these paraphrasers’ generation to assess the language generation capability of these paraphrasers and compare with their initial version (T5-large) to see how adversarial training benefits them. We compare these paraphrasers on a WebText subset with 100 samples. For each sentence in this subset, we first use GPT-2-XL to generate one sentence, and then use the 5 paraphrasers (T5-large and 4 paraphrasers trained on 4 instruction-tuned models’ generation) to paraphrase this sentence respectively. Then we input these 5 paraphrased sentences combining an instruction to GPT-3.5-Turbo. Our instruction is You are given an array of five sentences. Please rate these sentences and reply with an array of scores assigned to these sentences. Each score is on a scale from 1 to 10, the higher the score, the sentence is written more like a human. Your reply example: . Then we sniff the score for each sentence from the returned answer and make the comparison. The results have been discussed in Section 4.5.