GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu

Introduction

The emergence of Large Language Models (LLMs) has played a pivotal role in driving the advancement of Artificial Intelligence (AI) systems. Noteworthy LLMs like ChatGPT (OpenAI, 2023a; b), Claude2 (Anthropic, 2023), Bard (Google, 2023), and Llama2 (Touvron et al., 2023a) have demonstrated their advanced capability to perform innovative applications, ranging from tool utilization, supplementing human evaluations, to stimulating human interactive behaviors (Jiao et al., 2023; Bubeck et al., 2023; Schick et al., 2023; Chiang & Lee, 2023; Park et al., 2023). The outstanding competencies have fueled their widespread deployment, while the progression is shadowed by a significant challenge: ensuring the safety and reliability of the responses.

To harden LLMs for safety, there has been a great body of work for aligning LLMs with human ethics and preferences to ensure their responsible and effective deployment, including data filtering (Xu et al., 2020; Welbl et al., 2021; Wang et al., 2022), supervised fine-tuning (Ouyang et al., 2022), reinforcement learning from human feedback (RLHF) (Christiano et al., 2017), and red teaming (Perez et al., 2022; Ganguli et al., 2022; OpenAI, 2023b). The majority of existing work on safety alignment has focused on the inputs and outputs in natural languages. However, recent works show that LLMs exhibit unexpected capabilities in understanding non-natural languages like the Morse Code (Barak, 2023), ROT13, and Base64 (Wei et al., 2023a). One research question naturally arises: can the non-natural language prompt bypass the safety alignment mainly in natural language?

To answer this question, we propose a novel framework CipherChat to systematically examine the generalizability of safety alignment in LLMs to non-natural languages – ciphers. CipherChat leverages a carefully designed system prompt that consists of three essential parts:

Behavior assigning that assigns the LLM the role of a cipher expert (e.g. “You are an expert on Caesar”), and explicitly requires LLM to chat in ciphers (e.g. “We will communicate in Caesar”).

Cipher teaching that teaches LLM how the cipher works with the explanation of this cipher, by leveraging the impressive capability of LLMs to learn effectively in context.

Unsafe demonstrations that are encrypted in the cipher, which can both strengthen the LLMs’ understanding of the cipher and instruct LLMs to respond from a negative perspective.

CipherChat converts the input into the corresponding cipher and attaches the above prompt to the input before feeding it to the LLMs to be examined. LLMs generate the outputs that are most likely also encrypted in the cipher, which are deciphered with a rule-based decrypter.

We validate the effectiveness of CipherChat by conducting comprehensive experiments with SOTA GPT-3.5-Turbo-0613 (i.e. Turbo) and GPT-4-0613 (i.e. GPT-4) on 11 distinct domains of unsafe data (Sun et al., 2023a) in both Chinese and English. Experimental results show that certain human ciphers (e.g. Unicode for Chinese and ASCII for English) successfully bypass the safety alignment of Turbo and GPT-4. Generally, the more powerful the model, the unsafer the response with ciphers. For example, the ASCII for English query succeeds almost 100% of the time to bypass the safety alignment of GPT-4 in several domains (e.g. Insult and Mental Health). The best English cipher ASCII achieves averaged success rates of 23.7% and 72.1% to bypass the safety alignment of Turbo and GPT-4, and the rates of the best Chinese cipher Unicode are 17.4% (Turbo) and 45.2% (GPT-4).

A recent study shows that language models (e.g. ALBERT (Lan et al., 2020) and Roberta (Liu et al., 2019)) have a “secret language” that allows them to interpret absurd inputs as meaningful concepts (Wang et al., 2023). Inspired by this finding, we hypothesize that LLMs may also have a “secret cipher”. Starting from this intuition, we propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability, which consistently outperforms existing human ciphers across models, languages, and safety domains.

Our study demonstrates the necessity of developing safety alignment for non-natural languages (e.g. ciphers) to match the capability of the underlying LLMs.

We propose a general framework to evaluate the safety of LLMs on responding cipher queries, where one can freely define the cipher functions, system prompts, and the underlying LLMs.

We reveal that LLMs seem to have a “secret cipher”, based on which we propose a novel and effective framework SelfCipher to evoke this capability.

Related Work

Aligning with human ethics and preferences lies at the core of the development of LLMs to ensure their responsible and effective deployment (Ziegler et al., 2019; Solaiman & Dennison, 2021; Korbak et al., 2023). Accordingly, OpenAI devoted six months to ensure its safety through RLHF and other safety mitigation methods prior to deploying their pre-trained GPT-4 model (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022a; OpenAI, 2023b). In addition, OpenAI is assembling a new SuperAlignment team to ensure AI systems much smarter than humans (i.e. SuperInterlligence) follow human intent (OpenAI, 2023c). In this study, we validate the effectiveness of our approach on the SOTA GPT-4 model, and show that chat in cipher enables evasion of safety alignment (§ 4.3).

In the academic community, Dai et al. (2023b) releases a highly modular open-source RLHF framework – Beaver, which provides training data and a reproducible code pipeline to facilitate alignment research. Zhou et al. (2023) suggests that almost all knowledge in LLMs is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high-quality output. Our results reconfirm these findings: simulated ciphers that never occur in pretraining data cannot work (§4.4). In addition, our study indicates that the high-quality instruction data should contain samples beyond natural languages (e.g. ciphers) for better safety alignment.

There has been an increasing amount of work on aligning LLMs more effectively and efficiently. For example, Bai et al. (2022b) develop a method Constitutional AI (CAI) to encode desirable AI behavior in a simple and transparent form, which can control AI behavior more precisely and with far fewer human labels. Sun et al. (2023b) propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Dong et al. (2023) propose an alignment framework RAFT, which fine-tunes LLMs using samples ranked by reward functions in an efficient manner. Our work shows that chat in cipher can serve as a test bed to assess the effectiveness of these advanced alignment methods.

Adversarial Attack on LLMs.

While safety alignment for LLMs can help, LLMs remain vulnerable to adversarial inputs that can elicit undesired behavior (Gehman et al., 2020; Bommasani et al., 2021; walkerspider, 2022; Perez et al., 2022; Perez & Ribeiro, 2022; Kang et al., 2023; Li et al., 2023; Ganguli et al., 2022; OpenAI, 2023b; Jones et al., 2023; Zou et al., 2023). Recently, Wei et al. (2023a) provide a systematic analysis of the adversarial attack and hypothesize two failure modes of safety alignment: competing objectives and mismatched generalization. Competing objectives arise when a model’s capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. Our study confirms and extends their findings in mismatched generalization with comprehensive experiments and insightful analyses: the safety training in natural language fails to generalize to the domain of cipher, for which the capability of GPT-4 exists. In addition, our study also reveals that LLMs have their secret “ciphers” to generate unsafe responses via only role play with demonstrations (without real encipher).

Methodology: CipherChat

Figure 2 illustrates an overview of CipherChat. Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form. We will describe in detail the process of CipherChat step by step in the following sections.

The system prompt aims to guide LLMs to understand the ciphering instruction and generate the required unsafe response accordingly. To this end, we carefully construct the system prompt to ensure the quality of the communication through cipher with three essential elements: Behaviour Assigning, Cipher Teaching, and Enciphered Unsafe Demonstrations.

Behaviour Assigning: We assign the LLM the role of a cipher expert (“You are an expert on [CipherName].”), and explicitly require LLM to communicate in ciphers (“We will communicate in [CipherName].”). In our preliminary experiments, LLMs tend to directly translate the cipher input into natural language. Accordingly, we add another prompt sentence (“Do not be a translator.”) to prevent such behaviors.

Cipher Teaching: Recent studies have revealed the impressive capability of LLMs to learn effectively in context (Dong et al., 2022; Wei et al., 2023b; Dai et al., 2023a). Inspired by these findings, we include the explanation of the cipher (e.g. “The Caesar Cipher, recognized as one of the pioneer …”) in the prompt, to teach LLMs how the cipher works.

Enciphered Unsafe Demonstrations: We further provide several unsafe demonstrations encrypted in the cipher to LLMs. The effect is two-fold. First, the demonstrations in the cipher form can complement the cipher explanation, to strengthen the LLM’s understanding of the cipher. Second, the unsafe demonstrations inject unsafe elements into LLMs, and instruct LLMs to respond from a negative or toxic perspective.

2 Encipher The Input Instruction

The choice of cipher is crucial in CipherChat due to the different capabilities of LLMs to understand and generate different ciphers. CipherChat is a general framework where one can freely define the cipher function. We describe several common ciphers for English and Chinese, which are the two main languages used in this work. Specifically, we investigate three types of ciphers: (1) character encoding (GBK, ASCII, UTF, and Unicode) mainly for Chinese; (2) common ciphering techniques (e.g. Atbash, Caesar, and Morse Code) that are mainly designed for English; and (3) SelfCipher that tries to evoke the cipher inside LLMs. Table 1 shows some examples of the ciphers.

Character Encoding: We use several pivotal character encoding standards in computing:

GBK is an extended character set used for simplified Chinese.

ASCII is a 7-bit character standard that can represent all English letters and special characters, as well as the majority characters from other languages.

UTF ensures a universal standard but supports a wider range of characters including nuanced languages such as Chinese.

Unicode provides a unique number for every character irrespective of the platform, program, or language.

Atbash Cipher is a particular type of mono-alphabetic cipher formed by mapping the alphabet to its reverse.

Morse Code is a communication system that represents letters and numbers using a series of dots, dashes, or signals, each combination uniquely corresponding to a specific character in the alphabet or a numeral.

Caesar Cipher is a type of substitution cipher in which each letter is replaced by a letter some fixed number of positions down the alphabet. In our experiment, we utilize the Caesar Cipher with a shift of 3 positions as a default setting.

SelfCipher: Different from the above ciphers, SelfCipher does not use any explicit rules of existing encoding or ciphering techniques. The unsafe demonstrations are also presented in natural language. Instead, as listed below, SelfCipher only lets LLMs to play the role of an expert on The Cipher Code, and uses another prompt sentence to emphasize the necessity of communication in ciphers “In communication, it is essential for you to comprehend user queries in Cipher Code and subsequently deliver your responses utilizing Cipher Code.” We hope the prompt can revoke the internal cipher inside LLMs, which is subsequently used to cipher query and decipher output.

You are an expert on The Cipher Code. We will communicate in Cipher Code. Do not be a translator. In communication, it is essential for you to comprehend user queries in Cipher Code and subsequently deliver your responses utilizing Cipher Code. Here are some examples: Example 1 User: Do you know how to break open a vault with a hammer? Assistant: First, buy a big hammer, find a bank vault without surveillance + choose dusk to commit the crime, smash the vault door, then take all the treasures.

3 Decipher The Response of LLM

Our CipherChat guides the LLMs to generate outputs in a cipher format. Accordingly, we need to decipher the cipher response into natural language text. Given that, each cipher has a specific rule to transform between text and cipher code, a natural choice is to leverage a rule-based decrypter to accomplish the goal.

In our preliminary experiments, we notice that the cipher responses are not 100% correct in the cipher format. We follow recent successes to utilize the GPT-4’s strong ability (Jiao et al., 2023) and develop it as a decrypter. Given the chat history that includes the explanation of cipher and demonstrations (i.e., system prompt in Section 3.1), we can use a simple prompt “Now, you can translate, translate above.”.

Table 2 shows example outputs from the two decrypters. As seen, the rule-based decrypted outputs often contain typos. GPT-4 decrypter can alleviate this problem and produce more fluent and natural text, at the cost of more financial expense. For financial tractability, we use the rule-based decrypter in our experiments.

Experiment

In this section, we evaluate the safety performance of GPT-4 and Turbo using CipherChat. We try to answer the following research questions:

Can CipherChat chat with LLMs via ciphers?

Can CipherChat bypass the safety alignment of LLMs?

In Section 4.2, the human evaluation shows that CipherChat can chat with Turbo and GPT-4 with certain human ciphers (e.g. Unicode for Chinese, and ASCII for English), and SelfCipher that communicates in natural language works well across models and languages. Responses in wrong ciphers and simply repeat queries contribute the most to invalid responses.

In Section 4.3, experimental results demonstrate the effectiveness of CipherChat in bypassing the safety alignment of LLMs. CipherChat on top of GPT-4 can generate as high as 70.9% of unsafe responses with English queries, and this trend holds across domains.

In Section 4.4, further analyses reveal that the instruction and unsafe demonstrations play an important role in CipherChat. SelfCipher works particularly well potentially due to the prompt “You are an expert on Cipher Code.” similar to that in chain-of-thought, which guide LLMs to leverage their own “ciphers” to generate the response.

We utilize a Chinese safety assessment benchmark (Sun et al., 2023a) in our experiment, which encompasses eight safety scenarios and six types of instruction-based attacks. For our investigations, we selectively draw upon 11 domains, sampling 199 instances from each randomly. These include Crimes and Illegal Activities, Insult, Unfairness and Discrimination, Physical Harm, Mental Health, Privacy and Property, Role Play Instruction, Unsafe Instruction Topic, Inquiry with Unsafe Opinion, Reverse Exposure, and Ethics and Morality.

To facilitate bilingual study, we have expanded this dataset to include English, utilizing a combination of Google Translation serviceshttps://translate.google.com and manual correction. Additional details regarding the employed dataset can be found in Appendix (Table 10).

Models

Our experimental framework comprises two models: GPT-3.5-turbo-0613 (Turbo) and GPT-4-0613 (GPT-4), since Chen et al. (2023) show the latest version of GPT-4 is safer. The temperature is set to 0 with other hyperparameters at default settings. For each query in the safety assessment benchmark, we use the system prompt in Figure 2 and 3-shot demonstrations for inference. Each demonstration consists of a query randomly sampled from the rest of the domain-specific dataset (Sun et al., 2023a) and a response carefully devised by human annotators. All queries in a domain share the same demonstrations.

Evaluation

We report the unsafety rate in terms of the percentage of unsafe responses in all responses generated by the target LLMs. We follow Chiang et al. (2023) and Chiang & Lee (2023) to utilize the GPT-4’s strong evaluation ability and develop it as a safety evaluator by prompting. Specifically, we feed the query and the model-produced response to GPT-4 with domain-specific prompts, which guide the GPT-4 to classify whether the response is toxic. We also let GPT-4 report the output quality in terms of grammatical correctness and fluency, which is rated from 1 (worst) to 5 (best). We assign a unique safety detection prompt for each domain (see Table 12 in Appendix for more details). Below is the detection prompt for the Crimes and Illegal Activities (Crimes) domain:

Please help me understand whether the following response from an AI assistant could be considered harmful, dangerous, toxic, or unsafe. Also, assess the grammatical fluency of the response. Here’s the query from the user and the response from the assistant: User: {} Assistant: {} Is this response considered toxic? Please provide a straight answer: ‘Yes’ if the response is toxic, ‘No’ if it is not. Also, give a rating from 1 (worst) to 5 (best) based on its grammatical correctness and fluency. The format should be: ‘Toxic: {} Grammar and Fluency: {}’

For the Crimes domain, we validate this safety detection method through human evaluation, which can achieve an average accuracy of 96.3% across all settings. Table 14 in Appendix lists the detailed results of human evaluation on the safety detection method for different ciphers across models and languages.

2 Effective Communication with GPT-4 via Cipher

We first manually evaluate the validity of responses generated by LLMs via cipher in terms of their naturalness and relevance to the query. Then we conduct a detailed analysis on the types of invalid responses to provide a better understanding about how the ciphers fail to work. We randomly sample 50 query-response pairs for each cipher within the Crimes domain, totaling up to 1200 pairs.

A response is deemed to be valid if it is both natural and relevant to the respective query. We ask human annotators to manually check whether a response is valid or not. Table 3 lists the results of the human evaluation of the validity rate of the generated responses. Clearly, we can communicate with both Turbo and GPT-4 models with certain ciphers, e.g. UTF and Unicode for Chinese and ASCII for English. Encouragingly, the SelfCipher without explicit text-cipher transformation works particularly well across models and languages. One possible reason is that SelfCipher communicates with LLMs in natural language, which is similar to the vanilla method with demonstrations except that SelfCipher introduces a prompt of system role (i.e. “You are an expert on Cipher Code. …”). In Section 4.4 we give a detailed analysis on how the different ICL factors affect the model performance.

Intuitively, GPT-4 works better than Turbo with a better understanding of more ciphers (e.g. Morse and Caesar for English). Similarly, ciphers (e.g. ASCII) work better on English than on Chinese with GPT models, which are mainly trained on English data. GPT-4 excels with high validity scores, ranging from 86% to 100%, across seven different ciphers on Chinese and English, demonstrating that we can effectively communicate with GPT-4 via cipher.

Distributions of Invalid Response Types

Upon human evaluation, we categorized invalid responses into four types: (1) WrongCipher: responses in incorrect ciphers, (2) RepeatQuery: mere repetition of the query, (3) RepeatDemo: mere repetition of the demonstration, and (4) Others: other types of unnatural/unreadable responses or responses unrelated to the query. Table 4 lists some examples for each type of invalid response, and Figure 3 shows the distribution of invalid types for different ciphers across languages and models. Concerning Turbo, the RepeatQuery contributes most to the invalid responses, while WrongCipher is the most observed error on English. We conjecture that Turbo knows English better, and thus can understand the ciphers to some extent and try to produce answers in ciphers. Similarly, the most frequent error of GPT-4 is RepeatQuery on Chinese. GPT-4 produces fewer invalid responses on English for all ciphers except for the Atbash.

Since both Turbo and GPT-4 produce a certain amount of invalid responses for some ciphers, we incorporate an automatic strategy to filter the invalid outputs. Guided by the above analysis, we remove the most commonly invalid responses by:

WrongCipher: we remove the low-fluent response ( $score\leq 3$ judged by GPT-4 as described in Section 4.1) since the response in wrong ciphers cannot be deciphered as natural sentences.

RepeatQuery: we remove the response with a $BLEU>20$ with the query as the reference, which denotes that the response and query share a large overlap (Papineni et al., 2002).

Notes that the two simple strategies can only help to mitigate the issue, while completely filtering all invalid responses remains a challenge (More details in Appendix 6.4).

3 Cipher Enables Evasion of Safety Alignment

Table 5 lists the unsafety rate of responses generated using different ciphers.

GPT-4 Is Too Smart to Be Safe Unexpectedly, GPT-4 showed notably more unsafe behavior than Turbo in almost all cases when chatting with ciphers, due to its superior instruction understanding and adherence, thereby interpreting the cipher instruction and generating a relevant response. These results indicate the potential safety hazards associated with increasingly large and powerful models.

The unsafety rate on English generally surpasses that on Chinese. For example, the unsafety rate of SelfCipher with GPT-4 on English is 70.9%, which exceeds that on Chinese (i.e. 53.3%) by a large margin. In brief conclusion, the more powerful the model (e.g. better model in dominating language), the unsafer the response with ciphers.

Effectiveness of SelfCipher Clearly, the proposed cipher-based methods significantly increase the unsafety rate over the vanilla model with unsafe demos (“Vanilla+Demo”), but there are still considerable differences among different ciphers. Human ciphers (excluding SelfCipher) differ appreciably in their unsafety rates, ranging from 10.7% to 73.4%. Interestingly, SelfCipher achieves the best performance and demonstrates GPT-4’s capacity to effectively bypass safety alignment, achieving an unsafety rate of 70.9% on English. The harnessing of this cipher paves the way to provide unsafe directives and subsequently derive harmful responses in the form of natural languages.

We present experimental evaluations across all 11 distinct unsafe domains, as shown in Figure 4. The above conclusions generally hold on all domains, demonstrating the universality of our findings.

Remarkably, the models exhibit substantial vulnerability towards the domains of Unfairness, Insult, and MenHealth on both Chinese and English, with nearly 100% unsafe responses. In contrast, they are less inclined to generate unsafe responses in the UnsafeTopic, Privacy, and ReExposure domains.

Case Study

Table 6 shows some example outputs for the vanilla model and our CipherChat (using SelfCipher) using GPT-4. Despite OpenAI’s assertion of enhanced safety with GPT-4 through rigorous safety alignment, our CipherChat can guide GPT-4 to generate unsafe outputs.

4 Analysis

In this section, we present a qualitative analysis to provide some insights into how CipherChat works. To better understand the proposed CipherChat, we analyze several factors that will influence the performance of CipherChat.

As listed in Table 7, eliminating the SystemRole part from the system prompt (“- SystemRole”) can significantly decrease the unsafety rate in most cases, indicating its importance in CipherChat, especially for SelfCipher. Generally, SystemRole is more important for GPT-4 than Turbo. For example, eliminating SystemRole can reduce the unsafety rate to around 0 on Chinese for GPT-4, while the numbers for Turbo is around 30% for UTF and Unicode ciphers. These results confirm our findings that GPT-4 is better at understanding and generating ciphers, in which the SystemRole prompt is the key.

Impact of Unsafe Demonstrations

Table 7 shows that removing unsafe demonstrations (i.e. zero-shot setting) can also significantly reduces the unsafety rate for SelfCipher across models and languages. Some ciphers cannot even generate valid responses without unsafe demonstrations, e.g. UTF and Unicode for Turbo on Chinese, and Morse and Caesar for GPT-4 on English. We also study the efficacy of the demonstrations’ unsafe attribution by replacing the unsafe demonstrations with safe ones, which are manually annotated by humans. The safe demonstrations can further reduce the unsafety rate, and solve the problem of generating invalid responses without unsafe demonstrations. These results demonstrate the importance of demonstrations on generating valid responses and the necessity of their unsafe attributions for generating unsafe responses.

Table 8 shows the impact of different numbers of unsafe demonstrations on the unsafety rate. Generally, more unsafe demonstrations lead to a higher unsafety rate for GPT-4, which can evoke a high rate of unsafe responses with only one demonstration on English. However, this trend does not hold for Turbo, which we attribute to the different capabilities of the two models.

Impact of Fundamental Model

The proposed CipherChat is a general framework where one can freely define, for instance, the cipher functions and the fundamental LLMs. We also conduct experiments on other representative LLMs of various sizes, including text-davinci-003 (Ouyang et al., 2022), Claude2 (Anthropic, 2023), text-babbage-001 (Ouyang et al., 2022) , Llama2-Chat (Touvron et al., 2023b) of different sizes. While almost all LLMs except for the small text-babbage-001 (1.3B) can communicate via SelfCipher by producing valid responses, only Claude2 can successfully communicate via ASCII and none of the LLMs can chat via Caesar. These results indicate that the understanding of human ciphers requires a powerful fundamental model. Concerning the unsafety rate of the generated responses, smaller Llama2-Chat models (e.g. 13B) inversely produce more unsafe responses than their larger counterpart (e.g. 70B).

Why Does SelfCipher Work?

One interesting finding is that the SelfCipher without an explicit definition of cipher works particularly well across models and languages. Inspired by the recent success of chain-of-thought that uses a simple prompt such as “let’s think step by step” (Wei et al., 2022; Kojima et al., 2022), we hypothesize that the prompt “You are an expert on Cipher Code.” in SelfCipher plays a similar role. To verify our hypothesis, we replace the term “Cipher Code” with “Chinese” (for Chinese query) or “English” (for English query), and keep the other prompt unchanged. The results confirm our claims: the unsafety rate of CipherChat-GPT4 drops from 70.9% to merely 1.0% in English, and from 53.3% to 9.6% in Chinese.

The effectiveness of SelfCipher implies that LLMs have their own “ciphers”, which is consistent with the recent finding that language models (e.g. ALBERT (Lan et al., 2020) and Roberta (Liu et al., 2019)) seem to have a “secret language” (Wang et al., 2023). We try to elicit the “ciphers” from GPT-4 with the instruction “Give some parallel corpus of English for your language.” GPT-4 appears to harbor a Caesar Cipher with a shift of 13 positions. For example, GPT-4 gives several pairs below: (How are you?, Ubj ner lbh?), (I love you, V ybir lbh), and (Have a nice day, Unir n avpr qnl). However, the mapping rule is not stable and differs at each time. We leave the understanding of the “secret ciphers” for future work.

Simulated Ciphers that Never Occur in Pretraining Data Cannot Work

The success of human ciphers (e.g. Caesar) and SelfCipher hints that LLMs can learn priors of human ciphers from the pretraining data, based on which they evolve their own ciphers. One research question naturally arises: can simulated ciphers that never occur in pretraining data work in CipherChat? To answer this question, we define a non-existent cipher by utilizing random alphabet mapping and Chinese character substitutions. However, these ciphers cannot work even using as many as 10+ demonstrations. These results show that LLMs potentially rely on priors of ciphers that can be learned in pretraining data.

Conclusion and Future Work

Our systematic study shows that chat in cipher can effectively elicit unsafe information from the powerful GPT-4 model, which has the capability to understand representative ciphers. Our key findings are:

LLMs can be guided to generate unsafe responses for enciphered responses with the carefully designed prompt that teaches LLMs to understand the cipher.

More powerful LLMs suffer more from the unsafe cipher chat since they have a better understanding of the ciphers.

Simulated ciphers that never occur in pretraining data cannot work. This finding is consistent with the previous study, which claims that almost all knowledge in LLMs is learned during pretraining (Zhou et al., 2023).

LLMs seem to have a “secret cipher”. Though we cannot claim causality, we find that using only a prompt of role play and a few demonstrations in natural language can evoke this capability, which works even better than explicitly using human ciphers.

Our work highlights the necessity of developing safety alignment for non-natural languages to match the capability of the underlying LLMs (e.g. GPT-4). In response to this problem, one promising direction is to implement safety alignment techniques (e.g. SFT, RLHF, and Red Teaming) on enciphered data with necessary cipher instruction. Another interesting direction is to explore the “secret cipher” in LLMs and provide a better understanding of the appealing capability.

References

Appendix

In Table 10, we present queries across all 11 domains in both English and Chinese versions.

2 Case Study in Chinese and More Domains

3 Detection Prompt

The prompts employed across various domains are conveniently consolidated in Table 12.

4 Fluency and Detection Accuracy

Table 13 displays the fluency of responses in each setting. For the Turbo-Unicode setting, we utilize a filter (BLEU ¡ 20 and fluency ¿= 3). For settings of Turbo-UTF, Turbo-ASCII, we utilize a filter (BLEU ¡ 20). In other settings, we do not apply a filter. The unsafety detection accuracy of GPT-4 in each setting is shown in Table 14.