Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

Patrick Schramowski, Manuel Brack, Björn Deiseroth, Kristian Kersting

Introduction

The primary reasons for recent breakthroughs in text-conditioned generative diffusion models (DM) are the quality of pre-trained backbones’ representations and their multimodal training data. They have even been shown to learn and reflect the underlying syntax and semantics. In turn, they retain general knowledge implicitly present in the data . Unfortunately, while they learn to encode and reflect general information, systems trained on large-scale unfiltered data may suffer from degenerated and biased behavior. While these profound issues are not completely surprising—since many biases are human-like —many concerns are grounded in the data collection process failing to report its own bias . The resulting models, including DMs, end up reflecting them and, in turn, have the potential to replicate undesired behavior . Birhane et al. pinpoint numerous implications and concerns of datasets scraped from the internet, in particular, LAION-400M , a predecessor of LAION-5B , and subsequent downstream harms of trained models.

We analyze the open-source latent diffusion model Stable Diffusion (SD), which is trained on subsets of LAION-5B and find a significant amount of inappropriate content generated which, viewed directly, might be offensive, ignominious, insulting, threatening, or might otherwise cause anxiety. To systematically measure the risk of inappropriate degeneration by pre-trained text-to-image models, we provide a test bed for evaluating inappropriate generations by DMs and stress the need for better safety interventions and data selection processes for pre-training. We release I2P (Sec. 5), a set of 4703 dedicated text-to-image prompts extracted from real-world user prompts for image-to-text models paired with inappropriateness scores from three different detectors (cf. Fig. 1). We show that recently introduced open-source DMs, in this case, Stable Diffusion (SD), produce inappropriate content when conditioned on our prompts, even for those that seem to be non-harmful, cf. Sec. 6. Consequently, we introduce a possible mitigation strategy called safe latent diffusion (SLD) (Sec. 3) and quantify its ability to actively suppress the generation of inappropriate content using I2P (Sec. 6). SLD requires no external classifier, i.e., it relies on the model’s already acquired knowledge of inappropriateness and needs no further tuning of the DM.

In general, SLD introduces novel techniques for manipulating a generative diffusion model’s latent space and provides further insights into the arithmetic of latent vectors. Importantly, to the best of our knowledge, our work is the first to consider image editing from an ethical perspective to counteract the inappropriate degeneration of DMs.

Risks and Promises of Unfiltered Data

Let us start discussing the risks but also promises of noisy, unfiltered and large-scale datasets, including background information on SD and its training data.

Risks. Unfortunately, while modern large-scale models, such as GPT-3 , learn to encode and reflect general information, systems trained on large-scale unfiltered data also suffer from degenerated and biased behavior. Nonetheless, computational systems were promised to have the potential to counter human biases and structural inequalities . However, data-driven AI systems often end up reflecting these biases and, in turn, have the potential to reinforce them instead. The associated risks have been broadly discussed and demonstrated in the context of large-scale models . These concerns include, for instance, models producing stereotypical and derogatory content and gender and racial biases . Subsequently, approaches have been developed to, e.g., decrease the level of bias in these models .

Promises. Besides the performance gains, large-scale models show surprisingly strong abilities to recall factual knowledge from the training data . For example, Roberts et al. showed that large-scale pre-trained language models’ capabilities to store and retrieve knowledge scale with model size. Grounded on those findings, Schick et al. demonstrated that language models can self-debias the text they produce, specifically regarding toxic output. Furthermore, Jenetzsch et al. as well as Schramowski et al. showed that the retained knowledge of such models carries information about moral norms aligning with the human sense of “right” and “wrong” expressed in language. Similarly, other research demonstrated how to utilize this knowledge to guide autoregressive language models’ text generation to prevent their toxic degeneration . Correspondingly, we demonstrate DMs’ capabilities to guide image generation away from inappropriateness, only using representations and concepts learned during pre-training and defined in natural language.

This makes our approach related to other techniques for text-based image editing on diffusion models such as Text2LIVE , Imagic or UniTune . Contrary to these works, our SLD approach requires no fine-tuning of the text-encoder or DM, nor does it introduce new downstream components. Instead, we utilize the learned representations of the model itself, thus substantially improving computational efficiency. Previously, Prompt-to-Prompt proposed a text-controlled editing technique using changes to the text prompt and control of the model’s cross-attention layers. In contrast, SLD is based on classifier-free guidance and enables more complex changes to the image.

LAION-400M and LAION-5B. Whereas the LAION-400M dataset was released as a proof-of-concept, the creators took the raised concern to heart and annotated potential inappropriate content in its successor dataset of LAION-5B . To further facilitate research on safety, fairness, and biased data, these samples were not excluded from the dataset. Users could decide for themselves, depending on their use case, to include those images. Thus, the creators of LAION-5B “advise against any applications in deployed systems without carefully investigating behavior and possible biases of models trained on LAION-5B.”

Training Stable Diffusion. Many DMs have reacted to the concerns raised on large-scale training data by either not releasing the model , only deploying it in a controlled environment with dedicated guardrails in place or rigorously filtering the training data of the published model . In contrast, SD decided not to exclude the annotated content contained in LAION-5B and to release the model publicly. Similar to LAION, Stable Diffusion encourages research on the safe deployment of models which have the potential to generate harmful content.

Specifically, SD is trained on a subset of LAION-5B, namely LAION-2B-en containing over 2.32 billion English image-text pairs. Training SD is executed in different steps: First, the model is trained on the complete LAION-2B-en. Then it is fine-tuned on various subsets, namely “LAION High Resolution” and “LAION-Aesthetics v2 5+”. With all training samples taken from LAION-5B or subsets thereof, it is expected that the trained model reflects not only human-like biases such as gender occupation correlations but also reporting biases. Furthermore, SD is deployed on several platforms including huggingface and recently lexica.art making it easy to use for the general public, including users unaware of present issues.

Ethnic Bias. This leads us to our first experiment. Following up the studies by Birhane et al. on unfiltered multimodal datasets , we extend these investigations by answering whether the same issues occur in downstream generative models. Specifically, we chose to showcase biased representation of Asian women , cf. also Sec. 2.1 and Appendix A in wrt. LAION-400M. This social phenomenon is a well-documented trend in western countries that is also reflected in the English-speaking internet and subsequently the web-crawled LAION-2B-en. Our search for the 100 closest images (in CLIP space) in the dataset to the term ‘japanese body’ yielded almost exclusively inappropriate material, with 90% showing explicit nudity. For comparison, the global average for 50 selected countries is more than 4 times lower at 22% (details can be found in Sec. A.1).

Now, to investigate the bias transfer to SD, we prompted the model using ‘ body’ (e.g. ‘japanese body’ or ‘nigerian body’) and compared the percentages of explicit nudes for the same 50 countries. We automatically assessed the generated image using the NudeNet image classifier to detect exposed genitalia. We chose this ambiguous prompt template to expose the model’s biases. Based on image web search results, one can, for instance, expect images of categories such as average body comparison—displaying lightly dressed people—but also culturally related images such as the country flags.

For each country, we generated 100 images, each using the same seed and hyper-parameters. The results are depicted in Fig. 2 (left). Indeed, one can observe the above-described images such as average body comparison in the case of u.s. american (cf. Sec. A.2). However, as expected, the close association of some ethnic terms with nudity in Stable Diffusion is apparent. Overall it appears that European, Asian, and Oceanic countries are far more likely to be linked with nudity than African or American ones. The most nude images are generated for Japan at over 75%, whereas the global average is at 35%. Specifically, the terms ‘Asian’ and ‘Japanese’ yielded a significantly higher amount of nudity than any other ethnic or geographic term. We attribute the apparent synonym usage of ‘Japanese’ and ‘Asian’ in this context to the aforementioned trends and the overwhelming amount of such content in LAION-5B. Unfortunately, biases in SD generation like these may further reinforce problematic social phenomena.

SD’s post-hoc safety measures. Various methods have been proposed to detect and filter out inappropriate images . Similarly, the SD implementation does contain a “NSFW” safety checker; an image classifier applied after generation to detect and withhold inappropriate images. However, there seems to be an interest in deactivating this safety measure. We checked the recently added image generation feature of lexica.art using examples we knew to generate content that the safety checker withholds. We note that the generation of these inappropriate images is possible on lexica.art at time of the present study, apparently without any restrictions, cf. Sec. A.3.

Now, we are ready to introduce our two main contributions, first SLD and then the I2P benchmark.

Safe Latent Diffusion (SLD)

We introduce safety guidance for latent diffusion models to reduce the inappropriate degeneration of DMs. Our method extends the generative process by combining text conditioning through classifier-free guidance with inappropriate concepts removed or suppressed in the output image. Consequently, SLD performs image editing at inference without any further fine-tuning required.

where $(\mathbf{x,c}_{p})$ is conditioned on text prompt $p$ , $t$ is drawn from a uniform distribution $t\sim\mathcal{U}()$ , $\epsilon$ sampled from a Gaussian $\mathbf{\epsilon}\sim\mathcal{N}(0,\mathbf{I})$ , and $w_{t},\omega_{t},\alpha_{t}$ influence image fidelity depending on $t$ . Consequently, the DM is trained to denoise $\mathbf{z}_{t}:=\mathbf{x}+\mathbf{\epsilon}$ to yield $\mathbf{x}$ with the squared error as loss. At inference, the DM is sampled using the model’s prediction of $\mathbf{x}=(\mathbf{z}_{t}-\mathbf{\bar{\epsilon_{\theta}}})$ , with ${\bar{\epsilon_{\theta}}}$ as described below.

Classifier-free guidance is a conditioning method using a purely generational diffusion model, eliminating the need for an additional pre-trained classifier. The approach randomly drops the text conditioning $\mathbf{c}_{p}$ with a fixed probability during training, resulting in a joint model for unconditional and conditional objectives. During inference the score estimates for the $\mathbf{x}$ -prediction are adjusted so that:

with guidance scale $s_{g}$ which is typically chosen as $s_{g}\in(0,20]$ and $\epsilon_{\theta}$ defining the noise estimate with parameters $\theta$ . Intuitively, the unconditioned $\epsilon$ -prediction $\mathbf{\epsilon}_{\theta}(\mathbf{z}_{t})$ is pushed in the direction of the conditioned $\mathbf{\epsilon}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{p})$ to yield an image faithful to prompt $p$ . Lastly, $s_{g}$ determines the magnitude of the influence of the text $p$ .

To influence the diffusion process, SLD makes use of the same principles as classifier-free guidance, cf. the simplified illustration in Fig. 3. In addition to a text prompt $p$ (blue arrow), we define an inappropriate concept (red arrow) via textual description $S$ . Consequently, we use three $\epsilon$ -predictions with the goal of moving the unconditioned score estimate $\mathbf{\epsilon}_{\theta}(\mathbf{z}_{t})$ towards the prompt conditioned estimate $\mathbf{\epsilon}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{p})$ and simultaneously away from concept conditioned estimate $\mathbf{\epsilon}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{S})$ . This results in $\mathbf{\bar{\epsilon}}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{p},\mathbf{c}_{S})=$

where $\mu$ applies a guidance scale $s_{S}$ element-wise. To this extent, $\mu$ considers those dimensions of the prompt conditioned estimate that would guide the generation process toward the inappropriate concept. Therefore, $\mu$ scales the element-wise difference between the prompt conditioned estimate and safety conditioned estimate by $s_{S}$ for all elements where this difference is below a threshold $\lambda$ and equals otherwise: $\mu(\mathbf{c}_{p},\mathbf{c}_{S};s_{S},\lambda)=$

with both larger $\lambda$ and larger $s_{S}$ leading to a more substantial shift away from the prompt text and in the opposite direction of the defined concept. Note that we clip the scaling factor of $\mu$ in order to avoid producing image artifacts. As described in previous research , the values of each $\mathbf{x}$ -prediction should adhere to the training bounds of $$ to prevent low fidelity images.

SLD is a balancing act between removing all inappropriate content from the generated image while keeping the changes minimal. In order to facilitate these requirements, we make two adjustments to the methodology presented above. We add a warm-up parameter $\delta$ that will only apply safety guidance $\gamma$ after an initial warm-up period in the diffusion process, i.e., $\gamma(\mathbf{z}_{t},\mathbf{c}_{p},\mathbf{c}_{S}):=\mathbf{0}\text{ if }t<\delta$ . Naturally, higher values for $\delta$ lead to less significant adjustments of the generated image. As we aim to keep the overall composition of the image unchanged, selecting a sufficiently high $\delta$ ensures that only fine-grained details of the output are altered. Furthermore, we add a momentum term $\nu_{t}$ to the safety guidance $\gamma$ in order to accelerate guidance over time steps for dimensions that are continuously guided in the same direction. Hence, $\gamma_{t}$ is defined as: $\gamma_{t}(\mathbf{z}_{t},\mathbf{c}_{p},\mathbf{c}_{S})=$

with momentum scale $s_{m}\in$ and $\nu$ being updated as

where $\nu_{0}=\mathbf{0}$ and $\beta_{m}\in[0,1)$ , with larger $\beta_{m}$ resulting in less volatile changes of the momentum. Momentum is already built up during the warm-up period, even though $\gamma_{t}$ is not applied during these steps.

Overall, the resulting SLD progress is exemplary visualized by means of the various diffusion steps in Fig. 3. While the safety-guidance is already applied in early steps it removes and suppresses the inappropriate parts of the images as soon as these are constructed in the latent space. We attached the corresponding SLD pseudo-code in Appendix H.

Configuring Safe Latent Diffusion

Inappropriateness may be subjective depending on individual opinions and contexts. For instance, the requirements of a professional artist differ from those of a 4-year-old child. Therefore, we suggest four possible configurations of the diffusion process varying in the strength of the safety adaptions. The configurations include the before mentioned hyper-parameters and concepts. We believe the proposed values offer a decent trade-off between the degree of alternation but note that these can be easily adapted.

(Un)safe Concept. Inspired by existing content policies of deployed text-to-image DMs such as DALL-E, the following text prompt is used to define a general inappropriate safety concept: “hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality, cruelty”. Through the text-based definition, one could easily add further and remove present concepts (cf. Appendix H).

Threshold. The most vital parameter of SLD is the safety threshold $\lambda$ . It determines the location of the hyperplane dividing the latent space into appropriate and inappropriateness, cf. Eq. 5. Theoretically, $\lambda$ is restricted by the training bounds of $ $, and intuitively it should be at least . However, since our approach relies on the model’s understanding of “right” and “wrong” we recommend choosing a conservative, i.e. small positive values such that$ \lambda\in[0.0,0.03]$.

Safety guidance scale. The safety guidance scale $s_{S}$ can theoretically be chosen arbitrarily high as the scaling factor $\mu$ is clipped either way. Larger values for $s_{S}$ would simply increase the number of values in latent representation being set to 1. Therefore, there is no adverse effect of large $s_{S}$ such as image artifacts that are observed for high guidance scales $s_{g}$ . We recommend $s_{S}\in$ .

Warm-up. The warm-up period $\delta$ largely influences at which level of the image composition changes are applied. Large safe-guidance scales applied early in the diffusion process could lead to major initial changes before significant parts of the images were constructed. Hence, we recommend using at least a few warm-up steps, $\delta\in$ , to construct an initial image and, in the worst case, let SLD revise those parts. In any case, $\delta$ should be no larger than half the number of total diffusion steps.

Momentum. The guidance momentum is particularly useful to remove inappropriate concepts that make up significant portions of the image and thus require more substantial editing, especially those created during warm-up. Therefore, momentum builds up over the warm-up phase, and such images will be altered more rigorously than those with close editing distances. Higher momentum parameters usually allow for a longer warm-up period. With most diffusion processes using around 50 generation steps, the window for momentum build-up is limited. Therefore, we recommend choosing $s_{m}\in[0,0.5]$ and $\beta_{m}\in[0.3,0.7]$ .

Configuration sets. These recommendations result in the following four sets of hyper-parameters gradually increasing their aggressiveness of changes on the resulting image (cf. Fig. 4 and Appendix I). Which setting to use highly depends on the use case and individual preferences:

The weak configuration is usually sufficient to remove superficial blood splatters, but stronger parameters are required to suppress more severe injuries. Similarly, the weak set may suppress nude content on clearly pornographic images but may not reduce nudity in artistic imagery such as oil paintings. A fact that an adult artist may find perfectly acceptable, however, is problematic for, e.g., a child using the model. Furthermore, on the example of nudity, we observed the medium hyper-parameter set to yield the generation of, e.g., a bikini. In contrast, the strong and maximum one would produce progressively more cloth like a dress.

Note that we can even drive the generation of inappropriate content to zero by choosing strong enough parameters (Hyp-Max). However, doing so likely diverges from our goal of keeping changes minimal. Nevertheless, this could be a requirement for sensitive applications, e.g., involving children. In these cases, we further recommend the usage of post-hoc interventions such as SD’s safety checker.

Regarding the amount of observed changes, the Hyp-Max configuration often behaves similarly to replacing the unconditioned estimate with a conditioned estimate based on a negative prompt during the classifier-free guidance, cf. Neg. in Fig. 4. I.e., replacing $\mathbf{\epsilon}_{\theta}(\mathbf{z}_{t})$ with $\mathbf{\epsilon}_{\theta}(\mathbf{z}_{t},\mathbf{c}_{S})$ , cf. Eq. 2. However, as our experimental evaluation (cf. Tab. 1) shows, negative prompting leads to worse mitigation than SLD. Further, the major disadvantage of this approach is the lack of more fine-grained control over the generation process, always leading to images significantly differing from the original, especially for higher guidance scales $s_{S}$ . Additionally, negative prompts are a vital tool in text-to-image generation that would no longer be available to users if used for safety guidance.

Inappropriate Image Prompts (I2P)

To systematically measure the risk of inappropriate degeneration by pre-trained text-to-image models, we introduce a new benchmarking dataset of over 4.5k real-world text prompts for generative models that are likely to produce inappropriate content: the inappropriate image prompts (I2P) dataset, cf. Fig. 1, covers a wide range of inappropriate content beyond nudity. Our dataset and evaluation pipeline is publicly available for other researchers to use.

Inappropriate content. What is considered inappropriate imagery may differ based on context, setting, cultural and social predisposition, and individual factors and is highly subjective overall. In this work, we base our definition of inappropriate content on the work of Gebru et al.: “[data that] if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety” , which is for example also reflected by the OpenAI content policy that applies to the use of DALL-E . Specifically, we consider those images showcasing content that contains one of the following:

hate, harassment, violence, self-harm, sexual content, shocking images, illegal activity.

Note that inappropriateness is not limited to these seven concepts, varies between cultures, and constantly evolves. Here we restricted ourselves to images displaying tangible acts of inappropriate behavior.

Prompt collection. For the seven concepts mentioned above, we used 26 keywords and phrases (cf. Appendix C) describing them in more detail and collected up to 250 real-world text prompts for each. For a given keyword, we crawled the prompts of the top 250 images returned by https://lexica.art. Lexica is a collection of real-world, user-generated prompts for SD sourced from its official discord server. It stores the prompt, seed, guidance scale, and image dimensions used in the generation to facilitate reproducibility. Image retrieval in lexica is based on the similarity of an image and search query in CLIP embedding space. Therefore, the collected prompts are not guaranteed to generate inappropriate content, but the probability is high, as demonstrated in our evaluation.

Dataset statistics. The data collection described above yielded duplicate entries, as some retrieved images were found among multiple keywords. After reducing those duplicates, the I2P dataset contains 4703 unique prompts assigned to at least one of the seven categories above. We also include an estimate of the percentage of inappropriate images the prompt is predicted to generate, together with the necessary hyper-parameters to reproduce these results. The benchmark also contains a hard annotation for prompts that generate predominantly inappropriate images.

On average, the prompts are made up of 20 tokens, and we could not observe an apparent correlation between frequent words and the connection to inappropriate images of these prompts. We present a word cloud of frequently used terms in Appendix C. Furthermore, we include the toxicity of each prompt based on the respective toxicity score of the Perspective API. We only find a weak correlation between the toxicity of a prompt and the inappropriateness of images it generates. In fact, prompts with low toxicity scores still have unforeseen high probabilities of generating inappropriate images. Furthermore, out of 4702 prompts, a mere 1.5% are toxic. This highlights that simply suppressing “bad” words in text prompts is no reliable mitigation strategy against generating problematic content.

Experimental Evaluation

We now evaluate SD’s inappropriate degeneration and SLD based on the suggested configurations using I2P.

Experimental Protocol. To assess the reduction of inappropriate content, we generated ten images each for all prompts of the I2P test bed and compared the probability of generating inappropriate images. We used one general concept $S$ across all categories of I2P as specified in Sec. 4. We automatically evaluated inappropriate image content by combining two classifiers. First, the Q16 classifier —also used to annotate the LAION-5B dataset—to detect a wide range of inappropriate content in images. Second, we applied NudeNet (cf. Sec. 2) to identify sexually explicit content. In this paper, we only classify exposed genitalia as inappropriate while allowing otherwise provocative images. If not specified otherwise, an image is classified as inappropriate if one or both of the classifiers output the respective label. Further details can be found in Appendix D.

Inappropriateness in Stable Diffusion. We start our experimental evaluation by demonstrating the inappropriate degeneration of Stable Diffusion without any safety measures. Tab. 1 shows SD’s probability of generating inappropriate content for each category under investigation. Recall that only $1.5\%$ of the text prompts could be identified as toxic. Nevertheless, one can clearly observe that depending on the category, the probability of generating inappropriate content ranges from $34\%$ to $52\%$ . Furthermore, Tab. 1 reports the expected maximum inappropriateness over 25 prompts. These results show that a user generating images with I2P for 25 prompts is expected to have at least one batch of output images of which 96% are inappropriate. The benchmark clearly shows SD’s inappropriate degeneration and the risks of training on completely unfiltered datasets.

SLD in Stable Diffusion. Next, we investigate whether we can account for noisy, i.e. biased and unfiltered training data based on the model’s acquired knowledge in distinguishing between appropriate and inappropriate content.

To this end, we applied SLD. Similarly to the observations made on the examples in Fig. 4, one can observe in Tab. 1 that the number of inappropriate images gradually decreases with stronger hyper-parameters. The strongest hyper-parameter configuration reduces the probability of generating inappropriate content by over 75%. Consequently, a mere 9% of the generated images are still classified as inappropriate. However, it is important to note that the Q16 classifier tends to be rather conservative in some of its decisions classifying images as inappropriate where the respective content has already been reduced significantly. We assume the majority of images flagged as potentially inappropriate for Hyp-Max to be false negatives of the classifier. One can observe a similar reduction in the expected maximum inappropriateness but also note a substantial increase in variance. The latter indicates a substantial amount of outliers when using SLD.

Overall the results demonstrate that, indeed, we are able to largely mitigate the inappropriate degeneration of SD based on the underlying model’s learned representations. This could also apply to issues caused by reporting biases in the training set, as we will investigate in the following.

Counteracting Bias in Stable Diffusion. Recall the ‘ethnic bias’ experiments of Sec. 2. We demonstrated that biases reflected in LAION-5B data are, consequently, also reflected in the trained DM. Similarly to its performance on I2P, SLD strongly reduces the number of nude images generated for all countries as shown in Fig. 2 (right). SLD yields 75% less explicit content and the percentage of nude images are distributed more evenly between countries. The previous outlier Japan now yields 12.0% of nude content, close to the global percentage of 9.25%.

Nonetheless, at least with keeping changes minor (Hyp-Strong), SLD alone is not sufficient to mitigate this racial bias entirely. There remains a medium but statistically significant correlation between the percentages of nude images generated for a country by SD with and without SLD. Thus, SLD can make a valuable contribution towards de-biasing DMs trained on datasets that introduce biases. However, these issues still need to be identified beforehand, and an effort towards reducing—or better eliminating—such biases in the dataset itself is still required.

For further evidence, we ran experiments on Stable Diffusion v2.0 which is essentially a different model with a different text encoder and training set. Specifically, rigorous dataset filtering of sexual and nudity related content was applied before training the diffusion model, however, not on the pre-trained text encoder. While this filtering process reduces biased representations, they are still present and more frequent compared to SLD mitigation on SD in version 1.4, cf. Appendix E. Interestingly, the combination of SLD and dataset filtering achieves an even better mitigation. Hence, a combination of filtering and SLD could be beneficial and poses an interesting avenue for future work.

Discussion & Limitations

Before concluding, let us touch upon ethical implications and future work concerning I2P and the introduced SLD.

Ethical implications. We introduced an alternative approach to post-hoc prevention of presenting generated images with potentially inappropriate content. Instead, we identify inappropriate content and suppress it during the diffusion process. This intervention would not be possible if the model did not acquire a certain amount of knowledge on inappropriateness and related concepts during pre-training. Consequently, we do not advise removing potentially inappropriate content entirely from the training data, as we can reasonably assume that efforts towards removing all such samples will hurt the model’s capabilities to target related material at inference individually. Therefore, we also see a promising avenue for future research in measuring the impact of training on balanced datasets. However, this is likely to require large amounts of manual labor.

Nonetheless, we also demonstrated that highly imbalanced training data could reinforce problematic social phenomena. It must be ensured that potential risks can be reliably mitigated, and if in doubt, datasets must be further curated, such as in the presented case study. Whereas LAION already made a valiant curating effort by annotating the related inappropriate content, we again advocate for carefully investigating behavior and possible biases of models and consequently deploy mitigation strategies against these issues in any deployed application.

We realize that SLD potentially has further ethical implications. Most notably, we recognize the possibility of similar techniques being used for actively censoring generative models. Additionally, one could construct a model generating mainly inappropriate content by reversing the guidance direction of our approach. Thus, we strongly urge all models using SLD to transparently state which contents are being suppressed. However, it could also be applied to cases beyond inappropriateness, such as fairness . Furthermore, we reiterate that inappropriateness is based on social norms, and people have diverse sentiments. The introduced test bed is limited to specific concepts and consequently does not necessarily reflect differing opinions people might have on inappropriateness. Additionally, the model’s acquired representation of inappropriateness may reflect the societal dispositions of the social groups represented in the training data and might lack a more diverse sentiment.

Lastly, we discuss the overall impact of SLD on image fidelity and text-alignment. Ideally, the approach should have no adverse effect on either, especially on already appropriate images. In line with previous research on generative text-to-image models, we report the COCO FID-30k scores and CLIP distance of SD, and our four sets of hyper-parameters for SLD in Tab. 2. The scores slightly increase with stronger hyper-parameters. However, they do not necessarily align with actual user preference . Therefore, we conducted an exhaustive user study on the DrawBench benchmark and reported results in Tab. 2 (cf. Appendix G for study details). The results indicate that users even slightly prefer images generated with SLD over those without, indicating safety does no sacrifice image quality and text alignment.

Conclusion

We demonstrated text-to-image models’ inappropriate degeneration transfers from unfiltered and imbalanced training datasets. To measure related issues, we introduced an image generation test bed called I2P containing dedicated image-to-text prompts representing inappropriate concepts such as nudity and violence. Furthermore, we presented an approach to mitigate these issues based on classifier-free guidance. The proposed SLD removes and suppresses the corresponding image parts during the diffusion process with no additional training required and no adverse effect on overall image quality. Strong representation biases learned from the dataset are attenuated by our approach but not completely removed. Thus, we advocate for the careful use of unfiltered, clearly imbalanced datasets.

Acknowledgments

We gratefully acknowledge support by the German Center for Artificial Intelligence (DFKI) project “SAINT” and the Federal Ministry of Education and Research (BMBF) under Grant No. 01IS22091. This work also benefited from the ICT-48 Network of AI Research Excellence Center “TAILOR” (EU Horizon 2020, GA No 952215), the Hessian research priority program LOEWE within the project WhiteBox, the Hessian Ministry of Higher Education, and the Research and the Arts (HMWK) cluster projects “The Adaptive Mind” and “The Third Wave of AI”, and the HMWK and BMBF ATHENE project “AVSV”. Further, we thank Felix Friedrich, Dominik Hintersdorf and Lukas Struppek for their valuable feedback.

References

Appendix

Appendix A Ethnic Bias Experiment

Here, we provide more details on the “Ethnic Bias Experiment” related findings.

For each of the 50 selected countries introduced in Secs. 2 and 6 we retrieved the 100 closest images for the caption “ $<$ country $>$ body” from LAION-2B-en. Similar to the experiments in Secs. 2 and 6 we also computed the number of percentage of nude images for each country.

The observations regarding “ethnic bias” we made on SD generated images are also apparent in its initial training data set LAION-2B-en. Among the top-5 countries in terms of the number of nude images are four Asian ones with Japan, Indonesia, Thailand and India. Overall Japan tops that ranking at over 90% explicit material. This is more than four times higher than the global average of 22%.

A.2 SD Generations

As we have shown, the corresponding biases contained in the dataset transfer to the diffusion model. In addition to the discussion in the main text, Fig. 5 provides qualitative examples. Again, we blurred all images showing people. Still, one can observe that prompts containing asian and japanese mainly produce explicit nudity. In contrast, u.s. american and nigerian, for instance, produce the expected image style. Especially in the case of u.s. american, we can find images displaying average body comparison and culturally related content.

A.3 Lexica

Whereas the creators of SD warn and advice for research only, deployed application such as lexica have the potential to reinforce biases. Fig. 7 shows images that lexica generates for the prompt ”Japanese body”, again highlighting the strong ethnic bias in SD wrt. to Asian women and nudity.

Appendix B RealToxicityPrompts

It is noteworthy that we initially tried to reuse the prompts contained in RealToxicityPrompts , cf. Appendix B. However, they lead to unnatural images, mainly containing unidentifiable text fragments. Consequently, we only included dedicated text-to-image prompts. Tab. 3 shows the results of SLD on images generated using the subset of toxic prompts from RealToxicityPrompts.

While these prompts produce a similar amount of inappropriate image as the I2G benchmark, we did not opt to use the prompts from RealToxicityPrompts as the resulting images are often only contain seemingly random combination of letters, cf. Fig. 8.

Appendix C I2P Test Bed

I2P contains seven different inappropriate concept categories which we collected by the following keywords:

harassment: “mocking”, “bullying”, “threatening”, “harassment”

violence: “humiliation”, “violence acts”, “suffering”, “violence”

self-harm: “suicide”, “eating disorder”, “self-harm”, “attempts at harming oneself”

sexual: “sexual acts”, “content meant to arouse sexual excitement”, “sexual”, “nudity”

shocking: “profane subjects that may shock or disgust”, “bodily fluids”, “shocking”, “obscene gestures”

illegal activity: “vandalism”, “illegal activity”, “drug use”, “theft”

These keywords were used to collect the I2P prompts from https://lexica.art.

Exemplary, Fig. 9 visualizes SD generations of each category. Each row depicts a category with four SD-generated images; starting from top: hate, harassment, violence, self-harm, sexual, shocking, illegal activity. Next to the category’s exemplary images the I2P prompt is displayed. Additionally, Fig. 10 visualizes the most frequent terms in I2P prompts.

Furthermore, we provide a datasheet for the I2P benchmark in App. J giving further details on the dataset.

Appendix D Experimental Protocol

Here, we provide further details of our experimental protocol, cf. Sec. 6.

We based our implementation on version 1.4 of Stable Diffusion which we used for all of our experiments. We chose to opt for a discrete Linear Multistep Scheduler (LMS) with $\beta_{start}=$ 8.510-4$ $and$ \beta_{end}=0.012$. However, we note that our approach is applicable to any latent diffusion model employing classifier-free guidance.

Inappropriate Content Measures.

We automatically evaluated inappropriate image content by combining two classifiers. First, the Q16 classifier is able to detect a wide range of inappropriate content in images. It was trained on the SMID dataset which consists of images annotated on their appropriateness through user studies conducted in the USA. More specifically, users were tasked to give each image a score of 1-5 on the range of ”immoral/blameworthy” to ”moral/praiseworthy”. Consequently, the Q16 classifier was trained to classify all images with an average score below 2.5 as inappropriate. However, the SMID dataset contains little to no explicit nudity—such as pornographic material—, wherefore Q16 performs subpar on these images. Thus, we additionally used NudeNet to identify sexually explicit content. In this paper, we only classified exposed genitalia as inappropriate while allowing otherwise provocative images. If not specified otherwise an image is classified as inappropriate if one or both of the classifiers output the respective label. We did not use the built in ”NSFW” safety checker of Stable Diffusion as its high false positive rate renders is unsuitable for the nuanced image editing in our work. However, it is indeed suitable to warn users and prevent displaying potential inappropriate content generated by the DM.

I2P.

We compared the base SD model to four variants of SLD as defined by the sets of hyper-parameters in Sec. 4. To assess the reduction of inappropriate content we generate 10 images each for all prompts of the I2P test bed and compared the probability of generating inappropriate images. We used one general concept $S$ across all categories of I2P as specified in Sec. 4.

Appendix E Stable Diffusion v2

To train Stable Diffusion v2 (SD-v2) rigorous dataset filtering of sexual and nudity related content was applied. The I2P benchmark results of SD-v2 are shown in Tab. 4 and a concise comparison of Stable Diffusion in version v2 and v1.4 is provided in Tab. 5. Summarized, SLD’s mitigation on SD-v1.4 outperform the standalone dataset filtering of SD-v2. The combination of dataset filtering and SLD leads to the highest mitigation.

Appendix F I2P Results

In addition to the expected maximum inappropriateness for 25 prompts presented in Tab. 1, we depict a continuous plot for each category from 10 to 200 generations in Fig. 11.

We observe clear differences in the expected maximum inappropriateness between categories. For example when generating images with 200 prompts from the “sexual” category, the Hyp-Max configuration is expected to yield at most 50% inappropriate images whereas the same number of prompts from the “shocking” category reaches almost 100% expected maximum inappropriateness. While some of this can actually be attributed to the varying effectiveness of SLD on different categories of inappropriateness, it is largely influenced by the high false positive rate of the Q16 classifier. Since we are considering the maximum over $N$ prompts, this effect quickly amplifies with growing $N$ .

Overall this raises the question if the expected maximum inappropriateness over large $N$ is a suitable metric for cases in which the false positive rate is high. Consequently, we decided to only report the results at $N=25$ in the main body of the paper.

Qualitative Examples.

Fig. 12 depicts a comparison of SD generated images with (right) and without (left) SLD. Each inappropriate category (cf. Appendix C) is represented by four images. The corresponding prompts can be found in Fig. 9. Moreover, Fig. 13 depicts the generated images displayed in the main text and their corresponding prompts.

Appendix G DrawBench User Studies

Here, we provide further details on the conducted users studies on image fidelity and text alignment on the DrawBench dataset. Additionally, we present qualitative examples of images generated from DrawBench in Fig. 14.

For each model configuration and DrawBench prompt we generated 10 images, amounting to 2000 total images per configuration. Each user was tasked with labeling 25 random image pairs—one being the SD reference image and the second one the corresponding image using SLD. For the image fidelity study users had to answer the question

whereas the posed question for text alignment was

Which image better represents the displayed text caption?

In both cases the three answer options were

To conduct our study we relied on Amazon Mechanical Turk where we set the following qualification requirements for our users: HIT Approval Rate over 95% and at least 1000 HITs approved. Additionally, each batch of image pairs was evaluated by three distinct annotator resulting in 30 decisions for each prompt.

Annotators were fairly compensated according to Amazon MTurk guidelines. For the image fidelity task, users were paid $0.70 to label 25 images at an average of 8 minutes need for the assignment. Our estimates suggested that the image text alignment task, requires more time since the text caption has to be read and understood. Therefore we paid$ 0.80 for 25 images with users completing the task after 8.5 minutes on average.

G.2 Details on Results

The study results for each hyper parameter configuration on image fidelity and text alignment is depicted in Fig. 15.

Interestingly, on the perceived image fidelity we observed a transition from indecisive to preferring the safety-guided images with increasing guidance’ strength, which we assume to be grounded in the increased visualization of positive sentiments, for instance happy pets. A similar trend can be observed for text alignment, although the effect is considerably smaller.

Appendix H Stable Diffusion Implementation

Algorithm 1 shows the pseudo code of SLD.

In line with the Stable Diffusion’s policy giving its users maximum transparency and control on how to use the model, the used safety concept can be adapted based on the user’s preferences.

Appendix I SLD Ablation Studies

Lastly, we provide some qualitative examples of the influence of different hyper parameters on the generated image.

Fig. 17 compares the effect of different warmup periods and thresholds. The example highlights that more warmup steps $\delta$ lead to less significant changes of the image composition and simultaneously larger values for $\lambda$ alter the image more strongly. Furthermore, Fig. 18 shows the effect of varying scales of momentum. It shows that higher momentum also leads to stronger changes of the image and further accentuates that momentum scales over $0.5$ may lead to issues in the downstream images such as significant artifacts.

Additionally, Fig. 16 provides further insights on the inner workings of SLD by showcasing the effect of different hyper parameter configurations over the time steps of the diffusion process. Most importantly the Figure highlights that stronger hyper parameters configuration diverge from the original image much earlier in the diffusion process and change the image more substantially.

Appendix J I2P Datasheet

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

Inappropriate Image Prompts (I2P) was created as a benchmark to evaluate inappropriate degeneration in generative text-to-image models such as DALL-E, Imagen or Stable Diffusion. It is inspired by RealToxicityPrompts, which is a benchmark for measuring toxic degeneration in language models. However, since these prompts do not describe visual content, it is not applicable to text conditioned image generation. The purpose of I2P is to fill this gap. The I2P benchmark dataset and accompanying testbed can be used to measure the degree to which a model generates images that represent the concepts of hate, harassment, violence, self-harm, sexual content, shocking images, and illegal activity.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

This dataset is presented by a research group located at the Technical University Darmstadt, Germany, affiliated with the Hessian Center for AI (hessian.AI), Aleph Alpha and LAION.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

The creation of the dataset was support by the German Center for Artificial Intelligence (DFKI) project “SAINT” and the Federal Ministry of Education and Research (BMBF) under Grant No. 01IS22091. Furthermore, it benefited from the ICT-48 Network of AI Research Excellence Center “TAILOR” (EU Horizon 2020, GA No 952215), the Hessian research priority program LOEWE within the project WhiteBox, and the Hessian Ministry of Higher Education, and the Research and the Arts (HMWK) cluster projects “The Adaptive Mind” and “The Third Wave of AI”.