An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or

Introduction

In a famous scene from the motion picture “Titanic”, Rose makes a request of Jack: “…draw me like one of your French girls”. Albeit simple, this request contains a wealth of information. It indicates that Jack should produce a drawing; It suggests that its style and composition should match those of a subset of Jack’s prior work; Finally, through a single word, “me”, Rose indicates that this drawing should portray a specific, unique subject: Rose herself. In making her request, Rose relies on Jack’s ability to reason over these concepts — both broad and specific — and bring them to life in a new creation.

Recently, large-scale text-to-image models (Rombach et al., 2021; Ramesh et al., 2021, 2022; Nichol et al., 2021; Yu et al., 2022; Saharia et al., 2022) have demonstrated an unprecedented capability to reason over natural language descriptions. They allow users to synthesize novel scenes with unseen compositions and produce vivid pictures in a myriad of styles. These tools have been used for artistic creation, as sources of inspiration, and even to design new, physical products (Yacoubian, 2022). Their use, however, is constrained by the user’s ability to describe the desired target through text. Turning back to Rose, one could then ask: How might she frame her request if she were to approach one of these models? How could we, as users, ask text-to-image models to craft a novel scene containing a cherished childhood toy? Or to pull our child’s drawing from its place on the fridge, and turn it into an artistic showpiece?

Introducing new concepts into large scale models is often difficult. Re-training a model with an expanded dataset for each new concept is prohibitively expensive, and fine-tuning on few examples typically leads to catastrophic forgetting (Ding et al., 2022; Li et al., 2022). More measured approaches freeze the model and train transformation modules to adapt its output when faced with new concepts (Zhou et al., 2021; Gao et al., 2021; Skantze & Willemsen, 2022). However, these approaches are still prone to forgetting prior knowledge, or face difficulties in accessing it concurrently with newly learned concepts (Kumar et al., 2022; Cohen et al., 2022).

We propose to overcome these challenges by finding new words in the textual embedding space of pre-trained text-to-image models. We consider the first stage of the text encoding process (Figure 2). Here, an input string is first converted to a set of tokens. Each token is then replaced with its own embedding vector, and these vectors are fed through the downstream model. Our goal is to find new embedding vectors that represent new, specific concepts.

We represent a new embedding vector with a new pseudo-word (Rathvon, 2004) which we denote by $S_{*}$ . This pseudo-word is then treated like any other word, and can be used to compose novel textual queries for the generative models. One can therefore ask for “a photograph of $S_{*}$ on the beach”, “an oil painting of a $S_{*}$ hanging on the wall”, or even compose two concepts, such as “a drawing of $S^{1}_{*}$ in the style of $S^{2}_{*}$ ”. Importantly, this process leaves the generative model untouched. In doing so, we retain the rich textual understanding and generalization capabilities that are typically lost when fine-tuning vision and language models on new tasks.

To find these pseudo-words, we frame the task as one of inversion. We are given a fixed, pre-trained text-to-image model and a small (3-5) image set depicting the concept. We aim to find a single word embedding, such that sentences of the form “A photo of $S_{*}$ ” will lead to the reconstruction of images from our small set. This embedding is found through an optimization process, which we refer to as “Textual Inversion”.

We further investigate a series of extensions based on tools typically used in Generative Adversarial Network (GAN) inversion. Our analysis reveals that, while some core principles remain, applying the prior art in a naïve way is either unhelpful or actively harmful.

We demonstrate the effectiveness of our approach over a wide range of concepts and prompts, showing that it can inject unique objects into new scenes, transform them across different styles, transfer poses, diminish biases, and even imagine new products.

In summary, our contributions are as follows:

We introduce the task of personalized text-to-image generation, where we synthesize novel scenes of user-provided concepts guided by natural language instruction.

We present the idea of “Textual Inversions” in the context of generative models. Here the goal is to find new pseudo-words in the embedding space of a text encoder that can capture both high-level semantics and fine visual details.

We analyze the embedding space in light of GAN-inspired inversion techniques and demonstrate that it also exhibits a tradeoff between distortion and editability. We show that our approach resides on an appealing point on the tradeoff curve.

We evaluate our method against images generated using user-provided captions of the concepts and demonstrate that our embeddings provide higher visual fidelity, and also enable more robust editing.

Related work

Text-guided image synthesis has been widely studied in the context of GANs (Goodfellow et al., 2014). Typically, a conditional model is trained to reproduce samples from given paired image-caption datasets (Zhu et al., 2019; Tao et al., 2020), leveraging attention mechanisms (Xu et al., 2018) or cross-modal contrastive approaches (Zhang et al., 2021; Ye et al., 2021). More recently, impressive visual results were achieved by leveraging large scale auto-regressive (Ramesh et al., 2021; Yu et al., 2022) or diffusion models (Ramesh et al., 2022; Saharia et al., 2022; Nichol et al., 2021; Rombach et al., 2021).

Rather than training conditional models, several approaches employ test-time optimization to explore the latent spaces of a pre-trained generator (Crowson et al., 2022; Murdock, 2021; Crowson, 2021). These models typically guide the optimization to minimize a text-to-image similarity score derived from an auxiliary model such as CLIP (Radford et al., 2021).

Moving beyond pure image generation, a large body of work explores the use of text-based interfaces for image editing (Patashnik et al., 2021; Abdal et al., 2021; Avrahami et al., 2022b), generator domain adaptation (Gal et al., 2021; Kim et al., 2022), video manipulation (Tzaban et al., 2022; Bar-Tal et al., 2022), motion synthesis (Tevet et al., 2022; Petrovich et al., 2022), style transfer (Kwon & Ye, 2021; Liu et al., 2022) and even texture synthesis for 3D objects (Michel et al., 2021).

Our approach builds on the open-ended, conditional synthesis models. Rather than training a new model from scratch, we show that we can expand a frozen model’s vocabulary and introduce new pseudo-words that describe specific concepts.

GAN inversion.

Manipulating images with generative networks often requires one to find a corresponding latent representation of the given image, a process referred to as inversion (Zhu et al., 2016; Xia et al., 2021). In the GAN literature, this inversion is done through either an optimization-based technique (Abdal et al., 2019, 2020; Zhu et al., 2020b; Gu et al., 2020) or by using an encoder (Richardson et al., 2020; Zhu et al., 2020a; Pidhorskyi et al., 2020; Tov et al., 2021). Optimization methods directly optimize a latent vector, such that feeding it through the GAN will re-create a target image. Encoders leverage a large image set to train a network that maps images to their latent representations.

In our work, we follow the optimization approach, as it can better adapt to unseen concepts. Encoders face harsher generalization requirements, and would likely need to be trained on web-scale data to offer the same freedom. We further analyze our embedding space in light of the GAN-inversion literature, outlining the core principles that remain and those that do not.

Diffusion-based inversion.

In the realm of diffusion models, inversion can be performed naïvely by adding noise to an image and then de-noising it through the network. However, this process tends to change the image content significantly. Choi et al. (2021) improve inversion by conditioning the denoising process on noised low-pass filter data from the target image. (Dhariwal & Nichol, 2021) demonstrate that the DDIM (Song et al., 2020) sampling process can be inverted in a closed-form manner, extracting a latent noise map that will produce a given real image. In DALL-E 2 (Ramesh et al., 2022), they build on this method and demonstrate that it can be used to induce changes in the image, such as cross-image interpolations or semantic editing. The later relies on their use of CLIP-based codes to condition the model, and may not be applicable to other methods.

Whereas the above works invert a given image into the model’s latent space, we invert a user-provided concept. Moreover, we represent this concept as a new pseudo-word in the model’s vocabulary, allowing for more general and intuitive editing.

Personalization.

Adapting models to a specific individual or object is a long-standing goal in machine learning research. Personalized models are typically found in the realms of recommendation systems (Benhamdi et al., 2017; Amat et al., 2018; Martinez et al., 2009; Cho et al., 2002) or in federated learning (Mansour et al., 2020; Jiang et al., 2019; Fallah et al., 2020; Shamsian et al., 2021).

More recently, personalization efforts can also be found in vision and graphics. There it is typical to apply a delicate tuning of a generative model to better reconstruct specific faces or scenes (Bau et al., 2019; Roich et al., 2021; Alaluf et al., 2021; Dinh et al., 2022; Cao et al., 2022; Nitzan et al., 2022).

Most relevant to our work is PALAVRA (Cohen et al., 2022), which leverages a pre-trained CLIP model for retrieval and segmentation of personalized objects. PALAVRA identifies pseudo-words in the textual embedding space of CLIP that refer to a specific object. These are then used to describe images for retrieval, or in order to segment specific objects in a scene. However, their task and losses are both discriminative, aiming to separate the object from other candidates. As we later show (Figure 5), their approach fails to capture the details required for plausible reconstructions or synthesis in new scenes.

Method

Our goal is to enable language-guided generation of new, user-specified concepts. To do so, we aim to encode these concepts into an intermediate representation of a pre-trained text-to-image model. Ideally, this should be done in a manner that would allow us to leverage the rich semantic and visual prior represented by such a model, and use it to guide intuitive visual transformations of the concepts.

It is natural to search for candidates for such a representation in the word-embedding stage of the text encoders typically employed by text-to-image models. There, the discrete input text is first converted into a continuous vector representation that is amenable to direct optimization.

Prior work has shown that this embedding space is expressive enough to capture basic image semantics (Cohen et al., 2022; Tsimpoukelli et al., 2021). However, these approaches leveraged contrastive or language-completion objectives, neither of which require an in-depth visual understanding of the image. As we demonstrate in Section 4, those methods fail to accurately capture the appearance of the concept, and attempting to employ them for synthesis leads to considerable visual corruption. Our goal is to find pseudo-words that can guide generation, which is a visual task. As such, we propose to find them through a visual reconstruction objective.

Below, we outline the core details of applying our approach to a specific class of generative models — Latent Diffusion Models (Rombach et al., 2021). In Section 5, we then analyze a set of extensions to this approach, motivated by GAN-inversion literature. However, as we later show, these additional complexities fail to improve upon the initial representation, presented here.

We implement our method over Latent Diffusion Models (LDMs) (Rombach et al., 2021), a recently introduced class of Denoising Diffusion Probabilistic Models (DDPMs) (Ho et al., 2020) that operate in the latent space of an autoencoder.

LDMs consist of two core components. First, an autoencoder is pre-trained on a large collection of images. An encoder $\mathcal{E}$ learns to map images $x\in\mathcal{D}_{x}$ into a spatial latent code $z=\mathcal{E}(x)$ , regularized through either a KL-divergence loss or through vector quantization (Van Den Oord et al., 2017; Agustsson et al., 2017). The decoder $D$ learns to map such latents back to images, such that $D\left(\mathcal{E}(x)\right)\approx x$ .

The second component, a diffusion model, is trained to produce codes within the learned latent space. This diffusion model can be conditioned on class labels, segmentation masks, or even on the output of a jointly trained text-embedding model. Let $c_{\theta}(y)$ be a model that maps a conditioning input $y$ into a conditioning vector. The LDM loss is then given by:

where $t$ is the time step, $z_{t}$ is the latent noised to time $t$ , $\epsilon$ is the unscaled noise sample, and $\epsilon_{\theta}$ is the denoising network. Intuitively, the objective here is to correctly remove the noise added to a latent representation of an image. While training, $c_{\theta}$ and $\epsilon_{\theta}$ are jointly optimized to minimize the LDM loss. At inference time, a random noise tensor is sampled and iteratively denoised to produce a new image latent, $z_{0}$ . Finally, this latent code is transformed into an image through the pre-trained decoder $x^{\prime}=D(z_{0})$ .

We employ the publicly available 1.4 billion parameter text-to-image model of Rombach et al. (2021), which was pre-trained on the LAION-400M dataset (Schuhmann et al., 2021). Here, $c_{\theta}$ is realized through a BERT (Devlin et al., 2018) text encoder, with $y$ being a text prompt.

We next review the early stages of such a text encoder, and our choice of inversion space.

Text embeddings.

Typical text encoder models, such as BERT, begin with a text processing step (Figure 2, left). First, each word or sub-word in an input string is converted to a token, which is an index in some pre-defined dictionary. Each token is then linked to a unique embedding vector that can be retrieved through an index-based lookup. These embedding vectors are typically learned as part of the text encoder $c_{\theta}$ .

In our work, we choose this embedding space as the target for inversion. Specifically, we designate a placeholder string, $S_{*}$ , to represent the new concept we wish to learn. We intervene in the embedding process and replace the vector associated with the tokenized string with a new, learned embedding $v_{*}$ , in essence “injecting” the concept into our vocabulary. In doing so, we can then compose new sentences containing the concept, just as we would with any other word.

Textual inversion.

To find these new embeddings, we use a small set of images (typically $3$ - $5$ ), which depicts our target concept across multiple settings such as varied backgrounds or poses. We find $v_{*}$ through direct optimization, by minimizing the LDM loss of Equation 1 over images sampled from the small set. To condition the generation, we randomly sample neutral context texts, derived from the CLIP ImageNet templates (Radford et al., 2021). These contain prompts of the form “A photo of $S_{*}$ ”, “A rendition of $S_{*}$ ”, etc. The full list of templates is provided in the supplementary materials.

Our optimization goal can then be defined as:

and is realized by re-using the same training scheme as the original LDM model, while keeping both $c_{\theta}$ and $\epsilon_{\theta}$ fixed. Notably, this is a reconstruction task. As such, we expect it to motivate the learned embedding to capture fine visual details unique to the concept.

Implementation details.

Unless otherwise noted, we retain the original hyper-parameter choices of LDM (Rombach et al., 2021). Word embeddings were initialized with the embeddings of a single-word coarse descriptor of the object (e.g. “sculpture” and “cat” for the two concepts in Figure 1). Our experiments were conducted using $2\times$ V100 GPUs with a batch size of 4. The base learning rate was set to $0.005$ . Following LDM, we further scale the base learning rate by the number of GPUs and the batch size, for an effective rate of $0.04$ . All results were produced using $5,000$ optimization steps. We find that these parameters work well for most cases. However, we note that for some concepts, better results can be achieved with fewer steps or with an increased learning rate.

Qualitative comparisons and applications

In the following section, we demonstrate a range of applications enabled through Textual Inversions, and provide visual comparisons to the state-of-the-art and human-captioning baselines.

We begin by demonstrating our ability to capture and recreate variations of an object using a single pseudo-word. In Figure 3 we compare our method to two baselines: LDM guided by a human caption and DALLE-2 guided by either a human caption or an image prompt. Captions were collected using Mechanical Turk. Annotators were provided with four images of a concept and asked to describe it in a manner that could allow an artist to recreate it. We asked for both a short ( $\leq 12$ words) and a long ( $\leq 30$ words) caption. In total, we collected $10$ captions per concept — five short and five long. Figure 3 shows multiple results generated with a randomly chosen caption for each setup. Additional large-scale galleries showing our uncurated reconstructions are provided in the supplementary.

As our results demonstrate, our method better captures the unique details of the concept. Human captioning typically captures the most prominent features of an object, but provides insufficient detail to reconstruct finer features like color patterns (e.g. of the teapot). In some cases (e.g. the skull mug) the object itself may be exceedingly difficult to describe through natural language. When provided with an image, DALLE-2 is able to recreate more appealing samples, particularly for well-known objects with limited detail (Aladdin’s lamp). However, it still struggles with unique details of personalized objects that the image encoder (CLIP) is unlikely to have seen (mug, teapot). In contrast, our method can successfully capture these finer details, and it does so using only a single word embedding. However, note that while our creations are more similar to the source objects, they are still variations that may differ from the source.

2 Text-guided synthesis

In Figures 4 and 1 we show our ability to compose novel scenes by incorporating the learned pseudo-words into new conditioning texts. For each concept, we show exemplars from our training set, along with an array of generated images and their conditioning texts. As our results demonstrate, the frozen text-to-image model is able to jointly reason over both the new concepts and its large body of prior knowledge, bringing them together in a new creation. Importantly, despite the fact that our training goal was generative in nature, our pseudo-words still encapsulate semantic concepts that the model can then leverage. For example, observe the bowl’s ability (row four) to contain other objects like food, or the ability to preserve the Furby’s bird-like head and crown while adapting his palette to better match a prompt (album cover, row three). Additional concepts and texts are provided in the supplementary materials.

To better evaluate our ability to compose objects into new scenes, we compare our method to several personalization baselines (Figure 5). In particular, we consider the recent PALAVRA (Cohen et al., 2022), which is most similar to our own work. PALAVRA encodes object sets into the textual embedding space of CLIP, using a mix of contrastive learning and cyclic consistency goals. We find a new pseudo-word using their approach and use it to synthesize new images by leveraging VQGAN-CLIP (Crowson et al., 2022) and CLIP-Guided Diffusion (Crowson, 2021). As a second baseline, we apply the CLIP-guided models of Crowson et al. while trying to jointly minimize the CLIP-based distances to both the training set images and to the target text (VQGAN-CLIP) or by initializing the optimization with an input image from our set (Guided Diffusion). For the latter, we chose image-based initializations as we observed that they outperform the use of images in the optimization loss. Similar observations were reported in Disco Diffusion (Letts et al., 2021).

The images produced by PALAVRA (rows $2$ , $3$ ) typically contain elements from the target prompt (e.g. a beach, a moon) but they fail to accurately capture the concept and display considerable visual corruption. This is unsurprising, as PALAVRA was trained with a discriminative goal. In their case, the model needs to only encode enough information to distinguish between two typical concepts (e.g. it may be sufficient to remember the mug was black-and-white with text-like symbols). Moreover, their word-discovery process had no need to remain in regions of the embedding space that contain embedding vectors that can be mapped to outputs on the natural image manifold. In the case of the text-and-image guided synthesis methods (rows $4$ , $5$ ), results appear more natural and closer to the source image, but they fail to generalize to new texts. Moreover, as our method builds upon pre-trained, large-scale text-to-image synthesis models, we can optimize a single pseudo-word and re-use it for a multitude of new generations. The baseline models, meanwhile, use CLIP for test-time optimization and thus require expensive optimization for every new creation.

3 Style transfer

A typical use-case for text-guided synthesis is in artistic circles, where users aim to draw upon the unique style of a specific artist and apply it to new creations. Here, we show that our model can also find pseudo-words representing a specific, unknown style. To find such pseudo-words, we simply provide the model with a small set of images with a shared style, and replace the training texts with prompts of the form: “A painting in the style of $S_{*}$ ”. Results are shown in Figure 6. They serve as further demonstration that our ability to capture concepts extends beyond simple object reconstructions and into more abstract ideas.

Note that this differs from traditional style transfer, as we do not necessarily wish to maintain the content of some input image. Instead, we offer the network the freedom to decide how to depict the subject, and merely ask for an appropriate style.

4 Concept compositions

In Figure 7 we demonstrate compositional synthesis, where the guiding text contains multiple learned concepts. We observe that the model can concurrently reason over multiple novel pseudo-words at the same time. However, it struggles with relations between them (e.g. it fails to place two concepts side-by-side). We hypothesize that this limitation arises because our training considers only single concept scenes, where the concept is at the core of the image. Training on multi-object scenes may alleviate this shortcoming. However, we leave such investigation to future work.

5 Bias reduction

A common limitation of text-to-image models is that they inherit the biases found in the internet-scale data used to train them. These biases then manifest in the generated samples. For example, the DALLE-2 system card (Mishkin et al., 2022) reports that their baseline model tends to produce images of people that are white-passing and male-passing when provided with the prompt “A CEO”. Similarly, results for “wedding”, tend to assume Western wedding traditions, and default to heterosexual couples.

Here, we demonstrate that we can utilize a small, curated dataset in order to learn a new “fairer” word for a biased concept, which can then be used in place of the original to drive a more inclusive generation.

Specifically, in Figure 8 we highlight the bias encoded in the word “Doctor”, and show that this bias can be reduced (i.e. we increase perceived gender and ethnic diversity) by learning a new embedding from a small, more diverse set.

6 Downstream applications

Finally, we demonstrate that our pseudo-words can be used in downstream models that build on the same initial LDM model. Specifically, we consider the recent Blended Latent Diffusion (Avrahami et al., 2022a) which enables localized text-based editing of images via a mask-based blending process in the latent space of an LDM. In Figure 9 we demonstrate that this localized synthesis process can also be conditioned on our learned pseudo-words, without requiring any additional modifications of the original model.

7 Image curation

Unless otherwise noted, results in this section are partially curated. For each prompt, we generated $16$ candidates (or six for DALLE-2) and manually selected the best result. We note that similar curation processes with larger batches are typically employed in text-conditioned generation works (Avrahami et al., 2022b; Ramesh et al., 2021; Yu et al., 2022), and that one can automate this selection process by using CLIP to rank images. In the supplementary materials, we provide large-scale, uncurated galleries of generated results, including failure cases.

Quantitative analysis

Inversion into an uncharted latent space provides us with a wide range of possible design choices. Here, we examine these choices in light of the GAN inversion literature and discover that many core premises (such as a distortion-editability tradeoff (Tov et al., 2021; Zhu et al., 2020b)) also exist in the textual embedding space. However, our analysis reveals that many of the solutions typically used in GAN inversion fail to generalize to this space, and are often unhelpful or actively harmful.

To analyze the quality of latent space embeddings, we consider two fronts: reconstruction and editability. First, we wish to gauge our ability to replicate the target concept. As our method produces variations on the concept and not a specific image, we measure similarity by considering semantic CLIP-space distances. Specifically, for each concept, we generate a $64$ of images using the prompt: “A photo of $S_{*}$ ”. Our reconstruction score is then the average pair-wise CLIP-space cosine-similarity between the generated images and the images of the concept-specific training set.

Second, we want to evaluate our ability to modify the concepts using textual prompts. To this end, we produce a set of images using prompts of varying difficulty and settings. These range from background modifications (“A photo of $S_{*}$ on the moon”), to style changes (“An oil painting of $S_{*}$ ”), and a compositional prompt (“Elmo holding a $S_{*}$ ”).

For each prompt, we synthesize $64$ samples using $50$ DDIM steps, calculate the average CLIP-space embedding of the samples, and compute their cosine similarity with the CLIP-space embedding of the textual prompts, where we omit the placeholder $S_{*}$ (i.e. “A photo of on the moon”). Here, a higher score indicates better editing capability and more faithfulness to the prompt itself. Note that our method does not involve the direct optimization of the CLIP-based objective score and, as such, is not sensitive to the adversarial scoring flaws outlined by Nichol et al. (2021).

2 Evaluation setups

We evaluate the embedding space using a set of experimental setups inspired by GAN inversion:

Following Abdal et al. (2019), we consider an extended, multi-vector latent space. In this space, $S_{*}$ is embedded into multiple learned embeddings, an approach that is equivalent to describing the concept through multiple learned pseudo-words. We consider an extension to two and three pseudo-words (denoted $2-word$ and $3-word$ , respectively). This setup aims to alleviate the potential bottleneck of a single embedding vector to enable more accurate reconstructions.

Progressive extensions

We follow Tov et al. (2021) and consider a progressive multi-vector setup. Here, we begin training with a single embedding vector, introduce a second vector following $2,000$ training steps, and a third vector after $4,000$ steps. In this scenario, we expect the network to focus on the core details first, and then leverage the additional pseudo-words to capture finer details.

Regularization

Tov et al. (2021) observed that latent codes in the space of a GAN have increased editability when they lie closer to the code distribution which was observed during training. Here, we investigate a similar scenario by introducing a regularization term that aims to keep the learned embedding close to existing words. In practice, we minimize the L2 distance of the learned embedding to the embedding of a coarse descriptor of the object (e.g. “sculpture” and “cat” for the images in Figure 1).

Per-image tokens

Moving beyond GAN-based approaches, we investigate a novel scheme where we introduce unique, per-image tokens into our inversion approach. Let $\{x_{i}\}_{i=1}^{n}$ be the set of input images. Rather than optimizing a single word vector shared across all images, we introduce both a universal placeholder, $S_{*}$ , and an additional placeholder unique to each image, $\{S_{i}\}_{i=1}^{n}$ , associated with a unique embedding $v_{i}$ . We then compose sentences of the form “A photo of $S_{*}$ with $S_{i}$ ”, where every image is matched to sentences containing its own, unique string. We jointly optimize over both $S_{*}$ and $\{S_{i}\}_{i=1}^{n}$ , using Equation 2. The intuition here is that the model should prefer to encode the shared information (i.e. the concept) in the shared code $S_{*}$ while relegating per-image details such as the background to $S_{i}$ .

Human captions

In addition to the learned-embedding setups, we compare to human-level performance using the captions outlined in Section 4.1. Here, we simply replace the placeholder strings $S_{*}$ with the human captions, using both the short and long-caption setups.

Reference setups

To provide intuition for the scale of the results, we add two reference baselines. First, we consider the expected behavior from a model that always produces copies of the training set, regardless of the prompt. For that, we simply use the training set itself as the “generated sample”. Second, we consider a model that always aligns with the text prompt but ignores the personalized concept. We do so by synthesizing images using the evaluation prompts but without the pseudo-word. We denote these setups as “Image Only” and “Prompt Only”, respectively.

Textual-Inversion

Finally, we consider our own setup, as outlined in Section 3. We further evaluate our model with an increased learning rate ( $2e-2$ , “High-LR”) and a decreased learning rate ( $1e-4$ , “Low-LR”).

Additional setups

In the supplementary, we consider two additional setups for inversion: a pivotal tuning approach (Roich et al., 2021; Bau et al., 2020), where the model itself is optimized to improve reconstruction, and DALLE-2 (Ramesh et al., 2022)’s bipartite inversion process. We further analyze the effect of the image-set size on reconstruction and editability.

3 Results

Our evaluation results are summarized in Figure 10(a). We highlight four observations of particular interest: First, the semantic reconstruction quality of our method and many of the baselines is comparable to simply sampling random images from the training set. Second, the single-word method achieves comparable reconstruction quality, and considerably improved editability over all multi-word baselines. These points outline the impressive flexibility of the textual embedding space, showing that it can serve to capture new concepts with a high degree of accuracy while using only a single pseudo-word.

Third, we observe that our baselines outline a distortion-editability trade-off curve, where embeddings that lie closer to the true word distribution (e.g. due to regularization, fewer pseudo-words, or a lower learning rate) can be more easily modified, but fail to capture the details of the target. In contrast, deviating far from the word distribution enables improved reconstruction at the cost of severely diminished editing capabilities. Notably, our single-embedding model can be moved along this curve by simply changing the learning rate, offering a user a degree of control over this trade-off.

As a fourth observation, we note that the use of human descriptions for the concepts not only fails to capture their likeness, but also leads to diminished editability. We hypothesize that this is tied to the selective-similarity property outlined in Paiss et al. (2022), where vision-and-language models tend to focus on a subset of the semantically meaningful tokens. By using long captions, we increase the chance of the model ignoring our desired setting, focusing only on the object description itself. Our model, meanwhile, uses only a single token and thus minimizes this risk.

Finally, we note that while our reconstruction scores are on par with those of randomly sampled, real images, these results should be taken with a grain of salt. Our metrics compare semantic similarity using CLIP, which is less sensitive to shape-preservation. On this front, there remains more to be done.

4 Human evaluations

We further evaluate our models using a user study. Here, we created two questionnaires. In the first, users were provided with four images from a concept’s training set, and asked to rank the results produced by five models according to their similarity to these images. In the second questionnaire, users were provided with a text describing an image context (“A photo on the beach”) and asked to rank the results produced by the same models according to their similarity to the text.

We used the same target concepts and prompts as the CLIP-based evaluation and collected a total of $600$ responses to each questionnaire, for a total of $1,200$ responses. Results are shown in Figure 10(b).

The user-study results align with the CLIP-based metrics and demonstrate a similar reconstruction-editability tradeoff. Moreover, they outline the same limitations of human-based captioning when attempting to reproduce a concept, as well as when editing it.

Limitations

While our method offers increased freedom, it may still struggle with learning precise shapes, instead incorporating the “semantic” essence of a concept. For artistic creations, this is often enough. In the future, we hope to achieve better control over the accuracy of the reconstructed concepts, enabling users to leverage our method for tasks that require greater precision.

Another limitation of our approach is in the lengthy optimization times. Using our setup, learning a single concept requires roughly two hours. These times could likely be shortened by training an encoder to directly map a set of images to their textual embedding. We aim to explore this line of work in the future.

Social impact

Text-to-image models can be used to generate misleading content and promote disinformation. Personalized creation could allow a user to forge more convincing images of non-public individuals. However, our model does not currently preserve identity to the extent where this is a concern.

These models are further susceptible to the biases found in the training data. Examples include gender biases when portraying “doctors” and “nurses”, racial biases when requesting images of scientists, and more subtle biases such as an over-representation of heterosexual couples and western traditions when prompting for a “wedding” (Mishkin et al., 2022). As we build on such models, our own work may similarly exhibit biases. However, as demonstrated in Figure 8, our ability to more precisely describe specific concepts can also serve as a means for reducing these biases.

Finally, the ability to learn artistic styles may be misused for copyright infringement. Rather than paying an artist for their work, a user could train on their images without consent, and produce images in a similar style. While generated artwork is still easy to identify, in the future such infringement could be difficult to detect or legally pursue. However, we hope that such shortcomings are offset by the new opportunities that these tools could offer an artist, such as the ability to license out their unique style, or the ability to quickly create early prototypes for new work.

Conclusions

We introduced the task of personalized, language-guided generation, where a text-to-image model is leveraged to create images of specific concepts in novel settings and scenes. Our approach, “Textual Inversions”, operates by inverting the concepts into new pseudo-words within the textual embedding space of a pre-trained text-to-image model. These pseudo-words can be injected into new scenes using simple natural language descriptions, allowing for simple and intuitive modifications. In a sense, our method allows a user to leverage multi-modal information — using a text-driven interface for ease of editing, but providing visual cues when approaching the limits of natural language.

Our approach was implemented over LDM (Rombach et al., 2021), the largest publicly available text-to-image model. However, it does not rely on any architectural details unique to their approach. As such, we believe Textual Inversions to be easily applicable to additional, larger-scale text-to-image models. There, text-to-image alignment, shape preservation, and image generation fidelity may be further improved.

We hope our approach paves the way for future personalized generation works. These could be core to a multitude of downstream applications, from providing artistic inspiration to product design.

We thank Yael Vinker, Roni Paiss and Haggai Maron for reviewing early drafts and helpful suggestions. Tom Bagshaw for discussions regarding artist rights and social impacts, and Omri Avrahami for providing us with early access to Blended Latent Diffusion. This work was partially supported by Len Blavatnik and the Blavatnik family foundation, BSF (grant 2020280) and ISF (grants 2492/20 and 3441/21).

References

Appendix A Additional inversion approaches

In addition to the setups outlined in the core paper, we investigated two recent approaches to inversion: Bipartite DDIM-inversion (Ramesh et al., 2022; Dhariwal & Nichol, 2021) and pivotal tuning (Roich et al., 2021). Below we outline both methods and our experimental results.

Dhariwal & Nichol (2021) demonstrated that the DDIM sampling (Song et al., 2020) process can be inverted through a closed-form iterative approach. Specifically, their approach can find a latent noise vector $x_{T}$ which will be denoised into a specific target image when the denoising process is conditioned on a given code $c_{\theta}(y)$ . In (Ramesh et al., 2022), they further demonstrate that when the conditioning code is an output of CLIP, one can later modify this code using text-derived directions in CLIP’s multi-modal embedding space, while keeping the initial noise, $x_{T}$ , fixed. This induces semantic changes in the image while maintaining the general structure of the original object.

Here, we investigate a similar approach. However, rather than modifying the conditioning code $c_{\theta}(y)$ directly, we change the conditioning text $y$ . Specifically, we first find an appropriate pseudo-word for our target concept. Then, we find $x_{T}$ for a given image of the concept using the text “A photo of $S_{*}$ ” and the closed-form solution of Dhariwal & Nichol (2021). Finally, we modify the conditioning text but keep $x_{T}$ frozen. The results are shown in Figure 11 (left). Here, we observe that when using LDM’s typical guidance (Ho & Salimans, 2021) scales ( $5$ - $10$ ), the denoiser network is unable to maintain the original object’s structure through prompt changes. When reducing the guidance scale, the outline of the original image becomes visible. However, alignment with the prompt is poor.

Such guidance-dependent structure drift has also been demonstrated for GLIDE (Nichol et al., 2021). However, this effect is reduced in DALL-E2 (Ramesh et al., 2022) (their Figure 9). Notably, state-of-the-art models (Saharia et al., 2022; Ramesh et al., 2022) typically employ guidance scales ( $\sim 2$ ) which are significantly lower than LDM’s — within the range where we observe structure preservation, but no prompt-matching. This gives us hope that a bipartite inversion would allow better shape preservation in more powerful generative models.

Pivotal Tuning

In the field of GAN inversion, it has been shown (Roich et al., 2021; Bau et al., 2019) that one may largely avoid the reconstruction-editability tradeoff using a two-stage optimization process. First, an image is inverted into “pivot” code in a well-behaved region of the latent space, using standard optimization. This typically results in a highly editable code, but with poor identity preservation. As a second step, the generator is fine-tuned so that the first step’s pivot code will more accurately reproduce the inverted image. It was further demonstrated that such localized tuning can maintain the appealing properties of the latent space and retain similar latent-editing capabilities.

Here, we investigate a similar approach in order to improve reconstruction. We first optimize a pseudo-word using our baseline method. Then, we fine-tune the generator such that sentences of the form “A photo of $S_{*}$ ” will better reconstruct the concept-specific training set images.

Our initial investigation reveals that naïve applications of this approach lead to improved shape preservation, but also to a severe collapse of editing at high guidance scales. See Figure 11 (right) for examples.

However, a more involved application of this same principle (e.g. by combining it with a similar process to the bipartite-inversion outlined below, or by tuning around results produced with higher guidance scales) might overcome these issues. We leave such investigation to future work.

Appendix B Effect of training set size

We investigated the effect of the concept’s training set size on the results. Specifically, we consider the headless sculpture object of Figure 1 (top row). We inverted the object using our standard model but sweeped over dataset sizes ranging from a single image to $25$ samples. For ease of comparison, we further report the image-only, prompt-only, and human caption based scores for the same single object. The results are shown in Figure 12.

Appendix C Additional results

We provide additional results of personalized generation using our method. In Figure 13 we show additional text-guided synthesis results.

In Figure 14 we show large-scale galleries of uncurated results generated with the prompt “A photo of $S_{*}$ ”. In Figures 15 and 16 we provide large-scale galleries of uncurated results generated with a wide assortment of prompts. These are intended to provide a sense of the quality of images produced and cherry-picking involved when generating the samples in the core paper. Note that these results also contain demonstrations of typical failure cases, such as difficult relational prompts (Figure 15, rows $2$ , $5$ ).

Appendix D Training prompt templates

Below we provide the list of text templates used when optimizing a pseudo-word: