MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, Tali Dekel

Introduction

Text-to-image generative models have emerged as a “disruptive technology”, demonstrating unprecedented capabilities in synthesizing high-quality and diverse images from text prompts, where diffusion models are currently established as state-of-the-art (Saharia et al., 2022b; Ramesh et al., 2022; Rombach et al., 2022; Croitoru et al., 2022). While this progress holds a great promise in changing the way we can create digital content, deploying text-to-image models to real-world applications remains challenging due to the difficulty to provide users with intuitive control over the generated content. Currently, controllability over diffusion models is achieved in one of two ways: (i) training a model from scratch or finetuning a given diffusion model for the task at hand (e.g., inpainting, layout-to-image training, etc. (Wang et al., 2022a; Ramesh et al., 2022; Rombach et al., 2022; Nichol et al., 2021; Avrahami et al., 2022b; Brooks et al., 2022; Wang et al., 2022b)). With the ever-increasing scale of models and training data, this approach often requires extensive compute and long development period, even in a finetuning setting. (ii) Reuse a pre-trained model and add some controlled generation capability. Previously, these methods have concentrated on specific tasks and designed a tailored methodology (e.g., replacing objects in an image, manipulating style, or controlling layout (Tumanyan et al., 2022; Hertz et al., 2022; Avrahami et al., 2022a)).

The goal of this work is to design MultiDiffusion, a new unified framework that significantly increases the flexibility in adapting a pre-trained (reference) diffusion model to controlled image generation. The basic idea behind the MultiDiffusion is to define a new generation process that is composed of several reference diffusion generation processes binded together with a set of shared parameters or constraints. In more detail, the reference diffusion model is applied to different regions in the generated image, predicting a denoising sampling step for each. In turn, the MultiDiffusion takes a global denoising sampling step reconciling all these different steps via least squares optimal solution.

For example, consider the task of generating an image at arbitrary aspect ratio given a reference diffusion model trained on square images (Fig. 2). At each denoising step, the MultiDiffusion fuses the denoising directions, provided by the reference model, from all the square crops, and strives to follow them all as closely as possible, constrained by the fact that nearby crops share common pixels. Intuitively, we encourage each crop to be a real sample from the reference model. Note that while each crop might pull to a different denoising direction, our framework yields a unified denoising step, hence produces high-quality and seamless images.

With MultiDiffusion, we are able to harness a reference pre-trained text-to-image model to different applications including synthesizing images at desired resolution or aspect ratio, or synthesizing images using rough region-based text prompts, as seen in Fig. 1. Notably, our framework allows to solve these tasks simultaneously, using a common generation process. Comparing to relevant baselines, we found that our approach is able to produce state-of-the-art controlled generation quality even compared to methods that are specifically trained for these tasks. Furthermore, our method works efficiently, without introducing computational overhead.

Related Work

Diffusion models (Sohl-Dickstein et al., 2015; Croitoru et al., 2022; Dhariwal & Nichol, 2021; Ho et al., 2020; Nichol & Dhariwal, 2021) are a class of generative probabilistic models that aim to approximate a data distribution qq, and are easy to sample from. Specifically, these models take a Gaussian noise input ITN(0,I)I_{T}\sim\mathcal{N}(0,I), and through a series of gradual denoising steps, transform it into a sample I0I_{0}, that should be distributed according to qq. The number of denoising steps, and the parameterization of the transformation varies among different works (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020; Lu et al., 2022a, b; Liu et al., 2022). Recently, Diffusion Models have emerged as state-of-the-art generators due to their success in learning complex distributions and generating diverse high quality samples. These models have been successfully used in various domains, including images (Dhariwal & Nichol, 2021; Nichol & Dhariwal, 2021; Saharia et al., 2022b; Ramesh et al., 2022; Rombach et al., 2022), video (Ho et al., 2022; Singer et al., 2022), 3D scenes (Múller et al., 2022), and motion sequences (Yuan et al., 2022; Tevet et al., 2022).

Controllable generation with diffusion models

Diffusion models can be trained with guiding input channels (e.g., semantic layout, category label) and successfully perform conditional image generation (Ramesh et al., 2021; Saharia et al., 2022c, a; Wang et al., 2022a; Preechakul et al., 2022; Ho & Salimans, 2022). The most prominent example of conditional diffusion models is recent text-to-image diffusion models, which have demonstrated groundbreaking synthesis capabilities (Nichol et al., 2021; Saharia et al., 2022b; Ramesh et al., 2022; Nichol et al., 2021; Rombach et al., 2022; Sheynin et al., 2022). However, these models provide only little control over the generated content, which is mainly achieved through the input text. Recently, a surge of methods have been proposed to gain wider and better user controllability. Existing methods can be roughly divided into two main approaches: (i) methods that incorporate explicit control by using additional guiding signals to the model (Avrahami et al., 2022b; Rombach et al., 2022; Brooks et al., 2022). However, these works require costly extensive training on curated datasets. (ii) On the other side of the spectrum, numerous methods proposed to implicitly control the generated content by manipulating the generation process of a pre-trained model (Kwon & Ye, 2022; Meng et al., 2021; Tumanyan et al., 2022; Hertz et al., 2022; Avrahami et al., 2022c; Choi et al., 2021; Mokady et al., 2022; Couairon et al., 2022; Kong et al., 2023; Kwon et al., 2022) or by performing lightweight model finetuning (Ruiz et al., 2022; Kawar et al., 2022; Kim et al., 2022; Valevski et al., 2022). Avarahami et al. designed image inpainting methods (Avrahami et al., 2022a, c) that do not require finetuning. Recent works (Tumanyan et al., 2022; Hertz et al., 2022) rely on architectural properties and insights about the internal features of the pretrained model, and tailor image editing techniques accordingly. Our work also manipulates the generation process of a pretrained diffusion model, and does not require any training or finetuning. However, in contrast to existing works that target a specific application, without a well defined objective, we propose a more general approach that allows us to unify different user control inputs in a more principled manner.

Method

We consider a pre-trained diffusion model, which serves as a reference model:

gradually transforming the noisy image ITI_{T} into a clean image I0I_{0}.

The MultiDiffusion, similarly to a diffusion process, starts with some initial noisy input JTPJJ_{T}\sim P_{\mathcal{J}}, where PJP_{\mathcal{J}} is a noise distribution over J{\mathcal{J}}, and produces a series of images

Our key idea is to define Ψ\Psi to be as-consistent-as-possible with Φ\Phi. More specifically, we define a set of mappings between the target and reference image spaces Fi:JIF_{i}:{\mathcal{J}}\rightarrow{\mathcal{I}}, and a corresponding set of mappings between the condition spaces: λi:ZY\lambda_{i}:{\mathcal{Z}}\rightarrow{\mathcal{Y}} where i[n]={1,,n}i\in[n]=\left\{1,\ldots,n\right\}. These mappings are application depended, as will be described later in Sec. 4. Our goal is to make every MultiDiffuser step Jt1=Ψ(Jtz)J_{t-1}=\Psi(J_{t}|z) follow as closely as possible Φ(Itiyi)\Phi(I^{i}_{t}|y_{i}), i[n]i\in[n], i.e., the denoising steps of Φ\Phi when applied to the images and conditions:

Formally, our new process is given by solving the following optimization problem:

Ψ(Jtz)=\displaystyle\Psi(J_{t}|z)= arg minJJ  LFTD(JJt,z)\displaystyle\operatorname*{arg\,min}_{J\in{\mathcal{J}}}\ \ {\mathcal{L}}_{{\scriptscriptstyle\text{FTD}}}(J|J_{t},z) (3)

Closed-form formula.

In the applications demonstrated in this paper FiF_{i} consist of direct pixel samples (e.g., taking a crop out of image JtJ_{t}). In this case, Eq. 4 is a quadratic Least-Squares (LS) where each pixel of the minimizer JJ is a weighted average of all its diffusion sample updates, i.e.,

Properties of MultiDiffusion.

The main motivation for the definition of Ψ\Psi in Eq. 3 comes from the following observation: If we choose a probability distribution PJP_{\mathcal{J}} such that

and compute Jt1=Ψ(Jtz)J_{t-1}=\Psi(J_{t}|z), as defined in Eq. 3, where we reach a zero FTC loss, LFTD(Jt1Jt,z)=0{\mathcal{L}}_{\scriptscriptstyle\text{FTD}}(J_{t-1}|J_{t},z)=0, then:

That is, ItiI^{i}_{t}, for all i[n]i\in[n], is a diffusion sequence and thus I0iI^{i}_{0} is distributed according to the distribution defined by Φ\Phi over the image space I{\mathcal{I}}. We summarize

Proposition 3.1. If PJP_{\mathcal{J}} is a distribution over J{\mathcal{J}} satisfying Eq. 6, and the FTD cost (Eq. 4) is minimized to zero in Eq. 3 for all steps T,T1,,0T,T-1,\ldots,0, then the images Iti=Fi(Jt)I^{i}_{t}=F_{i}(J_{t}) reproduce a Φ\Phi diffusion path. In particular Fi(J0)F_{i}(J_{0}), i[n]i\in[n] are distributed identically to samples from the reference diffusion model Φ\Phi.

The implications of this proposition are far reaching: using a single reference diffusion process we can flexibly adapt to different image generation scenarios without the need to retrain the model, while still being consistent with the reference diffusion model. Next, we instantiate this framework outlining several application of the Follow-the-Diffusion-Paths approach.

Applications

As a first instantiation we use our framework to define a diffusion model in an image space J{\mathcal{J}} with HHH^{\prime}\geq H, WHW^{\prime}\geq H directly from a trained model Φ\Phi working in image space I{\mathcal{I}}. Let Z=Y{\mathcal{Z}}={\mathcal{Y}} (namely, generating a panoramic image for a given text-prompt), Fi(J)IF_{i}(J)\in{\mathcal{I}} is an H×WH\times W crop of image JJ, and z=λi(z)z=\lambda_{i}(z). We consider nn such crops that cover the original images JJ. Setting Wi=1W_{i}=\mathbf{1}, we get

that is a least-squares problem, the solution of which is calculated analytically according to Eq. 5. See the Appendix B.1 for implementation details.

As discussed in Sec. 3, MultiDiffusion reconciles multiple diffusion paths provided by the reference model Φ\Phi. We illustrate this property in Fig. 3, where we consider a panorama of H×4WH\times 4W. Fig. 3(a) shows the generation result when independently applying Φ\Phi on four non-overlapping crops. As expected, there is no coherency between the crops since this amounts to four random samples from the model. Starting from the same initial noise, our generation process (Eq. 7), allows us to fuse these initially-unrelated diffusion paths, and steer the generation into a high-quality, coherent panorama (b).

2 Region-based text-to-image-generation

Given a set of region-masks {Mi}i=1n{0,1}H×W\{M_{i}\}_{i=1}^{n}\subset\{0,1\}^{H\times W} and a corresponding set of text-prompts {yi}i=1nYn\{y_{i}\}_{i=1}^{n}\subset{\mathcal{Y}}^{n}, our goal is to generate a high-quality image III\in{\mathcal{I}} that depicts the desired content in each region. That is, the image segment IMiI\otimes M_{i} should manifest yiy_{i}. Going back to our formulation (Eq. 2), the MultiDiffusion process is defined over the condition space Z=Yn{\mathcal{Z}}={\mathcal{Y}}^{n}, i.e., z=(y1,,yn)z=(y_{1},\ldots,y_{n}), and the target image space J=I{\mathcal{J}}={\mathcal{I}} is identical to the reference one:

Furthermore, the region selection maps are defined as Fi(I)=IF_{i}(I)=I, the pixel weights are set according to the masks, Wi=MiW_{i}=M_{i}, and the Ψ\Psi step is defined as the solution to the least-squares problem:

The solution to this LS problem is calculated analytically. At each step we apply the pretrained diffusion w.r.t. each of the given prompts, resulting in multiple diffusion directions Φ(Jtyi)\Phi(J_{t}|y_{i}). We encourage each pixel in JtJ_{t} to follow the (averaged) directions associated with the regions MiM_{i} containing it (Eq. 5).

We further support obtaining high-fidelity to tight masks if provided by the user (see Fig. 5). We noticed that the layout is being determined early on in the diffusion process, and thus we strive to encourage Φ(Jtyi)\Phi(J_{t}|y_{i}) to focus on the region MiM_{i} early on in the process in order to match the desired layout, and to consider the full context in the image next, to achieve an harmonized result. We integrate time dependency in the maps FiF_{i}, introducing a bootstrapping phase. That is,

Where TinitT_{init} is the bootstrapping stopping step parameter, and StS_{t} is a random image with a constant color, which serves as background (see Appendix B.2 for implementation details).

We demonstrate the efficiency of our bootstrapping approach in Sec. 5.2. We set TinitT_{init} to be 20%20\% of the generation process (i.e., Tinit=800T_{init}=800).

Results

To evaluate our method on the task of text-to-panorama generation (Sec. 4.1), we generated a diverse set of 512×4608512\times 4608 panoramas, ×9\times 9 wider than the original training resolution. Since there is no direct method for generating images at arbitrary aspect ratio from text, we compare to the following two baselines: (i) Blended Latent Diffusion (BLD) (Avrahami et al., 2022a) (combined with Stable Diffusion (Rombach et al., 2022)), and Stable Inpainting (SI) (Rombach et al., 2022), which has been finetuned on large-scale data for inpainting. For both baselines, the panoramic image is generated gradually, starting from a central image (sampled by Φ\Phi given the input text), and extrapolated progressively to the right and left.

Fig. 4 shows sample generation results by our method compared to the above baselines. As seen, both baselines often exhibit visible seams and discontinuities between overlapping crops, as well as degradation in visual quality as moving away from the center pivotal image; this is expected due to the iterative generation process. BLD often generates repetitive content (e.g., skiers example), where SI results in noticeable visual difference between the left and right parts of the image. In contrast, our framework simultaneously “samples” the panoramic image by combining the diffusion paths of all crops, resulting in seamless and high quality images. Additional comparisons are in the Appendix 10.

To quantify these observations, we use the Frechet Inception Distance (FID) (Parmar et al., 2022) to measure the distance between the distribution of 512×512512\times 512 crops from the panoramic images to the distribution of images generated by the reference model Φ\Phi. That is, for a given text prompt, we sample NN different 512×512512\times 512 images from Φ\Phi, and consider them as our reference dataset. For the baselines and our method, we generated NN panoramic images, and randomly sampled a 512×512512\times 512 crop from each sample to serves as the generated dataset and computed the FID accordingly.

To further assess the quality of our results, we evaluated two CLIP-based scores: (i) text-image CLIP score (Radford et al., 2021) measured by the cosine similarity between the text prompt and the image embeddings, and (ii) CLIP aesthetic (Schuhmann et al., 2022) measured by a linear estimator on top of CLIP predicting the aesthetic quality of the images.

We used N=2000N=2000 samples and repeated this evaluation for 88 different text-conditioning. Table 1 reports the mean and standard-deviation of FID and CLIP scores for our method and the baselines. We additionally report the scores for an independent set of samples images from Φ\Phi, which serves as a baseline. As seen, our method outperforms the existing baselines in all metrics.

2 Region-based Text-to-Image Generation

Our region-based formulation (Sec. 4.2) allows novice users greater flexibility in their content creation, by lifting the burden of creating accurate tight masks. As can be seen in Fig. 1, Fig. 7 and Fig. 8, our method generates diverse high-quality samples that comply with text description, given only bounding boxes region guidance. As seen in Fig. 7, by starting our generation from a different input noise, we can generate diverse samples, depicting objects in different scales and appearances, all following the same spatial controls. Notably, since we integrate the controls from all regions into a unified generation process, our method can generate complex scene effects (e.g., background blur, shadows or reflections) which are coherently immerse in the scene. More results are included in the Appendix.

We compare our region-based framework with Make-A-Scene (Gafni et al., 2022) and the concurrent work SpaText (Avrahami et al., 2022b). Both baselines perform large-scale training specifically for this task. Note that these models are not publicly available, thus we qualitatively compare to their provided examples.

Additionally, we consider an adaptation of BLD (Avrahami et al., 2022a) as a baseline. Similarly to Sec. 5.1, this is done by applying their method in an auto-regressive manner by first generating the background, and sequentially generating each of the foreground objects.

As seen in Fig. 5, our framework produces consistent images that adhere to the spatial constraints, and are qualitatively on par with (Avrahami et al., 2022b). The auto-regressive approach based on BLD (Avrahami et al., 2022a) often results in incoherent images and an unnatural scene. (e.g., misplaced sink in “bathroom” example). Additional comparisons to the baselines are in the Appendix.

To quantitatively evaluate our performance, we use the COCO dataset (Lin et al., 2014), which contains images with global text caption and instance masks for each object in the image. We apply our method on a subset from the validation set, obtained by filtering examples which consists of 2 to 4 foreground objects, excluding people, and masks that occupy less than 5%5\% of the image. This results in 1K1K diverse samples. Following (Avrahami et al., 2022b), we use the ground truth labels to provide a text prompt for each foreground region, i.e., “a {label}”, and use the full image caption as the prompt describing the background.

We evaluate the results with an off-the-shelf segmentation model (Cheng et al., 2022) on the generated images, and measure the Intersection over Union (IoU) w.r.t. to the ground-truth segmentation. Table 2 reports the performance for our method and the baselines described above. As an upper bound, we also report the IoU w.r.t. the original images in the set. Note that our method outperforms the existing baselines SI (Rombach et al., 2022) and BLD (Avrahami et al., 2022a). We additional provide qualitative examples are included in the Appendix.

Finally, we present an ablation of our bootstrapping stage ( Eq. 9): qualitatively in Fig. 6, and quantitatively in Table 2. Note that without bootstrapping, our framework still generates the desired object within the mask region, however, the bootstrapping stage makes it tighter to the given mask.

Discussion and Conclusions

Controllable generation is one of the major pending challenges with text-to-image diffusion models. We proposed to tackle this challenge from a fundamentally new direction – defining a new generation process on top of a pre-trained and fixed diffusion model. This approach has several key advantages over previous works: (i) it does not require any further training or finetuning, (ii) it can be applied to various different generation tasks, and (iii) our generation process yields an optimization task which can be solved in closed form for many tasks, hence can be computed efficiently, while ensuring convergence to the global optimum of our objective. As for limitations, our method heavily relies on the generative prior of the reference diffusion model, i.e., the quality of our results depends on the diffusion paths provided by the model. Thus, when a “bad” path is chosen by the reference model (e.g., bad seed, or biased text-prompt), our results will be affected as well. In some cases, we can mitigate it by introducing more constraints into our framework (bootstrapping in Sec. 4.2), or prompt-engineering (Fig. 9). We thoroughly evaluated our framework, demonstrating state-of-the-art results even compared to methods that are tailored-trained for specific tasks.

We believe that our work can trigger further future research in harnessing the power of a pre-trained diffusion model in more principled manner. One way forward, for example, is to generalize the MultiDiffusion with a more general optimization problem,

where L0{\mathcal{L}}_{0} is a cost function and C{\mathcal{C}} is a set of (hard) constrains that control the MultiDiffusion process by incorporating other priors and/or design constraints. This approach provides a further of freedom in designing MultiDiffusion processes.

Acknowledgments

LY is supported by a grant from Israel CHE Program for Data Science Research Centers and the Minerva Stiftung. OB is supported by the Israeli Science Foundation (grant 2303/20). The research was supported also in part by a research grant from the Carolito Stiftung (WAIC). We thank Michal Geyer and Dolev Ofri-Amar for proofreading the paper.

References

Appendix A Additional Results

In the following section we provide additional results and comparisons for the applications shown in the main paper.

We provide additional results and qualitative comparisons for the task of text-to-panorama (Sec. 5.1). Fig. 10 depicts additional comparisons of our method vs Stable Inpainting (SI) (Rombach et al., 2022) and Blended Latent Diffusion (BLD) (Avrahami et al., 2022a). We also show vertical panorama result in Fig. 12 left.

A.2 Region-based Text-to-Image Generation

We provide additional qualitative results and comparisons for the task of region-based generation (Sec.4.2) in Fig. 12 and Fig. 14.

A.3 Region-based Text-to-Image Generation on COCO

We include sample results and comparison on the subset from the validation set of COCO in Fig. 13. See more details about this experiment in Sec. 5.2.

Appendix B Additional Implementation Details.

In the case of panorama generation, our maps FiF_{i} are defined as fixed-size crops from the full panorama. Specifically, for a panorama with spatial resolution H×WH^{\prime}\times W^{\prime}, we consider overlapping crops of size H×WH\times W where H=W=64H=W=64 defined in the Stable Diffusion latent space (which translates to size 512×512512\times 512 in RGB space). Our maps Fi,...,FnF_{i},...,F_{n} provide crops with a sliding window of size step=8\texttt{step}=8 in the latent space (64 pixels in RGB space). In particular, n=H64stepW64stepn={\frac{H^{\prime}-64}{\texttt{step}}\cdot{\frac{W^{\prime}-64}{\texttt{step}}}}. We summarize,

Note that we can compute the per-crop diffusion updates in parallel (i.e., in a batch), resulting in total of Tnb{\frac{{T\cdot n}}{b}} calls to the reference diffusion Φ\Phi, where bb denotes the batch size.

B.2 Bootstrapping (Sec. 4.2)

In case the user desires to maintain high fidelity to tight masks (see Fig. 4), we introduce a bootstrapping phase to our maps FiF_{i} (see Eq. 9). Specifically, we pre-compute each StS_{t} as follows: we randomize an image I512x512x3I\in^{512x512x3} with a random constant RGB value, and encode it to Stable Diffusion latent space S=E(I)S=\mathcal{E}(I), where E\mathcal{E} is the pre-trained encoder provided by the Stable Diffusion framework. Finally, we obtain StS_{t} by noising SS to the noise level of time-step tt. That is, StNS_{t}\sim\mathcal{N} where (μtS,σt2){(\mu_{t}\cdot S,\sigma_{t}^{2})}, μt\mu_{t} and σt\sigma_{t} are the diffusion noise schedulers (Ho et al., 2020).