Diffusion models as plug-and-play priors

Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, Dimitris Samaras

Introduction

Deep generative models, such as denoising diffusion probabilistic models [DDPMs; 39, 13] can capture the details of very complex distributions over high-dimensional continuous data $p(\mathbf{x})$ . The immense effective depth of DDPMs, sometimes with thousands of deep network evaluations in the generation process, is an apparent limitation on their use as off-the-shelf modules in hierarchical generative models, where models can be mixed and one model may serve as a prior for another conditional model. In this paper, we show that DDPMs trained on image data can be directly used as priors in systems that involve other differentiable constraints.

In our main problem setting, we assume that we have a prior $p(\mathbf{x})$ over high-dimensional data $\mathbf{x}$ and we wish to perform inference in a model that involves this prior and a constraint $c(\mathbf{x},\mathbf{y})$ on $\mathbf{x}$ given some additional information $\mathbf{y}$ . That is, we want to find an approximation to the posterior distribution $p(\mathbf{x}|\mathbf{y})\propto p(\mathbf{x})c(\mathbf{x},\mathbf{y})$ . In this paper, $p(\mathbf{x}=\mathbf{x}_{0},\mathbf{h}=\{\mathbf{x}_{T},...,\mathbf{x}_{1}\})$ is provided in the form of an independently trained DDPM over $\mathbf{x}_{T},\dots,\mathbf{x}_{0}$ (§2.2), making the DDPM a ‘plug-and-play’ prior.

Although the recent community interest in DDPMs has spurred progress in training algorithms and fast generation schedules , the possibility of their use as plug-and-play modules has not been explored. Furthermore, as opposed to existing work on plug-and-play models (starting from ), the algorithms we propose do not require additional training or finetuning of model components or inference networks.

One obvious application of plug-and-play priors is conditional image generation (§3.1, §3.2). For example, a denoising diffusion model trained on MNIST digit images might define $p(\mathbf{x})$ , while the constraint $c(\mathbf{x},\mathbf{y})$ may be be the probability of digit class $\mathbf{y}$ under an off-the-shelf classifier. However, changing the semantics of $\mathbf{x}$ , we can also use such models for inference tasks where neural networks struggle with domain adaptation, such as image segmentation: $c(\mathbf{x},\mathbf{y})$ constrains the segmentation $\mathbf{x}$ to match an appearance or a weak labeling $\mathbf{y}$ (§4). Finally, we describe a path towards using DDPM priors to solve continuous relaxations of combinatorial search problems by treating $\mathbf{y}$ as a latent variable with combinatorial structure that is deterministically encoded in $\mathbf{x}$ (§5).

DDPMs have previously been used for conditional generation and image segmentation . With few exceptions – such as , which uses a pretrained DDPM as a feature extractor – these algorithms assume access to paired data and conditioning information during training of the DDPM model. In , a classifier $p(y\mid\mathbf{x}_{t})$ that guides the denoising model towards the desired subset of images with the attribute $y$ is trained in parallel with the denoiser. In , generation is conditioned on an auxiliary image by guiding the denoising process through correction steps that match the low-frequency components of the generated and conditioning images. In contrast, we aim to build models that combine an independently trained DDPM with an auxiliary constraint.

Our approach is also related to work on adversarial examples. Adversarial samples are produced by optimizing an image $\mathbf{x}$ to satisfy a desired constraint $c$ – a classifier $p(\mathbf{y}|\mathbf{x})$ – without reference to the prior over data. As supervised learning algorithms can ignore the structure in data $\mathbf{x}$ , focusing only on the conditional distribution, it is possible to optimize for input $\mathbf{x}$ that provides the desired classification in various surprising ways . In , a diffusion model is used to defend from adversarial samples by making images more likely under a DDPM $p(\mathbf{x})$ . We are instead interested in inference, where we seek samples $\mathbf{x}$ that satisfy both the classifier and the prior. (Our work may, however, have consequences for adversarial generation.)

Conditional generation from unconditional models.

Works that preceded the recent popularity of DDPMs show how an unconditional generative model, such as a generative adversarial network [GAN; 11] or variational autoencoder [VAE; 21], can be combined with a constraint model to generate conditional samples. Regarding generative diffusion models, recent literature has focused on utilizing unconditional, pretrained DDPMs as priors to solve linear inverse imaging problems. Both in and , the authors modify the DDPM sampling algorithm, with knowledge of the linear degradation operator, to reconstruct an image consistent with the learned prior and given measurements. A generalization of these methods in shows how any pretrained denoising network can be used as the prior for solving linear inverse problems. We also clarify that although the term ‘plug-and-play’ is widely used in the inverse imaging literature we refer to it in the scope of in-domain generation under differentiable constraints, in the same sense as .

Latent vectors in DDPMs.

Modeling the latent prior distribution in VAE-like models using a DDPM has been studied in . On the other hand, in §5, we perform inference in the low-dimensional latent space under a pretrained DDPM on a high-dimensional data space. Our approach to semantic segmentation (§4) is also related to , where a prior $p(\mathbf{z})$ over latents is used to tune a posterior network $q(\mathbf{z}|\mathbf{x})$ . There, the priors are of relatively simple structure and are sample-specific, rather than global diffusion priors like in this paper.

Method

Recall that we want to find an approximation to the posterior distribution $p(\mathbf{x}|\mathbf{y})\propto p(\mathbf{x})c(\mathbf{x},\mathbf{y})$ , where $p(\mathbf{x})$ is a fixed prior distribution. Fixing $\mathbf{y}$ and introducing an approximate variational posterior $q(\mathbf{x})$ , the free energy

is minimized when $q(\mathbf{x})$ is closest to the true posterior, i.e., when ${\rm KL}(q(\mathbf{x})\|p(\mathbf{x}|\mathbf{y}))$ is minimized. When $q(\mathbf{x})$ , and the learning algorithm used to fit it, are expressive enough to capture the true posterior, this minimization yields the exact posterior $p(\mathbf{x}|\mathbf{y})$ . Otherwise, $q$ will capture a ‘mode-seeking’ approximation to the true posterior ; in particular, if $q(\mathbf{y})$ is a Dirac delta, it is optimal to concentrate $q$ at the mode of $p(\mathbf{x}|\mathbf{y})$ . When the prior involves latent variables $\mathbf{h}$ (i.e., $p(\mathbf{x})=\int_{\mathbf{h}}p(\mathbf{x}|\mathbf{h})p(\mathbf{h})\,d\mathbf{h}$ ), the free energy is

We are, in particular, interested in a general procedure for minimizing $F$ with respect to an approximate posterior $q(\mathbf{x})$ for any differentiable $c$ when $p$ is a DDPM (§2.2).

A free energy of the same structure was also studied in , where a DDPM $p(\mathbf{z})$ over a latent space is hybridized as a parent to a decoder $p(\mathbf{x}|\mathbf{z})$ , with an additional inference model $q(\mathbf{z}|\mathbf{x})$ trained jointly with both of these models. On the other hand, we aim to work with independently trained components that operate directly in the pixel space, e.g., an off-the-shelf diffusion model $p(\mathbf{x})$ trained on images of faces and an off-the-shelf face classifier $p(\mathbf{y}|\mathbf{x})$ , without training or finetuning them jointly (§3.2).

2 Denoising diffusion probabilistic models as priors

Denoising diffusion probabilistic models (DDPMs) generate samples $\mathbf{x}_{0}$ by reversing a (Gaussian) noising process. DDPMs are deep directed stochastic networks:

where $\boldsymbol{\mu}_{\theta}$ and $\mathbf{\Sigma}_{\theta}$ are neural networks with learned parameters (often, as in this paper, $\mathbf{\Sigma}_{\theta}$ is fixed to a scalar diagonal matrix depending on $t$ ). The model starts with a sample from a unit Gaussian $\mathbf{x}_{T}$ and successively transforms it with a nonlinear network $\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t},t)$ adding a small Gaussian innovation signal at each step according to a noise schedule. After $T$ steps, the sample $\mathbf{x}=\mathbf{x}_{0}$ is obtained.

In general, using such a model as a prior over $\mathbf{x}$ would require an intractable integration over latent variables $\mathbf{h}=(\mathbf{x}_{T},...,\mathbf{x}_{1})$ :

However, DDPMs are trained under the assumption that the posterior $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$ is a simple diffusion process that successively adds Gaussian noise according to a predefined schedule $\beta_{t}$ :

Therefore, if $p(\mathbf{x})$ is the likelihood (5) of $\mathbf{x}$ under a DDPM, then in the first expectation of (2) we should use $q(\mathbf{h}=\{\mathbf{x}_{T},...,\mathbf{x}_{1}\}|\mathbf{x}_{0}=\mathbf{x})=\prod_{t=1}^{T}q(\mathbf{x}_{t}\mid\mathbf{x}_{t-1})$ . The simplest approximation to the posterior over $\mathbf{x}=\mathbf{x}_{0}$ is a point estimate:

where by $\delta$ we denote the Dirac delta function. Thus, we can sample $\mathbf{x}_{t}$ at any arbitrary time step using the forward noising process as

where $\alpha_{t}=1-\beta_{t}$ and $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{t}$ . Analogously to , we can also extract a conditional Gaussian $q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\boldsymbol{\eta})$ and express the first expectation in (2) as

where the stage $t$ noise reconstruction $\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)$ is a linear transformation of the model’s expectation $\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t},t)$ :

The weighting $w_{t}(\beta)$ is generally a function of the noise schedule, but in most pretrained diffusion models it is set to 1. Thus, the free energy in (2) reduces to

The first term is the cost usually used to learn the parameters $\theta$ of the diffusion model. To perform inference under an already trained model $\boldsymbol{\epsilon}_{\theta}$ , we instead minimize $F$ with respect to $\boldsymbol{\eta}$ through sampling $\boldsymbol{\epsilon}$ in the summands over $t$ .

A similar derivation applies if a Gaussian approximation to the posterior $q(\mathbf{x})$ is used (see §A). Such an approximation allows to model not only a mode of the posterior, but the uncertainty in its vicinity.

We summarize the algorithm for a point estimate $q(\mathbf{x})$ as Algorithm 1. Variations on this algorithm are possible. Depending on how close to a good mode we can initialize $\boldsymbol{\eta}$ , this optimization may involve summing only over $t\leq t_{\rm max}<T$ ; different time step schedules can be considered depending on the desired diversity in the estimated $\mathbf{x}$ . Note that optimization is stochastic and each time it is run it can produce different point estimates of $\mathbf{x}$ which are are both likely under the diffusion prior and satisfy the constraint as much as possible.

We observed that optimizing simultaneously for all $t$ makes it difficult to guide the sample towards a mode in image generation applications; therefore, we anneal $t$ from high to low values. Intuitively, the first few iterations of gradient descent should coarsely explore the search space, while later iterations gradually reduce the temperature to steadily reach a nearby local maximum of $p(\mathbf{x}|\mathbf{y})$ . Examples of annealing schedules designed for the tasks demonstrated in §3, 4, 5 are presented in the Appendix (Fig. B.1).

Another interesting case is when $\mathbf{x}$ is parametrized through a latent variable (this can be seen as a case of a hard, non-differentiable constraint: if $\mathbf{x}$ is a deterministic function of $\mathbf{y}$ , $\mathbf{x}=f(\mathbf{y})$ , then $c(\mathbf{x},\mathbf{y})$ is supported on the corresponding manifold). Then the procedure in Algorithm 1 can be performed with gradient descent steps with respect to $\mathbf{y}$ on

instead of steps 4 and 5. (For some semantics of the latent representation, one may wish to make the prior on $\mathbf{x}$ the pushforward by $f$ of a known prior on the latent $\mathbf{y}$ . In this case, (13) must be weighted by the Jacobian of $f$ at $\mathbf{y}$ .)

Experiments: Conditional image generation

We first explore the idea of generating conditional samples from an unconditional diffusion model on MNIST. We train the DDPM model of on MNIST digits and experiment with different sets of constraints $\log c(\mathbf{x},\mathbf{y})$ to generate samples with specific attributes. The examples in Fig. 1 showcase such generated samples. For the digit in (a) we set the constraint $\log c$ to be the unnormalized score of ‘thin’ digits, computed as negative of the average image intensity, whereas in (b) we invert that and generate a ‘thick’ digit with high mean intensity. Similarly, in (c) and (d) we hand-craft a score that penalizes the vertical and horizontal symmetry respectively, by computing the $L^{2}$ distance between the two folds (vertical/horizontal) of the digit $\mathbf{x}$ , which leads to the generation of skewed, non-symmetric samples.

We also showcase how the auxiliary constraint $c(\mathbf{x},\mathbf{y})$ can be modeled by a different, independently trained network. The digit in Fig. 1 (e) is generated by constraining the DDPM with a classifier network that is separately trained to distinguish between the digit class $\mathbf{y}=3$ and all other digits. The auxiliary constraint in this case is the likelihood of the inferred digit, as it is estimated by the classifier. Finally, for (f) we multiply horizontal symmetry and digit classifier constraints, prompting the inference procedure to generate a perfectly centered and symmetric digit. Details of model training and inference can be found in the Appendix (§B.1).

2 Using off-the-shelf components for conditional generation of faces

We consider the generation of natural images with a pretrained DDPM prior and a learned constraint. We utilize the pretrained DDPM network on FFHQ-256 from and a pretrained ResNet-18 face attribute classifier on CelebA . The attribute classifier computes the likelihood of presence of various facial features $y$ in a given image $\mathbf{x}$ , as they are defined by the CelebA dataset. Examples of such features are no beard, smiling, blond hair and male. To generate a conditional sample from the unconditional DDPM network we select a subset of these and enforce their presence or absence using the classifier predicted likelihoods as our constraint $c$ . If $\mathbf{y}$ is a set of attributes we wish to be present, the constraint $\log c(\mathbf{x},\mathbf{y})$ can be expressed as

We only strictly enforce a small subset of facial attributes and therefore $\mathbf{x}$ is allowed to converge towards different modes that correspond to samples that exhibit, in varying levels, the desired features.

In Fig. 2 we demonstrate our ability to infer conditional samples $\mathbf{x}$ with desired attributes $\mathbf{y}$ , using only the unconditional diffusion model and the classifier $p(\mathbf{y}\mid\mathbf{x})$ . In the first row, we show the results of the optimization procedure of Algorithm 1 for various attributes. The classifier objective $c(\mathbf{x},\mathbf{y})$ manipulates the image with the goal of making the classifier network produce the desired attribute predictions, whereas the diffusion objective attempts to pull the sample $x$ towards the learned distribution $p(\mathbf{x})$ . If we ignored the denoising loss, the result would be some adversarial noise that fools the classifier network. The DDPM prior, however, is strong enough to guide the process towards realistic-looking images that simultaneously satisfy the classifier constraint set.

We notice that the generated samples $\mathbf{x}$ , although having converged towards a correct mode of $p(\mathbf{x})$ , still exhibit a noticeable amount of noise related to the optimization of classifier objective. To address that, inspired by , we simply denoise the image using the DDPM model alone, starting from the low noise level $t=200$ so as to retain the overall structure. The results of this denoising are shown in the second row of Fig. 2.

In Fig. 3 we showcase the intermediate steps of the optimization process for inference with the conditions blond hair+smiling+not male, thus solving a problem like that studied in using only independently trained attribute classifiers and an unconditional generative model of faces. The sample $x$ is initialized with Gaussian noise $\mathcal{N}(\mathbf{0},\mathbf{I})$ , and as we perform gradient steps with decreasing values of $t$ , we observe facial features being added in a coarse-to-fine manner.

In the Appendix (§B.2) we provide additional samples and further discuss the sample quality in comparison to unconditional generation. We also present results on inference with conflicting attributes as well as common failure cases.

Experiments: Semantic image segmentation

Inferring semantic segmentations.

Domain transfer with inferred samples.

The above inference procedure is agnostic to colors by design, and we expect it to have a greater ability to perform in new areas than the approach in , which still finetunes networks that take raw images as input. We also investigate domain transfer approaches where patches segmented using the the diffusion prior are used to train neural networks for fast inference. We pretrain a standard U-Net inference network $p(\mathbf{x}\mid I)$ solely on 20k batches of 16 randomly sampled $64\times 64$ image patches in PA. We randomly sample 640 images in each of the other geographies and generate semantic segmentations using our inference procedure, then finetune the inference network on these segmentations. This network is then evaluated on the entire target geography.

The results in Table 1 demonstrate that this approach to domain transfer is comparable with the state-of-the-art work of for weakly-supervised training. The naïve approach of training a U-Net only on the available high-resolution PA data (PA supervised) fails to generalize to the geographically different location of Phoenix, AZ. Similarly, the model of , which is a US-wide high-resolution land cover model trained on imagery and labels, and multi-resolution auxiliary data over the entire contiguous US also suffers. When the weak labels are provided as input (PA supervised + weak) the results can improve significantly.

Experiments: Continuous relaxation of combinatorial problems

So far, we have considered inference under a DDPM prior and a differentiable constraint $c(\mathbf{x},\mathbf{y})$ . We consider the case of a ‘hard’ constraint, where $\mathbf{y}$ is a latent vector deterministically encoded in an image $\mathbf{x}$ ( $\mathbf{x}=f(\mathbf{y})$ ) and we have a DDPM prior over images $p_{\rm DDPM}(\mathbf{x})$ . We will use the variation of Algorithm 1 described at the end of §2.2 to obtain a point estimate of the distribution over $y$ , $p(\mathbf{y})\propto p_{\rm DDPM}(f(\mathbf{y}))$ .

Although the general form of the TSP is NP-hard, a polynomial-time approximation scheme is known to exist in the Euclidean case and can yield proofs of tour optimality for small problems.

Humans have been shown to have a natural propensity for solving the Euclidean TSP (see for a survey). Humans construct a tour by processing an image representation of the points $v_{1},\dots,v_{N}$ through their visual system. However, the optimization algorithms in common use for solving the TSP do not use a vision inductive bias, instead falling into two broad categories:

Discrete combinatorial optimization algorithms and efficient integer programming solvers, studied for decades in the optimization literature ;

More recently, there has been work on neural nets, trained by reinforcement learning or imitation learning, that build tours sequentially or learn heuristics for their (discrete) iterative refinement. Successful recent approaches have used Transformer and graph neural network architectures.

The algorithm we propose using DDPMs is a hybrid of these categories: it reasons over a continuous relaxation of the problem, but exploits the learning of generalizable structure in example solutions by a neural model. In addition, ours is the first TSP algorithm to mimic the convolutional inductive bias of the visual system.

Fix a set of points $v_{1},\dots,v_{N}\in\times$ . We encode an symmetric $N\times N$ matrix with 0 diagonal $A$ as a $64\times 64$ greyscale image $f(A)$ by superimposing: (i) raster images of line segments from $v_{i}$ to $v_{j}$ with intensity value $A_{ij}$ for every pair $(i,j)$ , and (ii) raster images of small black dots placed at $v_{i}$ for each $i$ . For example, if $A$ is the adjacency matrix of a tour, then $f(A)$ is a visualization of this tour as a $64\times 64$ image.

Diffusion model training.

We use a dataset of Euclidean TSPs, with ground truth tours obtained by a state-of-the-art TSP solver , from (we consider two variants of the dataset, each with $\sim$ 1.5m training graphs: with 50 vertices in each graph and with a varying number from 20 to 50 vertices in each graph). Each training tour is represented via its adjacency matrix $A$ and encoded as an image $f(A)$ . We then train a DDPM with the U-Net architecture from on all of such encoded image. Model and training details can be found in the Appendix (§B.4). Some unconditional samples from the trained DDPM are shown in Fig. 6; most samples indeed resemble image representations of tours.

Solving new TSPs.

Suppose we are given a new set of points $v_{1},\dots,v_{N}$ . Solving the TSP requires finding the adjacency matrix $A$ of a tour of minimal length. As a differentiable relaxation, we set $A=S+S^{\top}$ , where $S$ is a stochastic $N\times N$ matrix with zero diagonal (parametrized via softmax of a matrix of parameters over rows). We run the inference procedure using the trained DDPM $p_{\rm DDPM}(f(A))$ as a prior to estimate $A$ . The hyperparameters and noise schedule are described in §B.4. Examples of the optimization are shown in Fig. 7.

Although the inferred $A$ is usually sharp (i.e., all entries close to 0 or 1), rounding $A$ to 0 or 1 does not always give the adjacency matrix of a tour (see, for example, the top row of Fig. 7; other common incorrect outputs include pairs of disjoint tours). To extract a tour from the inferred $A$ , we greedily insert edges to form an initial proposal, then refine it using a standard and lightweight combinatorial procedure, the 2-opt heuristic (amounting to iteratively uncrossing pairs of edges that intersect). The entire procedure is shown in Fig. 7, and full details can be found in the Appendix (§B.4).

Results.

We evaluate the trained models on test sets of 1280 graphs each with $N=50$ and $N=100$ vertices. We report the average length of the inferred tour and the gap (discrepancy from the length of the ground truth tour) in Table 2 (left), from which we make several observations.

The right side of Table 2 shows the number of 2-opt (edge uncrossing) steps performed in the refinement step of the algorithm when the inference algorithm is run for varying numbers of steps. Running the inference with more steps results in extracted tours that are closer to local minima with respect to the 2-opt neighbourhood, indicating that the DDPM encodes meaningful information about the shape of tours.

The DDPM inference is competitive with recent baseline algorithms that do not use beam search in generation of the tour (those shown in the table). These baseline algorithms improve when beam search decoding with very large beam size is used, but encounter diminishing returns as the computation cost grows. Our performance on the 100-vertex problems is similar to with the largest beam size they report (5000), which has 2.18% gap, while having similar computation time.

The model trained on problems with 50 nodes performs almost identically to the model trained on problems with 50 or fewer nodes, and both models generalize better than baseline methods from 50-node problems to the out-of-distribution 100-node problems.

We emphasize a unique feature of our algorithm: all ‘reasoning’ in our inference procedure happens via the image space. This property also leads to sublinear computation cost scaling with increasing size of the graph – as long as it can reasonably be represented in a $64\times 64$ image – since most of the computation cost of inference is borne by running the denoiser on images of a fixed size. In the Appendix (§B.4) we explore the generalization of the model trained on 20- to 50-node TSP instances to problems with 200 nodes and discuss potential extensions.

Conclusion

We have shown how inference in denoising diffusion models can be performed under constraints in a variety of settings. Imposing constraints that arise from pretrained classifiers enables conditional generation, while common-sense conditions, such as mutual information with a clustering or divergence from weak labels, can lead to models that are less sensitive to domain shift in the distribution of conditioning data.

A notable limitation of DDPMs, which is inherited by our algorithms, is the high cost of inference, requiring a large number of passes through the denoising network to generate a sample. We expect that with further research on DDPMs for which inference procedures converge in fewer steps , plug-and-play use of DDPMs will become more appealing in various applications.

Finally, our results on the traveling salesman problem illustrate the ability of DDPMs to reason over uncertain hypotheses in a manner that can mimic human ‘puzzle-solving’ behavior. These results open the door to future research on using DDPMs to efficiently generate candidates in combinatorial search problems.

Acknowledgments

The authors thank the anonymous NeurIPS 2022 reviewers for their comments.

All authors are funded by their primary institutions. Partial support was provided by the NASA Biodiversity program (Award 80NSSC21K1027), NSF grants IIS-2123920 and IIS-2212046, and the Partner University Fund 4D Vision award.

References

Checklist

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes] See the conclusion and discussion throughout the paper.

Did you discuss any potential negative societal impacts of your work? [N/A] Although no immediate negative societal impacts are expected, researchers should bear in mind the risks of flexible conditional generation of images, e.g., for creating ‘deep fakes’.

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? [N/A]

Did you include complete proofs of all theoretical results? [N/A]

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] For most experiments; see the Appendix.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See the Appendix and relevant experiment sections.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] Main experiments were run one time.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See the Appendix.

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? [Yes] See the relevant experiment sections.

Did you mention the license of the assets? [No] But all datasets used are free to use for research purposes; see the relevant citations.

Did you include any new assets either in the supplemental material or as a URL? [N/A]

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Deriving a Gaussian approximation to the posterior

We repeat the derivation in §2.2 for a Gaussian, rather than a point estimate of the posterior.

Recall that if $p(\mathbf{x})$ is the likelihood (5) of $\mathbf{x}$ under a DDPM, then in the first expectation of (2) we should use $q(\mathbf{h}=\{\mathbf{x}_{T},...,\mathbf{x}_{1}\}|\mathbf{x}_{0}=\mathbf{x})=\prod_{t=1}^{T}q(\mathbf{x}_{t}\mid\mathbf{x}_{t-1})$ . A computationally and notationally convenient form for the approximate posterior over $\mathbf{x}=\mathbf{x}_{0}$ is a scalar-covariance Gaussian:

We can sample $\mathbf{x}_{t}$ at any arbitrary time step as

where $\alpha_{t}=1-\beta_{t}$ and $\bar{\alpha}_{t}=\prod_{i=0}^{t}\alpha_{t}$ , with the convention that $\beta_{0}=\psi$ (note the difference with (8), which is the special case of (17) with $\beta_{0}=0$ ). Analogously to , we can also extract a conditional Gaussian $q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\psi,\eta)$ and express the first expectation in (2) as

where the link between the stage $t$ noise reconstruction $\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)$ and the model’s expectation $\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t},t)$ is

Assuming uniform weighting of the noising steps as before, the free energy in (2) reduces to

Appendix B Experiment details and extensions

To train the diffusion model we used the U-Net architecture of with a linear $\beta_{t}$ schedule and $T=1000$ diffusion steps. We trained the network for 10 epochs, with a batch size of 128 samples, using the Adam optimizer and a learning rate of $10^{-4}$ .

Performing inference.

For all inference examples, we performed 1000 optimization steps with the Adam optimizer and a learning rate $10^{-2}$ . We employed a cosine-modulated, linearly decreasing $t$ annealing schedule as shown in Fig. B.1 (a). We empirically designed this annealing process following the observation that the linearly decreasing $t$ values guide the inference procedure in a coarse-to-fine manner that starts by deciding the overall structure of the inferred sample and then adding in details. We also added the oscillating component to allow for revisions of the coarser structures that are to be made after having inferred specific details.

When performing the optimization step of Algorithm 1, we observed that it was important to gradually reduce the effect of the condition $c(\mathbf{x},\mathbf{y})$ in order to obtain good sample quality. In practice, we linearly decreased the weight of the conditional component of the loss, from $\lambda_{T}=10^{-2}$ to $\lambda_{1}=0$ as we performed the optimization steps from $T,\dots,1$ . This can be attributed to the fact that the conditions we used provide guidance for the steps made at larger values of $t$ , where the shape and orientation of a digit are decided. When combining two or more conditions the weighting is applied to all of them.

B.2 FFHQ

For the conditional generation experiments on the FFHQ dataset we utilized the pretrained DDPM model provided by . The face attribute classifier network was a ResNet-18 network trained on the face attributes given in the CelebA dataset . To run our inference algorithm we performed 200 optimization steps with the Adamax optimizer, choosing $(\beta_{1},\beta_{2})=(0.9,0.999)$ and a linearly decreasing learning rate from $1$ to $0.5$ . The $t$ annealing schedule was similar to the one used for the land cover segmentation experiments (Fig. B.1 (b)) but for $t$ values now ranging from 1000 to 200. Additionally, in this experiment we found that balancing between the diffusion and auxiliary losses with a carefully chosen weighting term was difficult. Thus, we opted for a different approach where we clipped the gradient norm of the auxiliary objective to $\frac{1}{2}$ of the gradient norm of the diffusion denoising loss.

Further discussion on samples.

In Fig. B.2 we demonstrate additional conditionally-generated samples from the unconditional DDPM and the attribute classifier. In the first set of examples we show that although we may find modes of $p_{DDPM}(x)$ that satisfy to a level the condition $\mathbf{y}$ set by the classifier, the sample quality is not always on-par with unconditionally generated samples, like those presented in . We can attribute that to the fact that for natural images, in contrast to segmentation labels, the mode may not always be a good-looking sample from the distribution. Our method to mitigate that, along with the classifier noise artifacts left from the optimization process, is to run the diffusion denoising procedure starting from a low temperature $t$ . Although this may improve the visual quality of the result, in some cases our choice of $t$ is not large enough to move the sample far enough from the inferred $\mathbf{x}$ . If we choose a larger $t$ however we risk erasing the attributes we aimed to generate in the first place.

In the second set of samples, we first show how conflicting attributes are resolved. When the constraint is set to satisfy two attributes that contradict each other we observed that the inferred sample $\mathbf{x}$ tends to gravitate towards a single randomly-chosen direction. This is evident in the first two examples where we set the not male attribute along with a male-correlated attribute. In each of them only a single condition, either the not male or the male-related, is satisfied. In the blonde+black hair example we could argue that a mix of the two attributes is present in the inferred sample. However, the classifier predictions for that specific image tell us that the person shown is exclusively blonde.

We also show a set of failure cases where the classifier ‘painted’ the features related to the desired attribute but the diffusion prior did not complete the sample in a correct way. For instance, in the eyeglasses example we see that the classifier has drawn an outline of the eyeglass edges on the generated face but the diffusion model has failed to pick up the cue. Similarly, when asking for wavy hair we see curves that can fool the classifier into thinking that the person has curly hair, or when the attributes set are smiling+mustache we observe a comically drawn mustache on the generated face. Since the conditioning depends both on the diffusion prior and the robustness of the classifier we believe that with better classifier training we could improve the result in such cases.

B.3 Land cover

The land cover DDPM was trained on $\frac{1}{4}$ -resolution, $64\times 64$ patches of land cover labels, randomly sampled from the Pittsburgh, PA tiles of the EnviroAtlas dataset . For the diffusion network, we used the U-Net architecture of , a linear $\beta_{t}$ schedule and $T=1000$ diffusion steps. We trained with $10^{5}$ batches of size 32, using the Adam optimizer and a learning rate of $10^{-4}$ . Additional samples from the unconditional diffusion model are shown in Fig. B.3. We observe that the model has learned both structures that are independent of the geography, such as the continuity of roads and the suburban building planning, and PA-specific ones, such as buildings nested in forested areas, which may not be as common in AZ for instance.

Performing inference.

Since we initialize the inference procedure with the weak labels we require fewer optimization steps and do not have to start the search from $t=1000$ . Thus, to infer the land cover segmentations we only perform 200 optimization steps using the Adam optimizer, with a linearly decreasing learning rate from $5\times 10^{-3}$ to $5\times 10^{-6}$ and $(\beta_{1},\beta_{2})=(0,0.999)$ . The annealing schedule we designed for this task reflects the needs for fewer overall steps and is shown in B.1 (b). We also decrease the weights of both conditional components of the loss, from $\lambda_{T}=1$ to $\lambda_{1}=0$ as we perform the optimization steps $T,\dots,1$ to reduce their influence on the final inferred sample. In addition, we linearly decrease the $\sigma$ parameter that is used to convert the one-hot representations learned from the DDPM model to probabilities, from $\sigma_{T}=0.2$ to $\sigma_{1}=0.02$ to mimic the uncertainty of this conversion process. Further examples of land cover segmentation inference are shown in Fig. B.4. Despite the fact that the DDPM was trained only on PA land cover labels we show how the weak label guidance allows us to perform inference in completely new geographies, such as that of AZ (last two rows), where the most prominent label is now Soil and Barren. We can still observe a few artifacts of the PA-related biases the model has learned, like the tendency to add uninterrupted forested areas but the transferability of the semantic model is still far superior than an of an image-based one.

The local color clustering $\mathbf{z}$ is computed as a local Gaussian mixture with a fixed number of components. To match the structure between the predicted labels and the precomputed clustering we compute the mutual information between the two distributions in overlapping patches of $31\times 31$ pixels. This choice of constraint pushes the inferred land covers segments in a way that they should match locally the color clustering segments. Although this allows us to infer the labels of large structures like roads and buildings it also tends to add noisy labels at areas where the clustering has a high entropy. By gradually reducing the weight of the auxiliary objective however, we allow the inference procedure to ‘fill in’ these details as it is dictated by the diffusion prior.

Domain transfer.

Regarding the domain transfer experiments, we initially pretrained the standard inference U-Net on $2\times 10^{4}$ batches of 16 randomly sampled $64\times 64$ image patches in Pittsburgh, PA, using the Adam optimizer with a learning rate of $10^{-4}$ . We then inferred the land cover segmentations of 640 randomly-sampled patches in each of the other geographic regions, (NC, TX, AZ) using the inference procedure described above. With these generated labels, we first finetuned the original network on a validation set of 5 tiles to determine the optimal finetuning parameters. For Durham, NC and Austin, TX we only finetune the last layer of the network for a single epoch, using a batch size of 16 patches and a learning rate of $5\times 10^{-4}$ . For Phoenix, AZ we require 5 epochs of finetuning the entire network with a learning rate of $5\times 10^{-4}$ since the domain shift is larger. Additionally, for all regions, following the experiments of , we multiply the predicted probabilities with the weak labels and renormalize.

Finally, in Table 1, we also present the results when the inference network is trained from scratch, to show that the resulting performance is not only an artifact of the pretraining. The U-Net was trained for 20 epochs on all 640 generated samples, with a batch size of 16 and a learning rate of $10^{-3}$ .

B.4 TSP

The DDPM was trained on $64\times 64$ images of ground truth TSP solutions encoded as images. The architecture was the same U-Net as used in the other experiments, with the architecture from and $T=1000$ diffusion steps in training. We trained each model for 8 epochs with batch size 16, which took about two days on one Tesla K80 GPU.

Performing inference.

At inference time, we performed varying numbers of inference steps (see Table 2 in the main text), using the Adam optimizer with $(\beta_{1},\beta_{2})=(0,0.9)$ and a learning rate linearly decaying from 1 to 0.1. The noise schedule was the same as that used in the MNIST experiment (Fig. B.1), with the time interval from 0 to 1000 linearly resampled to the number of inference steps used.

To extract a tour from the inferred adjacency matrix $A$ , we used the following greedy edge insertion procedure.

Initialize extracted tour with an empty graph with $N$ vertices.

Sort all the possible edges $(i,j)$ in decreasing order of $A_{ij}/\|v_{i}-v_{j}\|$ (i.e., the inverse edge weight, multiplied by inferred likelihood). Call the resulting edge list $(i_{1},j_{1}),(i_{2},j_{2}),\dots$ .

If inserting $(i,j)$ into the graph results in a complete tour, insert $(i,j)$ and terminate.

If inserting $(i,j)$ results in a graph with cycles (of length $<N$ ), continue.

It is easy to see that this algorithm terminates before the entire edge list has been traversed. The tour is refined by a naïve implementation of 2-opt, in which, on each step, all pairs of edges in the tour $((i,j),(k,l))$ are enumerated and a 2-opt move is performed if the edges cross. For the ‘2-opt’ baseline, the same procedure is performed using a uniform adjacency matrix.

Results on larger problems.

Extending the results in Table 2 of the main text, we evaluate the model trained on TSP instances with 20 to 50 nodes on problems with 200 nodes. We find an optimality gap of 3.77% (average number of uncrossing moves 219), compared to 3.81% for 2-opt (average number of uncrossing moves 115), suggesting that the generalization potential is near-saturated at this problem size. As shown in Fig. B.8, the vertices fill the $64\times 64$ image with such high density that it is difficult to see the (light grey) tour; many edges are invisible (compare to Fig. 7 in the main text).

We suggest three directions to solving to this problem that should be explored in later work:

Encoding: The size of the encoding image can be increased (for example, to $128\times 128$ ) when the number of vertices increases, without changing the model (trained on $64\times 64$ images), which can make denoising predictions on images of any size. We may expect to see better out-of-domain generalization of the denoising model in this setting, as the density of nodes (mean number of black pixels) would match that in the training set. Figs. B.8 and B.6 show the potential of DDPMs to generalize to image sizes larger than those in which they were trained. Inference using $128\times 128$ images gives an optimality gap of 2.59% (average number of uncrossings 81), much lower than that obtained with in-domain image size.

In addition, encoding graphs with smaller dots and thinner lines can be explored, although the generalization difficulties due to image ‘crowding’ would still appear at a larger value of $N$ .

Fractal behaviour and coarse-to-fine: Taking advantage of the fractal structure of Euclidean TSP solutions, a denoising objective could be used to locally refine the tour by minimizing the objective on a crop of the image representation (a form of DDPM-guided local search). This could be done in a coarse-to-fine manner by application of the same model at different scales, with a $128\times 128$ representation of a problem with 200 vertices being first optimized with respect to the denoising objective globally, then on $64\times 64$ crops.

Improved extraction: The 2-opt search can be improved by inexpensive heuristics, such as choosing the 2-opt move that most improves the cost on every step, rather than iterating through the edges of the candidate tour in order.