Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach

Shaofeng Zhang, Jinfa Huang, Qiang Zhou, Zhibin Wang, Fan Wang, Jiebo Luo, Junchi Yan

Introduction

Image outpainting (Lin et al., 2021a; Cheng et al., 2022; Wang et al., 2021), a.k.a. image extrapolation (Wang et al., 2022; Kim et al., 2021; Zhang et al., 2020), is to generate new content beyond the original boundaries of a given sub-image. It is technically an essential problem for generative models and remains relatively open (compared with other condition-based generation settings e.g. image inpainting (Bertalmio et al., 2000), style transfer (Luan et al., 2017) etc.), which meanwhile can find wide applications in automatic creative image, virtual reality. Usually, an ideal outpainter is expected to achieve the following basic functions (Yao et al., 2022): 1) determining where the missing regions should be located relative to the output’s spatial locations for both nearby and faraway features; 2) guaranteeing that the extrapolated image has a consistent structural layout with the given sub-image; and 3) the borders between extrapolated regions and original input images should be visually smooth.

These requirements have been mainly addressed by existing outpainting methods which in general fall into two categories: i) GAN-based methods (Van Hoorick, 2019; Yang et al., 2019), whereby random noise and the initial input sub-images (as conditions) are used to generate the fake surrounding image content, and the discriminator is to classify the generated images as fake or real; ii) MAE-based methods (Yao et al., 2022) use MAE (Masked Autoencoder) (He et al., 2022) as the main architecture. They model the extrapolation as the MIM (Masked Image Modeling) problem (Xie et al., 2022) by replacing the extrapolated regions around the input sub-images with masked tokens and predicting the pixels of the masked patches. Specifically, these MAE-based methods also employ a discriminator to enhance the smoothness of the borders between extrapolated regions and the original sub-images. The above two kinds of methods mainly suffer from two applicability limitations. First, as shown in Fig. 1(a), they require running multiple times to outpaints the image (e.g., the SOTA method in (Yao et al., 2022) outpaints 11.7x images by passing through the model three times (x $\rightarrow forward\rightarrow 2.25$ x $\rightarrow forward\rightarrow 5$ x $\rightarrow forward\rightarrow 11.7$ x), which is inefficient especially when the required expansion multiple is large (e.g. 99x); Second, the discriminator-based architecture can slow down the convergence speed (Goodfellow et al., 2014; Adler & Lunz, 2018), and require a pre-trained encoder. In other words, the computational cost of training their model in fact includes the pretraining cost, which is usually very high (e.g. up to 1,000 or longer epochs on ImageNet).

Recently diffusion models (Dhariwal & Nichol, 2021; Pokle et al., 2022; Jin et al., 2023b; c; Augustin et al., 2022) have shown success for multi-modal Jin et al. (2023a; 2022) and image generation (Saharia et al., 2022) with a progressive denoising procedure. Yet such an iterative step-by-step (also called timesteps) in the sampling (i.e. testing) stage can be too tedious for outpainting, especially considering image outpainting itself also still requires its own iterations (e.g. QueryOTR (Yao et al., 2022), IOH (Van Hoorick, 2019)). To achieve one-step diffusion-based generation, we propose to use relative positional queries and input sub-images as conditions. Since the relative positional embedding can represent any positional relationship between the input sub-image and the extrapolated image, we can outpaint the sub-image in controllable and continuous multiples in one step (Fig. 1(b)). We make methodology comparisons in Table 1 to better position our method. The main contributions include:

i) Continuous multiples for image outpainting. We propose PQDiff, which learns the positional relationships and pixel information at the same time. Specifically, in the training stage, PQDiff first randomly crops the given images twice to generate two views. Then, PQDiff learns one cropped view from the other cropped view and the pre-calculated relative positional embeddings (RPE) of the two views. Since the RPE can represent continuous relationships between two views, PQDiff can outpaint the images in continuous multiples. To our best knowledge, we are the first to outpaint images in continuous multiples (e.g., 1x, 2.25x, 3.6x, 21.8x), whereas the SOTA QueryOTR (Yao et al., 2022) can only outpaint images in discrete multiples.

ii) One-step image outpainting. We propose a position-aware cross-attention mechanism between relative positional embedding and input sub-image patches, which helps PQDiff to outpaint images in only one step for any multiple settings. As far as we know, PQDiff is the first to achieve this capability, whereas (Yao et al., 2022; Yang et al., 2019) can only outpaint images step-by-step, which severely limits their sampling, i.e. generation efficiency. Under the 2.25x, 5x and 11.7x outpainting settings, PQDiff only takes 40.6%, 20.3%, and 10.2% of the time of QueryOTR (Yao et al., 2022).

iii) New SOTA performance. Experimental results on outpainting benchmarks (Gao et al., 2023; Yang et al., 2019) show that PQDiff significantly surpasses QueryOTR (Yao et al., 2022) and achieves new SOTA 21.512, 25.310 and 36.212 FID scores with the challenging 11.7x multiple setting on the Scenery, Building Facades, and WikiArts datasets, respectively. Moverover, PQDiff achieves new SOTA results in most settings (2.25x, 5x, and 11.7x).

Background and Related Work

Image Outpainting. It aims to generate the surrounding regions from the visual content, which can be considered as an image-conditioned generation task (Odena et al., 2017; Kang et al., 2021; Guo et al., 2020; Arjovsky et al., 2017; Gulrajani et al., 2017). The work (Sabini & Rusak, 2018) brings the image outpainting task to attention with a deep neural network inspired by image inpainting (Bertalmio et al., 2000). It focuses on enhancing the quality of generated images smoothly by using GANs and post-processing to perform horizontal outpainting. The work (Van Hoorick, 2019) designs a CNN-based encoder-to-decoder framework by using GAN for image outpainting. In (Wang et al., 2019), a Semantic Regeneration Network is proposed to directly learn the semantic features from the conditional sub-image. While a 3-stage model is developed in (Lin et al., 2021b) with an edge-guided generative network to produce semantically consistent output. Although these methods avoid bias in the general padding and up-sampling pattern, they still suffer from blunt structures and abrupt color issues, which tend to ignore spatial and semantic consistency. To tackle these issues, a Recurrent Content Transfer (RCT) block is devised (Yang et al., 2019) for temporal content prediction with Long Short Term Memory (LSTM) networks (Hochreiter & Schmidhuber, 1997). To enrich the context, (Lu et al., 2021) additionally switches the outer area of images into its inner area.

The proposed Positional-Query Based Diffusion model

We provide a concrete embodiment based on the diffusion model and we term our approach as PQDiff, whereby the continuous multiples and one-step generation are achieved. In fact, our PQ framework with these two advantages can also incorporate other generative models e.g. GANs, and the empirical performance comparison given in our ablation studies. We also show the significant improvements of our approach compared with a vanilla diffusion model directly for outpainting.

Approach Overview. Our approach mainly consists of key modules: relative positional embedding, diffusion process, cross attention in the position-aware transformer model, and sampling pipeline.

where $e=10,000$ is the pre-defined parameter, as also commonly used in (He et al., 2022; Zhang et al., 2023a; b). $(m,n)$ means the position of top-left patch position (see Fig. 2). Note that we randomly crop two views, and as a result, the positional relationship of the two views might be either containing, overlapping, or non-overlapping, and our model can jointly learn these three relationships.

where $\mathbf{z}_{b_{0}}$ is the original target sequence $\{\mathbf{z}_{b}^{1},\mathbf{z}_{b}^{2},\cdots,\mathbf{z}_{b}^{L}\}$ and $\overline{\alpha}_{t}=\prod_{t=1}^{t}\alpha_{i}$ . For the backward process, suppose we have a neural network $g_{\theta}$ (will be described later), taking the noisy target $\mathbf{z}_{b_{t}}$ , clean anchor sequence $\mathbf{z}_{a}$ and relative positional embeddings $\mathbf{E}$ as input. Then, the network aims to predict the added noise $\epsilon_{t}$ on the target sequence $\mathbf{x}_{b_{t}}$ . Then, the objective of PQDiff can be written as:

We set $p=2$ in line with the previous generative methods (Rombach et al., 2022; Bao et al., 2023).

The Position-Aware Transformer Model $g_{\theta}$ . Here, we describe the architecture of the neural network used in the diffusion model in detail. Consider we have the noisy target $\mathbf{z}_{b_{t}}$ , clean anchor sequence $\mathbf{z}_{a}$ , relative positional embeddings $\mathbf{E}$ and the timestep $t$ , we first concatenate the noisy target and the anchor sequence at the channel dimension, followed by a linear layer to map to original dimension to reduce the computational cost, and we denote the mapped embedding as $\mathbf{z}_{g}\in\mathbf{L\times D}$ , where $D$ is the predefined hidden dimension in transformer network. Then, we feed the $\mathbf{z}_{g}$ into the transformer encoder, which is composed of several transformer blocks (Vaswani et al., 2017). After the transformer encoder, the position-aware cross-attention mechanism is proposed to learn positional relationship, which can be formulated as:

where $\mathbf{Q}_{E}=\mathbf{E}\mathbf{W}_{q}$ , $\mathbf{K}_{\mathbf{z}_{g}}=\mathbf{z}_{g}\mathbf{W}_{k}$ , $\mathbf{V}_{\mathbf{z}_{g}}=\mathbf{z}_{g}\mathbf{W}_{v}$ , and $\mathbf{W}_{q}$ , $\mathbf{W}_{k}$ , $\mathbf{W}_{v}$ are learnable parameters. After capturing the information of the target position $\mathbf{z}_{d}$ , we directly feed the $\mathbf{z}_{d}$ into the transformer decoder composed of several transformer blocks, followed by a convolutional layer to predict noise.

Sampling Pipeline. After training the network well, we can outpaint the image in any controlled multiples, since the designed relative positional encoding can represent any positional relationship between two images. In the sampling stage, we can simply take the input sub-image as the anchor view, and input any position we want. Then, we calculate the positional encoding of the given position and feed the RPE to the network. Then, the network can predict the noise as mentioned in Eq. 3. Finally, through Eq. 2, we can simply compute the fake $\widetilde{\mathbf{z}}_{b_{0}}$ , and predict $\mathbf{z}_{b_{t-1}}$ step-by-step by:

After several iterations, when $t=0$ , we can obtain the extrapolated images. The training and sampling algorithms are given in Alg. 1 and Alg. 2 in Appendix A, respectively.

Discussion. As DDPM (Ho et al., 2020) requires step-by-step sampling, DDIM (Song et al., 2021a) is proposed to use ODE (Song et al., 2021b) equation for faster sampling. Our PQDiff can also use the DDIM for faster sampling. Specifically, through the Euler method and probability flow ODE proposed in (Song et al., 2021b), we can obtain $\mathbf{z}_{b_{t-\Delta t}}$ by:

Then, the one-timestep sampling in each iteration and be replaced with $\Delta_{t}$ timesteps in each iteration.

Experiments

Quantitative Results. Table 2 shows that PQDiff with a copy operation outperforms in all metrics on 2.25x, 5x, and 11.7x experiments. In particular, with a larger outpainting multiple, PQDiff can obtain better FID and IS scores. For 11.7x outpainting, PQDiff surpasses the previous SOTA QueryOTR (Yao et al., 2022) 16.460, 25.310, and 36.216 FID scores on Scenery, Building and WikiArt datasets, respectively. We also find an interesting phenomenon that, on WikiArt, PQDiff can surpass PQDiff + Copy on 5x and 11.7x experiments. In other words, the generated images without copy could be more realistic than the ones with the centroid copy operation. This is perhaps because the boundary region of the input sub-image is slightly inconsistent with the generated image.

Qualitative Results. Examples of visual results are shown in Fig. 3 (the “copy” operation is added on all methods for better comparisons). PQDiff effectively outpaints the images by querying the global semantic-similar image patches. As seen from the 2.25x outpainting results, PQDiff could generate more realistic images with vivid details and enrich the contents of the generated regions. In addition, although QueryOTR (Yao et al., 2022) adds the smoothing module to handle the inconsistency of the boundary region, there are still a few noisy spots as highlighted in red boxes. For 5x and 11.7x results, we can clearly see the images generated from the QueryOTR are much vaguer than those generated by PQDiff. Moreover, PQDiff can handle details well. For example, the “clouds’ reflection” in the “water” is consistent with the generated “clouds” in the sky, as highlighted in the green ovals. We also provide more qualitative comparisons and generated images in Appendix E.

Sampling (i.e. Generation) Speed. We also compare the sampling speed of PQDiff with different timesteps. Specifically, we first train PQDiff with 80,000 iterations. Then, for the sampling stage, we evaluate the pre-trained PQDiff with different timesteps. Table 3 reports the wall-clock time spent on generating 64 images on 8 V100 GPUs. Since PQDiff can outpaint images with any multiples in one step, the cases for 2.25x, 5x, and 11.7x spend almost the same time. In contrast, previous methods will take much more time under the 5x and 11.7x settings than in 2.25x. It is worth noting that, under the 2.25x setting, the inception score of PQDiff with 200 timesteps (4.111) is even higher than the ground truth (4.091) (refer to Appendix E).

2 Ablation Studies

Outpainting in an arbitrary position. Previous outpainting methods mainly plot the same multiples around the top, down, left, and right regions, and they can simply find where the input sub-image should locate in the generated image. Thus, the “copy” operation can be simply finished. However, for outpainting in random positions, it is difficult to find the corresponding locations, since the two images (generated and original) could have different scales. Moreover, the two images may also not intersect. Hence, we directly illustrate the images generated by PQDiff without the “copy” operation. Some examples generated from the controlled position are shown in Fig. 4, where the images generated by PQDiff without the “copy” iteration can be still vivid and realistic. Furthermore, without the “copy” operation, the generated images can also record the pixel information and put it into the corresponding locations. Meanwhile, since the scales of input sub-images and generated images may be different, PQDiff implicitly learns to scale the input sub-images as well.

Diversity of generated images and PSNR in center regions. We also show five generated images with fixed positions and input in Fig. 11 in Appendix E, showing PQDiff can generate diverse content in the generated regions. Furthermore, it also retains the input pixels in the center parts of generated samples. Recall that in Sec. B, we choose not to use the PSNR score as evaluation metrics, as we also need to account for the diversity. Here we provide further analysis with PSNR score whose definition is as follows, and a higher score suggests a smaller mean square error:

Impact of random crop ratio. For better consistency of inputs in the training stage and sampling stage, we crop the view $\mathbf{x}_{b}$ with a larger crop ratio than view $\mathbf{x}_{a}$ (since the outpainted images are usually larger than input images). We conduct experiments to analyze the effect of random crop ratios. Specifically, we fix the crop ratio of target view $\mathbf{x}_{b}$ as $(0.8,1.0)$ , and switch the crop ratio of anchor views $\mathbf{x}_{a}$ from 0.15 $\sim$ 0.50. We train the model with 80,000 iterations on the Scenery dataset and show the results in Fig. 5. The results show the query crop ratio influences 2.25x, 5x, and 11.7x experiments in different manners. Specifically, for 2.25x experiments, since the input images are 128x128, and the outpainted images are 192x192, where the extrapolated image is only a little larger than the input sub-image. Hence, PQDiff with larger crop ratios outperforms PQDiff with smaller ones. For 5x and 11.7x experiments, the size of extrapolated images is much larger than the given input sub-image. As a result, PQDiff with smaller crop ratios outperforms PQDiff with larger ones. On top of that, we also find when the anchor crop ratio equals 0.50, the inception score drops with a large range, and we guess that is because when the random crop ratio is set (0.50, 0.50), we always feed the images with the same scales to PQDiff, and correspondingly, PQDiff can not learn to scale images. Then, in the sampling stage, since images in the test set are usually in different scales, it is difficult for PQDiff to handle the scaling gaps between the training stage and the testing stage.

Integrate PQ scheme into other generating models. Beyond diffusion, we also integrate our PQ learning paradigm into GAN-based model. In addition, to analyze the effect of the positional query, we also report the performance when directly using diffusion models (Ho et al., 2020). Table 5 shows the results. Specifically, we find the quality of the images generated by GAN is lower than the diffusion model (Ho et al., 2020), as the inception score of PQGAN is lower than diffusion models (Ho et al., 2020). However, the diffusion model is not well conditioned by the input sub-image (the FID and Center PSNR scores are much worse than PQGAN and PQDiff). The high FID and Center PSNR scores indicate the proposed PQ scheme can provide a strong condition, enhancing the generative models (Goodfellow et al., 2014; Ho et al., 2020) to learn where to outpaint.

Impact of the Positional Embedding. To analyze the effect of the positional embedding, we conduct a group of experiments with different types of positional embeddings. We mainly consider two types of embeddings (sin-cos and learnable). Learnable embedding means we take the relative position $(m,n)$ as input and use an MLP composed of two linear layers with the activation function to map the $2$ -dimensions to $D$ -dimensions. Table 6 shows the results with different positional embedding. Note that None means without relative positional embedding. Thus, the model can not learn the positional relationships between input sub-image and extrapolated images. Correspondingly, the inception score only drops with a little range, but the FID and Center PSNR drop with a large range.

Conclusion

We have proposed PQDiff, which learns the positional relationships and pixel information at the same time. Methodically, PQDiff can outpaint at any multiple in only one step, greatly increasing the applicability of image outpainting. We conduct experiments on three standard outpainting datasets, where PQDiff achieves new SOTA results that surpass previous methods by a large margin under almost all settings. We also conduct comprehensive ablation studies to show the robustness of our approach, including crop ratios, the center PSNR score, and the relative positional embeddings.

Ethics Statement. PQDiff generates images conditioned by positional embeddings, learning pixel information and positional relationships simultaneously. As the datasets used in PQDiff primarily focus on scenery, buildings, and arts, there are currently minimal negative potential impacts on ethics and crime-related aspects. We are aware that any technology could be abused for ill purposes.

Reproducibility Statement. We have clarified training and sampling details including hyper-parameters, pseudo code of the relative positional embeddings, and training pipeline in Sec. B in the Appendix. In addition, all the datasets used in this paper are open-source and can be accessed online.

References

Supplementary Material

Appendix A Pseudo Code of the Relative Positional Embeddings

Appendix B Training Details

Datasets. We use three datasets: Scenery (Yang et al., 2019), Building Facades (Gao et al., 2023), and WikiArt (Tan et al., 2016), in line with (Yang et al., 2019; Yao et al., 2022; Van Hoorick, 2019). Scenery is a natural scenery with diverse natural scenes, consisting of about 5,000 images for training and 1,000 images for testing. Building Facades is a city scenes dataset consisting of about 16,000 and 1,500 images for training and testing, respectively. WikiArt is a fine-art paintings dataset, which can be obtained from wikiart.org. We use the split manner of genres datasets (used in (Yao et al., 2022; Gao et al., 2023)), which contain 45,503 training images and 19,492 testing images.

Training Details. We implement our approach with PyTorch (Paszke et al., 2019) on a platform equipped with 8 V100 GPUs. The encoder is composed of 8-10 stacked transformer blocks. Then, a cross-attention block is employed, followed by the decoder made of 8-10 transformer blocks. Finally, a 3x3 convolutional layer is adopted to smooth the generated image. In line with previous methods (Yao et al., 2022), we copy the ground truth (input) to the corresponding location in the generated image. We find the copy operation will make a great impact on previous methods, while PQDiff is much more robust to this operation, which we will discuss later. The number of parameters is approximately equal to QueryOTR (Yao et al., 2022), which contains 12 transformer blocks in the encoder and 4 transformer blocks in the decoder. We adopt AdamW optimizer (Loshchilov & Hutter, 2019), and we set the learning rate to 0.0002, weight decay to 0.03, and betas to 0.99. In the training stage, we set the random crop ratio of the anchor view to (0.15, 0.5) and the ratio of the target view to (0.8, 1.0), aiming to use the small view to predict the larger view. We train PQDiff 80k, 150k, and 300k iterations with 64 images per GPU on the Scenery, Building Facades, and WikiArt datasets, respectively. Following QueryOTR, we set the resolution of each cropped view as 192x192. Our core idea of PQ-Diff is to utilize the randomly cropped two views (anchor and target) to learn the positional relation between them.

Evaluation and Baselines. We use Inception Score (IS) (Salimans et al., 2016), Frechet Inception Distance (FID) (Heusel et al., 2017) to measure the generative quality. Note that the upper bounds of IS are 4.091, 5.660, and 8.779 for Scenery, Building Facades, and WikiArt, respectively, which are calculated by real images in the test set. Here we do not use PSNR (peak signal-to-noise ratio) as once used in outpainting because PSNR cannot reflect the diversity of generated images, which is important for generative models, and more details are given in Sec. 4.2. Alternatively, we report the PSNR score between the input sub-images and the center region of the generated images. Because we think this score is more meaningful to detect whether the network ignores the input conditions. We make comparisons with five SOTA image outpainting methods, NSIPO (Yang et al., 2019), SRN (Wang et al., 2019), IOH (Van Hoorick, 2019), Uformer (Gao et al., 2023) and QueryOTR (Yao et al., 2022).

Sampling Details. In line with the previous method (Yao et al., 2022; Gao et al., 2023), for the testing stage, all images are resized to 192x192 as the ground truth, and then the input images are obtained by center cropping to the sizes 128x128, 86x86, and 56x56 for 2.25x, 5x, and 11.7x outpainting, respectively. The total output sizes are 2.25, 5, and 11.7 times the input in terms of 2.25x, 5x, and 11.7x outpainting, respectively.

Appendix C Illustration of the RPE During the Training and Sampling

Relative positional embeddings. Instead of simply using learnable positional embeddings (commonly used in previous transformer-based learning methods (Gao et al., 2023)), which can not represent the relation between the anchor view and target view (as the positional relation is random for each sample at each iteration due to the randomness of the RandomCrop augmentation), we adopt fixed positional encoding to represent the relative positions between the anchor view and each query tokens (i.e., each relative positional token in the target view), which is illustrated in Fig. 6. Given the positions $\textbf{p}_{A}=\{p_{Ai},p_{Aj},p_{Ah},p_{Aw}\}$ (top location, left location, height, and width), $\textbf{p}_{T}=\{p_{Ti},p_{Tj},p_{Th},p_{Tw}\}$ of the two views $\mathbf{x}_{A}$ and $\mathbf{x}_{T}$ , the objective is calculating the relative position of each patch $[\mathbf{x}_{T}^{(1)},\mathbf{x}_{T}^{(2)},\cdots,\mathbf{x}_{T}^{(K_{T}^{2})}]$ in the target view based on the anchor view $\mathbf{x}_{A}$ , where $(K_{T}^{2})$ is the number of patches in the target view. Specifically, we calculate the patch-level relative position of the target view via the following equation:

where $K_{a}^{2}$ means the number of patches of the anchor view. $p_{t}^{m,n}$ means the relative position of the patch located at $m$ -th row and $n$ -th column in the target view. Then, for each patch, we use the following popular form in transformers (Vaswani et al., 2017) to generate the relative positional embedding:

Appendix D More results in continuous multiples

Appendix E Impact of the Continuous and Discrete Positional Embeddings

To further explore how PQDiff learns the relative position, we conduct a set of ablation studies. Specifically, we remove the randomness of the positions of the anchor views and target views. Specifically, the anchor view is the center region of the target view, and the target view is set 2.25x, 5x, and 11.7x larger than the anchor view with the same probability at each training iteration (1/3 for each multiple) to better compare with the main results in our paper. Therefore, the sizes of the anchor view and target view are discrete, as they are fixed as three numbers. We train the Discrete version on 8 V100 GPUs with 80k iterations (in line with the main results). Table 7 shows the results when evaluating with variant outpainting multiples. We find under 2.25x, 5x, and 11.7x settings (inner distribution), discrete training, and achieve similar results with PQDiff. However, when transferred to other outpainting multiples (4x, 9x, 16x), both three scores of discrete training drop, especially for Center PSNR. We guess the reason behind the interesting phenomenon is that the “Discrete training” strategy only memorizes the pixel information of the input sub-image, but fails to scale the sub-image with proper ratio due to the outpainting multiples gaps in the training and inference stage.

Appendix F Sampling Speed

For better comparisons, we provide results of PQDiff with more timesteps in Table 8. Specifically, in the training phase, we set timesteps as 1,000. In the testing phase, we change the timesteps from 50 $\sim$ 500 (since we observe with larger timesteps, the IS and FID score won’t improve anymore). We find when timesteps are set to 300, PQDiff achieves 4.203 IS scores, while ground truth only achieves 4.091, which further demonstrates the vividness of images generated by PQDiff.

PQDiff can also be thought of as a predictive task conditioned by the relative information and the anchor views. Hence, we conduct the experiments to directly predict $\mathbf{x}_{b}$ , which is similar to QueryOTR (Yao et al., 2022), and we report the inception score in Table 9. We find by predicting $\mathbf{x}_{b}$ , PQDiff is much worse than predicting noise under all settings (especially in the 11.7x setting), and the phenomenon is also consistent with previous generative tasks (Bao et al., 2023; Ho et al., 2020). We guess that’s because predicting $\mathbf{x}_{b}$ makes the task more predictive but not generative, and the learned network will overfit in the training set, resulting in bad generalization for the generative task in the test set.

Appendix H Incorporate with Pretrained Models

As shown in Tab. 10 on the scenery dataset, We consider adding an ablation study to analyze the usage of the pretrained model. First, we try to load the stable diffusion pretrained model in PQDiff. As we use VQVAE and cross attention in our model (our model is a transformer-based model, while stable diffusion mainly uses resnet-block), we can only load weights of the spatial transformer layer.

Then, we try to train our model on ImageNet with 80,000 iterations, and then, we re-train the model on the Scenery dataset. We can find that pretrained with the imagenet has improved consistency, which provides insights for the subsequent scale-up of our framework. We can find that pretrained with the Imagenet has improved consistency, which provides insights for the subsequent scale-up of our framework.

Appendix I More Generated Examples on Facade datasets

We put more generated examples on facade datasets in Fig. 9. We have observed two interesting phenomena when our PQDiff extends semantic structures in facade scenes.

Our PQDiff can expand the reliable semantic structure that aligns with human cognition. In contrast, the previous SOTA QueryOTR model pretends to generate a physically distorted and abnormally deformed semantic structure. (As shown in the middle column (Fig. 9), the comparison includes the First row(lane), second row(wooden building), and third row (teaching buildings).

Our PQDiff can expand more texture semantic details in facade scenes, thereby generating high-fidelity expanded images. However, QueryOTR will lose detailed information, and the generated extended image has a lot of noise. (Representative examples are shown in the first column (First row, second row, and third row).

Appendix J Absolute Position Embedding v.s. Relative Position Embedding

Appendix K More Cases Generated by PQDiff

We show more examples generated by PQDiff in Fig. 12, Fig. 13 and Fig. 14.

Appendix L More discussions

Broader impact and potential applications. As PQDiff can generate images conditioned by arbitrary relative positions, we believe our method has great potential in image-inpainting (change the RPE of the target view with the inpainting regions), and super-resolutions (interpolate the relative position of the anchor view) if we properly change the relative positional in the training and sampling stages.