SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction

Zhizhuo Zhou, Shubham Tulsiani

Introduction

Consider the two images of the teddybear shown in Figure 1 and try to imagine the underlying 3D object. Relying on the direct visual evidence in these images, you can easily infer that the teddybear is white, has a large head, and has small arms. Even more remarkably, you can imagine beyond the directly visible to estimate a complete 3D model of this object e.g. forming a mental model of the teddy’s face with (likely black) eyes even though these were not observed. In this work, we build a computational approach that can similarly predict 3D from just a few images – by integrating visual measurements and priors via probabilistic modeling and then seeking likely 3D modes.

A growing number of recent works have studied the related tasks of sparse-view 3D reconstruction and novel view synthesis, i.e. inferring 3D representations and/or synthesizing novel views of an object given just a few (typically 2-3) images with known relative camera poses. By leveraging data-driven priors, these approaches can learn to efficiently leverage multi-view cues and infer 3D from sparse views. However, they still yield blurry predictions under large viewpoint changes and cannot hallucinate plausible content in unobserved regions. This is because they do not account for the uncertainty in the outputs e.g. the unobserved nose of a teddybear may be either red or black, but these methods, by reducing inference to independent pixel-wise or point-wise predictions, cannot model such variation.

In this work, we propose to instead model the distribution over the possible images given observations from some context views and an arbitrary query viewpoint. Leveraging a geometrically-informed backbone that computes pixel-aligned features in the query view, our approach learns a (conditional) diffusion model that can then infer detailed plausible novel-view images. While this probabilistic image synthesis approach allows the generation of higher quality image outputs, it does not directly yield a 3D representation of underlying the object. In fact, the (independently) sampled outputs for each query view often do not even correspond to a consistent underlying 3D e.g. if the nose of the teddybear is unobserved in context views, one sampled query view may paint it red, while another one black.

To obtain a consistent 3D representation, we propose a Diffusion Distillation technique that ‘distills’ the predicted distributions into an instance-specific 3D representation. We note that the conditional diffusion model not only gives us the ability to sample novel-view images but also to (approximately) compute the likelihood of a generated one. Using this insight, we optimize an instance-specific (neural) 3D representation by maximizing the diffusion-based likelihood of its renderings. We show that this leads to a mode-seeking optimization that results in more accurate and realistic renderings, while also recovering a 3D-consistent representation of the underlying object. We demonstrate our approach on over 50 real-world categories from the CO3D dataset and show that our method allows recovering accurate 3D and novel views given as few as 2 images as input – please see Figure 1 for sample results.

Related Work

Leveraging Structure-from-Motion to recover camera viewpoints, early Multi-view-Stereo (MVS) methods could recover dense 3D outputs. Recent neural incarnations of these use volumetric rendering to learn a compact neural scene representation. Follow up works seek to make the training and rendering orders of magnitudes faster. However, these methods require many input views, making them impractical for real world applications. While some works seek to reduce the input views required, they still do not make predictions for unseen regions.

The ability to predict 3D geometry (and appearance) beyond the visible is a key goal for single-view 3D prediction methods. While these approaches have pursued prediction of different 3D representations e.g. volumetric , mesh-based , or neural implicit 3D, the use of a single input image fundamentally limits the details that can be predicted. Moreover, these methods do not prioritize view synthesis as a goal. While our approach similarly learns data driven inference, we aim for a more detailed reconstruction and high quality novel-view renderings.

Novel view synthesis (NVS), while similar to reconstruction, has slightly different roots. Earlier works frame NVS as a 2D problem, using deep networks to make predictions from global encodings. Recent approaches combine deep networks with various rendering formulations . Strong performing approaches often leverage re-projected features from input views with volumetric rendering or image based rendering . While feature re-projection methods are 3D consistent, they regress to the mean and fail to produce perceptually sharp outputs. Another line of work revisits NVS as a probabilistic 2D generation task, using newer generative backbones to offer better perceptual quality at the cost of larger distortion and 3D consistency. See Table 1 for a comparison of our method against existing approaches.

Several works extend upon denoising diffusion models to achieve impressive applications, such as generating images from text and placing foreground objects in different backgrounds . In this work, we leverage this class of models for (probabilistic) novel view synthesis while using geometry-aware features as conditioning. Inspired by the impressive results in DreamFusion which optimized 3D scenes using text-conditioned diffusion models, we propose a view-conditioned diffusion distillation mechanism to similarly extract 3D modes in the sparse view reconstruction task.

Several concurrent works also leverage diffusion models for 3D reconstruction and view synthesis. 3DiM proposes a 2D diffusion approach for image-conditioned novel view synthesis, but does not infer a 3D representation like our approach. Closer to our work, Deng et al. uses (pre-trained) 2D diffusion models as guidance for single-view 3D, but obtain coarser reconstructions in this more challenging setting. While we leverage a 2D diffusion model for optimizing 3D, RenderDiffusion learns a diffusion model in 3D space. Concurrently to DreamFusion , which inspired our distillation objective, Wang et al. provide a different mathematical intuition for a similar objective.

Background: Denoising Diffusion

Our method adopts and optimizes through denoising diffusion models , and here we give a brief summary of the key formulations used, and refer the reader to the appendix for further details.

One can learn denoising diffusion models by optimizing a variational lower bound on the log-likelihood of the observed data. Conveniently, this reduces to a training framework where one adds (time-dependent) noise to a data point $\bm{x}_{0}$ , and then trains a network $\bm{\epsilon}_{\phi}$ to predict this noise given the noisy data point $\bm{x}_{t}$ .

Here, $\bar{\alpha}_{t}$ is a scheduling hyper-parameter, and the weights $w_{t}$ depend on this learning schedule, but are often set to 1 to simplify the objective.

The above noise prediction objective, which represents a bound on the log likelihood, can also be viewed as a reconstruction error. Concretely, given a noisy $\bm{x}_{t}$ , the network prediction $\bm{\epsilon}_{\phi}(\bm{x}_{t},t)$ can be interpreted as yielding a reconstruction for the original input, where the learning objective can be rewritten as a reconstruction error:

While the above summary focused on unconditional diffusion models, they can be easily extended to infer conditional distributions $p(\bm{x}|\bm{y})$ by additionally using $\bm{y}$ as an input for the noise prediction network $\bm{\epsilon}_{\phi}$ .

Approach

Given sparse-view observations of an object (typically 2-3 images with masked foreground) with known camera viewpoints, our approach aims to infer a (3D) representation capable of synthesizing novel views while also capturing the geometric structure. However, as aspects of the object may be unobserved and its geometry difficult to precisely infer, direct prediction of 3D or novel views leads to implausibly blurry outputs in regions of uncertainty.

To enable plausible and 3D-consistent predictions, we instead take a two step approach as outlined in Figure 2. First, we learn a probabilistic view-synthesis model that, using geometry-guided diffusion, can model the distribution of images from query views given the sparse-view context (Section 4.1). While this allows the generation of detailed and diverse outputs, the obtained renderings lack 3D consistency. To extract a 3D representation, we propose a 3D neural distillation process that ‘distills’ the predicted view distributions into a consistent 3D mode (Section 4.2).

Given a target view pose $\bm{\pi}$ along with a set of reference images and their relative poses $C\equiv{(\bm{x}_{m},\bm{\pi}_{m}})$ , we want to model the conditional distribution $p(\bm{x}|\bm{\pi},C)$ , from which we can synthesize an image $\bm{\hat{x}}$ . We illustrate our approach to modeling this distribution in Figure 3. First, we use an epipolar feature transformer (EFT) inspired by as feature extractor to obtain a low resolution feature grid $\bm{y}$ in the view space of $\bm{\pi}$ given the context $C$ . In conjunction, we train a view-conditioned latent diffusion model (VLDM) that models the distribution over novel-view images condition on these geometry-aware features.

We build upon GPNR to extract features from context $C$ . GPNR learns a feedforward network, $g_{\psi}(\bm{r},C)$ , that predicts color given a query ray $\bm{r}$ by extracting features along its epipolar lines in all context images and aggregating them with transformers. We make several modifications to GPNR to suit our needs. First, we replace the patch projection layer with a ResNet18 convolutional encoder as we found the lightweight patch encodings, while suitable for small baseline view synthesis, are not robust under the sparse-view setting. Furthermore, we modify the last layer to predict both an RGB value and a feature vector. We denote the RGB branch as $g_{\psi}$ and the feature branch as $h_{\psi}$ . We refer to our modified epipolar patch-based feature transformer as EFT and present its color branch as a strong baseline.

We train the color branch of the EFT to minimize a simple reconstruction loss in Eq. 4, where $\bm{r}$ is a query ray sampled from $\bm{\pi}$ , $C$ is the set of reference images and their relative poses, and $I(\bm{r})$ is the ground truth pixel value.

1.2 View-conditioned Latent Diffusion Model

While EFT can directly predict novel views, the pixelwise prediction mechanism does not allow it to model the underlying probability distribution, thus resulting in blurry mean-seeking predictions under uncertainty. To model the distribution over plausible images, we train a view-conditioned diffusion model to estimate $p(\bm{x}|\bm{\pi},C)$ while using EFT as a geometric feature extractor. Instead of directly modeling the distribution in pixel space, we find it computationally efficient to do so in a lower-resolution latent space $\bm{z}=\mathcal{E}(\bm{x})$ , which can be decoded back to an image as $\bm{x}=\mathcal{D}(\bm{z})$ . Please see the appendix for details.

Given target view $\bm{\pi}$ and a set of input images $C$ , we extract a 32 by 32 feature grid $\bm{y}=h_{\psi}(\bm{\pi},C)$ using the EFT backbone. We train our VLDM to recover ground truth image latent $\bm{z_{0}}$ conditioned on $\bm{y}$ . Following diffusion model training conventions , we optimize a simplified variational lower bound in Eq. 5.

Figure 3 shows a diagram of the training setup. Our VLDM model allows us to approximate $p(\bm{x}|\bm{\pi},C)$ , and enables drawing multiple sample predictions. In Figure 5, we see variations in VLDM predictions. Nevertheless, all predictions are plausible explanations for the target view given that majority of it is unseen.

2 Extracting 3D Modes via Diffusion Distillation

While the proposed VLDM gives us the ability to hallucinate unseen regions and make realistic predictions under uncertainty, it does not output a 3D representation. In fact, as it models the distribution over images, the views sampled from the VLDM do not (and should not!) necessarily correspond to a single underlying 3D interpretation. How can we then obtain an output 3D representation while preserving the high-quality of renderings?

Our key insight is that the VLDM model not only allows us to sample plausible novel views, but the modeled distribution also gives us a mechanism to approximate the likelihood of a generated novel view. Building on this insight, we propose to distill the VLDM predictions to obtain an instance-specific 3D neural scene representation $f_{\theta}$ , such as NeRF or Instant NGP (INGP) . Intuitively, we want to arrive at a solution for $f_{\theta}$ such that its renderings $\bm{x}\equiv f_{\theta}(\bm{\pi})$ from arbitrary viewpoints $\bm{\pi}$ are likely under the conditional distribution modeled by the VLDM $p_{\phi}(\bm{x}|\bm{\pi},C)$ :

where we minimize the negative log-likelihood for images rendered with $f_{\theta}$ over cameras sampled from a prior camera distribution $\Pi$ (constructed by assuming a circular camera trajectory and that all cameras look at a common center). We term this process as ‘neural mode seeking’ as it encourages a representation which maximizes likelihood as opposed to minimizing distance to samples (mean seeking).

Given a learned diffusion model, the reconstruction objective (Eq. 3) yields a bound on the log-likelihood of a data point $\bf{x}$ . This approximation yields a simple mechanism for computing the likelihood of a (rendered) image $f_{\theta}(\bm{\pi})$ to be used in the mode-seeking optimization (Eq. 6):

where $\bm{z}_{0}=\mathcal{E}(f_{\theta}(\bm{\pi}))$ is the latent of the rendered image, $t\sim(0,T]$ , and $\hat{\bm{z}}_{0,t}$ is the predicted latent (analogous to $\hat{\bm{x}}_{0,t}$ in Eq. 2). Intuitively, this objective implies that if, after adding noise to obtain $\bm{z}_{t}$ from $\bm{z}_{0}$ , the denoising diffusion model predicts $\hat{\bm{z}}_{0}$ close to the original input, one has reached a mode under $p_{\phi}(\bm{z})$ . We visualize the behavior of mode seeking versus mean seeking in Figure 6.

In practice, we make three modifications to the single-step objective in Eq. 7 for better performance: 1) taking loss in pixel space instead of latent space i.e. using $\bm{x}_{0}$ instead of $\bm{z}_{0}$ , 2) using perceptual distance in addition to the pixelwise distance, and 3) performing multi-step denoising. Instead of directly predicting $\hat{\bm{z}}_{0,t}$ , we adaptively use multiple time-steps (up to 50 steps) $\mathcal{T}=(t_{1},\cdots,t_{k},t)$ , and successively predict $\hat{\bm{z}}_{t_{k-1},t_{k}}$ (via ) i.e. predict a denoised estimate for time $t_{k-1}$ given a sample from time $t_{k}$ . We denote this reconstruction as $\hat{\bm{z}}_{0,\mathcal{T}}$ to highlight the multiple-step reconstruction. We express our final objective for optimizing for neural mode seeking with view-conditioned diffusion models as:

where $\hat{\bm{x}}_{0,{\mathcal{T}}}=\mathcal{D}(\hat{\bm{z}}_{0,{\mathcal{T}}})$ , and $\hat{\bm{z}}_{0,{\mathcal{T}}}$ is the multi-step reconstruction from $\bm{z}_{t}$ – which is obtained by adding noise to $\bm{z}_{0}=\mathcal{E}(f_{\theta}(\bm{\pi}))$ . While $\hat{z}$ in the above objective does (indirectly) depend on the neural representation $f_{\theta}$ , we follow in ignoring this dependence when computing parameter gradients (see for a justification). We outline the multi-step denoising diffusion distillation in Figure 4.

Experiments

We demonstrate our approach on a challenging real world multi-view dataset CO3Dv2 , across 51 diverse categories. First, we compare SparseFusion against prior works, highlighting the benefit of our approach in sparse view settings. Then, we show the importance of diffusion distillation and its probabilistic mode-seeking formulation.

We perform experiments on CO3Dv2 , a multi-view dataset of real world objects annotated with relative camera poses and foreground masks. We use the specified fewview-train and fewview-dev splits for training and evaluation. Since SparseFusion optimizes an instance-specific Instant NGP, it is computationally prohibitive to evaluate on all evaluation scenes. Instead, we perform most experiments on a core subset of 10 categories proposed by , evaluating 10 scenes per category. Furthermore, we demonstrate that SparseFusion extends to diverse categories by evaluating 5 scenes per category across 51 categories.

We compare SparseFusion against current state-of-the-art methods. We first compare against PixelNeRF , a feature re-projection method. We adapt PixelNeRF to CO3Dv2 dataset and train category-specific models on the 10 categories of the core subset, each for 300k steps. We also compare against NerFormer , another feature re-projection method. We use category-specific models provided by the authors for all 51 categories. Moreover, we compare against ViewFormerOnly category-agnostic CO3Dv1 weights are compatible with our evaluation. We use the 10-category weights for our core subset experiments and all-category weights for our all category experiments. Despite this difference, the comparative results of ViewFormer against our baselines are consistent with the comparisons reported in their original paper. , an autoregressive image generation method, using models provided by the authors. Lastly, we present components of SparseFusion, EFT and VLDM, as strong baselines.

We report standard image metrics PSNR, SSIM, and LPIPS . We recognize that no metric is perfect for ambiguous cases of novel view synthesis; PSNR derives from pixelwise MSE and favors mean color prediction while SSIM and LPIPS favor perceptual agreement.

For EFT, we use a ResNet18 backbone and three groups of transformer encoders with 4 layers each. We use 256 hidden dimensions for all layers. For VLDM, we freeze the VAE from that encodes 256x256 images to 32x32 latents with channel dimension of 4. We construct a 400M parameter denosing UNet similar to for probabilistic modeling. We jointly train category-specific EFT and VLDM models, using Eq. 4 and Eq. 5, across all categories in CO3Dv2. We use a batch size of 2 and train for 100K iterations.

For diffusion distillation, we use a PyTorch implementation of Instant NGP . Due to memory constraints, we render images at 128x128 and upsample to 256x256 before performing diffusion distillation. For each instance, we optimize Instant NGP for 3,000 steps. During the first 1,000 steps, we optimize rendering loss on input images and predicted EFT images from a circular camera trajectory to initialize a rough volume. During the next 2,000 steps, we perform diffusion distillation. Reconstructing a single instance takes roughly an hour on an A5000 gpu.

2 Reconstruction on Real Images

We show 2-view category-specific reconstruction results for the 10 core subset categories. We evaluate metrics on the first 10 scenes of each category. For each scene, we load 32 linearly spaced views, from which we randomly sample two input views and evaluate on the remaining 30 unseen views. The input and evaluation views are held constant across methods. We report category-specific PSNR and LPIPS in Table 2. We show qualitative comparisons in Figure 7.

SparseFusion outperforms all other methods in LPIPS, only losing out in PSNR for 3 categories. Despite PSNR favoring mean predicting methods, SparseFusion achieves higher PSNR in 7 categories. The strong performance of SparseFusion is reflected in the qualitative comparison. Existing methods either predict a blurry view for unseen regions or a perceptually reasonable view that disregards 3D consistency. SparseFusion predicts views that are both perceptually reasonable and geometrically consistent.

We examine performance of the different methods as we increase the number of input views. As the number of input views increases, more regions are observed, giving an advantage to methods that explicitly use feature re-projection. We evaluate 2, 3, and 6 view reconstruction on the core subset categories and show PSNR, SSIM, and LPIPS in Table 4.

We see feature re-projection methods improve drastically with more input views as the need for hallucination of unseen regions decreases. EFT outperforms SparseFusion in PSNR for the 3-view and 6-view settings. However, SparseFusion remains competitive in PSNR while being better in LPIPS. SSIM results further underscore the advantage of SparseFusion with sparse (2, 3) input views. Moreover, SparseFusion outperforms all current state-of-the-art methods in all three metrics for 2, 3, and 6 view reconstruction.

We compare against NerFormer and ViewFormer across all 51 categories to demonstrate SparseFusion’s performance on diverse categories. We evaluate with 2 random input views on the first 5 scenes of each category for all 51 categories and report the averaged metrics in Table 4. While EFT edges out in PSNR, SparseFusion achieves better SSIM and LPIPS. Existing methods, NerFormer and ViewFormer perform significantly worse. We show qualitative results of SparseFusion on diverse categories in Figure 9 where, in addition to 3 synthesized novel views, we also visualize the underlying geometry by extracting an iso-surface via marching cubes.

We show failure modes on the bottom row of Figure 9. On the bottom left, SparseFusion fails to reconstruct a good geometry for the black suitcase. As Instant NGP is trained to output a default black color for the background, the neural representation sometimes fails to disambiguate black foreground from black background. On the bottom right, we see SparseFusion propagating a dataset bias for the category, remote. Since most remote images are TV remotes, SparseFusion attempts to make the video game controller a TV remote.

3 Additional Analysis

We investigate the relationship between magnitude of viewpoint change and reconstruction performance. We analyze SparseFusion, EFT, and PixelNeRF results on the core subset and visualize PSNR and LPIPS binned by angle in degrees to the nearest context view in Figure 5.2. We show that for small viewpoint changes, SparseFusion performs better in LPIPS and competitively in PSNR against EFT. As viewpoint change increases, feature re-projection methods fall off quite fast while SparseFusion remains more robust and performs relatively better.

We compare the diffusion distillation formulation against a naive method to obtain a neural representation given a view synthesis method (VLDM or EFT). Concretely, we obtain several rendered samples $(\{\hat{I},\hat{\bm{\pi}}\})$ from the base view synthesis method given the context views $C$ , and simply train an INGP to fit a 3D representation to these.

We present the results in Table 5, and see no significant change when we fit INGP to EFT renderings because EFT predicts consistent mean outputs. However, when we fit INGP to VLDM predictions, we see that perceptual quality decreases. We show a qualitative example in Figure 6 and also illustrate a toy 2D scenario which explains this drop due to mean seeking where averaging over conflicting samples leads to a poor reconstruction. However, when we optimize INGP using the diffusion distillation objective, all metrics improve, underscoring the importance our proposed of mode seeking optimization.

We examine performance across various distillation design choices in Table 6. We observe that for all methods, PSNR remains relatively similar. However, computing loss in pixel space and additionally using perceptual loss improves both SSIM and LPIPS. Moreover, the multi-step denoising leads to the best perceptual results. While single-step denoising with perceptual loss achieves better PSNR and SSIM by a small margin, qualitative results in Figure 10 show that the predicted texture is smooth and unrealistic.

Discussion

We presented an approach for inferring 3D neural representations from sparse-view observations. Unlike prior methods that struggled to deal with uncertainty, our approach allowed predicting 3D-consistent representations with plausible and realistic outputs even in unobserved regions. While we believe our work represents a significant step forward in recovering detailed 3D from casually captured images, a few challenges still remain. A key limitation of our work (as well as prior methods) is the reliance on known (relative) camera poses across the observations, and while there have been recent promising advances , this remains a challenging task in general. Additionally, our approach requires optimizing instance-specific neural fields and is computationally expensive. Finally, while our work introduced the view-conditioned diffusion distillation in context of sparse-view reconstruction, we believe even single-view 3D prediction approaches can benefit from leveraging similar objectives.

Ethics and Broader Impact

Compared to existing novel view synthesis methods, SparseFusion is more computationally expensive. This poses a hardware limitation for potential downstream tasks and may also increase carbon emissions. Additionally, SparseFusion relies on view-conditioned latent diffusion models (VLDM), which are trained on multi-view datasets. VLDMs are good at representing their training data, potentially learning harmful biases that will propagate to reconstructed 3D scenes. While our current use case for reconstructing static objects from CO3D categories does not present ethical concerns, adapting SparseFusion to humans or animals requires more thorough examination of bias present in the training data.

Acknowledgements

We thank Naveen Venkat, Mayank Agarwal, Jeff Tan, Paritosh Mittal, Yen-Chi Cheng, and Nikolaos Gkanatsios for helpful discussions and feedback. We also thank David Novotny and Jonáš Kulhánek for sharing pretrained models for NerFormer and ViewFormer, respectively. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. (DGE1745016, DGE2140739).

References

Appendix

Appendix A Extended Background: Denoising Diffusion

Denoising diffusion probabilistic models approximate a distribution $p(\bm{x})$ over real data by reversing a Markov chain of diffusion steps, starting from Gaussian noise at $\bm{x}_{T}$ to a realistic image at $\hat{\bm{x}}_{0}$ . See for details.

The forward diffusion process, which incrementally adds noise to a real image $\bm{x}_{0}$ until the image becomes Gaussian noise $\bm{x}_{T}$ , is defined in Eq. 9. Forward variance $\beta$ is usually defined by a fixed schedule.

The reverse diffusion process reverses the noise added in the forward process, effectively denoising a noisy image. When we generate a sample from a diffusion model, we apply the reverse process $T$ times from $t=T$ to $t=1$ . The reverse process is defined in Eq. 10, where posterior mean $\mu_{\phi}(\bm{x}_{t},t)$ is predicted from a network and posterior variance $\sigma^{2}$ follows a fixed schedule (though other works such as also learn $\sigma^{2}$ with a network).

Prior works have found that parameterizing the neural network to predict $\bm{\epsilon}$ instead of $\bm{x}_{t-1}$ or $\bm{x}_{0}$ works better in practice. We write posterior mean in terms of $\bm{\epsilon}$ in Eq. 11 where $\alpha_{t}=1-\beta_{t}$ and $\bar{\alpha}_{t}=\Pi_{s=1}^{t}\alpha_{s}$ .

As mentioned in the main text, this parametrization leads to a training framework where one adds (time-dependent) noise to a data point $\bm{x}_{0}$ , and then trains the network $\bm{\epsilon}_{\phi}$ to predict this noise given the noisy data point $\bm{x}_{t}$ .

In this work, we use conditional diffusion models to infer distributions of the form $p(\bm{x}|\bm{y})$ by additionally using $\bm{y}$ as an input for the noise prediction network $\bm{\epsilon}_{\phi}(\bm{x},\bm{y},t)$ .

Appendix B Implementation Details

We provide detailed implementation and training details for all components of SparseFusion.

Epipolar feature transformer is a feed-forward network that first gathers features along the epipolar lines of input images before aggregating them through a series of transformers. EFT is inspired by the GPNR approach by Suhail et al. , but we modify the feature extractor backbone to better suit the sparse-view setup and additionally use epipolar features for conditional diffusion. We describe our implementation below.

Notation: Let $g_{\psi}$ be the RGB branch and $h_{\psi}$ be the feature branch.

Inputs: $C\equiv{(\bm{x}_{m},\bm{\pi}_{m}})$ , a set of input images with known camera poses and a query pose $\bm{\pi}$ – note that the poses are w.r.t. an arbitrary world-coordinate system and we only use their relative configuration.

Outputs: an RGB image $\bm{x}$ and a feature grid $\bm{y}$ corresponding to the query viewpoint $\bm{\pi}$ .

Given input views $C\equiv{(\bm{x}_{m},\bm{\pi}_{m}})$ where $\bm{x}_{m}$ is the $m^{th}$ masked (black background) input image of shape (256, 256, 3). We use ResNet18 as our backbone to extract pixel-aligned features by concatenating intermediate features from the first 4 layer groups of ResNet18, using bilinear upsampling to ensure all features are 128 by 128. For each image $\bm{x}_{m}$ , we arrive at a feature grid of shape (128, 128, 512).

Given a query camera $\bm{\pi}$ , each pixel in its image plane corresponds to some ray. Our Epipolar Transformer seeks to infer per-pixel colors or features, and does so by processing each ray using the multi-view projections of points along it. For each ray $\bm{r}$ (parameterized by its origin and direction), we project 20 points along the ray direction with depth values linearly spaced between $z\_near$ and $z\_far$ . We set $z\_near$ to $s-5$ and $z\_far$ to $s+5$ where $s$ is the average distance from scene cameras to origin computed per scene. The 20 points, with shape (20, 3), are then projected into the screen space of each of the $m$ input cameras, giving us epipolar points with shape (M, 20, 2). We use bilinear sampling to sample image features at the epipolar points, giving us combined epipolar features of shape (M, 20, 512) per ray. This becomes the input to our epipolar feature transformer.

EFT aggregates the epipolar features from a single ray with a series of three transformers to predict an RGB pixel color and a 256-dimension feature. We visualize the EFT in Figure 11. We show details of the transformers in Table 7. All transformer encoders have hidden and output dimensions of 256. Both the depth aggregator and view aggregator transformers are followed by a weighted average operation, where the output features from the transformers are multiplied by a weight, which sums to 1 along the sequence length dimension. The relative weights are predicted by a linear layer before passing through softmax. This effectively performs weighted averaging along the sequence dimension.

The inputs to the transformer are the sampled features concatenated with additional ray and depth encodings. Given a point along the query ray $\bm{r}_{q}$ at depth $d$ , we denote by $\bm{p}_{md}$ its projection in the $m^{th}$ context view. In addition to the pixel-aligned feature $\bm{f}_{md}$ (described in previous paragraph), we also concatenate encodings of the query ray $\bm{r}_{q}$ , the depth $\bm{d}$ , and the ray $\bm{r}_{md}$ connecting the $m^{th}$ camera center to the 3D point. We use plucker coordinates to represent each ray, and compute harmonic embeddings for each to $(\bm{r}_{q},\bm{r}_{md},\bm{d})$ (using 6 harmonic functions) before concatenating them with $\bm{f}_{md}$ to form the input tokens to the transformer.

We can train the color branch of EFT as a standalone novel view synthesis baseline. In our work, EFT is jointly trained with VLDM. Please see supplementary Section B.2 for details.

B.2 View-conditioned Diffusion Model

View-conditioned diffusion model is a latent diffusion model that conditions on a pixel-aligned feature grid $\bm{y}$ .

Notation: Let $\epsilon_{\phi}$ be the denoising UNet, $\mathcal{E}$ be the VAE encoder, and $\mathcal{D}$ be the VAE decoder.

We use the VAE from Stable Diffusion . We use the provided v1-3 weights and keep the VAE frozen for all experiments. We use (256, 256, 3) RGB images as input, and the VAE encodes them into latents of shape (32, 32, 4). We refer readers to for more details.

Our 400M parameter UNet roughly follows . We construct our UNet using code from with the parameters in Table 8.

The UNet comprises of 4 down-sampling blocks, a middle block, and 4 up-sampling blocks. We show the input and output shape for the modules of the UNet in Table 9. We refer readers to for UNet details. We disable all text conditioning and cross attention mechanisms; instead, we concatenate EFT features, $\bm{y}$ , with image latents, $\bm{z}_{t}$ . These EFT features are computed for the of $32\times 32$ rays corresponding to the patch centers.

We train with batch size of 2, randomly chosen number of input views between 2-5, and learning rate of 5e-5 using Adam optimizer with default hyperparameters for 100K steps. We optimize both the UNet weights and also the EFT weights. We optimize the UNet and feature branch of EFT with the simplified variational lower bound . We optimize the color branch of EFT with pixel-wise reconstruction loss.

B.3 Diffusion Distillation

We optimize a 3D neural scene representation, Instant NGP , with our VLDM.

Notation: Let $f_{\theta}$ be the volumetric Instant NGP renderer, $p_{\phi}(\bm{z}_{0:\mathcal{T}}|\bm{\pi},C)$ be the multi-step denoising process that estimates $\hat{\bm{z}}_{0}$ . Let $\Pi$ be an instance-specific camera distribution.

We use the PyTorch Instant NGP implementation from . We set scene bounds to 4 with desired hashgrid resolution of 8,192. We use a small 3 layer MLP with hidden dimension of 64 to predict RGB and density. We do not use view direction as input.

Given a set of input cameras $C_{I}\equiv{(\bm{\pi}_{m}})$ and a query camera $\bm{\pi}_{q}$ , we first find the look-at point $P_{at}$ by finding the nearest point to all $m+1$ rays originating from camera centers. Then, we fit a circle $O$ in 3D space with center being the mean of all camera centers. Let the normal of circle $O$ be $\bm{n}$ . To sample a camera, we first sample a point $P_{i}$ on $O$ and jitter the angle between $\overline{P_{at}P_{i}}$ and $\bm{n}$ by $\mathcal{N}(0,0.17)$ radians to get jittered point $P_{i}^{{}^{\prime}}$ . We then construct a camera $\bm{\pi}$ with center $P_{i}^{{}^{\prime}}$ looking at $P_{at}$ .

Given a rendered image $\bm{x}_{0}$ , we encode it to obtain $\bm{z}_{0}$ . Then, we uniformly sample $t\sim(0,T]$ and construct a noisy image latent $\bm{z}_{t}$ . We perform multi-step denoising to obtain $\hat{\bm{z}}_{0}$ by iteratively sampling $\hat{\bm{z}}_{t_{k-1}}\sim p_{\phi}(\bm{z}_{t_{k-1}}|\hat{\bm{z}}_{t_{k}},y)$ on an interval of time steps $\mathcal{T}=(t_{1},...,t_{k},t)$ using a linear multi-step method . We construct $\mathcal{T}$ by linearly spacing $k+1$ time steps between $(0,t]$ . We define $k$ with a simple scheduler:

Finally, given $\hat{\bm{z}}_{0}$ , we get the predicted image $\hat{\bm{x}}_{0}=\mathcal{D}(\hat{\bm{z}}_{0})$ . We do not compute gradients through multi-step diffusion and treat $\hat{\bm{x}}_{0}$ as a detached tensor.

We perform 3,000 steps of distillation, optimizing weights of the MLP $\theta$ with Adam optimizer and learning rate 5e-4. During each step of diffusion distillation, we sample $\bm{\pi}\sim\Pi$ and render an image $\bm{x}_{0}=f_{\theta}(\bm{\pi})$ . For the first 1,000 steps, we compute rendering loss between $f_{\theta}(\bm{\pi})$ and $g_{\psi}(\bm{\pi}|C)$ . During the remaining steps, we compute loss between $f_{\theta}(\bm{\pi})$ and $\hat{\bm{x}}_{0}$ and use weighting $w_{t}=1-\bar{\alpha}_{t}$ . To avoid out-of-memory error, we render images at reduced resolution (128, 128) and apply bilinear up-sampling before performing multi-step diffusion. In addition, we compute rendering loss between $f_{\theta}(\bm{\pi}_{m})$ and $\bm{x}_{m}$ on all $m$ input images. Optimizing a single scene takes roughly 1 hour on an A5000 GPU.