RealFusion: 360° Reconstruction of Any Object from a Single Image

Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, Andrea Vedaldi

Introduction

We consider the problem of obtaining a 360∘ photographic reconstruction of any object given a single image of it. The challenge is that a single image does not contain sufficient information for 3D reconstruction. Without access to multiple views, an image only provides weak evidence about the 3D shape of the object, and only for one side of it. Even so, there is proof that this task can be solved: any skilled 3D artist can take a picture of almost any object and, given sufficient time and effort, create a plausible 3D model of it. The artist can do so by tapping into her vast knowledge of the natural world and of the objects it contains, making up for the information missing in the image.

To solve this problem algorithmically, one must then marry visual geometry with a powerful statistical model of the 3D world. The recent explosion of 2D image generators like DALL-E , Imagen , and Stable Diffusion suggests that such models might not be far behind. By using diffusion, these methods can solve highly-ambiguous generation tasks, obtaining plausible 2D images from textual descriptions, semantic maps, partially-complete images, or simply unconditionally from random noise. Clearly, these models possess high-quality priors—if not of the 3D world, then at least of the way it is represented in 2D images. Hence, in theory, a 3D diffusion model trained on vast quantities of 3D data should be capable of producing 3D reconstructions, either unconditionally or conditioned on a 2D image. However, training such a model is infeasible because, while one can access billions of 2D images , the same cannot be said about 3D data.

The alternative to training a 3D diffusion model is to extract 3D information from an existing 2D model. A 2D image generator can in fact be used to sample or validate multiple views of a given object; these multiple views can then be used to perform 3D reconstruction. With early GAN-based generators, authors showed some success for simple data like faces and synthetic objects . With the availability of large-scale models like CLIP and, more recently, diffusion models, increasingly complex results have been obtained. The most recent example is DreamFusion , which generates high-quality 3D models from textual descriptions alone.

Despite these advances, the problem of single-image 3D reconstruction remains largely unsolved. In fact, these recent methods do not solve this problem. They either sample random objects, or, like in the case of DreamFusion, start from a textual description.

A problem in extending generators to reconstruction is coverage (sometimes known as mode collapse). For example, high-quality face generators based on GANs are usually difficult to invert: they may be able to generate many different high-quality images, and yet are usually unable to generate most images . Conditioning on an image provides a much more detailed and nuanced specification of the object than, say, a textual description. It is not obvious if the generator model would be able to satisfy all such constraints.

In this paper, we study this problem in the context of diffusion models. We express the object’s 3D geometry and appearance by means of a neural radiance field. Then, we train the radiance field to reconstruct the given input image by minimizing the usual rendering loss. At the same time, we sample random other views of the object, and constrain them with the diffusion prior, using a technique similar to DreamFusion.

We find that, out of the box, this idea does not work well. Instead, we need to make a number of improvements and modifications. The most important change is to adequately condition the diffusion model. The idea is to configure the prior to “dream up” or sample images that may plausibly constitute other views of the given object. We do so by engineering the diffusion prompt from random augmentations of the given image. Only in this manner does the diffusion model provide sufficiently strong constraints to allow meaningful 3D reconstruction.

In addition to setting the prompt correctly, we also add some regularizers: shading the underlying geometry and randomly dropping out texture (also similar to DreamFusion), smoothing the normals of the surface, and fitting the model in a coarse-to-fine fashion, capturing first the overall structure of the object and only then the fine-grained details. We also focus on efficiency and base our model on InstantNGP . In this manner, we achieve reconstructions in the span of hours instead of days if we were to adopt traditional MLP-based NeRF models.

We assess our approach by using random images captured in the wild as well as existing benchmark datasets. Note that we do not train a fully-fledged 2D-to-3D model and we are not limited to specific object categories; rather, we perform reconstruction on an image-by-image basis using a pretrained 2D generator as a prior. Nonetheless, we can surpass quantitatively and qualitatively previous single-image reconstructors, including Shelf-Supervised Mesh Prediction , which uses supervision tailored specifically for 3D reconstruction.

More impressively, and more importantly, we obtain plausible 3D reconstructions that are a good match for the provided input image (Fig. 1). Our reconstructions are not perfect, as the diffusion prior clearly does its best to explain the available image evidence but cannot always match all the details. Even so, we believe that our results convincingly demonstrate the viability of this approach and trace a path for future improvements.

To summarize, we make the following contributions: (1) We propose RealFusion, a method that can extract from a single image of an object a 360∘ photographic 3D reconstruction without assumptions on the type of object imaged or 3D supervision of any kind; (2) We do so by leveraging an existing 2D diffusion image generator via a new single-image variant of textual inversion; (3) We also introduce new regularizers and provide an efficient implementation using InstantNGP; (4) We demonstrate state-of-the-art reconstruction results on a number of in-the-wild images and images from existing datasets when compared to alternative approaches.

Related work

Much of the early work on 3D reconstruction is based on principles of multi-view geometry . These classic methods use photometry only to match image features and then discard it and only estimate 3D shape.

The problem of reconstructing photometry and geometry together has been dramatically revitalized by the introduction of neural radiance fields (RFs). NeRF in particular noticed that a coordinate MLP provides a compact and yet expressive representation of 3D fields, and can be used to model RFs with great effectiveness. Many variants of NeRF-like models have since appeared. For instance, some use sign distance functions (SDFs) to recover cleaner geometry. These approaches assume that dozens if not hundreds of views of each scene are available for reconstruction. Here, we use them for single-image reconstruction, using a diffusion model to “dream up” the missing views.

Few-view reconstruction.

Many authors have attempted to improve the statistical efficiency of NeRF-like models, by learning or incorporating various kinds of priors. Quite related to our work, NeRF-on-a-Diet reduces the number of images required to learn a NeRF by generating random views and measuring their “semantic compatibility” with the available views via CLIP embeddings , but they still require several input views.

While CLIP is a general-purpose model learned on 2D data, other authors have learned deep networks specifically for the goal of inferring NeRFs from a small number of views. Examples include IBRNet , NeRF-WCE , PixelNeRF , NeRFormer , and ViewFormer . These models still generally require more than one input view at test time, require multi-view data for training, and are often optimized for specific object categories.

Single-view reconstruction.

Some authors have attempted to recover full radiance fields from single images, but this generally requires multi-view data for training, as well as learning models that are specific to a specific object category. 3D-R2N2 , Pix2Vox , and LegoFormer learn to reconstruct volumetric representation of simple objects, mainly from synthetic data like ShapeNet . More recently, CodeNeRF predicts a full radiance field, including reconstructing the photometry of the objects. AutoRF learns a similar autoencoder specifically for cars.

Extracting 3D models from 2D generators.

Several authors have proposed to extract 3D models from 2D image generators, originally using GANs .

More related to our work, CLIP-Mesh and Dream Fields do so by using the CLIP embedding and can condition 3D generation on text. Our model is built on the recent Dream Fusion approach , which builds on a similar idea using a diffusion model as prior.

However, these models have been used as either pure generators or generators conditioned on vague cues such as class identity or text. Here, we build on similar ideas, but we apply them to the case of single-view reconstruction.

Recently, the authors of have proposed to directly generate multiple 2D views of an object, which can then be reconstructed in 3D using a NeRF-like model. This is also reminiscent of our approach, but their model requires multi-view data for training, is only tested on synthetic data, and requires to explicitly sample multiple views for reconstruction (in our case they remain implicit).

Diffusion Models.

Diffusion denoising probabilistic models are a class of generative models based on iteratively reversing a Markovian noising process. In vision, early works formulated the problem as learning a variational lower bound , or framed it as optimizing a score-based generative model or as the discretization of a continuous stochastic process . Recent improvements includes the use of faster and deterministic sampling , class-conditional models , text-conditional models , and modeling in latent space .

Method

We provide an overview and notation for the background material first (Sec. 3.1), and then discuss our RealFusion method (Sec. 3.2).

where $T_{i}=\exp(-\Delta\sum_{j=0}^{i-1}\sigma(\bm{x}_{j}))$ is the probability that a photon is transmitted from point $\bm{x}_{i}$ back to the camera sensor without being absorbed by the material.

Importantly, the rendering function $R(u;\sigma,c)$ is differentiable, which allows training the model by means of a standard optimizer. Specifically, the RF is fitted to a dataset $\mathcal{D}=\{(I,\pi)\}$ of images $I$ with known camera parameters by minimizing the $L^{2}$ image reconstruction error

In order to obtain good quality results, one typically requires a dataset of dozens or hundreds of views.

Here, we consider the case in which we are given exactly one input image $I_{0}$ corresponding to some (unknown) camera $\pi_{0}$ . In this case, we can also assume any standard viewpoint $\pi_{0}$ for that single camera. Optimizing Eq. 2 with a single training image leads to severe over-fitting: it is straightforward to find a pair $(\sigma,c)$ that has zero loss and yet does not capture any sensible 3D model of the object. Below we will leverage a pre-trained 2D image prior to (implicitly) dream up novel views of the object and provide the missing information for 3D reconstruction.

Diffusion models.

A diffusion model draws a sample from a probability distribution $p(I)$ by inverting a process that gradually adds noise to the image $I$ . The diffusion process is associated with a variance schedule $\{\beta_{t}\in(0,1)\}_{t=1}^{T}$ , which defines how much noise is added at each time step. The noisy version of sample $I$ at time $t$ can then be written $I_{t}=\sqrt{\bar{\alpha}_{t}}I+\sqrt{1-\bar{\alpha}_{t}}\epsilon$ where $\epsilon\sim\mathcal{N}(\bm{0},\bm{I}),$ is a sample from a Gaussian distribution (with the same dimensionality as $I$ ), $\alpha_{t}=1-\beta_{t}$ , and $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ . One then learns a denoising neural network $\hat{\epsilon}=\Phi(I_{t};t)$ that takes as input the noisy image $I_{t}$ and the noise level $t$ and tries to predict the noise component $\epsilon$ .

In order to draw a sample from the distribution $p(I)$ , one starts by drawing a sample $I_{T}\sim\mathcal{N}(\bm{0},\bm{I})$ . Then, one progressively denoises the image by iterated application of $\Phi$ according to a specified sampling schedule , which terminates with $I_{0}$ sampled from $p(I)$ .

Modern diffusion models are trained on large collections $\mathcal{D^{\prime}}=\{I\}$ of images by minimizing the loss

This model can be easily extended to draw samples from a distribution $p(\bm{x}|\bm{e})$ conditioned on a prompt $\bm{e}$ . Conditioning on the prompt is obtained by adding $\bm{e}$ as an additional input of the network $\Phi$ , and the strength of conditioning can be controlled via classifier-free guidance .

DreamFusion and Score Distillation Sampling (SDS).

Given a 2D diffusion model $p(I|\bm{e})$ and a prompt $\bm{e}$ , DreamFusion extracts from it a 3D rendition of the corresponding concept, represented by a RF $(\sigma,c)$ . It does so by randomly sampling a camera parameter $\pi$ , rendering a corresponding view $I_{\pi}$ , assessing the likelihood of the view based on the model $p(I_{\pi}|\bm{e})$ , and updating the RF to increase the likelihood of the generated view based on the model.

In practice, DreamFusion uses the denoiser network as a frozen critic and takes a gradient step

where $I=R(\cdot;\sigma,c,\pi).$ is the image rendered from a given viewpoint $\pi$ and prompt $\bm{e}$ . This process is called Score Distillation Sampling (SDS).

Note that Eq. 4 differs from simply optimizing the standard diffusion model objective because it does not include the Jacobian term for $\Phi$ . In practice, removing this term both improves generation quality and reduces computational and memory requirements.

One final aspect of DreamFusion is essential for understanding our contribution in the following section: DreamFusion finds that it is necessary to use classifier-free guidance with a very high guidance weight of 100, much larger than one would use for image sampling, in order to obtain good 3D shapes. As a result, the generations tend to have limited diversity; they produce only the most likely objects for a given prompt, which is incompatible with our goal of reconstructing any given object.

2 RealFusion

Our goal is to reconstruct a 3D model of the object contained in a single image $I_{0}$ , utilizing the prior captured in the diffusion model $\Phi$ to make up for the missing information. We will achieve this by optimizing a radiance field using two simultaneous objectives: (1) a reconstruction objective Eq. 2 from a fixed viewpoint, and (2) a SDS-based prior objective Eq. 4 on novel views randomly sampled at each iteration. Figure 2 provides a diagram of the entire system.

The most important component of our method is the use of single-image textual inversion as a substitute for alternative views. Ideally, we would like to condition our reconstruction process on multi-view images of the object in $I_{0}$ , i.e. on samples from $p(I|I_{0})$ . Since these images are not available, we instead synthesize a text prompt $\bm{e}^{(I_{0})}$ specifically for our image $I_{0}$ as a proxy for this multi-view information.

Our idea, then, is to engineer a prompt $\bm{e}^{(I_{0})}$ to provide a useful approximation of $p(I|I_{0})$ . We do so by generating random augmentations $g(I_{0}),$ $g\in G$ of the input image, which serve as pseudo-alternative-views. We use these augmentations as a mini-dataset $\mathcal{D^{\prime}}=\{g(I_{0})\}_{g\in G}$ and optimize the diffusion loss Eq. 3 $\mathcal{L}_{\text{diff}}(\Phi(\cdot;\bm{e}^{(I_{0})}))$ with respect to the prompt $\bm{e}^{(I_{0})}$ , while freezing all other text embeddings and model parameters.

In practice, our prompt is derived automatically from templates like “an image of a $\langle\textbf{e}\rangle$ ”, where “ $\langle\textbf{e}\rangle$ ” ( $=\bm{e}^{(I_{0})}$ ) is a new token introduced to the vocabulary of the text encoder of our diffusion model (see Appendix A for details). Our optimization procedure mirrors and generalizes the recently-proposed textual-inversion method of . Differently from , we work in the single-image setting and utilize image augmentations for training rather than multiple views.

To help convey the intuition behind $\langle\textbf{e}\rangle$ , consider an attempt at reconstructing an image of a fish using the generic text prompt “An image of a fish” with losses Eqs. 3 and 4. In our experience, this often produces a reconstruction which looks like the input fish from the input viewpoint, but looks like some different, more-generic fish from the backside. By contrast, using the prompt “An image of a $\langle\textbf{e}\rangle$ ”, the reconstruction resembles the input fish from all angles. An example of exactly this case is shown in Figure 7.

Finally, Figure 3 demonstrates the amount of detail captured in the embedding $\langle\textbf{e}\rangle$ .

Coarse-to-fine training.

In order to describe our coarse-to-fine training methodology, it is necessary to first briefly introduce our underlying RF model, a InstantNGP . InstantNGP is a grid-based model which stores features at the vertices of a set of feature grids $\{G_{i}\}_{i=1}^{L}$ at multiple resolutions. The resolution of these grids is chosen to be a geometric progression between the coarsest and finest resolutions, and feature grids are trained simultaneously.

We choose a InstantNGP over a conventional MLP-based NeRF due to its computational efficiency and training speed. However, the optimization procedure occasionally produces small irregularities on the surface of the object. We find that training in a coarse-to-fine manner helps to alleviate these issues: for the first half of training we only optimize the lower-resolution feature grids $\{G_{i}\}_{i=1}^{L/2}$ , and then in the second half of training we optimize all feature grids $\{G_{i}\}_{i=1}^{L}$ . Using this strategy, we obtain the benefits of both efficient training and high-quality results.

Normal vector regularization.

Next, we introduce a new regularization term to encourage our geometry to have smooth normals. The introduction of this term is motivated by the observation that our RF model occasionally generated noisy-looking surfaces with low-level artifacts. To address these artifacts, we encourage our RF to have smoothly varying normal vectors. Notably, we perform this regularization in 2D rather than in 3D.

At each iteration, in addition to computing RGB and opacity values, we also compute normals for each point along the ray and aggregate these via the raymarching equation to obtain normals $N\in\mathcal{R}^{H\times W\times 3}$ .Normals may be computed either by taking the gradient of the density field or by using finite differences. We found that using finite differences worked well in practice. Our loss is:

where stopgrad is a stop-gradient operation and $\text{blur}(\cdot,k)$ is a Gaussian blur with kernel size $k$ (we use $k=9$ ).

Although it may be more common to regularize normals in 3D, we found that operating in 2D reduced the variance of the regularization term and led to superior results.

Mask loss.

In addition to the input image, our model also utilizes a mask of the object that one wishes to reconstruct. In practice, we use an off-the-shelf image matting model to obtain this mask for all images.

We incorporate this mask in a simple manner by adding a simple $L^{2}$ loss term on the difference between the rendered opacities from the fixed reference viewpoint $\mathcal{R}(\sigma,\pi_{0})\in\mathcal{R}^{H\times W}$ and the object mask $M$ : $\mathcal{L}_{\text{rec,mask}}=||O-M||^{2}$ Our final objective then consists of four terms:

where the top line in the equation above corresponds to our prior objective and the bottom line corresponds to our reconstruction objective.

Experiments

Regarding hyperparameters, we use essentially the same set of hyper-parameters for all experiments—there is no per-scene hyper-parameter optimization.There is one small exception to this rule, which is that for a few number of images where the camera angle was clearly at an angle higher than 15∘, we took a camera angle of 30 or 40 $\deg$ .. For our diffusion model prior, we employ the open-source Stable Diffusion model trained on the LAION dataset of text-image pairs. For our InstantNGP model, we use a model with 16 resolution levels, a feature dimension of 2, and a maximum resolution of 2048, trained in a coarse-to-fine manner as explained above.

Regarding camera sampling, lighting, and shading, we keep nearly all parameters the same as . This includes the use of diffuse and textureless shading stochastic throughout the course of optimization, after an initial warmup period of albedo-only shading. Complete details regarding this and other aspects of our training setup are provided in the supplementary material.

2 Quantitative results

There are only few methods that attempt to reconstruct arbitrary objects in 3D. The most recent and best-performing of these is Shelf-Supervised Mesh Prediction , which we compare here. They provide 50 pretrained category-level models for 50 different categories in OpenImages . Since we aim to compute metrics using 3D or multi-view ground truth, we evaluate on seven categories in the CO3D dataset with corresponding OpenImages categories. For each of these seven categories, we select three images at random and run both RealFusion and Shelf-Supervised to obtain reconstructions.

We first test the quality of the recovered 3D shape in Fig. 5. Shelf-Supervised directly predicts a mesh. We extract one from our predicted radiance fields using marching cubes. CO3D comes with sparse point-cloud reconstruction of the objects obtained using multi-view geometry. For evaluation, we sample points from the reconstructed meshes and align them optimally with the ground truth point cloud by first estimating a scaling factor and then using Iterated Closest Point (ICP). Finally, we compute F-score with threshold $0.05$ to measure the distance between the predicted and ground truth point clouds. Results are shown in Tab. 1.

In order to evaluate the quality of the reproduced appearance, we also compare novel-view renderings from our and their method (Tab. 1). Ideally, these renderings should produce views that are visually close to the real views. In order to test this hypothesis, we check whether the generated views are close or not to the other views given in CO3D. We then report the CLIP embedding similarity of the generated images with respect to the closest CO3D view available (i.e. the view with maximum similarity).

3 Qualitative results

Figure 4 shows additional qualitative results from multiple viewpoints. Having a single image of an object means that several 3D reconstructions are possible. Figure 6 explores the ability of RealFusion to sample the space of possible solutions by repeating the reconstruction several times, starting from the same input image. There is little variance in the reconstructions of the front of the object, but quite a large variance for its back, as expected.

Figure 11 shows two typical failure modes of RealFusion: in some cases the model fails to converge, and in others it copies the front view to the back of the object, even if this is not semantically correct.

4 Analysis and Ablations

One of the key components of RealFusion is our use of single-image textual inversion, which allows the model to correctly imagine novel views of a specific object. Figure 7 shows that this component plays indeed a critical role in the quality of the reconstructions. Without texual inversion, the model often reconstructs the backside of the object in the form of a generic instance from the object category. For example, the backside of the cat statue in the top row of Fig. 7 is essentially a different statue of a more generic-looking cat, whereas the model trained with textual inversion resembles the true object from all angles.

Other components of the model are also significant. Figure 9 shows that the normal smoothness regularizer of Eq. 5 results in smoother, more realistic meshes and reduces the number of artifacts. Figure 8 shows that coarse-to-fine optimization reduces the presence of low-level artifacts and results in smoother, visually pleasing surfaces. Fig. 10 shows that using Stable Diffusion works significantly better than relying on an alternative such as CLIP.

Conclusions

We have introduced RealFusion, a new approach to obtain full 360∘ photographic reconstructions of any object given a single image of it. Given an off-the-shelf diffusion model trained using only 2D images and no special supervision for 3D reconstruction, as well as a single view of the target object, we have shown how to select the model prompt to imagine other views of the object. We have used this conditional prior to learn an efficient, multi-scale radiance field representation of the reconstructed object, incorporating an additional regularizer to smooth out the reconstructed surface. The resulting method can generate plausible 3D reconstructions of objects captured in the wild which are faithful to the input image. Future works include specializing the diffusion model for the task of new-view synthesis and incorporating dynamics to reconstruct animated 3D scenes.

We use the CO3D dataset in a manner compatible with their terms. CO3D does not contain personal information. Please see https://www.robots.ox.ac.uk/~vedaldi/research/union/ethics.html for further information on ethics.

Acknowledgments.

L. M. K. is supported by the Rhodes Trust. A. V., I. L. and C.R. are supported by ERC-UNION-CoG-101001212. C. R. is also supported by VisualAI EP/T028572/1.

References

Appendix A Implementation Details

In this section, we provide full implementation details which were omitted from the main text due to space constraints. Most of these details follow , but a few are slightly modified.

We consider three different types of shading: albedo, diffuse, and textureless. For albedo, we simply render the RGB color of each ray as given by our model:

For diffuse, we also compute the surface normal $n$ as the normalized negative gradient of the density with respect to $u$ . Then, given a point light $l$ with color $l_{\rho}$ and an ambient light with color $l_{a}$ , we render

For textureless, we use the same equation with $I_{\rho}(u)$ replaced by white $(1,1,1)$ .

For the reconstruction view, we only use albedo shading. For the random view (i.e. the view used for the prior objectives), we use albedo shading for the first 1000 steps of training by setting $l_{a}=1.0$ and $l_{\rho}=0.0$ . Afterwards we use $l_{a}=0.1$ and $l_{\rho}=0.9$ , and we select stochastically between albedo, diffuse, and textureless with probabilities $0.2$ , $0.4$ , and $0.4$ , respectively.

We obtain the surface normal using finite differences:

where ${\epsilon}_{x}=({\epsilon},0,0)$ , ${\epsilon}_{y}=(0,{\epsilon},0)$ , and ${\epsilon}_{z}=(0,0,{\epsilon})$

Density bias.

As in , we add a small Gaussian blob of density to the origin of the scene in order to assist with the early stages of optimization. This density takes the form

Camera.

The fixed camera for reconstruction is placed at a distance of $1.8$ from the origin, oriented toward the origin, at an elevation of $15^{\circ}$ above the horizontal plane. For a small number of scenes in which the object of interest is clearly seen from overhead, the reconstruction camera is placed at an elevation of $40^{\circ}$ .

The camera for the prior objectives is sampled randomly at each iteration. Its distance from the origin is sampled uniformly from $[1.0,1.5]$ . Its azimuthal angle is sampled uniformly at random from the $360^{\circ}$ around the object. Its elevation is sampled uniformly in degree space from $-10^{\circ}$ to $90^{\circ}$ with probability $0.5$ and uniformly on the upper hemisphere with probability $0.5$ . The field of view is uniformly sampled between $40$ and $70$ . The camera is oriented toward the origin. Additionally, every tenth iteration, we place the prior camera near the reconstruction camera: its location is sampled from the prior camera’s location perturbed by Gaussian noise with mean and variance $1$ .

Lighting.

We sample the position of the point light by adding a noise vector $\eta\sim\mathcal{N}(0,1)$ to the position of the prior camera.

View-Dependent Prompt.

We add a view-dependent suffix to our text prompt based on the location of the prior camera relative to the reconstruction camera. If the prior camera is placed at an elevation of above $60^{\circ}$ , the text prompt receives the suffix “overhead view.” If it is at an elevation below $0^{\circ}$ , the text receives “bottom view.” Otherwise, for azimuthal angles of $\pm 30^{\circ}$ , $\pm 30-90^{\circ}$ , or $\pm 90-180^{\circ}$ in either direction of the reconstruction camera, it receives the suffices “front view,” “side view,” or “bottom view,” respectively.

InstantNGP.

Our InstantNGP parameterizes the density and albedo inside a bounding box around the origin with side length $0.75$ . It is a multi-resolution feature grid with 16 levels. With coarse-to-fine training, only the first 8 (lowest-resolution) levels are used during the first half of training, while the others are masked with zeros. Each feature grid has dimensionality $2$ . The features from these grids are stacked and fed to a 3-layer MLP with $64$ hidden units.

Rendering and diffusion prior.

We render at resolution $96$ px. Since Stable Diffusion is designed for images with resolution $512$ px, we upsample renders to $512$ px before passing them to the Stable Diffusion latent space encoder (i.e. the VAE). We add noise in latent space, sampling $t\sim\mathcal{U}(0.02,0.98)$ . We use classifier-free guidance strength $100$ . We found that results with classifier-free guidance strength above $30$ produced good results; below $30$ led to many more geometric deformities. Although we do not backpropagate through the Stable Diffusion UNet for $\mathcal{L}_{\text{SDS}}$ , we do backpropagate through the latent space encoder.

Optimization.

We optimize using the Adam optimizer with learning rate $1e-3$ for $5000$ iterations. The optimization process takes approximately $45$ minutes on a single V100 GPU.

Background model.

For our background model, we use a two-layer MLP which takes the viewing direction as input. This model is purposefully weak, such that the model cannot trivially optimize its objectives by using the background.

Additional regularizers.

We additionally employ two regularizers on our density field. The first is the orientation loss from Ref-NeRF , also used in DreamFusion , for which we use $\lambda_{\text{orient}}=0.01$ . The second is an entropy loss which encourages points to be either fully transparent or fully opaque: $\mathcal{L}_{\text{entropy}}=(w\cdot\log_{2}(w)-(1-w)\cdot\log_{2}(1-w)$ where $w$ is the cumulative sum of density weights computed as part of the NeRF rendering equation (Equation 1).

Single-image textual inversion.

Our single-image textual inversion step, which is a variant of textual inversion , entails optimizing a token e introduced into the diffusion model text encode to match an input image. The key to making this optimization successful given only a single image is the use of heavy image augmentations, shown in Fig. 12. We optimize using these augmentations for a total of 3000 steps using the Adam optimizer with image size $512$ px, batch size 16, learning rate $5\cdot 10^{-4}$ , and weight decay $1\cdot 10^{-2}$ .

The embedding e can be initialized either randomly, manually (by selecting a token from the vocabulary that matches the object), or using an automated method.

One automated method that we found to be successful was to use CLIP (which is also the text encoder of the Stable Diffusion model) to infer a starting token to initialize the inversion procedure. For this automated procedure, we begin by considering the set of all tokens in the CLIP text tokenizer which are nouns, according to the WordNet database. We use only nouns because we aim to reconstruct objects, not reproduce styles or visual properties. We then compute text embeddings for captions of the form “An image of a $\langle\texttt{token}\rangle$ ” using each of these tokens. Separately, we compute the image embedding for the input image. Finally, we take the token whose caption is most similar to the image embedding as initialization for our textual inversion procedure.

We use the manual initialization method for the examples in the main paper and we use the automated initialization method for the examples in the supplemental material (i.e. those included below).

Appendix B Method diagram

We provide a diagram illustrating our method in Fig. 2.

Appendix C Additional Qualitative Examples

In Fig. 13, we show additional examples of reconstructions from our model. We see that our method is often able to reconstruct plausible geometries and object backsides.

Appendix D Additional Comparisons

We provide additional comparisons to recent single-view reconstruction methods on the lego scene from the synthetic NeRF dataset. We compare on the special test set created by SinNeRF , which consists of 60 views very close to the reference view. We emphasize that our method is not tailored to this setting, whereas the other methods are designed specifically for it. For example, some other methods work by warping the input image, which only performs well for novel views close to the reference view.

Appendix E Text-to-Image-to-3D

In this section, we explore the idea of reconstructing a 3D object from a text prompt alone by first using the text prompt to generate an image, and then reconstructing this image using RealFusion.

We show examples of text-to-image-to-3D generation in Fig. 14.

Compared to the one-step procedure of (i.e. text-to-3D), this two-step procedure (i.e. text-to-image-to-3D) has the advantage that it may be easier for users to control. Under our setup, users can first sample a large number of images from a 2D diffusion model such as Stable Diffusion, select their desired image, and then lift it to 3D using RealFusion. It is possible that this setup could help help address the issue of diversity of generation discussed in . Additionally, tn this setting, we find that it is usually not necessary to use single-image textual inversion, since the images sampled in the first stage are already extremely well-aligned with their respective prompts.

Appendix F Analysis of Failure Cases

In Fig. 15, we show additional examples of failure cases from our model. Below, we analyzed what we find to be our three most common failure cases. The techniques we apply in RealFusion (single-image textual inversion, normals smoothing, and coarse-to-fine training) make these failure cases less frequent and less severe, but they still occur on various images.

One failure case of our method consists of the generation of a semi-transparent neural field which does not have a well-defined geometry. These fields tend to look like the input image when seen from the reference viewpoint, but do not resemble plausible objects when seen from other viewpoints. We note that this behavior is extremely common when using CLIP as a prior model, but it occurs occasionally even when using Stable Diffusion and $\mathcal{L}_{\text{SDS}}$ .

Floaters.

Another failure case involves “floaters,” or disconnected parts of the scene which appear close to the camera. These floaters sometimes appear in front of the reference view as to make the corresponding render look like the input image. Without image-specific prompts, these floaters are a very big issue, appearing in the majority of reconstructions. When using image-specific prompts, the issue of floaters is greatly (but not entirely) alleviated.

The Janus Problem.

Named after the two-faced Roman god Janus, the “Janus problem” refers to reconstructions which have two or more faces. This problem arises because the loss function tries to make the render of every view look like the input image, at least to a certain extent.

Our use of view-specific prompting partially alleviates this issue. For example, when we render an image of a panda from the back, we optimize using the text prompt “An image of a $\langle$ object $\rangle$ , back view”, where “ $\langle$ object $\rangle$ ” is our image-specific token corresponding to the image of a panda. However, even with view-specific prompting, this problem still occurs. This problem is visible with the panda in Fig. 14 (row 2). We note that this problem is not unique to our method; it can also be seen with (see Figure 9, last row).

Appendix G Unsuccessful Experiments and Regularization Losses

In the process of developing our method, we experimented with numerous ideas, losses, and regularization terms which were not included in our final method because they either did not improve reconstruction quality or did not improve it enough to justify their complexity. Here, we describe some of these ideas for the benefit of future researchers working on this problem.

One idea we tried involved using the diffusion model within our reconstruction objective as well as our prior objective. This involved a modified version of $\mathcal{L}_{\text{SDS}}$ in which we compared the noise predicted by the diffusion model for our noisy rendered image to the noise predicted by the diffusion model for a noisy version of our input image. We found that with this loss we were able to reconstruct the input image to a certain degree, but that we did not match the exact input image colors or textures.

Normals smoothing in 3D.

Our normals smoothing term operates in 2D, using normals rendered via the NeRF equation. We also tried different ways of smoothing normals in 3D. However, possibly due to our grid-based radiance field and/or our finite difference-based normals computation, we found that these regularization terms were all very noisy and harmful to reconstruction quality.

Using monocular depth.

We tried incorporating monocular depth predictions into the pipeline, using pre-trained monocular depth networks such as MiDaS . Specifically, we enforced that the depth rendered from the reference view matched the depth predicted by MiDaS for the input image. We found that this additional depth loss in most instances did not noticeably improve reconstruction quality and in some cases was harmful. Nonetheless, these results are not conclusive and future work could pursue other ways of integrating these components.

Using LPIPS and SSIM reconstruction losses.

We tried using LPIPS and SSIM losses in place of our L2 reconstruction loss. We found that LPIPS performed similarly to L2, but incurred additional computation and memory usage. We found that SSIM without either L2 and LPIPS resulted in worse reconstruction quality, but that it yielded fine results when combined with them. We did not include it in our final objective for the sake of simplicity.

Rendering at higher resolutions.

Since Stable Diffusion operates on images of resolution $512$ px, it is conceivable that rendering at higher resolution would be benefitial with regard to the prior loss. However, we found no noticeable difference in quality when rendering at higher resolutions than $96$ px or $128$ px. For computational purposes, we used resolution $96$ px for all experiments in the main paper.

Using DINO-based prior losses.

Similarly to the CLIP prior loss, one could imagine using other networks to encourage renders from novel views to be semantically similar to the input image. Due to the widespread success of the DINO models in unsupervised learning, we tried using DINO feature losses in addition to the Stable Diffusion prior loss. Specifically, for each image rendered from a novel view, we computed a DINO image embedding and maximized its cosine similarity with the DINO image embedding of the reference image. We found that this did not noticeably improve or degrade performance. For purposes of simplicity, we did not include it.

Appendix H Links to Images for Qualitative Results

For our qualitative results, we primarily use images from datasets such as Co3D. We also use a small number of images sourced directly from the web to show that our method works on uncurated web data. We provide links to all of these images on our project website.