Magic3D: High-Resolution Text-to-3D Content Creation
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, Tsung-Yi Lin
Introduction
3D digital content has been in high demand for a variety of applications, including gaming, entertainment, architecture, and robotics simulation. It is slowly finding its way into virtually every possible domain: retail, online conferencing, virtual social presence, education, etc. However, creating professional 3D content is not for anyone — it requires immense artistic and aesthetic training with 3D modeling expertise. Developing these skill sets takes a significant amount of time and effort. Augmenting 3D content creation with natural language could considerably help democratize 3D content creation for novices and turbocharge expert artists.
Image content creation from text prompts has seen significant progress with the advances of diffusion models for generative modeling of images. The key enablers are large-scale datasets comprising billions of samples (images with text) scrapped from the Internet and massive amounts of compute. In contrast, 3D content generation has progressed at a much slower pace. Existing 3D object generation models are mostly categorical. A trained model can only be used to synthesize objects for a single class, with early signs of scaling to multiple classes shown recently by Zeng et al. . Therefore, what a user can do with these models is extremely limited and not yet ready for artistic creation. This limitation is largely due to the lack of diverse large-scale 3D datasets — compared to image and video content, 3D content is much less accessible on the Internet. This naturally raises the question of whether 3D generation capability can be achieved by leveraging powerful text-to-image generative models.
Recently, DreamFusion demonstrated its remarkable ability for text-conditioned 3D content generation by utilizing a pre-trained text-to-image diffusion model that generates images as a strong image prior. The diffusion model acts as a critic to optimize the underlying 3D representation. The optimization process ensures that rendered images from a 3D model, represented by Neural Radiance Fields (NeRF) , match the distribution of photorealistic images across different viewpoints, given the input text prompt. Since the supervision signal in DreamFusion operates on very low-resolution images (), DreamFusion cannot synthesize high-frequency 3D geometric and texture details. Due to the use of inefficient MLP architectures for the NeRF representation, practical high-resolution synthesis may not even be possible as the required memory footprint and the computation budget grows quickly with the resolution. Even at a resolution of , optimization times are in hours (1.5 hours per prompt on average using TPUv4).
In this paper, we present a method that can synthesize highly detailed 3D models from text prompts within a reduced computation time. Specifically, we propose a coarse-to-fine optimization approach that uses multiple diffusion priors at different resolutions to optimize the 3D representation, enabling the generation of both view-consistent geometry as well as high-resolution details. In the first stage, we optimize a coarse neural field representation akin to DreamFusion, but with a memory- and compute-efficient scene representation based on a hash grid . In the second stage, we switch to optimizing mesh representations, a critical step that allows us to utilize diffusion priors at resolutions as high as . As 3D meshes are amenable to fast graphics renderers that can render high-resolution images in real-time, we leverage an efficient differentiable rasterizer and make use of camera close-ups to recover high-frequency details in geometry and texture. As a result, our approach produces high-fidelity 3D content (see Fig. 1) that can conveniently be imported and visualized in standard graphics software and does so at 2 the speed of DreamFusion. Furthermore, we showcase various creative controls over the 3D synthesis process by leveraging the advancements developed for text-to-image editing applications . Our approach, dubbed Magic3D, endows users with unprecedented control in crafting their desired 3D objects with text prompts and reference images, bringing this technology one step closer to democratizing 3D content creation.
In summary, our work makes the following contributions:
We propose Magic3D, a framework for high-quality 3D content synthesis using text prompts by improving several major design choices made in DreamFusion. It consists of a coarse-to-fine strategy that leverages both low- and high-resolution diffusion priors for learning the 3D representation of the target content. Magic3D, which synthesizes 3D content with an 8 higher resolution supervision, is also 2 faster than DreamFusion. 3D content synthesized by our approach is significantly preferable by users (61.7%).
We extend various image editing techniques developed for text-to-image models to 3D object editing and show their applications in the proposed framework.
Related Work
Text-to-image generation. We have witnessed significant progress in text-to-image generation with diffusion models in recent years. With improvements in modeling and data curation, diffusion models can compose complex semantic concepts from text descriptions (nouns, adjectives, artistic styles, etc.) to generate high-quality images of objects and scenes . Sampling images from diffusion models is time consuming. To generate high-resolution images, these models either utilize a cascade of super-resolution models or sample from a lower-resolution latent space and decode latents into high-resolution images . Despite the advances in high-resolution image generation, using language to describe and control 3D properties (e.g. camera viewpoints) while maintaining coherency in 3D remains an open, challenging problem.
3D generative models. There is a large body of work on 3D generative modeling, exploring different types of 3D representations such as 3D voxel grids , point-clouds , meshes , implicit , or octree representations. Most of these approaches rely on training data in the form of 3D assets, which are hard to acquire at scale. Inspired by the success of neural volume rendering , recent works started investing in 3D-aware image synthesis , which has the advantage of learning 3D generative models directly from images — a more widely accessible resource. However, volume rendering networks are typically slow to query, leading to a trade-off between long training time and lack of multi-view consistency . EG3D partially mitigates this problem by utilizing a dual discriminator. While obtaining promising results, these works remain limited to modeling objects within a single object category, such as cars, chairs, or human faces, thus lacking scalability and the creative control desired for 3D content creation. In our paper, we focus on text-to-3D synthesis, aiming to generate a 3D renderable representation of a scene based on a text prompt.
Text-to-3D generation. With the recent success in text-to-image generative modeling in recent years, text-to-3D generation has also gained a surge of interest from the learning community. Earlier works such as CLIP-forge synthesizes objects by learning a normalizing flow model to sample shape embeddings from textual input. However, it requires 3D assets in voxel representations during training, making it challenging to scale with data. DreamField and CLIP-mesh mitigate the training data issue by relying on a pre-trained image-text model to optimize the underlying 3D representations (NeRFs and meshes), such that all 2D renderings reach high text-image alignment scores. While these approaches avoid the requirement of expensive 3D training data and mostly rely on pre-trained large-scale image-text models, they tend to produce less realistic 2D renderings.
Recently, DreamFusion showcased impressive capability in text-to-3D synthesis by utilizing a powerful pre-trained text-to-image diffusion model as a strong image prior. We build upon this work and improve over several design choices to bring significantly higher-fidelity 3D models into hands of users with a much reduced generation time.
Background: DreamFusion
DreamFusion achieves text-to-3D generation with two key components: a neural scene representation which we refer to as the scene model, and a pre-trained text-to-image diffusion-based generative model. The scene model is a parametric function , which can produce an image at the desired camera pose. Here, is a volumetric renderer of choice, and is a coordinate-based MLP representing a 3D volume. The diffusion model comes with a learned denoising function that predicts the sampled noise given the noisy image , noise level , and text embedding . It provides the gradient direction to update such that all rendered images are pushed to the high probability density regions conditioned on the text embedding under the diffusion prior. Specifically, DreamFusion introduces Score Distillation Sampling (SDS), which computes the gradient:
DreamFusion adopts a variant of Mip-NeRF 360 with an explicit shading model for the scene model and Imagen as the diffusion model. These choices result in two key limitations. First, high-resolution geometry or textures cannot be obtained since the diffusion model only operates on images. Second, the utility of a large global MLP for volume rendering is both computationally expensive as well as memory intensive, making this approach scale poorly with the increasing resolution of images.
High-Resolution 3D Generation
Magic3D is a two-stage coarse-to-fine framework that uses efficient scene models that enable high-resolution text-to-3D synthesis (Fig. 2). We describe our method and key differences from DreamFusion in this section.
Magic3D uses two different diffusion priors in a coarse-to-fine fashion to generate high-resolution geometry and textures. In the first stage, we use the base diffusion model described in eDiff-I , which is similar to the base diffusion model of Imagen used in DreamFusion. This diffusion prior is used to compute gradients of the scene model via a loss defined on rendered images at a low resolution . In the second stage, we use the latent diffusion model (LDM) that allows backpropagating gradients into rendered images at a high resolution ; in practice, we choose to use the publicly available Stable Diffusion model . Despite generating high-resolution images, the computation of LDM is manageable because the diffusion prior acts on the latent with resolution :
The increase in computation time mainly comes from computing (the gradient of the high-resolution rendered image) and (the gradient of the encoder in LDM).
2 Scene Models
We cater two different 3D scene representations to the two different diffusion priors at coarse and fine resolutions to accommodate the increased resolution of rendered images for the input of high-resolution priors, discussed as follows.
Neural fields as coarse scene models. The initial coarse stage of the optimization requires finding the geometry and textures from scratch. This can be challenging as we need to accommodate complex topological changes in the 3D geometry and depth ambiguities from the 2D supervision signals. In DreamFusion , the scene model is a neural field (a coordinate-based MLP) based on Mip-NeRF 360 that predicts albedo and density. This is a suitable choice as neural fields can handle topological changes in a smooth, continuous fashion. However, Mip-NeRF 360 is computationally expensive as it is based on a large global coordinate-based MLP. As volume rendering requires dense samples along a ray to accurately render high-frequency geometry and shading, the cost of having to evaluate a large neural network at every sample point quickly stacks up.
For this reason, we opt to use the hash grid encoding from Instant NGP , which allows us to represent high-frequency details at a much lower computational cost. We use the hash grid with two single-layer neural networks, one predicting albedo and density and the other predicting normals. We additionally maintain a spatial data structure that encodes scene occupancy and utilizes empty space skipping . Specifically, we use the density-based voxel pruning approach from Instant NGP with an octree-based ray sampling and rendering algorithm . With these design choices, we drastically accelerate the optimization of coarse scene models while maintaining quality.
Textured meshes as fine scene models. In our fine stage of optimization, we need to be able to accommodate very high-resolution rendered images to fine-tune our scene model with high-resolution diffusion priors. Using the same scene representation (the neural field) from the initial coarse stage of optimization could be a natural choice since the weights of the model can directly carry over. Although this strategy can work to some extent (Figs. 4 and 5), it struggles to render very high-resolution (e.g., ) images within reasonable memory constraints and computation budgets.
To resolve this issue, we use textured 3D meshes as the scene representation for the fine stage of optimization. In contrast to volume rendering for neural fields, rendering textured meshes with differentiable rasterization can be performed efficiently at very high resolutions, making meshes a suitable choice for our high-resolution optimization stage. Using the neural field from the coarse stage as the initialization for the mesh geometry, we can also sidestep the difficulty of learning large topological changes in meshes.
3 Coarse-to-fine Optimization
We describe our coarse-to-fine optimization procedure, which first operates on a coarse neural field representation and subsequently a high-resolution textured mesh.
Neural field optimization. Similarly to Instant NGP , we initialize an occupancy grid of resolution with values to 20 to encourage shapes to grow in the early stages of optimization. We update the grid every 10 iterations and generate an octree for empty space skipping. We decay the occupancy grid by 0.6 in every update and follow Instant NGP with the same update and thresholding parameters.
Instead of estimating normals from density differences, we use an MLP to predict the normals. Note that this does not violate geometric properties since volume rendering is used instead of surface rendering; as such, the orientation of particles at continuous positions need not be oriented to the level set surface. This helps us significantly reduce the computational cost of optimizing the coarse model by avoiding the use of finite differencing. Accurate normals can be obtained in the fine stage of optimization when we use a true surface rendering model.
Similar to DreamFusion, we also model the background using an environment map MLP, which predicts RGB colors as a function of ray directions. Since our sparse representation model does not support scene reparametrization as in Mip-NeRF 360 , the optimization has a tendency to “cheat” by learning the essence of the object using the background environment map. As such, we use a tiny MLP for the environment map (hidden dimension size of 16) and weigh down the learning rate by to allow the model to focus more on the neural field geometry.
Mesh optimization. To optimize a mesh from the neural field initialization, we convert the (coarse) density field to an SDF by subtracting it with a non-zero constant, yielding the initial . We additionally initialize the volume texture field directly with the color field optimized from the coarse stage.
During optimization, we render the extracted surface mesh into high-resolution images using a differentiable rasterizer . We optimize both and for each vertex via backpropagation using the high-resolution SDS gradient (Eq. 2). When rendering the mesh to an image, we also track the 3D coordinates of each corresponding pixel projection, which would be used to query colors in the corresponding texture field for joint optimization.
When rendering the mesh, we increase the focal length to zoom in on object details, which is a critical step towards recovering high-frequency details. We keep the same pre-trained environment map from the coarse stage of optimization and composite the rendered background with the rendered foreground object using differentiable antialiasing . To encourage the smoothness of the surface, we further regularize the angular differences between adjacent faces on the mesh. This allows us to obtain well-behaved geometry even under supervision signals with high variance, such as the SDS gradient .
Experiments
We focus on comparing our method with DreamFusion on 397 text prompts taken from the website of DreamFusionhttps://dreamfusion3d.github.io/gallery.html. We train Magic3D on all of the text prompts and compare them with the results provided on the website.
Speed evaluation. Unless otherwise noted, the coarse stage is trained for 5000 iterations with 1024 samples along the ray (subsequently filtered by the sparse octree) with a batch size of 32, with a total runtime of around minutes (upwards of iterations / second, variable due to differences in sparsity). The fine stage is trained for 3000 iterations with a batch size of 32 with a total runtime of minutes ( iterations / second). Both runtimes combined are minutes. All runtimes were measured on 8 NVIDIA A100 GPUs.
Qualitative comparisons. We provide qualitative examples in Fig. 3. Qualitatively, our models achieve much higher 3D quality in terms of both geometry and texture. Notice that our model can generate candies on ice cream cones, highly detailed sushi-like cars, vivid strawberries, and birds. We also note that our resulting 3D models can be directly imported and visualized in standard graphics software.
User studies. We conduct user studies to evaluate different methods based on user preferences on Amazon MTurk. We show users two videos side by side rendered from a canonical view by two different algorithms using the same text prompt. We ask the users to select the one that is more realistic and detailed. Each prompt is evaluated by different users, resulting in pairwise comparisons. As shown in Table 1, users favor 3D models generated by Magic3D, with 61.7% of the users considering our results with higher quality.
Can single-stage optimization work with LDM prior? We ablate scene models optimized with high-resolution LDM prior in a single-stage optimization setup. We find that 3D meshes as the scene model fail to generate high-quality results if optimized from scratch. This leaves our our memory-efficient sparse 3D representation as the ideal candidate for the scene model. However, rendering images is still too memory intensive to fit into modern GPUs. Therefore, we render lower-resolution images from the scene model and upsample them to as input to the LDM. We find it generates objects with worse shapes. Fig. 4 shows two examples with scene rendering resolution and respectively (top row). While it generates furry details, the shape is worse than the coarse model.
Can we use NeRF for the fine model? Yes. While optimizing a NeRF from scratch does not work well, we can follow the coarse-to-fine framework but replace the second-stage scene model with a NeRF. In the bottom right of Fig. 4, we show the result of a fine NeRF model initialized with the coarse model on its left and fine-tuned with rendered images. The two-stage approach retains good geometry in the initial model and adds more details, showing superior quality to its one-stage counterpart.
Coarse models vs. fine models. Fig. 5 provides more visual results contrasting coarse and fine models. We try both NeRF and mesh for scene models and fine-tune them from the same coarse model above. We see significant quality improvements on both NeRF and mesh models, suggesting our coarse-to-fine approach works for general scene models.
Controllable 3D Generation
As certain styles and concepts are difficult to express in words but easy with images, it is desirable to have a mechanism to influence the text-to-3D model generation with images. We explore different image conditioning techniques as well as a prompt-based editing approach to provide users more control over the 3D generation outputs.
Personalized text-to-3D. DreamBooth described a method to personalize text-to-image diffusion models by fine-tuning a pre-trained model on several images of a subject. The fine-tuned model can learn to tie the subject to a unique identifier string, denoted as [V], and generate images of the subject when [V] is included in the text prompt. In the context of text-to-3D generation, we would like to generate a 3D model of a subject. This can be achieved by first fine-tuning our diffusion prior models with the DreamBooth approach, and then using the fine-tuned diffusion priors with the [V] identifier as part of the conditioning text prompt to provide the learning signal when optimizing the 3D model.
To demonstrate the applicability of DreamBooth in our framework, we collect 11 images of one cat and 4 images of one dog. We fine-tune eDiff-I and LDM , binding the text identifier [V] to the given subject. Then, we optimize the 3D model with [V] in the text prompts. We use a batch size of 1 for all fine-tuning. For eDiff-I, we use the Adam optimizer with learning rate for 1,500 iterations; for LDM, we fine-tune with learning rate for 800 iterations. Fig. 6 shows our personalized text-to-3D results: we are able to successfully modify the 3D models preserving the subjects in the given input images.
Prompt-based editing through fine-tuning. Another way to control the generated 3D content is by fine-tuning a learned coarse model with a new prompt. Our prompt-based editing includes three stages. (a) We train a coarse model with a base prompt. (b) We modify the base prompt and fine-tune the coarse model with the LDM. This stage provides a well initialized NeRF model for the next step. Directly applying mesh optimization on a new prompt would generate highly detailed textures but could deform geometry only slightly. (c) We optimize the mesh with the modified text prompt. Our prompt-based editing can modify the texture of the shape or transform the geometry and texture according to the text. The resulting scene models preserve the layer-out and overall structure. Such an editing capability makes the 3D content creation with Magic3D more controllable. In Fig. 7, we show two coarse NeRF models trained with the base prompt for the “bunny” and “squirrel”. We modify the base prompt, fine-tune the NeRF model in high resolution and optimize the mesh. Results show that we can tune the scene model according to the prompt, e.g. changing the “baby bunny” to “stained glass bunny” or “metal bunny” results in similar geometry but with a different texture.
Conclusion
We propose Magic3D, a fast and high-quality text-to-3D generation framework. We benefit from both efficient scene models and high-resolution diffusion priors in a coarse-to-fine approach. In particular, the 3D mesh models scale nicely with image resolution and enjoy the benefits of higher resolution supervision brought by the latent diffusion model without sacrificing its speed. It takes 40 minutes from a text prompt to a high-quality 3D mesh model ready to be used in graphic engines. With extensive user studies and qualitative comparisons, we show that Magic3D is more preferable (61.7%) by the raters compared to DreamFusion, while enjoying a speed-up. Lastly, we propose a set of tools for better controlling style and content in 3D generation. We hope with Magic3D, we can democratize 3D synthesis and open up everyone’s creativity in 3D content creation.
Acknowledgements. We would like to thank Frank Shen, Yogesh Balaji, Seungjun Nah, James Lucas, David Luebke, Clement Fuji-Tsang, Charles Loop, Qinsheng Zhang, Zan Gojcic, and Jonathan Tremblay for helpful discussions and paper proofreading. We would also like to thank Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall for providing additional implementation details in DreamFusion.
Appendix
Appendix A Author Contributions
All authors have significant contributions on ideas, explorations, and paper writing. Specifically, CHL and TYL led the research, developed fundamental code for experiments and organized team efforts. JG led the experiments on generating high-resolution mesh models. LT led the experiments on using high-resolution diffusion prior. TT led the experiments on sparse scene representations. XZ and KK led the experiments in controllable generation. XH conducted the user study. SF and MYL advised the research direction and designed the scope of the project.
Appendix B Implementation Details
We follow the implementation details described by Poole et al. as closely as possible. We refer readers to the Dreamfusion paper for context and list the major differences below.
Architectural details. As aforementioned in the main paper, we adopt a multi-resolution hash grid encoding architecture from Instant NGP instead of using a large global coordinate-based MLP architecture. We use 16 levels of hash dictionaries of size and dimension , spanning 3D gird resolutions from to with an exponential growth rate. We use single-layer MLPs with hidden units to predict all of RGB color, volume density, and normal, where the inputs to the MLPs are the concatenated feature vectors from the multi-resolution hash encoding sampled with trilinear interpolation (we refer readers to the Instant NGP paper for more details in this representation). We perform density-based pruning to sparsify the Instant NGP representation with an octree structure every 10 iterations. This allows us to more efficiently render pixels using empty space skipping, even with 3D points as dense as samples per ray. We do not use the contracting reparametrization of unbounded scenes from Mip-NeRF 360 as it is not supported by our sparse representation.
Camera and light augmentations. We follow Poole et al. to add random augmentations to the camera and light sampling for rendering the shaded images. Differently, (a) we sample the point light location such that the angular distance from the random camera center location (w.r.t. the origin) is sampled from with a random point light distance , and (b) we use a “soft” version of the textureless and albedo-only augmentation such that various strengths of shading in the rendered images are seen during optimization. (c) we sample the camera distance from , and the focal length . When training with high resolution diffusion prior, we increase the focal length and sample from .
Optimization. Unless otherwise specified, we optimize the coarse model with batch size using the Adam optimizer with a learning rate of without warmup and decay. Note that the large global coordinate-based MLP architecture in DreamFusion limits its optimization to only an effective batch size of . For the coarse model, we add the opacity regularization as suggested by Poole et al. to encourage sparsity in the volume density field, but we do not add the orientation regularization as we empirically found it to hurt optimization.
Score Distillation Sampling. In the first stage, we sample the timestep and set . In the second stage, we find the range of timestep in SDS affects the quality. We sample in our experiments. In general, setting in the range of works well. We set in this stage.
Appendix C Alternative High-Resolution Prior
In addition to LDM, we also consider using Super Resolution (SR) diffusion prior for increasing the resolution of a coarse model. This diffusion model is trained to generate a high-resolution image conditioning on a low-resolution input image. In SDS, this model predicts noises added in high resolution, i.e., , where denotes a low-resolution image. We render with a frozen coarse model to optimize the second-stage fine model. Fig. 12 shows this approach fails to add high-quality details to the input coarse model.
Appendix D Style-Guided Text-to-3D Synthesis
where and are text and image conditioning respectively, and and are the guidance weights for text and joint text-and-image conditioning respectively. Note that for , the scheme is equivalent to standard classifier-free guidance with respect to text conditioning only.
Fig. 8 shows our style-guided text-to-3D generation results. When optimizing the 3D model, we feed the reference image to the eDiff-I model. We set or and apply the image guidance when only. We do not provide high-resolution results for this experiment because LDM does not support reference image conditioning.
Guidance weight and noise level threshold. We ablate different combinations of guidance weights and noise level thresholds in Figs. 9 and 10, respectively. The guidance weights and balance the guidance strength during optimization (see Eq. D). A similar guidance formulation has also been used by Liu et al. for compositional text-to-image generation . We also find that applying the image conditioning only below a certain noise level threshold can help control style transfer. The intuition is that image-based style guidance is most relevant for optimizing the generated 3D object’s details, which are modeled at lower noise levels. Notice that we do not provide high-resolution results for this experiment because LDM does not support image conditioning inputs.
Content image as reference. We also explore using multiple images as inputs during 3D synthesis to transfer the content in the images to the 3D model, as shown in Fig. 11: Given a text prompt, we first ask the eDiff-I model to generate the front view, side view and back view images. When optimizing the 3D model for the same text prompt from different views, we then feed the corresponding generated view image as input to guide the 3D synthesis. This approach requires some degree of consistency with respect to subject identity across the different view images, which can be achieved by generating a set of different view images first and choosing accordingly. Overall, the experiment shows that we can apply the text-to-image diffusion model to generate images that can be used for guidance during 3D model optimization. As we see, this does not only provide enhanced control by preserving the identity of the subjects in the images, but also improves output quality and 3D consistency. Generally, depending on image type, image conditioning can be used either for object-centric content transfer to 3D (Fig. 11) or for abstract 3D stylization (Figs. 8, 9, and 10).
Appendix E Additional Results
We provide more qualitative comparisons with Dreamfusion in Figs. 14, 15, 16, 17, 18. Our Magic3D achieved much higher quality in terms 3D geometry and texture.
We also show more results on prompt-based editing in Fig. 13. Our Magic3D enable high-quality editing of the 3D content through simple text prompt modification.