NeRF-Art: Text-Driven Neural Radiance Fields Stylization

Can Wang, Ruixiang Jiang, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao

Introduction

Artistic works depict the world in various creative and imaginative styles, evolving along with human progress. While primarily driven by professionals, the generation of artistic content is now more accessible to average users than ever before, empowered by the recent research on visual artistic stylization. In the era of deep learning, technical advances are gradually reshaping how people create, consume, and share art, from real-time entertainment to concept design. Since neural style transfer (Gatys et al., 2016; Chen et al., 2017b; Shu et al., 2021; Zhao et al., 2014; Sheng et al., 2018) shows the potential of encoding and changing visual styles via deep neural networks, a significant amount of effort has been devoted to effectively and efficiently migrating the style of an arbitrary image (Gatys et al., 2016; Huang and Belongie, 2017; Li et al., 2017; Liao et al., 2017) or a specific domain (Zhu et al., 2017; Lee et al., 2020) to the content image. Despite impressive results, these methods are limited to stylizing a single view captured by the content image.

Motivated by the increasing demand for 3D asset creation, our goal is to stylize 3D content from multi-view input, in contrast to single-image stylization. In the domain of 3D representation, previous methods typically take explicit models (e.g., meshes (Kato et al., 2018; Höllein et al., 2021; Han et al., 2021a; Ye et al., 2021; Zhang et al., 2020a), voxels (Guo et al., 2021; Klehm et al., 2014), and point clouds (Cao et al., 2020; Lin et al., 2018)) followed by differentiable rendering for multi-view stylization. These methods enable intuitive control over the geometry but suffer from the limited capacity for modeling and rendering complex scenes. The recent implicit representation of neural radiance field (NeRF) (Mildenhall et al., 2020; Deng et al., 2022; Yang et al., 2022; Zhang et al., 2022b; Wang et al., 2022) significantly improves the quality of novel view synthesis, satisfying our needs for a general representation of various scenes and objects. However, while enjoying the superior scene reconstruction quality of NeRF, the curse of its highly implicit volumetric representation of appearance and geometry, parameterized and entangled by dense MLP networks, makes NeRF more challenging to stylize through jointly transforming the encoded color and shape.

Very recently, pioneering NeRF stylization works (Chiang et al., 2022; Fan et al., 2022) have made exhilarating progress on appearance style transfer of 3D scenes. However, their style guidance is limited to image reference, which, although being adopted as one common way to specify the target style, is not always a perfect solution for every scenario—obtaining appropriate style images that both reflect the target style and match the source content might not be easy or even possible in many cases. Therefore, finding another simple, natural, and expressive form of guidance becomes an attractive idea. Thanks to the parallel advances in language-vision models, stylization with natural language is no longer a fantasy. As demonstrated by recent text-guided stylization works (Gal et al., 2021; Wei et al., 2021; Michel et al., 2021; Hong et al., 2022), compared to image-guided approaches, short text prompts provide 1) an extremely intuitive and user-friendly way to specify styles, 2) a flexible control over various styles from abstract ones like a certain concept to very concrete ones like a famous painting or character, and 3) a view-independent representation that is free from content alignment and naturally benefits cross-view consistency.

Yet, with the existing approaches, it is still challenging to stylize the implicit representation of NeRF via a simple text prompt. Learning a latent space helps constrain the geometry and texture modulations (Wang et al., 2021a), but it is often data-dependent and laborious. Some efforts directly enforce style directions (Figure 3) between the rendered views of NeRF and the text in the CLIP (Radford et al., 2021) embedding space. In addition, background augmentation (Jain et al., 2022) and mesh guidance (Hong et al., 2022) have been proposed to improve the geometry and texture modulations. However, they still suffer from insufficient geometry deformations and texture details.

In this work, we propose NeRF-Art, a new text-driven NeRF stylization method. Given a pre-trained NeRF model and a single text prompt, our method enables consistent novel view synthesis with both appearance and geometry transformed, adhering to the specified style. This is achieved by combining the recent large-scale Language-Vision model (i.e., CLIP) with NeRF, which is non-trivial due to several challenges. Directly applying the supervision from CLIP to NeRF by constraining the similarity between the rendered views and the text in the embedding space as (Gal et al., 2021) is insufficient to ensure the desired style strength. To tackle this problem, we design a CLIP-based contrastive loss to properly strengthen the stylization, by bringing the results closer to the target style and farther away from other styles pre-defined as negative samples. To further ensure the uniformity of the style over the whole scene, we extend our contrastive constraint to a hybrid global-local framework to cover both global structures and local details. In addition, to support geometry stylization jointly with appearance, we relax the constraints on the density of the pre-trained NeRF and adopt a weight regularization to effectively reduce cloudy artifacts and geometry noises when altering the density field. In experiments, we first evaluate text description selection for stylization and then test our method on various styles and demonstrate text guidance’s effectiveness and flexibility for NeRF stylization. Furthermore, we conduct a user study to show that our method achieves the best visual-pleasing results compared to related methods. We also extract the mesh from the stylized NeRF to show the geometry modulation ability of our method and integrate with different baselines to demonstrate the generalization ability of our method to various NeRF-like models.

Related Work

Neural Style Transfer on Images and Videos. Artistic image stylization is a long-standing research area. Traditional methods use handcrafted features to simulate styles (Hertzmann, 1998; Hertzmann et al., 2001). With the fast development of deep learning, neural networks have been applied to style transfer from either an arbitrary image (Gatys et al., 2016; Johnson et al., 2016; Huang and Belongie, 2017; Liao et al., 2017; Li et al., 2017; Kolkin et al., 2019) or a specific domain (Zhu et al., 2017; Huang et al., 2018, 2021a; Lee et al., 2020), and achieved impressive results. By enforcing temporal smoothness constraints defined on optical flows, neural style transfer has been successfully extended to videos (Ruder et al., 2016; Chen et al., 2017a, 2020). However, both image and video stylization methods are restricted to the given views. Simply combining the neural style transfer and novel view synthesis methods without considering 3D geometry will lead to blurriness or view inconsistencies.

Neural Stylization on Explicit 3D Representations. With the increasing demand for 3D content, neural style transfer has been extended to explicit 3D representations. The work (Chen et al., 2018) first considers the cross-view disparity consistency and applies style transfer on stereoscopic images or videos. Later, considering the voxel is the most compatible representation for CNNs, SKPN (Guo et al., 2021) encodes volume using convolutional blocks and stylizes it by deep features extracted from a reference image. As for mesh stylization, differential rendering allows for backpropagating style transfer objectives from rendered images to 3D meshes. According to whether the geometry or texture are allowed to be optimized, existing mesh style transfer methods achieve three different effects: texture stylization (Mordvintsev et al., 2018; Höllein et al., 2021), geometric stylization (Liu et al., 2018), and joint stylization (Kato et al., 2018; Han et al., 2021b; Yin et al., 2021). Another line of work uses point clouds as the 3D proxy to guarantee 3D consistency in stylizing novel views from either a single image (Mu et al., 2021) or multiple frames (Huang et al., 2021b). In these works, point-wise features extracted from pre-trained PointNet (Qi et al., 2017) or GCN (Li et al., 2021a) are stylized by feature transform algorithms, e.g., adaptive normalization, and then rendered to novel views. Despite the successes, these 3D stylization methods are difficult to generalize to complicated objects or scenes with dedicated structures, limited by the expressiveness of explicit 3D representations.

Neural Stylization on NeRF. To address the inherent limitations of explicit representations, implicit methods have recently received much attention. NeRF is a seminal one that is able to represent complex scenes by parameterizing the implicit function as MLP networks. A large number of follow-up works are presented to improve its efficiency (Deng et al., 2021; Lindell et al., 2021; Garbin et al., 2021; Reiser et al., 2021; Yu et al., 2021a; Müller et al., 2022), quality (Barron et al., 2021; Arandjelović and Zisserman, 2021; Ma et al., 2021; Zhang et al., 2020b), controllablity (Zhang et al., 2021; Srinivasan et al., 2021; Liu et al., 2021; Wang et al., 2021a), and generalization (Jain et al., 2021; Yu et al., 2021b; Niemeyer et al., 2021; Park et al., 2021a; Pumarola et al., 2021; Park et al., 2021b; Tretschk et al., 2021; Noguchi et al., 2021; Peng et al., 2021; Li et al., 2021b; Xian et al., 2021; Gao et al., 2021). Inspired by the power of NeRF, three very recent works (Chiang et al., 2022; Huang et al., 2022; Zhang et al., 2022a) adopt it for 3D stylization. They design the stylization network to predict color-related parameters in the NeRF model based on a reference style. And the stylization network is trained either by imposing the image style transfer losses (Gatys et al., 2016; Zhang et al., 2022a) on rendered views (Chiang et al., 2022) or being supervised by a mutually learnt image stylization network (Huang et al., 2022). These works have achieved consistent results in novel-view stylization. However, their stylization is still restricted to appearance only because they do not adjust density parameters in the NeRF model. In contrast, our method supports both appearance and geometric stylization to better mimic the reference style. Moreover, they rely on reference images for stylization, while we seek to stylize the scenes via simple text prompts.

Text-Driven Stylization. Compared to image references, a natural language prompt is a more intuitive and user-friendly way to specify the style. Therefore, a current line of works shifted away from image reference towards text guidance, with the help of the pre-trained CLIP (Radford et al., 2021), which bridges texts and images by jointly learning a shared latent space. The pioneering work StyleGAN-NADA (Gal et al., 2021) proposes a directional CLIP loss for transferring the pre-trained StyleGAN2 model (Karras et al., 2020) to the target domain with the desired style described by a textual prompt. However, it is an image-based method and will lead to inconsistencies when applied to stylizing multiple views. In the 3D world, Text2Mesh (Michel et al., 2021) uses CLIP to guide the stylization of a given 3D mesh by learning a displacement map for geometry deformation and vertex colors for texture stylization. The contemporary work AvatarCLIP (Hong et al., 2022) further supports driving a stylized human mesh using natural languages. Despite their success, these methods are limited to mesh input. In contrast, our method is able to stylize 3D scenes with better visual quality and view consistency without any mesh input.

Overview

As illustrated in Figure 2, our approach is simply decomposed into reconstruction and stylization stages. In what follows, after briefly reviewing our 3D photography representation with NeRF (§ 3.1), we focus on introducing our text-guided stylization method. Specifically, we first formulate the directional CLIP loss for stylization, which leverages the power of the pre-trained Language-Vision model (§ 4.1). Then, we introduce our global-local contrastive learning framework to cope with the stylization strength issue of the directional CLIP loss (§ 4.2). Next, we introduce a weight regularization term to alleviate the cloudy artifacts caused and geometry noises by the stylization process (§ 4.3). Finally, we conclude this section with the overall training strategy of the entire pipeline (§ 4.4).

We take NeRF as our 3D scene representation, which defines a continuous volumetric field as implicit functions, parameterized by MLP networks $\mathcal{F}$ . Given a single spatial coordinate ${\bm{x}}=(x,y,z)$ and its corresponding view direction $\mathbf{d}=(\phi,\theta)$ , the network predicts the density $\sigma$ and view-dependent radiance ${\bm{c}}=(r,g,b)$ , leading to the final color $C(\bm{r})$ of the camera ray $\bm{r}(t)={\bm{o}}+t\mathbf{d}$ by accumulating $K$ sample points along it, given the target view:

where $\omega_{k}=\exp(-\sigma_{k}(d_{k+1}-d_{k}))$ represents the transmittance of the ray segment $(k,k+1)$ and $\begin{matrix}T_{k}=\prod_{i}^{k-1}\omega_{i}\end{matrix}$ is the accumulated transmittance from the origin to the sample $k$ .

To train NeRF from a set of multi-view photos, a simple supervised reconstruction loss is adopted between the ground-truth pixel colors $\hat{C}(\bm{r})$ from the training view and the NeRF prediction $C(\bm{r})$ :

Text-Guided NeRF Stylization

After optimizing the reconstructed NeRF model $\mathcal{F}_{rec}$ from the multi-view input (§ 3.1), our goal is to train a stylized NeRF model $\mathcal{F}_{sty}$ , which satisfies the style control of the target text prompt ${\bm{t}}_{tgt}$ while preserving the content from $\mathcal{F}_{rec}$ (Figure 2).

The CLIP model aligns the semantics of image and text in a joint embedding space, by utilizing the image encoder $\hat{\mathcal{E}}_{i}(\cdot)$ and the text encoder $\hat{\mathcal{E}}_{t}(\cdot)$ . The semantic power of CLIP bridges the gap between natural language prompts and synthesized image pixels, making it possible to stylize NeRF scenes with text controls.

However, even with the powerful embedding space of CLIP, it remains challenging to achieve text-guided NeRF stylization that 1) preserves the original content from being washed away by the new style, 2) reaches the target style with proper strength that satisfies the semantics of the input text prompt, and 3) maintains cross-view consistency and avoids artifacts in the final NeRF model.

An intuitive strategy for text-guided NeRF stylization would be to enforce the trajectory of the stylization in the CLIP space with an absolute directional CLIP loss that measures the cosine similarity ( $\langle\cdot,\cdot\rangle$ ) between the stylized NeRF rendering $\bm{I}_{tgt}$ and the target text prompt ${\bm{t}}_{tgt}$ (Figure 3(a)):

which guides NeRF rendering with a global direction of the target text, not depending on any reference starting point. This loss is first designed in StyleCLIP (Patashnik et al., 2021) to guide face image editing and further extended to generative NeRF editing in CLIP-NeRF (Wang et al., 2021a).

However, as observed in StyleGAN-NADA (Gal et al., 2021), this global loss could easily mode-collapse the generator and hurt the generation diversity of stylization. Therefore, a relative directional loss is proposed, which transfers the source image $\bm{I}_{src}$ to the target domain guided by the CLIP-space trajectory embedded by a pair of text prompts $({\bm{t}}_{src},{\bm{t}}_{tgt})$ instead of a single one (Figure 3(b)). This relative directional CLIP loss for our NeRF stylization is defined as:

Different from the single-image setting of StyleGAN-NADA, here, the training target $\bm{I}_{tgt}$ stands for an arbitrarily sampled view rendered by the stylized NeRF of the same scene, and the source image $\bm{I}_{src}$ is produced by the pre-trained NeRF model and shares the identical view as $\bm{I}_{tgt}$ . We will follow this convention hereinafter.

2. Strength Control w/ Glocal Contrastive Learning

As the directional CLIP loss (Equation (4)) works by measuring the similarity between the normalized unit directions of the embedded vectors, it can enforce the relative stylization trajectory. However, it struggles with preserving enough stylization strength in altering the pre-trained NeRF model.

To address this issue, we propose a contrastive learning strategy to control the stylization strength (Figure 3(c)). Specifically, in the framework of contrastive learning, with the rendered view $\bm{I}_{tgt}$ as the query target, we set positive samples to the target text prompt ${\bm{t}}_{tgt}$ with the desired style and construct negative samples ${\bm{t}}_{neg}\in\mathcal{T}_{neg}$ by sampling a set of text prompts semantically irrelevant to $\bm{I}_{tgt}$ . In general, our contrastive loss in the CLIP space is defined as:

where $\{{\bm{v}},{\bm{v}}^{+},{\bm{v}}^{-}\}$ are query, positive sample, and negative sample, respectively, and temperature $\tau$ is set to $0.07$ in all our experiments. When defining the loss globally by treating the entire view $\bm{I}_{tgt}$ as the query anchor, we have the global contrastive loss $\mathcal{L}_{con}^{g}$ with $\{{\bm{v}}=\hat{\mathcal{E}}_{i}(\bm{I}_{tgt}),\,{\bm{v}}^{+}=\hat{\mathcal{E}}_{t}({\bm{t}}_{tgt}),\,{\bm{v}}^{-}=\hat{\mathcal{E}}_{t}({\bm{t}}_{neg})\}$ .

Ideally, this global contrastive loss cooperates with the directional CLIP loss, where the former defines the style trajectory that aligns with the target text, and the latter, at the same time, ensures the proper stylization magnitude by pushing along the style trajectory. However, the global contrastive loss still has trouble achieving sufficient and uniform stylization on the entire NeRF scene, leading to excessive stylization on certain parts and insufficient stylization in other regions. This may be attributed to the fact that CLIP focuses more attention on local regions with distinguishable features than the entire scene. Thus, this global contrastive loss can deliver a small value even when the overall stylization is insufficient or non-uniform. To achieve a more sufficient and balanced stylization, enforced by a more locally-attended contrastive learning approach, inspired by PatchNCE loss (Park et al., 2020), we propose a complementary local contrastive loss $\mathcal{L}_{con}^{l}$ which sets queries to random local patches $\bm{P}_{tgt}$ cropped from $\bm{I}_{tgt}$ : $\{{\bm{v}}=\hat{\mathcal{E}}_{i}(\bm{P}_{tgt}),\,{\bm{v}}^{+}=\hat{\mathcal{E}}_{t}({\bm{t}}_{tgt}),\,{\bm{v}}^{-}=\hat{\mathcal{E}}_{t}({\bm{t}}_{neg})\}$ .

Overall, we combine the global and local terms as our final global-local contrastive loss:

3. Artifact Suppression w/ Weight Regularization

Our pipeline aims to change not only the color but also the density of the pre-trained NeRF to achieve a joint stylization of appearance and geometry. However, allowing the training process to alter the density may lead to cloud-like semi-transparent artifacts near the camera and geometry noises, even if the pre-trained NeRF is perfectly clean. To alleviate that, we adopt a weight regularization loss to suppress geometric noises and encourage a more concentrated density distribution that better resembles real-world scenes.

Based on our NeRF notations (Equation (1)), weight of each ray sample is defined as the contribution to the final ray color: $w_{k}=T_{k}(1-\omega_{k})$ , where $\sum_{k}w_{k}\leq 1$ . Similar to the distortion loss in mip-NeRF 360 (Barron et al., 2022), the weight regularization loss is defined as:

where for each ray ${\bm{r}}$ of a randomly sampled view $\bm{I}_{tgt}$ , pairs of samples $(i,j)$ with distances $\|d_{i}-d_{j}\|$ are sampled. But different from mip-NeRF 360 (Barron et al., 2022) that optimizes the distances, we penalize those pairs with scattered large weights to suppress noise peeks and aggregate weights to the correct object surface.

4. Training

During training, we finetune the pre-trained NeRF model for stylization. The overall objective consists of three parts: text-guided stylization losses (including directional CLIP loss and global-local contrastive loss to control style trajectory and strength, respectively), content-preservation loss (we adopt VGG-based perceptual loss), and artifact suppression regularization loss:

Here we define the perceptual loss $\mathcal{L}_{per}$ between the original and stylized NeRF renderings on certain pre-defined VGG layers $\psi\in\Psi$ :

It’s practically infeasible to train stylization on all rays due to backward gradient propagation’s prohibitively huge memory consumption. To address this issue, previous works either sample sparse rays to obtain coarse images or patches (Schwarz et al., 2020; Chiang et al., 2022; Jain et al., 2021; Hong et al., 2022) or render all rays to low resolution and then upsample with CNN networks (Niemeyer and Geiger, 2021). However, coarse renderings or patches lose style details and semantic structures, while upsampling harms the cross-view consistency. Instead, we adopt a much easier solution, which first renders all rays to obtain the whole image of an arbitrary view, calculates the stylization loss gradients in the forward process, and then back-propagates the gradients through NeRF at the patch level. This significantly reduces memory consumption and allows rendering high-resolution images for better stylization training.

Experiments

We implement our framework using Pytorch. In the reconstruction training stage, we sample $192$ points for each ray and train our model for $6$ epochs. We set the learning rate as $0.0005$ and adopt the Adam optimizer. While in the stylization training stage, we train our model for $4$ epochs with the learning rate of $0.001$ and use the Adam optimizer. We set hyper-parameters $\lambda_{g}$ , $\lambda_{l}$ , $\lambda_{p}$ and $\lambda_{r}$ as $0.2$ , $0.1$ , $2.0$ , and $0.1$ , respectively. To construct the negative samples, we manually collect around 200 text descriptions from Pinterest website, describing various styles, like “Zombie”, “Tolkien elf”, and “Self-Portrait by Van Gogh”. We set the patch size as the $1/10$ of the original input in the local contrastive loss. Without loss of generality, we adopt VolSDF (Yariv et al., 2021) as the basic NeRF model for stylization.

2. Data Collection

Three self-portrait datasets are gathered under an in-the-wild condition by asking three users to capture selfies video for around 10 seconds with the front-facing camera. We finally received six video clips in around 10 seconds. After collecting these video clips under different views and expressions, we extract 100 frames for each video clip using FFmpeg with 15 fps. Then these frames are resized to 270 $\times$ 480. Then we estimate camera poses for these frames using COLMAP (Schonberger and Frahm, 2016) with rigid relative camera pose constraints. We suppose frames in a video share the same intrinsics. We also reconstruct a lady from the H3DS dataset (Ramon et al., 2021). We remove noise frames and obtain 31 sparse views. Moreover, we use the image size with 256 $\times$ 256 for stylization. We also adopt the Local Light Field Fusion (LLFF) dataset (Mildenhall et al., 2019) to stylize non-face scenes. LLFF dataset is composed of forward-facing scenes, with around 20 to 60 images.

3. Text Evaluation

As CLIP (Park et al., 2020) is sensitive to text prompts, we conduct a text description evaluation in Figure 4. When a text description refers to a style in general, not anyone in particular, the stylization can be insufficient. For example, “Fauvism” only induces stylization around the mouth as it describes general meaning, like artists “Henri Matisse” and “Kees van Dongen” or “Brutalist painting”. And the same observations when comparing “Chinese Painting” and “Chinese Ink Painting”. In contrast, when a text refers to a specific object or style, the language ambiguity will disappear. For example, “Lord Voldemort”, “Head of Lord Voldemort”, and “Head of Lord Voldemort in fantasy style” reveals similar stylization results. We also see the similar results concerning the Pixar style. In the interests of brevity, we use “Fauvism” to represent “painting, oil on canvas, Fauvism style” and “Vincent van Gogh” to represent “painting, oil on canvas, Vincent van Gogh self-portrait style” in other experiments. We also use the same prompt augmentation strategy for other painting styles, including “Edvard Munch” and “Fernando Botero”.

4. Comparisons

We compare with most related works following three categories: 1) Text-driven image stylization: StyleGAN-NADA (Gal et al., 2021); 2) Text-driven mesh-based stylization: Text2Mesh (Michel et al., 2021) and AvatarCLIP (Hong et al., 2022); and 3) Text-driven NeRF stylization: CLIP-NeRF (Wang et al., 2021a) and DreamField (Jain et al., 2022). To make fair comparisons with these methods, we adopt author-released codes and accommodate the input to each method as required. For StyleGAN-NADA, we follow its steps to first conduct a face alignment under the setting of FFHQ (Karras et al., 2019) and then invert these faces using e4e (Tov et al., 2021) into latent codes, before inputting them to StyleGAN-NADA. We have also tried pSp (Richardson et al., 2021) to invert latent codes but finally adopt e4e to obtain better stylization results. Per the authors’ advice, we trained 600 iterations and sampled faces present visual-pleasing stylized results. We place final stylized faces back on the input images by inversing the face alignment process. As for Text2Mesh, the input mesh of one example (‘Lady’) is provided by the H3DS (Ramon et al., 2021), while the input mesh of another example (‘Human’) is fetched from AvatarCLIP. Both meshes are normalized into -1 to 1, before inputting them to Text2Mesh. We follow the training setting of Text2Mesh in stylizing the person object to stylize ‘Lady’ and ‘Human’. We compare to DreamField and AvatarCLIP following the shape sculpting and texture generation process of AvatarCLIP. Similar to AvatarCLIP, we also adopt prompt augmentations when stylizing the ‘Human’. For example, we use text prompts including “Tolkien Elf”, “the back of Tolkien Elf”, and “the face of Tolkien Elf” for the detailed refinement.

The visual comparisons are demonstrated in Figure 5, Figure 6, and Figure 7. For video results, please see the supplementary material.

Comparisons to text-driven image stylization. Compared to StyleGAN-NADA, our method can better ensure the desired style strength in all examples by introducing global-local contrastive learning. StyleGAN-NADA achieves visual-pleasing results on sampled faces but reflects a degradation for in-the-wild faces partly due to the latent code inversion. Moreover, as a 3D stylization method, ours can preserve view consistencies in the stylized results. In contrast, StyleGAN-NADA stylizes each view independently, thus introducing inconsistent shapes or textures to different views. This may lead to flickering artifacts when applied to video applications. Moreover, StyleGAN-NADA is less friendly to real faces as the input image has to be inverted back to the StyleGAN latent space before stylization, which will inevitably lead to some detail loss and identity change. Unlike it, NeRF-Art is not constrained by any latent space of pre-trained networks and does not need the inversion step.

Comparisons to text-driven NeRF stylization. Compared with CLIP-NeRF, our advantages are two-fold. First, CLIP-NeRF stylizes NeRF using the absolute directional loss, which does not put enough stylizations. Moreover, it suffers from uneven stylizations. For example, we only see enough stylizations on the nose and hair for style “Fauvism”, but the man’s cheek has not been fully stylized. In contrast, we design a global-local contrastive learning strategy to ensure the desired style strength. Second, as no weight regularization is used in CLIP-NeRF, its results may appear as severe geometry noises. In contrast, our weight regularization suppresses geometric noises by encouraging a more concentrated density distribution. DreamField also adopts the absolute directional loss to stylize NeRF, which cannot guarantee sufficient and uniform stylization. DreamField adopts a random background augmentation to CLIP’s attention on the foreground, which requires view-consistent masks, while ours does not. Moreover, our method consistently outperforms DreamField in detailed cloth wrinkles, facial attributes, and fine-grained geometry deformations, like muscle shapes and antennas. In summary, our NeRF-Art outperforms these methods by proposing a contrastive learning technique to achieve sufficient and uniform stylization and designing a weight regularization to remove cloudy artifacts and geometry noises.

Comparisons to text-driven mesh-based stylization. Text2Mesh also supports geometry deformation and texture stylization of a 3D model like ours. However, it assumes there exists a synergy between the input 3D geometry and the target prompt and is more likely to fail when stylizing a 3D mesh towards a less related prompt, such as “Pixar” for the ‘Lady’ model in Figure 6. With carefully-designed loss constraints, ours is more robust to different prompts, either related to the 3D scenes or not. Moreover, limited by the expressivity of the mesh representation, Text2Mesh fails most runs and presents unstable stylization results, resulting in irregular deformations and indentations on the edge or surface. Authors of AvatarCLIP also report similar results when comparing to Text2Mesh. Similar to DreamField, AvatarCLIP adopts a random background augmentation to lead CLIP to focus on the foreground and prevents floating artifact generations. Nevertheless, this process requires view-consistent masks while ours does not. Moreover, AvatarCLIP adds an additional color network to constrain the general shape of the avatar as well as introducing random shading and lighting augmentations on the textured renderings to strengthen the stylization. Even with these augmentations, AvatarCLIP still fails to produce satisfying texture and geometry details. In contrast, ours reveals a fine-grained beard, detailed wrinkles of garments, and clearer face attributes. Noteworthy, our NeRF-Art supports stylizing in-the-wild faces, while AvatarCLIP requires a 3D mesh as input to conduct these augmentations. Finally, AvatarCLIP can still generate random bumps in the background and make the extracted surface noisy. This is because AvatarCLIP sampled a sparse rays ( $112\times 112$ ) to construct a coarse renderings for CLIP constraints, due to OOM problem. We found worse results with more noise when reducing sampled ray numbers. In contrast, our method supports training stylization on all rays by imposing a memory-saving technique. In conclusion, NeRF-Art achieves better stylization using the proposed contrastive learning strategies without any mesh guidance.

5. User Study

To evaluate stylization quality from human perception, we conducted a user study. For each compared category, we used two subjects. For each subject, we selected $5$ prompts from our text descriptions dataset and finally obtained $10$ test cases for each category and $50$ in total. For every test case, we showed one sample of input frames, the textual prompt, and the results of different methods in two views and random order. The participants were given unlimited time to select the best stylization results by jointly considering three aspects: preservation of the content, faithfulness to the style, and view consistency. We finally collected $23$ questionnaires completed by $10$ male and $13$ Lady participants. Statistics of the user study are shown in Figure 12. Our method outperforms StyleGAN-NADA, CLIP-NeRF, Text2Mesh, DreamField, and AvatarCLIP by achieving much higher user preference rates.

6. Ablation Study

Why global-local contrastive learning? A straightforward way to stylize NeRFs is to apply the directional CLIP loss proposed by StyleGAN-NADA (Gal et al., 2021) to the rendered views. Unfortunately, the directional CLIP loss can enforce the right stylization trajectory but struggles to reach a sufficient magnitude, as shown in the 2nd column of Figure 9. This is because the loss only measures the directional similarity between the normalized embedded vectors but ignores their actual distances. In contrast, our global contrastive loss (3rd column of Figure 9) can ensure the proper stylization magnitude by pushing it as close as possible to the target. However, the global contrastive loss still cannot guarantee a sufficient and uniform stylization of the whole scene. The stylization shows excess on certain parts and insufficiency on others, e.g., insufficient stylized faces and excessively stylized eyes in the “Tolkien Elf” example in the 3rd column of Figure 9. This may attribute to the fact that CLIP focuses more attention on regions with distinguishable features than on other regions. Our local contrastive loss helps achieve more balanced stylized results by stylizing every local region of the scene (4th column of Figure 9). However, this local contrastive loss without global information may produce excessive facial attributes, e.g., generating more eyes in the “White Walker” example and two left ears in the “Tolkien Elf” example. This attributes to insufficient semantics involved in a local patch. This problem can be avoided by adding the global contrastive loss at the same time.

By combining both global and local contrastive loss with the directional CLIP, our method successfully achieves uniform stylization with both correct stylization direction and sufficient magnitude (5th column of Figure 9).

Why weight regularization? Altering the geometry of NeRF may potentially cause cloudy artifacts. In Figure 11, we demonstrate that the weight regularization loss can suppress cloudy artifacts and geometric noises by encouraging a more concentrated density distribution for stylization.

7. Generalization Evaluation

We conduct a generalization evaluation on VolSDF and NeuS in Figure 8 to evaluate NeRF-Art’s ability in adapting to different NeRF-like models. As NeuS reconstructs a coarse result on our in-the-wild data without a mask may due to inaccurate camera estimations, we conduct a segmentation using RVM (Lin et al., 2022) for better reconstruction and dilate the mask using OpenCV with $3\times 3$ kernel and two iterations to allow geometry variations. In Figure 8, our method presents similar stylization results on VolSDF and NeuS, which demonstrates that our NeRF-Art has the ability to adapt to different NeRF-like models.

8. Geometry Evaluation

To evaluate whether the geometry will be correctly modulated in the stylization process, we show the geometry evaluation results in Figure 10. We extract meshes using Marching Cubes (Lorensen and Cline, 1987) before and after the stylization for comparison and report results on two widely-used NeRF-like models VolSDF (Yariv et al., 2021) and NeuS (Wang et al., 2021b). We clearly see geometry changes by comparison with the source mesh. For example, “Lord Voldemort” flattens the girl’s nose, “Tolkien Elf” sharpens the girl’s ears, and “Pixar” rounds the jaw. Moreover, we find the same observations on both VolSDF and NeuS. In summary, we conclude that our method can correctly modulate the geometry of NeRF to match the desired style.

Conclusion

In this paper, we present NeRF-Art, the text-guided NeRF stylization approach based on CLIP. Unlike existing approaches that require the mesh guidance in the stylization process or traps in insufficient geometry deformations and texture details in stylization, ours modulate its geometry and appearance simultaneously to match the desired style and show visual-pleasing results of geometry deformations and texture details with only a text guidance. To achieve it, we introduce a carefully-designed combination of directional constraint to control the style trajectory and novel global-local contrastive loss to enforce the proper style strength. Moreover, we propose a weight regularization strategy to alleviate the cloudy artifacts and geometry noises in deforming the geometry. Extensive experiments on real faces and general scenes show that our method is effective and robust in both stylization quality and view consistency.

Limitations. Despite the success in most cases, our method still has some limitations. First, some text prompts are linguistically ambiguous, like “Digital painting”, which describes a wide range of styles, including oil paintings, pencil sketches, 3D rendering images, cartoon drawings, etc. This ambiguity might confuse the CLIP and make the final result unexpected, as shown in Figure 13. Semantically meaningless words cause another kind of unexpected result. For example, if we combine the words “Mouth” and “Batman” as a prompt, the result unexpectedly puts a bat shape on the mouth, which may not be what the user desires. These are interesting problems worth exploring in the future.