Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping

Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Chen, Angjoo Kanazawa, Ken Goldberg

Introduction

Many common objects must be grasped appropriately to avoid damage or facilitate performing a task: a knife by its handle, a flower by its stem, or sunglasses by their frame. Learning-based grasping systems exhibit impressive robustness on grasping arbitrary objects , but these systems typically measure grasp success based on whether the object was lifted . Critically, these methods ignore an object’s semantic properties: even if a robot could locate your favorite sunglasses, rather than safely grasp at the frame it may shatter the lenses. This ability to grasp an object part based on a desired task and constraints is called task-oriented grasping, and while well-studied , previous methods collect specific object affordance datasets and struggle to scale to a diverse set of objects. Instead, the flexibility of natural language has the potential for specifying what and where to grasp. In this work, we propose LERF for Task-Oriented Grasping on Objects (LERF-TOGO), a method which enables task-oriented grasping through natural language by using large vision-language models in a zero-shot manner.

LERF-TOGO takes as input an object and a task-orienteed object part name in natural language (i.e. “flower; stem”), and outputs a ranking over viable grasps on this object from which the robot should grasp. We build on recent work Language Embedded Radiance Fields (LERF) , which takes in calibrated RGB images and trains a standard NeRF in tandem with a scale-conditioned CLIP feature field. Given a sentence prompt query, it outputs a 3D relevancy heatmap representing similarity to the query. However, these heatmaps may fail to highlight the full object (e.g., highlight only the bristles of a brush), which may cause issues when directly deployed to a task-oriented grasping task (grasp the “handle” of a brush). LERF-TOGO improves upon LERF’s capabilities by predicting a 3D object mask using 3D DINO features explicitly during inference. We propose a method of conditional LERF querying which restricts an object sub-part query to the object mask, leveraging the multi-scale nature of LERF to isolate specific regions within an object. LERF-TOGO then uses GraspNet to generate grasps, re-ranking them based on the geometric and semantic distributions.

We implemented a system with appropriate regularizations which allows LERF-TOGO to operate on a physical robot and evaluate its semantic grasping capabilities on 39 common household objects. In experiments, 96% of generated grasps are on the correct object, 82% on the correct object part, and 69% result in a successful grasp.

This work contributes LERF-TOGO, an algorithm for producing task-based semantic grasp distributions over an object by first extracting a coarse 3D object mask with DINO features to locally grow a relevant region, then conditioning a LERF query on this mask to isolate sub-parts of an object. We design a robotic system which integrates LERF-TOGO on a physical robot to reconstruct a LERF of a scene, then execute task-oriented grasps through natural language to grasp semantically meaningful object parts. See the project website at: lerftogo.github.io

Related Work

Task oriented grasping studies how to grasp objects by specific parts based on a use case. It has been studied by probabilistically modeling human grasps , extracting geometric features from labeled object parts , training on part-affordance datasets in simulation , or transferring category-specific part grasps to new instances . Recent works train object-part grasp networks by leveraging object part and manipulation affordance datasets for a range of household objects. In contrast, LERF-TOGO’s usage of off-the-shelf vision-language models trained at scale does not require training on affordance datasets, capturing long-tail objects more easily. Given 3D grasp regions on an object, Song et al. demonstrate that biasing an off-the-shelf grasp planner towards these regions is a viable approach to sampling task-oriented semantic grasps. Kokic et al. use videos of humans interacting with objects to guide grasps towards the same part. Decomposing objects into parts has also long been studied as a co-segmentation task in vision . Recent approaches use pretrained vision features to discover common parts within sets of objects . This technique has been applied at scale to segment parts of objects based on a canonical object or detect object affordances from example images of human usage . Though effective, it assumes access to a canonical image of each object and pre-existing part labels or demonstrations, which are restrictive in real-world applications. LERF-TOGO instead uses free-form language queries to isolate parts of objects in a 3D reconstruction.

2 Neural Radiance Fields (NeRF)

Neural Radiance Fields (NeRF) are an attractive representation for high quality scene reconstruction from pose RGB images, with an explosion of recent work on visual quality , large-scale scenes , optimization speed , dynamic scenes , and more. Because of its high-quality reconstruction and differentiable properties, NeRF has been widely explored in robotics for navigation and mapping , manipulation , and for synthetic data generation . This work is most similar to works such as Evo-NeRF which use NeRF as a real-time scene reconstruction to grasp objects. However, in contrast to previous works which only use RGB information, in this work we must include additional semantic information in 3D to select grasps falling on relevant target objects.

Several prior works explore using semantic outputs inside NeRF. Semantic-NeRF , Panoptic Lifting , and Panoptic Neural Fields distill semantic categories from semantic segmentation networks into 3D to improve the 3D consistency of labels, particularly noting the denoising effect of averaging multiple views. Other works such as Distilled Feature Fields or Neural Feature Fusion Fields distill feature vectors from DINO and LSeg , and show they can be used for editing and scene segmentation. We build off of LERF , which is described in the next section.

3 LERF Preliminaries

Language Embedded Radiance Fields (LERF) is a recent representation that distills CLIP features into a NeRF. LERF inputs RGB images with camera poses and outputs a 3D field of DINO embeddings as well as a scale-conditioned CLIP field. This supports querying points in 3D for CLIP embeddings at different physical scales, capturing different semantics given different amounts of context. Given a text query, a relevancy value (from 0 to 1) can be generated at any 3D point by calculating the cosine similarity between LERF-queried embeddings and the CLIP embedding of query text. During this query process, a grid search on the scale parameter retrieves the scale with the highest activation. LERF is particularly attractive for task-oriented grasping because 1) its multi-scale parameterization allows queries at both object-level and part-level scales 2) LERF uses outputs from a pre-trained CLIP model without fine-tuning, which supports a variety of long-tail object queries not included in object or part segmentation datasets. LERF, however, tends to produce nonuniform activations on object queries because it lacks spatial grouping as shown in Fig. 5. In this work we show how to explicitly use the DINO feature field to obtain object masks to enable down-stream object part queries related to task-oriented grasping.

4 Open-Vocabulary Detectors

Open-vocabulary object detection attempts to output masks or bounding box detections given text prompts as input. OVD proposes a two-stage training pipeline, where they first learn a visual-semantic stage using image-text pairs and then learn object detection using object annotations from a set of base classes. OpenSeg employs a similar joint learning framework for simultaneously training a mask prediction model and learning visual-semantic alignments for each mask by aligning captions and masks. Some methods distill CLIP embeddings into a visual encoder by using labels from detection or classification datasets, or fine-tune a lightweight detection head on top of a pretrained backbone . However, fine-tuning on detection datasets severely bottlenecks the language understanding capability of detection models , and as noted by the authors of OV-Seg , these networks face challenges in decomposing hierarchical components of the original masks, including object parts. LERF circumvents the usage of region proposals or fine-tuning by incorporating language embeddings densely in 3D, making it an attractive approach for effective part decomposition.

5 Natural Language in Robotics

With the advent of large pretrained language and vision-language models, several works have explored building 3D map representations to guide robot navigation. VL-Maps and OpenScene build a 3D language embedding from pretrained open-vocab detectors , which can be used to navigate to target queries. CLIP-Fields , ConceptFusion , and NLMaps-SayCan take more of a region-proposal based approach, querying CLIP on the outputs of some region proposal methods and fusing them into 3D point clouds for downstream navigation tasks. Region-based zero-shot methods retain more language understanding than fine-tuned features but run the risk of missing objects by insufficiently masking input images. Semantic Abstraction avoids this by extracting relevancy from vision-language models using Chefer et al. and uses these for composing multiple language queries with spatial relationships. Language has also been studied in the context of robot manipulation. Mees and Burgard use the MAttNet vision-language model for object rearrangement, and CLIPort uses language understanding from CLIP to train a language-conditioned pick and place module from demonstrations. PerAct uses language-conditioned demonstrations with a 3D scene transformer to learn diverse tasks, MOO uses the outputs from Owl-ViT to condition a manipulation policy for grasping objects, and large-scale demonstration datasets like RT-1 train on massive language-conditioned demonstration trajectories. In contrast to many other language-conditioned approaches, LERF-TOGO uses internet-scale vision models purely zero-shot and does not require fine-tuning on demonstrations or robot exploration.

Problem and Assumptions

Given a planar surface (table or workbench) containing a set of objects, the objective is for a robot to grasp and lift a target object specified using natural language. This query (e.g., “sunglasses; ear hooks.”) includes both the object query (“sunglasses”) and the object part query, which specifies the part to grasp the object by (“ear hooks”). We experiment with lifting this assumption in Sec. 5 by leveraging an LLM for providing part queries. We assume access to a robot manipulator with a parallel jaw gripper and calibrated wrist-mounted RGB camera, and the objects in the scene are graspable by the robot. We also assume the object query specifies a single object that is present in the scene.

Method

Given an object and object part query, LERF-TOGO outputs a ranking of viable grasps on the object part. To accomplish this, it first captures the scene and reconstructs a LERF. Given a text query, LERF can generate a 3D relevancy map that highlights the relevant parts of the scene (Sec. 2.3). Second, a 3D object mask is generated using the LERF relevancy for the object query and DINO-based semantic grouping (Sec. 4.1). Third, a 3D part relevancy map is generated with a conditional LERF query over the object part query and the 3D object mask (Sec. 4.2). The part relevancy map is used to produce a semantic grasp distribution.

An important limitation of LERF is its lack of spatial grouping within objects: for example given “can opener”, LERF tends to highlight regions of the object that most obviously identify that object (e.g. the metal cogs on the can opener as shown by the orange star in Fig. 3). However, since the region that visually identifies the object and the desired grasp location (e.g. handle) can differ significantly, this is problematic. LERF inherently exhibits such local behavior because it trains on local crops of input images, causing CLIP embeddings surrounding the handle to be unaware if it belongs to a can opener. LERF-TOGO overcomes this by finding a 3D object mask given a language query, which groups the object part together with the LERF activation. To create the object mask, we leverage the 3D DINO embeddings (self-DIstillation with NO labels) present within LERF during inference, because DINO embeddings have been shown to exhibit strong object awareness and foreground-background distinction .

First, we obtain a coarse object localization from LERF by rendering a top-down view of the scene and querying the object. We produce a foreground mask by thresholding the first principal component of the top-down rendered DINO embeddings, and constrain the relevancy query to this mask to find the most relevant 3D point. We then refine this single-point localization into a complete object mask. We render an object-centric point cloud around this 3D point by deprojecting NeRF depth from multiple views, and then iteratively grow the object mask by including neighboring points to the frontier which lie within a threshold DINO similarity (similar to floodfill). The output of this process is a set of 3D points lying on the target object. See the Appendix for more details.

2 Conditional LERF Queries

Another important challenge of using CLIP is its tendency to behave as a bag-of-words : the activation for “mug” behaves very similarly to “mug handle” because CLIP latches onto individual words, not the grammatical structure of sentences. To mitigate this phenomenon, LERF-TOGO introduces a conditional method of querying LERF relevancy by composing two related queries, similarly to how composing prompts has shown promise in generative modeling for guiding specific properties . Because LERF is scale-conditioned, during inference it searches over scales for a given query and returns the relevancy at the scale with the highest activation. To condition a LERF query, LERF-TOGO searches only on the points within the 3D object mask. Intuitively, this results in a distribution over the object’s 3D geometry representing the likelihood that a given point is the desired object part, which can be used for biasing grasps towards this region.

3 Grasping

Grasp Sampling Ensuring complete coverage of grasps on objects is critical to avoid missing specific object parts. We use GraspNet , which can generate 6-DOF parallel jaw grasps from a monocular RGBD point cloud, but from a single view it often misses key grasps on target object parts. To mitigate this, and to leverage the full 3D geometry available within NeRF, we create a hemisphere of virtual cameras oriented towards the scene’s center. For every virtual camera, we convert the scene’s point cloud to the camera coordinate frames before providing it as input to the pretrained GraspNet model. To obtain the final set of grasps for the scene, we combine the generated grasps from the virtual cameras using non-maximum suppression to remove duplicates.

Grasp Ranking Given the grasps sampled in the previous step (the geometric distribution), we now combine it with the semantic distribution across an object obtained from LERF-TOGO. The semantic score $s_{sem}$ for a given grasp is computed as the median LERF relevancy of points within the grasp volume. The geometric score $s_{geom}$ is the confidence output from GraspNet, indicating grasp quality based on geometric cues. To balance relevance and success likelihood, we combine the grasp score $s=0.95s_{sem}+0.05s_{geom}$ to ensure that we consider the most relevant grasps while slightly biasing towards confident grasps.

4 Scene Reconstruction

The robot uses a wrist-mounted camera to capture the scene with a hemispherical trajectory centered at the workspace, similar to Evo-NeRF . The capture has a radius of 45 cm and arcs from $\pm 100^{\circ}$ around the workspace horizontally and an inclination range of $30^{\circ}$ to $75^{\circ}$ . We capture images while the arm moves at 15 cm/sec at a rate of 3 hz, resulting in around 60 images per capture. We discard blurry images by analyzing the variance of the image Laplacian, ensuring the images are high quality. While the robot moves, we pre-process each image to extract DINO features, multi-scale CLIP, and ZoeDepth , which are used during LERF training. See Appendix B for additional training details.

Experiments

Part-Oriented Grasping We evaluate LERF-TOGO on a wide variety of 31 different objects and 49 total object parts to grasp (Fig. 4). For each object, we select an object query by describing it sufficiently to unambiguously differentiate between other objects in the scene. We use semantic descriptions when possible, and add visual descriptions only when such descriptions are ambiguous (i.e using color to differentiate multiple mugs in a scene). We provide a part query for each object by describing a natural place for a robot to grasp and lift (i.e. “handle”, “plant stem”, “ear hook”, “frame”). In addition, several objects include multiple different grasp locations. A grasp is successful if it lifts the correct object using the appropriate subpart at least 10cm vertically, and the object remains securely within the gripper jaws throughout. For each query, we measure 1) whether the selected grasp was on the correct object, 2) whether the selected grasp was on the correct object part, and 3) whether the grasp successfuly lifted the object from the table. Every scene is reconstructed once in the beginning, after which the objects are removed sequentially (i.e., objects are removed one-by-one) with no updates in the scene representation. For a full list of object queries see the Appendix.

Task-Oriented Grasping LERF-TOGO accepts a natural language part query as input, allowing it to be used alongside large language models (LLMs) to generate parts based on the task. To investigate if the LLM can also generate the object part, we use an LLM (ChatGPT) to generate the object and part query automatically via few-shot prompting. Results are shown in Table 3. The prompt and all tasks are included in the Appendix. Given the task and the list of objects in the scene, the LLM is tasked with generating the correct object and object part pair (object, part). We used a majority voting scheme to query the LLM. Given the task, the LLM provides seven candidates that we use to select the pair (object, part) that appears in a majority of the responses.

Integration with an LLM Planner LERF-TOGO can integrate as a module with an LLM planner to combine task-oriented grasps for robotic manipulation tasks. We define a set of robotic manipulation primitives (grasp, press, twist, pick&place, pour) and prompt the LLM to output the correct primitive for a given task. We use the same majority voting scheme in the previous section to select both the correct robotic primitive and the pair (object, part). Now, given a task (e.g. ‘uncork the wine’), an LLM can specify the action to accomplish the task (‘grasp’) and the pair of object and object part (e.g. ‘wine’ and ‘cork’). The prompt and all tasks are included in the Appendix.

ConceptFusion generates a multimodal point cloud of a scene by fusing RGBD images and their extracted features together. We provide ConceptFusion with depth generated from the NeRF, which results in the same high-quality point cloud we use for grasping. To fairly represent the paper, we use the OpenCLIP ViT-H/14 model, which is several times larger than the ViT-B/16 model for LERF-TOGO. To query ConceptFusion, we provide it with the concatenated object and part prompt (i.e. “mug handle”) and rank grasps via the highest similarity. We report the object and part success without physical evaluation.

Semantic Abstraction takes a single RGBD frame as input and a text query and outputs a relevancy heat map over the image. Since the method takes a single image, we provide the method with an input image observing all object parts for a fair comparison. We provide it with part queries 2 ways and take the best performance: 1) the concatenated object and part prompt (i.e. “mug handle”), and 2) the object and part as separate queries. A query is a success if the majority of the heatmap overlaps with the object part. In instances where activations are detected on other objects, it is considered successful if the highest activation is on the desired object part. Detailed results can be found in Table 2 and the Appendix.

OWL-ViT is an open-vocab detector which takes in an RGB image and text prompts and outputs segmentation maps. We provide OWL-ViT a single input image that encompasses all object parts for a fair comparison. To obtain an object mask, we use the object prompt to establish an initial bounding box. This box serves as a region to identify the highest-scoring part within the region. In order to deem the part box as successful, we visually confirm that it aligns with the object part. Results can be found in Table 2 and the Appendix.

Results

Part-Oriented LERF-TOGO overall achieves a 69% success rate for physically grasping and lifting objects by the correct part. The selected grasp was located around the correct object part 82% of the time, with the remaining failures being grasp execution failures. For context, the highest confidence geometric grasp on an object mask only lies on the correct part 18% of the time, suggesting LERF-TOGO meaningfully biases the grasp distribution to the object part. Selected task-oriented queries are visualized in Fig. 2: the distribution of grasps drastically shifts based on the given part query, and can focus task-oriented grasps on multiple different regions per object based on the language prompt. LERF-TOGO shows strong language understanding performance for object selection (96%), able to differentiate between very fine-grained language queries like color, appearance (“matte” vs “shiny”), or semantically similar categories (“popsicle” vs “lolipop”). It also can recognize long-tail object queries like “ethernet dongle”, “cork”, or “martini glass”, owing to its usage of CLIP zero-shot.

Task-Oriented Combining few-shot LLM prompting with LERF-TOGO identifies the correct primitive with 92% success and produces grasps on the correct object with 71% success across 49 tasks on 39 different objects. The LLM was able to correctly identify the object in the scene with the same success rate as the human, giving the correct object and part pairs for common tasks like “scrub the dishes” and “cut the steak”. However, the LLM had a lower success rate (71%) compared to the human (82%) for object part selection. This is because CLIP, and by extension LERF-TOGO, can be sensitive to subtle variations in wording like “body” vs. “base” resulting in different LERF activations and thus grasps.

Comparisons LERF-TOGO out-performs ConceptFusion by 43% at task-oriented grasping because it can capture multi-scale semantics, while ConceptFusion is limited to one CLIP embedding per point. This makes hierarchical querying difficult, and is reflected by the fact that ConceptFusion performs similarly to LERF-TOGO at selecting the correct object, but suffers at selecting the right object part. Due to its lack of scale-conditioning, ConceptFusion frequently emphasizes large sections of the table due to the inclusion of both the objects and the table itself in the mask proposals (Fig. 6).

Semantic Abstraction achieves an overall object detection rate of 80% and part detection of 35%. The method tends to produce empty relevancy responses when queried for specific object parts, potentially owing to its averaging across multiple scales which drowns out smaller part features. When presented with the concatenated object and object part, the method highlights the entirety of the object, owing to CLIP’s bag of words behavior, a characteristic addressed by LERF-TOGO’s compositional queries. (Fig. 10).

OWL-ViT achieves 85% accuracy for object localization, struggling on very long-tail objects that were not encountered within the detection datasets. This behavior is amplified for object part queries, where queries tend to be especially long-tail such as “measuring tape” and “ethernet dongle” (Fig. 12).

Failures The primary failure modes of LERF-TOGO are mistaking visually similar object parts for one another (eg teapot spout for a handle), missing subtle geometries like the small teacup handle or spray trigger, or confusing very close categories like steak and bread knives. We also observe prompt sensitivity with part queries: for example “bottle neck” more strongly localizes grasps than “neck”, and without more prompt tuning “body” sometimes fails to highlight the bases of bottles.

Object Extraction Ablation: Without 3D object masking and conditional querying, LERF-TOGO suffers with oblong objects, as shown in Fig. 5. We compare against querying LERF individually for the object and part, and multiplying their results together. This produces fragmented results which can ignore relevant parts of the object for part queries.

Limitations and Future Work

One limitation of LERF-TOGO is speed: the entire end-to-end the process takes a few minutes which can be impractical for time-sensitive applications. Future work on additional regularizations and optimizations to LERF training may reduce computation time. Another key limitation of LERF-TOGO is with groups of connected foreground objects, for example a bouquet of multiple flowers. The DINO flood-fill will create a foreground group containing all flowers, after which isolating the stem of a specific individual type of flower (i.e “daisy” vs “rose”) is challenging. Supporting hierarchy within foreground groups is critical to enable such cases. If there are multiple objects that match the prompt, the system will arbitrarily choose only one of them. This is also true if the object query cannot disambiguate multiple instances of a similarly colored object (e.g., “mug” object query in the mugs scene). We also note that LERF-TOGO is not designed for referring/comparative expressions (e.g., “mug next to the plate”, “biggest mug”). In addition, though we present a method for obtaining object part queries from input task descriptions via LLMs, in future work we will evaluate its performance on a diverse set of tasks.

Conclusion

This paper presents LERF-TOGO, a method for using vision-language models zero-shot with Language Embedded Radiance Fields to grasp objects and their parts via language. By improving the spatial grouping of LERF relevancy outputs, LERF-TOGO can support hierarchical part queries conditioned on the full object. Results indicate it performs strongly at language-guided grasping, with grasps landing on the correct object 96% of the time, and furthermore can direct grasps to the correct object parts 81% of the time. All code and datasets will be released after the submission process.

Acknowledgements

This research was performed at the AUTOLAB at UC Berkeley in affiliation with the Berkeley AI Research (BAIR) Lab. The authors were supported in part by donations from Toyota Research Institute, Bosch, Google, Siemens, and Autodesk and by equipment grants from PhotoNeo, NVidia, and Intuitive Surgical. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Sponsors. We would like to thank our colleague Brent Yi for his work on Viser, the visualization tool of our experimental setup. We thank our colleagues who provided helpful feedback and suggestions, in particular Simeon Adebola and Julia Isaac.

References

Appendix A LERF-TOGO

We implement LERF-TOGO on top of the Nerfacto method from Nerfstudio . For faster convergence and smoother optimization, we modify several parameters from the original LERF paper. We use a smaller hashgrid with 16 levels and a maximum resolution of 256, and find that using larger MLPs for the density, color, and transient output heads of NeRF results in faster convergence and better ability to handle specularities and robot shadows. We introduce weight decay of 1e-7 to the LERF network which smooths training. In addition, we compress the DINO embeddings into 128 dimensions and supervise on these vectors rather than the original DINO outputs, but we do not normalize the resulting vectors like Tschernezki et al. . While constructing the CLIP embedding pyramid for LERF, we use crops ranging from $5\%$ to $35\%$ of image height with 6 pyramid levels, biasing the pyramid to smaller crops as LERF-TOGO is primarily interested in object and part queries.

Appendix B Robot Capture

Robot Capture Region Size The robot captures the scene along a hemispherical trajectory arcing $\pm 100^{\circ}$ around the workspace horizontally (“1/2 hemisphere” in Fig. 7.) When this horizontal sweep angle is reduced to a fraction of the range, the quality of the 3D object mask degrades, sometimes selecting the incorrect object altogether. LERF’s semantic field is supervised on features of the scene images, thus the quality is heavily correlated with the distribution of images viewing the object. This lowers the quality of the 3D DINO embeddings used for the mask generation.

LERF Training Steps In our experiments the LERF scene representation is trained to 2k steps. As shown in Fig. 7, objects/object parts (e.g., “spray nozzle”, “bottle”) can be detected in steps as low as 1k, but more fine-grained or smaller parts (e.g., “handle”) may take longer (2-3k steps). This is consistent with what LERF reports: ”fine-grained features take more steps [to emerge]”.

3D Object Extraction Given the initial 3D point with the highest LERF activation, we create an object-centric point cloud by rendering six different views looking at the 3D coordinate. The views are $\pm 90^{\circ}$ around the upwards vector through the 3D point. For DINO floodfill, the threshold DINO similarity is defined as first projecting the current DINO embedding onto the first PCA component of the top-down image, then taking the L2 norm of the difference between the current embedding and the DINO embedding at the initial 3D point.

NeRF Regularization NeRF encounters difficulties in reconstructing texture-less planar surfaces, especially in the presence of specularities. This limitation is prominent in our table-top scenes, where the glossy surface and metallic objects can result in depth renderings with jagged missing regions. These missing regions can cause LERF renderings to spuriously activate and degrade the performance of grasp networks, so we apply depth regularization to mitigate this issue. We adopt the local depth ranking loss proposed in SparseNeRF and use ZoeDepth as the underlying depth model. We found this performs better than smoothness priors because it retains more fine-grained geometry. Additionally, we use the gradient scaling approach from Philip and Deschaintre , which significantly reduces the number of near-camera floaters and enables more robust grasping directly from point clouds rendered from the NeRF.

Poses obtained from cameras in motion are slightly inaccurate, which we found could result in oversmoothed geometry with depth regularization. To overcome this, we optimize the NeRF for the first 500 steps without any regularization to allow the camera poses to settle, then anneal the depth regularization loss term from 0 to 100 $\%$ over the next 1500 steps. Interestingly, we find staged training not only preserves thin features better but also speeds LERF optimization. We hypothesize this is because supervising the language field on un-converged density in free space results in a poor network initialization, while beginning LERF optimization after geometry has been largely removed from free space allows a smoother learning signal.

Appendix C Grasping

Point Cloud Extraction To extract a scene-wide point cloud for grasping, we use the method in Nerfstudio , which deprojects randomly sampled rays’ depth from the train camera poses, then filters with outlier rejection. We then crop the point cloud to the workspace of the robot. For object centric point clouds, we deproject depth from views radially surrounding the object of interest.

Motion Planning A grasp is considered feasible if the robot can perform a collision-free trajectory with the following poses: the pre-grasp, grasp, and post-grasp configurations. The pre-grasp pose is positioned 5cm along the z-axis of the robot end effector, which allows the gripper to approach the target grasp pose with minimal additional motion. The post-grasp pose is located 10cm above the grasp pose, along the z-axis of the world frame. The UR5’s IK solver calculates the set of viable joint configurations at these poses, and we calculate the trajectory as a linear interpolation between them. We additionally allow for a 180 degree rotation at the last wrist joint, as parallel-jaw grasps are rotationally symmetric. This facilitates the motion planning process, as the robot’s camera mount is prone to colliding with the robot arm.

Appendix D Experiments

We use a UR5 arm with a Logitech BRIO webcam at 1600x896 resolution, with all camera settings frozen before each capture prevent color discrepancies among images. The camera mount points orthogonally to the gripper axis, to maximize the reachable workspace of the camera while pointing towards the workspace center. During robot capture, pre-computation of DINO, CLIP, and ZoeDepth is parallelized across 3 NVIDIA 4090 GPUs to achieve real-time performance, and all subsequent operations are carried out on a single 4090. Capturing a scene takes 30 seconds, training the LERF to 2k iterations takes 78 seconds, and finally querying LERF-TOGO takes 10 seconds.

D.2 LLM Interface

We provide the full prompt to the LLM below. For any given task and scene the OBJECT_LIST is replaced with a list of objects within the scene and TASK is replaced with the desired task: