[3D editing] Seal-3D: NeRF-based interactive pixel-level editing

insert image description here


Project homepage : https://windingwind.github.io/seal-3d/
Code : https://github.com/windingwind/seal-3d/
Paper : https://arxiv.org/pdf/2307.15131

Summary

With the popularity of implicit neural representations (i.e., NeRF), there is an urgent need for editing methods to interact with implicit 3D models, such as post-processing reconstruction of scenes and 3D content creation. Previous work has been limited in terms of editing flexibility, quality and speed, in order to be able to respond directly to editorial commands and update immediately . The proposed Seal-3D allows users to edit NeRF models with various NeRF-like backbones in a pixel-level and free manner , and preview the editing effect immediately. To achieve these effects, our proposed proxy function maps editing instructions to the original space of the NeRF model , and employs a teacher-student training strategy of local pre-training and global fine-tuning , addressing these challenges. A NeRF editing system was built to demonstrate various editing types, and eye-catching editing effects can be achieved with an interactive speed of about 1 second.


I. Introduction

Benefiting from high reconstruction accuracy and relatively low memory consumption, NeRF and its variants have shown great potential in many 3D applications, such as 3D reconstruction, new view synthesis, and virtual/augmented reality. There is an urgent need for human-friendly editing tools to interact with these 3D models. Objects reconstructed from the real world are likely to contain artifacts due to the noise of the captured data and the limitations of reconstruction algorithms .

Previous works have attempted to edit 3D scenes represented by NeRF, including object segmentation [19,41Edit NeRF], object removal [18 Nerf-in], appearance editing [Palettenerf 13,Nerf-editing25], object blending [Template nerf7], etc. , which mainly focus on coarse-grained object-level editing, whose convergence speed cannot meet the requirements of interactive editing. Some recent approaches [Neumesh 45, Nerf-editing 5] convert editing from NeRF to grid editing by introducing a grid as an editing proxy. This requires the user to operate an additional meshing tool, which limits interactivity and user-friendliness.

Explicit 3D representations such as point clouds, texture meshes, and occupancy volumes store the explicit geometry of objects and scenes; implicit representations use neural networks to query features of 3D scenes, including geometry and color. Existing 3D editing methods, exemplified by mesh-based representations, can alter an object's geometry by replacing vertices corresponding to the target object's surface area and object texture. Editing implicit 3D models is indirect and challenging if there is no clear interpretable correspondence between visuals and latent representations . Furthermore, it is difficult to find implicit network parameters in local regions of the scene , which means that the adaptation of network parameters may lead to undesired global changes. This brings more challenges to fine-grained editing.

This paper proposes an implicit neural representation method and system for interactive pixel-level editing of 3D scenes, Seal-3D (borrowed from the software Adobe PhotoShop). As shown in Fig. 1, the sealing tool of the editing system includes four kinds of editing: 1) Bounding box tool . It can transform and scale things inside the bounding box, just like a copy-paste operation. 2) Brush tool . It paints the specified color over the selected area and can increase or decrease the surface height, just like a paint brush or a scratcher. 3) Fixed tools . It allows the user to freely move a control point and affect its neighbors spatially based on the user's input. 4) Use the color tool . it edits the color of the object's surface

First , to establish the correspondence between explicit editing instructions and implicit network parameter updates, we propose a proxy function that maps the target 3D space (determined by the user from an interactive GUI editing instruction) to the original 3D scene space , and the division A raw rectification strategy that uses the corresponding content supervision obtained by the agent function from the original scene to update the parameters. Second , in order to achieve local editing, i.e. alleviate the local editing effect on the global 3D scene under non-local implicit representation, we propose a two-stage training process: the pre-training stage only updates the edited region, while freezing the subsequent MLP decoder to To prevent global degradation, the fine-tuning stage updates the embedding grid and the global photometric loss of the MLP decoder . With this design, the pre-training phase updates the local edited features (pre-training can converge very quickly and only renders local edits in about 1 second), while the fine-tuning phase combines the local edited regions with the global structure of the unedited space and The colors of unedited spaces are blended for consistency of view.

2. Method

The framework of Seal-3D for interactive pixel-level editing is shown in Figure 2, which includes a pixel-level agent mapping function , a teacher-student training framework and a two-stage training strategy for the student NeRF network under this framework . Our editing workflow starts with a proxy function that maps query points and ray directions according to user-specified editing rules . Then there is a NeRF-to-NeRF teacher-student distillation framework, where a teacher model with editorial geometry and color mapping rules supervises the training of the student model (Section 3.2). The key to interactive fine-grained editing is the two-stage training of the student model (Section 3.3). An additional pre-training stage samples, computes, and caches points, ray directions, and inferred GTs in the edit space of the teacher model; only parameters with locality are updated, resulting in globally changing parameters being frozen. After pre-training, the global training phase finetune the student model.
insert image description here

2.1. Overview of nerf-based editing issues

2.1.1 NeRF basic knowledge , please see my blog: [3D reconstruction] NeRF principle + code explanation

2.1.2 Challenges based on nerf editing

3D scenes are implicitly represented by network parameters, which lack interpretability and are difficult to manipulate. In terms of scene editing, it is difficult to find a mapping between explicit editing instructions and implicit updating of network parameters . Previous work has attempted to address this problem in several constrained ways:

NeRF-Editing and NeuMesh introduce a mesh scaffold as a geometry proxy to assist editing, which simplifies the NeRF editing task to mesh modification. While conforming to existing grid-based editing, the editing process requires the extraction of an additional grid, which is cumbersome. Furthermore, the edited geometry is highly dependent on the mesh proxy structure, making it difficult to represent these spaces. Spaces that are not easily editable or cannot be represented by a mesh are a key feature of implicit representation. [Editing conditional radiance fields] designed additional color and shape losses to supervise editing. However, their design loss only occurs in 2D photometric space, which limits the editing ability of 3D NeRF models.

2.2. Editing guidance generation

Our design views NeRF editing as a process of knowledge distillation. Given a pre-trained NeRF network fitting a specific scene as a teacher network, we initialize an additional NeRF network as a student network with pre-trained weights. The teacher network f θ T generates editing guidance based on the editing instructions input by the user, while the student network f θ S is optimized by extracting editing knowledge from the editing guidance output by the teacher network .

First, user editing instructions are read from the interactive NeRF editor as pixel-level information. The source space S⊂R 3 is the three-dimensional space of the original NeRF model, and the target space T⊂R 3 is the three-dimensional space of the edited NeRF model. The target space T is warped to the original space S by Fm : T→S. F m transforms the points in the target space and their associated directions according to the following editing rules : In the function, the "pseudo" expected editing effect c T , σ T of each 3D point and viewing direction can be obtained by querying the teacher NeRF model f θ T. _ The process can be expressed as:
insert image description here
where, x s , d s represent the position and direction of the source space point, x t , d t represent the position and direction of the target space point. For simplicity, this process can be defined as the prediction of the teacher model: F t : = f θ T ◦ F m : (x t , d t)→(cT,σT

The inference results c T , σ T simulate the edited scene and serve as teacher labels for information extracted by the student network during the network optimization stage. The mapping rules of F m can be designed according to any editing target (there are 4 types of editing in this paper) .

  1. Bounding shape tool

Common functions of 3D editing software, including copy and paste, rotation and resizing . The user provides a bounding shape to indicate the original space S to be edited, and rotates, flips, and scales the bounding box to indicate the desired effect. Then, the target space T and the mapping function F m are resolved by the interface:

insert image description here

Where R is the rotation, S is the scale, c s , c t are the centers of S and T respectively. The tool even supports cross-scene object transfer, which can be done by introducing NeRF of transferred objects as an additional teacher network responsible for part of the teacher inference process within the target region. Figure 7 is the rendering

  1. Brushing tool

Similar to a sculpting brush, raising or lowering the painted surface. The user draws a sketch with a brush, and S is generated by projecting rays on the brushed pixels. The brush standard value n and the pressure value p( ) ∈ [0,1] are defined by the user, which determines the mapping:
insert image description here

  1. Anchor tool

The user defines a control point xc and a translation vector t. The area around x c will be stretched by the translation function (·; x c , t). Then this mapping is its inverse:

insert image description here
See Supplementary Material for explicit expressions for stretching (·; x c , t).

  1. Color tool

Edit colors via color space mapping (single color or texture) (space mapping is the same). We directly map the colors output by the network in HSL space, which helps to improve color consistency . This method is able to preserve shadow details (e.g. shadows) on the modified surface. We do this by transferring the luminance (in HSL space) offset from the original surface color to the target surface color. The implementation details of this shadow preservation strategy are presented in the Supplement.

2.3. Two-stage student training with instant preview

Distillation training strategy, directly applying the pixel value accumulated by the equation C ^ \hat{C}C^ D ^ \hat{D} D^ Between photometric loss, the teacher model provides editorial guidance to the student model. The training process converges slowly (≈30s or longer) so a two-stage training strategy is adopted: thefirst stage aims to converge immediately(within 1 second), so that arough edit resultcan be immediately presented to the user as a preview ;the second stage further refines the coarse previewto obtain the final refinement.

1. Local pre-training for instant preview . Usually, the edit space is relatively small, and training on the global photometric loss leads to slow convergence. In order to achieve instant preview of editing, we use local pre-training before the start of global training :

1) Uniformly sample a set of points X⊂T in the target space and the direction D on the unit sphere, input it into the teacher reasoning process F t , get the teacher labels c T , σ T , and cache them in advance;
2) Through the local pre-training loss Train the student network:

insert image description here

where c S , σ S are the color and density of the sampling point (x ∈ X) predicted by the student network, and c T , σ T are the cached teacher labels. Pre-training takes only 1 second, and the student network shows reasonable colors and shapes consistent with the editorial instructions .

However, due to the non-local implicit neural network, only training on local points in the edited region may lead to degradation of other global regions that are not related to the edit. We observe that in hybrid implicit representations such as Intant NGP, local information is mainly stored in the location embedding grid, while the subsequent MLP decodes global information. Therefore, at this stage, all parameters of the MLP decoder are frozen to prevent global degradation . See Experimental Illustration 12

insert image description here
2. Global fine-tuning .

After pre-training, we proceed to fine-tune f θ S to refine the coarse preview to a fully converged result. This stage is similar to standard NeRF training, except that supervised labels are generated by the teacher inference process rather than image pixels.

insert image description here

where R denotes the set of rays sampled in the mini-batch.

It is worth mentioning that the student network is able to produce better quality results than the teacher network it learns from . This is because the mapping operation during teacher inference may produce some view-inconsistent artifacts in pseudo-GT. However, during distillation, the student network can automatically remove these artifacts due to the multi-view training that enforces robustness to view consistency , as shown in Figure 6.

insert image description here

3. Experiment

  1. experiment setup

The experiment uses Instant-NGP as the NeRF backbone of the editing framework. Set λ1 = λ2 = 1 and the learning rate is fixed at 0.05. In the fine-tuning stage, we set λ3 = λ4 = 1 and the initial learning rate is 0.01.

The training data is synthesized from the NeRF Blender Dataset, as well as the real-world captured Tanks and
Temples [12] and DTU [10] datasets.

  1. Effect

Boundary shape (Figure 4 and 6) effect:
insert image description here
brushing effect:
insert image description here

Anchor point (Figure 5) and color (Figure 1) effect:
insert image description here

Comparison with NueMesh:
insert image description here

4. Code (unfinished...)

Rendering code: nerf/rendering.py line256 function run_cuda, get the color and depth of the ray:

xyzs, dirs, deltas = raymarching.march_rays(n_alive, n_step, rays_alive, rays_t, rays_o, rays_d, self.bound, self.density_bitfield, self.cascade, self.grid_size, nears, fars, 128, perturb if step == 0 else False, dt_gamma, max_steps)

raymarching.march_rays calls the class of line 297 in raymarching/raymarching.py: forward of _march_rays(Function)




Summarize

提示:这里对文章进行总结:

For example: the above is what we will talk about today. This article only briefly introduces the use of pandas, and pandas provides a large number of functions and methods that allow us to process data quickly and easily.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/132172308