97、Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

Introduction

Paper address
Insert image description here
Use a diffusion model to infer text-related images as content priors, and use a monocular depth estimation method to provide geometric priors, and introduce a progressive scene Drawing and updating strategies to ensure texture and geometry consistency between different views

Implementation process

Insert image description here
In simple terms:

Text-image diffusion model generates an initial image I 0 I_0 I0,将 I 0 I_0 I0Distortion, get multiple pictures of the same z plane, that is, Support set S 0 S_0 S0, be careful, be careful S 0 S_0 S0Koreyu I 0 I_0 I0 is distorted, so there are many gaps, but we can calculate it according to S 0 S_0 S0Reconstruct the initial NeRF model.

Use the initial NeRF model to render the new perspective picture. This is incomplete, but it can be completed by the diffusion model. Note that in order to maintain the consistency of the scene, the perspective is from I 0 I_0 I0A small offset next to allows the diffusion model to move as much as possible from I 0 I_0 I0Get information from it, and then you can update the NeRF model.

Due to the influence of image distortion, image scale gaps and distance gaps will inevitably result (reflected in the differences in the depth of spatial points at different viewing angles). For this purpose, a deep alignment strategy is adopted.

Support Set

采用了 DIBR(Depth-image-based rendering (dibr), compression, and transmission for a new approach on 3d-tv) 方法生成 S 0 S_0 S0

Specifically:
Obtain initial image from diffusion model I 0 I_0 I0, and then obtain the depth through the depth prediction network D 0 D_0 D0,对于 I − 0 I-0 IEach pixel q of 0 and its depth z are converted using the following formula to obtain S 0 S_0 S0
Insert image description here
K K K Sum P i P_i Piis the intrinsic matrix and camera pose in view i.

为了在大视野范围内生成3D场景,将相机位置设置在辐射场内部,并使相机向外看,但是该方法不能像其他设置相机查看内部的方法那样生成单独的3D对象。

With current camera position P 0 P_0 P0as the center, generate a surrounding circle with a radius r, with the same z coordinate, uniformly sample n points as the camera position, and use the same camera direction as the current view to generate a warped view that supports concentration, generally r=0.2, n=8, the offset directions are generally up, down, left, right, up left, down left, up right and down right.

At this point you can begin to reconstruct the initial 3D model.

Text-Driven Inpainting

Except initial view I 0 I_0 I0Rendering results outside the rendering results will inevitably have missing content. At this time, the text-driven image filling method based on the pre-trained diffusion model can be used.

First, render a new perspective P 1 P_1 P1 Statue I k R I^R_k IkR,通过对率 I 0 I_0 I0Twisted to P 1 P_1 P1Later image sum I k R I^R_k IkR, we get the mask M k M_k Mk. Then it is thrown to the diffusion model, which expands the scene information.
Insert image description here
However, the generation quality of the diffusion model is not necessarily very good, so multiple drawing processes are used to evaluate the image encoder of CLIP, compare the gap between the completed image and the initial image, and select optimal. The paper uses 30 candidates.
Insert image description here

Depth Alignment

The completed image and the initial image will have depth conflicts in the overlapping parts. It is reflected as:
Insert image description here
Scale gap: The distance between the spatial points corresponding to the sofa and the wall in the image should be unique, but there may be differences in different views The spatial points fitted by different views are inconsistent
Distance gap:

The paper globally aligns the two depth maps by compensating for average scale and distance differences.

corresponds to the rendered image and the completed image, expressed as { ( x j R , x j E ) } j = 1 M \{(x^R_j,x^E_j)\ }^M_{j=1} {(xjR,xjE)}j=1M, calculate the mean scale fraction s and the depth offset δ to approximate the mean scale and distance differences

Insert image description here

Scaleed point x ^ j E = s ⋅ x j E \hat{x}^E_j = s \cdot x^E_j x^jE=sxjE, z(x) represents the prediction depth

The global depth is defined here D k g l o b a l = s ⋅ D k E + δ D^{global}_k = s \cdot D^E_k + \delta Dkglobal=sDkE+δ, minimize rendering depth close to global depth
Insert image description here

Progressive Inpainting and Updating

Insert image description here
In order to ensure the consistency of views during the scene rendering process and avoid the ambiguity of geometry and appearance, a progressive rendering and update strategy is adopted to update the brightness field view by view.

Update the brightness field after each completion. This means that what was previously drawn will be reflected in subsequent renderings, and these parts will be considered known areas and will not be drawn again in other views.

Inspired by (Zeroshot text-guided object generation with dream fields), a depth-sensing transmission loss is designed L T L_T LT, to prompt the NeRF network to generate null density before the camera ray reaches the expected depth
Insert image description here
m(t) is a mask, when t< z ^ \hat{z} With^ when, m(t) = 1, no 为0, z ^ \hat{z} With^This is the depth D ^ \hat{D} DThe pixel-by-pixel depth value in ^, T (T) is the cumulative transmittance

Effect

Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_50973728/article/details/134574612