Reading notesFirst Order Motion Model for Image Animation

The article solves the problem of picture animation. Assuming that there are source pictures and driving videos, and the objects in them are of the same type, the method in the article makes the objects in the source pictures move according to the actions of the objects in the driving video.
The method in the article only requires a video set of similar objects and does not require additional annotations.

method

This method is based on the self-supervised strategy. The main method is to reconstruct the training video based on a frame of image in the training video and the learned action representation. Among them, the action representation consists of motion-specific keypoints and local affine transformations. Note that because it is a self-supervised method, the key points here are learned by the algorithm, unlike the key points in the face key point detection algorithm, which are artificially designated with specific meanings.
Insert image description here
The frame diagram is shown in the figure above and consists of two parts, one is the motion estimation module and the other is the image generation module. The purpose of the motion estimation module is to estimate a frame D ∈ R 3 × H × W
from the driving video \mathbf D \in \mathbb R^{3\times H \times W}DR3 × H × W to source imageS ∈ R 3 × H × W \mathbf S \in \mathbb R^{3\times H \times W}SR3 × H × W dense motion field. Sports fieldTS ← D : R 2 → R 2 \mathcal T_{\mathbf S \leftarrow \mathbf D}: \mathbb R^2 \rightarrow \mathbb R^2TSD:R2R2D \mathbf DEach pixel position in D is mapped to the correspondingS \mathbf SS T S ← D \mathcal T_{\mathbf S \leftarrow \mathbf D} TSDAlso called backward optical flow. Reverse optical flow is used instead of forward optical flow because backward warping can be achieved efficiently in a differentiable manner using bilinear sampling.

Affine transformation

Let’s first recall the radial transformation (Affine transformation).
On homogeneous coordinates, the affine transformation can be expressed by the following formula:
[ y ⃗ 1 ] = [ B b ⃗ 0 , … , 0 1 ] [ x ⃗ 1 ] {\begin{bmatrix}{\vec{y }}\\1\end{bmatrix}}= {\begin{bmatrix}\mathbf B&{\vec {b}}\ \\0,\ldots ,0&1\end{bmatrix}} {\begin{bmatrix}{ \vec {x}}\\1\end{bmatrix}}[y 1]=[B0,,0b  1][x 1] Because the last row of the operation matrix is ​​used for operation completion, the affine transformation on the 2-dimensional image is determined by the matrixA = [ B , b ⃗ ] ∈ R 2 × 3 \mathbf A = [\mathbf B, \vec {b}] \in \mathbb R^{2 \times 3}A=[B,b ]R2 × 3 definition.

motion estimation module

The motion estimation module is divided into two parts.

coarse motion estimation

Coarse motion estimation predicts motion patterns at key points, that is, reverse optical flow TS ← D \mathcal T_{\mathbf S \leftarrow \mathbf D}TSD T S ← D \mathcal T_{\mathbf S \leftarrow \mathbf D} TSDApproximated by first-order Taylor expansion near key points.

Suppose there is an abstract reference frame R \mathbf RR. _ In this way, we need to estimate two transformations: fromR \mathbf RRS \mathbf SS T S ← R \mathcal T_{\mathbf S \leftarrow \mathbf R} TSR) and from R \mathbf RR toD \mathbf DD T D ← R \mathcal T_{\mathbf D \leftarrow \mathbf R} TDR). The advantage of abstract reference frames is that they allow us to process D independently \mathbf DDS \mathbf SS. _
For convenience of description, useX \mathbf XX representsS \mathbf SS orD \mathbf DD,用 p 1 , ⋯   , p K p_1,\cdots,p_K p1,,pKRepresents an abstract reference frame R \mathbf RThe coordinates of the key points on R are expressed in zzz represents the coordinates of points on other frames. We estimate that at key pointsp 1 , ⋯ , p K p_1,\cdots,p_Kp1,,pKTX around ← R \mathcal T_{\mathbf X \leftarrow \mathbf R}TXR. Specifically, we consider TX ← R \mathcal T_{\mathbf X \leftarrow \mathbf R}TXRAt key points p 1 , ⋯ , p K p_1,\cdots,p_Kp1,,pKAdditionally:
TX ← R ( p ) = TX ← R ( pk ) + ( d TX ← R ( p ) dp ∣ p = pk ) ( p − pk ) + o ( ∥ p − pk ∥ ) \mathcal T_{\mathbf X \leftarrow \mathbf R}(p)=\mathcal T_{\mathbf X \leftarrow \mathbf R}(p_k)+(\frac{d \mathcal T_{\mathbf X \leftarrow \mathbf R} (p)}{dp}|_{p=p_k})(p-p_k)+o(\|p-p_k\|)TXR(p)=TXR(pk)+(dpdTXR(p)p=pk)(ppk)+o(ppk) This can be seen as an affine transformationAX ← R k ∈ R 2 × 3 \mathbf A^k_{\mathbf X \leftarrow \mathbf R} \in \mathbb R^{2 \times 3}AXRkR2 × 3 ,TX ← R ( pk ) \mathcal T_{\mathbf X \leftarrow \mathbf R}(p_k)TXR(pk) is the translation parameter,d TX ← R ( p ) dp ∣ p = pk \frac{d \mathcal T_{\mathbf X \leftarrow \mathbf R}(p)}{dp}|_{p=p_k}dpdTXR(p)p=pkare the parameters of linear mapping.

T X ← R \mathcal T_{\mathbf X \leftarrow \mathbf R} TXRThe function K is a free-form Jacobian function.
TX ← R ( p ) ≈ { { TX ← R ( p 1 ) , d TX ← R ( p ) dp ∣ p = p 1 } , ⋯ , { TX ← R ( p K ) , d TX ← R ( p ) dp ∣ p = p K } } \mathcal T_{\mathbf X \leftarrow \mathbf R}(p) \approx \{\{ \mathcal T_{\mathbf X \leftarrow\mathbf R}(p_1),\frac{d\mathcal T_{\mathbf X \leftarrow\mathbf R}(p)}{dp}|_{p=p_1}\}, \cdots,\{ \mathcal T_{\mathbf X \leftarrow \mathbf R}(p_K),\frac{d \mathcal T_{\mathbf X \leftarrow \mathbf R}(p)}{dp}|_{p=p_K}\} \}TXR(p){ { TXR(p1),dpdTXR(p)p=p1},,{ TXR(pK),dpdTXR(p)p=pK}}
We assume thatTX ← R \mathcal T_{\mathbf X \leftarrow \mathbf R}TXRThe locality at each keypoint is a bijection. Then for TS ← D \mathcal T_{\mathbf S \leftarrow \mathbf D}TSD,我们有
T S ← D = T S ← R ∘ T D ← R − 1 \mathcal T_{\mathbf S \leftarrow \mathbf D}=\mathcal T_{\mathbf S \leftarrow \mathbf R} \circ \mathcal T^{-1}_{\mathbf D \leftarrow \mathbf R} TSD=TSRTDR1In the case of the boundary layer
TS ← D ( z ) ≈ TS ← R ( pk ) + J k ( z − TD ← R ( pk ) ) J k = ( d TS ← R ( p ) dp ∣ p = pk ) . ( d TD ← R ( p ) dp ∣ p = pk ) − 1 \mathcal T_{\mathbf S \leftarrow \mathbf D}(z) \approx \mathcal T_{\mathbf S \leftarrow \mathbf R}(p_k); + J_k(z-\mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k))\\ J_k=(\frac{d \mathcal T_{\mathbf S \leftarrow \mathbf R}(p)}{dp }|_{p=p_k})(\frac{d \mathcal T_{\mathbf D \leftarrow \mathbf R}(p)}{dp}|_{p=p_k})^{-1}TSD(z)TSR(pk)+Jk(zTDR(pk))Jk=(dpdTSR(p)p=pk)(dpdTDR(p)p=pk)1 TS ← R ( pk ) \mathcal T_{\mathbf S \leftarrow \mathbf R}(p_k)TSR(pk)TD ← R ( pk ) \mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k)TDR(pk) is predicted using the keypoint predictor networkbased on U-Net. Predict a heatmap for each key point, and predict K heatmaps in total. The last layer of the U-Net decoder uses softmax to predict the confidence map of each key point (keypoint confidence map), which is the confidence of the key point at each pixel position, satisfying ∑ z∈ ZW k ( z ) = 1 \sum_ {z \in \mathcal Z} \mathbf W^k(z)=1zZWto (of)=1 , whereZ \mathcal ZZ represents all pixel positions.
TS ← R ( pk ) \mathcal T_{\mathbf S \leftarrow \mathbf R}(p_k)TSR(pk)TD ← R ( pk ) \mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k)TDR(pk) is equivalent to the translation parameter in affine transformation. Note that it is two-dimensional (z includes x and y). The translation parameters are calculated weighted by the key point confidence map:
bk = ∑ z ∈ ZW k ( z ) zb^k = \sum_{z \in \mathcal Z} \mathbf W^k(z)zbk=zZWk (z)z d TS ← R ( p ) dp ∣ p = pk \frac{d \mathcal T_{\mathbf S \leftarrow \mathbf R}(p)}{dp}|_{p=p_k}dpdTSR(p)p=pkd TD ← R ( p ) dp ∣ p = pk \frac{d \mathcal T_{\mathbf D \leftarrow \mathbf R}(p)}{dp}|_{p=p_k}dpdTDR(p)p=pkEquivalent to the linear transformation part in the affine transformation, they are estimated as the remaining 4 parameters in the affine transformation using the additional 4 channels of the keypoint predictor network, with 4 additional channels for each keypoint. Use P ijk ∈ RH × WP^k_{ij} \in \mathbb R^{H \times W}PijkRH × W represents the estimated value of one of the channels, wherei, j ∈ { 1, 2 } i,j\in\{1,2\}i,j{ 1,2 } is the coordinate of the affine transformation. The parameters of the linear transformation are weighted and fused using key point confidence maps:
B k [ i , j ] = ∑ z ∈ ZW k ( z ) P ijk ( z ) \mathbf B^k[i,j] = \sum_{z \in \mathcal Z} \mathbf W^k(z)P^k_{ij}(z)Bk[i,j]=zZWk(z)Pijk(z)

dense motion estimation

Dense motion estimation predicts the motion pattern T ^ S ← D of each pixel of the entire image \hat{\mathcal T}_{\mathbf S \leftarrow \mathbf D}T^SD

We use a convolutional network from KKTaylor expansion TS at K key points← D ( z ) \mathcal T_{\mathbf S \leftarrow \mathbf D}(z)TSD( z ) and source image frameS \mathbf SSsing T ^ S ← D \hat{\mathcal T}_{\mathbf S \leftarrow \mathbf D}T^SD.
Distort the source image frame S using transformations at key points \mathbf SS , you can getKKK transformed imagesS 1 , ⋯ , SK \mathbf S^1, \cdots, \mathbf S^KS1,,SK. _ Additionally, consider an additional imageS 0 = S \mathbf S^0 = \mathbf SS0=S as background. Calculate heatmap H k ( z ) \mathbf H_k(z)
for each key pointHk( z ) Specify the smoothness in the intrinsic motion
H k ( z ) = exp ( ( TD ← R ( pk ) − z ) 2 σ ) − exp ( ( TS ← R ( pk ) − z ) 2 σ ) \mathbf H_k(z) = exp(\frac{(\mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k)-z)^2}{\sigma}) - exp(\frac{(\mathcal T_{\ mathbf S \leftarrow \mathbf R}(p_k)-z)^2}{\sigma})Hk(z)=exp(p(TDR(pk)z)2)exp(p(TSR(pk)z)2)
H k \mathbf H_kHkS 0 , ⋯ , SK \mathbf S^0, \cdots, \mathbf S^KS0,,SThe K splicing input isa dense motion network. dense motion network estimatesK + 1 K+1K+1 maskM k , k = 0 , ⋯ , K \mathbf M_k, k = 0, \cdots, KMk,k=0,,K indicates which local transformation is used for each position, satisfying∑ k = 0 KM k ( z ) = 1 \sum_{k=0}^K \mathbf M^k(z)=1k=0KMto (of)=1. Let us define the equation in the following way:
T ^ S ← D ( z ) = M 0 z + ∑ k = 1 KM k ( TS ← R ( pk ) + J k ( z − TD ← R ( pk ) ) ) \ . hat{\mathbf T}_{\mathbf S \leftarrow \mathbf D}(z) = \mathbf M_0z + \sum_{k=1}^K \mathbf M_k(\mathbf T_{\mathbf S \leftarrow \mathbf R }(p_k) + J_k(z-\mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k)))T^SD(z)=M0z+k=1KMk(TSR(pk)+Jk(zTDR(pk)
.
_ 0(z)z + \sum_{k=1}^K \mathbf M^k(z) \mathbf A^k_{\mathbf S \leftarrow \mathbf D} {\begin{bmatrix}{z}\\1 \end{bmatrix}}O(z)=M0(z)z+k=1KMk(z)ASDk[z1]

Image generation module

1. According to the predicted T ^ S ← D \hat{\mathcal T}_{\mathbf S \leftarrow \mathbf D}T^SDto SSThe feature map of S after two downsampling convolutions ξ ∈ RH ′ × W ′ \xi \in \mathbb R^{H'\times W'}XRH×W' Use warp operation.
2. InSSWhen there is occlusion in S , D ′ D'D' cannot be completely obtained by warp source image, but requires inpaint. So, predict an occlusion mapO ^ S ← D ∈ [ 0 , 1 ] H ′ × W ′ \hat{\mathcal O}_{\mathbf S \leftarrow \mathbf D} \in [0,1 ]^{H'\times W'}O^SD[0,1]H×W , indicating the area of ​​the source image that needs to be inpainted. The occlusion map is predicted by adding a layer after the dense motion network.
The transformed feature map can be expressed as:
ξ ′ = O ^ S ← D ⊙ fw ( ξ , T ^ S ← D ) \xi' = \hat{\mathcal O}_{\mathbf S \leftarrow \mathbf D} \odot f_w(\xi, \hat{\mathcal T}_{\mathbf S \leftarrow \mathbf D})X=O^SDfw( ξ ,T^SD) f w f_w fwRepresents the back-warping operation. The converted feature map is input to the later layer of the image generation module for processing, and finally an image is generated.

train

The training loss consists of multiple items. The first is reconstruction loss based on perceptual loss. This loss uses the pre-trained VGG-19 network as a feature extractor to compare the feature differences between the reconstructed frames and the real frames driving the video.

In addition, considering that the learning of key points is unlabeled, which will lead to unstable performance, Equivariance constraint is introduced for use in unsupervised key point learning. Assume picture XXX undergoes a known transformationTX ← Y \mathcal T_{\mathbf X \leftarrow \mathbf Y}TXYGet YYY。Equivariance constraint solution:
TX ← R ≡ TX ← Y ∘ TY ← R \mathcal T_{\mathbf X \leftarrow \mathbf R} \equiv \mathcal T_{\mathbf X \leftarrow \mathbf Y} \circ \mathbf Y}; {\mathbf Y\left arrow \mathbf R}TXRTXYTYRBy performing first-order Taylor expansion on both sides, and using L1 loss to constrain the values ​​and Jacobian at the key points respectively.

References

《First Order Motion Model for Image Animation》
《Motion Representations for Articulated Animation》

Guess you like

Origin blog.csdn.net/icylling/article/details/130365274