The article solves the problem of picture animation. Assuming that there are source pictures and driving videos, and the objects in them are of the same type, the method in the article makes the objects in the source pictures move according to the actions of the objects in the driving video.
The method in the article only requires a video set of similar objects and does not require additional annotations.

method

This method is based on the self-supervised strategy. The main method is to reconstruct the training video based on a frame of image in the training video and the learned action representation. Among them, the action representation consists of motion-specific keypoints and local affine transformations. Note that because it is a self-supervised method, the key points here are learned by the algorithm, unlike the key points in the face key point detection algorithm, which are artificially designated with specific meanings.
Insert image description here
The frame diagram is shown in the figure above and consists of two parts, one is the motion estimation module and the other is the image generation module.
from the driving video \mathbf D \in \mathbb R^{3\times H \times W} $D \in R^{3 \times H \times W}$ to source image $\mathbf S \in \mathbb R^{3\times H \times W}$ dense motion field. Sports field $\mathcal T_{\mathbf S \leftarrow \mathbf D}: \mathbb R^2 \rightarrow \mathbb R^2$ 将 $\mathbf D$ is mapped to the corresponding $\mathbf S$ 。 $\mathcal T_{\mathbf S \leftarrow \mathbf D}$ Also called backward optical flow. Reverse optical flow is used instead of forward optical flow because backward warping can be achieved efficiently in a differentiable manner using bilinear sampling.

Affine transformation

Let’s first recall the radial transformation (Affine transformation).
On homogeneous coordinates, the affine transformation can be expressed by the following formula:
${\begin{bmatrix}{\vec{y }}\\1\end{bmatrix}}= {\begin{bmatrix}\mathbf B&{\vec {b}}\ \\0,\ldots ,0&1\end{bmatrix}} {\begin{bmatrix}{ \vec {x}}\\1\end{bmatrix}}$ Because the last row of the operation matrix is used for operation completion, the affine transformation on the 2-dimensional image is determined by the matrix $\mathbf A = [\mathbf B, \vec {b}] \in \mathbb R^{2 \times 3}$ definition.

motion estimation module

The motion estimation module is divided into two parts.

coarse motion estimation

Coarse motion estimation predicts motion patterns at key points, that is, reverse optical flow $\mathcal T_{\mathbf S \leftarrow \mathbf D}$ 。 $\mathcal T_{\mathbf S \leftarrow \mathbf D}$ Approximated by first-order Taylor expansion near key points.

Suppose there is an abstract reference frame $\mathbf R$ _ In this way, we need to estimate two transformations: from $\mathbf R$ 到 $\mathbf S$ （ $\mathcal T_{\mathbf S \leftarrow \mathbf R}$ ) and from $\mathbf R$ to $\mathbf D$ （ $\mathcal T_{\mathbf D \leftarrow \mathbf R}$ ). The advantage of abstract reference frames is that they allow us to process D independently $\mathbf D$ 和 $\mathbf S$ _
For convenience of description, use $\mathbf X$ represents $\mathbf S$ or $\mathbf D$ ，用 $p_1,\cdots,p_K$ Represents an abstract reference frame $\mathbf R$ The coordinates of the key points on $R$ $z$ represents the coordinates of points on other frames. We estimate that at key points $p_1,\cdots,p_K$ TX around $\mathcal T_{\mathbf X \leftarrow \mathbf R}$ . Specifically, we consider $\mathcal T_{\mathbf X \leftarrow \mathbf R}$ At key points $p_1,\cdots,p_K$ Additionally:
$\mathcal T_{\mathbf X \leftarrow \mathbf R}(p)=\mathcal T_{\mathbf X \leftarrow \mathbf R}(p_k)+(\frac{d \mathcal T_{\mathbf X \leftarrow \mathbf R} (p)}{dp}|_{p=p_k})(p-p_k)+o(\|p-p_k\|)$ This can be seen as an affine transformation $\mathbf A^k_{\mathbf X \leftarrow \mathbf R} \in \mathbb R^{2 \times 3}$ , $\mathcal T_{\mathbf X \leftarrow \mathbf R}(p_k)$ is the translation parameter, $\frac{d \mathcal T_{\mathbf X \leftarrow \mathbf R}(p)}{dp}|_{p=p_k}$ are the parameters of linear mapping.

$\mathcal T_{\mathbf X \leftarrow \mathbf R}$ The function K is a free-form Jacobian function.
$\mathcal T_{\mathbf X \leftarrow \mathbf R}(p) \approx \{\{ \mathcal T_{\mathbf X \leftarrow\mathbf R}(p_1),\frac{d\mathcal T_{\mathbf X \leftarrow\mathbf R}(p)}{dp}|_{p=p_1}\}, \cdots,\{ \mathcal T_{\mathbf X \leftarrow \mathbf R}(p_K),\frac{d \mathcal T_{\mathbf X \leftarrow \mathbf R}(p)}{dp}|_{p=p_K}\} \}$
We assume that $\mathcal T_{\mathbf X \leftarrow \mathbf R}$ The locality at each keypoint is a bijection. Then for $\mathcal T_{\mathbf S \leftarrow \mathbf D}$ ，我们有
$\mathcal T_{\mathbf S \leftarrow \mathbf D}=\mathcal T_{\mathbf S \leftarrow \mathbf R} \circ \mathcal T^{-1}_{\mathbf D \leftarrow \mathbf R}$ In the case of the boundary layer
$\mathcal T_{\mathbf S \leftarrow \mathbf D}(z) \approx \mathcal T_{\mathbf S \leftarrow \mathbf R}(p_k); + J_k(z-\mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k))\\ J_k=(\frac{d \mathcal T_{\mathbf S \leftarrow \mathbf R}(p)}{dp }|_{p=p_k})(\frac{d \mathcal T_{\mathbf D \leftarrow \mathbf R}(p)}{dp}|_{p=p_k})^{-1}$ $\mathcal T_{\mathbf S \leftarrow \mathbf R}(p_k)$ 和 $\mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k)$ is predicted using the keypoint predictor networkbased on U-Net. Predict a heatmap for each key point, and predict K heatmaps in total. The last layer of the U-Net decoder uses softmax to predict the confidence map of each key point (keypoint confidence map), which is the confidence of the key point at each pixel position, satisfying ∑ z $\sum_ {z \in \mathcal Z} \mathbf W^k(z)=1$ , where $\mathcal Z$ represents all pixel positions.
$\mathcal T_{\mathbf S \leftarrow \mathbf R}(p_k)$ 和 $\mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k)$ is equivalent to the translation parameter in affine transformation. Note that it is two-dimensional (z includes x and y). The translation parameters are calculated weighted by the key point confidence map:
$zb^k = \sum_{z \in \mathcal Z} \mathbf W^k(z)z$ $\frac{d \mathcal T_{\mathbf S \leftarrow \mathbf R}(p)}{dp}|_{p=p_k}$ 和 $\frac{d \mathcal T_{\mathbf D \leftarrow \mathbf R}(p)}{dp}|_{p=p_k}$ Equivalent to the linear transformation part in the affine transformation, they are estimated as the remaining 4 parameters in the affine transformation using the additional 4 channels of the keypoint predictor network, with 4 additional channels for each keypoint. Use $WP^k_{ij} \in \mathbb R^{H \times W}$ represents the estimated value of one of the channels, where $i,j\in\{1,2\}$ is the coordinate of the affine transformation. The parameters of the linear transformation are weighted and fused using key point confidence maps:
$\mathbf B^k[i,j] = \sum_{z \in \mathcal Z} \mathbf W^k(z)P^k_{ij}(z)$

dense motion estimation

each pixel of the entire image \hat{\mathcal T}_{\mathbf S \leftarrow \mathbf D} $\hat{T}_{S \leftarrow D}$ 。

We use a convolutional network from $Taylor expansion TS at K$ key points $\mathcal T_{\mathbf S \leftarrow \mathbf D}(z)$ and source image frame $\mathbf S$ $\hat{\mathcal T}_{\mathbf S \leftarrow \mathbf D$ } $\hat{T}_{S \leftarrow D}$ .
Distort the source image frame S using transformations at key points $\mathbf S$ , you can get $K$ transformed images $\mathbf S^1, \cdots, \mathbf S^K$ _ Additionally, consider an additional image $\mathbf S^0 = \mathbf S$ as background. $\mathbf H_k(z)$
for each key point $H_{k} (z)$ Specify the smoothness in the intrinsic motion
$\mathbf H_k(z) = exp(\frac{(\mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k)-z)^2}{\sigma}) - exp(\frac{(\mathcal T_{\ mathbf S \leftarrow \mathbf R}(p_k)-z)^2}{\sigma})$
将 $\mathbf H_k$ 和 $\mathbf S^0, \cdots, \mathbf S^K$ splicing input isa dense motion network. dense motion network estimates $K + 1$ mask $\mathbf M_k, k = 0, \cdots, K$ indicates which local transformation is used for each position, satisfying $\sum_{k=0}^K \mathbf M^k(z)=1$ Let us define the equation in the following way:
$\ . hat{\mathbf T}_{\mathbf S \leftarrow \mathbf D}(z) = \mathbf M_0z + \sum_{k=1}^K \mathbf M_k(\mathbf T_{\mathbf S \leftarrow \mathbf R }(p_k) + J_k(z-\mathcal T_{\mathbf D \leftarrow \mathbf R}(p_k)))$
.
$\sum_{k=1}^K \mathbf M^k(z) \mathbf A^k_{\mathbf S \leftarrow \mathbf D} {\begin{bmatrix}{z}\\1 \end{bmatrix}}$

Image generation module

1. According to the predicted $\hat{\mathcal T}_{\mathbf S \leftarrow \mathbf D}$ to The feature map of $S$ $\xi \in \mathbb R^{H'\times W'}$ Use warp operation.
2. InWhen there is occlusion in $S$ $D^{'}$ cannot be completely obtained by warp source image, but requires inpaint. So, predict an occlusion map $\hat{\mathcal O}_{\mathbf S \leftarrow \mathbf D} \in [0,1 ]^{H'\times W'}$ , indicating the area of the source image that needs to be inpainted. The occlusion map is predicted by adding a layer after the dense motion network.
The transformed feature map can be expressed as:
$\xi' = \hat{\mathcal O}_{\mathbf S \leftarrow \mathbf D} \odot f_w(\xi, \hat{\mathcal T}_{\mathbf S \leftarrow \mathbf D})$ $f_w$ Represents the back-warping operation. The converted feature map is input to the later layer of the image generation module for processing, and finally an image is generated.

train

The training loss consists of multiple items. The first is reconstruction loss based on perceptual loss. This loss uses the pre-trained VGG-19 network as a feature extractor to compare the feature differences between the reconstructed frames and the real frames driving the video.

In addition, considering that the learning of key points is unlabeled, which will lead to unstable performance, Equivariance constraint is introduced for use in unsupervised key point learning. Assume picture $X$ undergoes a known transformation $\mathcal T_{\mathbf X \leftarrow \mathbf Y}$ Get $Y$ 。Equivariance constraint solution:
$\mathcal T_{\mathbf X \leftarrow \mathbf R} \equiv \mathcal T_{\mathbf X \leftarrow \mathbf Y} \circ \mathbf Y}; {\mathbf Y\left arrow \mathbf R}$ By performing first-order Taylor expansion on both sides, and using L1 loss to constrain the values and Jacobian at the key points respectively.

References

《First Order Motion Model for Image Animation》
《Motion Representations for Articulated Animation》

Reading notesFirst Order Motion Model for Image Animation