PoseFormer: Video-based 2D-to-3D single-person pose estimation

Paper link: 3D Human Pose Estimation with Spatial and Temporal Transformers
Paper code: https://github.com/zczcwh/PoseFormer
Paper source: 2021 ICCV
Paper unit: University of Central Florida, USA

Summary

  • The Transformer architecture has become the model of choice in natural language processing and is now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation.
  • However, in the field of human pose estimation, convolutional architectures still dominate.
  • In this work, we present PoseFormer , a purely Transformer-based method for 3D human pose estimation in videos that does not involve convolutional architectures .
  • Inspired by the latest development of visual Transformers, we design a spatio-temporal Transformer structure to comprehensively model the human joint relationships within each frame and the temporal correlation between frames, and then output the accurate 3D human pose of the center frame.
  • We evaluate our method quantitatively and qualitatively on two popular and standard benchmark datasets: Human3.6M and MPI-INF-3DHP . Extensive experiments show that PoseFormer achieves state-of-the-art performance on both datasets.

1 Introduction

  • Human pose estimation (HPE) aims to locate joints and build body representations (e.g. bone positions) from input data such as images and videos.
  • HPE provides geometric and motion information of the human body, which can be used in a wide range of applications (such as human-computer interaction, motion analysis, healthcare).
  • Current work can be roughly divided into two categories: (1) direct estimation methods, (2) 2D-to-3D lifting methods .
  • Direct estimation methods infer 3D human poses from 2D images or video frames without the need for intermediate estimation of 2D pose representations.
  • 2D-to-3D lifting methods infer 3D human poses from intermediate estimated 2D poses.
  • Thanks to the excellent performance of state-of-the-art 2D pose detectors, 2D-to-3D lifting methods often outperform direct estimation methods.
  • However, these 2D pose-to-3D mappings are non-trivial; due to depth blur and occlusion, the same 2D pose can generate a variety of potential 3D poses.
  • To alleviate these issues and maintain natural coherence, many recent works integrate temporal information from videos into their methods. However, CNN-based methods often rely on dilation techniques with inherently limited temporal connectivity, while recurrent networks are mainly limited by simple sequential correlations.
  • Recently, Transformer has become the de facto model for natural language processing (NLP) due to its efficiency, scalability, and powerful modeling capabilities. Due to Transformer's self-attention mechanism, global correlations across long input sequences can be clearly captured. This makes it particularly suitable for the architecture of sequence data problems and therefore naturally extends to 3D HPE.
  • With its comprehensive connectivity and expression, Transformer provides an opportunity to learn stronger temporal representations across frames.
  • However, recent research shows that Transformers require specific design to achieve comparable performance to their CNN counterparts in vision tasks. Specifically, they often require very large training datasets or, if applied to smaller datasets, enhanced data augmentation and regularization.
  • Furthermore, existing vision transformers are mainly limited to image classification, object detection, and segmentation, but how to harness the power of transformers for 3D HPE remains unclear.
  • To answer this question, we first apply the transformer directly to 2D-to-3D lifting HPE. In this case, we treat the entire 2D pose of each frame in a given sequence as a token (Figure 1(a)). While this baseline approach is effective to a certain extent, it ignores the natural distinction of spatial relationships (joint to joint).
    Insert image description here
  • A natural extension to this baseline is to treat each 2D joint coordinate as a token, providing input consisting of these joints in all frames of the sequence (Figure 1(b)). However, in this case, the number of tokens becomes larger and larger when using long frame sequences (in 3D HPE, with a maximum of 243 frames per frame and 17 joints being common, the number of tokens will be 243× 17=4131). Since the Transformer computes each token's direct attention to another token, the memory requirements of the model approach unreasonable levels.
  • Therefore, as an effective solution to these challenges, we propose PoseFormer, the first pure Transformer network for 2d to 3d lifting HPE in videos .
  • PoseFormer directly models the spatial and temporal aspects using different Transformer modules in two dimensions .
  • PoseFormer not only produces powerful representations between spatial and temporal elements, but also does not produce large token counts for long input sequences.
  • At a high level, PoseFormer simply extracts a sequence of detected 2D poses from an off-the-shelf 2D pose estimator and outputs the 3D pose of the center frame .
  • More specifically, we build a spatial Transformer module to encode the local relationships between 2D joints in each frame. The spatial self-attention layer considers the position information of the two-dimensional joints and returns the latent feature representation of the frame. Next, our temporal Transformer module analyzes the global dependencies between each spatial feature representation and generates accurate 3D pose estimates.
  • Experimental evaluations on two popular 3D HPE benchmarks (Human3.6M and MPI-INF-3DHP) show that PoseFormer achieves state-of-the-art performance on both datasets. We compare our estimated 3D poses with the SOAT method and find that
    PoseFormer produces smoother and more reliable results. Additionally, visualization and analysis of PoseFormer attention maps are provided in the ablation study to understand the inner workings of the model and demonstrate its effectiveness.
  • Our contributions are threefold:
    (1) We propose the first purely Transformer-based model PoseFormer for 2d to 3D lifting of 3D HPE.
    (2) We design an effective spatio-temporal Transformer model, where the spatial Transformer module encodes the local relationships between human body joints, while the temporal Transformer module captures the global dependencies across frames in the entire sequence.
    (3) Our PoseFormer model achieves SOAT effects on the Human3.6M and MPI-INF-3DHP data sets.

2. Related Works

  • Here, we specifically summarize the 3D single-person single-view HPE method .
  • Direct estimation method : infer 3D human pose from 2D images without the need for intermediate estimation of 2D pose representation.
  • 2D-to-3D lifting method : Using 2D pose as input to generate the corresponding 3D pose, this is more popular among the latest methods in this field. Any off-the-shelf 2D pose estimator is effectively compatible with these methods.

2.1 2D-to-3D Lifting HPE

  • 2D to 3D lifting methods utilize 2D poses estimated from input images or video frames.
  • OpenPose, CPN, AlphaPose and HRNet are widely used as 2D pose detectors.
  • Based on this intermediate representation, a variety of methods can be used to generate 3D poses.
  • However, previous state-of-the-art methods rely on dilated temporal convolutions to capture global dependencies, which are limited in temporal connections.
  • Furthermore, most of these works use simple operations to project joint coordinates into latent space without considering the kinematic correlation of human joints.

2.2 GNNs in 3D HPE

  • Naturally, human posture can be represented as a graph, where joints are nodes and bones are edges.
  • Graph neural networks (GNNs) have also been applied to the 2D-to-3D pose lifting problem and provided good performance.
  • For our PoseFormer, the transformer can be viewed as a type of GNN with unique and often advantageous graph operations.
  • Specifically, a transformer encoder module essentially forms a fully connected graph where edge weights are calculated using input conditional, multi-head self-attention.
  • The operation also includes normalization of node features, feed-forward aggregators across attention head outputs, and residual connections that enable efficient scaling of stacked layers.
  • Such operations are advantageous compared to other graph operations. For example, the strength of connections between nodes is determined by the transformer's self-attention mechanism, rather than being predefined as is typical through the adjacency matrix.
  • The gcn based recipe used in this task. This gives the model the flexibility to adjust the relative importance of joints based on each input pose.
  • Furthermore, the transformer's integrated scaling and normalization components may be beneficial in mitigating the over-smoothing effects that plague many GNN operational variants when multiple layers are stacked together.

2.3 Vision Transformers

  • Recently, there has been an emerging interest in applying Transformers to vision tasks.
  • DEtection TRansformer (DETR) is used for target detection and panoramic segmentation.
  • Vision Transformer (ViT) , a pure Transformer architecture, achieves the performance of SOAT in image classification.
  • Transpose , based on the Transformer architecture, estimates 3D poses from images.
  • MEsh TRansfOrmer combines CNN with Transformer network to reconstruct 3D pose and mesh vertices from a single image.
  • The spatiotemporal Transformer architecture of our method exploits the keypoint correlation in each frame and preserves the natural temporal consistency in the video.

3. Method

  • Pipeline : Obtain the 2D pose of each frame through an off-the-shelf 2D pose detector, use the 2D pose sequence of consecutive frames as input, and estimate the 3D pose of the central frame.

3.1 Temporal Transformer Baseline

  • As a baseline application of 2D-to-3D lifting, we treat each 2D pose as an input token and use a Transformer to capture the global dependencies between inputs, as shown in Figure 2(a).
    Insert image description here
  • We will call each input token a patch, similar in terminology to ViT.
  • For the input sequence X∈R^(f×(J·2)), f is the number of frames of the input sequence, J is the 2D pose of the number of joints in each frame, and 2 represents the coordinates of the joints in the 2D space.
  • Patch embedding is a trainable linear projection layer that embeds each patch into high-dimensional features.
  • Transformer networks utilize positional embeddings to preserve the positional information of sequences.
  • Self-attention is the core function of Transformer, which associates different positions of the input sequence with embedded features.
  • Our Transformer encoder consists of multi-head self-attention blocks and multi-layer perceptron (MLP) blocks. LayerNorm is applied before each block and remaining connections are applied after each block.
  • To predict the three-dimensional pose of the center frame, the encoder output Y∈R f×C is shrunk into a vector y∈R 1×C by averaging in the frame dimensions . Finally, an MLP block regresses the output to y∈R 1×(J*3) , the 3D pose of the central frame.

3.2 PoseFormer: Spatial-Temporal Transformer

  • We observe that the Temporal Transformer baseline mainly focuses on global dependencies between frames in the input sequence. Use linear transformation patch embedding to project joint coordinates onto the hidden dimension.
  • However, since a simple linear projection layer cannot learn attention information, the motion information between local joint coordinates is not strongly represented in the temporal Transformer baseline.
  • A potential solution is to treat each joint coordinate as a separate patch and feed all frames' joints as input to the Transformer (see Figure 1(b)).
  • However, the number of patches will increase rapidly (the number of frames f times the number of joints J), resulting in a computational complexity of the model of O((f·J)2).
  • To effectively learn local joint correlations, we use two separate Transformers for spatial and temporal information respectively .
  • As shown in Figure 2(b), PoseFormer consists of three modules: spatial transformer module, temporal transformer module, and regression head module .

Spatial Transformer Module

  • The Spatial Transformer Module extracts high-dimensional feature embeddings from a single frame. Given a 2D pose with J joints, we treat each joint (i.e., two coordinates) as a patch and perform feature extraction among all patches according to a common visual transformation pipeline.
  • First, we map the coordinates of each joint into a high-dimensional space using a trainable linear projection, which is called spatial patch embedding.

Temporal Transformer Module

  • Since the Spatial Transformer Module encodes high-dimensional features for each individual frame, the Temporal Transformer Module aims to model dependencies across a sequence of frames.
  • Before the Temporal Transformer Module, we added a learnable temporal position embedding to preserve the position information of the frame.
  • For the Temporal Transformer Module encoder, we adopt the same architecture as the Spatial Transformer Module encoder, which consists of multi-head self-attention blocks and MLP blocks.
  • The output of the sequential transformer module is Y∈R f*(J*c) .

Regression Head

  • Since we use a set of frame sequences to predict the 3D pose of the center frame, the output of the Temporal Transformer Module Y∈R f*(J·c) needs to be simplified to y∈R 1*(J·c) .
  • We achieve this by applying a weighted averaging operation (using learned weights) over the frame dimension.
  • Finally, a simple MLP block with Layer norm and a linear layer returns the output y∈R 1*(J·3) , which is the predicted 3D pose of the center frame.

Loss Function

  • To train our spatiotemporal transformation model, we use the standard MPJPE (Mean Per Joint Position Error) loss to minimize the error between the predicted value and the ground truth pose as
    Insert image description here

4. Dataset

4.1 Human3.6M

  • Human3.6M is the most widely used indoor dataset for 3D single-person HPE .
  • 11 professional actors perform 17 actions such as sitting, walking, and talking on the phone.
  • Each subject was videotaped from 4 different angles in an indoor environment.
  • This dataset contains 3.6 million video frames with 3D ground truth annotations captured by a precise marker-based motion capture system.
  • Based on previous work, we adopt the same experimental setup: all 15 actions are used for training and testing, the model is trained in 5 sections (S1, S5, S6, S7, S8) and tested in 2 sections (S9 and S11 ).

4.2 MPI-INF-3DHP

  • MPI-INF-3DHP is a more challenging 3D pose dataset.
  • It contains both restricted indoor scenes and complex outdoor scenes.
  • There are 8 actors performing 8 actions, from 14 camera views, covering a greater diversity of poses.
  • MPI-INF-3DHP provides test sets for 6 different scenarios.

5. Evaluation indicators

  • MPJPE : Mean Per Joint Position Error, average position error of each joint, estimated average Euclidean distance between the joint and the ground truth, in millimeters.
  • P-MPJPE : P-MPJPE is an MPJPE in which the estimated three-dimensional pose and the ground truth are rigidly aligned after post-processing, and is more robust to single joint prediction failure.
  • PCK : Percentage of Correct Keypoint, the percentage of correct joint points within the 150mm range.
  • AUC : Area Under Curve, area under the curve.

Guess you like

Origin blog.csdn.net/gaoqing_dream163/article/details/132121074