Pose estimation of Openpose


Pose estimation position is very prominent masterpiece: Openpose

  1. Abstract
    focus your paper, you need to look at it is abstract, this paper presents a single picture of a 2D multiplayer attitude estimate (now the attitude of the estimated 3D), mainly behind the PAF methods and Combinatorial Mathematics K bipartite graph matching method, a very elegant solution to the problem of dry limbs than CPM in connection. The proposed network structure, first performed a full map encode, upon reaching the real-time requirements, while maintaining a high accuracy. As used herein, a multi-branch branch joint, a branch is responsible for detection, and the key point is connected to a key frame, and then through the Hungarian algorithm bipartite graph matching, the structure of such a bottom-up. In coco keypoints of 2016 made the first game, and focus on reaching sota in such data MPII (follow-up is hrnet, as well as unsupervised methods kaiming recent moco also brush to sota, we can not keep up.)

  2. Introduction
    In the article, presents the following challenges pose estimation.
    (1) The number of positions in the image, they may appear in any position, and sizes.
    (2). Causes in contact with each other, and occlusion bad situation will be difficult to detect on the key points, while increasing the number along with the number, the complexity of running time, will rise, making real-time representation of a challenge. This method is mainly used in detecting + singel person eatimation. However, this method is very dependent on the accuracy of detection, if the detection is cold, then the back key point also find Shaya.
    (3) If the test is extremely Diao, but there are 30 people, you will need to be repeated 30 single body pose estimation, which makes this method in a complex scene will become very slow.

  3. Method
    In this paper, the main thing for the bottom-up manner, using PAF (Part Affinity Fieilds) (a quick translation? Components affinity field?) To bottom-up human body pose estimation. The CPM method first learn to detect the position of the person key points, such as the position of the picture on the right shoulder of the body, the test results are obtained by predicting heatmap key points of the body, so you can see each human body has a critical point Gaussian peak, representing predicted network where the body is a key point, the same results for the same critical point everyone else, the detection results obtained, after obtaining the detection result, the measurement results of the critical point plus connection. When making the connection, the main use is the PAF (follow-up is described, a bunch of math problems).
    So the process is as follows:

    First, an input image, a shown in FIG, through the network, b is obtained heatmap pile, each heatmap fact referring to the same person in different critical points, as shown in the middle left shaft point, and the position of the left shoulder, and such sets of FIG PAF c, then d by matching the bipartite graph, the result obtained is resolved e, emmm. . . perfect.

  4. Simultaneous Detection and Association
    major network like the following schematic.

    Wherein PAFs pixel is used to described in the backbone, with \ (L (p) \) be represented, with the corresponding key points \ (S (p) \) be represented. Network layer 10 is first initialized by VGG-19 and before trimming, after the skeleton passes pretrain-model, there will be two Branch, respectively to regression \ (L (p \) and \ (S (P) \) . after each stage is considered a loss, and the L and S F is the original input concatenate, then fed to the next stage of training in which the training of the loss is used in the l2 norm gt .S L, and are based on marked a key point, a critical point if there is no label, the label is not the point. for continuous fine-tuning the way through the network shown, the network environment down two branches, each branch is t stages, and each stage will feature maps are fused, wherein \ (\ rho \ varphi \) represents the network

    in which the figure is a rough output.
    the main L2 loss is:
    \[\begin{array}{l}{f_{\mathrm{S}}^{t}=\sum_{j=1}^{J} \sum_{\mathrm{p}} \mathrm{W}(\mathrm{p}) \cdot\left\|\mathrm{S}_{j}^{t}(\mathrm{p})-\mathrm{S}_{j}^{\cdot}(\mathrm{p})\right\|_{2}^{2}} \\ {f_{\mathrm{L}}^{t}=\sum_{i=1}^{C} \sum_{\mathrm{p}} \mathrm{W}(\mathrm{p}) \cdot\left\|\mathrm{L}_{c}^{t}(\mathrm{p})-\mathrm{L}_{c}^{*}(\mathrm{p})\right\|_{2^{+}}^{2}}\end{array} \quad f=\sum_{t=1}^{T}\left(f_{\mathrm{S}}^{t}+f_{\mathrm{L}}^{t}\right)\]

  5. Confidence Maps for Part Detection
    For the detection means and the like:

    and then given annotation data calculated gt \ (\ mathbf {S *} ^ {} \) when each confidence map is a 2D representation, ideally, when when the value of the image contains a person, if a key is visible, then the corresponding confidence map will only appear in a peak, and when the image is more than one person for every visible point of each individual key k, j corresponding the confidence map, there will be a peak. As shown above, each individual is first given the confidence maps individual k, \ (X_ {j, k} \ in \ R & lt mathbb {2}} ^ {\) represents the position of the image corresponding to human k corresponding gt j position. \ (\ mathbf {S} _ {j, k} ^ {*} (\ mathbf {p}) = \ exp \ left (- \ frac {\ left \ | \ mathbf {p} - \ mathbf {x} _ {j, k} \ right \ | _ {2} ^ {2}} {\ sigma ^ {2}} \ right) \) where \ (\ Sigma \) is used to control the spread of the peak in the range of confidence map, FIG confidence map corresponding to a plurality of persons as follows. Here it will be more accurate peak confidence map is saved with the same maximum value in a feature maps. GT in network computing a prediction value corresponding to the position is a position P as shown above, takes a maximum value \ (\ mathbf {S} _ {j} ^ {*} (\ mathbf {p}) = \ max _ {k} \ mathbf {S} _ {J,} * {K} ^ (\ mathbf {P}) \) . In the prediction phase of the network to obtain the final confidence by NMS.
  6. Part Affinity Fields for Part Association

    after finding a number of key points, how we connect it up, this is a very difficult problem. Especially when people pose estimation, the relationship that exists between the image above key points are many possibilities that may exist. In this paper, the affinity member PAFs field while maintaining the positional relationship with the relationship between the direction of the region of the limb, each limb will have a type of connected
    contact corresponding portions of its two associated fields affinity.
    In the example graph paper:

    image \ (\ mathbf {X} _ {j_ {1}}, k \) and \ (\ mathbf {X} _ {j_ {2}}, k \) represent the first c k individual limbs of two body parts gt j1 and j2 of the position, if the point p c on the limb falls, then the \ (\ mathbf {L} _ {c, k} ^ {*} (\ mathbf {p }) \) values j1 j2 unit vector pointing point p on the limb is not 0.
    in order to assess the training process \ ({L} F_ \) , PAF GT value is defined in the point p \ ( \ mathbf {L} _ {c , k} ^ {*} (\ mathbf {p}) = \ left \ {\ begin {array} {ll} {\ mathbf {v}} & {\ text {if p on } \ OperatorName limb of {C}, {K} 0} & {\\ \} otherwise {text} \} End {Array \ right. \) , wherein\ (\ mathbf {L} _ {c, k} ^ {*} (\ mathbf {p}) = \ left \ {\ begin {array} {ll} {\ mathbf {v}} & {\ text {if p on} \ operatorname {limb} c, k} \\ {0} & {\ text {otherwise}} \ end {array} \ right. \) represents the unit vector limb.
    In \ (0 \ leq \ mathbf { v} \ cdot \ left (\ mathbf {p} - \ mathbf {x} _ {j_ {1}, k} \ right) \ leq l_ {c, k} \ text { and} \ left | \ mathbf { v} _ {\ perp} \ cdot \ left (\ mathbf {p} - \ mathbf {x} _ {j_ {1}, k} \ right) \ right | \ leq \ sigma_ {l} \) point p in the range is defined on the limb c, where \ (\ sigma_ {l} \ ) representative of the width of the limb, \ (menthoxypropane {c, K} = \ left \ | \ mathbf {X } _ {j_ {2}, k} - \ mathbf {x} _ {j_ {1}, k} \ right \ | _ {2} \) representative of the length of the limb.
    Point p in the field member and the affinity GT average value at this point all of the PAF, \ (\ mathbf {L} _ {C} ^ {*} (\ mathbf {p}) = \ FRAC. 1 {{} of N_ {C} (\ mathbf {P})} \ sum_ {K} \ mathbf {L} _ {C, K} ^ {*} (\ mathbf {P}) \) , where \ (n_ {c} { (p)} \) represents the number of non-zero vectors.
    In the prediction stage, two candidates for the member points \ (d_ {j1} \) and \ (J2 D_ {} \) , we obtain the predicted samples along line PAF \ (L_ {C} \) , to measuring the correlation between the two portions of confidence, \ (E = \ the int_ {U = U = {0}. 1} ^ \ mathbf {L} _ {C} (\ mathbf {P} (U)) \ CDOT \ frac {\ mathrm {d} _ {j_ {2}} - \ mathbf {d} _ {j_ {1}}} {\ left \ | \ mathbf {d} _ {j_ {2}} - \ mathbf {d {{_ J_. 1}}} \ right \ | _ {2}} du \) , where \ (p {(u)} \) representative of a position between the two body parts: \ (\ mathbf {P} ( u) = (. 1-u) \ mathbf {D} {_ J_. 1}} + {u \ mathbf J_ {D} _ {{2}} \) , the real time prediction u uniformly spaced sampling intervals summing solving the integral value approximate.
  7. Multi-Person Parsing using PAFs
    people of PAF decoding resolution is working:

    The next step is the final result of the PAFs decode, and after we forecast confidence diagrams nms operation, you can get a set of discrete candidate body parts, for each It means the presence of a plurality of candidates, since the case where a plurality of individual image for a much more likely candidate from the binding body member may be defined by integrating the above formula, the score of each limb a candidate obtained.
    Herein, the greedy relaxation proposed method to produce high quality match:
    (1) First, the discrete components of the prediction candidate confidence map \ (\ mathcal {D} _ {\ mathcal {J}} = \ left \ {\ mathbf {d} _ { j} ^ {m}: j \ in \ {1 \ ldots J \}, m \ in \ left \ {1 \ ldots N_ {j} \ right \} \ right \} \ ) , where \ (d_ {j} ^ { m} \) represents the body member j of the m-th position candidate keypoints, \ ({j} of N_ \) represents the number of candidate points j.
    (2). Our goal is to match the requirements of the other candidate candidate regions and uniform individual components are connected, first define the variable \ (z_ {j_ {1} j_ {2}} ^ {mn} \ in \ {0,1 \ } \) is used to indicate two candidate members \ (\ mathbf {D} _ {{J_. 1} ^ {m}} \) and \ (\ mathbf {D} _ {P}} ^ {n-\) between Is there a connection. All connections to the candidate set member
    \ (\ mathcal {Z} = \ left \ {z_ {j j_ {2}} ^ {mn}: \ text {for} j_ {1}, j_ {2} \ in \ {1 \ ldots J \}, m \ in \ left \ {. 1 \ ldots of N_ {J_ {. 1}} \ right \}, n-\ in \ left \ {. 1 \ ldots of N_ {J_ {2}} \ right \} \ right \} \)
    ( 3). consider a separate limb \ (C \) two body parts corresponding to \ (j_ {1} \) and \ (J_ {2} \) , and the goal is to find the value of the highest total affinity map matching mode, defined total affinity values: \ (\ max _ {\ mathcal the Z {}} _ {C} of E_ {C} = \ max _ {\ mathcal the Z {}}} _ {C \ sum_ {m \ in \ mathcal { D} _ {N}} \ sum_ {n-\ in \ mathcal {D} _ {J_ {2}}} of E_ {Mn} \ CDOT Z _ {\ text {Jil}} ^ {Mn} \) , where \ ( \ forall m \ in \ mathcal { D} _ {j_ {1}}, \ sum_ {n \ in \ mathcal {D} _ {j_ {2}}} z_ {j_ {1} j_ {2}} ^ { Mn} \ Leq. 1 \)
    \ (\ FORALL n-\ in \ mathcal {D} _ {J_ {2}}, \ sum_ {m \ in \ mathcal {D} _ {J_ {. 1}}} Z_ {J_ { {2}} J_. 1}} ^ {Mn \ Leq. 1 \) , \ (Mn of E_ {} \) represents the \ (d_ {j1} ^ { m} \) and\ (d_ {j2} ^ { n} \) affinities between. Note: The same type of two limbs no common point.
    (4) When the body pose estimation than considering that K is a bipartite graph matching problem, can be reduced to \ (\ max _ {\ mathcal {Z}} E = \ sum_ {t = 1} ^ {T } \ max} _ {C} Z_ of E_ {C} {\) , human limbs pair optimized independently, and will have the same connector body part assembled into the body posture of the body

Guess you like

Origin www.cnblogs.com/zonechen/p/11900481.html