Pedestrian re-recognition-attitude detection
Preface
From the extraction of image features for classification, the methods of pedestrian re-identification can be divided into methods based on global features and local features . Global features are relatively simple, which means that the network extracts a feature from the entire image. This feature does not consider some local information. Normal convolutional networks extract global features.
However, as the pedestrian data set becomes more and more complex, only the use of global features cannot meet the performance requirements, so extracting more complex local features has become a research hotspot.
Local features refer to manually or automatically allowing the network to focus on the key local Regions, and then extract the local features of these regions. Commonly used ideas for extracting local features mainly include image dicing , using skeleton key point positioning, and pedestrian foreground segmentation, etc.
Global characteristics
A feature is extracted for the global information of each pedestrian picture, and this global feature does not have any spatial information .
Through a simple convolutional neural network, a feature about the picture is obtained. This feature is called a global feature, but this method has some defects, such as noise regions that can cause great interference to the global feature, and posture is not aligned It will also make the global feature unmatched.
Detection method based on local features
Local feature refers to the feature extraction of a certain area in the image, and finally multiple local features are merged as the final feature.
Local feature-attitude detection
It is a common method to use the key points of human pose to align local features. Some current papers mostly use some prior knowledge (preprocessed human pose and skeleton key point model) to align pedestrians, and then detect and judge local features.
Usually a pedestrian will define 14 pose points (pose/keypoint), and two adjacent pose points are connected to form a skeleton (skeleton).
Commonly used pose point estimation models include: Hourglass, OpenPose, CPM, AlphaPose.
Related algorithms
1.PIE
Pose Invariant Embedding for Deep Person Re-identification
The article mentioned above is an early article on pose detection. The main work is roughly as follows:
CPM is used for key point collection. CPM is a sequential convolutional architecture that can detect 14 body joints, namely head, neck, left and right shoulders, left and right elbows, left and right elbows, right wrists, left and right hips, left and right knees, and also There are left and right ankles, as shown in the first column to the second column in the picture above.
Divide the picture into several parts, and perform affine transformation and alignment to obtain a rectangular area, which can solve the problem of different sizes and poses of the same part in different pictures, as shown in the third column and fourth column in the above figure:
fusion of the original image and the affine image Features, and use ID loss to train the network:
as shown in the figure above, the original image and poseBox first pass through two convolutional neural networks with no weights to obtain their respective features, and then combine with a 14-dimensional pose confidence score to enter the PIE network , Fuse the corresponding features, and the last three losses corresponding to the obtained from top to bottom are the global loss, the fusion loss, and the local loss.
2.Spindle Net
Spindle Net: Person Re-identification with
Human Body Region GuidedFeature Decomposition and Fusion This is a comparison using classical gesture recognition points pedestrian weight paper, as shown below, is first extracted by the key points of the network backbone 14 extracts the key body brother point. These key points extract 7 ROIs of human body structure, corresponding to the head, upper body, lower body, left arm, right arm, left leg, and right leg.
Then the 7 ROI regions and the original image are entered into the same CNN network to extract features. The original image passes through the complete CNN network to obtain a global feature, and the three large areas pass through the FEN-C2 and FEN-C3 sub-networks to obtain three local features. The four limb regions pass through the FEN-C3 sub-network to obtain four local features. Then these 8 features are connected at different scales according to the diagram, and finally a pedestrian re-recognition feature that combines global features and local features of multiple scales is obtained.
3.PDC
The Pose-driven Deep Convolutional Model for Person Re-identification is
different from the chestnut above. When the author of PDC extracts key points for pedestrians, although he also extracts 14 key points, he divides the pedestrians into 6 parts and adopts an improved one. The PTN network learns the parameters of the affine transformation and automatically places them in certain positions in the figure. Here, gaps between different parts are allowed.
After the partial image is obtained, the original image and the posture image can be performed separately Feature extraction, shallow sharing of the network, deep non-sharing, training the network, and finally get the effect similar to the above, global loss, local loss and fusion loss.
4.GLAD
GLAD: Global-Local-Alignment Descriptor for Pedestrian Retrieval
GLAD divides the human body into three parts: head, upper body, and lower body, and then calculates loss through a network that can share weights, and finally stitches the obtained features to get:
5.PABP
Part-Aligned Bilinear Representations for Person Re-identification
discusses the problem from the pixel level, uses the ReID network to extract the feature map A, uses openpose to extract the feature map P, the vector of each corresponding pixel position of A and P is outer producted and vectorized.
to sum up
- Use a pose estimation model to get the pedestrian's (14) key pose points
- Get the part area with semantic information according to the posture point
- Extract local features for each part area
- Combining local features and global features can often get better results