Paper reading | PointNet, a neural network that can directly process disordered 3D point clouds

Paper related information

1.论文题目:PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

2. Issuing time: 2016.12

3. Document address: https://arxiv.org/pdf/1612.00593.pdf

4. Paper source code:

论文贡献:

  • A new deep network architecture suitable for processing disordered 3D point cloud data is designed.
  • The network can be used for data tasks such as 3D classification, component segmentation, and scene semantic segmentation.
  • Provides a basic idea for processing 3D point cloud data.

Abstract

Point cloud is an important 3D data format. Because point cloud has an irregular form, most researchers will convert it into 3D voxel or picture geometry. But this will lead to a huge amount of data and some unnecessary problems. This paper designs a new neural network that can directly process point clouds with conversion invariance. We call this neural network PointNet, which provides another unified architecture for object classification, part segmentation, and scene semantic segmentation. The network is simple and efficient.

1.Introduction

In this paper, we will develop a deep learning architecture that can reason about 3D geometric data such as point clouds or (meshes). The classic convolutional architecture requires highly regular input data formats, such as picture grids or 3D voxels (voxels). In turn, weight sharing and kernel optimization can be realized. The point cloud and mesh are not regular formats. Most researchers convert them into regular 3D voxel grids or collections of pictures before sending the data to the network. However, this kind of data representation conversion will make the result data redundant. And it will obscure the original invariance (disorder) of the data.

To this end, we need to focus on using a simple point cloud to realize a different 3D geometric feature input representation, and name the deep network PointNets. The point cloud is a simple and uniform structure, which can avoid the combination of grids. Regularity and complexity make it easy to learn. PointNet is still a collection of points, and the arrangement of its members is invariant, so a certain symmetry is required in network calculations, and further invariance of strict motion needs to be considered.

PointNet is a unified architecture that directly takes the point cloud as input and outputs the category label (classification) of the entire input or the label of a single point (semantic segmentation). We use a vector to represent each point in the point cloud. The basic attributes are (x, y, z). Of course, there can also be other dimensional information, such as color. Due to the disorder of the point cloud, we use a symmetric function (such as max pooling or sum function) for processing, so that the order can be ignored. Specifically, the symmetric function max pooling is used in the article to perform a pooling operation on each dimension of all points to obtain the global information of each dimension, but since the dimensions of each point are not many, such global information capabilities are too weak, so you can first Upgrade each point, then perform a pooling operation to obtain a high-dimensional global information, and finally use a fully connected layer to aggregate these learned values ​​and classify them. The simplified diagram is as follows.
Insert picture description here

The MLP is a multi-layer perceptron, which is used to upgrade the dimension. Of course, it can be understood as a fully connected layer.

Network architecture

Insert picture description here

Among them, input transform and feature transform are used to standardize data, but later scholars pointed out that these tasks are useless and can be removed, so don't care.

The input of the network is a point cloud. The points of each object are different, but only n points are sampled. Each point can have several dimensions. At least 3 dimensions represent the coordinate information of x, y, and z. It is assumed that there are only 3 Dimension, that is (x, y, z), n point cloud data are first upgraded to 64 through mlp (64, 64) (mlp is a multi-layer perceptron, the number in parentheses is the output dimension of each layer), and then Upgrading to 1024 through a mlp(64,128,1024), at this time the dimension of each point is high enough, after the max pool pooling operation, a 1024-dimensional global feature can be obtained, and then the global feature is passed into a mlp down Dimension to k, and finally output the k classification scores of the point cloud. If it is segmentation, the feature of each point will be combined with the global feature. As shown in the figure above, the 1024-dimensional global feature will be copied n copies and added to each point in the 64-dimensional point cloud to get n 1088-dimensional points, so that each point has both global information and local information, and then the segmentation task can be realized by classifying each point through mlp.

The above is the idea of ​​the entire network, which is relatively easy to understand. The key to this idea is to use a symmetric function to deal with the disorder of the point cloud. The symmetric function can be refined as follows:

f ( { x 1 , . . . , x n } ) ≈ g ( h ( x 1 ) , . . . , h ( x n ) ) f(\left\{ x_1,...,x_n\right\})\approx g(h(x_1),...,h(x_n)) f({ x1,...,xn})g(h(x1),...,h(xn))$, (1)

其中 f : 2 R N → R , h : R N → R K , g : R K × ⋅ ⋅ ⋅ × R K ⏟ n f:2^{R^N}\to R,h:R^{N}\to R^K,g: \begin{matrix} \underbrace{ R^K×···×R^K} \\ n\end{matrix} f:2RNR,h:RNRK,g: RK××RKn

h h h corresponds to the point cloud to increase the dimension mlp, g is the global maximum pooling function,fff is our symmetric function.

supplement

The nature of the point cloud

Disorder . Although the input point cloud is in order, obviously this order should not affect the result

Interactivity with other points . Each point is not independent, but contains some information together with some surrounding points, so the model should be able to grasp the local structure and the interaction between the parts.

Invariance under transformation . For example, the overall rotation and translation of the point cloud should not affect its classification or segmentation.

MLP implementation

MLP is a multi-layer perceptron, which is actually implemented through a one-dimensional convolution Conv1d, and this article on Zhihu is very clear about this one-dimensional convolution conv1d. And this author also has a note on the pointnet paper, which is well analyzed and put in the reference place.

PointNet paper reproduction and detailed code explanation

reference

https://www.bilibili.com/video/BV1M5411K7Gx?p=3

As for the improved version of PointNet, PointNet++, it adds local feature extraction on the basis of PointNet, can extract point features by down-sampling like CNN, and solves the problem of point density adaptation, so it is called PointNet++. This is what the original paper said:

we propose density adaptive PointNet layers that learn to combine features from regions of different scales when the input sampling density changes. We call our hierarchical network with density adaptive PointNet layers as PointNet++.

The recommended blogs for PointNet++ are given below. I will not write notes by myself, because I read these blogs for the general principles, and then you can directly read the English version for the details. It is easier to understand.
For the thought of the article, see this: PointNet paper notes for 3D classification and segmentation. For the
specific implementation details of the paper, see this: Detailed explanation and code of PointNet++ . The implementation of the paper given in the above article thought is extracted from this article. It is not perfect, and Some specific meanings are not fully expressed, but fortunately, the general idea is the same as the paper. Generally speaking, the combination of the two and the original paper will make it easy to understand.

Guess you like

Origin blog.csdn.net/yanghao201607030101/article/details/114437211