Two-stream neural network and 3D convolution series paper reading notes

I read a total of 4 papers this week, which can be regarded as clarifying the series of evolution from the initial time + space dual-stream neural network to this year's CVPR's 3D convolution + dual-stream network.

The first Two-Stream Convolutional Networks for Action Recognition in Videos proposes a two-stream CNN network that captures spatial and temporal information respectively.
For spatial information, the article uses a CNN structure similar to another paper, which will be read after preparation. For temporal information, the article proposes a temporal CNN based on optical flow. The main idea is to start from adjacent L frames. In the picture, the optical flow information is extracted as input, and then used to represent the time information, as follows:
First, use OpenCV to obtain the corresponding optical flow information. The so-called optical flow information is divided into several types. The traditional one is to define a displacement - dt(u, v), which represents a point (u, v) on the corresponding frame at time t. , to move it to the direction vector of the corresponding place at time t+1. As for how to get this point, you can use OpenCV to directly process the frame of the video and get it. In addition, there are several other methods of representing optical flow, which can be found in the paper.
After having the corresponding optical flow information, for a certain frame, the first spatial information is to input the original image (processed) into the network, and then the spatial information is the next consecutive L frames starting from this frame, the difference between each adjacent two frames. The optical flow information of all points is extracted as input, which is divided into X-axis and Y-axis directions, so set the length and width of each frame as w*h pixels, then the dimension of the time network input information corresponding to each frame is w*h *2L
Finally, the spatial and temporal networks give the classification results of the actions, respectively, and then fuse the results, using the method of averaging or SVM (in the experiment, SVM has a higher accuracy) to obtain the final result.
The main gain of this paper is: to understand the method of optical flow to represent video, in addition, the two applications [3][15] of the paper are worth reading, [3] represents the basic structure of the temporal network, [15] is the spatial The basis of the network is also the basis of the training method.

The next one is an improvement of the above network, but it involves a 3D convolution method, which in turn involves two other papers: 3D ConvolutionalNeural Networks for Human ActionRecognition proposes a method, which is similar to the traditional 2D volume of CNN Correspondingly, there is a new 3D convolution method. The traditional 2D convolution method uses a 2-dimensional convolution layer to sample the feature map to obtain the feature map of the next layer, in the following form:
 

Then 3D is an additional time dimension, that is, for a certain number of frames of pictures, the same 3-dimensional convolution layer is used to sample, so as to obtain the next-dimensional feature map, the form is as follows:



The extra Ri is the length of the 3-dimensional convolution kernel in the time dimension.
With this foundation, the paper Learning Spatiotemporal Features with 3DConvolutionalNetworks further explores the practicality of this 3D temporal convolution (in fact, it feels like they just did a bunch of experiments, and then came to which ones are the best and which are the most efficient, and they can actually send CVPR...). It is more useful in this article that it is proved by experiments that the 3*3*3 spatiotemporal convolution kernel is the most suitable convolution kernel for this new type of neural network structure.

With these foundations, you can read the last paper Convolutional Two-Stream Network Fusion for Video ActionRecognition, which is an improvement on the original dual-stream paper. In the original Shuangliu paper, the fusion of the two neural networks in time and space is the last step, and the results are averaged or divided by linear SVM, while in this paper, the two neural networks are fused at a certain layer, as shown below:

The left side is simply fused at a certain layer, and the right side is the time network that is retained after fusion, and the result is fused again at the end. The experiments of the paper show that the accuracy of the latter is slightly higher.
The prerequisite for fusion is that in this layer, the feature maps of the spatial and temporal networks are equal in length and width, and the number of channels is the same (channels are mentioned in many papers, and the tentative understanding is that channels represent the corresponding convolutional layers. , the number of feature maps, because for the feature map input to the previous layer, each time a convolution kernel is used, a new feature map will be generated, so a channel is needed to count how many feature maps are generated in total). There are many specific fusion methods, please refer to the paper for details.
After fusing the feature maps of the two networks, another convolution operation is required. Assuming that at time t, the feature map we get is xt, then for a large period of time t=1....T, we need to synthesize all the feature maps (x1,...,xT) during this time, and perform a 3D Temporal convolution, the final result is the fused feature map output. Note that the output at this time is still a series of feature maps in time. Then it is input to a higher-level network to continue training and learning.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324955780&siteId=291194637