CNN为什么鲁棒, 为什么具有旋转平移不变性

https://www.quora.com/How-is-a-convolutional-neural-network-able-to-learn-invariant-features


1. 一种解释

After some thought, I do not believe that pooling operations are responsible for the translation invariant property in CNNs. I believe that invariance (at least to translation) is due to the convolution filters (not specifically the pooling) and due to the fully-connected layer.

For instance, let's use the Fig. 1 as reference:

The blue volume represents the input image, while the green and yellow volumes represent layer 1 and layer 2 output activation volumes (see CS231n Convolutional Neural Networks for Visual Recognition if you are not familiar with these volumes). At the end, we have a fully-connected layer that is connected to all activation points of the yellow volume.

These volumes are build using a convolution plus a pooling operation. The pooling operation reduces the height and width of these volumes, while the increasing number of filters in each layer increases the volume depth.

For the sake of the argument, let's suppose that we have very "ludic" filters, as show in Fig. 2:

  • the first layer filters (which will generate the green volume) detect eyes, noses and other basic shapes (in real CNNs, first layer filters will match lines and very basic textures);
  • The second layer filters (which will generate the yellow volume) detect faces, legs and other objects that are aggregations of the first layer filters. Again, this is only an example: real life convolution filters may detect objects that have no meaning to humans.

Now suppose that there is a face at one of the corners of the image (represented by two red and a magenta point). The two eyes are detected by the first filter, and therefore will represent two activations at the first slice of the green volume. The same happens for the nose, except that it is detected for the second filter and it appears at the second slice. Next, the face filter will find that there are two eyes and a nose next to each other, and it generates an activation at the yellow volume (within the same region of the face at the input image). Finally, the fully-connected layer detects that there is a face (and maybe a leg and an arm detected by other filters) and it outputs that it has detected an human body.

Now suppose that the face has moved to another corner of the image, as shown in Fig. 3:

The same number of activations occurs in this example, however they occur in a different region of the green and yellow volumes. Therefore, any activation point at the first slice of the yellow volume means that a face was detected, INDEPENDENTLY of the face location. Then the fully-connected layer is responsible to "translate" a face and two arms to an human body. In both examples, an activation was received at one of the fully-connected neurons. However, in each example, the activation path inside the FC layer was different, meaning that a correct learning at the FC layer is essential to ensure the invariance property.

It must be noticed that the pooling operation only "compresses" the activation volumes, if there was no pooling in this example, an activation at the first slice of the yellow volume would still mean a face.

In conclusion, what makes a CNN invariant to object translation is the architecture of the neural network: the convolution filters and the fully-connected layer. Additionally, I believe that if a CNN is trained showing faces only at one corner, during the learning process, the fully-connected layer may become insensitive to faces in other corners.


2. 另一种解释

In addition to the answers already here feature learning in convnets is guided by an error signal that is backpropagated throughout the network, from the output layer all the way back to the input layer.

Each neuron in a particular layer has a small receptive field which scans the whole preceding layer, hence in a typical convnet layer each neuron get's a chance to learn a distinct feature in a particular image or data irrespective of spatial positioning of that feature, since the convolution operation will always find that feature even when it undergoes translation. If the receptive fields don't convolve over the whole image or stimuli, it would not be possible for convnet neurons to learn those translation equivariant features.

Plus the series of alternating layers between a convolutional and pooling, mostly max-pooling layer helps the convnet build up tolerances to severe distortions of the input stimuli. The effective receptive fields of the neurons also become bigger higher up the hierarchy due to the pooling operation, making the convnet process context and integrate features over a large spatial extent. This also makes them very robust and capable of recognizing novel stimuli.

Other forms of invariances are built up artificially by rotating, mirroring and scaling up the training examples. This is because it is important to see training sets from different points of view in order to generalize better.




猜你喜欢

转载自blog.csdn.net/xiaojiajia007/article/details/78396319
今日推荐