Detailed explanation of VGG16 model

Detailed explanation of VGG16 model

0. Introduction of VGG16

VGG16 is a deep convolutional neural network developed in 2014 by a research team at the University of Oxford.

VGG16 achieved remarkable results in the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition. It won the second place of the year in the image classification task, its accuracy rate surpassed the previous deep neural network model, and provided important inspiration for later research.

1. Network model structure

This picture comes from the paper "VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION", where D and E represent VGG16 and VGG19.

parameters:

  • input (224x224 RGB image) : Input a RGB color image with a size of 224x224.
  • conv3-64: The first number indicates the size of the convolution kernel, and the latter number indicates the number of convolution kernels. The same goes for others.
  • maxpool : Maximum pooling, which helps to extract main features, reduce dimensions, and increase invariance.
  • FC-4096-1 : Fully Connected layer, specifically refers to a fully connected layer containing 4096 neurons. The feature maps from the convolutional and pooling layers are flattened, then multiplied by the weight matrix and biased, producing a set of 4096-dimensional outputs.
  • FC-4096-2 : This is the second fully connected layer of the VGG16 network, also with 4096 neurons. It receives as input the 4096-dimensional output from the first fully connected layer, and then performs the same matrix multiplication and biasing operations to obtain the final feature representation.
  • FC-1000 : The last fully connected layer is FC-1000, which has 1000 neurons corresponding to 1000 categories in the ImageNet dataset (i.e. the network is trained for the ImageNet image classification task). After the output of this layer passes through the softmax activation function, it represents the probability distribution of the image belonging to each category, thus realizing the classification function.

2. The convolution kernel characteristics of VGG16

The convolutional layers of the VGG network all use 3x3 convolution kernels. This design is a notable feature of the VGG16 network. A 3x3 convolution kernel contains the smallest unit of a pixel, up, down, left, and right. The continuous multi-layer 3x3 convolution kernel can fit more complex features while increasing the network depth.

Two 3x3 convolutions can replace a 5x5 convolution, and three 3x3 convolutions can replace a 7x7 convolution. In this way, the convolution layer with multiple small convolution kernels replaces a convolution layer with a larger convolution kernel. On the one hand, the number of parameters is reduced, and on the other hand, the number of nonlinearities is also increased, and the learning ability becomes better. Expressive skills have also improved.

3. Network convolution process

The picture comes from the video of Tongji Zihao.

  • Input an RGB image with a size of 224x224x3, go through block1, fill with a step size of 1, and padding=1. After two convolutional layers, it is activated by the ReLU function, the size remains the same, and the number of channels is 64. The dimensions become 224x224x64.
  • Each pass through maxpool, the filter size is 2x2, the step size is 2, and the size is halved. The dimensions become 112x112x64.
  • After block2, two convolutions, ReLU activation. The dimensions become 112x112x128.
  • Maxpool pooling, the size becomes 56x56x128.
  • After block3, three convolutions, ReLU activation. The dimensions become 56x56x256.
  • Maxpool pooling, the size becomes 28x28x256.
  • After block4, three convolutions, ReLU activation. The dimensions become 28x28x512.
  • Maxpool pooling, the size becomes 14x14x512.
  • After block5, three convolutions, ReLU activation. The dimensions become 14x14x512.
  • Maxpool pooling, the size becomes 7x7x512.
  • Then Flatten() becomes one-dimensional 512*7*7=25088.
  • After two layers of 1x1x4096, one layer of 1x1x1000 fully connected layer (a total of three layers), activated by ReLU.
  • Finally, 1000 prediction results are output through softmax.

4. View the built-in VGG16 model of Pytorch

import torch
import torchvision
import torch.nn as nn
import torchsummary

# pytorch内置的VGG16的模型
model = torchvision.models.vgg16()
print(model)

Contents of the console part:

5. The parameters of the built-in model

import torch
import torchvision
import torch.nn as nn
import torchsummary

model = torchvision.models.vgg16()
torchsummary.summary(model,input_size=(3,244,244),batch_size=2,device='cpu')

Contents of the console part:

Partial parameter 

Total params: 138,357,544

The total number of parameters reached 138 million. With such a large number of parameters, VGG16 can be expected to have a high fitting ability, but at the same time, the disadvantages are also obvious, that is, the training time is too long and the parameter adjustment is difficult. The required storage capacity is large, which is not conducive to deployment.

6. Implement VGG16 using Pytorch

class VGG16(nn.Module):
    """
    每个卷积核大小都是3x3,后面步长为1,padding=1
    每个卷积后面都用了ReLU
    """
    def __init__(self,in_channel=3,out_channel=1000,num_hidden=512*7*7):
        super(VGG16,self).__init__()
        self.features=nn.Sequential(
            # block1
            nn.Conv2d(in_channel,64,(3,3),(1,1),1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),

            nn.MaxPool2d(2,2),

            # block2
            nn.Conv2d(64, 128, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),

            nn.MaxPool2d(2, 2),

            # block3
            nn.Conv2d(128, 256, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),

            nn.MaxPool2d(2, 2),

            # block4
            nn.Conv2d(256, 512, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),

            nn.MaxPool2d(2, 2),

            # block5
            nn.Conv2d(512, 512, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, (3, 3), (1, 1), 1),
            nn.ReLU(inplace=True),

            nn.MaxPool2d(2, 2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(7,7))
        self.classifier = nn.Sequential(
            nn.Linear(num_hidden, 4096),
            nn.ReLU(),
            nn.Dropout(),

            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(),

            nn.Linear(4096, out_channel),
        )
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

The above are all written according to the parameters of the built-in VGG16 model of Pytorch.

7. Summary

As an important milestone in the development of deep learning, VGG16 emphasizes the extraction of image features by increasing network depth and consistent design. Its success has inspired subsequent more complex network architectures and methods, and has made important achievements in tasks such as image classification. Through research and practice, we can better understand the principle of VGG16 and apply it to practical problems.

Guess you like

Origin blog.csdn.net/m0_62919535/article/details/132189691