Summary of Convolutional Neural Network Compression Methods

This paper introduces common methods for convolutional network compression: low-rank approximation, pruning and sparsity constraints, parameter quantization, binarized networks, knowledge distillation, and shallow/lightweight networks.

We know that, to a certain extent, the deeper the network, the more parameters, the more complex the model, the better the final effect. The compression algorithm of the neural network is designed to convert a large and complex pre-trained model (pre-trained model) into a streamlined small model. According to the degree of damage to the network structure during the compression process, we divide the model compression technology into two parts: "front-end compression" and "back-end compression".

Front-end compression refers to compression techniques that do not change the original network structure, mainly including knowledge distillation, compact model structure involvement, and pruning at the filter level.

Back-end compression refers to low-rank approximation, unrestricted pruning, parameter quantization, and binary networks. The goal is to reduce the size of the model as much as possible, which will greatly modify the original network structure.

Summary: Front-end compression hardly changes the original network structure (only the number of layers or filters in the network is reduced on the basis of the original model), while back-end compression has a large irreversible change to the network structure, resulting in the original deep learning library , Even hardware devices are not compatible with the changed network. Its maintenance cost is high.

1. Low rank approximation

A simple understanding is that the weight matrix of a convolutional neural network is often dense and huge, resulting in high computational overhead. One way is to use a low-rank approximation technique to reconstruct the dense matrix from several small-scale matrices. Class for low-rank approximation algorithms.

In general, the rank of a row-echelon matrix is ​​equal to its "number of steps"—the number of nonzero rows.

The principle that the low-rank approximation algorithm can reduce the computational overhead is as follows:

Based on the above ideas, Sindhwani et al. proposed an algorithm that uses structured matrices for low-rank decomposition. For specific principles, please refer to the paper. Another relatively simple method is to use matrix decomposition to reduce the parameters of the weight matrix. For example, Denton et al. proposed to use Singular Value Decomposition (SVD) decomposition to reconstruct the weight of the fully connected layer.

1.1 Summary

The low-rank approximation algorithm has achieved very good results on small and medium-sized network models, but its hyperparameters and the number of network layers change linearly. As the number of network layers increases and the complexity of the model increases, its search space will decrease. It has increased sharply. At present, it is mainly researched by the academic circles, and there are not many applications in the industrial circles.

2. Pruning and sparse constraints

Given a pre-trained network model, commonly used pruning algorithms generally follow the following operations:

  1. A measure of neuron importance.

  2. Remove some unimportant neurons, this step is easier and more flexible than the previous step.

  3. To fine-tune the network, the pruning operation will inevitably affect the accuracy of the network. In order to prevent excessive damage to the classification performance, it is necessary to fine-tune the pruned model. For large-scale row image datasets (such as ImageNet), fine-tuning will take up a lot of computing resources, so the extent to which the network is fine-tuned needs to be considered.

  4. Return to the first step and cycle for the next round of pruning.

Based on the above cycle pruning framework, different scholars have proposed different methods. Han et al. proposed to first cut off all the weight connections below a certain threshold , and then fine-tune the pruned network to complete the parameter update method. The disadvantage of the method is that the pruned network is unstructured, that is, the pruned network connections are distributed without any continuity. This sparse structure leads to frequent switching between CPU cache and memory, thus limiting Actual acceleration effect.

Based on this method, some scholars try to increase the granularity of pruning to the level of the entire filter, that is, to discard the entire filter, but how to measure the importance of the filter is a problem, and one of the strategies is based on the statistics of the filter weight itself , such as calculating the L1 or L2 value of each filter separately, and taking the corresponding numerical value as a measure of importance. whaosoft  aiot  http://143ai.com

Using sparse constraints to pruning the network is also a research direction. The idea is to add sparse regularization items of weights to the optimization goal of the network, so that some weights of the network tend to be 0 during training, and these 0 values ​​are the objects of pruning. .

2.1 Summary

Overall, pruning is a general compression technique that can effectively reduce the complexity of the model. The key lies in how to measure the importance of individual weights to the overall model. The pruning operation has very little damage to the network structure. Combining pruning with other back-end compression technologies can achieve the maximum compression of the network model. At present, there are cases in the industry that use the pruning method for model compression.

3. Parameter Quantization

Compared to pruning operations, parameter quantization is a commonly used back-end compression technique. The so-called "quantization" refers to the induction of several "representatives" from the weight, and these "representatives" represent the specific value of a certain type of weight. The "representatives" are stored in the codebook, and the original weight matrix only needs to record the indexes of the respective "representatives", which greatly reduces the storage overhead . This idea can be compared to the classic bag-of-words model. The commonly used quantization algorithms are as follows:

  1. scalar quantization.

  2. Scalar quantization will reduce the accuracy of the network to a certain extent. To avoid this disadvantage, many algorithms consider structured vector methods, one of which is Product Quantization (PQ). For details, consult the paper.

  3. Based on the PQ method, Wu et al. designed a general network quantization algorithm: QCNN (quantized CNN). The main idea is that Wu et al. believe that minimizing the reconstruction error of each layer of network output is more important than minimizing the quantization error. efficient.

The essential idea of ​​these three types of clustering-based parameter quantization algorithms is to map multiple weights to the same value, so as to realize weight sharing and reduce storage overhead.

3.1 Summary

Parameter quantization is a commonly used back-end compression technology, which can achieve a large reduction in model volume with a small performance loss. The disadvantage is that the quantized network is "fixed", and it is difficult to make any changes to it. At the same time, this This method has poor versatility and requires a dedicated deep learning library to run the network.

4. Binary network

def residual_unit(data, num_filter, stride, dim_match, num_bits=1):
    """残差块 Residual Block 定义
    """
    bnAct1 = bnn.BatchNorm(data=data, num_bits=num_bits)
    conv1 = bnn.Convolution(data=bnAct1, num_filter=num_filter, kernel=(3, 3), stride=stride, pad=(1, 1))
    convBn1 = bnn.BatchNorm(data=conv1, num_bits=num_bits)
    conv2 = bnn.Convolution(data=convBn1, num_filter=num_filter, kernel=(3, 3), stride=(1, 1), pad=(1, 1))
    if dim_match:
        shortcut = data
    else:
        shortcut = bnn.Convolution(data=bnAct1, num_filter=num_filter, kernel=(3, 3), stride=stride, pad=(1, 1))
    return conv2 + shortcut

4.1 Gradient Descent of Binary Networks

The current neural network is almost all trained based on the gradient descent algorithm, but the weight of the binary network is only ±1, and the gradient information cannot be directly calculated, and the weight cannot be updated. In order to solve this problem, Courbariaux et al. proposed a binary connect algorithm, which uses a combination of single precision and binary to train a binary neural network (), which is the first time to give information on how to network Methods for doing binarization and how to train a binarized neural network. The process is as follows:

  1. The weight weight is initialized as a floating point

  2. Forward Pass Forward Pass:

    - Use the deterministic method (sign(x) function) to quantize the Weight to +1/-1, with 0 as the threshold

    - Use the quantized Weight (only +1/-1) to calculate the forward propagation, and perform a convolution operation with the input by the binary weight (actually only involves addition) to obtain the output of the convolutional layer.

  3. Backward Pass: update the gradient to the floating-point Weight (according to the relaxed sign function, calculate the corresponding gradient value, and update the parameters of the single-precision weight according to the value of the gradient); Sex is converted to +1/-1 for inference use.

4.2 Two questions

Network binarization needs to solve two problems: how to binarize the weights and how to calculate the gradient of the binary weights.

4.3 Improvement of Binary Connection Algorithm

It can be seen that the weight binarization neural network (BWN) and the full precision neural network have almost the same accuracy, but compared with the XNOR neural network (XNOR-Net), both Top-1 and Top-5 have 10+% loss.

Compared with the weight binarization neural network, the XOR neural network also converts the input of the network into a binary value, so the multiplication and addition (Multiplication and ACcumulation) operation in the XOR neural network uses bitwise XNOR (bitwise xnor) and number 1 (popcount) instead.

For more content, you can read these two articles:

https://github.com/Ewenwan/MVision/tree/master/CNN/Deep_Compression/quantization/BNN

https://blog.csdn.net/stdcoutzyx/article/details/50926174

4.4 Design Considerations for Binary Networks

  • Do not use Convolution with kernel = (1, 1) (including the bottleneck of resnet): the weight in the binary network is 1bit, if it is 1x1 size, it will greatly reduce the expressiveness

  • Increase the number of channels + increase the number of activation bits to cooperate: If you blindly increase the number of channels, the final feature map will still waste the model capacity because the number of bits is too low. The same is true in reverse.

  • It is recommended to use an activation bit of 4 bits or less. If it is too high, the accuracy benefit will be small, but it will significantly increase the amount of inference calculation.

5. Knowledge Distillation

This article only briefly introduces the opening work in this field-Distilling the Knowledge in a Neural Network, which is the method of steaming "logits", and later papers on steaming "features" appeared. For a deeper understanding, Chinese bloggers can refer to this article - What is Knowledge Distillation? An introductory essay (https://zhuanlan.zhihu.com/p/90049906).

Knowledge distillation (https://arxiv.org/abs/1503.02531) is a type of transfer learning. Simply put, it is to train a large model (teacher) and a small model (student). The knowledge learned from the huge and complex large model is transferred to the streamlined small model through certain technical means, so that the small model can obtain performance similar to that of the large model.

Therefore, it can be known that the final loss function of the student model consists of two parts:

  • The first item is the cross entropy (cross entroy) composed of the prediction result of the small model (student model) and the "soft label" of the large model;

  • The second term is the cross-entropy of small model predictions and common class labels.

The importance of these two loss functions can be adjusted by a certain weight. In practical applications, the value of T will affect the final result. Generally speaking, a larger T can obtain higher accuracy, and T (distillation Temperature parameter) belongs to a kind of knowledge distillation model training hyperparameters. T is an adjustable hyperparameter. The larger the T value, the softer the probability distribution (described in the paper), the smoother the curve is, which is equivalent to adding disturbances in the process of transfer learning, so that the student network learns The time is more effective and the generalization ability is stronger. This is actually a strategy to suppress overfitting. The whole process of knowledge distillation is as follows:

The actual model structure of the student model is the same as that of the small model, but the loss function contains two parts. The mxnet code example of the knowledge distillation of the classification network is as follows:

# -*-coding-*-  : utf-8  
"""
本程序没有给出具体的模型结构代码,主要给出了知识蒸馏 softmax 损失计算部分。
"""
import mxnet as mx

def get_symbol(data, class_labels, resnet_layer_num,Temperature,mimic_weight,num_classes=2):
    backbone = StudentBackbone(data)  # Backbone 为分类网络 backbone 类
    flatten = mx.symbol.Flatten(data=conv1, name="flatten")
    fc_class_score_s = mx.symbol.FullyConnected(data=flatten, num_hidden=num_classes, name='fc_class_score')
    softmax1 = mx.symbol.SoftmaxOutput(data=fc_class_score_s, label=class_labels, name='softmax_hard')

    import symbol_resnet  # Teacher model
    fc_class_score_t = symbol_resnet.get_symbol(net_depth=resnet_layer_num, num_class=num_classes, data=data)

    s_input_for_softmax=fc_class_score_s/Temperature
    t_input_for_softmax=fc_class_score_t/Temperature

    t_soft_labels=mx.symbol.softmax(t_input_for_softmax, name='teacher_soft_labels')
    softmax2 = mx.symbol.SoftmaxOutput(data=s_input_for_softmax, label=t_soft_labels, name='softmax_soft',grad_scale=mimic_weight)
    group=mx.symbol.Group([softmax1,softmax2])
    group.save('group2-symbol.json')

    return group

The tensorflow code example is as follows:

# 将类别标签进行one-hot编码
one_hot = tf.one_hot(y, n_classes,1.0,0.0) # n_classes为类别总数, n为类别标签
# one_hot = tf.cast(one_hot_int, tf.float32)
teacher_tau = tf.scalar_mul(1.0/args.tau, teacher) # teacher为teacher模型直接输出张量, tau为温度系数T
student_tau = tf.scalar_mul(1.0/args.tau, student) # 将模型直接输出logits张量student处于温度系数T
objective1 = tf.nn.sigmoid_cross_entropy_with_logits(student_tau, one_hot)
objective2 = tf.scalar_mul(0.5, tf.square(student_tau-teacher_tau))
"""
student模型最终的损失函数由两部分组成:
第一项是由小模型的预测结果与大模型的“软标签”所构成的交叉熵(cross entroy);
第二项为预测结果与普通类别标签的交叉熵。
"""
tf_loss = (args.lamda*tf.reduce_sum(objective1) + (1-args.lamda)*tf.reduce_sum(objective2))/batch_size

The tf.scalar_mul function is a fixed-rate scalar scaling function for the tf tensor. Generally, the value of T is between 1 - 20. Here I refer to the open source code, and the value is 3 . I found that in the training of the student model in the open source code, some are trained together with the teacher model, and some are trained by the teacher model and directly guide the student model training.

6. Shallow/Lightweight Network

Shallow network: By designing a shallower (fewer layers) network with a more compact structure to achieve the approximation of the complex model effect, but the expressive ability of the shallow network is difficult to match with the deep network. Therefore, the limitation of this design method is that it can only be applied to solve relatively simple problems. For example, tasks with fewer categories in classification problems.

Lightweight network: Using lightweight network structures such as MobilenetV2 and ShuffleNetv2 as the backbone of the model can greatly reduce the number of model parameters.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130471147