Deep Learning Basics Beginner Edition

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right

Article directory

 


foreword

首先!博主目前大一,写出来的东西水平很低。我写博客的目的只是为了让同是大一大二的学生们,入门这个领域的时候变得稍微轻松一点点。

这篇博客是我根据实验室面经,加上自己的理解,再加上几百篇博客的参考来的深度学习基础知识总结。大佬勿喷!。


Overfitting (overfitting): The learning ability is so strong that the less general features contained in the training samples are learned.

Underfitting (underfitting): The learning ability is too poor, and the general properties of the training samples have not been learned well.

Ways to prevent overfitting :

1,EarlyStopping

After each epoch (or after every N epochs): Obtain the test results on the verification set. As the epoch increases, if the test error is found to increase on the verification set, stop the training;

The weights after stopping are used as the final parameters of the network.

This approach is very intuitive, because the accuracy is no longer improved, and it is useless to continue training, and it will only increase the training time. So one of the key points of this approach is how to think that the accuracy of the verification set is no longer improved? It does not mean that once the accuracy of the verification set drops, it is considered that it will no longer improve, because the accuracy may decrease after this Epoch, but the subsequent Epoch will increase the accuracy again, so it cannot be judged based on one or two consecutive decreases. Improve again. The general practice is to record the best accuracy of the verification set so far during the training process. When the best accuracy is not reached for 10 consecutive Epochs (or more), it can be considered that the accuracy will no longer improve.

2. Data Augmentation

Flip, crop, rotate, translate, noise, brighten, blur, sharpen.

3. Regularization:

Regularization means to add some restrictions to the loss function , and through these restrictions to regulate them, in the next loop iterations, the size of the higher-order parameters cannot be too large.

L2 regularization imposes heavy penalties on weights with large absolute values, and very, very small penalties on weights with small absolute values. When the absolute value of the weight approaches 0, there is basically no penalty.

The role of L1 regularization is to make most of the model parameters equal to 0. In this way, after the model is trained, these features with weights equal to 0 can be omitted, thereby achieving the purpose of sparseness and saving storage space. , because during calculation, the features with a value of 0 can not be stored.

4,dropout:

The method of dropout is to randomly ignore or shield some neurons according to a certain ratio (the ratio parameter can be set) during the training process.

These neurons will be randomly "abandoned", which means that their contribution to the downstream neurons temporarily disappears during the forward propagation process, and the neuron will not have any weight updates during the back propagation.

· Prevent gradient explosion:

1. Pre-training and fine-tuning

Pre-training: Pre-trained model parameters are saved so that the model can achieve better results when performing similar tasks next time.

Fine-tuning: Use other people's parameters, modified network and own data to train, so that the parameters adapt to your own data. Such a process is usually called fine tuning.

 2. Gradient clipping, weight regularization (for gradient explosion)

        Gradient clipping:

If the gradient becomes very large, then we throttle it to keep it small. Specifically, when the norm of the gradient is greater than the hyperparameter. Gradient adjustment is performed, and the adjusted gradient norm is equal to the hyperparameter. (Determining hyperparameters generally uses the mean of statistical data)

        Weight regularization:

If the gradient explosion persists, another approach can be tried, which is to check the size of the network weights and penalize the loss function that produces large weight values. This process is called weight regularization and typically uses either an L1 penalty (the absolute value of the weight) or an L2 penalty (the square of the weight).

The description of L1 regularization and L2 regularization is as follows:

L1 regularization refers to the sum of the absolute values ​​of each element in the weight vector.

L2 regularization refers to the sum of the squares of each element in the weight vector and then the square root (you can see that the L2 regularization item of Ridge regression has a square symbol).

So what is the use of adding L1 and L2 regularization? The following is the role of L1 regularization and L2 regularization, these expressions can be found in many articles.

L1 regularization can generate a sparse weight matrix, that is, generate a sparse model that can be used for feature selection

L2 regularization can prevent model overfitting (overfitting); to a certain extent, L1 can also prevent overfitting

3. Activation function:

If you don't use the activation function, in this case, the input of each layer of your node is a linear function of the output of the upper layer, which is easy to verify. No matter how many layers your neural network has, the output is a linear combination of the input, and there is no hidden layer effect. Rather, this situation is the most primitive perceptron (Perceptron), so the approximation ability of the network is quite limited. For the above reasons, we decided to introduce a nonlinear function as the activation function, so that the expressive ability of the deep neural network is more powerful (it is no longer a linear combination of inputs, but can almost approximate any function).

Commonly used: Sigmoid Relu GELU

4,batchnorm:

As the depth of the network increases, the eigenvalue distribution of each layer will gradually approach the upper and lower ends of the output interval of the activation function (activation function saturation interval), and if this continues, the gradient will disappear. BN is to pull the feature value distribution of this layer back to the standard normal distribution by means of a method, so that the feature value will fall in the range where the activation function is more sensitive to the input, and a small change in the input can lead to a large change in the loss function, making the gradient larger , to avoid vanishing gradients.

· Residual network:

The shallow network output, the identity mapping enters the deep layer, so that the network will not degenerate even if the network is deepened.

LSTM: The LSTM forgetting gate value can be selected between [0,1] (sigmoid activation function), allowing LSTM to improve the gradient disappearance. You can choose close to 1 to saturate the forget gate. At this time, the long-distance information gradient does not disappear, and the gradient can be well transmitted in LSTM, which greatly reduces the probability of gradient disappearance. You can also choose to be close to 0. At this time, the model deliberately blocks the gradient flow and forgets the previous information, indicating that the information at the previous moment has no effect on the current moment. Then the overall ∏Tk=t+1∂C(k)∂C(k−1) will not always decrease, and the long-distance gradient will not disappear completely, which can solve the problem of gradient disappearance in RNN.

· Common optimizers:

SGD

Simple, but prone to local minimums.

AdaGrad

The algorithm can automatically reduce the learning rate along with the training process by recording the historical gradient. Advantage: Reduces manual adjustment of learning rate

.Momentum:

Add resistance to gradient descent. It can make the speed of the dimension with the same gradient direction faster, and the update speed of the dimension with the gradient direction changed can be slowed down, so that the convergence can be accelerated and the vibration can be reduced.

Adam

Commonly used loss functions :

01 loss square loss function absolute value loss logarithmic loss

Cross-entropy loss function (classification problem): H(p,q)=−∑i=1N​p(xi)logq(x−i)

Function description :

Cross-entropy is used to evaluate the difference between the probability distribution obtained by the current training and the real distribution. Reducing the cross-entropy loss is to improve the prediction accuracy of the model. where p(x) refers to the probability of the true distribution, and q(x) is the probability estimate calculated by the model through the data.

· Convolution layer:

Convolution is an efficient method for extracting image features.

The formula for calculating the step size of convolution :

Input image size: W*W.

Filter (convolution kernel) size: F*F.

Step size S (Stride).

The number of pixels of padding (filling) P, P=1 is equivalent to filling the image with an image size of W+1 *W+1

· Linear regression softmax regression:

Linear regression is a type of regression analysis.

Assume that there is a linear correlation between the target value (dependent variable) and the feature value (independent variable) (that is, to satisfy a multivariate linear equation, such as: f(x)=w1x1+…+wnxn+b.).

Then build the loss function.

Finally, the parameters are determined by minimizing the loss function.

softmax regression:

Softmax regression is different from linear regression for situations where the output is a continuous value. It is suitable for discrete values ​​​​such as image categories. It mainly solves classification problems, and the output unit of softmax regression has changed from one to multiple.

·Multilayer perceptron :

A multilayer perceptron is a neural network consisting of fully connected layers with at least one hidden layer, and the output of each hidden layer is transformed by an activation function. The number of layers in a multilayer perceptron and the number of hidden units in each hidden layer are hyperparameters. Using the example of a single hidden layer and following the notation defined earlier in this section, a multilayer perceptron computes the output as follows:

·optimization:

In fact, the essence of most machine learning algorithms is to establish an optimization model, and optimize the loss function (optimized objective function) through the optimization algorithm to train the best model.

·Accuracy rate, recall rate, precision rate, F1 value.

Accuracy. As the name implies, it is the proportion of all predictions that are correct (positive and negative).

Precision (Precision): The true correct proportion of all predictions that are positive.

Recall (Recall): The true correct proportion of all actual positives.

F1:

· Training error and generalization error:

Training error: After the training on the training set is completed, the prediction is made on the training set itself to get the misclassification rate.

Generalization error: The ratio of misclassified samples on data not found on the training set.

K-fold validation cross-validation :

method:

  1. First, all samples are divided into k sample subsets of equal size.
  2. The k subsets are traversed in turn, and each time the current subset is used as the verification set, and all other samples are used as the verification set for model training and evaluation.

3. Finally, the average value of k evaluation indicators is taken as the final evaluation indicator. In practical experiments, k is usually 10.

· Weight decay:

L2​norm regularization adds the L 2 L_2 L2​norm penalty term to the original loss function of the model to obtain the function that needs to be minimized for training. L 2 L_2 L2​The norm penalty term refers to the product of the sum of squares of each element of the model weight parameter and a positive constant. Take the linear regression loss function from the "Linear regression" section

Forward propagation, backward propagation:

· Forward propagation:

(forward) refers to the order of the neural network along the input layer to the output layer, and sequentially calculate and store the intermediate variables of the model.

· Backpropagation:

According to the chain rule, along the order from the output layer to the input layer, the gradients of the intermediate variables and parameters of the neural network are sequentially calculated and stored.

Relationship: Forward propagation and backpropagation depend on each other when training deep learning models. On the one hand, the computation of the forward pass may depend on the current values ​​of the model parameters, which are the gradients in the backpropagation. On the other hand, the gradient calculation of backpropagation may depend on the current values ​​of variables, which are calculated by forward propagation.

· Pooling layer:

 , so the number of parameters and the amount of calculation will also decrease, which also controls overfitting to a certain extent. Generally speaking, pooling layers are periodically inserted between the convolutional layers of CNN.

The pooling layer is also based on the idea of ​​local correlation, and obtains new element values ​​by sampling or information aggregation from a group of locally related elements. Usually we use two kinds of pooling for downsampling:

(1) Max Pooling, select the largest element value from the set of locally relevant elements.

(2) Average Pooling, which calculates the average value from the set of local related elements and returns it.

·CNN

CNN mainly has a data input layer, a convolutional layer, a RELU excitation layer, a pooling layer, a fully connected layer, and a Batch Normalization Layer (not necessarily present). Traditional neural networks mainly include a data input layer, one or more hidden layers, and data output Layer. The comparison shows that CNN still uses the hierarchical structure of the traditional neural network. 

·Normalized

The problem of data normalization is an important problem in the expression of feature vectors in data mining. When different features are listed together, due to the expression of the features themselves, the small data in absolute value will be "eaten" by big data . "In this case, what we need to do at this time is to normalize the extracted features vector to ensure that each feature is treated equally by the classifier.

1, (0, 1) normalized processing.

Record Max and Min, and use Max-Min as the base (ie Min=0, Max=1) to normalize the data:

2. z-score standardization

This method is standardized according to the mean and standard deviation of the original data, and the processed data conforms to the standard normal distribution, that is, the mean is 0 and the standard deviation is 1.

3,Batch narmalization.

When each layer of the network is input, a normalization layer is inserted, that is, a normalization process is performed first, and then enters the next layer of the network. However, the document normalization layer is a network layer that can learn and has parameters.

In the first step, we obtained a mini-batch input в = {x1,..., xm}, it can be seen that the batch size is m.

The second step is to find the mean μ and variance σ of this batch

The third step is to standardize all xi ∈ в to get xi`.

The fourth step is to do a linear transformation on xi` to get the output yi.

4,layer normalization

The idea of ​​​​Layer Normalization is very similar to Batch Normalization, except that Batch Normalization normalizes a mini batch size sample in each neuron, while Layer Normalization normalizes all neuron nodes of a single sample in each layer. One.

·Language model

Statistical language models :

Find a probability distribution that can represent the probability of occurrence of all sentences. Given the previous word, find the conditional probability of the next word appearing.

N-gram

In order to solve the problem of too many free parameters, the Markov assumption is introduced: the probability of a random word is only related to the limited n words that appear before it. A statistical language model based on the above assumptions is called an N-gram language model.

Why can't RNN use BN and use LN

Because in a sequential network such as RNN, the length of the sequence is not a fixed value (the depth of the network is not necessarily the same), for example, the length of each sentence is not necessarily the same, so it is difficult to use BN, so the author proposed Layer Normalization


Summarize

The foundation of in-depth learning is very important, and it is necessary to lay a solid foundation regardless of whether you are a samurai. But don't start from the basics when getting started with deep learning.

Guess you like

Origin blog.csdn.net/m0_60920298/article/details/124370413