[Machine Learning] P21 Regularization (L1 Regularization Lasso, L2 Regularization Ridge, Elastic Network Regularization, Dropout Regularization, Early Stopping Method)

Since the model has a probability of overfitting, how can we reduce overfitting or prevent overfitting? One of the methods is the regularization method, Regularization;

I have this understanding of regularization: "We hope to build a better network by adjusting the weights, but we are also worried about overfitting, so there is a regularization method."

I temporarily have this understanding of the neural network: "We hope that through training, we can get a better neural network, what is a better neural network? Simply put, it is the neural network with the lowest loss value, of course not only in the training set The lowest on the test set, and the lowest on the test set. But is it simply updating the parameters through gradient descent to reduce the loss value? In fact, I think that everything has a fixed law in the neural network, and people are just trying to find the law. The intuitive manifestation of this law is the multi-dimensional parameter relationship between many layers and many neurons in each layer."



What is regularization and what is its role

Regularization is a common machine learning technique used to reduce overfitting of models. In machine learning, the goal of a model is to learn patterns in the training data and apply those patterns to new, unknown data. However, when the model is too complex, the model may overfit the training data, which means that the model relies too much on noise and unnecessary features in the training data, resulting in poor performance on new, unknown data .

How to achieve regularization?
Regularization technology can punish the complexity of the model by adding some additional constraints or penalty items in the loss function of the model (how to add the penalty item will be explained in detail in the next section with different positive hybridization methods), thereby reducing the complexity of the model. fit.


Commonly used regularization methods

L1 regularization (also known as Lasso regularization)

In L1 regularization, we add an L1 norm penalty to the model's loss function. This penalty term refers to the sum of the absolute values ​​of the model parameters. For example, in a simple linear regression model, the objective function can be written as:

  • Loss function with L1 penalty:

J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 0 n − 1 ∣ w j ∣ J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=0}^{n-1} |w_j| J(w,b)=2 m1i=0m1(fw,b(x(i))y(i))2+lj=0n1wj

In the above formula, y ( i ) y^{(i)}y( i ) is thesecondThe actual value of i samples,fw , b ( x ( i ) ) f_{w,b}(x^{(i)})fw,b(x( i ) )is the model pairiiPredicted value of i samples,mmm is the sample size,nnn is the number of model parameters,wj w_jwjis jjThe value of j parameters,λ \lambdaλ is a regularization coefficient, which is used to balance the weight between the loss function and the regularization term.

  • L1 penalty term:

λ ∑ j = 0 n − 1 ∣ wj ∣ \lambda \sum_{j=0}^{n-1} |w_j|lj=0n1wj

Obviously, the objective function consists of two parts, one is our loss function before, and the other is the L1 penalty item ; the L1 penalty item adds the sum of the absolute values ​​of the model parameters to the objective function, so that the optimization algorithm not only To consider how to fit the training data, but also how to make the sum of the absolute values ​​of the model parameters as small as possible. This causes the value of some parameters to become zero.

Specifically, when the optimization algorithm updates the model parameters in the iterative process, the L1 penalty term will cause the value of some parameters to gradually decrease until they finally become zero. This is because, when updating model parameters, the optimization algorithm calculates the gradient and adjusts in the direction of the gradient. After adding the L1 penalty term, the optimization algorithm will subtract the contribution of the L1 penalty term to the gradient while updating the model parameters, which will cause the gradient of some parameters to gradually decrease until it finally becomes zero.

  • Gradient with L1 penalty:

∂ J ( w , b ) ∂ w j = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) + ∂ ( λ ∑ k = 0 n − 1 ∣ w k ∣ ) ∂ w j \frac {\partial J(w,b)} {\partial w_j} = \frac {1} {m} \sum _{i=0} ^{m-1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)} + \frac {\partial (\lambda \sum_{k=0}^{n-1} |w_k|)} {\partial w_j} wjJ(w,b)=m1i=0m1(fw,b(x(i))y(i))x(i)+wj( lk=0n1wk)

  • And the gradient of the L1 penalty part, because:

∑ k = 0 n − 1 ∣ w k ∣ = w 0 + w 1 + . . . + w n − 1 \sum_{k=0}^{n-1} |w_k| = w_0 + w_1 + ... + w_{n-1} k=0n1wk=w0+w1+...+wn1

  • can be simplified to:

∂ ( λ ∑ k = 0 n − 1 ∣ wk ∣ ) ∂ wj = λ ∗ ∂ ( w 0 + w 1 + ... + wj + ... + wn − 1 ) ∂ wj = λ ∗ sign ( wj ) \frac {\partial (\lambda \sum_{k=0}^{n-1} |w_k|)}{\partial w_j}=\lambda * \frac {\partial (w_0 + w_1 + ... + w_j + ... + w_{n-1})} {\partial w_j} = \lambda * sign(w_j)wj( lk=0n1wk)=lwj(w0+w1+...+wj+...+wn1)=lsign(wj)

  • And among them:

s i g n ( w j ) = { − 1 , x < 0 0 , x = 0 1 , x > 0 sign(w_j) = \begin{cases} -1 , x<0\\ 0, x=0\\ 1, x>0 \end{cases} sign(wj)= 1,x<00,x=01,x>0

  • So the formula for parameter update is:

wj = wj − α ∂ L ( w ) ∂ wj − α λ sign ( wj ) w_j = w_j - \alpha \frac {\partial L(w)}{\partial w_j}-\alpha \lambda sign(w_j)wj=wjawjL(w)αλsign(wj)

  • Right now:

w j = w j − α 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) − α λ s i g n ( w j ) w_j = w_j - \alpha \frac {1} {m} \sum _{i=0} ^{m-1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)} - \alpha \lambda sign(w_j) wj=wjam1i=0m1(fw,b(x(i))y(i))x(i)αλsign(wj)

When the gradient of a parameter becomes zero, the parameter is no longer updated, and its value remains zero, thereby achieving the effect of setting the parameter value to zero.

For example, for a linear regression model, L1 regularization may set the coefficients of certain features to zero, meaning that these features do not contribute anything to the model's predictions. This can make the model simpler, reduce the risk of overfitting, and also improve the interpretability of the model.


L2 regularization (also known as Ridge regularization)

The L2 regularization method punishes the weight parameters of the model with the L2 norm to make the weight parameters smoother and avoid excessive weight values, thereby alleviating the overfitting problem.

Combined with the following formula to derive a step-by-step understanding:

  • Loss function with L2 penalty:

J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 + λ 2 ∑ j = 0 n − 1 w j 2 J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)})-y^{(i)})^2 + \frac {\lambda} {2} \sum_{j=0}^{n-1} w_j^2 J(w,b)=2 m1i=0m1(fw,b(x(i))y(i))2+2lj=0n1wj2

  • Gradient with L2 penalty:

∂ J ( w , b ) ∂ w j = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) + ∂ ( λ ∑ k = 0 n − 1 w k 2 ) ∂ w j \frac {\partial J(w,b)} {\partial w_j} = \frac {1} {m} \sum _{i=0} ^{m-1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)} + \frac {\partial (\lambda \sum_{k=0}^{n-1} w_k^2)} {\partial w_j} wjJ(w,b)=m1i=0m1(fw,b(x(i))y(i))x(i)+wj( lk=0n1wk2)

  • And the gradient of the L2 penalty item part, because:

∑ k = 0 n − 1 w k 2 = w 0 2 + w 1 2 + . . . + w k 2 + w n − 1 2 \sum_{k=0}^{n-1} w_k^2 = w_0^2 + w_1^2 + ... + w_k^2 + w_{n-1}^2 k=0n1wk2=w02+w12+...+wk2+wn12

  • can be simplified to:

∂ ( λ ∑ k = 0 n − 1 wk 2 ) ∂ wj = λ 2 ∗ ∂ ( w 0 2 + w 1 2 + ... + wj 2 + ... + wn − 1 2 ) ∂ wj = λ ∗ wj \frac {\partial (\lambda \sum_{k=0}^{n-1}w_k^2)}{\partial w_j}=\frac {\lambda }2 * \frac {\partial (w_0^2) + w_1^2 + ... + w_j^2 + ... + w_{n-1}^2)} {\partial w_j}=\lambda * w_jwj( lk=0n1wk2)=2lwj(w02+w12+...+wj2+...+wn12)=lwj

  • So the formula for parameter update is:

wj = wj − α ∂ L ( w ) ∂ wj − α λ wj w_j = w_j - \alpha \frac {\partial L(w)}{\partial w_j}-\alpha \lambda w_jwj=wjawjL(w)a l wj


So, why does the L2 penalty make the weight parameters smoother?

  • First, the L2 penalty makes the gradient during training more sensitive to larger weight values, making it easier to reduce (square) these weight values;
  • In contrast, if there is no regularization term, the model may tend to choose certain parameter values ​​​​that are very large, and these parameter values ​​can achieve a very good fit on the training set, but not on the test set. Cannot generalize well, resulting in overfitting.
  • By punishing the square of the model parameters, the effect of L2 regularization is equivalent to smoothing all the parameters of the model, thereby preventing the model from being too complex and improving the generalization ability of the model. Therefore, L2 regularization can guarantee the smoothness of model parameters to a certain extent, making the model more stable and more generalizable.

Elastic Net Regularization

Elastic Net Regularization combines the linear regression algorithm of L1 and L2 regularization, aiming to overcome the shortcomings of L1 and L2 regularization, while having the advantages of both regularization methods.

  • The objective function of elastic network regularization consists of two parts: L1 regularization and L2 regularization:

J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 + λ ρ ∑ j = 0 n − 1 ∣ w j ∣ + λ ( 1 − ρ ) 2 ∑ j = 0 n − 1 w j 2 J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)})-y^{(i)})^2 + \lambda \rho \sum_{j=0}^{n-1} |w_j| + \frac {\lambda (1 - \rho )} {2} \sum_{j=0}^{n-1} w_j^2 J(w,b)=2 m1i=0m1(fw,b(x(i))y(i))2+l rj=0n1wj+2l ( 1p ).j=0n1wj2

Among them, mmm represents the sample size,nnn represents the number of features,fw , b ( x ( i ) ) f_{w,b}(x^{(i)})fw,b(x( i ) )represents the predicted value of the model,y ( i ) y^{(i)}y( i ) represents the real value,w , bw, bw,b is the model parameter,λ \lambdaλ represents the regularization strength,ρ \rhoρ is the mixture ratio of L1 and L2 regularization,Usually the value is 0.5

  • Combining the above-mentioned L1 and L2 regularization parts, the formula for parameter update can be obtained as:

wj = wj − α ∂ L ( w ) ∂ wj − α ρ λ wj − α λ ( 1 − ρ ) wj w_j = w_j - \alpha \frac {\partial L(w)}{\partial w_j} -\alpha \rho \lambda w_j - \alpha \lambda (1-\rho) w_jwj=wjawjL(w)a r l wja l ( 1p ) wj

The first item of the objective function is the mean square error (MSE), which represents the difference between the predicted value of the model and the real value; the second item is L1 regularization, which punishes the absolute value of the model parameters and produces a sparse weight matrix; the third The term is L2 regularization, which penalizes the sum of squares of the model parameters, making the model's weights smoother.

Compared with L1 regularization and L2 regularization, elastic network regularization can effectively avoid excessive sparseness of L1 regularization and excessive smoothing of L2 regularization. On high-dimensional datasets with a large number of features, elastic network regularization can better select relevant features while reducing the influence of non-correlated features, thereby improving the generalization ability of the model.


Dropout regularization

Dropout is a widely used regularization technique to prevent neural networks from overfitting. In dropout regularization, some neurons are randomly selected and dropped during training, i.e. their outputs are set to zero.

The working principle of dropout is that in each training iteration, each neuron has a certain probability of being dropped, and this probability is usuallyBetween 0.2 and 0.5. By randomly dropping neurons, the neural network does not depend on any single neuron. This helps prevent overfitting, because if a neural network relies too heavily on certain neurons, the network may not be able to make accurate predictions when it is processed with new data.

The steps are as follows:

  1. Random deactivation: For each training iteration, a part of neurons are randomly selected for deactivation (that is, the output value of neurons is set to 0), with probability ppp inactivation,ppThe value of p is between 0.2~0.5

  2. Forward propagation: The output result of the deactivated neuron 0 is sent to the next layer of neurons for calculation through forward propagation, and finally the predicted value is obtained, and the loss value is obtained by the difference between the predicted value and the actual value.

  3. Backpropagation: Calculate the error of the neuron through backpropagation and update the parameters. In the process of backpropagation, the weights of inactivated neurons do not participate in the calculation, so the process of parameter update only affects the retained neurons, thereby avoiding overfitting.

  4. Restoring neurons: Recover neurons that have just been inactivated, by randomly assigning values ​​to their parameters, and then participate in subsequent loop iterations.

  5. Repeated execution: Repeat the above steps several times until the model converges.

  • How are discarded neurons restored?

When using Dropout regularization, the activation values ​​​​of neurons that are dropped during training are set to 0. This means that these neurons have no influence on other neurons during forward and backpropagation.

For example, suppose a neuron has a weight value of 0.1. During training, if this neuron is randomly dropped, its activation value is set to 0, but its weight value remains at 0.1. During forward propagation and backpropagation, the influence of this neuron on other neurons is ignored. After this iteration, the weight value of this neuron remains at 0.1.

During the next iteration, this neuron may be kept (i.e. no longer discarded). In this case, it restores its original weight values ​​(0.1 in this example) and re-participates in the forward and backpropagation process. In this way, dropout regularization helps prevent the neural network from relying too much on a single neuron, thereby improving the generalization ability of the network.

So, when a neuron is dropped, its weight values ​​remain the same. During the next iteration, if the neuron is not discarded, it will restore its original weight value instead of being assigned a random value.

  • Why do neural networks depend too much on a certain neuron

There are several reasons why a neural network may rely too much on a certain neuron during training:

  1. High weight: The weight of a neuron determines its importance in the neural network. When the weight of a certain neuron is large, the contribution of the neuron to the output of the network is also relatively large. This can cause the network to rely too heavily on this neuron during training. However, this reliance can lead to overfitting, as the network overfits to specific patterns in the training data, reducing its ability to generalize.

  2. Unreasonable network structure: If the neural network structure is not reasonable, it may cause some neurons to become too important during the training process. For example, in an overly complex network, the network may learn noise in the training data, leading to overfitting. In this case, the network may over-rely on certain neurons to capture these noise patterns in the data.

  3. Imbalanced training data: If the training data has class imbalance or some features are too prominent, the neural network may over-rely on certain neurons to fit these features. This may lead to a decline in the generalization ability of the network when faced with new data.

  • Neural network training case MINST using dropout regularization method
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.datasets import mnist

# 加载MNIST数据集
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 定义模型
model = Sequential([
    Flatten(input_shape=(28, 28, 1)),
    Dense(256, activation='relu'),
    Dropout(0.5),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(10, activation='softmax')
])

# 编译模型
model.compile(optimizer=Adam(),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 训练模型
model.fit(x_train, y_train, epochs=40, batch_size=128, validation_data=(x_test, y_test))

from sklearn.metrics import accuracy_score
import numpy as np

# 测试集数据
pred_y = model.predict(x_test)
pred_labels = np.argmax(pred_y, axis=1)
acc = accuracy_score(y_test, pred_labels)
print('Test accuracy:', acc)

insert image description here


early stop method

Early stopping is also a regularization method based on validation error. The basic idea is that in the process of model training, we usually divide the training set into training set and verification set, where the training set is used to train the model, and the verification set is used to monitor the generalization ability of the model. After each training epoch, we calculate the error of the model on the validation set (such as the cross-entropy loss in the classification task), and stop training if the validation error starts to increase, otherwise continue training.

But it should be noted that, generally speaking, it is not that the verification error starts to rise, and the training is stopped immediately. Conversely, if the validation error is rising very slowly, and the current validation error is still much smaller than the previous minimum, then we usually allow the model to continue training for a while to allow the model to further optimize, rather than stop training immediately.

Therefore, the early stopping method is not to stop training immediately once the verification error starts to increase, but to decide whether to stop training based on the current verification error and the previous minimum value. Generally speaking, we will set a threshold to control the conditions for stopping the training. For example, when the verification error does not decrease for several consecutive epochs, or the increase exceeds a certain threshold, the training can be stopped.

  • A Tensorflow case of MINST handwriting recognition using early stopping method
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import mnist

# 加载数据集并进行预处理
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 定义模型
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

# 定义损失函数和优化器
loss_fn = keras.losses.SparseCategoricalCrossentropy()
optimizer = keras.optimizers.Adam()

# 定义早停法
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

# 训练模型
model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=50, validation_data=(x_test, y_test), callbacks=[early_stop])

from sklearn.metrics import accuracy_score
import numpy as np

# 测试集数据
pred_y = model.predict(x_test)
pred_labels = np.argmax(pred_y, axis=1)
acc = accuracy_score(y_test, pred_labels)
print('Test accuracy:', acc)

In the simple code above, we use early stopping to prevent overfitting:

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

The meaning of which patience=5is that if the loss of the verification set has not decreased for 5 consecutive epochs, the training will be stopped.


Guess you like

Origin blog.csdn.net/weixin_43098506/article/details/130241980