LASSO regression under L1 regularization and PGD method

1. Structural risk and empirical risk

In the support vector machine part, we came into contact with slack variables, regularization factors and optimization functions. In Naive Bayes classification, we also encountered similar function optimization problems in decision trees. In fact, these are the two model selection strategies of structural risk and empirical risk. The empirical risk is responsible for minimizing the error and making the model fit the data as much as possible, while the structural risk is responsible for regularizing the parameters, making the form of the parameters as simple as possible, so as to prevent excessive The role of fitting. So for common models, we have the following formula:

                                                        

The first empirical risk L(yi,f(xi,w)) measures the error between the true value and the predicted value, and the second structural risk Ω(w) regularization term makes the model as simple as possible. And the second term Ω (w) is generally a monotonic function of model complexity. The more complex the model, the greater the value of the regularization term. Here, the norm is often introduced as the regularization term, which also introduces our common L0 norm and L1 norm. and the L2 norm.


2. L0 norm, L1 norm, L2 norm and LASSO regression, ridge regression

1) Broad Definition

                L0 norm   : The number of non-zero elements in the vector.

                L1 norm  : the sum of the absolute values ​​of each element in the vector

                L2 norm  : sum the squares of the elements of the vector and then take the square root

L0, L1 norm can achieve sparseness, and L1 coefficient is widely used because it has better characteristics than L0. L2 norm is ridge regression in regression, also called mean attenuation. It is often used to solve overfitting. The sum of the squares of the elements of the vector and then the square root is made to minimize the L2 norm, so that each element of the parameter W is close to 0. Different from the L1 norm, the value of w after the L2 norm planning will be close to 0 but less than 0, After the L1 norm specification, some values ​​of w may be set to 0, so the L1 norm specification is often used in feature selection, and the L2 norm is often used in parameter regularization. In the regression model, by adding L1 , L2 norm introduces a regularization term, and then LASSO regression and ridge regression are obtained:

2) Regression model

Common linear model regression:

                                                                    

LASSOO is back:

                                                                    

Ridge regression:

                                                                    


3. Embedded selection and LASSO regression

This paper mainly focuses on the content of Section 11 of the watermelon book, and discusses the solution of the L1 regularization problem by the proximal gradient descent PGD method.

1) Optimization goals

Let denote the differential operator, for the optimization objective:

                                                                     

If f(x) is derivable, and f satisfies L-Lipschitz (Lipschitz continuity condition), that is, there is a constant L>0 such that:

                                                        

2) Taylor expansion

Then at Xk we can Taylor expansion:

                                                

The above formula is strictly equal, and we can see from the L-Lipschitz condition:

                                                                    

Here is a lower bound of L, and the form of the lower bound is similar to the form of the second derivative function, so the second derivative of the Taylor expansion is replaced by L, and the strict inequality also becomes an approximation:

                                                   

3) Simplify Taylor expansion

Next we simplify the above formula:

                                       

                                

                                

where φ(xk) is a const constant independent of x.

                                                              

4) Simplify the optimization problem

Here, if f(x) is minimized by the gradient descent method, each step of the descending iteration is actually equivalent to minimizing the quadratic function f(x), thus extending to our top optimization goal. Similarly, we can get each One-step iteration formula:

                                                

make  

                                                                      

Then we can calculate z first and then solve the optimization problem:

                                                            

5) Solve

Let xi be the i-th component of x, and expand the above formula to see that there is no item xixj (x≠j), that is, the components of x do not affect each other, so the optimization goal has a closed-form solution. Here is the solution to the above optimization problem The Soft Thresholding soft threshold function needs to be used, and its solution is:

                                                            

For this example, bring in the solution to get:

                                                            

Therefore, PGD enables LASSO and other L1 norm minimization-based methods to be solved quickly.


4. Soft Thresholding Soft Threshold Function Proof

1) Soft threshold function

The soft threshold function is used to solve the above formula, and the solution of the soft threshold function is verified below, so as to better understand the above solution process.

Let's take a look at the soft threshold function first:

                                                           

2) Proof

Proof:

For optimization problems:

                                                                    

Here X and Z are both n-dimensional vectors.

Expand the objective function:

      


The optimization problem thus becomes solving N independent functions :

                                                                        

This is our common quadratic function, derivative of it :

                                                                    

Let the derivative be 0:

                                                                            

Seeing that there are x's on both sides, so let's discuss the above case:

When Az>λ/2

Suppose x<0, so sign(x)=-1, but z-λ/2sign(x)>0, so contradictory.

Suppose x>0, so sign(x) = 1, z-λ/2sign(x)>0, so the minimum takes 1 at x>0.

At this point the minimum value is less than f(0):

                                            

Look at x<0 again,

                                      

                                                

So f(x) decreases monotonically from negative infinity to 0, so the minimum is taken at z-λ/2.


B.z<-λ/2时

Suppose x<0, so sign(x)=-1, z-λ/2sign(x)<0, so the extreme point is obtained at x<0.

Suppose x>0, so sign(x) = 1, z-λ/2sign(x)<0, so contradictory.

At this point the extreme value is less than f(0):

                                              

Look at x>0 again,

                                        

                                                  

So f(x) increases monotonically from 0 to positive infinity, so the minimum value is taken at z+λ/2.


C.λ/2<z<λ/2时

Suppose x<0 , so sign(x)=-1 , z-λ/2sign(x)>0 , so contradictory.

Suppose x>0, so sign(x) = 1, z-λ/2sign(x)<0, so contradictory.

Therefore, neither x>0 nor x<0 satisfy the condition.

So have:

                                                                

                                                                        

                                                                  

                                                                  

When △x>0, by the condition z<λ/2:

                                                                

                                                            

                                                            

When △x<0, by the condition z<λ/2:

                                                                

                                                            

                                                            

So taking the minimum value at 0 is also the minimum value.


Combining the above three situations:

                                                                  


3) L1 regularization and LASSO regression corresponding to the watermelon book

The optimization problem corresponding to the solution here is:

                                                                    

And our PGD optimization problem is:

                                                                

Multiplying the above formula by 2/L does not affect the selection of the extreme point position, so our PGD optimization problem becomes:

                                                                

Bring in the final solution that combines the three cases:

                                                                

11.14 on the watermelon book will be certified~


Summarize:

终于看完了西瓜书11章特征选择与稀疏学习,发现从头至尾都在提到用LASSO解决问题,所以就结合第六章的正则化和之前的模型评价,对正则化范数以及LASSO重新认识了一下,书中解决LASSO的大致方法就是通过利普希茨连续条件得到L,带入到优化函数中对函数简化变形,简易优化函数,然后通过软阈值函数得到最后的解.LASSO大致就是这些了,有问题欢迎大家交流~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324702964&siteId=291194637