Regularization (L1 and L2 canonical)

Sparsity of the data center represents a relatively large proportion of 0

The book cited watermelon P252 Original:

 

 Add to loss penalty function follows the function can reduce the risk of over-fitting, using L2 norm penalty function is called ridge regression, equivalent to the L2 norm is added prior to the w, w satisfy a need required distribution, L2 norm data representing the Gaussian distribution, the L1 norm data representing Laplacian distribution. Function and the Laplacian image from the Gaussian function point of view, the Laplace function takes the greater probability of 0, so that there will be some uses L1 norm to get 0

And the data is small and large than L1 L2 of penalties

 

We optimized for a final target min D (w) + λ * R (w), where R (w) represents a regularization term, and we transformed into solving min D (w), st R (w) <= η.

 

The yellow region is a penalty term that we added, after transformation the equivalent of a process for solving a minimum value in the yellow range. If a region of intersection, we can always find that, in the region, and so that the value of D (w) is the smallest, will be tangential to the final image, wherein the greater the smaller λ limits, because of the restrictions can take about showed little range about big, so the larger the orange area.

 

From the mathematical formula for L1 = | w1 | + | w2 | + ... + | wn | wi derivative is 1, and L2 = 1/2 * (w1 ^ 2 + w2 ^ 2 + ... + wn ^ 2) is the derivative wi wi, taking the learning rate [lambda], L1 norm as: wi = wi - λ * 1, L2 norm as wi = wi - λ * wi, so that each time subtracting a predetermined value L1, the total can be reduced to 0, and L2 each taking their own (1-λ), decreased slow

Reference links:

https://www.zhihu.com/question/37096933  Wang Xiaoming, Ser Jamie

https://vimsky.com/article/969.html

"Machine Learning" Zhou Zhihua      

 

Guess you like

Origin www.cnblogs.com/lalalatianlalu/p/11464929.html