Understanding Linear Regression

Linear regression is a statistical analysis method that uses mathematical statistical regression analysis to determine the dependence relationship between variables. How to understand it, in fact, is to find data rules, so as to infer the results of new variable conditions according to the data rules. Putting it into mathematics is to regard this law as a function, and find a way to solve the various parameters of this function. You can imagine solving the equation, but what you are looking for here is not the x, y, z in the equation, but the appropriate coefficients.

201803101418.jpg

There are many two-dimensional data points in the above figure. Through observation, it seems that these points seem to have some rules. By drawing the blue line, it can be directly observed that these data points surround the line and follow the line. direction to extend. This straight line is actually the pattern we are looking for. How to find this one directly? Is the straight line found the best? If the sum of the distances between these points and the straight line is the smallest, then this line should be the straight line we expect (this is the idea of ​​svm, looking for a dividing surface that can make the sum of the distances from all points to the dividing surface the smallest), but here we change the One way of thinking, if the sum of the difference distances between the y values ​​of all data points and the y values ​​where x falls on the line is the smallest, this line should also be what we expect.

Suppose the function of this line is f(x) = y = a * x + b , where a and b are the coefficients we want to find, and x and y are the abscissa and ordinate values ​​of the data points, respectively. Assuming that there are n data points here, the y value of the kth point is y k , the minimum distance sum described above can be expressed as |f(x k ) - y k |, or you can directly use (f(x k ) - y k ) 2 replaces (L2), so that the positive and negative intervals of the absolute value are not considered. So now it is to find a suitable a and b to minimize the sum of (f(x k ) - y k ) 2 of all points. It should be noted that x and y here are known data points, and the coefficients are unknown data. We name the sum of (f(x k ) - y k ) 2 of all points as the J function. Its unknown variables are actually a and b, and finally denoted as J(a, b), which can be used in multi-dimensional cases. A vector θ represents all parameters, written as J( θ ).


201803102206.jpg


201803102221.jpg

Now we are trying to find a way to find the minimum value of the function of this coefficient. We can imagine that J will be a graph with a valley bottom, and the valley bottom is where the slope is close to or 0 (it cannot be ruled out that sometimes there will be multiple valley bottoms, you only Found one of them, but not the bottom one, which is the difference between the so-called local optimum and global optimum).

201803102248.jpg

The calculation of the slope can be derived from J(a, b). For convenience, the two dimensions of a and b can be partially derived, that is, to see the slope of the a and b dimensions respectively. You can imagine that you are standing somewhere on the top of the valley. If you want to go down to the bottom of the valley, you can go down a section to the left, then down a section to the right, and then walk down the mountain alternately.

 

201803110846.jpg
Those who are unfamiliar with derivation can refer to the above figure, which is to find the limit of J(a) at a point a, that is, the micro increment ΔJ / micro increment Δa. We omit the coefficient 2 obtained after the derivation, which does not affect finding the minimum value.
201803102258.jpg

201803102258.jpg

In fact, in theory, the J' function can be set to 0, and each data point can be brought in to solve a and b, but in actual processing, the data noise and the magnitude of the magnitude and dimension are inconvenient to solve. Here we can use the gradient descent algorithm. Here we will use the stochastic gradient descent method to manually perform descent training on a simple set of data. Gradient descent is a method of gradually approaching the lowest point in small steps. At the beginning, a random a is selected as the starting point, and then an appropriate step size α is selected, and α * J(a)' is used as the last movement in the a direction. Length, is it moving left or right? By observation, if it is on the right side of the lowest point, the slope is positive, we want to approach the lowest point, we should go left; if it is on the left side of the lowest point, the slope is negative, we have to go right , so a - α * J(a)' should be used, so that you can go to the lowest point. The value of α must be selected appropriately. If it is too small, the approximation process will take too long, and if it is too large, it will always pass through. Finally got the following formula. The same is true for the b dimension. Next, you can bring the x and y values ​​of the data points into the formula, and execute in a loop until a and b shrink to a stable state, that is, α * J(a)' and α * J(b) 'Already less than the set threshold.

 


201803111133.jpg


201803111133.jpg  

 

Here we come to some data. Suppose we have such a set of data of x and y. Some of y are unknown values. We need to speculate what their values ​​are. From the known values, we can easily know that y = 2x - 1, and now use the stochastic gradient to find a and b.

 

201803110957.jpg
Stochastic gradient descent does not need to use the full amount of data each time, and randomly selects one or a part of it for training each time, which can reduce operations and quickly achieve results. Because only one data point is taken at a time, the above function no longer needs to be summed, and the derivation of a and b can be simplified to the following formula.

201803111140.jpg

201803111141.jpg
If a is calculated first each time, and the current a can be used when calculating b, the derivation of b becomes:

201803111143.jpg
Assuming that the initial value of a is 1 and the initial value of b is 0, (the initial value here can be selected randomly), set the step size to 0.01, and then select a pair of x and y values ​​sequentially or randomly and bring them into the above at t Calculate the values ​​of a and b in +1 and b t+1 . It needs to be repeated many times. You will find that the values ​​of a and b sometimes repeat, but in terms of the big trend, it is gradually approaching a=2 , b=-1. The above process can be carried out directly with excel. Those who can write programs can set the number of cycles or judge the size of α * J', and exit when it is less than the threshold.
Attach the excel file Stochastic gradient excel
 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326175936&siteId=291194637