Support Vector Machine (SVM), a blog for beginners

Support Vector Machine (SVM), a blog for beginners

Goal: Find a hyperplane that separates samples from different classes. There may be many such hyperplanes, as shown in the figure below. Among these hyperplanes, we want this hyperplane to have better generalization ability. Therefore, we should choose the thicker one. What are the characteristics of this hyperplane? That plane is the one "right in the middle"!
insert image description here

Intervals and Support Vectors

The divided hyperplane is described by the following linear equation:
w T x + b = 0 w^Tx+b=0wTx+b=0
,w = ( w 1 ; w 2 ; … ; wd ) w=(w_1;w_2;…;w_d)w=(w1;w2;;wd) is the normal vector, which determines the direction of the hyperplane, and b is the displacement item, which determines the distance between the hyperplane and the origin.
The distance from any point x to the hyperplane is:
r = ∣ w T x + b ∣ ∣ ∣ w ∣ ∣ r=\frac{|w^Tx+b|}{||w||}r=∣∣w∣∣wTx+b
In order to correctly classify the samples, there should be (xi, yi) ∈ D (x_i, y_i)\in D(xi,yi)D Yes, ifyi = 1 y_i=1yi=1 (positive example), thenw T xi + b > 0 w^Tx_i+b>0wTxi+b>0;若 y i = − 1 y_i=-1 yi=1 (counter example), thenw T xi + b < 0 w^Tx_i+b<0wTxi+b<0 . In order to improve the generalization performance, we want to widen this gap, so we scaled the above formula (scaling is premised, the data needs to be r separable, that is, there is a line as wide as r, The data is separated. Here, r is equal to2 ∣ ∣ w ∣ ∣ \frac{2}{||w||}∣∣w∣∣2, so the right side of the inequality is 1. Of course, if you encounter special problems, you can replace r), and you can get:
insert image description here
the point closest to the hyperplane should make the above equality sign hold. This closest point is called a "support vector". The hyperplane should be located exactly in the middle of the support vectors of the positive and negative examples. At this time, the distance between the positive support vector and the negative support vector is:
insert image description here
The visual diagram is as follows.
A question : the bigger the r, the better? Or smaller is better? The answer is obvious, the bigger the better, the bigger the data of the two categories are separated. On the contrary, if the distance is small, it means that the two types of data are not particularly separated, and the generalization ability can be imagined.
insert image description here
According to the above ideas, we can construct our optimization function, which is to maximize the distance r under certain conditions (when both positive and negative examples are classified correctly).
insert image description here
In machine learning, the maximization problem is always transformed into a minimization problem, and the above formula can be rewritten as:
insert image description here
This is the basic type of support vector machine! How to solve w and b next? Use the "dual problem" to solve!

dual problem

(If you don’t understand, you can skip it. This part is to calculate w and b through some mathematical methods.) The
whole solution idea is to use the Lagrange multiplier method to introduce the variable α i \alpha_iai, transform both w and b into α i \alpha_iaiThe expression of , and then by solving α i \alpha_iai, to further lend w and b.

  1. Lagrange multiplier method expression:
    insert image description here
    So the current optimization function becomes:
    minw , bmax α i L ( w , b , α ) min_{w,b}max_{\alpha_i}L(w,b,\alpha )minw,bmaxaiL(w,b,α )
    Since it needs to be calculated from the inside to the outside, however the internalmax α i L ( w , b , α ) max_{\alpha_i}L(w,b,\alpha)maxaiL(w,b,α ) is not easy to take partial derivatives, so convert! Converts to:
    max α iminw , b L ( w , b , α ) max_{\alpha_i}min_{w,b}L(w,b,\alpha)maxaiminw,bL(w,b,α )
    In this way, partial derivatives of w and b can be obtained internally! Get:
    insert image description here
    Then put the obtained formula intoL ( w , b , a ) L(w,b,a)L(w,b,In a ) , the dual problem is obtained:
    insert image description here
    the condition is:
    insert image description here
    Please observe this formula, we all know x, y, and substitute all the points and the above conditions obtained by taking the partial derivative of b into this formula, and then we can Find allα i \alpha_iaiup. Then bring in the above formula obtained by taking the partial derivative of w, and w can be obtained. It shows that the α i \alpha_i
    obtained according to the dual problemaiCertain conditions need to be met, that is, the kkt condition:
    insert image description here
    therefore, if some values ​​lent out do not meet this condition, they need to be discarded. Among the obtained results, some results are not 0, and some results are 0. According to the following formula, it can be seen that the value of 0 has no effect on the value of w, indicating that this point is not a support vector, not Only when it is 0 will it work on w, indicating that the point is a support vector! Hyperplanes are only relevant for local data!

insert image description here
As for how to find b? For support vectors we have ysf ( xs ) = 1 y_sf(x_s)=1ysf(xs)=1 , then please
insert image description here
observe this formula,ys , α i , yi , xi , xs y_s,\alpha_i,y_i,x_i,x_sys,ai,yi,xi,xsThese variables are all known, only b is not known, so the value of b can be easily retrieved.
So far, we have got the expression of the hyperplane!

kernel function

The previous support vector machine can do a good job of classifying linearly separable data, but what about linearly inseparable data? How to do classification? The core idea is that if the two-dimensional data is linearly inseparable, then I think of a way to map these data points to three-dimensional or higher dimensions. The mapped data points can represent the previous data points. In the higher-dimensional space, these data points Points are linearly separable and find a hyperplane in a high-dimensional space for classification. As shown below.
insert image description here
It has been proved that if the original space is finite-dimensional, that is, the number of attributes is limited, then there must be a high-dimensional feature space that makes the samples separable!
Suppose the mapping relationship is x → ϕ ( x ) x\to \phi(x)xϕ ( x )
Then, similar to the previous chapter, our optimization function becomes:
insert image description here
the dual problem is:ϕ ( xi ) T ϕ ( xj ) \phi(x_i)^
insert image description here
on the far rightT\phi(x_j)ϕ ( xi)T ϕ(xj) , this is the inner product after sample mapping, which is difficult to calculate! Why? Because the dimensionality after mapping can be quite high! How to avoid this problem? Mathematically, it is possible to find such a function that can convert the inner product after mapping into operations between data before mapping (the data before mapping is finite-dimensional, so it can be easily calculated), as follows:
insert image description here

κ ( x i , x j ) \kappa(x_i,x_j) k ( xi,xj) is called the kernel function,xi , xj x_i,x_jxi,xjThe inner product in the high-dimensional space is equal to the result calculated by the kernel function in their original sample space.
ok, after a series of similar calculations, the hyperplane is obtained, the hyperplane is as follows:
insert image description here
Ordinarily, ϕ ( x ) \phi(x)ϕ ( x ) should be in one-to-one correspondence with the kernel function. However, in reality, there are not many kernel functions that can meet the above conditions. Generally, there are the following commonly used kernel functions:
insert image description here

soft interval

In the previous discussion, we defaulted that the hyperplane can separate all the data. In real life, there are often some wrong data points, and if these data points are used as support vectors, it will have a great influence on the hyperplane. Influence. Can we give the hyperplane some loose conditions, that is, it is not necessary to completely separate all the data, which is conducive to the improvement of generalization performance. For this purpose, a "soft interval" is introduced, and the schematic diagram is as follows:
insert image description here
To this end, the previous optimization function is introduced into a slack variable:
insert image description here
insert image description here
the conditions are relaxed, that is, it does not need to be completely greater than or equal to 1. This slack variable is used to characterize the degree to which the sample does not satisfy the constraint. Continue to use the previous Lagrange multiplier method to solve the dual problem, and you can get w and b!

Summarize

Thought and calculation : The idea of ​​the support vector machine is to find a hyperplane to separate the samples. This hyperplane needs to be in the middle of the two types of data, and the larger the interval, the better! According to the larger the interval, the better, the optimization function can be constructed, and then the Lagrange multiplier method can be used to solve it.
Kernel function : For samples that cannot be linearly separable, we try to map these samples to a higher-dimensional space so that they can be linearly separable, which introduces a mapping function, but it is difficult to calculate the mapped inner product during the calculation process , so the kernel function is introduced to convert the inner product operation after mapping into the sample operation before mapping. And there are only a few commonly used kernel functions!
Soft interval : Finally, in order to prevent the interference of some wrong data and improve the generalization performance. A soft margin is introduced to make the constraints on the hyperplane less stringent! The slack variable characterizes the degree to which constraints are not satisfied.

Guess you like

Origin blog.csdn.net/no1xiaoqianqian/article/details/127580570