Introduction to SVM (8) Slack variables

Now we have transformed an originally linearly inseparable text classification problem into linearly separable by mapping it to a high-dimensional space. Like the picture below:

 

image

There are thousands of points each for circles and squares (after all, this is the number of documents in our training set, which of course is huge). Now imagine that we have another training set, only one more article than the original training set. After mapping to the high-dimensional space (of course, the same kernel function is also used), there will be one more sample point, but this sample The location is like this:

image

 

It is the yellow point in the figure. It is square, so it is a sample of the negative class. This single sample makes the linearly separable problem become linearly inseparable. Such similar problems (with only a few points linearly inseparable) are called "approximately linearly separable" problems.

Judging by our human common sense, it is said that there are 10,000 points that conform to a certain law (and thus are linearly separable), and there is one point that does not. Does this point represent an aspect that we have not considered in the classification rules? (and thus the rules should be modified for it)?

In fact, we will think that it is more likely that this sample point is an error at all, it is noise, and it was put in by mistake when the classmates who provided the training set dozed off during manual classification. So we will simply ignore this sample point and still use the original classifier, and its effect will not be affected at all.

But this tolerance to noise is brought about by human thinking, not our program. Since all the sample points must be considered in the expression of our original optimization problem (one cannot be ignored, because how does the program know which one to ignore?), on this basis, find the maximum geometry between positive and negative classes The interval, and the geometric interval itself represents the distance, is non-negative, and a noisy situation like the above will make the whole problem unsolvable. This solution is actually also called "hard interval" classification, because it rigidly requires that all sample points satisfy the distance from the classification plane must be greater than a certain value.

So it can be seen from the above example that hard-spaced taxonomy results are vulnerable to the control of the few, which is dangerous (although there is a saying that the truth is always in the hands of the few, but that is just that A small group of people chatting about masturbation words, we still have democracy).

But the solution is also obvious, which is to follow human thinking and allow the distances from some points to the classification plane to not meet the original requirements. Since the distance scales of points in different training sets are not the same, it is beneficial to use the interval (rather than the geometric interval) to measure the brevity of our expression. Our original requirements for sample points are:

 

clip_image002

It means that the function interval of the sample points closest to the classification surface is also larger than 1. If you want to introduce fault tolerance, add a slack variable to the hard threshold of 1, that is, allow

clip_image002[5]

Because the slack variable is non-negative, the net result is that the required interval can be smaller than 1. But when some points have this interval smaller than 1 (these points are also called outliers), it means that we give up the accurate classification of these points, which is a loss for our classifier. But abandoning these points also brings the advantage that the classification surface does not have to move in the direction of these points, so that a larger geometric separation can be obtained (the classification boundary is also smoother in the low-dimensional space). Obviously we have to weigh this loss against the benefit. The benefit is obvious, the larger the class interval we get, the more the benefit. Review our original optimization problem for hard-margin classification:

clip_image002[7]

||w|| 2 is our objective function (of course the coefficient is optional), I hope it is as small as possible, so the loss must be a quantity that can make it larger (making it smaller is not called loss , we originally hoped that the smaller the value of the objective function, the better). So how to measure the loss, there are two common ways, some people like to use

clip_image002[9]

And some people like to use

clip_image002[11]

where l is the number of samples. There is no big difference between the two methods. If the first one is selected, the resulting method is called a second-order soft margin classifier, and the second is called a first-order soft margin classifier. When the loss is added to the objective function, a penalty factor (cost, which is C in many parameters of libSVM) is needed, and the original optimization problem becomes the following:

clip_image002[13]

There are a few things to note about this formula:

One is that not all sample points have a corresponding slack variable. In fact, there are only "outliers", or it can be seen in this way, all the point slack variables that are not outliers are equal to 0 (for the negative class, the outliers are in the previous figure, running to the right of H2 Those negative sample points, for the positive class, are those positive sample points that run to the left of H1).

The second is that the value of the slack variable actually indicates how far the corresponding point is out of the group. The larger the value, the farther the point is.

Third, the penalty factor C determines how much you value the loss caused by outliers. Obviously, when the sum of the slack variables of all outliers is constant, the larger the C you set, the greater the loss to the objective function. It implies that you are very reluctant to give up these outliers. The most extreme case is that you set C to be infinite, so that as long as there is a slight outlier, the value of the objective function will immediately become infinite, and the problem will change immediately. There is no solution, which degenerates into a hard spacing problem.

Fourth, the penalty factor C is not a variable. When the entire optimization problem is being solved, C is a value that you must specify in advance. After specifying this value, solve it to get a classifier, and then use the test data to see how the result is. , if it is not good enough, change the value of C, solve the optimization problem again, get another classifier, and look at the effect again. This is a process of parameter optimization, but this is by no means the same as the optimization problem itself. Optimization In the process of solving the problem, C has always been a fixed value, so remember.

Fifth, despite the addition of slack variables, this optimization problem is still an optimization problem (Khan, is this nonsense), and the process of solving it is nothing more special than the original hard interval problem.

From a big perspective, the process of solving the optimization problem is to first try to determine w, that is, to determine the three straight lines in the previous figure, then look at how big the interval is, how many outliers there are, and put the value of the objective function Do the math, and then change a set of three straight lines (you can see that if the position of the straight line for classification is moved, some points that were originally outliers will become no longer outliers, and some points that were originally not outliers will become outliers. Group points), and then calculate the value of the objective function, and so on and so forth (iteratively), until finally find the w when the objective function is the smallest.

Having said so much, the reader must be able to sum it up immediately. The slack variable is just a method to solve the linear inseparability problem, but recall that the introduction of the kernel function is also to solve the linear inseparability problem? Why use two methods for one problem?

In fact, there are subtle differences between the two. The general process should be like this, taking text classification as an example. In the original low-dimensional space, the samples are quite inseparable. No matter how you find the classification plane, there will always be a large number of outliers. At this time, use the kernel function to map it to the high-dimensional space. Although the result is still inseparable, it is better than The original space is closer to a linearly separable state (that is, to an approximate linearly separable state), and it is much simpler and more effective to use slack variables to deal with those few "stubborn" outliers.

(Equation 1) in this section is indeed the most commonly used form of SVM. At this point, a relatively complete support vector machine framework is available. In short, a support vector machine is a soft-interval linear classification method using a kernel function.

In the next section, I will talk about the little things left over from slack variables, and by the way, do a reader survey to see what other aspects of SVM you want to talk about.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326960123&siteId=291194637
svm