Machine Learning Practice ultra-detailed Raiders (22): positive and negative samples 2-- set different rights for three tricks to get rid of the problem of uneven sample weight

Here Insert Picture Description
Has been described above made use of under-way (over) sampling to solve the problem of imbalance of positive and negative samples, this article, we introduce the second measure to solve the imbalance of positive and negative samples - positive and negative samples of the right to punish by weight solve the sample is not balanced.

We now assume that there is such a data set: 0 data set has 100 samples, 10 samples 1, a total of 110 samples. The proportion of the sample is 0, 10: 1. In all of the following methods introduction, we have to deal with this data set, for example.

First, the simple and crude method:

1, the principle

Look at an example: If I want to take part in a both take a language test and math test, but I have a hand and ten sets of mathematical language Moni Juan Moni Juan, because less I was able to practice the language problem, so I am this volume is necessary to pay more attention to some of the language.

Similarly, the weight loss of value to small sample size category given a higher weight, larger sample size category to the loss of value to a lower weight, simple and crude to increase weight loss as a function of the small sample label, you can make model is more concerned about the less information categories contained in the sample.

We cross loss function embodiment, the specific implementation process in detail.

Normal cross entropy loss function is calculated as follows:

Loss = −ylg( p ) − (1−y)lg(1−p)

If for a sample of 1, the model gives our sample is a sample of probability is 0.3, then the value Loss -lg (0.3) = 0.52.

0 if for a sample, the sample is given in our model is a probability that the sample is 0.7 (i.e., the prediction is the probability of the sample is also 0 0.3), the obtained Loss -log (0.3) = 0.52.

It can be seen that for the samples have the same 0,1 deviations, given the model loss function is the same.

For the top of the data sets, since only 1/10 of 1 sample 0 sample, we can make weight loss a sample weight increased 10-fold, the new loss function becomes:

Loss2 = −10ylg ( p ) − 1(1−y)lg(1−p)

Respect to the sample 1, the probability model although we give a positive sample is 0.3, the value will be changed Loss2 -10lg (0.3) = 5.2; respect to the negative sample, it is still Loss2 0.52.

This means that for every 1-sample forecast errors, we'll give the model with the loss of 10 times the value for the sample is 0, so that the model predicted a false positive sample paying enough attention to ensure full use only positive samples.

2, is implemented in the scikit-learn

This process is more than most of us do not have their own algorithms implemented in scikit-learn, a lot of models and algorithms provides an interface to let us adjust the weight category according to the ratio of the actual sample size.

To SVM algorithm, for example, by setting the class_weighvalue to manually specify a different weight categories, such as weight we want to lose weight ratio of positive and negative samples is set to 10: 1, just add this parameter: class_weight= {1:10,0:1}.

Of course, we can also further be lazy, just: class_weight = 'balanced'then SVM will weight to power and was the number of different types of samples is inversely proportional to the weight for automatic equalization, the weight is calculated as: Total number of samples / (number of categories * certain types of samples number).

For the data set shown in the upper side, it has been calculated, the sample weight weight 0: 110 / (2 100) 0.55,1 = weight of sample weight: 110 / (2 10) = 5.5. Can be seen, the ratio of the number of samples is 0, 1, 10: 1, after calculating the weight loss of the sample 0 is a ratio: 0.55: 5.5 = 1:10, inversely proportional to the number of samples and exactly.

SVM, decision trees, random forest algorithms can be specified in this way.

1) First, the package needs to import the positive and negative samples and generating a data set is not balanced:

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
import numpy as np
############################ 生成不均衡的分类数据集###########
from sklearn import datasets
X,Y = datasets.make_classification(n_samples = 1000,
                                                 n_features = 4,
                                                 n_classes = 2,
                                                 weights = [0.95,0.05])
train_x,test_x,train_y,test_y = train_test_split(X,Y,test_size=0.1,random_state=0)

Sample 0 Sample 1 and the ratio of generated here is 95: 5.

2) positive and negative samples not treated train a SVM classifier

SVM_model = SVC(random_state = 66)  #设定一个随机数种子,保证每次运行结果不变化
SVM_model.fit(train_x,train_y)

pred1 =SVM_model.predict(train_x)
accuracy1 = recall_score(train_y,pred1)
print('在训练集上的召回率:\n',accuracy1)

pred2 = SVM_model.predict(test_x)
accuracy2 = recall_score(test_y,pred2)
print('在测试集上的召回率:\n',accuracy2)

Output:

Recall on the training set:
0.56
recall rate on the test set:
0.2

0,1 Because the sample is very uneven, even if the model of all the samples were determined to be 0 sample, the model will get the correct rate of 95%, but in actual use, we want the model to be able to filter out small samples, such as anti money laundering suspicious transactions, credit card in bad loans.

So here we use the recall to judge the effect of the positive samples detected, you can see, under normal circumstances, the recall rate on the test set is only 0.2.

3) the right to set different values ​​for positive and negative samples lost a SVM classifier retraining

SVM_model2 = SVC(random_state = 66,class_weight = 'balanced')  #设定一个随机数种子,保证每次运行结果不变化
SVM_model2.fit(train_x,train_y)

pred1 =SVM_model2.predict(train_x)
accuracy1 = recall_score(train_y,pred1)
print('在训练集上的召回率:\n',accuracy1)

pred2 = SVM_model2.predict(test_x)
accuracy2 = recall_score(test_y,pred2)
print('在测试集上的召回率:\n',accuracy2)

Output:

Recall on the training set:
0.8
on a test set of recall:
0.8

It can be seen after being given sufficient sample weight, the recall has been significantly increased, it becomes 0.8.
(If you use here to evaluate the accuracy, precision Here you will find the worse, do not worry this is a model in order to identify the samples 1 and 0 misjudged the part of the sample, this is normal.)

In addition, some algorithms may not provide an interface, we can also fit in a training function sample_weightspecified parameters. Specific reference scikit-learn documentation.

Note: If you accidentally two parameters are used, then the right to the final weight of the sample is plus or minus: class_weight * sample_weight

Second, the more refined approach:

In the first method, we simply rudely to determine weight loss, this approach has the effect will be very significant through the various types of sample size is inversely proportional. However, the above does not do something to improve the handling of some of the more difficult to classify samples, we not only want to balance the sample, and you want to reduce the proportion of sample loss and easy classification provided, the proportion of the loss is difficult to classify samples provided increases. It's like: we must not only pay more attention to this language Moni Juan, but also to focus on this set of volumes yard wrong question. Focal loss (dynamic adjustment weights) idea is a perfect balance between these two aspects.

Focal loss original idea was put forward to solve the image detection problem of imbalance in the category, here say one more thing, image detection area is a typical sample of positive and negative uneven areas, because in one image, in most cases , the goal is always to occupy a small part of the image, occupy the majority of the background image. If you deal with daily samples imbalanced data problems, take a look at the image detection field, look for inspiration.

We still think that two-class cross entropy loss function as an example. Focal losss loss function on the basis of the
L = Y l g ( p ) ( 1 Y ) l g ( 1 p ) L = -ylg (p) - (1-y) g (1-p)

became:

F L = a ( 1 p ) c Y l g ( p ) a p c ( 1 Y ) l g ( 1 p ) FL = - \ alpha (1-p) ^ \ gamma YLG (p) - \ alpha p ^ \ gamma (1-y) g (1-p)

A closer look at this formula, the right weight loss function is determined by two weights: a with ( 1 p ) c \ Alpha, and (1-p) ^ \ gamma
, a first weighting Needless to say, the amount of positive and negative samples is based on a custom method of weight loss in focus to see the second weight.

If for some sample 1, the model predicts that 0.9 is a probability sample, then the second weight is: (1 - 0.9) ^ 2 = 0.01, and if the model predicts a probability sample 1 is a sample 0.6, then the second weight becomes: (1 to 0.6) ^ 2 = 0.16. Sample prediction less accurate, the value of the share of heavy weight loss will be greater.

FIG original paper with a vividly illustrates, the abscissa represents the probability of the prediction, the ordinate represents the value of loss probability ,, red dotted line represents the prediction of the intermediate is 0.5, the dashed line a great deviation prediction model representative of the left half, the right half of the representative model predicts small deviations. loss value at gamma = 0, weight is the second entry does not participate in:
Here Insert Picture Description
It can be seen in FIG blue curve with respect to it, as gamma increases, the left half of the red line, several curves loss value differences not, but the right half of the loss of value of the red line is rapidly approaching zero.

And more subtly, in general, if the sample 0 1 sample than many words, the model will certainly be inclined to this type of sample forecast to 0 (All samples were sentenced to class 0, model accuracy rate remains high), the second heavy weight on top of the model will be prompted to spend more energy to focus on a small number of samples.

To sum it up, Focal loss is in further contrast classification is difficult to increase the sample weight loss, so that the model sample points wrong more attention, so that the rapid convergence model, and get better performance.

Focal loss of thinking about, scikit-learn are not yet implemented, can achieve their own interest.

This series of related articles

Published 67 original articles · won praise 204 · views 280 000 +

Guess you like

Origin blog.csdn.net/u013044310/article/details/104006187