Handling imbalanced data sets with Python

1. What is data imbalance

The so-called data imbalance refers to the uneven distribution of the number of categories in the data set; imbalanced data is very common in real tasks. Such as

· Credit card fraud data: 99% are normal data, 1% are fraud data

· Loan overdue data

Generally, it is unbalanced data that is caused by data generation. Samples with fewer categories usually have a low frequency of occurrence and require a long period of time to collect.

In machine learning tasks (such as classification problems), imbalanced data will cause the predicted results of the trained model to be biased towards categories with a large number of samples. At this time, in addition to selecting appropriate evaluation indicators, if you want to improve the performance of the model, you must Do some preprocessing of data and models.

The main methods to deal with data imbalance:

· 欠采样
· 过采样
· 综合采样
· 模型集成
· 调整类别权重或者样本权重

2. Data imbalance processing method

The imbalanced-learn library provides many unbalanced data processing methods. The examples in this article are all implemented using the imbalanced-learn library.

https://github.com/scikit-learn-contrib/imbalanced-learn

Let's look at the data first

2.1 Undersampling

Under-sampling is to sample samples with a large number of categories to make the number equal to the number of categories with a small number, so as to achieve a balance of numbers.


Recommendation: 020 is continuously updated, the small circle of boutiques has new content every day, and the concentration of dry goods is extremely high.
There are everything you want to make connections and discuss technology!
Be the first to join the group and outperform your peers! (There is no fee for joining the group)
Click here to communicate and learn with Python developers.
Group number: 745895701
application and delivery :
Python software installation package, Python actual combat tutorial,
free collection of materials, including Python basic learning, advanced learning, crawling, artificial intelligence, automated operation and maintenance, automated testing, etc.

Because under-sampling loses part of the data, it inevitably changes the distribution of the number of multi-class samples. A good undersampling strategy should keep the original data distribution as much as possible.

Undersampling is to delete majority samples, so which samples can be deleted?

· One is overlapping data, which is redundant data

· One is the data that interferes with the distribution of minority

Based on this, there are two ways to undersampling

· Border adjacent matching, consider deleting majority samples in adjacent spaces, such as TomekLinks, NearMiss

The picture below shows 6NN (6 nearest neighbors)

Here is the emphasis on TomekLinks. This method is simply: Find 1NN (nearest neighbor) for each minority sample. If the nearest neighbor is majority, a tome-links will be formed. This method considers this majority to be interference, and it delete.

It can be seen from the above that 1174 tomek-links have been deleted. It seems that the deletion is not enough. You can test whether it is helpful to the classification result. It should be noted that because the nearest neighbors need to be calculated, the sample attributes must be numerical attributes, or can be converted into numerical attributes.

Clustering

This type of method divides the original sample into multiple clusters through multiple clusters, and then uses the center of each cluster to replace the characteristics of the cluster to accomplish the purpose of sampling. It can be seen that the samples of this kind of sampling are not from the original sample set, but are generated through clustering.

The under-sampling method provided by im-balance is as follows:

· Random majority under-sampling with replacement
· Extraction of majority-minority Tomek links
· Under-sampling with Cluster Centroids
· NearMiss-(1 & 2 & 3)
· Condensed Nearest Neighbour
· One-Sided Selection
· Neighboorhood Cleaning Rule
· Edited Nearest Neighbours
· Instance Hardness Threshold
· Repeated Edited Nearest Neighbours
· AllKNN

2.2 Oversampling

Oversampling is to copy samples of a small number of categories to make the number of samples similar to the number of categories with a large number of samples to achieve a quantitative balance. Since multiple minoruty samples are replicated, oversampling will change the minority variance.

A simple way of oversampling is to randomly copy minority samples; the other is to generate artificial samples based on existing samples. Here we introduce the classic algorithm SMOTE (Synthetic Minority Over-sampling Technique) for artificial samples.

SMOTE constructs new artificial samples based on the feature space similar to minority samples. Proceed as follows:

· Choose a minority sample and calculate its KNN neighbors

· Randomly select a neighbor among K neighbors

· Modify a certain feature, offset a certain size: the size of the offset is the difference between the minority sample and the neighbor multiplied by a small random ratio (0, 1), and then a new sample is generated

For the SMOTE method, a new sample is constructed for each minority. But this is not always the case. Consider the following three points A, B, and C. From the data distribution point of view, point C is likely to be an abnormal point (Noise), point B is a normally distributed point (SAFE), and point A is distributed at the boundary (DANGER); intuitively, we should not go to point C Constructing a new sample, for point B, constructing a new sample will not enrich the distribution of minority categories. There is only point A. If a new sample is constructed, point A can go from (DANGER) to (SAFE) and strengthen the classification boundary of the minority category. This is Borderline-SMOTE

The ADASYN method determines the generated data from the perspective of maintaining sample distribution. The method of generating data is the same as SMOTE. The difference lies in the number of samples generated for each minortiy sample.

· First determine that the number of samples to be generated beta is [0, 1]

· For each minortiy sample, determine the proportion of samples it generates. First find out the nearest neighbors of K, calculate the proportion of samples belonging to the majority in the nearest neighbors of K (ie, the numerator). Z is the normalization factor to ensure that the sum of all minortiry proportions is 1, which can be considered as the sum of all molecules.

· Calculate the number of new samples generated for each minortiy

· Generate samples in SMOTE mode

The oversampling methods provided by im-balance are as follows (including variants of the SMOTE algorithm):

· Random minority over-sampling with replacement
· SMOTE - Synthetic Minority Over-sampling Technique
· SMOTENC - SMOTE for Nominal Continuous
· bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2
· SVM SMOTE - Support Vectors SMOTE
· ADASYN - Adaptive synthetic sampling approach for imbalanced learning
· KMeans-SMOTE
· ROSE - Random OverSampling Examples

2.3 Comprehensive sampling

Over-sampling is for minority samples, and under-sampling is for majority samples; while comprehensive sampling is a method of simultaneously operating on minority and majority samples. There are mainly SMOTE+Tomek-links and SMOTE+Edited Nearest Neighbours.

The method of comprehensive sampling is to conduct over-sampling first and then to under-sampling.

2.4 Model integration

The model integration here is mainly reflected in the data, that is, training multiple models with many balanced data sets (majortiry samples for undersampling plus minority samples), and then integration. imblearn.ensemble provides several common model integration algorithms, such as BalancedRandomForestClassifier

The method of model integration provided by im-balance is as follows

· Easy Ensemble classifier
· Balanced Random Forest
· Balanced Bagging
· RUSBoost

2.5 Adjust category weight or sample weight

For many machine learning methods that use gradient descent methods to learn (to minimize a certain loss Loss), you can balance the unbalanced data to a certain extent by adjusting the category weight or sample weight. Such as class_weight in the gbdt model lightgbm

3. Summary

This article shares several common methods for dealing with unbalanced data sets and provides simple examples of imbalanced-learn. Summarized as follows:

· 欠采样: 减少majoritry样本
· 过采样:增加minority样本
· 综合采样:先过采样,在欠采样
· 模型集成:制造平衡数据(majoritry样本欠采样+minority样本),多次不同的欠采样,训练不同的模型,然后融合

· Both under-sampling and over-sampling change the distribution of the original data to a certain extent, which may cause the model to overfit. Which method needs to be tried is in line with the actual data distribution. Of course it may not be effective, just do it bravely!

Guess you like

Origin blog.csdn.net/Python_xiaobang/article/details/112391683