Python: SMOTE algorithm

17.11.28 update the look: This algorithm recently integrated into the data preprocessing python project code, and do not want to want to see the principle of the direct use, there is a simple version of python development: project code template feature
, enter the page ctrl + F search smote on the line, help yourself


Prior has not been used python, recently made some orders of magnitude larger projects, feel the need to familiarize ourselves with python, just uses smote, the Internet has not found, so just as a small practice hand to do it.

First, look Smote before the algorithm, when we look at the method of positive and negative samples are not balanced, we usually use:

  • Sampling
    Conventional comprising oversampling subsampling, composition sampling
    oversampling: a class of the sample fewer samples padded
    subsampling: the more samples for a class of compressed sample
    composition sample: agree on a magnitude N, oversampling simultaneously and subsampling, so that the amount of positive and negative samples is equal to the order N conventions

In this way loss of data or information, or it will lead to less sample collinearity, obvious defects

  • Weight adjustment
    routine includes an algorithm in weight, weight matrix
    change the weight ratio of the reference weight, the total amount of such boosting is an iterative manner, right in front of the logistic regression reset

The drawbacks of this approach is beyond the control of the appropriate weight ratio, requiring multiple attempts

  • Kernel amendment
    changed by kernel function, to offset the problems caused by the imbalance of samples

This usage scenario limitations, high cost of pre-knowledge of learning, high adjustment costs kernel function, black box optimization

  • Model Updating
    the data of the conventional sample type less, the algorithm used to probe features between data, whether the interpretation of the data satisfies a certain rule
    , for example, by a linear fit, found less linear relationship between sample classes, you can add linear fitting a new point in the model

The actual law more difficult to find, more difficult

SMOTE (Synthetic minoritye over-sampling technique, SMOTE) is oversampling algorithms Chawla proposed in 2002, the above problems can be avoided to some extent,

Here's what this algorithm:

 

Distribution of positive and negative samples

Clearly we can see that the number of blue samples is far greater than the red sample, in the general call classification model to determine the time between neglect may result in lost influence with a sample of red, blue only emphasized the classification accuracy of the sample, here to balance the need to increase red sample data sets

We thought Smote algorithm is very simple sample of n first few classes were randomly selected, as shown below

Find out the initial expansion of the small sample class

And then find its nearest small class of m samples, as shown below

Then optionally nearest class of m samples less arbitrary point,

At this point, optionally two, this is the new data sample


Development on the R language is simple, there are ready-made packages to the repository, here briefly explain:

rm(list=ls())
install.packages(“DMwR”,dependencies=T)
library(DMwR)#加载smote包
newdata=SMOTE(formula,data,perc.over=,perc.under=)
#formula:申明自变量因变量
#perc.over:过采样次数
#perc.under:欠采样次数

Effect of contrast:

 

 

Simple looks like a repeat depicting fewer classes
here smote is packaged, a direct call on the line, there is nothing special


Here I want to take your own hand and thought just learning python, python took all wrote about the process:

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from numpy import *
import matplotlib.pyplot as plt
 
#读数据
data = pd.read_table('C:/Users/17031877/Desktop/supermarket_second_man_clothes_train.txt', low_memory=False)

#简单的预处理
test_date = pd.concat([data['label'], data.iloc[:, 7:10]], axis=1)
test_date = test_date.dropna(how='any')

Data as follows:

test_date.head()
Out[25]: 
   label  max_date_diff  max_pay  cnt_time
0      0           23.0  43068.0        15
1      0           10.0   1899.0         2
2      0          146.0   3299.0        21
3      0           30.0  31959.0        35
4      0            3.0  24165.0        98
test_date['label'][test_date['label']==0].count()/test_date['label'][test_date['label']==1].count()
Out[37]: 67

label is a sample identification label category, 0: 1 = 67: 1, the need to label data = 1 be expanded


# 筛选目标变量
aimed_date = test_date[test_date['label'] == 1]
# 随机筛选少类扩充中心
index = pd.DataFrame(aimed_date.index).sample(frac=0.1, random_state=1)
index.columns = ['id']
number = len(index)
# 生成array格式
aimed_date_new = aimed_date.ix[index.values.ravel(), :]

Randomly selected 10% of the total amount as a few samples of data center point of extension



# 自变量标准化
sc = StandardScaler().fit(aimed_date_new)
aimed_date_new = pd.DataFrame(sc.transform(aimed_date_new))
sc1 = StandardScaler().fit(aimed_date)
aimed_date = pd.DataFrame(sc1.transform(aimed_date))

# 定义欧式距离计算
def dist(a, b):
    a = array(a)
    b = array(b)
    d = ((a[0] - b[0]) ** 2 + (a[1] - b[1]) ** 2 + (a[2] - b[2]) ** 2 + (a[3] - b[3]) ** 2) ** 0.5
    return d

The following defines the distance calculated way, all of the algorithms in place related to the distance of the need to standardize the amount of removal Gang, also accelerated the speed calculated
here a way to take the Euclidean distance, the more distance calculation by reference:
a variety of distance similarity theory of computation and presentation


# 统计所有检验距离样本个数
row_l1 = aimed_date_new.iloc[:, 0].count()
row_l2 = aimed_date.iloc[:, 0].count()
a = zeros((row_l1, row_l2))
a = pd.DataFrame(a)
# 计算距离矩阵
for i in range(row_l1):
    for j in range(row_l2):
        d = dist(aimed_date_new.iloc[i, :], aimed_date.iloc[j, :])
        a.ix[i, j] = d
b = a.T.apply(lambda x: x.min())

Calling a function to calculate the distance above, forming a distance matrix


# 找到同类点位置
h = []
z = []
for i in range(number):
    for j in range(len(a.iloc[i, :])):
        ai = a.iloc[i, j]
        bi = b[i]
        if ai == bi:
            h.append(i)
            z.append(j)
        else:
            continue
new_point = [0, 0, 0, 0]
new_point = pd.DataFrame(new_point)
for i in range(len(h)):
    index_a = z[i]
    new = aimed_date.iloc[index_a, :]
    new_point = pd.concat([new, new_point], axis=1)

new_point = new_point.iloc[:, range(len(new_point.columns) - 1)]

In the case to find the location, go to the original data set to find specific data based on location


import random
r1 = []
for i in range(len(new_point.columns)):
    r1.append(random.uniform(0, 1))
new_point_last = []
new_point_last = pd.DataFrame(new_point_last)
# 求新点 new_x=old_x+rand()*(append_x-old_x)
for i in range(len(new_point.columns)):
    new_x = (new_point.iloc[1:4, i] - aimed_date_new.iloc[number - 1 - i, 1:4]) * r1[i] + aimed_date_new.iloc[
                                                                                          number - 1 - i, 1:4]
    new_point_last = pd.concat([new_point_last, new_x], axis=1)
print new_point_last

Finally, according to the formula smote the new_x=old_x+rand()*(append_x-old_x)calculated point to the new, python concludes this practice hand

In fact, on this result, we can do a comprehensive Tomek link integrated data expansion algorithm idea is as follows:
Suppose that we use the above algorithm to generate a new data point two blue box:


我们认为,对于新产生的青色数据点与其他非青色样本点距离最近的点,构成一对Tomek link,如下图框中的青蓝两点

 

 


我们可以定义规则:
当以新产生点为中心,Tomek link的距离为范围半径,去框定一个空间,空间内的少数类的个数/多数类的个数<最低阀值的时候,认为新产生点为“垃圾点”,应该剔除或者再次进行smote训练;空间内的少数类的个数/多数类的个数>=最低阀值的时候,在进行保留并纳入smote训练的初始少类样本集合中去抽样
所以,剔除左侧的青色新增点,只保留右边的新增数据如下:

 


欢迎大家关注我的个人bolg,更多代码内容欢迎follow我的个人Github,如果有任何算法、代码疑问都欢迎通过公众号发消息给我哦。

少年,扫一下嘛

 

参考文献:



作者:slade_sal
链接:https://www.jianshu.com/p/ecbc924860af
來源:简书
简书著作权归作者所有,任何形式的转载都请联系作者获得授权并注明出处。

Guess you like

Origin blog.csdn.net/huobanjishijian/article/details/85245030