08-- uneven sample solution

background

When performing data analysis, sometimes there is a case: for example, to determine whether there is credit card fraud behavior, this problem is a binary classification problem, but for such a problem, there is fraud and there is a normal sample is very a large gap, fraud may occupy only one percent are less; for such samples are not balanced case, under normal circumstances can be handled in two ways: over-sampling, down-sampling

Downsampling

For this embodiment, the amount is usually the number of samples of data that one small adjustment, and make the same number of smaller sample quantities (the number of samples less the same)

Oversampling

For this embodiment, usually the one that is to be generated a small number of data samples, such that a small number of data samples and the one that can be one of the number of samples as many as (the number of samples as many)

Sample data processing

When performing data analysis, first of all we need a data pre-processing, machine learning in the middle, there is a misunderstanding, think higher degree of importance among large amount of data in the data sample data, a smaller amount of data a lower degree of importance of the data; but in our definition of the importance of the data sample is the same situation, we need to preprocess the data, during the pre-treatment when treated in two ways usually: return a standardization

Normalized

Here Insert Picture Description
Here normalized to sklearn used, is introduced here in the middle of the intermediate pre-processing module preprocessing sklearn standard processing method StandardScaler; It should be noted that the use reshape, if there is a matrix of 2 rows and 3 columns, if .reshape (- 1,1) denotes matrix to the data of a row 6, where -1 represents the system automatically determines
later Time and Amount delete two columns, the intermediate used to axis = 1, it is represented by delete columns

Downsampling

When down-sampling, the data need to be removed and that one more sample, they can use the middle numpy random.choice random selection
Here Insert Picture Description
by the above manner, it is possible to find an index of 0 and Class 1, the two types of data, the following it is possible to integrate the two types of data transactions; herein is used to concatenate numpy method of
Here Insert Picture Description
intermediate storage is above under_sample_indices data 0 and data 1 corresponding to the index value, the following needs to these contents corresponding to the index for obtaining
Here Insert Picture Description
under_sample_data above intermediate is corresponding to data 0 and data to a data value of 1, and after a subject in need be analyzed is this, the argument is non-Class is defined by the column definition strain is Class this column
by the above process of this series, is a down-sampling process, you can see, at the time of sampling using the data provided by many of them are not on the application, this will result in a waste of data, the final result is the natural existence the impact of certain

Guess you like

Origin blog.csdn.net/Escid/article/details/90762717