ML: Knowledge map data preprocessing

pandas powerful data processing tools
Learning objectives: api without having to remember the exact name, what knowledge pandas need to do, and in accordance with the actual function requires the use of quick references.

Quick Reference Guide:
(1) teaching Chinese pandas document given
https://www.pypandas.cn/docs/getting_started/10min.html
(2) commonly used in Chinese pandas api documentation
Description: gives the process of commonly used data api collection
https://blog.csdn.net/weixin_44129250/article/details/86653324
focus: Interface gives examples of appropriate exercises for each api.
Api important common good text summary
* * GroupBy
https://www.cnblogs.com/bjwu/p/8970818.html


Data cleansing and data pre-processing of data is the essence of the work of scientists.
As MLer,
we need to know some common data processing scenarios, and use some common methods.

First, we should have a common memory and cognition scene of a general pre-processing of the data, that is why we need to know the importance of data processing and data processing.
Recommended reading as follows:
https://zhuanlan.zhihu.com/p/51131210
https://zhuanlan.zhihu.com/p/57332604

Special attention to the following scenarios:
data cleaning
(1) the presence of missing values
principles:
due to the real world, the process of obtaining information and data, there will be all kinds of reasons cause data loss and vacancies. For the treatment of these missing values, mainly based on the importance of using different distribution characteristics variables and variables (the amount of information and predictive power).
Common treatment methods:
first with pandas.isnull.sum () detects the absence of the proportional variable, consider deleting or filled, if desired filled variable is continuous, and random average method is generally used to fill the difference, if the variable is discrete , usually median variable dummy or padding.
scikit-learn preprocessing model inputer class
(2) the presence of discrete points
summary view, in the data processing stage as outliers outlier consider the impact of data quality, rather than as a normal abnormality detection of said target point, and thus landlord generally use simple and intuitive way, in conjunction with statistical methods and box plot outlier determination MAD variables.

Data statute
data reduction techniques may be used to obtain the data set represents a reduction, it is much smaller, but still close to maintain the integrity of the original data. Thus, in the data set reduction more efficient mining, and produce the same (or almost identical) analysis results. Generally have the following strategies:

* * Statute dimension
for data analysis may contain hundreds of attributes, most of the properties and mining task irrelevant, redundant. Dimension reduction by removing irrelevant attributes, to reduce the amount of data, and to ensure minimum loss of information.

* * Dimension transformation

Transformation is to reduce the dimensions of existing data to smaller dimensions, try to ensure that the integrity of the data information. Landlord will introduce several common dimension lossy transform, will greatly enhance the efficiency of the practice of the modeling
(1) Principal Component Analysis (PCA) and factor analysis (FA): PCA by the spatial mapping mode, the current dimension mapped to a lower dimension, so that each variable maximum variance in the new space. FA is the common factor to find the current feature vector (dimension smaller), a linear combination of factors to describe well the current feature vector.
(2) Singular Value Decomposition (SVD): SVD explanatory dimension reduction may be low, and the calculated amount is larger than the PCA, generally used in the sparse matrix dimensionality reduction, e.g. image compression, the recommendation system.
(3) Clustering: The characteristics of a class of poly similar to a single variable, thus greatly reducing the dimension.
(4) a linear combination of: a plurality of linearly regression variables, each variable coefficient according to the vote, to a variable weighting, the class variable may be a variable according to the weights of recombinant synthesis.
(5) Learning Pop: Pop complex nonlinear learning methods, reference may be skearn: LLE Example

Data conversion
data includes data normalization transform, discrete, thinning-out processing, suitable for mining purposes.
(1) Data normalization
particular mining distance-based clustering, KNN, SVM normalization process must be done.
sklearn data provides a user-friendly standardized API
min-max Scaler Data normalization
Scale
StandardScaler
Normalizer

Detailed:
(1) normalized (min-max scaler) and standardized (z-scores normalization) distinction
https://www.cnblogs.com/bjwu/p/8977141.html
(2) Should the normalize the I / Standardize / rescale the data?
this paper presents normalize, standardize, rescale the general definition.
http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html

General experience:
1, in classification, clustering algorithm, it is necessary to use distance to measure the similarity of the time, or the use of PCA dimension reduction technique when, StandardScaler perform better.
2, the distance measure does not involve, covariance calculation, when the data does not match the profile is too, may be used MinMaxScaler. Such as image processing, the image is converted to RGB image after gradation value defining the range of [0255] a.
The reason is that the use of MinMaxScaler, covariance had scaled multiplier value, so in this way can not be eliminated dimensional variance, covariance impact, a huge impact on the PCA analysis; at the same time, because of the dimension, the use of different dimensions , the results will be different distances.
In StandardScaler, the new data because the other difference was normalized, this time each dimension is already equivalent dimension, and each dimension with mean 0 and variance of normal distribution 1, in calculating the distance when each dimension is dimensionless to avoid the effects of selecting different dimensions of the huge distance calculation produced.

(2) continuous (Continuous) discrete data
discrete process: discrete data refers to a continuous segment of data, so that it becomes a section of discretization intervals. There based equidistant, other optimization methods or principles frequency segment. Discrete data of the main reasons are the following:
Model needs: such as decision trees, naive Bayes algorithms are based on discrete data expanded. If you want to use this class algorithms, discrete data must be carried out. Effective discrete can reduce the time and space overhead algorithm to improve classification and clustering capabilities noise immunity of the sample.
Discretized continuous characteristic feature relative to more readily understood.
Can effectively overcome the data hidden defects, the model results more stable.
Et frequency method (percentile): samples in each tank such that an equal number of, for example, total sample n = 100, k = 5 divided into bins, the bin principle is secured amount falling within each bin of the sample = 20.
Method width: the width of the tank so that the properties are equal, for example, (0-100) variable of age, can be divided into [0, 20], [20, 40], [40, 60], [60, 80], [80, 100 ] width of five boxes.
Clustering: The clustering out of the cluster and the data in each cluster as a tank, a given cluster model number.

(3) multi-class attribute data (categorical) data --nomial class data set, the data sequencer ORDINAL

nomial data given categories: Category pure concept, the concept of non-partitioned data size, data equally weighted.

Common methods: one-hot code and dummy code (dummy encoding)
difference between the two specific coding scheme depending on the scene, it does not matter who's bad sleep.
1. up master - Wang Yun Maigo concise answer, requires in-depth understanding.
https://www.zhihu.com/question/48674426/answer/112633127

2. CSDN bloggers describe in detail the difference between the two
https://www.cnblogs.com/wqbin/p/10234636.html

ordinal sequencing data: There are certain categories of size comparison meanings, such as data: size, height, weight and so
you may want to use LabelEncoder class sklearn.preprocessing offer.

1. csdn Bowen bloggers the more detailed description of the difference between one hot and labelencoder, wherein the algorithm based on the measure of distance (kNN, SVM, etc.), and other one hot holding space invariant distance,
but one hot increased space for maintenance, so generally have one hot + PCA.
https://www.cnblogs.com/king-lps/p/7846414.html

 

 

Q & A:
How to deal with the imbalance in 1 machine learning data?
https://zhuanlan.zhihu.com/p/56960799

 


The following good cultural treasures dug more suitable for primary, bookmarks can be added later, in the face of difficulties, can quickly find.
Good Fair Main:
1. CSDN up: shelley__huang
https://blog.csdn.net/qq_27009517/article/details/80476507

Guess you like

Origin www.cnblogs.com/durui0558/p/12078071.html