My CSDN blog column: https://blog.csdn.net/yty_7
Github address: https://github.com/yot777/
In the process of data preprocessing in machine learning and deep learning, a very important step is to normalize and standardize the data.
The concept of normalization
What is normalization? To put it simply, it is to calculate and unify all data into a specified range through calculation. Generally, this range is 0 ~ 1.
Normalized formula
How to normalize? It can be carried out by the following two formulas:
(1) Transformation coefficient formula
item represents all the elements in each column of the array
max is the maximum value of all data in this column , min is the minimum value of all data in this column
The original array is as follows:
>>> import numpy as np
>>> data = np.array([[36,46],[45,25],[6,79]])
>>> print(data)
[[36 46]
[45 25]
[ 6 79]]
Step 1: Find the minimum and maximum values for each column
>>> n_max=np.max(data,axis=0)
>>> print(n_max)
[45 79]
>>> n_min=np.min(data,axis=0)
>>> print(n_min)
[ 6 25]
Step 2: Subtract the minimum value n_min of each element of the column from each element of the original matrix to obtain the numerator of the formula
>>> fenzi = np.subtract(data,n_min)
>>> print(fenzi)
[[30 21]
[39 0]
[ 0 54]]
Explain: n_min turns out to be a one-dimensional array (, 2) , data is a two-dimensional array (3,2)
Due to the subtract () subtraction of the array, n_min is automatically broadcasted by Python in the calculation and becomes a two-dimensional array (3, 2)
[[6 25]
[6 25]
[6 25]]
Therefore, the calculation process of np.subtract (data, n_min) is
Step 3: subtract the minimum value n_min of each column from the maximum value n_max of each column of the original matrix to obtain the denominator of the formula
>>> fenmu = np.subtract(n_max,n_min)
>>> print(fenmu)
[39 54]
explain:
The result of np.subtract (n_max, n_min) is
[45 79] - [6 25] = [39 54]
Step 4: Divide the numerator by the denominator to get the final value of the transform coefficient
>>> x = np.divide(fenzi,fenmu)
>>> print(x)
[[0.76923077 0.38888889]
[1. 0. ]
[0. 1. ]]
Explain: fenzi is a two-dimensional array (3, 2), fenmu is a one-dimensional array (, 2)
Due to the divide () division of the array, fenmu is automatically broadcasted by Python in the calculation and becomes a two-dimensional array (3, 2)
[[39 54]
[39 54]
[39 54]]
Therefore, the calculation process of np.divide (fenzi, fenmu) is
After proficiency, the above 4 steps can be completed with the following line, the code is as follows:
>>> x = np.divide(np.subtract(data,np.min(data,axis=0)),np.subtract(np.max(data,axis=0),np.min(data,axis=0)))
>>> print(x)
[[0.76923077 0.38888889]
[1. 0. ]
[0. 1. ]]
(2) Range conversion formula
mx is the maximum value of the selected range, mi is the minimum value of the selected range.
In a special case, if the range of values to be normalized is between 0 and 1, that is, mx = 1, when mi = 0, there is no need to perform range conversion because:
If we need to change the x 'matrix to between -1 and 1, that is, when mx = 1 and mi = -1:
>>> mx = 1
>>> mi = -1
>>> xx = x*(mx-mi)+mi
>>> print(xx)
[[ 0.53846154 -0.22222222]
[ 1. -1. ]
[-1. 1. ]]
Explain: x is a two-dimensional array (3,2), mx-mi = 2, x * (mx-mi) is
The final result x * (mx-mi) + mi is
Standardized concept
Make the data in each column as close to the standard Gaussian distribution as possible (mean is 0, standard deviation is 1)
Standardized formula
item represents all the elements in each column of the array, mean is the average of all the data in the column
σ is the standard deviation of all data in the column
Note: Mean () function of the mean numpy calculations, [sigma] may be a function of std numpy () is calculated
Code:
>>> import numpy as np
>>> data = np.array([[36,46],[45,25],[6,79]])
>>> print(data)
[[36 46]
[45 25]
[ 6 79]]
#根据标准化的公式:X = (item - mean) / σ
#item代表数组每列中所有的元素,mean为该列所有数据的平均值,σ为该列所有数据的标准差
#mean可以由numpy的mean()函数计算
>>> n_mean = np.mean(data,axis=0)
>>> print(n_mean)
[29. 50.]
##σ可以由numpy的std()函数计算
>>> n_std = np.std(data,axis=0)
>>> print(n_std)
[16.673332 22.22611077]
#将mean和σ代入公式,得到最终结果
>>> xxx = np.divide(np.subtract(data,n_mean),n_std)
>>> print(xxx)
[[ 0.4198321 -0.17996851]
[ 0.95961623 -1.12480318]
[-1.37944833 1.30477168]]
to sum up
1. The normalized formula is
item represents all the elements in each column of the array
max is the maximum value of all data in this column , min is the minimum value of all data in this column
mx is the maximum value of the selected range, mi is the minimum value of the selected range
If the normalized range is 0 ~ 1 , there is no need to calculate the second formula
2. The standardized formula is
item represents all the elements in each column of the array, mean is the average of all the data in the column
σ is the standard deviation of all data in the column
mean () function of the mean numpy calculations, [sigma] may be a function of std numpy () is calculated
My CSDN blog column: https://blog.csdn.net/yty_7
Github address: https://github.com/yot777/
If you think this chapter is helpful to you, welcome to follow, comment and like! Github welcomes your Follow and Star!