Starfruit Advanced Python Lecture 10-Array array (3) Array array normalization and standardization (the most detailed explanation to date on the entire network)

My CSDN blog column: https://blog.csdn.net/yty_7

Github address: https://github.com/yot777/

 

In the process of data preprocessing in machine learning and deep learning, a very important step is to normalize and standardize the data.

 

The concept of normalization

What is normalization? To put it simply, it is to calculate and unify all data into a specified range through calculation. Generally, this range is 0 ~ 1.

 

Normalized formula

How to normalize? It can be carried out by the following two formulas:

(1) Transformation coefficient formula

item represents all the elements in each column of the array

max is the maximum value of all data in this column , min is the minimum value of all data in this column

The original array is as follows:

>>> import numpy as np
>>> data = np.array([[36,46],[45,25],[6,79]])
>>> print(data)
[[36 46]
 [45 25]
 [ 6 79]]

Step 1: Find the minimum and maximum values ​​for each column

>>> n_max=np.max(data,axis=0)
>>> print(n_max)
[45 79]
>>> n_min=np.min(data,axis=0)
>>> print(n_min)
[ 6 25]

Step 2: Subtract the minimum value n_min of each element of the column from each element of the original matrix to obtain the numerator of the formula

>>> fenzi = np.subtract(data,n_min)
>>> print(fenzi)
[[30 21]
 [39  0]
 [ 0 54]]

Explain: n_min turns out to be a one-dimensional array (, 2) , data is a two-dimensional array (3,2)

Due to the subtract () subtraction of the array, n_min is automatically broadcasted by Python in the calculation and becomes a two-dimensional array (3, 2)

[[6 25]

 [6 25]

 [6 25]]

Therefore, the calculation process of np.subtract (data, n_min) is

Step 3: subtract the minimum value n_min of each column from the maximum value n_max of each column of the original matrix to obtain the denominator of the formula

>>> fenmu = np.subtract(n_max,n_min)
>>> print(fenmu)
[39 54]

explain:

The result of np.subtract (n_max, n_min) is

[45 79] - [6 25] = [39 54]

 

Step 4: Divide the numerator by the denominator to get the final value of the transform coefficient

>>> x = np.divide(fenzi,fenmu)
>>> print(x)
[[0.76923077 0.38888889]
 [1.         0.        ]
 [0.         1.        ]]

Explain: fenzi is a two-dimensional array (3, 2), fenmu is a one-dimensional array (, 2)

Due to the divide () division of the array, fenmu is automatically broadcasted by Python in the calculation and becomes a two-dimensional array (3, 2)

[[39 54]

 [39 54]

 [39 54]]

Therefore, the calculation process of np.divide (fenzi, fenmu) is

After proficiency, the above 4 steps can be completed with the following line, the code is as follows:

>>> x = np.divide(np.subtract(data,np.min(data,axis=0)),np.subtract(np.max(data,axis=0),np.min(data,axis=0)))
>>> print(x)
[[0.76923077 0.38888889]
 [1.         0.        ]
 [0.         1.        ]]

(2) Range conversion formula

mx is the maximum value of the selected range, mi is the minimum value of the selected range.

In a special case, if the range of values ​​to be normalized is between 0 and 1, that is, mx = 1, when mi = 0, there is no need to perform range conversion because:

If we need to change the x 'matrix to between -1 and 1, that is, when mx = 1 and mi = -1:

>>> mx = 1
>>> mi = -1
>>> xx = x*(mx-mi)+mi
>>> print(xx)
[[ 0.53846154 -0.22222222]
 [ 1.         -1.        ]
 [-1.          1.        ]]

Explain: x is a two-dimensional array (3,2), mx-mi = 2, x * (mx-mi) is

The final result x * (mx-mi) + mi is

Standardized concept

Make the data in each column as close to the standard Gaussian distribution as possible (mean is 0, standard deviation is 1)

 

Standardized formula

item represents all the elements in each column of the array, mean is the average of all the data in the column

σ is the standard deviation of all data in the column

Note: Mean () function of the mean numpy calculations, [sigma] may be a function of std numpy () is calculated

Code:

>>> import numpy as np
>>> data = np.array([[36,46],[45,25],[6,79]])
>>> print(data)
[[36 46]
 [45 25]
 [ 6 79]]

#根据标准化的公式:X = (item - mean) / σ
#item代表数组每列中所有的元素,mean为该列所有数据的平均值,σ为该列所有数据的标准差
#mean可以由numpy的mean()函数计算
>>> n_mean = np.mean(data,axis=0)
>>> print(n_mean)
[29. 50.]

##σ可以由numpy的std()函数计算
>>> n_std = np.std(data,axis=0)
>>> print(n_std)
[16.673332   22.22611077]

#将mean和σ代入公式,得到最终结果
>>> xxx = np.divide(np.subtract(data,n_mean),n_std)
>>> print(xxx)
[[ 0.4198321  -0.17996851]
 [ 0.95961623 -1.12480318]
 [-1.37944833  1.30477168]]

 

to sum up

1. The normalized formula is

item represents all the elements in each column of the array

max is the maximum value of all data in this column , min is the minimum value of all data in this column

mx is the maximum value of the selected range, mi is the minimum value of the selected range

If the normalized range is 0 ~ 1 , there is no need to calculate the second formula

2. The standardized formula is

item represents all the elements in each column of the array, mean is the average of all the data in the column

σ is the standard deviation of all data in the column

mean () function of the mean numpy calculations, [sigma] may be a function of std numpy () is calculated

 

My CSDN blog column: https://blog.csdn.net/yty_7

Github address: https://github.com/yot777/

If you think this chapter is helpful to you, welcome to follow, comment and like! Github welcomes your Follow and Star!

Published 55 original articles · won praise 16 · views 6111

Guess you like

Origin blog.csdn.net/yty_7/article/details/104556013