Usually the development process of machine learning includes: data collection --- data cleaning and transformation --- model training --- model testing --- model deployment and integration
Below, through an example to learn the complete machine learning development process.
Libraries required in the project:
from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.externals import joblib import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import pandas as pd from pandas import DataFrame
(1) First of all , the collection of data, the acquisition of data:
As above, it can be seen that what we need to obtain is the relationship between Global_active_power (active power), Global_reactive_power (reactive power) and Global_intensity (current).
path = 'household_power_consumption_1000.txt' df = pd.read_csv(path,sep = ';',low_memory = False) print(df.head()) print(df.info())
(2) Then enter the data cleaning stage
As above, if there are cases of nulls or outliers, we can handle them at this stage.
new_df = df.replace('?',np.nan) datas = new_df.dropna(axis = 0,how = 'any') print(datas.describe().T)
As above, you can see that we directly delete the sample in the presence of outliers and null values.
It can be seen that only 998 samples are left out of 1000 samples.
Then proceed to feature engineering:
# extract relevant data X = datas.iloc[:,2:4] Y = datas['Global_intensity'] #divide training set and test set X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.2,random_state = 0) # normalized processing ss = StandardScaler() X_train = ss.fit_transform(X_train) X_test = ss.transform(X_test)
Tips:
- random_state: is the seed of the random number. Random number seed: In fact, it is the number of the group of random numbers. When repeated experiments are required, it is guaranteed to get the same group of random numbers. For example, if you fill in 1 every time , the random array you get is the same when other parameters are the same. But fill in 0 or not fill in, each time will be different.
- StandardScaler: Standardization needs to calculate the mean and standard deviation of the feature, the formula is expressed as:
. As for why to standardize, https://zhuanlan.zhihu.com/p/24839177
(3) Model training
lr = LinearRegression() lr.fit(X_train,Y_train)
As above, we can complete the training of the model through simple statements encapsulated by sklearn.
(4) Model prediction
y_predict = lr.predict(X_test) print("训练:",lr.score(X_train,Y_train))# print("测试:",lr.score(X_test,Y_test)) mse = np.average((y_predict-Y_test)**2) rmse = np.sqrt(mse) print(rmse)
As above: score is a scoring function, namely R 2 .
Data visualization:
## Set the character set to prevent Chinese garbled characters mpl.rcParams ['font.sans-serif'] = [u'simHei '] mpl.rcParams['axes.unicode_minus']=False t = np.arange(len(X_test)) plt.figure() plt.plot(t,Y_test,'r-',label = u'true value') plt.plot(t,y_predict,'b-',label = u'predicted value') plt.legend(loc = 'upper right') plt.title(u'Linear regression predicts the relationship between power and current') plt.grid(b = True) plt.show()
For Anaconda2 and 3 installed at the same time, the following command can be used to execute 3:
(5) Model deployment
joblib.dump(lr,"data_lr.model") lr = joblib.load("data_lr.model")
As above, you can save the trained model and load it when you use it later.