Predicting the relationship between power and current

Usually the development process of machine learning includes: data collection --- data cleaning and transformation --- model training --- model testing --- model deployment and integration

Below, through an example to learn the complete machine learning development process.

Libraries required in the project:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame

(1) First of all , the collection of data, the acquisition of data:

As above, it can be seen that what we need to obtain is the relationship between Global_active_power (active power), Global_reactive_power (reactive power) and Global_intensity (current).

path = 'household_power_consumption_1000.txt'
df = pd.read_csv(path,sep = ';',low_memory = False)

print(df.head())
print(df.info())

(2) Then enter the data cleaning stage

As above, if there are cases of nulls or outliers, we can handle them at this stage.

new_df = df.replace('?',np.nan)
datas = new_df.dropna(axis = 0,how = 'any')
print(datas.describe().T)

　　As above, you can see that we directly delete the sample in the presence of outliers and null values.

It can be seen that only 998 samples are left out of 1000 samples.

Then proceed to feature engineering:

# extract relevant data
X = datas.iloc[:,2:4]
Y = datas['Global_intensity']
#divide training set and test set
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.2,random_state = 0)
# normalized processing
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

　Tips:

random_state: is the seed of the random number. Random number seed: In fact, it is the number of the group of random numbers. When repeated experiments are required, it is guaranteed to get the same group of random numbers. For example, if you fill in 1 every time , the random array you get is the same when other parameters are the same. But fill in 0 or not fill in, each time will be different.
StandardScaler: Standardization needs to calculate the mean and standard deviation of the feature, the formula is expressed as: . As for why to standardize, https://zhuanlan.zhihu.com/p/24839177

(3) Model training

lr = LinearRegression()
lr.fit(X_train,Y_train)

　　As above, we can complete the training of the model through simple statements encapsulated by sklearn.

(4) Model prediction

y_predict = lr.predict(X_test)
print("训练：",lr.score(X_train,Y_train))#
print("测试：",lr.score(X_test,Y_test))

mse = np.average((y_predict-Y_test)**2)
rmse = np.sqrt(mse)
print(rmse)

　　As above: score is a scoring function, namely R ² .

Data visualization:

## Set the character set to prevent Chinese garbled characters
mpl.rcParams ['font.sans-serif'] = [u'simHei ']
mpl.rcParams['axes.unicode_minus']=False
t = np.arange(len(X_test))
plt.figure()
plt.plot(t,Y_test,'r-',label = u'true value')
plt.plot(t,y_predict,'b-',label = u'predicted value')
plt.legend(loc = 'upper right')
plt.title(u'Linear regression predicts the relationship between power and current')
plt.grid(b = True)
plt.show()

　　For Anaconda2 and 3 installed at the same time, the following command can be used to execute 3:

(5) Model deployment

joblib.dump(lr,"data_lr.model")

lr = joblib.load("data_lr.model")

　　As above, you can save the trained model and load it when you use it later.

Predicting the relationship between power and current

Guess you like