机器学习——多元线性回归模型

本文主要探讨多元线性回归模型，假设函数，损失函数，梯度下降的实现以及预测，先来看一下原数据长什么样？

原数据

ex1data2.txt有三列数据，第1列是自变量面积(x1)，第2列是自变量卧室数(x2)，第3列为因变量价格(y)，维度是47*3，（如果你需要数据可以去三行科创微信公众号交流群要）首先，我们要对数据进行初探式的探索。
原数据

探索性分析

绘制出所有变量的热力图，观察各变量之间的相关性

data_normalized = (data - data.mean())/(data.std()) #规范化处理
cor_matrix = data_normalized.corr() #相关性矩阵
fig = plt.figure(figsize = (6,4)) #新建画布
ax = fig.add_subplot(111)
ax = sns.heatmap(cor_matrix, cmap = plt.cm.Blues, linewidths=0.5, vmax=1, vmin=0 ,annot=True, annot_kws={
    
    'size':8,'weight':'bold'}) #热力图

从热力图可以看到面积和价格都非常高的相关性，卧室数也与价格有一定相关性，因此，可以提出如下假设

$h_{\theta}(x_1,x_2) = \theta_0+\theta_1x_1+\theta_2x_2$

其中 $\theta_i (1\leq i \leq 3)$ 为待定的参数

数据预处理

为了建立模型，我们先将自变量和因变量分离变换并且进行规范化处理

x = np.array(data_normalized[['area','bedroom']]).reshape(m,2) #自变量
X = np.insert(x, 0,1 ,axis =1) #增加常数列
y = np.array(data_normalized ['price']).reshape(m, 1) #因变量

假设函数，损失函数，梯度下降及可视化实现

有了数据之后需要给出假设函数，损失函数，以及梯度下降实现方法，最好能够把损失函数下降过程可视化

alpha = 0.01 #学习率
max_iteration = 2000 #最大迭代次数

def h(theta, X): #定义假设函数
    return  np.dot(X, theta)

def costFunction(mytheta, X, y): #定义损失函数
    return  float(1./(2*m)*np.dot((h(mytheta, X)-y).T, (h(mytheta, X)-y)))

def gradientDescent(X, start_theta = np.zeros(X.shape[1])): #定义梯度下降函数
    theta = start_theta
    thetahistory = [] #用来存放theta值
    costhistory = [] #用来存放损失值
    for iter in range(max_iteration):
        tmptheta = theta
        costhistory.append(costFunction(theta, X,y))
        thetahistory.append(list(theta[:,0]))
        for j in range(len(tmptheta)):
            tmptheta[j] = theta[j] - (alpha/m)*np.sum((h(theta, X)-y)*np.array(X[:,j]).reshape(m, 1))
        theta = tmptheta
    return theta, thetahistory, costhistory

initial_theta = np.zeros((X.shape[1], 1)) #初始化theta值

theta, thetahistory, costhistory = gradientDescent(X, initial_theta)

def plotConvengence(costhistory): #定义绘制损失函数曲线
    plt.figure(figsize = (6,4))
    plt.plot(range(len(costhistory)), costhistory)
    plt.title("Convengence of cost function")
    plt.xlabel("Iteration")
    plt.ylabel("Cost function")    
plotConvengence(costhistory)

损失函数下降

预测

这样一来，我们就求出最佳的参数 $\theta$ 和假设方程，为了验证模型的准确性，来预测以面积为1650，卧室数为3的价格是多少？由于之前对数据进行了规范化，所以这个时候又需要还原回去计算出价格。

print("the best fitting function is: y = %0.4f+%0.4f*x_1+(%0.4f*x_2)"%(theta[0], theta[1],theta[2]))

print("if the areas are 1650 and bedrooms are 3, what's the price?")

area_normalized = (1650-data.mean()[0])/data.std()[0] #面积规范化
bedroom_normalized = (3-data.mean()[1])/data.std()[1] #卧室数规范化
price_mean = data.mean()[2] #规范化时的价格的均值
price_std = data.std()[2] #规范化时的价格的标准差
print("$%0.2f"%float((theta[0] + theta[1]*area_normalized + theta[2]*bedroom_normalized)* price_std + price_mean))

正常输入结果是 $293083.69.

机器学习——多元线性回归模型

猜你喜欢