本次实践,从较为简单的多变量线性回归模型入手
数据源于网络,是一个28行,12列的数据集,因为是初次上手,所以找的数据集较少。
数据集的详细内容介绍附在文章末尾,简单地说,
我们的任务是得出一个能较好拟合已知数据的线性回归模型,并对新数据进行房价的预测
话不多说,上代码
//
// main.cpp
// linear_regression_house_price
//
// Created by zsp on 2018/7/14.
// Copyright © 2018年 zsp. All rights reserved.
//
#include <iostream>
#include <fstream>
#include <cmath>
using namespace std;
const int SAMPLE = 27;
const int PARAMETER = 12;
double hypoVal(double para[], double fea[], int count);
double costVal(double para[], double lab[], int amount, double allX[][PARAMETER + 1]);
int main()
{
//文本操作,读入数据
fstream infile;
infile.open("data.txt");
if (!infile)
{
cout << "can't open file!" << endl;
return -1;
}
//二维数组储存数据
double X[SAMPLE + 1][PARAMETER + 1]; //features matrix
for (int i = 0; i < SAMPLE + 1; i++)
{
for (int j = 0; j < PARAMETER + 1; j++)
{
infile >> X[i][j];
}
}
infile.close();
//验证数据是否写入
for (int i = 0; i < SAMPLE + 1; i++)
{
for (int j = 0; j < PARAMETER + 1; j++)
cout << X[i][j] << " ";
cout << endl;
}
//特征缩放
//留一法,得到训练集X[0]~X[26]和测试集X[27]
double y[SAMPLE] = { 0 }; //labels vector
double theta[PARAMETER] = { 0 }; //parameters vector
double a = 0.0001; //set learning rate as 0.0001
int cnt = 0; //to count the times of loop
for (int i = 0; i < SAMPLE; i++)
{
y[i] = X[i][PARAMETER];
}
double h = hypoVal(theta, X[0], PARAMETER); //hypothesis function
double cost = costVal(theta, y, SAMPLE, X); //cost function
//梯度下降法求theta
double temp[PARAMETER] = { 0 }; //used for simultaneously updating theta parameters
double der[PARAMETER] = { 0 }; //the derivative term
double tempCost = 0; //break the loop when tempCost - cost is small enough
do {
tempCost = cost;
double sum = 0;
for (int j = 0; j < PARAMETER; j++)
{
for (int i = 0; i < SAMPLE; i++)
{
sum += (hypoVal(theta, X[i], PARAMETER) - y[i]) * X[i][j];
}
der[j] = (1.0 / double (PARAMETER)) * sum;
temp[j] = theta[j] - a * der[j];
sum = 0;
}
cout << "now the theta parameters are: ";
for (int i = 0; i < PARAMETER; i++)
{
theta[i] = temp[i];
cout << theta[i] << " ";
}
cout << endl;
cost = costVal(theta, y, SAMPLE, X); //new cost value
cnt++;
} while (tempCost - cost > 0.00001);
//测试集进行测试
cout << "共进行" << cnt << "次梯度下降法" << endl;
h = hypoVal(theta, X[SAMPLE - 1], PARAMETER);
cout << "对测试数据预测值为:" << h << ",真实值为:" << y[SAMPLE - 1] << endl;
return 0;
}
double hypoVal(double para[], double fea[], int count) //the value of hypothesis function
{
double hy = 0;
for (int i = 0; i < count; i++)
{
hy += para[i] * fea[i];
}
return hy;
}
double costVal(double para[], double lab[], int amount, double allX[][PARAMETER + 1]) //the value of cost function
{
double sum = 0;
for (int i = 0; i < amount; i++)
{
sum += pow((hypoVal(para, allX[i], PARAMETER) - lab[i]), 2);
}
cout << "costVal now is : " << (1.0/(2.0 * (double) amount)) * sum << endl;
return double(1.0/(2.0 * (double) amount)) * sum;
}
调试过程:
读入数据后,要注意最后一组数据不能用于训练,最后一列数不是特征变量,是真实标记。
刚开始,学习率为0.01,结果步长过大,代价函数增大,无法收敛到最优解。
学习率设置为0.0001后,程序正常运行,输出如下:
costVal now is : 824.024
now the theta parameters are: 0.00856167 0.0688921 0.0120104 0.0609822 0.0146573 0.0122017 0.0602742 0.0298192 0.29996 0.0196875 0.0104292 0.0032625
...
...
costVal now is : 5.12645
now the theta parameters are: 1.22471 3.71067 2.97314 0.151653 2.89371 0.524275 -0.456355 0.777262 -0.0155349 0.828632 -0.47861 1.85199
...
...
costVal now is : 4.53654
now the theta parameters are: 2.27722 3.36399 4.38813 0.114215 4.16421 0.756403 -1.04117 1.08051 -0.0204399 0.823736 0.0122513 2.3339
costVal now is : 4.53653
共进行95091次梯度下降法
对测试数据预测值为:47.874,真实值为:45.8
分析:
由此可见,经过95091次梯度下降法后,我们得到了一个对数据拟合较好的线性回归模型,代价函数的值为4.53653
已足够小。并且由循环条件可知,此时代价函数下降速度已十分缓慢,可以认为我们已经逼近了最优解。
最后预测结果与测试集的对比可以看出,模型的泛化性能也较好。
缺点:
由于数据量太小,测试集太小,结果的说服力不强,
模型受数据变化的影响较大,并不能肯定模型的性能。
遇到的问题:
在从文本读入数据时,无法读入数据。具体解决方案参见我的另一篇博客
https://blog.csdn.net/ezio23/article/details/81068667
附数据集说明:
# data.txt
#
# Reference:
#
# S C Narula, J F Wellington,
# Linear Regression and the Minimum Sum of Relative Errors,
# Technometrics, Volume 19, 1977, pages 185-190.
#
# Helmut Spaeth,
# Mathematical Algorithms for Linear Regression,
# Academic Press, 1991,
# ISBN 0-12-656460-4.
#
# Discussion:
#
# The selling price of houses is to be represented as a function of
# other variables.
#
# There are 28 rows of data. The data includes:
#
# I, the index;
# A1, the local selling prices, in hundreds of dollars;
# A2, the number of bathrooms;
# A3, the area of the site in thousands of square feet;
# A4, the size of the living space in thousands of square feet;
# A5, the number of garages;
# A6, the number of rooms;
# A7, the number of bedrooms;
# A8, the age in years;
# A9, 1 = brick, 2 = brick/wood, 3 = aluminum/wood, 4 = wood.
# A10, 1 = two story, 2 = split level, 3 = ranch
# A11, number of fire places.
# B, the selling price.
#
# We seek a model of the form:
#
# B = A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4 + A5 * X5 + A6 * X6 + A7 * X7
# + A8 * X8 + A9 * X9 + A10 * X10 + A11 * X11
#
13 columns
28 rows
Index
A1, the local selling prices, in hundreds of dollars;
A2, the number of bathrooms;
A3, the area of the site in thousands of square feet;
A4, the size of the living space in thousands of square feet;
A5, the number of garages;
A6, the number of rooms;
A7, the number of bedrooms;
A8, the age in years;
A9, construction type
A10, architecture type
A11, number of fire places.
B, selling price
1 4.9176 1.0 3.4720 0.998 1.0 7 4 42 3 1 0 25.9
2 5.0208 1.0 3.5310 1.500 2.0 7 4 62 1 1 0 29.5
3 4.5429 1.0 2.2750 1.175 1.0 6 3 40 2 1 0 27.9
4 4.5573 1.0 4.0500 1.232 1.0 6 3 54 4 1 0 25.9
5 5.0597 1.0 4.4550 1.121 1.0 6 3 42 3 1 0 29.9
6 3.8910 1.0 4.4550 0.988 1.0 6 3 56 2 1 0 29.9
7 5.8980 1.0 5.8500 1.240 1.0 7 3 51 2 1 1 30.9
8 5.6039 1.0 9.5200 1.501 0.0 6 3 32 1 1 0 28.9
9 16.4202 2.5 9.8000 3.420 2.0 10 5 42 2 1 1 84.9
10 14.4598 2.5 12.8000 3.000 2.0 9 5 14 4 1 1 82.9
11 5.8282 1.0 6.4350 1.225 2.0 6 3 32 1 1 0 35.9
12 5.3003 1.0 4.9883 1.552 1.0 6 3 30 1 2 0 31.5
13 6.2712 1.0 5.5200 0.975 1.0 5 2 30 1 2 0 31.0
14 5.9592 1.0 6.6660 1.121 2.0 6 3 32 2 1 0 30.9
15 5.0500 1.0 5.0000 1.020 0.0 5 2 46 4 1 1 30.0
16 5.6039 1.0 9.5200 1.501 0.0 6 3 32 1 1 0 28.9
17 8.2464 1.5 5.1500 1.664 2.0 8 4 50 4 1 0 36.9
18 6.6969 1.5 6.9020 1.488 1.5 7 3 22 1 1 1 41.9
19 7.7841 1.5 7.1020 1.376 1.0 6 3 17 2 1 0 40.5
20 9.0384 1.0 7.8000 1.500 1.5 7 3 23 3 3 0 43.9
21 5.9894 1.0 5.5200 1.256 2.0 6 3 40 4 1 1 37.5
22 7.5422 1.5 4.0000 1.690 1.0 6 3 22 1 1 0 37.9
23 8.7951 1.5 9.8900 1.820 2.0 8 4 50 1 1 1 44.5
24 6.0931 1.5 6.7265 1.652 1.0 6 3 44 4 1 0 37.9
25 8.3607 1.5 9.1500 1.777 2.0 8 4 48 1 1 1 38.9
26 8.1400 1.0 8.0000 1.504 2.0 7 3 3 1 3 0 36.9
27 9.1416 1.5 7.3262 1.831 1.5 8 4 31 4 1 0 45.8
28 12.0000 1.5 5.0000 1.200 2.0 6 3 30 3 1 1 41.0