学习笔记：【案例】财政收入影响因素分析及预测模型

案例来源：《Python数据分析与挖掘实战》第13章

案例背景与挖掘目标

输入数据：
《某市统计年鉴》（1995-2014）

挖掘目标：

梳理影响地方财政收入的关键特征，分析、识别影响地方财政收入的关键特征的选择模型
结合目标1的因素分析，对某市2015年的财政总收入及各个类别收入进行预测

分析方法与过程（选择的原则）

以往对财政收入的分析会使用多元线性回归模型，和最小二乘估计方法来估计回归模型的系统，但这样的结果对数据依赖程度很大，并且求得的往往只是局部最优解，后续的检验可能会失去应有的意义。
因此本案例运用Adaptive-Lasso变量选择方法来研究。
子任务规划

从某市统计局网站以及各统计年鉴搜集到该市财政收入以及各类别收入
建立Adaptive-Lasso变量选择模型
代入构建好的人工神经网络模型中，从而得到2015年预测值

实验
掌握Adaptive-Lasso变量选择和神经网络预测模型

分析数据，识别关键特征，使用Adaptive-Lasso变量选择方法进行筛选
用GM(1,1)灰色预测方法得到筛选出的关键影响因素的2014、2015的预测值
代入神经网络模型，得到2014、2015预测值

代码存档：

实验

掌握Adaptive-Lasso变量选择和神经网络预测模型

分析数据，识别关键特征，使用Adaptive-Lasso变量选择方法进行筛选
用GM(1,1)灰色预测方法得到筛选出的关键影响因素的2014、2015的预测值
代入神经网络模型，得到2014、2015预测值

import numpy as np
import pandas as pd
import os

# 查看数据概况
dpath = './demo/data/data1.csv'
input_data = pd.read_csv(dpath)
r = [input_data.min(),input_data.max(),input_data.mean(),input_data.std()]
r = pd.DataFrame(r, index=['Min','Max','Mean','Std'])
r = np.round(r,2)
print(r)

              x1       x2       x3        x4        x5          x6       x7  \
Min   3831732.00   181.54   448.19   7571.00   6212.70  6370241.00   525.71   
Max   7599295.00  2110.78  6882.85  42049.14  33156.83  8323096.00  4454.55   
Mean  5579519.95   765.04  2370.83  19644.69  15870.95  7350513.60  1712.24   
Std   1262194.72   595.70  1919.17  10203.02   8199.77   621341.85  1184.71   

            x8      x9     x10     x11   x12       x13        y  
Min     985.31   60.62   65.66   97.50  1.03   5321.00    64.87  
Max   15420.14  228.46  852.56  120.00  1.91  41972.00  2088.14  
Mean   5705.80  129.49  340.22  103.31  1.42  17273.80   618.08  
Std    4478.40   50.51  251.58    5.51  0.25  11109.19   609.25

# 求解Pearson相关系数
np.round(input_data.corr(method='pearson'),2)

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	y
x1	1.00	0.95	0.95	0.97	0.97	0.99	0.95	0.97	0.98	0.98	-0.29	0.94	0.96	0.94
x2	0.95	1.00	1.00	0.99	0.99	0.92	0.99	0.99	0.98	0.98	-0.13	0.89	1.00	0.98
x3	0.95	1.00	1.00	0.99	0.99	0.92	1.00	0.99	0.98	0.99	-0.15	0.89	1.00	0.99
x4	0.97	0.99	0.99	1.00	1.00	0.95	0.99	1.00	0.99	1.00	-0.19	0.91	1.00	0.99
x5	0.97	0.99	0.99	1.00	1.00	0.95	0.99	1.00	0.99	1.00	-0.18	0.90	0.99	0.99
x6	0.99	0.92	0.92	0.95	0.95	1.00	0.93	0.95	0.97	0.96	-0.34	0.95	0.94	0.91
x7	0.95	0.99	1.00	0.99	0.99	0.93	1.00	0.99	0.98	0.99	-0.15	0.89	1.00	0.99
x8	0.97	0.99	0.99	1.00	1.00	0.95	0.99	1.00	0.99	1.00	-0.15	0.90	1.00	0.99
x9	0.98	0.98	0.98	0.99	0.99	0.97	0.98	0.99	1.00	0.99	-0.23	0.91	0.99	0.98
x10	0.98	0.98	0.99	1.00	1.00	0.96	0.99	1.00	0.99	1.00	-0.17	0.90	0.99	0.99
x11	-0.29	-0.13	-0.15	-0.19	-0.18	-0.34	-0.15	-0.15	-0.23	-0.17	1.00	-0.43	-0.16	-0.12
x12	0.94	0.89	0.89	0.91	0.90	0.95	0.89	0.90	0.91	0.90	-0.43	1.00	0.90	0.87
x13	0.96	1.00	1.00	1.00	0.99	0.94	1.00	1.00	0.99	0.99	-0.16	0.90	1.00	0.99
y	0.94	0.98	0.99	0.99	0.99	0.91	0.99	0.99	0.98	0.99	-0.12	0.87	0.99	1.00

结果显示只有X11与结果y值呈现负相关，其余变量均为正相关。

# 导入AdaptiveLasso
from sklearn import linear_model
model = linear_model.Lasso(alpha=1)
model.fit(input_data.iloc[:,0:13], input_data['y'])
model.coef_

/Users/januswing/Library/Python/3.6/lib/python/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)





array([-1.85085555e-04, -3.15519378e-01,  4.32896206e-01, -3.15753523e-02,
        7.58007814e-02,  4.03145358e-04,  2.41255896e-01, -3.70482514e-02,
       -2.55448330e+00,  4.41363280e-01,  5.69277642e+00, -0.00000000e+00,
       -3.98946837e-02])

def GM11(x0): #自定义灰色预测函数
  import numpy as np
  x1 = x0.cumsum() #1-AGO序列
  z1 = (x1[:len(x1)-1] + x1[1:])/2.0 #紧邻均值（MEAN）生成序列
  z1 = z1.reshape((len(z1),1))
  B = np.append(-z1, np.ones_like(z1), axis = 1)
  Yn = x0[1:].reshape((len(x0)-1, 1))
  [[a],[b]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Yn) #计算参数
  f = lambda k: (x0[0]-b/a)*np.exp(-a*(k-1))-(x0[0]-b/a)*np.exp(-a*(k-2)) #还原值
  delta = np.abs(x0 - np.array([f(i) for i in range(1,len(x0)+1)]))
  C = delta.std()/x0.std()
  P = 1.0*(np.abs(delta - delta.mean()) < 0.6745*x0.std()).sum()/len(x0)
  return f, a, b, x0[0], C, P #返回灰色预测函数、a、b、首项、方差比、小残差概率

inputfile = './demo/data/data1.csv' #输入的数据文件
outputfile = './demo/tmp/data1_GM11.xls' #灰色预测后保存的路径
data = pd.read_csv(inputfile) #读取数据
data.index = range(1994, 2014)

data.loc[2014] = None
data.loc[2015] = None
l = ['x1', 'x2', 'x3', 'x4', 'x5', 'x7']

for i in l:
  f = GM11(data[i][:20].as_matrix())[0]
  data[i][2014] = f(len(data)-1) #2014年预测结果
  data[i][2015] = f(len(data)) #2015年预测结果
  data[i] = data[i].round(2) #保留两位小数

data[l+['y']].to_excel(outputfile) #结果输出

data[l+['y']]

	x1	x2	x3	x4	x5	x7	y
1994	3831732.00	181.54	448.19	7571.00	6212.70	525.71	64.87
1995	3913824.00	214.63	549.97	9038.16	7601.73	618.25	99.75
1996	3928907.00	239.56	686.44	9905.31	8092.82	638.94	88.11
1997	4282130.00	261.58	802.59	10444.60	8767.98	656.58	106.07
1998	4453911.00	283.14	904.57	11255.70	9422.33	758.83	137.32
1999	4548852.00	308.58	1000.69	12018.52	9751.44	878.26	188.14
2000	4962579.00	348.09	1121.13	13966.53	11349.47	923.67	219.91
2001	5029338.00	387.81	1248.29	14694.00	11467.35	978.21	271.91
2002	5070216.00	453.49	1370.68	13380.47	10671.78	1009.24	269.10
2003	5210706.00	533.55	1494.27	15002.59	11570.58	1175.17	300.55
2004	5407087.00	598.33	1677.77	16884.16	13120.83	1348.93	338.45
2005	5744550.00	665.32	1905.84	18287.24	14468.24	1519.16	408.86
2006	5994973.00	738.97	2199.14	19850.66	15444.93	1696.38	476.72
2007	6236312.00	877.07	2624.24	22469.22	18951.32	1863.34	838.99
2008	6529045.00	1005.37	3187.39	25316.72	20835.95	2105.54	843.14
2009	6791495.00	1118.03	3615.77	27609.59	22820.89	2659.85	1107.67
2010	7110695.00	1304.48	4476.38	30658.49	25011.61	3263.57	1399.16
2011	7431755.00	1700.87	5243.03	34438.08	28209.74	3412.21	1535.14
2012	7512997.00	1969.51	5977.27	38053.52	30490.44	3758.39	1579.68
2013	7599295.00	2110.78	6882.85	42049.14	33156.83	4454.55	2088.14
2014	8142148.24	2239.29	7042.31	43611.84	35046.63	4600.40	NaN
2015	8460489.28	2581.14	8166.92	47792.22	38384.22	5214.78	NaN

import pandas as pd
inputfile = './tmp/data1_GM11.xls' #灰色预测后保存的路径
outputfile = './data/revenue.xls' #神经网络预测后保存的结果
modelfile = './tmp/1-net.model' #模型保存路径
data = pd.read_excel(inputfile) #读取数据
feature = ['x1', 'x2', 'x3', 'x4', 'x5', 'x7'] #特征所在列

data_train = data.loc[range(1994,2014)].copy() #取2014年前的数据建模
data_mean = data_train.mean()
data_std = data_train.std()
data_train = (data_train - data_mean)/data_std #数据标准化
x_train = data_train[feature].as_matrix() #特征数据
y_train = data_train['y'].as_matrix() #标签数据

from keras.models import Sequential
from keras.layers.core import Dense, Activation

model = Sequential() #建立模型
model.add(Dense(input_dim=6, output_dim=12))
model.add(Activation('relu')) #用relu函数作为激活函数，能够大幅提供准确度
model.add(Dense(input_dim=12, output_dim=1))
model.compile(loss='mean_squared_error', optimizer='adam') #编译模型
model.fit(x_train, y_train, nb_epoch = 10000, batch_size = 16, verbose=0) #训练模型，学习一万次
model.save_weights(modelfile) #保存模型参数

#预测，并还原结果。
x = ((data[feature] - data_mean[feature])/data_std[feature]).as_matrix()
data[u'y_pred'] = model.predict(x) * data_std['y'] + data_mean['y']
data.to_excel(outputfile)

/Users/januswing/Library/Python/3.6/lib/python/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:19: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=6, units=12)`
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:21: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=12, units=1)`
/Users/januswing/Library/Python/3.6/lib/python/site-packages/keras/models.py:942: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
  warnings.warn('The `nb_epoch` argument in `fit` '

import matplotlib.pyplot as plt #画出预测结果图
p = data[['y','y_pred']].plot(subplots = True, style=['b-o','r-*'])
plt.show()

提出问题：

识别关键特征的方法还有哪些？哪些在PyTorch里面可以用？