文章目录
实训
实训 1 使用sklearn处理wine和wine_quality数据集
1.训练要点
(1)掌握 sklearn转换器的用法。
(2)掌握训练集、测试集划分的方法。
(3)掌握使用sklearn进行PCA降维的方法。
2.需求说明
wine数据集和 winequality数据集是两份和酒有关的数据集。wine数据集包含3种
同起源的葡萄酒的记录,共178条。其中,每个特征对应葡萄酒的每种化学成分,并且都
属于连续型数据。通过化学分析可以推断葡萄酒的起源。
winequality数据集共有4898个观察值,11个输入特征和一个标签。其中,不同类的
观察值数量不等,所有特征为连续型数据。通过酒的各类化学成分,预测该葡萄酒的评分
3.实现思路及步骤
(1) 使用pandas库分别读取wine数据集和 winquality数据集
import pandas as pd
wine = pd.read_csv('./data/wine.csv')
wine_quality = pd.read_csv('./data/winequality.csv',sep=';')
(2) 将wine数据集和winequality数据集的数据和标签拆分开。
wine_data = wine.iloc[:,1:]
wine_target=wine['Class']
wine_quality_data = wine_quality.iloc[:,:-1]
wine_quality_target = wine_quality.iloc[:,-1]
(3) 将wine和wine_quality数据集划分为训练集和测试集。
from sklearn.model_selection import train_test_split
wine_data_train,wine_data_test,\
wine_target_train,wine_target_test = \
train_test_split(wine_data,wine_target,test_size = 0.2,random_state=42)
wine_quality_data_train,wine_quality_data_test,\
wine_quality_target_train,wine_quality_target_test = \
train_test_split(wine_quality_data,wine_quality_target,test_size = 0.2,random_state=42)
(4) 标准化wine数据集和 wine quality数据集
import numpy as np
from sklearn.preprocessing import MinMaxScaler #离差标准化
Scaler = MinMaxScaler().fit(wine_data_train) # 生成规则
##将规则应用于训练集
wine_trainScaler = Scaler.transform(wine_data_train)
##将规则应用于测试集
wine_testScaler = Scaler.transform(wine_data_test)
Scaler1 = MinMaxScaler().fit(wine_quality_data_train)
wine_quality_trainScaler = Scaler1.transform(wine_quality_data_train)
wine_quality_testScaler = Scaler1.transform(wine_quality_data_test)
(5) 对wine数据集和 winequality数据集进行PCA降维。
from sklearn.decomposition import PCA
pca = PCA(n_components=5).fit(wine_trainScaler) ## 生成规则
## 将规则应用于训练集
wine_trainPca = pca.transform(wine_trainScaler)
## 将规则应用于测试集
wine_testPca = pca.transform(wine_testScaler)
pca = PCA(n_components=5).fit(wine_quality_trainScaler) ## 生成规则
## 将规则应用于训练集
wine_quality_trainPca = pca.transform(wine_quality_trainScaler)
## 将规则应用于测试集
wine_quality_testPca = pca.transform(wine_quality_testScaler)
实训2 构建基于wine数据集的k- Means聚类模型
1.训练要点
(1)了解sklearn估计器的用法。
(2)掌握聚类模型的构建方法。
(3)掌握聚类模型的评价方法。
2.需求说明
wine数据集的葡萄酒总共分为3种,通过将wine数据集的数据进行聚类,聚集为3
个簇,能够实现葡萄酒的类别划分。
3.实现思路及步骤
(1)根据实训1的wine数据集处理的结果,构建聚类数目为3的- -Means模型。
from sklearn.cluster import KMeans
#构建并训练模型
kmeans = KMeans(n_clusters = 3,random_state=123).fit(wine_trainScaler)
#用标准化后PCA降维后的训练集建模(采用降维后的数据聚类效果不好,故此处不采用)
#kmeans = KMeans(n_clusters = 3,random_state=123).fit(wine_trainPca)
print('构建的K-Means模型为:\n',kmeans)
构建的K-Means模型为:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=123, tol=0.0001, verbose=0)
(2)对比真实标签和聚类标签求取FMI。
from sklearn.metrics import fowlkes_mallows_score
# 构建并训练模型
score = fowlkes_mallows_score(wine_target_train,kmeans.labels_)
print('wine数据集的FMI:%f'%(score))
wine数据集的FMI:0.888732
(3)在聚类数目为2~10类时,确定最优聚类数目。
for i in range(2,11):
kmeans = KMeans(n_clusters = i,random_state = 123).fit(wine_trainScaler)
score = fowlkes_mallows_score(wine_target_train,kmeans.labels_)
print('wine数据集的FMI:%f'%(score))
wine数据集的FMI:0.637271
wine数据集的FMI:0.888732
wine数据集的FMI:0.822386
wine数据集的FMI:0.718383
wine数据集的FMI:0.684199
wine数据集的FMI:0.612088
wine数据集的FMI:0.577474
wine数据集的FMI:0.536403
wine数据集的FMI:0.546499
(4)求取模型的轮廓系数,绘制轮廓系数折线图,确定最优聚类数目。
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
silhouetteScore = []
for i in range(2,11):
#构建并建立模型
kmeans = KMeans(n_clusters = i,random_state = 123).fit(wine)
score = silhouette_score(wine,kmeans.labels_)
silhouetteScore.append(score)
plt.figure(figsize=(10,6))
plt.plot(range(2,11),silhouetteScore,linewidth=1.5,linestyle='-')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lquNILPA-1609145133464)(output_19_0.png)]
(5)求取 Calinski-Harabasz-指数,确定最优聚类数目。
from sklearn.metrics import calinski_harabaz_score
for i in range(2,11):
#构建并训练模型
kmeans = KMeans(n_clusters = i, random_state=123).fit(wine)
score= calinski_harabaz_score(wine,kmeans.labels_)
print('wine数据聚%d类calinski_harabaz指数为:%f'%(i,score))
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
wine数据聚2类calinski_harabaz指数为:505.425689
wine数据聚3类calinski_harabaz指数为:561.805171
wine数据聚4类calinski_harabaz指数为:707.349460
wine数据聚5类calinski_harabaz指数为:787.011163
wine数据聚6类calinski_harabaz指数为:878.393807
wine数据聚7类calinski_harabaz指数为:1180.244416
wine数据聚8类calinski_harabaz指数为:1297.354659
wine数据聚9类calinski_harabaz指数为:1349.991148
wine数据聚10类calinski_harabaz指数为:1441.838351
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
D:\Study\anaconda\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function calinski_harabaz_score is deprecated; Function 'calinski_harabaz_score' has been renamed to 'calinski_harabasz_score' and will be removed in version 0.23.
warnings.warn(msg, category=FutureWarning)
3) 结果分析与思考
通过分析FMI评价分值,可以看出wine数据分3类的时候其FMI值最高,故聚类为3类的时候wine数据集K-means聚类效果最好
通过分析轮廓系数折线图,可以看出在wine数据集为3的时候,其平均畸变程度最大,故亦可知聚类为3类的时候效果最佳
通过分析Calinski-Harabasz指数,我们发现其数值大体随着聚类的种类的增加而变多,最大值出现在聚类为9类的时候,考虑到Calinski-Harabasz指数是不需要真实值的评估方法,其可信度不如FMI评价法,故这里有理由相信这里的Calinski-Harabasz指数评价结果是存在异常的。
综上分析,加上对于实际数据的描述,wine数据集K-means聚类为3类的时候效果最好。
实训3 构建基于wine数据集的SVM分类模型
1.训练要点
(1)掌握sklearn估计器的用法。
(2)掌握分类模型的构建方法。
(3)掌握分类模型的评价方法。
2.需求说明
wine数据集中的葡萄酒类别为3种,将wie数据集划分为训练集和测试集,使用训练
集训练SVM分类模型,并使用训练完成的模型预测测试集的葡萄酒类别归属。
3.实现思路及步骤
(1)读取wine数据集,区分标签和数据。
import pandas as pd
wine = pd.read_csv('./data/wine.csv')
wine_data=wine.iloc[:,1:]
wine_target=wine['Class']
(2)将wine数据集划分为训练集和测试集
from sklearn.model_selection import train_test_split
wine_data_train, wine_data_test, \
wine_target_train, wine_target_test = \
train_test_split(wine_data, wine_target, \
test_size=0.1, random_state=6)
(3)使用离差标准化方法标准化wine数据集。
from sklearn.preprocessing import MinMaxScaler #标准差标准化
stdScale = MinMaxScaler().fit(wine_data_train) #生成规则(建模)
wine_trainScaler = stdScale.transform(wine_data_train)#对训练集进行标准化
wine_testScaler = stdScale.transform(wine_data_test)#用训练集训练的模型对测试集标准化
(4)构建SVM模型,并预测测试集结果。
from sklearn.svm import SVC
svm = SVC().fit(wine_trainScaler,wine_target_train)
print('建立的SVM模型为:\n',svm)
wine_target_pred = svm.predict(wine_testScaler)
print('预测前10个结果为:\n',wine_target_pred[:10])
建立的SVM模型为:
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
预测前10个结果为:
[1 2 2 2 1 1 2 2 2 1]
(5)打印出分类报告,评价分类模型性能。
from sklearn.metrics import classification_report
print('使用SVM预测iris数据的分类报告为:','\n',
classification_report(wine_target_test,
wine_target_pred))
使用SVM预测iris数据的分类报告为:
precision recall f1-score support
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 8
3 1.00 1.00 1.00 1
accuracy 1.00 18
macro avg 1.00 1.00 1.00 18
weighted avg 1.00 1.00 1.00 18
结果分析与思考
本题划分的训练集和测试集的比例为9:1,训练的效果很好,通过观察分类报告,精确度、召回率、F1值等指标均达到了1.00,即预测结果全部正确的高正确率!这里,我们做一个额外小实验——将训练集和测试集的比例为划分为1:1,结果如下:
%%html
<img style="float: left;" src="./image/s_3.png" width=400 height=400>
可见,模型得不到足够训练集训练,且测试集规模庞大时,结果并不能保证全对,但正确率确实也足够高了!
实训4 构建基于wine_quality数据集的回归模型
1.训练要点
(1)熟练sklearn估计器的用法。
(2)掌握回归模型的构建方法。
(3)掌握回归模型的评价方法。
2.需求说明
winequality数据集的葡萄酒评分在1~10之间,建线性回归模型与梯度提升回归模
型,训练 winequality数据集的训练集数据,训练完成后预测测试集的葡萄酒评分。结合
真实评分,评价构建的两个回归模型的好坏。
3.实现思路及步骤
(1)根据winequality数据集处理的结果,构建线性回归模型。
from sklearn.linear_model import LinearRegression
clf = LinearRegression().fit(wine_quality_trainPca,wine_quality_target_train)
y_pred = clf.predict(wine_quality_testPca)
print('线性回归模型预测前10个结果为:','\n',y_pred[:10])
线性回归模型预测前10个结果为:
[5.20302279 5.20466945 5.34234945 5.3790242 5.74640832 5.30545288
5.27205578 5.27721131 5.66550711 5.70050188]
(2)根据wine quality数据集处理的结果,构建梯度提升回归模型。
from sklearn.ensemble import GradientBoostingRegressor
GBR_wine = GradientBoostingRegressor().\
fit(wine_quality_trainPca,wine_quality_target_train)
wine_target_pred = GBR_wine.predict(wine_quality_testPca)
print('梯度提升回归模型预测前10个结果为:','\n',wine_target_pred[:10])
print('真实标签前十个预测结果为:','\n',list(wine_quality_target_test[:10]))
梯度提升回归模型预测前10个结果为:
[5.28629565 5.14521438 5.4020539 5.10652992 6.01754672 5.15338501
5.13264291 5.37157537 5.78959206 5.89730642]
真实标签前十个预测结果为:
[6, 5, 6, 5, 6, 5, 5, 5, 5, 6]
(3)结合真实评分和预测评分,计算均方误差中值绝对误差、可解释方差值。
from sklearn.metrics import explained_variance_score,\
mean_absolute_error,\
mean_squared_error,\
median_absolute_error,r2_score
print('线性回归模型评价结果:')
print('winequality数据线性回归模型的平均绝对误差为:',
mean_absolute_error(wine_quality_target_test,y_pred))
print('winequality数据线性回归模型的均方误差为:',
mean_squared_error(wine_quality_target_test,y_pred))
print('winequality数据线性回归模型的中值绝对误差为:',
median_absolute_error(wine_quality_target_test,y_pred))
print('winequality数据线性回归模型的可解释方差值为:',
explained_variance_score(wine_quality_target_test,y_pred))
print('winequality数据线性回归模型的R方值为:',
r2_score(wine_quality_target_test,y_pred))
print('梯度提升回归模型评价结果:')
from sklearn.metrics import explained_variance_score,\
mean_absolute_error,mean_squared_error,median_absolute_error,r2_score
print('winequality数据梯度提升回归树模型的平均绝对误差为:',
mean_absolute_error(wine_quality_target_test,wine_target_pred))
print('winequality数据梯度提升回归树模型的均方误差为:',
mean_squared_error(wine_quality_target_test,wine_target_pred))
print('winequality数据梯度提升回归树模型的中值绝对误差为:',
median_absolute_error(wine_quality_target_test,wine_target_pred))
print('winequality数据梯度提升回归树模型的可解释方差值为:',
explained_variance_score(wine_quality_target_test,wine_target_pred))
print('winequality数据梯度提升回归树模型的R方值为:',
r2_score(wine_quality_target_test,wine_target_pred))
线性回归模型评价结果:
winequality数据线性回归模型的平均绝对误差为: 0.5398251122317769
winequality数据线性回归模型的均方误差为: 0.43298314878877353
winequality数据线性回归模型的中值绝对误差为: 0.46814373379480045
winequality数据线性回归模型的可解释方差值为: 0.34153599915272226
winequality数据线性回归模型的R方值为: 0.3374456516688773
梯度提升回归模型评价结果:
winequality数据梯度提升回归树模型的平均绝对误差为: 0.5043859078173963
winequality数据梯度提升回归树模型的均方误差为: 0.3866355007418619
winequality数据梯度提升回归树模型的中值绝对误差为: 0.41613212262993216
winequality数据梯度提升回归树模型的可解释方差值为: 0.41147753430214007
winequality数据梯度提升回归树模型的R方值为: 0.40836720100469737
python 中 任务 6.1 使用sklearn 转换器处理数据(划分训练集,测试集,PCA降维) 学习笔记1
python 中 任务 6.2 构建并评价聚类模型 学习笔记2
python 中任务 6.3 构建并评分分类模型(SVM模型) 学习笔记3
python 中任务 6.4 构建并评价回归模型 学习笔记4