目录
1.乳腺癌数据集下探索核函数的性质
1.1 探索kernel该如何选取
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from time import time
import datetime
data=load_breast_cancer()
x=data.data
y=data.target
x.shape
(569, 30)
np.unique(y)
array([0, 1])
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)
Kernel=["linear","rbf","sigmoid"]
for kernel in Kernel:
time0=time()
clf=SVC(kernel=kernel
,gamma="auto"
,cache_size=2500 #使用多大的内存 单位MB
).fit(xtrain,ytrain)
print("The accuracy under kernel %s is %f"%(kernel,clf.score(xtest,ytest)))
print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))
The accuracy under kernel linear is 0.929825 00:00:416094 The accuracy under kernel rbf is 0.596491 00:00:083388 The accuracy under kernel sigmoid is 0.596491 00:00:004012
到这里,我们可以有两个发现:首先,乳腺癌数据集是一个线性数据集,线性核函数跑出来的效果很好。rbf和sigmoid两个擅长非线性的数据从效果上来看完全不可用。其次,线性核函数的运行速度远远不如非线性的两个核函数。如果数据是线性的,那如果我们把degree参数调整为1,多项式核函数应该也可以得到不错的结果:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)
Kernel=["linear","poly","rbf","sigmoid"]
for kernel in Kernel:
time0=time()
clf=SVC(kernel=kernel
,gamma="auto"
,degree=1
,cache_size=2500 #使用多大的内存 单位MB
).fit(xtrain,ytrain)
print("The accuracy under kernel %s is %f"%(kernel,clf.score(xtest,ytest)))
print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))
The accuracy under kernel linear is 0.929825 00:00:413092 The accuracy under kernel poly is 0.923977 00:00:175040 The accuracy under kernel rbf is 0.596491 00:00:087620 The accuracy under kernel sigmoid is 0.596491 00:00:004012
多项式核函数的运行速度立刻加快了,并且精度也提升到了接近线性核函数的水平。但是,我们之前的实验中,我们得出,rbf
在线性数据上也可以表现得非常好,那在这里,为什么跑出来的结果如此糟糕呢?
其实,这里真正的问题是数据的量纲问题。我们在求解决策边界,判断点是否在决策边界的一边
是靠计算
”
距离
“
,虽然我们不能说
SVM
是完全的距离类模型,但是它严重受到数据量纲的影响。接下来我们来探索一下乳腺癌数据集的量纲:
import pandas as pd
data=pd.DataFrame(x)
data.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
count | mean | std | min | 1% | 5% | 10% | 25% | 50% | 75% | 90% | 99% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 569.0 | 14.127292 | 3.524049 | 6.981000 | 8.458360 | 9.529200 | 10.260000 | 11.700000 | 13.370000 | 15.780000 | 19.530000 | 24.371600 | 28.11000 |
1 | 569.0 | 19.289649 | 4.301036 | 9.710000 | 10.930400 | 13.088000 | 14.078000 | 16.170000 | 18.840000 | 21.800000 | 24.992000 | 30.652000 | 39.28000 |
2 | 569.0 | 91.969033 | 24.298981 | 43.790000 | 53.827600 | 60.496000 | 65.830000 | 75.170000 | 86.240000 | 104.100000 | 129.100000 | 165.724000 | 188.50000 |
3 | 569.0 | 654.889104 | 351.914129 | 143.500000 | 215.664000 | 275.780000 | 321.600000 | 420.300000 | 551.100000 | 782.700000 | 1177.400000 | 1786.600000 | 2501.00000 |
4 | 569.0 | 0.096360 | 0.014064 | 0.052630 | 0.068654 | 0.075042 | 0.079654 | 0.086370 | 0.095870 | 0.105300 | 0.114820 | 0.132888 | 0.16340 |
5 | 569.0 | 0.104341 | 0.052813 | 0.019380 | 0.033351 | 0.040660 | 0.049700 | 0.064920 | 0.092630 | 0.130400 | 0.175460 | 0.277192 | 0.34540 |
6 | 569.0 | 0.088799 | 0.079720 | 0.000000 | 0.000000 | 0.004983 | 0.013686 | 0.029560 | 0.061540 | 0.130700 | 0.203040 | 0.351688 | 0.42680 |
7 | 569.0 | 0.048919 | 0.038803 | 0.000000 | 0.000000 | 0.005621 | 0.011158 | 0.020310 | 0.033500 | 0.074000 | 0.100420 | 0.164208 | 0.20120 |
8 | 569.0 | 0.181162 | 0.027414 | 0.106000 | 0.129508 | 0.141500 | 0.149580 | 0.161900 | 0.179200 | 0.195700 | 0.214940 | 0.259564 | 0.30400 |
9 | 569.0 | 0.062798 | 0.007060 | 0.049960 | 0.051504 | 0.053926 | 0.055338 | 0.057700 | 0.061540 | 0.066120 | 0.072266 | 0.085438 | 0.09744 |
10 | 569.0 | 0.405172 | 0.277313 | 0.111500 | 0.119740 | 0.160100 | 0.183080 | 0.232400 | 0.324200 | 0.478900 | 0.748880 | 1.291320 | 2.87300 |
11 | 569.0 | 1.216853 | 0.551648 | 0.360200 | 0.410548 | 0.540140 | 0.640400 | 0.833900 | 1.108000 | 1.474000 | 1.909400 | 2.915440 | 4.88500 |
12 | 569.0 | 2.866059 | 2.021855 | 0.757000 | 0.953248 | 1.132800 | 1.280200 | 1.606000 | 2.287000 | 3.357000 | 5.123200 | 9.690040 | 21.98000 |
13 | 569.0 | 40.337079 | 45.491006 | 6.802000 | 8.514440 | 11.360000 | 13.160000 | 17.850000 | 24.530000 | 45.190000 | 91.314000 | 177.684000 | 542.20000 |
14 | 569.0 | 0.007041 | 0.003003 | 0.001713 | 0.003058 | 0.003690 | 0.004224 | 0.005169 | 0.006380 | 0.008146 | 0.010410 | 0.017258 | 0.03113 |
15 | 569.0 | 0.025478 | 0.017908 | 0.002252 | 0.004705 | 0.007892 | 0.009169 | 0.013080 | 0.020450 | 0.032450 | 0.047602 | 0.089872 | 0.13540 |
16 | 569.0 | 0.031894 | 0.030186 | 0.000000 | 0.000000 | 0.003253 | 0.007726 | 0.015090 | 0.025890 | 0.042050 | 0.058520 | 0.122292 | 0.39600 |
17 | 569.0 | 0.011796 | 0.006170 | 0.000000 | 0.000000 | 0.003831 | 0.005493 | 0.007638 | 0.010930 | 0.014710 | 0.018688 | 0.031194 | 0.05279 |
18 | 569.0 | 0.020542 | 0.008266 | 0.007882 | 0.010547 | 0.011758 | 0.013012 | 0.015160 | 0.018730 | 0.023480 | 0.030120 | 0.052208 | 0.07895 |
19 | 569.0 | 0.003795 | 0.002646 | 0.000895 | 0.001114 | 0.001522 | 0.001710 | 0.002248 | 0.003187 | 0.004558 | 0.006185 | 0.012650 | 0.02984 |
20 | 569.0 | 16.269190 | 4.833242 | 7.930000 | 9.207600 | 10.534000 | 11.234000 | 13.010000 | 14.970000 | 18.790000 | 23.682000 | 30.762800 | 36.04000 |
21 | 569.0 | 25.677223 | 6.146258 | 12.020000 | 15.200800 | 16.574000 | 17.800000 | 21.080000 | 25.410000 | 29.720000 | 33.646000 | 41.802400 | 49.54000 |
22 | 569.0 | 107.261213 | 33.602542 | 50.410000 | 58.270400 | 67.856000 | 72.178000 | 84.110000 | 97.660000 | 125.400000 | 157.740000 | 208.304000 | 251.20000 |
23 | 569.0 | 880.583128 | 569.356993 | 185.200000 | 256.192000 | 331.060000 | 384.720000 | 515.300000 | 686.500000 | 1084.000000 | 1673.000000 | 2918.160000 | 4254.00000 |
24 | 569.0 | 0.132369 | 0.022832 | 0.071170 | 0.087910 | 0.095734 | 0.102960 | 0.116600 | 0.131300 | 0.146000 | 0.161480 | 0.188908 | 0.22260 |
25 | 569.0 | 0.254265 | 0.157336 | 0.027290 | 0.050094 | 0.071196 | 0.093676 | 0.147200 | 0.211900 | 0.339100 | 0.447840 | 0.778644 | 1.05800 |
26 | 569.0 | 0.272188 | 0.208624 | 0.000000 | 0.000000 | 0.018360 | 0.045652 | 0.114500 | 0.226700 | 0.382900 | 0.571320 | 0.902380 | 1.25200 |
27 | 569.0 | 0.114606 | 0.065732 | 0.000000 | 0.000000 | 0.024286 | 0.038460 | 0.064930 | 0.099930 | 0.161400 | 0.208940 | 0.269216 | 0.29100 |
28 | 569.0 | 0.290076 | 0.061867 | 0.156500 | 0.176028 | 0.212700 | 0.226120 | 0.250400 | 0.282200 | 0.317900 | 0.360080 | 0.486908 | 0.66380 |
29 | 569.0 | 0.083946 | 0.018061 | 0.055040 | 0.058580 | 0.062558 | 0.065792 | 0.071460 | 0.080040 | 0.092080 | 0.106320 | 0.140628 | 0.20750 |
果然数据存在严重的量纲不一的问题。我们来使用数据预处理中的标准化的类,对数据进行标准化:
from sklearn.preprocessing import StandardScaler
x=StandardScaler().fit_transform(x)
data=pd.DataFrame(x)
data.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
count | mean | std | min | 1% | 5% | 10% | 25% | 50% | 75% | 90% | 99% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 569.0 | -3.162867e-15 | 1.00088 | -2.029648 | -1.610057 | -1.305923 | -1.098366 | -0.689385 | -0.215082 | 0.469393 | 1.534446 | 2.909529 | 3.971288 |
1 | 569.0 | -6.530609e-15 | 1.00088 | -2.229249 | -1.945253 | -1.443165 | -1.212786 | -0.725963 | -0.104636 | 0.584176 | 1.326975 | 2.644095 | 4.651889 |
2 | 569.0 | -7.078891e-16 | 1.00088 | -1.984504 | -1.571053 | -1.296381 | -1.076672 | -0.691956 | -0.235980 | 0.499677 | 1.529432 | 3.037982 | 3.976130 |
3 | 569.0 | -8.799835e-16 | 1.00088 | -1.454443 | -1.249201 | -1.078225 | -0.947908 | -0.667195 | -0.295187 | 0.363507 | 1.486075 | 3.218702 | 5.250529 |
4 | 569.0 | 6.132177e-15 | 1.00088 | -3.112085 | -1.971730 | -1.517125 | -1.188910 | -0.710963 | -0.034891 | 0.636199 | 1.313694 | 2.599511 | 4.770911 |
5 | 569.0 | -1.120369e-15 | 1.00088 | -1.610136 | -1.345369 | -1.206849 | -1.035527 | -0.747086 | -0.221940 | 0.493857 | 1.347811 | 3.275782 | 4.568425 |
6 | 569.0 | -4.421380e-16 | 1.00088 | -1.114873 | -1.114873 | -1.052316 | -0.943046 | -0.743748 | -0.342240 | 0.526062 | 1.434288 | 3.300560 | 4.243589 |
7 | 569.0 | 9.732500e-16 | 1.00088 | -1.261820 | -1.261820 | -1.116837 | -0.974010 | -0.737944 | -0.397721 | 0.646935 | 1.328412 | 2.973759 | 3.927930 |
8 | 569.0 | -1.971670e-15 | 1.00088 | -2.744117 | -1.885853 | -1.448032 | -1.153036 | -0.703240 | -0.071627 | 0.530779 | 1.233221 | 2.862418 | 4.484751 |
9 | 569.0 | -1.453631e-15 | 1.00088 | -1.819865 | -1.600987 | -1.257643 | -1.057477 | -0.722639 | -0.178279 | 0.470983 | 1.342243 | 3.209454 | 4.910919 |
10 | 569.0 | -9.076415e-16 | 1.00088 | -1.059924 | -1.030184 | -0.884517 | -0.801577 | -0.623571 | -0.292245 | 0.266100 | 1.240514 | 3.198294 | 8.906909 |
11 | 569.0 | -8.853492e-16 | 1.00088 | -1.554264 | -1.462915 | -1.227791 | -1.045885 | -0.694809 | -0.197498 | 0.466552 | 1.256518 | 3.081820 | 6.655279 |
12 | 569.0 | 1.773674e-15 | 1.00088 | -1.044049 | -0.946900 | -0.858016 | -0.785049 | -0.623768 | -0.286652 | 0.243031 | 1.117354 | 3.378079 | 9.461986 |
13 | 569.0 | -8.291551e-16 | 1.00088 | -0.737829 | -0.700152 | -0.637545 | -0.597942 | -0.494754 | -0.347783 | 0.106773 | 1.121579 | 3.021867 | 11.041842 |
14 | 569.0 | -7.541809e-16 | 1.00088 | -1.776065 | -1.327593 | -1.116972 | -0.939031 | -0.624018 | -0.220335 | 0.368355 | 1.123053 | 3.405812 | 8.029999 |
15 | 569.0 | -3.921877e-16 | 1.00088 | -1.298098 | -1.160988 | -0.982870 | -0.911510 | -0.692926 | -0.281020 | 0.389654 | 1.236492 | 3.598943 | 6.143482 |
16 | 569.0 | 7.917900e-16 | 1.00088 | -1.057501 | -1.057501 | -0.949654 | -0.801336 | -0.557161 | -0.199065 | 0.336752 | 0.882848 | 2.997338 | 12.072680 |
17 | 569.0 | -2.739461e-16 | 1.00088 | -1.913447 | -1.913447 | -1.292055 | -1.022462 | -0.674490 | -0.140496 | 0.472657 | 1.117927 | 3.146456 | 6.649601 |
18 | 569.0 | -3.108234e-16 | 1.00088 | -1.532890 | -1.210240 | -1.063590 | -0.911757 | -0.651681 | -0.219430 | 0.355692 | 1.159654 | 3.834036 | 7.071917 |
19 | 569.0 | -3.366766e-16 | 1.00088 | -1.096968 | -1.014237 | -0.859880 | -0.788466 | -0.585118 | -0.229940 | 0.288642 | 0.904208 | 3.349301 | 9.851593 |
20 | 569.0 | -2.333224e-15 | 1.00088 | -1.726901 | -1.462332 | -1.187658 | -1.042700 | -0.674921 | -0.269040 | 0.522016 | 1.535063 | 3.001373 | 4.094189 |
21 | 569.0 | 1.763674e-15 | 1.00088 | -2.223994 | -1.706020 | -1.482403 | -1.282757 | -0.748629 | -0.043516 | 0.658341 | 1.297666 | 2.625885 | 3.885905 |
22 | 569.0 | -1.198026e-15 | 1.00088 | -1.693361 | -1.459232 | -1.173717 | -1.044983 | -0.689578 | -0.285980 | 0.540279 | 1.503553 | 3.009644 | 4.287337 |
23 | 569.0 | 5.049661e-16 | 1.00088 | -1.222423 | -1.097625 | -0.966014 | -0.871684 | -0.642136 | -0.341181 | 0.357589 | 1.393000 | 3.581882 | 5.930172 |
24 | 569.0 | -5.213170e-15 | 1.00088 | -2.682695 | -1.948882 | -1.605910 | -1.289152 | -0.691230 | -0.046843 | 0.597545 | 1.276124 | 2.478455 | 3.955374 |
25 | 569.0 | -2.174788e-15 | 1.00088 | -1.443878 | -1.298811 | -1.164575 | -1.021571 | -0.681083 | -0.269501 | 0.539669 | 1.231407 | 3.335783 | 5.112877 |
26 | 569.0 | 6.856456e-16 | 1.00088 | -1.305831 | -1.305831 | -1.217748 | -1.086814 | -0.756514 | -0.218232 | 0.531141 | 1.435090 | 3.023359 | 4.700669 |
27 | 569.0 | -1.412656e-16 | 1.00088 | -1.745063 | -1.745063 | -1.375270 | -1.159448 | -0.756400 | -0.223469 | 0.712510 | 1.436382 | 2.354181 | 2.685877 |
28 | 569.0 | -2.289567e-15 | 1.00088 | -2.160960 | -1.845039 | -1.251767 | -1.034661 | -0.641864 | -0.127409 | 0.450138 | 1.132518 | 3.184317 | 6.046041 |
29 | 569.0 | 2.575171e-15 | 1.00088 | -1.601839 | -1.405690 | -1.185223 | -1.006009 | -0.691912 | -0.216444 | 0.450762 | 1.239884 | 3.141089 | 6.846856 |
标准化完毕后,再次让
SVC
在核函数中遍历,此时我们把
degree
的数值设定为
1
,观察各个核函数在去量纲后的数据上的表现:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)
Kernel=["linear","poly","rbf","sigmoid"]
for kernel in Kernel:
time0=time()
clf=SVC(kernel=kernel
,gamma="auto"
,degree=1
,cache_size=2500 #使用多大的内存 单位MB
).fit(xtrain,ytrain)
print("The accuracy under kernel %s is %f"%(kernel,clf.score(xtest,ytest)))
print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))
The accuracy under kernel linear is 0.976608 00:00:011002 The accuracy under kernel poly is 0.964912 00:00:004001 The accuracy under kernel rbf is 0.970760 00:00:008002 The accuracy under kernel sigmoid is 0.953216 00:00:003000
量纲统一之后,可以观察到,所有核函数的运算时间都大大地减少了,尤其是对于线性核来说,而多项式核函数居然变成了计算最快的。其次,rbf
表现出了非常优秀的结果。经过我们的探索,我们可以得到的结论是:
1.
线性核,尤其是多项式核函数在高次项时计算非常缓慢
2.
rbf
和多项式核函数都不擅长处理量纲不统一的数据集
而这两个缺点都可以由数据无量纲化来解决。因此,
SVM
执行之前,非常推荐先进行数据的无量纲化
!
1.2 调参提升模型
到了这一步,我们是否已经完成建模了呢?虽然线性核函数的效果是最好的,但它是没有核函数相关参数可以调整的,rbf和多项式却还有着可以调整的相关参数。
输入 | 含义 | 解决问题 | 参数gamma | 参数degree | 参数coef0 |
---|---|---|---|---|---|
"linear" | 线性核 | 线性 | no | no | no |
"poly" | 多项式核 | 偏线性 | yes | yes | yes |
"sigmoid" | 双曲正切核 | 非线性 | yes | no | yes |
"rbf" | 高斯径向基 | 偏非线性 | yes | no | no |
对于线性核函数,"kernel"
是唯一能够影响它的参数,但是对于其他三种非线性核函数,他们还受到参数gamma
,
degree
以及
coef0
的影响。
对于高斯径向基核函数,调整gamma的方式其实比较容易,那就是画学习曲线。我们来试试看高斯径向基核函数rbf的参数
gamma
在乳腺癌数据集上的表现:
score=[]
gamma_range=np.logspace(-10,1,50) #返回在对数刻度上均匀间隔的数字
for i in gamma_range:
clf=SVC(kernel="rbf",gamma=i,cache_size=5000).fit(xtrain,ytrain)
score.append(clf.score(xtest,ytest))
print(max(score),gamma_range[score.index(max(score))])
plt.plot(gamma_range,score)
plt.show()
0.9766081871345029 0.012067926406393264
通过学习曲线,很容就找出了rbf的最佳gamma值。但我们观察到,这其实与线性核函数的准确率一模一样之前的准确率。我们可以多次调整gamma_range来观察结果,可以发现97.6608应该是rbf核函数的极限了。
对于多项式核函数来说,因为三个参数共同作用在一个数学公式上影响它的效果,因此
我们往往使用网格搜索来共同调整三个对多项式核函数有影响的参数。依然使用乳腺癌数据集:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV #带交叉验证的网格搜索
time0=time()
gamma_range=np.logspace(-10,1,20)
coef0_range=np.logspace(0,5,10)
param_grid=dict(gamma=gamma_range,coef0=coef0_range)
cv=StratifiedShuffleSplit(n_splits=5,test_size=0.3,random_state=420)
grid=GridSearchCV(SVC(kernel="poly",degree=1,cache_size=2500)
,param_grid=param_grid
,cv=cv)
grid.fit(x,y)
print("the best parameters are %s with a score of %0.5f" %(grid.best_params_,grid.best_score_))
print(datetime.datetime.fromtimestamp(time()-time0).strftime("%M:%S:%f"))
the best parameters are {'coef0': 1.0, 'gamma': 0.18329807108324375} with a score of 0.96959 00:09:726203
可以发现,网格搜索为我们返回了参数
coef0=1.0
,
gamma=0.18329807108324375
,但整体的分数是
0.96959
,虽然比调参前略有提高,但依然没有超过线性核函数核rbf
的结果。可见,如果最初选择核函数的时候,发现多项式的结果不如rbf
和线性核函数,那就不要挣扎了,试试看调整
rbf
或者直接使用线性。
2.软间隔与重要参数C
2.1 硬间隔与软间隔
当两组数据是完全线性可分,我们可以找出一个决策边界使得训练集上的分类误差为
0
,这两种数据就被称为是存在”
硬间隔
“
的。当两组数据几乎是完全线性可分的,但决策边界在训练集上存在较小的训练误差,这两种数据就被称为是存在”
软间隔
“
。
2.2 参数C
对于软间隔的数据来说,我们的决策边界就不是单纯地寻求最大边际了,因为对于软间隔的数据来说,边际越大被分错的样本也就会越多,因此我们需要找出一个”最大边际“与”被分错的样本数量“之间的平衡。参数C用于权衡”训练样本的正确分类“与”决策函数的边际最大化“两个不可同时完成的目标,希望找出一个平衡点来让模型的效果最佳。
参数 | 含义 |
---|---|
C |
浮点数,默认
1
,必须大于等于
0
,可不填
松弛系数的惩罚项
系数。如果C
值设定比较大,那
SVC
可能会选择边际较小的,能够更好地分类所有训练点的决策边界,不过模型的训练时间也会更长。如果C
的设定值较高,那
SVC
会尽量最大化边界,决策功能会更简单,但代价是训练的准确度。换句话说,C
在
SVM
中的影响就像正则化参数对逻辑回归的影响。
|
在实际使用中,C和核函数的相关参数(gamma,degree等等)们搭配,往往是SVM调参的重点。与gamma不同,C没有在对偶函数中出现,并且是明确了调参目标的,所以我们可以明确我们究竟是否需要训练集上的高精确度来调整C的方向。默认情况下C为1,通常来说这都是一个合理的参数。 如果我们的数据很嘈杂,那我们往往减小C。当然,我们也可以使用网格搜索或者学习曲线来调整C的值。
对于乳腺癌数据集,我们可以通过调参来优化模型:
2.3调整线性核函数
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)
score=[]
c_range=np.linspace(0.01,20,50)
for i in c_range:
clf=SVC(kernel="linear",C=i,cache_size=2500).fit(xtrain,ytrain)
score.append(clf.score(xtest,ytest))
print(max(score),c_range[score.index(max(score))])
plt.plot(c_range,score)
plt.show()
0.9766081871345029 0.41795918367346935
2.4调整高斯径向基函数rbf
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=420)
score=[]
c_range=np.linspace(0.01,30,50)
for i in c_range:
clf=SVC(kernel="rbf",C=i,gamma=0.012067926406393264,cache_size=2500).fit(xtrain,ytrain)
score.append(clf.score(xtest,ytest))
print(max(score),c_range[score.index(max(score))])
plt.plot(c_range,score)
plt.show()
0.9824561403508771 6.7424489795918365
进一步细化:
score=[]
c_range=np.linspace(5,7,50)
for i in c_range:
clf=SVC(kernel="rbf",C=i,gamma=0.012067926406393264,cache_size=2500).fit(xtrain,ytrain)
score.append(clf.score(xtest,ytest))
print(max(score),c_range[score.index(max(score))])
plt.plot(c_range,score)
plt.show()
0.9824561403508771 6.224489795918367
此时,我们找到了乳腺癌数据集上的最优解:rbf
核函数下的
98.245%的准确率。
完!