每天进步一点点《ML - Sklearn库简单学习》

一：Sklearn介绍
Sklearn是一个强大的机器学习库，基于python的。官方文档（http://scikit-learn.org/stable/ ）。如下列举部分的使用场景。
在这里插入图片描述

由图中，可以看到库的算法主要有四类：分类，回归，聚类，降维。其中：
• 常用的回归：线性、决策树、SVM、KNN ；集成回归：随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
• 常用的分类：线性、决策树、SVM、KNN，朴素贝叶斯；集成分类：随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
• 常用聚类：k均值（K-means）、层次聚类（Hierarchical clustering）、DBSCAN
• 常用降维：LinearDiscriminantAnalysis、PCA

在官网上还可以学习到很多的使用例子，以及用户指南，所以最好的文档就是sklearn的官网了。

新上手的同学呢，可以在这个网站看到很多的关于机器学习算法的详细介绍，还可以看到具体的使用例子，是个很不错的学习平台。重点需要看这个用户指南和 API 还有例子。
用户指南介绍了各种不同的机器学习算法，在学习的时候都是用得到的。

二：Sklearn安装
首先安装Sklearn，在conda下直接执行 conda install scikit-learn即可安装。

三：运行实例
下面运行一个线性回归的例子

from sklearn import datasets
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

#使用以后的数据集进行线性回归（这里是波士顿房价数据）
loaded_data=datasets.load_boston()
data_X=loaded_data.data
data_y=loaded_data.target

model=LinearRegression()
model.fit(data_X,data_y)

print(model.predict(data_X[:4,:]))
print(data_y[:4])

#使用生成线性回归的数据集，最后的数据集结果用散点图表示
X,y=datasets.make_regression(n_samples=100,n_features=1,n_targets=1,noise=10)   #n_samples表示样本数目，n_features特征的数目  n_tragets  noise噪音
plt.scatter(X,y)
plt.show()

结果如下：
在这里插入图片描述

下面运行一个SVM分类的例子

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

xx, yy = np.meshgrid(np.linspace(-3, 3, 500),
                     np.linspace(-3, 3, 500))
np.random.seed(0)
X = np.random.randn(300, 2)
Y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)

# fit the model
clf = svm.NuSVC(gamma='auto')
clf.fit(X, Y)

# plot the decision function for each datapoint on the grid
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',
           origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,
                       linestyles='dashed')
plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired,
            edgecolors='k')
plt.xticks(())
plt.yticks(())
plt.axis([-3, 3, -3, 3])
plt.show()

结果如下：
在这里插入图片描述

下面运行一个均值漂移的聚类算法，也就是K-means

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

结果如下：
在这里插入图片描述

这里主要是为了展示Sklearn的使用，和学习的地址，在以后的时间过程中就可以利用这些优秀的开源库进行各种开发和数据分析，大大提升效率，但是学习的时候还是要深入学习的，深入到算法的设计，内部细节，这个还是必不可少的。

每天进步一点点《ML - Sklearn库简单学习》

猜你喜欢