机器学习——聚类之k近邻算法及python使用

聚类算法之k近邻及python使用

什么是k近邻算法
k近邻算法流程
使用sklearn进行代码实现

写在开头，套用我的老师的一句话目前所有自然学科的前沿都是在研究数学，尤其机器学习这样的，对数学的要求就更高了。每每我们翻开一本机器学习的书籍，都是一大堆一大堆的数学公式，各种偏导、积分等，看着着实让人头疼，尤其是像我这样的数学底子并不好的人，简直是一种折磨。但是，难归难，不愿意去尝试，它就会成为一种遗憾，并且这种遗憾将会是终生的。
我现在喜欢这样一句话：我们一路坚持，并非想有多么高的成就，只是为了能让自己老了以后能够少一些遗憾！！！

什么是k近邻算法

k近邻（k-Nearest Neighbor，简称kNN）学习是一种常用的监督学习方法，其工作机制非常简单：给定测试样本，基于某种距离度量找出训练集中与其最靠近的k个训练样本，然后基于这k个“邻居”的信息来进行预测通常，在分类任务中可使用“投票法”，即选择这k个样本中出现最多的类别标记作为预测结果；在回归任务中可使用“平均法”，即将这k个样本的实值输出标记的平均值作为预测结果；还可基于距离远近进行加权平均或加权投票，距离越近的样本权重越大。
与其他的机器学习算法相比，k近邻几乎没有什么训练过程！或者说没有什么显式的训练过程，事实上，它是“懒惰学习”的著名代表，此类学习技术在训练阶段仅仅是吧样本保存起来，训练时间开销几乎为零，待收到测试样本后在进行处理。
kNN要求的数学知识几乎为0，理解起来也很简单。

k近邻算法流程

接下来我们来看看它的算法流程：
1、计算待测数据与已有的数据之间的距离；

2、按照距离的递增关系排序；

3、选取距离最小的K个点；

4、取这K个点中的最多的类别作为待测数据的类别。

下面我们借助一个实例来理解这个过程
在这里插入图片描述
图中的蓝色方块和红色三角是训练样本，绿色是测试样本。接下来我们为绿色圆形进行分类，根据选取的k（即邻居数）值来划分，当k取3时，即图中的实线圆包围的三个训练样本，三个样本中红色三角数量比蓝色方块要多，因此在该k值下，绿色圆形应该划分为红色三角类；同理，当k取5时，蓝色方块有三个比红色三角要多，因此此时应该划分为蓝色方块。

是不是理解了kNN算法了呢，接下来来点进阶的，我们在上面所理解的最近是所谓的欧氏距离的最近，但是样本的特征值并不一定是依靠欧氏距离来判断距离的，因此我们来看看一些其他的“距离”：

汉明距离：两个字符串对应位置不一样的个数。汉明距离是以理查德·卫斯里·汉明的名字命名的。在信息论中，两个等长字符串之间的汉明距离是两个字符串对应位置的不同字符的个数。换句话说，它就是将一个字符串变换成另外一个字符串所需要替换的字符个数；

马氏距离：表示数据的协方差距离。计算两个样本集相似度的距离；

余弦距离：两个向量的夹角作为一种判别距离的度量；

曼哈顿距离：两点投影到各轴上的距离总和；

切比雪夫距离：两点投影到各轴上距离的最大值；

标准化欧氏距离：欧氏距离里每一项除以标准差。

除此之外还有一种距离叫闵可夫斯基距离，定义如下：
在这里插入图片描述

当q为1时，即为曼哈顿距离。当q为2时，即为欧氏距离。

虽然一下子介绍了很多，但大家肯定还是觉得不明就里，但是不用着急，距离的定义在机器学习中是一个核心概念，在之后的学习中还会经常遇到它。在这里介绍距离的目的一个是为了让大家使用k近邻算法时，如果发现效果不太好时，可以通过使用不同的距离定义来尝试改进算法的性能。

使用sklearn进行代码实现

数据集介绍

sklearn中内置了下面要用到的数据集——红酒数据集，可以使用下面两行代码进行调用：

from sklearn.datasets import load_wine
wine_dataset = load_wine()

这是对红酒数据集进行打印的结果：

{
    
    'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2]),
 'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'),
 'DESCR': '.. _wine_dataset:\n\nWine recognition dataset\n------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 178 (50 in each of three classes)\n    :Number of Attributes: 13 numeric, predictive attributes and the class\n    :Attribute Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash  \n \t\t- Magnesium\n\t\t- Total phenols\n \t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of diluted wines\n \t\t- Proline\n\n    - class:\n            - class_0\n            - class_1\n            - class_2\n\t\t\n    :Summary Statistics:\n    \n    ============================= ==== ===== ======= =====\n                                   Min   Max   Mean     SD\n    ============================= ==== ===== ======= =====\n    Alcohol:                      11.0  14.8    13.0   0.8\n    Malic Acid:                   0.74  5.80    2.34  1.12\n    Ash:                          1.36  3.23    2.36  0.27\n    Alcalinity of Ash:            10.6  30.0    19.5   3.3\n    Magnesium:                    70.0 162.0    99.7  14.3\n    Total Phenols:                0.98  3.88    2.29  0.63\n    Flavanoids:                   0.34  5.08    2.03  1.00\n    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12\n    Proanthocyanins:              0.41  3.58    1.59  0.57\n    Colour Intensity:              1.3  13.0     5.1   2.3\n    Hue:                          0.48  1.71    0.96  0.23\n    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71\n    Proline:                       278  1680     746   315\n    ============================= ==== ===== ======= =====\n\n    :Missing Attribute Values: None\n    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%[email protected])\n    :Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\nSchool of Information and Computer Science. \n\n.. topic:: References\n\n  (1) S. Aeberhard, D. Coomans and O. de Vel, \n  Comparison of Classifiers in High Dimensional Settings, \n  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Technometrics). \n\n  The data was used with many others for comparing various \n  classifiers. The classes are separable, though only RDA \n  has achieved 100% correct classification. \n  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \n  (All results using the leave-one-out technique) \n\n  (2) S. Aeberhard, D. Coomans and O. de Vel, \n  "THE CLASSIFICATION PERFORMANCE OF RDA" \n  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Journal of Chemometrics).\n',
 'feature_names': ['alcohol',
  'malic_acid',
  'ash',
  'alcalinity_of_ash',
  'magnesium',
  'total_phenols',
  'flavanoids',
  'nonflavanoid_phenols',
  'proanthocyanins',
  'color_intensity',
  'hue',
  'od280/od315_of_diluted_wines',
  'proline']}

数据集中的红酒有十三个特征，三种类别，接下来我们使用这个数据集来训练并预测。

标准化

我们先来看看红酒数据集样本的均值与标准差：

from sklearn.datasets import load_wine
wine_dataset = load_wine()
print(wine_dataset.data.mean(0))
print(wine_dataset.data.std(0))

打印的结果如下：

[1.30006180e+01 2.33634831e+00 2.36651685e+00 1.94949438e+01 9.97415730e+01 2.29511236e+00 2.02926966e+00 3.61853933e-01 1.59089888e+00 5.05808988e+00 9.57449438e-01 2.61168539e+00 7.46893258e+02]
[8.09542915e-01 1.11400363e+00 2.73572294e-01 3.33016976e+00 1.42423077e+01 6.24090564e-01 9.96048950e-01 1.24103260e-01 5.70748849e-01 2.31176466e+00 2.27928607e-01 7.07993265e-01 3.14021657e+02]

从打印结果可以看出，有的特征的均值和标准差都比较大，例如如最后一个特征。如果现在用kNN算法来对这样的数据进行分类的话，kNN算法会认为最后一个特征比较重要。因为假设有两个样本的最后一个特征值分别为1和100，那么这两个样本之间的距离可能就被这最后一个特征决定了。这样就很有可能会影响kNN算法的准确度。为了解决这种问题，我们可以对数据进行标准化。

标准化的手段有很多，而最为常用的就是StandardScaler。StandardScaler通过删除平均值和缩放到单位方差来标准化特征，并将标准化的结果的均值变成0，标准差为1。

假设标准化后的特征为z，标准化之前的特征为x，特征的均值为μ，方差为s。则StandardScaler可以表示为
$z = (x - μ) / s$
sklearn的相关接口代码如下：

from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
# 实例化StandardScaler对象
scaler = StandardScaler()
# 用data的均值和标准差来进行标准化，并将结果保存到after_scaler
after_scaler = scaler.fit_transform(data)
# 用刚刚的StandardScaler对象来进行归一化
after_scaler2 = scaler.transform([[2, 2]])
print(after_scaler)
print(after_scaler2)

打印结果：

[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
[[3. 3.]]

根据打印结果可以看出，经过准换后，数据已经缩放成了均值为0，标准差为1的分布。

代码实现

啥也不说，先上代码：

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def classification(train_feature, train_label, test_feature):
    '''
    对test_feature进行红酒分类
    :param train_feature: 训练集数据，类型为ndarray
    :param train_label: 训练集标签，类型为ndarray
    :param test_feature: 测试集数据，类型为ndarray
    :return: 测试集数据的分类结果
    '''
    #调用模型
    clf = KNeighborsClassifier()
    #使用模型进行训练
    clf.fit(train_feature,train_label)
    #返回预测结果
    return clf.predict(test_feature)

def score(predict_labels,real_labels):
    '''
    对预测的结果进行打分，仅考虑测试集准确率！！！
    '''
    num = 0.
    lenth = len(predict_labels)
    for i in range(lenth):
        if predict_labels[i] == real_labels[i]:
            num = num + 1
    print("预测准确率：",num / lenth)


#加载红酒数据集
wine_dataset = load_wine()

#对数据集进行拆分，X_train、X_test、y_train、y_test分别代表
#训练集特征、测试集特征、训练集标签和测试集标签
X_train, X_test, y_train, y_test = train_test_split(wine_dataset['data'],wine_dataset['target']
                                ,test_size=0.3)

#这是数据没有标准化直接进行训练和预测的结果
print("未进行数据标准化直接训练的模型")
predict1 = classification(X_train,y_train,X_test)
score(predict1,y_test)

print("\n")

#这是数据标准化后的预测结果

#加载标准化模型
scaler = StandardScaler()

#进行数据标准化
train_data = scaler.fit_transform(X_train)
test_data = scaler.fit_transform(X_test)
print("标准化之后训练的模型")
predict2 = classification(train_data,y_train,test_data)
score(predict2,y_test)

上面直接把训练和预测封装在一个函数里，方便对标准化和未标准化两种数据得到的模型进行比对，
最终的结果：
在这里插入图片描述
可以看到，没有标准化的模型的预测准确率才0.7+，而标准化后的模型准确率有0.98。

博客参考周志华的《机器学习》，EduCoder平台机器学习修炼指南实训项目等