KNN(K近邻算法)

k近邻，顾名思义，就是寻找距离测试点最近的 k 个点，根据这 k 个点的标签来判断该测试点的标签。
如下图所示，图中有10个样本点，若要对图中的绿点1分类，k近邻算法采用的策略是（下图中 k 值为 3 ），找到距离绿点1最近的三个点，其分别是 2、3、4，其中 2 为蓝色，3、4为红色，因为红色占大多数，所以 1 就会被分类到红色阵营里面。

对于样本点距离的计算，一般是采用欧几里得距离，在二维特征情况下，其计算公式为： $\sqrt{(x_2^2-x_1^2)+(y_2^2-y_1^2)}$ 但对于多维的情况下，计算公式为： $\sqrt{\sum_{i=1}^{n}(q_i^2-p_i^2)}$ 其中 n 为样本数据的特征维度。

KNN实现代码：

from collections import Counter
import numpy as np


def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))


class KNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
    	# 使用_predict函数对每个测试样本进行预测，得到每个样本的标签
        predicted_labels = [self._predict(x) for x in X]
        return np.array((predicted_labels))

    def _predict(self, x):
        # 计算测试点到每个训练点的距离
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        # 获取K个最近的样本及标签
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        # 获取数量最多的标签
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

对于其中的某些函数：

argsort函数：它会对数组进行排序，但返回的不是排序后的数组，而是原数组中每个元素在排序后数组中的位置索引。

import numpy as np  
  
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])  
sorted_indices = np.argsort(arr)  
  
print("原数组:", arr)  
print("排序后的索引:", sorted_indices)  
print("按索引排序后的数组:", arr[sorted_indices])

输出：
原数组: [3 1 4 1 5 9 2 6 5 3 5]  
排序后的索引: [ 1  3  6  0  9  2  4  8 10  5]  
按索引排序后的数组: [1 1 2 3 3 4 5 5 5 6 9]

Counter函数：统计每个元素及其出现的次数。

from collections import Counter  
  
words = ['apple', 'banana', 'apple', 'orange', 'banana', 'grape']  
word_counts = Counter(words)  
  
print("单词计数:", word_counts)  
print("最常见的单词:", word_counts.most_common(1))  # 获取出现次数最多的单词

输出：
单词计数: Counter({
    
    'apple': 2, 'banana': 2, 'orange': 1, 'grape': 1})  
最常见的单词: [('apple', 2)]

KNN 测试程序：

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# (120, 4)共有120个数据，其中每个数据的维度为4
# print(X_train.shape)
# print(X_test.shape)
# [5.1 2.5 3.  1.1]通过观察可发现第一个样本有四个特征
# print(X_train[0])
# 训练标签共有120个数据，每个数据都是一维
# print(y_train.shape)
# 展示所有的标签
# print(y_train)

# 把数据的前两维特征以点状图呈现出来，其中颜色按照标签的不同进行分类
plt.figure()
cmap = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap, edgecolors='k', s=20)
plt.show()

from KNN import KNN

clf = KNN(3)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

acc = np.sum(predictions == y_test) / len(y_test)
print(acc)

数据样本前二维分布为：

输出结果为 1.0 可见 KNN 在此分类问题中还是有不错效果的。

本文参考视频：KNN

猜你喜欢