一、数据预处理优化
1. 特征标准化与归一化 KNN对特征尺度敏感,需通过标准化/归一化消除量纲差异。
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Z-score标准化(适合存在极端值)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Min-Max归一化(适合限定范围的特征)
minmax = MinMaxScaler()
X_train_normalized = minmax.fit_transform(X_train)
二、算法结构优化
1. KD-Tree加速查询 适用于低维数据(d<20),通过空间划分减少计算量。
from sklearn.neighbors import KDTree, KNeighborsClassifier
# 手写数字识别优化案例
kdtree = KDTree(X_train)
knn_kd = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
knn_kd.fit(X_train, y_train)
2. 流式处理与增量学习 动态更新索引,适用于实时推荐系统:
from sklearn.neighbors import LSHForest
# 近似最近邻(ANN)实现增量学习
model = LSHForest(n_estimators=10)
model.partial_fit(new_data) # 动态更新索引
三、距离计算优化
1. 向量化加速与GPU并行 利用NumPy加速欧氏距离计算:
import numpy as np
# 向量化计算测试集与所有训练样本的距离
distances = np.sqrt(((X_test[:, np.newaxis] - X_train) ** 2).sum(axis=2))
2. 动态距离选择策略 根据数据分布自动切换距离度量:
from sklearn.metrics.pairwise import pairwise_distances
# 根据特征相关性选择距离(欧式/曼哈顿/余弦)
if feature_correlation > 0.8:
metric = 'euclidean'
else:
metric = 'manhattan'
distances = pairwise_distances(X_test, X_train, metric=metric)
四、维度与样本优化
1. PCA降维 减少高维数据的计算复杂度:
from sklearn.decomposition import PCA
# 保留95%的方差信息
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)
2. 样本剪枝与加权
# 移除冗余样本(Condensed Nearest Neighbors)
from sklearn.neighbors import NeighborhoodComponentsAnalysis
nca = NeighborhoodComponentsAnalysis()
X_pruned = nca.fit_transform(X, y)
# 加权投票(近距离样本权重更高)
knn_weighted = KNeighborsClassifier(weights='distance')
五、参数调优实战
1. 交叉验证选择最优K值
from sklearn.model_selection import cross_val_score
k_values = range(1, 30)
best_score = 0
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5)
if scores.mean() > best_score:
best_k = k
2. 近似最近邻参数调优
# Faiss库构建IVF索引(分布式场景)
import faiss
dim = X_train.shape[1]
quantizer = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFFlat(quantizer, dim, 100) # 100个聚类中心
index.train(X_train)
index.add(X_train)
distances, indices = index.search(X_test, k=5)
通过 数据结构优化(KD-Tree/ANN) + 计算加速(向量化/GPU) + 维度压缩(PCA) + 参数调优(K值/距离) 的组合策略,可显著提升KNN性能。建议根据数据特性选择3-4种互补策略进行实验