机器学习·最近邻方法（k-NN）

前言

上一篇简单介绍了决策树，而本篇讲解与决策树相近的 最近邻方法k-NN。

机器学习·决策树-CSDN博客

一、算法原理对比

特性	决策树	最近邻方法（k-NN）
核心思想	通过特征分割构建树结构，递归划分数据	基于距离度量，用最近的k个样本投票预测
训练方式	显式构建模型（预训练）	惰性学习（无显式训练，预测时计算）
关键参数	`max_depth`, `min_samples_leaf`	`n_neighbors`, `metric`（距离度量）
分割标准	信息增益、基尼系数	欧氏距离、曼哈顿距离、余弦相似度等
输出类型	分类树（类别标签）、回归树（连续值）	分类（多数投票）、回归（均值/中位数）

二、概念

决策树
- 信息增益（Entropy）：
  \( S = -\sum_{i=1}^N p_i \log_2 p_i \)
  选择分割时最大化信息增益，减少不确定性。
- 基尼系数（Gini Index）：
  \( G = 1 - \sum_{k} (p_k)^2 \)
  衡量数据不纯度，值越小分割越优。
- 剪枝策略：
  - 预剪枝：限制树深度（max_depth）、叶节点最小样本数（min_samples_leaf）。
  - 后剪枝：构建完整树后合并冗余节点。
k-NN
- 距离度量：
  - 欧氏距离（默认）：\( d(x,y) = \sqrt{\sum (x_i - y_i)^2} \)
  - 曼哈顿距离：\( d(x,y) = \sum |x_i - y_i| \)
  - 余弦相似度：衡量向量方向相似性。
- 参数调优：
  - n_neighbors：邻居数，小值易过拟合，大值易欠拟合。
  - weights：邻居权重（uniform均等权重，distance按距离反比加权）。

三、交叉验证与调优

交叉验证方法
- k折交叉验证：数据分为k个子集，轮流用k-1个子集训练，1个子集验证，取平均性能。
- 留出法：按比例（如70%-30%）划分训练集和验证集。

GridSearchCV 参数调优

from sklearn.model_selection import GridSearchCV# 决策树参数网格
tree_params = {'max_depth': [3, 5, 7], 'max_features': [10, 20, 30]}
tree_grid = GridSearchCV(DecisionTreeClassifier(), tree_params, cv=5)
tree_grid.fit(X_train, y_train)# k-NN参数网格（需标准化）
knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])
knn_params = {'knn__n_neighbors': range(1, 10)}
knn_grid = GridSearchCV(knn_pipe, knn_params, cv=5)
knn_grid.fit(X_train, y_train)

四、实际应用与性能对比

客户流失预测任务
- 数据集：电信客户流失数据（特征包括国际套餐、语音邮箱等）。
- 结果对比：
  
  模型 留置集准确率 交叉验证最佳准确率
  决策树（调优） 94.6% 94.0%
  k-NN（调优） 89.0% 88.5%
  随机森林 95.3% 93.5%
MNIST手写数字识别
- 数据集：8x8像素手写数字图片。
- 结果对比：
  
  模型 留置集准确率 交叉验证最佳准确率
  决策树（调优） 84.4% 66.6%
  k-NN（调优） 98.7% 97.6%
  随机森林 93.4% -

模型	留置集准确率	交叉验证最佳准确率
决策树（调优）	94.6%	94.0%
k-NN（调优）	89.0%	88.5%
随机森林	95.3%	93.5%

模型	留置集准确率	交叉验证最佳准确率
决策树（调优）	84.4%	66.6%
k-NN（调优）	98.7%	97.6%
随机森林	93.4%	-

五、优缺点总结

算法	优点	缺点
决策树	1. 可解释性强，规则可视化 2. 支持类别/数值特征 3. 训练速度快	1. 对噪声敏感，易过拟合 2. 边界为轴平行，灵活性差 3. 无法外推
k-NN	1. 简单易实现 2. 无需显式训练 3. 适应复杂边界（小k值）	1. 预测速度慢（大数据集） 2. 高维数据效果差（维度灾难） 3. 依赖距离度量

六、应用场景

选择决策树：
- 需要可解释性强的模型（如金融风控、医疗诊断）。
- 数据特征存在明显分层逻辑（如年龄分段、阈值判断）。
- 实时预测需求（快速推理）。
选择k-NN：
- 数据维度较低且分布复杂（如小规模图像分类）。
- 需要快速原型验证（基线模型）。
- 数据特征尺度一致（需标准化）。

七、结论

模型选择优先级：
- 优先尝试简单模型（如决策树、k-NN），再过渡到复杂模型（随机森林、神经网络）。
- 决策树在结构化数据中表现优异，k-NN适合小规模非结构化数据。
调优核心：
- 决策树：控制深度（max_depth）和叶节点样本数（min_samples_leaf）。
- k-NN：选择合适的邻居数（n_neighbors）和距离度量（metric）。
交叉验证必要性：
- 避免过拟合，确保模型泛化性，尤其在参数调优时不可或缺。

八、完整代码

1.客户流失预测任务

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
from sklearn.tree import export_graphviz
import pydotplus
from io import StringIO
import matplotlib.pyplot as plt
from IPython.display import Image# 客户离网率预测任务
# 数据预处理
df = pd.read_csv('https://labfile.oss.aliyuncs.com/courses/1283/telecom_churn.csv')
df['International plan'] = pd.factorize(df['International plan'])[0]
df['Voice mail plan'] = pd.factorize(df['Voice mail plan'])[0]
df['Churn'] = df['Churn'].astype('int')
states = df['State']
y = df['Churn']
df.drop(['State', 'Churn'], axis=1, inplace=True)# 划分数据集
X_train, X_holdout, y_train, y_holdout = train_test_split(df.values, y, test_size=0.3, random_state=17)# 训练决策树和K近邻模型（随机参数）
tree = DecisionTreeClassifier(max_depth=5, random_state=17)
knn = KNeighborsClassifier(n_neighbors=10)
tree.fit(X_train, y_train)
knn.fit(X_train, y_train)# 模型评估
tree_pred = tree.predict(X_holdout)
print("决策树准确率（随机参数）:", accuracy_score(y_holdout, tree_pred))
knn_pred = knn.predict(X_holdout)
print("K近邻准确率（随机参数）:", accuracy_score(y_holdout, knn_pred))# 决策树交叉验证调优
tree_params = {'max_depth': range(5, 7),'max_features': range(16, 18)}
tree_grid = GridSearchCV(tree, tree_params, cv=5, n_jobs=-1, verbose=True)
tree_grid.fit(X_train, y_train)
print("决策树最佳参数:", tree_grid.best_params_)
print("决策树最佳分数:", tree_grid.best_score_)
print("决策树调优后准确率:", accuracy_score(y_holdout, tree_grid.predict(X_holdout)))# 绘制决策树
dot_data = StringIO()
export_graphviz(tree_grid.best_estimator_, feature_names=df.columns, out_file=dot_data, filled=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(value=graph.create_png())# K近邻交叉验证调优
knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_jobs=-1))])
knn_params = {'knn__n_neighbors': range(6, 8)}
knn_grid = GridSearchCV(knn_pipe, knn_params, cv=5, n_jobs=-1, verbose=True)
knn_grid.fit(X_train, y_train)
print("K近邻最佳参数:", knn_grid.best_params_)
print("K近邻最佳分数:", knn_grid.best_score_)
print("K近邻调优后准确率:", accuracy_score(y_holdout, knn_grid.predict(X_holdout)))# 训练随机森林模型
forest = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=17)
print("随机森林交叉验证分数:", np.mean(cross_val_score(forest, X_train, y_train, cv=5)))
forest_params = {'max_depth': range(8, 10),'max_features': range(5, 7)}
forest_grid = GridSearchCV(forest, forest_params, cv=5, n_jobs=-1, verbose=True)
forest_grid.fit(X_train, y_train)
print("随机森林最佳参数:", forest_grid.best_params_)
print("随机森林最佳分数:", forest_grid.best_score_)
print("随机森林准确率:", accuracy_score(y_holdout, forest_grid.predict(X_holdout)))# 简单分类任务
# 生成数据
def form_linearly_separable_data(n=500, x1_min=0, x1_max=30, x2_min=0, x2_max=30):data, target = [], []for i in range(n):x1 = np.random.randint(x1_min, x1_max)x2 = np.random.randint(x2_min, x2_max)if np.abs(x1 - x2) > 0.5:data.append([x1, x2])target.append(np.sign(x1 - x2))return np.array(data), np.array(target)X, y = form_linearly_separable_data()
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='autumn', edgecolors='black')# 训练决策树并绘制分类边界
tree = DecisionTreeClassifier(random_state=17).fit(X, y)def get_grid(X):x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))return xx, yyxx, yy = get_grid(X)
predicted = tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.pcolormesh(xx, yy, predicted, cmap='autumn')
plt.scatter(X[:, 0], X[:, 1], c=y, s=100, cmap='autumn', edgecolors='black', linewidth=1.5)
plt.title('Easy task. Decision tree compexifies everything')# 可视化决策树
dot_data = StringIO()
export_graphviz(tree, feature_names=['x1', 'x2'], out_file=dot_data, filled=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(value=graph.create_png())# 训练K近邻模型
knn = KNeighborsClassifier(n_neighbors=1).fit(X, y)
xx, yy = get_grid(X)
predicted = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.pcolormesh(xx, yy, predicted, cmap='autumn')
plt.scatter(X[:, 0], X[:, 1], c=y, s=100, cmap='autumn', edgecolors='black', linewidth=1.5)
plt.title('Easy task, kNN. Not bad')# MNIST手写数字识别任务
# 加载数据
data = load_digits()
X, y = data.data, data.target# 绘制MNIST手写数字
f, axes = plt.subplots(1, 4, sharey=True, figsize=(16, 6))
for i in range(4):axes[i].imshow(X[i, :].reshape([8, 8]), cmap='Greys')# 划分数据集
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.3, random_state=17)# 训练决策树和K近邻模型（随机参数）
tree = DecisionTreeClassifier(max_depth=5, random_state=17)
knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=10))])
tree.fit(X_train, y_train)
knn_pipe.fit(X_train, y_train)# 模型预测与评估
tree_pred = tree.predict(X_holdout)
knn_pred = knn_pipe.predict(X_holdout)
print("MNIST任务中决策树准确率（随机参数）:", accuracy_score(y_holdout, tree_pred))
print("MNIST任务中K近邻准确率（随机参数）:", accuracy_score(y_holdout, knn_pred))# 决策树交叉验证调优
tree_params = {'max_depth': [10, 20, 30],'max_features': [30, 50, 64]}
tree_grid = GridSearchCV(tree, tree_params, cv=5, n_jobs=-1, verbose=True)
tree_grid.fit(X_train, y_train)
print("MNIST任务中决策树最佳参数:", tree_grid.best_params_)
print("MNIST任务中决策树最佳分数:", tree_grid.best_score_)# K近邻交叉验证调优
print("MNIST任务中K近邻交叉验证分数:", np.mean(cross_val_score(KNeighborsClassifier(n_neighbors=1), X_train, y_train, cv=5)))# 训练随机森林模型
print("MNIST任务中随机森林交叉验证分数:", np.mean(cross_val_score(RandomForestClassifier(random_state=17), X_train, y_train, cv=5)))# 最近邻方法复杂情形
# 生成数据
def form_noisy_data(n_obj=1000, n_feat=100, random_seed=17):np.seed = random_seedy = np.random.choice([-1, 1], size=n_obj)x1 = 0.3 * yx_other = np.random.random(size=[n_obj, n_feat - 1])return np.hstack([x1.reshape([n_obj, 1]), x_other]), yX, y = form_noisy_data()# 划分数据集
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.3, random_state=17)# 训练K近邻模型并绘制验证曲线
cv_scores, holdout_scores = [], []
n_neighb = [1, 2, 3, 5] + list(range(50, 550, 50))for k in n_neighb:knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=k))])cv_scores.append(np.mean(cross_val_score(knn_pipe, X_train, y_train, cv=5)))knn_pipe.fit(X_train, y_train)holdout_scores.append(accuracy_score(y_holdout, knn_pipe.predict(X_holdout)))plt.plot(n_neighb, cv_scores, label='CV')
plt.plot(n_neighb, holdout_scores, label='holdout')
plt.title('Easy task. kNN fails')
plt.legend()# 决策树训练与评估
tree = DecisionTreeClassifier(random_state=17, max_depth=1)
tree_cv_score = np.mean(cross_val_score(tree, X_train, y_train, cv=5))
tree.fit(X_train, y_train)
tree_holdout_score = accuracy_score(y_holdout, tree.predict(X_holdout))
print('Decision tree. CV: {}, holdout: {}'.format(tree_cv_score, tree_holdout_score))

2.MNIST手写数字识别

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score# 载入 MNIST 手写数字数据集
data = load_digits()
X, y = data.data, data.target# 查看第一个样本的 8x8 矩阵形式
print(X[0, :].reshape([8, 8]))# 绘制一些 MNIST 手写数字
f, axes = plt.subplots(1, 4, sharey=True, figsize=(16, 6))
for i in range(4):axes[i].imshow(X[i, :].reshape([8, 8]), cmap='Greys')
plt.show()# 分割数据集
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.3, random_state=17)# 使用随机参数训练决策树和 k-NN
tree = DecisionTreeClassifier(max_depth=5, random_state=17)
knn_pipe = Pipeline([('scaler', StandardScaler()),('knn', KNeighborsClassifier(n_neighbors=10))])tree.fit(X_train, y_train)
knn_pipe.fit(X_train, y_train)# 在留置集上做出预测并评估
tree_pred = tree.predict(X_holdout)
knn_pred = knn_pipe.predict(X_holdout)
tree_accuracy = accuracy_score(y_holdout, tree_pred)
knn_accuracy = accuracy_score(y_holdout, knn_pred)
print(f"决策树（随机参数）在留置集上的准确率: {tree_accuracy}")
print(f"k-NN（随机参数）在留置集上的准确率: {knn_accuracy}")# 使用交叉验证调优决策树模型
tree_params = {'max_depth': [10, 20, 30],'max_features': [30, 50, 64]}tree_grid = GridSearchCV(tree, tree_params,cv=5, n_jobs=-1, verbose=True)tree_grid.fit(X_train, y_train)# 查看交叉验证得到的最佳参数组合和相应的准确率
best_tree_params = tree_grid.best_params_
best_tree_score = tree_grid.best_score_
print(f"决策树最佳参数: {best_tree_params}")
print(f"决策树最佳交叉验证准确率: {best_tree_score}")# 使用交叉验证调优 k-NN 模型
knn_cv_score = np.mean(cross_val_score(KNeighborsClassifier(n_neighbors=1), X_train, y_train, cv=5))
print(f"调优后 k-NN 的交叉验证准确率: {knn_cv_score}")# 训练随机森林模型
forest_cv_score = np.mean(cross_val_score(RandomForestClassifier(random_state=17), X_train, y_train, cv=5))
print(f"随机森林的交叉验证准确率: {forest_cv_score}")

结果：