数据挖掘入门: Kaggle手写数字识别

0.前言

本文对Kaggle上MNIST手写数字集进行分析,采用主成分分析和支持向量机算法进行建模和预测,将预测结果生成CSV文件.本人将思路记录下来,以供参考.如有不足之处,欢迎指正.

1.用到的软件包

python 版本： python 3.6
numpy、pandas：数据分析
time：计时
matplotlib：画图
sklearn: 机器学习、建模

2.导入数据

import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# 查看训练集信息
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Columns: 785 entries, label to pixel783
dtypes: int64(785)
memory usage: 251.5 MB

# 查看测试集，缺少Label一列
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28000 entries, 0 to 27999
Columns: 784 entries, pixel0 to pixel783
dtypes: int64(784)
memory usage: 167.5 MB

# 查看训练集是否有缺失值,结果是不存在缺失值
train.isnull().any().describe()

count       785
unique        1
top       False
freq        785
dtype: object

# 查看测试集是否有缺失值,结果是不存在缺失值
test.isnull().any().describe()

count       784
unique        1
top       False
freq        784
dtype: object

# 查看训练集和测试集的行列数
print(train.shape)
print(test.shape)

(42000, 785)
(28000, 784)

# 将训练集中的特征和标签列分开
X = train.iloc[:,1:]
y = train.iloc[:,0]

# 查看训练集的数字
plt.figure(figsize = (10,5))

for num in range(0,10):
    plt.subplot(2,5,num+1)
    #将长度为784的向量数据转化为28*28的矩阵
    grid_data = X.iloc[num].as_matrix().reshape(28,28)
    #显示图片，颜色为黑白
    plt.imshow(grid_data, interpolation = "none", cmap = "Greys")

这里写图片描述

3.特征预处理

特征预处理是数据挖掘的必要步骤, 如果不进行特征处理, 会导致模型的拟合程度下降, 分类的精确度下降.

本文采用标准预处理, 使用MinMaxScaler将每个特征的值域规范化到0至1之间. Sklearn模块有标准预处理的函数,直接调用即可.

# 特征预处理,将特征的值域规范化
X = MinMaxScaler().fit_transform(X)
test = MinMaxScaler().fit_transform(test)

# 分开训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 14)

4.主成分分析

此次训练集的特征较多,有784个, 我们需要进行主成分分析, 将特征降维至几十个, 这样做的好处是, 提升程序的运行速度, 提高算法的预测精准度. 我们需要找出使模型准确率最高的主成分个数, 再进行建模.

# 使用主成分分析,降低维度
all_scores = []
# 生成n_components的取值列表
n_components = np.linspace(0.7,0.9,num=20, endpoint=False)

def get_accuracy_score(n, X_train, X_test, y_train, y_test):
    '''当主成分为n时,计算模型预测的准确率'''      
    t0 = time()
    pca = PCA(n_components = n)
    pca.fit(X_train)
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    # 使用支持向量机分类器
    clf = svm.SVC()
    clf.fit(X_train_pca, y_train)
    # 计算准确度
    accuracy = clf.score(X_test_pca, y_test)
    t1 = time()
    print('n_components:{:.2f} , accuracy:{:.4f} , time elaps:{:.2f}s'.format(n, accuracy, t1-t0))
    return accuracy 

for n in n_components:
    score = get_accuracy_score(n,X_train, X_test, y_train, y_test)
    all_scores.append(score)

n_components:0.70 , accuracy:0.9750 , time elaps:31.42s
n_components:0.71 , accuracy:0.9757 , time elaps:34.00s
n_components:0.72 , accuracy:0.9769 , time elaps:27.17s
n_components:0.73 , accuracy:0.9760 , time elaps:29.33s
n_components:0.74 , accuracy:0.9776 , time elaps:27.73s
n_components:0.75 , accuracy:0.9781 , time elaps:27.35s
n_components:0.76 , accuracy:0.9781 , time elaps:29.31s
n_components:0.77 , accuracy:0.9781 , time elaps:30.05s
n_components:0.78 , accuracy:0.9783 , time elaps:29.16s
n_components:0.79 , accuracy:0.9776 , time elaps:32.74s
n_components:0.80 , accuracy:0.9779 , time elaps:33.51s
n_components:0.81 , accuracy:0.9771 , time elaps:33.29s
n_components:0.82 , accuracy:0.9774 , time elaps:34.05s
n_components:0.83 , accuracy:0.9769 , time elaps:36.23s
n_components:0.84 , accuracy:0.9755 , time elaps:39.80s
n_components:0.85 , accuracy:0.9748 , time elaps:41.42s
n_components:0.86 , accuracy:0.9748 , time elaps:39.79s
n_components:0.87 , accuracy:0.9729 , time elaps:51.27s
n_components:0.88 , accuracy:0.9721 , time elaps:50.15s
n_components:0.89 , accuracy:0.9717 , time elaps:46.11s

# 画出主成分和准确度的关系图，主成分为0.78时，精确度最高
plt.plot(n_components, all_scores, '-o')
plt.xlabel('n_components')
plt.ylabel('accuracy')
plt.show()

这里写图片描述

# 找出识别有误的数据
pca = PCA(n_components = 0.78)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

clf = svm.SVC()
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)

errors = (y_pred != y_test)
y_pred_errors = y_pred[errors]
y_test_errors = y_test[errors].values
X_test_errors = X_test[errors]

# 查看数据
print(y_pred_errors[:5])
print(y_test_errors[:5])
print(X_test_errors[:5])

[5 0 8 6 9]
[8 9 6 8 7]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

# 数据可视化,查看预测有误的数字
n = 0
nrows = 2
ncols = 5

fig, ax = plt.subplots(nrows,ncols,figsize=(10,6))

for row in range(nrows):
    for col in range(ncols):
        ax[row,col].imshow((X_test_errors[n]).reshape((28,28)), cmap = "Greys")
        ax[row,col].set_title("Predict:{}\nTrue: {}".format(y_pred_errors[n],y_test_errors[n]))
        n += 1

这里写图片描述

5.建模

调参也是关键步骤, 我们使用网格搜索, 自动选出使模型准确度最高的参数, 再进行训练和预测.

# n_components为0.78时, 模型的准确率最高
# 对训练集和测试集进行PCA降低维度处理, 主成分个数为39
pca = PCA(n_components=0.78)
pca.fit(X)
print(pca.n_components_)
X = pca.transform(X)
test = pca.transform(test)

# 使用支持向量机预测,使用网格搜索进行调参

clf_svc = GridSearchCV(estimator=svm.SVC(), param_grid={ 'C': [1, 2, 4, 5], 'kernel': [ 'linear', 'rbf', 'sigmoid' ] }, cv=5, verbose=2 ) 
# 训练算法
clf_svc.fit(X, y)
# 显示使模型准确率最高的参数
print(clf_svc.best_params_)

# 预测
preds = clf_svc.predict(test)
image_id = pd.Series(range(1,len(preds)+1))
result_2 = pd.DataFrame({'ImageID': image_id,'Label':preds})
# 保存为CSV文件
result_2.to_csv('result_svc.csv',index = False)
print('Done')

6.提交结果

程序运行完需要1小时,识别的准确率是98.4%, 还有提高的空间, 要继续努力呀.
这里写图片描述