Kaggle比赛入门赛题·用CNN做手写体数字识别,准确率为0.98260
Digit Recognizer入口链接:https://www.kaggle.com/c/digit-recognizer
数据集:
链接:https://pan.baidu.com/s/13f3rM_lhNGyu2Rsqbc1AUw
提取码:otoy
复制这段内容后打开百度网盘手机App,操作更方便哦
train.csv和test.csv分别为训练集和测试集数据
sample_submission为官方提供的提交格式
predict.csv是本博文对测试集预测的结果,准确率为0.98260
1.数据加载和预处理
读取训练集和测试集中的csv文件
train_file = pd.read_csv(os.path.join(main_path, "train.csv"))
test_file = pd.read_csv(os.path.join(main_path, "test.csv"))
由于MNIST数据集中文本矩阵范围为(0-255),在这做一个数据标准化,把(0-255)归一化到(0-1)范围中。
ps:为什么进行标准化?可以参考下这个博客
# Normalization
train_file_norm = train_file.iloc[:, 1:] / 255.0
test_file_norm = test_file / 255.0
查看此时数据集的形状
train_file_norm.shape
我们也可以使用matplotlib.pyplot打印下数据集,将样本可视化
rand_indices = np.random.choice(train_file_norm.shape[0], 64, replace=False)
examples = train_file_norm.iloc[rand_indices, :]
fig, ax_arr = plt.subplots(8, 8, figsize=(6, 5))
fig.subplots_adjust(wspace=.025, hspace=.025)
ax_arr = ax_arr.ravel()
for i, ax in enumerate(ax_arr):
ax.imshow(examples.iloc[i, :].values.reshape(28, 28), cmap="gray")
ax.axis("off")
plt.show()
我们需要把数据加工为(42000, 32, 32, 3)的形状便于训练
定义样本形状参数
num_examples_train = train_file.shape[0]
num_examples_test = test_file.shape[0]
n_h = 32
n_w = 32
n_c = 3
初始化样本空间
Train_input_images = np.zeros((num_examples_train, n_h, n_w, n_c))
Test_input_images = np.zeros((num_examples_test, n_h, n_w, n_c))
将数据装入样本空间
for example in range(num_examples_train):
Train_input_images[example,:28,:28,0] = train_file.iloc[example, 1:].values.reshape(28,28)
Train_input_images[example,:28,:28,1] = train_file.iloc[example, 1:].values.reshape(28,28)
Train_input_images[example,:28,:28,2] = train_file.iloc[example, 1:].values.reshape(28,28)
for example in range(num_examples_test):
Test_input_images[example,:28,:28,0] = test_file.iloc[example, :].values.reshape(28,28)
Test_input_images[example,:28,:28,1] = test_file.iloc[example, :].values.reshape(28,28)
Test_input_images[example,:28,:28,2] = test_file.iloc[example, :].values.reshape(28,28)
利用cv2.resize进行缩放
for example in range(num_examples_train):
Train_input_images[example] = cv2.resize(Train_input_images[example], (n_h, n_w))
for example in range(num_examples_test):
Test_input_images[example] = cv2.resize(Test_input_images[example], (n_h, n_w))
提取训练集的标签值
Train_labels = np.array(train_file.iloc[:, 0])
打印出来预处理好的样本数据形状
print("Shape of train input images : ", Train_input_images.shape)
print("Shape of test input images : ", Test_input_images.shape)
print("Shape of train labels : ", Train_labels.shape)
到此我们数据加工和预处理过程就结束了!
2.训练模型和预测结果
在介绍模型之前,先介绍下one-hot编码
one-hot编码可以把分类数据转化为二进制格式,供机器学习使用,实现函数如下
def one_hot(labels):
onehot_labels = np.zeros(shape=[len(labels), 10])
for i in range(len(labels)):
index = labels[i]
onehot_labels[i][index] = 1
return onehot_labels
构建一个CNN网络模型
def mnist_cnn(input_shape):
'''
构建一个CNN网络模型
:param input_shape: 指定输入维度
:return:
'''
model = keras.Sequential()
model.add(keras.layers.Conv2D(filters=32, kernel_size=5, strides=(1, 1),
padding='same', activation=tf.nn.relu, input_shape=input_shape))
model.add(keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid'))
model.add(keras.layers.Conv2D(filters=64, kernel_size=3, strides=(1, 1), padding='same', activation=tf.nn.relu))
model.add(keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2), padding='valid'))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(units=128, activation=tf.nn.relu))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(units=10, activation=tf.nn.softmax))
return model
训练并保存模型
def trian_model(train_images, train_labels):
# re-scale to 0~1.0之间
print("train_images :{}".format(train_images.shape))
print(train_labels)
train_labels = one_hot(train_labels)
# 建立模型
model = mnist_cnn(input_shape=(32, 32, 3))
model.compile(optimizer=tf.optimizers.Adam(), loss="categorical_crossentropy", metrics=['accuracy'])
model.fit(x=train_images, y=train_labels, epochs=5, batch_size = 256)
model.save('MYCNN2MNIST.h5')
利用训练好的模型预测测试集的标签,save_path为我们保存模型的路径
def pred(save_path,test_images):#载入模型并生成图片
model=keras.models.load_model(save_path)
# 开始预测
predictions = model.predict(test_images)
# print(predictions)
# print(type(predictions))
targetlist = []
targetlist.append(0)
for i in range(len(test_images)):
target = np.argmax(predictions[i])
targetlist.append(target)
print(targetlist)
predictions = pd.DataFrame(targetlist)
predictions.to_csv("predict.csv")
将我们的预测结果写入文件中
submission = pd.read_csv('DataSet/sample_submission.csv')
然后把文件提交到服务器中,就可以看到自己准确率和排名了
ps:由于kaggle的服务器在国外,所以需要**(懂得都懂)
源码链接:https://pan.baidu.com/s/1HXt24GiUrRZliUXN-0N06g
提取码:pza9
有问题欢迎在评论区下留言~