Machine Learning Project1, KNN

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/Sengo_GWU/article/details/82710651

Header

  • Name: Shusen Wu
  • OS: Win10 
  • Cpu: I7-7700
  • Language: Python3.6
  • Environment: Jupyter Notebook
  • library: 
    numpy
    matplotlib.pyplot
    collections 
    time
    operator

Reference

Machine Learning in Action, Peter Harrington, ISBN 9781617290183

Datasets

The project will explore two datasets, the famous MNIST dataset of very small pictures of handwritten numbers, and a dataset that explores the prevelance of diabetes in a native american tribe named the Pima. You can access the datasets here:
1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
2. https://www.kaggle.com/c/digit-recognizer/data

Part 1: Pima

Dataset details

Here, I use 80% data as trainning data, 10% as validation data and 10% as test data.

Besides, since the scales of features are quite different, so I normalize them into 0~1.

Algorithm Description

Learn from the book <<Machine Learning in Action>>,  I get the KNN and normalize functions.

  • KNN 

  • Normalize 

Algorithm Results

   

Predit

   
   

1

0

合计

Reality

1

True Positive(TP)

False Negative(FN)

Actual Positive(TP+FN)

 

0

False Positive(FP)

True Negative(TN)

Actual Negative(FP+TN)

Total

 

Predicted Positive(TP+FP)

Predicted Negative(FN+TN)

TP+FP+FN+TN

First, we need to predit the results by using the validation set with different K. Here, I set K from 3 to 9.

Here are the results:

 

 

Compare the accuracies when k is from 3 to 9, we find that when K is set to 7 it works well.

So we choose k as 7 to run on the test set

Runtime

The cost time on test set is:0.006999969482421875

Other running times are showing on the above pictures.

Part2:Recognise Digits

Dataset details

Here, again, I use 90% of traning data to train, 10% of traning set as validation set. Cause there is already a test set, so I do not need to split it from training set. From the following picture, we can see the shapes (row x col) of these data sets.

The distribution of 0~9 numbers: 

Random sample and show:

Besides, we'd like to have a quick look on one image:

Algorithm Description

  • KNN: we know that (A-B)^2 equals to A^2+B^2+2AB. So, when we compute the distance, we can use this way to calculate and the matric computation will save a lot of time.
  • Besides, we also need to computer the accuracy for every K. So, here, we can computer it quite fast by using np.sum(y-y') /len(y)

Algorithm Results && Runtime

KNN do cost a lot of time, and it costs huge memory. When putting this algorithm on the Jupyter Notebooks, the memory is not enough without splitting into small batch sizes.

 Thus, I put the code back to pycharm and run.

From this picture, we can find that when K is 5, it does best on the validation set. So we set K=5 to run on the test set. There is the result:

top 100 results:

猜你喜欢

转载自blog.csdn.net/Sengo_GWU/article/details/82710651