Titanic Kaggle competition technical summary (c | end)

Written at the beginning: a machine learning method call in this section show only python methods have been evaluated, the specific algorithms for machine learning and understanding of each model could be expanded in later chapters.

Return earlier

In the foregoing, we shared some data viewing and cleaning techniques, some of the data on the Titanic game visualization techniques. You can click the link to view,
Titanic Kaggle competition technical summary (a)
Titanic Kaggle competition technical summary (b)

Technical Summary

A paper idea
Step1 loading data packets and view data;. Step2 . Data cleaning, including deduplication value, an abnormal value, the value of the missing values, or wherein treatment; Step3 . Data visualization using, looking the value of variables and variable packet; Step4 . possible use of machine learning, deep learning model to model training; Step5 . be to evaluate the effects of different models, cross-validation, an important variable selection, precision, recall, F values, ROC-AUC were evaluated like; to Step6 . saved data is completed. Second, the key to the program (here does not show complete analysis of the code, only use the function to summarize the complete project visibility, and extended follow-up article machine learning models, see the follow-up article) Step4 . Possible use of machine learning, deep learning model model training. For Titanic model is mainly related to a binary classification problem, which is the traditional 0-1 and solve such problems more common logistic regression (LR), decision trees (DT), KNN, perception, etc., in which specific the algorithm can see Li Hang teacher "statistical learning methods" in the section that corresponds to the relevant detailed explanation, these methods may be follow-up articles on machine learning to share, below demonstrates the implementation of part of the code machine learning.








#机器学习包-二分类问题
#常见方法模型:LR、随机森林、感知机、梯度下降、DT、KNN、支持向量机、贝叶斯分类器
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
train_df = train_df.drop(['Name'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
X_train = train_df.drop('Survived',axis=1)
Y_train = train_df['Survived']
X_test = test_df.copy() # 这里的copy就相当于一个映射两者会共同改变,deepcopy才是复制
#梯度下降法
sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
acc_sgd = round(sgd.score(X_train, Y_train)*100, 2)
#随机森林
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
acc_rf = round(random_forest.score(X_train, Y_train)*100,2)
#逻辑回归
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
acc_logreg= round(logreg.score(X_train, Y_train)*100,2)
#KNN
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
acc_knn = round(knn.score(X_train, Y_train)*100,2)
#贝叶斯分类器
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
acc_gau = round(gaussian.score(X_train, Y_train)*100,2)
#感知机
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)
acc_per = round(perceptron.score(X_train, Y_train)*100,2)
#支持向量机
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
acc_svc = round(linear_svc.score(X_train, Y_train)*100, 2)
#决策树
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
acc_dt = round(decision_tree.score(X_train, Y_train)*100,2)

#模型筛选
result = pd.DataFrame({
    'Modle': ['Support Vector Machinse', 'KNN','Logistic Regression',
             'Random Forest', 'Naive Bayes', 'Perceptron','Stochastic Gradient Decent',
             'Decision Tree'],
    'Score':[acc_svc, acc_knn, acc_logreg, acc_rf, acc_gau, acc_per, acc_sgd, acc_dt]
})
# sort_values就是排序用的
result_df = result.sort_values(by = 'Score', ascending= False)
result_df = result_df.set_index('Score') # 制定那个在第一列
result_df.head(9)
Score	Modle
94.85	Decision Tree
94.84	Random Forest
87.21	KNN
79.57	Logistic Regression
76.66	Naive Bayes
76.43	Perceptron
70.48	Support Vector Machinse
38.72	Stochastic Gradient Decent

You can see the predicted effect of the decision tree is the best on the training set, the prediction accuracy of up to 94.75%, then the effect is almost random forest with tree, so in this section, we will evaluate the effect of the decision tree model for and parameter adjustment.
Preliminary evaluation model
methods described herein using 10-fold cross validation, the model is evaluated

from sklearn.model_selection import cross_val_score
dt = DecisionTreeClassifier()
scores = cross_val_score(dt, X_train, Y_train,cv=10,scoring='accuracy')
print("Scores",scores)
print('Mean',scores.mean())
print("st",scores.std())
Scores [0.67777778 0.75555556 0.70786517 0.83146067 0.78651685 0.78651685
 0.85393258 0.76404494 0.79775281 0.81818182]
Mean 0.77796050391556
st 0.05141546081049484

The average prediction accuracy of the decision tree model was 77.79%, and the average standard deviation is small, whole or acceptable.
Let's get rid of some of the important factors,

importances = pd.DataFrame({'feature':X_train.columns, 'importance':np.round(decision_tree.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(15)
importances.plot.bar()

Here Insert Picture Description
As used herein, is based on the feature tree of choice, you can see Parch and Embarked least important, so consider these two variables are removed.

train_df = train_df.drop('Parch', axis=1)
test_df = test_df.drop('Parch', axis=1)
X_train = train_df.drop(['Survived'], axis=1)
Y_train = train_df['Survived']
X_test = test_df
decision_tree.fit(X_train,Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train)*100,2)
print(acc_decision_tree,"%")
94.28 %

After removing unimportant variables, the prediction accuracy of the original little difference, but to some extent reduce the risk of over-fitting. As for parameter adjustment decision tree here only to show what parameters can be adjusted.

 DecisionTreeClassifier()
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Model Predictive Evaluation

#查准率、查全率
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(decision_tree, X_train, Y_train, cv=3)
confusion_matrix(Y_train, predictions)
from sklearn.metrics import precision_score, recall_score
print('Precision', precision_score(Y_train, predictions))
print('Recall', recall_score(Y_train, predictions))
#ROC-AUC值
y_scores = decision_tree.predict_proba(X_train)
y_scores = y_scores[:,1]
from sklearn.metrics import roc_auc_score
r_a_score = roc_auc_score(Y_train, y_scores)
print("ROC-AUC-Score:", r_a_score)
#查准率、查全率、ROC-AUC值
Precision 0.73125
Recall 0.6842105263157895
ROC-AUC-Score: 0.9906821546884821

From the precision and recall, ROC-AUC value can be seen as a whole in general, because the characteristics of the data processing method and this one there are some flaws, but in this article we are mainly display methods use in python.
Conclusion
For the Titanic project in python summary of the use of technology, this is over, only use this series to show the main features for a complete analysis of the project, you can view the Titanic corresponding notebooks learning on kaggle, not here and then into the share.
thanks for reading

Released four original articles · won praise 9 · views 312

Guess you like

Origin blog.csdn.net/qq_35149632/article/details/104348394