[Practical Combat 01] Two-category dataset of heart disease

Table of contents

1. Get the dataset

2. Dataset introduction

3. Data preprocessing

4. Build a random forest classification model

5. Predict the test set data

6. Build a confusion matrix

7. Calculate the recall rate, recall rate, and harmonic mean

8. ROC curve, AUC curve


 (Note: Each chapter can be a py file, 4, 5, 6, and 7 are written in the same file, it is best to use jupyter notebook)

1. Get the dataset

The following two methods: UCI, Kaggle

UCI Machine Learning Repository: Heart Disease Data Sethttps://archive.ics.uci.edu/ml/datasets/heart+disease

 Heart Disease Dataset | KagglePublic Health Datasethttps://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

 

The resulting csv file is:

2. Dataset introduction

The dataset has 1025 rows and 14 columns. Each row represents a patient. 13 columns for features and 1 for labels (whether you have a heart attack or not)


| age | age |
| sex | sex, 1 means male, 0 means female |
| cp | history of angina pectoris, 1: typical angina, 2: atypical angina, 3: no angina, 4: asymptomatic |
| trestbps | resting Blood pressure, measured on admission, the unit is mm Hg |
| chol | Cholesterol content, unit: mgldl |
| fbs | Whether the fasting blood sugar is high, if the fasting blood sugar is greater than 120 mg/dl, the value is 1, otherwise A value of 0 |
| restecg | ECG characteristics at rest. 0: normal. 1: ST-T waves are abnormal. 2: According to the Estes criterion, there is a potential left |
| thalach | maximum heart rate |
| exang | whether exercise will cause angina pectoris, 1 means yes, 0 means no | |
oldpeak | Whether the T wave will be flattened. 1 means yes, 0 means no |
| slope | The slope of the ST wave peak in the ECG (1: rising, 2: flat, 3: falling) |
| ca | the number of large blood vessels around the heart (0-3) |
| thal | whether you have thalassemia (3: none, 6: fixed defect; 7: reversable defect) |
| target | label column. Whether there is heart disease, 0 means no, 1 means yes |

3. Data preprocessing

 First of all, it is necessary to distinguish: classification, order, distance, ratio, and the characteristics of the four types of data

① We need to convert the categorical features from integers to actual corresponding strings, and restore them to their true meaning.

② Expand the categorical data into features

③ Export the preprocessed data

import pandas as pd
df = pd.read_csv('dataset/heart.csv')

# 将定类特征由整数编码转为实际对应的字符串,还原为真实含义
df['sex'][df['sex'] == 0] = 'female'
df['sex'][df['sex'] == 1] = 'male'
df['cp'][df['cp'] == 0] = 'typical angina'
df['cp'][df['cp'] == 1] = 'atypical angina'
df['cp'][df['cp'] == 2] = 'non-anginal pain'
df['cp'][df['cp'] == 3] = 'asymptomatic'

df['fbs'][df['fbs'] == 0] = 'lower than 120mg/ml'
df['fbs'][df['fbs'] == 1] = 'greater than 120mg ml'

df['restecg'][df['restecg'] == 0] = 'normal'
df['restecg'][df['restecg'] == 1] = 'ST-T wave abnormality'
df['restecg'][df['restecg'] == 1] = 'left ventricular hyper trophy'

df['exang'][df['exang'] == 0] = 'no'
df['exang'][df['exang'] == 1] = 'yes'

df['slope'][df['slope'] == 0] = 'upsloping'
df['slope'][df['slope'] == 1] = 'flat'
df['slope'][df['slope'] == 1] = 'downsloping'

df['thal'][df['thal'] == 0] = 'unknown'
df['thal'][df['thal'] == 1] = 'normal'
df['thal'][df['thal'] == 1] = 'fixed defect'
df['thal'][df['thal'] == 1] = 'reversable defect'


# 将离散的定类和定序特征列转为One-Hot独热编码
# 将定类数据扩展为特征
df = pd.get_dummies(df)

# 导出预处理后的数据
df.to_csv('process_heart.csv',index=False)

 

4. Build a random forest classification model

 First, get the data and processed files.

import numpy as np
import pandas as pd
import matplotlib.pyplot as p1t
%matplotlib inline
# 忽略警告
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('process_heart.csv')

Second, split the data into input and output

# 去掉这一列  矩阵用X表示  input
X = df.drop('target',axis=1)

# y向量
y = df['target']

Third, the data is divided into test set and training set

# 将数据划分为训练集和测试集,20%作为测试集,随机数种子
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=10)

 Fourth, build a random forest classification model and train the model on the training set

# 构建随机森林分类模型,在训练集上训练模型
from sklearn.ensemble import RandomForestClassifier
# 最大深度为5,决策树为100,随机种子数为5
model = RandomForestClassifier(max_depth=5,n_estimators=100,random_state=5)
# fit 拟合
model.fit(X_train,y_train)
# 可以查看第7个决策树
estimator = model.estimators_[7]

 Fifth, visualize the decision tree

# 将输出特征值转为字符串
feature_names = X_train.columns
y_train_str = y_train.astype('str')
y_train_str[y_train=='0'] = 'no disease'
y_train_str[y_train=='1'] = 'disease'
y_train_str = y_train_str.values

# 将决策树可视化
from sklearn.tree import export_graphviz
export_graphviz(estimator,out_file='tree.dot',
                feature_names = feature_names,
                class_names = y_train_str, 
                rounded = True,proportion = True,
                label='root',
                precision = 2,filled = True)

from subprocess import call
call(['dot','-Tpng','tree.dot','-o','tree.png','-Gdpi=600'],shell=True)
from IPython.display import Image
Image(filename = 'tree.png')

5. Predict the test set data

#After the random forest model is trained on the training set, the data on the test set can be predicted, and new unknown data can also be predicted. Comparing the predicted results with the real labels of the test set, you can quantitatively evaluate the model indicators, draw the confusion matrix, calculate the evaluation indicators such as Precision, Recall, and F1-Score, and draw the ROC curve. Confusion matrix, ROC curve, F1-#Score
first, model preparation


# 忽略警告
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#导入数据集,划分特征和标签
df = pd.read_csv('process_heart.csv')
X = df.drop('target',axis=1)
y = df['target']
#划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2,random_state=10)

#构建随机森林模型
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=5,n_estimators=100)
model.fit(X_train, y_train)

Second, convert one of the test samples into an array


## 对数据进行位置索引,从而在数据表中提取出相应的数据。
X_test.iloc
# 筛选出未知样本
test_sample = X_test.iloc[2]
# 变成二维
test_sample = np.array(test_sample).reshape(1,-1)

 Third, predicting a single unknown sample

#  二分类定性分类结果
model.predict(test_sample)
# 二分类定量分类结果
model.predict_proba(test_sample)

  Fourth, predict the entire test sample

y_pred = model.predict(X_test)
# 得到患心脏病和不患心脏病的置信度
y_pred_proba = model.predict_proba(X_test)
# 切片操作 只获得患心脏病的置信度
model.predict_proba(X_test)[:,1]

6. Build a confusion matrix

# 混淆矩阵
from sklearn.metrics import confusion_matrix
confusion_matrix_model = confusion_matrix(y_test, y_pred)
# 将混淆矩阵绘制出来
import itertools
def cnf_matrix_plotter(cm,classes):
    '''
    传入混淆矩阵和标签名称列表,绘制混淆矩阵
    '''
    # plt.imshow (cm, interpolation='nearest', cmap=plt.cm.Greens)
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Oranges)
    plt.title('Confusion Matrix')
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks,classes,rotation=45)
    plt.yticks(tick_marks,classes)
    
    threshold = cm.max() / 2.
    for i, j in itertools.product(range(cm. shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i,j],
        horizontalalignment="center",
        color="white" if cm[i,j] > threshold else "black",fontsize=25)
    plt.tight_layout()
    plt.ylabel('True Label')
    plt.xlabel(' Predicted Label')
    plt.show()
  cnf_matrix_plotter(confusion_matrix_model,['Healthy','Disease'])
 

7. Calculate the recall rate, recall rate, and harmonic mean

# 计算查全率、召回率、调和平均值
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred,target_names=['Healthy', 'Disease']))

8. ROC curve, AUC curve

# ROC曲线
y_pred_quant = model.predict_proba(X_test)[:,1]
from sklearn. metrics import roc_curve,auc
fpr,tpr,thresholds = roc_curve(y_test,y_pred_quant)
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],ls="--",c=".3")
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

# 计算AUC曲线
auc(fpr,tpr)

Guess you like

Origin blog.csdn.net/weixin_42322991/article/details/124857777