hw15（第十五周）

sklearn相关练习

In the second ML assignment you have to compare the performance of three different classiﬁcation algorithms, namely Naive Bayes, SVM, and Random Forest. For this assignment you need to generate a random binary classiﬁcation problem, and then train and test (using 10-fold cross validation) the three algorithms. For some algorithms inner cross validation (5-fold) for choosing the parameters is needed. Then, show the classiﬁcation performace (per-fold and averaged) in the report, and brieﬂy discussing the results.
Note
The report has to contain also a short description of the methodology used to obtain the results.
Steps
1. Create a classiﬁcation dataset (n samples 1000, n features 10)
2. Split the dataset using 10-fold cross validation
3. Train the algorithms
    I GaussianNB
    I SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
    I RandomForestClassiﬁer (possible n estimators values [10, 100, 1000])
4. Evaluate the cross-validated performance
    I Accuracy
    I F1-score
    I AUC ROC
5. Write a short report summarizing the methodology and the results

解释：
构造一个数据集，然后使用10折交叉比对把数据集分类，之后分别用朴素贝叶斯、支持向量机、随机森林分类的方法训练模型，然后根据测试集来预测结果并评估效果。

朴素贝叶斯分类器基于一个简单的假定：给定目标值时属性之间相互条件独立。所以在属性相关性较小时，朴素贝叶斯性能最为良好。
贝叶斯公式：
P( Category | Document) = P ( Document | Category ) * P( Category) / P(Document)

SVM（支持向量机）的主要思想可以概括为两点：
1. 它是针对线性可分情况进行分析，对于线性不可分的情况，通过使用非线性映射算法将低维输入空间线性不可分的样本转化为高维特征空间使其线性可分，从而使得高维特征空间采用线性算法对样本的非线性特征进行线性分析成为可能。
2. 它基于结构风险最小化理论之上在特征空间中构建最优超平面，使得学习器得到全局最优化，并且在整个样本空间的期望以某个概率满足一定上界。

随机森林是一个包含多个决策树的分类器，并且其输出的类别是由个别树输出的类别的众数而定。
根据下列算法而建造每棵树：
用N来表示训练用例（样本）的个数，M表示特征数目。
输入特征数目m，用于确定决策树上一个节点的决策结果；其中m应远小于M。
从N个训练用例（样本）中以有放回抽样的方式，取样N次，形成一个训练集（即bootstrap取样），并用未抽到的用例（样本）作预测，评估其误差。
对于每一个节点，随机选择m个特征，决策树上每个节点的决定都是基于这些特征确定的。根据这m个特征，计算其最佳的分裂方式。
每棵树都会完整成长而不会剪枝，这有可能在建完一棵正常树状分类器后会被采用）。

代码：

from sklearn import datasets, cross_validation, metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
'''Create and split'''
kfold = 10
cds = datasets.make_classification(n_samples=1000, n_features=10, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2)
kf = cross_validation.KFold(len(cds[0]), n_folds=kfold, shuffle=True)
count = 1
train_methods = [
    ("GaussianNB", GaussianNB()),
    ("SVC 1e-01", SVC(C=1e-01, kernel='rbf', gamma=0.1)),
    ("SVC 1e02", SVC(C=1e02, kernel='rbf', gamma=0.1)),
    ("RFC 10", RandomForestClassifier(n_estimators=10)),
    ("RFC 100" ,RandomForestClassifier(n_estimators=100))
    ]
sum_acc = {_[0]:0 for _ in train_methods}
sum_f1 = {_[0]:0 for _ in train_methods}
sum_auc = {_[0]:0 for _ in train_methods}
for train_index, test_index in kf:
    preds = []
    X_train, y_train = cds[0][train_index], cds[1][train_index]
    X_test, y_test = cds[0][test_index], cds[1][test_index]
    print("\nCase %d: y_test:\n"%count, y_test)
    for method in train_methods:
        '''Train'''
        clf = method[1]
        clf.fit(X_train, y_train)
        pred = clf.predict(X_test)
        print(method[0] + "predict:\n", pred)
        '''Evaluate'''
        acc = metrics.accuracy_score(y_test, pred)
        print(method[0] + "--Accuracy--\n", acc)
        f1 = metrics.f1_score(y_test, pred)
        print(method[0] + "--F1-Score--\n", f1)
        auc = metrics.roc_auc_score(y_test, pred)
        print(method[0] + "--AUC ROC--\n", auc)
        sum_acc[method[0]] += acc
        sum_f1[method[0]] += f1
        sum_auc[method[0]] += auc
    count += 1
print("\nAverage performance:")
for method in train_methods:
    print(method[0] + " acc:%.3f f1:%.3f auc:%.3f"%(sum_acc[method[0]]/kfold, sum_f1[method[0]]/kfold, sum_f1[method[0]]/kfold))

结果：
（只看最后评估的平均结果）

Average performance:
GaussianNB acc:0.890 f1:0.891 auc:0.891
SVC 1e-01 acc:0.897 f1:0.898 auc:0.898
SVC 1e02 acc:0.879 f1:0.880 auc:0.880
RFC 10 acc:0.917 f1:0.918 auc:0.918
RFC 100 acc:0.933 f1:0.935 auc:0.935

sklearn相关练习

猜你喜欢