机器实战(一):信用卡欺诈Python实现解决方案汇总

信用卡欺诈Python2.7实现过程中问题的解决方法

1、交叉验证导入train_test_spilt时:
from sklearn.cross_validation import train_test_split 时,

  • 出现如下提示:

DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface
of the new CV iterators are different from that of this module. This module will be removed in 0.20. “This module will be removed in 0.20.”

  • 解决方法为:将代码改成如下:

from sklearn.model_selection import train_test_split

2、使用KFold函数时:

fold=KFold(len(y_train_data),5,shuffle=False)

  • 出现如下错误:

TypeError: init() got multiple values for keyword argument ‘shuffle’

  • 解决方法为:将代码改成如下:

fold=KFold(5,shuffle=False)

3、使用emumerate函数时:

for iteration,indices in enumerate(fold,start=1): 时,

  • 出现如下错误:

TypeError: ‘KFold’ object is not iterable

  • 解决方法为:将代码改为如下:

for iteration,indices in enumerate(fold.split(x_train_data)):

4、使用idxmax函数时:

best_c = result_table.loc[result_table['Mean Recall score'].idxmax()]['C_parameter']  
  • 出现以下错误:

TypeError: reduction operation ‘argmax’ not allowed for this dtype

  • 解决方法为:将代码改为如下:
best_c = result_table.loc[result_table['Mean Recall score'].astype('float64').idxmax()]['C_parameter']

5、该部分完整代码如下

def print_kfold_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)#数据数目,交叉折叠次数5次,不进行洗牌
    c_param_range = [0.01,0.1,1,10,100] #待选的模型参数
    #新建一个dataframe类型,列名是参数取值、平均召回率
    result_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['C_parameter','Mean Recall score'])
    result_table['C_parameter']=c_param_range
    j=0
    for c_param in c_param_range:
        print "=============================="
        print "C parameter:",c_param
        print "------------------------------"
        recall_accs = []
        for iteration,indices in enumerate(fold.split(x_train_data)):
            lr = LogisticRegression(C=c_param,penalty='l1')   #实例化逻辑回归模型
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print "recall score=", recall_acc
        #the mean value of the recall scores is the metric we want to save and fet hold of.
        result_table.ix[j,'Mean Recall score'] = np.mean(recall_accs)
        j=+1
        print ''
        print "Mean Recall score:",np.mean(recall_accs)
    best_c = result_table.loc[result_table['Mean Recall score'].astype('float64').idxmax()]['C_parameter']
    #finally,we can check which C parameter is the best amongst the chosen
    print "**************************"
    print "Best model to choose from cross validation is with parameter= ",best_c
    print "**************************"
    return best_c

基本上就是出现了这些问题~关于问题产生的原因以及解决的原理,大家一起讨论鸭!

猜你喜欢

转载自blog.csdn.net/qq_35046314/article/details/88859058