DC游戏预测竞赛

1.竞赛简介

背景
游戏玩家付费预测大赛

任务
建立模型通过用户注册账户后7天之内的游戏数据,预测用户在45天内的消费金额

数据
*注 : 报名参赛或加入队伍后,可获取数据下载权限。
数据主要包含以下几类:(编码均为UTF-8) 1) 训练集(带标签):2288007个样本 带标签的训练集中共有2288007个样本。tap_fun_train.csv中存有训练集样本的全部信息,user_id为样本的id,prediction_pay_price为训练集标签,其他字段为特征。

2) 测试集:828934个样本 tap_fun_test.csv中存有测试集的特征信息,除无prediction_pay_price字段外,格式同tap_fun_train.csv。参赛者的目标是尽可能准确地预测第45天的消费金额prediction_pay_price。

3) tap4fun 数据字段解释.xlsx 为本次比赛数据109个字段的解释,每个属性对应的数据均用“数值”表示,无空值。

思路:
其实很简单的一个小竞赛,最重要的是要处理其特征,因为最终的侧评是根据方差,所以用LR是效果最好的

一个baseline,只用了LR,最终结果大概在72左右
`# -*- coding:utf-8
“””
@Created on 2018/6/22 14:01
@Author: Pengjiaxin
“””
import pandas as pd
import math
import numpy as np
import os
from eval_offline import rmsle

os.chdir(‘../data’)

设置最大显示列数,100

pd.set_option(‘display.max_columns’, 200)

文件名

train_file=’tap_fun_train.csv’
test_file=’tap_fun_test.csv’
train_data=pd.read_csv(train_file)
test_data=pd.read_csv(test_file)

train_data=train_data.fillna(0)
test_data=test_data.fillna(0)

print train_data.shape, test_data.shape
y = train_data.pop(‘prediction_pay_price’)

drop = [‘user_id’, ‘register_time’]

train_idx=train_data[drop]
test_idx=test_data[drop]
train_data=train_data.drop(drop, axis=1)
test_data=test_data.drop(drop, axis=1)

cols = [‘pay_price’]

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(train_data[cols], y)

y_prob = lr.predict(test_data[cols])
test_idx[‘prediction_pay_price’] = y_prob
print test_idx.prediction_pay_price.value_counts()
test_idx[[‘user_id’,’prediction_pay_price’]].to_csv(“sub.csv”, index=False)`

试过baseline之后,对特征进行处理,可以通过其训练集的属性可以看出很多都是没有用的,主要有用的就是注册时间,前七天付费情况,与前45天付费情况

根据这个思路,将所有的付费用户分为

  1. 前七天付费,但是后45天没有付费
  2. 前七天付费,但是后45天继续付费
  3. 前七没有付费,但是后45天付费(这一类占了付费总用户的9.9%)

重点就在于这些付费用户,将那些未付费的剔除之后,对1,2先分类,再对1进行回归

数据统计如图:

付费用户统计

对于这部分的回归,采用根据注册时间与付费金额之间的关系,把pay_price按照分布进行划分,然后去预测,这里在把pay_price的分布分成了三个类型:high,mid和low,然后用train_date_high里面那几天的数据去预测test_date_high里面的数据

注册时间分布

代码

这样虽然提高了两分,但是B榜的时候,或许会过拟合,过不了B榜,有个卵用啊。所以再寻思路。应该要好好分析特征。

接下来来自清晰地思路分析:

一、熟悉数据

不急着上模型,调参数,先熟悉下数据:

  1. 这是一个手机游戏的客户数据,注册个账号玩了一下。

    2.有什么类型的feature:user_id、注册时间、游戏资源(木头、石头、象牙、肉、魔法等)、兵种(勇士、驯兽师、萨满)、加速券(通用、建筑、科研、训练、治疗)、建筑、科研、PVP(玩家对战)、PVE(人机对战)、在线时长、付费金额、付费次数。这些都 是注册前7天的信息

  2. label:前45天的付费金额.

再问几个问题:

1.user_id在train和test中有没重复?

2.数据在时间上的分布?

3.有多少付费玩家,会给多少钱,分布如何?比例随时间的变化?

4.在线时长如何?

5.单次付费有什么额度的?

1、user_id在train和test中有没重复?

读取数据

> data = pd.read_csv("tap_fun_train.csv", parse_dates=True) data_test =
> pd.read_csv("tap_fun_test.csv", parse_dates=True)

提取user_id列,并做合并处理

data_id = pd.DataFrame(data['user_id'],columns=['user_id'])
data_test_id = pd.DataFrame(data_test['user_id'],columns=['user_id'])
pd.merge(data_id, data_test_id, on = 'user_id')

没有重复的user_id,就是检查下,不要多想,哈哈!

2、 玩家注册时间分布

增加两列

data['register_time_month'] = data.register_time.str[:7]
data['register_time_day'] = data.register_time.str[6:10]

统计并保存为dataframe

data_month_df = pd.DataFrame(data['register_time_month'].value_counts()).sort_index()
 print(data_month_df)
data_day_df = pd.DataFrame(data['register_time_day'].value_counts()).sort_index()
print(data_day_df)

画图 用了echarts的python版,挺好用的,也美观

from pyecharts import Line, Grid

line1 = Line("玩家数量统计-月")
line1.add("玩家数量", data_month_df.index, data_month_df['register_time_month'], mark_line=["average"], mark_point=["max", "min"])

line2 = Line("玩家数量统计-日",title_top="50%")
line2.add("玩家数量", data_day_df.index, data_day_df['register_time_day'], mark_line=["average"], mark_point=["max", "min"])

grid = Grid(width = 1000, height = 1000)
grid.add(line1, grid_bottom="60%")
grid.add(line2, grid_top="60%")
grid.render()

grid

月数据:不完整的,1月是从26号开始的数据,3月只有到6号的数据。

日数据:平均每天注册人数有5.7万人,挺多的啊。其中2月19号有11.7万人注册,是平时的两倍,20号还有9.3万人注册,有推广活动吗?但是整体趋势上,注册用户是下降的。

玩家数据分布

3、 有多少付费玩家,比例随时间的变化,会给多少钱,分布如何?

 》有多少付费玩家?
data_pay_7 = copy.copy(data[data['pay_price']>0])
print(data_pay_7.shape)   # (41439, 111)
print(data_pay_7.shape[0]/data.shape[0])  # 0.018111395638212645

七天内付费的玩家有41439个,占比大概是1.811%。

注册与付费的关系

》付费玩家比例随时间的变化

----------------------------- 统计,改列明(避免冲突),合并,计算比例
data_pay_7_day_df = 
data_pay_7_day_df.rename(columns={'register_time_day':'pay_register_time_day'}, inplace = True)
data_day_count = pd.concat([data_pay_7_day_df, data_day_df], axis=1)
data_day_count['pay_percent'] = 
 ----------------------------- 画图
from pyecharts import Overlap

line3 = Line()
line3.add("注册玩家数量", data_day_count.index, data_day_count['register_time_day'], mark_line=["average"], mark_point=["max", "min"])

line4 = Line()
line4.add("7天内付费玩家数量", data_day_count.index, data_day_count['pay_register_time_day'], mark_line=["average"], 
          mark_point=["max", "min"], yaxis_max=3000)

overlap = Overlap()
 默认不新增 x y 轴,并且 x y 轴的索引都为 0
overlap.add(line3)
 新增一个 y 轴,此时 y 轴的数量为 2,第二个 y 轴的索引为 1(索引从 0 开始),所以设置 yaxis_index = 1
 由于使用的是同一个 x 轴,所以 x 轴部分不用做出改变
overlap.add(line4, yaxis_index=1, is_add_yaxis=True)
overlap.render()

overlap


from pyecharts import Bar, Overlap

line3 = Line()
line3.add("注册玩家数量", data_day_count.index, data_day_count['register_time_day'], mark_line=["average"], mark_point=["max", "min"])

bar = Bar()
bar.add("7天内付费玩家比例", data_day_count.index, data_day_count['pay_percent'], yaxis_max=0.1)


overlap = Overlap()
# 默认不新增 x y 轴,并且 x y 轴的索引都为 0
overlap.add(line3)
# 新增一个 y 轴,此时 y 轴的数量为 2,第二个 y 轴的索引为 1(索引从 0 开始),所以设置 yaxis_index = 1
# 由于使用的是同一个 x 轴,所以 x 轴部分不用做出改变
overlap.add(bar, yaxis_index=1, is_add_yaxis=True)
overlap.render()

overlap

注册用户的7天付费占比平均值是1.811%,分拆到每天的注册用户中,有起伏,但是波动不算大。

反而是某天(2月1、7、8、15、19、20号等)的注册用户越多,成功转化付费的好像不会“水涨船高”,也就是付费的还是那么多,促销回来的客户还是不付费?

》会给多少钱?

data_pay_45 = copy.copy(data[data['prediction_pay_price']!=0])
print(data_pay_45['prediction_pay_price'].describe())
print('前45天合共付费:',data_pay_45['prediction_pay_price'].sum())

(上图)有4.6万客户前45天付款了,合计给了410万,最土豪的花了3.3万,土豪你不懂。

data_pay_7 = copy.copy(data[data['pay_price']!=0])
print(data_pay_7['pay_price'].describe())
print('前7天合共付费:',data_pay_7['pay_price'].sum())  
print('前45天合共付费:',data_pay_7['prediction_pay_price'].sum())

(上图)有4.14万客户前7天就有付款行为了,占了4.6万的90.1%(有9.9%前七天没给钱,后来给了);

前七天合计花了122万,前45天花了391.7万,占4.6万客户前45天总付款410万的95.5%。

data_nopay_7_pay_45 = copy.copy(data_pay_45[data_pay_45['pay_price']==0])
print(data_nopay_7_pay_45['prediction_pay_price'].describe())
print('前七天没有,后45天有付款的合共付费:',data_nopay_7_pay_45['prediction_pay_price'].sum())

付费比

(上图)有4549个客户前7天没付款,但是前45天付款的,占了9.9%,金额18.6万,占比4.53%。

data_pay_7_nopay_45 = copy.copy(data_pay_7[data_pay_7['pay_price']==data_pay_7['prediction_pay_price']])
print(data_pay_7_nopay_45['pay_price'].describe())
print(data_pay_7_nopay_45['pay_count'].describe())
print('前7天合共付费:',data_pay_7_nopay_45['pay_price'].sum())  
print('前7天给钱了,但是后面45天不再给钱的:',data_pay_7_nopay_45.shape[0])

(上图)前7天付款后,45天内不再付款的有3万,占4.6万的65.5%。前7天付费34.4万,占前七天的28.1%,占前45天410万的8.4%。—(放弃不玩了?)

data_pay_7_pay_45 = copy.copy(data_pay_7[data_pay_7['pay_price']<data_pay_7['prediction_pay_price']])
print(data_pay_7_pay_45['pay_price'].describe())
print(data_pay_7_pay_45['pay_count'].describe())
print('前7天合共付费:',data_pay_7_pay_45['pay_price'].sum())  
print('前45天合共付费:',data_pay_7_pay_45['prediction_pay_price'].sum())  
print('前7天给钱了,后面45天继续给钱的:',data_pay_7_pay_45.shape[0])

(上图)前7天付款,后面继续付款的,有1.13万,占4.6万的25%,付款占410万的87.1%,而且在7天内比较快的付款第二次了。

小结:红色字体才是核心人群啊。

4、在线时长如何?
data[‘avg_online_minutes’].describe()

(上图)整体来说,在线时长75%在5分钟内,看来都是下载看一下就不玩了。

data_pay = copy.copy(data[data['pay_price']!=0])
# data_pay.shape
data_pay['avg_online_minutes'].describe()

(上图)付费用户平均在线140分钟,除以7天,每天在线20分钟。

5、单次付费有什么额度的?

data_once = copy.copy(data[data['pay_count']==1])
# data_once.shape  
data_once.groupby("pay_price")["pay_count"].sum()

付费次数只有一次的客户,可以看出有几种单词付费的额度,都是0.99结尾。

6、研究评估标准
评分标准是RMSE,均方根误差

如果有一个土豪客户是付费15000,但是模型只猜测是付费1000,那么(15000-1000)^2/828934=236,开根号得到15.38,一个土豪客户的误判,就可以令评分标准飙升。

如果客户是给0.99的,但是预判了1.99,那么(1.99-0.99)^2=1,开根号也是1,如果每个客户都误判1元,RMSE也就+1。

成败看土豪???

7、test数据集

3月10号有类似2月19号的推广活动?

消费分布跟train值非常类似

二、解决策略

盯着上面的这个图,看着看着就想到: 使用前7天付费的数据,分两步,先分类出谁会继续付款,然后回归分析继续给钱的玩家会付款多少?还有第三部分,就是忽略掉前7天没付款,但是后面45天会付款的,然后就变成以下的图了。

1、分类

》准备数据

先对数据进行一下预处理,主要是对object类的feature处理掉,没有什么归一化之类的处理的。

data = pd.read_csv("tap_fun_train.csv", parse_dates=True)

#提取object及其对应的数据
object_columns_df = data.select_dtypes(include=["object"])
#object罗列出来了。
print(object_columns_df.iloc[1])

# 有一个object特征:register_time,处理掉

data['register_time_month'] = data.register_time.str[5:7]
data['register_time_day'] = data.register_time.str[8:10]
data = data.drop(['register_time'],axis=1) 

# object转换float
data[['register_time_month','register_time_day']] = data[['register_time_month','register_time_day']].apply(pd.to_numeric)
# data=pd.DataFrame(data,dtype=np.float) 

# 对于注册时间,拆分开月和日之后,再合并一个数值,更好反馈时间的前后
data['register_time_count'] = data['register_time_month'] * 31 + data['register_time_day'] 

data.shape
# (2288007, 111)

#保存前7天会给钱的客户
data_7_pay = copy.copy(data[data['pay_price']>0])
data_7_pay.shape
# (41439, 111)
data_7_pay.to_csv ("tap_fun_train_7_pay.csv")
打标签label,删掉不该有的特征等。

# -------------------------读取train set中前7天有付款的玩家明细
data = pd.read_csv("tap_fun_train_7_pay.csv", index_col=0, parse_dates=True)

print(data.shape)  #(41439, 111)

# -------------------------打标签
data['7_45_same_pay_label'] = (data['pay_price'] == data['prediction_pay_price'])
data['7_45_same_pay_label']=data['7_45_same_pay_label'].map({True:1,False:0})

data['7_45_same_pay_label'].value_counts()

# 1    30130   前7天给,后面都不付款了
# 0    11309   前7天给了,后续还给的。

# 删掉不需要的字段,比如45天的付费金额(对于test set是没有的字段),user_id
data = data.drop(['prediction_pay_price', 'user_id'],axis=1)
data.shape

》训练分类模型

分拆数据,训练和验证部分。

from sklearn.model_selection  import train_test_split

label = '7_45_same_pay_label'

# 将X和Y拆分开
X = data.loc[:, data.columns != label]
y = data.loc[:, data.columns == label]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.22, random_state = 0)
print("100% data")
print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# print(y_train.info())

# train和test拆分后,把train部分重新组合成data_train,也就是把test部分完整保留下来,除了test用,不参加任何处理了。
data_train = pd.concat( [X_train, y_train], axis=1 )
print(data.shape)
print('---------------------------------------')
print(data_train.shape)
# show_class(data_train,label)  自己写的函数,屏蔽了吧


#创建一个dataframe,然后对模型的效果进行记录。最后小结。
thresholds = [0.1,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.6,0.65,0.7,0.75,0.8,0.85,0.9]
thresholds_2 = thresholds[:]  #= thresholds,如果这样复杂是,浅复制,映射同一块内存
thresholds_2.append('time')

print(thresholds_2)
result_model_f1 = pd.DataFrame(index=thresholds_2)

print(result_model_f1)

训练,用GradientBoostingClassifier训练试试先,然后在留下的22%数据中,进行验证,f1有87.55%,挺不错的了。

import time
start = time.time()
print('start time:',time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())))

from sklearn.ensemble import GradientBoostingClassifier

gradient_boosting_classifier = GradientBoostingClassifier()
gradient_boosting_classifier.fit(X_train,y_train.values.ravel())

y_pred = gradient_boosting_classifier.predict(X_test.values)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(y_test,y_pred,title='Confusion matrix')

end = time.time()
print(end-start,'s')




试试用不同的threshold value来划分,看看有没更优的threshold value,0.45会好一点点,87.69%,0.5,高0.14%,小幅度提升。

#2、----------------------------------预测结果----------------------------------
y_pred_proba = gradient_boosting_classifier.predict_proba(X_test.values)  #array of shape = [n_samples, n_classes], or a list of n_outputs

#3、----------------------------------记录各种threshold下的结果----------------------------------
result_model_f1['GradientBoostingClassifier'] = 0  #扩充列,全为0.
print(result_model_f1)

for i in thresholds:
    y_test_predictions_high_recall = y_pred_proba[:,1] > i
    print('Threshold >= %s'%i)
    print_recall_precision_f1(y_test,y_test_predictions_high_recall)

print("------------------------------------")

for i in thresholds:
    y_test_predictions_high_recall = y_pred_proba[:,1] > i
    plt.figure(figsize=(4,4))
    plot_confusion_matrix(y_test,y_test_predictions_high_recall, title='Threshold >= %s'%i)
    result_model_f1.loc[i,'GradientBoostingClassifier'] = f1_score(y_test.values,y_test_predictions_high_recall) #记录f1

result_model_f1.loc['time','GradientBoostingClassifier'] = end-start #记录时间

print(result_model_f1)

备注下:

用xgboost也试过,默认参数下,差一点点。

导出模型

from sklearn.externals import joblib

print(gradient_boosting_classifier)
joblib.dump(gradient_boosting_classifier, ‘gradient_boosting_classifier.model’)

2、回归

》准备数据

data = pd.read_csv("tap_fun_train_7_pay.csv", index_col=0, parse_dates=True)

data_pay_more = copy.copy(data[data['pay_price']<data['prediction_pay_price']])
data_pay_more.shape

#  (11309, 111)  前7天付款,且后来继续付款的有1.1万人。用来做回归。


#删掉user_id
data_pay_more = data_pay_more.drop([ 'user_id'],axis=1)
data = copy.copy(data_pay_more)
data.shape

》训练模型

分拆数据,训练与验证部分

from sklearn.model_selection  import train_test_split

Quantity = 'prediction_pay_price'

# 将X和Y拆分开
X = data.loc[:, data.columns =='pay_price']
# X = data.loc[:, data.columns != Quantity]
y = data.loc[:, data.columns == Quantity]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.22, random_state = 0)
print("100% data")
print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

来,训练吧,还是GradientBoosting,但是是Regressor版,GradientBoostingRegressor。

import time
start = time.time()
print('start time:',time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())))

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

gradient_boosting_regression = GradientBoostingRegressor()
gradient_boosting_regression.fit(X_train,y_train.values.ravel())

y_pred = gradient_boosting_regression.predict(X_test.values)

# The mean squared error
print("Root Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred) ** 0.5)
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))

end = time.time()
print(end-start,'s')

其实Root Mean Squared error,都去到851了,爆表啊,先不管,上系统看看先

3、用模型预测,上系统看看排名

》把test set处理得跟train set训练前的样子,要不模型不认啊

data = pd.read_csv("tap_fun_test.csv", parse_dates=True)
data.shape #(828934, 108)

#提取object及其对应的数据
object_columns_df = data.select_dtypes(include=["object"])
#object罗列出来了。
print(object_columns_df.iloc[1])
# 有一个object的feature,register_time ,处理掉

data['register_time_month'] = data.register_time.str[5:7]
data['register_time_day'] = data.register_time.str[8:10]
data = data.drop(['register_time'],axis=1) 
# object转换float
data[['register_time_month','register_time_day']] = data[['register_time_month','register_time_day']].apply(pd.to_numeric)

# 对于注册时间,拆分开月和日之后,再合并一个数值,更好反馈时间的前后
data['register_time_count'] = data['register_time_month'] * 31 + data['register_time_day'] 

data.shape
#(828934, 110)

# 保存前7天会给钱的客户
data_7_pay = copy.copy(data[data['pay_price']>0])
data_7_pay.shape  # (19549, 110)
data_7_pay.to_csv ("tap_fun_test_7_pay.csv")

# 保存前7天没有给钱的客户
data_7_NOT_pay = copy.copy(data[data['pay_price']==0])
data_7_NOT_pay.shape  #(809385, 110)
data_7_NOT_pay.to_csv ("tap_fun_test_7_NOT_pay.csv")

》把test set里面前7天没付款的直接预测后续不会给钱。

data_test = pd.read_csv("tap_fun_test_7_NOT_pay.csv", index_col=0, parse_dates=True)
print(data_test.shape)# (809385, 110) , 80万的客户

# 把user_id和前7天的0元pay_price提炼出来。
data_test_part1 = data_test[['user_id','pay_price']]
data_test_part1.rename(columns={'pay_price':'prediction_pay_price'}, inplace = True)
data_test_part1.to_csv('tap_fun_test_part1_still0.csv')


》在前7天有付费的客户中分类出后续不再付费的客户,并保存。

data_test = pd.read_csv("tap_fun_test_7_pay.csv", index_col=0, parse_dates=True)
print(data_test.shape)  # (19549, 110)

data_test_model = data_test.drop([ 'user_id'],axis=1),# 跑模型前要删掉user_id

#跑分类模型
y_test_pred = gradient_boosting_classifier.predict(data_test_model.values)

# 结果转ndarray换成dataframe
y_test_pred = pd.DataFrame(y_test_pred, columns= {'pred_label'})

# 为了把index清零,重头开始,先转成ndarray,再转回来dataframe
columns_test = data_test.columns
data_test = data_test.values
data_test = pd.DataFrame(data_test, columns = columns_test )

# 重新把预测结果放回去原始数据
y_test_pred = pd.concat([data_test, y_test_pred], axis=1)
y_test_pred.shape # (19549, 111)


y_test_pred['pred_label'].value_counts()
# 1    15587 不会再给钱的有这么多
# 0     3962  会继续给钱的。

# part2:把1的留出来,并预测为原来的值。
# part3:把0的放到回归里面去猜。
y_test_pred_part2 = copy.copy(y_test_pred[y_test_pred['pred_label']==1])
y_test_pred_part3 = copy.copy(y_test_pred[y_test_pred['pred_label']==0])

y_test_pred_part2_user_id = pd.DataFrame(y_test_pred_part2,columns ={'user_id'})
y_test_pred_part2_pay = pd.DataFrame(y_test_pred_part2,columns ={'pay_price'})
y_test_pred_part2 = pd.concat([y_test_pred_part2_user_id, y_test_pred_part2_pay], axis=1)
y_test_pred_part2.rename(columns={'pay_price':'prediction_pay_price'}, inplace = True)

y_test_pred_part2.to_csv('tap_fun_test_part2_nopaymore.csv')
y_test_pred_part3.to_csv('tap_fun_test_part3_paymore.csv')

》前7天付费的,【分类】预判会继续付费的,【回归】出付费金额。

y_test_pred_part3 = pd.read_csv("tap_fun_test_part3_paymore.csv", index_col=0, parse_dates=True)
y_test_pred_part3.shape #(3962, 111)

# 把user_id保留出来。
user_id_pay_more = y_test_pred_part3['user_id'].values
user_id_pay_more[:10]

# 提出需要的字段,也就是pay_price
y_test_pred_part3_test = pd.DataFrame(y_test_pred_part3,columns=['pay_price'])
y_test_pred_part3_test.shape

# 跑模型
y_test_pred_part3_howmuch = gradient_boosting_regression.predict(y_test_pred_part3_test.values)

# 把预测结果和user_id合并回去。
y_test_pred_part3_user_id = pd.DataFrame(user_id_pay_more,columns = {'user_id'})
y_test_pred_part3_howmuch = pd.DataFrame(y_test_pred_part3_howmuch,columns = {'prediction_pay_price'})

y_test_pred_part3 = pd.concat([y_test_pred_part3_user_id, y_test_pred_part3_howmuch], axis=1)
y_test_pred_part3.shape  # (3962, 2)

y_test_pred_part3.to_csv('tap_fun_test_part3_paymore_result.csv')
》重新合并三块的结果

pred_part1 = pd.read_csv("tap_fun_test_part1_still0.csv", index_col=0, parse_dates=True)
print(pred_part1.shape)
pred_part2 = pd.read_csv("tap_fun_test_part2_nopaymore.csv", index_col=0, parse_dates=True)
print(pred_part2.shape)
pred_part3 = pd.read_csv("tap_fun_test_part3_paymore_result.csv", index_col=0, parse_dates=True)
print(pred_part3.shape)

# (809385, 2)
# (15587, 2)
# (3962, 2)

pred = pred_part1.append(pred_part2)
pred = pred.append(pred_part3)
pred.shape # (828934, 2)

pred.to_csv('result.csv')

幸好最后的数量跟test set的客户规模对得上。

随后就是用GradientBoostingClassifer+LR跑了70分,没有对数据特征进行怎么处理,实在是没有思路了,要了别人的代码,但是没有看太懂

# -*- coding: utf-8 -*-
from lightgbm import LGBMRegressor
import lightgbm as lgb
import pandas as pd
import xgboost as xgb
import numpy as np
from xgboost import plot_importance
from sklearn.preprocessing import Imputer
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_curve
from sklearn import metrics
from sklearn.metrics import mean_squared_error
import sys
import catboost as cb
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor, AdaBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from math import sqrt
from sklearn.svm import SVR
from sklearn.linear_model import Lasso, LassoLars, BayesianRidge, LinearRegression, Ridge
from sklearn.linear_model import ElasticNet, LassoLars, Lars, OrthogonalMatchingPursuit, ARDRegression, SGDRegressor, RANSACRegressor, HuberRegressor
import matplotlib.pyplot as plt

import os
os.chdir('/Users/lilyhou/Desktop/data/game')


# pd.set_option('display.max_rows', None)
def xgb_model(x_train, x_test, y_train):
    model = xgb.XGBRegressor(max_depth=3, learning_rate=0.07, n_estimators=1000, silent=False, objective='reg:gamma')
    model.fit(x_train, y_train, early_stopping_rounds=100, eval_set=[(x_train, y_train)], verbose=True)

    ans = model.predict(x_test)
    return ans


# 435.90434580437363
def creat_fea(train, type):
    label_time = pd.to_datetime('2018-03-22')
    train['hour'] = train['register_time'].apply(lambda x:x.hour)
    train['diff_day'] = train['register_time'].apply(lambda x:(label_time-x).days)
    train['day'] = train['register_time'].apply(lambda x: x.day)
    train['month'] = train['register_time'].apply(lambda x: x.month)
    train['time'] = train['month']*31+train['day']

    train['one'] = 1
    zh = train[['one', 'time']].groupby(['time']).sum().reset_index()
    del train['one']
    train = pd.merge(train, zh, on='time', how='left').reset_index(drop=True)
    print(train.columns)

    train['weekday'] = train['register_time'].apply(lambda x: x.weekday())

    train['rate_zhu_all_pvp'] = train.apply(lambda x:0 if x.pvp_battle_count == 0 else x.pvp_lanch_count/x.pvp_battle_count, axis=1)
    train['rate_win_all_pvp'] = train.apply(
        lambda x: 0 if x.pvp_battle_count == 0 else x.pvp_win_count / x.pvp_battle_count, axis=1)
    train['rate_zhu_win_pvp'] = train.apply(
        lambda x: x.pvp_lanch_count*2 if x.pvp_win_count == 0 else x.pvp_lanch_count / x.pvp_win_count, axis=1)

    train['rate_win_all_pve'] = train.apply(
        lambda x: 0 if x.pve_battle_count == 0 else x.pve_win_count / x.pve_battle_count, axis=1)

    train['rate_price_time'] = train.apply(
        lambda x: 0 if x.avg_online_minutes == 0 else x.pay_price / x.avg_online_minutes, axis=1)

    # train['win_2p'] = train['pvp_battle_count']+train['pve_battle_count']

    # train['rate_all_time_pve'] = train.apply(
    #     lambda x: 0 if x.avg_online_minutes == 0 else x.pve_battle_count / x.avg_online_minutes, axis=1)

    # train['rate_zhu_time_pve'] = train.apply(
    #     lambda x: 0 if x.avg_online_minutes == 0 else x.pve_lanch_count / x.avg_online_minutes, axis=1)

    # train['rate_win_time_pve'] = train.apply(
    #     lambda x: 0 if x.avg_online_minutes == 0 else x.pve_win_count / x.avg_online_minutes, axis=1)

    # train['rate_zhu_win_pve'] = train.apply(
    #     lambda x: 0 if x.pve_win_count == 0 else x.pve_lanch_count / x.pve_win_count, axis=1)

    # train['rate_win_time_pvp'] = train.apply(
    #     lambda x: 0 if x.avg_online_minutes == 0 else x.pvp_win_count / x.avg_online_minutes, axis=1)
    # train['rate_all_time_pvp'] = train.apply(
    #     lambda x: 0 if x.avg_online_minutes == 0 else x.pvp_battle_count / x.avg_online_minutes, axis=1)

    # train['mean_time_all_zhu'] = train.apply(lambda x: 0 if x.avg_online_minutes == 0 else (x.pve_lanch_count+x.pvp_lanch_count)/x.avg_online_minutes , axis=1)
    # train['mean_time_all_zhu'] = train['avg_online_minutes']/(train['pve_lanch_count']+train['pvp_lanch_count'])
    # train['mean_time_all_zhu'] =  (train['pve_lanch_count'] + train['pvp_lanch_count'])/train['avg_online_minutes']
    # train['rate_win_pap_zhu'] = train['pvp_lanch_count'] / (train['pvp_win_count'])
    # train['mean_price'] = train['pay_price'] / (train['pay_count'])
    # train['rate_time_count'] = train['avg_online_minutes'] / (train['pay_price'])


    # train['mean_time_all_zhu'] = 0
    # train['rate_win_pap_zhu'] = 0
    # train['mean_price'] = 0
    # train['rate_time_count'] = 0



    # for j in range(len(train['user_id'])):
    #     s = 0
    #     for i in ['wood_add_value','wood_reduce_value','stone_add_value','stone_reduce_value'
    #                        ,'ivory_add_value','ivory_reduce_value','meat_add_value','meat_reduce_value'
    #                        ,'magic_add_value','magic_reduce_value','infantry_add_value','infantry_reduce_value'
    #                        ,'infantry_reduce_value','cavalry_add_value','cavalry_reduce_value','shaman_add_value'
    #                        ,'shaman_reduce_value','wound_infantry_add_value','wound_infantry_reduce_value'
    #                        ,'wound_cavalry_add_value','wound_cavalry_reduce_value','wound_shaman_add_value'
    #                        ,'wound_shaman_reduce_value']:
    #         mean = train[i].mean()
    #         if train.loc[j, i]>mean:
    #             s = s+1
    #     train.loc[j, 'by_mean_num'] = s
    # print(1)
    #
    #
    # for j in range(len(train['user_id'])):
    #     s = 0
    #     for i in ['wood_add_value','wood_reduce_value','stone_add_value','stone_reduce_value'
    #                        ,'ivory_add_value','ivory_reduce_value','meat_add_value','meat_reduce_value'
    #                        ,'magic_add_value','magic_reduce_value','infantry_add_value','infantry_reduce_value'
    #                        ,'infantry_reduce_value','cavalry_add_value','cavalry_reduce_value','shaman_add_value'
    #                        ,'shaman_reduce_value','wound_infantry_add_value','wound_infantry_reduce_value'
    #                        ,'wound_cavalry_add_value','wound_cavalry_reduce_value','wound_shaman_add_value'
    #                        ,'wound_shaman_reduce_value']:
    #         mean = train[i].min()
    #         if train.loc[j, i]>mean:
    #             s = s+1
    #     train.loc[j, 'by_min_num'] = s
    # print(2)
    #
    # for j in range(len(train['user_id'])):
    #     s = 0
    #     for i in ['general_acceleration_add_value','general_acceleration_reduce_value'
    #                    ,'building_acceleration_add_value','building_acceleration_reduce_value','reaserch_acceleration_add_value'
    #                    ,'reaserch_acceleration_reduce_value','training_acceleration_add_value','training_acceleration_reduce_value'
    #                    ,'treatment_acceleraion_add_value','treatment_acceleration_reduce_value']:
    #         mean = train[i].mean()
    #         if train.loc[j, i]>mean:
    #             s = s+1
    #     train.loc[j, 'by_mean_ji_num'] = s
    # print(3)
    #
    # for j in range(len(train['user_id'])):
    #     s = 0
    #     for i in ['general_acceleration_add_value', 'general_acceleration_reduce_value'
    #         , 'building_acceleration_add_value', 'building_acceleration_reduce_value',
    #               'reaserch_acceleration_add_value'
    #         , 'reaserch_acceleration_reduce_value', 'training_acceleration_add_value',
    #               'training_acceleration_reduce_value'
    #         , 'treatment_acceleraion_add_value', 'treatment_acceleration_reduce_value']:
    #         mean = train[i].min()
    #         if train.loc[j, i] > mean:
    #             s = s + 1
    #     train.loc[j, 'by_min_ji_num'] = s
    # print(4)
    #
    # for j in range(len(train['user_id'])):
    #     s = 0
    #     for i in ['sr_scout_level','sr_training_speed_level','sr_infantry_tier_2_level','sr_cavalry_tier_2_level','sr_shaman_tier_2_level',
    #                                'sr_infantry_atk_level','sr_cavalry_atk_level','sr_shaman_atk_level','sr_infantry_tier_3_level','sr_cavalry_tier_3_level',
    #                                'sr_shaman_tier_3_level','sr_troop_defense_level','sr_infantry_def_level','sr_cavalry_def_level','sr_shaman_def_level',
    #                                'sr_infantry_hp_level','sr_cavalry_hp_level','sr_shaman_hp_level','sr_infantry_tier_4_level','sr_cavalry_tier_4_level',
    #                                'sr_shaman_tier_4_level','sr_troop_attack_level','sr_construction_speed_level','sr_hide_storage_level',
    #                                'sr_troop_consumption_level','sr_rss_a_prod_levell','sr_rss_b_prod_level','sr_rss_c_prod_level','sr_rss_d_prod_level',
    #                                'sr_rss_a_gather_level','sr_rss_b_gather_level','sr_rss_c_gather_level','sr_rss_d_gather_level','sr_troop_load_level',
    #                                'sr_rss_e_gather_level','sr_rss_e_prod_level','sr_outpost_durability_level','sr_outpost_tier_2_level','sr_healing_space_level',
    #                                'sr_gathering_hunter_buff_level','sr_healing_speed_level','sr_outpost_tier_3_level','sr_alliance_march_speed_level',
    #                                'sr_pvp_march_speed_level','sr_gathering_march_speed_level','sr_outpost_tier_4_level','sr_guest_troop_capacity_level',
    #                                'sr_march_size_level','sr_rss_help_bonus_level']:
    #         mean = train[i].mean()
    #         if train.loc[j, i] > mean:
    #             s = s + 1
    #         train.loc[j, 'by_mean_ji_level'] = s
    # print(5)
    #
    #
    # for j in range(len(train['user_id'])):
    #     s = 0
    #     for i in ['sr_scout_level', 'sr_training_speed_level', 'sr_infantry_tier_2_level',
    #               'sr_cavalry_tier_2_level', 'sr_shaman_tier_2_level',
    #               'sr_infantry_atk_level', 'sr_cavalry_atk_level', 'sr_shaman_atk_level',
    #               'sr_infantry_tier_3_level', 'sr_cavalry_tier_3_level',
    #               'sr_shaman_tier_3_level', 'sr_troop_defense_level', 'sr_infantry_def_level',
    #               'sr_cavalry_def_level', 'sr_shaman_def_level',
    #               'sr_infantry_hp_level', 'sr_cavalry_hp_level', 'sr_shaman_hp_level',
    #               'sr_infantry_tier_4_level', 'sr_cavalry_tier_4_level',
    #               'sr_shaman_tier_4_level', 'sr_troop_attack_level', 'sr_construction_speed_level',
    #               'sr_hide_storage_level',
    #               'sr_troop_consumption_level', 'sr_rss_a_prod_levell', 'sr_rss_b_prod_level',
    #               'sr_rss_c_prod_level', 'sr_rss_d_prod_level',
    #               'sr_rss_a_gather_level', 'sr_rss_b_gather_level', 'sr_rss_c_gather_level',
    #               'sr_rss_d_gather_level', 'sr_troop_load_level',
    #               'sr_rss_e_gather_level', 'sr_rss_e_prod_level', 'sr_outpost_durability_level',
    #               'sr_outpost_tier_2_level', 'sr_healing_space_level',
    #               'sr_gathering_hunter_buff_level', 'sr_healing_speed_level', 'sr_outpost_tier_3_level',
    #               'sr_alliance_march_speed_level',
    #               'sr_pvp_march_speed_level', 'sr_gathering_march_speed_level', 'sr_outpost_tier_4_level',
    #               'sr_guest_troop_capacity_level',
    #               'sr_march_size_level', 'sr_rss_help_bonus_level']:
    #         mean = train[i].min()
    #         if train.loc[j, i] > mean:
    #             s = s + 1
    #         train.loc[j, 'by_min_ji_level'] = s
    # print(6)
    #
    #
    # for j in range(len(train['user_id'])):
    #     s = 0
    #     for i in ['bd_training_hut_level', 'bd_healing_lodge_level'
    #                ,'bd_stronghold_level','bd_outpost_portal_level','bd_barrack_level'
    #                ,'bd_healing_spring_level','bd_dolmen_level','bd_guest_cavern_level'
    #                ,'bd_warehouse_level','bd_watchtower_level','bd_magic_coin_tree_level'
    #                ,'bd_hall_of_war_level','bd_market_level','bd_hero_gacha_level'
    #                ,'bd_hero_strengthen_level','bd_hero_pve_level']:
    #         mean = train[i].mean()
    #         if train.loc[j, i] > mean:
    #             s = s + 1
    #         train.loc[j, 'by_mean_buliding_level'] = s
    # print(7)
    #
    #
    # for j in range(len(train['user_id'])):
    #     s = 0
    #     for i in ['bd_training_hut_level', 'bd_healing_lodge_level'
    #         , 'bd_stronghold_level', 'bd_outpost_portal_level', 'bd_barrack_level'
    #         , 'bd_healing_spring_level', 'bd_dolmen_level', 'bd_guest_cavern_level'
    #         , 'bd_warehouse_level', 'bd_watchtower_level', 'bd_magic_coin_tree_level'
    #         , 'bd_hall_of_war_level', 'bd_market_level', 'bd_hero_gacha_level'
    #         , 'bd_hero_strengthen_level', 'bd_hero_pve_level']:
    #         mean = train[i].min()
    #         if train.loc[j, i] > mean:
    #             s = s + 1
    #         train.loc[j, 'by_min_buliding_level'] = s
    # print(8)


        # train['by_%s_mean_num_count' %i] = train.apply(lambda x:x.i.mean())
    train['all_num'] = train['wood_add_value']+train['wood_reduce_value']+train['stone_add_value']+train['stone_reduce_value']\
                       +train['ivory_add_value']+train['ivory_reduce_value']+train['meat_add_value']+train['meat_reduce_value']\
                       +train['magic_add_value']+train['magic_reduce_value']+train['infantry_add_value']+train['infantry_reduce_value']\
                       +train['infantry_reduce_value']+train['cavalry_add_value']+train['cavalry_reduce_value']+train['shaman_add_value']\
                       +train['shaman_reduce_value']+train['wound_infantry_add_value']+train['wound_infantry_reduce_value']\
                       +train['wound_cavalry_add_value']+train['wound_cavalry_reduce_value']+train['wound_shaman_add_value']\
                       +train['wound_shaman_reduce_value']

    train['max_all_num'] = train[['wood_add_value','wood_reduce_value','stone_add_value','stone_reduce_value'
                       ,'ivory_add_value','ivory_reduce_value','meat_add_value','meat_reduce_value'
                       ,'magic_add_value','magic_reduce_value','infantry_add_value','infantry_reduce_value'
                       ,'infantry_reduce_value','cavalry_add_value','cavalry_reduce_value','shaman_add_value'
                       ,'shaman_reduce_value','wound_infantry_add_value','wound_infantry_reduce_value'
                       ,'wound_cavalry_add_value','wound_cavalry_reduce_value','wound_shaman_add_value'
                       ,'wound_shaman_reduce_value']].max(axis=1)



    train['min_all_num'] = train[['wood_add_value', 'wood_reduce_value', 'stone_add_value', 'stone_reduce_value'
        , 'ivory_add_value', 'ivory_reduce_value', 'meat_add_value', 'meat_reduce_value'
        , 'magic_add_value', 'magic_reduce_value', 'infantry_add_value', 'infantry_reduce_value'
        , 'infantry_reduce_value', 'cavalry_add_value', 'cavalry_reduce_value', 'shaman_add_value'
        , 'shaman_reduce_value', 'wound_infantry_add_value', 'wound_infantry_reduce_value'
        , 'wound_cavalry_add_value', 'wound_cavalry_reduce_value', 'wound_shaman_add_value'
        , 'wound_shaman_reduce_value']].min(axis=1)

    train['std_all_num'] = train[['wood_add_value', 'wood_reduce_value', 'stone_add_value', 'stone_reduce_value'
        , 'ivory_add_value', 'ivory_reduce_value', 'meat_add_value', 'meat_reduce_value'
        , 'magic_add_value', 'magic_reduce_value', 'infantry_add_value', 'infantry_reduce_value'
        , 'infantry_reduce_value', 'cavalry_add_value', 'cavalry_reduce_value', 'shaman_add_value'
        , 'shaman_reduce_value', 'wound_infantry_add_value', 'wound_infantry_reduce_value'
        , 'wound_cavalry_add_value', 'wound_cavalry_reduce_value', 'wound_shaman_add_value'
        , 'wound_shaman_reduce_value']].std(axis=1)

    train['ji_num'] = train['general_acceleration_add_value']+train['general_acceleration_reduce_value']\
                       +train['building_acceleration_add_value']+train['building_acceleration_reduce_value']+train['reaserch_acceleration_add_value']\
                       +train['reaserch_acceleration_reduce_value']+train['training_acceleration_add_value']+train['training_acceleration_reduce_value']\
                       +train['treatment_acceleraion_add_value']+train['treatment_acceleration_reduce_value']

    train['max_ji_num'] = train[['general_acceleration_add_value','general_acceleration_reduce_value'
                       ,'building_acceleration_add_value','building_acceleration_reduce_value','reaserch_acceleration_add_value'
                       ,'reaserch_acceleration_reduce_value','training_acceleration_add_value','training_acceleration_reduce_value'
                       ,'treatment_acceleraion_add_value','treatment_acceleration_reduce_value']].max(axis=1)

    train['min_ji_num'] = train[['general_acceleration_add_value', 'general_acceleration_reduce_value'
        , 'building_acceleration_add_value', 'building_acceleration_reduce_value', 'reaserch_acceleration_add_value'
        , 'reaserch_acceleration_reduce_value', 'training_acceleration_add_value', 'training_acceleration_reduce_value'
        , 'treatment_acceleraion_add_value', 'treatment_acceleration_reduce_value']].min(axis=1)


    train['ji_level'] = train['sr_scout_level'] + train['sr_training_speed_level'] \
                              + train['sr_infantry_tier_2_level'] + train['sr_cavalry_tier_2_level'] + train[
                                  'sr_shaman_tier_2_level'] \
                              + train['sr_infantry_atk_level'] + train['sr_cavalry_atk_level'] + train[
                                  'sr_shaman_atk_level'] \
                              + train['sr_infantry_tier_3_level'] + train['sr_cavalry_tier_3_level'] + train[
                                  'sr_shaman_tier_3_level'] \
                              + train['sr_troop_defense_level'] + train['sr_infantry_def_level'] + train['sr_cavalry_def_level'] \
                              + train['sr_shaman_def_level'] + train['sr_infantry_hp_level']+  train['sr_cavalry_hp_level'] \
                        + train['sr_shaman_hp_level'] \
                              + train['sr_infantry_tier_4_level'] + train['sr_cavalry_tier_4_level'] + train[
                                  'sr_shaman_tier_4_level'] \
                              + train['sr_troop_attack_level'] + train['sr_construction_speed_level'] + train[
                                  'sr_hide_storage_level'] \
                              + train['sr_troop_consumption_level'] + train['sr_rss_a_prod_levell'] + train[
                                  'sr_rss_b_prod_level'] \
                              + train['sr_rss_c_prod_level'] + train['sr_rss_d_prod_level'] + train['sr_rss_a_gather_level'] \
                              + train['sr_rss_b_gather_level'] + train['sr_rss_c_gather_level'] + train['sr_rss_d_gather_level'] + train[
                                  'sr_troop_load_level'] \
                              + train['sr_rss_e_gather_level'] + train['sr_rss_e_prod_level'] + train['sr_outpost_durability_level'] \
                              + train['sr_outpost_tier_2_level'] + train['sr_healing_space_level']    + train['sr_gathering_hunter_buff_level'] + train[
                                  'sr_healing_speed_level'] \
                              + train['sr_outpost_tier_3_level'] + train['sr_alliance_march_speed_level'] + train[
                                  'sr_pvp_march_speed_level'] \
                              + train['sr_gathering_march_speed_level'] + train['sr_outpost_tier_4_level'] + train['sr_guest_troop_capacity_level'] \
                              + train['sr_march_size_level'] + train['sr_rss_help_bonus_level']

    train['std_ji_level'] = train[['sr_scout_level','sr_training_speed_level','sr_infantry_tier_2_level','sr_cavalry_tier_2_level','sr_shaman_tier_2_level',
                                   'sr_infantry_atk_level','sr_cavalry_atk_level','sr_shaman_atk_level','sr_infantry_tier_3_level','sr_cavalry_tier_3_level',
                                   'sr_shaman_tier_3_level','sr_troop_defense_level','sr_infantry_def_level','sr_cavalry_def_level','sr_shaman_def_level',
                                   'sr_infantry_hp_level','sr_cavalry_hp_level','sr_shaman_hp_level','sr_infantry_tier_4_level','sr_cavalry_tier_4_level',
                                   'sr_shaman_tier_4_level','sr_troop_attack_level','sr_construction_speed_level','sr_hide_storage_level',
                                   'sr_troop_consumption_level','sr_rss_a_prod_levell','sr_rss_b_prod_level','sr_rss_c_prod_level','sr_rss_d_prod_level',
                                   'sr_rss_a_gather_level','sr_rss_b_gather_level','sr_rss_c_gather_level','sr_rss_d_gather_level','sr_troop_load_level',
                                   'sr_rss_e_gather_level','sr_rss_e_prod_level','sr_outpost_durability_level','sr_outpost_tier_2_level','sr_healing_space_level',
                                   'sr_gathering_hunter_buff_level','sr_healing_speed_level','sr_outpost_tier_3_level','sr_alliance_march_speed_level',
                                   'sr_pvp_march_speed_level','sr_gathering_march_speed_level','sr_outpost_tier_4_level','sr_guest_troop_capacity_level',
                                   'sr_march_size_level','sr_rss_help_bonus_level']].std(axis=1)

    # 峰度
    train['skew_ji_level'] = train[
        ['sr_scout_level', 'sr_training_speed_level', 'sr_infantry_tier_2_level', 'sr_cavalry_tier_2_level',
         'sr_shaman_tier_2_level',
         'sr_infantry_atk_level', 'sr_cavalry_atk_level', 'sr_shaman_atk_level', 'sr_infantry_tier_3_level',
         'sr_cavalry_tier_3_level',
         'sr_shaman_tier_3_level', 'sr_troop_defense_level', 'sr_infantry_def_level', 'sr_cavalry_def_level',
         'sr_shaman_def_level',
         'sr_infantry_hp_level', 'sr_cavalry_hp_level', 'sr_shaman_hp_level', 'sr_infantry_tier_4_level',
         'sr_cavalry_tier_4_level',
         'sr_shaman_tier_4_level', 'sr_troop_attack_level', 'sr_construction_speed_level', 'sr_hide_storage_level',
         'sr_troop_consumption_level', 'sr_rss_a_prod_levell', 'sr_rss_b_prod_level', 'sr_rss_c_prod_level',
         'sr_rss_d_prod_level',
         'sr_rss_a_gather_level', 'sr_rss_b_gather_level', 'sr_rss_c_gather_level', 'sr_rss_d_gather_level',
         'sr_troop_load_level',
         'sr_rss_e_gather_level', 'sr_rss_e_prod_level', 'sr_outpost_durability_level', 'sr_outpost_tier_2_level',
         'sr_healing_space_level',
         'sr_gathering_hunter_buff_level', 'sr_healing_speed_level', 'sr_outpost_tier_3_level',
         'sr_alliance_march_speed_level',
         'sr_pvp_march_speed_level', 'sr_gathering_march_speed_level', 'sr_outpost_tier_4_level',
         'sr_guest_troop_capacity_level',
         'sr_march_size_level', 'sr_rss_help_bonus_level']].skew(axis=1)

    train['min_ji_level'] = train[
        ['sr_scout_level', 'sr_training_speed_level', 'sr_infantry_tier_2_level', 'sr_cavalry_tier_2_level',
         'sr_shaman_tier_2_level',
         'sr_infantry_atk_level', 'sr_cavalry_atk_level', 'sr_shaman_atk_level', 'sr_infantry_tier_3_level',
         'sr_cavalry_tier_3_level',
         'sr_shaman_tier_3_level', 'sr_troop_defense_level', 'sr_infantry_def_level', 'sr_cavalry_def_level',
         'sr_shaman_def_level',
         'sr_infantry_hp_level', 'sr_cavalry_hp_level', 'sr_shaman_hp_level', 'sr_infantry_tier_4_level',
         'sr_cavalry_tier_4_level',
         'sr_shaman_tier_4_level', 'sr_troop_attack_level', 'sr_construction_speed_level', 'sr_hide_storage_level',
         'sr_troop_consumption_level', 'sr_rss_a_prod_levell', 'sr_rss_b_prod_level', 'sr_rss_c_prod_level',
         'sr_rss_d_prod_level',
         'sr_rss_a_gather_level', 'sr_rss_b_gather_level', 'sr_rss_c_gather_level', 'sr_rss_d_gather_level',
         'sr_troop_load_level',
         'sr_rss_e_gather_level', 'sr_rss_e_prod_level', 'sr_outpost_durability_level', 'sr_outpost_tier_2_level',
         'sr_healing_space_level',
         'sr_gathering_hunter_buff_level', 'sr_healing_speed_level', 'sr_outpost_tier_3_level',
         'sr_alliance_march_speed_level',
         'sr_pvp_march_speed_level', 'sr_gathering_march_speed_level', 'sr_outpost_tier_4_level',
         'sr_guest_troop_capacity_level',
         'sr_march_size_level', 'sr_rss_help_bonus_level']].min(axis=1)

    train['max_ji_level'] = train[
        ['sr_scout_level', 'sr_training_speed_level', 'sr_infantry_tier_2_level', 'sr_cavalry_tier_2_level',
         'sr_shaman_tier_2_level',
         'sr_infantry_atk_level', 'sr_cavalry_atk_level', 'sr_shaman_atk_level', 'sr_infantry_tier_3_level',
         'sr_cavalry_tier_3_level',
         'sr_shaman_tier_3_level', 'sr_troop_defense_level', 'sr_infantry_def_level', 'sr_cavalry_def_level',
         'sr_shaman_def_level',
         'sr_infantry_hp_level', 'sr_cavalry_hp_level', 'sr_shaman_hp_level', 'sr_infantry_tier_4_level',
         'sr_cavalry_tier_4_level',
         'sr_shaman_tier_4_level', 'sr_troop_attack_level', 'sr_construction_speed_level', 'sr_hide_storage_level',
         'sr_troop_consumption_level', 'sr_rss_a_prod_levell', 'sr_rss_b_prod_level', 'sr_rss_c_prod_level',
         'sr_rss_d_prod_level',
         'sr_rss_a_gather_level', 'sr_rss_b_gather_level', 'sr_rss_c_gather_level', 'sr_rss_d_gather_level',
         'sr_troop_load_level',
         'sr_rss_e_gather_level', 'sr_rss_e_prod_level', 'sr_outpost_durability_level', 'sr_outpost_tier_2_level',
         'sr_healing_space_level',
         'sr_gathering_hunter_buff_level', 'sr_healing_speed_level', 'sr_outpost_tier_3_level',
         'sr_alliance_march_speed_level',
         'sr_pvp_march_speed_level', 'sr_gathering_march_speed_level', 'sr_outpost_tier_4_level',
         'sr_guest_troop_capacity_level',
         'sr_march_size_level', 'sr_rss_help_bonus_level']].max(axis=1)


    train['max_buliding_level'] = train[['bd_training_hut_level', 'bd_healing_lodge_level'
                       ,'bd_stronghold_level','bd_outpost_portal_level','bd_barrack_level'
                       ,'bd_healing_spring_level','bd_dolmen_level','bd_guest_cavern_level'
                       ,'bd_warehouse_level','bd_watchtower_level','bd_magic_coin_tree_level'
                       ,'bd_hall_of_war_level','bd_market_level','bd_hero_gacha_level'
                       ,'bd_hero_strengthen_level','bd_hero_pve_level']].max(axis=1)

    train['std_buliding_level'] = train[['bd_training_hut_level', 'bd_healing_lodge_level'
        , 'bd_stronghold_level', 'bd_outpost_portal_level', 'bd_barrack_level'
        , 'bd_healing_spring_level', 'bd_dolmen_level', 'bd_guest_cavern_level'
        , 'bd_warehouse_level', 'bd_watchtower_level', 'bd_magic_coin_tree_level'
        , 'bd_hall_of_war_level', 'bd_market_level', 'bd_hero_gacha_level'
        , 'bd_hero_strengthen_level', 'bd_hero_pve_level']].std(axis=1)




    train['rate_price_payco'] = train.apply(
        lambda x: 0 if x.pay_count == 0 else x.pay_price / x.pay_count, axis=1)
    train['rate_payco_time'] = train.apply(
        lambda x: 0 if x.avg_online_minutes == 0 else x.pay_count / x.avg_online_minutes, axis=1)

    train['rate_price_all_num'] = train.apply(
        lambda x: 0 if (x.all_num) == 0 else x.pay_price / (x.all_num), axis=1)

    train['rate_mr_ma'] = train['meat_reduce_value']/train['meat_add_value']
    train['rate_car_caa'] = train['cavalry_reduce_value'] / train['cavalry_add_value']
    train['add_inr_caa'] = train['infantry_add_value'] + train['cavalry_add_value']+ train['shaman_add_value']
    train['by_min_buliding_level'] = 0
    for i in ['bd_training_hut_level', 'bd_healing_lodge_level'
               ,'bd_stronghold_level','bd_outpost_portal_level','bd_barrack_level'
               ,'bd_healing_spring_level','bd_dolmen_level','bd_guest_cavern_level'
               ,'bd_warehouse_level','bd_watchtower_level','bd_magic_coin_tree_level'
               ,'bd_hall_of_war_level','bd_market_level','bd_hero_gacha_level'
               ,'bd_hero_strengthen_level','bd_hero_pve_level']:
        mean = train[i].min()
        print(mean)
        train['by_min_buliding_level'] = train.apply(lambda x:x.by_min_buliding_level+1 if x['%s' %i]  >mean else x.by_min_buliding_level, axis=1)

    train['by_min_num'] = 0
    for i in ['wood_add_value', 'wood_reduce_value', 'stone_add_value', 'stone_reduce_value'
        , 'ivory_add_value', 'ivory_reduce_value', 'meat_add_value', 'meat_reduce_value'
        , 'magic_add_value', 'magic_reduce_value', 'infantry_add_value', 'infantry_reduce_value'
        , 'infantry_reduce_value', 'cavalry_add_value', 'cavalry_reduce_value', 'shaman_add_value'
        , 'shaman_reduce_value', 'wound_infantry_add_value', 'wound_infantry_reduce_value'
        , 'wound_cavalry_add_value', 'wound_cavalry_reduce_value', 'wound_shaman_add_value'
        , 'wound_shaman_reduce_value']:
        mean = train[i].min()
        print(mean)
        train['by_min_num'] = train.apply(
            lambda x: x.by_min_num + 1 if x['%s' % i] > mean else x.by_min_num, axis=1)

    if type == 1:
        train = train[['user_id', 'std_buliding_level', 'max_buliding_level', 'max_ji_level', 'min_ji_level', 'std_ji_level'
                   , 'ji_level', 'min_ji_num', 'max_ji_num', 'ji_num', 'std_all_num', 'min_all_num', 'max_all_num'
                   , 'all_num', 'rate_zhu_all_pvp', 'rate_win_all_pvp', 'rate_zhu_win_pvp', 'rate_win_all_pve', 'rate_price_payco'
                       , 'rate_payco_time', 'rate_price_all_num', 'weekday', 'skew_ji_level'
        , 'rate_mr_ma', 'rate_car_caa', 'add_inr_caa', 'by_min_buliding_level', 'by_min_num'
                   , 'month', 'day', 'diff_day', 'prediction_pay_price', 'pay_count', 'pay_price', 'avg_online_minutes', 'pve_win_count'
             , 'pve_lanch_count', 'pve_battle_count', 'pvp_win_count', 'pvp_lanch_count', 'pvp_battle_count', 'hour']]
    elif type == 2:
        train = train[
            ['user_id', 'std_buliding_level', 'max_buliding_level', 'max_ji_level', 'min_ji_level', 'std_ji_level'
                , 'ji_level', 'min_ji_num', 'max_ji_num', 'ji_num', 'std_all_num', 'min_all_num', 'max_all_num'
                , 'all_num', 'rate_zhu_all_pvp', 'rate_win_all_pvp', 'rate_zhu_win_pvp', 'rate_win_all_pve', 'rate_price_payco'
                       , 'rate_payco_time', 'rate_price_all_num', 'weekday', 'skew_ji_level'
        , 'rate_mr_ma', 'rate_car_caa', 'add_inr_caa', 'by_min_buliding_level', 'by_min_num'
                , 'month', 'day', 'diff_day', 'pay_count', 'pay_price', 'avg_online_minutes', 'pve_win_count'
             , 'pve_lanch_count', 'pve_battle_count', 'pvp_win_count', 'pvp_lanch_count', 'pvp_battle_count', 'hour']]
    one_hot_feature = ['hour', 'weekday']
    train = pd.get_dummies(train, columns=one_hot_feature)
    return train


train = pd.read_csv('tap_fun_train.csv', parse_dates=['register_time']).fillna(0)
test = pd.read_csv('tap_fun_test.csv', parse_dates=['register_time']).fillna(0)
print('read_ok')

train = creat_fea(train, 1)
test = creat_fea(test, 2)
print('fea ok')
col = ['day', 'diff_day', 'hour_0', 'hour_1', 'hour_8', 'hour_9', 'hour_12'
       , 'hour_13', 'hour_14', 'hour_15', 'hour_16', 'hour_17', 'hour_18', 'hour_19', 'hour_20', 'hour_21', 'hour_22', 'hour_23'
       , 'weekday_2', 'weekday_3', 'weekday_5']
train = train.drop(col, axis=1)
test = test.drop(col, axis=1)


def rsme(target, pred):
    print(pred)
    print(target)
    error = []
    for i in range(len(target)):
        error.append(float(target[i]) - float(pred[i]))

    squaredError = []
    absError = []
    for val in error:
        squaredError.append(val * val)  # target-prediction之差平方
        absError.append(abs(val))  # 误差绝对值

    rmse = sqrt(sum(squaredError) / len(squaredError))
    # score = 1/(1+rmse)
    return rmse

def vali_test(train):
    print(train.columns)
    # print(train)
    train = train
    train_xy, val = train_test_split(train, test_size=0.3, random_state=1)
    y = train_xy.prediction_pay_price
    X = train_xy.drop(['prediction_pay_price', 'user_id'], axis=1)
    val_y = val.prediction_pay_price
    val_X = val.drop(['prediction_pay_price', 'user_id'], axis=1)
    print('start train')
    pred = xgb_model(X, val_X, y)
    print('train ok')
    val_y = [i for i in val_y]
    score = rsme(val_y, pred)
    print(score)


def get_result(train, test):
    result = test[['user_id']]
    train_x = train.drop(['prediction_pay_price', 'user_id'], axis=1)
    train_y = train['prediction_pay_price']
    test_x = test.drop(['user_id'], axis=1)
    pred = xgb_model(train_x, test_x, train_y)
    result['prediction_pay_price'] = pred
    result.to_csv(r'sub_1.csv',index=None,encoding='utf-8')



# score = vali_test(train)
# print(score)

get_result(train, test)

part_result = pd.read_csv('../result/sub_1.csv')
print(part_result)
all_id = pd.read_csv('../data/test_id.csv')
print(all_id)

sub = pd.merge(all_id, part_result, on='user_id', how='left').fillna(0)
print(sub)
sub['prediction_pay_price'] = sub['prediction_pay_price'].apply(lambda x:0 if x<0 else x)
sub.to_csv(r'sub_xgb.csv',index=None,encoding='utf-8')






本来几百名开外,结果B榜的时候前面好多过拟合的,于是进入了前4%,这可真是……骚操作……

溜了溜了……有缘再看

猜你喜欢

转载自blog.csdn.net/honry55/article/details/82532689
DC