kaggle竞赛数据集:rossmann-store-sales

这是kaggle7年前的一个比赛:https://www.kaggle.com/competitions/rossmann-store-sales/data

其主要目标,是为了对德国最大的连锁日用品超市品牌Rossmann下的1115家店铺(应该都是药店)进行48日的销售额预测(2015-8-1~2015-9-17)。从背景来看,Rossmann商店经理的任务是提前六周预测他们的每日销售额。商店销售受到许多因素的影响,包括促销、竞争、学校和国家假日、季节性和地点。由于成千上万的经理根据各自的独特情况来预测销售情况,结果的准确性可能会有很大差异。

对于每家店铺,训练集与测试集都会包含以下信息:

Store - 商店ID

DayOfWeek - 星期几

Date - 日期

Sales - 营业额,预测目标,仅训练集包含此信息

Customers - 顾客数量,仅训练集会包含此信息

Open - 商店是否开门,未开门时不会有销量,也不会被计算入最终的预测结果

Promo - 当天是否有促销

StateHoliday - a = 公共假日, b = 复活节, c = 圣诞, 0 = None

SchoolHoliday - 是否受到公立学校关闭影响

同时,竞赛也提供了各个店铺的独立信息,包括:

Store - 商店ID

StoreType - 4类不同店

Assortment - 等级分类:a 基本 b 额外 c 扩展

CompetitionDistance - 到最近竞争对手商店的距离(以米为单位)

CompetitionOpenSinceMonth/Month - 最接近的竞争对手开始营销的大概年份和月份

Promo2 - Promo2是一些商店的持续和连续促销:0 =商店不参与,1 =商店参与

Promo2SinceWeek - 开始参与参与促销的年份/周

PromoInterval - 间隔,如“Feb,May,Aug,Nov”指的是该商店在任何给定年份的2月,5月,8月,11月开始每一轮

预测使用的评价标准为

在训练集中,仅发现有54日*店的开店状态下销售额为0的情况,占总数据量约千分之三,在未来数据预处理时会对其进行插值处理,不影响数据探索。

探索数据

一、将所有店铺当作整体来看的趋势

train.Date = pd.to_datetime(train.Date)
train_ttl = train.groupby("Date")["Sales"].sum()
formatter = mpl.dates.DateFormatter("%Y")
smoothed=sm.nonparametric.lowess(endog=train_ttl.values,exog=train_ttl.index)[:,1]
locator = mpl.dates.YearLocator()
ax = plt.gca()
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_locator(locator)
plt.plot(train_ttl,alpha=0.5,label="原始值")
plt.plot(train_ttl.index,smoothed,label="平滑项")
plt.legend(loc="upper right")
plt.show()

我们发现,除了2013年底销售额偏高外,总体而言销售额都较为平整,同时有很多日子销售额都会趋向于0,这反映了大多店铺的闭店时间较为相似。我们将2015年的趋势信息单独拿出,可以更清晰地看到这个结果:

formatter = mpl.dates.DateFormatter("%Y")
locator = mpl.dates.YearLocator()
plt.figure(figsize=(12,5))
ax = plt.gca()
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_locator(locator)
plt.plot(train_ttl[[i for i in train_ttl.index if i.year == 2015]])
plt.show()

从图中可以看出,销售总量几乎没有长期的上升或下降趋势,且闭店(销量为0)的时间有着明显的周期循环。将开店(Open=1)的日期打印出来能够清晰地看看出这一结论:

使用傅里叶变换作出周期图,会发现其存在以7天为单位的一个周期。(能量最大值的x轴倒数为7)

train_sum = train.groupby(["Date"])["Sales"].sum()
train_standard = StandardScaler().fit_transform(pd.DataFrame(train_sum))
yf = abs(fft(train_standard[:,0]))
n = len(yf)
power = np.abs(yf[range(0,int(n/2))])**2
freq = np.arange(0,int(n/2))/n
plt.plot(freq,power)
plt.title("周期图")

使用ADF检验会发现他是平稳序列,并且能够通过白噪声检验。平稳与否对之后的特征处理与建模的策略起到了重要作用。

tsa.stattools.adfuller(train_ttl) #平稳序列,p值小于0.05
acorr_ljungbox(train_ttl,lags=20,boxpierce=True,return_df=True)

我们将店铺的销售额作月化均值处理,能够发现在每年的年末的销售额均值都会大大提高。这或许是因为每年12月时由于有圣诞节,店铺的放假日期变多了,而那些不怎么放假的店铺均销量高使得整个月均销量变高了。

train_opened = train.iloc[[i for i in train.index if train.Open[i]==1],:]
train_opened["Date_Month"]=[pd.to_datetime(str(train_opened.Date[i]).split("-")[0]+"-"+str(train_opened.Date[i]).split("-")[1]+"-01") for i in train_opened.index]
train_ttl_opened = train_opened.groupby("Date_Month")["Sales"].mean()
formatter = mpl.dates.DateFormatter("%Y")
locator = mpl.dates.YearLocator()
ax = plt.gca()
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_locator(locator)
plt.plot(train_ttl_opened)
plt.show()

而ADF检验依然认为其平稳

tsa.stattools.adfuller(train_ttl_opened) #平稳序列

总结:从整体来看,销售额并没有上升或下降的趋势,是一个平稳序列,但是月份之间,特别是年末会有些许差距。

二、日期因素影响

我们将星期中日的信息与月份作热力图,可以发现,周日的月均销量明显高于其他日子,并且11、12月的月均销量更高。

Month_and_day_of_week = pd.pivot_table(train_opened,index="DayOfWeek",columns="Month",values="Sales",aggfunc="mean")
Month_and_day_of_week.columns = [int(i) for i in Month_and_day_of_week.columns]
seaborn.heatmap(Month_and_day_of_week,cmap="YlGnBu")
plt.show()

我们换成毕店信息再作热力图,会发现绝大多数的闭店时间都发生在周日。猜想:那些周日不闭店的店铺是否销量本来就更高,有可能是周日不休假的药店本身就是规模较大的店铺,并且周围缺乏竞争对手,周围居民较为依赖这一药店。

#关门的日子
train_close = train.iloc[[i for i in train.index if train.Open[i]==0],:]
train_close["Month"] = pd.Series([train_close.Date[i].month for i in train_close.index],index=train_close.index)
Month_and_day_of_week_close = pd.pivot_table(train_close,index="DayOfWeek",columns="Month",values="Store",aggfunc="count")
seaborn.heatmap(Month_and_day_of_week_close,cmap="YlGnBu")
plt.show()

从总的开店时间而言,这样“在周日也营业”的店明显比其他店铺开店的日期数量来的要大。(此处的标题有误,我给写成了“闭店比率”)

tmp_idx = [i for i in train.index if train.DayOfWeek[i]==7 and train.Open[i]==1]
sun_open_stores = set(train.loc[tmp_idx,"Store"])
train_open_at_sun = train.iloc[[i for i in train.index if train.Store[i] in sun_open_stores],:]
train_other_store = train.iloc[[i for i in train.index if train.Store[i] not in sun_open_stores],:]
tmp = pd.DataFrame([{"type":"Open at Sunday","ratio":sum(train_open_at_sun.Open)/train_open_at_sun.shape[0]},
                   {"type":"Close at Sunday","ratio":sum(train_other_store.Open)/train_other_store.shape[0]}])
tmp.index = tmp.type
tmp = tmp["ratio"]
cm=plt.bar(tmp.index,tmp.values)
for i,idx in enumerate(tmp.index):
    plt.text(cm[i].get_x()+0.2,tmp[idx],"%.2f"%tmp[idx])
plt.title("Closure ratio")
plt.show()

将在周日开店与周日闭店的店铺分开,对其作月化平均,可以发现在周日开店的店铺月均销量明显比闭店的销量高。

#平滑项
def gen_smooth_series(df):
    smoothed=sm.nonparametric.lowess(endog=df.values,exog=df.index)[:,1]
    index = df.index
    smoothed = pd.Series(smoothed)
    smoothed.index = index
    return smoothed

store_open_at_sun = stores.iloc[[i for i in stores.index if stores.Store[i] in sun_open_stores],:]
store_other = stores.iloc[[i for i in stores.index if stores.Store[i] not in sun_open_stores],:]
train_open_at_sun["Month"] = [train_open_at_sun.Date[i].month for i in train_open_at_sun.index]
train_other_store["Month"] = [train_other_store.Date[i].month for i in train_other_store.index]
train_open_at_sun["Date_Month"] = [pd.to_datetime(str(train_open_at_sun.Date[i]).split("-")[0]+"-"+str(train_open_at_sun.Date[i]).split("-")[1]+"-01") for i in train_open_at_sun.index]
train_other_store["Date_Month"] = [pd.to_datetime(str(train_other_store.Date[i]).split("-")[0]+"-"+str(train_other_store.Date[i]).split("-")[1]+"-01") for i in train_other_store.index]
sale_open_at_sun = train_open_at_sun.groupby("Date_Month")["Sales"].mean()
sale_close_at_sun = train_other_store.groupby("Date_Month")["Sales"].mean()
formatter = mpl.dates.DateFormatter("%Y")
locator = mpl.dates.YearLocator()
ax = plt.gca()
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_locator(locator)
plt.plot(sale_open_at_sun,alpha=0.3,color="blue")
plt.plot(gen_smooth_series(sale_open_at_sun),color="blue",label="open_at_sun")
plt.plot(sale_close_at_sun,alpha=0.3,color="red")
plt.plot(gen_smooth_series(sale_close_at_sun),color="red",label="close_at_sun")
plt.title("Open at Sunday")
plt.legend()
plt.show()

总结:不同的年份以及星期对于销售额有着影响,而那些周日开店的店铺,本身销售额就比其他店铺来的高。

三、店铺分层类别影响

分别针对店铺类型与节假日作分层分析。我们根据月化销售额均值作出趋势图

从店铺类型来看,B类店铺明显比其他店铺的销售额更高,每个类别的销售额都能明显与其他的店铺类别区分出来。同时还能看出,B类店铺有着明显的上升趋势,这就意味着:尽管所有店铺总和的趋势不明显,被认为是平稳序列;但是这不意味着单个店铺也是平稳的,提取单个店铺的趋势项将是未来特征工程的一环。

train_with_store = pd.merge(train_opened,stores,on="Store",how="left")
train_store_StoreType = train_with_store.groupby(["Date_Month","StoreType"])["Sales"].mean()
store_A = train_store_StoreType[[i for i in train_store_StoreType.index if i[1]=="a"]]
store_B = train_store_StoreType[[i for i in train_store_StoreType.index if i[1]=="b"]]
store_C = train_store_StoreType[[i for i in train_store_StoreType.index if i[1]=="c"]]
store_D = train_store_StoreType[[i for i in train_store_StoreType.index if i[1]=="d"]]
store_A.index =[i[0] for i in store_A.index]
store_B.index =[i[0] for i in store_B.index]
store_C.index =[i[0] for i in store_C.index]
store_D.index =[i[0] for i in store_D.index]
smoothed_A=sm.nonparametric.lowess(endog=store_A.values,exog=store_A.index)[:,1]
smoothed_B=sm.nonparametric.lowess(endog=store_B.values,exog=store_B.index)[:,1]
smoothed_C=sm.nonparametric.lowess(endog=store_C.values,exog=store_C.index)[:,1]
smoothed_D=sm.nonparametric.lowess(endog=store_D.values,exog=store_D.index)[:,1]
formatter = mpl.dates.DateFormatter("%Y")
locator = mpl.dates.YearLocator()
ax = plt.gca()
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_locator(locator)
plt.plot(store_A,alpha=0.3,color="blue")
plt.plot(gen_smooth_series(store_A),color="blue",label="a")
plt.plot(store_B,alpha=0.3,color="red")
plt.plot(gen_smooth_series(store_B),color="red",label="b")
plt.plot(store_C,alpha=0.3,color="yellow")
plt.plot(gen_smooth_series(store_C),color="yellow",label="c")
plt.plot(store_D,alpha=0.3,color="green")
plt.plot(gen_smooth_series(store_D),color="green",label="d")
plt.legend(loc="upper right")
plt.title("different Store Type")
plt.show()

同样的,从等级划分来看,b等级的店铺均销量也明显有上升趋势,各个等级店铺的均销量明显不同。

train_store_Assortment = train_with_store.groupby(["Date_Month","Assortment"])["Sales"].mean()
store_A = train_store_Assortment[[i for i in train_store_Assortment.index if i[1]=="a"]]
store_B = train_store_Assortment[[i for i in train_store_Assortment.index if i[1]=="b"]]
store_C = train_store_Assortment[[i for i in train_store_Assortment.index if i[1]=="c"]]
store_A.index =[i[0] for i in store_A.index]
store_B.index =[i[0] for i in store_B.index]
store_C.index =[i[0] for i in store_C.index]
formatter = mpl.dates.DateFormatter("%Y")
locator = mpl.dates.YearLocator()
ax = plt.gca()
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_locator(locator)
plt.plot(store_A,alpha=0.3,color="blue")
plt.plot(gen_smooth_series(store_A),color="blue",label="a")
plt.plot(store_B,alpha=0.3,color="red")
plt.plot(gen_smooth_series(store_B),color="red",label="b")
plt.plot(store_C,alpha=0.3,color="yellow")
plt.plot(gen_smooth_series(store_C),color="yellow",label="c")
plt.legend(loc="upper right")
plt.title("Assortment")
plt.show()

总结:尽管所有店铺总和的趋势不明显,被认为是平稳序列;但是这不意味着单个店铺也是平稳的;不同的店铺等级和销售额会有明显不同。

四、节假日,促销以及竞争影响

在除了学校假期以外的节假日,大部分的都处于闭店状态,所以我们取前后14天的销售量进行比对,发现并无明显趋势。

holiday_a_date = set(train_with_store.loc[[i for i in train_with_store.index if train_with_store.StateHoliday[i]=="a"],"Date"])
holiday_b_date = set(train_with_store.loc[[i for i in train_with_store.index if train_with_store.StateHoliday[i]=="b"],"Date"])
holiday_c_date = set(train_with_store.loc[[i for i in train_with_store.index if train_with_store.StateHoliday[i]=="c"],"Date"])
#放假前后14天
def get_last_and_next_14days(dates):
    tmp = train.copy()
    tmp["holiday_14"]=False
    for d in dates:
        tmp["holiday_14"] = [(tmp.loc[i,"holiday_14"] or abs(tmp.loc[i,"Date"]-d).days<=14) for i in tmp.index]
    return tmp

train_c_14 = get_last_and_next_14days(holiday_c_date)
train_b_14 = get_last_and_next_14days(holiday_b_date)
train_a_14 = get_last_and_next_14days(holiday_a_date)
train_c_14_chosen = train_c_14.iloc[[i for i in train_c_14.index if train_c_14.holiday_14[i]==True],:]
train_a_14_chosen = train_a_14.iloc[[i for i in train_a_14.index if train_a_14.holiday_14[i]==True],:]
train_b_14_chosen = train_b_14.iloc[[i for i in train_b_14.index if train_b_14.holiday_14[i]==True],:]

a_sales_mean = train_a_14_chosen.groupby(["dis_to_holiday"])["Sales"].mean()
b_sales_mean = train_b_14_chosen.groupby(["dis_to_holiday"])["Sales"].mean()
c_sales_mean = train_c_14_chosen.groupby(["dis_to_holiday"])["Sales"].mean()

plt.subplot(2,2,1)
plt.plot(a_sales_mean)
plt.title("public holiday")
plt.subplot(2,2,2)
plt.plot(b_sales_mean)
plt.title("Easter")
plt.subplot(2,2,3)
plt.plot(c_sales_mean)
plt.title("Christmas")
plt.show()

而对于促销价格影响,我们可以看到,尽管促销使得销售额更好了,但是他们的长期趋势以及循环波动并没有因此受到干扰。从图中可以很清晰地看到,促销作为一个单独事件,能够明显且直接地影响销售额。

train_store_Promo = train_with_store.groupby(["Date_Month","Promo"])["Sales","Customers"].mean().reset_index()
train_store_Promo_1 = train_store_Promo.loc[[i for i in train_store_Promo.index if train_store_Promo.loc[i,"Promo"]==1],["Date_Month","Sales","Customers"]]
train_store_Promo_0 = train_store_Promo.loc[[i for i in train_store_Promo.index if train_store_Promo.loc[i,"Promo"]==0],["Date_Month","Sales","Customers"]]
train_store_Promo_1.index = train_store_Promo_1.Date_Month
train_store_Promo_0.index = train_store_Promo_0.Date_Month
promo_0_sales = pd.Series(train_store_Promo_0.Sales)
promo_1_sales = pd.Series(train_store_Promo_1.Sales)
ax=plt.gca()
formatter = mpl.dates.DateFormatter("%Y")
locator = mpl.dates.YearLocator()
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_locator(locator)
plt.plot(promo_0_sales,label="without promo")
plt.plot(promo_1_sales,label="with promo")
plt.legend()
plt.title("Sales")
plt.show()

由于学校寒暑假是一个时间段,他并没有横向的时间对比,因此我这里做了柱状图进行分析,可以发现寒暑假期间的销售额更好一点。

*当时我做的时候没有对这个特征注意;但是在做特征工程时,发现当年的冠军说“这个特征十分重要”,所以应该对此重视作特征预处理。

school_holiday = train_with_store.groupby(["SchoolHoliday"])["Sales"].mean().reset_index()
school_holiday.SchoolHoliday.astype("str")
cm = plt.bar(school_holiday.SchoolHoliday.astype("str"),school_holiday.Sales)
plt.text(cm[0].get_x()+0.2,school_holiday.Sales[0],"%.2f"%school_holiday.Sales[0])
plt.text(cm[1].get_x()+0.2,school_holiday.Sales[1],"%.2f"%school_holiday.Sales[1])
plt.title("School Holiday")
plt.show()

而对于竞争对手距离,从热力图而言,似乎并无太大影响,之后再做特征抽取时,也会印证这一点。

#在此之前,删去竞争对手开店之前的数据
train_with_store["CompetitionSince"]=train_with_store["CompetitionSince"].fillna(pd.to_datetime("1900-01-01"))
for i in train_with_store.index:
    year = 0
    month = 1
    if not pd.isna(train_with_store.loc[i,"CompetitionOpenSinceYear"]):
        year = int(train_with_store.loc[i,"CompetitionOpenSinceYear"])
        if not pd.isna(train_with_store.loc[i,"CompetitionOpenSinceMonth"]):
            month = int(train_with_stortrain_with_store["CompetitionSince"]=pd.to_datetime("1900-01-01")e.loc[i,"CompetitionOpenSinceMonth"])
        res = str(year)+"-"+"0"*(2-len(str(month)))+str(month)+"-01"
        res = pd.to_datetime(res)
        train_with_store.loc[i,"CompetitionSince"]=res

train_with_store_competition = train_with_store.iloc[[i for i in train_with_store.index if train_with_store.Date[i]>=train_with_store.CompetitionSince[i]],:]
Competition_Dis = train_with_store_competition.groupby(["Store"])["CompetitionDistance","Sales"].mean().reset_index()
Competition_Dis.drop("Store",axis=1,inplace=True)
seaborn.heatmap(Competition_Dis.corr()) #不需要归一化/标准化
plt.show()

最后,对于Promo2(二级促销)而言,那些正在进行促销的店铺反而没有不进行促销的店铺销售额高,或许这些店本身就是在某些这些月份销售疲软?

#promo2
promo2 = train_with_store.iloc[[i for i in train_with_store.index if (train_with_store.Promo2[i]==1 and train_with_store.Promo2SinceYear[i]<2015)],:]
promo2_no = train_with_store.iloc[[i for i in train_with_store.index if (train_with_store.Promo2[i]==0 or train_with_store.Promo2SinceYear[i]==2015)],:]
promo2_sales_mean = promo2.groupby(["Date_Month"])["Sales"].mean()
promo2_no_sales_mean = promo2_no.groupby(["Date_Month"])["Sales"].mean()
formatter = mpl.dates.DateFormatter("%Y")
locator = mpl.dates.YearLocator()
ax = plt.gca()
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_locator(locator)
plt.plot(promo2_sales_mean,label="with promo")
plt.plot(promo2_no_sales_mean,label="without promo")
plt.show()

总结:节假日、一级促销对销售额影响最大,竞争以及二级促销影响较小。

探索数据是一件很有意思的事情,不仅发现了很多对因变量有较大影响的因素,还未之后的特征工程作出了基础。

本文的思路一定程度上借鉴了https://www.kaggle.com/competitions/rossmann-store-sales/discussion/18024 里的思路,这个就是当年冠军思路。

前期的探索数据以及数据集的简介在这里:(待定链接)

特征提取以及特征工程的过程在这里:(待定链接)

经过特征工程,我们制造了300余个特征,又经过Null Importance策略,保留了70余个特征进行建模工作。建模将采用Direct forecast的策略,结合销售额、客户量以及特定月份的模型作模型融合。

本文的思路一定程度上借鉴了https://www.kaggle.com/competitions/rossmann-store-sales/discussion/18024 里的思路,这个就是当年冠军思路。

前期的探索数据以及数据集的简介在这里:(待定链接)

在特征工程时,我们已经注意到了,从总体趋势而言,销售额的时间序列相对平整,趋势项作用微乎其微;测试集的天数一共只有48日,并非一个长期时间的预测,趋势项的作用更进一步地减少了。所以,此处不再适合使用recursive forecasting(递归预测,先预测未来1天的数据,将这个预测值作为特征的一部分放入模型中预测未来第2天)。这样的预测策略尽管能够将趋势项加入,但是在短期预测时,由于会造成残差项堆积(预测值的误差带入模型,导致了预测的下一天残差变大)以及趋势项不那么明显,反而效果不如Direct Forecasting来的要好。故而此处的特征工程考虑的就是直接预测未来48日的预测策略。

特征工程

零、插值

尽管从第一名的那个方案中可以得知,大多数的参赛者在比赛中都只选取了那些开店时的数据作为训练集(Open==1),但是我个人认为如果直接将其舍去,作为时间序列数据又有太多的缺失值,数据质量会有所下降,并且之后还需要使用Prophet作时间序列分解,故而此处将其作前后插值,并将其按店铺分列作数据透视表。

train_pivoted = train.pivot(index="Date",columns="Store",values=["Sales","StateHoliday","Customers","DayOfWeek","Promo","SchoolHoliday"])
train_pivoted.index = pd.to_datetime(train_pivoted.index)

for store in range(1,1116):
    tmp = train_pivoted.Sales[store].replace(0,np.nan)
    opened = tmp[[i for i in tmp.index if not np.isnan(tmp[i])]]
    closed = tmp[[i for i in tmp.index if np.isnan(tmp[i])]]
    
    tmp_customer = train_pivoted.Customers[store].replace(0,np.nan)
    opened_customer = tmp_customer[[i for i in tmp_customer.index if not np.isnan(tmp_customer[i])]]
    closed_customer = tmp_customer[[i for i in tmp_customer.index if np.isnan(tmp_customer[i])]]
    if len(closed)>0:
        res = np.interp(closed.index,opened.index,opened.values)
        for i in range(len(res)):
            closed[i]=res[i]
        interped = pd.concat([opened,closed]).sort_index()
        for i in interped.index:
            train_pivoted.Sales.loc[i,store]=interped[i]
    if len(closed_customer)>0:
        res_customer = np.interp(closed_customer.index,opened_customer.index,opened_customer.values)
        for i in range(len(res_customer)):
            closed_customer[i]=res_customer[i]
        interped_customer = pd.concat([opened_customer,closed_customer]).sort_index()
        for i in interped_customer.index:
            train_pivoted.Customers.loc[i,store]=interped_customer[i]
    print(store,"complete")

一、趋势项以及其他时间序列特征提取

在作探索数据时,我们会发现尽管总体趋势是稳定,没有明显上升或下降趋势的,但是对于单家店,有可能会出现不同的趋势以及季节波动、循环波动,故而应该将每家店的时序特征作分解以作为特征。

从第一名的策略来看是直接使用sklearn中的岭回归来对每家公司进行建模以此提取出趋势项。而现在有了Prophet模型(6年前Facebook公开的模型,正好在这个比赛的后一年),能够一次提起时序的趋势项、季节特征以及外部变量。我对1115家店铺分别建模,以此获得他们所有的时序特征。

for i in range(1,1116):
    train_sales_tmp = train_pivoted.loc[:,[('Sales',i)]].reset_index()
    train_sales_tmp.columns = ["ds","y"]
    model = prophet.Prophet(weekly_seasonality=True)
    model.fit(train_sales_tmp)
    future = model.make_future_dataframe(periods=48)
    predicted = model.predict(future)
    predicted.to_csv("E:/jupyter_notebook/rossmann-store-sales/prophet_dataframes/%s.csv"%i)

#预测48日之后的时序特征
for i in range(1,1116):
    train_customer_tmp = train_pivoted.loc[:,[('Customers',i)]].reset_index()
    train_customer_tmp.columns = ["ds","y"]
    model = prophet.Prophet(weekly_seasonality=True)
    model.fit(train_customer_tmp)
    future = model.make_future_dataframe(periods=48)
    predicted = model.predict(future)
    predicted.to_csv("E:/jupyter_notebook/rossmann-store-sales/prophet_dataframes_Customers/%s.csv"%i)

同时根据数据探索的结果,加入数据的月份、日、是否在周日开店以及是否有二级推销、竞争对手:

train["Month"] = pd.Series([i.month for i in train.Date])
train["day"] = pd.Series([i.day for i in train.Date])
train["week_of_year"] = train.Date.dt.weekofyear

has_competition = []
for i in train.index:
    if pd.isna(train.CompetitionOpenSinceYear[i]):
        has_competition.append(False)
    elif pd.isna(train.CompetitionOpenSinceMonth[i]):
        Cometition_Start = pd.to_datetime(str(int(train.CompetitionOpenSinceYear[i]))+"-1-1")
        has_competition.append(train.Date[i]>=Cometition_Start)
    else:
        Cometition_Start = pd.to_datetime(str(int(train.CompetitionOpenSinceYear[i]))+"-"+str(int(train.CompetitionOpenSinceMonth[i]))+"-1")
        has_competition.append(train.Date[i]>=Cometition_Start)
train["has_competition"]=has_competition

train["PromoInterval"].astype("str")
train["PromoInterval"].fillna("None")
train["PromoInterval"].astype("category")

month_dict = {
    "Jan,Apr,Jul,Oct":{1,4,7,10},
    "Feb,May,Aug,Nov":{2,5,8,11},
    "Mar,Jun,Sept,Dec":{3,6,9,12}
}

has_promo2 = []
for i in train.index:
    if train.Promo2[i]==0 or pd.isna(train.Promo2SinceYear[i]) or pd.isna(train.Promo2SinceWeek[i]):
        has_promo2.append(False)
    else:
        promo_begin_date = str(int(train.Promo2SinceYear[i]))+str(int(train.Promo2SinceWeek[i]))+'0'
        promo_begin_date = pd.to_datetime(promo_begin_date,format='%Y%W%w')
        if train.Date[i]>=promo_begin_date and train.Month[i] in month_dict[train.PromoInterval[i]]:
            has_promo2.append(True)
        else:
            has_promo2.append(False)

train["has_promo2"] = pd.Series(has_promo2)

tmp_idx = [i for i in train.index if train.DayOfWeek[i]==7 and train.Open[i]==1]
sun_open_stores = set(train.loc[tmp_idx,"Store"])
train["open_at_Sunday"] = pd.Series([train.Store[i] in sun_open_stores for i in train.index])

二、近期数据(Recent data)

时间序列竞赛中的一个特殊特征提取类别,主要就是将各个指标的之前时间的特征再提取出来,也有一些选手将这样的特征处理称为Encoding。具体在第一名的策略中,就是分别对不同颗粒度的数据(每日每家店/每日每家店是否有优惠/每日每家店是否有优惠且分是否在学校寒暑假中/每家店是否有优惠)在不同的时间段(过去一年/一季度/半年)的各项聚合计算(中位数/均值/标准差/偏度/峰度/调和平均)。在此基础上,我还加了一个全局聚合(整个训练集上的时间段)。诚然这样做的确有着Data Leakage(数据泄露)的风险,但是由于时序数据本身平稳,且聚合后的数据可当作是每一家店固有的属性,从这样的考量出发,我建立了全局的聚合变量。

#建立上一年/上一个季度/上半年的数据框以便作左连接
data["Year"] = pd.Series([i.year for i in data.Date])
data["HALF_YEAR"] = pd.Series([i.month>=6 for i in data.Date])
data["Quater"]=pd.Series([i.quarter for i in data.Date])
years = [i.year for i in data.Date]
months = [i.month for i in data.Date]
days = [i.day for i in data.Date]

# Y-1数据,即对于Date加一年以作关联,下同
date_tmp = []
data_Y_1 = data.copy()
for i in range(data.shape[0]):
    date_tmp.append(datetime.date(years[i]+1,months[i],days[i]))
data_Y_1.Date = pd.to_datetime(pd.Series(date_tmp))
data_Y_1["Year"] = pd.Series([i.year for i in data_Y_1.Date])

date_tmp = []
data_M_6 = data.copy()
for i in range(data.shape[0]):
    if months[i]>6:
        date_tmp.append(datetime.date(years[i]+1,months[i]-6,1))
    else:
        date_tmp.append(datetime.date(years[i],months[i]+6,1))
data_M_6.Date = pd.to_datetime(pd.Series(date_tmp))
data_M_6["Year"] = pd.Series([i.year for i in data_M_6.Date])
data_M_6["HALF_YEAR"] = pd.Series([i.month>=6 for i in data_M_6.Date])

date_tmp = []
data_Quarter_1 = data.copy()
for i in range(data.shape[0]):
    if months[i]>9:
        date_tmp.append(datetime.date(years[i]+1,months[i]-9,1))
    else:
        date_tmp.append(datetime.date(years[i],months[i]+3,1))
data_Quarter_1.Date = pd.to_datetime(pd.Series(date_tmp))
data_Quarter_1["Year"] = pd.Series([i.year for i in data_Quarter_1.Date])
data_Quarter_1["Quater"] = pd.Series([i.quarter for i in data_Quarter_1.Date])
#对函数作聚合并左连接至主表
def groupby_features(data,groupby):
    tmp_res = data.groupby(groupby).agg({"Sales_interped":["median","mean","std","skew",pd.DataFrame.kurt,hmean],
                                             "Customer_interped":["median","mean","std","skew",pd.DataFrame.kurt,hmean],
                                        "sales_per_customer":["median","mean","std","skew",pd.DataFrame.kurt,hmean]})
    tmp_res.columns = [''.join(groupby)+"_"+i[0]+"_"+i[1] for i in tmp_res.columns]
    return tmp_res

def grouby_combine(data_main,data_join,groupby,prefix=""):
    tmp = groupby_features(data_join,groupby)
    tmp.columns = [prefix+i for i in tmp.columns]
    data_main = data_main.merge(tmp,on=groupby,how="left")
    print(prefix,"ok")
    return data_main

data = grouby_combine(data,data_Y_1,["Store","Promo","Year"],prefix="preyear_")
data = grouby_combine(data,data_Y_1,["Store","DayOfWeek","Year"],prefix="preyear_")
data = grouby_combine(data,data_Y_1,["Store","Promo","DayOfWeek","Year"],prefix="preyear_")
data = grouby_combine(data,data_Y_1,["Store","Promo","DayOfWeek","SchoolHoliday","Year"],prefix="preyear_")

data = grouby_combine(data,data_M_6,["Store","Promo","Year","HALF_YEAR"],prefix="prehalfyear_")
data = grouby_combine(data,data_M_6,["Store","DayOfWeek","Year","HALF_YEAR"],prefix="prehalfyear_")
data = grouby_combine(data,data_M_6,["Store","Promo","DayOfWeek","Year","HALF_YEAR"],prefix="prehalfyear_")
data = grouby_combine(data,data_M_6,["Store","Promo","DayOfWeek","SchoolHoliday","Year","HALF_YEAR"],prefix="prehalfyear_")

data = grouby_combine(data,data_Quarter_1,["Store","Promo","Year","Quater"],prefix="prequarter_")
data = grouby_combine(data,data_Quarter_1,["Store","DayOfWeek","Year","Quater"],prefix="prequarter_")
data = grouby_combine(data,data_Quarter_1,["Store","Promo","DayOfWeek","Year","Quater"],prefix="prequarter_")
data = grouby_combine(data,data_Quarter_1,["Store","Promo","DayOfWeek","SchoolHoliday","Year","Quater"],prefix="prequarter_")

data = grouby_combine(data,data,["Store","Promo"])
data = grouby_combine(data,data,["Store","DayOfWeek"])
data = grouby_combine(data,data,["Store","Promo","DayOfWeek"])
data = grouby_combine(data,data,["Store","Promo","DayOfWeek","SchoolHoliday"])

三、时间信息

如同第一名的策略那样,我也制作了很多“时间计数器”,用以计算14天内是否有StateHoliday

/Promo/has_promo2并计数;并且加入了月份信息以及周的信息。

def cycle_days(col,except_value):
    res = pd.DataFrame(columns=["Store","Date",col+"_cycle_14days"])
    Stores = []
    Dates = []
    ones = []
    tmp = data[data[col]!=except_value].reset_index()
    for i in tmp.index:
        Stores.append(tmp.loc[i,"Store"])
        Dates.append(tmp.loc[i,"Date"])
        ones.append(1)
        for d in range(1,8):
            Stores.append(tmp.loc[i,"Store"])
            Dates.append(tmp.loc[i,"Date"]+datetime.timedelta(d))
            ones.append(1)
            
            Stores.append(tmp.loc[i,"Store"])
            Dates.append(tmp.loc[i,"Date"]+datetime.timedelta(-d))
            ones.append(1)
        print(i/len(tmp.index))
        clear(wait=True)
    res["Store"] = pd.Series(Stores)
    res["Date"] = pd.Series(Dates)
    res[col+"_cycle_14days"] = pd.Series(ones)
    res = res.groupby(["Store","Date"])[col+"_cycle_14days"].sum()
    return res
tmp = cycle_days("StateHoliday","0")
data = data.merge(tmp,on=["Store","Date"],how="left")
tmp = cycle_days("Promo",0)
data = data.merge(tmp,on=["Store","Date"],how="left")
tmp = cycle_days("has_promo2",False)
data = data.merge(tmp,on=["Store","Date"],how="left")

data["StateHoliday_cycle_14days"] = data["StateHoliday_cycle_14days"].fillna(0).astype("uint")#.value_counts()
data["Promo_cycle_14days"] = data["Promo_cycle_14days"].fillna(0).astype("uint")
data["has_promo2_cycle_14days"] = data["has_promo2_cycle_14days"].fillna(0).astype("uint")#.value_counts()

四、NULL Importance特征筛选

经过上述的各项特征提取后,我总共获得了300+个特征,数据集大小也从原本的36MB变为了2.64GB的大小。这样繁多的特征不仅会使得在模型训练时需要更多的空间与时间,同时某些原本对因变量没有什么实际的正向作用又会导致模型的效果更差,模型过于复杂也会带来过拟合的风险。综上所述,我们需要对模型的特征进行筛选。

对于如此繁多的模型,此处借鉴了Kaggle上的Null Importance策略:

  1. 使用所有的特征构建一次“完整模型”,从完整模型中可以得到每个特征的重要性(树模型分裂时的split重要性以及对于gain(原帖中的词,信息增益?)的重要性)。

  1. 将因变量y打乱,重新建立模型,得到再得到每个特征的重要性。重复n次。

最终,使用完整模型的重要性,除以打乱过后的模型重要性的上四分位数作为score。

其中fa为完整模型的重要性,而fn为打乱后各个模型中特征的重要性。1和0.001时为了避免出现除数为0或者真数为0的情况而加入的平滑项。具体的原理可以理解为:倘若一个特征对于模型有着确实的正向影响的话,那么在因变量打乱的情况下,其重要程度应该明显下降;反言之,如果一个特征对于模型实际上并无明显影响的话,那么在因变量打乱之后不会有明显的重要性下降变化。具体原理我找到的论文为https://academic.oup.com/bioinformatics/article/26/10/1340/193348?login=false (但这里用的permutation importance,似乎并非同一个东西)

代码参考了Kaggle上的代码,基本上就是改了几个参数:https://www.kaggle.com/code/ogrellier/feature-selection-with-null-importances/notebook

def RMSPE(real,pred):
    return  np.sqrt(np.mean(((pred-real)/real)**2))

def get_feature_importances(data, shuffle, seed=None):
    # Gather real features
    train_features = [f for f in data if f not in ['Sales_interped', 'Customer_interped']]
    # Go over fold and keep track of CV score (train and valid) and feature importances
    
    # Shuffle target if required
    y = data[['Sales_interped']].copy()
    if shuffle:
        # Here you could as well use a binomial distribution
        y = data[['Sales_interped']].copy().sample(frac=1.0)
    
    # Fit LightGBM in RF mode, yes it's quicker than sklearn RandomForest
    dtrain = lgb.Dataset(data[train_features], y, free_raw_data=False, silent=True)
    lgb_params = {
        #'objective': 'binary',
        'boosting_type': 'rf',
        'subsample': 0.623,
        'colsample_bytree': 0.7,
        'num_leaves': 127,
        'max_depth': 8,
        'seed': seed,
        'bagging_freq': 1,
        'n_jobs': -1
        #'device':'gpu',
    }
    
    # Fit the model
    clf = lgb.train(params=lgb_params, train_set=dtrain, num_boost_round=200, categorical_feature=categorical_feats)

    # Get feature importances
    imp_df = pd.DataFrame()
    imp_df["feature"] = list(train_features)
    imp_df["importance_gain"] = clf.feature_importance(importance_type='gain')
    imp_df["importance_split"] = clf.feature_importance(importance_type='split')
    imp_df['trn_score'] = RMSPE(y.iloc[:,0], clf.predict(data[train_features]))
    
    return imp_df
    #return clf.predict(data[train_features])

# Seed the unexpected randomness of this world
np.random.seed(123)
# Get the actual importance, i.e. without shuffling
actual_imp_df = get_feature_importances(data=data, shuffle=False)

null_imp_df = pd.DataFrame()
nb_runs = 150
import time
start = time.time()
dsp = ''
for i in range(nb_runs):
    # Get current run importances
    imp_df = get_feature_importances(data=data, shuffle=True)
    imp_df['run'] = i + 1 
    # Concat the latest importances with the old ones
    null_imp_df = pd.concat([null_imp_df, imp_df], axis=0)
    # Erase previous message
    for l in range(len(dsp)):
        print('\b', end='', flush=True)
    # Display current run and time used
    spent = (time.time() - start) / 150
    dsp = 'Done with %4d of %4d (Spent %5.1f min)' % (i + 1, nb_runs, spent)
    with open("/mnt/log.txt","w+") as f:
            f.write(dsp+"\r\n")
    print(dsp, end='', flush=True)

actual_imp_df.to_pickle("actual_imp_df.pkl")
null_imp_df.to_pickle("null_imp_df.pkl")

def display_distributions(actual_imp_df_, null_imp_df_, feature_):
    plt.figure(figsize=(13, 6))
    gs = gridspec.GridSpec(1, 2)
    # Plot Split importances
    ax = plt.subplot(gs[0, 0])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_split'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_split'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Split Importance of %s' % feature_.upper(), fontweight='bold')
    plt.xlabel('Null Importance (split) Distribution for %s ' % feature_.upper())
    # Plot Gain importances
    ax = plt.subplot(gs[0, 1])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_gain'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_gain'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Gain Importance of %s' % feature_.upper(), fontweight='bold')
    plt.xlabel('Null Importance (gain) Distribution for %s ' % feature_.upper())

feature_scores = []
for _f in actual_imp_df['feature'].unique():
    f_null_imps_gain = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].values
    f_act_imps_gain = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].mean()
    gain_score = np.log(1e-10 + f_act_imps_gain / (1 + np.percentile(f_null_imps_gain, 75)))  # Avoid didvide by zero
    f_null_imps_split = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].values
    f_act_imps_split = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].mean()
    split_score = np.log(1e-10 + f_act_imps_split / (1 + np.percentile(f_null_imps_split, 75)))  # Avoid didvide by zero
    feature_scores.append((_f, split_score, gain_score))

scores_df = pd.DataFrame(feature_scores, columns=['feature', 'split_score', 'gain_score'])

plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()

最终,我们可以选择那些split importance与Feature importance均大于0的特征输入模型。此举大大减少了特征量,将原本的300余个特征降低到了70多个。对比当年冠军“随机选择训练500多个模型”的策略,的确是一种进步。

建模与模型融合

零、Recursive Forecasting

本次的数据集由于测试集经历的天数较少,且趋势项并不明显,使用recursive forecasting不仅不会有“保留趋势项”的优势,反而由于在预测时使用的是前一次预测的预测值会产生“残差项累加”的问题。因此不适合用recursive forecasting。但是本着学习的态度,本文依然作出了recursive forecasting的预处理及建模策略的代码。

注:recursive forecasting仅加入了Prophet的趋势项等信息

# d-14特征
for shift_day in range(1,16):
    tmp_store = pd.DataFrame(columns={"shifted_%s"%shift_day,"Store"})
    tmp_store.index.name="Date"
    for store in range(1,1116):
        tmp = pd.DataFrame(train_pivoted.Sales[store].shift(shift_day)).rename(columns={store:"shifted_%s"%shift_day})
        tmp["Store"] = store
        tmp_store = pd.concat([tmp_store,tmp],axis=0)
        print(store,"ok")
    train = train.merge(tmp_store,on=["Date","Store"],how="left")
        
for shift_day in range(1,15):
    tmp_store = pd.DataFrame(columns={"shifted_customers_%s"%shift_day,"Store"})
    tmp_store.index.name="Date"
    for store in range(1,1116):
        tmp = pd.DataFrame(train_pivoted.Customers[store].shift(shift_day)).rename(columns={store:"shifted_customers_%s"%shift_day})
        tmp["Store"] = store
        tmp_store = pd.concat([tmp_store,tmp],axis=0)
    train = train.merge(tmp_store,on=["Date","Store"],how="left")
def RMSPE(valid_df,pred_df):
    valid_df = valid_df.rename(columns={"Sales_interped":"Sales_interped_real"})
    valid_df = valid_df.merge(pred_df,how="left",on=["Store","Date"])
    #return valid_df
    return  np.sqrt(np.mean(((valid_df["Sales_interped_real"]-valid_df["Sales_interped"])/valid_df["Sales_interped_real"])**2))

def recursive_forecasting(model):
    valid_pred = pd.DataFrame(columns=x_cols+["Date"]+y_cols)
    valid_y = pd.DataFrame(columns=y_cols+["Store","Date"])
    for D in valid_D:
        sub = valid[valid.Date==D]
        valid_y = pd.concat([valid_y,sub.loc[:,y_cols+["Store","Date"]]],axis=0)
        # 前一天的值改为预测值
        if (D-mx_date).days>1:
            ED = min((D-mx_date).days,14)
            for sd in range(1,ED):
                tmp_sft = valid_pred[valid_pred.Date==D+datetime.timedelta(days=-sd)]
                tmp_sft = tmp_sft.loc[:,["Store","Sales_interped","Customer_interped"]]
                sub.drop(["shifted_"+str(sd),"shifted_customers_"+str(sd)],axis=1,inplace=True)
                tmp_sft = tmp_sft.rename(columns={"Sales_interped":"shifted_"+str(sd),"Customer_interped":"shifted_customers_"+str(sd)})
                sub = sub.merge(tmp_sft,how="left",on="Store")
        #列重排序
        sub = sub.loc[:,x_cols]
        #预测
        pred = model.predict(sub)
        pred = pd.DataFrame(pred,columns={"Sales_interped","Customer_interped"})
        sub["Date"]=D
        sub = pd.concat([sub,pred],axis=1)
        sub = sub.loc[:,valid_pred.columns]
        valid_pred = pd.concat([valid_pred,sub],axis=0)
    #return valid_pred,valid_y
    return pd.DataFrame(valid_pred.groupby(["Store","Date"])["Sales_interped"].sum()),pd.DataFrame(valid_y.groupby(["Store","Date"])["Sales_interped"].sum())

在训练集中,选用最后2个月作为验证集,XGBoost的结果为0.13841894393724635

一、单个模型参数调整与训练

根据第一名的策略,将对开店时的数据(Open==1)建立以下模型:

  1. 销售额数据

  1. 客单量(Customers)

  1. 5月~9月时的销售额以及客单量

首先需要指出的是,无论时Sales还是Customer都偏度较大,需要进行对数处理。这里使用了Optuna来作模型调参,具体原理就是将目标函数使用一个代理函数来估计其分布,寻找最佳的参数组合。

下面的代码以客单量的调参为主

data = pd.read_pickle("data_encoded_date_counted.pkl")
idx = [i for i in data.index if data.Year[i]==2013]
data.drop(idx,axis=0,inplace=True)
for i in ['Store','open_at_Sunday','has_competition','has_promo2','HALF_YEAR','Quater']:
    data[i] = data[i].astype("category")
data = data[data.Open==1]
data.drop(['Sales','Customers','CompetitionDistance','CompetitionOpenSinceMonth','CompetitionOpenSinceYear','Promo2SinceWeek','Promo2SinceYear','sales_per_customer','Year','Open'],axis=1,inplace=True)、

train_data = data[data.Date<pd.to_datetime("2015-06-01")]
valid_data = data[data.Date>=pd.to_datetime("2015-06-01")]
del data

feature_cols = {
    'StorePromoDayOfWeekSchoolHoliday_Customer_interped_mean',
    'StorePromoDayOfWeekSchoolHoliday_Customer_interped_median',
    #省略70+个特征
    'Customer_interped'
    }

train_data.drop([i for i in train_data.columns if i not in feature_cols],axis=1,inplace=True)
valid_data.drop([i for i in valid_data.columns if i not in feature_cols],axis=1,inplace=True)

train_data.Customer_interped = np.log(train_data.Customer_interped)
valid_data.Customer_interped = np.log(valid_data.Customer_interped)

def RMSPE(preds, labels): # 定义评价标准
    preds = np.exp(preds)
    labels = np.exp(labels)
    return np.sqrt(np.mean(((preds-labels)/labels)**2))

y_cols = {'Customer_interped'}
x_train = train_data[[i for i in train_data.columns if i not in y_cols]]
y_train = train_data[list(y_cols)]
x_valid = valid_data[[i for i in valid_data.columns if i not in y_cols]]
y_valid = valid_data[list(y_cols)]

categorial_feats = [i for i in train_data.columns if train_data[i].dtype == "category"]

def optuna_objective(trial):
    n_estimators = trial.suggest_int("n_estimators",2500,3500,step=100)
    learning_rate = trial.suggest_float("learning_rate",0.01,0.1,step=0.01)
    max_depth = trial.suggest_int("max_depth",3,10,step=1)
    model = XGB.XGBRegressor(n_estimators=n_estimators,learning_rate=learning_rate,max_depth=max_depth,tree_method="hist",enable_categorical=True,n_jobs=-1)
    model.fit(x_train,y_train)
    pred = model.predict(x_valid)
    score=RMSPE(pred,np.array(y_valid.iloc[:,0]))
        
    return score

def optimizer_TPE(n_trials):
    algo = optuna.samplers.TPESampler()
    study = optuna.create_study(sampler=algo,direction="minimize")
    study.optimize(optuna_objective,n_trials=n_trials,show_progress_bar=True)
    return study,study.best_trial.params,study.best_trial.values

optuna.logging.set_verbosity(optuna.logging.ERROR)
study,best_params,best_score = optimizer_TPE(50)

最终分别训练出XGBoost、CatBoost、LightGBM模型用以模型融合与预测。

model = XGB.XGBRegressor(n_estimators=3200,learning_rate=0.02,max_depth=10,tree_method="hist",enable_categorical=True,n_jobs=-1)
model.fit(x_train,y_train)
pred = model.predict(x_valid)
print(RMSPE(pred,np.array(y_valid.iloc[:,0])))
model = XGB.XGBRegressor(n_estimators=3200,learning_rate=0.02,max_depth=10,tree_method="hist",enable_categorical=True,n_jobs=-1)
model.fit(train_all_x,train_all_y)
model.save_model("xgb_customer_final.json")

预测结果:

注意此处的cat_res的结果乘以了因子0.995;这一因子的理论基础在:https://www.kaggle.com/competitions/rossmann-store-sales/discussion/17601 可惜的是链接中的LaTex公式出了问题,等到有空我再去看一下具体的思路。

这个结果(private:0.12468)并不好;在Leader Board中只能排名到大约1299名。故而我准备再从模型融合的角度看结果。

二、相同特征,不同模型的融合

针对XGBoost、CatBoost、LightGBM三类使用相同特征的模型,将其预测值作为特征再建立模型进行预测。

model_xgb_sales = XGB.XGBRegressor(enable_categorical=True)
model_xgb_sales.load_model("xgb_sales_final.json")

model_xgb_customer = XGB.XGBRegressor(enable_categorical=True)
model_xgb_customer.load_model("xgb_customer_final.json")

with open("cat_sales_final.pkl","rb") as f:
    model_cat_sales=pickle.load(f)
    
with open("cat_customer_final.pkl","rb") as f:
    model_cat_customers=pickle.load(f)
    
with open("lgb_sales_final.pkl","rb") as f:
    model_lgb_sales=pickle.load(f)
    
with open("lgb_customer_final.pkl","rb") as f:
    model_lgb_customers=pickle.load(f)

data["sales_xgb"] = model_xgb_sales.predict(data[model_xgb_sales.feature_names_in_])
data["customer_xgb"] = model_xgb_customer.predict(data[model_xgb_customer.feature_names_in_])

data["sales_cat"] = model_cat_sales.predict(data[model_cat_sales.feature_names_])
data["customer_cat"] = model_cat_customers.predict(data[model_cat_customers.feature_names_])

data["sales_lgb"] = model_lgb_sales.predict(data[model_lgb_sales.feature_name_])
data["customer_lgb"] = model_lgb_customers.predict(data[model_lgb_customers.feature_name_])


feature_cols = ['Store', 'DayOfWeek', 'Promo', 'StateHoliday', 'SchoolHoliday',
       'StoreType', 'Assortment', 'CompetitionDistance', 'Promo2',
       'PromoInterval', 'Month', 'open_at_Sunday', 'day',
       'has_competition', 'has_promo2', 'customers_terms',
       'customers_weekly', 'customers_trend', 'sales_terms',
       'sales_weekly', 'sales_trend', 'week_of_year', 'HALF_YEAR',
       'Quater', 'StateHoliday_cycle_14days', 'Promo_cycle_14days',
       'has_promo2_cycle_14days', 'sales_xgb', 'customer_xgb',
       'sales_cat', 'customer_cat', 'sales_lgb', 'customer_lgb',"Sales"]
data = data[feature_cols]

train_data,valid_data = train_test_split(data)
del data

train_data.drop([i for i in train_data.columns if i not in feature_cols],axis=1,inplace=True)
valid_data.drop([i for i in valid_data.columns if i not in feature_cols],axis=1,inplace=True)
train_data.Sales = np.log(train_data.Sales)
valid_data.Sales = np.log(valid_data.Sales)

x_train = train_data[[i for i in train_data.columns if i not in y_cols]]
y_train = train_data[list(y_cols)]
x_valid = valid_data[[i for i in valid_data.columns if i not in y_cols]]
y_valid = valid_data[list(y_cols)]
categorial_feats = [i for i in train_data.columns if train_data[i].dtype == "category"]

model = lgb.LGBMRegressor(n_estimators=3000,learning_rate=0.05,max_depth=7,n_jobs=-1,subsample=0.5,colsample_bytree=0.5)
model.fit(data[[i for i in data.columns if i not in y_cols]],np.log(data[list(y_cols)]))
with open("lgb_model_merge.pkl","wb") as f:
    pickle.dump(model,f)

最终结果为:

这一成绩相较于单一模型,在Leaderboard中对应提升了近600多的分数排名。

三、不同特征,相同模型的融合

针对于第一名pdf中的描述,原金牌得主的策略为:统一使用XGBoost,针对他建立的各个特征,分别作人工筛选和随即筛选特征建模,最终生成了500余个模型,在此基础上选取了10个模型。我将所有的特征分为:原有特征(hold)、上个半年、上个季度、整体Encoding以及Prophet所预测出的趋势、季节项。将这5种特征分别手动组合建立模型;并且将所有特征联合在一起,不断随机选取其中50项对销售额与客户量做特征。

def choose_times(columns):
    preyear_columns = set(preyear_columns)
    prehalfyear_columns = set(prehalfyear_columns)
    prequarter_columns = set(prequarter_columns)
    for i in columns:
        if i in preyear_columns:
            return data[data.Year>2013]
        elif i in prehalfyear_columns:
            return data[data.Date>=pd.to_datetime("2013-07-01")]
        elif i in prequarter_columns:
            return data[data.Date>=pd.to_datetime("2013-04-01")]
        else:
            return data
def calc_and_generate_model(columns,y_col,save_name):
    data = choose_times(data)
    learning_rate = 0.06
    max_depth = 3
    subsample = 0.5
    n_estimators = 3500
    colsample_bytree=0.5
    if len(columns)>50:
            n_estimators=5000
            colsample_bytree=0.3

    model = lgb.LGBMRegressor(n_estimators=n_estimators,
                             learning_rate=learning_rate,
                             max_depth=max_depth,
                             subsample=subsample,
                             colsample_bytree=colsample_bytree)
    model.fit(train[columns],np.log(train[y_col]))
    score = RMSPE(np.log(valid[y_col]),model.predict(valid[columns]))
    #score_train = RMSPE(np.log(train[y_col]),model.predict(train[columns]))
    #score = cross_val_score(model,train[columns],np.log(train[y_col]),scoring=eval_score,cv=3)
    with open("/mnt/cv_logs.txt","a+") as f:
        f.write(",".join([save_name,str(score)])+"\n")
    with open("/to_select_models/"+save_name,"wb") as f:
        pickle.dump(model,f)
# 随机选50个特征建模
all_columns = trend_columns+hold_columns+cycle_columns+encoding_columns+prehalfyear_columns+prequarter_columns

for i in range(150):
    selected_cols = random.sample(all_columns,50)
    calc_and_generate_model(selected_cols,"Sales","model_Sales_"+str(i)+".pkl")
for i in range(100):
    selected_cols = random.sample(all_columns,50)
    calc_and_generate_model(selected_cols,"Customers","model_Customers_"+str(i)+".pkl")

#这里我发现模型训练速度比我预想的要快,故而加了250个模型,总共500个
for i in range(150):
    selected_cols = random.sample(all_columns,50)
    calc_and_generate_model(selected_cols,"Sales","model_Sales_add_"+str(i)+".pkl")
for i in range(100):
    selected_cols = random.sample(all_columns,50)
    calc_and_generate_model(selected_cols,"Customers","model_Customers_add_"+str(i)+".pkl")

#自己手动选择的部分模型
calc_and_generate_model(hold_columns+cycle_columns,"Sales","model_Sales_cycle.pkl")
calc_and_generate_model(hold_columns+trend_columns,"Sales","model_Sales_trend.pkl")
calc_and_generate_model(hold_columns+prehalfyear_columns,"Sales","model_Sales_halfyear.pkl")
calc_and_generate_model(hold_columns+prequarter_columns,"Sales","model_Sales_quarter.pkl")
calc_and_generate_model(hold_columns+encoding_columns,"Sales","model_Sales_encoding.pkl")

calc_and_generate_model(hold_columns+cycle_columns,"Customers","model_Customers_cycle.pkl")
calc_and_generate_model(hold_columns+trend_columns,"Customers","model_Customers_trend.pkl")
calc_and_generate_model(hold_columns+prehalfyear_columns,"Customers","model_Customers_halfyear.pkl")
calc_and_generate_model(hold_columns+prequarter_columns,"Customers","model_Customers_quarter.pkl")
calc_and_generate_model(hold_columns+encoding_columns,"Customers","model_Customers_encoding.pkl")

我在这些模型种,选取了其中最好的10对模型(10个销售额的,10个客户量的)。并且加入Null Importance的模型结果,组成新的模型数据集。

#1 随机模型的前10组
sales_models = ['model_Sales_0.pkl',
'model_Sales_add_115.pkl',
'model_Sales_add_53.pkl',
#...
'model_Sales_add_58.pkl',
'model_Sales_add_128.pkl']

customer_models = ['model_Customers_53.pkl',
'model_Customers_35.pkl',
'model_Customers_add_26.pkl',
'model_Customers_add_4.pkl',
...
'model_Customers_add_56.pkl']

# 使用前10组的预测值作为特征
for i in range(10):
    with open("to_select_models/"+sales_models[i],"rb") as f:
        model = pickle.load(f)
    res = model.predict(train[model.feature_name_])
    train[sales_models[i]]=res
    res = model.predict(valid[model.feature_name_])
    valid[sales_models[i]]=res

for i in range(10):
    with open("to_select_models/"+customer_models[i],"rb") as f:
        model = pickle.load(f)
    res = model.predict(train[model.feature_name_])
    train[customer_models[i]]=res
    res = model.predict(valid[model.feature_name_])
    valid[customer_models[i]]=res

# 加入Null Importance模型作为特征
feature_cols = [
    'StorePromoDayOfWeekSchoolHoliday_Customer_interped_mean',
    'StorePromoDayOfWeekSchoolHoliday_Customer_interped_median',
    'StorePromoDayOfWeekSchoolHoliday_Customer_interped_hmean',
    'StorePromoDayOfWeek_Customer_interped_mean',
    'StorePromoDayOfWeek_Customer_interped_median',
    'StorePromoDayOfWeek_Customer_interped_hmean',
    'prehalfyear_StorePromoDayOfWeekYearHALF_YEAR_Customer_interped_mean',
    'StateHoliday_cycle_14days',
    #...
    'StorePromo_sales_per_customer_std'
]
learning_rate = 0.06
max_depth = 3
subsample = 0.5
n_estimators=5000
colsample_bytree=0.3
model = lgb.LGBMRegressor(n_estimators=n_estimators,
                         learning_rate=learning_rate,
                         max_depth=max_depth,
                         subsample=subsample,
                         colsample_bytree=colsample_bytree)
model.fit(train[feature_cols],np.log1p(train["Customers"]))

res = model.predict(train[model.feature_name_])
train["model_null_importance_Customers"]=res
res = model.predict(valid[model.feature_name_])
valid["model_null_importance_Customers"]=res

#Sales的建模策略相同,不再赘述

使用这些模型的预测值作为特征再建立一个模型

feas = ['model_Sales_add_115.pkl',
 'model_Customers_84.pkl',
 'customers_trend',
 'customers_terms',
 'model_null_importance_Customers',
 'sales_trend',
 'DayOfWeek',
 'model_Sales_54.pkl',
 'has_promo2_cycle_14days',
 'model_Sales_add_53.pkl',
 'model_Sales_86.pkl',
 'StateHoliday_cycle_14days',
 'Promo_cycle_14days',
 'Promo',
 'model_Customers_trend.pkl',
 'model_Sales_add_112.pkl',
 'sales_terms',
 'sales_weekly',
 'customers_weekly',
 'model_Sales_add_87.pkl',
 'model_Sales_0.pkl',
 'model_null_importance_Sales',
 'model_Sales_trend.pkl',
 'model_Sales_encoding.pkl',
 'day',
 'week_of_year',
 'Store',
"CompetitionOpen",
"PromoOpen"]
model_for_merge = lgb.LGBMRegressor(n_estimators=3000,
                         learning_rate=0.03,
                         max_depth=10,
                         subsample=0.9,
                        colsample_bytree=0.7)

#在6周验证集上作模型评估
model_for_merge.fit(train[feas],np.log1p(train["Sales"]))
res = model_for_merge.predict(valid[feas])
for i in [1,0.995,0.99,0.985,0.98,0.975,0.97]:
    print(i,RMSPE_factor(np.log1p(valid["Sales"]),res,factor=i))
#在乘以0.99的因子时,RMPSE最小

model_for_merge.fit(ttl_data[feas],np.log1p(ttl_data["Sales"]))
#使用所有的训练集训练出最终模型

最终的结果中,我使用了一个0.985的乘子对其做了调整,定位在了0.11874的RMSPE。相较于相同特征的模型融合,在LeaderBoard的对应排名又上升了200;离前10%(0.11773)有些许差距。

总结与反思

本次根据第一名给出的pdf策略进行了特征提取与建模操作;在此期间使用了Prophet作趋势项与季节因素的提取操作,Null Importance作特征筛选,Optuna对模型参数进行贝叶斯优化选取。然而,最终的结果并没有让人十分满意,我总结了以下的一些原因:

  1. 特征提取时,Recent Data的过于繁多,可能会在建模时会让其他特征的影响力被稀释

  1. 过分依赖Null Importance与Optuna等自动化工具,未能将模型训练时的某些问题排查出来。

  1. 过分信奉第一名pdf中的策略,特征提取时没有自己的主观意见,可能导致了某些重要变量的遗漏以及由于自身理解的偏差“造”出了错误的变量。

  1. 缺少对于其他参赛经验的借鉴,没有作出充分的准备。

  1. 缺少BaseLine模型的建立导致的后续策略过于混乱

我在公开代码中,找到了有人只在原始的csv文件上新建了几个特征,对特征作变换处理就达到铜牌水平的建模策略(对比本文中的策略,将36.3MB的csv扩充成了10GB的文件夹……),这显然是我们更加需要关注的策略。还有第三名的工程在GitHub上也有留存,后续应当对这些策略进行更为深入的研究。

追加:(更加有效的)单个模型建立

在静下心来之后,决定重新从单个模型的角度出发,建立BaseLine模型以观察各个特征对结果的好坏影响。

# define eval metrics
def rmspe(y, yhat):
    return np.sqrt(np.mean((yhat/y-1) ** 2))

def rmspe_xg(yhat, y):
    y = np.expm1(y.get_label())
    yhat = np.expm1(yhat)
    return "rmspe", rmspe(y,yhat)

def get_score_of_feature(x_train,y_train,x_valid,y_valid,fea):
    model = LGB.LGBMRegressor(n_estimators=4000,learning_rate=0.03,
                         subsample=0.9,colsample_bytree=0.7,
                        max_depth=10,random_state=100)

    model.fit(x_train,y_train,verbose=0)
    y_pred = model.predict(x_valid)
    error = rmspe(np.expm1(y_valid), np.expm1(y_pred))
    print(fea+'RMSPE: {:.4f}'.format(error))

feature_bases = ['Store', 'DayOfWeek', 'Promo', 'StateHoliday', 'SchoolHoliday',
       'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
       'Promo2SinceWeek', 'Promo2SinceYear', 'Year', 'Month', 'day',
       'week_of_year', 'CompetitionOpen', 'PromoOpen', 'has_promo2']

prophet_features = ['customers_terms','customers_weekly','customers_trend','sales_terms','sales_weekly','sales_trend']
pre_halfyear_features = [i for i in train.columns if "prehalfyear" in i and "sales_per_customer" not in i]
pre_year_features = [i for i in train.columns if "preyear" in i and "sales_per_customer" not in i]
pre_quarter_features = [i for i in train.columns if "prequarter" in i and "sales_per_customer" not in i]

def get_score_of_feature_type(cols,name):
    if name == "preyear":
        train_tmp = train[train.Year>2013]
    elif name == "pre_halfyear":
        train_tmp = train[train.Date>=pd.to_datetime("2013-07-01")]
    elif name == "pre_quarter":
        train_tmp = train[train.Date>=pd.to_datetime("2013-05-01")]
    else:
        train_tmp = train
    x_train = train_tmp[feature_bases+cols]
    y_train = np.log1p(train_tmp["Sales"])

    x_valid = valid[feature_bases+cols]
    y_valid = np.log1p(valid["Sales"])
    get_score_of_feature(x_train,y_train,x_valid,y_valid,name)

get_score_of_feature_type(prophet_features,"prophet")#0.1173
get_score_of_feature_type(pre_halfyear_features,"halfyear")#0.1208
get_score_of_feature_type(pre_year_features,"preyear") #0.1225
get_score_of_feature_type(pre_quarter_features,"prequarter") #0.1246

从验证集的结果上看,仅加入Prophet预测项的LGBM模型反而更优秀。以此为基准,在训练集上再做调整。

我借鉴了当年比赛第66名的3项策略(他的全部策略可以看https://www.kaggle.com/competitions/rossmann-store-sales/discussion/18410):

  1. 加入log(day)

  1. 删去在测试集中不存在的店铺(Store)

  1. 删去每家店的离群点(我这里定义为大于2倍四分位差的数)

#1.加入log(day)
data["log_day"] = np.log(data["day"])

#2.删去在测试集中不存在的店铺(Store)
train_store = list(set(data.Store))
delete_store = []
for i in train_store:
    if i not in to_predict_store:
        delete_store.append(i)

for i in delete_store:
    data = data[data.Store != i]

#3.删去每家店的离群点
data_delete_ouliers = pd.DataFrame(columns=data.columns)
for i in list(set(data.Store)):
    tmp = data[data.Store==i]
    q1, q3 = tmp['Sales'].quantile([0.25, 0.75])
    iqr = q3 - q1
    tmp = tmp[tmp.Sales<=q3 + iqr * 2]
    tmp = tmp[tmp.Sales>=q1 - iqr * 2]
    data_delete_ouliers = pd.concat([data_delete_ouliers,tmp],axis=0)

data = data_delete_ouliers
del data_delete_ouliers

#最终建立模型
feature_bases = ['Store', 'DayOfWeek', 'Promo', 'StateHoliday', 'SchoolHoliday',
       'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
       'Promo2SinceWeek', 'Promo2SinceYear', 'Year', 'Month', 'day',
       'week_of_year', 'CompetitionOpen', 'PromoOpen', 'has_promo2',"log_day"]

prophet_features = ['customers_terms','customers_weekly','customers_trend','sales_terms','sales_weekly','sales_trend']

model = LGB.LGBMRegressor(n_estimators=4000,learning_rate=0.03,
                         subsample=0.9,colsample_bytree=0.7,
                        max_depth=10)

model.fit(x,y,verbose=0)
with open("model_with_prophet.pkl","wb") as f:
    pickle.dump(model,f)

preds = model.predict(test[model.feature_name_])
test["Sales"] = np.expm1(preds)
test[["Id","Sales"]].to_csv("submission_lgb_new.csv",index=None)

最终,private score达到了0.11273,在Leaderboard中大约能到达75位的样子,相较于模型融合的结果更为显著。

通过这次实战,吸取了一下教训:

  1. 不要过度迷恋复杂的模型融合,应当先确立Baseline的模型,再确认之后的策略。

  1. 不要将过多的时间花在调参上,甚至可以直接取他人已经调试完的参数;特征工程以及数据预处理的功夫不能少。

  1. 不要直接“造”很多的特征,像上文中的近期数据(Recent Data)可能就造成了训练集中的过拟合。

猜你喜欢

转载自blog.csdn.net/thorn_r/article/details/129763903
今日推荐