gcForest参数说明

shape_1X:
单个样本元素的形状[n_lines，n_cols]。调用mg_scanning时需要！对于序列数据，可以给出单个int。

n_mgsRFtree:
多粒度扫描期间随机森林中的树数量。

window：
int（default = None）多粒度扫描期间使用的窗口大小列表。如果“无”，则不进行切片,window的选择决定了不同的粒度，如5，则只用5的窗口去滑动，而[4,5]则是用4和5分别滑动，即多粒度扫描。

stride：int（default = 1）
切片数据时使用的滑动间隔，类似于CNN中的stride。

cascade_test_size：float或int（default = 0.2）
级联训练集分裂的分数或绝对数。

n_cascadeRF：int（default = 2）
级联层中随机森林的数量,对于每个伪随机森林，创建完整的随机森林，因此一层中随机森林的总数将为2 * n_cascadeRF。一个随机森林和一个完全随机树森林

n_cascadeRFtree：int（default = 101）
级联层中单个随机森林中的树数。

min_samples_mgs：float或int（default = 0.1）
节点中执行拆分的最小样本数在多粒度扫描随机森林训练期间。如果int ，number_of_samples = int。如果float，min_samples 表示要考虑的初始n_samples的分数。

min_samples_cascade：float或int（default = 0.1）
节点中执行拆分的最小样本数在级联随机森林训练期间。如果int number_of_samples = int。如果float，min_samples表示要考虑的初始n_samples的分数。

cascade_layer：int（default = np.inf）
允许的最大级联级数。用来限制级联的结构。一般模型可以自己根据交叉验证结果选取出合适的技术

tolerance：float（default= 0.0）
级联生长的精度差,整个级联的性能将在验证集上进行估计，如果没有显着的性能增益，训练过程将终止

n_jobs：int（default = 1）
任意随机森林适合并预测的并行运行的工作数量。如果为-1，则将作业数设置为核心数。

gcForest Aligorithm Application

The gcForest algorithm was suggested in Zhou and Feng 2017 (https://arxiv.org/abs/1702.08835 , refer for this paper for technical details, refer for this paper for technical details)) and I provide here a python3 implementation of this algorithm.
I chose to adopt the scikit-learn syntax for ease of use and hereafter I present how it can be used.


from GCForest import gcForest
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

Iris Example

The iris data set is actually not a very good example as the gcForest algorithm is better suited for time series and images where informations can be found at different scales in one sample.
Nonetheless it is still an easy way to test the method.

# loading data
iris = load_iris()
X = iris.data
y = iris.target
print(X.shape,y.shape)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33)
# train
gcf = gcForest(shape_1X=4, window=2,tolerance=0.0)
gcf.fit(X_tr, y_tr)
# test
pred_X = gcf.predict(X_te)
print(pred_X)
# evaluating accuracy
accuarcy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuarcy : {}'.format(accuarcy))

Digits Example

A much better example is the digits data set containing images of hand written digits. The scikit data set can be viewed as a mini-MNIST for training purpose.

# loading the data
digits = load_digits()
X = digits.data
y = digits.target
print(X.shape,y.shape)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.4)
print(X_tr.shape,X_te.shape)
# train
gcf = gcForest(shape_1X=[8,8], window=[4,6], tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
gcf.fit(X_tr, y_tr)
# test
pred_X = gcf.predict(X_te)
print(pred_X)
# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuracy : {}'.format(accuracy))

Saving Models to Disk

You probably don’t want to re-train your classifier every day especially if you’re using it on large data sets. Fortunately there is a very easy way to save and load models to disk using sklearn.externals.joblib

# saving model
from sklearn.externals import joblib
joblib.dump(gcf, 'gcf_model.sav')
# loading model'j
gcf = joblib.load('gcf_model.sav')

Using mg-scanning and cascade_forest Sperately

As the Multi-Grain scanning and the cascade forest modules are quite independent it is possible to use them seperately.
If a target y is given the code automaticcaly use it for training otherwise it recalls the last trained Random Forests to slice the data.

gcf = gcForest(shape_1X=[8,8],window=5,min_samples_mgs=10,min_samples_cascade=7)
X_tr_mgs = gcf.mg_scanning(X_tr, y_tr)
# print(X_tr_mgs.shape)
# print(X_tr.shape)
X_te_mgs = gcf.mg_scanning(X_te)
'''
It is now possible to use the mg_scanning output as input for cascade forests using different parameters. 
Note that the cascade forest module does not directly return predictions but probability predictions from each Random Forest in the last layer of the cascade.
Hence the need to first take the mean of the output and then find the max.
'''
# set the parameters
gcf = gcForest(tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
_ = gcf.cascade_forest(X_tr_mgs,y_tr)
pred_proba = gcf.cascade_forest(X_te_mgs)
#print(X_te_mgs.shape)
#print(pred_proba[1])# the pred_proba is list
tmp = np.mean(pred_proba,axis=0) 
#print(tmp.shape)
preds = np.argmax(tmp,axis=1)
accuracy_score(y_true=y_te, y_pred=preds)
# test different parameters
gcf = gcForest(tolerance=0.0, min_samples_mgs=20, min_samples_cascade=10)
_ = gcf.cascade_forest(X_tr_mgs, y_tr)
pred_proba = gcf.cascade_forest(X_te_mgs)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

'''
Attention:
It is also possible to directly use the cascade forest and skip the multi grain scanning step.
'''
gcf = gcForest(tolerance=0.0, min_samples_cascade=20)
_ = gcf.cascade_forest(X_tr, y_tr)

pred_proba = gcf.cascade_forest(X_te)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)

参考GCForest股指期货涨跌
参考gcForest论文知识梳理

GCForest Study2:GCForest使用