The sklearn preprocessing

Recently, i was writing module of feature engineering, i found two excellently package -- tsfresh and sklearn.

tsfresh has been specialized for data of time series, tsfresh mainly include two module, feature extract and feature select:

1 from tsfresh import feature_selection, feature_extraction

To limit the number of irrelevant features, tsfresh deploys the fresh algorithms. The hole process consists of three steps.

Firstly.  the algorithm characterizes time series with comprehensive and well-established feature mappings. the feature calculators used to derive the features are contained in tsfresh.feature_extraction.feature_calculators.

In a second step, each extracted feature vector is individually and evaluated with respect to its significance for predicting the target under investigation, those test are contained in submodule tsfresh.feature_selection.significance_tests. the result of significance test is a vector of p-value, quantifying the significance of each feature for predicting the target.

Finally, the vector of p-value is evaluated base on  basis of the Benjamini-Yekutieli procedure in order to decide which feature could keep.

In summary, the tsfresh is a scalable and efficiency tool of feature engeering.

although the function of tsfresh was powerful, but i choice sklearn.

I download a data which is the heart diseaes data set. the data set target is binary and has 13 dimension feature, I was just used MinMaxScaler to transfrom age,trestbps,chol three columns, the model had choiced ensemble of AutoSklearnClassifer and ensemble of RandomForest. but bad performance for two model.

from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from numpy import set_printoptions, inf
set_printoptions(threshold=inf)
import pandas as pd
data = pd.read_csv("../data_set/heart.csv")
X = data[data.columns[:data.shape[1] - 1]].values
y = data[data.columns[-1]].values

data = MinMaxScaler().fit_transform(X[:, [0, 3, 4, 7]])
X[:, [0, 3, 4, 7]] = data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

from autosklearn.classification import AutoSklearnClassifier
model_auto = AutoSklearnClassifier(time_left_for_this_task=120, n_jobs=3, include_preprocessors=["no_preprocessing"], seed=3)
model_auto.fit(x_train, y_train)

from sklearn.metrics import accuracy_score
y_pred = model_auto.predict(x_test)
accuracy_score(y_test, y_pred)   >>> 0.8021978021978022


from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=500)
y_pred_rf = model.predict(x_test)
accuracy_score(y_test, y_pred_rf) >>> 0.8051648351648352

My personal web site which provide automl service, I upload this data set to my service, it get better score then my code: http://simple-automl.com/preview.html

0.8131868131868132

猜你喜欢

转载自www.cnblogs.com/xu-xiaofeng/p/10934296.html