Table of Contents
3.1 学习目标
- 学习时间序列数据的特征预处理方法
- 学习时间序列特征处理工具Tsfresh(TimeSeries Fresh) 的使用
3.2 内容介绍
数据预处理
- 时间序列数据格式处理
- 加入时间步特征time
特征工程
- 时间序列特征构造
- 特征筛选
- 使用tsfresh
3.3 代码示例
3.3.1 导入包并读取数据
Tsfresh是处理时间序列的关系数据库的特征工程工具,能自动从时间序列中提取100多个特征。
该软件包包含多种特征提取方法和一种稳健的特征选择算法,还包含评价这些特征对回归或分类
任务的解释能力和重要性的方法。
https://zhuanlan.zhihu.com/p/93310900
# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features,select_features
from tsfresh.utilities.dataframe_functions import impute
# 数据读取
data_train = pd.read_csv("train.csv")
data_test_A = pd.read_csv("testA.csv")
print(data_train.shape)
print(data_test_A.shape)
(100000, 3)
(20000, 2)
3.3.2 数据预处理
- 对心电特征进行行列处理,同时为每个心电信号加入时间步特征time
- reset_index()和set_index()的使用
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",",expand=True).stack()
train_heartbeat_df
0 0 0.9912297987616655
1 0.9435330436439665
2 0.7646772997256593
3 0.6185708990212999
4 0.3796321642826237
...
99999 200 0.0
201 0.0
202 0.0
203 0.0
204 0.0
Length: 20500000, dtype: object
- 重新设置索引 且变成了数据框的形式
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df
level_0 | level_1 | 0 | |
---|---|---|---|
0 | 0 | 0 | 0.9912297987616655 |
1 | 0 | 1 | 0.9435330436439665 |
2 | 0 | 2 | 0.7646772997256593 |
3 | 0 | 3 | 0.6185708990212999 |
4 | 0 | 4 | 0.3796321642826237 |
... | ... | ... | ... |
20499995 | 99999 | 200 | 0.0 |
20499996 | 99999 | 201 | 0.0 |
20499997 | 99999 | 202 | 0.0 |
20499998 | 99999 | 203 | 0.0 |
20499999 | 99999 | 204 | 0.0 |
20500000 rows × 3 columns
- 将level_0 设置为索引
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df
level_1 | 0 | |
---|---|---|
level_0 | ||
0 | 0 | 0.9912297987616655 |
0 | 1 | 0.9435330436439665 |
0 | 2 | 0.7646772997256593 |
0 | 3 | 0.6185708990212999 |
0 | 4 | 0.3796321642826237 |
... | ... | ... |
99999 | 200 | 0.0 |
99999 | 201 | 0.0 |
99999 | 202 | 0.0 |
99999 | 203 | 0.0 |
99999 | 204 | 0.0 |
20500000 rows × 2 columns
- 将索引的名字置空,感觉就好像是扔掉了
train_heartbeat_df.index.name = None
train_heartbeat_df
level_1 | 0 | |
---|---|---|
0 | 0 | 0.9912297987616655 |
0 | 1 | 0.9435330436439665 |
0 | 2 | 0.7646772997256593 |
0 | 3 | 0.6185708990212999 |
0 | 4 | 0.3796321642826237 |
... | ... | ... |
99999 | 200 | 0.0 |
99999 | 201 | 0.0 |
99999 | 202 | 0.0 |
99999 | 203 | 0.0 |
99999 | 204 | 0.0 |
20500000 rows × 2 columns
- 使用rename()方法更改列名,inplace为True应该就是原地更改的意思【直接修改】
train_heartbeat_df.rename(columns={
"level_1":"time",0:"heartbeat_signals"},inplace=True)
train_heartbeat_df
time | heartbeat_signals | |
---|---|---|
0 | 0 | 0.9912297987616655 |
0 | 1 | 0.9435330436439665 |
0 | 2 | 0.7646772997256593 |
0 | 3 | 0.6185708990212999 |
0 | 4 | 0.3796321642826237 |
... | ... | ... |
99999 | 200 | 0.0 |
99999 | 201 | 0.0 |
99999 | 202 | 0.0 |
99999 | 203 | 0.0 |
99999 | 204 | 0.0 |
20500000 rows × 2 columns
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)
train_heartbeat_df
time | heartbeat_signals | |
---|---|---|
0 | 0 | 0.991230 |
0 | 1 | 0.943533 |
0 | 2 | 0.764677 |
0 | 3 | 0.618571 |
0 | 4 | 0.379632 |
... | ... | ... |
99999 | 200 | 0.000000 |
99999 | 201 | 0.000000 |
99999 | 202 | 0.000000 |
99999 | 203 | 0.000000 |
99999 | 204 | 0.000000 |
20500000 rows × 2 columns
- 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train_label
0 0.0
1 0.0
2 2.0
3 0.0
4 2.0
...
99995 0.0
99996 2.0
99997 3.0
99998 2.0
99999 0.0
Name: label, Length: 100000, dtype: float64
- 将data_train去掉label这一列
data_train = data_train.drop('label',axis=1)
data_train
id | heartbeat_signals | |
---|---|---|
0 | 0 | 0.9912297987616655,0.9435330436439665,0.764677... |
1 | 1 | 0.9714822034884503,0.9289687459588268,0.572932... |
2 | 2 | 1.0,0.9591487564065292,0.7013782792997189,0.23... |
3 | 3 | 0.9757952826275774,0.9340884687738161,0.659636... |
4 | 4 | 0.0,0.055816398940721094,0.26129357194994196,0... |
... | ... | ... |
99995 | 99995 | 1.0,0.677705342021188,0.22239242747868546,0.25... |
99996 | 99996 | 0.9268571578157265,0.9063471198026871,0.636993... |
99997 | 99997 | 0.9258351628306013,0.5873839035878395,0.633226... |
99998 | 99998 | 1.0,0.9947621698382489,0.8297017704865509,0.45... |
99999 | 99999 | 0.9259994004527861,0.916476635326053,0.4042900... |
100000 rows × 2 columns
data_train = data_train.drop("heartbeat_signals", axis=1)
data_train
id | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
... | ... |
99995 | 99995 |
99996 | 99996 |
99997 | 99997 |
99998 | 99998 |
99999 | 99999 |
100000 rows × 1 columns
data_train = data_train.join(train_heartbeat_df)
data_train
id | time | heartbeat_signals | |
---|---|---|---|
0 | 0 | 0 | 0.991230 |
0 | 0 | 1 | 0.943533 |
0 | 0 | 2 | 0.764677 |
0 | 0 | 3 | 0.618571 |
0 | 0 | 4 | 0.379632 |
... | ... | ... | ... |
99999 | 99999 | 200 | 0.000000 |
99999 | 99999 | 201 | 0.000000 |
99999 | 99999 | 202 | 0.000000 |
99999 | 99999 | 203 | 0.000000 |
99999 | 99999 | 204 | 0.000000 |
20500000 rows × 3 columns
扫描二维码关注公众号,回复:
12884934 查看本文章
![](/qrcode.jpg)
data_train[data_train["id"]==1]
id | time | heartbeat_signals | |
---|---|---|---|
1 | 1 | 0 | 0.971482 |
1 | 1 | 1 | 0.928969 |
1 | 1 | 2 | 0.572933 |
1 | 1 | 3 | 0.178457 |
1 | 1 | 4 | 0.122962 |
... | ... | ... | ... |
1 | 1 | 200 | 0.000000 |
1 | 1 | 201 | 0.000000 |
1 | 1 | 202 | 0.000000 |
1 | 1 | 203 | 0.000000 |
1 | 1 | 204 | 0.000000 |
205 rows × 3 columns
可以看到,每个样本的心电特征都由205个时间步的心电信号组成
3.3.3 使用tsfresh 进行时间序列特征处理
1.特征抽取
**Tsfresh(TimeSeries Fresh)**是一个Python第三方工具包。 它可以自动计算大量的时间序列数据的特征。此外,该包还包含了特征重要性评估、特征选择的方法,因此,不管是基于时序数据的分类问题还是回归问题,tsfresh都会是特征提取一个不错的选择。官方文档:Introduction — tsfresh 0.17.1.dev24+g860c4e1 documentation
# # 特征提取
# train_features = extract_features(data_train,column_id = 'id',column_sort='time')
# train_features
- 导入已经跑好的特征(以pkl格式存储),直接读取用,不用每次都要重新生成这么耗时
import pickle
feature_file = open("./HeartbeatClassification/train_features_file.pkl","rb")
train_features = pickle.load(feature_file)
train_features
heartbeat_signals__variance_larger_than_standard_deviation | heartbeat_signals__has_duplicate_max | heartbeat_signals__has_duplicate_min | heartbeat_signals__has_duplicate | heartbeat_signals__sum_values | heartbeat_signals__abs_energy | heartbeat_signals__mean_abs_change | heartbeat_signals__mean_change | heartbeat_signals__mean_second_derivative_central | heartbeat_signals__median | ... | heartbeat_signals__fourier_entropy__bins_2 | heartbeat_signals__fourier_entropy__bins_3 | heartbeat_signals__fourier_entropy__bins_5 | heartbeat_signals__fourier_entropy__bins_10 | heartbeat_signals__fourier_entropy__bins_100 | heartbeat_signals__permutation_entropy__dimension_3__tau_1 | heartbeat_signals__permutation_entropy__dimension_4__tau_1 | heartbeat_signals__permutation_entropy__dimension_5__tau_1 | heartbeat_signals__permutation_entropy__dimension_6__tau_1 | heartbeat_signals__permutation_entropy__dimension_7__tau_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 1.0 | 38.927945 | 18.216197 | 0.019894 | -0.004859 | 0.000117 | 0.125531 | ... | 0.095763 | 0.109222 | 0.109222 | 0.356175 | 0.940492 | 1.180828 | 1.734917 | 2.184420 | 2.500658 | 2.722686 |
1 | 0.0 | 0.0 | 1.0 | 1.0 | 19.445634 | 7.705092 | 0.019952 | -0.004762 | 0.000105 | 0.030481 | ... | 0.248333 | 0.409767 | 0.567944 | 0.913016 | 1.791964 | 1.360828 | 2.118249 | 2.710933 | 3.065802 | 3.224835 |
2 | 0.0 | 0.0 | 1.0 | 1.0 | 21.192974 | 9.140423 | 0.009863 | -0.004902 | 0.000101 | 0.000000 | ... | 0.054659 | 0.054659 | 0.150231 | 0.204601 | 0.542013 | 0.712221 | 1.031064 | 1.263370 | 1.406001 | 1.509478 |
3 | 0.0 | 0.0 | 1.0 | 1.0 | 42.113066 | 15.757623 | 0.018743 | -0.004783 | 0.000103 | 0.241397 | ... | 0.054659 | 0.109222 | 0.186062 | 0.258874 | 1.426345 | 1.389686 | 2.206088 | 2.986728 | 3.534354 | 3.854177 |
4 | 0.0 | 0.0 | 1.0 | 1.0 | 69.756786 | 51.229616 | 0.014514 | 0.000000 | -0.000137 | 0.000000 | ... | 0.054659 | 0.109222 | 0.109222 | 0.163690 | 0.517722 | 1.045339 | 1.543338 | 1.914511 | 2.165627 | 2.323993 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | 0.0 | 0.0 | 1.0 | 1.0 | 63.323449 | 28.742238 | 0.023588 | -0.004902 | 0.000794 | 0.388402 | ... | 0.054659 | 0.054659 | 0.109222 | 0.109222 | 1.405361 | 1.326208 | 2.137411 | 2.873602 | 3.391830 | 3.679969 |
99996 | 0.0 | 0.0 | 1.0 | 1.0 | 69.657534 | 31.866323 | 0.017373 | -0.004543 | 0.000051 | 0.421138 | ... | 0.095763 | 0.095763 | 0.109222 | 0.163690 | 0.749555 | 1.408284 | 2.244166 | 3.085504 | 3.728881 | 4.095457 |
99997 | 0.0 | 0.0 | 1.0 | 1.0 | 40.897057 | 16.412857 | 0.019470 | -0.004538 | 0.000834 | 0.213306 | ... | 0.164224 | 0.186062 | 0.299588 | 0.353661 | 0.995174 | 1.305626 | 2.005282 | 2.601062 | 2.996962 | 3.293562 |
99998 | 0.0 | 0.0 | 1.0 | 1.0 | 42.333303 | 14.281281 | 0.017032 | -0.004902 | 0.000013 | 0.264974 | ... | 0.095763 | 0.109222 | 0.163690 | 0.218060 | 1.321241 | 1.460980 | 2.387132 | 3.236950 | 3.793512 | 4.018302 |
99999 | 0.0 | 0.0 | 1.0 | 1.0 | 53.290117 | 21.637471 | 0.021870 | -0.004539 | 0.000023 | 0.320124 | ... | 0.095763 | 0.150231 | 0.204601 | 0.463604 | 1.768224 | 1.344607 | 2.186286 | 2.949266 | 3.462549 | 3.688612 |
100000 rows × 779 columns
- 特征选择
train_features中包含了heartbeat_signals的779种常见的时间序列特征(所有这些特征的解释可以去看官方文档),这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:
# 去除抽取特征中的NAN值
impute(train_features)
heartbeat_signals__variance_larger_than_standard_deviation | heartbeat_signals__has_duplicate_max | heartbeat_signals__has_duplicate_min | heartbeat_signals__has_duplicate | heartbeat_signals__sum_values | heartbeat_signals__abs_energy | heartbeat_signals__mean_abs_change | heartbeat_signals__mean_change | heartbeat_signals__mean_second_derivative_central | heartbeat_signals__median | ... | heartbeat_signals__fourier_entropy__bins_2 | heartbeat_signals__fourier_entropy__bins_3 | heartbeat_signals__fourier_entropy__bins_5 | heartbeat_signals__fourier_entropy__bins_10 | heartbeat_signals__fourier_entropy__bins_100 | heartbeat_signals__permutation_entropy__dimension_3__tau_1 | heartbeat_signals__permutation_entropy__dimension_4__tau_1 | heartbeat_signals__permutation_entropy__dimension_5__tau_1 | heartbeat_signals__permutation_entropy__dimension_6__tau_1 | heartbeat_signals__permutation_entropy__dimension_7__tau_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 1.0 | 38.927945 | 18.216197 | 0.019894 | -0.004859 | 0.000117 | 0.125531 | ... | 0.095763 | 0.109222 | 0.109222 | 0.356175 | 0.940492 | 1.180828 | 1.734917 | 2.184420 | 2.500658 | 2.722686 |
1 | 0.0 | 0.0 | 1.0 | 1.0 | 19.445634 | 7.705092 | 0.019952 | -0.004762 | 0.000105 | 0.030481 | ... | 0.248333 | 0.409767 | 0.567944 | 0.913016 | 1.791964 | 1.360828 | 2.118249 | 2.710933 | 3.065802 | 3.224835 |
2 | 0.0 | 0.0 | 1.0 | 1.0 | 21.192974 | 9.140423 | 0.009863 | -0.004902 | 0.000101 | 0.000000 | ... | 0.054659 | 0.054659 | 0.150231 | 0.204601 | 0.542013 | 0.712221 | 1.031064 | 1.263370 | 1.406001 | 1.509478 |
3 | 0.0 | 0.0 | 1.0 | 1.0 | 42.113066 | 15.757623 | 0.018743 | -0.004783 | 0.000103 | 0.241397 | ... | 0.054659 | 0.109222 | 0.186062 | 0.258874 | 1.426345 | 1.389686 | 2.206088 | 2.986728 | 3.534354 | 3.854177 |
4 | 0.0 | 0.0 | 1.0 | 1.0 | 69.756786 | 51.229616 | 0.014514 | 0.000000 | -0.000137 | 0.000000 | ... | 0.054659 | 0.109222 | 0.109222 | 0.163690 | 0.517722 | 1.045339 | 1.543338 | 1.914511 | 2.165627 | 2.323993 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | 0.0 | 0.0 | 1.0 | 1.0 | 63.323449 | 28.742238 | 0.023588 | -0.004902 | 0.000794 | 0.388402 | ... | 0.054659 | 0.054659 | 0.109222 | 0.109222 | 1.405361 | 1.326208 | 2.137411 | 2.873602 | 3.391830 | 3.679969 |
99996 | 0.0 | 0.0 | 1.0 | 1.0 | 69.657534 | 31.866323 | 0.017373 | -0.004543 | 0.000051 | 0.421138 | ... | 0.095763 | 0.095763 | 0.109222 | 0.163690 | 0.749555 | 1.408284 | 2.244166 | 3.085504 | 3.728881 | 4.095457 |
99997 | 0.0 | 0.0 | 1.0 | 1.0 | 40.897057 | 16.412857 | 0.019470 | -0.004538 | 0.000834 | 0.213306 | ... | 0.164224 | 0.186062 | 0.299588 | 0.353661 | 0.995174 | 1.305626 | 2.005282 | 2.601062 | 2.996962 | 3.293562 |
99998 | 0.0 | 0.0 | 1.0 | 1.0 | 42.333303 | 14.281281 | 0.017032 | -0.004902 | 0.000013 | 0.264974 | ... | 0.095763 | 0.109222 | 0.163690 | 0.218060 | 1.321241 | 1.460980 | 2.387132 | 3.236950 | 3.793512 | 4.018302 |
99999 | 0.0 | 0.0 | 1.0 | 1.0 | 53.290117 | 21.637471 | 0.021870 | -0.004539 | 0.000023 | 0.320124 | ... | 0.095763 | 0.150231 | 0.204601 | 0.463604 | 1.768224 | 1.344607 | 2.186286 | 2.949266 | 3.462549 | 3.688612 |
100000 rows × 779 columns
接下来,按照特征和响应变量之间的相关性进行特征选择,这一过程包含两步:
- 首先单独计算每个特征和响应变量之间的相关性
- 然后利用Benjamini-Yekutieli procedure[1]进行特征选择,决定那些特征可以被保留.
特征选择的一些常用方法
# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features,data_train_label)
train_features_filtered
heartbeat_signals__sum_values | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_29 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_28 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_27 | ... | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_84 | heartbeat_signals__fft_coefficient__attr_"imag"__coeff_97 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_90 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_94 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_92 | heartbeat_signals__fft_coefficient__attr_"real"__coeff_97 | heartbeat_signals__fft_coefficient__attr_"abs"__coeff_75 | heartbeat_signals__fft_coefficient__attr_"real"__coeff_88 | heartbeat_signals__fft_coefficient__attr_"real"__coeff_92 | heartbeat_signals__fft_coefficient__attr_"real"__coeff_83 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 38.927945 | 1.168685 | 0.982133 | 1.223496 | 1.236300 | 1.104172 | 1.497129 | 1.358095 | 1.704225 | 1.745158 | ... | 0.531883 | -0.047438 | 0.554370 | 0.307586 | 0.564596 | 0.562960 | 0.591859 | 0.504124 | 0.528450 | 0.473568 |
1 | 19.445634 | 1.460752 | 1.924501 | 1.925485 | 1.715938 | 2.079957 | 1.818636 | 2.490450 | 1.673244 | 2.821067 | ... | 0.563590 | -0.109579 | 0.697446 | 0.398073 | 0.640969 | 0.270192 | 0.224925 | 0.645082 | 0.635135 | 0.297325 |
2 | 21.192974 | 1.787166 | 2.146987 | 1.686190 | 1.540137 | 2.291031 | 2.403422 | 1.765422 | 1.993213 | 2.756081 | ... | 0.712487 | -0.074042 | 0.321703 | 0.390386 | 0.716929 | 0.316524 | 0.422077 | 0.722742 | 0.680590 | 0.383754 |
3 | 42.113066 | 2.071539 | 1.000340 | 2.728281 | 1.391727 | 2.017176 | 2.610492 | 0.747448 | 2.900299 | 1.294779 | ... | 0.601499 | -0.184248 | 0.564669 | 0.623353 | 0.466980 | 0.651774 | 0.308915 | 0.550097 | 0.466904 | 0.494024 |
4 | 69.756786 | 0.653924 | 0.231422 | 1.080003 | 0.711244 | 1.357904 | 1.237998 | 1.346404 | 1.645870 | 0.941866 | ... | 0.015292 | 0.070505 | 0.065835 | 0.051780 | 0.092940 | 0.103773 | 0.179405 | -0.089611 | 0.091841 | 0.056867 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | 63.323449 | 0.417221 | 2.036034 | 1.659054 | 0.500584 | 1.693545 | 0.859932 | 1.963009 | 1.524831 | 1.344715 | ... | 0.779955 | 0.005525 | 0.486013 | 0.273372 | 0.705386 | 0.602898 | 0.447929 | 0.474844 | 0.564266 | 0.133969 |
99996 | 69.657534 | 1.611333 | 1.793044 | 1.092325 | 0.507138 | 1.763940 | 2.677643 | 2.640827 | 1.128049 | 0.856280 | ... | 0.539489 | 0.114670 | 0.579498 | 0.417226 | 0.270110 | 0.556596 | 0.703258 | 0.462312 | 0.269719 | 0.539236 |
99997 | 40.897057 | 1.190514 | 0.674603 | 1.632769 | 0.229008 | 2.027802 | 0.302457 | 2.016243 | 0.352602 | 1.836034 | ... | 0.282597 | -0.474629 | 0.460647 | 0.478341 | 0.527891 | 0.904111 | 0.728529 | 0.178410 | 0.500813 | 0.773985 |
99998 | 42.333303 | 1.237608 | 1.325212 | 2.785515 | 1.918571 | 0.814167 | 2.613950 | 2.083409 | 1.330934 | 2.801509 | ... | 0.594252 | -0.162106 | 0.694276 | 0.681025 | 0.357196 | 0.498088 | 0.433297 | 0.406154 | 0.324771 | 0.340727 |
99999 | 53.290117 | 0.154759 | 2.921164 | 2.183932 | 1.485150 | 2.685922 | 0.583443 | 3.101826 | 1.264842 | 2.877000 | ... | 0.463697 | 0.289364 | 0.285321 | 0.422103 | 0.692009 | 0.276236 | 0.245780 | 0.269519 | 0.681719 | -0.053993 |
100000 rows × 700 columns
特征工程总结: