前言

时间序列特征提取包中tsfresh较为流行，但是其官方教程给出的例子是机器人故障的数据集，其中的id列为各组不同的实验。然后我就一直在想能否做单类的，比如电力预测，或者是某一条街道的交通预测，但是翻遍了文档都没找到，后来在github项目文件中找到了做单类预测的示例文件

我当时有这个想法的时候查过CSDN上的其他有关tsfresh包的教程，大多都是搬运的官方文档的例子，没有单类预测的示例，下面我将结合代码，说明如何提取该类型的时间序列特征。

时序特征提取

导入必要包

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import roll_time_series, make_forecasting_frame
from tsfresh.utilities.dataframe_functions import impute

import pandas_datareader.data as web
from sklearn.linear_model import LinearRegression

准备数据集

我们将使用苹果公司的股票价格来展示如何同时处理一个时间序列（一只股票）。
我们从 "stooq "下载数据，只存储高值。

df = web.DataReader("AAPL", 'stooq')["High"]
print(df.shape)
df.head()

输出：

(1257,)
Date
2023-03-01    147.2285
2023-02-28    149.0800
2023-02-27    149.1700
2023-02-24    147.1900
2023-02-23    150.3400
Name: High, dtype: float64

绘制图形观察

plt.figure(figsize=(15, 6))
df.plot(ax=plt.gca())
plt.show()

请添加图片描述

整理数据集，添加标识符：

df_melted = pd.DataFrame({
    
    "high": df.copy()})
df_melted["date"] = df_melted.index
df_melted["Symbols"] = "AAPL"

df_melted.head()

输出：

	high	date	Symbols
Date			
2023-03-01	147.2285	2023-03-01	AAPL
2023-02-28	149.0800	2023-02-28	AAPL
2023-02-27	149.1700	2023-02-27	AAPL
2023-02-24	147.1900	2023-02-24	AAPL
2023-02-23	150.3400	2023-02-23	AAPL

创建训练数据样本

预测通常包括以下步骤：
- 收集到今天为止的的数据
- 进行特征提取（例如，使用extract_features函数）
- 训练一个预测模型
然而在训练中，我们需要多个例子来训练。如果我们只使用到今天为止的时间序列，我们将只有一个训练实例。因此，我们使用了一个技巧：滑动历史窗口。
想象一下有一个滑动的时间窗口在你的数据集上，在每个时间步长 $t$ ，你把窗口中的数据当作今天（包括 $t$ ）的数据来提取特征。直到时间 $t$ 的特征的目标是时间 $t + 1$ 的时间值。
窗口滑动的过程是在函数roll_time_series中实现的。我们的窗口大小为20（即看的是过去最多20天的情况），我们不考虑所有短于5天的窗口。

df_rolled = roll_time_series(df_melted, column_id="Symbols", column_sort="date",max_timeshift=20, min_timeshift=5)
df_rolled.head()

输出：

Rolling: 100%|██████████| 10/10 [00:02<00:00,  3.71it/s]
high	date	Symbols	id
0	42.4266	2018-03-05	AAPL	(AAPL, 2018-03-12 00:00:00)
1	42.5512	2018-03-06	AAPL	(AAPL, 2018-03-12 00:00:00)
2	41.9750	2018-03-07	AAPL	(AAPL, 2018-03-12 00:00:00)
3	42.2771	2018-03-08	AAPL	(AAPL, 2018-03-12 00:00:00)
4	42.9639	2018-03-09	AAPL	(AAPL, 2018-03-12 00:00:00)

上面的数据框架由这些 "窗口 "组成，从原始数据框架中印出来。例如，id = (AAPL，2020-07-14 00:00:00)的数据都来自股票AAPL的原始数据，包括直到2020-07-14的最后20天。
挑选出窗口2020-07-14的数据

df_rolled[df_rolled["id"] == ("AAPL", pd.to_datetime("2020-07-14"))]

输出：

	high	date	Symbols	id
12249	85.0954	2020-06-15	AAPL	(AAPL, 2020-07-14 00:00:00)
12250	86.9448	2020-06-16	AAPL	(AAPL, 2020-07-14 00:00:00)
12251	87.4872	2020-06-17	AAPL	(AAPL, 2020-07-14 00:00:00)
12252	87.0066	2020-06-18	AAPL	(AAPL, 2020-07-14 00:00:00)
12253	87.7743	2020-06-19	AAPL	(AAPL, 2020-07-14 00:00:00)
12254	88.4851	2020-06-22	AAPL	(AAPL, 2020-07-14 00:00:00)
12255	91.6684	2020-06-23	AAPL	(AAPL, 2020-07-14 00:00:00)
12256	90.7831	2020-06-24	AAPL	(AAPL, 2020-07-14 00:00:00)
12257	89.8490	2020-06-25	AAPL	(AAPL, 2020-07-14 00:00:00)
12258	89.9277	2020-06-26	AAPL	(AAPL, 2020-07-14 00:00:00)
12259	89.1531	2020-06-29	AAPL	(AAPL, 2020-07-14 00:00:00)
12260	90.0922	2020-06-30	AAPL	(AAPL, 2020-07-14 00:00:00)
12261	90.4312	2020-07-01	AAPL	(AAPL, 2020-07-14 00:00:00)
12262	91.1958	2020-07-02	AAPL	(AAPL, 2020-07-14 00:00:00)
12263	92.5019	2020-07-06	AAPL	(AAPL, 2020-07-14 00:00:00)
12264	93.2027	2020-07-07	AAPL	(AAPL, 2020-07-14 00:00:00)
12265	93.9105	2020-07-08	AAPL	(AAPL, 2020-07-14 00:00:00)
12266	94.8407	2020-07-09	AAPL	(AAPL, 2020-07-14 00:00:00)
12267	94.5077	2020-07-10	AAPL	(AAPL, 2020-07-14 00:00:00)
12268	98.4278	2020-07-13	AAPL	(AAPL, 2020-07-14 00:00:00)
12269	95.7619	2020-07-14	AAPL	(AAPL, 2020-07-14 00:00:00)

提取特征