一、背景
- 数据预测是一个已经被玩烂的的话题,但每个时间段谈起都具有一定的意义,一是其应用范围相当广泛,二是入门门槛较低比较友好
- 网上很多LSTM预测这预测那的项目、代码,但很多要么代码不全,要么数据不全,要么已经是几年前的代码了,所以这份文档就解决这三个问题
- 代码很简单,也是已经用的比较多的东西了,没有啥高深的地方,但数据、代码都会给完整,保证分享是具有使用的意义的。
二、简介
- 这个小项目是使用LSTM进行股价预测,使用数据是000001.SZ 平安银行自2014年开始的每日基础数据
- 数据列包括【‘股票代码’, ‘交易日期’, ‘开盘价’, ‘最高价’, ‘最低价’, ‘收盘价’, ‘昨收价’, ‘涨跌额’,‘涨跌幅’, '成交量 ', ‘成交额’】
- 演示代码主要使用前N天的基础数据预测下一天的收盘价,当然实际的工作中,不会去预测收盘价这种无意义的target,但在这里纯粹为了演示
三、代码分块展示
1.数据读取预处理
首先做一些简单的预处理操作,主要是将列名改成中文,然后数值化一些列,便于后续操作
df=df.rename(columns={
"ts_code":"股票代码",
"trade_date":"交易日期",
"open":"开盘价",
"high":"最高价",
"low":"最低价",
"close":"收盘价",
"pre_close":"昨收价",
"change":"涨跌额",
"pct_chg":"涨跌幅",
"vol":"成交量 ",
"amount":"成交额",
})
float_type = ['开盘价', '最高价', '最低价', '收盘价', '昨收价', '涨跌额', '涨跌幅', '成交量 ','成交额']
for item in float_type:
df[item] = df[item].astype('float')
df.head()
2.数据切分
我们需要使用开盘价、最高价等九类基础变量预测收盘价格,而且是使用前N天的数据预测下一天的收盘价,那么就需要对数据进行变换,如下所示(代码的编写和输入的数据有关):
#数据切分代码
def split_data(stock, slide_count):
data_raw = stock.to_numpy()
data = []
# you can free play(seq_length)
for index in range(len(data_raw) - slide_count):
data.append(data_raw[index: index + slide_count])
data = np.array(data)
xtrain = data[:, :-1, :] #除了最后一行之外的数据
ytrain = data[:, -1, [-1]] #最后一行的最后一个值数据
return xtrain, ytrain
#数据标准化处理
train_data=df_train.copy()
display(train_data.shape)
train_data=train_data.dropna()
display(train_data.shape)
cd_len=train_data.columns.size
train_data.columns=list(range(cd_len))
scaler_mean = train_data.mean().round(4).tolist()
scaler = MinMaxScaler(feature_range=(0, 1))
train_sc = pd.DataFrame(scaler.fit_transform(train_data))
scaler_max = scaler.data_max_.tolist()
scaler_min = scaler.data_min_.tolist()
3.模型构建
接下来是简单的一个单向LSTM,使用pytorch进行构建
import torch
import torch.nn as nn
np.set_printoptions(suppress=True)
torch.manual_seed(0)
np.random.seed(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class SingleLSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super().__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.num_layers = num_layers
self.output_size = output_size
self.num_directions = 1 # 单向LSTM
self.lstm = nn.LSTM(self.input_size, self.hidden_size, self.num_layers, batch_first=True)
self.linear = nn.Linear(self.hidden_size, self.output_size)
def forward(self, x):
h_0 = torch.randn(self.num_directions * self.num_layers, x.size(0), self.hidden_size).to(device)
c_0 = torch.randn(self.num_directions * self.num_layers, x.size(0), self.hidden_size).to(device)
# output(batch_size, seq_len, num_directions * hidden_size)
output, _ = self.lstm(x, (h_0, c_0)) # output(5, 30, 64)
pred = self.linear(output) # (5, 30, 1)
pred = pred[:, -1, :] # (5, 1)
return pred
4.模型训练
训练数据组成TensorDataset格式,按照100个epoch进行训练,这里epoch、hidden_dim、num_layers都可以修改,但是不建议修改过多。
##获得训练数据
from torch.utils.data import Dataset,DataLoader,TensorDataset
slide_count=5
xtrain, ytrain = split_data(train_sc, slide_count)
x_train = torch.from_numpy(xtrain).type(torch.Tensor)
y_train = torch.from_numpy(ytrain).type(torch.Tensor)
input_dim = x_train.shape[2]
hidden_dim = 128
num_layers = 6
output_dim = 1
num_epochs = 100
model = SingleLSTM(input_size=input_dim, hidden_size=hidden_dim, output_size=output_dim, num_layers=num_layers)
train_dataset = TensorDataset(x_train,y_train)
batch_size = 1024
dataloader = DataLoader(dataset=train_dataset,batch_size=batch_size,shuffle=True,drop_last=False)
import time
from torch.autograd import Variable
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
hist = np.zeros(num_epochs)
min_loss=10000
for epoch in (range(num_epochs)):
model.train()
train_loss = []
for (seq, label) in dataloader:
seq = seq.to(device)
seq = Variable(seq)
label = label.to(device)
epoch_ypred = model(seq)
loss = criterion(epoch_ypred, label)
train_loss.append(loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
hist[epoch] = round(np.mean(train_loss),8)
if hist[epoch]<min_loss:
min_loss=hist[epoch]
print("正在进行LSTM模型训练,共迭代"+str(num_epochs)+"次,当前迭代次数为:"+str(epoch))
print("Epoch: "+str(epoch)+", MSE: "+str(hist[epoch]))
model_name = 'LSTM_bestmodel_' + time.strftime('%Y%m%d',time.localtime(time.time())) +'.pt'
torch.save(model.state_dict(), './model_files/'+model_name)
else:
pass
模型的训练过程如下:
5.模型预测
模型训练好之后,加载模型,将测试数据也按照训练的格式处理后,进行结果预测:
input_dim=x_train.shape[2]
mymodel = SingleLSTM(input_size=input_dim, hidden_size=hidden_dim, output_size=output_dim, num_layers=num_layers)
mymodel.load_state_dict(torch.load('./model_files/LSTM_bestmodel_20240418.pt'))
test_sc=pd.DataFrame(scaler.transform(df_test))
xtest, ytest = split_data(test_sc, slide_count)
x_test = torch.from_numpy(xtest).type(torch.Tensor)
y_test = torch.from_numpy(ytest).type(torch.Tensor)
test_dataset = TensorDataset(x_test,y_test)
test_loader = DataLoader(dataset=test_dataset,batch_size=1024*100,shuffle=False,drop_last=False)
from itertools import chain
from tqdm import tqdm
model.eval()
print('predicting...')
for (seq, target) in tqdm(test_loader):
seq = seq.to(device)
with torch.no_grad():
test_pred = model(seq)
预测效果如下:
预测效果评估一下,R方0.72,MSE是0.097
总结
- 本文展示了一个简单的LSTM预测股价的问题,使用N天的9类变量预测下一天的收盘价,同理也可以将此拓展到其他预测类的项目中去。
- 本文中使用的模型是单向LSTM模型,使用pytorch实现,后续还会更新双向LSTM、Transformer等模型的使用。