pytorch實現股票預測

若水上善666發表於2020-12-19

1.股票預測概述

股票預測我分了兩大部分,第一個是模型訓練,第二個是模型預測,模型訓練中我又分成資料讀取、特徵選擇、模型訓練三個部分。

  • 模型訓練
    • 資料讀取
    • 特徵選擇
    • 模型訓練
  • 模型預測

2.模型訓練

2.1資料讀取

模型訓練: 我的想法是通過股票前n天的資料去預測股票五天之後的漲跌,因為我覺得買股票不應該只看後面一天的股價變化,所以我這裡選擇了五天之後的股價。
我這裡約定
漲:五天之後股價的最小值比五天前股價的最大值要高
跌:五天之後股價的最大值比五天前股價的最小值要低
平:其餘的情況
股票資料收集: 這裡股票資料收集採用 http://baostock.com/baostock/index.php/
資料處理: 這裡選擇days_to_train=20,days_to_pred=5,來表示用20天的股票資料來預測第25天的股價漲跌,在資料處理中去掉不需要的資料。

import baostock as bs
import pandas as pd
import numpy as np


def get_data(stock_id='hz.600000', days_to_train=20, days_to_pred=5, start_data='2019-12-15', end_date='2020-12-15'):
    # 需要用20天的資料去預測未來五天的資料
    # days_to_train = 20
    # days_to_pred = 5

    # 登陸系統
    lg = bs.login()
    # 顯示登陸返回資訊
    print('login respond error_code:' + lg.error_code)
    print('login respond  error_msg:' + lg.error_msg)

    # 獲取滬深A股歷史K線資料
    # 引數說明:http://baostock.com/baostock/index.php/Python_API%E6%96%87%E6%A1%A3#.E8.8E.B7.E5.8F.96.E5.8E.86.E5.8F.B2A.E8.82.A1K.E7.BA.BF.E6.95.B0.E6.8D.AE.EF.BC.9Aquery_history_k_data_plus.28.29
    rs = bs.query_history_k_data_plus(stock_id,
                                      "date,code,open,high,low,close,preclose,volume,amount,turn,tradestatus,pctChg,peTTM,pbMRQ,psTTM,pcfNcfTTM,isST",
                                      start_date=start_data, end_date=end_date,
                                      frequency="d", adjustflag="1")
    print('query_history_k_data_plus respond error_code:' + rs.error_code)
    print('query_history_k_data_plus respond  error_msg:' + rs.error_msg)

    # 列印結果集
    data_list = []
    while (rs.error_code == '0') & rs.next():
        # 獲取一條記錄,將記錄合併在一起
        data_list.append(rs.get_row_data())
    result = pd.DataFrame(data_list, columns=rs.fields)

    # 登出系統
    bs.logout()

    # 處理結果
    columns_all = result.columns
    columns_need = columns_all[2:-1]
    data_need = result[columns_need]
    column_low = 'low'
    column_high = 'high'

    # labels用於記錄股票在五天的時候是漲是跌
    # 漲:2
    # 平:1
    # 跌:0
    labels = []
    # train_data用於記錄上述分類中使用的訓練資料
    train_data = []
    for day in data_need.sort_index(ascending=False).index:
        day_pred_low = data_need.loc[day][column_low]
        day_pred_high = data_need.loc[day][column_high]
        if not (day - days_to_train - days_to_pred + 1 < 0):
            day_before_low = data_need.loc[day - days_to_pred][column_low]
            day_before_high = data_need.loc[day - days_to_pred][column_high]
            if day_pred_low > day_before_high:
                labels.append(2)
            elif day_pred_high < day_before_low:
                labels.append(0)
            else:
                labels.append(1)
            train_data.append(data_need.loc[day - days_to_pred - days_to_train + 1:day - days_to_pred])
    return train_data, labels

2.2特徵提取

這裡沒有正真的特徵提取,我覺得就算提取股價時間序列的什麼最大值、最小值、陡度等特徵還是不能準確表達股價變化,直接在後面的神經網路中採用LSTM或者是GRU等用於處理時間序列的網路來做特徵提取,這裡我做的特徵提取就是所謂的歸一化,這裡對時間序列中每一個特徵做歸一化,即對每一列做歸一化。

from sklearn import preprocessing
import numpy as np
from model_train_network.get_stock_data_1 import get_data


def norm_data(data):
    data_norm = []
    scaler = preprocessing.StandardScaler()
    data = np.array(data)
    for i in range(data.shape[0]):
        data_norm.append(scaler.fit_transform(data[i]))
    return data_norm

2.3模型訓練

這裡所使用的網路就是GRU+啟用+2層線性層
GRUNetV2: 神經網路 :GRU+啟用+2層線性層
split_data: 分開資料,分成訓練的和測試的
train: 訓練函式
test: 測試函式

net.train(),net.eval()的操作說明
同時發現,如果不寫這兩個程式也可以執行,這是因為這兩個方法是針對在網路訓練和測試時採用不同方式的情況,比如Batch NormalizationDropout

訓練時是正對每個min-batch的,但是在測試中往往是針對單張圖片,即不存在min-batch的概念。由於網路訓練完畢後引數都是固定的,因此每個批次的均值和方差都是不變的,因此直接結算所有batch的均值和方差。所有Batch Normalization的訓練和測試時的操作不同
在訓練中,每個隱層的神經元先乘概率P,然後在進行啟用,在測試中,所有的神經元先進行啟用,然後每個隱層神經元的輸出乘P。

# -*- coding:utf-8 -*-
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import torch
from torch.utils import data
from torchvision import transforms
from sklearn.model_selection import train_test_split
from model_train_network.get_stock_data_1 import get_data
from model_train_network.select_feature_stock_data_2 import norm_data


# 理解GRU網路結果https://blog.csdn.net/qq_27825451/article/details/99691258
class GRUNetV2(torch.nn.Module):
    def __init__(self, input_dim, hidden_size, out_size, drop_out, n_layers=1):
        super(GRUNetV2, self).__init__()

        # self.batch_size = batch_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.out_size = out_size
        self.drop_out = drop_out

        # 這裡指定了BATCH FIRST,所以輸入時BATCH應該在第一維度
        self.gru = torch.nn.Sequential(
            torch.nn.GRU(input_dim, hidden_size, n_layers, dropout=drop_out, batch_first=True),
        )

        self.relu = torch.nn.ReLU(inplace=True)
        # 加了一個線性層,全連線
        self.fc1 = torch.nn.Linear(hidden_size, 32)
        # 加入了第二個全連線層
        self.fc2 = torch.nn.Linear(32, out_size)
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, x):
        # x的格式(batch,seq,feature)
        output, hidden = self.gru(x)
        output = self.relu(output)
        # output是所有隱藏層的狀態,hidden是最後一層隱藏層的狀態
        output = self.fc1(output)
        output = self.fc2(output)
        output = self.softmax(output)

        # 僅僅獲取 time seq 維度中的最後一個向量
        # the last of time_seq
        output = output[:, -1, :]

        return output


feature_dim = 14
hidden_size = 64
output_dim = 3
num_layers = 3
drop_out_gru = 0.3

# hyper parameters
BATCH_SIZE = 8  # batch_size
LEARNING_RATE = 0.001  # learning_rate
EPOCH = 600  # epochs

net = GRUNetV2(feature_dim, hidden_size, output_dim, drop_out_gru, num_layers)
net = net.to('cpu')
net = net.double()
print(net)
optimizer = torch.optim.Adam(net.parameters(), lr=LEARNING_RATE, betas=(0.8, 0.8))
loss_func = torch.nn.CrossEntropyLoss()

transform = transforms.Compose([
    transforms.ToTensor(),
])


def split_data(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)
    X_train = torch.tensor(X_train)
    X_test = torch.tensor(X_test)
    y_train = torch.tensor(y_train)
    y_test = torch.tensor(y_test)
    torch_train_dataset = data.TensorDataset(X_train, y_train)
    torch_test_dataset = data.TensorDataset(X_test, y_test)
    trainloader = data.DataLoader(
        dataset=torch_train_dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
        # num_workers=2
    )
    testloader = data.DataLoader(
        dataset=torch_test_dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
        # num_workers=2
    )
    return trainloader, testloader


train_loss = []
train_acc = []


# Training
def train(epoch, trainloader):
    global train_acc, train_loss
    print('\n train Epoch: %d' % epoch)
    net.train()
    train_loss_tmp = 0
    train_loss_avg = 0
    correct = 0
    total = 0
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        # print(batch_idx)
        inputs, targets = inputs.to('cpu'), targets.to('cpu')
        optimizer.zero_grad()
        outputs = net(inputs)
        # print(outputs)
        loss = loss_func(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss_tmp += loss.item()
        _, predicted = torch.max(outputs, 1)
        # print(predicted)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

        train_loss_avg = train_loss_tmp / (batch_idx + 1)
        print(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
              % (train_loss_avg, 100. * correct / total, correct, total))

    train_loss.append(train_loss_avg)
    train_acc.append(100. * correct / total)
    print('\n -----train Epoch Over: %d------\n' % epoch)
    print(len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
          % (train_loss_avg, 100. * correct / total, correct, total))


test_acc = []
test_loss = []
best_acc = 0
best_acc_tmp = 0


def test(epoch, testloader):
    print('\n test Epoch: %d' % epoch)
    global test_acc, test_loss, best_acc_tmp
    net.eval()
    test_loss_tmp = 0
    test_loss_avg = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(testloader):
            inputs, targets = inputs.to('cpu'), targets.to('cpu')
            outputs = net(inputs)
            loss = loss_func(outputs, targets)

            test_loss_tmp += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

            test_loss_avg = test_loss_tmp / (batch_idx + 1)
            print(batch_idx, len(testloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
                  % (test_loss_avg, 100. * correct / total, correct, total))

    test_loss.append(test_loss_avg)
    test_acc.append(100. * correct / total)
    best_acc_tmp = max(test_acc)
    print('\n -----test Epoch Over: %d------\n' % epoch)
    print(len(testloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
          % (test_loss_avg, 100. * correct / total, correct, total))


if __name__ == '__main__':
    for j in [12, 13, 14, 15]:
        train_loss = []
        train_acc = []
        test_acc = []
        test_loss = []
        data_colums = ['train_acc', 'train_loss', 'test_acc', 'test_loss']
        data_train, labels = get_data('sz.000651', j, 5)
        data_train = norm_data(data_train)
        train_loader, test_loader = split_data(data_train, labels)
        for i in range(EPOCH):
            train(i, train_loader)
            test(i, test_loader)
            data_result = np.stack((train_acc, train_loss, test_acc, test_loss), axis=1)
            print(data_result)
            if i == 0:
                data_result = pd.Series(data_result.squeeze(), index=data_colums)
            else:
                data_result = pd.DataFrame(data_result, columns=data_colums)
            data_result.to_csv('../result_acc/result.csv')
            if best_acc_tmp > best_acc:
                best_acc = best_acc_tmp
                data_best = pd.Series((best_acc, j))
                data_best.to_csv('../result_acc/best.csv')
                torch.save(net.state_dict(), '../result_model/params_000651.pkl')

        # Data for plotting
        plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']

        t = np.arange(EPOCH)
        fig, ax = plt.subplots()
        ax.plot(t, train_acc, t, test_acc)
        ax.set(xlabel='訓練次數', ylabel='準確性%',
               title='訓練準確性')

        fig.savefig("../result_acc/acc_test" + str(j) + ".png")
        # plt.show()

3.模型預測

模型預測中選擇從今天開始到n天之前的股票資料(包含40組股票資料),剛好也可以看看前一段時間預測的準確性。

import re

import pandas as pd
import numpy as np
import datetime
import calendar
import torch
from model_train_network.train_3 import GRUNetV2
from model_train_network.get_stock_data_1 import get_data
from model_train_network.get_stock_data_1 import get_data_pred
from model_train_network.select_feature_stock_data_2 import norm_data

feature_dim = 14
hidden_size = 64
output_dim = 3
num_layers = 3
drop_out_gru = 0.3

# hyper parameters
BATCH_SIZE = 8  # batch_size
LEARNING_RATE = 0.001  # learning_rate
EPOCH = 600  # epochs

net = GRUNetV2(feature_dim, hidden_size, output_dim, drop_out_gru, num_layers)
net.load_state_dict(torch.load('../result_model/params_000001.pkl'))
net.eval()
net = net.to('cpu')
net = net.double()
print(net)


def model_pred(stock_id, days_to_train):
    # stock_id = 'sh.600519'
    # days_to_train = 17

    date_time = datetime.datetime.now()
    year_now = year = date_time.year
    month_now = month = date_time.month
    day_now = day = date_time.day

    num = 40
    date_tuple_prev = ()
    date_list = []

    while True:
        if day >= 1:
            print('--------------%d---------' % day)
            date_tuple_tmp = (str(year), str(month), str(day))
            connect_1 = ''
            connect_2 = '-'
            date = connect_1.join(date_tuple_tmp)
            date_list_style = connect_2.join(date_tuple_tmp)
            # 0表示週一,2表示週二,...,6表示週日
            week = datetime.datetime.strptime(date, "%Y%m%d").weekday()
            day -= 1
            print(week)
            if week in [0, 1, 2, 3, 4]:
                num -= 1
                date_list.append(date_list_style)
            if num == 0:
                date_tuple_prev = (str(year), str(month), str(day))
                break
        elif day == 0:
            month -= 1
            if month in [1, 3, 5, 7, 8, 10, 12]:
                day = 31
            if month in [4, 6, 9, 11]:
                day = 30
            else:
                check_year = calendar.isleap(year)
                if check_year:
                    day = 29
                else:
                    day = 28

    date_tuple_now = (str(year_now), str(month_now), str(day_now))
    connect = '-'
    end_data = connect.join(date_tuple_now)
    start_data = connect.join(date_tuple_prev)

    data_train, labels = get_data_pred(stock_id, days_to_train, 5, start_data=start_data, end_date=end_data)
    data_train = norm_data(data_train)
    inputs = torch.tensor(data_train)
    outputs = net(inputs)
    _, predicted = torch.max(outputs, 1)
    labels = torch.Tensor(labels)

    print('-------預測資料如下-------')
    print(predicted)
    total = labels.shape[0]
    correct = predicted.eq(labels).sum().item()
    print(100. * correct / total, '(%d/%d)' % (correct, total))

    predicted = np.append(predicted.numpy(), 100. * correct / total)
    date_list = date_list[0:num - days_to_train + 2]
    date_list.append('correct')
    pd.Series(predicted, index=date_list).to_csv(
        '../result_up_down/result_' + str(re.findall(r'\d+', stock_id)[0]) + '.csv')

相關文章