下篇 | 使用 ? Transformers 進行機率時間序列預測

在《使用 ? Transformers 進行機率時間序列預測》的第一部分裡，我們為大家介紹了傳統時間序列預測和基於 Transformers 的方法，也一步步準備好了訓練所需的資料集並定義了環境、模型、轉換和 InstanceSplitter。本篇內容將包含從資料載入器，到前向傳播、訓練、推理和展望未來發展等精彩內容。

建立 PyTorch 資料載入器

有了資料，下一步需要建立 PyTorch DataLoaders。它允許我們批次處理成對的 (輸入, 輸出) 資料，即 (past_values , future_values)。

from gluonts.itertools import Cyclic, IterableSlice, PseudoShuffled
from gluonts.torch.util import IterableDataset
from torch.utils.data import DataLoader

from typing import Iterable

def create_train_dataloader(
    config: PretrainedConfig,
    freq,
    data,
    batch_size: int,
    num_batches_per_epoch: int,
    shuffle_buffer_length: Optional[int] = None,
    **kwargs,
) -> Iterable:
    PREDICTION_INPUT_NAMES = [
        "static_categorical_features",
        "static_real_features",
        "past_time_features",
        "past_values",
        "past_observed_mask",
        "future_time_features",
        ]

    TRAINING_INPUT_NAMES = PREDICTION_INPUT_NAMES + [
        "future_values",
        "future_observed_mask",
        ]
    
    transformation = create_transformation(freq, config)
    transformed_data = transformation.apply(data, is_train=True)
    
    # we initialize a Training instance
    instance_splitter = create_instance_splitter(
        config, "train"
    ) + SelectFields(TRAINING_INPUT_NAMES)


    # the instance splitter will sample a window of 
    # context length + lags + prediction length (from the 366 possible transformed time series)
    # randomly from within the target time series and return an iterator.
    training_instances = instance_splitter.apply(
        Cyclic(transformed_data)
        if shuffle_buffer_length is None
        else PseudoShuffled(
            Cyclic(transformed_data), 
            shuffle_buffer_length=shuffle_buffer_length,
        )
    )

    # from the training instances iterator we now return a Dataloader which will 
    # continue to sample random windows for as long as it is called
    # to return batch_size of the appropriate tensors ready for training!
    return IterableSlice(
        iter(
            DataLoader(
                IterableDataset(training_instances),
                batch_size=batch_size,
                **kwargs,
            )
        ),
        num_batches_per_epoch,
    )

def create_test_dataloader(
    config: PretrainedConfig,
    freq,
    data,
    batch_size: int,
    **kwargs,
):
    PREDICTION_INPUT_NAMES = [
        "static_categorical_features",
        "static_real_features",
        "past_time_features",
        "past_values",
        "past_observed_mask",
        "future_time_features",
        ]
    
    transformation = create_transformation(freq, config)
    transformed_data = transformation.apply(data, is_train=False)
    
    # we create a Test Instance splitter which will sample the very last 
    # context window seen during training only for the encoder.
    instance_splitter = create_instance_splitter(
        config, "test"
    ) + SelectFields(PREDICTION_INPUT_NAMES)
    
    # we apply the transformations in test mode
    testing_instances = instance_splitter.apply(transformed_data, is_train=False)
    
    # This returns a Dataloader which will go over the dataset once.
    return DataLoader(IterableDataset(testing_instances), batch_size=batch_size, **kwargs)

train_dataloader = create_train_dataloader(
    config=config, 
    freq=freq, 
    data=train_dataset, 
    batch_size=256, 
    num_batches_per_epoch=100,
)

test_dataloader = create_test_dataloader(
    config=config, 
    freq=freq, 
    data=test_dataset,
    batch_size=64,
)

讓我們檢查第一批:

batch = next(iter(train_dataloader))
for k,v in batch.items():
  print(k,v.shape, v.type())

>>> static_categorical_features torch.Size([256, 1]) torch.LongTensor
    static_real_features torch.Size([256, 1]) torch.FloatTensor
    past_time_features torch.Size([256, 181, 2]) torch.FloatTensor
    past_values torch.Size([256, 181]) torch.FloatTensor
    past_observed_mask torch.Size([256, 181]) torch.FloatTensor
    future_time_features torch.Size([256, 24, 2]) torch.FloatTensor
    future_values torch.Size([256, 24]) torch.FloatTensor
    future_observed_mask torch.Size([256, 24]) torch.FloatTensor

可以看出，我們沒有將 input_ids 和 attention_mask 提供給編碼器 (訓練 NLP 模型時也是這種情況)，而是提供 past_values，以及 past_observed_mask、past_time_features、static_categorical_features 和 static_real_features 幾項資料。

解碼器的輸入包括 future_values、future_observed_mask 和 future_time_features。 future_values 可以看作等同於 NLP 訓練中的 decoder_input_ids。

我們可以參考 Time Series Transformer 文件以獲得對它們中每一個的詳細解釋。

前向傳播

讓我們對剛剛建立的批次執行一次前向傳播:

# perform forward pass
outputs = model(
    past_values=batch["past_values"],
    past_time_features=batch["past_time_features"],
    past_observed_mask=batch["past_observed_mask"],
    static_categorical_features=batch["static_categorical_features"],
    static_real_features=batch["static_real_features"],
    future_values=batch["future_values"],
    future_time_features=batch["future_time_features"],
    future_observed_mask=batch["future_observed_mask"],
    output_hidden_states=True
)

print("Loss:", outputs.loss.item())

>>> Loss: 9.141253471374512

目前，該模型返回了損失值。這是由於解碼器會自動將 future_values 向右移動一個位置以獲得標籤。這允許計算預測結果和標籤值之間的誤差。

另請注意，解碼器使用 Causal Mask 來避免預測未來，因為它需要預測的值在 future_values 張量中。

訓練模型

是時候訓練模型了！我們將使用標準的 PyTorch 訓練迴圈。

這裡我們用到了 ? Accelerate 庫，它會自動將模型、最佳化器和資料載入器放置在適當的 device 上。

from accelerate import Accelerator
from torch.optim import Adam

accelerator = Accelerator()
device = accelerator.device

model.to(device)
optimizer = Adam(model.parameters(), lr=1e-3)
 
model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, 
)

for epoch in range(40):
    model.train()
    for batch in train_dataloader:
        optimizer.zero_grad()
        outputs = model(
            static_categorical_features=batch["static_categorical_features"].to(device),
            static_real_features=batch["static_real_features"].to(device),
            past_time_features=batch["past_time_features"].to(device),
            past_values=batch["past_values"].to(device),
            future_time_features=batch["future_time_features"].to(device),
            future_values=batch["future_values"].to(device),
            past_observed_mask=batch["past_observed_mask"].to(device),
            future_observed_mask=batch["future_observed_mask"].to(device),
        )
        loss = outputs.loss

        # Backpropagation
        accelerator.backward(loss)
        optimizer.step()

        print(loss.item())

推理

在推理時，建議使用 generate() 方法進行自迴歸生成，類似於 NLP 模型。

預測的過程會從測試例項取樣器中獲得資料。取樣器會將資料集的每個時間序列的最後 context_length 那麼長時間的資料取樣出來，然後輸入模型。請注意，這裡需要把提前已知的 future_time_features 傳遞給解碼器。

該模型將從預測分佈中自迴歸取樣一定數量的值，並將它們傳回解碼器最終得到預測輸出:

model.eval()

forecasts = []

for batch in test_dataloader:
    outputs = model.generate(
        static_categorical_features=batch["static_categorical_features"].to(device),
        static_real_features=batch["static_real_features"].to(device),
        past_time_features=batch["past_time_features"].to(device),
        past_values=batch["past_values"].to(device),
        future_time_features=batch["future_time_features"].to(device),
        past_observed_mask=batch["past_observed_mask"].to(device),
    )
    forecasts.append(outputs.sequences.cpu().numpy())

該模型輸出一個表示結構的張量 (batch_size, number of samples, prediction length)。

下面的輸出說明: 對於大小為 64 的批次中的每個示例，我們將獲得接下來 24 個月內的 100 個可能的值:

forecasts[0].shape

>>> (64, 100, 24)

我們將垂直堆疊它們，以獲得測試資料集中所有時間序列的預測:

forecasts = np.vstack(forecasts)
print(forecasts.shape)

>>> (366, 100, 24)

我們可以根據測試集中存在的樣本值，根據真實情況評估生成的預測。這裡我們使用資料集中的每個時間序列的 MASE 和 sMAPE 指標 (metrics) 來評估:

from evaluate import load
from gluonts.time_feature import get_seasonality

mase_metric = load("evaluate-metric/mase")
smape_metric = load("evaluate-metric/smape")

forecast_median = np.median(forecasts, 1)

mase_metrics = []
smape_metrics = []
for item_id, ts in enumerate(test_dataset):
    training_data = ts["target"][:-prediction_length]
    ground_truth = ts["target"][-prediction_length:]
    mase = mase_metric.compute(
        predictions=forecast_median[item_id], 
        references=np.array(ground_truth), 
        training=np.array(training_data), 
        periodicity=get_seasonality(freq))
    mase_metrics.append(mase["mase"])
    
    smape = smape_metric.compute(
        predictions=forecast_median[item_id], 
        references=np.array(ground_truth), 
    )
    smape_metrics.append(smape["smape"])

print(f"MASE: {np.mean(mase_metrics)}")

>>> MASE: 1.361636922541396

print(f"sMAPE: {np.mean(smape_metrics)}")

>>> sMAPE: 0.17457818831512306

我們還可以單獨繪製資料集中每個時間序列的結果指標，並觀察到其中少數時間序列對最終測試指標的影響很大:

plt.scatter(mase_metrics, smape_metrics, alpha=0.3)
plt.xlabel("MASE")
plt.ylabel("sMAPE")
plt.show()

為了根據基本事實測試資料繪製任何時間序列的預測，我們定義了以下輔助繪圖函式:

import matplotlib.dates as mdates

def plot(ts_index):
    fig, ax = plt.subplots()

    index = pd.period_range(
        start=test_dataset[ts_index][FieldName.START],
        periods=len(test_dataset[ts_index][FieldName.TARGET]),
        freq=freq,
    ).to_timestamp()

    # Major ticks every half year, minor ticks every month,
    ax.xaxis.set_major_locator(mdates.MonthLocator(bymonth=(1, 7)))
    ax.xaxis.set_minor_locator(mdates.MonthLocator())

    ax.plot(
        index[-2*prediction_length:], 
        test_dataset[ts_index]["target"][-2*prediction_length:],
        label="actual",
    )

    plt.plot(
        index[-prediction_length:], 
        np.median(forecasts[ts_index], axis=0),
        label="median",
    )
    
    plt.fill_between(
        index[-prediction_length:],
        forecasts[ts_index].mean(0) - forecasts[ts_index].std(axis=0), 
        forecasts[ts_index].mean(0) + forecasts[ts_index].std(axis=0), 
        alpha=0.3, 
        interpolate=True,
        label="+/- 1-std",
    )
    plt.legend()
    plt.show()

例如:

plot(334)

我們如何與其他模型進行比較？ Monash Time Series Repository 有一個測試集 MASE 指標的比較表。我們可以將自己的結果新增到其中作比較:

Dataset	SES	Theta	TBATS	ETS	(DHR-)ARIMA	PR	CatBoost	FFNN	DeepAR	N-BEATS	WaveNet	Transformer (Our)
Tourism Monthly	3.306	1.649	1.751	1.526	1.589	1.678	1.699	1.582	1.409	1.574	1.482	1.361

請注意，我們的模型擊敗了所有已知的其他模型 (另請參見相應論文中的表 2) ，並且我們沒有做任何超引數最佳化。我們僅僅花了 40 個完整訓練調參週期來訓練 Transformer。

當然，我們應該謙虛。從歷史發展的角度來看，現在認為神經網路解決時間序列預測問題是正途，就好比當年的論文得出了 “你需要的就是 XGBoost” 的結論。我們只是很好奇，想看看神經網路能帶我們走多遠，以及 Transformer 是否會在這個領域發揮作用。這個特定的資料集似乎表明它絕對值得探索。

下一步

我們鼓勵讀者嘗試我們的 Jupyter Notebook 和來自 Hugging Face Hub 的其他時間序列資料集，並替換適當的頻率和預測長度引數。對於您的資料集，需要將它們轉換為 GluonTS 的慣用格式，在他們的文件裡有非常清晰的說明。我們還準備了一個示例 Notebook，向您展示如何將資料集轉換為 ? Hugging Face 資料集格式。

正如時間序列研究人員所知，人們對“將基於 Transformer 的模型應用於時間序列”問題很感興趣。傳統 vanilla Transformer 只是眾多基於注意力 (Attention) 的模型之一，因此需要向庫中補充更多模型。

目前沒有什麼能妨礙我們繼續探索對多變數時間序列 (multivariate time series) 進行建模，但是為此需要使用多變數分佈頭 (multivariate distribution head) 來例項化模型。目前已經支援了對角獨立分佈 (diagonal independent distributions)，後續會增加其他多元分佈支援。請繼續關注未來的部落格文章以及其中的教程。

路線圖上的另一件事是時間序列分類。這需要將帶有分類頭的時間序列模型新增到庫中，例如用於異常檢測這類任務。

當前的模型會假設日期時間和時間序列值都存在，但在現實中這可能不能完全滿足。例如 WOODS 給出的神經科學資料集。因此，我們還需要對當前模型進行泛化，使某些輸入在整個流水線中可選。

最後，NLP/CV 領域從大型預訓練模型中獲益匪淺，但據我們所知，時間序列領域並非如此。基於 Transformer 的模型似乎是這一研究方向的必然之選，我們迫不及待地想看看研究人員和從業者會發現哪些突破！

英文原文: Probabilistic Time Series Forecasting with ? Transformers
譯者、排版: zhongdongy (阿東)

下篇 | 使用 ? Transformers 進行機率時間序列預測

建立 PyTorch 資料載入器

前向傳播

訓練模型

推理

下一步

相關文章