基於 HuggingFace的Transformer庫，在Colab或Kaggle進行預訓練。

本教程提供：英文資料集wikitext-2和程式碼資料集的預訓練。
注：可以自行上傳資料集進行訓練

目的：跑通自迴歸語言模型的預訓練流程

一、準備

1.1 安裝依賴

!pip install -U datasets
!pip install accelerate -U

注意：在Colab上訓練時，最好將datasets更新到最新版（再重啟kernel），避免版本低報錯

colab和kaggle已經預安裝transformers庫

1.2 資料準備

載入資料

from datasets import load_dataset

datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

當然你也可使用huggingface上任何公開的文字資料集，或使用自己構造的資料，並將路徑替換為指定路徑：

# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

要訪問一個資料中實際的元素，您需要先選擇一個key，然後給出一個索引:
看一下資料的格式

datasets["train"][10].keys()

可以看到該資料集的每個元素就是一個僅包含文字的字典

dict_keys(['text'])

檢視例子

datasets["train"][1]

{‘text': ' =Valkyria Chronicles III = \n'}

訓練集和測試集數量

print(len(datasets["train"]), len(datasets["test"]))

36718 4358

透過如下的函式來隨機展示資料集中的一些樣本：

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(datasets["train"])

資料集中，一些是空文字或標題，一些文字完整段落，

二、因果語言建模（Causal Language Modeling，CLM）

對於因果語言建模，我們首先拿到資料集中的所有文字，並將它們分詞的結果拼接起來。

然後，我們將它們拆分到特定序列長度的訓練樣本中，這樣模型將接收如下所示的連續文字塊：

part of text 1

或

end of text 1 [BOS_TOKEN] beginning of text 2

這取決於訓練樣本是否跨越資料集中的幾個原始文字：

原始文字長於特定序列長度則被切分
原始文字短於特定序列長度則和其他文字拼接。

模型的標籤就是將輸入右移一個位置（預測下一個token）。

本例中，將使用gpt2模型。

model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"

當然，你也可以選擇這裡列出的任何一個https://huggingface.co/models?filter=causal-lm 因果語言模型的checkpoint。

為了用訓練模型時使用的詞彙對所有文字進行分詞，先下載一個預訓練過的分詞器（Tokenizer）。

直接使用AutoTokenizer類來自載入:

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

現在可以對所有的文字進行分詞。

首先定義一個對文字進行分詞的函式

def tokenize_function(examples):
    return tokenizer(examples["text"])

然後，將它用到datasets物件中進行分詞，使用batch=True和4個程序來加速預處理，並移除之後用不到的text列。

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

檢視已分詞的資料集的樣本，文字已轉換為input_ids (文字的Token Id序列)和attention_mask:

tokenized_datasets["train"][1]
{'input_ids': [238, 8576, 9441, 2987, 238, 252],
 'attention_mask': [1, 1, 1, 1, 1, 1]}

然後，需要將所有文字分詞的結果拼接在一起，並將其分割成特定block_size的小塊（第二節開頭提到的操作，block_size其實就是Batch後的max_length）。

為此，將再次使用map方法，並使用選項batch=True。設定不同的block_size，可以獲得不同數量的樣本，從而能改變樣本數量。

透過這種方式，可以從一批樣本中得到新的一批樣本。

首先，需要設定預訓練CLM模型時所使用的最大序列長度。在這裡設定為256，以防您的視訊記憶體爆炸💥。

# block_size = tokenizer.model_max_length
block_size = 256

然後，使用預處理函式來對訓練文字進行分組:

def group_texts(examples):
    # 拼接所有文字
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # 這裡將剩餘的少部分token去掉了。但如果模型支援的話，可以新增padding，這可以根據需要進行定製修改。
    total_length = (total_length // block_size) * block_size
    
    # 透過max_len進行分割
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

首先注意，我們複製了標籤的輸入。

這是因為🤗transformer庫的模型預設向右移動，所以我們不需要手動操作。

還要注意，在預設情況下，map方法將傳送一批1,000個示例，由預處理函式處理。因此，在這裡，我們將刪除剩餘部分，使連線的標記化文字每1000個示例為block_size的倍數。您可以透過傳遞更高的批處理大小來調整此行為(當然這也會被處理得更慢)。你也可以使用multiprocessing來加速預處理:

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=2000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]
Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]
Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

現在，可以檢查資料集是否發生了變化：
現在樣本包含了block_size連續字元塊，可能跨越了幾個原始文字。

tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries, along with Valkyria Chronicles II director Takeshi Oz'

在構建了處理好的預訓練語料後，可以開始模型訓練。

我們將建立一個模型:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

直接使用transformers的trainer型別，其程式碼如下所示：

from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(config)

訓練引數

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    f"{model_checkpoint}-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    # push_to_hub=True
)

訓練模型

trainer.train()

訓練日誌

[ 220/3375 02:11 < 31:43, 1.66 it/s, Epoch 0.19/3]

評估結果

import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 552.71

The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you would need a larger dataset and more epochs.

1.5 推理

tokenizer深入

tokens = tokenizer.tokenize("六朝何事")
tokens

奇奇怪怪的結果（詞表裡沒啥中文，直接中文按2位元組編碼）

['å', 'ħ', 'Ń', 'æ', 'ľ', 'Ŀ', 'ä', '½', 'ķ', 'ä', 'º', 'ĭ']

轉換為token id

tokenizer.convert_tokens_to_ids(tokens)

結果

[150, 165, 193, 151, 188, 189, 149, 121, 181, 149, 118, 171]

使用encode，直接轉換為token ids

tokenizer.encode("六朝何事")

[150, 165, 193, 151, 188, 189, 149, 121, 181, 149, 118, 171]

與直接使用tokenizer

tokenizer("六朝何事")

結果一致

{'input_ids': [150, 165, 193, 151, 188, 189, 149, 121, 181, 149, 118, 171], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

反tokenize

tokenizer.decode(tokenizer("六朝何事")
['input_ids'])

‘六朝何事’

1.6 推理

x = tokenizer("六朝何事", return_tensors="pt")
y = model.forward(x['input_ids'])
y

結果一大堆

CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[-0.5103, -0.3852, -0.0509,  ..., -0.1831,  0.7720, -0.2264],
         [-0.9077,  0.0660, -0.7552,  ...,  0.0428,  0.6765, -0.0024],
         [ 0.4458, -0.4124, -1.2314,  ...,  0.3847,  0.4391,  0.0402],
         ...,
         [ 0.3976,  0.0738, -0.7156,  ...,  0.1152,  0.8602,  0.0270],
         [ 0.6953,  0.7504,  0.0266,  ..., -0.6524,  1.1901,  0.1273],
         [-0.3004,  0.5009, -1.0164,  ..., -0.1076,  1.4422, -0.5940]]],
       grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-0.0165, -0.5414, -0.1960,  ...,  0.0751, -1.3083, -0.6204],
          [ 0.5249,  0.0685,  0.2652,  ..., -0.1789,  0.0868, -0.5673],
          [ 0.6694, -0.5541, -0.2543,  ...,  0.0981, -0.1687, -0.2084],

檢視預測logits即y.logits的shape為torch.Size([1, 12, 50257])

tensor([[[ 0.6202, -0.1432, -0.0364,  ..., -0.6025,  0.7150, -0.2145],
         [-0.3945, -0.0824, -0.5818,  ...,  0.0286,  0.6341, -0.2636],
         [ 0.2438,  0.5748, -0.9318,  ..., -0.4956,  0.5061, -0.3112],
         ...,
         [ 1.0054,  0.3126, -0.1491,  ..., -0.1764,  0.4643, -0.1376],
         [ 0.5537,  0.7263,  0.0582,  ..., -0.7386,  1.2950, -0.1308],
         [ 0.5036,  1.0895,  0.0722,  ..., -0.8044,  0.4085, -0.8951]]],
       grad_fn=<UnsafeViewBackward0>)

由於中文預測出來的token解碼不對，這裡後續使用英文測試

import torch
import numpy as np

inputs_text = "Hello "
x = tokenizer(inputs_text, return_tensors="pt")
y = model.forward(x['input_ids'])

# 貪婪取樣，取最大機率token
next_token_id = int(np.argmax(y.logits[0][-1].detach().numpy()))
print(next_token_id)
next_token = tokenizer.convert_ids_to_tokens(next_token_id)
print(inputs_text + next_token)

結果
10391
Hello ĠBright

generate程式碼 (設定預測長度max_length)

max_length = 20
inputs_text = "hello "

input_ids = [tokenizer.encode(inputs_text)]
input_ids = input_ids[:-1]
for i in range(max_length):
    outputs = model(torch.tensor([input_ids]))
    last_token_id = int(np.argmax(outputs.logits[0][-1].detach().numpy()))
    last_token = tokenizer.convert_ids_to_tokens(last_token_id)
    inputs_text += last_token
    input_ids.append(last_token_id)

1.7 參考資料

實現程式碼：colab原始碼：Train a language model - Colaboratory (google.com)

中文GPT2預訓練和微調：Hugging Face中GPT2模型應用程式碼 - 知乎 (zhihu.com)

Gpt進階(二): 以古詩集為例,訓練一個自己的古詩詞gpt模型 - 知乎 (zhihu.com)

【預訓練語言模型】使用Transformers庫進行GPT2預訓練