【預訓練語言模型】 使用Transformers庫進行BERT預訓練

LeonYi發表於2024-03-13

基於 HuggingFace的Transformer庫,在Colab或Kaggle進行預訓練。

鑑於算力限制,選用了較小的英文資料集wikitext-2

目的:跑通Mask語言模型的預訓練流程

一、準備

1.1 安裝依賴

!pip3 install --upgrade pip
!pip install -U datasets
!pip install accelerate -U

注意:在Kaggle上訓練時,最好將datasets更新到最新版(再重啟kernel),避免版本低報錯

colab和kaggle已經預安裝transformers庫

1.2 資料準備

載入資料

from datasets import concatenate_datasets, load_dataset

wikitext2 = load_dataset("liuyanchen1015/VALUE_wikitext2_been_done", split="train")

# concatenate_datasets可以合併資料集,這裡為了減少訓練資料量,只用1個資料集
dataset = concatenate_datasets([wikitext2])
# 將資料集合切分為 90% 用於訓練,10% 用於測試
d = dataset.train_test_split(test_size=0.1)
type(d["train"])

可以看到d的型別是

datasets.arrow_dataset.Dataset
def __init__(arrow_table: Table, info: Optional[DatasetInfo]=None, split: Optional[NamedSplit]=None, indices_table: Optional[Table]=None, fingerprint: Optional[str]=None)
A Dataset backed by an Arrow table.

訓練集和測試集數量

print(len(d["train"]), len(d["test"]))

5391 600

檢視資料

d["train"][0]

sentence、idx和score構成的字典,sentence為文字

{'sentence': "Rowley dismissed this idea given the shattered state of Africaine and instead towed the frigate back to Île Bourbon , shadowed by Astrée and Iphigénie on the return journey . The French frigates did achieve some consolation in pursuing Rowley from a distance , running into and capturing the Honourable East India Company 's armed brig Aurora , sent from India to reinforce Rowley . On 15 September , Boadicea , Africaine and the brigs done arrived at Saint Paul , Africaine sheltering under the fortifications of the harbour while the others put to sea , again seeking to drive away the French blockade but unable to bring them to action . Bouvet returned to Port Napoleon on 18 September , and thus was not present when Rowley attacked and captured the French flagship Vénus and Commodore Hamelin at the Action of 18 September 1810 .",
 'idx': 1867,
 'score': 1}

接下來將訓練和測試資料,分別儲存在本地檔案中(後續用於訓練tokenizer)

def dataset_to_text(dataset, output_filename="data.txt", num=1000):
    """Utility function to save dataset text to disk,
    useful for using the texts to train the tokenizer
    (as the tokenizer accepts files)"""
    cnt = 0
    with open(output_filename, "w") as f:
        for t in dataset["sentence"]:
            print(t, file=f) # 重定向到檔案
            # print(len(t), t)
            cnt += 1
            if cnt == num:
                break

    print("Write {} sample in {}".format(num, output_filename))

# save the training set to train.txt
dataset_to_text(d["train"], "train.txt", len(d["train"]))
# save the testing set to test.txt
dataset_to_text(d["test"], "test.txt", len(d["test"]))

二、訓練分詞器(Tokenizer)

BERT採用了WordPiece分詞,即

根據訓練語料先訓練一個語言模型,迭代每次在合併詞元對時,會對所有詞元進行評分,然後選擇使訓練資料似然增加最多的token對。直到,到達預定的詞表大小。

相比BPE的選取頻詞最高的token對,WordPiece使用了語言模型來計算似然值:

score = 詞元對出現的機率 / (第1個詞元出現的機率 * 第1個詞元出現的機率)。

因此,需要首先這裡使用transformers庫中的BertWordPieceTokenizer 類來完成分詞器(Tokenizer)訓練

import os, json
from transformers import BertTokenizerFast
from tokenizers import BertWordPieceTokenizer

special_tokens = [
"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"
]

# 載入之前儲存的訓練語料(可同時選擇訓練集和測試集)
# files = ["train.txt", "test.txt"]

# 在training set上訓練tokenizer
files = ["train.txt"]

# 設定詞表大小,BERT模型vocab size為30522,可自行更改
vocab_size = 30_522

# 最大序列長度, 較小的長度可以增加batch數量、提升訓練速度
# 最大序列長度同時也決定了,BERT預訓練的PositionEmbedding的大小,決定了其最大推理長度
max_length = 512

# 是否截斷處理, 這邊開啟截斷處理
# truncate_longer_samples = False
truncate_longer_samples = True

# 初始化WordPiece tokenizer
tokenizer = BertWordPieceTokenizer()

# 訓練tokenizer
tokenizer.train(files=files, 
                vocab_size=vocab_size,
                special_tokens=special_tokens)

# 截斷至最大序列長度-512 tokens
tokenizer.enable_truncation(max_length=max_length)

model_path = "pretrained-bert"
# make the directory if not already there
if not os.path.isdir(model_path):
    os.mkdir(model_path)

# 儲存tokenizer(其實就是詞表檔案vocab.txt)
tokenizer.save_model(model_path)

# 儲存tokenizer的配置資訊到config檔案,
# including special tokens, whether to lower case and the maximum sequence length
with open(os.path.join(model_path, "config.json"), "w") as f:
    tokenizer_cfg = {
    "do_lower_case": True,
    "unk_token": "[UNK]",
    "sep_token": "[SEP]",
    "pad_token": "[PAD]",
    "cls_token": "[CLS]",
    "mask_token": "[MASK]",
    "model_max_length": max_length,
    "max_len": max_length,
    }
    json.dump(tokenizer_cfg, f)

# when the tokenizer is trained and configured, load it as BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained(model_path)

image

vocab.txt內容,包括特殊7個token以及訓練得到的tokens

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
<S>
<T>
!
"
#
$
%
&
'
(

config.json檔案

{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "model_max_length": 512, "max_len": 512}

config.json和vocab.txt檔案都是我們平時微調BERT或GPT經常見到的檔案,其來源就是這個。

三 預處理語料集合

在開始BERT預訓前,還需要將預訓練語料根據訓練好的 Tokenizer進行處理,將文字轉換為分詞後的結果。

如果文件長度超過512個Token),那麼就直接進行截斷。

資料處理程式碼如下所示:

def encode_with_truncation(examples):
    """Mapping function to tokenize the sentences passed with truncation"""
    return tokenizer(examples["sentence"], truncation=True, padding="max_length",
    max_length=max_length, return_special_tokens_mask=True)

def encode_without_truncation(examples):
    """Mapping function to tokenize the sentences passed without truncation"""
    return tokenizer(examples["sentence"], return_special_tokens_mask=True)

# 上述encode函式的使用,取決於truncate_longer_samples變數
encode = encode_with_truncation if truncate_longer_samples else encode_without_truncation

# tokenizing訓練集
train_dataset = d["train"].map(encode, batched=True)
# tokenizing測試集
test_dataset = d["test"].map(encode, batched=True)
if truncate_longer_samples:
    # remove other columns and set input_ids and attention_mask as PyTorch tensors
    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
    test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
else:
    # remove other columns, and remain them as Python lists
    test_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
    train_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])

truncate_longer_samples變數控制對資料集進行詞元處理的 encode() 函式。

  • 如果設定為True,則會截斷超過最大序列長度(max_length)的句子。否則不截斷。
  • 如果truncate_longer_samples設為False,需要將沒有截斷的樣本連線起來,並組合成固定長度的向量。

這邊開啟截斷處理( 不截斷處理,程式碼有為解決的bug )。便於拼成chunk,batch處理。

Map: 100%
 5391/5391 [00:15<00:00, 352.72 examples/s]
Map: 100%
 600/600 [00:01<00:00, 442.19 examples/s]

觀察分詞結果

x = d["train"][10]['sentence'].split(".")[0]
x
In his foreword to John Marriott 's book , Thunderbirds Are Go ! , Anderson put forward several explanations for the series ' enduring popularity : it " contains elements that appeal to most children – danger , jeopardy and destruction 

按max_length=33進行截斷處理

tokenizer(x, truncation=True, padding="max_length", max_length=33, return_special_tokens_mask=True)

多餘token被截斷了,沒有padding。

  • 句子首部加上了(token id為2的[CLS])
  • 句子尾部加上了(token id為3的[SEP])
{'input_ids': [2, 502, 577, 23750, 649, 511, 1136, 29823, 13, 59, 1253, 18, 4497, 659, 762, 7, 18, 3704, 2041, 3048, 1064, 26668, 534, 492, 1005, 13, 12799, 4281, 32, 555, 8, 4265, 3], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}

按max_length=70進行截斷處理

tokenizer(x, truncation=True, padding="max_length", max_length=70, return_special_tokens_mask=True)

不足max_length的token進行了padding(token id為0, 即[PAD]), 對應attention_mask為0(注意力係數會變成無窮小,即mask掉)

{'input_ids': [2, 11045, 13, 61, 1248, 3346, 558, 8, 495, 4848, 1450, 8, 18, 981, 1146, 578, 17528, 558, 18652, 513, 13102, 18, 749, 1660, 819, 1985, 548, 566, 11279, 579, 826, 3881, 8079, 505, 495, 28263, 360, 18, 16273, 505, 570, 4145, 1644, 18, 505, 652, 757, 662, 2903, 1334, 791, 543, 3008, 810, 495, 817, 8079, 4436, 505, 579, 1558, 3, 0, 0, 0, 0, 0, 0, 0, 0], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], 
'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

如果,設定為不截斷

tokenizer(x, truncation=False, return_special_tokens_mask=True)

比較正常的結果

{'input_ids': [2, 502, 577, 23750, 649, 511, 1136, 29823, 13, 59, 1253, 18, 4497, 659, 762, 7, 18, 3704, 2041, 3048, 1064, 26668, 534, 492, 1005, 13, 12799, 4281, 32, 555, 8, 4265, 2603, 545, 3740, 511, 843, 1970, 178, 8021, 18, 19251, 362, 510, 6681, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}

反tokenize

y = tokenizer.decode(y["input_ids"])
y

直觀看到,加入了[CLS]和[SEP],而且這個例子是無損解碼的。

[CLS] in his foreword to john marriott's book, thunderbirds are go!, anderson put forward several explanations for the series'enduring popularity : it " contains elements that appeal to most children – danger, jeopardy and destruction [SEP]

四、模型訓練

在構建了處理好的預訓練語料後,可以開始模型訓練。

直接使用transformers的trainer型別,其程式碼如下所示:

from transformers import BertConfig, BertForMaskedLM
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer

# 使用config初始化model
model_config = BertConfig(vocab_size=vocab_size, 
                          max_position_embeddings=max_length)
model = BertForMaskedLM(config=model_config)

# 初始化BERT的data collator, 隨機Mask 20% (default is 15%)的token 
# 掩碼語言模型建模(Masked Language Modeling ,MLM)。預測[Mask]token對應的原始token
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)

# 訓練引數
training_args = TrainingArguments(
  output_dir=model_path,          # output directory to where save model checkpoint
  evaluation_strategy="steps",    # evaluate each `logging_steps` steps
  overwrite_output_dir=True,
  num_train_epochs=10,            # number of training epochs, feel free to tweak
  per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
  gradient_accumulation_steps=8,  # accumulating the gradients before updating the weights
  per_device_eval_batch_size=64,  # evaluation batch size
  logging_steps=1000,             # evaluate, log and save model checkpoints every 1000 step
  save_steps=1000,
  # load_best_model_at_end=True, # whether to load the best model (in terms of loss)
  # at the end of training
  # save_total_limit=3, # whether you don't have much space so you
  # let only 3 model weights saved in the disk
  )

trainer = Trainer(
  model=model,
  args=training_args,
  data_collator=data_collator,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
  )

# 訓練模型
trainer.train()

訓練日誌

wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
  ········································
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Tracking run with wandb version 0.16.3
Run data is saved locally in /kaggle/working/wandb/run-20240303_135646-j2ki48iz
Syncing run crimson-microwave-1 to Weights & Biases (docs)
View project at https://wandb.ai/1506097817gm/huggingface
View run at https://wandb.ai/1506097817gm/huggingface/runs/j2ki48iz
 [ 19/670 1:27:00 < 55:32:12, 0.00 it/s, Epoch 0.27/10]

這裡我嘗試在kaggle CPU上訓練了下的,耗時太長

GPU訓練時, 切換更大的資料集

開始訓練後,可以如下輸出結果:
 [ 19/670 1:27:00 < 55:32:12, 0.00 it/s, Epoch 0.27/10]
Step Training Loss Validation Loss
1000 6.904000 6.558231
2000 6.498800 6.401168
3000 6.362600 6.277831
4000 6.251000 6.172856
5000 6.155800 6.071129
6000 6.052800 5.942584
7000 5.834900 5.546123
8000 5.537200 5.248503
9000 5.272700 4.934949
10000 4.915900 4.549236

五、模型使用

5.1 直接推理

# 載入模型引數 checkpoint
model = BertForMaskedLM.from_pretrained(os.path.join(model_path, "checkpoint-10000"))

# 載入tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_path)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# 模型推理
examples = [
"Today's most trending hashtags on [MASK] is Donald Trump",
"The [MASK] was cloudy yesterday, but today it's rainy.",
]
for example in examples:
  for prediction in fill_mask(example):
    print(f"{prediction['sequence']}, confidence: {prediction['score']}")
    print("="*50)

進行,mask的token預測(完形填空😺),可以得到如下輸出:
```python
today's most trending hashtags on twitter is donald trump, confidence: 0.1027069091796875
today's most trending hashtags on monday is donald trump, confidence: 0.09271949529647827
today's most trending hashtags on tuesday is donald trump, confidence: 0.08099588006734848
today's most trending hashtags on facebook is donald trump, confidence: 0.04266013577580452
today's most trending hashtags on wednesday is donald trump, confidence: 0.04120611026883125
==================================================
the weather was cloudy yesterday, but today it's rainy., confidence: 0.04445931687951088
the day was cloudy yesterday, but today it's rainy., confidence: 0.037249673157930374
the morning was cloudy yesterday, but today it's rainy., confidence: 0.023775646463036537
the weekend was cloudy yesterday, but today it's rainy., confidence: 0.022554103285074234
the storm was cloudy yesterday, but today it's rainy., confidence: 0.019406016916036606
==================================================

參考資料

  • 參考《大規模語言模型:從理論到實踐》第2.2節預訓練程式碼,將其調通

實現程式碼:

  • github: https://github.com/DayDreamChaser/llm-todo/blob/main/BERT預訓練指令碼.ipynb
  • kaggle: https://www.kaggle.com/code/leonyiyi/bert-pretraining-on-wikitext2/edit

ToDo

這次的BERT預訓練只做了MLM,類似Roberta。

  • 後續可以根據預訓練的模型進行各種任務微調
  • 可以將BERT的Trainer內部實現解析
  • 詳解WordPiece和BPE
  • 下一個: GPT2預訓練和SFT

相關文章