【LLM訓練系列】從零開始訓練大模型之Phi2-mini-Chinese專案解讀

LeonYi發表於2024-09-09

一、前言

本文主要是在復現和實踐Phi2-mini-Chinese後,簡要分析下Phi2-mini-Chinese這個專案,做一個學習實戰總結。

原文釋出於知乎:https://zhuanlan.zhihu.com/p/718307193,轉載請註明出數。

Phi2-mini-Chinese簡介

Phi2-Chinese-0.2B 從0開始訓練自己的Phi2中文小模型,支援接入langchain載入本地知識庫做檢索增強生成RAG。Training your own Phi2 small chat model from scratch.

專案開始時期:2023年12月22日
地址:https://github.com/charent/Phi2-mini-Chinese

流程步驟

  • 資料處理
  • Tokenizer訓練
  • 預訓練
  • SFT
  • DPO

資料處理的步驟略去。一般是使用開源資料集。

二、Tokenizer訓練

image

就是使用tokenizers庫用BPE訓練,沒啥好說的。

三、預訓練程式碼

import os, platform, time
from typing import Optional
import numpy as np
import pandas as pd
from dataclasses import dataclass,field
from datasets import load_dataset, Dataset
import torch
from transformers.trainer_callback import TrainerControl, TrainerState
from transformers import PreTrainedTokenizerFast, DataCollatorForLanguageModeling, PhiConfig, PhiForCausalLM, Trainer, TrainingArguments, TrainerCallback

# 預訓練資料(單純的文字資料)
TRAIN_FILES = ['./data/wiki_chunk_320_2.2M.parquet',]
EVAL_FILE = './data/pretrain_eval_400_1w.parquet'

@dataclass
class PretrainArguments:
    tokenizer_dir: str = './model_save/tokenizer/'
    model_save_dir: str = './model_save/pre/'
    logs_dir: str = './logs/'
    train_files: list[str] = field(default_factory=lambda: TRAIN_FILES)
    eval_file: str = EVAL_FILE
    max_seq_len: int = 512
    attn_implementation: str = 'eager' if platform.system() == 'Windows' else attn_implementation

pretrain_args = PretrainArguments()
# 載入訓練好的tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(pretrain_args.tokenizer_dir)
# 詞表大小修正
vocab_size = len(tokenizer)
if vocab_size % 64 != 0:
    vocab_size = (vocab_size // 64 + 1) * 64
# 如果詞表大小小於 65535 用uint16儲存,節省磁碟空間,否則用uint32儲存
map_dtype = np.uint16 if vocab_size < 65535 else np.uint32

def token_to_id(samples: dict[str, list]) -> dict:
    batch_txt = samples['text']
    outputs = tokenizer(batch_txt, truncation=False, padding=False, return_attention_mask=False)
    input_ids = [np.array(item, dtype=map_dtype) for item in outputs["input_ids"]]
    return {"input_ids": input_ids}

# 載入資料集
def get_maped_dataset(files: str|list[str]) -> Dataset:
    dataset = load_dataset(path='parquet', data_files=files, split='train', cache_dir='.cache')
    maped_dataset = dataset.map(token_to_id, batched=True, batch_size=1_0000, remove_columns=dataset.column_names)
    return maped_dataset

train_dataset = get_maped_dataset(pretrain_args.train_files)
eval_dataset = get_maped_dataset(pretrain_args.eval_file)
# 定義data_collator。`mlm=False`表示要訓練CLM模型,`mlm=True`表示要訓練MLM模型
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

phi_config = PhiConfig(
    vocab_size=vocab_size,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    hidden_size=960,
    num_attention_heads=16,
    num_hidden_layers=24,
    max_position_embeddings=512,
    intermediate_size=4096,
    attn_implementation=pretrain_args.attn_implementation,
)
model = PhiForCausalLM(phi_config)

# 定義訓練引數
my_trainer_callback = MyTrainerCallback() # cuda cache回撥函式
args = TrainingArguments(
    output_dir=pretrain_args.model_save_dir, per_device_train_batch_size=4,
    gradient_accumulation_steps=32, num_train_epochs=4, weight_decay=0.1, 
    warmup_steps=1000, learning_rate=5e-4, evaluation_strategy='steps',
    eval_steps=2000, save_steps=2000, save_strategy='steps', save_total_limit=3,
    report_to='tensorboard', optim="adafactor", bf16=True, logging_steps=5,
    log_level='info', logging_first_step=True,
)
trainer = Trainer(model=model, tokenizer=tokenizer,args=args,
    data_collator=data_collator, train_dataset=train_dataset,
    eval_dataset=eval_dataset, callbacks=[my_trainer_callback],
)
trainer.train()
trainer.save_model(pretrain_args.model_save_dir)

這個程式碼和只要是使用Transformers庫的Trainer大差不差。主要是,tokenizer和CausalLM模型的差別。

PhiConfig, PhiForCausalLM 變成:

from transformers import LlamaConfig as PhiConfig
from transformers import LlamaForCausalLM as PhiForCausalLM

或:

from transformers import Qwen2Config as PhiConfig
from transformers import Qwen2ForCausalLM as PhiForCausalLM

就很隨意的變成了其他模型的簡單預訓練了 。

關於訓練資料構造,其中,DataCollatorForLanguageModeling:
image

image

image

注: get_maped_datasetget_maped_dataset的load_dataset沒有加num_proc,導致載入速度慢,加以設定為核心數)

這部分程式碼和我之前寫一個篇基於transformers庫訓練GPT2大差不差:
https://zhuanlan.zhihu.com/p/685851459

注: get_maped_datasetget_maped_dataset的load_dataset沒有加num_proc,導致載入速度慢,加以設定為核心數)

四、SFT程式碼

基本和預訓練一致,唯一的不同就是,設定了output的標籤

import time
import pandas as pd
import numpy as np
import torch
from datasets import load_dataset
from transformers import PreTrainedTokenizerFast, PhiForCausalLM, TrainingArguments, Trainer, TrainerCallback
from trl import DataCollatorForCompletionOnlyLM

# 1. 定義訓練資料,tokenizer,預訓練模型的路徑及最大長度
sft_file = './data/sft_train_data.parquet'
tokenizer_dir = './model_save/tokenizer/'
sft_from_checkpoint_file = './model_save/pre/'
model_save_dir = './model_save/sft/'
max_seq_len = 512

# 2. 載入訓練資料集
dataset = load_dataset(path='parquet', data_files=sft_file, split='train', cache_dir='.cache')
tokenizer = PreTrainedTokenizerFast.from_pretrained(tokenizer_dir)
print(f"vicab size: {len(tokenizer)}")

# ## 2.1 定義sft data_collator的指令字元
# 也可以手動將`instruction_template_ids`和`response_template_ids`新增到input_ids中的,因為如果是byte level tokenizer可能將`:`和後面的字元合並,導致找不到`instruction_template_ids`和`response_template_ids`。 
# 也可以像下文一樣透過在`'#'`和`':'`前後手動加`'\n'`解決

# %%
instruction_template = "##提問:"
response_template = "##回答:"

map_dtype = np.uint16 if len(tokenizer) < 65535 else np.uint32

def batched_formatting_prompts_func(example: list[dict]) -> list[str]:
    batch_txt = []
    for i in range(len(example['instruction'])):
        text = f"{instruction_template}\n{example['instruction'][i]}\n{response_template}\n{example['output'][i]}[EOS]"
        batch_txt.append(text)

    outputs = tokenizer(batch_txt, return_attention_mask=False)
    input_ids = [np.array(item, dtype=map_dtype) for item in outputs["input_ids"]]
    return {"input_ids": input_ids}

dataset = dataset.map(batched_formatting_prompts_func, batched=True, 
                        remove_columns=dataset.column_names).shuffle(23333)

# 2.2 定義data_collator
# 
data_collator = DataCollatorForCompletionOnlyLM(
  instruction_template=instruction_template, 
  response_template=response_template, 
  tokenizer=tokenizer, 
  mlm=False
)
empty_cuda_cahce = EmptyCudaCacheCallback()  ## 定義訓練過程中的回撥函式
my_datasets =  dataset.train_test_split(test_size=4096)

# 5. 定義訓練引數
model = PhiForCausalLM.from_pretrained(sft_from_checkpoint_file)
args = TrainingArguments(
    output_dir=model_save_dir, per_device_train_batch_size=8, gradient_accumulation_steps=8,
    num_train_epochs=3, weight_decay=0.1, warmup_steps=1000, learning_rate=5e-5,
    evaluation_strategy='steps', eval_steps=2000, save_steps=2000, save_total_limit=3,
    report_to='tensorboard', optim="adafactor", bf16=True, logging_steps=10,
    log_level='info', logging_first_step=True, group_by_length=True,
)
trainer = Trainer(
    model=model, tokenizer=tokenizer, args=args, 
    data_collator=data_collator,
    train_dataset=my_datasets['train'], 
    eval_dataset=my_datasets['test'],
    callbacks=[empty_cuda_cahce],
)
trainer.train()
trainer.save_model(model_save_dir)

總之,雖然都是一套程式碼,但實際上一切的細節隱藏在:
DataCollatorForLanguageModeling、Trainer、tokenizer和CausalLM的實現中。
更底層的實在pytorch的實現中,不過一般不涉及框架內部的實現分析。

和huggingface的trl庫的SFT example,唯一區別就是還是用的Trainer
https://huggingface.co/docs/trl/main/en/sft_trainer#train-on-completions-only

其中,DataCollatorForCompletionOnlyLM會為指令微調式的補全式訓練,自動構造樣本:

You can use the DataCollatorForCompletionOnlyLM to train your model on the generated prompts only. Note that this works only in the case when packing=False.

對於指令微調式的instruction data, 例項化一個datacollator,傳入一個輸response的template 和tokenizer。

內部可以進行response部分的token ids的拆分,並指定為預測標籤。

下面是HuggingFace官方的使用DataCollatorForCompletionOnlyLM+FTTrainer,進行指令微調的例子:

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

instruction_template = "### Human:"
response_template = "### Assistant:"
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)

trainer = SFTTrainer(
    model,
    args=SFTConfig(
        output_dir="/tmp",
        dataset_text_field = "text",
    ),
    train_dataset=dataset,
    data_collator=collator,
)
trainer.train()

關於Trainer and SFTTrainer的區別,感覺區別不大
https://medium.com/@sujathamudadla1213/difference-between-trainer-class-and-sfttrainer-supervised-fine-tuning-trainer-in-hugging-face-d295344d73f7

image

五、DPO程式碼

import time
import pandas as pd
from typing import List, Optional, Dict
from dataclasses import dataclass, field
import torch 
from trl import DPOTrainer
from transformers import PreTrainedTokenizerFast, PhiForCausalLM, TrainingArguments, TrainerCallback
from datasets import load_dataset

# 1. 定義sft模型路徑及dpo資料
dpo_file = './data/dpo_train_data.json'
tokenizer_dir = './model_save/tokenizer/'
sft_from_checkpoint_file = './model_save/sft/'
model_save_dir = './model_save/dpo/'
max_seq_len = 320

# 2. 載入資料集

# 資料集token格式化
# DPO資料格式:[prompt模型輸入,chosen正例, rejected負例]
# 將dpo資料集三列資料新增上`eos`token,`bos`可加可不加
def split_prompt_and_responses(samples: dict[str, str]) -> Dict[str, str]:
    prompts, chosens, rejects = [], [], []
    batch_size = len(samples['prompt'])
    for i in range(batch_size):
        # add an eos token for signal that end of sentence, using in generate.
        prompts.append(f"[BOS]{samples['prompt'][i]}[EOS]")
        chosens.append(f"[BOS]{samples['chosen'][i]}[EOS]")
        rejects.append(f"[BOS]{samples['rejected'][i]}[EOS]")
    return {'prompt': prompts, 'chosen': chosens, 'rejected':rejects,}

tokenizer = PreTrainedTokenizerFast.from_pretrained(tokenizer_dir)
dataset = load_dataset(path='json', data_files=dpo_file, split='train', cache_dir='.cache')
dataset = dataset.map(split_prompt_and_responses, batched=True,).shuffle(2333)

# 4. 載入模型
# `model`和`model_ref`開始時是同一個模型,只訓練`model`的引數,`model_ref`引數儲存不變
model = PhiForCausalLM.from_pretrained(sft_from_checkpoint_file)
model_ref = PhiForCausalLM.from_pretrained(sft_from_checkpoint_file)

# 5. 定義訓練中的回撥函式
# 清空cuda快取,dpo要載入兩個模型,視訊記憶體佔用較大,這能有效緩解低視訊記憶體機器視訊記憶體緩慢增長的問題
class EmptyCudaCacheCallback(TrainerCallback):
    log_cnt = 0
    def on_log(self, args, state, control, logs=None, **kwargs):
        self.log_cnt += 1
        if self.log_cnt % 5 == 0:
            torch.cuda.empty_cache()
            
empty_cuda_cahce = EmptyCudaCacheCallback()

# 訓練引數
args = TrainingArguments(
    output_dir=model_save_dir, per_device_train_batch_size=2, gradient_accumulation_steps=16,
    num_train_epochs=4, weight_decay=0.1, warmup_steps=1000, learning_rate=2e-5, save_steps=2000, save_total_limit=3, report_to='tensorboard', bf16=True, logging_steps=10, log_level='info',
    logging_first_step=True, optim="adafactor", remove_unused_columns=False, group_by_length=True,
)
trainer = DPOTrainer(
    model, model_ref, args=args, beta=0.1,
    train_dataset=dataset,tokenizer=tokenizer, callbacks=[empty_cuda_cahce],
    max_length=max_seq_len * 2 + 16, # 16 for eos bos
    max_prompt_length=max_seq_len,
)
trainer.train()
trainer.save_model(model_save_dir)

六、碎碎念

深入學習

使用transformers的Traniner以及trl庫的訓練程式碼基本上都差不多,因為transformers和trl都封裝地很好了。

如何要略微深入細節,建議閱讀或debug如下倉庫。這兩個倉庫都是基於pytorch實現的:

  • https://github.com/DLLXW/baby-llama2-chinese/tree/main
  • https://github.com/jzhang38/TinyLlama/blob/main/pretrain/tinyllama.py

改進

這個專案就是基於Phi2-mini-Chinese,主要就是把phi2換成了qwen,然後直接使用qwen的tokenizer
https://github.com/jiahe7ay/MINI_LLM/

我這邊嘗試使用了transformers庫把qwen2的抽出來,用於訓練。
其實,和直接用transformers的Qwen2LMModel沒有區別。

感興趣的可以替換任意主流模型,修改配置,其實也大差不差。
這些程式碼主要是用於學習用途。只要有點時間,有點卡,不費什麼力就可以弄點資料復現走完整個流程。

不過,要訓練出的效果還可以的小規模LLM也並不簡單。

如果您需要引用本文,請參考:

LeonYi. (Aug. 25, 2024). 《【LLM訓練系列】從零開始訓練大模型之Phi2-mini-Chinese專案解讀》.

@online{title={【LLM訓練系列】從零開始訓練大模型之Phi2-mini-Chinese專案解讀},
author={LeonYi},
year={2024},
month={Sep},
url={https://www.cnblogs.com/justLittleStar/p/18405618},
}

相關文章