[Python急救站]基於Transformer Models模型完成GPT2的學生AIGC學習訓練模型

Jinylin發表於2024-04-29

為了AIGC的學習,我做了一個基於Transformer Models模型完成GPT2的學生AIGC學習訓練模型,指在訓練模型中學習程式設計AI。

在程式設計之前需要準備一些檔案:

首先,先win+R開啟執行框,輸入:PowerShell後

輸入:

pip install -U huggingface_hub

下載完成後,指定我們的環境變數:

$env:HF_ENDPOINT = "https://hf-mirror.com"

然後下載模型:

huggingface-cli download --resume-download gpt2 --local-dir "D:\Pythonxiangmu\PythonandAI\Transformer Models\gpt-2"

這邊我的目錄是我要下載的工程目錄地址

然後下載資料量:

huggingface-cli download --repo-type dataset --resume-download wikitext --local-dir "D:\Pythonxiangmu\PythonandAI\Transformer Models\gpt-2"

這邊我的目錄是我要下載的工程目錄地址

所以兩個地址記得更改成自己的工程目錄下(建議放在建立一個名為gpt-2的資料夾)

在PowerShell中下載完這些後,可以開始我們的程式碼啦

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AdamW,
    get_linear_schedule_with_warmup,
    set_seed,
)
from torch.optim import AdamW

# 設定隨機種子以確保結果可復現
set_seed(42)


class TextDataset(Dataset):
    def __init__(self, tokenizer, texts, block_size=128):
        self.tokenizer = tokenizer
        self.examples = [
            self.tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=block_size) for
            text
            in texts]
        # 在tokenizer初始化後,確保unk_token已設定
        print(f"Tokenizer's unk_token: {self.tokenizer.unk_token}, unk_token_id: {self.tokenizer.unk_token_id}")

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        item = self.examples[i]
        # 替換所有不在vocab中的token為unk_token_id
        for key in item.keys():
            item[key] = torch.where(item[key] >= self.tokenizer.vocab_size, self.tokenizer.unk_token_id, item[key])
        return item


def train(model, dataloader, optimizer, scheduler, de, tokenizer):
    model.train()
    for batch in dataloader:
        input_ids = batch['input_ids'].to(de)
        # 新增日誌輸出檢查input_ids
        if torch.any(input_ids >= model.config.vocab_size):
            print("Warning: Some input IDs are outside the model's vocabulary.")
            print(f"Max input ID: {input_ids.max()}, Vocabulary Size: {model.config.vocab_size}")

        attention_mask = batch['attention_mask'].to(de)
        labels = input_ids.clone()
        labels[labels[:, :] == tokenizer.pad_token_id] = -100

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()


def main():
    local_model_path = "D:/Pythonxiangmu/PythonandAI/Transformer Models/gpt-2"
    tokenizer = AutoTokenizer.from_pretrained(local_model_path)

    # 確保pad_token已經存在於tokenizer中,對於GPT-2,它通常自帶pad_token
    if tokenizer.pad_token is None:
        special_tokens_dict = {'pad_token': '[PAD]'}
        tokenizer.add_special_tokens(special_tokens_dict)
        model = AutoModelForCausalLM.from_pretrained(local_model_path, pad_token_id=tokenizer.pad_token_id)
    else:
        model = AutoModelForCausalLM.from_pretrained(local_model_path)

    model.to(device)

    train_texts = [
        "The quick brown fox jumps over the lazy dog.",
        "In the midst of chaos, there is also opportunity.",
        "To be or not to be, that is the question.",
        "Artificial intelligence will reshape our future.",
        "Every day is a new opportunity to learn something.",
        "Python programming enhances problem-solving skills.",
        "The night sky sparkles with countless stars.",
        "Music is the universal language of mankind.",
        "Exploring the depths of the ocean reveals hidden wonders.",
        "A healthy mind resides in a healthy body.",
        "Sustainability is key for our planet's survival.",
        "Laughter is the shortest distance between two people.",
        "Virtual reality opens doors to immersive experiences.",
        "The early morning sun brings hope and vitality.",
        "Books are portals to different worlds and minds.",
        "Innovation distinguishes between a leader and a follower.",
        "Nature's beauty can be found in the simplest things.",
        "Continuous learning fuels personal growth.",
        "The internet connects the world like never before."
        # 更多訓練文字...
    ]

    dataset = TextDataset(tokenizer, train_texts, block_size=128)
    dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

    optimizer = AdamW(model.parameters(), lr=5e-5)
    total_steps = len(dataloader) * 5  # 假設訓練5個epoch
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

    for epoch in range(5):  # 訓練5個epoch
        train(model, dataloader, optimizer, scheduler, device, tokenizer)  # 使用正確的變數名dataloader並傳遞tokenizer

    # 儲存微調後的模型
    model.save_pretrained("path/to/save/fine-tuned_model")
    tokenizer.save_pretrained("path/to/save/fine-tuned_tokenizer")


if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    main()

這個程式碼只訓練了5個epoch,有一些例項文字,記得調成直接的路徑後,執行即可啦。

如果有什麼問題可以隨時在評論區或者是發個人郵箱:linyuanda@linyuanda.com

相關文章