擴充套件說明: 指令微調 Llama 2

這篇部落格是一篇來自 Meta AI，關於指令微調 Llama 2 的擴充套件說明。旨在聚焦構建指令資料集，有了它，我們則可以使用自己的指令來微調 Llama 2 基礎模型。

目標是構建一個能夠基於輸入內容來生成指令的模型。這麼做背後的邏輯是，模型如此就可以由其他人生成自己的指令資料集。這在當想開發私人個性化定製模型，如傳送推特、寫郵件等，時很方便。這也意味著你可以透過你的郵件來生成一個指令資料集，然後用它來訓練一個模型來為你寫郵件。

好，那我們來開始吧？我們將進行:

定義應用場景細節並建立指令的提示詞模板
構建指令資料集
使用 trl 與 SFTTrainer 指令微調 Llama 2
測試模型、進行推理

1. 定義應用場景細節並建立指令的提示詞模板

在描述應用場景前，我們要更好的理解一下究竟什麼是指令。

指令是一段文字或提供給大語言模型，類似 Llama，GPT-4 或 Claude，使用的提示詞，用來指導它去生成回覆。指令可以讓人們做到把控對話，約束模型輸出更自然、實用的輸出，並使這些結果能夠對齊使用者的目的。製作清晰的、整潔的指令則是生成高質量對話的關鍵。

指令的例子如下表所示。

能力	示例指令
頭腦風暴	提供一系列新口味的冰淇淋的創意。
分類	根據劇情概要，將這些電影歸類為喜劇、戲劇或恐怖片。
確定性問答	用一個單詞回答“法國的首都是哪裡？”
生成	用羅伯特·弗羅斯特的風格寫一首關於大自然和季節變化的詩。
資訊提取	從這篇短文中提取主要人物的名字。
開放性問答	為什麼樹葉在秋天會變色？用科學的理由解釋一下。
摘要	用 2-3 句話概括一下這篇關於可再生能源最新進展的文章。

如開頭所述，我們想要微調模型，以便根據輸入 (或輸出) 生成指令。我們希望將其用作建立合成資料集的方法，以賦予 LLM 和代理個性化能力。

把這個想法轉換成一個基礎的提示模板，按照 Alpaca 格式.

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 

### Input:
Dear [boss name],

I'm writing to request next week, August 1st through August 4th,
off as paid time off.

I have some personal matters to attend to that week that require 
me to be out of the office. I wanted to give you as much advance 
notice as possible so you can plan accordingly while I am away.

Please let me know if you need any additional information from me 
or have any concerns with me taking next week off. I appreciate you 
considering this request.

Thank you, [Your name]

### Response:
Write an email to my boss that I need next week 08/01 - 08/04 off.

2. 建立指令資料集

在定義了我們的應用場景和提示模板後，我們需要建立自己的指令資料集。建立高質量的指令資料集是獲得良好模型效能的關鍵。研究表明，“對齊，越少越好” 表明，建立高質量、低數量 (大約 1000 個樣本) 的資料集可以達到與低質量、高數量的資料集相同的效能。

建立指令資料集有幾種方法，包括:

使用現有資料集並將其轉換為指令資料集，例如 FLAN
使用現有的 LLM 建立合成指令資料集，例如 Alpaca
人力建立指令資料集，例如 Dolly。

每種方法都有其優缺點，這取決於預算、時間和質量要求。例如，使用現有資料集是最簡單的，但可能不適合您的特定用例，而使用人力可能是最準確的，但必然耗時、昂貴。也可以結合幾種不同方法來建立指令資料集，如 Orca: Progressive Learning from Complex Explanation Traces of GPT-4.。

為了簡單起見，我們將使用 Dolly，這是一個開源的指令跟蹤記錄資料集，由數千名 Databricks 員工在 InstructGPT paper 中描述的幾個行為類別中生成，包括頭腦風暴、分類、確定性回答、生成、資訊提取、開放性回答和摘要。

開始程式設計吧，首先，我們來安裝依賴項。

!pip install "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" --upgrade

我們使用 🤗 Datasets library 的 load_dataset() 方法載入 databricks/databricks-dolly-15k 資料集。

from datasets import load_dataset
from random import randrange

# 從hub載入資料集
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011

為了指導我們的模型，我們需要將我們的結構化示例轉換為透過指令描述的任務集合。我們定義一個 formatting_function ，它接受一個樣本並返回一個符合格式指令的字串。

def format_instruction(sample):
    return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 

### Input:
{sample['response']}

### Response:
{sample['instruction']}
"""

我們來在一個隨機的例子上測試一下我們的結構化函式。

from random import randrange

print(format_instruction(dataset[randrange(len(dataset))]))

3. 使用 `trl` 和`SFTTrainer` 指令微調 Llama 2

我們將使用最近在由 Tim Dettmers 等人的發表的論文“QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation”中介紹的方法。QLoRA 是一種新的技術，用於在微調期間減少大型語言模型的記憶體佔用，且並不會降低效能。QLoRA 的 TL;DR; 是這樣工作的:

將預訓練模型量化為 4bit 位並凍結它。
附加輕量化的、可訓練的介面卡層。(LoRA)
在使用凍結的量化模型基於文字內容進行微調時，僅微調介面卡層引數。

如果您想了解有關 QLoRA 及其工作原理的更多資訊，我建議您閱讀 Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA 部落格文章。

Flash Attention (快速注意力)

Flash Attention 是一種經過重新排序的注意力計算方法，它利用經典技術 (排列、重計算) 來顯著加快速度，將序列長度的記憶體使用量從二次降低到線性。它基於論文“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”。

TL;DR; 將訓練加速了 3 倍。在這兒獲得更多資訊 FlashAttention。 Flash Attention 目前僅支援 Ampere (A10, A40, A100, …) & Hopper (H100, …) GPU。你可以檢查一下你的 GPU 是否支援，並用下面的命令來安裝它:

注意: 如果您的機器的記憶體小於 96GB，而 CPU 核心數足夠多，請減少 MAX_JOBS 的數量。在我們使用的 g5.2xlarge 上，我們使用了 4 。

python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"
pip install ninja packaging
MAX_JOBS=4 pip install flash-attn --no-build-isolation

_安裝 flash attention 是會需要一些時間 (10-45 分鐘)_。

該示例支援對所有 Llama 檢查點使用 Flash Attention，但預設是未啟用的。要開啟 Flash Attention，請取消程式碼塊中這段的註釋， # COMMENT IN TO USE FLASH ATTENTION 。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

use_flash_attention = False

# COMMENT IN TO USE FLASH ATTENTION
# replace attention with flash attention 
# if torch.cuda.get_device_capability()[0] >= 8:
#     from utils.llama_patch import replace_attn_with_flash_attn
#     print("Using flash attention")
#     replace_attn_with_flash_attn()
#     use_flash_attention = True


# Hugging Face 模型id
model_id = "NousResearch/Llama-2-7b-hf" # non-gated
# model_id = "meta-llama/Llama-2-7b-hf" # gated


# BitsAndBytesConfig int-4 config 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 載入模型與分詞器
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")
model.config.pretraining_tp = 1 

# 透過對比doc中的字串，驗證模型是在使用flash attention
if use_flash_attention:
    from utils.llama_patch import forward    
    assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"


tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

SFTTrainer 支援與 peft 的本地整合，這使得高效地指令微調LLM變得非常容易。我們只需要建立 LoRAConfig 並將其提供給訓練器。

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# 基於 QLoRA 論文來配置 LoRA
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM", 
)


# 為訓練準備好模型
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

在開始訓練之前，我們需要定義自己想要的超引數 (TrainingArguments)。

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="llama-7-int4-dolly",
    num_train_epochs=3,
    per_device_train_batch_size=6 if use_flash_attention else 4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=True # 當配置的引數都正確後可以關閉tqdm
)

我們現在有了用來訓練模型 SFTTrainer 所需要準備的每一個模組。

from trl import SFTTrainer

max_seq_length = 2048 # 資料集的最大長度序列

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction, 
    args=args,
)

透過呼叫 Trainer 例項上的 train() 方法來訓練我們的模型。

# 訓練
trainer.train() # tqdm關閉後將不顯示進度條資訊

# 儲存模型
trainer.save_model()

不使用 Flash Attention 的訓練過程在 g5.2xlarge 上花費了 03:08:00。例項的成本為 1,212$/h ，總成本為 3.7$ 。

使用 Flash Attention 的訓練過程在 g5.2xlarge 上花費了 02:08:00。例項的成本為 1,212$/h ，總成本為 2.6$ 。

使用 Flash Attention 的結果令人滿意，速度提高了 1.5 倍，成本降低了 30%。

4. 測試模型、進行推理

在訓練完成後，我們想要執行和測試模型。我們會使用 peft 和 transformers 將 LoRA 介面卡載入到模型中。

if use_flash_attention:
    # 停止 flash attention
    from utils.llama_patch import unplace_flash_attn_with_attn
    unplace_flash_attn_with_attn()
    
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer


args.output_dir = "llama-7-int4-dolly"

# 載入基礎LLM模型與分詞器
model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
) 
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)

我們來再次用隨機樣本載入一次資料集，試著來生成一條指令。

from datasets import load_dataset 
from random import randrange


# 從hub載入資料集並得到一個樣本
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
sample = dataset[randrange(len(dataset))]

prompt = f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 

### Input:
{sample['response']}

### Response:
"""

input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)

print(f"Prompt:\n{sample['response']}\n")
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
print(f"Ground truth:\n{sample['instruction']}")

太好了！我們的模型可以工作了！如果想要加速我們的模型，我們可以使用 Text Generation Inference 部署它。因此我們需要將我們介面卡的引數合併到基礎模型中去。

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    low_cpu_mem_usage=True,
) 

# 合併 LoRA 與 base model
merged_model = model.merge_and_unload()

# 儲存合併後的模型
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")

# push合併的模型到hub上
# merged_model.push_to_hub("user/repo")
# tokenizer.push_to_hub("user/repo")

原文作者: Philschmid
原文連結: https://www.philschmid.de/instruction-tune-llama-2
譯者: Xu Haoran