RTX 3080 移動版能訓練哪種大模型?本文為那些 GPU 資源有限時使用 GRPO 訓練的開發者提供了寶貴的指導。
torch.OutOfMemoryError: CUDA out of memory.
Tried to allocate 1.90 GiB. GPU 0 has a total capacity of 15.73 GiB of which 1.28 GiB is free.
Including non-PyTorch memory, this process has 14.43 GiB memory in use. Of the allocated memory 11.82 GiB is allocated by PyTorch, and 2.41 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
首先,可以使用像 AdamW 這樣的 8-bit 最佳化器版本,它們能更高效地儲存跟蹤資料,同時仍保持良好的效能 —— 類似於壓縮照片可以節省空間,同時保留大部分影像質量; 其次,使用梯度檢查點技術,這就像在訓練過程中拍攝快照,而不是記錄所有內容。雖然這會使訓練速度減慢約 20-30%,但它顯著減少了記憶體使用。
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer
import re
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
def extract_hash_answer(text: str) -> str | None:
if "####" not in text:
return None
return text.split("####")[1].strip()
def get_gsm8k_questions(split = "train") -> Dataset:
data = load_dataset('openai/gsm8k', 'main')[split]
data = data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_hash_answer(x['answer'])
})
return data
def extract_xml_answer(text: str) -> str:
answer = text.split("<answer>")[-1]
answer = answer.split("</answer>")[0]
return answer.strip()
def format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def accuracy_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
"""Reward function that extracts the answer from the xml tags and compares it to the correct answer."""
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
def main():
dataset = get_gsm8k_questions()
model_name = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map=None
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
training_args = GRPOConfig(
output_dir="output",
learning_rate=5e-6,
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
logging_steps=1,
bf16=True,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=4,
max_prompt_length=256,
max_completion_length=786,
num_train_epochs=1,
save_steps=100,
save_total_limit=1,
max_grad_norm=0.1,
log_on_each_node=False,
)
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
format_reward_func,
accuracy_reward_func
],
args=training_args,
train_dataset=dataset,
)
trainer.train()
if __name__ == "__main__":
main()
batch_size=1,由於 GRPO 為每個查詢生成多個響應,batch size 會迅速失控。 gradient_accumulation_steps=4,最佳化器是另一個佔用大量 VRAM 的地方。此引數決定了我們將儲存的梯度以幫助最佳化器進行其「爬山」過程。 num_completions=4,DeepSeekMath 論文中使用了 64。這完全超出了有些人的計算預算。 max_prompt_length=256,如果你想訓練模型擁有更大上下文的推理能力,將不得不增加 VRAM。GSM8K 的提示相對較小,適合此測試。 max_completion_length=786,同樣,由於計算注意力的記憶體有限,推理鏈在這裡受到限制。上下文或生成的 token 越多,需要的記憶體就越大。 LoRA target_modules=["q_proj", "k_proj", "o_proj", "up_proj", "down_proj"] 在這方面可以嘗試幾種不同的迭代。target_modules="all-linear" 是一種流行的方式,可以從你的 LoRA 中擠出最多的效能(就準確性而言)。
模型引數:每個引數佔用 2 位元組。 參考模型引數:每個引數佔用 2 位元組。 梯度:每個引數佔用 2 位元組。 最佳化器狀態:每個引數佔用 8 位元組。 8 位最佳化器:每個引數佔用 4 位元組。 PEFT:有助於減少梯度的視訊記憶體佔用。