十天前的 Meta Connect 2024 大會上,開源領域迎來了可在邊緣和移動裝置上的執行的輕量級模型 Llama 3.2 1B 和 3B。兩個版本都是純文字模型,但也具備多語言文字生成和工具呼叫能力。Meta 表示,這些模型可讓開發者構建個性化的、在裝置本地上執行的通用應用 —— 這類應用將具備很強的隱私性,因為資料無需離開裝置。
近日,機器學習研究員 Sebastian Raschka 光速釋出長篇教程《Converting Llama 2 to Llama 3.2 From Scratch》。
博文連結:https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb
本文是《 Converting a From-Scratch GPT Architecture to Llama 2》的後續,更新的內容是如何將 Meta 的 Llama 2 架構模型逐步轉換為 Llama 3、Llama 3.1 和 Llama 3.2。為了避免不必要的冗長,本文特意將解釋部分縮至最短,並將重點放在主程式碼上。
機器之心對文章內容進行了不改變原意的編譯:
1 逐步轉換 Llama 模型實現
如果你是初次實施 LLM 架構,建議從《Build a Large Language Model From Scratch》(https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch04/01_main-chapter-code/ch04.ipynb)的第 4 章開始,那部分內容將逐步指導你實施原始 GPT 架構。
然後可參考《Converting a From-Scratch GPT Architecture to Llama 2》(https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb),將實現 Llama 特有的元件,如 RMSNorm 層、SiLU 和 SwiGLU 啟用、RoPE(旋轉位置嵌入)和 SentencePiece tokenizer。
本筆記本採用 Llama 2 架構,並透過以下方式將其轉換為 Llama 3 架構:
修改旋轉嵌入
實現分組查詢注意力
使用定製版的 GPT-4 tokenizer
隨後,我們將 Meta 共享的原始 Llama 3 權重載入到架構中:
1.1 複用 Llama 2 的元件
Llama 2 實際上與 Llama 3 非常相似,如上文所述和本文開頭的圖片所示。
這意味著我們可以使用以下程式碼從 Llama 2 筆記本中匯入多個構建模組:
import os
import sys
import io
import nbformat
import types
def import_from_notebook():
def import_definitions_from_notebook(fullname, names):
current_dir = os.getcwd()
path = os.path.join(current_dir, fullname + ".ipynb")
path = os.path.normpath(path)
# Load the notebook
if not os.path.exists(path):
raise FileNotFoundError(f"Notebook file not found at: {path}")
with io.open(path, "r", encoding="utf-8") as f:
nb = nbformat.read(f, as_version=4)
# Create a module to store the imported functions and classes
mod = types.ModuleType(fullname)
sys.modules[fullname] = mod
#
Go through the notebook cells and only execute function or class definitions
for cell in nb.cells:
if cell.cell_type == "code":
cell_code = cell.source
for name in names:
# Check for function or class definitions
if f"def {name}" in cell_code or f"class {name}" in cell_code:
exec(cell_code, mod.__dict__)
return mod
fullname = "converting-gpt-to-llama2"
names = ["precompute_rope_params", "compute_rope", "SiLU", "FeedForward", "RMSNorm", "MultiHeadAttention"]
return import_definitions_from_notebook(fullname, names)
imported_module = import_from_notebook()
# We need to redefine precompute_rope_params
# precompute_rope_params = getattr(imported_module, "precompute_rope_params", None)
compute_rope = getattr(imported_module, "compute_rope", None)
SiLU = getattr(imported_module, "SiLU", None)
FeedForward = getattr(imported_module, "FeedForward", None)
RMSNorm = getattr(imported_module, "RMSNorm", None)
# MultiHeadAttention only for comparison purposes
MultiHeadAttention = getattr(imported_module, "MultiHeadAttention", None)
1.2 修改後的 RoPE
Llama 3 使用的 RoPE 與 Llama 2 相似,可參閱 RoPE 論文(https://arxiv.org/abs/2104.09864)。
不過,二者 RoPE 設定有一些細微差別。Llama 3 現在支援多達 8192 個 token,是 Llama 2(4096)的兩倍。
RoPE 的基礎值(見下文公式),從 10000(Llama 2)增加到 50000(Llama 3),公式如下(改編自 RoPE 論文):
這些值是一組預定義的引數,用於確定旋轉矩陣中的旋轉角度,其中的維數是嵌入空間的維數。
將基數從 10000 增加到 50000,頻率(或旋轉角度)在各維度上的衰減速度會更慢,這意味著維度越高,角度越大(本質上,這是對頻率的解壓縮)。
此外,我們還在下面的程式碼中引入了一個 freq_config 部分,用於調整頻率;不過,在 Llama 3(只有 Llama 3.1 和 Llama 3.2)中並不需要它,所以稍後會重新訪問這個 freq_config(預設設定為「無」並被忽略)。
import torch
def precompute_rope_params(head_dim, theta_base=10000, context_length=4096, freq_config=None):
assert head_dim % 2 == 0, "Embedding dimension must be even"
# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))
################################ NEW ###############################################
# Frequency adjustments
if freq_config is not None:
low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]
high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]
wavelen = 2 * torch.pi / inv_freq
inv_freq_llama = torch.where(
wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq
)
smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (
freq_config["high_freq_factor"] - freq_config["low_freq_factor"]
)
smoothed_inv_freq = (
(1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq
)
is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)
inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
inv_freq = inv_freq_llama
####################################################################################
# Generate position indices
positions = torch.arange(context_length)
# Compute the angles
angles = positions[:, None] * inv_freq[None, :] # Shape: (context_length, head_dim // 2)
# Expand angles to match the head_dim
angles = torch.cat([angles, angles], dim=1) # Shape: (context_length, head_dim)
# Precompute sine and cosine
cos = torch.cos(angles)
sin = torch.sin(angles)
return cos, sin
總之,與 Llama 2 相比,Llama 3 的新功能是「上下文長度」和 theta 基底引數:
# Instantiate RoPE parameters
llama_2_context_len = 4096
llama_3_context_len = 8192
llama_2_theta_base = 10_000
llama_3_theta_base = 50_000
在 Llama 2 中,用法與以前相同:
# Settings
batch_size = 2
num_heads = 4
head_dim = 16
# Instantiate RoPE parameters
cos, sin = precompute_rope_params(
head_dim=head_dim,
theta_base=llama_3_theta_base,
context_length=llama_3_context_len
)
# Dummy query and key tensors
torch.manual_seed(123)
queries = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)
keys = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)
# Apply rotary position embeddings
queries_rot = compute_rope(queries, cos, sin)
keys_rot = compute_rope(keys, cos, sin)
1.3 分組查詢注意力
本節將用一種名為分組查詢注意力(GQA)的替代機制來取代多頭注意力(MHA)。簡而言之,可以將 GQA 視為計算和引數效率更高的 MHA 版本。
在 GQA 中,透過在多個注意力頭之間共享來減少鍵和值投影的數量,每個注意力頭仍有其獨特的查詢,但這些查詢關注同一組鍵和值。
下面是具有 2 個 key-value 組的 GQA 示例:
GQA 的主要思想是減少與鍵值對相關的唯一查詢組的數量,從而在不顯著降低建模效能的情況下,減少 MHA 中某些矩陣乘法的大小和引數的數量。
簡而言之,GQA 的主要變化是每個查詢組都需要重複,以匹配與之相關的頭數量,具體實現如下:
import torch.nn as nn
class GroupedQueryAttention(nn.Module):
def __init__(
self, d_in, d_out, context_length, num_heads,
num_kv_groups, # NEW
rope_base=10_000, # NEW
rope_config=None, # NEW
dtype=None
):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads
############################# NEW #############################
# self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
# self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
self.num_kv_groups = num_kv_groups
self.group_size = num_heads // num_kv_groups
################################################################
self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)
self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))
cos, sin = precompute_rope_params(
head_dim=self.head_dim,
theta_base=rope_base, # NEW
freq_config=rope_config, # NEW
context_length=8192
)
self.register_buffer("cos", cos)
self.register_buffer("sin", sin)
def forward(self, x):
b, num_tokens, d_in = x.shape
queries = self.W_query(x) # Shape: (b, num_tokens, d_out)
keys = self.W_key(x) # Shape: (b, num_tokens, num_kv_groups * head_dim)
values = self.W_value(x) # Shape: (b, num_tokens, num_kv_groups * head_dim)
# Reshape queries, keys, and values
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
##################### NEW #####################
# keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
# values = values.view(b, num_tokens, self.num_heads, self.head_dim)
keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)
values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)
################################################
# Transpose keys, values, and queries
keys = keys.transpose(1, 2) # Shape: (b, num_heads, num_tokens, head_dim)
values = values.transpose(1, 2) # Shape: (b, num_heads, num_tokens, head_dim)
queries = queries.transpose(1, 2) # Shape: (b, num_query_groups, num_tokens, head_dim)
# Apply RoPE
keys = compute_rope(keys, self.cos, self.sin)
queries = compute_rope(queries, self.cos, self.sin)
##################### NEW #####################
# Expand keys and values to match the number of heads
# Shape: (b, num_heads, num_tokens, head_dim)
keys = keys.repeat_interleave(self.group_size, dim=1) # Shape: (b, num_heads, num_tokens, head_dim)
values = values.repeat_interleave(self.group_size, dim=1) # Shape: (b, num_heads, num_tokens, head_dim)
# For example, before repeat_interleave along dim=1 (query groups):
# [K1, K2]
# After repeat_interleave (each query group is repeated group_size times):
# [K1, K1, K2, K2]
# If we used regular repeat instead of repeat_interleave, we'd get:
# [K1, K2, K1, K2]
################################################
# Compute scaled dot-product attention (aka self-attention) with a causal mask
# Shape: (b, num_heads, num_tokens, num_tokens)
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
assert keys.shape[-1] == self.head_dim
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.reshape(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
引數節省的情況,請參考以下來自 GPT 和 Llama 2 程式碼的多頭注意力示例:
# Settings
batch_size = 1
context_len = 3000
max_context_len = 8192
embed_dim = 4096
num_heads = 32
example_batch = torch.randn((batch_size, context_len, embed_dim))
mha=MultiHeadAttention(
d_in=embed_dim,
d_out=embed_dim,
context_length=max_context_len,
num_heads=num_heads
)
mha(example_batch)
print("W_key:", mha.W_key.weight.shape)
print("W_value:", mha.W_value.weight.shape)
print("W_query:", mha.W_query.weight.shape)
W_key: torch.Size([4096, 4096])
W_value: torch.Size([4096, 4096])
W_query: torch.Size([4096, 4096])
現在,如果改用分組查詢注意力,並使用 8 個 kv 組(Llama 3 8B 使用了 8 個 kv 組),可以看到 key 和 value 矩陣的行數減少了 4 倍(因為 32 個注意力頭除以 8 個 kv 組就是 4):
gqa = GroupedQueryAttention(
d_in=embed_dim,
d_out=embed_dim,
context_length=max_context_len,
num_heads=num_heads,
num_kv_groups=8,
rope_base=llama_3_theta_base
)
gqa(example_batch)
print("W_key:", gqa.W_key.weight.shape)
print("W_value:", gqa.W_value.weight.shape)
print("W_query:", gqa.W_query.weight.shape)
W_key: torch.Size([1024, 4096])
W_value: torch.Size([1024, 4096])
W_query: torch.Size([4096, 4096])
順便提一下,為了使分組查詢注意力等同於標準的多頭注意力,可以將查詢組的數量(num_kv_groups)設定為與頭的數量(num_heads)相等。
最後,比較一下下面的引數數量:
print("Total number of parameters:")
mha_total_params = sum(p.numel() for p in mha.parameters())
print(f"MHA: {mha_total_params:,}")
gqa_total_params = sum(p.numel() for p in gqa.parameters())
print(f"GQA: {gqa_total_params:,}")
Total number of parameters:
MHA: 67,108,864
GQA: 41,943,040
# Free up memory:
del mha
del gqa
1.4 更新 TransformerBlock 模組
接下來,更新 Transformer 塊。在這裡,只需將 MultiHeadAttention 與 GroupedQueryAttention 互換,並新增新的 RoPE 設定:
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = GroupedQueryAttention( # MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
num_kv_groups=cfg["n_kv_groups"], # NEW
rope_base=cfg["rope_base"], # NEW
rope_config=cfg["rope_freq"], # NEW
dtype=cfg["dtype"]
)
self.ff = FeedForward(cfg)
self.norm1 = RMSNorm(cfg["emb_dim"], eps=1e-5)
self.norm2 = RMSNorm(cfg["emb_dim"], eps=1e-5)
def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x.to(torch.bfloat16))
# Shape [batch_size, num_tokens, emb_size]
x = x + shortcut # Add the original input back
# Shortcut connection for feed-forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x.to(torch.bfloat16))
x = x + shortcut # Add the original input back
return x
1.5 定義模型類
幸運的是,在設定模型類時,我們不需要做太多,只需將名稱更新為 Llama3Model
# class Llama2Model(nn.Module):
class Llama3Model(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = RMSNorm(cfg["emb_dim"], eps=1e-5)
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
x = tok_embeds
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x.to(torch.bfloat16))
return logits
2 初始化模型
現在,我們可以定義一個 Llama 3 配置檔案(為便於比較,顯示的是 Llama 2 配置檔案):
LLAMA2_CONFIG_7B = {
"vocab_size": 32_000, # Vocabulary size
"context_length": 4096, # Context length
"emb_dim": 4096, # Embedding dimension
"n_heads": 32, # Number of attention heads
"n_layers": 32, # Number of layers
"hidden_dim": 11_008, # Size of the intermediate dimension in FeedForward
"dtype": torch.bfloat16 # Lower-precision dtype to save memory
}
LLAMA3_CONFIG_8B = {
"vocab_size": 128_256, # NEW: Larger vocabulary size
"context_length": 8192, # NEW: Larger context length
"emb_dim": 4096, # Embedding dimension
"n_heads": 32, # Number of attention heads
"n_layers": 32, # Number of layers
"hidden_dim": 14_336, # NEW: Larger size of the intermediate dimension in FeedForward
"n_kv_groups": 8, # NEW: Key-Value groups for grouped-query attention
"rope_base": 50_000, # NEW: The base in RoPE's "theta" was increased to 50_000
"rope_freq": None, # NEW: Additional configuration for adjusting the RoPE frequencies
"dtype": torch.bfloat16 # Lower-precision dtype to save memory
}
使用這些設定,我們現在可以初始化 Llama 3 8B 模型。
請注意,這需要約 34 GB 記憶體(作為對比,Llama 2 7B 需要約 26 GB 記憶體)
model = Llama3Model(LLAMA3_CONFIG_8B)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
Total number of parameters: 8,030,261,248
如上圖所示,模型包含 80 億個引數。此外,我們還可以使用下面的程式碼計算該模型的記憶體需求:
def model_memory_size(model, input_dtype=torch.float32):
total_params = 0
total_grads = 0
for param in model.parameters():
# Calculate total number of elements per parameter
param_size = param.numel()
total_params += param_size
# Check if gradients are stored for this parameter
if param.requires_grad:
total_grads += param_size
# Calculate buffer size (non-parameters that require memory)
total_buffers = sum(buf.numel() for buf in model.buffers())
# Size in bytes = (Number of elements) * (Size of each element in bytes)
# We assume parameters and gradients are stored in the same type as input dtype
element_size = torch.tensor(0, dtype=input_dtype).element_size()
total_memory_bytes = (total_params + total_grads + total_buffers) * element_size
# Convert bytes to gigabytes
total_memory_gb = total_memory_bytes / (1024**3)
return total_memory_gb
print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")
print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")
float32 (PyTorch default): 68.08 GB
bfloat16: 34.04 GB
最後,如果適用,我們還可以將模型轉移到 NVIDIA 或 Apple Silicon GPU 上:
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
model.to(device);
3 載入 tokenizer
在本節中,我們將為模型載入 tokenizer。
Llama 2 使用了谷歌的 SentencePiece tokenizer ,而不是 OpenAI 基於 Tiktoken 庫的 BPE tokenizer 。然而,Llama 3 恢復使用 Tiktoken 的 BPE tokenizer;具體來說,它使用的是具有擴充套件詞彙的 GPT-4 tokenizer。我們可以在 Meta AI 的官方 Llama 3 儲存庫中找到最初的 Tiktoken 適配程式。
下面重寫了 tokenizer 的程式碼,使其更易讀,更適合本筆記本使用(但表現應該是相似的):
import os
from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
class Tokenizer:
def __init__(self, model_path):
assert os.path.isfile(model_path), f"Model file {model_path} not found"
mergeable_ranks = load_tiktoken_bpe(model_path)
num_base_tokens = len(mergeable_ranks)
self.special_tokens = {
"<|begin_of_text|>": 128000,
"<|end_of_text|>": 128001,
"<|start_header_id|>": 128006,
"<|end_header_id|>": 128007,
"<|eot_id|>": 128009,
}
self.special_tokens.update({
f"<|reserved_{i}|>": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()
})
self.model = tiktoken.Encoding(
name=Path(model_path).name,
pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
mergeable_ranks=mergeable_ranks,
special_tokens=self.special_tokens
)
def encode(self, text, bos=False, eos=False, allowed_special=set(), disallowed_special=()):
if bos:
tokens = [self.special_tokens["<|begin_of_text|>"]]
else:
tokens = []
tokens += self.model.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special)
if eos:
tokens.append(self.special_tokens["<|end_of_text|>"])
return tokens
def decode(self, tokens):
return self.model.decode(tokens)
Meta AI 在 Hugging Face Hub 上共享了 Llama 3 模型的原始權重和 tokenizer 詞庫。
我們將首先從 Hub 下載 tokenizer 詞庫,並將其載入到上述程式碼中。請注意,Meta AI 要求你在下載檔案前接受 Llama 3 許可條款;為此必須建立一個 Hugging Face Hub 賬戶,並訪問 meta-llama/Meta-Llama-3-8B 儲存庫以接受條款。
接下來,需要建立一個訪問 token;要生成一個具有「讀取」許可權的訪問 token,請點選右上角的個人資料圖片,然後點選「設定」。
然後,建立並複製訪問 token,以便複製並貼上到下一個程式碼單元中:
from huggingface_hub import login
import json
with open("config.json", "r") as config_file:
config = json.load(config_file)
access_token = config["HF_ACCESS_TOKEN"]
login(token=access_token)
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
透過訪問 token 登入(這是驗證我們是否接受 Llama 3 許可條款所必需的)後,就可以下載 tokenizer 詞庫了:
from huggingface_hub import hf_hub_download
tokenizer_file_path = hf_hub_download(
repo_id="meta-llama/Meta-Llama-3-8B",
filename="original/tokenizer.model",
local_dir="llama3-files"
)
請注意,在使用 Llama 3 檔案時,我們可能需要 blobfile 軟體包,它用於處理儲存在雲端儲存解決方案(如 Google Cloud Storage (GCS)、Azure Blob Storage 或 Amazon S3)中的資料集或模型。
可以透過取消註釋並執行下面的 pip 命令來安裝此依賴包:
# pip install blobfile
tokenizer = Tokenizer(tokenizer_file_path)
現在,我們可以使用生成函式讓 Llama 3 模型生成新文字:
from previous_chapters import generate, text_to_token_ids, token_ids_to_text
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort",
tokenizer).to(device),
max_new_tokens=30,
context_size=LLAMA3_CONFIG_8B["context_length"],
top_k=1,
temperature=0.
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
Output text:
Every effort_dead aeros Ingredients başında.extension clangmissions.esp 사진 Ek Pars til DoctorsDaoеньostivan normal Ekized � Ekized � Ek rdr tık%,orgen>',
當然,正如我們在上面看到的,這段文字是毫無意義的,因為我們還沒有訓練過 Llama 3 模型。在下一節中,我們將從 Meta AI 中載入預訓練的權重,而不是自己訓練模型,因為這將花費數萬至數十萬美元。
4 載入預訓練權重
我們將載入下面的「meta-llama/Meta-Llama-3-8B 」base 模型,它是微調前的簡單文字補全模型。
或者,你也可以載入經過指令微調和對齊的「meta-llama/Meta-Llama-3-8B-Instruct」模型,方法是相應修改下一個程式碼單元中的字串。加起來,權重檔案大約有 16 GB 大。
from safetensors.torch import load_file
combined_weights = {}
for i in range(1, 5):
weights_file = hf_hub_download(
repo_id="meta-llama/Meta-Llama-3-8B",
filename=f"model-0000{i}-of-00004.safetensors",
local_dir="llama3-files"
)
current_weights = load_file(weights_file)
combined_weights.update(current_weights)
model-00001-of-00004.safetensors: 0%| | 0.00/4.98G [00:00<?, ?B/s]
model-00002-of-00004.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]
model-00003-of-00004.safetensors: 0%| | 0.00/4.92G [00:00<?, ?B/s]
model-00004-of-00004.safetensors: 0%| | 0.00/1.17G [00:00<?, ?B/s]
權重包含以下張量(為簡單起見,只顯示前 15 個張量):
list(combined_weights.keys())[:15]
['model.embed_tokens.weight',
'model.layers.0.input_layernorm.weight',
'model.layers.0.mlp.down_proj.weight',
'model.layers.0.mlp.gate_proj.weight',
'model.layers.0.mlp.up_proj.weight',
'model.layers.0.post_attention_layernorm.weight',
'model.layers.0.self_attn.k_proj.weight',
'model.layers.0.self_attn.o_proj.weight',
'model.layers.0.self_attn.q_proj.weight',
'model.layers.0.self_attn.v_proj.weight',
'model.layers.1.input_layernorm.weight',
'model.layers.1.mlp.down_proj.weight',
'model.layers.1.mlp.gate_proj.weight',
'model.layers.1.mlp.up_proj.weight',
'model.layers.1.post_attention_layernorm.weight']
下面的函式仿照《Build a Large Language Model From Scratch》第 5 章(https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/01_main-chapter-code/ch05.ipynb)中的 load_weights_into_gpt 函式,將預訓練好的權重載入到 Llama 3 模型中:
def assign(left, right, tensor_name="unknown"):
if left.shape != right.shape:
raise ValueError(f"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}")
if isinstance(right, torch.Tensor):
return torch.nn.Parameter(right.clone().detach())
else:
return torch.nn.Parameter(torch.tensor(right))
def load_weights_into_llama(model, param_config, params):
model.tok_emb.weight = assign(model.tok_emb.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
for l in range(param_config["n_layers"]):
# Load attention weights
model.trf_blocks[l].att.W_query.weight = assign(
model.trf_blocks[l].att.W_query.weight,
params[f"model.layers.{l}.self_attn.q_proj.weight"],
f"model.layers.{l}.self_attn.q_proj.weight"
)
model.trf_blocks[l].att.W_key.weight = assign(
model.trf_blocks[l].att.W_key.weight,
params[f"model.layers.{l}.self_attn.k_proj.weight"],
f"model.layers.{l}.self_attn.k_proj.weight"
)
model.trf_blocks[l].att.W_value.weight = assign(
model.trf_blocks[l].att.W_value.weight,
params[f"model.layers.{l}.self_attn.v_proj.weight"],
f"model.layers.{l}.self_attn.v_proj.weight"
)
model.trf_blocks[l].att.out_proj.weight = assign(
model.trf_blocks[l].att.out_proj.weight,
params[f"model.layers.{l}.self_attn.o_proj.weight"],
f"model.layers.{l}.self_attn.o_proj.weight"
)
model.trf_blocks[l].norm1.weight = assign(
model.trf_blocks[l].norm1.weight,
params[f"model.layers.{l}.input_layernorm.weight"],
f"model.layers.{l}.input_layernorm.weight"
)
# Load FeedForward weights
model.trf_blocks[l].ff.fc1.weight = assign(
model.trf_blocks[l].ff.fc1.weight,
params[f"model.layers.{l}.mlp.gate_proj.weight"],
f"model.layers.{l}.mlp.gate_proj.weight"
)
model.trf_blocks[l].ff.fc2.weight = assign(
model.trf_blocks[l].ff.fc2.weight,
params[f"model.layers.{l}.mlp.up_proj.weight"],
f"model.layers.{l}.mlp.up_proj.weight"
)
model.trf_blocks[l].ff.fc3.weight = assign(
model.trf_blocks[l].ff.fc3.weight,
params[f"model.layers.{l}.mlp.down_proj.weight"],
f"model.layers.{l}.mlp.down_proj.weight"
)
model.trf_blocks[l].norm2.weight = assign(
model.trf_blocks[l].norm2.weight,
params[f"model.layers.{l}.post_attention_layernorm.weight"],
f"model.layers.{l}.post_attention_layernorm.weight"
)
# Load output layer weights
model.final_norm.weight = assign(model.final_norm.weight, params["model.norm.weight"], "model.norm.weight")
if "lm_head.weight" in params.keys():
model.out_head.weight = assign(model.out_head.weight, params["lm_head.weight"], "lm_head.weight")
else:
model.out_head.weight = assign(model.out_head.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
print("Model uses weight tying.")
load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)
model.to(device);
del combined_weights # free up memory
接下來,我們就可以使用該模型生成文字了:
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort", tokenizer).to(device),
max_new_tokens=25,
context_size=LLAMA3_CONFIG_8B["context_length"],
top_k=1,
temperature=0.
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
Output text:
Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any
5 使用指令微調模型
上面我們使用的是經過預訓練的基礎模型,如果你想使用一個能夠遵循指令的模型,請使用「meta-llama/Llama-3-8b-Instruct」模型,如下所示:
# to free up memory
import gc
del model
gc.collect() # Run Python garbage collector
if torch.cuda.is_available():
torch.cuda.empty_cache()
combined_weights = {}
for i in range(1, 5):
weights_file = hf_hub_download(
repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
filename=f"model-0000{i}-of-00004.safetensors",
local_dir="llama3-files"
)
current_weights = load_file(weights_file)
combined_weights.update(current_weights)
model = Llama3Model(LLAMA3_CONFIG_8B)
load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)
model.to(device)
del combined_weights # free up memory
model-00001-of-00004.safetensors: 0%| | 0.00/4.98G [00:00<?, ?B/s]
model-00002-of-00004.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]
model-00003-of-00004.safetensors: 0%| | 0.00/4.92G [00:00<?, ?B/s]
model-00004-of-00004.safetensors: 0%| | 0.00/1.17G [00:00<?, ?B/s]
請注意,Llama 3 模型最好與微調時使用的正確提示模板一起使用。
下面是一個基於 Meta AI 的 Llama 3 專用 ChatFormat 程式碼的 tokenizer wrapper 類,用於構建提示模板:
class ChatFormat:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def encode_header(self, message):
tokens = []
tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])
tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])
tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
return tokens
def encode(self, text):
message = {
"role": "user",
"content": text
}
tokens = self.encode_header(message)
tokens.extend(
self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
)
tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
return tokens
def decode(self, token_ids):
return self.tokenizer.decode(token_ids)
chat_tokenizer = ChatFormat(tokenizer)
用法如下:
token_ids = chat_tokenizer.encode("Hello World!")
print(token_ids)
[128006, 882, 128007, 271, 9906, 4435, 0, 128009]
tokenizer.decode(token_ids)
'<|start_header_id|>user<|end_header_id|>\n\nHello World!<|eot_id|>'
現在,讓我們來看看 Llama 3 教學模式的實際應用:
import re
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("What do llamas eat?", chat_tokenizer).to(device),
max_new_tokens=150,
context_size=LLAMA3_CONFIG_8B["context_length"],
top_k=1,
temperature=0.
)
output_text = token_ids_to_text(token_ids, tokenizer)
def clean_text(text, header_end="assistant<|end_header_id|>\n\n"):
# Find the index of the first occurrence of "<|end_header_id|>"
index = text.find(header_end)
if index != -1:
# Return the substring starting after "<|end_header_id|>"
return text[index + len(header_end):].strip() # Strip removes leading/trailing whitespace
else:
# If the token is not found, return the original text
return text
print("Output text:\n", clean_text(output_text))
Output text:
Llamas are herbivores, which means they primarily eat plants and plant-based foods. Here are some of the things llamas like to eat:
1. Grass: Llamas love to graze on grass, especially in the spring and summer months.
2. Hay: Hay is a staple in a llama's diet. They like to eat timothy hay, alfalfa hay, and other types of hay.
3. Grains: Llamas may also be fed grains like oats, barley, and corn. However, grains should not make up more than 10% of a llama's diet.
4. Fruits and vegetables: Llamas may enjoy fruits and vegetables as treats, such as apples,
Llama 3.1 8B
在 Llama 3 釋出幾個月後,Meta AI 又推出了 Llama 3.1 模型套件(詳見 Llama 3.1 官方介紹)。
方便的是,我們可以重複使用之前的 Llama 3 程式碼來實現 Llama 3.1 8B:
結構完全相同,唯一的變化是重新調整了 RoPE 頻率,如下配置檔案所示:
LLAMA3_CONFIG_8B = {
"vocab_size": 128_256, # Vocabulary size
"context_length": 8192, # Context length
"emb_dim": 4096, # Embedding dimension
"n_heads": 32, # Number of attention heads
"n_layers": 32, # Number of layers
"hidden_dim": 14_336, # Size of the intermediate dimension in FeedForward
"n_kv_groups": 8, # Key-Value groups for grouped-query attention
"rope_base": 50_000, # The base in RoPE's "theta"
"rope_freq": None, # Additional configuration for adjusting the RoPE frequencies
"dtype": torch.bfloat16 # Lower-precision dtype to save memory
}
LLAMA31_CONFIG_8B = {
"vocab_size": 128_256, # Vocabulary size
"context_length": 8192, # Context length
"emb_dim": 4096, # Embedding dimension
"n_heads": 32, # Number of attention heads
"n_layers": 32, # Number of layers
"hidden_dim": 14_336, # Size of the intermediate dimension in FeedForward
"n_kv_groups": 8, # Key-Value groups for grouped-query attention
"rope_base": 50_000, # The base in RoPE's "theta"
"dtype": torch.bfloat16, # Lower-precision dtype to save memory
"rope_freq": { # NEW: RoPE frequency scaling
"factor": 8.0,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"original_context_length": 8192,
}
}
正如我們之前在程式碼中看到的,RoPE 方法使用正弦函式(正弦和餘弦)將位置資訊直接嵌入注意力機制中。
在 Llama 3.1 中,透過附加配置,我們對反向頻率計算進行了額外調整。這些調整會影響不同頻率成分對位置嵌入的貢獻。
讓我們在實踐中試試 Llama 3.1 模型;首先,我們清除舊模型,以釋放一些 GPU 記憶體:
# free up memory
del model
gc.collect() # Run Python garbage collector
if torch.cuda.is_available():
torch.cuda.empty_cache()
接下來,我們下載 tokenizer。
請注意,由於 Llama 3.1 系列與 Llama 3 系列不同,、必須訪問 meta-llama/Llama-3.1-8Brepository,並確認許可條款,這樣 Hugging Face 訪問 token 才能在下載時起作用。
簡單起見,我們在下面只載入 base 模型,但也有一個經過指令微調的版本,你可以將「meta-llama/Llama-3.1-8B」替換為「meta-llama/Llama-3.1-8B-Instruct」。
tokenizer_file_path = hf_hub_download(
repo_id="meta-llama/Llama-3.1-8B",
filename="original/tokenizer.model",
local_dir="llama3-files"
)
tokenizer = Tokenizer(tokenizer_file_path)
model = Llama3Model(LLAMA31_CONFIG_8B)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
Total number of parameters: 8,030,261,248
combined_weights = {}
for i in range(1, 5):
weights_file = hf_hub_download(
repo_id="meta-llama/Llama-3.1-8B",
filename=f"model-0000{i}-of-00004.safetensors",
local_dir="llama3-files"
)
current_weights = load_file(weights_file)
combined_weights.update(current_weights)
load_weights_into_llama(model, LLAMA31_CONFIG_8B, combined_weights)
model.to(device);
model-00001-of-00004.safetensors: 0%| | 0.00/4.98G [00:00<?, ?B/s]
model-00002-of-00004.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]
model-00003-of-00004.safetensors: 0%| | 0.00/4.92G [00:00<?, ?B/s]
model-00004-of-00004.safetensors: 0%| | 0.00/1.17G [00:00<?, ?B/s]
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort", tokenizer).to(device),
max_new_tokens=25,
context_size=LLAMA31_CONFIG_8B["context_length"],
top_k=1,
temperature=0.
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
Output text:
Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any
Llama 3.2 1B
截至本文撰寫之時,Meta AI 的最新模型是此處公佈的 Llama 3.2 模型。
Llama 3.2 文字模型的程式碼與 Llama 3.1 相似,只是縮小了模型的大小(有 1B 和 3B 版本)。
另一個效率上的調整是,他們又增加了權重繫結(GPT-2 架構中最初使用的概念);在這裡,他們在輸入(token)嵌入層和輸出層中重複使用相同的權重引數值。
Llama 3.2 1B 的模型體積小,甚至可以在許多移動裝置上執行,因此非常方便。
Llama 3.1 8B 和 Llama 3.2 1B 在結構上的差異如下圖所示:
從上圖可以看出,Llama 3.1 8B 和 Llama 3.2 1B 架構的主要區別在於各自的尺寸。
一個小的額外變化是增加了 RoPE rescaling 係數,這反映在下面的配置檔案中:
LLAMA31_CONFIG_8B = {
"vocab_size": 128_256, # Vocabulary size
"context_length": 8192, # Context length
"emb_dim": 4096, # Embedding dimension
"n_heads": 32, # Number of attention heads
"n_layers": 32, # Number of layers
"hidden_dim": 14_336, # Size of the intermediate dimension in FeedForward
"n_kv_groups": 8, # Key-Value groups for grouped-query attention
"rope_base": 50_000, # The base in RoPE's "theta"
"dtype": torch.bfloat16, # Lower-precision dtype to save memory
"rope_freq": { # RoPE frequency scaling
"factor": 8.0,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"original_context_length": 8192,
}
}
LLAMA32_CONFIG_1B = {
"vocab_size": 128_256, # Vocabulary size
"context_length": 8192, # Context length
"emb_dim": 2048, # NEW: Half the embedding dimension
"n_heads": 32, # Number of attention heads
"n_layers": 16, # NEW: Half the number of layers
"hidden_dim": 8192, # NEW: Almopst half the size of the intermediate dimension in FeedForward
"n_kv_groups": 8, # Key-Value groups for grouped-query attention
"rope_base": 50_000, # The base in RoPE's "theta"
"dtype": torch.bfloat16, # Lower-precision dtype to save memory
"rope_freq": { # RoPE frequency scaling
"factor": 32.0, # NEW: Adjustment of the rescaling factor
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"original_context_length": 8192,
}
}
下面,我們可以重複使用 Llama 3.1 8B 部分的程式碼來載入 Llama 3.2 1B 模型。
同樣,由於 Llama 3.2 系列有別於 Llama 3.1 系列,因此必須訪問 meta-llama/Llama-3.2-1B 軟體源並確認許可條款。
簡單起見,我們只在下面載入基本模型,但也有一個經過指令微調的版本,可以用 「meta-llama/Llama-3.2-1B-Instruct」替換「meta-llama/Llama-3.2-1B」。
# free up memory
del model
gc.collect() # Run Python garbage collector
if torch.cuda.is_available():
torch.cuda.empty_cache()
tokenizer_file_path = hf_hub_download(
repo_id="meta-llama/Llama-3.2-1B",
filename="original/tokenizer.model",
local_dir="llama32-files"
)
tokenizer = Tokenizer(tokenizer_file_path)
model = Llama3Model(LLAMA32_CONFIG_1B)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# Account for weight tying
total_params_normalized = total_params - model.tok_emb.weight.numel()
print(f"\nTotal number of unique parameters: {total_params_normalized:,}")
Total number of parameters: 1,498,482,688
Total number of unique parameters: 1,235,814,400
weights_file = hf_hub_download(
repo_id="meta-llama/Llama-3.2-1B",
filename=f"model.safetensors",
local_dir="llama32-files"
)
current_weights = load_file(weights_file)
load_weights_into_llama(model, LLAMA32_CONFIG_1B, current_weights)
model.to(device);
Model uses weight tying.
print("Weight tying:", torch.equal(model.tok_emb.weight, model.out_head.weight))
Weight tying: True
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort",
tokenizer).to(device),
max_new_tokens=25,
context_size=LLAMA32_CONFIG_1B["context_length"],
top_k=1,
temperature=0.
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
Output text:
Every effort is made to ensure that the information on this website is accurate. However, we cannot guarantee that the information is accurate, complete
原文連結:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb