本文主要使用Qwen2ForSequenceClassification實現文字分類任務。
文章首發於我的知乎:https://zhuanlan.zhihu.com/p/17468021019
一、實驗結果和結論
這幾個月,在大模型分類場景做了很多實驗,攢了一點小小經驗。
1、短文字
1)query情感分類,一般不如BERT
ps:結論和,https://segmentfault.com/a/1190000044485544#item-13,基本一致
2、長文字
1)通話ASR轉譯長文字,BERT截斷512不如LLM
- LLM沒有截斷(如果都階段512,可能效果差不多)
- 沒有對比,BERT進行文字滑動視窗的版本
2)Base v.s. Instruct
- 資料量小時,Base微調不如Instruct(Instruct模型有對齊稅,但是微調資料量小時,效果還是比Base沒見過指令微調樣本的好)
3)SFT v.s. LoRA - 資料量小時(總樣本10K以下,每個標籤需要視情況而定),SFT微調不如LoRA(SFT調參成本也更大)
3、分類場景的提升方案
1)生成式微調獨有
- 混合同領域相似資料型別不同業務資料,可以提升若干點
- 資料分佈不能差異太大,特別是文字長度,否則混入這種資料反而會讓效果下滑(一個平均長度1.2K,一個平均長度5k)
- 混入比例(接近2:1,不同場景需自行嘗試)
- 混入順序(我使用的是隨機取樣,沒驗證是否分開先後訓練順序是否有影響)
- 最佳化提示詞(提示詞中,增加各類別標籤的精要描述;短文字可嘗試few-shot)
2)分類頭微調 + 生成式微調
- 資料量大時(10K以上,平均每個標籤樣本充足),嘗試微調Base,而不是微調Instruct
- 資料增強:嘗試無標註資料上跑的偽標籤樣本(提示詞抽的標籤 + 微調後的模型抽的標籤)
- 資料量大時,嘗試SFT
- LoRA微調時,加入LLM的embedding層(未驗證過)
- 嘗試蒸餾更大模型到小模型(痛點:大模型難調參,訓練成本更高,部署上線還是得小模型)
- 嘗試LoRA的變體
- 嘗試調參(試過optuna自動搜尋,效果也不太好;一般就調lr, epoch, rank)
3)重量級
- Base模型,領域資料增量預訓練後,再進行指令微調
- 方案待驗證
- 這邊嘗試了在Qwen2-7B-Instruct的領域資料指令微調後的模型,微調效果反而比直接微調Qwen2-7B-Instruct效果差些。由於不清楚該模型訓練步驟細節,所以原因尚不明確)
- 若驗證成功,其好處是訓出來的基座在各個領域任務上的微調都能提點
- 方案待驗證
第一優先順序,還是搞資料。其次,才是嘗試各種方案的加加減減。
4、注意點
-
學習率
- 訓練的引數量越大,學習率適當要調小
-
標籤噪聲
- 樣本標註錯誤,需要在錯誤分析時進行剔除和校正
-
分類業務規則
- 複雜場景,需要提前確定好完備的標註規則,避免返工(那些模型可以做,那麼模型不能做)
二、文字分類-從BERT到LLM
Qwen2ForSequenceClassification和BERTForSequenceClassification,邏輯上是一致的。都是在模型的輸出層,加上一個Linear層,用來完成分類任務。
之前在,BERT上做的所有改動,都可以遷移到LLM上。譬如,BERT-CRF。
和BERTForSequenceClassification一樣。
BertForSequenceClassification是一個已經實現好的用來進行文字分類的類,一般用來進行文字分類任務。
透過num_labels傳遞分類的類別數,從建構函式可以看出這個類大致由3部分組成,1個是Bert,1個是Dropout,1個是用於分類的線性分類器Linear。
class BertForSequenceClassification(BertPreTrainedModel):
def __init__(self, config):
super(BertForSequenceClassification, self).__init__(config)
self.num_labels = config.num_labels
python
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
self.init_weights()
Bert用於提取文字特徵進行Embedding,Dropout防止過擬合,Linear是一個弱分類器,進行分類,如果需要用更復雜的網路結構進行分類可以參考它進行改寫。
forward()函式里面已經定義了損失函式,訓練時可以不用自己額外實現,返回值包括4個內容
def forward(...):
...
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), logits, (hidden_states), (attentions)
接下來。看看Qwen2ForSequenceClassification。
Qwen2ForSequenceClassification(
(model): Qwen2Model(
(embed_tokens): Embedding(151936, 1024, padding_idx=151643)
(layers): ModuleList(
(0-23): 24 x Qwen2DecoderLayer(
(self_attn): Qwen2SdpaAttention(
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(o_proj): Linear(in_features=1024, out_features=1024, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
(up_proj): Linear(in_features=1024, out_features=2816, bias=False)
(down_proj): Linear(in_features=2816, out_features=1024, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm()
(post_attention_layernorm): Qwen2RMSNorm()
)
)
(norm): Qwen2RMSNorm()
)
(score): Linear(in_features=1024, out_features=3, bias=False)
)
Qwen2官方程式碼實現,內建三種模式:
- single_label_classification 單標籤分類
- 損失為CrossEntropyLoss
- 取單標籤對應logit,算負對數似然
- multi_label_classification 多標籤分類
- 損失為BCEWithLogitsLoss
- 標籤為muilti-hot, 預測logits計算sigmoid,實際取對應維度標籤Logit,損失求和
- regression 迴歸
- 損失為MSELoss
- 預設為單維度迴歸(迴歸可以作為獎勵模型,預測打分)
這3種模式的輸入標籤不同。
class Qwen2ForSequenceClassification(Qwen2PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = Qwen2Model(config)
self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
@add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
transformer_outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
logits = self.score(hidden_states)
if input_ids is not None:
batch_size = input_ids.shape[0]
else:
batch_size = inputs_embeds.shape[0]
if self.config.pad_token_id is None and batch_size != 1:
raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
if self.config.pad_token_id is None:
sequence_lengths = -1
else:
if input_ids is not None:
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
sequence_lengths = sequence_lengths % input_ids.shape[-1]
sequence_lengths = sequence_lengths.to(logits.device)
else:
sequence_lengths = -1
pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
loss = None
if labels is not None:
labels = labels.to(logits.device)
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(pooled_logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(pooled_logits, labels)
if not return_dict:
output = (pooled_logits,) + transformer_outputs[1:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutputWithPast(
loss=loss,
logits=pooled_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
)
三、LoRA微調 Qwen2ForSequenceClassification
在LoRA微調後,將合併LoRA權重,並儲存模型。
因為,目前PEFT程式碼沒有,把分類頭Linear層的引數儲存下來。只靠LoRA權重無法,復現訓練的Qwen2ForSequenceClassification模型。
有需要可以小改下程式碼
這邊在modelscope提供的環境,完成了程式碼的測試。
from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name_or_path = "qwen/Qwen2.5-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
prompt = "不想學習怎麼辦?有興趣,但是拖延症犯了"
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs, max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```shell
Downloading [config.json]: 100%|██████████| 661/661 [00:00<00:00, 998B/s]
Downloading [configuration.json]: 100%|██████████| 2.00/2.00 [00:00<00:00, 2.30B/s]
Downloading [generation_config.json]: 100%|██████████| 242/242 [00:00<00:00, 557B/s]
Downloading [LICENSE]: 100%|██████████| 7.21k/7.21k [00:00<00:00, 11.7kB/s]
Downloading [merges.txt]: 100%|██████████| 1.59M/1.59M [00:00<00:00, 3.01MB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|██████████| 3.70G/3.70G [00:10<00:00, 373MB/s]
Downloading [model-00002-of-00002.safetensors]: 100%|██████████| 2.05G/2.05G [00:06<00:00, 332MB/s]
Downloading [model.safetensors.index.json]: 100%|██████████| 34.7k/34.7k [00:00<00:00, 56.8kB/s]
Downloading [README.md]: 100%|██████████| 4.79k/4.79k [00:00<00:00, 10.3kB/s]
Downloading [tokenizer.json]: 100%|██████████| 6.71M/6.71M [00:00<00:00, 8.58MB/s]
Downloading [tokenizer_config.json]: 100%|██████████| 7.13k/7.13k [00:00<00:00, 13.8kB/s]
Downloading [vocab.json]: 100%|██████████| 2.65M/2.65M [00:00<00:00, 5.09MB/s]
/usr/local/lib/python3.10/site-packages/accelerate/utils/modeling.py:1405: UserWarning: Current model requires 234882816 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True.
warnings.warn(
面對興趣與拖延之間的矛盾,確實會讓人感到困擾。這裡有一些建議或許能幫助你克服拖延,更好地堅持學習:
-
設定小目標:將大目標分解為一系列小目標。完成每一個小目標都是一次小小的勝利,這可以增加你的動力和成就感。
-
制定計劃:為自己規劃一個詳細的學習計劃,並儘量按照計劃執行。記得為休息時間留出空間,保持良好的工作與休息平衡。
-
保持積極心態:對自己保持耐心和理解,不要因為一時的困難而放棄。記住,進步的過程就是成長的過程。
由於modelscope不支援LORA, 這邊檢視了本地路徑
print(model.model_dir)
/mnt/workspace/.cache/modelscope/hub/qwen/Qwen2___5-3B-Instruct
檢視檔案
config.json
merges.txt
README.md
configuration.json
model-00001-of-00002.safetensors
tokenizer_config.json
generation_config.json
model-00002-of-00002.safetensors
tokenizer.json
LICENSE
model.safetensors.index.json
vocab.json
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
微調程式碼
### 初始化設定和隨機種子
import os
os.environ["CUDAVISIBLE_DEVICES"] = "0"
import torch
import numpy as np
import pandas as pd
import random
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
基於prompt用大模型構造了20個樣本。
import json
x = '''這基金表現也太差了吧,買了半年了還虧著呢。
管理費收得比別的基金都高,感覺就是在給基金公司打工。
想查查具體投了啥,結果發現透明度低得要命,啥也看不清楚。
基金經理換來換去的,都不知道到底誰在管我的錢。
客服電話打過去半天才有人接,問個問題還得等上好幾天才有回覆。
市場稍微有點風吹草動,這基金就跌得比誰都快。
投資組合裡全是同一行業的股票,風險大得讓人睡不著覺。
長期持有也沒見賺多少錢,還不如存銀行定期。
分紅政策一會兒一個樣,根本沒法做財務規劃。
當初宣傳時說得好聽,實際操作起來完全不是那麼回事。'''
x_samples = x.split("\n")
y = '''這基金真的穩啊,買了之後收益一直挺不錯的,感覺很靠譜!
管理團隊超級專業,每次市場波動都能及時調整策略,讓人放心。
透明度很高,隨時都能查到投資組合的情況,心裡有數。
基金經理經驗老道,看準了幾個大機會,賺了不少。
客服態度特別好,有問題總能很快得到解答,服務真是沒得說。
即使在市場不好的時候,這基金的表現也比大多數同類產品強。
分散投資做得很好,風險控制得很到位,睡個安穩覺沒問題。
長期持有的話,回報率真的非常可觀,值得信賴。
分紅政策明確而且穩定,每年都能按時收到分紅,計劃財務很方便。
宣傳時承諾的那些好處都實現了,真心覺得選對了這隻基金。'''
y_samples = y.split("\n")
# 建立一個Python字典
x_data = [{"content": i, "label": 0, "標註類別": "正向"} for i in x_samples]
y_data = [{"content": i, "label": 1, "標註類別": "負向"} for i in y_samples]
def save_json(path, data):
# 將Python字典轉換為JSON字串
with open(path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
save_json('data/classify_train.json', x_data[:6]+y_data[:6])
save_json('data/classify_valid.json', x_data[6:8]+y_data[6:8])
save_json('data/classify_test.json', x_data[8:]+y_data[8:])
資料載入
import json
from tqdm import tqdm
from loguru import logger
from datasets import Dataset, load_dataset
def get_dataset_from_json(json_path, cols):
with open(json_path, "r") as file:
data = json.load(file)
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df[cols], split='train')
return dataset
# load_dataset載入json的dataset太慢了
cols = ['content', 'label', '標註類別']
train_ds = get_dataset_from_json('data/classify_train.json', cols)
logger.info(f"TrainData num: {len(train_ds)}")
valid_ds = get_dataset_from_json('data/classify_valid.json', cols)
logger.info(f"ValidData num: {len(valid_ds)}")
test_ds = get_dataset_from_json('data/classify_test.json', cols)
logger.info(f"TestData num: {len(test_ds)}")
print(train_ds[0])
{'content': '這基金表現也太差了吧,買了半年了還虧著呢。', 'label': 0, '標註類別': '正向'}
準備dataset(簡單實現截斷和padding, 無動態padding)
id2label = {0: "正向", 1: "負向"}
label2id = {v:k for k,v in id2label.items()}
from transformers import AutoTokenizer, DataCollatorWithPadding
# from modelscope import AutoTokenizer, DataCollatorwithPadding
model_name_or_path = "/mnt/workspace/.cache/modelscope/hub/qwen/Qwen2___5-3B-Instruct"
model_name = model_name_or_path.split("/")[-1]
print(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side='left')
tokenizer.add_special_tokens({'pad_token': '<|endoftext|>'})
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
MAX_LEN = 24
txt_colname = 'content'
def preprocess_function(examples):
# padding後處理效率不高,需要動態batch padding
return tokenizer(examples[txt_colname], max_length=MAX_LEN, padding=True, truncation=True)
tokenized_train = train_ds.map(preprocess_function, num_proc=64, batched=True)
tokenized_valid = valid_ds.map(preprocess_function, num_proc=64, batched=True)
sklearn評測程式碼
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
f1_score,
precision_score,
recall_score
)
def evals(test_ds, model):
k_list = [x[txt_colname] for x in test_ds]
model.eval()
k_result = []
for idx, txt in tqdm(enumerate(k_list)):
model_inputs = tokenizer([txt], max_length=MAX_LEN, truncation=True, return_tensors="pt").to(model.device)
logits = model(**model_inputs).logits
res = int(torch.argmax(logits, axis=1).cpu())
k_result.append(id2label.get(res))
y_true = np.array(test_ds['label'])
y_pred = np.array([label2id.get(x) for x in k_result])
return y_true, y_pred
def compute_metrics(eval_pred):
predictions, label = eval_pred
predictions = np.argmax(predictions, axis=1)
return {"f1": f1_score(y_true=label, y_pred=predictions, average='weighted')}
def compute_valid_metrics(eval_pred):
predictions, label = eval_pred
y_true, y_pred = label, predictions
accuracy = accuracy_score(y_true, y_pred)
print(f'Accuracy: {accuracy}')
metric_types = ['micro', 'macro', 'weighted']
for metric_type in metric_types:
precision = precision_score(y_true, y_pred, average=metric_type)
recall = recall_score(y_true, y_pred, average=metric_type)
f1 = f1_score(y_true, y_pred, average=metric_type)
print(f'{metric_type} Precision: {precision}')
print(f'{metric_type} Recall: {recall}')
print(f'{metric_type} F1 Score: {f1}')
模型載入,使用Trainer進行訓練
import torch
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType
rank = 64
alpha = rank*2
training_args = TrainingArguments(
output_dir=f"./output/{model_name}/seqence_classify/",
learning_rate=5e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=4,
num_train_epochs=3,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True
)
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
inference_mode=False,
r=rank,
lora_alpha=alpha,
lora_dropout=0.1
)
model = AutoModelForSequenceClassification.from_pretrained(
model_name_or_path,
num_labels=len(id2label),
id2label=id2label,
label2id=label2id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
attn_implementation="flash attention2"
)
model.config.pad_token_id = tokenizer.pad_token_id
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_valid,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
logger.info(f"start Trainingrank: {rank}")
trainer.train()
logger.info(f"Valid Set, rank: {rank}")
y_true, y_pred = evals(valid_ds, model)
metrics = compute_valid_metrics((y_pred, y_true))
logger.info(metrics)
logger.info(f"Test Set, rank: {rank}")
y_true, y_pred = evals(test_ds, model)
metrics = compute_valid_metrics((y_pred, y_true))
logger.info(metrics)
saved_model = model.merge_and_unload()
saved_model.save_pretrained('/model/qwen2-3b/seqcls')
將LoraConfig和get_peft_model去掉,就是SFT的程式碼。
model的結構
PeftModelForSequenceClassification(
(base_model): LoraModel(
(model): Qwen2ForSequenceClassification(
(model): Qwen2Model(
(embed_tokens): Embedding(151936, 2048)
(layers): ModuleList(
(0-35): 36 x Qwen2DecoderLayer(
(self_attn): Qwen2SdpaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=True)
(k_proj): Linear(in_features=2048, out_features=256, bias=True)
(v_proj): Linear(in_features=2048, out_features=256, bias=True)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
(up_proj): Linear(in_features=2048, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((2048,), eps=1e-06)
)
(score): Linear(in_features=2048, out_features=2, bias=False)
)
)
)
預測
txt = "退錢,什麼辣雞基金"
model_inputs = tokenizer([txt], max_length=MAX_LEN, truncation=True, return_tensors="pt").to(saved_model.device)
logits = saved_model(**model_inputs).logits
res = int(torch.argmax(logits, axis=1).cpu())
print(id2label[res])
負向
output輸出型別
SequenceClassifierOutputWithPast(loss=None, logits=tensor([[-0.1387, 2.3438]], device='cuda:0', grad_fn=<IndexBackward0>), past_key_values=((tensor([[[[ -3.3750, 0.3164, 2.3125, ..., 56.5000, 26.0000, 87.0000],
[ -4.6875, 3.0312, 0.6875, ..., 57.7500, 24.3750, 86.0000],
[ -0.7109, 1.1094, -0.7383, ..., 56.7500, 24.8750, 86.5000],
...,
...,
[-0.2188, 0.2148, 0.4375, ..., -0.1016, 0.9336, -1.1016],
[ 1.3281, 0.3359, 1.3125, ..., -0.3906, 0.0312, -0.0391],
[ 0.8789, 0.5312, 1.4297, ..., 0.1797, -0.9609, -0.6445]]]],
device='cuda:0'))), hidden_states=None, attentions=None)
跑測結果
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at /mnt/workspace/.cache/modelscope/hub/qwen/Qwen2___5-3B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-10-09 23:54:07.615 | INFO | __main__:<module>:53 - start Trainingrank: 64
trainable params: 119,738,368 || all params: 3,205,681,152 || trainable%: 3.7352
[6/6 00:08, Epoch 3/3]
Epoch Training Loss Validation Loss F1
1 No log 0.988281 0.333333
2 No log 0.527344 0.733333
3 No log 0.453125 1.000000
2024-10-09 23:54:17.371 | INFO | __main__:<module>:56 - Valid Set, rank: 64
4it [00:00, 8.03it/s]
2024-10-09 23:54:17.896 | INFO | __main__:<module>:59 - None
2024-10-09 23:54:17.897 | INFO | __main__:<module>:61 - Test Set, rank: 64
Accuracy: 1.0
micro Precision: 1.0
micro Recall: 1.0
micro F1 Score: 1.0
macro Precision: 1.0
macro Recall: 1.0
macro F1 Score: 1.0
weighted Precision: 1.0
weighted Recall: 1.0
weighted F1 Score: 1.0
4it [00:00, 13.58it/s]
2024-10-09 23:54:18.218 | INFO | __main__:<module>:64 - None
Accuracy: 0.75
micro Precision: 0.75
micro Recall: 0.75
micro F1 Score: 0.75
macro Precision: 0.8333333333333333
macro Recall: 0.75
macro F1 Score: 0.7333333333333334
weighted Precision: 0.8333333333333333
weighted Recall: 0.75
weighted F1 Score: 0.7333333333333334
四、自測結果
短文字
常用的對話中,客戶的單輪query, 情感極性分類。
3分類,長度最大128字,訓練樣本量6K左右
query ,比不過基礎的BERT
Accuracy:0.9334389857369255
microPrecision:0.9334389857369255
microRecall:0.9334389857369255
micro F1Score:0.9334389857369255
macro Precision:0.9292774942877138
macro Reca1l:0.9550788300142491
macro F1Score:0.9388312342456646
weightedPrecision:0.9418775412386249
weighted Recall:0.9334389857369255
weighted F1Score:0.93383533375322
precision recall fi-score support
0 1.00 0.88 0.93 334
1 0.94 0.99 0.97 101
2 0.85 0.99 0.92 196
accuracy 0.93
macro avg 0.93
weightedavg 0.94
使用Chinese-RoBerta-large-wwm,及其各種變體進行比較。7B、3B、1.5B和0.5B,均無優勢。比不過Large、Base。一些裁減引數量就幾十M的都能到85左右,所以看不出LLM的優勢。
結論:
短文字場景,LLM的優勢在於少樣本、樣本不均勻,以及基於prompt+fewshot用72B規模生成偽標籤。
除非樣本量上萬,且價值比較大的場景,可以嘗試14B以上模型,調參確認提點後,再進行蒸餾。
絕大部分短文字場景沒有必要用到大模型,除非是生成場景,比如query擴寫、query多輪改寫。
長文字
這裡用到的是ASR轉譯文字
訓練集4918樣本,平均長度740字,最大4631字,75% 918字
LoRA微調結果,一般2個epoch效果好些,rank要適當調參。
epoch=1, rank=96, alpha=2*rank
Accuracy:0.8415637860082305
micro Precision:0.8415637860082305
micro Recall:0.8415637860082305
micro F1 Score:0.8415637860082305
macro Precision:0.8075007129137883
macro Recall: 0.770659344467927
macroF1 Score:0.7726373117446225
weightedPrecision:0.8509932419375813
weighted Recall:0.8415637860082305
weighted F1Score:0.8420807262647815
precision recall f1-score support
0 0.95 0.83 0.89 163
1 0.76 0.77 0.77 66
2 0.78 0.89 0.83 63
3 0.81 0.81 0.81 42
4 0.80 0.93 0.86 30
5 0.48 0.56 0.51 18
6 1.00 0.43 0.60 7
7 0.88 0.95 0.92 97
epoch=3,rank=96, alpha=2*rank
Accuracy:0.8847736625514403
micro Precision:0.8847736625514403
micro Recall:0.8847736625514403
micro F1 Score:0.8847736625514403
macro Precision:0.8765027065399982
macroRecall:0.8400805218716799
macro F1 Score:.8527883278910355
weighted Precision:0.8903846924862034
weighted Recall:0.8847736625514403
weighted F1 Score:0.8852820009557909
precision recall fl-score support
0 0.94 0.89 0.91 163
1 0.77 0.85 0.81 66
2 0.81 0.88 0.83 42
3 0.79 0.90 0.86 63
4 1.00 0.93 0.97 30
5 0.92 0.61 0.73 18
6 0.83 0.71 0.77 7
7 0.96 0.94 0.95 97
相關資料
-
比較詳細的LLM分類頭微調經驗
https://zhuanlan.zhihu.com/p/704983302 -
在災難推文分析場景上比較用 LoRA 微調 Roberta、Llama 2 和 Mistral 的過程及表現
https://segmentfault.com/a/1190000044485544 -
SFT分類頭微調程式碼(其實就是去掉LoRA那幾行程式碼)
https://github.com/muyaostudio/qwen2_seq_cls -
知乎上的一個程式碼
https://zhuanlan.zhihu.com/p/691459595
相關程式碼可以在github上找找,kaggle也推薦去,主要就是這2個地方。