PyTorch環境下對BERT進行finetune
本文根據Chris McCormick的BERT微調教程進行優化並使其適應於資料集Quora Question Pairs裡的判斷問題對是否一致的任務。(文字部分大部分為原文的翻譯)
原文部落格地址:https://mccormickml.com/2019/07/22/BERT-fine-tuning/
原文colab地址:https://colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX
本文專案地址:https://github.com/yxf975/pretraining_models_learning
前言
本文對刪除了很多原英文博文中一些介紹性的內容,著重於如何實現基礎的BERT微調方法。本解決方法不同於Chris McCormick的有以下幾點:
- 使用的資料集為Quora問題對資料集
- 新增了多gpu執行的選擇
- 將部分程式碼封裝進了函式中,方便使用
- 新增了預測部分
具體對於BERT等預訓練模型的原理的理解,我會單獨建立一個話題,讓我們直接開始吧!
準備工作
檢查GPU
為了讓 torch 使用 GPU,我們需要識別並指定 GPU 作為裝置。稍後,在我們的訓練迴圈中,我們將把資料載入到裝置上。
import torch
# If there's a GPU available...
if torch.cuda.is_available():
# Tell PyTorch to use the GPU.
device = torch.device("cuda")
n_gpu = torch.cuda.device_count()
print('There are %d GPU(s) available.' % n_gpu)
print('We will use the GPU:', [torch.cuda.get_device_name(i) for i in range(n_gpu)])
# If not...
else:
print('No GPU available, using the CPU instead.')
device = torch.device("cpu")
安裝Transformer庫
目前,Hugging Face的Transformer庫似乎是最被廣泛接受的、最強大的與BERT合作的pytorch介面。除了支援各種不同的預先訓練好的變換模型外,該庫還包含了這些模型的預構建修改,適合你的特定任務。例如,在本教程中,我們將使用BertForSequenceClassification
。
該庫還包括用於標記分類、問題回答、下句預測等的特定任務類。使用這些預建的類可以簡化為您的目的修改BERT的過程。
!pip install transformers
載入Quora Question Pairs資料
資料集在kaggle官網上,註冊登入即可下載,下載地址:https://www.kaggle.com/c/quora-question-pairs
Quora Question Pairs資料集介紹
這個資料集針對於Quora平臺,很多人在Quora上會提出類似措辭的問題。具有相同意圖的多個問題可能會導致搜尋者花費更多時間來尋找問題的最佳答案,並使作者感到他們需要回答同一問題的多個版本。
該任務需要對問題對是否重複進行分類,從而解決自然語言處理問題。這樣做將使查詢問題的高質量答案變得更加容易,從而為Quora的作家,搜尋者和讀者帶來了更好的體驗。
pandas載入資料
import pandas as pd
import numpy as np
# Load the dataset into a pandas dataframe.
train_data = pd.read_csv("./datatrain.csv", index_col="id",nrows=10000)
train_data.head(6)
這裡我顯示6行,因為到第六行才有個正樣本。
id | qid1 | qid2 | question1 | question2 | is_duplicate |
---|---|---|---|---|---|
0 | 1 | 2 | What is the step by step guide to invest in share market in india? | What is the step by step guide to invest in share market? | 0 |
1 | 3 | 4 | What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? | 0 |
2 | 5 | 6 | How can I increase the speed of my internet connection while using a VPN? | How can Internet speed be increased by hacking through DNS? | 0 |
3 | 7 | 8 | Why am I mentally very lonely? How can I solve it? | Find the remainder when [math]23^{24}[/math] is divided by 24,23? | 0 |
4 | 9 | 10 | Which one dissolve in water quikly sugar, salt, methane and carbon di oxide? | Which fish would survive in salt water? | 0 |
5 | 11 | 12 | Astrology: I am a Capricorn Sun Cap moon and cap rising…what does that say about me? | I’m a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me? | 1 |
我們實際關心的三個屬性是"question1",“question1"和它們的標籤"is_duplicate”,這個標籤被稱為"是否重複"(0=不重複,1=重複)。
訓練集驗證集拆分
把我們的訓練集分成 90% 用於訓練,20% 用於驗證。
from sklearn.model_selection import train_test_split
# train_validation data split
X_train, X_val, y_train, y_val = train_test_split(train_data[["question1", "question2"]], train_data["is_duplicate"], test_size=0.2, random_state=405633)
Tokenization & Input 格式化
BERT Tokenizer
from transformers import BertTokenizer
# load bert tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
檢視資料中句子的最長長度
#calculate the maximum sentence length
max_len = 0
for _, row in train_data.iterrows():
max_len = max(max_len, len(tokenizer(row['question1'],row['question2'])["input_ids"]))
print("max token length of the input:", max_len)
# set the maximum token length
max_length = pow(2,int(np.log2(max_len)+1))
print("max token length for BERT:", max_length)
轉換為BERT輸入
from torch.utils.data import TensorDataset
# func to convert data to bert input
def convert_to_dataset_torch(data: pd.DataFrame, labels = pd.Series(data=None)) -> TensorDataset:
input_ids = []
attention_masks = []
token_type_ids = []
for _, row in tqdm(data.iterrows(), total=data.shape[0]):
encoded_dict = tokenizer.encode_plus(row["question1"], row["question2"], max_length=max_length, pad_to_max_length=True,
return_attention_mask=True, return_tensors='pt', truncation=True)
# Add the encoded sentences to the list.
input_ids.append(encoded_dict['input_ids'])
token_type_ids.append(encoded_dict["token_type_ids"])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
token_type_ids = torch.cat(token_type_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
if labels.empty:
return TensorDataset(input_ids, attention_masks, token_type_ids)
else:
labels = torch.tensor(labels.values)
return TensorDataset(input_ids, attention_masks, token_type_ids, labels)
train = convert_to_dataset_torch(X_train, y_train)
validation = convert_to_dataset_torch(X_val, y_val)
將資料放入DataLoader
我們還將使用 torch DataLoader 類為我們的資料集建立一個迭代器。這有助於在訓練過程中節省記憶體,因為與for迴圈不同,有了迭代器,整個資料集不需要載入到記憶體中。
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
# set batch size for DataLoader(options from paper:16 or 32)
batch_size = 32
# Create the DataLoaders for training and validation sets
train_dataloader = DataLoader(
train,
sampler = RandomSampler(train), # Select batches randomly
batch_size = batch_size
)
# For validation
validation_dataloader = DataLoader(
validation,
sampler = SequentialSampler(validation), # Pull out batches sequentially.
batch_size = batch_size
)
載入模型
載入預訓練模型BertForSequenceClassification
我們將使用BertForSequenceClassification。這是普通的BERT模型,上面增加了一個用於分類的單線性層,我們將使用它作為句子分類器。當我們輸入資料時,整個預先訓練好的BERT模型和額外的未經訓練的分類層會根據我們的特定任務進行訓練。
from transformers import BertForSequenceClassification, AdamW, BertConfig
# Load BertForSequenceClassification, the pretrained BERT model with a single
# linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
num_labels = 2, # The number of output labels--2 for binary classification.
# You can increase this for multi-class tasks.
output_attentions = False, # Whether the model returns attentions weights.
output_hidden_states = False, # Whether the model returns all hidden-states.
)
# Tell pytorch to run this model on the GPU.
model.cuda()
if n_gpu > 1:
model = torch.nn.DataParallel(model)
當然也可以對BERT網路結構進行修改以適應我們的任務,這裡我就直接使用原模型。
優化器 & 學習率排程器
為了微調的目的,BERT論文的作者建議從以下數值中選擇(來自BERT論文的附錄A.3)。
- batch大小: 16,32。(在Dataloader裡設定)
- 學習率(Adam): 5e-5、3e-5、2e-5。
- epoch數: 2、3、4。
from transformers import get_linear_schedule_with_warmup
optimizer = AdamW(model.parameters(),
lr = 2e-5, # args.learning_rate
eps = 1e-8 # args.adam_epsilon
)
# Number of training epochs
epochs = 2
# Total number of training steps is [number of batches] x [number of epochs].
total_steps = len(train_dataloader) * epochs
# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = 0, # Default value in run_glue.py
num_training_steps = total_steps)
訓練
時間規範函式
import time
import datetime
# Helper function for formatting elapsed times as hh:mm:ss
def format_time(elapsed):
'''
Takes a time in seconds and returns a string hh:mm:ss
'''
# Round to the nearest second.
elapsed_rounded = int(round((elapsed)))
# Format as hh:mm:ss
return str(datetime.timedelta(seconds=elapsed_rounded))
fit函式
from tqdm import tqdm
def fit_batch(dataloader, model, optimizer, epoch):
total_train_loss = 0
for batch in tqdm(dataloader, desc=f"Training epoch:{epoch+1}", unit="batch"):
# Unpack batch from dataloader.
input_ids = batch[0].to(device)
attention_masks = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# clear any previously calculated gradients before performing a backward pass.
model.zero_grad()
# Perform a forward pass (evaluate the model on this training batch).
outputs = model(input_ids,
token_type_ids=token_type_ids,
attention_mask=attention_masks,
labels=labels)
loss = outputs[0]
total_train_loss += loss.item()
# Perform a backward pass to calculate the gradients.
loss.backward()
# normlization of the gradients to 1.0 to avoid exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update parameters and take a step using the computed gradient.
optimizer.step()
# Update the learning rate.
scheduler.step()
return total_train_loss
驗證評估函式
from sklearn.metrics import accuracy_score
def eval_batch(dataloader, model, metric=accuracy_score):
total_eval_accuracy = 0
total_eval_loss = 0
predictions , predicted_labels = [], []
for batch in tqdm(dataloader, desc="Evaluating", unit="batch"):
# Unpack batch from dataloader.
input_ids = batch[0].to(device)
attention_masks = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training).
with torch.no_grad():
# Forward pass, calculate logit predictions.
outputs = model(input_ids,
token_type_ids=token_type_ids,
attention_mask=attention_masks,
labels=labels)
loss = outputs[0]
logits = outputs[1]
total_eval_loss += loss.item()
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
label_ids = labels.to('cpu').numpy()
# Calculate the accuracy for this batch of validation sentences, and
# accumulate it over all batches.
y_pred = np.argmax(logits, axis=1).flatten()
total_eval_accuracy += metric(label_ids, y_pred)
predictions.extend(logits.tolist())
predicted_labels.extend(y_pred.tolist())
return total_eval_accuracy, total_eval_loss, predictions ,predicted_labels
訓練函式
def train(train_dataloader, validation_dataloader, model, optimizer, epochs):
# list to store a number of quantities such as
# training and validation loss, validation accuracy, and timings.
training_stats = []
# Measure the total training time for the whole run.
total_t0 = time.time()
for epoch in range(0, epochs):
# Measure how long the training epoch takes.
t0 = time.time()
# Reset the total loss for this epoch.
total_train_loss = 0
# Put the model into training mode.
model.train()
total_train_loss = fit_batch(train_dataloader, model, optimizer, epoch)
# Calculate the average loss over all of the batches.
avg_train_loss = total_train_loss / len(train_dataloader)
# Measure how long this epoch took.
training_time = format_time(time.time() - t0)
t0 = time.time()
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()
total_eval_accuracy, total_eval_loss, _, _ = eval_batch(validation_dataloader, model)
# Report the final accuracy for this validation run.
avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
print("\n")
print(f"score: {avg_val_accuracy}")
# Calculate the average loss over all of the batches.
avg_val_loss = total_eval_loss / len(validation_dataloader)
# Measure how long the validation run took.
validation_time = format_time(time.time() - t0)
print(f"Validation Loss: {avg_val_loss}")
print("\n")
# Record all statistics from this epoch.
training_stats.append(
{
'epoch': epoch,
'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss,
'Valid. score.': avg_val_accuracy,
'Training Time': training_time,
'Validation Time': validation_time
}
)
print("")
print("Training complete!")
print(f"Total training took {format_time(time.time()-total_t0)}")
return training_stats
開始訓練
import random
# Set the seed value all over the place to make this reproducible.
seed_val = 405633
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
if n_gpu > 0:
torch.cuda.manual_seed_all(seed_val)
training_stats = train(train_dataloader, validation_dataloader, model, optimizer, epochs)
檢視訓練過程中的的評估資料
df_stats = pd.DataFrame(training_stats).set_index('epoch')
df_stats
預測
預測函式
def predict(dataloader, model):
prediction = list()
for batch in tqdm(dataloader, desc="predicting", unit="batch"):
# Unpack batch from dataloader.
input_ids = batch[0].to(device)
attention_masks = batch[1].to(device)
token_type_ids = batch[2].to(device)
# Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training).
with torch.no_grad():
# Forward pass, calculate logit predictions.
outputs = model(input_ids,
token_type_ids=token_type_ids,
attention_mask=attention_masks)
logits = outputs[0]
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
prediction.append(logits)
pred_logits = np.concatenate(prediction, axis=0)
pred_label = np.argmax(pred_logits, axis=1).flatten()
print("done")
return (pred_label,pred_logits)
為測試集建立Dataloader
# Create the DataLoader for test data.
prediction_data = convert_to_dataset_torch(test_data)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)
預測
也可以用softmax將logits轉化為相應的概率
y_pred,logits = predict(prediction_dataloader,model)
# get the corresponding probablities
prob = torch.nn.functional.softmax(torch.tensor(logits))
總結
本篇文章演示了利用預先訓練好的 BERT 模型,微調適應於Quora問題對任務。在面對其他類似的文字分類問題時也可以採取類似的微調方法。
當然如果想要更精確的更好的預測結果,可能需要使用更好的更合適的預訓練模型,修改網路模型使之更適合當前任務,或者加入對抗訓練等方法。
相關文章
- 中文語料的 Bert finetune
- pytorch中:使用bert預訓練模型進行中文語料任務,bert-base-chinese下載。PyTorch模型
- Pytorch環境安裝PyTorch
- springboot多環境下如何進行動態配置Spring Boot
- linux環境下使用jmeter進行分散式測試LinuxJMeter分散式
- Pytorch複製現有環境PyTorch
- WIn10+Anaconda 環境下安裝 PyTorch 避坑指南Win10PyTorch
- PyTorch預訓練Bert模型PyTorch模型
- Python 環境配置(三)安裝pytorchPythonPyTorch
- Homestead 環境下安裝 Elasticsearch 並使用 scout 進行全文檢索Elasticsearch
- 在域環境下對賬戶的操作
- Java指令重排序在多執行緒環境下的應對策略Java排序執行緒
- Linux 環境安裝 Xdebug 進行除錯Linux除錯
- 透過 Python 進行 ArcGIS 環境設定Python
- 使用Conda Pack進行環境打包遷移
- 如何載入本地下載下來的BERT模型,pytorch踩坑!!模型PyTorch
- BERT-Pytorch版本程式碼pipline梳理PyTorch
- Windows環境下的Nginx環境搭建WindowsNginx
- Windows 環境下 Python 環境安裝WindowsPython
- window環境下testlink環境搭建(xammp)
- Ubuntu深度學習環境搭建 tensorflow+pytorchUbuntu深度學習PyTorch
- iTerm2教程|如何使用iTerm2對環境進行重新配色和美化?
- 基於BERT進行抽取式問答
- 手把手帶你進行Golang環境配置Golang
- 使用 Webpack 進行生產環境配置(附 Demo)Web
- win 環境使用easyswoole利用docker進行開發Docker
- 手把手教你在win10下搭建pytorch GPU環境(Anaconda+Pycharm)Win10PyTorchGPUPyCharm
- Ubuntu檢視conda環境,進入、退出環境Ubuntu
- 以太坊-Win環境下remix環境搭建REM
- PyTorch深度學習入門筆記(一)PyTorch環境配置及安裝PyTorch深度學習筆記
- PyTorch之對類別張量進行one-hot編碼PyTorch
- 第十課 如何在Remix環境下進行Solidity程式碼單步除錯REMSolid除錯
- 配置pytorch環境2024-更新至win11PyTorch
- 【深度學習】PyTorch CUDA環境配置及安裝深度學習PyTorch
- RAC環境下的SEQUENCE對應用的影響
- 在雲環境上使用SLF4J對Java程式進行日誌記錄Java
- centos7 下安裝laravel 執行環境CentOSLaravel
- CentOS 7 下安裝 nginx + PHP 執行環境CentOSNginxPHP