PyTorch環境下對BERT進行finetune

楓胤雪發表於2020-12-06

本文根據Chris McCormick的BERT微調教程進行優化並使其適應於資料集Quora Question Pairs裡的判斷問題對是否一致的任務。(文字部分大部分為原文的翻譯)

原文部落格地址:https://mccormickml.com/2019/07/22/BERT-fine-tuning/

原文colab地址:https://colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX

本文專案地址:https://github.com/yxf975/pretraining_models_learning

前言

本文對刪除了很多原英文博文中一些介紹性的內容,著重於如何實現基礎的BERT微調方法。本解決方法不同於Chris McCormick的有以下幾點:

  • 使用的資料集為Quora問題對資料集
  • 新增了多gpu執行的選擇
  • 將部分程式碼封裝進了函式中,方便使用
  • 新增了預測部分

具體對於BERT等預訓練模型的原理的理解,我會單獨建立一個話題,讓我們直接開始吧!

準備工作

檢查GPU

為了讓 torch 使用 GPU,我們需要識別並指定 GPU 作為裝置。稍後,在我們的訓練迴圈中,我們將把資料載入到裝置上。

import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    n_gpu = torch.cuda.device_count()

    print('There are %d GPU(s) available.' % n_gpu)

    print('We will use the GPU:', [torch.cuda.get_device_name(i) for i in range(n_gpu)])

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

安裝Transformer庫

目前,Hugging Face的Transformer庫似乎是最被廣泛接受的、最強大的與BERT合作的pytorch介面。除了支援各種不同的預先訓練好的變換模型外,該庫還包含了這些模型的預構建修改,適合你的特定任務。例如,在本教程中,我們將使用BertForSequenceClassification

該庫還包括用於標記分類、問題回答、下句預測等的特定任務類。使用這些預建的類可以簡化為您的目的修改BERT的過程。

!pip install transformers

載入Quora Question Pairs資料

資料集在kaggle官網上,註冊登入即可下載,下載地址:https://www.kaggle.com/c/quora-question-pairs

Quora Question Pairs資料集介紹

這個資料集針對於Quora平臺,很多人在Quora上會提出類似措辭的問題。具有相同意圖的多個問題可能會導致搜尋者花費更多時間來尋找問題的最佳答案,並使作者感到他們需要回答同一問題的多個版本。

該任務需要對問題對是否重複進行分類,從而解決自然語言處理問題。這樣做將使查詢問題的高質量答案變得更加容易,從而為Quora的作家,搜尋者和讀者帶來了更好的體驗。

pandas載入資料

import pandas as pd
import numpy as np

# Load the dataset into a pandas dataframe.
train_data = pd.read_csv("./datatrain.csv", index_col="id",nrows=10000)
train_data.head(6)

這裡我顯示6行,因為到第六行才有個正樣本。

idqid1qid2question1question2is_duplicate
012What is the step by step guide to invest in share market in india?What is the step by step guide to invest in share market?0
134What is the story of Kohinoor (Koh-i-Noor) Diamond?What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?0
256How can I increase the speed of my internet connection while using a VPN?How can Internet speed be increased by hacking through DNS?0
378Why am I mentally very lonely? How can I solve it?Find the remainder when [math]23^{24}[/math] is divided by 24,23?0
4910Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?Which fish would survive in salt water?0
51112Astrology: I am a Capricorn Sun Cap moon and cap rising…what does that say about me?I’m a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?1

我們實際關心的三個屬性是"question1",“question1"和它們的標籤"is_duplicate”,這個標籤被稱為"是否重複"(0=不重複,1=重複)。

訓練集驗證集拆分

把我們的訓練集分成 90% 用於訓練,20% 用於驗證。

from sklearn.model_selection import train_test_split

# train_validation data split
X_train, X_val, y_train, y_val = train_test_split(train_data[["question1", "question2"]], train_data["is_duplicate"], test_size=0.2, random_state=405633)

Tokenization & Input 格式化

BERT Tokenizer

from transformers import BertTokenizer

# load bert tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

檢視資料中句子的最長長度

#calculate the maximum sentence length
max_len  = 0
for _, row in train_data.iterrows():
    max_len = max(max_len, len(tokenizer(row['question1'],row['question2'])["input_ids"]))

print("max token length of the input:", max_len)
    
# set the maximum token length
max_length = pow(2,int(np.log2(max_len)+1))
print("max token length for BERT:", max_length)

轉換為BERT輸入

from torch.utils.data import TensorDataset

# func to convert data to bert input
def convert_to_dataset_torch(data: pd.DataFrame, labels = pd.Series(data=None)) -> TensorDataset:
    input_ids = []
    attention_masks = []
    token_type_ids = []
    for _, row in tqdm(data.iterrows(), total=data.shape[0]):
        encoded_dict = tokenizer.encode_plus(row["question1"], row["question2"], max_length=max_length, pad_to_max_length=True, 
                      return_attention_mask=True, return_tensors='pt', truncation=True)
        # Add the encoded sentences to the list.
        input_ids.append(encoded_dict['input_ids'])
        token_type_ids.append(encoded_dict["token_type_ids"])
        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])
    
    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    token_type_ids = torch.cat(token_type_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    if labels.empty:
        return TensorDataset(input_ids, attention_masks, token_type_ids)
    else:
        labels = torch.tensor(labels.values)
        return TensorDataset(input_ids, attention_masks, token_type_ids, labels)
train = convert_to_dataset_torch(X_train, y_train)
validation = convert_to_dataset_torch(X_val, y_val)

將資料放入DataLoader

我們還將使用 torch DataLoader 類為我們的資料集建立一個迭代器。這有助於在訓練過程中節省記憶體,因為與for迴圈不同,有了迭代器,整個資料集不需要載入到記憶體中。

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# set batch size for DataLoader(options from paper:16 or 32)
batch_size = 32

# Create the DataLoaders for training and validation sets
train_dataloader = DataLoader(
            train,  
            sampler = RandomSampler(train), # Select batches randomly
            batch_size = batch_size 
        )

# For validation
validation_dataloader = DataLoader(
            validation, 
            sampler = SequentialSampler(validation), # Pull out batches sequentially.
            batch_size = batch_size 
        )

載入模型

載入預訓練模型BertForSequenceClassification

我們將使用BertForSequenceClassification。這是普通的BERT模型,上面增加了一個用於分類的單線性層,我們將使用它作為句子分類器。當我們輸入資料時,整個預先訓練好的BERT模型和額外的未經訓練的分類層會根據我們的特定任務進行訓練。

from transformers import BertForSequenceClassification, AdamW, BertConfig

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()
if n_gpu > 1:
    model = torch.nn.DataParallel(model)

當然也可以對BERT網路結構進行修改以適應我們的任務,這裡我就直接使用原模型。

優化器 & 學習率排程器

為了微調的目的,BERT論文的作者建議從以下數值中選擇(來自BERT論文的附錄A.3)。

  • batch大小: 16,32。(在Dataloader裡設定)
  • 學習率(Adam): 5e-5、3e-5、2e-5。
  • epoch數: 2、3、4。
from transformers import get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate
                  eps = 1e-8 # args.adam_epsilon
                )

# Number of training epochs
epochs = 2

# Total number of training steps is [number of batches] x [number of epochs]. 
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

訓練

時間規範函式

import time
import datetime

# Helper function for formatting elapsed times as hh:mm:ss
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

fit函式

from tqdm import tqdm

def fit_batch(dataloader, model, optimizer, epoch):
    total_train_loss = 0
    
    for batch in tqdm(dataloader, desc=f"Training epoch:{epoch+1}", unit="batch"):
        # Unpack batch from dataloader.
        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        token_type_ids = batch[2].to(device)
        labels = batch[3].to(device)
        
        # clear any previously calculated gradients before performing a backward pass.
        model.zero_grad()
        
        # Perform a forward pass (evaluate the model on this training batch).
        outputs = model(input_ids, 
                        token_type_ids=token_type_ids, 
                        attention_mask=attention_masks, 
                        labels=labels)
        loss = outputs[0]
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # normlization of the gradients to 1.0 to avoid exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()
        
    return total_train_loss

驗證評估函式

from sklearn.metrics import accuracy_score

def eval_batch(dataloader, model, metric=accuracy_score):
    total_eval_accuracy = 0
    total_eval_loss = 0
    predictions , predicted_labels = [], []
    
    for batch in tqdm(dataloader, desc="Evaluating", unit="batch"):
        # Unpack batch from dataloader.
        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        token_type_ids = batch[2].to(device)
        labels = batch[3].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():
            # Forward pass, calculate logit predictions.
            outputs = model(input_ids, 
                            token_type_ids=token_type_ids, 
                            attention_mask=attention_masks,
                            labels=labels)
            loss = outputs[0]
            logits = outputs[1]
        total_eval_loss += loss.item()
        
        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of validation sentences, and
        # accumulate it over all batches.
        y_pred = np.argmax(logits, axis=1).flatten()
        total_eval_accuracy += metric(label_ids, y_pred)
        
        predictions.extend(logits.tolist())
        predicted_labels.extend(y_pred.tolist())
    
    return total_eval_accuracy, total_eval_loss, predictions ,predicted_labels

訓練函式

def train(train_dataloader, validation_dataloader, model, optimizer, epochs):
    # list to store a number of quantities such as 
    # training and validation loss, validation accuracy, and timings.
    training_stats = []
    
    # Measure the total training time for the whole run.
    total_t0 = time.time()
    
    for epoch in range(0, epochs):
        # Measure how long the training epoch takes.
        t0 = time.time()
        
        # Reset the total loss for this epoch.
        total_train_loss = 0
        
        # Put the model into training mode. 
        model.train()
        
        total_train_loss = fit_batch(train_dataloader, model, optimizer, epoch)
        
        # Calculate the average loss over all of the batches.
        avg_train_loss = total_train_loss / len(train_dataloader)
        
        # Measure how long this epoch took.
        training_time = format_time(time.time() - t0)
        
        t0 = time.time()
        
        # Put the model in evaluation mode--the dropout layers behave differently
        # during evaluation.
        model.eval()
        

        total_eval_accuracy, total_eval_loss, _, _ = eval_batch(validation_dataloader, model)
        
        # Report the final accuracy for this validation run.
        avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
        print("\n")
        print(f"score: {avg_val_accuracy}")
    
        # Calculate the average loss over all of the batches.
        avg_val_loss = total_eval_loss / len(validation_dataloader)
    
        # Measure how long the validation run took.
        validation_time = format_time(time.time() - t0)
    
        print(f"Validation Loss: {avg_val_loss}")
        print("\n")
    
        # Record all statistics from this epoch.
        training_stats.append(
            {
                'epoch': epoch,
                'Training Loss': avg_train_loss,
                'Valid. Loss': avg_val_loss,
                'Valid. score.': avg_val_accuracy,
                'Training Time': training_time,
                'Validation Time': validation_time
            }
        )
        

    print("")
    print("Training complete!")

    print(f"Total training took {format_time(time.time()-total_t0)}")
    return training_stats

開始訓練

import random

# Set the seed value all over the place to make this reproducible.
seed_val = 405633

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
if n_gpu > 0:
    torch.cuda.manual_seed_all(seed_val)

training_stats = train(train_dataloader, validation_dataloader, model, optimizer, epochs)

檢視訓練過程中的的評估資料

df_stats = pd.DataFrame(training_stats).set_index('epoch')
df_stats

預測

預測函式

def predict(dataloader, model):
    prediction = list()
    
    for batch in tqdm(dataloader, desc="predicting", unit="batch"):
        # Unpack batch from dataloader.
        input_ids = batch[0].to(device)
        attention_masks = batch[1].to(device)
        token_type_ids = batch[2].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():
            # Forward pass, calculate logit predictions.
            outputs = model(input_ids, 
                            token_type_ids=token_type_ids, 
                            attention_mask=attention_masks)
        logits = outputs[0]
        
        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        
        prediction.append(logits)
        
    pred_logits = np.concatenate(prediction, axis=0)
    pred_label = np.argmax(pred_logits, axis=1).flatten()
    print("done")
    return (pred_label,pred_logits)

為測試集建立Dataloader

# Create the DataLoader for test data.
prediction_data = convert_to_dataset_torch(test_data)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

預測

也可以用softmax將logits轉化為相應的概率

y_pred,logits = predict(prediction_dataloader,model)
# get the corresponding probablities
prob = torch.nn.functional.softmax(torch.tensor(logits))

總結

本篇文章演示了利用預先訓練好的 BERT 模型,微調適應於Quora問題對任務。在面對其他類似的文字分類問題時也可以採取類似的微調方法。

當然如果想要更精確的更好的預測結果,可能需要使用更好的更合適的預訓練模型,修改網路模型使之更適合當前任務,或者加入對抗訓練等方法。

相關文章