Bert文字分類實踐（二）：魔改Bert，融合TextCNN的新思路

盛小賤吖發表於2021-10-11

原文網址 : https://www.cnblogs.com/qingyao/p/15395033.html

寫在前面

文字分類是nlp中一個非常重要的任務，也是非常適合入坑nlp的第一個完整專案。雖然文字分類看似簡單，但裡面的門道好多好多，博主水平有限，只能將平時用到的方法和trick在此做個記錄和分享，希望各位看官都能有所收穫。並且儘可能提供給出簡潔，清晰的程式碼實現。

本文采用的文字分類模型是基於Bert和TextCNN的方法進行魔改，在博主實際的有關文字分類的工作中取得了F1值超越Bert基礎模型近4%的效果。大家可以用自己的資料嘗試一下哦 - _ *

有關於Bert文字分類baseline版本請參考此處

第一部分

模型

Bert模型是Google在2018年10月釋出的語言表示模型，一經問世在NLP領域橫掃了11項任務的最優結果，可謂風頭一時無二。有關於Bert中transformer的模型細節，推薦看這篇。在此不做贅述。

圖一：bert分類模型結構

Bert文字分類模型常見做法為將bert最後一層輸出的第一個token位置（CLS位置）當作句子的表示，後接全連線層進行分類。

圖二：TextCNN分類模型結構

在Bert問世前，TextCNN在文字分類模型中佔據了舉足輕重的位置，源於Cnn網路可以很有效的捕捉文字序列中的n-gram資訊，而分類任務從本質上講是捕捉n-gram排列組合特徵，捕捉關鍵詞，內容，還是句子的上層語義，在句子中均是以n-gram特徵的形式存在的。博主在做完Bert和TextCNN的實驗驚奇的發現，Bert往往可以對一些表述隱晦的句子進行更好的分類，TextCNN往往對關鍵詞更加敏感。所以博主魔改了一下模型，將Bert與TextCNN的思想融合在一起。

圖三：本文魔改模型結構

Bert-Base除去第一層輸入層，有12個encoder層，每個encode層的第一個token（CLS）向量都可以當作句子向量，我們可以抽象的理解為，encode層越淺，句子向量越能代表低階別語義資訊，越深，代表更高階別語義資訊。我們的目的是既想得到有關詞的特徵，又想得到語義特徵，模型具體做法是將第1層到第12層的CLS向量，作為CNN的輸入，分類。

話不多說我們直接看程式碼！

第二部分

pytorch程式碼實現

# -*- coding:utf-8 -*-
# bert融合textcnn思想的Bert+Blend-CNN
# model: Bert+Blend-CNN
# date: 2021.10.11 18:06:11

import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.utils.data as Data
import torch.nn.functional as F
import torch.optim as optim
import transformers
from transformers import AutoModel, AutoTokenizer
import matplotlib.pyplot as plt

train_curve = []
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# # 定義一些引數，模型選擇了最基礎的bert中文模型
batch_size = 2
epoches = 100
model = "bert-base-chinese"
hidden_size = 768
n_class = 2
maxlen = 8

encode_layer=12
filter_sizes = [2, 2, 2]
num_filters = 3

# data，構造一些訓練資料
sentences = ["我喜歡打籃球", "這個相機很好看", "今天玩的特別開心", "我不喜歡你", "太糟糕了", "真是件令人傷心的事情"]
labels = [1, 1, 1, 0, 0, 0]  # 1積極, 0消極.

class MyDataset(Data.Dataset):
  def __init__(self, sentences, labels=None, with_labels=True,):
    self.tokenizer = AutoTokenizer.from_pretrained(model)
    self.with_labels = with_labels
    self.sentences = sentences
    self.labels = labels
  def __len__(self):
    return len(sentences)

  def __getitem__(self, index):
    # Selecting sentence1 and sentence2 at the specified index in the data frame
    sent = self.sentences[index]

    # Tokenize the pair of sentences to get token ids, attention masks and token type ids
    encoded_pair = self.tokenizer(sent,
                    padding='max_length',  # Pad to max_length
                    truncation=True,       # Truncate to max_length
                    max_length=maxlen,  
                    return_tensors='pt')  # Return torch.Tensor objects

    token_ids = encoded_pair['input_ids'].squeeze(0)  # tensor of token ids
    attn_masks = encoded_pair['attention_mask'].squeeze(0)  # binary tensor with "0" for padded values and "1" for the other values
    token_type_ids = encoded_pair['token_type_ids'].squeeze(0)  # binary tensor with "0" for the 1st sentence tokens & "1" for the 2nd sentence tokens

    if self.with_labels:  # True if the dataset has labels
      label = self.labels[index]
      return token_ids, attn_masks, token_type_ids, label
    else:
      return token_ids, attn_masks, token_type_ids

train = Data.DataLoader(dataset=MyDataset(sentences, labels), batch_size=batch_size, shuffle=True, num_workers=1)

class TextCNN(nn.Module):
  def __init__(self):
    super(TextCNN, self).__init__()
    self.num_filter_total = num_filters * len(filter_sizes)
    self.Weight = nn.Linear(self.num_filter_total, n_class, bias=False)
    self.bias = nn.Parameter(torch.ones([n_class]))
    self.filter_list = nn.ModuleList([
      nn.Conv2d(1, num_filters, kernel_size=(size, hidden_size)) for size in filter_sizes
    ])

  def forward(self, x):
    # x: [bs, seq, hidden]
    x = x.unsqueeze(1) # [bs, channel=1, seq, hidden]
    
    pooled_outputs = []
    for i, conv in enumerate(self.filter_list):
      h = F.relu(conv(x)) # [bs, channel=1, seq-kernel_size+1, 1]
      mp = nn.MaxPool2d(
        kernel_size = (encode_layer-filter_sizes[i]+1, 1)
      )
      # mp: [bs, channel=3, w, h]
      pooled = mp(h).permute(0, 3, 2, 1) # [bs, h=1, w=1, channel=3]
      pooled_outputs.append(pooled)
    
    h_pool = torch.cat(pooled_outputs, len(filter_sizes)) # [bs, h=1, w=1, channel=3 * 3]
    h_pool_flat = torch.reshape(h_pool, [-1, self.num_filter_total])
    
    output = self.Weight(h_pool_flat) + self.bias # [bs, n_class]

    return output

# model
class Bert_Blend_CNN(nn.Module):
  def __init__(self):
    super(Bert_Blend_CNN, self).__init__()
    self.bert = AutoModel.from_pretrained(model, output_hidden_states=True, return_dict=True)
    self.linear = nn.Linear(hidden_size, n_class)
    self.textcnn = TextCNN()
    
  def forward(self, X):
    input_ids, attention_mask, token_type_ids = X[0], X[1], X[2]
    outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids) # 返回一個output字典
    # 取每一層encode出來的向量
    # outputs.pooler_output: [bs, hidden_size]
    hidden_states = outputs.hidden_states # 13*[bs, seq_len, hidden] 第一層是embedding層不需要
    cls_embeddings = hidden_states[1][:, 0, :].unsqueeze(1) # [bs, 1, hidden]
    # 將每一層的第一個token(cls向量)提取出來，拼在一起當作textcnn的輸入
    for i in range(2, 13):
      cls_embeddings = torch.cat((cls_embeddings, hidden_states[i][:, 0, :].unsqueeze(1)), dim=1)
    # cls_embeddings: [bs, encode_layer=12, hidden]
    logits = self.textcnn(cls_embeddings)
    return logits

bert_blend_cnn = Bert_Blend_CNN().to(device)

optimizer = optim.Adam(bert_blend_cnn.parameters(), lr=1e-3, weight_decay=1e-2)
loss_fn = nn.CrossEntropyLoss()

# train
sum_loss = 0
total_step = len(train)
for epoch in range(epoches):
  for i, batch in enumerate(train):
    optimizer.zero_grad()
    batch = tuple(p.to(device) for p in batch)
    pred = bert_blend_cnn([batch[0], batch[1], batch[2]])
    loss = loss_fn(pred, batch[3])
    sum_loss += loss.item()

    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
      print('[{}|{}] step:{}/{} loss:{:.4f}'.format(epoch+1, epoches, i+1, total_step, loss.item()))
  train_curve.append(sum_loss)
  sum_loss = 0

# test
bert_blend_cnn.eval()
with torch.no_grad():
  test_text = ['我不喜歡打籃球']
  test = MyDataset(test_text, labels=None, with_labels=False)
  x = test.__getitem__(0)
  x = tuple(p.unsqueeze(0).to(device) for p in x)
  pred = bert_blend_cnn([x[0], x[1], x[2]])
  pred = pred.data.max(dim=1, keepdim=True)[1]
  if pred[0][0] == 0:
    print('消極')
  else:
    print('積極')

pd.DataFrame(train_curve).plot() # loss曲線

測試單條樣本結果：

程式碼連結：

jupyter版本：https://github.com/PouringRain/blog_code/blob/main/nlp/Bert_Blend_CNN.ipynb

py版本：https://github.com/PouringRain/blog_code/blob/main/nlp/bert_blend_cnn.py

喜歡的話，給萌新的github倉庫一顆小星星哦……^ _^

Bert文字分類實踐（一）：實現一個簡單的分類模型
2021-10-10
文字分類模型
文字分類-TextCNN
2018-11-09
文字分類CNN
bert分類的程式碼
2024-06-21
Bert文字分類實踐（三）：處理樣本不均衡和提升模型魯棒性trick
2021-10-16
文字分類模型
5.2.2 用TextCNN做文字分類
2019-12-31
CNN文字分類
使用Bert預訓練模型文字分類（內附原始碼）
2019-03-13
模型文字分類原始碼
超詳細的 Bert 文字分類原始碼解讀 | 附原始碼
2021-06-03
文字分類原始碼
美團BERT的探索和實踐
2022-12-05
【BERT】詳解BERT
2024-06-15
中文新聞情感分類 Bert-Pytorch-transformers
2019-12-24
PyTorchORM
我的BERT！改改字典，讓BERT安全提速不掉分（已開源）
2020-09-25
BERT fine-tune 實踐終極教程
2018-11-23
BERT模型在京東零售業務的應用實踐
2019-03-30
模型
BERT時代與後時代的NLP（二）
2019-06-27
【BERT】你儲存的BERT模型為什麼那麼大？
2020-03-13
模型
transformers(1) 、bert
2024-04-16
ORM
搞定NLP領域的“變形金剛”！手把手教你用BERT進行多標籤文字分類
2019-02-19
文字分類
【NLP】BERT中文實戰踩坑
2018-12-06
Bert下載和使用（以bert-base-uncased為例）
2024-09-16
基於OCR進行Bert獨立語義糾錯實踐
2023-04-07
從RNN到BERT
2020-07-29
RNN
Roberta: Bert調優
2020-09-30
CNN+pytorch實現文字二分類
2021-07-07
CNNPyTorch
中文語料的 Bert finetune
2019-02-23
bert_dnn的程式碼
2024-06-21
DNN
[kotlin]帶分類的RecyclerView通用實現新思路
2018-06-21
KotlinView
百分點認知智慧實驗室出品：TPU與BERT的邂逅
2019-08-30
Bert時代的創新：Bert在NLP各領域的應用進展
2019-06-10
進一步改進GPT和BERT：使用Transformer的語言模型
2019-05-01
GPTORM模型
文字匹配之bert flow：On the Sentence Embeddings from Pre-trained Language Models
2020-12-31
AI
當Bert遇上Keras：這可能是Bert最簡單的開啟姿勢
2019-06-21
Keras
BERT來襲：燃爆AI
2018-10-19
AI
BERT 論文筆記
2019-01-18
筆記
8.3 BERT模型介紹
2020-01-08
模型
BERT 模型壓縮方法
2020-03-02
模型
使用BERT生成句向量
2019-02-19
【論文系列】之 ---- BERT
2024-11-10
7 papers | 對抗樣本前，BERT也不行；AutoML的商業實踐綜述
2019-07-22
TOML

Bert文字分類實踐（二）：魔改Bert，融合TextCNN的新思路

寫在前面

第一部分

模型

第二部分

pytorch程式碼實現

相關文章