Bert文字分類實踐（一）：實現一個簡單的分類模型

盛小賤吖發表於2021-10-10

文字分類模型

寫在前面

文字分類是nlp中一個非常重要的任務，也是非常適合入坑nlp的第一個完整專案。雖然文字分類看似簡單，但裡面的門道好多好多，作者水平有限，只能將平時用到的方法和trick在此做個記錄和分享，希望大家看過都能有所收穫，享受程式設計的樂趣。

第一部分

模型

Bert模型是Google在2018年10月釋出的語言表示模型，一經問世在NLP領域橫掃了11項任務的最優結果，可謂風頭一時無二。有關於Bert中transformer的模型細節，推薦看這篇。在此不做贅述。

圖一：bert分類模型結構

Bert文字分類模型常見做法為將bert最後一層輸出的第一個token位置（CLS位置）當作句子的表示，後接全連線層進行分類。模型很簡單，我們直接看程式碼！

第二部分

pytorch程式碼實現

# -*- coding:utf-8 -*-
# bert文字分類baseline模型
# model: bert
# date: 2021.10.10 10:01

import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.utils.data as Data
import torch.optim as optim
import transformers
from transformers import AutoModel, AutoTokenizer
import matplotlib.pyplot as plt

train_curve = []
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 定義一些引數，模型選擇了最基礎的bert中文模型
batch_size = 2
epoches = 100
model = "bert-base-chinese"
hidden_size = 768
n_class = 2
maxlen = 8

# data，構造一些訓練資料
sentences = ["我喜歡打籃球", "這個相機很好看", "今天玩的特別開心", "我不喜歡你", "太糟糕了", "真是件令人傷心的事情"]
labels = [1, 1, 1, 0, 0, 0]  # 1積極, 0消極.

# word_list = ' '.join(sentences).split()
# word_list = list(set(word_list))
# word_dict = {w: i for i, w in enumerate(word_list)}
# num_dict = {i: w for w, i in word_dict.items()}
# vocab_size = len(word_list)

# 將資料構造成bert的輸入格式
# inputs_ids: token的字典編碼
# attention_mask:長度與inputs_ids一致，真實長度的位置填充1，padding位置填充0
# token_type_ids: 第一個句子填充0，第二個句子句子填充1
class MyDataset(Data.Dataset):
  def __init__(self, sentences, labels=None, with_labels=True,):
    self.tokenizer = AutoTokenizer.from_pretrained(model)
    self.with_labels = with_labels
    self.sentences = sentences
    self.labels = labels
  def __len__(self):
    return len(sentences)

  def __getitem__(self, index):
    # Selecting sentence1 and sentence2 at the specified index in the data frame
    sent = self.sentences[index]

    # Tokenize the pair of sentences to get token ids, attention masks and token type ids
    encoded_pair = self.tokenizer(sent,
                    padding='max_length',  # Pad to max_length
                    truncation=True,       # Truncate to max_length
                    max_length=maxlen,  
                    return_tensors='pt')  # Return torch.Tensor objects

    token_ids = encoded_pair['input_ids'].squeeze(0)  # tensor of token ids
    attn_masks = encoded_pair['attention_mask'].squeeze(0)  # binary tensor with "0" for padded values and "1" for the other values
    token_type_ids = encoded_pair['token_type_ids'].squeeze(0)  # binary tensor with "0" for the 1st sentence tokens & "1" for the 2nd sentence tokens

    if self.with_labels:  # True if the dataset has labels
      label = self.labels[index]
      return token_ids, attn_masks, token_type_ids, label
    else:
      return token_ids, attn_masks, token_type_ids

train = Data.DataLoader(dataset=MyDataset(sentences, labels), batch_size=batch_size, shuffle=True, num_workers=1)

# model
class BertClassify(nn.Module):
  def __init__(self):
    super(BertClassify, self).__init__()
    self.bert = AutoModel.from_pretrained(model, output_hidden_states=True, return_dict=True)
    self.linear = nn.Linear(hidden_size, n_class) # 直接用cls向量接全連線層分類
    self.dropout = nn.Dropout(0.5)
  
  def forward(self, X):
    input_ids, attention_mask, token_type_ids = X[0], X[1], X[2]
    outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids) # 返回一個output字典
    # 用最後一層cls向量做分類
    # outputs.pooler_output: [bs, hidden_size]
    logits = self.linear(self.dropout(outputs.pooler_output))
    
    return logits

bc = BertClassify().to(device)

optimizer = optim.Adam(bc.parameters(), lr=1e-3, weight_decay=1e-2)
loss_fn = nn.CrossEntropyLoss()

# train
sum_loss = 0
total_step = len(train)
for epoch in range(epoches):
  for i, batch in enumerate(train):
    optimizer.zero_grad()
    batch = tuple(p.to(device) for p in batch)
    pred = bc([batch[0], batch[1], batch[2]])
    loss = loss_fn(pred, batch[3])
    sum_loss += loss.item()

    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
      print('[{}|{}] step:{}/{} loss:{:.4f}'.format(epoch+1, epoches, i+1, total_step, loss.item()))
  train_curve.append(sum_loss)
  sum_loss = 0

# test
bc.eval()
with torch.no_grad():
  test_text = ['我不喜歡打籃球']
  test = MyDataset(test_text, labels=None, with_labels=False)
  x = test.__getitem__(0)
  x = tuple(p.unsqueeze(0).to(device) for p in x)
  pred = bc([x[0], x[1], x[2]])
  pred = pred.data.max(dim=1, keepdim=True)[1]
  if pred[0][0] == 0:
    print('消極')
  else:
    print('積極')

pd.DataFrame(train_curve).plot() # loss曲線

測試單條樣本結果：

程式碼連結：

jupyter版本：https://github.com/PouringRain/blog_code/blob/main/nlp/bert_classify.ipynb

py版本：https://github.com/PouringRain/blog_code/blob/main/nlp/bert_classify.py

喜歡的話，給萌新的github倉庫一顆小星星哦……^ _^

Bert文字分類實踐（二）：魔改Bert，融合TextCNN的新思路
2021-10-11
文字分類CNN
文字分類模型
2020-10-28
文字分類模型
Bert文字分類實踐（三）：處理樣本不均衡和提升模型魯棒性trick
2021-10-16
文字分類模型
php 分頁分頁類簡單實用
2021-09-09
PHP
Pytext實戰-構建一個文字分類器有多快
2019-03-02
文字分類
用Python實現一個SVM分類器策略
2018-12-28
Python
使用Bert預訓練模型文字分類（內附原始碼）
2019-03-13
模型文字分類原始碼
教你用Pytorch建立你的第一個文字分類模型
2020-03-17
PyTorch文字分類模型
（一）文字分類經典模型之CNN篇
2024-05-08
文字分類模型CNN
【小白學PyTorch】15 TF2實現一個簡單的服裝分類任務
2020-09-29
PyTorchTF2
如何實現一個基本的微信文章分類器
2018-04-04
教程 | 用Scikit-Learn實現多類別文字分類
2018-05-14
文字分類
CNN+pytorch實現文字二分類
2021-07-07
CNNPyTorch
bert分類的程式碼
2024-06-21
選單的無限極分類實現
2019-02-16
原生 JS實現一個簡單分頁外掛
2019-03-04
JS
如何透過Scikit-Learn實現多類別文字分類？
2018-03-05
文字分類
如何通過Scikit-Learn實現多類別文字分類？
2018-03-05
文字分類
圖卷積實戰——文字分類
2019-05-22
卷積文字分類
CNN也能用於NLP任務，一文簡述文字分類任務的7個模型
2018-07-04
CNN文字分類模型
Pytorch實現分類器
2023-04-17
PyTorch
訓練PaddleOCR文字方向分類模型
2024-08-27
模型
從零實現一個RPC框架系列文章(二):11個類實現簡單RPC
2019-03-03
RPC框架
IO模型學習(一)IO模型分類
2018-04-15
模型
樸素貝葉斯分類-實戰篇-如何進行文字分類
2020-11-25
文字分類
異端審判器！一個泛用型文字聚類模型的實現（1）
2019-02-27
聚類模型
使用Facebook的FastText簡化文字分類
2019-03-23
AST文字分類
帶你實現一個簡單的MYSQL資料庫生成實體類工具
2020-04-02
MySql資料庫
實踐篇：使用Spark和Scala來訓練您的第一個分類器！
2019-03-01
Spark
概率分類之樸素貝葉斯分類（垃圾郵件分類python實現）
2020-10-05
Python
通用mapper和分類實現
2018-03-15
APP
電商左側商品分類選單實現
2019-10-11
文字分類-TextCNN
2018-11-09
文字分類CNN
好大夫資料安全分類分級實踐探索
2022-11-28
PHP 無限級分類最佳實踐
2019-02-16
PHP
用node實現一個簡單的聊天室——websocket實踐
2018-08-13
Web
超詳細的 Bert 文字分類原始碼解讀 | 附原始碼
2021-06-03
文字分類原始碼
破解垃圾分類難題，智慧分類如何實現最優解？
2020-06-01

Bert文字分類實踐（一）：實現一個簡單的分類模型

寫在前面

第一部分

模型

第二部分

pytorch程式碼實現

相關文章