Transformers2.0讓你三行程式碼呼叫語言模型，相容TF2.0和PyTorch

机器之心發表於2019-09-27

原文網址 : http://www.jiqizhixin.com/articles/2019-09-27-14

能夠靈活地呼叫各種語言模型，一直是 NLP 研究者的期待。近日 HuggingFace 公司開源了最新的 Transformer2.0 模型庫，使用者可非常方便地呼叫現在非常流行的 8 種語言模型進行微調和應用，且同時相容 TensorFlow2.0 和 PyTorch 兩大框架，非常方便快捷。

最近，專注於自然語言處理（NLP）的初創公司 HuggingFace 對其非常受歡迎的 Transformers 庫進行了重大更新，從而為 PyTorch 和 Tensorflow 2.0 兩大深度學習框架提供了前所未有的相容性。

更新後的 Transformers 2.0 汲取了 PyTorch 的易用性和 Tensorflow 的工業級生態系統。藉助於更新後的 Transformers 庫，科學家和實踐者可以更方便地在開發同一語言模型的訓練、評估和製作階段選擇不同的框架。

那麼更新後的 Transformers 2.0 具有哪些顯著的特徵呢？對 NLP 研究者和實踐者又會帶來哪些方面的改善呢？機器之心進行了整理。

專案地址：https://github.com/huggingface/transformers

Transformers 2.0 新特性

像 pytorch-transformers 一樣使用方便；
像 Keras 一樣功能強大和簡潔；
在 NLU 和 NLG 任務上實現高效能；
對教育者和實踐者的使用門檻低。

為所有人提供 SOTA 自然語言處理

深度學習研究者；
親身實踐者；
AI/ML/NLP 教師和教育者。

更低的計算開銷和更少的碳排放量

研究者可以共享訓練過的模型，而不用總是重新訓練；
實踐者可以減少計算時間和製作成本；
提供有 8 個架構和 30 多個預訓練模型，一些模型支援 100 多種語言；

為模型使用期限內的每個階段選擇正確的框架

3 行程式碼訓練 SOTA 模型；
實現 TensorFlow 2.0 和 PyTorch 模型的深度互操作；
在 TensorFlow 2.0 和 PyTorch 框架之間隨意移動模型；
為模型的訓練、評估和製作選擇正確的框架。

現已支援的模型

官方提供了一個支援的模型列表，包括各種著名的預訓練語言模型和變體，甚至還有官方實現的一個蒸餾後的 Bert 模型：

1. BERT (https://github.com/google-research/bert)

2. GPT (https://github.com/openai/finetune-transformer-lm)

3. GPT-2 (https://blog.openai.com/better-language-models/)

4. Transformer-XL (https://github.com/kimiyoung/transformer-xl)

5. XLNet (https://github.com/zihangdai/xlnet/)

6. XLM (https://github.com/facebookresearch/XLM/)

7. RoBERTa (https://github.com/pytorch/fairseq/tree/master/examples/roberta)

8. DistilBERT (https://github.com/huggingface/transformers/tree/master/examples/distillation)

快速上手

怎樣使用 Transformers 工具包呢？官方提供了很多程式碼示例，以下為檢視 Transformer 內部模型的程式碼：

import torch
from transformers import *

#Transformers has a unified API
#for 8 transformer architectures and 30 pretrained weights.
#Model          | Tokenizer          | Pretrained weights shortcut

MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'),
          (RobertaModel,    RobertaTokenizer,    'roberta-base')]

#To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. TFRobertaModel is the TF 2.0 counterpart of the PyTorch model RobertaModel
#Let's encode some text in a sequence of hidden-states using each model:
for model_class, tokenizer_class, pretrained_weights in MODELS:
    # Load pretrained model/tokenizer
    tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
    model = model_class.from_pretrained(pretrained_weights)

    # Encode text
    input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
    with torch.no_grad():
        last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples

#Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
                      BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
                      BertForQuestionAnswering]

#All the classes for an architecture can be initiated from pretrained weights for this architecture
#Note that additional weights added for fine-tuning are only initialized
#and need to be trained on the down-stream task
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
for model_class in BERT_MODEL_CLASSES:
    # Load pretrained model/tokenizer
    model = model_class.from_pretrained('bert-base-uncased')

#Models can return full list of hidden-states & attentions weights at each layer
model = model_class.from_pretrained(pretrained_weights,
                                    output_hidden_states=True,
                                    output_attentions=True)
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
all_hidden_states, all_attentions = model(input_ids)[-2:]

#Models are compatible with Torchscript
model = model_class.from_pretrained(pretrained_weights, torchscript=True)
traced_model = torch.jit.trace(model, (input_ids,))

#Simple serialization for models and tokenizers
model.save_pretrained('./directory/to/save/')  # save
model = model_class.from_pretrained('./directory/to/save/')  # re-load
tokenizer.save_pretrained('./directory/to/save/')  # save
tokenizer = tokenizer_class.from_pretrained('./directory/to/save/')  # re-load

#SOTA examples for GLUE, SQUAD, text generation...

Transformers 同時支援 PyTorch 和 TensorFlow2.0，使用者可以將這些工具放在一起使用。如下為使用 TensorFlow2.0 和 Transformer 的程式碼：

import tensorflow as tf
import tensorflow_datasets
from transformers import *

#Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
data = tensorflow_datasets.load('glue/mrpc')

#Prepare dataset for GLUE as a tf.data.Dataset instance
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)

#Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule 
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

#Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)

#Load the TensorFlow model in PyTorch for inspection
model.save_pretrained('./save/')
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)

#Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
sentence_0 = "This research was consistent with his findings.“
sentence_1 = "His findings were compatible with this research.“
sentence_2 = "His findings were not compatible with this research.“
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')

pred_1 = pytorch_model(*inputs_1)[0].argmax().item()
pred_2 = pytorch_model(*inputs_2)[0].argmax().item()
print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")

使用 py 檔案指令碼進行模型微調

當然，有時候你可能需要使用特定資料集對模型進行微調，Transformer2.0 專案提供了很多可以直接執行的 Python 檔案。例如：

run_glue.py：在九種不同 GLUE 任務上微調 BERT、XLNet 和 XLM 的示例（序列分類）；
run_squad.py：在問答資料集 SQuAD 2.0 上微調 BERT、XLNet 和 XLM 的示例（token 級分類）；
run_generation.py：使用 GPT、GPT-2、Transformer-XL 和 XLNet 進行條件語言生成；
其他可用於模型的示例程式碼。

GLUE 任務上進行模型微調

如下為在 GLUE 任務進行微調，使模型可以用於序列分類的示例程式碼，使用的檔案是 run_glue.py。

首先下載 GLUE 資料集，並安裝額外依賴：

pip install -r ./examples/requirements.txt

然後可進行微調：

export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python ./examples/run_glue.py \
 --model_type bert \
 --model_name_or_path bert-base-uncased \
 --task_name $TASK_NAME \
 --do_train \
 --do_eval \
 --do_lower_case \
 --data_dir $GLUE_DIR/$TASK_NAME \
 --max_seq_length 128 \
 --per_gpu_eval_batch_size=8 \
 --per_gpu_train_batch_size=8 \
 --learning_rate 2e-5 \
 --num_train_epochs 3.0 \
 --output_dir /tmp/$TASK_NAME/

在命令列執行時，可以選擇特定的模型和相關的訓練引數。

使用 SQuAD 資料集微調模型

另外，你還可以試試用 run_squad.py 檔案在 SQuAD 資料集上進行微調。程式碼如下：

python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
 --model_type bert \
 --model_name_or_path bert-large-uncased-whole-word-masking \
 --do_train \
 --do_eval \
 --do_lower_case \
 --train_file $SQUAD_DIR/train-v1.1.json \
 --predict_file $SQUAD_DIR/dev-v1.1.json \
 --learning_rate 3e-5 \
 --num_train_epochs 2 \
 --max_seq_length 384 \
 --doc_stride 128 \
 --output_dir ../models/wwm_uncased_finetuned_squad/ \
 --per_gpu_eval_batch_size=3 \
 --per_gpu_train_batch_size=3 \

這一程式碼可微調 BERT 全詞 Mask 模型，在 8 個 V100GPU 上微調，使模型的 F1 分數在 SQuAD 資料集上超過 93。

用模型進行文字生成

還可以使用 run_generation.py 讓預訓練語言模型進行文字生成，程式碼如下：

python ./examples/run_generation.py \
 --model_type=gpt2 \
 --length=20 \
 --model_name_or_path=gpt2 \

安裝方法

如此方便的工具怎樣安裝呢？使用者只要保證環境在 Python3.5 以上，PyTorch 版本在 1.0.0 以上或 TensorFlow 版本為 2.0.0-rc1。

然後使用 pip 安裝即可。

pip install transformers

移動端部署很快就到

HuggingFace 在 GitHub 上表示，他們有意將這些模型放到移動裝置上，並提供了一個 repo 的程式碼，將 GPT-2 模型轉換為 CoreML 模型放在移動端。

未來，他們會進一步推進開發工作，使用者可以無縫地將大模型轉換成 CoreML 模型，無需使用額外的程式指令碼。

repo 地址：https://github.com/huggingface/swift-coreml-transformers

基於PyTorch的大語言模型微調指南：Torchtune完整教程與程式碼示例
2024-11-03
PyTorch模型
三行Python程式碼,讓你的資料處理指令碼快別人4倍
2022-03-01
Python指令碼
函數語言程式設計思維在三行程式碼情書中的應用
2019-01-19
函數程式設計行程
快速呼叫 GLM-4-9B-Chat 語言模型
2024-07-02
模型
低程式碼與大語言模型的探索實踐
2024-02-24
模型
聊聊C語言/C++—程式和程式語言
2018-03-24
C語言C++
函數語言程式設計讓你忘記設計模式
2019-07-07
函數程式設計設計模式
函數語言程式設計之尾呼叫和尾遞迴
2019-01-11
函數程式設計遞迴
用Python程式碼解釋大語言模型的工作原理
2024-02-21
Python模型
Java如何呼叫C語言程式，JNI技術
2021-08-31
JavaC語言
一文解碼語言模型：語言模型的原理、實戰與評估
2023-11-13
模型
go語言與c語言的相互呼叫
2019-04-09
GoC語言
WPF多語言支援：簡單靈活的動態切換，讓你的程式支援多國語言
2024-05-01
Python函數語言程式設計系列002：水管模型和compose
2021-10-01
Python函數程式設計模型
C語言簡單程式碼程式
2020-09-25
C語言
C#使用OllamaSharp呼叫Llama 3、Phi 3等大語言模型
2024-07-21
C#模型
大語言模型
2024-08-08
模型
語言大模型
2024-08-07
大模型
C 語言常用錯誤程式碼釋義大全，讓你編譯執行報錯不是煩惱
2020-11-28
編譯
如何開始定製你自己的大型語言模型
2024-03-28
模型
C 語言程式碼總結
2020-12-04
nlp中的傳統語言模型與神經語言模型
2018-11-03
模型
學習 27 門程式語言的長處，提升你的 Python 程式碼水平
2021-12-13
Python
LLM大語言模型演算法特訓，帶你轉型AI大語言模型演算法工程師
2024-06-07
模型演算法AI工程師
節操，程式碼，修養，妹子和其他（Go語言版）
2019-02-16
Go
三行CSS程式碼實現水平垂直居中
2022-04-18
CSS
N元語言模型
2019-01-25
模型
小語言模型指南
2024-04-29
模型
Pytorch系列:（六）自然語言處理NLP
2021-05-21
PyTorch自然語言處理
C語言函式呼叫棧
2022-05-14
C語言函式
LLaMA 3 原始碼解讀-大語言模型5
2024-05-07
原始碼模型
第 09 篇：讓部落格支援 Markdown 語法和程式碼高亮
2019-08-21
一篇文章讓你徹底掌握 shell 語言
2019-02-27
C語言程式設計，初學者必學程式碼規範，你知道哪些？
2021-10-20
C語言程式設計
這個專案可以讓你在幾分鐘快速瞭解某個程式語言
2021-09-09
tf2.0 cycle-gan，官方程式碼復現整理。
2021-01-03
TF2
為什麼自制指令碼語言是程式語言的最高境界？
2018-07-24
指令碼
NVIDIA發力巨量AI語言模型：讓企業觸手可及
2021-11-12
AI模型

Transformers2.0讓你三行程式碼呼叫語言模型，相容TF2.0和PyTorch

相關文章