Hugging Face NLP課程學習記錄 - 0. 安裝transformers庫 & 1. Transformer 模型

shizidushu發表於2024-09-14

Hugging Face NLP課程學習記錄 - 0. 安裝transformers庫 & 1. Transformer 模型

說明:

  • 首次發表日期:2024-09-14
  • 官網: https://huggingface.co/learn/nlp-course/zh-CN/chapter1
  • 關於: 閱讀並記錄一下,只保留重點部分,大多從原文摘錄,潤色一下原文

0. 安裝transformers庫

建立conda環境並安裝包:

conda create -n hfnlp python=3.12
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers==4.44.2

# More
pip install seqeval
pip install sentencepiece

使用Hugging Face映象(見 https://hf-mirror.com/ ):

export HF_ENDPOINT=https://hf-mirror.com

或者在python中設定Hugging Face映象:

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

1. Transformer 模型

Transformers 能做什麼?

使用pipelines

Transformers 庫中最基本的物件是 pipeline() 函式。它將模型與其必要的預處理和後處理步驟連線起來,使我們能夠透過直接輸入任何文字並獲得最終的答案:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

提示:

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b ....
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

輸出:

[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

輸入多個句子:

classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)
[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

將一些文字傳遞到pipeline時涉及三個主要步驟:

  1. 文字被預處理為模型可以理解的格式。
  2. 預處理的輸入被傳遞給模型。
  3. 模型處理後輸出最終人類可以理解的結果。

零樣本分類

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

提示:

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 ([https://hf-mirror.com/facebook/bart-large-mnli](https://hf-mirror.com/facebook/bart-large-mnli)).
Using a pipeline without specifying a model name and revision in production is not recommended.

輸出:

{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445952534675598, 0.11197696626186371, 0.043427806347608566]}

此pipeline稱為zero-shot,因為您不需要對資料上的模型進行微調即可使用它

文字分類

現在讓我們看看如何使用pipeline來生成一些文字。這裡的主要使用方法是您提供一個提示,模型將透過生成剩餘的文字來自動完成整段話。

from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

提示:

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://hf-mirror.com/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.

輸出:

[{'generated_text': 'In this course, we will teach you how to create a simple Python script that uses the default Python scripts for the following tasks, such as adding a linker at the end of a file to a file, editing an array, etc.\n\n'}]

在pipeline中使用 Hub 中的其他模型

前面的示例使用了預設模型,但您也可以從 Hub 中選擇特定模型以在特定任務的pipeline中使用 - 例如,文字生成。轉到模型中心(hub)並單擊左側的相應標籤將會只顯示該任務支援的模型。例如這樣

讓我們試試 distilgpt2 模型吧!以下是如何在與以前相同的pipeline中載入它:

from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2
)
[{'generated_text': 'In this course, we will teach you how to make your world better. Our courses focus on how to make an improvement in your life or the things'},
 {'generated_text': 'In this course, we will teach you how to properly design your own design using what is currently in place and using what is best in place. By'}]

Mask filling

您將嘗試的下一個pipeline是 fill-mask。此任務的想法是填充給定文字中的空白:

from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
[{'score': 0.19198445975780487,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04209190234541893,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

top_k 引數控制要顯示的結果有多少種。請注意,這裡模型填充了特殊的< mask >詞,它通常被稱為掩碼標記。其他掩碼填充模型可能有不同的掩碼標記,因此在探索其他模型時要驗證正確的掩碼字是什麼。

命名實體識別

命名實體識別 (NER) 是一項任務,其中模型必須找到輸入文字的哪些部分對應於諸如人員、位置或組織之類的實體。讓我們看一個例子:

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english ...
[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

我們在pipeline建立函式中傳遞選項 grouped_entities=True 以告訴pipeline將對應於同一實體的句子部分重新組合在一起:這裡模型正確地將“Hugging”和“Face”分組為一個組織,即使名稱由多個片語成。

命名實體識別(中文)

執行來自 https://huggingface.co/shibing624/bert4ner-base-chinese README的程式碼

pip install seqeval
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']

sentence = "王宏偉來自北京,是個警察,喜歡去王府井遊玩兒。"

def get_entity(sentence):
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
    print(sentence)
    print(char_tags)

    pred_labels = [i[1] for i in char_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = sentence[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)
王宏偉來自北京,是個警察,喜歡去王府井遊玩兒。
[('宏', 'B-PER'), ('偉', 'I-PER'), ('來', 'I-PER'), ('自', 'O'), ('北', 'O'), ('京', 'B-LOC'), (',', 'I-LOC'), ('是', 'O'), ('個', 'O'), ('警', 'O'), ('察', 'O'), (',', 'O'), ('喜', 'O'), ('歡', 'O'), ('去', 'O'), ('王', 'O'), ('府', 'B-LOC'), ('井', 'I-LOC'), ('遊', 'I-LOC'), ('玩', 'O'), ('兒', 'O')]
Sentence entity:
[('王宏偉', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')]

或者透過使用nerpy庫來使用 shibing624/bert4ner-base-chinese 這個模型。

另外,可以使用的ltp來做中文命名實體識別,其Github倉庫 https://github.com/HIT-SCIR/ltp 有4.9K的星

問答系統

from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
{'score': 0.6949753761291504, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

請注意,此pipeline透過從提供的上下文中提取資訊來工作;它不會憑空生成答案。

文字摘要

文字摘要是將文字縮減為較短文字的任務,同時保留文字中的主要(重要)資訊。下面是一個例子:

from transformers import pipeline

summarizer = pipeline("summarization", device=0)
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

與文字生成一樣,您指定結果的 max_lengthmin_length

翻譯

對於翻譯,如果您在任務名稱中提供語言對(例如“translation_en_to_fr”),則可以使用預設模型,但最簡單的方法是在模型中心(hub)選擇要使用的模型。在這裡,我們將嘗試從法語翻譯成英語:

pip install sentencepiece
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en", device=0)
translator("Ce cours est produit par Hugging Face.")
[{'translation_text': 'This course is produced by Hugging Face.'}]

將英語翻譯成中文:

from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh", device=0)
translator("America has changed dramatically during recent years.")
[{'translation_text': '近年來,美國發生了巨大變化。'}]

偏見和侷限性

如果您打算在正式的專案中使用經過預訓練或經過微調的模型。請注意:雖然這些模型是很強大,但它們也有侷限性。其中最大的一個問題是,為了對大量資料進行預訓練,研究人員通常會蒐集所有他們能找到的內容,中間可能夾帶一些意識形態或者價值觀的刻板印象。

為了快速解釋清楚這個問題,讓我們回到一個使用BERT模型的pipeline的例子:

from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased", device=0)
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']

當要求模型填寫這兩句話中缺少的單詞時,模型給出的答案中,只有一個與性別無關(服務員/女服務員)。

相關文章