本文參與了 SegmentFault 思否徵文「百度搜尋技術創新挑戰賽」,歡迎正在閱讀的你也加入
資料集
CoQA 是Stanford NLP在2019年釋出的Conversational Questional Answering資料集,是用於構建Conversational Question Answering Systems的大規模資料集。該資料集旨在衡量機器理解文字段落和回答對話中出現的一系列相互關聯的問題的能力。該資料集的獨特之處在於,每次對話都是透過將兩名眾包工作者配對,以問答的形式討論一段話來收集的,因此,問題是對話式的。
JSON 資料有很多欄位。出於我們的目的,我們將使用“問題”和“答案”中的“故事”、“輸入文字”並形成我們的資料。
安裝transformers
!pip install transformers
匯入庫
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
載入資料
coqa = **pd.read_json**('[http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json'](http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json'))
coqa**.head()**
資料清洗
對於每個問答對,我們都會附上對應的上下文。
#required columns in our dataframe
cols = ["text","question","answer"]
#list of lists to create our dataframe
comp_list = []
for index, row in coqa.iterrows():
for i in range(len(row["data"]["questions"])):
temp_list = []
temp_list.append(row["data"]["story"])
temp_list.append(row["data"]["questions"][i]["input_text"])
temp_list.append(row["data"]["answers"][i]["input_text"])
comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols)
#saving the dataframe to csv file for further loading
new_df.to_csv("CoQA_data.csv", index=False)
形成DataFrame
data = pd.read_csv("CoQA_data.csv")
data.head()
構建模型
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
BERT 使用詞塊標記化。在 BERT 中,稀有詞被分解成子詞/片段。 Wordpiece 標記化使用## 來分隔已拆分的標記。舉個例子:“Karin”是一個常用詞,所以 wordpiece 不會拆分它。然而,“Karingu”是一個罕見的詞,所以 wordpiece 將其拆分為“Karin”和“##gu”這兩個詞。請注意,它在 gu 之前新增了 ## 以指示它是拆分詞的第二部分。
使用 wordpiece tokenization 背後的想法是減少詞彙量,從而提高訓練效能。考慮一下這些詞,run,running,runner。如果沒有詞塊標記化,模型必須獨立儲存和學習所有三個詞的含義。然而,透過詞塊標記化,這三個詞中的每一個都將被拆分為“run”和相關的“##SUFFIX”(如果有任何字尾——例如,“run”、“##ning”、“##ner” ”)。現在,該模型將學習單詞“run”的上下文,其餘含義將編碼在字尾中,該字尾將從具有相似字尾的其他單詞中學習。
def question_answer(question, text):
#tokenize question and text as a pair
input_ids = tokenizer.encode(question, text)
#string version of tokenized ids
tokens = tokenizer.convert_ids_to_tokens(input_ids)
#segment IDs
#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
#number of tokens in segment A (question)
num_seg_a = sep_idx+1
#number of tokens in segment B (text)
num_seg_b = len(input_ids) - num_seg_a
#list of 0s and 1s for segment embeddings
segment_ids = [0]*num_seg_a + [1]*num_seg_b
assert len(segment_ids) == len(input_ids)
#model output using input_ids and segment_ids
output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
#reconstructing the answer
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
if answer_end >= answer_start:
answer = tokens[answer_start]
for i in range(answer_start+1, answer_end+1):
if tokens[i][0:2] == "##":
answer += tokens[i][2:]
else:
answer += " " + tokens[i]
if answer.startswith("[CLS]"):
answer = "Unable to find the answer to your question."
print("\nPredicted answer:\n{}".format(answer.capitalize()))
實驗效果
Please enter your text:
The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. The Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. In March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. The Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. Scholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. The Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.
Please enter your question:
When was the Vat formally opened?
Answer:
1475
Do you want to ask another question based on this text (Y/N)? Y
Please enter your question:
what is the library for?
Answer:
Research library for history , law , philosophy , science and theology
Do you want to ask another question based on this text (Y/N)? Y
Please enter your question:
for what subjects?
Answer:
History , law , philosophy , science and theology
Do you want to ask another question based on this text (Y/N)? N
Bye!