使用 Amazon SageMaker 構建文字摘要應用

背景介紹

文字摘要，就是對給定的單個或者多個文件進行梗概，即在保證能夠反映原文件的重要內容的情況下，儘可能地保持簡明扼要。質量良好的文摘能夠在資訊檢索過程中發揮重要的作用，比如利用文摘代替原文件參與索引，可以有效縮短檢索的時間，同時也能減少檢索結果中的冗餘資訊，提高使用者體驗。隨著資訊爆炸時代的到來，自動文摘逐漸成為自然語言處理領域的一項重要的研究課題。

文字摘要的需求來自多個我們真實的客戶案例，對於大量的長文字對於新聞領域，金融領域，法律領域是司空見慣的。而在人力成本越來越高的今天，僱傭大量的專業人員進行資訊精煉或者內容稽核無疑要投入大量的資金。而自動文字摘要就顯得意義非凡，具體來說，透過大量資料訓練的深度學習模型可以在幾百毫秒內產生長度可控的文字摘要，這大大地提升了摘要生成效率，節約了大量人力成本。

對於目前的技術，可以根據摘要產生的方式大體可以分為兩類：1）抽取式文字摘要：找到一個文件中最重要的幾個句子並對其進行拼接；2）生成式文字摘要：直接建模為序列到序列的生成問題，根據源文字直接遞迴生成摘要。對於抽取式摘要，其具備效率高，解釋性強的優勢，但是抽取得到的文字在語義連續性上相較生成式摘要有所不足，故這裡我們主要展示生成式摘要。

Amazon SageMaker是亞馬遜雲端計算（Amazon Web Service）的一項完全託管的機器學習平臺服務，演算法工程師和資料科學家可以基於此平臺快速構建、訓練和部署機器學習 (ML) 模型，而無需關注底層資源的管理和運維工作。它作為一個工具集，提供了用於機器學習的端到端的所有元件，包括資料標記、資料處理、演算法設計、模型訓練、訓練除錯、超參調優、模型部署、模型監控等，使得機器學習變得更為簡單和輕鬆；同時，它依託於 Amazon 強大的底層資源，提供了高效能 CPU、GPU、彈性推理加速卡等豐富的計算資源和充足的算力，使得模型研發和部署更為輕鬆和高效。同時，本文還基於 Huggingface，Huggingface 是 NLP 著名的開源社群，並且與 Amazon SagaMaker 高度適配，可以在 Amazon SagaMaker 上以幾行程式碼輕鬆實現 NLP 模型訓練和部署。

亞馬遜雲科技開發者社群為開發者們提供全球的開發技術資源。這裡有技術文件、開發案例、技術專欄、培訓影片、活動與競賽等。幫助中國開發者對接世界最前沿技術，觀點，和專案，並將中國優秀開發者或技術推薦給全球雲社群。如果你還沒有關注/收藏，看到這裡請一定不要匆匆劃過，點這裡讓它成為你的技術寶庫！

解決方案概覽

在此示例中，我們將使用 Amazon SageMaker 執行以下操作：

環境準備
下載資料集並將其進行資料預處理
使用本地機器訓練
使用 Amazon SageMaker BYOS 進行模型訓練
託管部署及推理測試

環境準備

我們首先要建立一個 Amazon SageMaker Notebook，筆記本例項型別最好選擇 ml.p3.2xlarge，因為本例中用到了本地機器訓練的部分用來測試我們的程式碼，卷大小建議改成10GB或以上，因為執行該專案需要下載一些額外的資料。

筆記本啟動後，開啟頁面上的終端，執行以下命令下載程式碼。

cd ~/SageMaker
git clone https://github.com/HaoranLv/nlp_transformer.git

下載資料集並將其進行資料預處理
這裡給出若干開源的中英文資料集：

1.公開資料集 (英文)

XSUM，227k BBC articles
CNN/Dailymail，93k articles from the CNN, 220k articles from the Daily Mail
NEWSROOM，3M article-summary pairs written by authors and editors in the newsrooms of 38 major publications
Multi-News，56k pairs of news articles and their human-written summaries from the http://com
Gigaword，4M examples extracted from news articles，the task is to generate theheadline from the first sentence
arXiv, PubMed，two long documentdatasets of scientific publications from http://org(113k) andPubMed (215k). The task is to generate the abstract fromthe paper body.
BIGPATENT，3 millionU.S. patents along with human summaries under nine patent classification categories

2.公開資料集 (中文)

哈工大的新浪微博短文字摘要 LCSTS
教育新聞自動摘要語料 chinese_abstractive_corpus
NLPCC 2017 task3 Single Document Summarization
娛樂新聞等 “神策杯”2018高校演算法大師賽

本文以 Multi-News 為例，資料分為兩列，headlines 代表摘要，text 代表全文。由於文字資料集較小，故直接官網下載原始 csv 檔案上傳到 SageMaker Notebook 即可。如下是部分資料集樣例。

找到 hp_data.ipynb 執行程式碼。

首先載入資料集

df=pd.read_csv（./data/hp/summary/news_summary.csv'）

而後進行資料清洗

class Settings:

    TRAIN_DATA = "./data/hp/summary/news_summary_total.csv"
    Columns = ['headlines', 'text']
    encoding = 'latin-1'
    columns_dict = {"headlines": "headlines", "text": "text"}
    df_column_list = ['text', 'headlines']
    SUMMARIZE_KEY = ""
    SOURCE_TEXT_KEY = 'text'
    TEST_SIZE = 0.2
    BATCH_SIZE = 16
    source_max_token_len = 128
    target_max_token_len = 50
    train_df_len = 82332
    test_df_len = 20583
    
class Preprocess:
    def __init__(self):
        self.settings = Settings

    def clean_text(self, text):
        text = text.lower()
        text = re.sub('\[.*?\]', '', text)
        text = re.sub('https?://\S+|www\.\S+', '', text)
        text = re.sub('<.*?>+', '', text)
        text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
        text = re.sub('\n', '', text)
        text = re.sub('\w*\d\w*', '', text)
        return text

    def preprocess_data(self, data_path):
        df = pd.read_csv(data_path, encoding=self.settings.encoding, usecols=self.settings.Columns)
        # simpleT5 expects dataframe to have 2 columns: "source_text" and "target_text"
        df = df.rename(columns=self.settings.columns_dict)
        df = df[self.settings.df_column_list]
        # T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "
        df[self.settings.SOURCE_TEXT_KEY] = df[self.settings.SOURCE_TEXT_KEY]

        return df
settings=Settings
preprocess=Preprocess()
df = preprocess.preprocess_data(settings.TRAIN_DATA)

隨後完成訓練集和測試集的劃分並分別儲存：

df.to_csv('./data/hp/summary/news_summary_cleaned.csv',index=False)
df2=pd.read_csv('./data/hp/summary/news_summary_cleaned.csv')
order=['text','headlines']
df3=df2[order]
train_df, test_df = train_test_split(df3, test_size=0.2,random_state=100)
train_df.to_csv('./data/hp/summary/news_summary_cleaned_train.csv',index=False)
test_df.to_csv('./data/hp/summary/news_summary_cleaned_test.csv',index=False)

使用本地機器訓練

在完成了上述的資料處理過程後，就可以進行模型訓練了，下面的命令執行後即開始模型訓練，程式碼會自動 Huggingface hub 中載入 google/pegasus-large 作為預訓練模型，而後使用我們處理後的資料集進行模型訓練。

!python -u examples/pytorch/summarization/run_summarization.py \
--model_name_or_path google/pegasus-large \
--do_train \
--do_eval \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=1 \
--save_strategy epoch \
--evaluation_strategy epoch \
--overwrite_output_dir \
--predict_with_generate \
--train_file './data/hp/summary/news_summary_cleaned_train.csv' \
--validation_file './data/hp/summary/news_summary_cleaned_test.csv' \
--text_column 'text' \
--summary_column 'headlines' \
--output_dir='./models/local_train/pegasus-hp' \
--num_train_epochs=1.0 \
--eval_steps=500 \
--save_total_limit=3 \
--source_prefix "summarize: " > train_pegasus.log

訓練完成後，會提示日誌資訊如下。

並且會對驗證集的資料進行客觀指標評估，這裡使用 Rouge 進行評估。

模型結果檔案及相應的日誌等資訊會自動儲存在./models/local_train/pegasus-hp/checkpoint-500

我們可以直接用這個產生的模型檔案進行本地推理。注意這裡的模型檔案地址的指定為你剛剛訓練產生的。

import pandas as pd
df=pd.read_csv('./data/hp/summary/news_summary_cleaned_small_test.csv')
print('原文:',df.loc[0,'text'])
print('真實標籤:',df.loc[0,'headlines'])
from transformers import pipeline
summarizer=pipeline("summarization",model="./models/local_train/Pegasus-hp/checkpoint-500")
print('模型預測:',summarizer(df.loc[0,'text'], max_length=50)[0]['summary_text'])

輸出如下：

原文: Germany on Wednesday accused Vietnam of kidnapping a former Vietnamese oil executive Trinh Xuan Thanh, who allegedly sought asylum in Berlin, and taking him home to face accusations of corruption. Germany expelled a Vietnamese intelligence officer over the suspected kidnapping and demanded that Vietnam allow Thanh to return to Germany. However, Vietnam said Thanh had returned home by himself.
真實標籤: Germany accuses Vietnam of kidnapping asylum seeker 
模型預測: Germany accuses Vietnam of kidnapping ex-oil exec, taking him home

到這裡，就完成了一個模型的本地訓練和推理過程。

使用 Amazon SageMaker BYOS 進行模型訓練

在上文的範例中，我們使用本地環境一步步的訓練了一個較小的模型，驗證了我們的程式碼。現在，我們需要把程式碼進行整理，在 Amazon SageMaker 上，進行可擴充套件至分散式的託管訓練任務。

首先，我們要將上文的訓練程式碼整理至一個 python 指令碼，然後使用 SageMaker 上預配置的 Huggingface 容器，我們提供了很多靈活的使用方式來使用該容器，具體可以參考 Hugging Face Estimator。

由於 SageMaker 預置的 Huggingface 容器已經具備推理邏輯, 故這裡只需要將上一步中的訓練指令碼引入容器即可, 具體流程如下:

啟動一個 Jupyter Notebook，選擇 python3 作為直譯器完成如下工作：

許可權配置

import sagemaker
import os
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

資料上傳到 S3

# dataset used
dataset_name = ' news_summary'
# s3 key prefix for the data
s3_prefix = 'datasets/news_summary'
WORK_DIRECTORY = './data/'
data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=s3_prefix)
data_location

定義超引數並初始化 estimator。

from sagemaker.huggingface import HuggingFace

# hyperparameters which are passed to the training job
hyperparameters={'text_column':'text',
                 'summary_column':'headlines',
                 'train_file':'/opt/ml/input/data/train/news_summary_cleaned_train.csv',
                 'validation_file':'/opt/ml/input/data/test/ news_summary_cleaned_test.csv',
                 'output_dir':'/opt/ml/model',
                 'do_train':True,
                 'do_eval':True,
                 'max_source_length': 128,
                 'max_target_length': 128,
                 'model_name_or_path': 't5-large',
                 'learning_rate': 3e-4,
                 'num_train_epochs': 1,
                 'per_device_train_batch_size': 2,#16
                 'gradient_accumulation_steps':2, 
                 'save_strategy':'epoch',
                 'evaluation_strategy':'epoch',
                 'save_total_limit':1,
                 }
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
# create the Estimator
huggingface_estimator = HuggingFace(
        entry_point='run_paraphrase.py',
        source_dir='./scripts',
        instance_type='ml.p3.2xlarge',#'ml.p3dn.24xlarge'
        instance_count=1,
        role=role,
        max_run=24*60*60,
        transformers_version='4.6',
        pytorch_version='1.7',
        py_version='py36',
        volume_size=128,
        hyperparameters = hyperparameters,
#         distribution=distribution
)

啟動模型訓練。

huggingface_estimator.fit(
  {'train': data_location+'/news_summary_cleaned_train.csv',
   'test': data_location+'/news_summary_cleaned_test.csv',}
)

訓練啟動後，我們可以在 Amazon SageMaker 控制檯看到這個訓練任務，點進詳情可以看到訓練的日誌輸出，以及監控機器的 GPU、CPU、記憶體等的使用率等情況，以確認程式可以正常工作。訓練完成後也可以在 CloudWatch 中檢視訓練日誌。

託管部署及推理測試

完成訓練後，我們可以輕鬆的將上面的模型部署成一個實時可在生產環境中呼叫的埠。

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
#    env= {'HF_TASK':'text-generation'},
   model_data="s3://sagemaker-us-west-2-847380964353/huggingface-pytorch-training-2022-04-19-05-56-07-474/output/model.tar.gz",  # path to your trained SageMaker model
   role=role,                                            # IAM role with permissions to create an endpoint
   transformers_version="4.6",                           # Transformers version used
   pytorch_version="1.7",                                # PyTorch version used
   py_version='py36',                                    # Python version used
    
)
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.g4dn.xlarge"
)

模型呼叫

from sagemaker.huggingface.model import HuggingFacePredictor
predictor=HuggingFacePredictor(endpoint_name='huggingface-pytorch-inference-2022-04-19-06-41-55-309')

import time
s=time.time()
df=pd.read_csv('./data/hp/summary/news_summary_cleaned_small_test.csv')
print('原文:',df.loc[0,'text'])
print('真實標籤:',df.loc[0,'headlines'])
out=predictor.predict({
        'inputs': df.loc[0,'text'],
        "parameters": {"max_length": 256},
    })
e=time.time()
print('模型預測:' out)

輸出如下：

原文: Germany on Wednesday accused Vietnam of kidnapping a former Vietnamese oil executive Trinh Xuan Thanh, who allegedly sought asylum in Berlin, and taking him home to face accusations of corruption. Germany expelled a Vietnamese intelligence officer over the suspected kidnapping and demanded that Vietnam allow Thanh to return to Germany. However, Vietnam said Thanh had returned home by himself.
真實標籤: Germany accuses Vietnam of kidnapping asylum seeker 
模型預測: Germany accuses Vietnam of kidnapping ex-oil exec, taking him home

Amazon SageMaker

以上就是使用 Amazon SageMaker 構建文字摘要應用的全部過程，可以看到透過 Amazon SageMaker 可以非常便利地結合 Huggingface 進行 NLP 模型的搭建，訓練，部署的全流程。整個過程僅需要準備訓練指令碼以及資料即可透過若干命令啟動訓練和部署，同時，我們後續還會推出，使用 Amaozn SageMaker 進行更多 NLP 相關任務的實現方式，敬請關注。

參考資料

Amazon Sagemaker: https://docs.aws.amazon.com/sagemaker/index.html
Huggingface：https://huggingface.co/
Code Link：https://github.com/HaoranLv/nlp_transformer

本篇作者

呂浩然
亞馬遜雲科技應用科學家，長期從事計算機視覺，自然語言處理等領域的研究和開發工作。支援資料實驗室專案，在時序預測，目標檢測，OCR，自然語言生成等方向有豐富的演算法開發以及落地實踐經驗。

文章來源：https://dev.amazoncloud.cn/column/article/630a0a80afd24c6ba21...