如何生成文字: 透過 Transformers 用不同的解碼方法生成文字

簡介

近年來，隨著以 OpenAI GPT2 模型為代表的基於數百萬網頁資料訓練的大型 Transformer 語言模型的興起，開放域語言生成領域吸引了越來越多的關注。開放域中的條件語言生成效果令人印象深刻，典型的例子有: GPT2 在獨角獸話題上的精彩續寫，XLNet 以及使用 CTRL 模型生成受控文字等。促成這些進展的除了 transformer 架構的改進和大規模無監督訓練資料外，更好的解碼方法 也發揮了不可或缺的作用。

本文簡述了不同的解碼策略，同時向讀者展示瞭如何使用流行的 transformers 庫輕鬆實現這些解碼策略！

下文中的所有功能均可用於 自迴歸 語言生成任務 (點選此處回顧)。簡單複習一下， 自迴歸 語言生成是基於如下假設: 一個文字序列的機率分佈可以分解為每個詞基於其上文的條件機率的乘積。

\( P(w_{1:T} | W_0 ) = \prod_{t=1}^T P(w_{t} | w_{1: t-1}, W_0) \text{ , 其中 } w_{1: 0} = \emptyset, \)

上式中，\( W_0 \) 是初始 上下文 單詞序列。文字序列的長度 \( T \) 通常時變的，並且對應於時間步 \( t=T \)。\( P(w_{t} | w_{1: t- 1}, W_{0}) \) 的詞表中已包含終止符 (End Of Sequence，EOS)。 transformers 目前已支援的自迴歸語言生成任務包括 GPT2 、 XLNet 、 OpenAi-GPT 、 CTRL 、 TransfoXL 、 XLM 、 Bart 、 T5 模型，並支援 PyTorch 和 TensorFlow (>= 2.0) 兩種框架！

我們會介紹目前最常用的解碼方法，主要有 貪心搜尋 (Greedy search)、波束搜尋 (Beam search)、Top-K 取樣 (Top-K sampling) 以及 Top-p 取樣 (Top-p sampling) 。

在此之前，我們先快速安裝一下 transformers 並把模型載入進來。本文我們用 GPT2 模型在 TensorFlow 2.1 中進行演示，但 API 和使用 PyTorch 框架是一一對應的。

!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q tensorflow==2.1

import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2",pad_token_id=tokenizer.eos_token_id)

貪心搜尋

貪心搜尋在每個時間步 \( t \) 都簡單地選擇機率最高的詞作為當前輸出詞: \( w_t = argmax_{w}P(w | w_{1:t-1}) \) ，如下圖所示。

從單詞 \( \text{“The”} \) 開始，演算法在第一步貪心地選擇條件機率最高的詞 \( \text{“nice”} \) 作為輸出，依此往後。最終生成的單詞序列為 \( \text{“The”}, \text{“nice”}, \text{“woman”} \)，其聯合機率為 \( 0.5 \times 0.4 = 0.2 \)。

下面，我們輸入文字序列 \( (\text{“I”}, \text{“enjoy”}, \text{“walking”}, \text{“with”}, \text{“my”}, \text{“cute”}, \text{“dog”}) \) 給 GPT2 模型，讓模型生成下文。我們以此為例看看如何在 transformers 中使用貪心搜尋:

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 *'-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll

好，我們已經用 GPT2 生成了第一個短文字?。根據上文生成的單詞是合理的，但模型很快開始輸出重複的文字！這在語言生成中是一個非常普遍的問題，在貪心搜尋和波束搜尋中似乎更是如此 - 詳見 Vijayakumar 等人，2016 和 Shao 等人，2017 的論文。

貪心搜尋的主要缺點是它錯過了隱藏在低機率詞後面的高機率詞，如上圖所示:

條件機率為 \( 0.9 \) 的單詞 \( \text{“has”} \) 隱藏在單詞 \( \text{“dog”} \) 後面，而 \( \text{“dog”} \) 因為在 t=1 時條件機率值只排第二所以未被選擇，因此貪心搜尋會錯過序列 \( \text{“The”}, \text {“dog”}, \text{“has”} \) 。

幸好我們可以用波束搜尋來緩解這個問題！

波束搜尋

波束搜尋透過在每個時間步保留最可能的 num_beams 個詞，並從中最終選擇出機率最高的序列來降低丟失潛在的高機率序列的風險。以 num_beams=2 為例:

在時間步 1，除了最有可能的假設 \( (\text{“The”}, \text{“nice”}) \)，波束搜尋還跟蹤第二可能的假設 \( (\text{“The”}, \text{“dog”}) \)。在時間步 2，波束搜尋發現序列 \( (\text{“The”}, \text{“dog”}, \text{“has”}) \) 機率為\( 0.36\) ，比 \( (\text{“The”}, \text{“nice”}, \text{“woman”}) \) 的 \( 0.2 \) 更高。太棒了，在我們的例子中它已經找到了最有可能的序列！

波束搜尋一般都會找到比貪心搜尋機率更高的輸出序列，但仍不保證找到全域性最優解。

讓我們看看如何在 transformers 中使用波束搜尋。我們設定 num_beams > 1 和 early_stopping=True 以便在所有波束達到 EOS 時直接結束生成。

# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    early_stopping=True
)

print("Output:\n" + 100 *'-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll

雖然結果比貪心搜尋更流暢，但輸出中仍然包含重複。一個簡單的補救措施是引入 n-grams (即連續 n 個詞的詞序列) 懲罰，該方法是由 Paulus 等人 (2017) 和 Klein 等人 (2017) 引入的。最常見的 n-grams 懲罰是確保每個 n-gram 都只出現一次，方法是如果看到當前候選詞與其上文所組成的 n-gram 已經出現過了，就將該候選詞的機率設定為 0。

我們可以透過設定 no_repeat_ngram_size=2 來試試，這樣任意 2-gram 不會出現兩次:

# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

print("Output:\n" + 100 *'-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break

不錯，看起來好多了！我們看到生成的文字已經沒有重複了。但是， n-gram 懲罰使用時必須謹慎，如一篇關於紐約這個城市的文章就不應使用 2-gram 懲罰，否則，城市名稱在整個文字中將只出現一次！

波束搜尋的另一個重要特性是我們能夠比較機率最高的幾個波束，並選擇最符合我們要求的波束作為最終生成文字。

在 transformers 中，我們只需將引數 num_return_sequences 設定為需返回的機率最高的波束的數量，記得確保 num_return_sequences <= num_beams！

# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 *'-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to get back to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to take a break
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to get back to
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a step

如我們所見，五個波束彼此之間僅有少量差別 —— 這在僅使用 5 個波束時不足為奇。

開放域文字生成的研究人員最近提出了幾個理由來說明對該領域而言波束搜尋可能不是最佳方案:

在機器翻譯或摘要等任務中，因為所需生成的長度或多或少都是可預測的，所以波束搜尋效果比較好 - 參見 Murray 等人 (2018) 和 Yang 等人 (2018) 的工作。但開放域文字生成情況有所不同，其輸出文字長度可能會有很大差異，如對話和故事生成的輸出文字長度就有很大不同。
我們已經看到波束搜尋已被證明存在重複生成的問題。在故事生成這樣的場景中，很難用 n-gram 或其他懲罰來控制，因為在“不重複”和最大可重複 n-grams之間找到一個好的折衷需要大量的微調。
正如 Ari Holtzman 等人 (2019) 所論證的那樣，高質量的人類語言並不遵循最大機率法則。換句話說，作為人類，我們希望生成的文字能讓我們感到驚喜，而可預測的文字使人感覺無聊。論文作者畫了一個機率圖，很好地展示了這一點，從圖中可以看出人類文字帶來的驚喜度比波束搜尋好不少。

alt text

因此，讓我們開始玩點刺激的，引入一些隨機性?。

取樣

在其最基本的形式中，取樣意味著根據當前條件機率分佈隨機選擇輸出詞 \( w_t \) :

\( w_t \sim P(w|w_{1:t-1}) \)

繼續使用上文中的例子，下圖視覺化了使用取樣生成文字的過程。

很明顯，使用取樣方法時文字生成本身不再是 確定性的。單詞 \( \text{“car”} \) 從條件機率分佈 \( P(w | \text{“The”}) \) 中取樣而得，而 \( \text{“drives”} \) 則取樣自 \( P(w | \text{“The”}, \text{“car”}) \)。

在 transformers 中，我們設定 do_sample=True 並透過設定 top_k=0 停用 Top-K 取樣 (稍後詳細介紹)。在下文中，為便於復現，我們會固定 random_seed=0，但你可以在自己的模型中隨意更改 random_seed。

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0
)

print("Output:\n" + 100 *'-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog. He just gave me a whole new hand sense."

But it seems that the dogs have learned a lot from teasing at the local batte harness once they take on the outside.

"I take

有意思！生成的文字看起來不錯 - 但仔細觀察會發現它不是很連貫。 3-grams new hand sense 和 local batte harness 非常奇怪，看起來不像是人寫的。這就是對單詞序列進行取樣時的大問題: 模型通常會產生不連貫的亂碼，參見 Ari Holtzman 等人 (2019) 的論文。

緩解這一問題的一個技巧是透過降低所謂的 softmax 的“溫度”使分佈 \( P(w|w_{1:t-1} \) 更陡峭。而降低“溫度”，本質上是增加高機率單詞的似然並降低低機率單詞的似然。

將溫度應用到於我們的例子中後，結果如下圖所示。

\( t=1 \) 時刻單詞的條件分佈變得更加陡峭，幾乎沒有機會選擇單詞 \( \text{“car”} \) 了。

讓我們看看如何透過設定 temperature=0.7 來冷卻生成過程:

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0,
    temperature=0.7
)

print("Output:\n" + 100 *'-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't like to be at home too much. I also find it a bit weird when I'm out shopping. I am always away from my house a lot, but I do have a few friends

好，奇怪的 n-gram 變少了，現在輸出更連貫了！雖然溫度可以使分佈的隨機性降低，但極限條件下，當“溫度”設定為 \( 0 \) 時，溫度縮放取樣就退化成貪心解碼了，因此會遇到與貪心解碼相同的問題。

Top-K 取樣

Fan 等人 (2018) 的論文介紹了一種簡單但非常強大的取樣方案，稱為 Top-K 取樣。在 Top-K 取樣中，機率最大的 K 個詞會被選出，然後這 K 個詞的機率會被重新歸一化，最後就在這重新被歸一化機率後的 K 個詞中取樣。 GPT2 採用了這種取樣方案，這也是它在故事生成這樣的任務上取得成功的原因之一。

我們將上文例子中的候選單詞數從 3 個單詞擴充套件到 10 個單詞，以更好地說明 Top-K 取樣。

Top K sampling

設 \( K = 6\) ，即我們將在兩個取樣步的取樣池大小限制為 6 個單詞。我們定義 6 個最有可能的詞的集合為 \( V_{\text{top-K}} \)。在第一步中，\( V_{\text{top-K}} \) 僅佔總機率的大約三分之二，但在第二步，它幾乎佔了全部的機率。同時，我們可以看到在第二步該方法成功地消除了那些奇怪的候選詞 \( (\text{“not”}, \text{“the”}, \text{“small”}, \text{“told”}) \)。

我們以設定 top_k=50 為例看下如何在 transformers 庫中使用 Top-K:

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50
)

print("Output:\n" + 100 *'-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog. It's so good to have an environment where your dog is available to share with you and we'll be taking care of you.

We hope you'll find this story interesting!

I am from

相當不錯！該文字可以說是迄今為止生成的最 "像人" 的文字。現在還有一個問題， Top-K 取樣不會動態調整從需要機率分佈 \( P(w|w_{1:t-1} \) 中選出的單詞數。這可能會有問題，因為某些分佈可能是非常尖銳 (上圖中右側的分佈)，而另一些可能更平坦 (上圖中左側的分佈)，所以對不同的分佈使用同一個絕對數 K 可能並不普適。

在 \( t=1 \) 時， Top-K 將 \( (\text{“people”}, \text{“big”}, \text{“house”}, \text{“cat”}) \) 排出了取樣池，而這些詞似乎是合理的候選詞。另一方面，在\( t=2 \) 時，該方法卻又把不太合適的 \( (\text{“down”}, \text{“a”}) \) 納入了取樣池。因此，將取樣池限制為固定大小 K 可能會在分佈比較尖銳的時候產生胡言亂語，而在分佈比較平坦的時候限制模型的創造力。這一發現促使 Ari Holtzman 等人 (2019) 發明了 Top-p - 或核- 取樣。

Top-p (核) 取樣

在 Top-p 中，取樣不只是在最有可能的 K 個單詞中進行，而是在累積機率超過機率 p 的最小單詞集中進行。然後在這組詞中重新分配機率質量。這樣，詞集的大小 (又名集合中的詞數) 可以根據下一個詞的機率分佈動態增加和減少。好吧，說的很囉嗦，一圖勝千言。

Top p sampling

假設 \( p=0.92 \) ， Top-p 取樣對單詞機率進行降序排列並累加，然後選擇機率和首次超過 \( p=92% \) 的單詞集作為取樣池，定義為 \( V_{\text{top-p}} \)。在 \( t=1 \) 時 \( V_{\text{top-p}} \) 有 9 個詞，而在 \( t=2 \) 時它只需要選擇前 3 個詞就超過了 92%。其實很簡單吧！可以看出，在單詞比較不可預測時，它保留了更多的候選詞，如 \( P(w | \text{“The”}) \)，而當單詞似乎更容易預測時，只保留了幾個候選詞，如 \( P(w | \text{“The”}, \text{“car”}) \)。

好的，是時候看看它在 transformers 裡怎麼用了！我們可以透過設定 0 < top_p < 1 來啟用 Top-p 取樣:

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 *'-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog. He will never be the same. I watch him play.


Guys, my dog needs a name. Especially if he is found with wings.


What was that? I had a lot o

太好了，這看起來跟人類寫的差不多了，雖然還不算完全是。

雖然從理論上講， Top-p 似乎比 Top-K 更優雅，但這兩種方法在實踐中都很有效。 Top-p 也可以與 Top-K 結合使用，這樣可以避免排名非常低的詞，同時允許進行一些動態選擇。

最後，如果想要獲得多個獨立取樣的輸出，我們可以再次設定引數 num_return_sequences > 1:

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 *'-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog. It's so good to have the chance to walk with a dog. But I have this problem with the dog and how he's always looking at us and always trying to make me see that I can do something
1: I enjoy walking with my cute dog, she loves taking trips to different places on the planet, even in the desert! The world isn't big enough for us to travel by the bus with our beloved pup, but that's where I find my love
2: I enjoy walking with my cute dog and playing with our kids," said David J. Smith, director of the Humane Society of the US.

"So as a result, I've got more work in my time," he said.

很酷，現在你擁有了所有可以在 transformers 裡用模型來幫你寫故事的工具了！

總結

在開放域語言生成場景中，作為最新的解碼方法， top-p 和 top-K 取樣於傳統的貪心和波束搜尋相比，似乎能產生更流暢的文字。但，最近有更多的證據表明貪心和波束搜尋的明顯缺陷 - 主要是生成重複的單詞序列 - 是由模型 (特別是模型的訓練方式) 引起的，而不是解碼方法，參見 Welleck 等人 (2019) 的論文。此外，如 Welleck 等人 (2020) 的論文所述，看起來 top-K 和 top-p 取樣也會產生重複的單詞序列。

在 Welleck 等人 (2019) 的論文中，作者表明，根據人類評估，在調整訓練目標後，波束搜尋相比 Top-p 取樣能產生更流暢的文字。

開放域語言生成是一個快速發展的研究領域，而且通常情況下這裡沒有放之四海而皆準的方法，因此必須瞭解哪種方法最適合自己的特定場景。

好的方面是，你可以在 transfomers 中嘗試所有不同的解碼方法 ?。

以上是對如何在 transformers 中使用不同的解碼方法以及開放域語言生成的最新趨勢的簡要介紹。

非常歡迎大家在 Github 程式碼庫上提供反饋和問題。

如果想要體驗下用模型生成故事的樂趣，可以訪問我們的 web 應用 Writing with Transformers。

感謝為本文做出貢獻的所有人: Alexander Rush、Julien Chaumand、Thomas Wolf、Victor Sanh、Sam Shleifer、Clément Delangue、Yacine Jernite、Oliver Åstrand 和 John de Wasseige。

附錄

generate 方法還有幾個正文未提及的引數，這裡我們簡要解釋一下它們！

min_length 用於強制模型在達到 min_length 之前不生成 EOS。這在摘要場景中使用得比較多，但如果使用者想要更長的文字輸出，也會很有用。
repetition_penalty 可用於對生成重複的單詞這一行為進行懲罰。它首先由 Keskar 等人 (2019) 引入，在 Welleck 等人 (2019) 的工作中，它是訓練目標的一部分。它可以非常有效地防止重複，但似乎對模型和使用者場景非常敏感，其中一個例子見 Github 上的討論。
attention_mask 可用於遮蔽填充符。
pad_token_id、bos_token_id、eos_token_id: 如果模型預設沒有這些 token，使用者可以手動選擇其他 token id 來表示它們。

更多資訊，請查閱 generate 函式手冊。

英文原文: https://hf.co/blog/how-to-generate
原文作者: Patrick von Platen
譯者: Matrix Yao (姚偉峰)，英特爾深度學習工程師，工作方向為 transformer-family 模型在各模態資料上的應用及大規模模型的訓練推理。
審校/排版: zhongdongy (阿東)

如何生成文字: 透過 Transformers 用不同的解碼方法生成文字

簡介

貪心搜尋

波束搜尋

取樣

Top-K 取樣

Top-p (核) 取樣

總結

附錄

相關文章