在 Transformers 中使用對比搜尋生成可媲美人類水平的文字 ?

HuggingFace發表於2023-05-16

1. 引言

自然語言生成 (即文字生成) 是自然語言處理 (NLP) 的核心任務之一。本文將介紹神經網路文字生成領域當前最先進的解碼方法 對比搜尋 (Contrastive Search)。提出該方法的論文 “A Contrastive Framework for Neural Text Generation” 最初發表於 NeurIPS 2022 (論文官方實現)。此後, “Contrastive Search Is What You Need For Neural Text Generation” 的作者又進一步證明了對比搜尋可以用 現有的 語言模型在 16 種語言上生成可媲美人類水平的文字 (論文官方實現)。

[備註] 對於不熟悉文字生成的使用者,請參閱 此博文 瞭解更多詳情。

2. Hugging Face ? 對比搜尋演示

目前,? transformers 的 PyTorch 和 TensorFlow 後端均支援對比搜尋。你可以在 該 Colab notebook 中根據不同的後端選擇相應的部分來探索該方法,文章頂部也有該 notebook 連結。我們還構建了這個不錯的 演示應用,用它可以直觀地比較對比搜尋與其他流行的解碼方法 (例如波束搜尋、top-k 取樣 [3] 以及核取樣 [4])。

3. 環境安裝

在進行後續實驗前,我們要先安裝最新的 transformers 庫,如下:

pip install torch
pip install "transformers==4.24.0"

4. 現有解碼方法存在的問題

解碼方法可以分為兩類: (i) 確定性方法,(ii) 隨機方法。下面我們分別對兩者進行討論!

4.1. 確定性方法

確定性方法,如貪心搜尋和波束搜尋,透過在語言模型輸出的所有候選補全詞中選擇機率最高的詞來生成最終文字。然而,正如之前研究 [3][4] 指出的,確定性方法通常會導致 _模型退化_,即生成的文字不自然且包含不必要的重複。

下面,我們看一個用 GPT-2 模型和貪心搜尋生成文字的例子。

from transformers import AutoTokenizer, GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained('gpt2-large')
input_ids = tokenizer('DeepMind Company is', return_tensors='pt').input_ids
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

output = model.generate(input_ids, max_length=128)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

模型輸出:

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leading AI research company, with a focus on deep learning and deep learning-based systems.

The company's research is focused on the development of deep learning-based systems that can learn from large amounts of data, and that can be used to solve real-world problems.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies
----------------------------------------------------------------------------------------------------

[備註] 我們可以看到,貪心搜尋生成的結果中有明顯的重複。

4.2. 隨機方法

為了解決確定性方法帶來的問題,隨機方法透過在解碼過程中引入隨機性來生成文字。常用的兩種隨機方法是 (i) top-k 取樣 [3] 和 (ii) 核取樣 (也稱為 top-p 取樣) [4]

下面,我們給出用 GPT-2 模型和核取樣 (p=0.95) 生成文字的示例。

import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained('gpt2-large')
input_ids = tokenizer('DeepMind Company is', return_tensors='pt').input_ids
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

torch.manual_seed(0.)
output = model.generate(input_ids, do_sample=True, max_length=128, top_p=0.95, top_k=0)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

模型輸出:

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leading provider of AI-based research, development, and delivery of AI solutions for security, infrastructure, machine learning, communications, and so on."

'AI is not journalism'

Worse still was the message its researchers hoped would reach the world's media — that it was not really research, but rather a get-rich-quick scheme to profit from living forces' ignorance.

"The thing is, we know that people don't consciously assess the value of the others'
information. They understand they will get the same on their own."

One example? Given the details of today
----------------------------------------------------------------------------------------------------

[備註] 雖然核取樣可以生成沒有重複的文字,但生成文字的語義一致性並不是很好。例如,生成的短語 ‘AI is not journalism’ 與給定的上文即 ‘DeepMind Company’ 不一致。

我們注意到,這種語義不一致的問題可以透過降低溫度 (temperature) 來部分解決。然而,降低溫度會使核取樣更接近貪心搜尋,這其實就變成了貪心搜尋和核取樣之間的權衡。一般來講,要找到一個既能避免貪心搜尋又能避免核取樣陷阱的快捷且與模型無關的溫度相當有挑戰。


5. 對比搜尋

本節我們來詳細介紹一種新的解碼方法, _ 對比搜尋_。

5.1. 解碼目標

給定字首文字 \( x_{< t} \),我們按如下公式選擇輸出詞元 \( x_{t} \):

上式中, \( V^{(k)} \)是語言模型輸出機率分佈 \( p_{\theta}(v|x_{< t}) \) 中 k 個機率最大的候選詞元的集合。第一項,即 模型置信度 (model confidence)_,是語言模型預測的每個候選詞元 \( v \) 的機率。第二項, _退化懲罰 (degeneration penalty)_,用於度量 \( v \) 與上文 \( x_{< t} \) 中每個詞元的相異度,其中函式 \( s(\cdot, \cdot)$ 用於計算每兩個詞元間的餘弦相似度。更具體地說,退化懲罰被定義為 $v$ 的向量表徵 $h_{v}$ 與其上文 $x {< t}$ 中每個詞元的向量表徵間餘弦相似度的最大值。這裡,候選詞元的向量表徵 $h_{v}$ 是在給定 $x_{< t}$ 和 $v$ 的條件下將二者連線起來輸入給語言模型,然後由語言模型計算出來的。直觀上,如果 $v$ 的退化懲罰較大意味著它與上文更相似 (在表示空間中),因此更有可能導致模型退化問題。超引數 $\alpha$ 用於在這兩項中折衷。當 $\alpha=0$ 時,對比搜尋退化為純貪心搜尋。

[備註] 在生成輸出時,對比搜尋同時考慮 (i) 語言模型預測的機率,以保持生成文字和字首文字之間的語義連貫性; (ii) 與上文的相似性以避免模型退化。

5.2. 使用對比搜尋生成文字

下面,我們使用與 第 4.1 節 第 4.2 節 中相同的字首文字 (即 “DeepMind Company is” ),並使用對比搜尋生成文字 (取 k=4、\( \alpha=0.6 \))。為了充分展示對比搜尋的卓越能力,我們讓語言模型生成一個 512 詞元的 文件,如下:

from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_name = 'gpt2-large'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
model.eval()

# prepare the prefix
prefix_text = r'DeepMind Company is'
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids

# generate the result with contrastive search
output = model.generate(input_ids, penalty_alpha=0.6, top_k=4, max_length=512)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

引數設定如下:

  • --top_k: 對比搜尋中的超參 $k$。
  • --penalty_alpha: 對比搜尋中的超參 $\alpha$。
    模型輸出:
Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leader in artificial intelligence (AI). We have a long history of working with companies such as Google, Facebook, Amazon, and Microsoft to build products that improve people's lives, and today we are excited to announce that DeepMind's AlphaGo program has won the game of Go, becoming the first program to defeat a professional Go player.

The victory is a testament to the power of deep learning, and to the incredible work of our research team, which has been at the forefront of AI research for the past five years. AlphaGo is one of the most advanced Go programs ever created, and its performance is an important step towards the goal of human-level AI.

"This is the culmination of a decade of hard work," said Andy Ng, co-founder and CTO of DeepMind. "We are thrilled to have achieved this milestone and look forward to continuing to develop AI that can be used in a wide range of applications and to help people live better lives."

DeepMind's work on Go began in 2010, when it began to train a neural network to play Go using millions of games played by top Go players around the world. Since then, the team has refined the algorithm, adding more and more layers of reinforcement learning to make it better at recognizing patterns and making decisions based on those patterns. In the past year and a half, the team has made significant progress in the game, winning a record-tying 13 games in a row to move into the top four of the world rankings.

"The game of Go is a complex game in which players have to be very careful not to overextend their territory, and this is something that we have been able to improve over and over again," said Dr. Demis Hassabis, co-founder and Chief Scientific Officer of DeepMind. "We are very proud of our team's work, and we hope that it will inspire others to take the next step in their research and apply the same techniques to other problems."

In addition to the win in Go, DeepMind has also developed an AI system that can learn to play a number of different games, including poker, Go, and chess. This AI system, called Tarsier, was developed in partnership with Carnegie Mellon University and the University of California, Berkeley, and is being used to teach computer vision and machine learning to identify objects in images and recognize speech in natural language. Tarsier has been trained to play the game of Go and other games on a
----------------------------------------------------------------------------------------------------

[備註] 我們看到生成的文字質量非常高。整個文件語法流暢,語義連貫。同時,生成的文字也很好地保持了事實的正確性。例如,在第一段中,它正確闡述了 “AlphaGo” 作為 “第一個擊敗職業圍棋選手的程式” 這一事實。

5.3. 對比搜尋的結果視覺化

為了更好地理解對比搜尋的工作原理,我們對貪心搜尋 ( 第 4.1 節 ) 和對比搜尋進行了直觀比較。具體來說,我們分別將貪心搜尋和對比搜尋生成的詞元相似度矩陣視覺化。兩個詞元之間的相似性被定義為它們的向量表徵 (即最後一個轉換器層的隱藏狀態) 之間的餘弦相似性。貪心搜尋 (上) 和對比搜尋 (下) 的結果如下圖所示。

image.png

[備註] 從貪心搜尋的結果中,我們看到非對角線的相似度很高,這清楚地表明貪心搜尋產生了重複。相反,在對比搜尋的結果中,高相似度分數主要出現在對角線上,這證明我們成功解決了退化問題。對比搜尋的這一優良特性是透過在解碼過程中引入退化懲罰 (參見 第 5.1 節 ) 來實現的。

6. 更多的生成示例

在本節中,我們提供了更多的生成示例來比較不同的解碼方法。

6.1. 示例一: GPT-2

在這部分中,我們使用 GPT-2 生成文字,其字首文字來自 OpenAI 釋出 GPT-2 的部落格文章

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
載入語言模型並準備字首文字:
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')

prefix_text = r"In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids
6.1.1. 使用貪心搜尋生成文字

程式碼:

output = model.generate(input_ids, max_length=512)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

模型輸出:

Output:
----------------------------------------------------------------------------------------------------
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The researchers, led by Dr. David R. Williams of the University of California, Santa Cruz, discovered the unicorns in the Andes Mountains of Peru. The area is known for its unique geology and is home to a number of rare species of animals.

The researchers found the unicorns in the Andes Mountains of Peru.

"We were surprised to find that the unicorns were able to communicate with each other," Williams said. "We were also surprised to find that they were able to communicate in English."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the area around 2,000 years ago.

"The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of the Andes," Williams said. "They were also the first people to use the Andes Mountains as a place to hunt and gather food."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the area around 2,000 years ago.

"The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of the Andes," Williams said. "They were also the first people to use the Andes Mountains as a place to hunt and gather food."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the area around 2,000 years ago.

"The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of the Andes," Williams said. "They were also the first people to use the Andes Mountains as a place to hunt and gather food."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the area around 2,000 years ago.

"The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of the Andes," Williams said. "They were also the first people to use the Andes Mountains as a place to hunt and gather food."

The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the area around 2,000 years ago
----------------------------------------------------------------------------------------------------
6.1.2. 使用核取樣生成文字

程式碼:

torch.manual_seed(0.)
output = model.generate(input_ids, do_sample=True, max_length=512, top_p=0.95, top_k=0)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

模型輸出:

Output:
----------------------------------------------------------------------------------------------------
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. The study was published in the Journal of Zoology in March 2016.

Polygynous mammals such as unicorns have remained largely unknown to science. Professor Gustavo Giacota, from the University of Oxford who led the study, said that they had been documented as far as Eastern Siberia in Russia, but had only been seen a handful of times in the Gobi Desert.

Tiny animals with pale and shiny coats live in the presence of human beings and are hardly likely to be victims of any cruelty. However, there is some evidence of the condition occurring in both humans and animals in remote regions, which might have similarities to "black moles" that coexist on the skin.

It is thought that Unicorns could be inside themselves, that they have different scents depending on their current environment, or just fall out and there are plenty of legends of how they have survived. Experts speculate that the moths and other animals could be remnants of the Yezidi Isis and Charon, which literally is both the word which means great bird, and the Greek word for sound. It is said that the Isis and Charon taught their young the use of voice in the form of calling out to others.

The scientists think that it could be ancient folklore that has survived and is no longer attributed to a real entity
----------------------------------------------------------------------------------------------------
6.1.3. 使用對比搜尋生成文字

程式碼:

output = model.generate(input_ids, max_length=512, penalty_alpha=0.6, top_k=4)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

模型輸出:

Output:
----------------------------------------------------------------------------------------------------
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

According to the BBC, a team of scientists led by Dr David MacKay, from the University of Bristol, spent two years searching for the unicorn herd, which they discovered during a survey of the area.

"It's a very rare find," MacKay told the BBC. "There are a few in the Himalayas, but this is the first time we've been able to find one in such a remote area."

The team was surprised to find a herd of unicorns living in a region that has been known to be a hotbed of poaching, with many of the animals poached for their horns, which are used in traditional Chinese medicine to treat everything from rheumatism to cancer.

"We knew that the area was rich in rhino horn, but we had no idea how many there were, or what they were doing there," MacKay said. "This is an area of high poaching pressure, and we wanted to find out what was going on."

In order to do so, the team used GPS collars to track the animals as they moved around the mountain and the surrounding area. The GPS data was then compared with information gathered from local villagers, who had a wealth of information about the animals' movements, including where they were eating, what they were doing at night, and how much time they spent in the mountains each day.

After analyzing the data, the team determined that the herd consisted of at least three species of unicorns, including a male and two females. One of the females was the mother of the male, and the other two were her daughters. All three had the same horn color, which is believed to be a sign of purity in the animal kingdom.

While the discovery is exciting, it's not the first time scientists have discovered an animal that speaks English. Last year, scientists discovered a species of porcupine that can be heard by humans, and has been dubbed "Porcupine Man" for his ability to converse with the human race.
----------------------------------------------------------------------------------------------------

6.2. 示例二: OPT

本節中,我們使用 Meta 最近釋出的 OPT 模型 [5],並從著名的 ResNet 論文 [6] 摘要中選取前兩個句子來作為字首文字。

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.

載入語言模型並準備字首文字:

import torch
from transformers import AutoTokenizer, OPTForCausalLM
model_name = r'facebook/opt-1.3b'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = OPTForCausalLM.from_pretrained(model_name)

prefix_text = r"Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously."
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids
6.2.1. 使用貪心搜尋生成文字

程式碼:

output = model.generate(input_ids, max_length=256)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

模型輸出:

Output:
----------------------------------------------------------------------------------------------------
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We show that the residual learning framework can be used to train deep neural networks that are significantly more difficult to train than those used previously. We also show that the residual learning framework can be used to train deep neural networks that are significantly more difficult to train than those used previously.

The paper presents a new residual learning framework for deep neural networks that is based on the concept of residuals. The residuals are the residuals of the network that are not used in the training process. The residuals are computed by taking the residuals of the network that are used in the training process and subtracting the residuals of the network that are not used in the training process. The residuals are then used to train the network. The residuals are computed by taking the residuals of the network that are used in the training process and subtracting the residuals of the network that are not used in the training process. The residuals are then used to train the network. The residuals are computed by taking the residuals of the network that are used in the training process and subtracting the residuals of the
----------------------------------------------------------------------------------------------------
6.2.2. 使用核取樣生成文字

程式碼:

torch.manual_seed(0.)
output = model.generate(input_ids, do_sample=True, max_length=256, top_p=0.95, top_k=0)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

模型輸出:

Output:
----------------------------------------------------------------------------------------------------
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. The theory focuses on several aspects of learning, including the dynamics of replicative and non-replicative aspects of learning. This framework emphasizes learning by entropy. New randomized algorithms enable training networks with residual learning, so that deep networks can be deployed as reliably and as efficiently as their more conventional counterparts.
----------------------------------------------------------------------------------------------------
6.2.3. 使用對比搜尋生成文字

程式碼:

output = model.generate(input_ids, max_length=256, penalty_alpha=0.6, top_k=6)
print("Output:\n" + 100 *'-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 *'-')

模型輸出:

Output:
----------------------------------------------------------------------------------------------------
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.

In this paper, we propose a model-based residual learning (MBRL) framework that is based on neural networks trained on data that is sparse in terms of dimensionality (e.g., 1, 2, 3, etc.). The network parameters are chosen such that there is a high probability of convergence, i.e., the number of iterations is large enough to minimize the variance of the residuals. This is achieved by training the network on a set of training data, in which the data is sparse in terms of dimensionality, and then discarding the nonparametric part of the data after training is complete.

We show that MBRL outperforms other methods for deep reinforcement learning (RL) and deep convolutional neural networks (CNNs) by a factor of at least 2. In addition, we show that, compared to CNNs, MBRL performs better in two-dimensional (2D) and three-dimensional (3D) cases.
----------------------------------------------------------------------------------------------------

7. 更多資源

有關對比搜尋的更多詳細資訊,請檢視我們的論文和程式碼,如下:


8. 引用

@inproceedings{su2022a,
   title={A Contrastive Framework for Neural Text Generation},
   author={Yixuan Su and Tian Lan and Yan Wang and Dani Yogatama and Lingpeng Kong and Nigel Collier},
   booktitle={Advances in Neural Information Processing Systems},
   editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
   year={2022},
   url={https://openreview.net/forum?id=V88BafmH9Pj}
}

@article{su2022contrastiveiswhatyouneed,
  title={Contrastive Search Is What You Need For Neural Text Generation},
  author={Su, Yixuan and Collier, Nigel},
  journal={arXiv preprint arXiv:2210.14140},
  year={2022}
}

參考文獻

[1] Su et al., 2022 “A Contrastive Framework for Neural Text Generation”, NeurIPS 2022

[2] Su and Collier, 2022 “Contrastive Search Is What You Need For Neural Text Generation”, Arxiv 2022

[3] Fan et al., 2018 “Hierarchical Neural Story Generation”, ACL 2018

[4] Holtzman et al., 2020 “The Curious Case of Neural Text Degeneration”, ICLR 2020

[5] Zhang et al., 2022 “OPT: Open Pre-trained Transformer Language Models”, Arxiv 2022

[6] He et al., 2016 “Deep Residual Learning for Image Recognition”, CVPR 2016


- 本文由 Yixuan Su 和 Tian Lan 撰寫


致謝

我們要感謝 Joao Gante (@joaogante)、Patrick von Platen (@patrickvonplaten) 和 Sylvain Gugger (@sgugger),感謝他們在我們將本文中的對比搜尋整合進 transformers 庫的過程中給予的幫助和指導。


英文原文: https://hf.co/blog/introducing-csearch

原文作者: Tian Lan

譯者: Matrix Yao (姚偉峰),英特爾深度學習工程師,工作方向為 transformer-family 模型在各模態資料上的應用及大規模模型的訓練推理。

審校/排版: zhongdongy (阿東)

相關文章