句子嵌入: 交叉編碼和重排序

這個系列目的是揭開嵌入的神秘面紗，並展示如何在你的專案中使用它們。第一篇部落格介紹瞭如何使用和擴充套件開源嵌入模型，選擇現有的模型，當前的評價方法，以及生態系統的發展狀態。第二篇部落格將會更一步深入嵌入並解釋雙向編碼和交叉編碼的區別。進一步我們將瞭解 檢索和重排序 的理論。我們會構建一個工具，它可以來回答大約 400 篇 AI 的論文的問題。我們會在末尾大致討論一下兩個不同的論文。

你可以在這裡閱讀，或者透過點選左上角的圖示在 Google Colab 中執行。現在我們正式開始學習！

簡短概述

Sentence Transformers 支援兩種型別的模型: Bi-encoders 和 Cross-encoders。Bi-encoders 更快更可擴充套件，但 Cross-encoders 更準確。雖然兩者都處理類似的高水平任務，但何時使用一個而不是另一個是相當不同的。Bi-encoders 更適合搜尋，而 Cross-encoders 更適合分類和高精度排序。下面講下細節

介紹

我們之前見過的模型都是雙向編碼器。雙向編碼器將輸入文字編碼成固定長度的向量。當我們計算兩個句子的相似性時，我們通常將兩個句子編碼成兩個向量，然後計算它們之間的相似性 (比如，使用餘弦相似度)。我們訓練雙向編碼器去最佳化，使得在問題和相關句子之間的相似度增加，而在其他句子之間的相似度減少。這也解釋了為啥雙向編碼器更適合搜尋。正如之前部落格所說，雙向編碼器非常快，並且易於擴充套件。如果提供多個句子，雙向編碼器會獨立編碼每個句子。這意味著每個句子嵌入是互相獨立的。這對於搜尋來書是好的，因為我們可以並行編碼數百萬的句子。然而，這同樣意味著雙向編碼器不知道任何關於句子之間的關係的知識。

當我們使用交叉編碼，就會有所不同。交叉編碼同時編碼兩個句子，並且輸出一個分類分數。圖示展示了它們之間的區別。

為啥使用這個而不用其他的？交叉編碼更慢並且需要更多的記憶體，但同樣更精確。一個交叉編碼器對於比較幾十個句子是一個很好的選擇。如果我們要比較成千上萬個句子，一個雙向編碼器是更好的選擇，因為否則的話，它將會非常慢。如果你更在乎精度並且想高效的比較成千上萬個句子呢？這有個當你想要檢索資訊的經典示例，在那個示例中，一個選擇是首先使用雙向編碼器去減少候選數量 (比如，獲取最相關的 20 個例子)，然後使用交叉編碼器獲得最終結果。這個也叫做重排序，在資訊檢索中是一個常用的技術。我們將在後面學習更多關於這個的內容。

由於交叉編碼更精確，它同樣適用於一些微小差異很重要的任務，比如醫療或者法律文件，其中一點微小的差異可以改變句子的整個意思。

交叉編碼器

正如之前所說，交叉編碼器同時編碼兩個句子，並輸出一個分類標籤。交叉編碼器第一次生成一個單獨的嵌入，它捕獲了句子的表徵和相關關係。與雙向編碼器生成的嵌入 (它們是獨立的) 不同，交叉編碼器是互相依賴的。這也是為什麼交叉編碼器更適合分類，並且其質量更高，他們可以捕獲兩個句子之間的關係！反過來說，如果你需要比較上千個句子的話，交叉編碼器會很慢，因為他們要編碼所有的句子對。

假如你有四個句子，並且你需要比較所有的可能對:

一個雙向編碼器需要獨立編碼每個句子，所以它需要編碼四個句子。
一個交叉編碼器需要同時編碼所的句子對，所以它需要編碼六個句子對 (AB, AC, AD, BC, BD, CD)。

讓我們再擴充套件一下，如果你要 100,000 個句子，並且你需要比較所有的可能對:

一個雙向編碼器需要編碼 100,000 個句子。
一個交叉編碼器需要編碼 4,999,950,000 個句子對。(用組合數公式: n! / (r!(n-r)!) , 這裡面 n = 100,000, r = 2) 所以擴充套件並不好

也難怪他們會更慢！

儘管交叉編碼器再分類層前有一個適度的嵌入，但這並沒有用於相似性搜尋。這是因為交叉編碼器被訓練來最佳化分類損失，而不是相似性損失。因此，嵌入是針對分類任務而設計的，並且不用於相似性任務。

他們可以被用於不同的任務。例如，對於段落檢索 (給定一個問題和一個段落，段落是否與問題相關)。讓我們看一個快速程式碼片段，使用一個小的交叉編碼模型訓練這個:

!pip install sentence_transformers datasets

from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2', max_length=512)
scores = model.predict([('How many people live in Berlin?', 'Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'), 
                        ('How many people live in Berlin?', 'Berlin is well known for its museums.')])
scores

array([ 7.152365 , -6.2870445], dtype=float32)

另一個用例，用交叉編碼做語義相似度，跟我們用雙向編碼器的結果很相似。比如，給定兩個句子，他們語義上相似嗎？儘管這個任務跟我們用雙向編碼器解決的任務是一樣的，但是交叉編碼器更準確，更慢。

model = CrossEncoder('cross-encoder/stsb-TinyBERT-L-4')
scores = model.predict([("The weather today is beautiful", "It's raining!"), 
                        ("The weather today is beautiful", "Today is a sunny day")])
scores

array([0.46552283, 0.6350213 ], dtype=float32)

檢索和重排序

現在我們已經瞭解了交叉編碼器和雙向編碼器的不同，讓我們看看如何使在實踐中用它們來構建一個檢索和重排序系統。這是一個常見的資訊檢索技巧，首先檢索最相關的文件然後用一個更精確的模型進行重排序。這對於高效比較成千個句子的查詢是個不錯的選擇並且更加註重精度。

假設你有一個有 100，000 個句子的語料庫並且想要對給定查詢找到最相關的句子。第一步就是使用雙向編碼器去檢索很多候選 (為了確保召回)。然後，使用交叉編碼器去重新排序候選並且得到最終的帶有高精度的結果。這是高層次上看這個系統的樣子

讓我們試一試執行一個論文搜尋系統！我們將使用一個 AI Arxiv 資料集，這個是在 Pinecone 上關於重排序極好的教程。其目的是問 AI 一個問題，我們獲得最相關的論文部分並且回答問題。

from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv-chunked")
dataset["train"]

Found cached dataset json (/home/osanseviero/.cache/huggingface/datasets/jamescalam___json/jamescalam--ai-arxiv-chunked-0d76bdc6812ffd50/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
 0%|          | 0/1 [00:00<?, ?it/s]
Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

如果你檢查了資料集，它是一個劃分切塊好的 400 篇 Arxiv 論文，切塊意味著每個部分被分成更小的部分，以使模型更容易處理。這裡是一個樣本:

dataset["train"][0]

{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof its language understanding capabilities and being 60% faster. To leverage the\ninductive biases learned by larger models during pre-training, we introduce a triple\nloss combining language modeling, distillation and cosine-distance losses. Our\nsmaller, faster and lighter model is cheaper to pre-train and we demonstrate its',
 'id': '1910.01108',
 'title': 'DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter',
 'summary': 'As Transfer Learning from large-scale pre-trained models becomes more\nprevalent in Natural Language Processing (NLP), operating these large models in\non-the-edge and/or under constrained computational training or inference\nbudgets remains challenging. In this work, we propose a method to pre-train a\nsmaller general-purpose language representation model, called DistilBERT, which\ncan then be fine-tuned with good performances on a wide range of tasks like its\nlarger counterparts. While most prior work investigated the use of distillation\nfor building task-specific models, we leverage knowledge distillation during\nthe pre-training phase and show that it is possible to reduce the size of a\nBERT model by 40%, while retaining 97% of its language understanding\ncapabilities and being 60% faster. To leverage the inductive biases learned by\nlarger models during pre-training, we introduce a triple loss combining\nlanguage modeling, distillation and cosine-distance losses. Our smaller, faster\nand lighter model is cheaper to pre-train and we demonstrate its capabilities\nfor on-device computations in a proof-of-concept experiment and a comparative\non-device study.',
 'source': 'http://arxiv.org/pdf/1910.01108',
 'authors': ['Victor Sanh',
  'Lysandre Debut',
  'Julien Chaumond',
  'Thomas Wolf'],
 'categories': ['cs.CL'],
 'comment': 'February 2020 - Revision: fix bug in evaluation metrics, updated\n  metrics, argumentation unchanged. 5 pages, 1 figure, 4 tables. Accepted at\n  the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing\n  - NeurIPS 2019',
 'journal_ref': None,
 'primary_category': 'cs.CL',
 'published': '20191002',
 'updated': '20200301',
 'references': [{'id': '1910.01108'}]}

讓我們獲得所有的切塊，然後編碼:

chunks = dataset["train"]["chunk"] 
len(chunks)

現在，我們將使用一個雙向編碼器來編碼所有的切塊到嵌入。我們將會截斷過長的段落，最大不超過 512 token。注意，短文字是許多嵌入模型的缺點之一！我們將會特別使用 multi-qa-MiniLM-L6-cos-v1 模型。這是一個小模型，用來訓練把問題和段落編碼成小的嵌入空間。因為這是一個雙向編碼器模型，所以很快和易擴充套件。

在我這個普通的電腦上嵌入所有的 40,000+ 文章大概需要 30 秒。注意，我們只要嵌入一次，然後可以儲存到磁碟並且之後載入。在生產環境中，你可以把嵌入儲存到資料庫中並從中載入嵌入。

from sentence_transformers import SentenceTransformer

bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256

corpus_embeddings = bi_encoder.encode(chunks, convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/1300 [00:00<?, ?it/s]

真棒！現在，讓我們試一個問題，搜尋其相關文章。為了做到這一點，我們需要編碼問題，然後計算問題和所有段落之間的相似度。開幹，看看前幾個結果！

from sentence_transformers import util

query = "what is rlhf?"
top_k = 25 # how many chunks to retrieve
query_embedding = bi_encoder.encode(query, convert_to_tensor=True).cuda()

hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]
hits

[{'corpus_id': 14679, 'score': 0.6097552180290222},
 {'corpus_id': 17387, 'score': 0.5659530162811279},
 {'corpus_id': 39564, 'score': 0.5590510368347168},
 {'corpus_id': 14725, 'score': 0.5585878491401672},
 {'corpus_id': 5628, 'score': 0.5296251773834229},
 {'corpus_id': 14802, 'score': 0.5075011253356934},
 {'corpus_id': 9761, 'score': 0.49943411350250244},
 {'corpus_id': 14716, 'score': 0.4931946098804474},
 {'corpus_id': 9763, 'score': 0.49280521273612976},
 {'corpus_id': 20638, 'score': 0.4884325861930847},
 {'corpus_id': 20653, 'score': 0.4873950183391571},
 {'corpus_id': 9755, 'score': 0.48562008142471313},
 {'corpus_id': 14806, 'score': 0.4792214035987854},
 {'corpus_id': 14805, 'score': 0.475425660610199},
 {'corpus_id': 20652, 'score': 0.4740477204322815},
 {'corpus_id': 20711, 'score': 0.4703512489795685},
 {'corpus_id': 20632, 'score': 0.4695567488670349},
 {'corpus_id': 14750, 'score': 0.46810320019721985},
 {'corpus_id': 14749, 'score': 0.46809980273246765},
 {'corpus_id': 35209, 'score': 0.46695172786712646},
 {'corpus_id': 14671, 'score': 0.46657535433769226},
 {'corpus_id': 14821, 'score': 0.4637290835380554},
 {'corpus_id': 14751, 'score': 0.4585301876068115},
 {'corpus_id': 14815, 'score': 0.45775431394577026},
 {'corpus_id': 35250, 'score': 0.4569615125656128}]

#Let's store the IDs for later
retrieval_corpus_ids = [hit['corpus_id'] for hit in hits]

# Now let's print the top 3 results
for i, hit in enumerate(hits[:3]):
    sample = dataset["train"][hit["corpus_id"]]
    print(f"Top {i+1} passage with score {hit['score']} from {sample['source']}:")
    print(sample["chunk"])
    print("\n")

Top 1 passage with score 0.6097552180290222 from http://arxiv.org/pdf/2204.05862:
learning from human feedback, which we improve on a roughly weekly cadence. See Section 2.3.
4This means that our helpfulness dataset goes ‘up’ in desirability during the conversation, while our harmlessness
dataset goes ‘down’ in desirability. We chose the latter to thoroughly explore bad behavior, but it is likely not ideal
for teaching good behavior. We believe this difference in our data distributions creates subtle problems for RLHF, and
suggest that others who want to use RLHF to train safer models consider the analysis in Section 4.4.
5
1071081091010
Number of Parameters0.20.30.40.50.6Mean Eval Acc
Mean Zero-Shot Accuracy
Plain Language Model
RLHF
1071081091010
Number of Parameters0.20.30.40.50.60.7Mean Eval Acc
Mean Few-Shot Accuracy
Plain Language Model
RLHFFigure 3 RLHF model performance on zero-shot and few-shot NLP tasks. For each model size, we plot
the mean accuracy on MMMLU, Lambada, HellaSwag, OpenBookQA, ARC-Easy, ARC-Challenge, and
TriviaQA. On zero-shot tasks, RLHF training for helpfulness and harmlessness hurts performance for small


Top 2 passage with score 0.5659530162811279 from http://arxiv.org/pdf/2302.07842:
preferences and values which are diﬃcult to capture by hard- coded reward functions.
RLHF works by using a pre-trained LM to generate text, which i s then evaluated by humans by, for example,
ranking two model generations for the same prompt. This data is then collected to learn a reward model
that predicts a scalar reward given any generated text. The r eward captures human preferences when
judging model output. Finally, the LM is optimized against s uch reward model using RL policy gradient
algorithms like PPO ( Schulman et al. ,2017). RLHF can be applied directly on top of a general-purpose LM
pre-trained via self-supervised learning. However, for mo re complex tasks, the model’s generations may not
be good enough. In such cases, RLHF is typically applied afte r an initial supervised ﬁne-tuning phase using
a small number of expert demonstrations for the correspondi ng downstream task ( Ramamurthy et al. ,2022;
Ouyang et al. ,2022;Stiennon et al. ,2020).
A successful example of RLHF used to teach a LM to use an extern al tool stems from WebGPT Nakano et al.
(2021) (discussed in 3.2.3), a model capable of answering questions using a search engine and providing


Top 3 passage with score 0.5590510368347168 from http://arxiv.org/pdf/2307.09288:
31
5 Discussion
Here, we discuss the interesting properties we have observed with RLHF (Section 5.1). We then discuss the
limitations of L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc (Section 5.2). Lastly, we present our strategy for responsibly releasing these
models (Section 5.3).
5.1 Learnings and Observations
Our tuning process revealed several interesting results, such as L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc ’s abilities to temporally
organize its knowledge, or to call APIs for external tools.
SFT (Mix)
SFT (Annotation)
RLHF (V1)
0.0 0.2 0.4 0.6 0.8 1.0
Reward Model ScoreRLHF (V2)
Figure 20: Distribution shift for progressive versions of L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc , from SFT models towards RLHF.
Beyond Human Supervision. At the outset of the project, many among us expressed a preference for

好極了！我們根據最高召回但低精度的雙向編碼器得到了最相似的切塊。

現在，讓我們透過高精度的交叉編碼器模型重排序。我們將使用 cross-encoder/ms-marco-MiniLM-L-6-v2 模型。這個模型是在 MS MARCO 資料集上微調的，它是一個大型真實問答資訊檢索資料集。這使得這個模型在進行問答時非常適合決策。

我們將使用同樣的問題和我們從雙向編碼器獲得的前 10 個塊。讓我們看看結果！回想一下，交叉編碼器需要成對的，所以我們將建立問題和每個塊的對。

from sentence_transformers import  CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

cross_inp = [[query, chunks[hit['corpus_id']]] for hit in hits]
cross_scores = cross_encoder.predict(cross_inp)
cross_scores

array([ 1.2227577 ,  5.048051  ,  1.2897239 ,  2.205767  ,  4.4136825 ,
        1.2272772 ,  2.5638275 ,  0.81847703,  2.35553   ,  5.590804  ,
        1.3877895 ,  2.9497519 ,  1.6762824 ,  0.7211323 ,  0.16303705,
        1.3640019 ,  2.3106787 ,  1.5849439 ,  2.9696884 , -1.1079378 ,
        0.7681126 ,  1.5945492 ,  2.2869687 ,  3.5448399 ,  2.056368  ],
      dtype=float32)

讓我們增加一個新的屬性 cross-score ，並將其排序!

for idx in range(len(cross_scores)):
    hits[idx]['cross-score'] = cross_scores[idx]
hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
msmarco_l6_corpus_ids = [hit['corpus_id'] for hit in hits] # save for later

hits

[{'corpus_id': 20638, 'score': 0.4884325861930847, 'cross-score': 5.590804},
 {'corpus_id': 17387, 'score': 0.5659530162811279, 'cross-score': 5.048051},
 {'corpus_id': 5628, 'score': 0.5296251773834229, 'cross-score': 4.4136825},
 {'corpus_id': 14815, 'score': 0.45775431394577026, 'cross-score': 3.5448399},
 {'corpus_id': 14749, 'score': 0.46809980273246765, 'cross-score': 2.9696884},
 {'corpus_id': 9755, 'score': 0.48562008142471313, 'cross-score': 2.9497519},
 {'corpus_id': 9761, 'score': 0.49943411350250244, 'cross-score': 2.5638275},
 {'corpus_id': 9763, 'score': 0.49280521273612976, 'cross-score': 2.35553},
 {'corpus_id': 20632, 'score': 0.4695567488670349, 'cross-score': 2.3106787},
 {'corpus_id': 14751, 'score': 0.4585301876068115, 'cross-score': 2.2869687},
 {'corpus_id': 14725, 'score': 0.5585878491401672, 'cross-score': 2.205767},
 {'corpus_id': 35250, 'score': 0.4569615125656128, 'cross-score': 2.056368},
 {'corpus_id': 14806, 'score': 0.4792214035987854, 'cross-score': 1.6762824},
 {'corpus_id': 14821, 'score': 0.4637290835380554, 'cross-score': 1.5945492},
 {'corpus_id': 14750, 'score': 0.46810320019721985, 'cross-score': 1.5849439},
 {'corpus_id': 20653, 'score': 0.4873950183391571, 'cross-score': 1.3877895},
 {'corpus_id': 20711, 'score': 0.4703512489795685, 'cross-score': 1.3640019},
 {'corpus_id': 39564, 'score': 0.5590510368347168, 'cross-score': 1.2897239},
 {'corpus_id': 14802, 'score': 0.5075011253356934, 'cross-score': 1.2272772},
 {'corpus_id': 14679, 'score': 0.6097552180290222, 'cross-score': 1.2227577},
 {'corpus_id': 14716, 'score': 0.4931946098804474, 'cross-score': 0.81847703},
 {'corpus_id': 14671, 'score': 0.46657535433769226, 'cross-score': 0.7681126},
 {'corpus_id': 14805, 'score': 0.475425660610199, 'cross-score': 0.7211323},
 {'corpus_id': 20652, 'score': 0.4740477204322815, 'cross-score': 0.16303705},
 {'corpus_id': 35209, 'score': 0.46695172786712646, 'cross-score': -1.1079378}]

你可以從上面的資料看到，交叉編碼器的結果和雙向編碼器的結果並不一致。神奇的是，一些最前面的交叉編碼器結果 (14815 和 14749) 的結果其有著最低的雙向編碼器分數。這很合理，因為雙向編碼器比較的是問題和文件在嵌入空間的相似性，但交叉編碼器考慮的是問題和文件在嵌入空間的相關性。

for i, hit in enumerate(hits[:3]):
    sample = dataset["train"][hit["corpus_id"]]
    print(f"Top {i+1} passage with score {hit['cross-score']} from {sample['source']}:")
    print(sample["chunk"])
    print("\n")

Top 1 passage with score 0.9668010473251343 from http://arxiv.org/pdf/2204.05862:
Stackoverflow Good Answer vs. Bad Answer Loss Difference
Python FT
Python FT + RLHF(b)Difference in mean log-prob between good and bad
answers to Stack Overﬂow questions.
Figure 37 Analysis of RLHF on language modeling for good and bad Stack Overﬂow answers, over many
model sizes, ranging from 13M to 52B parameters. Compared to the baseline model (a pre-trained LM
ﬁnetuned on Python code), the RLHF model is more capable of distinguishing quality (right) , but is worse
at language modeling (left) .
the RLHF models obtain worse loss. This is most likely due to optimizing a different objective rather than
pure language modeling.
B.8 Further Analysis of RLHF on Code-Model Snapshots
As discussed in Section 5.3, RLHF improves performance of base code models on code evals. In this appendix, we compare that with simply prompting the base code model with a sample of prompts designed to
elicit helpfulness, harmlessness, and honesty, which we refer to as ‘HHH’ prompts. In particular, they contain
a couple of coding examples. Below is a description of what this prompt looks like:
Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful,


Top 2 passage with score 0.9574587345123291 from http://arxiv.org/pdf/2302.07459:
We examine the inﬂuence of the amount of RLHF training for two reasons. First, RLHF [13, 57] is an
increasingly popular technique for reducing harmful behaviors in large language models [3, 21, 52]. Some of
these models are already deployed [52], so we believe the impact of RLHF deserves further scrutiny. Second,
previous work shows that the amount of RLHF training can signiﬁcantly change metrics on a wide range of
personality, political preference, and harm evaluations for a given model size [41]. As a result, it is important
to control for the amount of RLHF training in the analysis of our experiments.
3.2 Experiments
3.2.1 Overview
We test the effect of natural language instructions on two related but distinct moral phenomena: stereotyping
and discrimination. Stereotyping involves the use of generalizations about groups in ways that are often
harmful or undesirable.4To measure stereotyping, we use two well-known stereotyping benchmarks, BBQ
[40] (§3.2.2) and Windogender [49] (§3.2.3). For discrimination, we focus on whether models make disparate
decisions about individuals based on protected characteristics that should have no relevance to the outcome.5
To measure discrimination, we construct a new benchmark to test for the impact of race in a law school course


Top 3 passage with score 0.9408788084983826 from http://arxiv.org/pdf/2302.07842:
preferences and values which are diﬃcult to capture by hard- coded reward functions.
RLHF works by using a pre-trained LM to generate text, which i s then evaluated by humans by, for example,
ranking two model generations for the same prompt. This data is then collected to learn a reward model
that predicts a scalar reward given any generated text. The r eward captures human preferences when
judging model output. Finally, the LM is optimized against s uch reward model using RL policy gradient
algorithms like PPO ( Schulman et al. ,2017). RLHF can be applied directly on top of a general-purpose LM
pre-trained via self-supervised learning. However, for mo re complex tasks, the model’s generations may not
be good enough. In such cases, RLHF is typically applied afte r an initial supervised ﬁne-tuning phase using
a small number of expert demonstrations for the correspondi ng downstream task ( Ramamurthy et al. ,2022;
Ouyang et al. ,2022;Stiennon et al. ,2020).
A successful example of RLHF used to teach a LM to use an extern al tool stems from WebGPT Nakano et al.
(2021) (discussed in 3.2.3), a model capable of answering questions using a search engine and providing

棒極了！這個結果似乎跟問題相關。我們還能做些什麼來改進結果？

這裡我們使用了 cross-encoder/ms-marco-MiniLM-L-6-v2, 這個模型已經有三年曆史了，並且很小。它是很多年前的最好的重新排序模型之一。

對於選擇模型，我建議去 MTEB leaderboard，點選 reranking，選擇一個適合你需求的模型。平均列可以很好地代表總體質量，但你可能對資料集特別感興趣 (例如，檢索選項卡中的 MSMarco)

注意，模型老模型，比如 MiniLM, 並不在那裡。另外，並不是所有的模型都是交叉編碼器，所以總是要實驗，如果增加第二階段，更慢的重新排序器是否值得。這裡有一些有趣的發現:

E5 Mistral 7B Instruct (2023 年 12 月): 這是一個基於解碼器的嵌入 (不同於我們之前學的基於編碼器的)。這一點很有趣，因為使用解碼器而不是編碼器是一個新趨勢，這樣可以容納更長的文字。這裡是相關論文
BAAI Reranker (2023 年 9 月): 一個高質量重排序模型，其大小適中 (只有 278M 的引數)。讓我們用這個模型獲得結果並比較。

# Same code as before, just different model
cross_encoder = CrossEncoder('BAAI/bge-reranker-base')

cross_inp = [[query, chunks[hit['corpus_id']]] for hit in hits]
cross_scores = cross_encoder.predict(cross_inp)

for idx in range(len(cross_scores)):
    hits[idx]['cross-score'] = cross_scores[idx]

hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
bge_corpus_ids = [hit['corpus_id'] for hit in hits]
for i, hit in enumerate(hits[:3]):
    sample = dataset["train"][hit["corpus_id"]]
    print(f"Top {i+1} passage with score {hit['cross-score']} from {sample['source']}:")
    print(sample["chunk"])
    print("\n")

Top 1 passage with score 0.9668010473251343 from http://arxiv.org/pdf/2204.05862:
Stackoverflow Good Answer vs. Bad Answer Loss Difference
Python FT
Python FT + RLHF(b)Difference in mean log-prob between good and bad
answers to Stack Overﬂow questions.
Figure 37 Analysis of RLHF on language modeling for good and bad Stack Overﬂow answers, over many
model sizes, ranging from 13M to 52B parameters. Compared to the baseline model (a pre-trained LM
ﬁnetuned on Python code), the RLHF model is more capable of distinguishing quality (right) , but is worse
at language modeling (left) .
the RLHF models obtain worse loss. This is most likely due to optimizing a different objective rather than
pure language modeling.
B.8 Further Analysis of RLHF on Code-Model Snapshots
As discussed in Section 5.3, RLHF improves performance of base code models on code evals. In this appendix, we compare that with simply prompting the base code model with a sample of prompts designed to
elicit helpfulness, harmlessness, and honesty, which we refer to as ‘HHH’ prompts. In particular, they contain
a couple of coding examples. Below is a description of what this prompt looks like:
Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful,


Top 2 passage with score 0.9574587345123291 from http://arxiv.org/pdf/2302.07459:
We examine the inﬂuence of the amount of RLHF training for two reasons. First, RLHF [13, 57] is an
increasingly popular technique for reducing harmful behaviors in large language models [3, 21, 52]. Some of
these models are already deployed [52], so we believe the impact of RLHF deserves further scrutiny. Second,
previous work shows that the amount of RLHF training can signiﬁcantly change metrics on a wide range of
personality, political preference, and harm evaluations for a given model size [41]. As a result, it is important
to control for the amount of RLHF training in the analysis of our experiments.
3.2 Experiments
3.2.1 Overview
We test the effect of natural language instructions on two related but distinct moral phenomena: stereotyping
and discrimination. Stereotyping involves the use of generalizations about groups in ways that are often
harmful or undesirable.4To measure stereotyping, we use two well-known stereotyping benchmarks, BBQ
[40] (§3.2.2) and Windogender [49] (§3.2.3). For discrimination, we focus on whether models make disparate
decisions about individuals based on protected characteristics that should have no relevance to the outcome.5
To measure discrimination, we construct a new benchmark to test for the impact of race in a law school course


Top 3 passage with score 0.9408788084983826 from http://arxiv.org/pdf/2302.07842:
preferences and values which are diﬃcult to capture by hard- coded reward functions.
RLHF works by using a pre-trained LM to generate text, which i s then evaluated by humans by, for example,
ranking two model generations for the same prompt. This data is then collected to learn a reward model
that predicts a scalar reward given any generated text. The r eward captures human preferences when
judging model output. Finally, the LM is optimized against s uch reward model using RL policy gradient
algorithms like PPO ( Schulman et al. ,2017). RLHF can be applied directly on top of a general-purpose LM
pre-trained via self-supervised learning. However, for mo re complex tasks, the model’s generations may not
be good enough. In such cases, RLHF is typically applied afte r an initial supervised ﬁne-tuning phase using
a small number of expert demonstrations for the correspondi ng downstream task ( Ramamurthy et al. ,2022;
Ouyang et al. ,2022;Stiennon et al. ,2020).
A successful example of RLHF used to teach a LM to use an extern al tool stems from WebGPT Nakano et al.
(2021) (discussed in 3.2.3), a model capable of answering questions using a search engine and providing

讓我們比較一下三個模型的排名:

for i in range(25):
    print(f"Top {i+1} passage. Bi-encoder {retrieval_corpus_ids[i]}, Cross-encoder (MS Marco) {msmarco_l6_corpus_ids[i]}, BGE {bge_corpus_ids[i]}")

Top 1 passage. Bi-encoder 14679, Cross-encoder (MS Marco) 20638, BGE 14815
Top 2 passage. Bi-encoder 17387, Cross-encoder (MS Marco) 17387, BGE 20638
Top 3 passage. Bi-encoder 39564, Cross-encoder (MS Marco) 5628, BGE 17387
Top 4 passage. Bi-encoder 14725, Cross-encoder (MS Marco) 14815, BGE 14679
Top 5 passage. Bi-encoder 5628, Cross-encoder (MS Marco) 14749, BGE 9761
Top 6 passage. Bi-encoder 14802, Cross-encoder (MS Marco) 9755, BGE 39564
Top 7 passage. Bi-encoder 9761, Cross-encoder (MS Marco) 9761, BGE 20632
Top 8 passage. Bi-encoder 14716, Cross-encoder (MS Marco) 9763, BGE 14725
Top 9 passage. Bi-encoder 9763, Cross-encoder (MS Marco) 20632, BGE 9763
Top 10 passage. Bi-encoder 20638, Cross-encoder (MS Marco) 14751, BGE 14750
Top 11 passage. Bi-encoder 20653, Cross-encoder (MS Marco) 14725, BGE 14805
Top 12 passage. Bi-encoder 9755, Cross-encoder (MS Marco) 35250, BGE 9755
Top 13 passage. Bi-encoder 14806, Cross-encoder (MS Marco) 14806, BGE 14821
Top 14 passage. Bi-encoder 14805, Cross-encoder (MS Marco) 14821, BGE 14802
Top 15 passage. Bi-encoder 20652, Cross-encoder (MS Marco) 14750, BGE 14749
Top 16 passage. Bi-encoder 20711, Cross-encoder (MS Marco) 20653, BGE 5628
Top 17 passage. Bi-encoder 20632, Cross-encoder (MS Marco) 20711, BGE 14751
Top 18 passage. Bi-encoder 14750, Cross-encoder (MS Marco) 39564, BGE 14716
Top 19 passage. Bi-encoder 14749, Cross-encoder (MS Marco) 14802, BGE 14806
Top 20 passage. Bi-encoder 35209, Cross-encoder (MS Marco) 14679, BGE 20711
Top 21 passage. Bi-encoder 14671, Cross-encoder (MS Marco) 14716, BGE 20652
Top 22 passage. Bi-encoder 14821, Cross-encoder (MS Marco) 14671, BGE 14671
Top 23 passage. Bi-encoder 14751, Cross-encoder (MS Marco) 14805, BGE 20653
Top 24 passage. Bi-encoder 14815, Cross-encoder (MS Marco) 20652, BGE 35209
Top 25 passage. Bi-encoder 35250, Cross-encoder (MS Marco) 35209, BGE 35250

非常有趣，我們得到了非常不同的結果！讓我們簡要地看一下其中的一些。

建議做類似 dataset["train"][20638]["chunk"] 的事情來列印一個特定的結果。以下是結果的快照。

雙向編碼器在獲取有關 RLHF 的結果時表現不錯，但對於獲取好的，精確的關於什麼是 RLHF 的響應卻很困難。我看了每個模型的前 5 個結果。透過檢視段落，17387 和 20638 是唯一真正回答了問題的段落。儘管三個模型中 17387 的排名都更高，但有趣的是雙向編碼器的 20638 的排名很低，而其他兩個交叉編碼器的排名都更高。你可以在下面找到這些內容:

語料庫 ID	相關文字和總結	雙向編碼器位置 (前 10)	MSMarco 位置	BGE 位置
14679	Discusses implications and applications of RLHF but no definition.	1	20	4
17387	Describes the process of RLHF in detail and applications	2	2	3
39564	This chunk is messy and is more of a discussion section intro than an answer	3	18	6
14725	Characteristics about RLHF but no definition of what it is	4	11	8
20638	“increasingly popular technique for reducing harmful behaviors in large language models”	10	1	2
5628	Discusses the reward modeling (a component) but does not define RLHF	5	3	16
14815	Discusses RLHF but does not define it	24	4	1
14749	Discusses impact of RLHF but it has no definition	19	5	15
9761	Discusses the reward modeling (a component) but does not define RLHF	7	7	5

重排序是一個頻繁出現在庫中的特性; llamaindex 允許你使用 VectorIndexRetriever 來檢索和 LLMRerank 來重排序 (見教程)，Cohere 提供了一個重排序端點並且 qdrant 支援同樣的功能。然而，正如你所見，這樣實現起來非常簡單。如果你有一個高質量的雙向編碼器模型，你可以使用它來進行重排序，並從中獲益於它的速度。

LLMs as rerankers
一些人用生成式大模型作為重排器。例如， OpenAI 的 Coobook 有一個例子，其中他們使用 GPT-3 作為重排器，透過構建一個提示，要求模型確定文件是否與文件相關。儘管這展示了大模型驚人的能力，但這通常不是最優的該任務選擇，甚至會更糟，更貴比交叉編碼器更慢。
實驗會證明什麼工作是最適合你的資料。使用大模型作為重排器在你的文件非常長是可能會有幫助 (對於基於 BERT 的模型來說，這可能是一個挑戰)。

補充: SPECTER2

如果你對科研任務的嵌入特別感興趣，我建議你檢視 AllenAI 的 SPECTER2，這是一個為科學論文生成嵌入的一組模型。這些模型可以用於預測連結、查詢最近的論文、為給定查詢找到候選論文、使用嵌入作為特徵對論文進行分類等等！
基礎模型是在 scirepeval 上訓練的，這是一個包含數百萬個科學論文引用的資料集。在訓練之後，作者使用介面卡對模型進行了微調，這是一個引數高效微調庫 (如果你不知道這是什麼，不用擔心)。作者將一個小型神經網路，稱為介面卡，連線到基礎模型上。這個介面卡被訓練去執行一個特定的任務，但是為特定的任務訓練需要的資料要比整個模型訓練少得多。由於這些差異，我們需要使用 transformers 和 adapters 來進行推理，例如透過執行類似


model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
model.load_adapter("allenai/specter2", source="hf", load_as="proximity", set_active=True)

我建議閱讀模型卡以瞭解更多關於模型及其使用資訊。你還可以閱讀論文以獲取更多細節。

補充: 增強 SBERT

增強 SBERT 是一種用於收集資料以改善雙向編碼器的方法。預訓練和微調雙向編碼器需要大量的資料，因此作者建議使用交叉編碼器來標記大量輸入對的集合，並將其新增到訓練資料中。例如，如果你有非常少的標記資料，你可以訓練一個交叉編碼器，然後標記未標記的對，這可以用來訓練一個雙向編碼器。
你是如何生成對的？我們可以使用句子的隨機組合，然後使用交叉編碼器對它們進行標記。這將導致大多數是負對，並傾斜標籤分佈。為了避免這種情況，作者探索了不同的技術:

使用 核密度估計 (KDE) ，目標是讓小型的金資料集和增強資料集之間有類似的標籤分佈。這是透過放棄一些負對來實現的。當然，這將效率低下，因為你需要生成很多對才能得到幾個正對。
BM25 是一種基於重疊的搜尋引擎演算法 (例如，詞頻、文件長度等)。基於此，作者獲取最相似的句子來檢索最相似的 k 個句子，然後使用交叉編碼器對它們進行標記。這是高效的，但只有在句子之間重疊很少時才能捕捉到語義相似性。
語義搜尋取樣 在金資料上訓練雙向編碼器，然後用來取樣其他相似的對。
BM25 + 語義搜尋取樣 結合了前兩種方法。這有助於找到詞彙和語義上相似的句子。

在 Sentence Transformers 文件中有很好的圖表和示例指令碼來做這件事。

增強 SBERT - 圖片來自原論文

總結

好了，我們剛才學會了怎麼做一件很酷的事情: 檢索和重新排序，這是句子嵌入的一個非常常見的任務。我們瞭解了雙向編碼器和交叉編碼器有什麼不同，以及什麼時候該用哪一個。還學到了一些提升雙向編碼器效能的技巧，比如增強 SBERT。

別擔心，程式碼可以隨便改，隨便玩！如果你覺得這篇部落格不錯，給個贊或者分享一下吧，這對我來說是個很大的鼓勵！

知識檢查

雙向編碼器和交叉編碼器之間有什麼區別？
解釋重新排序的不同步驟。
如果使用雙向編碼器比較 30,000 個句子，我們需要生成多少個嵌入？使用交叉編碼器進行推理需要執行多少次？
有哪些技術可以改善雙向編碼器？

現在，你已經有了實現你的搜尋系統的堅實基礎。作為一個後續任務，我建議使用不同的資料集實現一個類似的檢索和重新排序系統。探索改變檢索和重新排序模型對結果的影響。