上一章我們聊了標準化的Prompt生成方案DSPy,但DSPy還是更多依賴few-shot的Prompt編寫正規化,在純任務描述型指令上的最佳化效果有限。這一章我們就重點關注描述性指令最佳化。我們先簡單介紹下結構化Prompt編寫,再聊聊從結構化多角度進行Prompt最最佳化迭代的演算法方案UniPrompt
1. 結構化Prompt編寫
1.1 LangGPT
- https://langgptai.feishu.cn/wiki/RXdbwRyASiShtDky381ciwFEnpe
- https://github.com/langgptai/LangGPT
LangGPT算是最早提出要是用結構化Prompt進行寫作的,現在在Coze這種任務流平臺上看到的prompt基本上都是這個風格。
結構化Prompt一般使用Markdown和JSON來構建,感覺國內使用markdown更多,早期GPT3.5時使用JSON更多。畢竟現在很多開源模型SFT時也加了大量的Markdown樣本,以下是LangGPT提供的Markdown格式的樣例如下
# Role: Your_Role_Name
## Profile
- Author: YZFly
- Version: 1.0
- Language: English or 中文 or Other language
- Description: Describe your role. Give an overview of the role's characteristics and skills
### Skill-1
1.skill description 1
2.skill description 2
### Skill-2
1.skill description 1
2.skill description 2
## Rules
1. Don't break character under any circumstance.
2. Don't talk nonsense and make up facts.
## Workflow
1. First, xxx
2. Then, xxx
3. Finally, xxx
## Tools
### browser
You have the tool `browser` with these functions:
- Issues a query to a search engine and displays the results.
- Opens the webpage with the given id, displaying it.
- Returns to the previous page and displays it.
- Scrolls up or down in the open webpage by the given amount.
- Opens the given URL and displays it.
- Stores a text span from an open webpage. Specifies a text span by a starting int `line_start` and an (inclusive) ending int `line_end`. To quote a single line, use `line_start` = `line_end`.
### python
When you send a message containing Python code to python, it will be executed in a
stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0
seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.
### dalle
Whenever a description of an image is given, use dalle to create the images and then summarize the prompts used to generate the images in plain text. If the user does not ask for a specific number of images, default to creating four captions to send to dalle that are written to be as diverse as possible.
### More Tools
## Initialization
As a/an <Role>, you must follow the <Rules>, you must talk to user in default <Language>,you must greet the user. Then introduce yourself and introduce the <Workflow>.
不難發現結構化prompt有以下幾個特點和優點
- 使用#,##等標題分隔符來構建層級: 例如在二級標題profile下面有三級標題skill,讓模型理解這裡的skill也屬於模型資料。而二級標題tools下面有python,browser,dalle等多個三級標題,表示這些都屬於模型可呼叫工具
- 分模組任務描述: 每個二級標題都是一個主模組,分模組構建的好處,一個是可以複用,一個是便於上手和prompt迭代,常見的模組包括以下
- profile & skills:角色描述,角色有哪些能力,使用啥語言etc
- goal & task: 任務和目標描述,例如負責基於使用者指令生成寫作大綱
- constraint & requirements:要求和限制,例如RAG要求模型回答必須來自上文,不能自己生成
- workflow:針對複雜任務往往要告訴模型先做什麼後做什麼,例如打分評估任務要先分析問題再進行1-5分的打分
- example & demos: 提供一些few-shot示例
- style & output format:對回答格式的要求,例如單選題只能輸出ABCD其中一個
- Init & prefix: 告訴模型prompt結束,要開始回答的引導詞,例如單選題可以是“>>>你認為最合理的選項是:”
- 模組變數引用:在最後的initialization中,使用了<Rules>來引用對應的變數名稱,向模型強調這裡的Rules指的是前面提到的規則,而非廣義的規則。這類變數引用經常大量用於RAG中約束模型使用上文,以及特定格式輸出時進一步限制模型推理格式,例如"你的回答必須是<label>中的一個"
結構化Prompt的缺點也同樣很明顯
- 對模型能力要求較高,很多複雜指令理解能力較弱的小模型無法使用。其實也很好理解,指令就像是在模型高維空間裡切割出的一片空間,指令越複雜空間切割的粒度就越細,而對於本身高維空間可分性較差的模型,切著切著就沒了哈哈哈哈
- 越長的prompt上文,越多的constraint,會bias模型輸出,導致很多corner case最後歸因發現都是某一條requirement的鍋。因此個人建議prompt初始都儘量簡單,慢慢做加法,不要一上來就寫的很複雜。你的每一條要求不一定有用,但都有可能挖坑。
1.2 Pratical Guide
- https://www.jiqizhixin.com/articles/2024-05-14-4
在以上結構化提示器的基礎上,新加坡提示詞比賽的冠軍還給出了更多的結構化prompt編寫的tips,這裡總結2個親測好用tips。
- 分隔符的使用:分隔符這裡廣義指和其他層次化分隔符不同的字元。包括更長的#####,》》》》------之類。在prompt中有幾個位置需要特殊分割符,核心是讓模型理解分隔符前後存在顯著差異,語義要分開。例如在RAG段落續寫任務中,需要特殊分割符來分割檢索上文【Context】,前面模型推理的段落【paragraph】,來完成後面的段落續寫。而在一般回答任務中,建議顯著區分回答開始的位置,如下
<Annex>
Give a table of the list of row numbers belonging to each cluster, in order to back up your analysis. Use these table headers: [[CLUSTER_NAME], List of Rows].
#############
# START ANALYSIS #
If you understand, ask me for my dataset.
- XML標籤使用:針對一些分類任務,以及輸出是可列舉值的任務,使用XML進行標籤約束比markdown的輸出效果會更加穩定。
Classify the sentiment of the following conversations into one of two classes.Give the sentiment classifications without any other preamble text.
<classes>
Positive
Negative
</classes>
<conversations>
[Agent]: Good morning, how can I assist you today?
[Customer]: This product is terrible, nothing like what was advertised!
[Customer]: I’m extremely disappointed and expect a full refund.
[Agent]: Good morning, how can I help you today?
[Customer]: Hi, I just wanted to say that I’m really impressed with your
product. It exceeded my expectations!
</conversations>
2. 結構化Prompt最最佳化
- Task Facet Learning: A Structured Approach to Prompt Optimization
有了上面結構化Prompt的鋪墊,UniPrompt的最佳化思路會更容易理解。以上的結構化Prompt編寫其實就是把prompt拆分成了多個角度,例如profile,rules,workflow等等進行分別最佳化。UniPrompt同樣採用了結構化prompt的思路,讓模型直接生成結構化prompt,並對每個部分進行針對性最佳化。同時給出了模型在迭代prompt時通用性容易受到個別樣本影響的解決方案。相比上一章DSPy裡面提到的大模型反思直接最佳化,以及隨機搜尋的方案要更加有系統針對性~
Prompt Optimization?
論文前面很有意思,作者先嚐試論證定向Prompt最最佳化這個事它靠不靠譜。
連續性證明
作者先透過指令敏感性,既微小的指令變動對任務效果的影響幅度(利普希茨連續性),來驗證最最佳化的可行性。畢竟如果指令的隨便一個微小的變動,就會帶來巨大的變化,那隨機搜尋可能更合適,但如果指令敏感度有上界的話,那最最佳化方案就可能更合適合。利普希茨連續性的數學定理如下
給定一個機率分佈X和一個非負實數r,其中L>=0是利普希茨常數,d是一個距離變數,如果滿足如下條件
簡單說就是函式變化的斜率被限制在一個有限的範圍內。那為了實驗prompt的敏感性,論文使用GPT4對初始Prompt進行改寫,並計算改寫prompt和最初prompt 的cosine距離(Ada-002)作為指令變動幅度的衡量(d(x)),然後使用改寫prompt在驗證集上進行測試,用指標變化(Acc)作為任務效果變化幅度的衡量(d(f(x))),如下圖(a)所示,在95%的機率下,GPT4和GPT3的變化上界<1,而更小的模型Llama2-13B超過2。所以能力越強的模型對指令的微小變動更加魯棒,在指令最最佳化上的可行性更高。
子模性證明
有利普希茨連續性做為基礎,論文還進一步論證了在有限樣本和有限的prompt長度的限制下,透過多角度迭代最佳化prompt的可行性,以及對比few-shot迭代的更優性。這裡論文從子模性角度進行了討論,submodularity的定義如下,簡單說就是同一個元素加到不同的集合中產生的邊際收益,隨著集合的增大和遞減。
對於一個集合V和一個非負實值函式f, 如果對於所有$ A, B \subseteq V$, 且 \(A \subseteq B\), 以及對於所有 \(x \in V \setminus B\)都有:
那在有限樣本和有限prompt長度的限制下,尋找最優Prompt的問題,就變成了求解最大化子模態函式的問題,即尋找集合\(S \in V\),使得\(f(S)\)最大化,同時滿足 \(|S| \lt K\)。而滿足modularity的函式,可以透過貪婪演算法得到最優的近似解,每次迭代都把邊際收益最大的元素加到集合中,直到邊際收益小於閾值,或者集合大小達到上限。
論文分別計算了few-shot和task-facet使用貪婪演算法的邊際效益,上面的函式f為驗證集指標。few-shot的計算採用隨機取樣了多個A,B的few-shot集合,其中B集合小於A集合,計算在A,B集合上加入同一個shot,計算驗證集指標變化,如下圖的機率分佈,會發現few-shot的機率集中在[-0.01, 0.01]之間,基本是隨機分佈,並看不到邊際遞減效應的存在。
而task-facet部分,對比上面few-shot是加demo,這裡是加section,可以類比前面結構化Prompt的一個子模組。這裡論文采用了微調模型(Llama2-13B)來生成一個任務多個角度的prompt,下圖的Introduction,Task Description,Real-life Application,Background Knowledge, Challenges分別各是一個section,那A和B分別是取樣了不同的section,再計算加入一個新的section的邊際收益,會發現對比few-shot的虛線,Facet代表的藍線有更加明顯的邊際效應遞減的趨勢。但這和我們如何生成section是高度相關的,下面我們具體說下如何透過模型來生成任務不同角度的描述(section),並使用大模型進行迭代最佳化的。
UNIPROMPT
UNIPROMPT的整個流程分成以下幾個步驟
- 微調LLama2-13B,讓模型直接生成結構化的初始prompt
這裡論文使用GPT4構建了樣本,給定任務描述(使用了tasksrouce樣本集的指令), 和section的描述例如Background,description,requirements,讓GPT4來生成該section的內容,然後使用該樣本微調Llama2-13B。
### Instruction:
You are a prompt engineer, you have to write a structured prompt.
For the given task description, examples and section description,
write the contents of the section that align with section description.
### Task Description:
{data_point[’task_description’]}
### Section Description:
{data_point[’section’]}:{section_descriptions[data_point[’section’]]}
### Response:
{data_point[’prompt’]}
微調後的Llama2,給定任務描述和section描述,會生成該section的prompt。如下是background角度prompt生成的prompt。作為初始化,會取樣10個模型生成的prompt,然後選擇驗證集上效果最優的prompt。
Task: glue qnli
Task Description: With no explanation, label A to B with either entailment or not entailment
Section: background
Prompt:
1. Entailment means that the information in statement B can be inferred directly from statement A.
2. Not entailment means that the information in statement B cannot be inferred directly from statement A or is unrelated.
3. Understanding the context and relationship between the two statements is crucial for accurate classification.
- 樣本聚類
有了初始prompt,下一步就是進行迭代最佳化。這裡為了避免前人使用單樣本,隨機取樣樣本進行最佳化,引入的樣本bias,這裡論文對樣本進行了聚類,認為每一個cluster中的任務表徵是相似的。這裡論文使用大模型prompt對每個問題進行了主題分類打標,然後按標籤劃分了cluster。不使用cosine相似度的一個原因,個人感覺是語義相似和任務表徵相似這裡存在diff,所以個人感覺這裡的聚類可能需要case by case來看,不同的任務根據輸出的不同需要調整。
- 2階段反饋生成
基於上面的樣本聚類,進一步拆分成mini-batch(3-5),在每個minibatch上基於模型對樣本的預測,使用GPT4生成feedback。然後再在batch(5-7個樣本)粒度上對各個minibach上的feedback進行共性抽取,並直接生成針對section的增,刪,改的具體操作建議。這裡兩階段的設計和梯度累計的思路相似,其實還是想要降低個別樣本,甚至個別mini-batch在prompt迭代時陷入個性而非共性最佳化的問題(其實你只要試試用大模型去做過prompt最佳化就會發現模型非常容易被帶偏,因此平滑和共性抽取很重要)。
以下分別是minibach上的返回prompt,和在batch粒度上的總結prompt
You are a teacher and you have to give feedback to your students on their answers.
You are teaching how to solve math problems to your students.
You are given a question, it’s true answer and answer given by student.
You are also given the explanations written by your students while solving the questions.
The questions are answered wrong by the students.
You have to tell why is the solution wrong and what information is can be added to the in the Background Knowledge part that would have helped the student to write better explanations.
## IMPORTANT: You are also given a history of changes you made to the background knowledge part and the change in student’s accuracy after making the change. You have to use this history to make your feedback.
Be explicit and tell the exact information that can be added without further modification / addition.
### IMPORTANT: Give feedback in form of instructions like add a section, add a subsection, set the content of a section, set the content of a subsection, delete a section or delete a subsection in the background knowledge part. Give very granular feedbacks, like if the student has made amistake in the calculation, then tell what is the mistake in the calculation and how to correct it, if the student has made a mistake in the concept, then tell what is the mistake in the concept and how to correct it.
## Background Knowledge
{current_prompt}
## History
{history_string}
Now, it is your turn to give feedbacks to the students.
You can only provide a one line feedback.
You are given a set of feedbacks for some problems. The setfeedbacks for each problem separated by =========== symbol.You have to summarize the feedbacks into a final feedback.You are also given a set of wrong questions.
You need to tell which edit can be applied to aid the student in solving the wrong question.
To achieve your task, try to follow the following steps;
1. Identify the general problem that is being solved by all the feedbacks.
2. Once you have identified the problem, try to make a new feedback that covers most of the feedbacks given.Let’s say the problem in the first feedback is the absence of methods to solve linear equation and in the second feedback itis the method to inverse a matrix.You know that both of these problems can be caused by adding how to solve convert a matrix into row rediced echolon form. So,add that.
3. Try and validate your feedback. Once, you have a feedback try to see if it covers every feedback, if it does not cover any feedback, add that to yournew feedback.
4. See the wrong questions and try to identify what is the problem in the question.If the problem is not covered by your feedback, add that to your feedback.
5. You can add specifics like examples, definitions etc makesure that the feedback is enough to be directly added withoutany modification.
You may use the following function templates
add_section(sectioname)
add_subsection(section_name, subsection_name)
set_section_content(section_name, new_content)
set_subsection_content(section_name, subsection_name, new_content)
delete_section(section_name)
delete_subsection(section_name, subsection_name)
Your summary cannot include more than four functions. Make sure that the content is useful,not just a very general statement. Something specific.
Instructions:
{edits}
Wrong Questions:
{wrong_examples_string}
Summary:
- 基於反饋進行prompt編輯和最佳化
基於上面得到的反饋和操作,論文使用以下指令讓模型對prompt進行編輯和修改。這裡只保留修改後驗證集打分有提升的新prompt(greedy),並在每一步都維護多個最佳化後效果最好的prompt(類比Beam-Size=2),停止迭代的訊號是連續5輪在驗證集上沒有效果提升。
You are given an input prompt and a feedback, you have to incorporate the feedback into the input prompt and output the final prompt.
An example of the task is given below
### Input Prompt
Introduction: In this task you have to answer the given question.
### Feedback
The background knowledge is incomplete, it does not include what are the factors that affect the water usage and how many water sources are there.
\\add_subsection("Background Knowledge")
\\add_subsection_content(water usage depends on the population, climate, economic development, and availability of water sources. There are two sources of water, surface water and groundwater.)
### Final Prompt
Introduction: In this task you have to answer the given question.
Background Knowledge: water usage depends on the population, climate, economic development, and availability of water sources. There are two sources of water, surface water and groundwater.
Only output the final prompt nothing else.
### INPUT PROMPT
{current_prompt}
21
### FEEDBACK
{edits}
### FINAL PROMPT
效果上論文和之前的OPRO,ProTeGi等演算法都做了對比,在多個資料集上都會有較顯著的效果提升。