論文資訊:
Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, Sadao Kurohashi:
GPT-RE: In-context Learning for Relation Extraction using Large Language Models. EMNLP 2023: 3534-3547
引言
第1段:研究背景:GPT-3、ICL
% NLP前沿:GPT-3
The emergence of large language models (LLMs) such as GPT-3 (Brown et al., 2020; Thoppilan et al., 2022; Chowdhery et al., 2022; Rae et al., 2021; Hoffmann et al., 2022) represents a significant advancement in natural language processing (NLP).
% from 微調 to ICL
Instead of following a pretraining-and-finetuning pipeline (Devlin et al., 2019; Beltagy et al., 2019; Raffel et al., 2019; Lan et al., 2019; Zhuang et al., 2021), which finetunes a pre-trained model on a task-specific dataset in a fully-supervised manner, LLMs employ a new paradigm known as in-context learning (ICL) (Brown et al., 2020; Min et al., 2022a) which formulates an NLP task under the paradigm of language generation and makes predictions by learning from a few demonstrations.
% ICL v.s 微調
Under the framework of ICL, LLMs achieve remarkable performance rivaling previous fully-supervised methods even with only a limited number of demonstrations provided in various tasks such as solving math problems, commonsense reasoning, text classification, fact retrieval, natural language inference, and semantic parsing (Brown et al., 2020; Min et al., 2022b; Zhao et al., 2021; Liu et al., 2022b; Shin et al., 2021).
第2段:前人工作
% 前人工作:ICL
Despite the overall promising performance of LLMs, the utilization of ICL for relation extraction (RE) is still suboptimal.
% 背景介紹:RE
RE is the central task for knowledge retrieval requiring a deep understanding of natural language, which seeks to identify a predefined relation between a specific entity pair mentioned in the input sentence or NULL if no relation is found.
% 背景介紹:RE + ICL
Given a test input, ICL for RE prompts the input of LLMs with the task instruction, a few demonstrations retrieved from the training data, and the test input itself.
Then LLMs generate the corresponding relation.
% 前人工作:RE + ICL
Recent research (Gutiérrez et al., 2022) has sought to apply GPT-3 ICL to biomedical RE, but the results are relatively negative and suggest that GPT-3 ICL still significantly underperforms fine-tuned models.
第3.1段:前人工作的不足
% 概述:不足
The reasons that cause the pitfall of GPT-3 ICL in RE are two folds:
% 不足1:實體和關係的低相關性
(1) The low relevance regarding entity and relation in the retrieved demonstrations for ICL.
% 不足1:實體和關係的低相關性:僅考慮句向量
Demonstrations are selected randomly or via k-nearest neighbor (kNN) search based on sentence embedding (Liu et al., 2022b; Gutiérrez et al., 2022).
% 不足1:實體和關係的低相關性:僅考慮句向量:未考慮實體和關係
Regrettably, kNN-retrieval based on sentence embedding is more concerned with the relevance of the overall sentence semantics and not as much with the specific entities and relations it contains, which leads to low-quality demonstrations.
% 不足1:實體和關係的低相關性:僅考慮句向量:未考慮實體和關係:舉例說明
As shown in Figure 2, the test input retrieves a semantically similar sentence but is not desired in terms of entities and relations.
第3.2段:前人工作的不足
% 不足2:缺少“輸入-標籤”對映
(2) The lack of explaining input-label mappings in demonstrations leads to poor ICL effectiveness: A vanilla form of ICL lists all demonstrations as input-label pairs without any explanations.
% 不足2:缺少“輸入-標籤”對映:LLMs僅從表面線索學習
This may mislead LLMs to learn shallow clues from surface words, while a relation can be presented in diverse forms due to language complexity.
% 不足2:缺少“輸入-標籤”對映:LLMs僅從表面線索學習:提高每個示例的質量
Especially when ICL has a maximal input length, optimizing the learning efficiency of each single demonstration becomes extremely important.
第4.1段:本文工作
% 動機
To this end, we propose GPT-RE for the RE task.
% 概述:檢索 + 推理
GPT-RE employs two strategies to resolve the issues above: (1) task-aware retrieval and (2) gold label-induced reasoning.
% 方法1:任務感知檢索:概述
For (1) task-aware retrieval, its core is to use representations that deliberately encode and emphasize entity and relation information rather than sentence embedding for kNN search.
% 方法1:任務感知檢索:具體
We achieve this by two different retrieval approaches: (a) entity-prompted sentence embedding; (b) fine-tuned relation representation, which naturally places emphasis on entities and relations.
% 方法1:任務感知檢索:優勢
Both methods contain more RE-specific information than sentence semantics, thus effectively addressing the problem of low relevance.
第4.2段:本文工作
% 方法2:“input-label”推理:概述
For (2) gold label-induced reasoning, we propose to inject the reasoning logic into the demonstration to provide more evidence to align an input and the label, a strategy akin to the Chain-ofThought (CoT) research (Wei et al., 2022; Wang et al., 2022b; Kojima et al., 2022).
% 方法2:“input-label”推理:具體、區別
But different from previous work, we allow LLMs to elicit the reasoning process to explain not only why a given sentence should be classified under a particular label but also why a NULL example should not be assigned to any of the pre-defined categories.
% 方法2:“input-label”推理:優勢
This process significantly improves the ability of LLMs to align the relations with diverse expression forms.
第5.1段:實驗效果
% 提出問題:關係幻覺
Recent work reveals another crucial problem named “overpredicting” as shown in Figure 3: we observe that LLMs have the strong inclination to wrongly classify NULL examples into other predefined labels.
% 關係幻覺:相關工作
A similar phenomenon has also been observed in other tasks such as NER (Gutiérrez et al., 2022; Blevins et al., 2022).
% 本文方法:實驗效果
In this paper, we show that this issue can be alleviated if the representations for retrieval can be supervised with the whole set of NULL in the training data.
第5.2段:實驗效果
% 實驗設定:RE
We evaluate our proposed method on three popular general domain RE datasets: Semeval 2010 task 8, TACRED and ACE05, and one scientific domain dataset SciERC.
% 實驗效果:概述:超越:GPT-3基線模型+傳統微調模型
We observe that GPT-RE achieves improvements over not only existing GPT-3 baselines, but also fully-supervised baselines.
% 實驗效果:具體:取得SOTA + 有競爭力結果
Specifically, GPT-RE achieves SOTA performances on the Semeval and SciERC datasets, and competitive performances on the TACRED and ACE05 datasets.