使用ChatGPT自動構建知識圖譜

哥不是小萝莉發表於2024-04-30

1.概述

本文將探討利用OpenAI的gpt-3.5-turbo從原始文字構建知識圖譜,透過LLM和RAG技術實現文字生成、問答和特定領域知識的高效提取,以獲得有價值的洞察。在開始前,我們需要明確一些關鍵概念。

2.內容

2.1 什麼是知識圖譜?

知識圖譜是一種語義網路,它表示和連線現實世界中的實體,如人物、組織、物體、事件和概念。知識圖譜由具有以下結構的三元組組成:知識圖譜由“頭實體 → 關係 → 尾實體”或語義網術語“主語 → 謂語 → 賓語”的三元組構成,用於提取和分析實體間的複雜關係。它通常包含一個定義概念、關係及其屬性的本體,作為目標領域中概念和關係的正式規範,為網路提供語義。搜尋引擎等自動化代理使用本體來理解網頁內容,以正確索引和顯示。

2.2 案例

2.2.1 準備依賴

使用 OpenAI 的 gpt-3.5-turbo 根據產品資料集中的產品描述建立知識圖。Python依賴如下:

pip install pandas openai sentence-transformers networkx

2.2.2 讀取資料

讀取資料集,程式碼如下所示:

import json
import logging
import matplotlib.pyplot as plt
import networkx as nx
from networkx import connected_components
from openai import OpenAI
import pandas as pd
from sentence_transformers import SentenceTransformer, util
data = pd.read_csv("products.csv")

資料集包含"PRODUCT_ID"、"TITLE"、"BULLET_POINTS"、"DESCRIPTION"、"PRODUCT_TYPE_ID"和"PRODUCT_LENGTH"列。我們將合併"TITLE"、"BULLET_POINTS"和"DESCRIPTION"列成"text"列,用於提示ChatGPT從中提取實體和關係的商品規格。

實現程式碼如下:

data['text'] = data['TITLE'] + data['BULLET_POINTS'] + data['DESCRIPTION']

2.2.3 特徵提取

我們將指導ChatGPT從提供的商品規格中提取實體和關係,並以JSON物件陣列的形式返回結果。JSON物件必須包含以下鍵:'head'、'head_type'、'relation'、'tail'和'tail_type'。

'head'鍵必須包含從使用者提示提供的列表中提取的實體文字。'head_type'鍵必須包含從使用者提供的列表中提取的頭實體型別。'relation'鍵必須包含'head'和'tail'之間的關係型別,'tail'鍵必須表示提取的實體文字,該實體是三元組中的物件,而'tail_type'鍵必須包含尾實體的型別。

我們將使用下面列出的實體型別和關係型別來提示ChatGPT進行實體關係提取。我們將把這些實體和關係對映到Schema.org本體中對應的實體和關係。對映中的鍵表示提供給ChatGPT的實體和關係型別,值表示Schema.org中的物件和屬性的URL。

# ENTITY TYPES:
entity_types = {
  "product": "https://schema.org/Product", 
  "rating": "https://schema.org/AggregateRating",
  "price": "https://schema.org/Offer", 
  "characteristic": "https://schema.org/PropertyValue", 
  "material": "https://schema.org/Text",
  "manufacturer": "https://schema.org/Organization", 
  "brand": "https://schema.org/Brand", 
  "measurement": "https://schema.org/QuantitativeValue", 
  "organization": "https://schema.org/Organization",  
  "color": "https://schema.org/Text",
}

# RELATION TYPES:
relation_types = {
  "hasCharacteristic": "https://schema.org/additionalProperty",
  "hasColor": "https://schema.org/color", 
  "hasBrand": "https://schema.org/brand", 
  "isProducedBy": "https://schema.org/manufacturer", 
  "hasColor": "https://schema.org/color",
  "hasMeasurement": "https://schema.org/hasMeasurement", 
  "isSimilarTo": "https://schema.org/isSimilarTo", 
  "madeOfMaterial": "https://schema.org/material", 
  "hasPrice": "https://schema.org/offers", 
  "hasRating": "https://schema.org/aggregateRating", 
  "relatedTo": "https://schema.org/isRelatedTo"
 }

為使用ChatGPT進行資訊提取,我們建立了OpenAI客戶端,利用聊天完成API,為每個識別到的關係生成JSON物件輸出陣列。選擇gpt-3.5-turbo作為預設模型,因其效能已足夠滿足此簡單演示需求。

client = OpenAI(api_key="<YOUR_API_KEY>")

定義提取函式:

def extract_information(text, model="gpt-3.5-turbo"):
   completion = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt.format(
              entity_types=entity_types,
              relation_types=relation_types,
              specification=text
            )
        }
        ]
    )

   return completion.choices[0].message.content

2.2.4 編寫Prompt

system_prompt變數包含了指導ChatGPT從原始文字中提取實體和關係,並將結果以JSON物件陣列形式返回的指令,每個JSON物件包含以下鍵:'head'、'head_type'、'relation'、'tail'和'tail_type'。

system_prompt = """You are an expert agent specialized in analyzing product specifications in an online retail store.
Your task is to identify the entities and relations requested with the user prompt, from a given product specification.
You must generate the output in a JSON containing a list with JOSN objects having the following keys: "head", "head_type", "relation", "tail", and "tail_type".
The "head" key must contain the text of the extracted entity with one of the types from the provided list in the user prompt, the "head_type"
key must contain the type of the extracted head entity which must be one of the types from the provided user list,
the "relation" key must contain the type of relation between the "head" and the "tail", the "tail" key must represent the text of an
extracted entity which is the tail of the relation, and the "tail_type" key must contain the type of the tail entity. Attempt to extract as
many entities and relations as you can.
"""

user_prompt變數包含來自資料集單個規範所需的輸出示例,並提示ChatGPT以相同的方式從提供的規範中提取實體和關係。這是ChatGPT單次學習的一個示例。

使用ChatGPT自動構建知識圖譜
user_prompt = """Based on the following example, extract entities and relations from the provided text.
Use the following entity types:

# ENTITY TYPES:
{entity_types}

Use the following relation types:
{relation_types}

--> Beginning of example

# Specification
"YUVORA 3D Brick Wall Stickers | PE Foam Fancy Wallpaper for Walls,
 Waterproof & Self Adhesive, White Color 3D Latest Unique Design Wallpaper for Home (70*70 CMT) -40 Tiles
 [Made of soft PE foam,Anti Children's Collision,take care of your family.Waterproof, moist-proof and sound insulated. Easy clean and maintenance with wet cloth,economic wall covering material.,Self adhesive peel and stick wallpaper,Easy paste And removement .Easy To cut DIY the shape according to your room area,The embossed 3d wall sticker offers stunning visual impact. the tiles are light, water proof, anti-collision, they can be installed in minutes over a clean and sleek surface without any mess or specialized tools, and never crack with time.,Peel and stick 3d wallpaper is also an economic wall covering material, they will remain on your walls for as long as you wish them to be. The tiles can also be easily installed directly over existing panels or smooth surface.,Usage range: Featured walls,Kitchen,bedroom,living room, dinning room,TV walls,sofa background,office wall decoration,etc. Don't use in shower and rugged wall surface]
Provide high quality foam 3D wall panels self adhesive peel and stick wallpaper, made of soft PE foam,children's collision, waterproof, moist-proof and sound insulated,easy cleaning and maintenance with wet cloth,economic wall covering material, the material of 3D foam wallpaper is SAFE, easy to paste and remove . Easy to cut DIY the shape according to your decor area. Offers best quality products. This wallpaper we are is a real wallpaper with factory done self adhesive backing. You would be glad that you it. Product features High-density foaming technology Total Three production processes Can be use of up to 10 years Surface Treatment: 3D Deep Embossing Damask Pattern."

################

# Output
[
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "isProducedBy",
    "tail": "YUVORA",
    "tail_type": "manufacturer"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasCharacteristic",
    "tail": "Waterproof",
    "tail_type": "characteristic"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasCharacteristic",
    "tail": "Self Adhesive",
    "tail_type": "characteristic"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasColor",
    "tail": "White",
    "tail_type": "color"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "70*70 CMT",
    "tail_type": "measurement"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "40 tiles",
    "tail_type": "measurement"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "40 tiles",
    "tail_type": "measurement"
  }}
]

--> End of example

For the following specification, generate extract entitites and relations as in the provided example.

# Specification
{specification}
################

# Output

"""
View Code

現在,我們對資料集中的每個規範呼叫extract_information函式,並建立一個包含所有提取的三元組的列表,這將代表我們的知識圖譜。為了演示,我們將使用僅包含100個產品規範的子集來生成知識圖譜。

kg = []
for content in data['text'].values[:100]:
  try:
    extracted_relations = extract_information(content)
    extracted_relations = json.loads(extracted_relations)
    kg.extend(extracted_relations)
  except Exception as e:
    logging.error(e)

kg_relations = pd.DataFrame(kg)

資訊提取的結果顯示在下面的圖中。

2.2.5 實體關係

實體解析(ER)是消除與現實世界概念對應的實體歧義的過程。在這種情況下,我們將嘗試對資料集中的頭實體和尾實體進行基本的實體解析。這樣做的原因是使文字中存在的實體具有更簡潔的表示。

我們將使用NLP技術進行實體解析,更具體地說,我們將使用sentence-transformers庫為每個頭實體建立嵌入,並計算頭實體之間的餘弦相似性。

我們將使用'all-MiniLM-L6-v2'句子轉換器來建立嵌入,因為它是一個快速且相對準確的模型,適用於這種情況。對於每對頭實體,我們將檢查相似性是否大於0.95,如果是,我們將認為這些實體是相同的實體,並將它們的文字值標準化為相等。對於尾實體也是同樣的道理。

這個過程將幫助我們實現以下結果。如果我們有兩個實體,一個的值為'Microsoft',另一個為'Microsoft Inc.',那麼這兩個實體將被合併為一個。

我們以以下方式載入和使用嵌入模型來計算第一個和第二個頭實體之間的相似性。

heads = kg_relations['head'].values
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(heads)
similarity = util.cos_sim(embeddings[0], embeddings[1])

為了視覺化實體解析後提取的知識圖譜,我們使用Python的networkx庫。首先,我們建立一個空圖,然後將每個提取的關係新增到圖中。

G = nx.Graph()
for _, row in kg_relations.iterrows():
  G.add_edge(row['head'], row['tail'], label=row['relation'])

要繪製圖表,我們可以使用以下程式碼:

pos = nx.spring_layout(G, seed=47, k=0.9)
labels = nx.get_edge_attributes(G, 'label')
plt.figure(figsize=(15, 15))
nx.draw(G, pos, with_labels=True, font_size=10, node_size=700, node_color='lightblue', edge_color='gray', alpha=0.6)
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, font_size=8, label_pos=0.3, verticalalignment='baseline')
plt.title('Product Knowledge Graph')
plt.show()

下面的圖中顯示了生成的知識圖譜的一個子圖:

我們可以看到,透過這種方式,我們可以基於共享的特徵將多個不同的產品連線起來。這對於學習產品之間的共同屬性、標準化產品規格、使用通用模式(如Schema.org)描述網路資源,甚至基於產品規格進行產品推薦都是有用的。

3.總結

大多數公司有大量未被利用的非結構化資料儲存在資料湖中。建立知識圖譜以從這些未使用的資料中提取洞察的方法將有助於從未經處理和非結構化的文字語料庫中獲取資訊,並利用這些資訊做出更明智的決策。

相關文章