03LangChain初學者指南：從零開始實現高效資料檢索

onecyl發表於2024-11-13

原文網址 : https://www.cnblogs.com/onecyl/p/18543849

LangChain初學者指南：從零開始實現高效資料檢索

https://python.langchain.com/v0.2/docs/tutorials/retrievers/

這個文件，我們將熟悉LangChain的向量儲存和抽象檢索器。支援從（向量）資料庫和其他來源檢索資料，並與大模型的工作流整合。這對於需要檢索資料以進行推理的應用程式非常重要，例如檢索增強生成（retrieval-augmented generation）的情況，或者RAG（請參閱我們的RAG教程在這裡）。

概念

這個指南著重於文字資料的檢索。涵蓋以下主要概念：

Documents：文字
Vector stores：向量儲存
Retrievers：檢索

Setup

Jupyter Notebook

這些教程和其他教程可能最方便在Jupyter筆記本中執行。請參閱此處有關安裝方法的說明。

Installation

這個教程需要使用 langchain 、langchain-chroma和 langchain-openai包。

pip install langchain langchain-chroma langchain-openai

Installation guide.

LangSmith

設定環境變數

export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_API_KEY="..."

如果在 notebook中，可以這樣設定:

import getpass
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

Documents

LangChain 實現了一個提取的文件，文件包括文字單元和相關後設資料。它具有兩個屬性：

page_content ：字串格式的內容
metadata ：包含任意後設資料的字典。

後設資料屬性可以包含關於文件來源、與其他文件的關係以及其他資訊。請注意，單個文件物件通常代表更大文件的一部分。

生成一些 documents 例子:

from langchain_core.documents import Document
documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"source": "fish-pets-doc"},
),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"source": "bird-pets-doc"},
),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"source": "mammal-pets-doc"},
),
]

API 呼叫:

Document

這裡我們生成了五個包含後設資料的文件，其中顯示了三個不同的“來源”。

向量儲存

向量搜尋是一種常見的儲存和搜尋非結構化資料（如非結構化文字）的方法。其思想是儲存與文字相關聯的數值向量。給定一個查詢，我們可以將其嵌入為相同維度的向量，並使用向量相似度度量來識別儲存中相關的資料。

LangChain的VectorStore物件定義了用於將文字和文件物件新增到儲存，和使用各種相似度度量進行查詢的方法。通常使用嵌入模型進行初始化，這些模型確定了文字資料如何被轉化為數字向量。

LangChain包括一套與不同向量儲存技術整合的解決方案。一些向量儲存由提供者（如各種雲服務提供商）託管，並需要特定的憑據才能使用；一些（例如Postgres）在獨立的基礎設施中執行，可以在本地或透過第三方執行；其他一些可以執行在記憶體中，用於輕量級工作負載。在這裡，我們將演示使用Chroma的LangChain向量儲存的用法，是一個基於記憶體的實現。

例項化一個向量儲存的時候，通常需要提供一個嵌入模型來指定文字應該如何轉換為數字向量。在這裡，我們將使用 OpenAI 的嵌入模型。

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(),
)

API 呼叫:

OpenAIEmbeddings

呼叫 .from_documents 把文件新增到向量儲存中。VectorStore實現了用於新增文件的方法，這些方法可以在物件例項化之後呼叫。大多數實現都允許您連線到現有的向量儲存，例如，透過提供客戶端、索引名稱或其他資訊。有關特定整合的更多詳細資訊，請參閱文件。

一旦我們例項化了一個包含文件的 VectorStore，我們就可以對其進行查詢。VectorStore 包括以下查詢方法：

同步和非同步查詢；
透過字串查詢和透過向量查詢；
帶有和不帶有返回相似度分數的查詢；
透過相似度和最大邊際相關性（在檢索結果中平衡相似度和多樣性的查詢）進行查詢。

這些方法會輸出一個Document物件的列表。

例子

返回與字串查詢相似的文件：

vectorstore.similarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

非同步查詢：

await vectorstore.asimilarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

返回分數查詢：

# Note that providers implement different scores; Chroma here
# returns a distance metric that should vary inversely with
# similarity.
vectorstore.similarity_search_with_score("cat")

[(Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
  0.3751849830150604),
 (Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
  0.48316916823387146),
 (Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
  0.49601367115974426),
 (Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'}),
  0.4972994923591614)]

根據嵌入的查詢返回類似文件的查詢：

embedding = OpenAIEmbeddings().embed_query("cat")
vectorstore.similarity_search_by_vector(embedding)

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

學習更多：

API reference
How-to guide
Integration-specific docs

Retrievers

LangChain VectorStore 物件不繼承 Runnable，因此無法直接整合到 LangChain 表示式語言 chains 中。

Retrievers 繼承了 Runnables，實現了一套標準方法（例如同步和非同步的 invoke和 batch操作），並且設計為納入LCEL鏈中。

我們可以自己建立一個簡單的可執行物件，而無需繼承 Runnables。下面我們將圍繞相似性搜尋方法構建一個示例：

from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1)  # select top result
retriever.batch(["cat", "shark"])

API 呼叫:

Document
RunnableLambda

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

Vectorstores 實現一個 as_retriever 方法，該方法將生成一個 VectorStoreRetriever。這些 retriever 包括特定的 search_type 和 search_kwargs 屬性，用於識別呼叫底層向量儲存的方法以及如何給它們引數化。例如，我們可以使用以下方法複製上述操作：

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)
retriever.batch(["cat", "shark"])

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

VectorStoreRetriever 支援相似度（預設）、mmr（最大邊際相關性）和 similarity_score_threshold 可以對輸出的相似文件，設定相似度分數閾值。

Retrievers 可以很容易地整合到更復雜的應用中，比如檢索增強生成（RAG）應用程式，它將給定的問題與檢索到的上下文結合組成 LLM 的提示。下面我們展示一個最簡單的例子。

OpenAI

pip install -qU langchain-openai

import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
message = """
Answer this question using the provided context only.
{question}
Context:
{context}
"""
prompt = ChatPromptTemplate.from_messages([("human", message)])
rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

API 呼叫:

ChatPromptTemplate
RunnablePassthrough

response = rag_chain.invoke("tell me about cats")
print(response.content)

Cats are independent pets that often enjoy their own space.

總結:

本文件提供了向量儲存和檢索的示例程式碼。介紹了LangChain的向量儲存和抽象檢索器，包括向量儲存和檢索器的概念和使用。向量儲存是儲存和搜尋非結構化資料的一種方法，LangChain的VectorStore物件定義了用於將文字和文件物件新增到儲存，和使用各種相似度度量進行查詢的方法。檢索器繼承了Runnables，實現了一套標準方法，並且可以加入LCEL鏈中。

從零開始實現資料庫自動化巡檢(一)
2022-05-03
資料庫
寫給大資料初學者，從零開始學習大資料開發的完整路線
2019-02-23
大資料
從零開始實現multipart/form-data資料提交
2020-07-09
ORM
從零開始使用 Astro 的實用指南
2023-05-18
AST
圖資料庫初學者指南
2024-05-16
資料庫
從零開始實現一個RPC框架（零）
2019-03-03
RPC框架
從零開始用 proxy 實現 Mobx
2019-03-04
從零開始實現線上直播
2018-04-19
從零開始學Spring Boot系列-返回json資料
2024-02-29
Spring BootJSON
從零開始學Python
2022-04-10
Python
【mq】從零開始實現 mq-04-啟動檢測與實現優化
2022-05-08
MQ優化
從零開始寫 Docker(十)---實現 mydocker logs 檢視容器日誌
2024-04-09
Docker
從零開始實現放置遊戲（一）
2019-10-28
遊戲
從零開始的Java RASP實現(二)
2021-08-17
Java
從零開始的Java RASP實現(一)
2021-07-30
Java
【從零開始學習 MySql 資料庫】(2) 函式
2019-02-26
MySql資料庫函式
大資料學習路線（自己制定，從零開始）
2018-09-17
大資料
資料分析從零開始實戰 | 基礎篇（三）
2019-03-01
資料分析從零開始實戰 | 基礎篇（二）
2019-03-01
資料分析從零開始實戰 | 基礎篇（一）
2019-03-01
從零開始學Java，如何拿高工資？
2022-02-25
Java
【mq】從零開始實現 mq-01-生產者、消費者啟動
2022-04-22
MQ
從零開始學 Spring Boot
2019-03-11
Spring Boot
從零開始學正則
2018-08-07
從零開始學習laravel
2020-11-17
Laravel
從零開始學習Kafka
2019-05-10
Kafka
【ROS】從零開始學ROS
2020-12-12
ROS
從零開始PyTorch專案：YOLO v3目標檢測實現
2018-04-23
PyTorchYOLO
從零開始JAVA資料結構學習筆記（一）
2019-03-20
Java資料結構筆記
從零開始學習時空資料視覺化（序）
2019-02-25
視覺化
從零開始實現一個RPC框架（四）
2019-03-31
RPC框架
從零開始實現一個RPC框架（二）
2019-03-17
RPC框架
從零開始實現一個RPC框架（五）
2019-04-07
RPC框架
從零開始實現一個RPC框架（一）
2019-03-10
RPC框架
從零開始實現一個RPC框架（三）
2019-03-24
RPC框架
從零開始寫 Docker(六)---實現 mydocker run -v 支援資料卷掛載
2024-03-14
Docker
始於Jupyter Notebooks：一份全面的初學者實用指南
2018-05-30
從零開始寫 Docker(九)---實現 mydocker ps 檢視執行中的容器
2024-03-26
Docker

03LangChain初學者指南：從零開始實現高效資料檢索

LangChain初學者指南：從零開始實現高效資料檢索

概念

Setup

Jupyter Notebook

Installation

LangSmith

Documents

API 呼叫:

向量儲存

API 呼叫:

例子

Retrievers

API 呼叫:

API 呼叫:

總結:

相關文章