Langchain-ChatGLM原始碼解讀（二）-文件embedding以及構建faiss過程

有何m不可發表於2024-03-14

原文網址 : https://www.cnblogs.com/gongzb/p/18073884

LangChain原始碼

一、簡介

Langchain-ChatGLM 相信大家都不陌生，近幾周計劃出一個原始碼解讀，先解鎖langchain的一些基礎用法。

文件問答過程大概分為以下5部分，在Langchain中都有體現。

上傳解析文件
文件向量化、儲存
文件召回
query向量化
文件問答

今天主要講langchain在文件embedding以及構建faiss過程時是怎麼實現的。

二、原始碼入口

langchain中對於文件embedding以及構建faiss過程有2個分支，

1.當第一次進行載入檔案時如何生成faiss.index

2.當存在faiss.index時

下面也分別從這2個方面進行原始碼解讀

if len(docs) > 0:
    logger.info("檔案載入完畢，正在生成向量庫")
    if vs_path and os.path.isdir(vs_path) and "index.faiss" in os.listdir(vs_path):
        vector_store = load_vector_store(vs_path, self.embeddings)
        vector_store.add_documents(docs)
        torch_gc()
    else:
        if not vs_path:
            vs_path = os.path.join(KB_ROOT_PATH,f"""{"".join(lazy_pinyin(os.path.splitext(file)[0]))}_FAISS_{datetime.datetime.now().strftime("%Y%m%d_%H%M%S")}""","vector_store")
        vector_store = MyFAISS.from_documents(docs, self.embeddings)  # docs 為Document列表        
        torch_gc()

    vector_store.save_local(vs_path)

三、不存在faiss.index

MyFAISS.from_documents()過載了父類VectorStore的from_documents()，這裡的self.embeddings其實是一個embedding物件

vector_store = MyFAISS.from_documents(docs, self.embeddings)  # docs 為Document列表

self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model_dict[embedding_model],                                        
model_kwargs={'device': embedding_device})

@classmethod
def from_documents(
    cls: Type[VST],   
     documents: List[Document],    
     embedding: Embeddings,    
     **kwargs: Any,) -> VST:
    """Return VectorStore initialized from documents and embeddings."""    
    texts = [d.page_content for d in documents]
    metadatas = [d.metadata for d in documents]
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)

from_texts()主要做了3件事情

1.文件embedding化。

2. 建立記憶體中的文件儲存

3.初始化FAISS資料庫

最後返回的cls例項指class FAISS(VectorStore):

在這裡，要注意2個變數，

embeddings: List[List[float]], 真正的embedding向量，2維列表

embedding: Embeddings, 一個huggingface類物件

@classmethod
def from_texts(
    cls,    
    texts: List[str],    
    embedding: Embeddings,    
    metadatas: Optional[List[dict]] = None,    
    **kwargs: Any,) -> FAISS:
    """Construct FAISS wrapper from raw documents.    This is a user friendly interface that:        1. Embeds documents.        2. Creates an in memory docstore        3. Initializes the FAISS database    This is intended to be a quick way to get started.    Example:        .. code-block:: python            from langchain import FAISS            from langchain.embeddings import OpenAIEmbeddings            embeddings = OpenAIEmbeddings()            faiss = FAISS.from_texts(texts, embeddings)    """    
    embeddings = embedding.embed_documents(texts)
    return cls.__from(
        texts,        
        embeddings,        
        embedding,        
        metadatas,        
        **kwargs,    )

現在先看embeddings = embedding.embed_documents(texts)

我們在chains/modules/embeddings.py找到，可以看到embedding依賴client.encode這個函式，所以說如果想要自定義embedding模型，如果是huggingface的，那就比較簡單，定義好模型名稱和模型路徑就可以了。這裡返回一個向量List[List[float]]

def embed_documents(self, texts: List[str]) -> List[List[float]]:
    """Compute doc embeddings using a HuggingFace transformer model.    Args:        texts: The list of texts to embed.    Returns:        List of embeddings, one for each text.    """    
    texts = list(map(lambda x: x.replace("\n", " "), texts))
    embeddings = self.client.encode(texts, normalize_embeddings=True)
    return embeddings.tolist()
    
self.client = sentence_transformers.SentenceTransformer(
    self.model_name, cache_folder=self.cache_folder, **self.model_kwargs
)

接下來看cls.__from()，可以看出faiss構建索引的核心在這裡面

@classmethod
def __from(
    cls,    
    texts: List[str],    
    embeddings: List[List[float]],    
    embedding: Embeddings,    
    metadatas: Optional[List[dict]] = None,    
    normalize_L2: bool = False,    
    **kwargs: Any,) -> FAISS:
    faiss = dependable_faiss_import()
    index = faiss.IndexFlatL2(len(embeddings[0]))
    vector = np.array(embeddings, dtype=np.float32)
    if normalize_L2:
        faiss.normalize_L2(vector)
    index.add(vector)
    documents = []
    for i, text in enumerate(texts):
        metadata = metadatas[i] if metadatas else {}
        documents.append(Document(page_content=text, metadata=metadata))
    index_to_id = {i: str(uuid.uuid4()) for i in range(len(documents))}
    docstore = InMemoryDocstore(
        {index_to_id[i]: doc for i, doc in enumerate(documents)}
    )
    return cls(
        embedding.embed_query,        
        index,        
        docstore,        
        index_to_id,        
        normalize_L2=normalize_L2,        
        **kwargs,    )

從中我們看到幾個faiss構建索引的要素

1.faiss.IndexFlatL2，L2衡量距離

2.index_to_id ，給每個chunk一個獨一無二的編碼id，{hashcode：chunk1, ... ...}

3.docstore，其實是一個InMemoryDocstore類

class InMemoryDocstore(Docstore, AddableMixin):
    """Simple in memory docstore in the form of a dict."""    
    def __init__(self, _dict: Dict[str, Document]):
        """Initialize with dict."""        
        self._dict = _dict

    def add(self, texts: Dict[str, Document]) -> None:
        """Add texts to in memory dictionary."""       
        overlapping = set(texts).intersection(self._dict)
        if overlapping:
            raise ValueError(f"Tried to add ids that already exist: {overlapping}")
        self._dict = dict(self._dict, **texts)

    def search(self, search: str) -> Union[str, Document]:
        """Search via direct lookup."""        
        if search not in self._dict:
            return f"ID {search} not found."        
        else:
            return self._dict[search]

4.embed_query()，這是HuggingFace的一個encoding的方法，這裡真正實現了把文字進行向量化的過程

def embed_query(self, text: str) -> List[float]:
    """
    Compute query embeddings using a HuggingFace transformer model.    
    Args:        
    text: The text to embed.    
    Returns:        Embeddings for the text.    """    
    text = text.replace("\n", " ")
    embedding = self.client.encode(text, **self.encode_kwargs)
    return embedding.tolist()

最後可以看到vector_store其實就是一個包含文件資訊的FAISS物件，其中向量化的過程已經在流程中生成了檔案

vector_store = MyFAISS.from_documents(docs, self.embeddings)  # docs 為Document列表

class FAISS(VectorStore):
    """Wrapper around FAISS vector database.    To use, you should have the ``faiss`` python package installed.    Example:        .. code-block:: python            from langchain import FAISS            faiss = FAISS(embedding_function, index, docstore, index_to_docstore_id)    """    
    def __init__(
        self,        
        embedding_function: Callable,        
        index: Any,        
        docstore: Docstore,        
        index_to_docstore_id: Dict[int, str],        
        relevance_score_fn: Optional[
            Callable[[float], float]
        ] = _default_relevance_score_fn,        
        normalize_L2: bool = False,    ):

三、存在faiss.index

vector_store = load_vector_store(vs_path, self.embeddings)

這裡做了lru_cache快取機制， MyFAISS呼叫靜態方法load_local

@lru_cache(CACHED_VS_NUM)
def load_vector_store(vs_path, embeddings):
    return MyFAISS.load_local(vs_path, embeddings)

可以看到最後返回的是一個vector_store的FAISS(VectorStore)類

@classmethoddef load_local(
    cls, folder_path: str, embeddings: Embeddings, index_name: str = "index") -> FAISS:
    """Load FAISS index, docstore, and index_to_docstore_id to disk.    Args:        folder_path: folder path to load index, docstore,            and index_to_docstore_id from.        embeddings: Embeddings to use when generating queries        index_name: for saving with a specific index file name    """    
    path = Path(folder_path)
    # load index separately since it is not picklable    faiss = dependable_faiss_import()
    index = faiss.read_index(
        str(path / "{index_name}.faiss".format(index_name=index_name))
    )

    # load docstore and index_to_docstore_id    with open(path / "{index_name}.pkl".format(index_name=index_name), "rb") as f:
        docstore, index_to_docstore_id = pickle.load(f)
    return cls(embeddings.embed_query, index, docstore, index_to_docstore_id)

進入主題，vector_store.add_documents(docs)巢狀了2個函式，依次如下

def add_documents(self, documents: List[Document], **kwargs: Any) -> List[str]:
    """Run more documents through the embeddings and add to the vectorstore.    Args:        documents (List[Document]: Documents to add to the vectorstore.    Returns:        List[str]: List of IDs of the added texts.    """    # TODO: Handle the case where the user doesn't provide ids on the Collection    texts = [doc.page_content for doc in documents]
    metadatas = [doc.metadata for doc in documents]
    return self.add_texts(texts, metadatas, **kwargs)


def add_texts(
    self,    
    texts: Iterable[str],    
    metadatas: Optional[List[dict]] = None,    
    **kwargs: Any,) -> List[str]:
    """Run more texts through the embeddings and add to the vectorstore.    Args:        texts: Iterable of strings to add to the vectorstore.        metadatas: Optional list of metadatas associated with the texts.    Returns:        List of ids from adding the texts into the vectorstore.    """    
    if not isinstance(self.docstore, AddableMixin):
        raise ValueError(
            "If trying to add texts, the underlying docstore should support "            f"adding items, which {self.docstore} does not"        )
    # Embed and create the documents.    embeddings = [self.embedding_function(text) for text in texts]
    return self.__add(texts, embeddings, metadatas, **kwargs)
    
def __add(
    self,    
    texts: Iterable[str],    
    embeddings: Iterable[List[float]],    
    metadatas: Optional[List[dict]] = None,    
    **kwargs: Any,) -> List[str]:
    if not isinstance(self.docstore, AddableMixin):
        raise ValueError(
            "If trying to add texts, the underlying docstore should support "            f"adding items, which {self.docstore} does not"        )
    documents = []
    for i, text in enumerate(texts):
        metadata = metadatas[i] if metadatas else {}
        documents.append(Document(page_content=text, metadata=metadata))
    # Add to the index, the index_to_id mapping, and the docstore.    starting_len = len(self.index_to_docstore_id)
    faiss = dependable_faiss_import()
    vector = np.array(embeddings, dtype=np.float32)
    if self._normalize_L2:
        faiss.normalize_L2(vector)
    self.index.add(vector)
    # Get list of index, id, and docs.    full_info = [
        (starting_len + i, str(uuid.uuid4()), doc)
        for i, doc in enumerate(documents)
    ]
    # Add information to docstore and index.    self.docstore.add({_id: doc for _, _id, doc in full_info})
    index_to_id = {index: _id for index, _id, _ in full_info}
    self.index_to_docstore_id.update(index_to_id)
    return [_id for _, _id, _ in full_info]

其中，self.index做的是向量增量操作；full_info，self.docstore，self.index_to_docstore_id做的都是資料增量操作，add_documents()返回如下，可以看出是一個文件編碼list

full_info = [
    (starting_len + i, str(uuid.uuid4()), doc)
    for i, doc in enumerate(documents)
]
return [_id for _, _id, _ in full_info]

四、檔案儲存

檔案儲存，存了index、docstore、index_to_docstore_id，其中

{index_name}.faiss：向量儲存的faiss索引

{index_name}.pkl：存取的docstore物件以及index_to_docstore_id字典

def save_local(self, folder_path: str, index_name: str = "index") -> None:
    """Save FAISS index, docstore, and index_to_docstore_id to disk.    Args:        folder_path: folder path to save index, docstore,            and index_to_docstore_id to.        index_name: for saving with a specific index file name    """    
    path = Path(folder_path)
    path.mkdir(exist_ok=True, parents=True)

    # save index separately since it is not picklable    faiss = dependable_faiss_import()
    faiss.write_index(
        self.index, str(path / "{index_name}.faiss".format(index_name=index_name))
    )

    # save docstore and index_to_docstore_id    
    with open(path / "{index_name}.pkl".format(index_name=index_name), "wb") as f:
        pickle.dump((self.docstore, self.index_to_docstore_id), f)

五、雜談

文件embedding以及構建faiss過程在實現中其實很繞，需要用心去讀原始碼，還有關注函式define中的變數型別，理解才會事半功倍！

胖友，請不要忘了一鍵三連點贊哦！

轉載請註明出處：QA Weekly

Langchain-ChatGLM原始碼解讀（一）-文件資料上傳
2024-03-14
LangChain原始碼
Faiss原始碼剖析：類結構分析
2021-09-11
AI原始碼
vue原始碼解讀-建構函式
2018-11-20
Vue原始碼函式
Mybatis原始碼簡單解讀----構建
2020-11-13
MyBatis原始碼
Android 原始碼分析（一）專案構建過程
2019-02-21
Android原始碼
深入Vue - 原始碼目錄及構建過程分析
2019-02-19
Vue原始碼
Spring 原始碼閱讀（二）IoC 容器初始化以及 BeanFactory 建立和 BeanDefinition 載入過程
2024-04-22
Spring原始碼Bean
redis資料結構原始碼閱讀——字串編碼過程
2020-11-17
Redis資料結構原始碼字串編碼
原始碼解析.Net中Host主機的構建過程
2021-09-10
原始碼
PostgreSQL 原始碼解讀（126）- MVCC#10（vacuum過程）
2019-01-22
SQL原始碼MVCC#
Vue 原始碼解讀（2）—— Vue 初始化過程
2022-02-22
Vue原始碼
ConcurrentHashMap原始碼解讀二
2021-05-11
HashMap原始碼
如何通過閱讀文件，構建概念模型？
2022-03-18
模型
【原始碼解讀(二)】EFCORE原始碼解讀之查詢都做了什麼以及如何自定義批次插入
2023-11-10
原始碼
swoole 協程原始碼解讀
2019-09-09
原始碼
vue 原始碼學習（一）目錄結構和構建過程簡介
2019-02-16
Vue原始碼
SnapKit 原始碼解讀（二）：DSLs
2018-05-23
APK原始碼
Vite 原始碼解讀系列（圖文結合） —— 構建篇
2022-03-02
Vite原始碼
swoole 協程原始碼解讀 (協程的排程)
2019-09-10
原始碼
spring原始碼閱讀--容器啟動過程
2018-10-22
Spring原始碼
swoole 協程原始碼解讀 (協程的建立)
2019-09-09
原始碼
【詳解】ThreadPoolExecutor原始碼閱讀（二）
2018-11-01
thread原始碼
Feign原始碼解析：初始化過程（二）
2023-12-23
原始碼
視覺化講解DOM構建過程
2018-07-16
視覺化
PostgreSQL 原始碼解讀（15）- Insert語句(執行過程跟蹤)
2018-08-10
SQL原始碼
PostgreSQL 原始碼解讀（134）- MVCC#18（vacuum過程-HeapTupleSatisfiesVacuum函式）
2019-02-01
SQL原始碼MVCC#APT函式
CyberRT_recorder原始碼解讀以及record解析
2024-05-04
原始碼
tomcat原始碼分析(第二篇 tomcat啟動過程詳解)
2018-06-27
Tomcat原始碼
SOFAJRaft原始碼閱讀-模組啟動過程
2023-01-22
Raft原始碼
圖解大頂堆的構建、排序過程
2020-05-31
圖解排序
Netty原始碼解析 -- ChannelPipeline機制與讀寫過程
2020-11-07
Netty原始碼
微信小程式第二篇，文件結構解讀
2018-08-07
微信小程式
PostgreSQL 原始碼解讀（127）- MVCC#11（vacuum過程-vacuum_rel函式）
2019-01-22
SQL原始碼MVCC#函式
PostgreSQL 原始碼解讀（123）- MVCC#8(提交事務-實際提交過程)
2019-01-18
SQL原始碼MVCC#
Universal-Image-Loader原始碼解解析---display過程 + 獲取bitmap過程
2018-03-30
原始碼
vux之x-input使用以及原始碼解讀
2018-11-03
UX原始碼
論文解讀（LLE）《Nonlinear Dimensionality Reduction by Locally Linear Embedding》以及論文通俗解釋
2021-11-21
Laravel 原始碼閱讀指南 -- Database 查詢構建器
2018-06-08
Laravel原始碼Database