向量資料庫Chroma學習記錄

深度学习机器發表於2024-04-13

原文網址 : https://www.cnblogs.com/deeplearningmachine/p/18132593

一簡介

Chroma是一款AI開源向量資料庫，用於快速構建基於LLM的應用，支援Python和Javascript語言。具備輕量化、快速安裝等特點，可與Langchain、LlamaIndex等知名LLM框架組合使用。

二基本用法

1 安裝

安裝方式非常簡單，只需要一行命令

pip instakk chromadb

2 建立一個客戶端

import chromadb
chroma_client = chromadb.Client()

3 建立一個集合

這裡面的集合用於存放向量以及後設資料的資訊，可以理解為傳統資料庫的一張表

collection = chroma_client.create_collection(name="my_collection")

4 新增資料

集合中可以新增文字，元資訊，以及序號等資料。新增文字之後會呼叫預設的嵌入模型對文字進行向量化表示。
documents和ids為必需項，其他為可選項。（metadatas、embeddings、urls、data）

collection.add(
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

如果已經有文字的向量化表示，可以直接新增進embedding欄位。需要注意手動新增的向量的維度需要與初始化集合時用到的嵌入模型維度一致，否則會報錯。

collection.add(
    embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

5 從集合中檢索

results = collection.query(
    query_texts=["This is a query document"],
    n_results=2
)

三進階用法

建立本地資料存放路徑

前面程式碼中建立的集合不會落到資料盤中，只用於快速搭建專案原型，程式退出即消失。如果想使集合可以重複利用，只需要稍微修改一下程式碼即可：

# Client改為PersistentClient
client = chromadb.PersistentClient(path="/path/to/save/to")

客戶端/服務端部署

實際專案一般不會只有客戶端程式碼，因此chroma也被設計成可以客戶端-服務端方式進行部署

服務端啟動命令：

# --path引數可以指定資料持久化路徑
# 預設開啟8000埠
chroma run --path /db_path

客戶端連線命令：

import chromadb
client = chromadb.HttpClient(host='localhost', port=8000)

如果你負責的專案只需要維護客戶端的資料，則可以安裝更加輕量化的客戶端chroma

pip install chromadb-client

在客戶端，連線方式同前面一樣。chromadb-client相比完整版減少很多依賴項，特別是不支援預設的embedding模型了，因此必須自定義embedding function對文字進行向量化表示。

建立或選擇已有的集合：

# 建立名稱為my_collection的集合，如果已經存在，則會報錯
collection = client.create_collection(name="my_collection", embedding_function=emb_fn)
# 獲取名稱為my_collection的集合，如果不存在，則會報錯
collection = client.get_collection(name="my_collection", embedding_function=emb_fn)
# 獲取名稱為my_collection的集合，如果不存在，則建立
collection = client.get_or_create_collection(name="my_collection", embedding_function=emb_fn)

探索集合

# 返回集合中的前10條記錄
collection.peek() 
# 返回集合的數量
collection.count() 
# 重新命名集合
collection.modify(name="new_name")

操作集合

增

集合的增用add來實現，前面已有，這裡不贅述

查

集合的查詢包含query和get兩個介面

# 可以用文字進行查詢，會呼叫模型對文字進行向量化表示，然後再查詢出相似的向量
collection.query(
    query_texts=["doc10", "thus spake zarathustra", ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

# 也可以用向量進行查詢
collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

where和where_document分別對元資訊和文字進行過濾。這部分的過濾條件比較複雜，可以參考官方的說明文件。個人感覺有點多餘了，對於這種輕量化資料庫以及AI應用來說必要性不強。

collection.get(
    ids=["id1", "id2", "id3", ...],
    where={"style": "style1"},
    where_document={"$contains":"search_string"}
)

get更像是傳統意義上的select操作，同樣也支援where和where_document兩個過濾條件。

刪

集合的刪除操作透過指定ids實現，如果沒有指定ids，則會刪除滿足where的所有資料

collection.delete(
    ids=["id1", "id2", "id3",...],
    where={"chapter": "20"}
)

改

集合的修改也是透過指定id實現，如果id不存在，則會報錯。如果更新的內容是documents，則連同對應的embeddings都一併更新

collection.update(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

自定義embedding函式

chroma支援多種向量化模型，除此之外還能自定義模型。下面是一個用text2vec模型來定義embedding function的例子：

from chromadb import Documents, EmbeddingFunction, Embeddings
from text2vec import SentenceModel

# 載入text2vec庫的向量化模型
model = SentenceModel('text2vec-chinese')

# Documents是字串陣列型別，Embeddings是浮點陣列型別
class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents somehow
        return model.encode(input).tolist()

多模態

chroma的集合支援多模態的資料儲存和查詢，只需要embedding function能對多模型資料進行向量化表示即可。官方給出了以下例子：

import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader

# 用到了Openai的CLIP文字-圖片模型
embedding_function = OpenCLIPEmbeddingFunction()
# 還需要呼叫一個內建的圖片載入器
data_loader = ImageLoader()
client = chromadb.Client()

collection = client.create_collection(
    name='multimodal_collection', 
    embedding_function=embedding_function, 
    data_loader=data_loader)

往集合中新增numpy型別的圖片

collection.add(
    ids=['id1', 'id2', 'id3'],
    images=[...] # A list of numpy arrays representing images
)

與文字檢索類似，只是變成了query_images而已

results = collection.query(
    query_images=[...] # A list of numpy arrays representing images
)

Chroma向量資料庫使用案例
2024-03-24
資料庫
向量資料庫之Lancedb學習記錄
2024-04-15
資料庫
開源向量資料庫比較：Chroma, Milvus, Faiss,Weaviate
2024-04-25
資料庫AI
資料庫mysql學習筆記記錄
2021-09-09
資料庫MySql筆記
資料庫學習筆記
2018-10-18
資料庫筆記
向量資料庫
2024-11-24
資料庫
1029學習筆記資料庫
2020-11-03
筆記資料庫
python學習筆記：資料庫
2018-04-19
Python筆記資料庫
MySQL資料庫學習筆記
2020-12-10
MySql資料庫筆記
資料型別 - Go 學習記錄
2019-02-25
資料型別Go
我的架構學習實戰記錄：資料庫階段
2021-09-09
架構資料庫
Redis學習筆記（七）資料庫
2020-05-16
Redis筆記資料庫
達夢資料庫學習筆記
2021-01-03
資料庫筆記
資料庫學習與複習筆記--資料庫概念和不同類資料庫CRUD操作(1)
2020-09-25
資料庫筆記
mysql/mariadb學習記錄——建立刪除資料庫、表的基本命令
2018-04-30
MySql資料庫
大資料學習記錄，Python基礎（3）
2024-11-29
大資料Python
大資料學習記錄，Python基礎（4）
2024-12-02
大資料Python
資料庫學習筆記1(資料管理歷史)
2019-03-24
資料庫筆記
swoft 學習筆記之資料庫操作
2019-08-29
筆記資料庫
黃東旭：“向量資料庫”還是“向量搜尋外掛 + SQL 資料庫”？
2024-02-15
資料庫SQL
資料庫學習筆記——20 使用遊標
2021-09-09
資料庫筆記
資料庫學習筆記之查詢表
2021-01-03
資料庫筆記
《Python入門與資料科學庫》學習筆記
2021-02-12
Python資料科學筆記
向量資料庫落地實踐
2024-04-03
資料庫
向量資料庫技術全景
2024-07-18
資料庫
學習記錄
2024-06-18
大資料學習目錄
2018-06-21
大資料
學習MongoDB資料庫
2020-12-10
MongoDB資料庫
關聯式資料庫很快會替代向量資料庫
2024-07-04
資料庫
飛機的 PHP 學習筆記八：資料庫
2020-01-30
PHP筆記資料庫
MySQL學習筆記-使用Navicat操作MySQL資料庫
2020-10-20
MySql筆記資料庫
資料庫學習筆記 - MySQL基礎知識
2021-11-16
資料庫筆記MySql
為VNPY增加資料庫記錄交易資料功能
2019-01-25
資料庫
記錄每次更新到倉庫 —— Git 學習筆記 10
2018-09-02
Git筆記
SQL Server 資料庫基本記錄（一）
2019-03-02
SQLServer資料庫
cmdb 查詢資料庫操作記錄
2020-08-14
資料庫
WindowsServer 2012資料庫遷移記錄
2020-06-10
WindowsServer資料庫
SQL Server 資料庫基本記錄（二）
2018-03-17
SQLServer資料庫