向量資料庫之Lancedb學習記錄

深度学习机器發表於2024-04-15

原文網址 : https://www.cnblogs.com/deeplearningmachine/p/18136746

簡介

Lancedb是一個用於人工智慧的開源向量資料庫，旨在儲存、管理、查詢和檢索大規模多模式資料的嵌入。Lancedb的核心是用Rust編寫的，並構建在Lance之上，專為高效能 ML 工作負載和快速隨機訪問而設計。

快速開始

安裝

pip install lancedb

目前0.6.8需要pyarrow-12.0.0及以上，親測15.0會報錯。

建立客戶端

import lancedb
import pandas as pd
import pyarrow as pa

uri = "data/sample-lancedb"
db = lancedb.connect(uri)   
# 非同步客戶端
#async_db = await lancedb.connect_async(uri)

與Chroma不同，lancedb沒有服務端-客戶端模式。支援同步和非同步客戶端，看起來非同步客戶端更新較快，從官方文件來看沒發現使用上的區別。

建立一張表

data = [
    {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
    {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
]

tbl = db.create_table("my_table", data=data)

如果表名已經存在，則會報錯。如果希望覆蓋已經建立的同名表，可以新增mode='overwrite'引數。

tbl = db.create_table("my_table", data=data, mode='overwrite')

如果不希望覆蓋已經建立的同名表，而直接開啟的話，可以新增exist_ok=True引數。

tbl = db.create_table("my_table", data=data, exist_ok=True)

建立一張空表

schema = pa.schema([pa.field("vector", pa.list_(pa.float32(), list_size=2))])
tbl = db.create_table("empty_table", schema=schema)

類似SQL語法，先建立一張空表，插入資料可以放到後面進行。

新增資料

# 直接新增資料
data = [
    {"vector": [1.3, 1.4], "item": "fizz", "price": 100.0},
    {"vector": [9.5, 56.2], "item": "buzz", "price": 200.0},
]
tbl.add(data)

# 新增df資料幀
df = pd.DataFrame(data)
tbl.add(data)

查詢資料

# Synchronous client
tbl.search([100, 100]).limit(2).to_pandas()

透過向量來查詢相似的向量。預設情況下沒有對向量建立索引，因此是全表暴力檢索。官方推薦資料量超過50萬以上才需要建立索引，否則全表暴力檢索的延遲也在可以接受的範圍之內。（明明就是沒實現，還說的冠冕堂皇。。）

刪除資料

tbl.delete('item = "fizz"')

類似SQL語法中的WHERE宣告，需要指定欄位和對應的值。

修改資料

table.update(where='item = "fizz"', values={"vector": [10, 10]})

類似SQL語法中的UPDATE宣告，需要指定欄位和對應的值。

刪除表

db.drop_table("my_table")

檢視所有表

print(db.table_names())
tbl = db.open_table("my_table")

table_names可以返回該資料庫中已經建立的所有表，使用open_table可以開啟對應的表。

高階用法

資料型別

多種資料型別

除了直接新增資料和新增df資料幀之外，lancedb還支援用pyarrow建立schema和新增資料。

import pyarrow as pa
schema = pa.schema(
    [
        pa.field("vector", pa.list_(pa.float16(), 2)),
        pa.field("text", pa.string())
    ]
)

lancedb直接float16資料型別，這就比chromadb有儲存優勢了。

自定義資料型別

from lancedb.pydantic import Vector, LanceModel

class Content(LanceModel):
    movie_id: int
    vector: Vector(128)
    genres: str
    title: str
    imdb_id: int

    @property
    def imdb_url(self) -> str:
        return f"https://www.imdb.com/title/tt{self.imdb_id}"

LanceModel是pydantic.BaseModel的子類，主要就是實現了Vector資料型別的定義，避免手動建立schema中vector的定義，只需要指定維度即可。

複合資料型別

class Document(BaseModel):
    content: str
    source: str
    
class NestedSchema(LanceModel):
    id: str
    vector: Vector(1536)
    document: Document

tbl = db.create_table("nested_table", schema=NestedSchema, mode="overwrite")

索引

建立IVF_PQ索引

tbl.create_index(num_partitions=256, num_sub_vectors=96)

lancedb支援建立倒排索引的乘積量化。num_partitions是索引中的分割槽數，預設值是行數的平方根。num_sub_vectors是子向量的數量，預設值是向量的維度除以16。

使用GPU建立

accelerator="cuda"
# accelerator="mps"

支援CUDA的GPU或者Apple的MPS加速

使用索引加速近似查詢

tbl.search(np.random.random((1536))) \
.limit(2) \
.nprobes(20) \
.refine_factor(10) \
.to_pandas()

nprobes是探針數量，預設為20，增加探針數量則會提高查詢的精度並相應增加計算耗時。refine_factor是一個粗召的數量，用於讀取額外元素並重新排列，以此來提高召回。

向量化模型

內建向量模型

import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

model = get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="cpu")

class Words(LanceModel):
    text: str = model.SourceField() # 指定這個欄位為需要模型進行向量化的欄位
    vector: Vector(model.ndims()) = model.VectorField() # 指定這個欄位為模型向量化的結果

table = db.create_table("words", schema=Words)
table.add(
    [
        {"text": "hello world"},
        {"text": "goodbye world"}
    ]
)

query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)

官方支援了多種sentence-transformers的向量化模型。用上述方法呼叫內建模型需要指定模型的SourceField和VectorField。

自定義向量模型

from lancedb.embeddings import register
from lancedb.util import attempt_import_or_raise

@register("sentence-transformers")
class SentenceTransformerEmbeddings(TextEmbeddingFunction):
    name: str = "all-MiniLM-L6-v2"
    # set more default instance vars like device, etc.

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._ndims = None

    def generate_embeddings(self, texts):
        return self._embedding_model().encode(list(texts), ...).tolist()

    def ndims(self):
        if self._ndims is None:
            self._ndims = len(self.generate_embeddings("foo")[0])
        return self._ndims

    @cached(cache={}) 
    def _embedding_model(self):
        return sentence_transformers.SentenceTransformer(name)

from lancedb.pydantic import LanceModel, Vector

registry = EmbeddingFunctionRegistry.get_instance()
stransformer = registry.get("sentence-transformers").create()

class TextModelSchema(LanceModel):
    vector: Vector(stransformer.ndims) = stransformer.VectorField()
    text: str = stransformer.SourceField()

tbl = db.create_table("table", schema=TextModelSchema)

tbl.add(pd.DataFrame({"text": ["halo", "world"]}))
result = tbl.search("world").limit(5)

官方提供了模板用於自定義模型，但是我覺得直接呼叫模型進行向量化表示更直接吧，這樣感覺有點追求格式化的統一了。

總結

與Chromadb對比，沒有服務端模式，全部在客戶端完成，雖然官方聲稱有云原生的版本，但感覺大部分場景下可能都不需要放在雲上，感覺這一款產品會更加輕量化。
此外，建立表的時候沒有預設的向量化模型，感覺對開發者可能更加靈活一些，相比之下Chromadb預設會從HuggingFace下載模型，對於內網環境不太友好。

向量資料庫Chroma學習記錄
2024-04-13
資料庫
資料庫mysql學習筆記記錄
2021-09-09
資料庫MySql筆記
swoft 學習筆記之資料庫操作
2019-08-29
筆記資料庫
資料庫學習筆記之查詢表
2021-01-03
資料庫筆記
資料庫學習筆記
2018-10-18
資料庫筆記
向量資料庫
2024-11-24
資料庫
飛機的 PHP 學習筆記之資料庫篇
2020-01-30
PHP筆記資料庫
1029學習筆記資料庫
2020-11-03
筆記資料庫
python學習筆記：資料庫
2018-04-19
Python筆記資料庫
MySQL資料庫學習筆記
2020-12-10
MySql資料庫筆記
資料型別 - Go 學習記錄
2019-02-25
資料型別Go
我的架構學習實戰記錄：資料庫階段
2021-09-09
架構資料庫
swoft 學習筆記之資料庫配置與實體定
2019-08-28
筆記資料庫
Laravel 學習之資料庫遷移
2019-06-25
Laravel資料庫
InnoDB學習（六）之資料庫鎖
2021-12-23
資料庫
Redis學習筆記（七）資料庫
2020-05-16
Redis筆記資料庫
達夢資料庫學習筆記
2021-01-03
資料庫筆記
大資料之 Hadoop學習筆記
2018-12-14
大資料Hadoop筆記
資料庫學習與複習筆記--資料庫概念和不同類資料庫CRUD操作(1)
2020-09-25
資料庫筆記
mysql/mariadb學習記錄——建立刪除資料庫、表的基本命令
2018-04-30
MySql資料庫
大資料學習記錄，Python基礎（3）
2024-11-29
大資料Python
大資料學習記錄，Python基礎（4）
2024-12-02
大資料Python
jmeter學習指南之操作 mysql 資料庫
2019-08-23
JMeterMySql資料庫
資料庫學習筆記1(資料管理歷史)
2019-03-24
資料庫筆記
Java安全之JDBC Attacks學習記錄
2023-01-13
JavaJDBC
黃東旭：“向量資料庫”還是“向量搜尋外掛 + SQL 資料庫”？
2024-02-15
資料庫SQL
資料庫學習筆記——20 使用遊標
2021-09-09
資料庫筆記
監督學習之支援向量機
2020-02-14
rust學習九.1、集合之向量
2024-11-14
Rust
《Python入門與資料科學庫》學習筆記
2021-02-12
Python資料科學筆記
MySQL學習筆記之SQL語句建立、修改和刪除資料庫
2020-11-19
MySql筆記資料庫
向量資料庫落地實踐
2024-04-03
資料庫
Chroma向量資料庫使用案例
2024-03-24
資料庫
向量資料庫技術全景
2024-07-18
資料庫
學習記錄
2024-06-18
大資料學習目錄
2018-06-21
大資料
Python 3 學習筆記之——資料型別
2018-10-23
Python筆記資料型別
Python學習筆記|Python之pycache資料夾
2018-12-21
Python筆記