最近官方推出了免費試用365天的雲資料庫,版本也升級到了4.3.支援了向量功能.
官方推出了活動體驗AI的動手實戰活動, 教程中使用了docker單機版資料庫,既然有免費的雲資料庫,就優先使用雲資料庫體驗一下.
1. 雲環境申請
在官網的主頁有有一個大大的標題,OB Cloud 365天免費試用.經過簡單的操作後,大約等5分鐘,就建立了一個免費的資料庫例項了.
點選右上方的"三個點",依次建立使用者、建立資料庫、獲取連線串,就可以透過公網連線雲上資料庫了,要謹慎新增白名單,避免資源被非法連線.
進入例項控制檯後,點選"引數管理",設定ob_vector_memory_limit_percentage,啟用向量檢索功能,將引數值設定為30.
2. 安裝python
要求的python版本大於等於3.9,小於4.0.我使用的是3.9.6.
# yum install python39
3. CLONE專案
# git clone https://github.com/oceanbase-devhub/ai-workshop-2024.git
4. 安裝poetry
poetry是python的依賴和包管理工具,安裝包更簡單也更方便.
# python3 -m pip install poetry
# cd ~/ai-workshop-2024
# poetry install
如果下載包比較慢,可以將官方源換為阿里源
# cd ai-workshop-2024
# vi pyproject.toml
// 刪除下面的源資訊
[[tool.poetry.source]]
name = "PyPI"
priority = "primary"
[[tool.poetry.source]]
name = "tuna"
url = "https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
priority = "supplemental"
// 新增下面的源資訊
[[tool.poetry.source]]
name = "ali"
url = "https://mirrors.aliyun.com/pypi/simple/"
priority = "primary"
[[tool.poetry.source]]
name = "tuna"
url = "https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
priority = "supplemental"
//使配置生效
# poetry lock
接下來配置環境變數
# cp .env.example .env
# vi .env
API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # 替換https://open.bigmodel.cn/usercenter/apikeys 智譜AI的API KEY
LLM_MODEL="glm-4-flash"
LLM_BASE_URL="https://open.bigmodel.cn/api/paas/v4/" # BigModel (ZhipuAI)
# LLM_BASE_URL="https://api.openai.com/v1/" # OpenAI
# LLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" # Dashscope (Alibaba)
HF_ENDPOINT=https://hf-mirror.com
BGE_MODEL_PATH=BAAI/bge-m3
OLLAMA_URL=
OLLAMA_TOKEN=
OPENAI_API_KEY=
OPENAI_BASE_URL=
OPENAI_EMBEDDING_MODEL=
DB_HOST="127.0.0.1" # 資料庫的IP或域名
DB_PORT="2881" # 資料庫的埠
DB_USER="root@test" # 連線的使用者名稱
DB_NAME="test" # 連線的資料庫名
DB_PASSWORD="" # 連線的密碼
5. 準備BGE-M3 模型
# poetry run python utils/prepare_bgem3.py
===================================
BGEM3FlagModel loaded successfully!
===================================
出現以上的輸出,就成功了.
6. 準備文件資料
從github克隆文件資料
# cd doc_repos
# git config --global http.postBuffer 16000M // 增加修改buffer大小
# git config --global core.compression -1 // 啟動壓縮
# git clone --single-branch --branch V4.3.3 https://github.com/oceanbase/oceanbase-doc.git --depth 1 //如果git報錯,新增後面的引數
# git clone --single-branch --branch V4.3.0 https://github.com/oceanbase/ocp-doc.git
# git clone --single-branch --branch V4.3.1 https://github.com/oceanbase/odc-doc.git
# git clone --single-branch --branch V4.2.5 https://github.com/oceanbase/oms-doc.git
# git clone --single-branch --branch V2.10.0 https://github.com/oceanbase/obd-doc.git
# git clone --single-branch --branch V4.3.0 https://github.com/oceanbase/oceanbase-proxy-doc.git
# cd ..
把文件的標題轉換為標準的 markdown 格式
# poetry run python convert_headings.py \
doc_repos/oceanbase-doc/zh-CN \
doc_repos/ocp-doc/zh-CN \
doc_repos/odc-doc/zh-CN \
doc_repos/oms-doc/zh-CN \
doc_repos/obd-doc/zh-CN \
doc_repos/oceanbase-proxy-doc/zh-CN
生成文件向量和後設資料,等待時間比較長,並且相當看硬體效能,是個不錯的效能壓測工具
# poetry run python embed_docs.py --doc_base doc_repos/oceanbase-doc/zh-CN
# poetry run python embed_docs.py --doc_base doc_repos/ocp-doc/zh-CN --component ocp
# poetry run python embed_docs.py --doc_base doc_repos/odc-doc/zh-CN --component odc
# poetry run python embed_docs.py --doc_base doc_repos/oms-doc/zh-CN --component oms
# poetry run python embed_docs.py --doc_base doc_repos/obd-doc/zh-CN --component obd
# poetry run python embed_docs.py --doc_base doc_repos/oceanbase-proxy-doc/zh-CN --component odp
儲存載入資料
# poetry run python utils/extract.py --output_file ~/my-data.json
載入預處理的文件資料
# poetry run python utils/load.py --source_file ~/my-data.json
7. 驗證資料庫和資料
進入.env配置檔案中的資料庫,會有一個新表,表名是corpus,表結構中的embedding列的資料型別是VECTOR(1024),這個型別就是向量型別,
mysql> desc corpus;
+----------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+---------------+------+-----+---------+-------+
| id | varchar(4096) | NO | PRI | NULL | |
| embedding | VECTOR(1024) | YES | | NULL | |
| document | longtext | YES | | NULL | |
| metadata | json | YES | | NULL | |
| component_code | int(11) | NO | PRI | NULL | |
+----------------+---------------+------+-----+---------+-------+
5 rows in set (0.04 sec)
mysql> select count(*) from corpus;
+----------+
| count(*) |
+----------+
| 6500 |
+----------+
1 row in set (0.05 sec)
8.啟動web介面
上面的準備工作已經全部完成,接下來就是激動人心的時刻了,原神啟動!!!(走錯片場了),啟動web介面
# poetry run streamlit run --server.runOnSave false chat_ui.py
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://172.xxx.xxx.xxx:8501
External URL: http://xxx.xxx.xxx.xxx:8501 # 這是您可以從瀏覽器訪問的 URL
剛好streamlit提供服務的IP都不是對外的,修改.streamlit/config.toml,指定對外服務的IP和埠
[server]
port = 8501
enableCORS = false
[browser]
serverAddress = "192.168.56.110"
gatherUsageStats = false
重啟後,IP被繫結到了192.168.56.110上.
試著提問個問題
9.將tidb的文件存入資料庫中
首先,在github上我找到了tidb的中文文件,把clone的檔案儲存到doc_repos目錄中
git clone https://github.com/pingcap/docs-cn.git --depth 1
替換文件標題
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/tiup
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/tiproxy
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/tiflash
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/tidb-lightning
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/tidb-binlog
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/ticdc
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/templates
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/sync-diff-inspector
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/storage-engine
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/sql-statements
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/scripts
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/resources
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/releases
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/performance-schema
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/media
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/information-schema
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/functions-and-operators
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/faq
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/dm
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/develop
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/dashboard
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/config-templates
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/clinic
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/br
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/best-practices
poetry run python convert_headings.py doc_repos/docs-cn-release-7.6/benchmark
生成向量文件和後設資料,儲存到資料庫中
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/tiup
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/tiproxy
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/tiflash
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/tidb-lightning
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/tidb-binlog
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/ticdc
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/templates
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/sync-diff-inspector
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/storage-engine
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/sql-statements
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/scripts
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/resources
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/releases
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/performance-schema
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/media
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/information-schema
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/functions-and-operators
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/faq
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/dm
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/develop
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/dashboard
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/config-templates
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/clinic
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/br
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/best-practices
poetry run python embed_docs.py --doc_base doc_repos/docs-cn-release-7.6/benchmark
10. 效果展示
在web頁面中關掉"僅限oceanbase相關問題"
先問個關於tidb的問題,使用到了下面的幾個本地文件.
接下來請兩位打擂臺
總體來說搭建比較簡單,解析向量資料需要足夠的計算資源,並且等待時間比較長.
參考文件:
【創意工坊】試用 OceanBase 4.3.3 構建《黑神話:悟空》智慧遊戲助手
https://github.com/oceanbase-devhub/ai-workshop-2024/blob/main/README_zh.md
https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000001579715
免費使用OceanBase Cloud搭建RAG聊天機器人