talk-to-your-data

lightsong發表於2024-07-29

原文網址 : https://www.cnblogs.com/lightsong/p/18329222

talk-to-your-data

https://github.com/fanqingsong/talk-to-your-data

talk-to-your-data project

This project aid you to build a talk-to-your-data chatbot using openai LLM, LangChain, and Streamlit. Basically:

You clone the project

git clone https://github.com/emmakodes/talk-to-your-data.git

cd talk-to-your-data

create a new virtual environment called .venv

python -m venv .venv

Activate the virtual environment

.venv\Scripts\activate

Install the project requirements

pip install -r requirements.txt

Add your document to mydocument directory and delete animalsinresearch.pdf already existing inside mydocument directory. animalsinresearch.pdf is my own document except you want to test with my own data

Delete the files in vector_index directory so as to hold the vectors of your own document.

Start the app using the following command:

streamlit run app.py

效果

跟文件無關的問題

按照模型自己的知識回答。

文件中提到的問題

按文件回答。

模型思考過程：

參考

https://lmstudio.ai/docs/text-embeddings

https://dev.to/emmakodes_/how-to-build-a-talk-to-your-data-chatbot-using-openai-llm-langchain-and-streamlit-27po

CustomAPIEmbeddings 呼叫本地分詞巢狀模型

https://lmstudio.ai/docs/text-embeddings

https://api.python.langchain.com/en/latest/embeddings/langchain_core.embeddings.embeddings.Embeddings.html#langchain_core.embeddings.embeddings.Embeddings

https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html

https://python.langchain.com/v0.2/docs/integrations/text_embedding/ollama/#usage

https://python.langchain.com/v0.1/docs/modules/data_connection/text_embedding/

https://api.python.langchain.com/en/latest/_modules/langchain_community/embeddings/openai.html#OpenAIEmbeddings.embed_query

https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html

啟發於：

https://github.com/langchain-ai/langchain/discussions/19467

https://stackoverflow.com/questions/77217193/langchain-how-to-use-a-custom-embedding-model-locally

https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.localai.LocalAIEmbeddings.html

https://python.langchain.com/v0.2/docs/integrations/text_embedding/localai/

實現

import pprint

import requests
from typing import List
from langchain_core.embeddings import Embeddings


class CustomAPIEmbeddings(Embeddings):
    def __init__(self, model_name: str, api_url: str):
        self.model_name = model_name
        self.api_url = api_url

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        response = requests.post(
            self.api_url,
            headers={'Authorization': 'Bearer your_token_here'},
            json={
                "model": self.model_name,
                "input": texts,
            },
        )

        pprint.pprint(response.json())

        data_list = response.json()['data']  # Adjust this based on the response format of your API
        ans = [one['embedding'] for one in data_list]

        return ans

    def embed_query(self, text: str) -> List[float]:
        """Call out to OpenAI's embedding endpoint for embedding query text.

        Args:
            text: The text to embed.

        Returns:
            Embedding for the text.
        """
        return self.embed_documents([text])[0]
        #
        # response = requests.post(
        #     self.api_url,
        #     headers={'Authorization': 'Bearer your_token_here'},
        #     json={
        #         "model": self.model_name,
        #         "input": text,
        #     },
        # )
        #
        # ret = response.json()  # Adjust this based on the response format of your API
        # pprint.pprint(ret)
        #
        # return ret['data'][0]['embedding']


if __name__ == '__main__':
    embeddings = CustomAPIEmbeddings(
        model_name="Xenova/text-embedding-ada-002",
        api_url="http://192.168.0.108:1234/v1/embeddings",
        # api_key="sss"
    )

    query = "What is the cultural heritage of India?"
    query_embedding = embeddings.embed_query(query)
    pprint.pprint(query_embedding)

分詞模型和 LLM 不是同一個模型。

https://www.aneasystone.com/archives/2023/07/doc-qa-using-embedding.html

https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/

https://openai.com/index/introducing-text-and-code-embeddings/