talk-to-your-data

lightsong發表於2024-07-29

talk-to-your-data

https://github.com/fanqingsong/talk-to-your-data

talk-to-your-data project

This project aid you to build a talk-to-your-data chatbot using openai LLM, LangChain, and Streamlit. Basically:

  • You clone the project

    git clone https://github.com/emmakodes/talk-to-your-data.git

  • cd talk-to-your-data

  • create a new virtual environment called .venv

    python -m venv .venv

  • Activate the virtual environment

.venv\Scripts\activate

  • Install the project requirements

pip install -r requirements.txt

  • Add your document to mydocument directory and delete animalsinresearch.pdf already existing inside mydocument directory. animalsinresearch.pdf is my own document except you want to test with my own data

  • Delete the files in vector_index directory so as to hold the vectors of your own document.

  • Start the app using the following command:

streamlit run app.py

效果

跟文件無關的問題

按照模型自己的知識回答。

文件中提到的問題

按文件回答。

模型思考過程:

參考

https://lmstudio.ai/docs/text-embeddings

https://dev.to/emmakodes_/how-to-build-a-talk-to-your-data-chatbot-using-openai-llm-langchain-and-streamlit-27po

CustomAPIEmbeddings 呼叫本地分詞巢狀模型

https://lmstudio.ai/docs/text-embeddings

https://api.python.langchain.com/en/latest/embeddings/langchain_core.embeddings.embeddings.Embeddings.html#langchain_core.embeddings.embeddings.Embeddings

https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html

https://python.langchain.com/v0.2/docs/integrations/text_embedding/ollama/#usage

https://python.langchain.com/v0.1/docs/modules/data_connection/text_embedding/

https://api.python.langchain.com/en/latest/_modules/langchain_community/embeddings/openai.html#OpenAIEmbeddings.embed_query

https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html

啟發於:

https://github.com/langchain-ai/langchain/discussions/19467

https://stackoverflow.com/questions/77217193/langchain-how-to-use-a-custom-embedding-model-locally

https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.localai.LocalAIEmbeddings.html

https://python.langchain.com/v0.2/docs/integrations/text_embedding/localai/

實現

import pprint

import requests
from typing import List
from langchain_core.embeddings import Embeddings


class CustomAPIEmbeddings(Embeddings):
    def __init__(self, model_name: str, api_url: str):
        self.model_name = model_name
        self.api_url = api_url

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        response = requests.post(
            self.api_url,
            headers={'Authorization': 'Bearer your_token_here'},
            json={
                "model": self.model_name,
                "input": texts,
            },
        )

        pprint.pprint(response.json())

        data_list = response.json()['data']  # Adjust this based on the response format of your API
        ans = [one['embedding'] for one in data_list]

        return ans

    def embed_query(self, text: str) -> List[float]:
        """Call out to OpenAI's embedding endpoint for embedding query text.

        Args:
            text: The text to embed.

        Returns:
            Embedding for the text.
        """
        return self.embed_documents([text])[0]
        #
        # response = requests.post(
        #     self.api_url,
        #     headers={'Authorization': 'Bearer your_token_here'},
        #     json={
        #         "model": self.model_name,
        #         "input": text,
        #     },
        # )
        #
        # ret = response.json()  # Adjust this based on the response format of your API
        # pprint.pprint(ret)
        #
        # return ret['data'][0]['embedding']


if __name__ == '__main__':
    embeddings = CustomAPIEmbeddings(
        model_name="Xenova/text-embedding-ada-002",
        api_url="http://192.168.0.108:1234/v1/embeddings",
        # api_key="sss"
    )

    query = "What is the cultural heritage of India?"
    query_embedding = embeddings.embed_query(query)
    pprint.pprint(query_embedding)

分詞模型 和 LLM 不是同一個模型。

https://www.aneasystone.com/archives/2023/07/doc-qa-using-embedding.html

https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/

https://openai.com/index/introducing-text-and-code-embeddings/