RAG OVERVIEW
https://opendatascience.com/getting-started-with-multimodal-retrieval-augmented-generation/
What is RAG?
RAG is an architectural framework for LLM-powered applications which consists of two main steps:
- Retrieval. In this stage, the system has the task of retrieving from the provided knowledge base the context that is the most similar to the user’s query. This step involves the concept of embedding.
Definition
Embedding is a process of transforming data into numerical vectors and it is used to represent text, images, audio, or other complex data types in a multi-dimensional space that preserves the semantic similarity and relevance of the original data. This means that, for example, the embeddings representing two words or concepts that are semantically similar, will be mathematically close within the multi-dimensional space.
Embedding is essential for generative AI and RAG (retrieval-augmented generation) because it allows the models to access and compare external knowledge sources with the user input and generate more accurate and reliable outputs.
The process involves the embedding of the user’s query and the retrieval of those words or sentences that are represented by vectors that are mathematical to the query’s vector.
- Augmented Generation. Once the relevant set of words, sentences or documents is retrieved, it becomes the context from which the LLM generates the response. It is augmented in the sense that the context is not simply “copied-pasted” and presented to the user as it is, but it is rather passed to the LLM as context and used to produce an AI-generated answer.
RAG能解決大模型的什麼問題?不能解決什麼問題?
RAG(Retrieval Augmented Generation,檢索增強生成)技術在大模型領域主要解決了一些關鍵問題,同時也存在一些其無法直接解決的問題。以下是對這兩方面的詳細分析:
RAG能解決大模型的什麼問題?
- 提高答案的準確性和相關性:
- RAG透過將檢索元件與生成元件相結合,利用檢索到的知識來輔助答案的生成,從而提高了響應的準確性和相關性。這種方法有效地解決了大模型在回答時可能出現的“幻覺”問題,即模型編造出不存在於上下文中的答案。
- 增強模型的泛化能力:
- 透過引入外部知識源,RAG使得大模型能夠處理更廣泛的問題,而不僅僅是依賴於模型內部儲存的知識。這有助於提升模型在未知或罕見問題上的表現。
- 適應不同領域和任務:
- RAG允許根據不同領域和任務的需求,動態地檢索和注入相關的知識,從而提高了模型在不同場景下的適應性和靈活性。
- 最佳化長輸入長輸出問題:
- 在處理長輸入和長輸出問題時,RAG可以透過檢索和整合相關知識,減輕大模型的負擔,提高處理效率和效能。
- 提升模型的透明度和可控性:
- 透過最佳化提示詞和引入後設資料過濾等方法,RAG可以增強模型的透明度和可控性,使使用者能夠更清楚地瞭解模型的決策過程和工作原理。
RAG不能解決大模型的什麼問題?
- 資料質量和標註問題:
- RAG雖然可以利用外部知識源來提高答案的準確性和相關性,但前提是這些資料必須是高質量且準確的。然而,高質量的資料往往難以獲取且標註成本高昂,這是RAG無法直接解決的問題。
- 模型可解釋性問題:
- 儘管RAG可以透過視覺化工具等方式提高模型的可解釋性,但大模型本身的複雜性和黑盒特性仍然限制了其可解釋性的提升。RAG無法從根本上解決這一問題。
- 隱私和安全問題:
- 在使用RAG技術時,需要處理大量的使用者資料,這可能導致隱私洩露和安全問題。雖然可以採取差分隱私等技術來保護使用者隱私,但這些問題仍然需要額外的注意和努力來解決。
- 所有型別的幻覺問題:
- RAG雖然可以有效地解決一些由模型內部知識不足引起的幻覺問題,但對於由訓練資料中的錯誤或偏見資訊引起的幻覺問題,RAG的解決能力有限。這需要更全面的資料清洗和預處理工作來配合解決。
- 極端複雜和新穎的問題:
- 對於一些極端複雜或新穎的問題,即使引入了外部知識源,大模型也可能無法給出滿意的答案。這需要更高階別的推理能力和知識整合能力來支援。
綜上所述,RAG技術在大模型領域具有顯著的優勢和潛力,但也存在一些無法直接解決的問題。在實際應用中,需要根據具體需求和場景來選擇合適的方法和策略來最佳化和提升大模型的效能。
機器學習中的判別式模型和生成式模型
https://zhuanlan.zhihu.com/p/74586507
RAG的能力增強生成。
判別式模型和生成式模型的對比圖
上圖左邊為判別式模型而右邊為生成式模型,可以很清晰地看到差別,判別式模型是在尋找一個決策邊界
,透過該邊界來將樣本劃分到對應類別。而生成式模型則不同,它學習了每個類別的邊界,它包含了更多資訊,可以用來生成樣本。
multimodal RAG
Introducing Multimodality
Multimodality refers to the integration of information from different modalities (e.g., text, images, audio). Introducing multimodality in RAG further enriches the model’s capabilities:
- Text-Image Fusion: Combining textual context with relevant images allows RAG to provide more comprehensive and context-aware responses. For instance, a medical diagnosis system could benefit from both textual descriptions and relevant medical images.
- Cross-Modal Retrieval: RAG can retrieve information from diverse sources, including text, images, and videos. Cross-modal retrieval enables a deeper understanding of complex topics by leveraging multiple types of data.
- Domain-Specific Knowledge: Multimodal RAG can integrate domain-specific information from various sources. For example, a travel recommendation system could consider both textual descriptions and user-generated photos to suggest personalized destinations.
Multimodal RAG (MM-RAG) follows the same pattern as the “monomodal” RAG described in the previous section, with the difference that we can interact with the model in multiple ways, plus the indexed knowledge base can also be in different data formats. The idea behind this pattern is to create a shared embedding space, where data in different modalities can be represented with vectors in the same multidimensional space. Also in this case, the idea is that similar data will be represented by vectors that are close to each other.
Once our knowledge base is properly embedded, we can store it in a multimodal VectorDB and use it to retrieve relevant context given the user’s query (which can be multimodal as well):
Building a MM-RAG application with CLIP and GPT-4-vision
Now let’s see how to practically build an MM-RAG application. The idea is to build a conversational application that can receive as input both text and images, as well as retrieve relevant information from a PDF that contains text and images. Henceforth, in this scenario, multimodality refers to “text + images” data. The goal is to create a shared embedding space where both images and text have their vector representation.
Let’s break down the architecture into its main steps.
Embedding the multimodal knowledge base
To embed our images, there are two main options we can follow:
1. Using an LMM such as the GPT-4-vision to first get a rich captioning of the image. Then, use a text embedding model such as the text-ada-002 to embed that caption.
2. Using a model that is capable of directly embedding images without intermediate steps. For example, we can use a Vision Transformer for this purpose.
Definition
The Vision Transformer (ViT) emerged as an alternative to Convolutional Neural Networks (CNNs). Like LLMs, ViT employs a core architecture consisting of an encoder and decoder. In ViT, the central mechanism is Attention, which enables the model to selectively focus on specific parts of the input sequence during predictions. By teaching the model to attend to relevant input data while disregarding irrelevant portions, ViT enhances its ability to tackle tasks effectively.
What sets attention in Transformers apart is its departure from traditional techniques like recurrence (commonly used in Recurrent Neural Networks or RNNs) and convolutions. Unlike previous models, the Transformer relies solely on attention to compute representations of both input and output. This unique approach allows the Transformer to capture a broader range of relationships between words in a sentence, resulting in a more nuanced representation of the input.
An example of this type of model is CLIP, a ViT developed by OpenAI that can learn visual concepts from natural language supervision. It can perform various image classification tasks by simply providing the names of the visual categories in natural language, without any fine-tuning or labeled data. CLIP achieves this by learning a joint embedding space of images and texts, where images and texts that are semantically related are close to each other. The model was trained on a large-scale dataset of (image, text) pairs collected from the internet.
Once you’ve got your embedding, you will need to store them somewhere, likely a Vector DB.
Retrieving relevant context
For a fully multimodal experience, we want our application to be able not only to retrieve both text and images but also to receive both as input. This means that users can enjoy a multimodal experience, explaining concepts with both written and visual inputs.
To achieve this result, we will need an LMM to process the user’s input. The idea is that, given a user’s input (text + image), the LMM will reason over it and produce images’ description which is also in line with the whole context provided by the user. Then, once the text + images’ descriptions are obtained, a text embedding model will create the vectors that will be compared with those of the knowledge base.
Once gathered the relevant context (text + images) from the knowledge base, it will be used as input for the LMM to reason over it, in order to produce the generative answer. Note that the generative answer will contain references to both text and image sources.
Conclusion
In the ever-expanding landscape of artificial intelligence, Multimodal Retrieval-Augmented Generation emerges as a beacon of promise. This fusion of Large Multimodal Models and external multimodal knowledge sources opens up exciting avenues for research, applications, and societal impact.
In the next few years, we anticipate MM-RAG to evolve into an indispensable tool for content creation, education, and problem-solving, just like “only-text” RAG has started to become indispensable over the last year. In other words, it will enable more effective communication between AI systems and humans.
If you are interested in learning more about MM-RAG and how to build multimodal applications with Python and
用例
https://raga.ai/blogs/rag-use-cases-impact
Practical Use Cases of RAG (Retrieval-Augmented Generation)
Document Question Answering Systems: Enhancing Access to Proprietary Documents
Envisage having an enormous base of proprietary documents and requiring precise details from them swiftly. Document Question Answering System, powered by RAG, can transform this process. By asking queries in natural language, you can promptly recover specific responses from your documents, saving time and enhancing effectiveness.
Conversational Agents: Customizing LLMs to Specific Guidelines or Manuals
Conversational agents can become even more efficient when customized to precise instructions or manuals. With RAG, you can tailor language models to adhere to concrete conventions and industry standards. This ensures that the AI interacts with precision while complying with specific needs.
Real-time Event Commentary with Live Data and LLMs
For live events, giving real-time commentary is critical. RAG can connect language models to live data feeds, permitting you to produce up-to-minute reports that improve the virtual experience. Whether it’s a sports game, a meeting, or a breaking news story, RAG keeps your audience engaged with the newest updates.
Content Generation: Personalizing Content and Ensuring Contextual Relevance
Generating customized content that reverberates with your audience can be challenging. RAG helps by using real-time data to create content that is not only pertinent but also gradually customized. This ensures that your readers find your content appealing and valuable, elevating your content’s efficiency.
Personalized Recommendation: Evolving Content Recommendations through LLMs
RAG can revolutionize how you provide customized suggestions. By incorporating retrieval mechanisms and language models, you can offer suggestions that develop based on user interactions and choices. This dynamic approach ensures that your suggestions remain pertinent and customized over time.
Virtual Assistants: Creating More Personalized User Experiences
Virtual Assistants equipped with RAG abilities can provide gradually customized user experiences. They can recover pertinent details and produce answers that serve specifically to the user’s requirement and context. This makes interactions more relevant and improves user contentment.
Customer Support Chatbots: Providing Up-to-date and Accurate Responses
Customer support chatbots need to deliver precise and prompt responses. With RAG, your chatbots can attain the most latest information, ensuring they give dependable and up-to-date details. This enhances customer service standards and decreases answering duration.
Business Intelligence and Analysis: Delivering Domain-specific, Relevant Insights
In the scenario of Business Intelligence, RAG can be a groundbreaker. By delivering domain-specific perceptions, RAG enables you to make informed decisions based on the newest and most pertinent information. This improves your inquisitive abilities and helps you stay ahead in your industry.
Healthcare Information Systems: Accessing Medical Research and Patient Data for Better Care
Healthcare professionals can take advantage of RAG by attaining medical investigation and patient records efficiently. RAG permits for swift recovery of relevant details, helping in better curing and treatment plans, eventually enhancing patient care.
Legal Research and Compliance: Assisting in the Analysis of Legal Documents and Regulatory Compliance
Legitimate professionals can use RAG to sleek the inspection of legitimate documents and ensure regulatory compliance. By recovering and producing pertinent legitimate data, RAG helps in comprehensive investigation and compliance checks, making legitimate processes more effective and precise.
And that's not all—RAG's utility is expanding into even more advanced, specialized areas. Check out some next-level use cases.
Advanced RAG Use Cases
Gaining Insights from Sales Rep Feedback
Imagine turning your sales representative’ remarks into gold mines of applicable insights. You can use Retrieval-Augmented Generation (RAG) to dissect sales feedback. By involuntarily classifying and amalgamating responses, you can pinpoint trends, common problems and opportunities.
This permits you to cautiously acknowledge concerns, customize your approach to customer requirements, and eventually drive better customer success results. It’s like having a 24/7 annotator that turns every piece of response into planned insights.
Medical Insights Miner: Enhancing Research with Real-Time PubMed Data
Stay ahead in medical investigation by pounding into real-time information from PubMed using RAG. This tool permits you to constantly observe and extract pertinent research discoveries, keeping you updated with the newest evolutions.
By incorporating these perceptions into your research process, you can improve the quality and promptness of your studies. This approach boosts discovery, helps in pinpointing emerging trends, and ensures that your work stays at the cutting edge of medical science.
L1/L2 Customer Support Assistant: Improving Customer Support Experiences
Elevate your customer support experience by using RAG to assist your L1 and L2 support teams. This tool can rapidly recover and present pertinent solutions from a wide knowledge base, ensuring that your support agents always have the correct data at their fingertips. By doing so, you can decrease response duration, increase solution rates, and improve overall customer contentment. It’s like giving your support team a significant support that never sleeps and always has the answers.
Compliance in Customer Contact Centers: Ensuring Behavior Analysis in Regulated Industries
Ensure your customer centers follow regulatory requirements using RAG. This tool can dissect interactions for compliance, discerning any divergence required conventions.
By giving real-time responses and recommendations, you can acknowledge problems instantly, ensuring that your functioning remains within the bounds of industry regulations. This proactive approach not only helps in sustaining compliance but also builds trust with your customers and investors.
Employee Knowledge Training Assessment: Enhancing Training Effectiveness Across Roles
Revolutionize your employee training programs with RAG. By inspecting training materials and employee responses, you can pinpoint gaps in knowledge and areas for enhancement.
This tool helps in customizing training sessions to acknowledge precise requirements, ensuring that employees across all roles receive the most efficient and pertinent training. By constantly evaluating and processing your training programs, you can elevate workflow, improve expertise, and ensure that your employees are always prepared to meet new challenges.
Global SOP Standardization: Analyzing and Improving Standard Operating Procedures
Sleek your worldwide operations by homogenizing your Standard Operating Procedures (SOPs) with RAG. This tool can dissect SOPs from distinct regions, dissect inconsistencies, and recommend enhancements.
By ensuring that all your SOPs are aligned and upgraded, you can improve functioning effectiveness, reduce errors, and ensure congruous quality across your organization. It’s like having a universal process examiner that ensures every process is up to par.
Operations Support Assistant in Manufacturing: Assisting Technical Productivity with Complex Machinery Maintenance
Improve your manufacturing operations with an RAG-powered support assistant. This tool can aid in sustaining intricate machinery by offering real-time troubleshooting and preserving data.
By rapidly recovering and presenting pertinent technical data, you can reduce interruption, enhance workflow, and lengthen the lifespan of your equipment. This approach ensures that your technical workforce always has the details they need to keep your operations running sleekly.
Of course, implementing RAG comes with its own set of best practices and considerations, and we'll explore those next.
https://cloud.baidu.com/article/3326428
實際應用案例
假設我們有一個電商平臺,使用者想要購買一款特定款式的服裝。傳統的搜尋方式可能只能透過關鍵詞進行模糊匹配,難以精確滿足使用者需求。而採用多模態RAG技術,使用者可以上傳一張圖片作為查詢,系統能夠自動檢索到與圖片中服裝款式相似的商品,並生成詳細的商品描述和推薦理由。
這種應用不僅提升了使用者體驗,還大大提高了搜尋的準確性和效率。
https://xie.infoq.cn/article/ed917a90d564e2c06dd120247
多模態 RAG
在多模態搜尋基礎上,OpenSearch 結合文字生成大模型,面向企業知識庫、電商導購等場景推出多模態 RAG 能力。使用者上傳業務資料後,OpenSearch 不僅能智慧理解圖片中的資訊,還會以此作為參考,生成相應對話結果,提供基於企業知識庫、商城商品庫的 RAG 服務。