Advance RAG- Improve RAG performance
Ultimate guide to optimise RAG pipeline from zero to advance- Solving the core challenges of Retrieval-Augmented Generation
In my last blog, I covered RAG extensively and how it’s implemented with LlamaIndex. However, RAG often encounters numerous challenges when answering questions. In this blog, I’ll address these challenges and, more importantly we will delve into solutions to improve RAG performance, making it production-ready.
I’ll discuss various optimization techniques sourced from different research papers. The majority of these techniques will be based on a research paper I particularly enjoyed, titled “Retrieval-Augmented Generation for Large Language Models: A Survey.” This paper encompasses most recent optimization methods.
Breakdown of RAG workflow
First, we will breakdown RAG workflow into three parts to enhance our understanding of RAG and optimise each of these parts to improve overall performance:
Pre-Retrieval
In the pre-retrieval step, the new data outside of the LLM’s original training dataset, also called external data has be prepared and split into chunks and then index the chunk data using embedding models that converts data into numerical representations and stores it in a vector database. This process creates a knowledge library that the LLM can understand.
Retrieval
In the most important retrieval step, the user query is converted to a vector representation called embedding and finds a relavent chunks using cosine similarity from the vector database. And this tried to find highly-relevant document chunks from vector store.
Post-Retrieval
Next, the RAG model augments the user input (or prompts) by adding the relevant retrieved data in context (query + context). This step uses prompt engineering techniques to communicate effectively with the LLM. The augmented prompt allows the large language models to generate an accurate answer to user queries using given context.
Goal
We aim to enhance each component of the RAG workflow by applying various techniques to different parts.
Pre-Retrieval Optimisation
Pre-retrieval techniques include improving the quality of indexed data and chunk optimisation. This step could also called Enhancing Semantic Representations
Enhancing data granularity
Improve quality of Data
‘Garbage in, garbage out’
Data cleaning plays a crucial role in the RAG framework. The performance of your RAG solution depends on how well the data is cleaned and organized. Remove unnecessary information such as special characters, unwanted metadata, or text.
- Remove irrelevant text/document: Eliminated all the irrelevant documents that we don’t need LLM to answer. Also remove noise data, this includes removing special characters, stop words (common words like “the” and “a”), and HTML tags.
- Identify and correct errors: This includes spelling mistakes, typos, and grammatical errors.
- Replacing pronouns with names in split chunks can enhance semantic significance during retrieval.
Adding Metadata
Adding metadata, such as concept and level tags, to improve the quality of indexed data.
Adding metadata information involves integrating referenced metadata, such as dates and purposes, into chunks for filtering purposes, and incorporating metadata like chapters and subsections of references to improve retrieval efficiency.
Here are some scenarios where metadata is useful:
- If you search for items and recency is a criterion, you can sort over a date metadata
- If you search over scientific papers and you know in advance that the information you’re looking for is always located in a specific section, say the experiment section for example, you can add the article section as metadata for each chunk and filter on it to match experiments only
Metadata is useful because it brings an additional layer of structured search on top vector search.
Optimizing index structures
- Knowledge Graphs or Graph Neural Network Indexing
Incorporating information from the graph structure to capture relevant context by leveraging relationships between nodes in a graph data index. - Vector Indexing
Chunking Optimisation
Choosing the right chunk_size is a critical decision that can influence the efficiency and accuracy of a RAG system in several ways:
Relevance and Granularity
A small chunk_size, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the similarity_top_k setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available.
Response Generation Time
As the chunk_size increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system.
Challenges
If your chunk is too small, it may not include all the information the LLM needs to answer the user’s query; if the chunk is too big, it may contain too much irrelevant information that confuses the LLM or may be too big to fit into the context size.
Task Specific Chunking
Based on downstream task optimal length of the chunk need to determine and how much overlap you want to have for each chunk.
High-level tasks like summarization requires bigger chunk size and low-level tasks like coding requires smaller chunks
Chunking Techniques
Small2big or Parent Ducument Retrieval
The ParentDocumentRetriever
strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents to the LLM
It utilizes small text blocks during the initial search phase and subsequently provides larger related text blocks to the language model for processing.
Recursive retrieval involves acquiring smaller chunks during the initial retrieval phase to capture key semantic meanings. Subsequently, larger chunks containing more contextual information are provided to the LLM in later stages of the process. This two-step retrieval method helps to strike a balance between efficiency and the delivery of contextually rich responses.
Steps:
- The process involves broken down of original large document into both smaller, more manageable units referred to as child documents and larger chunks called Parent document.
- It focus on creating embeddings for each of these child documents, these embeddings richer and more detailed than each of the entire parent chunk embedding. It helps the framework to identify the most relevant child document that contains information related to the user’s query.
- Once the alignment with a child document is established then it retrieves the entire parent document associated with that child. In the image shown finally parent chunks got retrieved.
- This retrieval of the parent document is significant because it provides a broader context for understanding and responding to the user’s query. Instead of relying solely on the content of the child document, the framework now has access to the entire parent document.
Sentence Window Retrieval
This chunking technique is very similar to above. The core idea behind Sentence Window Retrieval is to selectively fetch context from a custom knowledge base based on the query and then utilize a broader version of this context for more robust text generation.
This process involves embedding a limited set of sentences for retrieval, with the additional context surrounding these sentences, referred to as “window context,” stored separately and linked to them. Once the top similar sentences are identified, this context is reintegrated just before these sentences are sent to the Large Language Model (LLM) for generation, thereby enriching overall contextual comprehension.
Retrieval Optimisation
This is the most important part of RAG workflow which includes retrieving documents from the vector store based on user query. This step could also called Aligning Queries and Documents.
Query Rewriting
Query rewriting is a fundamental approach for aligning the semantics of a query and a document.
In this process, we leverage Language Model (LLM) capabilities to rephrase the user’s query and give it another shot. It’s important to note that two questions that might look the same to a human may not appear similar in the embedding space.
MultiQuery Retrievers
The Multi-query Retrieval method utilizes LLMs to generate multiple queries from different perspectives for a given user input query, advantageous for addressing complex problems with multiple sub-problems.
For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.
By generating multiple perspectives on the same question, the MultiQuery Retriever might be able to overcome some of the limitations of the distance-based retrieval and get a richer set of results.
Hyde or Query2doc
Both Hyde and Query2doc are similar query rewriting optimisations. Given that search queries are often short, ambiguous, or lack necessary background information, LLMs can provide relevant information to guide retrieval systems, as they memorize an enormous amount of knowledge and language patterns by pre-training on trillions of tokens.
StepBack-prompt
The StepBack-prompt approach encourages the language model to think beyond specific examples and focus on broader concepts and principles.
This template replicates the “Step-Back” prompting technique that improves performance on complex questions by first asking a “step back” question. This technique can be combined with standard question-answering RAG applications by retrieving information for both the original and step-back questions. Below is an example of a step-back prompt.
Fine-tuning Embedding
Fine-tuning embedding models significantly impact the relevance of retrieved content in RAG systems. This process involves customizing embedding models to enhance retrieval relevance in domain-specific contexts, especially for professional domains dealing with evolving or rare terms.
Generating synthetic dataset for training and evaluation
The key idea here is that training data for fine-tuning can be generated using language models like GPT-3.5-turbo to formulate questions grounded on document chunks. This allows us to generate synthetic positive pairs of (query, relevant documents) in a scalable way without requiring human labellers. Final dataset will be pairs of questions and text chunks.
Fine-tune Embedding
Fine-tune any embedding model on the generated training Dataset
Hybrid Search Exploration
The RAG system optimizes its performance by intelligently integrating various techniques, including keyword based search, semantic search, and vector search.
This approach leverages the unique strengths of each method to accommodate diverse query types and information needs, ensuring consistent retrieval of highly relevant and context-rich information. The use of hybrid search serves as a robust supplement to retrieval strategies, thereby enhancing the overall efficacy of the RAG pipeline.
Common Example
The most common pattern is to combine a sparse retriever (like BM25) with a dense retriever (like embedding similarity), because their strengths are complementary. It is also known as “hybrid search”. The sparse retriever is good at finding relevant documents based on keywords, while the dense retriever is good at finding relevant documents based on semantic similarity.
Post-Retrieval Optimisation
Re-Ranking
Reranking retrieval results before sending them to the LLM has significantly improved RAG performance.
A high score in vector similarity search does not mean that it will always have the highest relevance.
The core concept involves re-arranging document records to prioritize the most relevant items at the top, thereby limiting the total number of documents. This not only resolves the challenge of context window expansion during retrieval but also enhances retrieval efficiency and responsiveness.
Increase the similarity_top_k in the query engine to retrieve more context passages, which can be reduced to top_n after reranking.
Prompt Compression
Noise in retrieved documents adversely affects RAG performance therefore, information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.
Here, the emphasis lies in compressing irrelevant context, highlighting pivotal paragraphs, and reducing the overall context length.
Contextual compression
Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, it can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.
Doc Compressor is a small language models to calculate prompt mutual information of user query and retrieved document, estimating element importance.
Modular RAG
Modular RAG integrates various methods to enhance different component of RAG, such as incorporating a search module for similarity retrieval and applying a fine-tuning approach in the retriever
RAG Fusion
RAG Fusion combines 2 approaches:
- Multi-Query Retrieval
Utilizes LLMs to generate multiple queries from different perspectives for a given user input query, advantageous for addressing complex problems with multiple sub-problems. - Rerank Retrieved Documents
Re-rank all the retrieved documents and removed all documents with low relevant scores
This advanced technique guarantees search results that match the user’s intentions, whether they are obvious or not. It helps users find more insightful and relevant information.
Final thoughts
This article discusses various techniques to optimize each part of the RAG pipeline and enhance the overall RAG pipeline. You can use one or multiple of these techniques in your RAG pipeline, making it more accurate and more efficient. I hope these techniques can help you build a better RAG pipeline for your app.