Query expansion:
When performing information retrieval, you don’t always get what you want. A method suggested to improve the recall of search systems is query expansion, which adds additional terms to the search query, recovering relevant documents that might not have lexical overlap with the initial query. This idea is essential and valuable to enhance the performance of retrieval-augmented generation (RAG) systems.
Why Use Query Expansion?
Query expansion is vital for several reasons:
- Improves Recall: It helps retrieve documents that are semantically related to the query but don’t necessarily share common keywords.
- Addresses Query Ambiguity: It’s beneficial for short or ambiguous queries, providing more context and clarity.
- Enhances Document Matching: Expanding the query terms increases the likelihood of matching with the correct documents in the database.
The LLM Approach to Query Expansion
Recent advancements propose leveraging Large Language Models (LLMs) for query expansion. Unlike traditional methods like Pseudo-Relevance Feedback (PRF), which depends on the content of the retrieved documents, LLMs utilize their generative capabilities to create meaningful query expansions. This approach taps into the inherent knowledge encoded within the LLM, generating alternative terms and phrases that might be relevant to the original query.
Below is the code for query expansion (note that we’re using the OpenAI chat API).
The prompt asks the LLM to generate a hypothetical answer for a given query. We can combine the generated answer to our original query and then pass it back to our LLM as a joint query. This provides more context to the LLM prompt prior to retrieving the result.
As you can see, using query expansion in RAG systems this way offers several benefits:
- Better document retrieval: Expanded queries lead to more accurate and comprehensive document retrieval, a crucial step in RAG models.
- Enhanced understanding: Expanded queries provide RAG models with a broader context, improving the model’s understanding and responses.
- Versatility: This approach is adaptable to various domains and types of queries, enhancing the versatility of RAG models.
Drawbacks and the Role of Reranking
Although query expansion offers significant benefits, it’s not without its drawbacks:
- Over-expansion: Adding too many terms can sometimes lead to irrelevant document retrieval.
- Quality control: The relevance of expanded terms is only sometimes guaranteed.
To mitigate these issues, reranking plays a crucial role. It refines the initial retrieval output, recalibrating document rankings based on their relevance to the expanded query. This ensures that only the most pertinent documents are prioritized, effectively sifting through the noise introduced by query expansion.
1. Cross-Encoder Reranking
Among the reranking methodologies, cross-encoder models stand out for their ability to significantly enhance search accuracy. These models diverge from traditional ranking metrics, such as cosine similarity, by employing deep learning to evaluate the alignment between each document and the query directly. Cross-encoders output a relevance score by processing the query and document in tandem, enabling a more nuanced document selection process.
Practical Implementation
In practice, cross-encoder reranking is applied after expanding the search query to include a broader set of documents. This approach not only refines the selection of documents from the initial retrieval but also enhances the utility of RAG models by:
- Improving Accuracy: Cross-encoders enhance the precision of document retrieval by ensuring that documents are ranked according to their actual relevance to the query.
- Expanding Versatility: This method adapts seamlessly across various domains and query types, elevating the adaptability of RAG models.
Example Use Case
Consider a scenario where your application needs to retrieve and rank documents based on their relevance to user queries. After initial query expansion, you’d apply a cross-encoder model to rerank the results.
In the example below, we use the sentence transformers Cross-Encoder. You need to pass the retriever documents from your RAG, following which the Cross-Encoder will give you a ranking based on the most relevant documents.
You can use this or the answer from the reranker and do further processing, such as passing top documents to the LLM to get the final answer.
Next, we’ll look at the reranker model, ColBERTv2, which is among the fastest reranker models available today.
2. ColBERT: The reranker model
ColBERT is a document reranker model using a late interaction architecture over BERT, designed to enhance the performance of document retrieval and ranking of the documents. It’s particularly notable for balancing computational efficiency with high accuracy.
Core Idea and Architecture
ColBERT separates the encoding of query and document texts using BERT, allowing offline pre-computation of document encodings and significantly reducing the computational load per query. The model employs a unique approach where each query and document token is encoded into a low-dimensional vector, facilitating fast and accurate retrieval.
Late Interaction Mechanism
The crux of ColBERT’s efficiency lies in its late interaction mechanism. Instead of squashing all token vectors into a single vector, it compares each query vector with every vector of the document. This method ensures a more nuanced and accurate representation of the document’s relevance to the query.
Indexing and Retrieval in Colbert
ColBERT’s indexing process is a three-stage process:
- Centroid Selection: Using k-means clustering to select centroids for residual encoding.
- Passage Encoding: Encoding documents with the selected centroid and computing the quantized residual.
- Index Inversion: Creating an inverted list of embeddings grouped by centroids for fast retrieval.
ColBERT efficiently computes the cosine similarity for each query vector during retrieval, leading to a fast and accurate ranking of documents.
Learn more about Colbert V2 & other ranker comparison benchmarks on this blog.
Practical Implementation
We’ll use the ColBERT reranker with LanceDB, which provides an interface to choose from different ranking hybrid methods for querying the documents.
The results below are from the reranked ColBERT v2 model.
3 .FlashRank
FlashRank is an ultra-lite & super-fast Python library to add reranking to your existing search & retrieval pipelines and is based on SoTA cross-encoders. Ensure you pip install flashrank before running the following code.
All the code is available in the following Colab:
Google Collab for query expansion with reranker model
Conclusion
The integration of LLM-based query expansion and advanced reranking models like Cross-Encoder ColBERT v2 & FlashReranker are opening up new possibilities in the field of information retrieval. These methods enhance the precision and recall of document retrieval systems and ensure that RAG models can deliver highly relevant and contextually richer results. As we continue to explore and innovate within this space, these tools will become easier to use and more common in various use cases and domains.
Our Google Colab notebooks provide a hands-on introduction to implementing query expansion, Cross-Encoder reranking, and leveraging ColBERT v2 models for those eager to dive deeper and experiment with these concepts. Through practical exploration, users can experience firsthand the impact of these technologies on enhancing information retrieval and document ranking processes.
But that’s not all. For even more exciting applications of vector databases and Large Language Models (LLMs), explore the LanceDB repository. LanceDB offers a powerful and versatile vector database that can revolutionize how you work with data.
Explore the full potential of this cutting-edge technology by visiting the vectordb-recipes repository.