Are you here because of RAG? Then download our free RAG Primer.
The Task
In our project, the goal is to make a professional library of over 33,000 pages from 3,900 PDF documents accessible through an AI assistant. With this volume, RAG is indispensable. Users should be able to ask domain-specific questions and receive a detailed response spanning one or two screens within a reasonable time frame. Page-accurate references to the PDFs allow users to open the documents for further verification.
Speed Intro to Vector Search
For each page, we calculate an embedding vector using an embedding model[1] and store it in a special database index (vector index). In our case[2], the vector consists of 1,536 floating-point numbers, representing the semantic meaning of the page in 1,536 dimensions. This is beyond human intuition, but this mathematical construct enables us to compare texts semantically by calculating the similarity between their vectors[3]. Theoretically, the similarity value ranges from -1 (the opposite) to 1 (100% identical), but in practice, it typically falls between 0.5 and 0.87.
Our embeddings encode the entire world into just 1,536 dimensions! Let that sink in for a moment. Common sense dictates there must be natural limitations to this.
When someone asks the assistant a question, it is converted into a vector, and we search the vector index for the most similar pages to pass on to the Large Language Model (the AI) for evaluation. This is called semantic search. The search does not look for terms but for meaning.
First Problem: The Needle in the Haystack
For instance, if someone searches for a specific court ruling with the question, “What are the contents of the BGH decision from January 12, 2001, V ZR 420/99?”, the vector search will fail. It cannot encode such a specific detail across 33,000 pages with pinpoint accuracy. Instead, it will return everything loosely related to court rulings – semantically close but imprecise. That’s not enough. We need the exact page where the ruling appears.
Solution: Full-Text Search
Full-text search is a solved problem, with proven implementations like Lucene available for years. If the above question is submitted as a whole to a full-text search, it will pinpoint the page where the court ruling is mentioned with a very high relevance score.
However, full-text search struggles to find semantically similar phrasing using different wording. However, if no results are returned, it is a strong indicator that the question does not match the topic.
Second Problem: Performance
This issue is rarely discussed, and in the numerous toy examples, performance isn’t a concern.
We use MongoDB Atlas because it integrates vector search and Lucene full-text search while offering all the other features you expect from a database. Having everything under one roof significantly simplifies the design of a RAG application.
On an M10 cluster, a vector search across 33,000 pages takes about 3 seconds, while Lucene full-text search requires only around 0.3 seconds. Vector search includes a parameter to limit the number of candidates – in other words, you can adjust the breadth of the search. This parameter has a significant impact on performance.
Another key factor is the vector length. We experimented with vectors twice as long, hoping for more accurate results. Unfortunately, this did not materialize, and we harvested doubled runtimes instead.
The Dilemma
We are facing a dilemma. Some questions are abstract and require broad searches to identify the best sources, while others are highly specific, like searching for the needle in a haystack. In an interactive application, we cannot afford exhaustive precision in order to keep acceptable answering times[4]. It is difficult to tell from the question alone[5] whether it requires a broad or narrow search, and optimizing search parameters dynamically based on the question is challenging.
Solution: Hybrid Search
Combining full-text search and vector search balances their strengths and compensates for their weaknesses. Text search becomes the primary search method. If it returns no relevant results, we skip the vector search entirely. Vector search then complements the text search with a moderately configured search breadth.
The results of both searches are merged using Reciprocal Rank Fusion[6], and the top K[7] pages are passed to the LLM for processing.
Conclusion
RAG would fundamentally work with either full-text search or vector search alone, but the results would be far inferior to hybrid search, which better addresses diverse questions.
Retrieval in RAG is a finely balanced compromise tailored to its purpose. An interactive assistant has different requirements than a system that operates in the background and can take as long as needed to compute a result.
My advice: Don’t let this critical part of RAG slip out of your hands by relying on inflexible standard mechanisms. Nobody knows your data and the insights you want to gain better than you do. Off-the-shelf implementations and services will struggle to fine-tune themselves to your individual needs.
The Work Continues
Because there is no alternative to RAG for large document collections, I will continue experimenting with the dynamic optimization of hybrid search. Better LLMs may open up new possibilities. Some ideas include:
- Improving the detection of question types to adjust search parameters accordingly. So far, my experiments in this area have been unreliable.
- Broadening the vector search when the text search yields weak results. This could indicate a general question.
- Letting the LLM decide all search parameters.
-
Embedding models are a byproduct of language models. They existed even before the LLM era and are computed using machine learning. ↩
-
OpenAI text–embedding–3–small ↩
-
Using the dot product or cosine similarity. ↩
-
Anthropic's Contextual Retrieval, for example, doubles the number of vector searches. Can this always be justified? As a consequence, do we need to narrow the breadth of both searches, and does this lead to an overall improvement? ↩
-
“Tell” in the sense of being computable. ↩
-
How this works can easily be researched or looked up in our RAG Primer. ↩
-
Top–20 pages work very well. This parameter impacts the LLM’s runtime. This is our only adjustment lever in terms of LLM performance. ↩